Xen project Mailing List

Re: vcpumask_to_pcpumask() case study

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Mon, 3 Jun 2024 23:19:26 +0200

Cc: Roger Pau Monné <roger.pau@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Mon, 03 Jun 2024 21:19:43 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 01.06.2024 20:50, Andrew Cooper wrote: > One of the followon items I had from the bitops clean-up is this: > > diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c > index 648d6dd475ba..9c3a017606ed 100644 > --- a/xen/arch/x86/mm.c > +++ b/xen/arch/x86/mm.c > @@ -3425,7 +3425,7 @@ static int vcpumask_to_pcpumask( > unsigned int cpu; > > vcpu_id = ffsl(vmask) - 1; > - vmask &= ~(1UL << vcpu_id); > + vmask &= vmask - 1; > vcpu_id += vcpu_bias; > if ( (vcpu_id >= d->max_vcpus) ) > return 0; > > which yields the following improvement: > > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-34 (-34) > Function old new delta > vcpumask_to_pcpumask 519 485 -34 Nice. At the risk of getting flamed again for raising dumb questions: Considering that elsewhere "trickery" like the &= mask - 1 here were deemed not nice to have (at least wanting to be hidden behind a suitably named macro; see e.g. ISOLATE_LSB()), wouldn't __clear_bit() be suitable here too, and less at risk of being considered "trickery"? But yes, that would eliminate the benefit of making the bit clearing independent of the ffsl() result. And personally I'm fine anyway with the form as suggested. > While I (the programmer) can reason the two expressions are equivalent, > the compiler can't, Why is it you think it can't? There's no further knowledge that you as a human need to rely on for this, afaics. If ffsl() uses the built-in (as it now does), the compiler has full insight into what's going on. It's just that compiler engineers may not deem it worth the effort to carry out such a special-case optimization. > so we really are saving a SHL to re-calculate (1 << > vcpu_id) and swapping it for a LEA -1(vmask) which happens to be hoisted > above the ffsl() call. > > However, the majority of the savings came from the fact that the old > code used to hold 1 in %r15 (across the entire function!) so it could > calculate (1 << vcpu_id) on each loop iteration. With %r15 now free for > other use, we one fewer thing spilt to the stack. > > Anyway - while it goes to show that while/ffs logic should be extra > careful to use x &= x - 1 for it's loop condition logic, that's not all. > > The rest of this function is crazy. We're reading a guest-word at a > time in order to make a d->max_vcpus sized bitmap (with a reasonable > amount of opencoded logic to reassemble the fragments back into a vcpu > number). > > PVH dom0 can reach here, and because it's not pv32, will be deemed to > have a guest word size of 64. Also, word-at-time for any HVM guest is > an insane overhead in terms of the pagewalks behind the copy_from_hvm() > infrastructure. > > Instead, we should calculate the size of the bitmap at > DIV_ROUND_UP(d->max_vcpus, 8), copy the bitmap in one whole go, and then > just have a single for_each_set_bit() looping over it. Importantly this > avoids needing to know or care about the guest word size. > > This removes 4 of the 6 hiding lfences; the pair for calculating > is_native to start with, and then one of the two pairs behind the use of > is_native to select the type of the access. > > The only complications is this: Right now, PV max vCPUS is 8k, so we > could even get away with this being an on-stack bitmap. However, the > vast majority of PV guests small and a 64-bit bitmap would do fine. > > But, given this is just one example, wouldn't it be better if we just > unconditionally had a marshalling buffer for hypercalls? There's the > XLAT area but it doesn't exist PV64, and we've got various other pieces > of scratch per-cpu data. > > If not, having a 128/256-bit bitmap on the stack will still be good > enough in practice, but still ok at amortising the PVH dom0 costs. > > Thoughts? Well, yes, the last of what you suggest might certainly be worthwhile as a minimal improvement. The only slight concern with increasing the granularity is that it'll be increasingly less likely that a 2nd or further iterations of the loop would actually be exercised in routine testing. A marshalling buffer might also make sense to have, as long as we have more than just one or two places wanting to use it. Since you say "just one example", likely you have further uses in mind. In most other places we use input data more directly, so it looks like I can't right away come up with possible further use cases. Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.