[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: vcpumask_to_pcpumask() case study



On 01.06.2024 20:50, Andrew Cooper wrote:
> One of the followon items I had from the bitops clean-up is this:
> 
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index 648d6dd475ba..9c3a017606ed 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -3425,7 +3425,7 @@ static int vcpumask_to_pcpumask(
>              unsigned int cpu;
>  
>              vcpu_id = ffsl(vmask) - 1;
> -            vmask &= ~(1UL << vcpu_id);
> +            vmask &= vmask - 1;
>              vcpu_id += vcpu_bias;
>              if ( (vcpu_id >= d->max_vcpus) )
>                  return 0;
> 
> which yields the following improvement:
> 
>   add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-34 (-34)
>   Function                                     old     new   delta
>   vcpumask_to_pcpumask                         519     485     -34

Nice. At the risk of getting flamed again for raising dumb questions:
Considering that elsewhere "trickery" like the &= mask - 1 here were
deemed not nice to have (at least wanting to be hidden behind a
suitably named macro; see e.g. ISOLATE_LSB()), wouldn't __clear_bit()
be suitable here too, and less at risk of being considered "trickery"?
But yes, that would eliminate the benefit of making the bit clearing
independent of the ffsl() result. And personally I'm fine anyway with
the form as suggested.

> While I (the programmer) can reason the two expressions are equivalent,
> the compiler can't,

Why is it you think it can't? There's no further knowledge that you
as a human need to rely on for this, afaics. If ffsl() uses the
built-in (as it now does), the compiler has full insight into what's
going on. It's just that compiler engineers may not deem it worth the
effort to carry out such a special-case optimization.

> so we really are saving a SHL to re-calculate (1 <<
> vcpu_id) and swapping it for a LEA -1(vmask) which happens to be hoisted
> above the ffsl() call.
> 
> However, the majority of the savings came from the fact that the old
> code used to hold 1 in %r15 (across the entire function!) so it could
> calculate (1 << vcpu_id) on each loop iteration.  With %r15 now free for
> other use, we one fewer thing spilt to the stack.
> 
> Anyway - while it goes to show that while/ffs logic should be extra
> careful to use x &= x - 1 for it's loop condition logic, that's not all.
> 
> The rest of this function is crazy.  We're reading a guest-word at a
> time in order to make a d->max_vcpus sized bitmap (with a reasonable
> amount of opencoded logic to reassemble the fragments back into a vcpu
> number).
> 
> PVH dom0 can reach here, and because it's not pv32, will be deemed to
> have a guest word size of 64.  Also, word-at-time for any HVM guest is
> an insane overhead in terms of the pagewalks behind the copy_from_hvm()
> infrastructure.
> 
> Instead, we should calculate the size of the bitmap at
> DIV_ROUND_UP(d->max_vcpus, 8), copy the bitmap in one whole go, and then
> just have a single for_each_set_bit() looping over it.  Importantly this
> avoids needing to know or care about the guest word size.
> 
> This removes 4 of the 6 hiding lfences; the pair for calculating
> is_native to start with, and then one of the two pairs behind the use of
> is_native to select the type of the access.
> 
> The only complications is this:  Right now, PV max vCPUS is 8k, so we
> could even get away with this being an on-stack bitmap.  However, the
> vast majority of PV guests small and a 64-bit bitmap would do fine.
> 
> But, given this is just one example, wouldn't it be better if we just
> unconditionally had a marshalling buffer for hypercalls?  There's the
> XLAT area but it doesn't exist PV64, and we've got various other pieces
> of scratch per-cpu data.
> 
> If not, having a 128/256-bit bitmap on the stack will still be good
> enough in practice, but still ok at amortising the PVH dom0 costs.
> 
> Thoughts?

Well, yes, the last of what you suggest might certainly be worthwhile
as a minimal improvement. The only slight concern with increasing the
granularity is that it'll be increasingly less likely that a 2nd or
further iterations of the loop would actually be exercised in routine
testing.

A marshalling buffer might also make sense to have, as long as we have
more than just one or two places wanting to use it. Since you say "just
one example", likely you have further uses in mind. In most other
places we use input data more directly, so it looks like I can't right
away come up with possible further use cases.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.