[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH RFC 3/4] Arm64: further speed-up to hweight{32, 64}()
Hi Jan, On 05/06/2019 08:42, Jan Beulich wrote: On 04.06.19 at 18:11, <julien.grall@xxxxxxx> wrote:On 5/31/19 10:53 AM, Jan Beulich wrote:According to Linux commit e75bef2a4f ("arm64: Select ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the variant using only bitwise operations on at least some hardware, and no worse on other. Suggested-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx> --- RFC: To be honest I'm not fully convinced this is a win in particular in the hweight32() case, as there's no actual shift insn which gets replaced by the multiplication. Even for hweight64() the compiler could emit better code and avoid the explicit shift by 32 (which it emits at least for me).I can see multiplication instruction used in both hweight32() and hweight64() with the compiler I am using.That is for which exact implementation? A simple call hweight64(). What I was referring to as "could emit better code" was the multiplication-free variant, where the compiler fails to recognize (afaict) another opportunity to fold a shift into an arithmetic instruction: add x0, x0, x0, lsr #4 and x0, x0, #0xf0f0f0f0f0f0f0f add x0, x0, x0, lsr #8 add x0, x0, x0, lsr #16lsr x1, x0, #32 add w0, w1, w0and w0, w0, #0xff ret Afaict the two marked insns could be replaced by add x0, x0, x0, lsr #32 I am not a compiler expert. Anyway this likely depends on the version of the compiler you are using. They are becoming smarter and smarter. The commit message in Linux (and Robin's answer) is pretty clear. It may improve on some core but does not make it worst on other.With there only a sequence of add-s remaining, I'm having difficulty seeing how the use of mul+lsr would actually help: add x0, x0, x0, lsr #4 and x0, x0, #0xf0f0f0f0f0f0f0f mov x1, #0x101010101010101 mul x0, x0, x1 lsr x0, x0, #56 ret But of course I know nothing about throughput and latency of such add-s with one of their operands shifted first. And yes, the variant using mul is, comparing with the better > optimized case, still one insn smaller. I would expect the compiler could easily replace a multiply by a series of shift but it would be more difficult to do the invert. Also, this has been in Linux for a year now, so I am assuming Linux folks are happy with changes (CCing Robin just in case I missed anything). Therefore I am happy to give it a go on Xen as well.In which case - can I take this as an ack, or do you want to first pursue the discussion? I will commit it later on with another bunch of patches. Cheers, -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |