Xen project Mailing List

Re: [Xen-devel] [PATCH RFC 3/4] Arm64: further speed-up to hweight{32, 64}()

To: "Julien Grall" <julien.grall@xxxxxxx>

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Wed, 05 Jun 2019 01:42:53 -0600

Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Robin Murphy <robin.murphy@xxxxxxx>

Delivery-date: Wed, 05 Jun 2019 07:43:10 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

>>> On 04.06.19 at 18:11, <julien.grall@xxxxxxx> wrote: > On 5/31/19 10:53 AM, Jan Beulich wrote: >> According to Linux commit e75bef2a4f ("arm64: Select >> ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the >> variant using only bitwise operations on at least some hardware, and no >> worse on other. >> >> Suggested-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> >> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx> >> --- >> RFC: To be honest I'm not fully convinced this is a win in particular in >> the hweight32() case, as there's no actual shift insn which gets >> replaced by the multiplication. Even for hweight64() the compiler >> could emit better code and avoid the explicit shift by 32 (which it >> emits at least for me). > > I can see multiplication instruction used in both hweight32() and > hweight64() with the compiler I am using. That is for which exact implementation? What I was referring to as "could emit better code" was the multiplication-free variant, where the compiler fails to recognize (afaict) another opportunity to fold a shift into an arithmetic instruction: add x0, x0, x0, lsr #4 and x0, x0, #0xf0f0f0f0f0f0f0f add x0, x0, x0, lsr #8 add x0, x0, x0, lsr #16 >>> lsr x1, x0, #32 >>> add w0, w1, w0 and w0, w0, #0xff ret Afaict the two marked insns could be replaced by add x0, x0, x0, lsr #32 With there only a sequence of add-s remaining, I'm having difficulty seeing how the use of mul+lsr would actually help: add x0, x0, x0, lsr #4 and x0, x0, #0xf0f0f0f0f0f0f0f mov x1, #0x101010101010101 mul x0, x0, x1 lsr x0, x0, #56 ret But of course I know nothing about throughput and latency of such add-s with one of their operands shifted first. And yes, the variant using mul is, comparing with the better optimized case, still one insn smaller. > I would expect the compiler could easily replace a multiply by a series > of shift but it would be more difficult to do the invert. > > Also, this has been in Linux for a year now, so I am assuming Linux > folks are happy with changes (CCing Robin just in case I missed > anything). Therefore I am happy to give it a go on Xen as well. In which case - can I take this as an ack, or do you want to first pursue the discussion? Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.