[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] bitops/32: Convert variable_ffs() and fls() zero-case handling to C
- To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
- From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
- Date: Tue, 29 Apr 2025 15:34:30 -0700
- Cc: "H. Peter Anvin" <hpa@xxxxxxxxx>, Ingo Molnar <mingo@xxxxxxxxxx>, Arnd Bergmann <arnd@xxxxxxxx>, Arnd Bergmann <arnd@xxxxxxxxxx>, Thomas Gleixner <tglx@xxxxxxxxxxxxx>, Ingo Molnar <mingo@xxxxxxxxxx>, Borislav Petkov <bp@xxxxxxxxx>, Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>, x86@xxxxxxxxxx, Juergen Gross <jgross@xxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, Alexander Usyskin <alexander.usyskin@xxxxxxxxx>, Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>, Mateusz Jończyk <mat.jonczyk@xxxxx>, Mike Rapoport <rppt@xxxxxxxxxx>, Ard Biesheuvel <ardb@xxxxxxxxxx>, Peter Zijlstra <peterz@xxxxxxxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxxx
- Delivery-date: Tue, 29 Apr 2025 22:35:11 +0000
- List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
On Tue, 29 Apr 2025 at 15:22, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:
>
> Oh, I didn't realise there was also a perf difference too, but Agner Fog
> agrees.
The perf difference is exactly because of the issue where the non-rep
one acts as a cmov, and has basically two inputs (the bits to test in
the source, and the old value of the result register)
I guess it's not "fundamental", but lzcnt is basically a bit simpler
for hardware to implement, and the non-rep legacy bsf instruction
basically has a dependency on the previous value of the result
register.
So even when it's a single uop for both cases, that single uop can be
slower for the bsf because of the (typically false) dependency and
extra pressure on the rename registers.
Linus
|