[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH] x86/bitops: Account for POPCNT errata on earlier Intel CPUs



Manually break the false dependency for the benefit of cases such as
bitmap_weight() which is a reasonable hotpath.

Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
---
CC: Jan Beulich <JBeulich@xxxxxxxx>
CC: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Not sure if this warrants a fixes or not, but:

Fixes: 6978602334d9 ("x86/bitops: Use the POPCNT instruction when available")

Many examples online suggest a 2x improvement perf improvement on tight loops
by breaking this dependency.  cpumasks in particular make frequent use of this
loop.

Still TODO:

 1) Put a double CS prefix on the CALL instruction to avoid a trailing 2-byte
    NOP, but this depends on x86_decode_lite() in order to work.

 2) Revert a buggy GAS diagnostic:

    ./arch/x86/include/asm/bitops.h: Assembler messages:
    ./arch/x86/include/asm/bitops.h:493: Error: same type of prefix used twice
    ./arch/x86/include/asm/bitops.h:493: Error: same type of prefix used twice

    Multiple prefixes are not an error, and are sometimes the best choice
    available.

It turns out that LZCNT/TZCNT have the same input dependent bug, prior to
Skylake.  There are no instructions in the "cleaned up" part of bitops yet,
and I don't expect any to survive cleaning.
---
 xen/arch/x86/include/asm/bitops.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/include/asm/bitops.h 
b/xen/arch/x86/include/asm/bitops.h
index bb9d75646023..87eac7782f10 100644
--- a/xen/arch/x86/include/asm/bitops.h
+++ b/xen/arch/x86/include/asm/bitops.h
@@ -488,10 +488,16 @@ static always_inline unsigned int arch_hweightl(unsigned 
long x)
      *
      * This limits the POPCNT instruction to using the same ABI as a function
      * call (input in %rdi, output in %eax) but that's fine.
+     *
+     * On Intel CPUs prior to Cannon Lake, the POPCNT instruction has a false
+     * input dependency on it's destination register (errata HSD146, SKL029
+     * amongst others), impacting loops such as bitmap_weight().  Insert an
+     * XOR to manually break the dependency.
      */
     alternative_io("call arch_generic_hweightl",
+                   "xor %k[res], %k[res]\n\t"
                    "popcnt %[val], %q[res]", X86_FEATURE_POPCNT,
-                   ASM_OUTPUT2([res] "=a" (r) ASM_CALL_CONSTRAINT),
+                   ASM_OUTPUT2([res] "=&a" (r) ASM_CALL_CONSTRAINT),
                    [val] "D" (x));
 
     return r;
-- 
2.39.5




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.