|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH] x86/build: Use new .nops directive when available
On Thu, Aug 16, 2018 at 04:56:04PM +0100, Andrew Cooper wrote:
> On 16/08/18 15:31, Roger Pau Monné wrote:
> > On Thu, Aug 16, 2018 at 11:42:56AM +0100, Andrew Cooper wrote:
> >> On 16/08/18 10:55, Roger Pau Monné wrote:
> >>> On Wed, Aug 15, 2018 at 06:57:38PM +0100, Andrew Cooper wrote:
> >>>> diff --git a/xen/arch/x86/Rules.mk b/xen/arch/x86/Rules.mk
> >>>> index ac585a3..c84ed20 100644
> >>>> --- a/xen/arch/x86/Rules.mk
> >>>> +++ b/xen/arch/x86/Rules.mk
> >>>> @@ -29,6 +29,10 @@ $(call as-option-add,CFLAGS,CC,"invpcid
> >>>> (%rax)$$(comma)%rax",-DHAVE_AS_INVPCID)
> >>>> $(call as-option-add,CFLAGS,CC,\
> >>>> ".if ((1 > 0) < 0); .error \"\";.endif",,-DHAVE_AS_NEGATIVE_TRUE)
> >>>>
> >>>> +# Check to see whether the assmbler supports the .nop directive.
> >>>> +$(call as-option-add,CFLAGS,CC,\
> >>>> + ".L1: .L2: .nops (.L2 - .L1)$$(comma)9",-DHAVE_AS_NOP_DIRECTIVE)
> >>> I think I remember commenting on an earlier version of this about the
> >>> usage of the CONTROL parameter. I would expect the assembler to
> >>> use the most optimized version by default, is that not the case?
> >> Again, I don't understand what you're trying to say.
> >>
> >> This expression is like this, because that's how we actually use it.
> > As mentioned in another email, I was wondering why we choose to not
> > use nop instructions > 9 bytes. The assembler will by default use
> > nop instructions up to 11 bytes for 64bit code.
>
> Using more than 9 bytes is suboptimal on some hardware.
OK. But using the same argument isn't it also suboptimal to use 9
bytes on some hardware then also?
What I mean by this is that it would be good to add a comment
somewhere that notes why nop instructions are limited to 9 bytes,
because that's likely to generate the more optimized code on a wide
variety of hardware.
At least it would have helped me.
> Toolchains use long nops (exclusively?) for basic block alignment,
> whereas we use use them for executing through because its still faster
> than a branch.
>
> >
> >>>> +
> >>>> CFLAGS += -mno-red-zone -fpic -fno-asynchronous-unwind-tables
> >>>>
> >>>> # Xen doesn't use SSE interally. If the compiler supports it, also
> >>>> skip the
> >>>> diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
> >>>> index 0ef7a8b..2c844d6 100644
> >>>> --- a/xen/arch/x86/alternative.c
> >>>> +++ b/xen/arch/x86/alternative.c
> >>>> @@ -84,6 +84,19 @@ static const unsigned char * const
> >>>> p6_nops[ASM_NOP_MAX+1] init_or_livepatch_cons
> >>>>
> >>>> static const unsigned char * const *ideal_nops init_or_livepatch_data =
> >>>> p6_nops;
> >>>>
> >>>> +#ifdef HAVE_AS_NOP_DIRECTIVE
> >>>> +
> >>>> +/* Nops in .init.rodata to compare against the runtime ideal nops. */
> >>>> +asm ( ".pushsection .init.rodata, \"a\", @progbits\n\t"
> >>>> + "toolchain_nops: .nops " __stringify(ASM_NOP_MAX) "\n\t"
> >>>> + ".popsection\n\t");
> >>>> +extern char toolchain_nops[ASM_NOP_MAX];
> >>>> +static bool __read_mostly toolchain_nops_are_ideal;
> >>>> +
> >>>> +#else
> >>>> +# define toolchain_nops_are_ideal false
> >>>> +#endif
> >>>> +
> >>>> static void __init arch_init_ideal_nops(void)
> >>>> {
> >>>> switch ( boot_cpu_data.x86_vendor )
> >>>> @@ -112,6 +125,11 @@ static void __init arch_init_ideal_nops(void)
> >>>> ideal_nops = k8_nops;
> >>>> break;
> >>>> }
> >>>> +
> >>>> +#ifdef HAVE_AS_NOP_DIRECTIVE
> >>>> + if ( memcmp(ideal_nops[ASM_NOP_MAX], toolchain_nops, ASM_NOP_MAX)
> >>>> == 0 )
> >>>> + toolchain_nops_are_ideal = true;
> >>>> +#endif
> >>> You are only comparing that the biggest nop instruction (9 bytes
> >>> AFAICT) generated by the assembler is what Xen believes to be the more
> >>> optimized version. What about shorter nops?
> >> They are all variations on a theme.
> >>
> >> For P6 nops, its the 0f 1f root which is important, which takes a modrm
> >> byte. Traditionally, its always encoded with eax and uses redundant
> >> memory encodings for longer instructions.
> >>
> >> I can't think of any way of detecting if the optimised nops if the
> >> toolchain starts using alternative registers in the encoding, but I
> >> expect this case won't happen in practice.
> > I would rather do:
> >
> > toolchain_nops:
> > .nops 1
> > .nops 2
> > .nops 3
> > ...
> > .nops 9
> >
> > And then build an assembler_nops[ASM_NOP_MAX+1] like it's done for the
> > other nops. Then you could do:
> >
> > toolchain_nops_are_ideal = true;
> > for ( i = 1; i < ASM_NOP_MAX+1; i++ )
> > if ( memcmp(ideal_nops[i], assembler_nops[i], i) )
> > {
> > toolchain_nops_are_ideal = false;
> > break;
> > }
> >
> > In order to make sure all the possible nop sizes are using the
> > expected optimized version.
>
> What is the point? Other than the 0f 1f, it really doesn't matter, and
> small variations won't invalidate them as ideal nops.
>
> This check needs to be just good enough to tell whether the toolchain
> used P6 or K8 nops, and unless you explicitly built with -march=k8, the
> answer is going to be P6.
>
> It does not need to check every variation byte for byte. Going back to
> my original argument for not even doing this basic check, if we happen
> to be wrong and the toolchain wrote the other kind of long nops, you
> probably won't be able to measure the difference.
I have to admit I don't know that much about nop instruction length or
encoding, but again I think it would be nice to add the reasoning
above to the commit as a comment on why only instruction of length 9
is tested.
Roger.
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |