[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 1/2] x86: improve MSR_SHADOW_GS accesses
On 07/12/17 10:45, Jan Beulich wrote: >>>> On 06.12.17 at 20:34, <andrew.cooper3@xxxxxxxxxx> wrote: >> On 06/12/17 16:37, Jan Beulich wrote: >>> @@ -172,6 +173,24 @@ static inline unsigned long rdgsbase(voi >>> return base; >>> } >>> >>> +static inline unsigned long rdgsshadow(void) >>> +{ >>> + unsigned long base; >>> + >>> + alternative_io("mov %[msr], %%ecx\n\t" >>> + "rdmsr\n\t" >>> + "shl $32, %%rdx\n\t" >>> + "or %%rdx, %[res]", >> There needs to be some clearer distinction between the two >> alternatives. It took a while for me to spot this comma. > Any suggestion? I've been noticing the issue of the split being > hard to spot in other places as well, so I'd like to do this in a > generally applicable and sufficiently uniform way. The best I can think of is something like this: "or %%rdx, %[res]", /* Use a double swapgs and rdgsbase if available. */ "swapgs\n\t" > >> I'm not entirely sure the alternative is justified here. For >> consistency alone, these helpers should match their companions, and in >> the unlikely case that the runtime feature test does make a measurable >> difference, wouldn't a static key be a better option anyway? > Static key? The main reason I dislike making {rd,wr}{fs,gs}base > use alternatives is that the original code would be much larger > than the replacement code. IOW of the options to make things > consistent, I'd prefer using alternatives for the other inline > functions too. But I think the choice here should be what fits > best. > >>> @@ -196,6 +215,19 @@ static inline void wrgsbase(unsigned lon >>> wrmsrl(MSR_GS_BASE, base); >>> } >>> >>> +static inline void wrgsshadow(unsigned long base) >>> +{ >>> + alternative_input("mov %[msr], %%ecx\n\t" >>> + "shld $32, %[val], %%rdx\n\t" >> This is a vector path instruction and specifically called out to be >> avoided in the AMD optimisation guide. On all hardware (according to >> Agner's latency mesurements) it alone has a longer latency to execute >> that the mov/shl pair you've replaced, and that is before accounting for >> mov elimination. > For one I doubt the latency of the SHLD will be noticable with > the latency of the following WRMSR. And then I think the main > goal should be to have optimal performance on modern CPUs. > That means size-optimizing original code, to reduce the amount > of NOPs needed when the alternative is being installed. I'm sorry if this comes across as blunt, but you are too focused on the wrong details. I agree that `swap;{rd,wr}gsbase;swap` is better than wrmsr, and we should be using it when available. However, it is simply not the case that squeezing every possible byte out of .text makes the best result. Just as with inc and dec, there is a very good reason why compilers don't emit shld, and that is because the ORM's recommend against their use. In this specific case, a mov/shl pair is better than shld on all hardware, even in older pipelines which don't do mov elimination. You are going to need a more convincing argument than currently provided as to why the helpers aren't implemented like this: static inline unsigned long rdgsshadow(void) { unsigned long base; if ( cpu_has_fsgsbase ) asm volatile ( "swapgs\n\t" #ifdef HAVE_GAS_FSGSBASE "rdgsbase %0\n\t" #else ".byte 0xf3, 0x48, 0x0f, 0xae, 0xc8\n\t" #endif "swapgs\n\t" : #ifdef HAVE_GAS_FSGSBASE "=r" (base) #else "=a" (base) #endif ); else rdmsrl(MSR_SHADOW_GS_BASE, base); return base; } static inline void wrgsshadow(unsigned long base) { if ( cpu_has_fsgsbase ) asm volatile ( "swapgs\n\t" #ifdef HAVE_GAS_FSGSBASE "wrgsbase %0\n\t" #else ".byte 0xf3, 0x48, 0x0f, 0xae, 0xd8\n\t" #endif "swapgs\n\t" :: #ifdef HAVE_GAS_FSGSBASE "r" (base) #else "a" (base) #endif ); else wrmsrl(MSR_SHADOW_GS_BASE, base); } In particular: 1) This is consistent with existing {rd,wr}{fs,gs}base helpers. 2) cpu_has_fsgsbase is constant after boot, so the branch will be predicted correctly. 3) It doesn't use instructions recommended against by the ORMs. I don't wish to suggest that the above code is better than any and all possible other combinations of performing the same operation, but the more you get into the minutia of microoptimisation, the higher the barrier for inclusion gets, because it is important to demonstrate that the benefits genuinely outweigh the associated costs. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |