[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH] x86/spec-ctrl: Scrub stale segment registers on leaky hardware
On 12/08/2019 09:00, Jan Beulich wrote: > On 09.08.2019 19:16, Andrew Cooper wrote: >> --- a/docs/misc/xen-command-line.pandoc >> +++ b/docs/misc/xen-command-line.pandoc >> @@ -1914,7 +1914,7 @@ By default SSBD will be mitigated at runtime >> (i.e `ssbd=runtime`). >> ### spec-ctrl (x86) >> > `= List of [ <bool>, xen=<bool>, >> {pv,hvm,msr-sc,rsb,md-clear}=<bool>, >> > bti-thunk=retpoline|lfence|jmp, >> {ibrs,ibpb,ssbd,eager-fpu, >> -> l1d-flush,l1tf-barrier}=<bool> ]` >> +> l1d-flush,stale-seg-clear,l1tf-barrier}=<bool> ]` >> Controls for speculative execution sidechannel mitigations. By >> default, Xen >> will pick the most appropriate mitigations based on compiled in >> support, >> @@ -1986,6 +1986,12 @@ Irrespective of Xen's setting, the feature is >> virtualised for HVM guests to >> use. By default, Xen will enable this mitigation on hardware >> believed to be >> vulnerable to L1TF. >> +On all hardware, the `stale-seg-clear=` option can be used to >> force or prevent >> +Xen from clearing the microarchitectural segment register copies on >> context >> +switch. By default, Xen will choose to use stale segment clearing >> on affected >> +hardware. The clearing logic is tuned to microarchitectural details >> of the >> +affected CPUs. >> + >> On hardware vulnerable to L1TF, the `l1tf-barrier=` option can be >> used to force >> or prevent Xen from protecting evaluations inside the hypervisor >> with a barrier >> instruction to not load potentially secret information into L1 >> cache. By > > Purely out of curiosity: Is there a reason both insertions happen between > two pre-existing sub-options, not after all of them? Easier backportability. That and l1tf-barrier is not going to say in this form by the time 4.13 gets released. > >> --- a/xen/arch/x86/domain.c >> +++ b/xen/arch/x86/domain.c >> @@ -1328,6 +1328,29 @@ static void load_segments(struct vcpu *n) >> dirty_segment_mask = per_cpu(dirty_segment_mask, cpu); >> per_cpu(dirty_segment_mask, cpu) = 0; >> + /* >> + * CPUs which suffer from stale segment register leakage have >> two copies >> + * of each segment register [Correct at the time of writing - >> Aug 2019]. >> + * >> + * We must write to both of them to scrub state from the >> previous vcpu. >> + * However, two writes in quick succession stall the pipeline, >> as only one >> + * write per segment can be speculatively outstanding. >> + * >> + * Split the two sets of writes to each register to maximise the >> chance >> + * that these writes have retired before the second set starts, >> thus >> + * reducing the chance of stalling. >> + */ >> + if ( opt_stale_seg_clear ) >> + { >> + asm volatile ( "mov %[sel], %%ds;" >> + "mov %[sel], %%es;" >> + "mov %[sel], %%fs;" >> + "mov %[sel], %%gs;" >> + :: [sel] "r" (0) ); >> + /* Force a reload of all segments to be the second scrubbing >> write. */ >> + dirty_segment_mask = ~0; >> + } > > Having the command line option, do we care about people actually using > it on AMD hardware? I care that it can be forced on to test. Beyond that, I expect the overhead will be large on Atom and AMD, which is why the command like doc was worded a bit more forcefully than other options, and is hinting at "don't try using this on non-affected hardware". > For %ds and %es this would now lead to up to three > selector register loads each, with the one above being completely > pointless (due to not clearing the segment bases anyway). Question of > course is whether adding yet another conditional (to the added code > above or to preload_segment()) would be any better than having this > extra selector register write. preload_segment() needs to change anyway given Rome's behaviour, but I need to wait for feedback from AMD as to whether they are happy with my CPUID bit and erratum proposal. > > Furthermore, if we cared, shouldn't SVM code also honor the flag and > gain extra loads, albeit perhaps with unlikely() used in the if()? If you observe, svm_ctxt_switch_to() already has this unconditionally. This was added in f9caf5ffaba but I'm not convinced it can be correct. I don't see equivalent logic in KVM, or any description to this effect in the APM. It might be an errata on an early SVM-capable processor. > > As to your choice of loading a nul selector: For the VERW change iirc > we've been told that using a nul selector is a bad choice from > performance pov. Are nul selectors being treated better for selector > register writes? VERW and segment loads are very different. VERW is microcoded, unconditionally references memory, and doesn't have a plausible usecase in the 286+ where you would call it with a NUL selector (until the MDS side effect came along). As a result, handling NUL is the slowpath in microcode. Loading a segment selector with NUL has a specific meaning, and is the common case in 64bit code. It does not reference memory. In practice, my microperf tests show no difference in net time between using a real selector and NUL. The majority of samples come in cycle-identical. This might mean that NUL is slowpath'd (considering it doesn't have a memory read to perform), but it probably means that other aspects of segment loading dominate. > >> @@ -872,11 +873,38 @@ static void vmx_ctxt_switch_from(struct vcpu *v) >> static void vmx_ctxt_switch_to(struct vcpu *v) >> { >> + /* >> + * CPUs which suffer from stale segment register leakage have >> two copies >> + * of each segment register [Correct at the time of writing - >> Aug 2019]. >> + * >> + * We must write to both of them to scrub state from the >> previous vcpu. >> + * However, two writes in quick succession stall the pipeline, >> as only one >> + * write per segment can be speculatively outstanding. >> + * >> + * Split the two sets of writes to each register to maximise the >> chance >> + * that these writes have retired before the second set starts, >> thus >> + * reducing the chance of stalling. >> + */ >> + if ( opt_stale_seg_clear ) >> + asm volatile ( "mov %[sel], %%ds;" >> + "mov %[sel], %%es;" >> + "mov %[sel], %%fs;" >> + "mov %[sel], %%gs;" >> + :: [sel] "r" (0) ); >> + >> vmx_restore_guest_msrs(v); >> vmx_restore_dr(v); >> if ( v->domain->arch.hvm.pi_ops.flags & PI_CSW_TO ) >> vmx_pi_switch_to(v); >> + >> + /* Should be last in the function. See above. */ >> + if ( opt_stale_seg_clear ) >> + asm volatile ( "mov %[sel], %%ds;" >> + "mov %[sel], %%es;" >> + "mov %[sel], %%fs;" >> + "mov %[sel], %%gs;" >> + :: [sel] "r" (0) ); >> } > > Why two instances? Aren't the selector register loads during VM > entry sufficient to clear both instances? (The question is > rhetorical in a way, because I think I know the answer, but > neither the comment here nor the patch description provide it.) This really depends on how much information information Intel are willing to make public on the subject. I don't think it is reasonable to presume that VMEntry works as a sequence of architectural instructions (it provably doesn't when it comes to control registers), which is why I commented the logic in this way. > >> @@ -111,6 +114,7 @@ static int __init parse_spec_ctrl(const char *s) >> opt_ibpb = false; >> opt_ssbd = false; >> opt_l1d_flush = 0; >> + opt_stale_seg_clear = 0; >> } > > Is it really correct for this to be affected by both "spec-ctrl=no" > and "spec-ctrl=no-xen"? It would seem to me that it would belong > above the "disable_common" label, as the control also is meant to > guard against guest->guest leaks. spec-ctrl=no-xen is ill-defined. As stated, it says "don't do anything in Xen except enough to virtualise capabilities for guests". However, an argument could be made to say that this is conceptually similar to eager-fpu. Either way, I think a clarification to what no-xen means is needed first. > >> @@ -864,6 +871,83 @@ static __init void mds_calculations(uint64_t caps) >> } >> } >> +/* Calculate whether this CPU leaks segment registers between >> contexts. */ >> +static void __init stale_segment_calculations(void) >> +{ >> + /* >> + * Assume all unrecognised processors are ok. This is only >> known to >> + * affect Intel Family 6 processors. >> + */ >> + if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL || >> + boot_cpu_data.x86 != 6 ) >> + return; > > The comment above here and ... > >> + switch ( boot_cpu_data.x86_model ) >> + { >> + /* >> + * Core processors since at least Nehalem are vulnerable. >> + */ >> + case 0x1e: /* Nehalem */ >> + case 0x1f: /* Auburndale / Havendale */ >> + case 0x1a: /* Nehalem EP */ >> + case 0x2e: /* Nehalem EX */ >> + case 0x25: /* Westmere */ >> + case 0x2c: /* Westmere EP */ >> + case 0x2f: /* Westmere EX */ >> + case 0x2a: /* SandyBridge */ >> + case 0x2d: /* SandyBridge EP/EX */ >> + case 0x3a: /* IvyBridge */ >> + case 0x3e: /* IvyBridge EP/EX */ >> + case 0x3c: /* Haswell */ >> + case 0x3f: /* Haswell EX/EP */ >> + case 0x45: /* Haswell D */ >> + case 0x46: /* Haswell H */ >> + case 0x3d: /* Broadwell */ >> + case 0x47: /* Broadwell H */ >> + case 0x4f: /* Broadwell EP/EX */ >> + case 0x56: /* Broadwell D */ >> + case 0x4e: /* Skylake M */ >> + case 0x55: /* Skylake X */ >> + case 0x5e: /* Skylake D */ >> + case 0x66: /* Cannonlake */ >> + case 0x67: /* Cannonlake? */ >> + case 0x8e: /* Kabylake M */ >> + case 0x9e: /* Kabylake D */ >> + cpu_has_bug_stale_seg = true; >> + break; >> + >> + /* >> + * Atom processors are not vulnerable. >> + */ >> + case 0x1c: /* Pineview */ >> + case 0x26: /* Lincroft */ >> + case 0x27: /* Penwell */ >> + case 0x35: /* Cloverview */ >> + case 0x36: /* Cedarview */ >> + case 0x37: /* Baytrail / Valleyview (Silvermont) */ >> + case 0x4d: /* Avaton / Rangely (Silvermont) */ >> + case 0x4c: /* Cherrytrail / Brasswell */ >> + case 0x4a: /* Merrifield */ >> + case 0x5a: /* Moorefield */ >> + case 0x5c: /* Goldmont */ >> + case 0x5f: /* Denverton */ >> + case 0x7a: /* Gemini Lake */ >> + break; >> + >> + /* >> + * Knights processors are not vulnerable. >> + */ >> + case 0x57: /* Knights Landing */ >> + case 0x85: /* Knights Mill */ >> + break; >> + >> + default: >> + printk("Unrecognised CPU model %#x - assuming vulnerable to >> StaleSeg\n", >> + boot_cpu_data.x86_model); >> + break; > > ... the plain "break" here are not in line with the log message text. > Did you mean to add "not"? No. I intended this path to set cpu_has_bug_stale_seg. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |