[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] x86/spec-ctrl: Scrub stale segment registers on leaky hardware


  • To: Jan Beulich <jbeulich@xxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Mon, 12 Aug 2019 13:30:40 +0100
  • Authentication-results: esa2.hc3370-68.iphmx.com; dkim=none (message not signed) header.i=none; spf=None smtp.pra=andrew.cooper3@xxxxxxxxxx; spf=Pass smtp.mailfrom=Andrew.Cooper3@xxxxxxxxxx; spf=None smtp.helo=postmaster@xxxxxxxxxxxxxxx
  • Autocrypt: addr=andrew.cooper3@xxxxxxxxxx; prefer-encrypt=mutual; keydata= mQINBFLhNn8BEADVhE+Hb8i0GV6mihnnr/uiQQdPF8kUoFzCOPXkf7jQ5sLYeJa0cQi6Penp VtiFYznTairnVsN5J+ujSTIb+OlMSJUWV4opS7WVNnxHbFTPYZVQ3erv7NKc2iVizCRZ2Kxn srM1oPXWRic8BIAdYOKOloF2300SL/bIpeD+x7h3w9B/qez7nOin5NzkxgFoaUeIal12pXSR Q354FKFoy6Vh96gc4VRqte3jw8mPuJQpfws+Pb+swvSf/i1q1+1I4jsRQQh2m6OTADHIqg2E ofTYAEh7R5HfPx0EXoEDMdRjOeKn8+vvkAwhviWXTHlG3R1QkbE5M/oywnZ83udJmi+lxjJ5 YhQ5IzomvJ16H0Bq+TLyVLO/VRksp1VR9HxCzItLNCS8PdpYYz5TC204ViycobYU65WMpzWe LFAGn8jSS25XIpqv0Y9k87dLbctKKA14Ifw2kq5OIVu2FuX+3i446JOa2vpCI9GcjCzi3oHV e00bzYiHMIl0FICrNJU0Kjho8pdo0m2uxkn6SYEpogAy9pnatUlO+erL4LqFUO7GXSdBRbw5 gNt25XTLdSFuZtMxkY3tq8MFss5QnjhehCVPEpE6y9ZjI4XB8ad1G4oBHVGK5LMsvg22PfMJ ISWFSHoF/B5+lHkCKWkFxZ0gZn33ju5n6/FOdEx4B8cMJt+cWwARAQABtClBbmRyZXcgQ29v cGVyIDxhbmRyZXcuY29vcGVyM0BjaXRyaXguY29tPokCOgQTAQgAJAIbAwULCQgHAwUVCgkI CwUWAgMBAAIeAQIXgAUCWKD95wIZAQAKCRBlw/kGpdefoHbdD/9AIoR3k6fKl+RFiFpyAhvO 59ttDFI7nIAnlYngev2XUR3acFElJATHSDO0ju+hqWqAb8kVijXLops0gOfqt3VPZq9cuHlh IMDquatGLzAadfFx2eQYIYT+FYuMoPZy/aTUazmJIDVxP7L383grjIkn+7tAv+qeDfE+txL4 SAm1UHNvmdfgL2/lcmL3xRh7sub3nJilM93RWX1Pe5LBSDXO45uzCGEdst6uSlzYR/MEr+5Z JQQ32JV64zwvf/aKaagSQSQMYNX9JFgfZ3TKWC1KJQbX5ssoX/5hNLqxMcZV3TN7kU8I3kjK mPec9+1nECOjjJSO/h4P0sBZyIUGfguwzhEeGf4sMCuSEM4xjCnwiBwftR17sr0spYcOpqET ZGcAmyYcNjy6CYadNCnfR40vhhWuCfNCBzWnUW0lFoo12wb0YnzoOLjvfD6OL3JjIUJNOmJy RCsJ5IA/Iz33RhSVRmROu+TztwuThClw63g7+hoyewv7BemKyuU6FTVhjjW+XUWmS/FzknSi dAG+insr0746cTPpSkGl3KAXeWDGJzve7/SBBfyznWCMGaf8E2P1oOdIZRxHgWj0zNr1+ooF /PzgLPiCI4OMUttTlEKChgbUTQ+5o0P080JojqfXwbPAyumbaYcQNiH1/xYbJdOFSiBv9rpt TQTBLzDKXok86LkCDQRS4TZ/ARAAkgqudHsp+hd82UVkvgnlqZjzz2vyrYfz7bkPtXaGb9H4 Rfo7mQsEQavEBdWWjbga6eMnDqtu+FC+qeTGYebToxEyp2lKDSoAsvt8w82tIlP/EbmRbDVn 7bhjBlfRcFjVYw8uVDPptT0TV47vpoCVkTwcyb6OltJrvg/QzV9f07DJswuda1JH3/qvYu0p vjPnYvCq4NsqY2XSdAJ02HrdYPFtNyPEntu1n1KK+gJrstjtw7KsZ4ygXYrsm/oCBiVW/OgU g/XIlGErkrxe4vQvJyVwg6YH653YTX5hLLUEL1NS4TCo47RP+wi6y+TnuAL36UtK/uFyEuPy wwrDVcC4cIFhYSfsO0BumEI65yu7a8aHbGfq2lW251UcoU48Z27ZUUZd2Dr6O/n8poQHbaTd 6bJJSjzGGHZVbRP9UQ3lkmkmc0+XCHmj5WhwNNYjgbbmML7y0fsJT5RgvefAIFfHBg7fTY/i kBEimoUsTEQz+N4hbKwo1hULfVxDJStE4sbPhjbsPCrlXf6W9CxSyQ0qmZ2bXsLQYRj2xqd1 bpA+1o1j2N4/au1R/uSiUFjewJdT/LX1EklKDcQwpk06Af/N7VZtSfEJeRV04unbsKVXWZAk uAJyDDKN99ziC0Wz5kcPyVD1HNf8bgaqGDzrv3TfYjwqayRFcMf7xJaL9xXedMcAEQEAAYkC HwQYAQgACQUCUuE2fwIbDAAKCRBlw/kGpdefoG4XEACD1Qf/er8EA7g23HMxYWd3FXHThrVQ HgiGdk5Yh632vjOm9L4sd/GCEACVQKjsu98e8o3ysitFlznEns5EAAXEbITrgKWXDDUWGYxd pnjj2u+GkVdsOAGk0kxczX6s+VRBhpbBI2PWnOsRJgU2n10PZ3mZD4Xu9kU2IXYmuW+e5KCA vTArRUdCrAtIa1k01sPipPPw6dfxx2e5asy21YOytzxuWFfJTGnVxZZSCyLUO83sh6OZhJkk b9rxL9wPmpN/t2IPaEKoAc0FTQZS36wAMOXkBh24PQ9gaLJvfPKpNzGD8XWR5HHF0NLIJhgg 4ZlEXQ2fVp3XrtocHqhu4UZR4koCijgB8sB7Tb0GCpwK+C4UePdFLfhKyRdSXuvY3AHJd4CP 4JzW0Bzq/WXY3XMOzUTYApGQpnUpdOmuQSfpV9MQO+/jo7r6yPbxT7CwRS5dcQPzUiuHLK9i nvjREdh84qycnx0/6dDroYhp0DFv4udxuAvt1h4wGwTPRQZerSm4xaYegEFusyhbZrI0U9tJ B8WrhBLXDiYlyJT6zOV2yZFuW47VrLsjYnHwn27hmxTC/7tvG3euCklmkn9Sl9IAKFu29RSo d5bD8kMSCYsTqtTfT6W4A3qHGvIDta3ptLYpIAOD2sY3GYq2nf3Bbzx81wZK14JdDDHUX2Rs 6+ahAA==
  • Cc: Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Delivery-date: Tue, 13 Aug 2019 06:57:52 +0000
  • Ironport-sdr: OdGg8HEJ5ioR32G6URJdDSBfTNjK83+129xnMUaHHTRrfLbNB/mvA5nlCuDqaBgC2uRJQ/6g3s zoT3QLaCgEWri8fMuxCjI1KOSlQya4UQKTy71hCcwNCxFbOmuHJeyaeRyaKcbR5LgYxTazBm6S gzOc4yIrQg1O0cBT0YtmwiPOff1Ca3YsqneMOxtYvFCgvpXo1mPakazeNJ2gDA9VsvdIz4NWTr ivZTv3Ung0R9gkK4XaGqAhZ2tuq5QqOXsRJY4rYSnsbfYfJfpqSrR+Cca45bVJBjdlDDq21igt ykg=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Openpgp: preference=signencrypt

On 12/08/2019 09:00, Jan Beulich wrote:
> On 09.08.2019 19:16, Andrew Cooper wrote:
>> --- a/docs/misc/xen-command-line.pandoc
>> +++ b/docs/misc/xen-command-line.pandoc
>> @@ -1914,7 +1914,7 @@ By default SSBD will be mitigated at runtime
>> (i.e `ssbd=runtime`).
>>   ### spec-ctrl (x86)
>>   > `= List of [ <bool>, xen=<bool>,
>> {pv,hvm,msr-sc,rsb,md-clear}=<bool>,
>>   >              bti-thunk=retpoline|lfence|jmp,
>> {ibrs,ibpb,ssbd,eager-fpu,
>> ->              l1d-flush,l1tf-barrier}=<bool> ]`
>> +>              l1d-flush,stale-seg-clear,l1tf-barrier}=<bool> ]`
>>     Controls for speculative execution sidechannel mitigations.  By
>> default, Xen
>>   will pick the most appropriate mitigations based on compiled in
>> support,
>> @@ -1986,6 +1986,12 @@ Irrespective of Xen's setting, the feature is
>> virtualised for HVM guests to
>>   use.  By default, Xen will enable this mitigation on hardware
>> believed to be
>>   vulnerable to L1TF.
>>   +On all hardware, the `stale-seg-clear=` option can be used to
>> force or prevent
>> +Xen from clearing the microarchitectural segment register copies on
>> context
>> +switch.  By default, Xen will choose to use stale segment clearing
>> on affected
>> +hardware.  The clearing logic is tuned to microarchitectural details
>> of the
>> +affected CPUs.
>> +
>>   On hardware vulnerable to L1TF, the `l1tf-barrier=` option can be
>> used to force
>>   or prevent Xen from protecting evaluations inside the hypervisor
>> with a barrier
>>   instruction to not load potentially secret information into L1
>> cache.  By
>
> Purely out of curiosity: Is there a reason both insertions happen between
> two pre-existing sub-options, not after all of them?

Easier backportability.  That and l1tf-barrier is not going to say in
this form by the time 4.13 gets released.

>
>> --- a/xen/arch/x86/domain.c
>> +++ b/xen/arch/x86/domain.c
>> @@ -1328,6 +1328,29 @@ static void load_segments(struct vcpu *n)
>>       dirty_segment_mask = per_cpu(dirty_segment_mask, cpu);
>>       per_cpu(dirty_segment_mask, cpu) = 0;
>>   +    /*
>> +     * CPUs which suffer from stale segment register leakage have
>> two copies
>> +     * of each segment register [Correct at the time of writing -
>> Aug 2019].
>> +     *
>> +     * We must write to both of them to scrub state from the
>> previous vcpu.
>> +     * However, two writes in quick succession stall the pipeline,
>> as only one
>> +     * write per segment can be speculatively outstanding.
>> +     *
>> +     * Split the two sets of writes to each register to maximise the
>> chance
>> +     * that these writes have retired before the second set starts,
>> thus
>> +     * reducing the chance of stalling.
>> +     */
>> +    if ( opt_stale_seg_clear )
>> +    {
>> +        asm volatile ( "mov %[sel], %%ds;"
>> +                       "mov %[sel], %%es;"
>> +                       "mov %[sel], %%fs;"
>> +                       "mov %[sel], %%gs;"
>> +                       :: [sel] "r" (0) );
>> +        /* Force a reload of all segments to be the second scrubbing
>> write. */
>> +        dirty_segment_mask = ~0;
>> +    }
>
> Having the command line option, do we care about people actually using
> it on AMD hardware?

I care that it can be forced on to test.  Beyond that, I expect the
overhead will be large on Atom and AMD, which is why the command like
doc was worded a bit more forcefully than other options, and is hinting
at "don't try using this on non-affected hardware".

> For %ds and %es this would now lead to up to three
> selector register loads each, with the one above being completely
> pointless (due to not clearing the segment bases anyway). Question of
> course is whether adding yet another conditional (to the added code
> above or to preload_segment()) would be any better than having this
> extra selector register write.

preload_segment() needs to change anyway given Rome's behaviour, but I
need to wait for feedback from AMD as to whether they are happy with my
CPUID bit and erratum proposal.

>
> Furthermore, if we cared, shouldn't SVM code also honor the flag and
> gain extra loads, albeit perhaps with unlikely() used in the if()?

If you observe, svm_ctxt_switch_to() already has this unconditionally.

This was added in f9caf5ffaba but I'm not convinced it can be correct. 
I don't see equivalent logic in KVM, or any description to this effect
in the APM.  It might be an errata on an early SVM-capable processor.

>
> As to your choice of loading a nul selector: For the VERW change iirc
> we've been told that using a nul selector is a bad choice from
> performance pov. Are nul selectors being treated better for selector
> register writes?

VERW and segment loads are very different.

VERW is microcoded, unconditionally references memory, and doesn't have
a plausible usecase in the 286+ where you would call it with a NUL
selector (until the MDS side effect came along).  As a result, handling
NUL is the slowpath in microcode.

Loading a segment selector with NUL has a specific meaning, and is the
common case in 64bit code.  It does not reference memory.

In practice, my microperf tests show no difference in net time between
using a real selector and NUL.  The majority of samples come in
cycle-identical.

This might mean that NUL is slowpath'd (considering it doesn't have a
memory read to perform), but it probably means that other aspects of
segment loading dominate.

>
>> @@ -872,11 +873,38 @@ static void vmx_ctxt_switch_from(struct vcpu *v)
>>     static void vmx_ctxt_switch_to(struct vcpu *v)
>>   {
>> +    /*
>> +     * CPUs which suffer from stale segment register leakage have
>> two copies
>> +     * of each segment register [Correct at the time of writing -
>> Aug 2019].
>> +     *
>> +     * We must write to both of them to scrub state from the
>> previous vcpu.
>> +     * However, two writes in quick succession stall the pipeline,
>> as only one
>> +     * write per segment can be speculatively outstanding.
>> +     *
>> +     * Split the two sets of writes to each register to maximise the
>> chance
>> +     * that these writes have retired before the second set starts,
>> thus
>> +     * reducing the chance of stalling.
>> +     */
>> +    if ( opt_stale_seg_clear )
>> +        asm volatile ( "mov %[sel], %%ds;"
>> +                       "mov %[sel], %%es;"
>> +                       "mov %[sel], %%fs;"
>> +                       "mov %[sel], %%gs;"
>> +                       :: [sel] "r" (0) );
>> +
>>       vmx_restore_guest_msrs(v);
>>       vmx_restore_dr(v);
>>         if ( v->domain->arch.hvm.pi_ops.flags & PI_CSW_TO )
>>           vmx_pi_switch_to(v);
>> +
>> +    /* Should be last in the function.  See above. */
>> +    if ( opt_stale_seg_clear )
>> +        asm volatile ( "mov %[sel], %%ds;"
>> +                       "mov %[sel], %%es;"
>> +                       "mov %[sel], %%fs;"
>> +                       "mov %[sel], %%gs;"
>> +                       :: [sel] "r" (0) );
>>   }
>
> Why two instances? Aren't the selector register loads during VM
> entry sufficient to clear both instances? (The question is
> rhetorical in a way, because I think I know the answer, but
> neither the comment here nor the patch description provide it.)

This really depends on how much information information Intel are
willing to make public on the subject.

I don't think it is reasonable to presume that VMEntry works as a
sequence of architectural instructions (it provably doesn't when it
comes to control registers), which is why I commented the logic in this way.

>
>> @@ -111,6 +114,7 @@ static int __init parse_spec_ctrl(const char *s)
>>               opt_ibpb = false;
>>               opt_ssbd = false;
>>               opt_l1d_flush = 0;
>> +            opt_stale_seg_clear = 0;
>>           }
>
> Is it really correct for this to be affected by both "spec-ctrl=no"
> and "spec-ctrl=no-xen"? It would seem to me that it would belong
> above the "disable_common" label, as the control also is meant to
> guard against guest->guest leaks.

spec-ctrl=no-xen is ill-defined.

As stated, it says "don't do anything in Xen except enough to virtualise
capabilities for guests".

However, an argument could be made to say that this is conceptually
similar to eager-fpu.

Either way, I think a clarification to what no-xen means is needed first.

>
>> @@ -864,6 +871,83 @@ static __init void mds_calculations(uint64_t caps)
>>       }
>>   }
>>   +/* Calculate whether this CPU leaks segment registers between
>> contexts. */
>> +static void __init stale_segment_calculations(void)
>> +{
>> +    /*
>> +     * Assume all unrecognised processors are ok.  This is only
>> known to
>> +     * affect Intel Family 6 processors.
>> +     */
>> +    if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
>> +         boot_cpu_data.x86 != 6 )
>> +        return;
>
> The comment above here and ...
>
>> +    switch ( boot_cpu_data.x86_model )
>> +    {
>> +        /*
>> +         * Core processors since at least Nehalem are vulnerable.
>> +         */
>> +    case 0x1e: /* Nehalem */
>> +    case 0x1f: /* Auburndale / Havendale */
>> +    case 0x1a: /* Nehalem EP */
>> +    case 0x2e: /* Nehalem EX */
>> +    case 0x25: /* Westmere */
>> +    case 0x2c: /* Westmere EP */
>> +    case 0x2f: /* Westmere EX */
>> +    case 0x2a: /* SandyBridge */
>> +    case 0x2d: /* SandyBridge EP/EX */
>> +    case 0x3a: /* IvyBridge */
>> +    case 0x3e: /* IvyBridge EP/EX */
>> +    case 0x3c: /* Haswell */
>> +    case 0x3f: /* Haswell EX/EP */
>> +    case 0x45: /* Haswell D */
>> +    case 0x46: /* Haswell H */
>> +    case 0x3d: /* Broadwell */
>> +    case 0x47: /* Broadwell H */
>> +    case 0x4f: /* Broadwell EP/EX */
>> +    case 0x56: /* Broadwell D */
>> +    case 0x4e: /* Skylake M */
>> +    case 0x55: /* Skylake X */
>> +    case 0x5e: /* Skylake D */
>> +    case 0x66: /* Cannonlake */
>> +    case 0x67: /* Cannonlake? */
>> +    case 0x8e: /* Kabylake M */
>> +    case 0x9e: /* Kabylake D */
>> +        cpu_has_bug_stale_seg = true;
>> +        break;
>> +
>> +        /*
>> +         * Atom processors are not vulnerable.
>> +         */
>> +    case 0x1c: /* Pineview */
>> +    case 0x26: /* Lincroft */
>> +    case 0x27: /* Penwell */
>> +    case 0x35: /* Cloverview */
>> +    case 0x36: /* Cedarview */
>> +    case 0x37: /* Baytrail / Valleyview (Silvermont) */
>> +    case 0x4d: /* Avaton / Rangely (Silvermont) */
>> +    case 0x4c: /* Cherrytrail / Brasswell */
>> +    case 0x4a: /* Merrifield */
>> +    case 0x5a: /* Moorefield */
>> +    case 0x5c: /* Goldmont */
>> +    case 0x5f: /* Denverton */
>> +    case 0x7a: /* Gemini Lake */
>> +        break;
>> +
>> +        /*
>> +         * Knights processors are not vulnerable.
>> +         */
>> +    case 0x57: /* Knights Landing */
>> +    case 0x85: /* Knights Mill */
>> +        break;
>> +
>> +    default:
>> +        printk("Unrecognised CPU model %#x - assuming vulnerable to
>> StaleSeg\n",
>> +               boot_cpu_data.x86_model);
>> +        break;
>
> ... the plain "break" here are not in line with the log message text.
> Did you mean to add "not"?

No.  I intended this path to set cpu_has_bug_stale_seg.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.