Xen project Mailing List

Re: [Xen-devel] [PATCH v2 15/30] xen/x86: Improvements to in-hypervisor cpuid sanity checks

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Mon, 15 Feb 2016 17:12:45 +0000

Delivery-date: Mon, 15 Feb 2016 17:13:08 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 15/02/16 15:43, Jan Beulich wrote: >>>> On 05.02.16 at 14:42, <andrew.cooper3@xxxxxxxxxx> wrote: >> @@ -4617,50 +4618,39 @@ void hvm_cpuid(unsigned int input, unsigned int >> *eax, unsigned int *ebx, >> /* Fix up VLAPIC details. */ >> *ebx &= 0x00FFFFFFu; >> *ebx |= (v->vcpu_id * 2) << 24; >> + >> + *ecx &= hvm_featureset[FEATURESET_1c]; >> + *edx &= hvm_featureset[FEATURESET_1d]; > Looks like I've overlooked an issue in patch 11, which becomes > apparent here: How can you use a domain-independent featureset > here, when features vary between HAP and shadow mode guests? > I.e. in the earlier patch I suppose you need to calculate two > hvm_*_featureset[]s, with the HAP one perhaps empty when > !hvm_funcs.hap_supported. Their use here is a halfway house between nothing and the planned full per-domain policies. In this case, the "don't expose $X to a non-hap domain" checks have been retained, to cover the difference. > >> @@ -4694,6 +4687,9 @@ void hvm_cpuid(unsigned int input, unsigned int *eax, >> unsigned int *ebx, >> break; >> >> case 0x80000001: >> + *ecx &= hvm_featureset[FEATURESET_e1c]; >> + *edx &= hvm_featureset[FEATURESET_e1d]; > Looking at the uses here, wouldn't it be better (less ambiguous) for > these to be named FEATURESET_x1c and FEATURESET_x1d > respectively, since x is not a hex digit, but e is? I originally chose e because these are commonly known as the extended cpuid leaves. 'e' being a hex digit is why the number is specified as an upper case hex number, as shown with FEATURESET_Da1. I suppose x could work here, but I don't see being less ambiguous. (Perhaps I am clouded by having this syntax firmly embedded in my expectation). > >> @@ -841,69 +842,70 @@ void pv_cpuid(struct cpu_user_regs *regs) >> else >> cpuid_count(leaf, subleaf, &a, &b, &c, &d); >> >> - if ( (leaf & 0x7fffffff) == 0x00000001 ) >> - { >> - /* Modify Feature Information. */ >> - if ( !cpu_has_apic ) >> - __clear_bit(X86_FEATURE_APIC, &d); >> - >> - if ( !is_pvh_domain(currd) ) >> - { >> - __clear_bit(X86_FEATURE_PSE, &d); >> - __clear_bit(X86_FEATURE_PGE, &d); >> - __clear_bit(X86_FEATURE_PSE36, &d); >> - __clear_bit(X86_FEATURE_VME, &d); >> - } >> - } >> - >> switch ( leaf ) >> { >> case 0x00000001: >> - /* Modify Feature Information. */ >> - if ( !cpu_has_sep ) >> - __clear_bit(X86_FEATURE_SEP, &d); >> - __clear_bit(X86_FEATURE_DS, &d); >> - __clear_bit(X86_FEATURE_ACC, &d); >> - __clear_bit(X86_FEATURE_PBE, &d); >> - if ( is_pvh_domain(currd) ) >> - __clear_bit(X86_FEATURE_MTRR, &d); >> - >> - __clear_bit(X86_FEATURE_DTES64 % 32, &c); >> - __clear_bit(X86_FEATURE_MWAIT % 32, &c); >> - __clear_bit(X86_FEATURE_DSCPL % 32, &c); >> - __clear_bit(X86_FEATURE_VMXE % 32, &c); >> - __clear_bit(X86_FEATURE_SMXE % 32, &c); >> - __clear_bit(X86_FEATURE_TM2 % 32, &c); >> + c &= pv_featureset[FEATURESET_1c]; >> + d &= pv_featureset[FEATURESET_1d]; >> + >> if ( is_pv_32bit_domain(currd) ) >> - __clear_bit(X86_FEATURE_CX16 % 32, &c); >> - __clear_bit(X86_FEATURE_XTPR % 32, &c); >> - __clear_bit(X86_FEATURE_PDCM % 32, &c); >> - __clear_bit(X86_FEATURE_PCID % 32, &c); >> - __clear_bit(X86_FEATURE_DCA % 32, &c); >> - if ( !cpu_has_xsave ) >> - { >> - __clear_bit(X86_FEATURE_XSAVE % 32, &c); >> - __clear_bit(X86_FEATURE_AVX % 32, &c); >> - } >> - if ( !cpu_has_apic ) >> - __clear_bit(X86_FEATURE_X2APIC % 32, &c); >> - __set_bit(X86_FEATURE_HYPERVISOR % 32, &c); >> + c &= ~cpufeat_mask(X86_FEATURE_CX16); > Shouldn't this be taken care of by clearing LM and then applying > your dependencies correction? Or is that meant to only get > enforced later? Is it maybe still worth having both pv64_featureset[] > and pv32_featureset[]? Again, this is just used as a halfway house. Longterm, I plan to read the featureset straight from the per-domain policy, which won't be stored as an adhoc array. > >> + /* >> + * !!! Warning - OSXSAVE handling for PV guests is >> non-architectural !!! >> + * >> + * Architecturally, the correct code here is simply: >> + * >> + * if ( curr->arch.pv_vcpu.ctrlreg[4] & X86_CR4_OSXSAVE ) >> + * c |= cpufeat_mask(X86_FEATURE_OSXSAVE); >> + * >> + * However because of bugs in Xen (before c/s bd19080b, Nov 2010, >> the >> + * XSAVE cpuid flag leaked into guests despite the feature not being >> + * avilable for use), buggy workarounds where introduced to Linux >> (c/s >> + * 947ccf9c, also Nov 2010) which relied on the fact that Xen also >> + * incorrectly leaked OSXSAVE into the guest. >> + * >> + * Furthermore, providing architectural OSXSAVE behaviour to a many >> + * Linux PV guests triggered a further kernel bug when the fpu code >> + * observes that XSAVEOPT is available, assumes that xsave state had >> + * been set up for the task, and follows a wild pointer. >> + * >> + * Therefore, the leaking of Xen's OSXSAVE setting has become a >> + * defacto part of the PV ABI and can't reasonably be corrected. >> + * >> + * The following situations and logic now applies: >> + * >> + * - Hardware without CPUID faulting support and native CPUID: >> + * There is nothing Xen can do here. The hosts XSAVE flag will >> + * leak through and Xen's OSXSAVE choice will leak through. >> + * >> + * In the case that the guest kernel has not set up OSXSAVE, only >> + * SSE will be set in xcr0, and guest userspace can't do too much >> + * damage itself. >> + * >> + * - Enlightened CPUID or CPUID faulting available: >> + * Xen can fully control what is seen here. Guest kernels need >> to >> + * see the leaked OSXSAVE, but guest userspace is given >> + * architectural behaviour, to reflect the guest kernels >> + * intentions. >> + */ > Well, I think all of this is too harsh: In a hypervisor-guest > relationship of PV kind I don't view it as entirely wrong to let the > guest kernel know of whether the hypervisor enabled XSAVE > support by inspecting the OSXSAVE bit. From guest kernel > perspective, the hypervisor is kind of the OS. Xen shadows the guests %cr4. That alone means that the emulated cpuid should have matched. If you are going for the OS argument, that means "Xen has XSAVE turned on and the guest can the provided features" and not "Xen supports the guest configuring its own XCR0", which is what it is taken to mean. In both of these cases, deviating from the architectural behaviour confuses the situation, and adds adds needless divergence from the native implementation in guests. I admit that all of this stems from the complete fiasco which was the original XSAVE support, but all of it would be unnecessary had buggy bugfixes not been stack up on each other. > While I won't > make weakening the above a little a requirement, I'd appreciate > you doing so. I don't which to be deliberately over the top, but I really don't think I am in this case. None of this should be necessary. > >> + case 0x80000007: >> + d &= pv_featureset[FEATURESET_e7d]; >> + break; > By not clearing eax and ebx (not sure about ecx) here you would > again expose flags to guests without proper white listing. > >> + case 0x80000008: >> + b &= pv_featureset[FEATURESET_e8b]; >> break; > Same here for ecx and edx and perhaps the upper 8 bits of eax. Both of these would be changes to how these things are currently handled, whereby a guest gets to see whatever the toolstack managed to find in said leaf. I was hoping to put off some of these decisions, but they probably need making now. On the PV side they definitely can't be fully hidden, as these leaves are not maskable. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.