[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] [DESIGN] Feature Levelling improvements
All, With migration v2 getting close to being done, I have had time to pick back up with feature levelling improvements. Presented here for review is draft E. A PDF version of the design is available here: http://xenbits.xen.org/people/andrewcoop/feature-levelling/feature-levelling-E.pdf Pandoc version as follows: % VM CPU Feature Levelling Improvements % Andrew Cooper <<andrew.cooper3@xxxxxxxxxx>> % Draft E Introduction ============ Revision History ---------------- ------------------------------------------------------------------------------ Version Date Changes ------- ----------- -------------------------------------------------------- Draft A 07 Feb 2014 Initial draft Draft B 13 Feb 2014 More detail for proposed new implementation Draft C 17 Feb 2014 Even more details for proposed new implementation Draft D 11 Jun 2014 More background, having had time to hack around and experiment Draft E 15 Jun 2015 More details for the proposed implementation. ------------------------------------------------------------------------------ Background ---------- _CPU feature masking_ is a term used to mean altering the visible feature-set of a processor. For single systems, this could be to hide certain features from operating system software, for which support is buggy. In the world of virtualisation, it is common to have non-identical hardware in a cluster but still want to migrate a virtual machine safely. On regular hardware, the kernel can safely assume that the feature-set as detected on boot will remain the same. Live migration invalidates this assumption when moving between two non-identical pieces of hardware. To migrate virtual machines in this fashion, orchestration software must ensure that the available feature set remains consistent anywhere the virtual machine might end up. The feature-set of a particular CPU can be obtained using the `CPUID` instruction. It was introduced as a forward compatible way of advertising new features which were detectable at runtime. Information available includes processor branding, available features, topology information and cache details. The `CPUID` instruction is an unprivileged instruction, usable from user-mode without interception from the kernel. This makes it impossible to paravirtualise using the standard trap-and-emulate method. Purpose ------- This project originally started to improve the way in which XenServer performed heterogeneous pool levelling. In the process of investigation, it was discovered that the current implementation in Xen and libxc are in need of improvement, particularly in relation to PV guests. This document describes: * What properties are needed from a VM point of view * What hardware features are available to aid with levelling * What abilities are exposed by Xen and libxc for levelling * How XenServer currently does pool levelling (and why it is in need of improvements) This document also proposes a new mechanism for VM feature levelling, taking into account the information needed by orchestration software. What a Virtual Machine cares about ================================== On native hardware, a kernel, as well as certain userspace libraries will use the set of available features to tune themselves to run more efficiently. Over a migrate, it is critical that features a VM is using do not disappear. (In some cases it might be possible to trap-and-emulate missing features, but this would be an exceedingly high overhead and is not considered.) When a VM is liable to migrate between hardware of differing feature-sets, it is important to ensure that the VM is strictly only using the common subset of features available on any potential destination. This can be done either by hiding features outside of the common subset, or in some cases specifically instructing the kernel not to use a feature which it can see. Hardware features to aid levelling ================================== HVM --- HVM guests (using `Intel VT-x` or `AMD SVM`) will unconditionally exit to Xen on all `CPUID` instructions, allowing Xen full and complete control over all leaves. PV -- The `CPUID` instruction is unprivileged, so executing it in a PV guest will not trap, leaving Xen no direct ability to control the information returned. Xen Forced Emulation Prefix --------------------------- Xen-aware PV guest kernels and userspace can make use of the 'Forced Emulation Prefix' > `ud2a; .byte 'x'; .byte 'e'; .byte 'n'; cpuid` which Xen recognises as a deliberate attempt to get the fully-controlled `CPUID` information rather than the hardware-reported information. This only works with cooperative guests and guest userspace, so cannot be directly relied upon. Masking and Override MSRs ------------------------- AMD CPUs from the `K8` onwards support _Feature Override_ MSRs, which specify the raw value returned for all `CPUID` instructions querying a specific feature bitmap. These MSRs allow any result to be returned, including the ability to advertise features which are not actually supported. Intel CPUs between `Nehalem` and `SandyBridge` have differing numbers of _Feature Mask_ MSRs, which are a simple AND-mask applied to all `CPUID` instructions requesting specific feature bitmap sets. The exact MSRs, and which feature bitmap sets they affect are hardware specific. These MSRs allow features to be hidden by clearing the appropriate bit in the mask, but does not allow unsupported features to be advertised. CPUID Faulting ---------------- On newer Intel hardware, a feature known as _CPUID Faulting_ can allow Xen to cause `CPUID` instruction executed in PV guests to trap, which allows Xen full and complete control over all leaves (exactly like an HVM guest). _CPUID Faulting_ support is present in `IvyBridge` and newer CPUs, although not architecturally guaranteed. How Xen currently uses and exposes levelling support ==================================================== Libxc has a `CPUID` Policy API which can be set by the toolstack for a domain. Libxc performs some information gathering, and uses the `DOMCTL_set_cpuid` hypercall to specify what information should be returned by Xen when the domain requests specific `CPUID` leaves. The user of the libxc `CPUID` Policy API may specify, for any leaf whatsoever, whether particular bits should be forced high, forced low, default (as chosen by libxc), specifically the same as hardware, or specifically the same hardware and maintained consistently across migration. The default `CPUID` Policy involves libxc trying to work out which features should be set or cleared in the policy. It does this with a mixture of native `CPUID` instructions, some switch statements choosing to enable/disable certain features and hypercalls querying certain Xen state. When Xen is servicing a `CPUID` instruction on behalf of a guest and ends up using the policy provided by libxc, it subsequently edits certain fields, particularly in the feature sets. Support for the feature masking MSRs is available via the `cpuid_mask_*` command line parameters which get applied at boot and reduce the visible feature set to every subsequent `CPUID` instruction. Support for enabling _CPUID Faulting_ exists, but it does nothing more than defer back to the default policy. How XenServer currently does levelling ====================================== The _Heterogeneous Pool Levelling_ support in XenServer appears to predate the libxc CPUID policy API, so does not currently use it. The toolstack has a table of CPU model numbers identifying whether levelling is supported. It then uses native `CPUID` instructions to look at the first four feature masks, and identifies the subset of features across the pool. `cpuid_mask_{,extd_}{ecx,edx}` is then set on Xen's command line for each host in the pool, and all hosts rebooted. This has several limitations: * Xen and dom0 have a reduced feature set despite not needing to migrate * There is only a single level for all VMs in the pool * The toolstack only understands the first 4 of the possible masking MSRs, and there are now feature maps in further `CPUID` leaves which have no masking MSRs Notes and observations ====================== Experimentally, the masking MSRs can be context switched. There is no need to force all PV guests to the same level, and no need to prevent dom0 or Xen from using certain features. Context switching the masking MSRs will however incur an overhead, and should be avoided where possible. The toolstack needs to know how much control Xen has over VM features. In the case that there are insufficient masking MSRs, and no faulting support is present, a PV VM can still potentially be made safe to migrate by explicitly disabling features on the kernel command line. As a result, there should be a new mechanism which reports the levelling controls Xen has available. The features available to each type of guest is really only known to Xen. Having libxc try to divine them is bogus (especially as libxc is subject to the toolstack domains cpuid policy itself). Therefore on boot, Xen should work out the maximal feature set available to each type of guest and make this information available to the toolstack. Design ====== `struct sysctl_physinfo.levelling_caps` --------------------------------------- Xen shall gain a new physinfo field which reports the degree to which it can influence `CPUID` executed by a PV guest. This is a bitmap containing: * `faulting` * CPUID Faulting is available, and full control can be exercised. * `mask_ecx` * Leaf 0x00000001.ECX * `mask_edx` * Leaf 0x00000001.EDX * `mask_extd_ecx` * Leaf 0x80000001.ECX * `mask_extd_edx` * Leaf 0x80000001.EDX * `mask_xsave_eax` * Leaf 0x0000000D[ECX=1].EAX * `mask_therm_ecx` * Leaf 0x00000006.ECX * `mask_l7s0_eax` * Leaf 0x00000007[ECX=0].EAX * `mask_l7s0_ebx` * Leaf 0x00000007[ECX=0].EBX At the time of writing, these are all the masking MSRs known by Xen. The bitmap shall be extended as new MSRs become available. New 'featureset' API for use by the toolstack --------------------------------------------- A featureset is a defined as a collection of words covering the cpuid leaves which report features to the caller. It is variable length, and expected to grow over time as processors gain more features, or Xen starts supporting exposing more features to guests. At the time of writing, the leaves containing feature bits are: * 0x00000001.ECX * 0x00000001.EDX * 0x80000001.ECX * 0x80000001.EDX * 0x0000000D[ECX=1].EAX * 0x00000007[ECX=0].EBX * 0x00000006.EAX * 0x00000006.ECX * 0x0000000A.EAX * 0x0000000A.EBX * 0x0000000F[ECX=0].EDX * 0x0000000F[ECX=1].EDX XEN_SYSCTL_get_featureset ------------------------- Xen shall on boot create a featureset for itself, and the maximum available features for each type of guest, based on hardware features, command line options etc. A toolstack shall be able to query all of these. Cpuid feature-verification library ---------------------------------- There shall be a new library (shared between Xen and libxc in the same way as libelf etc.) which can verify the a featureset. In particular, it will confirm that no features are enabled without their dependent features. XEN_DOMCTL_set_cpuid -------------------- This is an existing hypercall. Currently it just stashes the policy from userspace. It shall be extended to provide verification of the policy, and reject attempts to advertise features which Xen is incapable of providing (via hardware or emulation support). VCPU context switch ------------------- Xen shall be updated to lazily context switch all available masking MSRs. It is noted that this shall incur a performance overhead if restricted featuresets are assigned to PV guests, and _CPUID Faulting_ is not available. It shall be the responsibility of the host administrator to avoid creating such a scenario, if the performance overhead is a concern. Future work =========== The above is a minimum quantity of work to support feature levelling, but further problems exist. They are acknowledged as being issues, but are not in scope for fixing as part of feature levelling. * Xen has no notion of per-cpu and per-package data in the cpuid policy. In particular, this causes issues for VMs attempting to detect topology, which find inconsistent/incorrect cache information. * In the case that `domain_cpuid()` can't locate a leaf in the topology, it will fall back to issuing a plain `CPUID` instruction. This breaks VM encapsulation, as a VM which has migrated can observe differences which should be hidden. * There is currently a positioning issue with the domains cpuid policy. Verifying the register state requires the policy, but the policy is behind the register state in the migration stream. The domains cpuid policy should become an item in Xen's migration state for a VM. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |