[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] VM Feature levelling improvements proposal (draft C)


Here is a design proposal to improve VM feature levelling support in Xen
and libxc.

PDF can be found here:

And markdown source inline:


Revision History

Version  Date         Changes
-------  ----------- 
Draft A  07 Feb 2014  Initial draft

Draft B  13 Feb 2014  More detail for proposed new implementation

Draft C  17 Feb 2014  Even more details for proposed new implementation


_CPU feature masking_ is a term used to mean altering the visible
of a processor.  For single systems, this could be to hide certain features
from operating system software, for which support is buggy.

In the world of virtualisation, it is common to have non-identical
hardware in
a cluster but still want to migrate a virtual machine safely.  On regular
hardware, the kernel can safely assume that the feature-set as detected on
boot will remain the same.  Live migration invalidates this assumption when
moving between two non-identical pieces of hardware.

To migrate virtual machines in this fashion, orchestration software must
ensure that the available feature set remains consistent anywhere the
machine might end up.

The feature-set of a particular CPU can be obtained using the `CPUID`
instruction.  It was introduced as a forward compatible way of
advertising new
features which were detectable at runtime.  Information available includes
processor branding, available features, topology information and cache

The `CPUID` instruction is an unprivileged instruction, usable from
without interception from the kernel.  This makes it impossible to
paravirtualise using the standard trap-and-emulate method.


This project originally started to improve the way in which XenServer
performed heterogeneous pool levelling.  In the process of investigation, it
was discovered that the current implementation in Xen and libxc are in
need of
improvement, particularly in relation to PV guests.

This document describes:

* What properties are needed from a VM point of view
* What hardware features are available to aid with levelling
* What abilities are exposed by Xen and libxc for levelling
* How XenServer currently does pool levelling (and why it is in need of

This document also proposes a new mechanism for VM feature levelling, taking
into account the information needed by orchestration software.

What a Virtual Machine cares about

On native hardware, a kernel, as well as certain userspace libraries
will use
the set of available features to tune themselves to run more efficiently.
Over a migrate, it is critical that features a VM is using do not disappear.
(In some cases it might be possible to trap-and-emulate missing
features, but
this would be an unacceptably high overhead so is not considered.  It is
not applicable in the general case)

When a VM is liable to migrate between hardware of differing
feature-sets, it
is important to ensure that the VM is strictly only using the common
subset of
features available on any potential destination.

This can be done either by hiding features outside of the common subset,
or in
some cases specifically instructing the kernel not to use a feature which it
can see.

Hardware features to aid levelling

HVM guests (using `Intel VT-x` or `AMD SVM`) will exit to Xen on each
instruction, allowing Xen full and complete control over all leaves.

PV guests are harder.  By default, `CPUID` instructions executed in a PV
will not trap, leaving Xen no direct ability to control the information

On newer Intel hardware, a feature known as _CPUID Faulting_ can allow
Xen to
cause `CPUID` instruction executed in PV guests to trap, which allows
Xen full
and complete control over all leaves (exactly like an HVM guest).

Xen-aware PV guest kernels and userspace can make use of the 'Forced

> `ud2a; .byte 'x'; .byte 'e'; .byte 'n'; cpuid`

which Xen recognises as a deliberate attempt to get the fully-controlled
`CPUID` information rather than the hardware-reported information.  This
works with cooperative guests and guest userspace, so cannot be directly
relied upon.

Most hardware available these days have some number of `CPUID` Feature Mask
MSRs which are a simple AND-mask applied to all `CPUID` instructions
requesting specific feature bitmap sets.  The exact MSRs, and which feature
bitmap sets they affect are hardware specific.

Having said that, for PV guests particularly, there are features which might
be visible, but which they cannot possibly use.  As a result, Xen can
get away
with hiding fewer features where it knows the guest could not use the

How Xen currently uses and exposes levelling support

Libxc has a `CPUID` Policy API which can be set by the toolstack for a
Libxc performs some information gathering, and uses the `DOMCTL_set_cpuid`
hypercall to specify what information should be returned by Xen when the
domain requests specific `CPUID` leaves.

The user of the libxc `CPUID` Policy API may specify, for any leaf
whether particular bits should be forced high, forced low, default (as
by libxc), specifically the same as hardware, or specifically the same
hardware and maintained consistently across migration.

The default `CPUID` Policy involves libxc trying to work out which features
should be set or cleared in the policy.  It does this with a mixture of
`CPUID` instructions, some switch statements choosing to enable/disable
certain features and hypercalls querying certain Xen state.

When Xen is servicing a `CPUID` instruction on behalf of a guest and ends up
using the policy provided by libxc, it subsequently edits certain fields,
particularly in the feature sets.

Support for the feature masking MSRs is available via the five command line
parameters `cpuid_mask_({,extd_}{ecx,edx}|xsave_eax)`, which get applied at
boot and reduce the visible feature set to every subsequent `CPUID`

Support for _CPUID Faulting_ exists, but only insofar as having the same
effect as the masking MSRs would provide.

How XenServer currently does levelling

The _Heterogeneous Pool Levelling_ support in XenServer appears to
predate the
libxc CPUID policy API, so does not currently use it.  The toolstack has a
table of CPU model numbers identifying whether levelling is supported.  It
then uses native `CPUID` instructions to look at the first four feature
and identifies the subset of features across the pool.
`cpuid_mask_{,extd_}{ecx,edx}` is then set on Xen's command line for
each host
in the pool, and all hosts rebooted.

This has several limitations:

* Xen and dom0 have a reduced feature set despite not needing to migrate
* There is only a single level for all VMs in the pool
* The toolstack only understands 4 of the 5 possible masking MSRs, and there
  are now feature maps in further `CPUID` leaves which have no masking MSRs

Proposal for new implementation

Experimentally, the masking MSRs can be context switched.  There is no
need to
force all PV guests to the same level, and no need to prevent dom0 or
Xen from
using certain features.

The toolstack needs to know how much control Xen has over VM features. 
In the
case that there are insufficient masking MSRs, and no faulting support is
present, a PV VM can still potentially be made safe to migrate by explicitly
disabling features on the kernel command line.  As a result, there
should be a
new mechanism which reports the levelling controls Xen has available.

The features available to each type of guest is really only known to Xen.
Having libxc try to divine them is bogus (especially as libxc is subject to
the toolstack domains cpuid policy itself).  Therefore on boot, Xen should
work out the maximal feature set available to each type of guest and
make this
information available to the toolstack.

`struct sysctl_physinfo.levelling_caps`

A bitmap field.  This is to inform a toolstack what Xen is capable of in
of levelling.  Bits reported include:

* `faulting`
* `mask_ecx`
* `mask_edx`
* `mask_extd_ecx`
* `mask_extd_edx`
* `mask_xsave_eax`

_It is probably better extending sysctl_phsyinfo in preference to
a new hypercall to return a word with a few bits set._

Improvements to `XEN_DOMCTL_set_cpuid`

The `XEN_DOMCTL_set_cpuid` hypercall is too lax at validating its input,
results in needing further validation scattered over the Xen code.  In
particular it should not be possible to set feature bits which are blatantly

* Feature bitmaps should be strictly checked against Xen's maximal set for a
* Leaves should be checked against `max{,_extd}_eax`.  `libxc` currently
  the leaves in a suitable order for this restriction to be enforced.
* Xen should calculate a domains feature masking MSRs from uploaded leaves,
  which prevents the toolstack from needing to special-case `CPUID`
masking vs
  faulting based on host support.

Lazy context switching of VCPU masking MSRs

Domains having different sets of features is an important flexibility. This
requires tracking and properly context switching the MSRs on vcpu context
switches, in the case that _CPUID faulting_ is not available.

At boot, Xen shall determine which masking MSRs are available as part of
calculating `sysctl_physinfo.levelling_caps`.  All domain masks
(including the
idle domain) default to `~0`, and for PV guests (when _faulting_ is not
available) can be reduced by setting the policy.  Updates to a domain's
must never be able to exceed the equivalent mask in the idle domain.

The context switch code shall lazily update the masking MSRs when context
switching between VCPUs.

Deprecation of `cpuid_mask_*` command line parameters

The presence of these masking MSRs is already intermittent, and are starting
to disappear in more modern hardware.  With feature levelling being properly
configurable via the improvements presented here, there is no real
justification to use the command line parameters.  Features needing hiding
from Xen or dom0 should be done so using appropriate command line

Attempted use of these command line parameters should emit a deprecation
warning, but continue to work as a host-wide lowering of features.  It shall
continue to work by lowering the idle domain's masks.


Get the Xen-calculated default CPUID policy for PV and HVM domains.  This is
needed by toolstacks to calculate how to level the VM features for safe

_This is a SYSCTL rather than DOMCTL as it is system specific information
referring to types of domains, rather than domain information.  On the other
hand, it could probably just be another set of hw_caps and forgo
introducing a
new hypercall - I am open to suggestions as to the best method of reporting
this information_

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.