[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] DESIGN v2: CPUID part 3



On Wed, Jul 05, 2017 at 02:22:00PM +0100, Joao Martins wrote:
> On 07/05/2017 12:16 PM, Andrew Cooper wrote:
> > On 05/07/17 10:46, Joao Martins wrote:
> >> Hey Andrew,
> >>
> >> On 07/04/2017 03:55 PM, Andrew Cooper wrote:
> >>> Presented herewith is the a plan for the final part of CPUID work, which
> >>> primarily covers better Xen/Toolstack interaction for configuring the 
> >>> guests
> >>> CPUID policy.
> >>>
> >> Really nice write up, a few comments below.
> >>
> >>> A PDF version of this document is available from:
> >>>
> >>> http://xenbits.xen.org/people/andrewcoop/cpuid-part-3-rev2.pdf
> >>>
> >>> Changes from v1:
> >>>  * Clarification of the interaction of emulated features
> >>>  * More information about the difference between max and default 
> >>> featuresets.
> >>>
> >>> ~Andrew
> >>>
> >>> -----8<-----
> >>> % CPUID Handling (part 3)
> >>> % Revision 2
> >>>
> 
> [snip]
> 
> >>> # Proposal
> >>>
> >>> First and foremost, split the current **max\_policy** notion into separate
> >>> **max** and **default** policies.  This allows for the provision of 
> >>> features
> >>> which are unused by default, but may be opted in to, both at the 
> >>> hypervisor
> >>> level and the toolstack level.
> >>>
> >>> At the hypervisor level, **max** constitutes all the features Xen can use 
> >>> on
> >>> the current hardware, while **default** is the subset thereof which are
> >>> supported features, the features which the user has explicitly opted in 
> >>> to,
> >>> and excluding any features the user has explicitly opted out of.
> >>>
> >>> A new `cpuid=` command line option shall be introduced, whose internals 
> >>> are
> >>> generated automatically from the featureset ABI.  This means that all 
> >>> features
> >>> added to `include/public/arch-x86/cpufeatureset.h` automatically gain 
> >>> command
> >>> line control.  (RFC: The same top level option can probably be used for
> >>> non-feature CPUID data control, although I can't currently think of any 
> >>> cases
> >>> where this would be used Also find a sensible way to express 'available 
> >>> but
> >>> not to be used by Xen', as per the current `smep` and `smap` options.)
> >>>
> >>>
> >>> At the guest level, the **max** policy is conceptually unchanged.  It
> >>> constitutes all the features Xen is willing to offer to each type of 
> >>> guest on
> >>> the current hardware (including emulated features).  However, it shall 
> >>> instead
> >>> be derived from Xen's **default** host policy.  This is to ensure that
> >>> experimental hypervisor features must be opted in to at the Xen level 
> >>> before
> >>> they can be opted in to at the toolstack level.
> >>>
> >>> The guests **default** policy is then derived from its **max**.  This is
> >>> because there are some features which should always be explicitly opted 
> >>> in to
> >>> by the toolstack, such as emulated features which come with a security
> >>> trade-off, or for non-architectural features which may differ in
> >>> implementation in heterogeneous environments.
> >>>
> >>> All global policies (Xen and guest, max and default) shall be made 
> >>> available
> >>> to the toolstack, in a manner similar to the existing
> >>> _XEN\_SYSCTL\_get\_cpu\_featureset_ mechanism.  This allows decisions to 
> >>> be
> >>> taken which include all CPUID data, not just the feature bitmaps.
> >>>
> >>> New _XEN\_DOMCTL\_{get,set}\_cpuid\_policy_ hypercalls will be introduced,
> >>> which allows the toolstack to query and set the cpuid policy for a 
> >>> specific
> >>> domain.  It shall supersede _XEN\_DOMCTL\_set\_cpuid_, and shall fail if 
> >>> Xen
> >>> is unhappy with any aspect of the policy during auditing.  This provides
> >>> feedback to the user that a chosen combination will not work, rather than 
> >>> the
> >>> guest booting in an unexpected state.
> >>>
> >>> When a domain is initially created, the appropriate guests **default** 
> >>> policy
> >>> is duplicated for use.  When auditing, Xen shall audit the toolstacks
> >>> requested policy against the guests **max** policy.  This allows 
> >>> experimental
> >>> features or non-migration-safe features to be opted in to, without those
> >>> features being imposed upon all guests automatically.
> >>>
> >>> A guests CPUID policy shall be immutable after construction.  This better
> >>> matches real hardware, and simplifies the logic in Xen to translate policy
> >>> alterations into configuration changes.
> >>>
> >> This appears to be a suitable abstraction even for higher level toolstacks
> >> (libxl). At least I can imagine libvirt fetching the PV/HVM max policy, and
> >> compare them between different servers when user computes the guest cpu 
> >> config
> >> (the normalized one) and use the common denominator as the guest policy.
> >> Probably higher level toolstack could even use these said policies 
> >> constructs
> >> and built the idea of models such that the user could easily choose one 
> >> for a
> >> pool of hosts with different families. But the discussion here is more 
> >> focused
> >> on xc <-> Xen so I won't clobber discussion with libxl remarks.
> > 
> > One thing I haven't decided on yet is how to represent the policy at a
> > higher level.  Somewhere (probably libxc), I am going to need to
> > implement is_policy_compatible(a, b), and calculate_compatible_policy(a,
> > b, res), which will definitely be needed by Xapi, and will probably be
> > useful to other higher level toolstacks.
> >
> I had initially intended for libxl to keep this sort of logic when I was 
> looking
> at the topic, but with the problems depicted above, libxc is probably better
> suited to have this.
> 
> >>> (RFC: Decide exactly where to fit this.  _XEN\_DOMCTL\_max\_vcpus_ 
> >>> perhaps?)
> >>> The toolstack shall also have a mechanism to explicitly select topology
> >>> configuration for the guest, which primarily affects the virtual APIC ID
> >>> layout, and has a knock on effect for the APIC ID of the virtual IO-APIC.
> >>> Xen's auditing shall ensure that guests observe values consistent with the
> >>> guarantees made by the vendor manuals.
> >>>
> >> Why choose max_vcpus domctl?
> > 
> > Despite its name, the max_vcpus hypercall is the one which allocates all
> > the vcpus in the hypervisor.  I don't want there to be any opportunity
> > for vcpus to exist but no topology information to have been provided.
> > 
> /nods
> 
> So then doing this at vcpus allocation we would need to pass an additional CPU
> topology argument on the max_vcpus hypercall? Otherwise it's sort of guess 
> work
> wrt sockets, cores, threads ... no?

Andrew, thoughts on this and the one below?

> 
> There could be other uses too on passing this info to Xen, say e.g. the
> scheduler knowing the guest CPU topology it would allow better selection of
> core+sibling pair such that it could match cache/cpu topology passed on the
> guest (for unpinned SMT guests).
> 
> >>
> >> With multiple sockets/nodes and having supported extended topology leaf 
> >> the APIC
> >> ID layout will change considerably requiring fixup if... say we set vNUMA 
> >> (I
> >> know numa node != socket spec wise, but on the machines we have seen so 
> >> far,
> >> it's a 1:1 mapping).
> > 
> > AMD Fam15h and later (may) have multiple NUMA nodes per socket, which
> > will need to be accounted for in how the information is represented,
> > especially in leaf 0x8000001e.
> > 
> > Intel on the other hand (as far as I can tell), has no interaction
> > between NUMA and topology as far as CPUID is concerned.
> >
> Sorry, I should probably have mentioned earlier that "machines we have seen so
> far" were Intel - I am bit unaware of the AMD added possibilities.
> 
> >> Another question since we are speaking about topology is would be: how do 
> >> we
> >> make hvmloader aware of each the APIC_ID layout? Right now, it is too 
> >> hardcoded
> >> 2 * APIC_ID :( Probably a xenstore entry 'hvmloader/cputopology-threads' 
> >> and
> >> 'hvmloader/cputopology-sockets' (or use vnuma_topo.nr_nodes for the 
> >> latter)?
> > 
> > ACPI table writing is in the toolstack now, but even if it weren't,
> > HVMLoader would have to do what all real firmware needs to do, and look
> > at CPUID.

I think the real hardware when constructing interesting topologies uses
platform specific MSRs or other hidden gems (like AMD Northbridge).

> > 
> Right, but the mp tables (and lapic ids) are still adjusted/created by 
> hvmloader
> unless ofc I am reading it wrong. But anyhow - if you're planning to be based 
> on

<nods>

I can't see how the CPUID would allow to construct the proper APIC MADT entries 
so that
the APIC IDs match as of right now?

Unless hvmloader is changed to do full SMP bootup (it does that now at some 
point)
and each CPU reports this information and they all update this table based on 
their
EAX=1 CPUID value?

> CPUID, that is certainly more correct than what I had suggested earlier, 
> though
> with a bit more cirurgy on hvmloader.
> 
> >> This all brings me to the question of perhaps a separate domctl?
> > 
> > I specifically want to avoid having a separate hypercall for this
> > information.
> > 
> OK.
> 
> Joao
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> https://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.