[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] DESIGN v2: CPUID part 3

On 31/07/2017 20:49, Konrad Rzeszutek Wilk wrote:
> On Wed, Jul 05, 2017 at 02:22:00PM +0100, Joao Martins wrote:
>> On 07/05/2017 12:16 PM, Andrew Cooper wrote:
>>> On 05/07/17 10:46, Joao Martins wrote:
>>>> Hey Andrew,
>>>> On 07/04/2017 03:55 PM, Andrew Cooper wrote:
>>>>> (RFC: Decide exactly where to fit this.  _XEN\_DOMCTL\_max\_vcpus_ 
>>>>> perhaps?)
>>>>> The toolstack shall also have a mechanism to explicitly select topology
>>>>> configuration for the guest, which primarily affects the virtual APIC ID
>>>>> layout, and has a knock on effect for the APIC ID of the virtual IO-APIC.
>>>>> Xen's auditing shall ensure that guests observe values consistent with the
>>>>> guarantees made by the vendor manuals.
>>>> Why choose max_vcpus domctl?
>>> Despite its name, the max_vcpus hypercall is the one which allocates all
>>> the vcpus in the hypervisor.  I don't want there to be any opportunity
>>> for vcpus to exist but no topology information to have been provided.
>> /nods
>> So then doing this at vcpus allocation we would need to pass an additional 
>> CPU
>> topology argument on the max_vcpus hypercall? Otherwise it's sort of guess 
>> work
>> wrt sockets, cores, threads ... no?
> Andrew, thoughts on this and the one below?

Urgh sorry.  I've been distracted with some high priority interrupts (of
the non-maskable variety).

So, bad news is that the CPUID and MSR policy handling has become
substantially more complicated and entwined than I had first planned.  A
change in either of the data alters the auditing of the other, so I am
leaning towards implementing everything with a single set hypercall (as
this is the only way to get a plausibly-consistent set of data).

The good news is that I don't think we actually need any changes to the
XEN_DOMCTL_max_vcpus.  I now think there is sufficient expressibility in
the static cpuid policy to work.

>> There could be other uses too on passing this info to Xen, say e.g. the
>> scheduler knowing the guest CPU topology it would allow better selection of
>> core+sibling pair such that it could match cache/cpu topology passed on the
>> guest (for unpinned SMT guests).

I remain to be convinced (i.e. with some real performance numbers) that
the added complexity in the scheduler for that logic is a benefit in the
general case.

In practice, customers are either running very specific and dedicated
workloads (at which point pinning is used and there is no
oversubscription, and exposing the actual SMT topology is a good thing),
or customers are running general workloads with no pinning (or perhaps
cpupool-numa-split) with a moderate amount of oversubscription (at which
point exposing SMT is a bad move).

Counterintuitively, exposing NUMA in general oversubscribed scenarios is
terrible for net system performance.  What happens in practice is that
VMs which see NUMA spend their idle cycles trying to balance their own
userspace processes, rather than yielding to the hypervisor so another
guest can get a go.

>>>> With multiple sockets/nodes and having supported extended topology leaf 
>>>> the APIC
>>>> ID layout will change considerably requiring fixup if... say we set vNUMA 
>>>> (I
>>>> know numa node != socket spec wise, but on the machines we have seen so 
>>>> far,
>>>> it's a 1:1 mapping).
>>> AMD Fam15h and later (may) have multiple NUMA nodes per socket, which
>>> will need to be accounted for in how the information is represented,
>>> especially in leaf 0x8000001e.
>>> Intel on the other hand (as far as I can tell), has no interaction
>>> between NUMA and topology as far as CPUID is concerned.
>> Sorry, I should probably have mentioned earlier that "machines we have seen 
>> so
>> far" were Intel - I am bit unaware of the AMD added possibilities.
>>>> Another question since we are speaking about topology is would be: how do 
>>>> we
>>>> make hvmloader aware of each the APIC_ID layout? Right now, it is too 
>>>> hardcoded
>>>> 2 * APIC_ID :( Probably a xenstore entry 'hvmloader/cputopology-threads' 
>>>> and
>>>> 'hvmloader/cputopology-sockets' (or use vnuma_topo.nr_nodes for the 
>>>> latter)?
>>> ACPI table writing is in the toolstack now, but even if it weren't,
>>> HVMLoader would have to do what all real firmware needs to do, and look
>>> at CPUID.
> I think the real hardware when constructing interesting topologies uses
> platform specific MSRs or other hidden gems (like AMD Northbridge).

It was my understanding that APIC IDs are negotiated at power-on time,
as they are the base layer of addressing in the system.

>> Right, but the mp tables (and lapic ids) are still adjusted/created by 
>> hvmloader
>> unless ofc I am reading it wrong. But anyhow - if you're planning to be 
>> based on
> <nods>
> I can't see how the CPUID would allow to construct the proper APIC MADT 
> entries so that
> the APIC IDs match as of right now?
> Unless hvmloader is changed to do full SMP bootup (it does that now at some 
> point)
> and each CPU reports this information and they all update this table based on 
> their
> EAX=1 CPUID value?

HVMLoader is currently hardcoded to the same assumption (APIC ID =
vcpu_id * 2) as other areas of Xen and the toolstack.

All vcpus are already booted, so the MTRRs can be configured suitably. 
Having said that, I think vcpu0 can write out the ACPI tables properly,
so long as it knows that Xen doesn't insert arbitrary holes into the
APIC ID space.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.