[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest



On 07/23/2015 04:07 PM, Dario Faggioli wrote:
On Thu, 2015-07-23 at 06:43 +0200, Juergen Gross wrote:
On 07/22/2015 04:44 PM, Boris Ostrovsky wrote:
On 07/22/2015 10:09 AM, Juergen Gross wrote:

I think we have 2 possible solutions:

1. Try to handle this all in the hypervisor via CPUID mangling.

2. Add PV-topology support to the guest and indicate this capability
via
    elfnote; only enable PV-numa if this note is present.

I'd prefer the second solution. If you are okay with this, I'd try
to do
some patches for the pvops kernel.

Why do you think that kernel patches are preferable to CPUID management?
This would be all in tools, I'd think. (Well, one problem that I can
think of is that AMD sometimes pokes at MSRs and/or Northbridge's PCI
registers to figure out nodeID --- that we may need to have to address
in the hypervisor)

Doing it via CPUID is more HW specific. Trying to fake a topology for
the guest from outside might lead to weird decisions in the guest e.g.
regarding licenses based on socket counts.

I do see the value of this, I think...

If you are doing it in the guest itself you are able to address the
different problems (scheduling, licensing) in different ways.

... but, at least in the case of vNUMA for instance, there still need to
be a correlation between the vNUMA topology, and the "CPUID topology",
and vNUMA topology is decided by toolstack.

Then, if you mean, within all the possible solutions that matches (i.e.,
that does not cause problems to!) the vNUMA setup we've been given,
let's pick up one that also is best for this xxx other purpose, then I
agree.

What I'm not sure I see, although, is how you would be specifying the
other purpose, e.g., in this case, are you thinking to another parameter
saying that we want to minimize the socket count?

Depending on the licensing model
playing with CPUID is either good or bad. I can even imagine the CPUID
configuration capabilities in xl are in use today for exactly this
purpose. Using them for pv-NUMA as well will make this feature unusable
for those users.

Yeah, well... So, you want a VM with only one socket, because of
whatever reasons (say licensing), and you're using libxl's CPUID
fiddling capability to do that. Now, if you specify, for such a VM, a
vNUMA layout with 2 vnodes, well, I'd call this asking for troubles. I
know, strictly speaking, socket != (v)NUMA node. Still, I think this
will be a corner case, way less common than just a user specifying a
vNUMA topology, and getting only a fraction of the vcpus being
used/usable! :-/

In summary, I probably know too few of CPUID handling to have a clear
view on whether something like 'making it match the topology' --which
also means, if no vNUMA, CPUID should say flat, for some definition of
flat-- should leave in tools or in kernel... I just know that we need to
do something *consistent*.

FWIW, I was thinking that the kernel were a better place, as Juergen is
saying, while now I'm more convinced that tools would be more
appropriate, as Boris is saying.

I've collected some information from the linux kernel sources as a base
for the discussion:

The complete numa information (cpu->node and memory->node relations) is
taken from the acpi tables (srat, slit for "distances").

The topology information is obtained via:
- intel:
  + cpuid leaf b with subleafs, leaf 4
  + cpuid leaf 2 and/or leaf 1 if leaf b and/or 4 isn't available
- amd:
  + cpuid leaf 8000001e, leaf 8000001d, leaf 4
  + msr c001100c
  + cpuid leaf 2 and/or leaf 1 if leaf b and/or 4 isn't available

The scheduler is aware of:
- smt siblings (from topology)
- last-level-cache siblings (from topology)
- node siblings (from numa information)
It will especially move tasks from one cpu to another first between smt
siblings, second between llc siblings, third between node siblings and
last all cpus.

Memory management does numa node aware memory allocation.

Topology and numa information are made available through /sys and /proc
filesystems.

cpuid instruction is available for user mode as well.


Anything I have missed?


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.