[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] _PXM, NUMA, and all that goodnesss

To: "Konrad Rzeszutek Wilk" <konrad.wilk@xxxxxxxxxx>
From: "Jan Beulich" <JBeulich@xxxxxxxx>
Date: Thu, 13 Feb 2014 10:08:16 +0000
Cc: ufimtseva@xxxxxxxxx, andrew.thomas@xxxxxxxxxx, george.dunlap@xxxxxxxxxxxxx, andrew.cooper3@xxxxxxxxxx, jun.nakajima@xxxxxxxxx, kurt.hackel@xxxxxxxxxx, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, boris.ostrovsky@xxxxxxxxxx
Delivery-date: Thu, 13 Feb 2014 10:08:31 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

>>> On 12.02.14 at 20:50, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote:
> From a Linux kernel perspective we do seem to 'pipe' said information
> from the ACPI DSDT (drivers/xen/pci.c):
> 
>  75                 unsigned long long pxm;                                  
>  76                                                                          
>  77                 status = acpi_evaluate_integer(handle, "_PXM",           
>  78                                    NULL, &pxm);                           
>  79                 if (ACPI_SUCCESS(status)) {                              
>  80                     add.optarr[0] = pxm;                                 
>  81                     add.flags |= XEN_PCI_DEV_PXM;        
> 
> Which is neat except that Xen ignores that flag altogether. I Googled
> a bit but still did not find anything relevant - thought there were
> some presentations from past Xen Summits referring to it
> (I can't find it now :-()

When adding that interface it seemed pretty clear to me that we
would want/need this information sooner or later. I'm unaware of
any (prototype or better) code utilizing it.

> Anyhow,  what I am wondering if there are some prototypes out the
> in the past that utilize this. And if we were to use this how
> can we expose this to 'libxl' or any other tools to say:
> 
> "Hey! You might want to use this other PCI device assigned
> to pciback which is on the same node". Some of form of
> 'numa-pci' affinity.

Right, a hint like this might be desirable. But this shouldn't be
enforced.

> Interestingly enough one can also read this from SysFS:
> /sys/bus/pci/devices/<BDF>/numa_node,local_cpu,local_cpulist.
> 
> Except that we don't expose the NUMA topology to the initial
> domain so the 'numa_node' is all -1. And the local_cpu depends
> on seeing _all_ of the CPUs - and of course it assumes that
> vCPU == pCPU.
> 
> Anyhow, if this was "tweaked" such that the initial domain
> was seeing the hardware NUMA topology and parsing it (via
> Elena's patches) we could potentially have at least the
> 'numa_node' information present and figure out if a guest
> is using a PCIe device from the right socket.

I think you're mixing up things here. Afaict Elena's patches
are to introduce _virtual_ NUMA, i.e. it would specifically _not_
expose the host NUMA properties to the Dom0 kernel. Don't
we have interfaces to expose the host NUMA information to
the tools already?

> So what I am wondering is:
>  1) Were there any plans for the XEN_PCI_DEV_PXM in the
>     hypervisor? Were there some prototypes for exporting the
>     PCI device BDF and NUMA information out.

As said above: Intentions (I wouldn't call it plans) yes, prototypes
no.

>  2) Would it be better to just look at making the initial domain
>    be able to figure out the NUMA topology and assign the
>    correct 'numa_node' in the PCI fields?

As said above, I don't think this should be exposed to and
handled in Dom0's kernel. It's the tool stack to have the overall
view here.

>  3). If either option is used, would taking that information in-to
>    advisement when launching a guest with either 'cpus' or 'numa-affinity'
>    or 'pci' and informing the user of a better choice be good?
>    Or would it be better if there was some diagnostic tool to at
>    least tell the user whether their PCI device assignment made
>    sense or not? Or perhaps program the 'numa-affinity' based on
>    the PCIe socket location?

I think issuing hint messages would be nice. Automatic placement
could clearly also take assigned devices' localities into consideration,
i.e. one could expect assigned devices to result in the respective
nodes to be picked in preference (as long as CPU and memory
availability allow doing so).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] _PXM, NUMA, and all that goodnesss
  - From: Andrew Cooper

References:
- [Xen-devel] _PXM, NUMA, and all that goodnesss
  - From: Konrad Rzeszutek Wilk

Prev by Date: Re: [Xen-devel] Error ignored in xc_map_foreign_pages
Next by Date: Re: [Xen-devel] handling local attach of phy disks for pygrub (Was: Xen 4.3 xl migrate " htree_dirblock_to_tree" on second host)
Previous by thread: [Xen-devel] _PXM, NUMA, and all that goodnesss
Next by thread: Re: [Xen-devel] _PXM, NUMA, and all that goodnesss
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.