[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] _PXM, NUMA, and all that goodnesss

On 13/02/14 10:08, Jan Beulich wrote:
>> Interestingly enough one can also read this from SysFS:
>> /sys/bus/pci/devices/<BDF>/numa_node,local_cpu,local_cpulist.
>> Except that we don't expose the NUMA topology to the initial
>> domain so the 'numa_node' is all -1. And the local_cpu depends
>> on seeing _all_ of the CPUs - and of course it assumes that
>> vCPU == pCPU.
>> Anyhow, if this was "tweaked" such that the initial domain
>> was seeing the hardware NUMA topology and parsing it (via
>> Elena's patches) we could potentially have at least the
>> 'numa_node' information present and figure out if a guest
>> is using a PCIe device from the right socket.
> I think you're mixing up things here. Afaict Elena's patches
> are to introduce _virtual_ NUMA, i.e. it would specifically _not_
> expose the host NUMA properties to the Dom0 kernel. Don't
> we have interfaces to expose the host NUMA information to
> the tools already?

I have recently looked into this when playing with xen support in hwloc.

Xen can export its vcpu_to_{socket,node,core} mappings for the toolstack
to consume, and for each node expose an count of used and free pages,
along with a square matrix of distances from the SRAT table.

The counts of used pages are problematic, because it includes pages
mapping MMIO regions, which is different to the logical expectation of
just being RAM

>> So what I am wondering is:
>>  1) Were there any plans for the XEN_PCI_DEV_PXM in the
>>     hypervisor? Were there some prototypes for exporting the
>>     PCI device BDF and NUMA information out.
> As said above: Intentions (I wouldn't call it plans) yes, prototypes
> no.
>>  2) Would it be better to just look at making the initial domain
>>    be able to figure out the NUMA topology and assign the
>>    correct 'numa_node' in the PCI fields?
> As said above, I don't think this should be exposed to and
> handled in Dom0's kernel. It's the tool stack to have the overall
> view here.

This is where things get awkward.  Dom0 has the real APCI tables and is
the only entity with the ability to evaluate the _PXM() attributes to
work out which PCI devices belong to which NUMA nodes.  On the other
hand, its idea of cpus and numa is stifled by being virtual and
generally not having access to all the cpus it can see as present in the
ACPI tables.

It would certainly be nice for dom0 to report the _PXM() attributes back
up to Xen, but I have no idea how easy/hard it would be.

>>  3). If either option is used, would taking that information in-to
>>    advisement when launching a guest with either 'cpus' or 'numa-affinity'
>>    or 'pci' and informing the user of a better choice be good?
>>    Or would it be better if there was some diagnostic tool to at
>>    least tell the user whether their PCI device assignment made
>>    sense or not? Or perhaps program the 'numa-affinity' based on
>>    the PCIe socket location?
> I think issuing hint messages would be nice. Automatic placement
> could clearly also take assigned devices' localities into consideration,
> i.e. one could expect assigned devices to result in the respective
> nodes to be picked in preference (as long as CPU and memory
> availability allow doing so).
> Jan

Diagnostic tool is arguably in the works, having been done in my copious
free time, and rather more activly on the hwloc-devel list than
xen-devel, given the current code freeze.


One vague idea I had was to see about using hwlocs placement algorithms
to help advise domain placement, but I have not yet done any
investigation into the feasibility of this.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.