[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] _PXM, NUMA, and all that goodnesss
Hey, I have been looking at figuring out how we can "easily" do PCIe assignment of devices that are on different sockets. The problem is that on machines with many sockets (four or more) we might inadvertently assign the PCIe from a different socket to a guest bound to a different NUMA node. That means more KPI traffic, higher latency, etc. From a Linux kernel perspective we do seem to 'pipe' said information from the ACPI DSDT (drivers/xen/pci.c): 75 unsigned long long pxm; 76 77 status = acpi_evaluate_integer(handle, "_PXM", 78 NULL, &pxm); 79 if (ACPI_SUCCESS(status)) { 80 add.optarr[0] = pxm; 81 add.flags |= XEN_PCI_DEV_PXM; Which is neat except that Xen ignores that flag altogether. I Googled a bit but still did not find anything relevant - thought there were some presentations from past Xen Summits referring to it (I can't find it now :-() Anyhow, what I am wondering if there are some prototypes out the in the past that utilize this. And if we were to use this how can we expose this to 'libxl' or any other tools to say: "Hey! You might want to use this other PCI device assigned to pciback which is on the same node". Some of form of 'numa-pci' affinity. Interestingly enough one can also read this from SysFS: /sys/bus/pci/devices/<BDF>/numa_node,local_cpu,local_cpulist. Except that we don't expose the NUMA topology to the initial domain so the 'numa_node' is all -1. And the local_cpu depends on seeing _all_ of the CPUs - and of course it assumes that vCPU == pCPU. Anyhow, if this was "tweaked" such that the initial domain was seeing the hardware NUMA topology and parsing it (via Elena's patches) we could potentially have at least the 'numa_node' information present and figure out if a guest is using a PCIe device from the right socket. So what I am wondering is: 1) Were there any plans for the XEN_PCI_DEV_PXM in the hypervisor? Were there some prototypes for exporting the PCI device BDF and NUMA information out. 2) Would it be better to just look at making the initial domain be able to figure out the NUMA topology and assign the correct 'numa_node' in the PCI fields? 3). If either option is used, would taking that information in-to advisement when launching a guest with either 'cpus' or 'numa-affinity' or 'pci' and informing the user of a better choice be good? Or would it be better if there was some diagnostic tool to at least tell the user whether their PCI device assignment made sense or not? Or perhaps program the 'numa-affinity' based on the PCIe socket location? _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |