|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
On 07/24/2015 04:44 PM, Dario Faggioli wrote: On Fri, 2015-07-24 at 12:28 +0200, Juergen Gross wrote:On 07/23/2015 04:07 PM, Dario Faggioli wrote:FWIW, I was thinking that the kernel were a better place, as Juergen is saying, while now I'm more convinced that tools would be more appropriate, as Boris is saying.I've collected some information from the linux kernel sources as a base for the discussion:That's great, thanks for this!The complete numa information (cpu->node and memory->node relations) is taken from the acpi tables (srat, slit for "distances").Ok. And I already have a question (as I lost track of things a bit). What you just said about ACPI tables is certainly true for baremetal and HVM guests, but for PV? At the time I was looking into it, together with Elena, there were Linux patches being produced for the PV case, which makes sense. However, ISTR that both Wei and Elena mentioned recently that those patches have not been upstreamed in Linux yet... Is that the case? Maybe not all, but at least some of them are there? Because if not, I'm not sure I see how a PV guest would even see a vNUMA topology (which it does). Of course, I can go and check, but since you just looked, you may have it fresh and clear already. :-) I checked "bottom up", so when I found the acpi scan stuff I stopped searching how the kernel obtains numa info. During my search I found no clue of an pv-numa stuff in the kernel. And a quick "grep -i numa" in arch/x86/xen and drivers/xen didn't reveal anything. Same for a complete kernel source search for "vnuma".
I think we have to be very careful here. I see two possible scenarios: 1) The vcpus are not pinned 1:1 on physical cpus. The hypervisor will try to schedule the vcpus according to their numa affinity. So they can change pcpus at any time in case of very busy guests. I don't think the linux kernel should treat the cpus differently in this case as it will be in vane regarding the Xen scheduler's activity. So we should use the "null" topology in this case. 2) The vcpus of the guest are all pinned 1:1 to physical cpus. The Xen scheduler can't move vcpus between pcpus, so the linux kernel should see the real topology of the used pcpus in order to optimize for this picture. This only covers the scheduling aspect, of course.
Uuh, I don't think a change of the scheduler on behalf of Xen is really appreciated. :-) I'd rather fiddle with the cpu masks on the different levels to let the scheduler do the right thing. One thing I don't like about this approach is that it would potentially solve vNUMA and other scheduling anomalies, but...cpuid instruction is available for user mode as well....it would not do any good for other subsystems, and user level code and apps. Indeed. I think the optimal solution would be two-fold: give the scheduler the information it is needing to react correctly via a kernel patch not relying on cpuid values and fiddle with the cpuid values from xen tools according to any needs of other subsystems and/or user code (e.g. licensing). Juergen _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |