[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
On Fri, Jul 24, 2015 at 5:09 PM, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote: > On Fri, Jul 24, 2015 at 05:58:29PM +0200, Dario Faggioli wrote: >> On Fri, 2015-07-24 at 17:24 +0200, Juergen Gross wrote: >> > On 07/24/2015 05:14 PM, Juergen Gross wrote: >> > > On 07/24/2015 04:44 PM, Dario Faggioli wrote: >> >> > >> In fact, I think that it is the topology, i.e., what comes from MSRs, >> > >> that needs to adapt, and follow vNUMA, as much as possible. Do we agree >> > >> on this? >> > > >> > > I think we have to be very careful here. I see two possible scenarios: >> > > >> > > 1) The vcpus are not pinned 1:1 on physical cpus. The hypervisor will >> > > try to schedule the vcpus according to their numa affinity. So they >> > > can change pcpus at any time in case of very busy guests. I don't >> > > think the linux kernel should treat the cpus differently in this >> > > case as it will be in vane regarding the Xen scheduler's activity. >> > > So we should use the "null" topology in this case. >> > >> > Sorry, the topology should reflect the vcpu<->numa-node relations, of >> > course, but nothing else (so flat topolgy in each numa node). >> > >> Yeah, I was replying to this point saying something like this right >> now... Luckily, I've seen this email! :-P >> >> With this semantic, I fully agree with this. >> >> > > 2) The vcpus of the guest are all pinned 1:1 to physical cpus. The Xen >> > > scheduler can't move vcpus between pcpus, so the linux kernel should >> > > see the real topology of the used pcpus in order to optimize for this >> > > picture. >> > > >> > >> Mmm... I did think about this too, but I'm not sure. I see the value of >> this of course, and the reason why it makes sense. However, pinning can >> change on-line, via `xl vcpu-pin' and stuff. Also migration could make >> things less certain, I think. What happens if we build on top of the >> initial pinning, and then things change? >> >> To be fair, there is stuff building on top of the initial pinning >> already, e.g., from which physical NUMA node we allocate the memory >> relies depends exactly on that. That being said, I'm not sure I'm >> comfortable with adding more of this... >> >> Perhaps introduce an 'immutable_pinning' flag, which will prevent >> affinity to be changed, and then bind the topology to pinning only if >> that one is set? >> >> > >> Maybe, there is room for "fixing" this at this level, hooking up inside >> > >> the scheduler code... but I'm shooting in the dark, without having check >> > >> whether and how this could be really feasible, should I? >> > > >> > > Uuh, I don't think a change of the scheduler on behalf of Xen is really >> > > appreciated. :-) >> > > >> I'm sure it would (have been! :-)) a true and giant nightmare!! :-D >> >> > >> One thing I don't like about this approach is that it would potentially >> > >> solve vNUMA and other scheduling anomalies, but... >> > >> >> > >>> cpuid instruction is available for user mode as well. >> > >>> >> > >> ...it would not do any good for other subsystems, and user level code >> > >> and apps. >> > > >> > > Indeed. I think the optimal solution would be two-fold: give the >> > > scheduler the information it is needing to react correctly via a >> > > kernel patch not relying on cpuid values and fiddle with the cpuid >> > > values from xen tools according to any needs of other subsystems and/or >> > > user code (e.g. licensing). >> > >> So, just to check if I'm understanding is correct: you'd like to add an >> abstraction layer, in Linux, like in generic (or, perhaps, scheduling) >> code, to hide the direct interaction with CPUID. >> Such layer, on baremetal, would just read CPUID while, on PV-ops, it'd >> check with Xen/match vNUMA/whatever... Is this that you are saying? >> >> If yes, I think I like it... > > I don't think this is workable. For example there are applications > which use 'cpuid' and figure out the core/thread and use it for its own > scheduling purposes. Can you expand a little on this? I'm having trouble figuring out exactly what user-space applications are reading and how they're using it -- and, how they work currently in virtual environments, given that they (typically) will be moved between physical processors even if they stay on the same virtual processor. -George _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |