Xen project Mailing List

Re: [Xen-devel] DESIGN v2: CPUID part 3

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

From: Joao Martins <joao.m.martins@xxxxxxxxxx>

Date: Wed, 2 Aug 2017 11:34:05 +0100

Cc: Dario Faggioli <dario.faggioli@xxxxxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Wed, 02 Aug 2017 10:34:32 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 08/01/2017 07:34 PM, Andrew Cooper wrote: > On 31/07/2017 20:49, Konrad Rzeszutek Wilk wrote: >> On Wed, Jul 05, 2017 at 02:22:00PM +0100, Joao Martins wrote: >>> On 07/05/2017 12:16 PM, Andrew Cooper wrote: >>>> On 05/07/17 10:46, Joao Martins wrote: >>>>> Hey Andrew, >>>>> >>>>> On 07/04/2017 03:55 PM, Andrew Cooper wrote: >>>>> >>>>>> (RFC: Decide exactly where to fit this. _XEN\_DOMCTL\_max\_vcpus_ >>>>>> perhaps?) >>>>>> The toolstack shall also have a mechanism to explicitly select topology >>>>>> configuration for the guest, which primarily affects the virtual APIC ID >>>>>> layout, and has a knock on effect for the APIC ID of the virtual IO-APIC. >>>>>> Xen's auditing shall ensure that guests observe values consistent with >>>>>> the >>>>>> guarantees made by the vendor manuals. >>>>>> >>>>> Why choose max_vcpus domctl? >>>> Despite its name, the max_vcpus hypercall is the one which allocates all >>>> the vcpus in the hypervisor. I don't want there to be any opportunity >>>> for vcpus to exist but no topology information to have been provided. >>>> >>> /nods >>> >>> So then doing this at vcpus allocation we would need to pass an additional >>> CPU >>> topology argument on the max_vcpus hypercall? Otherwise it's sort of guess >>> work >>> wrt sockets, cores, threads ... no? >> Andrew, thoughts on this and the one below? > > Urgh sorry. I've been distracted with some high priority interrupts (of > the non-maskable variety). > > So, bad news is that the CPUID and MSR policy handling has become > substantially more complicated and entwined than I had first planned. A > change in either of the data alters the auditing of the other, so I am > leaning towards implementing everything with a single set hypercall (as > this is the only way to get a plausibly-consistent set of data). > > The good news is that I don't think we actually need any changes to the > XEN_DOMCTL_max_vcpus. I now think there is sufficient expressibility in > the static cpuid policy to work. > Awesome! >>> There could be other uses too on passing this info to Xen, say e.g. the >>> scheduler knowing the guest CPU topology it would allow better selection of >>> core+sibling pair such that it could match cache/cpu topology passed on the >>> guest (for unpinned SMT guests). > > I remain to be convinced (i.e. with some real performance numbers) that > the added complexity in the scheduler for that logic is a benefit in the > general case. > The suggestion above was a simple extension to struct domain (e.g. cores/threads or struct cpu_topology field) - nothing too disruptive I think. But I cannot really argue on this as this was just an idea that I found interesting (no numbers to support it entirely). We just happened to see it under-perform when a simple range of cpus was used for affinity, and that some vcpus end up being scheduled belonging the same core+sibling pair IIRC; hence I (perhaps naively) imagined that there could be value in further scheduler enlightenment e.g. "gang-scheduling" where we schedule core+sibling always together. I was speaking to Dario (CC'ed) on the summit whether CPU topology could have value - and there might be but it remains to be explored once we're able to pass a cpu topology to the guest. (In the past it seemed enthusiastic of the idea of the topology[0] and hence I assumed to be in the context of schedulers) [0] https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg03850.html > In practice, customers are either running very specific and dedicated > workloads (at which point pinning is used and there is no > oversubscription, and exposing the actual SMT topology is a good thing), > /nods > or customers are running general workloads with no pinning (or perhaps > cpupool-numa-split) with a moderate amount of oversubscription (at which > point exposing SMT is a bad move). > Given the scale you folks invest on over-subscription (1000 VMs), I wonder what moderate here means :P > Counterintuitively, exposing NUMA in general oversubscribed scenarios is > terrible for net system performance. What happens in practice is that > VMs which see NUMA spend their idle cycles trying to balance their own > userspace processes, rather than yielding to the hypervisor so another > guest can get a go. > Interesting to know - vNUMA perhaps is only better placed for performance cases where both (or either) I/O topology and memory locality matter - or when going for bigger guests. Provided that the correspondent CPU topology is provided. Joao _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.