Xen project Mailing List

RE: [Xen-devel] RE: Live migration fails due to c/s 20627

To: "Zhang, Xiantao" <xiantao.zhang@xxxxxxxxx>, "Xu, Dongxiao" <dongxiao.xu@xxxxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>

From: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>

Date: Wed, 16 Dec 2009 08:23:57 -0800 (PST)

Cc: kurt.hackel@xxxxxxxxxx, Jeremy Fitzhardinge <jeremy@xxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxx, "Dugger, Donald D" <donald.d.dugger@xxxxxxxxx>, "Nakajima, Jun" <jun.nakajima@xxxxxxxxx>

Delivery-date: Wed, 16 Dec 2009 08:24:54 -0800

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Since this discussion seems to be going in circles, I suspect we may have some fundamentally different assumptions. You likely have some unstated ideas, maybe about the underlying implementation of the Linux NUMA syscalls when running on Xen, or maybe defaults for how NUMA-ness might be specified when creating an HVM domain. But all of these are mostly unrelated to rdtscp. The only reason that this discussion has involved NUMA concepts is that the rdtscp instruction, by accident rather than by design, may on some (but not all) guest OS's, communicate the guest OS's concept of cpu and node information to an application. As Jeremy has pointed out, this cpu/node information is exactly the same information that can be obtained by a system call. So the only reason that rdtscp is better than using the system call would be for performance. Rdtscp is faster than a system call in many situations, but now is often emulated in Xen (even on processors that do support the hardware instruction*), so cannot be assumed to be much faster than a system call. And the difference in performance is only measurable if an app is executing rdtscp many thousands of times every second. Are there apps that execute rdtscp many thousands of times every second PRIMARILY TO OBTAIN the cpu/node information? If so, I agree that it is unfortunately necessary to expose the rdtscp instruction. If not, I would highly recommend we do NOT expose it now. Otherwise, to use Keir's words, we are "Supporting CPU instructions just because they're there [which] is not a useful effort." Once rdtscp/TSC_AUX is exposed to guests, it is very hard to remove it again (as saved guests may have tested the cpuid bit once at startup and will fail if restored). Other brief NUMA-related replies below. * See xen-unstable.hg/docs/misc/tscmode.txt for explanation > From: Zhang, Xiantao [mailto:xiantao.zhang@xxxxxxxxx] > Dan Magenheimer wrote: > >>> . And, as I've said before, > >>> the node/cpu info provided by Linux in TSC_AUX is > >>> wrong anyway (except in very constrained environments > >>> such as where the admin has pinned vcpus to pcpus). > >> > >> I don't agree with you at this point. For guest numa support, > >> it should be a must to pin virtual node's vcpus to its > >> related physical node and crossing-node vcpu migration should > >> be disallowed by default, otherwise guest numa support is > >> meaningless, right ? > > > > It's not a must. A system administrator should always > > have the option of choosing flexibility vs performance. > > I agree that when performance is higher priority, pinning > > is a must, but pinning may even have issues when the > > guest's nvcpus exceeds the number of cores in a node. > > Could you elaborate the issues you can see ? Normally, > virtual node's number of vcpus should be less than one > physical node's cpu number. But enen if vcpu's number is more > than physical cpu's number in a node, why it can introduce issues ? Suppose a guest believes it has eight cores on a single processor/node. It is now started on a machine that has four cores per processor/node (and two or more sockets). Since the guest believes it is running on a single node, it communicates that information (via TSC_AUX or vgetcpu) to an application. The application is NUMA-aware, but since the guest OS told it that all cores are on the same node, it doesn't use it's NUMA code/mode. Suppose a guest believes it has a total of four cores, two cores on each of two nodes. It is now started on some future machine with 16 cores all on a single node. Since the guest believes it is running on two nodes, it communicates that information (via TSC_AUX or vgetcpu) to an application. The application is NUMA-aware, and the guest OS told it that there are two nodes. This app has very high memory bandwidth needs, so it spends lots of time doing NUMA-related syscalls such as Linux move_pages to ensure that the memory is on the same node as the cpu. All of these move calls are wasted. Both of these situations are very possible in a cloud environment. (NOTE: Since this NUMA-related discussion is orthogonal to rdtscp, we should probably start a separate thread for further discussion.) If the above discussion doesn't clarify my concerns and I haven't answered other questions in your email, please let me know. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.