[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [PATCH 10 of 10 [RFC]] xl: Some automatic NUMA placement documentation



Add some rationale and usage documentation for the new automatic
NUMA placement feature of xl.

TODO: * Decide whether we want to have things like "Future Steps/Roadmap"
        and/or "Performances/Benchmarks Results" here as well.

Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

diff --git a/docs/misc/xl-numa-placement.txt b/docs/misc/xl-numa-placement.txt
new file mode 100644
--- /dev/null
+++ b/docs/misc/xl-numa-placement.txt
@@ -0,0 +1,205 @@
+               -------------------------------------
+               NUMA Guest Placement Design and Usage
+               -------------------------------------
+
+Xen deals with Non-Uniform Memory Access (NUMA) machines in many ways. For
+example each domain has its own "node affinity", i.e., a set of NUMA nodes
+of the host from which memory for that domain is allocated. That becomes
+very important as soon as many domains start running memory-intensive
+workloads on a shared host. In fact, accessing non node-local memory
+locations costs much more than node-local ones, to the point that the
+degradation in performance is likely to be noticeable.
+
+It is then quite obvious that, any mechanism that enable the most of the
+memory accesses for the most of the most of the guest domains to stay
+local is something very important to achieve when dealing with NUMA
+platforms.
+
+
+Node Affinity and CPU Affinity
+------------------------------
+
+There is another very popular 'affinity', besides node affinity we are
+discussing here, which is '(v)cpu affinity'. Moreover, to make things
+even worse, the two are different but somehow related things. In fact,
+in both Xen and Linux worlds, 'cpu affinity' is the set of CPUs a domain
+(that would be a task, when talking about Linux) can be scheduled on.
+This seems to have few to do with memory accesses, but it does, as the
+CPU where a domain run is also from where it tries to access its memory,
+i.e., that is one half of what decides whether a memory access is remote
+or local --- the other half being where the location it wants to access
+is stored.
+
+Of course, if a domain is known to only run on a subset of the physical
+CPUs of the host, it is very easy to turn all its memory accesses into
+local ones, by just constructing it's node affinity (in Xen) basing on
+what nodes these CPUs belongs to. Actually, that is exactly what is being
+done by the hypervisor by default, as soon as it finds out a domain (or
+better, the vcpus of a domain, but let's avoid getting into too much
+details here) has a cpu affinity.
+
+This is working quite well, but it requires the user/system administrator
+to explicitly specify such property --- the cpu affinity --- while the
+domain is being created, or Xen won't be able to exploit that for ensuring
+accesses locality.
+
+On the other hand, as node affinity directly affects where domain's memory
+lives, it makes a lot of sense for it to be involved in scheduling decisions,
+as it would be great if the hypervisor would manage in scheduling all the
+vcpus of all the domains on CPUs attached to the various domains' local
+memory. That is why, the node affinity of a domain is treated by the scheduler
+as the set of nodes on which it would be preferable to run it, although
+not at the cost of violating the scheduling algorithm behavior and
+invariants. This means it Xen will check whether a vcpu of a domain can run
+on one of the CPUs belonging to the nodes of the domain's node affinity,
+but will better run it somewhere else --- even on another, remote, CPU ---
+than violating the priority ordering (e.g., by kicking out from there another
+running vcpu with higher priority) it is designed to enforce.
+
+So, last but not least, what if a domain has both vcpu and node affinity, and
+they only partially match or they do not match at all (to understand how that
+can happen, see the following sections)? Well, in such case, all the domain
+memory will be allocated reflecting its node affinity, while scheduling will
+happen according to its vcpu affinities, meaning that it is easy enough to
+construct optimal, sub-optimal, neutral and even bad and awful configurations
+(which is something nice, e.g., for benchmarking purposes). The remainder
+part of this document is explaining how to do so.
+
+
+Specifying Node Affinity
+------------------------
+
+Besides being automatically computed from the vcpu affinities of a domain
+(or also from it being part of a cpupool) within Xen, it might make sense
+for the user to specify the node affinity of its domains by hand, while
+editing their config files, as another form of partitioning the host
+resources. If that is the case, this is where the "nodes" option of the xl
+config file becomes useful. In fact, specifying something like the below
+
+        nodes = [ '0', '1', '3', '4' ]
+
+in a domain configuration file would result in Xen assigning host NUMA nodes
+number 0, 1, 3 and 4 to the domain's node affinity, regardless of any vcpu
+affinity setting for the same domain. The idea is, yes, the to things are
+related, and if only one is present, it makes sense to use the other for
+inferring it, but it is always possible to explicitly specify both of them,
+independently on how good or awful it could end up being.
+
+Therefore, this is what one should expect when using "nodes", perhaps in
+conjunction with "cpus" in a domain configuration file:
+
+ * `cpus = "0, 1"` and no `nodes=` at all
+   (i.e., only vcpu affinity specified):
+     domain's vcpus can and will run only on host CPUs 0 and 1. Also, as
+     domain's node affinity will be computed by Xen and set to whatever
+     nodes host CPUs 0 and 1 belongs, all the domain's memory accesses will
+     be local accesses;
+
+ * `nodes = [ '0', '1' ]` and no `cpus=` at all
+   (i.e., only node affinity present):
+     domain's vcpus can run on any of the host CPUs, but the scheduler (at
+     least if credit is used, as it is the only scheduler supporting this
+     right now) will try running them on the CPUs that are part of host
+     NUMA nodes 0 and 1. Memory-wise, all the domain's memory will be
+     allocated on host NUMA nodes nodes 0 and 1. This means the most of
+     the memory accesses of the domain should be local, but that will
+     depend on the on-line load, behavior and actual scheduling of both
+     the domain in question and all the other domains on the same host;
+
+ * `nodes = [ '0', '1' ]` and `cpus = "0"`, with CPU 0 within node 0:
+   (i.e., cpu affinity subset of node affinity):
+     domain's vcpus can and will only run on host CPU 0. As node affinity
+     is being explicitly set to host NUMA nodes 0 and 1 --- which includes
+     CPU 0 --- all the memory access of the domain will be local;
+
+ * `nodes = [ '0', '1' ]` and `cpus = "0, 4", with CPU 0 in node 0 but
+   CPU 4 in, say, node 2 (i.e., cpu affinity superset of node affinity):
+     domain's vcpus can run on host CPUs 0 and 4, with CPU 4 not being within
+     the node affinity (explicitly set to host NUMA nodes 0 and 1). The
+     (credit) scheduler will try to keep memory accesses local by scheduling
+     the domain's vcpus on CPU 0, but it may not achieve 100% success;
+
+ * `nodes = [ '0', '1' ]` and `cpus = "4"`, with CPU 4 within, say, node 2
+   (i.e., cpu affinity disjointed with node affinity):
+     domain's vcpus can and will run only on host CPU 4, i.e., completely
+     "outside" of the chosen node affinity. That necessarily means all the
+     domain's memory access will be remote.
+
+
+Automatic NUMA Placement
+------------------------
+
+Just in case one does not want to take the burden of manually specifying
+all the node (and, perhaps, CPU) affinities for all its domains, xl implements
+some automatic placement logic. This basically means the user can ask the
+toolstack to try sorting things out in the best possible way for him.
+This is instead of specifying manually a domain's node affinity and can be
+paired or not with any vcpu affinity (in case it is, the relationship between
+vcpu and node affinities just stays as stated above). To serve this purpose,
+a new domain config switch has been introduces, i.e., the "nodes_policy"
+option. As the name suggests, it allows for specifying a policy to be used
+while attempting automatic placement of the new domain. Available policies
+at the time of writing are:
+
+ * "auto": automatic placement by means of a not better specified (xl
+           implementation dependant) algorithm. It is basically for those
+           who do want automatic placement, but have no idea what policy
+           or algorithm would be better... <<Just give me a sane default!>>
+
+ * "ffit": automatic placement via the First Fit algorithm, applied checking
+           the memory requirement of the domain against the amount of free
+           memory in the various host NUMA nodes;
+
+ * "bfit": automatic placement via the Best Fit algorithm, applied checking
+           the memory requirement of the domain against the amount of free
+           memory in the various host NUMA nodes;
+
+ * "wfit": automatic placement via the Worst Fit algorithm, applied checking
+           the memory requirement of the domain against the amount of free
+           memory in the various host NUMA nodes;
+
+The various algorithms have been implemented as they offer different behavior
+and performances (for different performance metrics). For instance, First Fit
+is known to be efficient and quick, and it generally works better than Best
+Fit wrt memory fragmentation, although it tends to occupy "early" nodes more
+than "late" ones. On the other hand, Best Fit aims at optimizing memory usage,
+although it introduces quite a bit of fragmentation, by leaving large amounts
+of small free memory areas. Finally, the idea behind Worst Fit is that it will
+leave big enough free memory chunks to limit the amount of fragmentation, but
+it (as well as Best Fit does) is more expensive in terms of execution time, as
+it needs the "list" of free memory areas to be kept sorted.
+ 
+Therefore, achieving automatic placement actually happens by properly using
+the "nodes" and "nodes_config" configuration options as follows:
+
+ * `nodes="auto` or `nodes_policy="auto"`:
+     xl will try fitting the domain on the host NUMA nodes by using its
+     own default placing algorithm, with default parameters. Most likely,
+     all nodes will be considered suitable for the domain (unless a vcpu
+     affinity is specified, see the last entry of this list;
+
+ * `nodes_policy="ffit"` (or `"bfit"`, `"wfit"`) and no `nodes=` at all:
+     xl will try fitting the domain on the host NUMA nodes by using the
+     requested policy. All nodes will be considered suitable for the
+     domain, and consecutive fitting attempts will be performed while
+     increasing the number of nodes on which to put the domain itself
+     (unless a vcpu affinity is specified, see the last entry of this list);
+
+ * `nodes_policy="auto"` (or `"ffit"`, `"bfit"`, `"wfit"`) and `nodes=2`:
+     xl will try fitting the domain on the host NUMA nodes by using the
+     requested policy and only the number of nodes specified in `nodes=`
+     (2 in this example). All the nodes will be considered suitable for
+     the domain, and consecutive attempts will be performed while
+     increasing such a value;
+
+ * `nodes_policy="auto"` (or `"ffit"`, `"bfit"`, `"wfit"`) and `cpus="0-6":
+     xl will try fitting the domain on the host NUMA nodes to which the CPUs
+     specified as vcpu affinity (0 to 6 in this example) belong, by using the
+     requested policy. In case it fails, consecutive fitting attempts will
+     be performed with both a reduced (first) and an increased (next) number
+     of nodes).
+
+Different usage patterns --- like specifying both a policy and a list of nodes
+are accepted, but does not make much sense after all. Therefore, although xl
+will try at its best to interpret user's will, the resulting behavior is
+somehow unspecified.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.