[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] [PATCH 10 of 10 [RFC]] xl: Some automatic NUMA placement documentation
Add some rationale and usage documentation for the new automatic NUMA placement feature of xl. TODO: * Decide whether we want to have things like "Future Steps/Roadmap" and/or "Performances/Benchmarks Results" here as well. Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx> diff --git a/docs/misc/xl-numa-placement.txt b/docs/misc/xl-numa-placement.txt new file mode 100644 --- /dev/null +++ b/docs/misc/xl-numa-placement.txt @@ -0,0 +1,205 @@ + ------------------------------------- + NUMA Guest Placement Design and Usage + ------------------------------------- + +Xen deals with Non-Uniform Memory Access (NUMA) machines in many ways. For +example each domain has its own "node affinity", i.e., a set of NUMA nodes +of the host from which memory for that domain is allocated. That becomes +very important as soon as many domains start running memory-intensive +workloads on a shared host. In fact, accessing non node-local memory +locations costs much more than node-local ones, to the point that the +degradation in performance is likely to be noticeable. + +It is then quite obvious that, any mechanism that enable the most of the +memory accesses for the most of the most of the guest domains to stay +local is something very important to achieve when dealing with NUMA +platforms. + + +Node Affinity and CPU Affinity +------------------------------ + +There is another very popular 'affinity', besides node affinity we are +discussing here, which is '(v)cpu affinity'. Moreover, to make things +even worse, the two are different but somehow related things. In fact, +in both Xen and Linux worlds, 'cpu affinity' is the set of CPUs a domain +(that would be a task, when talking about Linux) can be scheduled on. +This seems to have few to do with memory accesses, but it does, as the +CPU where a domain run is also from where it tries to access its memory, +i.e., that is one half of what decides whether a memory access is remote +or local --- the other half being where the location it wants to access +is stored. + +Of course, if a domain is known to only run on a subset of the physical +CPUs of the host, it is very easy to turn all its memory accesses into +local ones, by just constructing it's node affinity (in Xen) basing on +what nodes these CPUs belongs to. Actually, that is exactly what is being +done by the hypervisor by default, as soon as it finds out a domain (or +better, the vcpus of a domain, but let's avoid getting into too much +details here) has a cpu affinity. + +This is working quite well, but it requires the user/system administrator +to explicitly specify such property --- the cpu affinity --- while the +domain is being created, or Xen won't be able to exploit that for ensuring +accesses locality. + +On the other hand, as node affinity directly affects where domain's memory +lives, it makes a lot of sense for it to be involved in scheduling decisions, +as it would be great if the hypervisor would manage in scheduling all the +vcpus of all the domains on CPUs attached to the various domains' local +memory. That is why, the node affinity of a domain is treated by the scheduler +as the set of nodes on which it would be preferable to run it, although +not at the cost of violating the scheduling algorithm behavior and +invariants. This means it Xen will check whether a vcpu of a domain can run +on one of the CPUs belonging to the nodes of the domain's node affinity, +but will better run it somewhere else --- even on another, remote, CPU --- +than violating the priority ordering (e.g., by kicking out from there another +running vcpu with higher priority) it is designed to enforce. + +So, last but not least, what if a domain has both vcpu and node affinity, and +they only partially match or they do not match at all (to understand how that +can happen, see the following sections)? Well, in such case, all the domain +memory will be allocated reflecting its node affinity, while scheduling will +happen according to its vcpu affinities, meaning that it is easy enough to +construct optimal, sub-optimal, neutral and even bad and awful configurations +(which is something nice, e.g., for benchmarking purposes). The remainder +part of this document is explaining how to do so. + + +Specifying Node Affinity +------------------------ + +Besides being automatically computed from the vcpu affinities of a domain +(or also from it being part of a cpupool) within Xen, it might make sense +for the user to specify the node affinity of its domains by hand, while +editing their config files, as another form of partitioning the host +resources. If that is the case, this is where the "nodes" option of the xl +config file becomes useful. In fact, specifying something like the below + + nodes = [ '0', '1', '3', '4' ] + +in a domain configuration file would result in Xen assigning host NUMA nodes +number 0, 1, 3 and 4 to the domain's node affinity, regardless of any vcpu +affinity setting for the same domain. The idea is, yes, the to things are +related, and if only one is present, it makes sense to use the other for +inferring it, but it is always possible to explicitly specify both of them, +independently on how good or awful it could end up being. + +Therefore, this is what one should expect when using "nodes", perhaps in +conjunction with "cpus" in a domain configuration file: + + * `cpus = "0, 1"` and no `nodes=` at all + (i.e., only vcpu affinity specified): + domain's vcpus can and will run only on host CPUs 0 and 1. Also, as + domain's node affinity will be computed by Xen and set to whatever + nodes host CPUs 0 and 1 belongs, all the domain's memory accesses will + be local accesses; + + * `nodes = [ '0', '1' ]` and no `cpus=` at all + (i.e., only node affinity present): + domain's vcpus can run on any of the host CPUs, but the scheduler (at + least if credit is used, as it is the only scheduler supporting this + right now) will try running them on the CPUs that are part of host + NUMA nodes 0 and 1. Memory-wise, all the domain's memory will be + allocated on host NUMA nodes nodes 0 and 1. This means the most of + the memory accesses of the domain should be local, but that will + depend on the on-line load, behavior and actual scheduling of both + the domain in question and all the other domains on the same host; + + * `nodes = [ '0', '1' ]` and `cpus = "0"`, with CPU 0 within node 0: + (i.e., cpu affinity subset of node affinity): + domain's vcpus can and will only run on host CPU 0. As node affinity + is being explicitly set to host NUMA nodes 0 and 1 --- which includes + CPU 0 --- all the memory access of the domain will be local; + + * `nodes = [ '0', '1' ]` and `cpus = "0, 4", with CPU 0 in node 0 but + CPU 4 in, say, node 2 (i.e., cpu affinity superset of node affinity): + domain's vcpus can run on host CPUs 0 and 4, with CPU 4 not being within + the node affinity (explicitly set to host NUMA nodes 0 and 1). The + (credit) scheduler will try to keep memory accesses local by scheduling + the domain's vcpus on CPU 0, but it may not achieve 100% success; + + * `nodes = [ '0', '1' ]` and `cpus = "4"`, with CPU 4 within, say, node 2 + (i.e., cpu affinity disjointed with node affinity): + domain's vcpus can and will run only on host CPU 4, i.e., completely + "outside" of the chosen node affinity. That necessarily means all the + domain's memory access will be remote. + + +Automatic NUMA Placement +------------------------ + +Just in case one does not want to take the burden of manually specifying +all the node (and, perhaps, CPU) affinities for all its domains, xl implements +some automatic placement logic. This basically means the user can ask the +toolstack to try sorting things out in the best possible way for him. +This is instead of specifying manually a domain's node affinity and can be +paired or not with any vcpu affinity (in case it is, the relationship between +vcpu and node affinities just stays as stated above). To serve this purpose, +a new domain config switch has been introduces, i.e., the "nodes_policy" +option. As the name suggests, it allows for specifying a policy to be used +while attempting automatic placement of the new domain. Available policies +at the time of writing are: + + * "auto": automatic placement by means of a not better specified (xl + implementation dependant) algorithm. It is basically for those + who do want automatic placement, but have no idea what policy + or algorithm would be better... <<Just give me a sane default!>> + + * "ffit": automatic placement via the First Fit algorithm, applied checking + the memory requirement of the domain against the amount of free + memory in the various host NUMA nodes; + + * "bfit": automatic placement via the Best Fit algorithm, applied checking + the memory requirement of the domain against the amount of free + memory in the various host NUMA nodes; + + * "wfit": automatic placement via the Worst Fit algorithm, applied checking + the memory requirement of the domain against the amount of free + memory in the various host NUMA nodes; + +The various algorithms have been implemented as they offer different behavior +and performances (for different performance metrics). For instance, First Fit +is known to be efficient and quick, and it generally works better than Best +Fit wrt memory fragmentation, although it tends to occupy "early" nodes more +than "late" ones. On the other hand, Best Fit aims at optimizing memory usage, +although it introduces quite a bit of fragmentation, by leaving large amounts +of small free memory areas. Finally, the idea behind Worst Fit is that it will +leave big enough free memory chunks to limit the amount of fragmentation, but +it (as well as Best Fit does) is more expensive in terms of execution time, as +it needs the "list" of free memory areas to be kept sorted. + +Therefore, achieving automatic placement actually happens by properly using +the "nodes" and "nodes_config" configuration options as follows: + + * `nodes="auto` or `nodes_policy="auto"`: + xl will try fitting the domain on the host NUMA nodes by using its + own default placing algorithm, with default parameters. Most likely, + all nodes will be considered suitable for the domain (unless a vcpu + affinity is specified, see the last entry of this list; + + * `nodes_policy="ffit"` (or `"bfit"`, `"wfit"`) and no `nodes=` at all: + xl will try fitting the domain on the host NUMA nodes by using the + requested policy. All nodes will be considered suitable for the + domain, and consecutive fitting attempts will be performed while + increasing the number of nodes on which to put the domain itself + (unless a vcpu affinity is specified, see the last entry of this list); + + * `nodes_policy="auto"` (or `"ffit"`, `"bfit"`, `"wfit"`) and `nodes=2`: + xl will try fitting the domain on the host NUMA nodes by using the + requested policy and only the number of nodes specified in `nodes=` + (2 in this example). All the nodes will be considered suitable for + the domain, and consecutive attempts will be performed while + increasing such a value; + + * `nodes_policy="auto"` (or `"ffit"`, `"bfit"`, `"wfit"`) and `cpus="0-6": + xl will try fitting the domain on the host NUMA nodes to which the CPUs + specified as vcpu affinity (0 to 6 in this example) belong, by using the + requested policy. In case it fails, consecutive fitting attempts will + be performed with both a reduced (first) and an increased (next) number + of nodes). + +Different usage patterns --- like specifying both a policy and a list of nodes +are accepted, but does not make much sense after all. Therefore, although xl +will try at its best to interpret user's will, the resulting behavior is +somehow unspecified. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |