[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 10 of 10 v3] Some automatic NUMA placement documentation

To: Dario Faggioli <raistlin@xxxxxxxx>
From: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
Date: Fri, 6 Jul 2012 15:08:35 +0100
Cc: Andre Przywara <andre.przywara@xxxxxxx>, Ian Campbell <Ian.Campbell@xxxxxxxxxx>, Stefano Stabellini <Stefano.Stabellini@xxxxxxxxxxxxx>, Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, Roger Pau Monne <roger.pau@xxxxxxxxxx>
Delivery-date: Fri, 06 Jul 2012 14:24:39 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 04/07/12 17:18, Dario Faggioli wrote:

# HG changeset patch
# User Dario Faggioli<raistlin@xxxxxxxx>
# Date 1341416324 -7200
# Node ID f1523c3dc63746e07b11fada5be3d461c3807256
# Parent  885e2f385601d66179058bfb6bd3960f17d5e068
Some automatic NUMA placement documentation

About rationale, usage and (some small bits of) API.

Signed-off-by: Dario Faggioli<dario.faggioli@xxxxxxxxxx>
Acked-by: Ian Campbell<ian.campbell@xxxxxxxxxx>

Changes from v1:
  * API documentation moved close to the actual functions.

diff --git a/docs/misc/xl-numa-placement.markdown 
b/docs/misc/xl-numa-placement.markdown
new file mode 100644
--- /dev/null
+++ b/docs/misc/xl-numa-placement.markdown
@@ -0,0 +1,91 @@
+# Guest Automatic NUMA Placement in libxl and xl #
+
+## Rationale ##
+
+NUMA means the memory accessing times of a program running on a CPU depends on
+the relative distance between that CPU and that memory. In fact, most of the
+NUMA systems are built in such a way that each processor has its local memory,
+on which it can operate very fast. On the other hand, getting and storing data
+from and on remote memory (that is, memory local to some other processor) is
+quite more complex and slow. On these machines, a NUMA node is usually defined
+as a set of processor cores (typically a physical CPU package) and the memory
+directly attached to the set of cores.
+
+The Xen hypervisor deals with Non-Uniform Memory Access (NUMA]) machines by
+assigning to its domain a "node affinity", i.e., a set of NUMA nodes of the
+host from which it gets its memory allocated.
+
+NUMA awareness becomes very important as soon as many domains start running
+memory-intensive workloads on a shared host. In fact, the cost of accessing non
+node-local memory locations is very high, and the performance degradation is
+likely to be noticeable.
+
+## Guest Placement in xl ##
+
+If using xl for creating and managing guests, it is very easy to ask for both
+manual or automatic placement of them across the host's NUMA nodes.
+
+Note that xm/xend does the very same thing, the only differences residing in
+the details of the heuristics adopted for the placement (see below).
+
+### Manual Guest Placement with xl ###
+
+Thanks to the "cpus=" option, it is possible to specify where a domain should
+be created and scheduled on, directly in its config file. This affects NUMA
+placement and memory accesses as the hypervisor constructs the node affinity of
+a VM basing right on its CPU affinity when it is created.
+
+This is very simple and effective, but requires the user/system administrator
+to explicitly specify affinities for each and every domain, or Xen won't be
+able to guarantee the locality for their memory accesses.
+
+It is also possible to deal with NUMA by partitioning the system using cpupools
+(available in the upcoming release of Xen, 4.2). Again, this could be "The
+Right Answer" for many needs and occasions, but  has to to be carefully
+considered and manually setup by hand.
+
+### Automatic Guest Placement with xl ###
+
+In case no "cpus=" option is specified in the config file, libxl tries to

I think "If no 'cpus=' option..." is better here.

+figure out on its own on which node(s) the domain could fit best.  It is
+worthwhile noting that optimally fitting a set of VMs on the NUMA nodes of an
+host host is an incarnation of the Bin Packing Problem. In fact, the various

host host

+VMs with different memory sizes are the items to be packed, and the host nodes
+are the bins. That is known to be NP-hard, thus, it is probably better to
+tackle the problem with some sort of hauristics, as we do not have any oracle
+available!

I think you can just say "...is an incarnation of the Bin PackingProblem, which is known to be NP-hard." We will therefore be using someheuristics."


(nb the spelling of "heuristics" as well.)

+
+The first thing to do is finding  a node, or even a set of nodes, that have
+enough free memory and enough physical CPUs for accommodating the one new
+domain. The idea is to find a spot for the domain with at least as much free
+memory as it has configured, and as much pCPUs as it has vCPUs.  After that,
+the actual decision on which solution to go for happens accordingly to the
+following heuristics:
+
+  *  candidates involving fewer nodes come first. In case two (or more)
+     candidates span the same number of nodes,
+  *  the amount of free memory and the number of domains assigned to the
+     candidates are considered. In doing that, candidates with greater amount
+     of free memory and fewer assigned domains are preferred, with free memory
+     "weighting" three times as much as number of domains.
+
+Giving preference to small candidates ensures better performance for the guest,

I think I would say "candidates with fewer nodes" here; "smallcandidates" doesn't convey "fewer nodes" to me.

+as it avoid spreading its memory among different nodes.  Favouring the nodes
+that have the biggest amounts of free memory helps keeping the memory

We normally don't say "big amount", but "large amount" (don't ask me why-- just sounds a bit funny to me). So this would be "largest amount".

+fragmentation small, from a system wide perspective.  However, in case more

Again, s/in case/if/;

Other than that, looks good to me.

 -George

+candidates fulfil these criteria by roughly the same extent, having the number
+of domains the candidates are "hosting" helps balancing the load on the various
+nodes.
+
+## Guest Placement within libxl ##
+
+xl achieves automatic NUMA just because libxl does it interrnally.
+No API is provided (yet) for interacting with this feature and modify
+the library behaviour regarding automatic placement, it just happens
+by default if no affinity is specified (as it is with xm/xend).
+
+For actually looking and maybe tweaking the mechanism and the algorithms it
+uses, all is implemented as a set of libxl internal interfaces and facilities.
+Look at the comment "Automatic NUMA placement" in libxl\_internal.h.
+
+Note this may change in future versions of Xen/libxl.



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] [PATCH 10 of 10 v3] Some automatic NUMA placement documentation
  - From: George Dunlap

References:
- [Xen-devel] [PATCH 00 of 10 v3] Automatic NUMA placement for xl
  - From: Dario Faggioli
- [Xen-devel] [PATCH 10 of 10 v3] Some automatic NUMA placement documentation
  - From: Dario Faggioli

Prev by Date: Re: [Xen-devel] [PATCH 2 of 2 RFC] xl: allow for moving the domain's memory when changing vcpu affinity
Next by Date: Re: [Xen-devel] [PATCH 2 of 2 RFC] xl: allow for moving the domain's memory when changing vcpu affinity
Previous by thread: [Xen-devel] [PATCH 10 of 10 v3] Some automatic NUMA placement documentation
Next by thread: Re: [Xen-devel] [PATCH 10 of 10 v3] Some automatic NUMA placement documentation
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.