[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Hwloc with Xen host topology


For some post-holiday hacking, I tried playing around with getting hwloc
to understand Xen's full system topology, rather than the faked up
topology dom0 receives.

I present here some code which works (on some interestingly shaped
servers in the XenRT test pool), and some discoveries/problems found
along the way.

Code can be found at:

You will need a libxc with the following patch:

Instructions for use found in the commit message of the hwloc.git tree. 
It is worth noting that with the help of the hwloc-devel list, v2 is
already quite a bit different, but is still in-progress.

Anyway, for the Xen issues I encountered.  If memory serves, some of
them might have been brought up on xen-devel in the past.

The first problem, as indicated from the extra patch required against
libxc is that the current interface for xc_{topology,numa}info() suck if
you are not libxl.  The current interface forces the caller to handle
hypercall bounce buffering, which is even harder to do sensibly as half
the bounce buffer macros are private to libxc.  Bounce buffering is the
kind of details which libxc should deal with on behalf of its callers,
and should only be exposed to callers who want to do something special.

My patch implements xc_{topology,numa}info_bounced()  (name up for
reconsideration) which takes some uint{32,64}_t arrays (optionally
NULL), and properly bounce buffer them.  This results in not needing to
mess around with any of the bounce buffering in hwloc.

The second problem is with the choice of max_node_id, which is
MAX_NUMNODES-1, or 63.  This means that the toolstack has to bounce a
16k buffer (64 * 64 * uint32_t) to get the node-node distances, even on
a single or dual node system.  The issue is less pronounced with the
node_to_mem{size,free} arrays, which only have to be 64 * uint64_t long,
but still wasteful especially if node_to_memfree is being periodically
polled.  Having nr_node_ids set dynamically (similar to nr_cpu_ids)
would alleviate this overhead, as the number of nodes available on the
system will unconditionally be static after boot.

The third problem is the one which created the only real bug in my hwloc
implementation.  Cores are numbered per-socket in Xen, while sockets,
numa nodes and cpus are numbered on an absolute scale.  There is
currently a gross hack in my hwloc code which adds (socket_id *
cores_per_socket * threads_per_core) onto each core id to make them
similarly numbered on an absolute scale.  This is fine for a homogeneous
system, but not for a hetrogeneous system.

Relatedly, when debugging the third problem on an AMD Opteron 63xx
system, I noticed that it advertises 8 cores per socket and 2 threads
per core, but numbers the cores 1-16 on each socket.  This is broken. 
It should ether be 16 cores per socket and 1 thread per core, or really
8 cores per socket and 2 threads per core, with the cores numbered 1-8
and each pair of cpus with the same core id.

Fourth, the API for identifying offline cpus is broken.  To mark a cpu
as offline, it has its topology information shot, meaning that an
offline cpu cannot be positively located in the topology.  I happen to
know it can as Xen writes the records sequentially, so a single offline
cpu can be identified based on the valid information either side, but a
block of offline cpus become rather harder to locate.  Ideally,
XEN_SYSCTL_topologyinfo should return 4 parameters, with one of them
being a bitmap from 0 to max_cpu_index identifying which cpus are
online, and writing the correct core/socket/node information (when
known) into the other parameters.  However, being an ABI now makes this
somewhat harder to do.

Fifth, Xen has no way of querying the cpu cache information.  hwloc
likes to know the entire cache hierarchy, which is arguably more useful
for its primary purpose of optimising HPC than for simply viewing the
Xen topology, but is none-the-less a missing feature as far as Xen is
concerned.  I was considering adding a sysctl along the lines of "please
execute cpuid with these parameters on that pcpu and give me the answers".

Sixth and finally, which is also the hardest problem conceptually to
solve, Xen has no notion of IO proximity.  Devices on the system can
report their location using _PXM() methods in the DSDT/SSDTs, but only
dom0 can gather this information, and doesn't have an accurate view of
the NUMA or CPU topology.

Anyway - that is probably enough rambling.  I don't expect much/any of
this to be resolved before the 4.5 dev window opens, but bringing these
issues to light might at least get some of them discussed.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.