[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Request for input: Extended event channel support

* Executive summary

The number of event channels available for dom0 is currently one of
the biggest limitations on scaling up the number of VMs which can be
created on a single system.  There are two alternative implementations
we could choose, one of which is ready now, the other of which is
potentially technically superior, but will not be ready for the 4.3

The core question we need to ask the community: How important is
lifting the event channel scalability limit to 4.3?  Will waiting
until 4.4 cause a limit in the uptake of the Xen platform?

* The issue

The existing event channel implementation for PV guests is implemented
as 2-level bit array.  This limits the total number of event channels
to word_size ^ 2, which is 1024 for 32-bit guests and 4096 for 64-bit

This sounds like a lot, until you consider that in a typical system,
each VM needs 4 or more event channels in domain 0.  This means that
for a 32-bit dom0, there is a theoretical maximum of 256 guests -- and
in practice it's more like 180 or so, because of event channels
required for other things.  XenServer already has customers using VDI
that require more VMs than this.

* The dilemma

When we began the 4.3 release cycle, this was one of the items we
identified as a key feature we needed to get for 4.3.  Wei Liu started
work on an extension of the existing implmentation, allowing 3 levels
of event channels.  The draft of this is ready, and just needs the
last bit of polishing and bug-chasing before it can be accepted.

However, several months ago, David Vrabel came up with an alternate
design which in theory was more scalable, based on queues of linked
lists (which we have internally been calling "FIFO" for short).  David
has been working on the implementation since, and has a draft
protoype; but it's in no shape to be included in 4.3.

There are some things that are attractive about the second solution,
including the flexible assignment of interrupt priorities, ease of
scalability, and potentially even the FIFO nature of the interrupt

The question at hand then, is whether to take what we have in the
3-level implementation for 4.3, or wait to see how the FIFO
implementation turns out (taking either it or the 3-level
implementation in 4.4).

* The solution in hand: 3-level event channels

The basic idea behind 3-level event channels is to extend the existing
2-level implementation to 3 levels.  Going to 3 levels would give us
32k event channels for 32-bit, and 256k for 64-bit.

One of the advantages of this method is that since it is similar to
the existing method, the general concepts and race conditions are
fairly well understood and tested.

One of the disadvantages that this method inherits from the 2-level
event channels is the lack of priority.  In the initial implementation
of event channels, priority was handled by event channel order: scans
for events always started at 0 and went upwards.  However, this was
not very scalable, as lower-numbered events could easily completely
lock out higher-numbered events; and frequently "lower-numbered"
simply meant "created earlier".  Event channels were forced into a
priority even if one was not wanted.

So the implementation was tweaked, so that scans don't start at 0, but
continue where the last event left off.  This made it so that earlier
events were not prioritized and removed the starvation issue, but at
the cost of removing all event priorities.  Certain events, like the
timer event, are special-cased to be always checked, but this is
rather a bit of a hack and not very scalable or flexible.

One thing that should be noted is that adding the extra level is
envisoned only to be used by guests that need the extended event
channel space, such as dom0 and driver domains; domUs will continue to
use the 2-level version.

* The solution close at hand: FIFO event channels

The FIFO solution makes event delivery a matter of adding items to a
highly structured linked list.  The number of event channels for the
interface design has a theoretical maximum of 2^28; the current
implementation is limimited at 2^17, which is over 100,000.  The
number is the same for both 32-bit and 64-bit kernels.

One of the key design advantages of the FIFO is the ability to assign
an arbitrary priority to any event.  There are 16 priorities
available; one queue for each priority.  Higher-priority queues are
handled below lower-priority queues, but events within a queue are
handled in FIFO order.

Another potential advantage is the FIFO ordering.  With the current
event channel implementation, one can construct scenarios where even
with events of the same priority, clusters of events can lock out
others based on where they are or the number of them.  FIFO solves
this by handling events within the same priority strictly in the order
in which they were raised.  It's not clear yet, however, whether this
has a measurable impact on performance.

One of the potential disadvantages of the FIFO solution is the amount
of memory that it requires to be mapped into the Xen address space.
The FIFO solution requires an entire word per event channel; a
reasonably configured system might have up to 128 Xen-mapped pages per
dom0 or domU.  On the other hand, this number can be scaled at a
fine-grained level, and limited by the toolstack; a typical domU would
require only one page mapped in the hypervisor.

By comparison, the 3-level solution requires only two bits per event
channel.  Any domain using the extra level would require exactly 16
pages for 64-bit domains, and 2 pages for 32-bit domains.  We would
expect this to include dom0 and any driver domains, but that domUs
would continue using 2-level event channels (and thus require no extra
pages to be mapped).

* Considerations

There are a number of additional considerations to take into account.

The first is that the hypervisor maintainers have made it clear that
once 3-level event channels is accepted, FIFO will have a higher bar
to clear for acceptance.  That is, if we wait for the 4.4 timeframe
before choosing one to accept, then FIFO will only need to be
marginally preferrable to 3-level to be accepted.  However, if we
accept the 3-level implimentation for 4.3, then FIFO will need to
demonstrate that it is significantly better for 4.3 in order to be

We are not yet aware of any companies that are blocked on this
feature.  Citrix XenServer clients using Citrix's VDI solution need to
be able to run more than 200 guests; however, because XenServer
control both the kernel and hypervisor side, they can introduce
temporary, non-backwards or forwards-compatible changes to work around
the limitation, and so are not blocked.  Oracle and SuSE have not
indicated that this a feature they are in dire need of.  Most cloud
deployments that we know of -- even extremely large ones like Amazon
or Rackspace -- use large numbers of relatively inexpensive computers,
and so typically do not need to run more than 200 VMs per physical

Another factor to consider is that we are considering attempting a
shorter release cadence for 4.4 -- 6 months or possibly less.  That
means that the impact of delaying the event channel scalability
feature will be reduced.

* What we need to know

What we're missing in order to make an informed decision is voices
from the community: If we delay the event channel scalability feature
until 4.4, how likely is this to be an issue?  Are there current users
or potential users of Xen who need to be able to scale past 200 VMs on
a single host, and who would end up choosing another hypervisor if
this feature were delayed?

Thank you for your time and input.

 -George Dunlap,
  4.3 Release manager

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.