[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Scalable Event Channel ABI design (draft A)



Nice work and interesting reading for the night! please see inline
comments.

On Mon, 2013-02-04 at 17:52 +0000, David Vrabel wrote:
> Hi,
> 
> Here is a design for a scalable event channel ABI for Xen.  It has a
> number of benefits over the design currently being proposed by Wei Lui.
> 
> * More event channels (>100,000).

This is not a benefit over the 3-level design. ;-)

> * 16 event priorities.
> * Reduced memory requirements (only 1 additional page per domU).
> * The use of FIFOs for events ensures fairness, whereas it is difficult
> to reason about the fairness with the current bitmap system.

These are! ;-)

> The PDF version is easier to read and has diagrams and readable maths
> but the original markdown format document is included below (for ease of
> commenting).
> 
> http://xenbits.xen.org/people/dvrabel/event-channels-A.pdf
> 
> Introduction
> ============
> 
> Purpose
> -------
> 
> Xen uses event channels to signal events (interrupts) to (fully or
> partially) paravirtualized guests.  The current event channel ABI
> provided by Xen only supports up-to 1024 (for 32-bit guests) or 4096
> (for 64-bit guests) event channels.  This is limiting scalability as
> support for more VMs, VCPUs and devices is required.
> 
> The existing ABI does not easily allow events to have different
> priorities.  Current Linux kernels prioritize the timer event by
> special casing this but this is not generalizable to more events.
> Event priorities may be useful for prioritizing MMIO emulation
> requests over bulk data traffic (such as network or disk).
> 
> This design replaces the existing event channel ABI with one that:
> 
> - is scalable to more than 100,000 event channels, with scope for
> increasing this further with minimal ABI changes.
> 
> - allows guests to use up-to 16 different event priorities.
> 
> System Overview
> ---------------
> 
> [FIXME: diagram showing Xen and guest and shared memory block for events?]
> 
> Design Map
> ----------
> 
> A new event channel ABI requires changes to Xen and the guest kernels.
> 
> References
> ----------
> 
> [FIXME: link to alternate proposal?]
> 
> Design Considerations
> =====================
> 
> Assumptions
> -----------
> 
> * Atomic read-modify-write of 32-bit words is possible on all
>   supported platforms.  This can be with a linked-load /
>   store-conditional (e.g., ARMv8's ldrx/strx) or a compare-and-swap
>   (e.g., x86's cmpxchg).
> 
> Constraints
> -----------
> 
> * The existing ABI must continue to be useable.  Compatibilty with
>   existing guests is mandatory.
> 
> Risks and Volatile Areas
> ------------------------
> 
> * Should the 3-level proposal be merged into Xen then this design does
>   not offer enough improvements to warrant the cost of maintaining
>   three different event channel ABIs in Xen and guest kernels.
> 
> * The performance of some operations may be decreased.  Specifically,
>   retriggering an event now always requires a hypercall.
> 
> Architecture
> ============
> 
> Overview
> --------
> 
> The event channel ABI uses a data structure that is shared between Xen
> and the guest.  Access to the structure is done with lock-less
> operations (except for some less common operations where the guest
> must use a hypercall).  The guest is responsible for allocating this
> structure and registering it with Xen during VCPU bring-up.

Just for completeness, legacy guests might not register vcpu info - they
just use those vcpu info inside shared info. And a modern guest might
use vcpu info inside shared info along with registered per-vcpu vcpu
info if registrations fail half way.

But this will not be a big problem for this design.

> Events are reported to a guest's VCPU using a FIFO queue.  There is a
> queue for each priority level and each VCPU.
> 
> Each event has a _pending_ and a _masked_ bit.  The pending bit
> indicates the event has been raised.  The masked bit is used by the
> guest to prevent delivery of that specific event.
> 
> 
> High Level Design
> =================
> 
> Shared Event Data Structure
> ---------------------------
> 
> The shared event data structure has a per-domain _event array_, and a
> per-VCPU _control block_.
> 
> - _event array_: A logical array of event words (one per event
>   channel) which contains the pending and mask bits and the link index
>   for next event in the queue.
> 
> ![\label{fig_event-word}Event Array Word](event-word.pdf)
> 
> - _control block_: This contains the head and tail indexes for each
>   priority level.  Each VCPU has its own control block and this is
>   contained in the existing `struct vcpu_info`.
> 
> The pages within the event array need not be physically nor virtually
> contiguous, but the guest or Xen may make the virtually contiguous for
> ease of implementation.  e.g., by using vmap() in Xen or vmalloc() in
> Linux.  Pages are added by the guest as required by the number of
> bound event channels.
> 
> Only 17 bits are currently defined for the LINK field, allowing 2^17^
> (131,072) events.  This limit can be trivially increased without any
> other changes to the ABI.  Bits [28:17] are reserved for future
> expansion or for other uses.
> 
> Event State Machine
> -------------------
> 
> Event channels are bound to a domain using the existing ABI.
> 
> A bound event may be in one of three main states.
> 
> State    Abbrev.  Meaning
> -----    -------  -------
> BOUND    B        The event is bound but not pending.
> PENDING  P        The event has been raised and not yet acknowledged.
> LINKED   L        The event is on an event queue.
> 
> Additionally, events may be UNMASKED or MASKED (M).
> 
> ![\label{events_sm}Event State Machine](event-sm.pdf)
> 
> The state of an event is tracked using 3 bits within the event word.
> the M (masked), P (pending) and L (linked) bits.  Only state
> transitions that change a single bit are valid.
> 

Where is the "priority" information stored? I think it should be
somewhere inside Xen, right? I presume a field in struct evtchn?

> Event Queues
> ------------
> 
> The event queues use a singly-linked list of event array words (see
> figure \ref{fig_event-word} and \ref{fig_event-queue}).
> 
> ![\label{fig_event-queue}Empty and Non-empty Event Queues](event-queue.pdf)
> 
> Each event queue has a head and tail index stored in the control
> block.  The head index is the index of the first element in the queue.
> The tail index is the last element in the queue.  Every element within
> the queue has the L bit set.
> 
> The LINK field in the event word indexes the next event in the queue.
> LINK is zero for the last word in the queue.
> 
> The queue is empty when the head index is zero (zero is not a valid
> event channel).
> 
> Hypercalls
> ----------
> 
> Four new EVTCHNOP hypercall sub-operations are added:
> 
> * `EVTCHNOP_init`
> 
> * `EVTCHNOP_expand`
> 
> * `EVTCHNOP_set_priority`
> 
> * `EVTCHNOP_set_limit`
> 
> ### `EVTCHNOP_init`
> 
> This call initializes a single VCPU's event channel data structures,
> adding one page for the event array.
> 

I think the registration for new event channels should always be done in
a batch. If not then you need to provide another OP to rollback previous
registered ones.

> A guest should call this during initial VCPU bring up (and on resume?).
> 
>     struct evtchnop_init {
>         uint32_t vcpu;
>         uint64_t array_pfn;
>     };
> 
> Field          Purpose
> -----          -------
> `vcpu`         [in] The VCPU number.
> `array_pfn`    [in] The PFN or GMFN of a page to be used for the first page
>                of the event array.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `vcpu` is invalid or already initialized.
> EINVAL      `array_pfn` is not a valid frame for the domain.
> ENOMEM      Insufficient memory to allocate internal structures.
> 
> ### `EVTCHNOP_expand`
> 
> This call expands the event array for a VCPU by appended an additional
> page.
> 
> A guest should call this for all VCPUs when a new event channel is
> required and there is insufficient space in the current event array.
> 
> It is not possible to shrink the event array once it has been
> expanded.
> 
>     struct evtchnop_expand {
>         uint32_t vcpu;
>         uint64_t array_pfn;
>     };
> 
> Field          Purpose
> -----          -------
> `vcpu`         [in] The VCPU number.
> `array_pfn`    [in] The PFN or GMFN of a page to be used for the next page
>                of the event array.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `vcpu` is invalid or already initialized.
> EINVAL      `array_pfn` is not a valid frames for the domain.
> EINVAL      The event array already has the maximum number of pages.
> ENOMEM      Insufficient memory to allocate internal structures.
> 
> ### `EVTCHNOP_set_priority`
> 
> This call sets the priority for an event channel.  The event must be
> unbound.
> 
> A guest may call this prior to binding an event channel. The meaning
> and the use of the priority are up to the guest.  Valid priorities are
> 0 - 15 and the default is 7.
> 
>     struct evtchnop_set_priority {
>         uint32_t port;
>         uint32_t priority;
>     };
> Field       Purpose
> -----       -------
> `port`      [in] The event channel.
> `priority`  [in] The priority for the event channel.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `port` is invalid.
> EINVAL      `port` is currently bound.
> EINVAL      `priority` is outside the range 0 - 15.
> 
> ### `EVTCHNOP_set_limit`
> 
> This privileged call sets the maximum number of event channels a
> domain can bind.  The default for dom0 is unlimited.  Other domains
> default to 1024 events (requiring only a single page for their event
> array).
> 
>     struct evtchnop_set_limit {
>         uint32_t domid;
>         uint32_t max_events;
>     };
> 
> Field         Purpose
> -----         -------
> `domid`       [in] The domain ID.
> `max_events`  [in] The maximum number of event channels that may be bound
>               to the domain.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `domid` is invalid.
> EPERM       The calling domain has insufficient privileges.
> 

Is shrinking max_events legal? Should also deal with max_events >
design_limit case.

> Memory Usage
> ------------
> 
> ### Event Arrays
> 
> Xen needs to map every domains' event array into its address space.
> The space reserved for these global mappings is limited to 1 GiB on
> x86-64 (262144 pages) and is shared with other users.
> 
> It is non-trivial to calculate the maximum number of VMs that can be
> supported as this depends on the system configuration (how many driver
> domains etc.) and VM configuration.  We can make some assuptions and
> derive an approximate limit.
> 
> Each page of the event array has space for 1024 events ($E_P$) so a
> regular domU will only require a single page.  Since event channels
> typically come in pairs, the upper bound on the total number of pages
                               ^^^^^
                               upper or lower?

> is $2 \times\text{number of VMs}$.
> 
> If the guests are further restricted in the number of event channels
> ($E_V$) then this upper bound can be reduced further.
> 

Can this bound really be reduced? Can you map memory on non-page
granularity?

> The number of VMs ($V$) with a limit of $P$ total event array pages is:
> \begin{equation*}
> V = P \div \left(1 + \frac{E_V}{E_P}\right)
> \end{equation*}
> 
> Using only half the available pages and limiting guests to only 64
> events gives:
> \begin{eqnarray*}
> V & = & (262144 / 2) \div (1 + 64 / 1024) \\
>   & = & 123 \times 10^3\text{ VMs}
> \end{eqnarray*}
> 
> Alternatively, we can consider a system with $D$ driver domains, each
> of which requires $E_D$ events, and a dom0 using the maximum number of
> pages (128).
> 
> \begin{eqnarray*}
> V & = & P - \left(128 + D \times \frac{E_D}{E_P}\right)
> \end{eqnarray*}
> 
> With, for example, 16 driver domains each using the maximum number of pages:
> \begin{eqnarray*}
> V  & = & (262144/2) - (128 + 16 \times \frac{2^{17}}{1024}) \\
>    & = & 129 \times 10^3\text{ VMs}
> \end{eqnarray*}
> 
> In summary, there is space to map the event arrays for over 100,000
> VMs.  This is more than the limit imposed by the 16 bit domain ID
> (~32,000 VMs).
> 
> ### Control Block
> 
> With $L$ priority levels and two 32-bit words for the head and tail
> indexes, the amount of space ($S$) required in the `struct vcpu_info`
> for the control block is:
> \begin{eqnarray*}
> S & = & L \times 2 \times 4 \\
>   & = & 16 \times 2 \times 4 \\
>   & = & 128\text{ bytes}
> \end{eqnarray*}
> 
> There is more than enough space within `struct vcpu_info` for the
> control block while still keeping plenty of space for future use.
> 
> Low Level Design
> ================
> 
> In the pseudo code in this section, all memory accesses are atomic,
> including those to bit-fields within the event word.
> 
> These variables have a standard meaning:
> 
> Variable  Purpose
> --------  -------
> E         Event array.
> p         A specific event.
> H         The head index for a specific priority.
> T         The tail index for a specific priority.
> 
> These memory barriers are required:
> 
> Function  Purpose
> --------  -------
> mb()      Full (read/write) memory barrier
> 
> Raising an Event
> ----------------
> 
> When Xen raises an event it marks it pending and (if it is not masked)
> adds it tail of event queue.
> 
>     E[p].pending = 1
>     if not E[p].linked and not E[n].masked
>         E[p].linked = 1
>         E[p].link = 0
>         mb()
>         if H == 0
>             H = p
>         else
>             E[T].link = p
>         T = p
> 
> Concurrent access by Xen to the event queue must be protected by a
> per-event queue spin lock.
> 

I presume "E[n]" in the pseudo code is "E[p]"?

Is this spin lock really a good idea? How many threads / cpus will spin
on this lock? As [0] shows, contention on spin lock incurs heavy
performance penalty.

[0] https://lwn.net/Articles/530458/

> Consuming Events
> ----------------
> 
> The guests consumes events starting at the head until it reaches the
> tail.  Events in the queue that are not pending or are masked are
> consumed but not handled.
> 
>     while H != 0
>         p = H
>         H = E[p].link
>         if H == 0
>             mb()
>             H = E[p].link
>         E[H].linked = 0
>         if not E[p].masked
>             handle(p)
> 
> handle() clears `E[p].pending` and EOIs level-triggered PIRQs.
> 

How to synchronize access to the array and control blocks between Xen
and guest? I'm afraid I have no knowledge of a xen-guest spin lock...

AFAICT at least inter-domain event channel is a corner case in your
design - cpu A is handling events in Linux kernel while cpu B tries to
raise a inter-domain port to A. If this is not a problem, please also
clarify a bit. (sorry I'm a bit fuzzy after a day's work)

> > Note: When the event queue contains a single event, removing the
> > event may race with Xen appending another event because the load of
> > `E[p].link` and the store of `H` is not atomic.  To avoid this race,
> > the guest must recheck `E[p].link` if the list appeared empty.
> 
> Masking Events
> --------------
> 
> Events are masked by setting the masked bit.  If the event is pending
> and linked it does not need to be unlinked.
> 
>     E[p].masked = 1
> 
> Unmasking Events
> ----------------
> 
> Events are unmasked by the guest by clearing the masked bit.  If the
> event is pending the guest must call the event channel unmask
> hypercall so Xen can link the event into the correct event queue.
> 
>     E[p].masked = 0
>     if E[p].pending
>         hypercall(EVTCHN_unmask)
> 
> The expectation here is that unmasking a pending event will be rare,
> so the performance hit of the hypercall is minimal.
> 

Currently unmask a "remote" port requires issuing hyercall as well, so
if unmasking is not very frequent, this is not a big problem.

But please take some interrupt-intensive workloads into consideration.
For example, 1G nic (e1000) even with NAPI enabled can generate 27k+
intrs per second under high packet load [1], 10G nic can surely generate
more. Can you give estimation on the performance hit on the context
switch?

[1] http://www.linuxfoundation.org/collaborate/workgroups/networking/napi

> > Note: that after clearing the mask bit, the event may be raised and
> > thus it may already be linked by the time the hypercall is done.

Even if the event has already been linked before you finish the
hypercall, you would still need to get hold of the a lock to serialize
access to event structure for checking, right? Or a test_bit on linked
field is sufficient? I think you need to write some pseudo code for this
as well.


Thanks
Wei.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.