[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Scalable Event Channel ABI design (draft A)



On 04/02/2013 17:52, "David Vrabel" <david.vrabel@xxxxxxxxxx> wrote:

> Hi,
> 
> Here is a design for a scalable event channel ABI for Xen.  It has a
> number of benefits over the design currently being proposed by Wei Lui.
> 
> * More event channels (>100,000).
> * 16 event priorities.
> * Reduced memory requirements (only 1 additional page per domU).
> * The use of FIFOs for events ensures fairness, whereas it is difficult
> to reason about the fairness with the current bitmap system.

I have some sympathy for this design. It's primary downside compared with
the 3-level alternative is its greater space cost (32*#vcpus). However, as
you say the fairness and prioritisation features are rather nice. Also
having the data structures be per vcpu may well help avoid cacheline
contention on busy multi-vcpu guests.

Interested in what others think, but I may actually prefer a ground-up
redesign like this.

 -- Keir

> The PDF version is easier to read and has diagrams and readable maths
> but the original markdown format document is included below (for ease of
> commenting).
> 
> http://xenbits.xen.org/people/dvrabel/event-channels-A.pdf
> 
> Introduction
> ============
> 
> Purpose
> -------
> 
> Xen uses event channels to signal events (interrupts) to (fully or
> partially) paravirtualized guests.  The current event channel ABI
> provided by Xen only supports up-to 1024 (for 32-bit guests) or 4096
> (for 64-bit guests) event channels.  This is limiting scalability as
> support for more VMs, VCPUs and devices is required.
> 
> The existing ABI does not easily allow events to have different
> priorities.  Current Linux kernels prioritize the timer event by
> special casing this but this is not generalizable to more events.
> Event priorities may be useful for prioritizing MMIO emulation
> requests over bulk data traffic (such as network or disk).
> 
> This design replaces the existing event channel ABI with one that:
> 
> - is scalable to more than 100,000 event channels, with scope for
> increasing this further with minimal ABI changes.
> 
> - allows guests to use up-to 16 different event priorities.
> 
> System Overview
> ---------------
> 
> [FIXME: diagram showing Xen and guest and shared memory block for events?]
> 
> Design Map
> ----------
> 
> A new event channel ABI requires changes to Xen and the guest kernels.
> 
> References
> ----------
> 
> [FIXME: link to alternate proposal?]
> 
> Design Considerations
> =====================
> 
> Assumptions
> -----------
> 
> * Atomic read-modify-write of 32-bit words is possible on all
>   supported platforms.  This can be with a linked-load /
>   store-conditional (e.g., ARMv8's ldrx/strx) or a compare-and-swap
>   (e.g., x86's cmpxchg).
> 
> Constraints
> -----------
> 
> * The existing ABI must continue to be useable.  Compatibilty with
>   existing guests is mandatory.
> 
> Risks and Volatile Areas
> ------------------------
> 
> * Should the 3-level proposal be merged into Xen then this design does
>   not offer enough improvements to warrant the cost of maintaining
>   three different event channel ABIs in Xen and guest kernels.
> 
> * The performance of some operations may be decreased.  Specifically,
>   retriggering an event now always requires a hypercall.
> 
> Architecture
> ============
> 
> Overview
> --------
> 
> The event channel ABI uses a data structure that is shared between Xen
> and the guest.  Access to the structure is done with lock-less
> operations (except for some less common operations where the guest
> must use a hypercall).  The guest is responsible for allocating this
> structure and registering it with Xen during VCPU bring-up.
> 
> Events are reported to a guest's VCPU using a FIFO queue.  There is a
> queue for each priority level and each VCPU.
> 
> Each event has a _pending_ and a _masked_ bit.  The pending bit
> indicates the event has been raised.  The masked bit is used by the
> guest to prevent delivery of that specific event.
> 
> 
> High Level Design
> =================
> 
> Shared Event Data Structure
> ---------------------------
> 
> The shared event data structure has a per-domain _event array_, and a
> per-VCPU _control block_.
> 
> - _event array_: A logical array of event words (one per event
>   channel) which contains the pending and mask bits and the link index
>   for next event in the queue.
> 
> ![\label{fig_event-word}Event Array Word](event-word.pdf)
> 
> - _control block_: This contains the head and tail indexes for each
>   priority level.  Each VCPU has its own control block and this is
>   contained in the existing `struct vcpu_info`.
> 
> The pages within the event array need not be physically nor virtually
> contiguous, but the guest or Xen may make the virtually contiguous for
> ease of implementation.  e.g., by using vmap() in Xen or vmalloc() in
> Linux.  Pages are added by the guest as required by the number of
> bound event channels.
> 
> Only 17 bits are currently defined for the LINK field, allowing 2^17^
> (131,072) events.  This limit can be trivially increased without any
> other changes to the ABI.  Bits [28:17] are reserved for future
> expansion or for other uses.
> 
> Event State Machine
> -------------------
> 
> Event channels are bound to a domain using the existing ABI.
> 
> A bound event may be in one of three main states.
> 
> State    Abbrev.  Meaning
> -----    -------  -------
> BOUND    B        The event is bound but not pending.
> PENDING  P        The event has been raised and not yet acknowledged.
> LINKED   L        The event is on an event queue.
> 
> Additionally, events may be UNMASKED or MASKED (M).
> 
> ![\label{events_sm}Event State Machine](event-sm.pdf)
> 
> The state of an event is tracked using 3 bits within the event word.
> the M (masked), P (pending) and L (linked) bits.  Only state
> transitions that change a single bit are valid.
> 
> Event Queues
> ------------
> 
> The event queues use a singly-linked list of event array words (see
> figure \ref{fig_event-word} and \ref{fig_event-queue}).
> 
> ![\label{fig_event-queue}Empty and Non-empty Event Queues](event-queue.pdf)
> 
> Each event queue has a head and tail index stored in the control
> block.  The head index is the index of the first element in the queue.
> The tail index is the last element in the queue.  Every element within
> the queue has the L bit set.
> 
> The LINK field in the event word indexes the next event in the queue.
> LINK is zero for the last word in the queue.
> 
> The queue is empty when the head index is zero (zero is not a valid
> event channel).
> 
> Hypercalls
> ----------
> 
> Four new EVTCHNOP hypercall sub-operations are added:
> 
> * `EVTCHNOP_init`
> 
> * `EVTCHNOP_expand`
> 
> * `EVTCHNOP_set_priority`
> 
> * `EVTCHNOP_set_limit`
> 
> ### `EVTCHNOP_init`
> 
> This call initializes a single VCPU's event channel data structures,
> adding one page for the event array.
> 
> A guest should call this during initial VCPU bring up (and on resume?).
> 
>     struct evtchnop_init {
>         uint32_t vcpu;
>         uint64_t array_pfn;
>     };
> 
> Field          Purpose
> -----          -------
> `vcpu`         [in] The VCPU number.
> `array_pfn`    [in] The PFN or GMFN of a page to be used for the first page
>                of the event array.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `vcpu` is invalid or already initialized.
> EINVAL      `array_pfn` is not a valid frame for the domain.
> ENOMEM      Insufficient memory to allocate internal structures.
> 
> ### `EVTCHNOP_expand`
> 
> This call expands the event array for a VCPU by appended an additional
> page.
> 
> A guest should call this for all VCPUs when a new event channel is
> required and there is insufficient space in the current event array.
> 
> It is not possible to shrink the event array once it has been
> expanded.
> 
>     struct evtchnop_expand {
>         uint32_t vcpu;
>         uint64_t array_pfn;
>     };
> 
> Field          Purpose
> -----          -------
> `vcpu`         [in] The VCPU number.
> `array_pfn`    [in] The PFN or GMFN of a page to be used for the next page
>                of the event array.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `vcpu` is invalid or already initialized.
> EINVAL      `array_pfn` is not a valid frames for the domain.
> EINVAL      The event array already has the maximum number of pages.
> ENOMEM      Insufficient memory to allocate internal structures.
> 
> ### `EVTCHNOP_set_priority`
> 
> This call sets the priority for an event channel.  The event must be
> unbound.
> 
> A guest may call this prior to binding an event channel. The meaning
> and the use of the priority are up to the guest.  Valid priorities are
> 0 - 15 and the default is 7.
> 
>     struct evtchnop_set_priority {
>         uint32_t port;
>         uint32_t priority;
>     };
> 
> Field       Purpose
> -----       -------
> `port`      [in] The event channel.
> `priority`  [in] The priority for the event channel.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `port` is invalid.
> EINVAL      `port` is currently bound.
> EINVAL      `priority` is outside the range 0 - 15.
> 
> ### `EVTCHNOP_set_limit`
> 
> This privileged call sets the maximum number of event channels a
> domain can bind.  The default for dom0 is unlimited.  Other domains
> default to 1024 events (requiring only a single page for their event
> array).
> 
>     struct evtchnop_set_limit {
>         uint32_t domid;
>         uint32_t max_events;
>     };
> 
> Field         Purpose
> -----         -------
> `domid`       [in] The domain ID.
> `max_events`  [in] The maximum number of event channels that may be bound
>               to the domain.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `domid` is invalid.
> EPERM       The calling domain has insufficient privileges.
> 
> Memory Usage
> ------------
> 
> ### Event Arrays
> 
> Xen needs to map every domains' event array into its address space.
> The space reserved for these global mappings is limited to 1 GiB on
> x86-64 (262144 pages) and is shared with other users.
> 
> It is non-trivial to calculate the maximum number of VMs that can be
> supported as this depends on the system configuration (how many driver
> domains etc.) and VM configuration.  We can make some assuptions and
> derive an approximate limit.
> 
> Each page of the event array has space for 1024 events ($E_P$) so a
> regular domU will only require a single page.  Since event channels
> typically come in pairs, the upper bound on the total number of pages
> is $2 \times\text{number of VMs}$.
> 
> If the guests are further restricted in the number of event channels
> ($E_V$) then this upper bound can be reduced further.
> 
> The number of VMs ($V$) with a limit of $P$ total event array pages is:
> \begin{equation*}
> V = P \div \left(1 + \frac{E_V}{E_P}\right)
> \end{equation*}
> 
> Using only half the available pages and limiting guests to only 64
> events gives:
> \begin{eqnarray*}
> V & = & (262144 / 2) \div (1 + 64 / 1024) \\
>   & = & 123 \times 10^3\text{ VMs}
> \end{eqnarray*}
> 
> Alternatively, we can consider a system with $D$ driver domains, each
> of which requires $E_D$ events, and a dom0 using the maximum number of
> pages (128).
> 
> \begin{eqnarray*}
> V & = & P - \left(128 + D \times \frac{E_D}{E_P}\right)
> \end{eqnarray*}
> 
> With, for example, 16 driver domains each using the maximum number of pages:
> \begin{eqnarray*}
> V  & = & (262144/2) - (128 + 16 \times \frac{2^{17}}{1024}) \\
>    & = & 129 \times 10^3\text{ VMs}
> \end{eqnarray*}
> 
> In summary, there is space to map the event arrays for over 100,000
> VMs.  This is more than the limit imposed by the 16 bit domain ID
> (~32,000 VMs).
> 
> ### Control Block
> 
> With $L$ priority levels and two 32-bit words for the head and tail
> indexes, the amount of space ($S$) required in the `struct vcpu_info`
> for the control block is:
> \begin{eqnarray*}
> S & = & L \times 2 \times 4 \\
>   & = & 16 \times 2 \times 4 \\
>   & = & 128\text{ bytes}
> \end{eqnarray*}
> 
> There is more than enough space within `struct vcpu_info` for the
> control block while still keeping plenty of space for future use.
> 
> Low Level Design
> ================
> 
> In the pseudo code in this section, all memory accesses are atomic,
> including those to bit-fields within the event word.
> 
> These variables have a standard meaning:
> 
> Variable  Purpose
> --------  -------
> E         Event array.
> p         A specific event.
> H         The head index for a specific priority.
> T         The tail index for a specific priority.
> 
> These memory barriers are required:
> 
> Function  Purpose
> --------  -------
> mb()      Full (read/write) memory barrier
> 
> Raising an Event
> ----------------
> 
> When Xen raises an event it marks it pending and (if it is not masked)
> adds it tail of event queue.
> 
>     E[p].pending = 1
>     if not E[p].linked and not E[n].masked
>         E[p].linked = 1
>         E[p].link = 0
>         mb()
>         if H == 0
>             H = p
>         else
>             E[T].link = p
>         T = p
> 
> Concurrent access by Xen to the event queue must be protected by a
> per-event queue spin lock.
> 
> Consuming Events
> ----------------
> 
> The guests consumes events starting at the head until it reaches the
> tail.  Events in the queue that are not pending or are masked are
> consumed but not handled.
> 
>     while H != 0
>         p = H
>         H = E[p].link
>         if H == 0
>             mb()
>             H = E[p].link
>         E[H].linked = 0
>         if not E[p].masked
>             handle(p)
> 
> handle() clears `E[p].pending` and EOIs level-triggered PIRQs.
> 
>> Note: When the event queue contains a single event, removing the
>> event may race with Xen appending another event because the load of
>> `E[p].link` and the store of `H` is not atomic.  To avoid this race,
>> the guest must recheck `E[p].link` if the list appeared empty.
> 
> Masking Events
> --------------
> 
> Events are masked by setting the masked bit.  If the event is pending
> and linked it does not need to be unlinked.
> 
>     E[p].masked = 1
> 
> Unmasking Events
> ----------------
> 
> Events are unmasked by the guest by clearing the masked bit.  If the
> event is pending the guest must call the event channel unmask
> hypercall so Xen can link the event into the correct event queue.
> 
>     E[p].masked = 0
>     if E[p].pending
>         hypercall(EVTCHN_unmask)
> 
> The expectation here is that unmasking a pending event will be rare,
> so the performance hit of the hypercall is minimal.
> 
>> Note: that after clearing the mask bit, the event may be raised and
>> thus it may already be linked by the time the hypercall is done.
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.