Xen project Mailing List

Re: [Xen-devel] Scalable Event Channel ABI design (draft A)

To: David Vrabel <david.vrabel@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

From: Keir Fraser <keir.xen@xxxxxxxxx>

Date: Mon, 04 Feb 2013 19:59:16 +0000

Cc: Wei Liu <wei.liu2@xxxxxxxxxx>

Delivery-date: Mon, 04 Feb 2013 19:59:59 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Thread-index: Ac4DEhxV4sjAnqEXFkSf5H8AzMbgig==

Thread-topic: [Xen-devel] Scalable Event Channel ABI design (draft A)

On 04/02/2013 17:52, "David Vrabel" <david.vrabel@xxxxxxxxxx> wrote: > Hi, > > Here is a design for a scalable event channel ABI for Xen. It has a > number of benefits over the design currently being proposed by Wei Lui. > > * More event channels (>100,000). > * 16 event priorities. > * Reduced memory requirements (only 1 additional page per domU). > * The use of FIFOs for events ensures fairness, whereas it is difficult > to reason about the fairness with the current bitmap system. I have some sympathy for this design. It's primary downside compared with the 3-level alternative is its greater space cost (32*#vcpus). However, as you say the fairness and prioritisation features are rather nice. Also having the data structures be per vcpu may well help avoid cacheline contention on busy multi-vcpu guests. Interested in what others think, but I may actually prefer a ground-up redesign like this. -- Keir > The PDF version is easier to read and has diagrams and readable maths > but the original markdown format document is included below (for ease of > commenting). > > http://xenbits.xen.org/people/dvrabel/event-channels-A.pdf > > Introduction > ============ > > Purpose > ------- > > Xen uses event channels to signal events (interrupts) to (fully or > partially) paravirtualized guests. The current event channel ABI > provided by Xen only supports up-to 1024 (for 32-bit guests) or 4096 > (for 64-bit guests) event channels. This is limiting scalability as > support for more VMs, VCPUs and devices is required. > > The existing ABI does not easily allow events to have different > priorities. Current Linux kernels prioritize the timer event by > special casing this but this is not generalizable to more events. > Event priorities may be useful for prioritizing MMIO emulation > requests over bulk data traffic (such as network or disk). > > This design replaces the existing event channel ABI with one that: > > - is scalable to more than 100,000 event channels, with scope for > increasing this further with minimal ABI changes. > > - allows guests to use up-to 16 different event priorities. > > System Overview > --------------- > > [FIXME: diagram showing Xen and guest and shared memory block for events?] > > Design Map > ---------- > > A new event channel ABI requires changes to Xen and the guest kernels. > > References > ---------- > > [FIXME: link to alternate proposal?] > > Design Considerations > ===================== > > Assumptions > ----------- > > * Atomic read-modify-write of 32-bit words is possible on all > supported platforms. This can be with a linked-load / > store-conditional (e.g., ARMv8's ldrx/strx) or a compare-and-swap > (e.g., x86's cmpxchg). > > Constraints > ----------- > > * The existing ABI must continue to be useable. Compatibilty with > existing guests is mandatory. > > Risks and Volatile Areas > ------------------------ > > * Should the 3-level proposal be merged into Xen then this design does > not offer enough improvements to warrant the cost of maintaining > three different event channel ABIs in Xen and guest kernels. > > * The performance of some operations may be decreased. Specifically, > retriggering an event now always requires a hypercall. > > Architecture > ============ > > Overview > -------- > > The event channel ABI uses a data structure that is shared between Xen > and the guest. Access to the structure is done with lock-less > operations (except for some less common operations where the guest > must use a hypercall). The guest is responsible for allocating this > structure and registering it with Xen during VCPU bring-up. > > Events are reported to a guest's VCPU using a FIFO queue. There is a > queue for each priority level and each VCPU. > > Each event has a _pending_ and a _masked_ bit. The pending bit > indicates the event has been raised. The masked bit is used by the > guest to prevent delivery of that specific event. > > > High Level Design > ================= > > Shared Event Data Structure > --------------------------- > > The shared event data structure has a per-domain _event array_, and a > per-VCPU _control block_. > > - _event array_: A logical array of event words (one per event > channel) which contains the pending and mask bits and the link index > for next event in the queue. > > ![\label{fig_event-word}Event Array Word](event-word.pdf) > > - _control block_: This contains the head and tail indexes for each > priority level. Each VCPU has its own control block and this is > contained in the existing `struct vcpu_info`. > > The pages within the event array need not be physically nor virtually > contiguous, but the guest or Xen may make the virtually contiguous for > ease of implementation. e.g., by using vmap() in Xen or vmalloc() in > Linux. Pages are added by the guest as required by the number of > bound event channels. > > Only 17 bits are currently defined for the LINK field, allowing 2^17^ > (131,072) events. This limit can be trivially increased without any > other changes to the ABI. Bits [28:17] are reserved for future > expansion or for other uses. > > Event State Machine > ------------------- > > Event channels are bound to a domain using the existing ABI. > > A bound event may be in one of three main states. > > State Abbrev. Meaning > ----- ------- ------- > BOUND B The event is bound but not pending. > PENDING P The event has been raised and not yet acknowledged. > LINKED L The event is on an event queue. > > Additionally, events may be UNMASKED or MASKED (M). > > ![\label{events_sm}Event State Machine](event-sm.pdf) > > The state of an event is tracked using 3 bits within the event word. > the M (masked), P (pending) and L (linked) bits. Only state > transitions that change a single bit are valid. > > Event Queues > ------------ > > The event queues use a singly-linked list of event array words (see > figure \ref{fig_event-word} and \ref{fig_event-queue}). > > ![\label{fig_event-queue}Empty and Non-empty Event Queues](event-queue.pdf) > > Each event queue has a head and tail index stored in the control > block. The head index is the index of the first element in the queue. > The tail index is the last element in the queue. Every element within > the queue has the L bit set. > > The LINK field in the event word indexes the next event in the queue. > LINK is zero for the last word in the queue. > > The queue is empty when the head index is zero (zero is not a valid > event channel). > > Hypercalls > ---------- > > Four new EVTCHNOP hypercall sub-operations are added: > > * `EVTCHNOP_init` > > * `EVTCHNOP_expand` > > * `EVTCHNOP_set_priority` > > * `EVTCHNOP_set_limit` > > ### `EVTCHNOP_init` > > This call initializes a single VCPU's event channel data structures, > adding one page for the event array. > > A guest should call this during initial VCPU bring up (and on resume?). > > struct evtchnop_init { > uint32_t vcpu; > uint64_t array_pfn; > }; > > Field Purpose > ----- ------- > `vcpu` [in] The VCPU number. > `array_pfn` [in] The PFN or GMFN of a page to be used for the first page > of the event array. > > Error code Reason > ---------- ------ > EINVAL `vcpu` is invalid or already initialized. > EINVAL `array_pfn` is not a valid frame for the domain. > ENOMEM Insufficient memory to allocate internal structures. > > ### `EVTCHNOP_expand` > > This call expands the event array for a VCPU by appended an additional > page. > > A guest should call this for all VCPUs when a new event channel is > required and there is insufficient space in the current event array. > > It is not possible to shrink the event array once it has been > expanded. > > struct evtchnop_expand { > uint32_t vcpu; > uint64_t array_pfn; > }; > > Field Purpose > ----- ------- > `vcpu` [in] The VCPU number. > `array_pfn` [in] The PFN or GMFN of a page to be used for the next page > of the event array. > > Error code Reason > ---------- ------ > EINVAL `vcpu` is invalid or already initialized. > EINVAL `array_pfn` is not a valid frames for the domain. > EINVAL The event array already has the maximum number of pages. > ENOMEM Insufficient memory to allocate internal structures. > > ### `EVTCHNOP_set_priority` > > This call sets the priority for an event channel. The event must be > unbound. > > A guest may call this prior to binding an event channel. The meaning > and the use of the priority are up to the guest. Valid priorities are > 0 - 15 and the default is 7. > > struct evtchnop_set_priority { > uint32_t port; > uint32_t priority; > }; > > Field Purpose > ----- ------- > `port` [in] The event channel. > `priority` [in] The priority for the event channel. > > Error code Reason > ---------- ------ > EINVAL `port` is invalid. > EINVAL `port` is currently bound. > EINVAL `priority` is outside the range 0 - 15. > > ### `EVTCHNOP_set_limit` > > This privileged call sets the maximum number of event channels a > domain can bind. The default for dom0 is unlimited. Other domains > default to 1024 events (requiring only a single page for their event > array). > > struct evtchnop_set_limit { > uint32_t domid; > uint32_t max_events; > }; > > Field Purpose > ----- ------- > `domid` [in] The domain ID. > `max_events` [in] The maximum number of event channels that may be bound > to the domain. > > Error code Reason > ---------- ------ > EINVAL `domid` is invalid. > EPERM The calling domain has insufficient privileges. > > Memory Usage > ------------ > > ### Event Arrays > > Xen needs to map every domains' event array into its address space. > The space reserved for these global mappings is limited to 1 GiB on > x86-64 (262144 pages) and is shared with other users. > > It is non-trivial to calculate the maximum number of VMs that can be > supported as this depends on the system configuration (how many driver > domains etc.) and VM configuration. We can make some assuptions and > derive an approximate limit. > > Each page of the event array has space for 1024 events ($E_P$) so a > regular domU will only require a single page. Since event channels > typically come in pairs, the upper bound on the total number of pages > is $2 \times\text{number of VMs}$. > > If the guests are further restricted in the number of event channels > ($E_V$) then this upper bound can be reduced further. > > The number of VMs ($V$) with a limit of $P$ total event array pages is: > \begin{equation*} > V = P \div \left(1 + \frac{E_V}{E_P}\right) > \end{equation*} > > Using only half the available pages and limiting guests to only 64 > events gives: > \begin{eqnarray*} > V & = & (262144 / 2) \div (1 + 64 / 1024) \\ > & = & 123 \times 10^3\text{ VMs} > \end{eqnarray*} > > Alternatively, we can consider a system with $D$ driver domains, each > of which requires $E_D$ events, and a dom0 using the maximum number of > pages (128). > > \begin{eqnarray*} > V & = & P - \left(128 + D \times \frac{E_D}{E_P}\right) > \end{eqnarray*} > > With, for example, 16 driver domains each using the maximum number of pages: > \begin{eqnarray*} > V & = & (262144/2) - (128 + 16 \times \frac{2^{17}}{1024}) \\ > & = & 129 \times 10^3\text{ VMs} > \end{eqnarray*} > > In summary, there is space to map the event arrays for over 100,000 > VMs. This is more than the limit imposed by the 16 bit domain ID > (~32,000 VMs). > > ### Control Block > > With $L$ priority levels and two 32-bit words for the head and tail > indexes, the amount of space ($S$) required in the `struct vcpu_info` > for the control block is: > \begin{eqnarray*} > S & = & L \times 2 \times 4 \\ > & = & 16 \times 2 \times 4 \\ > & = & 128\text{ bytes} > \end{eqnarray*} > > There is more than enough space within `struct vcpu_info` for the > control block while still keeping plenty of space for future use. > > Low Level Design > ================ > > In the pseudo code in this section, all memory accesses are atomic, > including those to bit-fields within the event word. > > These variables have a standard meaning: > > Variable Purpose > -------- ------- > E Event array. > p A specific event. > H The head index for a specific priority. > T The tail index for a specific priority. > > These memory barriers are required: > > Function Purpose > -------- ------- > mb() Full (read/write) memory barrier > > Raising an Event > ---------------- > > When Xen raises an event it marks it pending and (if it is not masked) > adds it tail of event queue. > > E[p].pending = 1 > if not E[p].linked and not E[n].masked > E[p].linked = 1 > E[p].link = 0 > mb() > if H == 0 > H = p > else > E[T].link = p > T = p > > Concurrent access by Xen to the event queue must be protected by a > per-event queue spin lock. > > Consuming Events > ---------------- > > The guests consumes events starting at the head until it reaches the > tail. Events in the queue that are not pending or are masked are > consumed but not handled. > > while H != 0 > p = H > H = E[p].link > if H == 0 > mb() > H = E[p].link > E[H].linked = 0 > if not E[p].masked > handle(p) > > handle() clears `E[p].pending` and EOIs level-triggered PIRQs. > >> Note: When the event queue contains a single event, removing the >> event may race with Xen appending another event because the load of >> `E[p].link` and the store of `H` is not atomic. To avoid this race, >> the guest must recheck `E[p].link` if the list appeared empty. > > Masking Events > -------------- > > Events are masked by setting the masked bit. If the event is pending > and linked it does not need to be unlinked. > > E[p].masked = 1 > > Unmasking Events > ---------------- > > Events are unmasked by the guest by clearing the masked bit. If the > event is pending the guest must call the event channel unmask > hypercall so Xen can link the event into the correct event queue. > > E[p].masked = 0 > if E[p].pending > hypercall(EVTCHN_unmask) > > The expectation here is that unmasking a pending event will be rare, > so the performance hit of the hypercall is minimal. > >> Note: that after clearing the mask bit, the event may be raised and >> thus it may already be linked by the time the hypercall is done. > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > http://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.