Xen project Mailing List

Re: [Xen-devel] Scalable Event Channel ABI design (draft A)

To: Ian Campbell <Ian.Campbell@xxxxxxxxxx>

From: David Vrabel <david.vrabel@xxxxxxxxxx>

Date: Tue, 5 Feb 2013 18:18:55 +0000

Cc: Wei Liu <wei.liu2@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Tue, 05 Feb 2013 18:19:39 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 05/02/13 16:10, Ian Campbell wrote: > On Mon, 2013-02-04 at 17:52 +0000, David Vrabel wrote: >> The pages within the event array need not be physically nor virtually >> contiguous, but the guest or Xen may make the virtually contiguous for >> ease of implementation. e.g., by using vmap() in Xen or vmalloc() in >> Linux. Pages are added by the guest as required by the number of >> bound event channels. > > Strictly speaking the size is related to the maximum bound evtchn, which > need not be the same as the number of bound evtchns. Yes. >> The state of an event is tracked using 3 bits within the event word. >> the M (masked), P (pending) and L (linked) bits. Only state >> transitions that change a single bit are valid. > > The diagram shows transitions B<->P and P<->L, which involve two bits in > each case. So do BM<->PM and LM<->BM now I think about it. B == 000b, P == 100b, L == 101b. Similarly for the masked transitions: BM == 010b, PM == 110b, LM == 111b. > Is the L bit redundant with the LINK field == 0 or != 0? LINK == 0 is the end of queue marker. We could use all ones to mean 'unlinked' but using the L bit allows the guest to remove the event from the queue by clearing a single bit, instead of writing the LINK field. >> ### `EVTCHNOP_init` >> >> This call initializes a single VCPU's event channel data structures, >> adding one page for the event array. > > Isn't the event array per-domain? Er, yes. I changed this from per-VCPU and it looks like I didn't make all the updates required. >> A guest should call this during initial VCPU bring up (and on resume?). >> >> struct evtchnop_init { >> uint32_t vcpu; >> uint64_t array_pfn; >> }; > > I think this will have a different layout on 32 and 64 bit x86, if you > care (because uint64_t is align(4) on 32-bit and align(8) on 64-bit). Ok. I thought uint64_t was always 8 byte aligned on both. >> Field Purpose >> ----- ------- >> `vcpu` [in] The VCPU number. >> `array_pfn` [in] The PFN or GMFN of a page to be used for the first page >> of the event array. >> >> Error code Reason >> ---------- ------ >> EINVAL `vcpu` is invalid or already initialized. >> EINVAL `array_pfn` is not a valid frame for the domain. >> ENOMEM Insufficient memory to allocate internal structures. >> >> ### `EVTCHNOP_expand` >> >> This call expands the event array for a VCPU by appended an additional >> page. > > This doesn't seem all that different to _init, except the former handles > the transition from 0->1 event pages and this handles N->N+1? Agreed, I'll fold the two together. >> ### `EVTCHNOP_set_priority` >> >> This call sets the priority for an event channel. The event must be >> unbound. >> >> A guest may call this prior to binding an event channel. The meaning >> and the use of the priority are up to the guest. Valid priorities are >> 0 - 15 and the default is 7. >> >> struct evtchnop_set_priority { >> uint32_t port; >> uint32_t priority; >> }; >> >> Field Purpose >> ----- ------- >> `port` [in] The event channel. >> `priority` [in] The priority for the event channel. >> >> Error code Reason >> ---------- ------ >> EINVAL `port` is invalid. >> EINVAL `port` is currently bound. > > EBUSY? Sure. >> Memory Usage >> ------------ >> >> Alternatively, we can consider a system with $D$ driver domains, each >> of which requires $E_D$ events, and a dom0 using the maximum number of >> pages (128). >> >> \begin{eqnarray*} >> V & = & P - \left(128 + D \times \frac{E_D}{E_P}\right) >> \end{eqnarray*} >> >> With, for example, 16 driver domains each using the maximum number of pages: >> \begin{eqnarray*} >> V & = & (262144/2) - (128 + 16 \times \frac{2^{17}}{1024}) \\ >> & = & 129 \times 10^3\text{ VMs} >> \end{eqnarray*} > > This accounts for the driver domains and dom0 but not the domains which > they are serving, doesn't it? This is calculating the number of pages left once those used for dom0 and the driver domains are used. This is the same as the number of supportable VMs since the other VMs only require a page. >> In summary, there is space to map the event arrays for over 100,000 >> VMs. This is more than the limit imposed by the 16 bit domain ID >> (~32,000 VMs). > > Is there scope to reduce the maximum then? Maximum of what? >> ### Control Block >> >> With $L$ priority levels and two 32-bit words for the head and tail >> indexes, the amount of space ($S$) required in the `struct vcpu_info` >> for the control block is: >> \begin{eqnarray*} >> S & = & L \times 2 \times 4 \\ >> & = & 16 \times 2 \times 4 \\ >> & = & 128\text{ bytes} >> \end{eqnarray*} >> >> There is more than enough space within `struct vcpu_info` for the >> control block while still keeping plenty of space for future use. > > There is? I can see 7 bytes of padding and 4 bytes of evtchn_pending_sel > which I suppose becomes redundant now. > > I don't think it would be a problem to predicate use of this new > interface on the use of the VCPU_placement API and therefore give scope > to expand the vcpu_info. I should have taken a proper look at where the vcpu_info comes from... Oops. We can add a new VCPUOP_register_vcpu_info_and_control_block which will add a struct vcpu_info and the control block. >> Low Level Design >> ================ >> >> In the pseudo code in this section, all memory accesses are atomic, >> including those to bit-fields within the event word. > >> These variables have a standard meaning: >> >> Variable Purpose >> -------- ------- >> E Event array. >> p A specific event. >> H The head index for a specific priority. >> T The tail index for a specific priority. >> >> These memory barriers are required: >> >> Function Purpose >> -------- ------- >> mb() Full (read/write) memory barrier >> >> Raising an Event >> ---------------- >> >> When Xen raises an event it marks it pending and (if it is not masked) >> adds it tail of event queue. > > What are the conditions for actually performing the upcall when > returning from the hypervisor to the guest? Any H != 0 for that VCPU. This means we should have a per-VCPU bitmap of non empty queues. >> E[p].pending = 1 >> if not E[p].linked and not E[n].masked >> E[p].linked = 1 >> E[p].link = 0 >> mb() >> if H == 0 >> H = p >> else >> E[T].link = p >> T = p > > Do you not need a barrier towards the end here to ensure that a consumer > who is currently processing interrupts sees the updated link when they > get there? I need to take another look at the barriers -- I'll get back to you on this and the others you highlighted. >> Concurrent access by Xen to the event queue must be protected by a >> per-event queue spin lock. >> >> Consuming Events >> ---------------- >> >> The guests consumes events starting at the head until it reaches the >> tail. Events in the queue that are not pending or are masked are >> consumed but not handled. >> >> while H != 0 >> p = H >> H = E[p].link >> if H == 0 >> mb() >> H = E[p].link >> E[H].linked = 0 > > Did you mean E[p].linked here? > > If at this point the interrupt is reraised then the if in the raising > pseudo code becomes true, linked will set again and don't we also race > with the clearing of E[p].pending below? The event will be correctly linked and then we clear P, when we consume the linked event it will be ignored as P is already clear. Perhaps we could also avoid linking the event if it is already pending? >>> Note: When the event queue contains a single event, removing the >>> event may race with Xen appending another event because the load of >>> `E[p].link` and the store of `H` is not atomic. To avoid this race, >>> the guest must recheck `E[p].link` if the list appeared empty. > > It appears that both "Raising an Event" and "Consuming Events" can write > H? Is that correct? Likewise for the linked bit. Xen only writes H when the queue is empty, the VM only writes H if it is non-empty. The linked bit is set by Xen and cleared by the guest. Do you see a problem with this? > It's a bit unclear because the pseudo-code doesn't make it explicitly > which variables are par of the shared data structure and which are > private to the local routine. Capital variables are in the shared structures. Lower-case are local. >> Masking Events >> -------------- >> >> Events are masked by setting the masked bit. If the event is pending >> and linked it does not need to be unlinked. >> >> E[p].masked = 1 >> >> Unmasking Events >> ---------------- >> >> Events are unmasked by the guest by clearing the masked bit. If the >> event is pending the guest must call the event channel unmask >> hypercall so Xen can link the event into the correct event queue. >> >> E[p].masked = 0 >> if E[p].pending >> hypercall(EVTCHN_unmask) > > Can the hypercall do the E[p].masked = 0? Sure. David _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.