[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Wait Queues

Hi all,
we currently have a problem in the (x86) mm layer. Callers may request the p2m 
to perform a translation of a gfn to an mfn. Such translation may need to wait 
for a third party to service it. This happens when:

- a page needs to be paged in
- a CoW breaking of a shared page fails due to lack of memory

Note that paging in may also fail due to lack of memory. In both ENOMEM cases, 
all the plumbing for a toolstack to be notified and take some corrective action 
to release some memory and retry is in place. We also have plumbing for pagers 
in place.

Ideally we want the internals to be self-contained, so that callers need not be 
concerned with any of this. A request for a p2m translation may or may not 
sleep, but on exit from the p2m the caller either has an mfn with a page ref, 
or an error code due to some other condition.

Wait queue support in (x86) Xen prevents sleeping on a wait queue if any locks 
are held, including RCU read-side locks (i.e. BUG_ON(!in_atomic()).

For this reason, we have not yet implemented sleeping on the p2m. Callers may 
get errors telling them to retry. A lot of (imho) ugly code is peppered around 
the hypervisor to deal with the consequences of this. More fundamentally, in 
some cases there is no possible elegant handling, and guests are crashed (for 
example, if a page table page is paged out and the hypervisor needs to 
translate a guest virtual address to a gfn). This limits the applicability of 
memory paging and sharing.

One way to solve this would be to ensure no code path liable to sleep in a wait 
queue is holding any locks at wait queue sleep time. I believe this is doomed. 
Not just because this is a herculean task. It also makes writing hypervisor 
code *very* difficult. Anyone trying to throw a p2m translation into a code 
path needs to think of all possible upstream call sequences. Not even RCU read 
locks are allowed.

I'd like to propose an approach that ensures that as long some properties are 
met, arbitrary wait queue sleep is allowed. Here are the properties:
1. Third parties servicing a wait queue sleep are indeed third parties. In 
other words, dom0 does not do paging.
2. Vcpus of a wait queue servicing domain may never go to sleep on a wait queue 
during a foreign map.
3. A guest vcpu may go to sleep on a wait queue holding any kinds of locks as 
long as it does not hold the p2m lock.
4. "Kick" routines in the hypervisor by which service domains effectively wake 
up a vcpu may only take the p2m lock to do a fix up of the p2m.
5. Wait queues can be awakened on a special domain destroy condition.

Property 1. is hopefully self-evident, and although not enforced in the code it 
is reasonably simple to do so.

Property 2. is also self-evident and enforced in the code today.

Property 3. simplifies reasoning about p2m translations and wait queue 
sleeping. Provides a clean model for reasoning about what might or might not 
happen. It will require us to restructure some code paths (i.e. 
XENMEM_add_to_physmap) that require multiple p2m translations in a single 
critical region to perform all translations up front.

Property 4. is already enforced in the code as is right now.

Property 5. needs adding some logic to the top of domain destruction: set a 
flag, wake up all vcpus in wait queues. Extra logic on the wait queue side will 
exit the wait if the destroy flag is set, with an error. Most if not all 
callers can deal right now with a p2m translation error (other than paging), 
and unwind and release all their locks.

I confess my understanding of RCU is not 100% there and I am not sure what will 
happen to force_quiescent_state. I also understand there is a impedance 
mismatch wrt to "saving" and "restoring" the physical CPU preempt count.

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.