[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [RFC] [PATCH] x86/mm: use wait queues for mem_paging
> Hi, > > At 08:05 -0800 on 17 Feb (1329465959), Andres Lagar-Cavilla wrote: >> Well, thanks for taking a stab at it. Looks fairly well. My main >> observation is that we do not want to unconditionally go on a wait queue >> everywhere. For example, the grant table code pretty reliable unwinds >> itself and correctly tells the grant mapper to retry, in the presence of >> a >> paged page. > > The grant table code is unlikely to hit a paged-out gfn in the caller's > address space. Oh yeah they do. At least for the target gfn's behind the grants. That's why we need the retry loops in the PV backends in dom0. > >> The same could be said of emulation (X86_EMUL_RETRY). > > Fair enough. But the emulator uses hvm_copy() in its callbacks anyway > we'd need another flag to say 'special handling for p2m fixups plz' in > all the hvm_copy() interfaces. Unpleasant! > >> And certainly the balloon code should not sleep waiting for populate! > > The balloon code should not be trying to populate at all! Agreed. get_gfn_query is needed there. Nothing bad happens because we immediately send drop_page. But it's wrong-ish. > >> Hence I had proposed introducing get_gfn_sleep, and I'd be happy if the >> wait queue code is invoked only there. And then we can call >> get_gfn_sleep >> in vmx_load_pdptrs, in guest page table walks, and in __hvm_copy. > > I see. In the longer term I definitely do not want get_gfn()'s callers > to have to deal with p2m_is_paged(), and PoD, and sharing, and every > other p2m hack we come up with. I want all that stuff hidden behind > get_gfn_*(), with at most a flag to say 'fix up magic MM tricks'. > > I don't like get_gfn_sleep() very much but I'll see if I can find a way > to do the equivalent with a flag to get_gfn_*(). But I suspect it will > end up with a tri-state argument of: > a) just give me the current state; > b) try (without sleeping) to fix up paging, PoD and sharing but > still maybe make me handle it myself by error propagation; and > c) just fix it up. > > 'b' is a pretty weird interface. Specially given that in corner cases, > 'make me handle it' might involve hitting a wait queue anyway. If we > intend in the fullness of time to get rid of it (as I hope we would) > then probably we shouldn't introduce it. (And if that means pushing > this whole change out past 4.2, we should consider doing that.) > It's weird. b) is an evolutionary side-effect. It should disappear. I guess what I'm arguing for is "it should disappear in the long term, not on a patch so close to 4.2" >> >> diff -r 01c1bcbc6222 xen/arch/x86/mm/p2m.c >> >> --- a/xen/arch/x86/mm/p2m.c Thu Feb 16 08:48:23 2012 +0100 >> >> +++ b/xen/arch/x86/mm/p2m.c Thu Feb 16 14:39:13 2012 +0000 >> >> @@ -160,13 +160,49 @@ mfn_t __get_gfn_type_access(struct p2m_d >> >> } >> >> >> >> /* For now only perform locking on hap domains */ >> >> - if ( locked && (hap_enabled(p2m->domain)) ) >> >> + locked = locked && hap_enabled(p2m->domain); >> >> + >> >> +#ifdef __x86_64__ >> When are we getting rid of 32 bit hypervisor? (half kidding. Only half) > > Can't be soon enough for me. I've been sharpening the axe for that one > for years now. :) What's stopping the death blow? > >> >> +again: >> >> +#endif >> >> + if ( locked ) >> >> /* Grab the lock here, don't release until put_gfn */ >> >> gfn_lock(p2m, gfn, 0); >> >> >> >> mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order); >> >> >> >> #ifdef __x86_64__ >> >> + if ( p2m_is_paging(*t) && q != p2m_query >> >> + && p2m->domain == current->domain ) >> >> + { >> >> + if ( locked ) >> >> + gfn_unlock(p2m, gfn, 0); >> >> + >> >> + /* Ping the pager */ >> >> + if ( *t == p2m_ram_paging_out || *t == p2m_ram_paged ) >> >> + p2m_mem_paging_populate(p2m->domain, gfn); >> >> You want to make sure the mfn is not valid. For p2m_ram_paging_out we >> still have the mfn in place. > > Doesn't p2m_mem_paging_populate already DTRT in all these cases? If > not, it really should. > It's not about populate, it's about killing the domain unnecessarily in the in_atomic check. >> >> + >> >> + /* Safety catch: it _should_ be safe to wait here but if >> it's >> >> not, >> >> + * crash the VM, not the host */ >> >> + if ( in_atomic() ) >> >> + { >> >> + WARN(); >> >> + domain_crash(p2m->domain); >> >> + } >> >> + else >> >> + { >> >> + current->arch.mem_paging_gfn = gfn; >> >> + wait_event(current->arch.mem_paging_wq, ({ >> >> + mfn = p2m->get_entry(p2m, gfn, t, a, >> >> + p2m_query, page_order); >> >> + (*t != p2m_ram_paging_in); >> >> Well, first of all mfn->get_entry will not happen in a locked context, >> so >> you will exit the wait loop not holding the p2m lock and crash later. > > Yes - we always exit the loop without the lock; then we 'goto again' and > do things properly. I realise it's a bit inefficient, but on the > page-in path the p2m lookups shouldn't be the bottleneck. :) I did > write the version that handled the locking in this loop but it got a bit > twisty, plus we need the 'goto again' for other reasons (see below). Yes, but, you'll be reading the p2m in a racy manner. > > As a follow-up I ought to make the page-sharing fixup code that's just > below this use 'goto again' as well; we shouldn't really leave this > function until we get a lookup where all these cases are dealt with. > I was intending to do that and then I stepped back and ended up with the current patch submitted a couple of days ago. Same problem, can't safely go to a wait queue. I think we won't be able to have a magic get_gfn_solve_everything_silently() call until we find a solution to the fact that get_gfn happens in locked contexts. I've though about a wacky idea that 1. schedules a sleep-n-retry softirq 2. fails the call 3. avoids crashing the domain on the unwind path 4. when executing softirqs pauses the vcpu as per 1. 5. When woken up retries the last failed hypercall/vmexit/whatever I shouldn't have said that out loud... >> So you want __get_gfn_type_access with q = p2m_query, and then put_gfn >> if the >> condition fails. Probably not gonna fit in a ({}) block ;) >> >> I think checking for (!p2m_is_paging(*t)) is more appropriate. > > No, that was deliberate, and IIRC a change from Olaf's patch. If the > page gets paged in and back out while we're asleep, the type goes back > to p2m_ram_paged, and in that case we _must_ break out and retry or this > vcpu might stall forever. I see, a new populate event is needed. > >> >> void p2m_mem_paging_populate(struct domain *d, unsigned long gfn) >> >> { >> >> struct vcpu *v = current; >> >> - mem_event_request_t req; >> >> + mem_event_request_t req = {0}; >> >> p2m_type_t p2mt; >> >> p2m_access_t a; >> >> mfn_t mfn; >> >> struct p2m_domain *p2m = p2m_get_hostp2m(d); >> >> + int send_request = 0; >> >> >> >> /* We're paging. There should be a ring */ >> >> int rc = mem_event_claim_slot(d, &d->mem_event->paging); >> >> Given all the optimizations around the send_request flag, it might be >> worth moving the claim_slot call to an if(send_request) block at the >> bottom, and get rid of cancel_slot altogether. Claim_slot could >> potentially go (or cause someone else to go) to a wait queue, and it's >> best to not be that rude. > > Sure. It might be tricky to make sure we unwind properly in the failure > case, but I'll have a go at a follow-up patch that looks at that. > > (This is not really a regression in this patch AFAICS - in the existing > code > the get_gfn() callers end up calling this function anyway.) *After* they've put gfn. Not a regression, just cleaner code imo. Andres > >> >> - /* Pause domain if request came from guest and gfn has paging >> type >> >> */ >> >> - if ( p2m_is_paging(p2mt) && v->domain == d ) >> >> + /* No need to inform pager if the gfn is not in the page-out >> path >> >> */ >> >> + if ( !send_request ) >> >> { >> >> - vcpu_pause_nosync(v); >> >> I see no sin in stacking vcpu_pauses. Plus, this will need to stay if we >> (hopefully) adopt get_gfn_sleep. > > Yes, if we do it that way, this pause definitely needs to stay. But if > not then the waitqueue takes care of this case entirely, and this code > is just redundant and confusing. > > Cheers, > > Tim. > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |