[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Domain relinquish resources racing with p2m access
> At 12:49 -0800 on 01 Feb (1328100564), Andres Lagar-Cavilla wrote: >> So we've run into this interesting (race?) condition while doing >> stress-testing. We pummel the domain with paging, sharing and mmap >> operations from dom0, and concurrently we launch a domain destruction. >> Often we get in the logs something along these lines >> >> (XEN) mm.c:958:d0 Error getting mfn 859b1a (pfn ffffffffffffffff) from >> L1 >> entry 8000000859b1a625 for l1e_owner=0, pg_owner=1 >> >> We're using the synchronized p2m patches just posted, so my analysis is >> as >> follows: >> >> - the domain destroy domctl kicks in. It calls relinquish resources. >> This >> disowns and puts most domain pages, resulting in invalid (0xff...ff) m2p >> entries >> >> - In parallel, a do_mmu_update is making progress, it has no issues >> performing a p2m lookup because the p2m has not been torn down yet; we >> haven't gotten to the RCU callback. Eventually, the mapping fails in >> page_get_owner in get_pafe_from_l1e. >> >> The map is failed, as expected, but what makes me uneasy is the fact >> that >> there is a still active p2m lurking around, with seemingly valid >> translations to valid mfn's, while all the domain pages are gone. > > Yes. That's OK as long as we know that any user of that page will > fail, but I'm not sure that we do. > > At one point we talked about get_gfn() taking a refcount on the > underlying MFN, which would fix this more cleanly. ISTR the problem was > how to make sure the refcount was moved when the gfn->mfn mapping > changed. Oh, I ditched that because it's too hairy and error prone. There are plenty of nested get_gfn's with the n>1 call changing the mfn. So unless we make a point of remembering the mfn at the point of get_gfn, it's just impossible to make this work. And then "remembering the mfn" means a serious uglification of existing code. > > Can you stick a WARN() in mm.c to get the actual path that leads to the > failure? As a debug aid or as actual code to make it into the tree? This typically happens in batches of a few dozens, so a WARN is going to massively spam the console with stack traces. Guess how I found out ... The moral is that the code is reasonably defensive, so this gets caught, albeit in a rather verbose way. But this might eventually bite someone who does a get_gfn and doesn't either check that the domain is dying or ensure that a get_page succeeds. Andres > > Tim. > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |