Xen project Mailing List

Re: [Xen-devel] Domain relinquish resources racing with p2m access

To: "Tim Deegan" <tim@xxxxxxx>

From: "Andres Lagar-Cavilla" <andres@xxxxxxxxxxxxxxxx>

Date: Fri, 10 Feb 2012 10:05:46 -0800

Cc: xen-devel@xxxxxxxxxxxxxxxxxxx, keir@xxxxxxx

Delivery-date: Fri, 10 Feb 2012 18:06:24 +0000

Domainkey-signature: a=rsa-sha1; c=nofws; d=lagarcavilla.org; h=message-id :in-reply-to:references:date:subject:from:to:cc:reply-to :mime-version:content-type:content-transfer-encoding; q=dns; s= lagarcavilla.org; b=g7xMZkLh/lePVthjKAxS9U7xlgNUcIrJYsUFdEJMsD5c OFnNqlGD3HONHujGHsLqwwI4LfW3gwjwx3zcXRiW5YL7BinPa8+fMnpuL7WWY2vt k9lZVw+zAnpJ/KkEmXr2TUSqUPuL5HodzNTFw6rI9MSkFVF2s6BPfJSP5Ycec1g=

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

> At 12:49 -0800 on 01 Feb (1328100564), Andres Lagar-Cavilla wrote: >> So we've run into this interesting (race?) condition while doing >> stress-testing. We pummel the domain with paging, sharing and mmap >> operations from dom0, and concurrently we launch a domain destruction. >> Often we get in the logs something along these lines >> >> (XEN) mm.c:958:d0 Error getting mfn 859b1a (pfn ffffffffffffffff) from >> L1 >> entry 8000000859b1a625 for l1e_owner=0, pg_owner=1 >> >> We're using the synchronized p2m patches just posted, so my analysis is >> as >> follows: >> >> - the domain destroy domctl kicks in. It calls relinquish resources. >> This >> disowns and puts most domain pages, resulting in invalid (0xff...ff) m2p >> entries >> >> - In parallel, a do_mmu_update is making progress, it has no issues >> performing a p2m lookup because the p2m has not been torn down yet; we >> haven't gotten to the RCU callback. Eventually, the mapping fails in >> page_get_owner in get_pafe_from_l1e. >> >> The map is failed, as expected, but what makes me uneasy is the fact >> that >> there is a still active p2m lurking around, with seemingly valid >> translations to valid mfn's, while all the domain pages are gone. > > Yes. That's OK as long as we know that any user of that page will > fail, but I'm not sure that we do. > > At one point we talked about get_gfn() taking a refcount on the > underlying MFN, which would fix this more cleanly. ISTR the problem was > how to make sure the refcount was moved when the gfn->mfn mapping > changed. Oh, I ditched that because it's too hairy and error prone. There are plenty of nested get_gfn's with the n>1 call changing the mfn. So unless we make a point of remembering the mfn at the point of get_gfn, it's just impossible to make this work. And then "remembering the mfn" means a serious uglification of existing code. > > Can you stick a WARN() in mm.c to get the actual path that leads to the > failure? As a debug aid or as actual code to make it into the tree? This typically happens in batches of a few dozens, so a WARN is going to massively spam the console with stack traces. Guess how I found out ... The moral is that the code is reasonably defensive, so this gets caught, albeit in a rather verbose way. But this might eventually bite someone who does a get_gfn and doesn't either check that the domain is dying or ensure that a get_page succeeds. Andres > > Tim. > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.