[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Domain relinquish resources racing with p2m access

  • To: "Tim Deegan" <tim@xxxxxxx>
  • From: "Andres Lagar-Cavilla" <andres@xxxxxxxxxxxxxxxx>
  • Date: Fri, 10 Feb 2012 10:05:46 -0800
  • Cc: xen-devel@xxxxxxxxxxxxxxxxxxx, keir@xxxxxxx
  • Delivery-date: Fri, 10 Feb 2012 18:06:24 +0000
  • Domainkey-signature: a=rsa-sha1; c=nofws; d=lagarcavilla.org; h=message-id :in-reply-to:references:date:subject:from:to:cc:reply-to :mime-version:content-type:content-transfer-encoding; q=dns; s= lagarcavilla.org; b=g7xMZkLh/lePVthjKAxS9U7xlgNUcIrJYsUFdEJMsD5c OFnNqlGD3HONHujGHsLqwwI4LfW3gwjwx3zcXRiW5YL7BinPa8+fMnpuL7WWY2vt k9lZVw+zAnpJ/KkEmXr2TUSqUPuL5HodzNTFw6rI9MSkFVF2s6BPfJSP5Ycec1g=
  • List-id: Xen developer discussion <xen-devel.lists.xensource.com>

> At 12:49 -0800 on 01 Feb (1328100564), Andres Lagar-Cavilla wrote:
>> So we've run into this interesting (race?) condition while doing
>> stress-testing. We pummel the domain with paging, sharing and mmap
>> operations from dom0, and concurrently we launch a domain destruction.
>> Often we get in the logs something along these lines
>> (XEN) mm.c:958:d0 Error getting mfn 859b1a (pfn ffffffffffffffff) from
>> L1
>> entry 8000000859b1a625 for l1e_owner=0, pg_owner=1
>> We're using the synchronized p2m patches just posted, so my analysis is
>> as
>> follows:
>> - the domain destroy domctl kicks in. It calls relinquish resources.
>> This
>> disowns and puts most domain pages, resulting in invalid (0xff...ff) m2p
>> entries
>> - In parallel, a do_mmu_update is making progress, it has no issues
>> performing a p2m lookup because the p2m has not been torn down yet; we
>> haven't gotten to the RCU callback. Eventually, the mapping fails in
>> page_get_owner in get_pafe_from_l1e.
>> The map is failed, as expected, but what makes me uneasy is the fact
>> that
>> there is a still active p2m lurking around, with seemingly valid
>> translations to valid mfn's, while all the domain pages are gone.
> Yes.  That's OK as long as we know that any user of that page will
> fail, but I'm not sure that we do.
> At one point we talked about get_gfn() taking a refcount on the
> underlying MFN, which would fix this more cleanly.  ISTR the problem was
> how to make sure the refcount was moved when the gfn->mfn mapping
> changed.

Oh, I ditched that because it's too hairy and error prone. There are
plenty of nested get_gfn's with the n>1 call changing the mfn. So unless
we make a point of remembering the mfn at the point of get_gfn, it's just
impossible to make this work. And then "remembering the mfn" means a
serious uglification of existing code.

> Can you stick a WARN() in mm.c to get the actual path that leads to the
> failure?

As a debug aid or as actual code to make it into the tree? This typically
happens in batches of a few dozens, so a WARN is going to massively spam
the console with stack traces. Guess how I found out ...

The moral is that the code is reasonably defensive, so this gets caught,
albeit in a rather verbose way. But this might eventually bite someone who
does a get_gfn and doesn't either check that the domain is dying or ensure
that a get_page succeeds.


> Tim.

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.