[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Re: Need help with fixing the Xen waitqueue feature


  • To: keir.xen@xxxxxxxxx
  • From: "Andres Lagar-Cavilla" <andres@xxxxxxxxxxxxxxxx>
  • Date: Tue, 8 Nov 2011 19:52:31 -0800
  • Cc: olaf@xxxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxx
  • Delivery-date: Tue, 08 Nov 2011 19:53:03 -0800
  • Domainkey-signature: a=rsa-sha1; c=nofws; d=lagarcavilla.org; h=message-id :in-reply-to:references:date:subject:from:to:cc:reply-to :mime-version:content-type:content-transfer-encoding; q=dns; s= lagarcavilla.org; b=HW60HBAjPi84t/0/9Wc0td4sisww61Ka265EKUBXKTsG Tnq96nirJy+zj0Fwatfe+oVH3C6K6gamg3x2PfyLIT0S+mmvavH4ywsScu3oswGI kWocWu6QTvguYY4+OTk9N92aFiy56zKy77y0LA7gsQJRPXzdRJ+rJISWTYSvQ9s=
  • List-id: Xen developer discussion <xen-devel.lists.xensource.com>

> Date: Tue, 08 Nov 2011 22:05:41 +0000
> From: Keir Fraser <keir.xen@xxxxxxxxx>
> Subject: Re: [Xen-devel] Need help with fixing the Xen waitqueue
>       feature
> To: Olaf Hering <olaf@xxxxxxxxx>,     <xen-devel@xxxxxxxxxxxxxxxxxxx>
> Message-ID: <CADF5835.245E1%keir.xen@xxxxxxxxx>
> Content-Type: text/plain;     charset="US-ASCII"
>
> On 08/11/2011 21:20, "Olaf Hering" <olaf@xxxxxxxxx> wrote:
>
>> Another thing is that sometimes the host suddenly reboots without any
>> message. I think the reason for this is that a vcpu whose stack was put
>> aside and that was later resumed may find itself on another physical
>> cpu. And if that happens, wouldnt that invalidate some of the local
>> variables back in the callchain? If some of them point to the old
>> physical cpu, how could this be fixed? Perhaps a few "volatiles" are
>> needed in some places.
>
>>From how many call sites can we end up on a wait queue? I know we were
>> going
> to end up with a small and explicit number (e.g., in __hvm_copy()) but
> does
> this patch make it a more generally-used mechanism? There will unavoidably
> be many constraints on callers who want to be able to yield the cpu. We
> can
> add Linux-style get_cpu/put_cpu abstractions to catch some of them.
> Actually
> I don't think it's *that* common that hypercall contexts cache things like
> per-cpu pointers. But every caller will need auditing, I expect.

Tbh, for paging to be effective, we need to be prepared to yield on every
p2m lookup.

Let's compare paging to PoD. They're essentially the same thing: pages
disappear, and get allocated on the fly when you need them. PoD is a
highly optimized in-hypervisor optimization that does not need a
user-space helper -- but the pager could do PoD easily and remove all that
p2m-pod.c code from the hypervisor.

PoD only introduces extraneous side-effects when there is a complete
absence of memory to allocate pages. The same cannot be said of paging, to
put it mildly. It returns EINVAL all over the place. Right now, qemu can
be crashed in a blink by paging out the right gfn.

To get paging to where PoD is, all these situations need to be handled in
a manner other than returning EINVAL. That means putting the vcpu on a
waitqueue on every location p2m_pod_demand_populate is called, not just
__hvm_copy.

I don't know that that's gonna be altogether doable. Many of these gfn
lookups happen in atomic contexts, not to mention cpu-specific pointers.
But at least we should aim for that.

Andres
>
> A sudden reboot is very extreme. No message even on a serial line? That
> most
> commonly indicates bad page tables. Most other bugs you'd at least get a
> double fault message.
>
>  -- Keir
>


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.