Xen project Mailing List

Re: [Xen-devel] Kernel bug from 3.0 (was phy disks and vifs timing out in DomU)

To: Ian Campbell <Ian.Campbell@xxxxxxxxxxxxx>

From: Anthony Wright <anthony@xxxxxxxxxxxxxxx>

Date: Fri, 23 Sep 2011 13:35:01 +0100

Cc: Jeremy Fitzhardinge <jeremy@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir@xxxxxxx>, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>, David Vrabel <david.vrabel@xxxxxxxxxx>, Todd Deshane <todd.deshane@xxxxxxx>

Delivery-date: Fri, 23 Sep 2011 05:36:06 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

On 03/09/2011 11:27, Ian Campbell wrote: > On Fri, 2011-09-02 at 21:26 +0100, Jeremy Fitzhardinge wrote: >> On 09/02/2011 12:17 AM, Ian Campbell wrote: >>> On Thu, 2011-09-01 at 21:34 +0100, Jeremy Fitzhardinge wrote: >>>> On 09/01/2011 12:21 PM, Ian Campbell wrote: >>>>> On Thu, 2011-09-01 at 18:32 +0100, Jeremy Fitzhardinge wrote: >>>>>> On 09/01/2011 12:42 AM, Ian Campbell wrote: >>>>>>> On Wed, 2011-08-31 at 18:07 +0100, Konrad Rzeszutek Wilk wrote: >>>>>>>> On Wed, Aug 31, 2011 at 05:58:43PM +0100, David Vrabel wrote: >>>>>>>>> On 26/08/11 15:44, Konrad Rzeszutek Wilk wrote: >>>>>>>>>> So while I am still looking at the hypervisor code to figure out why >>>>>>>>>> it would give me [when trying to map a grant page]: >>>>>>>>>> >>>>>>>>>> (XEN) mm.c:3846:d0 Could not find L1 PTE for address fbb42000 >>>>>>>>> It is failing in guest_map_l1e() because the page for the vmalloc'd >>>>>>>>> virtual address PTEs is not present. >>>>>>>>> >>>>>>>>> The test that fails is: >>>>>>>>> >>>>>>>>> (l2e_get_flags(l2e) & (_PAGE_PRESENT | _PAGE_PSE)) != _PAGE_PRESENT >>>>>>>>> >>>>>>>>> I think this is because the GNTTABOP_map_grant_ref hypercall is done >>>>>>>>> when task->active_mm != &init_mm and alloc_vm_area() only adds PTEs >>>>>>>>> into >>>>>>>>> init_mm so when Xen looks in the page tables it doesn't find the >>>>>>>>> entries >>>>>>>>> because they're not there yet. >>>>>>>>> >>>>>>>>> Putting a call to vmalloc_sync_all() after create_vm_area() and before >>>>>>>>> the hypercall makes it work for me. Classic Xen kernels used to have >>>>>>>>> such a call. >>>>>>>> That sounds quite reasonable. >>>>>>> I was wondering why upstream was missing the vmalloc_sync_all() in >>>>>>> alloc_vm_area() since the out-of-tree kernels did have it and the >>>>>>> function was added by us. I found this: >>>>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=ef691947d8a3d479e67652312783aedcf629320a >>>>>>> >>>>>>> commit ef691947d8a3d479e67652312783aedcf629320a >>>>>>> Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> >>>>>>> Date: Wed Dec 1 15:45:48 2010 -0800 >>>>>>> >>>>>>> vmalloc: remove vmalloc_sync_all() from alloc_vm_area() >>>>>>> >>>>>>> There's no need for it: it will get faulted into the current >>>>>>> pagetable >>>>>>> as needed. >>>>>>> >>>>>>> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> >>>>>>> >>>>>>> The flaw in the reasoning here is that you cannot take a kernel fault >>>>>>> while processing a hypercall, so hypercall arguments must have been >>>>>>> faulted in beforehand and that is what the sync_all was for. >>>>>> That's a good point. (Maybe Xen should have generated pagefaults when >>>>>> hypercall arg pointers are bad...) >>>>> I think it would be a bit tricky to do in practice, you'd either have to >>>>> support recursive hypercalls in the middle of other hypercalls (because >>>>> the page fault handler is surely going to want to do some) or proper >>>>> hypercall restart (so you can fully return to guest context to handle >>>>> the fault then retry) or something along those and complexifying up the >>>>> hypervisor one way or another. Probably not impossible if you were >>>>> building something form the ground up, but not trivial. >>>> Well, Xen already has the continuation machinery for dealing with >>>> hypercall restart, so that could be reused. >>> That requires special support beyond just calling the continuation in >>> each hypercall (often extending into the ABI) for pickling progress and >>> picking it up again, only a small number of (usually long running) >>> hypercalls have that support today. It also uses the guest context to >>> store the state which perhaps isn't helpful if you want to return to the >>> guest, although I suppose building a nested frame would work. >> I guess it depends on how many hypercalls do work before touching guest >> memory, but any hypercall should be like that anyway, or at least be >> able to wind back work done if a later read EFAULTs. >> >> I was vaguely speculating about a scheme on the lines of: >> >> 1. In copy_to/from_user, if we touch a bad address, save it in a >> per-vcpu "bad_guest_addr" >> 2. when returning to the guest, if the errno is EFAULT and >> bad_guest_addr is set, then generate a memory fault frame with cr2 = >> bad_guest_addr, and with the exception return restarting the hypercall >> >> Perhaps there should be a EFAULT_RETRY error return to trigger this >> behaviour, rather than doing it for all EFAULTs, so the faulting >> behaviour can be added incrementally. > The kernel uses -ERESTARTSSYS for something similar, doesn't it? > > Does this scheme work if the hypercall causing the exception was itself > runnnig in an exception handler? I guess it depends on the architecture > +OSes handling of nested faults. > >> Maybe this is a lost cause for x86, but perhaps its worth considering >> for new ports? > Certainly worth thinking about. > >>> The guys doing paging and sharing etc looked into this and came to the >>> conclusion that it would be intractably difficult to do this fully -- >>> hence we now have the ability to sleep in hypercalls, which works >>> because the pager/sharer is in a different domain/vcpu. >> Hmm. Were they looking at injecting faults back into the guest, or >> forwarding "missing page" events off to another domain? > Sharing and swapping are transparent to the domain, another domain runs > the swapper/unshare process (actually, unshare might be in the h/v > itself, not sure). > >>>> And accesses to guest >>>> memory are already special events which must be checked so that EFAULT >>>> can be returned. If, rather than failing with EFAULT Xen set up a >>>> pagefault exception for the guest CPU with the return set up to retry >>>> the hypercall, it should all work... >>>> >>>> Of course, if the guest isn't expecting that - or its buggy - then it >>>> could end up in an infinite loop. But maybe a flag (set a high bit in >>>> the hypercall number?), or a feature, or something? Might be worthwhile >>>> if it saves guests having to do something expensive (like a >>>> vmalloc_sync_all), even if they have to also deal with old hypervisors. >>> The vmalloc_sync_all is a pretty event even on Xen though, isn't it? >> Looks like an important word is missing there. But its very expensive, >> if that's what you're saying. > Oops. "rare" was the missing word. Is there any progress on an official patch for this? I have my own unofficial patch which places a vmalloc_sync_all() after every alloc_vm_area() call and it works, but from the thread it sounds like there should be a more sophisticated solution to the problem. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.