[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] error in xen/arch/x86/mm.c:get_page during migration
At 09:52 -0500 on 25 Feb (1361785966), Andres Lagar-Cavilla wrote: > >>>> On 22.02.13 at 21:07, Olaf Hering <olaf@xxxxxxxxx> wrote: > >> On Fri, Feb 22, Jan Beulich wrote: > >> > >>>>>> On 21.02.13 at 18:31, Olaf Hering <olaf@xxxxxxxxx> wrote: > >>>> It did not happen with xl. > >>> > >>> But the same guest and Dom0 kernel, and the same hypervisor? > >> > >> Yes, same sles11sp2 dom0, and 3.7.9 pvops guest. > >> > >>>> Here is the output while doing xm migrate: > >>>> > >>>> (XEN) HVM2 restore: VMCE_VCPU 0 > >>>> (XEN) HVM2 restore: VMCE_VCPU 1 > >>>> (XEN) HVM2 restore: TSC_ADJUST 0 > >>>> (XEN) HVM2 restore: TSC_ADJUST 1 > >>>> (XEN) mm.c:1983:d0 Error pfn 4112c5: rd=ffff83036ffef000, > >> od=0000000000000000, caf=180000000000000, taf=7400000000000001 > >>> > >>> Didn't even notice yesterday that this is apparently after restore > >>> has already started. Which makes me curious whether the domain > >>> that is being referenced with rd= is the old or the new one (would > >>> require printing the domain ID; honestly I never understood what > >>> use printing of the domain pointer is). > >>> > >>> I'm also confused by the domain pointer always being the same; > >>> I would expect it to at least toggle between two values, but > >>> probably even be different between every instance of the guest. > >>> But you're not having a stubdom configured for the guest either, > >>> according to the config you sent earlier... > >> > >> The rd->domain_id is DOMID_COW in both cases. > > > > Which suggests that memory sharing is in use. At least I'm unaware > > of other uses of that pseudo domain. > > There are none. > > There seems to be something else amiss though. Unless I am parsing > this incorrectly, taf == PGT_writable | PGT_pae_xen_l2? And caf == PAT > | PCD? Looks like a very unlikely combination By my reading, taf = 0x7400000000000001 = typecount 1, PGT_writable_page | PGT_validated caf = 0x0180000000000000 = refcount 0, PGC_state_free iow this is a free page but somehow has ended up with a typecount (which explains why the get_page() failed). And presumably this is one of the various get_page[_and_type](page, dom_cow) calls in mem_sharing.c. Since free_domheap_pages() has a BUG_ON(typecount != 0), it seems like something's gone badly off the rails here. One place I can see that tinkers with typecount without holding a ref is share_xen_page_with_guest(), which sets exactly this typecount, but then calls page_set_owner(page, d). There's some hairy code in __gnttab_map_grant_ref() too, but I _think_ it can't end up taking typecounts without refcounts. __acquire_grant_for_copy() looks pretty hairy too, in particular this: (void)page_get_owner_and_reference(*page); but presumably the matching put_page() would have crashed if that was the problem. Does anyone understand the grant code well enough to get into that? If you can repro this, it might be worth tracing all the refcount ops into a large buffer and dumping the history of this MFN on failure. Cheers, Tim. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |