[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Error restoring DomU when using GPLPV
Ok, I've been looking at this and figured what's going on. Annie's problem lies in not remapping the grant frames post migration. Hence the leak, tot_pages goes up every time until migration fails. On linux, remapping is where the frames created by restore (for heap pfn's), get freed back to the dom heap, is what I found. So that's a fix to be made on win pv driver side. Now back to orig problem. As you already know, because libxc is not skipping heap pages, tot_pages in struct domain{} temporarily goes up by (shared-info-frame + gnt-frames) until guest remaps these pages. Hence, migration fails if (max_pages - tot_pages) < (shared-info-frame + gnt-frames). Occassionally, I see tot_pages nearly same as max_pages, and I don't know of all ways that may happen or what causes that to happen (by default, i see tot_pages short by 21). Anyways, of two solutions: 1. Always balloon down, shinfo+gnttab frames: This needs to be done just once during load, right? I'm not sure how it would work tho if mem gets ballooned up subsequently. I suppose the driver will have to intercept every increase in reservation and balloon down everytime? Also, balloon down during suspend call would prob be too late, right? 2. libxc fix: I wonder how much work this will be. Good thing here is, it'll take care of both linux and PV HVM guests avoiding driver updates in many versions, and hence appealing to us. Can we somehow mark the frames special to be skipped? Looking at biiig xc_domain_save function, not sure in case of HVM, how pfn_type gets set. May be before the outer loop, it could ask hyp for all xen heap page list, but then what if a new page gets added to the list in between..... Also, unfortunately, the failure case is not handled properly sometimes. If migration fails after suspend, then no way to get the guest back. I even noticed, the guest disappeared totally from both source and target when failed, couple times of several dozen migrations I did. thanks, Mukesh Keir Fraser wrote: Not all those pages are special. Frames fc0xx will be ACPI tables, resident in ordinary guest memory pages, for example. Only the Xen-heap pages are special and need to be (1) skipped; or (2) unmapped by the HVMPV drivers on suspend; or (3) accounted for by HVMPV drivers by unmapping and freeing an equal number of domain-heap pages. (1) is 'nicest' but actually a bit of a pain to implement; (2) won't work well for live migration, where the pages wouldn't get unmapped by the drivers until the last round of page copying; and (3) was apparently tried by Annie but didn't work? I'm curious why (3) didn't work - I can't explain that. -- Keir On 05/09/2009 00:02, "Dan Magenheimer" <dan.magenheimer@xxxxxxxxxx> wrote:On further debugging, it appears that the p2m_size may be OK, but there's something about those 24 "magic" gpfns that isn't quite right.-----Original Message----- From: Dan Magenheimer Sent: Friday, September 04, 2009 3:29 PM To: Wayne Gong; Annie Li; Keir Fraser Cc: Joshua West; James Harper; xen-devel@xxxxxxxxxxxxxxxxxxx Subject: RE: [Xen-devel] Error restoring DomU when using GPLPV I think I've tracked down the cause of this problem in the hypervisor, but am unsure how to best fix it. In tools/libxc/xc_domain_save.c, the static variable p2m_size is said to be "number of pfns this guest has (i.e. number of entries in the P2M)". But apparently p2m_size is getting set to a very large number (0x100000) regardless of the maximum psuedophysical memory for the hvm guest. As a result, some "magic" pages in the 0xf0000-0xfefff range are getting placed in the save file. But since they are not "real" pages, the restore process runs beyond the maximum number of physical pages allowed for the domain and fails. (The gpfn of the last 24 pages saved are f2020, fc000-fc012, feffb, feffc, feffd, feffe.) p2m_size is set in "save" with a call to a memory_op hypercall (XENMEM_maximum_gpfn) which for an hvm domain returns d->arch.p2m->max_mapped_pfn. I suspect that the meaning of max_mapped_pfn changed at some point to more match its name, but this changed the semantics of the hypercall as used by xc_domain_restore, resulting in this curious problem. Any thoughts on how to fix this?-----Original Message-----From: Annie Li Sent: Tuesday, September 01, 2009 10:27 PMTo: Keir Fraser Cc: Joshua West; James Harper; xen-devel@xxxxxxxxxxxxxxxxxxx Subject: Re: [Xen-devel] Error restoring DomU when using GPLPVIt seems this problem is connected with gnttab, not shareinfo. I changed some code about grant table in winpv driver (not using balloon down shinfo+gnttab method),save/restore/migration can workproperly on Xen3.4 now. What i changed is winpv driver use hypercallXENMEM_add_to_physmap tomap corresponding grant tables which devices require, instead of mapping all 32 pages grant table during initialization. It seems those extra grant table mapping cause this problem.Wondering whether those extra grant table mapping is the rootcause of the migration problem? or by luck as linux PVHVM too?Thanks Annie. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |