Xen project Mailing List

Re: [Xen-devel] Error restoring DomU when using GPLPV

Ok, I've been looking at this and figured what's going on. Annie's problem lies in not remapping the grant frames post migration. Hence the leak, tot_pages goes up every time until migration fails. On linux, remapping is where the frames created by restore (for heap pfn's), get freed back to the dom heap, is what I found. So that's a fix to be made on win pv driver side. Now back to orig problem. As you already know, because libxc is not skipping heap pages, tot_pages in struct domain{} temporarily goes up by (shared-info-frame + gnt-frames) until guest remaps these pages. Hence, migration fails if (max_pages - tot_pages) < (shared-info-frame + gnt-frames). Occassionally, I see tot_pages nearly same as max_pages, and I don't know of all ways that may happen or what causes that to happen (by default, i see tot_pages short by 21). Anyways, of two solutions: 1. Always balloon down, shinfo+gnttab frames: This needs to be done just once during load, right? I'm not sure how it would work tho if mem gets ballooned up subsequently. I suppose the driver will have to intercept every increase in reservation and balloon down everytime? Also, balloon down during suspend call would prob be too late, right? 2. libxc fix: I wonder how much work this will be. Good thing here is, it'll take care of both linux and PV HVM guests avoiding driver updates in many versions, and hence appealing to us. Can we somehow mark the frames special to be skipped? Looking at biiig xc_domain_save function, not sure in case of HVM, how pfn_type gets set. May be before the outer loop, it could ask hyp for all xen heap page list, but then what if a new page gets added to the list in between..... Also, unfortunately, the failure case is not handled properly sometimes. If migration fails after suspend, then no way to get the guest back. I even noticed, the guest disappeared totally from both source and target when failed, couple times of several dozen migrations I did. thanks, Mukesh Keir Fraser wrote:

Not all those pages are special. Frames fc0xx will be ACPI tables, resident
in ordinary guest memory pages, for example. Only the Xen-heap pages are
special and need to be (1) skipped; or (2) unmapped by the HVMPV drivers on
suspend; or (3) accounted for by HVMPV drivers by unmapping and freeing an
equal number of domain-heap pages. (1) is 'nicest' but actually a bit of a
pain to implement; (2) won't work well for live migration, where the pages
wouldn't get unmapped by the drivers until the last round of page copying;
and (3) was apparently tried by Annie but didn't work? I'm curious why (3)
didn't work - I can't explain that.

 -- Keir

On 05/09/2009 00:02, "Dan Magenheimer" <dan.magenheimer@xxxxxxxxxx> wrote:

On further debugging, it appears that the
p2m_size may be OK, but there's something about
those 24 "magic" gpfns that isn't quite right.

-----Original Message-----
From: Dan Magenheimer
Sent: Friday, September 04, 2009 3:29 PM
To: Wayne Gong; Annie Li; Keir Fraser
Cc: Joshua West; James Harper; xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: RE: [Xen-devel] Error restoring DomU when using GPLPV

I think I've tracked down the cause of this problem
in the hypervisor, but am unsure how to best fix it.

In tools/libxc/xc_domain_save.c, the static variable p2m_size
is said to be "number of pfns this guest has (i.e. number of
entries in the P2M)".  But apparently p2m_size is getting
set to a very large number (0x100000) regardless of the
maximum psuedophysical memory for the hvm guest.  As a result,
some "magic" pages in the 0xf0000-0xfefff range are getting
placed in the save file.  But since they are not "real"
pages, the restore process runs beyond the maximum number
of physical pages allowed for the domain and fails.
(The gpfn of the last 24 pages saved are f2020, fc000-fc012,
feffb, feffc, feffd, feffe.)

p2m_size is set in "save" with a call to a memory_op hypercall
(XENMEM_maximum_gpfn) which for an hvm domain returns
d->arch.p2m->max_mapped_pfn.  I suspect that the meaning
of max_mapped_pfn changed at some point to more match
its name, but this changed the semantics of the hypercall
as used by xc_domain_restore, resulting in this curious
problem.

Any thoughts on how to fix this?

-----Original Message-----

From: Annie LiSent: Tuesday, September 01, 2009 10:27 PM

To: Keir Fraser
Cc: Joshua West; James Harper; xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: Re: [Xen-devel] Error restoring DomU when using GPLPV

It seems this problem is connected with gnttab, not shareinfo.
I changed some code about grant table in winpv driver (not using
balloon down shinfo+gnttab method),

save/restore/migration can work

properly on Xen3.4 now.

What i changed is winpv driver use hypercall

XENMEM_add_to_physmap to

map corresponding grant tables which devices require, instead of
mapping all 32 pages grant table during initialization.  It seems
those extra grant table mapping cause this problem.

Wondering whether those extra grant table mapping is the root

cause ofthe migration problem? or by luck as linux PVHVM too?


Thanks
Annie.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.