[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] PoD code killing domain before it really gets started
On 26/07/12 15:41, Jan Beulich wrote: Yes, this is a very strange circumstance: because p2m_demand_populate() shouldn't happen until at least one PoD entry has been created; and that shouldn't happen until after c0...7ff have been populated with 4k pages.George, in the hope that you might have some insight, or might be remembering that something like this was reported before (and ideally fixed), I'll try to describe a problem a customer of ours reported. Unfortunately this is with Xen 4.0.x (plus numerous backports), and it is not known whether the same issue exists on 4.1.x or -unstable. For a domain with maxmem=16000M and memory=3200M, what gets logged is (XEN) p2m_pod_demand_populate: Out of populate-on-demand memory! tot_pages 480 pod_entries 221184 (XEN) domain_crash called from p2m.c:1150 (XEN) Domain 3 reported crashed by domain 0 on cpu#6: (XEN) p2m_pod_demand_populate: Out of populate-on-demand memory! tot_pages 480 pod_entries 221184 (XEN) domain_crash called from p2m.c:1150 Translated to hex, the numbers are 1e0 and 36000. The latter one varies across the (rather infrequent) cases where this happens (but was always a multiple of 1000 - see below), and instant retries to create the affected domain did always succeed so far (i.e. the failure is definitely not because of a lack of free memory). Given that the memory= target wasn't reached, yet, I would conclude that this happens in the middle of (4.0.x file name used here) tools/libxc/xc_hvm_build.c:setup_guest()'s main physmap population code. However, the way I read the code there, I would think that the sequence of population should be (using hex GFNs) 0...9f, c0...7ff, 800-fff, 1000-17ff, etc. That, however appears to be inconsistent with the logged numbers above - tot_pages should always be at least 7e0 (low 2Mb less the VGA hole), especially when pod_entries is divisible by 800 (the increment by which large page population happens). As a result of this apparent inconsistency I can't really conclude anything from the logged numbers. The main question, irrespective of any numbers, of course is: How would p2m_pod_demand_populate() be invoked at all during this early phase of domain construction? Nothing should be touching any of the memory... If this nevertheless is possible (even if just for a single page), then perhaps the tools ought to make sure the pages put into the low 2Mb get actually zeroed, so the PoD code has a chance to find victim pages. Although, it does look as though when populating 4k pages, the code doesn't actually look to see if the allocation succeeded or not... oh wait, no, it actually checks rc as a condition of the while() loop -- but that is then clobbered by the xc_domain_set_pod_target() call. But surely if the 4k allocation failed, the set_target() call should fail as well? And in any case, there shouldn't yet be any PoD entries to cause a demand-populate. We probably should change "if(pod_mode)" to "if(rc == 0 && pod_mode)" or something like that, just to be sure. I'll spin up a patch. I think what I would try to do is to add a stack trace to the demand_populate() failure path, so you can see where the call came from; i.e., if it came from a guest access, or from someone in dom0 writing to some of the memory. I'd also add a printk to set_pod_target(), so you can see if it was actually called and what it was set to. -George _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |