[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Balloon driver bug in increase_reservation
On Wed, 4 Sep 2013, Wei Liu wrote: > On Wed, Sep 04, 2013 at 02:15:51PM +0100, Stefano Stabellini wrote: > > On Mon, 2 Sep 2013, Ian Campbell wrote: > > > On Mon, 2013-09-02 at 16:13 +0100, Wei Liu wrote: > > > > On Mon, Sep 02, 2013 at 04:09:26PM +0100, Ian Campbell wrote: > > > > > On Mon, 2013-09-02 at 16:04 +0100, Wei Liu wrote: > > > > > > On Mon, Sep 02, 2013 at 03:48:43PM +0100, Ian Campbell wrote: > > > > > > > On Mon, 2013-09-02 at 15:43 +0100, Wei Liu wrote: > > > > > > > > Hi, Stefano > > > > > > > > > > > > > > > > I found another bug in the balloon scratch page code. As I > > > > > > > > didn't follow > > > > > > > > the discussion on scratch page so I cannot propose a proper fix > > > > > > > > at the > > > > > > > > moment. > > > > > > > > > > > > > > > > The problem is that in balloon.c:increase_reservation, when a > > > > > > > > ballooned > > > > > > > > page is resued, it can have a valid P2M entry pointing to the > > > > > > > > scratch, > > > > > > > > hitting the BUG_ON > > > > > > > > > > > > > > > > BUG_ON(!xen_feature(XENFEAT_auto_translated_physmap) && > > > > > > > > phys_to_machine_mapping_valid(pfn)); > > > > > > > > > > > > > > > > As balloon worker might run by a CPU other then the one that > > > > > > > > returns the > > > > > > > > page, checking pfn_to_mfn(pfn) == local_cpu_scratch_page_mfn > > > > > > > > wouldn't > > > > > > > > work. Checking pfn_to_mfn(pfn) belongs to the set of all > > > > > > > > scratch page > > > > > > > > mfns is not desirable. > > > > > > > > > > > > > > This makes me think that whoever suggested that pfn_to_mfn for a > > > > > > > ballooned page out to return INVALID_MFN was right. > > > > > > > > > > > > > > > > > > > If there are many balloon pages the check can be expensive. > > > > > > > > > > IIRC the suggestion was that the p2m for a ballooned out page would > > > > > contain INVALID_MFN, so the expense is just the lookup you would be > > > > > doing anyway. > > > > > > > > > > > > > That was David's idea. Stefano was worried that other PVOPS hooks would > > > > need to know about the MFN. I don't know how much it holds true. Need > > > > some input from Konrad I think. > > > > > > Hrm, aren't those expected to not be operating on ballooned out pages > > > anyway? Prior to this change those callers would have got INVALID_MFN, I > > > think? > > > > > > My gut feeling is that places which accidentally/unexpectedly have a > > > ballooned out page in their hand have either a virtual address or an mfn > > > (e.g. a cr2 fault address) and not a pfn in their hands, and won't in > > > general be that interested in the pfn and/or are already geared up to > > > deal with INVALID_MFN because previously that's what they would have > > > gotten. There's every chance I'm wrong though. > > > > > > I think we should just remove the BUG_ON: > > > > - if the guest is autotranslate, then the BUG_ON is already irrelevant; > > - if the guest is not autotranslate, then we are sure that the p2m of > > the page is pointing to a scratch page; > > > > either way the BUG_ON is wrong. > > Is it there to guard other misuse of the function? IOW can we be > confident that all calls are legit? the ballooned_pages list would need to be broken for this BUG_ON to trigger > Messing up the P2M table looks very hard to debug? > > > Otherwise could simply implement a is_balloon_scratch_page function that > > checks whether a given pfn corresponds to any of the scratch pages (it > > doesn't need to be the scratch page of this cpu). > > That's quite expensive IMHO, especially when you have lots of CPU's and > lots of ballooned pages. as a compromise we could run a check on the pfn only on debug runs? _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |