Xen project Mailing List

[Xen-devel] Re: Race between ept_get_entry / ept_set_entry

From: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>

Date: Thu, 26 Aug 2010 14:05:22 +0100

Cc: "Li, Xin" <xin.li@xxxxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>, Tim Deegan <Tim.Deegan@xxxxxxxxxx>

Delivery-date: Thu, 26 Aug 2010 06:06:30 -0700

Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=SOoMCX6WYisoDcMdKmkpXgxX2kfE2dtUbtTlgD4cBMCqNQEXHhkmZKXwzlpyYwu2/l jQTR6yTVRvFqQ8xZvlZFGanvMUbUZS7g9AYI/m8CaaM0p8JPll/ETGhdRSIokmxzbteO f7vMp/fzEgv3RJPBsrA+Yge9HmhtvHJK7aPmI=

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

FWIW, modifying ept_next_level() and ept_set_entry() to be {write,read}-once (i.e., reading once and working with a local copy; or making the entry in a local copy and writing it all at once) seems to work as well, at least on 64-bit. But overall, unless there's a measured reason to avoid locks, I think the best thing for long-term stability is just to have ept_get_entry() grab the p2m lock. -George On Thu, Aug 26, 2010 at 11:35 AM, George Dunlap <George.Dunlap@xxxxxxxxxxxxx> wrote: > In the course of doing some fixes for my populate-on-demand testing, I > found that a Windows Server 2008 VM with 30G static max and 24G ram > (i.e., booting ballooned) crashed 1-2 times out of ten during boot, > reporting MMIO errors. > > I managed to get a trace of this crash. Strangely enough, the trace > indicated that the page the NPF occured on was populate-on-demand -- > but that hvm_hap_nested_page_fault() injected a GP anyway. > > The only way this would be possible is if the gfn_to_mfn_query() in > the trace function got a p2m type of p2m_popluate_on_demand, but the > gfn_to_mfn_current() in hvm_hap_nested_page_fault() got a p2m type of > p2m_mmio_dm. > > Looking at the trace (snippet attached), the failed NPF happened on > d1v1; but almost simultaneously on d1v0, an NPF fault happened that > caused a populate-on-demand demand populate. That demand populate > happened to be of a superpage that was shared with the gpa fault on > d1v1. > > So, the first query on d1v1 (correctly) got a PoD; but the second > query, instead of either causing the demand-populate, or successfully > getting the result of d1v0's demand populate, returned failure, > causing the guest to crash. > > I looked in the p2m-ept.c code, and noticed (once again) that > ept_get_entry() can be called without the p2m lock held. I added > conditional locks, and am running the test again. The guest has now > booted 20 times successfully without crashing (whereas before, the > average was about 2 in 10 crashing). > > Looking closely at the code, I can see one potential race: > * entry starts out PoD, not-present. > * v0 finds the entry PoD, allocates a page, calls set_p2m_entry(), > which calls ept_set_entry(). > * v1 begins to walk the pagetable; at some point, it calls > ept_next_level(), which finds the flags all clear (entry->epte & 7 == > 0) > * v0 ept_set_entry() changes the p2m type from p2m_populate_on_demand > to p2m_ram_rw > * v1 ept_next_level() reads entry->avail1 and finds that it is not > p2m_populate_on_demand, so it returns GUEST_TABLE_MAP_FAILED > * v0 ept_set_entry() sets the flags to present. > > Is there a good reason not to just grab the p2m lock when walking the > ept tables? We could conceivably do some cleverness to avoid this > kind of race, but unless there's a significant performance gain, I > think the simple approach is better. > > -George > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.