Xen project Mailing List

RE: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support

To: George Dunlap <george.dunlap@xxxxxxxxxxxxx>

From: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>

Date: Fri, 20 Mar 2009 06:40:33 -0700 (PDT)

Cc: Tim Deegan <Tim.Deegan@xxxxxxxxxxxxx>, Wei Huang <wei.huang2@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxx, Keir Fraser <Keir.Fraser@xxxxxxxxxxxxx>

Delivery-date: Fri, 20 Mar 2009 06:41:56 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Interesting. And non-intuitive. I think you are saying that, at least theoretically (and using your ABCD, not my ABC below), A is always faster than (B | C), and (B | C) is always faster than D. Taking into account the fact that the TLB size is fixed (I think), C will always be faster than B and never slower than D. So if the theory proves true, that does seem to eliminate my objection. Thanks, Dan > -----Original Message----- > From: George Dunlap [mailto:george.dunlap@xxxxxxxxxxxxx] > Sent: Friday, March 20, 2009 3:46 AM > To: Dan Magenheimer > Cc: Wei Huang; xen-devel@xxxxxxxxxxxxxxxxxxx; Keir Fraser; Tim Deegan > Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support > > > Dan, > > Don't forget that this is about the p2m table, which is (if I > understand > correctly) orthogonal to what the guest pagetables are doing. So the > scenario, if HAP is used, would be: > > A) DB code uses 2MB pages in guest PTs, OS assumes 2MB pages, > guest PTs > use 2MB pages, P2M uses 2MB pages > - A tlb miss requires 3 * 3 = 9 reads (Assuming 64-bit guest) > B) DB code uses 2MB pages, OS uses 2MB pages, p2m uses 4K pages > - A tlb miss requires 3 * 4 = 12 reads > C) DB code uses 4k pages, OS uses 4k pages, p2m uses 2MB pages > - A tlb miss requires 4 * 3 = 12 reads > D) DB code uses 4k pages, OS uses 4k pages, p2m uses 4k pages > - A tlb miss requires 4 * 4 = 16 reads > > And adding the 1G p2m entries will change the multiplier from 3 to 2 > (i.e., 3*2 = 6 reads for superpages, 4*2 = 8 reads for 4k > guest pages). > > (Those who are more familiar with the hardware, please correct me if > I've made some mistakes or oversimplified things.) > > So adding 1G pages to the p2m table shouldn't change > expectations of the > guest OS in any case. Using it will benefit the guest to the same > degree whether the guest is using 4k, 2Mb, or 1G pages. (If I > understand > correctly.) > > -George > > Dan Magenheimer wrote: > > Hi Wei -- > > > > I'm not worried about the overhead of the splintering, I'm > > worried about the "hidden overhead" everytime a "silent > > splinter" is used. > > > > Let's assume three scenarios (and for now use 2MB pages though > > the same concerns can be extended to 1GB and/or mixed 2MB/1GB): > > > > A) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides > > only 2MB pages (no splintering occurs) > > B) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides > > only 4KB pages (because of fragmentation, all 2MB pages have > > been splintered) > > C) DB code assumes 4KB pages, OS assumes 4KB pages, Xen provides > > 4KB pages > > > > Now run some benchmarks. Clearly one would assume that A is > > faster than both B and C. The question is: Is B faster or slower > > than C? > > > > If B is always faster than C, then I have less objection to > > "silent splintering". But if B is sometimes (or maybe always?) > > slower than C, that's a big issue because a user has gone through > > the effort of choosing a better-performing system configuration > > for their software (2MB DB on 2MB OS), but it actually performs > > worse than if they had chosen the "lower performing" configuration. > > And, worse, it will likely degrade across time so performance > > might be fine when the 2MB-DB-on-2MB-OS guest is launched > > but get much worse when it is paused, save/restored, migrated, > > or hot-failed. So even if B is only slightly faster than C, > > if B is much slower than A, this is a problem. > > > > Does that make sense? > > > > Some suggestions: > > 1) If it is possible for an administrator to determine how many > > large pages (both 2MB and 1GB) were requested by each domain > > and how many are currently whole-vs-splintered, that would help. > > 2) We may need some form of memory defragmenter > > > > > >> -----Original Message----- > >> From: Wei Huang [mailto:wei.huang2@xxxxxxx] > >> Sent: Thursday, March 19, 2009 12:52 PM > >> To: Dan Magenheimer > >> Cc: George Dunlap; xen-devel@xxxxxxxxxxxxxxxxxxx; > >> keir.fraser@xxxxxxxxxxxxx; Tim Deegan > >> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support > >> > >> > >> Dan, > >> > >> Thanks for your comments. I am not sure about which > >> splintering overhead > >> you are referring to. I can think of three areas: > >> > >> 1. splintering in page allocation > >> In this case, Xen fails to allocate requested page order. > So it falls > >> back to smaller pages to setup p2m table. The overhead is > >> O(guest_mem_size), which is a one-time deal. > >> > >> 2. P2M splits large page into smaller pages > >> This is one directional because we don't merge smaller > pages to large > >> ones. The worst case is to split all guest large pages. So > >> overhead is > >> O(total_large_page_mem). In long run, the overhead will > converge to 0 > >> because it is one-directional. Note this overhead also covers > >> when PoD > >> feature is enabled. > >> > >> 3. CPU splintering > >> If CPU does not support 1GB page, it automatically does > splintering > >> using smaller ones (such as 2MB). In this case, the overhead > >> is always > >> there. But 1) this only happens to a small number of old > chips; 2) I > >> believe that it is still faster than 4K pages. CPUID (1gb > feature and > >> 1gb TLB entries) can be used to detect and stop this > problem, if we > >> don't really like it. > >> > >> I agree on your concerns. Customers should have the right to > >> make their > >> own decision. But that require new feature is enabled in the first > >> place. For a lot of benchmarks, splintering overhead can be > >> offset with > >> benefits of huge pages. SPECJBB is a good example of using > >> large pages > >> (see Ben Serebrin's presentation in Xen Summit). With that > >> said, I agree > >> with the idea of adding a new option in guest configure file. > >> > >> -Wei > >> > >> > >> Dan Magenheimer wrote: > >> > >>> I'd like to reiterate my argument raised in a previous > >>> discussion of hugepages: Just because this CAN be made > >>> to work, doesn't imply that it SHOULD be made to work. > >>> Real users use larger pages in their OS for the sole > >>> reason that they expect a performance improvement. > >>> If it magically works, but works slow (and possibly > >>> slower than if the OS had just used small pages to > >>> start with), this is likely to lead to unsatisfied > >>> customers, and perhaps allegations such as "Xen sucks > >>> when running databases". > >>> > >>> So, please, let's think this through before implementing > >>> it just because we can. At a minimum, an administrator > >>> should be somehow warned if large pages are getting splintered. > >>> > >>> And if its going in over my objection, please tie it to > >>> a boot option that defaults off so administrator action > >>> is required to allow silent splintering. > >>> > >>> My two cents... > >>> Dan > >>> > >>> > >>>> -----Original Message----- > >>>> From: Huang2, Wei [mailto:Wei.Huang2@xxxxxxx] > >>>> Sent: Thursday, March 19, 2009 2:07 AM > >>>> To: George Dunlap > >>>> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; keir.fraser@xxxxxxxxxxxxx; > >>>> Tim Deegan > >>>> Subject: RE: [Xen-devel] [RFC][Patches] Xen 1GB Page > Table Support > >>>> > >>>> > >>>> Here are patches using the middle approach. It handles 1GB > >>>> pages in PoD > >>>> by remapping 1GB with 2MB pages & retry. I also added > code for 1GB > >>>> detection. Please comment. > >>>> > >>>> Thanks a lot, > >>>> > >>>> -Wei > >>>> > >>>> -----Original Message----- > >>>> From: dunlapg@xxxxxxxxx [mailto:dunlapg@xxxxxxxxx] On > >>>> > >> Behalf Of George > >> > >>>> Dunlap > >>>> Sent: Wednesday, March 18, 2009 12:20 PM > >>>> To: Huang2, Wei > >>>> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; keir.fraser@xxxxxxxxxxxxx; > >>>> Tim Deegan > >>>> Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page > Table Support > >>>> > >>>> Thanks for doing this work, Wei -- especially all the > >>>> > >> extra effort for > >> > >>>> the PoD integration. > >>>> > >>>> One question: How well would you say you've tested the PoD > >>>> functionality? Or to put it the other way, how much do I need to > >>>> prioritize testing this before the 3.4 release? > >>>> > >>>> It wouldn't be a bad idea to do as you suggested, and > break things > >>>> into 2 meg pages for the PoD case. In order to take the best > >>>> advantage of this in a PoD scenario, you'd need to have a balloon > >>>> driver that could allocate 1G of continuous *guest* p2m > >>>> > >> space, which > >> > >>>> seems a bit optimistic at this point... > >>>> > >>>> -George > >>>> > >>>> 2009/3/18 Huang2, Wei <Wei.Huang2@xxxxxxx>: > >>>> > >>>>> Current Xen supports 2MB super pages for NPT/EPT. The > >>>>> > >>>> attached patches > >>>> > >>>>> extend this feature to support 1GB pages. The PoD > >>>>> > >>>> (populate-on-demand) > >>>> > >>>>> introduced by George Dunlap made P2M modification harder. > >>>>> > >> I tried to > >> > >>>>> preserve existing PoD design by introducing a 1GB PoD > cache list. > >>>>> > >>>>> > >>>>> > >>>>> Note that 1GB PoD can be dropped if we don't care about > >>>>> > >> 1GB when PoD > >> > >>>> is > >>>> > >>>>> enabled. In this case, we can just split 1GB PDPE into > 512x2MB PDE > >>>>> > >>>> entries > >>>> > >>>>> and grab pages from PoD super list. That can pretty much make > >>>>> 1gb_p2m_pod.patch go away. > >>>>> > >>>>> > >>>>> > >>>>> Any comment/suggestion on design idea will be appreciated. > >>>>> > >>>>> > >>>>> > >>>>> Thanks, > >>>>> > >>>>> > >>>>> > >>>>> -Wei > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> The following is the description: > >>>>> > >>>>> === 1gb_tools.patch === > >>>>> > >>>>> Extend existing setup_guest() function. Basically, it tries to > >>>>> > >>>> allocate 1GB > >>>> > >>>>> pages whenever available. If this request fails, it falls > >>>>> > >>>> back to 2MB. > >>>> If > >>>> > >>>>> both fail, then 4KB pages will be used. > >>>>> > >>>>> > >>>>> > >>>>> === 1gb_p2m.patch === > >>>>> > >>>>> * p2m_next_level() > >>>>> > >>>>> Check PSE bit of L3 page table entry. If 1GB is found > (PSE=1), we > >>>>> > >>>> split 1GB > >>>> > >>>>> into 512 2MB pages. > >>>>> > >>>>> > >>>>> > >>>>> * p2m_set_entry() > >>>>> > >>>>> Configure the PSE bit of L3 P2M table if page order == 18 (1GB). > >>>>> > >>>>> > >>>>> > >>>>> * p2m_gfn_to_mfn() > >>>>> > >>>>> Add support for 1GB case when doing gfn to mfn > >>>>> > >> translation. When L3 > >> > >>>> entry is > >>>> > >>>>> marked as POPULATE_ON_DEMAND, we call 2m_pod_demand_populate(). > >>>>> > >>>> Otherwise, > >>>> > >>>>> we do the regular address translation (gfn ==> mfn). > >>>>> > >>>>> > >>>>> > >>>>> * p2m_gfn_to_mfn_current() > >>>>> > >>>>> This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as > >>>>> POPULATE_ON_DEMAND, it demands a populate using > >>>>> > >>>> p2m_pod_demand_populate(). > >>>> > >>>>> Otherwise, it does a normal translation. 1GB page is taken into > >>>>> consideration. > >>>>> > >>>>> > >>>>> > >>>>> * set_p2m_entry() > >>>>> > >>>>> Request 1GB page > >>>>> > >>>>> > >>>>> > >>>>> * audit_p2m() > >>>>> > >>>>> Support 1GB while auditing p2m table. > >>>>> > >>>>> > >>>>> > >>>>> * p2m_change_type_global() > >>>>> > >>>>> Deal with 1GB page when changing global page type. > >>>>> > >>>>> > >>>>> > >>>>> === 1gb_p2m_pod.patch === > >>>>> > >>>>> * xen/include/asm-x86/p2m.h > >>>>> > >>>>> Minor change to deal with PoD. It separates super page > >>>>> > >>>> cache list into > >>>> 2MB > >>>> > >>>>> and 1GB lists. Similarly, we record last gpfn of sweeping > >>>>> > >>>> for both 2MB > >>>> and > >>>> > >>>>> 1GB. > >>>>> > >>>>> > >>>>> > >>>>> * p2m_pod_cache_add() > >>>>> > >>>>> Check page order and add 1GB super page into PoD 1GB cache list. > >>>>> > >>>>> > >>>>> > >>>>> * p2m_pod_cache_get() > >>>>> > >>>>> Grab a page from cache list. It tries to break 1GB page > >>>>> > >> into 512 2MB > >> > >>>> pages > >>>> > >>>>> if 2MB PoD list is empty. Similarly, 4KB can be requested > >>>>> > >> from super > >> > >>>> pages. > >>>> > >>>>> The breaking order is 2MB then 1GB. > >>>>> > >>>>> > >>>>> > >>>>> * p2m_pod_cache_target() > >>>>> > >>>>> This function is used to set PoD cache size. To increase > >>>>> > >> PoD target, > >> > >>>> we try > >>>> > >>>>> to allocate 1GB from xen domheap. If this fails, we try > >>>>> > >> 2MB. If both > >> > >>>> fail, > >>>> > >>>>> we try 4KB which is guaranteed to work. > >>>>> > >>>>> > >>>>> > >>>>> To decrease the target, we use a similar approach. We > first try to > >>>>> > >>>> free 1GB > >>>> > >>>>> pages from 1GB PoD cache list. If such request fails, we > >>>>> > >> try 2MB PoD > >> > >>>> cache > >>>> > >>>>> list. If both fail, we try 4KB list. > >>>>> > >>>>> > >>>>> > >>>>> * p2m_pod_zero_check_superpage_1gb() > >>>>> > >>>>> This adds a new function to check for 1GB page. This function is > >>>>> > >>>> similar to > >>>> > >>>>> p2m_pod_zero_check_superpage_2mb(). > >>>>> > >>>>> > >>>>> > >>>>> * p2m_pod_zero_check_superpage_1gb() > >>>>> > >>>>> We add a new function to sweep 1GB page from guest memory. > >>>>> > >>>> This is the > >>>> same > >>>> > >>>>> as p2m_pod_zero_check_superpage_2mb(). > >>>>> > >>>>> > >>>>> > >>>>> * p2m_pod_demand_populate() > >>>>> > >>>>> The trick of this function is to do remap_and_retry if > >>>>> > >>>> p2m_pod_cache_get() > >>>> > >>>>> fails. When p2m_pod_get() fails, this function will > >>>>> > >> splits p2m table > >> > >>>> entry > >>>> > >>>>> into smaller ones (e.g. 1GB ==> 2MB or 2MB ==> 4KB). That can > >>>>> > >>>> guarantee > >>>> > >>>>> populate demands always work. > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> Xen-devel mailing list > >>>>> Xen-devel@xxxxxxxxxxxxxxxxxxx > >>>>> http://lists.xensource.com/xen-devel > >>>>> > >>>>> > >>>>> > >> > > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.