[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support
Dan,I agree on the order: A > C >= B > D. Generally, super page should perform better than small pages. In reality, the difference between B & C is subtle. It depends on how TLB cache is designed and whether TLB flush happens frequently. -Wei Dan Magenheimer wrote: Interesting. And non-intuitive. I think you are saying that, at least theoretically (and using your ABCD, not my ABC below), A is always faster than (B | C), and (B | C) is always faster than D. Taking into account the fact that the TLB size is fixed (I think), C will always be faster than B and never slower than D. So if the theory proves true, that does seem to eliminate my objection. Thanks, Dan-----Original Message----- From: George Dunlap [mailto:george.dunlap@xxxxxxxxxxxxx] Sent: Friday, March 20, 2009 3:46 AM To: Dan Magenheimer Cc: Wei Huang; xen-devel@xxxxxxxxxxxxxxxxxxx; Keir Fraser; Tim Deegan Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support Dan,Don't forget that this is about the p2m table, which is (if I understand correctly) orthogonal to what the guest pagetables are doing. So the scenario, if HAP is used, would be:A) DB code uses 2MB pages in guest PTs, OS assumes 2MB pages, guest PTs use 2MB pages, P2M uses 2MB pages- A tlb miss requires 3 * 3 = 9 reads (Assuming 64-bit guest) B) DB code uses 2MB pages, OS uses 2MB pages, p2m uses 4K pages - A tlb miss requires 3 * 4 = 12 reads C) DB code uses 4k pages, OS uses 4k pages, p2m uses 2MB pages - A tlb miss requires 4 * 3 = 12 reads D) DB code uses 4k pages, OS uses 4k pages, p2m uses 4k pages - A tlb miss requires 4 * 4 = 16 readsAnd adding the 1G p2m entries will change the multiplier from 3 to 2 (i.e., 3*2 = 6 reads for superpages, 4*2 = 8 reads for 4k guest pages).(Those who are more familiar with the hardware, please correct me if I've made some mistakes or oversimplified things.)So adding 1G pages to the p2m table shouldn't change expectations of the guest OS in any case. Using it will benefit the guest to the same degree whether the guest is using 4k, 2Mb, or 1G pages. (If I understand correctly.)-George Dan Magenheimer wrote:So it fallsHi Wei -- I'm not worried about the overhead of the splintering, I'm worried about the "hidden overhead" everytime a "silent splinter" is used. Let's assume three scenarios (and for now use 2MB pages though the same concerns can be extended to 1GB and/or mixed 2MB/1GB): A) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides only 2MB pages (no splintering occurs) B) DB code assumes 2MB pages, OS assumes 2MB pages, Xen provides only 4KB pages (because of fragmentation, all 2MB pages have been splintered) C) DB code assumes 4KB pages, OS assumes 4KB pages, Xen provides 4KB pages Now run some benchmarks. Clearly one would assume that A is faster than both B and C. The question is: Is B faster or slower than C? If B is always faster than C, then I have less objection to "silent splintering". But if B is sometimes (or maybe always?) slower than C, that's a big issue because a user has gone through the effort of choosing a better-performing system configuration for their software (2MB DB on 2MB OS), but it actually performs worse than if they had chosen the "lower performing" configuration. And, worse, it will likely degrade across time so performance might be fine when the 2MB-DB-on-2MB-OS guest is launched but get much worse when it is paused, save/restored, migrated, or hot-failed. So even if B is only slightly faster than C, if B is much slower than A, this is a problem. Does that make sense? Some suggestions: 1) If it is possible for an administrator to determine how many large pages (both 2MB and 1GB) were requested by each domain and how many are currently whole-vs-splintered, that would help. 2) We may need some form of memory defragmenter-----Original Message----- From: Wei Huang [mailto:wei.huang2@xxxxxxx] Sent: Thursday, March 19, 2009 12:52 PM To: Dan Magenheimer Cc: George Dunlap; xen-devel@xxxxxxxxxxxxxxxxxxx; keir.fraser@xxxxxxxxxxxxx; Tim Deegan Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB Page Table Support Dan,Thanks for your comments. I am not sure about which splintering overhead you are referring to. I can think of three areas:1. splintering in page allocationIn this case, Xen fails to allocate requested page order.pages to largeback to smaller pages to setup p2m table. The overhead is O(guest_mem_size), which is a one-time deal.2. P2M splits large page into smaller pagesThis is one directional because we don't merge smallerconverge to 0ones. The worst case is to split all guest large pages. So overhead is O(total_large_page_mem). In long run, the overhead willsplinteringbecause it is one-directional. Note this overhead also covers when PoD feature is enabled.3. CPU splinteringIf CPU does not support 1GB page, it automatically doeschips; 2) Iusing smaller ones (such as 2MB). In this case, the overhead is always there. But 1) this only happens to a small number of oldfeature andbelieve that it is still faster than 4K pages. CPUID (1gbproblem, if we1gb TLB entries) can be used to detect and stop thisdon't really like it.I agree on your concerns. Customers should have the right to make their own decision. But that require new feature is enabled in the first place. For a lot of benchmarks, splintering overhead can be offset with benefits of huge pages. SPECJBB is a good example of using large pages (see Ben Serebrin's presentation in Xen Summit). With that said, I agree with the idea of adding a new option in guest configure file.-Wei Dan Magenheimer wrote:I'd like to reiterate my argument raised in a previous discussion of hugepages: Just because this CAN be made to work, doesn't imply that it SHOULD be made to work. Real users use larger pages in their OS for the sole reason that they expect a performance improvement. If it magically works, but works slow (and possibly slower than if the OS had just used small pages to start with), this is likely to lead to unsatisfied customers, and perhaps allegations such as "Xen sucks when running databases". So, please, let's think this through before implementing it just because we can. At a minimum, an administrator should be somehow warned if large pages are getting splintered. And if its going in over my objection, please tie it to a boot option that defaults off so administrator action is required to allow silent splintering. My two cents... Dan-----Original Message----- From: Huang2, Wei [mailto:Wei.Huang2@xxxxxxx] Sent: Thursday, March 19, 2009 2:07 AM To: George DunlapCc: xen-devel@xxxxxxxxxxxxxxxxxxx; keir.fraser@xxxxxxxxxxxxx; Tim Deegan Subject: RE: [Xen-devel] [RFC][Patches] Xen 1GB PageTable SupportHere are patches using the middle approach. It handles 1GB pages in PoD by remapping 1GB with 2MB pages & retry. I also addedcode for 1GBdetection. Please comment. Thanks a lot, -Wei -----Original Message-----From: dunlapg@xxxxxxxxx [mailto:dunlapg@xxxxxxxxx] OnBehalf Of GeorgeDunlap Sent: Wednesday, March 18, 2009 12:20 PM To: Huang2, WeiCc: xen-devel@xxxxxxxxxxxxxxxxxxx; keir.fraser@xxxxxxxxxxxxx; Tim Deegan Subject: Re: [Xen-devel] [RFC][Patches] Xen 1GB PageTable SupportThanks for doing this work, Wei -- especially all theextra effort forthe PoD integration. One question: How well would you say you've tested the PoD functionality? Or to put it the other way, how much do I need to prioritize testing this before the 3.4 release?It wouldn't be a bad idea to do as you suggested, andbreak thingsinto 2 meg pages for the PoD case. In order to take the best advantage of this in a PoD scenario, you'd need to have a balloondriver that could allocate 1G of continuous *guest* p2mspace, whichseems a bit optimistic at this point... -George 2009/3/18 Huang2, Wei <Wei.Huang2@xxxxxxx>:Current Xen supports 2MB super pages for NPT/EPT. Theattached patchesextend this feature to support 1GB pages. The PoD(populate-on-demand)introduced by George Dunlap made P2M modification harder.I tried topreserve existing PoD design by introducing a 1GB PoDcache list.Note that 1GB PoD can be dropped if we don't care about1GB when PoDisenabled. In this case, we can just split 1GB PDPE into512x2MB PDEentriesand grab pages from PoD super list. That can pretty much make 1gb_p2m_pod.patch go away. Any comment/suggestion on design idea will be appreciated. Thanks, -Wei The following is the description: === 1gb_tools.patch === Extend existing setup_guest() function. Basically, it tries toallocate 1GBpages whenever available. If this request fails, it fallsback to 2MB. Ifboth fail, then 4KB pages will be used. === 1gb_p2m.patch === * p2m_next_level()Check PSE bit of L3 page table entry. If 1GB is found(PSE=1), wesplit 1GBinto 512 2MB pages. * p2m_set_entry() Configure the PSE bit of L3 P2M table if page order == 18 (1GB). * p2m_gfn_to_mfn()Add support for 1GB case when doing gfn to mfntranslation. When L3entry ismarked as POPULATE_ON_DEMAND, we call 2m_pod_demand_populate().Otherwise,we do the regular address translation (gfn ==> mfn). * p2m_gfn_to_mfn_current() This is similar to p2m_gfn_to_mfn(). When L3 entry s marked as POPULATE_ON_DEMAND, it demands a populate usingp2m_pod_demand_populate().Otherwise, it does a normal translation. 1GB page is taken into consideration. * set_p2m_entry() Request 1GB page * audit_p2m() Support 1GB while auditing p2m table. * p2m_change_type_global() Deal with 1GB page when changing global page type. === 1gb_p2m_pod.patch === * xen/include/asm-x86/p2m.hMinor change to deal with PoD. It separates super pagecache list into 2MBand 1GB lists. Similarly, we record last gpfn of sweepingfor both 2MB and1GB. * p2m_pod_cache_add() Check page order and add 1GB super page into PoD 1GB cache list. * p2m_pod_cache_get()Grab a page from cache list. It tries to break 1GB pageinto 512 2MBpagesif 2MB PoD list is empty. Similarly, 4KB can be requestedfrom superpages.The breaking order is 2MB then 1GB. * p2m_pod_cache_target()This function is used to set PoD cache size. To increasePoD target,we tryto allocate 1GB from xen domheap. If this fails, we try2MB. If bothfail,we try 4KB which is guaranteed to work.To decrease the target, we use a similar approach. Wefirst try tofree 1GBpages from 1GB PoD cache list. If such request fails, wetry 2MB PoDcachelist. If both fail, we try 4KB list. * p2m_pod_zero_check_superpage_1gb() This adds a new function to check for 1GB page. This function issimilar top2m_pod_zero_check_superpage_2mb(). * p2m_pod_zero_check_superpage_1gb()We add a new function to sweep 1GB page from guest memory.This is the sameas p2m_pod_zero_check_superpage_2mb(). * p2m_pod_demand_populate() The trick of this function is to do remap_and_retry ifp2m_pod_cache_get()fails. When p2m_pod_get() fails, this function willsplits p2m tableentryinto smaller ones (e.g. 1GB ==> 2MB or 2MB ==> 4KB). That canguaranteepopulate demands always work. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |