Xen project Mailing List

Re: [Xen-devel] RMRR Fix Design for Xen

I'll work out a new design proposal based on below content and previous discussions. Thanks Tiejun for your hard-working and Jan for your careful reviews so far. For below comment: > > Also there's clearly an alternative proposal: Drop support for sharing > > page tables. Your colleagues will surely have told you that we've > > been considering this for quite some time, and had actually hoped > > for them to do the necessary VT-d side work to allow for this without > > causing performance regressions. let's separate it from RMRR discussion, because RMRR issues are about general p2m and thus orthogonal to the implementation difference between shared or not-shared fact (though it did lead to different behaviors w/ current bogus logic). Thanks Kevin > From: Chen, Tiejun > Sent: Monday, December 22, 2014 10:12 AM > > Jan, > > Thanks for your time but I'm not going to address your comments here. > Because I heard this design is totally not satisfied your expectation. > But this really was reviewed with several revisions by Kevin and Yang > before sending in public... > > Anyway, I guess the only thing what I can do is that, Kevin and Yang, or > other appropriate guys should finish this design as you expect. So now > I'd better not say anything to avoid bringing any inconvenience. > > Tiejun > > On 2014/12/19 23:13, Jan Beulich wrote: > >>>> On 19.12.14 at 02:21, <tiejun.chen@xxxxxxxxx> wrote: > >> #4 Something like USB, is still restricted to current RMRR implementation. > We > >> should work out this case. > > > > This can mean all or nothing. My understanding is that right now code > > assumes that USB devices won't use their RMRR-specified memory > > regions post-boot (kind of contrary to your earlier statement that in > > such a case the regions shouldn't be listed in RMRRs in the first place). > > > >> Design Overview > >> =============== > >> > >> First of all we need to make sure all resources don't overlap RMRR. And > then > >> in case of shared ept, we can set these identity entries. And Certainly we > >> will > >> group all devices associated to one same RMRR entry, then make sure all > >> group > >> devices should be assigned to same VM. > >> > >> 1. Setup RMRR identity mapping > >> > >> current status: > >> * identity mapping only setup in non-shared ept case > >> > >> proposal: > >> > >> In non-shared ept case, IOMMU stuff always set those entries and RMRR is > >> already marked reserved in host so its fine enough. > > > > Is it? Where? Or am I misunderstanding the whole statement, likely > > due to me silently replacing "host" by "guest" (since reservation in > > host address spaces is of no interest here afaict)? > > > >> But in shared ept case, we need to > >> check any conflit, so we should follow up > >> > >> - gfn space unoccupied > >> -> insert mapping: success. > >> gfn:_mfn(gfn), PAGE_ORDER_4K, p2m_mmio_direct, > p2m_access_rw > >> - gfn space already occupied by 1:1 RMRR mapping > >> -> do nothing; success. > >> - gfn space already occupied by other mapping > >> -> fail. > >> > >> expectation: > >> * only devices w/ non-conflicting RMRR can be assigned > >> * fortunately this achieves the very initial intention to support IGD > >> pass-through on BDW > > > > Are you trying to say here that doing the above is all you need for > > your specific machine? If so, that's clearly not something to go into > > a design document. > > > > Also there's clearly an alternative proposal: Drop support for sharing > > page tables. Your colleagues will surely have told you that we've > > been considering this for quite some time, and had actually hoped > > for them to do the necessary VT-d side work to allow for this without > > causing performance regressions. > > > >> 2.1 Expose RMRR to user space > >> > >> current status: > >> * Xen always record RMRR info into one list, acpi_rmrr_units, while > >> parsing > >> acpi. So we can retrieve these info by lookup that list. > >> > >> proposal: > >> * RMRR would be exposed by a new hypercall, which Jan already > finished > >> in > >> current version but just expose all RMRR info unconditionally. > >> * Furthermore we can expose RMRR on demand to diminish > shrinking guest > >> RAM/MMIO space. > >> * So we will introduce a new parameter, 'rdm_forcecheck' and to > >> collaborate > >> with SBDFs to control which RMRR should be exposed: > >> > >> - We can set this parameter in .cfg file like, > >> > >> rdm_forcecheck = 1 => Of course this should be 0 by > default. > >> > >> '1' means we should force check to reserve all ranges > >> unconditionally. > >> and if failed VM wouldn't be created successfully. This also can > >> give > >> user a chance to work well with later hotplug, even if not a > device > >> assignment while creating VM. > >> > >> If 0, we just check those assigned pci devices. As you know we > already > > > > assigned? Wasn't the plan to have a separate potentially-to-be- > > assigned list? And I can only re-iterate that the name > > "rdm_forcecheck" doesn't really express what you mean. Since > > your intention is to check all devices (rather than do a check that > > otherwise wouldn't be done), "rdm_all" or "rdm_check_all" would > > seem closer to the intended behavior. > > > >> have such an existing hypercall to assign PCI devices, looks we > can > >> work > >> directly under this hypercall to get that necessary SBDF to sort > >> which > >> RMRR should be handled. But obviously, we need to get these > info > >> before > >> we populate guest memory to make sure these RMRR ranges > should be > >> excluded from guest memory. But unfortunately the memory > populating > >> takes place before a device assignment, so we can't live on that > >> directly. > >> > >> But as we discussed it just benefit that assigned case to reorder > >> that > >> order, but not good to hotplug case. So we have to introduce a > new > >> DOMCTL to pass that global parameter with SBDF at the same > time. > >> > >> For example, if we own these two RMRR entries, > > > > "own" is confusing here, I assume you mean if there are such entries. > > > >> [00:14.0] RMRR region: base_addr ab80a000 end_address > ab81dfff > >> [00:02.0] RMRR region: base_addr ad000000 end_address > af7fffff > >> > >> If 'rdm_forcecheck = 1', any caller to that hypercall may get > these two > > > > s/may/will/ I suppose. > > > >> entries. But if 'rdm_forcecheck = 0', and in .cfg file, > >> > >> #1 'pci=[00:14.0, 00:02.0]' -> The caller still get these two > entries. > >> #2 'pci=[00:02.0]' -> The caller just get one entry, > >> 0xad000000:0xaf7fffff > >> #2 'pci=[00:14.0]' -> The caller just get one entry, > >> 0xab80a000:0xab81dfff > >> #4 'pci=[others]' or no any pci configuration -> The caller get > >> nothing. > > > > Oh, interesting, you dropped the idea of a separate list of possibly- > > to-be-assigned devices. Why that? > > > >> And ultimately, if we expose any RMRR entry at gfn, > >> > >> in non-shared ept case, > >> > >> p2m table: valid non-identity mapping, gfn:mfn != 1:1 > > > > This would mean ... > > > >> VT-D table: no such a mapping until set identity mapping, > >> gfn:_mfn(gfn) == 1:1 when we have a associated > device > >> assignment. > > > > ... the two mappings to go out of sync when such a 1:1 mapping gets > > established. I think we should avoid such a situation if at all possible. > > > >> in shared ept case, > >> > >> p2m table\VT-D table: always INVALID until set identity > mapping, > >> gfn:_mfn(gfn) == 1:1 when we have > a > >> associated > >> device assignment. > >> > >> But if we don't expose any RMRR entry, > >> > >> in non-shared ept case, > >> > >> p2m table: valid non-identity mapping, gfn:mfn != 1:1 > >> VT-D table: no such a mapping until set identity mapping, > >> gfn:_mfn(gfn) == 1:1 when we have a associated > device > >> assignment. > >> > >> in shared ept case, > >> > >> p2m table\VT-D table: If guest RAM already cover gfn, we > have sunc a > >> valid non-identity mapping, > gfn:mfn != 1:1, > >> but > >> we can't set any identity mapping > again then > >> that associated device can't be > assigned > >> successfully. If not, we'll set identity > >> mapping, > >> gfn:_mfn(gfn) == 1:1, to work well. > > > > So in the end we'd still have various cases of different behavior, > > when really it would be much better (if not a requirement) if guest > > observable behavior wouldn't depend on implementation details > > like (non-)shared page tables. > > > >> 2.2 Detect and avoid RMRR conflictions with guest RAM/MMIO > >> > >> current status: > >> * Currently Xen doesn't detect anything to avoid any conflict. > >> > >> proposal: > >> * We need to cover all points to check any conflict. Below lists > >> places > >> where reserved region conflictions should be detected: > >> > >> 1>. libxc:setup_guest():modules_init() > >> > >> There are some modules, like acpi_module and smbios_module. > They may > >> occupy some ranges so we need to check if they're conflicting > with > >> all ranges from that new hypercall above. > > > > I'm missing of what conflict resolution you expect to do here. > > Following earlier discussion in the context of patch review made > > pretty clear that this isn't really straightforward, having the > > potential to prevent a guest from booting (e.g. when the virtual > > BIOS conflicts with an RMRR). > > > >> 2>. libxc:setup_guest():xc_domain_populate_physmap() > >> > >> There are two ways to exclude RMRR ranges here: > >> > >> #1 Before we populate guest RAM without any RMRR > range from that > >> new hypercall to skip RMRR ranges when populating > guest RAM. > >> #2 In Xen we can fliter RMRR range while calling > >> XENMEM_populate_physmap, its no change to libxc, > but skip > >> setting p2m entry for RMRR ranges in Xen hypervisor > >> > >> But to compare #1, #2 is not better since Xen still allocate those > >> range, and we have to sync somewhere else in Xen. And #1 is > cleaner > >> because #2 actually shrinks the available memory size to the > guest. > >> > >> 3>. hvmloader:pci_setup() > >> > >> Here we should populate guest MMIO excluding all ranges > from that > >> new > >> hypercall to detect RMRR conflictions for allocating PCI MMIO > BAR. > >> > >> 4>. hvmloader:mem_hole_alloc() > >> > >> Here we should populate mem holw excluding all ranges from > that new > >> hypercall to detect RMRR conflictions for dynamic allocation in > >> hvmloader, e.g. as used for IGD opregion > >> > >> 5>. hvmloader:build_e820_table(): > >> > >> Finally we need let VM know that RMRR regions are reserved > through > >> e820 > >> table > >> > >> Its a little bit tricky to handle this inside hvmloader since as you > >> know, > >> struct hvm_info_table is only a approach between libxc and > hvmloader. > >> But > >> however, making up all above places will bring some duplicated > logic. > >> Especially between libxc and hvmloader, which skip same regions in > guest > >> physical layer(one in populating guest RAM, the other in > constructing > >> e820) > >> But current design has some limitation to pass information between > libxc > >> and > >> hvmloader, > >> > >> struct hvm_info_table { > >> ... > >> /* > >> * 0x0 to page_to_phys(low_mem_pgend)-1: > >> * RAM below 4GB (except for VGA hole 0xA0000-0xBFFFF) > >> */ > >> uint32_t low_mem_pgend; > >> ... > >> /* > >> * 0x100000000 to page_to_phys(high_mem_pgend)-1: > >> * RAM above 4GB > >> */ > >> uint32_t high_mem_pgend; > >> ... > >> } > >> > >> nothing valuable is passed to hvmloader so we have to figure out a > way > >> to > >> handle those points in hvmloader. Currently, > >> > >> #1 Reconstruct hvm_info_table{} > >> > >> We can store all necessary RMRR info into hvm_info_table > as > >> well, > >> then we can pick them conveniently but oftentimes these > RMRR > >> entries are scattered and the number is also undertimined, > so > >> its > >> a little bit ugly to do. > >> > >> #2 Xenstore > >> > >> We may store all necessary RMRR info as Xenstore then > pick them > >> in > >> hvmloader. > >> > >> #3 A hypercall > >> > >> We may have to call our new hypercall again to pick them, > but > >> obviously this may introduce some duplicated codes. > >> > >> #4 XENMEM_{set_,}memory_map pair of hypercalls > >> > >> As Jan commented it "could be used(and is readily > available to > >> be > >> extended that way, since for HVM domains > XENMEM_set_memory_map > >> returns -EPERM at present). The only potentially > problematic > >> aspect > >> I can see with using it might be its limiting of the entry > count > >> to > >> E820MAX." So a preparation patch is required before > RMRR, and > >> hvmloader still needs to query RMRR information. > > > > Somewhere above I'm missing the mentioning of the option to reorder > > operations of hvmloader, to e.g. base PCI BAR assignment on the > > previously set up E820 table, rather than assuming a (mostly) fixed > > size window where to put these. > > > >> 3. Group multiple devices sharing same RMRR entry > >> > >> current status: > >> * Xen doesn't distinguish if multiple devices share same RMRR entry. > >> This > >> may lead a leak to corruption, e.g. Device A and device B share > same > >> entry > >> and then device A is assigned to domain A, device B is assigned to > >> domain > >> B. So domain A can read something from that range, even rewrite > >> maliciously that range to corrupt device B inside domain B. > >> > >> proposal: > >> * We will group all devices by means of one same RMRR entry. > >> Theoretically, > >> we should make sure all devices in one group are allowed to > assign to > >> one > >> VM. But in Xen side the hardware domain owns all devices at first, > we > >> should unbound all group devies before assign one of group device. > But > >> its > >> hard to do such thing in current VT-d stuff. And actually its rare > >> to > >> have > >> the group device in real world so we just prevent assigning any > group > >> device simply. > >> * We will introduce two field, gid, in struct, acpi_rmrr_unit: > >> gid: indicate if this device belongs a group. > >> Then when we add or attach device in iommu side, we will check > this > >> field > >> if we're assigning a group device then determine if we assign that. > >> > >> > >> expectation: > >> * Make all device associated to one RMRR to be assigned same VM. > > > > A few lines up you say you're intending to refuse such assignments > > rather than making them happen in a consistent way - what's the > > plan in the end? > > > >> 4. Handle in case of force accessing to RMRR regions > >> > >> proposal: > >> * Although we already reserve these ranges in guest e820 table, its > >> possible to access these ranges. > >> In non-shared ept case it will issue such EPT violation since we > have > >> no > >> p2m table actually. But its same to access other reserved range so > >> just > >> live on Xen's default behavior. In shared-ept case guest can read > or > >> write anything from this range, but such a leak or corruption just > >> happens inside same domain so this behavior is also same as a > native > >> case, it should be fine. > >> > >> expectation: > >> * Don't broken normal RMRR usage. > > > > I'm not getting the purpose of this whole section. > > > >> 5. Re-enable USB > >> > >> current status: > >> * Currently Xen doesn't allow USB passthrough. > > > > ??? > > > >> 6. Hotplug case > >> > >> current status: > >> * only work well in non-shared ept case or in case of shared ept & all > >> associated gfns are free. > >> > >> proposal: > >> * If user ensure there'll be a hotplug device in advace, > >> 'rdm_forcecheck' > >> can be set to reserve all RMRR range as we describe above. Then > any > >> device can be hotplugged successfully. > >> * If not, there are still two scenarios here: > >> in non-shared ept case it still work well as original; > > > > So you continue to assume that the two tables going out of sync is > > acceptable. > > > >> in shared ept case, its just going a case of "1. Setup RMRR identity > >> mapping" > >> - gfn space unoccupied -> insert mapping: success. > >> - gfn space already occupied by 1:1 RMRR mapping -> do > nothing; > >> success. > >> - gfn space already occupied by other mapping -> fail. > >> > >> expectation: > >> * provide a way to let hotplug work in case of shared ept totally > >> > >> Open Issue > >> ========== > >> > >> When other stuffs like ballon mechanism, to populate memory accessing > RMRR > >> range, or others like qemu, force map RMRR range, we may need to avoid > such > >> a > >> possibility but we're not 100% sure this so just open this to double check. > > > > I think a HVM guest can be expected to play by what E820 table > > says is reserved. That includes not using such ranges for ballooning. > > If it still does, it'll get what it deserves. > > > > Jan > > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.