[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] RMRR Fix Design for Xen



I'll work out a new design proposal based on below content and previous
discussions. Thanks Tiejun for your hard-working and Jan for your careful
reviews so far.

For below comment:
> > Also there's clearly an alternative proposal: Drop support for sharing
> > page tables. Your colleagues will surely have told you that we've
> > been considering this for quite some time, and had actually hoped
> > for them to do the necessary VT-d side work to allow for this without
> > causing performance regressions.

let's separate it from RMRR discussion, because RMRR issues are about
general p2m and thus orthogonal to the implementation difference between
shared or not-shared fact (though it did lead to different behaviors w/ current
bogus logic).

Thanks
Kevin

> From: Chen, Tiejun
> Sent: Monday, December 22, 2014 10:12 AM
> 
> Jan,
> 
> Thanks for your time but I'm not going to address your comments here.
> Because I heard this design is totally not satisfied your expectation.
> But this really was reviewed with several revisions by Kevin and Yang
> before sending in public...
> 
> Anyway, I guess the only thing what I can do is that, Kevin and Yang, or
> other appropriate guys should finish this design as you expect. So now
> I'd better not say anything to avoid bringing any inconvenience.
> 
> Tiejun
> 
> On 2014/12/19 23:13, Jan Beulich wrote:
> >>>> On 19.12.14 at 02:21, <tiejun.chen@xxxxxxxxx> wrote:
> >> #4 Something like USB, is still restricted to current RMRR implementation.
> We
> >>     should work out this case.
> >
> > This can mean all or nothing. My understanding is that right now code
> > assumes that USB devices won't use their RMRR-specified memory
> > regions post-boot (kind of contrary to your earlier statement that in
> > such a case the regions shouldn't be listed in RMRRs in the first place).
> >
> >> Design Overview
> >> ===============
> >>
> >> First of all we need to make sure all resources don't overlap RMRR. And
> then
> >> in case of shared ept, we can set these identity entries. And Certainly we
> >> will
> >> group all devices associated to one same RMRR entry, then make sure all
> >> group
> >> devices should be assigned to same VM.
> >>
> >> 1. Setup RMRR identity mapping
> >>
> >> current status:
> >>      * identity mapping only setup in non-shared ept case
> >>
> >> proposal:
> >>
> >> In non-shared ept case, IOMMU stuff always set those entries and RMRR is
> >> already marked reserved in host so its fine enough.
> >
> > Is it? Where? Or am I misunderstanding the whole statement, likely
> > due to me silently replacing "host" by "guest" (since reservation in
> > host address spaces is of no interest here afaict)?
> >
> >> But in shared ept case, we need to
> >> check any conflit, so we should follow up
> >>
> >>    - gfn space unoccupied
> >>          -> insert mapping: success.
> >>              gfn:_mfn(gfn), PAGE_ORDER_4K, p2m_mmio_direct,
> p2m_access_rw
> >>    - gfn space already occupied by 1:1 RMRR mapping
> >>          -> do nothing; success.
> >>    - gfn space already occupied by other mapping
> >>          -> fail.
> >>
> >> expectation:
> >>      * only devices w/ non-conflicting RMRR can be assigned
> >>      * fortunately this achieves the very initial intention to support IGD
> >>        pass-through on BDW
> >
> > Are you trying to say here that doing the above is all you need for
> > your specific machine? If so, that's clearly not something to go into
> > a design document.
> >
> > Also there's clearly an alternative proposal: Drop support for sharing
> > page tables. Your colleagues will surely have told you that we've
> > been considering this for quite some time, and had actually hoped
> > for them to do the necessary VT-d side work to allow for this without
> > causing performance regressions.
> >
> >> 2.1 Expose RMRR to user space
> >>
> >> current status:
> >>      * Xen always record RMRR info into one list, acpi_rmrr_units, while
> >> parsing
> >>        acpi. So we can retrieve these info by lookup that list.
> >>
> >> proposal:
> >>      * RMRR would be exposed by a new hypercall, which Jan already
> finished
> >> in
> >>        current version but just expose all RMRR info unconditionally.
> >>      * Furthermore we can expose RMRR on demand to diminish
> shrinking guest
> >>        RAM/MMIO space.
> >>      * So we will introduce a new parameter, 'rdm_forcecheck' and to
> >> collaborate
> >>        with SBDFs to control which RMRR should be exposed:
> >>
> >>          - We can set this parameter in .cfg file like,
> >>
> >>              rdm_forcecheck = 1 => Of course this should be 0 by
> default.
> >>
> >>          '1' means we should force check to reserve all ranges
> >> unconditionally.
> >>          and if failed VM wouldn't be created successfully. This also can
> >> give
> >>          user a chance to work well with later hotplug, even if not a
> device
> >>          assignment while creating VM.
> >>
> >>          If 0, we just check those assigned pci devices. As you know we
> already
> >
> > assigned? Wasn't the plan to have a separate potentially-to-be-
> > assigned list? And I can only re-iterate that the name
> > "rdm_forcecheck" doesn't really express what you mean. Since
> > your intention is to check all devices (rather than do a check that
> > otherwise wouldn't be done), "rdm_all" or "rdm_check_all" would
> > seem closer to the intended behavior.
> >
> >>          have such an existing hypercall to assign PCI devices, looks we
> can
> >> work
> >>          directly under this hypercall to get that necessary SBDF to sort
> >> which
> >>          RMRR should be handled. But obviously, we need to get these
> info
> >> before
> >>          we populate guest memory to make sure these RMRR ranges
> should be
> >>          excluded from guest memory. But unfortunately the memory
> populating
> >>          takes place before a device assignment, so we can't live on that
> >>          directly.
> >>
> >>          But as we discussed it just benefit that assigned case to reorder
> >> that
> >>          order, but not good to hotplug case. So we have to introduce a
> new
> >>          DOMCTL to pass that global parameter with SBDF at the same
> time.
> >>
> >>          For example, if we own these two RMRR entries,
> >
> > "own" is confusing here, I assume you mean if there are such entries.
> >
> >>          [00:14.0]    RMRR region: base_addr ab80a000 end_address
> ab81dfff
> >>          [00:02.0]    RMRR region: base_addr ad000000 end_address
> af7fffff
> >>
> >>          If 'rdm_forcecheck = 1', any caller to that hypercall may get
> these two
> >
> > s/may/will/ I suppose.
> >
> >>          entries. But if 'rdm_forcecheck = 0', and in .cfg file,
> >>
> >>          #1 'pci=[00:14.0, 00:02.0]' -> The caller still get these two
> entries.
> >>          #2 'pci=[00:02.0]' -> The caller just get one entry,
> >> 0xad000000:0xaf7fffff
> >>          #2 'pci=[00:14.0]' -> The caller just get one entry,
> >> 0xab80a000:0xab81dfff
> >>          #4 'pci=[others]' or no any pci configuration -> The caller get
> >> nothing.
> >
> > Oh, interesting, you dropped the idea of a separate list of possibly-
> > to-be-assigned devices. Why that?
> >
> >>          And ultimately, if we expose any RMRR entry at gfn,
> >>
> >>          in non-shared ept case,
> >>
> >>          p2m table:  valid non-identity mapping, gfn:mfn != 1:1
> >
> > This would mean ...
> >
> >>          VT-D table: no such a mapping until set identity mapping,
> >>                      gfn:_mfn(gfn) == 1:1 when we have a associated
> device
> >>                      assignment.
> >
> > ... the two mappings to go out of sync when such a 1:1 mapping gets
> > established. I think we should avoid such a situation if at all possible.
> >
> >>          in shared ept case,
> >>
> >>          p2m table\VT-D table:  always INVALID until set identity
> mapping,
> >>                                 gfn:_mfn(gfn) == 1:1 when we have
> a
> >> associated
> >>                                 device assignment.
> >>
> >>          But if we don't expose any RMRR entry,
> >>
> >>          in non-shared ept case,
> >>
> >>          p2m table:  valid non-identity mapping, gfn:mfn != 1:1
> >>          VT-D table: no such a mapping until set identity mapping,
> >>                      gfn:_mfn(gfn) == 1:1 when we have a associated
> device
> >>                      assignment.
> >>
> >>          in shared ept case,
> >>
> >>          p2m table\VT-D table:  If guest RAM already cover gfn, we
> have sunc a
> >>                                 valid non-identity mapping,
> gfn:mfn != 1:1,
> >> but
> >>                                 we can't set any identity mapping
> again then
> >>                                 that associated device can't be
> assigned
> >>                                 successfully. If not, we'll set identity
> >> mapping,
> >>                                 gfn:_mfn(gfn) == 1:1, to work well.
> >
> > So in the end we'd still have various cases of different behavior,
> > when really it would be much better (if not a requirement) if guest
> > observable behavior wouldn't depend on implementation details
> > like (non-)shared page tables.
> >
> >> 2.2 Detect and avoid RMRR conflictions with guest RAM/MMIO
> >>
> >> current status:
> >>      * Currently Xen doesn't detect anything to avoid any conflict.
> >>
> >> proposal:
> >>      * We need to cover all points to check any conflict. Below lists 
> >> places
> >>        where reserved region conflictions should be detected:
> >>
> >>          1>. libxc:setup_guest():modules_init()
> >>
> >>          There are some modules, like acpi_module and smbios_module.
> They may
> >>          occupy some ranges so we need to check if they're conflicting
> with
> >>          all ranges from that new hypercall above.
> >
> > I'm missing of what conflict resolution you expect to do here.
> > Following earlier discussion in the context of patch review made
> > pretty clear that this isn't really straightforward, having the
> > potential to prevent a guest from booting (e.g. when the virtual
> > BIOS conflicts with an RMRR).
> >
> >>          2>. libxc:setup_guest():xc_domain_populate_physmap()
> >>
> >>          There are two ways to exclude RMRR ranges here:
> >>
> >>              #1 Before we populate guest RAM without any RMRR
> range from that
> >>                 new hypercall to skip RMRR ranges when populating
> guest RAM.
> >>              #2 In Xen we can fliter RMRR range while calling
> >>                 XENMEM_populate_physmap, its no change to libxc,
> but skip
> >>                 setting p2m entry for RMRR ranges in Xen hypervisor
> >>
> >>          But to compare #1, #2 is not better since Xen still allocate those
> >>          range, and we have to sync somewhere else in Xen. And #1 is
> cleaner
> >>          because #2 actually shrinks the available memory size to the
> guest.
> >>
> >>          3>. hvmloader:pci_setup()
> >>
> >>          Here we should populate guest MMIO excluding all ranges
> from that
> >> new
> >>          hypercall to detect RMRR conflictions for allocating PCI MMIO
> BAR.
> >>
> >>          4>. hvmloader:mem_hole_alloc()
> >>
> >>          Here we should populate mem holw excluding all ranges from
> that new
> >>          hypercall to detect RMRR conflictions for dynamic allocation in
> >>          hvmloader, e.g. as used for IGD opregion
> >>
> >>          5>. hvmloader:build_e820_table():
> >>
> >>          Finally we need let VM know that RMRR regions are reserved
> through
> >> e820
> >>          table
> >>
> >>      Its a little bit tricky to handle this inside hvmloader since as you
> >> know,
> >>      struct hvm_info_table is only a approach between libxc and
> hvmloader.
> >> But
> >>      however, making up all above places will bring some duplicated
> logic.
> >>      Especially between libxc and hvmloader, which skip same regions in
> guest
> >>      physical layer(one in populating guest RAM, the other in
> constructing
> >> e820)
> >>      But current design has some limitation to pass information between
> libxc
> >> and
> >>      hvmloader,
> >>
> >> struct hvm_info_table {
> >>      ...
> >>      /*
> >>       *  0x0 to page_to_phys(low_mem_pgend)-1:
> >>       *    RAM below 4GB (except for VGA hole 0xA0000-0xBFFFF)
> >>       */
> >>      uint32_t    low_mem_pgend;
> >>      ...
> >>      /*
> >>       *  0x100000000 to page_to_phys(high_mem_pgend)-1:
> >>       *    RAM above 4GB
> >>       */
> >>      uint32_t    high_mem_pgend;
> >>      ...
> >> }
> >>
> >>      nothing valuable is passed to hvmloader so we have to figure out a
> way
> >> to
> >>      handle those points in hvmloader. Currently,
> >>
> >>              #1 Reconstruct hvm_info_table{}
> >>
> >>              We can store all necessary RMRR info into hvm_info_table
> as
> >> well,
> >>              then we can pick them conveniently but oftentimes these
> RMRR
> >>              entries are scattered and the number is also undertimined,
> so
> >> its
> >>              a little bit ugly to do.
> >>
> >>              #2 Xenstore
> >>
> >>              We may store all necessary RMRR info as Xenstore then
> pick them
> >> in
> >>              hvmloader.
> >>
> >>              #3 A hypercall
> >>
> >>              We may have to call our new hypercall again to pick them,
> but
> >>              obviously this may introduce some duplicated codes.
> >>
> >>              #4 XENMEM_{set_,}memory_map pair of hypercalls
> >>
> >>              As Jan commented it "could be used(and is readily
> available to
> >> be
> >>              extended that way, since for HVM domains
> XENMEM_set_memory_map
> >>              returns -EPERM at present). The only potentially
> problematic
> >> aspect
> >>              I can see with using it might be its limiting of the entry
> count
> >> to
> >>              E820MAX." So a preparation patch is required before
> RMRR, and
> >>              hvmloader still needs to query RMRR information.
> >
> > Somewhere above I'm missing the mentioning of the option to reorder
> > operations of hvmloader, to e.g. base PCI BAR assignment on the
> > previously set up E820 table, rather than assuming a (mostly) fixed
> > size window where to put these.
> >
> >> 3. Group multiple devices sharing same RMRR entry
> >>
> >> current status:
> >>      * Xen doesn't distinguish if multiple devices share same RMRR entry.
> >> This
> >>        may lead a leak to corruption, e.g. Device A and device B share
> same
> >> entry
> >>        and then device A is assigned to domain A, device B is assigned to
> >> domain
> >>        B. So domain A can read something from that range, even rewrite
> >>        maliciously that range to corrupt device B inside domain B.
> >>
> >> proposal:
> >>      * We will group all devices by means of one same RMRR entry.
> >> Theoretically,
> >>        we should make sure all devices in one group are allowed to
> assign to
> >> one
> >>        VM. But in Xen side the hardware domain owns all devices at first,
> we
> >>        should unbound all group devies before assign one of group device.
> But
> >> its
> >>        hard to do such thing in current VT-d stuff. And actually its rare 
> >> to
> >> have
> >>        the group device in real world so we just prevent assigning any
> group
> >>        device simply.
> >>      * We will introduce two field, gid, in struct, acpi_rmrr_unit:
> >>          gid: indicate if this device belongs a group.
> >>        Then when we add or attach device in iommu side, we will check
> this
> >> field
> >>        if we're assigning a group device then determine if we assign that.
> >>
> >>
> >> expectation:
> >>      * Make all device associated to one RMRR to be assigned same VM.
> >
> > A few lines up you say you're intending to refuse such assignments
> > rather than making them happen in a consistent way - what's the
> > plan in the end?
> >
> >> 4. Handle in case of force accessing to RMRR regions
> >>
> >> proposal:
> >>      * Although we already reserve these ranges in guest e820 table, its
> >>        possible to access these ranges.
> >>        In non-shared ept case it will issue such EPT violation since we
> have
> >> no
> >>        p2m table actually. But its same to access other reserved range so
> >> just
> >>        live on Xen's default behavior. In shared-ept case guest can read
> or
> >>        write anything from this range, but such a leak or corruption just
> >>        happens inside same domain so this behavior is also same as a
> native
> >>        case, it should be fine.
> >>
> >> expectation:
> >>      * Don't broken normal RMRR usage.
> >
> > I'm not getting the purpose of this whole section.
> >
> >> 5. Re-enable USB
> >>
> >> current status:
> >>      * Currently Xen doesn't allow USB passthrough.
> >
> > ???
> >
> >> 6. Hotplug case
> >>
> >> current status:
> >>      * only work well in non-shared ept case or in case of shared ept & all
> >>        associated gfns are free.
> >>
> >> proposal:
> >>      * If user ensure there'll be a hotplug device in advace,
> >> 'rdm_forcecheck'
> >>        can be set to reserve all RMRR range as we describe above. Then
> any
> >>        device can be hotplugged successfully.
> >>      * If not, there are still two scenarios here:
> >>        in non-shared ept case it still work well as original;
> >
> > So you continue to assume that the two tables going out of sync is
> > acceptable.
> >
> >>        in shared ept case, its just going a case of "1. Setup RMRR identity
> >>        mapping"
> >>          - gfn space unoccupied -> insert mapping: success.
> >>          - gfn space already occupied by 1:1 RMRR mapping -> do
> nothing;
> >> success.
> >>          - gfn space already occupied by other mapping -> fail.
> >>
> >> expectation:
> >>      * provide a way to let hotplug work in case of shared ept totally
> >>
> >> Open Issue
> >> ==========
> >>
> >> When other stuffs like ballon mechanism, to populate memory accessing
> RMRR
> >> range, or others like qemu, force map RMRR range, we may need to avoid
> such
> >> a
> >> possibility but we're not 100% sure this so just open this to double check.
> >
> > I think a HVM guest can be expected to play by what E820 table
> > says is reserved. That includes not using such ranges for ballooning.
> > If it still does, it'll get what it deserves.
> >
> > Jan
> >

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.