[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] (v2) Design proposal for RMRR fix

> From: George Dunlap
> Sent: Thursday, January 08, 2015 8:50 PM
> On Fri, Dec 26, 2014 at 11:23 AM, Tian, Kevin <kevin.tian@xxxxxxxxx> wrote:
> > (please note some proposal is different from last sent version after more
> > discussions. But I tried to summarize previous discussions and explained why
> > we choose a different way. Sorry if I may miss some opens/conclusions
> > discussed in past months. Please help point it out which is very
> appreciated. :-)
> Kevin, thanks for this document.  A few questions / comments below:
> > For proper functioning of these legacy reserved memory usages, when
> > system software enables DMA remapping, the translation structures for
> > the respective devices are expected to be set up to provide identity
> > mapping for the specified reserved memory regions with read and write
> > permissions. The system software is also responsible for ensuring
> > that any input addresses used for device accesses to OS-visible memory
> > do not overlap with the reserved system memory address ranges.
> Just to be clear: "identity mapping" here means that gpfn == mfn, in
> both the p2m and IOMMU.  (I suppose it might mean vfn == gpfn as well,
> but that wouldn't really concern us, as the guest deals with virtual
> mappings.)

I'm not sure what you meant by 'vfn', but it applies to whatever address space
created by IOMMU page table. And it's from VT-d spec which also covers
bare metal usage. 

> > However current RMRR implementation in Xen only partially achieves a)
> > and completely misses b), which cause some issues:
> >
> > --
> > [Issue-1] Identity mapping is not setup in shared ept case, so a device
> > with RMRR may not function correctly if assigned to a VM.
> >
> > This was the original problem we found when assigning IGD on BDW
> > platform, which triggered the whole long discussion in past months
> >
> > --
> > [Issue-2] Being lacking of goal-b), existing device assignment with
> > RMRR works only when reserved regions happen to not conflicting with
> > other valid allocations in the guest physical address space. This could
> > lead to unpredicted failures in various deployments, due to non-detected
> > conflictions caused by platform difference and VM configuration
> > difference.
> >
> > One example is about USB controller assignment. It's already identified
> > as a problem on some platforms, that USB reserved regions conflict with
> > guest BIOS region. However, being the fact that host BIOS only touches
> > those reserved regions for legacy keyboard emulation at early Dom0 boot
> > phase, a trick is added in Xen to bypass RMRR handling for usb
> > controllers.
> >
> > --
> > [Issue-3] devices may share same reserved regions, however
> > there is no logic to handle this in Xen. Assigning such devices to
> > different VMs could lead to secure concern
> So to summarize:
> When assigning a device to a guest, the device's associated RMRRs must
> be identity mapped in the p2m and IOMMU.
> At the moment, we don't have a reliable way to reclaim a particular
> gpfn space from a guest once it's been used for other puproses (e.g.,
> guest RAM or other MMIO ranges).
> So, we need to make sure at guest creation time that we reserve any
> RMRR ranges for devices we may wish to assign, and make sure that the
> RMRR in gpfn space is empty.
> For statically-assigned devices, we know at guest creation time which
> RMRRs may be required.  But if we want to dynamically add devices, we
> must figure out ahead of time which devices we *might* add, and
> reserve the RMRRs at boot time.
> As a separate problem, two different devices may share the same RMRR,
> meaning that if we assign these devices to two different VMs, the RMRR
> may be mapped into the gpfn space of two different VMs.  This may well
> be a security issue, so we need to handle it carefully.

exactly. :-)

> > 3. High Level Design
> >
> ================================================================
> =====
> >
> > To achieve aforementioned two goals, major enhancements are required
> > cross Xen hypervisor, libxc, and hvmloader, to address the gap in
> > goal-b), i.e. handling possible conflictions in gfn space. Fixing
> > goal-a) is straightforward.
> >
> >>>>3.1 Guidelines
> > ----
> > There are several guidelines considered in the design:
> >
> > --
> > [Guideline-1] No regression in a VM w/o statically-assigned devices
> >
> >   If a VM isn't configured with assigned devices at creation, new
> > confliction detection logic shouldn't block the VM boot progress
> > (either skipped, or just throw warning)
> >
> > --
> > [Guideline-2] No regression on devices which do not have RMRR reported
> >
> >   If a VM is assigned with a device which doesn't have RMRR reported,
> > either statically-assigned or dynamically-assigned, new confliction
> > detection logic shouldn't fail the assignment request for this device.
> >
> > --
> > [Guideline-3] New interface should be kept as common as possible
> >
> >   New interface will be introduced to expose reserved regions to the
> > user space. Though RMRR is a VT-d specific terminology, the interface
> > design should be generic enough, i.e. to support a function which
> > allows hypervisor to force reserving one or more gfn ranges.
> >
> > --
> > [Guideline-4] Keep changes simple
> >
> >   RMRR reserved regions should be avoided or limited by platform
> > designers, per VT-d specification. Per our observations, there are
> > only a few reported examples (USB, IGD) on real platforms. So we need
> > to balance the code complexity and usage limitation. If one limitation
> > is only in niche scenarios, we'd like to vote no-support to simplify
> > changes for now.
> This is an excellent set of principles -- thanks.
> >
> >>>>3.2 Confliction detection
> > ----
> > Confliction must be detected in several places as far as gfn is
> > concerned (how to handle confliction is discussed in 3.3)
> >
> > 1) libxc domain builder
> >   Here coarse-grained gfn layout is created, including two contiguous
> > guest RAM trunks (lowmem and/or highmem) and mmio holes (VGA, PCI),
> > which are passed to hvmloader for later fine-grained manipulation. Guest
> > RAM trunks are populated with valid translation setup in underlying p2m
> > layer. Device reserved regions must be detected in that layout.
> >
> > 2) Xen hypervisor device assignment
> >   Device assignment can happen either at VM creation time (after domain
> > builder), or anytime thru hotplug after VM is booted. Regardless of
> > how userspace handles confliction, Xen hypervisor will always do the
> > last-conservative detection when setting up identity mapping:
> >         * gfn space unoccupied:
> >                 -> insert identity mapping; no confliction
> >         * gfn space already occupied with identity mapping:
> >                 -> do nothing; no confliction
> >         * gfn space already occupied with other mapping:
> >                 -> confliction detected
> >
> > 3) hvmloader
> >   Hvmloader allocates other resources (ACPI, PCI MMIO, etc.) and
> > internal data structures in gfn space, and it creates the final guest
> > e820. So hvmloader also needs to detect conflictions when conducting
> > those operations. If there's no confliction, hvmloader will reserve
> > those regions in guest e820 to let guest OS aware.
> I think this can be summarized a bit more clearly by what each bit of
> code needs to actually do:
> 1. libxc
>  - RMRR areas need to be not populated with gfns during boot time.
> 2. Xen
>  - When a device with RMRRs is assigned, Xen must make an
> identity-mapping of the appropriate RMRR ranges.
> 3. hvmloader
>  - hvmoader must report RMRRs in the e820 map of all devices which a
> guest may ever be assigned
>  - when placing devices in MMIO space, hvmloader must avoid placing
> MMIO devices over RMRR regions which are / may be assigned to a guest.
> One component I think may be missing here -- qemu-traditional is very
> tolerant with regards to the gpfn space; but qemu-upstream expects to
> know the layout of guest gpfn space, and may crash if its idea of gpfn
> space doesn't match Xen's idea.  Unfortunately, however, there is not
> a very close link between these two at the moment; IIUC at the moment
> this is limited to the domain builder telling qemu how big the lowmem
> PCI hole will be.  Any solution which marks GPFN space as "non-memory"
> needs to make sure this is communicated to qemu-upstream as well.

so what qemu cares about is only RAM pages, right? If that's the case,
Jan's idea in another mail might make sense, i.e. we assume no RMRR
in lowmem (or at least not low enough so we have to split lowmem) and
then basic lowmem structure doesn't change.

> >>>>3.3 Policies
> > ----
> > An intuitive thought is to fail immediately upon a confliction, however
> > it is not flexible regarding to different requirments:
> >
> > a) it's not appropriate to fail libxc domain builder just because such
> > confliction. We still want the guest to boot even w/o assigned device;
> >
> > b) whether to fail in hvmloader has several dependencies. If it's
> > to check for hotplug preparation, warning is also an acceptable option
> > since assignment may not happen at all. Or if it's a USB controller
> > but user doesn't care about legacy keyboard emulation, it's also OK to
> > move forward upon a confliction;
> >
> > c) in Xen hypervisor it is reasonable to fail upon confliction, where
> > device is actually assigned. But due to the same requirement on USB
> > controller, sometimes we might want it succeed just w/ warnings.
> >
> > Regarding to the complexity of addressing all above flexibilities (user
> > preferences, per-device), which requires inventing quite some parameters
> > passed among different components, and regarding to the fact that
> > failures would be rare (except some USB) with proactive avoidance
> > in userspace, we'd like to propose below simplified policy following
> > [Guideline-4]:
> >
> > - 'warn' conflictions in user space (libxc and hvmloader)
> > - a boot option to specify 'fail' or 'warn' confliction in Xen device
> > assignment path, default to 'fail' (user can set to 'warn' for USB case)
> >
> > Such policy provides a relaxed user space policy w/ hypervisor to do
> > final judge. It has a unique merit to simplify later interface design
> > and hotplug support, w/o breaking [Guideline-1/2] even when all possible
> > reserved regions are exposed.
> >
> >     ******agreement is first required on above policy******
> So the important part of policy is what the user experience is.  I
> think we can assume that all device assignment will happen through
> libxl; so from a user interface perspective we mainly want to be
> thinking about the xl / libxl interface.
> How the various sub-components react if something unexpected happens
> is then just a matter of robust system design.
> So first of all, I think RMRR reservations should be specified at
> domain creation time.  If a user tries to assign a device with RMRRs
> to a VM that has not reserved those ranges at creation time, the
> assignment should fail.
> The main place this checking should happen is in the toolstack
> (libxl).  The toolstack can then give a sensible error message to the
> user, which may include things they can to to fix the problem.
> In the case of statically-assigned devices, the toolstack can look at
> the RMRRs required and make sure to reserve them at domain creation
> time.
> For dynamically-assigned devices, I think there should be an option to
> make the guest's memory layout mirror the host: this would include the
> PCI hole and all RMRR ranges.  This would be off by default.
> We could imagine a way of specifying "I may want to assign this pool
> of devices to this VM", or to manually specify RMRR ranges which
> should be reserved, but I think that's a bit more advanced than we
> really need right now.

yes, that type of enhancements can be considered later.

> >>>>3.5 New interface: expose reserved region information
> It's not clear to me who this new interface is being exposed to.
> It seems to me what we want is for the toolstack to figure out, at
> guest creation time, what RMRRs should be reserved for this VM, and
> probably put that information in xenstore somewhere, where it's
> available to hvmloader.  I assume the RMRR information is already
> available through sysfs in dom0?


> One question: where are these RMRRs typically located in memory?  Are
> they normally up in the MMIO region?  Or can they occur anywhere (even
> in really low areas, say, under 1GiB)?

reported by ACPI structures. and like Jan replied, could be anywhere.

> If RMRRs almost always happen up above 2G, for example, then a simple
> solution that wouldn't require too much work would be to make sure
> that the PCI MMIO hole we specify to libxc and to qemu-upstream is big
> enough to include all RMRRs.  That would satisfy the libxc and qemu
> requirements.

unfortunately it's not so simple case.

> If we then store specific RMRRs we want included in xenstore,
> hvmloader can put them in the e820 map, and that would satisfy the
> hvmloader requirement.

is xenstore really necessary given the new introduced hypercall?

> Then when we assign the device, those ranges will be already unused in
> the p2m, and (if I understand correctly) Xen will already map the RMRR
> ranges 1-1 upon device assignment.
> What do you think?
> If making the RMRRs fit inside the guest MMIO hole is not practical
> (for example, if the ranges occur very low in memory), then we'll have
> to come up with a way to specify, both to libxc and to qemu, where
> these  holes in memory are.
> >>>>3.8 Xen: Handle devices sharing reserved regions
> > ----
> > Per VT-d spec, it's possible to have two devices sharing same reserved
> > region. Though we didn't see such example in reality, hypervisor needs
> > to detect and handle such scenario, otherwise vulnerability may exist
> > if two devices are assigned to different VMs (so a malicious VM may
> > program its assigned device to clobber the shared region to malform
> > another VM's device)
> >
> > Ideally all devices sharing reserved regions should be assigned to a
> > single VM. However achieving this goal can't be done sole in hypervisor
> > w/o reworking current device assignment interface. Assignment is managed
> > by toolstack, which requires exposing group sharing information to
> > userspace and then extends toolstack to manage assignment in bundle.
> >
> > Given the problem only in ideal space, we propose to not support such
> > scenario, i.e. having hypervisor to fail the assignment, if the target
> > device happens to share some reserved regions with another device,
> > following [Guideline-4] to keep things simple.
> I think denying it by default, first in the toolstack and as a
> fall-back in the hypervisor, is a good idea.
> It shouldn't be too difficult, however, to add an option to override
> this.  We have a lot of individual users who use Xen for device
> pass-through; such advanced users should be allowed to "shoot
> themselves in the foot" if they want to.
> Thoughts?

that's also the option I'd like to keep. Instead of enforcing most strict
policy by Xen, better to warn the problem but let user choose whether
they want to move forward.

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.