[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] (v2) Design proposal for RMRR fix



(please note some proposal is different from last sent version after more
discussions. But I tried to summarize previous discussions and explained why
we choose a different way. Sorry if I may miss some opens/conclusions
discussed in past months. Please help point it out which is very appreciated. 
:-)

----
TOC:
        1. What's RMRR
        2. RMRR status in Xen
        3. High Level Design
                3.1 Guidelines
                3.2 Confliction detection
                3.3 Policies
                3.4 Xen: setup RMRR identity mapping
                3.5 New interface: expose reserved region information
                3.6 Libxc/hvmloader: detect and avoid conflictions
                3.7 Hvmloader: reserve 'reserved regions' in guest E820
                3.8 Xen: Handle devices sharing reserved regions
        4. Plan
                4.1 Stage-1: hypervisor hardening
                4.2 Stage-2: libxc/hvmloader hardening
                
1. What's RMRR?
=====================================================================

RMRR is an acronym for Reserved Memory Region Reporting, expected to 
be used for legacy usages (such as USB, UMA Graphics, etc.) requiring
reserved memory.

(From vt-d spec)
----
Reserved system memory regions are typically allocated by BIOS at boot
time and reported to OS as reserved address ranges in the system memory
map. Requests to these reserved regions may either occur as a result of
operations performed by the system software driver (for example in the
case of DMA from unified memory access (UMA) graphics controllers to
graphics reserved memory) or may be initiated by non system software
(for example in case of DMA performed by a USB controller under BIOS
SMM control for legacy keyboard emulation). 

For proper functioning of these legacy reserved memory usages, when 
system software enables DMA remapping, the translation structures for 
the respective devices are expected to be set up to provide identity 
mapping for the specified reserved memory regions with read and write 
permissions. The system software is also responsible for ensuring 
that any input addresses used for device accesses to OS-visible memory 
do not overlap with the reserved system memory address ranges.

BIOS may report each such reserved memory region through the RMRR
structures, along with the devices that requires access to the 
specified reserved memory region. Reserved memory ranges that are
either not DMA targets, or memory ranges that may be target of BIOS
initiated DMA only during pre-boot phase (such as from a boot disk
drive) must not be included in the reserved memory region reporting.
The base address of each RMRR region must be 4KB aligned and the size
must be an integer multiple of 4KB. If there are no RMRR structures,
the system software concludes that the platform does not have any 
reserved memory ranges that are DMA targets.

Platform designers should avoid or limit use of reserved memory regions
since these require system software to create holes in the DMA virtual
address range available to system software and its drivers.
----

Below is one example from a BDW machine:
(XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
(XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ab80a000 end_address 
ab81dfff
(XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
(XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ad000000 end_address 
af7fffff

Here the 1st reserved region is for USB controller, with the 2nd one
belonging to IGD.



2. RMRR status in Xen
=====================================================================

There are two main design goals according to VT-d spec:

a) Setup identity mapping for reserved regions in IOMMU page table
b) Ensure reserved regions not conflicting with OS-visible memory
(OS-visible memory in a VM means guest physical memory, and more
strictly it also means no confliction with other types of allocations
in guest physical address space, such as PCI MMIO, ACPI, etc.)

However current RMRR implementation in Xen only partially achieves a)
and completely misses b), which cause some issues:

--
[Issue-1] Identity mapping is not setup in shared ept case, so a device
with RMRR may not function correctly if assigned to a VM. 

This was the original problem we found when assigning IGD on BDW 
platform, which triggered the whole long discussion in past months

--
[Issue-2] Being lacking of goal-b), existing device assignment with 
RMRR works only when reserved regions happen to not conflicting with
other valid allocations in the guest physical address space. This could
lead to unpredicted failures in various deployments, due to non-detected
conflictions caused by platform difference and VM configuration 
difference.

One example is about USB controller assignment. It's already identified
as a problem on some platforms, that USB reserved regions conflict with
guest BIOS region. However, being the fact that host BIOS only touches 
those reserved regions for legacy keyboard emulation at early Dom0 boot 
phase, a trick is added in Xen to bypass RMRR handling for usb 
controllers. 

--
[Issue-3] devices may share same reserved regions, however
there is no logic to handle this in Xen. Assigning such devices to 
different VMs could lead to secure concern



3. High Level Design
=====================================================================

To achieve aforementioned two goals, major enhancements are required 
cross Xen hypervisor, libxc, and hvmloader, to address the gap in
goal-b), i.e. handling possible conflictions in gfn space. Fixing
goal-a) is straightforward. 

>>>3.1 Guidelines
----
There are several guidelines considered in the design:

--
[Guideline-1] No regression in a VM w/o statically-assigned devices

  If a VM isn't configured with assigned devices at creation, new 
confliction detection logic shouldn't block the VM boot progress 
(either skipped, or just throw warning)

--
[Guideline-2] No regression on devices which do not have RMRR reported

  If a VM is assigned with a device which doesn't have RMRR reported,
either statically-assigned or dynamically-assigned, new confliction
detection logic shouldn't fail the assignment request for this device.

--
[Guideline-3] New interface should be kept as common as possible

  New interface will be introduced to expose reserved regions to the
user space. Though RMRR is a VT-d specific terminology, the interface
design should be generic enough, i.e. to support a function which 
allows hypervisor to force reserving one or more gfn ranges. 

--
[Guideline-4] Keep changes simple

  RMRR reserved regions should be avoided or limited by platform 
designers, per VT-d specification. Per our observations, there are
only a few reported examples (USB, IGD) on real platforms. So we need
to balance the code complexity and usage limitation. If one limitation
is only in niche scenarios, we'd like to vote no-support to simplify
changes for now.

>>>3.2 Confliction detection
----
Confliction must be detected in several places as far as gfn is 
concerned (how to handle confliction is discussed in 3.3)

1) libxc domain builder
  Here coarse-grained gfn layout is created, including two contiguous
guest RAM trunks (lowmem and/or highmem) and mmio holes (VGA, PCI),
which are passed to hvmloader for later fine-grained manipulation. Guest 
RAM trunks are populated with valid translation setup in underlying p2m 
layer. Device reserved regions must be detected in that layout.

2) Xen hypervisor device assignment
  Device assignment can happen either at VM creation time (after domain 
builder), or anytime thru hotplug after VM is booted. Regardless of 
how userspace handles confliction, Xen hypervisor will always do the 
last-conservative detection when setting up identity mapping:
        * gfn space unoccupied:
                -> insert identity mapping; no confliction
        * gfn space already occupied with identity mapping:
                -> do nothing; no confliction
        * gfn space already occupied with other mapping:
                -> confliction detected

3) hvmloader
  Hvmloader allocates other resources (ACPI, PCI MMIO, etc.) and 
internal data structures in gfn space, and it creates the final guest 
e820. So hvmloader also needs to detect conflictions when conducting 
those operations. If there's no confliction, hvmloader will reserve 
those regions in guest e820 to let guest OS aware.

>>>3.3 Policies
----
An intuitive thought is to fail immediately upon a confliction, however 
it is not flexible regarding to different requirments:

a) it's not appropriate to fail libxc domain builder just because such
confliction. We still want the guest to boot even w/o assigned device;

b) whether to fail in hvmloader has several dependencies. If it's
to check for hotplug preparation, warning is also an acceptable option
since assignment may not happen at all. Or if it's a USB controller 
but user doesn't care about legacy keyboard emulation, it's also OK to 
move forward upon a confliction;

c) in Xen hypervisor it is reasonable to fail upon confliction, where
device is actually assigned. But due to the same requirement on USB
controller, sometimes we might want it succeed just w/ warnings.

Regarding to the complexity of addressing all above flexibilities (user
preferences, per-device), which requires inventing quite some parameters
passed among different components, and regarding to the fact that 
failures would be rare (except some USB) with proactive avoidance  
in userspace, we'd like to propose below simplified policy following 
[Guideline-4]:

- 'warn' conflictions in user space (libxc and hvmloader)
- a boot option to specify 'fail' or 'warn' confliction in Xen device
assignment path, default to 'fail' (user can set to 'warn' for USB case)

Such policy provides a relaxed user space policy w/ hypervisor to do 
final judge. It has a unique merit to simplify later interface design 
and hotplug support, w/o breaking [Guideline-1/2] even when all possible 
reserved regions are exposed.

    ******agreement is first required on above policy******

>>>3.4 Xen: setup RMRR identity mapping
----
Regardless of whether userspace has detected confliction, Xen hypervisor
always needs to detect confliction itself when setting up identify 
mapping for reserved gfn regions, following above defined policy. 

Identity mapping should be really handled from the general p2m layer, 
so the same r/w permissions apply equally to CPU/DMA access paths,
regardless of the underlying fact whether EPT is shared with IOMMU.

This is to match the behavior on bare metal, where although reserved
regions are marked as E820_RESERVED, it's just a hint to the system 
software which can still read data back because physically those bits
do exist. So in the virtualization case we don't need to specially
treat CPU accesses to RMRR reserved regions (similar to other reserved
regions like ACPI NVS)

>>>3.5 New interface: expose reserved region information
----
As explained in [Guideline-3], we'd like to keep this interface general 
enough, as a common interface for hypervisor to force reserving gfn 
ranges, due to various reasons (RMRR is a client of this feature).

One design open was discussed back-and-forth accordingly, regarding to
whether the interface should return regions reported for all devices
in the platform (report-all), or selectively return regions only 
belonging to assigned devices (report-sel). report-sel can be built on
top of report-all, with extra work to help hypervisor generate filtered 
regions (e.g. introduce new interface or make device assignment happened 
before domain builder)

We propose report-all as the simple solution (different from last sent
version which used report-sel), regarding to the below facts:

  - 'warn' policy in user space makes report-all not harmful
  - 'report-all' still means only a few entries in reality:
    * RMRR reserved regions should be avoided or limited by platform
designers, per VT-d specification;
    * RMRR reserved regions are only a few on real platforms, per our
current observations;
  - anyway OS needs to handle all the reserved regions on bare metal;
  - hotplug friendly;
  - report-all can be extended to report-sel if really required

In this way, there are two situations libxc domain builder may request 
to query reserved region information w/ same interface:

a) if any statically-assigned devices, and/or
b) if a new parameter is specified, asking for hotplug preparation
        ('rdm_check' or 'prepare_hotplug'?)

the 1st invocation of this interface will save all reported reserved
regions under domain structure, and later invocation (e.g. from 
hvmloader) gets saved content.

If a VM is configured w/o assigned devices, this interface is not 
invoked so there's no impact and [Guideline-1] is enforced;

If a VM is configured w/ assigned devices which don't have reserved
regions, this interface is invoked. In some cases warning may be 
thrown out due to confliction caused by other non-assigned devices, 
but it's just informational and there is no impact on assigned devices
so [Guideline-2] is enforced;

>>>3.6 Libxc/hvmloader: detect and avoid conflictions
----
libxc needs to detect reserved region conflictions with:
        - guest RAM
        - monolithic PCI MMIO hole

hvmloader needs to detect reserved region confliction with:
        - guest RAM
        - PCI MMIO allocation
        - memory allocation
        - some e820 entries like ACPI Opregion, etc.

When there's a confliction detected, libxc/hvmloader first try to
relocate conflicting gfn resources to avoid confliction. warning
will be thrown out when such relocation fails. The relocation policy 
is straightforward for most resources, however there remains a major 
design tradeoff for guest RAM, regarding to handoff between libxc 
and hvmloader...

In current implementation, guest RAM is contiguous in gfn space, w/
at most two trunks: lowmem (<4G) and highmem(>4G), which are passed
to hvmloader through hvm_info. Now by relocating guest RAM to avoid
confliction with reserved regions, sparse memory trunks are created
and it's not thought as an extensible way to introduce such sparse
structure into hvm_info.

There are several other options discussed so far:

a) Duplicate same relocation algorithm within libxc domain builder 
(when populating physmap) and hvmloader (when creating e820)
  - Pros:
        * no interface/structure change
        * anyway hvmloader still needs to handle reserved regions
  - Cons:
        * duplication is not good

b) pass sparse information through Xenstore
  (no much idea. need input from toolstack maintainers)

c) utilize XENMEM_{set,}_memory_map pair of hypercalls, with libxc to
set and hvmloader to get. Extension required to allow hvm invoke.
  - Pros:
        * centralized ownership in libxc. flexible for extension
  - Cons:
        * limiting entry to E820MAX (should be fine)
        * hvmloader e820 construction may become more complex, given
two predefined tables (reserved_regions, memory_map)

********Inputs are required to find a good option here*********

>>>3.7 hvmloader: reserve 'reserved regions' in guest E820
----
If there is no confliction detected, hvmloader needs to mark those
reserved regions as E820_RESERVED in guest E820 table, so the guest OS
is aware of those reserved regions (thus not does problematic actions
e.g. when re-allocating PCI MMIO)

>>>3.8 Xen: Handle devices sharing reserved regions
----
Per VT-d spec, it's possible to have two devices sharing same reserved
region. Though we didn't see such example in reality, hypervisor needs
to detect and handle such scenario, otherwise vulnerability may exist
if two devices are assigned to different VMs (so a malicious VM may
program its assigned device to clobber the shared region to malform 
another VM's device)

Ideally all devices sharing reserved regions should be assigned to a
single VM. However achieving this goal can't be done sole in hypervisor
w/o reworking current device assignment interface. Assignment is managed
by toolstack, which requires exposing group sharing information to 
userspace and then extends toolstack to manage assignment in bundle.

Given the problem only in ideal space, we propose to not support such
scenario, i.e. having hypervisor to fail the assignment, if the target
device happens to share some reserved regions with another device,
following [Guideline-4] to keep things simple.



4. Plan
=====================================================================
We're seeking an incremental way to split above tasks into 2 stages, 
and in each stage we move forward a step w/o causing regression. Doing
so can benefit people who want to use device assignment early, and 
also benefit newbie developer to rampup, toward a final sane solution.

4.1 Stage-1: hypervisor hardening
----
  [Tasks]
        1) Setup RMRR identity mapping in p2m layer with confliction 
detection
        2) add a boot option for fail/warn policy
        3) remove USB hack
        4) Detect and fail device assignment w/ shared reserve regions 

  [Enhancements]
        * fix [Issue-1] and [Issue-3]
        * partially fix [Issue-2] with limitations:
                - w/o userspace relocation there's larger chance to 
see conflictions. 
                - w/o reserve in guest e820, guest OS may allocate 
reserved pfn when re-enumerating PCI resource

  [Regressions]
        * devices which can be assigned successfully before may be
failed now due to confliction detection. However it's not a regression
per se. and user can change policy to 'warn' if required.  

4.2 Stage-2: libxc/hvmloader hardening
----
  [Tasks]
        5) Introduce new interface to expose reserve region information
        6) Detect and avoid reserved region conflictions in libxc
        7) Pass libxc guest RAM layout to hvmloader
        8) Detect and avoid reserved region conflictions in hvmloader
        9) Reserve 'reserved regions' in guest E820 in hvmloader

  [Enhancements]
        * completely fix [Issue-2]

  [Regression]
        * n/a

Thanks,
Kevin


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.