[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] (v2) Design proposal for RMRR fix

> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
> Sent: Thursday, January 08, 2015 9:55 PM
> >>> On 26.12.14 at 12:23, <kevin.tian@xxxxxxxxx> wrote:
> > [Issue-2] Being lacking of goal-b), existing device assignment with
> > RMRR works only when reserved regions happen to not conflicting with
> > other valid allocations in the guest physical address space. This could
> > lead to unpredicted failures in various deployments, due to non-detected
> > conflictions caused by platform difference and VM configuration
> > difference.
> >
> > One example is about USB controller assignment. It's already identified
> > as a problem on some platforms, that USB reserved regions conflict with
> > guest BIOS region. However, being the fact that host BIOS only touches
> > those reserved regions for legacy keyboard emulation at early Dom0 boot
> > phase, a trick is added in Xen to bypass RMRR handling for usb
> > controllers.
> s/trick/hack/ - after all, doing this is not safe. Plus if these regions
> really were needed only for early boot legacy keyboard emulation,
> they wouldn't need expressing as RMRR afaict, or if that really was
> a requirement a suitable flag should be added to tell the OS that
> once a proper driver is in place for the device, the RMRR won't be
> needed anymore. In any event - the hack needs to go away.

early boot doesn't mean pre-OS boot. It's still used in early OS boot before
ACPI switches mode/ or a USB keyboard driver writes a register (will check
detail later). and VT-d can be used on bare metal, that's why RMRR is
required from specification p.o.v (OS can setup IOMMU for USB devices
very early)

so it's reported, and if adding strict confliction detection with immediate
failure policy (if RMRR is <1M), USB devices which previously were 
assigned would fail now after removing the hack. That's later open 
whether we still want to keep an warning option to support it. 

> > [Issue-3] devices may share same reserved regions, however
> > there is no logic to handle this in Xen. Assigning such devices to
> > different VMs could lead to secure concern
> s/could lead to/is a/
> > [Guideline-3] New interface should be kept as common as possible
> >
> >   New interface will be introduced to expose reserved regions to the
> > user space. Though RMRR is a VT-d specific terminology, the interface
> > design should be generic enough, i.e. to support a function which
> > allows hypervisor to force reserving one or more gfn ranges.
> s/hypervisor/user space/ ? Or else I don't see the connection between
> the new interface and the enforcement of the reserved ranges.

the reserved regions are specified by hypervisor due to some reason, e.g.
RMRR, and are finally reserved by user space. Here I want to describe
the intention comes from hypervisor.

> > 3) hvmloader
> >   Hvmloader allocates other resources (ACPI, PCI MMIO, etc.) and
> > internal data structures in gfn space, and it creates the final guest
> > e820. So hvmloader also needs to detect conflictions when conducting
> > those operations. If there's no confliction, hvmloader will reserve
> > those regions in guest e820 to let guest OS aware.
> Ideally, rather than detecting conflicts, hvmloader would just
> consume what libxc set up. Obviously that would require awareness
> in libxc of things it currently doesn't care about (like fitting PCI BARs
> into the MMIO hole, enlarging it as necessary). I admit that this may
> end up being difficult to implement. Another alternative would be to
> have libxc only populate a limited part of RAM (for hvmloader to be
> loadable), and have hvmloader do the bulk of the populating.

there are quite some allocations which are suitable in hvmloader, such
as ACPI, PCI BARs, and other hole allocations. some of them are hvmloader's
own usage, and others are related to guest bios. I don't think it's worthy of
the mass refactoring of moving those allocations to libxc, just for this very
specific task. As long as hvmloader still needs allocate gfns, it needs to
keep confliction detection logic itself.

so I want to avoid big changes if possible (which could proliferate to more
tasks making this specific RMRR task never ends), and only target the 
minimal necessary changes now.

> >>>>3.3 Policies
> > ----
> > An intuitive thought is to fail immediately upon a confliction, however
> > it is not flexible regarding to different requirments:
> >
> > a) it's not appropriate to fail libxc domain builder just because such
> > confliction. We still want the guest to boot even w/o assigned device;
> I don't think that's right (and I believe this was discussed before):
> When device assignment fails, VM creation should fail too. It is the
> responsibility of the host admin in that case to remove some or all
> of the to be assigned devices from the guest config.

think about bare metal. If a device say NIC doesn't work, would the
platform reject to work at all? there could be errors, but their scope
are limited within specific function. user can still use a platform w/
errors as long as related functions are not used.

Similarly we should allow domainbuilder to move forward upon a
device assignment failure (something like circuit error when powering
the device), and user will note this problem when using the device
(either not present or not function correctly).

same thing for hotplug usage. all the detections for future hotplug
usage are just preparation and not strict. you don't want to hang
a platform just because it's not suitable to hotplug some device in
the future.

> > b) whether to fail in hvmloader has several dependencies. If it's
> > to check for hotplug preparation, warning is also an acceptable option
> > since assignment may not happen at all. Or if it's a USB controller
> > but user doesn't care about legacy keyboard emulation, it's also OK to
> > move forward upon a confliction;
> Again assuming that RMRRs for USB devices are _only_ used for
> legacy keyboard emulation, which may or may not be true.
> > c) in Xen hypervisor it is reasonable to fail upon confliction, where
> > device is actually assigned. But due to the same requirement on USB
> > controller, sometimes we might want it succeed just w/ warnings.
> But only when asked to do so by the host admin.
> > Regarding to the complexity of addressing all above flexibilities (user
> > preferences, per-device), which requires inventing quite some parameters
> > passed among different components, and regarding to the fact that
> > failures would be rare (except some USB) with proactive avoidance
> > in userspace, we'd like to propose below simplified policy following
> > [Guideline-4]:
> >
> > - 'warn' conflictions in user space (libxc and hvmloader)
> > - a boot option to specify 'fail' or 'warn' confliction in Xen device
> > assignment path, default to 'fail' (user can set to 'warn' for USB case)
> I think someone else (Tim?) already said this: Such a "warn" option
> would unlikely to be desirable as a global one, affecting all devices,
> but should rather be a flag settable on particular devices.
> >>>>3.5 New interface: expose reserved region information
> > ----
> > As explained in [Guideline-3], we'd like to keep this interface general
> > enough, as a common interface for hypervisor to force reserving gfn
> > ranges, due to various reasons (RMRR is a client of this feature).
> >
> > One design open was discussed back-and-forth accordingly, regarding to
> > whether the interface should return regions reported for all devices
> > in the platform (report-all), or selectively return regions only
> > belonging to assigned devices (report-sel). report-sel can be built on
> > top of report-all, with extra work to help hypervisor generate filtered
> > regions (e.g. introduce new interface or make device assignment happened
> > before domain builder)
> >
> > We propose report-all as the simple solution (different from last sent
> > version which used report-sel), regarding to the below facts:
> >
> >   - 'warn' policy in user space makes report-all not harmful
> >   - 'report-all' still means only a few entries in reality:
> >     * RMRR reserved regions should be avoided or limited by platform
> > designers, per VT-d specification;
> >     * RMRR reserved regions are only a few on real platforms, per our
> > current observations;
> Few yes, but in the IGD example you gave the region is quite large,
> and it would be fairly odd to have all guests have a strange, large
> hole in their address spaces. Furthermore remember that these
> holes vary from machine to machine, so a migrateable guest would
> needlessly end up having a hole potentially not helping subsequent
> hotplug at all.

it's not strange since it never exceeds the set on bare metal, but yes, 
migration raises another interesting point. currently I don't think 
migration w/ assigned devices is supported. but even considering
future possibility, there's always limitation since whatever reserved
regions created at boot time in e820 are static which can't adapt
to dynamic device changes. for hotplug or migration, you always
suffer from seeing some holes which might not be relevant at a

> > In this way, there are two situations libxc domain builder may request
> > to query reserved region information w/ same interface:
> >
> > a) if any statically-assigned devices, and/or
> > b) if a new parameter is specified, asking for hotplug preparation
> >     ('rdm_check' or 'prepare_hotplug'?)
> >
> > the 1st invocation of this interface will save all reported reserved
> > regions under domain structure, and later invocation (e.g. from
> > hvmloader) gets saved content.
> Why would the reserved regions need attaching to the domain
> structure? The combination of (to be) assigned devices and
> global RMRR list always allow reproducing the intended set of
> regions without any extra storage.

it's possible a new device is plugged into host between two 
adjacent invocations, and inconsistent information will be returned
that way. 

> >>>>3.6 Libxc/hvmloader: detect and avoid conflictions
> > ----
> > libxc needs to detect reserved region conflictions with:
> >     - guest RAM
> >     - monolithic PCI MMIO hole
> >
> > hvmloader needs to detect reserved region confliction with:
> >     - guest RAM
> >     - PCI MMIO allocation
> >     - memory allocation
> >     - some e820 entries like ACPI Opregion, etc.
> - BIOS and alike


> > There are several other options discussed so far:
> >
> > a) Duplicate same relocation algorithm within libxc domain builder
> > (when populating physmap) and hvmloader (when creating e820)
> >   - Pros:
> >     * no interface/structure change
> >     * anyway hvmloader still needs to handle reserved regions
> >   - Cons:
> >     * duplication is not good
> >
> > b) pass sparse information through Xenstore
> >   (no much idea. need input from toolstack maintainers)
> >
> > c) utilize XENMEM_{set,}_memory_map pair of hypercalls, with libxc to
> > set and hvmloader to get. Extension required to allow hvm invoke.
> >   - Pros:
> >     * centralized ownership in libxc. flexible for extension
> >   - Cons:
> >     * limiting entry to E820MAX (should be fine)
> >     * hvmloader e820 construction may become more complex, given
> > two predefined tables (reserved_regions, memory_map)
> d) Move down the lowmem RAM/MMIO boundary so that a single,
> contiguous chunk of lowmem results, with all other RAM moving up
> beyond 4Gb. Of course RMRRs below the 1Mb boundary must not be
> considered here, and I think we can reasonably safely assume that
> no RMRRs will ever report ranges above 1Mb but below the host
> lowmem RAM/MMIO boundary (i.e. we can presumably rest assured
> that the lowmem chunk will always be reasonably big).

I don't see how above assumption is validated, but it's a good simplification
since how much we want to avoid confliction is implementation tradeoff. :-)

> > 4. Plan
> >
> ================================================================
> =====
> > We're seeking an incremental way to split above tasks into 2 stages,
> > and in each stage we move forward a step w/o causing regression. Doing
> > so can benefit people who want to use device assignment early, and
> > also benefit newbie developer to rampup, toward a final sane solution.
> >
> > 4.1 Stage-1: hypervisor hardening
> > ----
> >   [Tasks]
> >     1) Setup RMRR identity mapping in p2m layer with confliction
> > detection
> >     2) add a boot option for fail/warn policy
> >     3) remove USB hack
> >     4) Detect and fail device assignment w/ shared reserve regions
> >
> >   [Enhancements]
> >     * fix [Issue-1] and [Issue-3]
> According to what you wrote earlier, [Issue-3] is not intended to be
> fixed, but instead devices sharing the same RMRR(s) are to be
> declared unassignable.

yes, that's clearer.

> >     * partially fix [Issue-2] with limitations:
> >             - w/o userspace relocation there's larger chance to
> > see conflictions.
> >             - w/o reserve in guest e820, guest OS may allocate
> > reserved pfn when re-enumerating PCI resource
> >
> >   [Regressions]
> >     * devices which can be assigned successfully before may be
> > failed now due to confliction detection. However it's not a regression
> > per se. and user can change policy to 'warn' if required.
> Avoiding such a (perceived) regression would seem to be possible by
> intermixing hypervisor and libxc/hvmloader adjustments.
> Jan

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.