[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] (v2) Design proposal for RMRR fix



>>> On 09.01.15 at 03:27, <kevin.tian@xxxxxxxxx> wrote:
>>  From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
>> Sent: Thursday, January 08, 2015 9:55 PM
>> >>> On 26.12.14 at 12:23, <kevin.tian@xxxxxxxxx> wrote:
>> > 3) hvmloader
>> >   Hvmloader allocates other resources (ACPI, PCI MMIO, etc.) and
>> > internal data structures in gfn space, and it creates the final guest
>> > e820. So hvmloader also needs to detect conflictions when conducting
>> > those operations. If there's no confliction, hvmloader will reserve
>> > those regions in guest e820 to let guest OS aware.
>> 
>> Ideally, rather than detecting conflicts, hvmloader would just
>> consume what libxc set up. Obviously that would require awareness
>> in libxc of things it currently doesn't care about (like fitting PCI BARs
>> into the MMIO hole, enlarging it as necessary). I admit that this may
>> end up being difficult to implement. Another alternative would be to
>> have libxc only populate a limited part of RAM (for hvmloader to be
>> loadable), and have hvmloader do the bulk of the populating.
> 
> there are quite some allocations which are suitable in hvmloader, such
> as ACPI, PCI BARs, and other hole allocations. some of them are hvmloader's
> own usage, and others are related to guest bios. I don't think it's worthy 
> of
> the mass refactoring of moving those allocations to libxc, just for this 
> very
> specific task. As long as hvmloader still needs allocate gfns, it needs to
> keep confliction detection logic itself.

Allocations done by hvmloader doesn't need to look for conflicts.
All it needs to make sure is that what it allocates is actually RAM.
It not doing so today is already a (latent) bug. The thing is that
if libxc set up a proper, fully correct memory map for hvmloader
to consume, hvmloader doesn't need to do anything else than
play by this memory map.

>> >>>>3.3 Policies
>> > ----
>> > An intuitive thought is to fail immediately upon a confliction, however
>> > it is not flexible regarding to different requirments:
>> >
>> > a) it's not appropriate to fail libxc domain builder just because such
>> > confliction. We still want the guest to boot even w/o assigned device;
>> 
>> I don't think that's right (and I believe this was discussed before):
>> When device assignment fails, VM creation should fail too. It is the
>> responsibility of the host admin in that case to remove some or all
>> of the to be assigned devices from the guest config.
> 
> think about bare metal. If a device say NIC doesn't work, would the
> platform reject to work at all? there could be errors, but their scope
> are limited within specific function. user can still use a platform w/
> errors as long as related functions are not used.
> 
> Similarly we should allow domainbuilder to move forward upon a
> device assignment failure (something like circuit error when powering
> the device), and user will note this problem when using the device
> (either not present or not function correctly).
> 
> same thing for hotplug usage. all the detections for future hotplug
> usage are just preparation and not strict. you don't want to hang
> a platform just because it's not suitable to hotplug some device in
> the future.

Hotplug is something that can fail, and surely shouldn't lead to a
hung guest. Very similar to hotplug on bare metal indeed.

Boot time device assignment is different: The question isn't whether
an assigned device works, instead the proper analogy is whether a
device is _present_. If a device doesn't work on bare metal, it will
still be discoverable. Yet if device assignment fails, that's not going
to be the case - for security reasons, the guest would not see any
notion of the device.

The "device does not work" analogy to bare metal would only apply
if the device's presence would prevent the system from booting (in
which case you'd have to physically remove it from the system, just
like for the virtualized case you'd have to remove it from the guest
config).

>> > We propose report-all as the simple solution (different from last sent
>> > version which used report-sel), regarding to the below facts:
>> >
>> >   - 'warn' policy in user space makes report-all not harmful
>> >   - 'report-all' still means only a few entries in reality:
>> >     * RMRR reserved regions should be avoided or limited by platform
>> > designers, per VT-d specification;
>> >     * RMRR reserved regions are only a few on real platforms, per our
>> > current observations;
>> 
>> Few yes, but in the IGD example you gave the region is quite large,
>> and it would be fairly odd to have all guests have a strange, large
>> hole in their address spaces. Furthermore remember that these
>> holes vary from machine to machine, so a migrateable guest would
>> needlessly end up having a hole potentially not helping subsequent
>> hotplug at all.
> 
> it's not strange since it never exceeds the set on bare metal, but yes, 
> migration raises another interesting point. currently I don't think 
> migration w/ assigned devices is supported. but even considering
> future possibility, there's always limitation since whatever reserved
> regions created at boot time in e820 are static which can't adapt
> to dynamic device changes. for hotplug or migration, you always
> suffer from seeing some holes which might not be relevant at a
> moment.

The question isn't about migrating with devices assigned, but about
assigning devices after migration (consider a dual vif + SR-IOV NIC
guest setup where the SR-IOV NIC gets hot-removed before
migration and a new one hot-plugged afterwards).

Furthermore any tying of the guest memory layout to the host's
where the guest first boots is awkward, as post-migration there's
not going to be any reliable correlation between the guest layout
and the new host's.

>> > In this way, there are two situations libxc domain builder may request
>> > to query reserved region information w/ same interface:
>> >
>> > a) if any statically-assigned devices, and/or
>> > b) if a new parameter is specified, asking for hotplug preparation
>> >    ('rdm_check' or 'prepare_hotplug'?)
>> >
>> > the 1st invocation of this interface will save all reported reserved
>> > regions under domain structure, and later invocation (e.g. from
>> > hvmloader) gets saved content.
>> 
>> Why would the reserved regions need attaching to the domain
>> structure? The combination of (to be) assigned devices and
>> global RMRR list always allow reproducing the intended set of
>> regions without any extra storage.
> 
> it's possible a new device is plugged into host between two 
> adjacent invocations, and inconsistent information will be returned
> that way. 

Can hot-plugged devices indeed be associated with RMRRs? This
would seem like a contradiction to me, since RMRRs are specifically
there to aid boot time compatibility issues.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.