[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] (v2) Design proposal for RMRR fix



On Tue, Jan 13, 2015 at 11:03:22AM +0000, Tian, Kevin wrote:
> > From: George Dunlap
> > Sent: Monday, January 12, 2015 10:20 PM
> > 
> > On Mon, Jan 12, 2015 at 12:28 PM, Tian, Kevin <kevin.tian@xxxxxxxxx> wrote:
> > >> From: George Dunlap
> > >> Sent: Monday, January 12, 2015 8:14 PM
> > >>
> > >> On Mon, Jan 12, 2015 at 11:22 AM, Tian, Kevin <kevin.tian@xxxxxxxxx>
> > wrote:
> > >> >> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
> > >> >> Sent: Monday, January 12, 2015 6:23 PM
> > >> >>
> > >> >> >>> On 12.01.15 at 11:12, <kevin.tian@xxxxxxxxx> wrote:
> > >> >> >>  From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
> > >> >> >> Sent: Monday, January 12, 2015 6:09 PM
> > >> >> >>
> > >> >> >> >>> On 12.01.15 at 10:56, <kevin.tian@xxxxxxxxx> wrote:
> > >> >> >> > the result is related to another open whether we want to block
> > guest
> > >> >> >> > boot for such problem. If 'warn' in domain builder is 
> > >> >> >> > acceptable, we
> > >> >> >> > don't need to change lowmem for such rare 1GB case, just throws
> > >> >> >> > a warning for unnecessary conflictions (doesn't hurt if user 
> > >> >> >> > doesn't
> > >> >> >> > assign it).
> > >> >> >>
> > >> >> >> And how would you then deal with the one guest needing that
> > >> >> >> range reserved?
> > >> >> >
> > >> >> > if guest needs the range, then report-all or report-sel doesn't 
> > >> >> > matter.
> > >> >> > domain builder throws the warning, and later device assignment will
> > >> >> > fail (or warn w/ override). In reality I think 1GB is rare. Making 
> > >> >> > such
> > >> >> > assumption to simplify implementation is reasonable.
> > >> >>
> > >> >> One of my main problems with all you recent argumentation here
> > >> >> is the arbitrary use of the 1Gb boundary - there's nothing special
> > >> >> in this discussion with where the boundary is. Everything revolves
> > >> >> around the (undue) effect of report-all on domains not needing all
> > >> >> of the ranges found on the host.
> > >> >>
> > >> >
> > >> > I'm not sure which part of my argument is not clear here. report-all
> > >> > would be a problem here only if we want to fix all the conflictions
> > >> > (then pulling unnecessary devices increases the confliction 
> > >> > possibility)
> > >> > in the domain builder. but if we only fix reasonable ones (e.g. >3GB)
> > >> > while warn other conflictions (e.g. <3G) in domain builder (let later
> > >> > assignment path to actually fail if confliction does matter), then we
> > >> > don't need to solve all conflictions in domain builder (if say 1G 
> > >> > example
> > >> > fixing it may instead reduce lowmem greatly) and then report-all
> > >> > may just add more warnings than report-sel for unused devices.
> > >>
> > >> You keep saying "report-all" or "report-sel", but I'm not 100% clear
> > >> what you mean by those.  In any case, the naming has got to be a bit
> > >> misleading: the important questions at the moment, AFAICT, are:
> > >
> > > I explained them in original proposal
> > 
> > Yes, I read it and didn't understand it there either. :-)
> 
> sorry for that.
> 
> > 
> > >> 1. Whether we make holes at boot time for all RMRRs on the system, or
> > >> whether only make RMRRs for some subset (or potentially some other
> > >> arbitrary range, which may include RMRRs on other hosts to which we
> > >> may want to migrate).
> > >
> > > I use 'report-all' to stand for making holes for all RMRRs on the system,
> > > while 'report-sel' for specified subset.
> > >
> > > including other RMRRs (from admin for migration) is orthogonal to
> > > above open.
> > 
> > Right; so the "report" in this case is "report to the guest".
> > 
> > As I said, I think that's confusing terminology; after all, we want to
> > report to the guest all holes that we make, and only the holes that we
> > make.  The question isn't then which ones we report, but which ones we
> > make holes for. :-)
> 
> originally I use 'report' to describe the hypercall which hypervisor composes
> the actual information about RMRR, so it can be 'report to libxl' or 'report
> to the guest' regarding to who invokes that hypercall.
> 
> but yes here we more care about what's reported to the guest.
> 
> > 
> > So for this discussion, maybe "rmrr-host" (meaning, copy RMRRs from
> > the host) or "rmrr-sel" (meaning, specify a selection of RMRRs, which
> > may be from this host, or even another host)?
> 
> the counterpart of 'rmrr-host' gives me feeling of 'rmrr-guest'. :-)
> 
> > 
> > Given that the ranges may be of arbitrary size, and that we may want
> > to specify additional ranges for migration to other hosts, then I
> > think we need at some level we need the machinery to be in place to
> > specify the RMRRs that will be reserved for a specific guest.
> > 
> > At the xl level, there should of course be a way to specify "use all
> > host RMRRs"; but what should happen then is that xl / libxl should
> > query Xen for the host RMRRs and then pass those down to the next
> > layer of the library.
> > 
> > >> 2. Whether those holes are made by the domain builder in libxc, or by
> > >> hvmloader
> > >
> > > based on current discussion, whether to make holes in hvmloader
> > > doesn't bring fundamental difference. as long as domain builder
> > > still need to populate memory (even minimal for hvmloader to boot),
> > > it needs to check conflict and may ideally make hole too (though we
> > > may make assumption not doing that)
> > 
> > Well it will have an impact on the overall design of the code; but
> > you're right, if RMRRs really can (and will) be anywhere in memory,
> > then the domain builder will need to know what RMRRs are going to be
> > reserved for this VM and avoid populating those.  If, on the other
> > hand, we can make some fairly small assumptions about where there will
> > not be any RMRRs, then we can get away with handling everything in
> > hvmloader.
> 
> I'm not sure such fairly small assumptions can be made. For example,
> host RMRR may include one or several regions in host PCI MMIO
> space (say >3G), then hvmloader has to understand such knowledge
> to avoid allocating them for guest PCI MMIO.
> 
> > 
> > >>
> > >> 3. What happens if Xen is asked to assign a device and it finds that
> > >> the required RMRR is not empty:
> > >>  a. during guest creation
> > >>  b. after the guest has booted
> > >
> > > for Xen we don't need differentiate a/b. by default it's clear failure
> > > should be returned as it implies a security/correctness issue if
> > > moving forward. but based on discussion an override to 'warn' only
> > > is preferred, so admin can make decision (remains an open on
> > > whether to do global override or per-device override)
> > 
> > Well I think part of our confusion here is what "fail" vs "warn" means.
> > 
> > Fail / warn might be "Do we refuse to assign the device, or do we go
> > ahead and assign the device, knowing that it may act buggy?"
> > 
> > Or it might be, "Do we fail domain creation if at some step we
> > discover an RMRR conflict?  Or do we let the domain create succeed but
> > warn that the device has not been attached?"
> > 
> > I think in any case, failing to *assign* the device is the right thing
> > to do (except perhaps with a per-device override option).
> 
> yes
> 
> > 
> > libxl already has a policy of what happens when pci assignment fails
> > during domain creation.  If I'm reading the code right, libxl will
> > destroy the domain if libxl__device_pci_add() fails during domain
> > creation; I think that's the right thing to do.  If you want to change
> > that policy, that's a different discussion.
> 
> not my intention. as I said, the policy for Xen is clear. Just fail the
> assignment hypercall (or warn w/ override). keep whatever policy
> defined by libxl.
> 
> > 
> > But if the device assignment fails due to an unspecified RMRR, that's
> > a bug in the toolstack -- it should have looked at the device list,
> > found out what RMRRs were necessary, and reserved those ranges before
> > we got to that point.
> > 
> > The only time I would expect device assignment might fail during
> > domain creation is if one of the devices had an RMRR shared with a
> > device already assigned to another VM.
> > 
> > >> Obviously at some point some part of the toolstack needs to identify
> > >> which RMRRs go with what device, so that either libxc or hvmloader can
> > >> make the appropriate holes in the address space; but at that point,
> > >> "report" is not so much the right word as "query".  (Obviously we want
> > >> to "report" in the e820 map all RMRRs that we've made holes for in the
> > >> guest.)
> > >
> > > yes, using 'report' doesn't catch all the changes we need to make. Just
> > > use them to simplify discussion in case all are on the same page. However
> > > clearly my original explanation didn't make it. :/
> > >
> > > and state my major intention again. I don't think the preparation (i.e.
> > > detect confliction and make holes) for device assignment should be a
> > > a blocking failure.  Throw warning should be enough (i.e. in libxc). We
> > > should let actual device assignment path to make final call based on
> > > admin's configuration (default 'fail' w/ 'warn' override). Based on that
> > > policy I think 'report-all' (making holes for all host RMRRs) is an
> > > acceptable approach, w/ small impact on possibly more warning
> > > messages (actually not bad to help admin understand the hotplug
> > > possibility on this platform) and show more reserved regions to the
> > > end user (but he shouldn't make any assumption here). :-)
> > 
> > I don't really understand what you're talking about here.
> > 
> > When the libxc domain builder runs, there is *no* guest memory mapped.
> > So if it has the RMRRs, then it can *avoid* conflict; and if it
> > doesn't have the RMRRs, it can't even *detect* conflict.  So there is
> > no reason for libxc to either give a warning, or cause a failure.
> 
> not all the conflicts can or will be avoided. e.g. USB may report a 
> region conflicting with guest BIOS which is a hard conflict. another 
> example (from one design option) is that we may want to keep 
> current cross-component structure (one lowmem + one highmem)
> so conflict in the middle (e.g. 2G) is a problem (avoiding it will break
> lowmem or make lowmem too small).
> 
> as long as we agree some conflicts may not be avoided, then it
> comes to the open about whether to give a warning, or cause a failure.
> I view making holes in domain builder as a preparation for later device
> assignment, so gives warning should be sufficient here since Xen will
> fail the assignment hypercall later when it actually happens and then
> libxl will react according to defined policy like you described above.
> 
> > 
> > So I'm not sure why you think making holes for all RMRRs would have
> > more warning messages.
> 
> based on the fact that not all RMRRs can or will be avoided, definitely
> making holes for all RMRRs on the host can potentially lead to more 
> conflicts than just making holes for RMRRs belonging to assigned
> devices. When we agree warning is OK in domain builder, it means
> more warning messages. 
> 
> > 
> > And when you say "show more reserved regions to the end user", I take
> > it you mean the guest kernel (via the e820 map)?
> 
> yes since all reserved regions have to be marked in the e820, so guest
> OS itself won't allocate the hole e.g. when doing PCI re-configuration.
> 
> > 
> > I'm also not clear what assumptions "he" may be making: you mean, the
> > existence of an RMRR in the e820 map shouldn't be taken to mean that
> > he has a specific device assigned?  No, indeed, he should not make
> > such an assumption. :-)
> 
> I meant 'he' shouldn't make assumption on how many reserved regions
> should exist in e820 based on exposed devices. Jan has a concern exposing
> more reserved regions in e820 than necessary is not a good thing. I'm
> trying to convince him it should be fine. :-)
> 
> > 
> > Again -- I think that the only place "rmrr-host" and "rmrr-sel" is
> > important is at the very top level -- in xl, and possibly at a high
> > level in libxl.  By the time things reach libxc and hvmloader, they
> > should simply be told, "These are the RMRRs for this domain", and they
> > should avoid conflicts and report those in the e820 map.
> > 
> 
> Having libxl to centrally manage RMRR at a high level is a good idea,
> which however need help from you and Ian on what're detail tasks
> in libxl to achieve such goal. We're not tool expert (especially to
> Tiejun) so definitely more suggestion in this area is welcomed. :-)
> 
> Then I hope you understand now about our discussion in libxl/xen/
> hvmloader, based on the fact that conflict may not be avoided. 
> That's the major open in original discussion with Jan. I'd like to 
> give an example of the flow here per Jan's suggestion, starting 
> from domain builder after reserved regions have been specified 
> by high level libxl.
> 
> Let's take an synthetic platform w/ two devices each reported 
> with one RMRR reserved region:
>       (D1): [0xe0000, 0xeffff] in <1MB area
>       (D2): [0xa0000000, 0xa37fffff] in ~2.75G area
> 
> The guest is configured with 4G memory, and assigned with D2.
> due to libxl policy (say for migration and hotplug) in total 3
> ranges are reported:
>       (hotplug): [0xe0000, 0xeffff] in <1MB area in this node
>       (migration): [0x40000000, 0x40003fff] in ~1G area in another node
>       (static-assign): [0xa0000000, 0xa37fffff] in ~2.75G area in this node
> 
> let's use xenstore to save such information (assume accessible to both
> domain builder and hvmloader?)
> 
> STEP-1. domain builder
> 
> say the default layout w/o reserved regions would be:
>       lowmem:         [0, 0xbfffffff]
>       mmio hole:      [0xc0000000, 0xffffffff]
>       highmem:        [0x100000000, 0x140000000]
> 
> domain builder then queries reserved regions from xenstore, 
> and tries to avoid conflicts.

Perhaps an easier way of this is to use the existing
mechanism we have - that is the XENMEM_memory_map (which
BTW e820_host uses). If all of this is done in the libxl (which
already does this based on the host E820, thought it can
be modified to query the hypervisor for other 'reserved
regions') and hvmloader is modified to use XENMEM_memory_map
and base its E820 on that (and also QEMU-xen), then we solve
this problem and also the http://bugs.xenproject.org/xen/bug/28?

(lots of handwaving).
> 
> For [0xad000000, 0xaf7fffff], it can be avoided by reducing
> lowmem to 0xad000000 and increase highmem:
>       lowmem:         [0, 0x9fffffff]
>       mmio hole:      [0xa0000000, 0xffffffff]
>       highmem:        [0x100000000, 0x160000000]
> 
> 
> For [0x40000000, 0x40003fff], leave it as a conflict since either
> reducing lowmem to 1G is not nice to guest which doesn't use
> highmem or we have to break lowmem into two trunks so more 
> structure changes are required.
> 
> For [0xe0000, 0xeffff], leave it as a conflict (w/ guest BIOS)
> 
> w/ libxl centrally managed mode, domain builder doesn't know
> whether a conflict will lead to an immediate error or not, so 
> the best policy here is to throw warning and then move forward.
> conflicts will be caught in later steps when a region is actually
> concerned.
> 
> STEP-2. static device assignment
> 
> after domain builder, libxl will request Xen hypervisor to complete
> actual device assignment. Because D2 is statically assigned to
> this guest, Xen will setup identity mapping for [0xa0000000, 
> 0xa37fffff] with conflict detection in gfn space. Since domain builder
> has making hole for this region, there'll no conflict and device will
> be assigned to the guest successfully.
> 
> STEP-3. hvmloader boot
> 
> hvmloader also needs to query reserved regions (still thru xenstore?)
> due to two reasons:
>       - mark all reported reserved regions in guest e820
>       - make holes to avoid conflict in dynamic allocation (e.g. PCI
> BAR, ACPI opregion, etc.)
> 
> hvmloader can avoid making holes for guest RAM again (even there
> are potential conflicts w/ guest RAM they would be acceptable otherwise
> libxl will fail the boot before reaching here). So hvmloader will just 
> add a new reserved e820 entry and make hole for [0xa0000000, 
> 0xa37fffff] in this example, which doesn't have a guest RAM confliction.
> 
> STEP-4. D2 hotplug
> 
> After guest has been booted, user decides to hotplug D1 so libxl
> will raise another device assignment request to Xen hypervisor. At
> this point, a conflict is detected on [0xe0000, 0xeffff] and then a 
> failure is returned by default. an override is provide in Xen to allow 
> warn and return success, and user understands doing so implies 
> an unsecure environment.
> 
> Out of this example hotplug will succeed if no conflict on RMRRs of 
> the hotplug device.
> 
> STEP-5. migration
> 
> after guest has been booted, user decides to migrate the guest
> to another node and the guest will be assigned with a new device
> w/ a [0x40000000, 0x40003fff] reserved region on the new node. 
> After the guest is migrated, Xen on the new node will detect conflict
> when new device is hotplugged and failure (or warn override so 
> success) will be returned accordingly.
> 
> Out of this example hotplug will succeed if no conflict on RMRRs of 
> the hotplug device.
> 
> ---
> 
> After completing this long thread, I actually like this high level libxl
> management idea, which avoids complexity in domain builder/
> hvmloader to understand device/RMRR association and then uses
> different policy according to whether it's statically assigned or 
> hotplug or for other purpose. :-)
> 
> Thanks
> Kevin
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.