|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] HVMlite ABI specification DRAFT B + implementation outline
>>> On 08.02.16 at 20:03, <roger.pau@xxxxxxxxxx> wrote:
> Boot ABI
> --------
>
> Since the Xen entry point into the kernel can be different from the
> native entry point, a `ELFNOTE` is used in order to tell the domain
> builder how to load and jump into the kernel entry point:
>
> ELFNOTE(Xen, XEN_ELFNOTE_PHYS32_ENTRY, .long, xen_start32)
>
> The presence of the `XEN_ELFNOTE_PHYS32_ENTRY` note indicates that the
> kernel supports the boot ABI described in this document.
>
> The domain builder shall load the kernel into the guest memory space and
> jump into the entry point defined at `XEN_ELFNOTE_PHYS32_ENTRY` with the
> following machine state:
>
> * `ebx`: contains the physical memory address where the loader has placed
> the boot start info structure.
>
> * `cr0`: bit 0 (PE) must be set. All the other writeable bits are cleared.
>
> * `cr4`: all bits are cleared.
>
> * `cs`: must be a 32-bit read/execute code segment with a base of â0â
> and a limit of â0xFFFFFFFFâ. The selector value is unspecified.
>
> * `ds`, `es`: must be a 32-bit read/write data segment with a base of
> â0â and a limit of â0xFFFFFFFFâ. The selector values are all unspecified.
>
> * `tr`: must be a 32-bit TSS (active) with a base of '0' and a limit of
> '0x67'.
>
> * `eflags`: all user settable bits are clear.
The word "user" here can be mistaken. Perhaps better "all modifiable
bits"?
> All other processor registers and flag bits are unspecified. The OS is in
> charge of setting up it's own stack, GDT and IDT.
The "flag bits" part should now probably be dropped?
> The format of the boot start info structure is the following (pointed to
> be %ebx):
"... by %ebx"
> NOTE: nothing will be loaded at physical address 0, so a 0 value in any of
> the address fields should be treated as not present.
>
> 0 +----------------+
> | magic | Contains the magic value 0x336ec578
> | | ("xEn3" with the 0x80 bit of the "E" set).
> 4 +----------------+
> | flags | SIF_xxx flags.
> 8 +----------------+
> | cmdline_paddr | Physical address of the command line,
> | | a zero-terminated ASCII string.
> 12 +----------------+
> | nr_modules | Number of modules passed to the kernel.
> 16 +----------------+
> | modlist_paddr | Physical address of an array of modules
> | | (layout of the structure below).
> 20 +----------------+
There having been talk about extending the structure, I think we
need some indicator that the consumer can use to know which
fields are present. I.e. either a version field, another flags one,
or a size one.
> The layout of each entry in the module structure is the following:
>
> 0 +----------------+
> | paddr | Physical address of the module.
> 4 +----------------+
> | size | Size of the module in bytes.
> 8 +----------------+
> | cmdline_paddr | Physical address of the command line,
> | | a zero-terminated ASCII string.
> 12 +----------------+
> | reserved |
> 16 +----------------+
I've been thinking about this on draft A already: Do we really want
to paint ourselves into the corner of not supporting >4Gb modules,
by limiting their addresses and sizes to 32 bits?
> Hardware description
> --------------------
>
> Hardware description can come from two different sources, just like on
> (PV)HVM
> guests.
>
> Description of PV devices will always come from xenbus, and in fact
> xenbus is the only hardware description that is guaranteed to always be
> provided to HVMlite guests.
>
> Description of physical hardware devices will always come from ACPI, in the
> absence of any physical hardware device no ACPI tables will be provided.
This seems too strict: How about "in the absence of any physical
hardware device ACPI tables may not be provided"?
> Non-PV devices exposed to the guest
> -----------------------------------
>
> The initial idea was to simply don't provide any emulated devices to a
> HVMlite
> guest as the default option. We have however identified certain situations
> where emulated devices could be interesting, both from a performance and
> ease of implementation point of view. The following list tries to encompass
> the different identified scenarios:
>
> * 1. HVMlite with no emulated devices at all
> ------------------------------------------
> This is the current implementation inside of Xen, everything is disabled
> by default and the guest has access to the PV devices only. This is of
> course the most secure design because it has the smaller surface of attack.
smallest?
> * 2. HVMlite with (or capable to) PCI-passthrough
> -----------------------------------------------
> The current model of PCI-passthrought in PV guests is complex and requires
> heavy modifications to the guest OS. Going forward we would like to remove
> this limitation, by providing an interface that's the same as found on bare
> metal. In order to do this, at least an emulated local APIC should be
> provided to guests, together with the access to a PCI-Root complex.
> As said in the 'Hardware description' section above, this will also require
> ACPI. So this proposed scenario will require the following elements that
> are
> not present in the minimal (or default) HVMlite implementation: ACPI, local
> APIC, IO APIC (optional) and PCI-Root complex.
Are you reasonably convinced that the absence of an IO-APIC
won't, with LAPICs present, cause more confusion than aid to the
OSes wanting to adopt PVHv2?
> * 3. HVMlite hardware domain
> --------------------------
> The aim is that a HVMlite hardware domain is going to work exactly like a
> HVMlite domain with passed-through devices. This means that the domain will
> need access to the same set of emulated devices, and that some ACPI tables
> must be fixed in order to reflect the reality of the container the hardware
> domain is running on. The ACPI section contains more detailed information
> about which/how these tables are going to be fixed.
>
> Note that in this scenario the hardware domain will *always* have a local
> APIC and IO APIC, and that the usage of PHYSDEV operations and PIRQ event
> channels is going to be removed in favour of the bare metal mechanisms.
Do you really mean "*always*"? What about a system without IO-APIC?
Would you mean to emulate one there for no reason?
Also I think you should say "the usage of many PHYSDEV operations",
because - as we've already pointed out - some are unavoidable.
> ACPI
> ----
>
> ACPI tables will be provided to the hardware domain or to unprivileged
> domains. In the case of unprivileged guests ACPI tables are going to be
> created by the toolstack and will only contain the set of devices available
> to the guest, which will at least be the following: local APIC and
> optionally an IO APIC and passed-through device(s). In order to provide this
> information from ACPI the following tables are needed as a minimum: RSDT,
> FADT, MADT and DSDT. If an administrator decides to not provide a local APIC,
> the MADT table is not going to be provided to the guest OS.
>
> The ACPI_FADT_NO_CMOS_RTC flag in the FADT boot_flags field is going to be
> used
> to signal guests that there's no RTC device (the Xen PV wall clock should be
> used instead). It is likely that this flag is not going to be set for the
> hardware domain, since it should have access to the RTC present in the host
> (if there's one). The ACPI_FADT_NO_VGA is also very likely to be set in the
> same boot_flags FADT field for DomUs in order to signal that there's no VGA
> adapter present.
>
> Finally the ACPI_FADT_HW_REDUCED is going to be set in the FADT flags field
> in order to signal that there are no legacy devices: i8259 PIC or i8254 PIT.
> There's no intention to enable these devices, so it is expected that the
> hardware-reduced FADT flag is always going to be set.
We'll need to be absolutely certain that use of this flag doesn't carry
any further implications.
> In the case of the hardware domain, Xen has traditionally passed-through the
> native ACPI tables to the guest. This is something that of course we still
> want to do, but in the case of HVMlite Xen will have to make sure that
> the data passed in the ACPI tables to the hardware domain contain the
> accurate
> hardware description. This means that at least certain tables will have to
> be modified/mangled before being presented to the guest:
>
> * MADT: the number of local APIC entries need to be fixed to match the number
> of vCPUs available to the guest. The address of the IO APIC(s) also
> need to be fixed in order to match the emulated ones that we are
> going
> to provide.
>
> * DSDT: certain devices reported in the DSDT may not be available to the
> guest,
> but since the DSDT is a run-time generated table we cannot fix it. In
> order to cope with this, a STAO table will be provided that should
> be able to signal which devices are not available to the hardware
> domain. This is in line with the Xen/ACPI implementation for ARM.
Will STAO be sufficient for everything that may need customization?
I'm particularly worried about processor related methods in DSDT or
SSDT, which - if we're really meaning to do as you say - would need
to be limited (or extended) to the number of vCPU-s Dom0 gets.
What's even less clear to me is how you mean to deal with P-, C-,
and (once supported) T-state management for CPUs which don't
have a vCPU equivalent in Dom0.
> NB: there are corner cases that I'm not sure how to solve properly. Currently
> the hardware domain has some 'hacks' regarding ACPI and Xen. At least I'm
> aware
> of the following:
>
> * 1. Reporting CPU PM info back to Xen: this comes from the DSDT table, and
> since this table is only available to the hardware domain it has to report
> the PM info back to Xen so that Xen can perform proper PM.
> * 2. Doing proper shutdown (S5) requires the usage of a hypercall, which is
> mixed with native ACPICA code in most OSes. This is awkward and requires
> the usage of hooks into ACPICA which we have not yet managed to upstream.
Iirc shutdown doesn't require any custom patches anymore in Linux.
> * 3. Reporting the PCI devices it finds to the hypervisor: this is not very
> intrusive in general, so I'm not that pushed to remove it. It's generally
> easy in any OS to add some kind of hook that's executed every time a PCI
> device is discovered.
> * 4. Report PCI memory-mapped configuration areas to Xen: my opinion
> regarding
> this one is the same as (3), it's not really intrusive so I'm not very
> pushed to remove it.
As said in another reply - for both of these, we just can't remove the
reporting to Xen.
> MMIO mapping
> ------------
>
> For DomUs without any device passed-through no direct MMIO mappings will be
> present in the physical memory map presented to the guest. For DomUs with
> devices passed-though the toolstack will create direct MMIO mappings as
> part of the domain build process, and thus no action will be required
> from the DomU.
>
> For the hardware domain initial direct MMIO mappings will be set for the
> following regions:
>
> NOTE: ranges are defined using memory addresses, not pages.
>
> * [0x0, 0xFFFFF]: the low 1MiB will be mapped into the physical guest
> memory map at the same position.
>
> * [0xF00000, 0xFFFFFF]: the ISA memory hole will be mapped 1:1 into the
> guest physical memory.
When have you last seen a machine with a hole right below the
16Mb boundary?
> * ACPI memory areas: regions with type E820_ACPI or E820_NVS will be mapped
> 1:1 to the guest physical memory map. There are going to be exceptions if
> Xen has to modify the tables before presenting them to the guest.
>
> * PCI Express MMCFG: if Xen is able to identify any of these regions at boot
> time they will also be made available to the guest at the same position
> in it's physical memory map. It is possible that Xen will trap accesses to
> those regions, but a guest should be able to use the native configuration
> mechanism in order to interact with this configuration space. If the
> hardware domain reports the presence of any of those regions using the
> PHYSDEVOP_pci_mmcfg_reserved hypercall Xen will also all guest access to
> them.
s/all guest/allow Dom0/ in this last sentence?
> * PCI BARs: it's not possible for Xen to know the position of the BARs of
> the PCI devices without hardware domain interaction. In order to have
> the BARs of PCI devices properly mapped the hardware domain needs to
> call the PHYSDEVOP_pci_device_add hypercall, that will take care of setting
> up the BARs in the guest physical memory map using 1:1 MMIO mappings. This
> procedure will be transparent from guest point of view, and upon returning
> from the hypercall mappings must be already established.
I'm not sure this can work, as it imposes restrictions on the ordering
of operations internal of the Dom0 OS: Successfully having probed
for a PCI device (and hence reporting its presence to Xen) doesn't
imply its BARs have already got set up. Together with the possibility
of the OS re-assigning BARs I think we will actually need another
hypercall, or the same device-add hypercall may need to be issued
more than once per device (i.e. also every time any BAR assignment
got changed).
Jan
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |