[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] HVMlite ABI specification DRAFT B + implementation outline



On 08/02/16 19:03, Roger Pau Monnà wrote:
> The format of the boot start info structure is the following (pointed to
> be %ebx):
>
> NOTE: nothing will be loaded at physical address 0, so a 0 value in any of the
> address fields should be treated as not present.
>
>  0 +----------------+
>    | magic          | Contains the magic value 0x336ec578
>    |                | ("xEn3" with the 0x80 bit of the "E" set).
>  4 +----------------+
>    | flags          | SIF_xxx flags.
>  8 +----------------+
>    | cmdline_paddr  | Physical address of the command line,
>    |                | a zero-terminated ASCII string.
> 12 +----------------+
>    | nr_modules     | Number of modules passed to the kernel.
> 16 +----------------+
>    | modlist_paddr  | Physical address of an array of modules
>    |                | (layout of the structure below).
> 20 +----------------+
>
> The layout of each entry in the module structure is the following:
>
>  0 +----------------+
>    | paddr          | Physical address of the module.
>  4 +----------------+
>    | size           | Size of the module in bytes.
>  8 +----------------+
>    | cmdline_paddr  | Physical address of the command line,
>    |                | a zero-terminated ASCII string.
> 12 +----------------+
>    | reserved       |
> 16 +----------------+
>
> Other relevant information needed in order to boot a guest kernel
> (console page address, xenstore event channel...) can be obtained
> using HVMPARAMS, just like it's done on HVM guests.
>
> The setup of the hypercall page is also performed in the same way
> as HVM guests, using the hypervisor cpuid leaves and msr ranges.
>
> Hardware description
> --------------------
>
> Hardware description can come from two different sources, just like on (PV)HVM
> guests.
>
> Description of PV devices will always come from xenbus, and in fact
> xenbus is the only hardware description that is guaranteed to always be
> provided to HVMlite guests.
>
> Description of physical hardware devices will always come from ACPI, in the
> absence of any physical hardware device no ACPI tables will be provided. The
> presence of ACPI tables can be detected by finding the RSDP, just like on
> bare metal.

As we are extending the base structure, why not have an RSDP paddr in it
as well?  This avoids the need to scan RAM, and also serves as an
indication of "No ACPI".

>
> Non-PV devices exposed to the guest
> -----------------------------------
>
> The initial idea was to simply don't provide any emulated devices to a HVMlite
> guest as the default option. We have however identified certain situations
> where emulated devices could be interesting, both from a performance and
> ease of implementation point of view. The following list tries to encompass
> the different identified scenarios:
>
>  * 1. HVMlite with no emulated devices at all
>    ------------------------------------------
>    This is the current implementation inside of Xen, everything is disabled
>    by default and the guest has access to the PV devices only. This is of
>    course the most secure design because it has the smaller surface of attack.
>
>  * 2. HVMlite with (or capable to) PCI-passthrough
>    -----------------------------------------------
>    The current model of PCI-passthrought in PV guests is complex and requires
>    heavy modifications to the guest OS. Going forward we would like to remove
>    this limitation, by providing an interface that's the same as found on bare
>    metal. In order to do this, at least an emulated local APIC should be
>    provided to guests, together with the access to a PCI-Root complex.
>    As said in the 'Hardware description' section above, this will also require
>    ACPI. So this proposed scenario will require the following elements that 
> are
>    not present in the minimal (or default) HVMlite implementation: ACPI, local
>    APIC, IO APIC (optional) and PCI-Root complex.
>
>  * 3. HVMlite hardware domain
>    --------------------------
>    The aim is that a HVMlite hardware domain is going to work exactly like a
>    HVMlite domain with passed-through devices. This means that the domain will
>    need access to the same set of emulated devices, and that some ACPI tables
>    must be fixed in order to reflect the reality of the container the hardware
>    domain is running on. The ACPI section contains more detailed information
>    about which/how these tables are going to be fixed.
>
>    Note that in this scenario the hardware domain will *always* have a local
>    APIC and IO APIC, and that the usage of PHYSDEV operations and PIRQ event
>    channels is going to be removed in favour of the bare metal mechanisms.
>
> The default model for HVMlite guests is going to be to provide a local APIC
> together with a minimal set of ACPI tables that accurately match the reality 
> of
> the container is guest is running on.

This statement is contrary to option 1 above, which states that all
emulation is disabled.

FWIW, I think there needs to be a 4th option, inbetween current 1 and 2,
which is HVMLite + LAPIC.  This is then the default HVMLite ABI, and is
not passthrough-capable.

>  An administrator should be able to change
> the default setting using the following tunables that are part of the xl
> toolstack:
>
>  * lapic: default to true. Indicates whether a local APIC is provided.
>  * ioapic: default to false. Indicates whether an IO APIC is provided
>    (requires lapic set to true).
>  * acpi: default to true. Indicates whether ACPI tables are provided.
>
> <snip>
>
> MMIO mapping
> ------------
>
> For DomUs without any device passed-through no direct MMIO mappings will be
> present in the physical memory map presented to the guest. For DomUs with
> devices passed-though the toolstack will create direct MMIO mappings as
> part of the domain build process, and thus no action will be required
> from the DomU.
>
> For the hardware domain initial direct MMIO mappings will be set for the
> following regions:
>
> NOTE: ranges are defined using memory addresses, not pages.

I would preface this with "where applicable".  Non-legacy boots are
unlikely to have anything interesting in the first 1MB.

>
>  * [0x0, 0xFFFFF]: the low 1MiB will be mapped into the physical guest
>    memory map at the same position.
>
>  * [0xF00000, 0xFFFFFF]: the ISA memory hole will be mapped 1:1 into the
>    guest physical memory.
>
>  * ACPI memory areas: regions with type E820_ACPI or E820_NVS will be mapped
>    1:1 to the guest physical memory map. There are going to be exceptions if
>    Xen has to modify the tables before presenting them to the guest.
>
>  * PCI Express MMCFG: if Xen is able to identify any of these regions at boot
>    time they will also be made available to the guest at the same position
>    in it's physical memory map. It is possible that Xen will trap accesses to
>    those regions, but a guest should be able to use the native configuration
>    mechanism in order to interact with this configuration space. If the
>    hardware domain reports the presence of any of those regions using the
>    PHYSDEVOP_pci_mmcfg_reserved hypercall Xen will also all guest access to
>    them.
>
>  * PCI BARs: it's not possible for Xen to know the position of the BARs of
>    the PCI devices without hardware domain interaction.

Xen requires no dom0 interaction to find all information like this for
devices in segment 0 (i.e. all current hardware).  Segments other than 0
may have their MMCONF regions expressed in AML only.

The reason this is all awkward in Xen is that PCI devices were hacked in
as second-class citizens when IOMMU support was added.  This is a purely
a Xen software issue which needs undoing.

>  In order to have
>    the BARs of PCI devices properly mapped the hardware domain needs to
>    call the PHYSDEVOP_pci_device_add hypercall, that will take care of setting
>    up the BARs in the guest physical memory map using 1:1 MMIO mappings. This
>    procedure will be transparent from guest point of view, and upon returning
>    from the hypercall mappings must be already established.
>
>

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.