[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] HVMlite ABI specification DRAFT B + implementation outline



El 9/2/16 a les 14:24, Jan Beulich ha escrit:
>>>> On 08.02.16 at 20:03, <roger.pau@xxxxxxxxxx> wrote:
>>  * `eflags`: all user settable bits are clear.
> 
> The word "user" here can be mistaken. Perhaps better "all modifiable
> bits"?
>
>> All other processor registers and flag bits are unspecified. The OS is in
>> charge of setting up it's own stack, GDT and IDT.
> 
> The "flag bits" part should now probably be dropped?
> 
>> The format of the boot start info structure is the following (pointed to
>> be %ebx):
> 
> "... by %ebx"

Done to both of the above comments.

>> NOTE: nothing will be loaded at physical address 0, so a 0 value in any of 
>> the address fields should be treated as not present.
>>
>>  0 +----------------+
>>    | magic          | Contains the magic value 0x336ec578
>>    |                | ("xEn3" with the 0x80 bit of the "E" set).
>>  4 +----------------+
>>    | flags          | SIF_xxx flags.
>>  8 +----------------+
>>    | cmdline_paddr  | Physical address of the command line,
>>    |                | a zero-terminated ASCII string.
>> 12 +----------------+
>>    | nr_modules     | Number of modules passed to the kernel.
>> 16 +----------------+
>>    | modlist_paddr  | Physical address of an array of modules
>>    |                | (layout of the structure below).
>> 20 +----------------+
> 
> There having been talk about extending the structure, I think we
> need some indicator that the consumer can use to know which
> fields are present. I.e. either a version field, another flags one,
> or a size one.

Either a version or flags field sounds good to me. A version is
probably more desirable in order to prevent confusion with the already
present flags field:

 0 +----------------+
   | magic          | Contains the magic value 0x336ec578
   |                | ("xEn3" with the 0x80 bit of the "E" set).
 4 +----------------+
   | version        | Version of this structure. Current version is 0.
   |                | New versions are guaranteed to be
backwards-compatible.
 8 +----------------+
   | flags          | SIF_xxx flags.
12 +----------------+
   | cmdline_paddr  | Physical address of the command line,
   |                | a zero-terminated ASCII string.
16 +----------------+
   | nr_modules     | Number of modules passed to the kernel.
20 +----------------+
   | modlist_paddr  | Physical address of an array of modules
   |                | (layout of the structure below).
24 +----------------+

> 
>> The layout of each entry in the module structure is the following:
>>
>>  0 +----------------+
>>    | paddr          | Physical address of the module.
>>  4 +----------------+
>>    | size           | Size of the module in bytes.
>>  8 +----------------+
>>    | cmdline_paddr  | Physical address of the command line,
>>    |                | a zero-terminated ASCII string.
>> 12 +----------------+
>>    | reserved       |
>> 16 +----------------+
> 
> I've been thinking about this on draft A already: Do we really want
> to paint ourselves into the corner of not supporting >4Gb modules,
> by limiting their addresses and sizes to 32 bits?

Hm, that's an itchy question. TBH I doubt we are going to see modules
>4GB ATM, but maybe in the future this no longer holds.

I wouldn't mind making all the fields in the module structure 64bits,
but I think we should then spell out that Xen will always try to place
the modules below the 4GiB boundary when possible.

>> Hardware description
>> --------------------
>>
>> Hardware description can come from two different sources, just like on 
>> (PV)HVM
>> guests.
>>
>> Description of PV devices will always come from xenbus, and in fact
>> xenbus is the only hardware description that is guaranteed to always be
>> provided to HVMlite guests.
>>
>> Description of physical hardware devices will always come from ACPI, in the
>> absence of any physical hardware device no ACPI tables will be provided.
> 
> This seems too strict: How about "in the absence of any physical
> hardware device ACPI tables may not be provided"?

Right, this should allow us for more freedom when deciding whether to
provide ACPI tables or not.

The only case were we might avoid ACPI tables is when no local APIC or
IO APIC is provided, and even in this scenario I would be tempted to
provide at least a FADT in order to announce that no CMOS RTC is
available (and possibly also signal reduced HW).

>> Non-PV devices exposed to the guest
>> -----------------------------------
>>
>> The initial idea was to simply don't provide any emulated devices to a 
>> HVMlite
>> guest as the default option. We have however identified certain situations
>> where emulated devices could be interesting, both from a performance and
>> ease of implementation point of view. The following list tries to encompass
>> the different identified scenarios:
>>
>>  * 1. HVMlite with no emulated devices at all
>>    ------------------------------------------
>>    This is the current implementation inside of Xen, everything is disabled
>>    by default and the guest has access to the PV devices only. This is of
>>    course the most secure design because it has the smaller surface of 
>> attack.
> 
> smallest?

Right, fixed.

>>  * 2. HVMlite with (or capable to) PCI-passthrough
>>    -----------------------------------------------
>>    The current model of PCI-passthrought in PV guests is complex and requires
>>    heavy modifications to the guest OS. Going forward we would like to remove
>>    this limitation, by providing an interface that's the same as found on 
>> bare
>>    metal. In order to do this, at least an emulated local APIC should be
>>    provided to guests, together with the access to a PCI-Root complex.
>>    As said in the 'Hardware description' section above, this will also 
>> require
>>    ACPI. So this proposed scenario will require the following elements that 
>> are
>>    not present in the minimal (or default) HVMlite implementation: ACPI, 
>> local
>>    APIC, IO APIC (optional) and PCI-Root complex.
> 
> Are you reasonably convinced that the absence of an IO-APIC
> won't, with LAPICs present, cause more confusion than aid to the
> OSes wanting to adopt PVHv2?

As long as the data provided in the MADT represent the container
provided I think we should be fine. In the case of no IO APICs no
entries of type 1 (IO APIC) will be provided in the MADT.

>>  * 3. HVMlite hardware domain
>>    --------------------------
>>    The aim is that a HVMlite hardware domain is going to work exactly like a
>>    HVMlite domain with passed-through devices. This means that the domain 
>> will
>>    need access to the same set of emulated devices, and that some ACPI tables
>>    must be fixed in order to reflect the reality of the container the 
>> hardware
>>    domain is running on. The ACPI section contains more detailed information
>>    about which/how these tables are going to be fixed.
>>
>>    Note that in this scenario the hardware domain will *always* have a local
>>    APIC and IO APIC, and that the usage of PHYSDEV operations and PIRQ event
>>    channels is going to be removed in favour of the bare metal mechanisms.
> 
> Do you really mean "*always*"? What about a system without IO-APIC?
> Would you mean to emulate one there for no reason?

Oh, a real system without an IO APIC. No, then we wouldn't provide one
to the hardware domain, since it makes no sense.

> Also I think you should say "the usage of many PHYSDEV operations",
> because - as we've already pointed out - some are unavoidable.

Yes, that's right.

>> ACPI
>> ----
>>
>> ACPI tables will be provided to the hardware domain or to unprivileged
>> domains. In the case of unprivileged guests ACPI tables are going to be
>> created by the toolstack and will only contain the set of devices available
>> to the guest, which will at least be the following: local APIC and
>> optionally an IO APIC and passed-through device(s). In order to provide this
>> information from ACPI the following tables are needed as a minimum: RSDT,
>> FADT, MADT and DSDT. If an administrator decides to not provide a local APIC,
>> the MADT table is not going to be provided to the guest OS.
>>
>> The ACPI_FADT_NO_CMOS_RTC flag in the FADT boot_flags field is going to be 
>> used
>> to signal guests that there's no RTC device (the Xen PV wall clock should be
>> used instead). It is likely that this flag is not going to be set for the
>> hardware domain, since it should have access to the RTC present in the host
>> (if there's one). The ACPI_FADT_NO_VGA is also very likely to be set in the
>> same boot_flags FADT field for DomUs in order to signal that there's no VGA
>> adapter present.
>>
>> Finally the ACPI_FADT_HW_REDUCED is going to be set in the FADT flags field
>> in order to signal that there are no legacy devices: i8259 PIC or i8254 PIT.
>> There's no intention to enable these devices, so it is expected that the
>> hardware-reduced FADT flag is always going to be set.
> 
> We'll need to be absolutely certain that use of this flag doesn't carry
> any further implications.

No, after taking a closer look at the ACPI spec I don't think we can use
this flag. It has some connotations that wouldn't be true, for example:

 - UEFI must be used for boot.
 - Sleep state entering is different. Using SLEEP_CONTROL_REG and
SLEEP_STATUS_REG instead of SLP_TYP, SLP_EN and WAK_STS. This of course
is not something that we can decide for Dom0.

And there are more implications which I think would not hold in our case.

So are we just going to say that HVMlite systems will never have a i8259
PIC or i8254 PIT? Because I don't see a proper way to report this using
standard ACPI fields.

>> In the case of the hardware domain, Xen has traditionally passed-through the
>> native ACPI tables to the guest. This is something that of course we still
>> want to do, but in the case of HVMlite Xen will have to make sure that
>> the data passed in the ACPI tables to the hardware domain contain the 
>> accurate
>> hardware description. This means that at least certain tables will have to
>> be modified/mangled before being presented to the guest:
>>
>>  * MADT: the number of local APIC entries need to be fixed to match the 
>> number
>>          of vCPUs available to the guest. The address of the IO APIC(s) also
>>          need to be fixed in order to match the emulated ones that we are 
>> going
>>          to provide.
>>
>>  * DSDT: certain devices reported in the DSDT may not be available to the 
>> guest,
>>          but since the DSDT is a run-time generated table we cannot fix it. 
>> In
>>          order to cope with this, a STAO table will be provided that should
>>          be able to signal which devices are not available to the hardware
>>          domain. This is in line with the Xen/ACPI implementation for ARM.
> 
> Will STAO be sufficient for everything that may need customization?
> I'm particularly worried about processor related methods in DSDT or
> SSDT, which - if we're really meaning to do as you say - would need
> to be limited (or extended) to the number of vCPU-s Dom0 gets.
> What's even less clear to me is how you mean to deal with P-, C-,
> and (once supported) T-state management for CPUs which don't
> have a vCPU equivalent in Dom0.

I was mostly planning to use the STAO in order to hide the UART. Hiding
the CPU methods is also something that we might do from the STAO, but as
you say we still need to report them to Xen in order to have proper PM.
This is already listed in the section called 'hacks' below.

IMHO, I think the processor related methods should not be hidden and
instead a custom Xen driver should be implemented in Dom0 in order to
report them to Xen. AFAICT masking them in the STAO would effectively
prevent _any_ driver in Dom0 from using them. This is ugly, but I don't
see any alternative at all.

> 
>> NB: there are corner cases that I'm not sure how to solve properly. Currently
>> the hardware domain has some 'hacks' regarding ACPI and Xen. At least I'm 
>> aware
>> of the following:
>>
>>  * 1. Reporting CPU PM info back to Xen: this comes from the DSDT table, and
>>    since this table is only available to the hardware domain it has to report
>>    the PM info back to Xen so that Xen can perform proper PM.
>>  * 2. Doing proper shutdown (S5) requires the usage of a hypercall, which is
>>    mixed with native ACPICA code in most OSes. This is awkward and requires
>>    the usage of hooks into ACPICA which we have not yet managed to upstream.
> 
> Iirc shutdown doesn't require any custom patches anymore in Linux.

Hm, not in Linux, but the hooks have not been merged into ACPICA, which
is the standard code base used by many OSes in order to deal with ACPI.

>>  * 3. Reporting the PCI devices it finds to the hypervisor: this is not very
>>    intrusive in general, so I'm not that pushed to remove it. It's generally
>>    easy in any OS to add some kind of hook that's executed every time a PCI
>>    device is discovered.
>>  * 4. Report PCI memory-mapped configuration areas to Xen: my opinion 
>> regarding
>>    this one is the same as (3), it's not really intrusive so I'm not very
>>    pushed to remove it.
> 
> As said in another reply - for both of these, we just can't remove the
> reporting to Xen.

Right, as I said above I'm not specially pushed to remove them. IMHO
they are not that intrusive.

>> MMIO mapping
>> ------------
>>
>> For DomUs without any device passed-through no direct MMIO mappings will be
>> present in the physical memory map presented to the guest. For DomUs with
>> devices passed-though the toolstack will create direct MMIO mappings as
>> part of the domain build process, and thus no action will be required
>> from the DomU.
>>
>> For the hardware domain initial direct MMIO mappings will be set for the
>> following regions:
>>
>> NOTE: ranges are defined using memory addresses, not pages.
>>
>>  * [0x0, 0xFFFFF]: the low 1MiB will be mapped into the physical guest
>>    memory map at the same position.
>>
>>  * [0xF00000, 0xFFFFFF]: the ISA memory hole will be mapped 1:1 into the
>>    guest physical memory.
> 
> When have you last seen a machine with a hole right below the
> 16Mb boundary?

Right, I will remove this. Even my old Nehalems (which is the first
architecture with IOMMU from Intel IIRC) don't have them.

Should I also mention RMRR?

  * Any RMRR regions reported will also be mapped 1:1 to Dom0.

>>  * ACPI memory areas: regions with type E820_ACPI or E820_NVS will be mapped
>>    1:1 to the guest physical memory map. There are going to be exceptions if
>>    Xen has to modify the tables before presenting them to the guest.
>>
>>  * PCI Express MMCFG: if Xen is able to identify any of these regions at boot
>>    time they will also be made available to the guest at the same position
>>    in it's physical memory map. It is possible that Xen will trap accesses to
>>    those regions, but a guest should be able to use the native configuration
>>    mechanism in order to interact with this configuration space. If the
>>    hardware domain reports the presence of any of those regions using the
>>    PHYSDEVOP_pci_mmcfg_reserved hypercall Xen will also all guest access to
>>    them.
> 
> s/all guest/allow Dom0/ in this last sentence?

Yes.

>>  * PCI BARs: it's not possible for Xen to know the position of the BARs of
>>    the PCI devices without hardware domain interaction. In order to have
>>    the BARs of PCI devices properly mapped the hardware domain needs to
>>    call the PHYSDEVOP_pci_device_add hypercall, that will take care of 
>> setting
>>    up the BARs in the guest physical memory map using 1:1 MMIO mappings. This
>>    procedure will be transparent from guest point of view, and upon returning
>>    from the hypercall mappings must be already established.
> 
> I'm not sure this can work, as it imposes restrictions on the ordering
> of operations internal of the Dom0 OS: Successfully having probed
> for a PCI device (and hence reporting its presence to Xen) doesn't
> imply its BARs have already got set up. Together with the possibility
> of the OS re-assigning BARs I think we will actually need another
> hypercall, or the same device-add hypercall may need to be issued
> more than once per device (i.e. also every time any BAR assignment
> got changed).

We already trap accesses to 0xcf8/0xcfc, can't we detect BAR
reassignments and then act accordingly and change the MMIO mapping?

I was thinking that we could do the initial map at the current position
when issuing the hypercall, and then detect further changes and perform
remapping if needed, but maybe I'm missing something again that makes
this approach not feasible.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.