[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] RFC: very initial PVH design document



El 27/08/14 a les 2.33, Mukesh Rathor ha escrit:
> On Fri, 22 Aug 2014 16:55:08 +0200
> Roger Pau Monnà <roger.pau@xxxxxxxxxx> wrote:
> 
>> Hello,
>>
>> I've started writing a document in order to describe the interface 
>> exposed by Xen to PVH guests, and how it should be used (by guest 
>> OSes). The document is far from complete (see the amount of TODOs 
>> scattered around), but given the lack of documentation regarding PVH
>> I think it's a good starting point. The aim of this is that it should
>> be committed to the Xen repository once it's ready. Given that this
>> is still a *very* early version I'm not even posting it as a patch.
>>
>> Please comment, and try to fill the holes if possible ;).
>>
>> Roger.
>>
>> ---
>> # PVH Specification #
>>
>> ## Rationale ##
>>
>> PVH is a new kind of guest that has been introduced on Xen 4.4 as a
>> DomU, and on Xen 4.5 as a Dom0. The aim of PVH is to make use of the
>> hardware virtualization extensions present in modern x86 CPUs in
>> order to improve performance.
>>
>> PVH is considered a mix between PV and HVM, and can be seen as a PV
>> guest that runs inside of an HVM container, or as a PVHVM guest
>> without any emulated devices. The design goal of PVH is to provide
>> the best performance possible and to reduce the amount of
>> modifications needed for a guest OS to run in this mode (compared to
>> pure PV).
>>
>> This document tries to describe the interfaces used by PVH guests,
>> focusing on how an OS should make use of them in order to support PVH.
>>
>> ## Early boot ##
>>
>> PVH guests use the PV boot mechanism, that means that the kernel is
>> loaded and directly launched by Xen (by jumping into the entry
>> point). In order to do this Xen ELF Notes need to be added to the
>> guest kernel, so that they contain the information needed by Xen.
>> Here is an example of the ELF Notes added to the FreeBSD amd64 kernel
>> in order to boot as PVH:
>>
>>     ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS,       .asciz, "FreeBSD")
>>     ELFNOTE(Xen, XEN_ELFNOTE_GUEST_VERSION,  .asciz,
>> __XSTRING(__FreeBSD_version)) ELFNOTE(Xen,
>> XEN_ELFNOTE_XEN_VERSION,    .asciz, "xen-3.0") ELFNOTE(Xen,
>> XEN_ELFNOTE_VIRT_BASE,      .quad,  KERNBASE) ELFNOTE(Xen,
>> XEN_ELFNOTE_PADDR_OFFSET,   .quad,  KERNBASE) ELFNOTE(Xen,
>> XEN_ELFNOTE_ENTRY,          .quad,  xen_start) ELFNOTE(Xen,
>> XEN_ELFNOTE_HYPERCALL_PAGE, .quad,  hypercall_page) ELFNOTE(Xen,
>> XEN_ELFNOTE_HV_START_LOW,   .quad,  HYPERVISOR_VIRT_START)
>> ELFNOTE(Xen, XEN_ELFNOTE_FEATURES,       .asciz,
>> "writable_descriptor_tables|auto_translated_physmap|supervisor_mode_kernel|hvm_callback_vector")
>> ELFNOTE(Xen, XEN_ELFNOTE_PAE_MODE,       .asciz, "yes") ELFNOTE(Xen,
>> XEN_ELFNOTE_L1_MFN_VALID,   .long,  PG_V, PG_V) ELFNOTE(Xen,
>> XEN_ELFNOTE_LOADER,         .asciz, "generic") ELFNOTE(Xen,
>> XEN_ELFNOTE_SUSPEND_CANCEL, .long,  0) ELFNOTE(Xen,
>> XEN_ELFNOTE_BSD_SYMTAB,     .asciz, "yes")
> 
> It will be helpful to add:
> 
> On the linux side, the above can be found in arch/x86/xen/xen-head.S.

Done, although I would prefer to limit the number of code examples
picked from Linux (or to at least try provide alternate examples under a
more liberal license).

>> It is important to highlight the following notes:
>>
>>   * XEN_ELFNOTE_ENTRY: contains the memory address of the kernel
>> entry point.
>>   * XEN_ELFNOTE_HYPERCALL_PAGE: contains the memory address of the
>> hypercall page inside of the guest kernel (this memory region will be
>> filled by Xen prior to booting).
>>   * XEN_ELFNOTE_FEATURES: contains the list of features supported by
>> the kernel. In this case the kernel is only able to boot as a PVH
>> guest, but those options can be mixed with the ones used by pure PV
>> guests in order to have a kernel that supports both PV and PVH (like
>> Linux). The list of options available can be found in the
>> `features.h` public header.
> 
> Hmm... for linux I'd word that as follows:
> 
> A PVH guest is started by specifying pvh=1 in the config file. However,
> for the guest to be launched as a PVH guest, it must minimally advertise 
> certain features which are: auto_translated_physmap, hvm_callback_vector, 
> writable_descriptor_tables, and supervisor_mode_kernel. This is done
> via XEN_ELFNOTE_FEATURES and XEN_ELFNOTE_SUPPORTED_FEATURES. See
> linux:arch/x86/xen/xen-head.S for more info. A list of all xen features
> can be found in xen:include/public/features.h. However, at present
> the absence of these features does not make it automatically boot in PV
> mode, but that may change in future. The ultimate goal is, if a guest
> supports these features, then boot it automatically in PVH mode, otherwise
> boot it in PV mode.

I don't think we should add tool-side stuff here (like setting pvh=1 on
the config file). I wanted this document to be a specification about the
interfaces used by a PVH guest, from the OS point of view. Xen supports
a wide variety of toolstacks, and I bet some of them will require a
different method in order to boot as PVH.

> [You can leave out the last part if you want, or just take whatever from
> above].
> 
>> Xen will jump into the kernel entry point defined in
>> `XEN_ELFNOTE_ENTRY` with paging enabled (either long or protected
>> mode depending on the kernel bitness) and some basic page tables
>> setup.
> 
> If I may rephrase:
> 
> Guest is launched at the entry point specified in XEN_ELFNOTE_ENTRY
> with paging, PAE, and long mode enabled. At present only 64bit mode
> is supported, however, in future compat mode support will be added.
> An important distinction for a 64bit PVH is that it is launched at
> privilege level 0 as opposed to a 64bit PV guest which is launched at
> privilege level 3.

I've integrated a part of this paragraph, but I think some of this
contents would go into the i386 section once we have support for 32bit
PVH guests.

>> Also, the `rsi` (`esi` on 32bits) register is going to contain the
>> virtual memory address were Xen has placed the start_info structure.
>> The `rsp` (`esp` on 32bits) will contain a stack, that can be used by
>> the guest kernel. The start_info structure contains all the info the
>> guest needs in order to initialize. More information about the
>> contents can be found on the `xen.h` public header.
> 
> Since the above is all true for PV guest, you could begin it with:
> 
> Just like a PV guest, the rsi ....
> 
>>
>> ### Initial amd64 control registers values ###
>>
>> Initial values for the control registers are set up by Xen before
>> booting the guest kernel. The guest kernel can expect to find the
>> following features enabled by Xen.
>>
>> On `CR0` the following bits are set by Xen:
>>
>>   * PE (bit 0): protected mode enable.
>>   * ET (bit 4): 80387 external math coprocessor.
>>   * PG (bit 31): paging enabled.
>>
>> On `CR4` the following bits are set by Xen:
>>
>>   * PAE (bit 5): PAE enabled.
>>
>> And finally on `EFER` the following features are enabled:
>>
>>   * LME (bit 8): Long mode enable.
>>   * LMA (bit 10): Long mode active.
>>
>> *TODO*: do we expect this flags to change? Are there other flags that
>> might be enabled depending on the hardware we are running on?
> 
> Can't think of anything...
> 
> 
>> ## Memory ##
>>
>> Since PVH guests rely on virtualization extensions provided by the
>> CPU, they have access to a hardware virtualized MMU, which means
>> page-table related operations should use the same instructions used
>> on native.
> 
> Do you wanna expand a bit since this is another big distinction from
> a PV guest?
> 
> which means that page tables are native and guest managed. 
> This also implies that mmu_update hypercall is not available to a PVH
> guest, unlike a PV guest.  The guest is configured at start so it can 
> access all pages upto start_info->nr_pages.

This is already explained in the last paragraph of this section, and
since MMU hypercalls are not available to PVH guests I don't think we
should even mention them.

I like to see this document as something that can be used to add PVH
support from scratch, not something written to be used to migrate from
PV to PVH (although I think it also serves this purpose).

> 
>> There are however some differences with native. The usage of native
>> MTRR operations is forbidden, and `XENPF_*_memtype` hypercalls should
>> be used instead. This can be avoided by simply not using MTRR and
>> setting all the memory attributes using PAT, which doesn't require
>> the usage of any hypercalls.
>>
>> Since PVH doesn't use a BIOS in order to boot, the physical memory
>> map has to be retrieved using the `XENMEM_memory_map` hypercall,
>> which will return an e820 map. This memory map might contain holes
>> that describe MMIO regions, that will be already setup by Xen.
>>
>> *TODO*: we need to figure out what to do with MMIO regions, right now
>> Xen sets all the holes in the native e820 to MMIO regions for Dom0 up
>> to 4GB. We need to decide what to do with MMIO regions above 4GB on
>> Dom0, and what to do for PVH DomUs with pci-passthrough.
> 
> We map all non-ram regions for dom0 1:1 till the highest non-ram e820
> entry. If there is anything that is beyond the last e820 entry,
> it will remain unmapped.
> 
> Correct, passthru needs to be figured.
> 
>> In the case of a guest started with memory != maxmem, the e820 memory
>> map returned by Xen will contain the memory up to maxmem. The guest
>> has to be very careful to only use the lower memory pages up to the
>> value contained in `start_info->nr_pages` because any memory page
>> above that value will not be populated.
>>
>> ## Physical devices ##
>>
>> When running as Dom0 the guest OS has the ability to interact with
>> the physical devices present in the system. A note should be made
>> that PVH guests require a working IOMMU in order to interact with
>> physical devices.
>>
>> The first step in order to manipulate the devices is to make Xen
>> aware of them. Due to the fact that all the hardware description on
>> x86 comes from ACPI, Dom0 is responsible of parsing the ACPI tables
>> and notify Xen about the devices it finds. This is done with the
>> `PHYSDEVOP_pci_device_add` hypercall.
>>
>> *TODO*: explain the way to register the different kinds of PCI
>> devices, like devices with virtual functions.
>>
>> ## Interrupts ##
>>
>> All interrupts on PVH guests are routed over event channels, see
>> [Event Channel Internals][event_channels] for more detailed
>> information about event channels. In order to inject interrupts into
>> the guest an IDT vector is used. This is the same mechanism used on
>> PVHVM guests, and allows having per-cpu interrupts that can be used
>> to deliver timers or IPIs.
>>
>> In order to register the callback IDT vector the `HVMOP_set_param`
>> hypercall is used with the following values:
>>
>>     domid = DOMID_SELF
>>     index = HVM_PARAM_CALLBACK_IRQ
>>     value = (0x2 << 56) | vector_value
>>
>> In order to know which event channel has fired, we need to look into
>> the information provided in the `shared_info` structure. The
>> `evtchn_pending` array is used as a bitmap in order to find out which
>> event channel has fired. Event channels can also be masked by setting
>> it's port value in the `shared_info->evtchn_mask` bitmap.
>>
>> *TODO*: provide a reference about how to interact with FIFO event
>> channels?
>>
>> ### Interrupts from physical devices ###
>>
>> When running as Dom0 (or when using pci-passthrough) interrupts from
>> physical devices are routed over event channels. There are 3
>> different kind of physical interrupts that can be routed over event
>> channels by Xen: IO APIC, MSI and MSI-X interrupts.
>>
>> Since physical interrupts usually need EOI (End Of Interrupt), Xen
>> allows the registration of a memory region that will contain whether
>> a physical interrupt needs EOI from the guest or not. This is done
>> with the `PHYSDEVOP_pirq_eoi_gmfn_v2` hypercall that takes a
>> parameter containing the physical address of the memory page that
>> will act as a bitmap. Then in order to find out if an IRQ needs EOI
>> or not, the OS can perform a simple bit test on the memory page using
>> the PIRQ value.
>>
>> ### IO APIC interrupt routing ###
>>
>> IO APIC interrupts can be routed over event channels using `PHYSDEVOP`
>> hypercalls. First the IRQ is registered using the `PHYSDEVOP_map_pirq`
>> hypercall, as an example IRQ#9 is used here:
>>
>>     domid = DOMID_SELF
>>     type = MAP_PIRQ_TYPE_GSI
>>     index = 9
>>     pirq = 9
>>
>> After this hypercall, `PHYSDEVOP_alloc_irq_vector` is used to
>> allocate a vector:
>>
>>     irq = 9
>>     vector = 0
>>
>> *TODO*: I'm not sure why we need those two hypercalls, and it's usage
>> is not documented anywhere. Need to clarify what the parameters mean
>> and what effect they have.
>>
>> The IRQ#9 is now registered as PIRQ#9. The triggering and polarity
>> can also be configured using the `PHYSDEVOP_setup_gsi` hypercall:
>>
>>     gsi = 9 # This is the IRQ value.
>>     triggering = 0
>>     polarity = 0
>>
>> In this example the IRQ would be configured to use edge triggering
>> and high polarity.
>>
>> Finally the PIRQ can be bound to an event channel using the
>> `EVTCHNOP_bind_pirq`, that will return the event channel port the
>> PIRQ has been assigned. After this the event channel will be ready
>> for delivery.
>>
>> *NOTE*: when running as Dom0, the guest has to parse the interrupt
>> overwrites found on the ACPI tables and notify Xen about them.
>>
>> ### MSI ###
>>
>> In order to configure MSI interrupts for a device, Xen must be made
>> aware of it's presence first by using the `PHYSDEVOP_pci_device_add`
>> as described above. Then the `PHYSDEVOP_map_pirq` hypercall is used:
>>
>>     domid = DOMID_SELF
>>     type = MAP_PIRQ_TYPE_MSI_SEG or MAP_PIRQ_TYPE_MULTI_MSI
>>     index = -1
>>     pirq = -1
>>     bus = pci_device_bus
>>     devfn = pci_device_function
>>     entry_nr = number of MSI interrupts
>>
>> The type has to be set to `MAP_PIRQ_TYPE_MSI_SEG` if only one MSI
>> interrupt source is being configured. On devices that support MSI
>> interrupt groups `MAP_PIRQ_TYPE_MULTI_MSI` can be used to configure
>> them by also placing the number of MSI interrupts in the `entry_nr`
>> field.
>>
>> The values in the `bus` and `devfn` field should be the same as the
>> ones used when registering the device with `PHYSDEVOP_pci_device_add`.
>>
>> ### MSI-X ###
>>
>> *TODO*: how to register/use them.
>>
>> ## Event timers and timecounters ##
>>
>> Since some hardware is not available on PVH (like the local APIC),
>> Xen provides the OS with suitable replacements in order to get the
>> same functionality. One of them is the timer interface. Using a set
>> of hypercalls, a guest OS can set event timers that will deliver and
>> event channel interrupt to the guest.
>>
>> In order to use the timer provided by Xen the guest OS first needs to
>> register a VIRQ event channel to be used by the timer to deliver the
>> interrupts. The event channel is registered using the
>> `EVTCHNOP_bind_virq` hypercall, that only takes two parameters:
>>
>>     virq = VIRQ_TIMER
>>     vcpu = vcpu_id
>>
>> The port that's going to be used by Xen in order to deliver the
>> interrupt is returned in the `port` field. Once the interrupt is set,
>> the timer can be programmed using the `VCPUOP_set_singleshot_timer`
>> hypercall.
>>
>>     flags = VCPU_SSHOTTMR_future
>>     timeout_abs_ns = absolute value when the timer should fire
>>
>> It is important to notice that the `VCPUOP_set_singleshot_timer`
>> hypercall must be executed from the same vCPU where the timer should
>> fire, or else Xen will refuse to set it. This is a single-shot timer,
>> so it must be set by the OS every time it fires if a periodic timer
>> is desired.
>>
>> Xen also shares a memory region with the guest OS that contains time
>> related values that are updated periodically. This values can be used
>> to implement a timecounter or to obtain the current time. This
>> information is placed inside of
>> `shared_info->vcpu_info[vcpu_id].time`. The uptime (time since the
>> guest has been launched) can be calculated using the following
>> expression and the values stored in the `vcpu_time_info` struct:
>>
>>     system_time + ((((tsc - tsc_timestamp) << tsc_shift) *
>> tsc_to_system_mul) >> 32)
>>
>> The timeout that is passed to `VCPUOP_set_singleshot_timer` has to be
>> calculated using the above value, plus the timeout the system wants
>> to set.
>>
>> If the OS also wants to obtain the current wallclock time, the value
>> calculated above has to be added to the values found in
>> `shared_info->wc_sec` and `shared_info->wc_nsec`.
> 
> All the above is great info, not PVH specific tho. May wanna mention
> it fwiw. 
> 
>> ## SMP discover and bring up ##
>>
>> The process of bringing up secondary CPUs is obviously different from
>> native, since PVH doesn't have a local APIC. The first thing to do is
>> to figure out how many vCPUs the guest has. This is done using the
>> `VCPUOP_is_up` hypercall, using for example this simple loop:
>>
>>     for (i = 0; i < MAXCPU; i++) {
>>         ret = HYPERVISOR_vcpu_op(VCPUOP_is_up, i, NULL);
>>         if (ret >= 0)
>>             /* vCPU#i is present */
>>     }
>>
>> Note than when running as Dom0, the ACPI tables might report a
>> different number of available CPUs. This is because the value on the
>> ACPI tables is the number of physical CPUs the host has, and it might
>> bear no resemblance with the number of vCPUs Dom0 actually has so it
>> should be ignored.
>>
>> In order to bring up the secondary vCPUs they must be configured
>> first. This is achieved using the `VCPUOP_initialise` hypercall. A
>> valid context has to be passed to the vCPU in order to boot. The
>> relevant fields for PVH guests are the following:
>>
>>   * `flags`: contains VGCF_* flags (see `arch-x86/xen.h` public
>> header).
>>   * `user_regs`: struct that contains the register values that will
>> be set on the vCPU before booting. The most relevant ones are `rip`
>> and `rsp` in order to set the start address and the stack.
>>   * `ctrlreg[3]`: contains the address of the page tables that will
>> be used by the vCPU.
>>
>> After the vCPU is initialized with the proper values, it can be
>> started by using the `VCPUOP_up` hypercall. The values of the other
>> control registers of the vCPU will be the same as the ones described
>> in the `control registers` section.
> 
> If you want, you could put linux reference here:
> 
> For an example, please see cpu_initialize_context() in arch/x86/xen/smp.c
> in linux.

Done, thanks for the comments.

>> ## Control operations (reboot/shutdown) ##
>>
>> Reboot and shutdown operations on PVH guests are performed using
>> hypercalls. In order to issue a reboot, a guest must use the
>> `SHUTDOWN_reboot` hypercall. In order to perform a power off from a
>> guest DomU, the `SHUTDOWN_poweroff` hypercall should be used.
>>
>> The way to perform a full system power off from Dom0 is different
>> than what's done in a DomU guest. In order to perform a power off
>> from Dom0 the native ACPI path should be followed, but the guest
>> should not write the SLP_EN bit to the Pm1Control register. Instead
>> the `XENPF_enter_acpi_sleep` hypercall should be used, filling the
>> following data in the `xen_platform_op` struct:
>>
>>     cmd = XENPF_enter_acpi_sleep
>>     interface_version = XENPF_INTERFACE_VERSION
>>     u.enter_acpi_sleep.pm1a_cnt_val = Pm1aControlValue
>>     u.enter_acpi_sleep.pm1b_cnt_val = Pm1bControlValue
>>
>> This will allow Xen to do it's clean up and to power off the system.
>> If the host is using hardware reduced ACPI, the following field
>> should also be set:
>>
>>     u.enter_acpi_sleep.flags = XENPF_ACPI_SLEEP_EXTENDED (0x1)
>>
>> ## CPUID ##
>>
>> *TDOD*: describe which cpuid flags a guest should ignore and also
>> which flags describe features can be used. It would also be good to
>> describe the set of cpuid flags that will always be present when
>> running as PVH.
>>
>> ## Final notes ##
>>
>> All the other hardware functionality not described in this document
>> should be assumed to be performed in the same way as native.
>>
>> [evnet_channels]: http://wiki.xen.org/wiki/Event_Channel_Internals
> 
> 
> Great work Roger! Thanks a lot for writing it.
> 
> Mukesh
> 
> 
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.