[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] RFC: very initial PVH design document
El 27/08/14 a les 2.33, Mukesh Rathor ha escrit: > On Fri, 22 Aug 2014 16:55:08 +0200 > Roger Pau Monnà <roger.pau@xxxxxxxxxx> wrote: > >> Hello, >> >> I've started writing a document in order to describe the interface >> exposed by Xen to PVH guests, and how it should be used (by guest >> OSes). The document is far from complete (see the amount of TODOs >> scattered around), but given the lack of documentation regarding PVH >> I think it's a good starting point. The aim of this is that it should >> be committed to the Xen repository once it's ready. Given that this >> is still a *very* early version I'm not even posting it as a patch. >> >> Please comment, and try to fill the holes if possible ;). >> >> Roger. >> >> --- >> # PVH Specification # >> >> ## Rationale ## >> >> PVH is a new kind of guest that has been introduced on Xen 4.4 as a >> DomU, and on Xen 4.5 as a Dom0. The aim of PVH is to make use of the >> hardware virtualization extensions present in modern x86 CPUs in >> order to improve performance. >> >> PVH is considered a mix between PV and HVM, and can be seen as a PV >> guest that runs inside of an HVM container, or as a PVHVM guest >> without any emulated devices. The design goal of PVH is to provide >> the best performance possible and to reduce the amount of >> modifications needed for a guest OS to run in this mode (compared to >> pure PV). >> >> This document tries to describe the interfaces used by PVH guests, >> focusing on how an OS should make use of them in order to support PVH. >> >> ## Early boot ## >> >> PVH guests use the PV boot mechanism, that means that the kernel is >> loaded and directly launched by Xen (by jumping into the entry >> point). In order to do this Xen ELF Notes need to be added to the >> guest kernel, so that they contain the information needed by Xen. >> Here is an example of the ELF Notes added to the FreeBSD amd64 kernel >> in order to boot as PVH: >> >> ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS, .asciz, "FreeBSD") >> ELFNOTE(Xen, XEN_ELFNOTE_GUEST_VERSION, .asciz, >> __XSTRING(__FreeBSD_version)) ELFNOTE(Xen, >> XEN_ELFNOTE_XEN_VERSION, .asciz, "xen-3.0") ELFNOTE(Xen, >> XEN_ELFNOTE_VIRT_BASE, .quad, KERNBASE) ELFNOTE(Xen, >> XEN_ELFNOTE_PADDR_OFFSET, .quad, KERNBASE) ELFNOTE(Xen, >> XEN_ELFNOTE_ENTRY, .quad, xen_start) ELFNOTE(Xen, >> XEN_ELFNOTE_HYPERCALL_PAGE, .quad, hypercall_page) ELFNOTE(Xen, >> XEN_ELFNOTE_HV_START_LOW, .quad, HYPERVISOR_VIRT_START) >> ELFNOTE(Xen, XEN_ELFNOTE_FEATURES, .asciz, >> "writable_descriptor_tables|auto_translated_physmap|supervisor_mode_kernel|hvm_callback_vector") >> ELFNOTE(Xen, XEN_ELFNOTE_PAE_MODE, .asciz, "yes") ELFNOTE(Xen, >> XEN_ELFNOTE_L1_MFN_VALID, .long, PG_V, PG_V) ELFNOTE(Xen, >> XEN_ELFNOTE_LOADER, .asciz, "generic") ELFNOTE(Xen, >> XEN_ELFNOTE_SUSPEND_CANCEL, .long, 0) ELFNOTE(Xen, >> XEN_ELFNOTE_BSD_SYMTAB, .asciz, "yes") > > It will be helpful to add: > > On the linux side, the above can be found in arch/x86/xen/xen-head.S. Done, although I would prefer to limit the number of code examples picked from Linux (or to at least try provide alternate examples under a more liberal license). >> It is important to highlight the following notes: >> >> * XEN_ELFNOTE_ENTRY: contains the memory address of the kernel >> entry point. >> * XEN_ELFNOTE_HYPERCALL_PAGE: contains the memory address of the >> hypercall page inside of the guest kernel (this memory region will be >> filled by Xen prior to booting). >> * XEN_ELFNOTE_FEATURES: contains the list of features supported by >> the kernel. In this case the kernel is only able to boot as a PVH >> guest, but those options can be mixed with the ones used by pure PV >> guests in order to have a kernel that supports both PV and PVH (like >> Linux). The list of options available can be found in the >> `features.h` public header. > > Hmm... for linux I'd word that as follows: > > A PVH guest is started by specifying pvh=1 in the config file. However, > for the guest to be launched as a PVH guest, it must minimally advertise > certain features which are: auto_translated_physmap, hvm_callback_vector, > writable_descriptor_tables, and supervisor_mode_kernel. This is done > via XEN_ELFNOTE_FEATURES and XEN_ELFNOTE_SUPPORTED_FEATURES. See > linux:arch/x86/xen/xen-head.S for more info. A list of all xen features > can be found in xen:include/public/features.h. However, at present > the absence of these features does not make it automatically boot in PV > mode, but that may change in future. The ultimate goal is, if a guest > supports these features, then boot it automatically in PVH mode, otherwise > boot it in PV mode. I don't think we should add tool-side stuff here (like setting pvh=1 on the config file). I wanted this document to be a specification about the interfaces used by a PVH guest, from the OS point of view. Xen supports a wide variety of toolstacks, and I bet some of them will require a different method in order to boot as PVH. > [You can leave out the last part if you want, or just take whatever from > above]. > >> Xen will jump into the kernel entry point defined in >> `XEN_ELFNOTE_ENTRY` with paging enabled (either long or protected >> mode depending on the kernel bitness) and some basic page tables >> setup. > > If I may rephrase: > > Guest is launched at the entry point specified in XEN_ELFNOTE_ENTRY > with paging, PAE, and long mode enabled. At present only 64bit mode > is supported, however, in future compat mode support will be added. > An important distinction for a 64bit PVH is that it is launched at > privilege level 0 as opposed to a 64bit PV guest which is launched at > privilege level 3. I've integrated a part of this paragraph, but I think some of this contents would go into the i386 section once we have support for 32bit PVH guests. >> Also, the `rsi` (`esi` on 32bits) register is going to contain the >> virtual memory address were Xen has placed the start_info structure. >> The `rsp` (`esp` on 32bits) will contain a stack, that can be used by >> the guest kernel. The start_info structure contains all the info the >> guest needs in order to initialize. More information about the >> contents can be found on the `xen.h` public header. > > Since the above is all true for PV guest, you could begin it with: > > Just like a PV guest, the rsi .... > >> >> ### Initial amd64 control registers values ### >> >> Initial values for the control registers are set up by Xen before >> booting the guest kernel. The guest kernel can expect to find the >> following features enabled by Xen. >> >> On `CR0` the following bits are set by Xen: >> >> * PE (bit 0): protected mode enable. >> * ET (bit 4): 80387 external math coprocessor. >> * PG (bit 31): paging enabled. >> >> On `CR4` the following bits are set by Xen: >> >> * PAE (bit 5): PAE enabled. >> >> And finally on `EFER` the following features are enabled: >> >> * LME (bit 8): Long mode enable. >> * LMA (bit 10): Long mode active. >> >> *TODO*: do we expect this flags to change? Are there other flags that >> might be enabled depending on the hardware we are running on? > > Can't think of anything... > > >> ## Memory ## >> >> Since PVH guests rely on virtualization extensions provided by the >> CPU, they have access to a hardware virtualized MMU, which means >> page-table related operations should use the same instructions used >> on native. > > Do you wanna expand a bit since this is another big distinction from > a PV guest? > > which means that page tables are native and guest managed. > This also implies that mmu_update hypercall is not available to a PVH > guest, unlike a PV guest. The guest is configured at start so it can > access all pages upto start_info->nr_pages. This is already explained in the last paragraph of this section, and since MMU hypercalls are not available to PVH guests I don't think we should even mention them. I like to see this document as something that can be used to add PVH support from scratch, not something written to be used to migrate from PV to PVH (although I think it also serves this purpose). > >> There are however some differences with native. The usage of native >> MTRR operations is forbidden, and `XENPF_*_memtype` hypercalls should >> be used instead. This can be avoided by simply not using MTRR and >> setting all the memory attributes using PAT, which doesn't require >> the usage of any hypercalls. >> >> Since PVH doesn't use a BIOS in order to boot, the physical memory >> map has to be retrieved using the `XENMEM_memory_map` hypercall, >> which will return an e820 map. This memory map might contain holes >> that describe MMIO regions, that will be already setup by Xen. >> >> *TODO*: we need to figure out what to do with MMIO regions, right now >> Xen sets all the holes in the native e820 to MMIO regions for Dom0 up >> to 4GB. We need to decide what to do with MMIO regions above 4GB on >> Dom0, and what to do for PVH DomUs with pci-passthrough. > > We map all non-ram regions for dom0 1:1 till the highest non-ram e820 > entry. If there is anything that is beyond the last e820 entry, > it will remain unmapped. > > Correct, passthru needs to be figured. > >> In the case of a guest started with memory != maxmem, the e820 memory >> map returned by Xen will contain the memory up to maxmem. The guest >> has to be very careful to only use the lower memory pages up to the >> value contained in `start_info->nr_pages` because any memory page >> above that value will not be populated. >> >> ## Physical devices ## >> >> When running as Dom0 the guest OS has the ability to interact with >> the physical devices present in the system. A note should be made >> that PVH guests require a working IOMMU in order to interact with >> physical devices. >> >> The first step in order to manipulate the devices is to make Xen >> aware of them. Due to the fact that all the hardware description on >> x86 comes from ACPI, Dom0 is responsible of parsing the ACPI tables >> and notify Xen about the devices it finds. This is done with the >> `PHYSDEVOP_pci_device_add` hypercall. >> >> *TODO*: explain the way to register the different kinds of PCI >> devices, like devices with virtual functions. >> >> ## Interrupts ## >> >> All interrupts on PVH guests are routed over event channels, see >> [Event Channel Internals][event_channels] for more detailed >> information about event channels. In order to inject interrupts into >> the guest an IDT vector is used. This is the same mechanism used on >> PVHVM guests, and allows having per-cpu interrupts that can be used >> to deliver timers or IPIs. >> >> In order to register the callback IDT vector the `HVMOP_set_param` >> hypercall is used with the following values: >> >> domid = DOMID_SELF >> index = HVM_PARAM_CALLBACK_IRQ >> value = (0x2 << 56) | vector_value >> >> In order to know which event channel has fired, we need to look into >> the information provided in the `shared_info` structure. The >> `evtchn_pending` array is used as a bitmap in order to find out which >> event channel has fired. Event channels can also be masked by setting >> it's port value in the `shared_info->evtchn_mask` bitmap. >> >> *TODO*: provide a reference about how to interact with FIFO event >> channels? >> >> ### Interrupts from physical devices ### >> >> When running as Dom0 (or when using pci-passthrough) interrupts from >> physical devices are routed over event channels. There are 3 >> different kind of physical interrupts that can be routed over event >> channels by Xen: IO APIC, MSI and MSI-X interrupts. >> >> Since physical interrupts usually need EOI (End Of Interrupt), Xen >> allows the registration of a memory region that will contain whether >> a physical interrupt needs EOI from the guest or not. This is done >> with the `PHYSDEVOP_pirq_eoi_gmfn_v2` hypercall that takes a >> parameter containing the physical address of the memory page that >> will act as a bitmap. Then in order to find out if an IRQ needs EOI >> or not, the OS can perform a simple bit test on the memory page using >> the PIRQ value. >> >> ### IO APIC interrupt routing ### >> >> IO APIC interrupts can be routed over event channels using `PHYSDEVOP` >> hypercalls. First the IRQ is registered using the `PHYSDEVOP_map_pirq` >> hypercall, as an example IRQ#9 is used here: >> >> domid = DOMID_SELF >> type = MAP_PIRQ_TYPE_GSI >> index = 9 >> pirq = 9 >> >> After this hypercall, `PHYSDEVOP_alloc_irq_vector` is used to >> allocate a vector: >> >> irq = 9 >> vector = 0 >> >> *TODO*: I'm not sure why we need those two hypercalls, and it's usage >> is not documented anywhere. Need to clarify what the parameters mean >> and what effect they have. >> >> The IRQ#9 is now registered as PIRQ#9. The triggering and polarity >> can also be configured using the `PHYSDEVOP_setup_gsi` hypercall: >> >> gsi = 9 # This is the IRQ value. >> triggering = 0 >> polarity = 0 >> >> In this example the IRQ would be configured to use edge triggering >> and high polarity. >> >> Finally the PIRQ can be bound to an event channel using the >> `EVTCHNOP_bind_pirq`, that will return the event channel port the >> PIRQ has been assigned. After this the event channel will be ready >> for delivery. >> >> *NOTE*: when running as Dom0, the guest has to parse the interrupt >> overwrites found on the ACPI tables and notify Xen about them. >> >> ### MSI ### >> >> In order to configure MSI interrupts for a device, Xen must be made >> aware of it's presence first by using the `PHYSDEVOP_pci_device_add` >> as described above. Then the `PHYSDEVOP_map_pirq` hypercall is used: >> >> domid = DOMID_SELF >> type = MAP_PIRQ_TYPE_MSI_SEG or MAP_PIRQ_TYPE_MULTI_MSI >> index = -1 >> pirq = -1 >> bus = pci_device_bus >> devfn = pci_device_function >> entry_nr = number of MSI interrupts >> >> The type has to be set to `MAP_PIRQ_TYPE_MSI_SEG` if only one MSI >> interrupt source is being configured. On devices that support MSI >> interrupt groups `MAP_PIRQ_TYPE_MULTI_MSI` can be used to configure >> them by also placing the number of MSI interrupts in the `entry_nr` >> field. >> >> The values in the `bus` and `devfn` field should be the same as the >> ones used when registering the device with `PHYSDEVOP_pci_device_add`. >> >> ### MSI-X ### >> >> *TODO*: how to register/use them. >> >> ## Event timers and timecounters ## >> >> Since some hardware is not available on PVH (like the local APIC), >> Xen provides the OS with suitable replacements in order to get the >> same functionality. One of them is the timer interface. Using a set >> of hypercalls, a guest OS can set event timers that will deliver and >> event channel interrupt to the guest. >> >> In order to use the timer provided by Xen the guest OS first needs to >> register a VIRQ event channel to be used by the timer to deliver the >> interrupts. The event channel is registered using the >> `EVTCHNOP_bind_virq` hypercall, that only takes two parameters: >> >> virq = VIRQ_TIMER >> vcpu = vcpu_id >> >> The port that's going to be used by Xen in order to deliver the >> interrupt is returned in the `port` field. Once the interrupt is set, >> the timer can be programmed using the `VCPUOP_set_singleshot_timer` >> hypercall. >> >> flags = VCPU_SSHOTTMR_future >> timeout_abs_ns = absolute value when the timer should fire >> >> It is important to notice that the `VCPUOP_set_singleshot_timer` >> hypercall must be executed from the same vCPU where the timer should >> fire, or else Xen will refuse to set it. This is a single-shot timer, >> so it must be set by the OS every time it fires if a periodic timer >> is desired. >> >> Xen also shares a memory region with the guest OS that contains time >> related values that are updated periodically. This values can be used >> to implement a timecounter or to obtain the current time. This >> information is placed inside of >> `shared_info->vcpu_info[vcpu_id].time`. The uptime (time since the >> guest has been launched) can be calculated using the following >> expression and the values stored in the `vcpu_time_info` struct: >> >> system_time + ((((tsc - tsc_timestamp) << tsc_shift) * >> tsc_to_system_mul) >> 32) >> >> The timeout that is passed to `VCPUOP_set_singleshot_timer` has to be >> calculated using the above value, plus the timeout the system wants >> to set. >> >> If the OS also wants to obtain the current wallclock time, the value >> calculated above has to be added to the values found in >> `shared_info->wc_sec` and `shared_info->wc_nsec`. > > All the above is great info, not PVH specific tho. May wanna mention > it fwiw. > >> ## SMP discover and bring up ## >> >> The process of bringing up secondary CPUs is obviously different from >> native, since PVH doesn't have a local APIC. The first thing to do is >> to figure out how many vCPUs the guest has. This is done using the >> `VCPUOP_is_up` hypercall, using for example this simple loop: >> >> for (i = 0; i < MAXCPU; i++) { >> ret = HYPERVISOR_vcpu_op(VCPUOP_is_up, i, NULL); >> if (ret >= 0) >> /* vCPU#i is present */ >> } >> >> Note than when running as Dom0, the ACPI tables might report a >> different number of available CPUs. This is because the value on the >> ACPI tables is the number of physical CPUs the host has, and it might >> bear no resemblance with the number of vCPUs Dom0 actually has so it >> should be ignored. >> >> In order to bring up the secondary vCPUs they must be configured >> first. This is achieved using the `VCPUOP_initialise` hypercall. A >> valid context has to be passed to the vCPU in order to boot. The >> relevant fields for PVH guests are the following: >> >> * `flags`: contains VGCF_* flags (see `arch-x86/xen.h` public >> header). >> * `user_regs`: struct that contains the register values that will >> be set on the vCPU before booting. The most relevant ones are `rip` >> and `rsp` in order to set the start address and the stack. >> * `ctrlreg[3]`: contains the address of the page tables that will >> be used by the vCPU. >> >> After the vCPU is initialized with the proper values, it can be >> started by using the `VCPUOP_up` hypercall. The values of the other >> control registers of the vCPU will be the same as the ones described >> in the `control registers` section. > > If you want, you could put linux reference here: > > For an example, please see cpu_initialize_context() in arch/x86/xen/smp.c > in linux. Done, thanks for the comments. >> ## Control operations (reboot/shutdown) ## >> >> Reboot and shutdown operations on PVH guests are performed using >> hypercalls. In order to issue a reboot, a guest must use the >> `SHUTDOWN_reboot` hypercall. In order to perform a power off from a >> guest DomU, the `SHUTDOWN_poweroff` hypercall should be used. >> >> The way to perform a full system power off from Dom0 is different >> than what's done in a DomU guest. In order to perform a power off >> from Dom0 the native ACPI path should be followed, but the guest >> should not write the SLP_EN bit to the Pm1Control register. Instead >> the `XENPF_enter_acpi_sleep` hypercall should be used, filling the >> following data in the `xen_platform_op` struct: >> >> cmd = XENPF_enter_acpi_sleep >> interface_version = XENPF_INTERFACE_VERSION >> u.enter_acpi_sleep.pm1a_cnt_val = Pm1aControlValue >> u.enter_acpi_sleep.pm1b_cnt_val = Pm1bControlValue >> >> This will allow Xen to do it's clean up and to power off the system. >> If the host is using hardware reduced ACPI, the following field >> should also be set: >> >> u.enter_acpi_sleep.flags = XENPF_ACPI_SLEEP_EXTENDED (0x1) >> >> ## CPUID ## >> >> *TDOD*: describe which cpuid flags a guest should ignore and also >> which flags describe features can be used. It would also be good to >> describe the set of cpuid flags that will always be present when >> running as PVH. >> >> ## Final notes ## >> >> All the other hardware functionality not described in this document >> should be assumed to be performed in the same way as native. >> >> [evnet_channels]: http://wiki.xen.org/wiki/Event_Channel_Internals > > > Great work Roger! Thanks a lot for writing it. > > Mukesh > > > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |