[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3] docs: add PVH specification



El 20/09/14 a les 21.15, Konrad Rzeszutek Wilk ha escrit:
> On Thu, Sep 18, 2014 at 07:19:24PM +0200, Roger Pau Monne wrote:
>> Introduce a document that describes the interfaces used on PVH. This
>> document has been designed from a guest OS point of view (i.e.: what a guest
>> needs to do in order to support PVH).
>>
>> Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
>> Acked-by: David Vrabel <david.vrabel@xxxxxxxxxx>
>> Cc: Jan Beulich <JBeulich@xxxxxxxx>
>> Cc: Mukesh Rathor <mukesh.rathor@xxxxxxxxxx>
>> Cc: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
>> Cc: David Vrabel <david.vrabel@xxxxxxxxxx>
>> ---
>> The document is still far from complete IMHO, but it might be best to just
>> commit what we currently have rather than wait for a full document.
>>
>> I will try to fill the gaps as I go implementing new features on FreeBSD.
>>
>> I've retained David's Ack from v2 in this version.
>> ---
>>  docs/misc/pvh.markdown | 367 
>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 367 insertions(+)
>>  create mode 100644 docs/misc/pvh.markdown
>>
>> diff --git a/docs/misc/pvh.markdown b/docs/misc/pvh.markdown
>> new file mode 100644
>> index 0000000..120ede7
>> --- /dev/null
>> +++ b/docs/misc/pvh.markdown
>> @@ -0,0 +1,367 @@
>> +# PVH Specification #
>> +
>> +## Rationale ##
>> +
>> +PVH is a new kind of guest that has been introduced on Xen 4.4 as a DomU, 
>> and
>> +on Xen 4.5 as a Dom0. The aim of PVH is to make use of the hardware
>> +virtualization extensions present in modern x86 CPUs in order to
>> +improve performance.
>> +
>> +PVH is considered a mix between PV and HVM, and can be seen as a PV guest
>> +that runs inside of an HVM container, or as a PVHVM guest without any 
>> emulated
>> +devices. The design goal of PVH is to provide the best performance possible 
>> and
>> +to reduce the amount of modifications needed for a guest OS to run in this 
>> mode
>> +(compared to pure PV).
>> +
>> +This document tries to describe the interfaces used by PVH guests, focusing
>> +on how an OS should make use of them in order to support PVH.
>> +
>> +## Early boot ##
>> +
>> +PVH guests use the PV boot mechanism, that means that the kernel is loaded 
>> and
>> +directly launched by Xen (by jumping into the entry point). In order to do 
>> this
>> +Xen ELF Notes need to be added to the guest kernel, so that they contain the
>> +information needed by Xen. Here is an example of the ELF Notes added to the
>> +FreeBSD amd64 kernel in order to boot as PVH:
>> +
>> +    ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS,       .asciz, "FreeBSD")
>> +    ELFNOTE(Xen, XEN_ELFNOTE_GUEST_VERSION,  .asciz, 
>> __XSTRING(__FreeBSD_version))
>> +    ELFNOTE(Xen, XEN_ELFNOTE_XEN_VERSION,    .asciz, "xen-3.0")
>> +    ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE,      .quad,  KERNBASE)
>> +    ELFNOTE(Xen, XEN_ELFNOTE_PADDR_OFFSET,   .quad,  KERNBASE)
>> +    ELFNOTE(Xen, XEN_ELFNOTE_ENTRY,          .quad,  xen_start)
>> +    ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, .quad,  hypercall_page)
>> +    ELFNOTE(Xen, XEN_ELFNOTE_HV_START_LOW,   .quad,  HYPERVISOR_VIRT_START)
>> +    ELFNOTE(Xen, XEN_ELFNOTE_FEATURES,       .asciz, 
>> "writable_descriptor_tables|auto_translated_physmap|supervisor_mode_kernel|hvm_callback_vector")
>> +    ELFNOTE(Xen, XEN_ELFNOTE_PAE_MODE,       .asciz, "yes")
>> +    ELFNOTE(Xen, XEN_ELFNOTE_L1_MFN_VALID,   .long,  PG_V, PG_V)
>> +    ELFNOTE(Xen, XEN_ELFNOTE_LOADER,         .asciz, "generic")
>> +    ELFNOTE(Xen, XEN_ELFNOTE_SUSPEND_CANCEL, .long,  0)
>> +    ELFNOTE(Xen, XEN_ELFNOTE_BSD_SYMTAB,     .asciz, "yes")
>> +
>> +On the linux side, the above can be found in `arch/x86/xen/xen-head.S`.
> 
> s/linux/Linux/

Done.

> 
>> +
>> +It is important to highlight the following notes:
>> +
>> +  * `XEN_ELFNOTE_ENTRY`: contains the virtual memory address of the kernel 
>> entry
>> +    point.
>> +  * `XEN_ELFNOTE_HYPERCALL_PAGE`: contains the virtual memory address of the
>> +    hypercal page inside of the guest kernel (this memory region will be 
>> filled
>> +    by Xen prior to booting).
>> +  * `XEN_ELFNOTE_FEATURES`: contains the list of features supported by the 
>> kernel.
>> +    In the example above the kernel is only able to boot as a PVH guest, but
>> +    those options can be mixed with the ones used by pure PV guests in 
>> order to
>> +    have a kernel that supports both PV and PVH (like Linux). The list of
>> +    options available can be found in the `features.h` public header.
>> +
> 
> 
> Note that 'hvm_callback_vector' is in XEN_ELFNOTE_FEATURES. Older hypervisor 
> will
> balk at this being part of it, so it can also be put in             
> XEN_ELFNOTE_SUPPORTED_FEATURES which older hypervisors will ignore.  

Added to the XEN_ELFNOTE_FEATURES comment, thanks for the info.

>> +Xen will jump into the kernel entry point defined in `XEN_ELFNOTE_ENTRY` 
>> with
>> +paging enabled (either long mode or protected mode with paging turned on
>> +depending on the kernel bitness) and some basic page tables setup. An 
>> important
>> +distinction for a 64bit PVH is that it is launched at privilege level 0 as
>> +opposed to a 64bit PV guest which is launched at privilege level 3.
>> +
>> +Also, the `rsi` (`esi` on 32bits) register is going to contain the virtual
>> +memory address were Xen has placed the `start_info` structure. The `rsp` 
>> (`esp`
>> +on 32bits) will point to the top of an initial single page stack, that can 
>> be
>> +used by the guest kernel. The `start_info` structure contains all the info 
>> the
>> +guest needs in order to initialize. More information about the contents can 
>> be
>> +found on the `xen.h` public header.
> 
> s/on/in/
>> +
>> +### Initial amd64 control registers values ###
>> +
>> +Initial values for the control registers are set up by Xen before booting 
>> the
>> +guest kernel. The guest kernel can expect to find the following features
>> +enabled by Xen.
>> +
>> +`CR0` has the following bits set by Xen:
>> +
>> +  * PE (bit 0): protected mode enable.
>> +  * ET (bit 4): 387 or newer processor.
>> +  * PG (bit 31): paging enabled.
> 
> Also TS (at least that is what the Linux code says:
> 
> /* Some of these are setup in 'secondary_startup_64'. The others:       
> * X86_CR0_TS, X86_CR0_PE, X86_CR0_ET are set by Xen for HVM guests     
> * (which PVH shared codepaths), while X86_CR0_PG is for PVH. */        
> 
> Perhaps it is incorrect?

I think this comment is outdated/incorrect. This is the CR0 value I see 
on a FreeBSD PVH start-of-day:

0x80000011 (PE, ET and PG bits set)

> 
>> +
>> +`CR4` has the following bits set by Xen:
>> +
>> +  * PAE (bit 5): PAE enabled.
>> +
>> +And finally in `EFER` the following features are enabled:
>> +
>> +  * LME (bit 8): Long mode enable.
>> +  * LMA (bit 10): Long mode active.
>> +
>> +At least the following flags in `EFER` are guaranteed to be disabled:
>> +
>> +  * SCE (bit 0): System call extensions disabled.
>> +  * NXE (bit 11): No-Execute disabled.
>> +
>> +There's no guarantee about the state of the other bits in the `EFER` 
>> register.
>> +
>> +All the segments selectors are set with a flat base at zero.
>> +
>> +The `cs` segment selector attributes are set to 0x0a09b, which describes an
>> +executable and readable code segment only accessible by the most privileged
>> +level. The segment is also set as a 64-bit code segment (`L` flag set, `D` 
>> flag
>> +unset).
>> +
>> +The remaining segment selectors (`ds`, `ss`, `es`, `fs` and `gs`) are all 
>> set
>> +to the same values. The attributes are set to 0x0c093, which implies a read 
>> and
>> +write data segment only accessible by the most privileged level.
> 
> I think the SS, ES, FS, GS are set to the null selector in 64-bit mode.

This is what I see when I dump the vcpu state of a PVH guest created 
with the -p option (so that the guest is never started):

(XEN) CS: sel=0x0000, attr=0x0a09b, limit=0xffffffff, base=0x0000000000000000
(XEN) DS: sel=0x0000, attr=0x0c093, limit=0xffffffff, base=0x0000000000000000
(XEN) SS: sel=0x0000, attr=0x0c093, limit=0xffffffff, base=0x0000000000000000
(XEN) ES: sel=0x0000, attr=0x0c093, limit=0xffffffff, base=0x0000000000000000
(XEN) FS: sel=0x0000, attr=0x0c093, limit=0xffffffff, base=0x0000000000000000
(XEN) GS: sel=0x0000, attr=0x0c093, limit=0xffffffff, base=0x0000000000000000

Am I missing something? I don't see a difference between SS, ES, FS,
GS and DS. In construct_vmcs on Xen we seem to set all the segments
to the same values with the exception of CS attributes.

>> +
>> +The `FS.base` and `GS.base` MSRs are zeroed out.
> 
> .. and 'KERNEL_GS.base'

Done.

>> +
>> +The `IDT` and `GDT` are also zeroed, so the guest must be specially careful 
>> to
>> +not trigger a fault until after they have been properly set. The way of 
>> setting
>> +the IDT and the GDT is using the native instructions as would be done on 
>> bare
>> +metal.
>> +
>> +The `RFLAGS` register is guaranteed to be clear when jumping into the kernel
>> +entry point, with the exception of the reserved bit 1 set.

[...]
>> +## Interrupts ##
>> +
>> +All interrupts on PVH guests are routed over event channels, see
>> +[Event Channel Internals][event_channels] for more detailed information 
>> about
>> +event channels. In order to inject interrupts into the guest an IDT vector 
>> is
>> +used. This is the same mechanism used on PVHVM guests, and allows having
>> +per-cpu interrupts that can be used to deliver timers or IPIs.
>> +
>> +In order to register the callback IDT vector the `HVMOP_set_param` hypercall
>> +is used with the following values:
>> +
>> +    domid = DOMID_SELF
>> +    index = HVM_PARAM_CALLBACK_IRQ
>> +    value = (0x2 << 56) | vector_value
> 
> And naturally the OS has to program the IDT for the 'vector_value' using
> the baremetal mechanism.

Added.

[...]
>> +## CPUID ##
>> +
>> +*TDOD*: describe which cpuid flags a guest should ignore and also which 
>> flags
>> +describe features can be used. It would also be good to describe the set of
>> +cpuid flags that will always be present when running as PVH.
> 
> Perhaps start with: 
> The cpuid instruction that should be used is the normal 'cpuid', not
> the emulated 'cpuid' that PV guests usually require.

Done.

> 
>> +
>> +## Final notes ##
>> +
>> +All the other hardware functionality not described in this document should 
>> be
>> +assumed to be performed in the same way as native.
>> +
>> +[event_channels]: http://wiki.xen.org/wiki/Event_Channel_Internals
> 
> And with those changes:
> 
> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> 
>> -- 
>> 1.8.5.2 (Apple Git-48)
>>
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.