Xen project Mailing List

Re: [Xen-devel] [PATCH 9/18 V2]: PVH xen: create PVH vmcs, and initialization

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

From: Mukesh Rathor <mukesh.rathor@xxxxxxxxxx>

Date: Mon, 18 Mar 2013 18:00:36 -0700

Cc: "Xen-devel@xxxxxxxxxxxxxxxxxxx" <Xen-devel@xxxxxxxxxxxxxxxxxxx>

Delivery-date: Tue, 19 Mar 2013 01:01:00 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Mon, 18 Mar 2013 11:28:43 -0400 Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote: > On Fri, Mar 15, 2013 at 05:39:25PM -0700, Mukesh Rathor wrote: > > This patch mainly contains code to create a VMCS for PVH guest, and > > HVM specific vcpu/domain creation code. > > > > Changes in V2: > > - Avoid call to hvm_do_resume() at call site rather than return > > in it. > > - Return for PVH vmx_do_resume prior to intel debugger stuff. > > > > Signed-off-by: Mukesh Rathor <mukesh.rathor@xxxxxxxxxx> > > --- > > xen/arch/x86/hvm/hvm.c | 90 ++++++++++++++- > > xen/arch/x86/hvm/vmx/vmcs.c | 266 > > ++++++++++++++++++++++++++++++++++++++++++- > > xen/arch/x86/hvm/vmx/vmx.c | 34 ++++++ 3 files changed, 383 > > insertions(+), 7 deletions(-) > > > > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c > > index ea7adf6..18889ad 100644 > > --- a/xen/arch/x86/hvm/hvm.c > > +++ b/xen/arch/x86/hvm/hvm.c > > @@ -510,6 +510,29 @@ static int hvm_print_line( > > return X86EMUL_OKAY; > > } > > > > +static int hvm_pvh_dom_initialise(struct domain *d) > > +{ > > + int rc; > > + > > + if (!d->arch.hvm_domain.hap_enabled) > > + return -EINVAL; > > + > > + spin_lock_init(&d->arch.hvm_domain.irq_lock); > > + hvm_init_guest_time(d); > > + > > + hvm_init_cacheattr_region_list(d); > > + > > + if ( (rc=paging_enable(d, > > PG_refcounts|PG_translate|PG_external)) != 0 ) > > + goto fail1; <=================================== GOTO > > + > > + if ( (rc = hvm_funcs.domain_initialise(d)) == 0 ) > > + return 0; > > + > > +fail1: > > I don't think you need the label here? You are not doing an goto. Right above. > > long))hvm_assert_evtchn_irq, > > + (unsigned long)v ); > > + > > + v->arch.hvm_vcpu.hcall_64bit = 1; > > + v->arch.hvm_vcpu.hvm_pvh.vcpu_info_mfn = INVALID_MFN; > > + v->arch.user_regs.eflags = 2; > > So that sets the Reserved flag. Could you include a comment > explaining why.. Ah, is it b/c we later on bit-shift it and use it to > figure out whether IOPL needs to be virtualized in > arch_set_info_guest? Or is it just b/c this function is based off > hvm_vcpu_initialise? If so, since you are being called by it, can you > skip it? That resvd bit is required to be set for bootstrap. Set in other places also, like arch_set_info_guest(): v->arch.user_regs.eflags |= 2; > > > + v->arch.hvm_vcpu.inject_trap.vector = -1; > > + > > + if ( (rc=hvm_vcpu_cacheattr_init(v)) != 0 ) { > > The syntax here is off. Hmm... space surrounding "=" in rc=hvm* ? > > int hvm_vcpu_initialise(struct vcpu *v) > > { > > int rc; > > struct domain *d = v->domain; > > - domid_t dm_domid = > > d->arch.hvm_domain.params[HVM_PARAM_DM_DOMAIN]; > > + domid_t dm_domid; > > Not sure I follow, why the move of it further down? params is not defined/allocated for PVH. > > + /* VMCS controls. */ > > + vmx_pin_based_exec_control &= ~PIN_BASED_VIRTUAL_NMIS; > > + __vmwrite(PIN_BASED_VM_EXEC_CONTROL, > > vmx_pin_based_exec_control); + > > + v->arch.hvm_vmx.exec_control = vmx_cpu_based_exec_control; > > + > > + /* if rdtsc exiting is turned on and it goes thru > > emulate_privileged_op, > > + * then pv_vcpu.ctrlreg must be added to pvh struct */ > > That would be the 'timer_mode' syntax in the guest config right? > Perhaps then a check at the top of the function to see which > timer_mode is used and exit out with -ENOSYS? The vtsc setting. We set it to 0 for PVH guests. > > > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_RDTSC_EXITING; > > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_USE_TSC_OFFSETING; > > + > > + v->arch.hvm_vmx.exec_control &= ~(CPU_BASED_INVLPG_EXITING | > > + CPU_BASED_CR3_LOAD_EXITING | > > + CPU_BASED_CR3_STORE_EXITING); > > + v->arch.hvm_vmx.exec_control |= > > CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; > > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG; > > Is that b/c the PV code ends up making the SCHED_yield_op hypercall > and we don't need the monitor/mwait capability? If so, could you add > that comment in please? No, MTF is debugging feature used mostly for single step. > > + vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_CS, > > msr_type); > > + vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_ESP, > > msr_type); > > + vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_EIP, > > msr_type); > > + vmx_disable_intercept_for_msr(v, MSR_SHADOW_GS_BASE, > > msr_type); > > So this looks like the one vmcs.c except that one has this extra: > > 895 if ( cpu_has_vmx_pat && paging_mode_hap(d) ) > 896 vmx_disable_intercept_for_msr(v, MSR_IA32_CR_PAT, > MSR_TYPE_R | MSR_TYPE_W); 897 } > > Did you miss that? I'll add it. I guess default must be disabled. > > + > > + /* pure hvm doesn't do this. safe? see: > > long_mode_do_msr_write() */ +#if 0 > > + vmx_disable_intercept_for_msr(v, MSR_STAR); > > + vmx_disable_intercept_for_msr(v, MSR_LSTAR); > > + vmx_disable_intercept_for_msr(v, MSR_CSTAR); > > + vmx_disable_intercept_for_msr(v, MSR_SYSCALL_MASK); > > +#endif > > I would just provide a comment saying: > > /* > * long_mode_do_msr_write() takes care of > MSR_[STAR|LSTAR|CSTAR|SYSCALL_MASK] */ Good Idea. I left the "#if 0" there for suggestion. > > + } else { > > + printk("PVH: CPU does NOT have msr bitmap\n"); > > Perhaps: > > printk(XENLOG_G_ERR "%s: ..\n", __func__); > ? > > + return -EINVAL; > > + } > > + > > + if ( !cpu_has_vmx_vpid ) { > > + printk("PVH: At present VPID support is required to run > > PVH\n"); > > Should you de-allocate msr_bitmap at this point? > > Or perhaps move this check (and the one below) to the start of the > function? So you have: > > if ( !cpu_has_vmx_vpid ) > gdprintk ("%s: VPID required for PVH mode.\n", > __func__); > > if ( !cpu_has_vmx_secondary_exec_control ) > .. bla bla > > > > + return -EINVAL; > > + } > > + > > + v->arch.hvm_vmx.secondary_exec_control = > > vmx_secondary_exec_control; + > > + if ( cpu_has_vmx_secondary_exec_control ) { > > + v->arch.hvm_vmx.secondary_exec_control &= ~0x4FF; /* turn > > off all */ > > + v->arch.hvm_vmx.secondary_exec_control |= > > + > > SECONDARY_EXEC_PAUSE_LOOP_EXITING; > > + v->arch.hvm_vmx.secondary_exec_control |= > > SECONDARY_EXEC_ENABLE_VPID; + > > + v->arch.hvm_vmx.secondary_exec_control |= > > SECONDARY_EXEC_ENABLE_EPT; > > + __vmwrite(SECONDARY_VM_EXEC_CONTROL, > > + v->arch.hvm_vmx.secondary_exec_control); > > + } else { > > + printk("PVH: NO Secondary Exec control\n"); > > + return -EINVAL; > > Ditto - should you de-allocate msr_bitmap ? Or if you are going to > move the check for cpu_has_vmx_secondary_exec_control, then there is > no need for this if (.. ) else .. > > > > + } > > + > > + __vmwrite(VM_EXIT_CONTROLS, vmexit_ctl); > > + > > + #define VM_ENTRY_LOAD_DEBUG_CTLS 0x4 > > + #define VM_ENTRY_LOAD_EFER 0x8000 > > + vmentry_ctl &= ~VM_ENTRY_LOAD_DEBUG_CTLS; > > + vmentry_ctl &= ~VM_ENTRY_LOAD_EFER; > > + vmentry_ctl &= ~VM_ENTRY_SMM; > > + vmentry_ctl &= ~VM_ENTRY_DEACT_DUAL_MONITOR; > > + vmentry_ctl |= VM_ENTRY_IA32E_MODE; > > + __vmwrite(VM_ENTRY_CONTROLS, vmentry_ctl); > > + > > From here on, it looks mostly the same as construct_vmcs right? > > Perhaps you can add a comment saying so - so when a cleanup effort > is done later on - these can be candidates for it? > > > + /* MSR intercepts. */ > > + __vmwrite(VM_ENTRY_MSR_LOAD_COUNT, 0); > > + __vmwrite(VM_EXIT_MSR_LOAD_COUNT, 0); > > + __vmwrite(VM_EXIT_MSR_STORE_COUNT, 0); > > + > > + /* Host data selectors. */ > > + __vmwrite(HOST_SS_SELECTOR, __HYPERVISOR_DS); > > + __vmwrite(HOST_DS_SELECTOR, __HYPERVISOR_DS); > > + __vmwrite(HOST_ES_SELECTOR, __HYPERVISOR_DS); > > + __vmwrite(HOST_FS_SELECTOR, 0); > > + __vmwrite(HOST_GS_SELECTOR, 0); > > + __vmwrite(HOST_FS_BASE, 0); > > + __vmwrite(HOST_GS_BASE, 0); > > + > > + vmx_set_host_env(v); > > + > > + /* Host control registers. */ > > + v->arch.hvm_vmx.host_cr0 = read_cr0() | X86_CR0_TS; > > + __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0); > > + __vmwrite(HOST_CR4, mmu_cr4_features|(cpu_has_xsave ? > > X86_CR4_OSXSAVE : 0)); > > That formatting looks odd. Copied from hvm code. whats wrong? > > + /* Set default guest context values here. Some of these are > > then overwritten > > + * in vmx_pvh_set_vcpu_info() by guest itself during vcpu > > bringup */ > > + __vmwrite(GUEST_CS_BASE, 0); > > + __vmwrite(GUEST_CS_LIMIT, ~0u); > > + __vmwrite(GUEST_CS_AR_BYTES, 0xa09b); /* CS.L == 1 */ > > + __vmwrite(GUEST_CS_SELECTOR, 0x10); > > 0x10. Could you use a #define for it? Somehow I thought it would > be running in FLAT_KERNEL_CS but that would be odd. And of course > since are booting in the Linux kernel without the PV MMU we would > be using its native segments. So this would correspond to > GDT_ENTRY_KERNEL_CS right? Might want to mention that > so if the Linux kernel alters its GDT page we don't blow up? > > Thought I guess it does not matter - this is just the initial > bootstrap segments. Presumarily the load_gdt in the Linux kernel > later on resets it to whatever the "new" GDT is. Correct: #define __KERNEL_CS (GDT_ENTRY_KERNEL_CS*8) And load_gdt loads a new GDT natively. > > + __vmwrite(GUEST_INTERRUPTIBILITY_INFO, 0); > > + __vmwrite(GUEST_DR7, 0); > > + __vmwrite(VMCS_LINK_POINTER, ~0UL); > > + > > + __vmwrite(PAGE_FAULT_ERROR_CODE_MASK, 0); > > + __vmwrite(PAGE_FAULT_ERROR_CODE_MATCH, 0); > > Weird. In the vmcs.c file these are somewhat higher in the code. Yes. I just didn't copy the existing function, but created PVH function to make it easier for PVH. > > + > > + v->arch.hvm_vmx.exception_bitmap = > > + HVM_TRAP_MASK | (1 << > > TRAP_debug) | > > + (1U << TRAP_int3) | (1U << > > TRAP_no_device); > > Odd syntax. Similar to existing hvm code, whats wrong? > > + __vmwrite(EXCEPTION_BITMAP, v->arch.hvm_vmx.exception_bitmap); > > + > > + __vmwrite(TSC_OFFSET, 0); > > Hm, so you did earlier: > > v->arch.hvm_vmx.exec_control &= ~CPU_BASED_USE_TSC_OFFSETING; > > so is this neccessary? Or is just that you want it to be set > to default baseline state? Not necessary, doesn't hurt either. I can remove it. > > + > > + /* Set WP bit so rdonly pages are not written from CPL 0 */ > > + tmpval = X86_CR0_PG | X86_CR0_NE | X86_CR0_PE | X86_CR0_WP; > > + __vmwrite(GUEST_CR0, tmpval); > > + __vmwrite(CR0_READ_SHADOW, tmpval); > > + v->arch.hvm_vcpu.hw_cr[0] = v->arch.hvm_vcpu.guest_cr[0] = > > tmpval; + > > + tmpval = real_cr4_to_pv_guest_cr4(mmu_cr4_features); > > + required = X86_CR4_PAE | X86_CR4_VMXE | X86_CR4_OSFXSR; > > + if ( (tmpval & required) != required ) > > + { > > + printk("PVH: required CR4 features not available:%lx\n", > > required); > > + return -EINVAL; > > You might want to move that to the top of the code. Or if you want > it here, then at least free the msr_bitmap I think I'll just move all the checks top of the code. > > { > > struct domain *d = v->domain; > > @@ -825,6 +1072,12 @@ static int construct_vmcs(struct vcpu *v) > > > > vmx_vmcs_enter(v); > > > > + if ( is_pvh_vcpu(v) ) { > > + int rc = pvh_construct_vmcs(v); > > + vmx_vmcs_exit(v); > > Do you need to call paging_update_paging_modes as construct_vmcs() > does? Nop. We don't need to as the arch_set_info_guest() does it for PVH. Thanks Konrad. Mukesh _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.