Xen project Mailing List

Re: [Xen-devel] [PATCH 9/18 V2]: PVH xen: create PVH vmcs, and initialization

To: Mukesh Rathor <mukesh.rathor@xxxxxxxxxx>

From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

Date: Tue, 19 Mar 2013 09:23:30 -0400

Cc: "Xen-devel@xxxxxxxxxxxxxxxxxxx" <Xen-devel@xxxxxxxxxxxxxxxxxxx>

Delivery-date: Tue, 19 Mar 2013 13:23:59 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Mon, Mar 18, 2013 at 06:00:36PM -0700, Mukesh Rathor wrote: > On Mon, 18 Mar 2013 11:28:43 -0400 > Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote: > > > On Fri, Mar 15, 2013 at 05:39:25PM -0700, Mukesh Rathor wrote: > > > This patch mainly contains code to create a VMCS for PVH guest, and > > > HVM specific vcpu/domain creation code. > > > > > > Changes in V2: > > > - Avoid call to hvm_do_resume() at call site rather than return > > > in it. > > > - Return for PVH vmx_do_resume prior to intel debugger stuff. > > > > > > Signed-off-by: Mukesh Rathor <mukesh.rathor@xxxxxxxxxx> > > > --- > > > xen/arch/x86/hvm/hvm.c | 90 ++++++++++++++- > > > xen/arch/x86/hvm/vmx/vmcs.c | 266 > > > ++++++++++++++++++++++++++++++++++++++++++- > > > xen/arch/x86/hvm/vmx/vmx.c | 34 ++++++ 3 files changed, 383 > > > insertions(+), 7 deletions(-) > > > > > > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c > > > index ea7adf6..18889ad 100644 > > > --- a/xen/arch/x86/hvm/hvm.c > > > +++ b/xen/arch/x86/hvm/hvm.c > > > @@ -510,6 +510,29 @@ static int hvm_print_line( > > > return X86EMUL_OKAY; > > > } > > > > > > +static int hvm_pvh_dom_initialise(struct domain *d) > > > +{ > > > + int rc; > > > + > > > + if (!d->arch.hvm_domain.hap_enabled) > > > + return -EINVAL; > > > + > > > + spin_lock_init(&d->arch.hvm_domain.irq_lock); > > > + hvm_init_guest_time(d); > > > + > > > + hvm_init_cacheattr_region_list(d); > > > + > > > + if ( (rc=paging_enable(d, > > > PG_refcounts|PG_translate|PG_external)) != 0 ) > > > + goto fail1; <=================================== GOTO > > > + > > > + if ( (rc = hvm_funcs.domain_initialise(d)) == 0 ) > > > + return 0; > > > + > > > +fail1: > > > > I don't think you need the label here? You are not doing an goto. > > Right above. Duh! > > > > long))hvm_assert_evtchn_irq, > > > + (unsigned long)v ); > > > + > > > + v->arch.hvm_vcpu.hcall_64bit = 1; > > > + v->arch.hvm_vcpu.hvm_pvh.vcpu_info_mfn = INVALID_MFN; > > > + v->arch.user_regs.eflags = 2; > > > > So that sets the Reserved flag. Could you include a comment > > explaining why.. Ah, is it b/c we later on bit-shift it and use it to > > figure out whether IOPL needs to be virtualized in > > arch_set_info_guest? Or is it just b/c this function is based off > > hvm_vcpu_initialise? If so, since you are being called by it, can you > > skip it? > > That resvd bit is required to be set for bootstrap. Set in other places > also, like arch_set_info_guest(): > > v->arch.user_regs.eflags |= 2; OK, but why? I am not seeing that bit defined? Or was it needed to get the machine to boot up. > > > > > > + v->arch.hvm_vcpu.inject_trap.vector = -1; > > > + > > > + if ( (rc=hvm_vcpu_cacheattr_init(v)) != 0 ) { > > > > The syntax here is off. > > Hmm... space surrounding "=" in rc=hvm* ? Yes, like this: 385 if ( (rc = vcpu_init_fpu(v)) != 0 ) > > > > int hvm_vcpu_initialise(struct vcpu *v) > > > { > > > int rc; > > > struct domain *d = v->domain; > > > - domid_t dm_domid = > > > d->arch.hvm_domain.params[HVM_PARAM_DM_DOMAIN]; > > > + domid_t dm_domid; > > > > Not sure I follow, why the move of it further down? > > params is not defined/allocated for PVH. But you are still using it in the code. As in, you are still fetching it (just later). > > > > + /* VMCS controls. */ > > > + vmx_pin_based_exec_control &= ~PIN_BASED_VIRTUAL_NMIS; > > > + __vmwrite(PIN_BASED_VM_EXEC_CONTROL, > > > vmx_pin_based_exec_control); + > > > + v->arch.hvm_vmx.exec_control = vmx_cpu_based_exec_control; > > > + > > > + /* if rdtsc exiting is turned on and it goes thru > > > emulate_privileged_op, > > > + * then pv_vcpu.ctrlreg must be added to pvh struct */ > > > > That would be the 'timer_mode' syntax in the guest config right? > > Perhaps then a check at the top of the function to see which > > timer_mode is used and exit out with -ENOSYS? > > The vtsc setting. We set it to 0 for PVH guests. > OK, so that is the the 'timer_mode' always set to '2' or rather timer_mode="native" (in the guest config). Which then does the xc_domain_set_tsc_info hypercall to set the vtsc. You need to document that please. > > > > > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_RDTSC_EXITING; > > > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_USE_TSC_OFFSETING; > > > + > > > + v->arch.hvm_vmx.exec_control &= ~(CPU_BASED_INVLPG_EXITING | > > > + CPU_BASED_CR3_LOAD_EXITING | > > > + CPU_BASED_CR3_STORE_EXITING); > > > + v->arch.hvm_vmx.exec_control |= > > > CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; > > > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG; > > > > Is that b/c the PV code ends up making the SCHED_yield_op hypercall > > and we don't need the monitor/mwait capability? If so, could you add > > that comment in please? > > No, MTF is debugging feature used mostly for single step. > > > > + vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_CS, > > > msr_type); > > > + vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_ESP, > > > msr_type); > > > + vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_EIP, > > > msr_type); > > > + vmx_disable_intercept_for_msr(v, MSR_SHADOW_GS_BASE, > > > msr_type); > > > > So this looks like the one vmcs.c except that one has this extra: > > > > 895 if ( cpu_has_vmx_pat && paging_mode_hap(d) ) > > 896 vmx_disable_intercept_for_msr(v, MSR_IA32_CR_PAT, > > MSR_TYPE_R | MSR_TYPE_W); 897 } > > > > Did you miss that? > > I'll add it. I guess default must be disabled. > > > > + > > > + /* pure hvm doesn't do this. safe? see: > > > long_mode_do_msr_write() */ +#if 0 > > > + vmx_disable_intercept_for_msr(v, MSR_STAR); > > > + vmx_disable_intercept_for_msr(v, MSR_LSTAR); > > > + vmx_disable_intercept_for_msr(v, MSR_CSTAR); > > > + vmx_disable_intercept_for_msr(v, MSR_SYSCALL_MASK); > > > +#endif > > > > I would just provide a comment saying: > > > > /* > > * long_mode_do_msr_write() takes care of > > MSR_[STAR|LSTAR|CSTAR|SYSCALL_MASK] */ > > Good Idea. I left the "#if 0" there for suggestion. > > > > > + } else { > > > + printk("PVH: CPU does NOT have msr bitmap\n"); > > > > Perhaps: > > > > printk(XENLOG_G_ERR "%s: ..\n", __func__); > > ? > > > + return -EINVAL; > > > + } > > > + > > > + if ( !cpu_has_vmx_vpid ) { > > > + printk("PVH: At present VPID support is required to run > > > PVH\n"); > > > > Should you de-allocate msr_bitmap at this point? > > > > Or perhaps move this check (and the one below) to the start of the > > function? So you have: > > > > if ( !cpu_has_vmx_vpid ) > > gdprintk ("%s: VPID required for PVH mode.\n", > > __func__); > > > > if ( !cpu_has_vmx_secondary_exec_control ) > > .. bla bla > > > > > > > + return -EINVAL; > > > + } > > > + > > > + v->arch.hvm_vmx.secondary_exec_control = > > > vmx_secondary_exec_control; + > > > + if ( cpu_has_vmx_secondary_exec_control ) { > > > + v->arch.hvm_vmx.secondary_exec_control &= ~0x4FF; /* turn > > > off all */ > > > + v->arch.hvm_vmx.secondary_exec_control |= > > > + > > > SECONDARY_EXEC_PAUSE_LOOP_EXITING; > > > + v->arch.hvm_vmx.secondary_exec_control |= > > > SECONDARY_EXEC_ENABLE_VPID; + > > > + v->arch.hvm_vmx.secondary_exec_control |= > > > SECONDARY_EXEC_ENABLE_EPT; > > > + __vmwrite(SECONDARY_VM_EXEC_CONTROL, > > > + v->arch.hvm_vmx.secondary_exec_control); > > > + } else { > > > + printk("PVH: NO Secondary Exec control\n"); > > > + return -EINVAL; > > > > Ditto - should you de-allocate msr_bitmap ? Or if you are going to > > move the check for cpu_has_vmx_secondary_exec_control, then there is > > no need for this if (.. ) else .. > > > > > > > + } > > > + > > > + __vmwrite(VM_EXIT_CONTROLS, vmexit_ctl); > > > + > > > + #define VM_ENTRY_LOAD_DEBUG_CTLS 0x4 > > > + #define VM_ENTRY_LOAD_EFER 0x8000 > > > + vmentry_ctl &= ~VM_ENTRY_LOAD_DEBUG_CTLS; > > > + vmentry_ctl &= ~VM_ENTRY_LOAD_EFER; > > > + vmentry_ctl &= ~VM_ENTRY_SMM; > > > + vmentry_ctl &= ~VM_ENTRY_DEACT_DUAL_MONITOR; > > > + vmentry_ctl |= VM_ENTRY_IA32E_MODE; > > > + __vmwrite(VM_ENTRY_CONTROLS, vmentry_ctl); > > > + > > > > From here on, it looks mostly the same as construct_vmcs right? > > > > Perhaps you can add a comment saying so - so when a cleanup effort > > is done later on - these can be candidates for it? > > > > > + /* MSR intercepts. */ > > > + __vmwrite(VM_ENTRY_MSR_LOAD_COUNT, 0); > > > + __vmwrite(VM_EXIT_MSR_LOAD_COUNT, 0); > > > + __vmwrite(VM_EXIT_MSR_STORE_COUNT, 0); > > > + > > > + /* Host data selectors. */ > > > + __vmwrite(HOST_SS_SELECTOR, __HYPERVISOR_DS); > > > + __vmwrite(HOST_DS_SELECTOR, __HYPERVISOR_DS); > > > + __vmwrite(HOST_ES_SELECTOR, __HYPERVISOR_DS); > > > + __vmwrite(HOST_FS_SELECTOR, 0); > > > + __vmwrite(HOST_GS_SELECTOR, 0); > > > + __vmwrite(HOST_FS_BASE, 0); > > > + __vmwrite(HOST_GS_BASE, 0); > > > + > > > + vmx_set_host_env(v); > > > + > > > + /* Host control registers. */ > > > + v->arch.hvm_vmx.host_cr0 = read_cr0() | X86_CR0_TS; > > > + __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0); > > > + __vmwrite(HOST_CR4, mmu_cr4_features|(cpu_has_xsave ? > > > X86_CR4_OSXSAVE : 0)); > > > > That formatting looks odd. > > Copied from hvm code. whats wrong? The HVM code has some extra spaces. > > > > + /* Set default guest context values here. Some of these are > > > then overwritten > > > + * in vmx_pvh_set_vcpu_info() by guest itself during vcpu > > > bringup */ > > > + __vmwrite(GUEST_CS_BASE, 0); > > > + __vmwrite(GUEST_CS_LIMIT, ~0u); > > > + __vmwrite(GUEST_CS_AR_BYTES, 0xa09b); /* CS.L == 1 */ > > > + __vmwrite(GUEST_CS_SELECTOR, 0x10); > > > > 0x10. Could you use a #define for it? Somehow I thought it would > > be running in FLAT_KERNEL_CS but that would be odd. And of course > > since are booting in the Linux kernel without the PV MMU we would > > be using its native segments. So this would correspond to > > GDT_ENTRY_KERNEL_CS right? Might want to mention that > > so if the Linux kernel alters its GDT page we don't blow up? > > > > Thought I guess it does not matter - this is just the initial > > bootstrap segments. Presumarily the load_gdt in the Linux kernel > > later on resets it to whatever the "new" GDT is. > > Correct: > #define __KERNEL_CS (GDT_ENTRY_KERNEL_CS*8) > > And load_gdt loads a new GDT natively. > > > > > + __vmwrite(GUEST_INTERRUPTIBILITY_INFO, 0); > > > + __vmwrite(GUEST_DR7, 0); > > > + __vmwrite(VMCS_LINK_POINTER, ~0UL); > > > + > > > + __vmwrite(PAGE_FAULT_ERROR_CODE_MASK, 0); > > > + __vmwrite(PAGE_FAULT_ERROR_CODE_MATCH, 0); > > > > Weird. In the vmcs.c file these are somewhat higher in the code. > > Yes. I just didn't copy the existing function, but created PVH function > to make it easier for PVH. Sure, but the next step (not these patches) will be to collapse some of the redundant logic. > > > > + > > > + v->arch.hvm_vmx.exception_bitmap = > > > + HVM_TRAP_MASK | (1 << > > > TRAP_debug) | > > > + (1U << TRAP_int3) | (1U << > > > TRAP_no_device); > > > > Odd syntax. > > Similar to existing hvm code, whats wrong? Tabs. The HVM looks different (I think it uses spaces?) > > > > > + __vmwrite(EXCEPTION_BITMAP, v->arch.hvm_vmx.exception_bitmap); > > > + > > > + __vmwrite(TSC_OFFSET, 0); > > > > Hm, so you did earlier: > > > > v->arch.hvm_vmx.exec_control &= ~CPU_BASED_USE_TSC_OFFSETING; > > > > so is this neccessary? Or is just that you want it to be set > > to default baseline state? > > Not necessary, doesn't hurt either. I can remove it. > > > > + > > > + /* Set WP bit so rdonly pages are not written from CPL 0 */ > > > + tmpval = X86_CR0_PG | X86_CR0_NE | X86_CR0_PE | X86_CR0_WP; > > > + __vmwrite(GUEST_CR0, tmpval); > > > + __vmwrite(CR0_READ_SHADOW, tmpval); > > > + v->arch.hvm_vcpu.hw_cr[0] = v->arch.hvm_vcpu.guest_cr[0] = > > > tmpval; + > > > + tmpval = real_cr4_to_pv_guest_cr4(mmu_cr4_features); > > > + required = X86_CR4_PAE | X86_CR4_VMXE | X86_CR4_OSFXSR; > > > + if ( (tmpval & required) != required ) > > > + { > > > + printk("PVH: required CR4 features not available:%lx\n", > > > required); > > > + return -EINVAL; > > > > You might want to move that to the top of the code. Or if you want > > it here, then at least free the msr_bitmap > > I think I'll just move all the checks top of the code. > > > > { > > > struct domain *d = v->domain; > > > @@ -825,6 +1072,12 @@ static int construct_vmcs(struct vcpu *v) > > > > > > vmx_vmcs_enter(v); > > > > > > + if ( is_pvh_vcpu(v) ) { > > > + int rc = pvh_construct_vmcs(v); > > > + vmx_vmcs_exit(v); > > > > Do you need to call paging_update_paging_modes as construct_vmcs() > > does? > > Nop. We don't need to as the arch_set_info_guest() does it for PVH. Ok, could you provide a comment please? That way when one looks at this code and the HVM one it will be clear. > > > Thanks Konrad. Of course. Thank you for bearing with this tiring review process. > Mukesh > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > http://lists.xen.org/xen-devel > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.