[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 9/18 V2]: PVH xen: create PVH vmcs, and initialization



On Mon, Mar 18, 2013 at 06:00:36PM -0700, Mukesh Rathor wrote:
> On Mon, 18 Mar 2013 11:28:43 -0400
> Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote:
> 
> > On Fri, Mar 15, 2013 at 05:39:25PM -0700, Mukesh Rathor wrote:
> > > This patch mainly contains code to create a VMCS for PVH guest, and
> > > HVM specific vcpu/domain creation code.
> > > 
> > > Changes in V2:
> > >   - Avoid call to hvm_do_resume() at call site rather than return
> > > in it.
> > >   - Return for PVH vmx_do_resume prior to intel debugger stuff.
> > > 
> > > Signed-off-by: Mukesh Rathor <mukesh.rathor@xxxxxxxxxx>
> > > ---
> > >  xen/arch/x86/hvm/hvm.c      |   90 ++++++++++++++-
> > >  xen/arch/x86/hvm/vmx/vmcs.c |  266
> > > ++++++++++++++++++++++++++++++++++++++++++-
> > > xen/arch/x86/hvm/vmx/vmx.c  |   34 ++++++ 3 files changed, 383
> > > insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> > > index ea7adf6..18889ad 100644
> > > --- a/xen/arch/x86/hvm/hvm.c
> > > +++ b/xen/arch/x86/hvm/hvm.c
> > > @@ -510,6 +510,29 @@ static int hvm_print_line(
> > >      return X86EMUL_OKAY;
> > >  }
> > >  
> > > +static int hvm_pvh_dom_initialise(struct domain *d)
> > > +{
> > > +    int rc;
> > > +
> > > +    if (!d->arch.hvm_domain.hap_enabled)
> > > +        return -EINVAL;
> > > +
> > > +    spin_lock_init(&d->arch.hvm_domain.irq_lock);
> > > +    hvm_init_guest_time(d);
> > > +
> > > +    hvm_init_cacheattr_region_list(d);
> > > +
> > > +    if ( (rc=paging_enable(d,
> > > PG_refcounts|PG_translate|PG_external)) != 0 )
> > > +        goto fail1;        <=================================== GOTO
> > > +
> > > +    if ( (rc = hvm_funcs.domain_initialise(d)) == 0 )
> > > +        return 0;
> > > +
> > > +fail1:
> > 
> > I don't think you need the label here? You are not doing an goto.
> 
> Right above.

Duh!
> 
> > > long))hvm_assert_evtchn_irq,
> > > +                          (unsigned long)v );
> > > +
> > > +    v->arch.hvm_vcpu.hcall_64bit = 1;
> > > +    v->arch.hvm_vcpu.hvm_pvh.vcpu_info_mfn = INVALID_MFN;
> > > +    v->arch.user_regs.eflags = 2;
> > 
> > So that sets the Reserved flag. Could you include a comment
> > explaining why.. Ah, is it b/c we later on bit-shift it and use it to
> > figure out whether IOPL needs to be virtualized in
> > arch_set_info_guest? Or is it just b/c this function is based off
> > hvm_vcpu_initialise? If so, since you are being called by it, can you
> > skip it?
> 
> That resvd bit is required to be set for bootstrap. Set in other places
> also, like arch_set_info_guest():
> 
> v->arch.user_regs.eflags |= 2;

OK, but why? I am not seeing that bit defined? Or was it needed to get
the machine to boot up.
> 
> > 
> > > +    v->arch.hvm_vcpu.inject_trap.vector = -1;
> > > +
> > > +    if ( (rc=hvm_vcpu_cacheattr_init(v)) != 0 ) {
> > 
> > The syntax here is off.
> 
> Hmm... space surrounding "=" in rc=hvm* ?

Yes, like this:
 385     if ( (rc = vcpu_init_fpu(v)) != 0 )

> 
> > >  int hvm_vcpu_initialise(struct vcpu *v)
> > >  {
> > >      int rc;
> > >      struct domain *d = v->domain;
> > > -    domid_t dm_domid =
> > > d->arch.hvm_domain.params[HVM_PARAM_DM_DOMAIN];
> > > +    domid_t dm_domid; 
> > 
> > Not sure I follow, why the move of it further down?
> 
> params is not defined/allocated for PVH.

But you are still using it in the code. As in, you are still
fetching it (just later).

> 
> > > +    /* VMCS controls. */
> > > +    vmx_pin_based_exec_control &= ~PIN_BASED_VIRTUAL_NMIS;
> > > +    __vmwrite(PIN_BASED_VM_EXEC_CONTROL,
> > > vmx_pin_based_exec_control); +
> > > +    v->arch.hvm_vmx.exec_control = vmx_cpu_based_exec_control;
> > > +
> > > +    /* if rdtsc exiting is turned on and it goes thru
> > > emulate_privileged_op,
> > > +     * then pv_vcpu.ctrlreg must be added to pvh struct */
> > 
> > That would be the 'timer_mode' syntax in the guest config right?
> > Perhaps then a check at the top of the function to see which
> > timer_mode is used and exit out with -ENOSYS?
> 
> The vtsc setting. We set it to 0 for PVH guests.
> 

OK, so that is the the 'timer_mode' always set to '2' or rather
timer_mode="native" (in the guest config). Which then does
the xc_domain_set_tsc_info hypercall to set the vtsc.

You need to document that please.

> > 
> > > +    v->arch.hvm_vmx.exec_control &= ~CPU_BASED_RDTSC_EXITING;
> > > +    v->arch.hvm_vmx.exec_control &= ~CPU_BASED_USE_TSC_OFFSETING;
> > > +
> > > +    v->arch.hvm_vmx.exec_control &= ~(CPU_BASED_INVLPG_EXITING |
> > > +                                      CPU_BASED_CR3_LOAD_EXITING |
> > > +                                      CPU_BASED_CR3_STORE_EXITING);
> > > +    v->arch.hvm_vmx.exec_control |=
> > > CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
> > > +    v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG;
> > 
> > Is that b/c the PV code ends up making the SCHED_yield_op hypercall
> > and we don't need the monitor/mwait capability? If so, could you add
> > that comment in please?
> 
> No, MTF is debugging feature used mostly for single step.
> 
> > > +        vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_CS,
> > > msr_type);
> > > +        vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_ESP,
> > > msr_type);
> > > +        vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_EIP,
> > > msr_type);
> > > +        vmx_disable_intercept_for_msr(v, MSR_SHADOW_GS_BASE,
> > > msr_type);
> > 
> > So this looks like the one vmcs.c except that one has this extra:
> > 
> >  895         if ( cpu_has_vmx_pat && paging_mode_hap(d) )
> >  896             vmx_disable_intercept_for_msr(v, MSR_IA32_CR_PAT,
> > MSR_TYPE_R | MSR_TYPE_W); 897     }
> > 
> > Did you miss that?
> 
> I'll add it. I guess default must be disabled.
> 
> > > +
> > > +        /* pure hvm doesn't do this. safe? see:
> > > long_mode_do_msr_write() */ +#if 0
> > > +        vmx_disable_intercept_for_msr(v, MSR_STAR);
> > > +        vmx_disable_intercept_for_msr(v, MSR_LSTAR);
> > > +        vmx_disable_intercept_for_msr(v, MSR_CSTAR);
> > > +        vmx_disable_intercept_for_msr(v, MSR_SYSCALL_MASK);
> > > +#endif
> > 
> > I would just provide a comment saying:
> > 
> >     /*
> >     * long_mode_do_msr_write() takes care of
> > MSR_[STAR|LSTAR|CSTAR|SYSCALL_MASK] */
> 
> Good Idea. I left the "#if 0" there for suggestion.
> 
> 
> > > +    } else {
> > > +        printk("PVH: CPU does NOT have msr bitmap\n");
> > 
> > Perhaps:
> > 
> >     printk(XENLOG_G_ERR "%s: ..\n", __func__);
> > ?
> > > +        return -EINVAL;
> > > +    }
> > > +
> > > +    if ( !cpu_has_vmx_vpid ) {
> > > +        printk("PVH: At present VPID support is required to run
> > > PVH\n");
> > 
> > Should you de-allocate msr_bitmap at this point?
> > 
> > Or perhaps move this check (and the one below) to the start of the
> > function? So you have:
> > 
> >     if ( !cpu_has_vmx_vpid )
> >             gdprintk ("%s: VPID required for PVH mode.\n",
> > __func__);
> > 
> >     if ( !cpu_has_vmx_secondary_exec_control )
> >             .. bla bla
> > 
> > 
> > > +        return -EINVAL;
> > > +    }
> > > +
> > > +    v->arch.hvm_vmx.secondary_exec_control =
> > > vmx_secondary_exec_control; +
> > > +    if ( cpu_has_vmx_secondary_exec_control ) {
> > > +        v->arch.hvm_vmx.secondary_exec_control &= ~0x4FF; /* turn
> > > off all */
> > > +        v->arch.hvm_vmx.secondary_exec_control |= 
> > > +
> > > SECONDARY_EXEC_PAUSE_LOOP_EXITING;
> > > +        v->arch.hvm_vmx.secondary_exec_control |=
> > > SECONDARY_EXEC_ENABLE_VPID; +
> > > +        v->arch.hvm_vmx.secondary_exec_control |=
> > > SECONDARY_EXEC_ENABLE_EPT;
> > > +        __vmwrite(SECONDARY_VM_EXEC_CONTROL,
> > > +                  v->arch.hvm_vmx.secondary_exec_control);
> > > +    } else {
> > > +        printk("PVH: NO Secondary Exec control\n");
> > > +        return -EINVAL;
> > 
> > Ditto - should you de-allocate msr_bitmap ? Or if you are going to
> > move the check for cpu_has_vmx_secondary_exec_control, then there is
> > no need for this if (.. ) else ..
> > 
> > 
> > > +    }
> > > +
> > > +    __vmwrite(VM_EXIT_CONTROLS, vmexit_ctl);
> > > +
> > > +    #define VM_ENTRY_LOAD_DEBUG_CTLS 0x4
> > > +    #define VM_ENTRY_LOAD_EFER 0x8000
> > > +    vmentry_ctl &= ~VM_ENTRY_LOAD_DEBUG_CTLS;
> > > +    vmentry_ctl &= ~VM_ENTRY_LOAD_EFER;
> > > +    vmentry_ctl &= ~VM_ENTRY_SMM;
> > > +    vmentry_ctl &= ~VM_ENTRY_DEACT_DUAL_MONITOR;
> > > +    vmentry_ctl |= VM_ENTRY_IA32E_MODE;
> > > +    __vmwrite(VM_ENTRY_CONTROLS, vmentry_ctl);
> > > +
> > 
> > From here on, it looks mostly the same as construct_vmcs right?
> > 
> > Perhaps you can add a comment saying so - so when a cleanup effort
> > is done later on - these can be candidates for it?
> > 
> > > +    /* MSR intercepts. */
> > > +    __vmwrite(VM_ENTRY_MSR_LOAD_COUNT, 0);
> > > +    __vmwrite(VM_EXIT_MSR_LOAD_COUNT, 0);
> > > +    __vmwrite(VM_EXIT_MSR_STORE_COUNT, 0);
> > > +
> > > +    /* Host data selectors. */
> > > +    __vmwrite(HOST_SS_SELECTOR, __HYPERVISOR_DS);
> > > +    __vmwrite(HOST_DS_SELECTOR, __HYPERVISOR_DS);
> > > +    __vmwrite(HOST_ES_SELECTOR, __HYPERVISOR_DS);
> > > +    __vmwrite(HOST_FS_SELECTOR, 0);
> > > +    __vmwrite(HOST_GS_SELECTOR, 0);
> > > +    __vmwrite(HOST_FS_BASE, 0);
> > > +    __vmwrite(HOST_GS_BASE, 0);
> > > +
> > > +    vmx_set_host_env(v);
> > > +
> > > +    /* Host control registers. */
> > > +    v->arch.hvm_vmx.host_cr0 = read_cr0() | X86_CR0_TS;
> > > +    __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);
> > > +    __vmwrite(HOST_CR4, mmu_cr4_features|(cpu_has_xsave ?
> > > X86_CR4_OSXSAVE : 0));
> > 
> > That formatting looks odd. 
> 
> Copied from hvm code. whats wrong?

The HVM code has some extra spaces.

> 
> > > +    /* Set default guest context values here. Some of these are
> > > then overwritten
> > > +     * in vmx_pvh_set_vcpu_info() by guest itself during vcpu
> > > bringup */
> > > +    __vmwrite(GUEST_CS_BASE, 0);
> > > +    __vmwrite(GUEST_CS_LIMIT, ~0u);
> > > +    __vmwrite(GUEST_CS_AR_BYTES, 0xa09b); /* CS.L == 1 */
> > > +    __vmwrite(GUEST_CS_SELECTOR, 0x10);
> > 
> > 0x10. Could you use a #define for it? Somehow I thought it would
> > be running in FLAT_KERNEL_CS but that would be odd. And of course
> > since are booting in the Linux kernel without the PV MMU we would
> > be using its native segments. So this would correspond to
> > GDT_ENTRY_KERNEL_CS right? Might want to mention that
> > so if the Linux kernel alters its GDT page we don't blow up?
> > 
> > Thought I guess it does not matter - this is just the initial
> > bootstrap segments. Presumarily the load_gdt in the Linux kernel
> > later on resets it to whatever the "new" GDT is.
> 
> Correct:
> #define __KERNEL_CS     (GDT_ENTRY_KERNEL_CS*8)
> 
> And load_gdt loads a new GDT natively.
> 
> 
> > > +    __vmwrite(GUEST_INTERRUPTIBILITY_INFO, 0);
> > > +    __vmwrite(GUEST_DR7, 0);
> > > +    __vmwrite(VMCS_LINK_POINTER, ~0UL);
> > > +
> > > +    __vmwrite(PAGE_FAULT_ERROR_CODE_MASK, 0);
> > > +    __vmwrite(PAGE_FAULT_ERROR_CODE_MATCH, 0);
> > 
> > Weird. In the vmcs.c file these are somewhat higher in the code.
> 
> Yes. I just didn't copy the existing function, but created PVH function
> to make it easier for PVH.

Sure, but the next step (not these patches) will be to collapse some
of the redundant logic.

> 
> > > +
> > > +    v->arch.hvm_vmx.exception_bitmap = 
> > > +                                   HVM_TRAP_MASK     | (1 <<
> > > TRAP_debug) | 
> > > +                                   (1U << TRAP_int3) | (1U <<
> > > TRAP_no_device);
> > 
> > Odd syntax.
> 
> Similar to existing hvm code, whats wrong?

Tabs. The HVM looks different (I think it uses spaces?)
> 
> 
> > > +    __vmwrite(EXCEPTION_BITMAP, v->arch.hvm_vmx.exception_bitmap);
> > > +
> > > +    __vmwrite(TSC_OFFSET, 0);
> > 
> > Hm, so you did earlier:
> > 
> > v->arch.hvm_vmx.exec_control &= ~CPU_BASED_USE_TSC_OFFSETING;
> > 
> > so is this neccessary? Or is just that you want it to be set
> > to default baseline state?
> 
> Not necessary, doesn't hurt either. I can remove it.
> 
> > > +
> > > +    /* Set WP bit so rdonly pages are not written from CPL 0 */
> > > +    tmpval = X86_CR0_PG | X86_CR0_NE | X86_CR0_PE | X86_CR0_WP;
> > > +    __vmwrite(GUEST_CR0, tmpval);
> > > +    __vmwrite(CR0_READ_SHADOW, tmpval);
> > > +    v->arch.hvm_vcpu.hw_cr[0] = v->arch.hvm_vcpu.guest_cr[0] =
> > > tmpval; +
> > > +    tmpval = real_cr4_to_pv_guest_cr4(mmu_cr4_features);
> > > +    required = X86_CR4_PAE | X86_CR4_VMXE | X86_CR4_OSFXSR;
> > > +    if ( (tmpval & required) != required )
> > > +    {
> > > +        printk("PVH: required CR4 features not available:%lx\n",
> > > required);
> > > +        return -EINVAL;
> > 
> > You might want to move that to the top of the code. Or if you want
> > it here, then at least free the msr_bitmap
> 
> I think I'll just move all the checks top of the code.
> 
> > >  {
> > >      struct domain *d = v->domain;
> > > @@ -825,6 +1072,12 @@ static int construct_vmcs(struct vcpu *v)
> > >  
> > >      vmx_vmcs_enter(v);
> > >  
> > > +    if ( is_pvh_vcpu(v) ) {
> > > +        int rc = pvh_construct_vmcs(v);
> > > +        vmx_vmcs_exit(v);
> > 
> > Do you need to call paging_update_paging_modes as construct_vmcs()
> > does?
> 
> Nop. We don't need to as the arch_set_info_guest() does it for PVH.

Ok, could you provide a comment please? That way when one looks at this
code and the HVM one it will be clear.
> 
> 
> Thanks Konrad.

Of course. Thank you for bearing with this tiring review process.
> Mukesh
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.