[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH for-4.5 v8 4/7] xen: Add vmware_port support



>>> On 21.01.15 at 18:52, <dslutz@xxxxxxxxxxx> wrote:
> On 01/16/15 05:09, Jan Beulich wrote:
>>>>> On 03.10.14 at 00:40, <dslutz@xxxxxxxxxxx> wrote:
>>> This is a new domain_create() flag, DOMCRF_vmware_port.  It is
>>> passed to domctl as XEN_DOMCTL_CDF_vmware_port.
>> 
>> Can you explain why a HVM param isn't suitable here?
>> 
> 
> The issue is that you need this flag during construct_vmcb() and
> construct_vmcs().  While Intel has vmx_update_exception_bitmap()
> AMD does not.  So when HVM param's are setup and/or changed there
> currently is no way to adjust AMD's exception bitmap.
> 
> So this is the simpler way.

But the less desirable one from a design/consistency perspective.
Unless other maintainers disagree, I'd like to see this changed.

>>> This is both a more complete support then in currently provided by
>>> QEMU and/or KVM and less.  The missing part requires QEMU changes
>>> and has been left out until the QEMU patches are accepted upstream.
>> 
>> I vaguely recall the question having been asked before, but I can't
>> find it to the answer to it: If qemu has support for this, why can't
>> you build on that rather than adding everything in the hypervisor?
>> 
> 
> The v10 version of this patch set (which is waiting for an adjusted
> QEMU (the released 2.2.0 is one) does use QEMU for more VMware port
> support.  The issues are:

Was there a newer version of these posted than the v8 I looked at?
If so, I must have overlooked the posting (as otherwise I would of
course have commented on the newer version).

> 1) QEMU needs access to parts of CPU registers to handle VMware port.
> 2) You need to allow ring 3 access to this 1 I/O port.
> 3) There is more state in xen that would need to also be sent to
>    QEMU if all support is moved to QEMU.

Understood.

>>> @@ -2111,6 +2112,31 @@ svm_vmexit_do_vmsave(struct vmcb_struct *vmcb,
>>>       return;
>>>   }
>>>
>>> +static void svm_vmexit_gp_intercept(struct cpu_user_regs *regs,
>>> +                                    struct vcpu *v)
>>> +{
>>> +    struct vmcb_struct *vmcb = v->arch.hvm_svm.vmcb;
>>> +    /*
>>> +     * Just use 15 for the instruction length; vmport_gp_check will
>>> +     * adjust it.  This is because
>>> +     * __get_instruction_length_from_list() has issues, and may
>>> +     * require a double read of the instruction bytes.  At some
>>> +     * point a new routine could be added that is based on the code
>>> +     * in vmport_gp_check with extensions to make it more general.
>>> +     * Since that routine is the only user of this code this can be
>>> +     * done later.
>>> +     */
>>> +    unsigned long inst_len = 15;
>> 
>> Surely this can be unsigned int?
> 
> The code is smaller this way.  In vmx_vmexit_gp_intercept():
> 
>     unsigned long inst_len;
> ...
>     __vmread(VM_EXIT_INSTRUCTION_LEN, &inst_len);
> ...
>     rc = vmport_gp_check(regs, v, &inst_len, inst_addr,
> ...
> 
> So changing the argument to vmport_gp_check() to "unsigned int" would
> add code there.

So be it then. Generic code shouldn't use odd types just because of
vendor specific code needs it, unless this makes things _a lot_ more
complicated.

>>> +    int rc = X86EMUL_OKAY;
>>> +
>>> +    if ( regs->_eax == BDOOR_MAGIC )
>> 
>> With this, is handling other than 32-bit in/out really meaningful/
>> correct?
>> 
> 
> Yes. Harder to use, but since VMware allows it, I allow it also.

But then a comment explaining the non-architectural (from an
instruction set perspective) behavior is the minimum that's
needed for future readers (and reviewers) to understand this.

>>> +        case BDOOR_CMD_GETHWVERSION:
>>> +            /* vmware_hw */
>>> +            regs->_eax = 0;get_instruction_length
>>> +            if ( is_hvm_vcpu(curr) )
>> 
>> Since you can't get here for PV, I can't see what you need this
>> conditional for.
>> 
> 
> Since I was not 100% sure, I was being safe.  Would converting
> this to be a "debug=y" check be ok?

ASSERT() would indeed be the right vehicle.

>>> +            {
>>> +                struct hvm_domain *hd = &d->arch.hvm_domain;
>>> +
>>> +                regs->_eax = hd->params[HVM_PARAM_VMWARE_HW];
>>> +            }
>>> +            if ( !regs->_eax )
>>> +                regs->_eax = 4;  /* Act like version 4 */
>> 
>> Why version 4?
> 
> That is the 1st version that VMware was more consistent in the handling
> of the "VMware hardware version".  Any value between 1 and 6 would be
> ok.  This should only happen in strange configs.

Please make the comment say so then.

>>> +            /* hostUsecs */
>>> +            regs->_ebx =value / 1000000ULL;
>>> +            /* maxTimeLag */
>>> +            regs->_ecx = 1000000;
>>> +            break;
>> 
>> Perhaps this should share code with BDOOR_CMD_GETTIME; I have
>> to admit though that I can't make any sense of why the latter one
>> has a FULL suffix when it returns _less_ information.
>> 
> 
> Sharing of code is not simple.
> Since I did not pick the names, VMware did.
> 
> Bug found.  The full returns data in si & dx.
> will fix. And also makes sharing more complex then not.

Of course if the current code is incomplete, sharing makes less sense
once completed.

>>> +        unsigned char bytes[MAX_INST_LEN];
>>> +        unsigned int fetch_len;
>>> +        int frc;
>>> +
>>> +        /* in or out are limited to 32bits */
>>> +        if ( byte_cnt > 4 )
>>> +            byte_cnt = 4;
>>> +
>>> +        /*
>>> +         * Fetch up to the next page break; we'll fetch from the
>>> +         * next page later if we have to.
>>> +         */
>>> +        fetch_len = min_t(unsigned int, *inst_len,
>>> +                          PAGE_SIZE - (inst_addr  & ~PAGE_MASK));
>>> +        frc = hvm_fetch_from_guest_virt_nofault(bytes, inst_addr, 
>>> fetch_len,
>>> +                                                PFEC_page_present);
>>> +        if ( frc != HVMCOPY_okay )
>>> +        {
>>> +            gdprintk(XENLOG_WARNING,
>>> +                     "Bad instruction fetch at %#lx (frc=%d il=%lu 
>>> fl=%u)\n",
>>> +                     (unsigned long) inst_addr, frc, *inst_len, fetch_len);
>> 
>> Pointless cast. But the value of log messages like this one is
>> questionable anyway.
>> 
> 
> Will drop cast.  I am not sure it is possible to get here. The best I
> have come up with is to change the GDT entry for CS to fault, then do
> this instruction.  Not sure it would fault, and clearly is an attempt
> to break in.
> 
> I do know that if Xen is running under VMware (Why anyone would do
> this?) this is possible.
> 
> With all this, should I just drop this message (or make it debug=y
> only)?

Yes - dropping would be preferred by me, but I'd accept a debug=y
only one too.

>>> +
>>> +        /* Only adjust byte_cnt 1 time */
>>> +        if ( bytes[0] == 0x66 )     /* operand size prefix */
>>> +        {
>>> +            if ( byte_cnt == 4 )
>>> +                byte_cnt = 2;
>>> +            else
>>> +                byte_cnt = 4;
>>> +        }
>> 
>> Iirc REX.W set following 0x66 cancels the effect of the latter. Another
>> thing x86emul would be taking care of for you if you used it.
> 
> I did not know this.  Most of my testing was done without any check
> for prefix(s).  I.E. (Open) VMware Tools only uses the inl.  I do
> not know of anybody using 16bit segments and VMware tools.

But this isn't the perspective to take when adding code to the
hypervisor - you should always consider what a (perhaps
malicious) guest could do.

>>> +static void vmx_vmexit_gp_intercept(struct cpu_user_regs *regs,
>>> +                                    struct vcpu *v)
>>> +{
>>> +    unsigned long exit_qualification;
>>> +    unsigned long inst_len;
>>> +    unsigned long inst_addr = vmx_rip2pointer(regs, v);
>>> +    unsigned long ecode;
>>> +    int rc;
>>> +#ifndef NDEBUG
>>> +    unsigned long orig_inst_len;
>>> +    unsigned long vector;
>>> +
>>> +    __vmread(VM_EXIT_INTR_INFO, &vector);
>>> +    BUG_ON(!(vector & INTR_INFO_VALID_MASK));
>>> +    BUG_ON(!(vector & INTR_INFO_DELIVER_CODE_MASK));
>>> +#endif
>> 
>> If you use ASSERT() instead of BUG_ON(), I think you can avoid most
>> of this preprocessor conditional.
>> 
> 
> I do not see how.  vector only exists in "debug=y".  So yes if using
> ASSERT() I could move the #endif up 2 lines, but that does not
> look better to me.

I don't follow - ASSERT() is intentionally coded in a way such that
variables used only by it don't cause compiler warnings. And the
optimizer ought to be able to eliminate the then unnecessary
__vmread().

>>> +    __vmread(EXIT_QUALIFICATION, &exit_qualification);
>>> +    __vmread(VM_EXIT_INSTRUCTION_LEN, &inst_len);
>> 
>> get_instruction_length(). But is it architecturally defined that
>> #GP intercept vmexits actually set this field?
>> 
> 
> I could not find a clear statement.

That's the point of my comment.

>  My reading of (directly out of
> "Intel 64 and IA-32 Architectures
> Software Developerâs Manual
> Volume 3 (3A, 3B & 3C):
>[...]
> to me says that yes, this field is set on a #GP exit on an IN.  But the
> #GP case is not called out by name.

And it is not any of the cases mentioned.

> My read is that a #GP fault is a "VM Exits Unconditionally" based on the
> setting of the exception bit mask.

Right, but it's not exactly an instruction based exit. Unless Intel
confirms that your extending of the manual says is correct, I'd
rather recommend against relying on unspecified behavior. If
any CPU model ends up behaving differently, this might be
rather hard to diagnose I'm afraid.

> So not using get_instruction_length() does avoid a possible BUG_ON()
> if I am wrong.
> 
> So there are 3 options here:
> 1) Add an ASSERT() like the BUG_ON() in get_instruction_length()
> 2) Switch to using get_instruction_length()
> 3) Switch to using MAX_INST_LEN.
> 
> Let me know which way to go.

As said above - use get_instruction_length() if Intel confirms the
necessary hardware behavior as being architectural. If they
don't, 3) looks like the only viable option.

>>> @@ -2182,6 +2183,8 @@ int nvmx_n2_vmexit_handler(struct cpu_user_regs *regs,
>>>               if ( v->fpu_dirtied )
>>>                   nvcpu->nv_vmexit_pending = 1;
>>>           }
>>> +        else if ( vector == TRAP_gp_fault )
>>> +            nvcpu->nv_vmexit_pending = 1;
>> 
>> Doesn't that mean an unconditional vmexit even if the L1 hypervisor
>> didn't ask for such?
> 
> I might.  I have not done any testing here for the nested VMX case.
> I could just drop this for now and deside what to do for this code later.

If dropping the code is safe without also forbidding the combination
of nested and VMware emulation.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.