[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v6 24/29] xen/x86: allow HVM guests to use hypercalls to bring up vCPUs



>>> On 28.09.15 at 18:09, <roger.pau@xxxxxxxxxx> wrote:
> El 21/09/15 a les 17.44, Jan Beulich ha escrit:
>>>>> On 04.09.15 at 14:09, <roger.pau@xxxxxxxxxx> wrote:
>>> Allow the usage of the VCPUOP_initialise, VCPUOP_up, VCPUOP_down and
>>> VCPUOP_is_up hypercalls from HVM guests.
>>>
>>> This patch introduces a new structure (vcpu_hvm_context) that should be used
>>> in conjuction with the VCPUOP_initialise hypercall in order to initialize
>>> vCPUs for HVM guests.
>>>
>>> Signed-off-by: Roger Pau Monnà <roger.pau@xxxxxxxxxx>
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
>> 
>> So this bi-modal thing doesn't look too bad, but a concern I have is
>> with the now different contexts used by XEN_DOMCTL_setvcpucontext
>> and VCPUOP_initialise.
> 
> Yes, that's far from ideal. I was going to say that
> XEN_DOMCTL_{set/get}vcpucontext should return EOPNOTSUPP when executed
> against HVM guests, but that's going to break current toolstack code
> that relies on this in order to perform suspend/resume of HVM guests
> (and gdbsx would probably also be affected).
> 
> If you feel that would be a better solution I could fix current tools
> code in order to use XEN_DOMCTL_{get/set}hvmcontext instead of
> XEN_DOMCTL_{set/get}vcpucontext when dealing with HVM guests.

That might be a follow-up thing, subject to the tools maintainers
agreeing.

>>> +    if ( !paging_mode_hap(v->domain) )
>>> +        v->arch.guest_table = pagetable_null();
>> 
>> A comment with the reason for this would be nice. I can't immediately
>> see why this is here.
> 
> guest_table contains uninitialized data at this point, which makes
> pagetable_is_null return false. Other places where guest_table is set
> check if previous guest_table is null or not, and if it's not null Xen
> will try to free it, causing a bug. hvm_vcpu_reset_state does something
> similar when setting the initial vCPU state.

Truly uninitialized is not possible, as struct vcpu starts out zeroed.
Hence if anything the field may hold left over data, in which case
the store would better go where the reference becomes dangling.

>>> +    hvm_update_guest_cr(v, 0);
>>> +    hvm_update_guest_cr(v, 4);
>>> +
>>> +    if ( (ctx->mode == VCPU_HVM_MODE_32B) ||
>>> +         (ctx->mode == VCPU_HVM_MODE_64B) )
>>> +    {
>>> +        hvm_update_guest_cr(v, 3);
>>> +        hvm_update_guest_efer(v);
>>> +    }
>>> +
>>> +    if ( hvm_paging_enabled(v) && !paging_mode_hap(v->domain) )
>>> +    {
>>> +        /* Shadow-mode CR3 change. Check PDBR and update refcounts. */
>>> +        struct page_info *page = get_page_from_gfn(v->domain,
>>> +                                 v->arch.hvm_vcpu.guest_cr[3] >> 
>>> PAGE_SHIFT,
>>> +                                 NULL, P2M_ALLOC);
>> 
>> P2M_ALLOC but not P2M_UNSHARE?
> 
> This is a copy of what's done in hvm_set_cr3 when shadow mode is enabled.

As said on IRC - sadly we have to mem-sharing maintainer to ask. I
wonder whether the past or current x86/mm maintainer would know
- Tim, George?

>>> +    /* Sync AP's TSC with BSP's. */
>>> +    v->arch.hvm_vcpu.cache_tsc_offset =
>>> +        v->domain->vcpu[0]->arch.hvm_vcpu.cache_tsc_offset;
>>> +    hvm_funcs.set_tsc_offset(v, v->arch.hvm_vcpu.cache_tsc_offset,
>>> +                             v->domain->arch.hvm_domain.sync_tsc);
>>> +
>>> +    v->arch.hvm_vcpu.msr_tsc_adjust = 0;
>> 
>> The need for this one I also can't see right away - another comment
>> perhaps?
> 
> AFAICT we need to initialize this to a sane value, like it's done in
> hvm_vcpu_reset_state.

As said above - struct vcpu starts out zeroed, so unless stale data
is being left here, "initialize" would at least not be the right term.

>>> +#ifndef __XEN_PUBLIC_HVM_HVM_VCPU_H__
>>> +#define __XEN_PUBLIC_HVM_HVM_VCPU_H__
>>> +
>>> +#include "../xen.h"
>>> +
>>> +struct vcpu_hvm_x86_16 {
>>> +    uint16_t ax;
>>> +    uint16_t cx;
>>> +    uint16_t dx;
>>> +    uint16_t bx;
>>> +    uint16_t sp;
>>> +    uint16_t bp;
>>> +    uint16_t si;
>>> +    uint16_t di;
>>> +    uint16_t ip;
>>> +    uint16_t flags;
>>> +
>>> +    uint32_t cr0;
>>> +    uint32_t cr4;
>>> +
>>> +    uint32_t cs_base;
>>> +    uint32_t ds_base;
>>> +    uint32_t ss_base;
>>> +    uint32_t tr_base;
>>> +    uint32_t cs_limit;
>>> +    uint32_t ds_limit;
>>> +    uint32_t ss_limit;
>>> +    uint32_t tr_limit;
>>> +    uint16_t cs_ar;
>>> +    uint16_t ds_ar;
>>> +    uint16_t ss_ar;
>>> +    uint16_t tr_ar;
>>> +};
>> 
>> I doubt this is very useful: While one ought to be able to start a
>> guest in 16-bit mode, its GPRs still are supposed to be 32 bits wide.
>> The mode used really would depend on the cs_ar setting. (Having
>> the structure just for a 16-bit IP would seem insane.)
> 
> So, would you prefer to go just with the 64-bit structure and the mode
> is simply set by the value of the control registers / segment selectors?

No, you certainly want a 32- and a 64-bit layout, for the benefit of
32- and 64-bit guests.

>>> +    uint32_t cs_base;
>>> +    uint32_t ds_base;
>>> +    uint32_t ss_base;
>> 
>> I continue to question why we have DS here, but not ES (and maybe
>> FS and GS too). I.e. either just CS and SS (which are architecturally
>> required) or at least all four traditional x86 segment registers. And
>> you're also clearly not targeting minimal state, or else there likely
>> wouldn't be a need for e.g. R8-R15 in the 64-bit variant.
> 
> I'm fine with removing r8-15. Regarding the segment selectors, I don't
> have a problem with only allowing CS and SS to be set, or all of them
> including FS and GS. But I would like to get a consensus on this, we
> have already gone back and forth several times regarding how this
> structure should look like, and TBH, I was hoping that this was the last
> time.

Was there back and forth? I only recall always having asked for
consistency here, just like spelled out above.

> Andrew, Jan, what would you prefer, either DS is removed or ES, FS and
> GS are also added?

I voiced my opinion. Andrew?

>>> +struct vcpu_hvm_x86_64 {
>>> +    uint64_t rax;
>>> +    uint64_t rcx;
>>> +    uint64_t rdx;
>>> +    uint64_t rbx;
>>> +    uint64_t rsp;
>>> +    uint64_t rbp;
>>> +    uint64_t rsi;
>>> +    uint64_t rdi;
>>> +    uint64_t r8;
>>> +    uint64_t r9;
>>> +    uint64_t r10;
>>> +    uint64_t r11;
>>> +    uint64_t r12;
>>> +    uint64_t r13;
>>> +    uint64_t r14;
>>> +    uint64_t r15;
>>> +    uint64_t rip;
>>> +    uint64_t rflags;
>>> +
>>> +    uint64_t cr0;
>>> +    uint64_t cr3;
>>> +    uint64_t cr4;
>>> +    uint64_t efer;
>>> +
>>> +    uint32_t cs_base;
>>> +    uint32_t ds_base;
>>> +    uint32_t ss_base;
>>> +    uint32_t tr_base;
>>> +    uint32_t cs_limit;
>>> +    uint32_t ds_limit;
>>> +    uint32_t ss_limit;
>>> +    uint32_t tr_limit;
>> 
>> I can see the need for the TR ones here, but what's the point of the
>> CS, SS, and DS ones?
> 
> In case the guest wants to start in 64bit compatibility mode?

Hmm, yes. But couldn't that be done using the 32-bit mode layout?

>>> +/*
>>> + * The layout of the _ar fields of the segment registers is the
>>> + * following:
>>> + *
>>> + * Bits [0,3]: type (bits 40-43).
>>> + * Bit      4: s    (descriptor type, bit 44).
>>> + * Bit  [5,6]: dpl  (descriptor privilege level, bits 45-46).
>>> + * Bit      7: p    (segment-present, bit 47).
>>> + * Bit      8: avl  (available for system software, bit 52).
>>> + * Bit      9: l    (64-bit code segment, bit 53).
>>> + * Bit     10: db   (meaning depends on the segment, bit 54).
>>> + * Bit     11: g    (granularity, bit 55)
>>> + *
>>> + * A more complete description of the meaning of this fields can be
>>> + * obtained from the Intel SDM, Volume 3, section 3.4.5.
>>> + */
>> 
>> Please make explicit what state bits 12-15 are expected to be in
>> (hopefully they're being checked to be zero rather than getting
>> ignored).
> 
> I will add a comment regarding the 12-15 bits. IMHO, that checking
> should be done in hvm_set_segment_register, and is not part of this patch.

Well - is there an existing path like the one being added here where
these values come from an untrusted source (which isn't to say that
when coming from a trusted source checking wouldn't be worthwhile)?
I wasn't able to spot any.

>> Please also clarify whether the limit specified for the various
>> segment registers is the one present in descriptor tables (subject
>> to scaling by the g bit) or (more likely) the byte granular one. In
>> any event I suppose certain restrictions apply.
> 
> Limit is not subject to the g bit, the limit set here it's the one that
> ends up present in the descriptor tables (or in this case in the cached
> part of the selectors). IMHO that's implicit, and it's what's done on
> bare metal.

See - you're mixing things up: In descriptor tables, the (20-bit) limit is
always subject to scaling due to G=1. And since we're talking about
documentation of an interface here, "implicit" is not an option. (I
wonder, btw, whether VMENTRY checks would fail when a limit value
isn't in line with the G bit. I didn't check the manual...)

With both of the above points my main goal is to convert an
unrecoverable error (guest death) to a recoverable one (the vCPU
issuing the hypercall receiving an error indication).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.