[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 2/4] x86: suppress SMAP and SMEP while running 32-bit PV guest code



>>> On 09.03.16 at 12:19, <andrew.cooper3@xxxxxxxxxx> wrote:
> On 08/03/16 07:57, Jan Beulich wrote:
>>>> @@ -174,10 +174,43 @@ compat_bad_hypercall:
>>>>  /* %rbx: struct vcpu, interrupts disabled */
>>>>  ENTRY(compat_restore_all_guest)
>>>>          ASSERT_INTERRUPTS_DISABLED
>>>> +.Lcr4_orig:
>>>> +        ASM_NOP3 /* mov   %cr4, %rax */
>>>> +        ASM_NOP6 /* and   $..., %rax */
>>>> +        ASM_NOP3 /* mov   %rax, %cr4 */
>>>> +        .pushsection .altinstr_replacement, "ax"
>>>> +.Lcr4_alt:
>>>> +        mov   %cr4, %rax
>>>> +        and   $~(X86_CR4_SMEP|X86_CR4_SMAP), %rax
>>>> +        mov   %rax, %cr4
>>>> +.Lcr4_alt_end:
>>>> +        .section .altinstructions, "a"
>>>> +        altinstruction_entry .Lcr4_orig, .Lcr4_alt, X86_FEATURE_SMEP, 12, 
>>>> \
>>>> +                             (.Lcr4_alt_end - .Lcr4_alt)
>>>> +        altinstruction_entry .Lcr4_orig, .Lcr4_alt, X86_FEATURE_SMAP, 12, 
>>>> \
>>>> +                             (.Lcr4_alt_end - .Lcr4_alt)
>>> These 12's look as if they should be (.Lcr4_alt - .Lcr4_orig).
>> Well, the NOPs that get put there make 12 (= 3 + 6 + 3) a
>> pretty obvious (shorter and hence more readable) option. But
>> yes, if you're of the strong opinion that we should use the
>> longer alternative, I can switch these around.
> 
> I have to admit that I prefer the Linux ALTERNATIVE macro for assembly,
> which takes care of the calculations like this.  It is slightly
> unfortunate that it generally requires its assembly blocks in strings,
> and is unsuitable for larger blocks.  Perhaps we can see about an
> variant in due course.

I due course to me means subsequently - is that the meaning you
imply here too?

But what's interesting about this suggestion: Their macro uses
.skip instead of .org, which means I should be able to replace the
ugly gas bug workaround by simply using .skip. I'll give that a try.

>>>> +        .pushsection .altinstr_replacement, "ax"
>>>> +.Lsmep_smap_alt:
>>>> +        mov   VCPU_domain(%rbx),%rax
>>>> +.Lsmep_smap_alt_end:
>>>> +        .section .altinstructions, "a"
>>>> +        altinstruction_entry .Lsmep_smap_orig, .Lsmep_smap_alt, \
>>>> +                             X86_FEATURE_SMEP, \
>>>> +                             (.Lsmep_smap_alt_end - .Lsmep_smap_alt), \
>>>> +                             (.Lsmep_smap_alt_end - .Lsmep_smap_alt)
>>>> +        altinstruction_entry .Lsmep_smap_orig, .Lsmep_smap_alt, \
>>>> +                             X86_FEATURE_SMAP, \
>>>> +                             (.Lsmep_smap_alt_end - .Lsmep_smap_alt), \
>>>> +                             (.Lsmep_smap_alt_end - .Lsmep_smap_alt)
>>>> +        .popsection
>>>> +
>>>> +        testb $3,UREGS_cs(%rsp)
>>>> +        jz    0f
>>>> +        cmpb  $0,DOMAIN_is_32bit_pv(%rax)
>>> This comparison is wrong on hardware lacking SMEP and SMAP, as the "mov
>>> VCPU_domain(%rbx),%rax" won't have happened.
>> That mov indeed won't have happened, but the original instruction
>> is a branch past all of this code, so the above is correct (and I did
>> test on older hardware).
> 
> Oh so it wont.  It is moderately subtle that this entire codeblock is
> logically contained in the alternative.
> 
> It would be far clearer, and work around your org bug, if this was a
> single alternative which patched jump into a nop.

I specifically wanted to avoid needlessly patching a NOP in there
when going forward we expect the majority of systems to have
that patching done.

> At the very least, a label of .Lcr3_pv32_fixup_done would be an
> improvement over 0.

Agreed (albeit I prefer it to be named .Lcr4_pv32_done).

>>>> +        je    0f
>>>> +        call  cr4_smep_smap_restore
>>>> +        /*
>>>> +         * An NMI or #MC may occur between clearing CR4.SMEP and CR4.SMAP 
>>>> in
>>>> +         * compat_restore_all_guest and it actually returning to guest
>>>> +         * context, in which case the guest would run with the two 
>>>> features
>>>> +         * enabled. The only bad that can happen from this is a kernel 
>>>> mode
>>>> +         * #PF which the guest doesn't expect. Rather than trying to make 
>>>> the
>>>> +         * NMI/#MC exit path honor the intended CR4 setting, simply check
>>>> +         * whether the wrong CR4 was in use when the #PF occurred, and 
>>>> exit
>>>> +         * back to the guest (which will in turn clear the two CR4 bits) 
>>>> to
>>>> +         * re-execute the instruction. If we get back here, the CR4 bits
>>>> +         * should then be found clear (unless another NMI/#MC occurred at
>>>> +         * exactly the right time), and we'll continue processing the
>>>> +         * exception as normal.
>>>> +         */
>>>> +        test  %rax,%rax
>>>> +        jnz   0f
>>>> +        mov   $PFEC_page_present,%al
>>>> +        cmpb  $TRAP_page_fault,UREGS_entry_vector(%rsp)
>>>> +        jne   0f
>>>> +        xor   UREGS_error_code(%rsp),%eax
>>>> +        test  $~(PFEC_write_access|PFEC_insn_fetch),%eax
>>>> +        jz    compat_test_all_events
>>>> +0:      sti
>>> Its code like this which makes me even more certain that we have far too
>>> much code written in assembly which doesn't need to be.  Maybe not this
>>> specific sample, but it has taken me 15 minutes and a pad of paper to
>>> try and work out how this conditional works, and I am still not certain
>>> its correct.  In particular, PFEC_prot_key looks like it fool the test
>>> into believing a non-smap/smep fault was a smap/smep fault.
>> Not sure how you come to think of PFEC_prot_key here: That's
>> a bit which can be set only together with PFEC_user_mode, yet
>> we care about kernel mode faults only here.
> 
> I would not make that assumption.  Assumptions about the valid set of
> #PF flags are precisely the reason that older Linux falls into an
> infinite loop when encountering a SMAP pagefault, rather than a clean crash.

We have to make _some_ assumption here on the bits which
so far have no meaning. Whichever route we go, we can't
exclude that for things to be right with new features getting
added, we may need to adjust the mask. As to the example you
give for comparison - that's apples and oranges as long as Xen
isn't meant to run as PV guest on some (other) hypervisor.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.