[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v2 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests



On 16/01/18 07:46, Jan Beulich wrote:
>>>> On 15.01.18 at 19:23, <andrew.cooper3@xxxxxxxxxx> wrote:
>> On 15/01/18 11:06, Jan Beulich wrote:
>>> --- a/xen/arch/x86/x86_64/compat/entry.S
>>> +++ b/xen/arch/x86/x86_64/compat/entry.S
>>> @@ -199,6 +199,17 @@ ENTRY(cstar_enter)
>>>          pushq $0
>>>          movl  $TRAP_syscall, 4(%rsp)
>>>          SAVE_ALL
>>> +
>>> +        GET_STACK_END(bx)
>>> +        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
>>> +        neg   %rcx
>>> +UNLIKELY_START(nz, cstar_cr3)
>>> +        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
>>> +        neg   %rcx
>>> +        write_cr3 rcx, rdi, rsi
>>> +        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
>>> +UNLIKELY_END(cstar_cr3)
>> These UNLIKELY()s aren't correct.  It will depend on hardware and
>> command line setting as to whether we expect to update cr3.
> Why are they not correct? What 64-bit kernels do you know that
> use the CSTAR entry method in PV mode?

None, but I also don't see you excluding 32bit PV guests, which means we
do take the unlikely path all the time in general.

>  Afaik this was in
> existence for a range of kernel versions only in our forward port.
> The INT80 path is perhaps indeed more questionable in this regard.
>
>> Furthermore, they will complicate splitting the early entry code away
>> from the main .text section for a full isolation implementation.
>>
>> For now, I'd drop them and have a simple jz .Lskip.
> I will replace them (on the entry paths; I think the one on the exit-
> to-Xen path is valid) to please you and in the interest of forward
> progress; maybe this will even slightly help backporting to really old
> Xen versions, where the UNLIKELY* constructs don't exist yet.
>
>> Also, can we collect these together into macros, rather than
>> opencoding?  We seem to have 3 distinct variations.
> I had considered that (following the model you use in the SP2
> series), but decided against it not the least because of the
> dependent but placement-wise separated code additions to
> restore original values. Plus again this might be a hindrance of
> backporting to really old Xen (which then typically will also be
> built on really old tool chains) - as you certainly recall old gas
> had quite a few issues with macro handling.

There is nothing special in these macros though?  I found the SP2-style
far easier to backport because they are a single slot-in line.

Anyway, I'm not overly fussed, but I have a

>
>>> --- a/xen/arch/x86/x86_64/entry.S
>>> +++ b/xen/arch/x86/x86_64/entry.S
>>> @@ -37,6 +37,32 @@ ENTRY(switch_to_kernel)
>>>  /* %rbx: struct vcpu, interrupts disabled */
>>>  restore_all_guest:
>>>          ASSERT_INTERRUPTS_DISABLED
>>> +
>>> +        /* Copy guest mappings and switch to per-CPU root page table. */
>>> +        mov   %cr3, %r9
>>> +        GET_STACK_END(dx)
>>> +        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rdi
>>> +        movabs $PADDR_MASK & PAGE_MASK, %rsi
>>> +        movabs $DIRECTMAP_VIRT_START, %rcx
>>> +        mov   %rdi, %rax
>>> +        and   %rsi, %rdi
>>> +        and   %r9, %rsi
>>> +        add   %rcx, %rdi
>>> +        add   %rcx, %rsi
>>> +        mov   $ROOT_PAGETABLE_FIRST_XEN_SLOT, %ecx
>>> +        mov   root_table_offset(SH_LINEAR_PT_VIRT_START)*8(%rsi), %r8
>>> +        mov   %r8, root_table_offset(SH_LINEAR_PT_VIRT_START)*8(%rdi)
>>> +        rep movsq
>>> +        mov   $ROOT_PAGETABLE_ENTRIES - \
>>> +               ROOT_PAGETABLE_LAST_XEN_SLOT - 1, %ecx
>>> +        sub   $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \
>>> +                ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rsi
>>> +        sub   $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \
>>> +                ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rdi
>>> +        rep movsq
>>> +        mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
>>> +        write_cr3 rax, rdi, rsi
>>> +
>> Can we possibly move this up into C?  For this simplistic algorithm it
>> is ok in ASM, but if we want to do any optimisations to avoid the 4k
>> memcpy (generation count hidden somewhere in page_info?), ASM is going
>> quickly become unwieldy.
> I'd prefer to move it into C when it really becomes necessary. Also
> you don't properly qualify "this" - for example, I'd rather not move
> the write_cr3 invocation into C, yet the placement of your comment
> suggests to do so.

I meant the whole hunk, not write_cr3 itself.

>
>> Another optimisation I found made a massive difference for the KAISER
>> series was to have an MRU cache of 4 pagetables, so in-guest syscalls
>> don't result in any copying as we pass in and out of Xen.
> As said elsewhere - optimization can come later. Plus - is avoiding the
> copying at _any_ time actually correct, considering possible racing
> L4 entry updates on another vCPU?

later> indeed.

The common behaviour is that L4's don't really change after fork(), so
having a generation count on them will allow skipping of the memcpy() in
most cases.

You can't skip the check, but the amount of logic used to avoid a 4k
memcpy can be quite substantial and still be a performance win.

>
>>> @@ -71,6 +97,18 @@ iret_exit_to_guest:
>>>          ALIGN
>>>  /* No special register assumptions. */
>>>  restore_all_xen:
>>> +        /*
>>> +         * Check whether we need to switch to the per-CPU page tables, in
>>> +         * case we return to late PV exit code (from an NMI or #MC).
>>> +         */
>>> +        GET_STACK_END(ax)
>>> +        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rax), %rdx
>>> +        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rax), %rax
>>> +        test  %rdx, %rdx
>>> +UNLIKELY_START(g, exit_cr3)
>> cmp or ne ?
> "ne" (or really "nz" when used with "test") is outright wrong - we
> want to skip the restore when the value is zero _or_ negative.
> What's wrong with "jg" and "test" in combination? There simply is
> no "jnsz" (other than e.g. "jnbe"). "cmp" against zero could be
> used here, but why would I use the larger instruction when "test"
> does?

greater or less than is not commonly related to the test instruction,
which is why this looks wrong to a reviewer.

A comment of /* Ideally jnsz, but jg will have to do */ would go a very
long way.

I've double checked the logic and I agree with your conclusions, but the
only reason this works is because test unconditionally zeroes the
overflow flag.

>
>>> @@ -585,6 +692,17 @@ ENTRY(double_fault)
>>>          movl  $TRAP_double_fault,4(%rsp)
>>>          /* Set AC to reduce chance of further SMAP faults */
>>>          SAVE_ALL STAC
>>> +
>>> +        GET_STACK_END(bx)
>>> +        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rbx
>>> +        test  %rbx, %rbx
>>> +        jz    .Ldblf_cr3_okay
>>> +        jns   .Ldblf_cr3_load
>>> +        neg   %rbx
>>> +.Ldblf_cr3_load:
>>> +        write_cr3 rbx, rdi, rsi
>>> +.Ldblf_cr3_okay:
>>> +
>> It is moderately common for there to be cascade faults in #DF.  This
>> would be better if it were the general IST switch.
> Based on the issues I had with #DF occurring while debugging this,
> I've decided to keep the code here as simple as possible without
> being incorrect: There's no point looking at the incoming CR3.
> There's no point in trying to avoid nested faults (including
> subsequent #DF) restoring CR3. There's also no point in retaining
> the value for later restoring here, as we never return. In fact, as
> mentioned elsewhere, we should imo indeed consider unilaterally
> switching to idle_pg_table[] here.

Ok.

~Andrew

>
> Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.