[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler



create ^
title it linux pvops: fpu corruption due to incorrect assumptions
about TS bit after exception under Xen
thanks

On Wed, Nov 6, 2013 at 6:41 AM, Zhu Yanhai <zhu.yanhai@xxxxxxxxx> wrote:
> As we know Intel X86's CR0.TS is a sticky bit, which means once set
> it remains set until cleared by some software routines, in other words,
> the exception handler expects the bit is set when it starts to execute.
>
> However xen doesn't simulate this behavior quite well for PV guests -
> vcpu_restore_fpu_lazy() clears CR0.TS unconditionally in the very beginning,
> so the guest kernel's #NM handler runs with CR0.TS cleared. Generally speaking
> it's fine since the linux kernel executes the exception handler with
> interrupt disabled and a sane #NM handler will clear the bit anyway
> before it exits, but there's a catch: if it's the first FPU trap for the 
> process,
> the linux kernel must allocate a piece of SLAB memory for it to save
> the FPU registers, which opens a schedule window as the memory
> allocation might sleep -- and with CR0.TS keeps clear!
>
> [see the code below in linux kernel,
>
> void math_state_restore(void)
> {
>     struct task_struct *tsk = current;
>
>     if (!tsk_used_math(tsk)) {
>         local_irq_enable();
>         /*
>          * does a slab alloc which can sleep
>          */
>         if (init_fpu(tsk)) {                 <<<< Here it might open a 
> schedule window
>             /*
>              * ran out of memory!
>              */
>             do_group_exit(SIGKILL);
>             return;
>         }
>         local_irq_disable();
>     }
>
>     __thread_fpu_begin(tsk);    <<<< Here the process gets marked as a 'fpu 
> user'
>                                          after the schedule window
>
>     /*
>      * Paranoid restore. send a SIGSEGV if we fail to restore the state.
>      */
>     if (unlikely(restore_fpu_checking(tsk))) {
>         drop_init_fpu(tsk);
>         force_sig(SIGSEGV, tsk);
>         return;
>     }
>
>     tsk->fpu_counter++;
> }
> ]
>
> The check in linux kernel's switch_fpu_prepare() doesn't stts() either because
> the current process only gets marked as a FPU user after the schedule window
> (the story is a bit different if eagerfpu is enabled, anyway a sane hypervisor
> cannot depend on such undetermined things). And then supposing that the new
> process scheduled-in wants to touch FPU, nobody will do fxrstor/frstor for it 
> anymore,
> conducing to a serious data damage.
>
> Also, The point is everything is fine on linux + baremetal since CR0.TS will
> keep set until the interrupted #NM handler got the memory it needs and exits,
> so the incomer FPU user will get trapped as it's supposed to be.
>
> The test case is as below,
>
>         buf = malloc(BUF_SIZE);
>         if (!buf) {
>                 fprintf(stderr, "error %s during %s\n",
>                         strerror(-err),
>                         "malloc");
>                 return 1;
>         }
>         memset(buf, IO_PATTERN, BUF_SIZE);
>         memset(cmp_buf, IO_PATTERN, BUF_SIZE);
>
>         if (memcmp(buf, cmp_buf, BUF_SIZE)) {
>                 unsigned long long *ubuf = (unsigned long long *)buf;
>                 int i;
>
>                 for (i = 0; i < BUF_SIZE / sizeof(unsigned long long); i++)
>                         printf("%d: 0x%llx\n", i, ubuf[i]);
>
>                 return 2;
>         }
>
> Two shell scripts on each box's dom0 runs above program repeatedly until
> the compare fails (so every time the C program is a new fpu user and triggers
> memory allocation). we can see the data damage at least once with
> xen 4.3 + linux 2.6.32 on ~200 physical machines within two hours.
> With xen 4.3 + linux 3.11.6 stable it becomes harder to reproduce
> (guess it's because of the eagerfpu feature introduced in linux kernel 3.7)
> but it's still possible to come out within about four hours.
>
> The fix here is trying to make xen behave as close to the hardware as 
> possible.
>
> This bug only has effects on PV guests (and including dom0 kernel of course).
>
> Cc: Wan Jia <jia.wanj@xxxxxxxxxxxxxxx>
> Cc: Shen Yiben <zituan@xxxxxxxxxx>
> Cc: Charles Wang <muming.wq@xxxxxxxxxx>
> Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>
> Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> Cc: Ian Campbell <ian.campbell@xxxxxxxxxx>
> Signed-off-by: Zhu Yanhai <gaoyang.zyh@xxxxxxxxxx>
> ---
>  xen/arch/x86/traps.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
> index 77c200b..b0321a6 100644
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -3267,8 +3267,8 @@ void do_device_not_available(struct cpu_user_regs *regs)
>
>      if ( curr->arch.pv_vcpu.ctrlreg[0] & X86_CR0_TS )
>      {
> +        stts();
>          do_guest_trap(TRAP_no_device, regs, 0);
> -        curr->arch.pv_vcpu.ctrlreg[0] &= ~X86_CR0_TS;
>      }
>      else
>          TRACE_0D(TRC_PV_MATH_STATE_RESTORE);
> --
> 1.7.4.4
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.