[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler



On Wed, Nov 06, 2013 at 02:41:12PM +0800, Zhu Yanhai wrote:
> As we know Intel X86's CR0.TS is a sticky bit, which means once set
> it remains set until cleared by some software routines, in other words,
> the exception handler expects the bit is set when it starts to execute.
> 
> However xen doesn't simulate this behavior quite well for PV guests -
> vcpu_restore_fpu_lazy() clears CR0.TS unconditionally in the very beginning,
> so the guest kernel's #NM handler runs with CR0.TS cleared. Generally speaking
> it's fine since the linux kernel executes the exception handler with
> interrupt disabled and a sane #NM handler will clear the bit anyway
> before it exits, but there's a catch: if it's the first FPU trap for the 
> process,
> the linux kernel must allocate a piece of SLAB memory for it to save
> the FPU registers, which opens a schedule window as the memory
> allocation might sleep -- and with CR0.TS keeps clear!
> 


With the Ingo's FPU rewrite we haven't been able to retrigger this.
(Tests ran for 2 weeks while they would have failed within
two hours).

And when I dug in this I found the reason:

commit 0c8c0f03e3a292e031596484275c14cf39c0ab7a
Author: Dave Hansen <dave@xxxxxxxx>
Date:   Fri Jul 17 12:28:11 2015 +0200

    x86/fpu, sched: Dynamically allocate 'struct fpu'
    
    The FPU rewrite removed the dynamic allocations of 'struct fpu'.
    But, this potentially wastes massive amounts of memory (2k per
    task on systems that do not have AVX-512 for instance).
    
    Instead of having a separate slab, this patch just appends the
    space that we need to the 'task_struct' which we dynamically
    allocate already.  This saves from doing an extra slab
    allocation at fork().

And that when the #NM is called ('do_device_not_available')
it does:

 fpu__restore(&current->thread.fpu); /* interrupts still off */          
   |+- fpu__activate_curr (which just inits the already allocated space)
   |     \- memset(state, 0, xstate_size);
   |+- fpregs_activate
         \- stts()

So there is no scheduling window during this time, while in
kernels prior to Linux 4.2 there was.

And it took a bit of time to figure out what exactly the problem was.

I appreciate folks emails (and this giant thread) about this but
without some sort of diagram it was hard to understand this
(at least to me).

So here it is in case somebody is doing code archaeology:

For simplicity we assume the guest/baremetal use the lazy mechanism
not eager. That makes 'switch_fpu_prepare' (called by schedule()) effectively:

if (previous task had PF_USED_MATH set)
   stts (CR0.TS=1)
else
   ;

I am ignoring the case if the task had used the FPU more than
five times - where we do things a bit different.

The time diagram looks great at 132x42.

Anyhow, lets assume that we have two tasks: A and B. Both
haven't used the FPU. This is on PVHVM:


CR0.TS=1                       CR0.TS=1                 CR0.TS=0                
   CR0.TS=1                       CR0.TS=0
------------------------------------------------------------------------------------+--------+-------------------+-------+
task A | #NM                     |task B|                    |taskB |           
    | task A |                   |taskA  |
MMX    |math_state_restore       |      |                    |      |           
    |        |                   |       |
op     |  \- fpu_init            |      |                    |      |           
    |        |                   |       |
       |       \- .. schedule()  |      |                    |      |           
    |        |                   |       |
       |           [swap task B] |      |                    |      |           
    |        |                   |       |
       |           [since task A |      |                    |      |           
    |        |                   |       |
       |            hadn't set   |      |                    |      |           
    |        |                   |       |
       |            PF_USED_MATH |      |                    |      |           
    |        |                   |       |
       |            we don't muck|      |                    |      |           
    |        |                   |       |
       |            with CR0.TS] |      |                    |      |           
    |        |                   |       |
       |                         |MMX op|                    |      |           
    |        |                   |       |
       |                         |      |#NM                 |      |           
    |        |                   |       |
       |                         |      |math_state_restore  |      |           
    |        |                   |       |
       |                         |      | fpu_init worked    |      |           
    |        |                   |       |
       |                         |      |  clts()            |      |           
    |        |                   |       |
       |                         |      |task_B->flags |=    |      |           
    |        |                   |       |
       |                         |      |  PF_USED_MATH      |      |           
    |        |                   |       |
       |                         |      |  return;           |      |           
    |        |                   |       |
       |                         |      |                    |syscall|          
    |        |                   |       |
       |                         |      |                    |      |schedule() 
    |        |                   |       |
       |                         |      |                    |      |[swap task 
A]  |        |                   |       |
       |                         |      |                    |      |[taskB has 
    |        |                   |       |
       |                         |      |                    |      | 
PF_USED_MATH] |        |                   |       |
       |                         |      |                    |      |[so 
CR0.TS=1]  |        |                   |       |
       |                         |      |                    |      |  task A 
runs  |        |                   |       |
       |                         |      |                    |      |           
    |MMX op  |                   |       |
       |                         |      |                    |      |           
    |        |#NM                |       |
       |                         |      |                    |      |           
    |        | fpu_init works    |       |
       |                         |      |                    |      |           
    |        | clts()            |       |
       |                         |      |                    |      |           
    |        |  taskA->flags |=  |       |
       |                         |      |                    |      |           
    |        |  PF_USED_MATH     |       |
       |                         |      |                    |      |           
    |        |  return           |       |
       |                         |      |                    |      |           
    |        |                   |MMX op |



And under Xen PV:



CR0.TS=1                       CR0.TS=0                 CR0.TS=0                
   CR0.TS=0                       CR0.TS=0
[but Xen sets it to
CR0.TS=0 and calls
Linux #NM:]
------------------------------------------------------------------------------------+--------+-------------------+-------+
task A | #NM                     |task B|                    |taskB |           
    | task A |                   |taskA  |
MMX    |math_state_restore       |      |                    |      |           
    |        |                   |       |
op     |  \- fpu_init            |      |                    |      |           
    |        |                   |       |
       |       \- .. schedule()  |      |                    |      |           
    |        |                   |       |
       |           [swap task B] |      |                    |      |           
    |        |                   |       |
       |           [since task A |      |                    |      |           
    |        |                   |       |
       |            hadn't set   |      |                    |      |           
    |        |                   |       |
       |            PF_USED_MATH |      |                    |      |           
    |        |                   |       |
       |            we don't muck|      |                    |      |           
    |        |                   |       |
       |            with CR0.TS] |      |                    |      |           
    |        |                   |       |
       |                         |MMX op|                    |      |           
    |        |                   |       |
       |                         |      |[no trap to Linux or|      |           
    |        |                   |       |
       |                         |      |Xen as CR0.TS=0]    |      |           
    |        |                   |       |
       |                         |      |                    |      |           
    |        |                   |       |
       |                         |      |                    |      |           
    |        |                   |       |
       |                         |      |                    |      |           
    |        |                   |       |
       |                         |      |                    |      |           
    |        |                   |       |
       |                         |      |                    |      |           
    |        |                   |       |
       |                         |      |                    |syscall|          
    |        |                   |       |
       |                         |      |                    |      |schedule() 
    |        |                   |       |
       |                         |      |                    |      |[swap task 
A]  |        |                   |       |
       |                         |      |                    |      |[with task 
B]  |        |                   |       |
       |                         |      |                    |      |  task A 
runs  |        |                   |       |
       |                         |      |                    |      |           
    |MMX op  |                   |       |
       |                         |      |                    |      |           
    |        |[again, no trap to |       |
       |                         |      |                    |      |           
    |        | Xen or Linux b/c  |       |
       |                         |      |                    |      |           
    |        | CR0.TS=0 *1]      |       |
       |                         |      |                    |      |           
    |        |                   |MMX op |

And so on - FPU registers are effectively leaking across tasks.

The [*1] refers to the Xen scheduler. If any of the
syscalls that the user application called, ended in the Linux kernel
halt (xen_safe_halt) routine - we would deschedule the guest VCPU.

When that VCPU is re-scheduled, Xen would set CR0.TS=1 back
so the #NM would function again.

Not pretty - and only happening if the fpu_alloc() ends
up calling the schedule().

I followed Jan's recommendation and cobbled this up:

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 324ab52..06b7843 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -837,11 +837,17 @@ asmlinkage __visible void __attribute__((weak)) 
smp_threshold_interrupt(void)
  * Must be called with kernel preemption disabled (eg with local
  * local interrupts as in the case of do_device_not_available).
  */
+#include <xen/xen.h>
 void math_state_restore(void)
 {
        struct task_struct *tsk = current;
 
        if (!tsk_used_math(tsk)) {
+               /*
+                * See http://bugs.xenproject.org/xen/bug/40
+                */
+               if (xen_pv_domain())
+                       stts();
                local_irq_enable();
                /*
                 * does a slab alloc which can sleep
@@ -854,6 +860,8 @@ void math_state_restore(void)
                        return;
                }
                local_irq_disable();
+               if (xen_pv_domain())
+                       clts();
        }
 
        /* Avoid __kernel_fpu_begin() right after __thread_fpu_begin() */

For the older pvops kernels and trying it out now.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.