[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler



On Wed, Nov 06, 2013 at 08:51:56AM +0000, Jan Beulich wrote:
> >>> On 06.11.13 at 07:41, Zhu Yanhai <zhu.yanhai@xxxxxxxxx> wrote:
> > As we know Intel X86's CR0.TS is a sticky bit, which means once set
> > it remains set until cleared by some software routines, in other words,
> > the exception handler expects the bit is set when it starts to execute.
> 
> Since when would that be the case? CR0.TS is entirely unaffected
> by exception invocations according to all I know. All that is known
> here is that #NM wouldn't have occurred in the first place if CR0.TS
> was clear.
> 
> > However xen doesn't simulate this behavior quite well for PV guests -
> > vcpu_restore_fpu_lazy() clears CR0.TS unconditionally in the very beginning,
> > so the guest kernel's #NM handler runs with CR0.TS cleared. Generally 
> > speaking
> > it's fine since the linux kernel executes the exception handler with
> > interrupt disabled and a sane #NM handler will clear the bit anyway
> > before it exits, but there's a catch: if it's the first FPU trap for the 
> > process,
> > the linux kernel must allocate a piece of SLAB memory for it to save
> > the FPU registers, which opens a schedule window as the memory
> > allocation might sleep -- and with CR0.TS keeps clear!
> > 
> > [see the code below in linux kernel,
> 
> You're apparently referring to the pvops kernel.
> 
> > void math_state_restore(void)
> > {
> >     struct task_struct *tsk = current;
> > 
> >     if (!tsk_used_math(tsk)) {
> >         local_irq_enable();
> >         /*
> >          * does a slab alloc which can sleep
> >          */
> >         if (init_fpu(tsk)) {                 <<<< Here it might open a 
> > schedule window
> >             /*
> >              * ran out of memory!
> >              */
> >             do_group_exit(SIGKILL);
> >             return;
> >         }
> >         local_irq_disable();
> >     }
> > 
> >     __thread_fpu_begin(tsk);    <<<< Here the process gets marked as a 'fpu 
> > user'
> >                                          after the schedule window
> > 
> >     /*
> >      * Paranoid restore. send a SIGSEGV if we fail to restore the state.
> >      */
> >     if (unlikely(restore_fpu_checking(tsk))) {
> >         drop_init_fpu(tsk);
> >         force_sig(SIGSEGV, tsk);
> >         return;
> >     }
> > 
> >     tsk->fpu_counter++;
> > }
> > ]
> 
> May I direct your attention to the XenoLinux one:
> 
> asmlinkage void math_state_restore(void)
> {
>       struct task_struct *me = current;
> 
>       /* NB. 'clts' is done for us by Xen during virtual trap. */
>       __get_cpu_var(xen_x86_cr0) &= ~X86_CR0_TS;
>       if (!used_math())
>               init_fpu(me);
>       restore_fpu_checking(&me->thread.i387.fxsave);
>       task_thread_info(me)->status |= TS_USEDFPU;
> }
> 
> Note the comment close to the beginning - the fact that CR0.TS
> is clear at exception handler entry is actually part of the PV ABI,
> i.e. by altering hypervisor behavior here you break all forward
> ported kernels.
> 
> Nevertheless I agree that there is an issue, but this needs to be
> fixed on the Linux side (hence adding the Linux maintainers to Cc);
> this issue was introduced way back in 2.6.26 (before that there
> was no allocation on that path). It's not clear though whether
> using GFP_ATOMIC for the allocation would be preferable over
> stts() before calling the allocation function (and clts() if it
> succeeded), or whether perhaps to defer the stts() until we
> actually know the task is being switched out. It's going to be an
> ugly, Xen-specific hack in any event.

Was there ever a resolution to this problem? I never saw a comment
from the Linux Xen PV maintainers.

--msw


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.