[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler
On 27/02/14 00:04, Matt Wilson wrote: > On Wed, Nov 06, 2013 at 08:51:56AM +0000, Jan Beulich wrote: >>>>> On 06.11.13 at 07:41, Zhu Yanhai <zhu.yanhai@xxxxxxxxx> wrote: >>> As we know Intel X86's CR0.TS is a sticky bit, which means once set >>> it remains set until cleared by some software routines, in other words, >>> the exception handler expects the bit is set when it starts to execute. >> >> Since when would that be the case? CR0.TS is entirely unaffected >> by exception invocations according to all I know. All that is known >> here is that #NM wouldn't have occurred in the first place if CR0.TS >> was clear. >> >>> However xen doesn't simulate this behavior quite well for PV guests - >>> vcpu_restore_fpu_lazy() clears CR0.TS unconditionally in the very beginning, >>> so the guest kernel's #NM handler runs with CR0.TS cleared. Generally >>> speaking >>> it's fine since the linux kernel executes the exception handler with >>> interrupt disabled and a sane #NM handler will clear the bit anyway >>> before it exits, but there's a catch: if it's the first FPU trap for the >>> process, >>> the linux kernel must allocate a piece of SLAB memory for it to save >>> the FPU registers, which opens a schedule window as the memory >>> allocation might sleep -- and with CR0.TS keeps clear! >>> >>> [see the code below in linux kernel, >> >> You're apparently referring to the pvops kernel. >> >>> void math_state_restore(void) >>> { >>> struct task_struct *tsk = current; >>> >>> if (!tsk_used_math(tsk)) { >>> local_irq_enable(); >>> /* >>> * does a slab alloc which can sleep >>> */ >>> if (init_fpu(tsk)) { <<<< Here it might open a >>> schedule window >>> /* >>> * ran out of memory! >>> */ >>> do_group_exit(SIGKILL); >>> return; >>> } >>> local_irq_disable(); >>> } >>> >>> __thread_fpu_begin(tsk); <<<< Here the process gets marked as a 'fpu >>> user' >>> after the schedule window >>> >>> /* >>> * Paranoid restore. send a SIGSEGV if we fail to restore the state. >>> */ >>> if (unlikely(restore_fpu_checking(tsk))) { >>> drop_init_fpu(tsk); >>> force_sig(SIGSEGV, tsk); >>> return; >>> } >>> >>> tsk->fpu_counter++; >>> } >>> ] >> >> May I direct your attention to the XenoLinux one: >> >> asmlinkage void math_state_restore(void) >> { >> struct task_struct *me = current; >> >> /* NB. 'clts' is done for us by Xen during virtual trap. */ >> __get_cpu_var(xen_x86_cr0) &= ~X86_CR0_TS; >> if (!used_math()) >> init_fpu(me); >> restore_fpu_checking(&me->thread.i387.fxsave); >> task_thread_info(me)->status |= TS_USEDFPU; >> } >> >> Note the comment close to the beginning - the fact that CR0.TS >> is clear at exception handler entry is actually part of the PV ABI, >> i.e. by altering hypervisor behavior here you break all forward >> ported kernels. >> >> Nevertheless I agree that there is an issue, but this needs to be >> fixed on the Linux side (hence adding the Linux maintainers to Cc); >> this issue was introduced way back in 2.6.26 (before that there >> was no allocation on that path). It's not clear though whether >> using GFP_ATOMIC for the allocation would be preferable over >> stts() before calling the allocation function (and clts() if it >> succeeded), or whether perhaps to defer the stts() until we >> actually know the task is being switched out. It's going to be an >> ugly, Xen-specific hack in any event. > > Was there ever a resolution to this problem? I never saw a comment > from the Linux Xen PV maintainers. I think allocating on the context switch is mad and the irq enable/disable just to allow the allocation looks equally mad. I had vague plans to maintain a mempool for FPU contexts but couldn't immediately think how we could guarantee that the pool would be kept sufficiently populated. David _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |