[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Live-Patch application failure in core-scheduling mode



On Fri, Feb 07, 2020 at 10:25:05AM +0100, Jürgen Groß wrote:
> On 07.02.20 09:49, Jan Beulich wrote:
> > On 07.02.2020 09:42, Jürgen Groß wrote:
> > > On 07.02.20 09:23, Jan Beulich wrote:
> > > > On 07.02.2020 09:04, Jürgen Groß wrote:
> > > > > On 06.02.20 15:02, Sergey Dyasli wrote:
> > > > > > On 06/02/2020 11:05, Sergey Dyasli wrote:
> > > > > > > On 06/02/2020 09:57, Jürgen Groß wrote:
> > > > > > > > On 05.02.20 17:03, Sergey Dyasli wrote:
> > > > > > > > > Hello,
> > > > > > > > > 
> > > > > > > > > I'm currently investigating a Live-Patch application failure 
> > > > > > > > > in core-
> > > > > > > > > scheduling mode and this is an example of what I usually get:
> > > > > > > > > (it's easily reproducible)
> > > > > > > > > 
> > > > > > > > >         (XEN) [  342.528305] livepatch: lp: CPU8 - IPIing the 
> > > > > > > > > other 15 CPUs
> > > > > > > > >         (XEN) [  342.558340] livepatch: lp: Timed out on 
> > > > > > > > > semaphore in CPU quiesce phase 13/15
> > > > > > > > >         (XEN) [  342.558343] bad cpus: 6 9
> > > > > > > > > 
> > > > > > > > >         (XEN) [  342.559293] CPU:    6
> > > > > > > > >         (XEN) [  342.559562] Xen call trace:
> > > > > > > > >         (XEN) [  342.559565]    [<ffff82d08023f304>] R 
> > > > > > > > > common/schedule.c#sched_wait_rendezvous_in+0xa4/0x270
> > > > > > > > >         (XEN) [  342.559568]    [<ffff82d08023f8aa>] F 
> > > > > > > > > common/schedule.c#schedule+0x17a/0x260
> > > > > > > > >         (XEN) [  342.559571]    [<ffff82d080240d5a>] F 
> > > > > > > > > common/softirq.c#__do_softirq+0x5a/0x90
> > > > > > > > >         (XEN) [  342.559574]    [<ffff82d080278ec5>] F 
> > > > > > > > > arch/x86/domain.c#guest_idle_loop+0x35/0x60
> > > > > > > > > 
> > > > > > > > >         (XEN) [  342.559761] CPU:    9
> > > > > > > > >         (XEN) [  342.560026] Xen call trace:
> > > > > > > > >         (XEN) [  342.560029]    [<ffff82d080241661>] R 
> > > > > > > > > _spin_lock_irq+0x11/0x40
> > > > > > > > >         (XEN) [  342.560032]    [<ffff82d08023f323>] F 
> > > > > > > > > common/schedule.c#sched_wait_rendezvous_in+0xc3/0x270
> > > > > > > > >         (XEN) [  342.560036]    [<ffff82d08023f8aa>] F 
> > > > > > > > > common/schedule.c#schedule+0x17a/0x260
> > > > > > > > >         (XEN) [  342.560039]    [<ffff82d080240d5a>] F 
> > > > > > > > > common/softirq.c#__do_softirq+0x5a/0x90
> > > > > > > > >         (XEN) [  342.560042]    [<ffff82d080279db5>] F 
> > > > > > > > > arch/x86/domain.c#idle_loop+0x55/0xb0
> > > > > > > > > 
> > > > > > > > > The first HT sibling is waiting for the second in the 
> > > > > > > > > LP-application
> > > > > > > > > context while the second waits for the first in the scheduler 
> > > > > > > > > context.
> > > > > > > > > 
> > > > > > > > > Any suggestions on how to improve this situation are welcome.
> > > > > > > > 
> > > > > > > > Can you test the attached patch, please? It is only tested to 
> > > > > > > > boot, so
> > > > > > > > I did no livepatch tests with it.
> > > > > > > 
> > > > > > > Thank you for the patch! It seems to fix the issue in my manual 
> > > > > > > testing.
> > > > > > > I'm going to submit automatic LP testing for both thread/core 
> > > > > > > modes.
> > > > > > 
> > > > > > Andrew suggested to test late ucode loading as well and so I did.
> > > > > > It uses stop_machine() to rendezvous cpus and it failed with a 
> > > > > > similar
> > > > > > backtrace for a problematic CPU. But in this case the system crashed
> > > > > > since there is no timeout involved:
> > > > > > 
> > > > > >        (XEN) [  155.025168] Xen call trace:
> > > > > >        (XEN) [  155.040095]    [<ffff82d0802417f2>] R 
> > > > > > _spin_unlock_irq+0x22/0x30
> > > > > >        (XEN) [  155.069549]    [<ffff82d08023f3c2>] S 
> > > > > > common/schedule.c#sched_wait_rendezvous_in+0xa2/0x270
> > > > > >        (XEN) [  155.109696]    [<ffff82d08023f728>] F 
> > > > > > common/schedule.c#sched_slave+0x198/0x260
> > > > > >        (XEN) [  155.145521]    [<ffff82d080240e1a>] F 
> > > > > > common/softirq.c#__do_softirq+0x5a/0x90
> > > > > >        (XEN) [  155.180223]    [<ffff82d0803716f6>] F 
> > > > > > x86_64/entry.S#process_softirqs+0x6/0x20
> > > > > > 
> > > > > > It looks like your patch provides a workaround for LP case, but 
> > > > > > other
> > > > > > cases like stop_machine() remain broken since the underlying issue 
> > > > > > with
> > > > > > the scheduler is still there.
> > > > > 
> > > > > And here is the fix for ucode loading (that was in fact the only case
> > > > > where stop_machine_run() wasn't already called in a tasklet).
> > > > 
> > > > This is a rather odd restriction, and hence will need explaining.
> > > 
> > > stop_machine_run() is using a tasklet on each online cpu (excluding the
> > > one it was called one) for doing a rendezvous of all cpus. With tasklets
> > > always being executed on idle vcpus it is mandatory for
> > > stop_machine_run() to be called on an idle vcpu as well when core
> > > scheduling is active, as otherwise a deadlock will occur. This is being
> > > accomplished by the use of continue_hypercall_on_cpu().
> > 
> > Well, it's this "a deadlock" which is too vague for me. What exactly is
> > it that deadlocks, and where (if not obvious from the description of
> > that case) is the connection to core scheduling? Fundamentally such an
> > issue would seem to call for an adjustment to core scheduling logic,
> > not placing of new restrictions on other pre-existing code.
> 
> This is the main objective of core scheduling: on all siblings of a
> core only vcpus of exactly one domain are allowed to be active.
> 
> As tasklets are only running on idle vcpus and stop_machine_run()
> is activating tasklets on all cpus but the one it has been called on
> to rendezvous, it is mandatory for stop_machine_run() to be called on
> an idle vcpu, too, as otherwise there is no way for scheduling to
> activate the idle vcpu for the tasklet on the sibling of the cpu
> stop_machine_run() has been called on.

Could there also be issues with other rendezvous not running in
tasklet context?

One triggered by on_selected_cpus for example?

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.