[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] POD: soft lockups in dom0 kernel

To: David Vrabel <david.vrabel@xxxxxxxxxx>
From: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>
Date: Fri, 06 Dec 2013 09:50:17 -0500
Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, Dietmar Hahn <dietmar.hahn@xxxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>
Delivery-date: Fri, 06 Dec 2013 14:48:00 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 12/06/2013 07:00 AM, David Vrabel wrote:

On 06/12/13 11:30, Jan Beulich wrote:

On 06.12.13 at 12:07, David Vrabel <david.vrabel@xxxxxxxxxx> wrote:

We do not want to disable the soft lockup detection here as it has found
a bug.  We can't have tasks that are unschedulable for minutes as it
would only take a handful of such tasks to hose the system.

My understanding is that the soft lockup detection is what its name
says - a mechanism to find cases where the kernel software locked
up. Yet that's not the case with long running hypercalls.

Well ok, it's not a lockup in the kernel but it's still a task that
cannot be descheduled for minutes of wallclock time.  This is still a
bug that needs to be fixed.

We should put an explicit preemption point in.  This will fix it for the
CONFIG_PREEMPT_VOLUNTARY case which I think is the most common
configuration.  Or perhaps this should even be a cond_reched() call to
fix it for fully non-preemptible as well.

How do you imagine to do this? When the hypervisor preempts a
hypercall, all the kernel gets to see is that it drops back into the
hypercall page, such that the next thing to happen would be
re-execution of the hypercall. You can't call anything at that point,
all that can get run here are interrupts (i.e. event upcalls). Or do
you suggest to call cond_resched() from within
__xen_evtchn_do_upcall()?

I've not looked at how.

KVM has a hook (kvm_check_and_clear_guest_paused()) into watchdog codeto prevent it from having false positives (for a different reasonthough). If we claim that soft lockup mechanism is only to detect Linuxkernel problems and not long-running hypervisor code then perhaps we canmake this hook a bit more generic.

We would still need to think about what may happen if we are stuck inthe hypervisor for abnormally long time. Maybe this Xen hook can stillreturn false when such cases are detected.


-boris

And even if you do - how certain is it that what gets its continuation
deferred won't interfere with other things the kernel wants to do
(since if you'd be doing it that way, you'd cover all hypercalls at
once, not just those coming through privcmd, and hence you could
end up with partially completed multicalls or other forms of batching,
plus you'd need to deal with possibly active lazy modes).

I would only do this for hypercalls issued by the privcmd driver.

David



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

References:
- [Xen-devel] POD: soft lockups in dom0 kernel
  - From: Dietmar Hahn
- Re: [Xen-devel] POD: soft lockups in dom0 kernel
  - From: Jan Beulich
- Re: [Xen-devel] POD: soft lockups in dom0 kernel
  - From: David Vrabel
- Re: [Xen-devel] POD: soft lockups in dom0 kernel
  - From: Jan Beulich
- Re: [Xen-devel] POD: soft lockups in dom0 kernel
  - From: David Vrabel

Prev by Date: Re: [Xen-devel] [PATCH v5 02/17] libxl: better name for last parameter of libxl_list_vcpu
Next by Date: Re: [Xen-devel] [Spice-devel] Vdagent not working on xen linux hvm DomUs
Previous by thread: Re: [Xen-devel] POD: soft lockups in dom0 kernel
Next by thread: [Xen-devel] [PATCH v8-iwj 1/2] libxl: usb2 and usb3 controller support for upstream qemu
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.