[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Xen PVM: Strange lockups when running PostgreSQL load
On 18.10.2012 09:48, Ian Campbell wrote: > On Thu, 2012-10-18 at 08:38 +0100, Stefan Bader wrote: >> On 18.10.2012 09:08, Jan Beulich wrote: >>>>>> On 18.10.12 at 09:00, "Jan Beulich" <JBeulich@xxxxxxxx> wrote: >>>>>>> On 17.10.12 at 17:35, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote: >>>>> In each case, the event channels are masked (no surprise given the >>>>> conversation so far on this thread), and have no pending events. >>>>> Therefore, I believe we are looking at the same bug. >>>> >>>> That seems very unlikely (albeit not impossible) to me, given that >>>> the non-pvops kernel uses ticket locks while the pvops one doesn't. >>> >>> And in fact we had a similar problem with our original ticket lock >>> implementation, exposed by an open coded lock in the scheduler's >>> run queue management. But that was really ticket lock specific, >>> in that the fact that a CPU could passively become the owner of >>> a lock while polling - that's impossible with pvops' byte locks afaict. >> >> One of the trains of thought I had was whether it could happen that a cpu is >> in >> polling and the task gets moved. But I don't think it can happen as the >> hypercall unlikely is a place where any schedule happens (preempt is none). >> And >> it would be much more common... >> >> One detail which I hope someone can fill in is the whole "interrupted >> spinlock" >> thing. Saving the last lock pointer stored on the per-cpu lock_spinners and >> so >> on. Is that really only for spinlocks taken without interrupts disabled or >> do I >> miss something there? > > spinning_lock() returns the old lock which the caller is expected to > remember and replace via unspinning_lock() -- it effectively implements > a stack of locks which are being waited on. xen_spin_lock_slow (the only > caller0 appears to do this correctly from a brief inspection. Yes, just *when* can there be a stack of locks (spinlocks). The poll_irq hypercall seems to be an active (in the sense of not preemting to another task) process. How could there be a situation that another lock (on the same cpu is tried to be taken). > > Is there any chance this is just a simple AB-BA or similar type > deadlock? Do we have data which suggests all vCPUs are waiting on the > same lock or just that they are waiting on some lock? I suppose lockdep > (which I think you mentioned before?) would have caught this, unless pv > locks somehow confound it? The one situation where I went deeper into the tasks that appeared to be on a cpu it was one waiting for signalling a task that looked to be just scheduled out and the cpu it was running on doing a idle balance that waited on the lock for cpu#0's runqueue. Which cpu#0 itself seemed to be waiting slow (the lock pointer was on lock_spinners[0]) but the lock itself was 0. Though there is a chance that this is always just a coincidental state where the lock just was released and more related to how the Xen stack does a guest dump. So it would be to find who holds the other lock. Unfortunately at least a full lock debugging enabled kernel is sufficiently different in timing that I cannot reproduce the issue on a test machine. And from reported crashes in production I have no data. > > Ian. > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > http://lists.xen.org/xen-devel > Attachment:
signature.asc _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |