[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Strange PVM spinlock case revisited



On 12.02.2013 16:07, Ian Campbell wrote:
> On Tue, 2013-02-12 at 14:50 +0000, Stefan Bader wrote:
>> On 12.02.2013 15:07, Ian Campbell wrote:
>>> On Tue, 2013-02-12 at 13:42 +0000, Stefan Bader wrote:
>>>> On 11.02.2013 18:29, Ian Campbell wrote:
>>>>> On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote:
>>>>>> A while ago I had been reporting an issue which causes Xen PVM guests
>>>>>> to hang (apparently related to spinlocks). Since then I was able to
>>>>>> gather a few more facts which I try to provide below. For the real
>>>>>> reasons, I am still puzzled and would be thankful for any hints or
>>>>>> direction.
>>>>>>
>>>>>> - Happens with Xen 4.1.2 and Xen 4.2 host-side.
>>>>>> - PVM guest with 8 VCPUs is running 800 threads doing DB updates.
>>>>>
>>>>> This is on a host with >= 8 PCPUs, correct?
>>>>
>>>> Maybe >=4 but I cannot remember for sure anymore.
>>>
>>> OK, so the important point I was getting at was whether the guest was
>>> overcommitting the host or not, it seems like it is (2xVCPUs for each
>>> PCPU)
>>
>> Err not so much. I run on an 8-core host. ;)
> 
> Um, which is what I asked. What is >= 4 then?

Uhm, me not being able to discern between P and V at some point. So what I was
thinking, was that I believe I had this when running with 4 VCPUs or 8 VCPUs on
the same 8 PCPU host...

> 
> 
>>>
>>>>> I suppose xen_spin_unlock could also be instrumented to indicate who it
>>>>> last kicked and for which lock and that might show us something?
>>>>
>>>> IIRC I had done this but it did not really show much. What I have done in 
>>>> the
>>>> meantime was to instrument the IRQ service functions in
>>>> arch/x86/kernel/entry_64.S to give me a history of callbacks. This shows 
>>>> (in
>>>> another experiment where it is 1,2,5,6 in the same lock and 2,5,6 still 
>>>> polling)
>>>> that the last events on the vcpu not polling are:
>>>>
>>>> xen_do_hypervisor_callback+127
>>>> call_softirq+29
>>>> call_softirq+125
>>>> call_softirq+29
>>>> call_softirq+125
>>>> call_softirq+29
>>>> call_softirq+125
>>>> xen_do_hypervisor_callback+28
>>>> call_softirq+29
>>>> xen_do_hypervisor_callback+28
>>>>
>>>> The lower offsets are when irq_count gets incremented and the higher 
>>>> offsets are
>>>> when irq_count gets decremented before ending the callback. This shows 
>>>> that at
>>>> least in this case there was an upcall, a softirq and another upcall being
>>>> triggered without finishing the previous one. There was another experiment 
>>>> where
>>>> I saw softirq, upcall, upcall. But at least it seems that there is always a
>>>> three level stack.
>>>> I believe that softirq, upcall would be ok as softirq processing enables
>>>> interrupts but is it expected to have an upcall interrupted by another one?
>>>
>>> I'm not sure. evtchn's don't have a priority mechanism so I expect not
>>> in general but individual interrupt handlers could well reenable
>>> interrupts, I think (it might be one of the flags you pass when calling
>>> request_irq?).
>>
>> Just a gut feeling but it feels wrong to enable interrupts in any interrupt
>> handler. I thought for anything that needs interrupts enabled and/or takes
>> longer it should pushed off to a workqueue...
> 
> Either that or some sort of interrupt prioritisation mechanism in the
> hardware so only higher priority interrupts than the current one will
> occur.
> 
>>
>>>
>>>>> Can someone explain why the non-locked update of lock_spinners in
>>>>> spinning_lock() is safe against xen_spin_unlock reading the field from
>>>>> another VCPU? I suspect it isn't, so what happens if the other VCPU
>>>>> kicks the lock which is just about to become prev? maybe this is handled
>>>>> in xen_spin_lock_slow somehow?
>>>>
>>>> Isn't that safe because lock_spinners is percpu?
>>>
>>> xen_spin_unlock_slow() accesses the lock_spinners of remote CPUs, which
>>> is what I think might be unsafe.
>>
>> Ah right, that. Hm, the way those two play together always was a bit brain
>> hurting.
> 
> Yes...
> 
>>  Though somehow it feels like if the top level poll_irq would return and
>> count things as spurious wakeup. I think in that case I would expect the
>> lock_spinner entry of one vcpu to be not matching the freed lock...
>> Not a really scientific argument.
> 
> Adding a BUG_ON(__this_cpu_write(lock_spinners) != xl) in
> unspinning_lock might be interesting?

Hm, beside of writing prev into there which always should not be xl, I think
since write access only happens in spin_lock_slow that should stack ok. The two
cases that would go wrong are sending a wakeup event when the lock was just
replaced on the other vcpu or not sending one when it should be.
IIRC I had one attempt print when xen_spin_unlock_slow went through all vcpus
without finding one to signal. That seemed to happen rather often but then would
resolve as xen_spin_lock_slow for the other lock would set the event pending for
the prev lock if there was one (after being sucessful with the top level lock).

I think there is room for improvement by finding a way not always to search from
vcpu 0 in spin_unlock_slow and maybe in spin_lock_slow to check whether
returning from poll without the current lock being free may be because of prev
being free now. When there are more than one spinner may be able to proceed...

> 
> I wonder if this is something where the ftrace and related kernel
> infrastructure might be useful?

MAybe. Not sure there are hooks for what we are looking for and whether the
system if not hanging too badly. My usual approach of looking at the dump would
require to manually find the buffer and parse the records...



Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.