[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Strange PVM spinlock case revisited
On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote: > A while ago I had been reporting an issue which causes Xen PVM guests > to hang (apparently related to spinlocks). Since then I was able to > gather a few more facts which I try to provide below. For the real > reasons, I am still puzzled and would be thankful for any hints or > direction. > > - Happens with Xen 4.1.2 and Xen 4.2 host-side. > - PVM guest with 8 VCPUs is running 800 threads doing DB updates. This is on a host with >= 8 PCPUs, correct? > - When hanging at least 2 VCPUs in xen_spin_lock_slow with the lock free. > IIRC the weird lock always belongs to one VCPU runqueue. > - Testcase shows the problem for guest kernel versions v3.2..v3.5 (not > verified farther back). Since v3.6 it cannot be reproduced. Bisect > ends up with: 611ae8e3f5204f7480b3b405993b3352cfa16662 > "x86/tlb: enable tlb flush range support for x86" That seems like a very odd candidate for impacting on this issue. > - The other potential fix for this was to prevent xen_spin_lock_slow() > from unmasking events (enabling interrupts) if those were enabled > before the spinlock call. > > Maybe allowing or not allowing the interrupts in xen_spin_lock_slow > just manages to tip the likelihood of getting nested deeper. Either that or the bug is in the nesting aspect of xen_spin_lock_slow(), which seems to be a rather complex dance judging from the comments in that function (and one I confess I don't really follow). > The same > way that the changed TLB flush may reduce it by causing lesser IPIs. Ah, that might explain why that patch is relevant, yes. > From the static information I can extract from a dump it is hard to > tell what exactly is going on. VCPU2 seems to be doing something but > looks to be rather deep in event handling. The irq_count is at 2, but > irq_count seems to be used in a few more places than just xen_hypervisor > callback. Though that at least seems to be the entry on the stack I can > see for VCPU2. The contents of the per-cpu variables for irq_reqs and > irq_stack_ptr at least seem consistent as if a callback came in while > being on the sched_op hypercall. But then the bt command seems to be > operating on a completely different stack. > VCPU2 seems to be active, not polling (event processing could be initiated > from enable_irqs via force callback or before finally exiting the sched_op > hypercall), there seems to be a pending upcall but upcalls are masked. > The next thing I would try is to instrument the sites incrementing and > decrementing irq_count... > --- > > Below some of the info seen from dom0 debugging keys and looking > at a dump with crash: > > Dom0 Info: > > Polling vCPUs: {1,5-7} > VCPU0: CPU0 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 > VCPU1: CPU1 [has=F] poll=10 upcall_pend = 00, upcall_mask = 00 > VCPU2: CPU6 [has=T] poll=0 upcall_pend = 01, upcall_mask = 01 > VCPU3: CPU2 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 > VCPU4: CPU4 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 > VCPU5: CPU3 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 > VCPU6: CPU7 [has=F] poll=40 upcall_pend = 01, upcall_mask = 01 > VCPU7: CPU4 [has=F] poll=46 upcall_pend = 01, upcall_mask = 01 > > 10 [0/1]: s=6 n=1 x=0 > 40 [0/1]: s=6 n=6 x=0 > 46 [0/1]: s=6 n=7 x=0 > > Guest Info: > > lock_spinners: > [0] (struct xen_spinlock *) 0x > [1] (struct xen_spinlock *) 0xffff8803bffc8ae8 = { lock=1, spinners=2 } > [2] (struct xen_spinlock *) 0xffff8803bfc93700 = { lock=0, spinners=2 } If I understand correctly this means that 2 threads (VCPU2 & 6, I suppose) are spinning on this lock, but no one is holding it? An interesting hack^Wexperiment might be to make xen_poll_irq use a timeout and see if that unwedges things -- this would help confirm that the issue is on nested wakeup. I suppose xen_spin_unlock could also be instrumented to indicate who it last kicked and for which lock and that might show us something? Can someone explain why the non-locked update of lock_spinners in spinning_lock() is safe against xen_spin_unlock reading the field from another VCPU? I suspect it isn't, so what happens if the other VCPU kicks the lock which is just about to become prev? maybe this is handled in xen_spin_lock_slow somehow? In a similar vein do we not need a barrier in unspinning_lock to ensure that the remote cpu sees prev getting restored? > [3] (struct xen_spinlock *) 0x > [4] (struct xen_spinlock *) 0x > [5] (struct xen_spinlock *) 0xffff8803bffc8ae8 -> [1] > [6] (struct xen_spinlock *) 0xffff8803bfc93700 -> [2] > [7] (struct xen_spinlock *) 0xffffffff81f15ef0 = { lock=1, spinners=1 } > > VCPU[2,6] -> lock of runqueue[2] = (struct rq *) 0xffff8803bfc93700 > > irq_count[1] = 1 > irq_count[2] = 2 > It is -1 for the rest. > > *(struct pt_regs *) irq_regs[2] = { > .ip = 0xffffffff810013aa == hypercall_page+938 -> sched_op > .sp = 0xffff8803bfc83ce8 > } > > char *irq_stack_ptr[2] = 0xffff8803bfc83fc0 > > #> bt -T ffff88033d65c4a0 > PID: 2050 TASK: ffff88033d65c4a0 CPU: 2 COMMAND: "postgres" > [ffff88033d72d530] get_page_from_freelist at ffffffff8111e6df > [ffff88033d72d600] __alloc_pages_nodemask at ffffffff8111ebec > [ffff88033d72d620] check_preempt_curr at ffffffff81050254 > [ffff88033d72d640] pull_task at ffffffff8105e33e > [ffff88033d72d670] balance_tasks at ffffffff8105e4b6 > [ffff88033d72d680] radix_tree_lookup_slot at ffffffff8130db5e > [ffff88033d72d690] find_get_page at ffffffff811161ee > [ffff88033d72d6b0] find_lock_page at ffffffff81116286 > [ffff88033d72d6e0] shmem_getpage_gfp at ffffffff8112dd10 > [ffff88033d72d748] pte_pfn_to_mfn at ffffffff81005a78 > [ffff88033d72d7a0] get_page_from_freelist at ffffffff8111e6df > [ffff88033d72d7c0] _raw_spin_lock at ffffffff81655e4e > [ffff88033d72d7d0] ext4_ext_check_cache at ffffffff81239248 > [ffff88033d72d820] ext4_ext_map_blocks at ffffffff8123e269 > [ffff88033d72d870] _raw_spin_lock at ffffffff81655e4e > [ffff88033d72d880] enqueue_to_backlog at ffffffff8153e55f > [ffff88033d72d890] __wake_up_common at ffffffff8104c1c8 > [ffff88033d72d8e0] netif_rx.part.82 at ffffffff8153f2de > [ffff88033d72d920] netif_rx at ffffffff8153f400 > [ffff88033d72d960] loopback_xmit at ffffffff8146fb1c > [ffff88033d72d990] dev_hard_start_xmit at ffffffff8153fea6 > [ffff88033d72d9d0] do_softirq at ffffffff81016284 > [ffff88033d72d9f0] local_bh_enable at ffffffff8106d614 > [ffff88033d72da00] dev_queue_xmit at ffffffff81540309 > [ffff88033d72da50] ip_finish_output at ffffffff815769ab > [ffff88033d72da90] ip_output at ffffffff81577518 > [ffff88033d72dac0] ip_local_out at ffffffff81576c09 > [ffff88033d72dae0] ip_send_skb at ffffffff81577f4b > [ffff88033d72db00] udp_send_skb at ffffffff8159aaa1 > [ffff88033d72db40] ip_generic_getfrag at ffffffff81575030 > [ffff88033d72db50] udp_sendmsg at ffffffff8159bd38 > [ffff88033d72db90] irq_to_desc at ffffffff810d70b7 > [ffff88033d72dbd0] notify_remote_via_irq at ffffffff813a74c1 > [ffff88033d72dc10] ttwu_queue at ffffffff81052672 > [ffff88033d72dc38] _raw_spin_unlock_irqrestore at ffffffff8165605e > [ffff88033d72dc90] inet_sendmsg at ffffffff815a6474 > --- bt -t would end here > [ffff88033d72dcb0] apparmor_socket_sendmsg at ffffffff812d43f7 > [ffff88033d72dcd0] sock_sendmsg at ffffffff815266fe > [ffff88033d72dd70] md_make_request at ffffffff814e0606 > [ffff88033d72ddd0] kmem_cache_free at ffffffff811605cf > [ffff88033d72de10] mempool_free_slab at ffffffff81118507 > [ffff88033d72de20] mempool_free at ffffffff811185b7 > [ffff88033d72de50] sys_sendto at ffffffff8152974d > [ffff88033d72dee0] ext4_sync_file at ffffffff8120f07b > [ffff88033d72df00] vfs_write at ffffffff811764b0 > [ffff88033d72df50] xen_hypervisor_callback at ffffffff816606db > RIP: 00007f6852c7b39a RSP: 00007fff431454e0 RFLAGS: 00000212 > RAX: 00007f6852fb38b8 RBX: 00007f6852fb3720 RCX: 000000000000014a > RDX: 0000000000000150 RSI: 0000000000000140 RDI: 00007f6852fb3720 > RBP: 0000000000000150 R8: 0000000000000015 R9: 0000000000000000 > R10: 0000000000000000 R11: 00007f6852c9174a R12: 00007f6852fb3778 > R13: 0000000000000140 R14: 00007f6852fb38b8 R15: 0000000000000001 > ORIG_RAX: ffffffffffffffff CS: e033 SS: e02b > > #> bt -T ffff88033b812dc0 > PID: 2069 TASK: ffff88033b812dc0 CPU: 6 COMMAND: "postgres" > [ffff88033b897530] get_page_from_freelist at ffffffff8111e6df > [ffff88033b897600] __alloc_pages_nodemask at ffffffff8111ebec > [ffff88033b897640] radix_tree_lookup at ffffffff8130db6b > [ffff88033b897650] irq_to_desc at ffffffff810d70b7 > [ffff88033b897660] irq_get_irq_data at ffffffff810d9f4e > [ffff88033b897670] balance_tasks at ffffffff8105e493 > [ffff88033b897680] radix_tree_lookup_slot at ffffffff8130db5e > [ffff88033b897690] find_get_page at ffffffff811161ee > [ffff88033b8976b0] find_lock_page at ffffffff81116286 > [ffff88033b8976e0] shmem_getpage_gfp at ffffffff8112dd10 > [ffff88033b897748] pte_pfn_to_mfn at ffffffff81005a78 > [ffff88033b897770] xen_set_pte_at at ffffffff81008e29 > [ffff88033b897788] __raw_callee_save_xen_make_pte at ffffffff810052cd > [ffff88033b8977b0] unlock_page at ffffffff8111589a > [ffff88033b8977d0] __do_fault at ffffffff81138ac9 > [ffff88033b8978a0] pvclock_clocksource_read at ffffffff8103dd15 > [ffff88033b8978f0] xen_clocksource_read at ffffffff8100a850 > [ffff88033b897900] sched_clock at ffffffff8101bd79 > [ffff88033b897910] blk_rq_init at ffffffff812ec262 > [ffff88033b897930] get_request at ffffffff812f004e > [ffff88033b8979c0] cpumask_next_and at ffffffff81308c66 > [ffff88033b8979e0] find_busiest_group at ffffffff8105a061 > --- bt -t ends here > [ffff88033b897a30] radix_tree_lookup at ffffffff8130db6b > [ffff88033b897a40] irq_to_desc at ffffffff810d70b7 > [ffff88033b897a50] irq_get_irq_data at ffffffff810d9f4e > [ffff88033b897a60] info_for_irq at ffffffff813a636e > [ffff88033b897a80] xen_poll_irq_timeout at ffffffff813a696e > [ffff88033b897ac0] xen_poll_irq at ffffffff813a8450 > [ffff88033b897ad0] xen_spin_lock_slow at ffffffff8163ad77 > [ffff88033b897b20] xen_spin_lock at ffffffff810121da > [ffff88033b897b40] _raw_spin_lock at ffffffff81655e4e > [ffff88033b897b50] double_rq_lock at ffffffff8105119c > [ffff88033b897b80] load_balance at ffffffff8105e788 > [ffff88033b897b88] _raw_spin_unlock_irqrestore at ffffffff8165605e > [ffff88033b897c10] idle_balance at ffffffff8163d57c > [ffff88033b897c60] __schedule at ffffffff81653ea9 > [ffff88033b897cc0] sleep_on_page at ffffffff81115810 > [ffff88033b897ce0] schedule at ffffffff81653fcf > [ffff88033b897cf0] io_schedule at ffffffff8165407f > [ffff88033b897d10] sleep_on_page at ffffffff8111581e > [ffff88033b897d20] __wait_on_bit at ffffffff8165489f > [ffff88033b897d70] wait_on_page_bit at ffffffff81115988 > [ffff88033b897da0] wake_bit_function at ffffffff8108a140 > [ffff88033b897dc0] filemap_fdatawait_range at ffffffff81115a9c > [ffff88033b897e60] do_writepages at ffffffff81120ee1 > [ffff88033b897eb0] filemap_write_and_wait_range at ffffffff81117398 > [ffff88033b897ee0] ext4_sync_file at ffffffff8120ef9f > [ffff88033b897f00] vfs_write at ffffffff811764b0 > [ffff88033b897f40] do_fsync at ffffffff811a4a36 > [ffff88033b897f70] sys_fdatasync at ffffffff811a4d83 > [ffff88033b897f80] system_call_fastpath at ffffffff8165e442 > RIP: 00007f6852ce8240 RSP: 00007fff43145618 RFLAGS: 00000246 > RAX: 000000000000004b RBX: ffffffff8165e442 RCX: 00007f6852ce1900 > RDX: 00000000000000c5 RSI: 000000000000000c RDI: 000000000000000c > RBP: 00000000000000c5 R8: 000000000073a550 R9: 0000000000000000 > R10: 0000000000000004 R11: 0000000000000246 R12: ffffffff811a4d83 > R13: ffff88033b897f78 R14: 0000000000000001 R15: 0000000000af7488 > ORIG_RAX: 000000000000004b CS: e033 SS: e02b > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |