Xen project Mailing List

Re: [Xen-devel] [PATCH 2/4] xen: x86 / cpupool: clear the proper cpu_valid bit on pCPU teardown

To: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Thu, 25 Jun 2015 16:52:07 +0100

Cc: Juergen Gross <jgross@xxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx, Jan Beulich <JBeulich@xxxxxxxx>

Delivery-date: Thu, 25 Jun 2015 15:52:58 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 25/06/15 16:04, Dario Faggioli wrote: > On Thu, 2015-06-25 at 15:20 +0100, Andrew Cooper wrote: >> On 25/06/15 13:15, Dario Faggioli wrote: >>> In fact, if a pCPU belonging to some other pool than >>> cpupool0 goes down, we want to clear the relevant bit >>> from its actual pool, rather than always from cpupool0. >> This sentence is a little hard to parse. >> >> I presume you mean "use the correct cpupools valid mask, rather than >> cpupool0's". >> > Yes, that's a better way to say what I meant. > >>> # xl cpupool-cpu-remove Pool-0 8-15 >>> # xl cpupool-create name=\"Pool-1\" >>> # xl cpupool-cpu-add Pool-1 8-15 >>> --> suspend >>> --> resume >>> (XEN) ----[ Xen-4.6-unstable x86_64 debug=y Tainted: C ]---- >>> (XEN) CPU: 8 >>> (XEN) RIP: e008:[<ffff82d080123078>] csched_schedule+0x4be/0xb97 >>> (XEN) RFLAGS: 0000000000010087 CONTEXT: hypervisor >>> (XEN) rax: 80007d2f7fccb780 rbx: 0000000000000009 rcx: 0000000000000000 >>> (XEN) rdx: ffff82d08031ed40 rsi: ffff82d080334980 rdi: 0000000000000000 >>> (XEN) rbp: ffff83010000fe20 rsp: ffff83010000fd40 r8: 0000000000000004 >>> (XEN) r9: 0000ffff0000ffff r10: 00ff00ff00ff00ff r11: 0f0f0f0f0f0f0f0f >>> (XEN) r12: ffff8303191ea870 r13: ffff8303226aadf0 r14: 0000000000000009 >>> (XEN) r15: 0000000000000008 cr0: 000000008005003b cr4: 00000000000026f0 >>> (XEN) cr3: 00000000dba9d000 cr2: 0000000000000000 >>> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 >>> (XEN) ... ... ... >>> (XEN) Xen call trace: >>> (XEN) [<ffff82d080123078>] csched_schedule+0x4be/0xb97 >>> (XEN) [<ffff82d08012c732>] schedule+0x12a/0x63c >>> (XEN) [<ffff82d08012f8c8>] __do_softirq+0x82/0x8d >>> (XEN) [<ffff82d08012f920>] do_softirq+0x13/0x15 >>> (XEN) [<ffff82d080164791>] idle_loop+0x5b/0x6b >>> (XEN) >>> (XEN) **************************************** >>> (XEN) Panic on CPU 8: >>> (XEN) GENERAL PROTECTION FAULT >>> (XEN) [error_code=0000] >>> (XEN) **************************************** >> What is the actual cause of the #GP fault? There are no obviously >> poised registers. >> > IIRC, CPU 8 has been just brought up and is scheduling. Not any other > CPU from Pool-1 is online yet. We are on CPU 8, in > csched_load_balance(), more specifically here: > > ... > BUG_ON( cpu != snext->vcpu->processor ); > online = cpupool_scheduler_cpumask(per_cpu(cpupool, cpu)); > ... > for_each_csched_balance_step( bstep ) > { > /* > * We peek at the non-idling CPUs in a node-wise fashion. In fact, > * it is more likely that we find some affine work on our same > * node, not to mention that migrating vcpus within the same node > * could well expected to be cheaper than across-nodes (memory > * stays local, there might be some node-wide cache[s], etc.). > */ > peer_node = node; > do > { > /* Find out what the !idle are in this node */ > cpumask_andnot(&workers, online, prv->idlers); > cpumask_and(&workers, &workers, &node_to_cpumask(peer_node)); > __cpumask_clear_cpu(cpu, &workers); > > peer_cpu = cpumask_first(&workers); > if ( peer_cpu >= nr_cpu_ids ) > goto next_node; > do > { > /* > * Get ahold of the scheduler lock for this peer CPU. > * > * Note: We don't spin on this lock but simply try it. > Spinning > * could cause a deadlock if the peer CPU is also load > * balancing and trying to lock this CPU. > */ > spinlock_t *lock = pcpu_schedule_trylock(peer_cpu); > > Because of the fact that we did not clear Pool-1->cpu_valid online is > 8-15. Also, since we _did_ clear bits 8-15 in prv->idlers when tearing > them down, during suspend, they're all (or all but 8) workers, as far as > the code above can tell. > > We therefore enter the inner do{}while with, for instance (that's what > I've seen in my debugging), peer_cpu=9, but we've not yet done > cpu_schedule_up()-->alloc_pdata()-->etc. for that CPU, so we die at (or > shortly after) the end of the code snippet shown above. Aah - it is a dereference with %rax as a pointer, which is #define INVALID_PERCPU_AREA (0x8000000000000000L - (long)__per_cpu_start) That explains the #GP fault which is due to a non-canonical address. It might be better to use 0xDEAD000000000000L as the constant to make it slightly easier to spot as a poisoned pointer. > >> Is it something we should modify to be a BUG or ASSERT? >> > Not sure how/where. Note that some more fixing of similar situations > happen in other patches in the series, and that includes also adding > ASSERT-s (although, no, they probably won't cover this case). > > I can try to think at it and to come up with something if you think it's > important... Not to worry. I was more concerned about working out why it was dying with an otherwise unqualified #GP fault. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.