[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Problem with simple scheduler



On 02/01/14 06:46, Manohar Vanga wrote:
Hi all,

I've spent the last few weeks trying to debug a weird issue with a new scheduler I'm developing for Xen. I have written a barebones round-robin scheduler which seems to work fine when starting up Dom0, but then at some point during the boot everything just hangs (somewhat deterministically from what I can tell from a week of debugging; see below).

I've inlined my source code below. I don't expect anyone to read the whole thing (although it's quite minimal) so here are the key points:
  • I've implemented the following callbacks: init_domain, destroy_domain, insert_vcpu, remove_vcpu, sleep, wake, yield, pick_cpu, do_schedule, init, deinit, alloc_vdata, free_vdata, alloc_pdata, free_pdata, alloc_domdata, free_domdata. Most of these are minimal (or in some cases do nothing). Am I missing anything critical?
  • The hang occurs even if I'm running Dom0 with just a single vcpu. Nothing hangs if I choose a stock scheduler. Either I'm doing something foolish that is causing a deadlock (less likely since the code structure is borrowed from sched_credit.c) or I'm *not* doing something leading to Dom0 crashing and the vcpu just dying.
If you do suspect some specific issue please let me know. Below are some of the possible issues that I've investigated but hit dead ends on:
  • Checking if my debug printk statements were leading to a deadlock due to sleeps in interrupt mode. This doesn't seem to be the case since Dom0 hangs during boot even if I disable all debug output.
  • I suspected incorrect queuing operations that might be corrupting memory somewhere. However, my debug logs tell me that this is not the case. There is at most one element in the runqueue at all times (I use Dom0 with 1 vcpu).
  • I also suspected a deadlock due to incorrect locking. However, based on what the credit scheduler does in sched_credit.c, I'm don't seem to be doing anything significantly different. In general though, which callbacks run in interrupt context?
  • In the end, I stuck debug statements in tick_suspend and tick_resume and after the hang, those get called infinitely which seems like the physical CPU has gone idle. Is this correct? In that case, *what am I doing wrong in the scheduler* to cause Dom0 to crash?
  • The hang occurs around 3-5 seconds into the boot process quite deterministically. Could it be some periodic timer going off and bugging with my code in weird and wonderful ways?
Also, how do the sleep/wake/yield callbacks work? When do they get called? Is there any documentation on the different callbacks with regards to when they are called? If I understand everything correctly after this, I would gladly create a wiki page explaining this (and perhaps a tutorial on writing a simple scheduler; something I wish existed!).

I hope the description was enough to help understand my problem. If not, feel free to ask for more details :-)

Thanks for reading this far! Source code follows

Using printk()s in the code is going to skew the timing terribly.

A serial console and the 'q' debug key is probably a good start, to see some vcpu state.

'watchdog' on the Xen command line will enable NMI watchdogs which will catch deadlocks, but as I don't see a single use of spinlocks in your code, I doubt this is your issue.

Beyond that, writing a custom keyhandler to dump all of the xfair state is probably the next thing to try.

~Andrew
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.