[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] xenbus and the message of doom
On 15.12.2011 20:39, Konrad Rzeszutek Wilk wrote: > On Thu, Dec 15, 2011 at 08:20:23PM +0100, Stefan Bader wrote: >> I was investigating a bug report[1] about newer kernels (>3.1) not booting as >> HVM guests on Amazon EC2. For some reason git bisect did give the some pain, >> but >> it lead me at least close and with some crash dump data I think I figured the >> problem. > > Stefan, thanks for finding this. > I realize I wanted to add the reference to our bug report but completely forgot to do so. So just for completeness: http://bugs.launchpad.net/bugs/901305 > Olaf, what are your thoughts? Should I prep a patch to revert the patch > below and then we can work on 3.3 and rethink this in 3.3? The clock is > ticking for 3.2 and there is not much runway to fix stuff. > >> >> commit ddacf5ef684a655abe2bb50c4b2a5b72ae0d5e05 >> Author: Olaf Hering <olaf@xxxxxxxxx> >> Date: Thu Sep 22 16:14:49 2011 +0200 >> >> xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old >> kernel >> >> This change introduced a xs_reset_watches() call. The problem seems to be >> that >> there is at least some version of Xen (I was able to reproduce with a 3.4.3 >> version which I admit to deliberately not having updated) for which xenstore >> will not return any reply. > > And oxenstore too, but Ian prepped a patch for this. Perhaps that is > what Amazon is running. >> >> At least the backtraces in crash showed that xs_init had been calling >> xs_reset_watches() and that was happily idling in read_reply(). Effectively >> nothing was going on and the boot just hung. > > So at least we should have a timeout read_reply. But I don't see > anything in the code that we could immediately use. > >> By just not doing that xs_reset_watches() call, I was able to boot under the >> same host. And for what it is worth there has not been an issue with Xen >> 4.1.1 >> and a 3.0 dom0 kernel. Just this "older" release is trouble. >> >> Now the big question is, should this never happen and the host needs urgent >> updating. Or, should xs_talkv() set up a time limit and assume failure when >> not >> receiving a message after that? I could imagine the latter might lead at >> least >> to a more helpful "there is something wrong here, dude" than just hanging >> around >> without any response. ;) >> >> -Stefan >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@xxxxxxxxxxxxxxxxxxx >> http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |