Re: [Xen-devel] [RFC 7/7] libxl: Wait for QEMU startup in stubdomain

On Fri, Feb 6, 2015 at 9:59 AM, Wei Liu <wei.liu2@xxxxxxxxxx> wrote:
> On Fri, Feb 06, 2015 at 08:56:40AM -0500, Eric Shelton wrote:
>> On Fri, Feb 6, 2015 at 6:16 AM, Wei Liu <wei.liu2@xxxxxxxxxx> wrote:
>> I simply used the code already present in the QEMU upstream code,
>> which is writing to that particular ath to indicate "running."  Since
>> it is distinct from the path used by the QEMU instance running in
>> Dom0, it works for my intended purpose: ensuring the device model is
>> running before unpausing the HVM guest.  When you say it is "wrong,"
>> is that just because you ultimately intend to rearchitect this and use
>> something different?  If so, maybe the path I am using is "good
>> enough" until that happens.  Otherwise, can you suggest a better path
>> or mechanism?
> It is not "good enough". It just happens to be working.
> Currently the path is hardcoded "/local/domain/0/BLAH". It's wrong,
> because the QEMU in stubdom is not running in 0. The correct prefix
> should be "/local/domain/$stubdom_id".

OK; that definitely makes more sense - I recall the same idea crossing
my mind when I first dug into this.  Although the revised protocol may
go in a different direction, I will adopt this approach for now.

>> I noticed some discussion about this on xen-devel.  Unfortunately, I
>> was unable to find anything that laid out specifically what the
>> problems are - can you point me to a bug report or such?  The libxl
>> startup code - with callbacks on top of callbacks, callbacks within
>> callbacks, and callbacks stashed away in little places only to be
>> called _much_ later - is really convoluted, I suspect particularly so
>> for stubdom startup.  I am not surprised it got broken - who can
>> remember how it works?
> It's not how libxl is coded. It's the startup protocol that is broken.
> The breakage of stubdom in Xen 4.5 is a latent bug exposed by a new
> feature.
> I guess I should just send a bug report saying "Device model startup
> protocol is broken". But I don't have much to say at this point, because
> thorough research for both qemu-trad and qemu-upstream is required to
> produce a sensible report.

So, just where is the current protocol breaking down?  Is there a
contemplated bandaid for 4.5.1?  I'm just trying to figure out what I
might want to do differently.

> So prior to 4.5, when there is emulation request issued by a guest vcpu,
> that request is put on a ring, guest vcpu is paused. When a DM shows up
> it processes that request, posts response, then guest vcpu is unpaused.
> So there is implicit dependency on Xen's behaviour for DM to work.
> In 4.5, a new feature called ioreq server is added. When Xen sees an
> io request which no backing DM, it returns immediately. Guest sees some
> wired value and crashes. That is, Xen's behaviour has changed and a
> latent bug in stubdom's startup protocol is exposed.

So, is the approach that I took - waiting for the stubdom DM to finish
initializing - a reasonable short-term solution?  I guess I am
wondering whether the fix you are contemplating is in libxl, the
hypervisor, or both.


