Xen project Mailing List

Re: [Xen-API] [Xen-devel] driver domain crash and reconnect handling

To: Zoltan Kiss <zoltan.kiss@xxxxxxxxxx>

From: Ian Campbell <Ian.Campbell@xxxxxxxxxx>

Date: Thu, 24 Jan 2013 09:59:45 +0000

Cc: "xen-api@xxxxxxxxxxxxx" <xen-api@xxxxxxxxxxxxx>, "'xen-devel@xxxxxxxxxxxxx'" <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Thu, 24 Jan 2013 10:01:46 +0000

List-id: User and development list for XCP and XAPI <xen-api.lists.xen.org>

On Wed, 2013-01-23 at 21:58 +0000, Zoltan Kiss wrote: > Hi, > > On 21/01/13 12:20, Ian Campbell wrote: > > On Mon, 2013-01-21 at 11:31 +0000, Dave Scott wrote: > >> Is the current xenstore protocol considered sufficient to > >> support reconnecting a frontend to a new backend? I did a few > >> simple experiments with an XCP driver domain prototype a while > >> back and I failed to make the frontend happy -- usually it would > >> become confused about the backend and become stuck. This might > >> just be because I didn't know what I was doing :-) > > > > I think the protocol is probably sufficient but the implementations of > > that protocol are not... > What kind of problems do you think about? Just lack of testing of the code paths in that way, my gut feeling is that there will inevitably be frontends which can't cope, but maybe I'm pessimistic. > >> Zoltan (cc:d) also did a few simple experiments to see whether > >> we could re-use the existing suspend/resume infrastructure, > >> similar to the 'fast' resume we already use for live checkpoint. > >> As an experiment he modified libxc's xc_resume.c to allow the > >> guest's HYPERVISOR_suspend hypercall invocation to return with > >> '0' (success) rather than '1' (cancelled). The effect of this > >> was to leave the domain running, but since it thinks it has just > >> resumed in another domain, it explicitly reconnects its frontends. > >> With this change and one or two others (like fixing the > >> start_info->{store_,console.domU}.mfns) he made it work for a > >> number of oldish guests. I'm sure he can describe the changes > >> needed more accurately than I can! > > > > Would be interesting to know, especially if everything was achieved with > > toolstack side changes only! > Actually I've used the xc_domain_resume_any() function from libxc to > resume the guests. It worked with PV guests, however with some hacks in > the hypervisor to silently discarding the error condicions, and not > returning from the hypercall with an error. The two guests I've used, > and their problems with the hypercall return values: > > - SLES 11 SP1 (2.6.32.12) crashes because VCPUOP_register_vcpu_info > hypercall returns EINVAL, as ( v->arch.vcpu_info_mfn != INVALID_MFN ) > - Debian Squeeze 6.0 (2.6.32-5) crashes because EVTCHNOP_bind_virq > returns EEXISTS, as ( v->virq_to_evtchnvirq != 0 ) > - (these hypercalls were made right after guest comes back from the > suspend hypercall) The toolstack might need to do EVTCHNOP_reset or do some other cleanup? One difference between a cancelled suspend (i.e. resuming in the old domain) and a normal/successful one is that in the normal case you are starting in a fresh domain, so things like evtchns are all unbound and must be redone whereas in the cancelled case some of the old state can persist and needs to be reset. xend has some code which might form a useful basis of a list of things which may need resetting, see tools/python/xen/xend/XendDomainInfo.py resumeDomain. > I suppose there will be similar problems with other PV guests, I intend > to test other ones as well. My current problem is to architect a proper > solution instead of my hacks in the hypervisor. I think we can't access > those data areas outside the hypervisor (v is a "struct vcpu" equals > current->domain->vcpu[vcpuid]), and unfortunately as I see Xen forgets > the fact that the domain was suspended by the time these hypercalls comes. Xen isn't generally aware of things like suspend, it just sees a domain/vcpu getting torn down and new ones (unrelated as far as Xen knows) being created. > Probably in storage driver domains it's better to suspend the guest > immediately when the backend is gone, as the guest can easily crash if > the block device is inaccessible for a long time. In case of network > access, this isn't such a big problem. Pausing guests when one of their supporting driver domains goes away does seem like a good idea. I suppose the flip side is that a domain which isn't using a disk which goes away briefly would see a hiccup it wouldn't have otherwise seen. Ian. _______________________________________________ Xen-api mailing list Xen-api@xxxxxxxxxxxxx http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.