[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-API] driver domain crash and reconnect handling
Hi, [ my apologies if this has been discussed before but I couldn't find a relevant thread ] In XCP we're hoping to make serious use of driver domains soon. We'd like to tell people that their xen-based cloud is even more robust than before, because even if a host driver crashes, there is only a slight interruption to guest I/O. For this to work smoothly, we need to figure out how to re-establish disk and network I/O after the driver restart -- this is where I'd appreciate some advice! Is the current xenstore protocol considered sufficient to support reconnecting a frontend to a new backend? I did a few simple experiments with an XCP driver domain prototype a while back and I failed to make the frontend happy -- usually it would become confused about the backend and become stuck. This might just be because I didn't know what I was doing :-) Zoltan (cc:d) also did a few simple experiments to see whether we could re-use the existing suspend/resume infrastructure, similar to the 'fast' resume we already use for live checkpoint. As an experiment he modified libxc's xc_resume.c to allow the guest's HYPERVISOR_suspend hypercall invocation to return with '0' (success) rather than '1' (cancelled). The effect of this was to leave the domain running, but since it thinks it has just resumed in another domain, it explicitly reconnects its frontends. With this change and one or two others (like fixing the start_info->{store_,console.domU}.mfns) he made it work for a number of oldish guests. I'm sure he can describe the changes needed more accurately than I can! What do you think of this approach? Since it's based on the existing suspend/resume code it should hopefully work with all guest types without having to update the frontends or hopefully even fix bugs in them (because it looks just like a regular resume which is pretty well tested everywhere). This is particularly important in "cloud" scenarios because the people running clouds have usually little or no control over the software their customers are running. Unfortunately if we have to wait for a PV frontend change to trickle into all the common distros it will be a while before we can fully benefit from driver domain restart. If there is a better way of doing this in the long term involving a frontend change, what do you think about this as a stopgap until the frontends are updated? Cheers, Dave _______________________________________________ Xen-api mailing list Xen-api@xxxxxxxxxxxxx http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |