[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] driver domain crash and reconnect handling


[ my apologies if this has been discussed before but I couldn't
  find a relevant thread ]

In XCP we're hoping to make serious use of driver domains soon.
We'd like to tell people that their xen-based cloud is even
more robust than before, because even if a host driver crashes,
there is only a slight interruption to guest I/O. For this to
work smoothly, we need to figure out how to re-establish disk
and network I/O after the driver restart -- this is where I'd
appreciate some advice!

Is the current xenstore protocol considered sufficient to
support reconnecting a frontend to a new backend? I did a few
simple experiments with an XCP driver domain prototype a while
back and I failed to make the frontend happy -- usually it would
become confused about the backend and become stuck. This might
just be because I didn't know what I was doing :-)

Zoltan (cc:d) also did a few simple experiments to see whether
we could re-use the existing suspend/resume infrastructure,
similar to the 'fast' resume we already use for live checkpoint.
As an experiment he modified libxc's xc_resume.c to allow the
guest's HYPERVISOR_suspend hypercall invocation to return with
'0' (success) rather than '1' (cancelled). The effect of this
was to leave the domain running, but since it thinks it has just
resumed in another domain, it explicitly reconnects its frontends.
With this change and one or two others (like fixing the
start_info->{store_,console.domU}.mfns) he made it work for a
number of oldish guests. I'm sure he can describe the changes
needed more accurately than I can!

What do you think of this approach? Since it's based on the
existing suspend/resume code it should hopefully work with all
guest types without having to update the frontends or hopefully even
fix bugs in them (because it looks just like a regular resume which
is pretty well tested everywhere). This is particularly important in
"cloud" scenarios because the people running clouds have usually
little or no control over the software their customers are running.
Unfortunately if we have to wait for a PV frontend change to trickle
into all the common distros it will be a while before we can fully
benefit from driver domain restart. If there is a better way
of doing this in the long term involving a frontend change, what
do you think about this as a stopgap until the frontends are updated?


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.