Re: [Xen-devel] driver domain crash and reconnect handling

On 24/01/13 15:10, Andrew Cooper wrote:
If you would do a suspend, the frontend driver flush out disk IO
>operations before suspend reached, and therefore there won't be anything
>to timeout after resume. However, if the storage driver domain just
>crashed, I guess the guest would crash at suspend. Maybe we can try out
>something to save the the ring buffer, and replay them back once the
>backend come back (but before resuming the guest). But I'm not sure
>whether the guest would handle the timeouts after the resume first, or
>cancel them if the requests were succesfully responded.
Perhaps I am making this harder, but might it be best to wait for a
short while (15-30 seconds) for the device driver domain to come back,
and if it takes longer than that, pause the VM.

This way, if the driver domain is fast to come back, all the guest
notices is transitorily blocked IO, and if the driver domain is too slow
(but does come back), all the guest might notices is a pause.

Ultimately, if the driver domain never comes back, then we are in a no
worse position than currently.

As Paul mentioned, pausing doesn't cause the guest to reconnect to the new backend, so you would need a suspend/resume. But in George's case, where the driver domain remains the same, this can work. However to avoid George's problem with timeouts, a reconnect should be necessary. As Ian mentioned, the guest will replay the ring and that might help to avoid the timouts to happen.


