Xen project Mailing List

Re: [Xen-API] [Xen-devel] driver domain crash d reconnect handling

To: Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>, Zoltan Kiss <zoltan.kiss@xxxxxxxxxx>

From: Paul Durrant <Paul.Durrant@xxxxxxxxxx>

Date: Thu, 24 Jan 2013 16:57:07 +0000

Accept-language: en-US

Acceptlanguage: en-US

Cc: "'xen-devel@xxxxxxxxxxxxx'" <xen-devel@xxxxxxxxxxxxx>, Ian Campbell <Ian.Campbell@xxxxxxxxxx>, "xen-api@xxxxxxxxxxxxx" <xen-api@xxxxxxxxxxxxx>

Delivery-date: Thu, 24 Jan 2013 16:56:48 +0000

List-id: User and development list for XCP and XAPI <xen-api.lists.xen.org>

Thread-index: Ac36U0D2/UP3MvGNQLCbLRuFC7wlSA==

Thread-topic: [Xen-devel] driver domain crash d reconnect handling

> -----Original Message----- > From: Andrew Cooper [mailto:andrew.cooper3@xxxxxxxxxx] > Sent: 24 January 2013 15:11 > To: Zoltan Kiss > Cc: George Shuklin; xen-api@xxxxxxxxxxxxx; Ian Campbell; Paul Durrant; Dave > Scott; 'xen-devel@xxxxxxxxxxxxx' > Subject: Re: [Xen-devel] driver domain crash and reconnect handling > > On 24/01/13 15:01, Zoltan Kiss wrote: > > On 24/01/13 14:06, George Shuklin wrote: > >> 24.01.2013 17:25, Paul Durrant ÐÐÑÐÑ: > >>>> Some notes about guest suspend during IO. > >>>> > >>>> I tested that way for storage reboot (pause all domains, reboot > >>>> ISCSI storage and resume every domain). If pause is short (less > >>>> that 2 minutes), guest can survive. If pause is longer than 2 > >>>> minutes, guests in state of waiting for io completion, detects IO > >>>> timeout after resuming and cause IO error on virtual block devices. > (PV). > >>>> > >>> To be clear here: do you mean you *paused* and then unpaused the > VMs, or *suspended* and then resumed the VMs? I suspect you mean the > former. > >>> > >>> Paul > >> Pause, of cause. My bad. > >> > > If you would do a suspend, the frontend driver flush out disk IO > > operations before suspend reached, and therefore there won't be > > anything to timeout after resume. However, if the storage driver > > domain just crashed, I guess the guest would crash at suspend. Maybe > > we can try out something to save the the ring buffer, and replay them > > back once the backend come back (but before resuming the guest). But > > I'm not sure whether the guest would handle the timeouts after the > > resume first, or cancel them if the requests were succesfully responded. > > > > Zoli > > Perhaps I am making this harder, but might it be best to wait for a short > while (15-30 seconds) for the device driver domain to come back, and if it > takes longer than that, pause the VM. > > This way, if the driver domain is fast to come back, all the guest notices is > transitorily blocked IO, and if the driver domain is too slow (but does come > back), all the guest might notices is a pause. > > Ultimately, if the driver domain never comes back, then we are in a no > worse position than currently. > What do you mean by 'come back' here? If you're talking about the same driver domain then fair enough. If you're talking about a new instance then pausing or not pausing the VM is immaterial. Unless the frontends are prodded to connect to the new backends (remembering that the xenstore paths have the domid baked into them) then IO will block forever. In general you're going to need to go through a full suspend/resume of the frontend to achieve this, unless we write new frontend code to directly notice the change in the backend (and distinguish it from an unplug) and reconnect automatically. Paul _______________________________________________ Xen-api mailing list Xen-api@xxxxxxxxxxxxx http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.