[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] driver domain crash d reconnect handling

> -----Original Message-----
> From: Andrew Cooper [mailto:andrew.cooper3@xxxxxxxxxx]
> Sent: 24 January 2013 15:11
> To: Zoltan Kiss
> Cc: George Shuklin; xen-api@xxxxxxxxxxxxx; Ian Campbell; Paul Durrant; Dave
> Scott; 'xen-devel@xxxxxxxxxxxxx'
> Subject: Re: [Xen-devel] driver domain crash and reconnect handling
> On 24/01/13 15:01, Zoltan Kiss wrote:
> > On 24/01/13 14:06, George Shuklin wrote:
> >> 24.01.2013 17:25, Paul Durrant ÐÐÑÐÑ:
> >>>> Some notes about guest suspend during IO.
> >>>>
> >>>> I tested that way for storage reboot (pause all domains, reboot
> >>>> ISCSI storage and resume every domain). If pause is short (less
> >>>> that 2 minutes), guest can survive. If pause is longer than 2
> >>>> minutes, guests in state of waiting for io completion, detects IO
> >>>> timeout after resuming  and cause IO error on virtual block devices.
> (PV).
> >>>>
> >>> To be clear here: do you mean you *paused* and then unpaused the
> VMs, or *suspended* and then resumed the VMs? I suspect you mean the
> former.
> >>>
> >>>     Paul
> >> Pause, of cause. My bad.
> >>
> > If you would do a suspend, the frontend driver flush out disk IO
> > operations before suspend reached, and therefore there won't be
> > anything to timeout after resume. However, if the storage driver
> > domain just crashed, I guess the guest would crash at suspend. Maybe
> > we can try out something to save the the ring buffer, and replay them
> > back once the backend come back (but before resuming the guest). But
> > I'm not sure whether the guest would handle the timeouts after the
> > resume first, or cancel them if the requests were succesfully responded.
> >
> > Zoli
> Perhaps I am making this harder, but might it be best to wait for a short
> while (15-30 seconds) for the device driver domain to come back, and if it
> takes longer than that, pause the VM.
> This way, if the driver domain is fast to come back, all the guest notices is
> transitorily blocked IO, and if the driver domain is too slow (but does come
> back), all the guest might notices is a pause.
> Ultimately, if the driver domain never comes back, then we are in a no
> worse position than currently.

What do you mean by 'come back' here? If you're talking about the same driver 
domain then fair enough. If you're talking about a new instance then pausing or 
not pausing the VM is immaterial. Unless the frontends are prodded to connect 
to the new backends (remembering that the xenstore paths have the domid baked 
into them) then IO will block forever. In general you're going to need to go 
through a full suspend/resume of the frontend to achieve this, unless we write 
new frontend code to directly notice the change in the backend (and distinguish 
it from an unplug) and reconnect automatically.

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.