[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 RESEND] xen: Fix SEGV on domain disconnect



On Mon, Apr 24, 2023 at 1:08 PM Mark Syms <mark.syms@xxxxxxxxx> wrote:
>
> Copying in Tim who did the final phase of the changes.
>
> On Mon, 24 Apr 2023 at 11:32, Paul Durrant <xadimgnik@xxxxxxxxx> wrote:
> >
> > On 20/04/2023 12:02, mark.syms@xxxxxxxxxx wrote:
> > > From: Mark Syms <mark.syms@xxxxxxxxxx>
> > >
> > > Ensure the PV ring is drained on disconnect. Also ensure all pending
> > > AIO is complete, otherwise AIO tries to complete into a mapping of the
> > > ring which has been torn down.
> > >
> > > Signed-off-by: Mark Syms <mark.syms@xxxxxxxxxx>
> > > ---
> > > CC: Stefano Stabellini <sstabellini@xxxxxxxxxx>
> > > CC: Anthony Perard <anthony.perard@xxxxxxxxxx>
> > > CC: Paul Durrant <paul@xxxxxxx>
> > > CC: xen-devel@xxxxxxxxxxxxxxxxxxxx
> > >
> > > v2:
> > >   * Ensure all inflight requests are completed before teardown
> > >   * RESEND to fix formatting
> > > ---
> > >   hw/block/dataplane/xen-block.c | 31 +++++++++++++++++++++++++------
> > >   1 file changed, 25 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/hw/block/dataplane/xen-block.c 
> > > b/hw/block/dataplane/xen-block.c
> > > index 734da42ea7..d9da4090bf 100644
> > > --- a/hw/block/dataplane/xen-block.c
> > > +++ b/hw/block/dataplane/xen-block.c
> > > @@ -523,6 +523,10 @@ static bool 
> > > xen_block_handle_requests(XenBlockDataPlane *dataplane)
> > >
> > >       dataplane->more_work = 0;
> > >
> > > +    if (dataplane->sring == 0) {
> > > +        return done_something;
> > > +    }
> > > +
> >
> > I think you could just return false here... Nothing is ever going to be
> > done if there's no ring :-)
> >
> > >       rc = dataplane->rings.common.req_cons;
> > >       rp = dataplane->rings.common.sring->req_prod;
> > >       xen_rmb(); /* Ensure we see queued requests up to 'rp'. */
> > > @@ -666,14 +670,35 @@ void xen_block_dataplane_destroy(XenBlockDataPlane 
> > > *dataplane >   void xen_block_dataplane_stop(XenBlockDataPlane *dataplane)
> > >   {
> > >       XenDevice *xendev;
> > > +    XenBlockRequest *request, *next;
> > >
> > >       if (!dataplane) {
> > >           return;
> > >       }
> > >
> > > +    /* We're about to drain the ring. We can cancel the scheduling of any
> > > +     * bottom half now */
> > > +    qemu_bh_cancel(dataplane->bh);
> > > +
> > > +    /* Ensure we have drained the ring */
> > > +    aio_context_acquire(dataplane->ctx);
> > > +    do {
> > > +        xen_block_handle_requests(dataplane);
> > > +    } while (dataplane->more_work);
> > > +    aio_context_release(dataplane->ctx);
> > > +
> >
> > I don't think we want to be taking new requests, do we?
> >

If we're in this situation and the guest has put something on the
ring, I think we should do our best with it.
We cannot just rely on the guest to be well-behaved, because they're
not :-( We're about to throw the
ring away, so whatever is there would otherwise be lost. This bit is
here to try to handle guests which are
less than diligent about their shutdown. We *should* always be past
this fast enough when the disconnect()/connect()
of XenbusStateConnected happens that all remains well (if not, we were
in a worse situation before).

> > > +    /* Now ensure that all inflight requests are complete */
> > > +    while (!QLIST_EMPTY(&dataplane->inflight)) {
> > > +        QLIST_FOREACH_SAFE(request, &dataplane->inflight, list, next) {
> > > +            blk_aio_flush(request->dataplane->blk, 
> > > xen_block_complete_aio,
> > > +                        request);
> > > +        }
> > > +    }
> > > +
> >
> > I think this could possibly be simplified by doing the drain after the
> > call to blk_set_aio_context(), as long as we set dataplane->ctx to
> > qemu_get_aio_context(). Alos, as long as more_work is not set then it
> > should still be safe to cancel the bh before the drain AFAICT.

I'm not sure what you mean by simpler? Possibly I'm not getting something.

We have to make sure that any "aio_bh_schedule_oneshot_full()" which
happens as a result of
"blk_aio_flush()" has finished before any change of AIO context,
because it tries to use the one which
was current at the time of being called (I have the SEGVs to prove it
:-)). Whether that happens before or after
"blk_set_aio_context(qemu_get_aio_context())" doesn't seem to be a
change in complexity to me.

Motivation was to get as much as possible to happen in the way it
"normally" would, so that future changes
are less likely to regress, but as mentioned maybe I'm missing something.

The BH needs to be prevented from firing ASAP, otherwise the
disconnect()/connect() which happens when
XenbusStateConnected can have the bh fire from what the guest does
next right in the middle of juggling
contexts for the disconnect() (I have the SEGVs from that too...).

> >    Paul
> >
> > >       xendev = dataplane->xendev;
> > >
> > >       aio_context_acquire(dataplane->ctx);
> > > +
> > >       if (dataplane->event_channel) {
> > >           /* Only reason for failure is a NULL channel */
> > >           xen_device_set_event_channel_context(xendev, 
> > > dataplane->event_channel,
> > > @@ -684,12 +709,6 @@ void xen_block_dataplane_stop(XenBlockDataPlane 
> > > *dataplane)
> > >       blk_set_aio_context(dataplane->blk, qemu_get_aio_context(), 
> > > &error_abort);
> > >       aio_context_release(dataplane->ctx);
> > >
> > > -    /*
> > > -     * Now that the context has been moved onto the main thread, cancel
> > > -     * further processing.
> > > -     */
> > > -    qemu_bh_cancel(dataplane->bh);
> > > -
> > >       if (dataplane->event_channel) {
> > >           Error *local_err = NULL;
> > >
> >

Tim (hoping GMail behaves itself with this message...)



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.