[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [BUG] xl devd segmentation fault on xl block-detach



Wei Liu writes ("Re: [Xen-devel] [BUG] xl devd segmentation fault on xl 
block-detach"):
> Can you give the following patch a try? This applies to 4.8.
> 
> Not sure if there is a better way to fix it though. Ian and Roger?

I find the logic here rather awkward.  I do remember reviewing it and
becoming a bit confused at the time and it seems that even though I
eventually convinced myself it was OK, I was wrong.

> From: Wei Liu <wei.liu2@xxxxxxxxxx>
> Date: Wed, 3 May 2017 17:55:42 +0100
> Subject: [PATCH] libxl: fix backend_watch_callback
> 
> That function needs to cope with spurious events. The original "skip"
> path blindly freed dguest even when it needed to stay in ddomain list.
> 
> Free dguest iff it is newly added to the list. That way we don't free
> the one that should stay on the list and we don't unnecessarily add a
> stale dguest entry to ddomain list.

AFAICT right now you are right.  But I see another possible way of
fixing it:

How about moving the num_devs == 0 check, and associated cleanup, to
the exit path ?  That way a if new guest struct is spuriously
allocated, it will automatically be freed.  It would mean that the
freeing of dguest would depend only on other invariants already in the
code, rather than on explicit tracking.

The invariants are, I think:

 * Any libxl__ddomain_device is either
     * on some list libxl__ddomain_guest->devices
     * being processed for removal, and referenced by a device
       remove async call initiated by remove_device and which will
       call device_complete() when done
   but not both!

 * Any libxl__domain_guest is on the list libxl__ddomain->guests.

 The above apply even within any function, except very briefly when
 transitioning from one state to another (eg creation, destruction).

 * SUM(libxl__domain_guest->num_*) != 0, when we return from the
   outermost callback.  (Ie, there are no leftover empty guest
   structs.)


Thinking about this like this, and observing the control flow, leads
me to think I have found another bug.

Consider what happens if a device is removed while it is still being
added.  That is, an event comes in which causes us to call add_device.
add_device sets up the callback and starts doing work (eg hotplug
scripts).  Before that finishes, the device is removed again.
backend_watch_callback will tear the device down and free dev.

But dev is still referenced by the add_device operation, and when it
completes, device_complete will call
  libxl__device_backend_path(gc, aodev->dev)

There ought to be a (perhaps implicit) invariant that
 * Any dev referenced by an aodev call is legit

But this invariant is violated by backend_watch_callback, which frees
it despite it not knowing whether there is a callback in flight.


Perhaps we should do explicit reference counting.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.