[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [BUG] xl devd segmentation fault on xl block-detach
On Thu, May 04, 2017 at 10:51:01AM +0100, Ian Jackson wrote: > Wei Liu writes ("Re: [Xen-devel] [BUG] xl devd segmentation fault on xl > block-detach"): > > Can you give the following patch a try? This applies to 4.8. > > > > Not sure if there is a better way to fix it though. Ian and Roger? > > I find the logic here rather awkward. I do remember reviewing it and > becoming a bit confused at the time and it seems that even though I > eventually convinced myself it was OK, I was wrong. > > > From: Wei Liu <wei.liu2@xxxxxxxxxx> > > Date: Wed, 3 May 2017 17:55:42 +0100 > > Subject: [PATCH] libxl: fix backend_watch_callback > > > > That function needs to cope with spurious events. The original "skip" > > path blindly freed dguest even when it needed to stay in ddomain list. > > > > Free dguest iff it is newly added to the list. That way we don't free > > the one that should stay on the list and we don't unnecessarily add a > > stale dguest entry to ddomain list. > > AFAICT right now you are right. But I see another possible way of > fixing it: > > How about moving the num_devs == 0 check, and associated cleanup, to > the exit path ? That way a if new guest struct is spuriously > allocated, it will automatically be freed. It would mean that the > freeing of dguest would depend only on other invariants already in the > code, rather than on explicit tracking. > > The invariants are, I think: > > * Any libxl__ddomain_device is either > * on some list libxl__ddomain_guest->devices > * being processed for removal, and referenced by a device > remove async call initiated by remove_device and which will > call device_complete() when done > but not both! > > * Any libxl__domain_guest is on the list libxl__ddomain->guests. > > The above apply even within any function, except very briefly when > transitioning from one state to another (eg creation, destruction). > > * SUM(libxl__domain_guest->num_*) != 0, when we return from the > outermost callback. (Ie, there are no leftover empty guest > structs.) Yes, that seems better so that there's no code duplication. Will send a patch shortly. > > Thinking about this like this, and observing the control flow, leads > me to think I have found another bug. > > Consider what happens if a device is removed while it is still being > added. That is, an event comes in which causes us to call add_device. > add_device sets up the callback and starts doing work (eg hotplug > scripts). Before that finishes, the device is removed again. > backend_watch_callback will tear the device down and free dev. > > But dev is still referenced by the add_device operation, and when it > completes, device_complete will call > libxl__device_backend_path(gc, aodev->dev) > > There ought to be a (perhaps implicit) invariant that > * Any dev referenced by an aodev call is legit Right, maybe an easier solution would be to not pass the stored libxl__device to the async functions, and instead copy it to a temporary one that's GC'ed afterwards. AFAICT the async operations only rely on the libxl__device, so passing a device tracked by the GC should solve this without refcounting, will send a patch for this also. Roger. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |