[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: race condition when re-connecting vif after backend died
On Wed, Oct 08, 2025 at 02:32:02PM +0200, Jürgen Groß wrote: > On 08.10.25 13:22, Marek Marczykowski-Górecki wrote: > > Hi, > > > > I have the following scenario: > > 1. Start backend domain (call it netvm1) > > 2. Start frontend domain (call it vm1), with > > vif=['backend=netvm2,mac=00:16:3e:5e:6c:00,script=vif-route-qubes,ip=10.138.17.244'] > > 3. Pause vm1 (not strictly required, but makes reproducing much easier) > > 5. Crash/shutdown/destroy netvm1 > > 4. Start another backend domain (call it netvm2) > > 5. In quick succession: > > 5.1. unpause vm1 > > 5.2. detach (or actually cleanup) vif from vm1 (connected to now dead > > netvm1) > > 5.3. attach similar vif with backend=netvm2 The way it's above, it tricky to reproduce (1/20 or even less often). But if I move unpause after 5.3, then it's happening reliably. I hope it's not too different scenario... > > Sometimes it ends up with eth0 being present in vm1, but its xenstore > > state key is still XenbusStateInitializing. And the backend state is at > > XenbusStateInitWait. > > In step 5.2, normally libxl waits for the backend to transition to state > > XenbusStateClosed, and IIUC backend waits for the frontend to do the > > same too. But when the backend is gone, libxl seems to simply removes > > frontend xenstore entries without any coordination with the frontend > > domain itself. > > What I suspect happens is that xenstore events generated at 5.2 are > > getting handled by the frontend's kernel only after 5.3. At this stage, > > frontend sees device that was is XenbusStateConnected transitioning to > > XenbusStateInitializing (not really expected by the frontend to somebody > > else change its state key) and (I guess) doesn't notice device vanished > > for a moment (xenbus_dev_changed() doesn't hit the !exists path). I > > haven't verified it, but I guess it also doesn't notice backend path > > change, so it's still watching the old one (gone at this point). > > > > If my diagnosis is correct, what should be the solution here? Add > > handling for XenbusStateUnknown in xen-netfrontc.c:netback_changed()? If > > so, it should probably carefully cleanup the old device while not > > touching xenstore entries (which belong to the new instance already) and > > then re-initialize the device (xennet_connect()? call). > > Or maybe it should be done in generic way in xenbus_probe.c, in > > xenbus_dev_changed()? Not sure how exactly - maybe by checking if > > backend path (or just backend-id?) changed? And then call both > > device_unregister() (again, being careful to not change xenstore, > > especially not set XenbusStateClosed) and then xenbus_probe_node()? > > > > I think we need to know what is going on here. > > Can you repeat the test with Xenstore tracing enabled? Just do: > > xenstore-control logfile /tmp/xs-trace > > before point 3. in your list above and then perform steps 3. - 5.3. and > then send the logfile. Please make sure not to have any additional actions > causing Xenstore traffic in between, as this would make it much harder to > analyze the log. I can't completely avoid other xenstore activity, but I tried to reduce it as much as possible... I'm attaching reproduction script, its output, and xenstore traces. Note I split xenstore trace into two parts, hopefully making it easier to analyze. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab Attachment:
network-reproduce Attachment:
output.txt Attachment:
xs-trace-2025-10-08T09:57:31-04:00 Attachment:
xs-trace-2025-10-08T09:57:31-04:00-unpause Attachment:
signature.asc
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |