[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: race condition when re-connecting vif after backend died



On Wed, Oct 08, 2025 at 02:32:02PM +0200, Jürgen Groß wrote:
> On 08.10.25 13:22, Marek Marczykowski-Górecki wrote:
> > Hi,
> > 
> > I have the following scenario:
> > 1. Start backend domain (call it netvm1)
> > 2. Start frontend domain (call it vm1), with
> > vif=['backend=netvm2,mac=00:16:3e:5e:6c:00,script=vif-route-qubes,ip=10.138.17.244']
> > 3. Pause vm1 (not strictly required, but makes reproducing much easier)
> > 5. Crash/shutdown/destroy netvm1
> > 4. Start another backend domain (call it netvm2)
> > 5. In quick succession:
> >     5.1. unpause vm1
> >     5.2. detach (or actually cleanup) vif from vm1 (connected to now dead
> >          netvm1)
> >     5.3. attach similar vif with backend=netvm2

The way it's above, it tricky to reproduce (1/20 or even less often).
But if I move unpause after 5.3, then it's happening reliably. I hope
it's not too different scenario...

> > Sometimes it ends up with eth0 being present in vm1, but its xenstore
> > state key is still XenbusStateInitializing. And the backend state is at
> > XenbusStateInitWait.
> > In step 5.2, normally libxl waits for the backend to transition to state
> > XenbusStateClosed, and IIUC backend waits for the frontend to do the
> > same too. But when the backend is gone, libxl seems to simply removes
> > frontend xenstore entries without any coordination with the frontend
> > domain itself.
> > What I suspect happens is that xenstore events generated at 5.2 are
> > getting handled by the frontend's kernel only after 5.3.  At this stage,
> > frontend sees device that was is XenbusStateConnected transitioning to
> > XenbusStateInitializing (not really expected by the frontend to somebody
> > else change its state key) and (I guess) doesn't notice device vanished
> > for a moment (xenbus_dev_changed() doesn't hit the !exists path). I
> > haven't verified it, but I guess it also doesn't notice backend path
> > change, so it's still watching the old one (gone at this point).
> > 
> > If my diagnosis is correct, what should be the solution here? Add
> > handling for XenbusStateUnknown in xen-netfrontc.c:netback_changed()? If
> > so, it should probably carefully cleanup the old device while not
> > touching xenstore entries (which belong to the new instance already) and
> > then re-initialize the device (xennet_connect()? call).
> > Or maybe it should be done in generic way in xenbus_probe.c, in
> > xenbus_dev_changed()? Not sure how exactly - maybe by checking if
> > backend path (or just backend-id?) changed? And then call both
> > device_unregister() (again, being careful to not change xenstore,
> > especially not set XenbusStateClosed) and then xenbus_probe_node()?
> > 
> 
> I think we need to know what is going on here.
> 
> Can you repeat the test with Xenstore tracing enabled? Just do:
> 
> xenstore-control logfile /tmp/xs-trace
> 
> before point 3. in your list above and then perform steps 3. - 5.3. and
> then send the logfile. Please make sure not to have any additional actions
> causing Xenstore traffic in between, as this would make it much harder to
> analyze the log.

I can't completely avoid other xenstore activity, but I tried to reduce
it as much as possible...

I'm attaching reproduction script, its output, and xenstore traces. Note
I split xenstore trace into two parts, hopefully making it easier to
analyze.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

Attachment: network-reproduce
Description: Text document

Attachment: output.txt
Description: Text document

Attachment: xs-trace-2025-10-08T09:57:31-04:00
Description: Text document

Attachment: xs-trace-2025-10-08T09:57:31-04:00-unpause
Description: Text document

Attachment: signature.asc
Description: PGP signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.