[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: race condition when re-connecting vif after backend died


  • To: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Jürgen Groß <jgross@xxxxxxxx>
  • Date: Wed, 8 Oct 2025 14:32:02 +0200
  • Autocrypt: addr=jgross@xxxxxxxx; keydata= xsBNBFOMcBYBCACgGjqjoGvbEouQZw/ToiBg9W98AlM2QHV+iNHsEs7kxWhKMjrioyspZKOB ycWxw3ie3j9uvg9EOB3aN4xiTv4qbnGiTr3oJhkB1gsb6ToJQZ8uxGq2kaV2KL9650I1SJve dYm8Of8Zd621lSmoKOwlNClALZNew72NjJLEzTalU1OdT7/i1TXkH09XSSI8mEQ/ouNcMvIJ NwQpd369y9bfIhWUiVXEK7MlRgUG6MvIj6Y3Am/BBLUVbDa4+gmzDC9ezlZkTZG2t14zWPvx XP3FAp2pkW0xqG7/377qptDmrk42GlSKN4z76ELnLxussxc7I2hx18NUcbP8+uty4bMxABEB AAHNH0p1ZXJnZW4gR3Jvc3MgPGpncm9zc0BzdXNlLmNvbT7CwHkEEwECACMFAlOMcK8CGwMH CwkIBwMCAQYVCAIJCgsEFgIDAQIeAQIXgAAKCRCw3p3WKL8TL8eZB/9G0juS/kDY9LhEXseh mE9U+iA1VsLhgDqVbsOtZ/S14LRFHczNd/Lqkn7souCSoyWsBs3/wO+OjPvxf7m+Ef+sMtr0 G5lCWEWa9wa0IXx5HRPW/ScL+e4AVUbL7rurYMfwCzco+7TfjhMEOkC+va5gzi1KrErgNRHH kg3PhlnRY0Udyqx++UYkAsN4TQuEhNN32MvN0Np3WlBJOgKcuXpIElmMM5f1BBzJSKBkW0Jc Wy3h2Wy912vHKpPV/Xv7ZwVJ27v7KcuZcErtptDevAljxJtE7aJG6WiBzm+v9EswyWxwMCIO RoVBYuiocc51872tRGywc03xaQydB+9R7BHPzsBNBFOMcBYBCADLMfoA44MwGOB9YT1V4KCy vAfd7E0BTfaAurbG+Olacciz3yd09QOmejFZC6AnoykydyvTFLAWYcSCdISMr88COmmCbJzn sHAogjexXiif6ANUUlHpjxlHCCcELmZUzomNDnEOTxZFeWMTFF9Rf2k2F0Tl4E5kmsNGgtSa aMO0rNZoOEiD/7UfPP3dfh8JCQ1VtUUsQtT1sxos8Eb/HmriJhnaTZ7Hp3jtgTVkV0ybpgFg w6WMaRkrBh17mV0z2ajjmabB7SJxcouSkR0hcpNl4oM74d2/VqoW4BxxxOD1FcNCObCELfIS auZx+XT6s+CE7Qi/c44ibBMR7hyjdzWbABEBAAHCwF8EGAECAAkFAlOMcBYCGwwACgkQsN6d 1ii/Ey9D+Af/WFr3q+bg/8v5tCknCtn92d5lyYTBNt7xgWzDZX8G6/pngzKyWfedArllp0Pn fgIXtMNV+3t8Li1Tg843EXkP7+2+CQ98MB8XvvPLYAfW8nNDV85TyVgWlldNcgdv7nn1Sq8g HwB2BHdIAkYce3hEoDQXt/mKlgEGsLpzJcnLKimtPXQQy9TxUaLBe9PInPd+Ohix0XOlY+Uk QFEx50Ki3rSDl2Zt2tnkNYKUCvTJq7jvOlaPd6d/W0tZqpyy7KVay+K4aMobDsodB3dvEAs6 ScCnh03dDAFgIq5nsB11j3KPKdVoPlfucX2c7kGNH+LUMbzqV6beIENfNexkOfxHfw==
  • Delivery-date: Wed, 08 Oct 2025 12:32:08 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 08.10.25 13:22, Marek Marczykowski-Górecki wrote:
Hi,

I have the following scenario:
1. Start backend domain (call it netvm1)
2. Start frontend domain (call it vm1), with
vif=['backend=netvm2,mac=00:16:3e:5e:6c:00,script=vif-route-qubes,ip=10.138.17.244']
3. Pause vm1 (not strictly required, but makes reproducing much easier)
5. Crash/shutdown/destroy netvm1
4. Start another backend domain (call it netvm2)
5. In quick succession:
    5.1. unpause vm1
    5.2. detach (or actually cleanup) vif from vm1 (connected to now dead
         netvm1)
    5.3. attach similar vif with backend=netvm2

Sometimes it ends up with eth0 being present in vm1, but its xenstore
state key is still XenbusStateInitializing. And the backend state is at
XenbusStateInitWait.
In step 5.2, normally libxl waits for the backend to transition to state
XenbusStateClosed, and IIUC backend waits for the frontend to do the
same too. But when the backend is gone, libxl seems to simply removes
frontend xenstore entries without any coordination with the frontend
domain itself.
What I suspect happens is that xenstore events generated at 5.2 are
getting handled by the frontend's kernel only after 5.3.  At this stage,
frontend sees device that was is XenbusStateConnected transitioning to
XenbusStateInitializing (not really expected by the frontend to somebody
else change its state key) and (I guess) doesn't notice device vanished
for a moment (xenbus_dev_changed() doesn't hit the !exists path). I
haven't verified it, but I guess it also doesn't notice backend path
change, so it's still watching the old one (gone at this point).

If my diagnosis is correct, what should be the solution here? Add
handling for XenbusStateUnknown in xen-netfrontc.c:netback_changed()? If
so, it should probably carefully cleanup the old device while not
touching xenstore entries (which belong to the new instance already) and
then re-initialize the device (xennet_connect()? call).
Or maybe it should be done in generic way in xenbus_probe.c, in
xenbus_dev_changed()? Not sure how exactly - maybe by checking if
backend path (or just backend-id?) changed? And then call both
device_unregister() (again, being careful to not change xenstore,
especially not set XenbusStateClosed) and then xenbus_probe_node()?


I think we need to know what is going on here.

Can you repeat the test with Xenstore tracing enabled? Just do:

xenstore-control logfile /tmp/xs-trace

before point 3. in your list above and then perform steps 3. - 5.3. and
then send the logfile. Please make sure not to have any additional actions
causing Xenstore traffic in between, as this would make it much harder to
analyze the log.

In case the problem doesn't appear, please delete the logfile before
starting a new attempt (xenstored is appending new trace data to an
existing file).


Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.