[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] XEN: bug in vif-bridge script



Am 05.03.14 10:06, schrieb Roger Pau MonnÃ:
On 05/03/14 03:00, Ian Campbell wrote:
(Roger, I've trimmed the quotes fairly aggressively,
https://bugs.gentoo.org/show_bug.cgi?id=502570 or
http://lists.xen.org/archives/html/xen-users/2014-03/msg00013.html for
the full thing but in brief the vif is gone by the time the hotplug
script runs and this results in errors from e.g. brctl delif, which are
correctly ignored but are also logged. I presume Atom2 is running a log
scanning tool or something and would like to avoid spurious log
messages, which seems fair)

On Tue, 2014-03-04 at 17:06 +0100, Atom2 wrote:
If this rather needs to go to the xen-devel ML, I am sure Ian Campbell
(or somebody else) will shortly be around and move it or asks me to
resend to the other list.
[...]
Feb 26 22:14:29 vm-host logger: /etc/xen/scripts/vif-bridge: brctl delif xenbr0 
vif1.0 failed
Feb 26 22:14:29 vm-host logger: /etc/xen/scripts/vif-bridge: ifconfig vif1.0 
down failed
[...]
Upon investigating it seems that the problem is related to the fact that
the network device (at least for paravirtualized guests using the
netfront/netback device model) has already been destroyed by the dom0
kernel when the script is being run.

This sounds very plausible to me.

Are you using the xm or xl toolstack? The way the new xl toolstack
handles hotplug scripts ought to be a lot less prone to this sort of
race (but I don't know if it avoids this particular one). Roger, do you
have any thoughts?

I've tried with xl and I can confirm this happens at least with xl (not
tested xm). In libxl we wait for the backends to switch to state 6
(closed) before executing hotplug scripts. My guess is that netback
removes the vif before switching to state 6, so by the time hotplug
scripts are executed the vif is long gone.

This messages appear on /var/log/debug on my system, which also contains
other messages from hotplug scripts, that are part of the normal hotplug
execution:

Mar  5 09:33:06 loki royger: /etc/xen/scripts/block: add
XENBUS_PATH=backend/vbd/2/51712
Mar  5 09:33:06 loki royger: /etc/xen/scripts/vif-bridge: online
type_if=vif XENBUS_PATH=backend/vif/2/0
Mar  5 09:33:06 loki royger: /etc/xen/scripts/vif-bridge: Successful
vif-bridge online for vif2.0, bridge bridge0.
Mar  5 09:33:06 loki royger: /etc/xen/scripts/vif-bridge: Writing
backend/vif/2/0/hotplug-status connected to xenstore.
Mar  5 09:33:22 loki royger: /etc/xen/scripts/block: remove
XENBUS_PATH=backend/vbd/2/51712
Mar  5 09:33:22 loki royger: /etc/xen/scripts/vif-bridge: offline
type_if=vif XENBUS_PATH=backend/vif/2/0
Mar  5 09:33:22 loki royger: /etc/xen/scripts/vif-bridge: brctl delif
bridge0 vif2.0 failed
Mar  5 09:33:22 loki royger: /etc/xen/scripts/vif-bridge: ifconfig
vif2.0 down failed
Mar  5 09:33:22 loki royger: /etc/xen/scripts/vif-bridge: Successful
vif-bridge offline for vif2.0, bridge bridge0.

Suggested fix:
for brctl: check whether the interface still exists and is also still
linked to the bridge prior to invoking the brctl command
for ifconfig: check whether the interface still exists and is also still
up prior to invoking the ifconfig command as follows:
-------------------------------------------
case "$command" in
      online)
          setup_virtual_bridge_port "$dev"
          mtu="`ip link show $bridge | awk '/mtu/ { print $5 }'`"
          if [ -n "$mtu" ] && [ "$mtu" -gt 0 ]
          then
                  ip link set $dev mtu $mtu || :
          fi
          add_to_bridge "$bridge" "$dev"
          ;;

      offline)
          if brctl show "$bridge" | grep "$dev" > /dev/null 2>&1 ; then
              do_without_error brctl delif "$bridge" "$dev"
        fi
          if ifconfig -s "$dev" > /dev/null 2>&1 ; then
              do_without_error ifconfig "$dev" down
        fi
          ;;

      add)
          setup_virtual_bridge_port "$dev"
          add_to_bridge "$bridge" "$dev"
          ;;
esac

If this issue does affect xl then I would like to see this fixed
upstream, preferably by fixing xl to not race hotplug scripts against
device tear down. If that is impossible (I don't think it should be, but
Roger?) then the script change which you propose seems like a very
reasonable fallback option.

IMHO we are doing the right thing in libxl, wait for the backend to
switch to state 6, and then execute hotplug scripts. I think your
proposed fix is racy, by looking at netback code:

set_backend_state(be, XenbusStateClosed);

unregister_hotplug_status_watch(be);
if (be->vif) {
        kobject_uevent(&dev->dev.kobj, KOBJ_OFFLINE);
        xenbus_rm(XBT_NIL, dev->nodename, "hotplug-status");
        xenvif_free(be->vif);
        be->vif = NULL;
}
kfree(be);
dev_set_drvdata(&dev->dev, NULL);

Netback sets the "Closed" state before removing the vif, so there's a
possible race between the gate that you propose to add to the hotplug
script and the removal work done by netback.

I think a suitable fix would be to either entirely remove those two
lines from the hotplug script, or simply don't log any error in case
they fail.
If the netback does indeed do all the removal work, I think completely removing those two lines from the script would be a sensible choice. Just not logging any error for an error that pops up because somebody else already did the job seems odd to me.

On second thought those two lines now seem to me a bit like double accounting for things that need to be done once only and are anyway accounted for somewhere else - unless I missed the point.

And on a related note: If you remove those two lines from the hotplug script, you could also completely get rid of the "offline)" pattern in the case command as those two lines (prior to my suggested fix) were the only ones in there.

Thanks Atom2

Roger.


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.