[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

libxl / xen-pciback interaction


while trying to test a pciback fix for how SR-IOV VFs get placed in the
guest PCI topology I noticed that my test VM would only ever see the 1st
out of 3 VFs that I passed to it. As it turns out, the driver watches
its own (backend) node, and upon first receiving notification it
evaluates state found in xenstore to set up the backend device.
Subsequently it switches the device to Initialised. After this switching,
not further instances of the watch triggering would do anything.

In all instances I observed the watch event getting processed when the
"num_devs" node still reported 1. Trying to deal with this in libxl, by
delaying the writing of the "num_devs" node, led to a fatal error
("num_devs" not being available to read) in the driver, causing the
device to move to Closing state. Therefore I decided that the issue has
to be addressed in the driver, resulting in a patch (reproduced below)
that I'm not overly happy with. I think the present libxl behavior is
wrong - it shouldn't trigger driver initialization without having fully
populated the information the driver is supposed to consume for its
device initialization. The only solution that I can think of, however,
doesn't look very appealing either: Instead of putting all pieces of the
data for one device in a transaction, make a single transaction cover
all devices collectively.

Are there opinions about where to address the issue, or suggestions as
to better approaches than the ones shown / outlined?

While doing this it also occurred to me as odd how "num_devs" is
actually used: It's not really a "number of devices" indicator, but
instead a "next device has this number" one: libxl reads the present
value and increments it by one for every new device. Destroying
(hot-unplugging) of devices doesn't have any effect on the value. If
addition / removal of a device happens a number of times for a VM,
quite a few leftover, no longer used entries would accumulate afaict.
This isn't only consuming space in xenstore for no good reason, but
also means pciback has to do an increasing amount of processing every
time a reconfigure event happens.


xen-pciback: reconfigure also from backend watch handler

When multiple PCI devices get assigned to a guest right at boot, libxl
incrementally populates the backend tree. The writes for the first of
the devices trigger the backend watch. In turn xen_pcibk_setup_backend()
will set the XenBus state to Initialised, at which point no further
reconfigures would happen unless a device got hotplugged. Arrange for
reconfigure to also get triggered from the backend watch handler.

Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
I will admit that this isn't entirely race-free (with the guest actually
attaching in parallel), but from the looks of it such a race ought to be
benign (not the least thanks to the mutex). Ideally the tool stack
wouldn't write num_devs until all devices had their information
populated. I tried doing so in libxl, but xen_pcibk_setup_backend()
calling xenbus_dev_fatal() when not being able to read that node
prohibits such an approach (or else libxl and driver changes would need
to go into use in lock-step).

I wonder why what is being watched isn't just the num_devs node. Right
now the watch triggers quite frequently without anything relevant
actually having changed (I suppose in at least some cases in response
to writes by the backend itself).

--- a/drivers/xen/xen-pciback/xenbus.c
+++ b/drivers/xen/xen-pciback/xenbus.c
@@ -359,7 +359,8 @@ out:
        return err;
-static int xen_pcibk_reconfigure(struct xen_pcibk_device *pdev)
+static int xen_pcibk_reconfigure(struct xen_pcibk_device *pdev,
+                                enum xenbus_state state)
        int err = 0;
        int num_devs;
@@ -374,8 +375,7 @@ static int xen_pcibk_reconfigure(struct
        /* Make sure we only reconfigure once */
-       if (xenbus_read_driver_state(pdev->xdev->nodename) !=
-           XenbusStateReconfiguring)
+       if (xenbus_read_driver_state(pdev->xdev->nodename) != state)
                goto out;
        err = xenbus_scanf(XBT_NIL, pdev->xdev->nodename, "num_devs", "%d",
@@ -500,6 +500,9 @@ static int xen_pcibk_reconfigure(struct
+       if (state != XenbusStateReconfiguring)
+               goto out;
        err = xenbus_switch_state(pdev->xdev, XenbusStateReconfigured);
        if (err) {
                xenbus_dev_fatal(pdev->xdev, err,
@@ -525,7 +528,7 @@ static void xen_pcibk_frontend_changed(s
        case XenbusStateReconfiguring:
-               xen_pcibk_reconfigure(pdev);
+               xen_pcibk_reconfigure(pdev, XenbusStateReconfiguring);
        case XenbusStateConnected:
@@ -664,6 +667,10 @@ static void xen_pcibk_be_watch(struct xe
+       case XenbusStateInitialised:
+               xen_pcibk_reconfigure(pdev, XenbusStateInitialised);
+               break;



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.