[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-API] Re: openvswitch problem with XCP 1.0 RC3



On Tuesday 15 March 2011 20:23:21 Ben Pfaff wrote:
> Ben Pfaff <blp@xxxxxxxxxxxxxxx> writes:
> > Testing is taking longer than I expected, but we're still
> > planning to release this soon.
> 
> Here's the tested version of the kernel patch that we already
> passed along to Citrix.  I have already posted the corresponding
> OVS patch to ovs-dev here:
>         http://openvswitch.org/pipermail/dev/2011-March/028884.html
> 
> --8<--------------------------cut here-------------------------->8--
> 
> From: Ben Pfaff <blp@xxxxxxxxxx>
> Date: Fri, 25 Feb 2011 16:47:48 -0800
> Subject: [PATCH] 8021q: Add kluge to support 802.1Q VLANs on devices with
> buggy drivers.
> 
> Some Linux network drivers support a feature called "VLAN
> acceleration", associated with a data structure called a "vlan_group".
> A vlan_group is, abstractly, a dictionary that maps from a VLAN ID (in
> the range 0...4095) to a VLAN device, that is, a Linux network device
> associated with a particular VLAN, e.g. "eth0.9" for VLAN 9 on eth0.
> 
> Some drivers that support VLAN acceleration have bugs that fall
> roughly into the following categories:
> 
>     * Some NICs strip VLAN tags on receive if no vlan_group is
>       registered, so that the tag is completely lost.
> 
>     * Some drivers size their receive buffers based on whether a
>       vlan_group is enabled, meaning that a maximum size packet with a
>       VLAN tag will not fit if a vlan_group is not configured.
> 
>     * On transmit some drivers expect that VLAN acceleration will be
>       used if it is available (which can only be done if a vlan_group
>       is configured).  In these cases, the driver may fail to parse
>       the packet and correctly setup checksum offloading and/or TSO.
> 
> The correct long term solution is to fix these driver bugs.

FYI:
VLAN trunks work for us with igb, e1000, e1000e with all versions of XCP and 
openvswitch-1.1.0pre2.

The limitation is the incorrect receive buffer size which must be adjusted 
manually.

The drivers are patched to avoid transmit checksum errors, see 
http://patchwork.ozlabs.org/patch/24670/, but note that the patch is buggy in 
line 19.

> 
> This commit implements a workaround, which has been tested to work with the
> "igb" driver, which does not otherwise work with VLANs on Open vSwitch:
> 
>     * Implement a new VLAN device ioctl that sets up a vlan_group in
>       which all 4096 VLANs point back to the physical device.
> 
>     * In the code that handles received VLAN-accelerated packets, save
>       the VLAN on which the packet was received to a new sk_buff
>       member before clearing the VLAN number.
> 
>     * Check the value of the new member in the OVS kernel module
>       packet receive code.
> 
> This requires only minimal, low-risk changes to the core of the kernel
> network stack, as well as minimal changes to the Open vSwitch kernel
> module.
> 
> Driver Details
> --------------
> 
> The following drivers in linux-2.6.32.12-0.7.1.xs1.0.0.311.170586 are
> the ones that use "vlan_group" and are relevant to Open vSwitch on
> XenServer.  We have not tested any version of most of these drivers,
> so we do not know whether they have a VLAN problem that needs to be
> fixed:
> 
>     8139cp
>     acenic
>     amd8111e
>     atl1c
>     atl1e
>     atl1
>     atl2
>     be2net
>     bna
>     bnx2
>     bnx2x
>     cnic
>     cxgb
>     cxgb3
>     e1000
>     e1000e
>     enic
>     forcedeth
>     gianfar
>     igb
>     igbvf
>     ixgb
>     ixgbe
>     jme
>     mlx4_en
>     ns83820
>     qlge
>     r8169
>     s2io
>     sky2
>     starfire
>     tehuti
>     tg3
>     typhoon
>     via-velocity
>     vxge
> 
> Of the drivers that use "vlan_group" listed above, the following
> drivers inspect vlan devices in ways that imply that the VLAN
> workaround would not work for them.  These drivers have not yet been
> tested, so we do not know whether they need to use the VLAN workaround
> anyhow:
> 
>     cnic
>     cxgb3
> 
> Of the drivers that use "vlan_group" listed above, the following
> drivers appear to have bugs that would limit them to VLANs with VIDs
> in the range 0...63 with the kernel patch fix.  This is based on quick
> code inspection, so it is not necessarily an exhaustive list.  The
> drivers on this list have not yet been tested, so we do not know
> whether they need to use the VLAN workaround anyhow:
> 
>     via-velocity
> 
> The following drivers use "vlan_group" but are irrelevant to Open
> vSwitch on XenServer:
> 
>     bonding (not used with Open vSwitch)
>     ehea (cannot be built on x86; IBM Power architecture only)
>     stmmac (cannot be built on x86; SH4 architecture)
>     vmxnet3 (not shipped with XenServer; for use inside VMware VMs only)
> 
> Signed-off-by: Ben Pfaff <blp@xxxxxxxxxx>
> ---
>  include/linux/if_vlan.h |    5 ++-
>  net/8021q/vlan.c        |  131
> ++++++++++++++++++++++++++++++++++++++++++++-- net/8021q/vlan_core.c   |  
>  8 +++-
>  net/core/dev.c          |    1 +
>  4 files changed, 137 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
> index 46e2bf4..a0260d7 100644
> --- a/include/linux/if_vlan.h
> +++ b/include/linux/if_vlan.h
> @@ -84,6 +84,7 @@ struct vlan_group {
>       struct hlist_node       hlist;  /* linked list */
>       struct net_device **vlan_devices_arrays[VLAN_GROUP_ARRAY_SPLIT_PARTS];
>       struct rcu_head         rcu;
> +     bool                    loopback;
>  };
> 
>  static inline struct net_device *vlan_group_get_device(struct vlan_group
> *vg, @@ -325,7 +326,9 @@ enum vlan_ioctl_cmds {
>       SET_VLAN_NAME_TYPE_CMD,
>       SET_VLAN_FLAG_CMD,
>       GET_VLAN_REALDEV_NAME_CMD, /* If this works, you know it's a VLAN 
> device,
> btw */ -      GET_VLAN_VID_CMD /* Get the VID of this VLAN (specified by name)
> */ +  GET_VLAN_VID_CMD, /* Get the VID of this VLAN (specified by name) */
> +     ADD_ALL_VLANS_CMD,
> +     DEL_ALL_VLANS_CMD
>  };
> 
>  enum vlan_flags {
> diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
> index bd8a167..f74d10e 100644
> --- a/net/8021q/vlan.c
> +++ b/net/8021q/vlan.c
> @@ -101,7 +101,8 @@ static void vlan_group_free(struct vlan_group *grp)
>       kfree(grp);
>  }
> 
> -static struct vlan_group *vlan_group_alloc(struct net_device *real_dev)
> +static struct vlan_group *vlan_group_alloc(struct net_device *real_dev,
> +                                        bool loopback)
>  {
>       struct vlan_group *grp;
> 
> @@ -110,6 +111,7 @@ static struct vlan_group *vlan_group_alloc(struct
> net_device *real_dev) return NULL;
> 
>       grp->real_dev = real_dev;
> +     grp->loopback = loopback;
>       hlist_add_head_rcu(&grp->hlist,
>                       &vlan_group_hash[vlan_grp_hashfn(real_dev->ifindex)]);
>       return grp;
> @@ -242,7 +244,7 @@ int register_vlan_dev(struct net_device *dev)
> 
>       grp = __vlan_find_group(real_dev);
>       if (!grp) {
> -             ngrp = grp = vlan_group_alloc(real_dev);
> +             ngrp = grp = vlan_group_alloc(real_dev, false);
>               if (!grp)
>                       return -ENOBUFS;
>               err = vlan_gvrp_init_applicant(real_dev);
> @@ -363,6 +365,90 @@ out_free_newdev:
>       return err;
>  }
> 
> +/* Attach a "loopback" VLAN device to every VLAN on dev.
> + * Returns 0 if successful, a negative error code otherwise.
> + */
> +static int register_all_vlans(struct net_device *dev)
> +{
> +     struct net_device **array;
> +     struct vlan_group *grp;
> +     unsigned int size;
> +     int err;
> +     int i;
> +
> +     ASSERT_RTNL();
> +
> +     if (__vlan_find_group(dev))
> +             return -EBUSY;
> +
> +     err = vlan_check_real_dev(dev, 0);
> +     if (err < 0)
> +             return err;
> +
> +     grp = vlan_group_alloc(dev, true);
> +     if (!grp)
> +             return -ENOBUFS;
> +
> +     size = sizeof(struct net_device *) * VLAN_GROUP_ARRAY_PART_LEN;
> +     array = kmalloc(size, GFP_KERNEL);
> +     if (array == NULL)
> +             return -ENOBUFS;
> +     for (i = 0; i < VLAN_GROUP_ARRAY_PART_LEN; i++)
> +             array[i] = dev;
> +     for (i = 0; i < VLAN_GROUP_ARRAY_SPLIT_PARTS; i++)
> +             grp->vlan_devices_arrays[i] = array;
> +
> +     /* Account for reference in struct vlan_dev_info */
> +     dev_hold(dev);
> +
> +     grp->nr_vlans = VLAN_GROUP_ARRAY_LEN;
> +
> +     if (dev->features & NETIF_F_HW_VLAN_RX)
> +             dev->netdev_ops->ndo_vlan_rx_register(dev, grp);
> +     if (dev->features & NETIF_F_HW_VLAN_FILTER)
> +             for (i = 0; i < VLAN_GROUP_ARRAY_LEN; i++)
> +                     dev->netdev_ops->ndo_vlan_rx_add_vid(dev, i);
> +
> +     return 0;
> +}
> +
> +static int unregister_all_vlans(struct net_device *dev)
> +{
> +     struct vlan_group *grp;
> +     int i;
> +
> +     ASSERT_RTNL();
> +
> +     grp = __vlan_find_group(dev);
> +     if (!grp || !grp->loopback)
> +             return -EINVAL;
> +
> +     /* Take it out of our own structures, but be sure to interlock with
> +      * HW accelerating devices or SW vlan input packet processing.
> +      */
> +     if (dev->features & NETIF_F_HW_VLAN_FILTER)
> +             for (i = 0; i < VLAN_GROUP_ARRAY_LEN; i++)
> +                     dev->netdev_ops->ndo_vlan_rx_kill_vid(dev, i);
> +
> +     for (i = 0; i < VLAN_GROUP_ARRAY_PART_LEN; i++)
> +             vlan_group_set_device(grp, i, NULL);
> +     grp->nr_vlans = 0;
> +
> +     if (dev->features & NETIF_F_HW_VLAN_RX)
> +             dev->netdev_ops->ndo_vlan_rx_register(dev, NULL);
> +
> +     hlist_del_rcu(&grp->hlist);
> +
> +     synchronize_net();
> +     kfree(grp->vlan_devices_arrays[0]);
> +     kfree(grp);
> +
> +     /* Get rid of the vlan's reference to dev */
> +     dev_put(dev);
> +
> +     return 0;
> +}
> +
>  static void vlan_sync_address(struct net_device *dev,
>                             struct net_device *vlandev)
>  {
> @@ -444,6 +530,12 @@ static int vlan_device_event(struct notifier_block
> *unused, unsigned long event, * as we run under the RTNL lock.
>        */
> 
> +     if (grp->loopback) {
> +             if (event == NETDEV_UNREGISTER)
> +                     unregister_all_vlans(dev);
> +             goto out;
> +     }
> +
>       switch (event) {
>       case NETDEV_CHANGE:
>               /* Propagate real device state to vlan devices */
> @@ -581,13 +673,18 @@ static int vlan_ioctl_handler(struct net *net, void
> __user *arg) case DEL_VLAN_CMD:
>       case GET_VLAN_REALDEV_NAME_CMD:
>       case GET_VLAN_VID_CMD:
> +     case ADD_ALL_VLANS_CMD:
> +     case DEL_ALL_VLANS_CMD:
>               err = -ENODEV;
>               dev = __dev_get_by_name(net, args.device1);
>               if (!dev)
>                       goto out;
> 
>               err = -EINVAL;
> -             if (args.cmd != ADD_VLAN_CMD && !is_vlan_dev(dev))
> +             if (args.cmd != ADD_VLAN_CMD &&
> +                 args.cmd != ADD_ALL_VLANS_CMD &&
> +                 args.cmd != DEL_ALL_VLANS_CMD &&
> +                 !is_vlan_dev(dev))
>                       goto out;
>       }
> 
> @@ -667,6 +764,20 @@ static int vlan_ioctl_handler(struct net *net, void
> __user *arg) err = -EFAULT;
>               break;
> 
> +     case ADD_ALL_VLANS_CMD:
> +             err = -EPERM;
> +             if (!capable(CAP_NET_ADMIN))
> +                     break;
> +             err = register_all_vlans(dev);
> +             break;
> +
> +     case DEL_ALL_VLANS_CMD:
> +             err = -EPERM;
> +             if (!capable(CAP_NET_ADMIN))
> +                     break;
> +             err = unregister_all_vlans(dev);
> +             break;
> +
>       default:
>               err = -EOPNOTSUPP;
>               break;
> @@ -769,9 +880,17 @@ static void __exit vlan_cleanup_module(void)
> 
>       dev_remove_pack(&vlan_packet_type);
> 
> -     /* This table must be empty if there are no module references left. */
> -     for (i = 0; i < VLAN_GRP_HASH_SIZE; i++)
> -             BUG_ON(!hlist_empty(&vlan_group_hash[i]));
> +     for (i = 0; i < VLAN_GRP_HASH_SIZE; i++) {
> +             struct hlist_head *hlist = &vlan_group_hash[i];
> +             while (!hlist_empty(hlist)) {
> +                     struct vlan_group *grp;
> +                     int err;
> +
> +                     grp = hlist_entry(hlist->first, struct vlan_group, 
> hlist);
> +                     err = unregister_all_vlans(grp->real_dev);
> +                     BUG_ON(err);
> +             }
> +     }
> 
>       unregister_pernet_gen_device(vlan_net_id, &vlan_net_ops);
>       rcu_barrier(); /* Wait for completion of call_rcu()'s */
> diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
> index 7f7de1a..8327334 100644
> --- a/net/8021q/vlan_core.c
> +++ b/net/8021q/vlan_core.c
> @@ -33,6 +33,11 @@ int vlan_hwaccel_do_receive(struct sk_buff *skb)
>       struct net_device *dev = skb->dev;
>       struct net_device_stats *stats;
> 
> +     if (!is_vlan_dev(dev)) {
> +             skb->pkt_type = PACKET_OTHERHOST;
> +             return 0;
> +     }
> +
>       skb->dev = vlan_dev_info(dev)->real_dev;
>       netif_nit_deliver(skb);
> 
> @@ -90,7 +95,8 @@ static int vlan_gro_common(struct napi_struct *napi,
> struct vlan_group *grp,
> 
>       for (p = napi->gro_list; p; p = p->next) {
>               NAPI_GRO_CB(p)->same_flow =
> -                     p->dev == skb->dev && !compare_ether_header(
> +                     p->dev == skb->dev && p->vlan_tci == skb->vlan_tci &&
> +                     !compare_ether_header(
>                               skb_mac_header(p), skb_gro_mac_header(skb));
>               NAPI_GRO_CB(p)->flush = 0;
>       }
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 0b5bf64..4c99b87 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2742,6 +2742,7 @@ void napi_reuse_skb(struct napi_struct *napi, struct
> sk_buff *skb) {
>       __skb_pull(skb, skb_headlen(skb));
>       skb_reserve(skb, NET_IP_ALIGN - skb_headroom(skb));
> +     skb->vlan_tci = 0;
> 
>       napi->skb = skb;
>  }


_______________________________________________
xen-api mailing list
xen-api@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/mailman/listinfo/xen-api


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.