[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC] xen-netback: coalesce frags before copying

To: Wei Liu <wei.liu2@xxxxxxxxxx>
From: Ian Campbell <Ian.Campbell@xxxxxxxxxx>
Date: Fri, 15 Mar 2013 10:42:22 +0000
Cc: "netwiz@xxxxxxxxx" <netwiz@xxxxxxxxx>, "konrad.wilk@xxxxxxxxxx" <konrad.wilk@xxxxxxxxxx>, "npegg@xxxxxxxxxx" <npegg@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, "linux@xxxxxxxxxxxxxx" <linux@xxxxxxxxxxxxxx>, "annie.li@xxxxxxxxxx" <annie.li@xxxxxxxxxx>, "jacekowski@xxxxxxxxxxxxxx" <jacekowski@xxxxxxxxxxxxxx>
Delivery-date: Fri, 15 Mar 2013 10:42:47 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Thu, 2013-03-14 at 18:26 +0000, Wei Liu wrote:
> This patch tries to coalesce txps when constructing grant copy structures. It
> enables netback to deal with situation when frontend's MAX_SKB_FRAGS is larger
> than backend's MAX_SKB_FRAGS.
> 
> It also defines max_skb_frags, which is a estimation of the maximum number of
> frags a guest can send, anything bigger than that is considered malicious. Now
> it is set to 20, which should be enough to accommodate Linux (16 to 19) and
> possibly Windows (19?).
> 
> Signed-off-by: Wei Liu <wei.liu2@xxxxxxxxxx>
> ---
>  drivers/net/xen-netback/netback.c |  164 
> +++++++++++++++++++++++++++++--------
>  1 file changed, 128 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/net/xen-netback/netback.c 
> b/drivers/net/xen-netback/netback.c
> index 6e8e51a..e27f81a 100644
> --- a/drivers/net/xen-netback/netback.c
> +++ b/drivers/net/xen-netback/netback.c
> @@ -47,9 +47,20 @@
>  #include <asm/xen/hypercall.h>
>  #include <asm/xen/page.h>
>  
> +/*
> + * This is an estimation of the maximum possible frags a SKB might
> + * have, anything larger than this is considered malicious. Typically
> + * Linux has 16 to 19, Windows has 19.

Perhaps rather than this static (although configurable) limit we could
avoid malicious guests by aborting only if we see a slot with invalid
contents? I think the XSA-39 issue doesn't exist so long as each request
results in a response (which a valid slot does).

Although I'm also sympathetic to the idea that this approach is simpler
and therefore expedient in the short term and we can consider the more
complicated approach above for the future.

> + */
> +#define NETBK_MAX_SKB_FRAGS_DEFAULT 20
> +static unsigned int max_skb_frags = NETBK_MAX_SKB_FRAGS_DEFAULT;
> +module_param(max_skb_frags, uint, 0444);

You could at least make this 0644 so root can up the limit dynamically?

> @@ -968,42 +979,98 @@ static struct gnttab_copy 
> *xen_netbk_get_requests(struct xen_netbk *netbk,
>       struct skb_shared_info *shinfo = skb_shinfo(skb);
>       skb_frag_t *frags = shinfo->frags;
>       u16 pending_idx = *((u16 *)skb->data);
> -     int i, start;
> +     u16 head_idx = 0;
> +     int i, j, start;
> +     struct page *page;
> +     pending_ring_idx_t index;
> +     uint16_t dst_offset;
> +     int total_frags_merged;
> +     unsigned int nr_frags = shinfo->nr_frags; /* This can be as large as 
> max_skb_frags */
> +     struct pending_tx_info *first = NULL;
> +     int nr_txp;
>  
>       /* Skip first skb fragment if it is on same page as header fragment. */
>       start = (frag_get_pending_idx(&shinfo->frags[0]) == pending_idx);
>  
> -     for (i = start; i < shinfo->nr_frags; i++, txp++) {
> -             struct page *page;
> -             pending_ring_idx_t index;
> +     /* Coalesce */
> +     total_frags_merged = 0;
> +     for (i = j = start; i < MAX_SKB_FRAGS && j < nr_frags; i++) {

I can't see any code which causes us to drop the frag if we end up with
i >= MAX_SKB_FRAGS? Is it just too subtle for me?

I wonder if this code might all be a bit clearer if:
        s/nr_frags/nr_slots/
        s/j/slots/
        s/i/frags/
? Or possibly replace i with shinfo->nr_frags (if that works).

I think s/frags/slots/ in max_skb_frags too BTW. Keeping the concept of
skb frags separate from the number of ring slots which it requires to
represent those frags is less confusing here but also helpful when you
consider higher order frags.

>               struct pending_tx_info *pending_tx_info =
>                       netbk->pending_tx_info;
>  
> -             index = pending_index(netbk->pending_cons++);
> -             pending_idx = netbk->pending_ring[index];
> -             page = xen_netbk_alloc_page(netbk, pending_idx);
> +             page = alloc_page(GFP_KERNEL|__GFP_COLD);

Possibly for future enhancement we could use netdev_alloc_frag() and
take advantage of higher order pages here?

>               if (!page)
>                       goto err;
>  
> -             gop->source.u.ref = txp->gref;
> -             gop->source.domid = vif->domid;
> -             gop->source.offset = txp->offset;
> -
> -             gop->dest.u.gmfn = virt_to_mfn(page_address(page));
> -             gop->dest.domid = DOMID_SELF;
> -             gop->dest.offset = txp->offset;
> -
> -             gop->len = txp->size;
> -             gop->flags = GNTCOPY_source_gref;
> +             nr_txp = 0;
> +             dst_offset = 0;
> +             first = NULL;
> +             while (dst_offset < PAGE_SIZE && j < nr_frags) {

I think splitting the handling of a single request slot into a function
would be useful for clarity. e.g. xen_netbk_handle_one_request or
something.

> +                     gop->flags = GNTCOPY_source_gref;
> +
> +                     gop->source.u.ref = txp->gref;
> +                     gop->source.domid = vif->domid;
> +                     gop->source.offset = txp->offset;
> +
> +                     gop->dest.domid = DOMID_SELF;
> +
> +                     gop->dest.offset = dst_offset;
> +                     gop->dest.u.gmfn = virt_to_mfn(page_address(page));
> +
> +                     if (dst_offset + txp->size > PAGE_SIZE) {
> +                             /* This page can only merge a portion of txp */
> +                             gop->len = PAGE_SIZE - dst_offset;
> +                             txp->offset += gop->len;
> +                             txp->size -= gop->len;
> +                             dst_offset += gop->len; /* == PAGE_SIZE, will 
> quit loop */

In this case we don't touch pending_cons etc? Is that because when we
hit this case we always go around again and will eventually hit the else
case below ensuring we do the pending_cons stuff exactly once for the
tail end of the buffer? I think that is OK but it is worthy of a
comment.

What happens if we hit SKB_FRAG_MAX right as we hit this case? Might we
fail to respond to the final request which has been split in this way?

> +                     } else {
> +                             /* This txp can be merged in the page */
> +                             gop->len = txp->size;
> +                             dst_offset += gop->len;
> +
> +                             index = pending_index(netbk->pending_cons++);
> +
> +                             pending_idx = netbk->pending_ring[index];
> +
> +                             memcpy(&pending_tx_info[pending_idx].req, txp, 
> sizeof(*txp));
> +                             xenvif_get(vif);
> +
> +                             pending_tx_info[pending_idx].vif = vif;
> +
> +                             /*
> +                              * Poison these fields, corrsponding

"corresponding"

> +                              * fields for head txp will be set to
> +                              * correct values after the loop.
> +                              */
> +                             pending_tx_info[pending_idx].nr_tx_req = 
> (u16)(~(u16)0);
> +                             netbk->mmap_pages[pending_idx] = (void *)(~0UL);
> +                             pending_tx_info[pending_idx].start_idx = -1;
> +
> +                             if (unlikely(!first)) {
> +                                     first = &pending_tx_info[pending_idx];
> +                                     first->start_idx = index;
> +                                     head_idx = pending_idx;
> +                             }
> +
> +                             txp++;
> +                             nr_txp++;
> +                             j++;
> +                     }
>  
> -             gop++;
> +                     gop++;
> +             }
>  
> -             memcpy(&pending_tx_info[pending_idx].req, txp, sizeof(*txp));
> -             xenvif_get(vif);
> -             pending_tx_info[pending_idx].vif = vif;
> -             frag_set_pending_idx(&frags[i], pending_idx);
> +             first->req.offset = 0;
> +             first->req.size = dst_offset;
> +             first->nr_tx_req = nr_txp;
> +             total_frags_merged += nr_txp - 1;
> +             set_page_ext(page, netbk, head_idx);
> +             netbk->mmap_pages[head_idx] = page;
> +             frag_set_pending_idx(&frags[i], head_idx);
>       }
>  
> +     shinfo->nr_frags -= total_frags_merged;

This adjustment seems a bit odd, why not just count them correctly in
the first place?

> +
>       return gop;
>  err:
>       /* Unwind, freeing all pages and sending error responses. */
> @@ -1025,6 +1092,7 @@ static int xen_netbk_tx_check_gop(struct xen_netbk 
> *netbk,
>       struct gnttab_copy *gop = *gopp;
>       u16 pending_idx = *((u16 *)skb->data);
>       struct skb_shared_info *shinfo = skb_shinfo(skb);
> +     struct pending_tx_info *tx_info;
>       int nr_frags = shinfo->nr_frags;
>       int i, err, start;
>  
> @@ -1037,12 +1105,14 @@ static int xen_netbk_tx_check_gop(struct xen_netbk 
> *netbk,
>       start = (frag_get_pending_idx(&shinfo->frags[0]) == pending_idx);
>  
>       for (i = start; i < nr_frags; i++) {
> -             int j, newerr;
> +             int j, newerr = 0, n;
>  
>               pending_idx = frag_get_pending_idx(&shinfo->frags[i]);
> +             tx_info = &netbk->pending_tx_info[pending_idx];
>  
>               /* Check error status: if okay then remember grant handle. */
> -             newerr = (++gop)->status;
> +             for (n = 0; n < tx_info->nr_tx_req; n++)
> +                     newerr |= (++gop)->status;

I don't think GNTST_* are or-able in this way... I think for each
request the response should contain the status of the first failed grant
op relating to that slot, or success.

>               if (likely(!newerr)) {
>                       /* Had a previous error? Invalidate this fragment. */
>                       if (unlikely(err))
> @@ -1267,11 +1337,11 @@ static unsigned xen_netbk_tx_build_gops(struct 
> xen_netbk *netbk)
>       struct sk_buff *skb;
>       int ret;
>  
> -     while (((nr_pending_reqs(netbk) + MAX_SKB_FRAGS) < MAX_PENDING_REQS) &&
> +     while (((nr_pending_reqs(netbk) + max_skb_frags) < MAX_PENDING_REQS) &&
>               !list_empty(&netbk->net_schedule_list)) {
>               struct xenvif *vif;
>               struct xen_netif_tx_request txreq;
> -             struct xen_netif_tx_request txfrags[MAX_SKB_FRAGS];
> +             struct xen_netif_tx_request txfrags[max_skb_frags];

I guess this stack variable puts an implicit limit on max_skb_frags. It
also argues against being able to change it dynamically as I suggested
above (imagine a race between setting up the stack here and checking
against max_skb_frags later)

> @@ -1548,13 +1626,27 @@ static void xen_netbk_idx_release(struct xen_netbk 
> *netbk, u16 pending_idx,
>       pending_tx_info = &netbk->pending_tx_info[pending_idx];
>  
>       vif = pending_tx_info->vif;
> +     nr = pending_tx_info->nr_tx_req;
> +     start_idx = pending_tx_info->start_idx;
>  
> -     make_tx_response(vif, &pending_tx_info->req, status);
> +     BUG_ON(nr == (u16)(~(u16)0));

Is one of those casts not redundant?

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] [PATCH RFC] xen-netback: coalesce frags before copying
  - From: Wei Liu
- Re: [Xen-devel] [PATCH RFC] xen-netback: coalesce frags before copying
  - From: Wei Liu

References:
- [Xen-devel] [PATCH RFC] xen-netback: coalesce frags before copying
  - From: Wei Liu

Prev by Date: Re: [Xen-devel] [PATCH 03 of 11 v4] xen: sched_credit: when picking, make sure we get an idle one, if any
Next by Date: Re: [Xen-devel] [PATCHv3] QEMU(upstream): Disable xen's use of O_DIRECT by default as it results in crashes.
Previous by thread: Re: [Xen-devel] [PATCH RFC] xen-netback: coalesce frags before copying
Next by thread: Re: [Xen-devel] [PATCH RFC] xen-netback: coalesce frags before copying
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.