[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC] xen-netback: coalesce frags before copying



On Fri, 2013-03-15 at 10:42 +0000, Ian Campbell wrote:
> On Thu, 2013-03-14 at 18:26 +0000, Wei Liu wrote:
> > This patch tries to coalesce txps when constructing grant copy structures. 
> > It
> > enables netback to deal with situation when frontend's MAX_SKB_FRAGS is 
> > larger
> > than backend's MAX_SKB_FRAGS.
> > 
> > It also defines max_skb_frags, which is a estimation of the maximum number 
> > of
> > frags a guest can send, anything bigger than that is considered malicious. 
> > Now
> > it is set to 20, which should be enough to accommodate Linux (16 to 19) and
> > possibly Windows (19?).
> > 
> > Signed-off-by: Wei Liu <wei.liu2@xxxxxxxxxx>
> > ---
> >  drivers/net/xen-netback/netback.c |  164 
> > +++++++++++++++++++++++++++++--------
> >  1 file changed, 128 insertions(+), 36 deletions(-)
> > 
> > diff --git a/drivers/net/xen-netback/netback.c 
> > b/drivers/net/xen-netback/netback.c
> > index 6e8e51a..e27f81a 100644
> > --- a/drivers/net/xen-netback/netback.c
> > +++ b/drivers/net/xen-netback/netback.c
> > @@ -47,9 +47,20 @@
> >  #include <asm/xen/hypercall.h>
> >  #include <asm/xen/page.h>
> >  
> > +/*
> > + * This is an estimation of the maximum possible frags a SKB might
> > + * have, anything larger than this is considered malicious. Typically
> > + * Linux has 16 to 19, Windows has 19.
> 
> Perhaps rather than this static (although configurable) limit we could
> avoid malicious guests by aborting only if we see a slot with invalid
> contents? I think the XSA-39 issue doesn't exist so long as each request
> results in a response (which a valid slot does).
> 
> Although I'm also sympathetic to the idea that this approach is simpler
> and therefore expedient in the short term and we can consider the more
> complicated approach above for the future.
> 
> > + */
> > +#define NETBK_MAX_SKB_FRAGS_DEFAULT 20
> > +static unsigned int max_skb_frags = NETBK_MAX_SKB_FRAGS_DEFAULT;
> > +module_param(max_skb_frags, uint, 0444);
> 
> You could at least make this 0644 so root can up the limit dynamically?
> 

This was done on purpose. Changing this on the fly can cause race
condition. If we are up to changing this, we need to write more code to
make sure all worker threads in quiescence state, this adds a layer of
complexity.

> > @@ -968,42 +979,98 @@ static struct gnttab_copy 
> > *xen_netbk_get_requests(struct xen_netbk *netbk,
> >     struct skb_shared_info *shinfo = skb_shinfo(skb);
> >     skb_frag_t *frags = shinfo->frags;
> >     u16 pending_idx = *((u16 *)skb->data);
> > -   int i, start;
> > +   u16 head_idx = 0;
> > +   int i, j, start;
> > +   struct page *page;
> > +   pending_ring_idx_t index;
> > +   uint16_t dst_offset;
> > +   int total_frags_merged;
> > +   unsigned int nr_frags = shinfo->nr_frags; /* This can be as large as 
> > max_skb_frags */
> > +   struct pending_tx_info *first = NULL;
> > +   int nr_txp;
> >  
> >     /* Skip first skb fragment if it is on same page as header fragment. */
> >     start = (frag_get_pending_idx(&shinfo->frags[0]) == pending_idx);
> >  
> > -   for (i = start; i < shinfo->nr_frags; i++, txp++) {
> > -           struct page *page;
> > -           pending_ring_idx_t index;
> > +   /* Coalesce */
> > +   total_frags_merged = 0;
> > +   for (i = j = start; i < MAX_SKB_FRAGS && j < nr_frags; i++) {
> 
> I can't see any code which causes us to drop the frag if we end up with
> i >= MAX_SKB_FRAGS? Is it just too subtle for me?
> 

The basic assumption that it is not possible to reach i >= MAX_SKB_FRAGS
since the maximum packet size supported in backend is 64K and we can
always accommodate a packet. Any packet larger than 64K causes fatal
error early on.

I could remove "i < MAX_SKB_FRAGS" if this causes confusion, it is just
a redundant check.

> I wonder if this code might all be a bit clearer if:
>       s/nr_frags/nr_slots/
>       s/j/slots/
>       s/i/frags/
> ? Or possibly replace i with shinfo->nr_frags (if that works).
> 
> I think s/frags/slots/ in max_skb_frags too BTW. Keeping the concept of
> skb frags separate from the number of ring slots which it requires to
> represent those frags is less confusing here but also helpful when you
> consider higher order frags.
> 

Sure.

> >             struct pending_tx_info *pending_tx_info =
> >                     netbk->pending_tx_info;
> >  
> > -           index = pending_index(netbk->pending_cons++);
> > -           pending_idx = netbk->pending_ring[index];
> > -           page = xen_netbk_alloc_page(netbk, pending_idx);
> > +           page = alloc_page(GFP_KERNEL|__GFP_COLD);
> 
> Possibly for future enhancement we could use netdev_alloc_frag() and
> take advantage of higher order pages here?
> 

Sure, but I think that deserves a separate patch. This patch is more of
a fix than improvement.

> >             if (!page)
> >                     goto err;
> >  
> > -           gop->source.u.ref = txp->gref;
> > -           gop->source.domid = vif->domid;
> > -           gop->source.offset = txp->offset;
> > -
> > -           gop->dest.u.gmfn = virt_to_mfn(page_address(page));
> > -           gop->dest.domid = DOMID_SELF;
> > -           gop->dest.offset = txp->offset;
> > -
> > -           gop->len = txp->size;
> > -           gop->flags = GNTCOPY_source_gref;
> > +           nr_txp = 0;
> > +           dst_offset = 0;
> > +           first = NULL;
> > +           while (dst_offset < PAGE_SIZE && j < nr_frags) {
> 
> I think splitting the handling of a single request slot into a function
> would be useful for clarity. e.g. xen_netbk_handle_one_request or
> something.
> 
> > +                   gop->flags = GNTCOPY_source_gref;
> > +
> > +                   gop->source.u.ref = txp->gref;
> > +                   gop->source.domid = vif->domid;
> > +                   gop->source.offset = txp->offset;
> > +
> > +                   gop->dest.domid = DOMID_SELF;
> > +
> > +                   gop->dest.offset = dst_offset;
> > +                   gop->dest.u.gmfn = virt_to_mfn(page_address(page));
> > +
> > +                   if (dst_offset + txp->size > PAGE_SIZE) {
> > +                           /* This page can only merge a portion of txp */
> > +                           gop->len = PAGE_SIZE - dst_offset;
> > +                           txp->offset += gop->len;
> > +                           txp->size -= gop->len;
> > +                           dst_offset += gop->len; /* == PAGE_SIZE, will 
> > quit loop */
> 
> In this case we don't touch pending_cons etc? Is that because when we
> hit this case we always go around again and will eventually hit the else
> case below ensuring we do the pending_cons stuff exactly once for the
> tail end of the buffer? I think that is OK but it is worthy of a
> comment.
> 

Yes, we cannot touch pending_cons in this case, because this slot needs
to be coped with in next loop, using another gop.

> What happens if we hit SKB_FRAG_MAX right as we hit this case? Might we
> fail to respond to the final request which has been split in this way?
> 

As stated above, we won't hit i >= MAX_SKB_FRAGS.

> > +                   } else {
[...]
> > +           }
> >  
> > -           memcpy(&pending_tx_info[pending_idx].req, txp, sizeof(*txp));
> > -           xenvif_get(vif);
> > -           pending_tx_info[pending_idx].vif = vif;
> > -           frag_set_pending_idx(&frags[i], pending_idx);
> > +           first->req.offset = 0;
> > +           first->req.size = dst_offset;
> > +           first->nr_tx_req = nr_txp;
> > +           total_frags_merged += nr_txp - 1;
> > +           set_page_ext(page, netbk, head_idx);
> > +           netbk->mmap_pages[head_idx] = page;
> > +           frag_set_pending_idx(&frags[i], head_idx);
> >     }
> >  
> > +   shinfo->nr_frags -= total_frags_merged;
> 
> This adjustment seems a bit odd, why not just count them correctly in
> the first place?
> 

Sure.

> > +
> >     return gop;
> >  err:
> >     /* Unwind, freeing all pages and sending error responses. */
> > @@ -1025,6 +1092,7 @@ static int xen_netbk_tx_check_gop(struct xen_netbk 
> > *netbk,
> >     struct gnttab_copy *gop = *gopp;
> >     u16 pending_idx = *((u16 *)skb->data);
> >     struct skb_shared_info *shinfo = skb_shinfo(skb);
> > +   struct pending_tx_info *tx_info;
> >     int nr_frags = shinfo->nr_frags;
> >     int i, err, start;
> >  
> > @@ -1037,12 +1105,14 @@ static int xen_netbk_tx_check_gop(struct xen_netbk 
> > *netbk,
> >     start = (frag_get_pending_idx(&shinfo->frags[0]) == pending_idx);
> >  
> >     for (i = start; i < nr_frags; i++) {
> > -           int j, newerr;
> > +           int j, newerr = 0, n;
> >  
> >             pending_idx = frag_get_pending_idx(&shinfo->frags[i]);
> > +           tx_info = &netbk->pending_tx_info[pending_idx];
> >  
> >             /* Check error status: if okay then remember grant handle. */
> > -           newerr = (++gop)->status;
> > +           for (n = 0; n < tx_info->nr_tx_req; n++)
> > +                   newerr |= (++gop)->status;
> 
> I don't think GNTST_* are or-able in this way... I think for each
> request the response should contain the status of the first failed grant
> op relating to that slot, or success.
> 

Fixed.

> 
> >             if (likely(!newerr)) {
> >                     /* Had a previous error? Invalidate this fragment. */
> >                     if (unlikely(err))
> > @@ -1267,11 +1337,11 @@ static unsigned xen_netbk_tx_build_gops(struct 
> > xen_netbk *netbk)
> >     struct sk_buff *skb;
> >     int ret;
> >  
> > -   while (((nr_pending_reqs(netbk) + MAX_SKB_FRAGS) < MAX_PENDING_REQS) &&
> > +   while (((nr_pending_reqs(netbk) + max_skb_frags) < MAX_PENDING_REQS) &&
> >             !list_empty(&netbk->net_schedule_list)) {
> >             struct xenvif *vif;
> >             struct xen_netif_tx_request txreq;
> > -           struct xen_netif_tx_request txfrags[MAX_SKB_FRAGS];
> > +           struct xen_netif_tx_request txfrags[max_skb_frags];
> 
> I guess this stack variable puts an implicit limit on max_skb_frags. It
> also argues against being able to change it dynamically as I suggested
> above (imagine a race between setting up the stack here and checking
> against max_skb_frags later)
> 

Yes. See above explanation.


Wei.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.