[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected"



On Mon, Mar 17, 2014 at 11:33:18PM +0100, Sander Eikelenboom wrote:
> 
> Monday, March 17, 2014, 11:35:24 AM, you wrote:
> 
> > On Wed, Mar 12, 2014 at 05:47:29PM +0100, Sander Eikelenboom wrote:
> >> 
> >> Wednesday, March 12, 2014, 4:45:01 PM, you wrote:
> >> 
> >> > On Wed, Mar 12, 2014 at 04:20:03PM +0100, Sander Eikelenboom wrote:
> >> > [...]
> >> >> 
> >> >> > Sorry, remove the trailing "S". Actually you only need to look at 
> >> >> > netback.c.
> >> >> 
> >> >> What producer index to compare with .. there are quite some 
> >> >> RING_GET_REQUESTS .. and i see:
> >> >> npo->meta_prod
> >> >> vif->rx.sring->req_prod
> >> >> vif->pending_prod
> >> >> 
> >> >> to name a few ..
> >> >> Any particular RING_GET_REQUESTS call and particular producer index you 
> >> >> are interested in ?
> >> >> 
> >> 
> >> > There are two macros you can use
> >> 
> >> > RING_REQUEST_CONS_OVERFLOW and RING_REQUEST_PROD_OVERFLOW.
> >> 
> >> Ah i already produced my own .. diff to netback is attached ..
> >> 
> >> Netback:
> >> Mar 12 17:41:26 serveerstertje kernel: [  464.778614] vif vif-7-0 vif7.0: 
> >> ?!? npo->meta_prod:37 vif->rx.sring->req_prod:431006 
> >> vif->rx.req_cons:431007
> >> Mar 12 17:41:26 serveerstertje kernel: [  464.786203] vif vif-7-0 vif7.0: 
> >> ?!? npo->meta_prod:38 vif->rx.sring->req_prod:431006 
> >> vif->rx.req_cons:431008
> 
> > req_prod < req_cons, so there's an overflow. I'm actually curious how
> > this could happen.
> 
> > Back to the code, before netback enqueues SKB to its internal queue, it
> > will check if there's enough room in the ring. Before Paul's changeset,
> > it checks against a static number (the possible maximum slots that can
> > be consumed by an SKB). Paul's changeset made it check against the
> > actual slots the incoming SKB consumes. See
> > interface.c:xenvif_start_xmit.  Another interesting site would be when
> > the SKB is broken down later on in internal queue. See
> > netback.c:xenvif_rx_action. The routine to break down an SKB is
> > xenvif_gop_skb.
> 
> > Although they look alright to me, but you might want to instrument them
> > a bit more to see what triggers that overflow. It's a bit frustrating,
> > but a bug that cannot be easily reproduced is indeed extremely hard to
> > fix.
> 
> [  554.166914] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req 
> npo->meta_prod:7 vif->rx.sring->req_prod:301750 vif->rx.req_cons:301751
> [  571.143788] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req 
> npo->meta_prod:37 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364823
> [  571.150762] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req 
> npo->meta_prod:37 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364823 
> req->gref:19791875 req->id:22
> [  571.164691] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2  
> npo->meta_prod:38 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364823 
> npo->copy_gref:19791875
> [  571.179474] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 3 
> npo->meta_prod:38 old_meta_prod:30 vif->rx.sring->req_prod:364822 
> vif->rx.req_cons:364823 gso_type:1 gso_size:1448 nr_frags:1 req->gref:576 
> req->id:14
> [  571.194844] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 before req 
> npo->meta_prod:38 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364824 
> gso_type:0 gso_size:0 nr_frags:0
> [  571.211215] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 after req 
> npo->meta_prod:38 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364824 
> gso_type:0 gso_size:0 nr_frags:0 req->gref:4325379 req->id:23
> [  571.228471] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 3 
> npo->meta_prod:39 old_meta_prod:38 vif->rx.sring->req_prod:364822 
> vif->rx.req_cons:364824 gso_type:0 gso_size:0 nr_frags:0 req->gref:4325379 
> req->id:23
> [  571.246412] vif vif-7-0 vif7.0: Bad status -3 from copy to DOM7.
> [  571.255531] vif vif-7-0 vif7.0: ?!? xenvif_check_gop status err? status:-1 
> i:7 nr_meta_slots:8 copy_op->status:-3 npo->copy_cons:38
> [  571.264804] vif vif-7-0 vif7.0: ?!? xenvif_rx_action status err? status:-1 
> meta_slots_used:8 flags:7 size:1 ip_summed:1 is_gso:1 
> skb_shinfo(skb)->gso_type:1634 vif->meta[npo.meta_cons].gso_type:3 nr_frags:1 
> npo.copy_prod:39, npo.meta_cons:30
> 
> It seems in 'get_next_rx_buffer' you can get the situation that cons > prod, 
> but the first time somehow the code after the actual RING_GET_REQUEST isn't 
> ran (why?!? queue stop ?) .. and it doesn't lead to an immediate problem ..

You've rate limited the output. get_next_rx_buffer won't sleep
whatsoever. Stopping the queue won't affect a running RX kthread. You
can think of the start_xmit and kthread run in parallel.

get_next_rx_buffer doens't check if the ring is overflowed. It's the
caller's fault. It might be worth checking start_new_rx_buffer

> The second time it does get to the code after the RING_GET_REQUEST in 
> 'get_next_rx_buffer' and that leads to mayhem ...
> 
> So added a netdev_warn to xenvif_start_xmit for the "stop queue" case .. 
> unfortunately it now triggers net_ratelimit at the end:
> 
> [  402.909693] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
> min_slots_needed:7 vif->rx.sring->req_prod:189228 vif->rx.req_cons:189222

I think xenvif_rx_ring_slots_available is doing its job. If req_prod -
req_cons < needed, the queue is stopeed.

Wei.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.