Xen project Mailing List

Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected"

To: Sander Eikelenboom <linux@xxxxxxxxxxxxxx>

Date: Tue, 18 Mar 2014 12:04:01 +0000

Cc: annie li <annie.li@xxxxxxxxxx>, Paul Durrant <Paul.Durrant@xxxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>, Zoltan Kiss <zoltan.kiss@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx

Delivery-date: Tue, 18 Mar 2014 12:04:35 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Mon, Mar 17, 2014 at 11:33:18PM +0100, Sander Eikelenboom wrote: > > Monday, March 17, 2014, 11:35:24 AM, you wrote: > > > On Wed, Mar 12, 2014 at 05:47:29PM +0100, Sander Eikelenboom wrote: > >> > >> Wednesday, March 12, 2014, 4:45:01 PM, you wrote: > >> > >> > On Wed, Mar 12, 2014 at 04:20:03PM +0100, Sander Eikelenboom wrote: > >> > [...] > >> >> > >> >> > Sorry, remove the trailing "S". Actually you only need to look at > >> >> > netback.c. > >> >> > >> >> What producer index to compare with .. there are quite some > >> >> RING_GET_REQUESTS .. and i see: > >> >> npo->meta_prod > >> >> vif->rx.sring->req_prod > >> >> vif->pending_prod > >> >> > >> >> to name a few .. > >> >> Any particular RING_GET_REQUESTS call and particular producer index you > >> >> are interested in ? > >> >> > >> > >> > There are two macros you can use > >> > >> > RING_REQUEST_CONS_OVERFLOW and RING_REQUEST_PROD_OVERFLOW. > >> > >> Ah i already produced my own .. diff to netback is attached .. > >> > >> Netback: > >> Mar 12 17:41:26 serveerstertje kernel: [ 464.778614] vif vif-7-0 vif7.0: > >> ?!? npo->meta_prod:37 vif->rx.sring->req_prod:431006 > >> vif->rx.req_cons:431007 > >> Mar 12 17:41:26 serveerstertje kernel: [ 464.786203] vif vif-7-0 vif7.0: > >> ?!? npo->meta_prod:38 vif->rx.sring->req_prod:431006 > >> vif->rx.req_cons:431008 > > > req_prod < req_cons, so there's an overflow. I'm actually curious how > > this could happen. > > > Back to the code, before netback enqueues SKB to its internal queue, it > > will check if there's enough room in the ring. Before Paul's changeset, > > it checks against a static number (the possible maximum slots that can > > be consumed by an SKB). Paul's changeset made it check against the > > actual slots the incoming SKB consumes. See > > interface.c:xenvif_start_xmit. Another interesting site would be when > > the SKB is broken down later on in internal queue. See > > netback.c:xenvif_rx_action. The routine to break down an SKB is > > xenvif_gop_skb. > > > Although they look alright to me, but you might want to instrument them > > a bit more to see what triggers that overflow. It's a bit frustrating, > > but a bug that cannot be easily reproduced is indeed extremely hard to > > fix. > > [ 554.166914] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req > npo->meta_prod:7 vif->rx.sring->req_prod:301750 vif->rx.req_cons:301751 > [ 571.143788] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req > npo->meta_prod:37 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364823 > [ 571.150762] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req > npo->meta_prod:37 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364823 > req->gref:19791875 req->id:22 > [ 571.164691] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2 > npo->meta_prod:38 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364823 > npo->copy_gref:19791875 > [ 571.179474] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 3 > npo->meta_prod:38 old_meta_prod:30 vif->rx.sring->req_prod:364822 > vif->rx.req_cons:364823 gso_type:1 gso_size:1448 nr_frags:1 req->gref:576 > req->id:14 > [ 571.194844] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 before req > npo->meta_prod:38 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364824 > gso_type:0 gso_size:0 nr_frags:0 > [ 571.211215] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 after req > npo->meta_prod:38 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364824 > gso_type:0 gso_size:0 nr_frags:0 req->gref:4325379 req->id:23 > [ 571.228471] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 3 > npo->meta_prod:39 old_meta_prod:38 vif->rx.sring->req_prod:364822 > vif->rx.req_cons:364824 gso_type:0 gso_size:0 nr_frags:0 req->gref:4325379 > req->id:23 > [ 571.246412] vif vif-7-0 vif7.0: Bad status -3 from copy to DOM7. > [ 571.255531] vif vif-7-0 vif7.0: ?!? xenvif_check_gop status err? status:-1 > i:7 nr_meta_slots:8 copy_op->status:-3 npo->copy_cons:38 > [ 571.264804] vif vif-7-0 vif7.0: ?!? xenvif_rx_action status err? status:-1 > meta_slots_used:8 flags:7 size:1 ip_summed:1 is_gso:1 > skb_shinfo(skb)->gso_type:1634 vif->meta[npo.meta_cons].gso_type:3 nr_frags:1 > npo.copy_prod:39, npo.meta_cons:30 > > It seems in 'get_next_rx_buffer' you can get the situation that cons > prod, > but the first time somehow the code after the actual RING_GET_REQUEST isn't > ran (why?!? queue stop ?) .. and it doesn't lead to an immediate problem .. You've rate limited the output. get_next_rx_buffer won't sleep whatsoever. Stopping the queue won't affect a running RX kthread. You can think of the start_xmit and kthread run in parallel. get_next_rx_buffer doens't check if the ring is overflowed. It's the caller's fault. It might be worth checking start_new_rx_buffer > The second time it does get to the code after the RING_GET_REQUEST in > 'get_next_rx_buffer' and that leads to mayhem ... > > So added a netdev_warn to xenvif_start_xmit for the "stop queue" case .. > unfortunately it now triggers net_ratelimit at the end: > > [ 402.909693] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! > min_slots_needed:7 vif->rx.sring->req_prod:189228 vif->rx.req_cons:189222 I think xenvif_rx_ring_slots_available is doing its job. If req_prod - req_cons < needed, the queue is stopeed. Wei. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.