[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected"



Tuesday, March 18, 2014, 1:04:01 PM, you wrote:

> On Mon, Mar 17, 2014 at 11:33:18PM +0100, Sander Eikelenboom wrote:
>> 
>> Monday, March 17, 2014, 11:35:24 AM, you wrote:
>> 
>> > On Wed, Mar 12, 2014 at 05:47:29PM +0100, Sander Eikelenboom wrote:
>> >> 
>> >> Wednesday, March 12, 2014, 4:45:01 PM, you wrote:
>> >> 
>> >> > On Wed, Mar 12, 2014 at 04:20:03PM +0100, Sander Eikelenboom wrote:
>> >> > [...]
>> >> >> 
>> >> >> > Sorry, remove the trailing "S". Actually you only need to look at 
>> >> >> > netback.c.
>> >> >> 
>> >> >> What producer index to compare with .. there are quite some 
>> >> >> RING_GET_REQUESTS .. and i see:
>> >> >> npo->meta_prod
>> >> >> vif->rx.sring->req_prod
>> >> >> vif->pending_prod
>> >> >> 
>> >> >> to name a few ..
>> >> >> Any particular RING_GET_REQUESTS call and particular producer index 
>> >> >> you are interested in ?
>> >> >> 
>> >> 
>> >> > There are two macros you can use
>> >> 
>> >> > RING_REQUEST_CONS_OVERFLOW and RING_REQUEST_PROD_OVERFLOW.
>> >> 
>> >> Ah i already produced my own .. diff to netback is attached ..
>> >> 
>> >> Netback:
>> >> Mar 12 17:41:26 serveerstertje kernel: [  464.778614] vif vif-7-0 vif7.0: 
>> >> ?!? npo->meta_prod:37 vif->rx.sring->req_prod:431006 
>> >> vif->rx.req_cons:431007
>> >> Mar 12 17:41:26 serveerstertje kernel: [  464.786203] vif vif-7-0 vif7.0: 
>> >> ?!? npo->meta_prod:38 vif->rx.sring->req_prod:431006 
>> >> vif->rx.req_cons:431008
>> 
>> > req_prod < req_cons, so there's an overflow. I'm actually curious how
>> > this could happen.
>> 
>> > Back to the code, before netback enqueues SKB to its internal queue, it
>> > will check if there's enough room in the ring. Before Paul's changeset,
>> > it checks against a static number (the possible maximum slots that can
>> > be consumed by an SKB). Paul's changeset made it check against the
>> > actual slots the incoming SKB consumes. See
>> > interface.c:xenvif_start_xmit.  Another interesting site would be when
>> > the SKB is broken down later on in internal queue. See
>> > netback.c:xenvif_rx_action. The routine to break down an SKB is
>> > xenvif_gop_skb.
>> 
>> > Although they look alright to me, but you might want to instrument them
>> > a bit more to see what triggers that overflow. It's a bit frustrating,
>> > but a bug that cannot be easily reproduced is indeed extremely hard to
>> > fix.
>> 
>> [  554.166914] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req 
>> npo->meta_prod:7 vif->rx.sring->req_prod:301750 vif->rx.req_cons:301751
>> [  571.143788] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req 
>> npo->meta_prod:37 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364823
>> [  571.150762] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req 
>> npo->meta_prod:37 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364823 
>> req->gref:19791875 req->id:22
>> [  571.164691] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2  
>> npo->meta_prod:38 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364823 
>> npo->copy_gref:19791875
>> [  571.179474] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 3 
>> npo->meta_prod:38 old_meta_prod:30 vif->rx.sring->req_prod:364822 
>> vif->rx.req_cons:364823 gso_type:1 gso_size:1448 nr_frags:1 req->gref:576 
>> req->id:14
>> [  571.194844] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 before req 
>> npo->meta_prod:38 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364824 
>> gso_type:0 gso_size:0 nr_frags:0
>> [  571.211215] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 after req 
>> npo->meta_prod:38 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364824 
>> gso_type:0 gso_size:0 nr_frags:0 req->gref:4325379 req->id:23
>> [  571.228471] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 3 
>> npo->meta_prod:39 old_meta_prod:38 vif->rx.sring->req_prod:364822 
>> vif->rx.req_cons:364824 gso_type:0 gso_size:0 nr_frags:0 req->gref:4325379 
>> req->id:23
>> [  571.246412] vif vif-7-0 vif7.0: Bad status -3 from copy to DOM7.
>> [  571.255531] vif vif-7-0 vif7.0: ?!? xenvif_check_gop status err? 
>> status:-1 i:7 nr_meta_slots:8 copy_op->status:-3 npo->copy_cons:38
>> [  571.264804] vif vif-7-0 vif7.0: ?!? xenvif_rx_action status err? 
>> status:-1 meta_slots_used:8 flags:7 size:1 ip_summed:1 is_gso:1 
>> skb_shinfo(skb)->gso_type:1634 vif->meta[npo.meta_cons].gso_type:3 
>> nr_frags:1 npo.copy_prod:39, npo.meta_cons:30
>> 
>> It seems in 'get_next_rx_buffer' you can get the situation that cons > prod, 
>> but the first time somehow the code after the actual RING_GET_REQUEST isn't 
>> ran (why?!? queue stop ?) .. and it doesn't lead to an immediate problem ..

> You've rate limited the output. get_next_rx_buffer won't sleep
> whatsoever. Stopping the queue won't affect a running RX kthread. You
> can think of the start_xmit and kthread run in parallel.

> get_next_rx_buffer doens't check if the ring is overflowed. It's the
> caller's fault. It might be worth checking start_new_rx_buffer

Added even more warns ...

[  297.885969] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:4 vif->rx.sring->req_prod:21764 vif->rx.req_cons:21762
[  298.760555] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:3 vif->rx.sring->req_prod:22488 vif->rx.req_cons:22486

[  306.376176] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req 
npo->meta_prod:30 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28313
[  306.376556] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:1 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28313
[  306.391863] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req 
npo->meta_prod:30 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28314 
req->gref:4325377 req->id:153

[  306.407599] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2  
npo->meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28314 
npo->copy_gref:4325377  npo->copy_off:0  MAX_BUFFER_OFFSET:4096 bytes:640 
size:640 i:4
[  306.423913] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here end 
npo->meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28314 
npo->copy_gref:4325377 npo->copy_off:640  MAX_BUFFER_OFFSET:4096 bytes:640 
size:0 i:5


[  306.440941] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 4 
npo->meta_prod:31 old_meta_prod:25 vif->rx.sring->req_prod:28313 
vif->rx.req_cons:28314 gso_type:1 gso_size:1448 nr_frags:1 req->gref:638 
req->id:147
[  306.458334] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 before req 
npo->meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28315 
gso_type:0 gso_size:0 nr_frags:0
[  306.476097] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 after req 
npo->meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28315 
gso_type:0 gso_size:0 nr_frags:0 req->gref:4325377 req->id:154
[  306.494462] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 3 before 
npo->meta_prod:32 old_meta_prod:31 vif->rx.sring->req_prod:28313 
vif->rx.req_cons:28315 gso_type:0 gso_size:0 nr_frags:0 req->gref:4325377 
req->id:154 j:0
[  306.513424] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here start   
npo->meta_prod:32 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28315 
npo->copy_gref:4325377 npo->copy_off:0  MAX_BUFFER_OFFSET:4096 bytes:0 size:66 
i:0
[  311.390883] net_ratelimit: 317 callbacks suppressed
[  311.400901] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:3 vif->rx.sring->req_prod:32386 vif->rx.req_cons:32322

- So in this case we are in the 3rd iteration of the loop in 
xenvif_gop_frag_copy ...
- Xenvif_start_xmit stop the queue since it has detected it needs one more slot 
which is unavailable at that time.
- The current rx thread however doesn't know and doesn't check  (neither in the 
loop in xenvif_gop_frag_copy nor in get_next_rx_buffer that the ring if full) 
.. while prod == cons now .. consumes another one ..
- That ring request leads to the bad grant references reported by the hypervisor

(XEN) [2014-03-18 15:02:58.928] grant_table.c:1857:d0v2 Bad grant reference 
4325377

So should there be a check added there ... or should the callers  
"xenvif_gop_frag_copy" and the caller of that one "xenvif_gop_skb" already have 
anticipated that what the were about
to do wasn't going to fit anyway ?

And of course .. how made Paul's change trigger this ?


>> The second time it does get to the code after the RING_GET_REQUEST in 
>> 'get_next_rx_buffer' and that leads to mayhem ...
>> 
>> So added a netdev_warn to xenvif_start_xmit for the "stop queue" case .. 
>> unfortunately it now triggers net_ratelimit at the end:
>> 
>> [  402.909693] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
>> min_slots_needed:7 vif->rx.sring->req_prod:189228 vif->rx.req_cons:189222

> I think xenvif_rx_ring_slots_available is doing its job. If req_prod -
> req_cons < needed, the queue is stopeed.

> Wei.



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.