[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected"



Wednesday, March 12, 2014, 2:42:43 AM, you wrote:


> Tuesday, March 11, 2014, 4:36:16 PM, you wrote:

>> On Tue, Mar 11, 2014 at 02:00:41PM +0100, Sander Eikelenboom wrote:
>> [...]
>>> >> the issue when using 3.13.6 as a base and ..
>>> >>   - pull all 3.14 patches from the  
>>> >> git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip.git tree
>>> >>   - apply paul's commit "ca2f09f2b2c6c25047cfc545d057c4edfcfe561c 
>>> >> xen-netback: improve guest-receive-side flow control"
>>> >>   - applying annie's v2 patch
>>> >>   - applying your patch
>>> >> as dom0 and using a 3.14-rc5 as domU kernel.
>>> >> 
>>> >> Unfortunately i'm still getting the Bad grant references ..
>>> >> 
>>> 
>>> > :-( That's bad news.
>>> 
>>> > I guess you always have the same DomU kernel when testing? That means we
>>> > can narrow down the bug to netback only.
>>> 
>>> Yes my previous tests (from my previous mail):
>>> 
>>> - First testing a baseline that worked o.k. for several days (3.13.6 for 
>>> both dom0 and domU)
>>> - Testing domU 3.14-rc5 and dom0 3.13.6, this worked ok.
>>> - Testing dom0 3.14-rc5 and domU 3.13.6, this failed.
>>> - After that took 3.13.6 as base and first applied all the general xen 
>>> related patches for the dom0 kernel, that works ok.
>>> - After that started to apply the netback changes for 3.14 and that failed 
>>> after the commit "ca2f09f2b2c6c25047cfc545d057c4edfcfe561c xen-netback: 
>>> improve guest-receive-side flow control".
>>> 
>>> Also seem to indicate just that, although it could also be something in 
>>> this netback commit that triggers a latent bug in netfront, can't rule that 
>>> one out completly.
>>> 
>>> But the trigger is in that commit &&
>>> annie's and your patch seem to have no effect at all( on this issue) &&
>>> later commits in 3.14 do seems to mask it / make it less likely to trigger, 
>>> but do not fix it.
>>> 

>> Unfortunately I've stared at the same piece of code for some time but
>> don't have immediate clue. Later commits don't look suspecious either.

>> I also looked at netfront code, but there's no slot couting change
>> between 3.13 and 3.14.

>> Do you have some straight setup instructions so that I can try to
>> reproduce.

> Ok you asked for it .. so here we go .. ;-) :

> - Xen-unstable
> - DomU:
>      - PV guest
>      - Debian Wheezy
>      - Kernel: 3.14-rc5 (.config see dom0 .config)
>      - Running netserver
> - Dom0:
>      - Debian Wheezy
>      - Kernel: 3.13.6 + additional patches (see attached git log and .config)
>      - 2 physical NIC's
>      - Autoballooning is prevented by using dom0_mem=1536M,max:1536M for xen 
> and mem=1536M for dom0 kernel stanzas in grub
>      - vcpu 0 is pinned on pcpu 0 and exclusively for dom0

> - Networking:
>      - Routed bridge
>      - eth0 = internet
>      - eth1 = lan 172.16.x.x
>      - xen_bridge = bridge for VM's 192.168.1.x
>      - iptables NAT and routing  
>      - attached dom0 and domU ifconfig output
>      - attached ethtool -k output for the bridge, vif and guest eth0

> Triggering workload:
>      - Well that's were the problem is :-)
>      - The Guest has a normal disk and swap (phy/lvm) and
>        shared storage with glusterfs (glusterfs server is on dom0)
>      - The Guest exposes this storage via webdavs 

>      What triggers it is:
>      - The Guest runs it's rsync of the shared storage to a remote client on 
> the internet
>        So this causes traffic from Dom0 to domU (reading storage) ..
>        back from domU to dom0 and via iptables NAT on to eth0 (actual rsync)
>        or vice versa when syncing the other way around
>      - At the same time do a backup from a windows machine from the lan to 
> the webdavs server
>        So this causes traffic from eth1 to domU (webdav) .. 
>        back from domU to dom0 (writing to the shared storage) 

> So in essence it is doing quite some netback && netfront stress testing :-)

> It seems to only trigger when doing both the rsync and the webdav 
> simultaneous.

> I tried my best to emulate any of this with netperf (and multiple instances),
> i tried with various (odd) packet sizes and the packet / byte rates 
> transmitted
> are higher then with the workload above ... but it doesn't seem to trigger 
> with netpref

> So i don't think it will be easy to replicate ...

> Perhaps running through the available logging again .. and try to answer some 
> questions ... this is just with one guest running kernels as before only 
> added debugging to netfront and xen (diffs attached):

> Mar 12 02:00:44 backup kernel: [  496.840646] net eth0: rx->offset: 0, size: 
> 4294967295
> Mar 12 02:00:44 backup kernel: [  496.840665] net eth0: cons:1346005 slots:1 
> rp:1346013 max:18 err:0 rx->id:212 rx->offset:0 size:4294967295 ref:572 
> pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 
> RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 
> xennet_get_extras_err:0
> Mar 12 02:00:44 backup kernel: [  496.840680] net eth0: rx->offset: 0, size: 
> 4294967295
> Mar 12 02:00:44 backup kernel: [  496.840687] net eth0: cons:1346005 slots:2 
> rp:1346013 max:18 err:-22 rx->id:214 rx->offset:0 size:4294967295 ref:657 
> pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 
> RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 
> xennet_get_extras_err:0
> Mar 12 02:00:44 backup kernel: [  496.840701] net eth0: rx->offset: 0, size: 
> 4294967295
> Mar 12 02:00:44 backup kernel: [  496.840712] net eth0: cons:1346005 slots:3 
> rp:1346013 max:18 err:-22 rx->id:215 rx->offset:0 size:4294967295 ref:667 
> pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 
> RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 
> xennet_get_extras_err:0
> Mar 12 02:00:44 backup kernel: [  496.840733] net eth0: rx->offset: 0, size: 
> 4294967295
> Mar 12 02:00:44 backup kernel: [  496.840740] net eth0: cons:1346005 slots:4 
> rp:1346013 max:18 err:-22 rx->id:216 rx->offset:0 size:4294967295 ref:716 
> pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 
> RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 
> xennet_get_extras_err:0
> Mar 12 02:00:44 backup kernel: [  496.840757] net eth0: rx->offset: 0, size: 
> 4294967295
> Mar 12 02:00:44 backup kernel: [  496.840764] net eth0: cons:1346005 slots:5 
> rp:1346013 max:18 err:-22 rx->id:217 rx->offset:0 size:4294967295 ref:755 
> pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 
> RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 
> xennet_get_extras_err:0
> Mar 12 02:00:44 backup kernel: [  496.840778] net eth0: rx->offset: 0, size: 
> 4294967295
> Mar 12 02:00:44 backup kernel: [  496.840784] net eth0: cons:1346005 slots:6 
> rp:1346013 max:18 err:-22 rx->id:218 rx->offset:0 size:4294967295 ref:592 
> pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 
> RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 
> xennet_get_extras_err:0
> Mar 12 02:00:44 backup kernel: [  496.840801] net eth0: rx->offset: 0, size: 
> 4294967295
> Mar 12 02:00:44 backup kernel: [  496.840807] net eth0: cons:1346005 slots:7 
> rp:1346013 max:18 err:-22 rx->id:219 rx->offset:0 size:4294967295 ref:633 
> pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 
> RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 
> xennet_get_extras_err:0
> Mar 12 02:00:44 backup kernel: [  496.841043] net eth0: rx->offset: 0, size: 
> 4294967295

> Mar 12 02:00:44 backup kernel: [  496.841052] net eth0: cons:1346025 slots:1 
> rp:1346038 max:18 err:0 rx->id:232 rx->offset:0 size:4294967295 
> ref:-131941395332491 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 
> gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:13 cons_changed:1 
> cons_before:1346024 xennet_get_extras_err:0
> Mar 12 02:00:44 backup kernel: [  496.841067] net eth0: rx->offset: 0, size: 
> 4294967295
> Mar 12 02:00:44 backup kernel: [  496.841074] net eth0: cons:1346025 slots:2 
> rp:1346038 max:18 err:-22 rx->id:234 rx->offset:0 size:4294967295 
> ref:-131941395332579 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 
> gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:29 cons_changed:1 
> cons_before:1346024 xennet_get_extras_err:0
> Mar 12 02:00:44 backup kernel: [  496.841092] net eth0: rx->offset: 0, size: 
> 4294967295
> Mar 12 02:00:44 backup kernel: [  496.841101] net eth0: cons:1346025 slots:3 
> rp:1346038 max:18 err:-22 rx->id:235 rx->offset:0 size:4294967295 
> ref:-131941395332408 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 
> gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:29 cons_changed:1 
> cons_before:1346024 xennet_get_extras_err:0


> (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 
> 4325377 gt_version:1 ldom:0 readonly:0 allow_transitive:1
> (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
> failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5478146 
> source_off:0 source_len:4096 op->source.offset:0 op->len:1168
> (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 
> 19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1
> (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
> failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497610 
> source_off:0 source_len:4096 op->source.offset:0 op->len:2476
> (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 
> 4325379 gt_version:1 ldom:0 readonly:0 allow_transitive:1
> (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
> failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5478282 
> source_off:0 source_len:4096 op->source.offset:0 op->len:1634
> (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 
> 4325379 gt_version:1 ldom:0 readonly:0 allow_transitive:1
> (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
> failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497610 
> source_off:0 source_len:4096 op->source.offset:1634 op->len:1620
> (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 
> 4325377 gt_version:1 ldom:0 readonly:0 allow_transitive:1
> (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
> failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497609 
> source_off:0 source_len:4096 op->source.offset:0 op->len:4096
> (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 
> 19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1

Hrrmm sorry seems i was incomplete in copying and pasting the accompanying xl 
dmesg part:

(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325377 
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5478146 
source_off:0 source_len:4096 op->source.offset:0 op->len:1168
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 
19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497610 
source_off:0 source_len:4096 op->source.offset:0 op->len:2476
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325379 
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5478282 
source_off:0 source_len:4096 op->source.offset:0 op->len:1634
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325379 
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497610 
source_off:0 source_len:4096 op->source.offset:1634 op->len:1620
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325377 
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497609 
source_off:0 source_len:4096 op->source.offset:0 op->len:4096
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 
19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489800 
source_off:0 source_len:4096 op->source.offset:0 op->len:4096
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325379 
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489799 
source_off:0 source_len:4096 op->source.offset:0 op->len:4096
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 
194904079 gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489798 
source_off:0 source_len:4096 op->source.offset:0 op->len:4096
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 
224788480 gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489796 
source_off:0 source_len:4096 op->source.offset:0 op->len:4096
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 
19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489795 
source_off:0 source_len:4096 op->source.offset:0 op->len:4096
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325379 
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489794 
source_off:0 source_len:4096 op->source.offset:0 op->len:2892
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325377 
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5478282 
source_off:0 source_len:4096 op->source.offset:0 op->len:1634
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325377 
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489794 
source_off:0 source_len:4096 op->source.offset:1634 op->len:1204
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325379 
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489793 
source_off:0 source_len:4096 op->source.offset:0 op->len:4096
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 
19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5490944 
source_off:0 source_len:4096 op->source.offset:0 op->len:3268



> - Sometimes (but not always) netfront also spits out:
>   dev_warn(dev, "Invalid extra type: %d\n", extra->type);
>   where the extra type seems a random value (seen 196, 31 ..)
> - Sometimes (but not always) netfront also spits out:
>   dev_warn(dev, "Need more slots\n");
> - Sometimes (but not always) netfront also spits out:
>   dev_warn(dev, "Missing extra info\n");


> First question that comes to my mind:
> - Are the warnings netfront spits out the cause of xen reporting the bad 
> grant reference ?
>   Or
>   Are the Bad grant references Xen is reporting .. causing netfront to spit 
> out the warnings ?
> - Why is that "if (rx->flags & XEN_NETRXF_extra_info) {}" part in 
> xen-netfront.c doing there and changing cons midway ?
>   The commit message from f942dc2552b8bfdee607be867b12a8971bb9cd85 that 
> introduced the if says:

>     One major change from xen.git is that the guest transmit path (i.e. what
>     looks like receive to netback) has been significantly reworked to remove
>     the dependency on the out of tree PageForeign page flag (a core kernel
>     patch which enables a per page destructor callback on the final
>     put_page). This page flag was used in order to implement a grant map
>     based transmit path (where guest pages are mapped directly into SKB
>     frags). Instead this version of netback uses grant copy operations into
>     regular memory belonging to the backend domain. Reinstating the grant
>     map functionality is something which I would like to revisit in the
>     future.

>   It *is* using grant copy now .. so should this part have been removed ?
>   And
>   Could Paul's commit that seems to be the first to trigger these events 
> affect this ?

> --
> Sander



>> Wei.

>>> > Paul, do you have any idea what might go wrong?
>>> 
>>> > Wei.
>>> 


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.