|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected"
Wednesday, March 12, 2014, 2:42:43 AM, you wrote:
> Tuesday, March 11, 2014, 4:36:16 PM, you wrote:
>> On Tue, Mar 11, 2014 at 02:00:41PM +0100, Sander Eikelenboom wrote:
>> [...]
>>> >> the issue when using 3.13.6 as a base and ..
>>> >> - pull all 3.14 patches from the
>>> >> git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip.git tree
>>> >> - apply paul's commit "ca2f09f2b2c6c25047cfc545d057c4edfcfe561c
>>> >> xen-netback: improve guest-receive-side flow control"
>>> >> - applying annie's v2 patch
>>> >> - applying your patch
>>> >> as dom0 and using a 3.14-rc5 as domU kernel.
>>> >>
>>> >> Unfortunately i'm still getting the Bad grant references ..
>>> >>
>>>
>>> > :-( That's bad news.
>>>
>>> > I guess you always have the same DomU kernel when testing? That means we
>>> > can narrow down the bug to netback only.
>>>
>>> Yes my previous tests (from my previous mail):
>>>
>>> - First testing a baseline that worked o.k. for several days (3.13.6 for
>>> both dom0 and domU)
>>> - Testing domU 3.14-rc5 and dom0 3.13.6, this worked ok.
>>> - Testing dom0 3.14-rc5 and domU 3.13.6, this failed.
>>> - After that took 3.13.6 as base and first applied all the general xen
>>> related patches for the dom0 kernel, that works ok.
>>> - After that started to apply the netback changes for 3.14 and that failed
>>> after the commit "ca2f09f2b2c6c25047cfc545d057c4edfcfe561c xen-netback:
>>> improve guest-receive-side flow control".
>>>
>>> Also seem to indicate just that, although it could also be something in
>>> this netback commit that triggers a latent bug in netfront, can't rule that
>>> one out completly.
>>>
>>> But the trigger is in that commit &&
>>> annie's and your patch seem to have no effect at all( on this issue) &&
>>> later commits in 3.14 do seems to mask it / make it less likely to trigger,
>>> but do not fix it.
>>>
>> Unfortunately I've stared at the same piece of code for some time but
>> don't have immediate clue. Later commits don't look suspecious either.
>> I also looked at netfront code, but there's no slot couting change
>> between 3.13 and 3.14.
>> Do you have some straight setup instructions so that I can try to
>> reproduce.
> Ok you asked for it .. so here we go .. ;-) :
> - Xen-unstable
> - DomU:
> - PV guest
> - Debian Wheezy
> - Kernel: 3.14-rc5 (.config see dom0 .config)
> - Running netserver
> - Dom0:
> - Debian Wheezy
> - Kernel: 3.13.6 + additional patches (see attached git log and .config)
> - 2 physical NIC's
> - Autoballooning is prevented by using dom0_mem=1536M,max:1536M for xen
> and mem=1536M for dom0 kernel stanzas in grub
> - vcpu 0 is pinned on pcpu 0 and exclusively for dom0
> - Networking:
> - Routed bridge
> - eth0 = internet
> - eth1 = lan 172.16.x.x
> - xen_bridge = bridge for VM's 192.168.1.x
> - iptables NAT and routing
> - attached dom0 and domU ifconfig output
> - attached ethtool -k output for the bridge, vif and guest eth0
> Triggering workload:
> - Well that's were the problem is :-)
> - The Guest has a normal disk and swap (phy/lvm) and
> shared storage with glusterfs (glusterfs server is on dom0)
> - The Guest exposes this storage via webdavs
> What triggers it is:
> - The Guest runs it's rsync of the shared storage to a remote client on
> the internet
> So this causes traffic from Dom0 to domU (reading storage) ..
> back from domU to dom0 and via iptables NAT on to eth0 (actual rsync)
> or vice versa when syncing the other way around
> - At the same time do a backup from a windows machine from the lan to
> the webdavs server
> So this causes traffic from eth1 to domU (webdav) ..
> back from domU to dom0 (writing to the shared storage)
> So in essence it is doing quite some netback && netfront stress testing :-)
> It seems to only trigger when doing both the rsync and the webdav
> simultaneous.
> I tried my best to emulate any of this with netperf (and multiple instances),
> i tried with various (odd) packet sizes and the packet / byte rates
> transmitted
> are higher then with the workload above ... but it doesn't seem to trigger
> with netpref
> So i don't think it will be easy to replicate ...
> Perhaps running through the available logging again .. and try to answer some
> questions ... this is just with one guest running kernels as before only
> added debugging to netfront and xen (diffs attached):
> Mar 12 02:00:44 backup kernel: [ 496.840646] net eth0: rx->offset: 0, size:
> 4294967295
> Mar 12 02:00:44 backup kernel: [ 496.840665] net eth0: cons:1346005 slots:1
> rp:1346013 max:18 err:0 rx->id:212 rx->offset:0 size:4294967295 ref:572
> pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0
> RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004
> xennet_get_extras_err:0
> Mar 12 02:00:44 backup kernel: [ 496.840680] net eth0: rx->offset: 0, size:
> 4294967295
> Mar 12 02:00:44 backup kernel: [ 496.840687] net eth0: cons:1346005 slots:2
> rp:1346013 max:18 err:-22 rx->id:214 rx->offset:0 size:4294967295 ref:657
> pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0
> RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004
> xennet_get_extras_err:0
> Mar 12 02:00:44 backup kernel: [ 496.840701] net eth0: rx->offset: 0, size:
> 4294967295
> Mar 12 02:00:44 backup kernel: [ 496.840712] net eth0: cons:1346005 slots:3
> rp:1346013 max:18 err:-22 rx->id:215 rx->offset:0 size:4294967295 ref:667
> pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0
> RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004
> xennet_get_extras_err:0
> Mar 12 02:00:44 backup kernel: [ 496.840733] net eth0: rx->offset: 0, size:
> 4294967295
> Mar 12 02:00:44 backup kernel: [ 496.840740] net eth0: cons:1346005 slots:4
> rp:1346013 max:18 err:-22 rx->id:216 rx->offset:0 size:4294967295 ref:716
> pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0
> RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004
> xennet_get_extras_err:0
> Mar 12 02:00:44 backup kernel: [ 496.840757] net eth0: rx->offset: 0, size:
> 4294967295
> Mar 12 02:00:44 backup kernel: [ 496.840764] net eth0: cons:1346005 slots:5
> rp:1346013 max:18 err:-22 rx->id:217 rx->offset:0 size:4294967295 ref:755
> pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0
> RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004
> xennet_get_extras_err:0
> Mar 12 02:00:44 backup kernel: [ 496.840778] net eth0: rx->offset: 0, size:
> 4294967295
> Mar 12 02:00:44 backup kernel: [ 496.840784] net eth0: cons:1346005 slots:6
> rp:1346013 max:18 err:-22 rx->id:218 rx->offset:0 size:4294967295 ref:592
> pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0
> RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004
> xennet_get_extras_err:0
> Mar 12 02:00:44 backup kernel: [ 496.840801] net eth0: rx->offset: 0, size:
> 4294967295
> Mar 12 02:00:44 backup kernel: [ 496.840807] net eth0: cons:1346005 slots:7
> rp:1346013 max:18 err:-22 rx->id:219 rx->offset:0 size:4294967295 ref:633
> pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0
> RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004
> xennet_get_extras_err:0
> Mar 12 02:00:44 backup kernel: [ 496.841043] net eth0: rx->offset: 0, size:
> 4294967295
> Mar 12 02:00:44 backup kernel: [ 496.841052] net eth0: cons:1346025 slots:1
> rp:1346038 max:18 err:0 rx->id:232 rx->offset:0 size:4294967295
> ref:-131941395332491 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0
> gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:13 cons_changed:1
> cons_before:1346024 xennet_get_extras_err:0
> Mar 12 02:00:44 backup kernel: [ 496.841067] net eth0: rx->offset: 0, size:
> 4294967295
> Mar 12 02:00:44 backup kernel: [ 496.841074] net eth0: cons:1346025 slots:2
> rp:1346038 max:18 err:-22 rx->id:234 rx->offset:0 size:4294967295
> ref:-131941395332579 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0
> gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:29 cons_changed:1
> cons_before:1346024 xennet_get_extras_err:0
> Mar 12 02:00:44 backup kernel: [ 496.841092] net eth0: rx->offset: 0, size:
> 4294967295
> Mar 12 02:00:44 backup kernel: [ 496.841101] net eth0: cons:1346025 slots:3
> rp:1346038 max:18 err:-22 rx->id:235 rx->offset:0 size:4294967295
> ref:-131941395332408 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0
> gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:29 cons_changed:1
> cons_before:1346024 xennet_get_extras_err:0
> (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference
> 4325377 gt_version:1 ldom:0 readonly:0 allow_transitive:1
> (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
> failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5478146
> source_off:0 source_len:4096 op->source.offset:0 op->len:1168
> (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference
> 19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1
> (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
> failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497610
> source_off:0 source_len:4096 op->source.offset:0 op->len:2476
> (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference
> 4325379 gt_version:1 ldom:0 readonly:0 allow_transitive:1
> (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
> failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5478282
> source_off:0 source_len:4096 op->source.offset:0 op->len:1634
> (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference
> 4325379 gt_version:1 ldom:0 readonly:0 allow_transitive:1
> (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
> failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497610
> source_off:0 source_len:4096 op->source.offset:1634 op->len:1620
> (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference
> 4325377 gt_version:1 ldom:0 readonly:0 allow_transitive:1
> (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
> failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497609
> source_off:0 source_len:4096 op->source.offset:0 op->len:4096
> (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference
> 19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1
Hrrmm sorry seems i was incomplete in copying and pasting the accompanying xl
dmesg part:
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325377
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5478146
source_off:0 source_len:4096 op->source.offset:0 op->len:1168
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference
19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497610
source_off:0 source_len:4096 op->source.offset:0 op->len:2476
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325379
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5478282
source_off:0 source_len:4096 op->source.offset:0 op->len:1634
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325379
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497610
source_off:0 source_len:4096 op->source.offset:1634 op->len:1620
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325377
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497609
source_off:0 source_len:4096 op->source.offset:0 op->len:4096
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference
19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489800
source_off:0 source_len:4096 op->source.offset:0 op->len:4096
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325379
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489799
source_off:0 source_len:4096 op->source.offset:0 op->len:4096
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference
194904079 gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489798
source_off:0 source_len:4096 op->source.offset:0 op->len:4096
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference
224788480 gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489796
source_off:0 source_len:4096 op->source.offset:0 op->len:4096
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference
19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489795
source_off:0 source_len:4096 op->source.offset:0 op->len:4096
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325379
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489794
source_off:0 source_len:4096 op->source.offset:0 op->len:2892
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325377
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5478282
source_off:0 source_len:4096 op->source.offset:0 op->len:1634
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325377
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489794
source_off:0 source_len:4096 op->source.offset:1634 op->len:1204
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325379
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489793
source_off:0 source_len:4096 op->source.offset:0 op->len:4096
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference
19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5490944
source_off:0 source_len:4096 op->source.offset:0 op->len:3268
> - Sometimes (but not always) netfront also spits out:
> dev_warn(dev, "Invalid extra type: %d\n", extra->type);
> where the extra type seems a random value (seen 196, 31 ..)
> - Sometimes (but not always) netfront also spits out:
> dev_warn(dev, "Need more slots\n");
> - Sometimes (but not always) netfront also spits out:
> dev_warn(dev, "Missing extra info\n");
> First question that comes to my mind:
> - Are the warnings netfront spits out the cause of xen reporting the bad
> grant reference ?
> Or
> Are the Bad grant references Xen is reporting .. causing netfront to spit
> out the warnings ?
> - Why is that "if (rx->flags & XEN_NETRXF_extra_info) {}" part in
> xen-netfront.c doing there and changing cons midway ?
> The commit message from f942dc2552b8bfdee607be867b12a8971bb9cd85 that
> introduced the if says:
> One major change from xen.git is that the guest transmit path (i.e. what
> looks like receive to netback) has been significantly reworked to remove
> the dependency on the out of tree PageForeign page flag (a core kernel
> patch which enables a per page destructor callback on the final
> put_page). This page flag was used in order to implement a grant map
> based transmit path (where guest pages are mapped directly into SKB
> frags). Instead this version of netback uses grant copy operations into
> regular memory belonging to the backend domain. Reinstating the grant
> map functionality is something which I would like to revisit in the
> future.
> It *is* using grant copy now .. so should this part have been removed ?
> And
> Could Paul's commit that seems to be the first to trigger these events
> affect this ?
> --
> Sander
>> Wei.
>>> > Paul, do you have any idea what might go wrong?
>>>
>>> > Wei.
>>>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |