|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected"
Tuesday, March 11, 2014, 4:36:16 PM, you wrote:
> On Tue, Mar 11, 2014 at 02:00:41PM +0100, Sander Eikelenboom wrote:
> [...]
>> >> the issue when using 3.13.6 as a base and ..
>> >> - pull all 3.14 patches from the
>> >> git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip.git tree
>> >> - apply paul's commit "ca2f09f2b2c6c25047cfc545d057c4edfcfe561c
>> >> xen-netback: improve guest-receive-side flow control"
>> >> - applying annie's v2 patch
>> >> - applying your patch
>> >> as dom0 and using a 3.14-rc5 as domU kernel.
>> >>
>> >> Unfortunately i'm still getting the Bad grant references ..
>> >>
>>
>> > :-( That's bad news.
>>
>> > I guess you always have the same DomU kernel when testing? That means we
>> > can narrow down the bug to netback only.
>>
>> Yes my previous tests (from my previous mail):
>>
>> - First testing a baseline that worked o.k. for several days (3.13.6 for
>> both dom0 and domU)
>> - Testing domU 3.14-rc5 and dom0 3.13.6, this worked ok.
>> - Testing dom0 3.14-rc5 and domU 3.13.6, this failed.
>> - After that took 3.13.6 as base and first applied all the general xen
>> related patches for the dom0 kernel, that works ok.
>> - After that started to apply the netback changes for 3.14 and that failed
>> after the commit "ca2f09f2b2c6c25047cfc545d057c4edfcfe561c xen-netback:
>> improve guest-receive-side flow control".
>>
>> Also seem to indicate just that, although it could also be something in this
>> netback commit that triggers a latent bug in netfront, can't rule that one
>> out completly.
>>
>> But the trigger is in that commit &&
>> annie's and your patch seem to have no effect at all( on this issue) &&
>> later commits in 3.14 do seems to mask it / make it less likely to trigger,
>> but do not fix it.
>>
> Unfortunately I've stared at the same piece of code for some time but
> don't have immediate clue. Later commits don't look suspecious either.
> I also looked at netfront code, but there's no slot couting change
> between 3.13 and 3.14.
> Do you have some straight setup instructions so that I can try to
> reproduce.
Ok you asked for it .. so here we go .. ;-) :
- Xen-unstable
- DomU:
- PV guest
- Debian Wheezy
- Kernel: 3.14-rc5 (.config see dom0 .config)
- Running netserver
- Dom0:
- Debian Wheezy
- Kernel: 3.13.6 + additional patches (see attached git log and .config)
- 2 physical NIC's
- Autoballooning is prevented by using dom0_mem=1536M,max:1536M for xen
and mem=1536M for dom0 kernel stanzas in grub
- vcpu 0 is pinned on pcpu 0 and exclusively for dom0
- Networking:
- Routed bridge
- eth0 = internet
- eth1 = lan 172.16.x.x
- xen_bridge = bridge for VM's 192.168.1.x
- iptables NAT and routing
- attached dom0 and domU ifconfig output
- attached ethtool -k output for the bridge, vif and guest eth0
Triggering workload:
- Well that's were the problem is :-)
- The Guest has a normal disk and swap (phy/lvm) and
shared storage with glusterfs (glusterfs server is on dom0)
- The Guest exposes this storage via webdavs
What triggers it is:
- The Guest runs it's rsync of the shared storage to a remote client on
the internet
So this causes traffic from Dom0 to domU (reading storage) ..
back from domU to dom0 and via iptables NAT on to eth0 (actual rsync)
or vice versa when syncing the other way around
- At the same time do a backup from a windows machine from the lan to the
webdavs server
So this causes traffic from eth1 to domU (webdav) ..
back from domU to dom0 (writing to the shared storage)
So in essence it is doing quite some netback && netfront stress testing :-)
It seems to only trigger when doing both the rsync and the webdav simultaneous.
I tried my best to emulate any of this with netperf (and multiple instances),
i tried with various (odd) packet sizes and the packet / byte rates transmitted
are higher then with the workload above ... but it doesn't seem to trigger with
netpref
So i don't think it will be easy to replicate ...
Perhaps running through the available logging again .. and try to answer some
questions ... this is just with one guest running kernels as before only added
debugging to netfront and xen (diffs attached):
Mar 12 02:00:44 backup kernel: [ 496.840646] net eth0: rx->offset: 0, size:
4294967295
Mar 12 02:00:44 backup kernel: [ 496.840665] net eth0: cons:1346005 slots:1
rp:1346013 max:18 err:0 rx->id:212 rx->offset:0 size:4294967295 ref:572
pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0
RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004
xennet_get_extras_err:0
Mar 12 02:00:44 backup kernel: [ 496.840680] net eth0: rx->offset: 0, size:
4294967295
Mar 12 02:00:44 backup kernel: [ 496.840687] net eth0: cons:1346005 slots:2
rp:1346013 max:18 err:-22 rx->id:214 rx->offset:0 size:4294967295 ref:657
pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0
RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004
xennet_get_extras_err:0
Mar 12 02:00:44 backup kernel: [ 496.840701] net eth0: rx->offset: 0, size:
4294967295
Mar 12 02:00:44 backup kernel: [ 496.840712] net eth0: cons:1346005 slots:3
rp:1346013 max:18 err:-22 rx->id:215 rx->offset:0 size:4294967295 ref:667
pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0
RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004
xennet_get_extras_err:0
Mar 12 02:00:44 backup kernel: [ 496.840733] net eth0: rx->offset: 0, size:
4294967295
Mar 12 02:00:44 backup kernel: [ 496.840740] net eth0: cons:1346005 slots:4
rp:1346013 max:18 err:-22 rx->id:216 rx->offset:0 size:4294967295 ref:716
pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0
RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004
xennet_get_extras_err:0
Mar 12 02:00:44 backup kernel: [ 496.840757] net eth0: rx->offset: 0, size:
4294967295
Mar 12 02:00:44 backup kernel: [ 496.840764] net eth0: cons:1346005 slots:5
rp:1346013 max:18 err:-22 rx->id:217 rx->offset:0 size:4294967295 ref:755
pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0
RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004
xennet_get_extras_err:0
Mar 12 02:00:44 backup kernel: [ 496.840778] net eth0: rx->offset: 0, size:
4294967295
Mar 12 02:00:44 backup kernel: [ 496.840784] net eth0: cons:1346005 slots:6
rp:1346013 max:18 err:-22 rx->id:218 rx->offset:0 size:4294967295 ref:592
pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0
RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004
xennet_get_extras_err:0
Mar 12 02:00:44 backup kernel: [ 496.840801] net eth0: rx->offset: 0, size:
4294967295
Mar 12 02:00:44 backup kernel: [ 496.840807] net eth0: cons:1346005 slots:7
rp:1346013 max:18 err:-22 rx->id:219 rx->offset:0 size:4294967295 ref:633
pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0
RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004
xennet_get_extras_err:0
Mar 12 02:00:44 backup kernel: [ 496.841043] net eth0: rx->offset: 0, size:
4294967295
Mar 12 02:00:44 backup kernel: [ 496.841052] net eth0: cons:1346025 slots:1
rp:1346038 max:18 err:0 rx->id:232 rx->offset:0 size:4294967295
ref:-131941395332491 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0
gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:13 cons_changed:1
cons_before:1346024 xennet_get_extras_err:0
Mar 12 02:00:44 backup kernel: [ 496.841067] net eth0: rx->offset: 0, size:
4294967295
Mar 12 02:00:44 backup kernel: [ 496.841074] net eth0: cons:1346025 slots:2
rp:1346038 max:18 err:-22 rx->id:234 rx->offset:0 size:4294967295
ref:-131941395332579 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0
gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:29 cons_changed:1
cons_before:1346024 xennet_get_extras_err:0
Mar 12 02:00:44 backup kernel: [ 496.841092] net eth0: rx->offset: 0, size:
4294967295
Mar 12 02:00:44 backup kernel: [ 496.841101] net eth0: cons:1346025 slots:3
rp:1346038 max:18 err:-22 rx->id:235 rx->offset:0 size:4294967295
ref:-131941395332408 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0
gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:29 cons_changed:1
cons_before:1346024 xennet_get_extras_err:0
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325377
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5478146
source_off:0 source_len:4096 op->source.offset:0 op->len:1168
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference
19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497610
source_off:0 source_len:4096 op->source.offset:0 op->len:2476
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325379
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5478282
source_off:0 source_len:4096 op->source.offset:0 op->len:1634
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325379
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497610
source_off:0 source_len:4096 op->source.offset:1634 op->len:1620
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325377
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497609
source_off:0 source_len:4096 op->source.offset:0 op->len:4096
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference
19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1
- Sometimes (but not always) netfront also spits out:
dev_warn(dev, "Invalid extra type: %d\n", extra->type);
where the extra type seems a random value (seen 196, 31 ..)
- Sometimes (but not always) netfront also spits out:
dev_warn(dev, "Need more slots\n");
- Sometimes (but not always) netfront also spits out:
dev_warn(dev, "Missing extra info\n");
First question that comes to my mind:
- Are the warnings netfront spits out the cause of xen reporting the bad grant
reference ?
Or
Are the Bad grant references Xen is reporting .. causing netfront to spit out
the warnings ?
- Why is that "if (rx->flags & XEN_NETRXF_extra_info) {}" part in
xen-netfront.c doing there and changing cons midway ?
The commit message from f942dc2552b8bfdee607be867b12a8971bb9cd85 that
introduced the if says:
One major change from xen.git is that the guest transmit path (i.e. what
looks like receive to netback) has been significantly reworked to remove
the dependency on the out of tree PageForeign page flag (a core kernel
patch which enables a per page destructor callback on the final
put_page). This page flag was used in order to implement a grant map
based transmit path (where guest pages are mapped directly into SKB
frags). Instead this version of netback uses grant copy operations into
regular memory belonging to the backend domain. Reinstating the grant
map functionality is something which I would like to revisit in the
future.
It *is* using grant copy now .. so should this part have been removed ?
And
Could Paul's commit that seems to be the first to trigger these events affect
this ?
--
Sander
> Wei.
>> > Paul, do you have any idea what might go wrong?
>>
>> > Wei.
>>
Attachment:
ethtool-domu-eth0.txt Attachment:
ethtool-vif.txt Attachment:
ethtool-xenbridge.txt Attachment:
grant_table_debug.diff Attachment:
ifconfig-domu.txt Attachment:
kernel-dom0-dotconfig.txt Attachment:
kernel-dom0-git-log1.txt Attachment:
netfront-debug.diff Attachment:
ifconfig-dom0.txt _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |