[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-users] Xen related networking issue



I have a relatively complicated network and xen setup, but I'll start
with the problem, and provide more details below.

>From time to time (approx 1 to 3 times per day), one or more (usually
one at a time) server will stop communicating with the network for
anything between a few seconds and a minute (usually around 20 to 40
seconds).

I have 8 physical machines (dom0's) each of which runs one VM domU
(except one which runs two VM's). The VM's are primarily MS Win 2003 R2,
one is MS Win XP Pro SP3, one is MS Win 2008R2.

The problem seems to be load related, (but generating network traffic
doesn't trigger the problem), it usually co-incides with busy user times
(start of day and end of day).

It seems to be restricted to the MS Win 2003R2 servers (which are
Terminal Servers), and generally the busiest machines as far as
CPU/disk/network, except for the domain controller which would do more
network and disk but doesn't have this issue.

So far, I've replaced all the cables, the switch (4 different switches,
different models, different manufacturers, etc).

I'm using current Debian Stable packages for Xen
ii  libxen-4.1                           4.1.4-3+deb7u1           
amd64        Public libs for Xen
ii  libxenstore3.0                       4.1.4-3+deb7u1           
amd64        Xenstore communications library for Xen
ii  xen-hypervisor-4.1-amd64             4.1.4-3+deb7u1           
amd64        Xen Hypervisor on AMD64
ii  xen-linux-system-3.2.0-4-amd64       3.2.41-2                 
amd64        Xen system with Linux 3.2 on 64-bit PCs (meta-package)
ii  xen-linux-system-amd64               3.2+46                   
amd64        Xen system with Linux for 64-bit PCs (meta-package)
ii  xen-system-amd64                     4.1.4-3+deb7u1           
amd64        Xen System on AMD64 (meta-package)
ii  xen-utils-4.1                        4.1.4-3+deb7u1           
amd64        XEN administrative tools
ii  xen-utils-common                     4.1.4-3+deb7u1           
all          Xen administrative tools - common files
ii  xenstore-utils                       4.1.4-3+deb7u1           
amd64        Xenstore utilities for Xen

I'm using a simple bridge:
ifconfig -a
eth0      Link encap:Ethernet  HWaddr f4:6d:04:ef:e4:d7 
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:59742510 errors:0 dropped:0 overruns:0 frame:0
          TX packets:63945509 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:17925182347 (16.6 GiB)  TX bytes:28533598982 (26.5 GiB)
          Interrupt:39 Base address:0x6000

eth1      Link encap:Ethernet  HWaddr a0:36:9f:19:25:af 
          inet addr:10.30.16.31  Bcast:10.30.16.255  Mask:255.255.255.0
          inet6 addr: fe80::a236:9fff:fe19:25af/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:10450615 errors:0 dropped:0 overruns:0 frame:0
          TX packets:11577187 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:31270341044 (29.1 GiB)  TX bytes:15185740522 (14.1 GiB)
          Memory:fe800000-fe900000

eth2      Link encap:Ethernet  HWaddr a0:36:9f:19:25:ae 
          inet addr:10.30.16.41  Bcast:10.30.16.255  Mask:255.255.255.0
          inet6 addr: fe80::a236:9fff:fe19:25ae/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:10413185 errors:0 dropped:0 overruns:0 frame:0
          TX packets:11576680 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:31259680024 (29.1 GiB)  TX bytes:15194685404 (14.1 GiB)
          Memory:fea00000-feb00000

lo        Link encap:Local Loopback 
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:3152 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3152 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:474856 (463.7 KiB)  TX bytes:474856 (463.7 KiB)

vif1.0    Link encap:Ethernet  HWaddr fe:ff:ff:ff:ff:ff 
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:63234371 errors:0 dropped:0 overruns:0 frame:0
          TX packets:58998093 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:32
          RX bytes:27401913081 (25.5 GiB)  TX bytes:17707566787 (16.4 GiB)

xenbr0    Link encap:Ethernet  HWaddr f4:6d:04:ef:e4:d7 
          inet addr:10.10.10.31  Bcast:10.30.15.255  Mask:255.255.240.0
          inet6 addr: fe80::f66d:4ff:feef:e4d7/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1021860 errors:0 dropped:31621 overruns:0 frame:0
          TX packets:711501 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:199158280 (189.9 MiB)  TX bytes:246436545 (235.0 MiB)

eth1 and eth2 are connected to the iSCSI server (different vlan,
different network), eth0 is on the bridge xenbr0:
brctl show
bridge name    bridge id        STP enabled    interfaces
xenbr0        8000.f46d04efe4d7    no        eth0
                            vif1.0
I don't see anything in dmesg or xm dmesg at the time of the problem.

I do see regular single packet drops across various parts of the network
(ie, a dozen times a day or more) but I don't think this is an issue.
The problem is dropping almost all packets for a period of 10+ seconds.
Note, tcpdump on the dom0, and then examining in wireshark showed about
a dozen packets being sent/received during the outage, some packets were
retransmissions, some where ping requests/replies, but 99.9% of the
normal network load was missing. ie, during the outage, one ping packet
was not received (the senders tcpdump showed it had been sent), the next
ping packet was received, and dom0 showed the reply was sent as well
(reply from the domU), but the other machine never received the reply
(missing in tcpdump at the other end).

The only way I see this is I have two machines which will ping every IP
60 times (once per second) every minute, and record the results with the
date/time. I can then process the logs, and both machines show the same
outage on the same destination machine at the same time.

The switch is currently a cisco 3560, previously I've used a netgear
GS716Tv2, netgear GS748Tv4, and netgear unmanaged 16 port gigabit switch.

I'm using the onboard network card at the moment, but had the same issue
with a Intel server PCI card.
04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06)

eth1/eth2 are a dual port Intel gigabit ethernet card:
02:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network
Connection (rev 01)
02:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network
Connection (rev 01)

I'm using the GPLPV drivers:
Driver Date: 10/22/2012
Driver Version: 0.11.0.372 (Not signed)
In windows, the following advanced settings are configured:
Check checksum on RX packets: Enabled
Checksum offload: Enabled
Don't fix the blank checksum on offload: Disabled
Large send offload: 61440
Locally Administered Address: (blank)
MTU: 1500
Scatter/Gather: Enabled

domU config file has networking configured like this:
vif        = ['bridge=xenbr0, mac=00:16:3e:39:26:ac']

The domU doesn't record anything in the event viewer, the switch doesn't
record anything in it's logs, DoS options are disabled on the switch,
the dom0 will respond to pings while the domU doesn't.

For a long time, I thought this was happening to other physical machines
as well, but either it isn't anymore, or never was. At least the last 4
weeks of ping stats I have show that only the domU's, and only the
terminal servers will lose more than 4 consecutive pings (aside from
outages caused by changing hardware/etc).

Any hints on what to look at, additional information needed, how to
diagnose, or any options other than continued hair-pulling would be
immensely appreciated.

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.