[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x217/0x220



On Thursday, 1 June 2017 11:56:28 PM AEST Boris Ostrovsky wrote:
> On 05/31/2017 10:25 PM, Steven Haigh wrote:
> > On 2017-05-31 00:37, Steven Haigh wrote:
> >> On 31/05/17 00:18, Boris Ostrovsky wrote:
> >>> On 05/30/2017 06:27 AM, Steven Haigh wrote:
> >>>> Just wanted to give this a nudge to try and get some suggestions on
> >>>> where to go / what to do about this.
> >>>> 
> >>>> On 28/05/17 09:44, Steven Haigh wrote:
> >>>>> The last couple of days running on kernel 4.9.29 and 4.9.30 with Xen
> >>>>> 4.9.0-rc6 I've had a number of ethernet lock ups that have taken my
> >>>>> system off the network.
> >>>>> 
> >>>>> This is a new development - but I'm not sure if its kernel or xen
> >>>>> related.
> >>> 
> >>> Since noone seems to have seen this it would be useful to narrow it
> >>> down
> >>> a bit.
> >>> 
> >>> Do you observe this on rc5? Or with 4.9.28 kernel? Any particular load
> >>> that you are using? Do you see this on a specific NIC?
> >> 
> >> This install is currently using xen 4.9-rc7 and kernel 4.9.30. I would
> >> say that there may be a connection between occurrences between disk
> >> activity and the ethernet adapter locking up - but I haven't been able
> >> to prove this in any valid way yet.
> >> 
> >> I am currently running this script on the server in question to try and
> >> get a log of how often the adapter locks up. I only added the logger
> >> line tonight - so I don't have a great deal of historical data to add as
> >> yet.
> >> 
> >> #!/bin/bash
> >> while true; do
> >> 
> >>         ping -c1 10.1.1.2 >& /dev/null
> >>         if [ $? != 0 ]; then
> >>         
> >>                 logger 'No response. Resetting enp5s0'
> >>                 mii-tool -R enp5s0
> >>         
> >>         fi
> >>         sleep 5
> >> 
> >> done
> > 
> > Just to keep kicking this along a little bit, my logs so far have shown:
> > messages:May 31 00:20:10 No response. Resetting enp5s0
> > messages:May 31 04:20:08 No response. Resetting enp5s0
> > messages:May 31 12:21:37 No response. Resetting enp5s0
> > 
> > Its almost spooky that its nearly 20 minutes past the hour on each reset.
> > 
> > I've checked against the cron logs, but I can't find anything that
> > would be scheduled on the Dom0 at that time.
> > 
> > The logs also show that after running mii-tool to reset the ethernet
> > adapter, connectivity has returned straight away.
> > 
> > The network adapter uses the r8169 kernel module, and shows as:
> > 05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
> > RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)
> > 
> > I have a DomU backup script that runs *in* a DomU at 01:00 each night
> > - that causes a lot of disk activity - but alas, that time hasn't
> > lined up with anything as yet...
> > 
> > Still seem to be fidgeting in the dark :(
> 
> Since you've already observed this problem with rc6 and 4.9.29, wouldn't
> it be more useful to go backwards to narrow down where the problem first
> occurred? I am not sure how moving to rc7 and 4.9.30 is going to help
> unless you think this is a temporary regression.

I'm not 100% sure of the cause at the moment. I moved to kernel 4.9 from 4.4  
a few weeks before I started to test Xen 4.9.  My only thoughts were that 
bringing up to the latest version would at least test against other fixes that 
are known going into the Xen 4.9rc releases.

I have also been updating to the latest 4.9 kernel in case I come across a fix 
- or at least a version of kernel where this no longer occurs.

At this stage, I don't have any information to give any major hint on if this 
is Xen or kernel related other than I had never observed this using:
        * Xen 4.7 + kernel 4.4
        * Xen 4.7 + kernel 4.9

I am making the assumption however that because when the network dies in this 
manor, it is dead until manual intervention, that I would notice this in a 
different combination of Xen / kernel.

One observation I have made since putting in the extra logging via the 
ethernet reset script posted earlier - the WARNING is not printed for every 
ethernet controller hang. As such, this may actually be a side-effect of 
having the controller stay dead - rather than a cause.

A second observation is that I don't seem to see as many hangs of the ethernet 
adapter of recent days. I'm not confident yet to say if this is an absolute - 
or coincidence - but I'm hoping that my script that logs when it detects a 
network issue and resets the ethernet adapter via mii-tools will give some 
type of data to try and base some kind of conclusion on.

It also seems strange that all three resets occurred at almost 20 minutes past 
an hour. It may well be that this is pure coincidence, but I figure more data 
gathering may clear this part up.

-- 
Steven Haigh

📧 netwiz@xxxxxxxxx      💻 http://www.crc.id.au
📞 +61 (3) 9001 6090     📱 0412 935 897

Attachment: signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.