[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] new netfront and occasional receive path lockup



 On 09/10/2010 11:45 AM, Xu, Dongxiao wrote:
> Hi Jeremy and Pasi,
>
> I was frustrated that I couldn't reproduce this bug in my site. 

Perhaps you have been trying to reproduce it in the wrong conditions?  I
have generally seen this bug when the networking is under very light
load, such as a couple of fairly idle dom0<->domU ssh connections.  I'm
not sure that I've seen it under heavy load.

> However I investigated the code, indeed there is one race condition that
> probably cause the bug. See the attached patch.
>
> Could anybody who can see this bug help to try it? Appreciate much!

Thanks for looking into this.  Your logic seems reasonable, so I'll
apply it (however I also added a patch to make smartpoll default to
"off"; I guess I can switch that to default on again to make sure it
gets tested, but leave the option as a workaround if there are still
problems).

However, I am concerned about these manipulations of a cross-cpu shared
variable without any barriers or other ordering constraints.  Are you
sure this code is correct under any reordering (either by the compiler
or CPUs); and if the compiler decides to access it more or less often
than the source says it should?

Thanks,
    J

> Thanks,
> Dongxiao
>
>
> Jeremy Fitzhardinge wrote:
>>  On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote:
>>> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote:
>>>> Hi Christophe,
>>>>
>>>> Thanks for finding and checking the problem.
>>>> I will try to reproduce the issue and check what caused the problem.
>>>>
>>> Hello,
>>>
>>> Was this issue resolved? Some users have been complaining "network
>>> freezing up" issues recently on ##xen on irc..
>> Yeah, I'll add a command-line parameter to disable smartpoll (and
>> leave it off by default). 
>>
>>     J
>>
>>> -- Pasi
>>>
>>>> Thanks,
>>>> Dongxiao
>>>>
>>>> Jeremy Fitzhardinge wrote:
>>>>>  On 08/22/2010 09:43 AM, Christophe Saout wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I've been playing with some of the new pvops code, namely DomU
>>>>>> guest code.  What I've been observing on one of the virtual
>>>>>> machines is that the network (vif) is dying after about ten to
>>>>>> sixty minutes of uptime. The unfortunate thing here is that I can
>>>>>> only repoduce it on a production VM and have been unlucky so far
>>>>>> to trigger the bug on a test machine.  While this has not been
>>>>>> tragic - rebooting fixed the issue, unfortunately I can't spend
>>>>>> very much time on debugging after the issue pops up.
>>>>> Ah, OK.  I've seen this a couple of times as well.  And it just
>>>>> happened to me then... 
>>>>>
>>>>>
>>>>>> Now, what is happening is that the receive path goes dead.  The
>>>>>> DomU can send packets to Dom0 and those are visible using tcpdump
>>>>>> on the Dom0 on the virtual interface, but not the other way
>>>>>> around. 
>>>>> I hadn't got to that level of diagnosis, but I can confirm that
>>>>> that's what seems to be happening here too.
>>>>>
>>>>>> Now, I have done more than one change at a time (I'd like to avoid
>>>>>> going into pinning it down since I can only reproduce it on a
>>>>>> production machine, as I said, so suggestions are welcome), but my
>>>>>> suspicion is that it might have to do with the new "smart polling"
>>>>>> feature in xen/netfront.  Note that I have also updated Dom0 to
>>>>>> pull in the latest dom0/backend and netback changes, just to make
>>>>>> sure it's not due to an issue that has been fixed there, but I'm
>>>>>> still seeing the same.
>>>>> I agree.  I think I started seeing this once I merged smartpoll
>>>>> into netfront. 
>>>>>
>>>>>     J
>>>>>
>>>>>> The production machine is a machine that doesn't have much network
>>>>>> load, but deals with a lot of small network requests (DNS and smtp
>>>>>> mostly).  A workload which is hard to reproduce on the test
>>>>>> machine. Heavy network load (NFS, FTP and so on) for days hasn't
>>>>>> triggered the problem.  Also, segmentation offloading and similar
>>>>>> settings don't have any effect. 
>>>>>>
>>>>>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has
>>>>>> PREEMPT enabled. 
>>>>>>
>>>>>> I've been looking at the code, if there might be a race condition
>>>>>> somewhere, something like where one could run into a situation
>>>>>> where the hrtimer doesn't run and Dom0 believes the DomU should be
>>>>>> polling and doesn't emit an interrupt or something, but I'm afraid
>>>>>> I don't know enough to judge this (I mean, there are spinlocks
>>>>>> which look safe to me). 
>>>>>>
>>>>>> Do you have any suggestions what to try?  I can trigger the issue
>>>>>> on the production VM again, but debugging should not take more
>>>>>> than a few minutes if it happens.  Access is only possible via
>>>>>> the console. Neither Dom0 nor the guest show anything unusual in
>>>>>> the kernel message and continue to behave normally after the
>>>>>> network goes dead (also able to shut down the guest normally).
>>>>>>
>>>>>> Thanks,
>>>>>>  Christophe
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Xen-devel mailing list
>>>>>> Xen-devel@xxxxxxxxxxxxxxxxxxx
>>>>>> http://lists.xensource.com/xen-devel
>>>> _______________________________________________
>>>> Xen-devel mailing list
>>>> Xen-devel@xxxxxxxxxxxxxxxxxxx
>>>> http://lists.xensource.com/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.