[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] new netfront and occasional receive path lockup
On 09/10/2010 11:45 AM, Xu, Dongxiao wrote: > Hi Jeremy and Pasi, > > I was frustrated that I couldn't reproduce this bug in my site. Perhaps you have been trying to reproduce it in the wrong conditions? I have generally seen this bug when the networking is under very light load, such as a couple of fairly idle dom0<->domU ssh connections. I'm not sure that I've seen it under heavy load. > However I investigated the code, indeed there is one race condition that > probably cause the bug. See the attached patch. > > Could anybody who can see this bug help to try it? Appreciate much! Thanks for looking into this. Your logic seems reasonable, so I'll apply it (however I also added a patch to make smartpoll default to "off"; I guess I can switch that to default on again to make sure it gets tested, but leave the option as a workaround if there are still problems). However, I am concerned about these manipulations of a cross-cpu shared variable without any barriers or other ordering constraints. Are you sure this code is correct under any reordering (either by the compiler or CPUs); and if the compiler decides to access it more or less often than the source says it should? Thanks, J > Thanks, > Dongxiao > > > Jeremy Fitzhardinge wrote: >> On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote: >>> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote: >>>> Hi Christophe, >>>> >>>> Thanks for finding and checking the problem. >>>> I will try to reproduce the issue and check what caused the problem. >>>> >>> Hello, >>> >>> Was this issue resolved? Some users have been complaining "network >>> freezing up" issues recently on ##xen on irc.. >> Yeah, I'll add a command-line parameter to disable smartpoll (and >> leave it off by default). >> >> J >> >>> -- Pasi >>> >>>> Thanks, >>>> Dongxiao >>>> >>>> Jeremy Fitzhardinge wrote: >>>>> On 08/22/2010 09:43 AM, Christophe Saout wrote: >>>>>> Hi, >>>>>> >>>>>> I've been playing with some of the new pvops code, namely DomU >>>>>> guest code. What I've been observing on one of the virtual >>>>>> machines is that the network (vif) is dying after about ten to >>>>>> sixty minutes of uptime. The unfortunate thing here is that I can >>>>>> only repoduce it on a production VM and have been unlucky so far >>>>>> to trigger the bug on a test machine. While this has not been >>>>>> tragic - rebooting fixed the issue, unfortunately I can't spend >>>>>> very much time on debugging after the issue pops up. >>>>> Ah, OK. I've seen this a couple of times as well. And it just >>>>> happened to me then... >>>>> >>>>> >>>>>> Now, what is happening is that the receive path goes dead. The >>>>>> DomU can send packets to Dom0 and those are visible using tcpdump >>>>>> on the Dom0 on the virtual interface, but not the other way >>>>>> around. >>>>> I hadn't got to that level of diagnosis, but I can confirm that >>>>> that's what seems to be happening here too. >>>>> >>>>>> Now, I have done more than one change at a time (I'd like to avoid >>>>>> going into pinning it down since I can only reproduce it on a >>>>>> production machine, as I said, so suggestions are welcome), but my >>>>>> suspicion is that it might have to do with the new "smart polling" >>>>>> feature in xen/netfront. Note that I have also updated Dom0 to >>>>>> pull in the latest dom0/backend and netback changes, just to make >>>>>> sure it's not due to an issue that has been fixed there, but I'm >>>>>> still seeing the same. >>>>> I agree. I think I started seeing this once I merged smartpoll >>>>> into netfront. >>>>> >>>>> J >>>>> >>>>>> The production machine is a machine that doesn't have much network >>>>>> load, but deals with a lot of small network requests (DNS and smtp >>>>>> mostly). A workload which is hard to reproduce on the test >>>>>> machine. Heavy network load (NFS, FTP and so on) for days hasn't >>>>>> triggered the problem. Also, segmentation offloading and similar >>>>>> settings don't have any effect. >>>>>> >>>>>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has >>>>>> PREEMPT enabled. >>>>>> >>>>>> I've been looking at the code, if there might be a race condition >>>>>> somewhere, something like where one could run into a situation >>>>>> where the hrtimer doesn't run and Dom0 believes the DomU should be >>>>>> polling and doesn't emit an interrupt or something, but I'm afraid >>>>>> I don't know enough to judge this (I mean, there are spinlocks >>>>>> which look safe to me). >>>>>> >>>>>> Do you have any suggestions what to try? I can trigger the issue >>>>>> on the production VM again, but debugging should not take more >>>>>> than a few minutes if it happens. Access is only possible via >>>>>> the console. Neither Dom0 nor the guest show anything unusual in >>>>>> the kernel message and continue to behave normally after the >>>>>> network goes dead (also able to shut down the guest normally). >>>>>> >>>>>> Thanks, >>>>>> Christophe >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Xen-devel mailing list >>>>>> Xen-devel@xxxxxxxxxxxxxxxxxxx >>>>>> http://lists.xensource.com/xen-devel >>>> _______________________________________________ >>>> Xen-devel mailing list >>>> Xen-devel@xxxxxxxxxxxxxxxxxxx >>>> http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |