[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RE: [Xen-devel] new netfront and occasional receive path lockup
Jeremy Fitzhardinge wrote: > On 09/10/2010 11:45 AM, Xu, Dongxiao wrote: >> Hi Jeremy and Pasi, >> >> I was frustrated that I couldn't reproduce this bug in my site. > > Perhaps you have been trying to reproduce it in the wrong conditions? > I have generally seen this bug when the networking is under very > light load, such as a couple of fairly idle dom0<->domU ssh > connections. I'm not sure that I've seen it under heavy load. > >> However I investigated the code, indeed there is one race condition >> that probably cause the bug. See the attached patch. >> >> Could anybody who can see this bug help to try it? Appreciate much! > > Thanks for looking into this. Your logic seems reasonable, so I'll > apply it (however I also added a patch to make smartpoll default to > "off"; I guess I can switch that to default on again to make sure it > gets tested, but leave the option as a workaround if there are still > problems). > > However, I am concerned about these manipulations of a cross-cpu > shared variable without any barriers or other ordering constraints. > Are you sure this code is correct under any reordering (either by the > compiler or CPUs); and if the compiler decides to access it more or > less often than the source says it should? Do you mean the flag "np->rx.sring->private.netif.smartpoll_active"? It is a flag in shared ring structure, Therefore operations towards this flag are the same as other component in shared ring, such as under spinlock, etc. I will put dom0 and domU ssh(ed) for some time to see if the bug still exists. Thanks, Dongxiao > > Thanks, > J > >> Thanks, >> Dongxiao >> >> >> Jeremy Fitzhardinge wrote: >>> On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote: >>>> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote: >>>>> Hi Christophe, >>>>> >>>>> Thanks for finding and checking the problem. >>>>> I will try to reproduce the issue and check what caused the >>>>> problem. >>>>> >>>> Hello, >>>> >>>> Was this issue resolved? Some users have been complaining "network >>>> freezing up" issues recently on ##xen on irc.. >>> Yeah, I'll add a command-line parameter to disable smartpoll (and >>> leave it off by default). >>> >>> J >>> >>>> -- Pasi >>>> >>>>> Thanks, >>>>> Dongxiao >>>>> >>>>> Jeremy Fitzhardinge wrote: >>>>>> On 08/22/2010 09:43 AM, Christophe Saout wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I've been playing with some of the new pvops code, namely DomU >>>>>>> guest code. What I've been observing on one of the virtual >>>>>>> machines is that the network (vif) is dying after about ten to >>>>>>> sixty minutes of uptime. The unfortunate thing here is that I >>>>>>> can only repoduce it on a production VM and have been unlucky >>>>>>> so far >>>>>>> to trigger the bug on a test machine. While this has not been >>>>>>> tragic - rebooting fixed the issue, unfortunately I can't spend >>>>>>> very much time on debugging after the issue pops up. >>>>>> Ah, OK. I've seen this a couple of times as well. And it just >>>>>> happened to me then... >>>>>> >>>>>> >>>>>>> Now, what is happening is that the receive path goes dead. The >>>>>>> DomU can send packets to Dom0 and those are visible using >>>>>>> tcpdump >>>>>>> on the Dom0 on the virtual interface, but not the other way >>>>>>> around. >>>>>> I hadn't got to that level of diagnosis, but I can confirm that >>>>>> that's what seems to be happening here too. >>>>>> >>>>>>> Now, I have done more than one change at a time (I'd like to >>>>>>> avoid going into pinning it down since I can only reproduce it >>>>>>> on >>>>>>> a production machine, as I said, so suggestions are welcome), >>>>>>> but >>>>>>> my suspicion is that it might have to do with the new "smart >>>>>>> polling" feature in xen/netfront. Note that I have also >>>>>>> updated Dom0 to >>>>>>> pull in the latest dom0/backend and netback changes, just to >>>>>>> make sure it's not due to an issue that has been fixed there, >>>>>>> but I'm still seeing the same. >>>>>> I agree. I think I started seeing this once I merged smartpoll >>>>>> into netfront. >>>>>> >>>>>> J >>>>>> >>>>>>> The production machine is a machine that doesn't have much >>>>>>> network load, but deals with a lot of small network requests >>>>>>> (DNS and smtp mostly). A workload which is hard to reproduce >>>>>>> on the >>>>>>> test machine. Heavy network load (NFS, FTP and so on) for days >>>>>>> hasn't triggered the problem. Also, segmentation offloading and >>>>>>> similar settings don't have any effect. >>>>>>> >>>>>>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has >>>>>>> PREEMPT enabled. >>>>>>> >>>>>>> I've been looking at the code, if there might be a race >>>>>>> condition somewhere, something like where one could run into a >>>>>>> situation >>>>>>> where the hrtimer doesn't run and Dom0 believes the DomU should >>>>>>> be polling and doesn't emit an interrupt or something, but I'm >>>>>>> afraid I don't know enough to judge this (I mean, there are >>>>>>> spinlocks which look safe to me). >>>>>>> >>>>>>> Do you have any suggestions what to try? I can trigger the >>>>>>> issue >>>>>>> on the production VM again, but debugging should not take more >>>>>>> than a few minutes if it happens. Access is only possible via >>>>>>> the console. Neither Dom0 nor the guest show anything unusual in >>>>>>> the kernel message and continue to behave normally after the >>>>>>> network goes dead (also able to shut down the guest normally). >>>>>>> >>>>>>> Thanks, >>>>>>> Christophe >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Xen-devel mailing list >>>>>>> Xen-devel@xxxxxxxxxxxxxxxxxxx >>>>>>> http://lists.xensource.com/xen-devel >>>>> _______________________________________________ >>>>> Xen-devel mailing list >>>>> Xen-devel@xxxxxxxxxxxxxxxxxxx >>>>> http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |