[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] new netfront and occasional receive path lockup
On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote: > Hi Christophe, > > Thanks for finding and checking the problem. > I will try to reproduce the issue and check what caused the problem. > Hello, Was this issue resolved? Some users have been complaining "network freezing up" issues recently on ##xen on irc.. -- Pasi > Thanks, > Dongxiao > > Jeremy Fitzhardinge wrote: > > On 08/22/2010 09:43 AM, Christophe Saout wrote: > >> Hi, > >> > >> I've been playing with some of the new pvops code, namely DomU guest > >> code. What I've been observing on one of the virtual machines is > >> that > >> the network (vif) is dying after about ten to sixty minutes of > >> uptime. > >> The unfortunate thing here is that I can only repoduce it on a > >> production VM and have been unlucky so far to trigger the bug on a > >> test machine. While this has not been tragic - rebooting fixed the > >> issue, unfortunately I can't spend very much time on debugging after > >> the issue pops up. > > > > Ah, OK. I've seen this a couple of times as well. And it just > > happened to me then... > > > > > >> Now, what is happening is that the receive path goes dead. The DomU > >> can send packets to Dom0 and those are visible using tcpdump on the > >> Dom0 on the virtual interface, but not the other way around. > > > > I hadn't got to that level of diagnosis, but I can confirm that > > that's what seems to be happening here too. > > > >> Now, I have done more than one change at a time (I'd like to avoid > >> going into pinning it down since I can only reproduce it on a > >> production machine, as I said, so suggestions are welcome), but my > >> suspicion is that it might have to do with the new "smart polling" > >> feature in xen/netfront. Note that I have also updated Dom0 to pull > >> in the latest dom0/backend and netback changes, just to make sure > >> it's > >> not due to an issue that has been fixed there, but I'm still seeing > >> the same. > > > > I agree. I think I started seeing this once I merged smartpoll into > > netfront. > > > > J > > > >> The production machine is a machine that doesn't have much network > >> load, but deals with a lot of small network requests (DNS and smtp > >> mostly). A workload which is hard to reproduce on the test machine. > >> Heavy network load (NFS, FTP and so on) for days hasn't triggered the > >> problem. Also, segmentation offloading and similar settings don't > >> have any effect. > >> > >> The machine has 2 physical and the VM 2 virtual CPUs, DomU has > >> PREEMPT > >> enabled. > >> > >> I've been looking at the code, if there might be a race condition > >> somewhere, something like where one could run into a situation where > >> the hrtimer doesn't run and Dom0 believes the DomU should be polling > >> and doesn't emit an interrupt or something, but I'm afraid I don't > >> know enough to judge this (I mean, there are spinlocks which look > >> safe > >> to me). > >> > >> Do you have any suggestions what to try? I can trigger the issue on > >> the production VM again, but debugging should not take more than a > >> few > >> minutes if it happens. Access is only possible via the console. > >> Neither Dom0 nor the guest show anything unusual in the kernel > >> message > >> and continue to behave normally after the network goes dead (also > >> able > >> to shut down the guest normally). > >> > >> Thanks, > >> Christophe > >> > >> > >> > >> _______________________________________________ > >> Xen-devel mailing list > >> Xen-devel@xxxxxxxxxxxxxxxxxxx > >> http://lists.xensource.com/xen-devel > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxxxxxxxx > http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |