[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Xen-devel] Re: [Xen-users] old issue after 1024 live migrations seems to still exist.
(dropping xen-users to avoid cross-posting)
On Wed, 2010-07-21 at 17:24 +0100, Pasi KÃrkkÃinen wrote:
> Adding xen-devel to the CC list..
> On Wed, Jul 21, 2010 at 04:38:28PM +0200, Florian Heigl wrote:
> > Hi,
> > last month I did some checkig of old Xen issues that I remember and
> > found this one to still exist - if you do a high amount of live
> > migrations at some point the xen daemon chokes and dies.
> > The issue was reported by someone on the list like 4-5 years ago, but
> > it seems it hasn't been fixed (not sure if anyone even replied back
> > then) The Xen version I used to test as 3.4.0 from Oracle VM 2.2
Do you have a reference to this old issue?
To be honest I think it is unlikely that you are seeing the actual same
issue as a bug that old, even if your symptoms are very similar.
Can you give details of your precise system configuration for both host
and guest, hypervisor changeset (I don't know what Oracle VM 2.0 has in
it), kernel changeset for both dom0 and domU etc.
I am currently doing some live migration testing with guests under load
(forkbomb) and am regularly doing 4-5000 successful migrations before I
hit a very subtle deadlock in a PVops domU kernel. I have most likely in
the past 4-5 years personally done tens of thousands of iterations of
live migration in various scenarios and we know other people are
regularly doing automated and manual test of these things so the problem
you are seeing is almost certainly not a generic failure but must be
specific to the version of one or more components in your system.
Are you seeing failure after precisely 1024 migrations in every case or
is that just a rough figure? It might be worth
using /usr/lib/xen/bin/lsevtchn to check what is happening to both the
dom0 and domU event channels after each migration iteration. Once upon a
time I was seeing an evtchn leak in domU (now fixed) but that wouldn't
fail after precisely 1024 iterations since there is always a number of
non-leaking event channels also in use.
Are you able to test with an up to date xen-3.4-testing or even better
the xen-4.0-testing tree?
> > is it just the gratious arp?
The Grat. ARP doesn't get sent by current PVops kernels (I don't know if
you are using this since you haven't provided any details about your
system configuration). A fix is pending in the network subsystem
maintainers tree which I hope will be backported to to 2.6.32.x when it
goes into mainline during the next merge window.
See 06c4648d46d1b757d6b9591a86810be79818b60c and
592970675c9522bde588b945388c7995c8b51328 in net-next-2.6.git. You will
also need to configure sysctl to enable the arp_notify option for the
devices setting net.ipv4.conf.all.arp_notify = 1 is likely sufficient.
Xen-devel mailing list