[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [xen-unstable test] 56759: regressions - FAIL



On Tue, 2015-05-26 at 14:29 +0100, Ian Campbell wrote:
> On Wed, 2015-05-20 at 10:56 +0100, Ian Campbell wrote:
> > On Wed, 2015-05-20 at 09:34 +0000, osstest service user wrote:
> > > flight 56759 xen-unstable real [real]
> > > http://logs.test-lab.xenproject.org/osstest/logs/56759/
> > > 
> > > Regressions :-(
> > > 
> > > Tests which did not succeed and are blocking,
> > > including tests which could not be run:
> > >  test-armhf-armhf-xl-multivcpu 17 leak-check/check         fail REGR. vs. 
> > > 56375
> > 
> > I'm pretty hard pressed to explain this from the set of commits
> > currently under test, but it has happened a few times now (e.g. 56700
> > 56576) so it does seem to be real.
> > 
> > http://logs.test-lab.xenproject.org/osstest/results/bisect.xen-unstable.test-armhf-armhf-xl-multivcpu.leak-check--check.html
> > is working on it and is currently consider the set of changes from:
> > ianc@cosworth:xen.git$ git log --oneline 9ab42~1...45fcc4
> > 45fcc45 use ticket locks for spin locks
> > e13013d libxc/restore: add checkpointed flag to the restore context
> > ce44b40 libxc/restore: introduce setup() and cleanup() on restore
> > c5c5a04 libxc/restore: split read/handle qemu info
> > 9ab42c9 libxc/restore: introduce process_record()
> > 
> > where e13013d is current master which was pushed in by flight 56375.
> > 
> > I think it unlikely the libxl stuff is responible, given we don't
> > migrate on ARM, which would seem to point to the ticket locks...
> 
> I've now managed to reproduce using the arndale on my desk.

... and now I've confirmed that reverting the spin lock change causes
the issue to not happen any more.

> I'm just starting to dig in to the issue.
> 
> So far the only thing I've concluded is that the message comes from
> netback try to read the script node for inclusion in the hotplug
> invocation's environment.
> 
> I wonder if perhaps the spinlock change has just exposed a pre-existing
> race?

I'm still confirming, but AFAICT libxl does the right thing and writes
state=Closing and waits for it to hit state=Closed before tearing down
the backend directory. AFAICS it is not timing out while waiting.

Looking at the netback side though it seems like netback_remove is
switching to state=Closed _before_ it calls kobject_uevent(...,
KOBJ_OFFLINE) and it is this which generates the call to netback_uevent
which tries and fails to read script and produces the error message.

Since switching to state=Closed is what prompts libxl to go and delete
the xenstore backend dir it seems like it would be possible that
netback_uevent might not happen until the xenstore key was gone,
prompting it to write the error nodes. Is there anything else which
might prevent against that possibility?

Handwaving a bit (ok, a lot) it's possible that the change of spinlocks
has caused a commonly won race to become a commonly lost one at least
under these circumstances.

My theory is that this is exacerbated on arndale because the CPU is
relatively slow (even compared to cubietruck which is the same core but
faster DRAM etc) and the fact that it is dual core while the test case
which is failing involves a 4 vcpu guest (which is a bit dumb but not
invalid) is loading things even more.

I'm still slightly concerned that perhaps the new spinlock stuff has
some sort of bad behaviour either on arndale specifically or more
generally for ARM systems which has pushed this particular case over the
edge.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.