[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [XTF PATCH] xtf-runner: fix two synchronisation issues



Wei Liu writes ("Re: [XTF PATCH] xtf-runner: fix two synchronisation issues"):
> On Fri, Jul 29, 2016 at 01:43:42PM +0100, Andrew Cooper wrote:
> > The runner existing before xl has torn down the guest is very
> > deliberate, because some part of hvm guests is terribly slow to tear
> > down; waiting synchronously for teardown tripled the wallclock time to
> > run a load of tests back-to-back.
> 
> Then you won't know if a guest is leaked or it is being slowly destroyed
> when a dead guest shows up in the snapshot of 'xl list'.
> 
> Also consider that would make back-to-back tests that happen to have a
> guest that has the same name as the one in previous test fail.
> 
> I don't think getting blocked for a few more seconds is a big issue.
> It's is important to eliminate such race conditions so that osstest can
> work properly.

IMO the biggest reason for waiting for teardown is that that will make
it possible to accurately identify the xtf test which was responsible
for the failure if a test reveals a bug which causes problems for the
whole host.

Suppose there is a test T1 which, in buggy hypervisors, creates an
anomalous data structure, such that the hypervisor crashes when T1's
guest is finally torn down.

If we start to run the next test T2 immediately we see success output
from T1, we will observe the host crashing "due to T2", and T1 would
be regarded as having succeeded.

This is why in an in-person conversation with Wei yesterday I
recommended that osstest should after each xtf test (i) wait for
everything to be torn down and (ii) then check that the dom0 is still
up.  (And these two activities are regarded as part of the preceding
test step.)

If this leads to over-consumption of machine resources because this
serialisation is too slow then the right approach would be explicit
parallelisation in osstest.  That would still mean that in the
scenario above, T1 would be regarded as having failed, because T1
wouldn't be regarded as having passed until osstest had seen that all
of T1's cleanup had been done and the host was still up.  (T2 would
_also_ be regarded as failed, and that might look like a heisenbug,
but that would be tolerable.)

Wei: I need to check what happens with multiple failing test steps in
the same job.  Specifically, I need to check which one the bisector
is likely to try to attack.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.