[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] rimava0 (Re: [xen-4.6-testing test] 66466: trouble: broken/fail/pass)



osstest service owner writes ("[xen-4.6-testing test] 66466: trouble: 
broken/fail/pass"):
> flight 66466 xen-4.6-testing real [real]
> http://logs.test-lab.xenproject.org/osstest/logs/66466/
> 
> Failures and problems with tests :-(
> 
> Tests which did not succeed and are blocking,
> including tests which could not be run:
>  test-amd64-i386-qemut-rhel6hvm-amd  3 host-install(3) broken REGR. vs. 65639

This is a timeout waiting for rimava0 to fetch its preseed file.

rimava0 was attended to by Yogesh Patel of Credativ (CC'd) on
Thursday, and we found it worked.  The only thing that we think we did
to it to make it work is that the cables were all pushed home.  We
don't really know what changed, although Yogesh had a vague feeling
that the power cable was not properly inserted into the PDU and that
he might have seated it better.

rimava0 then seems to have been rebootable from 2015-12-17 13:57:55 Z
to 2015-12-18 12:04:53 Z (at least).  AFter that it experienced a
number of successive failures.

I just allocated the machine to myself and found it had been left up
and booted by some previous test job (presumably, the last one that
failed).  Power cycling it caused it to reboot as expected.  I
confirmed the boot order settings in the BIOS.  On exiting the BIOS it
then started running the d-i autoinstall that had been left set up in
the PXE area by the most recent failed test job, as I would expect.

I looked at the serial log and it shows the machine having apparently
been up between Dec 18 12:20:29 (when 66466
test-amd64-i386-migrupgrade xen-boot/dst_host completed) and Dec 18
16:15:59 (when I manually power cycled it).  There is lots of Xen
debug keys output, which will be from the log capture steps of the
failed jobs.  This indicates that the machine was actually
continuously up, and responsive to the serial port, for all of this
time.  The attempts to power cycle it had not actually rebooted it.

I have two theories:


1. Depriving rimava0 of power for a mere 15s (the previous
configuration of PowerCycleTime) is not sufficient to cause rimava0
to reboot, particularly if it is very idle.

This is not particularly convincing.  It does not explain the previous
problem where rimava0 apparently wouldn't boot even when I turned the
power off for (IIRC) 2 minutes; this led me to ask Credativ to
investigate in person.  It does not explain why rimava1 is not
affected.  It does not explain very well why the failures clump so
much.

It also doesn't explain why the serial log doesn't show "Modem lines
changed" messages, which for this machine sympathy (our serial
concentrator) usually reports when the power goes off or on.  In a
series of manual poweron/poweroff tests I found that this modem line
change appeared in the sympathy client UI almost immediately.

However, this theory is fairly easy to test.  I have just set the
PowerCycleTime to 120s which should surely be enough, and thrown the
machine back.


2. The PDU has a fault (eg, a sticky relay).  This would explain many
of the symptoms.

We could test this possibility by using a known-good PDU port.  Eg, we
could borrow the port assigned to one of the removed machines.  But
this would require a site visit.

I intend to see what the results look like from the runs over the
weekend and then decide what to do next.


Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.