This might be related to a posting a couple of days ago on random
reboots, but the problem arises from a different environment and
situation.
We are running a two-node cluster. Both nodes run Debian Squeeze
6.0.2.1 + Xen 4.02 on top of OCFS2 1.4.4-3. Kernel is
2.6.32-5-xen-amd64. Both nodes store and run vms on the ocfs2
partition, which is accessed from the 2 boxes via ISCSI. We run a
network stress test in which the 2 vms pass a large file between
them. One vm has an nfs share with the file in it, and the other vm
copies this file (arbitrarily, a large, 4.6 Gb debian.iso file) to
and from the nfs file share to its own local directory. Currently,
network configuration giving us no problems--no lost packets,
collisions, etc.
The vms are lucid instances (ubuntu 10.04) created by the following
command:
sudo xen-create-image --hostname lucidxentest
--ip 163.1.86.9 --pygrub
+ xen-tools.conf params-- size = 8 Gb, image = full, mem. = 512,
swap = 512
The stress proceeds successfully for anywhere from 1 to 12 hours,
then the system reboots. The file move has been interrupted, the vms
crashed, with one of the nodes rebooted.
I have noticed occasional reporting of a kernel error
(linux/mm/slub.c 2969!), similar to a Debian bug (#634047). But I find no firm correlation,
as often kern.log and messages logs do not usually report this
kernel error.
Some things I have tried:
a basic reinstall of the all the components of the system (squeeze +
xen + ocfs2)
a memtest on both nodes. (no problems).
changing the default Debian IO scheduler in combination with ocfs2:
cfq, deadline, anticipatory, no op.
currently investigating, but have not yet investigated,
adjusting: (1) halt state set in BIOS; (2) setting of
cpufreq=dom0-kernel, frequency scaling.
Any suggestions are welcome!
Ben Weaver
|