Xen Users,
 
We have experienced recently few issues on Xen 3.3.1 for
which we would appreciate if one of you can shed some light. 
 
First of all, our system configuration is:
-         
a dual Xeon 2.5GHz with 16Gb (8 cores) 
-         
Xen 3.3.1 from latest xensources distributed with Linux
Kernel 2.6.18.8-xen
-         
Dom0 is a Centos 5.2 upgraded few days ago to Centos
5.3
-         
There are 6 HVM DomUs running, 5 with sporadic issues
(see below) are Fedora-10 x86_64 and 1 domU (no issue so far) is a Windows 2003.
-         
The 5 Fedora-10 domUs have the latest package upgrades,
including a Linux kernel 2.6.27.19-170.2.35.fc10.x86_64. They have 2 vCPU each,
between 512MB to 1Gb of memory, and 30Gb of disk space stored on an internal
SATA
-         
Dom0’s VPCU is pinned to core 0 (dom0_vcpus_pin)
-         
DomUs are  visibly sharing core 1 to 7, (xm vcpu_List)
although no config was done to map them to specific Cpu/cores 
 
Now here are our observations:
 
(1)    The
Fedora-10 domUs described above are randomly and partially (see below) freezing
after running for some hours. 
-         
If there is a pre-existing ssh session on a hung domU, some
commands such as ‘ls’, ‘ps’, ‘tail –f <file>’,’free’
can be executed while commands such ‘top’, ‘vmstat’
will hang OR sometimes no command at all
-         
Xentop display of 0% activity on a hung domU although I
have observed a 100% once on another hung one
-         
There is nothing significant on  domU:/var/log/messages
and nothing as well on dom0:/var/log/xen/qemu-dm-… 
-         
Nagios running on dom0 doesn’t really picked this
condition up as the hung domUs are still able to answer ping or able to answer Nagios
ssh checkin; note that ssh to a hung domU doesn’t work although Nagios basic
tcp port answers on 22
-         
Their time is completely off (see next observation
below) with or without ntpd running
-         
I had the occasion to run ‘free’ on few of
them and it appears that they had enough free memory, i.e. not swapping at all 
 
ð  I
don’t want to speculate on the potential root cause nevertheless what can
be the next most effective troubleshooting steps? 
o  
Force a domU system dump? And then?
o  
Deep dive into dom0 logs although a quick
browsing wasn’t successful?
o  
Disable most of the processes on one of these
domU to identify if a user proc can cause this issue (may be very time
consuming)?
o  
Set the run-level to 3 instead of 5?
o  
etc 
 
(2)    The
5 Fedora-10s domUs are not keeping their time in sync
 
We have read different pages concerning
time management for a Linux domU but we haven’t found yet something
concluding and/or haven’t been able to set this up properly. The facts
are:
 
-         
Our dom0 runs ntpd and is perfectly synchronized on
external public ntp sources
-         
We tried initially to run ntpd on the Fedora-10 domUs,
configured on external public sources, which has proven to be unsuccessful; the
time is usually off by few minutes
-         
We tried without ntpd, this should be the proper
configuration according to our readings as the domUs’ hardware clock
should sync up on their dom0’s hw clock alas still unsuccessful. In this
case, the domUs end up significantly lagging behind their dom0’s time
-         
We have read on few occasion that there is a parameter
to set with echo 1 > /proc/sys/xen/independent_wallclock in order to
run ntpd on a domU, but /proc/sys/xen doesn’t exist on these Fedora-10
domUs. Is it an expected behavior? Should we assume the setting
independent_wallclock is only for PV domUs?! 
-         
Note that one of the domUs is a Windows 2003 server 32-bits
and is perfectly on time, i.e. in sync with its dom0. It does run the default
Windows time service, no ntpd installed
 
(3)    The
5 Fedora-10 domUs have been installed as HVM domU but their kernels see them as
PV. This may be a misunderstanding from our side, however, a dmesg on the 5 Fedora-10
domUs, shows the message:
“Booting paravirtualized
kernel on bare hardware”
 
We just installed an HVM centos 5.3 domU, and this time the
kernel boot message “Booting …” doesn’t appear. 
 
Therefore, can we conclude that the presumed HVM Fedora-10 domUs
are in fact PV domUs?
Should a /proc/sys/xen be present on a PV domU or on any
type of domUs?
 
Thank you,
Lionel Raynaud.