[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] vpmu=1 and running 'perf top' within a PVHVM guest eventually hangs dom0 and hypervisor has stuck vCPUS. Romley-EP (model=45, stepping=2)

On Wed, Mar 13, 2013 at 08:33:15AM +0000, Jan Beulich wrote:
> >>> On 12.03.13 at 18:30, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> 
> >>> wrote:
> > This issue I am encountering seems to only happen on multi-socket
> > machines.
> > 
> > It also does not help that the only multi-socket box I have is
> > an Romley-EP (so two socket SandyBridge CPUs). The other
> > SandyBridge boxes I've (one socket) are not showing this. Granted
> > they are also a different model (42).
> > 
> > The problem is that when I run 'perf top' within an SMP PVHVM
> > guest, after a couple of seconds or minutes the guest hangs.
> > Hypervisor ends up stuck too looping, and then the dom0 ends
> > up hanging as well.
> > 
> > Dumping the cpu registers (Ctrl-A x3, then 'd'
> > shows that the guest is pretty firmly stuck in vmx_vmexit_handler:
> > 
> > (XEN)    [<ffff82c4c01d386f>] vmx_vmexit_handler+0x22f/0x174
> > 
> > and if I let this stay for some time, dom0 detects that some
> > of its VCPUs are hanged and it resorts to sending NMI. NMI
> > is not implemented in pv-ops and then dom0 wedges. In some
> > cases it also wedges itself when doing 'xl list' or any up-calls
> > to the hypervisor.
> Did you try running Xen with its watchdog (and perhaps Dom0
> without)?

Just now I ran it with 'watchdog=1' on the Xen hypervisor line and it
did not spot any issues with the guest. It naturally spotted an
issue with the vcpu_sleep_sync as it was hung/spinning and gave me a grave
stack-trace - which was the exactly same as what 'd' showed.

> > Anyhow, following 'Ctrl-A x3, then 'v' tells me:
> > 
> > (XEN) Virtual processor ID = 0x0c02
> > .. snip..
> > (XEN) Virtual processor ID = 0x0fc4
> > (XEN)   VCPU 3
> > 
> > and stays stuck there. Doing the 'Ctrl-A x3' and 'd' to
> > see where it is stuck tells me:
> Perhaps sending 'd' without first sending 'v' might better show where
> the original hang is?

Did that too (I think it was part of the serial output). If I did 'd'
it would tell me that the VCPUs for the guest were all in vmx_vmexit_handler
and also give me a stack dump of the guest. There were no 'vcpu_sleep_sync'
as well, the 'vmcs_dump' had never run.

I originally thought that this meant the vmx_vmexit_handler is somehow stuck
- but maybe that is the OK state - meaning when a guest is busily doing
VMEXIT/VMENTER continously that is what we would see on the hypervisor

Looking at the guest stack provided with 'd' made for some interesting
observation. It looks as if one vcpu is doing something in
__switch_context, and two others are in ticket_spin_lock! Then I realized
that in the past I did have to provide a PauseLoopExit value as the
default would never let me launch an RHEL5 HVM guest.

Adding 'ple_gap=0' in the Xen hypervisor line is now masking the issue
it seems (or perhaps fixing it?) But I doubt it is the fix as Boris saw this
exact similar issue when running an UP PVHVM guest - and dom0 was running
on a laptop.

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.