[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] vpmu=1 and running 'perf top' within a PVHVM guest eventually hangs dom0 and hypervisor has stuck vCPUS. Romley-EP (model=45, stepping=2)

We also met the issue as fixed by Dietmar's workaround. I remember we
two had some email discussion at that time.

The issue causing interrupt loop is:
It seems that on NHM (at that time) when a PMI arrives at CPU, the
counter has a value to zero (instead of some other small value, say 3
or 5, seen on Core 2 Duo). In this case, unmasking the PMI via APIC
will trigger immediately another PMI.
This does not produce problem with native kernel, since it typically
programs the counter with another value (as needed by making yet
another sampling point) before unmasking.
For Xen, PMI handler cannot handle the counter immediately since it
should be handled by guests. It just records a virtual PMI to guests
and unmasks the PMI before return.

We don't know whether this is a desired HW behavior. But we hope we
can get confirm from internal HW team quickly.

Shan Haitao

2013/3/13 Dietmar Hahn <dietmar.hahn@xxxxxxxxxxxxxx>:
> Am Dienstag 12 März 2013, 16:54:11 schrieb Boris Ostrovsky:
>> On 03/12/2013 04:31 PM, Konrad Rzeszutek Wilk wrote:
>> > On Tue, Mar 12, 2013 at 02:50:59PM -0400, Boris Ostrovsky wrote:
>> >> On 03/12/2013 01:30 PM, Konrad Rzeszutek Wilk wrote:
>> >>> This issue I am encountering seems to only happen on multi-socket
>> >>> machines.
>> >> I believe I was able to reproduce this (once) on my laptop.
>> >>
>> >>> It also does not help that the only multi-socket box I have is
>> >>> an Romley-EP (so two socket SandyBridge CPUs). The other
>> >>> SandyBridge boxes I've (one socket) are not showing this. Granted
>> >>> they are also a different model (42).
>> >>>
>> >>> The problem is that when I run 'perf top' within an SMP PVHVM
>> >>> guest, after a couple of seconds or minutes the guest hangs.
>> >>> Hypervisor ends up stuck too looping, and then the dom0 ends
>> >>> up hanging as well.
>> >>>
>> >>> Dumping the cpu registers (Ctrl-A x3, then 'd'
>> >>> shows that the guest is pretty firmly stuck in vmx_vmexit_handler:
>> >>>
>> >>> (XEN)    [<ffff82c4c01d386f>] vmx_vmexit_handler+0x22f/0x174
>> >> And in my case this address is the second instruction after STI, i.e. we
>> >> are right at the point where interrupts got enabled.
>> >>
>> >> So I am wondering whether this has something to do with the counter
>> >> overflow interrupt (which I believe is an NMI).
>> > Interestingly enough, if I run the PVHVM guest with 'nowatchdog'
>> > it runs fine!
>> I think by default perf top runs off timer interrupt so it does not use
>> HW counters. But watchdog
>> is implemented on top of the counters so perhaps it fires the interrupt
>> at a bad time, messing
>> something up.
> This looks like a strange behavior we had on nehalem cpus see
> http://lists.xen.org/archives/html/xen-devel/2010-11/msg01157.html
> For this I added a quirk, see check_pmc_quirk() in vpmu_core2.c
> The model 42 is in the quirk list and it seems to work but Romley-EP is model
> 43 I think which is not in the list.
> Maybe you should add this model and give it a try.
> Dietmar.
> --
> Company details: http://ts.fujitsu.com/imprint.html
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.