[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] State of current Xen debugger


  • To: Roger Cruz <roger.cruz@xxxxxxxxxxxxxxxxxxx>, Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>, Tim Deegan <Tim.Deegan@xxxxxxxxxx>
  • From: Keir Fraser <keir@xxxxxxx>
  • Date: Tue, 28 Sep 2010 18:06:33 +0100
  • Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
  • Delivery-date: Tue, 28 Sep 2010 10:07:50 -0700
  • Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:user-agent:date:subject:from:to:cc:message-id:thread-topic :thread-index:in-reply-to:mime-version:content-type :content-transfer-encoding; b=PKsdk5rvIVUJGF1AkMuxPYb4N56nH2ZIKjSScxlKAGzuRzxcmioYbpoNVPOOkNYBYm hIuzhmPQ+sdmHH0tci1hvbI/HYL2NnlFBnFrjCeHtDRyDIQMMJumlYzFoeUVSWPk6t0y bSEFR5KhaZKmIidfCJwUPDenkmVhG19KgQEIY=
  • List-id: Xen developer discussion <xen-devel.lists.xensource.com>
  • Thread-index: ActUIG8kM5lsGIutScKk9RYSVW2ObQABLJOOAr67TxAAAH0rpgAAX2zZAAL/zu8=
  • Thread-topic: [Xen-devel] State of current Xen debugger

Yeah, but the performance counters are driven by the same LAPIC timesource
that drives the main LAPIC timer.

 -- Keir

On 28/09/2010 16:40, "Roger Cruz" <roger.cruz@xxxxxxxxxxxxxxxxxxx> wrote:

> 
> 
> By the APIC timer?  When I traced this code I was under the impression that is
> driven by the performance counters counting cycles and generating an interrupt
> when the counter overflows.  I found this was the routine being called to
> setup the watchdog
> 
> static void __pminit setup_p6_watchdog(unsigned counter)
> {
>     unsigned int evntsel;
> 
>     nmi_perfctr_msr = MSR_P6_PERFCTR0;  <--- register
> 
>     clear_msr_range(MSR_P6_EVNTSEL0, 2);
>     clear_msr_range(MSR_P6_PERFCTR0, 2);
> 
>     evntsel = P6_EVNTSEL_INT
>         | P6_EVNTSEL_OS
>         | P6_EVNTSEL_USR
>         | counter;
> 
>     wrmsr(MSR_P6_EVNTSEL0, evntsel, 0);
>     write_watchdog_counter("P6_PERFCTR0");
>     apic_write(APIC_LVTPC, APIC_DM_NMI);
>     evntsel |= P6_EVNTSEL0_ENABLE;
>     wrmsr(MSR_P6_EVNTSEL0, evntsel, 0);
> }
> 
> and then during the NMI tick handler this path was executed
> 
>         else if ( nmi_perfctr_msr == MSR_P6_PERFCTR0 )
>         {
>             /*
>              * Only P6 based Pentium M need to re-unmask the apic vector but
>              * it doesn't hurt other P6 variants.
>              */
>             apic_write(APIC_LVTPC, APIC_DM_NMI);
>         }
>         write_watchdog_counter(NULL);
> 
> 
> 
> static inline void write_watchdog_counter(const char *descr)
> {
>     u64 count = (u64)cpu_khz * 1000;
> 
>     do_div(count, nmi_hz);
>     if(descr)
>         Dprintk("setting %s to -0x%08Lx\n", descr, count);
>     wrmsrl(nmi_perfctr_msr, 0 - count);
> }
> 
> 
> It is also my understanding that during the CPU c3 state change in cpu_idle.c,
> the APIC timer is turned off.  See comments below.
> 
>         /*
>          * Before invoking C3, be aware that TSC/APIC timer may be
>          * stopped by H/W. Without carefully handling of TSC/APIC stop issues,
>          * deep C state can't work correctly.
>          */
>         /* preparing APIC stop */
>         lapic_timer_off();  <------------- APIC timer appears to be turned off
> here.
> 
>         /* Get start time (ticks) */
>         t1 = inl(pmtmr_ioport);
>         /* Trace cpu idle entry */
>         TRACE_2D(TRC_PM_IDLE_ENTRY, cx->idx, t1);
>         /* Invoke C3 */
>         acpi_idle_do_entry(cx);
>         /* Get end time (ticks) */
>         t2 = inl(pmtmr_ioport);
> 
>         /* recovering TSC */
>         cstate_restore_tsc();  <----- this is our backport of an unstable
> patch to keep TSCs synchronized
>         /* Trace cpu idle exit */
> 
> 
> Thanks Keir!
> 
> Roger
> 
> -----Original Message-----
> From: Keir Fraser on behalf of Keir Fraser
> Sent: Tue 9/28/2010 11:30 AM
> To: Roger Cruz; Dan Magenheimer; Tim Deegan
> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
> Subject: Re: [Xen-devel] State of current Xen debugger
> 
> On 28/09/2010 16:21, "Roger Cruz" <roger.cruz@xxxxxxxxxxxxxxxxxxx> wrote:
> 
>> I am still chasing this hard hang in our system with a modified 3.4.2 xen.  I
>> have upgraded the BIOS and the problem still exists.  The only thing that so
>> far had appeared to work was adding max_cstate=0 but now I have a report
>> where
>> it still hung in one customer who had that flag enabled.  The rest of them
>> have been successfully running for more than a week with this ³work-around².
>> I have isolated the problem to Lenovo with the Centrino processors.  These
>> guys will stop the TSC when in C3.
>> 
>> What I need to really understand is why the NMI/watchdog in Xen is not
>> working
>> and causing a crash when the CPU hangs.  I was under the impression that NMIs
>> couldn¹t be masked at all.  Is there anyway that Xen could be disabling or
>> changing that behavior?   I know the NMI is being driven by a timer set in
>> the
>> NMI handler.  Could there be a case where this timer is disabled?   Any ideas
>> are welcome!
> 
> The NMI counter gets driven by the APIC timer. Perhaps it needs poking
> womehow on wakeup from C3? My suggestion for debugging this would be to take
> a look at what native Linux does. The NMI perfctr poking logic was all taken
> from (rather old now) upstream Linux.
> 
>  -- Keir
> 
>> Thanks
>> Roger R. Cruz
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
>> [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of Roger Cruz
>> Sent: Tuesday, September 14, 2010 11:55 AM
>> To: Dan Magenheimer; Tim Deegan
>> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
>> Subject: RE: [Xen-devel] State of current Xen debugger
>> 
>> Hi Dan,
>> 
>> I am using 3.4.2 where we have made very minor modifications (some backports,
>> for example).
>> 
>> I have not tried your suggestions.. so I will do that next.. thanks!
>> 
>> R.
>> 
>> -----Original Message-----
>> From: Dan Magenheimer [mailto:dan.magenheimer@xxxxxxxxxx]
>> Sent: Tue 9/14/2010 11:19 AM
>> To: Roger Cruz; Tim Deegan
>> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
>> Subject: RE: [Xen-devel] State of current Xen debugger
>> 
>> A couple of thoughts:
>> 
>> 
>> 
>> Have you tried max_cstate=0 (as a Xen boot option)?
>> 
>> 
>> 
>> Also, you didn't say what version of Xen you are using but playing around
>> with
>> hpet_broadcast (enabling it or force-disabling it as below) might be worth a
>> try.
>> 
>> 
>> 
>> http://lists.xensource.com/archives/html/xen-devel/2010-09/msg00556.html
>> 
>> 
>> 
>> From: Roger Cruz [mailto:roger.cruz@xxxxxxxxxxxxxxxxxxx]
>> Sent: Tuesday, September 14, 2010 8:56 AM
>> To: Tim Deegan
>> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
>> Subject: RE: [Xen-devel] State of current Xen debugger
>> 
>> 
>> 
>> Hi Tim,  good to hear from you again
>> 
>> I had a pretty good inkling that one of you hardcore developers would say
>> that
>> :-)  Yes, it is pretty well wedged.  I can cause the problem more rapidly by
>> dropping to a single CPU.  When the hang happens, the Xen console is
>> completely dead.  None of the special keys work.
>> 
>> I do have hopes a BIOS upgrade could fix this as a last resort but I want to
>> see if at least I can understand the problem.  We have a few different
>> machines that are exhibiting similar symptoms so I have to see if I can find
>> a
>> work-around without requiring every user to upgrade their BIOS :-(
>> 
>> Just in case, what debugger have you been using?  Are there recent
>> instructions on how to set it up that you can point me to?
>> 
>> Thanks
>> Roger
>> 
>> 
>> -----Original Message-----
>> From: Tim Deegan [mailto:Tim.Deegan@xxxxxxxxxx]
>> Sent: Tue 9/14/2010 10:30 AM
>> To: Roger Cruz
>> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
>> Subject: Re: [Xen-devel] State of current Xen debugger
>> 
>> Hi,
>> 
>> At 15:22 +0100 on 14 Sep (1284477779), Roger Cruz wrote:
>>> I am trying to debug a problem where the hypervisor is hanging hard.
>>> Not even the NMI watchdog is triggering a reboot.  So I wanted to hook
>>> up a debugger.
>> 
>> Sorry to bring a counsel of despair but if the NMI watchdog isn't
>> working then your chances of getting a working debugger are slim.  It's
>> likely that at least one CPU is very very stuck.  Does the 'd' debug key
>> work on the serial line when the machine is wedged?
>> 
>> On a more cheerful note, I've twice seen hard hangs like this that
>> turned out to be hardware issues, fixable with BIOS upgrades.
>> 
>> Cheers,
>> 
>> Tim.
>> 
>>> What is the state of the current debuggers out there?
>>> Any input on how I should set it up (kdb, gdb, etc) and pointers to a
>>> good wiki page are much appreciated.  I did perform a Google search
>>> and found some links but I want to hear from the current developers as
>>> to what is most stable and useful for debugging this type of hard
>>> hang.  I only have a serial port PCI-express card to use as the laptop
>>> has no built in port.
>> 
>> --
>> Tim Deegan <Tim.Deegan@xxxxxxxxxx>
>> Principal Software Engineer, XenServer Engineering
>> Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)
>> 
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com
>> Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
>> 02:35:00
>> 
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com
>> Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
>> 02:35:00
>> 
>> 
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com
>> Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
>> 02:35:00
>> 
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@xxxxxxxxxxxxxxxxxxx
>> http://lists.xensource.com/xen-devel
> 
> 
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.856 / Virus Database: 271.1.1/3149 - Release Date: 09/28/10
> 02:34:00
> 
> 



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.