[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.



> -----Original Message-----
> From: Ian Campbell [mailto:ijc@xxxxxxxxxxxxxx]
> Sent: Thursday, November 08, 2012 3:29 PM
> To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE
> Cc: mrsanna1@xxxxxxxxx; 599161@xxxxxxxxxxxxxxx; xen-
> devel@xxxxxxxxxxxxx; keir@xxxxxxx; JBeulich@xxxxxxxx
> Subject: Re: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50 
> minutes" bug.
> 
> 
> I think Jan was asking for information relating to the system you saw this on 
> -
> - or are you working on the same systems as Mauro?

oops, excuse me, here is a description : I have the problem on 4 systems, all 
with same hardware.
the problem occured  on each system, 1 time each 2 month in average. since 
January 2012, I decided to reboot them all monthly, 
and the clock jump occurred only once in February ...

SYSTEM :                HP ProLiant DL385 G7, with 2 * AMD Processor 6174 (12 
cores) = 24 cores, 16 GB MEMORY
XEN                      (XEN) Xen version 4.0.1 (Debian 4.0.1-5.4) 
(ultrotter@xxxxxxxxxx) (gcc version 4.4.5 (Debian 4.4.5-8) ) Sat Sep  8 
19:15:46 UTC 2012
DOM0                    Linux 2.6.32-5-xen-amd64 #1 SMP Sun Sep 23 13:49:30 UTC 
2012 x86_64 GNU/Linux
CPU                     
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 9
model name      : AMD Opteron(tm) Processor 6174
stepping        : 1
cpu MHz         : 3791872.477
cache size      : 512 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic mtrr mca cmov pat clflush mmx 
fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow constant_tsc 
rep_good nonstop_tsc extd_apicid amd_dcm pni cx16 popcnt hypervisor lahf_lm 
cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch nodeid_msr
bogomips        : 4400.17
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

> 
> Of course additional information from Mauro would be useful too in order to
> help spotting any patterns.
> 
> > > Philippe, could you clarify again what CPU model(s) this is being
> > > observed on (the long times between individual steps forward with
> > > this problem perhaps warrant repeating the basics each time, as it's
> > > otherwise quite cumbersome to always look up old pieces of
> information).
> >
> > can you provide this information ?
> >     cat /proc/cpuinfo
> >     cat /proc/meminfo
> >     hardware information (manufacturer, model, urls, ...)
> >
> > Thanks, Philippe
> >
> >
> > > -----Original Message-----
> > > From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
> > > Sent: Thursday, November 08, 2012 10:40 AM
> > > To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE; Keir Fraser
> > > Cc: 599161@xxxxxxxxxxxxxxx; mrsanna1@xxxxxxxxx; Ian Campbell; xen-
> > > devel@xxxxxxxxxxxxx
> > > Subject: Re: [Xen-devel] #599161: Xen debug patch for the "clock
> > > shifts by 50 minutes" bug.
> > >
> > > >>> On 07.11.12 at 18:40, Keir Fraser <keir@xxxxxxx> wrote:
> > > > On 07/11/2012 13:22, "Ian Campbell" <ijc@xxxxxxxxxxxxxx> wrote:
> > > >
> > > >>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128
> > > plt_wrap=5ece12d09306
> > > >>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
> > > >>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
> > > >>>> tsc_stamp=e3839fcb0273
> > > >>>
> > > >>> (below is the complete xm dmesg output)
> > > >>>
> > > >>> did that help you ? do you need more info ?
> > > >>
> > > >> I'll leave this to Keir (who wrote the debugging patch) to answer
> > > >> but it looks to me like it should be useful!
> > > >
> > > > I'm scratching my head. plt_wrap is earlier than plt_now, which
> > > > should be impossible. plt_stamp64 oddly has low 32 bits identical
> > > > to new_stamp. That seems very very improbable!
> > >
> > > Is it? My understanding was that plt_stamp64 is just a software
> > > extension to the more narrow HW counter, and hence the low plt_mask
> > > bits would always be expected to be identical.
> > >
> > > The plt_wrap < plt_now thing of course is entirely unexplainable to me
> too:
> > > Considering that plt_scale doesn't change at all post- boot, apart
> > > from memory corruption I could only see an memory access ordering
> > > problem to be the reason (platform_timer_stamp and/or
> > > stime_platform_stamp changing despite platform_timer_lock being
> > > held. So maybe taking a snapshot of all three static values involved
> > > in the calculation in
> > > __read_platform_stime() between acquiring the lock and the first
> > > call to __read_platform_stime(), and printing them together with the
> > > "live" values in a second
> > > printk() after the one your original patch added could rule that out.
> > >
> > > But the box doesn't even seem to be NUMA (of course it also doesn't
> > > help that the log level was kept restricted - hint, hint, Philippe),
> > > not does there appear to be any S3 cycle or pCPU bring-up/-down in
> between...
> > >
> > > Philippe, could you clarify again what CPU model(s) this is being
> > > observed on (the long times between individual steps forward with
> > > this problem perhaps warrant repeating the basics each time, as it's
> > > otherwise quite cumbersome to always look up old pieces of
> information).
> > >
> > > > I wonder whether the overflow handling should just be removed, or
> > > > made conditional on a command-line parameter, or on the 32-bit
> > > > platform counter being at least somewhat likely to overflow before
> > > > a softirq occurs -- it seems lots of systems are using 14MHz HPET,
> > > > and that gives us a couple of minutes for the plt_overflow softirq
> > > > to do its work
> > > before overflow occurs.
> > > > I think we would notice that outage in other ways. :)
> > >
> > > Iirc we added this for a good reason - to cover the, however
> > > unlikely, event of Xen running for very long without preemption.
> > > Presumably most of the cases got fixed meanwhile, and indeed a
> > > wraparound time on the order of minutes should make this
> > > superfluous, but as the case here shows that code did spot a severe
> > > anomaly (whatever that may turn out to be).
> > >
> > > Also recall that there are HPET implementations around that tick at
> > > a much higher frequency than 14MHz.
> > >
> > > So unless we finally reach the understanding that the code is
> > > flawed, I would rather want to keep it.
> > >
> > > Jan
> >
> >
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.