[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: CONFIG_IA64_SPLIT_CACHE was: [Xen-ia64-devel] Console problem on domU on tip?


  • To: "Xu, Anthony" <anthony.xu@xxxxxxxxx>, "Yang, Fred" <fred.yang@xxxxxxxxx>, "Tian, Kevin" <kevin.tian@xxxxxxxxx>, <xen-ia64-devel@xxxxxxxxxxxxxxxxxxx>
  • From: "Magenheimer, Dan (HP Labs Fort Collins)" <dan.magenheimer@xxxxxx>
  • Date: Fri, 23 Dec 2005 10:22:16 -0800
  • Delivery-date: Fri, 23 Dec 2005 18:25:25 +0000
  • List-id: Discussion of the ia64 port of Xen <xen-ia64-devel.lists.xensource.com>
  • Thread-index: AcYBELBGMu7ZHSeYRUaEav49mCDvgwAAVFMwAADLrDAAIbQEoAARHDxgAAFwgsAApRrdUAAOEX2AABW4J/AAUJyDYAAareHQABB8dxAAG8N+UAAfdswg
  • Thread-topic: CONFIG_IA64_SPLIT_CACHE was: [Xen-ia64-devel] Console problem on domU on tip?

I got up early and spent several hours trying to debug
this further.  By adding timing loops and other debug code
and moving all the relevant PAL macros around, I proved
conclusively that the ia64_pal_call_static assembly routine
is not returning.  Next I added an infinite loop to the ivt
nested TLB handler (which isn't used by Xen except by some
fast paths that are currently off).  With this loop, the
error message goes away and Xen "freezes".  I think this
proves that the PAL call is inappropriately accessing some
(unpinned) data location with psr.ic off.

You should note that this is the only PAL call that requires
psr.ic to be off.  I suspect that OS's need to be prepared
for the possibility that a fault occurs.  Linux is not
so never calls the routine.  Xen is not prepared either.

Happy holidays!

Dan

> -----Original Message-----
> From: Xu, Anthony [mailto:anthony.xu@xxxxxxxxx] 
> Sent: Thursday, December 22, 2005 7:29 PM
> To: Magenheimer, Dan (HP Labs Fort Collins); Yang, Fred; 
> Tian, Kevin; xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
> Subject: RE: CONFIG_IA64_SPLIT_CACHE was: [Xen-ia64-devel] 
> Console problem on domU on tip?
> 
> >With CONFIG_IA64_SPLIT_CACHE on, a new user may encounter
> >the problem on a shipping machine and the symptom is that
> >the machine immediately crashes when a domU is launched.
> 
> Dan,
> That means dom0 can boot with CONFIG_IA64_SPLIT_CACHE on, and 
> PAL_CACHE_FLUSH has been invoked successfully in the process 
> of dom0 boot. So this is not PAL_CACHE_FLUSH issue, there 
> must be some other issue. Could you provide more information 
> about the crash, due to we can't reproduce this issue.
> 
> Thanks.
> 
> -Anthony
> 
> 
> >-----Original Message-----
> >From: Magenheimer, Dan (HP Labs Fort Collins) 
> [mailto:dan.magenheimer@xxxxxx]
> >Sent: 2005å12æ22æ 21:26
> >To: Yang, Fred; Xu, Anthony; Tian, Kevin; 
> xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
> >Subject: RE: CONFIG_IA64_SPLIT_CACHE was: [Xen-ia64-devel] 
> Console problem on
> >domU on tip?
> >
> >With CONFIG_IA64_SPLIT_CACHE on, a new user may encounter
> >the problem on a shipping machine and the symptom is that
> >the machine immediately crashes when a domU is launched.
> >
> >With CONFIG_IA64_SPLIT_CACHE off, a developer may encounter
> >a different problem on an unreleased machine.
> >
> >I know that you are focused primarily on the unreleased machine,
> >but in this case, I think we should be cautious for the new user
> >as the developer knows to change the option when running
> >on the unreleased machine.
> >
> >I will spend some more time on this when I have a chance.
> >I think it is a real bug (probably PAL accessing some address
> >which isn't pinned) that occurs only on some boxes due
> >to some factor like memory configuration.
> >
> >Thanks,
> >Dan
> >
> >P.S. The debug output just before the crash was:
> >ia64_fault: General Exception: IA-64 Reserved Register/Field 
> fault (data
> >access): reflecting
> >
> >> -----Original Message-----
> >> From: Yang, Fred [mailto:fred.yang@xxxxxxxxx]
> >> Sent: Wednesday, December 21, 2005 10:34 PM
> >> To: Magenheimer, Dan (HP Labs Fort Collins); Xu, Anthony;
> >> Tian, Kevin; xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
> >> Subject: CONFIG_IA64_SPLIT_CACHE was: [Xen-ia64-devel]
> >> Console problem on domU on tip?
> >>
> >> Dan,
> >>
> >> Can we suggest to always turn on #CONFIG_IA64_SPLIT_CACHE as
> >> the default build configuration.  People may not be aware of
> >> this build flag and miss it one each new build.
> >>
> >> All the newer generation ia64 processors will come with
> >> splitted I/Dcache as discussed in the previous mail thread
> >> and it is documented in the Itanium architectur of possible
> >> splitted cache for future implementation.  With default
> >> turning off, it is a potential bugs for all Tiger4 systems
> >> using for daily development and future platforms to come.
> >>
> >> It is also indicated through your mail, it is only HP  rx2620
> >> system has issue and not the other HP boxes.  Can you track
> >> down this issue?  Rather than put a kludge for rx2620 box?
> >>
> >> Thanks,
> >>
> >> -Fred
> >>
> >>
> >> Magenheimer, Dan (HP Labs Fort Collins) wrote:
> >> > Committed (but without removal of ifdefs until we
> >> > track down this problem).
> >> >
> >> >> -----Original Message-----
> >> >> From: Xu, Anthony [mailto:anthony.xu@xxxxxxxxx]
> >> >> Sent: Monday, December 19, 2005 7:15 PM
> >> >> To: Magenheimer, Dan (HP Labs Fort Collins); Tian, Kevin;
> >> >> xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
> >> >> Subject: RE: [Xen-ia64-devel] Console problem on domU on tip?
> >> >>
> >> >> I guest maybe the firmware on your machine doesn't implement
> >> >> this pal call due to there is no split I/D cache at that
> >> >> time, so when you call this pal call, it will return
> >> >> PAL_STATUS_UNIMPLEMENTED, Could you please turn on
> >> >> CONFIG_IA64_SPLIT_CACHE  and try this new patch to see
> >> >> whether your machine can boot domain0?
> >> >> If this patch works, could you please remove all
> >> >> CONFIG_IA64_SPLIT_CACHE macro?
> >> >>
> >> >> Thanks
> >> >> -Anthony
> >> >>
> >> >>> -----Original Message-----
> >> >>> From: Magenheimer, Dan (HP Labs Fort Collins)
> >> >> [mailto:dan.magenheimer@xxxxxx]
> >> >>> Sent: 2005å12æ19æ 23:48
> >> >>> To: Xu, Anthony; Tian, Kevin; 
> xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
> >> >>> Subject: RE: [Xen-ia64-devel] Console problem on domU on tip?
> >> >>>
> >> >>> I have been distracted tracking another bug...
> >> >>>
> >> >>> Here's where I got:
> >> >>>
> >> >>> The machine is a new (April 2005) HP rx2620 so it is
> >> >>> not old firmware.   I can't reproduce it on a machine
> >> >>> with an ITP (which does have older firmware).
> >> >>>
> >> >>> This PAL call is never used in Linux, though there is a
> >> >>> routine coded for it.  It is the only
> >> >>> PAL call coded in Linux that occurs with psr.ic off.
> >> >>>
> >> >>> The crash I am seeing occurs either during the PAL call or
> >> >>> immediately upon return.
> >> >>>
> >> >>> Is it OK to
> >> >>>
> >> >>>
> >> >>>> -----Original Message-----
> >> >>>> From: Xu, Anthony [mailto:anthony.xu@xxxxxxxxx]
> >> >>>> Sent: Monday, December 19, 2005 2:02 AM
> >> >>>> To: Tian, Kevin; Magenheimer, Dan (HP Labs Fort Collins);
> >> >>>> xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
> >> >>>> Subject: RE: [Xen-ia64-devel] Console problem on domU on tip?
> >> >>>>
> >> >>>> Dan,
> >> >>>> Have you got time to verify below discussion?
> >> >>>>
> >> >>>> Thanks
> >> >>>> -Anthony
> >> >>>>
> >> >>>>> -----Original Message-----
> >> >>>>> From: Tian, Kevin
> >> >>>>> Sent: 2005å12æ16æ 10:16
> >> >>>>> To: Xu, Anthony; 'Magenheimer, Dan (HP Labs Fort Collins)';
> >> >>>>> 'xen-ia64-devel@xxxxxxxxxxxxxxxxxxx'
> >> >>>>> Subject: RE: [Xen-ia64-devel] Console problem on domU on tip?
> >> >>>>>
> >> >>>>>> From: Xu, Anthony
> >> >>>>>> Sent: 2005å12æ16æ 9:54
> >> >>>>>>
> >> >>>>>>> Also, why panic if it fails?
> >> >>>>>>>
> >> >>>>>
> >> >>>>> Panic is not required here, and we could just print out
> >> a warning
> >> >>>>> message. Previously panic is kept there to help our debug in
> >> >>>>> early stage.
> >> >>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>> Does the problem happen only on VTI?  Or both VTI and
> >> non-VTI on
> >> >>>>>>> split-cache machines?
> >> >>>>>>
> >> >>>>>> Sometimes, it makes domain0 crash at the very 
> beginning of the
> >> >>>>>> domain0 boot process, especially on MP machine.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Thanks
> >> >>>>>> -Anthony
> >> >>>>>
> >> >>>>> One complement is, that problem definitely exists on new
> >> >>>>> split-cache processors, for dom0/domU. For VTI 
> domain, we have
> >> >>>>> logic within device model to ensure consistence.
> >> >>>>>
> >> >>>>> Thanks,
> >> >>>>> Kevin
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>> -----Original Message-----
> >> >>>>>>> From: Magenheimer, Dan (HP Labs Fort Collins)
> >> >>>>>> [mailto:dan.magenheimer@xxxxxx]
> >> >>>>>>> Sent: 2005å12æ16æ 1:39
> >> >>>>>>> To: Tian, Kevin; xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
> >> >>>>>>> Cc: Xu, Anthony
> >> >>>>>>> Subject: RE: [Xen-ia64-devel] Console problem on 
> domU on tip?
> >> >>>>>>>
> >> >>>>>>>>> Is this code fragment necessary for VTI to boot domU
> >> >>>>>>>>> or is it OK to remove?
> >> >>>>>>>>
> >> >>>>>>>>   The comment is inaccurate and it should be
> >> domU. That I/D
> >> >>>>>>>> cache sync step is mandatory to boot domU on new IA64
> >> >>>>>>>> processor which has split L2 I/D cache. If 
> without such I/D
> >> >>>>>>>> cache sync, control panel loads domU's kernel image
> >> which only
> >> >>>>>>>> affects D side cache. If there're some stale 
> entry on I-side
> >> >>>>>>>> cache within same range of dom0 image, people will
> >> see machine
> >> >>>>>>>> going weird.
> >> >>>>>>>
> >> >>>>>>> I don't understand... how can there be stale entries in the
> >> >>>>>>> I-cache? The instructions have just been written to memory
> >> >>>>>>> (through D-cache) and no instructions in this 
> domain have yet
> >> >>>>>>> been executed.
> >> >>>>>>> I do see that the D-cache needs to be flushed so that
> >> memory is
> >> >>>>>>> coherent but are there better ways to do that without a pal
> >> >>>>>>> call?
> >> >>>>>>>
> >> >>>>>>>>   Normally I/D cache sync shouldn't force any
> >> problem. Possibly
> >> >>>>>>>> there's some problem with the pal calling code, like
> >> incorrect
> >> >>>>>>>> ITLB mapping for pal or similar issue...
> >> >>>>>>>
> >> >>>>>>> Although the ia64_pal_cache_flush routine is defined
> >> in linux's
> >> >>>>>>> pal.h, it doesn't appear to be used anywhere in 
> Linux so there
> >> >>>>>>> is no use model to copy.  I suspect there is some use
> >> model for
> >> >>>>>>> the call that we don't understand, for example 
> maybe it should
> >> >>>>>>> only be called with physical &progress?  It 
> definitely fails
> >> >>>>>>> every time on one of my (newer) machines and 
> disabling the pal
> >> >>>>>>> call makes the problem go away.
> >> >>>>>>>
> >> >>>>>>>> Though it's intermittent, please
> >> >>>>>>>> keep this code
> >> >>>>>>>> there for correctness.
> >> >>>>>>>
> >> >>>>>>> Since the call is definitely failing under some 
> circumstances
> >> >>>>>>> that we don't understand, I'm inclined to at least
> >> put the code
> >> >>>>>>> in an #ifdef CONFIG_SPLIT_CACHE
> >> >>>>>>>
> >> >>>>>>> Does the problem happen only on VTI?  Or both VTI 
> and non-VTI
> >> >>>>>>> on split-cache machines?
> >> >>>>>>>
> >> >>>>>>> Thanks,
> >> >>>>>>> Dan
> >> >>>>>>>
> >> >>>>>>> P.S. I tried Anthony's patch (which moves the PAL 
> call after
> >> >>>>>>> new_thread()) but it still crashes.
> >>
> >>
> 
_______________________________________________
Xen-ia64-devel mailing list
Xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-ia64-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.