[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Problems with APIC on versions 4.9 and later (4.8 works)

Em ter., 26 de jan. de 2021 às 08:48, Jan Beulich <jbeulich@xxxxxxxx> escreveu:
> On 25.01.2021 20:37, Claudemir Todo Bom wrote:
> > I've managed to get the debug messages on the screen using
> > vga=text-80x50,keep and disabling all messages from the kernel. Two
> > images are attached with the output running the debug patch.
> And the 1st of them (161303) was taken at the time of the hang of
> the kernel (or entire system), not any earlier? I ask because one
> part of the reason for the patch was to understand whether the
> rendezvousing itself would fail in some way (like one of the CPUs
> not calling in).

I could not tell if it already hung when I took the picture, but I can
tell the messages keep appearing after the hang. I tested this
enabling log messages... the screen became a mess, but I can assure
that the rendezvous function is being run and completed multiple times
after the "freeing memory" message that freezes the kernel.

> Were new log messages (from the debugging patch) still issued at
> this point, showing Xen itself was still alive?
> The 2nd of the pictures (162313) at least clarifies that indeed
> the commit in question had a functional effect on this system,
> because of
> (XEN) TSC warp detected, disabling TSC_RELIABLE
> I still can't figure though why the change in rendezvous handling
> (from "std" to "tsc") would have broken your system.
> > About the version I've used to test: since the 4.14 shows that other
> > bug with the detection of cpu features I mentioned on the other
> > subthread, I chose to work on 4.11 that doesn't shows that behaviour.
> >
> > Calling with clocksource on the xen command line changed nothing.
> Oh, right, because the specific feature that causes the change
> of rendezvous functions for you also is a prereq for that mode
> of operation.

Oh, this should be why reverting the code on 4.14 didn't work...
probably messed up with features introduced after 4.11.

> > I don't know if this part of code is intended to execute a lot of
> > times, but when starting with dom0_max_vcpus=1, the system boots up
> > and keeps showing the messages.
> When there's just one CPU, there's no CPU to rendezvous with.
> Iirc you did say that you observe the hang even with as little
> as 2 CPUs? The problem the above quoted message is supposed to
> address is normally coming into play only on multi-socket
> systems. Yet from your initial report I deduce this is a
> single socket system. So in the end I suppose there are two
> problems - one is the hang, and the other is that your system
> gets diagnosed as having an unreliable TSC (at least I didn't
> think Xeon E5 v2 should have a problem there).

It is a single socket, I was talking about virtual cpus for domain 0.

After the last tests I tried to boot it with maxcpus=1 parameter on
the xen command line. This changed the rendezvous code to std and the
system worked on all versions up to 4.14.

Is there any performance issue on using this parameter and this "std"
rendezvous code?

> I will want to extend the debugging patch, but I'd like to
> have clarification on some of the points above first.

If this information is good for more tests, please send the patch and
I will test it!

Best regards,



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.