[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Problems with APIC on versions 4.9 and later (4.8 works)

Em sex., 29 de jan. de 2021 às 11:21, Jan Beulich <jbeulich@xxxxxxxx> escreveu:
> On 28.01.2021 14:08, Claudemir Todo Bom wrote:
> > Em qui., 28 de jan. de 2021 às 06:49, Jan Beulich <jbeulich@xxxxxxxx> 
> > escreveu:
> >>
> >> On 28.01.2021 10:47, Jan Beulich wrote:
> >>> On 26.01.2021 14:03, Claudemir Todo Bom wrote:
> >>>> If this information is good for more tests, please send the patch and
> >>>> I will test it!
> >>>
> >>> Here you go. For simplifying analysis it may be helpful if you
> >>> could limit the number of CPUs in use, e.g. by "maxcpus=4" or
> >>> at least "smt=0". Provided the problem still reproduces with
> >>> such options, of course.
> >>
> >> Speaking of command line options - it doesn't look like you have
> >> told us what else you have on the Xen command line, and without
> >> a serial log this isn't visible (e.g. in your video).
> >
> > All tests are done with xen command line:
> >
> > dom0_mem=1024M,max:2048M dom0_max_vcpus=4 dom0_vcpus_pin=true
> > smt=false vga=text-80x50,keep
> >
> > and kernel command line:
> >
> > loglevel=0 earlyprintk=xen nomodeset
> >
> > this way I can get all xen messages on console.
> >
> > Attached are the frames I captured from a video, I manually selected
> > them starting from the first readable frame.
> Okay, so we seem to be hitting two previously noticed issues, neither
> of which so far was necessary to address directly (because there was
> always something else to be tweaked such that the problems went away).
> For one, the boot CPU has a TSC value that lags by more than a
> second compared to all secondary CPUs. The way
> time_calibration_tsc_rendezvous() works, together with the way we
> calculate system time from the TSC (get_s_time_fixed() - this is
> where the known issue here is: the function breaks when trying to
> scale a negative delta, hence the absurdly high s= values in the
> screenshots you've provided), allows for small negative deltas
> between CPUs, but expects to bring all CPUs' TSCs forward (i.e. over
> the 1s interval between rendezvous' the lagging CPUs are assumed to
> have made enough progress to be ahead of the more towards the future
> timestamps on the previous run). Secondary lagging behind the boot
> CPU more than this could also be dealt with, but on your system the
> situation is the other way around.
> Btw - what kind of BIOS do you have on this system? This way of the
> TSCs being set is pretty odd, and must be - unless you run other
> pre-boot software or an unusual boot loader - caused by the BIOS.

It is a generic mainboard acquired from china... it is very lame! I
was already thinking the big issue is the BIOS.  Unfortunately I don't
know how to upgrade it.

> And then this points out (again, afaic at least) that the way we
> kickstart the rendezvous handling is likely inappropriate.
> Especially when TSCs are skewed like they are here, it is unhelpful
> to launch Dom0 before having brought the TSC in sync. (Related to
> this, I also don't think we should arm the respective timer before
> AP bringup was done, or else we risk the first rendezvous instance
> to not hit all CPUs.)
> I'll work on addressing both, hoping that in particular for the
> first one you'll be ready to help with testing (through perhaps
> multiple iterations).

I can help you a little more until end of next week. After that I will
move the host to another address and I will not have a quick "hands
on" access to it.

> > As a sidenote, I managed to get the system working with the parameter
> > "tsc=unstable", performance looks satisfactory but I do not know what
> > problems I may end with this parameter.
> I _think_ you'd be running into trouble if you removed dom0_vcpus_pin
> (which imo really no-one should use without reporting a bug, despite
> all the hits to the contrary that one gets when searching the web),
> and if you ran any guests (PV at least) without pinning their vCPU-s
> to pCPU-s.

just tested it without the cpu pin, it worked.

I stress-tested both dom0 and a pv guest with the "yes method"
described here:

Best regards,



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.