[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ANNOUNCE] Xen 4.15 release schedule and feature tracking



On 28.01.2021 19:26, Dario Faggioli wrote:
> On Thu, 2021-01-14 at 19:02 +0000, Andrew Cooper wrote:
>> 2) "scheduler broken" bugs.  We've had 4 or 5 reports of Xen not
>> working, and very little investigation on whats going on.  Suspicion
>> is
>> that there might be two bugs, one with smt=0 on recent AMD hardware,
>> and
>> one more general "some workloads cause negative credit" and might or
>> might not be specific to credit2 (debugging feedback differs - also
>> might be 3 underlying issue).
>>
> Yep, so, let's try to summarize/collect the ones I think you may be
> referring to:
> 
> 1) There is one report about Credit2 not working, while Credit1 was
> fine. It's this one:
> 
> https://lists.xenproject.org/archives/html/xen-devel/2020-10/msg01561.html
> 
> It's the one where it somehow happens that one or more vCPUs manage to
> run for a really really long timeslice, much more than the scheduler
> would have allowed them to, and this cause problems. _If_ that's it, my
> investigation so far seems to show that this happens despite scheduler
> code tries to enforce (via timers) the proper timeslice limits. when it
> happens, makes the scheduler very unhappy. I've see reports of it
> occurring both on Credit and Credit2, but definitely Credit2 seems to
> be more sensitive to it.
> 
> I've actually been trying to track it down for a while now, but I can't
> easily reproduce it, so it's proving to be challenging.
> 
> 2) Then there has been his one:
> 
> https://lists.xenproject.org/archives/html/xen-devel/2020-10/msg01005.html
> 
> Here, the where reporter said that "[credit1] results is an observable
> delay, unusable performance; credit2 seems to be the only usable
> scheduler". This is the one that also Andrew mention, happening on
> Ryzen and with SMT disabled (as this is on QubesOS, IIRC).
> 
> Here, doing "dom0_max_vcpus=1 dom0_vcpus_pin" seemed to mitigate the
> problem but, of course, with obvious limitations. I don't have a Ryzen
> handy, but I have a Zen and a Zen2. I checked there and again could not
> reproduce (although, what I tried was upstream Xen, not QubesOS).
> 
> 3) Then I recall this one:
> 
> https://lists.xenproject.org/archives/html/xen-devel/2020-10/msg01800.html
> 
> This also started as a "scheduler, probably Credit2" bug. But it then
> turned out manifests on both Credit1 and Credit2 and it started to
> happen on 4.14, while it was not there in 4.13... And nothing major
> changed in scheduling between these two releases, I think.
> 
> During the analysis, we thought we identified a livelock, but then
> could not pinpoint what was exactly going on. Oh, and then it was also
> discovered that Credit2 + PVH dom0 seemed to be a working
> configuration, and it's weird for a scheduling issue to have a (dom0)
> domain type dependency, I think. But that could be anything really...
> and I'm sure happy to keep digging.
> 
> 4) There's the NULL scheduler + ARM + vwfi=native issue:
> 
> https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg01634.html
> 
> This looks like something that we saw before, but remained unfixed,
> although not exactly like that. If it's that one, analysis is done, and
> we're working on a patch. If it's something else or even something
> similar but slightly different... Well, we'll have to see when we have
> the patch.
> 
> 5) We're also dealing with this bugreport, although this is being
> reported against Xen 4.13 (openSUSE 's packaged version of it):
> 
> https://bugzilla.opensuse.org/show_bug.cgi?id=1179246
> 
> This is again on recent AMD hardware and here, "dom0_max_vcpus=4
> dom0_vcpus_pin" works ok, but only until a (Windows) HVM guest is
> started. When that happens, though, we have crashes/hangs.
> 
> If guests are PV, things are apparently fine. If the HVM guests use a
> different set of CPUs than dom0 (e.g., vm.cpumask="4-63" in xl.conf),
> thinks are fine as well.
> 
> Again a scheduler issue and a scheduling algorithm dependency was
> theorized and will be investigated (if the user can come back with
> answers, which may take some time, as explained in the report). The
> different behavior with different kind of guests is a little weird for
> an issue of this kind, IME, but let's see.
> 
> 6) If we want, we can include this too (hopefully just for reference):
> 
> https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg01376.html
> 
> As indeed the symptoms were similar, such as hanging during boot, but
> all fine with dom0_max_vcpus=1. However, Jan is currently investigating
> this one, and they're heading toward problems with TSC reliability
> reporting and rendezvous, but let's see.
> 
> Did I forget any?

Going just from my mailbox, where I didn't keep all of the still
unaddressed reports, but some (another one I have there is among
the ones you've mentioned above):

https://lists.xen.org/archives/html/xen-devel/2020-03/msg01251.html
https://lists.xen.org/archives/html/xen-devel/2020-05/msg01985.html

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.