[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [ANNOUNCE] Xen 4.15 release schedule and feature tracking
On 28.01.2021 19:26, Dario Faggioli wrote: > On Thu, 2021-01-14 at 19:02 +0000, Andrew Cooper wrote: >> 2) "scheduler broken" bugs. We've had 4 or 5 reports of Xen not >> working, and very little investigation on whats going on. Suspicion >> is >> that there might be two bugs, one with smt=0 on recent AMD hardware, >> and >> one more general "some workloads cause negative credit" and might or >> might not be specific to credit2 (debugging feedback differs - also >> might be 3 underlying issue). >> > Yep, so, let's try to summarize/collect the ones I think you may be > referring to: > > 1) There is one report about Credit2 not working, while Credit1 was > fine. It's this one: > > https://lists.xenproject.org/archives/html/xen-devel/2020-10/msg01561.html > > It's the one where it somehow happens that one or more vCPUs manage to > run for a really really long timeslice, much more than the scheduler > would have allowed them to, and this cause problems. _If_ that's it, my > investigation so far seems to show that this happens despite scheduler > code tries to enforce (via timers) the proper timeslice limits. when it > happens, makes the scheduler very unhappy. I've see reports of it > occurring both on Credit and Credit2, but definitely Credit2 seems to > be more sensitive to it. > > I've actually been trying to track it down for a while now, but I can't > easily reproduce it, so it's proving to be challenging. > > 2) Then there has been his one: > > https://lists.xenproject.org/archives/html/xen-devel/2020-10/msg01005.html > > Here, the where reporter said that "[credit1] results is an observable > delay, unusable performance; credit2 seems to be the only usable > scheduler". This is the one that also Andrew mention, happening on > Ryzen and with SMT disabled (as this is on QubesOS, IIRC). > > Here, doing "dom0_max_vcpus=1 dom0_vcpus_pin" seemed to mitigate the > problem but, of course, with obvious limitations. I don't have a Ryzen > handy, but I have a Zen and a Zen2. I checked there and again could not > reproduce (although, what I tried was upstream Xen, not QubesOS). > > 3) Then I recall this one: > > https://lists.xenproject.org/archives/html/xen-devel/2020-10/msg01800.html > > This also started as a "scheduler, probably Credit2" bug. But it then > turned out manifests on both Credit1 and Credit2 and it started to > happen on 4.14, while it was not there in 4.13... And nothing major > changed in scheduling between these two releases, I think. > > During the analysis, we thought we identified a livelock, but then > could not pinpoint what was exactly going on. Oh, and then it was also > discovered that Credit2 + PVH dom0 seemed to be a working > configuration, and it's weird for a scheduling issue to have a (dom0) > domain type dependency, I think. But that could be anything really... > and I'm sure happy to keep digging. > > 4) There's the NULL scheduler + ARM + vwfi=native issue: > > https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg01634.html > > This looks like something that we saw before, but remained unfixed, > although not exactly like that. If it's that one, analysis is done, and > we're working on a patch. If it's something else or even something > similar but slightly different... Well, we'll have to see when we have > the patch. > > 5) We're also dealing with this bugreport, although this is being > reported against Xen 4.13 (openSUSE 's packaged version of it): > > https://bugzilla.opensuse.org/show_bug.cgi?id=1179246 > > This is again on recent AMD hardware and here, "dom0_max_vcpus=4 > dom0_vcpus_pin" works ok, but only until a (Windows) HVM guest is > started. When that happens, though, we have crashes/hangs. > > If guests are PV, things are apparently fine. If the HVM guests use a > different set of CPUs than dom0 (e.g., vm.cpumask="4-63" in xl.conf), > thinks are fine as well. > > Again a scheduler issue and a scheduling algorithm dependency was > theorized and will be investigated (if the user can come back with > answers, which may take some time, as explained in the report). The > different behavior with different kind of guests is a little weird for > an issue of this kind, IME, but let's see. > > 6) If we want, we can include this too (hopefully just for reference): > > https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg01376.html > > As indeed the symptoms were similar, such as hanging during boot, but > all fine with dom0_max_vcpus=1. However, Jan is currently investigating > this one, and they're heading toward problems with TSC reliability > reporting and rendezvous, but let's see. > > Did I forget any? Going just from my mailbox, where I didn't keep all of the still unaddressed reports, but some (another one I have there is among the ones you've mentioned above): https://lists.xen.org/archives/html/xen-devel/2020-03/msg01251.html https://lists.xen.org/archives/html/xen-devel/2020-05/msg01985.html Jan
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |