[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: S3 regression on AMD in 4.20
On Mon, Mar 17, 2025 at 06:54:06PM +0100, Marek Marczykowski-Górecki wrote: > On Mon, Mar 17, 2025 at 06:35:08PM +0100, Roger Pau Monné wrote: > > On Mon, Mar 17, 2025 at 05:35:51PM +0100, Marek Marczykowski-Górecki wrote: > > > On Mon, Mar 17, 2025 at 04:56:15PM +0100, Roger Pau Monné wrote: > > > > On Sat, Mar 15, 2025 at 12:02:50AM +0000, Andrew Cooper wrote: > > > > > On 14/03/2025 11:53 pm, Marek Marczykowski-Górecki wrote: > > > > > > On Fri, Mar 14, 2025 at 11:23:28PM +0100, Marek > > > > > > Marczykowski-Górecki wrote: > > > > > >> On Fri, Mar 14, 2025 at 02:19:19PM -0700, Stefano Stabellini wrote: > > > > > >>> On Fri, 14 Mar 2025, Marek Marczykowski-Górecki wrote: > > > > > >>>> This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook > > > > > >>>> 445 G7. > > > > > >>>> > > > > > >>>> This one has working S3, so add a test for it here. > > > > > >>>> > > > > > >>>> Signed-off-by: Marek Marczykowski-Górecki > > > > > >>>> <marmarek@xxxxxxxxxxxxxxxxxxxxxx> > > > > > >>>> --- > > > > > >>>> Cc: Jan Beulich <jbeulich@xxxxxxxx> > > > > > >>>> Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> > > > > > >>>> > > > > > >>>> The suspend test added here currently fails on staging[1], but > > > > > >>>> passes on > > > > > >>>> staging-4.19[2]. So the regression wants fixing before > > > > > >>>> committing this > > > > > >>>> patch. > > > > > >>>> > > > > > >>>> [1] > > > > > >>>> https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408437140 > > > > > >>>> [2] > > > > > >>>> https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408943441 > > > > > >>> We could commit the patch now without the s3 test. > > > > > >>> > > > > > >>> I don't know what the x86 maintainers think about fixing the > > > > > >>> suspend > > > > > >>> bug, but one idea would be to run a bisection between 4.20 and > > > > > >>> 4.19. > > > > > >> I'm on it already, but it's annoying. Lets convert this thread to > > > > > >> discussion about the issue: > > > > > >> > > > > > >> So, I bisected it between staging-4.19 and master. The breakage is > > > > > >> somewhere between (inclusive): > > > > > >> eb21ce14d709 x86/boot: Rewrite EFI/MBI2 code partly in C > > > > > >> and > > > > > >> 47990ecef286 x86/boot: Improve MBI2 structure check > > > > > >> > > > > > >> But, the first one breaks booting on this system and it remains > > > > > >> broken > > > > > >> until the second commit (or its parent) - at which point S3 is > > > > > >> already > > > > > >> broken. So, there is a range of 71 commits that may be > > > > > >> responsible... > > > > > >> > > > > > >> But then, based on a matrix chat and Jan's observation I've tried > > > > > >> reverting f75780d26b2f "xen: move per-cpu area management into > > > > > >> common > > > > > >> code" just on top of 47990ecef286, and that fixed suspend. > > > > > >> Applying "xen/percpu: don't initialize percpu on resume" on top of > > > > > >> 47990ecef286 fixes suspend too. > > > > > >> But applying it on top of master > > > > > >> (91772f8420dfa2fcfe4db68480c216db5b79c512 specifically) does not > > > > > >> fix it, > > > > > >> but the failure mode is different than without the patch - system > > > > > >> resets > > > > > >> on S3 resume, with no crash message on the serial console (even > > > > > >> with > > > > > >> sync_console), instead of hanging. > > > > > >> And one more data point: reverting f75780d26b2f on top of master > > > > > >> is the > > > > > >> same as applying "xen/percpu: don't initialize percpu on resume" on > > > > > >> master - system reset on S3 resume. > > > > > >> So, it looks like there are more issues... > > > > > > Another bisection round and I have the second culprit: > > > > > > > > > > > > 8e60d47cf011 x86/iommu: avoid MSI address and data writes if > > > > > > IRT index hasn't changed > > > > > > > > > > > > With master+"xen/percpu: don't initialize percpu on resume"+revert > > > > > > of > > > > > > 8e60d47cf011 suspend works again on this AMD system. > > > > > > > > > > That's not surprising in the slightest. > > > > > > > > > > Caching hardware values in Xen isn't safe across S3, which QubesOS has > > > > > found time and time again, and for which we still have outstanding > > > > > bugs. > > > > > > > > > > S3 turns most of the system off. RAM gets preserved, but devices and > > > > > plenty of internal registers don't. > > > > > > > > I think I've spotted the issue. enable_iommu() called on resume > > > > (ab)uses set_msi_affinity() to force an MSI register write, as it's > > > > previous behavior was to unconditionally propagate the values. With > > > > my change it would no longer perform such writes on resume. > > > > > > > > I think the patch below should help. > > > > > > It doesn't, I still got reboot on resume, with no crash message on > > > serial (even with sync_console). > > > > There was an error with the previous patch, and it's also a bug in the > > original patch. Could you give a try to the updated patch below? > > > > Sorry for bothering you, but ATM I haven't found a way to trigger > > this myself. > > This one does fix the issue :) I've now formally sent the patch, could you provide a Tested-by for it? https://lore.kernel.org/xen-devel/20250318082945.52019-1-roger.pau@xxxxxxxxxx/T/#u Thanks, Roger.
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |