[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Serious AMD-Vi(?) issue



On Thu, Mar 28, 2024 at 08:22:31AM -0700, Elliott Mitchell wrote:
> On Thu, Mar 28, 2024 at 07:25:02AM +0100, Jan Beulich wrote:
> > On 27.03.2024 18:27, Elliott Mitchell wrote:
> > > On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote:
> > >> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote:
> > >>>
> > >>> In fact when running into trouble, the usual course of action would be 
> > >>> to
> > >>> increase verbosity in both hypervisor and kernel, just to make sure no
> > >>> potentially relevant message is missed.
> > >>
> > >> More/better information might have been obtained if I'd been engaged
> > >> earlier.
> > > 
> > > This is still true, things are in full mitigation mode and I'll be
> > > quite unhappy to go back with experiments at this point.
> > 
> > Well, it very likely won't work without further experimenting by someone
> > able to observe the bad behavior. Recall we're on xen-devel here; it is
> > kind of expected that without clear (and practical) repro instructions
> > experimenting as well as info collection will remain with the reporter.
> 
> The first reporter: https://bugs.debian.org/988477 gave pretty specific
> details about their setups.
> 
> While the exact border isn't very well defined, that seems enough to give
> a pretty good start.  We don't know whether all Samsung SATA devices are
> effected, but most of the recent ones (<5 years old) are.  This requires
> a pair of devices in software RAID1.  Likely reproduces better with AMD
> AM4/AM5 processors, but almost certainly needs a fully operational IOMMU.
> 
> (ASUS motherboards tend to have well setup IOMMUs)
> 
> I would be surprised if you don't have all of the hardware on-hand.  Only
> issue would be finding an appropriate pair of SATA devices, since those
> tend to remain in service.  I would look for older devices which were
> removed from service due to being too small (128GB 840 PRO from the first
> report), or were pulled from service due to having had too many writes.

Come to think of it, one more possible ingredient to this.  Similar to
the first report, when the problem occurred, the SATA device was plugged
into an on chipset SATA port, not the extra controller this motherboard
has.  I don't know whether the performance difference of an off-main
chip controller would influence this, but it might.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@xxxxxxx  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445





 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.