[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Serious AMD-Vi issue



On Fri, Jan 24, 2025 at 03:31:30PM +0100, Roger Pau Monné wrote:
> On Thu, Jan 25, 2024 at 12:24:53PM -0800, Elliott Mitchell wrote:
> > Apparently this was first noticed with 4.14, but more recently I've been
> > able to reproduce the issue:
> > 
> > https://bugs.debian.org/988477
> > 
> > The original observation features MD-RAID1 using a pair of Samsung
> > SATA-attached flash devices.  The main line shows up in `xl dmesg`:
> > 
> > (XEN) AMD-Vi: IO_PAGE_FAULT: DDDD:bb:dd.f d0 addr ffffff???????000 flags 
> > 0x8 I
> 
> I think I've figured out the cause for those faults, and posted a fix
> here:
> 
> https://lore.kernel.org/xen-devel/20250124120112.56678-1-roger.pau@xxxxxxxxxx/
> 
> Fix is patch 5/5, but you likely want to take them all to avoid
> context conflicts.

I haven't tested yet, but some analysis from looking at the series.

This seems a plausible explanation for the interrupt IOMMU messages.  As
such I think there is a good chance the reported messages will disappear.

Nothing in here looks plausible for solving the real problem, that of
RAID1 mirrors diverging (almost certainly getting zeroes during DMA, but
there is a chance stale data is being read).

Worse, since it removes the observed messages, the next person will
almost certainly have severe data loss by the time they realize there is
a problem.  Notably those messages lead me to Debian #988477, so I was
able to take action before things got too bad.



I'm not absolutely certain this is a pure Xen bug.  There is a
possibility the RAID1 driver is reusing DMA buffers in a fashion which
violates the DMA interface.  Yet there is also a good chance Xen isn't
implementing its layer properly either.



There is one pattern emerging at this point.  Samsung hardware is badly
effected, other vendors are either uneffected or mildly effected.
Notably the estimated age of the devices meant to be handed off to
someone able to diagnose the issue is >10 years.  The uneffected
Crucial/Micron SATA device *should* drastically outperform these, yet
instead it is uneffected.  The Crucial/Micron NVMe is very mildly
effected, yet should be more than an order of magnitude faster.

The simplest explanation is the flash controller on the Samsung devices
is lower latency than the one used by Micron.


Both present reproductions feature AMD processors and ASUS motherboards.
I'm doubtful of this being an ASUS issue.  This seems more likely a case
of people who use RAID with flash tending to go with a motherboard vendor
who reliably support ECC on all their motherboards.

I don't know whether this is confined to AMD processors, or not.  The
small number of reproductions suggests few people are doing RAID with
flash storage.  In which case no one may have tried RAID1 with flash on
Intel processors.  On Intel hardware the referenced message would be
absent and people might think their problem was distinct from Debian
#988477.

In fact what seems a likely reproduction on Intel hardware is the Intel
sound card issue.  I notice that issue occurs when sound *starts*
playing.  When a sound device starts, its buffers would be empty and the
first DMA request would be turned around with minimal latency.  In such
case this matches the Samsung SATA devices handling DMA with low
latency.


> Can you give it a try and see if it fixes the fault messages, plus
> your issues with the disk devices?

Ick.  I was hoping to avoid reinstalling the known problematic devices
and simply send them to someone better setup for analyzing x86 problems.

Looking at the series, it seems likely to remove the fault messages and
turn this into silent data loss.  I doubt any AMD processors have an
IOMMU, yet omit cmpxchg16b (older system lacked full IOMMU, yet did have
cmpxchg16b, newer system has both).  Even guests have cmpxchg16b
available.

If you really want this tested, it will be a while before the next
potential downtime window.

Come to think of it, I wonder whether this might fix a particular device
which was having an interrupt problem.  Problem there being it was being
uncooperative with motherboard firmware...


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@xxxxxxx  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445





 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.