[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Serious AMD-Vi issue
Hello Elliott, Le 24/01/2025 à 22:31, Elliott Mitchell a écrit : > On Fri, Jan 24, 2025 at 03:31:30PM +0100, Roger Pau Monné wrote: >> On Thu, Jan 25, 2024 at 12:24:53PM -0800, Elliott Mitchell wrote: >>> Apparently this was first noticed with 4.14, but more recently I've been >>> able to reproduce the issue: >>> >>> https://bugs.debian.org/988477 >>> >>> The original observation features MD-RAID1 using a pair of Samsung >>> SATA-attached flash devices. The main line shows up in `xl dmesg`: >>> >>> (XEN) AMD-Vi: IO_PAGE_FAULT: DDDD:bb:dd.f d0 addr ffffff???????000 flags >>> 0x8 I >> >> I think I've figured out the cause for those faults, and posted a fix >> here: >> >> https://lore.kernel.org/xen-devel/20250124120112.56678-1-roger.pau@xxxxxxxxxx/ >> >> Fix is patch 5/5, but you likely want to take them all to avoid >> context conflicts. > > I haven't tested yet, but some analysis from looking at the series. > > This seems a plausible explanation for the interrupt IOMMU messages. As > such I think there is a good chance the reported messages will disappear. > > Nothing in here looks plausible for solving the real problem, that of > RAID1 mirrors diverging (almost certainly getting zeroes during DMA, but > there is a chance stale data is being read). > The message is showing shows that something is going wrong, presumably a lost interrupt. This can lead to data loss, as it breaks the expectations of the Dom0's drivers. If you still observe data loss after these patches, and these messages have disappeared, it may be due to something else, but these patches are not looking to hide the fault. According to AMD-Vi specification, there appears to be a specific case where interrupt remapping faults are reported as IO_PAGE_FAULT (which appears to be what's happening). IG bit (133) of DTE appears to provide an explanation (SupIOPF can set this behavior globally). > IG: ignore unmapped interrupts. 1=Suppress event logging for interrupt > messages causing IO_PAGE_FAULT events. 0=creation of event log entries > for IO_PAGE_FAULT events is controlled by SupIOPF in the interrupt > remapping table entry (see Section 2.2.5 [Interrupt Remapping > Tables]). Note that Xen (and this patch doesn't change this behavior) does set this bit to 0, which means that faults are reported as IO_PAGE_FAULT events. > Worse, since it removes the observed messages, the next person will > almost certainly have severe data loss by the time they realize there is > a problem. Notably those messages lead me to Debian #988477, so I was > able to take action before things got too bad. > > > > I'm not absolutely certain this is a pure Xen bug. There is a > possibility the RAID1 driver is reusing DMA buffers in a fashion which > violates the DMA interface. Yet there is also a good chance Xen isn't > implementing its layer properly either. > > > > There is one pattern emerging at this point. Samsung hardware is badly > effected, other vendors are either uneffected or mildly effected. > Notably the estimated age of the devices meant to be handed off to > someone able to diagnose the issue is >10 years. The uneffected > Crucial/Micron SATA device *should* drastically outperform these, yet > instead it is uneffected. The Crucial/Micron NVMe is very mildly > effected, yet should be more than an order of magnitude faster. > > The simplest explanation is the flash controller on the Samsung devices > is lower latency than the one used by Micron. > > > Both present reproductions feature AMD processors and ASUS motherboards. > I'm doubtful of this being an ASUS issue. This seems more likely a case > of people who use RAID with flash tending to go with a motherboard vendor > who reliably support ECC on all their motherboards. > > I don't know whether this is confined to AMD processors, or not. The > small number of reproductions suggests few people are doing RAID with > flash storage. In which case no one may have tried RAID1 with flash on > Intel processors. On Intel hardware the referenced message would be > absent and people might think their problem was distinct from Debian > #988477. > > In fact what seems a likely reproduction on Intel hardware is the Intel > sound card issue. I notice that issue occurs when sound *starts* > playing. When a sound device starts, its buffers would be empty and the > first DMA request would be turned around with minimal latency. In such > case this matches the Samsung SATA devices handling DMA with low > latency. > > >> Can you give it a try and see if it fixes the fault messages, plus >> your issues with the disk devices? > > Ick. I was hoping to avoid reinstalling the known problematic devices > and simply send them to someone better setup for analyzing x86 problems. > > Looking at the series, it seems likely to remove the fault messages and > turn this into silent data loss. I doubt any AMD processors have an > IOMMU, yet omit cmpxchg16b (older system lacked full IOMMU, yet did have > cmpxchg16b, newer system has both). Even guests have cmpxchg16b > available. > > If you really want this tested, it will be a while before the next > potential downtime window. > > Come to think of it, I wonder whether this might fix a particular device > which was having an interrupt problem. Problem there being it was being > uncooperative with motherboard firmware... > > As it seems to be a specific corner case, I would not be surprised that it only shows up in very specific hardware setups. Teddy Teddy Astie | Vates XCP-ng Developer XCP-ng & Xen Orchestra - Vates solutions web: https://vates.tech
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |