[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Memory corruption bug with Xen PV Dom0 and BOSS-S1 RAID card



On Mon, Feb 17, 2025 at 09:19:41PM +0100, Paweł Srokosz wrote:
> Hello everyone,

Adding the x86 maintainers plus the Linux Xen maintainer to the email.

> for few months I'm struggling with a very weird memory corruption issue in 
> Xen PV Dom0 and storage backed by BOSS-S1 RAID-1 card. I noticed it when I 
> tried 
> to copy huge ISO file on Dom0 file system and use it for DomU installation.
> Everything was fine while its contents were cached in the memory, but when I
> rebooted the system and file was read, some parts of the file changed their
> contents. In the same time fsck doesn't report any problems with the
> filesystem.
> 
> In the same time I'm able to reproduce it only when I'm reading and writing
> files onto storage backed by two SSDs in hardware RAID-1 (BOSS-S1 SATA AHCI
> RAID-1 fw ver. 2.5.13.3024) and only under Xen PV. Without Xen or with PVH 
> Dom0 
> everything works correctly. I have reproduced the bug on three servers
> with the same hardware/software specification:

We had similar reports, and IIRC also with software RAID.

> - Platform: Dell PowerEdge R640
> - CPU: 1x Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
> - RAM: 4x Multi-bit ECC DDR-4 32 GB
> - Storage:
>  - 2xSSD 240GiB with BOSS-S1 SATA AHCI RAID-1 fw ver. 2.5.13.3024 (where
>  files corrupt)
>  - 1xSSD 4TiB SAS PERC H330 Mini JBOD fw ver. 25.5.9.0001 (where files
>  doesn't corrupt)
> 
> I reproduced the same situation by writing a file, flushing dirty pages to
> the storage (`sync`) and dropping cached pages (`echo 3 > 
> /proc/sys/vm/drop_caches`).
> 
> ```
> # sha256sum Win10_22H2_Polish_x64v1.iso
> 96aad9e4b20b6e3f5fea40b981263e854f6c879472369d5ce8324aae1f6b7556 
> Win10_22H2_Polish_x64v1.iso
> 
> # echo 3 > /proc/sys/vm/drop_caches
> 
> # sha256sum Win10_22H2_Polish_x64v1.iso
> 0ba05ee38c0f2755bce4ccdf6b389963d9177b261505cbc2b41f8198e9f3bc60 
> Win10_22H2_Polish_x64v1.iso
> 
> # echo 3 > /proc/sys/vm/drop_caches
> 
> # sha256sum Win10_22H2_Polish_x64v1.iso
> 972a7f363e48b72a612efe85cc6a2c8ce7314858ec0e7ef08d9d7578c9a10ddc 
> Win10_22H2_Polish_x64v1.iso
> ```
> 
> Same effect occurred on two other machines with the same hardware. Only files
> written under Xen PV Dom0 were affected. When these files (written under Xen 
> Dom0)
> were read without Xen, they were consistently corrupted in affected
> parts. When these files are read with Xen, the corruption changed every time
> we dropped the page cache.
> 
> I found that file is corrupted within 4kB page boundaries, so it looked like
> memory issue. So I wrote a script that writes a huge file with specific
> pattern on each 4kB block (matching the page size) and after
> flush/drop_caches, it mmap's the file and checks the integrity of each block.
> When block mismatch occurs, it prints the VA and GFN from
> `/proc/<pid>/pagemap` (using https://github.com/dwks/pagemap). Each page is
> filled with numbers depending on the file offset, so I'm able to correlate
> the contents with the specific offset in case they're shifted or out of
> order.
> 
> In terms of file offset, corruption is usually aligned up to 0xffff boundary
> e.g. mismatched blocks can be found within these file offset ranges:
> - 0x248f000-0x248ffff
> - 0xd4944000-0xd494ffff
> - 0xc1fb000-0xc1fffff
> 
> My wild guess is that 0xffff is a Linux boundary for readahead operation.
> When I try to load two or more files into page cache, I start to see some
> patterns on Dom0 Linux PFN (GFN?):
> 
> ```
> Block mismatch 0x4f577000 read -0x1
> 7f664ec00000-7f6742e40000 r--s 00000000 fe:00 397029 /home/pawelsr/testfile1
>  00007f669e177000 pm a18000000030e50c pc 0000000000000001 pf 000000040000082c 
> cg 0000000000000be5
> 
> <... redacted series of similar entries for ...8000, ...9000, ...a000>
> 
> Block mismatch 0x4f57f000 read -0x1
> 7f664ec00000-7f6742e40000 r--s 00000000 fe:00 397029 /home/pawelsr/testfile1
>  00007f669e17f000 pm a18000000030e514 pc 0000000000000001 pf 000000040000082c 
> cg 0000000000000be5
> ```
> 
> ```
> Block mismatch 0xc1fb000 read 0x4f577000
> 7f0552600000-7f0646840000 r--s 00000000 fe:00 399642 /home/pawelsr/testfile2
>  00007f055e7fb000 pm a18000000020e50c pc 0000000000000001 pf 000000040000082c 
> cg 0000000000000be5
> 
> <... redacted series of similar entries for ...c000, ...d000, ...e000>
> 
> Block mismatch 0xc203000 read 0x4f57f000
> 7f0552600000-7f0646840000 r--s 00000000 fe:00 399642 /home/pawelsr/testfile2
>  00007f055e803000 pm a18000000020e514 pc 0000000000000001 pf 000000040000082c 
> cg 0000000000000be5
> ```
> 
> which means that when I try to read from `20e50c-20e514` GFN, I'm getting
> contents that should land in `30e50c-30e514` GFN. On the other hand
> `30e50c-30e514` contain only zeroes, but sometimes I see something that looks
> like a random portion of some memory. When I'm able to correlate the
> contents, very often it comes from GFN offseted by multiply of 0x100000.
> 
> Corruption isn't limited to page cache but makes whole system unstable and
> from time to time results in kernel panic or random segmentation faults. 
> It's also not easy to reproduce, I need to read/write a lot of blocks to 
> trigger 
> it and bug looks to be time-sensitive.
> 
> All three servers behave the same and it doesn't look like problem is caused
> by simple hardware issue. All healthchecks and tests on RAM/storage/other
> components pass.
> 
> Our BOSS-S1 PCI card uses the following SATA controller: Marvell Technology
> Group Ltd. 88SE9230 PCIe 2.0 x2 4-port SATA 6 Gb/s RAID Controller (rev 11).
> There are well-known problems with this family of controllers and Linux
> contains a fixup for DMA function 1
> (https://github.com/torvalds/linux/blob/2408a807bfc3f738850ef5ad5e3fd59d66168996/drivers/pci/quirks.c#L4316).
> This bug is known to cause some issues on Xen with IOMMU
> (https://github.com/QubesOS/qubes-issues/issues/5968). I'm not sure if it's
> somehow correlated and causes problems with PV as well.
> 
> By testing the bug in different conditions I also spotted few more
> correlations:
> 
> - bug occurs on Xen PV Dom0 and was reproduced on Xen versions from 4.16.0 to
>  4.19.2-pre (up to git:4803a3c5b5 from stable-4.19) and Debian 10 to 12
>  (both stable and backports kernel). Somehow that specific commit
>  git:4803a3c5b5 makes the bug harder to trigger but it may be just a
>  coincidence.

I think it's more likely to be a Linux bug than a Xen (hypervisor)
bug.

> - I was unable to reproduce it when Xen was compiled from master branch, but
>  I'm not sure if once again - it wasn't just a bad timing to trigger the
>  bug.
> - bug occurs only on ext4 file system with hardware RAID backed by BOSS-S1
> - bug DOESN'T occur without Xen
> - bug DOESN'T occur on Xen PVH Dom0
> - bug DOESN'T occur on Xen PV Dom0 when Xen was compiled with excluded
>  `NDEBUG` in file `xen/arch/x86/pv/dom0_build.c`. When I played with it, I
>  found that I'm unable to reproduce the issue when the code that reverses
>  MFN<->PFN mapping for Dom0 is active.

So the issue doesn't happen on debug=y builds?

That's unexpected.  I would expect the opposite, that some code in
Linux assumes that pfn + 1 == mfn + 1, and hence breaks when the
relation is reversed.

> - bug DOESN'T occur when using different storage than one backed by BOSS-S1.
> - bug was tested in few additional conditions and reproduction is not
>  dependent on these:
>  - -O1/-O2/no optimization behaves the same
>  - PAT patch to use Linux PAT layout instead of Xen's choice doesn't
>  change anything
>  
> (https://github.com/QubesOS/qubes-vmm-xen/blob/main/1018-x86-Use-Linux-s-PAT.patch)
>  - different Linux kernel version doesn't change anything
>  - vCPU pinning (e.g. single vCPU pinned to Dom0) doesn't change anything
> - bug was tested only with smt=1 because Xen doesn't boot properly on our
>  machines with smt=0 (hangs with "(XEN) CPU X still not dead", similar to
>  https://lists.xen.org/archives/html/xen-devel/2019-08/msg00138.html)

Hm, from that thread it seems like the original bug should already be
fixed.

> `xl info` for my testbed:
> 
> ```
> # xl info
> host : <redacted>
> release : 6.12.9+bpo-amd64
> version : #1 SMP PREEMPT_DYNAMIC Debian 6.12.9-1~bpo12+1 (2025-01-19)
> machine : x86_64
> nr_cpus : 28
> max_cpu_id : 27
> nr_nodes : 1
> cores_per_socket : 14
> threads_per_core : 2
> cpu_mhz : 2593.905
> hw_caps : 
> bfebfbff:77fef3ff:2c100800:00000121:0000000f:d29ffffb:00000008:00000100
> virt_caps : pv hvm hvm_directio pv_directio hap shadow iommu_hap_pt_share 
> vmtrace gnttab-v1 gnttab-v2
> total_memory : 130562
> free_memory : 96501
> sharing_freed_memory : 0
> sharing_used_memory : 0
> outstanding_claims : 0
> free_cpus : 0
> xen_major : 4
> xen_minor : 19
> xen_extra : .2-pre
> xen_version : 4.19.2-pre
> xen_caps : xen-3.0-x86_64 hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64
> xen_scheduler : credit2
> xen_pagesize : 4096
> platform_params : virt_start=0xffff800000000000
> xen_changeset : Tue Jan 21 09:21:01 2025 +0100 git:4803a3c5b5
> xen_commandline : placeholder dom0_mem=32G,max:32G dom0_max_vcpus=16 
> dom0_vcpus_pin=1 no-real-mode edd=off
> cc_compiler : gcc (Debian 12.2.0-14) 12.2.0
> cc_compile_by : root
> cc_compile_domain : <redacted>
> cc_compile_date : Mon Feb 17 17:31:08 UTC 2025
> build_id : 410ba653f1f1fc13770b5d2a8cdf5e4d285b6783
> xend_config_format : 4
> ```
> 
> After collecting all of these information I'm on a roadblock. Effects of this
> bug on memory constistency are pretty serious, but on the other hand, they 
> occur 
> in very specific conditions, which makes them difficult to track. I would 
> appreciate 
> any help in finding the root cause of this issue.

Can you see if you can reproduce with dom0-iommu=strict in the Xen
command line?

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.