[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 3/3] PCI/MSI: Convert pci_msi_ignore_mask to per MSI domain flag



On Mon, Mar 24, 2025 at 07:58:14PM +0100, Daniel Gomez wrote:
> On Mon, Mar 24, 2025 at 06:51:54PM +0100, Roger Pau Monné wrote:
> > On Mon, Mar 24, 2025 at 03:29:46PM +0100, Daniel Gomez wrote:
> > > 
> > > Hi,
> > > 
> > > On Fri, Mar 21, 2025 at 09:00:09AM +0100, Jürgen Groß wrote:
> > > > On 20.03.25 22:07, Bjorn Helgaas wrote:
> > > > > On Wed, Feb 19, 2025 at 10:20:57AM +0100, Roger Pau Monne wrote:
> > > > > > Setting pci_msi_ignore_mask inhibits the toggling of the mask bit 
> > > > > > for both
> > > > > > MSI and MSI-X entries globally, regardless of the IRQ chip they are 
> > > > > > using.
> > > > > > Only Xen sets the pci_msi_ignore_mask when routing physical 
> > > > > > interrupts over
> > > > > > event channels, to prevent PCI code from attempting to toggle the 
> > > > > > maskbit,
> > > > > > as it's Xen that controls the bit.
> > > > > > 
> > > > > > However, the pci_msi_ignore_mask being global will affect devices 
> > > > > > that use
> > > > > > MSI interrupts but are not routing those interrupts over event 
> > > > > > channels
> > > > > > (not using the Xen pIRQ chip).  One example is devices behind a VMD 
> > > > > > PCI
> > > > > > bridge.  In that scenario the VMD bridge configures MSI(-X) using 
> > > > > > the
> > > > > > normal IRQ chip (the pIRQ one in the Xen case), and devices behind 
> > > > > > the
> > > > > > bridge configure the MSI entries using indexes into the VMD bridge 
> > > > > > MSI
> > > > > > table.  The VMD bridge then demultiplexes such interrupts and 
> > > > > > delivers to
> > > > > > the destination device(s).  Having pci_msi_ignore_mask set in that 
> > > > > > scenario
> > > > > > prevents (un)masking of MSI entries for devices behind the VMD 
> > > > > > bridge.
> > > > > > 
> > > > > > Move the signaling of no entry masking into the MSI domain flags, 
> > > > > > as that
> > > > > > allows setting it on a per-domain basis.  Set it for the Xen MSI 
> > > > > > domain
> > > > > > that uses the pIRQ chip, while leaving it unset for the rest of the
> > > > > > cases.
> > > > > > 
> > > > > > Remove pci_msi_ignore_mask at once, since it was only used by Xen 
> > > > > > code, and
> > > > > > with Xen dropping usage the variable is unneeded.
> > > > > > 
> > > > > > This fixes using devices behind a VMD bridge on Xen PV hardware 
> > > > > > domains.
> > > > > > 
> > > > > > Albeit Devices behind a VMD bridge are not known to Xen, that 
> > > > > > doesn't mean
> > > > > > Linux cannot use them.  By inhibiting the usage of
> > > > > > VMD_FEAT_CAN_BYPASS_MSI_REMAP and the removal of the 
> > > > > > pci_msi_ignore_mask
> > > > > > bodge devices behind a VMD bridge do work fine when use from a 
> > > > > > Linux Xen
> > > > > > hardware domain.  That's the whole point of the series.
> > > > > > 
> > > > > > Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> > > > > > Reviewed-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> > > > > > Acked-by: Juergen Gross <jgross@xxxxxxxx>
> > > > > 
> > > > > Acked-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
> > > > > 
> > > > > I assume you'll merge this series via the Xen tree.  Let me know if
> > > > > otherwise.
> > > > 
> > > > I've pushed the series to the linux-next branch of the Xen tree.
> > > > 
> > > > 
> > > > Juergen
> > > 
> > > This patch landed in latest next-20250324 tag causing this crash:
> > > 
> > > [    0.753426] BUG: kernel NULL pointer dereference, address: 
> > > 0000000000000002
> > > [    0.753921] #PF: supervisor read access in kernel mode
> > > [    0.754286] #PF: error_code(0x0000) - not-present page
> > > [    0.754656] PGD 0 P4D 0
> > > [    0.754842] Oops: Oops: 0000 [#1]
> > > [    0.755080] CPU: 0 UID: 0 PID: 1 Comm: swapper Not tainted 
> > > 6.14.0-rc7-next-20250324 #1 NONE
> > > [    0.755691] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> > > 1.16.3-debian-1.16.3-2 04/01/2014
> > > [    0.756349] RIP: 0010:msix_prepare_msi_desc+0x39/0x80
> > > [    0.756390] Code: 20 c7 46 04 01 00 00 00 8b 56 4c 89 d0 0d 01 01 00 
> > > 00 66 89 46 4c 8b 8f 64 02 00 00 89 4e 50 48 8b 8f 70 06 00 00 48 89 4e 
> > > 58 <41> f6 40 02 40 75 2a c1 ea 02 bf 80 00 00 00 21 fa 25 7f ff ff ff
> > > [    0.756390] RSP: 0000:ffff8881002a76e0 EFLAGS: 00010202
> > > [    0.756390] RAX: 0000000000000101 RBX: ffff88810074d000 RCX: 
> > > ffffc9000002e000
> > > [    0.756390] RDX: 0000000000000000 RSI: ffff8881002a7710 RDI: 
> > > ffff88810074d000
> > > [    0.756390] RBP: ffff8881002a7710 R08: 0000000000000000 R09: 
> > > ffff8881002a76b4
> > > [    0.756390] R10: 000000701000c001 R11: ffffffff82a3dc01 R12: 
> > > 0000000000000000
> > > [    0.756390] R13: 0000000000000005 R14: 0000000000000000 R15: 
> > > 0000000000000002
> > > [    0.756390] FS:  0000000000000000(0000) GS:0000000000000000(0000) 
> > > knlGS:0000000000000000
> > > [    0.756390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [    0.756390] CR2: 0000000000000002 CR3: 0000000002a3d001 CR4: 
> > > 00000000003706b0
> > > [    0.756390] Call Trace:
> > > [    0.756390]  <TASK>
> > > [    0.756390]  ? __die_body+0x1b/0x60
> > > [    0.756390]  ? page_fault_oops+0x2d0/0x310
> > > [    0.756390]  ? exc_page_fault+0x59/0xc0
> > > [    0.756390]  ? asm_exc_page_fault+0x22/0x30
> > > [    0.756390]  ? msix_prepare_msi_desc+0x39/0x80
> > > [    0.756390]  ? msix_capability_init+0x172/0x2c0
> > > [    0.756390]  ? __pci_enable_msix_range+0x1a8/0x1d0
> > > [    0.756390]  ? pci_alloc_irq_vectors_affinity+0x7c/0xf0
> > > [    0.756390]  ? vp_find_vqs_msix+0x187/0x400
> > > [    0.756390]  ? vp_find_vqs+0x2f/0x250
> > > [    0.756390]  ? snprintf+0x3e/0x50
> > > [    0.756390]  ? vp_modern_find_vqs+0x13/0x60
> > > [    0.756390]  ? init_vq+0x184/0x1e0
> > > [    0.756390]  ? vp_get_status+0x20/0x20
> > > [    0.756390]  ? virtblk_probe+0xeb/0x8d0
> > > [    0.756390]  ? __kernfs_new_node+0x122/0x160
> > > [    0.756390]  ? vp_get_status+0x20/0x20
> > > [    0.756390]  ? virtio_dev_probe+0x171/0x1c0
> > > [    0.756390]  ? really_probe+0xc2/0x240
> > > [    0.756390]  ? driver_probe_device+0x1d/0x70
> > > [    0.756390]  ? __driver_attach+0x96/0xe0
> > > [    0.756390]  ? driver_attach+0x20/0x20
> > > [    0.756390]  ? bus_for_each_dev+0x7b/0xb0
> > > [    0.756390]  ? bus_add_driver+0xe6/0x200
> > > [    0.756390]  ? driver_register+0x5e/0xf0
> > > [    0.756390]  ? virtio_blk_init+0x4d/0x90
> > > [    0.756390]  ? add_boot_memory_block+0x90/0x90
> > > [    0.756390]  ? do_one_initcall+0xe2/0x250
> > > [    0.756390]  ? xas_store+0x4b/0x4b0
> > > [    0.756390]  ? number+0x13b/0x260
> > > [    0.756390]  ? ida_alloc_range+0x36a/0x3b0
> > > [    0.756390]  ? parameq+0x13/0x90
> > > [    0.756390]  ? parse_args+0x10f/0x2a0
> > > [    0.756390]  ? do_initcall_level+0x83/0xb0
> > > [    0.756390]  ? do_initcalls+0x43/0x70
> > > [    0.756390]  ? rest_init+0x80/0x80
> > > [    0.756390]  ? kernel_init_freeable+0x70/0xb0
> > > [    0.756390]  ? kernel_init+0x16/0x110
> > > [    0.756390]  ? ret_from_fork+0x30/0x40
> > > [    0.756390]  ? rest_init+0x80/0x80
> > > [    0.756390]  ? ret_from_fork_asm+0x11/0x20
> > > [    0.756390]  </TASK>
> > > [    0.756390] Modules linked in:
> > > [    0.756390] CR2: 0000000000000002
> > > [    0.756390] ---[ end trace 0000000000000000 ]---
> > > [    0.756390] RIP: 0010:msix_prepare_msi_desc+0x39/0x80
> > > [    0.756390] Code: 20 c7 46 04 01 00 00 00 8b 56 4c 89 d0 0d 01 01 00 
> > > 00 66 89 46 4c 8b 8f 64 02 00 00 89 4e 50 48 8b 8f 70 06 00 00 48 89 4e 
> > > 58 <41> f6 40 02 40 75 2a c1 ea 02 bf 80 00 00 00 21 fa 25 7f ff ff ff
> > > [    0.756390] RSP: 0000:ffff8881002a76e0 EFLAGS: 00010202
> > > [    0.756390] RAX: 0000000000000101 RBX: ffff88810074d000 RCX: 
> > > ffffc9000002e000
> > > [    0.756390] RDX: 0000000000000000 RSI: ffff8881002a7710 RDI: 
> > > ffff88810074d000
> > > [    0.756390] RBP: ffff8881002a7710 R08: 0000000000000000 R09: 
> > > ffff8881002a76b4
> > > [    0.756390] R10: 000000701000c001 R11: ffffffff82a3dc01 R12: 
> > > 0000000000000000
> > > [    0.756390] R13: 0000000000000005 R14: 0000000000000000 R15: 
> > > 0000000000000002
> > > [    0.756390] FS:  0000000000000000(0000) GS:0000000000000000(0000) 
> > > knlGS:0000000000000000
> > > [    0.756390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [    0.756390] CR2: 0000000000000002 CR3: 0000000002a3d001 CR4: 
> > > 00000000003706b0
> > > [    0.756390] note: swapper[1] exited with irqs disabled
> > > [    0.782774] Kernel panic - not syncing: Attempted to kill init! 
> > > exitcode=0x00000009
> > > [    0.783560] Kernel Offset: disabled
> > > [    0.783909] ---[ end Kernel panic - not syncing: Attempted to kill 
> > > init! exitcode=0x00000009 ]---
> > > 
> > > 
> > > msix_prepare_msi_desc+0x39/0x80:
> > > msix_prepare_msi_desc at drivers/pci/msi/msi.c:616
> > >  611            desc->nvec_used                         = 1;
> > >  612            desc->pci.msi_attrib.is_msix            = 1;
> > >  613            desc->pci.msi_attrib.is_64              = 1;
> > >  614            desc->pci.msi_attrib.default_irq        = dev->irq;
> > >  615            desc->pci.mask_base                     = dev->msix_base;
> > > >616<           desc->pci.msi_attrib.can_mask           = !(info->flags & 
> > > >MSI_FLAG_NO_MASK) &&
> > >  617                                                      
> > > !desc->pci.msi_attrib.is_virtual;
> > >  618
> > >  619            if (desc->pci.msi_attrib.can_mask) {
> > >  620                    void __iomem *addr = pci_msix_desc_addr(desc);
> > >  621
> > > 
> > > Reverting patch 3 fixes the issue.
> > 
> > Thanks for the report and sorry for the breakage.  Do you have a QEMU
> > command line I can use to try to reproduce this locally?
> > 
> > Will work on a patch ASAP.
> 
> Thanks for the quick reply.
> 
> The issue is that info appears to be uninitialized. So, this worked for me:

Indeed, irq_domain->host_data is NULL, there's no msi_domain_info.  As
this is x86, I was expecting x86 ot always use
x86_init_dev_msi_info(), but that doesn't seem to be the case.  I
would like to better understand this.

> diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
> index dcbb4f9ac578..b76c7ec33602 100644
> --- a/drivers/pci/msi/msi.c
> +++ b/drivers/pci/msi/msi.c
> @@ -609,8 +609,10 @@ void msix_prepare_msi_desc(struct pci_dev *dev, struct 
> msi_desc *desc)
>         desc->pci.msi_attrib.is_64              = 1;
>         desc->pci.msi_attrib.default_irq        = dev->irq;
>         desc->pci.mask_base                     = dev->msix_base;
> -       desc->pci.msi_attrib.can_mask           = !(info->flags & 
> MSI_FLAG_NO_MASK) &&
> -                                                 
> !desc->pci.msi_attrib.is_virtual;
> +       desc->pci.msi_attrib.can_mask =
> +               info ? !(info->flags & MSI_FLAG_NO_MASK) &&
> +                               !desc->pci.msi_attrib.is_virtual :
> +                      1;
> 
>         if (desc->pci.msi_attrib.can_mask) {
>                 void __iomem *addr = pci_msix_desc_addr(desc);
> @@ -743,7 +745,7 @@ static int msix_capability_init(struct pci_dev *dev, 
> struct msix_entry *entries,
>         /* Disable INTX */
>         pci_intx_for_msi(dev, 0);
> 
> -       if (!(info->flags & MSI_FLAG_NO_MASK)) {
> +       if (info && !(info->flags & MSI_FLAG_NO_MASK)) {

I think this should rather be:

if (!info || !(info->flags & MSI_FLAG_NO_MASK)) {

So that in case of no info the default action is to mask the entries.

>                 /*
>                  * Ensure that all table entries are masked to prevent
>                  * stale entries from firing in a crash kernel.
> 
> I also noticed d (struct irq_domain) can return NULL if CONFIG_GENERIC_MSI_IRQ
> is not set and we are not checking that either.
> 
> I run QEMU with vmctl [1]. This is my command:
> 
> [1] https://github.com/SamsungDS/vmctl
> 
> /usr/bin/qemu-system-x86_64 \
>   -nodefaults \
>   -display "none" \
>   -machine "q35,accel=kvm,kernel-irqchip=split" \
>   -cpu "host" \
>   -smp "4" \
>   -m "8G" \
>   -device "intel-iommu,intremap=on" \
>   -netdev "user,id=net0,hostfwd=tcp::2222-:22" \
>   -device "virtio-net-pci,netdev=net0" \
>   -device "virtio-rng-pci" \
>   -drive 
> "id=boot,file=file.qcow2,format=qcow2,if=virtio,discard=unmap,media=disk,read-only=no"
>  \
>   -device "pcie-root-port,id=pcie_root_port0,chassis=1,slot=0" \
>   -device "nvme,id=nvme0,serial=deadbeef,bus=pcie_root_port0,mdts=7" \
>   -drive 
> "id=nvm,file=~/nvm.img,format=raw,if=none,discard=unmap,media=disk,read-only=no"
>  \
>   -device 
> "nvme-ns,id=nvm,drive=nvm,bus=nvme0,nsid=1,logical_block_size=4096,physical_block_size=4096"
>  \
>   -pidfile "~/vmctl/confdir/run/nvme/pidfile" \
>   -kernel "~/src/kernel/linux/arch/x86_64/boot/bzImage" \
>   -append "root=/dev/vda1 console=ttyS0,115200 audit=0" \
>   -virtfs 
> "local,path=~/linux,security_model=none,readonly=on,mount_tag=kernel_dir" \
>   -serial "mon:stdio" \
>   -d "guest_errors" \
>   -D "~/vmctl/confdir/log/nvme/qemu.log"

Can you narrow down the command line to the minimum required to
reproduce the issue?

Can you attach the Kconfig used to build the crashing kernel?

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.