[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 5/5] RFC: pci: Migrate pci_mmcfg_{read,write} to pci.c


  • To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • From: Teddy Astie <teddy.astie@xxxxxxxxxx>
  • Date: Tue, 19 May 2026 15:42:46 +0200
  • Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=selector1 header.d=vates.tech header.i="@vates.tech" header.h="From:Subject:Date:Message-ID:To:Cc:MIME-Version:Content-Type:In-Reply-To:References:Feedback-ID"
  • Autocrypt: addr=teddy.astie@xxxxxxxxxx; keydata= xsDNBGn5sK8BDACuzSrrTjpVf4ay06OYB6yY0J1PqKffihoNMtrQRZjAHxoAPC7LTBVHV/XO Zw5HJc+9R71z1JV+iYg6z3jPziGKzX8Fj3ZXlzJPmpf1PuETH3KdbvtJT4ny+OGntnJntUoR KRPhTirr6yNeBk/637O3CQXjtqFUPZnko8OI/o1yawIBhJJAWicutjkkUgd28Bh6HV9EIumH tCBgn5/1A/fpm9624MMgYLsA8qjC4XsoovQvFCaO8HEhvfzrrTZHjn/nPeB9SigxIxXW8YaT VqMdqul07o72m3eA2mf+LMu9a04FX/d4wbxBLtELm+1jIrbtyaFZEMOLv/haSiS/Lj3btJH/ EoucejoZ5SH49ksmVAmKOLktOaTQ8b2gEvP7iaKiIiszCCtOSRohr+2GvDsDeLvVZnlR3I+S PhHar7TPKjFz0G3DPNolyjXywNqOAMpomSPi8lSwjAFsxOtQbcck/qRGRSNk4DAmH70pA+89 MXfQXZ3qt1Q01B1+sU0I8xsAEQEAAc0kVGVkZHkgQXN0aWUgPHRlZGR5LmFzdGllQHZhdGVz LnRlY2g+wsENBBMBCAA3FiEEGAIew9LzHY3pdrqtZg+p0QLLz9AFAmn5sK8FCQWjmoACGwME CwkIBwUVCAkKCwUWAgMBAAAKCRBmD6nRAsvP0ID6DACGOktArFbLKHNzuyOVCskwfUZPla6Z pd3GZ8r61SrAKePIr2BnpgPkd0hV3bSRkRLIrgjzR2NRCzfp0x0HfuhcYfAYPR46XHTvjaJE v99sT/vGUG1BZguYDOScSEpgSNaNlYum3RKZbMuROxdK8G+YHccJY8PvWSq2K2yiae2KGiAv 1yjnZxug9/PtDfX8vQFUSg2w1ukRDf50wvDohN1zUQfFtofOP2xCRsDZiHAlQ0pF+aUjXQhP eP3IdpfWc8cyRLXF06Rk46YMYCytweGtGdHcqAfrVthl84129ZPN422k/voW0sm14gjYlGcT UwgnYlFRk2FLq0QeKEDcS0aj3o3EVAQCrayoGzi1pnlIKE3PRGUcUzjGVvzQ/po24gOjwba9 Egr/Wmu3MQlx/7A8zT5QBzF/n+RYdLNQ0Eu6YnUwf0Z1uieqNaon+olyIRFiLb/hCZHO6ekN f5vrm2clHUbQAYaPQebknujoKBo6ZLHg0WM1gZS01Gz+aUpKsUfOwM0EafmwsAEMAKiQiZa3 yQMmc/h3sDbfVHPSiBA4IMI/NAB7IotzPHq1GzCpsoVILAhF/INbWjxJ3DbVf+en3/FvdVZg 2S38xtnth0njNdlVKpyxm054phKjbdoFDwaknWolS4hrddTmetSG5/52AjtmPFtlXAk0NmLv fJnW3seXVQbgM7sW/MNXPP5UKDpkGnLhnvej+GU0s3109sJeXT5ImVdphFs9cvyZyBT9t1Pb Rowv58EgV0zE4hbAeVkULAbxFV5b/ExTjjGVHoX7CVhWxvCiTqCUoXZRkUE9C3FnkzEFRkKb Yu6NCfiHfEyB3Xyg9hfdrRgjMRq907zCof+nDtWxGz1MSEuvTj1g9GZ049Bennqzjc/Q+0ov XoK4jm+Py0FiUGUaA6yhexficjH+kCR/xDbVnWrMhSLB4AuTBT9HjfZI6gk3uYLhoT8Pig4/ eVtR2Q1wZIJsFToR6ofGuyECwFcs+PUXN7fmGRSiPXgjAr/zIUBdW0VWCE3OGPNqtRk2E5s6 IQARAQABwsD8BBgBCAAmFiEEGAIew9LzHY3pdrqtZg+p0QLLz9AFAmn5sLAFCQWjmoACGwwA CgkQZg+p0QLLz9DncQwAg76IehTemLIfrB8T9WIBZrI4kUV7G7a4rjiVoUiHYN5QwhnbZnsa JDlt+Ezoqy/510eo2bCSzvW5xXYPgyjcuOPwgQo1Qp764QxyX6rld2f2RcWkDuBHun55ZWXj by8o21ginPRwruBVYY5rVf3DV1iBu4NurUeHtyFk/dS0XTOQi2wVUb17sW/+ybCEokdVacZG zOqP/OmwHrF8ylXlXnhQq6e3r+J+T8fuoGJelm/CJiMwyP6cEWE8sxVqX/iqwjwUYkuOCpE+ lOWSvdNHgoEkWR0RXBPQjnGmLKbfTl/QDXLk6NP2/r9uxm2HL6Ei3QJKSEdrp+XZaVnk/Off O485NOTKwGOxyWb006cTMh53xPkAJFQu4Tvdj+odsHz88jqw5wfPG0BYWx0I/FspYj7N9kZR 8ULR9nX0LvpzJ/kB4NgHIUt8YtIL6ZSfM2dbF7fKzvx1UqFfvozJZwFzfEieJLXa4nlGgR6D x9fhaZEsniw8/bYgC3igkk5YJiOa
  • Cc: Jan Beulich <jbeulich@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>
  • Delivery-date: Tue, 19 May 2026 13:42:59 +0000
  • Feedback-id: default:8631fc262581453bbf619ec5b2062170:Sweego
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Le 18/05/2026 à 19:38, Andrew Cooper a écrit :
On 18/05/2026 4:21 pm, Teddy Astie wrote:
Key parts of MMCFG access bits are in mmconfig_64.c (in particular
pci_mmcfg_{read,write}()) while PCI configuration primitives (used accross the
codebase) are in pci.c.
This leads to situations where the compiler cannot optimize the `switch (len)`
for MMCFG access for all pci_conf_{read,write}N(), because they are not from
the same file.

Move the pci_mmcfg_{read,write} in pci.c and hint the compiler to inline these
functions such that it's more likely that the compiler eliminates the
`switch (len)``.

Also take the opportunity to migrate to pci_sbdf_t to reduce the parameter count
and drop many parameter domains checks.

On GCC 16.1, this leads to codegen where pci_conf_{read,write}N() doesn't call
pci_mmcfg_{read,write}() anymore and directly perform the MMIO RW.

<pci_conf_read32>:
        55                      push   %rbp
        48 89 e5                mov    %rsp,%rbp
        53                      push   %rbx
        89 f8                   mov    %edi,%eax
        89 f3                   mov    %esi,%ebx
        c1 ef 10                shr    $0x10,%edi
        81 fe ff 00 00 00       cmp    $0xff,%esi
        77 26                   ja     ffff82d040301fab <pci_conf_read32+0x3a>
        85 ff                   test   %edi,%edi
        75 22                   jne    ffff82d040301fab <pci_conf_read32+0x3a>
        0f b7 f8                movzwl %ax,%edi
        c1 e7 08                shl    $0x8,%edi
        83 e3 fc                and    $0xfffffffc,%ebx
        09 df                   or     %ebx,%edi
        81 cf 00 00 00 80       or     $0x80000000,%edi
        ba 04 00 00 00          mov    $0x4,%edx
        be 00 00 00 00          mov    $0x0,%esi
        e8 2a 1c 03 00          call   ffff82d040333bd3 <pci_conf_read>
        eb 22                   jmp    ffff82d040301fcd <pci_conf_read32+0x5c>
        81 fb ff 0f 00 00       cmp    $0xfff,%ebx
        77 24                   ja     ffff82d040301fd7 <pci_conf_read32+0x66>
        0f b6 d0                movzbl %al,%edx
        0f b6 f4                movzbl %ah,%esi
        0f b7 ff                movzwl %di,%edi
        e8 f5 fd ff ff          call   ffff82d040301db6 <pci_dev_base>
        48 85 c0                test   %rax,%rax
        74 18                   je     ffff82d040301fde <pci_conf_read32+0x6d>
        89 db                   mov    %ebx,%ebx
        48 01 d8                add    %rbx,%rax
        8b 00                   mov    (%rax),%eax
        48 8b 5d f8             mov    -0x8(%rbp),%rbx
        c9                      leave
        e9 89 12 f0 ff          jmp    ffff82d040203260 <__x86_return_thunk>
        b8 ff ff ff ff          mov    $0xffffffff,%eax
        eb ef                   jmp    ffff82d040301fcd <pci_conf_read32+0x5c>
        b8 ff ff ff ff          mov    $0xffffffff,%eax
        eb e8                   jmp    ffff82d040301fcd <pci_conf_read32+0x5c>

This is not the whole function because it's missing a pop %rbx.  Also,
right at the bottom here are the -1's from bad error paths (discussed
later).

But, this should be after the ---.  Disassembly this long isn't
interesting to stay in the commit message.


Yes, I guess I could just summary it as the function is now inlined properly.


diff --git a/xen/arch/x86/x86_64/mmconfig_64.c 
b/xen/arch/x86/x86_64/mmconfig_64.c
index 940cf6d747..483dff9c2c 100644
--- a/xen/arch/x86/x86_64/mmconfig_64.c
+++ b/xen/arch/x86/x86_64/mmconfig_64.c
@@ -133,6 +46,25 @@ static void __iomem *mcfg_ioremap(const struct 
acpi_mcfg_allocation *cfg,
      return (void __iomem *) virt;
  }
+char __iomem *pci_mmcfg_base(unsigned int seg, unsigned int *bus)
+{
+    struct acpi_mcfg_allocation *cfg;
+    int cfg_num;
+
+    for (cfg_num = 0; cfg_num < pci_mmcfg_config_num; cfg_num++) {
+        cfg = pci_mmcfg_virt[cfg_num].cfg;
+        if (cfg->pci_segment == seg &&
+            (cfg->start_bus_number <= *bus) &&
+            (cfg->end_bus_number >= *bus)) {
+            *bus -= cfg->start_bus_number;
+            return pci_mmcfg_virt[cfg_num].virt;
+        }
+    }
+
+    /* Fall back to type 0 */
+    return NULL;
+}

This is a horrid function.  Accessing and modifying bus like that causes
poor code generation, and by now having this in a separate translation
unit, the optimiser can't fold it into it's single caller and undo the
poor decisions which went into writing this function.

Instead, you want:

void __iomem *pci_mmcfg_base(pci_sbdf_t sbdf)
{
     ...
}

base which takes care of the bus adjustment internally.  This can be
broken out into a separate patch, and take the opportunity to write it
to Xen style.


That's a good idea, IIUC you want this function to now return a pointer to the ECAM MMIO region for this SBDF. So we won't have to compute it afterward.

diff --git a/xen/arch/x86/x86_64/pci.c b/xen/arch/x86/x86_64/pci.c
index 8d33429103..c37e3edade 100644
--- a/xen/arch/x86/x86_64/pci.c
+++ b/xen/arch/x86/x86_64/pci.c
@@ -11,13 +11,123 @@
  #define PCI_CONF_ADDRESS(sbdf, reg) \
      (0x80000000U | ((sbdf).bdf << 8) | ((reg) & ~3))
+/*
+ * AMD Fam10h CPUs are buggy, and cannot access MMIO config space
+ * on their northbrige except through the * %eax register. As such, you MUST
+ * NOT use normal IOMEM accesses, you need to only use the magic mmio-config
+ * accessor functions.
+ * In fact just use pci_config_*, nothing else please.

I know you're just copying an existing comment, but it's mostly an
opinion and not terribly helpful in place.

"AMD Fam10h CPUs can only make MMCFG accesses via MOV %eax/%ax/%al",
would be better, except...

... this claim cannot be true.  It's been made since the K8 RevF BKWG
and exists even into the latest PPRs, but that's simply not how
load/store ops work in the pipeline.

It was added to Linux in
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3320ad994afb2c44ad34b3b34c3c5cf0da297331
but without adequate justification.  I've made some enquiries.


Recent AMD PPR state that you must use eax/ax/al to access MMIO configuration space, so I suppose it's not actually a bug (or was in the past one, but just became a part of the spec ?)

> MMIO configuration space accesses must use the uncacheable (UC) memory
> type.
> Instructions used to read MMIO configuration space are required to
> take the following form:
> mov eax/ax/al, any_address_mode;
> Instructions used to write MMIO configuration space are required to
> take the following form:
> mov any_address_mode, eax/ax/al;
> No other source/target registers may be used other than eax/ax/al.

I can't find anything regarding Intel though.

If that happens to not be the case in practice, and it's fine to use whichever register, that would be preferable; some other parts of the code rely on just volatile pointers to do it.

+ */
+static inline unsigned char mmio_config_readb(void __iomem *pos)
+{
+    u8 val;
+    asm volatile("movb (%1),%%al" : "=a" (val) : "r" (pos));
+    return val;
+}

These need corrections, either in this patch or a followup.

Switch to Xen types, and correct the memory operand constraint to be "m"
(*(uint8_t *)ptr).

The Fam10h BKWG states that any memory encoding is acceptable, and this
allows the optimiser more flexibility (which will get used).


Looks good to me.
I was thinking about using a macro here, but I didn't like the end result. So I guess I will stay with these inline functions.

Agree with styling change (I guess there are (too) many places that want such refactor).

+
+static inline unsigned short mmio_config_readw(void __iomem *pos)
+{
+    u16 val;
+    asm volatile("movw (%1),%%ax" : "=a" (val) : "r" (pos));
+    return val;
+}
+
+static inline unsigned int mmio_config_readl(void __iomem *pos)
+{
+    u32 val;
+    asm volatile("movl (%1),%%eax" : "=a" (val) : "r" (pos));
+    return val;
+}
+
+static inline void mmio_config_writeb(void __iomem *pos, u8 val)
+{
+    asm volatile("movb %%al,(%1)" :: "a" (val), "r" (pos) : "memory");
+}
+
+static inline void mmio_config_writew(void __iomem *pos, u16 val)
+{
+    asm volatile("movw %%ax,(%1)" :: "a" (val), "r" (pos) : "memory");
+}
+
+static inline void mmio_config_writel(void __iomem *pos, u32 val)
+{
+    asm volatile("movl %%eax,(%1)" :: "a" (val), "r" (pos) : "memory");
+}
+
+static char __iomem *pci_dev_base(unsigned int seg, unsigned int bus, unsigned 
int devfn)
+{
+    char __iomem *addr;
+
+    addr = pci_mmcfg_base(seg, &bus);
+    if (!addr)
+        return NULL;
+     return addr + ((bus << 20) | (devfn << 12));
+}
+
+static inline
+int pci_mmcfg_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int len, u32 
*value)
+{
+    char __iomem *addr;
+
+    /* Why do we have this when nobody checks it. How about a BUG()!? -AK */
+    if (unlikely(reg > 4095)) {
+err:        *value = -1;
+        return -EINVAL;
+    }
+
+    addr = pci_dev_base(sbdf.seg, sbdf.bus, sbdf.devfn);
+    if (!addr)
+        goto err;
+
+    switch (len) {
+    case 1:
+        *value = mmio_config_readb(addr + reg);
+        break;
+    case 2:
+        *value = mmio_config_readw(addr + reg);
+        break;
+    case 4:
+        *value = mmio_config_readl(addr + reg);
+        break;
+    }
+
+    return 0;
+}

Again, for this patch or a later cleanup, drop the output-by-pointer and
return value directly.  The optimiser is hopefully doing this already
but it also makes the function simpler.

At best, we want ASSERT_UNREACHBLE()'s in the error paths (including no
mapping), and to consistently return -1.  Returning 0 for a bad length
is bogus.

Strictly speaking we also need to check reg & (len - 1) because accesses
must be naturally aligned, but even with ASSERT_UNREACHABLE() and a
failsafe -1, that's still logic emitted and I'm not sure if it's worth
having.  Amongst other things you really need to know that len is 1, 2
or 4 before the alignment check reads correctly.


Making this function inline itself could actually work at allowing the compiler to eliminate many of these checks, but it's not ideal.

We can eventually leave alignment/length asserts to debug code.

+
+inline int pci_mmcfg_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int 
len, u32 value)
+{
+    char __iomem *addr;
+
+    /* Why do we have this when nobody checks it. How about a BUG()!? -AK */
+    if (unlikely(reg > 4095))
+        return -EINVAL;
+
+    addr = pci_dev_base(sbdf.seg, sbdf.bus, sbdf.devfn);
+    if (!addr)
+        return -EINVAL;
+
+    switch (len) {
+    case 1:
+        mmio_config_writeb(addr + reg, value);
+        break;
+    case 2:
+        mmio_config_writew(addr + reg, value);
+        break;
+    case 4:
+        mmio_config_writel(addr + reg, value);
+        break;
+    }
+
+    return 0;
+}
+
  uint8_t pci_conf_read8(pci_sbdf_t sbdf, unsigned int reg)
  {
      uint32_t value;
if ( sbdf.seg || reg > 255 )
      {
-        pci_mmcfg_read(sbdf.seg, sbdf.bus, sbdf.devfn, reg, 1, &value);
+        pci_mmcfg_read(sbdf, reg, 1, &value);
          return value;
      }

Not for this patch, but we also need to junk the if() condition here and
elsewhere.

We should be using MMCFG unilaterally if it's available; the IO port
pairs require use of a global spinlock, and behind the scenes all the
CPU is doing is translating it back into MMCFG-like accesses.

At this juncture we should probably change it at the start of the 4.23
dev window to give it maximum time to settle before getting into a
release, so probably best to tack it on as a final commit in this series?


It's actually the plan for [1]. I was thinking of splitting this specific patch and merge it into a new one dedicated to MMCFG refactoring and keep the PCI refactoring around pci_sbdf_t separate (the more I look to PCI subsystem, the more stuff I find to improve there).
~Andrew


Teddy

[1] https://lore.kernel.org/xen-devel/cover.1767804090.git.teddy.astie@xxxxxxxxxx/

Attachment: OpenPGP_0x660FA9D102CBCFD0.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.