Xen project Mailing List

Re: [PATCH] x86/pci: Correct ECS handling with CF8/CFC emulation

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

From: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Date: Mon, 3 Apr 2023 15:26:42 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=iuyfu2/BsHus3kHZGId7MNczCmHstVKYik7fQ1vMBSI=; b=mXe2PDdASLPCtoeViNadZDC16EsdXq7ZDpcQ+uMHM2G66RP/wLaxWwsAA9DLVBetQIFq8T2BSUSdtUOP5v/fbKMbEC7Y7Zg6ZzBkTlSs74TSOsDYTbnUwW55WOCf8inEOT49Cyb1R7TwKXKv7uNKNzpve7DwBsVVbd7+Dwp24KtP6aGETQ2l16q6pMBATlHkGdyDDmC/runZNmPQek2N1L506zrcyBUYACPgCngpY1T8aL9VT5HDKGyHzK0n/qGAcok2RferIDq030L1hhtO7POwwga0ax8RWeqrJdOcDevM5VcX7FGCs7i/sBVRl8YgD8Vd8zOyw4KHfzto9AHP0A==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=mB1mmXF+aylgBGjOWaPSQlAFMio6He+ps3MaWmk1+PqKcp02EeGbaWQoLXrovAtw41RC0MDy1cB0Yf5hCJWlkTLYMYeSntea2+xhKXqeVVASI1ohy3noBUUwggMQS/lpvP/ftCO0XHl4/OCDav7Vc7YNqhbn8E45YCBE7SK//Yr/zhGc0EzRoGG3aopIn8zf6UtchctwFt8Bev4RIeCGREXbRG+pOhyf4fH3Cw7Z2NJMkTPiuaMeZpcYmUbHTIu7dxKsiqVjOsWhRF434E6QfcqXL9zzR0ejPX3kkSkWIgK6fhazsesOeJmDtKwcOfYTnpyXQIAFqRegIMdAJOxeUw==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;

Cc: Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Wei Liu <wl@xxxxxxx>

Delivery-date: Mon, 03 Apr 2023 13:27:11 +0000

Ironport-data: A9a23:lVpx7KibkeP8VPf9HdoBkkubX161RxEKZh0ujC45NGQN5FlHY01je htvUGnUbPnca2Dwe4olOYy0p0MBu8fTy9A3QFM++ysyRn8b9cadCdqndUqhZCn6wu8v7q5Ex 55HNoSfdpBcolv0/ErF3m3J9CEkvU2wbuOgTrWCYmYpHlUMpB4J0XpLg/Q+jpNjne+3CgaMv cKai8DEMRqu1iUc3lg8sspvkzsy+qWi0N8klgZmP6sT4AeFzyB94K83fsldEVOpGuG4IcbiL wrz5OnR1n/U+R4rFuSknt7TGqHdauePVeQmoiM+t5mK2nCulARrukoIHKN0hXNsoyeIh7hMJ OBl7vRcf+uL0prkw4zxWzEAe8130DYvFLXveRBTuuTLp6HKnueFL1yDwyjaMKVBktubD12i+ tQcLQoMNjCcwNipnr/jUttmmYcdN+j0adZ3VnFIlVk1DN4AaLWaGeDv2oUd2z09wMdTAfzZe swVLyJ1awjNaAFOPVFRD48imOCvhT/0dDgwRFC9/PJrpTSMilIvluSyWDbWUoXiqcF9hEGXq 3iA523kKhobKMae2XyO9XfEaurnxHulBd5NROflnhJsqFSP61VQIQUPbGuygN2mt2fhSu0OC VNBr0LCqoB3riRHVOLVTxC+5XKJoBMYc95RCPEhrhGAzLLO5ASUDXRCSSROAPQEnsIrQT0h1 neSgsjkQzdotdW9Vna15rqS6zSoNkAowXQqYCYFSU4A/IPlqYRq1BbXFI4/T+iyk8H/Hiz2z 3aSti8iir4PjMkNkaKm4VTAhDHqrZ/MJuIo2jjqsquexlsRTOaYi0aAsjA3Md4owF6lc2S8

Ironport-hdrordr: A9a23:MY017a18yJU3K80J7x1G2wqjBAkkLtp133Aq2lEZdPWaSK2lfq eV7ZImPH7P+VEssRQb8+xoV5PsfZqxz/JICMwqTNSftOePghrVEGgg1/qe/9XYcxeOidK1rJ 0QDZSWaueRMbEKt7ef3ODiKadY/DDvysnB7ts2jU0dLz2CDZsO0+4TMHf/LqQZfmd77LMCZe uhz/sCiTq8WGgdKv+2DmMCWIH41qf2vaOjTx4aJgItrDKDhzOw6LL8DnGjr2wjegIK77c+0H TP1zf07KW7s/2911v12mLJ445N8eGRuudrNYijitU1Nj6psAquaYh7MofyxAwInA==

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Mon, Apr 03, 2023 at 11:16:52AM +0100, Andrew Cooper wrote: > On 03/04/2023 9:57 am, Roger Pau Monné wrote: > > On Fri, Mar 31, 2023 at 06:57:19PM +0100, Andrew Cooper wrote: > >> This started by noticing the dubious Fam17h check in > >> arch_ioreq_server_get_type_addr(), and then realising that checking the > >> host > >> CF8_EXT setting is utterly bogus for the guest <-> qemu emulation path. > >> > >> What should be consulted here is the guest's MSR_AMD64_NB_CFG setting, but > >> a) > >> there isn't one, and b) the vestigial remnants of the cross-vendor > >> migration > >> logic cause MSR_AMD64_NB_CFG to be unconditionally read-as-zero, making the > >> CF8_EXT path unused by any suitably-written OS in the first place. > >> > >> MSR_AMD64_NB_CFG really has been removed on Fam17h (it's now just a > >> read-zero, > >> write-discard stub), and the ECS extension is unconditionally active, > >> meaning > >> it is not correct for Xen to ignore the ExtRegNo field on newer AMD CPUs. > >> > >> It turns out that Xen did even have this behaviour in 4.5 and earlier, with > >> this problematic CF8_EXT checking being added in 4.6. Therefore, revert > >> back > >> to Xen's older behaviour - it is objectively less wrong than the current > >> logic. > >> > >> While fixing this, get rid of hvm_pci_decode_addr() - it is more > >> complicated > >> to follow (and to call) than using the CF8* macros in the calling context. > >> Rename CF8_ADDR() to CF8_REG() to better describe what it does, and write a > >> comment explaining all about CF8/CFC accesses. > >> > >> There's one rare case when CF8_EXT is visible to guests, and that is for a > >> pinned hwdom. Right now, we permit such a dom0 to modify the CF8_EXT bit, > >> but > >> this seems like a very unwise idea. Leave a TODO for people to consider. > > One weirdness I've noticed is that for vPCI we decode the accesses > > taking the extended CF8 bit after this change, but then if the access > > is relayed to the hardware using vpci_{read,write}_hw it will be > > forwarded to the hardware using pci_conf_{read,write}<size> which > > doesn't have support for CF8_EXT. So if the underlying hardware > > doesn't have MMCFG support and the reg is > 255 it will be dropped. > > It is important to stress that this change does not influence whether > the guest issues ECS accesses or not. All it does is change Xen's > handling of such accesses. > > Previously vPCI blindly ignored ECS accesses, so the vPCI layer > effectively truncated them to BCS accesses. > > Now, from your analysis, when MMCFG isn't active, Xen's PCI layer will > effectively terminate ECS accesses with default behaviour, even on > systems where IO ECS is available. > > So we've changed one valid behaviour for a different valid behaviour. Given vPCI is currently limited to PVH dom0 I doubt there are many systems capable of running PVH dom0 but not having MMCFG support. Best way to fix would be to implement ECS accesses in pci_conf_{read,write}<size>() if the register is > 255 and there's no MMCFG. > > (Quick tangent... Our PCI handling is currently very dumb. > pci_mmcfg_read() returns its value by pointer but the callers never > check. Swapping it to return by value would improve code gen quite a > lot. Also, when MMCFG is active we still pass BCS accesses to IO ports.) I wonder if it's really preferred to access registers below 255 using the IO ports, as Linux seems to do the same (prefer IO port access if possible). > So I think we do want to improve Xen's general behaviour too, but this > difference here doesn't concern me. > > > > >> Fixes: e0fbf3bf9871 ("x86/AMD: correct certain Fam17 checks") > >> Fixes: 2d67a7a4d37a ("x86: synchronize PCI config space access decoding") > >> Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> > >> --- > >> CC: Jan Beulich <JBeulich@xxxxxxxx> > >> CC: Roger Pau Monné <roger.pau@xxxxxxxxxx> > >> CC: Wei Liu <wl@xxxxxxx> > >> > >> Whoever reviewed those two patches originally was clearly a fool... > >> --- > >> xen/arch/x86/hvm/io.c | 24 ++++++------------------ > >> xen/arch/x86/hvm/ioreq.c | 19 ++----------------- > >> xen/arch/x86/include/asm/hvm/io.h | 4 ---- > >> xen/arch/x86/include/asm/pci.h | 26 ++++++++++++++++++++++++-- > >> xen/arch/x86/pv/emul-priv-op.c | 19 ++++++------------- > >> 5 files changed, 38 insertions(+), 54 deletions(-) > >> > >> diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c > >> index 5ae209d3b6b3..b0d3c236e985 100644 > >> --- a/xen/arch/x86/hvm/io.c > >> +++ b/xen/arch/x86/hvm/io.c > >> @@ -248,20 +248,6 @@ void register_g2m_portio_handler(struct domain *d) > >> handler->ops = &g2m_portio_ops; > >> } > >> > >> -unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr, > >> - pci_sbdf_t *sbdf) > >> -{ > >> - ASSERT(CF8_ENABLED(cf8)); > >> - > >> - sbdf->bdf = CF8_BDF(cf8); > >> - sbdf->seg = 0; > >> - /* > >> - * NB: the lower 2 bits of the register address are fetched from the > >> - * offset into the 0xcfc register when reading/writing to it. > >> - */ > >> - return CF8_ADDR_LO(cf8) | (addr & 3); > >> -} > >> - > >> /* vPCI config space IO ports handlers (0xcf8/0xcfc). */ > >> static bool cf_check vpci_portio_accept( > >> const struct hvm_io_handler *handler, const ioreq_t *p) > >> @@ -275,7 +261,7 @@ static int cf_check vpci_portio_read( > >> { > >> const struct domain *d = current->domain; > >> unsigned int reg; > >> - pci_sbdf_t sbdf; > >> + pci_sbdf_t sbdf = {}; > >> uint32_t cf8; > >> > >> *data = ~(uint64_t)0; > >> @@ -292,7 +278,8 @@ static int cf_check vpci_portio_read( > >> if ( !CF8_ENABLED(cf8) ) > >> return X86EMUL_UNHANDLEABLE; > >> > >> - reg = hvm_pci_decode_addr(cf8, addr, &sbdf); > >> + sbdf.bdf = CF8_BDF(cf8); > >> + reg = CF8_REG(cf8) | (addr & 3); > >> > >> if ( !vpci_access_allowed(reg, size) ) > >> return X86EMUL_OKAY; > >> @@ -308,7 +295,7 @@ static int cf_check vpci_portio_write( > >> { > >> struct domain *d = current->domain; > >> unsigned int reg; > >> - pci_sbdf_t sbdf; > >> + pci_sbdf_t sbdf = {}; > >> uint32_t cf8; > >> > >> if ( addr == 0xcf8 ) > >> @@ -323,7 +310,8 @@ static int cf_check vpci_portio_write( > >> if ( !CF8_ENABLED(cf8) ) > >> return X86EMUL_UNHANDLEABLE; > >> > >> - reg = hvm_pci_decode_addr(cf8, addr, &sbdf); > >> + sbdf.bdf = CF8_BDF(cf8); > >> + reg = CF8_REG(cf8) | (addr & 3); > >> > >> if ( !vpci_access_allowed(reg, size) ) > >> return X86EMUL_OKAY; > >> diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c > >> index 0bdcca1e1a5f..325a9d118e52 100644 > >> --- a/xen/arch/x86/hvm/ioreq.c > >> +++ b/xen/arch/x86/hvm/ioreq.c > >> @@ -285,27 +285,12 @@ bool arch_ioreq_server_get_type_addr(const struct > >> domain *d, > >> (p->addr & ~3) == 0xcfc && > >> CF8_ENABLED(cf8) ) > >> { > >> - unsigned int x86_fam, reg; > >> - pci_sbdf_t sbdf; > >> - > >> - reg = hvm_pci_decode_addr(cf8, p->addr, &sbdf); > >> + pci_sbdf_t sbdf = { .bdf = CF8_BDF(cf8) }; > >> + unsigned int reg = CF8_REG(cf8) | (p->addr & 3); > >> > >> /* PCI config data cycle */ > >> *type = XEN_DMOP_IO_RANGE_PCI; > >> *addr = ((uint64_t)sbdf.sbdf << 32) | reg; > >> - /* AMD extended configuration space access? */ > >> - if ( CF8_ADDR_HI(cf8) && > >> - d->arch.cpuid->x86_vendor == X86_VENDOR_AMD && > >> - (x86_fam = get_cpu_family( > >> - d->arch.cpuid->basic.raw_fms, NULL, NULL)) >= 0x10 && > >> - x86_fam < 0x17 ) > >> - { > >> - uint64_t msr_val; > >> - > >> - if ( !rdmsr_safe(MSR_AMD64_NB_CFG, msr_val) && > >> - (msr_val & (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT)) ) > >> - *addr |= CF8_ADDR_HI(cf8); > >> - } > >> } > >> else > >> { > >> diff --git a/xen/arch/x86/include/asm/hvm/io.h > >> b/xen/arch/x86/include/asm/hvm/io.h > >> index 54e0161b492c..3f3fb6403ccb 100644 > >> --- a/xen/arch/x86/include/asm/hvm/io.h > >> +++ b/xen/arch/x86/include/asm/hvm/io.h > >> @@ -144,10 +144,6 @@ void stdvga_deinit(struct domain *d); > >> > >> extern void hvm_dpci_msi_eoi(struct domain *d, int vector); > >> > >> -/* Decode a PCI port IO access into a bus/slot/func/reg. */ > >> -unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr, > >> - pci_sbdf_t *sbdf); > >> - > >> /* > >> * HVM port IO handler that performs forwarding of guest IO ports into > >> machine > >> * IO ports. > >> diff --git a/xen/arch/x86/include/asm/pci.h > >> b/xen/arch/x86/include/asm/pci.h > >> index f4a58c8acf13..3b814f4ebacf 100644 > >> --- a/xen/arch/x86/include/asm/pci.h > >> +++ b/xen/arch/x86/include/asm/pci.h > >> @@ -3,10 +3,32 @@ > >> > >> #include <xen/mm.h> > >> > >> +/* > >> + * PCI config space accesses with CF8/CFC: > >> + * > >> + * 1) Write {Enable | BDF | Reg} to CF8 to set an address > >> + * 2) Read or write CF{C..F} to access the register > >> + * > >> + * For sub-dword register accesses, the bottom two register address bits > >> come > >> + * from the CF{C..F} address, not from CF8. > >> + * > >> + * AMD have extention to this protocol to access PCIe Extend Config Space > >> by > >> + * storing the 4 extra register address bits in the penultimate nibble of > >> CF8. > >> + * This extention: > >> + * - Is unconditionally active on Fam17h and later > >> + * - Has model specific enablement on Fam10h thru Fam16h > >> + * - Has reserved behaviour in all other cases, including other vendors > >> + * > >> + * For simplicity and because we are permitted to, given "reserved", Xen > >> + * always treats ECS as active when emulating guest PCI config space > >> accesses. > >> + */ > >> #define CF8_BDF(cf8) ( ((cf8) & 0x00ffff00) >> 8) > >> -#define CF8_ADDR_LO(cf8) ( (cf8) & 0x000000fc) > >> -#define CF8_ADDR_HI(cf8) ( ((cf8) & 0x0f000000) >> 16) > >> #define CF8_ENABLED(cf8) (!!((cf8) & 0x80000000)) > >> +#define CF8_REG(cf8) \ > >> + ({ \ > >> + unsigned int _c = cf8; \ > >> + ((_c & 0x0f000000) >> 16) | (_c & 0xfc); \ > >> + }) > > What happens on Intel when the bit is set, is it just ignored? > > "the bit" => the ECS nibble, or the CF8_EXT bit? The ECS nibble. > > The ECS nibble is ignored on Intel AFAICT, while the CF8_EXT bit is in a > very AMD-only MSR, so won't exist on Intel. > > > We only allow such accesses for dom0 anyway. > > And guests running on AMD hardware where CF8_EXT is active on the > northbridge of the core we are instantaneously scheduled on. > > >> > >> #define IS_SNB_GFX(id) (id == 0x01068086 || id == 0x01168086 \ > >> || id == 0x01268086 || id == 0x01028086 \ > >> diff --git a/xen/arch/x86/pv/emul-priv-op.c > >> b/xen/arch/x86/pv/emul-priv-op.c > >> index 5da00e24e4ff..008367195c78 100644 > >> --- a/xen/arch/x86/pv/emul-priv-op.c > >> +++ b/xen/arch/x86/pv/emul-priv-op.c > >> @@ -245,19 +245,7 @@ static bool pci_cfg_ok(struct domain *currd, unsigned > >> int start, > >> if ( ro_map && test_bit(machine_bdf, ro_map) ) > >> return false; > >> } > >> - start |= CF8_ADDR_LO(currd->arch.pci_cf8); > >> - /* AMD extended configuration space access? */ > >> - if ( CF8_ADDR_HI(currd->arch.pci_cf8) && > >> - boot_cpu_data.x86_vendor == X86_VENDOR_AMD && > >> - boot_cpu_data.x86 >= 0x10 && boot_cpu_data.x86 < 0x17 ) > >> - { > >> - uint64_t msr_val; > >> - > >> - if ( rdmsr_safe(MSR_AMD64_NB_CFG, msr_val) ) > >> - return false; > >> - if ( msr_val & (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT) ) > >> - start |= CF8_ADDR_HI(currd->arch.pci_cf8); > >> - } > >> + start |= CF8_REG(currd->arch.pci_cf8); > >> > >> return !write ? > >> xsm_pci_config_permission(XSM_HOOK, currd, machine_bdf, > >> @@ -1104,6 +1092,11 @@ static int cf_check write_msr( > >> if ( !is_hwdom_pinned_vcpu(curr) ) > >> return X86EMUL_OKAY; > >> if ( (rdmsr_safe(MSR_AMD64_NB_CFG, temp) != 0) || > >> + /* > >> + * TODO: this is broken. What happens when dom0 is pinned > >> but > >> + * can't see the full system? CF8_EXT probably ought to be a > >> + * Xen-owned setting, and made symmetric across the system. > >> + */ > > I would assume CF8_EXT would be symmetric across the system, specially > > given that it seems to be phased out and only used in older AMD > > families that where all symmetric? > > The CF8_EXT bit has been phased out. The IO ECS functionality still exists. > > But yes, the more I think about letting dom0 play with this, the more I > think it is a fundamentally broken idea... I bet it was from the very > early AMD Fam10h days where dom0 knew how to turn it on, and Xen was > trying to pretend it didn't have to touch any PCI devices. It seems to me Xen should set CF8_EXT on all threads (when available) and expose it to dom0, so that accesses using pci_{conf,write}_read() work as expected? Thanks, Roger.

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.