[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3] x86/hvmloader: select xenpci MMIO BAR UC or WB MTRR cache attribute



Hi Roger,

On 05/06/2025 18:20, Roger Pau Monne wrote:
> The Xen PCI device (vendor ID 0x5853) exposed to x86 HVM guests doesn't
> have the functionality of a traditional PCI device.  The exposed MMIO BAR
> is used by some guests (including Linux) as a safe place to map foreign
> memory, including the grant table itself.
>
> Traditionally BARs from devices have the uncacheable (UC) cache attribute
> from the MTRR, to ensure correct functionality of such devices.  hvmloader
> mimics this behavior and sets the MTRR attributes of both the low and high
> PCI MMIO windows (where BARs of PCI devices reside) as UC in MTRR.
>
> This however causes performance issues for users of the Xen PCI device BAR,
> as for the purposes of mapping remote memory there's no need to use the UC
> attribute.  On Intel systems this is worked around by using iPAT, that
> allows the hypervisor to force the effective cache attribute of a p2m entry
> regardless of the guest PAT value.  AMD however doesn't have an equivalent
> of iPAT, and guest PAT values are always considered.
>
> Linux commit:
>
> 41925b105e34 xen: replace xen_remap() with memremap()
>
> Attempted to mitigate this by forcing mappings of the grant-table to use
> the write-back (WB) cache attribute.  However Linux memremap() takes MTRRs
> into account to calculate which PAT type to use, and seeing the MTRR cache
> attribute for the region being UC the PAT also ends up as UC, regardless of
> the caller having requested WB.
>
> As a workaround to allow current Linux to map the grant-table as WB using
> memremap() introduce an xl.cfg option (xenpci_bar_uc=0) that can be used to
> select whether the Xen PCI device BAR will have the UC attribute in MTRR.
> Such workaround in hvmloader should also be paired with a fix for Linux so
> it attempts to change the MTRR of the Xen PCI device BAR to WB by itself.
>
> Overall, the long term solution would be to provide the guest with a safe
> range in the guest physical address space where mappings to foreign pages
> can be created.
>
> Some vif throughput performance figures provided by Anthoine from a 8
> vCPUs, 4GB of RAM HVM guest(s) running on AMD hardware:
>
> Without this patch:
> vm -> dom0: 1.1Gb/s
> vm -> vm:   5.0Gb/s
>
> With the patch:
> vm -> dom0: 4.5Gb/s
> vm -> vm:   7.0Gb/s
>
> Reported-by: Anthoine Bourgeois <anthoine.bourgeois@xxxxxxxxxx>
> Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> ---
> Changes since v2:
>   - Add default value in xl.cfg.
>   - List xenstore path in the pandoc file.
>   - Adjust comment in hvmloader.
>   - Fix commit message MIO -> MMIO.
>
> Changes since v1:
>   - Leave the xenpci BAR as UC by default.
>   - Introduce an option to not set it as UC.
> ---
>   docs/man/xl.cfg.5.pod.in                |  8 ++++
>   docs/misc/xenstore-paths.pandoc         |  5 +++
>   tools/firmware/hvmloader/config.h       |  2 +-
>   tools/firmware/hvmloader/pci.c          | 49 ++++++++++++++++++++++++-
>   tools/firmware/hvmloader/util.c         |  2 +-
>   tools/include/libxl.h                   |  9 +++++
>   tools/libs/light/libxl_create.c         |  1 +
>   tools/libs/light/libxl_dom.c            |  9 +++++
>   tools/libs/light/libxl_types.idl        |  1 +
>   tools/xl/xl_parse.c                     |  2 +
>   xen/include/public/hvm/hvm_xs_strings.h |  2 +
>   11 files changed, 86 insertions(+), 4 deletions(-)
>
> diff --git a/docs/man/xl.cfg.5.pod.in b/docs/man/xl.cfg.5.pod.in
> index c388899306c2..ddbff6fffc16 100644
> --- a/docs/man/xl.cfg.5.pod.in
> +++ b/docs/man/xl.cfg.5.pod.in
> @@ -2351,6 +2351,14 @@ Windows L<https://xenproject.org/windows-pv-drivers/>.
>   Setting B<xen_platform_pci=0> with the default device_model "qemu-xen"
>   requires at least QEMU 1.6.
>
> +
> +=item B<xenpci_bar_uc=BOOLEAN>
> +
> +B<x86 only:> Select whether the memory BAR of the Xen PCI device should have
> +uncacheable (UC) cache attribute set in MTRR.
> +
> +Default is B<true>.
> +
>   =item B<viridian=[ "GROUP", "GROUP", ...]> or B<viridian=BOOLEAN>
>
>   The groups of Microsoft Hyper-V (AKA viridian) compatible enlightenments
> diff --git a/docs/misc/xenstore-paths.pandoc b/docs/misc/xenstore-paths.pandoc
> index 01a340fafcbe..073bed91eec1 100644
> --- a/docs/misc/xenstore-paths.pandoc
> +++ b/docs/misc/xenstore-paths.pandoc
> @@ -234,6 +234,11 @@ These xenstore values are used to override some of the 
> default string
>   values in the SMBIOS table constructed in hvmloader. See the SMBIOS
>   table specification at http://www.dmtf.org/standards/smbios/
>
> +#### ~/hvmloader/pci/xenpci-bar-uc = ("1"|"0") [HVM,INTERNAL]
> +
> +Select whether the Xen PCI device MMIO BAR will have the uncacheable cache
> +attribute set in the MTRRs by hvmloader.
> +
>   #### ~/bios-strings/oem-* = STRING [HVM,INTERNAL]
>
>   1 to 99 OEM strings can be set in xenstore using values of the form
> diff --git a/tools/firmware/hvmloader/config.h 
> b/tools/firmware/hvmloader/config.h
> index 6e1da137d779..c159db30eea9 100644
> --- a/tools/firmware/hvmloader/config.h
> +++ b/tools/firmware/hvmloader/config.h
> @@ -58,7 +58,7 @@ extern uint32_t *cpu_to_apicid;
>   #define ACPI_TIS_HDR_ADDRESS 0xFED40F00UL
>
>   extern uint32_t pci_mem_start;
> -extern const uint32_t pci_mem_end;
> +extern uint32_t pci_mem_end;
>   extern uint64_t pci_hi_mem_start, pci_hi_mem_end;
>
>   extern bool acpi_enabled;
> diff --git a/tools/firmware/hvmloader/pci.c b/tools/firmware/hvmloader/pci.c
> index cc67b18c0361..747f6cfb6794 100644
> --- a/tools/firmware/hvmloader/pci.c
> +++ b/tools/firmware/hvmloader/pci.c
> @@ -30,7 +30,7 @@
>   #include <xen/hvm/e820.h>
>
>   uint32_t pci_mem_start = HVM_BELOW_4G_MMIO_START;
> -const uint32_t pci_mem_end = RESERVED_MEMBASE;
> +uint32_t pci_mem_end = RESERVED_MEMBASE;
>   uint64_t pci_hi_mem_start = 0, pci_hi_mem_end = 0;
>
>   /*
> @@ -116,6 +116,8 @@ void pci_setup(void)
>        * experience the memory relocation bug described below.
>        */
>       bool allow_memory_relocate = 1;
> +    /* Select the MTRR cache attribute of the xenpci device BAR. */
> +    bool xenpci_bar_uc = false;

Since this is meant to be a workaround, I wonder if it makes more sense
to flip the setting (`xenpci_bar_wb`) and make it 0 by default? It also
simplifies the logic for both hvmloader and the consumer (no need for
double negatives).

>
>       BUILD_BUG_ON((typeof(*pci_devfn_decode_type))PCI_COMMAND_IO !=
>                    PCI_COMMAND_IO);
> @@ -130,6 +132,12 @@ void pci_setup(void)
>       printf("Relocating guest memory for lowmem MMIO space %s\n",
>              allow_memory_relocate?"enabled":"disabled");
>
> +    s = xenstore_read(HVM_XS_XENPCI_BAR_UC, NULL);
> +    if ( s )
> +        xenpci_bar_uc = strtoll(s, NULL, 0);
> +    printf("XenPCI device BAR MTRR cache attribute set to %s\n",
> +           xenpci_bar_uc ? "UC" : "WB");
> +
>       s = xenstore_read("platform/mmio_hole_size", NULL);
>       if ( s )
>           mmio_hole_size = strtoll(s, NULL, 0);
> @@ -271,6 +279,43 @@ void pci_setup(void)
>               if ( bar_sz == 0 )
>                   continue;
>
> +            if ( !xenpci_bar_uc &&
> +                 ((bar_data & PCI_BASE_ADDRESS_SPACE) ==
> +                   PCI_BASE_ADDRESS_SPACE_MEMORY) &&
> +                 vendor_id == 0x5853 &&
> +                 (device_id == 0x0001 || device_id == 0x0002) )
> +            {
> +                if ( is_64bar )
> +                {
> +                     printf("xenpci dev %02x:%x unexpected MMIO 64bit 
> BAR%u\n",
> +                            devfn >> 3, devfn & 7, bar);
> +                     continue;
> +                }
> +
> +                if ( bar_sz > pci_mem_end ||
> +                     ((pci_mem_end - bar_sz) & ~(bar_sz - 1)) < 
> pci_mem_start )
> +                {
> +                     printf("xenpci dev %02x:%x BAR%u size %llx overflows 
> low PCI hole\n",
> +                            devfn >> 3, devfn & 7, bar, bar_sz);
> +                     continue;
> +                }
> +
> +                /* Put unconditionally at the end of the low PCI MMIO hole. 
> */
> +                pci_mem_end -= bar_sz;
> +                pci_mem_end &= ~(bar_sz - 1);
> +                bar_data &= ~PCI_BASE_ADDRESS_MEM_MASK;
> +                bar_data |= pci_mem_end;
> +                pci_writel(devfn, bar_reg, bar_data);
> +                pci_devfn_decode_type[devfn] |= PCI_COMMAND_MEMORY;
> +
> +                /* Prefix BAR address with a 0 to match format used below. */
> +                printf("pci dev %02x:%x bar %02x size "PRIllx": 0%08x\n",
> +                       devfn >> 3, devfn & 7, bar_reg,
> +                       PRIllx_arg(bar_sz), bar_data);
> +
> +                continue;
> +            }
> +
>               for ( i = 0; i < nr_bars; i++ )
>                   if ( bars[i].bar_sz < bar_sz )
>                       break;
> @@ -310,7 +355,7 @@ void pci_setup(void)
>           }
>
>           /* Enable bus master for this function later */
> -        pci_devfn_decode_type[devfn] = PCI_COMMAND_MASTER;
> +        pci_devfn_decode_type[devfn] |= PCI_COMMAND_MASTER;
>       }
>
>       if ( mmio_hole_size )
> diff --git a/tools/firmware/hvmloader/util.c b/tools/firmware/hvmloader/util.c
> index 79c0e6bd4ad2..31b4411db7b4 100644
> --- a/tools/firmware/hvmloader/util.c
> +++ b/tools/firmware/hvmloader/util.c
> @@ -867,7 +867,7 @@ void hvmloader_acpi_build_tables(struct acpi_config 
> *config,
>           config->table_flags |= ACPI_HAS_HPET;
>
>       config->pci_start = pci_mem_start;
> -    config->pci_len = pci_mem_end - pci_mem_start;
> +    config->pci_len = RESERVED_MEMBASE - pci_mem_start;
>       if ( pci_hi_mem_end > pci_hi_mem_start )
>       {
>           config->pci_hi_start = pci_hi_mem_start;
> diff --git a/tools/include/libxl.h b/tools/include/libxl.h
> index b7ad7735ca4c..7ce7678e6836 100644
> --- a/tools/include/libxl.h
> +++ b/tools/include/libxl.h
> @@ -1503,6 +1503,15 @@ void libxl_mac_copy(libxl_ctx *ctx, libxl_mac *dst, 
> const libxl_mac *src);
>    */
>   #define LIBXL_HAVE_CREATEINFO_XEND_SUSPEND_EVTCHN_COMPAT
>
> +/*
> + * LIBXL_HAVE_XENPCI_BAR_UC
> + *
> + * libxl_domain_build_info contains a boolean 'u.hvm.xenpci_bar_uc' field to
> + * signal whether the XenPCI device BAR should have UC cache attribute set in
> + * MTRR.
> + */
> +#define LIBXL_HAVE_XENPCI_BAR_UC
> +
>   typedef char **libxl_string_list;
>   void libxl_string_list_dispose(libxl_string_list *sl);
>   int libxl_string_list_length(const libxl_string_list *sl);
> diff --git a/tools/libs/light/libxl_create.c b/tools/libs/light/libxl_create.c
> index 8bc768b5156c..962fa820faec 100644
> --- a/tools/libs/light/libxl_create.c
> +++ b/tools/libs/light/libxl_create.c
> @@ -313,6 +313,7 @@ int libxl__domain_build_info_setdefault(libxl__gc *gc,
>           libxl_defbool_setdefault(&b_info->u.hvm.usb,                false);
>           libxl_defbool_setdefault(&b_info->u.hvm.vkb_device,         true);
>           libxl_defbool_setdefault(&b_info->u.hvm.xen_platform_pci,   true);
> +        libxl_defbool_setdefault(&b_info->u.hvm.xenpci_bar_uc,      true);
>           libxl_defbool_setdefault(&b_info->u.hvm.pirq,               false);
>
>           libxl_defbool_setdefault(&b_info->u.hvm.spice.enable, false);
> diff --git a/tools/libs/light/libxl_dom.c b/tools/libs/light/libxl_dom.c
> index 4d67b0d28294..60ec0354d19a 100644
> --- a/tools/libs/light/libxl_dom.c
> +++ b/tools/libs/light/libxl_dom.c
> @@ -819,6 +819,15 @@ static int hvm_build_set_xs_values(libxl__gc *gc,
>               goto err;
>       }
>
> +    if (info->type == LIBXL_DOMAIN_TYPE_HVM &&
> +        libxl_defbool_val(info->u.hvm.xenpci_bar_uc)) {
> +        path = GCSPRINTF("/local/domain/%d/"HVM_XS_XENPCI_BAR_UC, domid);
> +        ret = libxl__xs_printf(gc, XBT_NULL, path, "%d",
> +                               libxl_defbool_val(info->u.hvm.xenpci_bar_uc));
> +        if (ret)
> +            goto err;
> +    }
> +
>       return 0;
>
>   err:
> diff --git a/tools/libs/light/libxl_types.idl 
> b/tools/libs/light/libxl_types.idl
> index 198515383012..6054350b83c7 100644
> --- a/tools/libs/light/libxl_types.idl
> +++ b/tools/libs/light/libxl_types.idl
> @@ -691,6 +691,7 @@ libxl_domain_build_info = Struct("domain_build_info",[
>                                          ("vkb_device",       libxl_defbool),
>                                          ("soundhw",          string),
>                                          ("xen_platform_pci", libxl_defbool),
> +                                       ("xenpci_bar_uc",    libxl_defbool),
>                                          ("usbdevice_list",   
> libxl_string_list),
>                                          ("vendor_device",    
> libxl_vendor_device),
>                                          # See libxl_ms_vm_genid_generate()
> diff --git a/tools/xl/xl_parse.c b/tools/xl/xl_parse.c
> index 219e924779ff..4da3bb9e91ab 100644
> --- a/tools/xl/xl_parse.c
> +++ b/tools/xl/xl_parse.c
> @@ -2917,6 +2917,8 @@ skip_usbdev:
>           xlu_cfg_replace_string (config, "soundhw", &b_info->u.hvm.soundhw, 
> 0);
>           xlu_cfg_get_defbool(config, "xen_platform_pci",
>                               &b_info->u.hvm.xen_platform_pci, 0);
> +        xlu_cfg_get_defbool(config, "xenpci_bar_uc",
> +                            &b_info->u.hvm.xenpci_bar_uc, 0);
>
>           if(b_info->u.hvm.vnc.listen
>              && b_info->u.hvm.vnc.display
> diff --git a/xen/include/public/hvm/hvm_xs_strings.h 
> b/xen/include/public/hvm/hvm_xs_strings.h
> index e1ed078628a0..ebb07b9fba56 100644
> --- a/xen/include/public/hvm/hvm_xs_strings.h
> +++ b/xen/include/public/hvm/hvm_xs_strings.h
> @@ -14,6 +14,8 @@
>   #define HVM_XS_BIOS                    "hvmloader/bios"
>   #define HVM_XS_GENERATION_ID_ADDRESS   "hvmloader/generation-id-address"
>   #define HVM_XS_ALLOW_MEMORY_RELOCATE   "hvmloader/allow-memory-relocate"
> +/* Set xenpci device BAR as UC in MTRR */
> +#define HVM_XS_XENPCI_BAR_UC           "hvmloader/pci/xenpci-bar-uc"
>
>   /* The following values allow additional ACPI tables to be added to the
>    * virtual ACPI BIOS that hvmloader constructs. The values specify the guest



Ngoc Tu Dinh | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech





 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.