[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests


  • To: Julien Grall <julien@xxxxxxx>
  • From: Milan Djokic <milan_djokic@xxxxxxxx>
  • Date: Fri, 13 Feb 2026 04:18:57 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=AAFa/ckxYSa5CZsSTZ6MwkTAveXn9hGaO+1gduzng8Q=; b=OvPgLr8bnQbKmhn7S0H+oFrUZqAinrbiv+jimjIBv6gxumcXbdGWmeeGgO6rQUs5L6axo8/FUEkN6xwkVOb66ZP0ydU6cff8WEjESs7FCpBNelvhzX1Z7mQQq4x6v2FU7YIlJlSiViUeFFjfUDLQDVcwLD8H89JILuovtmQE9vzQgfzbqdNtUA9EplpAJzRUWcLysfl+CrCUBaHry4P1KIQqHidZslDj63abhxeenrpQ80c9vYL8AeHGW7G/fIb0XbsNFJ60M6mzxmaxGjDkiACvAAjOXye8PFKJ4Aq6iaVcf4x7inNabclJ+oa6GczUUyT+DU9FAVLzvBBSLbY+Bw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=qqEpYKSVFUlg4KLljOJQmZb1N+yeu3XbIVoVhlgc7QhTglEKi9Yn1N3rdGNBl1cZ4dG52KrH260SU4iBid4N1UT0Ql5+kuzhPC1o3TfCwIFqY5+v94qHSRtpENL+h0P3jffZgqH0yoTZtu6cpzg283C77gV5QpvS0UAvkYDkMaYYWfinqFe7QPZl/kD5gY2qwokotgr2yrHMniQjp2NCh5yz/lkOjSAbE06nFqnN7+oF+KAq/RyiGurPps6VHB5Ho3JtC2VciWQBMt6hK2noSH4IZsf4yZaC9fxJAtNnMvT6WYbJh+CKJWMol6kQisEIW2nQ3BcvrKL2byASBTRABg==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=epam.com;
  • Cc: Julien Grall <julien.grall.oss@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Bertrand Marquis <bertrand.marquis@xxxxxxx>, Rahul Singh <rahul.singh@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Nick Rosbrook <enr0n@xxxxxxxxxx>, George Dunlap <gwd@xxxxxxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>
  • Delivery-date: Fri, 13 Feb 2026 03:19:08 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Hi Julien,

On 12/3/25 16:47, Milan Djokic wrote:
Hi Julien,
On 12/3/25 11:32, Julien Grall wrote:
Hi,

On 02/12/2025 22:08, Milan Djokic wrote:
Hi Julien,

On 11/27/25 11:22, Julien Grall wrote:
We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU and
pIOMMU. Considering single vIOMMU model limitation pointed out by
Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the
only proper solution.

   > Does this means in your solution you will end up with multiple
   > vPCI as well and then map pBDF == vBDF? (this because the SID have to be
   > fixed at boot)
   >

To answer your question, yes we will have multiple vPCI nodes with this
model, establishing 1-1 vSID-pSID mapping (same iommu-map range between
pPCI-vPCI).
For pBDF to vBDF 1-1 mapping, I'm not sure if this is necessary. My
understanding is that vBDF->pBDF mapping does not affect vSID->pSID
mapping. Am I wrong here?

   From my understanding, the mapping between a vBDF and vSID is setup at
domain creation (as this is described in ACPI/Device-Tree). As PCI
devices can be hotplug, if you want to enforce vSID == pSID, then you
indirectly need to enforce vBDF == pBDF.


I was not aware of that. I will have to do a detailed analysis on this
and come back with a solution. Right now I'm not sure how and if
enumeration will work with multi vIOMMU/vPCI model. If that's not
possible, we will have to introduce a mapping layer for vSID->pSID and
go back to single vPCI/vIOMMU model.

[...]


I have updated the vIOMMU design following our previous discussion on this topic and some additional usecases which we had in a meantime. I have changed the implementation which now provides a single vIOMMU to guest, with a Xen mapping layer which translates it into physical IOMMU layout. This design supports multiple physical IOMMUs and also aligns with the ongoing vPCI/PCI passthrough work. New vIOMMU design is provided below, could you please review the updated design?
Changes comparing to previous design version:
- Switched from N-N to 1-N vIOMMU-pIOMMU model, with the addition of vSID->pSID mapping layer - Added some details related to vIOMMU emulation flow (commands, events) and the new vSID->pSID mapping layer - Assumptions and constraints for vPCI compatibility. PCI support is not yet complete, planned to be implemented in alignment with the ongoing PCI passthrough work - Removed security considerations which are not directly related to vIOMMU (xl, libfdt)
- Expanded mitigations for scheduling-related risks
- Added initial performance measurements for the Renesas R-Car platform (to be extended with future PCI support work)


==========================================================
Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
==========================================================

:Author:     Milan Djokic <milan_djokic@xxxxxxxx>
:Date:       2026-02-13
:Status:     Draft

Introduction
============

The SMMUv3 supports two stages of translation. Each stage of translation
can be
independently enabled. An incoming address is logically translated from
VA to
IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to
the output PA. Stage 1 translation support is required to provide
isolation between different
devices within OS. XEN already supports Stage 2 translation but there is no
support for Stage 1 translation.
This design proposal outlines the introduction of Stage-1 SMMUv3 support
in Xen for ARM guests.

Motivation
==========

ARM systems utilizing SMMUv3 require stage-1 address translation to
ensure secure DMA and
guest managed I/O memory mappings.
With stage-1 enabled, guest manages IOVA to IPA mappings through its own
IOMMU driver.

This feature enables:

- Stage-1 translation for the guest domain
- Device passthrough with per-device I/O address space

Design Overview
===============

These changes provide emulated SMMUv3 support:

- **SMMUv3 Stage-1 Translation**: stage-1 and nested translation support
  in SMMUv3 driver.
- **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1
  handling.
- **Register/Command Emulation**: SMMUv3 register emulation and command
  queue handling.
- **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes to
  device trees for dom0 and dom0less scenarios.
- **Runtime Configuration**: Introduces a `viommu` boot parameter for
  dynamic enablement.

A single vIOMMU device is exposed to the guest and mapped to one or more
physical IOMMUs through a Xen-managed translation layer.
The vIOMMU feature provides a generic framework together with a backend
implementation specific to the target IOMMU type. The backend is responsible
for implementing the hardware-specific data structures and command handling
logic (currently only SMMUv3 is supported).

This modular design allows the stage-1 support to be reused
for other IOMMU architectures in the future.

vIOMMU architecture
===================

Responsibilities:

Guest:
 - Configures stage-1 via vIOMMU commands.
 - Handles stage-1 faults received from Xen.

Xen:
 - Emulates the IOMMU interface (registers, commands, events).
 - Provides vSID->pSID mappings.
 - Programs stage-1/stage-2 configuration in the physical IOMMU.
 - Propagate stage-1 faults to guest.

vIOMMU commands and faults are transmitted between guest and Xen via
command and event queues (one command/event queue created per guest).

vIOMMU command Flow:

::

    Guest:
        smmu_cmd(vSID, IOVA -> IPA)

    Xen:
        trap MMIO read/write
        translate vSID->pSID
        store stage-1 state
        program pIOMMU for (pSID, IPA -> PA)

All hardware programming of the physical IOMMU is performed exclusively by Xen.

vIOMMU Stage-1 fault handling flow:

::

    Xen:
        receives stage-1 fault
        triggers vIOMMU callback
        injects virtual fault

    Guest:
        receives and handles fault

vSID Mapping Layer
------------------

Each guest-visible Stream ID (vSID) is mapped by Xen to a physical Stream ID
(pSID). The mapping is maintained per-domain. The allocation policy guarantees
vSID uniqueness within a domain while allowing reuse of pSIDs for different
pIOMMUs.

* Platform devices receive individually allocated vSIDs.
* PCI devices receive a contiguous vSID range derived from RID space.


Supported Device Model
======================

Currently, the vIOMMU framework supports only devices described via the
Device Tree (DT) model. This includes platform devices and basic PCI
devices support instantiated through the vPCI DT node. ACPI-described
devices are not supported.

Guest assigned platform devices are mapped via `iommus` property:

::

    <&pIOMMU pSID> -> <&vIOMMU vSID>

PCI devices use RID-based mapping via the root complex `iommu-map`:

::

    <RID-base &viommu vSID-base length>

PCI Topology Assumptions and Constraints:

- RID space must be contiguous
- Pre-defined continuous pSID space (0-0x1000)
- No runtime PCI reconfiguration
- Single root complex assumed
- Mapping is fixed at guest DT construction

Constraints for PCI devices will be addressed as part of the future work on
this feature.

Security Considerations
=======================

Stage-1 translation provides isolation between guest devices by
enforcing a per-device I/O address space, preventing unauthorized DMA.
With the introduction of emulated IOMMU, additional protection
mechanisms are required to minimize security risks.

1. Observation:
---------------
Support for Stage-1 translation in SMMUv3 introduces new data structures
(`s1_cfg` alongside `s2_cfg`)
and logic to write both Stage-1 and Stage-2 entries in the Stream Table
Entry (STE), including an `abort`
field to handle partial configuration states.

**Risk:**
Without proper handling, a partially applied configuration
might leave guest DMA mappings in an inconsistent state, potentially
enabling unauthorized access or causing cross-domain interference.

**Mitigation:** *(Handled by design)*
This feature introduces logic that writes both `s1_cfg` and `s2_cfg` to
STE and manages the `abort` field - only considering
configuration if fully attached. This ensures  incomplete or invalid
device configurations are safely ignored by the hypervisor.

2. Observation:
---------------
Guests can now invalidate Stage-1 caches; invalidation needs forwarding
to SMMUv3 hardware to maintain coherence.

**Risk:**
Failing to propagate cache invalidation could allow stale mappings,
enabling access to old mappings and possibly
data leakage or misrouting between devices assigned to the same guest.

**Mitigation:**
The guest must issue appropriate invalidation commands whenever
its stage-1 I/O mappings are modified to ensure that translation caches
remain coherent.

3. Observation:
---------------
Introducing optional per-guest enabled features (`viommu` argument in xl
guest config) means some guests
may opt-out.

**Risk:**
Guests without vIOMMU enabled (stage-2 only) could potentially dominate
access to the physical command and event queues, since they bypass the
emulation layer and processing is faster comparing to vIOMMU-enabled guests.

**Mitigation:**
Audit the impact of emulation overhead effect on IOMMU processing fairness
in a multi-guest environment.
Consider enabling/disabling stage-1 on a system level, instead of per-domain.

4. Observation:
---------------
Guests have the ability to issue Stage-1 IOMMU commands like cache
invalidation, stream table entries
configuration, etc. An adversarial guest may issue a high volume of
commands in rapid succession.

**Risk:**
Excessive commands requests can cause high hypervisor CPU consumption
and disrupt scheduling,
leading to degraded system responsiveness and potential
denial-of-service scenarios.

**Mitigation:**

- Implement vIOMMU commands execution restart and continuation support:

  - Introduce processing budget with only a limited amount of commands
    handled per invocation.
  - If additional commands remain pending after the budget is exhausted,
    defer further processing and resume it asynchronously, e.g. via a
    per-domain tasklet.

- Batch multiple commands of same type to reduce emulation overhead:

  - Inspect the command queue and group commands that can be processed
    together (e.g. multiple successive invalidation requests or STE
    updates for the same SID).
  - Execute the entire batch in one go, reducing repeated accesses to
    guest memory and emulation overhead per command.
  - This reduces CPU time spent in the vIOMMU command processing loop.
    The optimization is applicable only when consecutive commands of the
    same type operate on the same SID/context.

5. Observation:
---------------
Some guest commands issued towards vIOMMU are propagated to pIOMMU
command queue (e.g. TLB invalidate).

**Risk:**
Excessive commands requests from abusive guest can cause flooding of
physical IOMMU command queue,
leading to degraded pIOMMU responsiveness on commands issued from other
guests.

**Mitigation:**

- Batch commands that are propagated to the pIOMMU command queue and
  implement batch execution pause/continuation.
  Rely on the same mechanisms as in the previous observation
(command continuation and batching of pIOMMU-related commands of the same
  type and context).
- If possible, implement domain penalization by adding a per-domain budget
  for vIOMMU/pIOMMU usage:

  - Apply per-domain dynamic budgeting of allowed IOMMU commands to
    execute per invocation, reducing the budget for guests with
    excessive command requests over a longer period of time
  - Combine with command continuation mechanism

6. Observation:
---------------
The vIOMMU feature includes an event queue used to forward IOMMU events
to the guest (e.g. translation faults, invalid Stream IDs, permission errors).
A malicious guest may misconfigure its IOMMU state or intentionally trigger
faults at a high rate.

**Risk:**
Occurrence of IOMMU events with high frequency can cause Xen to flood the
event queue and disrupt scheduling with
high hypervisor CPU load for events handling.

**Mitigation:**

- Implement fail-safe state by disabling events forwarding when faults
  are occurred with high frequency and
  not processed by guest:

  - Introduce a per-domain pending event counter.
  - Stop forwarding events to the guest once the number of unprocessed
    events reaches a predefined threshold.

- Consider disabling the emulated event queue for untrusted guests.
- Note that this risk is more general and may also apply to stage-2-only
  guests. This section addresses mitigations in the emulated IOMMU layer
only. Mitigation of physical event queue flooding should also be considered
  in the target pIOMMU driver.

Performance Impact
==================

With iommu stage-1 and nested translation inclusion, performance
overhead is introduced comparing to existing,
stage-2 only usage in Xen. Once mappings are established, translations
should not introduce significant overhead.
Emulated paths may introduce moderate overhead, primarily affecting
device initialization and event/command handling.
Testing is performed on Renesas R-Car platform.
Performance is mostly impacted by emulated vIOMMU operations, results
shown in the following table.

+-------------------------------+---------------------------------+
| vIOMMU Operation              | Execution time in guest         |
+===============================+=================================+
| Reg read                      | median: 645ns, worst-case: 2mμs |
+-------------------------------+---------------------------------+
| Reg write                     | median: 630ns, worst-case: 1μs  |
+-------------------------------+---------------------------------+
| Invalidate TLB                | median: 2μs, worst-case: 10μs   |
+-------------------------------+---------------------------------+
| Invalidate STE                | median: 5μs worst_case: 100μs   |
+-------------------------------+---------------------------------+

With vIOMMU exposed to guest, guest OS has to initialize IOMMU device
and configure stage-1 mappings for the devices
attached to it.
Following table shows initialization stages which impact stage-1 enabled
guest boot time and compares it with
stage-1 disabled guest.

NOTE: Device probe execution time varies depending on device complexity.
A USB host controller was selected as the test case due to its extensive
use of dynamic DMA allocations and IOMMU mappings, making it a
representative workload for evaluating stage-1 vIOMMU behavior.

+---------------------+-----------------------+------------------------+
| Stage               | Stage-1 Enabled Guest | Stage-1 Disabled Guest |
+=====================+=======================+========================+
| IOMMU Init          | ~10ms                 | /                      |
+---------------------+-----------------------+------------------------+
| Dev Attach / Mapping| ~100ms                | ~90ms                  |
+---------------------+-----------------------+------------------------+

For devices configured with dynamic DMA mappings, DMA allocate/map/unmap
operations performance is
also impacted on stage-1 enabled guests.
Dynamic DMA mapping operation trigger emulated IOMMU functions like mmio
write/read and TLB invalidation.

+---------------+---------------------------+--------------------------+
| DMA Op        | Stage-1 Enabled Guest     | Stage-1 Disabled Guest   |
+===============+===========================+==========================+
| dma_alloc     | median: 20µs, worst: 80µs | median: 8µs, worst: 60µs |
+---------------+---------------------------+--------------------------+
| dma_free      | median: 15µs, worst: 60µs | median: 6µs, worst: 30µs |
+---------------+---------------------------+--------------------------+
| dma_map       | median: 12µs, worst: 60µs | median: 3µs, worst: 20µs |
+---------------+---------------------------+--------------------------+
| dma_unmap     | median: 15µs, worst: 70µs | median: 3µs, worst: 20µs |
+---------------+---------------------------+--------------------------+

Testing
=======

- QEMU-based ARM system tests for Stage-1 translation.
- Actual hardware validation to ensure compatibility with real SMMUv3
implementations.
- Unit/Functional tests validating correct translations (not implemented).

Migration and Compatibility
===========================

This optional feature defaults to disabled (`viommu=""`) for backward
compatibility.

Future improvements
===================

- Implement the proposed mitigations to address security risks that are
  not covered by the current design
  (events batching, commands execution continuation)
- PCI support
- Support for other IOMMU HW (Renesas, RISC-V, etc.)

References
==========

- Original feature implemented by Rahul Singh:

https://patchwork.kernel.org/project/xen-devel/cover/cover.1669888522.git.rahul.singh@xxxxxxx/

- SMMUv3 architecture documentation
- Existing vIOMMU code patterns (KVM, QEMU)


Best regards,
Milan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.