[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests


  • To: Milan Djokic <milan_djokic@xxxxxxxx>
  • From: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>
  • Date: Fri, 29 Aug 2025 16:27:00 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=FxhC10AQu/OtasGST8qhixbboSFy9XLmn2hJmnJBc00=; b=fiaZA4WsOrp14N0HUuKi1DeKvnEN22pnlZ0nc7Y0qvWk7LXDQt8FRJFk/t+cfkq9hthyJ8opjRTg7OySnY/gSybx/lMjcF0zbN2hWTYEgL6zcFBnOJtCh71AZGMwvomlEDrtTtXnwQsRzj46YyFqX6dzaeH6zCJtPBX44fh9pz1eUva0gS145lhMRtQ6oc47Fz7OtYpbAGWmtTE1vCu2CIni7AC1fGL9gxZLVC8rRVAYbQE2dzDispAxbKMA6vq/XVGFXurPQKmLkb5axo8uATRYcz4lakoGXXkXtd7Wr62Iipzgxnt5jQsPHAYj0gDIPo02OTavUt6sPDTpMaZJbQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=LO1ijDHKvoEucp1BBhlXuaVXMv0WavM2JFmPu8nns2iyJDsKQhh+ld7GC1/pdfN/bR0avxIsU4r2y6kAmPb77elCEskqf837v/+w5+btmDgaIi2eONa6xKSYvh22Y/Mh3DPm7pIfQJSCUyCFKAgCvbjesW6JHBi5GFH+mXuCxb3GNlTbHEC1fOaM69i9huuZEUcr9FsrhFiMaUxEFsDatZncfn1hbHDavwRAryQkafv1x3xIT51mqS6QgGdDXvmQmFHqAa3oQI4Tf0zGsIDNOuK/YI5EUk6Tcdyh2ZrraqqSc4w2GByqTCt2ZgYYc4cHUQqUJko0J+7l9Awxad3z0A==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=epam.com;
  • Cc: Julien Grall <julien@xxxxxxx>, Julien Grall <julien.grall.oss@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Bertrand Marquis <bertrand.marquis@xxxxxxx>, Rahul Singh <rahul.singh@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Nick Rosbrook <enr0n@xxxxxxxxxx>, George Dunlap <gwd@xxxxxxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Delivery-date: Fri, 29 Aug 2025 16:27:11 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHcB7wdmg15F24Gc0ClRsxpCVedRA==
  • Thread-topic: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests

Hi Milan,

Thanks, "Security Considerations" sections looks really good. But I have
more questions.

Milan Djokic <milan_djokic@xxxxxxxx> writes:

> Hello Julien, Volodymyr
>
> On 8/27/25 01:28, Volodymyr Babchuk wrote:
>> Hi Milan,
>> Milan Djokic <milan_djokic@xxxxxxxx> writes:
>> 
>>> Hello Julien,
>>>
>>> On 8/13/25 14:11, Julien Grall wrote:
>>>> On 13/08/2025 11:04, Milan Djokic wrote:
>>>>> Hello Julien,
>>>> Hi Milan,
>>>>
>>>>>
>>>>> We have prepared a design document and it will be part of the updated
>>>>> patch series (added in docs/design). I'll also extend cover letter with
>>>>> details on implementation structure to make review easier.
>>>> I would suggest to just iterate on the design document for now.
>>>>
>>>>> Following is the design document content which will be provided in
>>>>> updated patch series:
>>>>>
>>>>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
>>>>> ==========================================================
>>>>>
>>>>> Author: Milan Djokic <milan_djokic@xxxxxxxx>
>>>>> Date:   2025-08-07
>>>>> Status: Draft
>>>>>
>>>>> Introduction
>>>>> ------------
>>>>>
>>>>> The SMMUv3 supports two stages of translation. Each stage of translation
>>>>> can be independently enabled. An incoming address is logically
>>>>> translated from VA to IPA in stage 1, then the IPA is input to stage 2
>>>>> which translates the IPA to the output PA. Stage 1 translation support
>>>>> is required to provide isolation between different devices within the OS.
>>>>>
>>>>> Xen already supports Stage 2 translation but there is no support for
>>>>> Stage 1 translation. This design proposal outlines the introduction of
>>>>> Stage-1 SMMUv3 support in Xen for ARM guests.
>>>>>
>>>>> Motivation
>>>>> ----------
>>>>>
>>>>> ARM systems utilizing SMMUv3 require Stage-1 address translation to
>>>>> ensure correct and secure DMA behavior inside guests.
>>>> Can you clarify what you mean by "correct"? DMA would still work
>>>> without
>>>> stage-1.
>>>
>>> Correct in terms of working with guest managed I/O space. I'll
>>> rephrase this statement, it seems ambiguous.
>>>
>>>>>
>>>>> This feature enables:
>>>>> - Stage-1 translation in guest domain
>>>>> - Safe device passthrough under secure memory translation
>>>>>
>>>>> Design Overview
>>>>> ---------------
>>>>>
>>>>> These changes provide emulated SMMUv3 support:
>>>>>
>>>>> - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in
>>>>>      SMMUv3 driver
>>>>> - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling
>>>> So what are you planning to expose to a guest? Is it one vIOMMU per
>>>> pIOMMU? Or a single one?
>>>
>>> Single vIOMMU model is used in this design.
>>>
>>>> Have you considered the pros/cons for both?
>>>>> - Register/Command Emulation: SMMUv3 register emulation and command
>>>>>      queue handling
>>>>
>>>
>>> That's a point for consideration.
>>> single vIOMMU prevails in terms of less complex implementation and a
>>> simple guest iommmu model - single vIOMMU node, one interrupt path,
>>> event queue, single set of trap handlers for emulation, etc.
>>> Cons for a single vIOMMU model could be less accurate hw
>>> representation and a potential bottleneck with one emulated queue and
>>> interrupt path.
>>> On the other hand, vIOMMU per pIOMMU provides more accurate hw
>>> modeling and offers better scalability in case of many IOMMUs in the
>>> system, but this comes with more complex emulation logic and device
>>> tree, also handling multiple vIOMMUs on guest side.
>>> IMO, single vIOMMU model seems like a better option mostly because
>>> it's less complex, easier to maintain and debug. Of course, this
>>> decision can and should be discussed.
>>>
>> Well, I am not sure that this is possible, because of StreamID
>> allocation. The biggest offender is of course PCI, as each Root PCI
>> bridge will require own SMMU instance with own StreamID space. But even
>> without PCI you'll need some mechanism to map vStremID to
>> <pSMMU, pStreamID>, because there will be overlaps in SID space.
>> Actually, PCI/vPCI with vSMMU is its own can of worms...
>> 
>>>> For each pSMMU, we have a single command queue that will receive command
>>>> from all the guests. How do you plan to prevent a guest hogging the
>>>> command queue?
>>>> In addition to that, AFAIU, the size of the virtual command queue is
>>>> fixed by the guest rather than Xen. If a guest is filling up the queue
>>>> with commands before notifying Xen, how do you plan to ensure we don't
>>>> spend too much time in Xen (which is not preemptible)?
>>>>
>>>
>>> We'll have to do a detailed analysis on these scenarios, they are not
>>> covered by the design (as well as some others which is clear after
>>> your comments). I'll come back with an updated design.
>> I think that can be handled akin to hypercall continuation, which is
>> used in similar places, like P2M code
>> [...]
>> 
>
> I have updated vIOMMU design document with additional security topics
> covered and performance impact results. Also added some additional
> explanations for vIOMMU components following your comments.
> Updated document content:
>
> ===============================================
> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
> ===============================================
>
> :Author:     Milan Djokic <milan_djokic@xxxxxxxx>
> :Date:       2025-08-07
> :Status:     Draft
>
> Introduction
> ========
>
> The SMMUv3 supports two stages of translation. Each stage of
> translation can be
> independently enabled. An incoming address is logically translated
> from VA to
> IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to
> the output PA. Stage 1 translation support is required to provide
> isolation between different
> devices within OS. XEN already supports Stage 2 translation but there is no
> support for Stage 1 translation.
> This design proposal outlines the introduction of Stage-1 SMMUv3
> support in Xen for ARM guests.
>
> Motivation
> ==========
>
> ARM systems utilizing SMMUv3 require stage-1 address translation to
> ensure secure DMA and guest managed I/O memory mappings.

It is unclear for my what you mean by "guest manged IO memory mappings",
could you please provide an example?

> This feature enables:
>
> - Stage-1 translation in guest domain
> - Safe device passthrough under secure memory translation
>

As I see it, ARM specs use "secure" mostly when referring to Secure mode
(S-EL1, S-EL2, EL3) and associated secure counterparts of architectural
devices, like secure GIC, secure Timer, etc. So I'd probably don't use
this word here to reduce confusion

> Design Overview
> ===============
>
> These changes provide emulated SMMUv3 support:
>
> - **SMMUv3 Stage-1 Translation**: stage-1 and nested translation
>     support in SMMUv3 driver.

"Nested translation" as in "nested virtualization"? Or is this something else?

> - **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1
>     handling.

I think, this is the big topic. You see, apart from SMMU, there is
at least Renesas IP-MMU, which uses completely different API. And
probably there are other IO-MMU implementations possible. Right now
vIOMMU framework handles only SMMU, which is okay, but probably we
should design it in a such way, that other IO-MMUs will be supported as
well. Maybe even IO-MMUs for other architectures (RISC V maybe?).

> - **Register/Command Emulation**: SMMUv3 register emulation and
>     command queue handling.

Continuing previous paragraph: what about other IO-MMUs? For example, if
platform provides only Renesas IO-MMU, will vIOMMU framework still
emulate SMMUv3 registers and queue handling?

> - **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes
>     to device trees for dom0 and dom0less scenarios.
> - **Runtime Configuration**: Introduces a `viommu` boot parameter for
>     dynamic enablement.
>
> vIOMMU is exposed to guest as a single device with predefined
> capabilities and commands supported. Single vIOMMU model abstracts the
> details of an actual IOMMU hardware, simplifying usage from the guest
> point of view. Guest OS handles only a single IOMMU, even if multiple
> IOMMU units are available on the host system.

In the previous email I asked how are you planning to handle potential
SID overlaps, especially in PCI use case. I want to return to this
topic. I am not saying that this is impossible, but I'd like to see this
covered in the design document.

>
> Security Considerations
> =======================
>
> **viommu security benefits:**
>
> - Stage-1 translation ensures guest devices cannot perform unauthorized DMA.
> - Emulated IOMMU removes guest dependency on IOMMU hardware while
>   maintaining domains isolation.

I am not sure that I got this paragraph. 

>
>
> 1. Observation:
> ---------------
> Support for Stage-1 translation in SMMUv3 introduces new data
> structures (`s1_cfg` alongside `s2_cfg`) and logic to write both
> Stage-1 and Stage-2 entries in the Stream Table Entry (STE), including
> an `abort` field to handle partial configuration states.
>
> **Risk:**
> Without proper handling, a partially applied Stage-1 configuration
> might leave guest DMA mappings in an inconsistent state, potentially
> enabling unauthorized access or causing cross-domain interference.
>
> **Mitigation:** *(Handled by design)*
> This feature introduces logic that writes both `s1_cfg` and `s2_cfg`
> to STE and manages the `abort` field-only considering Stage-1
> configuration if fully attached. This ensures incomplete or invalid
> guest configurations are safely ignored by the hypervisor.
>
> 2. Observation:
> ---------------
> Guests can now invalidate Stage-1 caches; invalidation needs
> forwarding to SMMUv3 hardware to maintain coherence.
>
> **Risk:**
> Failing to propagate cache invalidation could allow stale mappings,
> enabling access to old mappings and possibly data leakage or
> misrouting.
>
> **Mitigation:** *(Handled by design)*
> This feature ensures that guest-initiated invalidations are correctly
> forwarded to the hardware, preserving IOMMU coherency.
>
> 3. Observation:
> ---------------
> This design introduces substantial new functionality, including the
> `vIOMMU` framework, virtual SMMUv3 devices (`vsmmuv3`), command
> queues, event queues, domain management, and Device Tree modifications
> (e.g., `iommus` nodes and `libxl` integration).
>
> **Risk:**
> Large feature expansions increase the attack surface—potential for
> race conditions, unchecked command inputs, or Device Tree-based
> misconfigurations.
>
> **Mitigation:**
>
> - Sanity checks and error-handling improvements have been introduced
>   in this feature.
> - Further audits have to be performed for this feature and its
>   dependencies in this area. Currently, feature is marked as *Tech
>   Preview* and is self-contained, reducing the risk to unrelated
>  components.
>
> 4. Observation:
> ---------------
> The code includes transformations to handle nested translation versus
> standard modes and uses guest-configured command queues (e.g.,
> `CMD_CFGI_STE`) and event notifications.
>
> **Risk:**
> Malicious or malformed queue commands from guests could bypass
> validation, manipulate SMMUv3 state, or cause Dom0 instability.

Only Dom0?

>
> **Mitigation:** *(Handled by design)*
> Built-in validation of command queue entries and sanitization
> mechanisms ensure only permitted configurations are applied. This is
> supported via additions in `vsmmuv3` and `cmdqueue` handling code.
>
> 5. Observation:
> ---------------
> Device Tree modifications enable device assignment and
> configuration—guest DT fragments (e.g., `iommus`) are added via
> `libxl`.
>
> **Risk:**
> Erroneous or malicious Device Tree injection could result in device
> misbinding or guest access to unauthorized hardware.
>
> **Mitigation:**
>
> - `libxl` perform checks of guest configuration and parse only
>   predefined dt fragments and nodes, reducing risc.
> - The system integrator must ensure correct resource mapping in the
>   guest Device Tree (DT) fragments.
>
> 6. Observation:
> ---------------
> Introducing optional per-guest enabled features (`viommu` argument in
> xl guest config) means some guests may opt-out.
>
> **Risk:**
> Differences between guests with and without `viommu` may cause
> unexpected behavior or privilege drift.
>
> **Mitigation:**
> Verify that downgrade paths are safe and well-isolated; ensure missing
> support doesn't cause security issues. Additional audits on emulation
> paths and domains interference need to be performed in a multi-guest
> environment.
>
> 7. Observation:
> ---------------
> Guests have the ability to issue Stage-1 IOMMU commands like cache
> invalidation, stream table entries configuration, etc. An adversarial
> guest may issue a high volume of commands in rapid succession.
>
> **Risk**
> Excessive commands requests can cause high hypervisor CPU consumption
> and disrupt scheduling, leading to degraded system responsiveness and
> potential denial-of-service scenarios.
>
> **Mitigation**
>
> - Xen credit scheduler limits guest vCPU execution time, securing
>   basic guest rate-limiting.

I don't thing that this feature available only in credit schedulers,
AFAIK, all schedulers except null scheduler will limit vCPU execution time.

> - Batch multiple commands of same type to reduce overhead on the
>   virtual SMMUv3 hardware emulation.
> - Implement vIOMMU commands execution restart and continuation support

So, something like "hypercall continuation"?

>
> 8. Observation:
> ---------------
> Some guest commands issued towards vIOMMU are propagated to pIOMMU
> command queue (e.g. TLB invalidate). For each pIOMMU, only one command
> queue is
> available for all domains.
>
> **Risk**
> Excessive commands requests from abusive guest can cause flooding of
> physical IOMMU command queue, leading to degraded pIOMMU responsivness
> on commands issued from other guests.
>
> **Mitigation**
>
> - Xen credit scheduler limits guest vCPU execution time, securing
>   basic guest rate-limiting.
> - Batch commands which should be propagated towards pIOMMU cmd queue
>   and enable support for batch execution pause/continuation
> - If possible, implement domain penalization by adding a per-domain
>   cost counter for vIOMMU/pIOMMU usage.
>
> 9. Observation:
> ---------------
> vIOMMU feature includes event queue used for forwarding IOMMU events
> to guest (e.g. translation faults, invalid stream IDs, permission
> errors). A malicious guest can misconfigure its SMMU state or
> intentionally trigger faults with high frequency.
>
> **Risk**
> Occurance of IOMMU events with high frequency can cause Xen to flood
> the event queue and disrupt scheduling with high hypervisor CPU load
> for events handling.
>
> **Mitigation**
>
> - Implement fail-safe state by disabling events forwarding when faults
>   are occured with high frequency and not processed by guest.
> - Batch multiple events of same type to reduce overhead on the virtual
>   SMMUv3 hardware emulation.
> - Consider disabling event queue for untrusted guests
>
> Performance Impact
> ==================
>
> With iommu stage-1 and nested translation inclusion, performance
> overhead is introduced comparing to existing, stage-2 only usage in
> Xen.
> Once mappings are established, translations should not introduce
> significant overhead.
> Emulated paths may introduce moderate overhead, primarily affecting
> device initialization and event handling.
> Performance impact highly depends on target CPU capabilities. Testing
> is performed on cortex-a53 based platform.

Which platform exactly? While QEMU emulates SMMU to some extent, we are
observing somewhat different SMMU behavior on real HW platforms (mostly
due to cache coherence problems). Also, according to MMU-600 errata, it
can have lower than expected performance in some use-cases.

> Performance is mostly impacted by emulated vIOMMU operations, results
> shown in the following table.
>
> +-------------------------------+---------------------------------+
> | vIOMMU Operation              | Execution time in guest         |
> +===============================+=================================+
> | Reg read                      | median: 30μs, worst-case: 250μs |
> +-------------------------------+---------------------------------+
> | Reg write                     | median: 35μs, worst-case: 280μs |
> +-------------------------------+---------------------------------+
> | Invalidate TLB                | median: 90μs, worst-case: 1ms+  |
> +-------------------------------+---------------------------------+
> | Invalidate STE                | median: 450μs worst_case: 7ms+  |
> +-------------------------------+---------------------------------+
>
> With vIOMMU exposed to guest, guest OS has to initialize IOMMU device
> and configure stage-1 mappings for devices attached to it.
> Following table shows initialization stages which impact stage-1
> enabled guest boot time and compares it with stage-1 disabled guest.
>
> "NOTE: Device probe execution time varies significantly depending on
> device complexity. virtio-gpu was selected as a test case due to its
> extensive use of dynamic DMA allocations and IOMMU mappings, making it
> a suitable candidate for benchmarking stage-1 vIOMMU behavior."
>
> +---------------------+-----------------------+------------------------+
> | Stage               | Stage-1 Enabled Guest | Stage-1 Disabled Guest |
> +=====================+=======================+========================+
> | IOMMU Init          | ~25ms                 | /                      |
> +---------------------+-----------------------+------------------------+
> | Dev Attach / Mapping| ~220ms                | ~200ms                 |
> +---------------------+-----------------------+------------------------+
>
> For devices configured with dynamic DMA mappings, DMA
> allocate/map/unmap operations performance is also impacted on stage-1
> enabled guests.
> Dynamic DMA mapping operation issues emulated IOMMU functions like
> mmio write/read and TLB invalidations.
> As a reference, following table shows performance results for runtime
> dma operations for virtio-gpu device.
>
> +---------------+-------------------------+----------------------------+
> | DMA Op        | Stage-1 Enabled Guest   | Stage-1 Disabled Guest     |
> +===============+=========================+============================+
> | dma_alloc     | median: 27μs, worst: 7ms| median: 2.5μs, worst: 360μs|
> +---------------+-------------------------+----------------------------+
> | dma_free      | median: 1ms, worst: 14ms| median: 2.2μs, worst: 85μs |
> +---------------+-------------------------+----------------------------+
> | dma_map       | median: 25μs, worst: 7ms| median: 1.5μs, worst: 336μs|
> +---------------+-------------------------+----------------------------+
> | dma_unmap     | median: 1ms, worst: 13ms| median: 1.3μs, worst: 65μs |
> +---------------+-------------------------+----------------------------+
>
> Testing
> ============
>
> - QEMU-based ARM system tests for Stage-1 translation and nested
>   virtualization.
> - Actual hardware validation on platforms such as Renesas to ensure
>   compatibility with real SMMUv3 implementations.
> - Unit/Functional tests validating correct translations (not implemented).
>
> Migration and Compatibility
> ===========================
>
> This optional feature defaults to disabled (`viommu=""`) for backward
> compatibility.
>

-- 
WBR, Volodymyr

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.