[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests
Hello Julien, Volodymyr On 8/27/25 01:28, Volodymyr Babchuk wrote: Hi Milan, Milan Djokic <milan_djokic@xxxxxxxx> writes:Hello Julien, On 8/13/25 14:11, Julien Grall wrote:On 13/08/2025 11:04, Milan Djokic wrote:Hello Julien,Hi Milan,We have prepared a design document and it will be part of the updated patch series (added in docs/design). I'll also extend cover letter with details on implementation structure to make review easier.I would suggest to just iterate on the design document for now.Following is the design document content which will be provided in updated patch series: Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests ========================================================== Author: Milan Djokic <milan_djokic@xxxxxxxx> Date: 2025-08-07 Status: Draft Introduction ------------ The SMMUv3 supports two stages of translation. Each stage of translation can be independently enabled. An incoming address is logically translated from VA to IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to the output PA. Stage 1 translation support is required to provide isolation between different devices within the OS. Xen already supports Stage 2 translation but there is no support for Stage 1 translation. This design proposal outlines the introduction of Stage-1 SMMUv3 support in Xen for ARM guests. Motivation ---------- ARM systems utilizing SMMUv3 require Stage-1 address translation to ensure correct and secure DMA behavior inside guests.Can you clarify what you mean by "correct"? DMA would still work without stage-1.Correct in terms of working with guest managed I/O space. I'll rephrase this statement, it seems ambiguous.This feature enables: - Stage-1 translation in guest domain - Safe device passthrough under secure memory translation Design Overview --------------- These changes provide emulated SMMUv3 support: - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in SMMUv3 driver - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handlingSo what are you planning to expose to a guest? Is it one vIOMMU per pIOMMU? Or a single one?Single vIOMMU model is used in this design.Have you considered the pros/cons for both?- Register/Command Emulation: SMMUv3 register emulation and command queue handlingThat's a point for consideration. single vIOMMU prevails in terms of less complex implementation and a simple guest iommmu model - single vIOMMU node, one interrupt path, event queue, single set of trap handlers for emulation, etc. Cons for a single vIOMMU model could be less accurate hw representation and a potential bottleneck with one emulated queue and interrupt path. On the other hand, vIOMMU per pIOMMU provides more accurate hw modeling and offers better scalability in case of many IOMMUs in the system, but this comes with more complex emulation logic and device tree, also handling multiple vIOMMUs on guest side. IMO, single vIOMMU model seems like a better option mostly because it's less complex, easier to maintain and debug. Of course, this decision can and should be discussed.Well, I am not sure that this is possible, because of StreamID allocation. The biggest offender is of course PCI, as each Root PCI bridge will require own SMMU instance with own StreamID space. But even without PCI you'll need some mechanism to map vStremID to <pSMMU, pStreamID>, because there will be overlaps in SID space. Actually, PCI/vPCI with vSMMU is its own can of worms...For each pSMMU, we have a single command queue that will receive command from all the guests. How do you plan to prevent a guest hogging the command queue? In addition to that, AFAIU, the size of the virtual command queue is fixed by the guest rather than Xen. If a guest is filling up the queue with commands before notifying Xen, how do you plan to ensure we don't spend too much time in Xen (which is not preemptible)?We'll have to do a detailed analysis on these scenarios, they are not covered by the design (as well as some others which is clear after your comments). I'll come back with an updated design.I think that can be handled akin to hypercall continuation, which is used in similar places, like P2M code [...] I have updated vIOMMU design document with additional security topics covered and performance impact results. Also added some additional explanations for vIOMMU components following your comments. Updated document content: =============================================== Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests =============================================== :Author: Milan Djokic <milan_djokic@xxxxxxxx> :Date: 2025-08-07 :Status: Draft Introduction ========The SMMUv3 supports two stages of translation. Each stage of translation can be independently enabled. An incoming address is logically translated from VA to IPA in stage 1, then the IPA is input to stage 2 which translates the IPA tothe output PA. Stage 1 translation support is required to provide isolation between different devices within OS. XEN already supports Stage 2 translation but there is no support for Stage 1 translation.This design proposal outlines the introduction of Stage-1 SMMUv3 support in Xen for ARM guests. Motivation ==========ARM systems utilizing SMMUv3 require stage-1 address translation to ensure secure DMA and guest managed I/O memory mappings. This feature enables: - Stage-1 translation in guest domain - Safe device passthrough under secure memory translation Design Overview =============== These changes provide emulated SMMUv3 support:- **SMMUv3 Stage-1 Translation**: stage-1 and nested translation support in SMMUv3 driver. - **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1 handling. - **Register/Command Emulation**: SMMUv3 register emulation and command queue handling. - **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes to device trees for dom0 and dom0less scenarios. - **Runtime Configuration**: Introduces a `viommu` boot parameter for dynamic enablement. vIOMMU is exposed to guest as a single device with predefined capabilities and commands supported. Single vIOMMU model abstracts the details of an actual IOMMU hardware, simplifying usage from the guest point of view. Guest OS handles only a single IOMMU, even if multiple IOMMU units are available on the host system. Security Considerations ======================= **viommu security benefits:** - Stage-1 translation ensures guest devices cannot perform unauthorized DMA.- Emulated IOMMU removes guest dependency on IOMMU hardware while maintaining domains isolation. 1. Observation: ---------------Support for Stage-1 translation in SMMUv3 introduces new data structures (`s1_cfg` alongside `s2_cfg`) and logic to write both Stage-1 and Stage-2 entries in the Stream Table Entry (STE), including an `abort` field to handle partial configuration states. **Risk:**Without proper handling, a partially applied Stage-1 configuration might leave guest DMA mappings in an inconsistent state, potentially enabling unauthorized access or causing cross-domain interference. **Mitigation:** *(Handled by design)*This feature introduces logic that writes both `s1_cfg` and `s2_cfg` to STE and manages the `abort` field-only considering Stage-1 configuration if fully attached. This ensures incomplete or invalid guest configurations are safely ignored by the hypervisor. 2. Observation: ---------------Guests can now invalidate Stage-1 caches; invalidation needs forwarding to SMMUv3 hardware to maintain coherence. **Risk:**Failing to propagate cache invalidation could allow stale mappings, enabling access to old mappings and possibly data leakage or misrouting. **Mitigation:** *(Handled by design)*This feature ensures that guest-initiated invalidations are correctly forwarded to the hardware, preserving IOMMU coherency. 3. Observation: ---------------This design introduces substantial new functionality, including the `vIOMMU` framework, virtual SMMUv3 devices (`vsmmuv3`), command queues, event queues, domain management, and Device Tree modifications (e.g., `iommus` nodes and `libxl` integration). **Risk:**Large feature expansions increase the attack surface—potential for race conditions, unchecked command inputs, or Device Tree-based misconfigurations. **Mitigation:**- Sanity checks and error-handling improvements have been introduced in this feature. - Further audits have to be performed for this feature and its dependencies in this area. Currently, feature is marked as *Tech Preview* and is self-contained, reducing the risk to unrelated components. 4. Observation: ---------------The code includes transformations to handle nested translation versus standard modes and uses guest-configured command queues (e.g., `CMD_CFGI_STE`) and event notifications. **Risk:**Malicious or malformed queue commands from guests could bypass validation, manipulate SMMUv3 state, or cause Dom0 instability. **Mitigation:** *(Handled by design)*Built-in validation of command queue entries and sanitization mechanisms ensure only permitted configurations are applied. This is supported via additions in `vsmmuv3` and `cmdqueue` handling code. 5. Observation: ---------------Device Tree modifications enable device assignment and configuration—guest DT fragments (e.g., `iommus`) are added via `libxl`. **Risk:**Erroneous or malicious Device Tree injection could result in device misbinding or guest access to unauthorized hardware. **Mitigation:**- `libxl` perform checks of guest configuration and parse only predefined dt fragments and nodes, reducing risc. - The system integrator must ensure correct resource mapping in the guest Device Tree (DT) fragments. 6. Observation: ---------------Introducing optional per-guest enabled features (`viommu` argument in xl guest config) means some guests may opt-out. **Risk:**Differences between guests with and without `viommu` may cause unexpected behavior or privilege drift. **Mitigation:**Verify that downgrade paths are safe and well-isolated; ensure missing support doesn't cause security issues. Additional audits on emulation paths and domains interference need to be performed in a multi-guest environment. 7. Observation: ---------------Guests have the ability to issue Stage-1 IOMMU commands like cache invalidation, stream table entries configuration, etc. An adversarial guest may issue a high volume of commands in rapid succession. **Risk**Excessive commands requests can cause high hypervisor CPU consumption and disrupt scheduling, leading to degraded system responsiveness and potential denial-of-service scenarios. **Mitigation**- Xen credit scheduler limits guest vCPU execution time, securing basic guest rate-limiting. - Batch multiple commands of same type to reduce overhead on the virtual SMMUv3 hardware emulation. - Implement vIOMMU commands execution restart and continuation support 8. Observation: ---------------Some guest commands issued towards vIOMMU are propagated to pIOMMU command queue (e.g. TLB invalidate). For each pIOMMU, only one command queue is available for all domains. **Risk**Excessive commands requests from abusive guest can cause flooding of physical IOMMU command queue, leading to degraded pIOMMU responsivness on commands issued from other guests. **Mitigation**- Xen credit scheduler limits guest vCPU execution time, securing basic guest rate-limiting. - Batch commands which should be propagated towards pIOMMU cmd queue and enable support for batch execution pause/continuation - If possible, implement domain penalization by adding a per-domain cost counter for vIOMMU/pIOMMU usage. 9. Observation: ---------------vIOMMU feature includes event queue used for forwarding IOMMU events to guest (e.g. translation faults, invalid stream IDs, permission errors). A malicious guest can misconfigure its SMMU state or intentionally trigger faults with high frequency. **Risk**Occurance of IOMMU events with high frequency can cause Xen to flood the event queue and disrupt scheduling with high hypervisor CPU load for events handling. **Mitigation**- Implement fail-safe state by disabling events forwarding when faults are occured with high frequency and not processed by guest. - Batch multiple events of same type to reduce overhead on the virtual SMMUv3 hardware emulation. - Consider disabling event queue for untrusted guests Performance Impact ==================With iommu stage-1 and nested translation inclusion, performance overhead is introduced comparing to existing, stage-2 only usage in Xen. Once mappings are established, translations should not introduce significant overhead. Emulated paths may introduce moderate overhead, primarily affecting device initialization and event handling. Performance impact highly depends on target CPU capabilities. Testing is performed on cortex-a53 based platform. Performance is mostly impacted by emulated vIOMMU operations, results shown in the following table. +-------------------------------+---------------------------------+ | vIOMMU Operation | Execution time in guest | +===============================+=================================+ | Reg read | median: 30μs, worst-case: 250μs | +-------------------------------+---------------------------------+ | Reg write | median: 35μs, worst-case: 280μs | +-------------------------------+---------------------------------+ | Invalidate TLB | median: 90μs, worst-case: 1ms+ | +-------------------------------+---------------------------------+ | Invalidate STE | median: 450μs worst_case: 7ms+ | +-------------------------------+---------------------------------+With vIOMMU exposed to guest, guest OS has to initialize IOMMU device and configure stage-1 mappings for devices attached to it. Following table shows initialization stages which impact stage-1 enabled guest boot time and compares it with stage-1 disabled guest. "NOTE: Device probe execution time varies significantly depending on device complexity. virtio-gpu was selected as a test case due to its extensive use of dynamic DMA allocations and IOMMU mappings, making it a suitable candidate for benchmarking stage-1 vIOMMU behavior." +---------------------+-----------------------+------------------------+ | Stage | Stage-1 Enabled Guest | Stage-1 Disabled Guest | +=====================+=======================+========================+ | IOMMU Init | ~25ms | / | +---------------------+-----------------------+------------------------+ | Dev Attach / Mapping| ~220ms | ~200ms | +---------------------+-----------------------+------------------------+For devices configured with dynamic DMA mappings, DMA allocate/map/unmap operations performance is also impacted on stage-1 enabled guests. Dynamic DMA mapping operation issues emulated IOMMU functions like mmio write/read and TLB invalidations. As a reference, following table shows performance results for runtime dma operations for virtio-gpu device. +---------------+-------------------------+----------------------------+ | DMA Op | Stage-1 Enabled Guest | Stage-1 Disabled Guest | +===============+=========================+============================+ | dma_alloc | median: 27μs, worst: 7ms| median: 2.5μs, worst: 360μs| +---------------+-------------------------+----------------------------+ | dma_free | median: 1ms, worst: 14ms| median: 2.2μs, worst: 85μs | +---------------+-------------------------+----------------------------+ | dma_map | median: 25μs, worst: 7ms| median: 1.5μs, worst: 336μs| +---------------+-------------------------+----------------------------+ | dma_unmap | median: 1ms, worst: 13ms| median: 1.3μs, worst: 65μs | +---------------+-------------------------+----------------------------+ Testing ============- QEMU-based ARM system tests for Stage-1 translation and nested virtualization. - Actual hardware validation on platforms such as Renesas to ensure compatibility with real SMMUv3 implementations. - Unit/Functional tests validating correct translations (not implemented). Migration and Compatibility ===========================This optional feature defaults to disabled (`viommu=""`) for backward compatibility. References ==========- Original feature implemented by Rahul Singh: https://patchwork.kernel.org/project/xen-devel/cover/cover.1669888522.git.rahul.singh@xxxxxxx/ - SMMUv3 architecture documentation - Existing vIOMMU code patterns
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |