[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC] ARM PCI Passthrough design document



Hi Julien, 
Thanks for posting this. I think some additional topics need to be covered in 
the design document, under 3 main topics:

Hotplug: how will Xen support hotplug? Many rootports may require firmware 
hooks such as ACPI ASL to take care of platform specific MMIO initialization on 
hotplug. Normally firmware (UEFI) would have done that platform specific setup 
at boot. 

AER: Will PCIe non-fatal and fatal errors (secondary bus reset for fatal) be 
recoverable in Xen? 
Will drivers in doms be notified about fatal errors so they can be quiesced 
before doing secondary bus reset in Xen? 
Will Xen support Firmware First Error handling for AER? i.e When platform does 
Firmware first error handling for AER and/or filtering of AER, sends associated 
ACPI HEST logs to Xen
How will AER notification and logs be propagated to the doms: injected ACPI 
HEST?

PCIe DPC (Downstream Port Containment): will it be supported in Xen, and Xen 
will register for DPC interrupt? When Xen brings the link back up will it send 
a simulated hotplug to dom0 to show link back up?

Thanks,
Vikram

-----Original Message-----
From: Julien Grall [mailto:julien.grall@xxxxxxxxxx] 
Sent: Friday, May 26, 2017 12:14 PM
To: Stefano Stabellini <sstabellini@xxxxxxxxxx>
Cc: Julien Grall <julien.grall@xxxxxxxxxx>; xen-devel 
<xen-devel@xxxxxxxxxxxxxxxxxxxx>; edgar.iglesias@xxxxxxxxxx; Steve Capper 
<Steve.Capper@xxxxxxx>; punit.agrawal@xxxxxxx; Wei Chen <Wei.Chen@xxxxxxx>; 
Dave P Martin <Dave.Martin@xxxxxxx>; Sameer Goel <sgoel@xxxxxxxxxxxxxxxx>; 
Sinan Kaya <okaya@xxxxxxxxxxxxxxxx>; Vikram Sethi <vikrams@xxxxxxxxxxxxxxxx>; 
roger.pau@xxxxxxxxxx; manish.jaggi@xxxxxxxxxxxxxxxxxx; Vijaya Kumar K 
<Vijaya.Kumar@xxxxxxxxxxxxxxxxxx>; Andre Przywara <andre.przywara@xxxxxxx>
Subject: [RFC] ARM PCI Passthrough design document

Hi all,

The document below is an RFC version of a design proposal for PCI Passthrough 
in Xen on ARM. It aims to describe from an high level perspective the 
interaction with the different subsystems and how guest will be able to 
discover and access PCI.

Currently on ARM, Xen does not have any knowledge about PCI devices. This means 
that IOMMU and interrupt controller (such as ITS) requiring specific 
configuration will not work with PCI even with DOM0.

The PCI Passthrough work could be divided in 2 phases:
        * Phase 1: Register all PCI devices in Xen => will allow
                   to use ITS and SMMU with PCI in Xen
        * Phase 2: Assign devices to guests

This document aims to describe the 2 phases, but for now only phase
1 is fully described.


I think I was able to gather all of the feedbacks and come up with a solution 
that will satisfy all the parties. The design document has changed quite a lot 
compare to the early draft sent few months ago. The major changes are:
        * Provide more details how PCI works on ARM and the interactions with
        MSI controller and IOMMU
        * Provide details on the existing host bridge implementations
        * Give more explanation and justifications on the approach chosen 
        * Describing the hypercalls used and how they should be called

Feedbacks are welcomed.

Cheers,

--------------------------------------------------------------------------------

% PCI pass-through support on ARM
% Julien Grall <julien.grall@xxxxxxxxxx> % Draft B

# Preface

This document aims to describe the components required to enable the PCI 
pass-through on ARM.

This is an early draft and some questions are still unanswered. When this is 
the case, the text will contain XXX.

# Introduction

PCI pass-through allows the guest to receive full control of physical PCI 
devices. This means the guest will have full and direct access to the PCI 
device.

ARM is supporting a kind of guest that exploits as much as possible 
virtualization support in hardware. The guest will rely on PV driver only for 
IO (e.g block, network) and interrupts will come through the virtualized 
interrupt controller, therefore there are no big changes required within the 
kernel.

As a consequence, it would be possible to replace PV drivers by assigning real 
devices to the guest for I/O access. Xen on ARM would therefore be able to run 
unmodified operating system.

To achieve this goal, it looks more sensible to go towards emulating the host 
bridge (there will be more details later). A guest would be able to take 
advantage of the firmware tables, obviating the need for a specific driver for 
Xen.

Thus, in this document we follow the emulated host bridge approach.

# PCI terminologies

Each PCI device under a host bridge is uniquely identified by its Requester ID 
(AKA RID). A Requester ID is a triplet of Bus number, Device number, and 
Function.

When the platform has multiple host bridges, the software can add a fourth 
number called Segment (sometimes called Domain) to differentiate host bridges.
A PCI device will then uniquely by segment:bus:device:function (AKA SBDF).

So given a specific SBDF, it would be possible to find the host bridge and the 
RID associated to a PCI device. The pair (host bridge, RID) will often be used 
to find the relevant information for configuring the different subsystems (e.g 
IOMMU, MSI controller). For convenience, the rest of the document will use SBDF 
to refer to the pair (host bridge, RID).

# PCI host bridge

PCI host bridge enables data transfer between a host processor and PCI bus 
based devices. The bridge is used to access the configuration space of each PCI 
devices and, on some platform may also act as an MSI controller.

## Initialization of the PCI host bridge

Whilst it would be expected that the bootloader takes care of initializing the 
PCI host bridge, on some platforms it is done in the Operating System.

This may include enabling/configuring the clocks that could be shared among 
multiple devices.

## Accessing PCI configuration space

Accessing the PCI configuration space can be divided in 2 category:
    * Indirect access, where the configuration spaces are multiplexed. An
    example would be legacy method on x86 (e.g 0xcf8 and 0xcfc). On ARM a
    similar method is used by PCIe RCar root complex (see [12]).
    * ECAM access, each configuration space will have its own address space.

Whilst ECAM is a standard, some PCI host bridges will require specific fiddling 
when access the registers (see thunder-ecam [13]).

In most of the cases, accessing all the PCI configuration spaces under a given 
PCI host will be done the same way (i.e either indirect access or ECAM access). 
However, there are a few cases, dependent on the PCI devices accessed, which 
will use different methods (see thunder-pem [14]).

## Generic host bridge

For the purpose of this document, the term "generic host bridge" will be used 
to describe any host bridge ECAM-compliant and the initialization, if required, 
will be already done by the firmware/bootloader.

# Interaction of the PCI subsystem with other subsystems

In order to have a PCI device fully working, Xen will need to configure other 
subsystems such as the IOMMU and the Interrupt Controller.

The interaction expected between the PCI subsystem and the other subsystems is:
    * Add a device
    * Remove a device
    * Assign a device to a guest
    * Deassign a device from a guest

XXX: Detail the interaction when assigning/deassigning device

In the following subsections, the interactions will be briefly described from a 
higher level perspective. However, implementation details such as callback, 
structure, etc... are beyond the scope of this document.

## IOMMU

The IOMMU will be used to isolate the PCI device when accessing the memory (e.g 
DMA and MSI Doorbells). Often the IOMMU will be configured using a MasterID 
(aka StreamID for ARM SMMU)  that can be deduced from the SBDF with the help of 
the firmware tables (see below).

Whilst in theory, all the memory transactions issued by a PCI device should go 
through the IOMMU, on certain platforms some of the memory transaction may not 
reach the IOMMU because they are interpreted by the host bridge. For instance, 
this could happen if the MSI doorbell is built into the PCI host bridge or for 
P2P traffic. See [6] for more details.

XXX: I think this could be solved by using direct mapping (e.g GFN == MFN), 
this would mean the guest memory layout would be similar to the host one when 
PCI devices will be pass-throughed => Detail it.

## Interrupt controller

PCI supports three kind of interrupts: legacy interrupt, MSI and MSI-X. On ARM, 
legacy interrupts will be mapped to SPIs. MSI and MSI-X will write their 
payload in a doorbell belonging to a MSI controller.

### Existing MSI controllers

In this section some of the existing controllers and their interaction with the 
devices will be briefly described. More details can be found in the respective 
specifications of each MSI controller.

MSIs can be distinguished by some combination of
    * the Doorbell
        It is the MMIO address written to. Devices may be configured by
        software to write to arbitrary doorbells which they can address.
        An MSI controller may feature a number of doorbells.
    * the Payload
        Devices may be configured to write an arbitrary payload chosen by
        software. MSI controllers may have restrictions on permitted payload.
        Xen will have to sanitize the payload unless it is known to be always
        safe.
    * Sideband information accompanying the write
        Typically this is neither configurable nor probeable, and depends on
        the path taken through the memory system (i.e it is a property of the
        combination of MSI controller and device rather than a property of
        either in isolation).

### GICv3/GICv4 ITS

The Interrupt Translation Service (ITS) is a MSI controller designed by ARM and 
integrated in the GICv3/GICv4 interrupt controller. For the specification see 
[GICV3]. Each MSI/MSI-X will be mapped to a new type of interrupt called LPI. 
This interrupt will be configured by the software using a pair (DeviceID, 
EventID).

A platform may have multiple ITS block (e.g one per NUMA node), each of them 
belong to an ITS group.

The DeviceID is a unique identifier with an ITS group for each MSI-capable 
device that can be deduced from the RID with the help of the firmware tables 
(see below).

The EventID is a unique identifier to distinguish different event sending by a 
device.

The MSI payload will only contain the EventID as the DeviceID will be added 
afterwards by the hardware in a way that will prevent any tampering.

The [SBSA] appendix I describes the set of rules for the integration of the ITS 
that any compliant platform should follow. Some of the rules will explain the 
security implication of a misbehaving devices. It ensures that a guest will 
never be able to trigger an MSI on behalf of another guest.

XXX: The security implication is described in the [SBSA] but I haven't found 
any similar working in the GICv3 specification. It is unclear to me if non-SBSA 
compliant platform (e.g embedded) will follow those rules.

### GICv2m

The GICv2m is an extension of the GICv2 to convert MSI/MSI-X writes to unique 
interrupts. The specification can be found in the [SBSA] appendix E.

Depending on the platform, the GICv2m will provide one or multiple instance of 
register frames. Each frame is composed of a doorbell and associated to a set 
of SPIs that can be discovered by reading the register MSI_TYPER.

On an MSI write, the payload will contain the SPI ID to generate. Note that on 
some platform the MSI payload may contain an offset form the base SPI rather 
than the SPI itself.

The frame will only generate SPI if the written value corresponds to an SPI 
allocated to the frame. Each VM should have exclusity to the frame to ensure 
isolation and prevent a guest OS to trigger an MSI on-behalf of another guest 
OS.

XXX: Linux seems to consider GICv2m as unsafe by default. From my 
understanding, it is still unclear how we should proceed on Xen, as GICv2m 
should be safe as long as the frame is only accessed by one guest.

### Other MSI controllers

Servers compliant with SBSA level 1 and higher will have to use either ITS or 
GICv2m. However, it is by no means the only MSI controllers available.
The hardware vendor may decide to use their custom MSI controller which can be 
integrated in the PCI host bridge.

Whether it will be possible to write securely an MSI will depend on the MSI 
controller implementations.

XXX: I am happy to give a brief explanation on more MSI controller (such as 
Xilinx and Renesas) if people think it is necessary.

This design document does not pertain to a specific MSI controller and will try 
to be as agnostic is possible. When possible, it will give insight how to 
integrate the MSI controller.

# Information available in the firmware tables

## ACPI

### Host bridges

The static table MCFG (see 4.2 in [1]) will describe the host bridges available 
at boot and supporting ECAM. Unfortunately, there are platforms out there (see 
[2]) that re-use MCFG to describe host bridge that are not fully ECAM 
compatible.

This means that Xen needs to account for possible quirks in the host bridge.
The Linux community are working on a patch series for this, see [2] and [3], 
where quirks will be detected with:
    * OEM ID
    * OEM Table ID
    * OEM Revision
    * PCI Segment
    * PCI bus number range (wildcard allowed)

Based on what Linux is currently doing, there are two kind of quirks:
    * Accesses to the configuration space of certain sizes are not allowed
    * A specific driver is necessary for driving the host bridge

The former is straightforward to solve but the latter will require more thought.
Instantiation of a specific driver for the host controller can be easily done 
if Xen has the information to detect it. However, those drivers may require 
resources described in ASL (see [4] for instance).

The number of platforms requiring specific PCI host bridge driver is currently 
limited. Whilst it is not possible to predict the future, it will be expected 
upcoming platform to have fully ECAM compliant PCI host bridges. Therefore, 
given Xen does not have any ASL parser, the approach suggested is to hardcode 
the missing values. This could be revisit in the future if necessary.

### Finding information to configure IOMMU and MSI controller

The static table [IORT] will provide information that will help to deduce data 
(such as MasterID and DeviceID) to configure both the IOMMU and the MSI 
controller from a given SBDF.

## Finding which NUMA node a PCI device belongs to

On NUMA system, the NUMA node associated to a PCI device can be found using the 
_PXM method of the host bridge (?).

XXX: I am not entirely sure where the _PXM will be (i.e host bridge vs PCI 
device).

## Device Tree

### Host bridges

Each Device Tree node associated to a host bridge will have at least the 
following properties (see bindings in [8]):
    - device_type: will always be "pci".
    - compatible: a string indicating which driver to instanciate

The node may also contain optional properties such as:
    - linux,pci-domain: assign a fix segment number
    - bus-range: indicate the range of bus numbers supported

When the property linux,pci-domain is not present, the operating system would 
have to allocate the segment number for each host bridges.

### Finding information to configure IOMMU and MSI controller

### Configuring the IOMMU

The Device Treee provides a generic IOMMU bindings (see [10]) which uses the 
properties "iommu-map" and "iommu-map-mask" to described the relationship 
between RID and a MasterID.

These properties will be present in the host bridge Device Tree node. From a 
given SBDF, it will be possible to find the corresponding MasterID.

Note that the ARM SMMU also have a legacy binding (see [9]), but it does not 
have a way to describe the relationship between RID and StreamID. Instead it 
assumed that StreamID == RID. This binding has now been deprecated in favor of 
the generic IOMMU binding.

### Configuring the MSI controller

The relationship between the RID and data required to configure the MSI 
controller (such as DeviceID) can be found using the property "msi-map"
(see [11]).

This property will be present in the host bridge Device Tree node. From a given 
SBDF, it will be possible to find the corresponding MasterID.

## Finding which NUMA node a PCI device belongs to

On NUMA system, the NUMA node associated to a PCI device can be found using the 
property "numa-node-id" (see [15]) presents in the host bridge Device Tree node.

# Discovering PCI devices

Whilst PCI devices are currently available in the hardware domain, the 
hypervisor does not have any knowledge of them. The first step of supporting 
PCI pass-through is to make Xen aware of the PCI devices.

Xen will require access to the PCI configuration space to retrieve information 
for the PCI devices or access it on behalf of the guest via the emulated host 
bridge.

This means that Xen should be in charge of controlling the host bridge. 
However, for some host controller, this may be difficult to implement in Xen 
because of depencencies on other components (e.g clocks, see more details in 
"PCI host bridge" section).

For this reason, the approach chosen in this document is to let the hardware 
domain to discover the host bridges, scan the PCI devices and then report 
everything to Xen. This does not rule out the possibility of doing everything 
without the help of the hardware domain in the future.

## Who is in charge of the host bridge?

There are numerous implementation of host bridges which exist on ARM. A part of 
them requires a specific driver as they cannot be driven by a generic host 
bridge driver. Porting those drivers may be complex due to dependencies on 
other components.

This would be seen as signal to leave the host bridge drivers in the hardware 
domain. Because Xen would need to access the configuration space, all the 
access would have to be forwarded to hardware domain which in turn will access 
the hardware.

In this design document, we are considering that the host bridge driver can be 
ported in Xen. In the case it is not possible, a interface to forward 
configuration space access would need to be defined. The interface details is 
out of scope.

## Discovering and registering host bridge

The approach taken in the document will require communication between Xen and 
the hardware domain. In this case, they would need to agree on the segment 
number associated to an host bridge. However, this number is not available in 
the Device Tree case.

The hardware domain will register new host bridges using the existing hypercall
PHYSDEV_mmcfg_reserved:

#define XEN_PCI_MMCFG_RESERVED 1

struct physdev_pci_mmcfg_reserved {
    /* IN */
    uint64_t    address;
    uint16_t    segment;
    /* Range of bus supported by the host bridge */
    uint8_t     start_bus;
    uint8_t     end_bus;

    uint32_t    flags;
}

Some of the host bridges may not have a separate configuration address space 
region described in the firmware tables. To simplify the registration, the 
field 'address' should contains the base address of one of the region described 
in the firmware tables.
    * For ACPI, it would be the base address specified in the MCFG or in the
    _CBA method.
    * For Device Tree, this would be any base address of region
    specified in the "reg" property.

The field 'flags' is expected to have XEN_PCI_MMCFG_RESERVED set.

It is expected that this hypercall is called before any PCI devices is 
registered to Xen.

When the hardware domain is in charge of the host bridge, this hypercall will 
be used to tell Xen the existence of an host bridge in order to find the 
associated information for configuring the MSI controller and the IOMMU.

## Discovering and registering PCI devices

The hardware domain will scan the host bridge to find the list of PCI devices 
available and then report it to Xen using the existing hypercall
PHYSDEV_pci_device_add:

#define XEN_PCI_DEV_EXTFN   0x1
#define XEN_PCI_DEV_VIRTFN  0x2
#define XEN_PCI_DEV_PXM     0x3

struct physdev_pci_device_add {
    /* IN */
    uint16_t    seg;
    uint8_t     bus;
    uint8_t     devfn;
    uint32_t    flags;
    struct {
        uint8_t bus;
        uint8_t devfn;
    } physfn;
    /*
     * Optional parameters array.
     * First element ([0]) is PXM domain associated with the device (if
     * XEN_PCI_DEV_PXM is set)
     */
    uint32_t optarr[0];
}

When XEN_PCI_DEV_PXM is set in the field 'flag', optarr[0] will contain the 
NUMA node ID associated with the device:
    * For ACPI, it would be the value returned by the method _PXM
    * For Device Tree, this would the value found in the property 
"numa-node-id".
For more details see the section "Finding which NUMA node a PCI device belongs 
to" in "ACPI" and "Device Tree".

XXX: I still don't fully understand how XEN_PCI_DEV_EXTFN and 
XEN_PCI_DEV_VIRTFN wil work. AFAICT, the former is used with the bus support 
ARI and the only usage is in the x86 IOMMU code. For the latter, this is 
related to IOV but I am not sure what devfn and physfn.devfn will correspond 
too.

Note that x86 currently provides two more hypercalls (PHYSDEVOP_manage_pci_add 
and PHYSDEVOP_manage_pci_add_ext) to register PCI devices. However they are 
subset of the hypercall PHYSDEVOP_pci_device_add. Therefore, it is suggested to 
leave them unimplemented on ARM.

## Removing PCI devices

The hardware domain will be in charge Xen a device has been removed using the 
existing hypercall PHYSDEV_pci_device_remove:

struct physdev_pci_device {
    /* IN */
    uint16_t    seg;
    uint8_t     bus;
    uint8_t     devfn;
}

Note that x86 currently provide one more hypercall 
(PHYSDEVOP_manage_pci_remove) to remove PCI devices. However it does not allow 
to pass a segment number.
Therefore it is suggested to leave unimplemented on ARM.

# Glossary

ECAM: Enhanced Configuration Mechanism
SBDF: Segment Bus Device Function. The segment is a software concept.
MSI: Message Signaled Interrupt
MSI doorbell: MMIO address written to by a device to generate an MSI
SPI: Shared Peripheral Interrupt
LPI: Locality-specific Peripheral Interrupt
ITS: Interrupt Translation Service

# Specifications
[SBSA]  ARM-DEN-0029 v3.0
[GICV3] IHI0069C
[IORT]  DEN0049B

# Bibliography

[1] PCI firmware specification, rev 3.2
[2] https://www.spinics.net/lists/linux-pci/msg56715.html
[3] https://www.spinics.net/lists/linux-pci/msg56723.html
[4] https://www.spinics.net/lists/linux-pci/msg56728.html
[6] https://www.spinics.net/lists/kvm/msg140116.html
[7] http://www.firmware.org/1275/bindings/pci/pci2_1.pdf
[8] Documents/devicetree/bindings/pci
[9] Documents/devicetree/bindings/iommu/arm,smmu.txt
[10] Document/devicetree/bindings/pci/pci-iommu.txt
[11] Documents/devicetree/bindings/pci/pci-msi.txt
[12] drivers/pci/host/pcie-rcar.c
[13] drivers/pci/host/pci-thunder-ecam.c
[14] drivers/pci/host/pci-thunder-pem.c
[15] Documents/devicetree/bindings/numa.txt
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.