[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH v2] iommu/xen: Add Xen PV-IOMMU driver



Hello Robin,
Thanks for the thourough review.

>> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
>> index 0af39bbbe3a3..242cefac77c9 100644
>> --- a/drivers/iommu/Kconfig
>> +++ b/drivers/iommu/Kconfig
>> @@ -480,6 +480,15 @@ config VIRTIO_IOMMU
>>         Say Y here if you intend to run this kernel as a guest.
>> +config XEN_IOMMU
>> +    bool "Xen IOMMU driver"
>> +    depends on XEN_DOM0
>
> Clearly this depends on X86 as well.
>
Well, I don't intend this driver to be X86-only, even though the current
Xen RFC doesn't support ARM (yet). Unless there is a counter-indication
for it ?

>> +#include <linux/kernel.h>
>> +#include <linux/init.h>
>> +#include <linux/types.h>
>> +#include <linux/iommu.h>
>> +#include <linux/dma-map-ops.h>
>
> Please drop this; it's a driver, not a DMA ops implementation.
>
Sure, in addition to some others unneeded headers.

>> +#include <linux/pci.h>
>> +#include <linux/list.h>
>> +#include <linux/string.h>
>> +#include <linux/device/driver.h>
>> +#include <linux/slab.h>
>> +#include <linux/err.h>
>> +#include <linux/printk.h>
>> +#include <linux/stddef.h>
>> +#include <linux/spinlock.h>
>> +#include <linux/minmax.h>
>> +#include <linux/string.h>
>> +#include <asm/iommu.h>
>> +
>> +#include <xen/xen.h>
>> +#include <xen/page.h>
>> +#include <xen/interface/memory.h>
>> +#include <xen/interface/physdev.h>
>> +#include <xen/interface/pv-iommu.h>
>> +#include <asm/xen/hypercall.h>
>> +#include <asm/xen/page.h>
>> +
>> +MODULE_DESCRIPTION("Xen IOMMU driver");
>> +MODULE_AUTHOR("Teddy Astie <teddy.astie@xxxxxxxxxx>");
>> +MODULE_LICENSE("GPL");
>> +
>> +#define MSI_RANGE_START        (0xfee00000)
>> +#define MSI_RANGE_END        (0xfeefffff)
>> +
>> +#define XEN_IOMMU_PGSIZES       (0x1000)
>> +
>> +struct xen_iommu_domain {
>> +    struct iommu_domain domain;
>> +
>> +    u16 ctx_no; /* Xen PV-IOMMU context number */
>> +};
>> +
>> +static struct iommu_device xen_iommu_device;
>> +
>> +static uint32_t max_nr_pages;
>> +static uint64_t max_iova_addr;
>> +
>> +static spinlock_t lock;
>
> Not a great name - usually it's good to name a lock after what it
> protects. Although perhaps it is already, since AFAICS this isn't
> actually used anywhere anyway.
>

Yes, that shouldn't be there.

>> +static inline struct xen_iommu_domain *to_xen_iommu_domain(struct
>> iommu_domain *dom)
>> +{
>> +    return container_of(dom, struct xen_iommu_domain, domain);
>> +}
>> +
>> +static inline u64 addr_to_pfn(u64 addr)
>> +{
>> +    return addr >> 12;
>> +}
>> +
>> +static inline u64 pfn_to_addr(u64 pfn)
>> +{
>> +    return pfn << 12;
>> +}
>> +
>> +bool xen_iommu_capable(struct device *dev, enum iommu_cap cap)
>> +{
>> +    switch (cap) {
>> +    case IOMMU_CAP_CACHE_COHERENCY:
>> +        return true;
>
> Will the PV-IOMMU only ever be exposed on hardware where that really is
> always true?
>

On the hypervisor side, the PV-IOMMU interface always implicitely flush
the IOMMU hardware on map/unmap operation, so at the end of the
hypercall, the cache should be always coherent IMO.

>> +
>> +static struct iommu_device *xen_iommu_probe_device(struct device *dev)
>> +{
>> +    if (!dev_is_pci(dev))
>> +        return ERR_PTR(-ENODEV);
>> +
>> +    return &xen_iommu_device;
>
> Even emulated PCI devices have to have an (emulated, presumably) IOMMU?
>

No and that's indeed one thing to check.

>> +}
>> +
>> +static void xen_iommu_probe_finalize(struct device *dev)
>> +{
>> +    set_dma_ops(dev, NULL);
>> +    iommu_setup_dma_ops(dev, 0, max_iova_addr);
>

That got changed in 6.10, good to know that this code is not required
anymore.

>
>> +}
>> +
>> +static int xen_iommu_map_pages(struct iommu_domain *domain, unsigned
>> long iova,
>> +                   phys_addr_t paddr, size_t pgsize, size_t pgcount,
>> +                   int prot, gfp_t gfp, size_t *mapped)
>> +{
>> +    size_t xen_pg_count = (pgsize / XEN_PAGE_SIZE) * pgcount;
>
> You only advertise the one page size, so you'll always get that back,
> and this seems a bit redundant.
>

Yes

>> +    struct xen_iommu_domain *dom = to_xen_iommu_domain(domain);
>> +    struct pv_iommu_op op = {
>> +        .subop_id = IOMMUOP_map_pages,
>> +        .flags = 0,
>> +        .ctx_no = dom->ctx_no
>> +    };
>> +    /* NOTE: paddr is actually bound to pfn, not gfn */
>> +    uint64_t pfn = addr_to_pfn(paddr);
>> +    uint64_t dfn = addr_to_pfn(iova);
>> +    int ret = 0;
>> +
>> +    //pr_info("Mapping to %lx %zu %zu paddr %x\n", iova, pgsize,
>> pgcount, paddr);
>
> Please try to clean up debugging leftovers before posting the patch (but
> also note that there are already tracepoints and debug messages which
> can be enabled in the core code to give visibility of most of this.)
>

Yes

>> +
>> +    if (prot & IOMMU_READ)
>> +        op.flags |= IOMMU_OP_readable;
>> +
>> +    if (prot & IOMMU_WRITE)
>> +        op.flags |= IOMMU_OP_writeable;
>> +
>> +    while (xen_pg_count) {
>
> Unless you're super-concerned about performance already, you don't
> really need to worry about looping here - you can happily return short
> as long as you've mapped *something*, and the core code will call you
> back again with the remainder. But it also doesn't complicate things
> *too* much as it is, so feel free to leave it in if you want to.
>

Ok

>> +        size_t to_map = min(xen_pg_count, max_nr_pages);
>> +        uint64_t gfn = pfn_to_gfn(pfn);
>> +
>> +        //pr_info("Mapping %lx-%lx at %lx-%lx\n", gfn, gfn + to_map -
>> 1, dfn, dfn + to_map - 1);
>> +
>> +        op.map_pages.gfn = gfn;
>> +        op.map_pages.dfn = dfn;
>> +
>> +        op.map_pages.nr_pages = to_map;
>> +
>> +        ret = HYPERVISOR_iommu_op(&op);
>> +
>> +        //pr_info("map_pages.mapped = %u\n", op.map_pages.mapped);
>> +
>> +        if (mapped)
>> +            *mapped += XEN_PAGE_SIZE * op.map_pages.mapped;
>> +
>> +        if (ret)
>> +            break;
>> +
>> +        xen_pg_count -= to_map;
>> +
>> +        pfn += to_map;
>> +        dfn += to_map;
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static size_t xen_iommu_unmap_pages(struct iommu_domain *domain,
>> unsigned long iova,
>> +                    size_t pgsize, size_t pgcount,
>> +                    struct iommu_iotlb_gather *iotlb_gather)
>> +{
>> +    size_t xen_pg_count = (pgsize / XEN_PAGE_SIZE) * pgcount;
>> +    struct xen_iommu_domain *dom = to_xen_iommu_domain(domain);
>> +    struct pv_iommu_op op = {
>> +        .subop_id = IOMMUOP_unmap_pages,
>> +        .ctx_no = dom->ctx_no,
>> +        .flags = 0,
>> +    };
>> +    uint64_t dfn = addr_to_pfn(iova);
>> +    int ret = 0;
>> +
>> +    if (WARN(!dom->ctx_no, "Tried to unmap page to default context"))
>> +        return -EINVAL;
>
> This would go hilariously wrong... the return value here is bytes
> successfully unmapped, a total failure should return 0.

This check is not useful as the core code is never going to call it on
this domain.

>
>> +    while (xen_pg_count) {
>> +        size_t to_unmap = min(xen_pg_count, max_nr_pages);
>> +
>> +        //pr_info("Unmapping %lx-%lx\n", dfn, dfn + to_unmap - 1);
>> +
>> +        op.unmap_pages.dfn = dfn;
>> +        op.unmap_pages.nr_pages = to_unmap;
>> +
>> +        ret = HYPERVISOR_iommu_op(&op);
>> +
>> +        if (ret)
>> +            pr_warn("Unmap failure (%lx-%lx)\n", dfn, dfn + to_unmap
>> - 1);
>
> But then how
>> would it ever happen anyway? Unmap is a domain op, so a domain which
>> doesn't allow unmapping shouldn't offer it in the first place...

Unmap failing should be exceptionnal, but is possible e.g with
transparent superpages (like Xen IOMMU drivers do). Xen drivers folds
appropriate contiguous mappings into superpages entries to optimize
memory usage and iotlb. However, if you unmap in the middle of a region
covered by a superpage entry, this is no longer a valid superpage entry,
and you need to allocate and fill the lower levels, which is faillible
if lacking memory.

> In this case I'd argue that you really *do* want to return short, in the
> hope of propagating the error back up and letting the caller know the
> address space is now messed up before things start blowing up even more
> if they keep going and subsequently try to map new pages into
> not-actually-unmapped VAs.

While mapping on top of another mapping is ok for us (it's just going to
override the previous mapping), I definetely agree that having the
address space messed up is not good.

>
>> +
>> +        xen_pg_count -= to_unmap;
>> +
>> +        dfn += to_unmap;
>> +    }
>> +
>> +    return pgcount * pgsize;
>> +}
>> +
>> +int xen_iommu_attach_dev(struct iommu_domain *domain, struct device
>> *dev)
>> +{
>> +    struct pci_dev *pdev;
>> +    struct xen_iommu_domain *dom = to_xen_iommu_domain(domain);
>> +    struct pv_iommu_op op = {
>> +        .subop_id = IOMMUOP_reattach_device,
>> +        .flags = 0,
>> +        .ctx_no = dom->ctx_no,
>> +    };
>> +
>> +    pdev = to_pci_dev(dev);
>> +
>> +    op.reattach_device.dev.seg = pci_domain_nr(pdev->bus);
>> +    op.reattach_device.dev.bus = pdev->bus->number;
>> +    op.reattach_device.dev.devfn = pdev->devfn;
>> +
>> +    return HYPERVISOR_iommu_op(&op);
>> +}
>> +
>> +static void xen_iommu_free(struct iommu_domain *domain)
>> +{
>> +    int ret;
>> +    struct xen_iommu_domain *dom = to_xen_iommu_domain(domain);
>> +
>> +    if (dom->ctx_no != 0) {
>
> Much like unmap above, this not being true would imply that someone's
> managed to go round the back of the core code to get the .free op from a
> validly-allocated domain and then pass something other than that domain
> to it. Personally I'd consider that a level of brokenness that's not
> even worth trying to consider at all, but if you want to go as far as
> determining that you *have* clearly been given something you couldn't
> have allocated, then trying to kfree() it probably isn't wise either.
>

Yes

>> +
>> +static int default_domain_attach_dev(struct iommu_domain *domain,
>> +                     struct device *dev)
>> +{
>> +    int ret;
>> +    struct pci_dev *pdev;
>> +    struct pv_iommu_op op = {
>> +        .subop_id = IOMMUOP_reattach_device,
>> +        .flags = 0,
>> +        .ctx_no = 0 /* reattach device back to default context */
>> +    };
>> +
>> +    pdev = to_pci_dev(dev);
>> +
>> +    op.reattach_device.dev.seg = pci_domain_nr(pdev->bus);
>> +    op.reattach_device.dev.bus = pdev->bus->number;
>> +    op.reattach_device.dev.devfn = pdev->devfn;
>> +
>> +    ret = HYPERVISOR_iommu_op(&op);
>> +
>> +    if (ret)
>> +        pr_warn("Unable to release device %p\n",
>> &op.reattach_device.dev);
>> +
>> +    return ret;
>> +}
>> +
>> +static struct iommu_domain default_domain = {
>> +    .ops = &(const struct iommu_domain_ops){
>> +        .attach_dev = default_domain_attach_dev
>> +    }
>> +};
>
> Looks like you could make it a static xen_iommu_domain and just use the
> normal attach callback? Either way please name it something less
> confusing like xen_iommu_identity_domain - "default" is far too
> overloaded round here already...
>

Yes, although, if in the future, we can have either this domain as
identity or blocking/paging depending on some upper level configuration.
Should we have both identity and blocking domains, and only setting the
relevant one in iommu_ops, or keep this naming.

>> +static struct iommu_ops xen_iommu_ops = {
>> +    .identity_domain = &default_domain,
>> +    .release_domain = &default_domain,
>> +    .capable = xen_iommu_capable,
>> +    .domain_alloc_paging = xen_iommu_domain_alloc_paging,
>> +    .probe_device = xen_iommu_probe_device,
>> +    .probe_finalize = xen_iommu_probe_finalize,
>> +    .device_group = pci_device_group,
>> +    .get_resv_regions = xen_iommu_get_resv_regions,
>> +    .pgsize_bitmap = XEN_IOMMU_PGSIZES,
>> +    .default_domain_ops = &(const struct iommu_domain_ops) {
>> +        .map_pages = xen_iommu_map_pages,
>> +        .unmap_pages = xen_iommu_unmap_pages,
>> +        .attach_dev = xen_iommu_attach_dev,
>> +        .iova_to_phys = xen_iommu_iova_to_phys,
>> +        .free = xen_iommu_free,
>> +    },
>> +};
>> +
>> +int __init xen_iommu_init(void)
>> +{
>> +    int ret;
>> +    struct pv_iommu_op op = {
>> +        .subop_id = IOMMUOP_query_capabilities
>> +    };
>> +
>> +    if (!xen_domain())
>> +        return -ENODEV;
>> +
>> +    /* Check if iommu_op is supported */
>> +    if (HYPERVISOR_iommu_op(&op) == -ENOSYS)
>> +        return -ENODEV; /* No Xen IOMMU hardware */
>> +
>> +    pr_info("Initialising Xen IOMMU driver\n");
>> +    pr_info("max_nr_pages=%d\n", op.cap.max_nr_pages);
>> +    pr_info("max_ctx_no=%d\n", op.cap.max_ctx_no);
>> +    pr_info("max_iova_addr=%llx\n", op.cap.max_iova_addr);
>> +
>> +    if (op.cap.max_ctx_no == 0) {
>> +        pr_err("Unable to use IOMMU PV driver (no context
>> available)\n");
>> +        return -ENOTSUPP; /* Unable to use IOMMU PV ? */
>> +    }
>> +
>> +    if (xen_domain_type == XEN_PV_DOMAIN)
>> +        /* TODO: In PV domain, due to the existing pfn-gfn mapping we
>> need to
>> +         * consider that under certains circonstances, we have :
>> +         *   pfn_to_gfn(x + 1) != pfn_to_gfn(x) + 1
>> +         *
>> +         * In these cases, we would want to separate the subop into
>> several calls.
>> +         * (only doing the grouped operation when the mapping is
>> actually contigous)
>> +         * Only map operation would be affected, as unmap actually
>> uses dfn which
>> +         * doesn't have this kind of mapping.
>> +         *
>> +         * Force single-page operations to work arround this issue
>> for now.
>> +         */
>> +        max_nr_pages = 1;
>> +    else
>> +        /* With HVM domains, pfn_to_gfn is identity, there is no
>> issue regarding this. */
>> +        max_nr_pages = op.cap.max_nr_pages;
>> +
>> +    max_iova_addr = op.cap.max_iova_addr;
>> +
>> +    spin_lock_init(&lock);
>> +
>> +    ret = iommu_device_sysfs_add(&xen_iommu_device, NULL, NULL,
>> "xen-iommu");
>> +    if (ret) {
>> +        pr_err("Unable to add Xen IOMMU sysfs\n");
>> +        return ret;
>> +    }
>> +
>> +    ret = iommu_device_register(&xen_iommu_device, &xen_iommu_ops,
>> NULL);
>> +    if (ret) {
>> +        pr_err("Unable to register Xen IOMMU device %d\n", ret);
>> +        iommu_device_sysfs_remove(&xen_iommu_device);
>> +        return ret;
>> +    }
>> +
>> +    /* swiotlb is redundant when IOMMU is active. */
>> +    x86_swiotlb_enable = false;
>
> That's not always true, but either way if this is at
> module_init/device_initcall time then it's too late to make any
> difference anyway.
>

Ok

>> +
>> +    return 0;
>> +}
>> +
>> +void __exit xen_iommu_fini(void)
>> +{
>> +    pr_info("Unregistering Xen IOMMU driver\n");
>> +
>> +    iommu_device_unregister(&xen_iommu_device);
>> +    iommu_device_sysfs_remove(&xen_iommu_device);
>> +}
>
> This is dead code since the Kconfig is only "bool". Either allow it to
> be an actual module (and make sure that works), or drop the pretence
> altogether.
>

Ok, I though this function was actually a requirement even if it is not
a module.

> Thanks,
> Robin.

Teddy


Teddy Astie | Vates XCP-ng Intern

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.