Xen project Mailing List

Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING

To: Thomas Gleixner <tglx@xxxxxxxxxxxxx>

From: Jason Gunthorpe <jgg@xxxxxxxxxx>

Date: Fri, 21 Aug 2020 17:17:05 -0300

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=nvidia.com; dmarc=pass action=none header.from=nvidia.com; dkim=pass header.d=nvidia.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=9FzrAeRKkTAlshpGx4nH9RAtEQFJEfpjazIyCB62c/0=; b=EGFtEKb2uH/g7zSIbnWTUqLe/CZBieBrlAwiXtXmyIZZsLjhWEdzlBdQ/1hGFOJXoqB/obSBP+gi0vyJtn5MZ32I72F0nH5KEXrq71/MXXLsPWD2tK3LVKS2vGC9pGqtj5JMSyVc1BAybpmZaL3jK96wQZ5eDBYuXe/5/MY6Gf6AnfKWXkdkytNNeT0IJCZGECAFmhej894r9P+ZihU8zzQPhib4JY8vDE5gq1rC2rmu2yqiuiO0OzROj72eGm+9nBrrnqvuJMRCMGRu8s5ctbGNIyoMIpYTAUhFS8YPdJxWkdfPUWdM8dw0O8Yrm5A60xciBZ4Z4VUv/KjsiFV8qg==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=IR7Xpru97F6nduJ4BtQdYjdQak2cHmXTpJh6EvJQGhWdUJoGARO+FSxAWxb3HYLUYDOS2MKy2C9YskvjsA3lQ0AlNXHbFqeVCgvc7qsNUEti8pryEeKOBiwmRQxsPYC244dgkpWhuUiFLbhUTuqN1yFzpd2DFDEE1/mvuHsN7REIZDmpyqF8Ci3/cSQSXSTBPi8L14fCj923j283pSvk9D7rNwn9Iv1sh9jscv86+krgxEWfy3wHGIXPc6trUJEy6ZjIFoKS085OT+Rkct0UNAiLA2CU+VejapPdKlTTiRNOp4xktATILSQpaThSwkJu0W+wjE8rNpWWOFBL/akqtA==

Authentication-results: linutronix.de; dkim=none (message not signed) header.d=none;linutronix.de; dmarc=none action=none header.from=nvidia.com;

Cc: LKML <linux-kernel@xxxxxxxxxxxxxxx>, <x86@xxxxxxxxxx>, Marc Zyngier <maz@xxxxxxxxxx>, Megha Dey <megha.dey@xxxxxxxxx>, Dave Jiang <dave.jiang@xxxxxxxxx>, Alex Williamson <alex.williamson@xxxxxxxxxx>, "Jacob Pan" <jacob.jun.pan@xxxxxxxxx>, Baolu Lu <baolu.lu@xxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Dan Williams <dan.j.williams@xxxxxxxxx>, Joerg Roedel <joro@xxxxxxxxxx>, <iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx>, <linux-hyperv@xxxxxxxxxxxxxxx>, Haiyang Zhang <haiyangz@xxxxxxxxxxxxx>, "Jon Derrick" <jonathan.derrick@xxxxxxxxx>, Lu Baolu <baolu.lu@xxxxxxxxxxxxxxx>, Wei Liu <wei.liu@xxxxxxxxxx>, "K. Y. Srinivasan" <kys@xxxxxxxxxxxxx>, "Stephen Hemminger" <sthemmin@xxxxxxxxxxxxx>, Steve Wahl <steve.wahl@xxxxxxx>, "Dimitri Sivanich" <sivanich@xxxxxxx>, Russ Anderson <rja@xxxxxxx>, <linux-pci@xxxxxxxxxxxxxxx>, Bjorn Helgaas <bhelgaas@xxxxxxxxxx>, "Lorenzo Pieralisi" <lorenzo.pieralisi@xxxxxxx>, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, "Stefano Stabellini" <sstabellini@xxxxxxxxxx>, Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>, "Rafael J. Wysocki" <rafael@xxxxxxxxxx>

Delivery-date: Fri, 21 Aug 2020 20:17:30 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Fri, Aug 21, 2020 at 09:47:43PM +0200, Thomas Gleixner wrote: > On Fri, Aug 21 2020 at 09:45, Jason Gunthorpe wrote: > > On Fri, Aug 21, 2020 at 02:25:02AM +0200, Thomas Gleixner wrote: > >> +static void ims_mask_irq(struct irq_data *data) > >> +{ > >> + struct msi_desc *desc = irq_data_get_msi_desc(data); > >> + struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem; > >> + u32 __iomem *ctrl = &slot->ctrl; > >> + > >> + iowrite32(ioread32(ctrl) & ~IMS_VECTOR_CTRL_UNMASK, ctrl); > > > > Just to be clear, this is exactly the sort of operation we can't do > > with non-MSI interrupts. For a real PCI device to execute this it > > would have to keep the data on die. > > We means NVIDIA and your new device, right? We'd like to use this in the current Mellanox NIC HW, eg the mlx5 driver. (NVIDIA acquired Mellanox recently) > So if I understand correctly then the queue memory where the MSI > descriptor sits is in RAM. Yes, IMHO that is the whole point of this 'IMS' stuff. If devices could have enough on-die memory then they could just use really big MSI-X tables. Currently due to on-die memory constraints mlx5 is limited to a few hundred MSI-X vectors. Since MSI-X tables are exposed via MMIO they can't be 'swapped' to RAM. Moving away from MSI-X's MMIO access model allows them to be swapped to RAM. The cost is that accessing them for update is a command/response operation not a MMIO operation. The HW is already swapping the queues causing the interrupts to RAM, so adding a bit of additional data to store the MSI addr/data is reasonable. To give some sense, a 'working set' for the NIC device in some cases can be hundreds of megabytes of data. System RAM is used to store this, and precious on-die memory holds some dynamic active set, much like a processor cache. > How is that supposed to work if interrupt remapping is disabled? The best we can do is issue a command to the device and spin/sleep until completion. The device will serialize everything internally. If the device has died the driver has code to detect and trigger a PCI function reset which will definitely stop the interrupt. So, the implementation of these functions would be to push any change onto a command queue, trigger the device to DMA the command, spin/sleep until the device returns a response and then continue on. If the device doesn't return a response in a time window then trigger a WQ to do a full device reset. The spin/sleep is only needed if the update has to be synchronous, so things like rebalancing could just push the rebalancing work and immediately return. > If interrupt remapping is enabled then both are trivial because then the > irq chip can delegate everything to the parent chip, i.e. the remapping > unit. I did like this notion that IRQ remapping could avoid the overhead of spin/spleep. Most of the use cases we have for this will require the IOMMU anyhow. > > I saw the idxd driver was doing something like this, I assume it > > avoids trouble because it is a fake PCI device integrated with the > > CPU, not on a real PCI bus? > > That's how it is implemented as far as I understood the patches. It's > device memory therefore iowrite32(). I don't know anything about idxd.. Given the scale of interrupt need I assumed the idxd HW had some hidden swapping to RAM. Since it is on-die with the CPU there are a bunch of ways I could imagine Intel could make MMIO triggered swapping work that are not available to a true PCI-E device. Jason

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.