[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()



On Wed, Jul 02, 2014 at 02:22:01PM -0600, Bjorn Helgaas wrote:
> On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> > There are PCI devices that require a particular value written
> > to the Multiple Message Enable (MME) register while aligned on
> > power of 2 boundary value of actually used MSI vectors 'nvec'
> > is a lesser of that MME value:
> > 
> >     roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> > 
> > However the existing pci_enable_msi_block() interface is not
> > able to configure such devices, since the value written to the
> > MME register is calculated from the number of requested MSIs
> > 'nvec':
> > 
> >     'Multiple Message Enable' = roundup_pow_of_two(nvec)
> 
> For MSI, software learns how many vectors a device requests by reading
> the Multiple Message Capable (MMC) field.  This field is encoded, so a
> device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
> for a device to request 3 vectors; it would have to round up that up
> to a power of two and request 4 vectors.
> 
> Software writes similarly encoded values to MME to tell the device how
> many vectors have been allocated for its use.  For example, it's
> impossible to tell the device that it can use 3 vectors; the OS has to
> round that up and tell the device it can use 4 vectors.

Nod.

> So if I understand correctly, the point of this series is to take
> advantage of device-specific knowledge, e.g., the device requests 4
> vectors via MMC, but we "know" the device is only capable of using 3.
> Moreover, we tell the device via MME that 4 vectors are available, but
> we've only actually set up 3 of them.

Exactly.

> This makes me uneasy because we're lying to the device, and the device
> is perfectly within spec to use all 4 of those vectors.  If anything
> changes the number of vectors the device uses (new device revision,
> firmware upgrade, etc.), this is liable to break.

If a device committed via non-MSI specific means to send only 3 vectors
out of 4 available why should we expect it to send 4? The probability of
a firmware sending 4/4 vectors in this case is equal to the probability
of sending 5/4 or 16/4, with the very same reason - a bug in the firmware.
Moreover, even vector 4/4 would be unexpected by the device driver, though
it is perfectly within the spec.

As of new device revision or firmware update etc. - it is just yet another
case of device driver vs the firmware match/mismatch. Not including this
change does not help here at all IMHO.

> Can you quantify the benefit of this?  Can't a device already use
> MSI-X to request exactly the number of vectors it can use?  (I know

A Intel AHCI chipset requires 16 vectors written to MME while advertises
(via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
in device's fallback to 1 (!).

> not all devices support MSI-X, but maybe we should just accept MSI for
> what it is and encourage the HW guys to use MSI-X if MSI isn't good
> enough.)
> 
> > In this case the result written to the MME register may not
> > satisfy the aforementioned PCI devices requirement and therefore
> > the PCI functions will not operate in a desired mode.
> 
> I'm not sure what you mean by "will not operate in a desired mode."
> I thought this was an optimization to save vectors and that these
> changes would be completely invisible to the hardware.

Yes, this should be invisible to the hardware. The above is an attempt
to describe the Intel AHCI weirdness in general terms :) I think it
could be omitted.

> Bjorn
> 
> > This update introduces pci_enable_msi_partial() extension to
> > pci_enable_msi_block() interface that accepts extra 'nvec_mme'
> > argument which is then written to MME register while the value
> > of 'nvec' is still used to setup as many interrupts as requested.
> > 
> > As result of this change, architecture-specific callbacks
> > arch_msi_check_device() and arch_setup_msi_irqs() get an extra
> > 'nvec_mme' parameter as well, but it is ignored for now.
> > Therefore, this update is a placeholder for architectures that
> > wish to support pci_enable_msi_partial() function in the future.
> > 
> > Cc: linux-doc@xxxxxxxxxxxxxxx
> > Cc: linux-mips@xxxxxxxxxxxxxx
> > Cc: linuxppc-dev@xxxxxxxxxxxxxxxx
> > Cc: linux-s390@xxxxxxxxxxxxxxx
> > Cc: x86@xxxxxxxxxx
> > Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx
> > Cc: iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx
> > Cc: linux-ide@xxxxxxxxxxxxxxx
> > Cc: linux-pci@xxxxxxxxxxxxxxx
> > Signed-off-by: Alexander Gordeev <agordeev@xxxxxxxxxx>
> > ---
> >  Documentation/PCI/MSI-HOWTO.txt |   36 ++++++++++++++--
> >  arch/mips/pci/msi-octeon.c      |    2 +-
> >  arch/powerpc/kernel/msi.c       |    4 +-
> >  arch/s390/pci/pci.c             |    2 +-
> >  arch/x86/kernel/x86_init.c      |    2 +-
> >  drivers/pci/msi.c               |   83 
> > ++++++++++++++++++++++++++++++++++-----
> >  include/linux/msi.h             |    5 +-
> >  include/linux/pci.h             |    3 +
> >  8 files changed, 115 insertions(+), 22 deletions(-)
> > 
> > diff --git a/Documentation/PCI/MSI-HOWTO.txt 
> > b/Documentation/PCI/MSI-HOWTO.txt
> > index 10a9369..c8a8503 100644
> > --- a/Documentation/PCI/MSI-HOWTO.txt
> > +++ b/Documentation/PCI/MSI-HOWTO.txt
> > @@ -195,14 +195,40 @@ By contrast with pci_enable_msi_range() function, 
> > pci_enable_msi_exact()
> >  returns zero in case of success, which indicates MSI interrupts have been
> >  successfully allocated.
> >  
> > -4.2.4 pci_disable_msi
> > +4.2.4 pci_enable_msi_partial
> > +
> > +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> > +
> > +This variation on pci_enable_msi_exact() call allows a device driver to
> > +setup 'nvec_mme' number of multiple MSIs with the PCI function, while
> > +setup only 'nvec' (which could be a lesser of 'nvec_mme') number of MSIs
> > +in operating system. The MSI specification only allows 'nvec_mme' to be
> > +allocated in powers of two, up to a maximum of 2^5 (32).
> > +
> > +This function could be used when a PCI function is known to send 'nvec'
> > +MSIs, but still requires a particular number of MSIs 'nvec_mme' to be
> > +initialized with. As result, 'nvec_mme' - 'nvec' number of unused MSIs
> > +do not waste system resources.
> > +
> > +If this function returns 0, it has succeeded in allocating 'nvec_mme'
> > +interrupts and setting up 'nvec' interrupts. In this case, the function
> > +enables MSI on this device and updates dev->irq to be the lowest of the
> > +new interrupts assigned to it.  The other interrupts assigned to the
> > +device are in the range dev->irq to dev->irq + nvec - 1.
> > +
> > +If this function returns a negative number, it indicates an error and
> > +the driver should not attempt to request any more MSI interrupts for
> > +this device.
> > +
> > +4.2.5 pci_disable_msi
> >  
> >  void pci_disable_msi(struct pci_dev *dev)
> >  
> > -This function should be used to undo the effect of pci_enable_msi_range().
> > -Calling it restores dev->irq to the pin-based interrupt number and frees
> > -the previously allocated MSIs.  The interrupts may subsequently be assigned
> > -to another device, so drivers should not cache the value of dev->irq.
> > +This function should be used to undo the effect of pci_enable_msi_range()
> > +or pci_enable_msi_partial(). Calling it restores dev->irq to the pin-based
> > +interrupt number and frees the previously allocated MSIs.  The interrupts
> > +may subsequently be assigned to another device, so drivers should not cache
> > +the value of dev->irq.
> >  
> >  Before calling this function, a device driver must always call free_irq()
> >  on any interrupt for which it previously called request_irq().
> > diff --git a/arch/mips/pci/msi-octeon.c b/arch/mips/pci/msi-octeon.c
> > index 2b91b0e..2be7979 100644
> > --- a/arch/mips/pci/msi-octeon.c
> > +++ b/arch/mips/pci/msi-octeon.c
> > @@ -178,7 +178,7 @@ msi_irq_allocated:
> >     return 0;
> >  }
> >  
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int 
> > type)
> >  {
> >     struct msi_desc *entry;
> >     int ret;
> > diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
> > index 8bbc12d..c60aee3 100644
> > --- a/arch/powerpc/kernel/msi.c
> > +++ b/arch/powerpc/kernel/msi.c
> > @@ -13,7 +13,7 @@
> >  
> >  #include <asm/machdep.h>
> >  
> > -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
> > +int arch_msi_check_device(struct pci_dev *dev, int nvec, int nvec_mme, int 
> > type)
> >  {
> >     if (!ppc_md.setup_msi_irqs || !ppc_md.teardown_msi_irqs) {
> >             pr_debug("msi: Platform doesn't provide MSI callbacks.\n");
> > @@ -32,7 +32,7 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, 
> > int type)
> >          return 0;
> >  }
> >  
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int 
> > type)
> >  {
> >     return ppc_md.setup_msi_irqs(dev, nvec, type);
> >  }
> > diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> > index 9ddc51e..3cf38a8 100644
> > --- a/arch/s390/pci/pci.c
> > +++ b/arch/s390/pci/pci.c
> > @@ -398,7 +398,7 @@ static void zpci_irq_handler(struct airq_struct *airq)
> >     }
> >  }
> >  
> > -int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int nvec_mme, int 
> > type)
> >  {
> >     struct zpci_dev *zdev = get_zdev(pdev);
> >     unsigned int hwirq, msi_vecs;
> > diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
> > index e48b674..b65bf95 100644
> > --- a/arch/x86/kernel/x86_init.c
> > +++ b/arch/x86/kernel/x86_init.c
> > @@ -121,7 +121,7 @@ struct x86_msi_ops x86_msi = {
> >  };
> >  
> >  /* MSI arch specific hooks */
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int 
> > type)
> >  {
> >     return x86_msi.setup_msi_irqs(dev, nvec, type);
> >  }
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index 27a7e67..0410d9b 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -56,7 +56,8 @@ void __weak arch_teardown_msi_irq(unsigned int irq)
> >     chip->teardown_irq(chip, irq);
> >  }
> >  
> > -int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
> > +int __weak arch_msi_check_device(struct pci_dev *dev,
> > +                            int nvec, int nvec_mme, int type)
> >  {
> >     struct msi_chip *chip = dev->bus->msi;
> >  
> > @@ -66,7 +67,8 @@ int __weak arch_msi_check_device(struct pci_dev *dev, int 
> > nvec, int type)
> >     return chip->check_device(chip, dev, nvec, type);
> >  }
> >  
> > -int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int __weak arch_setup_msi_irqs(struct pci_dev *dev,
> > +                          int nvec, int nvec_mme, int type)
> >  {
> >     struct msi_desc *entry;
> >     int ret;
> > @@ -598,6 +600,7 @@ error_attrs:
> >   * msi_capability_init - configure device's MSI capability structure
> >   * @dev: pointer to the pci_dev data structure of MSI device function
> >   * @nvec: number of interrupts to allocate
> > + * @nvec_mme: number of interrupts to write to Multiple Message Enable 
> > register
> >   *
> >   * Setup the MSI capability structure of the device with the requested
> >   * number of interrupts.  A return value of zero indicates the successful
> > @@ -605,7 +608,7 @@ error_attrs:
> >   * an error, and a positive return value indicates the number of interrupts
> >   * which could have been allocated.
> >   */
> > -static int msi_capability_init(struct pci_dev *dev, int nvec)
> > +static int msi_capability_init(struct pci_dev *dev, int nvec, int nvec_mme)
> >  {
> >     struct msi_desc *entry;
> >     int ret;
> > @@ -640,7 +643,7 @@ static int msi_capability_init(struct pci_dev *dev, int 
> > nvec)
> >     list_add_tail(&entry->list, &dev->msi_list);
> >  
> >     /* Configure MSI capability structure */
> > -   ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
> > +   ret = arch_setup_msi_irqs(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
> >     if (ret) {
> >             msi_mask_irq(entry, mask, ~mask);
> >             free_msi_irqs(dev);
> > @@ -758,7 +761,8 @@ static int msix_capability_init(struct pci_dev *dev,
> >     if (ret)
> >             return ret;
> >  
> > -   ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
> > +   /* Parameter 'nvec_mme' does not make sense in case of MSI-X */
> > +   ret = arch_setup_msi_irqs(dev, nvec, 0, PCI_CAP_ID_MSIX);
> >     if (ret)
> >             goto out_avail;
> >  
> > @@ -812,13 +816,15 @@ out_free:
> >   * pci_msi_check_device - check whether MSI may be enabled on a device
> >   * @dev: pointer to the pci_dev data structure of MSI device function
> >   * @nvec: how many MSIs have been requested ?
> > + * @nvec_mme: how many MSIs write to Multiple Message Enable register ?
> >   * @type: are we checking for MSI or MSI-X ?
> >   *
> >   * Look at global flags, the device itself, and its parent buses
> >   * to determine if MSI/-X are supported for the device. If MSI/-X is
> >   * supported return 0, else return an error code.
> >   **/
> > -static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
> > +static int pci_msi_check_device(struct pci_dev *dev,
> > +                           int nvec, int nvec_mme, int type)
> >  {
> >     struct pci_bus *bus;
> >     int ret;
> > @@ -846,7 +852,7 @@ static int pci_msi_check_device(struct pci_dev *dev, 
> > int nvec, int type)
> >             if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI)
> >                     return -EINVAL;
> >  
> > -   ret = arch_msi_check_device(dev, nvec, type);
> > +   ret = arch_msi_check_device(dev, nvec, nvec_mme, type);
> >     if (ret)
> >             return ret;
> >  
> > @@ -878,6 +884,62 @@ int pci_msi_vec_count(struct pci_dev *dev)
> >  }
> >  EXPORT_SYMBOL(pci_msi_vec_count);
> >  
> > +/**
> > + * pci_enable_msi_partial - configure device's MSI capability structure
> > + * @dev: device to configure
> > + * @nvec: number of interrupts to configure
> > + * @nvec_mme: number of interrupts to write to Multiple Message Enable 
> > register
> > + *
> > + * This function tries to allocate @nvec number of interrupts while setup
> > + * device's Multiple Message Enable register with @nvec_mme interrupts.
> > + * It returns a negative errno if an error occurs. If it succeeds, it 
> > returns
> > + * zero and updates the @dev's irq member to the lowest new interrupt 
> > number;
> > + * the other interrupt numbers allocated to this device are consecutive.
> > + */
> > +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> > +{
> > +   int maxvec;
> > +   int rc;
> > +
> > +   if (dev->current_state != PCI_D0)
> > +           return -EINVAL;
> > +
> > +   WARN_ON(!!dev->msi_enabled);
> > +
> > +   /* Check whether driver already requested MSI-X irqs */
> > +   if (dev->msix_enabled) {
> > +           dev_info(&dev->dev, "can't enable MSI "
> > +                    "(MSI-X already enabled)\n");
> > +           return -EINVAL;
> > +   }
> > +
> > +   if (!is_power_of_2(nvec_mme))
> > +           return -EINVAL;
> > +   if (nvec > nvec_mme)
> > +           return -EINVAL;
> > +
> > +   maxvec = pci_msi_vec_count(dev);
> > +   if (maxvec < 0)
> > +           return maxvec;
> > +   else if (nvec_mme > maxvec)
> > +           return -EINVAL;
> > +
> > +   rc = pci_msi_check_device(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
> > +   if (rc < 0)
> > +           return rc;
> > +   else if (rc > 0)
> > +           return -ENOSPC;
> > +
> > +   rc = msi_capability_init(dev, nvec, nvec_mme);
> > +   if (rc < 0)
> > +           return rc;
> > +   else if (rc > 0)
> > +           return -ENOSPC;
> > +
> > +   return 0;
> > +}
> > +EXPORT_SYMBOL(pci_enable_msi_partial);
> > +
> >  void pci_msi_shutdown(struct pci_dev *dev)
> >  {
> >     struct msi_desc *desc;
> > @@ -957,7 +1019,7 @@ int pci_enable_msix(struct pci_dev *dev, struct 
> > msix_entry *entries, int nvec)
> >     if (!entries || !dev->msix_cap || dev->current_state != PCI_D0)
> >             return -EINVAL;
> >  
> > -   status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
> > +   status = pci_msi_check_device(dev, nvec, 0, PCI_CAP_ID_MSIX);
> >     if (status)
> >             return status;
> >  
> > @@ -1110,7 +1172,8 @@ int pci_enable_msi_range(struct pci_dev *dev, int 
> > minvec, int maxvec)
> >             nvec = maxvec;
> >  
> >     do {
> > -           rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
> > +           rc = pci_msi_check_device(dev, nvec, roundup_pow_of_two(nvec),
> > +                                     PCI_CAP_ID_MSI);
> >             if (rc < 0) {
> >                     return rc;
> >             } else if (rc > 0) {
> > @@ -1121,7 +1184,7 @@ int pci_enable_msi_range(struct pci_dev *dev, int 
> > minvec, int maxvec)
> >     } while (rc);
> >  
> >     do {
> > -           rc = msi_capability_init(dev, nvec);
> > +           rc = msi_capability_init(dev, nvec, roundup_pow_of_two(nvec));
> >             if (rc < 0) {
> >                     return rc;
> >             } else if (rc > 0) {
> > diff --git a/include/linux/msi.h b/include/linux/msi.h
> > index 92a2f99..b9f89ee 100644
> > --- a/include/linux/msi.h
> > +++ b/include/linux/msi.h
> > @@ -57,9 +57,10 @@ struct msi_desc {
> >   */
> >  int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
> >  void arch_teardown_msi_irq(unsigned int irq);
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int 
> > type);
> >  void arch_teardown_msi_irqs(struct pci_dev *dev);
> > -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
> > +int arch_msi_check_device(struct pci_dev *dev,
> > +                     int nvec, int nvec_mme, int type);
> >  void arch_restore_msi_irqs(struct pci_dev *dev);
> >  
> >  void default_teardown_msi_irqs(struct pci_dev *dev);
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 71d9673..7360bd2 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -1184,6 +1184,7 @@ void pci_disable_msix(struct pci_dev *dev);
> >  void msi_remove_pci_irq_vectors(struct pci_dev *dev);
> >  void pci_restore_msi_state(struct pci_dev *dev);
> >  int pci_msi_enabled(void);
> > +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme);
> >  int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec);
> >  static inline int pci_enable_msi_exact(struct pci_dev *dev, int nvec)
> >  {
> > @@ -1215,6 +1216,8 @@ static inline void pci_disable_msix(struct pci_dev 
> > *dev) { }
> >  static inline void msi_remove_pci_irq_vectors(struct pci_dev *dev) { }
> >  static inline void pci_restore_msi_state(struct pci_dev *dev) { }
> >  static inline int pci_msi_enabled(void) { return 0; }
> > +static int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int 
> > nvec_mme)
> > +{ return -ENOSYS; }
> >  static inline int pci_enable_msi_range(struct pci_dev *dev, int minvec,
> >                                    int maxvec)
> >  { return -ENOSYS; }
> > -- 
> > 1.7.7.6
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Regards,
Alexander Gordeev
agordeev@xxxxxxxxxx

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.