|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx>
>
> Use a previously introduced per-domain read/write lock to check
> whether vpci is present, so we are sure there are no accesses to the
> contents of the vpci struct if not. This lock can be used (and in a
> few cases is used right away) so that vpci removal can be performed
> while holding the lock in write mode. Previously such removal could
> race with vpci_read for example.
This I think needs to state the locking order of the per-domain
pci_lock wrt the vpci->lock. AFAICT that's d->pci_lock first, then
vpci->lock.
> 1. Per-domain's pci_rwlock is used to protect pdev->vpci structure
> from being removed.
>
> 2. Writing the command register and ROM BAR register may trigger
> modify_bars to run, which in turn may access multiple pdevs while
> checking for the existing BAR's overlap. The overlapping check, if
> done under the read lock, requires vpci->lock to be acquired on both
> devices being compared, which may produce a deadlock. It is not
> possible to upgrade read lock to write lock in such a case. So, in
> order to prevent the deadlock, use d->pci_lock instead. To prevent
> deadlock while locking both hwdom->pci_lock and dom_xen->pci_lock,
> always lock hwdom first.
>
> All other code, which doesn't lead to pdev->vpci destruction and does
> not access multiple pdevs at the same time, can still use a
> combination of the read lock and pdev->vpci->lock.
>
> 3. Drop const qualifier where the new rwlock is used and this is
> appropriate.
>
> 4. Do not call process_pending_softirqs with any locks held. For that
> unlock prior the call and re-acquire the locks after. After
> re-acquiring the lock there is no need to check if pdev->vpci exists:
> - in apply_map because of the context it is called (no race condition
> possible)
> - for MSI/MSI-X debug code because it is called at the end of
> pdev->vpci access and no further access to pdev->vpci is made
I assume that's vpci_msix_arch_print(): there are further accesses to
pdev->vpci, but those use the msix local variable, which holds a copy
of the pointer in pdev->vpci->msix, so that last sentence is not true
I'm afraid.
However the code already try to cater for the pdev going away, and
hence it's IMO fine. IOW: your change doesn't make this any better or
worse.
>
> 5. Introduce pcidevs_trylock, so there is a possibility to try locking
> the pcidev's lock.
I'm confused by this addition, the more that's no used anywhere. Can
you defer the addition until the patch that makes use of it?
>
> 6. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
> while accessing pdevs in vpci code.
>
> Suggested-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> Suggested-by: Jan Beulich <jbeulich@xxxxxxxx>
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@xxxxxxxx>
>
> ---
>
> Changes in v8:
> - changed d->vpci_lock to d->pci_lock
> - introducing d->pci_lock in a separate patch
> - extended locked region in vpci_process_pending
> - removed pcidevs_lockis vpci_dump_msi()
> - removed some changes as they are not needed with
> the new locking scheme
> - added handling for hwdom && dom_xen case
> ---
> xen/arch/x86/hvm/vmsi.c | 4 +++
> xen/drivers/passthrough/pci.c | 7 +++++
> xen/drivers/vpci/header.c | 18 ++++++++++++
> xen/drivers/vpci/msi.c | 14 ++++++++--
> xen/drivers/vpci/msix.c | 52 ++++++++++++++++++++++++++++++-----
> xen/drivers/vpci/vpci.c | 46 +++++++++++++++++++++++++++++--
> xen/include/xen/pci.h | 1 +
> 7 files changed, 129 insertions(+), 13 deletions(-)
>
> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> index 3cd4923060..8c1bd66b9c 100644
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -895,6 +895,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
> {
> unsigned int i;
>
> + ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
> +
> for ( i = 0; i < msix->max_entries; i++ )
> {
> const struct vpci_msix_entry *entry = &msix->entries[i];
> @@ -913,7 +915,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
> struct pci_dev *pdev = msix->pdev;
>
> spin_unlock(&msix->pdev->vpci->lock);
> + read_unlock(&pdev->domain->pci_lock);
> process_pending_softirqs();
> + read_lock(&pdev->domain->pci_lock);
This should be a read_trylock(), much like the spin_trylock() below.
> /* NB: we assume that pdev cannot go away for an alive domain. */
> if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
> return -EBUSY;
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 5b4632ead2..6f8692cd9c 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -57,6 +57,11 @@ void pcidevs_lock(void)
> spin_lock_recursive(&_pcidevs_lock);
> }
>
> +int pcidevs_trylock(void)
> +{
> + return spin_trylock_recursive(&_pcidevs_lock);
> +}
> +
> void pcidevs_unlock(void)
> {
> spin_unlock_recursive(&_pcidevs_lock);
> @@ -1144,7 +1149,9 @@ static void __hwdom_init setup_one_hwdom_device(const
> struct setup_hwdom *ctxt,
> } while ( devfn != pdev->devfn &&
> PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
>
> + write_lock(&ctxt->d->pci_lock);
> err = vpci_add_handlers(pdev);
> + write_unlock(&ctxt->d->pci_lock);
> if ( err )
> printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
> ctxt->d->domain_id, err);
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index b41556d007..2780fcae72 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -152,6 +152,7 @@ bool vpci_process_pending(struct vcpu *v)
> if ( rc == -ERESTART )
> return true;
>
> + write_lock(&v->domain->pci_lock);
> spin_lock(&v->vpci.pdev->vpci->lock);
> /* Disable memory decoding unconditionally on failure. */
> modify_decoding(v->vpci.pdev,
> @@ -170,6 +171,7 @@ bool vpci_process_pending(struct vcpu *v)
> * failure.
> */
> vpci_remove_device(v->vpci.pdev);
> + write_unlock(&v->domain->pci_lock);
> }
The handling in vpci_process_pending() wrt vpci_remove_device() is
racy and will need some thinking to get it solved. Your change
doesn't make it any worse, but I would also be fine with adding a note
in the commit message that vpci_process_pending() is not adjusted to
use the new lock because it needs to be reworked first in order to be
safe against a concurrent vpci_remove_device() call.
>
> return false;
> @@ -181,8 +183,20 @@ static int __init apply_map(struct domain *d, const
> struct pci_dev *pdev,
> struct map_data data = { .d = d, .map = true };
> int rc;
>
> + ASSERT(rw_is_locked(&d->pci_lock));
> +
> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) ==
> -ERESTART )
> + {
> + /*
> + * It's safe to drop and reacquire the lock in this context
> + * without risking pdev disappearing because devices cannot be
> + * removed until the initial domain has been started.
> + */
> + read_unlock(&d->pci_lock);
> process_pending_softirqs();
> + read_lock(&d->pci_lock);
> + }
Since this is init only code you could likely forego the usage of the
locks, but I guess that's more churn than just using them. In any
case, as this gets called from modify_bars() the locks need to be
dropped/taken in write mode (see comment below).
> rangeset_destroy(mem);
> if ( !rc )
> modify_decoding(pdev, cmd, false);
> @@ -223,6 +237,8 @@ static int modify_bars(const struct pci_dev *pdev,
> uint16_t cmd, bool rom_only)
> unsigned int i;
> int rc;
>
> + ASSERT(rw_is_locked(&pdev->domain->pci_lock));
The lock here needs to be taken in write mode I think, so the code can
safely iterate over the contents of each pdev->vpci assigned to the
domain.
> +
> if ( !mem )
> return -ENOMEM;
>
> @@ -502,6 +518,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
> struct vpci_bar *bars = header->bars;
> int rc;
>
> + ASSERT(rw_is_locked(&pdev->domain->pci_lock));
> +
> switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
> {
> case PCI_HEADER_TYPE_NORMAL:
> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> index 8f2b59e61a..e63152c224 100644
> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -190,6 +190,8 @@ static int cf_check init_msi(struct pci_dev *pdev)
> uint16_t control;
> int ret;
>
> + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
I'm confused by the difference in lock requirements between
init_bars() and init_msi(). In the former you assert for the lock
being taken in read mode, while the later asserts for write mode.
We want to do initialization in write mode, so that modify_bars()
called by init_bars() has exclusive access to the contents of
pdev->vpci.
> +
> if ( !pos )
> return 0;
>
> @@ -265,7 +267,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
>
> void vpci_dump_msi(void)
> {
> - const struct domain *d;
> + struct domain *d;
>
> rcu_read_lock(&domlist_read_lock);
> for_each_domain ( d )
> @@ -277,6 +279,9 @@ void vpci_dump_msi(void)
>
> printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
>
> + if ( !read_trylock(&d->pci_lock) )
> + continue;
> +
> for_each_pdev ( d, pdev )
> {
> const struct vpci_msi *msi;
> @@ -318,14 +323,17 @@ void vpci_dump_msi(void)
> * holding the lock.
> */
> printk("unable to print all MSI-X entries: %d\n", rc);
> - process_pending_softirqs();
> - continue;
> + goto pdev_done;
> }
> }
>
> spin_unlock(&pdev->vpci->lock);
> + pdev_done:
> + read_unlock(&d->pci_lock);
> process_pending_softirqs();
> + read_lock(&d->pci_lock);
read_trylock().
This is not very safe, as the list could be modified while the lock is
dropped, but it's a debug key handler so I'm not very concerned.
However we should at least add a comment that this relies on the list
not being altered while the lock is dropped.
> }
> + read_unlock(&d->pci_lock);
> }
> rcu_read_unlock(&domlist_read_lock);
> }
> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
> index 25bde77586..9481274579 100644
> --- a/xen/drivers/vpci/msix.c
> +++ b/xen/drivers/vpci/msix.c
> @@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain
> *d, unsigned long addr)
> {
> struct vpci_msix *msix;
>
> + ASSERT(rw_is_locked(&d->pci_lock));
Hm, here you are iterating over pdev->vpci->header.bars for multiple
devices, so I think in addition to the pci_lock in read mode we should
also take the vpci->lock for each pdev.
I think I would like to rework msix_find() so it's msix_get() and
returns with the appropriate vpci->lock taken. Anyway, that's for a
different patch, the usage of the lock in read mode seems correct,
albeit I might want to move the read_lock() call inside of msix_get()
in the future.
> +
> list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
> {
> const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
> @@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain
> *d, unsigned long addr)
>
> static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
> {
> - return !!msix_find(v->domain, addr);
> + int rc;
> +
> + read_lock(&v->domain->pci_lock);
> + rc = !!msix_find(v->domain, addr);
> + read_unlock(&v->domain->pci_lock);
> +
> + return rc;
> }
>
> static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
> @@ -358,21 +366,34 @@ static int adjacent_read(const struct domain *d, const
> struct vpci_msix *msix,
> static int cf_check msix_read(
> struct vcpu *v, unsigned long addr, unsigned int len, unsigned long
> *data)
> {
> - const struct domain *d = v->domain;
> - struct vpci_msix *msix = msix_find(d, addr);
> + struct domain *d = v->domain;
> + struct vpci_msix *msix;
> const struct vpci_msix_entry *entry;
> unsigned int offset;
>
> *data = ~0ul;
>
> + read_lock(&d->pci_lock);
> +
> + msix = msix_find(d, addr);
> if ( !msix )
> + {
> + read_unlock(&d->pci_lock);
> return X86EMUL_RETRY;
> + }
>
> if ( adjacent_handle(msix, addr) )
> - return adjacent_read(d, msix, addr, len, data);
> + {
> + int rc = adjacent_read(d, msix, addr, len, data);
Nit: missing newline (here and below).
> + read_unlock(&d->pci_lock);
> + return rc;
> + }
>
> if ( !access_allowed(msix->pdev, addr, len) )
> + {
> + read_unlock(&d->pci_lock);
> return X86EMUL_OKAY;
> + }
>
> spin_lock(&msix->pdev->vpci->lock);
> entry = get_entry(msix, addr);
> @@ -404,6 +425,7 @@ static int cf_check msix_read(
> break;
> }
> spin_unlock(&msix->pdev->vpci->lock);
> + read_unlock(&d->pci_lock);
>
> return X86EMUL_OKAY;
> }
> @@ -491,19 +513,32 @@ static int adjacent_write(const struct domain *d, const
> struct vpci_msix *msix,
> static int cf_check msix_write(
> struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
> {
> - const struct domain *d = v->domain;
> - struct vpci_msix *msix = msix_find(d, addr);
> + struct domain *d = v->domain;
> + struct vpci_msix *msix;
> struct vpci_msix_entry *entry;
> unsigned int offset;
>
> + read_lock(&d->pci_lock);
> +
> + msix = msix_find(d, addr);
> if ( !msix )
> + {
> + read_unlock(&d->pci_lock);
> return X86EMUL_RETRY;
> + }
>
> if ( adjacent_handle(msix, addr) )
> - return adjacent_write(d, msix, addr, len, data);
> + {
> + int rc = adjacent_write(d, msix, addr, len, data);
> + read_unlock(&d->pci_lock);
> + return rc;
> + }
>
> if ( !access_allowed(msix->pdev, addr, len) )
> + {
> + read_unlock(&d->pci_lock);
> return X86EMUL_OKAY;
> + }
>
> spin_lock(&msix->pdev->vpci->lock);
> entry = get_entry(msix, addr);
> @@ -579,6 +614,7 @@ static int cf_check msix_write(
> break;
> }
> spin_unlock(&msix->pdev->vpci->lock);
> + read_unlock(&d->pci_lock);
>
> return X86EMUL_OKAY;
> }
> @@ -665,6 +701,8 @@ static int cf_check init_msix(struct pci_dev *pdev)
> struct vpci_msix *msix;
> int rc;
>
> + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
> msix_offset = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func,
> PCI_CAP_ID_MSIX);
> if ( !msix_offset )
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index d73fa76302..f22cbf2112 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -38,6 +38,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
>
> void vpci_remove_device(struct pci_dev *pdev)
> {
> + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
> if ( !has_vpci(pdev->domain) || !pdev->vpci )
> return;
>
> @@ -73,6 +75,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
> const unsigned long *ro_map;
> int rc = 0;
>
> + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
> if ( !has_vpci(pdev->domain) )
> return 0;
>
> @@ -326,11 +330,12 @@ static uint32_t merge_result(uint32_t data, uint32_t
> new, unsigned int size,
>
> uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
> {
> - const struct domain *d = current->domain;
> + struct domain *d = current->domain;
> const struct pci_dev *pdev;
> const struct vpci_register *r;
> unsigned int data_offset = 0;
> uint32_t data = ~(uint32_t)0;
> + rwlock_t *lock;
>
> if ( !size )
> {
> @@ -342,11 +347,21 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg,
> unsigned int size)
> * Find the PCI dev matching the address, which for hwdom also requires
> * consulting DomXEN. Passthrough everything that's not trapped.
> */
> + lock = &d->pci_lock;
> + read_lock(lock);
> pdev = pci_get_pdev(d, sbdf);
> if ( !pdev && is_hardware_domain(d) )
> + {
> + read_unlock(lock);
> + lock = &dom_xen->pci_lock;
> + read_lock(lock);
> pdev = pci_get_pdev(dom_xen, sbdf);
> + }
> if ( !pdev || !pdev->vpci )
> + {
> + read_unlock(lock);
> return vpci_read_hw(sbdf, reg, size);
> + }
>
> spin_lock(&pdev->vpci->lock);
>
> @@ -392,6 +407,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg,
> unsigned int size)
> ASSERT(data_offset < size);
> }
> spin_unlock(&pdev->vpci->lock);
> + read_unlock(lock);
>
> if ( data_offset < size )
> {
> @@ -431,10 +447,23 @@ static void vpci_write_helper(const struct pci_dev
> *pdev,
> r->private);
> }
>
> +/* Helper function to unlock locks taken by vpci_write in proper order */
> +static void unlock_locks(struct domain *d)
> +{
> + ASSERT(rw_is_locked(&d->pci_lock));
> +
> + if ( is_hardware_domain(d) )
> + {
> + ASSERT(rw_is_locked(&d->pci_lock));
> + read_unlock(&dom_xen->pci_lock);
> + }
> + read_unlock(&d->pci_lock);
> +}
> +
> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> uint32_t data)
> {
> - const struct domain *d = current->domain;
> + struct domain *d = current->domain;
> const struct pci_dev *pdev;
> const struct vpci_register *r;
> unsigned int data_offset = 0;
> @@ -447,8 +476,16 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg,
> unsigned int size,
>
> /*
> * Find the PCI dev matching the address, which for hwdom also requires
> - * consulting DomXEN. Passthrough everything that's not trapped.
> + * consulting DomXEN. Passthrough everything that's not trapped.
> + * If this is hwdom, we need to hold locks for both domain in case if
> + * modify_bars is called()
Typo: the () wants to be at the end of modify_bars().
> */
> + read_lock(&d->pci_lock);
> +
> + /* dom_xen->pci_lock always should be taken second to prevent deadlock */
> + if ( is_hardware_domain(d) )
> + read_lock(&dom_xen->pci_lock);
For modify_bars() we also want the locks to be in write mode (at least
the hw one), so that the position of the BARs can't be changed while
modify_bars() is iterating over them.
Is this something that will be done in a followup change?
> +
> pdev = pci_get_pdev(d, sbdf);
> if ( !pdev && is_hardware_domain(d) )
> pdev = pci_get_pdev(dom_xen, sbdf);
> @@ -459,6 +496,8 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg,
> unsigned int size,
>
> if ( !ro_map || !test_bit(sbdf.bdf, ro_map) )
> vpci_write_hw(sbdf, reg, size, data);
> +
> + unlock_locks(d);
> return;
> }
>
> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg,
> unsigned int size,
> ASSERT(data_offset < size);
> }
> spin_unlock(&pdev->vpci->lock);
> + unlock_locks(d);
There's one issue here, some handlers will cal pcidevs_lock(), which
will result in a lock over inversion, as in the previous patch we
agreed that the locking order was pcidevs_lock first, d->pci_lock
after.
For example the MSI control_write() handler will call
vpci_msi_arch_enable() which takes the pcidevs lock. I think I will
have to look into using a dedicated lock for MSI related handling, as
that's the only place where I think we have this pattern of taking the
pcidevs_lock after the d->pci_lock.
Thanks, Roger.
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |