Xen project Mailing List

Re: [PATCH v10 03/17] vpci: use per-domain PCI lock to protect vpci structure

To: Roger Pau Monné <roger.pau@xxxxxxxxxx>

From: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>

Date: Tue, 28 Nov 2023 22:24:12 +0000

Accept-language: en-US

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=KxxTC7MzxsgS0nkgkUExKzZbzLVl7qMR8fJ2O9b8uzc=; b=icgxL9Su1OjcvuT/Va8FsGTYZBpUFK33dtd5yFnp+IyzbIAKh3K9tSpql/S+lQ+N/I5VjZk+ZAx3bAHBk6wmOmz+s4p6q7mQpj+c6BHCDwDJKoJUv+e96b37GWACNn3hc1bIlafK3vpKlE7Ps+zapq2ETO2GwCpJuZ8QlTrCuNJ2JPGGnzMI6b1ZstqrH3zL2UhRnrzJOuOcqNfSMHMFm7B9Sugip6eAk3Vp59RS/3nUPdFSqHbRqD6thQzRFF1skjPadgInC5D1QTU0EqoectIsJreTSqmefzEjMg5ubsxzlUYQh0HjV40LYp/4TkkzhRi9Vrjv9zzjZ8C9QvWD3A==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=fVuyCIi4KftF7ZnDqDizONje6W2i+iC3Hzb4soOlTY0qbcMb0QBm+IeVwVJpvKL96bfIIwpN0LO6kSfcow2JQ2UiCz5hNqK0KLzlocT2kqFvs6TuZgL9SNA/OoHIRH0vI2Htypz5xxJtW3lbYgkxvdd++RcnJ4G/dpAm5z0Dpj/FDWb6UqebcaVNj8eD+JCg6aXZ3YCcAv9YKI9kuGZViCnrM8xKfxbFs5+nIpfnzhnwBe/RKqQilcMvzauiG052QjtYxaWrAjeoeNbciZJIyrG9ejZSk6ziGm4LWOUM9kvX/P84gbskqD9zOmVj0vUKILfEoxG4IrKPBNrYKDtFpg==

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Stewart Hildebrand <stewart.hildebrand@xxxxxxx>, Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Paul Durrant <paul@xxxxxxx>

Delivery-date: Tue, 28 Nov 2023 22:24:27 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Thread-index: AQHZ/Vi9bRc3EUPSDEy0THfJgPlJLLB+12MAgBGozoA=

Thread-topic: [PATCH v10 03/17] vpci: use per-domain PCI lock to protect vpci structure

Hi Roger Thank you for the review. Roger Pau Monné <roger.pau@xxxxxxxxxx> writes: > On Thu, Oct 12, 2023 at 10:09:15PM +0000, Volodymyr Babchuk wrote: >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx> >> >> Use a previously introduced per-domain read/write lock to check >> whether vpci is present, so we are sure there are no accesses to the >> contents of the vpci struct if not. This lock can be used (and in a >> few cases is used right away) so that vpci removal can be performed >> while holding the lock in write mode. Previously such removal could >> race with vpci_read for example. >> >> When taking both d->pci_lock and pdev->vpci->lock, they should be >> taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid >> possible deadlock situations. >> >> 1. Per-domain's pci_rwlock is used to protect pdev->vpci structure >> from being removed. >> >> 2. Writing the command register and ROM BAR register may trigger >> modify_bars to run, which in turn may access multiple pdevs while >> checking for the existing BAR's overlap. The overlapping check, if >> done under the read lock, requires vpci->lock to be acquired on both >> devices being compared, which may produce a deadlock. It is not >> possible to upgrade read lock to write lock in such a case. So, in >> order to prevent the deadlock, use d->pci_lock instead. To prevent >> deadlock while locking both hwdom->pci_lock and dom_xen->pci_lock, >> always lock hwdom first. >> >> All other code, which doesn't lead to pdev->vpci destruction and does >> not access multiple pdevs at the same time, can still use a >> combination of the read lock and pdev->vpci->lock. >> >> 3. Drop const qualifier where the new rwlock is used and this is >> appropriate. >> >> 4. Do not call process_pending_softirqs with any locks held. For that >> unlock prior the call and re-acquire the locks after. After >> re-acquiring the lock there is no need to check if pdev->vpci exists: >> - in apply_map because of the context it is called (no race condition >> possible) >> - for MSI/MSI-X debug code because it is called at the end of >> pdev->vpci access and no further access to pdev->vpci is made >> >> 5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain >> while accessing pdevs in vpci code. >> >> 6. We are removing multiple ASSERT(pcidevs_locked()) instances because >> they are too strict now: they should be corrected to >> ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock)), but problem is >> that mentioned instances does not have access to the domain >> pointer and it is not feasible to pass a domain pointer to a function >> just for debugging purposes. >> >> There is a possible lock inversion in MSI code, as some parts of it >> acquire pcidevs_lock() while already holding d->pci_lock. > > Is this going to be addressed in a further patch? > It is actually addressed in this patch, in the v10. I just forgot to remove this sentence from the patch description. My bad. This was fixed by additional parameter to allocate_and_map_msi_pirq(), as it is being called from two places: from vpci_msi_enable(), while we already are holding d->pci_lock, or from physdev_map_pirq(), when there are no locks are taken. [...] >> @@ -2908,7 +2908,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int >> index, int *pirq_p, >> >> msi->irq = irq; >> >> - pcidevs_lock(); >> + if ( use_pci_lock ) >> + pcidevs_lock(); > > Instead of passing the flag it might be better if the caller can take > the lock, as to avoid having to pass an extra parameter. > > Then we should also assert that either the pcidev_lock or the > per-domain pci lock is taken? > This is a good idea. I'll add lock in physdev_map_pirq(). This is only one place where we call physdev_map_pirq() without holding any lock. >> /* Verify or get pirq. */ >> write_lock(&d->event_lock); >> pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr); >> @@ -2924,7 +2925,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int >> index, int *pirq_p, >> >> done: >> write_unlock(&d->event_lock); >> - pcidevs_unlock(); >> + if ( use_pci_lock ) >> + pcidevs_unlock(); >> if ( ret ) >> { >> switch ( type ) >> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c >> index 20275260b3..466725d8ca 100644 >> --- a/xen/arch/x86/msi.c >> +++ b/xen/arch/x86/msi.c >> @@ -602,7 +602,7 @@ static int msi_capability_init(struct pci_dev *dev, >> unsigned int i, mpos; >> uint16_t control; >> >> - ASSERT(pcidevs_locked()); >> + ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock)); >> pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSI); >> if ( !pos ) >> return -ENODEV; >> @@ -771,7 +771,7 @@ static int msix_capability_init(struct pci_dev *dev, >> if ( !pos ) >> return -ENODEV; >> >> - ASSERT(pcidevs_locked()); >> + ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock)); >> >> control = pci_conf_read16(dev->sbdf, msix_control_reg(pos)); >> /* >> @@ -988,8 +988,6 @@ static int __pci_enable_msi(struct msi_info *msi, struct >> msi_desc **desc, >> { >> struct msi_desc *old_desc; >> >> - ASSERT(pcidevs_locked()); >> - >> if ( !pdev ) >> return -ENODEV; >> >> @@ -1043,8 +1041,6 @@ static int __pci_enable_msix(struct msi_info *msi, >> struct msi_desc **desc, >> { >> struct msi_desc *old_desc; >> >> - ASSERT(pcidevs_locked()); >> - >> if ( !pdev || !pdev->msix ) >> return -ENODEV; >> >> @@ -1154,8 +1150,6 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool >> off) >> int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc, >> struct pci_dev *pdev) >> { >> - ASSERT(pcidevs_locked()); >> - > > If you have the pdev in all the above function, you could expand the > assert to test for pdev->domain->pci_lock? > Yes, you are right. This is the leftover from times where pci_enable_msi() acquired pdev pointer by itself. [...] >> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c >> index 767c1ba718..a52e52db96 100644 >> --- a/xen/drivers/vpci/header.c >> +++ b/xen/drivers/vpci/header.c >> @@ -172,6 +172,7 @@ bool vpci_process_pending(struct vcpu *v) >> if ( rc == -ERESTART ) >> return true; >> >> + write_lock(&v->domain->pci_lock); >> spin_lock(&v->vpci.pdev->vpci->lock); >> /* Disable memory decoding unconditionally on failure. */ >> modify_decoding(v->vpci.pdev, >> @@ -190,6 +191,7 @@ bool vpci_process_pending(struct vcpu *v) >> * failure. >> */ >> vpci_remove_device(v->vpci.pdev); >> + write_unlock(&v->domain->pci_lock); >> } >> >> return false; >> @@ -201,8 +203,20 @@ static int __init apply_map(struct domain *d, const >> struct pci_dev *pdev, >> struct map_data data = { .d = d, .map = true }; >> int rc; >> >> + ASSERT(rw_is_write_locked(&d->pci_lock)); >> + >> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == >> -ERESTART ) >> + { >> + /* >> + * It's safe to drop and reacquire the lock in this context >> + * without risking pdev disappearing because devices cannot be >> + * removed until the initial domain has been started. >> + */ >> + read_unlock(&d->pci_lock); >> process_pending_softirqs(); >> + read_lock(&d->pci_lock); > > You are asserting the lock is taken in write mode just above the usage > of read_{un,}lock(). Either the assert is wrong, or the usage of > read_{un,}lock() is wrong. Oops, looks like this is the rebasing artifact. The final version of the code uses write locks of course. Patch "vpci/header: handle p2m range sets per BAR" changes this piece of code too and replaces read lock with the write lock. I'll move that change to this patch. [...] >> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c >> index 3bec9a4153..112de56fb3 100644 >> --- a/xen/drivers/vpci/vpci.c >> +++ b/xen/drivers/vpci/vpci.c >> @@ -38,6 +38,8 @@ extern vpci_register_init_t *const __end_vpci_array[]; >> >> void vpci_remove_device(struct pci_dev *pdev) >> { >> + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock)); >> + >> if ( !has_vpci(pdev->domain) || !pdev->vpci ) >> return; >> >> @@ -73,6 +75,8 @@ int vpci_add_handlers(struct pci_dev *pdev) >> const unsigned long *ro_map; >> int rc = 0; >> >> + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock)); >> + >> if ( !has_vpci(pdev->domain) ) >> return 0; >> >> @@ -326,11 +330,12 @@ static uint32_t merge_result(uint32_t data, uint32_t >> new, unsigned int size, >> >> uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) >> { >> - const struct domain *d = current->domain; >> + struct domain *d = current->domain; >> const struct pci_dev *pdev; >> const struct vpci_register *r; >> unsigned int data_offset = 0; >> uint32_t data = ~(uint32_t)0; >> + rwlock_t *lock; >> >> if ( !size ) >> { >> @@ -342,11 +347,21 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, >> unsigned int size) >> * Find the PCI dev matching the address, which for hwdom also requires >> * consulting DomXEN. Passthrough everything that's not trapped. >> */ >> + lock = &d->pci_lock; >> + read_lock(lock); >> pdev = pci_get_pdev(d, sbdf); >> if ( !pdev && is_hardware_domain(d) ) >> + { >> + read_unlock(lock); >> + lock = &dom_xen->pci_lock; >> + read_lock(lock); >> pdev = pci_get_pdev(dom_xen, sbdf); > > I'm unsure whether devices assigned to dom_xen can change ownership > after boot, so maybe there's no need for all this lock dance, as the > device cannot disappear? > > Maybe just taking the hardware domain lock is enough to prevent > concurrent accesses in that case, as the hardware domain is the only > allowed to access devices owned by dom_xen. This sounds correct. If there are no objections then I can remove extra locks. -- WBR, Volodymyr

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.