[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH] vpci: introduce per-domain lock to protect vpci structure
Hi, Roger! On 10.02.22 18:16, Roger Pau Monné wrote: > On Wed, Feb 09, 2022 at 03:36:27PM +0200, Oleksandr Andrushchenko wrote: >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx> >> >> Introduce a per-domain read/write lock to check whether vpci is present, >> so we are sure there are no accesses to the contents of the vpci struct >> if not. This lock can be used (and in a few cases is used right away) >> so that vpci removal can be performed while holding the lock in write >> mode. Previously such removal could race with vpci_read for example. > Sadly there's still a race in the usage of pci_get_pdev_by_domain wrt > pci_remove_device, and likely when vPCI gets also used in > {de}assign_device I think. Yes, this is indeed an issue, but I was not trying to solve it in context of vPCI locking yet. I think we should discuss how do we approach pdev locking, so I can create a patch for that. that being said, I would like not to solve pdev in this patch yet ...I do understand we do want to avoid that, but at the moment a single reliable way for making sure pdev is alive seems to be pcidevs_lock.... > >> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure >> from being removed. >> >> 2. Writing the command register and ROM BAR register may trigger >> modify_bars to run, which in turn may access multiple pdevs while >> checking for the existing BAR's overlap. The overlapping check, if done >> under the read lock, requires vpci->lock to be acquired on both devices >> being compared, which may produce a deadlock. It is not possible to >> upgrade read lock to write lock in such a case. So, in order to prevent >> the deadlock, check which registers are going to be written and acquire >> the lock in the appropriate mode from the beginning. >> >> All other code, which doesn't lead to pdev->vpci destruction and does not >> access multiple pdevs at the same time, can still use a combination of the >> read lock and pdev->vpci->lock. >> >> 3. Optimize if ROM BAR write lock required detection by caching offset >> of the ROM BAR register in vpci->header->rom_reg which depends on >> header's type. >> >> 4. Reduce locked region in vpci_remove_device as it is now possible >> to set pdev->vpci to NULL early right after the write lock is acquired. >> >> 5. Reduce locked region in vpci_add_handlers as it is possible to >> initialize many more fields of the struct vpci before assigning it to >> pdev->vpci. >> >> 6. vpci_{add|remove}_register are required to be called with the write lock >> held, but it is not feasible to add an assert there as it requires >> struct domain to be passed for that. So, add a comment about this requirement >> to these and other functions with the equivalent constraints. >> >> 7. Drop const qualifier where the new rwlock is used and this is appropriate. >> >> 8. This is based on the discussion at [1]. >> >> [1] >> https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-4-andr2000@xxxxxxxxx/__;!!GF_29dbcQIUBPA!gObSySzN7s6zSKrcpSEi6vw18fRPls157cuRoqq4KDd7Ic_Nvh_cFlyVXPRpEWBkI38pgsvvfg$ >> [lore[.]kernel[.]org] >> >> Suggested-by: Roger Pau Monné <roger.pau@xxxxxxxxxx> >> Suggested-by: Jan Beulich <jbeulich@xxxxxxxx> >> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx> >> >> --- >> This was checked on x86: with and without PVH Dom0. >> --- >> xen/arch/x86/hvm/vmsi.c | 2 + >> xen/common/domain.c | 3 + >> xen/drivers/vpci/header.c | 8 +++ >> xen/drivers/vpci/msi.c | 8 ++- >> xen/drivers/vpci/msix.c | 40 +++++++++++-- >> xen/drivers/vpci/vpci.c | 114 ++++++++++++++++++++++++++++---------- >> xen/include/xen/sched.h | 3 + >> xen/include/xen/vpci.h | 2 + >> 8 files changed, 146 insertions(+), 34 deletions(-) >> >> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c >> index 13e2a190b439..351cb968a423 100644 >> --- a/xen/arch/x86/hvm/vmsi.c >> +++ b/xen/arch/x86/hvm/vmsi.c >> @@ -893,6 +893,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix) >> { >> unsigned int i; >> >> + ASSERT(!!rw_is_locked(&msix->pdev->domain->vpci_rwlock)); > ^ no need for the double negation. Ok, will update all asserts which use !! > > Also this asserts that the lock is taken, but could be by a different > pCPU. I guess it's better than nothing. Fair enough. Do you still want the asserts or should I remove them? > >> + >> for ( i = 0; i < msix->max_entries; i++ ) >> { >> const struct vpci_msix_entry *entry = &msix->entries[i]; > Since this function is now called with the per-domain rwlock read > locked it's likely not appropriate to call process_pending_softirqs > while holding such lock (check below). You are right, as it is possible that: process_pending_softirqs -> vpci_process_pending -> read_lock Even more, vpci_process_pending may also read_unlock -> vpci_remove_device -> write_lock in its error path. So, any invocation of process_pending_softirqs must not hold d->vpci_rwlock at least. And also we need to check that pdev->vpci was not removed in between or *re-created* > > We will likely need to re-iterate over the list of pdevs assigned to > the domain and assert that the pdev is still assigned to the same > domain. So, do you mean a pattern like the below should be used at all places where we need to call process_pending_softirqs? read_unlock process_pending_softirqs read_lock pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn); if ( pdev && pdev->vpci && is_the_same_vpci(pdev->vpci) ) <continue processing> > >> diff --git a/xen/common/domain.c b/xen/common/domain.c >> index 2048ebad86ff..10558c22285d 100644 >> --- a/xen/common/domain.c >> +++ b/xen/common/domain.c >> @@ -616,6 +616,9 @@ struct domain *domain_create(domid_t domid, >> >> #ifdef CONFIG_HAS_PCI >> INIT_LIST_HEAD(&d->pdev_list); >> +#ifdef CONFIG_HAS_VPCI >> + rwlock_init(&d->vpci_rwlock); >> +#endif >> #endif >> >> /* All error paths can depend on the above setup. */ >> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c >> index 40ff79c33f8f..9e2aeb2055c9 100644 >> --- a/xen/drivers/vpci/header.c >> +++ b/xen/drivers/vpci/header.c >> @@ -142,12 +142,14 @@ bool vpci_process_pending(struct vcpu *v) >> if ( rc == -ERESTART ) >> return true; >> >> + read_lock(&v->domain->vpci_rwlock); >> spin_lock(&v->vpci.pdev->vpci->lock); >> /* Disable memory decoding unconditionally on failure. */ >> modify_decoding(v->vpci.pdev, >> rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : >> v->vpci.cmd, >> !rc && v->vpci.rom_only); >> spin_unlock(&v->vpci.pdev->vpci->lock); >> + read_unlock(&v->domain->vpci_rwlock); >> >> rangeset_destroy(v->vpci.mem); >> v->vpci.mem = NULL; >> @@ -203,6 +205,7 @@ static void defer_map(struct domain *d, struct pci_dev >> *pdev, >> raise_softirq(SCHEDULE_SOFTIRQ); >> } >> >> +/* This must hold domain's vpci_rwlock in write mode. */ >> static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool >> rom_only) >> { >> struct vpci_header *header = &pdev->vpci->header; >> @@ -454,6 +457,8 @@ static int init_bars(struct pci_dev *pdev) >> struct vpci_bar *bars = header->bars; >> int rc; >> >> + ASSERT(!!rw_is_write_locked(&pdev->domain->vpci_rwlock)); >> + >> switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f ) >> { >> case PCI_HEADER_TYPE_NORMAL: >> @@ -548,6 +553,7 @@ static int init_bars(struct pci_dev *pdev) >> { >> struct vpci_bar *rom = &header->bars[num_bars]; >> >> + header->rom_reg = rom_reg; >> rom->type = VPCI_BAR_ROM; >> rom->size = size; >> rom->addr = addr; >> @@ -559,6 +565,8 @@ static int init_bars(struct pci_dev *pdev) >> if ( rc ) >> rom->type = VPCI_BAR_EMPTY; > You can also set 'rom_reg = ~0' here. Or move the setting of rom_reg > after the handler has been successfully installed. > > I think it would be easier to just signal no ROM BAR with rom_reg == > 0. There's no header where the ROM BAR is at offset 0. That way you > will only have to set rom_reg on the successful path, but you don't > need to care about the case where there's no ROM BAR. Yes, it is possible > >> } >> + else >> + header->rom_reg = ~(unsigned int)0; > No need for the cast. Ok > >> >> return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0; >> } >> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c >> index 5757a7aed20f..5df3dfa8243c 100644 >> --- a/xen/drivers/vpci/msi.c >> +++ b/xen/drivers/vpci/msi.c >> @@ -190,6 +190,8 @@ static int init_msi(struct pci_dev *pdev) >> uint16_t control; >> int ret; >> >> + ASSERT(!!rw_is_write_locked(&pdev->domain->vpci_rwlock)); >> + >> if ( !pos ) >> return 0; >> >> @@ -265,7 +267,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW); >> >> void vpci_dump_msi(void) >> { >> - const struct domain *d; >> + struct domain *d; >> >> rcu_read_lock(&domlist_read_lock); >> for_each_domain ( d ) >> @@ -277,6 +279,9 @@ void vpci_dump_msi(void) >> >> printk("vPCI MSI/MSI-X d%d\n", d->domain_id); >> >> + if ( !read_trylock(&d->vpci_rwlock) ) >> + continue; >> + >> for_each_pdev ( d, pdev ) >> { >> const struct vpci_msi *msi; >> @@ -326,6 +331,7 @@ void vpci_dump_msi(void) >> spin_unlock(&pdev->vpci->lock); >> process_pending_softirqs(); >> } >> + read_unlock(&d->vpci_rwlock); > Same here, you are calling process_pending_softirqs while holding > vpci_rwlock. Same as above > >> } >> rcu_read_unlock(&domlist_read_lock); >> } >> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c >> index 846f1b8d7038..5296d6025d8e 100644 >> --- a/xen/drivers/vpci/msix.c >> +++ b/xen/drivers/vpci/msix.c >> @@ -138,6 +138,7 @@ static void control_write(const struct pci_dev *pdev, >> unsigned int reg, >> pci_conf_write16(pdev->sbdf, reg, val); >> } >> >> +/* This must hold domain's vpci_rwlock in write mode. */ >> static struct vpci_msix *msix_find(const struct domain *d, unsigned long >> addr) >> { >> struct vpci_msix *msix; >> @@ -158,7 +159,12 @@ static struct vpci_msix *msix_find(const struct domain >> *d, unsigned long addr) >> >> static int msix_accept(struct vcpu *v, unsigned long addr) >> { >> - return !!msix_find(v->domain, addr); >> + int rc; >> + >> + read_lock(&v->domain->vpci_rwlock); >> + rc = !!msix_find(v->domain, addr); >> + read_unlock(&v->domain->vpci_rwlock); > Newline before return. Ok > >> + return rc; >> } >> >> static bool access_allowed(const struct pci_dev *pdev, unsigned long addr, >> index fb0947179b79..16bb3b832e6a 100644 >> --- a/xen/drivers/vpci/vpci.c >> +++ b/xen/drivers/vpci/vpci.c >> @@ -89,22 +104,28 @@ int vpci_add_handlers(struct pci_dev *pdev) >> } >> #endif /* __XEN__ */ >> >> -static int vpci_register_cmp(const struct vpci_register *r1, >> - const struct vpci_register *r2) >> +static int vpci_offset_cmp(unsigned int r1_offset, unsigned int r1_size, >> + unsigned int r2_offset, unsigned int r2_size) >> { >> /* Return 0 if registers overlap. */ >> - if ( r1->offset < r2->offset + r2->size && >> - r2->offset < r1->offset + r1->size ) >> + if ( r1_offset < r2_offset + r2_size && >> + r2_offset < r1_offset + r1_size ) >> return 0; >> - if ( r1->offset < r2->offset ) >> + if ( r1_offset < r2_offset ) >> return -1; >> - if ( r1->offset > r2->offset ) >> + if ( r1_offset > r2_offset ) >> return 1; >> >> ASSERT_UNREACHABLE(); >> return 0; >> } >> >> +static int vpci_register_cmp(const struct vpci_register *r1, >> + const struct vpci_register *r2) >> +{ >> + return vpci_offset_cmp(r1->offset, r1->size, r2->offset, r2->size); >> +} > Seeing how this gets used below I'm not sure it's very helpful to > reuse vpci_register_cmp, see my comment below. Yes, the only thing which is used is the first check for the overlap > >> + >> /* Dummy hooks, writes are ignored, reads return 1's */ >> static uint32_t vpci_ignored_read(const struct pci_dev *pdev, unsigned int >> reg, >> void *data) >> @@ -129,6 +150,7 @@ uint32_t vpci_hw_read32(const struct pci_dev *pdev, >> unsigned int reg, >> return pci_conf_read32(pdev->sbdf, reg); >> } >> >> +/* This must hold domain's vpci_rwlock in write mode. */ >> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler, >> vpci_write_t *write_handler, unsigned int offset, >> unsigned int size, void *data) >> @@ -152,8 +174,6 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t >> *read_handler, >> r->offset = offset; >> r->private = data; >> >> - spin_lock(&vpci->lock); >> - >> /* The list of handlers must be kept sorted at all times. */ >> list_for_each ( prev, &vpci->handlers ) >> { >> @@ -165,25 +185,23 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t >> *read_handler, >> break; >> if ( cmp == 0 ) >> { >> - spin_unlock(&vpci->lock); >> xfree(r); >> return -EEXIST; >> } >> } >> >> list_add_tail(&r->node, prev); >> - spin_unlock(&vpci->lock); >> >> return 0; >> } >> >> +/* This must hold domain's vpci_rwlock in write mode. */ >> int vpci_remove_register(struct vpci *vpci, unsigned int offset, >> unsigned int size) >> { >> const struct vpci_register r = { .offset = offset, .size = size }; >> struct vpci_register *rm; >> >> - spin_lock(&vpci->lock); >> list_for_each_entry ( rm, &vpci->handlers, node ) >> { >> int cmp = vpci_register_cmp(&r, rm); >> @@ -195,14 +213,12 @@ int vpci_remove_register(struct vpci *vpci, unsigned >> int offset, >> if ( !cmp && rm->offset == offset && rm->size == size ) >> { >> list_del(&rm->node); >> - spin_unlock(&vpci->lock); >> xfree(rm); >> return 0; >> } >> if ( cmp <= 0 ) >> break; >> } >> - spin_unlock(&vpci->lock); >> >> return -ENOENT; >> } >> @@ -310,7 +326,7 @@ static uint32_t merge_result(uint32_t data, uint32_t >> new, unsigned int size, >> >> uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) >> { >> - const struct domain *d = current->domain; >> + struct domain *d = current->domain; >> const struct pci_dev *pdev; >> const struct vpci_register *r; >> unsigned int data_offset = 0; >> @@ -327,6 +343,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, >> unsigned int size) >> if ( !pdev ) >> return vpci_read_hw(sbdf, reg, size); >> >> + read_lock(&d->vpci_rwlock); > After taking the domain lock you need to check that pdev->vpci != > NULL, as vpci_remove_device will set pdev->vpci == NULL before > removing the device from the domain. Same applies to vpci_add_handlers > which will be called with the device already added to the domain, but > with pdev->vpci == NULL. Sure, will add > > We would also need some kind of protection arround > pci_get_pdev_by_domain so that devices are not removed (from the > domain) while we are iterating over it. Again, this is a problem. And we need to find a way to solve that > >> spin_lock(&pdev->vpci->lock); >> >> /* Read from the hardware or the emulated register handlers. */ >> @@ -371,6 +388,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, >> unsigned int size) >> ASSERT(data_offset < size); >> } >> spin_unlock(&pdev->vpci->lock); >> + read_unlock(&d->vpci_rwlock); >> >> if ( data_offset < size ) >> { >> @@ -410,14 +428,37 @@ static void vpci_write_helper(const struct pci_dev >> *pdev, >> r->private); >> } >> >> +static bool vpci_header_write_lock(const struct pci_dev *pdev, >> + unsigned int start, unsigned int size) > I think this should live in header.c, for consistency. Ok, will move > > I'm also not sure it's worth adding vpci_offset_cmp: you just need to > do a range overlap check, and that can be easily open coded. It's just > the first 'if' in vpci_register_cmp that you want, the rest of the > code is just adding overhead. Agree > >> +{ >> + /* >> + * Writing the command register and ROM BAR register may trigger >> + * modify_bars to run which in turn may access multiple pdevs while >> + * checking for the existing BAR's overlap. The overlapping check, if >> done >> + * under the read lock, requires vpci->lock to be acquired on both >> devices >> + * being compared, which may produce a deadlock. It is not possible to >> + * upgrade read lock to write lock in such a case. So, in order to >> prevent >> + * the deadlock, check which registers are going to be written and >> acquire >> + * the lock in the appropriate mode from the beginning. >> + */ >> + if ( !vpci_offset_cmp(start, size, PCI_COMMAND, 2) ) >> + return true; >> + >> + if ( !vpci_offset_cmp(start, size, pdev->vpci->header.rom_reg, 4) ) > No need for the comparison if rom_reg is unset. Also you can OR both > conditions into a single if. If we open code vpci_offset_cmp with a single if all this is going to be a bit clumsy: if ( r1_offset < r2_offset + r2_size && r2_offset < r1_offset + r1_size ) return 0; This is a single check. Now we need to check two registers with the code above and also check that pdev->vpci->header.rom_reg != 0 I think it would be more readable if we have a tiny helper function static bool vpci_offset_cmp(unsigned int r1_offset, unsigned int r1_size, unsigned int r2_offset, unsigned int r2_size) { /* Return 0 if registers overlap. */ if ( r1_offset < r2_offset + r2_size && r2_offset < r1_offset + r1_size ) return false; return true; } So, then we can have something like static bool vpci_header_write_lock(const struct pci_dev *pdev, unsigned int start, unsigned int size) { if ( !vpci_offset_cmp(start, size, PCI_COMMAND, 2) || (pdev->vpci->header.rom_reg && !vpci_offset_cmp(start, size, pdev->vpci->header.rom_reg, 4)) ) return true; return false; } > >> + return true; >> + >> + return false; >> +} >> + >> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, >> uint32_t data) >> { >> - const struct domain *d = current->domain; >> + struct domain *d = current->domain; >> const struct pci_dev *pdev; >> const struct vpci_register *r; >> unsigned int data_offset = 0; >> const unsigned long *ro_map = pci_get_ro_map(sbdf.seg); >> + bool write_locked = false; >> >> if ( !size ) >> { >> @@ -440,7 +481,17 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, >> unsigned int size, >> return; >> } >> >> - spin_lock(&pdev->vpci->lock); >> + if ( vpci_header_write_lock(pdev, reg, size) ) >> + { >> + /* Gain exclusive access to all of the domain pdevs vpci. */ >> + write_lock(&d->vpci_rwlock); >> + write_locked = true; > Here you need to check that pdev->vpci != NULL... Will do > >> + } >> + else >> + { >> + read_lock(&d->vpci_rwlock); > ...and also here. Will do > >> + spin_lock(&pdev->vpci->lock); >> + } >> >> /* Write the value to the hardware or emulated registers. */ >> list_for_each_entry ( r, &pdev->vpci->handlers, node ) >> @@ -475,7 +526,14 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, >> unsigned int size, >> break; >> ASSERT(data_offset < size); >> } >> - spin_unlock(&pdev->vpci->lock); >> + >> + if ( write_locked ) >> + write_unlock(&d->vpci_rwlock); >> + else >> + { >> + spin_unlock(&pdev->vpci->lock); >> + read_unlock(&d->vpci_rwlock); >> + } >> >> if ( data_offset < size ) >> /* Tailing gap, write the remaining. */ >> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h >> index 37f78cc4c4c9..ecd34481a7af 100644 >> --- a/xen/include/xen/sched.h >> +++ b/xen/include/xen/sched.h >> @@ -444,6 +444,9 @@ struct domain >> >> #ifdef CONFIG_HAS_PCI >> struct list_head pdev_list; >> +#ifdef CONFIG_HAS_VPCI >> + rwlock_t vpci_rwlock; >> +#endif >> #endif >> >> #ifdef CONFIG_HAS_PASSTHROUGH >> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h >> index e8ac1eb39513..e19e462ee5cb 100644 >> --- a/xen/include/xen/vpci.h >> +++ b/xen/include/xen/vpci.h >> @@ -88,6 +88,8 @@ struct vpci { >> * is mapped into guest p2m) if there's a ROM BAR on the device. >> */ >> bool rom_enabled : 1; >> + /* Offset to the ROM BAR register if any. */ >> + unsigned int rom_reg; > Could you please place this field after 'type'? That will avoid some > padding in the structure. Sure > > Thanks, Roger. Thank you, Oleksandr
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |