[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci


  • To: Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Tue, 8 Feb 2022 11:50:03 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=none; dmarc=none; dkim=none; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=CmVQHcuVkLutcxkbKnI79dJyv/G70MnZrJPD8qAKEYI=; b=aTfy2t9Qp2VsS6g6uhYHIZ2HJQzqi/4EKAIhlLzy2XDWH6UHk2T4AhfXt0DnuqDiGgFPvF8aSD4oFl5OPVv3z1BkD3ILtsdhQxVOsDCVjgC3ZLGBjR3xEiYuzztoFOpzi0CW2/d1P1NB20HYW2MZNqmHcHKdlYfNvvwefu2aXTy7OoJv6B0t5KmBi2mGa4a28Q7ITkKardREz6cQB6uJM6OJM9e/naY8zLR4a5Xp0+1twA/+pP+YXB5P2uTWGCaN56LuiITLBlVUJdQueJcE5+9LeVv0Zb1IeiBmOW+cIE6s/gN8BtfaW1jUwII7QLiNAr1HzBQDfTsvekhGO4sMbw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=HjM7RrsEK+IjUkw2EMi+KiQhp1dHu8QcHLRVToHZ7XjuIJRrPyZwpRVAdL7GXO1BgHkWqR+TN3QxsLlVmWbFUmxxU41hQTE/GLYNV7o4o2mf5frp3MMvPaRIX0QOdwELr69vArswYk2v87cbZBlalmiyk9GmVfYnIN463McdjN6tN452HL+6q4RwNKadTwY4M2GZ7mikgvDO19xN5TK4i3VTmX4I9pU0/bchkoZ11QBMYt+Np8GGb2MX0GIaaqymB1JHLzJCXPGhCqKzNZIAYRKN8BrB1FfhRFXP8rWbV1uphq2hdE3DqBr9fhZptjNplwmmZDvGaPVTiyLZFNaNaA==
  • Authentication-results: esa1.hc3370-68.iphmx.com; dkim=pass (signature verified) header.i=@citrix.onmicrosoft.com
  • Cc: Jan Beulich <jbeulich@xxxxxxxx>, "julien@xxxxxxx" <julien@xxxxxxx>, "sstabellini@xxxxxxxxxx" <sstabellini@xxxxxxxxxx>, Oleksandr Tyshchenko <Oleksandr_Tyshchenko@xxxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, Artem Mygaiev <Artem_Mygaiev@xxxxxxxx>, "andrew.cooper3@xxxxxxxxxx" <andrew.cooper3@xxxxxxxxxx>, "george.dunlap@xxxxxxxxxx" <george.dunlap@xxxxxxxxxx>, "paul@xxxxxxx" <paul@xxxxxxx>, Bertrand Marquis <bertrand.marquis@xxxxxxx>, Rahul Singh <rahul.singh@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Tue, 08 Feb 2022 10:50:20 +0000
  • Ironport-data: A9a23:EGcqSqxALQEd63SBSjx6t+cXwCrEfRIJ4+MujC+fZmUNrF6WrkVRm 2ZKW2uPOPvZMWSnfNp+bd/ioR4A6JXRxtM2HVBqqiAxQypGp/SeCIXCJC8cHc8zwu4v7q5Dx 59DAjUVBJlsFhcwnvopW1TYhSEUOZugH9IQM8aZfHAhLeNYYH1500g7wbZp2tcAbeWRWGthh /uj+6UzB3f9s9JEGjp8B3Wr8U4HUFza4Vv0j3RmDRx5lAa2e0o9VfrzEZqZPXrgKrS4K8bhL wr1IBNVyUuCl/slIovNfr8W6STmSJaKVeSFoiI+t6RPHnGuD8H9u0o2HKN0VKtZt9mGt/Fbz /YWjaaKcBgSMo/FvOE4TDYGHj4raMWq+JefSZS+mcmazkmAeHrw2fR+SkoxOOX0+M4uXzsIr 6ZBbmlQMFbT3Ipaw5riIgVort4kI8TxepsWp1lrzC3DDOZgSpfGK0nPzYEEhW1t250QdRrYT +gmchltTh7aWAdGHlVOV7wHwuOLn2aqJlW0r3rK/PFqsgA/1jdZyLHwNPLFd9rMQt9a9m6xv H7L+lPcEx4TNdGB4Ted+3fqjejK9QvxVZgODrS++rhviUeK220IIBQMUB2wpvzRol6zXZdTJ lIZ/gIqrLMu7wq7Q9/lRRq6rXWY+BkGVLJ4CeQ69QXL0avS7AaxD3IBCDVGbbQOisgyQjA70 06TqPngDzdvrb69RGqU8/GfqjbaESofIHIGZCQEZRAY+NSlq4Y25jroSttgC6ezgsfCMDf82 S2RrCM+irMQiuYGz6y+u1vAhlqEpIXNTwMzzhXaWCSi9AwRTKmhao+z4F7X98F8PZ2ZRVmMu nsDs8WG5eVIBpaI/ASWW/kEFryt4/eDMRXfjERpEp1n8C6ik0NPZqgJvms4fh0wdJ9ZJ3m5O yc/pD+9+rdyFnyJNpdNWbuuMMgIxIzlMMvKCP3LO48mjodKSCeL+yRnZEi11m/rkVQxnaxXB ap3YfpAHl5BV/05kWPeq/M1lOZym3tgnT+7qYXTkkz/uYdyckJ5Xlvs3LGmSukipJ2JrwzOm zq0H5vbkk4POAESj8S+zGLyEbzoBSVhbXwVg5YOHgJmHuaBMDtwY8I9OZt7J+RYc119z48kB E2VVE5C00bYjnbaMwiMYX0LQOqxAcom/CtiY312ZA3AN50fjWGHtvZ3SnfKVeN/qLwLIQBcE 5HphPls8twQE2+aqlzxnLH2rZB4dQTDuO59F3HNXdTLRLY5H1ah0oa9JmPHrXBSZgLq5ZpWi +DxjWvzHMtcLyw8V5y+QKz0kDuMUY01xbsas73geYIIJi0BMeFCdkTMsxPAC5tUdkyYm2DKj l3+7NVxjbClnrLZOeLh3Mish4yoD/F/DgxdGWza5qyxLi7U4iyoxooobQpCVWq1uLrc9Prwa ONL4ev7NfFbzl9Gv5AlS+RgzL4k5suprLhfl1w2EHLOZlWtK7VhPnjZgpUf6vwTnudU6VmsR 0aC2thGIrHVasnrJ0EceVg+ZeOZ2PBKxjSLtaYpIF/37TNc9aacVRkAJAGFjSFQdeMnMI4sz eo7ltQR7giz1kgjPtqc13gG/GWQNH0QFa4gs8hCUoPsjwMqzHBEYIDdVXCqsM3eNY0UPxBzc DGOhafEi7BN/Wb4ciI+RSrXwO5QpZUSoxQWnlUMEEuEx4jejfgt0RwPrTluFlZJzg9K2v5YM 3RwMxEnPr2H+jpliZQRX22oHA0dVhSV9laolgkMnWzdCUKpSnbMPCs2PuPUpBIV9GdVfz56+ rCEyTm6DWa2LZ+phiZiC1R4r/HDTMBq8lyQkc+qKM2JAp0mbGe3maSpf2cJ90PqDM5ZaJcrf gW2EDKcsZHGCBM=
  • Ironport-hdrordr: A9a23:5QGIwK9uMSVwTpFRvNtuk+FAdb1zdoMgy1knxilNoENuHfBwxv rDoB1E73LJYVYqOU3Jmbi7Sc69qFfnhORICO4qTMqftWjdyRCVxeRZg7cKrAeQeREWmtQtsJ uINpIOdOEYbmIK/PoSgjPIaurIqePvmMvD5Za8854ud3ATV0gJ1XYGNu/xKDwReOApP+tcKH LKjfA32AZINE5nJfiTNz0gZazuttfLnJXpbVovAAMm0hCHiXeN5KThGxaV8x8CW3cXqI1SvF Ttokjc3OGOovu7whjT2yv66IlXosLozp9mCNaXgsYYBz3wgkKDZZhnWZeFoDcpydvfomoCoZ 3pmVMNLs5z43TeciWcpgbs4RDp1HIU53rr2Taj8A3eiP28YAh/J9tKhIpffBecwVEnpstA3K VC2H/cn4ZLDDvb9R6NqeTgZlVPrA6ZsHAimekcgzh0So0FcoJcqoQZ4Qd8DIoAJiTn84oqed MeQ/003MwmMW9yUkqp/VWGmLeXLzYO91a9MwQ/U/WuonlrdCsT9Tpc+CQd9k1wg67VBaM0o9 gsCZ4Y542mePVmGZ6VNN1xMfdfNVa9My4kEFjiaGgPR5t3c04klfbMkcAIDaeRCds18Kc=
  • Ironport-sdr: n+8gErvBko1hIKVPkTu4UKa6vrBE1hEN0ZGvmwAmx82pYH14379CWaigwuaXWv1AyPASXdu0/+ vToor7K6cW82Lpe1c8UVUbHXLwg1dfyl+eG1rB6cZx/Lo3amV+2F3Qm5pvE9SJcDRD/+PKH8u2 cAdbIpgEOXib1ay1ozTMUINt0rYLWHzgD+pUjYou/pmsi/e+wltw3GxPXegP+0yEXQo3atCwVt HFFRGR6OGhEdt+0nL4CZcQ24ulrDeR7XDPMSrLhKQHXk876gbQN3a87TQzlJwkZvKsz7b1pGe3 otmiqfP/TwaQlllePMb9OLUU
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Tue, Feb 08, 2022 at 07:35:34AM +0000, Oleksandr Andrushchenko wrote:
> 
> 
> On 07.02.22 18:44, Oleksandr Andrushchenko wrote:
> >
> > On 07.02.22 18:37, Jan Beulich wrote:
> >> On 07.02.2022 17:21, Oleksandr Andrushchenko wrote:
> >>> On 07.02.22 18:15, Jan Beulich wrote:
> >>>> On 07.02.2022 17:07, Oleksandr Andrushchenko wrote:
> >>>>> On 07.02.22 17:26, Jan Beulich wrote:
> >>>>>> 1b. Make vpci_write use write lock for writes to command register and 
> >>>>>> BARs
> >>>>>> only; keep using the read lock for all other writes.
> >>>>> I am not quite sure how to do that. Do you mean something like:
> >>>>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> >>>>>                     uint32_t data)
> >>>>> [snip]
> >>>>>         list_for_each_entry ( r, &pdev->vpci->handlers, node )
> >>>>> {
> >>>>> [snip]
> >>>>>         if ( r->needs_write_lock)
> >>>>>             write_lock(d->vpci_lock)
> >>>>>         else
> >>>>>             read_lock(d->vpci_lock)
> >>>>> ....
> >>>>>
> >>>>> And provide rw as an argument to:
> >>>>>
> >>>>> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
> >>>>>                           vpci_write_t *write_handler, unsigned int 
> >>>>> offset,
> >>>>>                           unsigned int size, void *data, --->>> bool 
> >>>>> write_path <<<-----)
> >>>>>
> >>>>> Is this what you mean?
> >>>> This sounds overly complicated. You can derive locally in vpci_write(),
> >>>> from just its "reg" and "size" parameters, whether the lock needs taking
> >>>> in write mode.
> >>> Yes, I started writing a reply with that. So, the summary (ROM
> >>> position depends on header type):
> >>> if ( (reg == PCI_COMMAND) || (reg == ROM) )
> >>> {
> >>>        read PCI_COMMAND and see if memory or IO decoding are enabled.
> >>>        if ( enabled )
> >>>            write_lock(d->vpci_lock)
> >>>        else
> >>>            read_lock(d->vpci_lock)
> >>> }
> >> Hmm, yes, you can actually get away without using "size", since both
> >> command register and ROM BAR are 32-bit aligned registers, and 64-bit
> >> accesses get split in vpci_ecam_write().
> > But, OS may want reading a single byte of ROM BAR, so I think
> > I'll need to check if reg+size fall into PCI_COMAND and ROM BAR
> > ranges
> >> For the command register the memory- / IO-decoding-enabled check may
> >> end up a little more complicated, as the value to be written also
> >> matters. Maybe read the command register only for the ROM BAR write,
> >> using the write lock uniformly for all command register writes?
> > Sounds good for the start.
> > Another concern is that if we go with a read_lock and then in the
> > underlying code we disable memory decoding and try doing
> > something and calling cmd_write handler for any reason then....
> >
> > I mean that the check in the vpci_write is somewhat we can tolerate,
> > but then it is must be considered that no code in the read path
> > is allowed to perform write path functions. Which brings a pretty
> > valid use-case: say in read mode we detect an unrecoverable error
> > and need to remove the device:
> > vpci_process_pending -> ERROR -> vpci_remove_device or similar.
> >
> > What do we do then? It is all going to be fragile...
> I have tried to summarize the options we have wrt locking
> and would love to hear from @Roger and @Jan.
> 
> In every variant there is a task of dealing with the overlap
> detection in modify_bars, so this is the only place as of now
> which needs special treatment.
> 
> Existing limitations: there is no way to upgrade a read lock to a write
> lock, so paths which may require write lock protection need to use
> write lock from the very beginning. Workarounds can be applied.
> 
> 1. Per-domain rw lock, aka d->vpci_lock
> ==============================================================
> Note: with per-domain rw lock it is possible to do without introducing
> per-device locks, so pdev->vpci->lock can be removed and no pdev->vpci_lock
> should be required.

Er, no, I think you still need a per-device lock unless you intent to
take the per-domain rwlock in write mode every time you modify data
in vpci. I still think you need pdev->vpci->lock. It's possible this
approach doesn't require moving the lock outside of the vpci struct.

> This is only going to work in case if vpci_write always takes the write lock
> and vpci_read takes a read lock and no path in vpci_read is allowed to
> perform write path operations.

I think that's likely too strong?

You could get away with both vpci_{read,write} only taking the read
lock and use a per-device vpci lock?

Otherwise you are likely to introduce contention in msix_write if a
guest makes heavy use of the MSI-X entry mask bit.

> vpci_process_pending uses write lock as it have vpci_remove_device in its
> error path.
> 
> Pros:
> - no per-device vpci lock is needed?
> - solves overlap code ABBA in modify_bars
> 
> Cons:
> - all writes are serialized
> - need to carefully select read paths, so they are guaranteed not to lead
>    to lock upgrade use-cases
> 
> 1.1. Semi read lock upgrade in modify bars
> --------------------------------------------------------------
> In this case both vpci_read and vpci_write take a read lock and when it comes
> to modify_bars:
> 
> 1. read_unlock(d->vpci_lock)
> 2. write_lock(d->vpci_lock)
> 3. Check that pdev->vpci is still available and is the same object:
> if (pdev->vpci && (pdev->vpci == old_vpci) )
> {
>      /* vpci structure is valid and can be used. */
> }
> else
> {
>      /* vpci has gone, return an error. */
> }
> 
> Pros:
> - no per-device vpci lock is needed?
> - solves overlap code ABBA in modify_bars
> - readers and writers are NOT serialized
> - NO need to carefully select read paths, so they are guaranteed not to lead
>    to lock upgrade use-cases
> 
> Cons:
> - ???
> 
> 2. per-device lock (pdev->vpci_lock) + d->overlap_chk_lock
> ==============================================================
> In order to solve overlap ABBA, we introduce a per-domain helper
> lock to protect the overlapping code in modify_bars:
> 
>      old_vpci = pdev->vpci;
>      spin_unlock(pdev->vpci_lock);
>      spin_lock(pdev->domain->overlap_chk_lock);

Since you drop the pdev lock you get a window here where either vpci
or even pdev itself could be removed under your feet, so using
pdev->vpci_lock like you do below could dereference a stale pdev.

>      spin_lock(pdev->vpci_lock);
>      if ( pdev->vpci && (pdev->vpci == old_vpci) )
>          for_each_pdev ( pdev->domain, tmp )
>          {
>              if ( tmp != pdev )
>              {
>                  spin_lock(tmp->vpci_lock);
>                  if ( tmp->vpci )
>                      ...
>              }
>          }
> 
> Pros:
> - all accesses are independent, only the same device access is serialized
> - no need to care about readers and writers wrt read lock upgrade issues
> 
> Cons:
> - helper spin lock
> 
> 3. Move overlap detection into process pending
> ==============================================================
> There is a Roger's patch [1] which adds a possibility for vpci_process_pending
> to perform different tasks rather than just map/unmap. With this patch 
> extended
> in a way that it can hold a request queue it is possible to delay execution
> of the overlap code until no pdev->vpci_lock is held, but before returning to
> a guest after vpci_{read|write} or similar.
> 
> Pros:
> - no need to emulate read lock upgrade
> - fully parallel read/write
> - queue in the vpci_process_pending will later on be used by SR-IOV,
>    so this is going to help the future code
> Cons:
> - ???

Maybe? It's hard to devise how that would end up looking like, and
whether it won't still require such kind of double locking. We would
still need to prevent doing a rangeset_remove_range for the device we
are trying to setup the mapping for, at which point we still need to
lock the current device plus the device we are iterating against?

Since the code in vpci_process_pending is always executed in guest
vCPU context requiring all guest vCPUs to be paused when doing a
device addition or removal would prevent devices from going away, but
we could still have issues with concurrent accesses from other vCPUs.

> 
> 4. Re-write overlap detection code
> ==============================================================
> It is possible to re-write overlap detection code, so the information about 
> the
> mapped/unmapped regions is not read from vpci->header->bars[i] of each device,
> but instead there is a per-domain structure which holds the regions and
> implements reference counting.
> 
> Pros:
> - solves ABBA
> 
> Cons:
> - very complex code is expected
> 
> 5. You name it
> ==============================================================
> 
>  From all the above I would recommend we go with option 2 which seems to 
> reliably
> solve ABBA and does not bring cons of the other approaches.

6. per-domain rwlock + per-device vpci lock

Introduce vpci_header_write_lock(start, {end, size}) helper: return
whether a range requires the per-domain lock in write mode. This will
only return true if the range overlaps with the BAR ROM or the command
register.

In vpci_{read,write}:

if ( vpci_header_write_lock(...) )
    /* Gain exclusive access to all of the domain pdevs vpci. */
    write_lock(d->vpci);
else
{
    read_lock(d->vpci);
    spin_lock(vpci->lock);
}
...

The vpci assign/deassign functions would need to be modified to write
lock the per-domain rwlock. The MSI-X table MMIO handler will also
need to read lock the per domain vpci lock.

I think it's either something along the lines of my suggestion above,
or maybe option 3, albeit you would have to investigate how to
implement option 3.

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.