Xen project Mailing List

Re: [Xen-devel] [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier

To: Jason Gunthorpe <jgg@xxxxxxxx>, "linux-mm@xxxxxxxxx" <linux-mm@xxxxxxxxx>, Jerome Glisse <jglisse@xxxxxxxxxx>, Ralph Campbell <rcampbell@xxxxxxxxxx>, John Hubbard <jhubbard@xxxxxxxxxx>

From: "Kuehling, Felix" <Felix.Kuehling@xxxxxxx>

Date: Tue, 29 Oct 2019 22:04:45 +0000

Accept-language: en-US

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=amd.com; dmarc=pass action=none header.from=amd.com; dkim=pass header.d=amd.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=/JD7KFkKHp5fyK1k331LGUR/1WWjmJqNA6hU7v+mEUs=; b=MV14xJxP9wGka43UZEOQY9hWpsCEOGPdCY0JgQ1IWYLVxfhN99mHFUX5JiK6ToX3ccAfMGD5q8dXGp4umpvggSTuI6zRTZ4i92wDbOdO+L6LXPjoN1BH2jTtUEL1HLm1ocbW55LmQtCsoJj128gCTI3iBTp2BZhVoAro37blTQ3a5hDUUfEDpVc68OBd1d4T52/NKApGt/GQTtODGXYGJKTZ1fu2DILhT5SKEmqVff3s5kAdVculvGP1GHrAOhgFCvY3mcof+yCX1BUr6t6YkayckrTz+wbXcFunB7UuK6sr9eXPDYEhZm1/E/qS6brnga/kCy2p/HfzKB4dIgMdIQ==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=HL6TCL77yX35tR/tiOoP/P/6cMJDAXGmoo11RxSp+cGVqfVYEtkhK+j4/Xa0A8LLzDHNh54yKrY1odGl+/0QdcAAVtWufvKh+FTBwmI3e6J5mrnw/NmhMz6H1VTC1ZB+pSi1rRG3u2q/VdXNml8TL5T5YoVy6VhXluazdLEv/4Wwss9NyBemC2F0wXIWK1Ktrx21nZzxgdaUvho8dyCAozfyA36XAMNclAO/m50ZjZZR/XFWS7SVEfj5D74P9HSrS2mVsrWQ5WHYNWTF7gbZ8Re9nmt052ugMenvzakx8Yi7IBg6Nbeb0k2ZcCx42TjyNXZku8nBpmmCNStYikv8Ng==

Authentication-results: spf=none (sender IP is ) smtp.mailfrom=Felix.Kuehling@xxxxxxx;

Cc: Juergen Gross <jgross@xxxxxxxx>, "Zhou, David\(ChunMing\)" <David1.Zhou@xxxxxxx>, Mike Marciniszyn <mike.marciniszyn@xxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Andrea Arcangeli <aarcange@xxxxxxxxxx>, Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx>, "linux-rdma@xxxxxxxxxxxxxxx" <linux-rdma@xxxxxxxxxxxxxxx>, "nouveau@xxxxxxxxxxxxxxxxxxxxx" <nouveau@xxxxxxxxxxxxxxxxxxxxx>, Dennis Dalessandro <dennis.dalessandro@xxxxxxxxx>, "amd-gfx@xxxxxxxxxxxxxxxxxxxxx" <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>, Michal Hocko <mhocko@xxxxxxxxxx>, Christoph Hellwig <hch@xxxxxxxxxxxxx>, Jason Gunthorpe <jgg@xxxxxxxxxxxx>, "dri-devel@xxxxxxxxxxxxxxxxxxxxx" <dri-devel@xxxxxxxxxxxxxxxxxxxxx>, "Deucher, Alexander" <Alexander.Deucher@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, Petr Cvek <petrcvekcz@xxxxxxxxx>, "Koenig, Christian" <Christian.Koenig@xxxxxxx>, Ben Skeggs <bskeggs@xxxxxxxxxx>

Delivery-date: Tue, 29 Oct 2019 22:04:56 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Thread-index: AQHVjcvKdJlZAglZt0CRWZTjlOmCL6dyLiSA

Thread-topic: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier

I haven't had enough time to fully understand the deferred logic in this change. I spotted one problem, see comments inline. On 2019-10-28 4:10 p.m., Jason Gunthorpe wrote: > From: Jason Gunthorpe <jgg@xxxxxxxxxxxx> > > Of the 13 users of mmu_notifiers, 8 of them use only > invalidate_range_start/end() and immediately intersect the > mmu_notifier_range with some kind of internal list of VAs. 4 use an > interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list > of some kind (scif_dma, vhost, gntdev, hmm) > > And the remaining 5 either don't use invalidate_range_start() or do some > special thing with it. > > It turns out that building a correct scheme with an interval tree is > pretty complicated, particularly if the use case is synchronizing against > another thread doing get_user_pages(). Many of these implementations have > various subtle and difficult to fix races. > > This approach puts the interval tree as common code at the top of the mmu > notifier call tree and implements a shareable locking scheme. > > It includes: > - An interval tree tracking VA ranges, with per-range callbacks > - A read/write locking scheme for the interval tree that avoids > sleeping in the notifier path (for OOM killer) > - A sequence counter based collision-retry locking scheme to tell > device page fault that a VA range is being concurrently invalidated. > > This is based on various ideas: > - hmm accumulates invalidated VA ranges and releases them when all > invalidates are done, via active_invalidate_ranges count. > This approach avoids having to intersect the interval tree twice (as > umem_odp does) at the potential cost of a longer device page fault. > > - kvm/umem_odp use a sequence counter to drive the collision retry, > via invalidate_seq > > - a deferred work todo list on unlock scheme like RTNL, via deferred_list. > This makes adding/removing interval tree members more deterministic > > - seqlock, except this version makes the seqlock idea multi-holder on the > write side by protecting it with active_invalidate_ranges and a spinlock > > To minimize MM overhead when only the interval tree is being used, the > entire SRCU and hlist overheads are dropped using some simple > branches. Similarly the interval tree overhead is dropped when in hlist > mode. > > The overhead from the mandatory spinlock is broadly the same as most of > existing users which already had a lock (or two) of some sort on the > invalidation path. > > Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> > Cc: Michal Hocko <mhocko@xxxxxxxxxx> > Acked-by: Christian König <christian.koenig@xxxxxxx> > Signed-off-by: Jason Gunthorpe <jgg@xxxxxxxxxxxx> > --- > include/linux/mmu_notifier.h | 98 +++++++ > mm/Kconfig | 1 + > mm/mmu_notifier.c | 533 +++++++++++++++++++++++++++++++++-- > 3 files changed, 607 insertions(+), 25 deletions(-) > [snip] > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c > index 367670cfd02b7b..d02d3c8c223eb7 100644 > --- a/mm/mmu_notifier.c > +++ b/mm/mmu_notifier.c [snip] > * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap > @@ -52,17 +286,24 @@ struct mmu_notifier_mm { > * can't go away from under us as exit_mmap holds an mm_count pin > * itself. > */ > -void __mmu_notifier_release(struct mm_struct *mm) > +static void mn_hlist_release(struct mmu_notifier_mm *mmn_mm, > + struct mm_struct *mm) > { > struct mmu_notifier *mn; > int id; > > + if (mmn_mm->has_interval) > + mn_itree_release(mmn_mm, mm); > + > + if (hlist_empty(&mmn_mm->list)) > + return; This seems to duplicate the conditions in __mmu_notifier_release. See my comments below, I think one of them is wrong. I suspect this one, because __mmu_notifier_release follows the same pattern as the other notifiers. > + > /* > * SRCU here will block mmu_notifier_unregister until > * ->release returns. > */ > id = srcu_read_lock(&srcu); > - hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) > + hlist_for_each_entry_rcu(mn, &mmn_mm->list, hlist) > /* > * If ->release runs before mmu_notifier_unregister it must be > * handled, as it's the only way for the driver to flush all > @@ -72,9 +313,9 @@ void __mmu_notifier_release(struct mm_struct *mm) > if (mn->ops->release) > mn->ops->release(mn, mm); > > - spin_lock(&mm->mmu_notifier_mm->lock); > - while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) { > - mn = hlist_entry(mm->mmu_notifier_mm->list.first, > + spin_lock(&mmn_mm->lock); > + while (unlikely(!hlist_empty(&mmn_mm->list))) { > + mn = hlist_entry(mmn_mm->list.first, > struct mmu_notifier, > hlist); > /* > @@ -85,7 +326,7 @@ void __mmu_notifier_release(struct mm_struct *mm) > */ > hlist_del_init_rcu(&mn->hlist); > } > - spin_unlock(&mm->mmu_notifier_mm->lock); > + spin_unlock(&mmn_mm->lock); > srcu_read_unlock(&srcu, id); > > /* > @@ -100,6 +341,17 @@ void __mmu_notifier_release(struct mm_struct *mm) > synchronize_srcu(&srcu); > } > > +void __mmu_notifier_release(struct mm_struct *mm) > +{ > + struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm; > + > + if (mmn_mm->has_interval) > + mn_itree_release(mmn_mm, mm); If mmn_mm->list is not empty, this will be done twice because mn_hlist_release duplicates this. > + > + if (!hlist_empty(&mmn_mm->list)) > + mn_hlist_release(mmn_mm, mm); mn_hlist_release checks the same condition itself. > +} > + > /* > * If no young bitflag is supported by the hardware, ->clear_flush_young can > * unmap the address and return 1 or 0 depending if the mapping previously [snip] _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.