[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
On Wed, Oct 23, 2019 at 12:26 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 22.10.19 23:54, Dan Williams wrote: > > Hi David, > > > > Thanks for tackling this! > > Thanks for having a look :) > > [...] > > > >> I am probably a little bit too careful (but I don't want to break things). > >> In most places (besides KVM and vfio that are nuts), the > >> pfn_to_online_page() check could most probably be avoided by a > >> is_zone_device_page() check. However, I usually get suspicious when I see > >> a pfn_valid() check (especially after I learned that people mmap parts of > >> /dev/mem into user space, including memory without memmaps. Also, people > >> could memmap offline memory blocks this way :/). As long as this does not > >> hurt performance, I think we should rather do it the clean way. > > > > I'm concerned about using is_zone_device_page() in places that are not > > known to already have a reference to the page. Here's an audit of > > current usages, and the ones I think need to cleaned up. The "unsafe" > > ones do not appear to have any protections against the device page > > being removed (get_dev_pagemap()). Yes, some of these were added by > > me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device > > pages into anonymous memory paths and I'm not up to speed on how it > > guarantees 'struct page' validity vs device shutdown without using > > get_dev_pagemap(). > > > > smaps_pmd_entry(): unsafe > > > > put_devmap_managed_page(): safe, page reference is held > > > > is_device_private_page(): safe? gpu driver manages private page lifetime > > > > is_pci_p2pdma_page(): safe, page reference is held > > > > uncharge_page(): unsafe? HMM > > > > add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page() > > > > soft_offline_page(): unsafe > > > > remove_migration_pte(): unsafe? HMM > > > > move_to_new_page(): unsafe? HMM > > > > migrate_vma_pages() and helpers: unsafe? HMM > > > > try_to_unmap_one(): unsafe? HMM > > > > __put_page(): safe > > > > release_pages(): safe > > > > I'm hoping all the HMM ones can be converted to > > is_device_private_page() directlly and have that routine grow a nice > > comment about how it knows it can always safely de-reference its @page > > argument. > > > > For the rest I'd like to propose that we add a facility to determine > > ZONE_DEVICE by pfn rather than page. The most straightforward why I > > can think of would be to just add another bitmap to mem_section_usage > > to indicate if a subsection is ZONE_DEVICE or not. > > (it's a somewhat unrelated bigger discussion, but we can start discussing it > in this thread) > > I dislike this for three reasons > > a) It does not protect against any races, really, it does not improve things. > b) We do have the exact same problem with pfn_to_online_page(). As long as we > don't hold the memory hotplug lock, memory can get offlined and remove any > time. Racy. True, we need to solve that problem too. That seems to want something lighter weight than the hotplug lock that can be held over pfn lookups + use rather than requiring a page lookup in paths where it's not clear that a page reference would prevent unplug. > c) We mix in ZONE specific stuff into the core. It should be "just another > zone" Not sure I grok this when the RFC is sprinkling zone-specific is_zone_device_page() throughout the core? > > What I propose instead (already discussed in > https://lkml.org/lkml/2019/10/10/87) Sorry I missed this earlier... > > 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE > 2. Convert SECTION_IS_ACTIVE to a subsection bitmap > 3. Introduce pfn_active() that checks against the subsection bitmap > 4. Once the memmap was initialized / prepared, set the subsection active > (similar to SECTION_IS_ONLINE in the buddy right now) > 5. Before the memmap gets invalidated, set the subsection inactive > (similar to SECTION_IS_ONLINE in the buddy right now) > 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE > 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE This does not seem to reduce any complexity because it still requires a pfn to zone lookup at the end of the process. I.e. converting pfn_to_online_page() to use a new pfn_active() subsection map plus looking up the zone from pfn_to_page() is more steps than just doing a direct pfn to zone lookup. What am I missing? > > Especially, driver-reserved device memory will not get set active in > the subsection bitmap. > > Now to the race. Taking the memory hotplug lock at random places is ugly. I do > wonder if we can use RCU: Ah, yes, exactly what I was thinking above. > > The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page(): > > /* the memmap is guaranteed to remain active under RCU */ > rcu_read_lock(); > if (pfn_active(random_pfn)) { > page = pfn_to_page(random_pfn); > ... use the page, stays valid > } > rcu_unread_lock(); > > Memory offlining/memremap code: > > set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically > */ > synchronize_rcu(); > /* all users saw the bitmap update, we can invalide the memmap */ > remove_pfn_range_from_zone(zone, pfn, nr_pages); Looks good to me. > > > > >> > >> I only gave it a quick test with DIMMs on x86-64, but didn't test the > >> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested > >> on x86-64 and PPC. > > > > I'll give it a spin, but I don't think the kernel wants to grow more > > is_zone_device_page() users. > > Let's recap. In this RFC, I introduce a total of 4 (!) users only. > The other parts can rely on pfn_to_online_page() only. > > 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes" > - Basically never used with ZONE_DEVICE. > - We hold a reference! > - All it protects is a SetPageDirty(page); > > 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes" > - Same as 1. > > 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes" > - We come via virt_to_head_page() / virt_to_head_page(), not sure about > references (I assume this should be fine as we don't come via random > PFNs) > - We check that we don't mix Reserved (including device memory) and CMA > pages when crossing compound pages. > > I think we can drop 1. and 2., resulting in a total of 2 new users in > the same context. I think that is totally tolerable to finally clean > this up. ...but more is_zone_device_page() doesn't "finally clean this up". Like we discussed above it's the missing locking that's the real cleanup, the pfn_to_online_page() internals are secondary. > > > However, I think we also have to clarify if we need the change in 3 at all. > It comes from > > commit f5509cc18daa7f82bcc553be70df2117c8eedc16 > Author: Kees Cook <keescook@xxxxxxxxxxxx> > Date: Tue Jun 7 11:05:33 2016 -0700 > > mm: Hardened usercopy > > This is the start of porting PAX_USERCOPY into the mainline kernel. This > is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The > work is based on code by PaX Team and Brad Spengler, and an earlier port > from Casey Schaufler. Additional non-slab page tests are from Rik van > Riel. > [...] > - otherwise, object must not span page allocations (excepting Reserved > and CMA ranges) > > Not sure if we really have to care about ZONE_DEVICE at this point. That check needs to be careful to ignore ZONE_DEVICE pages. There's nothing wrong with a copy spanning ZONE_DEVICE and typical pages. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |