[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
On 17/01/25 16:52, Jann Horn wrote: > On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@xxxxxxxxxx> > wrote: >> On 14/01/25 19:16, Jann Horn wrote: >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@xxxxxxxxxx> >> > wrote: >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the >> >> flush_tlb_kernel_range() IPIs. >> >> >> >> Given that CPUs executing in userspace do not access data in the vmalloc >> >> range, these IPIs could be deferred until their next kernel entry. >> >> >> >> Deferral vs early entry danger zone >> >> =================================== >> >> >> >> This requires a guarantee that nothing in the vmalloc range can be >> >> vunmap'd >> >> and then accessed in early entry code. >> > >> > In other words, it needs a guarantee that no vmalloc allocations that >> > have been created in the vmalloc region while the CPU was idle can >> > then be accessed during early entry, right? >> >> I'm not sure if that would be a problem (not an mm expert, please do >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't >> deferred anyway. > > flush_cache_vmap() is about stuff like flushing data caches on > architectures with virtually indexed caches; that doesn't do TLB > maintenance. When you look for its definition on x86 or arm64, you'll > see that they use the generic implementation which is simply an empty > inline function. > >> So after vmapping something, I wouldn't expect isolated CPUs to have >> invalid TLB entries for the newly vmapped page. >> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus >> stale TLB entries can and will remain on isolated CPUs, up until they >> execute the deferred flush themselves (IOW for the entire duration of the >> "danger zone"). >> >> Does that make sense? > > The design idea wrt TLB flushes in the vmap code is that you don't do > TLB flushes when you unmap stuff or when you map stuff, because doing > TLB flushes across the entire system on every vmap/vunmap would be a > bit costly; instead you just do batched TLB flushes in between, in > __purge_vmap_area_lazy(). > > In other words, the basic idea is that you can keep calling vmap() and > vunmap() a bunch of times without ever doing TLB flushes until you run > out of virtual memory in the vmap region; then you do one big TLB > flush, and afterwards you can reuse the free virtual address space for > new allocations again. > > So if you "defer" that batched TLB flush for CPUs that are not > currently running in the kernel, I think the consequence is that those > CPUs may end up with incoherent TLB state after a reallocation of the > virtual address space. > Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc that occurred while an isolated CPU was NOHZ-FULL can be an issue if said CPU accesses it during early entry; > Actually, I think this would mean that your optimization is disallowed > at least on arm64 - I'm not sure about the exact wording, but arm64 > has a "break before make" rule that forbids conflicting writable > address translations or something like that. > On the bright side of things, arm64 is not as bad as x86 when it comes to IPI'ing isolated CPUs :-) I'll add that to my notes, thanks! > (I said "until you run out of virtual memory in the vmap region", but > that's not actually true - see the comment above lazy_max_pages() for > an explanation of the actual heuristic. You might be able to tune that > a bit if you'd be significantly happier with less frequent > interruptions, or something along those lines.)
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |