[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
On 19/02/25 10:05, Joel Fernandes wrote: > On Fri, Jan 17, 2025 at 05:53:33PM +0100, Valentin Schneider wrote: >> On 17/01/25 16:52, Jann Horn wrote: >> > On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@xxxxxxxxxx> >> > wrote: >> >> On 14/01/25 19:16, Jann Horn wrote: >> >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider >> >> > <vschneid@xxxxxxxxxx> wrote: >> >> >> vunmap()'s issued from housekeeping CPUs are a relatively common >> >> >> source of >> >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the >> >> >> flush_tlb_kernel_range() IPIs. >> >> >> >> >> >> Given that CPUs executing in userspace do not access data in the >> >> >> vmalloc >> >> >> range, these IPIs could be deferred until their next kernel entry. >> >> >> >> >> >> Deferral vs early entry danger zone >> >> >> =================================== >> >> >> >> >> >> This requires a guarantee that nothing in the vmalloc range can be >> >> >> vunmap'd >> >> >> and then accessed in early entry code. >> >> > >> >> > In other words, it needs a guarantee that no vmalloc allocations that >> >> > have been created in the vmalloc region while the CPU was idle can >> >> > then be accessed during early entry, right? >> >> >> >> I'm not sure if that would be a problem (not an mm expert, please do >> >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't >> >> deferred anyway. >> > >> > flush_cache_vmap() is about stuff like flushing data caches on >> > architectures with virtually indexed caches; that doesn't do TLB >> > maintenance. When you look for its definition on x86 or arm64, you'll >> > see that they use the generic implementation which is simply an empty >> > inline function. >> > >> >> So after vmapping something, I wouldn't expect isolated CPUs to have >> >> invalid TLB entries for the newly vmapped page. >> >> >> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus >> >> stale TLB entries can and will remain on isolated CPUs, up until they >> >> execute the deferred flush themselves (IOW for the entire duration of the >> >> "danger zone"). >> >> >> >> Does that make sense? >> > >> > The design idea wrt TLB flushes in the vmap code is that you don't do >> > TLB flushes when you unmap stuff or when you map stuff, because doing >> > TLB flushes across the entire system on every vmap/vunmap would be a >> > bit costly; instead you just do batched TLB flushes in between, in >> > __purge_vmap_area_lazy(). >> > >> > In other words, the basic idea is that you can keep calling vmap() and >> > vunmap() a bunch of times without ever doing TLB flushes until you run >> > out of virtual memory in the vmap region; then you do one big TLB >> > flush, and afterwards you can reuse the free virtual address space for >> > new allocations again. >> > >> > So if you "defer" that batched TLB flush for CPUs that are not >> > currently running in the kernel, I think the consequence is that those >> > CPUs may end up with incoherent TLB state after a reallocation of the >> > virtual address space. >> > >> >> Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc >> that occurred while an isolated CPU was NOHZ-FULL can be an issue if said >> CPU accesses it during early entry; > > So the issue is: > > CPU1: unmappes vmalloc page X which was previously mapped to physical page > P1. > > CPU2: does a whole bunch of vmalloc and vfree eventually crossing some lazy > threshold and sending out IPIs. It then goes ahead and does an allocation > that maps the same virtual page X to physical page P2. > > CPU3 is isolated and executes some early entry code before receving said IPIs > which are supposedly deferred by Valentin's patches. > > It does not receive the IPI becuase it is deferred, thus access by early > entry code to page X on this CPU results in a UAF access to P1. > > Is that the issue? > Pretty much so yeah. That is, *if* there such a vmalloc'd address access in early entry code - testing says it's not the case, but I haven't found a way to instrumentally verify this.
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |