[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 25/30] context_tracking,x86: Defer kernel text patching IPIs



On Tue, Jan 14, 2025, Valentin Schneider wrote:
> text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
> them vs the newly patched instruction. CPUs that are executing in userspace
> do not need this synchronization to happen immediately, and this is
> actually harmful interference for NOHZ_FULL CPUs.

...

> This leaves us with static keys and static calls.

...

> @@ -2317,11 +2334,20 @@ static void text_poke_bp_batch(struct text_poke_loc 
> *tp, unsigned int nr_entries
>        * First step: add a int3 trap to the address that will be patched.
>        */
>       for (i = 0; i < nr_entries; i++) {
> -             tp[i].old = *(u8 *)text_poke_addr(&tp[i]);
> -             text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE);
> +             void *addr = text_poke_addr(&tp[i]);
> +
> +             /*
> +              * There's no safe way to defer IPIs for patching text in
> +              * .noinstr, record whether there is at least one such poke.
> +              */
> +             if (is_kernel_noinstr_text((unsigned long)addr))
> +                     cond = NULL;

Maybe pre-check "cond", especially if multiple ranges need to be checked?  I.e.

                if (cond && is_kernel_noinstr_text(...))
> +
> +             tp[i].old = *((u8 *)addr);
> +             text_poke(addr, &int3, INT3_INSN_SIZE);
>       }
>  
> -     text_poke_sync();
> +     __text_poke_sync(cond);
>  
>       /*
>        * Second step: update all but the first byte of the patched range.

...

> +/**
> + * is_kernel_noinstr_text - checks if the pointer address is located in the
> + *                    .noinstr section
> + *
> + * @addr: address to check
> + *
> + * Returns: true if the address is located in .noinstr, false otherwise.
> + */
> +static inline bool is_kernel_noinstr_text(unsigned long addr)
> +{
> +     return addr >= (unsigned long)__noinstr_text_start &&
> +            addr < (unsigned long)__noinstr_text_end;
> +}

This doesn't do the right thing for modules, which matters because KVM can be
built as a module on x86, and because context tracking understands transitions
to GUEST mode, i.e. CPUs that are running in a KVM guest will be treated as not
being in the kernel, and thus will have IPIs deferred.  If KVM uses a static key
or branch between guest_state_enter_irqoff() and guest_state_exit_irqoff(), the
patching code won't wait for CPUs to exit guest mode, i.e. KVM could 
theoretically
use the wrong static path.

I don't expect this to ever cause problems in practice, because patching code in
KVM's VM-Enter/VM-Exit path that has *functional* implications, while CPUs are
actively running guest code, would be all kinds of crazy.  But I do think we
should plug the hole.

If this issue is unique to KVM, i.e. is not a generic problem for all modules (I
assume module code generally isn't allowed in the entry path, even via NMI?), 
one
idea would be to let KVM register its noinstr section for text poking.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.