[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v5 11/28] xsplice: Implement support for applying/reverting/replacing patches.



Hey Jan,

Thank you for your review!
Please see below my answers.

Ross,

Could you answer one question below please.

On Fri, Apr 01, 2016 at 07:28:00AM -0600, Jan Beulich wrote:
> >>> On 24.03.16 at 21:00, <konrad.wilk@xxxxxxxxxx> wrote:
> > From: Ross Lagerwall <ross.lagerwall@xxxxxxxxxx>
> > 
> > Implement support for the apply, revert and replace actions.
> > 
> > To perform and action on a payload, the hypercall sets up a data
> > structure to schedule the work.  A hook is added in all the
> > return-to-guest paths to check for work to do and execute it if needed.
> 
> Looking at the diffstat alone I cannot see this being the case.
> Perhaps it's just the description here which needs to become
> more precise.

s/all the/common/ ?

> 
> > In this way, patches can be applied with all CPUs idle and without
> > stacks.  The first CPU to do_xsplice() becomes the master and triggers a
> > reschedule softirq to trigger all the other CPUs to enter do_xsplice()
> > with no stack.  Once all CPUs have rendezvoused, all CPUs disable IRQs
> > and NMIs are ignored. The system is then quiscient and the master
> > performs the action.  After this, all CPUs enable IRQs and NMIs are
> > re-enabled.
> > 
> > Note that it is unsafe to patch do_nmi and the xSplice internal functions.
> > Patching functions on NMI/MCE path is liable to end in disaster.
> 
> So what measures are (planned to be) taken to make sure this
> doesn't happen by accident?

None in this patch.

Please keep in mind that this issue is not only a problem with
xSplice but also with alternative assembler patching.

I haven't figured out a nice way to take care of this and was hoping
to brainstorm that. For right now this part is left out as 'TODO'.

> 
> > The action to perform is one of:
> > - APPLY: For each function in the module, store the first 5 bytes of the
> >   old function and replace it with a jump to the new function.
> > - REVERT: Copy the previously stored bytes into the first 5 bytes of the
> >   old function.
> > - REPLACE: Revert each applied module and then apply the new module.
> > 
> > To prevent a deadlock with any other barrier in the system, the master
> > will wait for up to 30ms before timing out.
> > Measurements found that the patch application to take about 100 μs on a
> > 72 CPU system, whether idle or fully loaded.
> 
> That's for an individual patch I suppose? What if REPLACE has to
> revert dozens or hundreds of patches?

The user-space can change the timeout to a larger value. 
> 
> > We also add an BUILD_ON to make sure that the size of the structure
> > of the payload is not inadvertly changed.
> > 
> > Lastly we unroll the 'vmap_to_page' on x86 as inside the macro there
> > is a posibility of a NULL pointer. Hence we unroll it with extra
> > ASSERTS. Note that asserts on non-debug builds are compiled out hence
> > the extra checks that will just return (and leak memory).
> 
> I'm afraid I can't really make sense of this: What does "unroll" mean
> here? There's no loop involved. And where's that potential for NULL?

That is a very stale comment. That should have been deleted!

> 
> > --- a/docs/misc/xsplice.markdown
> > +++ b/docs/misc/xsplice.markdown
> > @@ -841,7 +841,8 @@ The implementation must also have a mechanism for:
> >   * Be able to lookup in the Xen hypervisor the symbol names of functions 
> > from the ELF payload.
> >   * Be able to patch .rodata, .bss, and .data sections.
> >   * Further safety checks (blacklist of which functions cannot be patched, 
> > check
> > -   the stack, make sure the payload is built with same compiler as 
> > hypervisor).
> > +   the stack, make sure the payload is built with same compiler as 
> > hypervisor,
> > +   and NMI/MCE handlers and do_nmi for right now - until an safe solution 
> > is found).
> 
> The whole thing doesn't parse anymore for me with the addition,
> so I can't really conclude what you mean to say here (and hence
> whether that addresses the earlier question).

It is adding handling of NMI/MCE on the 'TODO' list so that I don't forget
about it.

> 
> > @@ -120,6 +121,7 @@ static void idle_loop(void)
> >          (*pm_idle)();
> >          do_tasklet();
> >          do_softirq();
> > +        check_for_xsplice_work(); /* Must be last. */
> >      }
> >  }
> >  
> > @@ -136,6 +138,7 @@ void startup_cpu_idle_loop(void)
> >  
> >  static void noreturn continue_idle_domain(struct vcpu *v)
> >  {
> > +    check_for_xsplice_work();
> >      reset_stack_and_jump(idle_loop);
> >  }
> 
> The placement here kind of contradicts the comment right above.
> And anyway - why in both places? And why here and not in
> common code, e.g. in do_softirq(), or - considering the further
> addition below - inside reset_stack_and_jump() _after_ having
> reset the stack (after all by doing it up front you do not really
> meet your goal of doing the action "without stacks")?

Ross?

> 
> > +void arch_xsplice_apply_jmp(struct xsplice_patch_func *func)
> > +{
> > +    uint32_t val;
> 
> The way it's being used below, it clearly should be int32_t.
> 
> > +    uint8_t *old_ptr;
> > +
> > +    BUILD_BUG_ON(PATCH_INSN_SIZE > sizeof(func->undo));
> > +    BUILD_BUG_ON(PATCH_INSN_SIZE != (1 + sizeof val));
> > +
> > +    old_ptr = (uint8_t *)func->old_addr;
> 
> (Considering this cast, the "old_addr" member should be
> unsigned long (or void *), not uint64_t: The latest on ARM32
> such would otherwise cause problems.)
> 
> Also - where is the verification that
> func->old_size >= PATCH_INSN_SIZE?

Missing..
..snip..
> > +/* There can be only one outstanding patching action. */
> > +static struct xsplice_work xsplice_work;
> > +
> > +/* Indicate whether the CPU needs to consult xsplice_work structure. */
> > +static DEFINE_PER_CPU(bool_t, work_to_do);
> 
> Peeking ahead to the uses of this, I can't see why this is needed
> alongside xsplice_work.do_work.

Andrew asked for this - he noticed that we pound on the xsplice_work
structure quite often across a lot of CPUs. Instead of pounding on the same
cache line - we could just read from a per-cpu cache line. Hence this addition.

> 
> > @@ -296,6 +331,77 @@ static int secure_payload(struct payload *payload, 
> > struct xsplice_elf *elf)
> >      return rc;
> >  }
> >  
> > +static int check_special_sections(struct payload *payload,
> > +                                  struct xsplice_elf *elf)
> 
> const (twice, but the first parameter appears to be unused anyway)
> 
> > +{
> > +    unsigned int i;
> > +    static const char *const names[] = { ".xsplice.funcs" };
> > +
> > +    for ( i = 0; i < ARRAY_SIZE(names); i++ )
> > +    {
> > +        struct xsplice_elf_sec *sec;
> 
> const
> 
> > +        sec = xsplice_elf_sec_by_name(elf, names[i]);
> > +        if ( !sec )
> > +        {
> > +            printk(XENLOG_ERR "%s%s: %s is missing!\n",
> > +                   XSPLICE, elf->name, names[i]);
> > +            return -EINVAL;
> > +        }
> > +
> > +        if ( !sec->sec->sh_size )
> > +            return -EINVAL;
> > +    }
> > +
> > +    return 0;
> > +}
> 
> So you check for there being one such section. Is having multiple
> of them okay / meaningful?

/me blinks. You can have multiple ELF sections with the same name?
I will double-check the spec over the weekend to see.

.. snip..
> > +    sec = xsplice_elf_sec_by_name(elf, ".xsplice.funcs");
> > +    if ( sec )
> 
> Assuming check_special_sections() got invoked before you get here,
> why the conditional?

Andrew asked for it. Albeit it is pointless as 'check_special_sections'
will bail halt the process if this section is not found.

> 
> > +    {
> > +        if ( sec->sec->sh_size % sizeof *payload->funcs )
> > +        {
> > +            dprintk(XENLOG_DEBUG, "%s%s: Wrong size of .xsplice.funcs!\n",
> > +                    XSPLICE, elf->name);
> > +            return -EINVAL;
> > +        }
> > +
> > +        payload->funcs = (struct xsplice_patch_func *)sec->load_addr;
> > +        payload->nfuncs = sec->sec->sh_size / (sizeof *payload->funcs);
> 
> Since this repeats (so far I thought this was more like a mistake),
> and despite being a matter of taste to some degree - may I ask
> that the canonical sizeof() be used, instead of (sizeof ...) or
> whatever else variants?

Sure. I will convert Ross's use of it.
> 
> > +    }
> > +
> > +    for ( i = 0; i < payload->nfuncs; i++ )
> > +    {
> > +        unsigned int j;
> > +
> > +        f = &(payload->funcs[i]);
> > +
> > +        if ( !f->new_addr || !f->old_addr || !f->old_size || !f->new_size )
> 
> Isn't new_size == 0 a particularly easy to deal with case?

To NOP func? No. I would have to include the headers for the nop[] in the arch 
xSplice
and expand on its patching (or expose the text_poke code).

That will grow this patch which is big enough. I don't mind doing it in further 
patches
(and I believe it is mentioned as a TODO in the design doc so that I won't 
forget).

.. snip..
> > +    case XSPLICE_ACTION_REPLACE:
> > +        rc = 0;
> > +        /* N.B: Use 'applied_list' member, not 'list'. */
> > +        list_for_each_entry_safe_reverse ( other, tmp, &applied_list, 
> > applied_list )
> > +        {
> > +            other->rc = revert_payload(other);
> 
> Why does this set ->rc, but the two earlier ones only set the local
> variable?
> 
> > +            if ( other->rc == 0 )
> > +                other->state = XSPLICE_STATE_CHECKED;
> > +            else
> > +            {
> > +                rc = -EINVAL;
> > +                break;
> > +            }
> > +        }
> > +
> > +        if ( rc != -EINVAL )
> 
> Perhaps better "rc == 0"?
> 
> > +        {
> > +            rc = apply_payload(data);
> > +            if ( rc == 0 )
> > +                data->state = XSPLICE_STATE_APPLIED;
> > +        }
> > +        break;
> > +
> > +    default:
> > +        rc = -EINVAL;
> > +        break;
> > +    }
> > +
> > +    data->rc = rc;
> 
> Oh, here it is. But why would an xsplice_work.cmd (which probably
> shouldn't make it here anyway) mark the payload having an error?

In xsplice_action() we set data->rc to -EAGAIN right before we kick
of the schedule_work(). That way the user can check the 'status'
of the patching as we attempt to do it.

But once the patching has been complete we MUST set it to zero
(or to an error if the patching failed).

> It didn't change state or anything after all.
.. snip..
> > +void check_for_xsplice_work(void)
> > +{
> > +    /* Set at -1, so will go up to num_online_cpus - 1. */
> > +    if ( atomic_inc_and_test(&xsplice_work.semaphore) )
> > +    {
> > +        struct payload *p;
> > +        unsigned int total_cpus;
> > +
> > +        p = xsplice_work.data;
> > +        if ( !get_cpu_maps() )
> > +        {
> > +            printk(XENLOG_ERR "%s%s: CPU%u - unable to get cpu_maps 
> > lock!\n",
> > +                   XSPLICE, p->name, cpu);
> > +            per_cpu(work_to_do, cpu) = 0;
> > +            xsplice_work.data->rc = -EBUSY;
> > +            xsplice_work.do_work = 0;
> 
> On x86 such possibly simultaneous accesses may be okay, but is
> that universally the case? Wouldn't it better be only the monarch
> which updates shared data?

Monarch? Oh you mean arch specific code path?

> 
> > +            /*
> > +             * Do NOT decrement semaphore down - as that may cause the 
> > other
> > +             * CPU (which may be at this exact moment checking the ASSERT)
> 
> Which ASSERT()? (I guess the one a few lines up - but does that
> matter?)

It messed me up when I was working on it so thought I would leave that in here
in case anybody wants to modify the code. Or more likely - if I rework
this code that I will recall this.

> > +        /* "Mask" NMIs. */
> > +        saved_nmi_callback = set_nmi_callback(mask_nmi_callback);
> 
> So what if an NMI got raised on a remote CPU right before you set
> this? I.e. doesn't this need to move earlier?

Hmm. Yes we should. Thanks!
> 
> > +        /* All CPUs are waiting, now signal to disable IRQs. */
> > +        xsplice_work.ready = 1;
> > +        smp_wmb();
> > +
> > +        atomic_inc(&xsplice_work.irq_semaphore);
> > +        if ( !xsplice_do_wait(&xsplice_work.irq_semaphore, timeout, 
> > total_cpus,
> > +                              "Timed out on IRQ semaphore") )
> 
> I'd prefer if the common parts of that message moved into
> xsplice_do_wait() - no reason to have more string literal space
> occupied than really needed. Also, looking back, the respective
> function parameter could do with a more descriptive name.
> 
> And then - does it make sense to wait the perhaps full 30ms
> on this second step? Rendezvoused CPUs should react rather
> quickly to the signal to disable interrupts.

We don't reset the timeout - the timeout is for both calls
to xsplice_do_wait.
> 
> > +        {
> > +            local_irq_save(flags);
> > +            /* Do the patching. */
> > +            xsplice_do_action();
> > +            /* To flush out pipeline. */
> > +            arch_xsplice_post_action();
> 
> The comment needs to become more generic, and maybe the
> function name more specific.

Serialize CPUs? Flush CPU pipeline? Flush CPU i-cache?
> 
> > +            local_irq_restore(flags);
> > +        }
> > +        set_nmi_callback(saved_nmi_callback);
> > +
> > + abort:
> > +        per_cpu(work_to_do, cpu) = 0;
> > +        xsplice_work.do_work = 0;
> > +
> > +        smp_wmb(); /* Synchronize with waiting CPUs. */
> 
> What "synchronization" does this barrier do?

For the ->do_work being set to zero - as they are reading and reading it.

> 
> > +        ASSERT(local_irq_is_enabled());
> 
> Is there really anything between the earlier identical ASSERT() and
> this one which could leave interrupts off?

The patching calls (in later patches) - the load and unload hooks. Those
could inadvertly enabled interrupts.

> 
> > +        put_cpu_maps();
> > +
> > +        printk(XENLOG_INFO "%s%s finished with rc=%d\n", XSPLICE,
> > +               p->name, p->rc);
> 
> And no record left of what was done with that payload?

? But it does. The rc is mentioned.. Or are you saying that it should
also say whether it was applied/reverted/replaced, etc?

> 
> > +    }
> > +    else
> > +    {
> > +        /* Wait for all CPUs to rendezvous. */
> > +        while ( xsplice_work.do_work && !xsplice_work.ready )
> > +        {
> > +            cpu_relax();
> > +            smp_rmb();
> 
> What is the barrier doing inside this and the other loop below?

The goal is to get the updated value from ->do_work and ->ready.

That is to do the same thing we do with frontend and backends - sync
out the rp/rc so that the other end can read it the updated value.

.. snip..
> >  static int __init xsplice_init(void)
> >  {
> > +    BUILD_BUG_ON( sizeof(struct xsplice_patch_func) != 64 );
> > +    BUILD_BUG_ON( offsetof(struct xsplice_patch_func, new_addr) != 8 );
> > +    BUILD_BUG_ON( offsetof(struct xsplice_patch_func, new_size) != 24 );
> 
> What assumptions get broken if these sizes change?

I think you mean the offsets? They can change. I added them to make sure
that on 32-bit hypervisor (ARM) the offsets would be the same as on 64-bit
hypervisor (x86).

The size however MUST remain the same - otherwise the toolstack won't produce
proper payloads anymore.

> 
> > --- a/xen/include/xen/xsplice.h
> > +++ b/xen/include/xen/xsplice.h
> > @@ -11,12 +11,30 @@ struct xsplice_elf_sec;
> >  struct xsplice_elf_sym;
> >  struct xen_sysctl_xsplice_op;
> >  
> > +#include <xen/elfstructs.h>
> > +/*
> > + * The structure which defines the patching. This is what the hypervisor
> > + * expects in the '.xsplice.func' section of the ELF file.
> > + *
> > + * This MUST be in sync with what the tools generate.
> 
> In which case this need to be part of the public interface, I would
> say.

Ah yes. Where should it go? sysctl.h is where the XSPLICE ops are - or
should this have it is own public/xsplice.h  header file?
> 
> > + */
> > +struct xsplice_patch_func {
> > +    const char *name;
> > +    uint64_t new_addr;
> > +    uint64_t old_addr;
> > +    uint32_t new_size;
> > +    uint32_t old_size;
> > +    uint8_t undo[8];
> 
> Shouldn't the size of this array be a per-arch constant?

Yes.
> 
> > +    uint8_t pad[24];
> 
> What is this padding field good for?

I want the size of the structure to be exactly 64-bytes across
all architectures. That way I don't have to mess with compat layer
or have the design doc be different for ARM vs x86.

It gives us enough of space to stash other 'private' fields we may want
or even expand this structure in the future and fill it out
with extra fields. ARgh, but for that we need a version field
somewhere... Maybe another section .xsplice_version which just
has a value of 1?

However the one thing I am not sure about is the padding.

The 'const char *name' on 32-bit ARM will be 4 bytes hence
it will insert 4 byte padding (at least that is what pahole tells me,
and the BUILD_BUG_ON confirms). But if I built with clang or other
compiler it may very well ignore me.

Any suggestions? I could do:

struct xsplice_patch_func {
    const char *name;
#ifdef CONFIG_32
    uint32_t _name_pad;
#endif
    uint64_t new_addr;
    uint64_t old_addr;
    uint32_t new_size;
    uint32_t old_size;
    uint8_t undo[8];
    uint8_t pad[24];
}

But that is not very nice.

Also if we make this a public header I can't very well expose the
'undo' internals - it ought to be opaque. Ideas?

Thanks!

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.