[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 02/18] x86/domain: limit window where curr_vcpu != current on context switch



This is a net gain even without ASI. Having "current" hold the previous vCPU on
__context_switch() makes it _a lot_ easier to follow the lazy switch path.

On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> On x86 Xen will perform lazy context switches to the idle vCPU, where the
> previously running vCPU context is not overwritten, and only current is 
> updated
> to point to the idle vCPU.  The state is then disjunct between current and
> curr_vcpu: current points to the idle vCPU, while curr_vcpu points to the vCPU
> whose context is loaded on the pCPU.
>
> While on that lazy context switched state, certain calls (like
> map_domain_page()) will trigger a full synchronization of the pCPU state by
> forcing a context switch.  Note however how calling any of such functions
> inside the context switch code itself is very likely to trigger an infinite
> recursion loop.
>
> Attempt to limit the window where curr_vcpu != current in the context switch
> code, as to prevent and infinite recursion loop around sync_local_execstate().
>
> This is required for using map_domain_page() in the vCPU context switch code,
> otherwise using map_domain_page() in that context ends up in a recursive
> sync_local_execstate() loop:
>
> map_domain_page() -> sync_local_execstate() -> map_domain_page() -> ...

More generally, it's worth mentioning that we want to establish an invariant
between a per-cpu variable (curr_vcpu) and the currently running page tables.
That way it can be used as discriminant to know which are the currently active
per-vCPU mappings.

That's essential for implementing FPU hiding as proposed here:

  
https://lore.kernel.org/xen-devel/20241105143310.28301-1-alejandro.vallejo@xxxxxxxxx/

A shorter form of that should probably be mentioned also...

>
> Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> ---
> Changes since v1:
>  - New in this version.
> ---
>  xen/arch/x86/domain.c | 58 +++++++++++++++++++++++++++++++++++--------
>  xen/arch/x86/traps.c  |  2 --
>  2 files changed, 48 insertions(+), 12 deletions(-)
>
> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> index 78a13e6812c9..1f680bf176ee 100644
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1982,16 +1982,16 @@ static void load_default_gdt(unsigned int cpu)
>      per_cpu(full_gdt_loaded, cpu) = false;
>  }
>  
> -static void __context_switch(void)
> +static void __context_switch(struct vcpu *n)
>  {
>      struct cpu_user_regs *stack_regs = guest_cpu_user_regs();
>      unsigned int          cpu = smp_processor_id();
>      struct vcpu          *p = per_cpu(curr_vcpu, cpu);
> -    struct vcpu          *n = current;
>      struct domain        *pd = p->domain, *nd = n->domain;
>  
>      ASSERT(p != n);
>      ASSERT(!vcpu_cpu_dirty(n));
> +    ASSERT(p == current);
>  
>      if ( !is_idle_domain(pd) )
>      {
> @@ -2036,6 +2036,18 @@ static void __context_switch(void)
>  
>      write_ptbase(n);
>  
> +    /*
> +     * It's relevant to set both current and curr_vcpu back-to-back, to 
> avoid a
> +     * window where calls to mapcache_current_vcpu() during the context 
> switch
> +     * could trigger a recursive loop.
> +     *
> +     * Do the current switch immediately after switching to the new guest
> +     * page-tables, so that current is (almost) always in sync with the
> +     * currently loaded page-tables.
> +     */
> +    set_current(n);
> +    per_cpu(curr_vcpu, cpu) = n;

... here. So we're not tempted to move these 2 far off from write_ptbase().

> +
>  #ifdef CONFIG_PV
>      /* Prefetch the VMCB if we expect to use it later in the context switch 
> */
>      if ( using_svm() && is_pv_64bit_domain(nd) && !is_idle_domain(nd) )
> @@ -2048,8 +2060,6 @@ static void __context_switch(void)
>      if ( pd != nd )
>          cpumask_clear_cpu(cpu, pd->dirty_cpumask);
>      write_atomic(&p->dirty_cpu, VCPU_CPU_CLEAN);
> -
> -    per_cpu(curr_vcpu, cpu) = n;
>  }
>  
>  void context_switch(struct vcpu *prev, struct vcpu *next)
> @@ -2081,16 +2091,36 @@ void context_switch(struct vcpu *prev, struct vcpu 
> *next)
>  
>      local_irq_disable();
>  
> -    set_current(next);
> -
>      if ( (per_cpu(curr_vcpu, cpu) == next) ||
>           (is_idle_domain(nextd) && cpu_online(cpu)) )
>      {
> +        /*
> +         * Lazy context switch to the idle vCPU, set current == idle.  Full
> +         * context switch happens if/when sync_local_execstate() is called.
> +         */
> +        set_current(next);
>          local_irq_enable();
>      }
>      else
>      {
> -        __context_switch();
> +        /*
> +         * curr_vcpu will always point to the currently loaded vCPU context, 
> as

nit: s/will always point/always points/ ? It's an inconditional invariant,
after all.

> +         * it's not updated when doing a lazy switch to the idle vCPU.
> +         */
> +        struct vcpu *prev_ctx = per_cpu(curr_vcpu, cpu);
> +
> +        if ( prev_ctx != current )
> +        {
> +            /*
> +             * Doing a full context switch to a non-idle vCPU from a lazy
> +             * context switched state.  Adjust current to point to the
> +             * currently loaded vCPU context.
> +             */
> +            ASSERT(current == idle_vcpu[cpu]);
> +            ASSERT(!is_idle_vcpu(next));
> +            set_current(prev_ctx);
> +        }
> +        __context_switch(next);
>  
>          /* Re-enable interrupts before restoring state which may fault. */
>          local_irq_enable();
> @@ -2156,15 +2186,23 @@ int __sync_local_execstate(void)
>  {
>      unsigned long flags;
>      int switch_required;
> +    unsigned int cpu = smp_processor_id();
> +    struct vcpu *p;
>  
>      local_irq_save(flags);
>  
> -    switch_required = (this_cpu(curr_vcpu) != current);
> +    p = per_cpu(curr_vcpu, cpu);
> +    switch_required = (p != current);
>  
>      if ( switch_required )
>      {
> -        ASSERT(current == idle_vcpu[smp_processor_id()]);
> -        __context_switch();
> +        ASSERT(current == idle_vcpu[cpu]);
> +        /*
> +         * Restore current to the previously running vCPU, __context_switch()
> +         * will update current together with curr_vcpu.
> +         */
> +        set_current(p);
> +        __context_switch(idle_vcpu[cpu]);
>      }
>  
>      local_irq_restore(flags);
> diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
> index 87b30ce4df2a..487b8c5a78c5 100644
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -2232,8 +2232,6 @@ void __init trap_init(void)
>  
>  void activate_debugregs(const struct vcpu *curr)
>  {
> -    ASSERT(curr == current);
> -
>      write_debugreg(0, curr->arch.dr[0]);
>      write_debugreg(1, curr->arch.dr[1]);
>      write_debugreg(2, curr->arch.dr[2]);

Cheers,
Alejandro



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.