Xen project Mailing List

Re: [Xen-devel] [PATCH v2 1/2] VMX: fix VMCS race on context-switch paths

To: "Igor Druzhinin" <igor.druzhinin@xxxxxxxxxx>

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Tue, 07 Nov 2017 08:16:54 -0700

Cc: Sergey Dyasli <sergey.dyasli@xxxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Anshul Makkar <anshul.makkar@xxxxxxxxxx>, raistlin@xxxxxxxx, Jun Nakajima <jun.nakajima@xxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Tue, 07 Nov 2017 15:17:09 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

>>> On 07.11.17 at 09:07, <JBeulich@xxxxxxxx> wrote: >>>> On 02.11.17 at 20:46, <igor.druzhinin@xxxxxxxxxx> wrote: >>> Any ideas about the root cause of the fault and suggestions how to >>> reproduce > it >>> would be welcome. Does this crash really has something to do with PML? I > doubt >>> because the original environment may hardly be called PML-heavy. > > Well, PML-heaviness doesn't matter. It's the mere fact that PML > is enabled on the vCPU being destroyed. > >> So we finally have complete understanding of what's going on: >> >> Some vCPU has just migrated to another pCPU and we switched to idle but >> per_cpu(curr_vcpu) on the current pCPU is still pointing to it - this is >> how the current logic works. While we're in idle we're issuing >> vcpu_destroy() for some other domain which eventually calls >> vmx_vcpu_disable_pml() and trashes VMCS pointer on the current pCPU. At >> this moment we get a TLB flush IPI from that same vCPU which is now >> context switching on another pCPU - it appears to clean TLB after >> itself. This vCPU is already marked is_running=1 by the scheduler. In >> the IPI handler we enter __sync_local_execstate() and trying to call >> vmx_ctxt_switch_from() for the migrated vCPU which is supposed to call >> vmcs_reload() but doesn't do it because is_running==1. The next VMWRITE >> crashes the hypervisor. >> >> So the state transition diagram might look like: >> pCPU1: vCPUx -> migrate to pCPU2 -> idle -> RCU callbacks -> > > I'm not really clear about who/what is "idle" here: pCPU1, > pCPU2, or yet something else? If vCPUx migrated to pCPU2, > wouldn't it be put back into runnable state right away, and > hence pCPU2 can't be idle at this point? Yet for pCPU1 I don't > think its idleness would matter much, i.e. the situation could > also arise without it becoming idle afaics. pCPU1 making it > anywhere softirqs are being processed would suffice. > >> vcpu_destroy() -> vmx_vcpu_disable_pml() -> vmcs_clear() >> pCPU2: context switch into vCPUx -> is_running = 1 -> TLB flush >> pCPU1: IPI handler -> context switch out of vCPUx -> VMWRITE -> CRASH! >> >> We can basically just fix the condition around vmcs_reload() call but >> I'm not completely sure that it's the right way to do - I don't think >> leaving per_cpu(curr_vcpu) pointing to a migrated vCPU is a good idea >> (maybe we need to clean it). What are your thoughts? > > per_cpu(curr_vcpu) can only validly be written inside > __context_switch(), hence the only way to achieve this would > be to force __context_switch() to be called earlier than out of > the TLB flush IPI handler, perhaps like in the (untested!) patch > below. Two questions then remain: > - Should we perhaps rather do this in an arch-independent way > (i.e. ahead of the call to vcpu_destroy() in common code)? > - This deals with only a special case of the more general "TLB > flush behind the back of a vmx_vmcs_enter() / > vmx_vmcs_exit() section" - does this need dealing with in a > more general way? Here I'm thinking of introducing a > FLUSH_STATE flag to be passed to flush_mask() instead of > the current flush_tlb_mask() in context_switch() and > sync_vcpu_execstate(). This could at the same time be used > for a small performance optimization: At least for HAP vCPU-s > I don't think we really need the TLB part of the flushes here. Btw., for this second aspect below is what I have in mind. Jan x86: make CPU state flush requests explicit Having this be an implied side effect of a TLB flush is not very nice: It could (at least in theory) lead to unintended state flushes (see e.g. https://lists.xenproject.org/archives/html/xen-devel/2017-11/msg00187.html for context). Introduce a flag to be used in the two places actually wanting the state flushed, and conditionalize the __sync_local_execstate() invocation in the IPI handler accordingly. At the same time also conditionalize the flush_area_local() invocations, to short-circuit the function ending up as a no-op anyway. Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx> --- I first thought we could also suppress the TLB flush part in the context switch cases for HAP vCPU-s, but the per-domain mappings require that to happen. --- unstable.orig/xen/arch/x86/domain.c +++ unstable/xen/arch/x86/domain.c @@ -1699,7 +1699,7 @@ void context_switch(struct vcpu *prev, s !cpumask_empty(&dirty_mask)) ) { /* Other cpus call __sync_local_execstate from flush ipi handler. */ - flush_tlb_mask(&dirty_mask); + flush_mask(&dirty_mask, FLUSH_TLB | FLUSH_STATE); } if ( prev != next ) @@ -1808,7 +1808,7 @@ void sync_vcpu_execstate(struct vcpu *v) sync_local_execstate(); /* Other cpus call __sync_local_execstate from flush ipi handler. */ - flush_tlb_mask(v->vcpu_dirty_cpumask); + flush_mask(v->vcpu_dirty_cpumask, FLUSH_TLB | FLUSH_STATE); } static int relinquish_memory( --- unstable.orig/xen/arch/x86/smp.c +++ unstable/xen/arch/x86/smp.c @@ -207,9 +207,10 @@ void invalidate_interrupt(struct cpu_use unsigned int flags = flush_flags; ack_APIC_irq(); perfc_incr(ipis); - if ( __sync_local_execstate() ) + if ( (flags & FLUSH_STATE) && __sync_local_execstate() ) flags &= ~(FLUSH_TLB | FLUSH_TLB_GLOBAL); - flush_area_local(flush_va, flags); + if ( flags & ~(FLUSH_STATE | FLUSH_ORDER_MASK) ) + flush_area_local(flush_va, flags); cpumask_clear_cpu(smp_processor_id(), &flush_cpumask); } @@ -219,7 +220,8 @@ void flush_area_mask(const cpumask_t *ma ASSERT(local_irq_is_enabled()); - if ( cpumask_test_cpu(cpu, mask) ) + if ( (flags & ~(FLUSH_STATE | FLUSH_ORDER_MASK)) && + cpumask_test_cpu(cpu, mask) ) flags = flush_area_local(va, flags); if ( (flags & ~FLUSH_ORDER_MASK) && --- unstable.orig/xen/include/asm-x86/flushtlb.h +++ unstable/xen/include/asm-x86/flushtlb.h @@ -101,6 +101,8 @@ void write_cr3(unsigned long cr3); #define FLUSH_CACHE 0x400 /* VA for the flush has a valid mapping */ #define FLUSH_VA_VALID 0x800 + /* Flush CPU state */ +#define FLUSH_STATE 0x1000 /* Flush local TLBs/caches. */ unsigned int flush_area_local(const void *va, unsigned int flags); _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.