Xen project Mailing List

Re: [Xen-devel] [PATCH v2 1/2] VMX: fix VMCS race on context-switch paths

To: "Igor Druzhinin" <igor.druzhinin@xxxxxxxxxx>

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Tue, 07 Nov 2017 01:07:04 -0700

Cc: Sergey Dyasli <sergey.dyasli@xxxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Anshul Makkar <anshul.makkar@xxxxxxxxxx>, raistlin@xxxxxxxx, Jun Nakajima <jun.nakajima@xxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Tue, 07 Nov 2017 08:07:21 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

>>> On 02.11.17 at 20:46, <igor.druzhinin@xxxxxxxxxx> wrote: >> Any ideas about the root cause of the fault and suggestions how to reproduce >> it >> would be welcome. Does this crash really has something to do with PML? I >> doubt >> because the original environment may hardly be called PML-heavy. Well, PML-heaviness doesn't matter. It's the mere fact that PML is enabled on the vCPU being destroyed. > So we finally have complete understanding of what's going on: > > Some vCPU has just migrated to another pCPU and we switched to idle but > per_cpu(curr_vcpu) on the current pCPU is still pointing to it - this is > how the current logic works. While we're in idle we're issuing > vcpu_destroy() for some other domain which eventually calls > vmx_vcpu_disable_pml() and trashes VMCS pointer on the current pCPU. At > this moment we get a TLB flush IPI from that same vCPU which is now > context switching on another pCPU - it appears to clean TLB after > itself. This vCPU is already marked is_running=1 by the scheduler. In > the IPI handler we enter __sync_local_execstate() and trying to call > vmx_ctxt_switch_from() for the migrated vCPU which is supposed to call > vmcs_reload() but doesn't do it because is_running==1. The next VMWRITE > crashes the hypervisor. > > So the state transition diagram might look like: > pCPU1: vCPUx -> migrate to pCPU2 -> idle -> RCU callbacks -> I'm not really clear about who/what is "idle" here: pCPU1, pCPU2, or yet something else? If vCPUx migrated to pCPU2, wouldn't it be put back into runnable state right away, and hence pCPU2 can't be idle at this point? Yet for pCPU1 I don't think its idleness would matter much, i.e. the situation could also arise without it becoming idle afaics. pCPU1 making it anywhere softirqs are being processed would suffice. > vcpu_destroy() -> vmx_vcpu_disable_pml() -> vmcs_clear() > pCPU2: context switch into vCPUx -> is_running = 1 -> TLB flush > pCPU1: IPI handler -> context switch out of vCPUx -> VMWRITE -> CRASH! > > We can basically just fix the condition around vmcs_reload() call but > I'm not completely sure that it's the right way to do - I don't think > leaving per_cpu(curr_vcpu) pointing to a migrated vCPU is a good idea > (maybe we need to clean it). What are your thoughts? per_cpu(curr_vcpu) can only validly be written inside __context_switch(), hence the only way to achieve this would be to force __context_switch() to be called earlier than out of the TLB flush IPI handler, perhaps like in the (untested!) patch below. Two questions then remain: - Should we perhaps rather do this in an arch-independent way (i.e. ahead of the call to vcpu_destroy() in common code)? - This deals with only a special case of the more general "TLB flush behind the back of a vmx_vmcs_enter() / vmx_vmcs_exit() section" - does this need dealing with in a more general way? Here I'm thinking of introducing a FLUSH_STATE flag to be passed to flush_mask() instead of the current flush_tlb_mask() in context_switch() and sync_vcpu_execstate(). This could at the same time be used for a small performance optimization: At least for HAP vCPU-s I don't think we really need the TLB part of the flushes here. Jan --- unstable.orig/xen/arch/x86/domain.c +++ unstable/xen/arch/x86/domain.c @@ -379,6 +379,14 @@ int vcpu_initialise(struct vcpu *v) void vcpu_destroy(struct vcpu *v) { + /* + * Flush all state for this vCPU before fully tearing it down. This is + * particularly important for HVM ones on VMX, so that this flushing of + * state won't happen from the TLB flush IPI handler behind the back of + * a vmx_vmcs_enter() / vmx_vmcs_exit() section. + */ + sync_vcpu_execstate(v); + xfree(v->arch.vm_event); v->arch.vm_event = NULL; _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.