Xen project Mailing List

On 11/22/18 4:49 PM, Roger Pau Monné wrote: > On Thu, Nov 22, 2018 at 02:48:07PM +0200, Razvan Cojocaru wrote: >> On 11/22/18 12:58 PM, Roger Pau Monné wrote: >>> On Thu, Nov 22, 2018 at 12:14:59PM +0200, Razvan Cojocaru wrote: >>>> On 11/22/18 12:05 PM, Roger Pau Monné wrote: >>>>> On Wed, Nov 21, 2018 at 08:55:48PM +0200, Razvan Cojocaru wrote: >>>>>> On 11/16/18 7:04 PM, Roger Pau Monné wrote: >>>>>>>> + if ( a == v ) >>>>>>>> + continue; >>>>>>>> + >>>>>>>> + /* Pause, synced. */ >>>>>>>> + while ( !a->arch.in_host ) >>>>>>> Why not use a->is_running as a way to know whether the vCPU is >>>>>>> running? >>>>>>> >>>>>>> I think the logic of using vcpu_pause and expecting the running vcpu >>>>>>> to take a vmexit and thus set in_host is wrong because a vcpu that >>>>>>> wasn't running when vcpu_pause_nosync is called won't get scheduled >>>>>>> anymore, thus not taking a vmexit and this function will lockup. >>>>>>> >>>>>>> I don't think you need the in_host boolean at all. >>>>>>> >>>>>>>> + cpu_relax(); >>>>>>> Is this really better than using vcpu_pause? >>>>>>> >>>>>>> I assume this is done to avoid waiting on each vcpu, and instead doing >>>>>>> it here likely means less wait time? >>>>>> >>>>>> The problem with plain vcpu_pause() is that we weren't able to use it, >>>>>> for the same reason (which remains unclear as of yet) that we couldn't >>>>>> use a->is_running: we get CPU stuck hypervisor crashes that way. Here's >>>>>> one that uses the same logic, but loops on a->is_running instead of >>>>>> !a->arch.in_host: >>> >>> [...] >>> >>>>>> Some scheduler magic appears to happen here where it is unclear why >>>>>> is_running doesn't seem to end up being 0 as expected in our case. We'll >>>>>> keep digging. >>>>> >>>>> There seems to be some kind of deadlock between >>>>> vmx_start_reexecute_instruction and hap_track_dirty_vram/handle_mmio. >>>>> Are you holding a lock while trying to put the other vcpus to sleep? >>>> >>>> d->arch.rexec_lock, but I don't see how that would matter in this case. >>> >>> The trace from pCPU#0: >>> >>> (XEN) [ 3668.016989] RFLAGS: 0000000000000202 CONTEXT: hypervisor (d0v0) >>> [...] >>> (XEN) [ 3668.275417] Xen call trace: >>> (XEN) [ 3668.278714] [<ffff82d0801327d2>] vcpu_sleep_sync+0x40/0x71 >>> (XEN) [ 3668.284952] [<ffff82d08010735b>] >>> domain.c#do_domain_pause+0x33/0x4f >>> (XEN) [ 3668.291973] [<ffff82d08010879a>] domain_pause+0x25/0x27 >>> (XEN) [ 3668.297952] [<ffff82d080245e69>] >>> hap_track_dirty_vram+0x2c1/0x4a7 >>> (XEN) [ 3668.304797] [<ffff82d0801dd8f5>] do_hvm_op+0x18be/0x2b58 >>> (XEN) [ 3668.310864] [<ffff82d080172aca>] pv_hypercall+0x1e5/0x402 >>> (XEN) [ 3668.317017] [<ffff82d080250899>] entry.o#test_all_events+0/0x3d >>> >>> Shows there's an hypercall executed from Dom0 that's trying to pause >>> the domain, thus pausing all the vCPUs. >>> >>> Then pCPU#3: >>> >>> (XEN) [ 3669.062841] RFLAGS: 0000000000000202 CONTEXT: hypervisor (d1v0) >>> [...] >>> (XEN) [ 3669.322832] Xen call trace: >>> (XEN) [ 3669.326128] [<ffff82d08021006a>] >>> vmx_start_reexecute_instruction+0x107/0x68a >>> (XEN) [ 3669.333925] [<ffff82d080210b3e>] >>> p2m_mem_access_check+0x551/0x64d >>> (XEN) [ 3669.340774] [<ffff82d0801dee9e>] >>> hvm_hap_nested_page_fault+0x2f2/0x631 >>> (XEN) [ 3669.348051] [<ffff82d080202c00>] >>> vmx_vmexit_handler+0x156c/0x1e45 >>> (XEN) [ 3669.354899] [<ffff82d08020820c>] >>> vmx_asm_vmexit_handler+0xec/0x250 >>> >>> Seems to be blocked in vmx_start_reexecute_instruction, and thus not >>> getting paused and triggering the watchdog on pCPU#0? >>> >>> You should check on which vCPU is the trace from pCPU#0 waiting, if >>> that's the vCPU running on pCPU#3 (d1v0) you will have to check what's >>> taking such a long time in vmx_start_reexecute_instruction. >> >> Right, so this is what appears to be happening, if the output of my test >> is to be trusted: https://pastebin.com/YEDqNuwh >> >> 1. vmx_start_reexecute_instruction() pauses all VCPUs but self (which >> appears to be VCPU 1): >> >> (XEN) [ 195.427141] 0 pause_count 0 >> (XEN) [ 195.427142] 2 pause_count 0 >> (XEN) [ 195.427143] 3 pause_count 0 >> (XEN) [ 195.427144] 4 pause_count 0 >> (XEN) [ 195.427146] 5 pause_count 0 >> (XEN) [ 195.427147] 6 pause_count 0 >> (XEN) [ 195.427148] 7 pause_count 0 > > The diff below doesn't show where you add this message, neither > what's actually printing. I guess the first number is the vCPU ID, and > the second the value of pause_count at some point? Yes, exactly. So the above tells us that VCPUs 0 and 2-7 have been paused (nosync) by vmx_start_reexecute_instruction(), which is now doing a while ( a->is_running ) cpu_relax(). >> 2. The hypercall happens, which calls domain_pause(), which I've >> modified thus: >> >> @@ -959,7 +961,10 @@ static void do_domain_pause(struct domain *d, >> atomic_inc(&d->pause_count); >> >> for_each_vcpu( d, v ) >> + { >> + printk("domain_pause %d\n", v->vcpu_id); > > Could you print both the domain and the vcpu ids? Of course, but I think I've found the issue (please see below). >> sleep_fn(v); >> + } >> >> arch_domain_pause(d); >> } >> >> and which says: >> >> (XEN) [ 195.492064] domain_pause 0 > > This is the hypercall code waiting for domain 1 vCPU 0 to pause? Yes. >> 3. At this point, according to addr2line, >> vmx_start_reexecute_instruction() does "while ( a->is_running ) >> cpu_relax();" for all VCPUs but itself. > > Why don't you just start by using: > > for_each_vcpu( d, v ) > if ( v != current ) > vcpu_pause(v); > > Instead of open-coding it in vmx_start_reexecute_instruction. That's the intention if we can get it to work. >> Now, d1v0, which, if I'm reading this correctly, is the VCPU that >> domain_pause() is stuck waiting for, does: >> >> (XEN) [ 200.829874] Xen call trace: >> (XEN) [ 200.833166] [<ffff82d0801278c6>] >> queue_read_lock_slowpath+0x25/0x4d >> (XEN) [ 200.840186] [<ffff82d08020c1f6>] >> get_page_from_gfn_p2m+0x14e/0x3b0 >> (XEN) [ 200.847121] [<ffff82d080247213>] >> hap_p2m_ga_to_gfn_4_levels+0x48/0x299 >> (XEN) [ 200.854400] [<ffff82d080247480>] >> hap_gva_to_gfn_4_levels+0x1c/0x1e >> (XEN) [ 200.861331] [<ffff82d08021275c>] paging_gva_to_gfn+0x10e/0x11d >> (XEN) [ 200.867918] [<ffff82d0801d66e0>] hvm.c#__hvm_copy+0x98/0x37f >> (XEN) [ 200.874329] [<ffff82d0801d848d>] >> hvm_fetch_from_guest_virt_nofault+0x14/0x16 >> (XEN) [ 200.882130] [<ffff82d0801d141a>] >> emulate.c#_hvm_emulate_one+0x118/0x2bc >> (XEN) [ 200.889496] [<ffff82d0801d16b4>] hvm_emulate_one+0x10/0x12 >> (XEN) [ 200.895735] [<ffff82d0801e0902>] handle_mmio+0x52/0xc9 >> (XEN) [ 200.901626] [<ffff82d0801e09ba>] >> handle_mmio_with_translation+0x41/0x43 >> (XEN) [ 200.908994] [<ffff82d0801ded1f>] >> hvm_hap_nested_page_fault+0x133/0x631 >> (XEN) [ 200.916271] [<ffff82d080202c40>] >> vmx_vmexit_handler+0x156c/0x1e45 >> (XEN) [ 200.923117] [<ffff82d08020824c>] >> vmx_asm_vmexit_handler+0xec/0x250 > > What lock is it waiting on? Is this the paging lock? If so you will > have to figure out who is holding this lock. It turns out that it's the p2m lock. I've looked at the code more closely, and hvm_hap_nested_page_fault() takes a p2m lock: 1923 /* 1924 * Take a lock on the host p2m speculatively, to avoid potential 1925 * locking order problems later and to handle unshare etc. 1926 */ 1927 hostp2m = p2m_get_hostp2m(currd); then ends up calling p2m_mem_access_check() with said lock taken. Then p2m_mem_access_check() also does a bit of gfn_lock(p2m, gfn, 0) (although it also unlocks those by the time vmx_start_reexecute_instruction() gets called). And then, of course, vmx_start_reexecute() runs, gets stuck in that loop (it doesn't matter if we manually look at is_running or call vcpu_pause(), the effect is the same), while the other VCPU ends up running get_page_from_gfn_p2m(), which tries to also lock the p2m and deadlocks. I've placed printk()s before and after p2m_read_lock(p2m); in the if ( likely(!p2m_locked_by_me(p2m)) ) conditional in get_page_from_gfn_p2m(), and it seems to confirm this theory - the last thing that gets printed before the crash is the line before p2m_read_lock(p2m). This seems to imply that is_running gets much later than in_host would have, so with the in_host code vmx_start_reexecute_instruction() (and so p2m_mem_access_check() and hvm_hap_nested_page_fault()) are able to exit in a timely manner, allowing get_page_from_gfn_p2m() to take the p2m lock afterwards and the show goes on. > Is this on top of plain staging, or do you have other changes applied > to Xen? My tests are done on an older, XenServer-related 4.7.5 Xen. I also have a patch applied that dumps all the CPUs on crash (otherwise I'd only see one of them which is not very helpful in situations like these) that Andrew has kindly provided. Thanks, Razvan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.

Re: [Xen-devel] [PATCH v1] x86/hvm: Generic instruction re-execution mechanism for execute faults