[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] Xen, oprofile, perf, PEBS, event counters, PVHVM, PV
I've been looking at doing some perf analysis of block and network backend to get a better idea whether the things I spotted in the ring protocol are indeed a problem. But ran in to a problem of actually trying to use the performance tools. What I am wondering if the folks on this email have also run in similar issues? This is with oprofile in the Linux kernel backported by Michel Petullo and fixed by me (thought I might have introduced bugs too): http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=shortlog;h=refs/heads/devel/oprofile.v1 with Xen 4.3-unstable. a). 32/64 compat is missing backtrace support. If you run with a 32-bit dom0 and try to set the backtrace, the hypervisor sets is as -some-huge-number. It might be there are some other hypercalls that need some compat tweaks. b). 32-bit dom0 oprofile toolstack truncates the EIP of 64-bit guests (or hypervisor). I am not really sure how to solve that except just not run 64-bit guests/hypervisor with a 32-bit dom0. Or make oprofile and its tools capable of doing 64-bit architecture. The vice-versa condition does not exist - so I can profile 32-bit guests using a 64-bit dom0. c). HVM guest tracing. I tried running oprofile with an PVHVM Linux guest (32 or 64-bit) and I get data, but the /dev/oprofile/* worryingly tells me that the sample_invalid_eip is about 10%. So 10% of the sample data is unknown. c1). HVM did not profile on Intel (had not tried AMD). Andrew Thomas suggested a patch which did indeed fix the problem. (see attached) d). opreport. I had the most difficult time understanding it. I finally figured it out that it collapses guest's vcpu information on-to the domain 0 vcpu. So if you run something like this: Domain-0 0 0 0 r-- 27.1 0 Domain-0 0 1 1 -b- 6.3 1 Domain-0 0 2 2 -b- 5.3 2 Domain-0 0 3 3 --- 10.6 3 backend 1 0 1 r-- 22.2 0-1 backend 1 1 0 -b- 23.7 0-1 netperf 2 0 2 -b- 11.0 2 netserver 3 0 3 r-- 12.8 3 The opreport is going to show me netperf vcpu0, netserver vcpu0 and backend vcp0 all on the domain0's vcpu0. So if I run the oprofile long enough and do: opreport -l vmlinux-netperf xen-syms -t CPU: Intel Sandy Bridge microarchitecture, speed 3092.89 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 90000 Samples on CPU 0 Samples on CPU 1 Samples on CPU 2 Samples on CPU 3 samples % samples % samples % samples % image name app name symbol name 13058 3.5286 0 0 0 0 0 0 xen-syms qemu-dm get_page_from_gfn_p2m 12671 3.4240 0 0 0 0 0 0 xen-syms qemu-dm hap_p2m_ga_to_gfn_4_levels 11382 3.0757 0 0 0 0 0 0 xen-syms qemu-dm hap_gva_to_gfn_4_levels 9276 2.5066 0 0 0 0 0 0 xen-syms qemu-dm __hvm_copy 9146 2.4715 0 0 0 0 0 0 xen-syms qemu-dm paging_gva_to_gfn 8069 2.1804 0 0 0 0 0 0 xen-syms qemu-dm guest_walk_tables_4_levels 7228 1.9532 0 0 0 0 0 0 xen-syms qemu-dm __get_gfn_type_access 6387 1.7259 0 0 0 0 0 0 xen-syms qemu-dm gnttab_copy 6216 1.6797 0 0 0 0 0 0 xen-syms qemu-dm ept_get_entry 6091 1.6459 0 0 0 0 0 0 vmlinux-netperf qemu-dm inet_sendmsg 5954 1.6089 0 0 0 0 0 0 vmlinux-netperf qemu-dm sock_sendmsg 5865 1.5849 0 0 0 0 0 0 vmlinux-netperf qemu-dm handle_irq_event_percpu 5851 1.5811 0 0 0 0 0 0 vmlinux-netperf qemu-dm handle_irq_event 5771 1.5595 0 0 0 0 0 0 vmlinux-netperf qemu-dm xennet_interrupt 5697 1.5395 0 0 0 0 0 0 vmlinux-netperf qemu-dm xennet_tx_buf_gc 5567 1.5043 0 0 0 0 0 0 xen-syms qemu-dm do_grant_table_op 5520 1.4916 0 0 0 0 0 0 vmlinux-netperf qemu-dm sys_sendto 5116 1.3825 0 0 0 0 0 0 vmlinux-netperf qemu-dm system_call_fastpath 4857 1.3125 0 0 0 0 0 0 xen-syms qemu-dm memcpy 4429 1.1968 0 0 0 0 0 0 xen-syms qemu-dm hvm_grant_table_op 4350 1.1755 0 0 0 0 0 0 vmlinux-netperf qemu-dm gnttab_end_foreign_access_ref 4326 1.1690 0 0 0 0 0 0 xen-syms qemu-dm page_get_owner_and_reference 4256 1.1501 0 0 0 0 0 0 xen-syms qemu-dm hvm_do_hypercall 4205 1.1363 0 0 0 0 0 0 vmlinux-netperf qemu-dm gnttab_end_foreign_access_ref_v1 4056 1.0960 0 0 0 0 0 0 vmlinux-netperf qemu-dm __copy_user_nocache 3817 1.0314 0 0 0 0 0 0 xen-syms qemu-dm map_domain_gfn which seems valid enough - vcpu0 on _all_ of the guest is the place where the action is happening. But that is not neccessarily the right pCPU! But if I do it for the backend (which has two vcpus): CPU: Intel Sandy Bridge microarchitecture, speed 3092.89 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 90000 Samples on CPU 0 Samples on CPU 1 Samples on CPU 2 Samples on CPU 3 samples % samples % samples % samples % image name app name symbol name 13058 4.2539 0 0 0 0 0 0 xen-syms qemu-dm get_page_from_gfn_p2m 12671 4.1278 0 0 0 0 0 0 xen-syms qemu-dm hap_p2m_ga_to_gfn_4_levels 11382 3.7079 0 0 0 0 0 0 xen-syms qemu-dm hap_gva_to_gfn_4_levels 9276 3.0219 0 0 0 0 0 0 xen-syms qemu-dm __hvm_copy 9146 2.9795 0 0 0 0 0 0 xen-syms qemu-dm paging_gva_to_gfn 8069 2.6286 0 0 0 0 0 0 xen-syms qemu-dm guest_walk_tables_4_levels 7228 2.3547 0 0 0 0 0 0 xen-syms qemu-dm __get_gfn_type_access 6387 2.0807 0 0 0 0 0 0 xen-syms qemu-dm gnttab_copy 6216 2.0250 0 0 0 0 0 0 xen-syms qemu-dm ept_get_entry 5567 1.8136 0 0 0 0 0 0 xen-syms qemu-dm do_grant_table_op 4857 1.5823 0 0 0 0 0 0 xen-syms qemu-dm memcpy 4429 1.4428 0 0 0 0 0 0 xen-syms qemu-dm hvm_grant_table_op 4326 1.4093 0 0 0 0 0 0 xen-syms qemu-dm page_get_owner_and_reference 4256 1.3865 0 0 0 0 0 0 xen-syms qemu-dm hvm_do_hypercall 3817 1.2435 0 0 0 0 0 0 xen-syms qemu-dm map_domain_gfn 3661 1.1926 0 0 0 0 0 0 vmlinux-backend qemu-dm xen_netbk_kthread 3319 1.0812 0 0 0 0 0 0 xen-syms qemu-dm ept_next_level 3237 1.0545 0 0 0 0 0 0 vmlinux-backend qemu-dm kthread 3097 1.0089 0 0 0 0 0 0 xen-syms qemu-dm hvm_copy_to_guest_virt_nofault somehow the vcpu1 is gone. Albeit if I split it up: opreport -l vmlinux-backend xen-syms -t 1 cpu:0 opreport -l vmlinux-backend xen-syms -t 1 cpu:1 I can get some values. e). If I look at how the guests are utilized, combined with 'xl vcpu-list' it appears to me that the pCPU0 is the most used one. Mem: 8208716k total, 6435216k used, 1773500k free CPUs: 4 @ 3092MHz NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT SSID backend -----r 3889 172.1 2093004 25.5 2098176 25.6 2 0 0 0 1 0 9 0 96 0 0 domain-0 -----r 125 3.5 2097140 25.5 2097152 25.5 4 0 0 0 0 0 0 0 0 0 0 netperf --b--- 1064 47.8 1044444 12.7 1049600 12.8 1 0 0 0 1 0 8 0 56 0 0 netserver -----r 1328 59.8 1044444 12.7 1049600 12.8 1 0 0 0 1 0 8 0 56 0 0 and xentrace/xenalyze tell me that the backend domain is busy: |-- Domain 3 --| Runstates: blocked: 3 0.00s 21733 { 28204| 28204| 28204} partial run: 342975 6.52s 45613 { 14187| 84608|113476} full run: 362709 28.23s 186763 { 83904|369588|399732} partial contention: 997 0.82s 1977180 {5596136|5804072|47488112} concurrency_hazard: 353528 1.72s 11677 { 8977| 10056|322393} full_contention: 6 0.00s 5136 { 5040| 5228| 5400} lost: 864 1.37s 3814508 {351182583|1296246349|1296246349} (if I drill in deeper, vcpu 0 is busy doing grant table operations (56% of the VMCALL exits are b/c of it, the vcpu1 is busy getting IPIed, the /proc/cpuinfo in the guest collaborates with that information). So I would have expected the oprofile numbers to be more serious that then occasional 4% or so here and there. Oh, that is what the "-a" argument is for!: (here it is with a bit trimming done) samples cum. samples % app name symbol name 13058 13058 4.2518 4.2518 xen-syms qemu-dm get_page_from_gfn_p2m 12671 25729 4.1258 8.3776 xen-syms qemu-dm hap_p2m_ga_to_gfn_4_levels 11382 37111 3.7061 12.0837 xen-syms qemu-dm hap_gva_to_gfn_4_levels 9276 46387 3.0204 15.1041 xen-syms qemu-dm __hvm_copy 9146 55533 2.9780 18.0822 xen-syms qemu-dm paging_gva_to_gfn 8069 63602 2.6274 20.7095 xen-syms qemu-dm guest_walk_tables_4_levels 7228 70830 2.3535 23.0630 xen-syms qemu-dm __get_gfn_type_access 6387 77217 2.0797 25.1427 xen-syms qemu-dm gnttab_copy 6216 83433 2.0240 27.1667 xen-syms qemu-dm ept_get_entry 5567 89000 1.8127 28.9794 xen-syms qemu-dm do_grant_table_op 4857 93857 1.5815 30.5609 xen-syms qemu-dm memcpy 4429 98286 1.4421 32.0030 xen-syms qemu-dm hvm_grant_table_op 4326 102612 1.4086 33.4116 xen-syms qemu-dm page_get_owner_and_reference 4256 106868 1.3858 34.7974 xen-syms qemu-dm hvm_do_hypercall 3817 110685 1.2429 36.0402 xen-syms qemu-dm map_domain_gfn 3661 114346 1.1921 37.2323 vmlinux-backend qemu-dm xen_netbk_kthread so 36% of the pCPU time is spent in the hypervisor (I think?). Q: Can xenalyze tell me that as well? That the pCPU1 is pegged at 100%? f). oprofile counters on Intel and the hypercalls. It looks like the counters that oprofile is using are the valid MSR_P6_PERFCTR0 and MSR_P6_EVNTSEL0. It has no notion of fixed counters or unicore counters. So for example on the SandyBridge it will report 8 counters (and try to clear MSR_P6_PERFCTR0+8). Even thought there are three event-based ones, four fixed, and one uncore. It also is missing support for some Intel architectures (Model 44 for example - the Westmere). In other words, it looks like the Linux and hypervisor oprofile support has not been maintained for some time. I don't know how it works on AMD as I just took a look at this last week. g). There are no code that takes advantage of the PEBS or the unicore events to get a full featured trace (did you know that PEBS can also collect the register information!) g1). Linux perf looks to be using a garden variety vector instead of doing it via the NMI. Could we do it too? Or would that be inappropiate? I am not sure whether that can actually be done for the event type counters. Perhaps this is only for the fixed ones and uncore ones. h). There are some counters in the hypervisor for the oprofile statistics, like lost samples, etc. I does not look like they are exported/printed anywhere. Perhaps an 'register_keyhandler' should be written to dump those (and also which domains are profiled). h1). I can't find any reference to DOM_IDLE which probably should also be accounted for? Or DOM_IO (perhaps less since it does not seem to be in usage). i). opreports often tells me warning: /domain1-apps could not be found. warning: /domain1-modules could not be found. warning: /domain1-xen-unknown could not be found. warning: /domain2-apps could not be found. warning: /domain2-modules could not be found. warning: /domain2-xen-unknown could not be found. warning: /domain3-apps could not be found. warning: /domain3-modules could not be found. warning: /domain3-xen-unknown could not be found. warning: /vmlinux-unknown could not be found. warning: /xen-unknown could not be found. which is odd, b/c I did specify: opcontrol --start --passive-domains=1,2,3 \ --passive-images=/shared/vmlinux-backend,/shared/vmlinux-netperf,/shared/vmlinux-netserver \ --xen=/shared/xen-syms --vmlinux=/shared/vmlinux-dom0 \ --buffer-size=256000 \ -c=5 --separate=cpu,kernel --event=CPU_CLK_UNHALTED:90000:0x0:1:0 so is it just the toolstack being silly? Anyhow, I don't know if it just me not realizing how to use 'opreport' to get per guest or per vcpu information but I find it hard to use. I like how the 'perf top' can instantly give an excellent idea of the whole machine status. Maybe that is why folks in the Linux community really wanted something else than oprofile. And it occurs to me it could be possible be to make some inroads on making performance monitoring easier: 1). fix the glaring omissions in oprofile for the new CPUs 2). Add a register keyhandle to get some debug info. 3). piggyback on oprofile hypercalls and insert some bridge in perf (lots of handwaving here). Or perhaps emulate in the Linux kernel the wmsrs (so xen_safe_wrmsrs) and have the pvops kernel based on the MSRs make the hypercalls to setup the buffers, etc. 3a). new hypercalls? intercept rdmsr/wrmsrs and stuff the right data in the initial domain? Other thoughts? 4). Extend perf to have '--xen' so it can also look at the xen-hypervisor ELF file. diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index aee1f9e..16db4fe 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -58,6 +58,9 @@ enum handler_return { HNDL_done, HNDL_unhandled, HNDL_exception_raised }; +//extern void do_nmi(void); +extern bool_t alternative_nmi; + static void vmx_ctxt_switch_from(struct vcpu *v); static void vmx_ctxt_switch_to(struct vcpu *v); @@ -2441,8 +2444,16 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) if ( (intr_info & INTR_INFO_INTR_TYPE_MASK) != (X86_EVENTTYPE_NMI << 8) ) goto exit_and_crash; + HVM_DBG_LOG(DBG_LEVEL_VMMU, + "NMI eax=%lx, ebx=%lx, ecx=%lx, edx=%lx, esi=%lx, edi=%lx", + (unsigned long)regs->eax, (unsigned long)regs->ebx, + (unsigned long)regs->ecx, (unsigned long)regs->edx, + (unsigned long)regs->esi, (unsigned long)regs->edi); HVMTRACE_0D(NMI); - self_nmi(); /* Real NMI, vector 2: normal processing. */ + if (alternative_nmi) + do_nmi(regs); + else + self_nmi(); /* Real NMI, vector 2: normal processing. */ break; case TRAP_machine_check: HVMTRACE_0D(MCE); diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c index 4fd6711..3e61e3e 100644 --- a/xen/arch/x86/nmi.c +++ b/xen/arch/x86/nmi.c @@ -40,6 +40,8 @@ static unsigned int nmi_p4_cccr_val; static DEFINE_PER_CPU(struct timer, nmi_timer); static DEFINE_PER_CPU(unsigned int, nmi_timer_ticks); +bool_t alternative_nmi = 0; +boolean_param("alternative_nmi", alternative_nmi); /* opt_watchdog: If true, run a watchdog NMI on each processor. */ bool_t __initdata opt_watchdog = 0; boolean_param("watchdog", opt_watchdog); _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |