Xen project Mailing List

Re: [Xen-devel] Question about high CPU load during iperf ethernet testing

To: Iurii Konovalenko <iurii.konovalenko@xxxxxxxxxxxxxxx>

From: Ian Campbell <Ian.Campbell@xxxxxxxxxx>

Date: Thu, 25 Sep 2014 16:10:08 +0100

Cc: Julien Grall <julien.grall@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx, Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>

Delivery-date: Thu, 25 Sep 2014 15:10:43 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Thu, 2014-09-25 at 17:12 +0300, Iurii Konovalenko wrote: > On Wed, Sep 24, 2014 at 8:57 PM, Stefano Stabellini > <stefano.stabellini@xxxxxxxxxxxxx> wrote: > > On Wed, 24 Sep 2014, Iurii Konovalenko wrote: > >> Hi, Stefano! > >> Thank you for your reply! > >> > >> On Tue, Sep 23, 2014 at 7:41 PM, Stefano Stabellini > >> <stefano.stabellini@xxxxxxxxxxxxx> wrote: > >> > On Mon, 22 Sep 2014, Iurii Konovalenko wrote: > >> >> Hello, all! > >> >> > >> >> I am running iperf ethernet tests on DRA7XX_EVM board (OMAP5). > >> >> Xen version is 4.4. > >> >> I run only Linux (kernel 3.8) as Dom0, no other active domains (For > >> >> clear tests results I decided not to start DomU). > >> >> iperf server is started on host, iperf client is started on board with > >> >> command line "iperf -c 192.168.2.10 -w 256k -m > >> >> -f M -d -t 60". > >> > > >> > Just to double check: you are running the iperf test in Dom0, correct? > >> > >> Yes, iperf is running in Dom0. > >> > >> >> During test I studied CPU load with top tool on Dom0, and saw, that one > >> >> VCPU is totally loaded, spending about 50% in > >> >> software IRQs, and 50% in system. > >> >> Running the same test on clear Linux without Xen, I saw that CPU load > >> >> is about 2-4%. > >> >> > >> >> I decided to debug a bit, so I used "({register uint64_t _r; asm > >> >> volatile("mrrc " "p15, 0, %0, %H0, c14" ";" : "=r" > >> >> (_r)); _r; })" command to read timer counter before and after > >> >> operations I want to test. > >> >> > >> >> In such way I've found, that most time of CPU is spent in functions > >> >> enable_irq/disable_irq_nosync and > >> >> spin_lock_irqsave/spin_unlock_irqrestore (mostly in "mrs %0, cpsr > >> >> @ arch_local_irq_save"/"msr cpsr_c, %0 @ > >> >> local_irq_restore"). When running without Xen it should not take so > >> >> much time. > >> > > >> > There is nothing Xen specific in the Linux ARM implementation of > >> > spin_lock_irqsave/spin_unlock_irqrestore and > >> > enable_irq/disable_irq_nosync. > >> > > >> > >> That is strange, because my explorations show a lot of time is spent > >> there, for example in spin_unlock_irqrestore (mostly in mrs > >> instuction) about 20%, when running in Dom0. > > > > Unless you are doing something wrong in your measurements, if you really > > narrowed it down to one instruction then I would try to do the same on a > > different SoC of another vendor to see if it is actually an hardware issue. > > > > > >> >> So, could anyone explain me some questions: > >> >> 1. Is it normal behaviour? > >> > > >> > No, it is not normal. > >> > Assuming that you assign all the memory to Dom0 and as many vcpu as > >> > physical cpus on your platform then you should get the same numbers as > >> > native. > >> > >> OK, so I might do something wrong. > >> > >> >> 2. Does hypervisor trap cpsr register? I suppose, that hypervisor trap > >> >> access to cpsr register, that leads to > >> >> additional overhead, but I can't find place in sources where it happens. > >> > > >> > We don't trap cpsr. > >> > >> It is strange, because it was only one my assumption, where time can be > >> spent. > >> So could you please advise where to go to understand the reason of > >> such high VCPU load? > > > > I don't know. When you say that arch_local_irq_save is the one taking > > all the time, do you actually have something like: > > > > time1 = read CNTVCT; > > arch_local_irq_save(); > > time2 = read CNTVCT; > > printk(time2-time1); > > > > in your code? > > Almost like this, with only difference that I accumulate difference of > time in variable, and when it's value is grater then 1s, make print, > like this: > > volatile u64 total; > #define SECONDS(_s) ((s64)((_s) * 1000000000ULL)) > static inline s64 ticks_to_ns(uint64_t ticks) > { > return muldiv64(ticks, SECONDS(1), 1000 * 6144 ); > } > > time1 = read CNTVCT; > arch_local_irq_save(); > time2 = read CNTVCT; > total += time2-time1; > if(ticks_to_ns(total) > 1000000000) > { > printk("1 second spent here"); > total = 0; > } Doesn't that just tell you that this code is called frequently and hence clocks up the total faster? As opposed to telling you that this operation is slow. I'd expect IRQs to go on and off quite a lot under normal operation e.g. certain spinlock variants will disable irqs for example and might be frequently used. Ian. > > Best regards. > > Iurii Konovalenko | Senior Software Engineer > GlobalLogic > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > http://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.