[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Question about high CPU load during iperf ethernet testing



On Thu, 2014-09-25 at 17:12 +0300, Iurii Konovalenko wrote:
> On Wed, Sep 24, 2014 at 8:57 PM, Stefano Stabellini
> <stefano.stabellini@xxxxxxxxxxxxx> wrote:
> > On Wed, 24 Sep 2014, Iurii Konovalenko wrote:
> >> Hi, Stefano!
> >> Thank you for your reply!
> >>
> >> On Tue, Sep 23, 2014 at 7:41 PM, Stefano Stabellini
> >> <stefano.stabellini@xxxxxxxxxxxxx> wrote:
> >> > On Mon, 22 Sep 2014, Iurii Konovalenko wrote:
> >> >> Hello, all!
> >> >>
> >> >> I am running iperf ethernet tests on DRA7XX_EVM board (OMAP5).
> >> >> Xen version is 4.4.
> >> >> I run only Linux (kernel 3.8) as Dom0, no other active domains (For 
> >> >> clear tests results I decided not to start DomU).
> >> >> iperf server is started on host, iperf client is started on board with 
> >> >> command line "iperf -c 192.168.2.10 -w 256k -m
> >> >> -f M -d -t 60".
> >> >
> >> > Just to double check: you are running the iperf test in Dom0, correct?
> >>
> >> Yes, iperf is running in Dom0.
> >>
> >> >> During test I studied CPU load with top tool on Dom0, and saw, that one 
> >> >> VCPU is totally loaded, spending about 50% in
> >> >> software IRQs, and 50% in system.
> >> >> Running the same test on clear Linux without Xen, I saw that CPU load 
> >> >> is about 2-4%.
> >> >>
> >> >> I decided to debug a bit, so I used "({register uint64_t _r; asm 
> >> >> volatile("mrrc " "p15, 0, %0, %H0, c14" ";" : "=r"
> >> >> (_r)); _r; })" command to read timer counter before and after 
> >> >> operations I want to test.
> >> >>
> >> >> In such way I've found, that most time of CPU is spent in functions 
> >> >> enable_irq/disable_irq_nosync and
> >> >> spin_lock_irqsave/spin_unlock_irqrestore (mostly in "mrs    %0, cpsr    
> >> >> @ arch_local_irq_save"/"msr    cpsr_c, %0    @
> >> >> local_irq_restore"). When running without Xen it should not take so 
> >> >> much time.
> >> >
> >> > There is nothing Xen specific in the Linux ARM implementation of
> >> > spin_lock_irqsave/spin_unlock_irqrestore and
> >> > enable_irq/disable_irq_nosync.
> >> >
> >>
> >> That is strange, because my explorations show a lot of time is spent
> >> there, for example in spin_unlock_irqrestore (mostly in  mrs
> >> instuction) about 20%, when running in Dom0.
> >
> > Unless you are doing something wrong in your measurements, if you really
> > narrowed it down to one instruction then I would try to do the same on a
> > different SoC of another vendor to see if it is actually an hardware issue.
> >
> >
> >> >> So, could anyone explain me some questions:
> >> >> 1. Is it normal behaviour?
> >> >
> >> > No, it is not normal.
> >> > Assuming that you assign all the memory to Dom0 and as many vcpu as
> >> > physical cpus on your platform then you should get the same numbers as
> >> > native.
> >>
> >> OK, so I might do something wrong.
> >>
> >> >> 2. Does hypervisor trap cpsr register? I suppose, that hypervisor trap 
> >> >> access to cpsr register, that leads to
> >> >> additional overhead, but I can't find place in sources where it happens.
> >> >
> >> > We don't trap cpsr.
> >>
> >> It is strange, because it was only one my assumption, where time can be 
> >> spent.
> >> So could you please advise where to go to understand the reason of
> >> such high VCPU load?
> >
> > I don't know. When you say that arch_local_irq_save is the one taking
> > all the time, do you actually have something like:
> >
> > time1 = read CNTVCT;
> > arch_local_irq_save();
> > time2 = read CNTVCT;
> > printk(time2-time1);
> >
> > in your code?
> 
> Almost like this, with only difference that I accumulate difference of
> time in variable, and when it's value is grater then 1s, make print,
> like this:
> 
> volatile u64 total;
> #define SECONDS(_s)     ((s64)((_s)  * 1000000000ULL))
> static inline s64 ticks_to_ns(uint64_t ticks)
> {
>     return muldiv64(ticks, SECONDS(1), 1000 * 6144 );
> }
> 
> time1 = read CNTVCT;
> arch_local_irq_save();
> time2 = read CNTVCT;
> total += time2-time1;
> if(ticks_to_ns(total) > 1000000000)
> {
>     printk("1 second spent here");
>     total = 0;
> }

Doesn't that just tell you that this code is called frequently and hence
clocks up the total faster? As opposed to telling you that this
operation is slow.

I'd expect IRQs to go on and off quite a lot under normal operation e.g.
certain spinlock variants will disable irqs for example and might be
frequently used.

Ian.

> 
> Best regards.
> 
> Iurii Konovalenko | Senior Software Engineer
> GlobalLogic
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.