[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Question about high CPU load during iperf ethernet testing



On Thu, Sep 25, 2014 at 6:10 PM, Ian Campbell <Ian.Campbell@xxxxxxxxxx> wrote:
> On Thu, 2014-09-25 at 17:12 +0300, Iurii Konovalenko wrote:
>> On Wed, Sep 24, 2014 at 8:57 PM, Stefano Stabellini
>> <stefano.stabellini@xxxxxxxxxxxxx> wrote:
>> > On Wed, 24 Sep 2014, Iurii Konovalenko wrote:
>> >> Hi, Stefano!
>> >> Thank you for your reply!
>> >>
>> >> On Tue, Sep 23, 2014 at 7:41 PM, Stefano Stabellini
>> >> <stefano.stabellini@xxxxxxxxxxxxx> wrote:
>> >> > On Mon, 22 Sep 2014, Iurii Konovalenko wrote:
>> >> >> Hello, all!
>> >> >>
>> >> >> I am running iperf ethernet tests on DRA7XX_EVM board (OMAP5).
>> >> >> Xen version is 4.4.
>> >> >> I run only Linux (kernel 3.8) as Dom0, no other active domains (For 
>> >> >> clear tests results I decided not to start DomU).
>> >> >> iperf server is started on host, iperf client is started on board with 
>> >> >> command line "iperf -c 192.168.2.10 -w 256k -m
>> >> >> -f M -d -t 60".
>> >> >
>> >> > Just to double check: you are running the iperf test in Dom0, correct?
>> >>
>> >> Yes, iperf is running in Dom0.
>> >>
>> >> >> During test I studied CPU load with top tool on Dom0, and saw, that 
>> >> >> one VCPU is totally loaded, spending about 50% in
>> >> >> software IRQs, and 50% in system.
>> >> >> Running the same test on clear Linux without Xen, I saw that CPU load 
>> >> >> is about 2-4%.
>> >> >>
>> >> >> I decided to debug a bit, so I used "({register uint64_t _r; asm 
>> >> >> volatile("mrrc " "p15, 0, %0, %H0, c14" ";" : "=r"
>> >> >> (_r)); _r; })" command to read timer counter before and after 
>> >> >> operations I want to test.
>> >> >>
>> >> >> In such way I've found, that most time of CPU is spent in functions 
>> >> >> enable_irq/disable_irq_nosync and
>> >> >> spin_lock_irqsave/spin_unlock_irqrestore (mostly in "mrs    %0, cpsr   
>> >> >>  @ arch_local_irq_save"/"msr    cpsr_c, %0    @
>> >> >> local_irq_restore"). When running without Xen it should not take so 
>> >> >> much time.
>> >> >
>> >> > There is nothing Xen specific in the Linux ARM implementation of
>> >> > spin_lock_irqsave/spin_unlock_irqrestore and
>> >> > enable_irq/disable_irq_nosync.
>> >> >
>> >>
>> >> That is strange, because my explorations show a lot of time is spent
>> >> there, for example in spin_unlock_irqrestore (mostly in  mrs
>> >> instuction) about 20%, when running in Dom0.
>> >
>> > Unless you are doing something wrong in your measurements, if you really
>> > narrowed it down to one instruction then I would try to do the same on a
>> > different SoC of another vendor to see if it is actually an hardware issue.
>> >
>> >
>> >> >> So, could anyone explain me some questions:
>> >> >> 1. Is it normal behaviour?
>> >> >
>> >> > No, it is not normal.
>> >> > Assuming that you assign all the memory to Dom0 and as many vcpu as
>> >> > physical cpus on your platform then you should get the same numbers as
>> >> > native.
>> >>
>> >> OK, so I might do something wrong.
>> >>
>> >> >> 2. Does hypervisor trap cpsr register? I suppose, that hypervisor trap 
>> >> >> access to cpsr register, that leads to
>> >> >> additional overhead, but I can't find place in sources where it 
>> >> >> happens.
>> >> >
>> >> > We don't trap cpsr.
>> >>
>> >> It is strange, because it was only one my assumption, where time can be 
>> >> spent.
>> >> So could you please advise where to go to understand the reason of
>> >> such high VCPU load?
>> >
>> > I don't know. When you say that arch_local_irq_save is the one taking
>> > all the time, do you actually have something like:
>> >
>> > time1 = read CNTVCT;
>> > arch_local_irq_save();
>> > time2 = read CNTVCT;
>> > printk(time2-time1);
>> >
>> > in your code?
>>
>> Almost like this, with only difference that I accumulate difference of
>> time in variable, and when it's value is grater then 1s, make print,
>> like this:
>>
>> volatile u64 total;
>> #define SECONDS(_s)     ((s64)((_s)  * 1000000000ULL))
>> static inline s64 ticks_to_ns(uint64_t ticks)
>> {
>>     return muldiv64(ticks, SECONDS(1), 1000 * 6144 );
>> }
>>
>> time1 = read CNTVCT;
>> arch_local_irq_save();
>> time2 = read CNTVCT;
>> total += time2-time1;
>> if(ticks_to_ns(total) > 1000000000)
>> {
>>     printk("1 second spent here");
>>     total = 0;
>> }
>
> Doesn't that just tell you that this code is called frequently and hence
> clocks up the total faster? As opposed to telling you that this
> operation is slow.
>
> I'd expect IRQs to go on and off quite a lot under normal operation e.g.
> certain spinlock variants will disable irqs for example and might be
> frequently used.
>
> Ian.

Of cause it can be because of frequent calls, but why such difference
with and without Xen?
I thought number of spinlock calls is more or less equal with Xen or without.

Best regards.

Iurii Konovalenko | Senior Software Engineer
GlobalLogic

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.