[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue



Hey Konrad, 

Thanks for running the tests. Very useful data.

Re: Experiment to show latency improvement

I never ran anything on ramdisk.

You should be able to see the latency benefit with 'orion' tool but I am
sure other tools can be used as well. For a volume backed by a single disk
drive, keep the number of small random I/O outstanding to 2 (I think
"num_small" parameter in orion should do the job) with a 50-50 mix of
write and read. Measure the latency reported by the guest and Dom-0 &
compare them. For LVM volumes that present multiple drives as a single LUN
(inside the guest), the latency improvement will be the highest when the
number of I/O outstanding is 2X the number of spindles. This is the
'moderate I/O' scenario I was describing and you should see significant
improvement in latencies.


If you allow page cache to perform sequential I/O using dd or other
sequential non-direct I/O generation tool, you should find that the
interrupt rate doesn't go up for high I/O load. Thinking about this, I
think burstiness of I/O submission as seen by the driver is also a key
player particularly in the absence of I/O coalescing waits introduced by
I/O scheduler. Page cache draining is notoriously bursty.

>>queue depth of 256.

What 'queue depth' is this ? If I am not wrong, blkfront-blkback is
restricted to ~32 max pending I/Os due to the limit of one page being used
for mailbox entries - no ?

>>But to my surprise the case where the I/O latency is high, the interrupt
>>generation was quite small

If this patch results in an extra interrupt, it will very likely result in
reduction of latency for the next I/O. If the interrupt generation
increase is not high, then the number of I/Os whose latencies this patch
has improved is low. Looks like your workload belonged to this category.
Perhaps that's why you didn't much of an improvement in overall
performance ? I think this is close to the high I/O workload scenario I
described.

>>But where the I/O latency was very very small (4 microseconds) the
>>interrupt generation was on average about 20K/s.

This is not a scenario I tested but the results aren't surprising.  This
isn't the high I/O load I was describing though (I didn't test ramdisk).
SSD is probably the closest real world workload.
An increase of 20K/sec means this patch very likely improved latency of
20K I/Os per sec although the absolute value of latency improvements would
be smaller in this case. 20K/sec interrupt rate (50usec delay between
interrupt) is something I would be comfortable with if they directly
translate to latency improvement for the users. The graphs seem to
indicate a 5% increase in throughput for this case - Am I reading the
graphs right ? 

Overall, Very useful tests indeed and I haven't seen anything too
concerning or unexpected except that I don't think you have seen the 50+%
latency benefit that the patch got me in my moderate I/O benchmark :-)
Feel free to ping me offline if you aren't able to see the latency impact
using the 'moderate I/O' methodology described above.

About IRQ coalescing: Stepping back a bit, there are few different use
cases that irq coalescing mechanism would be useful for

1. Latency sensitive workload: Wait time of 10s of usecs. Particularly
useful for SSDs. 
2. Interrupt rate conscious workload/environment: Wait time of 200+ usecs
which will essentially cap the theoretical interrupt rate to 5K.
3. Excessive CPU consumption Mitigation: This is similar to (2) but
includes the case of malicious guests. Perhaps not a big concern unless
you have lots of drives attached to each guest.

I suspect the implementation for (1) and (2) would be different (spin vs
sleep perhaps). (3) can't be implemented by manipulation of 'req_event'
since a guest has the ability to abuse irq channel independent of what
'blkback' tries to tell 'blkfront' via 'req_event' manipulation.

(3) could be implemented in the hypervisor as a generic irq throttler that
could be leveraged for all irqs heading to Dom-0 from DomUs including
blkback/netback. Such a mechanism could potentially solve (1) and/or (2)
as well. Thoughts ?

One crude way to address (3) for 'many disk drive' scenario is to pin
all/most blkback interrupts for an instance to the same CPU core in Dom-0
and throttle down the thread wake up (wake_up(&blkif->wq) in
blkif_notify_work) that usually results in IPIs. Not an elegant solution
but might be a good crutch.

Another angle to (1) and (2) is whether these irq coalesce settings should
be controllable by the guest, perhaps within limits set by the
administrator. 

Thoughts ? Suggestions ?

Konrad, Love to help out if you are already working on something around
irq coalescing. Or when I have irq coalescing functionality that can be
consumed by community I will certainly submit them.

Meanwhile, I wouldn't want to deny Xen users the advantage of this patch
just because there is no irq coalescing functionality. Particularly since
the downside is very minimal on blkfront-blkback stack. My 2 cents..

Thanks much Konrad,

- Pradeep Vincent




On 5/16/11 8:22 AM, "Konrad Rzeszutek Wilk" <konrad.wilk@xxxxxxxxxx> wrote:

>On Thu, May 12, 2011 at 10:51:32PM -0400, Konrad Rzeszutek Wilk wrote:
>> > >>what were the numbers when it came to high bandwidth numbers
>> > 
>> > Under high I/O workload, where the blkfront would fill up the queue as
>> > blkback works the queue, the I/O latency problem in question doesn't
>> > manifest itself and as a result this patch doesn't make much of a
>> > difference in terms of interrupt rate. My benchmarks didn't show any
>> > significant effect.
>> 
>> I have to rerun my benchmarks. Under high load (so 64Kb, four threads
>> writting as much as they can to a iSCSI disk), the IRQ rate for each
>> blkif went from 2-3/sec to ~5K/sec. But I did not do a good
>> job on capturing the submission latency to see if the I/Os get the
>> response back as fast (or the same) as without your patch.
>> 
>> And the iSCSI disk on the target side was an RAMdisk, so latency
>> was quite small which is not fair to your problem.
>> 
>> Do you have a program to measure the latency for the workload you
>> had encountered? I would like to run those numbers myself.
>
>Ran some more benchmarks over this week. This time I tried to run it on:
>
> - iSCSI target (1GB, and on the "other side" it wakes up every 1msec, so
>the
>   latency is set to 1msec).
> - scsi_debug delay=0 (no delay and as fast possible. Comes out to be
>about
>   4 microseconds completion with queue depth of one with 32K I/Os).
> - local SATAI 80GB ST3808110AS. Still running as it is quite slow.
>
>With only one PV guest doing a round (three times) of two threads randomly
>writting I/Os with a queue depth of 256. Then a different round of four
>threads writting/reading (80/20) 512bytes up to 64K randomly over the
>disk.
>
>I used the attached patch against #master
>(git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git)
>to gauge how well we are doing (and what the interrupt generation rate
>is).
>
>These workloads I think would be considered 'high I/O' and I was expecting
>your patch to not have any influence on the numbers.
>
>But to my surprise the case where the I/O latency is high, the interrupt
>generation
>was quite small. But where the I/O latency was very very small (4
>microseconds)
>the interrupt generation was on average about 20K/s. And this is with a
>queue depth
>of 256 with four threads. I was expecting the opposite. Hence quite
>curious
>to see your use case.
>
>What do you consider a middle I/O and low I/O cases? Do you use 'fio' for
>your
>testing?
>
>With the high I/O load, the numbers came out to give us about 1% benefit
>with your
>patch. However, I am worried (maybe unneccassarily?) about the 20K
>interrupt generation
>when the iometer tests kicked in (this was only when using the
>unrealistic 'scsi_debug'
>drive).
>
>The picture of this using iSCSI target:
>http://darnok.org/xen/amazon/iscsi_target/iometer-bw.png
>
>And when done on top of local RAMdisk:
>http://darnok.org/xen/amazon/scsi_debug/iometer-bw.png
>


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.