Xen project Mailing List

Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

From: "Vincent, Pradeep" <pradeepv@xxxxxxxxxx>

Date: Thu, 19 May 2011 23:12:25 -0700

Accept-language: en-US

Acceptlanguage: en-US

Cc: Daniel, Jeremy Fitzhardinge <jeremy@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxxxx>, Stodden <daniel.stodden@xxxxxxxxxx>

Delivery-date: Thu, 19 May 2011 23:15:21 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Thread-index: AcwWtOQUuLcZEyw4RXeF9ggy9QHSLQ==

Thread-topic: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue

Hey Konrad, Thanks for running the tests. Very useful data. Re: Experiment to show latency improvement I never ran anything on ramdisk. You should be able to see the latency benefit with 'orion' tool but I am sure other tools can be used as well. For a volume backed by a single disk drive, keep the number of small random I/O outstanding to 2 (I think "num_small" parameter in orion should do the job) with a 50-50 mix of write and read. Measure the latency reported by the guest and Dom-0 & compare them. For LVM volumes that present multiple drives as a single LUN (inside the guest), the latency improvement will be the highest when the number of I/O outstanding is 2X the number of spindles. This is the 'moderate I/O' scenario I was describing and you should see significant improvement in latencies. If you allow page cache to perform sequential I/O using dd or other sequential non-direct I/O generation tool, you should find that the interrupt rate doesn't go up for high I/O load. Thinking about this, I think burstiness of I/O submission as seen by the driver is also a key player particularly in the absence of I/O coalescing waits introduced by I/O scheduler. Page cache draining is notoriously bursty. >>queue depth of 256. What 'queue depth' is this ? If I am not wrong, blkfront-blkback is restricted to ~32 max pending I/Os due to the limit of one page being used for mailbox entries - no ? >>But to my surprise the case where the I/O latency is high, the interrupt >>generation was quite small If this patch results in an extra interrupt, it will very likely result in reduction of latency for the next I/O. If the interrupt generation increase is not high, then the number of I/Os whose latencies this patch has improved is low. Looks like your workload belonged to this category. Perhaps that's why you didn't much of an improvement in overall performance ? I think this is close to the high I/O workload scenario I described. >>But where the I/O latency was very very small (4 microseconds) the >>interrupt generation was on average about 20K/s. This is not a scenario I tested but the results aren't surprising. This isn't the high I/O load I was describing though (I didn't test ramdisk). SSD is probably the closest real world workload. An increase of 20K/sec means this patch very likely improved latency of 20K I/Os per sec although the absolute value of latency improvements would be smaller in this case. 20K/sec interrupt rate (50usec delay between interrupt) is something I would be comfortable with if they directly translate to latency improvement for the users. The graphs seem to indicate a 5% increase in throughput for this case - Am I reading the graphs right ? Overall, Very useful tests indeed and I haven't seen anything too concerning or unexpected except that I don't think you have seen the 50+% latency benefit that the patch got me in my moderate I/O benchmark :-) Feel free to ping me offline if you aren't able to see the latency impact using the 'moderate I/O' methodology described above. About IRQ coalescing: Stepping back a bit, there are few different use cases that irq coalescing mechanism would be useful for 1. Latency sensitive workload: Wait time of 10s of usecs. Particularly useful for SSDs. 2. Interrupt rate conscious workload/environment: Wait time of 200+ usecs which will essentially cap the theoretical interrupt rate to 5K. 3. Excessive CPU consumption Mitigation: This is similar to (2) but includes the case of malicious guests. Perhaps not a big concern unless you have lots of drives attached to each guest. I suspect the implementation for (1) and (2) would be different (spin vs sleep perhaps). (3) can't be implemented by manipulation of 'req_event' since a guest has the ability to abuse irq channel independent of what 'blkback' tries to tell 'blkfront' via 'req_event' manipulation. (3) could be implemented in the hypervisor as a generic irq throttler that could be leveraged for all irqs heading to Dom-0 from DomUs including blkback/netback. Such a mechanism could potentially solve (1) and/or (2) as well. Thoughts ? One crude way to address (3) for 'many disk drive' scenario is to pin all/most blkback interrupts for an instance to the same CPU core in Dom-0 and throttle down the thread wake up (wake_up(&blkif->wq) in blkif_notify_work) that usually results in IPIs. Not an elegant solution but might be a good crutch. Another angle to (1) and (2) is whether these irq coalesce settings should be controllable by the guest, perhaps within limits set by the administrator. Thoughts ? Suggestions ? Konrad, Love to help out if you are already working on something around irq coalescing. Or when I have irq coalescing functionality that can be consumed by community I will certainly submit them. Meanwhile, I wouldn't want to deny Xen users the advantage of this patch just because there is no irq coalescing functionality. Particularly since the downside is very minimal on blkfront-blkback stack. My 2 cents.. Thanks much Konrad, - Pradeep Vincent On 5/16/11 8:22 AM, "Konrad Rzeszutek Wilk" <konrad.wilk@xxxxxxxxxx> wrote: >On Thu, May 12, 2011 at 10:51:32PM -0400, Konrad Rzeszutek Wilk wrote: >> > >>what were the numbers when it came to high bandwidth numbers >> > >> > Under high I/O workload, where the blkfront would fill up the queue as >> > blkback works the queue, the I/O latency problem in question doesn't >> > manifest itself and as a result this patch doesn't make much of a >> > difference in terms of interrupt rate. My benchmarks didn't show any >> > significant effect. >> >> I have to rerun my benchmarks. Under high load (so 64Kb, four threads >> writting as much as they can to a iSCSI disk), the IRQ rate for each >> blkif went from 2-3/sec to ~5K/sec. But I did not do a good >> job on capturing the submission latency to see if the I/Os get the >> response back as fast (or the same) as without your patch. >> >> And the iSCSI disk on the target side was an RAMdisk, so latency >> was quite small which is not fair to your problem. >> >> Do you have a program to measure the latency for the workload you >> had encountered? I would like to run those numbers myself. > >Ran some more benchmarks over this week. This time I tried to run it on: > > - iSCSI target (1GB, and on the "other side" it wakes up every 1msec, so >the > latency is set to 1msec). > - scsi_debug delay=0 (no delay and as fast possible. Comes out to be >about > 4 microseconds completion with queue depth of one with 32K I/Os). > - local SATAI 80GB ST3808110AS. Still running as it is quite slow. > >With only one PV guest doing a round (three times) of two threads randomly >writting I/Os with a queue depth of 256. Then a different round of four >threads writting/reading (80/20) 512bytes up to 64K randomly over the >disk. > >I used the attached patch against #master >(git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git) >to gauge how well we are doing (and what the interrupt generation rate >is). > >These workloads I think would be considered 'high I/O' and I was expecting >your patch to not have any influence on the numbers. > >But to my surprise the case where the I/O latency is high, the interrupt >generation >was quite small. But where the I/O latency was very very small (4 >microseconds) >the interrupt generation was on average about 20K/s. And this is with a >queue depth >of 256 with four threads. I was expecting the opposite. Hence quite >curious >to see your use case. > >What do you consider a middle I/O and low I/O cases? Do you use 'fio' for >your >testing? > >With the high I/O load, the numbers came out to give us about 1% benefit >with your >patch. However, I am worried (maybe unneccassarily?) about the 20K >interrupt generation >when the iometer tests kicked in (this was only when using the >unrealistic 'scsi_debug' >drive). > >The picture of this using iSCSI target: >http://darnok.org/xen/amazon/iscsi_target/iometer-bw.png > >And when done on top of local RAMdisk: >http://darnok.org/xen/amazon/scsi_debug/iometer-bw.png > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.