Xen project Mailing List

Re: [Xen-devel] 4.2.1: Poor write performance for DomU.

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

Date: Sat, 07 Sep 2013 09:06:41 +1000

Cc: roger.pau@xxxxxxxxxx, xen-devel@xxxxxxxxxxxxx

Delivery-date: Fri, 06 Sep 2013 23:07:39 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Openpgp: id=7A7D31DC

On 06/09/13 23:33, Konrad Rzeszutek Wilk wrote: > On Thu, Sep 05, 2013 at 06:28:25PM +1000, Steven Haigh wrote: >> On 21/08/13 02:48, Konrad Rzeszutek Wilk wrote: >>> On Mon, Mar 25, 2013 at 01:21:09PM +1100, Steven Haigh wrote: >>>> So, based on my tests yesterday, I decided to break the RAID6 and >>>> pull a drive out of it to test directly on the 2Tb drives in >>>> question. >>>> >>>> The array in question: >>>> # cat /proc/mdstat >>>> Personalities : [raid1] [raid6] [raid5] [raid4] >>>> md2 : active raid6 sdd[4] sdc[0] sde[1] sdf[5] >>>> 3907026688 blocks super 1.2 level 6, 128k chunk, algorithm 2 >>>> [4/4] [UUUU] >>>> >>>> # mdadm /dev/md2 --fail /dev/sdf >>>> mdadm: set /dev/sdf faulty in /dev/md2 >>>> # mdadm /dev/md2 --remove /dev/sdf >>>> mdadm: hot removed /dev/sdf from /dev/md2 >>>> >>>> So, all tests are to be done on /dev/sdf. >>>> Model Family: Seagate SV35 >>>> Device Model: ST2000VX000-9YW164 >>>> Serial Number: Z1E17C3X >>>> LU WWN Device Id: 5 000c50 04e1bc6f0 >>>> Firmware Version: CV13 >>>> User Capacity: 2,000,398,934,016 bytes [2.00 TB] >>>> Sector Sizes: 512 bytes logical, 4096 bytes physical >>>> >>>> From the Dom0: >>>> # dd if=/dev/zero of=/dev/sdf bs=1M count=4096 oflag=direct >>>> 4096+0 records in >>>> 4096+0 records out >>>> 4294967296 bytes (4.3 GB) copied, 30.7691 s, 140 MB/s >>>> >>>> Create a single partition on the drive, and format it with ext4: >>>> Disk /dev/sdf: 2000.4 GB, 2000398934016 bytes >>>> 255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors >>>> Units = sectors of 1 * 512 = 512 bytes >>>> Sector size (logical/physical): 512 bytes / 4096 bytes >>>> I/O size (minimum/optimal): 4096 bytes / 4096 bytes >>>> Disk identifier: 0x98d8baaf >>>> >>>> Device Boot Start End Blocks Id System >>>> /dev/sdf1 2048 3907029167 1953513560 83 Linux >>>> >>>> Command (m for help): w >>>> >>>> # mkfs.ext4 -j /dev/sdf1 >>>> ...... >>>> Writing inode tables: done >>>> Creating journal (32768 blocks): done >>>> Writing superblocks and filesystem accounting information: done >>>> >>>> Mount it on the Dom0: >>>> # mount /dev/sdf1 /mnt/esata/ >>>> # cd /mnt/esata/ >>>> # bonnie++ -d . -u 0:0 >>>> .... >>>> Version 1.96 ------Sequential Output------ --Sequential >>>> Input- --Random- >>>> Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- >>>> --Block-- --Seeks-- >>>> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec >>>> %CP /sec %CP >>>> xenhost.lan.crc. 2G 425 94 133607 24 60544 12 973 95 209114 >>>> 17 296.4 6 >>>> Latency 70971us 190ms 221ms 40369us 17657us >>>> 164ms >>>> >>>> So from the Dom0: 133Mb/sec write, 209Mb/sec read. >>>> >>>> Now, I'll attach the full disk to a DomU: >>>> # xm block-attach zeus.vm phy:/dev/sdf xvdc w >>>> >>>> And we'll test from the DomU. >>>> >>>> # dd if=/dev/zero of=/dev/xvdc bs=1M count=4096 oflag=direct >>>> 4096+0 records in >>>> 4096+0 records out >>>> 4294967296 bytes (4.3 GB) copied, 32.318 s, 133 MB/s >>>> >>>> Partition the same as in the Dom0 and create an ext4 filesystem on it: >>>> >>>> I notice something interesting here. In the Dom0, the device is seen as: >>>> Units = sectors of 1 * 512 = 512 bytes >>>> Sector size (logical/physical): 512 bytes / 4096 bytes >>>> I/O size (minimum/optimal): 4096 bytes / 4096 bytes >>>> >>>> In the DomU, it is seen as: >>>> Units = sectors of 1 * 512 = 512 bytes >>>> Sector size (logical/physical): 512 bytes / 512 bytes >>>> I/O size (minimum/optimal): 512 bytes / 512 bytes >>>> >>>> Not sure if this could be related - but continuing testing: >>>> Device Boot Start End Blocks Id System >>>> /dev/xvdc1 2048 3907029167 1953513560 83 Linux >>>> >>>> # mkfs.ext4 -j /dev/xvdc1 >>>> .... >>>> Allocating group tables: done >>>> Writing inode tables: done >>>> Creating journal (32768 blocks): done >>>> Writing superblocks and filesystem accounting information: done >>>> >>>> # mount /dev/xvdc1 /mnt/esata/ >>>> # cd /mnt/esata/ >>>> # bonnie++ -d . -u 0:0 >>>> .... >>>> Version 1.96 ------Sequential Output------ --Sequential >>>> Input- --Random- >>>> Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- >>>> --Block-- --Seeks-- >>>> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec >>>> %CP /sec %CP >>>> zeus.crc.id.au 2G 396 99 116530 23 50451 15 1035 99 176407 >>>> 23 313.4 9 >>>> Latency 34615us 130ms 128ms 33316us 74401us >>>> 130ms >>>> >>>> So still... 116Mb/sec write, 176Mb/sec read to the physical device >>>> from the DomU. More than acceptable. >>>> >>>> It leaves me to wonder.... Could there be something in the Dom0 >>>> seeing the drives as 4096 byte sectors, but the DomU seeing it as >>>> 512 byte sectors cause an issue? >>> >>> There is certain overhead in it. I still have this in my mailbox >>> so I am not sure whether this issue got ever resolved? I know that the >>> indirect patches in Xen blkback and xen blkfront are meant to resolve >>> some of these issues - by being able to carry a bigger payload. >>> >>> Did you ever try v3.11 kernel in both dom0 and domU? Thanks. >> >> Ok, so I finally got around to building kernel 3.11 RPMs today for >> testing. I upgraded both the Dom0 and DomU to the same kernel: > > Woohoo! >> >> DomU: >> # dmesg | grep blkfront >> blkfront: xvda: flush diskcache: enabled; persistent grants: enabled; >> indirect descriptors: enabled; >> blkfront: xvdb: flush diskcache: enabled; persistent grants: enabled; >> indirect descriptors: enabled; >> >> Looks good. >> >> Transfer tests using bonnie++ as per before: >> # bonnie -d . -u 0:0 >> Version 1.96 ------Sequential Output------ --Sequential Input- >> --Random- >> Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- >> --Seeks-- >> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP >> /sec %CP >> zeus.crc.id.au 2G 603 92 58250 9 62248 14 886 99 295757 30 >> 492.3 13 >> Latency 27305us 124ms 158ms 34222us 16865us >> 374ms >> Version 1.96 ------Sequential Create------ --------Random >> Create-------- >> zeus.crc.id.au -Create-- --Read--- -Delete-- -Create-- --Read--- >> -Delete-- >> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP >> /sec %CP >> 16 10048 22 +++++ +++ 17849 29 11109 25 +++++ +++ >> 18389 31 >> Latency 17775us 154us 180us 16008us 38us >> 58us >> >> Still seems to be a massive discrepancy between Dom0 and DomU write >> speeds. Interesting is that sequential block reads are nearly 300MB/sec, >> yet sequential writes were only ~58MB/sec. > > OK, so the other thing that people were pointing out that is you > can use xen-blkfront.max parameter. By default it is 32, but try 8. > Or 64. Or 256. Ahh - interesting. I used the following: Kernel command line: ro root=/dev/xvda rd_NO_LUKS rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=auto console=hvc0 xen-blkfront.max=X 8: Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP zeus.crc.id.au 2G 696 92 50906 7 46102 11 1013 97 256784 27 496.5 10 Latency 24374us 199ms 117ms 30855us 38008us 85175us 16: Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP zeus.crc.id.au 2G 675 92 58078 8 57585 13 1005 97 262735 25 505.6 10 Latency 24412us 187ms 183ms 23661us 53850us 232ms 32: Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP zeus.crc.id.au 2G 698 92 57416 8 63328 13 1063 97 267154 24 498.2 12 Latency 24264us 199ms 81362us 33144us 22526us 237ms 64: Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP zeus.crc.id.au 2G 574 86 88447 13 68988 17 897 97 265128 27 493.7 13 128: Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP zeus.crc.id.au 2G 702 97 107638 14 70158 15 1045 97 255596 24 491.0 12 Latency 27279us 17553us 134ms 29771us 38392us 65761us 256: Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP zeus.crc.id.au 2G 689 91 102554 14 67337 15 1012 97 262475 24 484.4 12 Latency 20642us 104ms 189ms 36624us 45286us 80023us So, as a nice summary: 8: 50Mb/sec 16: 58Mb/sec 32: 57Mb/sec 64: 88Mb/sec 128: 107Mb/sec 256: 102Mb/sec So, maybe it's coincidence, maybe it isn't - but the best (factoring margin of error) seems to be 128 - which happens to be the block size of the underlying RAID6 array on the Dom0. # cat /proc/mdstat md2 : active raid6 sdd[5] sdc[4] sdf[1] sde[0] 3906766592 blocks super 1.2 level 6, 128k chunk, algorithm 2 [4/4] [UUUU] > The indirect descriptor allows us to put more I/Os on the ring - and > I am hoping that will: > a) solve your problem Well, it looks like this solves the issue - at least increasing the max causes almost double the write speed - and no change to read speeds (within margin of error). > b) not solve your problem, but demonstrate that the issue is not with > the ring, but with something else making your writes slower. > > Hmm, are you by any chance using O_DIRECT when running bonnie++ in > dom0? The xen-blkback tacks on O_DIRECT to all write requests. This is > done to not use the dom0 page cache - otherwise you end up with > a double buffer where the writes are insane speed - but with absolutly > no safety. > > If you want to try disabling that (so no O_DIRECT), I would do this > little change: > > diff --git a/drivers/block/xen-blkback/blkback.c > b/drivers/block/xen-blkback/blkback.c > index bf4b9d2..823b629 100644 > --- a/drivers/block/xen-blkback/blkback.c > +++ b/drivers/block/xen-blkback/blkback.c > @@ -1139,7 +1139,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, > break; > case BLKIF_OP_WRITE: > blkif->st_wr_req++; > - operation = WRITE_ODIRECT; > + operation = WRITE; > break; > case BLKIF_OP_WRITE_BARRIER: > drain = true; With the above results, is this still useful? -- Steven Haigh Email: netwiz@xxxxxxxxx Web: https://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897 Fax: (03) 8338 0299

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.