[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] 4.2.1: Poor write performance for DomU.



On 06/09/13 23:33, Konrad Rzeszutek Wilk wrote:
> On Thu, Sep 05, 2013 at 06:28:25PM +1000, Steven Haigh wrote:
>> On 21/08/13 02:48, Konrad Rzeszutek Wilk wrote:
>>> On Mon, Mar 25, 2013 at 01:21:09PM +1100, Steven Haigh wrote:
>>>> So, based on my tests yesterday, I decided to break the RAID6 and
>>>> pull a drive out of it to test directly on the 2Tb drives in
>>>> question.
>>>>
>>>> The array in question:
>>>> # cat /proc/mdstat
>>>> Personalities : [raid1] [raid6] [raid5] [raid4]
>>>> md2 : active raid6 sdd[4] sdc[0] sde[1] sdf[5]
>>>>       3907026688 blocks super 1.2 level 6, 128k chunk, algorithm 2
>>>> [4/4] [UUUU]
>>>>
>>>> # mdadm /dev/md2 --fail /dev/sdf
>>>> mdadm: set /dev/sdf faulty in /dev/md2
>>>> # mdadm /dev/md2 --remove /dev/sdf
>>>> mdadm: hot removed /dev/sdf from /dev/md2
>>>>
>>>> So, all tests are to be done on /dev/sdf.
>>>> Model Family:     Seagate SV35
>>>> Device Model:     ST2000VX000-9YW164
>>>> Serial Number:    Z1E17C3X
>>>> LU WWN Device Id: 5 000c50 04e1bc6f0
>>>> Firmware Version: CV13
>>>> User Capacity:    2,000,398,934,016 bytes [2.00 TB]
>>>> Sector Sizes:     512 bytes logical, 4096 bytes physical
>>>>
>>>> From the Dom0:
>>>> # dd if=/dev/zero of=/dev/sdf bs=1M count=4096 oflag=direct
>>>> 4096+0 records in
>>>> 4096+0 records out
>>>> 4294967296 bytes (4.3 GB) copied, 30.7691 s, 140 MB/s
>>>>
>>>> Create a single partition on the drive, and format it with ext4:
>>>> Disk /dev/sdf: 2000.4 GB, 2000398934016 bytes
>>>> 255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors
>>>> Units = sectors of 1 * 512 = 512 bytes
>>>> Sector size (logical/physical): 512 bytes / 4096 bytes
>>>> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
>>>> Disk identifier: 0x98d8baaf
>>>>
>>>>    Device Boot      Start         End      Blocks   Id  System
>>>> /dev/sdf1            2048  3907029167  1953513560   83  Linux
>>>>
>>>> Command (m for help): w
>>>>
>>>> # mkfs.ext4 -j /dev/sdf1
>>>> ......
>>>> Writing inode tables: done
>>>> Creating journal (32768 blocks): done
>>>> Writing superblocks and filesystem accounting information: done
>>>>
>>>> Mount it on the Dom0:
>>>> # mount /dev/sdf1 /mnt/esata/
>>>> # cd /mnt/esata/
>>>> # bonnie++ -d . -u 0:0
>>>> ....
>>>> Version  1.96       ------Sequential Output------ --Sequential
>>>> Input- --Random-
>>>> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr-
>>>> --Block-- --Seeks--
>>>> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec
>>>> %CP /sec %CP
>>>> xenhost.lan.crc. 2G   425  94 133607  24 60544  12   973  95 209114
>>>> 17 296.4   6
>>>> Latency             70971us     190ms     221ms   40369us   17657us
>>>> 164ms
>>>>
>>>> So from the Dom0: 133Mb/sec write, 209Mb/sec read.
>>>>
>>>> Now, I'll attach the full disk to a DomU:
>>>> # xm block-attach zeus.vm phy:/dev/sdf xvdc w
>>>>
>>>> And we'll test from the DomU.
>>>>
>>>> # dd if=/dev/zero of=/dev/xvdc bs=1M count=4096 oflag=direct
>>>> 4096+0 records in
>>>> 4096+0 records out
>>>> 4294967296 bytes (4.3 GB) copied, 32.318 s, 133 MB/s
>>>>
>>>> Partition the same as in the Dom0 and create an ext4 filesystem on it:
>>>>
>>>> I notice something interesting here. In the Dom0, the device is seen as:
>>>> Units = sectors of 1 * 512 = 512 bytes
>>>> Sector size (logical/physical): 512 bytes / 4096 bytes
>>>> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
>>>>
>>>> In the DomU, it is seen as:
>>>> Units = sectors of 1 * 512 = 512 bytes
>>>> Sector size (logical/physical): 512 bytes / 512 bytes
>>>> I/O size (minimum/optimal): 512 bytes / 512 bytes
>>>>
>>>> Not sure if this could be related - but continuing testing:
>>>>     Device Boot      Start         End      Blocks   Id  System
>>>> /dev/xvdc1            2048  3907029167  1953513560   83  Linux
>>>>
>>>> # mkfs.ext4 -j /dev/xvdc1
>>>> ....
>>>> Allocating group tables: done
>>>> Writing inode tables: done
>>>> Creating journal (32768 blocks): done
>>>> Writing superblocks and filesystem accounting information: done
>>>>
>>>> # mount /dev/xvdc1 /mnt/esata/
>>>> # cd /mnt/esata/
>>>> # bonnie++ -d . -u 0:0
>>>> ....
>>>> Version  1.96       ------Sequential Output------ --Sequential
>>>> Input- --Random-
>>>> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr-
>>>> --Block-- --Seeks--
>>>> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec
>>>> %CP /sec %CP
>>>> zeus.crc.id.au   2G   396  99 116530  23 50451  15  1035  99 176407
>>>> 23 313.4   9
>>>> Latency             34615us     130ms     128ms   33316us   74401us
>>>> 130ms
>>>>
>>>> So still... 116Mb/sec write, 176Mb/sec read to the physical device
>>>> from the DomU. More than acceptable.
>>>>
>>>> It leaves me to wonder.... Could there be something in the Dom0
>>>> seeing the drives as 4096 byte sectors, but the DomU seeing it as
>>>> 512 byte sectors cause an issue?
>>>
>>> There is certain overhead in it. I still have this in my mailbox
>>> so I am not sure whether this issue got ever resolved? I know that the 
>>> indirect patches in Xen blkback and xen blkfront are meant to resolve
>>> some of these issues - by being able to carry a bigger payload.
>>>
>>> Did you ever try v3.11 kernel in both dom0 and domU? Thanks.
>>
>> Ok, so I finally got around to building kernel 3.11 RPMs today for
>> testing. I upgraded both the Dom0 and DomU to the same kernel:
> 
> Woohoo!
>>
>> DomU:
>> # dmesg | grep blkfront
>> blkfront: xvda: flush diskcache: enabled; persistent grants: enabled;
>> indirect descriptors: enabled;
>> blkfront: xvdb: flush diskcache: enabled; persistent grants: enabled;
>> indirect descriptors: enabled;
>>
>> Looks good.
>>
>> Transfer tests using bonnie++ as per before:
>> # bonnie -d . -u 0:0
>> Version  1.96       ------Sequential Output------ --Sequential Input-
>> --Random-
>> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
>> --Seeks--
>> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
>> /sec %CP
>> zeus.crc.id.au   2G   603  92 58250   9 62248  14   886  99 295757  30
>> 492.3  13
>> Latency             27305us     124ms     158ms   34222us   16865us
>> 374ms
>> Version  1.96       ------Sequential Create------ --------Random
>> Create--------
>> zeus.crc.id.au      -Create-- --Read--- -Delete-- -Create-- --Read---
>> -Delete--
>>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>> /sec %CP
>>                  16 10048  22 +++++ +++ 17849  29 11109  25 +++++ +++
>> 18389  31
>> Latency             17775us     154us     180us   16008us      38us
>>  58us
>>
>> Still seems to be a massive discrepancy between Dom0 and DomU write
>> speeds. Interesting is that sequential block reads are nearly 300MB/sec,
>> yet sequential writes were only ~58MB/sec.
> 
> OK, so the other thing that people were pointing out that is you
> can use xen-blkfront.max parameter. By default it is 32, but try 8.
> Or 64. Or 256.

Ahh - interesting.

I used the following:
Kernel command line: ro root=/dev/xvda rd_NO_LUKS rd_NO_DM
LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us
crashkernel=auto console=hvc0 xen-blkfront.max=X

8:
Version  1.96       ------Sequential Output------ --Sequential Input-
--Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
zeus.crc.id.au   2G   696  92 50906   7 46102  11  1013  97 256784  27
496.5  10
Latency             24374us     199ms     117ms   30855us   38008us
85175us

16:
Version  1.96       ------Sequential Output------ --Sequential Input-
--Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
zeus.crc.id.au   2G   675  92 58078   8 57585  13  1005  97 262735  25
505.6  10
Latency             24412us     187ms     183ms   23661us   53850us
232ms

32:
Version  1.96       ------Sequential Output------ --Sequential Input-
--Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
zeus.crc.id.au   2G   698  92 57416   8 63328  13  1063  97 267154  24
498.2  12
Latency             24264us     199ms   81362us   33144us   22526us
237ms

64:
Version  1.96       ------Sequential Output------ --Sequential Input-
--Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
zeus.crc.id.au   2G   574  86 88447  13 68988  17   897  97 265128  27
493.7  13

128:
Version  1.96       ------Sequential Output------ --Sequential Input-
--Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
zeus.crc.id.au   2G   702  97 107638  14 70158  15  1045  97 255596  24
491.0  12
Latency             27279us   17553us     134ms   29771us   38392us
65761us

256:
Version  1.96       ------Sequential Output------ --Sequential Input-
--Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
zeus.crc.id.au   2G   689  91 102554  14 67337  15  1012  97 262475  24
484.4  12
Latency             20642us     104ms     189ms   36624us   45286us
80023us

So, as a nice summary:
8: 50Mb/sec
16: 58Mb/sec
32: 57Mb/sec
64: 88Mb/sec
128: 107Mb/sec
256: 102Mb/sec

So, maybe it's coincidence, maybe it isn't - but the best (factoring
margin of error) seems to be 128 - which happens to be the block size of
the underlying RAID6 array on the Dom0.

# cat /proc/mdstat
md2 : active raid6 sdd[5] sdc[4] sdf[1] sde[0]
      3906766592 blocks super 1.2 level 6, 128k chunk, algorithm 2 [4/4]
[UUUU]

> The indirect descriptor allows us to put more I/Os on the ring - and
> I am hoping that will:
>  a) solve your problem

Well, it looks like this solves the issue - at least increasing the max
causes almost double the write speed - and no change to read speeds
(within margin of error).

>  b) not solve your problem, but demonstrate that the issue is not with
>     the ring, but with something else making your writes slower.
> 
> Hmm, are you by any chance using O_DIRECT when running bonnie++ in
> dom0? The xen-blkback tacks on O_DIRECT to all write requests. This is
> done to not use the dom0 page cache - otherwise you end up with
> a double buffer where the writes are insane speed - but with absolutly
> no safety.
> 
> If you want to try disabling that (so no O_DIRECT), I would do this
> little change:
> 
> diff --git a/drivers/block/xen-blkback/blkback.c 
> b/drivers/block/xen-blkback/blkback.c
> index bf4b9d2..823b629 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -1139,7 +1139,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>                 break;
>         case BLKIF_OP_WRITE:
>                 blkif->st_wr_req++;
> -               operation = WRITE_ODIRECT;
> +               operation = WRITE;
>                 break;
>         case BLKIF_OP_WRITE_BARRIER:
>                 drain = true;

With the above results, is this still useful?

-- 
Steven Haigh

Email: netwiz@xxxxxxxxx
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.