Xen project Mailing List

Re: [Xen-users] Bad iSCSI I/O performance on Xen 4.6

To: Matthieu Cerda <matthieu.cerda@xxxxxxxxxxxxxx>, xen-users@xxxxxxxxxxxxxxxxxxxx

From: Jean-Louis Dupond <jean-louis@xxxxxxxxx>

Date: Thu, 29 Nov 2018 12:40:51 +0100

Arc-authentication-results: i=1; ORIGINATING; auth=pass smtp.auth=jean-louis@xxxxxxxxx smtp.mailfrom=jean-louis@xxxxxxxxx

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=dupond.be; s=dkim; t=1543491652; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3BZ/cKnjk9fdtRphM5mTf5HVVlFbzZQddyqSmBfhG34=; b=kGENmOMgvuIfoDPETUBIv/IWKBY/Ej5m54AwihVznY0xw0xAtbBem5irZI1sFtq6xF43cg RuVfo/ksQjxmwHK6OKqW0cr1+GWwd+qkwq/2JINzpq//vBpkeisb5NzjtGrApLBWvAzUNi dRsdk6ojkaPLFqLAu3HO3oi5pM+JlvSabrpniKeQBLRYJxRJGJI/Xz4JphRaEUtbkJm/AE KBCL9DoshRdy9nq/d7oya/EmhHntD0z0rTrDTs0nz/iryVyu1aW+a3pkNRA2oRm2oK4tAf XgZI/ubx9KpsiSph42GG8W2Xe7EZWxJn3qGJJfwyYrj7n6pP58gDKN6OFLNEhQ==

Arc-seal: i=1; s=dkim; d=dupond.be; t=1543491652; a=rsa-sha256; cv=none; b=UgVHvzKflhsiUS/yd9vor4Xm25BCvuhJ7VXjSNfU9mhhlterqObg4Yon3ZwV9PWrgdYc9c yuQ59wx0eOPoBODSG/hxRlR+JeuSJdsJik/N7BrA7R71Dr7+LbBenTvjmXf77YydA/yi+G pomFKtZsApNDLK3q7r/86rv0/1Kmo6YD361upuCT8pdZQ2Z4V5YhqCuk6N/P5LE+d9oadR PTjlepyZLhb+ATFuV9b3Fsfs6P+ylTWLtG0vWJEBq4jEPeKfY6csIaKWjIZrZfZGSUQt9L 4fzbY/oKPCEuAy1PtCKVOzB5FMslpupW+LLh1vtX0eaNC+M10igoNVUUlKhubw==

Delivery-date: Thu, 29 Nov 2018 11:42:33 +0000

List-id: Xen user discussion <xen-users.lists.xenproject.org>

Finally we found the root cause(s) of this :)

Testing on more recent kernels (4.18+) was broken due to SACK Compression patch that give issues in our cases.
This was fixed in https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=86de5921a3d5dd246df661e09bdd0a6131b39ae3

Now this showed us that results were almost the same for all ranges of kernels.
Some more troubleshooting showed tcp window scaling seems to be incorrect.

Finally (many thanks to Eric), we found the root cause!

This is fixed with the following patch:
https://www.mail-archive.com/netdev@xxxxxxxxxxxxxxx/msg256497.html

We went from from 16.5 MB/s to 391 MB/s read speed inside a domU !!

Seems like case closed :)

On 17/10/18 17:51, Jean-Louis Dupond wrote:

Even did some more tests today :)

Xen 4.8 and 4.10 have the same results. No change there unfortunately :(

But when playing around with dom0 CPU's and pinning, got the following results:
dom0 -> 2cpu - no pinning => 73.5 MB/s
dom0 -> 2cpu + pinning => 519 MB/s
dom0 -> 8cpu + pinning => 124 MB/s

This was with 4.9.127-32.el6.x86_64 and Xen 4.10.2-1.el6.

Perfomance inside the VM is still really bad: 18.4 MB/s :(

Thanks
Jean-Louis

On 16/10/18 17:28, Jean-Louis Dupond wrote:
Hi Matthieu,

I did all tests on dom0, and the bug listed there seems only to affect domU.
So I doubt this will be related.

Thanks
Jean-Louis

On 16/10/18 17:11, Matthieu Cerda wrote:
Hello,

You might be hitting the issue that has been discussed here: https://lists.xenproject.org/archives/html/xen-users/2017-07/msg00023.html

We solved the issue by upgrading to the latest backport kernel (4.14 at the time), it seemed due to a loop device regression in Linux.

Maybe you should try with a more recent kernel ?

Cheers,

--

Matthieu CERDA

Le 16/10/2018 à 16:53, Jean-Louis Dupond a écrit :
Did even more testing today, and seems like we hit 2 problems.

On a plain 4.9.13 kernel, without Xen, I get 686 MB/s.
With Xen (on dom0) => 58.8 MB/s

But after upgrade to 4.9.127
without Xen => 161 MB/s
with Xen => 40.3 MB/s

Then I tried downgrading to Xen xen-4.6.6-8.el6.x86_64
And that gives the following.

With 4.9.13 kernel: 107 MB/s
With 4.9.127 kernel: 61.9 MB/s

It might be retpoline/spectre/meltdown changes?
Tried adding 'spectre_v2=off nopti' to the boot line on 4.9.127 kernel (without Xen), and then it speed is 171 MB/s (10 MB/s faster)

Seems like this will be a hard one to debug further :(

Any other idea's are welcome :)

Thanks
Jean-Louis

On 11/10/18 17:15, Jean-Louis Dupond wrote:
FYI,

After some additional debugging, I found out that on the machine the speed is perfect when running stock CentOS 6 kernel (2.6.32).
When using a 4.9.x of 4.18.x kernel, the speed is degraded again.

Speed on 2.6.32: 320 MB/s
Speed on 4.9.x : 55.2 MB/s

But when I disable gro on the storage NIC, It boosts to 157 MB/s.
That is already better, but still way below what we have on 2.6.32 ...

I also did tests on plain machine without Xen, and with the same results.
So it doesn't looks like its Xen related, but more iSCSi/Kernel.

Thanks
Jean-Louis

On 11-10-18 11:18, Dario Faggioli wrote:
[Adding Roger]

On Mon, 2018-10-08 at 13:10 +0200, Jean-Louis Dupond wrote:
Hi,

We are hitting some I/O limitation on some of our Xen hypervisors.
The hypervisors are running CentOS 6 with Xen 4.6.6-12.el6 and
4.9.105+ 
kernels.

The hypervisors are attached with 10G network to the SAN network.
And 
there is no congestion at all.
Storage is exported via iSCSI and we use multipathd for failover.
Now we see a performance of +-200MB/sec write speed, but only a poor 
20-30mb/sec read speed on a LUN on the SAN.
This is while testing this on dom0. Same speeds on domU.

If I do the same test on a Xen 4.4.4-34.el6 hypervisor to the same
LUN 
(but attached with 1G), I max out the link (100MB read/write).
Right. But, if I've understood correctly, you're changing two things (I
mean between the two tests), i.e., the hypervisor and the NIC.
(BTW, is dom0 kernel the same, or does that also change?).

This makes it harder to narrow things down to where the problem could
be.

What would be useful to see would be the results of running:
- Xen 4.4.4-34.el6, with 4.9.105+ dom0 kernel on the 10G NIC / host,
  and compare this with Xen 4.6.6-12.el6, with the same kernel on the 
  same NIC / host;
- Xen 4.6.6-12.el6, with 4.9.105+ dom0 kernel on the 1G NIC / host,
  and compare this with Xen 4.4.4-34.el6, with the same kernel on the
  same NIC / host.

This will tell us, if there is a regression between Xen 4.4.x and Xen
4.6.x (as that is _the_only_ thing that varies).

And this is assuming the versions of the dom0 kernels, and of all the
other components involved are the same. If they're not, we need to go
checking, changing one component at a time.
So it really looks like the Xen 4.6 hypervisors are reaching some 
bottleneck. But we couldn't locate it yet :)
There seems to be issues, but from the tests you've performed so far, I
don't think we can conclude the problem is in Xen. And we need to know
at least where the problem most likely is, in order to have any chance
to find it! :-)
The hypervisor's dom0 has 8 vCPU and 8GB RAM, which should be plenty!
Probably. But, just in case, have you tried increasing, e.g., the
number of dom0's vcpus? Is things like vcpu-pinning or similar features
being used? Is the host a NUMA box? (Or, more generally, what are the
characteristics of the host[s]?)

Regards,
Dario
_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users
_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users

_______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-users

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.