[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Xen-users] Bad iSCSI I/O performance on Xen 4.6
 
- To: Matthieu Cerda <matthieu.cerda@xxxxxxxxxxxxxx>, xen-users@xxxxxxxxxxxxxxxxxxxx
 
- From: Jean-Louis Dupond <jean-louis@xxxxxxxxx>
 
- Date: Thu, 29 Nov 2018 12:40:51 +0100
 
- Arc-authentication-results: i=1; ORIGINATING; auth=pass smtp.auth=jean-louis@xxxxxxxxx smtp.mailfrom=jean-louis@xxxxxxxxx
 
- Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=dupond.be; s=dkim; t=1543491652; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3BZ/cKnjk9fdtRphM5mTf5HVVlFbzZQddyqSmBfhG34=; b=kGENmOMgvuIfoDPETUBIv/IWKBY/Ej5m54AwihVznY0xw0xAtbBem5irZI1sFtq6xF43cg RuVfo/ksQjxmwHK6OKqW0cr1+GWwd+qkwq/2JINzpq//vBpkeisb5NzjtGrApLBWvAzUNi dRsdk6ojkaPLFqLAu3HO3oi5pM+JlvSabrpniKeQBLRYJxRJGJI/Xz4JphRaEUtbkJm/AE KBCL9DoshRdy9nq/d7oya/EmhHntD0z0rTrDTs0nz/iryVyu1aW+a3pkNRA2oRm2oK4tAf XgZI/ubx9KpsiSph42GG8W2Xe7EZWxJn3qGJJfwyYrj7n6pP58gDKN6OFLNEhQ==
 
- Arc-seal: i=1; s=dkim; d=dupond.be; t=1543491652; a=rsa-sha256; cv=none; b=UgVHvzKflhsiUS/yd9vor4Xm25BCvuhJ7VXjSNfU9mhhlterqObg4Yon3ZwV9PWrgdYc9c yuQ59wx0eOPoBODSG/hxRlR+JeuSJdsJik/N7BrA7R71Dr7+LbBenTvjmXf77YydA/yi+G pomFKtZsApNDLK3q7r/86rv0/1Kmo6YD361upuCT8pdZQ2Z4V5YhqCuk6N/P5LE+d9oadR PTjlepyZLhb+ATFuV9b3Fsfs6P+ylTWLtG0vWJEBq4jEPeKfY6csIaKWjIZrZfZGSUQt9L 4fzbY/oKPCEuAy1PtCKVOzB5FMslpupW+LLh1vtX0eaNC+M10igoNVUUlKhubw==
 
- Delivery-date: Thu, 29 Nov 2018 11:42:33 +0000
 
- List-id: Xen user discussion <xen-users.lists.xenproject.org>
 
 
 
| 
  
  
     Finally we found the root cause(s) of this :) 
    Testing on more recent kernels (4.18+) was broken due to SACK
      Compression patch that give issues in our cases. 
      This was fixed in
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=86de5921a3d5dd246df661e09bdd0a6131b39ae3 
    Now this showed us that results were almost the same for all
      ranges of kernels. 
      Some more troubleshooting showed tcp window scaling seems to be
      incorrect. 
       
      Finally (many thanks to Eric), we found the root cause! 
       
      This is fixed with the following patch: 
      https://www.mail-archive.com/netdev@xxxxxxxxxxxxxxx/msg256497.html 
    We went from from 16.5 MB/s to 391 MB/s read speed inside a domU
      !! 
       
      Seems like case closed :) 
     
    On 17/10/18 17:51, Jean-Louis Dupond
      wrote: 
     
    
      
      Even did some more tests today :) 
      Xen 4.8 and 4.10 have the same results. No change there
        unfortunately :( 
      But when playing around with dom0 CPU's and pinning, got the
        following results: 
        dom0 -> 2cpu - no pinning => 73.5 MB/s 
        dom0 -> 2cpu + pinning => 519 MB/s 
        dom0 -> 8cpu + pinning => 124 MB/s 
      This was with 4.9.127-32.el6.x86_64 and Xen 4.10.2-1.el6. 
      Perfomance inside the VM is still really bad: 18.4 MB/s :( 
      Thanks 
        Jean-Louis 
       
      On 16/10/18 17:28, Jean-Louis Dupond
        wrote: 
       
      
        
        Hi Matthieu, 
        I did all tests on dom0, and the bug listed there seems only
          to affect domU. 
          So I doubt this will be related. 
        Thanks 
          Jean-Louis 
         
        On 16/10/18 17:11, Matthieu Cerda
          wrote: 
         
        
          
          Hello, 
           
           
          
           
           
          We solved the issue by upgrading
            to the latest backport kernel (4.14 at the time), it seemed
            due to a loop device regression in Linux. 
           
           
          Maybe you should try with a more
            recent kernel ? 
           
           
          Cheers, 
          -- 
          Matthieu CERDA 
           
           
           
          Le 16/10/2018 à 16:53, Jean-Louis
            Dupond a écrit : 
           
          
            
            Did even more testing today, and seems like we hit 2
              problems. 
            On a plain 4.9.13 kernel, without Xen, I get 686 MB/s. 
              With Xen (on dom0) => 58.8 MB/s 
            But after upgrade to 4.9.127 
              without Xen => 161 MB/s 
              with Xen => 40.3 MB/s 
            Then I tried downgrading to Xen xen-4.6.6-8.el6.x86_64 
              And that gives the following. 
            With 4.9.13 kernel: 107 MB/s 
              With 4.9.127 kernel: 61.9 MB/s 
            It might be retpoline/spectre/meltdown changes? 
              Tried adding 'spectre_v2=off nopti' to the boot line on
              4.9.127 kernel (without Xen), and then it speed is 171
              MB/s (10 MB/s faster) 
             
            Seems like this will be a hard one to debug further :( 
             
            Any other idea's are welcome :) 
            Thanks 
              Jean-Louis 
             
            On 11/10/18 17:15, Jean-Louis
              Dupond wrote: 
             
            
              FYI, 
              After some additional debugging, I found out that on
                the machine the speed is perfect when running stock
                CentOS 6 kernel (2.6.32). 
                When using a 4.9.x of 4.18.x kernel, the speed is
                degraded again. 
              Speed on 2.6.32: 320 MB/s 
                Speed on 4.9.x : 55.2 MB/s 
                 
                But when I disable gro on the storage NIC, It boosts to
                157 MB/s. 
                That is already better, but still way below what we have
                on 2.6.32 ... 
                 
                I also did tests on plain machine without Xen, and with
                the same results. 
                So it doesn't looks like its Xen related, but more
                iSCSi/Kernel. 
                 
                Thanks 
                Jean-Louis 
               
               
              On 11-10-18 11:18, Dario
                Faggioli wrote: 
               
              
                [Adding Roger]
On Mon, 2018-10-08 at 13:10 +0200, Jean-Louis Dupond wrote:
 
                
                  Hi,
We are hitting some I/O limitation on some of our Xen hypervisors.
The hypervisors are running CentOS 6 with Xen 4.6.6-12.el6 and
4.9.105+ 
kernels.
The hypervisors are attached with 10G network to the SAN network.
And 
there is no congestion at all.
Storage is exported via iSCSI and we use multipathd for failover.
Now we see a performance of +-200MB/sec write speed, but only a poor 
20-30mb/sec read speed on a LUN on the SAN.
This is while testing this on dom0. Same speeds on domU.
If I do the same test on a Xen 4.4.4-34.el6 hypervisor to the same
LUN 
(but attached with 1G), I max out the link (100MB read/write).
 
                 
                Right. But, if I've understood correctly, you're changing two things (I
mean between the two tests), i.e., the hypervisor and the NIC.
(BTW, is dom0 kernel the same, or does that also change?).
This makes it harder to narrow things down to where the problem could
be.
What would be useful to see would be the results of running:
- Xen 4.4.4-34.el6, with 4.9.105+ dom0 kernel on the 10G NIC / host,
  and compare this with Xen 4.6.6-12.el6, with the same kernel on the 
  same NIC / host;
- Xen 4.6.6-12.el6, with 4.9.105+ dom0 kernel on the 1G NIC / host,
  and compare this with Xen 4.4.4-34.el6, with the same kernel on the
  same NIC / host.
This will tell us, if there is a regression between Xen 4.4.x and Xen
4.6.x (as that is _the_only_ thing that varies).
And this is assuming the versions of the dom0 kernels, and of all the
other components involved are the same. If they're not, we need to go
checking, changing one component at a time.
 
                
                  So it really looks like the Xen 4.6 hypervisors are reaching some 
bottleneck. But we couldn't locate it yet :)
 
                 
                There seems to be issues, but from the tests you've performed so far, I
don't think we can conclude the problem is in Xen. And we need to know
at least where the problem most likely is, in order to have any chance
to find it! :-)
 
                
                  The hypervisor's dom0 has 8 vCPU and 8GB RAM, which should be plenty!
 
                 
                Probably. But, just in case, have you tried increasing, e.g., the
number of dom0's vcpus? Is things like vcpu-pinning or similar features
being used? Is the host a NUMA box? (Or, more generally, what are the
characteristics of the host[s]?)
Regards,
Dario
 
               
             
           
           
          
          _______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users 
         
       
       
      
      _______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users 
     
  
 |  
 _______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users 
 
    
     |