[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Poor network performance between DomU with multiqueue support



Thanks for your detailed explanation, Wei. 

I am wondering if netfront/netback can be optimized to reach the 10Gbps 
throughout between DomUs running on different hosts connected with 10GE 
network. Currently, it seems like the RX is the bottleneck, which also consist 
with the testing result in xenwiki: 
http://wiki.xen.org/wiki/Xen-netback_and_xen-netfront_multi-queue_performance_testing

I am wondering what factors prevent RX to reach the higher throughout? You have 
mentioned that one reason is that guest RX data path still uses grant_copy 
while guest TX uses grant_map to do zero-copy transmit. Do you know any other 
factors or ongoing work to optimize the RX data path?

----------
zhangleiqiang (Trump)

Best Regards


> -----Original Message-----
> From: Wei Liu [mailto:wei.liu2@xxxxxxxxxx]
> Sent: Thursday, December 04, 2014 9:06 PM
> To: Zhangleiqiang (Trump)
> Cc: Wei Liu; xen-devel@xxxxxxxxxxxxx; zhangleiqiang; Luohao (brian); Xiaoding
> (B); Yuzhou (C); Zhuangyuxin
> Subject: Re: [Xen-devel] Poor network performance between DomU with
> multiqueue support
> 
> On Thu, Dec 04, 2014 at 12:09:33PM +0000, Zhangleiqiang (Trump) wrote:
> [...]
> > > > However, I find another issue. Even using 6 queues and making sure
> > > > that all of these 6 netback processes running with high cpu usage
> > > > (indeed, any of it running with 87% cpu usage), the whole VM
> > > > receive throughout is not very higher than results when using 4
> > > > queues. The results are from 4.5Gbps to 5.04 Gbps using TCP with
> > > > 512 bytes length and 4.3Gbps to 5.78Gbps using TCP with 1460 bytes
> length.
> > > >
> > >
> > > I would like to ask if you're still using 4U4G (4 CPU 4 G?)
> > > configuration? If so, please make sure there are at least the same number
> of vcpus as queues.
> >
> 
> > Sorry for misleading you, 4U4G means 4 CPU and 4 G memory, :). I also
> > found that the max_queue of netback is determinated by min(online_cpu,
> > module_param) yesterday, so when using 6 queues in the previous
> > testing, I used VM with 6 CPU and 6 G Memory.
> 
> >
> > > > According to the testing result from WIKI:
> > > > http://wiki.xen.org/wiki/Xen-netback_and_xen-netfront_multi-queue_
> > > > perf ormance_testing, The VM receive throughput is also more lower
> > > > than VM transmit.
> > > >
> > >
> > > I think that's expected, because guest RX data path still uses
> > > grant_copy while guest TX uses grant_map to do zero-copy transmit.
> >
> > As I understand, the RX process is as follows:
> > 1. Phy NIC receive packet
> > 2. XEN Hypervisor trigger interrupt to Dom0 3. Dom0' s NIC driver do
> > the "RX" operation, and the packet is stored into SKB which is also
> > owned/shared with netback 4. NetBack notify netfront through event
> > channel that a packet is receiving 5. Netfront grant a buffer for
> > receiving and notify netback the GR (if using grant-resue mechanism,
> > netfront just notify the GR to netback) through IO Ring 6. NetBack do
> > the grant_copy to copy packet from its SKB to the buffer referenced by
> > GR, and notify netfront through event channel 7. Netfront copy the
> > data from buffer to user-level app's SKB
> >
> > Am I right?
> 
> Step 4 is not correct, netback won't notify netfront at that point.
> 
> Step 5 is not correct, all grant refs are pre-allocated and granted before 
> that.
> 
> Other steps look correct.
> 
> > Why not using zero-copy transmit in guest RX data pash too ?
> >
> 
> A rogue / buggy guest might hold the mapping for arbitrary long period of 
> time.
> 
> >
> > > > I am wondering why the VM receive throughout cannot be up to
> > > > 8-10Gbps as VM transmit under multi-queue?  I also tried to send
> > > > packets directly from Local Dom0 to DomU, the DomU receive
> > > > throughput can reach about 8-12Gbps, so I am also wondering why
> > > > transmitting packets from Dom0 to Remote DomU can only reach about
> 4-5Gbps throughout?
> > >
> > > If data is from Dom0 to DomU then SKB is probably not fragmented by
> > > network stack.  You can use tcpdump to check that.
> >
> > In our testing , the MTU is set to 1600. However, even testing with
> > packets whose length are 1024 (small than 1600), the throughout
> > between Dom0 to Local DomU is more higher than that between Dom0 to
> > Remote DomU. So maybe the fragment is not the reason for it.
> >
> 
> Don't have much idea about this, sorry.
> 
> Wei.
> 
> >
> > > Wei.
> > >
> > > >
> > > > > Wei.
> > > > >
> > > > > > ----------
> > > > > > zhangleiqiang (Trump)
> > > > > >
> > > > > > Best Regards
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Wei Liu [mailto:wei.liu2@xxxxxxxxxx]
> > > > > > > Sent: Tuesday, December 02, 2014 8:12 PM
> > > > > > > To: Zhangleiqiang (Trump)
> > > > > > > Cc: Wei Liu; zhangleiqiang; xen-devel@xxxxxxxxxxxxx; Luohao
> > > > > > > (brian); Xiaoding (B); Yuzhou (C); Zhuangyuxin
> > > > > > > Subject: Re: [Xen-devel] Poor network performance between
> > > > > > > DomU with multiqueue support
> > > > > > >
> > > > > > > On Tue, Dec 02, 2014 at 11:50:59AM +0000, Zhangleiqiang
> > > > > > > (Trump)
> > > wrote:
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: xen-devel-bounces@xxxxxxxxxxxxx
> > > > > > > > > [mailto:xen-devel-bounces@xxxxxxxxxxxxx] On Behalf Of
> > > > > > > > > Wei Liu
> > > > > > > > > Sent: Tuesday, December 02, 2014 7:02 PM
> > > > > > > > > To: zhangleiqiang
> > > > > > > > > Cc: wei.liu2@xxxxxxxxxx; xen-devel@xxxxxxxxxxxxx
> > > > > > > > > Subject: Re: [Xen-devel] Poor network performance
> > > > > > > > > between DomU with multiqueue support
> > > > > > > > >
> > > > > > > > > On Tue, Dec 02, 2014 at 04:30:49PM +0800, zhangleiqiang
> wrote:
> > > > > > > > > > Hi, all
> > > > > > > > > >     I am testing the performance of xen
> > > > > > > > > > netfront-netback driver that with
> > > > > > > > > multi-queues support. The throughput from domU to remote
> > > > > > > > > dom0 is 9.2Gb/s, but the throughput from domU to remote
> > > > > > > > > domU is only 3.6Gb/s, I think the bottleneck is the
> > > > > > > > > throughput from dom0 to local domU. However, we have
> > > > > > > > > done some testing and found the throughput from dom0 to local
> domU is 5.8Gb/s.
> > > > > > > > > >     And if we send packets from one DomU to other 3
> > > > > > > > > > DomUs on different
> > > > > > > > > host simultaneously, the sum of throughout can reach 9Gbps.
> > > > > > > > > It seems like the bottleneck is the receiver?
> > > > > > > > > >     After some analysis, I found that even the
> > > > > > > > > > max_queue of netfront/back
> > > > > > > > > is set to 4, there are some strange results as follows:
> > > > > > > > > >     1. In domU, only one rx queue deal with softirq
> > > > > > > > >
> > > > > > > > > Try to bind irq to different vcpus?
> > > > > > > >
> > > > > > > > Do you mean we try to bind irq to different vcpus in DomU?
> > > > > > > > I will try it
> > > > > now.
> > > > > > > >
> > > > > > >
> > > > > > > Yes. Given the fact that you have two backend threads
> > > > > > > running while only one DomU vcpu is busy, it smells like
> > > > > > > misconfiguration in
> > > DomU.
> > > > > > >
> > > > > > > If this phenomenon persists after correctly binding irqs,
> > > > > > > you might want to check traffic is steering correctly to different
> queues.
> > > > > > >
> > > > > > > > >
> > > > > > > > > >     2. In dom0, only two netback queues process are
> > > > > > > > > > scheduled, other two
> > > > > > > > > process aren't scheduled.
> > > > > > > > >
> > > > > > > > > How many Dom0 vcpu do you have? If it only has two then
> > > > > > > > > there will only be two processes running at a time.
> > > > > > > >
> > > > > > > > Dom0 has 6 vcpus, and 6G memory. There are only one DomU
> > > > > > > > running in
> > > > > > > Dom0 and so four netback processes are running in Dom0
> > > > > > > (because the max_queue param of netback kernel module is set to
> 4).
> > > > > > > > The phenomenon is that only 2 of these four netback
> > > > > > > > process were running
> > > > > > > with about 70% cpu usage, and another two use little CPU.
> > > > > > > > Is there a hash algorithm to determine which netback
> > > > > > > > process to handle the
> > > > > > > input packet?
> > > > > > > >
> > > > > > >
> > > > > > > I think that's whatever default algorithm Linux kernel is using.
> > > > > > >
> > > > > > > We don't currently support other algorithms.
> > > > > > >
> > > > > > > Wei.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.