[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Poor network performance between DomU with multiqueue support



> -----Original Message-----
> From: Wei Liu [mailto:wei.liu2@xxxxxxxxxx]
> Sent: Thursday, December 04, 2014 6:50 PM
> To: Zhangleiqiang (Trump)
> Cc: Wei Liu; xen-devel@xxxxxxxxxxxxx; zhangleiqiang; Luohao (brian); Xiaoding
> (B); Yuzhou (C); Zhuangyuxin
> Subject: Re: [Xen-devel] Poor network performance between DomU with
> multiqueue support
> 
> On Wed, Dec 03, 2014 at 02:43:37PM +0000, Zhangleiqiang (Trump) wrote:
> > > -----Original Message-----
> > > From: Wei Liu [mailto:wei.liu2@xxxxxxxxxx]
> > > Sent: Tuesday, December 02, 2014 11:59 PM
> > > To: Zhangleiqiang (Trump)
> > > Cc: Wei Liu; zhangleiqiang; xen-devel@xxxxxxxxxxxxx; Luohao (brian);
> > > Xiaoding (B); Yuzhou (C); Zhuangyuxin
> > > Subject: Re: [Xen-devel] Poor network performance between DomU with
> > > multiqueue support
> > >
> > > On Tue, Dec 02, 2014 at 02:46:36PM +0000, Zhangleiqiang (Trump) wrote:
> > > > Thanks for your reply, Wei.
> > > >
> > > > I do the following testing just now and found the results as follows:
> > > >
> > > > There are three DomUs (4U4G) are running on Host A (6U6G) and one
> > > > DomU
> > > (4U4G) is running on Host B (6U6G), I send packets from three DomUs
> > > to the DomU on Host B simultaneously.
> > > >
> > > > 1. The "top" output of Host B as follows:
> > > >
> > > > top - 09:42:11 up  1:07,  2 users,  load average: 2.46, 1.90, 1.47
> > > > Tasks: 173 total,   4 running, 169 sleeping,   0 stopped,   0 zombie
> > > > %Cpu0  :  0.0 us,  0.0 sy,  0.0 ni, 97.3 id,  0.0 wa,  0.0 hi,
> > > > 0.8 si,  1.9 st
> > > > %Cpu1  :  0.0 us, 27.0 sy,  0.0 ni, 63.1 id,  0.0 wa,  0.0 hi,
> > > > 9.5 si,  0.4 st
> > > > %Cpu2  :  0.0 us, 90.0 sy,  0.0 ni,  8.3 id,  0.0 wa,  0.0 hi,
> > > > 1.7 si,  0.0 st
> > > > %Cpu3  :  0.4 us,  1.4 sy,  0.0 ni, 95.4 id,  0.0 wa,  0.0 hi,
> > > > 1.4 si,  1.4 st
> > > > %Cpu4  :  0.0 us, 60.2 sy,  0.0 ni, 39.5 id,  0.0 wa,  0.0 hi,
> > > > 0.3 si,  0.0 st
> > > > %Cpu5  :  0.0 us,  2.8 sy,  0.0 ni, 89.4 id,  0.0 wa,  0.0 hi,
> > > > 6.9 si,  0.9
> > > st
> > > > KiB Mem:   4517144 total,  3116480 used,  1400664 free,      876
> > > buffers
> > > > KiB Swap:  2103292 total,        0 used,  2103292 free.  2374656
> > > cached Mem
> > > >
> > > >   PID USER      PR  NI    VIRT    RES    SHR
> S  %CPU  %MEM
> > > TIME+ COMMAND
> > > >  7440 root      20   0       0      0      0 R 71.10 0.000
> > > 8:15.38 vif4.0-q3-guest
> > > >  7434 root      20   0       0      0      0 R 59.14 0.000
> > > 9:00.58 vif4.0-q0-guest
> > > >    18 root      20   0       0      0      0 R 33.89 0.000
> > > 2:35.06 ksoftirqd/2
> > > >    28 root      20   0       0      0      0 S 20.93 0.000
> > > 3:01.81 ksoftirqd/4
> > > >
> > > >
> > > > As shown above, only two netback related processes (vif4.0-*) are
> > > > running
> > > with high cpu usage, and the other 2 netback processes are idle. The "ps"
> > > result of vif4.0-* processes as follows:
> > > >
> > > > root      7434 50.5  0.0      0     0 ?        R    09:23
> 11:29
> > > [vif4.0-q0-guest]
> > > > root      7435  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q0-deall]
> > > > root      7436  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q1-guest]
> > > > root      7437  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q1-deall]
> > > > root      7438  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q2-guest]
> > > > root      7439  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q2-deall]
> > > > root      7440 48.1  0.0      0     0 ?        R    09:23
> 10:55
> > > [vif4.0-q3-guest]
> > > > root      7441  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q3-deall]
> > > > root      9724  0.0  0.0   9244  1520 pts/0    S+   09:46
> 0:00
> > > grep --color=auto
> > > >
> > > >
> > > > 2. The "rx" related content in /proc/interupts in receiver DomU (on Host
> B):
> > > >
> > > > 73:     2               0               2925405         0               
> > > >         xen-dyn-event
> > >   eth0-q0-rx
> > > > 75:     43              93              0                       118     
> > > >                 xen-dyn-event
> > >   eth0-q1-rx
> > > > 77:     2               3376    14                      1983            
> > > > xen-dyn-event
> > >   eth0-q2-rx
> > > > 79:     2414666 0               9                       0               
> > > >         xen-dyn-event
> > >   eth0-q3-rx
> > > >
> > > > As shown above, it seems like that only q0 and q3 handles the
> > > > interrupt
> > > triggered by packet receving.
> > > >
> > > > Any advise? Thanks.
> > >
> > > Netback selects queue based on the return value of
> skb_get_queue_mapping.
> > > The queue mapping is set by core driver or ndo_select_queue (if
> > > specified by individual driver). In this case netback doesn't have
> > > its implementation of ndo_select_queue, so it's up to core driver to
> > > decide which queue to dispatch the packet to.  I think you need to
> > > inspect why Dom0 only steers traffic to these two queues but not all of
> them.
> > >
> > > Don't know which utility is handy for this job. Probably tc(8) is useful?
> >
> > Thanks Wei.
> >
> 
> > I think the reason for the above results that only two
> > netback/netfront processes works hard is the queue select method. I
> > have tried to send packets from multiple host/vm to a vm, and all of
> > the netback/netfront processes are running with high cpu usage a few
> > times.
> >
> 
> A few times? You might want to check some patches to rework RX stall
> detection by David Vrabel that went in after 3.16.

Thanks for your suggest. I have switched to latest stable branch 3.17.4 and I 
find the patches you mentioned are not merged in this branch too, I will merge 
this patch and try again.

> > However, I find another issue. Even using 6 queues and making sure
> > that all of these 6 netback processes running with high cpu usage
> > (indeed, any of it running with 87% cpu usage), the whole VM receive
> > throughout is not very higher than results when using 4 queues. The
> > results are from 4.5Gbps to 5.04 Gbps using TCP with 512 bytes length
> > and 4.3Gbps to 5.78Gbps using TCP with 1460 bytes length.
> >
> 
> I would like to ask if you're still using 4U4G (4 CPU 4 G?) configuration? If 
> so,
> please make sure there are at least the same number of vcpus as queues.

Sorry for misleading you, 4U4G means 4 CPU and 4 G memory, :). I also found 
that the max_queue of netback is determinated by min(online_cpu, module_param) 
yesterday, so when using 6 queues in the previous testing, I used VM with 6 CPU 
and 6 G Memory.

> > According to the testing result from WIKI:
> > http://wiki.xen.org/wiki/Xen-netback_and_xen-netfront_multi-queue_perf
> > ormance_testing, The VM receive throughput is also more lower than VM
> > transmit.
> >
> 
> I think that's expected, because guest RX data path still uses grant_copy 
> while
> guest TX uses grant_map to do zero-copy transmit.

As I understand, the RX process is as follows: 
1. Phy NIC receive packet
2. XEN Hypervisor trigger interrupt to Dom0
3. Dom0' s NIC driver do the "RX" operation, and the packet is stored into SKB 
which is also owned/shared with netback
4. NetBack notify netfront through event channel that a packet is receiving
5. Netfront grant a buffer for receiving and notify netback the GR (if using 
grant-resue mechanism, netfront just notify the GR to netback) through IO Ring
6. NetBack do the grant_copy to copy packet from its SKB to the buffer 
referenced by GR, and notify netfront through event channel
7. Netfront copy the data from buffer to user-level app's SKB

Am I right? Why not using zero-copy transmit in guest RX data pash too ?


> > I am wondering why the VM receive throughout cannot be up to 8-10Gbps
> > as VM transmit under multi-queue?  I also tried to send packets
> > directly from Local Dom0 to DomU, the DomU receive throughput can
> > reach about 8-12Gbps, so I am also wondering why transmitting packets
> > from Dom0 to Remote DomU can only reach about 4-5Gbps throughout?
> 
> If data is from Dom0 to DomU then SKB is probably not fragmented by network
> stack.  You can use tcpdump to check that.

In our testing , the MTU is set to 1600. However, even testing with packets 
whose length are 1024 (small than 1600), the throughout between Dom0 to Local 
DomU is more higher than that between Dom0 to Remote DomU. So maybe the 
fragment is not the reason for it.


> Wei.
> 
> >
> > > Wei.
> > >
> > > > ----------
> > > > zhangleiqiang (Trump)
> > > >
> > > > Best Regards
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Wei Liu [mailto:wei.liu2@xxxxxxxxxx]
> > > > > Sent: Tuesday, December 02, 2014 8:12 PM
> > > > > To: Zhangleiqiang (Trump)
> > > > > Cc: Wei Liu; zhangleiqiang; xen-devel@xxxxxxxxxxxxx; Luohao
> > > > > (brian); Xiaoding (B); Yuzhou (C); Zhuangyuxin
> > > > > Subject: Re: [Xen-devel] Poor network performance between DomU
> > > > > with multiqueue support
> > > > >
> > > > > On Tue, Dec 02, 2014 at 11:50:59AM +0000, Zhangleiqiang (Trump)
> wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: xen-devel-bounces@xxxxxxxxxxxxx
> > > > > > > [mailto:xen-devel-bounces@xxxxxxxxxxxxx] On Behalf Of Wei
> > > > > > > Liu
> > > > > > > Sent: Tuesday, December 02, 2014 7:02 PM
> > > > > > > To: zhangleiqiang
> > > > > > > Cc: wei.liu2@xxxxxxxxxx; xen-devel@xxxxxxxxxxxxx
> > > > > > > Subject: Re: [Xen-devel] Poor network performance between
> > > > > > > DomU with multiqueue support
> > > > > > >
> > > > > > > On Tue, Dec 02, 2014 at 04:30:49PM +0800, zhangleiqiang wrote:
> > > > > > > > Hi, all
> > > > > > > >     I am testing the performance of xen netfront-netback
> > > > > > > > driver that with
> > > > > > > multi-queues support. The throughput from domU to remote
> > > > > > > dom0 is 9.2Gb/s, but the throughput from domU to remote domU
> > > > > > > is only 3.6Gb/s, I think the bottleneck is the throughput
> > > > > > > from dom0 to local domU. However, we have done some testing
> > > > > > > and found the throughput from dom0 to local domU is 5.8Gb/s.
> > > > > > > >     And if we send packets from one DomU to other 3 DomUs
> > > > > > > > on different
> > > > > > > host simultaneously, the sum of throughout can reach 9Gbps.
> > > > > > > It seems like the bottleneck is the receiver?
> > > > > > > >     After some analysis, I found that even the max_queue
> > > > > > > > of netfront/back
> > > > > > > is set to 4, there are some strange results as follows:
> > > > > > > >     1. In domU, only one rx queue deal with softirq
> > > > > > >
> > > > > > > Try to bind irq to different vcpus?
> > > > > >
> > > > > > Do you mean we try to bind irq to different vcpus in DomU? I
> > > > > > will try it
> > > now.
> > > > > >
> > > > >
> > > > > Yes. Given the fact that you have two backend threads running
> > > > > while only one DomU vcpu is busy, it smells like misconfiguration in
> DomU.
> > > > >
> > > > > If this phenomenon persists after correctly binding irqs, you
> > > > > might want to check traffic is steering correctly to different queues.
> > > > >
> > > > > > >
> > > > > > > >     2. In dom0, only two netback queues process are
> > > > > > > > scheduled, other two
> > > > > > > process aren't scheduled.
> > > > > > >
> > > > > > > How many Dom0 vcpu do you have? If it only has two then
> > > > > > > there will only be two processes running at a time.
> > > > > >
> > > > > > Dom0 has 6 vcpus, and 6G memory. There are only one DomU
> > > > > > running in
> > > > > Dom0 and so four netback processes are running in Dom0 (because
> > > > > the max_queue param of netback kernel module is set to 4).
> > > > > > The phenomenon is that only 2 of these four netback process
> > > > > > were running
> > > > > with about 70% cpu usage, and another two use little CPU.
> > > > > > Is there a hash algorithm to determine which netback process
> > > > > > to handle the
> > > > > input packet?
> > > > > >
> > > > >
> > > > > I think that's whatever default algorithm Linux kernel is using.
> > > > >
> > > > > We don't currently support other algorithms.
> > > > >
> > > > > Wei.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.