[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Poor network performance between DomU with multiqueue support



> -----Original Message-----
> From: Wei Liu [mailto:wei.liu2@xxxxxxxxxx]
> Sent: Thursday, December 04, 2014 6:50 PM
> To: Zhangleiqiang (Trump)
> Cc: Wei Liu; xen-devel@xxxxxxxxxxxxx; zhangleiqiang; Luohao (brian); Xiaoding
> (B); Yuzhou (C); Zhuangyuxin
> Subject: Re: [Xen-devel] Poor network performance between DomU with
> multiqueue support
> 
> On Wed, Dec 03, 2014 at 02:43:37PM +0000, Zhangleiqiang (Trump) wrote:
> > > -----Original Message-----
> > > From: Wei Liu [mailto:wei.liu2@xxxxxxxxxx]
> > > Sent: Tuesday, December 02, 2014 11:59 PM
> > > To: Zhangleiqiang (Trump)
> > > Cc: Wei Liu; zhangleiqiang; xen-devel@xxxxxxxxxxxxx; Luohao (brian);
> > > Xiaoding (B); Yuzhou (C); Zhuangyuxin
> > > Subject: Re: [Xen-devel] Poor network performance between DomU with
> > > multiqueue support
> > >
> > > On Tue, Dec 02, 2014 at 02:46:36PM +0000, Zhangleiqiang (Trump) wrote:
> > > > Thanks for your reply, Wei.
> > > >
> > > > I do the following testing just now and found the results as follows:
> > > >
> > > > There are three DomUs (4U4G) are running on Host A (6U6G) and one
> > > > DomU
> > > (4U4G) is running on Host B (6U6G), I send packets from three DomUs
> > > to the DomU on Host B simultaneously.
> > > >
> > > > 1. The "top" output of Host B as follows:
> > > >
> > > > top - 09:42:11 up  1:07,  2 users,  load average: 2.46, 1.90, 1.47
> > > > Tasks: 173 total,   4 running, 169 sleeping,   0 stopped,   0 zombie
> > > > %Cpu0  :  0.0 us,  0.0 sy,  0.0 ni, 97.3 id,  0.0 wa,  0.0 hi,
> > > > 0.8 si,  1.9 st
> > > > %Cpu1  :  0.0 us, 27.0 sy,  0.0 ni, 63.1 id,  0.0 wa,  0.0 hi,
> > > > 9.5 si,  0.4 st
> > > > %Cpu2  :  0.0 us, 90.0 sy,  0.0 ni,  8.3 id,  0.0 wa,  0.0 hi,
> > > > 1.7 si,  0.0 st
> > > > %Cpu3  :  0.4 us,  1.4 sy,  0.0 ni, 95.4 id,  0.0 wa,  0.0 hi,
> > > > 1.4 si,  1.4 st
> > > > %Cpu4  :  0.0 us, 60.2 sy,  0.0 ni, 39.5 id,  0.0 wa,  0.0 hi,
> > > > 0.3 si,  0.0 st
> > > > %Cpu5  :  0.0 us,  2.8 sy,  0.0 ni, 89.4 id,  0.0 wa,  0.0 hi,
> > > > 6.9 si,  0.9
> > > st
> > > > KiB Mem:   4517144 total,  3116480 used,  1400664 free,      876
> > > buffers
> > > > KiB Swap:  2103292 total,        0 used,  2103292 free.  2374656
> > > cached Mem
> > > >
> > > >   PID USER      PR  NI    VIRT    RES    SHR
> S  %CPU  %MEM
> > > TIME+ COMMAND
> > > >  7440 root      20   0       0      0      0 R 71.10 0.000
> > > 8:15.38 vif4.0-q3-guest
> > > >  7434 root      20   0       0      0      0 R 59.14 0.000
> > > 9:00.58 vif4.0-q0-guest
> > > >    18 root      20   0       0      0      0 R 33.89 0.000
> > > 2:35.06 ksoftirqd/2
> > > >    28 root      20   0       0      0      0 S 20.93 0.000
> > > 3:01.81 ksoftirqd/4
> > > >
> > > >
> > > > As shown above, only two netback related processes (vif4.0-*) are
> > > > running
> > > with high cpu usage, and the other 2 netback processes are idle. The "ps"
> > > result of vif4.0-* processes as follows:
> > > >
> > > > root      7434 50.5  0.0      0     0 ?        R    09:23
> 11:29
> > > [vif4.0-q0-guest]
> > > > root      7435  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q0-deall]
> > > > root      7436  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q1-guest]
> > > > root      7437  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q1-deall]
> > > > root      7438  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q2-guest]
> > > > root      7439  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q2-deall]
> > > > root      7440 48.1  0.0      0     0 ?        R    09:23
> 10:55
> > > [vif4.0-q3-guest]
> > > > root      7441  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q3-deall]
> > > > root      9724  0.0  0.0   9244  1520 pts/0    S+   09:46
> 0:00
> > > grep --color=auto
> > > >
> > > >
> > > > 2. The "rx" related content in /proc/interupts in receiver DomU (on Host
> B):
> > > >
> > > > 73:     2               0               2925405         0               
> > > >         xen-dyn-event
> > >   eth0-q0-rx
> > > > 75:     43              93              0                       118     
> > > >                 xen-dyn-event
> > >   eth0-q1-rx
> > > > 77:     2               3376    14                      1983            
> > > > xen-dyn-event
> > >   eth0-q2-rx
> > > > 79:     2414666 0               9                       0               
> > > >         xen-dyn-event
> > >   eth0-q3-rx
> > > >
> > > > As shown above, it seems like that only q0 and q3 handles the
> > > > interrupt
> > > triggered by packet receving.
> > > >
> > > > Any advise? Thanks.
> > >
> > > Netback selects queue based on the return value of
> skb_get_queue_mapping.
> > > The queue mapping is set by core driver or ndo_select_queue (if
> > > specified by individual driver). In this case netback doesn't have
> > > its implementation of ndo_select_queue, so it's up to core driver to
> > > decide which queue to dispatch the packet to.  I think you need to
> > > inspect why Dom0 only steers traffic to these two queues but not all of
> them.
> > >
> > > Don't know which utility is handy for this job. Probably tc(8) is useful?
> >
> > Thanks Wei.
> >
> 
> > I think the reason for the above results that only two
> > netback/netfront processes works hard is the queue select method. I
> > have tried to send packets from multiple host/vm to a vm, and all of
> > the netback/netfront processes are running with high cpu usage a few
> > times.
> >
> 
> A few times? You might want to check some patches to rework RX stall
> detection by David Vrabel that went in after 3.16.
> 
> > However, I find another issue. Even using 6 queues and making sure
> > that all of these 6 netback processes running with high cpu usage
> > (indeed, any of it running with 87% cpu usage), the whole VM receive
> > throughout is not very higher than results when using 4 queues. The
> > results are from 4.5Gbps to 5.04 Gbps using TCP with 512 bytes length
> > and 4.3Gbps to 5.78Gbps using TCP with 1460 bytes length.
> >
> 
> I would like to ask if you're still using 4U4G (4 CPU 4 G?) configuration? If 
> so,
> please make sure there are at least the same number of vcpus as queues.
> 
> > According to the testing result from WIKI:
> > http://wiki.xen.org/wiki/Xen-netback_and_xen-netfront_multi-queue_perf
> > ormance_testing, The VM receive throughput is also more lower than VM
> > transmit.
> >
> 
> I think that's expected, because guest RX data path still uses grant_copy 
> while
> guest TX uses grant_map to do zero-copy transmit.

As far as I know, there are three main grant-related operations used in split 
device model: grant mapping, grant transfer and grant copy. 
Grant transfer has not used now, and grant mapping and grant transfer both 
involve "TLB" refresh work for hypervisor, am I right?  Or only grant transfer 
has this overhead?
Does grant copy surely has more overhead than grant mapping? 

From the code, I see that in TX, netback will do gnttab_batch_copy as well as 
gnttab_map_refs:

<code> //netback.c:xenvif_tx_action
        xenvif_tx_build_gops(queue, budget, &nr_cops, &nr_mops);

        if (nr_cops == 0)
                return 0;

        gnttab_batch_copy(queue->tx_copy_ops, nr_cops);
        if (nr_mops != 0) {
                ret = gnttab_map_refs(queue->tx_map_ops,
                                      NULL,
                                      queue->pages_to_map,
                                      nr_mops);
                BUG_ON(ret);
        }
</code>

> > I am wondering why the VM receive throughout cannot be up to 8-10Gbps
> > as VM transmit under multi-queue?  I also tried to send packets
> > directly from Local Dom0 to DomU, the DomU receive throughput can
> > reach about 8-12Gbps, so I am also wondering why transmitting packets
> > from Dom0 to Remote DomU can only reach about 4-5Gbps throughout?
> 
> If data is from Dom0 to DomU then SKB is probably not fragmented by network
> stack.  You can use tcpdump to check that.
> 
> Wei.
> 
> >
> > > Wei.
> > >
> > > > ----------
> > > > zhangleiqiang (Trump)
> > > >
> > > > Best Regards
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Wei Liu [mailto:wei.liu2@xxxxxxxxxx]
> > > > > Sent: Tuesday, December 02, 2014 8:12 PM
> > > > > To: Zhangleiqiang (Trump)
> > > > > Cc: Wei Liu; zhangleiqiang; xen-devel@xxxxxxxxxxxxx; Luohao
> > > > > (brian); Xiaoding (B); Yuzhou (C); Zhuangyuxin
> > > > > Subject: Re: [Xen-devel] Poor network performance between DomU
> > > > > with multiqueue support
> > > > >
> > > > > On Tue, Dec 02, 2014 at 11:50:59AM +0000, Zhangleiqiang (Trump)
> wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: xen-devel-bounces@xxxxxxxxxxxxx
> > > > > > > [mailto:xen-devel-bounces@xxxxxxxxxxxxx] On Behalf Of Wei
> > > > > > > Liu
> > > > > > > Sent: Tuesday, December 02, 2014 7:02 PM
> > > > > > > To: zhangleiqiang
> > > > > > > Cc: wei.liu2@xxxxxxxxxx; xen-devel@xxxxxxxxxxxxx
> > > > > > > Subject: Re: [Xen-devel] Poor network performance between
> > > > > > > DomU with multiqueue support
> > > > > > >
> > > > > > > On Tue, Dec 02, 2014 at 04:30:49PM +0800, zhangleiqiang wrote:
> > > > > > > > Hi, all
> > > > > > > >     I am testing the performance of xen netfront-netback
> > > > > > > > driver that with
> > > > > > > multi-queues support. The throughput from domU to remote
> > > > > > > dom0 is 9.2Gb/s, but the throughput from domU to remote domU
> > > > > > > is only 3.6Gb/s, I think the bottleneck is the throughput
> > > > > > > from dom0 to local domU. However, we have done some testing
> > > > > > > and found the throughput from dom0 to local domU is 5.8Gb/s.
> > > > > > > >     And if we send packets from one DomU to other 3 DomUs
> > > > > > > > on different
> > > > > > > host simultaneously, the sum of throughout can reach 9Gbps.
> > > > > > > It seems like the bottleneck is the receiver?
> > > > > > > >     After some analysis, I found that even the max_queue
> > > > > > > > of netfront/back
> > > > > > > is set to 4, there are some strange results as follows:
> > > > > > > >     1. In domU, only one rx queue deal with softirq
> > > > > > >
> > > > > > > Try to bind irq to different vcpus?
> > > > > >
> > > > > > Do you mean we try to bind irq to different vcpus in DomU? I
> > > > > > will try it
> > > now.
> > > > > >
> > > > >
> > > > > Yes. Given the fact that you have two backend threads running
> > > > > while only one DomU vcpu is busy, it smells like misconfiguration in
> DomU.
> > > > >
> > > > > If this phenomenon persists after correctly binding irqs, you
> > > > > might want to check traffic is steering correctly to different queues.
> > > > >
> > > > > > >
> > > > > > > >     2. In dom0, only two netback queues process are
> > > > > > > > scheduled, other two
> > > > > > > process aren't scheduled.
> > > > > > >
> > > > > > > How many Dom0 vcpu do you have? If it only has two then
> > > > > > > there will only be two processes running at a time.
> > > > > >
> > > > > > Dom0 has 6 vcpus, and 6G memory. There are only one DomU
> > > > > > running in
> > > > > Dom0 and so four netback processes are running in Dom0 (because
> > > > > the max_queue param of netback kernel module is set to 4).
> > > > > > The phenomenon is that only 2 of these four netback process
> > > > > > were running
> > > > > with about 70% cpu usage, and another two use little CPU.
> > > > > > Is there a hash algorithm to determine which netback process
> > > > > > to handle the
> > > > > input packet?
> > > > > >
> > > > >
> > > > > I think that's whatever default algorithm Linux kernel is using.
> > > > >
> > > > > We don't currently support other algorithms.
> > > > >
> > > > > Wei.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.