[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Poor network performance between DomU with multiqueue support
> -----Original Message----- > From: Wei Liu [mailto:wei.liu2@xxxxxxxxxx] > Sent: Thursday, December 04, 2014 6:50 PM > To: Zhangleiqiang (Trump) > Cc: Wei Liu; xen-devel@xxxxxxxxxxxxx; zhangleiqiang; Luohao (brian); Xiaoding > (B); Yuzhou (C); Zhuangyuxin > Subject: Re: [Xen-devel] Poor network performance between DomU with > multiqueue support > > On Wed, Dec 03, 2014 at 02:43:37PM +0000, Zhangleiqiang (Trump) wrote: > > > -----Original Message----- > > > From: Wei Liu [mailto:wei.liu2@xxxxxxxxxx] > > > Sent: Tuesday, December 02, 2014 11:59 PM > > > To: Zhangleiqiang (Trump) > > > Cc: Wei Liu; zhangleiqiang; xen-devel@xxxxxxxxxxxxx; Luohao (brian); > > > Xiaoding (B); Yuzhou (C); Zhuangyuxin > > > Subject: Re: [Xen-devel] Poor network performance between DomU with > > > multiqueue support > > > > > > On Tue, Dec 02, 2014 at 02:46:36PM +0000, Zhangleiqiang (Trump) wrote: > > > > Thanks for your reply, Wei. > > > > > > > > I do the following testing just now and found the results as follows: > > > > > > > > There are three DomUs (4U4G) are running on Host A (6U6G) and one > > > > DomU > > > (4U4G) is running on Host B (6U6G), I send packets from three DomUs > > > to the DomU on Host B simultaneously. > > > > > > > > 1. The "top" output of Host B as follows: > > > > > > > > top - 09:42:11 up 1:07, 2 users, load average: 2.46, 1.90, 1.47 > > > > Tasks: 173 total, 4 running, 169 sleeping, 0 stopped, 0 zombie > > > > %Cpu0 : 0.0 us, 0.0 sy, 0.0 ni, 97.3 id, 0.0 wa, 0.0 hi, > > > > 0.8 si, 1.9 st > > > > %Cpu1 : 0.0 us, 27.0 sy, 0.0 ni, 63.1 id, 0.0 wa, 0.0 hi, > > > > 9.5 si, 0.4 st > > > > %Cpu2 : 0.0 us, 90.0 sy, 0.0 ni, 8.3 id, 0.0 wa, 0.0 hi, > > > > 1.7 si, 0.0 st > > > > %Cpu3 : 0.4 us, 1.4 sy, 0.0 ni, 95.4 id, 0.0 wa, 0.0 hi, > > > > 1.4 si, 1.4 st > > > > %Cpu4 : 0.0 us, 60.2 sy, 0.0 ni, 39.5 id, 0.0 wa, 0.0 hi, > > > > 0.3 si, 0.0 st > > > > %Cpu5 : 0.0 us, 2.8 sy, 0.0 ni, 89.4 id, 0.0 wa, 0.0 hi, > > > > 6.9 si, 0.9 > > > st > > > > KiB Mem: 4517144 total, 3116480 used, 1400664 free, 876 > > > buffers > > > > KiB Swap: 2103292 total, 0 used, 2103292 free. 2374656 > > > cached Mem > > > > > > > > PID USER PR NI VIRT RES SHR > S %CPU %MEM > > > TIME+ COMMAND > > > > 7440 root 20 0 0 0 0 R 71.10 0.000 > > > 8:15.38 vif4.0-q3-guest > > > > 7434 root 20 0 0 0 0 R 59.14 0.000 > > > 9:00.58 vif4.0-q0-guest > > > > 18 root 20 0 0 0 0 R 33.89 0.000 > > > 2:35.06 ksoftirqd/2 > > > > 28 root 20 0 0 0 0 S 20.93 0.000 > > > 3:01.81 ksoftirqd/4 > > > > > > > > > > > > As shown above, only two netback related processes (vif4.0-*) are > > > > running > > > with high cpu usage, and the other 2 netback processes are idle. The "ps" > > > result of vif4.0-* processes as follows: > > > > > > > > root 7434 50.5 0.0 0 0 ? R 09:23 > 11:29 > > > [vif4.0-q0-guest] > > > > root 7435 0.0 0.0 0 0 ? S 09:23 > 0:00 > > > [vif4.0-q0-deall] > > > > root 7436 0.0 0.0 0 0 ? S 09:23 > 0:00 > > > [vif4.0-q1-guest] > > > > root 7437 0.0 0.0 0 0 ? S 09:23 > 0:00 > > > [vif4.0-q1-deall] > > > > root 7438 0.0 0.0 0 0 ? S 09:23 > 0:00 > > > [vif4.0-q2-guest] > > > > root 7439 0.0 0.0 0 0 ? S 09:23 > 0:00 > > > [vif4.0-q2-deall] > > > > root 7440 48.1 0.0 0 0 ? R 09:23 > 10:55 > > > [vif4.0-q3-guest] > > > > root 7441 0.0 0.0 0 0 ? S 09:23 > 0:00 > > > [vif4.0-q3-deall] > > > > root 9724 0.0 0.0 9244 1520 pts/0 S+ 09:46 > 0:00 > > > grep --color=auto > > > > > > > > > > > > 2. The "rx" related content in /proc/interupts in receiver DomU (on Host > B): > > > > > > > > 73: 2 0 2925405 0 > > > > xen-dyn-event > > > eth0-q0-rx > > > > 75: 43 93 0 118 > > > > xen-dyn-event > > > eth0-q1-rx > > > > 77: 2 3376 14 1983 > > > > xen-dyn-event > > > eth0-q2-rx > > > > 79: 2414666 0 9 0 > > > > xen-dyn-event > > > eth0-q3-rx > > > > > > > > As shown above, it seems like that only q0 and q3 handles the > > > > interrupt > > > triggered by packet receving. > > > > > > > > Any advise? Thanks. > > > > > > Netback selects queue based on the return value of > skb_get_queue_mapping. > > > The queue mapping is set by core driver or ndo_select_queue (if > > > specified by individual driver). In this case netback doesn't have > > > its implementation of ndo_select_queue, so it's up to core driver to > > > decide which queue to dispatch the packet to. I think you need to > > > inspect why Dom0 only steers traffic to these two queues but not all of > them. > > > > > > Don't know which utility is handy for this job. Probably tc(8) is useful? > > > > Thanks Wei. > > > > > I think the reason for the above results that only two > > netback/netfront processes works hard is the queue select method. I > > have tried to send packets from multiple host/vm to a vm, and all of > > the netback/netfront processes are running with high cpu usage a few > > times. > > > > A few times? You might want to check some patches to rework RX stall > detection by David Vrabel that went in after 3.16. > > > However, I find another issue. Even using 6 queues and making sure > > that all of these 6 netback processes running with high cpu usage > > (indeed, any of it running with 87% cpu usage), the whole VM receive > > throughout is not very higher than results when using 4 queues. The > > results are from 4.5Gbps to 5.04 Gbps using TCP with 512 bytes length > > and 4.3Gbps to 5.78Gbps using TCP with 1460 bytes length. > > > > I would like to ask if you're still using 4U4G (4 CPU 4 G?) configuration? If > so, > please make sure there are at least the same number of vcpus as queues. > > > According to the testing result from WIKI: > > http://wiki.xen.org/wiki/Xen-netback_and_xen-netfront_multi-queue_perf > > ormance_testing, The VM receive throughput is also more lower than VM > > transmit. > > > > I think that's expected, because guest RX data path still uses grant_copy > while > guest TX uses grant_map to do zero-copy transmit. As far as I know, there are three main grant-related operations used in split device model: grant mapping, grant transfer and grant copy. Grant transfer has not used now, and grant mapping and grant transfer both involve "TLB" refresh work for hypervisor, am I right? Or only grant transfer has this overhead? Does grant copy surely has more overhead than grant mapping? From the code, I see that in TX, netback will do gnttab_batch_copy as well as gnttab_map_refs: <code> //netback.c:xenvif_tx_action xenvif_tx_build_gops(queue, budget, &nr_cops, &nr_mops); if (nr_cops == 0) return 0; gnttab_batch_copy(queue->tx_copy_ops, nr_cops); if (nr_mops != 0) { ret = gnttab_map_refs(queue->tx_map_ops, NULL, queue->pages_to_map, nr_mops); BUG_ON(ret); } </code> > > I am wondering why the VM receive throughout cannot be up to 8-10Gbps > > as VM transmit under multi-queue? I also tried to send packets > > directly from Local Dom0 to DomU, the DomU receive throughput can > > reach about 8-12Gbps, so I am also wondering why transmitting packets > > from Dom0 to Remote DomU can only reach about 4-5Gbps throughout? > > If data is from Dom0 to DomU then SKB is probably not fragmented by network > stack. You can use tcpdump to check that. > > Wei. > > > > > > Wei. > > > > > > > ---------- > > > > zhangleiqiang (Trump) > > > > > > > > Best Regards > > > > > > > > > > > > > -----Original Message----- > > > > > From: Wei Liu [mailto:wei.liu2@xxxxxxxxxx] > > > > > Sent: Tuesday, December 02, 2014 8:12 PM > > > > > To: Zhangleiqiang (Trump) > > > > > Cc: Wei Liu; zhangleiqiang; xen-devel@xxxxxxxxxxxxx; Luohao > > > > > (brian); Xiaoding (B); Yuzhou (C); Zhuangyuxin > > > > > Subject: Re: [Xen-devel] Poor network performance between DomU > > > > > with multiqueue support > > > > > > > > > > On Tue, Dec 02, 2014 at 11:50:59AM +0000, Zhangleiqiang (Trump) > wrote: > > > > > > > -----Original Message----- > > > > > > > From: xen-devel-bounces@xxxxxxxxxxxxx > > > > > > > [mailto:xen-devel-bounces@xxxxxxxxxxxxx] On Behalf Of Wei > > > > > > > Liu > > > > > > > Sent: Tuesday, December 02, 2014 7:02 PM > > > > > > > To: zhangleiqiang > > > > > > > Cc: wei.liu2@xxxxxxxxxx; xen-devel@xxxxxxxxxxxxx > > > > > > > Subject: Re: [Xen-devel] Poor network performance between > > > > > > > DomU with multiqueue support > > > > > > > > > > > > > > On Tue, Dec 02, 2014 at 04:30:49PM +0800, zhangleiqiang wrote: > > > > > > > > Hi, all > > > > > > > > I am testing the performance of xen netfront-netback > > > > > > > > driver that with > > > > > > > multi-queues support. The throughput from domU to remote > > > > > > > dom0 is 9.2Gb/s, but the throughput from domU to remote domU > > > > > > > is only 3.6Gb/s, I think the bottleneck is the throughput > > > > > > > from dom0 to local domU. However, we have done some testing > > > > > > > and found the throughput from dom0 to local domU is 5.8Gb/s. > > > > > > > > And if we send packets from one DomU to other 3 DomUs > > > > > > > > on different > > > > > > > host simultaneously, the sum of throughout can reach 9Gbps. > > > > > > > It seems like the bottleneck is the receiver? > > > > > > > > After some analysis, I found that even the max_queue > > > > > > > > of netfront/back > > > > > > > is set to 4, there are some strange results as follows: > > > > > > > > 1. In domU, only one rx queue deal with softirq > > > > > > > > > > > > > > Try to bind irq to different vcpus? > > > > > > > > > > > > Do you mean we try to bind irq to different vcpus in DomU? I > > > > > > will try it > > > now. > > > > > > > > > > > > > > > > Yes. Given the fact that you have two backend threads running > > > > > while only one DomU vcpu is busy, it smells like misconfiguration in > DomU. > > > > > > > > > > If this phenomenon persists after correctly binding irqs, you > > > > > might want to check traffic is steering correctly to different queues. > > > > > > > > > > > > > > > > > > > > 2. In dom0, only two netback queues process are > > > > > > > > scheduled, other two > > > > > > > process aren't scheduled. > > > > > > > > > > > > > > How many Dom0 vcpu do you have? If it only has two then > > > > > > > there will only be two processes running at a time. > > > > > > > > > > > > Dom0 has 6 vcpus, and 6G memory. There are only one DomU > > > > > > running in > > > > > Dom0 and so four netback processes are running in Dom0 (because > > > > > the max_queue param of netback kernel module is set to 4). > > > > > > The phenomenon is that only 2 of these four netback process > > > > > > were running > > > > > with about 70% cpu usage, and another two use little CPU. > > > > > > Is there a hash algorithm to determine which netback process > > > > > > to handle the > > > > > input packet? > > > > > > > > > > > > > > > > I think that's whatever default algorithm Linux kernel is using. > > > > > > > > > > We don't currently support other algorithms. > > > > > > > > > > Wei. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |