[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Poor network performance between DomU with multiqueue support



>On Mon, Dec 08, 2014 at 01:08:18PM +0000, Zhangleiqiang (Trump) wrote:
>> > On Mon, Dec 08, 2014 at 06:44:26AM +0000, Zhangleiqiang (Trump) wrote:
>> > > > On Fri, Dec 05, 2014 at 01:17:16AM +0000, Zhangleiqiang (Trump) wrote:
>> > > > [...]
>> > > > > > I think that's expected, because guest RX data path still 
>> > > > > > uses grant_copy while guest TX uses grant_map to do zero-copy transmit.
>> > > > >
>> > > > > As far as I know, there are three main grant-related 
>> > > > > operations used in split
>> > > > device model: grant mapping, grant transfer and grant copy.
>> > > > > Grant transfer has not used now, and grant mapping and grant 
>> > > > > transfer both
>> > > > involve "TLB" refresh work for hypervisor, am I right?  Or only 
>> > > > grant transfer has this overhead?
>> > > >
>> > > > Transfer is not used so I can't tell. Grant unmap causes TLB flush.
>> > > >
>> > > > I saw in an email the other day XenServer folks has some planned 
>> > > > improvement to avoid TLB flush in Xen to upstream in 4.6 window. 
>> > > > I can't speak for sure it will get upstreamed as I don't work on that.
>> > > >
>> > > > > Does grant copy surely has more overhead than grant mapping?
>> > > > >
>> > > >
>> > > > At the very least the zero-copy TX path is faster than previous copying path.
>> > > >
>> > > > But speaking of the micro operation I'm not sure.
>> > > >
>> > > > There was once persistent map prototype netback / netfront that 
>> > > > establishes a memory pool between FE and BE then use memcpy to 
>> > > > copy data. Unfortunately that prototype was not done right so 
>> > > > the result was not
>> > good.
>> > >
>> > > The newest mail about persistent grant I can find is sent from 16 
>> > > Nov
>> > > 2012
>> > > (http://lists.xen.org/archives/html/xen-devel/2012-11/msg00832.html).
>> > > Why is it not done right and not merged into upstream?
>> > 
>> > AFAICT there's one more memcpy than necessary, i.e. frontend memcpy 
>> > data into the pool then backend memcpy data out of the pool, when 
>> > backend should be able to use the page in pool directly.
>> 
>> Memcpy should cheaper than grant_copy because the former needs not the 
>> "hypercall" which will cause "VM Exit" to "XEN Hypervisor", am I 
>> right? For RX path, using memcpy based on persistent grant table may 
>> have higher performance than using grant copy now.
>
>In theory yes. Unfortunately nobody has benchmarked that properly.
I have some testing for RX performance using persistent grant method and upstream method (3.17.4 branch), the results show that persistent grant method does have higher performance than upstream method (from 3.5Gbps to about 6Gbps).
And I find that persistent grant mechanism has already used in blkfrong/blkback, I am wondering why there are no efforts to replace the grant copy by persistent grant now, at least in RX path. Are there other disadvantages in persistent grant method which stop we use it?

PS. I used pkt-gen to send packet from dom0 to a domU running on another dom0, the CPUs of both dom0 is Intel E5640 2.4GHz, and the two dom0s is connected with a 10GE NIC.


>If you're interested in doing work on optimising RX performance, you might want to sync up with XenServer folks? > >> >> I have seen "move grant copy to guest" and "Fix grant copy alignment >> problem" as optimization methods used in "NetChannel2" >> (http://www-archive.xenproject.org/files/xensummit_fall07/16_JoseRenatoSantos.pdf). >> Unfortunately, NetChannel2 seems not be supported from 2.6.32. Do you >> know them and are them be helpful for RX path optimization under >> current upstream implementation? > >Not sure, that's long before I ever started working on Xen. > >> >> By the way, after rethinking the testing results for multi-queue pv >> (kernel 3.17.4+XEN 4.4) implementation, I find that when using four >> queues for netback/netfront, there will be about 3 netback process >> running with high CPU usage on receive Dom0 (about 85% usage per >> process running on one CPU core), and the aggregate throughout is only >> about 5Gbps. I doubt that there may be some bug or pitfall in current >> multi-queue implementation, because for 5Gbps throughout, occurring >> about all of 3 CPU core for packet receiving is somehow abnormal. >> > >3.17.4 doesn't contain David Vrabel's fixes. > >Look for > bc96f648df1bbc2729abbb84513cf4f64273a1f1 > f48da8b14d04ca87ffcffe68829afd45f926ec6a > ecf08d2dbb96d5a4b4bcc53a39e8d29cc8fef02e >in David Miller's net tree. > >BTW there are some improvement planned for 4.6: "[Xen-devel] [PATCH v3 0/2] gnttab: Improve scaleability". This is orthogonal to the problem you're trying to solve but it should help improve performance in general. > > >Wei.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.