Xen project Mailing List

RE: [Xen-devel] [PATCH] Network Checksum Removal

To: "Andrew Theurer" <habanero@xxxxxxxxxx>, "Jon Mason" <jdmason@xxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxx>

From: "Ian Pratt" <m+Ian.Pratt@xxxxxxxxxxxx>

Date: Wed, 25 May 2005 17:48:29 +0100

Cc: bin.ren@xxxxxxxxxxxx

Delivery-date: Wed, 25 May 2005 16:47:52 +0000

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Thread-index: AcVhN5JMaceQUCdTQgaU25amryR27QACXP5Q

Thread-topic: [Xen-devel] [PATCH] Network Checksum Removal

What does the tx hw csum control actually turn on and off? I'm surprised there's much benefit to csum offload on the tx side at all as its almost always done as part of a copy. I'd have thought the main benefit of csum offload was on the rx side, so that packets received by the NIC are hardware csum'ed, passed through the bridge, and then into the domU where the csum re-calculation is avoided [it would normally need to be done before the TCP ack is sent, and can't be done as part of a copy as the data won't be moved out of the skb until the user app does a read]. The same rx csum check will be avoided and hence provide benefit to domU <-> domU transfers. In the figures below, which direction is the data stream heading? (I presume it's a one way test, like ttcp?) It's somewhat surprising that the dom0 bridge code is burning so much CPU. xenoprofile results will be quite interesting to see what functions are eating the CPU. Ultimately, the best way of doing domU <-> domU networking will be to allow point-to-point connections where netfronts are connected direct to other netfronts if the hosts are on the same machine. However, the priority for 3.0 is to optimise the normal front-back-bridge-back-front path. Thanks, Ian > -----Original Message----- > From: Andrew Theurer [mailto:habanero@xxxxxxxxxx] > Sent: 25 May 2005 15:39 > To: Jon Mason; xen-devel@xxxxxxxxxxxxxxxxxxx > Cc: Ian Pratt; bin.ren@xxxxxxxxxxxx > Subject: Re: [Xen-devel] [PATCH] Network Checksum Removal > > Tests for domU->dom0, domU->host, and domU->domU are completed: > > 3.2 GHz Xeon with Hyperhtreading, 2GB (correction) memory > > > Benchmark: netperf2 -T TCP_STREAM > > > dom0, dom1, and dom2 on cpu0 (first SMT thread on first core) > domU to host > hw tx csum > msg-size: 00064 Mbps: 0186 d0-cpu: 49.38 d1-cpu: 44.35 > msg-size: 01500 Mbps: 0917 d0-cpu: 62.13 d1-cpu: 37.87 > msg-size: 16384 Mbps: 0933 d0-cpu: 66.63 d1-cpu: 33.37 > msg-size: 32768 Mbps: 0928 d0-cpu: 66.96 d1-cpu: 32.66 > sw tx csum > msg-size: 00064 Mbps: 0187 d0-cpu: 49.50 d1-cpu: 44.52 > msg-size: 01500 Mbps: 0904 d0-cpu: 60.63 d1-cpu: 39.36 > msg-size: 16384 Mbps: 0924 d0-cpu: 63.98 d1-cpu: 35.98 > msg-size: 32768 Mbps: 0926 d0-cpu: 64.18 d1-cpu: 35.68 > ^^about 2% reduction in cpu util on dom1^^ > domU to dom0 > hw tx csum > msg-size: 00064 Mbps: 0014 d0-cpu: 64.02 d1-cpu: 31.71 > msg-size: 01500 Mbps: 1087 d0-cpu: 63.34 d1-cpu: 36.67 > msg-size: 16384 Mbps: 1204 d0-cpu: 67.30 d1-cpu: 32.71 > msg-size: 32768 Mbps: 1148 d0-cpu: 68.08 d1-cpu: 31.93 > sw tx csum > msg-size: 00064 Mbps: 0014 d0-cpu: 64.88 d1-cpu: 32.39 > msg-size: 01500 Mbps: 0948 d0-cpu: 62.20 d1-cpu: 37.80 > msg-size: 16384 Mbps: 1063 d0-cpu: 64.73 d1-cpu: 35.27 > msg-size: 32768 Mbps: 1012 d0-cpu: 65.71 d1-cpu: 34.30 > ^^upto 13% throughput increase with cpu util down ~2% on dom1^^ > Note the dismal performance for very small msg sizes > donU to domU > hw tx csum > msg-size:00064 Mbps: 0359 d0-cpu: 27.85 d1-cpu: 53.68 > d2-cpu: 18.48 > msg-size:01500 Mbps: 0594 d0-cpu: 47.42 d1-cpu: 21.77 > d2-cpu: 30.78 > msg-size:16384 Mbps: 0619 d0-cpu: 49.66 d1-cpu: 18.81 > d2-cpu: 31.53 > msg-size:32768 Mbps: 0616 d0-cpu: 49.58 d1-cpu: 18.68 > d2-cpu: 31.74 > sw tx csum > msg-size:00064 Mbps: 0361 d0-cpu: 27.81 d1-cpu: 53.58 > d2-cpu: 18.62 > msg-size:01500 Mbps: 0584 d0-cpu: 46.22 d1-cpu: 23.18 > d2-cpu: 30.60 > msg-size:16384 Mbps: 0602 d0-cpu: 47.99 d1-cpu: 20.33 > d2-cpu: 31.69 > msg-size:32768 Mbps: 0603 d0-cpu: 47.67 d1-cpu: 20.59 > d2-cpu: 31.74 > ^^About a 2% throughput increase, and cpu down on d1 > The cpu wasted on dom1 should be enough justification for > domU<->domU communication with point to point front end driver > communication. > dom0 on cpu0, dom1 on cpu2, and dom2 on cpu3 (dom1 and dom2 on same > core) > domU to host > hw tx csum > msg-size: 00064 Mbps: 0540 d0-cpu: 92.98 d1-cpu: 100.00 > msg-size: 01500 Mbps: 0941 d0-cpu: 99.74 d1-cpu: 48.62 > msg-size: 16384 Mbps: 0941 d0-cpu: 99.71 d1-cpu: 43.32 > msg-size: 32768 Mbps: 0941 d0-cpu: 99.72 d1-cpu: 43.21 > sw tx csum > msg-size: 00064 Mbps: 0545 d0-cpu: 93.47 d1-cpu: 100.00 > msg-size: 01500 Mbps: 0941 d0-cpu: 99.76 d1-cpu: 51.43 > msg-size: 16384 Mbps: 0941 d0-cpu: 99.69 d1-cpu: 46.58 > msg-size: 32768 Mbps: 0941 d0-cpu: 99.72 d1-cpu: 45.39 > ^^Finally at wire speed, but at a cost of 100% cpu on dom0 > This cpu util seems excessive, maybe oprofile will show > some problems. Notice dom1 has ~2% lower cpu. > domU to dom0 > tx csum > msg-size: 00064 Mbps: 0390 d0-cpu: 97.92 d1-cpu: 100.00 > msg-size: 01500 Mbps: 1571 d0-cpu: 97.36 d1-cpu: 54.83 > msg-size: 16384 Mbps: 1582 d0-cpu: 96.20 d1-cpu: 49.93 > msg-size: 32768 Mbps: 1596 d0-cpu: 96.32 d1-cpu: 49.63 > sw tx csum > msg-size: 00064 Mbps: 0375 d0-cpu: 97.65 d1-cpu: 100.00 > msg-size: 01500 Mbps: 1546 d0-cpu: 96.36 d1-cpu: 52.99 > msg-size: 16384 Mbps: 1598 d0-cpu: 95.88 d1-cpu: 47.48 > msg-size: 32768 Mbps: 1641 d0-cpu: 95.89 d1-cpu: 46.37 > ^^very slightly better avg throughput, and lower cpu on dom1 > donU to domU > tx csum > msg-size:00064 Mbps: 0287 d0-cpu: 84.97 d1-cpu: 100.0 > d2-cpu: 75.46 > msg-size:01500 Mbps: 1004 d0-cpu: 90.98 d1-cpu: 68.29 > d2-cpu: 76.94 > msg-size:16384 Mbps: 1018 d0-cpu: 89.78 d1-cpu: 60.82 > d2-cpu: 78.12 > msg-size:32768 Mbps: 1010 d0-cpu: 89.30 d1-cpu: 59.83 > d2-cpu: 77.99 > sw tx csum > msg-size:00064 Mbps: 0286 d0-cpu: 84.81 d1-cpu: 99.93 > d2-cpu: 76.28 > msg-size:01500 Mbps: 1018 d0-cpu: 91.30 d1-cpu: 67.27 > d2-cpu: 75.08 > msg-size:16384 Mbps: 1012 d0-cpu: 88.46 d1-cpu: 55.56 > d2-cpu: 71.37 > msg-size:32768 Mbps: 1017 d0-cpu: 88.33 d1-cpu: 54.96 > d2-cpu: 70.96 > ^^about same throughput, but ~4% lower cpu on d1 > Again, point to point front end comms woudl be great here. > > > IMO, I think the patch is a good thing. There are other very major > issues with networking, like the massive cpu overhead for dom0. I > wonder if we could have a layer 2 networking model like: > > -Xen has have front end ethernet drivers only > -dom0 has a Xen bridge front end driver, just to put eth0 (or > whatever > phys dev) on it. > -no domain hosted bridge device or backend ethernet drivers > > With this, Xen acts as a ethernet "switch", switching > ethernet traffic > in xen itself, without the help of a domain hosted bridge. > Packets are > forwarded to either a domain's front end driver, or the front end > bridge interface in dom0 (or any other driver domain). With this we > may have better control of emulating offload functions, and we should > avoid some hops (and in may cases involving dom0) for the netwrok > traffic. Comments? > > -Andrew > > > > > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.