Xen project Mailing List

Re: [Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes

On 27/03/15 13:06, Jonathan Davies wrote: > On 26/03/15 17:14, Eric Dumazet wrote: >> On Thu, 2015-03-26 at 16:46 +0000, Jonathan Davies wrote: >>> Network drivers with slow TX completion can experience poor network >>> transmit >>> throughput, limited by hitting the sk_wmem_alloc limit check in >>> tcp_write_xmit. >>> The limit is 128 KB (by default), which means we are limited to two >>> 64 KB skbs >>> in-flight. This has been observed to limit transmit throughput with >>> xen-netfront >>> because its TX completion can be relatively slow compared to physical >>> NIC >>> drivers. >>> >>> There have been several modifications to the calculation of the >>> sk_wmem_alloc >>> limit in the past. Here is a brief history: >>> >>> * Since TSQ was introduced, the queue size limit was >>> sysctl_tcp_limit_output_bytes. >>> >>> * Commit c9eeec26 ("tcp: TSQ can use a dynamic limit") made the limit >>> max(skb->truesize, sk->sk_pacing_rate >> 10). >>> This allows more packets in-flight according to the estimated rate. >>> >>> * Commit 98e09386 ("tcp: tsq: restore minimal amount of queueing") >>> made the >>> limit >>> max_t(unsigned int, sysctl_tcp_limit_output_bytes, >>> sk->sk_pacing_rate >> 10). >>> This ensures at least sysctl_tcp_limit_output_bytes in flight but >>> allowed >>> more if rate estimation shows this to be worthwhile. >>> >>> * Commit 605ad7f1 ("tcp: refine TSO autosizing") made the limit >>> min_t(u32, max(2 * skb->truesize, sk->sk_pacing_rate >> 10), >>> sysctl_tcp_limit_output_bytes). >>> This meant that the limit can never exceed >>> sysctl_tcp_limit_output_bytes, >>> regardless of what rate estimation suggests. It's not clear from >>> the commit >>> message why this significant change was justified, changing >>> sysctl_tcp_limit_output_bytes from being a lower bound to an >>> upper bound. >>> >>> This patch restores the behaviour that allows the limit to grow above >>> sysctl_tcp_limit_output_bytes according to the rate estimation. >>> >>> This has been measured to improve xen-netfront throughput from a domU >>> to dom0 >>> from 5.5 Gb/s to 8.0 Gb/s. Or, in the case of transmitting from one >>> domU to >>> another (on the same host), throughput rose from 2.8 Gb/s to 8.0 >>> Gb/s. In the >>> latter case, TX completion is especially slow, explaining the large >>> improvement. >>> These values were measured against 4.0-rc5 using "iperf -c <ip> -i 1" >>> using >>> CentOS 7.0 VM(s) on Citrix XenServer 6.5 on a Dell R730 host with a >>> pair of Xeon >>> E5-2650 v3 CPUs. >>> >>> Fixes: 605ad7f184b6 ("tcp: refine TSO autosizing") >>> Signed-off-by: Jonathan Davies <jonathan.davies@xxxxxxxxxx> >>> --- >>> net/ipv4/tcp_output.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c >>> index 1db253e..3a49af8 100644 >>> --- a/net/ipv4/tcp_output.c >>> +++ b/net/ipv4/tcp_output.c >>> @@ -2052,7 +2052,7 @@ static bool tcp_write_xmit(struct sock *sk, >>> unsigned int mss_now, int nonagle, >>> * One example is wifi aggregation (802.11 AMPDU) >>> */ >>> limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10); >>> - limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes); >>> + limit = max_t(u32, limit, sysctl_tcp_limit_output_bytes); >>> >>> if (atomic_read(&sk->sk_wmem_alloc) > limit) { >>> set_bit(TSQ_THROTTLED, &tp->tsq_flags); >> >> >> That will get a NACK from me and Google TCP team of course. >> >> Try your patch with low throughput flows, like 64kbps, GSO off... >> >> And observe TCP timestamps and rtt estimations, critical for vegas or >> any CC based on delay. >> >> I get line rate on 40Gbps NIC using this (low) limit of 2 TSO packets. >> >> This topic was discussed for wifi recently, and I suggested multiple >> ways to solve the problem on problematic drivers. >> >> There is a reason sysctl_tcp_limit_output_bytes exists : You can change >> its value to whatever you want. >> >> vi +652 Documentation/networking/ip-sysctl.txt >> >> tcp_limit_output_bytes - INTEGER >> Controls TCP Small Queue limit per tcp socket. >> TCP bulk sender tends to increase packets in flight until it >> gets losses notifications. With SNDBUF autotuning, this can >> result in a large amount of packets queued in qdisc/device >> on the local machine, hurting latency of other flows, for >> typical pfifo_fast qdiscs. >> tcp_limit_output_bytes limits the number of bytes on qdisc >> or device to reduce artificial RTT/cwnd and reduce bufferbloat. >> Default: 131072 >> >> Documentation is pretty clear : This is the max value, not a min one. > > Okay, thanks for your feedback. It wasn't clear to me from the commit > message in 605ad7f1 ("tcp: refine TSO autosizing") why > tcp_limit_output_bytes changed from being a lower bound to an upper > bound in that patch. But it's now clear from your reply that that's the > behaviour you intend as correct. > > Thanks for drawing my attention to the wifi thread, which is helpful. As > I understand it, the only available options for fixing the performance > regression experienced by xen-netfront are: > 1. Reduce TX completion latency, or > 2. Bypass TSQ to allow buffering of more than tcp_limit_output_bytes, > finding a way to work around that limit in the driver, e.g. > (a) pretending the sent packets are smaller than in reality, so > sk_wmem_alloc doesn't grow as fast; > (b) using skb_orphan in xennet_start_xmit to allow the skb to be > freed without waiting for TX completion. > > Do you agree, or have I missed something? > > Option 1 seems to me like the right long-term goal, but this would > require substantial changes to xen-netback and probably can't be > addressed in the near-term to fix this performance regression. > > Option 2 risks the problems associated with bufferbloat but may be the > only way to sustain high throughput. I have briefly tried out ideas (a) > and (b); both look like they can give almost the same throughput as the > patch above (around 7 Gb/s domU-to-domU) without the need to modify > tcp_write_xmit. (I don't know much about the implications of using > skb_orphan in ndo_start_xmit, so am not sure whether that may cause > other problems.) > > I would like to seek the opinion of the xen-netfront maintainers (cc'd). > You have a significant performance regression since commit 605ad7f1 > ("tcp: refine TSO autosizing"). Are either of options (a) or (b) viable > to fix it? Or is there a better solution? > > For your reference, the proof-of-concept patch for idea (a) I used was: > > @@ -630,6 +630,16 @@ static int xennet_start_xmit(struct sk_buff *skb, > struct net_device *dev) > > spin_unlock_irqrestore(&queue->tx_lock, flags); > > + /* Pretend we sent less than we did, so we can squeeze more out > + * of the available tcp_limit_output_bytes. > + */ > + if (skb->sk) { > + u32 fraction = (7 * skb->truesize) / 8; > + > + skb->truesize -= fraction; > + atomic_sub(fraction, &skb->sk->sk_wmem_alloc); > + } Playing games like this with skb->truesize looks dodgy to me. > And the proof-of-concept patch for idea (b) I used was: > > @@ -552,6 +552,8 @@ static int xennet_start_xmit(struct sk_buff *skb, > struct net_device *dev) > goto drop; > } > > + skb_orphan(skb); > + > page = virt_to_page(skb->data); > offset = offset_in_page(skb->data); > len = skb_headlen(skb); No. This a bunch of allocations and a full memcpy of all the frags. David _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.