[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes

On 26/03/15 17:14, Eric Dumazet wrote:
On Thu, 2015-03-26 at 16:46 +0000, Jonathan Davies wrote:
Network drivers with slow TX completion can experience poor network transmit
throughput, limited by hitting the sk_wmem_alloc limit check in tcp_write_xmit.
The limit is 128 KB (by default), which means we are limited to two 64 KB skbs
in-flight. This has been observed to limit transmit throughput with xen-netfront
because its TX completion can be relatively slow compared to physical NIC

There have been several modifications to the calculation of the sk_wmem_alloc
limit in the past. Here is a brief history:

  * Since TSQ was introduced, the queue size limit was

  * Commit c9eeec26 ("tcp: TSQ can use a dynamic limit") made the limit
        max(skb->truesize, sk->sk_pacing_rate >> 10).
    This allows more packets in-flight according to the estimated rate.

  * Commit 98e09386 ("tcp: tsq: restore minimal amount of queueing") made the
        max_t(unsigned int, sysctl_tcp_limit_output_bytes,
                            sk->sk_pacing_rate >> 10).
    This ensures at least sysctl_tcp_limit_output_bytes in flight but allowed
    more if rate estimation shows this to be worthwhile.

  * Commit 605ad7f1 ("tcp: refine TSO autosizing") made the limit
        min_t(u32, max(2 * skb->truesize, sk->sk_pacing_rate >> 10),
    This meant that the limit can never exceed sysctl_tcp_limit_output_bytes,
    regardless of what rate estimation suggests. It's not clear from the commit
    message why this significant change was justified, changing
    sysctl_tcp_limit_output_bytes from being a lower bound to an upper bound.

This patch restores the behaviour that allows the limit to grow above
sysctl_tcp_limit_output_bytes according to the rate estimation.

This has been measured to improve xen-netfront throughput from a domU to dom0
from 5.5 Gb/s to 8.0 Gb/s. Or, in the case of transmitting from one domU to
another (on the same host), throughput rose from 2.8 Gb/s to 8.0 Gb/s. In the
latter case, TX completion is especially slow, explaining the large improvement.
These values were measured against 4.0-rc5 using "iperf -c <ip> -i 1" using
CentOS 7.0 VM(s) on Citrix XenServer 6.5 on a Dell R730 host with a pair of Xeon
E5-2650 v3 CPUs.

Fixes: 605ad7f184b6 ("tcp: refine TSO autosizing")
Signed-off-by: Jonathan Davies <jonathan.davies@xxxxxxxxxx>
  net/ipv4/tcp_output.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1db253e..3a49af8 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2052,7 +2052,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
                 * One example is wifi aggregation (802.11 AMPDU)
                limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
-               limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
+               limit = max_t(u32, limit, sysctl_tcp_limit_output_bytes);

                if (atomic_read(&sk->sk_wmem_alloc) > limit) {
                        set_bit(TSQ_THROTTLED, &tp->tsq_flags);

That will get a NACK from me and Google TCP team of course.

Try your patch with low throughput flows, like 64kbps, GSO off...

And observe TCP timestamps and rtt estimations, critical for vegas or
any CC based on delay.

I get line rate on 40Gbps NIC using this (low) limit of 2 TSO packets.

This topic was discussed for wifi recently, and I suggested multiple
ways to solve the problem on problematic drivers.

There is a reason sysctl_tcp_limit_output_bytes exists : You can change
its value to whatever you want.

vi +652 Documentation/networking/ip-sysctl.txt

tcp_limit_output_bytes - INTEGER
         Controls TCP Small Queue limit per tcp socket.
         TCP bulk sender tends to increase packets in flight until it
         gets losses notifications. With SNDBUF autotuning, this can
         result in a large amount of packets queued in qdisc/device
         on the local machine, hurting latency of other flows, for
         typical pfifo_fast qdiscs.
         tcp_limit_output_bytes limits the number of bytes on qdisc
         or device to reduce artificial RTT/cwnd and reduce bufferbloat.
         Default: 131072

Documentation is pretty clear : This is the max value, not a min one.

Okay, thanks for your feedback. It wasn't clear to me from the commit message in 605ad7f1 ("tcp: refine TSO autosizing") why tcp_limit_output_bytes changed from being a lower bound to an upper bound in that patch. But it's now clear from your reply that that's the behaviour you intend as correct.

Thanks for drawing my attention to the wifi thread, which is helpful. As I understand it, the only available options for fixing the performance regression experienced by xen-netfront are:
  1. Reduce TX completion latency, or
2. Bypass TSQ to allow buffering of more than tcp_limit_output_bytes, finding a way to work around that limit in the driver, e.g. (a) pretending the sent packets are smaller than in reality, so sk_wmem_alloc doesn't grow as fast; (b) using skb_orphan in xennet_start_xmit to allow the skb to be freed without waiting for TX completion.

Do you agree, or have I missed something?

Option 1 seems to me like the right long-term goal, but this would require substantial changes to xen-netback and probably can't be addressed in the near-term to fix this performance regression.

Option 2 risks the problems associated with bufferbloat but may be the only way to sustain high throughput. I have briefly tried out ideas (a) and (b); both look like they can give almost the same throughput as the patch above (around 7 Gb/s domU-to-domU) without the need to modify tcp_write_xmit. (I don't know much about the implications of using skb_orphan in ndo_start_xmit, so am not sure whether that may cause other problems.)

I would like to seek the opinion of the xen-netfront maintainers (cc'd). You have a significant performance regression since commit 605ad7f1 ("tcp: refine TSO autosizing"). Are either of options (a) or (b) viable to fix it? Or is there a better solution?

For your reference, the proof-of-concept patch for idea (a) I used was:

@@ -630,6 +630,16 @@ static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev)

        spin_unlock_irqrestore(&queue->tx_lock, flags);

+       /* Pretend we sent less than we did, so we can squeeze more out
+         * of the available tcp_limit_output_bytes.
+         */
+       if (skb->sk) {
+               u32 fraction = (7 * skb->truesize) / 8;
+               skb->truesize -= fraction;
+               atomic_sub(fraction, &skb->sk->sk_wmem_alloc);
+       }
        return NETDEV_TX_OK;


And the proof-of-concept patch for idea (b) I used was:

@@ -552,6 +552,8 @@ static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev)
                        goto drop;

+       skb_orphan(skb);
        page = virt_to_page(skb->data);
        offset = offset_in_page(skb->data);
        len = skb_headlen(skb);


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.