Skip to content

Commit 4bfe744

Browse files
edumazetdavem330
authored andcommitted
tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT
I had this bug sitting for too long in my pile, it is time to fix it. Thanks to Doug Porter for reminding me of it! We had various attempts in the past, including commit 0cbe6a8 ("tcp: remove SOCK_QUEUE_SHRUNK"), but the issue is that TCP stack currently only generates EPOLLOUT from input path, when tp->snd_una has advanced and skb(s) cleaned from rtx queue. If a flow has a big RTT, and/or receives SACKs, it is possible that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0 and no more data can be sent until tp->snd_una finally advances. What is needed is to also check if POLLOUT needs to be generated whenever tp->snd_nxt is advanced, from output path. This bug triggers more often after an idle period, as we do not receive ACK for at least one RTT. tcp_notsent_lowat could be a fraction of what CWND and pacing rate would allow to send during this RTT. In a followup patch, I will remove the bogus call to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED) from tcp_check_space(). Fact that we have decided to generate an EPOLLOUT does not mean the application has immediately refilled the transmit queue. This optimistic call might have been the reason the bug seemed not too serious. Tested: 200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2] $ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat $ cat bench_rr.sh SUM=0 for i in {1..10} do V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"` echo $V SUM=$(($SUM + $V)) done echo SUM=$SUM Before patch: $ bench_rr.sh 130000000 80000000 140000000 140000000 140000000 140000000 130000000 40000000 90000000 110000000 SUM=1140000000 After patch: $ bench_rr.sh 430000000 590000000 530000000 450000000 450000000 350000000 450000000 490000000 480000000 460000000 SUM=4680000000 # This is 410 % of the value before patch. Fixes: c9bee3b ("tcp: TCP_NOTSENT_LOWAT socket option") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Doug Porter <dsp@fb.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
1 parent 1fcb8fb commit 4bfe744

3 files changed

Lines changed: 13 additions & 1 deletion

File tree

include/net/tcp.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -620,6 +620,7 @@ void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req);
620620
void tcp_reset(struct sock *sk, struct sk_buff *skb);
621621
void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb);
622622
void tcp_fin(struct sock *sk);
623+
void tcp_check_space(struct sock *sk);
623624

624625
/* tcp_timer.c */
625626
void tcp_init_xmit_timers(struct sock *);

net/ipv4/tcp_input.c

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5454,7 +5454,17 @@ static void tcp_new_space(struct sock *sk)
54545454
INDIRECT_CALL_1(sk->sk_write_space, sk_stream_write_space, sk);
54555455
}
54565456

5457-
static void tcp_check_space(struct sock *sk)
5457+
/* Caller made space either from:
5458+
* 1) Freeing skbs in rtx queues (after tp->snd_una has advanced)
5459+
* 2) Sent skbs from output queue (and thus advancing tp->snd_nxt)
5460+
*
5461+
* We might be able to generate EPOLLOUT to the application if:
5462+
* 1) Space consumed in output/rtx queues is below sk->sk_sndbuf/2
5463+
* 2) notsent amount (tp->write_seq - tp->snd_nxt) became
5464+
* small enough that tcp_stream_memory_free() decides it
5465+
* is time to generate EPOLLOUT.
5466+
*/
5467+
void tcp_check_space(struct sock *sk)
54585468
{
54595469
/* pairs with tcp_poll() */
54605470
smp_mb();

net/ipv4/tcp_output.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,7 @@ static void tcp_event_new_data_sent(struct sock *sk, struct sk_buff *skb)
8282

8383
NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPORIGDATASENT,
8484
tcp_skb_pcount(skb));
85+
tcp_check_space(sk);
8586
}
8687

8788
/* SND.NXT, if window was not shrunk or the amount of shrunk was less than one

0 commit comments

Comments
 (0)