Skip to content

Commit 5442a9d

Browse files
netoptimizerkuba-moo
authored andcommitted
veth: more robust handing of race to avoid txq getting stuck
Commit dc82a33 ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops") introduced a race condition that can lead to a permanently stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra Max). The race occurs in veth_xmit(). The producer observes a full ptr_ring and stops the queue (netif_tx_stop_queue()). The subsequent conditional logic, intended to re-wake the queue if the consumer had just emptied it (if (__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a "lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and traffic halts. This failure is caused by an incorrect use of the __ptr_ring_empty() API from the producer side. As noted in kernel comments, this check is not guaranteed to be correct if a consumer is operating on another CPU. The empty test is based on ptr_ring->consumer_head, making it reliable only for the consumer. Using this check from the producer side is fundamentally racy. This patch fixes the race by adopting the more robust logic from an earlier version V4 of the patchset, which always flushed the peer: (1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier are removed. Instead, after stopping the queue, we unconditionally call __veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled, making it solely responsible for re-waking the TXQ. This handles the race where veth_poll() consumes all packets and completes NAPI *before* veth_xmit() on the producer side has called netif_tx_stop_queue. The __veth_xdp_flush(rq) will observe rx_notify_masked is false and schedule NAPI. (2) On the consumer side, the logic for waking the peer TXQ is moved out of veth_xdp_rcv() and placed at the end of the veth_poll() function. This placement is part of fixing the race, as the netif_tx_queue_stopped() check must occur after rx_notify_masked is potentially set to false during NAPI completion. This handles the race where veth_poll() consumes all packets, but haven't finished (rx_notify_masked is still true). The producer veth_xmit() stops the TXQ and __veth_xdp_flush(rq) will observe rx_notify_masked is true, meaning not starting NAPI. Then veth_poll() change rx_notify_masked to false and stops NAPI. Before exiting veth_poll() will observe TXQ is stopped and wake it up. Fixes: dc82a33 ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops") Reviewed-by: Toshiaki Makita <toshiaki.makita1@gmail.com> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://patch.msgid.link/176295323282.307447.14790015927673763094.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>
1 parent dfe28c4 commit 5442a9d

1 file changed

Lines changed: 20 additions & 18 deletions

File tree

drivers/net/veth.c

Lines changed: 20 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -392,14 +392,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
392392
}
393393
/* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
394394
__skb_push(skb, ETH_HLEN);
395-
/* Depend on prior success packets started NAPI consumer via
396-
* __veth_xdp_flush(). Cancel TXQ stop if consumer stopped,
397-
* paired with empty check in veth_poll().
398-
*/
399395
netif_tx_stop_queue(txq);
400-
smp_mb__after_atomic();
401-
if (unlikely(__ptr_ring_empty(&rq->xdp_ring)))
402-
netif_tx_wake_queue(txq);
396+
/* Makes sure NAPI peer consumer runs. Consumer is responsible
397+
* for starting txq again, until then ndo_start_xmit (this
398+
* function) will not be invoked by the netstack again.
399+
*/
400+
__veth_xdp_flush(rq);
403401
break;
404402
case NET_RX_DROP: /* same as NET_XMIT_DROP */
405403
drop:
@@ -900,17 +898,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
900898
struct veth_xdp_tx_bq *bq,
901899
struct veth_stats *stats)
902900
{
903-
struct veth_priv *priv = netdev_priv(rq->dev);
904-
int queue_idx = rq->xdp_rxq.queue_index;
905-
struct netdev_queue *peer_txq;
906-
struct net_device *peer_dev;
907901
int i, done = 0, n_xdpf = 0;
908902
void *xdpf[VETH_XDP_BATCH];
909903

910-
/* NAPI functions as RCU section */
911-
peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
912-
peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
913-
914904
for (i = 0; i < budget; i++) {
915905
void *ptr = __ptr_ring_consume(&rq->xdp_ring);
916906

@@ -959,22 +949,27 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
959949
rq->stats.vs.xdp_packets += done;
960950
u64_stats_update_end(&rq->stats.syncp);
961951

962-
if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq)))
963-
netif_tx_wake_queue(peer_txq);
964-
965952
return done;
966953
}
967954

968955
static int veth_poll(struct napi_struct *napi, int budget)
969956
{
970957
struct veth_rq *rq =
971958
container_of(napi, struct veth_rq, xdp_napi);
959+
struct veth_priv *priv = netdev_priv(rq->dev);
960+
int queue_idx = rq->xdp_rxq.queue_index;
961+
struct netdev_queue *peer_txq;
972962
struct veth_stats stats = {};
963+
struct net_device *peer_dev;
973964
struct veth_xdp_tx_bq bq;
974965
int done;
975966

976967
bq.count = 0;
977968

969+
/* NAPI functions as RCU section */
970+
peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
971+
peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
972+
978973
xdp_set_return_frame_no_direct();
979974
done = veth_xdp_rcv(rq, budget, &bq, &stats);
980975

@@ -996,6 +991,13 @@ static int veth_poll(struct napi_struct *napi, int budget)
996991
veth_xdp_flush(rq, &bq);
997992
xdp_clear_return_frame_no_direct();
998993

994+
/* Release backpressure per NAPI poll */
995+
smp_rmb(); /* Paired with netif_tx_stop_queue set_bit */
996+
if (peer_txq && netif_tx_queue_stopped(peer_txq)) {
997+
txq_trans_cond_update(peer_txq);
998+
netif_tx_wake_queue(peer_txq);
999+
}
1000+
9991001
return done;
10001002
}
10011003

0 commit comments

Comments
 (0)