TCP Nagle's Delay

Someone recently asked me to help diagnose a mysterious delay on a globally deployed Elixir service.

Symptoms

Explanation

The client configuration appeared correct e.g. TCP_NODELAY. Further investigation revealed the configuration paramaters weren’t being honored. We managed to trace it to this client defect. This resulted in the following condition:

[Nagle’s] algorithm interacts badly with TCP delayed acknowledgments (delayed ACK), a feature introduced into TCP at roughly the same time in the early 1980s, but by a different group. With both algorithms enabled, applications that do two successive writes to a TCP connection, followed by a read that will not be fulfilled until after the data from the second write has reached the destination, experience a constant delay of up to 500 milliseconds, the “ACK delay”. It is recommended to disable either, although traditionally it’s easier to disable Nagle, since such a switch already exists for real-time applications.

―Wikipedia

With delayed acknowledgments and Nagle’s algorithm:

sequenceDiagram participant Client participant Server Client->>+Server: "H" Note right of Server: ~100ms delay Server->>-Client: ACK Client->>+Server: "ELLO" Server->>-Client: ACK

With delayed acknowledgments but without Nagle’s algorithm:

sequenceDiagram participant Client participant Server Client->>+Server: "H" Client->>Server: "ELLO" Server->>-Client: ACK

N.B. I have previously helped facilitate SRE training for this scenario.