DEV Community

Vincent Tommi
Vincent Tommi

Posted on

TCP and the Two Generals Problem: A Funny, Technical Guide to System Design day 19 of learning system design

Picture two generals, Kasongo and Riggy, plotting a battle with carrier pigeons that might get lostβ€”sounds like a wild skit, right? This is the Two Generals Problem, and it’s the key to understanding why TCP (Transmission Control Protocol) is so tricky in distributed systems. TCP powers reliable internet communication, but its asynchronous nature makes it a battlefield of uncertainty. In this article, we’ll explore TCP’s challenges using a hilarious Kasongo-and-Riggy analogy, dive into safety, liveness, timeouts, and DoS risks, and show how these shape system design for reliable, scalable software.

The Two Generals Problem: Pigeons and Pandemonium
Imagine generals Kasongo and Riggy planning to attack a city from opposite hills. They must attack at the same time, but their only communication is via carrier pigeons, which might get lost, eaten by hawks, or just chill in a tree. Here’s the chaos:

  • Kasongo sends a pigeon: β€œAttack at dawn!”

  • Riggy gets it and sends back: β€œDawn, I’m in!”

  • But Kasongo needs to know Riggy got his message, so he sends: β€œConfirm you got my plan!”

  • Riggy replies: β€œI got your confirmation, confirm mine!”—and it’s a pigeon frenzy.

This is the Two Generals Problem, a classic in distributed systems. It proves that perfect agreement over an unreliable channel (like the internet) is impossible. TCP faces the same issue: two computers (endpoints) can’t always know each other’s state because packets (our pigeons) can get lost or delayed.

Why it’s funny: Picture Kasongo and Riggy buried in pigeon feathers, yelling, β€œARE WE ATTACKING OR NOT?!” It’s a perfect metaphor for TCP’s struggle to sync up.

Two Generals Illustration

This shows the generals stuck in a loop, just like TCP endpoints without a way to guarantee agreement.

TCP’s Asynchronous Battle
TCP ensures reliable communication, but its asynchronous model is like Kasongo and Riggy’s pigeon problem. Endpoints can’t have common knowledge of the connection’s state. For example:

  • One endpoint might think the connection is active while the other has closed it.

  • Packets can be delayed, lost, or arrive out of order, like pigeons taking a detour.

In system design, this means building systems that tolerate uncertainty while staying reliable and scalable.

Safety vs. Liveness: The Generals’ Strategy
Distributed systems like TCP balance two key properties:

  • Safety: Nothing bad happens. TCP ensures no data is lost, corrupted, or duplicatedβ€”like Kasongo and Riggy ensuring their attack plan isn’t misread.

  • Liveness: Progress happens. TCP wants data to keep flowing, like the generals actually attacking the city.

Here’s the rub: safety is guaranteed, but liveness depends on the network. If packets (or pigeons) get lost, progress stalls. TCP makes safe assumptions (e.g., β€œthe connection is open”) and hopes for the best, just like Kasongo assuming Riggy got his message.

System Design Takeaway: Prioritize safety to avoid disasters (data loss) and design for fault tolerance to handle network failures gracefully.

Timeouts: The Generals’ Deadline

To avoid waiting forever for a lost pigeon, TCP uses timeouts. It’s like Kasongo saying, β€œIf Riggy doesn’t reply in 10 minutes, I’ll send another pigeon or call off the attack.”

How Timeouts Work

  • An endpoint sends a packet and waits for a response (e.g., an acknowledgment).

  • If the timeout expires, it retries or assumes the connection is broken.

  • Timeouts are crucial because users (and generals) have limited patience.

Two Generals Spin: Kasongo sets a β€œpigeon deadline.” If no reply comes, He sends another pigeon or assumes Riggy’s camp is lost to hawks. TCP’s adaptive timeouts adjust based on network conditions to avoid giving up too soon or waiting too long.

System Design Lesson

  • Trade-offs: Timeouts balance responsiveness (quick retries) and reliability (avoiding premature retries).

  • User Experience: Choose timeouts that keep users happy, avoiding Kasongo-level impatience.

This shows a client retrying after a timeout, mirroring Kasongo resending a pigeon.

TCP Handshake: Generals Shaking Pigeons
TCP establishes connections with a three-way handshake

  • Client sends a SYN (Kasongo’s β€œAttack at dawn!”).

  • Server responds with a SYN-ACK (Riggy’s β€œGot it!”).

  • Client sends an ACK (Kasongo’s β€œWe’re on!”).

Like the Two Generals Problem, the server assumes the connection is forming after sending SYN-ACK, but it’s a Benign misunderstanding if the client doesn’t get it. This assumption is safe (no data is lost) but may delay progress until clarified.

Handshake Illustration

This shows the ideal handshake, but lost packets could disrupt it, like a hawk snatching a pigeon.

DoS Attacks: Pigeons Overwhelm the Camp
Benign misunderstandings can cause chaos. When a server gets a SYN, it allocates resources (memory) for the connection, expecting the client to finish the handshake. A malicious client can send thousands of SYNs without completing the handshake, causing a SYN floodβ€”a denial of service (DoS) attack. It’s like a prankster flooding Riggy’s camp with fake pigeons, forcing him to reserve soldiers for a nonexistent attack.

System Design Lesson

  • Scalability: Use SYN cookiesβ€”a lightweight way to verify connections without allocating memory until the handshake completes.

  • Security: Monitor for suspicious patterns (e.g., many SYNs from one source) and block attackers.

  • Fault Tolerance: Design systems to handle malicious inputs without crashing.

Two Generals Spin: Riggy learns to ignore fake pigeons by using a β€œpigeon code” (like SYN cookies) to verify real messages before committing resources.

System Design Principles in Action
The Two Generals Problem and TCP highlight core system design principles:

  • Reliability: TCP’s safety ensures data integrity, even with lost packets.

  • Scalability: Protect against DoS attacks to handle millions of connections.

  • Fault Tolerance: Timeouts and retries manage network failures, like lost pigeons.

  • Trade-offs: Balance responsiveness (short timeouts) with reliability (avoiding premature retries).

  • Security: Mitigate risks like SYN floods to keep systems robust.

These principles are critical for building distributed systems, whether it’s a web app, a microservice, or a cloud platform.

Final Thoughts
TCP’s fight with the Two Generals Problem is a hilarious yet profound lesson in system design. Like Kasongo and Riggy dodging hawk attacks, TCP uses safety, timeouts, and clever assumptions to keep the internet running. Whether you’re building a startup’s backend or a global cloud service, these principles will help you conquer network chaos.
Got a funny distributed systems tale or a TCP question? Drop it in the comments!

Top comments (4)

Collapse
 
hayessolo profile image
Hayessolo

This is a fun way to learn about system design, but the lessons on timeouts and fault tolerance really hit home. Thanks for sharing.

Collapse
 
vincenttommi profile image
Vincent Tommi

Thankyou for feedback

Collapse
 
hayessolo profile image
Hayessolo

How do you decide on the right timeout values for TCP in a real-world system? ama it's just trial and error?

Collapse
 
vincenttommi profile image
Vincent Tommi • Edited

To choose TCP timeout values for a real-world system, think of timeouts like how long you wait for a text reply. Set them based on:

Network Speed: Fast networks (e.g., Wi-Fi) need short timeouts (1-2 seconds); slow ones (e.g., mobile) need longer (5-10 seconds).
App Needs: Quick apps (e.g., video calls) use short timeouts; patient apps (e.g., file downloads) can wait longer.
Timeout Types:

Retry wait (e.g., 0.3 seconds for resending lost data).
Connection wait (e.g., 3-5 seconds to connect).
Keep-alive check (e.g., 5 minutes to confirm connection).

Balance: Short timeouts are fast but may fail on slow networks; long ones are reliable but slow.
Start and Test: Use default settings (e.g., 3 seconds to connect), test your app, and adjust if it’s too slow or error-prone.