Introduction to distributed systems

June 27th 26

I didn't intend to write a post about distributed systems. I'm in fact starting a new series exploring interesting algorithms, ranging from graphics rendering to cryptography, and the first post happens to be about consensus in distributed systems. My goal with this series is to go beyond the theory by implementing each algorithm step by step, focusing on understanding why it works rather than just reproducing it.

Given the programming background of my audience, I thought this would be a useful guide to some of the common concepts I'll be referencing in upcoming posts. If you work with distributed systems or infrastructure and are already familiar with concepts like Byzantine fault tolerance, leader election, and network partitions, feel free to skip this post.

Why distributed systems exist

Before telling you what a distributed system is, let me first lay out the problems it's trying to solve.

Imagine you have to provide digital services to millions or even billions of clients. Think companies like Netflix, Google, or any large bank. You rapidly face challenges related to scale, reliability, cost-efficiency, and latency with the traditional approach of relying on a single server.

The first limitation is simply size. A server can only grow so much before hitting the limits of its underlying hardware: CPU processing power, RAM, storage, or the amount of network traffic it can handle. More importantly, relying on a single server creates a single point of failure (SPOF). If that server goes down, everything stops working. Finally, there's geographical latency. Hosting your application in a single location increases the time it takes for users on the other side of the world to access it.

Now imagine that instead of making one server bigger and bigger (vertical scaling), you build a network of smaller computers and distribute the traffic between them (horizontal scaling).

By doing that, you can address many of the challenges we just discussed:

Scaling becomes as simple as adding more compute resources to the network. That network is called a cluster, and each machine in it is called a node.
By replicating services across multiple nodes and using health checks and failover mechanisms, the system can detect failures, isolate unhealthy nodes, and redirect traffic to healthy ones, greatly reducing downtime.
You can deploy nodes in multiple regions around the world, placing them closer to your users and significantly reducing latency.
A distributed system can also improve cost-efficiency. Combined with an orchestration system, it's possible to dynamically add more nodes to the cluster during traffic peaks and release those resources when demand drops. This allows you to pay only for the compute resources you actually need.

That being said, distributed systems don't automatically reduce costs. They introduce additional infrastructure, operational complexity, and networking overhead, so whether they are more cost-efficient depends heavily on the workload.

Beyond cost, distributed systems also come with inherent challenges. They increase the attack surface from a security perspective, introduce more moving parts to deploy, monitor, and maintain, and make the overall system harder to reason about due to partial failures, concurrency, and network unpredictability. Like anything in engineering, it's always about compromises. In distributed systems, however, this becomes even more explicit: you are constantly trading simplicity, consistency, and performance against scale, availability, and resilience. Designing distributed systems therefore means designing for failure; accepting that failure is inevitable, but building systems that can continue operating seamlessly despite it.

To put it simply,

A distributed system is a collection of independent computers and devices that work together over a network so that, from the outside, they appear to be a single, unified system.

Distributed systems are not limited to web services and cloud infrastructure. They also appear in many other domains, one of the most well-known being blockchain networks, which are distributed by design.

Take Bitcoin as an example. It is a peer-to-peer network where thousands of independent nodes maintain and update a shared ledger of transactions without any central authority. Instead of trusting a single server, the system relies on a collective agreement mechanism called Proof of Work to decide which version of the ledger is valid.

The key difference in this kind of system is that you cannot assume nodes are simply unreliable or slow; you must also assume that some of them may behave maliciously. They might try to cheat the system, send conflicting information, or attempt to rewrite history for their own benefit.

This is where the notion of Byzantine fault tolerance becomes essential. It refers to the ability of a distributed system to reach agreement even when some participants are faulty or actively adversarial. In other words, it extends the idea of consensus beyond simple failures, into environments where trust itself cannot be assumed.

How to make a distrubuted system work

Now that you're managing more than one server, one of the core problems you need to solve is coordination. How do you make every server in the cluster agree on a decision? How do you ensure they all share the same view of the system?

That's where consensus algorithms come into play.

Before looking at how consensus works, let's first walk through a typical scenario where it becomes necessary.

State Machine Replication (SMR)

Imagine you're building a distributed database that must remain available even when some servers fail. A common approach is to replicate the database across multiple servers, called replicas, so that if one goes down, the others can continue serving requests.

However, replication alone isn't enough. Every replica must contain the exact same data at all times. To achieve this, each replica processes the exact same sequence of operations (in our case, database transactions) in the exact same order. Assuming the database behaves deterministically, every replica will produce the same result and therefore remain in a consistent state.

This technique is known as State Machine Replication (SMR). It is one of the fundamental techniques used in distributed systems to achieve fault tolerance and high availability.

But this immediately raises another question: who decides the order of those operations? If two replicas receive different requests at the same time, or if the network delays some messages, how can they all agree on the same sequence of operations?

This is precisely the problem that consensus algorithms solve.

Consensus

Consensus is a fundamental and challenging problem in distributed systems that deals with getting a group of distributed nodes to agree on a single sequence of decisions despite failures and unreliable communication.

Difficulties of distributed consensus

Distributed systems operate under assumptions that don't exist on a single machine. Let's go through some of the most common challenges:

Message delays and loss. Networks are inherently unpredictable. Messages can be delayed, reordered, duplicated, or even lost due to congestion, hardware failures, or broken network links.
Node crashes. A server can crash, stop responding unexpectedly because of hardware failures, software bugs, power outages, or countless other reasons.
Clock disagreement. There is no perfectly synchronized global clock. Since nodes can be spread across the world, each machine has its own local notion of time, making it difficult to determine the true order of events.

As it turns out, distributed consensus isn't just difficult, it is impossible to solve perfectly under these assumptions.

This result is formalized by the FLP impossibility theorem, which states that in a fully asynchronous (no timing guarantees) distributed system where even a single node may crash, there is no deterministic consensus algorithm that can guarantee both safety (never reaching the wrong decision) and liveness (eventually reaching a decision) in every possible execution.

The key idea is surprisingly simple: if messages can be delayed indefinitely, there is no way to distinguish a crashed node from a very slow one. A correct algorithm must therefore keep waiting in case the missing message eventually arrives, but waiting forever means it may never make progress.

This doesn't mean consensus is impossible in practice. It simply means that real-world consensus algorithms must relax some assumptions, rely on timeouts, or accept trade-offs in order to guarantee progress.

Network partitions

One of the most disruptive failures in a distributed system is a network partition. A network partition occurs when a network is divided into smaller, isolated groups of nodes that can no longer communicate with each other, even though each group continues operating normally.

Without proper coordination, each isolated group could believe it is the legitimate one and continue accepting requests independently. This situation is known as a split-brain scenario. Since each partition evolves separately, they may eventually produce conflicting states. When the partition heals, reconciling those conflicting changes can become extremely difficult, or even impossible, without losing data.

Preventing split-brain is one of the primary goals of consensus algorithms, and it naturally leads us to one of the most influential results in distributed systems: the CAP theorem.

The CAP theorem states that when a network partition occurs, a distributed system cannot simultaneously guarantee both consistency and availability. It must sacrifice one in favor of the other.

A consistent system guarantees that every client observes the same data, regardless of which node they contact.
An available system guarantees that every request receives a response, even if that response may not reflect the latest state.
Partition tolerance means the system continues operating despite parts of the network being unable to communicate.

Since network partitions are an unavoidable reality, modern distributed systems must tolerate them. In practice, this means choosing between preserving consistency by temporarily rejecting some requests, or preserving availability by allowing different parts of the system to continue operating independently until communication is restored.

Failure models

Distributed systems are often classified based on how components can fail. Understanding these failure models is important because different consensus algorithms are designed to tolerate different kinds of failures.

Crash failures. A node simply stops working and takes no further action. It may appear offline permanently, but it does not send incorrect or malicious messages. It just stops responding.
Omission failures. A node continues operating but may fail to send or receive messages. For example, it might drop requests, fail to respond in time, or lose incoming messages due to network or software issues.
Byzantine failures. A node can behave arbitrarily. It may send conflicting information to different nodes, lie about its state, or act maliciously in an attempt to disrupt the system. This is the strongest and most general failure model.

How practical consensus algorithms work

Most practical consensus algorithms follow a leader-based approach. Think of the leader as the conductor of an orchestra.

Although the details vary between algorithms, the overall process is remarkably similar.

Elect a leader. One node is elected to coordinate the cluster. To detect failures, the leader periodically sends heartbeats to the other nodes. If followers stop receiving heartbeats for a certain amount of time, they assume the leader has failed and start a new election. The election process itself differs between algorithms, and we'll study one in detail in the next post.
Clients send requests to the leader. Rather than communicating with every node, clients send their requests to the leader. If a client contacts a follower instead, that follower simply redirects the request to the current leader. From the client's perspective, the cluster therefore behaves as a single system.
The leader appends the command to its log. Instead of immediately executing the request, the leader first records it as a new entry in its replicated log.
The log entry is replicated. The leader sends the new log entry to every follower. Each follower appends the entry to its own log and acknowledges that it has been received.
A majority confirms the entry. Once the leader receives acknowledgements from a majority of the cluster, the entry is considered committed. This majority is called a quorum and is typically defined as $Q = \left\lfloor \frac{N}{2} \right\rfloor + 1$ . The key insight, which we'll explore in the next post, is that any two quorums always overlap by at least one node. At this point, the leader knows that the command cannot be lost, even if it crashes immediately afterward.
The command is applied. Only after a command has been committed does each replica execute it on its local state machine. Since every replica applies the same commands in the same order, they all remain in the same state.

Although they are often confused, consensus and Two-Phase Commit (2PC) solve different problems.

Two-Phase Commit (2PC) is a blocking protocol that coordinates a transaction across multiple participants. It requires every participant to respond before the transaction can be committed. If the coordinator or, in some cases, even a single participant fails at the wrong moment, the protocol can block until the failure is resolved. Its goal is atomicity: ensuring that either every participant commits the transaction or none of them do.

Consensus, on the other hand, is a fault-tolerant protocol. It only requires a quorum to make progress, allowing the system to continue operating even if some nodes fail. Its goal is replication: ensuring that all healthy replicas agree on the same sequence of operations.

At this point, we've only described the mechanics of a typical leader-based consensus algorithm. The more interesting question is why this process works. Why does waiting for a quorum guarantee safety? Why does electing a single leader prevent conflicting decisions? And how can the system recover when that leader fails?

These guarantees don't happen by accident. They emerge from carefully designed algorithms. In the next post, we'll build one step by step and understand why it works.

Conclusion

Distributed systems is a very wide and challenging topic. I tried to make this post digestable with a natural progression, focusing on why instead of just laying out concepts out of nowhere. I hope you find it useful. Stay tuned for the upcoming post where we will dive deep implementing a consensus algorithm.

Thanks for reading.