Phase king vs raft vs paxos vs zookeeper
The comparisons and notes written in collaboration with Tanel and GPT 5.3
This page compares four important approaches to agreement in distributed systems:
- Phase King – a synchronous Byzantine agreement algorithm used primarily in theory
- Paxos – a foundational consensus protocol with strong theoretical guarantees
- Raft – a practical, leader-based consensus algorithm designed for understandability
- ZooKeeper (Zab) – a distributed coordination service built on atomic broadcast
There is also
- An introductory short comparison of blockchain algorithms with the ones above.
- A final chapter: an overview of the compromise algorithm: PBFT (practical byzantine fault tolerance)
Use cases:
- Phase King: not really used in practice.
- Raft: embedded inside a system
- Paxos: protocol family
- ZooKeeper: external coordination service
NB! No consensus algorithm can guarantee termination in a fully asynchronous system.
All practical systems (Raft, Paxos, Zab) rely on partial synchrony and timeouts.
Sisukord
- 1 Comparison to blockchain consensus algorithms
- 2 Do Real Systems Use the Phase King Algorithm?
- 3 Partial Synchrony in Raft
- 4 Phase King vs Raft Consensus Algorithms
- 5 Raft vs Paxos: Use Cases, Assumptions, and Risks
- 6 ZooKeeper vs Raft and Paxos
- 7 How PBFT Differs from Phase King, Raft, Paxos, and ZooKeeper
Comparison to blockchain consensus algorithms
Blockchain consensus can be viewed as an extension of classical distributed consensus:
- Adds Byzantine fault tolerance (as does phase king)
- Removes trusted membership assumptions
- Introduces incentives and cryptography
- Often retains core ideas:
- Leaders / proposers
- Quorums
- Replicated logs
- Raft prioritizes simplicity and performance in controlled environments
- PBFT extends consensus to Byzantine settings with stronger guarantees but higher cost
- Nakamoto consensus sacrifices efficiency for openness and decentralization
| Aspect | Raft | PBFT (Practical Byzantine Fault Tolerance) | Nakamoto Consensus (Proof-of-Work) |
|---|---|---|---|
| Fault model | Crash faults | Byzantine faults | Byzantine + economic adversary |
| Network model | Partial synchrony | Partial synchrony | Fully asynchronous (probabilistic guarantees) |
| Participants | Fixed, known set | Fixed, known set | Open, permissionless |
| Identity model | Static identities | Static identities | Pseudonymous (Sybil-prone) |
| Sybil resistance | Not needed | Not needed | Required (PoW) |
| Fault tolerance | <math>f < n/2</math> | <math>f < n/3</math> | Depends on hash power (< 50% attacker) |
| Consensus type | Leader-based log replication | Leader-based voting protocol | Longest-chain rule |
| Leader role | Strong leader | Primary (leader) with view changes | Miner proposes blocks (implicit leader) |
| Communication pattern | Leader → followers | All-to-all voting (multiple phases) | Gossip-based block propagation |
| Message complexity | <math>O(n)</math> per entry | <math>O(n^2)</math> per decision | <math>O(n)</math> gossip (amortized) |
| Latency | 1–2 RTTs | Multiple rounds (≥2–3 RTTs) | Minutes (block confirmation time) |
| Finality | Immediate (deterministic) | Immediate (deterministic) | Probabilistic (confidence over time) |
| Throughput | High | Moderate | Low |
| Energy usage | Low | Low | Very high (mining) |
| Incentives | None | None | Economic incentives (block rewards, fees) |
| Typical use | Databases, coordination systems | Permissioned blockchains | Public blockchains (e.g., Bitcoin) |
| Examples | etcd, Consul | Tendermint, Hyperledger Fabric | Bitcoin, early Ethereum |
Note:
A Sybil attack is when one participant pretends to be many. Example:
- One attacker creates 1000 nodes
- System treats them as 1000 independent voters
- Attacker now controls the system
Do Real Systems Use the Phase King Algorithm?
Short Answer
No — real-world distributed systems do not use the Phase King algorithm.
Why Phase King Is Not Used in Practice
1. Unrealistic Timing Assumptions
- Requires a fully synchronous system:
- Known upper bounds on message delays
- Lock-step execution of rounds across all nodes
- Real systems are typically asynchronous or only partially synchronous
- Even small timing violations can break correctness
2. High Communication Cost
- Message complexity: <math>O(n^2 f)</math>
- Each round requires all-to-all communication
- Does not scale well to larger systems
3. Strong (and Expensive) Fault Model
- Designed to tolerate Byzantine faults:
- Nodes may lie, equivocate, or collude
- Most real systems assume weaker fault models:
- Crash faults (nodes stop but do not misbehave)
- Byzantine tolerance is used only when necessary due to cost
4. Lack of Practical Ecosystem
- No widely adopted implementations
- Not integrated with:
- Storage systems
- RPC frameworks
- Production infrastructure
- Exists mainly in textbooks and theoretical research
What Real Systems Use Instead
Crash Fault-Tolerant Systems
- Raft
- Paxos / Multi-Paxos
- Used in systems such as:
- etcd
- Consul
- ZooKeeper (Zab protocol, Paxos-like)
Byzantine Fault-Tolerant Systems
- Practical Byzantine Fault Tolerance (PBFT)
- Tendermint
- HotStuff
- These protocols:
- Work under partial synchrony
- Are optimized for real-world networks
- Avoid strict round-based synchrony like Phase King
Partial Synchrony in Raft
Raft assumes a partially synchronous system:
- There is no known fixed upper bound on message delays
- However, after some unknown point in time, the system becomes "well-behaved":
- Message delays become bounded
- Processing speeds stabilize
- Communication is reliable enough for progress
A partially synchronous system can be understood as having two phases:
- Unstable period
- Messages may be delayed arbitrarily long
- Nodes may timeout incorrectly
- Leader elections may repeatedly fail
- Stable period
- Message delays are bounded by some unknown value <math>\Delta</math>
- Timeouts begin to work reliably
- A leader can be elected and maintain authority
The transition point between these phases is unknown and not detectable by the algorithm.
Why Partial Synchrony Is Needed
In a fully asynchronous system:
- There is no bound on message delay
- It is impossible to distinguish between:
- A slow node
- A failed node
This leads to the FLP impossibility result:
- No deterministic consensus algorithm can guarantee termination in a fully asynchronous system
Raft avoids this by assuming eventual synchrony.
How Raft Uses Partial Synchrony
Raft relies on timeouts for leader election:
- Nodes start elections if they do not hear from a leader
- Randomized timeouts reduce the chance of election conflicts
During the unstable period:
- Multiple nodes may start elections simultaneously
- Elections may fail repeatedly
During the stable period:
- One node times out first
- It becomes leader
- Other nodes receive its messages in time and follow it
Phase King vs Raft Consensus Algorithms
Big Picture
| Aspect | Phase King | Raft |
|---|---|---|
| Model | Synchronous | Partially synchronous / asynchronous |
| Faults | Byzantine (arbitrary/malicious) | Crash faults only |
| Goal | Theoretical agreement under worst-case faults | Practical replicated state machine |
| Typical Use | Theory / teaching Byzantine agreement | Real systems (e.g., etcd, Consul) |
Assumptions
Phase King
- Fully synchronous system
- Known upper bound on message delay
- Lock-step rounds
- Handles Byzantine faults
- Nodes may lie, equivocate, or collude
- Requires: <math>n \ge 3f + 1</math>
Raft
- Partial synchrony
- No fixed delay bound, but eventually messages are timely
- Handles crash faults only
- Nodes may stop but do not behave maliciously
- Requires: <math>n \ge 2f + 1</math>
Guarantees
Phase King
- Agreement: All correct nodes decide the same value
- Validity: If all correct nodes start with the same value, that value is chosen
- Termination: Always terminates in a fixed number of rounds
Raft
- Safety (always):
- No two leaders commit different values at the same log index
- Logs remain consistent across nodes
- Liveness (eventual):
- Progress occurs once the network stabilizes
Algorithm Structure
Phase King
- Runs in <math>f + 1</math> rounds
- Each round:
- Nodes broadcast their values
- A majority value is computed
- A designated king node enforces a decision if needed
- The king resolves ambiguity caused by Byzantine nodes
Raft
- Continuous protocol consisting of:
- Leader election
- Log replication
- Commit mechanism
- Leader-based design:
- Leader sends log entries to followers
- Entries are committed once a majority acknowledges
Efficiency
Phase King
- Rounds: <math>f + 1</math>
- Messages per round: <math>O(n^2)</math>
- Total messages: <math>O(n^2 f)</math>
- High communication overhead
- Requires strict synchrony
- Not used in practice
Raft
- Normal operation:
- <math>O(n)</math> messages per log entry
- Leader election:
- <math>O(n)</math> messages
- Latency:
- Typically 1–2 round-trip times (RTTs)
- Efficient and scalable in practice
Key Differences
Fault Model
- Phase King: tolerates Byzantine faults
- Raft: tolerates crash faults only
Timing Model
- Phase King: requires strict synchrony
- Raft: assumes eventual synchrony
Practicality
- Phase King: theoretical and pedagogical
- Raft: widely used in production systems
Mechanism
- Phase King: rotating "king" authority per round
- Raft: stable leader-based replication
Complexity
- Phase King: <math>O(n^2 f)</math>
- Raft: <math>O(n)</math> per operation
Intuition
- Phase King answers:
- "Can we reach agreement even if some nodes are actively malicious?"
- Raft answers:
- "Can we build a reliable system from nodes that may crash?"
Raft vs Paxos: Use Cases, Assumptions, and Risks
Raft and Paxos are both consensus algorithms designed for crash fault-tolerant distributed systems. They provide similar theoretical guarantees but differ significantly in design philosophy, usability, and typical deployment scenarios.
Where assumptions differ
Raft
- Strong leadership:
- A single leader manages log replication and client interaction
Paxos
- No fixed leader in basic Paxos:
- Roles (proposer, acceptor, learner) are logically separate
- Leader optimization (Multi-Paxos) is commonly used in practice
Design Philosophy
Raft
- Explicit goal: understandability
- Decomposes consensus into:
- Leader election
- Log replication
- Safety rules
- Strong, stable leader simplifies reasoning
Paxos
- Minimal and abstract
- Focuses on safety properties with minimal assumptions
- More flexible but harder to understand and implement correctly
Risks and Failure Modes
Raft
- Leader dependency:
- System performance depends on leader health
- Leader failure triggers re-election delays
- Availability risks:
- Requires a majority of nodes to be available
- Network partitions can stall progress
- Simplicity tradeoff:
- Less flexible for certain optimizations compared to Paxos variants
Paxos
- Liveness challenges:
- Basic Paxos can experience repeated conflicts between proposers
- Requires leader optimization (Multi-Paxos) for efficiency
- Operational difficulty:
- Harder to debug and reason about in production systems
Efficiency and Performance
Raft
- Leader-based replication:
- <math>O(n)</math> messages per log entry
- Typically commits in 1–2 round-trip times (RTTs)
- Predictable performance under normal operation
Paxos
- Basic Paxos:
- Multiple phases per decision
- Higher latency without optimization
- Multi-Paxos (common in practice):
- Similar efficiency to Raft after leader is established
- <math>O(n)</math> messages per decision
Key Differences Summary
| Aspect | Raft | Paxos |
|---|---|---|
| Design goal | Understandability | Minimality and flexibility |
| Leadership | Strong, explicit leader | Implicit or optional (explicit in Multi-Paxos) |
| Ease of implementation | Easier | Difficult |
| Typical use | Production systems, teaching | Research, highly optimized systems |
| Risk profile | Operational simplicity, leader bottleneck | High complexity, subtle correctness issues |
Practical Insight
- Raft is generally preferred for new systems due to its clarity and strong engineering model.
- Paxos (especially Multi-Paxos) remains influential and is used in systems where fine-grained control and optimization are required.
ZooKeeper vs Raft and Paxos
Overview
Apache ZooKeeper is a distributed coordination service rather than a standalone consensus algorithm. Internally, it uses a consensus protocol called Zab (ZooKeeper Atomic Broadcast), which is similar in spirit to leader-based consensus algorithms such as Raft and Multi-Paxos.
What ZooKeeper Is
- A distributed coordination service providing:
- Configuration management
- Naming/registry services
- Distributed synchronization (locks, leader election)
- Exposes a filesystem-like hierarchical data model (znodes)
- Clients interact with a replicated state maintained by a cluster
Underlying Algorithm
- ZooKeeper uses Zab (ZooKeeper Atomic Broadcast)
- Zab is:
- Leader-based
- Crash fault-tolerant (not Byzantine)
- Designed for high-throughput broadcast of updates
Assumptions
ZooKeeper (Zab)
- Partial synchrony
- Crash fault model
- Majority quorum:
- <math>n \ge 2f + 1</math>
- Single leader at a time
Comparison
| Aspect | ZooKeeper (Zab) | Raft | Paxos |
|---|---|---|---|
| Fault model | Crash | Crash | Crash |
| Timing | Partial synchrony | Partial synchrony | Partial synchrony |
| Leader | Yes | Yes | Optional (explicit in Multi-Paxos) |
Key intuition:
- Zab ≈ a specialized form of Multi-Paxos optimized for ordered broadcast
- Very similar operational model to Raft:
- One leader
- Followers replicate state
- Majority acknowledgment required
Where ZooKeeper Is Used
ZooKeeper is widely used as a coordination backbone in distributed systems:
- Apache Kafka:
- Broker metadata and controller election (historically)
- Apache Hadoop:
- Coordination of distributed components
- Apache HBase:
- Master election and region management
- Apache Solr / Elasticsearch (older versions):
- Cluster coordination
Typical use cases:
- Leader election
- Distributed locks
- Service discovery
- Configuration storage
Intuition
- ZooKeeper:
- "Provide a shared coordination service for distributed systems."
- Raft:
- "Provide an understandable consensus building block."
- Paxos:
- "Provide a minimal, flexible foundation for consensus."
How PBFT Differs from Phase King, Raft, Paxos, and ZooKeeper
Practical Byzantine Fault Tolerance (PBFT) is a consensus algorithm designed to tolerate Byzantine (malicious) faults in partially synchronous systems. It bridges the gap between:
- Theoretical Byzantine algorithms (e.g., Phase King)
- Practical crash-fault systems (Raft, Paxos, ZooKeeper)
PBFT is widely used as a foundation for modern Byzantine fault-tolerant and permissioned blockchain systems.
Examples:
- Tendermint
- Hyperledger Fabric (variants)
Fault Model
| Algorithm | Fault Model |
|---|---|
| Phase King | Byzantine faults |
| PBFT | Byzantine faults |
| Raft | Crash faults only |
| Paxos | Crash faults only |
| ZooKeeper (Zab) | Crash faults only |
- PBFT tolerates arbitrary (malicious) behavior
- Requires <math>n \ge 3f + 1</math> nodes
- Crash-fault algorithms only require <math>n \ge 2f + 1</math>
Timing Assumptions
| Algorithm | Timing Model |
|---|---|
| Phase King | Fully synchronous |
| PBFT | Partial synchrony |
| Raft | Partial synchrony |
| Paxos | Partial synchrony |
| ZooKeeper | Partial synchrony |
Algorithm Structure
PBFT
- Leader-based (called the primary)
- Each decision proceeds in multiple phases:
- Pre-prepare
- Prepare
- Commit
- Requires agreement from <math>2f + 1</math> nodes