No Control Plane: The Playbook for Distributed Coordination Without a Brain
When to delete the orchestrator and let your system coordinate itself
Here is the moment the control plane becomes the bottleneck.
It’s 2:47 AM. Your fleet of 1,200 nodes just detected a configuration drift. The remediation playbook kicks off. Ansible starts its work. And then... nothing. The SSH connection pool saturates. The orchestrator is hammering the same 50 targets while 1,150 sit waiting. Your control plane, designed to coordinate everything, is now the reason nothing is moving.
You’ve built a system where the brain can think faster than the body can act. And the brain just had a stroke.
This is the failure mode nobody warns you about when they sell you on “centralized visibility” and “single pane of glass.” The control plane becomes the ceiling. And when something goes wrong at scale, you hit that ceiling hard.
Last week I published a technical architecture called Ghostfleet: a fire-and-forget system where the orchestrator deploys to seed targets and immediately exits. No persistent connections. No polling loops. No central brain watching everything. The targets coordinate among themselves using a peer-to-peer cascade.
The response was split. Half the readers wanted implementation details. The other half thought I’d lost my mind.
This is the playbook for the first group. And a warning label for the second.
The Premise: Why Delete the Brain?
Control planes exist for good reasons. They provide visibility. They enable rollback. They give you a place to ask “what’s the current state of the world?”
But control planes also concentrate failure. When the orchestrator dies, coordination dies. When the orchestrator slows, everything slows. When the orchestrator’s network path degrades, you’re flying blind.
The question isn’t whether control planes are useful. They are. The question is: what operations can survive without one?
The answer is narrower than you’d think. A control-plane-free architecture works when:
The operation is idempotent (running it twice doesn’t break anything)
The success criteria are local (each node can verify its own completion)
The coordination is embarrassingly parallel (nodes don’t need to wait for each other)
The blast radius is bounded (a misbehaving node can’t cascade damage)
If all four conditions hold, you can delete the brain. If any one fails, you probably shouldn’t.
Design Rules
Building a system that coordinates without central control requires different instincts. Here’s what I’ve learned.
Rule 1: Embed everything at deployment time.
The orchestrator’s job is to prepare the payload, encrypt it, and deploy it to seed nodes. After that, the orchestrator is done. Every piece of information a node needs... target lists, credentials, remediation logic... must be embedded in the payload itself.
This feels wasteful. You’re duplicating data across nodes. But duplication buys you independence. No node ever needs to call home to ask “what should I do next?”
Rule 2: Use deterministic work assignment.
If you have 1,000 targets and 10 seed nodes, each seed should know exactly which 100 targets are its responsibility. No coordinator assigns work. The work assignment is computed from the node’s identity.
Simple modulo distribution works: if your node ID mod 10 equals 3, you handle targets 3, 13, 23, and so on. Every node can compute its own workload without asking anyone.
Rule 3: Make remediated targets into workers.
This is the cascade. When Node A remediates Node B, Node B now has the payload. Node B can remediate Node C. The system grows its own workforce as it executes.
The math is favorable. If each node can remediate 5 others per minute, and each successful remediation creates a new worker, you get exponential coverage. 10 seeds become 50, become 250, become 1,250. The control plane could never parallelize this fast.
Rule 4: Assume partial failure is normal.
Some nodes will be unreachable. Some will fail mid-remediation. Some will succeed but not report. You won’t know which.
Design for this. Set time-bounded execution windows. Use retries at the node level. Accept that your completion rate will be “good enough” rather than “perfect.” If you need 100% confirmation, you need a control plane. That’s the tradeoff.
Failure Modes
Every architecture has failure modes. Control-plane-free systems fail differently than centralized ones. Know the shape of the failures before you commit.
The Runaway Cascade.
If your remediation logic has a bug, and each node propagates to others, you’ve just deployed a self-replicating bug. The same exponential math that makes the pattern fast makes failures catastrophic.
Mitigation: time-bound every operation. Each node should have a hard deadline after which it stops propagating, regardless of completion state. The cascade dies naturally.
The Silent Failure.
Without a control plane polling for status, you don’t know what succeeded. A node might remediate 50 targets and then crash before logging anything. Those 50 successes are invisible to you.
Mitigation: local receipts. Each node writes a tamper-evident log of its actions before reporting to anyone. More on this next week.
The Coordination Collision.
Two nodes might try to remediate the same target simultaneously. Without central locking, you get race conditions.
Mitigation: deterministic assignment (Rule 2) prevents this. If assignments are computed, not coordinated, collisions only happen when the computation is wrong.
The Credential Sprawl.
Embedded credentials travel with the payload. Every node that receives the payload has the credentials. Your attack surface grows with your cascade.
Mitigation: scope credentials to the minimum required, and rotate them after the operation completes. Treat embedded credentials as one-time-use.
When Not to Use This Pattern
The playbook has boundaries. Here’s when to stay centralized.
When you need confirmation. If the operation requires verified completion before the next step can proceed, you need something tracking state. Fire-and-forget means accepting uncertainty.
When operations aren’t idempotent. If running the same remediation twice causes damage, you need coordination to prevent duplicates. No control plane means no deduplication guarantees.
When blast radius is unbounded. If a misbehaving node can cascade harm to others (deleting data, corrupting configs, exhausting resources), the exponential spread becomes exponential destruction.
When auditability is mandatory. Regulators want logs. They want provenance. They want a single source of truth. Control-plane-free architectures produce distributed, partial evidence. If your compliance team needs a perfect audit trail, this pattern will make them nervous.
The Checklist
Before you delete the control plane, verify:
[ ] Operations are idempotent (safe to run twice)
[ ] Success criteria are locally verifiable
[ ] Work assignment is deterministic, not coordinated
[ ] Time bounds exist on all propagation
[ ] Credentials are scoped and rotatable
[ ] Partial completion is acceptable
[ ] You have a plan for collecting receipts after the fact
If you cannot satisfy all seven conditions, do not use this pattern.
What’s Missing
This playbook covers coordination. It doesn’t cover verification.
The hardest question in a control-plane-free system isn’t “how do I coordinate?” It’s “how do I know what actually happened?” When there’s no central observer, and nodes only have local views, how do you reconstruct the truth after the fact?
That’s the receipts problem. And it’s next.
This article is part of a series on autonomous operations. Previously: Fire-and-Forget: A P2P Architecture for Autonomous Security Remediation at Scale. Next: Receipts Everywhere.
Resources & Further Reading
Foundational Concepts
Control Planes vs Data Planes — Marc Brooker (AWS Sr. Principal Engineer) The clearest explanation of why control planes scale differently than data planes, and when control plane failures matter.
SWIM: Scalable Weakly-consistent Infection-style Process Group Membership — Cornell University (2002) The foundational paper on gossip-based membership and failure detection. This is what HashiCorp’s Serf implements.
Practical Implementations
HashiCorp Serf — GitHub Production-grade decentralized cluster membership with no single point of failure. Powers Consul and Nomad.
Gossip Protocol in Consul — HashiCorp Docs How Consul uses Serf’s gossip protocol for membership and message broadcast without central coordination.
Deep Dives
Gossip Protocol Explained — High Scalability Accessible overview of gossip protocols, eventual consistency, and why epidemic-style information spread works at scale.
Control Plane Scalability Issues in SDN — MDPI Future Internet (2020) Academic review of control plane bottlenecks and the failure modes centralized architectures face at scale.
Book
Designing Data-Intensive Applications — Martin Kleppmann Chapters 5 and 9 cover replication, consistency, and the tradeoffs between coordination and autonomy. The best single resource on distributed systems design.



