First Principles
The infrastructure industry has spent a decade building workarounds on broken foundations. We're done with workarounds.
A distributed systems problem masquerading as file storage
Terraform state is a coordination problem. Multiple actors (engineers, CI systems, drift detection) need to read and modify overlapping subsets of infrastructure state concurrently. This is a well-studied problem in distributed systems, with established solutions around fine-grained locking, multi-version concurrency control, and transaction isolation.
Instead, we got a global mutex on a JSON file.
The mismatch between granularity of operation and granularity of locking is the root cause of every Terraform scaling problem. It violates a fundamental principle of concurrent systems: non-overlapping operations should not block each other. You're modifying twelve resources but locking 2,847.
The standard response, splitting state files, doesn't solve the problem. It redistributes it. Now you have N coordination problems instead of one, plus the complexity of managing cross-state dependencies. You've traded false contention for distributed transaction coordination.
File-based vs Graph-based State
| File-based | Graph-based | |
|---|---|---|
| Lock scope | Entire state file | Affected subgraph only |
| Refresh | All resources | Changed resources |
| Concurrency | One operation at a time | Parallel on disjoint sets |
| Plan time (2,847 resources) | 30 min | 2 sec |
| Query support | Parse JSON | SQL |
| Dependencies | Opaque | First-class edges |
Infrastructure is a graph. Store it as a graph.
Infrastructure state is inherently a directed graph. Resources have dependencies. Dependencies form edges. Changes propagate along those edges. Terraform already knows this. The internal representation is a graph, and the planner performs graph traversal.
But at the storage layer, we flatten this structure into a blob.
This is like storing a B-tree in a CSV file. You can do it, but you destroy the properties that make the data structure useful. Plans read the entire file because file-based storage offers no alternative. Refreshes query everything because the state file doesn't know what you're about to change. The lock is global because the file is the unit of atomicity.
When state is properly normalized into a graph database, the properties emerge naturally. Subgraph isolation means operations on disjoint subgraphs are inherently parallelizable. Precise locking means you acquire locks on resources and dependencies, not the entire state. Incremental refresh means you compute the minimal refresh set by traversing the dependency graph.
Apply forty years of database engineering
The distributed systems community solved these problems decades ago. Multi-version concurrency control allows readers to proceed without blocking writers. Write-ahead logging provides durability without sacrificing performance. Transaction isolation levels let operators choose their consistency guarantees. Row-level locking enables concurrent modification of non-overlapping data.
None of this is novel. PostgreSQL has done it for thirty years. Yet somehow the infrastructure industry decided that a JSON file with a mutex was acceptable for managing production systems.
We implement MVCC at the Terraform state layer. Each operation acquires locks only on its subgraph. The lock manager uses the dependency graph to ensure consistent ordering. Readers use snapshots without blocking writers. Three teams can run three transactions with zero contention on disjoint resource sets.
Kubernetes proved the control plane pattern works
Kubernetes controllers continuously reconcile cluster state with desired configuration. They retry until they succeed. They watch for drift. They handle this at massive scale. Infrastructure needs the same thing: state that reconciles automatically, operations that retry instead of fail, drift that gets fixed instead of reported to a Slack channel nobody monitors.
You can't build continuous reconciliation on top of a flat file that locks globally. You can't parallelize operations when everything shares the same lock. You can't query relationships when state is opaque JSON.
The state system has to change first. Fix the storage primitives, and you unlock reconciliation. Fix reconciliation, and you can react to events instead of polling.
Correctness isn't optional
We're building infrastructure that manages other people's infrastructure. State corruption can't be "rare." It has to be impossible.
We chose OCaml because its type system catches entire categories of bugs at compile time that tests miss. Strongly-typed data structures catch field errors before deployment. Type-safe SQL queries prevent schema drift before it reaches production. Immutability by default eliminates race conditions. When you add a new error case to the system, the compiler tells you every place you aren't handling it.
This isn't academic type theory. Production systems that absolutely cannot fail choose languages where certain failures are impossible, not just unlikely.
The software is free. The bureaucracy isn't.
Infrastructure tooling has become a collection of six-figure line items. One vendor for automation. Another for cost visibility. Another for inventory. Another for policy. Per seat. Per resource. Per month. "Call us for pricing." Enterprise sales teams playing discount games.
For what? Plumbing.
The infrastructure tooling market has convinced itself that basic visibility into your own infrastructure is a premium feature. That knowing what you're spending requires a separate contract. That querying your resources requires yet another vendor. This is absurd. These are table stakes, not profit centers.
The core capabilities should be free. Not free-with-limits. Not free-until-you-need-it. Free by default. We charge for the human overhead: governance teams that need approval workflows, compliance departments that require audit trails, procurement processes that demand paperwork. Real costs for real work.
We're not trying to win this market. We're trying to collapse it.
Fix the foundations
The Terraform ecosystem has built an impressive tower of workarounds. Terragrunt is the poster child. A wrapper that exists solely to compensate for Terraform's inability to handle basic patterns like DRY configuration and cross-stack dependencies. It papers over state fragmentation with more fragmentation. It adds a templating layer because the underlying tool can't express what you need. It's duct tape on a broken foundation, and somehow it became best practice.
The rest of the ecosystem follows the same pattern. Elaborate orchestration to work around the fact that the storage layer can't support concurrent operations. State splitting strategies. External locking mechanisms. Dependency graphs rebuilt in YAML because Terraform lost them when it flattened state to JSON.
These aren't solutions. They're evidence that we're solving the wrong problem.
We're not building another wrapper. We're not adding another layer of abstraction on a broken foundation. We're replacing the storage primitives that everything else depends on. Graph-native state. Resource-level locking. Subgraph operations. MVCC concurrency. Queryable infrastructure.
This isn't revolutionary. It's the application of established distributed systems principles to a problem that's been mischaracterized since its inception.
The infrastructure industry has accepted file-based state as an immutable constraint for too long.
It's not. It's a choice.
And it's the wrong one.