Terraform's unit of work is too large
The issue is not that Terraform is slow at doing work. The issue is that at large enough scale it insists on considering too much of the world before it can do any work.
The workload
A design partner brought us a single Terraform root module that had grown organically over years. Around 100 MB of Terraform code, around 800 MB of state. We ran a Terraform no-op on that workload on a large, well-resourced cloud machine, with refresh disabled. No changes intended, no expected diff, just the required full traversal before Terraform can tell you nothing needs to happen. It took nearly three hours.
We loaded the same workload into Stategraph. Import took around six minutes. After that, no-op detection ran in 97 seconds, on a laptop. The Stategraph runs and the Terraform run were on different hardware, and we are not claiming a controlled comparison. That disparity actually makes the result more interesting, not less. It points to a difference in what work is being done rather than in raw compute capacity.
The design partner has asked to remain anonymous. The numbers are real. The explanation for the gap is what this post is about.
The result
The observed numbers, presented without interpretation inflation:
| Operation | Environment | Observed time | Interpretation |
|---|---|---|---|
| Stategraph import (full load) | Laptop | ~6 minutes | One-time cost to normalize state into the database |
| Stategraph plan (small change) | Laptop | 71 seconds | Plan scoped to the impacted subgraph only |
| Stategraph no-op detection | Laptop | 97 seconds | Determines nothing needs to change without full traversal |
| Terraform no-op | Large cloud machine | ~3 hours | Full graph traversal with refresh disabled; still ~3 hours |
There is variance here. Network conditions affect API response times, the import is a one-time cost, and the Terraform figure is a single observed run rather than an average. This is a directional data point, not a benchmark. The mechanism behind it is what warrants the explanation.
Why this happens
Terraform's scaling problem is fundamentally about granularity. The state file is the unit of work. Every operation, whether it makes one change or no changes, must first account for the entire state before it can proceed. Three things happen together on every run.
Whole-state refresh: By default, Terraform contacts the cloud provider for every resource in state to verify its current condition. With 800 MB of state, that is a lot of resources. You can disable refresh, but then you are flying without instruments.
Coarse state lock: Before any operation begins, a global mutex is acquired over the entire state file. No other operation can proceed in parallel, regardless of whether it touches overlapping resources.
Full graph traversal: The planner walks the entire dependency graph to construct a plan. Even when the answer is "nothing to do," the traversal is complete.
The core issue
Once state gets big enough, even a no-op becomes a whole-world event. The execution model does not distinguish between "this operation touches 3 resources" and "this operation touches 3,000 resources." The unit of work is the same in both cases. The entire state.
At a few hundred resources, this is fine. The work is fast enough that the granularity does not matter. At 800 MB of state, the whole-state model is the architecture, and the architecture has a cost.
What Stategraph is actually doing
Stategraph stores Terraform state in Postgres. Resources, attributes, and dependencies are normalized into rows and edges rather than a monolithic JSON file. When an operation arrives, Stategraph identifies the subgraph implicated by that operation and scopes the refresh, the plan traversal, and the lock acquisition to that subgraph.
This is not -target. Terraform's -target flag requires the operator to manually specify resources and bypasses normal dependency resolution. Stategraph computes the impacted subgraph automatically from the dependency relationships in the configuration. If resource A has downstream effects on B and C, they are included. Nothing is skipped that should not be skipped.
Stategraph also refreshes the resources implicated by the subgraph, not all resources. For a small change, that is a small number of API calls. Provider APIs do not become faster. Graph traversal is not mocked. The difference is in what portion of the state each operation includes.
Beyond speed: concurrency and scale
The speed difference is the headline, but the more durable change is in what becomes possible when the unit of work shrinks.
A team with a moderately large state was seeing 60-second Terraform no-ops routinely. In Stategraph, the same operation ran in 15 seconds. Not because 15 seconds is some theoretical minimum, but because the subgraph was genuinely small and the refresh scope reflected that.
The concurrency case is often more practically significant. In standard Terraform, one team acquires the state lock and the other waits, regardless of whether their changes overlap. With row-level locking, if two operations touch non-overlapping resources, both proceed simultaneously.
There is also a cross-state dimension. When a producing state changes an output that a consuming state references through terraform_remote_state, that relationship is typically invisible to either Terraform operation. Someone must manually understand and coordinate it. Stategraph can represent that relationship directly and include both impacted states in a single plan view, surfacing the actual blast radius before anything is applied.
The underlying shift
This is not a story about making provider APIs faster. It is a story about shrinking the execution universe. When the universe is smaller, everything that operates within it becomes cheaper. The refresh, the traversal, the lock, the plan, the apply.
Caveats and limits
Is this apples-to-apples?
No. The Stategraph runs were on a laptop; the Terraform no-op was on a large cloud machine. The hardware disparity is real. The explanation for the result is not hardware, it is what work is being done. We cannot claim a controlled comparison and we are not trying to.
Was refresh disabled on the Terraform run?
Yes, and that is worth stating clearly. The Terraform no-op ran with refresh disabled and still took nearly three hours. The bottleneck on that workload is not refresh time. It is the graph traversal and plan construction over a state file of that size. Stategraph refreshes only the resources implicated by the subgraph, so the comparison is not refresh-off versus refresh-on. It is whole-state traversal versus subgraph-scoped traversal.
What happens when changes overlap?
If two operations target overlapping resources, one must wait and replan after the other completes. The row-level lock means only the overlapping rows are contested. Non-overlapping portions of each subgraph can still proceed. Concurrency conflicts do not disappear. Their scope narrows.
What about manual changes and drift?
Stategraph is designed for teams working in a GitOps workflow, where infrastructure changes go through code. If you are regularly making manual changes outside of Terraform, that is a hygiene problem that no state backend solves. For the edge cases that do slip through, Stategraph supports scheduled drift detection runs and per-resource forced refreshes. But the expectation is that those are the exception, not the primary mechanism.
Does every change collapse to a tiny subgraph?
No. A change to a foundational resource that everything depends on will produce a large subgraph. The benefit scales with how localized a change is. Many real operations are localized. Some are not, and for those the advantage is smaller or absent.
The import time is a real upfront cost. Six minutes to load an 800 MB state file is a one-time payment, but it is not instant.
Why this matters
Infra teams have built an entire ecosystem of wrappers around the symptoms of whole-state execution. State splitting to reduce lock contention. Atlantis queue depths to serialize operations. Terragrunt run-all flags to manage cross-module dependencies. Drift detection jobs that run nightly because running them more frequently is too expensive. These are engineering responses to a real problem, and they work. But they are working around the model rather than changing it.
At small scale, whole-state execution feels acceptable. At large scale, it becomes the architecture. The 3-hour no-op is not a bug in Terraform. It is a predictable outcome of applying a whole-state execution model to a workload that has grown beyond the scale for which that model is comfortable.
Stategraph is in active development and currently focused on self-hosted installations, in part because state is sensitive and many of our early partners operate in regulated environments. If you are running a Terraform workload where whole-state execution has become a material problem, we are interested in the specifics of what you are seeing.
Try Stategraph free
If your Terraform state has reached the point where routine operations carry a real time cost, the subgraph approach is worth seeing in practice. Stategraph is self-hosted, works with your existing Terraform configuration, and does not require changes to your provider setup.
Get started with the docs or follow along as we build.