← Back to Blog RSS

Terraform's unit of work is too large

Terraform Performance Infrastructure
TL;DR
$ cat terraform-unit-of-work.tldr
• 800 MB state file. Terraform no-op: ~3 hours (cloud VM). Stategraph no-op: 97s (laptop).
• Not a controlled comparison (different hardware). The gap is about what work each model does, not raw speed.
• At large enough scale, even a no-op becomes a whole-world event. Subgraph execution changes the unit of work.

The issue is not that Terraform is slow at doing work. The issue is that at large enough scale it insists on considering too much of the world before it can do any work.

The workload

A design partner brought us a single Terraform root module that had grown organically over years. Around 100 MB of Terraform code, around 800 MB of state. We ran a Terraform no-op on that workload on a large, well-resourced cloud machine, with refresh disabled. No changes intended, no expected diff, just the required full traversal before Terraform can tell you nothing needs to happen. It took nearly three hours.

We loaded the same workload into Stategraph. Import took around six minutes. After that, no-op detection ran in 97 seconds, on a laptop. The Stategraph runs and the Terraform run were on different hardware, and we are not claiming a controlled comparison. That disparity actually makes the result more interesting, not less. It points to a difference in what work is being done rather than in raw compute capacity.

The design partner has asked to remain anonymous. The numbers are real. The explanation for the gap is what this post is about.

The result

The observed numbers, presented without interpretation inflation:

Operation Environment Observed time Interpretation
Stategraph import (full load) Laptop ~6 minutes One-time cost to normalize state into the database
Stategraph plan (small change) Laptop 71 seconds Plan scoped to the impacted subgraph only
Stategraph no-op detection Laptop 97 seconds Determines nothing needs to change without full traversal
Terraform no-op Large cloud machine ~3 hours Full graph traversal with refresh disabled; still ~3 hours
Horizontal bar chart showing Stategraph operations (import: 6 min, plan: 71s, no-op: 97s) versus Terraform no-op (~3 hours). The Terraform bar spans the full width; the Stategraph bars are barely visible by comparison. Horizontal bar chart showing Stategraph operations (import: 6 min, plan: 71s, no-op: 97s) versus Terraform no-op (~3 hours). The Terraform bar spans the full width; the Stategraph bars are barely visible by comparison.

There is variance here. Network conditions affect API response times, the import is a one-time cost, and the Terraform figure is a single observed run rather than an average. This is a directional data point, not a benchmark. The mechanism behind it is what warrants the explanation.

Why this happens

Terraform's scaling problem is fundamentally about granularity. The state file is the unit of work. Every operation, whether it makes one change or no changes, must first account for the entire state before it can proceed. Three things happen together on every run.

Whole-state refresh: By default, Terraform contacts the cloud provider for every resource in state to verify its current condition. With 800 MB of state, that is a lot of resources. You can disable refresh, but then you are flying without instruments.

Coarse state lock: Before any operation begins, a global mutex is acquired over the entire state file. No other operation can proceed in parallel, regardless of whether it touches overlapping resources.

Full graph traversal: The planner walks the entire dependency graph to construct a plan. Even when the answer is "nothing to do," the traversal is complete.

Two dependency graph diagrams side by side. Left shows whole-state execution with all nodes highlighted red. Right shows subgraph execution with only 3 connected nodes highlighted in teal, the rest are dim. Two dependency graph diagrams side by side. Left shows whole-state execution with all nodes highlighted red. Right shows subgraph execution with only 3 connected nodes highlighted in teal, the rest are dim.

The core issue

Once state gets big enough, even a no-op becomes a whole-world event. The execution model does not distinguish between "this operation touches 3 resources" and "this operation touches 3,000 resources." The unit of work is the same in both cases. The entire state.

At a few hundred resources, this is fine. The work is fast enough that the granularity does not matter. At 800 MB of state, the whole-state model is the architecture, and the architecture has a cost.

What Stategraph is actually doing

Stategraph stores Terraform state in Postgres. Resources, attributes, and dependencies are normalized into rows and edges rather than a monolithic JSON file. When an operation arrives, Stategraph identifies the subgraph implicated by that operation and scopes the refresh, the plan traversal, and the lock acquisition to that subgraph.

This is not -target. Terraform's -target flag requires the operator to manually specify resources and bypasses normal dependency resolution. Stategraph computes the impacted subgraph automatically from the dependency relationships in the configuration. If resource A has downstream effects on B and C, they are included. Nothing is skipped that should not be skipped.

stategraph> -- Identify affected subgraph for incoming change
WITH RECURSIVE affected AS (
SELECT id, type, name FROM resources
WHERE name = 'api-gateway-prod'
UNION
SELECT r.id, r.type, r.name FROM resources r
JOIN dependencies d ON r.id = d.dependent_id
JOIN affected a ON d.resource_id = a.id
) SELECT * FROM affected;
→ 8 resources in change scope (0.004s)
→ Compared to: 4,200+ resources in full state (skipped)

Stategraph also refreshes the resources implicated by the subgraph, not all resources. For a small change, that is a small number of API calls. Provider APIs do not become faster. Graph traversal is not mocked. The difference is in what portion of the state each operation includes.

Beyond speed: concurrency and scale

The speed difference is the headline, but the more durable change is in what becomes possible when the unit of work shrinks.

A team with a moderately large state was seeing 60-second Terraform no-ops routinely. In Stategraph, the same operation ran in 15 seconds. Not because 15 seconds is some theoretical minimum, but because the subgraph was genuinely small and the refresh scope reflected that.

The concurrency case is often more practically significant. In standard Terraform, one team acquires the state lock and the other waits, regardless of whether their changes overlap. With row-level locking, if two operations touch non-overlapping resources, both proceed simultaneously.

Timeline diagram comparing Terraform (Team B waits while Team A holds global lock, then runs sequentially) vs Stategraph (both teams run simultaneously with row-level locks on non-overlapping resources). Timeline diagram comparing Terraform (Team B waits while Team A holds global lock, then runs sequentially) vs Stategraph (both teams run simultaneously with row-level locks on non-overlapping resources).

There is also a cross-state dimension. When a producing state changes an output that a consuming state references through terraform_remote_state, that relationship is typically invisible to either Terraform operation. Someone must manually understand and coordinate it. Stategraph can represent that relationship directly and include both impacted states in a single plan view, surfacing the actual blast radius before anything is applied.

The underlying shift

This is not a story about making provider APIs faster. It is a story about shrinking the execution universe. When the universe is smaller, everything that operates within it becomes cheaper. The refresh, the traversal, the lock, the plan, the apply.

Caveats and limits

Is this apples-to-apples?

No. The Stategraph runs were on a laptop; the Terraform no-op was on a large cloud machine. The hardware disparity is real. The explanation for the result is not hardware, it is what work is being done. We cannot claim a controlled comparison and we are not trying to.

Was refresh disabled on the Terraform run?

Yes, and that is worth stating clearly. The Terraform no-op ran with refresh disabled and still took nearly three hours. The bottleneck on that workload is not refresh time. It is the graph traversal and plan construction over a state file of that size. Stategraph refreshes only the resources implicated by the subgraph, so the comparison is not refresh-off versus refresh-on. It is whole-state traversal versus subgraph-scoped traversal.

What happens when changes overlap?

If two operations target overlapping resources, one must wait and replan after the other completes. The row-level lock means only the overlapping rows are contested. Non-overlapping portions of each subgraph can still proceed. Concurrency conflicts do not disappear. Their scope narrows.

What about manual changes and drift?

Stategraph is designed for teams working in a GitOps workflow, where infrastructure changes go through code. If you are regularly making manual changes outside of Terraform, that is a hygiene problem that no state backend solves. For the edge cases that do slip through, Stategraph supports scheduled drift detection runs and per-resource forced refreshes. But the expectation is that those are the exception, not the primary mechanism.

Does every change collapse to a tiny subgraph?

No. A change to a foundational resource that everything depends on will produce a large subgraph. The benefit scales with how localized a change is. Many real operations are localized. Some are not, and for those the advantage is smaller or absent.

The import time is a real upfront cost. Six minutes to load an 800 MB state file is a one-time payment, but it is not instant.

Why this matters

Infra teams have built an entire ecosystem of wrappers around the symptoms of whole-state execution. State splitting to reduce lock contention. Atlantis queue depths to serialize operations. Terragrunt run-all flags to manage cross-module dependencies. Drift detection jobs that run nightly because running them more frequently is too expensive. These are engineering responses to a real problem, and they work. But they are working around the model rather than changing it.

At small scale, whole-state execution feels acceptable. At large scale, it becomes the architecture. The 3-hour no-op is not a bug in Terraform. It is a predictable outcome of applying a whole-state execution model to a workload that has grown beyond the scale for which that model is comfortable.

Stategraph is in active development and currently focused on self-hosted installations, in part because state is sensitive and many of our early partners operate in regulated environments. If you are running a Terraform workload where whole-state execution has become a material problem, we are interested in the specifics of what you are seeing.

Try Stategraph free

If your Terraform state has reached the point where routine operations carry a real time cost, the subgraph approach is worth seeing in practice. Stategraph is self-hosted, works with your existing Terraform configuration, and does not require changes to your provider setup.

Get started with the docs or follow along as we build.