← Back to Blog RSS

Yes, Terraform has a graph. That's not the point.

Terraform State Management Distributed Systems Architecture
TL;DR
$ cat terraform-has-a-graph.tldr
• Terraform's graph is ephemeral (built, walked, discarded each run)
• In-run parallelism doesn't fix state-level locking or cross-state coordination
• Stategraph persists the graph, making it queryable infrastructure, not an execution detail
• Persistent graphs enable resource-level locking and multi-state transactions

Every time I talk about Stategraph, someone eventually says "Doesn't Terraform already have graph walking?" Yes. Terraform builds a DAG and walks it during plan and apply. It parallelizes nodes. You can tune it with -parallelism. And none of that solves the problem we're solving.

This is not a "we discovered graphs" moment. Terraform absolutely has a graph. It's a well-engineered DAG scheduler embedded in the CLI. The issue isn't whether Terraform walks a graph. The issue is that Terraform's graph is an implementation detail.

If you think "graph walking" is the story, you've confused a data structure for a control plane.

What Terraform's graph actually does

Let's be precise. Terraform:

The graph lives inside a single process, for a single run, against a single state. It exists to schedule work. It does not coordinate systems.

Terraform graph lifecycle diagram Terraform graph lifecycle diagram

Observation

Terraform's graph is ephemeral by design. It's reconstructed on every run from configuration and state. This means the graph can never be more than a local scheduling optimization. It has no memory, no history, and no awareness of other concurrent operations.

Parallelism is not the bottleneck

When someone says "Terraform already does graph walking," what they usually mean is "Terraform already parallelizes things." Sure. But parallelism inside a single apply was never the fundamental scaling constraint.

If your system is small, Terraform is fine.

If your system is large, your bottlenecks look like this:

None of those are fixed by increasing -parallelism from 10 to 50. You can't thread-pool your way out of architectural limits.

Terraform's safety model is effectively one state, one writer. Whatever the backend, the coordination boundary is still the state. One operation owns the write path, and everyone else waits. That means independent subgraphs can't proceed concurrently. The graph inside Terraform only optimizes work after you've acquired the global lock.

CI queues, PR serialization, Terragrunt wrappers. Ceremony around a file lock.

Which is like optimizing the fuel efficiency of a car that's stuck in traffic.

Global lock versus resource-level locking comparison Global lock versus resource-level locking comparison

Terraform walks a DAG. Stategraph operates a DAG.

This is the core difference.

Terraform builds a graph to execute a run. Stategraph treats the graph as the system.

And before someone says "Terraform Cloud already coordinates runs," let's be clear. It coordinates runs around Terraform. Stategraph coordinates runs through the graph. Those are not the same thing.

That means:

Terraform's graph is ephemeral. Stategraph's graph is infrastructure. That's not a performance tweak. That's a different execution model.

stategraph> -- Find minimal impacted subgraph
WITH RECURSIVE affected AS (
SELECT id, type, name FROM resources
WHERE name = 'prod-db-subnet'
UNION
SELECT r.id, r.type, r.name FROM resources r
JOIN dependencies d ON r.id = d.dependent_id
JOIN affected a ON d.resource_id = a.id
) SELECT * FROM affected;
5 resources in change scope (0.002s)
Lock only these 5, not all 3,200 in state

Design Principle

When the graph is persistent, it stops being an internal structure and becomes a system of record. Once it's a system of record, it becomes a control plane. And once it's a control plane, you can coordinate at resource granularity instead of file granularity.

In-run parallelism versus subgraph execution

Within a single apply, if two nodes don't depend on each other, they can run concurrently. Great. But that still happens inside one process, under one state lock, in one isolated execution context, with no awareness of other runs.

Stategraph takes a different view. Instead of asking "How do we parallelize within one apply?" we ask a different question.

What is the minimal impacted subgraph of this change?

If a change touches 5 resources in a graph of 10,000, why should we reason about 10,000? If two changes touch disjoint subgraphs, why should they block each other? If a database and a CDN are independent in the dependency graph, why are they serialized by a state file boundary?

Subgraph execution means:

That's not "more parallelism." That's removing the global mutex.

If your coordination primitive is a file lock, your scaling story is a queue.

Subgraph isolation showing concurrent team operations Subgraph isolation showing concurrent team operations
# Terraform: In-run parallelism
Team A: terraform apply (locks entire state)
→ Creates 10 EC2 instances in parallel
→ Updates 5 security groups in parallel
Team B: BLOCKED (waiting for state lock)
# Stategraph: Subgraph isolation
Team A: stategraph apply
→ Locks: EC2, SG subgraph (15 resources)
→ Creates resources in parallel
Team B: stategraph apply
→ Locks: RDS, VPC subgraph (8 resources)
Proceeds in parallel (disjoint subgraphs)

Persistence changes everything

Here's the part most people miss. Terraform rebuilds the graph every run. Stategraph persists it.

Once the graph is persistent, you unlock:

Once the graph is persistent, it stops being a data structure and starts being infrastructure.

That's when coordination moves from "hope the pipeline runs in order" to "the system enforces invariants."

Implementation Detail

Stategraph stores the graph in PostgreSQL as a normalized schema: resources table, dependencies table, transactions log. This enables SQL queries over your infrastructure, ACID transactions across states, and resource-level concurrency control for safe parallel operations.

Why Terraform works this way

This isn't an accident. Terraform's architecture optimizes for portability and simplicity.

That makes Terraform easy to reason about and easy to distribute.

It also means Terraform can't accumulate institutional memory. Every run starts fresh.

But it also hard-codes a coordination boundary at the state file.

Stategraph makes a different tradeoff. We accept a persistent control plane because coordination, not portability, is the scaling constraint in 2026.

A cleaner mental model

Here's the simplest way to think about it:

Terraform:

Stategraph:

Terraform's graph is an execution detail. Stategraph's graph is the system.

# Terraform: Graph as optimization
parse HCL → build DAG → walk → update state → discard DAG
↑ Graph is means to an end (scheduling)
# Stategraph: Graph as infrastructure
parse HCL → query graph → plan subgraph → lock nodes → apply → persist
↑ Graph is the end (coordination substrate)

Why this matters now

At small scale, none of this matters. At enterprise scale, it's everything.

When you have hundreds of engineers, dozens of states, monorepos, regulated environments, slow CI queues, and constant lock contention, the bottleneck isn't CPU. It's coordination.

And Terraform's architecture centralizes coordination at the state file boundary. Stategraph decentralizes coordination to the resource boundary.

That's the shift.

Terraform walks a graph.
Stategraph makes the graph the control plane.

Terraform optimized single-run execution. Stategraph optimizes organizational coordination.

At scale, the bottleneck isn't CPU. It's who's allowed to touch what, when.

Stategraph moves that boundary from the state file to the resource.

If your infrastructure execution model is a queue, your engineering org eventually becomes one too.

Stop coordinating. Start shipping

Graph-based state. Resource-level locking. Multi-state transactions.
The graph becomes infrastructure, not an execution detail.

Get Updates Become a Design Partner

// Zero spam. Just progress updates as we build Stategraph.