← Back to Blog RSS

Yes, Terraform has a graph. That's not the point.

Josh Pollara • February 12, 2026

TL;DR

$ cat terraform-has-a-graph.tldr

• Terraform's graph is ephemeral (built, walked, discarded each run)

• In-run parallelism doesn't fix state-level locking or cross-state coordination

• Stategraph persists the graph, making it queryable infrastructure, not an execution detail

• Persistent graphs enable resource-level locking and multi-state transactions

Every time I talk about Stategraph, someone eventually says "Doesn't Terraform already have graph walking?" Yes. Terraform builds a DAG and walks it during plan and apply. It parallelizes nodes. You can tune it with -parallelism. And none of that solves the problem we're solving.

This is not a "we discovered graphs" moment. Terraform absolutely has a graph. It's a well-engineered DAG scheduler embedded in the CLI. The issue isn't whether Terraform walks a graph. The issue is that Terraform's graph is an implementation detail.

If you think "graph walking" is the story, you've confused a data structure for a control plane.

What Terraform's graph actually does

Let's be precise. Terraform:

Parses configuration
Builds a dependency graph
Resolves references
Orders operations
Walks the DAG
Executes up to N nodes in parallel (default 10)
Exits
Discards the graph

The graph lives inside a single process, for a single run, against a single state. It exists to schedule work. It does not coordinate systems.

Observation

Terraform's graph is ephemeral by design. It's reconstructed on every run from configuration and state. This means the graph can never be more than a local scheduling optimization. It has no memory, no history, and no awareness of other concurrent operations.

Parallelism is not the bottleneck

When someone says "Terraform already does graph walking," what they usually mean is "Terraform already parallelizes things." Sure. But parallelism inside a single apply was never the fundamental scaling constraint.

If your system is small, Terraform is fine.

If your system is large, your bottlenecks look like this:

State-level contention
CI queues that serialize everything
Monorepo blast radius
Refreshing the world when you changed a pebble
Re-planning massive surfaces for tiny diffs
Cross-state coordination via shell scripts and optimism
Terragrunt run-all trying to impersonate a transaction manager

None of those are fixed by increasing -parallelism from 10 to 50. You can't thread-pool your way out of architectural limits.

Terraform's safety model is effectively one state, one writer. Whatever the backend, the coordination boundary is still the state. One operation owns the write path, and everyone else waits. That means independent subgraphs can't proceed concurrently. The graph inside Terraform only optimizes work after you've acquired the global lock.

CI queues, PR serialization, Terragrunt wrappers. Ceremony around a file lock.

Which is like optimizing the fuel efficiency of a car that's stuck in traffic.

Global lock versus resource-level locking comparison

Terraform walks a DAG. Stategraph operates a DAG.

This is the core difference.

Terraform builds a graph to execute a run. Stategraph treats the graph as the system.

And before someone says "Terraform Cloud already coordinates runs," let's be clear. It coordinates runs around Terraform. Stategraph coordinates runs through the graph. Those are not the same thing.

That means:

The graph is persisted
The graph is indexed
The graph is queryable
The graph coordinates execution
The graph enforces resource-level locking
The graph spans states

Terraform's graph is ephemeral. Stategraph's graph is infrastructure. That's not a performance tweak. That's a different execution model.

stategraph> -- Find minimal impacted subgraph

WITH RECURSIVE affected AS (

SELECT id, type, name FROM resources

WHERE name = 'prod-db-subnet'

UNION

SELECT r.id, r.type, r.name FROM resources r

JOIN dependencies d ON r.id = d.dependent_id

JOIN affected a ON d.resource_id = a.id

) SELECT * FROM affected;

✓ 5 resources in change scope (0.002s)

→ Lock only these 5, not all 3,200 in state

Design Principle

When the graph is persistent, it stops being an internal structure and becomes a system of record. Once it's a system of record, it becomes a control plane. And once it's a control plane, you can coordinate at resource granularity instead of file granularity.

In-run parallelism versus subgraph execution

Within a single apply, if two nodes don't depend on each other, they can run concurrently. Great. But that still happens inside one process, under one state lock, in one isolated execution context, with no awareness of other runs.

Stategraph takes a different view. Instead of asking "How do we parallelize within one apply?" we ask a different question.

What is the minimal impacted subgraph of this change?

If a change touches 5 resources in a graph of 10,000, why should we reason about 10,000? If two changes touch disjoint subgraphs, why should they block each other? If a database and a CDN are independent in the dependency graph, why are they serialized by a state file boundary?

Subgraph execution means:

Identify only what's affected
Lock only what's necessary
Allow disjoint work to proceed
Coordinate at the resource level, not the file level

That's not "more parallelism." That's removing the global mutex.

If your coordination primitive is a file lock, your scaling story is a queue.

Subgraph isolation showing concurrent team operations

# Terraform: In-run parallelism

Team A: terraform apply (locks entire state)

→ Creates 10 EC2 instances in parallel

→ Updates 5 security groups in parallel

Team B: BLOCKED (waiting for state lock)

# Stategraph: Subgraph isolation

Team A: stategraph apply

→ Locks: EC2, SG subgraph (15 resources)

→ Creates resources in parallel

Team B: stategraph apply

→ Locks: RDS, VPC subgraph (8 resources)

→ Proceeds in parallel (disjoint subgraphs)

Persistence changes everything

Here's the part most people miss. Terraform rebuilds the graph every run. Stategraph persists it.

Once the graph is persistent, you unlock:

Incremental recomputation
Fast impact analysis
Resource-level history
Execution metadata attached to nodes
Queryable blast radius
Attachable cost and security data
True change intelligence

Once the graph is persistent, it stops being a data structure and starts being infrastructure.

That's when coordination moves from "hope the pipeline runs in order" to "the system enforces invariants."

Implementation Detail

Stategraph stores the graph in PostgreSQL as a normalized schema: resources table, dependencies table, transactions log. This enables SQL queries over your infrastructure, ACID transactions across states, and resource-level concurrency control for safe parallel operations.

Why Terraform works this way

This isn't an accident. Terraform's architecture optimizes for portability and simplicity.

State is a file.
The CLI is stateless.
The execution model is local.

That makes Terraform easy to reason about and easy to distribute.

It also means Terraform can't accumulate institutional memory. Every run starts fresh.

But it also hard-codes a coordination boundary at the state file.

Stategraph makes a different tradeoff. We accept a persistent control plane because coordination, not portability, is the scaling constraint in 2026.

A cleaner mental model

Here's the simplest way to think about it:

Terraform:

Builds a graph
Walks the graph
Throws the graph away

Stategraph:

Builds the graph
Stores the graph
Coordinates through the graph
Executes through the graph
Queries the graph
Evolves the graph

Terraform's graph is an execution detail. Stategraph's graph is the system.

# Terraform: Graph as optimization

parse HCL → build DAG → walk → update state → discard DAG

↑ Graph is means to an end (scheduling)

# Stategraph: Graph as infrastructure

parse HCL → query graph → plan subgraph → lock nodes → apply → persist

↑ Graph is the end (coordination substrate)

Why this matters now

At small scale, none of this matters. At enterprise scale, it's everything.

When you have hundreds of engineers, dozens of states, monorepos, regulated environments, slow CI queues, and constant lock contention, the bottleneck isn't CPU. It's coordination.

And Terraform's architecture centralizes coordination at the state file boundary. Stategraph decentralizes coordination to the resource boundary.

That's the shift.

Terraform walks a graph.
Stategraph makes the graph the control plane.

Terraform optimized single-run execution. Stategraph optimizes organizational coordination.

At scale, the bottleneck isn't CPU. It's who's allowed to touch what, when.

Stategraph moves that boundary from the state file to the resource.

If your infrastructure execution model is a queue, your engineering org eventually becomes one too.

RIP .tfstate. Meet Stategraph

Graph-based state. Resource-level locking. Multi-state transactions.
The graph becomes infrastructure, not an execution detail.

Get Started Read the Docs

// Zero spam. Just progress updates as we build Stategraph.