← Back to Blog RSS

Terraliths in Terraform: what they are and why the monolith is the right shape

Terraform Terralith State Management Infrastructure

A terralith is the path of least resistance in Terraform, and most of the pain that gets blamed on it is not really the monolith's fault. Slow plans, lock contention, and scary blast radius are symptoms of storing infrastructure state in a file, not symptoms of keeping your configuration in one root module.

TL;DR
$ cat terralith-terraform.tldr
• A terralith is a single Terraform root module that manages infrastructure in one state file.
• It is easy to start with, and for a surprising amount of time it reflects how infrastructure actually connects.
• Slow plans, wide blast radius, and lock contention are usually storage failures, not architecture failures.
• Splitting state to appease Terraform redistributes the pain and hides real dependencies. Fix the storage layer instead.

If you already use Terraform in production, you have probably felt the pull of keeping everything in one place. One repo, one working directory, one plan, one apply, one state file.

At the beginning, that shape makes perfect sense. A small team defines providers, variables, outputs, and resources in a single module, runs terraform plan, reviews one plan, and ships. That is not laziness. It is often the most direct path to infrastructure code that works.

The trouble is that a terralith rarely announces itself as a problem. Infrastructure grows, more services arrive, more environments appear, and the same configuration absorbs more modules and more dependencies. The configuration still works, so the project keeps growing, and eventually someone on the team suggests breaking it up to make Terraform fast again.

Before you reach for that fix, it is worth pulling the terralith apart in the other direction, conceptually, and asking what is actually broken.

What is a terralith in Terraform?

A terralith is one root module with one state file. In Terraform terms, a root module is the directory where Terraform runs. Child modules are the modules you call with a module block from that root.

Terraform evaluates all the top level .tf files in a directory as part of the same module, so splitting a large configuration across main.tf, network.tf, and database.tf does not create isolation. It is still one root module, one execution context, and one state file unless you deliberately introduce separate root modules.

It is not an official Terraform concept, and you will not find it in the Terraform docs as a product term. It is community shorthand for the monolithic pattern, usually invoked right before someone recommends chopping it up.

How a terralith takes shape

Most terraliths grow by accident.

A project starts with a VPC, a few IAM roles, a database, maybe a service or two. The team adds another provider, more variables, another child module, another environment switch, and another set of outputs. Nothing in Terraform forces a split at this stage, and Terraform is perfectly comfortable treating one directory as one module for a long time.

Even when engineers start sensing that the configuration is getting heavy, the easiest next step is usually to keep adding to the existing root module rather than creating multiple root modules, migrating state, and wiring cross state dependencies. For a while, that instinct is correct. It avoids premature ceremony and, more importantly, it keeps related infrastructure on the same dependency graph.

That organic growth is the real story. A terralith is rarely a failure of discipline. It is what happens when you let infrastructure be shaped by its own dependencies instead of by arbitrary boundaries.

The terralith reflects how infrastructure actually connects

Infrastructure is a graph. VPCs contain subnets. Subnets contain instances. Instances reference security groups. Security groups reference other security groups. Load balancers reference instances. DNS references load balancers. Everything connects to something else.

That is not a quirk of one team's codebase. It is the structure of the problem. When you keep all of that in one root module, the configuration maps cleanly onto the thing it is describing. Dependencies are explicit. References resolve in the graph Terraform already builds. Outputs do not need to be copied across state boundaries to reach the resources that consume them.

Design Principle

Tools should conform to the shape of the problem, not the other way around. Infrastructure is interconnected, so a single root module that preserves those connections is the natural representation. The monolith is not the anti-pattern. Pretending the graph is several disjoint graphs is.

This is why we have argued elsewhere that terraliths are the natural shape of infrastructure. The pattern keeps showing up because it matches reality.

What actually hurts at scale

The common story is that a terralith is fine until it gets big, and then it is a liability. The first half is right. The second half is missing the subject.

Three symptoms do show up as infrastructure grows. They are worth taking seriously. They are also worth naming correctly.

Plan and apply times stretch with every added resource

Terraform plan reads the current state, refreshes the tracked objects against the provider, and compares reality to configuration. As the state file grows, every terraform plan has more to read and more to refresh. The feedback loop that once felt immediate starts taking minutes.

The reason is not that 2,000 resources cannot be reasoned about in a single configuration. The reason is that Terraform reads and refreshes the entire state file even when you are changing one resource. A 40 MB JSON blob with a global lock cannot support selective reads or partial refreshes. The slowness is the storage format, not the resource count.

The blast radius is as wide as the state file

Blast radius is the scope of unrelated things that can be affected when something goes wrong. In a terralith, an IAM edit, a networking update, and a database change can all sit inside the same apply against the same state.

That is a real risk, but splitting state only moves the line. Service A and service B now live in different files, yet the database behind them is still shared, the VPC is still shared, and the cross stack wiring you built to make splitting work is now an invisible coordination surface that Terraform cannot see. You did not reduce blast radius. You traded one kind of risk for another and added a coordination layer on top.

Teams queue behind one lock

Terraform state locking exists for a good reason. If the backend supports it, Terraform locks state for operations that could write state, and if the lock cannot be acquired, Terraform does not continue. See Terraform state locking explained for the details.

The bottleneck is not locking itself. It is the granularity. Terraform locks the whole state file because the file is the unit of atomicity. A platform team updating networking, an application team changing a service, and CI running an approved apply all serialize behind the same lock, even when their changes touch completely disjoint subgraphs.

Observation

Slow plans, wide blast radius, and lock contention are not three separate architectural complaints. They are three faces of the same thing, which is that the state file is the only unit Terraform knows how to reason about at rest.

Why splitting state is not the fix

The standard advice when a terralith starts hurting is to break it into smaller root modules, split by environment, team, service, or account, and wire the pieces together with terraform_remote_state data sources or a wrapper tool.

That advice treats the symptom. It does not remove it. It spreads it out.

Cross stack dependencies become manual. When stack A creates a VPC and stack B needs the VPC ID, Terraform can no longer track that dependency. You reach for data sources, hardcoded values, or parameter passing through a wrapper. The dependency is still there in your infrastructure, but it is now invisible to Terraform. When stack A changes, stack B does not know. Drift accumulates quietly.

Orchestration complexity grows. You now need to know which stack deploys first. Your CI pipeline encodes dependency ordering that should live in the infrastructure graph. When a new dependency appears, you update both the Terraform configuration and the deployment orchestration, and the single source of truth becomes two sources that have to stay in sync.

Lock contention does not go away. It gets smaller boundaries. Three engineers working on the same stack still block each other. You have just decided that blocking inside service boundaries is acceptable. The underlying problem, which is global locks preventing concurrent work on unrelated resources, is still there.

We make that argument in more detail in The Terralith is correct. State fragmentation is the problem. The short version is that splitting state to make Terraform faster is not a solution. It is a way of paying the cost in coordination overhead instead of in plan time.

Pros and cons of a terralith, honestly

A terralith is still worth evaluating on its own terms. Here is the table most people draw, with the storage caveat made explicit.

What a terralith gets right

  • One root module keeps the dependency graph intact.
  • Plans and applies operate on the same visible structure your infrastructure actually has.
  • No orchestration layer needed to coordinate root modules.
  • No remote state data sources to keep in sync.
  • Outputs flow between resources through normal references, not across stack boundaries.

What hurts, and why

  • Plan and apply times grow with state size, because the state file is read and refreshed in full.
  • Blast radius tracks the state file boundary, because the file is the unit of atomicity.
  • Lock contention serializes unrelated work, because the lock is global to the file.
  • Every one of these is a property of file-based state, not of the monolithic root module.

Read the right column carefully. None of the entries say the problem is that too many resources live together. They say the problem is that file-based state cannot support selective reads, partial refreshes, or granular locking.

What actually fixes the symptoms

If the state file is the unit that creates the pain, the fix is to stop using a file as the database.

Store state as a graph. Nodes are resources. Edges are dependencies. Plans compute the affected subgraph from configuration changes and read only those resources. Applies lock only the resources they are about to modify, plus their dependents. Multiple engineers working on disjoint subgraphs proceed in parallel without stepping on each other.

# File-based state, global lock on the whole terralith
$ time terraform plan
Acquiring state lock...
Refreshing state... [2847/2847]
Plan: 1 to add, 0 to change, 0 to destroy.
real 30m 0s
 
# Graph-native state, subgraph lock on the affected resources
$ time stategraph plan
Computing affected subgraph... 12 resources
Acquiring subgraph lock...
Refreshing subgraph... [12/12]
Plan: 1 to add, 0 to change, 0 to destroy.
real 0m 2.1s

Stategraph is built on that model. State lives as versioned rows in PostgreSQL with explicit foreign keys for dependencies. Each resource has its own lock. Plans read the subgraph they need. Applies lock only the resources they are changing. The same terralith that took thirty minutes to plan takes seconds, and the teams that used to serialize behind one lock work in parallel.

The configuration does not have to move. The dependency graph does not have to be cut into pieces. The terralith stays where it is, because the problem was never the terralith.

When splitting is still reasonable

There are a few reasons to run more than one root module that have nothing to do with appeasing Terraform. Regulatory or account boundaries that demand separate blast domains. Completely independent products that happen to share a company. A genuine desire to hand a piece of infrastructure to a different team with a different release cadence and no shared resources.

Those splits are fine. They are driven by the shape of the organization or the shape of the compliance regime, not by the shape of a JSON file. The test is simple. If you are splitting because the underlying infrastructure is genuinely disjoint, that is architecture. If you are splitting because terraform plan is slow or because your teams keep colliding on a lock, that is tool appeasement, and you are trading one kind of pain for another.

Pattern Recognition

Every terralith refactor driven by plan time or lock contention eventually reintroduces the same coordination problems inside the new stacks, because the thing you were fighting was the storage model. The stacks just make the fight harder to see.

Terralith FAQs

What is a terralith in Terraform?

A terralith is a single Terraform root module that manages infrastructure with one state file. It is community shorthand, not an official Terraform term, and it describes the pattern of keeping everything in one plan, one apply, and one state.

Is a terralith an anti-pattern?

No. A terralith reflects how infrastructure actually connects, which is as a graph. The pain people attribute to the terralith usually comes from Terraform storing that graph as a file with a global lock, not from the fact that the resources live in one root module.

Why is my Terraform plan slow in a large terralith?

Terraform reads and refreshes the whole state file on every plan. As the state grows, plan time grows with it, because file-based state does not support selective reads or partial refreshes. Switching to graph-native state with subgraph plans removes the cost without splitting the configuration.

Should I split my terralith into multiple root modules?

Only if the split reflects a real boundary in your organization or compliance model. If you are splitting to work around slow plans or lock contention, you are moving the pain rather than removing it, and you are introducing cross stack dependencies that Terraform can no longer track.

Does state splitting fix lock contention?

It makes the locks smaller, but it does not remove the pattern. Teams working on the same stack still serialize behind the same lock. The underlying problem, which is that the lock is global to the state file instead of scoped to the resources being changed, is still there until the storage layer changes.

Does Stategraph require me to split my terralith?

No. Stategraph keeps the terralith intact and changes what happens underneath. State is stored as a graph in PostgreSQL, plans and applies operate on the affected subgraph, and locks are taken on resources rather than on the whole state file.

The terralith was never the problem

A terralith feels right at the start because infrastructure really is interconnected, and it keeps feeling right for longer than the usual advice admits. When it starts hurting, the symptoms are real, but the diagnosis is usually wrong. Slow plans, wide blast radius, and lock contention are things a file with a global lock cannot avoid, no matter how well you organize the Terraform code sitting on top of it.

Fix the layer that is actually broken. Keep the configuration that matches your infrastructure.