What Bazel taught us about Terraform
Build systems solved fast-and-correct twenty years ago. Bazel's motto is { Fast, Correct } — Choose two: you cannot be fast without being correct, because speed comes from caching, and caching only works when inputs are complete. Stategraph applies that same playbook to Terraform.
As a person with a build systems background, I see infrastructure as code as a build system that takes HCL as input and which produces deployed infrastructure as output. This is a concrete instantiation of what build systems do: they take tasks as input, execute these tasks, producing outputs (outputs are usually referred to as artifacts).
Executing a task involves gathering its inputs. This is complex when a task's inputs include artifacts produced by earlier tasks. This is where graph theory comes into play: a build system must compute the dependency graph between tasks in order to run them in the right order. Dependencies between tasks can be static (main.c depends on main.h) or dynamic (main.c depends on all *.c files). See build systems à la carte for a deeper dive.
HCL as a dependency graph
In the IaC ecosystem, the dependency graph is expressed in HCL: resources depend on tfvars, modules depend on other modules, and so on. Here, the build artifact is the deployed infrastructure itself. HCL's dependency graph is hard to compute statically: HCL can perform computations that depend on files on disk, and for_each/count accept arbitrary expressions. That is why we had to write our own HCL evaluator early on in our implementation journey.
How build systems are fast
My build system of choice is Google's Bazel. Bazel's motto is { Fast, Correct } — Choose two. This motto says a lot: to be fast, a build system must be correct. To be fast, you need to cache the results of earlier tasks. If a task's inputs haven't changed, you can reuse the artifacts that task produced previously. For this to be correct, your tasks' inputs must be complete: there should be no implicit dependencies. If an untracked input to your task changes and you reuse its cached artifacts, your outputs become incorrect.
Design Principle
Speed comes from caching. Caching only works when inputs are complete. There is no shortcut: a build system that cannot enumerate every input to a task cannot safely reuse its outputs, and a system that cannot safely reuse outputs cannot be fast.
How Stategraph is fast
When implementing Stategraph, we encounter the same situation: we compute a fine-grained graph of dependencies between your HCL blocks; and use this dependency graph to cache computations intelligently. We actually use the same mechanism as Bazel and friends to cache inputs: we hash them. If an input changed, its hash will change; and Stategraph will include its dependees (that's the blast radius) in its computation (reverse deps in Bazel jargon).
Bazel only rebuilds affected tasks, while Stategraph only plans the affected subgraph of resources.
db.tf only triggers a replan of its reverse dependencies. Everything else stays cached.But doesn't Terraform already do this?
Terraform/OpenTofu users will recognise that Terraform does build a dependency graph too. So what makes Stategraph different? We covered this in Terraform has a graph; the short version:
Terraform's graph is ephemeral. Every terraform plan re-parses your .tf files, re-evaluates every expression, and rebuilds the graph from scratch. There is no cross-run cache of HCL evaluation: every plan is a cold build, even when nothing changed.
The state lock matches: one apply at a time per state file, no matter how disjoint the changes are. Two engineers touching unrelated parts of the same state still serialize through one lock.
Stategraph keeps the graph as a persistent artifact. We hash HCL blocks to detect change, only plan the subgraph the change reaches, and lock at the granularity of what's actually modified. Terraform's graph is a means to an end; Stategraph's graph is the product.
The Core Difference
Terraform discards its dependency graph at the end of every run. Stategraph persists it, hashes its inputs, and uses it to decide what work to skip. That shift unlocks incremental plans, lock-free parallelism, and virtually every feature of Stategraph.
Staying fast in the presence of hundreds of HCL files
Like Bazel, we have to be smart about traversing your HCL files. When there are hundreds of them, rereading them impacts performance. This is very similar to Bazel's usage of starlark: an analogue to Python, but with additional properties (determinism, no I/O, no side effects) that make it easy to cache starlark's evaluation results.
By keeping track of the graph of dependencies between HCL files, we can optimise traversal. If you only changed a few HCL files since your last plan, we can reuse part of the previous dependency graph: only edges in/from modified files may have changed.
However, efficiently computing what has changed within a given file offers no shortcut: we hash HCL blocks and use the usual hash changed -> content changed mechanism. But, within a block, it's hard to be fine-grained: HCL scoping being coarse-grained, any given expression can depend on all modules in scope. In the future we plan to do per-field dependency computation, but it'll always be an approximation. Our ability to do this is limited by the design of HCL.
Bazel and Stategraph: sandboxing
In Bazel, builds are performed within sandboxes. This enables you to verify that your dependency graph is correct, because, if the input to a task is undeclared, Bazel will not copy it to the sandbox and the build will fail. Similarly, Stategraph issues terraform/tofu commands in a sandbox directory, in which it brought the subgraph of the HCL code needed to perform your change. In the early days of our implementation, bugs in our computation of the subgraph would surface in this sandbox: the HCL we brought was incomplete.
Local sandboxing is also the first step towards enabling remote execution, more on that below.
Observation
A sandbox is a correctness oracle for your dependency graph. If the sandbox is missing an input, the build fails loudly instead of silently producing a stale or wrong artifact. In our early days, the same property helped us discover issues in Stategraph's PoC.
Bazel and Stategraph: distributed caches
Bazel is a distributed build system in which different actors share build artifacts through various caches. There are caches local to a machine and there are remote caches. Remote caches are typically filled up both by individual developers and by CI.
Stategraph's model is also distributed, just a little simpler: different developers interact with a shared server, which holds the dependency graph, the most recent committed state (i.e. the deployed infra), as well as all transactions in flight (developers/CI having run stategraph plan, but didn't finish the transaction yet with a successful stategraph apply). The server therefore serves as a cache of the dependency graph: when you're planning, you never change the entire graph at once, so your local client retrieves the unchanged part from the server.
So in this case the comparison isn't 1:1, but we're close.
Bazel and Stategraph: remote execution
Bazel allows tasks to be executed on remote machines. This is made possible by ensuring that the inputs of a task are complete. Once you know the exact version of the compiler you are using, all your input source files etc.; you can delegate compilation to a remote machine, as long as you can provision this compiler automatically.
In Stategraph, remote execution means that calls to terraform/tofu don't have to happen on the machine where you call Stategraph. The machinery is in place to do that in Stategraph's implementation, but finalizing it is in our backlog. This will enable moving intensive tasks to machines that are better suited to performing them, like Bazel does with remote executors.
Parallelism without a global lock
Having the dependency graph means that both Bazel and Stategraph can perform operations in parallel. In Bazel, this manifests as the ability to perform tasks in parallel, by distributing them to different cores or different remote executors. In Stategraph, this enables lock-free concurrent plan and applies: if different developers (or CI) affect distinct parts of the infrastructure's graph, they can proceed in parallel.
Lock Granularity Principle
A global lock is what you reach for when you don't have a dependency graph. With a graph, you lock at the granularity of what is actually being modified. Disjoint changes proceed in parallel; only the dependent steps serialize.
Drift: an infrastructure specific problem
In the build system world, produced artifacts are immutable: nobody changes them without going through the build system. In IaC, however, uneducated users change the infrastructure without going through Stategraph. This is the classic drift issue.
Pride of the craft
We're proud to build Stategraph, because we're building on the shoulders of giants: decades of computer science research in the field of graph theory; and two decades of Bazel development at Google (it started as the internal Blaze project in 2006). We address a fundamental problem (distributed build systems) and apply it to an industry standard (HCL). In a sense, we're doing boring work; but we're also solving industry problems. It doesn't get much cooler than that.
Conclusion
We leave as an exercise to the reader to implement Stategraph with Bazel. I truly believe one could write an HCL -> starlark compiler, and use Bazel as the execution engine. Build artifacts would be calls to terraform.
Just an idea for you YOLO developers out there! Credit me in the README please.
See the graph in action
Stategraph brings the build-system playbook to Terraform: a fine-grained dependency graph over HCL, hashed inputs, sandboxed execution, a shared graph cache, and lock-free parallelism for disjoint changes.
If you want to dig into how the graph is computed, how transactions work, and what running stategraph plan on the affected subgraph looks like in practice, explore the Stategraph Docs.
// Zero spam. Just progress updates as we build Stategraph.