Terraform: Declarative Infrastructure Provisioning
Learn Terraform from the ground up—state management, providers, modules, and production-ready patterns for managing cloud infrastructure as code.
Terraform: Declarative Infrastructure Provisioning
Terraform is the standard tool for declarative infrastructure provisioning. You write configuration files that describe what infrastructure you want, and Terraform figures out how to make it happen. No manual console clicks, no forgotten steps when recreating environments. Your infrastructure becomes code reviewable, testable, and repeatable.
The core idea is straightforward: declare what you want, and Terraform handles the rest. Whether you’re spinning up a single EC2 instance or orchestrating a multi-cloud Kubernetes cluster, Terraform manages the full lifecycle with a consistent workflow.
Introduction
Terraform is an infrastructure-as-code tool that defines cloud resources in configuration files you can version, reuse, and share. Instead of clicking through a cloud console to provision servers, databases, or networks, you write code that describes the desired state of your infrastructure. Terraform then determines what actions are needed to reach that state and executes them in the right order. This approach turns infrastructure into something you can test, code review, and reproduce across environments.
The workflow is straightforward: write configuration, run terraform plan to preview changes, then run terraform apply to execute them. Terraform tracks every resource it manages in a state file, which acts as the source of truth for what currently exists. This state file is compared against your configuration on every plan and apply operation, so Terraform always knows what needs to change. The workflow stays consistent whether you’re deploying to a single AWS account or managing resources across multiple cloud providers.
Providers are plugins that let Terraform interact with cloud platforms and external APIs. The AWS provider handles EC2 instances, VPCs, and RDS databases. The Azure provider works with AKS clusters and Blob Storage. The GCP provider manages GKE clusters and Cloud Storage buckets. Community providers add support for GitHub, Datadog, Cloudflare, and hundreds of other services. Each provider exposes resource types that map to real infrastructure objects, and you can mix resources from any number of providers in the same configuration. That flexibility makes Terraform useful for multi-cloud architectures.
Terraform Basics and HCL Syntax
HashiCorp Configuration Language (HCL) is Terraform’s domain-specific language. It is designed to be human-readable while being machine-parseable. The syntax uses blocks, attributes, and expressions.
# Define the required providers
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
# Configure the AWS provider
provider "aws" {
region = "us-west-2"
}
# Create an S3 bucket
resource "aws_s3_bucket" "app_bucket" {
bucket = "my-unique-app-bucket-${var.environment}"
tags = {
Name = "App bucket"
Environment = var.environment
}
}
# Define a variable
variable "environment" {
description = "Deployment environment"
type = string
default = "dev"
}
Blocks type declares the kind of infrastructure you want. Attributes assign values to specific properties. Variables and outputs let you parameterize your configurations. The terraform block configures the Terraform runtime itself, including provider version constraints.
You run terraform init to initialize the working directory, downloading the required providers. Then terraform plan shows you what Terraform intends to do, and terraform apply executes those changes. terraform destroy tears everything down when you no longer need it.
State Management and Backends
Terraform uses state to track the real-world resources it manages. The state file is a JSON snapshot of all managed resources and their current attributes. Every time you run terraform plan or apply, Terraform compares your configuration against the state and determines what changes are necessary.
By default, Terraform stores state locally in a file named terraform.tfstate. This works for solo development, but it breaks down in team environments. Local state files get overwritten, cause merge conflicts, and leak secrets if committed to version control.
Remote backends solve these problems. They store state in a shared location—typically an S3 bucket, Google Cloud Storage bucket, or HashiCorp Cloud—that supports state locking to prevent concurrent modifications.
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "prod/us-east-1/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
The dynamodb_table setting enables state locking, which prevents two team members from running terraform apply simultaneously. If one person holds the lock, the other gets an error message rather than a corrupted state file.
State file security matters. State can contain sensitive values like database passwords or API keys if you use sensitive = true on output definitions. Always enable encryption at rest, restrict access with IAM policies, and never commit state files to version control.
Providers and Resource Types
Providers are plugins that Terraform uses to interact with cloud platforms, SaaS services, and other APIs. HashiCorp maintains official providers for AWS, Azure, GCP, and Kubernetes. The community contributes hundreds more for services like Datadog, GitHub, Stripe, and Cloudflare.
Each provider exposes resource types that map to infrastructure objects. The AWS provider includes resources like aws_instance, aws_vpc, aws_rds_instance, and aws_iam_role. You can mix resources from multiple providers in the same configuration to define your entire stack.
# Create a VPC
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "Main VPC"
}
}
# Create a subnet
resource "aws_subnet" "private" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.1.0/24"
availability_zone = "us-east-1a"
map_public_ip_on_launch = false
tags = {
Name = "Private Subnet"
Type = "Private"
}
}
# Create an IAM role for EC2
resource "aws_iam_role" "ec2_role" {
name = "ec2-app-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ec2.amazonaws.com"
}
}]
})
}
Provider configuration happens in the provider block. You can specify different configurations for the same provider in different workspaces, which is useful when managing resources across multiple AWS regions or accounts.
Writing Reusable Modules
Modules are containers for related resources. They let you package infrastructure patterns and reuse them across projects. A well-designed module accepts input variables and returns output values, abstracting away the implementation details.
# modules/networking/vpc/main.tf
variable "environment" {}
variable "cidr_block" {}
resource "aws_vpc" "main" {
cidr_block = var.cidr_block
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Environment = var.environment
}
}
resource "aws_subnet" "public" {
count = 3
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.cidr_block, 8, count.index)
availability_zone = data.aws_availability_zones.available.names[count.index]
map_public_ip_on_launch = true
tags = {
Name = "${var.environment}-public-${count.index + 1}"
Type = "Public"
}
}
output "vpc_id" {
value = aws_vpc.main.id
}
output "subnet_ids" {
value = aws_subnet.public[*].id
}
Using a module looks like calling a function:
module "networking" {
source = "./modules/networking"
environment = "production"
cidr_block = "10.1.0.0/16"
}
# Use the module outputs
module "eks" {
source = "./modules/eks"
vpc_id = module.networking.vpc_id
subnet_ids = module.networking.subnet_ids
# ... other arguments
}
Modules promote consistency. Instead of copying and pasting resource definitions, teams maintain a library of modules that encode best practices. When someone fixes a security issue in a module, every project using that module gets the fix by updating the version.
Workspace Strategies
Terraform workspaces let you manage multiple environments from the same configuration. Each workspace maintains its own state file, so you can use one set of configuration files to provision dev, staging, and production environments.
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "environments/${terraform.workspace}/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
resource "aws_instance" "server" {
instance_type = terraform.workspace == "production" ? "t3.large" : "t3.micro"
ami = var.ami_id
tags = {
Name = "Server-${terraform.workspace}"
}
}
The terraform.workspace variable returns the current workspace name. You can use it in expressions to vary resource configuration per environment, like using larger instance types in production or enabling extra logging there.
Some teams prefer directory-based environments instead, keeping separate folders for each environment with a shared module library. This approach makes environment boundaries clearer and prevents accidental cross-environment changes. The tradeoff is maintaining consistency across duplicate directory structures.
Workspaces work well for environments that differ mostly in sizing and scaling, while directory-based approaches suit environments with structural differences.
Tofu and OpenTofu Ecosystem
After HashiCorp switched Terraform to BSL licensing in 2023, the open-source community rallied around OpenTofu, a Linux Foundation project that maintains a Spiritually Fork of Terraform. OpenTofu is fully backward-compatible with existing Terraform configurations and providers.
# Install OpenTofu
curl --proto '=https' --tlsv1.2 -fsSL https://get.opentofu.org/install.sh | sh
# Initialize with OpenTofu instead of Terraform
tofu init
tofu plan
tofu apply
OpenTofu promises to remain open-source under the Linux Foundation governance, removing the licensing uncertainty that HashiCorp introduced. The project has already added features like enhanced templating and improved state encryption that were missing from the last open-source Terraform release.
If you are starting a new project, OpenTofu is worth considering for its community-driven development model. Existing Terraform users can migrate gradually since both tools share the same configuration syntax and state format. The OpenTofu website provides migration guides for users concerned about the licensing direction.
For teams already using Terraform with HashiCorp Cloud or Terraform Enterprise, the change may not affect your daily workflow. But monitoring the ecosystem for provider compatibility and feature parity between Terraform and OpenTofu makes sense as the fork matures.
For more on managing infrastructure costs, check out our post on Cost Optimization.
When to Use / When Not to Use
When Terraform makes sense
Terraform is the right tool when you need to manage infrastructure across multiple cloud providers from a single workflow. If you are running AWS, GCP, and Azure—or even just multiple AWS accounts—Terraform’s provider model gives you a consistent interface for all of it.
Use Terraform when your infrastructure team and application team are separate. Terraform modules let you hide implementation details behind interfaces. The app team gets a VPC module without needing to understand CIDR blocks or routing tables. That separation of concerns matters at scale.
Terraform is also the obvious choice when compliance requires documented, version-controlled infrastructure. If an auditor needs to see what existed in production on a given date, Git history plus Terraform state gives you that. Manual console clicks do not.
When to use something else
If your infrastructure is mostly ephemeral—serverless functions, containerized workloads managed by Kubernetes—Terraform adds overhead without much value. The cloud provider’s own tooling may be faster and more integrated.
If your team consists of software engineers who do not want to learn HCL, Pulumi or AWS CDK let them manage infrastructure in languages they already know.
For simple, stable infrastructure that rarely changes, the investment in Terraform (state management, module maintenance, workflow overhead) may not pay off. Sometimes a CloudFormation template or even manual provisioning is fine.
Terraform Architecture Flow
flowchart TD
A[Config Files .tf] --> B[terraform init]
B --> C[Provider Plugins Downloaded]
C --> D[terraform plan]
D --> E[State File Updated]
E --> F[terraform apply]
F --> G[Real Infrastructure]
G --> D
G --> H[terraform destroy]
H --> I[Resources Torn Down]
Production Failure Scenarios
Common Terraform Failures
| Failure | Impact | Mitigation |
|---|---|---|
| State drift | Actual infrastructure diverges from Terraform’s view | Run terraform plan regularly, use import for drift correction |
| Lock timeout | Team member blocked from applying changes | Check for hung apply, increase lock timeout, break locks manually if needed |
| Plan/apply mismatch | State corrupted after manual changes outside Terraform | Enforce policy: all changes via Terraform, no console edits |
| Provider version conflict | Resources fail to create or update | Pin provider versions, test upgrades in staging first |
| Sensitive data in state | Passwords and keys stored in plain text | Use encrypted backends, suppress sensitive outputs, use vault provider |
| Circular dependencies | apply hangs indefinitely | Break cycles by extracting shared resources |
| Destroying shared resources | Production resource deleted accidentally | Use protection flags, review destroy plans carefully |
State Lock Timeout Recovery
flowchart TD
A[terraform apply blocked] --> B{Lock expired?}
B -->|No| C[Wait for lock holder]
B -->|Yes| D[Check lock metadata]
D --> E{Valid lock?}
E -->|Yes| F[Force unlock after timeout]
E -->|No| G[Remove stale lock manually]
F --> H[Re-run apply]
G --> H
Observability Hooks
Track Terraform operations to catch problems early and maintain audit trails.
What to monitor:
- Plan duration and size (large plans mean many pending changes)
- Apply success/failure rate per workspace
- State file size growth over time
- Resource drift detection frequency
- Lock contention events
# Enable Terraform logging
export TF_LOG=TRACE
export TF_LOG_PATH=terraform.log
# Check state lock status
terraform force-unlock LOCK_ID
# Audit state changes
terraform state list | wc -l # Count resources
terraform state pull > state-backup.tfstate
# Monitor drift
terraform plan -detailed-exitcode
# Exit code 2 means drift detected
Common Pitfalls / Anti-Patterns
Storing state locally
Local state files work for learning, but in teams they cause merge conflicts, secret leaks, and no locking. Always use a remote backend for anything beyond solo experiments.
Ignoring destroy plans
Running terraform destroy without reviewing the plan first can take down production. A misplaced filter in your targeting can delete resources across all environments. Always read the destroy plan before confirming.
Using terraform apply without terraform plan
Skipping plan means you approve changes without knowing what they are. CI pipelines that auto-apply without plan review are a recipe for surprises.
Not using version pinning
Provider version drift causes “it worked yesterday” problems. Always pin versions in the required_providers block and test upgrades in staging before production.
Treating modules as macros
Modules that just wrap single resources without adding meaningful abstraction are cargo cult IaC. A module for an EC2 instance that does nothing but pass through arguments adds indirection without value.
Trade-off Analysis
| Aspect | Terraform | Pulumi | Cloud-native IaC (CDKTF) |
|---|---|---|---|
| Language | HCL (DSL) | TypeScript, Python, Go, C# | TypeScript, Python, Go, C# |
| State management | Self-managed or Cloud | Pulumi Cloud or self-hosted | Self-managed or Cloud |
| Provider ecosystem | Largest (10,000+) | Large (200+) | Growing |
| Learning curve | Moderate (HCL) | Steeper (programming) | Steep (CDK concepts) |
| Testing | Terratest (external) | Native unit tests | Native unit tests |
| Debugging | Limited tooling | Full IDE support | Full IDE support |
| Governance | HashiCorp/OpenTofu | Pulumi (proprietary) | HashiCorp (open core) |
Interview Questions
Terraform uses state to track the real-world resources it manages. The state file is a JSON snapshot of all managed resources and their current attributes. Every terraform plan/apply compares configuration against state to determine changes.
State must be protected - it can contain sensitive values like passwords. Without state, Terraform does not know what resources it manages, so destroying state means losing track of infrastructure.
Local state files get overwritten causing merge conflicts. There is no locking - two team members running apply simultaneously corrupts state. Secrets can be leaked if the state file is committed to version control. There is no state versioning or audit trail for who made changes.
Remote backends with state locking are mandatory for collaboration. They prevent concurrent modifications and provide a single source of truth accessible to the whole team.
The S3 backend supports state locking via a DynamoDB table. The dynamodb_table setting in the backend configuration enables locking. When one person holds the lock, others get an error message instead of being able to apply changes simultaneously. This prevents corrupted state from simultaneous operations.
Providers are plugins that Terraform uses to interact with APIs (AWS, Azure, GCP, etc.). Resources are the actual infrastructure objects that providers expose - like aws_instance, aws_vpc, or aws_rds_instance. One provider can expose many resource types.
You configure a provider with a provider block, then use resources from that provider in your configuration. The provider handles the API communication; resources define the actual infrastructure.
Modules package related resources into reusable components. Well-designed modules accept input variables and return output values, hiding implementation details behind interfaces.
Teams maintain a library of modules encoding best practices. When someone fixes a security issue in a module, all projects using that module get the fix by updating the version. This means consistent security hardening across all infrastructure.
Terraform workspaces let you manage multiple environments from the same configuration. Each workspace maintains its own state file, so you can use one configuration to provision dev, staging, and production.
The terraform.workspace variable returns the current workspace name. Use it in expressions to vary resource configuration per environment. Workspaces work well when environments differ mostly in sizing and scaling. Directory-based approaches suit environments with structural differences.
OpenTofu is a Linux Foundation project, forked from Terraform after HashiCorp switched to BSL licensing. It is fully backward-compatible with existing Terraform configurations.
The project promises to remain open-source under community governance. It has already added features like enhanced templating and improved state encryption. Consider it for new projects if open-source governance matters. Existing users can migrate gradually since both tools share the same syntax and state format.
First check if the lock has expired - wait for the lock holder if not. Check lock metadata to identify who holds it. If the lock is valid but the holder is hung, force unlock after the timeout using terraform force-unlock LOCK_ID. If the lock is invalid (stale), remove it manually from DynamoDB. Then re-run the apply that was blocked.
Running terraform destroy without reviewing the plan first can take down production. A misplaced filter in your targeting can delete resources across all environments. Destroying shared resources accidentally is a risk - use protection flags and review destroy plans carefully.
Always read the destroy plan before confirming. Use targeted destroy with -target flag when you only need to delete specific resources. Be especially careful with state that references resources managed outside Terraform.
Provider version drift causes "it worked yesterday" problems. New provider versions may introduce breaking changes. Always pin versions in the required_providers block and test upgrades in staging before production. Unpinned versions can silently update and change behavior in ways that break your infrastructure.
Write configuration files (.tf files) with HCL, then run terraform init to initialize the working directory and download providers. Run terraform plan to preview intended changes, then terraform apply to execute them and update state. State file tracks actual resources - plan compares config vs state. terraform destroy tears down resources when no longer needed.
Directory-based approaches suit environments with structural differences. Use when dev, staging, and production have different resource compositions. Workspace-based is better when environments only differ in sizing and scaling.
The directory approach makes boundaries clearer and prevents accidental cross-environment changes. The tradeoff is maintaining consistency across duplicate directory structures.
State drift: actual infrastructure diverges from Terraform's view. Causes include manual changes outside Terraform, failed applies, and provider bugs.
Detection: run terraform plan regularly - exit code 2 means drift detected. Mitigation: terraform plan frequently, use import for drift correction, and enforce policy that all changes go through Terraform with no console edits.
State can contain sensitive values like database passwords if you use sensitive = true. Always enable encryption at rest on the state backend. Restrict access with IAM policies on S3/DynamoDB. Never commit state files to version control. Suppress sensitive outputs in console with -json flag. Consider using the HashiCorp Vault provider for secret management.
Terraform uses HCL (a domain-specific language) which is easier for ops engineers. Pulumi uses real programming languages (TypeScript, Python, Go, C#) which enables loops, conditionals, and functions - more expressiveness than HCL.
Terraform has a larger ecosystem with more providers (10,000+). Pulumi offers native unit tests with standard frameworks but has a steeper learning curve for non-programmers.
Modules that just wrap single resources without adding meaningful abstraction are cargo cult IaC. A module for an EC2 instance that just passes through arguments adds indirection without value. Over-abstracting with inheritance chains makes debugging harder.
Modules should encode best practices and hide complexity. Start concrete, and abstract only when repetition demands it.
Manual changes outside Terraform (console edits) cause plan to differ from apply. Apply attempts to reconcile state but Terraform's view is outdated. Resources can be orphaned or incorrectly modified, leaving state inconsistent with actual infrastructure.
Prevention: enforce policy that all changes go through Terraform.
Circular dependency: Resource A depends on B, B depends on C, C depends on A. Apply hangs indefinitely waiting for resources that cannot be created.
Resolution: break cycles by extracting shared resources to a separate module. Use data sources to break dependency chains. Refactor configuration to eliminate circular references.
Track plan duration and size - large plans mean many pending changes. Monitor apply success/failure rate per workspace. Track state file size growth over time. Monitor drift detection frequency and lock contention events.
Use TF_LOG=TRACE and TF_LOG_PATH for detailed logs when troubleshooting.
Terraform is not ideal when infrastructure is mostly ephemeral (serverless, containers managed by K8s). The cloud provider's own tooling may be faster and more integrated there.
If your team consists of software engineers who do not want to learn HCL, Pulumi or CDK let them manage infrastructure in languages they already know. For simple, stable infrastructure that rarely changes, the investment in Terraform (state management, module maintenance) may not pay off.
Further Reading
- Terraform Documentation - Official HashiCorp docs
- OpenTofu Documentation - OpenTofu community docs
- Terraform Provider Registry - Browse available providers
- Terratest - Testing framework for Terraform
- Terraform Module Registry - Pre-built modules
- State Backend Comparison - Backend options guide
Conclusion
Key Takeaways
- Terraform uses HCL for declarative infrastructure definition with state management
- Remote backends with state locking are mandatory for team environments
- Modules encode best practices and enable separation of concerns between teams
- OpenTofu offers a community-driven open-source alternative to Terraform
- Plan always before apply—never skip the review step
Terraform Health Checklist
# Verify Terraform is initialized
terraform init
# Check for configuration drift
terraform plan -detailed-exitcode
# List all managed resources
terraform state list
# Validate configuration syntax
terraform validate
# Review destroy plan before execution
terraform plan -destroy
# Check backend configuration
terraform workspace list
# Audit state file
terraform state pull > backup.tfstate Category
Related Posts
IaC Module Design: Reusable and Composable Infrastructure
Design Terraform modules that are reusable, composable, and maintainable—versioning, documentation, and publish patterns for infrastructure building blocks.
IaC State Management: Remote Backends and Team Collaboration
Manage Terraform/OpenTofu state securely with remote backends, state locking, and strategies for team collaboration without state conflicts.
Pulumi: Infrastructure as Actual Code
Use Pulumi to define infrastructure using real programming languages—TypeScript, Python, Go, C#—enabling loops, conditionals, and full IDE support for IaC.