Terraform, Ansible, and the Automation That Runs Everything

This is the story of taking a homelab that was 100% manually configured and turning it into something where every change is a git commit, every deployment is a GitHub Actions run, and I never SSH into a box to make a “quick fix” again. It took a lot of hours, a lot of broken credential chains, and one memorable incident where I leaked secrets because of echo output. But it works now, and it works well.

Contents#

The Before Times#

Before this project, my lab was really just a physical timeline of my career in infrastructure. A FortiGate 60E from when I got deep into network security. A Cisco 3560 from my early networking days. A SuperMicro FreeNAS box that was the first “real” storage I built. A Dell R710 running Docker Compose, handed down from a datacenter decom. A Dell R610 on Proxmox, same story. A SuperMicro chassis with a 1070 GPU that started as a deep learning experiment and became a Plex transcoder. Ruckus APs picked up from a vendor relationship. A couple of Raspberry Pis (model 1 and 2, ancient) from when those were the exciting thing. An Eaton UPS holding it all together. Every piece came from a different era, a different job, a different vendor relationship or surplus sale. The lab was a geological record of everywhere I’d been as an infrastructure architect.

Some of the Docker configs were in git. None of the infrastructure was automated. A few things were on Tailscale. Nothing else was automated.

I knew how to do all of this properly. I’d been doing IaC and automation professionally for years. But the lab had always been the place where “I’ll clean this up later” won every time. The cobbler’s children had no shoes. I decided to stop half-assing it and bring the whole thing into 2025 the way I’d build it for a client. The goal list was ambitious:

Manage the entire infrastructure with Ansible and Terraform (firewall, hypervisors, Cloudflare, maybe even the ancient Cisco switch)
Add prod and dev environments, both on-prem and in AWS
Deploy a Kubernetes cluster and migrate workloads to it
Monitor everything with the LGTM stack
Use AWS Secrets Manager for secrets, S3 for Terraform state, GitHub Actions for all automation

The Chicken and Egg Problem#

The first real challenge: you can’t automate infrastructure deployment without a CI runner, and you can’t provision a CI runner without infrastructure. Classic bootstrap problem.

The solution was a one-time manual setup. I created a VM template in Proxmox with provisioner credentials, launched a VM from it, and wrote an Ansible playbook to configure it as a GitHub Actions self-hosted runner. The runner needs to execute jobs for two separate repos (ansible-heezy and terraform-heezy), so it needs registration tokens for both. GitHub runner registration requires an authenticated admin to fetch the token, so I generated a personal access token, stored it in AWS Secrets Manager, and wrote the Ansible role to fetch it at runtime.

That runner at 10.x.x.x is the linchpin of the entire operation. It’s the one thing that was manually bootstrapped. Everything else flows from it.

Terraform: The Structure#

The terraform-heezy repo manages all infrastructure that can be expressed as resources: FortiGate firewall policies, Proxmox VMs, DHCP configuration, AWS resources, and DNS.

Environment Layout#

terraform-heezy/
├── environments/
│   ├── shared/
│   │   ├── heezy/    # FortiGate interfaces, zones, shared firewall rules, DHCP
│   │   └── aws/      # Shared AWS resources, MCP readonly user
│   ├── production/
│   │   ├── heezy/    # Production VMs, firewall policies, VIP NAT, SNMP
│   │   └── aws/      # Production AWS (ECR, OIDC, IAM)
│   ├── dev/
│   │   ├── heezy/    # Dev VMs, dev firewall rules, dev DHCP
│   │   └── aws/      # Dev AWS resources
│   └── dmz/
│       └── heezy/    # DMZ VMs, DMZ firewall objects, DMZ policies
└── shared/
    └── modules/
        └── proxmox-vm/  # Reusable VM module with Ansible trigger

Each environment directory is an independent Terraform workspace with its own state file in S3. No workspace sharing, no cross-environment state references. If production needs to know about a shared resource, it references it by convention (known IP, known name), not by Terraform remote state. This keeps things simple and avoids the dependency hell that comes with shared state.

The split between heezy/ and aws/ within each environment separates on-prem resources (FortiGate, Proxmox) from cloud resources (ECR, IAM, OIDC). Different providers, different credentials, different blast radius.

The Proxmox VM Module#

Every VM in the lab is created through a shared module that handles cloning from a template, setting CPU/memory/disk, configuring the network VLAN, and then triggering Ansible:

module "dmz_minecraft_java" {
  source = "../../../shared/modules/proxmox-vm"
  
  vm_name       = "dmz-minecraft-java"
  target_node   = "proxmox"
  proxmox_vm_id = 105  # ubuntu-2024-vm-template
  vm_cores      = 4
  vm_memory     = 8192
  vm_disk_size  = 150
  vm_vlan_id    = 3    # DMZ VLAN
  
  ansible_playbooks = "baseline"
}

The magic is in the provisioner block. After the VM is created and gets a DHCP IP, the module uses a local-exec provisioner to fetch a GitHub token from AWS Secrets Manager and trigger the terraform-triggered.yml workflow in ansible-heezy. It passes the new VM’s IP and the playbooks to run. So Terraform creates the VM, and Ansible configures it, all in one push.

The FortiGate Provider#

Managing a FortiGate with Terraform is… an experience. The fortios provider works, but it has sharp edges:

Always use zone names (DMZ, SHARED, USERS), never raw interface names. If zones are configured, interface names cause HTTP 500 errors.
Let the FortiGate auto-assign policy IDs. Explicit IDs collide with internal numbering and produce cryptic errors.
Cross-zone policies need nat = "enable" even for RFC1918-to-RFC1918 traffic.
Use inspection_mode = "flow" unless you specifically need proxy inspection.
Import existing resources before managing them. The FortiGate has a lot of default objects that Terraform doesn’t know about.

I started by importing everything that already existed on the firewall. Interfaces, zones, DHCP servers, existing policies. Then I could manage them going forward without Terraform trying to recreate things that were already there.

One lesson learned the hard way: renaming firewall zones in Terraform is a terrible idea. Every policy that references the zone breaks. Start fresh with the naming you want, always.

What Terraform Manages#

Shared environment:

FortiGate interfaces and zones (SHARED, USERS, DMZ, PROD)
DHCP servers for all VLANs (DNS pointing at dnsmasq, MetalLB VIP reservation)
Shared firewall objects (address objects, service objects)
Cross-zone firewall policies (USERS to SHARED DNS, PROD to SHARED DNS)

Production environment:

Production VMs on Proxmox
Production firewall policies and VIP NAT rules
SNMP configuration for FortiGate monitoring
AWS ECR repositories, OIDC federation, IAM roles

DMZ environment:

Game server VMs (Minecraft, CS 1.6)
DMZ DHCP reservations (so IPs don’t shuffle on reboot)
DMZ firewall address objects and VIP NAT for inbound access
DMZ-specific firewall policies (runner SSH access, inbound game traffic)

Auto-Generated Workflows#

Both repos use scripts that auto-generate GitHub Actions workflow files. This was born out of frustration with inconsistent triggers and combined workflow runs.

Terraform Workflows#

scripts/generate-workflows.sh generates one workflow per workspace:

terraform-shared-heezy-execution.yml
terraform-production-heezy-execution.yml
terraform-production-aws-execution.yml
terraform-dev-heezy-execution.yml
terraform-dmz-heezy-execution.yml
Plus an all-workspaces workflow for manual full runs

Each workflow triggers on pushes to its specific environment directory. Change a file in environments/production/heezy/, only the production-heezy workflow runs. No cross-contamination.

The workflow pattern: fetch runner credentials from AWS Secrets Manager on a GitHub-hosted runner via OIDC, pass them to the self-hosted runner, assume the Terraform backend role, run plan, then apply on main branch merges.

Ansible Workflows#

scripts/generate-workflows.py is smarter. It reads every playbook YAML, parses out which roles each playbook uses, and generates path-based triggers that include the playbook file, its roles directories, and the inventory. Change a file in roles/dnsmasq/, and only the dnsmasq playbook workflow triggers.

It also generates deterministic cron schedules (hashed from the playbook name) so every playbook runs weekly on Sunday at a different hour. And it cleans up orphaned workflow files when playbooks are deleted.

Every time you add, remove, or rename a playbook, you run the generator and commit the updated workflow files alongside your changes.

Ansible: Containerized Execution#

Ansible doesn’t run on the self-hosted runner directly. It runs inside a Docker container that’s built and pushed to ECR as part of the workflow.

The container image has everything baked in: Ansible, the playbooks, roles, inventory, and all dependencies. The workflow pulls this image on the self-hosted runner and executes the playbook inside it. This means the runner itself stays clean, and the Ansible environment is reproducible.

The credential chain for Ansible execution:

GitHub-hosted runner uses OIDC to assume the GitHubActions-MultiRepo role
Fetches static runner AWS keys from Secrets Manager
Passes keys to the self-hosted runner
Self-hosted runner assumes the backend role using those keys
Ansible container runs with the assumed role credentials
Ansible fetches service-specific secrets (SSH keys, API tokens) from Secrets Manager at runtime

What Ansible Manages#

Every host in the lab has a baseline role that handles OS config, Docker installation, AWS CLI, common tools, and hostname setup. On top of that:

github-runner: Self-hosted runner registration for both repos
micro-k8s: MicroK8s cluster setup, addon enablement, node joining
lgtm: Full monitoring stack (Grafana, Prometheus, Loki, Tempo, Mimir)
dnsmasq: Split-horizon DNS with auto-generated host entries
promtail: Log shipping agents on all hosts
mcp-access: SSH access for MCP tooling
Game servers: Minecraft Bedrock, Minecraft Java, CS 1.6 (Docker Compose on DMZ VMs)
docker-compose-updater: Weekly auto-update cron for all Docker Compose services
tailscale: VPN mesh for remote access

The key rule: never run Ansible locally, never SSH in and make manual changes. Edit the role, commit, push, let GitHub Actions handle it. The runner at 10.x.x.x has SSH access to every host in the lab. If you need to run a playbook manually, use gh workflow run.

The Credential Dance#

Getting credentials to flow securely through this pipeline was the hardest part of the whole project. The self-hosted runner can’t use OIDC (that’s a GitHub-hosted runner feature), so it needs static AWS keys. But those keys have zero permissions on their own. They can only assume the backend role, which has the actual permissions.

The flow:

OIDC on GitHub-hosted runner assumes the multi-repo role
Multi-repo role reads static runner keys from Secrets Manager
Static keys are passed as job outputs (masked in logs)
Self-hosted runner uses static keys to assume the backend role
Backend role credentials are used for actual work

I leaked credentials early on because of echo output in the workflow. GitHub’s ::add-mask:: annotation is critical. Every credential value gets masked before it’s used anywhere. The set +x before credential operations prevents bash from echoing commands that contain secrets.

Discord Notifications#

Both repos use a shared _discord-notify.yml reusable workflow that posts to Discord on every workflow completion. Success or failure, I get a notification. When you’re running 15+ workflows across two repos, you need to know when something breaks without staring at the Actions tab.

What I’d Tell Someone Starting This#

It’s okay to start manual. The first runner was hand-built. But then I wrote the Ansible role that provisions runners, so now the automation can rebuild the thing that runs the automation. It’s like the opposite of the snake eating its own tail: instead of consuming itself, it births a replacement. The manual bootstrap is a one-time sin that gets absolved the moment the runner can recreate itself.
Separate your Terraform workspaces aggressively. One workspace per environment per provider. The blast radius of a bad apply should be as small as possible.
Auto-generate your workflows. Writing them by hand is error-prone and they drift. A script that reads your playbooks/workspaces and generates workflows is worth the upfront investment.
Containerize your Ansible. Running Ansible directly on the runner leads to dependency hell. Bake everything into a container image.
Mask everything. If it’s a credential, mask it. If it might be a credential, mask it. If you’re not sure, mask it.
Provision the FortiGate from scratch if you can. Importing existing config into Terraform is always more painful than starting net-new. We still have some leftover manual config that was imported early on, and those resources are the ones that cause the most drift and the most surprises. If you’re starting fresh, define everything in Terraform from day one. Use zone names not interface names. Let it auto-assign policy IDs. Expect to fight it on cross-zone NAT.
Terraform provisioners for Ansible triggers work great. VM creation automatically triggers configuration. One push, full stack.