Lab Modernization: From Manual Configs to Full Automation
This is the journal of taking a homelab that was held together with SSH sessions and good intentions and turning it into something that manages itself. It took months of evenings and weekends, a lot of broken things, and more hours than I want to admit. But the lab went from “I hope nobody touches that server” to “push to main and walk away.”
These are the raw notes, cleaned up just enough to be readable.
The Starting Point (September 2024)#
The lab inventory when this started:
- FortiGate 60E (manually configured)
- Cisco 3560 switch (manually configured, ancient)
- SuperMicro FreeNAS (manually configured)
- Dell R710 bare metal, running 2 Minecraft Docker Compose containers
- Dell R610 on Proxmox, running personal website and other services via Cloudflare ZTNA + Docker Compose
- SuperMicro chassis with a 1070 GPU, running Plex and Steam Headless via Docker Compose
- Ruckus R600 APs
- A few Raspberry Pis (model 1 and 2, from the early days)
- Eaton UPS
The entire lab was manually configured. Some Docker configs were in git. No automation. No state management. A few things on Tailscale. Nothing else was automated.
I had years of professional experience with all of these technologies. Ansible, Terraform, Kubernetes, CI/CD, cloud infrastructure. This wasn’t a learning project. It was applying what I already knew from work to my own environment, which made the rollout straightforward once I committed the time to it.
The Objectives#
- Manage the entire infrastructure with Ansible and Terraform:
- FortiGate firewall (policies, VIPs, DHCP, address objects)
- Proxmox hypervisors (VM provisioning, templates)
- Cloudflare (DNS, tunnels, ZTNA)
- Maybe even the ancient Cisco switch
- Deploy a Kubernetes cluster, migrate ZTNA workloads to it (originally thinking Talos + Flux)
- Monitor and alert with the LGTM stack
- Use AWS for all the ancillary infrastructure:
- Secrets Manager for all secrets (FortiGate, Proxmox, GitHub tokens, VPN credentials)
- S3 for Terraform state backend
- GitHub Actions for all deployment (no manual applies, no SSH-and-fix)
- AWS as the extension point for anything that outgrows the lab
- IAM Identity Center (SSO) for auth, managed by a CDK-deployed account that handles CI roles and identity federation
- All automated with GitHub Actions
Day 1: Prerequisites (~10 hours)#
Started with the AWS integration. I already had an AWS account managed by CDK for some freelance apps. Needed to deploy:
- IAM resources for GitHub OIDC assumption, including on-prem runner keys
- The Terraform backend role (assumable by OIDC and by the private runner keys)
- Logging and trails
- Terraform state bucket
Then hit the chicken-and-egg problem. Can’t deploy automatically without a private runner. Can’t provision a runner without deploying something. Had to do some manual work for the first-time setup.
Created a VM template in Proxmox with provisioner credentials. Launched a VM manually. Wrote the Ansible playbook to configure it as a GitHub Actions runner. The runner needs tokens for two repos (Ansible and Terraform), so Ansible fetches unique registration tokens using a personal access token stored in AWS Secrets Manager.
Got the runner up and ready to execute playbooks and Terraform applies.
Day 2: Runners and Workflows (~16 hours)#
Secrets placed in AWS, now needed workflows to use them. The self-hosted runner doesn’t have OIDC credentials like GitHub-hosted runners do, so it needs static AWS keys that can assume the backend role.
The credential flow:
- Fetch self-hosted runner AWS access keys from Secrets Manager
- Inject fetched keys into the workflow workspace
- Use injected keys to assume the backend role
- Execute playbook/terraform apply using that role
Spent a lot of time experimenting with matrix jobs and the safest way to pass credentials between jobs. Leaked credentials because of echo output. Fun times. Added ::add-mask:: everywhere and set +x before any credential operations.
Started testing Terraform with the FortiGate provider. FortiManager is terrible, so managing the firewall as code was a high priority. Reorganized the repo structure. Finished the Ansible role for GitHub runners. Added secrets for Proxmox and FortiGate providers. Added Discord notifications for workflow results.
After adding credentials, verified I could launch a VM from the Proxmox template via Terraform. Then figured out the Ansible trigger: using a local-exec provisioner in Terraform to call the GitHub API and trigger a workflow with the new VM’s IP and playbooks as inputs. Brand new host, stood up by Terraform, configured by Ansible, all in one push.
Days 3-4: Refactoring and Network (~9 hours)#
Decided to refactor the project after initial provisioning. The flat directory structure at the root wasn’t going to scale. Reorganized into the environment-based layout (shared/production/dev/dmz, each with heezy/ and aws/ subdirectories).
Redeployed the runner, which broke all the automatic provisioning. Had to do a bunch of manual recovery work.
Spent 5 hours getting the rest of the network up with DHCP. Added zones, policies (NAT and security), DHCP servers, DHCP zones. Got bridging working on a new runner that was cut off from GitHub, which meant executing things manually.
Then the leaked credential fix broke the credential passing. 4 more hours of troubleshooting.
Day 5: Firewall Imports and Planning#
Importing more existing resources into Terraform for the firewall. Decided the original “SERVERS” interface needed to be renamed. Started identifying the naming convention for everything going forward:
- VLAN 1 (native): SHARED
- VLAN 2: USERS
- VLAN 3: DMZ
- VLAN 1000: PROD
- VLAN 2000: DEV
Decided that before Kubernetes, I needed to work on Cloudflare and migrations. The R710 DMZ host was bare metal with 1TB of local storage, no NFS. It was hosting Minecraft servers, and files were backed up by another server on Proxmox that was considered “legacy.”
The plan:
- Migrate DMZ servers, install Proxmox on the R710 for DMZ workloads
- Add a DMZ VLAN on the switch, plumb it to the physical interface
- Deploy new VMs on Proxmox, migrate Minecraft into VMs managed by Ansible
- Rename and document the network architecture
- Start provisioning for Cloudflare and ZTNA workloads with OAuth
- Set up a blog using markdown (looked at Ghost CMS initially)
Day 6: Renaming Zones is a Bad Idea#
Tried to rename firewall zones in Terraform. Every policy that referenced the old zone name broke. Lesson learned: start fresh with the naming you want. Don’t rename in place.
Ordered hardware for the k8s cluster: 1TB NVMe M.2 drives and 32GB memory upgrades for each Lenovo ThinkCentre.
Started working on PXE boot for automated OS installation on the new nodes.
Day 7: Hardware Arrives (~2 hours)#
NVMe drives arrived. Memory still in transit.
Updated the workflow triggering. Detection wasn’t working well, getting inconsistent and combined triggers. Built scripts to auto-generate workflows based on playbooks and workspaces. Now jobs are generated and triggered independently.
Discovered that a single self-hosted runner is single-threaded for job execution. Multiple workflows queue up and run sequentially. Needed to figure out parallelization, or just beef up the runner to handle the load.
Day 8: PXE Boot and Pivoting#
Couldn’t get PXE boot to work with the ThinkCentres. Between BIOS quirks and network boot configuration, it was eating too much time.
Also explored Omni (Sidero Labs’ managed Talos platform) as an alternative to bare-metal Talos installation. Set up Google OAuth secrets and Cloudflare integration in AWS Secrets Manager for it.
Eventually abandoned both PXE boot and Talos. Installed Ubuntu 24.04 manually on all five nodes via USB, went with MicroK8s, and never looked back. Sometimes the pragmatic choice is the right one.
What Came After#
The rest of the story is covered in the other blog posts:
- The Kubernetes cluster build
- The Calico BPF incident
- The DNS saga
- Terraform and Ansible automation
- SWAG, Cloudflare, and the blog
The lab went from a collection of manually configured boxes to a fully automated environment where every change is a git commit. It took months. It broke constantly during the transition. But now I push to main and walk away, and that’s worth every hour I put into it.
Time Tracking#
For anyone wondering what a project like this actually costs in time:
| Phase | Hours |
|---|---|
| Day 1: AWS prerequisites, runner bootstrap | ~10 |
| Day 2: Workflows, credentials, Terraform testing | ~16 |
| Days 3-4: Refactoring, networking, DHCP | ~9 |
| Day 5: Firewall imports, planning | ~4 |
| Day 6: Zone renaming disaster, PXE boot | ~6 |
| Day 7: Workflow generation, hardware | ~2 |
| Day 8: PXE boot, Talos, pivoting | ~4 |
| K8s cluster build and migration | ~20 |
| DNS, monitoring, ongoing refinement | ~15 |
| Total (rough estimate) | ~86 |
And that’s just the tracked time. The actual number is probably higher. Homelabs are a hobby, and hobbies don’t have timesheets.