Split DNS, MetalLB, and the dnsmasq Debugging Saga
This is the story of building split-horizon DNS for a homelab with four VLANs, a FortiGate firewall, a 5-node MicroK8s cluster, and a Cloudflare tunnel. Then spending hours debugging why dnsmasq wouldn’t answer queries despite the port being open, the firewall allowing traffic, and the container running fine. The entire implementation and debugging session was driven through MCP tooling.
Contents#
- The Starting Point
- The DNS Architecture
- MetalLB: Solving the Round-Robin Problem
- The DHCP Cutover
- The Debugging Saga
- The Successful Cutover
- Everything That Was Deployed
- Lessons Learned
- Final State
The Starting Point#
The homelab had grown to 20+ services across multiple network zones, but DNS was held together with duct tape:
- No internal domain. Accessing services meant remembering
10.x.x.x:30533instead of<service>.internal - Cloudflare hairpinning. LAN clients hitting
navidrome.yourdomain.tldwent out to Cloudflare and back in, adding latency and a dependency on the internet for local services - Half-baked dnsmasq. A container existed at 10.x.x.x with entries for 7 of 20+ services
- No failover. DNS entries round-robined across 5 k8s nodes, so if one died, 20% of queries hit a dead IP and clients waited 30+ seconds for a timeout
The goal: proper split-horizon DNS where heezy.local resolves internal services, yourdomain.tld resolves to the cluster directly (skipping Cloudflare), and everything else forwards upstream.
The DNS Architecture#
Two Domains, Two Strategies#
heezy.local: the internal-only domain. Every server, every k8s service, every piece of infrastructure gets a name here. dnsmasq is authoritative, so queries for heezy.local never leave the network.
Three categories of records:
- Infrastructure hosts: auto-generated from Ansible inventory.
<service>.internal,<service>.internal,<service>.internal. One IP per host. - K8s services: round-robin across all 5 nebula nodes.
<service>.internalreturns all 5 IPs because any node can serve any NodePort via kube-proxy. - Infra services: single-host entries for non-k8s services.
<service>.internal→ 10.x.x.x (the LGTM monitoring server).
yourdomain.tld: the public domain, managed by Cloudflare externally. But internally, dnsmasq overrides it so LAN clients go directly to the cluster instead of hairpinning through Cloudflare.
Only the SWAG-proxied subdomains get overrides. Every public-facing service points at the same VIP:
plex.yourdomain.tld→ 10.x.x.x (MetalLB VIP → SWAG → ClusterIP)navidrome.yourdomain.tld→ 10.x.x.xaurral.yourdomain.tld→ 10.x.x.xnebula-plex.yourdomain.tld→ 10.x.x.x- (plus several other self-hosted services)
These all point at a single MetalLB VIP (10.x.x.x) which is assigned to the SWAG LoadBalancer service. SWAG terminates TLS and reverse-proxies to the k8s ClusterIP services internally. If the node holding the VIP dies, MetalLB moves it to another node in ~2 seconds.
Everything else forwards to 1.1.1.1 and 8.8.8.8. dnsmasq doesn’t try to answer queries it doesn’t own.
The Hostname Strategy#
Every host in the Ansible inventory gets a hostname entry automatically via the hosts.j2 template:
10.x.x.x <service>.internal nebula-1
10.x.x.x <service>.internal nebula-2
10.x.x.x <service>.internal shared-lgtm
10.x.x.x <service>.internal shared-dnsmasq
10.x.x.x <service>.internal dmz-minecraft
The naming convention: <zone>-<purpose> for infrastructure, bare names for k8s nodes. The baseline Ansible role sets each host’s hostname via hostnamectl and configures systemd-resolved to route heezy.local and yourdomain.tld queries to dnsmasq.
K8s services get their own template (k8s-services.j2) with round-robin entries:
10.x.x.x <service>.internal
10.x.x.x <service>.internal
10.x.x.x <service>.internal
10.x.x.x <service>.internal
10.x.x.x <service>.internal
Infrastructure services that run on dedicated hosts get single entries (infra-services.j2):
10.x.x.x <service>.internal
10.x.x.x <service>.internal
10.x.x.x <service>.internal
10.x.x.x4 <service>.internal
10.x.x.x <service>.internal
DMZ Isolation#
DMZ hosts (game servers on 10.x.x.0/24) never use internal DNS. They’re in an untrusted zone, and allowing DNS queries into the environment would be an attack vector. DMZ DHCP uses the FortiGate’s default DNS (which forwards to public resolvers). No firewall rule exists for DMZ → dnsmasq, and one should never be created.
MetalLB: Solving the Round-Robin Problem#
Round-robin DNS is terrible for reliability. If one of 5 nodes goes down, 20% of queries return a dead IP. The client has to wait for a TCP timeout (30+ seconds) before trying the next IP. SSH doesn’t retry at all, it just fails. This defeats the purpose of running a cluster.
The fix: MetalLB, which was available as a MicroK8s addon but disabled.
microk8s enable metallb:10.x.x.x-10.x.x.x
One IP in the pool. The SWAG LoadBalancer service (which had been stuck in <pending> state for 69 days) immediately grabbed it. Now yourdomain.tld subdomains resolve to a single VIP with instant failover.
The catch: MetalLB VIPs only serve LoadBalancer-type services. NodePorts are not accessible on the VIP. So heezy.local services (which use NodePorts) still round-robin. This is acceptable for internal/admin use since the priority was fixing the public-facing yourdomain.tld resolution.
The DHCP Cutover#
Four VLANs, four DHCP servers, all managed by the FortiGate:
| VLAN | Subnet | Zone | Before | After |
|---|---|---|---|---|
| native | 10.x.x.0/24 | SHARED | FortiGate default | 10.x.x.x + 1.1.1.1 |
| 200 | 10.x.x.0/24 | USERS | FortiGate default | 10.x.x.x + 1.1.1.1 |
| 2000 | 10.x.x.0/24 | PROD | 10.x.x.x + 8.8.8.8 | 10.x.x.x + 1.1.1.1 |
| 3 | 10.x.x.0/24 | DMZ | FortiGate default | Unchanged |
The fallback to 1.1.1.1 is critical. If dnsmasq goes down, clients lose internal names but the internet keeps working.
Firewall Rules for Cross-Zone DNS#
dnsmasq lives on the SHARED VLAN. Clients on SHARED can reach it directly (same zone, no firewall rule needed). But USERS and PROD are different zones:
- Policy 309 (existing, updated): USERS → SHARED dnsmasq, UDP/53 + TCP/53
- Policy 315 (new): PROD → SHARED dnsmasq, UDP/53 + TCP/53
DNS needs both UDP and TCP on port 53. Easy to forget TCP. Most queries use UDP, but large responses and DNSSEC fall back to TCP.
The Premature Cutover#
We pushed the DHCP changes before verifying dnsmasq was actually working. The terraform applied successfully, DHCP servers started handing out 10.x.x.x as the DNS server, and… nothing worked. DNS queries to 10.x.x.x timed out from every client.
We immediately reverted the DHCP cutover to stop the bleeding, then spent the next hour debugging why dnsmasq wouldn’t answer queries.
The Debugging Saga#
Symptom#
nslookup google.com 10.x.x.x timed out from every client: the Mac on USERS VLAN, big-boi on SHARED VLAN, even from the dnsmasq host itself. But docker exec dnsmasq nslookup google.com 127.0.0.1 worked perfectly inside the container.
What We Ruled Out#
- Firewall: FortiGate policy 309 confirmed via API with both UDP/53 and TCP/53.
nc -vz 10.x.x.x 53connected successfully from the Mac. - Port binding:
ss -tlnupshowed dnsmasq bound to0.0.0.0:53on both TCP and UDP. - iptables: INPUT chain policy ACCEPT, no rules. No DROP or REJECT anywhere in the chain.
- systemd-resolved: stub listener disabled, not conflicting on port 53.
- Docker bridge networking: initially suspected, switched to
network_mode: host. Same problem.
The Docker Bridge Red Herring#
The original setup used Docker bridge networking with port mapping (-p 53:53). We discovered that Docker’s DOCKER-BRIDGE iptables chain had explicit DROP rules that blocked the container’s outbound UDP/53 to upstream resolvers. The container could receive queries but couldn’t forward them.
Switching to network_mode: host fixed the outbound forwarding issue, but queries from outside the container still timed out. This sent us down a rabbit hole investigating PID namespaces, conntrack, and kernel socket routing.
The Real Problem: Two Configuration Options#
After hours of debugging, we bisected the dnsmasq configuration file line by line. Starting from a minimal config that worked:
no-resolv
server=1.1.1.1
server=8.8.8.8
log-queries
We added lines back one at a time until it broke. Two options were the culprits:
listen-address=0.0.0.0: explicitly setting this breaks dnsmasq in Docker host networking mode. Without it, dnsmasq binds to wildcard and works. With it, dnsmasq binds but doesn’t process packets from external sources. The behavior difference is subtle and undocumented. It appears to be related to how dnsmasq enumerates interfaces when listen-address is set vs. when it uses the default wildcard binding.
bogus-priv: this option tells dnsmasq not to forward reverse lookups for private IP ranges. In host networking mode, it also breaks forward query handling. Removing it fixed the issue.
The Image Problem#
The original jpillora/dnsmasq image also lacked the NET_ADMIN Linux capability, which dnsmasq needs to set socket options for proper DNS packet handling. The busybox nslookup inside the container worked because it ran in the same process context, but external queries failed because the listening socket wasn’t configured correctly without NET_ADMIN.
The fix: switch to drpsychick/dnsmasq with cap_add: NET_ADMIN.
The Final Working Configuration#
# docker-compose.yml
services:
dnsmasq:
image: drpsychick/dnsmasq
container_name: dnsmasq
restart: unless-stopped
network_mode: host
cap_add:
- NET_ADMIN
volumes:
- /opt/dnsmasq/config/dnsmasq.conf:/etc/dnsmasq.conf
- /opt/dnsmasq/hosts:/etc/dnsmasq.d
command: ["--no-daemon"]
# dnsmasq.conf: note what's NOT here
port=53
domain-needed
# NO bogus-priv
# NO listen-address=0.0.0.0
no-resolv
no-poll
domain=heezy.local
expand-hosts
server=1.1.1.1
server=8.8.8.8
cache-size=1000
local-ttl=30
log-queries
addn-hosts=/etc/dnsmasq.d
local=/heezy.local/
local=/yourdomain.tld/
The Successful Cutover#
With dnsmasq actually responding to queries, we re-applied the DHCP cutover. This time:
$ nslookup google.com 10.x.x.x
Server: 10.x.x.x
Address: 10.x.x.x#53
Non-authoritative answer:
Name: google.com
Address: 172.253.132.113
$ nslookup <service>.internal 10.x.x.x
Server: 10.x.x.x
Address: 10.x.x.x#53
Name: <service>.internal
Address: 10.x.x.x
Address: 10.x.x.x
Address: 10.x.x.x
Address: 10.x.x.x
Address: 10.x.x.x
$ nslookup navidrome.yourdomain.tld 10.x.x.x
Server: 10.x.x.x
Address: 10.x.x.x#53
Name: navidrome.yourdomain.tld
Address: 10.x.x.x
Internal domain, public domain override, and upstream forwarding, all working.
Everything That Was Deployed#
Ansible (ansible-heezy)#
roles/dnsmasq/templates/dnsmasq.conf.j2: main config, nobogus-privorlisten-addressroles/dnsmasq/templates/hosts.j2: auto-generated from inventoryroles/dnsmasq/templates/k8s-services.j2: 17 k8s services, round-robinroles/dnsmasq/templates/infra-services.j2: grafana, prometheus, loki, proxmox, fortigate, dnsmasq-uiroles/dnsmasq/templates/trentnielsen-overrides.j2: SWAG-proxied subdomains → MetalLB VIProles/dnsmasq/tasks/ubuntu.yml: drpsychick/dnsmasq, host networking, NET_ADMINplaybooks/dnsmasq.yml: includes baseline, promtail, mcp-access, dnsmasq roles
Terraform (terraform-heezy)#
shared/heezy/dhcp.tf: SHARED + USERS DHCP → dnsmasq, MetalLB VIP reservationproduction/heezy/dhcp.tf: PROD DHCP → dnsmasqshared/heezy/firewall-objects.tf: TCP/53, TCP/32400 service objects, prod subnet addressshared/heezy/firewall.tf: policy 309 updated (added TCP/53), policy 315 new (PROD → dnsmasq)
Kubernetes (heezy-k8s)#
- MetalLB enabled:
microk8s enable metallb:10.x.x.x-10.x.x.x - SWAG LoadBalancer service got VIP 10.x.x.x
Lessons Learned#
listen-address=0.0.0.0is not the same as omitting it. In Docker host networking mode, explicitly settinglisten-addresschanges how dnsmasq enumerates interfaces and breaks external query handling. Just don’t set it. dnsmasq listens on all interfaces by default.bogus-privbreaks more than reverse lookups. In host networking mode with certain dnsmasq versions, it also breaks forward query processing. Remove it unless you specifically need it and have tested it in your exact deployment configuration.Docker containers need
NET_ADMINfor DNS. dnsmasq usessetsockoptcalls that require this capability. Without it, the socket binds but doesn’t process packets correctly. Thejpillora/dnsmasqimage doesn’t include it.Docker bridge networking blocks outbound UDP/53. The
DOCKER-BRIDGEiptables chain has explicit DROP rules for traffic that doesn’t match port-forwarding ACCEPT rules. Containers can receive DNS queries but can’t forward them upstream. Usenetwork_mode: hostfor DNS servers.Always verify DNS works before cutting over DHCP. We pushed the DHCP cutover before confirming dnsmasq was responding, which broke DNS for all clients. Revert fast, debug at leisure.
Bisect configuration files when debugging. Start with a minimal config that works, add lines back one at a time. We found two independent bugs (
listen-addressandbogus-priv) that would have been nearly impossible to find by staring at the full config.MetalLB VIPs only serve LoadBalancer services. NodePorts are not accessible on the VIP. For single-IP access to NodePort services, you need a reverse proxy (SWAG/nginx) behind the LoadBalancer.
DMZ hosts should never use internal DNS. It’s an attack vector. Keep DMZ on public resolvers only.
DNS needs both UDP and TCP on port 53. Don’t forget TCP/53 in firewall rules.
Split-horizon DNS eliminates Cloudflare hairpinning. LAN clients resolving
navidrome.yourdomain.tldnow go directly to the cluster VIP instead of out to Cloudflare and back. Lower latency, no internet dependency for local access.
Final State#
| Component | Status |
|---|---|
| dnsmasq (10.x.x.x) | Active: Running, authoritative for heezy.local + yourdomain.tld |
| MetalLB VIP (10.x.x.x) | Active: Active, assigned to SWAG LoadBalancer |
| yourdomain.tld overrides | Active: Single VIP, instant failover |
| heezy.local k8s services | Partial: Round-robin (acceptable for internal use) |
| SHARED DHCP | Active: Pointing at dnsmasq |
| USERS DHCP | Active: Pointing at dnsmasq |
| PROD DHCP | Active: Pointing at dnsmasq |
| DMZ DHCP | N/A: Unchanged (public DNS only) |
| Firewall rules | Active: USERS + PROD → dnsmasq DNS (UDP+TCP/53) |