Split DNS, MetalLB, and the dnsmasq Debugging Saga

This is the story of building split-horizon DNS for a homelab with four VLANs, a FortiGate firewall, a 5-node MicroK8s cluster, and a Cloudflare tunnel. Then spending hours debugging why dnsmasq wouldn’t answer queries despite the port being open, the firewall allowing traffic, and the container running fine. The entire implementation and debugging session was driven through MCP tooling.

Contents#

The Starting Point#

The homelab had grown to 20+ services across multiple network zones, but DNS was held together with duct tape:

No internal domain. Accessing services meant remembering 10.x.x.x:30533 instead of <service>.internal
Cloudflare hairpinning. LAN clients hitting navidrome.yourdomain.tld went out to Cloudflare and back in, adding latency and a dependency on the internet for local services
Half-baked dnsmasq. A container existed at 10.x.x.x with entries for 7 of 20+ services
No failover. DNS entries round-robined across 5 k8s nodes, so if one died, 20% of queries hit a dead IP and clients waited 30+ seconds for a timeout

The goal: proper split-horizon DNS where heezy.local resolves internal services, yourdomain.tld resolves to the cluster directly (skipping Cloudflare), and everything else forwards upstream.

The DNS Architecture#

Two Domains, Two Strategies#

heezy.local: the internal-only domain. Every server, every k8s service, every piece of infrastructure gets a name here. dnsmasq is authoritative, so queries for heezy.local never leave the network.

Three categories of records:

Infrastructure hosts: auto-generated from Ansible inventory. <service>.internal, <service>.internal, <service>.internal. One IP per host.
K8s services: round-robin across all 5 nebula nodes. <service>.internal returns all 5 IPs because any node can serve any NodePort via kube-proxy.
Infra services: single-host entries for non-k8s services. <service>.internal → 10.x.x.x (the LGTM monitoring server).

yourdomain.tld: the public domain, managed by Cloudflare externally. But internally, dnsmasq overrides it so LAN clients go directly to the cluster instead of hairpinning through Cloudflare.

Only the SWAG-proxied subdomains get overrides. Every public-facing service points at the same VIP:

plex.yourdomain.tld → 10.x.x.x (MetalLB VIP → SWAG → ClusterIP)
navidrome.yourdomain.tld → 10.x.x.x
aurral.yourdomain.tld → 10.x.x.x
nebula-plex.yourdomain.tld → 10.x.x.x
(plus several other self-hosted services)

These all point at a single MetalLB VIP (10.x.x.x) which is assigned to the SWAG LoadBalancer service. SWAG terminates TLS and reverse-proxies to the k8s ClusterIP services internally. If the node holding the VIP dies, MetalLB moves it to another node in ~2 seconds.

Everything else forwards to 1.1.1.1 and 8.8.8.8. dnsmasq doesn’t try to answer queries it doesn’t own.

The Hostname Strategy#

Every host in the Ansible inventory gets a hostname entry automatically via the hosts.j2 template:

10.x.x.x <service>.internal nebula-1
10.x.x.x <service>.internal nebula-2
10.x.x.x <service>.internal shared-lgtm
10.x.x.x <service>.internal shared-dnsmasq
10.x.x.x <service>.internal dmz-minecraft

The naming convention: <zone>-<purpose> for infrastructure, bare names for k8s nodes. The baseline Ansible role sets each host’s hostname via hostnamectl and configures systemd-resolved to route heezy.local and yourdomain.tld queries to dnsmasq.

K8s services get their own template (k8s-services.j2) with round-robin entries:

10.x.x.x <service>.internal
10.x.x.x <service>.internal
10.x.x.x <service>.internal
10.x.x.x <service>.internal
10.x.x.x <service>.internal

Infrastructure services that run on dedicated hosts get single entries (infra-services.j2):

10.x.x.x <service>.internal
10.x.x.x <service>.internal
10.x.x.x <service>.internal
10.x.x.x4 <service>.internal
10.x.x.x <service>.internal

DMZ Isolation#

DMZ hosts (game servers on 10.x.x.0/24) never use internal DNS. They’re in an untrusted zone, and allowing DNS queries into the environment would be an attack vector. DMZ DHCP uses the FortiGate’s default DNS (which forwards to public resolvers). No firewall rule exists for DMZ → dnsmasq, and one should never be created.

MetalLB: Solving the Round-Robin Problem#

Round-robin DNS is terrible for reliability. If one of 5 nodes goes down, 20% of queries return a dead IP. The client has to wait for a TCP timeout (30+ seconds) before trying the next IP. SSH doesn’t retry at all, it just fails. This defeats the purpose of running a cluster.

The fix: MetalLB, which was available as a MicroK8s addon but disabled.

microk8s enable metallb:10.x.x.x-10.x.x.x

One IP in the pool. The SWAG LoadBalancer service (which had been stuck in <pending> state for 69 days) immediately grabbed it. Now yourdomain.tld subdomains resolve to a single VIP with instant failover.

The catch: MetalLB VIPs only serve LoadBalancer-type services. NodePorts are not accessible on the VIP. So heezy.local services (which use NodePorts) still round-robin. This is acceptable for internal/admin use since the priority was fixing the public-facing yourdomain.tld resolution.

The DHCP Cutover#

Four VLANs, four DHCP servers, all managed by the FortiGate:

VLAN	Subnet	Zone	Before	After
native	10.x.x.0/24	SHARED	FortiGate default	10.x.x.x + 1.1.1.1
200	10.x.x.0/24	USERS	FortiGate default	10.x.x.x + 1.1.1.1
2000	10.x.x.0/24	PROD	10.x.x.x + 8.8.8.8	10.x.x.x + 1.1.1.1
3	10.x.x.0/24	DMZ	FortiGate default	Unchanged

The fallback to 1.1.1.1 is critical. If dnsmasq goes down, clients lose internal names but the internet keeps working.

Firewall Rules for Cross-Zone DNS#

dnsmasq lives on the SHARED VLAN. Clients on SHARED can reach it directly (same zone, no firewall rule needed). But USERS and PROD are different zones:

Policy 309 (existing, updated): USERS → SHARED dnsmasq, UDP/53 + TCP/53
Policy 315 (new): PROD → SHARED dnsmasq, UDP/53 + TCP/53

DNS needs both UDP and TCP on port 53. Easy to forget TCP. Most queries use UDP, but large responses and DNSSEC fall back to TCP.

The Premature Cutover#

We pushed the DHCP changes before verifying dnsmasq was actually working. The terraform applied successfully, DHCP servers started handing out 10.x.x.x as the DNS server, and… nothing worked. DNS queries to 10.x.x.x timed out from every client.

We immediately reverted the DHCP cutover to stop the bleeding, then spent the next hour debugging why dnsmasq wouldn’t answer queries.

The Debugging Saga#

Symptom#

nslookup google.com 10.x.x.x timed out from every client: the Mac on USERS VLAN, big-boi on SHARED VLAN, even from the dnsmasq host itself. But docker exec dnsmasq nslookup google.com 127.0.0.1 worked perfectly inside the container.

What We Ruled Out#

Firewall: FortiGate policy 309 confirmed via API with both UDP/53 and TCP/53. nc -vz 10.x.x.x 53 connected successfully from the Mac.
Port binding: ss -tlnup showed dnsmasq bound to 0.0.0.0:53 on both TCP and UDP.
iptables: INPUT chain policy ACCEPT, no rules. No DROP or REJECT anywhere in the chain.
systemd-resolved: stub listener disabled, not conflicting on port 53.
Docker bridge networking: initially suspected, switched to network_mode: host. Same problem.

The Docker Bridge Red Herring#

The original setup used Docker bridge networking with port mapping (-p 53:53). We discovered that Docker’s DOCKER-BRIDGE iptables chain had explicit DROP rules that blocked the container’s outbound UDP/53 to upstream resolvers. The container could receive queries but couldn’t forward them.

Switching to network_mode: host fixed the outbound forwarding issue, but queries from outside the container still timed out. This sent us down a rabbit hole investigating PID namespaces, conntrack, and kernel socket routing.

The Real Problem: Two Configuration Options#

After hours of debugging, we bisected the dnsmasq configuration file line by line. Starting from a minimal config that worked:

no-resolv
server=1.1.1.1
server=8.8.8.8
log-queries

We added lines back one at a time until it broke. Two options were the culprits:

listen-address=0.0.0.0: explicitly setting this breaks dnsmasq in Docker host networking mode. Without it, dnsmasq binds to wildcard and works. With it, dnsmasq binds but doesn’t process packets from external sources. The behavior difference is subtle and undocumented. It appears to be related to how dnsmasq enumerates interfaces when listen-address is set vs. when it uses the default wildcard binding.

bogus-priv: this option tells dnsmasq not to forward reverse lookups for private IP ranges. In host networking mode, it also breaks forward query handling. Removing it fixed the issue.

The Image Problem#

The original jpillora/dnsmasq image also lacked the NET_ADMIN Linux capability, which dnsmasq needs to set socket options for proper DNS packet handling. The busybox nslookup inside the container worked because it ran in the same process context, but external queries failed because the listening socket wasn’t configured correctly without NET_ADMIN.

The fix: switch to drpsychick/dnsmasq with cap_add: NET_ADMIN.

The Final Working Configuration#

# docker-compose.yml
services:
  dnsmasq:
    image: drpsychick/dnsmasq
    container_name: dnsmasq
    restart: unless-stopped
    network_mode: host
    cap_add:
      - NET_ADMIN
    volumes:
      - /opt/dnsmasq/config/dnsmasq.conf:/etc/dnsmasq.conf
      - /opt/dnsmasq/hosts:/etc/dnsmasq.d
    command: ["--no-daemon"]

# dnsmasq.conf: note what's NOT here
port=53
domain-needed
# NO bogus-priv
# NO listen-address=0.0.0.0
no-resolv
no-poll
domain=heezy.local
expand-hosts
server=1.1.1.1
server=8.8.8.8
cache-size=1000
local-ttl=30
log-queries
addn-hosts=/etc/dnsmasq.d
local=/heezy.local/
local=/yourdomain.tld/

The Successful Cutover#

With dnsmasq actually responding to queries, we re-applied the DHCP cutover. This time:

$ nslookup google.com 10.x.x.x
Server:    10.x.x.x
Address:   10.x.x.x#53
Non-authoritative answer:
Name:  google.com
Address: 172.253.132.113

$ nslookup <service>.internal 10.x.x.x
Server:    10.x.x.x
Address:   10.x.x.x#53
Name:  <service>.internal
Address: 10.x.x.x
Address: 10.x.x.x
Address: 10.x.x.x
Address: 10.x.x.x
Address: 10.x.x.x

$ nslookup navidrome.yourdomain.tld 10.x.x.x
Server:    10.x.x.x
Address:   10.x.x.x#53
Name:  navidrome.yourdomain.tld
Address: 10.x.x.x

Internal domain, public domain override, and upstream forwarding, all working.

Everything That Was Deployed#

Ansible (ansible-heezy)#

roles/dnsmasq/templates/dnsmasq.conf.j2: main config, no bogus-priv or listen-address
roles/dnsmasq/templates/hosts.j2: auto-generated from inventory
roles/dnsmasq/templates/k8s-services.j2: 17 k8s services, round-robin
roles/dnsmasq/templates/infra-services.j2: grafana, prometheus, loki, proxmox, fortigate, dnsmasq-ui
roles/dnsmasq/templates/trentnielsen-overrides.j2: SWAG-proxied subdomains → MetalLB VIP
roles/dnsmasq/tasks/ubuntu.yml: drpsychick/dnsmasq, host networking, NET_ADMIN
playbooks/dnsmasq.yml: includes baseline, promtail, mcp-access, dnsmasq roles

Terraform (terraform-heezy)#

shared/heezy/dhcp.tf: SHARED + USERS DHCP → dnsmasq, MetalLB VIP reservation
production/heezy/dhcp.tf: PROD DHCP → dnsmasq
shared/heezy/firewall-objects.tf: TCP/53, TCP/32400 service objects, prod subnet address
shared/heezy/firewall.tf: policy 309 updated (added TCP/53), policy 315 new (PROD → dnsmasq)

Kubernetes (heezy-k8s)#

MetalLB enabled: microk8s enable metallb:10.x.x.x-10.x.x.x
SWAG LoadBalancer service got VIP 10.x.x.x

Lessons Learned#

listen-address=0.0.0.0 is not the same as omitting it. In Docker host networking mode, explicitly setting listen-address changes how dnsmasq enumerates interfaces and breaks external query handling. Just don’t set it. dnsmasq listens on all interfaces by default.
bogus-priv breaks more than reverse lookups. In host networking mode with certain dnsmasq versions, it also breaks forward query processing. Remove it unless you specifically need it and have tested it in your exact deployment configuration.
Docker containers need NET_ADMIN for DNS. dnsmasq uses setsockopt calls that require this capability. Without it, the socket binds but doesn’t process packets correctly. The jpillora/dnsmasq image doesn’t include it.
Docker bridge networking blocks outbound UDP/53. The DOCKER-BRIDGE iptables chain has explicit DROP rules for traffic that doesn’t match port-forwarding ACCEPT rules. Containers can receive DNS queries but can’t forward them upstream. Use network_mode: host for DNS servers.
Always verify DNS works before cutting over DHCP. We pushed the DHCP cutover before confirming dnsmasq was responding, which broke DNS for all clients. Revert fast, debug at leisure.
Bisect configuration files when debugging. Start with a minimal config that works, add lines back one at a time. We found two independent bugs (listen-address and bogus-priv) that would have been nearly impossible to find by staring at the full config.
MetalLB VIPs only serve LoadBalancer services. NodePorts are not accessible on the VIP. For single-IP access to NodePort services, you need a reverse proxy (SWAG/nginx) behind the LoadBalancer.
DMZ hosts should never use internal DNS. It’s an attack vector. Keep DMZ on public resolvers only.
DNS needs both UDP and TCP on port 53. Don’t forget TCP/53 in firewall rules.
Split-horizon DNS eliminates Cloudflare hairpinning. LAN clients resolving navidrome.yourdomain.tld now go directly to the cluster VIP instead of out to Cloudflare and back. Lower latency, no internet dependency for local access.

Final State#

Component	Status
dnsmasq (10.x.x.x)	Active: Running, authoritative for heezy.local + yourdomain.tld
MetalLB VIP (10.x.x.x)	Active: Active, assigned to SWAG LoadBalancer
yourdomain.tld overrides	Active: Single VIP, instant failover
heezy.local k8s services	Partial: Round-robin (acceptable for internal use)
SHARED DHCP	Active: Pointing at dnsmasq
USERS DHCP	Active: Pointing at dnsmasq
PROD DHCP	Active: Pointing at dnsmasq
DMZ DHCP	N/A: Unchanged (public DNS only)
Firewall rules	Active: USERS + PROD → dnsmasq DNS (UDP+TCP/53)