The LGTM Stack: Monitoring a Homelab Like It’s Production
The monitoring stack runs on a dedicated VM at 10.x.x.x (shared-lgtm), deliberately separate from the Kubernetes cluster it watches. If the cluster goes down, the thing watching it needs to still be running. Seven containers in a single Docker Compose stack handle metrics collection, log aggregation, long-term storage, tracing, and dashboards with alerting. The whole thing is deployed and configured by a single Ansible role (roles/lgtm/) with Jinja2 templates for every config file. Push a change, GitHub Actions runs the playbook, Ansible templates the configs and restarts the stack. This post covers what each component does, what gets scraped, and how alerting works.
Contents#
- The Stack
- How It’s Deployed
- Prometheus: What Gets Scraped
- SNMP: Polling the Cisco Switch
- Loki: Log Aggregation
- Mimir: Long-Term Metrics
- Tempo: Distributed Tracing
- Grafana: Dashboards and Alerting
- Retention Summary
- The Weekly Auto-Update
- What’s Still Manual
The Stack#
LGTM stands for Loki, Grafana, Tempo, Mimir. Add Prometheus and you’ve got the full Grafana observability suite. Each component has a specific job:
| Service | Port | What It Does |
|---|---|---|
| Prometheus | 9090 | Scrapes and stores time-series metrics (30-day retention) |
| Loki | 3100 | Log aggregation (7-day retention) |
| Grafana | 3000 | Dashboards, alerting, visualization |
| Tempo | 3200 | Distributed tracing (Jaeger, Zipkin, OTLP ingest) |
| Mimir | 9009 | Long-term metrics storage (Prometheus remote_write target) |
| SNMP Exporter | 9116 | Translates SNMP polls into Prometheus metrics |
| Promtail | (local) | Ships host logs from the LGTM server itself to Loki |
All containers run as user 1000:1000. This matters because mounted config files and data directories need matching ownership or you get permission denied errors. The Ansible role sets owner: "1000" and group: "1000" on everything it creates under /opt/lgtm/.
How It’s Deployed#
The entire stack is an Ansible role (roles/lgtm/) with Jinja2 templates for every config file. Push a change to the role, GitHub Actions runs the playbook, Ansible templates the configs and restarts the Compose stack. The role also creates a systemd service so the stack starts on boot.
Directory layout on the host:
/opt/lgtm/
├── config/
│ ├── prometheus.yml
│ ├── alerts.yml
│ ├── snmp.yml
│ ├── loki.yml
│ ├── tempo.yml
│ ├── mimir.yml
│ ├── promtail.yaml
│ └── grafana/
│ ├── datasources/datasources.yml
│ ├── dashboards/dashboards.yml
│ ├── dashboards/*.json
│ └── alerting/
│ ├── alerting.yml
│ ├── contactpoints.yml
│ └── policies.yml
├── data/
│ ├── prometheus/
│ ├── loki/
│ ├── grafana/
│ ├── tempo/
│ └── mimir/
└── docker-compose.yml
Prometheus: What Gets Scraped#
Prometheus is the metrics engine. It scrapes targets every 15 seconds (30 seconds for SNMP) and stores 30 days of data locally. It also remote_writes everything to Mimir for long-term storage.
Scrape Jobs#
Self-monitoring: Prometheus, Loki, Grafana, Tempo, and Mimir all expose /metrics endpoints. Prometheus scrapes itself and all its siblings.
Node Exporter (port 9100): Runs on every Linux host. CPU, memory, disk, network. The targets are auto-generated from the Ansible inventory:
- job_name: 'node-exporter'
static_configs:
- targets:
- '10.x.x.x:9100' # nebula-1
- '10.x.x.x:9100' # nebula-2
# ... all hosts in the 'linux' inventory group
Kubelet and cAdvisor (port 10255): Scraped from all 5 k8s nodes. Container-level CPU, memory, and network metrics.
kube-state-metrics (NodePort 30800): Cluster state as metrics. Deployment replica counts, pod phases, PVC status, node conditions. Single target on the first k8s node.
Exportarr (NodePorts 30707-30710): Application-level metrics from self-hosted services. Queue depths, library sizes, calendar entries. Four instances in one pod.
SNMP (via SNMP Exporter): Network device metrics. More on this below.
SNMP: Polling the Cisco Switch#
The Cisco 3560 switch is the one piece of infrastructure that’s still manually configured. No Terraform, no Ansible (yet). But it does get monitored via SNMP.
Switch-Side Config#
SNMP v2c is configured on the switch with a read-only community string. This is manual CLI work:
snmp-server community heezy-ro RO
snmp-server location heezy-lab
snmp-server contact admin
The switch exposes standard MIBs (IF-MIB for interface stats, ENTITY-MIB for hardware info) over the community string.
Prometheus Side#
The SNMP Exporter container runs the default snmp.yml config (extracted from the official image at build time) with a custom auth section injected by Ansible:
auths:
heezy_v2:
community: heezy-ro
security_level: noAuthNoPriv
auth_protocol: MD5
priv_protocol: DES
version: 2
Prometheus scrapes the SNMP exporter with relabeling that routes each target through the exporter:
- job_name: 'snmp'
scrape_interval: 30s
static_configs:
- targets: ['10.x.x.x']
labels:
device: 'cisco-switch'
__param_module: 'if_mib'
metrics_path: /snmp
params:
auth: [heezy_v2]
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: snmp-exporter:9116
The relabeling is the standard SNMP exporter pattern: the original target address (the switch IP) becomes a query parameter, and the actual scrape target becomes the exporter container. Prometheus asks the exporter “go poll 10.x.x.x using if_mib with heezy_v2 auth” and gets back interface metrics.
This gives us per-port traffic rates, error counts, utilization percentages, and interface status for every port on the switch.
Loki: Log Aggregation#
Loki stores logs with a 7-day retention period. It uses the TSDB store with filesystem backend (no object storage needed for a homelab).
schema_config:
configs:
- from: 2020-10-24
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
limits_config:
retention_period: 7d
max_global_streams_per_user: 5000
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
compactor:
working_directory: /loki/compactor
delete_request_store: filesystem
retention_enabled: true
retention_delete_delay: 2h
The delete_request_store: filesystem line is important. Loki 3.x removed the old shared_store field, and if you have it in your config, Loki crash-loops on startup. Found that one during the Calico BPF debugging session.
Log Sources#
Promtail DaemonSet (k8s cluster): Runs on all 5 nebula nodes, scrapes pod logs from /var/log/pods/, ships to Loki at 10.x.x.x:3100. No sidecar needed. If a pod writes to stdout/stderr, Promtail picks it up.
Promtail standalone (LGTM host): Scrapes Docker container logs and system logs from the monitoring server itself.
Promtail standalone (DMZ hosts): Game servers on the DMZ ship logs across the firewall. There’s a specific FortiGate policy allowing DMZ-to-SHARED traffic on TCP/3100 for this.
Mimir: Long-Term Metrics#
Prometheus stores 30 days locally. Mimir stores everything Prometheus remote_writes to it, with higher limits:
limits:
max_global_series_per_user: 500000
ingestion_rate: 50000
ingestion_burst_size: 100000
Single-node deployment with filesystem backend. Memberlist for ring coordination (even though it’s just one instance, the config requires it). Mimir shows up as a second Prometheus datasource in Grafana, so you can query historical data beyond the 30-day local window.
Tempo: Distributed Tracing#
Tempo accepts traces via Jaeger (thrift, gRPC), Zipkin, and OTLP (gRPC + HTTP). It’s mostly there for future use. If I ever instrument an application with OpenTelemetry, the backend is ready.
The Grafana datasource config links Tempo to Loki so you can jump from a trace to the corresponding logs:
- name: Tempo
type: tempo
jsonData:
tracesToLogs:
datasourceUid: loki
tags: ['job', 'instance', 'pod', 'namespace']
Grafana: Dashboards and Alerting#
Datasources#
Four datasources, all provisioned as code via the Ansible template:
| Name | Type | UID | What |
|---|---|---|---|
| Prometheus | prometheus | prometheus | Default. 15s scrape interval. |
| Loki | loki | loki | Log queries. Linked to Tempo for trace correlation. |
| Tempo | tempo | tempo | Trace queries. Linked to Loki for log correlation. |
| Mimir | prometheus | mimir | Long-term metrics at /prometheus path. |
Dashboards#
Three JSON dashboards deployed by Ansible:
- Node Exporter Host Metrics (community dashboard #1860 rev41): CPU, memory, disk, network per host. The standard node exporter dashboard.
- Kubernetes: Cluster-level metrics from kube-state-metrics and kubelet/cAdvisor.
- Media Library: Application-specific metrics from Exportarr instances.
Dashboards are JSON files in roles/lgtm/files/dashboards/. The Ansible role copies them to the Grafana provisioning directory and cleans up any unmanaged JSON files (so deleting a dashboard from the role actually removes it from Grafana).
Alerting#
Alerts fire to Discord via a webhook. The contact point is provisioned as code:
contactPoints:
- orgId: 1
name: discord-grafana-alerts
receivers:
- type: discord
settings:
url: "{{ grafana_discord_webhook_url }}"
The webhook URL is stored in AWS Secrets Manager at production/heezy/grafana/discord-webhook and pulled at Ansible runtime.
Alert Rules#
Two layers of alerting: Prometheus alert rules (evaluated by Prometheus) and Grafana alert rules (evaluated by Grafana against any datasource).
Host alerts:
- CPU > 80% for 30 minutes
- Memory > 90% for 5 minutes
- Load average > 2 for 10 minutes
- Host down (any
up == 0) for 1 minute - Host reboot detected (boot time changed)
Disk alerts:
- Root filesystem < 20% free for 5 minutes (warning)
- Root filesystem < 5% free for 2 minutes (critical)
Switch alerts:
- Port errors > 1/sec for 5 minutes
- Port utilization > 80% for 10 minutes
Game server alerts:
- CS 1.6 player joined (Loki log query:
{container="cs16-server"} |= "entered the game")
That last one is my favorite. Grafana queries Loki for the CS 1.6 server logs, pattern-matches on “entered the game,” and pings Discord. I know within seconds when someone connects to the server.
Retention Summary#
| Component | Retention | Storage |
|---|---|---|
| Prometheus | 30 days | Local TSDB at /opt/lgtm/data/prometheus/ |
| Loki | 7 days | Local filesystem at /opt/lgtm/data/loki/ |
| Mimir | Unlimited (disk-limited) | Local filesystem at /opt/lgtm/data/mimir/ |
| Tempo | 1 hour block retention | Local filesystem at /opt/lgtm/data/tempo/ |
| Grafana | N/A (config only) | SQLite at /opt/lgtm/data/grafana/ |
Prometheus and Loki have explicit retention. Mimir grows until disk fills up (not a concern yet). Tempo has short retention because I’m not actively generating traces.
The Weekly Auto-Update#
The LGTM role includes the docker-compose-updater role at the end, which creates a cron job that pulls latest images and recreates containers every Sunday at 5am Eastern. All the component versions are set to latest in the Ansible vars, so the weekly pull keeps everything current without manual intervention.
What’s Still Manual#
The Cisco switch SNMP config is manual CLI. I’d like to manage it with Ansible eventually, but the Cisco IOS Ansible modules require a specific network automation setup that I haven’t prioritized. The switch works, SNMP works, and it’s not something that changes often.
The FortiGate is also monitored via SNMP (same pattern, different target), but its SNMP config is managed by Terraform, not manual.