The LGTM Stack: Monitoring a Homelab Like It's Production

The monitoring stack runs on a dedicated VM at 10.x.x.x (shared-lgtm), deliberately separate from the Kubernetes cluster it watches. If the cluster goes down, the thing watching it needs to still be running. Seven containers in a single Docker Compose stack handle metrics collection, log aggregation, long-term storage, tracing, and dashboards with alerting. The whole thing is deployed and configured by a single Ansible role (roles/lgtm/) with Jinja2 templates for every config file. Push a change, GitHub Actions runs the playbook, Ansible templates the configs and restarts the stack. This post covers what each component does, what gets scraped, and how alerting works.

Contents#

The Stack#

LGTM stands for Loki, Grafana, Tempo, Mimir. Add Prometheus and you’ve got the full Grafana observability suite. Each component has a specific job:

Service	Port	What It Does
Prometheus	9090	Scrapes and stores time-series metrics (30-day retention)
Loki	3100	Log aggregation (7-day retention)
Grafana	3000	Dashboards, alerting, visualization
Tempo	3200	Distributed tracing (Jaeger, Zipkin, OTLP ingest)
Mimir	9009	Long-term metrics storage (Prometheus remote_write target)
SNMP Exporter	9116	Translates SNMP polls into Prometheus metrics
Promtail	(local)	Ships host logs from the LGTM server itself to Loki

All containers run as user 1000:1000. This matters because mounted config files and data directories need matching ownership or you get permission denied errors. The Ansible role sets owner: "1000" and group: "1000" on everything it creates under /opt/lgtm/.

How It’s Deployed#

The entire stack is an Ansible role (roles/lgtm/) with Jinja2 templates for every config file. Push a change to the role, GitHub Actions runs the playbook, Ansible templates the configs and restarts the Compose stack. The role also creates a systemd service so the stack starts on boot.

Directory layout on the host:

/opt/lgtm/
├── config/
│   ├── prometheus.yml
│   ├── alerts.yml
│   ├── snmp.yml
│   ├── loki.yml
│   ├── tempo.yml
│   ├── mimir.yml
│   ├── promtail.yaml
│   └── grafana/
│       ├── datasources/datasources.yml
│       ├── dashboards/dashboards.yml
│       ├── dashboards/*.json
│       └── alerting/
│           ├── alerting.yml
│           ├── contactpoints.yml
│           └── policies.yml
├── data/
│   ├── prometheus/
│   ├── loki/
│   ├── grafana/
│   ├── tempo/
│   └── mimir/
└── docker-compose.yml

Prometheus: What Gets Scraped#

Prometheus is the metrics engine. It scrapes targets every 15 seconds (30 seconds for SNMP) and stores 30 days of data locally. It also remote_writes everything to Mimir for long-term storage.

Scrape Jobs#

Self-monitoring: Prometheus, Loki, Grafana, Tempo, and Mimir all expose /metrics endpoints. Prometheus scrapes itself and all its siblings.

Node Exporter (port 9100): Runs on every Linux host. CPU, memory, disk, network. The targets are auto-generated from the Ansible inventory:

- job_name: 'node-exporter'
  static_configs:
    - targets:
      - '10.x.x.x:9100'  # nebula-1
      - '10.x.x.x:9100'  # nebula-2
      # ... all hosts in the 'linux' inventory group

Kubelet and cAdvisor (port 10255): Scraped from all 5 k8s nodes. Container-level CPU, memory, and network metrics.

kube-state-metrics (NodePort 30800): Cluster state as metrics. Deployment replica counts, pod phases, PVC status, node conditions. Single target on the first k8s node.

Exportarr (NodePorts 30707-30710): Application-level metrics from self-hosted services. Queue depths, library sizes, calendar entries. Four instances in one pod.

SNMP (via SNMP Exporter): Network device metrics. More on this below.

SNMP: Polling the Cisco Switch#

The Cisco 3560 switch is the one piece of infrastructure that’s still manually configured. No Terraform, no Ansible (yet). But it does get monitored via SNMP.

Switch-Side Config#

SNMP v2c is configured on the switch with a read-only community string. This is manual CLI work:

snmp-server community heezy-ro RO
snmp-server location heezy-lab
snmp-server contact admin

The switch exposes standard MIBs (IF-MIB for interface stats, ENTITY-MIB for hardware info) over the community string.

Prometheus Side#

The SNMP Exporter container runs the default snmp.yml config (extracted from the official image at build time) with a custom auth section injected by Ansible:

auths:
  heezy_v2:
    community: heezy-ro
    security_level: noAuthNoPriv
    auth_protocol: MD5
    priv_protocol: DES
    version: 2

Prometheus scrapes the SNMP exporter with relabeling that routes each target through the exporter:

- job_name: 'snmp'
  scrape_interval: 30s
  static_configs:
    - targets: ['10.x.x.x']
      labels:
        device: 'cisco-switch'
        __param_module: 'if_mib'
  metrics_path: /snmp
  params:
    auth: [heezy_v2]
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: snmp-exporter:9116

The relabeling is the standard SNMP exporter pattern: the original target address (the switch IP) becomes a query parameter, and the actual scrape target becomes the exporter container. Prometheus asks the exporter “go poll 10.x.x.x using if_mib with heezy_v2 auth” and gets back interface metrics.

This gives us per-port traffic rates, error counts, utilization percentages, and interface status for every port on the switch.

Loki: Log Aggregation#

Loki stores logs with a 7-day retention period. It uses the TSDB store with filesystem backend (no object storage needed for a homelab).

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

limits_config:
  retention_period: 7d
  max_global_streams_per_user: 5000
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20

compactor:
  working_directory: /loki/compactor
  delete_request_store: filesystem
  retention_enabled: true
  retention_delete_delay: 2h

The delete_request_store: filesystem line is important. Loki 3.x removed the old shared_store field, and if you have it in your config, Loki crash-loops on startup. Found that one during the Calico BPF debugging session.

Log Sources#

Promtail DaemonSet (k8s cluster): Runs on all 5 nebula nodes, scrapes pod logs from /var/log/pods/, ships to Loki at 10.x.x.x:3100. No sidecar needed. If a pod writes to stdout/stderr, Promtail picks it up.

Promtail standalone (LGTM host): Scrapes Docker container logs and system logs from the monitoring server itself.

Promtail standalone (DMZ hosts): Game servers on the DMZ ship logs across the firewall. There’s a specific FortiGate policy allowing DMZ-to-SHARED traffic on TCP/3100 for this.

Mimir: Long-Term Metrics#

Prometheus stores 30 days locally. Mimir stores everything Prometheus remote_writes to it, with higher limits:

limits:
  max_global_series_per_user: 500000
  ingestion_rate: 50000
  ingestion_burst_size: 100000

Single-node deployment with filesystem backend. Memberlist for ring coordination (even though it’s just one instance, the config requires it). Mimir shows up as a second Prometheus datasource in Grafana, so you can query historical data beyond the 30-day local window.

Tempo: Distributed Tracing#

Tempo accepts traces via Jaeger (thrift, gRPC), Zipkin, and OTLP (gRPC + HTTP). It’s mostly there for future use. If I ever instrument an application with OpenTelemetry, the backend is ready.

The Grafana datasource config links Tempo to Loki so you can jump from a trace to the corresponding logs:

- name: Tempo
  type: tempo
  jsonData:
    tracesToLogs:
      datasourceUid: loki
      tags: ['job', 'instance', 'pod', 'namespace']

Grafana: Dashboards and Alerting#

Datasources#

Four datasources, all provisioned as code via the Ansible template:

Name	Type	UID	What
Prometheus	prometheus	prometheus	Default. 15s scrape interval.
Loki	loki	loki	Log queries. Linked to Tempo for trace correlation.
Tempo	tempo	tempo	Trace queries. Linked to Loki for log correlation.
Mimir	prometheus	mimir	Long-term metrics at `/prometheus` path.

Dashboards#

Three JSON dashboards deployed by Ansible:

Node Exporter Host Metrics (community dashboard #1860 rev41): CPU, memory, disk, network per host. The standard node exporter dashboard.
Kubernetes: Cluster-level metrics from kube-state-metrics and kubelet/cAdvisor.
Media Library: Application-specific metrics from Exportarr instances.

Dashboards are JSON files in roles/lgtm/files/dashboards/. The Ansible role copies them to the Grafana provisioning directory and cleans up any unmanaged JSON files (so deleting a dashboard from the role actually removes it from Grafana).

Alerting#

Alerts fire to Discord via a webhook. The contact point is provisioned as code:

contactPoints:
  - orgId: 1
    name: discord-grafana-alerts
    receivers:
      - type: discord
        settings:
          url: "{{ grafana_discord_webhook_url }}"

The webhook URL is stored in AWS Secrets Manager at production/heezy/grafana/discord-webhook and pulled at Ansible runtime.

Alert Rules#

Two layers of alerting: Prometheus alert rules (evaluated by Prometheus) and Grafana alert rules (evaluated by Grafana against any datasource).

Host alerts:

CPU > 80% for 30 minutes
Memory > 90% for 5 minutes
Load average > 2 for 10 minutes
Host down (any up == 0) for 1 minute
Host reboot detected (boot time changed)

Disk alerts:

Root filesystem < 20% free for 5 minutes (warning)
Root filesystem < 5% free for 2 minutes (critical)

Switch alerts:

Port errors > 1/sec for 5 minutes
Port utilization > 80% for 10 minutes

Game server alerts:

CS 1.6 player joined (Loki log query: {container="cs16-server"} |= "entered the game")

That last one is my favorite. Grafana queries Loki for the CS 1.6 server logs, pattern-matches on “entered the game,” and pings Discord. I know within seconds when someone connects to the server.

Retention Summary#

Component	Retention	Storage
Prometheus	30 days	Local TSDB at `/opt/lgtm/data/prometheus/`
Loki	7 days	Local filesystem at `/opt/lgtm/data/loki/`
Mimir	Unlimited (disk-limited)	Local filesystem at `/opt/lgtm/data/mimir/`
Tempo	1 hour block retention	Local filesystem at `/opt/lgtm/data/tempo/`
Grafana	N/A (config only)	SQLite at `/opt/lgtm/data/grafana/`

Prometheus and Loki have explicit retention. Mimir grows until disk fills up (not a concern yet). Tempo has short retention because I’m not actively generating traces.

The Weekly Auto-Update#

The LGTM role includes the docker-compose-updater role at the end, which creates a cron job that pulls latest images and recreates containers every Sunday at 5am Eastern. All the component versions are set to latest in the Ansible vars, so the weekly pull keeps everything current without manual intervention.

What’s Still Manual#

The Cisco switch SNMP config is manual CLI. I’d like to manage it with Ansible eventually, but the Cisco IOS Ansible modules require a specific network automation setup that I haven’t prioritized. The switch works, SNMP works, and it’s not something that changes often.

The FortiGate is also monitored via SNMP (same pattern, different target), but its SNMP config is managed by Terraform, not manual.

The LGTM Stack: Monitoring a Homelab Like It’s Production