Cheap NVMe, Dead Talos, and How I Ended Up on MicroK8s

I wanted to experiment with Kubernetes on a budget. The rules:

Five nodes
Intel QuickSync-compatible processors for hardware transcoding
As cheap as possible

I found a bulk refurb reseller and bought five Lenovo ThinkCentres off the used market.

Contents#

My intent was to move my processing pipelines, homebrewed apps, and other services onto k8s and see how it held up. I’d been fortunate enough in my career to completely skip the k8s cluster wrangling phase, but that also meant I’d never actually touched it. Always heard about it. Never had to deal with it.

To me, managing a k8s cluster sounded like a nightmare. In the era of “pets” and traditionally managed infrastructure, sysadmins and network engineers were fiercely protective of the work they’d done. I left network engineering specifically because I hated constantly tending this living, breathing “thing” that was a network, a storage cluster, a VMware environment, whatever it was. Kubernetes was getting hot right around the time I was learning what a Python function was, and by the time I got to migrating and managing apps in AWS, I never had to summon the grandiose, complicated nature of a k8s cluster. I had ECS Fargate, ALBs, and all the ass-wipe mode managed services that make being a cloud engineer comfortable. A close friend and mentor has said many times, “once you’ve done it with AWS, you know how hard it is to do it the other way.” I’ve kept that close the whole time. But I’ve leveled out in my area of expertise, and I find myself looking at what’s still relevant, still everywhere, and something I should probably learn. So here I am.

I wanted to build a modern k8s cluster on this cheap, non-enterprise, budget desktop mini PC hardware, and rely on it for services I actually use every day. Knowing nothing about installing or running k8s, an API-driven, locked-down OS sounded like the natural starting point. So I decided to try that first.

The idea: install Talos Linux, deploy Flux for GitOps, and have a proper immutable Kubernetes cluster running in a weekend. What actually happened was a week of fighting hardware compatibility, throwing away off-brand NVMe drives, abandoning Talos entirely, and installing Ubuntu like it was 2010. Sometimes the pragmatic choice is the only choice.

The ThinkCentre Bet#

When I decided to build a Kubernetes cluster, I wanted small, quiet, low-power nodes. The Lenovo ThinkCentre M910q fit the bill: tiny form factor (about the size of a thick paperback), Intel desktop CPUs, and they show up on the used market for cheap because businesses cycle through them like office supplies. Companies buy thousands of these for desktops, run them for a few years, then dump them when the lease is up.

I bought five of them, then upgraded each one with RAM and storage.

The Bill#

Item	Cost
5x Lenovo ThinkCentre M910q (refurb)	$263.15
5x 32GB RAM upgrades	$266.21
5x 1TB NVMe drives (name-brand, after the disaster below)	$275.10
Total cluster cost	$804.46

That’s $160.89 per node, fully upgraded.

What You Get for $804#

Spec	Per Node	Cluster Total
CPU	Intel Core i5-7500T (4C/4T, 2.7GHz)	20 cores / 20 threads
RAM	32GB DDR4	160GB
Storage	1TB NVMe	5TB
Power draw	~15-25W idle	~75-125W idle

Some unit economics:

$5.03 per GB of RAM (160GB for $804)
$160.89 per TB of storage (5TB for $804)
$40.22 per CPU core (20 cores for $804)

I still run an R610 as my Proxmox hypervisor. It’s a great server. But I didn’t want another pizza box. Those single-chassis servers are loud, hot, and pull 200-400W at idle for one host. My five ThinkCentres combined pull less power than a single one of those, spread the workload across five failure domains, and sit on a shelf making zero noise.

For roughly the same money, five mini PCs give you more total RAM, more total storage, distributed fault tolerance, and your spouse doesn’t ask why the closet sounds like a jet engine. It’s the modern, multi-host equivalent of a single rack server, except it’s quiet, sips power, and if one node dies the other four keep running.

The Off-Brand NVMe Disaster#

I bought cheap NVMe drives. Not “budget brand” cheap. “Never heard of this manufacturer” cheap. The kind of drives that show up on Amazon with 47 five-star reviews that all read like they were written by the same person.

The drives worked fine in a USB enclosure. They benchmarked fine on a desktop. But when I put them in the ThinkCentres and tried to boot, some of them wouldn’t be recognized by the BIOS at all. Others would show up intermittently. One would work for a few hours and then disappear from the system mid-operation.

The ThinkCentre M910q has a specific M.2 slot (2242 form factor on some models, 2280 on others) and it’s picky about NVMe compatibility. The Intel NVMe controller in these machines doesn’t play well with every drive. Name-brand drives (Samsung, Western Digital, Intel) work fine. The mystery drives from Shenzhen? Coin flip.

I ended up replacing all five with known-good drives. Lesson learned: the $15 you save per drive on off-brand NVMe is not worth the hours of debugging when three out of five don’t work reliably. Buy name-brand storage. Don’t cheap out on the one component that holds all your data.

The PXE Boot Rabbit Hole#

Part of what made Talos attractive was the full lifecycle story. You don’t just install it and forget about it. Talos is designed so that nodes can boot from the network, receive their configuration via API, and provision themselves with zero manual intervention. The OS, the cluster join, the workload scheduling, all of it happens automatically. For a five-node cluster, that means you rack the hardware, plug in ethernet, and walk away. New node? It PXE boots, gets its config, joins the cluster. Dead node? Replace the hardware, same thing. That’s the pitch, and it’s a compelling one.

So before I even tried installing Talos to disk, I went after the full vision: PXE boot on a dedicated VLAN so the nodes could come up from nothing. I spent 20+ hours on it.

I tried running PXE/TFTP infrastructure in containers. I tried different DHCP option configurations on the FortiGate to hand out the right boot filename and next-server. I tried iPXE chainloading to an HTTPS endpoint serving the Talos image. The ThinkCentre BIOS was uncooperative with network boot in ways that were hard to pin down. Sometimes it would PXE boot and pull the initial loader, then fail to chainload. Sometimes it wouldn’t even attempt network boot despite being configured as the first boot option. Different nodes behaved differently with the same configuration.

Omni (Sidero Labs’ managed Talos platform) was supposed to handle all of this for you. It runs the PXE/bootstrapping infrastructure so you don’t have to stand up your own TFTP and DHCP stack. I got as far as setting up Google OAuth secrets and Cloudflare integration in AWS Secrets Manager for it, but I could never get a host to successfully boot a custom image from the network on this hardware. Containerized PXE pointing at an HTTPS loader on a dedicated VLAN with these ThinkCentre BIOSes just would not work, no matter how many configurations I tried.

Twenty-plus hours. Nothing to show for it. But I still had Talos ISOs, so I figured I’d boot from USB and apply the config manually for now. Get the cluster running first, automate the boot process later.

Talos: The Road Not Taken#

With PXE off the table, I burned Talos ISOs to USB sticks and booted the nodes that way. Talos is an immutable, API-driven OS designed specifically for Kubernetes. No SSH, no shell, no package manager. You manage it entirely through its API. Even without PXE, the workflow is: boot the ISO, apply a machine config via talosctl, and the installer writes itself to disk. Elegant in theory.

In practice, the installer would hang indefinitely on my hardware. I filed a bug with detailed logs, disk info, screenshots, the whole thing. The initial response was that the system_disk volume error could be ignored and to just specify /dev/nvme0n1 directly. Did that. Still hung. The installer would download successfully, print “running Talos installer,” and then sit there forever. Left it overnight, came back, still sitting. The node would loop trying to reach itself on port 6443 but never get that far.

A maintainer suggested removing wipe: true from the config. Tried that. Same result. Tried specifying the disk path directly instead of using selectors. Same result. Tried different image factory builds with and without extensions, UEFI and legacy boot. Nothing worked. Every suggestion from the maintainers led to the same place: the installer downloads, prints its message, and hangs.

The issue sat for a while before anyone else chimed in. Then other people started hitting the same thing on different hardware (Hyper-V, other off-brand NVMe drives). One user confirmed that installing to a USB disk worked fine but the NVMe was untouchable. As of writing, the issue is still open with no resolution.

The frustrating part was that Talos detected the drive fine. talosctl get disks showed it correctly: right model, right serial, right bus path, right size. The drive was there. The installer just couldn’t (or wouldn’t) write to it. And because Talos locks down the OS so aggressively (no shell, no strace, no lsof), there was no way to debug what was actually happening during the install. You’re at the mercy of whatever the installer logs tell you, and they didn’t tell me much.

I’d burned days on PXE boot failures and days on Talos install failures. Eventually I accepted that bare metal automation on this hardware wasn’t happening, grabbed a USB stick, and loaded Ubuntu manually on each node. Gross. But it worked in 20 minutes per node, and I had a cluster the same day.

The MicroK8s Pivot#

I installed Ubuntu 24.04 on all five nodes via USB stick. Took about 20 minutes per node. From there, I iteratively developed the Ansible role that installs MicroK8s, enables the addons, generates join tokens, and joins nodes to the cluster. I never did any of the cluster setup by hand. The role went through a lot of revisions as I figured out what worked, but every iteration was Ansible, committed to git, deployed through GitHub Actions.

MicroK8s is not as clean as Talos. It’s a snap package, it bundles everything into a single binary called kubelite, and it has opinions about how things should work. You can’t independently manage kube-proxy (it’s baked into kubelite). Snap updates can be disruptive if you’re not pinning versions. I found that out when snap auto-refreshed MicroK8s on one node, caused a version skew across the cluster, and things started falling over. Had to pin the snap channel and block auto-refresh in the Ansible role after that.

The storage situation was its own journey. I started with Rook-Ceph because that’s what every “production k8s storage” guide recommends. Created a CephCluster, configured rook-ceph-block and rook-cephfs as storage classes, wrote the templates around it. Ceph did not work on these ThinkCentres. The hardware just wasn’t up to it. Ceph wants dedicated disks, real NICs, and nodes with enough headroom to run OSD daemons alongside your actual workloads. On mini PCs with a single NVMe and 32GB of RAM that also need to run 5+ application pods each, Ceph was fighting for resources constantly. I ripped it out within a couple weeks and switched to Longhorn, which is lightweight, works with a single disk per node, and handles replication without needing a dedicated storage cluster. The git history tells the story: November 2025 I created the CephCluster, December 10th I removed it and deployed Longhorn the same day.

As a newcomer to all of this, the hardest part wasn’t any single problem. It was not knowing what I didn’t know. Which storage backends are compatible with MicroK8s? Which CNI plugins work with kubelite’s bundled kube-proxy? What addons conflict with each other? None of this is documented in one place. You piece it together from GitHub issues, forum posts, and trial and error. I wrote a cleanup script early on that was supposed to help with stuck pods, and it ended up crashing the cluster so badly I had to commit a “CRITICAL FIX” to undo it. The Calico CNI works fine in its default mode, but someone (me) enabled BPF mode without understanding that it conflicts with kubelite’s kube-proxy, and that silently broke all ClusterIP routing across the cluster. That’s a whole separate post.

And here’s the irony: the whole reason I wanted Talos was to avoid having a “pet” cluster. An immutable, API-driven OS where the nodes are cattle, not snowflakes. Instead I ended up with Ubuntu boxes running a snap package, which is exactly the kind of living, breathing thing I spent my career trying to get away from. It’s the same feeling as managing a network, a VMware cluster, or a storage array. There’s state everywhere. The nodes have SSH. They have package managers. They can drift. It’s the replacement for the manually configured infrastructure I was trying to escape, and despite managing it with Ansible and GitHub Actions, it still feels like a pet sometimes. MicroK8s on Ubuntu is not an API-driven platform. It’s infrastructure that needs tending.

But it installed in 5 minutes, the addons worked, and I had a running cluster the same day. Within a week I had migrated my entire Docker Compose stack onto it. The cluster has been running for 162+ days now with 27 pods across 5 nodes.

Sometimes “it works” beats “it’s architecturally pure.” In the real world, this is the compromise you’d make at a job too. You’d build the process, document it, train the team on it, and iterate on it for years. It’s not the beautiful zero-touch immutable dream you drew on a whiteboard. But it’s codified, it’s repeatable, it’s not terrible, and people can actually use it. That’s what shipping looks like most of the time.

What the Cluster Looks Like Now#

nebula-1  10.x.x.x  HA standby   Ubuntu 24.04  MicroK8s 1.32.9
nebula-2  10.x.x.x  HA master    Ubuntu 24.04  MicroK8s 1.32.9
nebula-3  10.x.x.x  HA master    Ubuntu 24.04  MicroK8s 1.32.9
nebula-4  10.x.x.x  HA standby   Ubuntu 24.04  MicroK8s 1.32.9
nebula-5  10.x.x.x  HA master    Ubuntu 24.04  MicroK8s 1.32.9

Three HA masters, two standbys. Calico CNI with VXLAN. Longhorn for replicated storage. NFS for shared data. MetalLB for a single LoadBalancer VIP. All managed by Ansible, deployed by GitHub Actions.

The full story of what runs on it and how it’s configured is in the Kubernetes cluster post.

Would I Use Talos If I Started Over?#

Maybe. If I had hardware that was known-compatible and I wasn’t also debugging NVMe issues at the same time, Talos would probably be great. The immutable OS concept is solid. No SSH means no configuration drift. The API-driven management is how infrastructure should work.

But I’d want to test it on the exact hardware first, with known-good drives, before committing five nodes to it. Fighting the OS and the hardware simultaneously is a recipe for wasted weekends.

For now, MicroK8s on Ubuntu does the job. The Ansible role keeps the nodes consistent, kernel auto-upgrades are disabled so nothing reboots unexpectedly, and the cluster has been rock solid outside of self-inflicted incidents. Good enough is good enough.