← Blog
/POST · ALFIE MILLS

Building a K3s Cluster on Hetzner with Terraform and GitOps

For a while now I've been running my own Kubernetes cluster on Hetzner Cloud instead of paying for managed nodes elsewhere. The whole thing is reproducible: Terraform stands up the infrastructure, cloud-init bootstraps K3s, and ArgoCD takes over from there. This is a write-up of how it actually fits together.

Why Hetzner and K3s

Hetzner's pricing is hard to argue with. The cluster runs on cax21 ARM instances (4 vCPU, 8GB) for the master and three workers, plus a small cax11 box dedicated to the database. The whole thing lives in the nbg1 region behind a private network.

I went with K3s rather than full upstream Kubernetes because it's a single binary, it's light on resources, and it strips out the parts I don't want. The control plane is fronted by a private subnet, and the workers have no public inbound at all.

Provisioning with Terraform

Everything starts with the hcloud Terraform provider. I define a private network, a subnet, firewalls, and the servers themselves:

resource "hcloud_network" "private_network" {
  name     = "kubernetes-cluster"
  ip_range = "10.0.0.0/16"
}

resource "hcloud_server" "worker-nodes" {
  count       = 3
  name        = "worker-node-${count.index}"
  image       = "ubuntu-24.04"
  server_type = "cax21"
  location    = "nbg1"
  network {
    network_id = hcloud_network.private_network.id
  }
  firewall_ids = [hcloud_firewall.worker_firewall.id]
}

The firewall design is the important bit. The worker firewall has no inbound rules at all, private network traffic between nodes bypasses the firewall automatically, so workers are completely sealed off from the public internet. The master firewall opens only SSH (22) and the Kubernetes API (6443).

The master gets a static private IP of 10.0.1.1 so the workers always know where to find it.

Bootstrapping K3s with cloud-init

Rather than SSH in and run install scripts by hand, the nodes self-install K3s on first boot via cloud-init. The master installs the server with a deliberately trimmed config:

curl https://get.k3s.io | INSTALL_K3S_EXEC="--disable traefik \
  --disable servicelb --disable-cloud-controller \
  --kubelet-arg cloud-provider=external \
  --tls-san 10.0.1.1 --flannel-iface=enp7s0" sh -

I disable the bundled Traefik and servicelb because I bring my own ingress and load balancing through Helm. The --flannel-iface=enp7s0 flag is critical: it pins the CNI to the private network interface so pod traffic never traverses the public NICs.

The workers are templated through Terraform's templatefile(), which injects an SSH key so each worker can pull the join token straight off the master before installing:

until curl -k https://10.0.1.1:6443; do sleep 5; done
REMOTE_TOKEN=$(ssh cluster@10.0.1.1 sudo cat /var/lib/rancher/k3s/server/node-token)
curl -sfL https://get.k3s.io | K3S_URL=https://10.0.1.1:6443 \
  K3S_TOKEN=$REMOTE_TOKEN sh -

The until loop matters more than it looks, Terraform creates resources in parallel, so workers will happily try to join before the master's API is up. The dependency chain plus the retry loop keeps the bootstrap deterministic.

Persistent storage on a Storage Box

Hetzner Storage Boxes are cheap SMB/CIFS shares, so I use them for persistent volumes via the SMB CSI driver, installed through the Helm Terraform provider. A StorageClass points at the share and pulls credentials from a Kubernetes secret:

parameters = {
  source = "//${var.storagebox_host}/backup"
  "csi.storage.k8s.io/node-stage-secret-name" = "storagebox-credentials"
}
mount_options = ["dir_mode=0777", "file_mode=0777", "nobrl"]

The nobrl option (disable byte-range locks) was a hard-won addition, without it, some workloads choke on SMB locking semantics. The reclaim policy is Retain so I never lose data to an accidental kubectl delete pvc.

Everything else is GitOps

Once the cluster is up, Terraform's job is basically done. From there, ArgoCD runs the show. My Helm charts are organised by namespace under charts/:

  • cert-manager for TLS
  • infisical for secret management
  • monitoring for the Prometheus/Grafana observability stack
  • tailscale for private access to internal services
  • default for the actual application workloads

Secrets never touch Git. Public config lives in values.yaml; anything sensitive is referenced through Infisical InfisicalSecret custom resources. The workflow is just: edit values.yaml, commit, push, ArgoCD syncs the change to the cluster automatically.

The monitoring stack has grown into the most active part of the repo. I've added SNMP exporters for my router, an AdGuard Home exporter, and a handful of custom Grafana dashboards for home network health, all managed as Helm values and synced through Argo.

What I'd tell past me

The two things that bit me hardest were both networking: pinning Flannel to the private interface, and remembering that "no inbound firewall rules" is the correct, secure default for workers rather than an oversight. Get the private network right first, and the rest of the cluster is honestly the easy part.