Operation Cluster Restore: A Technical Intelligence Briefing

January 4, 2026 | By George Blackwell, Technical Operations Division

The call came in at fourteen hundred hours. The cluster was dark.

I'd seen this before, of course—in Prague, in '89, when an entire signals intelligence network went silent because someone had misconfigured a single relay. The principle was the same. One bad actor, operating with elevated privileges, and suddenly your entire infrastructure is deaf, dumb, and blind.

In this case, the hostile element was a Tailscale router deployment. Someone had given it hostNetwork: true, which sounds innocuous enough to the uninitiated. But in the shadowy world of Kubernetes networking, it's the equivalent of handing a mole the keys to the signals room.

The Initial Assessment

The API server—kubelite, in MicroK8s parlance—was in a continuous crash loop. Every attempt to reach it via the standard channels (kubectl) returned only silence:

i/o timeout
connection reset

The kind of responses that make field operatives very nervous indeed.

We had to go in through the back door. SSH access to the node itself, bypassing the compromised control plane entirely. Old tradecraft, but still effective.

The logs told a familiar story of betrayal:

Error while dialing: dial unix /var/snap/microk8s/8611/var/kubernetes/backend/kine.sock:12379:
connect: no such file or directory

The database socket was gone. Dqlite—the distributed database that holds all the cluster's secrets, its deployments, its services, its carefully constructed reality—had been severed from the rest of the system.

The Extraction

The first order of business was triage. Stop the bleeding, assess what could be salvaged.

sudo microk8s stop
sudo pkill -9 kubelite || true
sudo pkill -9 k8s-dqlite || true
sudo microk8s start

We killed everything. It's not elegant, but in crisis operations, elegance is a luxury. The restart cleared the corrupted state, and slowly—painfully slowly—the logs began to show signs of life.

The API server was processing pods again. Not all of them, mind you. Many were stuck in limbo, ghosts of deployments that no longer had a home. But the heartbeat had returned.

The next step was surgical: remove the hostile element before it could cause further damage.

sudo microk8s kubectl delete deployment -n cluster-management tailscale-router

The router was gone. Now we had to ensure it could never return in its compromised form.

Fixing the Source

Intelligence work is as much about paperwork as it is about field operations. The repository—our source of truth—still contained the dangerous configuration. If the GitLab Agent reconnected and began syncing, it would simply redeploy the same hostile router.

We had to modify the deployment manifest at the source:

# Before (Compromised)
hostNetwork: true

# After (Sanitized)
hostNetwork: false

One line. The difference between a functioning cluster and complete operational blindness. I've always found it remarkable how much damage can be contained in such small changes.

Reinforcements: Adding a Second Node

With the primary station stabilized, attention turned to expansion. A single-node cluster is a single point of failure—unacceptable for any operation of significance.

The asset: a ThinkPad E14, previously dormant. It would become our second node, a worker to share the operational load.

The first contact attempt failed immediately:

OSError: [Errno 113] No route to host

Port 25000 was blocked. The firewall on the primary node—a holdover from earlier security protocols—was refusing the connection. We adjusted:

sudo ufw allow 25000/tcp
sudo ufw allow from 192.168.8.0/24

The second attempt revealed a different problem entirely. The ThinkPad wasn't clean. It carried the digital fingerprints of a previous MicroK8s installation—Calico interfaces (vxlan.calico, cali*) still active, a ghost in the machine.

You cannot join a node that believes it's already part of another cluster. The identity had to be wiped clean:

sudo microk8s leave
sudo snap remove microk8s --purge
sudo rm -rf /var/snap/microk8s
sudo snap install microk8s --classic

A complete reset. Sometimes that's the only way to ensure loyalty.

The Join Operation

The join token was generated on the primary node—a cryptographic handshake that would bind the two stations together:

sudo microk8s add-node

On the ThinkPad, the join command was executed. And then we waited.

The WiFi connection between the nodes was... unreliable. Field conditions, as they say. Average latency of 23 milliseconds, acceptable for most purposes. But the spikes—83ms, 219ms, occasionally worse—these were concerning.

Dqlite, the distributed database at the heart of MicroK8s, is sensitive to latency. It maintains consensus through constant communication. When packets arrive late, or not at all, the nodes lose faith in each other.

rtt min/avg/max/mdev = 3.063/23.581/219.121/48.926 ms

High jitter. The kind of numbers that make database synchronization a gamble rather than a certainty.

But the join succeeded. Both nodes reported Ready:

NAME                   STATUS   ROLES           AGE   VERSION
surfstation            Ready    control-plane   45m   v1.31.0
devlin-thinkpad-e14    Ready    <none>          12m   v1.31.0

A two-node cluster. Redundancy achieved.

Re-Establishing Communications

With the physical infrastructure restored, the intelligence apparatus needed reconnection. The GitLab Agent—codename surfshack-agent—had to be reinstalled to resume automated deployments.

The agent serves as the bridge between the repository and the cluster, watching for changes and applying them in real-time. Without it, all deployments would require manual intervention—untenable for any operation at scale.

The CI/CD pipeline was reconfigured to use shared runners, removing the dependency on the compromised local infrastructure. The applications began to flow again: Homepage, Job Automation, the various microservices that constitute the operational capability.

Lessons Learned

I've compiled this briefing for the archive, and for future operatives who may face similar situations. The key insights:

hostNetwork: true is a privilege escalation. A container with host networking can interfere with the node's core services. Use it sparingly, if at all, and never for services that manage network traffic.
Single points of failure are unacceptable. The cluster operated on one node for too long. Adding the ThinkPad as a worker provides resilience against hardware failure and allows for rolling updates.
WiFi is a liability for distributed databases. The latency spikes between nodes can cause Dqlite consensus failures. Wired connections are preferable for production clusters.
Always sanitize nodes before joining. Previous installations leave artifacts that can corrupt the join process. A complete reset is the only reliable approach.
Back-channel access is essential. When the API server is unreachable, SSH access to the underlying node is the only way to diagnose and recover. Ensure this access is always available.

Current Status

The cluster is operational. Two nodes, communicating over WiFi with acceptable (if imperfect) latency. The GitLab Agent is active and synchronized. All applications have been redeployed.

The Tailscale router runs now in its own network namespace, properly contained, no longer a threat to the host's networking stack.

It was a long day. They usually are, in this line of work. But the lights are on again, the signals are flowing, and the cluster hums along quietly in its corner of the network.

Tomorrow will bring new challenges. There's always something—a certificate expiring, a pod stuck in CrashLoopBackOff, a migration that didn't quite apply correctly.

But that's tomorrow's operation.

Tonight, we rest.

Technical operations log filed by G. Blackwell. Infrastructure secured via MicroK8s, Calico CNI, Dqlite consensus, and GitLab Agent automation. The coffee was adequate.