The Shared Radio Problem: Field Notes on a Cascade Failure

March 10, 2026 | By Marlowe Sandoval, Desert Sentinel

The Rimrock

I have been sitting at this desk for six hours. It is not a good desk — particleboard, IKEA origin, approximately four years old, which means it has already outlived its designed obsolescence by about three years and six months. It will not outlast the Navajo sandstone I keep on the shelf above the monitor. Nothing in this room will. But the desk holds a keyboard, and the keyboard is connected to a machine called surfstation, and surfstation is connected — or is supposed to be connected — to another machine called surfpad, and between them they run a small Kubernetes cluster, and today I discovered that the connection between them has been routing through a data center in San Francisco for reasons that are, I promise you, as geologically fascinating as anything I have ever observed in the field.

Let me explain. But first, let me be precise, because precision is reverence, and what happened here deserves reverence.

The Dry Wash

The original task was modest. We had installed a Zot registry cache — a local mirror that catches container images on the way through, stores them, and serves them back next time somebody asks. A small act of conservation. Like a rain cistern in the desert: you do not control when it rains, but you can choose not to let the water run off into nothing.

The cache was installed. Pulls succeeded. But Zot's traffic logs showed zero. Not low traffic. Zero. Every image was bypassing the cache entirely and pulling fresh from the upstream registries — Docker Hub, GitHub Container Registry, Microsoft — as if the cistern did not exist. The water was running straight through the wash.

The cause was mundane: the nodes had stale configuration. Their registries.yaml files still pointed at DNS names from an older mirror setup instead of the ClusterIP addresses Zot actually lived at. The fix existed. It lived in an Ansible playbook. All we had to do was run it.

This is always how it starts. You walk into the desert to check one thing and you find something else entirely.

The Hanging Valley

The Ansible playbook hung.

Not crashed. Not errored. Hung. It gathered facts on surfstation — the local node, the one under the desk, the one I could reach out and touch if I were the sort of person who touches running servers, which I am not — and then it reached for surfpad, the second node, a laptop on the other side of the room, and it simply... stopped. Like a hawk circling a thermal that has gone still.

Tailscale SSH was the culprit. The connection landed on surfpad and Tailscale's SSH intercept caught it and demanded interactive browser authentication — a thing that Ansible, being a headless automation tool with the social skills of a fence post, cannot provide. Fine. A configuration issue. We fixed the SSH routing, pointed Ansible at the standard OpenSSH daemon, moved on.

But while diagnosing the SSH problem, I ran tailscale status and saw something that made me sit very still:

surfpad    100.85.124.56   linux   active; relay "sfo", tx 1100093092 rx 1713840196

Relay. sfo. San Francisco.

These two machines are in the same room. They are connected to the same WiFi access point. They are separated by approximately eleven feet of air and a coffee table. And every packet between them — every flannel VXLAN frame carrying pod-to-pod traffic, every kubelet heartbeat telling the control plane that surfpad is alive, every etcd synchronization, every DNS query, every container image pull — all of it was leaving this room, traversing the open internet to a DERP relay server thirty miles away in San Francisco, and coming back.

One point seven gigabytes of traffic. Relayed.

I want you to understand what this means. Imagine you are sitting in a canyon — a narrow slot canyon, sandstone walls close enough to touch on both sides — and you want to talk to your companion ten feet ahead of you. But instead of speaking, your voice leaves your mouth, rises straight up out of the canyon, flies to San Francisco, bounces off a server in a data center, flies back, and descends into the canyon to reach your companion's ears. Every word. Every syllable. Sixty miles round trip to cross eleven feet.

This is what was happening. And it had been happening, silently, for who knows how long. The cluster worked. Pods scheduled. Services resolved. Everything was fine, in the way that everything is fine when you do not look closely, when you do not sit still long enough to notice that the river is flowing uphill.

The Fault Line

I began the investigation the way you begin any field observation: by sitting still and looking.

tailscale netcheck on both machines. Identical results. UDP works. Both behind the same NAT. Endpoint Independent Mapping — the best kind of NAT for peer-to-peer connectivity. On paper, these machines should have no trouble finding each other directly.

tailscale ping --verbose surfpad. DERP(sfo). DERP(sfo). DERP(sfo). Never upgrades to direct. The WireGuard handshake completes over the relay and stays there, like a conversation that starts through an interpreter and never switches to the common language both parties actually speak.

Then the test that changed everything:

ping 192.168.8.147

Raw LAN ping. No Tailscale, no WireGuard, no overlay. Just: can this machine, on this WiFi network, send a packet to that machine, on the same WiFi network?

From 192.168.8.224 icmp_seq=1 Destination Host Unreachable
From 192.168.8.224 icmp_seq=2 Destination Host Unreachable

The LAN was dead. Not slow. Not lossy. Dead. Two machines on the same WiFi network, associated to the same access point, and they could not exchange a single packet at Layer 2. ARP was failing. The most basic operation in local networking — "who has this IP address? tell me your MAC address so I can send you a frame" — was getting no response.

Now I was interested. Now I was paying attention in the way that matters — the way you pay attention when the raven does something you have never seen a raven do, when the juniper is growing in a direction that makes no sense until you understand the wind.

The Strata

I checked everything. I checked it the way you check strata in a canyon wall — layer by layer, oldest to youngest, looking for the unconformity that explains the gap.

The router's ARP table: both MACs present. Bridge forwarding database: both MACs learned on the correct ports. DHCP leases: active, valid, correct IPs. Both machines associated to the same 5GHz radio — radio0, channel 149, 80MHz width. Signal strength: -63 dBm on surfstation, -69 dBm on surfpad. Good signal. Not great, not terrible. The signal of two machines that can see the access point clearly and should be able to talk to each other without difficulty.

WiFi AP isolation — the feature that deliberately prevents wireless clients from communicating with each other, typically used in coffee shops and hotels to stop strangers from poking at each other's laptops — was explicitly disabled. isolate='0'. Both machines were using their real hardware MAC addresses. No randomization.

Every layer looked correct. Every stratum was where it should be. And yet the packets could not cross.

This is the moment in field work when you stop looking at the obvious and start looking at the substrate. The answer is never in the layer you are examining. It is in the layer below.

The Unconformity

I went to the kernel logs.

The router is a GL.iNet AXT1800, which is a polite way of saying it is a small plastic box running OpenWrt on a Qualcomm IPQ6018 system-on-chip with an ath11k wireless driver. The ath11k is a fine driver in the way that a coyote is a fine animal — it does what it does, it survives in hostile conditions, and every now and then it does something that makes you reconsider your assumptions about what is happening in its environment.

Here is what the kernel had to say:

ath11k c000000.wifi: recv beacon last time 3070, connection is missing and disconnect it
sta0: deauthenticating from 64:97:14:a8:04:25 by local choice
br-lan: port 4(wlan1) entered disabled state
br-lan: port 4(wlan1) entered blocking state
br-lan: port 4(wlan1) entered forwarding state

Read that again. Read it the way you read a fossil — slowly, with attention to what each layer implies about the world that deposited it.

sta0 is the station interface. It is the router's own WiFi client — the interface it uses to connect upstream to another router for its internet backhaul. This GL.iNet box was operating in repeater mode. It received its internet connection over WiFi from an upstream access point (a router charmingly named "stay wired"), and it served that connection to local clients over its own AP interface, wlan1.

Here is the critical fact, the fact that explains everything, the angular unconformity in the geological record:

sta0 and wlan1 were sharing the same physical radio. radio0. phy#0. The 5GHz radio. The single, solitary, indivisible ath11k Qualcomm IPQ6018 5GHz radio.

One radio. Two jobs. Uplink client and local access point, time-sliced on the same hardware, like a single-lane bridge trying to carry traffic in both directions.

The Erosion Cycle

Here is what was happening, and it was happening on a cycle as regular and destructive as freeze-thaw erosion in sandstone:

The upstream router's beacon transmission is missed. This happens. Beacons are broadcast frames, best-effort, no acknowledgment. In a busy RF environment — and every home WiFi environment is busy, stuffed with neighbor APs and Bluetooth devices and microwave ovens and the ceaseless electromagnetic chatter of civilization — beacons get lost.
After approximately three seconds without a beacon (recv beacon last time 3070 — that number is in milliseconds), the ath11k driver decides the upstream connection is dead. Not degraded. Not questionable. Dead. It deauthenticates. deauthenticating from 64:97:14:a8:04:25 by local choice. By local choice. The driver chose this. It chose violence.
The driver immediately reconnects. The sta0 interface goes down and comes back up. But sta0 and wlan1 share a bridge — br-lan — and when sta0 cycles, the bridge port for wlan1 cycles with it. entered disabled state. entered blocking state. entered forwarding state. Three state transitions. During disabled and blocking, no frames are forwarded. All client-to-client Layer 2 traffic is dropped.
ARP entries go stale. The machines on either side of the bridge forget each other's MAC addresses. Pings fail. TCP connections time out. And Tailscale, unable to establish a direct WireGuard tunnel over a LAN that does not exist, falls back to the only path that works: the DERP relay in San Francisco.
The bridge comes back. Traffic resumes. But three seconds later — or thirty seconds later, or ninety seconds later — another beacon is missed, and the cycle begins again.

This is not a bug in the way that a misplaced semicolon is a bug. This is an emergent behavior — a consequence of asking one piece of hardware to do two incompatible things simultaneously. It is the wireless equivalent of asking a single river channel to flow in two directions. The physics do not care about your configuration. The physics are going to do what physics does.

I sat with this for a while. I felt the particular satisfaction of understanding a system that had been opaque — the satisfaction that is, I believe, the closest secular equivalent to prayer. Not the satisfaction of fixing it. That comes later and is cheaper. The satisfaction of seeing it. Of watching the invisible mechanism that had been silently degrading my cluster for weeks or months, cycling away in the kernel logs like a heartbeat nobody was listening to.

Three seconds of silence. Deauth. Bridge cycle. ARP death. DERP relay. Sixty miles to cross eleven feet. Repeat.

The New Channel

The fix was to stop asking one radio to do two jobs.

The GL.iNet AXT1800 has two radios. radio0 is the 5GHz ath11k. radio1 is the 2.4GHz radio on a separate physical chip, phy#1. They are independent. They do not share time, do not share state, do not interfere with each other. They are, in geological terms, separate formations — deposited at different times, composed of different material, subject to different forces.

Move the upstream link to radio1. Let the 5GHz radio do one thing and do it well: serve local clients.

uci set wireless.sta.device=radio1
uci delete wireless.sta.bssid
uci commit wireless
wifi down && wifi up

Four commands. The first reassigns the station interface to the 2.4GHz radio. The second deletes the stale 5GHz BSSID so the driver does not try to associate with a 5GHz access point on a 2.4GHz radio. The third commits the configuration. The fourth restarts the wireless subsystem.

I want to note that wifi reload — the gentler, more polite version of this operation — did not work. It updated the configuration but did not actually rebind the interface to the new physical radio. The sta0 interface sat on radio1 in the config file and continued operating on radio0 in reality, like a map that has been updated but the territory has not. You need the full wifi down && wifi up. You need to turn it off and turn it on again. Some problems are only solved by violence.

The Alluvial Fan

After the fix:

PING 192.168.8.147: 64 bytes, icmp_seq=1 ttl=64 time=2.73ms
PING 192.168.8.147: 64 bytes, icmp_seq=2 ttl=64 time=2.81ms
PING 192.168.8.147: 64 bytes, icmp_seq=3 ttl=64 time=2.69ms

Solid. Zero packet loss. The LAN existed again. Two machines, eleven feet apart, talking to each other at the speed of local radio instead of the speed of a round trip to San Francisco.

Tailscale noticed immediately. Within seconds, the DERP relay dropped and the direct connection came up — 4 milliseconds, LAN-direct, WireGuard peer-to-peer over the local network. The way it was supposed to work. The way it had been configured to work. The way it would have worked all along if a single radio had not been quietly tearing itself apart every few seconds.

The Ansible playbook ran. The registries.yaml files deployed. The Zot cache began receiving traffic.

And then, the thing we had originally come to verify:

First pull of a container image through Zot, uncached: three minutes and eighteen seconds. Zot fetching from Docker Hub, storing layers, building the local cache. The cistern filling for the first time.

Second pull of the same image, cached: 7.8 seconds.

Seven point eight seconds. Down from three minutes. The water was in the cistern. The cistern was working. The original task — the modest task that had led us into the canyon — was complete.

The Overlook

There is a tradeoff. The internet uplink now runs on 2.4GHz instead of 5GHz. It is slower. 2.4GHz has less bandwidth, more congestion, longer wavelengths that carry better through walls but carry less data per cycle. The cluster's connection to the outside world is measurably degraded.

But the cluster's connection to itself — the LAN, the substrate on which everything runs, the bedrock — is solid for the first time. Flannel VXLAN frames cross the room in milliseconds instead of crossing the continent. Kubelet heartbeats arrive on time. Pod scheduling happens at the speed of local computation, not the speed of light through fiber to San Francisco and back.

This is always the choice in infrastructure, and it is always the same choice you face in the desert: do you optimize for the connection to somewhere else, or do you optimize for the place where you actually are? The answer, if you have been paying attention — and I have, for six hours now, at this particleboard desk — is that you optimize for the local. You make the foundation solid. You let the long-distance connection be a little slower, a little narrower, a little more constrained. Because a fast connection to everywhere else means nothing if the ground under your feet is cycling through disabled, blocking, and forwarding every three seconds.

The ath11k driver will continue to do what it does. It will manage its radio with the same brutal pragmatism that a coyote manages its territory — deauthenticating when it loses signal, reconnecting when it finds it, cycling through states with no regard for what depends on its stability. That is its nature. I do not blame the driver any more than I blame the freeze-thaw cycle for cracking sandstone. I simply do not build my house on the crack.

I have given the 5GHz radio one job. It is good at one job. The beacon-loss cycling continues on 2.4GHz, where sta0 lives alone, disconnecting and reconnecting in its own private drama, bothering no one. The bridge port for wlan1 stays in forwarding state. ARP stays fresh. The machines see each other.

Eleven feet. Two point seven milliseconds. The distance and the time it takes for two computers to speak across a room, when nothing is in the way.

This post chronicles debugging a cascade failure in a home K3s cluster where a GL.iNet AXT1800 router's shared-radio WiFi repeater mode caused periodic bridge cycling, breaking LAN connectivity and forcing Tailscale to relay all cluster traffic through San Francisco. Technologies involved: K3s, Tailscale, Ansible, ath11k (Qualcomm IPQ6018), OpenWrt, Zot registry cache, flannel VXLAN.