Technitium HA DNS Cluster Migration
Version
1.0
Last Updated
10 May 2026
Author
Steve James
1. Executive Summary & Outcome Hypothesis
The home network’s DNS ran on a single instance of Technitium DNS Server in a Docker container on the Unraid media server. That made the Unraid host a single point of failure for all LAN name resolution: a container crash, host reboot, or hardware fault took every device offline for name resolution until human intervention.
This work delivers a two-node Technitium HA cluster — Primary on Unraid, Secondary on a Raspberry Pi 4 — with native cluster replication for configuration and keepalived / VRRP managing a single virtual IP for client-facing failover. Pi-hole and Unbound, the legacy stack the original Technitium deployment had already replaced, are decommissioned.
Outcome hypothesis. If two Technitium replicas run with native cluster sync and a VRRP-managed virtual IP, then a single-node failure will disrupt LAN name resolution for under five seconds with no client-side reconfiguration. Validated post-cutover: under one second of disruption observed across four induced failure modes (Technitium stopped on Primary, Primary host reboot, keepalived stopped on Primary, keepalived stopped on Secondary).
2. Strategic Context & Alignment
This is the second project in a connected homelab arc. The first, the Autonomous Homelab Sentinel, introduced a monitoring-first agent layer with a human approval gateway for high-risk operations. This project extends the thesis from “we can detect and respond to failures” to “the most critical infrastructure layer survives failures without intervention in the first place.” Together they push the homelab toward operating as a small SRE program rather than a collection of always-on hobby boxes.
The work also closes the two open recommendations from the Network Redesign Memo: promoting the idle Pi to a resilience node, and introducing automatic DNS failover so that name resolution survives any single-host failure on the LAN. Both are marked as resolved by this delivery.
3. Problem Statement
The homelab’s network DNS started life on a Raspberry Pi running Pi-hole (ad/tracker/malware blocking) with Unbound (recursive resolution). The two tools worked but were stitched together rather than designed as one system: limited API, ageing admin UI, no mobile access, and no visibility into the recursive layer. Technitium DNS Server replaced both, consolidating recursion, blocklists, allow/deny lists, and internal zone management into a single engine running as a Docker container on the Unraid media server. That migration freed the Pi and consolidated DNS — but it also made the Unraid server the single point of failure for all LAN name resolution.
If the Technitium container on the Unraid server dies (process crash, container OOM, host reboot, hardware failure), every device on the network loses name resolution until a human notices and intervenes. The Pi sits idle and could be picking up the load instead. There is no automatic recovery path.
The affected users are every device on the LAN — including IoT devices that don’t tolerate DNS outages gracefully — and any human on the network who notices immediately when everyday services stop responding.
4. Goals & Non-Goals
Goals
- Eliminate the single point of failure for network DNS. When the Unraid server dies, name resolution continues without manual intervention.
- Single virtual IP for all clients. Clients see one IP and that IP keeps answering. No client-side reconfiguration; no per-device split-DNS.
- One source of truth for configuration. Adding a blocked domain, an allowed domain, a new internal zone record, or an Advanced Blocking group remains a single action — not “do it on the Unraid server, then do it on the Pi.”
- Decommission Pi-hole and Unbound entirely. One DNS engine, two replicas.
- Recovery time under 5 seconds end-to-end. From “Unraid server’s Technitium dies” to “Pi answers DNS for the same IP that clients have always asked,” fewer than five seconds.
Non-Goals
- Replacing the public DNS layer (Cloudflare). External hostnames continue to resolve via Cloudflare’s authoritative DNS; this work is purely about the LAN-side resolver.
- IPv6 enablement. The LAN is IPv4-only by design.
- Tailscale DNS integration.
- Replacing UDM as the DHCP server, gateway, or firewall.
- Geographic DNS or multi-region failover.
5. Architecture & Discovery Plan
Target architecture
Two layers, conceptually independent.
Cluster layer (config replication). Technitium-native. One Primary, one Secondary; all zones, blocklists, allow/deny, app configs, and admin auth flow Primary to Secondary. The Pi is a faithful mirror.
VRRP layer (IP failover). keepalived on each node negotiates ownership of 192.168.86.57. The Unraid server has higher priority and holds the IP normally. If the Unraid server’s Technitium fails its local health check (a DNS query against itself), keepalived demotes it; the Pi promotes itself within 1 to 3 seconds and starts answering on .57.
The two layers are decoupled deliberately. The cluster handles “what is the truth”; VRRP handles “where does the truth answer.” Either layer can fail without taking the other down.
UDM DHCP DNS pushes ONLY 192.168.86.57
|
┌───────▼────────┐
│ VIP .57 │ keepalived/VRRP managed
│ (on master) │
└────────────────┘
▲ ▲
│ │
┌───────────────┴─┐ ┌─┴───────────────┐
│ UNRAID SRV .56 │ │ PI .192 │
│ Primary node │ ────► │ Secondary node │
│ Technitium │ cluster │ Technitium │
│ keepalived M │ sync │ keepalived B │
└─────────────────┘ └─────────────────┘
Discovery & pre-flight
Before any change to the live network, the migration plan was validated by eight read-only probes against the actual infrastructure: IP layout, Docker network driver and bridge behaviour, port bindings, the L2 multicast path between the Unraid server and the Pi, IP-alias semantics, host-network sysctl support on the Unraid kernel, container restart-policy persistence across reboots, and the existing Technitium API surface for cluster operations. The phase took roughly 90 minutes and disproved several pessimistic assumptions in the original plan, collapsing the execution estimate from a half-weekend to about three hours.
The principle behind the phase: every assumption that influences a sequenced live change should be eliminated or down-weighted to a known-knowns-only state before the live work starts. Live execution then followed a “stop on first surprise” rule — any divergence from the plan paused the work, formed a hypothesis, and triggered a revise-or-abandon decision before continuing. That rule was triggered three times during execution and prevented the kind of compounding state that turns a Sunday evening project into a weekend one.
6. Assumptions & Constraints
Assumptions
- Both nodes run Technitium 15.2+ (clustering became a first-class feature in November 2025).
- The LAN is a single broadcast domain; VRRP multicast flows between the Unraid server and the Pi without
unicast_peermode. - The Pi has substantial idle resources to run Technitium and
keepalivedalongside whatever else it does. - The Unraid server has CPU/RAM headroom for a
keepalivedDocker container alongside Technitium and the existing 19+ containers.
Constraints
- Single DHCP DNS entry. All clients point at one IP; failover is handled at the infrastructure layer, not the client layer.
- No DNS outage during the migration. A short safety-net window using
1.1.1.1as a temporary second DHCP DNS entry is acceptable purely during the change, reverted immediately afterwards. - No new credentials in chat. Any secrets are written to
~/.credentials/via interactiveread -sprompts; nothing pasted into a session log.
7. Non-Functional Requirements
| Category | Requirement | Target |
|---|---|---|
| Privacy | LAN DNS queries never reach a third-party recursive resolver. Recursion stays local; no DoH forwarders. | 0 forwarded queries |
| Availability | A single-node failure must not produce a measurable client-side outage. | MTTR ≤ 5s; failed queries during induced failover = 0 |
| Consistency | Configuration changes on the Primary appear on the Secondary without manual sync. | Replication lag ≤ 30s |
| Operability | Every migration phase has a clean, documented rollback path. | Rollback time ≤ 10 min per phase |
| Manageability | Adding a zone record, blocklist entry, or Advanced Blocking rule is a single action across the cluster. | 1 admin UI / 1 API call |
8. Success Metrics
| Metric | Target | Observed |
|---|---|---|
| Failover MTTR (Unraid → Pi) | ≤ 5s | < 1s |
| Failback time (Pi → Unraid after Primary recovery) | ≤ 15s | ~10s |
| Failed DNS queries during induced failure | 0 | 0 across 4 scenarios |
| Cluster replication lag (config change → secondary visible) | ≤ 30s | ~5s |
| Single source of truth for config | 1 admin UI | ✓ |
| DHCP DNS entries pushed by UDM | 1 (192.168.86.57) | ✓ |
Failover scenarios validated: Technitium stopped on Primary, Primary host reboot, keepalived stopped on Primary, keepalived stopped on Secondary.
9. Risks & Mitigation
| Risk | Mitigation |
|---|---|
| DNS outage during the migration window | Temporary 1.1.1.1 DHCP DNS fallback on UDM during execution, reverted immediately after VIP cutover |
| Misconfiguration leading to VRRP split-brain | Pre-flight verification of multicast path; same virtual_router_id and auth_pass on both nodes; tested under simulated failure before going live |
| Cluster domain accidentally locked to a placeholder value | Cluster domain decided up-front and validated before init; surgical recovery path documented |
| Containers fail to start on boot, leaving the cluster degraded | All daemons configured with --restart unless-stopped (Unraid server) or systemctl enable (Pi); persistence verified via deliberate reboot test |
| Pi SD card death wipes the Secondary | Tracked in §10 (Roadmap) — migration to USB SSD planned |
10. Roadmap & Future Hardening
| Item | Driver | Status |
|---|---|---|
| Migrate Pi root filesystem from SD card to USB SSD | SD cards have a known endurance ceiling for always-on workloads; this is the last significant single-component reliability risk on the Secondary node | Planned |
| Independent observability for the Secondary | The VIP can mask Secondary degradation if keepalived and Technitium fail in correlated ways. A health probe on the Pi’s standalone IP plus an Uptime Kuma monitor are in place; the next step is per-zone and replication-lag alerting | In progress |
| Pi-only failback drill | Validate that the cluster runs cleanly on the Secondary alone for an extended window, in case the Unraid server is offline for maintenance | Planned |
| Tertiary node | Currently rejected — a two-node cluster is sufficient given the LAN’s blast radius and the cost/complexity of a third node. Revisit if the homelab gains a second always-on Linux host for unrelated reasons | Won’t do (yet) |
11. Acceptance & Sign-off
The migration is considered complete when all Success Metrics in §8 meet their targets and:
dig @192.168.86.57 google.comreturns a recursive answer from any LAN client.dig @192.168.86.57 <internal-host>.lanreturns the correct internal address for an internal-zone hostname.- The UDM DHCP DNS entry is back to a single value (
192.168.86.57) with the safety-net1.1.1.1removed. system-monitor.shreports both DNS nodes healthy on every 5-minute run.- An Uptime Kuma monitor watches the Pi’s standalone IP independently of the VIP.
- Pi-hole and Unbound services are stopped and disabled; binaries retained for warm rollback for ~30 days, then purged.
Sign-off documentation:
- The original migration plan and pre-flight verification document are archived.
- Recovery documentation, the project’s CLAUDE.md, the Technitium service doc, and the Pi-hole service doc all reflect the new architecture.
- The Network Redesign Memo’s recommendations for Pi promotion to resilience node and automatic DNS failover are marked as resolved by this work.