Technitium HA DNS Cluster Migration

1. Executive Summary & Outcome Hypothesis

The home network’s DNS ran on a single instance of Technitium DNS Server in a Docker container on the Unraid media server. That made the Unraid host a single point of failure for all LAN name resolution: a container crash, host reboot, or hardware fault took every device offline for name resolution until human intervention.

This work delivers a two-node Technitium HA cluster — Primary on Unraid, Secondary on a Raspberry Pi 4 — with native cluster replication for configuration and keepalived / VRRP managing a single virtual IP for client-facing failover. Pi-hole and Unbound, the legacy stack the original Technitium deployment had already replaced, are decommissioned.

Outcome hypothesis. If two Technitium replicas run with native cluster sync and a VRRP-managed virtual IP, then a single-node failure will disrupt LAN name resolution for under five seconds with no client-side reconfiguration. Validated post-cutover: under one second of disruption observed across four induced failure modes (Technitium stopped on Primary, Primary host reboot, keepalived stopped on Primary, keepalived stopped on Secondary).

2. Strategic Context & Alignment

This is the second project in a connected homelab arc. The first, the Autonomous Homelab Sentinel, introduced a monitoring-first agent layer with a human approval gateway for high-risk operations. This project extends the thesis from “we can detect and respond to failures” to “the most critical infrastructure layer survives failures without intervention in the first place.” Together they push the homelab toward operating as a small SRE program rather than a collection of always-on hobby boxes.

The work also closes the two open recommendations from the Network Redesign Memo: promoting the idle Pi to a resilience node, and introducing automatic DNS failover so that name resolution survives any single-host failure on the LAN. Both are marked as resolved by this delivery.

3. Problem Statement

The homelab’s network DNS started life on a Raspberry Pi running Pi-hole (ad/tracker/malware blocking) with Unbound (recursive resolution). The two tools worked but were stitched together rather than designed as one system: limited API, ageing admin UI, no mobile access, and no visibility into the recursive layer. Technitium DNS Server replaced both, consolidating recursion, blocklists, allow/deny lists, and internal zone management into a single engine running as a Docker container on the Unraid media server. That migration freed the Pi and consolidated DNS — but it also made the Unraid server the single point of failure for all LAN name resolution.

If the Technitium container on the Unraid server dies (process crash, container OOM, host reboot, hardware failure), every device on the network loses name resolution until a human notices and intervenes. The Pi sits idle and could be picking up the load instead. There is no automatic recovery path.

The affected users are every device on the LAN — including IoT devices that don’t tolerate DNS outages gracefully — and any human on the network who notices immediately when everyday services stop responding.

4. Goals & Non-Goals

Goals

Eliminate the single point of failure for network DNS. When the Unraid server dies, name resolution continues without manual intervention.
Single virtual IP for all clients. Clients see one IP and that IP keeps answering. No client-side reconfiguration; no per-device split-DNS.
One source of truth for configuration. Adding a blocked domain, an allowed domain, a new internal zone record, or an Advanced Blocking group remains a single action — not “do it on the Unraid server, then do it on the Pi.”
Decommission Pi-hole and Unbound entirely. One DNS engine, two replicas.
Recovery time under 5 seconds end-to-end. From “Unraid server’s Technitium dies” to “Pi answers DNS for the same IP that clients have always asked,” fewer than five seconds.

Non-Goals

Replacing the public DNS layer (Cloudflare). External hostnames continue to resolve via Cloudflare’s authoritative DNS; this work is purely about the LAN-side resolver.
IPv6 enablement. The LAN is IPv4-only by design.
Tailscale DNS integration.
Replacing UDM as the DHCP server, gateway, or firewall.
Geographic DNS or multi-region failover.

5. Architecture & Discovery Plan

Target architecture

Two layers, conceptually independent.

Cluster layer (config replication). Technitium-native. One Primary, one Secondary; all zones, blocklists, allow/deny, app configs, and admin auth flow Primary to Secondary. The Pi is a faithful mirror.

VRRP layer (IP failover). keepalived on each node negotiates ownership of 192.168.86.57. The Unraid server has higher priority and holds the IP normally. If the Unraid server’s Technitium fails its local health check (a DNS query against itself), keepalived demotes it; the Pi promotes itself within 1 to 3 seconds and starts answering on .57.

The two layers are decoupled deliberately. The cluster handles “what is the truth”; VRRP handles “where does the truth answer.” Either layer can fail without taking the other down.

               UDM DHCP DNS pushes ONLY 192.168.86.57
                              |
                      ┌───────▼────────┐
                      │   VIP .57      │     keepalived/VRRP managed
                      │  (on master)   │
                      └────────────────┘
                        ▲             ▲
                        │             │
        ┌───────────────┴─┐         ┌─┴───────────────┐
        │  UNRAID SRV .56 │         │    PI .192      │
        │  Primary node   │  ────►  │  Secondary node │
        │  Technitium     │ cluster │  Technitium     │
        │  keepalived M   │  sync   │  keepalived B   │
        └─────────────────┘         └─────────────────┘

Discovery & pre-flight

Before any change to the live network, the migration plan was validated by eight read-only probes against the actual infrastructure: IP layout, Docker network driver and bridge behaviour, port bindings, the L2 multicast path between the Unraid server and the Pi, IP-alias semantics, host-network sysctl support on the Unraid kernel, container restart-policy persistence across reboots, and the existing Technitium API surface for cluster operations. The phase took roughly 90 minutes and disproved several pessimistic assumptions in the original plan, collapsing the execution estimate from a half-weekend to about three hours.

The principle behind the phase: every assumption that influences a sequenced live change should be eliminated or down-weighted to a known-knowns-only state before the live work starts. Live execution then followed a “stop on first surprise” rule — any divergence from the plan paused the work, formed a hypothesis, and triggered a revise-or-abandon decision before continuing. That rule was triggered three times during execution and prevented the kind of compounding state that turns a Sunday evening project into a weekend one.

6. Assumptions & Constraints

Assumptions

Both nodes run Technitium 15.2+ (clustering became a first-class feature in November 2025).
The LAN is a single broadcast domain; VRRP multicast flows between the Unraid server and the Pi without unicast_peer mode.
The Pi has substantial idle resources to run Technitium and keepalived alongside whatever else it does.
The Unraid server has CPU/RAM headroom for a keepalived Docker container alongside Technitium and the existing 19+ containers.

Constraints

Single DHCP DNS entry. All clients point at one IP; failover is handled at the infrastructure layer, not the client layer.
No DNS outage during the migration. A short safety-net window using 1.1.1.1 as a temporary second DHCP DNS entry is acceptable purely during the change, reverted immediately afterwards.
No new credentials in chat. Any secrets are written to ~/.credentials/ via interactive read -s prompts; nothing pasted into a session log.

7. Non-Functional Requirements

Category	Requirement	Target
Privacy	LAN DNS queries never reach a third-party recursive resolver. Recursion stays local; no DoH forwarders.	0 forwarded queries
Availability	A single-node failure must not produce a measurable client-side outage.	MTTR ≤ 5s; failed queries during induced failover = 0
Consistency	Configuration changes on the Primary appear on the Secondary without manual sync.	Replication lag ≤ 30s
Operability	Every migration phase has a clean, documented rollback path.	Rollback time ≤ 10 min per phase
Manageability	Adding a zone record, blocklist entry, or Advanced Blocking rule is a single action across the cluster.	1 admin UI / 1 API call

8. Success Metrics

Metric	Target	Observed
Failover MTTR (Unraid → Pi)	≤ 5s	< 1s
Failback time (Pi → Unraid after Primary recovery)	≤ 15s	~10s
Failed DNS queries during induced failure	0	0 across 4 scenarios
Cluster replication lag (config change → secondary visible)	≤ 30s	~5s
Single source of truth for config	1 admin UI	✓
DHCP DNS entries pushed by UDM	1 (`192.168.86.57`)	✓

Failover scenarios validated: Technitium stopped on Primary, Primary host reboot, keepalived stopped on Primary, keepalived stopped on Secondary.

9. Risks & Mitigation

Risk	Mitigation
DNS outage during the migration window	Temporary `1.1.1.1` DHCP DNS fallback on UDM during execution, reverted immediately after VIP cutover
Misconfiguration leading to VRRP split-brain	Pre-flight verification of multicast path; same `virtual_router_id` and `auth_pass` on both nodes; tested under simulated failure before going live
Cluster domain accidentally locked to a placeholder value	Cluster domain decided up-front and validated before init; surgical recovery path documented
Containers fail to start on boot, leaving the cluster degraded	All daemons configured with `--restart unless-stopped` (Unraid server) or `systemctl enable` (Pi); persistence verified via deliberate reboot test
Pi SD card death wipes the Secondary	Tracked in §10 (Roadmap) — migration to USB SSD planned

10. Roadmap & Future Hardening

Item	Driver	Status
Migrate Pi root filesystem from SD card to USB SSD	SD cards have a known endurance ceiling for always-on workloads; this is the last significant single-component reliability risk on the Secondary node	Planned
Independent observability for the Secondary	The VIP can mask Secondary degradation if `keepalived` and Technitium fail in correlated ways. A health probe on the Pi’s standalone IP plus an Uptime Kuma monitor are in place; the next step is per-zone and replication-lag alerting	In progress
Pi-only failback drill	Validate that the cluster runs cleanly on the Secondary alone for an extended window, in case the Unraid server is offline for maintenance	Planned
Tertiary node	Currently rejected — a two-node cluster is sufficient given the LAN’s blast radius and the cost/complexity of a third node. Revisit if the homelab gains a second always-on Linux host for unrelated reasons	Won’t do (yet)

11. Acceptance & Sign-off

The migration is considered complete when all Success Metrics in §8 meet their targets and:

dig @192.168.86.57 google.com returns a recursive answer from any LAN client.
dig @192.168.86.57 <internal-host>.lan returns the correct internal address for an internal-zone hostname.
The UDM DHCP DNS entry is back to a single value (192.168.86.57) with the safety-net 1.1.1.1 removed.
system-monitor.sh reports both DNS nodes healthy on every 5-minute run.
An Uptime Kuma monitor watches the Pi’s standalone IP independently of the VIP.
Pi-hole and Unbound services are stopped and disabled; binaries retained for warm rollback for ~30 days, then purged.

Sign-off documentation:

The original migration plan and pre-flight verification document are archived.
Recovery documentation, the project’s CLAUDE.md, the Technitium service doc, and the Pi-hole service doc all reflect the new architecture.
The Network Redesign Memo’s recommendations for Pi promotion to resilience node and automatic DNS failover are marked as resolved by this work.

Version

Last Updated

Author