AI-Orchestrated Homelab Supervisor

The Concept

The “Maintenance Tax” of a homelab is real. Traditional automation scripts are brittle—they break when IP addresses change or error logs shift formats.

This project moves beyond simple scripting to build an AI-Orchestrated Supervisor. Instead of hard-coded “if/then” rules, the system uses a central AI Agent (built in n8n with LangChain) that understands intent, reasons about system state, and routes tasks to specialised “Tool” workflows. It acts not just as a watchdog, but as a Tier-1 Site Reliability Engineer (SRE) that lives inside the server.

Technical Architecture

The system follows a Supervisor-Worker Pattern designed for modularity and safety.

1. Ingest Layer

The system accepts inputs from multiple asynchronous sources:

User Commands: Natural language via Telegram (e.g., “Why is Plex unreachable?” or “Restart Home Assistant”).
System Alerts: Webhooks from Uptime Kuma when services (Sonarr, Radarr, Pi-hole) transition state.
Scheduled Events: Cron-based health checks triggered internally by n8n.

2. The Supervisor Agent (The Brain)

Unlike a standard automation script, the core is an AI Chain (LangChain/OpenAI node).

Context Awareness: It knows the “Rules of Engagement” (e.g., Never delete data, Ask for approval on destructive actions).
Routing: It analyses the input and decides which Tool Workflow to call. It does not execute commands directly; it delegates.

3. Tool Workflows

Complex operations are encapsulated into deterministic, reusable sub-workflows (“Tools”):

Tool - Docker Ops: A “safe” wrapper around the Docker CLI. It connects via SSH to the Unraid host to list, inspect, or restart containers. It strictly enforces a “No Delete” policy.
Tool - Home Assistant: Interactions with the HA API to check entity states or restart the Core service gracefully.
Tool - Log Analyst: (In Progress) Retrieves the last $N$ lines of logs for error diagnosis.

4. Human-in-the-Loop

Autonomy has limits. High-risk actions trigger the Approval Gateway:

Agent proposes an action (e.g., “Stop Container X”).
System sends a Telegram message with Interactive Buttons (Approve / Deny).
The workflow pauses (Wait Node) until a callback is received.
Execution resumes only upon explicit human authorisation.

Current Capabilities

The system is currently live on an Unraid host (managed via n8n-mcp) with the following active agents:

Chat Home Assistant: A unified chat interface that allows natural language control of both IoT devices (via HA) and Infrastructure (via Docker).
Docker Management Tool: A fully operational sub-workflow that performs safe container operations via SSH.
Approval Gateway: A universal logic flow for handling Telegram button callbacks for sensitive operations.

Roadmap & Evolution

Phase 2: The “Supervisor” Logic (Active)

Migrating from direct point-to-point workflows to a central Router Agent. This will allow the system to handle complex, multi-step requests (e.g., “Check if Radarr is down, and if so, check the VPN container status before restarting”).

Phase 3: Long-Term Memory (Vector Store)

Implementing a Vector Database (Qdrant or Pinecone) to allow the agents to “remember” past incidents.

Goal: If Plex crashes 3 times in a week with the same error, the Agent shouldn’t just restart it—it should cite the pattern and suggest a permanent fix.

Phase 4: Self-Healing with Diagnosis

Integrating local LLMs (Ollama) to parse raw logs during an incident. The system will move from “Auto-Restart” to “Auto-Diagnosis,” providing a summary of why the service crashed before attempting recovery.