AI-Orchestrated Homelab Supervisor

AI-Orchestrated Homelab Supervisor

A multi-agent autonomous system using n8n, LangChain, and Model Context Protocol (MCP) to manage, heal, and interact with a complex Unraid & Home Assistant environment.

Role

Automation Architect

Timeline

Active / Iterative Refinement

Outcome

Production-ready AI Supervisor capable of natural language interaction, autonomous container recovery, and human-in-the-loop approval flows.

Tech Stack

n8n LangChain OpenAI (GPT-4) Claude Code Uptime Kuma Docker (Unraid) Telegram Bot API PostgreSQL

The Concept

The “Maintenance Tax” of a homelab is real. Traditional automation scripts are brittle—they break when IP addresses change or error logs shift formats.

This project moves beyond simple scripting to build an AI-Orchestrated Supervisor. Instead of hard-coded “if/then” rules, the system uses a central AI Agent (built in n8n with LangChain) that understands intent, reasons about system state, and routes tasks to specialised “Tool” workflows. It acts not just as a watchdog, but as a Tier-1 Site Reliability Engineer (SRE) that lives inside the server.

Technical Architecture

The system follows a Supervisor-Worker Pattern designed for modularity and safety.

1. Ingest Layer

The system accepts inputs from multiple asynchronous sources:

  • User Commands: Natural language via Telegram (e.g., “Why is Plex unreachable?” or “Restart Home Assistant”).
  • System Alerts: Webhooks from Uptime Kuma when services (Sonarr, Radarr, Pi-hole) transition state.
  • Scheduled Events: Cron-based health checks triggered internally by n8n.

2. The Supervisor Agent (The Brain)

Unlike a standard automation script, the core is an AI Chain (LangChain/OpenAI node).

  • Context Awareness: It knows the “Rules of Engagement” (e.g., Never delete data, Ask for approval on destructive actions).
  • Routing: It analyses the input and decides which Tool Workflow to call. It does not execute commands directly; it delegates.

3. Tool Workflows

Complex operations are encapsulated into deterministic, reusable sub-workflows (“Tools”):

  • Tool - Docker Ops: A “safe” wrapper around the Docker CLI. It connects via SSH to the Unraid host to list, inspect, or restart containers. It strictly enforces a “No Delete” policy.
  • Tool - Home Assistant: Interactions with the HA API to check entity states or restart the Core service gracefully.
  • Tool - Log Analyst: (In Progress) Retrieves the last $N$ lines of logs for error diagnosis.

4. Human-in-the-Loop

Autonomy has limits. High-risk actions trigger the Approval Gateway:

  1. Agent proposes an action (e.g., “Stop Container X”).
  2. System sends a Telegram message with Interactive Buttons (Approve / Deny).
  3. The workflow pauses (Wait Node) until a callback is received.
  4. Execution resumes only upon explicit human authorisation.

Current Capabilities

The system is currently live on an Unraid host (managed via n8n-mcp) with the following active agents:

  • Chat Home Assistant: A unified chat interface that allows natural language control of both IoT devices (via HA) and Infrastructure (via Docker).
  • Docker Management Tool: A fully operational sub-workflow that performs safe container operations via SSH.
  • Approval Gateway: A universal logic flow for handling Telegram button callbacks for sensitive operations.

Roadmap & Evolution

Phase 2: The “Supervisor” Logic (Active)

Migrating from direct point-to-point workflows to a central Router Agent. This will allow the system to handle complex, multi-step requests (e.g., “Check if Radarr is down, and if so, check the VPN container status before restarting”).

Phase 3: Long-Term Memory (Vector Store)

Implementing a Vector Database (Qdrant or Pinecone) to allow the agents to “remember” past incidents.

  • Goal: If Plex crashes 3 times in a week with the same error, the Agent shouldn’t just restart it—it should cite the pattern and suggest a permanent fix.

Phase 4: Self-Healing with Diagnosis

Integrating local LLMs (Ollama) to parse raw logs during an incident. The system will move from “Auto-Restart” to “Auto-Diagnosis,” providing a summary of why the service crashed before attempting recovery.