AI-Orchestrated Homelab Supervisor
Version
2.0
Last Updated
14 January 2026
Author
Steve James
Project Specification: AI-Orchestrated Homelab Supervisor
1. Vision & Core Philosophy
The Concept We are not building a monitoring dashboard; we are building a Digital Site Reliability Engineer (SRE).
The goal is to move beyond passive alerting (“Plex is down”) to active, autonomous management (“Plex was down; I analysed the logs, saw a database lock, and restarted the container. Service is restored.”).
Operating Principles
- Supervisor-Worker Architecture: A central “Brain” (AI Agent) reasons about intent and delegates tasks to deterministic “Tools” (Sub-workflows).
- Safety First: The AI is never given direct shell access. It must interact with infrastructure through strictly scoped tool definitions (e.g.,
docker restartis allowed;docker rmis physically impossible via the tool). - Human-in-the-Loop: Any action with side effects (stopping services, changing configs) requires explicit confirmation via Telegram.
2. Technical Architecture
The system runs on Unraid and is orchestrated entirely within n8n, utilising the Model Context Protocol (MCP) to standardise how the AI interacts with the environment.
A. Ingestion Layer
- Uptime Kuma Webhooks: The primary trigger for system health.
- Payload: JSON data containing Service Name, Status (UP/DOWN), and Error Message.
- Role: Immediate detection of failures.
- Telegram Bot: The primary interface for human interaction.
- Role: Accepts natural language commands (“Check on Radarr”) and delivers status reports/approvals.
- Scheduled Cron: n8n-native triggers for routine tasks (e.g., “Check disk space every morning”).
B. Supervisor Agent
- Core Node: A LangChain Orchestrator running on n8n.
- Logic: It does not execute commands directly. It interprets the input (e.g., “Alert: Sonarr is down”) and decides which tool to call.
- Memory (Planned): A Vector Store (Qdrant/Pinecone) to retain context of recent failures (“Sonarr failed 3 times this hour; stop restarting and alert the human”).
C. Tool Layer
Deterministic n8n sub-workflows that perform the actual work.
- Tool - Docker Ops:
- Connects via SSH to Unraid.
- Capabilities:
list,inspect,logs,restart. - Constraint: Strictly read-only or safe-restart. No deletion.
- Tool - Home Assistant Ops:
- Connects via API to HA.
- Capabilities: Restart Core, check entity state, toggle specific switches.
- Tool - Network Ops:
- Interacts with Pi-hole to check DNS status or temporarily disable blocking.
3. Functional Requirements
1. Autonomous Remediation (The “Self-Healing” Loop)
Scenario: Uptime Kuma detects “Plex” is DOWN.
- Step 1: Ingest webhook.
- Step 2: Supervisor Agent correlates “Plex” to the
plexcontainer. - Step 3: Supervisor calls Tool - Docker Ops to check container status.
- If “Exited”: Call
docker start. - If “Unhealthy”: Call
docker restart.
- If “Exited”: Call
- Step 4: Verify recovery (Wait 30s -> HTTP Check).
- Step 5: Notify User via Telegram: “Plex went down. I restarted it. It is back up.”
2. Intelligent Diagnostics (The “SRE” Loop)
Scenario: User asks Telegram, “Why is Radarr acting up?”
- Step 1: Supervisor interprets “acting up” as a request for health & logs.
- Step 2: Calls Tool - Docker Ops to fetch the last 50 lines of logs.
- Step 3: Calls LLM (OpenAI/Claude) to analyse the logs for keywords (e.g., “Database locked”, “API timeout”).
- Step 4: Returns a summarised, plain-English diagnosis to the user.
3. Safety & Approval Gateway
Scenario: AI recommends a destructive or risky action (e.g., “Pull new image” or “Stop container”).
- Step 1: Supervisor identifies the tool is flagged as
requires_approval. - Step 2: Workflow suspends.
- Step 3: Telegram message sent with Inline Buttons:
[✅ Approve]|[❌ Deny]. - Step 4: Workflow resumes only upon receiving the callback from the
Approvebutton.
4. Environment & Constraints
Infrastructure Map
- Host: Unraid (Docker)
- Network: 192.168.86.x Subnet
- Critical Services: Plex, *arr Stack (Sonarr/Radarr/Lidarr), Home Assistant, Pi-hole.
Security Constraints
- Credential Isolation: SSH keys and API tokens are stored in n8n Credentials, never in the agent prompt.
- Rate Limiting: The “Restart” tool must have a cooldown (e.g., max 3 restarts per hour) to prevent infinite boot loops.
5. Roadmap
Phase 1: Foundation (Current)
- Deploy Supervisor Agent.
- Implement “Docker Ops” and “HA Ops” tools.
- Setup basic Telegram Ingest.
Phase 2: Context & Memory
- Implement Vector Database (Qdrant) to allow the agent to recall previous incidents.
- Goal: Agent recognises pattern failures (“This is the 3rd crash this week”).
Phase 3: Local Intelligence
- Replace cloud-based Log Analysis with local Ollama integration.
- Goal: Keep sensitive log data entirely within the local network.