Project Specification: AI-Orchestrated Homelab Supervisor

1. Vision & Core Philosophy

The Concept We are not building a monitoring dashboard; we are building a Digital Site Reliability Engineer (SRE).

The goal is to move beyond passive alerting (“Plex is down”) to active, autonomous management (“Plex was down; I analysed the logs, saw a database lock, and restarted the container. Service is restored.”).

Operating Principles

Supervisor-Worker Architecture: A central “Brain” (AI Agent) reasons about intent and delegates tasks to deterministic “Tools” (Sub-workflows).
Safety First: The AI is never given direct shell access. It must interact with infrastructure through strictly scoped tool definitions (e.g., docker restart is allowed; docker rm is physically impossible via the tool).
Human-in-the-Loop: Any action with side effects (stopping services, changing configs) requires explicit confirmation via Telegram.

2. Technical Architecture

The system runs on Unraid and is orchestrated entirely within n8n, utilising the Model Context Protocol (MCP) to standardise how the AI interacts with the environment.

A. Ingestion Layer

Uptime Kuma Webhooks: The primary trigger for system health.
- Payload: JSON data containing Service Name, Status (UP/DOWN), and Error Message.
- Role: Immediate detection of failures.
Telegram Bot: The primary interface for human interaction.
- Role: Accepts natural language commands (“Check on Radarr”) and delivers status reports/approvals.
Scheduled Cron: n8n-native triggers for routine tasks (e.g., “Check disk space every morning”).

B. Supervisor Agent

Core Node: A LangChain Orchestrator running on n8n.
Logic: It does not execute commands directly. It interprets the input (e.g., “Alert: Sonarr is down”) and decides which tool to call.
Memory (Planned): A Vector Store (Qdrant/Pinecone) to retain context of recent failures (“Sonarr failed 3 times this hour; stop restarting and alert the human”).

C. Tool Layer

Deterministic n8n sub-workflows that perform the actual work.

Tool - Docker Ops:
- Connects via SSH to Unraid.
- Capabilities: list, inspect, logs, restart.
- Constraint: Strictly read-only or safe-restart. No deletion.
Tool - Home Assistant Ops:
- Connects via API to HA.
- Capabilities: Restart Core, check entity state, toggle specific switches.
Tool - Network Ops:
- Interacts with Pi-hole to check DNS status or temporarily disable blocking.

3. Functional Requirements

1. Autonomous Remediation (The “Self-Healing” Loop)

Scenario: Uptime Kuma detects “Plex” is DOWN.

Step 1: Ingest webhook.
Step 2: Supervisor Agent correlates “Plex” to the plex container.
Step 3: Supervisor calls Tool - Docker Ops to check container status.
- If “Exited”: Call docker start.
- If “Unhealthy”: Call docker restart.
Step 4: Verify recovery (Wait 30s -> HTTP Check).
Step 5: Notify User via Telegram: “Plex went down. I restarted it. It is back up.”

2. Intelligent Diagnostics (The “SRE” Loop)

Scenario: User asks Telegram, “Why is Radarr acting up?”

Step 1: Supervisor interprets “acting up” as a request for health & logs.
Step 2: Calls Tool - Docker Ops to fetch the last 50 lines of logs.
Step 3: Calls LLM (OpenAI/Claude) to analyse the logs for keywords (e.g., “Database locked”, “API timeout”).
Step 4: Returns a summarised, plain-English diagnosis to the user.

3. Safety & Approval Gateway

Scenario: AI recommends a destructive or risky action (e.g., “Pull new image” or “Stop container”).

Step 1: Supervisor identifies the tool is flagged as requires_approval.
Step 2: Workflow suspends.
Step 3: Telegram message sent with Inline Buttons: [✅ Approve] | [❌ Deny].
Step 4: Workflow resumes only upon receiving the callback from the Approve button.

4. Environment & Constraints

Infrastructure Map

Host: Unraid (Docker)
Network: 192.168.86.x Subnet
Critical Services: Plex, *arr Stack (Sonarr/Radarr/Lidarr), Home Assistant, Pi-hole.

Security Constraints

Credential Isolation: SSH keys and API tokens are stored in n8n Credentials, never in the agent prompt.
Rate Limiting: The “Restart” tool must have a cooldown (e.g., max 3 restarts per hour) to prevent infinite boot loops.

5. Roadmap

Phase 1: Foundation (Current)

Deploy Supervisor Agent.
Implement “Docker Ops” and “HA Ops” tools.
Setup basic Telegram Ingest.

Phase 2: Context & Memory

Implement Vector Database (Qdrant) to allow the agent to recall previous incidents.
Goal: Agent recognises pattern failures (“This is the 3rd crash this week”).

Phase 3: Local Intelligence

Replace cloud-based Log Analysis with local Ollama integration.
Goal: Keep sensitive log data entirely within the local network.

AI-Orchestrated Homelab Supervisor

Version

Last Updated

Author