Reddit Product Sentiment & Market Intelligence Agent
Version
0.2 (POC)
Last Updated
14 January 2026
Author
Steve James
1. Executive Summary & Outcome Hypothesis
Problem Statement
Authentic user feedback lives on Reddit, but it is buried under layers of noise, memes, and off-topic discussion. For Product Managers and Solo Founders, manually monitoring 50+ subreddits is impossible, and traditional social listening tools rely on keyword matching that fails to capture context (e.g., distinguishing between a feature request and a user error).
Proposed Solution
A “Headless” Market Intelligence Agent that ingests raw community data, uses a multi-stage LLM pipeline to filter and structure the data, and stores it in a vector-ready database. This creates a high-signal dataset that can be interrogated via a RAG (Retrieval-Augmented Generation) Chatbot or summarised by an automated weekly agent.
Outcome Hypothesis
We believe that by separating the “Ingestion” layer (RSS/n8n) from the “Analysis” layer (RAG/Agents), we can create a system that “remembers” community sentiment over time. We will know we’re right when the system can answer complex natural language queries (e.g., “How has sentiment regarding [Competitor X] pricing changed in the last 30 days?”) with >90% accuracy compared to manual analysis.
2. Strategic Context
Innovation Goals
This project serves as a technical proving ground for:
- Agentic Orchestration: Managing multi-step reasoning chains in n8n (Relevance → Sentiment → Summarisation).
- Small Language Models (SLMs): Validating if cheaper models (GPT-4o-mini, Haiku) can perform classification tasks as well as frontier models to control costs.
- RAG Pipelines: Building a “Chat with your Data” interface for unstructured social commentary.
3. Technical Architecture & Data Flow
Current Pipeline (Ingestion & Processing)
The system is orchestrated via self-hosted n8n using a linear DAG (Directed Acyclic Graph) workflow:
- Trigger: RSS feeds poll specific subreddits (e.g., r/LegalTech, r/SaaS) every 15 minutes.
- Normalisation: A javascript node cleanses metadata to ensure consistent schema across different RSS formats.
- Gatekeeper Agent (LLM 1):
- Input: Post Title + Body.
- Task: Binary classification (True/False) on “Is this relevant to B2B Product Management?”
- Model: Low-cost, high-speed model (GPT-4o-mini via OpenRouter).
- Analyst Agent (LLM 2):
- Trigger: Only runs if Gatekeeper = True.
- Task: Sentiment Analysis (-1 to +1), Entity Extraction (Competitors, Features), and User Intent Classification (Complaint, Question, Praise).
- Output: Structured JSON via Output Parser.
- Storage: Data is upserted into Supabase (PostgreSQL).
Planned Architecture (Consumption Layer)
- Vectorisation: Supabase
pgvectorextension to embed post summaries. - Retrieval: A Python/LangChain backend to handle queries.
- Interface: A simple Streamlit or Vercel AI SDK chat interface.
4. Product Scope & MVP Definition
Phase 1: Ingestion & Filtering (Status: Live)
- Connect n8n to Reddit RSS feeds.
- Implement “Relevance Check” to filter noise.
- Implement “Sentiment Analysis” to structure data.
- Store raw and processed data in Supabase.
Phase 2: The “Active Analyst” (Status: Next)
- Vector Embeddings: Automatically generate embeddings for successful inserts in Supabase.
- Recursive Retrieval: Allow the system to fetch related historical posts when a new trend emerges.
- Weekly Agent: A scheduled workflow that reads the last 7 days of entries, clusters them by topic, and sends a summarised email briefing.
Out of Scope
- Real-time “alerting” (Daily/Weekly batches are sufficient).
- Writing back to Reddit (Read-only mode to prevent bot bans).
- Sentiment analysis of images/video.
5. Non-Functional Requirements
Cost Efficiency
- Constraint: The system must run for <$10/month.
- Strategy: Aggressive pre-filtering using Regex/Keywords before calling LLMs. Use tiered model selection (Cheap models for filtering, Smart models for analysis).
Data Integrity
- De-duplication: RSS feeds often repeat content. The system must use unique IDs (Post GUID) to prevent duplicate DB entries.
- Schema Enforcement: The LLM output parser must guarantee strictly typed JSON (Sentiment = Float, Entities = Array) to prevent SQL insertion errors.
6. Success Metrics (Technical)
- Noise Reduction Rate: % of raw RSS items filtered out by the Gatekeeper Agent (Target: >60%).
- Parsing Reliability: % of LLM outputs that successfully parse into JSON without retries (Target: >95%).
- Query Latency: Time to generate a RAG response (Target: <5 seconds).