Reddit Product Sentiment & Market Intelligence Agent

1. Executive Summary & Outcome Hypothesis

Problem Statement

Authentic user feedback lives on Reddit, but it is buried under layers of noise, memes, and off-topic discussion. For Product Managers and Solo Founders, manually monitoring 50+ subreddits is impossible, and traditional social listening tools rely on keyword matching that fails to capture context (e.g., distinguishing between a feature request and a user error).

Proposed Solution

A “Headless” Market Intelligence Agent that ingests raw community data, uses a multi-stage LLM pipeline to filter and structure the data, and stores it in a vector-ready database. This creates a high-signal dataset that can be interrogated via a RAG (Retrieval-Augmented Generation) Chatbot or summarised by an automated weekly agent.

Outcome Hypothesis

We believe that by separating the “Ingestion” layer (RSS/n8n) from the “Analysis” layer (RAG/Agents), we can create a system that “remembers” community sentiment over time. We will know we’re right when the system can answer complex natural language queries (e.g., “How has sentiment regarding [Competitor X] pricing changed in the last 30 days?”) with >90% accuracy compared to manual analysis.

2. Strategic Context

Innovation Goals

This project serves as a technical proving ground for:

Agentic Orchestration: Managing multi-step reasoning chains in n8n (Relevance → Sentiment → Summarisation).
Small Language Models (SLMs): Validating if cheaper models (GPT-4o-mini, Haiku) can perform classification tasks as well as frontier models to control costs.
RAG Pipelines: Building a “Chat with your Data” interface for unstructured social commentary.

3. Technical Architecture & Data Flow

Current Pipeline (Ingestion & Processing)

The system is orchestrated via self-hosted n8n using a linear DAG (Directed Acyclic Graph) workflow:

Trigger: RSS feeds poll specific subreddits (e.g., r/LegalTech, r/SaaS) every 15 minutes.
Normalisation: A javascript node cleanses metadata to ensure consistent schema across different RSS formats.
Gatekeeper Agent (LLM 1):
- Input: Post Title + Body.
- Task: Binary classification (True/False) on “Is this relevant to B2B Product Management?”
- Model: Low-cost, high-speed model (GPT-4o-mini via OpenRouter).
Analyst Agent (LLM 2):
- Trigger: Only runs if Gatekeeper = True.
- Task: Sentiment Analysis (-1 to +1), Entity Extraction (Competitors, Features), and User Intent Classification (Complaint, Question, Praise).
- Output: Structured JSON via Output Parser.
Storage: Data is upserted into Supabase (PostgreSQL).

Planned Architecture (Consumption Layer)

Vectorisation: Supabase pgvector extension to embed post summaries.
Retrieval: A Python/LangChain backend to handle queries.
Interface: A simple Streamlit or Vercel AI SDK chat interface.

4. Product Scope & MVP Definition

Phase 1: Ingestion & Filtering (Status: Live)

Connect n8n to Reddit RSS feeds.
Implement “Relevance Check” to filter noise.
Implement “Sentiment Analysis” to structure data.
Store raw and processed data in Supabase.

Phase 2: The “Active Analyst” (Status: Next)

Vector Embeddings: Automatically generate embeddings for successful inserts in Supabase.
Recursive Retrieval: Allow the system to fetch related historical posts when a new trend emerges.
Weekly Agent: A scheduled workflow that reads the last 7 days of entries, clusters them by topic, and sends a summarised email briefing.

Out of Scope

Real-time “alerting” (Daily/Weekly batches are sufficient).
Writing back to Reddit (Read-only mode to prevent bot bans).
Sentiment analysis of images/video.

5. Non-Functional Requirements

Cost Efficiency

Constraint: The system must run for <$10/month.
Strategy: Aggressive pre-filtering using Regex/Keywords before calling LLMs. Use tiered model selection (Cheap models for filtering, Smart models for analysis).

Data Integrity

De-duplication: RSS feeds often repeat content. The system must use unique IDs (Post GUID) to prevent duplicate DB entries.
Schema Enforcement: The LLM output parser must guarantee strictly typed JSON (Sentiment = Float, Entities = Array) to prevent SQL insertion errors.

6. Success Metrics (Technical)

Noise Reduction Rate: % of raw RSS items filtered out by the Gatekeeper Agent (Target: >60%).
Parsing Reliability: % of LLM outputs that successfully parse into JSON without retries (Target: >95%).
Query Latency: Time to generate a RAG response (Target: <5 seconds).

Version

Last Updated

Author