Product Requirements Document Draft

Reddit Product Sentiment & Market Intelligence Agent

Version

0.2 (POC)

Last Updated

14 January 2026

Author

Steve James

1. Executive Summary & Outcome Hypothesis

Problem Statement

Authentic user feedback lives on Reddit, but it is buried under layers of noise, memes, and off-topic discussion. For Product Managers and Solo Founders, manually monitoring 50+ subreddits is impossible, and traditional social listening tools rely on keyword matching that fails to capture context (e.g., distinguishing between a feature request and a user error).

Proposed Solution

A “Headless” Market Intelligence Agent that ingests raw community data, uses a multi-stage LLM pipeline to filter and structure the data, and stores it in a vector-ready database. This creates a high-signal dataset that can be interrogated via a RAG (Retrieval-Augmented Generation) Chatbot or summarised by an automated weekly agent.

Outcome Hypothesis

We believe that by separating the “Ingestion” layer (RSS/n8n) from the “Analysis” layer (RAG/Agents), we can create a system that “remembers” community sentiment over time. We will know we’re right when the system can answer complex natural language queries (e.g., “How has sentiment regarding [Competitor X] pricing changed in the last 30 days?”) with >90% accuracy compared to manual analysis.

2. Strategic Context

Innovation Goals

This project serves as a technical proving ground for:

  1. Agentic Orchestration: Managing multi-step reasoning chains in n8n (Relevance → Sentiment → Summarisation).
  2. Small Language Models (SLMs): Validating if cheaper models (GPT-4o-mini, Haiku) can perform classification tasks as well as frontier models to control costs.
  3. RAG Pipelines: Building a “Chat with your Data” interface for unstructured social commentary.

3. Technical Architecture & Data Flow

Current Pipeline (Ingestion & Processing)

The system is orchestrated via self-hosted n8n using a linear DAG (Directed Acyclic Graph) workflow:

  1. Trigger: RSS feeds poll specific subreddits (e.g., r/LegalTech, r/SaaS) every 15 minutes.
  2. Normalisation: A javascript node cleanses metadata to ensure consistent schema across different RSS formats.
  3. Gatekeeper Agent (LLM 1):
    • Input: Post Title + Body.
    • Task: Binary classification (True/False) on “Is this relevant to B2B Product Management?”
    • Model: Low-cost, high-speed model (GPT-4o-mini via OpenRouter).
  4. Analyst Agent (LLM 2):
    • Trigger: Only runs if Gatekeeper = True.
    • Task: Sentiment Analysis (-1 to +1), Entity Extraction (Competitors, Features), and User Intent Classification (Complaint, Question, Praise).
    • Output: Structured JSON via Output Parser.
  5. Storage: Data is upserted into Supabase (PostgreSQL).

Planned Architecture (Consumption Layer)

  • Vectorisation: Supabase pgvector extension to embed post summaries.
  • Retrieval: A Python/LangChain backend to handle queries.
  • Interface: A simple Streamlit or Vercel AI SDK chat interface.

4. Product Scope & MVP Definition

Phase 1: Ingestion & Filtering (Status: Live)

  • Connect n8n to Reddit RSS feeds.
  • Implement “Relevance Check” to filter noise.
  • Implement “Sentiment Analysis” to structure data.
  • Store raw and processed data in Supabase.

Phase 2: The “Active Analyst” (Status: Next)

  • Vector Embeddings: Automatically generate embeddings for successful inserts in Supabase.
  • Recursive Retrieval: Allow the system to fetch related historical posts when a new trend emerges.
  • Weekly Agent: A scheduled workflow that reads the last 7 days of entries, clusters them by topic, and sends a summarised email briefing.

Out of Scope

  • Real-time “alerting” (Daily/Weekly batches are sufficient).
  • Writing back to Reddit (Read-only mode to prevent bot bans).
  • Sentiment analysis of images/video.

5. Non-Functional Requirements

Cost Efficiency

  • Constraint: The system must run for <$10/month.
  • Strategy: Aggressive pre-filtering using Regex/Keywords before calling LLMs. Use tiered model selection (Cheap models for filtering, Smart models for analysis).

Data Integrity

  • De-duplication: RSS feeds often repeat content. The system must use unique IDs (Post GUID) to prevent duplicate DB entries.
  • Schema Enforcement: The LLM output parser must guarantee strictly typed JSON (Sentiment = Float, Entities = Array) to prevent SQL insertion errors.

6. Success Metrics (Technical)

  • Noise Reduction Rate: % of raw RSS items filtered out by the Gatekeeper Agent (Target: >60%).
  • Parsing Reliability: % of LLM outputs that successfully parse into JSON without retries (Target: >95%).
  • Query Latency: Time to generate a RAG response (Target: <5 seconds).