RAG Knowledge Assistant
Upload documents or crawl web pages, then ask questions answered exclusively from your indexed content using hybrid AI search.
Built with FastAPI, Next.js, PostgreSQL + pgvector, LangChain, and OpenAI. Full-stack RAG with streaming responses, real-time pipeline visibility, and interactive search analysis.
How It Works
A production-grade Retrieval-Augmented Generation system that grounds every answer in your documents -- no hallucinations.
Multi-Source Ingestion
Upload PDF and TXT files, or paste any URL. A headless Chromium browser renders JavaScript-heavy pages to extract their full content.
Hybrid Search
Combines vector similarity search (cosine distance) with PostgreSQL full-text search, merged via Reciprocal Rank Fusion for the best of both worlds.
Streaming Answers
Responses are streamed token-by-token via Server-Sent Events. You see the answer forming in real time, not waiting for the full generation.
Pipeline Transparency
Watch every step of the ingestion pipeline live: text extraction, chunking, embedding generation, and vector storage. See the AI thinking.
Search Analysis
Compare semantic, keyword, and hybrid search results side-by-side with relevance scores. Understand why certain chunks are retrieved.
Multi-Tenant Security
Clerk JWT authentication, per-user data isolation, SSRF protection on URLs, rate limiting, and automatic cleanup of inactive sessions.
The AI Pipeline
Every document goes through a multi-stage pipeline before it can answer your questions. Here is exactly what happens at each step.
Document Ingestion
1. Text Extraction
PDF files are parsed page-by-page with PyPDF2. URLs are rendered in a headless Chromium browser via Playwright, executing JavaScript and expanding collapsed sections.
Handles SPAs, dynamic content, and lazy-loaded pages
2. Recursive Chunking
Text is split using LangChain's RecursiveCharacterTextSplitter, which tries paragraph breaks first, then sentences, then words -- preserving semantic boundaries.
1200 chars per chunk, 200 char overlap
3. Embedding Generation
Each chunk is sent to OpenAI's text-embedding-3-small model, which converts it into a 1536-dimensional vector capturing its semantic meaning.
Cached in Redis to avoid redundant API calls
4. Vector Storage
Vectors are stored in PostgreSQL with the pgvector extension. An HNSW index (m=16, ef_construction=64) enables sub-millisecond approximate nearest-neighbor search.
Plus a GIN index on tsvector for full-text search
Query and Retrieval
5. Semantic Search
Your question is embedded into the same vector space and compared against all stored chunks using cosine similarity. Finds conceptually related content even with different wording.
Returns top 15 candidates ranked by vector distance
6. Keyword Search
In parallel, PostgreSQL's full-text search engine matches exact terms, acronyms, and technical identifiers that vector similarity might miss.
plainto_tsquery with ts_rank_cd scoring
7. Reciprocal Rank Fusion
Results from both searches are merged using RRF: each document gets score 1/(k+rank) from each method, then scores are summed. This consistently outperforms either method alone.
k=60, top 5 results selected
8. Grounded Generation
The top chunks are sent to GPT-4o-mini along with conversation history. The system prompt strictly constrains answers to the provided context -- no hallucinations.
temperature=0.1, last 10 messages of history
Tech Stack
Frontend
Next.js 15, React 19, Tailwind
Backend
FastAPI, Python 3.11, LangChain
Database
PostgreSQL 16, pgvector, Redis
AI Models
GPT-4o-mini, text-embedding-3-small
Auth
Clerk (JWT RS256)
Crawler
Playwright (headless Chromium)
Queue
Redis Queue (RQ) + workers
Deploy
Docker Compose, Traefik
Try It Out
Upload a document, ask a question, and explore the pipeline in action.
Get StartedShowcase mode -- data auto-deleted after 10 minutes of inactivity
