01
Why Agentic RAG?
Standard RAG does one retrieval and generates an answer. It fails on multi-hop questions that need evidence from multiple sources. The agent classifies complexity first, then decides how to answer.
❌ Standard RAG
→Single retrieval per query
→Fails on multi-hop questions
→No reasoning about what to retrieve
→Missing cross-document synthesis
✅ Agentic RAG
→Classifies simple vs complex
→Decomposes complex questions
→Retrieves evidence per sub-question
→Synthesizes final grounded answer
5pipeline components
3agent step types
7API endpoints
18unit tests
02
System Pipeline
01
DocumentLoader
Loads .txt, .md, .pdf from files or directories
02
TextChunker
Splits into 512-token chunks with 64-token overlap — prevents information loss at boundaries
03
Embedder
sentence-transformers/all-MiniLM-L6-v2 (local, free) or OpenAI text-embedding-ada-002
04
VectorStore (ChromaDB)
Persistent cosine similarity search — stores chunks with embeddings and metadata
05
AgenticRAG
Classify → simple path (single retrieve+generate) or complex path (decompose→retrieve×N→synthesize)
Agentic Reasoning Flow
# Step 1: Classify
complexity = classify(question) # "simple" | "complex"
if complexity == "simple":
# Standard RAG — single retrieval
chunks = vector_store.search(question, top_k=5)
answer = llm.complete(context=chunks, question=question)
else:
# Step 2: Decompose into sub-questions
sub_questions = decompose(question) # LLM returns JSON array
# Step 3: Retrieve per sub-question
evidence = []
for sub_q in sub_questions:
chunks = vector_store.search(sub_q, top_k=3)
evidence.append({"question": sub_q, "context": chunks})
# Step 4: Synthesize final answer
answer = llm.complete(evidence=evidence, original=question)
complexity = classify(question) # "simple" | "complex"
if complexity == "simple":
# Standard RAG — single retrieval
chunks = vector_store.search(question, top_k=5)
answer = llm.complete(context=chunks, question=question)
else:
# Step 2: Decompose into sub-questions
sub_questions = decompose(question) # LLM returns JSON array
# Step 3: Retrieve per sub-question
evidence = []
for sub_q in sub_questions:
chunks = vector_store.search(sub_q, top_k=3)
evidence.append({"question": sub_q, "context": chunks})
# Step 4: Synthesize final answer
answer = llm.complete(evidence=evidence, original=question)
03
Code Structure
agentic-rag-knowledge-system/
├── src/
├── rag_pipeline.py # Document, Loader, Chunker, Embedder, VectorStore, RAGPipeline
└── agentic_rag.py # AgentStep, AgentResponse, AgenticRAG
├── api.py # FastAPI — /ingest, /query, /agent/query endpoints
├── demo.py # CLI demo with 3 sample docs + example queries
├── tests/test_rag.py # 18 unit tests — all mocked, no API key needed
├── requirements.txt
└── .env.example
| Class | File | Responsibility |
|---|---|---|
| DocumentLoader | rag_pipeline.py | Load txt, md, pdf — file or directory |
| TextChunker | rag_pipeline.py | Paragraph-aware chunking with overlap |
| Embedder | rag_pipeline.py | Local or OpenAI embeddings — swappable |
| VectorStore | rag_pipeline.py | ChromaDB wrapper — add, search, clear |
| RAGPipeline | rag_pipeline.py | End-to-end ingest + query orchestration |
| AgenticRAG | agentic_rag.py | Classify → decompose → retrieve → synthesize |
04
REST API
POST/ingest/textAdd raw text to knowledge base — returns chunk count
POST/ingest/fileUpload .txt, .md, or .pdf file — auto-chunked and indexed
POST/queryStandard RAG query — returns answer + sources + latency
POST/agent/queryAgentic query — includes reasoning steps, is_complex flag, all sources
GET/healthService health check
GET/statsVector store stats — chunk count, embedder, model info
DELETE/knowledgeClear all documents from knowledge base
Example — Agentic Query Response
{
"answer": "Physical products have a 30-day return window while digital products are non-refundable after download. Physical refunds take 5-7 business days...",
"query": "How does the refund policy differ for digital vs physical products?",
"is_complex": true,
"reasoning_steps": [
"Classified as: complex",
"Decomposed into 2 sub-questions",
"Retrieved 3 chunks for sub-question 1 (score: 0.91)",
"Retrieved 3 chunks for sub-question 2 (score: 0.88)",
"Synthesized final answer from all evidence"
],
"latency_ms": 842.3
}
"answer": "Physical products have a 30-day return window while digital products are non-refundable after download. Physical refunds take 5-7 business days...",
"query": "How does the refund policy differ for digital vs physical products?",
"is_complex": true,
"reasoning_steps": [
"Classified as: complex",
"Decomposed into 2 sub-questions",
"Retrieved 3 chunks for sub-question 1 (score: 0.91)",
"Retrieved 3 chunks for sub-question 2 (score: 0.88)",
"Synthesized final answer from all evidence"
],
"latency_ms": 842.3
}
05
Setup
# Clone and install
git clone https://github.com/sadhanageddam27/agentic-rag-knowledge-system.git
pip install -r requirements.txt
# Configure (OpenAI or Ollama)
cp .env.example .env
# Run demo
python demo.py
# Or start API server
python api.py # → http://localhost:8000/docs
# Run tests (no API key needed)
pytest tests/ -v
git clone https://github.com/sadhanageddam27/agentic-rag-knowledge-system.git
pip install -r requirements.txt
# Configure (OpenAI or Ollama)
cp .env.example .env
# Run demo
python demo.py
# Or start API server
python api.py # → http://localhost:8000/docs
# Run tests (no API key needed)
pytest tests/ -v
Run fully offline with Ollama — set LLM_BACKEND=ollama and LLM_MODEL=llama3. Local embeddings via sentence-transformers are used by default — no API key needed for either component.