Overview
Retrieval-Augmented Generation (RAG) is the pattern at the heart of the Nadoo AI Knowledge Base. Instead of relying solely on what an LLM was trained on, RAG retrieves relevant information from your documents and injects it into the prompt, grounding the response in your own data. This eliminates hallucination for questions that your documents can answer.End-to-End RAG Flow
The RAG pipeline consists of six stages, from the user’s query to the final grounded response.Query
The user sends a natural language question through the chat interface, API, or a messaging channel. The query enters the workflow at the Start Node and is routed to the Search Knowledge Node.
Embed
The query text is converted into a vector using the same embedding model configured for the knowledge base. This places the query in the same vector space as the indexed document chunks.
Search
The query vector (and optionally the raw query text for BM25) is used to retrieve the most relevant chunks from the knowledge base. The search mode — vector, BM25, or hybrid — determines how results are fetched and scored.Learn about search modes
Rerank
Optionally, retrieved chunks are re-scored using a cross-encoder reranking model. The reranker evaluates each chunk in the context of the original query and produces a more accurate relevance score. This step is especially valuable when the initial retrieval returns a large candidate set.
Context Assembly
The top-ranked chunks are assembled into a context block that will be injected into the LLM prompt. The assembly process:
- Orders chunks by relevance score (highest first)
- Adds source metadata (document name, page number, heading) for citation
- Truncates or selects chunks to fit within the LLM’s context window
- Deduplicates overlapping content from adjacent chunks
Integration with Workflows
In the Nadoo AI workflow engine, the RAG pipeline is implemented through a combination of nodes. The most common pattern connects a Search Knowledge Node to an AI Agent Node.Basic RAG Workflow
The Search Knowledge Node handles stages 2-4 (embed, search, rerank) and passes the assembled context to the AI Agent Node, which handles stage 6 (LLM generation).RAG Workflow with Reranking
For higher precision, add an explicit reranking step:Multi-Source RAG Workflow
Query multiple knowledge bases in parallel for comprehensive coverage:Context Injection Strategies
The way retrieved context is presented to the LLM affects response quality. Nadoo AI supports several injection strategies.- System Prompt Injection
- User Message Injection
- Structured Context
Inject the retrieved context into the system prompt. This approach treats the context as authoritative background information.Best for: Most use cases. Keeps context separate from the user’s message.
Citation Tracking
Nadoo AI tracks which documents and chunks contributed to each response. This provides transparency and allows users to verify the source of information.How Citations Work
- Each chunk passed to the LLM carries metadata: document ID, filename, heading, and page number.
- The prompt instructs the LLM to reference sources in its response.
- The platform records the mapping between the response and the source chunks.
- Citations are returned in the API response and displayed in the chat UI.
Citation Response Format
Best Practices
Chunk size and overlap
Chunk size and overlap
- Chunk size controls the granularity of retrieval. Smaller chunks (300-500 chars) are better for precise fact retrieval. Larger chunks (1000-2000 chars) provide more context per result.
- Chunk overlap (default: 200 chars) ensures that information near chunk boundaries is not lost. Higher overlap produces more chunks but improves boundary coverage.
- Start with the defaults (1000/200) and adjust based on retrieval quality.
Embedding model selection
Embedding model selection
- General purpose:
text-embedding-3-small(OpenAI) is an excellent default. - Higher quality:
text-embedding-3-large(OpenAI) for improved retrieval accuracy at higher cost. - Domain-specific: Fine-tuned HuggingFace models for specialized content (medical, legal, scientific).
- Multilingual:
multilingual-e5-largefor knowledge bases with content in multiple languages. - Self-hosted: Ollama or local models when data cannot leave your infrastructure.
Search mode selection
Search mode selection
- Use hybrid search as the default. It handles the widest range of query types.
- Switch to vector-only if all queries are natural language and your embedding model is domain-tuned.
- Switch to BM25-only for keyword-heavy queries like error codes or product identifiers.
Reranking for precision
Reranking for precision
- Enable reranking when retrieval quality matters more than latency.
- Set a higher initial
top_k(e.g., 20) and let the reranker narrow torerank_top_k(e.g., 3-5). - Cross-encoder rerankers are slower but significantly more accurate than bi-encoder similarity.
Context window management
Context window management
- Be aware of the LLM’s context window size. Injecting too many chunks wastes tokens on less relevant content.
- Prioritize quality over quantity: 3-5 highly relevant chunks typically outperform 15-20 marginally relevant ones.
- Use reranking to ensure only the best chunks make it into the context.
Monitoring and Debugging
Common Issues
| Symptom | Likely Cause | Fix |
|---|---|---|
| Responses ignore relevant documents | Chunks are not being retrieved — check score_threshold | Lower score_threshold or increase top_k |
| Responses cite irrelevant content | top_k is too high or score_threshold is too low | Increase score_threshold or enable reranking |
| ”I don’t have information about that” | Document is not indexed (check status) or query is out of scope | Verify document status is indexed; expand knowledge base |
| Responses are generic despite context | Context injection may not be reaching the LLM | Check that the Search Knowledge Node output is connected to the AI Agent Node input |