Overview
Documents are the foundation of the Nadoo AI Knowledge Base. You upload files or provide URLs, the platform parses and chunks them, generates embeddings, and indexes them for fast retrieval. Once indexed, documents power your RAG pipelines and AI agent workflows with grounded, source-backed answers.Supported Formats
Nadoo AI accepts a range of document formats, each with format-specific parsing to preserve structure and meaning.| Format | Extensions | Processing Details |
|---|---|---|
.pdf | Text extraction with layout preservation. OCR fallback for scanned documents. | |
| Microsoft Word | .docx, .doc | Heading hierarchy, tables, and inline images are preserved during parsing. |
| Plain Text | .txt | Direct ingestion with no conversion required. |
| Markdown | .md, .mdx | Heading levels are used as chunk boundaries and metadata. |
| Excel | .xlsx, .xls | Each worksheet is processed as a separate logical document. Column headers become metadata. |
| Web Pages | URL | HTML is fetched, cleaned of navigation and boilerplate, and converted to plain text. |
File size limits depend on your deployment configuration. The default limit is 50 MB per file. For self-hosted deployments, adjust the
MAX_UPLOAD_SIZE environment variable.Uploading Documents
Via the UI
- Navigate to your workspace and open the Knowledge Base section.
- Select a knowledge base or create a new one.
- Click Upload Documents and drag files into the upload zone, or click to browse.
- Optionally add custom metadata tags before confirming.
- The upload begins and documents enter the processing pipeline automatically.
Via the API
Use the document upload endpoint to add files programmatically.| Parameter | Type | Required | Description |
|---|---|---|---|
file | File | Yes | The document file to upload |
metadata | JSON | No | Custom metadata to attach to the document |
chunk_size | Integer | No | Override the knowledge base default chunk size |
chunk_overlap | Integer | No | Override the knowledge base default chunk overlap |
Document Processing Pipeline
Every uploaded document passes through a four-stage pipeline before it becomes searchable.Upload
The file is received, validated for format and size, and stored in the platform’s object storage. A document record is created with status
processing.Parse
A Celery background worker extracts text from the file using format-specific parsers. Structure such as headings, tables, and page boundaries is preserved as metadata annotations.
Chunk
The extracted text is split into overlapping segments. The chunking strategy respects document structure — splits prefer paragraph and heading boundaries over arbitrary character positions.Default settings:
- Chunk size: 1000 characters
- Chunk overlap: 200 characters
- Separator:
\n\n(falls back to sentence and word boundaries)
Embed
Each chunk is converted into a vector using the knowledge base’s configured embedding model. Embedding generation runs in batches for efficiency.
Document Status
Each document has a status that reflects its position in the processing pipeline.| Status | Description |
|---|---|
processing | The document has been uploaded and is being parsed, chunked, and embedded. |
indexed | Processing is complete. The document’s chunks are searchable. |
failed | An error occurred during processing. Check the error details for the specific failure reason. |
Metadata Extraction
During parsing, the platform automatically extracts structural metadata from documents. This metadata is stored alongside chunks and can be used for filtered retrieval.Automatically Extracted Metadata
| Metadata Field | Source | Description |
|---|---|---|
title | Document title or first heading | Used for display and citation |
headings | Section headings in the document | Hierarchical heading path for each chunk |
page_number | PDF page numbers | Maps chunks back to their source page |
sheet_name | Excel worksheet name | Identifies which sheet a chunk came from |
source_url | Web page URL | The original URL for web-sourced documents |
file_type | File extension | The format of the uploaded file |
word_count | Computed | Word count per chunk |
Custom Metadata
You can attach custom key-value metadata at upload time. Custom metadata is propagated to every chunk generated from the document, enabling filtered search.Managing Documents
List Documents
Retrieve all documents in a knowledge base with pagination.Delete a Document
Remove a document and all its associated chunks and vectors from the knowledge base.Re-process a Document
If you change the embedding model or chunking settings for a knowledge base, you can trigger reprocessing for existing documents.Best Practices
Choose the right chunk size
Choose the right chunk size
- Smaller chunks (300-500 chars): Better for precise, fact-based Q&A where each answer fits in a few sentences.
- Larger chunks (1000-2000 chars): Better for questions that require broader context, such as summarization or multi-step reasoning.
- Default (1000 chars): A good starting point for most use cases.
Use metadata for filtering
Use metadata for filtering
Tag documents with department, version, or topic metadata at upload time. This allows agents to restrict search to relevant subsets, improving both relevance and speed.
Keep documents up to date
Keep documents up to date
Set up a regular schedule to re-upload updated documents. Stale documents can lead to incorrect or outdated answers from your agents.
Monitor processing status
Monitor processing status
After bulk uploads, check for
failed documents and address errors promptly. A knowledge base with unprocessed documents has gaps in its coverage.