Documents - Nadoo AI Knowledge Base

Overview

Documents are the foundation of the Nadoo AI Knowledge Base. You upload files or provide URLs, the platform parses and chunks them, generates embeddings, and indexes them for fast retrieval. Once indexed, documents power your RAG pipelines and AI agent workflows with grounded, source-backed answers.

Supported Formats

Nadoo AI accepts a range of document formats, each with format-specific parsing to preserve structure and meaning.

Format	Extensions	Processing Details
PDF	`.pdf`	Text extraction with layout preservation. OCR fallback for scanned documents.
Microsoft Word	`.docx`, `.doc`	Heading hierarchy, tables, and inline images are preserved during parsing.
Plain Text	`.txt`	Direct ingestion with no conversion required.
Markdown	`.md`, `.mdx`	Heading levels are used as chunk boundaries and metadata.
Excel	`.xlsx`, `.xls`	Each worksheet is processed as a separate logical document. Column headers become metadata.
Web Pages	URL	HTML is fetched, cleaned of navigation and boilerplate, and converted to plain text.

File size limits depend on your deployment configuration. The default limit is 50 MB per file. For self-hosted deployments, adjust the MAX_UPLOAD_SIZE environment variable.

Uploading Documents

Via the UI

Navigate to your workspace and open the Knowledge Base section.
Select a knowledge base or create a new one.
Click Upload Documents and drag files into the upload zone, or click to browse.
Optionally add custom metadata tags before confirming.
The upload begins and documents enter the processing pipeline automatically.

Via the API

Use the document upload endpoint to add files programmatically.

curl -X POST \
  "https://your-instance.example.com/api/v1/workspaces/{workspace_id}/knowledge/{kb_id}/documents" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@/path/to/document.pdf" \
  -F 'metadata={"tags": ["engineering", "onboarding"]}'

Request parameters:

Parameter	Type	Required	Description
`file`	File	Yes	The document file to upload
`metadata`	JSON	No	Custom metadata to attach to the document
`chunk_size`	Integer	No	Override the knowledge base default chunk size
`chunk_overlap`	Integer	No	Override the knowledge base default chunk overlap

Response:

{
  "id": "doc_abc123",
  "filename": "document.pdf",
  "status": "processing",
  "format": "pdf",
  "size_bytes": 1048576,
  "metadata": {
    "tags": ["engineering", "onboarding"]
  },
  "created_at": "2026-03-09T10:30:00Z"
}

Document Processing Pipeline

Every uploaded document passes through a four-stage pipeline before it becomes searchable.

Upload

The file is received, validated for format and size, and stored in the platform’s object storage. A document record is created with status processing.

Parse

A Celery background worker extracts text from the file using format-specific parsers. Structure such as headings, tables, and page boundaries is preserved as metadata annotations.

Chunk

The extracted text is split into overlapping segments. The chunking strategy respects document structure — splits prefer paragraph and heading boundaries over arbitrary character positions.Default settings:

Chunk size: 1000 characters
Chunk overlap: 200 characters
Separator: \n\n (falls back to sentence and word boundaries)

Embed

Each chunk is converted into a vector using the knowledge base’s configured embedding model. Embedding generation runs in batches for efficiency.

Index

Vectors are inserted into the vector store (pgvector by default) and indexed for fast approximate nearest neighbor search. Once indexing completes, the document status transitions to indexed.

Document Status

Each document has a status that reflects its position in the processing pipeline.

Status	Description
`processing`	The document has been uploaded and is being parsed, chunked, and embedded.
`indexed`	Processing is complete. The document’s chunks are searchable.
`failed`	An error occurred during processing. Check the error details for the specific failure reason.

You can check document status via the API:

curl -X GET \
  "https://your-instance.example.com/api/v1/workspaces/{workspace_id}/knowledge/{kb_id}/documents/{doc_id}" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response for a failed document:

{
  "id": "doc_xyz789",
  "filename": "corrupted.pdf",
  "status": "failed",
  "error": "Unable to extract text from PDF: file appears to be encrypted",
  "created_at": "2026-03-09T10:30:00Z",
  "updated_at": "2026-03-09T10:31:15Z"
}

Documents with failed status do not contribute to search results. Review the error message, fix the source file, and re-upload.

Metadata Extraction

During parsing, the platform automatically extracts structural metadata from documents. This metadata is stored alongside chunks and can be used for filtered retrieval.

Automatically Extracted Metadata

Metadata Field	Source	Description
`title`	Document title or first heading	Used for display and citation
`headings`	Section headings in the document	Hierarchical heading path for each chunk
`page_number`	PDF page numbers	Maps chunks back to their source page
`sheet_name`	Excel worksheet name	Identifies which sheet a chunk came from
`source_url`	Web page URL	The original URL for web-sourced documents
`file_type`	File extension	The format of the uploaded file
`word_count`	Computed	Word count per chunk

Custom Metadata

You can attach custom key-value metadata at upload time. Custom metadata is propagated to every chunk generated from the document, enabling filtered search.

{
  "metadata": {
    "department": "engineering",
    "version": "2.1",
    "confidentiality": "internal",
    "tags": ["api", "backend", "architecture"]
  }
}

Filtering with metadata at query time:

{
  "query": "How does authentication work?",
  "search_mode": "hybrid",
  "top_k": 5,
  "filters": {
    "department": "engineering",
    "tags": ["api"]
  }
}

Managing Documents

List Documents

Retrieve all documents in a knowledge base with pagination.

curl -X GET \
  "https://your-instance.example.com/api/v1/workspaces/{workspace_id}/knowledge/{kb_id}/documents?page=1&per_page=20" \
  -H "Authorization: Bearer YOUR_API_KEY"

Delete a Document

Remove a document and all its associated chunks and vectors from the knowledge base.

curl -X DELETE \
  "https://your-instance.example.com/api/v1/workspaces/{workspace_id}/knowledge/{kb_id}/documents/{doc_id}" \
  -H "Authorization: Bearer YOUR_API_KEY"

Deleting a document is irreversible. All chunks and embeddings for the document are permanently removed from the vector store.

Re-process a Document

If you change the embedding model or chunking settings for a knowledge base, you can trigger reprocessing for existing documents.

curl -X POST \
  "https://your-instance.example.com/api/v1/workspaces/{workspace_id}/knowledge/{kb_id}/documents/{doc_id}/reprocess" \
  -H "Authorization: Bearer YOUR_API_KEY"

This deletes the existing chunks and vectors, then re-runs the full pipeline with the current settings.

Best Practices

Choose the right chunk size

Smaller chunks (300-500 chars): Better for precise, fact-based Q&A where each answer fits in a few sentences.
Larger chunks (1000-2000 chars): Better for questions that require broader context, such as summarization or multi-step reasoning.
Default (1000 chars): A good starting point for most use cases.

Use metadata for filtering

Tag documents with department, version, or topic metadata at upload time. This allows agents to restrict search to relevant subsets, improving both relevance and speed.

Keep documents up to date

Set up a regular schedule to re-upload updated documents. Stale documents can lead to incorrect or outdated answers from your agents.

Monitor processing status

After bulk uploads, check for failed documents and address errors promptly. A knowledge base with unprocessed documents has gaps in its coverage.

Next Steps

Vector Search

Learn how embedding-based similarity search retrieves relevant chunks

Hybrid Search

Combine vector and keyword search for the best of both approaches

RAG Pipeline

Understand the end-to-end retrieval-augmented generation flow

Knowledge Base Overview

Return to the Knowledge Base overview

​Overview

​Supported Formats

​Uploading Documents

​Via the UI

​Via the API

​Document Processing Pipeline

​Document Status

​Metadata Extraction

​Automatically Extracted Metadata

​Custom Metadata

​Managing Documents

​List Documents

​Delete a Document

​Re-process a Document

​Best Practices

​Next Steps

Vector Search

Hybrid Search

RAG Pipeline

Knowledge Base Overview

Overview

Supported Formats

Uploading Documents

Via the UI

Via the API

Document Processing Pipeline

Document Status

Metadata Extraction

Automatically Extracted Metadata

Custom Metadata

Managing Documents

List Documents

Delete a Document

Re-process a Document

Best Practices

Next Steps