Skip to main content

Overview

Documents are the foundation of the Nadoo AI Knowledge Base. You upload files or provide URLs, the platform parses and chunks them, generates embeddings, and indexes them for fast retrieval. Once indexed, documents power your RAG pipelines and AI agent workflows with grounded, source-backed answers.

Supported Formats

Nadoo AI accepts a range of document formats, each with format-specific parsing to preserve structure and meaning.
FormatExtensionsProcessing Details
PDF.pdfText extraction with layout preservation. OCR fallback for scanned documents.
Microsoft Word.docx, .docHeading hierarchy, tables, and inline images are preserved during parsing.
Plain Text.txtDirect ingestion with no conversion required.
Markdown.md, .mdxHeading levels are used as chunk boundaries and metadata.
Excel.xlsx, .xlsEach worksheet is processed as a separate logical document. Column headers become metadata.
Web PagesURLHTML is fetched, cleaned of navigation and boilerplate, and converted to plain text.
File size limits depend on your deployment configuration. The default limit is 50 MB per file. For self-hosted deployments, adjust the MAX_UPLOAD_SIZE environment variable.

Uploading Documents

Via the UI

  1. Navigate to your workspace and open the Knowledge Base section.
  2. Select a knowledge base or create a new one.
  3. Click Upload Documents and drag files into the upload zone, or click to browse.
  4. Optionally add custom metadata tags before confirming.
  5. The upload begins and documents enter the processing pipeline automatically.

Via the API

Use the document upload endpoint to add files programmatically.
curl -X POST \
  "https://your-instance.example.com/api/v1/workspaces/{workspace_id}/knowledge/{kb_id}/documents" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@/path/to/document.pdf" \
  -F 'metadata={"tags": ["engineering", "onboarding"]}'
Request parameters:
ParameterTypeRequiredDescription
fileFileYesThe document file to upload
metadataJSONNoCustom metadata to attach to the document
chunk_sizeIntegerNoOverride the knowledge base default chunk size
chunk_overlapIntegerNoOverride the knowledge base default chunk overlap
Response:
{
  "id": "doc_abc123",
  "filename": "document.pdf",
  "status": "processing",
  "format": "pdf",
  "size_bytes": 1048576,
  "metadata": {
    "tags": ["engineering", "onboarding"]
  },
  "created_at": "2026-03-09T10:30:00Z"
}

Document Processing Pipeline

Every uploaded document passes through a four-stage pipeline before it becomes searchable.
1

Upload

The file is received, validated for format and size, and stored in the platform’s object storage. A document record is created with status processing.
2

Parse

A Celery background worker extracts text from the file using format-specific parsers. Structure such as headings, tables, and page boundaries is preserved as metadata annotations.
3

Chunk

The extracted text is split into overlapping segments. The chunking strategy respects document structure — splits prefer paragraph and heading boundaries over arbitrary character positions.Default settings:
  • Chunk size: 1000 characters
  • Chunk overlap: 200 characters
  • Separator: \n\n (falls back to sentence and word boundaries)
4

Embed

Each chunk is converted into a vector using the knowledge base’s configured embedding model. Embedding generation runs in batches for efficiency.
5

Index

Vectors are inserted into the vector store (pgvector by default) and indexed for fast approximate nearest neighbor search. Once indexing completes, the document status transitions to indexed.

Document Status

Each document has a status that reflects its position in the processing pipeline.
StatusDescription
processingThe document has been uploaded and is being parsed, chunked, and embedded.
indexedProcessing is complete. The document’s chunks are searchable.
failedAn error occurred during processing. Check the error details for the specific failure reason.
You can check document status via the API:
curl -X GET \
  "https://your-instance.example.com/api/v1/workspaces/{workspace_id}/knowledge/{kb_id}/documents/{doc_id}" \
  -H "Authorization: Bearer YOUR_API_KEY"
Response for a failed document:
{
  "id": "doc_xyz789",
  "filename": "corrupted.pdf",
  "status": "failed",
  "error": "Unable to extract text from PDF: file appears to be encrypted",
  "created_at": "2026-03-09T10:30:00Z",
  "updated_at": "2026-03-09T10:31:15Z"
}
Documents with failed status do not contribute to search results. Review the error message, fix the source file, and re-upload.

Metadata Extraction

During parsing, the platform automatically extracts structural metadata from documents. This metadata is stored alongside chunks and can be used for filtered retrieval.

Automatically Extracted Metadata

Metadata FieldSourceDescription
titleDocument title or first headingUsed for display and citation
headingsSection headings in the documentHierarchical heading path for each chunk
page_numberPDF page numbersMaps chunks back to their source page
sheet_nameExcel worksheet nameIdentifies which sheet a chunk came from
source_urlWeb page URLThe original URL for web-sourced documents
file_typeFile extensionThe format of the uploaded file
word_countComputedWord count per chunk

Custom Metadata

You can attach custom key-value metadata at upload time. Custom metadata is propagated to every chunk generated from the document, enabling filtered search.
{
  "metadata": {
    "department": "engineering",
    "version": "2.1",
    "confidentiality": "internal",
    "tags": ["api", "backend", "architecture"]
  }
}
Filtering with metadata at query time:
{
  "query": "How does authentication work?",
  "search_mode": "hybrid",
  "top_k": 5,
  "filters": {
    "department": "engineering",
    "tags": ["api"]
  }
}

Managing Documents

List Documents

Retrieve all documents in a knowledge base with pagination.
curl -X GET \
  "https://your-instance.example.com/api/v1/workspaces/{workspace_id}/knowledge/{kb_id}/documents?page=1&per_page=20" \
  -H "Authorization: Bearer YOUR_API_KEY"

Delete a Document

Remove a document and all its associated chunks and vectors from the knowledge base.
curl -X DELETE \
  "https://your-instance.example.com/api/v1/workspaces/{workspace_id}/knowledge/{kb_id}/documents/{doc_id}" \
  -H "Authorization: Bearer YOUR_API_KEY"
Deleting a document is irreversible. All chunks and embeddings for the document are permanently removed from the vector store.

Re-process a Document

If you change the embedding model or chunking settings for a knowledge base, you can trigger reprocessing for existing documents.
curl -X POST \
  "https://your-instance.example.com/api/v1/workspaces/{workspace_id}/knowledge/{kb_id}/documents/{doc_id}/reprocess" \
  -H "Authorization: Bearer YOUR_API_KEY"
This deletes the existing chunks and vectors, then re-runs the full pipeline with the current settings.

Best Practices

  • Smaller chunks (300-500 chars): Better for precise, fact-based Q&A where each answer fits in a few sentences.
  • Larger chunks (1000-2000 chars): Better for questions that require broader context, such as summarization or multi-step reasoning.
  • Default (1000 chars): A good starting point for most use cases.
Tag documents with department, version, or topic metadata at upload time. This allows agents to restrict search to relevant subsets, improving both relevance and speed.
Set up a regular schedule to re-upload updated documents. Stale documents can lead to incorrect or outdated answers from your agents.
After bulk uploads, check for failed documents and address errors promptly. A knowledge base with unprocessed documents has gaps in its coverage.

Next Steps