Overview
The Document Extract Node processes uploaded documents within a running workflow, extracting text content, tables, and images into structured data that downstream nodes can consume. Unlike the Knowledge Base ingestion pipeline (which processes documents asynchronously for long-term storage), the Document Extract Node operates synchronously during workflow execution, making it ideal for on-the-fly document processing.Supported Formats
| Format | Extensions | Extraction Capabilities |
|---|---|---|
.pdf | Full text, tables, embedded images, page-level metadata | |
| Microsoft Word | .docx, .doc | Text, tables, inline images, heading structure |
| Plain Text | .txt, .md | Raw text content |
| Spreadsheets | .xlsx, .csv | Tabular data with headers and cell values |
| HTML | .html, .htm | Parsed text content with structural information |
How It Works
Receive Document
The node reads a file reference from the workflow context. This is typically a file uploaded by the user through the Start Node or a Form Node.
Detect Format
The document format is detected from the file extension and MIME type. The appropriate extraction engine is selected.
Extract Content
Text, tables, and images are extracted from the document. For multi-page documents, content is organized by page.
Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
input_variable | string | — | Reference to the uploaded file in the workflow context |
extract_text | boolean | true | Extract text content from the document |
extract_tables | boolean | true | Extract tables as structured arrays |
extract_images | boolean | false | Extract embedded images (increases processing time) |
page_range | string | null | null | Page range to process (e.g., "1-5", "3,7,10"). null processes all pages. |
output_variable | string | "extracted_content" | Context variable to store the extraction results |
Output Structure
The node produces a structured extraction result:| Field | Description |
|---|---|
text | Concatenated text from all processed pages |
pages | Array of page-level extraction results |
pages[].page_number | 1-based page index |
pages[].text | Text content of the page |
pages[].tables | Array of tables found on the page |
metadata | File-level metadata including name, format, and size |
Example: Document Q&A Workflow
A workflow that lets users upload a document and ask questions about its content:- The user uploads a PDF and asks a question.
- Document Extract Node extracts the full text content.
- AI Agent Node receives both the user’s question and the extracted text as context, then generates an answer.
- End Node delivers the response.
AI Agent System Prompt
Example: Invoice Processing
Extract structured data from uploaded invoices: The AI Agent Node is configured to extract specific fields (invoice number, date, line items, total) from the extracted text, and the Python Code Node validates and formats the structured output.Working with Tables
Whenextract_tables is enabled, tables are extracted as structured arrays with headers and rows. This is particularly useful for spreadsheets and PDF documents containing tabular data.
Access table data in downstream nodes:
Table extraction quality depends on the document format. PDF tables with clear grid lines extract most reliably. Complex merged cells or borderless tables may require post-processing with an AI Agent Node.
Page Range Selection
For large documents, usepage_range to process only the pages you need:
| Syntax | Behavior |
|---|---|
"1-5" | Pages 1 through 5 |
"3,7,10" | Only pages 3, 7, and 10 |
"1-3,8-10" | Pages 1-3 and 8-10 |
null | All pages (default) |
Best Practices
Limit page ranges for large documents
Limit page ranges for large documents
Processing a 200-page PDF will be slow and produce a very large context. Use
page_range or a preceding Question Node to ask the user which section they need.Disable image extraction unless needed
Disable image extraction unless needed
Image extraction significantly increases processing time. Only enable it when the workflow specifically needs to analyze embedded images (e.g., with an Image Understand Node downstream).
Use the full text for LLM context
Use the full text for LLM context
For Q&A workflows, pass
extracted_content.text to the AI Agent Node’s system prompt or user context. The LLM can then answer questions about the entire document.Validate extraction quality
Validate extraction quality
For critical workflows, add a Condition Node after extraction to check
metadata.page_count or text length, and route to an error handler if the extraction seems incomplete.