Skip to main content

Overview

The Document Extract Node processes uploaded documents within a running workflow, extracting text content, tables, and images into structured data that downstream nodes can consume. Unlike the Knowledge Base ingestion pipeline (which processes documents asynchronously for long-term storage), the Document Extract Node operates synchronously during workflow execution, making it ideal for on-the-fly document processing.

Supported Formats

FormatExtensionsExtraction Capabilities
PDF.pdfFull text, tables, embedded images, page-level metadata
Microsoft Word.docx, .docText, tables, inline images, heading structure
Plain Text.txt, .mdRaw text content
Spreadsheets.xlsx, .csvTabular data with headers and cell values
HTML.html, .htmParsed text content with structural information

How It Works

1

Receive Document

The node reads a file reference from the workflow context. This is typically a file uploaded by the user through the Start Node or a Form Node.
2

Detect Format

The document format is detected from the file extension and MIME type. The appropriate extraction engine is selected.
3

Extract Content

Text, tables, and images are extracted from the document. For multi-page documents, content is organized by page.
4

Structure Output

The extracted data is written to the workflow context as a structured object, ready for downstream nodes to process.

Configuration

{
  "type": "document-extract-node",
  "config": {
    "input_variable": "{{uploaded_file}}",
    "extract_text": true,
    "extract_tables": true,
    "extract_images": false,
    "page_range": null,
    "output_variable": "extracted_content"
  }
}
ParameterTypeDefaultDescription
input_variablestringReference to the uploaded file in the workflow context
extract_textbooleantrueExtract text content from the document
extract_tablesbooleantrueExtract tables as structured arrays
extract_imagesbooleanfalseExtract embedded images (increases processing time)
page_rangestring | nullnullPage range to process (e.g., "1-5", "3,7,10"). null processes all pages.
output_variablestring"extracted_content"Context variable to store the extraction results

Output Structure

The node produces a structured extraction result:
{
  "extracted_content": {
    "text": "Full extracted text content of the document...",
    "pages": [
      {
        "page_number": 1,
        "text": "Page 1 text content...",
        "tables": [
          {
            "headers": ["Name", "Role", "Department"],
            "rows": [
              ["Alice Kim", "Engineer", "Platform"],
              ["Bob Park", "Designer", "Product"]
            ]
          }
        ]
      },
      {
        "page_number": 2,
        "text": "Page 2 text content..."
      }
    ],
    "metadata": {
      "filename": "quarterly-report.pdf",
      "format": "pdf",
      "page_count": 12,
      "file_size_bytes": 245760
    }
  }
}
FieldDescription
textConcatenated text from all processed pages
pagesArray of page-level extraction results
pages[].page_number1-based page index
pages[].textText content of the page
pages[].tablesArray of tables found on the page
metadataFile-level metadata including name, format, and size

Example: Document Q&A Workflow

A workflow that lets users upload a document and ask questions about its content:
  1. The user uploads a PDF and asks a question.
  2. Document Extract Node extracts the full text content.
  3. AI Agent Node receives both the user’s question and the extracted text as context, then generates an answer.
  4. End Node delivers the response.

AI Agent System Prompt

You are a document analysis assistant. Answer the user's question based solely on the following document content:

---
{{extracted_content.text}}
---

If the answer is not found in the document, say so clearly.

Example: Invoice Processing

Extract structured data from uploaded invoices: The AI Agent Node is configured to extract specific fields (invoice number, date, line items, total) from the extracted text, and the Python Code Node validates and formats the structured output.

Working with Tables

When extract_tables is enabled, tables are extracted as structured arrays with headers and rows. This is particularly useful for spreadsheets and PDF documents containing tabular data. Access table data in downstream nodes:
{{extracted_content.pages[0].tables[0].headers}}  // ["Name", "Role", "Department"]
{{extracted_content.pages[0].tables[0].rows[0]}}   // ["Alice Kim", "Engineer", "Platform"]
Table extraction quality depends on the document format. PDF tables with clear grid lines extract most reliably. Complex merged cells or borderless tables may require post-processing with an AI Agent Node.

Page Range Selection

For large documents, use page_range to process only the pages you need:
SyntaxBehavior
"1-5"Pages 1 through 5
"3,7,10"Only pages 3, 7, and 10
"1-3,8-10"Pages 1-3 and 8-10
nullAll pages (default)
This reduces processing time and keeps the extracted content focused on the relevant sections.

Best Practices

Processing a 200-page PDF will be slow and produce a very large context. Use page_range or a preceding Question Node to ask the user which section they need.
Image extraction significantly increases processing time. Only enable it when the workflow specifically needs to analyze embedded images (e.g., with an Image Understand Node downstream).
For Q&A workflows, pass extracted_content.text to the AI Agent Node’s system prompt or user context. The LLM can then answer questions about the entire document.
For critical workflows, add a Condition Node after extraction to check metadata.page_count or text length, and route to an error handler if the extraction seems incomplete.

Next Steps