Overview
Reflection Mode implements a self-critique loop where the LLM generates a response, evaluates it against configurable quality criteria, and iteratively revises it until it meets a quality threshold or reaches the maximum number of reflection cycles. This mode is designed for tasks where the first draft is rarely good enough — content creation, code generation, technical writing, and any output that benefits from systematic review and revision.How It Works
Reflect
The response is evaluated against each configured criterion (accuracy, completeness, clarity, etc.). The LLM assigns a score and identifies specific weaknesses.
Decide
If the overall quality score meets or exceeds
quality_threshold, the response is accepted. Otherwise, the cycle continues.Revise
The LLM generates a targeted critique highlighting what needs improvement, then produces a revised response that addresses the identified issues.
Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
agent_mode | string | — | Must be "reflection" |
reflection_config.max_reflections | number | 3 | Maximum reflection-revision cycles |
reflection_config.quality_threshold | float | 0.8 | Score (0.0-1.0) required to accept the response |
reflection_config.criteria | string[] | ["accuracy", "completeness", "clarity"] | Evaluation criteria for self-critique |
Evaluation Criteria
Thecriteria array defines what the LLM evaluates during each reflection cycle. Each criterion is scored from 0.0 to 1.0, and the overall quality score is the average across all criteria.
Built-in Criteria
| Criterion | What It Evaluates |
|---|---|
accuracy | Are the facts, data, and claims correct? |
completeness | Does the response fully address the user’s question? |
clarity | Is the response easy to understand and well-organized? |
conciseness | Is the response free of unnecessary repetition or filler? |
tone | Does the tone match the intended audience and context? |
formatting | Is the output properly formatted (headings, lists, code blocks)? |
relevance | Does every part of the response relate to the question? |
Custom Criteria
You can include custom criteria strings alongside the built-in ones. The LLM will interpret them and incorporate them into its evaluation:Quality Threshold
Thequality_threshold parameter controls when the reflection loop terminates:
| Threshold | Behavior |
|---|---|
0.6 | Lenient — accepts responses after minimal revision |
0.8 | Balanced — good default for most use cases |
0.9 | Strict — pushes for near-perfect output, may use all reflection cycles |
1.0 | Maximum — will always use all max_reflections (score of 1.0 is very hard to achieve) |
Setting
quality_threshold too high (above 0.9) may cause the agent to use all reflection cycles without meaningful improvement in later iterations. A threshold of 0.8 typically strikes the right balance between quality and efficiency.SSE Events
Reflection mode emits these events during execution:| Event | When | Payload |
|---|---|---|
node_started | Node begins | { node_id } |
llm_token | Each token generated | { token, node_id } |
agent_reflection | Each reflection cycle | { cycle, scores, critique, overall_score, node_id } |
llm_finished | Final response generated | { node_id, total_tokens } |
node_finished | Node completes | { node_id, status, reflections_used } |
agent_reflection event is unique to Reflection mode. It provides real-time visibility into the self-critique process:
The Reflection Cycle in Detail
Cycle 1: Initial Generation + First Reflection
Cycle 2: Revision + Second Reflection
Example: Content Creation Workflow
A workflow that generates polished blog posts:Example: Code Review Workflow
Use Reflection mode to review and improve generated code:Performance Characteristics
| Metric | Reflection Mode |
|---|---|
| LLM calls per execution | 2-7 (generation + 1-3 reflect/revise pairs) |
| Latency | Moderate-High (multiple generation rounds) |
| Token usage | 2-4x Standard (each reflection cycle is an additional generation) |
| Quality improvement | High for content and writing tasks |
Cost-Quality Tradeoff
max_reflections: 2 is often the sweet spot for cost-effectiveness.
Best Practices
Use specific, measurable criteria
Use specific, measurable criteria
Vague criteria like “good quality” produce vague evaluations. Use specific criteria: “All dates are in ISO 8601 format” is better than “dates are formatted correctly.”
Start with 2 max_reflections
Start with 2 max_reflections
Two reflection cycles are usually sufficient. The first cycle catches major issues, and the second refines details. Add a third only if you consistently see improvement in cycle 3.
Set quality_threshold to 0.8
Set quality_threshold to 0.8
This is the recommended starting point. Adjust based on observed output quality — if the first generation consistently scores above 0.8, you do not need Reflection mode for that task.
Pair with Research Nodes for grounded content
Pair with Research Nodes for grounded content
Reflection mode improves the form of the response (clarity, completeness, tone), but it cannot fix missing information. Use Search Knowledge or ReAct upstream to gather the facts, then use Reflection to polish the output.
Monitor reflection events for optimization
Monitor reflection events for optimization
Track the
agent_reflection events to see which criteria consistently score low. This can inform system prompt improvements that reduce the need for reflection.