Skip to main content

Overview

Nadoo AI uses Celery for asynchronous background task processing. Document ingestion, embedding generation, graph extraction, and scheduled maintenance all run in Celery workers, keeping the API server responsive. The production deployment runs three Celery-related containers:
ContainerRole
nadoo-celery-prodWorker that processes tasks from all queues
nadoo-celery-beat-prodScheduler that dispatches periodic tasks
nadoo-backend-prodAPI server that enqueues tasks

Architecture

Task Queues

Celery is configured with seven queues to prioritize different workloads:
QueueRouting KeyTasks
defaulttask.#General tasks, channel gateway delivery, scheduler jobs
high_priorityhigh.#Web search (CRAG fallback) and time-sensitive operations
low_prioritylow.#Cleanup, audit log statistics
documentsdocument.#Document processing, paragraph chunking, GraphRAG extraction
exportsexport.#Data exports
emailsemail.#Email delivery
embeddingsembeddingsVector embedding generation

Task Routing

Tasks are routed to queues automatically by module path or custom task name:
task_routes = {
    "src.tasks.document_tasks.*": {"queue": "documents"},
    "src.tasks.cleanup_tasks.*":  {"queue": "low_priority"},
    "graphrag.*":                 {"queue": "documents"},
    "audit.*":                    {"queue": "low_priority"},
    "web_search.*":               {"queue": "high_priority"},
    "channels.*":                 {"queue": "default"},
    "scheduler.*":                {"queue": "default"},
}

Docker Configuration

Worker

celery:
  image: nadoo-backend:latest
  container_name: nadoo-celery-prod
  env_file:
    - ../packages/backend/.env
  environment:
    NADOO_DB_HOST: postgres
    NADOO_DB_PORT: 5432
    NADOO_REDIS_HOST: redis
    NADOO_REDIS_PORT: 6379
    NADOO_DEBUG: "false"
    ENVIRONMENT: production
  volumes:
    - backend_uploads:/app/uploads
    - backend_logs:/app/logs
  depends_on:
    postgres:
      condition: service_healthy
    redis:
      condition: service_healthy
  restart: on-failure:5
  command: >
    celery -A src.core.celery_app worker
    --loglevel=info
    --pool=threads
    -c 4
    -Q documents,default,embeddings,high_priority,low_priority

Beat Scheduler

celery-beat:
  image: nadoo-backend:latest
  container_name: nadoo-celery-beat-prod
  env_file:
    - ../packages/backend/.env
  environment:
    NADOO_DB_HOST: postgres
    NADOO_DB_PORT: 5432
    NADOO_REDIS_HOST: redis
    NADOO_REDIS_PORT: 6379
    NADOO_DEBUG: "false"
    ENVIRONMENT: production
  volumes:
    - backend_logs:/app/logs
  depends_on:
    postgres:
      condition: service_healthy
    redis:
      condition: service_healthy
  restart: on-failure:5
  command: celery -A src.core.celery_app beat --loglevel=info
Always run exactly one Celery Beat instance. Running multiple Beat containers will cause duplicate periodic tasks. The RedBeat distributed lock (nadoo:redbeat:lock) helps prevent this but a single instance is the safest configuration.

Worker Configuration

Pool Type

The production configuration uses --pool=threads instead of the default prefork pool:
celery -A src.core.celery_app worker --pool=threads -c 4
This avoids fork-related crashes with PyTorch and sentence-transformers (SIGSEGV issues) and works well for I/O-bound tasks like API calls to AI providers, document uploads, and embedding requests.

Key Settings

SettingValuePurpose
task_time_limit1800 (30 min)Hard kill for stuck tasks
task_soft_time_limit1500 (25 min)Raises SoftTimeLimitExceeded for graceful cleanup
worker_prefetch_multiplier4Number of tasks prefetched per worker thread
worker_max_tasks_per_child1000Recycle worker threads after 1000 tasks to prevent memory leaks
task_acks_latetrueAcknowledge tasks only after completion (prevents loss on crash)
task_reject_on_worker_losttrueRequeue tasks if the worker process is killed
task_default_retry_delay60Wait 60 seconds before retrying a failed task
task_max_retries3Maximum retry attempts per task

Broker Reliability

broker_connection_retry_on_startup = True
broker_connection_retry = True
broker_connection_max_retries = 10
broker_transport_options = {
    "visibility_timeout": 3600,
    "socket_keepalive": True,
    "retry_on_timeout": True,
}
The visibility_timeout of 3600 seconds (1 hour) must exceed task_time_limit (1800 seconds). If a task is not acknowledged within this window, Redis makes it visible to other workers.

Periodic Tasks (Beat Schedule)

Celery Beat dispatches these tasks on a cron schedule:
TaskScheduleQueuePurpose
audit.cleanup_expired_logsDaily at 02:00 UTClow_priorityRemove expired audit log entries
audit.get_statisticsHourlylow_priorityCompute audit log statistics
cleanup.recover_orphaned_documentsEvery 10 minuteslow_priorityRe-process documents stuck > 30 min
cleanup.cleanup_vlm_temp_filesDaily at 03:00 UTClow_priorityDelete VLM temp files older than 24 hours
channels.process_delivery_retriesEvery 30 secondsdefaultRetry failed channel message deliveries
channels.health_checkEvery 5 minutesdefaultCheck channel connector health
channels.sync_webhooksEvery 6 hoursdefaultRe-register channel webhooks

Scaling Workers

Horizontal Scaling

Run multiple worker containers for higher throughput. Assign each worker to specific queues to isolate workloads:
# Worker 1: Document-heavy tasks
celery -A src.core.celery_app worker --pool=threads -c 4 -Q documents,embeddings -n worker-docs@%h

# Worker 2: High-priority and general tasks
celery -A src.core.celery_app worker --pool=threads -c 4 -Q high_priority,default -n worker-general@%h

# Worker 3: Low-priority maintenance
celery -A src.core.celery_app worker --pool=threads -c 2 -Q low_priority -n worker-maint@%h
Use the -n flag to give each worker a unique name. This is required when running multiple workers on the same host to avoid name collisions.

Adjusting Concurrency

Increase -c (concurrency) based on available CPU and memory:
# High-throughput server (8+ cores, 32 GB RAM)
celery -A src.core.celery_app worker --pool=threads -c 16 -Q documents,default,embeddings,high_priority,low_priority

Monitoring with Flower

Flower provides a real-time web dashboard for Celery workers.
# Install Flower
pip install flower

# Start the dashboard
celery -A src.core.celery_app flower --port=5555

# Or run as a Docker container
docker run -d --name flower \
  --network nadoo-prod-network \
  -p 5555:5555 \
  mher/flower:latest \
  celery flower --broker=redis://redis:6379/0 --port=5555
Access the Flower UI at http://localhost:5555. It displays:
  • Active, processed, and failed task counts
  • Worker status and resource usage
  • Task execution time distributions
  • Queue lengths and throughput
Flower exposes detailed internal state. In production, protect it with authentication or restrict access to internal networks only.

Troubleshooting

This is typically caused by forking with PyTorch or sentence-transformers loaded. Ensure you are using --pool=threads instead of the default --pool=prefork:
celery -A src.core.celery_app worker --pool=threads -c 4
Check that the worker is consuming from the correct queue:
# List active queues on the worker
celery -A src.core.celery_app inspect active_queues

# Verify task routing
celery -A src.core.celery_app inspect registered
If the task queue is not listed, restart the worker with the -Q flag including the missing queue name.
Running more than one Celery Beat instance causes duplicate task dispatching. Verify only one Beat container is active:
docker ps | grep celery-beat
RedBeat uses a distributed lock in Redis at nadoo:redbeat:lock. If a stale lock exists after an unclean shutdown, clear it:
docker exec nadoo-redis redis-cli DEL nadoo:redbeat:lock
Enable AOF persistence on Redis (already configured in production Docker Compose):
redis-server --appendonly yes --appendfsync everysec
Combined with task_acks_late=True, unfinished tasks are requeued automatically when Redis recovers.