Overview
Nadoo AI uses Celery for asynchronous background task processing. Document ingestion, embedding generation, graph extraction, and scheduled maintenance all run in Celery workers, keeping the API server responsive. The production deployment runs three Celery-related containers:| Container | Role |
|---|---|
nadoo-celery-prod | Worker that processes tasks from all queues |
nadoo-celery-beat-prod | Scheduler that dispatches periodic tasks |
nadoo-backend-prod | API server that enqueues tasks |
Architecture
Task Queues
Celery is configured with seven queues to prioritize different workloads:| Queue | Routing Key | Tasks |
|---|---|---|
default | task.# | General tasks, channel gateway delivery, scheduler jobs |
high_priority | high.# | Web search (CRAG fallback) and time-sensitive operations |
low_priority | low.# | Cleanup, audit log statistics |
documents | document.# | Document processing, paragraph chunking, GraphRAG extraction |
exports | export.# | Data exports |
emails | email.# | Email delivery |
embeddings | embeddings | Vector embedding generation |
Task Routing
Tasks are routed to queues automatically by module path or custom task name:Docker Configuration
Worker
Beat Scheduler
Worker Configuration
Pool Type
The production configuration uses--pool=threads instead of the default prefork pool:
Key Settings
| Setting | Value | Purpose |
|---|---|---|
task_time_limit | 1800 (30 min) | Hard kill for stuck tasks |
task_soft_time_limit | 1500 (25 min) | Raises SoftTimeLimitExceeded for graceful cleanup |
worker_prefetch_multiplier | 4 | Number of tasks prefetched per worker thread |
worker_max_tasks_per_child | 1000 | Recycle worker threads after 1000 tasks to prevent memory leaks |
task_acks_late | true | Acknowledge tasks only after completion (prevents loss on crash) |
task_reject_on_worker_lost | true | Requeue tasks if the worker process is killed |
task_default_retry_delay | 60 | Wait 60 seconds before retrying a failed task |
task_max_retries | 3 | Maximum retry attempts per task |
Broker Reliability
visibility_timeout of 3600 seconds (1 hour) must exceed task_time_limit (1800 seconds). If a task is not acknowledged within this window, Redis makes it visible to other workers.
Periodic Tasks (Beat Schedule)
Celery Beat dispatches these tasks on a cron schedule:| Task | Schedule | Queue | Purpose |
|---|---|---|---|
audit.cleanup_expired_logs | Daily at 02:00 UTC | low_priority | Remove expired audit log entries |
audit.get_statistics | Hourly | low_priority | Compute audit log statistics |
cleanup.recover_orphaned_documents | Every 10 minutes | low_priority | Re-process documents stuck > 30 min |
cleanup.cleanup_vlm_temp_files | Daily at 03:00 UTC | low_priority | Delete VLM temp files older than 24 hours |
channels.process_delivery_retries | Every 30 seconds | default | Retry failed channel message deliveries |
channels.health_check | Every 5 minutes | default | Check channel connector health |
channels.sync_webhooks | Every 6 hours | default | Re-register channel webhooks |
Scaling Workers
Horizontal Scaling
Run multiple worker containers for higher throughput. Assign each worker to specific queues to isolate workloads:Adjusting Concurrency
Increase-c (concurrency) based on available CPU and memory:
Monitoring with Flower
Flower provides a real-time web dashboard for Celery workers.http://localhost:5555. It displays:
- Active, processed, and failed task counts
- Worker status and resource usage
- Task execution time distributions
- Queue lengths and throughput
Troubleshooting
Worker crashes with SIGSEGV
Worker crashes with SIGSEGV
This is typically caused by forking with PyTorch or sentence-transformers loaded. Ensure you are using
--pool=threads instead of the default --pool=prefork:Tasks stuck in PENDING state
Tasks stuck in PENDING state
Check that the worker is consuming from the correct queue:If the task queue is not listed, restart the worker with the
-Q flag including the missing queue name.Duplicate periodic tasks
Duplicate periodic tasks
Running more than one Celery Beat instance causes duplicate task dispatching. Verify only one Beat container is active:RedBeat uses a distributed lock in Redis at
nadoo:redbeat:lock. If a stale lock exists after an unclean shutdown, clear it:Tasks lost after Redis restart
Tasks lost after Redis restart
Enable AOF persistence on Redis (already configured in production Docker Compose):Combined with
task_acks_late=True, unfinished tasks are requeued automatically when Redis recovers.