How to tune worker proccess, worker_max_tasks_per_child

Issue Summary

Celery process often grab CPU core, and eventually hosted machine get 100% CPU utils.
I will try to tune the parameters related celery worker.

Technical details:

  • Docker Image: redash:6.0.0.b8537
  • OS: Ubuntu 16.04 LTS
  • 2core, 8GB
  • current env variable at worker container
REDASH_REDIS_URL=redis://xxx
PYTHONUNBUFFERED=0
QUEUES=queries,scheduled_queries,celery
CELERY_WORKER_PREFETCH_MULTIPLIER=0
SHLVL=0
HOME=/home/redash
WORKERS_COUNT=2
REDASH_DATABASE_URL=postgresql://xxx

I intend to tune worker_max_tasks_per_child, but there is default value.
https://github.com/getredash/redash/blob/master/bin/docker-entrypoint/#L10

–max-tasks-per-child=10

In case of CPU hang at Celery, worker_max_tasks_per_child is useful ?
I’m recognizing worker_max_tasks_per_child as memory-bound. But it seems be impossible to tune worker_max_tasks_per_child in above the code.

First of all, it may be heavy query that is problem causing CPU hang.
And, I know the roadmap of Celery migration to RQ.
Now, I just want to know how tuning the parameters is possible.

waits for any advice.

Tuning worker_max_tasks_per_child is unlikely to affect CPU usage. It tells Celery when it should recycle workers (i.e. after n tasks it will kill the worker process and fork a new one) in order to avoid potential memory leaks.

Yes, migration of query executions to RQ is right around the corner - is there anything stopping you from changing the value of worker_max_tasks_per_child in your docker-entrypoint until the migration is complete?

If it’s important for you to modify this as an environment variable and don’t want docker-entrypoint to depart from master, go ahead and submit this as a PR and add me as a reviewer.

Thanks your advice @rauchy.
I misunderstood that it maye be max_tasks_per_child that improve CPU usage. That is because that at Celery community, some CPU hangs cases was shown as max_tasks_per_child issue.

In my case, I decided that first countermeasure is timeout setting for detecting query stuck. (REDASH_ADHOC_QUERY_TIME_LIMIT, REDASH_JOB_EXPIRY_TIME)
After that, I will consider kernel parameter tuning. (ex. somaxconn, tcp_tw_reuse)