Celery worker stuck with 100% CPU

In our redash setup we have a problem that our celery worker gets stuck at 100% CPU and then fully occupies a core.
This goes on until all our cores are used up and redash crashes.

Initially we though it was related to the celery bug https://github.com/celery/celery/issues/1845 however we are facing it in the latest redash version with a higher celery version.
One point to note is that strace gets stuck when we run it on the process and shows no output.

Please let us know how we can get the reason for this behavior.
Thanks for your help

Maybe it’s related to this?

Oh, wait. Did it start before v6?

Yes
It started before V6 and still is there
Initially we thought it was celery issue but even after upgrade its still the same.

Any pointers would be appreciated.

Hey can you please suggest anything regarding this.

What types of data sources are you using?

I seemed to have found the problem.
The problem was how celery handled the SIGINT on cancellation of the query.

So when a query was cancelled it took up 100% cpu and the core was unusable after that. I modified the signal to SIGKILL and it just removes the process on cancellation.

But if that makes sense to do a SIGKILL (It does according to me since the query is cancelled) I can make a PR for this.
Please let me know your thoughts and I’ll do accordingly.

2 Likes

Hmmm, wonder if this the cause behind this problem report:

We’ve been having the 100% CPU celery issue for months (still running v4) but we couldn’t figure out what was causing it.

I just ran a query and cancelled it, and immediately started seeing another celery process stuck at 100%. (Used the MySQL data source in this case). The celery process is stuck until we restart redash. Sometimes we even have had to flush redis to clear out waiting queries. We’re using a dockerized version of redash FWIW.

I’m going to try the SIGKILL fix mentioned above. A PR for this seems like it would make sense. Thanks for figuring this our Rohit!

I actually found an old commit from 2013 which actually replaced SIGINT with SIGKILL.

Not sure how SIGINT cropped up again for cancellation.
@arikfr what do you think?

SIGINT has the benefit that it allows our process to do cleanup or proper cancelling like in the case of MySQL following a recent change.

Did anyone get over this issue by any chance?

I am using the Redash v7 and a dockerized env on AWS EC2 Ubuntu 18.04 it has 4 cpu cores and 8GiB mem and still have this issue.

My celery worker stuck at 100% CPU unless I restart the server.

One thing I wanted to try out is to increase workers but let me know if anyone has any pointers on this issue.

This is how my docker compose file look like:
version: ‘2’
x-redash-service: &redash-service
image: redash/redash:7.0.0.b18042
depends_on:
- postgres
- redis
env_file: /redash/env
restart: always
services:
server:
<<: *redash-service
command: server
ports:
- “5000:5000”
environment:
REDASH_WEB_WORKERS: 4
scheduler:
<<: *redash-service
command: scheduler
environment:
QUEUES: “celery”
WORKERS_COUNT: 1
scheduled_worker:
<<: *redash-service
command: worker
environment:
QUEUES: “scheduled_queries,schemas”
WORKERS_COUNT: 1
adhoc_worker:
<<: *redash-service
command: worker
environment:
QUEUES: “queries”
WORKERS_COUNT: 2
redis:
image: redis:3.0-alpine
restart: always
postgres:
image: postgres:9.5.6-alpine
env_file: /redash/env
volumes:
- /redash/postgres-data:/var/lib/postgresql/data
restart: always
nginx:
image: redash/nginx:latest
ports:
- “80:80”
depends_on:
- server
links:
- server:redash
restart: always

You can modify the code and use it.
That’s how we are using it.
You can use the base docker image as is and add the file with the fix.

Increasing the workers would help however it will just delay the load problem.

@rohit-conn Thanks for response.
Can you please elaborate what you mean by changing the code. It would be great if you can give me list of steps to perform instead as mine was production setup would like to be little cautious on troubleshooting.

Thanks for response.

So as mentioned above the main issue here is how celery handles sigint.
I’ve modified that to sigkill in the queries file.
After that we have built a custom docker on top of the original redash docker image.
FROM redash/redash:6.0.0.b8537
USER root
COPY
/queries.py /app/redash/tasks/queries.py
USER redash
This should add the queries.py with the sigint enabled on it.

The downside here is you have to check before upgrade if there has been any change on the code base.

aah, I get it,
this is pain, we should manage this every time there is upgrade to a newer version of codebase.

Thanks for the pointers.

@arikfr Is there a way to incorporate this SIGKILL change in current Redash image or deploy a patch image?

Suggest alternative if nothing above works, just don’t want to make a custom image as we might fall back on updating it later from Redash latest if we do that.

Thanks

We are still seeing this issue… any ideas on how to fix it?