Celery worker stuck with 100% CPU

rohit-conn · February 21, 2019, 3:49pm

In our redash setup we have a problem that our celery worker gets stuck at 100% CPU and then fully occupies a core.
This goes on until all our cores are used up and redash crashes.

Initially we though it was related to the celery bug https://github.com/celery/celery/issues/1845 however we are facing it in the latest redash version with a higher celery version.
One point to note is that strace gets stuck when we run it on the process and shows no output.

Please let us know how we can get the reason for this behavior.
Thanks for your help

arikfr · February 25, 2019, 5:11pm

Maybe it’s related to this?

arikfr · February 25, 2019, 5:11pm

Oh, wait. Did it start before v6?

rohit-conn · February 25, 2019, 5:25pm

Yes
It started before V6 and still is there
Initially we thought it was celery issue but even after upgrade its still the same.

Any pointers would be appreciated.

rohit-conn · March 15, 2019, 11:51am

Hey can you please suggest anything regarding this.

arikfr · March 28, 2019, 7:38am

What types of data sources are you using?

rohit-conn · March 28, 2019, 10:27am

I seemed to have found the problem.
The problem was how celery handled the SIGINT on cancellation of the query.

github.com

getredash/redash/blob/master/redash/tasks/queries.py#L94


        return self._async_result.status == 'REVOKED'


    @property
    def celery_status(self):
        return self._async_result.status


    def ready(self):
        return self._async_result.ready()


    def cancel(self):
        return self._async_result.revoke(terminate=True, signal='SIGINT')




def enqueue_query(query, data_source, user_id, scheduled_query=None, metadata={}):
    query_hash = gen_query_hash(query)
    logging.info("Inserting job for %s with metadata=%s", query_hash, metadata)
    try_count = 0
    job = None


    while try_count < 5:
        try_count += 1

So when a query was cancelled it took up 100% cpu and the core was unusable after that. I modified the signal to SIGKILL and it just removes the process on cancellation.

But if that makes sense to do a SIGKILL (It does according to me since the query is cancelled) I can make a PR for this.
Please let me know your thoughts and I’ll do accordingly.

justinclift · March 30, 2019, 4:37am

Hmmm, wonder if this the cause behind this problem report:

nsmith · April 4, 2019, 2:49pm

We’ve been having the 100% CPU celery issue for months (still running v4) but we couldn’t figure out what was causing it.

I just ran a query and cancelled it, and immediately started seeing another celery process stuck at 100%. (Used the MySQL data source in this case). The celery process is stuck until we restart redash. Sometimes we even have had to flush redis to clear out waiting queries. We’re using a dockerized version of redash FWIW.

I’m going to try the SIGKILL fix mentioned above. A PR for this seems like it would make sense. Thanks for figuring this our Rohit!

satyadeepk · April 25, 2019, 7:14am

I actually found an old commit from 2013 which actually replaced SIGINT with SIGKILL.

Not sure how SIGINT cropped up again for cancellation.
@arikfr what do you think?

arikfr · June 16, 2019, 4:47pm

SIGINT has the benefit that it allows our process to do cleanup or proper cancelling like in the case of MySQL following a recent change.

raj · August 1, 2019, 1:09am

Did anyone get over this issue by any chance?

I am using the Redash v7 and a dockerized env on AWS EC2 Ubuntu 18.04 it has 4 cpu cores and 8GiB mem and still have this issue.

My celery worker stuck at 100% CPU unless I restart the server.

One thing I wanted to try out is to increase workers but let me know if anyone has any pointers on this issue.

This is how my docker compose file look like:
version: ‘2’
x-redash-service: &redash-service
image: redash/redash:7.0.0.b18042
depends_on:
- postgres
- redis
env_file: /redash/env
restart: always
services:
server:
<<: *redash-service
command: server
ports:
- “5000:5000”
environment:
REDASH_WEB_WORKERS: 4
scheduler:
<<: *redash-service
command: scheduler
environment:
QUEUES: “celery”
WORKERS_COUNT: 1
scheduled_worker:
<<: *redash-service
command: worker
environment:
QUEUES: “scheduled_queries,schemas”
WORKERS_COUNT: 1
adhoc_worker:
<<: *redash-service
command: worker
environment:
QUEUES: “queries”
WORKERS_COUNT: 2
redis:
image: redis:3.0-alpine
restart: always
postgres:
image: postgres:9.5.6-alpine
env_file: /redash/env
volumes:
- /redash/postgres-data:/var/lib/postgresql/data
restart: always
nginx:
image: redash/nginx:latest
ports:
- “80:80”
depends_on:
- server
links:
- server:redash
restart: always

rohit-conn · August 1, 2019, 9:01am

You can modify the code and use it.
That’s how we are using it.
You can use the base docker image as is and add the file with the fix.

Increasing the workers would help however it will just delay the load problem.

raj · August 6, 2019, 2:11am

@rohit-conn Thanks for response.
Can you please elaborate what you mean by changing the code. It would be great if you can give me list of steps to perform instead as mine was production setup would like to be little cautious on troubleshooting.

Thanks for response.

rohit-conn · August 6, 2019, 5:15am

So as mentioned above the main issue here is how celery handles sigint.
I’ve modified that to sigkill in the queries file.
After that we have built a custom docker on top of the original redash docker image.
FROM redash/redash:6.0.0.b8537
USER root
COPY
/queries.py /app/redash/tasks/queries.py
USER redash
This should add the queries.py with the sigint enabled on it.

The downside here is you have to check before upgrade if there has been any change on the code base.

raj · August 6, 2019, 5:55pm

aah, I get it,
this is pain, we should manage this every time there is upgrade to a newer version of codebase.

Thanks for the pointers.

raj · August 20, 2019, 6:18pm

@arikfr Is there a way to incorporate this SIGKILL change in current Redash image or deploy a patch image?

Suggest alternative if nothing above works, just don’t want to make a custom image as we might fall back on updating it later from Redash latest if we do that.

Thanks

davebwag · November 11, 2019, 7:27pm

We are still seeing this issue… any ideas on how to fix it?

alexvdan · January 17, 2020, 4:32pm

anyone got over this issue? still seeing it

jesse · January 23, 2020, 11:26am

Probably not. In the next release (V9) we drop Celery completely in favor of RQ. We’re already running this way internally. If you’re running Redash V8 or older you can use the patch described above. After V9 releases this won’t be an issue for anyone who upgrades regularly.

▼Categories

▼Tags

Celery worker stuck with 100% CPU