Celery worker stuck with 100% CPU

#1

In our redash setup we have a problem that our celery worker gets stuck at 100% CPU and then fully occupies a core.
This goes on until all our cores are used up and redash crashes.

Initially we though it was related to the celery bug https://github.com/celery/celery/issues/1845 however we are facing it in the latest redash version with a higher celery version.
One point to note is that strace gets stuck when we run it on the process and shows no output.

Please let us know how we can get the reason for this behavior.
Thanks for your help

#2

Maybe it’s related to this?

#3

Oh, wait. Did it start before v6?

#4

Yes
It started before V6 and still is there
Initially we thought it was celery issue but even after upgrade its still the same.

Any pointers would be appreciated.

#5

Hey can you please suggest anything regarding this.

#6

What types of data sources are you using?

#7

I seemed to have found the problem.
The problem was how celery handled the SIGINT on cancellation of the query.

So when a query was cancelled it took up 100% cpu and the core was unusable after that. I modified the signal to SIGKILL and it just removes the process on cancellation.

But if that makes sense to do a SIGKILL (It does according to me since the query is cancelled) I can make a PR for this.
Please let me know your thoughts and I’ll do accordingly.

2 Likes
#8

Hmmm, wonder if this the cause behind this problem report:

#9

We’ve been having the 100% CPU celery issue for months (still running v4) but we couldn’t figure out what was causing it.

I just ran a query and cancelled it, and immediately started seeing another celery process stuck at 100%. (Used the MySQL data source in this case). The celery process is stuck until we restart redash. Sometimes we even have had to flush redis to clear out waiting queries. We’re using a dockerized version of redash FWIW.

I’m going to try the SIGKILL fix mentioned above. A PR for this seems like it would make sense. Thanks for figuring this our Rohit!

#10

I actually found an old commit from 2013 which actually replaced SIGINT with SIGKILL.

Not sure how SIGINT cropped up again for cancellation.
@arikfr what do you think?