In our redash setup we have a problem that our celery worker gets stuck at 100% CPU and then fully occupies a core.
This goes on until all our cores are used up and redash crashes.
Initially we though it was related to the celery bug https://github.com/celery/celery/issues/1845 however we are facing it in the latest redash version with a higher celery version.
One point to note is that strace gets stuck when we run it on the process and shows no output.
Please let us know how we can get the reason for this behavior.
Thanks for your help
I seemed to have found the problem.
The problem was how celery handled the SIGINT on cancellation of the query.
So when a query was cancelled it took up 100% cpu and the core was unusable after that. I modified the signal to SIGKILL and it just removes the process on cancellation.
But if that makes sense to do a SIGKILL (It does according to me since the query is cancelled) I can make a PR for this.
Please let me know your thoughts and I’ll do accordingly.
We’ve been having the 100% CPU celery issue for months (still running v4) but we couldn’t figure out what was causing it.
I just ran a query and cancelled it, and immediately started seeing another celery process stuck at 100%. (Used the MySQL data source in this case). The celery process is stuck until we restart redash. Sometimes we even have had to flush redis to clear out waiting queries. We’re using a dockerized version of redash FWIW.
I’m going to try the SIGKILL fix mentioned above. A PR for this seems like it would make sense. Thanks for figuring this our Rohit!
@rohit-conn Thanks for response.
Can you please elaborate what you mean by changing the code. It would be great if you can give me list of steps to perform instead as mine was production setup would like to be little cautious on troubleshooting.
So as mentioned above the main issue here is how celery handles sigint.
I’ve modified that to sigkill in the queries file.
After that we have built a custom docker on top of the original redash docker image.
FROM redash/redash:6.0.0.b8537
USER root
COPY
/queries.py /app/redash/tasks/queries.py
USER redash
This should add the queries.py with the sigint enabled on it.
The downside here is you have to check before upgrade if there has been any change on the code base.
@arikfr Is there a way to incorporate this SIGKILL change in current Redash image or deploy a patch image?
Suggest alternative if nothing above works, just don’t want to make a custom image as we might fall back on updating it later from Redash latest if we do that.
Probably not. In the next release (V9) we drop Celery completely in favor of RQ. We’re already running this way internally. If you’re running Redash V8 or older you can use the patch described above. After V9 releases this won’t be an issue for anyone who upgrades regularly.
Login or sign up disabled while the site is in read only mode