Issue Summary
Whenever a worker container shuts down, there is a possibility that in-flight queries on that worker will remain stuck in the “started” state in RQ. This prevents any user from being able to refresh that query, since all attempts will use the existing query.
Technical details:
- Redash Version: 9.0.0-beta.b42121
- Browser/OS: Debian
- How did you install Redash: Self-hosted Kubernetes
Environment
We run our workers on an autoscaling group inside AWS, and I noticed this behavior on instance termination. We send SIGTERM
to the process, and wait for it to shutdown before teminating the node. The worker container runs the following command: docker-entrypoint worker
A query that can take a few minutes to complete started on a worker that was shutdown during query execution:
[2020-10-15 12:25:00,758][PID:15][INFO][rq.worker] queries: 4a033c0a-d4e6-493a-865d-fd24726e432e
[2020-10-15 12:25:00,782][PID:147][INFO][rq.job.redash.tasks.queries.execution] job.func_name=redash.tasks.queries.execution.execute_query job.id=4a033c0a-d4e6-493a-865d-fd24726e432e job=execute_query state=load_ds ds_id=2
[2020-10-15 12:25:01,019][PID:147][INFO][rq.job.redash.tasks.queries.execution] job.func_name=redash.tasks.queries.execution.execute_query job.id=4a033c0a-d4e6-493a-865d-fd24726e432e job=execute_query state=executing_query query_hash=255f676cfc06ebe953c96a286e8aa08f type=rds_mysql ds_id=2 job_id=4a033c0a-d4e6-493a-865d-fd24726e432e queue=queries query_id=12 username=<redacted>
Oct 15 07:30:31 shutting down, got signal: Terminated
I don’t have any more logs from the container itself after receiving SIGTERM
, but I can see that the whole process tree has exited within the same second as the shutdown message. Under /api/admin/queries/rq_status
, I can see that the worker that was is eventually removed, but the query itself remains in the list of started queries. In redis, the status is still listed as “started”:
$ redis-cli hget rq:job:4a033c0a-d4e6-493a-865d-fd24726e432e "status"
"started"
I expected there to be a 10 second delay between SIGTERM
and process exit based upon the supervisord default stopwaitseconds
(http://supervisord.org/configuration.html#program-x-section-settings), and the RQ signal handling (https://python-rq.org/docs/workers/#taking-down-workers). There’s no indication that the underlying RQ process is handling this signal from the logs.
It is possible to manually recover this by running redis-cli del rq:job:4a033c0a-d4e6-493a-865d-fd24726e432e
and refreshing the query inside Redash.