Worker shutdowns in v9 resulting in stuck queries

Issue Summary

Whenever a worker container shuts down, there is a possibility that in-flight queries on that worker will remain stuck in the “started” state in RQ. This prevents any user from being able to refresh that query, since all attempts will use the existing query.

Technical details:

  • Redash Version: 9.0.0-beta.b42121
  • Browser/OS: Debian
  • How did you install Redash: Self-hosted Kubernetes

Environment

We run our workers on an autoscaling group inside AWS, and I noticed this behavior on instance termination. We send SIGTERM to the process, and wait for it to shutdown before teminating the node. The worker container runs the following command: docker-entrypoint worker

A query that can take a few minutes to complete started on a worker that was shutdown during query execution:

[2020-10-15 12:25:00,758][PID:15][INFO][rq.worker] queries: 4a033c0a-d4e6-493a-865d-fd24726e432e
[2020-10-15 12:25:00,782][PID:147][INFO][rq.job.redash.tasks.queries.execution] job.func_name=redash.tasks.queries.execution.execute_query job.id=4a033c0a-d4e6-493a-865d-fd24726e432e job=execute_query state=load_ds ds_id=2
[2020-10-15 12:25:01,019][PID:147][INFO][rq.job.redash.tasks.queries.execution] job.func_name=redash.tasks.queries.execution.execute_query job.id=4a033c0a-d4e6-493a-865d-fd24726e432e job=execute_query state=executing_query query_hash=255f676cfc06ebe953c96a286e8aa08f type=rds_mysql ds_id=2  job_id=4a033c0a-d4e6-493a-865d-fd24726e432e queue=queries query_id=12 username=<redacted>
Oct 15 07:30:31 shutting down, got signal: Terminated

I don’t have any more logs from the container itself after receiving SIGTERM, but I can see that the whole process tree has exited within the same second as the shutdown message. Under /api/admin/queries/rq_status, I can see that the worker that was is eventually removed, but the query itself remains in the list of started queries. In redis, the status is still listed as “started”:

$ redis-cli hget rq:job:4a033c0a-d4e6-493a-865d-fd24726e432e "status"
"started"

I expected there to be a 10 second delay between SIGTERM and process exit based upon the supervisord default stopwaitseconds (http://supervisord.org/configuration.html#program-x-section-settings), and the RQ signal handling (https://python-rq.org/docs/workers/#taking-down-workers). There’s no indication that the underlying RQ process is handling this signal from the logs.

It is possible to manually recover this by running redis-cli del rq:job:4a033c0a-d4e6-493a-865d-fd24726e432e and refreshing the query inside Redash.

1 Like

I’ve submitted a pull request that should address part of this here: https://github.com/getredash/redash/pull/5214

Thank you for the detailed report and the pull request. But I wonder if it’s worth addressing this from the root, and adding logic to detect queries that are in started state, but their worker is no longer around. Because even if we try to perform a clean shutdown, it’s always possible that something will go wrong.

Did you have a chance to look into this?

Btw, we’re upgrading to RQ 1.5, so maybe something changed in regards to this in the new version?

1 Like

I think the process you describe is a helpful long term solution, but in my investigation of the source, I didn’t find a quick path to implementing this. This model does allow two workers to run the same query at the same time in the event of a network partition. This could be fixed by a lease system where a worker stops running a query if it loses the lease.

I’ve looked into the RQ 1.5 docs, and it does provide retries, which could be a useful addition during a graceful shutdown. I’ve not found anything that would help in a hard shutdown or network partition.

1 Like