Maintenance is not running

Issue Summary

Hello everyone,

We noticed that after upgrading Redash to v10 (from v8), the job to cleanup unused query results does not seem to be running.

As a result, the free space in Redash’s database gets depleted and we need to scale it up in order to keep the service up and running.

Any ideas what may be going wrong here?

Cheers,
Theodore

Technical details:

  • Redash Version: 10.1.0
  • How did you install Redash: Using the setup script to deploy v8. Then upgraded to v10 following the instructions provided in the release notes

Welcome to the user forum and sorry for this late reply.

There are a few possible causes for this issue:

  1. The periodic cleanup job is not being scheduled
  2. The job is being scheduled but no worker processes will run it
  3. The job is scheduled and acquired by a worker but fails

Most of the time the first one is the culprit.

Can you check that at least one of your worker processes is monitoring the periodic jobs queue?

Related High Disk Size issue with Redash Postgres

Hi Jesse,

Thank you for your reply.

Most of the time the first one is the culprit.

This is my understanding as well. I believe if either 2 or 3 were the case, it would appear in the logs.

Can you check that at least one of your worker processes is monitoring the periodic jobs queue?

It does. We have followed the instructions provided here to add a new service for general RQ jobs and our configuration is almost identical to the one in step 5 (please see below). Emails and tasks in the default queue are executed normally.

  worker:
    <<: *redash-service
    command: worker
    mem_limit: 3000m
    environment:
      QUEUES: "periodic emails default"
      WORKERS_COUNT: 25
1 Like

Following up on High Disk Size issue with Redash Postgres:

REDASH_QUERY_RESULTS_CLEANUP_ENABLED env var has not been set to false.

We have set the following though to mitigate the issue and clean up the database

REDASH_QUERY_RESULTS_CLEANUP_MAX_AGE=1
REDASH_QUERY_RESULTS_CLEANUP_COUNT=200

but to no avail. Task scheduling is not consistent.

1 Like

Thanks for the details. If you check your docker logs, do you see a message like this from the scheduler service when it starts?

[2022-03-15 17:20:25,452][PID:8][INFO][redash.tasks.schedule] Scheduling e27209059575fcc17c527c47d0957cb21756e551 (cleanup_query_results) with interval 300.

The interval is in seconds. So 300 here means it will run every five minutes.

When it runs you should see a log message from one of the worker services that observes the periodic queue. The message will look like this:

[2022-03-15 17:25:33,910][PID:25][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Running query results clean up (removing maximum of 100 unused results, that are 0 days old or more)

In my case I set it to clean up all query results to make it easy to verify.

Do you see any messages like this?

I have checked the logs since Feb 21st (that’s when Redash was upgraded to v10) and here are my findings:

If you check your docker logs, do you see a message like this from the scheduler service when it starts?

There are only a few entries with a message like this. My understanding is that I should be expecting to see it every time the task was starting, is that correct?

Show logs
[2022-02-21 15:59:56635][PID:1][INFO][redash.tasks.schedule] Scheduling e27209059575fcc17c527c47d0957cb21756e551 (cleanup_query_results) with interval 300.
[2022-02-22 10:43:58998][PID:1][INFO][redash.tasks.schedule] Scheduling e27209059575fcc17c527c47d0957cb21756e551 (cleanup_query_results) with interval 300.
[2022-03-02 09:12:02255][PID:1][INFO][redash.tasks.schedule] Scheduling e27209059575fcc17c527c47d0957cb21756e551 (cleanup_query_results) with interval 300.
[2022-03-09 14:23:05815][PID:1][INFO][redash.tasks.schedule] Scheduling e27209059575fcc17c527c47d0957cb21756e551 (cleanup_query_results) with interval 300.
[2022-03-11 16:47:31106][PID:1][INFO][redash.tasks.schedule] Scheduling e27209059575fcc17c527c47d0957cb21756e551 (cleanup_query_results) with interval 300.

Do you see any messages like this?

Yes, but not consistently. In the beginning it was scheduled a few times but not sticking to its schedule and then it stopped.
On Mar 2nd we received an alert for high disk space utilization from the Redash database and we started looking into it.
We realized that the cleanup task wasn’t running but we couldn’t understand why, so we restarted all services and tried to clean up manually some old cached query results.

Since then the task was being scheduled normally as expected, up until March 8th when it stopped again (I am not posting all the logs from that time but I can do so if needed).
Since then there were only a couple of entries in the logs. We saw the task being scheduled once after restarting Redash and then again nothing.

Show logs
[2022-02-21 16:04:14,015][PID:229][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Running query results clean up (removing maximum of 100 unused results, that are 7 days old or more)
[2022-02-21 16:04:17,031][PID:229][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Deleted 100 unused query results.
[2022-02-22 10:48:11,001][PID:207][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Running query results clean up (removing maximum of 100 unused results, that are 7 days old or more)
[2022-02-22 10:48:12,699][PID:207][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Deleted 100 unused query results.
[2022-03-02 08:30:53,816][PID:4653][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=325de3b5237b058d20e318b14fbf4bee0aaf2f76 Running query results clean up (removing maximum of 100 unused results, that are 7 days old or more)
[2022-03-02 08:30:54,392][PID:4653][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=325de3b5237b058d20e318b14fbf4bee0aaf2f76 Deleted 100 unused query results.
[...]
[2022-03-08 10:18:46,643][PID:17436][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Running query results clean up (removing maximum of 100 unused results, that are 7 days old or more)
[2022-03-08 10:18:46,948][PID:17436][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Deleted 0 unused query results.
[2022-03-09 14:27:19,547][PID:212][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Running query results clean up (removing maximum of 200 unused results, that are 1 days old or more
[2022-03-11 16:52:08,715][PID:219][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Running query results clean up (removing maximum of 200 unused results, that are 1 days old or more)
[2022-03-11 16:52:11,014][PID:219][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Deleted 200 unused query results.

I have restarted all the services a couple of times with no luck; the task still does not seem to be scheduled

I have tried both:

docker-compose up -d

and

docker-compose up -d --force-recreate --build

Not quite. This message emits exactly once when the scheduler service starts, i.e when you run docker-compose up -d. If you ran docker-compose restart scheduler you should see the message again. As a debugging step, you should try restarting your scheduler and verifying that the cleanup task registration log appears.

That said, a different message will appear when the cleanup task actually runs. That different message will be emitted not by the scheduler, but by the worker.

Exactly right. This is normal. When the scheduler starts it emits logs to say which tasks it will schedule, and at what interval. The scheduler never executes tasks, it only schedules them. It does not emit a message every time it schedules a task.

So you might ask, “how will I know the task is running if the scheduler doesn’t tell me?”. That’s where the worker logs come in. When a worker pulls a job from the queue, it emits a log message specifying the job id, task name, and a timestamp. So in simple terms what you should expect to to see is:

00:00 scheduler: I’m going to schedule this task to run every five minutes
00:00 worker: I just picked up a task to run, I’m doing that now
00:01 worker: I finished the task, here’s the outcome
00:05 worker: I just picked up a task to run, I’m doing that now
00:06 worker: I finished the task, here’s the outcome
00:10 worker: I just picked up a task to run, I’m doing that now
00:11 worker: I finished the task, here’s the outcome

…and so on.

One thing that may be happening is that the cleanup job is taking too long to execute. What are your settings for the number of results to cleanup and the interval?

I see, thanks for clarifying this!

After issuing a

docker-compose restart scheduler

the task seems to have both been registered and executed as expected.
Logs below are just to confirm that the task has been registered and has had 2 successful executions; other than that, I confirm that it’s been running consistently every 5 minutes, up until the time I am posting this message. It doesn’t make sense…

Show logs
[2022-03-17 10:43:44,795][pid:1][info][redash.tasks.schedule] scheduling e27209059575fcc17c527c47d0957cb21756e551 (cleanup_query_results) with interval 300.
[2022-03-17 10:43:44,812][pid:32678][info][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 running query results clean up (removing maximum of 200 unused results, that are 1 days old or more)
[2022-03-17 10:43:45,985][pid:32678][info][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 deleted 200 unused query results.
[2022-03-17 10:48:45,050][PID:360][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Running query results clean up (removing maximum of 200 unused results, that are 1 days old or more)
[2022-03-17 10:48:45,962][PID:360][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Deleted 200 unused query results.

We have set the following values in /opt/redash/env:

REDASH_QUERY_RESULTS_CLEANUP_MAX_AGE=1
REDASH_QUERY_RESULTS_CLEANUP_COUNT=200

Interval hasn’t been changed; I couldn’t find any cleanup related env vars for interval, but from this I am guessing the interval is hardcoded to 5 minutes?

I’m looking forward to adding this breakdown to our documentation. Thanks for your patience while I gathered the info!

You’re right. I didn’t read closely enough (I’m updating the docs for Redash settings right now btw).

It sounds like for the moment your issue is resolved. But if you haven’t changed any settings it seems likely that the issue of database bloat will continue. I wonder if the cause ultimately is that the job is running but…

a) it crashes or hangs midway through
b) it runs normally, but the number of results to clean up is growing faster than the cleanup job can clear it

Of course!


I do have some additional questions:

There was no indication in the logs that the task was running.
Instead, the logs indicate it had stopped running unexpectedly.

I would expect to see an error in the logs, but there weren’t any. Or at least none showed up when filtering with the term: cleanup_query_results.

Here I must ask: Why was there no indication in the logs that the task was running? :thinking:


Now the most interesting part:

I hear you, and I raised the number of results to be cleaned up to:

REDASH_QUERY_RESULTS_CLEANUP_COUNT=500

(from 200) and then I issued a:

docker-compose up -d

and didn’t see any logs indicating that the scheduler had scheduled the task.

I waited for a while and then I issued a

docker-compose restart scheduler

but again no logs indicating the cleanup task registration.

Finally, after issuing another

docker-compose restart scheduler

I saw that the task was registered

2022-03-17 17:23:10,493][PID:1][INFO][redash.tasks.schedule] <mark>Scheduling</mark> e27209059575fcc17c527c47d0957cb21756e551 (<mark>cleanup_query_results</mark>) with interval 300.

What is going on? :sweat:

Hmmm very strange not to see the registration, although it’s possible waiting longer may have shown a result. For consistency, I never run docker-compose up -d on its own to restart services. I run stop first, like this:

$ docker-compose stop && docker-compose up -d

This guarantees that everything stops and is restarted.

I don’t usually run docker-compose restart (without specifying a service first) because this won’t guarantee that environment variable changes are propagated (not sure why, but it’s a docker issue not specific to Redash).

Possible causes:

  • It could be an insufficient number of workers monitoring the periodic queue. If another periodic job is hogging the workers then the cleanup task might never run. To check for this, observe how many jobs are in the periodic queue when you stop seeing the task run. If the number stays static or increases, there aren’t enough workers.
  • The worker is crashing before it can emit the log saying it’s picked up the task. To check for this, don’t grep for cleanup_query_results, but instead see if there are signs of a worker thread crashing.
  • The scheduler stops scheduling the job. To check for this, you can connect directly to the redis instance and pattern search for cleanup_query_results in the redis database.

Depending on which of those is the cause there would be different steps to fix.

Not sure. Keep an eye on it and ping what you find or if you have questions. We’ll get to the bottom of it I’m sure.