Maintenance is not running

kirkiris · March 9, 2022, 4:42pm

Issue Summary

Hello everyone,

We noticed that after upgrading Redash to v10 (from v8), the job to cleanup unused query results does not seem to be running.

As a result, the free space in Redash’s database gets depleted and we need to scale it up in order to keep the service up and running.

Any ideas what may be going wrong here?

Cheers,
Theodore

Technical details:

Redash Version: 10.1.0
How did you install Redash: Using the setup script to deploy v8. Then upgraded to v10 following the instructions provided in the release notes

jesse · March 11, 2022, 8:03pm

Welcome to the user forum and sorry for this late reply.

There are a few possible causes for this issue:

The periodic cleanup job is not being scheduled
The job is being scheduled but no worker processes will run it
The job is scheduled and acquired by a worker but fails

Most of the time the first one is the culprit.

Can you check that at least one of your worker processes is monitoring the periodic jobs queue?

jesse · March 11, 2022, 8:11pm

Related High Disk Size issue with Redash Postgres

kirkiris · March 14, 2022, 9:10am

Hi Jesse,

Thank you for your reply.

Most of the time the first one is the culprit.

This is my understanding as well. I believe if either 2 or 3 were the case, it would appear in the logs.

Can you check that at least one of your worker processes is monitoring the periodic jobs queue?

It does. We have followed the instructions provided here to add a new service for general RQ jobs and our configuration is almost identical to the one in step 5 (please see below). Emails and tasks in the default queue are executed normally.

  worker:
    <<: *redash-service
    command: worker
    mem_limit: 3000m
    environment:
      QUEUES: "periodic emails default"
      WORKERS_COUNT: 25

kirkiris · March 14, 2022, 9:20am

Following up on High Disk Size issue with Redash Postgres:

REDASH_QUERY_RESULTS_CLEANUP_ENABLED env var has not been set to false.

We have set the following though to mitigate the issue and clean up the database

REDASH_QUERY_RESULTS_CLEANUP_MAX_AGE=1
REDASH_QUERY_RESULTS_CLEANUP_COUNT=200

but to no avail. Task scheduling is not consistent.

jesse · March 15, 2022, 7:04pm

Thanks for the details. If you check your docker logs, do you see a message like this from the scheduler service when it starts?

[2022-03-15 17:20:25,452][PID:8][INFO][redash.tasks.schedule] Scheduling e27209059575fcc17c527c47d0957cb21756e551 (cleanup_query_results) with interval 300.

The interval is in seconds. So 300 here means it will run every five minutes.

When it runs you should see a log message from one of the worker services that observes the periodic queue. The message will look like this:

[2022-03-15 17:25:33,910][PID:25][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Running query results clean up (removing maximum of 100 unused results, that are 0 days old or more)

In my case I set it to clean up all query results to make it easy to verify.

Do you see any messages like this?

kirkiris · March 16, 2022, 10:45am

I have checked the logs since Feb 21st (that’s when Redash was upgraded to v10) and here are my findings:

If you check your docker logs, do you see a message like this from the scheduler service when it starts?

There are only a few entries with a message like this. My understanding is that I should be expecting to see it every time the task was starting, is that correct?

Show logs

[2022-02-21 15:59:56635][PID:1][INFO][redash.tasks.schedule] Scheduling e27209059575fcc17c527c47d0957cb21756e551 (cleanup_query_results) with interval 300.
[2022-02-22 10:43:58998][PID:1][INFO][redash.tasks.schedule] Scheduling e27209059575fcc17c527c47d0957cb21756e551 (cleanup_query_results) with interval 300.
[2022-03-02 09:12:02255][PID:1][INFO][redash.tasks.schedule] Scheduling e27209059575fcc17c527c47d0957cb21756e551 (cleanup_query_results) with interval 300.
[2022-03-09 14:23:05815][PID:1][INFO][redash.tasks.schedule] Scheduling e27209059575fcc17c527c47d0957cb21756e551 (cleanup_query_results) with interval 300.
[2022-03-11 16:47:31106][PID:1][INFO][redash.tasks.schedule] Scheduling e27209059575fcc17c527c47d0957cb21756e551 (cleanup_query_results) with interval 300.

Do you see any messages like this?

Yes, but not consistently. In the beginning it was scheduled a few times but not sticking to its schedule and then it stopped.
On Mar 2nd we received an alert for high disk space utilization from the Redash database and we started looking into it.
We realized that the cleanup task wasn’t running but we couldn’t understand why, so we restarted all services and tried to clean up manually some old cached query results.

Since then the task was being scheduled normally as expected, up until March 8th when it stopped again (I am not posting all the logs from that time but I can do so if needed).
Since then there were only a couple of entries in the logs. We saw the task being scheduled once after restarting Redash and then again nothing.

Show logs

[2022-02-21 16:04:14,015][PID:229][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Running query results clean up (removing maximum of 100 unused results, that are 7 days old or more)
[2022-02-21 16:04:17,031][PID:229][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Deleted 100 unused query results.
[2022-02-22 10:48:11,001][PID:207][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Running query results clean up (removing maximum of 100 unused results, that are 7 days old or more)
[2022-02-22 10:48:12,699][PID:207][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Deleted 100 unused query results.
[2022-03-02 08:30:53,816][PID:4653][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=325de3b5237b058d20e318b14fbf4bee0aaf2f76 Running query results clean up (removing maximum of 100 unused results, that are 7 days old or more)
[2022-03-02 08:30:54,392][PID:4653][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=325de3b5237b058d20e318b14fbf4bee0aaf2f76 Deleted 100 unused query results.
[...]
[2022-03-08 10:18:46,643][PID:17436][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Running query results clean up (removing maximum of 100 unused results, that are 7 days old or more)
[2022-03-08 10:18:46,948][PID:17436][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Deleted 0 unused query results.
[2022-03-09 14:27:19,547][PID:212][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Running query results clean up (removing maximum of 200 unused results, that are 1 days old or more
[2022-03-11 16:52:08,715][PID:219][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Running query results clean up (removing maximum of 200 unused results, that are 1 days old or more)
[2022-03-11 16:52:11,014][PID:219][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Deleted 200 unused query results.

I have restarted all the services a couple of times with no luck; the task still does not seem to be scheduled

I have tried both:

docker-compose up -d

and

docker-compose up -d --force-recreate --build

jesse · March 16, 2022, 4:36pm

Not quite. This message emits exactly once when the scheduler service starts, i.e when you run docker-compose up -d. If you ran docker-compose restart scheduler you should see the message again. As a debugging step, you should try restarting your scheduler and verifying that the cleanup task registration log appears.

That said, a different message will appear when the cleanup task actually runs. That different message will be emitted not by the scheduler, but by the worker.

Exactly right. This is normal. When the scheduler starts it emits logs to say which tasks it will schedule, and at what interval. The scheduler never executes tasks, it only schedules them. It does not emit a message every time it schedules a task.

So you might ask, “how will I know the task is running if the scheduler doesn’t tell me?”. That’s where the worker logs come in. When a worker pulls a job from the queue, it emits a log message specifying the job id, task name, and a timestamp. So in simple terms what you should expect to to see is:

00:00 scheduler: I’m going to schedule this task to run every five minutes
00:00 worker: I just picked up a task to run, I’m doing that now
00:01 worker: I finished the task, here’s the outcome
00:05 worker: I just picked up a task to run, I’m doing that now
00:06 worker: I finished the task, here’s the outcome
00:10 worker: I just picked up a task to run, I’m doing that now
00:11 worker: I finished the task, here’s the outcome

…and so on.

One thing that may be happening is that the cleanup job is taking too long to execute. What are your settings for the number of results to cleanup and the interval?

kirkiris · March 17, 2022, 3:24pm

I see, thanks for clarifying this!

After issuing a

docker-compose restart scheduler

the task seems to have both been registered and executed as expected.
Logs below are just to confirm that the task has been registered and has had 2 successful executions; other than that, I confirm that it’s been running consistently every 5 minutes, up until the time I am posting this message. It doesn’t make sense…

Show logs

[2022-03-17 10:43:44,795][pid:1][info][redash.tasks.schedule] scheduling e27209059575fcc17c527c47d0957cb21756e551 (cleanup_query_results) with interval 300.
[2022-03-17 10:43:44,812][pid:32678][info][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 running query results clean up (removing maximum of 200 unused results, that are 1 days old or more)
[2022-03-17 10:43:45,985][pid:32678][info][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 deleted 200 unused query results.
[2022-03-17 10:48:45,050][PID:360][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Running query results clean up (removing maximum of 200 unused results, that are 1 days old or more)
[2022-03-17 10:48:45,962][PID:360][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Deleted 200 unused query results.

We have set the following values in /opt/redash/env:

REDASH_QUERY_RESULTS_CLEANUP_MAX_AGE=1
REDASH_QUERY_RESULTS_CLEANUP_COUNT=200

Interval hasn’t been changed; I couldn’t find any cleanup related env vars for interval, but from this I am guessing the interval is hardcoded to 5 minutes?

jesse · March 17, 2022, 4:12pm

I’m looking forward to adding this breakdown to our documentation. Thanks for your patience while I gathered the info!

You’re right. I didn’t read closely enough (I’m updating the docs for Redash settings right now btw).

It sounds like for the moment your issue is resolved. But if you haven’t changed any settings it seems likely that the issue of database bloat will continue. I wonder if the cause ultimately is that the job is running but…

a) it crashes or hangs midway through
b) it runs normally, but the number of results to clean up is growing faster than the cleanup job can clear it

kirkiris · March 17, 2022, 5:24pm

Of course!

I do have some additional questions:

There was no indication in the logs that the task was running.
Instead, the logs indicate it had stopped running unexpectedly.

I would expect to see an error in the logs, but there weren’t any. Or at least none showed up when filtering with the term: cleanup_query_results.

Here I must ask: Why was there no indication in the logs that the task was running?

Now the most interesting part:

I hear you, and I raised the number of results to be cleaned up to:

REDASH_QUERY_RESULTS_CLEANUP_COUNT=500

(from 200) and then I issued a:

docker-compose up -d

and didn’t see any logs indicating that the scheduler had scheduled the task.

I waited for a while and then I issued a

docker-compose restart scheduler

but again no logs indicating the cleanup task registration.

Finally, after issuing another

docker-compose restart scheduler

I saw that the task was registered

2022-03-17 17:23:10,493][PID:1][INFO][redash.tasks.schedule] <mark>Scheduling</mark> e27209059575fcc17c527c47d0957cb21756e551 (<mark>cleanup_query_results</mark>) with interval 300.

What is going on?

jesse · March 17, 2022, 5:45pm

Hmmm very strange not to see the registration, although it’s possible waiting longer may have shown a result. For consistency, I never run docker-compose up -d on its own to restart services. I run stop first, like this:

$ docker-compose stop && docker-compose up -d

This guarantees that everything stops and is restarted.

I don’t usually run docker-compose restart (without specifying a service first) because this won’t guarantee that environment variable changes are propagated (not sure why, but it’s a docker issue not specific to Redash).

Possible causes:

It could be an insufficient number of workers monitoring the periodic queue. If another periodic job is hogging the workers then the cleanup task might never run. To check for this, observe how many jobs are in the periodic queue when you stop seeing the task run. If the number stays static or increases, there aren’t enough workers.
The worker is crashing before it can emit the log saying it’s picked up the task. To check for this, don’t grep for cleanup_query_results, but instead see if there are signs of a worker thread crashing.
The scheduler stops scheduling the job. To check for this, you can connect directly to the redis instance and pattern search for cleanup_query_results in the redis database.

Depending on which of those is the cause there would be different steps to fix.

Not sure. Keep an eye on it and ping what you find or if you have questions. We’ll get to the bottom of it I’m sure.

kirkiris · August 5, 2022, 8:08am

Hello,

Circling back on this as the cleanup_query_results seems to have stopped running again.

Unfortunately we only noticed because we got a notification about the Redash DB running out of storage and realized it wasn’t running for the past month or so.

As before, the task was scheduled successfully after the service was restarted, run once, and then nothing again. Redash has been restarted a couple of times after that to see if the job will run consistently, but I am only noticing the same behavior. Gets scheduled, runs once, then nothing.

Show logs

[2022-08-05 07:11:25,916][PID:1][INFO][redash.tasks.schedule] Scheduling e27209059575fcc17c527c47d0957cb21756e551 (cleanup_query_results) with interval 300.
[2022-08-05 07:15:37,847][PID:212][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Running query results clean up (removing maximum of 500 unused results, that are 1 days old or more)

Following up on possible causes:

The queue seems to be empty.

I looked but couldn’t find any signs. Are there any specific patterns to look for?

After Redash was restarted, I looked in Redis. I figured I should look for the job hash from the logs above and here’s what I’ve found:

rq:job:e27209059575fcc17c527c47d0957cb21756e551
 1) "enqueued_at"
 2) "2022-08-05T07:21:26.460464Z"
 5) "origin"
 6) "periodic"
 7) "status"
 8) "queued"
 9) "description"
10) "redash.tasks.queries.maintenance.cleanup_query_results()"
13) "timeout"
14) "180"
15) "created_at"
16) "2022-08-05T07:11:25.916441Z"
17) "result_ttl"
18) "600"
19) "ended_at"
20) "2022-08-05T07:15:41.703879Z"
21) "started_at"
22) "2022-08-05T07:15:37.820316Z"

I can see that job details do correlate with the logs above on schedule and execution. It also seems it was enqueued to run again but there were no signs of it in the logs. Not sure how tasks are registered in Redis but I was also not able to find the specific key again after 15 minutes or so (and of course the cleanup job hasn’t run ever since).

What would you suggest as next steps?

jesse · August 5, 2022, 8:13am

Excellent write up of your findings. I need a couple business days to dig into this more closely. I will follow up in this thread ASAP.

phillipjohnson · December 15, 2022, 5:02pm

Hi, I am checking to see if there is any followup available? We seem to have a similar issue where the “periodic” queue is scheduled, we have 1 of four workers busy, with 3 idle. However, the query_results do not seem to be getting cleaned up.