Issue Summary

We do have a Redash self-hosted instance running on AWS EC2. We already increased computational resources and Redash is not consuming all the resources when our users start reporting high response times. When CPU utilization is over 35%, response times start getting high. Top waits on performance insights is always related with CPU.

Technical details:

  • Redash Version: 10.0.0 (9c928bd1)
  • Browser/OS: Chrome
  • How did you install Redash: docker-compose deployment

What do you mean by increased response times?

  • Are queries remaining in queue for extended periods?
  • Do web requests load more slowly?
  • Are queries taking longer to execute?

hello Jesse,

Yes, Redash dashboards are taking longer to bring the requested information. Let’s say an analyst wants to consult an information from a client and it takes 2 to 5 minutes to have results, so queries behind the visualization of the dashboard are taking longer to execute

Are the queries sitting in queue for extended periods? Or taking a long time to execute after they begin executing?

The queries are taking a long time to execute after they begin executing

When this happens, do everyone’s queries take a long time to execute? Or is it specific only to the queries in use on a given dashboard?

Also, how do you know that the queries are taking a long time to execute vs a long time in queue if you are on the dashboard screen? Have you checked the network requests in your browser console?

when this happens, everyone’s queries take a long time to execute. It’s not dashboard-specific.

Sometimes DiskQueueDepth on Read Replica is high, so it shows some queries are in queue. What’s weird here is that computational resources are not being consumed completely but response times are high though. We escalated computational resources from read replica and EC2 instance where Redash is running, and the behavior improved just a bit for a very short period, currently we have the users complaining about response times.

We need to now which specific component or configuration we need to change because the approach we have been following is not showing good results. We escalated read replica and also EC2 instance was escalated. but the issue persists.

I’m asking specific questions for a reason. It’s important to know what exactly is taking a long time. When response times are high, you can check it this way:

  1. Visit a query that’s taking a long time
  2. Click the refresh button
  3. Observe the message that appears beneath the query editor

Does it say “Query in queue” for a long time? Or does it say “Executing” for a long time?

This is harder to see on the dashboard screen because in that case (queueing vs executing) it just shows a spinner. But on the query screen you can see it more clearly.

If the queries sit in queue for a long time that means you don’t have enough worker threads to handle that many concurrent queries.

If the queries begin executing immediately then you have enough worker threads, but the queries may just take a long time to execute. Execution time depends on many factors, not all of which are part of Redash.

since the issues we are dealing with are all related with Dashboards, we see the spinner, so as you mentioned, is harder to see whether it is in queue or executing. However, I just tested looking at 2 visualizations from one of our dashboards and after executing them I saw “Query in queue” message.

On the other hand, as I mentioned before, we created a read replica just for Redash and also escalated EC2 instance on which Redash is running. Following what you mentioned, since the queries are in queue, it means we don’t have enough worker threads to handle that many concurrent queries. How could we increase the amount of worker threads?

Additionally, could you please share documentation with best practices for monitoring the amount of available worker threads and how many concurrent queries do we have?

You do this by increasing the WORKERS_COUNT environment variable for the worker service. Then restart the worker service for the change to take effect.

I wrote my previous post from my mobile late at night. Just a few follow-up items.

It’s harder, yes, but still it’s possible to see the status using your browser’s web inspector. Just look in your network requests tab for calls to api/jobs/<job_id>. These API calls are the front-end application polling for the execution status at the server.

If you see the code in one of these responses is 1, that means the query is in queue and has not started executing yet. If you see code 2 then the query has begun executing but did not finish yet. There’s a full list of codes here.

If the problem is not enough workers then this change would not have an effect.

Admins of your Redash instance can click their user avatar and then System Status to see an overview of all queues, workers, jobs, and their status.

Jesse,

Thank you very much for your answers. They have helped us out to identify the main issue. For now, we will proceed to increase the WORKERS_COUNT environment variable for the worker service. Just to doublecheck: after restarting the worker service for the change to take effect, we will have downtime over the Redash instance, right? do you have an idea about how much time it would take?

The downtime lasts as long as it takes to restart a docker service (usually less than a minute) (you don’t need to restart every service, just the worker services).

To avoid any downtime whatsoever you could define an additional worker service and spin it up before you stop and restart the existing ones. Just be mindful of the RAM and CPU capabilities of your server so that you don’t choke the whole system down.

great, thank you!

Could you please let us know what would be the best way to adjust this WORKERS_COUNT environment variable? how many should we define? is there any way we could define kind of a capacity planning on that regard?

Most users use trial-and-error to determine their needs. There is no one-size-fits-all recipe for it.

It depends on your needs. You can theoretically add as many worker threads as you want, up to the RAM and CPU limits of your hardware. Just be careful not to over-provision, if you do it can slow the worker service down to a crawl.

This is because each worker thread uses a fixed amount of RAM. The exact amount of RAM varies depending on how many query runners you have enabled (even if you don’t have any of those data sources defined, the Python dependencies are still imported). So if one worker thread needs 200mb of RAM, then two will require 400mb, three require 600mb etc.

You can tune it by gradually increasing the number of workers. If you have 8 threads today and your RAM utilisation is acceptable (under 80% for example), then you could move to 12 threads. Then check your queues to see if the overall number of jobs begins to decrease. If not, increase from 12 to 20. Check your RAM utilisation.

If you hit the RAM limit and the queue is still the same size this suggests you may have a lot of long-running queries that are hogging the workers. Or that you simply need more ram on your server. There are various ways to change this behaviour, such as scheduling the long-running queries to run during non peak times all the way up to setting up a dedicated worker and a special queue for just those queries. You can also disable any query runners that aren’t needed in your installation (you can do this with an environment variable or by modifying the source code directly).

And of course if you increase the worker count to the point that the queue shrinks to an acceptable level you don’t need any more workers.