This question spawned from a question with @jesse

Issue Summary

I am easily able to “crash” Redash by opening 10-15 Query tabs in a single browser. I can reproduce with 100% consistency.

Technical details:

  • Redash Version: 8.0.2+b37747 (a9d7ca43)
  • Browser/OS: firefox 95
  • How did you install Redash:

We run Redash as a collection of contiainers, running as ECS Services in Amazon.

  • scheduler, scheduled_worker and adhoc_worker are each independent ECS Services
  • We use AWS ElasticCache instead of redis
  • We use an RDS database for the backend
  • “server” and “nginx” containers run together in a single ECS Task, which runs via an ECS Service.

To reproduce:
From the /queries endpoint I command-click each query in the list (all 20 on the first page), to open each Query Detail page in its own tab. When I do this, the first few will load quickly, but the remainder will just sit and spin for a minute or so, and then all return 5XX errors.

Behind the scenes, I see that the AWS Load Balancer is taking the Task out of service due to health check timeout. Example event from ECS Service Events …

service (REDACTED) (instance REDACTED) (port 49159) is unhealthy in target-group (REDACTED) due to (reason Request timed out)

It seems that after maybe 10 or so requests, redash is unable to serve requests fast enough and the task gets taken out of service. The health check is configured with a 30 second timeout.

So it looks like our issue here is that redash can’t serve requests quickly if it gets many all at once. I’ve not tried any other exhaustive testing here. For example, I haven’t tried loading the same page 10 times or tried loading from 10 different workstations at once.

This is the extent of the investigation I’ve done. I haven’t looked at logs to figure out why/where things are choking. Any guidance would be greatly appreciated. Thank you
-wes

You didn’t specify which version of Redash you’re using.

I assume something is broken with your configuration, as this isn’t a Redash problem. I tested locally with a single worker in our local development setup. I can open 50 concurrent tabs and execute a sizeable query in 10+ tabs without crashing anything. Just the standard delay where some queries must wait until a worker is available.

Testing with our V8 AMI upgraded to V10.1 it won’t crash under these circumstances either.

1 Like

oops, sorry about that. sloppy/paste. fixed. version is 8.0.2+b37747 (a9d7ca43)

yeah, something is probably wrong with the configuration, we’re just not sure what it is. Do you think the worker is the likely bottleneck ?