This question spawned from a question with @jesse
Issue Summary
I am easily able to “crash” Redash by opening 10-15 Query tabs in a single browser. I can reproduce with 100% consistency.
Technical details:
- Redash Version: 8.0.2+b37747 (a9d7ca43)
- Browser/OS: firefox 95
- How did you install Redash:
We run Redash as a collection of contiainers, running as ECS Services in Amazon.
- scheduler, scheduled_worker and adhoc_worker are each independent ECS Services
- We use AWS ElasticCache instead of redis
- We use an RDS database for the backend
- “server” and “nginx” containers run together in a single ECS Task, which runs via an ECS Service.
To reproduce:
From the /queries endpoint I command-click each query in the list (all 20 on the first page), to open each Query Detail page in its own tab. When I do this, the first few will load quickly, but the remainder will just sit and spin for a minute or so, and then all return 5XX errors.
Behind the scenes, I see that the AWS Load Balancer is taking the Task out of service due to health check timeout. Example event from ECS Service Events …
service (REDACTED) (instance REDACTED) (port 49159) is unhealthy in target-group (REDACTED) due to (reason Request timed out)
It seems that after maybe 10 or so requests, redash is unable to serve requests fast enough and the task gets taken out of service. The health check is configured with a 30 second timeout.
So it looks like our issue here is that redash can’t serve requests quickly if it gets many all at once. I’ve not tried any other exhaustive testing here. For example, I haven’t tried loading the same page 10 times or tried loading from 10 different workstations at once.
This is the extent of the investigation I’ve done. I haven’t looked at logs to figure out why/where things are choking. Any guidance would be greatly appreciated. Thank you
-wes