Self hosted on AWS using the eu-west-1 AMI.
Version 8
We are finding that redash repeatedly crashes throughout the day. The EC2 needs to be rebooted before it will work again. The gui freezes and then gets a 504 error.
From the metrics we have, it looks like the memory is maxing out.
It looks like it might be related to this issue:
I am thinking our problem might be that people are running large queries, redash doesn’t limit it in any way, the memory is maxing out and then it crashes. We have some large tables (multiple 200+GB), but most of the queries made throughout the day are small. If we resize the instance it would have to be pretty massive just to handle the odd time someone runs a big query without a LIMIT clause.
If it is related to the above git topic, is there any kind of workaround? Hard to believe that people have just lived with it repeatedly crashing for 5+ years in that case, so presumably there is something else going on in our situation.
What you’re describing sounds familiar to an issue I’m working on with V10 performance (here). The first thing I’d recommend is increasing your instance provision at least temporarily so you can debug without absurd delays. If you’re on a t2.small go up to t2.medium, for example.
Next, I have a few clarifying questions for you:
How large are your query results? Redash is built to visualise results. Not for making large data extracts. The git issue you referenced refers to improving performance if your individual query results are sizable. Until we can address this: you should expect degraded performance of the front-end if your query results exceed around 20mb in size. However, this will not cause a 504 timeout.
You mentioned the front-end “freezes”. Under what circumstances? When you execute a large query?
How much RAM are your containers using? You can find this by SSH into the EC2 instance and run sudo docker stats. You can also run the top command to see if your system is using kswap0 frequently,.
How many types of data sources are you using? If you only need a few data sources, you can reduce Redash’s memory footprint by disabling the ones you don’t need.
I ran some experiments and think I can clarify a bit better what’s going on.
We have an instance with 128GB memory (r5.4xlarge).
This is pointed to two PostgreSQL data sources. These sources have several tables which are 200+GB in size.
Typical queries are fairly small as most people use the tool appropriately (not for making large data extracts). Outside of these crashing queries, the max memory I’ve seen is around 7GB.
sudo docker stats shows typical memory usage quite low (highest is redash_server_1 with 1.7GB, rest are between 10-400MB).
The problem occurs when people aren’t using the tool properly. They uncheck the LIMIT 1000 tickbox, don’t set their own LIMIT clause, whatever. So they do something like “SELECT * FROM 200gb_table” without LIMIT checked. We can see that the query starts to run, the memory usage just goes up and up until it hits max, then redash crashes and the EC2 needs rebooting before anyone can use it again.
The behaviour when crashing is:
the query hangs
the query says it cannot connect
shortly afterwards the page gets a 504 error if you refresh, until the instance is rebooted
Below screenshot shows the memory metrics during one of these crashes (we resized to 128GB after this).
We could continue to resize the EC2 until it can handle every full table read, but as you say it’s not what redash is for. It would be very expensive and only ever needed for those times people run massive table extracts without thinking. The problem as I see it is that it’s too easy for users to bring the whole instance down and make it unusable for anyone else.
What we’d like is to be able to handle these bad queries better. We tried a timeout but with a simple query on a large table it can be quite fast to crash. Ideally some way so that redash can run out of memory for a query, stay alive, and then just return a failure result on the query without taking the instance down for anyone else. Maybe we need to host it on two load balanced EC2s so that if a large query kills an instance the other one can continue serving queries without users being impacted?
Your memory usage looks dead-on to me (1.7GB on the main, 10mb-400mb for the other services).
It’s normal for a worker process to crash if it pulls too much data. But it’s not normal for this to take down the entire worker container or the entire EC2 instance. Ideally, the worker will just fail gracefully, restart itself, and everything will stabilise.
Can you share your docker-compose file? I wonder how many worker services you have configured.
It shouldn’t make a difference, but is there any difference if you unify the formatting of your QUEUES specifications? Some of them are space-separated where others are comma-separated.