Unable to execute query on the table having 5,000,000 records?

aak4405 · May 29, 2022, 11:04am

Hi, is there any limit on redash to work on databases having large number of records?
I am trying to execute a query in redash having 50 Lakh plus records, but it does not show anything. However, when I exectute the query after clicking on Limit 1000 Option this works but I only get 1000 records. Can anyone suggest how can I see execute the query and get all the 50 lakh plus records?

jesse · June 2, 2022, 12:09pm

Redash is not designed to pull 5m records into your browser. It’s meant to visualise aggregated data sets under 20mb in size. For visualisation purposes there is no need to pull this much data. If that is part of your requirement then you’ll need a much more powerful vis tool like PowerBI or Tableau.

SawlStone · June 24, 2022, 7:21am

20 MB is not the limit.)) We had been uploading data up to 150-300 MB (400K-600K rows)

jesse · June 24, 2022, 11:06am

I never said it was. 20mb is the advised size. Beyond that your mileage will vary.

Also there is a difference between pulling a large result vs trying to visualize it in Plotly.

aak4405 · June 24, 2022, 11:24am

So upto what number of records or output size (query result) will be suitable for redash to handle that query?

SawlStone · June 24, 2022, 12:07pm

Postgres cell can accept and save max 1GB of data. So, for query_result data field this is maximum.
Also this depends from NGINX pass capacity. By default, I think, is 500mb

jesse · June 24, 2022, 2:46pm

What matters is the size of the query result in bytes, not the number of rows. Because not all rows are created equal. Some rows have 10 fields. Some rows have 1000 fields. And not all fields are the same either. Some fields are a boolean value. Others could hold binary data, or long strings.

There is no hard limit enforced by Redash. You can try SELECTing any size of data and see if it works. If you pull too much data it will just fail, as you’ve seen. There are a few places it can fail. To understand these you should understand the lifetime of a query result, which goes like this:

A query job is created when a user presses Execute.
A worker thread picks up the job and sends the query to your database, then waits for the result.
The worker receives the result, then serialises it into JSON and saves it to Redash’s postgres metadata database on the query_results table.
The worker sets the job status from step 1 to COMPLETE.
This whole time, the web application has been polling for the status of the job created in step 1. When it sees the status is COMPLETE, it makes a network request for the query result.
The server pulls the query result from postgres and sends it to the browser.
The browser now has a JSON serialised query result. This is given to the visualisation component to render. The visualisation component could be a table or a Plotly chart.
The visualisation component renders the data onto the screen.

Knowing this, here’s where will fail depending on the size of the query result…

Step 2: result size alone crashes worker

The worker thread can run out of RAM while receiving the result from your database. If your worker thread only has 1GB available to it and the query result is 1.2GB, the worker doesn’t have enough memory to receive the whole thing. If it runs out of RAM, it just dies and the query fails.

Step 3: result size crashes worker during serialisation

Serialising to JSON occurs in-memory. And a serliaised result will always be larger than the query result in the format it arrives from your database. If the worker has 1GB of RAM and the query result is 500mb, it will need at least 1GB of space to contain both the original result and the serialised value. So if the result exceeds 500mb it will run out of RAM, the worker dies, and the query fails.

Step 6: result size overloads the browser

We now have a query result that is passed to your local browser. If your machine is RAM constrained, this network request could crash the browser page. There is no exact limit here as it will vary from one computer to the next.

Step 8: visualisation component is not optimised for large results

The Plotly visualisation tool which Redash uses draws a single point for every row of data in your query result. And it’s not designed to display many thousands or millions of points. If you pass it too large of a query result, it could crash. But more likely it will churn for a long time as it processes the visualisation and renders it. You’ll experience degraded performance at least.

The Redash table visualisation is actually pretty good with large results. It uses pagination to avoid rendering too many DOM nodes at once so it won’t be sluggish. So you can pull more records into a table visualisation than you could show with Plotly.

So returning to your question: how many rows can Redash handle? The answer is that the upper limit entirely depends. Redash is built to comfortably handle query results around 20mb in size. If each row of data contains 1kb of data, that would be 20,000 rows.

“But what if my workers have more RAM than 1GB?”

Then you can probably show larger results.

“But what if I’m only making table visualisations?”

Then you can probably show larger results.

“But what if my browser has lots of RAM and no other tabs open?”

Then you can probably show larger results.

"But what if I don’t need to visualise anything and I just want to fetch results using the API?

Then you can probably download larger results. Doing this bypasses the problems with steps 5-8. If the limit for good performance in the browser is 20mb, the limit for strictly API access is closer to 300mb.

"But what if I need to visualise 5m rows at once and none of this works?

Then Redash is the wrong tool for you. And I’d question whether you really need to show that much data to make a meaningful chart. Would a user even notice the difference if there were 4.9m points instead of an even 5m? No. They wouldn’t.

“Why is the recommendation so low at only 20mb?”

Because the average query result being rendered is less than 1mb in size. Redash isn’t a data science and engineering platform. It’s meant for making simple charts of business data.

This is correct, but the real limit will be lower. Your workers will run out of RAM / browser performance will plunge long before this value. The ceiling in my room is 3 meters tall, but that doesn’t mean I can jump that high

I’m not sure this is accurate. Since you can absolutely download a 1GB excel file from Redash if you want to

SawlStone · June 24, 2022, 3:39pm

Not so absolutely. I had tried to do this (~ 500k row), but it was very unstable. The excel script (csv is much faster) processed the data quite long (depends on column amount), and frontend-backend connection had been losing. So I had to add my own feature for downloading files

By the way about serialization, this serializer works quite faster than simplejson:

jesse · June 24, 2022, 3:43pm

Fair enough. Glad you were able to hack together a solution that works for your specific need. That’s the beauty of open source

▼Categories

▼Tags

Unable to execute query on the table having 5,000,000 records?

Step 2: result size alone crashes worker

Step 3: result size crashes worker during serialisation

Step 6: result size overloads the browser

Step 8: visualisation component is not optimised for large results