Downloading large dataset with datesize >500K rows

kpeng · February 21, 2017, 10:56pm

Hi All,

We have a use case were we want to run queries that may return more than 500K rows and we want to be able to download those datasets.

I set up a docker version of reDash using the setup guide and I pulled in the majestic data set into the docker postgres instance.

After running a couple test queries I noticed that I can run queries that return 10K rows worth of data and properly download them from the UI. When I goto 100K the UI hangs, but I am able to retrieve the data via the API link: <redash_url>/api/queries/<query_id>/results.csv?api_key=<api_key>, which is great, but I noticed that every time I go revisit the query on the UI, since it never really pulled the result into the UI the page would hang and I can’t really modify anything like setting the reoccurring run schedule (using chrome).

Another issue I am having is since this dataset is 1 million rows, I created another query removing the 100K limit and saved it. I have been trying to run this query, but it fails in the UI due to the SIG 9, memory issue. One way I was trying to get around this is to call the refresh api, but that is not working for me. I am using curl -X post <redash_url>/api/queries/<query_id>/refresh?api_key=<api_key> but I keep on getting 400 errors.

My question is whether I am going in the right direction or if there is anything that I am missing on my side.

Thanks,

kpeng

arikfr · March 26, 2017, 9:09am

The refresh call fails because you’re using a query API key. If you will use some user’s API key, it will work (it’s a bug).

But it doesn’t matter if you trigger the refresh from the API or the browser – the SIGKILL message will happen anyway, as this is the out of memory killer killing the worker trying to run this query.

There were previous discussions about this (and an open issue on GitHub): Redash was never optimized for large datasets. There are several issues around it, and while I’ll be happy to support large datasets it was never a priority compared to other things.

galangel · May 27, 2018, 9:30am

Hi, i’m trying to handle the same issue, the UI freezes and unresponsive.
I want to add an option to just download result, skipping displaying in UI.
what would be the best approach for doing this?
I’d love if you point me to a place to start.
Thanks!

▼Categories

▼Tags

Downloading large dataset with datesize >500K rows