Query a BIG query result


#1

Hello, new to redash and very interested so far. However, I’ve got a use case that I’m stumped with how to do. We have regular data queries that are run that we need to join the ID’s from that against a different data source to pull in a bunch of extra metadata. The data queries are in the 10k to 500k size, but the source that holds the metadata has 15-20M entries in couchbase. The query_result query_runner looks like it works by pulling the entire 15M entries and then trying to join that in a sqlite database. What we actually want is the 100k ID’s from the data query being passed to our couchbase query_runner to handle there. I think I can probably modify query_result to chunk somehow, but I didn’t know if there was a way to essentially pass a query result to a query runner instead? Hope that makes sense! Certainly a cross-platform joined query like we’re doing here is probably a bit of a corner case, but I’m happy to contribute back anything we do if there is a smart way to accomplish this.
Thanks!


#2

This is indeed a corner case and considering the volumes of data, you might be better served by some simple ETL process that syncs the metadata into a single place. Of course it depends on the frequency you need to do this join, but if it’s frequent enough it might be worth it.

Having said that, if you do find some interesting ways to serve this use case in Redash, I’m happy to discuss :slight_smile:


#3

I don’t control the location of the data so ETL wasn’t really possible. I ended up creating a custom query runner that did the metadata pull and then used pydatasql to join up the results and then pass that back to redash. It’s not dissimilar to the query_result query_runner, though it doesn’t generalize past our use case. It’s delightfully easy to implement my own query_runners!
Thanks!