Altering slow athena result fetching implementation

Shitij · February 25, 2020, 4:45pm

Hi,
I am using athena with Redash and found that it is pretty slow for bigger queries. ~20 lac rows

The query finished withing 2 minutes in athena but took more than 12-13 minutes in Redash to get the results. After looking at the code, i found that pyathena cursor is being used which I think may be the bottleneck, as it fetches results and serialises them into JSON. This has become a bit of a blocker at our side.

I’ve worked on an alternate implementation which involves transferring the CSV result file directly from Athena S3 to Redash machine and converting the records to json with python’s csv DictReader. It has improved the speed of fetching data to 2-3 minutes for the same query.

bucket, key = parse_output_location(cursor.output_location)
s3 = boto3.client('s3', **self._get_iam_credentials(user=user))
athena_query_results_file = athena_query_id
with open(athena_query_results_file, 'wb') as w:
s3.download_fileobj(bucket, key, w)
with open(athena_query_results_file, 'r+') as f:
rows = list(csv.DictReader(f))
Crux of the change

P.S - Should I create a PR or first get the idea reviewed here?

jesse · February 25, 2020, 10:14pm

How many rows are you returning?

Shitij · February 26, 2020, 7:51am

Tried returning 1 lac, 10 lac, 20 lac rows. All queries in redash going over 10 minutes. While fetching the output file directly reduces it to 2 minutes.

I know redash is not meant for large datasets but there are some usecases where this is needed.
Running redash server on a EC2 machine

jesse · February 28, 2020, 4:36am

Correct.

Then use other tools. Redash works best with results under 20mb. It’s a visualization tool, not a big data platform. 100k rows is well outside our scope.

The delay with the query runner is serialization of all that data into JSON. As you found…

This is expected behavior for Redash. Not a bug.

Shitij · February 28, 2020, 6:00am

But even if we discard the big results issue, won’t it be beneficial to have an implementation which is faster in any case? I fail to see how slow serialization of data is a feature. I would happily create a PR for my implementation but wanted to get the community’s views on whether it would be worthwhile.

Also practically speaking, its sometimes hard to enforce this on the people making queries. We are not stopping the user from getting big data into redash, it’s just a best practice. Why not improve the implementation and let people who query decide what’s best for them? Only point being that if it is a matter of 10 rows or 10k rows, we can have a more efficient system with minimal changes

jesse · February 28, 2020, 11:26am

Sorry I was unclear. I meant that we can’t avoid serialization altogether even though it takes longer for large results.

Also, we don’t exclude users with big data sets. Its browser performance that gets in the way. The tab crashes when someone charts 100k+ rows. Your proposal means that person will wait 2 minutes instead of 10 minutes , but the app will crash anyway

For reference, we know that 95% of queries in Redash (SaaS) execute within 10 seconds. If you can make that faster, by all means do so. But I’d hate for you to invest much time before you realize why Redash was the wrong tool for your needs.

Shitij · March 4, 2020, 4:01pm

Thanks susodapop for clarifying.
Got your point and I agree with you.
I had to change the implementation in a redash fork for my company for unavoidable reasons. But I believe it will improve the time for regular short athena queries too. In my experiments a query showing 10-15 seconds on redash, got reduced to around 5-6 seconds (depending on the result count). Not a massive diff yes, but still I think it will improve the user experience.
As this is my first foray in open source, very thankful for helping me understand this. I would like to create a PR for this optimization.

▼Categories

▼Tags

Altering slow athena result fetching implementation