I am using athena with Redash and found that it is pretty slow for bigger queries. ~20 lac rows
The query finished withing 2 minutes in athena but took more than 12-13 minutes in Redash to get the results. After looking at the code, i found that pyathena cursor is being used which I think may be the bottleneck, as it fetches results and serialises them into JSON. This has become a bit of a blocker at our side.
I’ve worked on an alternate implementation which involves transferring the CSV result file directly from Athena S3 to Redash machine and converting the records to json with python’s csv DictReader. It has improved the speed of fetching data to 2-3 minutes for the same query.
bucket, key = parse_output_location(cursor.output_location)
s3 = boto3.client('s3', **self._get_iam_credentials(user=user))
athena_query_results_file = athena_query_id
with open(athena_query_results_file, 'wb') as w:
s3.download_fileobj(bucket, key, w)
with open(athena_query_results_file, 'r+') as f:
rows = list(csv.DictReader(f))
Crux of the change
P.S - Should I create a PR or first get the idea reviewed here?