Athena launched today and looks like a great natural fit for Redash. It’s Presto operating on top of S3, with hive DDL.
I’ve only had a chance to try Athena through the AWS console but given it’s simply Presto doesn’t Redash already have support for it? I’ll be trying tomorrow but I don’t expect issues.
From brief investigation it appears Redash’s Presto support doesn’t map to Athena out of the box.
Athena only speaks JDBC (as does Presto): http://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html
Redash uses pyhive, which talks to Presto’s REST interface: https://github.com/dropbox/PyHive/blob/master/pyhive/presto.py
I don’t know enough about Redash’s internals to know if generic JDBC support underpins another database (I couldn’t find it).
We don’t use JDBC for anything else in Redash. I did see some Python library that supports JDBC, but it feels a bit sketchy. I thought of writing a simple Java proxy that will use Athena’s JDBC connector and expose a REST API to use it.
It could be nice to deploy it as a Lambda function, but not sure it will be economical as some queries might take a long time to complete.
I just took Athena for a test drive, looks like converting my Redshift warehouses could be a huge win especially because I already persist most of my data to S3. What’s sketchy about adding a JDBC connectory @arikfr? If you ask AWS they might let you in on the roadmap for Athena under an NDA and you can see if they plan on supporting any other connection. I’d love to see Athena supported so I can cut over and lower my AWS bill
Now I succeeded in connecting with Athena using Py4j.
I’d like to get your feedback.
it requires java8, and java execution in subprocess.
Well done, @kakakazuma! I posted my thoughts on your blog post, but if you prefer - I can repost here, so we can have the discussion over here.
@mikeghen I didn’t mean JDBC was sketchy, but rather that Python connector that supports it.
I’m almost done with my solution (the proxy) and there is also the work @kakakazuma has done.
Thanks for your comment on my blog post!
I’d like you to repost here, and discuss after official release of Redash’s Athena support.
I’m looking forward to the official release!
For context here’s @kakakazuma’s blog post: http://qiita.com/yskazuma/items/99f5892d5a303528aad3. It’s in Japanese but easy to understand by just following the code examples and/or using Google Translate.
Here’s my comment re. his approach (using Py4J):
"Last week I started working on Athena support myself and got a working prototype but didn’t have the chance to release it yet as it needed some more work. The approach I took is somewhat similar but different - I wrote a REST API/Proxy in Java that takes Athena connection details, runs the query and returns results in “Redash format”.
At first it seemed to me that the Py4J approach might be better, but to the best of my understanding Py4J requires you to run the gateway program along side Redash. In this case it has limited value compared to having just a REST proxy, while the later is easier to deploy/scale/maintain. Also we reduce cases of bugs in a middleware (Py4J) impacting the stability of the solution.
Once I release it, I will post a link here to my codebase, so we can have a discussion on top of the code and decide which one to keep."
- Pull request with query runner: https://github.com/getredash/redash/pull/1468
- Java Proxy to JDBC: https://github.com/getredash/redash-amazon-athena-proxy
Some more details in this blog post: https://blog.redash.io/amazon-athena-in-redash-support-6b71c91aa747#.rq96frkup
I will follow up later in this thread with instructions on how to back port the query runner to v0.12.
@kakakazuma what do you think? Do you prefer the API proxy or the Py4J approach?
I’m very happy to hear Redash supported Amazon Athena officially!
I read your code, and prefered API proxy because REST communicate is seemed to be better than Py4J socket based communicate.
Now I use Py4J approach in production env, So I will switch to your approach and report if some problem occurs.
Huge thanks for official’s quick support!
I dockerized it and deployed with ECS. The Dockerfile is super simple:
FROM openjdk:8-alpine EXPOSE 4567 COPY athena.jar ./ CMD java -jar athena.jar
Only in the data source.
I just installed Redash 0.12.0+b2449 on an EC2 instance running Ubuntu 16.04.
I was half expecting Athena to show up as one of the data source types but clearly it’s not baked in yet :).
Any update on the blog post you mentioned you’ll be writing, @arikfr?
FWIW, I created a Docker build using the Dockerfile in the athena-proxy repo and have the container running on the same instance as Redash.
The question is, how do I make Athena appear in the list of data sources on Redash?
I didn’t realize Redash was so close to a 1.0 release - I’ll just wait for it before diving any deeper.
FWIW, I did upgrade to rc2 and checked the /opt/redash/current/redash/query_runner directory and athena.py does exist, However, I still don’t see Athena listed as an option in the drop down while adding a new data source. Not sure what’s going on but wanted to point it out in case this is a bug in RC2.