In re:Invent 2016, AWS launched Athena, a new service that let’s you query data in S3 without dealing with the usual complexities of setting up infrastructure. In a lot of ways, this looks like the equivalent of AWS Lambda for the data world (Just run your processes/queries and AWS takes care of the heavy lifting with respect to infrastructure).
From the product page
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
Under the hood, it uses tools/libraries which are fairly well known in the AWS and “big data” world.
- S3 : Amazon’s object based store where typically you keep all your files - especially when AWS is part of your infrastructure.
- Hive (metastore) : A library/language that stores your metadata like table definitions and how to read and make sense of your file (comma seperated, regex matches etc..)
- Presto : Facebook’s now-open-source project which allows fast interactive queries on top of you data in HDFS or any distributed data store
The process is really simple. You create a table describing how your file is structured and the location of the file. This create table is purely logical and is used to specify how to read the file. There is no data loading, so this just takes a couple of seconds. Amazon’s interface looks a lot like the earlier version of Hue.
The not-so-unique parts
External tables : Databases like Oracle and Microsoft had support for external tables for a while. This let’s you query data without the need to “load” them. Loading data is a time consuming operation and most databases have a limited amount of space, so if you have, say, 50 TB of logs that you query very rarely, external tables are a good option.
Big data queries at interactive speeds : Presto and other engines like impala allow you to query large data sets at interactive speeds. Athena uses Presto as the engine to process your query.
The unique (and awesome) parts:
S3!! : While external tables existed in oracle and sql server, very few use cases I have seen had more than a few terabytes of data. Typically these files are on the same server as the database server, so this is fairly restricted. On the other hand, It is common to see 100s of TBs of data in you S3, so this solution scales.
Serverless : While the tools existed, option 2 above still had the overhead of setting up and managing a hadoop cluster or at the very least, use a service like EMR. Companies have come up with interesting solutions like spinning up EMR clusters on demand for a job, but it requires a lot of tooling and knowledge of these libraries, when all you want to do is query the data in your files.
regular query tools : Not specific to Athena, but it great to just connect your sql tools (like sql workbench) and visualization tools directly to this service.
Different formats : It’s not just csv files anymore. JSON and custom SerDes are all fair game.
Permissions and Security
Current permissions are easier to understand if you think about this as two different AWS services. For a user to have access to the data, he needs to have permissions to use Athena as a service and seperate access on the underlying S3 access. S3 access is controlled well, so being able to just reuse those policies requires no additional work.
Presto/Impala has the unique use case of interactive querying (queries that typically take a few seconds to a few minutes). Map reduce and other frameworks still are the go to option when running jobs that take hours. It is easier to think of Athena v/s EMR in the same view. Short-term interactive queries on Athena and long running jobs on EMR.
Pricing is literally based on the amount of data you scan, so you should think carefully about partitioning your s3 data, using the right compression and the file format. There are some good guidelines and some limitations mentioned here.
Surprises and Open questions.
- I was surprised to read AWS not having a compute component pricing for this service. My initial assumption was that only select was allowed, in which case pricing based on just data scanned makes sense. But if I join a billion row data set to another million row dataset, compute/memory needed to do the job are expensive. May be AWS doesn’t expect a lot of workloads having joins or may be we are all working on a “shared compute”. In that case it would be interesting to see how jobs are prioritized and how they might be terminated if users runs really massive queries and joins. There is currently a query time (soft) limit of 30 minutes.
- There is no aws-cli access yet, so the initial set up still needs to be done from the console. You should still save these queries in your source control, at the very least.
Overall, it looks like a service which combines a lot of stable tools and eases the administrative burden. Thankfully, you don’t need to understand the big data lingo to just query data from a file, anymore. #lessbuzzwords