Simple way to query Amazon Athena in python with boto3

ETL takes time and it's a lot to maintain. Sometimes it breaks when you didn't expect a string to contain emojis. You might decide the transformation needs to be changed, which means you need to refresh all your data. So what can you do to avoid this?

Data virtualization helps tremendously with that. It's not the answer to everything of course, but it makes certain things loads easier. When you don't have to worry too much about the structure in which to store your data, you can choose to keep it in its richest, untouched format. All of it. So when you need it in a specific format, it's all there. With data virtualization, you impose a schema on read, not on write, and you can change that schema as you wish.

Towards the end of 2016, Amazon launched Athena - and it's pretty awesome. We can e.g. store our raw JSON data in S3, define virtual databases with virtual tables on top of them and query these tables with SQL. And on top of everything, it is quite simple to take into use.

To give it a go, just dump some raw data files (e.g. CSV, JSON or log files) into an S3 bucket, head over to Amazon Athena and run a wizard that takes you through a virtual table creation step-by-step. If you are familiar with Hive, you will find that the Data Definition Language is identical. Once a table is created, it's ready to be queried.

If you wish to run queries to Athena from e.g. python, you have a few options, for example

PyAthenaJDBC
JayDeBeApi
boto3

From a user experience point of view the PyAthenaJDBC would have been my preferred order too, as the first two would have let me query easily into a pandas DataFrame, but I was too lazy to compile the PyAthenaJDBC on my Windows machine (would've required Visual C++ Build Tools which I didn't have). JayDeBeApi looked like a hassle to set up.

Boto3 was something I was already familiar with. With boto3, you specify the S3 path where you want to store the results, wait for the query execution to finish and fetch the file once it is there. And clean up afterwards. Once all of this is wrapped in a function, it gets really manageable.

If you want to see the code, go ahead and copy-paste this gist: query Athena using boto3. I'll explain the code below.

First let's start with our configurations. Fill these with your own details of course.

import boto3
import pandas as pd
import io
import re
import time

params = {
    'region': 'eu-central-1',
    'database': 'databasename',
    'bucket': 'your-bucket-name',
    'path': 'temp/athena/output',
    'query': 'SELECT * FROM tablename LIMIT 100'
}

session = boto3.Session()

The following function will dispatch the query to Athena with our details and return an execution object. Our query will be handled in the background by Athena asynchronously.

def athena_query(client, params):
    
    response = client.start_query_execution(
        QueryString=params["query"],
        QueryExecutionContext={
            'Database': params['database']
        },
        ResultConfiguration={
            'OutputLocation': 's3://' + params['bucket'] + '/' + params['path']
        }
    )
    return response

To make things a little easier for us, the next function will do a few things for us. It will:

Dispatch the query to Athena
Poll the results and once the query is finished
Return the filename in S3 where the query results are stored

def athena_to_s3(session, params, max_execution = 5):
    client = session.client('athena', region_name=params["region"])
    execution = athena_query(client, params)
    execution_id = execution['QueryExecutionId']
    state = 'RUNNING'

    while (max_execution > 0 and state in ['RUNNING', 'QUEUED']):
        max_execution = max_execution - 1
        response = client.get_query_execution(QueryExecutionId = execution_id)

        if 'QueryExecution' in response and \
                'Status' in response['QueryExecution'] and \
                'State' in response['QueryExecution']['Status']:
            state = response['QueryExecution']['Status']['State']
            if state == 'FAILED':
                return False
            elif state == 'SUCCEEDED':
                s3_path = response['QueryExecution']['ResultConfiguration']['OutputLocation']
                filename = re.findall('.*\/(.*)', s3_path)[0]
                return filename
        time.sleep(1)
    
    return False

If you want to clean things up after, this piece of code removes all objects in the path:

# Deletes all files in your path so use carefully!
def cleanup(session, params):
    s3 = session.resource('s3')
    my_bucket = s3.Bucket(params['bucket'])
    for item in my_bucket.objects.filter(Prefix=params['path']):
        item.delete()

Best be careful with that one ;)

Once you have all these defined, you can call

# Query Athena and get the s3 filename as a result
s3_filename = athena_to_s3(session, params)

# Removes all files from the s3 folder you specified, so be careful
cleanup(session, params)

If you enjoyed these instructions, would you mind commenting below 👇🏻?
Also, if there's something this did not help you with regarding querying Athena with boto3, I'd be happy to improve the article.

Last update 25.3.2019

Comments

UnknownMarch 26, 2019 at 12:29 PM
This is very clear and concise. Have you tried such code in a context of a query for a web page backend? I am curious what performance of something like this look like and if it is only functional for ETL.
AnonymousMay 13, 2019 at 11:49 AM
Hi!
Thanks for the guide!

In the configurations section:

session = boto3.Session(profile_name='profile-optional')

What do you mean with "profile_name='profile-optional'"?

Search This Blog

Ilkka Peltola

Simple way to query Amazon Athena in python with boto3

Comments

Post a Comment

Popular posts from this blog

ThumbmarkJS: A free, open source device fingerprinting JavaScript library for the web

How to access AWS S3 with pyspark locally using AWS profiles tutorial