Thursday, September 6, 2018

How to ETL in Amazon AWS? AWS Glue for dummies

You can do ETL in AWS in a few different ways:

  1. Glue
  2. DataPipeline
  3. A custom solution, e.g. a Docker

I've used a custom solution for a while, but recently decided to move to Glue, gradually. Why? Because when it is set up, you have so much less to worry about. Glue is the preferred choice when you need to move data around. If you're unsure what route to take, stick to Glue. If you find it doesn't fit your needs well, only then look elsewhere.

The building blocks you need to understand
  1. Glue Data Catalog
  2. Crawlers
  3. Jobs

Glue Data Catalog

The tables in your Glue Data Catalog hold the metadata for your data (where it is stored, what format it is in). Tables are grouped by databases, which is simply a way to group your tables. Lastly, connections describe how Glue can connect to your data, e.g. through a JDBC connection.

You need the tables in your data catalog in order to run ETL jobs.


Crawlers help you catalog your data into the Data Catalog. Your data needs to be in S3, DynamoDB or you need to have specified a JDBC connection for it. A crawler will have a look at your data and generate the tables in your Data Catalog - interpreting the schema from the data.


Jobs do the ETL work and they are essentially python or scala scripts. When using the wizard for creating a Glue job, the source needs to be a table in your Data Catalog. You can load the output to another table in your data catalog, or you can choose a connection and tell Glue to create/update any tables it may find in the target data store. You can schedule jobs with triggers.

Advanced stuff

A role for Glue

You'll obviously need a role for Glue. The policies I have added are AmazonS3FullAccess, AWSGlueServiceRole, AWSGlueConsoleFullAccess and AmazonAthenaFullAccess. If you are a strickler on security, I'm sure you can narrow done the rights.


When you create a job with the wizard, it creates a bunch of code for you. Here's what it looks like in python. It includes annotations (lines starting with ##). These are only necessary for drawing the diagram on the left. 

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "analytics", table_name = "events", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "analytics", table_name = "events", transformation_ctx = "datasource0")
## @type: ApplyMapping
## @args: [mapping = [("category", "string", "category", "string"), ("timestamp", "bigint", "timestamp", "bigint"), ("environment", "string", "environment", "string"), ("source", "string", "source", "string"), ("owner", "string", "owner", "string"), ("id", "string", "id", "string"), ("sessionid", "string", "sessionid", "string"), ("payload", "string", "payload", "string")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("category", "string", "category", "string"), ("timestamp", "string", "timestamp", "bigint"), ("environment", "string", "environment", "string"), ("source", "string", "source", "string"), ("owner", "string", "owner", "string"), ("id", "string", "id", "string"), ("sessionid", "string", "sessionid", "string"), ("payload", "string", "payload", "string")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_cols", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_cols", transformation_ctx = "resolvechoice2")
## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
## @type: DataSink
## @args: [catalog_connection = "Analytics DB Testprod", connection_options = {"dbtable": "events", "database": "reporting"}, transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "Analytics data warehouse", connection_options = {"dbtable": "event", "database": "analytics"}, transformation_ctx = "datasink4")

The most interesting lines start from datasource0 =. The default code works like this:

So fairly simple stuff. Except when you run into problems.

Developing the ETL scripts

What you'll quickly find is that you need to further develop the ETL script. After modifying the ETL script you launch it, only to realize it takes a good few minutes for it to start, and a few others to run. Iterating the script gets tedious, but luckily you can use Glue's Dev Endpoints.

Dev Endpoints are reserved instances that you can use to iterate ETL scripts faster. Setting these up is a whole another story though and would deserve their own blog post.

No comments:

Post a Comment