Simple Big Data setup on Amazon AWS

Everyone wants to do some big data stuff, right? In all honesty, no-one cares if your data is big or small - size doesn't matter. What matters is your ability to take any size of data and generate understanding from it. At some point the data you are gathering might become inconvenient to process with more traditional tools. It might be that some big data tools might help you - or not. The bottom line is, it is a tool you want to have in your toolbox.

So you want to do some big data stuff. You've got two options
  1. Run stuff locally, or
  2. Run stuff in the cloud
If you don't have a cloud, then surely you can just go with option #1. For that the Hortonworks Sandbox in a Docker seems like a great place to start ( But I would not recommend that. It's a hassle and there are very little benefits, except that you only pay for your own electricity.

If you don't have a cloud, just get an AWS account and start doing things the way it will be done. If you already have data in e.g. an AWS EC2 instance, then I've got just the right thing for you.

Oftentimes one of the biggest obstacles others in your company have for not doing simple data analysis themselves is that there's a lot of overhead. In it's simplest form big data is just SQL, like in the case of Apache Hive. Now if big data is just SQL, why aren't more people doing it? It's because of the complicated overhead. Do you know:
  1. ...which applications you need?
  2. to launch and destroy clusters?
  3. to write and debug a job?
  4. to deploy jobs to the cluster?
  5. to check the results?
If your job script is perfect, then you only need to figure out once how to harness an EMR cluster to run your script and terminate once it's done. Usually, however, you need to iterate your jobs quite a bit before they're just right. If developing your job includes a lot of overhead, chances are you won't do it. If instead you can iterate very efficiently and setting up tools is a breeze, chances are other people will do it too.

While running a local sandbox is fine, it doesn't solve the problem of data availability. What if your big data is in your EC2 instance, how easy is it for you to access your data there, or make copies of it for your sandbox? Instead, it's a lot easier to launch your cluster to the AWS EC2 instance next to your data. You want to explore data, too!

For example, I have some JSON data in an S3 bucket. Ideally, I would like to have an interactive browser editor for writing SQL-like queries, which would get run as "Big Data -thingies" to my S3. I would like to see results in my browser, like I do with my SQL desktop clients, or be able to store results in S3.

If you read to the end, you'll know how you can do this, easily.

Hue, an open source Analytics Workbench for self service BI.

Launching an EMR cluster on AWS is super easy with the UI. When you launch an EMR cluster on AWS, you can include Hue, which is excellent. Hue is a browser-based editor that lets you run all sorts of scripts on your cluster and instantly see the results. Did I say it was excellent?

The only problem with Hue on EMR is that... there's no open internet access for EMR clusters. Your cluster's master node has a hostname something along the lines of but you will not reach that hostname without some tricks.

You need two tricks:
  1. An SSH tunnel to your EC2 instance, and
  2. A SOCKS proxy for your browser.
The first creates the information tunnel and the second instructs your browser to use it for specific traffic. To create an SSH tunnel, you will need a private key file. This is a .ppk file on a Windows platform (Putty Private Key) or a .pem file on other platforms (used by ssh).

For this to work, you also need to have set up aws cli. Additionally, I am assuming you are using profiles to assume IAM roles with aws cli. If not, then, well, you'll know not to set the profile when calling boto3.session.

A list of prerequirements:
  1. aws cli configured with a profile called sandbox
  2. python 3 with boto3 and configparser
  3. plink on Windows (comes with Putty), ssh otherwise
  4. An EC2 private key (create from the EC2 console. On windows you'll need to convert to .ppk.)
  5. Chrome browser (If you know how to configure the SOCKS proxy, any browser will do)
You can get the code from the git gist, place it in the same directory with the private key file and run it. Here're some parts of the code. Let's define some variables first. If you are using the Git gist, it will do a lot of this through a config file.

import tempfile
chromepath    = 'C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe'
key_extension = '.ppk' if platform.system() == 'Windows' else '.pem'
keyname       = 'your_key_name' # expects the .pem or .ppk file to be same name
keyfile       = keyname + key_extension
profile       = 'sandbox'
region        = 'eu-central-1'
instance_type = 'm4.large'
subnet        = "" # you can specify a subnet. I choose this randomly later.
release_label = 'emr-5.13.0'
instances     = 1
apps          = [{'Name': 'HIVE'}, {'Name': 'Hue'}]
profiledir    = tempfile.gettempdir() + "\\" + "chrome_emr_socks_session"
cluster_name  = 'your cluster name'

You might want to choose a subnet, but if you don't, a random one from your VPC will be selected.

import boto3

session = boto3.Session(profile_name = profile)
ec2 = session.resource('ec2')

first_vpc = list(ec2.vpcs.all())[0]
subnets = list(first_vpc.subnets.all())
subnet = random.choice(subnets).id

The following launches the cluster. You'll need to have suitable roles for the EMR_EC2_DefaultRole and EMR_DefaultRole. Mine have the AmazonElasticMapReduceforEC2Role policy with trust relationship to EC2 and AmazonElasticMapReduceRole with trust relationship to elasticmapreduce, respectively.

client = session.client('emr', region_name = region)

response = client.run_job_flow(
        'MasterInstanceType': instance_type,
        'SlaveInstanceType': instance_type,
        'InstanceCount': instances,
        'KeepJobFlowAliveWhenNoSteps': True,
        'TerminationProtected': False,
        'Ec2SubnetId': subnet,
        'Ec2KeyName': keyname,

# better store the job flow id
job_flow_id = response['JobFlowId']

If you want to check how it's doing, have a look at the EMR console, or with a command

cluster_info = client.describe_cluster(ClusterId = job_flow_id)

Once the cluster is set up, you'll see it as waiting and it has a public master DNS name.

master_public_dns_name = cluster_info['Cluster']['MasterPublicDnsName']

We'll use that to create an SSH tunnel to it.

# This opens the SSH tunnel on Windows
if platform.system() == 'Windows':
    process_id = sp.Popen(['cmd.exe', '/c', 'echo', 'y', '|'
        , 'plink.exe', '-i', keyfile, '-N', '-D', '8157'
        , 'hadoop@' + master_public_dns_name]
        , shell=True)
# I haven't tested this, but on Linux / Mac this should work.
    process_id = sp.Popen(['ssh', '-o', "'StrictHostKeyChecking no'"
        , '-i', keyfile, '-N', '-D', '8157'
        , 'hadoop@' + master_public_dns_name]
        , shell=True)

# Just ensure your keyfile points to the .pem or the .ppk file

One last thing before you can connect: SOCKS proxy. With chrome this is pretty easy.

command = [chromepath
           , '--proxy-server=socks5://'
           , '--user-data-dir=' + profiledir
           , 'http://' + master_public_dns_name + ':8888'
          ], stdout=sp.PIPE,stderr=sp.PIPE)

Nice! You can now create credentials to your Hue and start scripting! When you're done, the python script will destroy the cluster for you, like so:

# Stop the SSH tunnel

# Terminate the cluster
response = client.terminate_job_flows( JobFlowIds=[job_flow_id] )
All of the above is in a neat gist, which you can get here. You just need to have your private key file.


Popular posts from this blog

Snowflake UPSERT operation (aka MERGE)