How to access AWS S3 with pyspark locally using AWS profiles tutorial
At Zervant, we currently use databricks for our ETL processes, and it's quite great. However, there's been some difficulty in setting up scripts that work both locally and on the databricks cloud. Specifically, databricks uses their own prorpietary libraries to connect to AWS S3 based on AWS hadoop 2.7. That version does not support accessing using AWS profiles.
Internally, we use SSO to create temporary credentials for an AWS profile that then assumes a role. Therefore, reading the ACCESS_ID and ACCESS_SECRET from the .credentials file is something we don't want to do.
In order to accomplish this, we need to set two hadoop configurations to the Spark Context
- fs.s3a.aws.credentials.provider
- com.amazonaws.auth.profile.ProfileCredentialsProvider
This is done by running this line of code:
sc._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider",
"com.amazonaws.auth.profile.ProfileCredentialsProvider")
Note! You need to set your environment variable AWS_PROFILE to the profile name you want to use.
As databricks uses hadoop 2.7, that version doesn't have support for the abovementioned credentials provider. Instead, we need to use a newer hadoop version 3.
The full list of files in our current setup is as follows:
- spark-3.0.2-bin-hadoop3.2
- hadoop-aws-3.2.2
- aws-java-sdk-bundle-1.11.563
It's more obvious to pick the correct version of hadoop-aws-3.2.2 based on the fact that our hadoop is 3.2 in the spark installation file. Picking the right aws-java-sdk-bundle version requires that you look into the hadoop-aws-3.2.2 dependencies.
Our setup is compatible with the following MySQL, PostgreSQL and Snowflake libraries:
- snowflake-jdbc-3.12.17
- spark-snowflake_2.12-2.8.4-spark_3.0
- mysql-connector-java-8.0.23
- postgresql-42.2.19
When everything is done, running this will work:
import findspark, os
os.environ['AWS_PROFILE'] = "your-aws-profile" # In case you haven't already set the ENV variable
findspark.init()
import pyspark
from pyspark.sql.session import SparkSession
sc = pyspark.SparkContext(appName="Pi")
sc._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider")
spark = SparkSession(sc)
os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local[*] --jars spark-snowflake_2.12-2.8.4-spark_3.0.jar,postgresql-42.2.19.jar,mysql-connector-java-8.0.23.jar,hadoop-aws-3.2.2,aws-java-sdk-bundle-1.11.563.jar'
s3File = sc.textFile("s3a://bucket/file")
s3File.count()
Comments
Post a Comment