AWS Glue ETL Workflow for Data Analytics

Build Athena Tables through Glue Tool Suite

Image from AWS Glue Concepts

TL;DR

Glue Tools Used for This ETL Example

  • Glue Dev endpoints: creates a local Spark environment in my Jupyter Notebook (Yes, Jupyter Notebook is my comfort zone, you can choose PyCharm Professional or other IDE as your preference) and make it connected to Cloud clusters to improve compute performance. It makes my life easier to debug and test scripts before code deployment on AWS platform.
  • Glue Jobs: deploys my ETL scripts and run ETL jobs as configured
  • Glue Triggers: automation tool that monitors job running status or time schedule and then trigger next action of the process
  • Glue Workflows: orchestras executions of multiple glue job steps

Background Info about Data Sources

Challenges for Data Analysts

Glue Crawler can sure build a data catalog for each individual source, but it won’t flatten all details hidden in Struct and we need extra steps to extract and transform important information through ETL scripts.

Real time analysis is not needed but analysts usually check user activities on a daily basis and we want to retain the newly arrived source to be processed and loaded to destination on a daily basis. Then another challenge comes up, we need to orchestra and automate the main steps of the ETL process.

How To Solve?

Data Integration: We want to combine all user activities from different devices and have an integrated view to understand their behavior better.

# union Android and iOS data togetherdf = df_ios.unionByName(df_android)

Repartition: Original partition in epoch timestamp format is not friendly. Reconcile the event timestamps and repartition them into Year-Month-Date format is more human-readable way, and easier for analyst to query.

Destination S3 partitions
from pyspark.context import SparkContext
import pyspark.sql.functions as f
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
# convert is8601 utc to pst timestamp
df_mobile = df.withColumn(
'utc_timestamp',
f.to_timestamp('timestamp',
"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
)
# create year, month, date columns from timestamp
df_mobile = df_mobile.withColumn('year', f.year('timestamp'))
.withColumn('month',f.month('timestamp'))
.withColumn('date', f.to_date('timestamp'))

# set new partitions
partition_by = ["year","month","date"]
# convert Spark DataFrame to Glue DynamicFrame
ddf_mobile = DynamicFrame.fromDF(
df_mobile,
glueContext,
"ddf_mobile")
# write DynamicFrame into S3 destination with new partitions
df_output = glueContext.write_dynamic_frame.from_options(
frame = ddf_mobile,
connection_type = "s3",
connection_options = {
"path": output_dir,
"staging_path": temp_dir,
"partitionKeys":partition_by
},
format = "csv",
transformation_ctx = "df_output")
job.commit()

The detailed transformation steps are not the focus of this story, please refer to AWS Glue documentation if you are interested.

Sequential Steps of This Work Flow

There are in total 8 steps:

  1. mobile-user-start: This is a schedule Trigger to start daily ETL process whenever timestamp passed your configured time (e.g. 13:00:00 UTC)
  2. segmentlogs-IOS: This is a Crawler to automatically capture the iOS data schema on S3 raw data source and update the newly ingested partitions in data catalog
  3. android-crawler-trigger: This is a Trigger to respond last step’s crawler job, once segmentlogs-IOS in status SUCCEED, it triggers the next crawler segmentlogs-Android to run
  4. segmentlogs-Android: This is a glue Crawler to automatically capture the Android data catalog on S3 raw data source and update the newly ingested partitions
  5. mobile-user-etl-trigger: This is a glue Trigger to respond last step’s crawler job, once segmentlogs-Android in status SUCCEED, it triggers the PySpark scripts mobile-user-daily to run
  6. mobile-user-daily: This is the actual ETL Job where it run scripts that read source folders, drop unnecessary fields, combine both iOS and Android data, parse out target date range, re-partitioned data into PST standard year-month-date, and then pipe the transformed data into a destination S3 folder for analytics consumption
  7. mobile-crawler-trigger: This is a Trigger that responds to successful ETL job mobile-user-daily, once the last step is in SUCCEED, it triggers the next crawler to run
  8. segmentlogs-Mobile: This is the final Crawler used to capture the data catalog of output data which sits in destination folder and make it available for Athena query

Summary

Notes:

CouldWatch is another AWS tool can be used for monitoring the steps of the work flow, where you can check execution details, identify issues and set alarms and automatic reactions.

Glue Dev endpoints can be expensive if you keep it alive. It charges you based on the Data Processing Unit hours used for the time your development endpoints are provisioned. It means you pay for the idle time as well. Please make sure to delete the dev endpoints once you finish the ETL deployment if you don’t want to surprise your manager with high cost on AWS bills.

References:

https://docs.aws.amazon.com/glue/index.html

Full Stack Data Analytics Engineer | AWS Certified | Learning through sharing! Be true to yourself and be kind to others!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store