AWS Glue ETL Workflow for Data Analytics

Build Athena Tables through Glue Tool Suite

5 min readMar 21, 2021

TL;DR

We first used Glue Crawlers to build data catalogs for S3 source data, and then deployed customized ETL scripts through Glue Jobs. The processed result was saved into a new S3 destination with updated partitions. Next, we ran another crawler on output data so that an Athena table of result data was created/updated for SQL query. Finally, we applied Triggers and Workflows to orchestra and automate the ETL steps as daily job.

Glue Tools Used for This ETL Example

Glue Crawler: builds data catalog of source data and make S3 data available to Athena in table format no matter what format the source data is, including but not limited to CSV, Parquet, ORC etc. so that analysts can write SQL query in Athena or connect to BI visualization tools
Glue Dev endpoints: creates a local Spark environment in my Jupyter Notebook (Yes, Jupyter Notebook is my comfort zone, you can choose PyCharm Professional or other IDE as your preference) and make it connected to Cloud clusters to improve compute performance. It makes my life easier to debug and test scripts before code deployment on AWS platform.
Glue Jobs: deploys my ETL scripts and run ETL jobs as configured
Glue Triggers: automation tool that monitors job running status or time schedule and then trigger next action of the process
Glue Workflows: orchestras executions of multiple glue job steps

Background Info about Data Sources

We have mobile event logs sent through a third party platform Segment. Origin logs are automatically copied from Segment’s S3 to our S3 bucket when it reaches configured time interval or data size limit. However, the data arrived in our S3 bucket is automatically partitioned by source-id, received-date. For example, we could have Android and IOS data sitting in two separate subfolders, and in each folder, the data is further partitioned by date which is in unix epoch timestamp format.

Challenges for Data Analysts

Source data is in separate folders, and they have different schema. Event logs are nested JSON strings with so many user attributes and event details that are not easily accessible through SQL query.

Glue Crawler can sure build a data catalog for each individual source, but it won’t flatten all details hidden in Struct and we need extra steps to extract and transform important information through ETL scripts.

Real time analysis is not needed but analysts usually check user activities on a daily basis and we want to retain the newly arrived source to be processed and loaded to destination on a daily basis. Then another challenge comes up, we need to orchestra and automate the main steps of the ETL process.

How To Solve?

To solve above challenges, we need to build a workflow include these major steps 1) using Crawlers to build data catalogs of raw data in source bucket so that source data is easier to be extracted by ETL scripts 2) writing a PySpark ETL scripts to integrate different raw sources based on data catalogs built, do further transformations, repartition data and load the results into a new destination 3) using another Crawler to build data catalog and create Athena table on top of the result data

Data Integration: We want to combine all user activities from different devices and have an integrated view to understand their behavior better.

# union Android and iOS data togetherdf = df_ios.unionByName(df_android)

Repartition: Original partition in epoch timestamp format is not friendly. Reconcile the event timestamps and repartition them into Year-Month-Date format is more human-readable way, and easier for analyst to query.

from pyspark.context import SparkContext
import pyspark.sql.functions as f
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job# convert is8601 utc to pst timestamp 
df_mobile = df.withColumn(
                  'utc_timestamp', 
                  f.to_timestamp('timestamp',
                                 "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
                  )
# create year, month, date columns from timestamp
df_mobile = df_mobile.withColumn('year', f.year('timestamp'))
                     .withColumn('month',f.month('timestamp'))
                     .withColumn('date', f.to_date('timestamp')) 
# set new partitions
partition_by = ["year","month","date"]# convert Spark DataFrame to Glue DynamicFrame
ddf_mobile = DynamicFrame.fromDF(
                            df_mobile, 
                            glueContext, 
                            "ddf_mobile")# write DynamicFrame into S3 destination with new partitions
df_output = glueContext.write_dynamic_frame.from_options(
                     frame = ddf_mobile,                                                          
                     connection_type = "s3",                                                           
                     connection_options = {
                               "path": output_dir,                                                                                
                               "staging_path": temp_dir,                                                                               
                               "partitionKeys":partition_by
                             },                                                          
                     format = "csv",                                                          
                     transformation_ctx = "df_output")job.commit()

The detailed transformation steps are not the focus of this story, please refer to AWS Glue documentation if you are interested.

Sequential Steps of This Work Flow

There are in total 8 steps:

mobile-user-start: This is a schedule Trigger to start daily ETL process whenever timestamp passed your configured time (e.g. 13:00:00 UTC)
segmentlogs-IOS: This is a Crawler to automatically capture the iOS data schema on S3 raw data source and update the newly ingested partitions in data catalog
android-crawler-trigger: This is a Trigger to respond last step’s crawler job, once segmentlogs-IOS in status SUCCEED, it triggers the next crawler segmentlogs-Android to run
segmentlogs-Android: This is a glue Crawler to automatically capture the Android data catalog on S3 raw data source and update the newly ingested partitions
mobile-user-etl-trigger: This is a glue Trigger to respond last step’s crawler job, once segmentlogs-Android in status SUCCEED, it triggers the PySpark scripts mobile-user-daily to run
mobile-user-daily: This is the actual ETL Job where it run scripts that read source folders, drop unnecessary fields, combine both iOS and Android data, parse out target date range, re-partitioned data into PST standard year-month-date, and then pipe the transformed data into a destination S3 folder for analytics consumption
mobile-crawler-trigger: This is a Trigger that responds to successful ETL job mobile-user-daily, once the last step is in SUCCEED, it triggers the next crawler to run
segmentlogs-Mobile: This is the final Crawler used to capture the data catalog of output data which sits in destination folder and make it available for Athena query

Summary

We ran Crawlers to get schema of two mobile data sources; wrote Glue ETL job to combine, clean the data, repartition it into year-month-date format and load the result into a new destination; and ran another Crawler to build data catalog of output data and make data available as a Athena table for SQL queries or BI tools.

Notes:

CouldWatch is another AWS tool can be used for monitoring the steps of the work flow, where you can check execution details, identify issues and set alarms and automatic reactions.

Glue Dev endpoints can be expensive if you keep it alive. It charges you based on the Data Processing Unit hours used for the time your development endpoints are provisioned. It means you pay for the idle time as well. Please make sure to delete the dev endpoints once you finish the ETL deployment if you don’t want to surprise your manager with high cost on AWS bills.

References:

https://docs.aws.amazon.com/glue/index.html