Create Glue Workflow

In this section, we’ll setup the AWS Glue components required to make our QLDB data in S3 available for query via Amazon Athena. AWS Glue is a fully-managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. The service consists of a metadata repository, or data catalog, an ETL job execution environment, and a job scheduling facility. The data catalog provides applications and other AWS services with the information they need to discover, connect to, and use data wherever it is stored. The ETL engine executes Python or Scala code in a managed Apache Spark environment to perform the work of extracting data from one data source, transforming it other usable forms, and loading it to other systems and storage locations.

Our Glue workflow will work as follows:

  1. A Glue job will transform the QLDB exported data from its Amazon Ion format to partitioned Apache Parquet files. We will be leaving the original data untouched, and store the newly-transformed Parquet data in a second S3 bucket. This is a common data lake pattern.
  2. A Glue crawler will read and catalog the newly-created Parquet-formatted version of our QLDB data, making it available to services like Amazon Athena, Amazon EMR, and Redshift Spectrum.

Create the Glue workflow

We’ll create all of our Glue components with CloudFormation in your desired region.

This template will be used to create the following items:

  • S3 Bucket
  • Two Lambda functions
  • Glue Workflow
  • Glue Job
  • Glue Crawlers
  • Glue Databases
  • IAM Roles
Region Launch CloudFormation Template
US East (Virginia) Launch Stack in us-east-1
US East (Ohio) Launch Stack in us-east-2
US West (Oregon) Launch Stack in us-west-2
Europe (Frankfurt) Launch Stack in eu-central-1

or, download the file to your local workstation and create a CloudFormation stack by uploading the template.

You will see the Quick create stack page as shown below. In the Stack name block, leave the Stack name as qldb-athena-lab-create-workflow.

Leave the Parameters block as-is if following this lab.

If selecting UseYourOwnS3Bucket then enter the s3 location of your QLDB export in the format bucket/folder

Check the “I acknowledge that CloudFormation might create IAM resources.” box and click Create stack.

Create Stack

The stack will take several minutes to create. Its status will be updated to CREATE_COMPLETE.

Run the Glue Workflow

Return to the main page of the AWS console by clicking the AWS logo in the upper-left hand corner of any console page. Go to the Glue page by typing Glue in the Find Services box or by clicking AWS Glue under the Analytics section under all All Services.

Find Services

If you’re taken to the Glue Welcome page, click Get Started.

In the left-hand navigation of the Glue page, click Workflows.

On the Workflows page, select the QLDBExportWorkflow workflow by clicking the radio button next to it.

Click the Actions button and then click Run in it its drop-down menu.

The Glue Workflow will now start. A diagram in the lower half of the screen shows the steps in the Glue workflow. The workflow starts at the StartTrigger step. The next step is a Glue Job (“QLDBWorkflowJob”) will convert the Ion documents to Parquet. This transformation is necessary because AWS Athena does not currently support Ion. Once the Glue Job succeeds, a crawler will read and catalog the created Parquet documents. When the crawler is done, the QLDB export is ready to be queried using AWS Athena.

The Glue Workflow should take approximately 10 minutes to complete.

This exercise performs a very basic transformation: conversion of data from Amazon Ion format to Parquet. Customers can perform much more sophisticated transformations by modifying the Python or Scala scripts in their Glue Job. The Optional: QLDB Export and PySpark section in this lab will introduce doing custom PySpark scripts against the QLDB export.

View Glue job output files

Return to the main page of the AWS console by clicking the AWS logo in the upper-left hand corner of any console page. Navigate to S3 by finding it under Storage in the list of services or you can find it by typing it in the search bar at the top.

Find Services

In the list of buckets, click on the bucket created in the CloudFormation stack above. If you’re not sure of the name of the bucket that you created, check the Outputs section of the CloudFormation stack you created, as described above.

Click through the directories in the Glue job output bucket and explore the files created by the job.

Optional

S3 Select

With Amazon S3 Select, you can use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need.

For a quick inspection, S3 Select can be an easy option when inspecting files created from the Glue Job.

From the S3 console, select the check box next to a parquet file that was just created and drop down the Actions tab.

Find Services

Now, select Query with S3 Select.

Find Services

In the Input settings, make sure Apache Parquet is selected, select JSON for the Output settings.

Find Services

Now, in the SQL query block, click Run SQL query.

Find Services

In the Query results block, the converted QLDB export files can now be seen in human readable JSON.

Find Services