In this section, we’ll setup the AWS Glue components required to make our QLDB data in S3 available for query via Amazon Athena. AWS Glue is a fully-managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. The service consists of a metadata repository, or data catalog, an ETL job execution environment, and a job scheduling facility. The data catalog provides applications and other AWS services with the information they need to discover, connect to, and use data wherever it is stored. The ETL engine executes Python or Scala code in a managed Apache Spark environment to perform the work of extracting data from one data source, transforming it other usable forms, and loading it to other systems and storage locations.
Our Glue workflow will work as follows:
We’ll create all of our Glue components with CloudFormation in your desired region.
This template will be used to create the following items:
|Region||Launch CloudFormation Template|
|US East (Virginia)||Launch Stack in us-east-1|
|US East (Ohio)||Launch Stack in us-east-2|
|US West (Oregon)||Launch Stack in us-west-2|
|Europe (Frankfurt)||Launch Stack in eu-central-1|
or, download the file to your local workstation and create a CloudFormation stack by uploading the template.
You will see the Quick create stack page as shown below. In the Stack name block, leave the Stack name as
Leave the Parameters block as-is if following this lab.
UseYourOwnS3Bucket then enter the s3 location of your QLDB export in the format bucket/folder
Check the “I acknowledge that CloudFormation might create IAM resources.” box and click
The stack will take several minutes to create. Its status will be updated to CREATE_COMPLETE.
Return to the main page of the AWS console by clicking the AWS logo in the upper-left hand corner of any console page. Go to the Glue page by typing
Glue in the Find Services box or by clicking
AWS Glue under the Analytics section under all All Services.
If you’re taken to the Glue Welcome page, click
In the left-hand navigation of the Glue page, click
On the Workflows page, select the
QLDBExportWorkflow workflow by clicking the radio button next to it.
Actions button and then click
Run in it its drop-down menu.
The Glue Workflow will now start. A diagram in the lower half of the screen shows the steps in the Glue workflow. The workflow starts at the StartTrigger step. The next step is a Glue Job (“QLDBWorkflowJob”) will convert the Ion documents to Parquet. This transformation is necessary because AWS Athena does not currently support Ion. Once the Glue Job succeeds, a crawler will read and catalog the created Parquet documents. When the crawler is done, the QLDB export is ready to be queried using AWS Athena.
The Glue Workflow should take approximately 10 minutes to complete.
This exercise performs a very basic transformation: conversion of data from Amazon Ion format to Parquet. Customers can perform much more sophisticated transformations by modifying the Python or Scala scripts in their Glue Job. The Optional: QLDB Export and PySpark section in this lab will introduce doing custom PySpark scripts against the QLDB export.
Return to the main page of the AWS console by clicking the AWS logo in the upper-left hand corner of any console page. Navigate to S3 by finding it under Storage in the list of services or you can find it by typing it in the search bar at the top.
In the list of buckets, click on the bucket created in the CloudFormation stack above. If you’re not sure of the name of the bucket that you created, check the Outputs section of the CloudFormation stack you created, as described above.
Click through the directories in the Glue job output bucket and explore the files created by the job.
With Amazon S3 Select, you can use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need.
For a quick inspection, S3 Select can be an easy option when inspecting files created from the Glue Job.
From the S3 console, select the
check box next to a parquet file that was just created and drop down the Actions tab.
Query with S3 Select.
In the Input settings, make sure
Apache Parquet is selected, select
JSON for the Output settings.
Now, in the SQL query block, click
Run SQL query.
In the Query results block, the converted QLDB export files can now be seen in human readable JSON.