First Look: AWS Glue DataBrew

Introduction

This is a post about a new vendor service which blew up a blog series I had planned, and I’m not mad. With a greater reliance on data science comes a greater emphasis on data engineering, and I had planned a blog series about building a pipeline with AWS services. That all changed when AWS released DataBrew, which is a managed data profiling and preparation service. The announcement is at https://aws.amazon.com/blogs/aws/announcing-aws-glue-databrew-a-visual-data-preparation-tool-that-helps-you-clean-and-normalize-data-faster/, but the main thing to know is that DataBrew is a visual tool for analyzing and preparing datasets. It’s powerful without a lot of programming. Despite its ease of use and numerous capabilities, DataBrew will not replace data engineers; instead, DataBrew will make it easier to set up and perform a great deal of the simple, rote data preparation activities, freeing data engineers to focus on the really hard problems. We’ll look into use cases and capabilities in future blog posts. Spoiler alert: we’re still going to need that pipeline I was going to write about, just more streamlined. Updated series in future posts.

DataBrew is not a stand-alone component, but is instead a component of AWS Glue. This makes sense, since it adds a lot of missing capabilities into Glue, but can also take advantage of Glue’s job scheduling and workflows. Some of what I was planning to write involved Glue anyway, so this is convenient for me.

In this “First Look” post I’m working my way through the DataBrew screens as you first encounter them, so if you have an AWS account, it might be useful to open DataBrew and move through the screens as you read. No worries if you don’t, I’ll cover features more in-depth as I work through future posts.

DataBrew Overview

There are four main parts of DataBrew: Datasets, Projects, Recipes and Jobs. These are just where we start, there is a lot of ground to cover.

DataBrew parts

Datasets

DataBrew can work directly with files stored in S3, or via the Glue catalog to access data in S3, RedShift or RDS. If you’re using Lake Formation, it appears DataBrew (since it is part of Glue) will honor the AuthN (“authorization”) configuration. Exactly how this works is a topic for future exploration.

If you’re connecting directly to S3, DataBrew can work with CSV, parquet, and JSON files. At the time of writing, XML is not supported so you’d need to do a conversion upstream in a Lambda or Spark job. One cool feature is the ability to create parameterized paths to S3, even using a regex. This isn’t something available in the Glue Catalog, only directly to S3. I work with a lot of files which have a date stamp as part of the filename, so this will be helpful.

DataBrew datasets

Projects

Holey moley there’s a lot of stuff here! The Projects screen is where the real action is, and we’ll spend a lot of time here in the future.

DataBrew Projects

Sample View

As we explore the Sample View, it’s important to keep in mind that DataBrew is meant for actual data preparation work, not just lightweight profiles. This sample view is kept to a small windows so we can explore the effects of transformations and monitor effects on quality.

The majority of this page is taken up with a sample of the dataset and some lightweight profiling, including the type, number of unique values in the sample, the most common values in the sample, and the first few rows of the sample. The sample size and position in the set can be changed. This sample view is a great way to test transformations and enrichments, which we’ll look into later.

DataBrew Sample

The profile view can be changed to explore the schema, which will be inferred from CSV and JSON files, or use the metadata in parquet or Glue Catalog.

DataBrew Schema

he third profile view is correlations and summaries. If you’ve runs several profiles, the history is available to browse. The “missing cells” statistic is something we will revisit for the dataset I have loaded here. Also, for my sample dataset, the Correlation isn’t that interesting because the majority of the columns are part of an address so they should correlate. But in other datasets, this could be really interesting.

DataBrew Profile Overview

The profile view also has data insights into individual columns, showing several quality metrics for the selected column.

DataBrew Column Stats

Transformations

DataBrew currently has over 250 built-in transformations, which AWS confusingly calls “Recipe actions” in parts of its documentation.

DataBrew Transformations

The transformations are categorized in the menu bar above the profile grid. Transformations include removing invalid values, remove nulls, flag column, replace values, joins, aggregates, splits, etc. Most of these should be familiar to a data professional. With a join you can enrich one dataset by joining to other datasets.

Recipes

When we’re in the Projects tab, and we apply a transformation to a column, we’re creating a recipe step. One or more recipe steps form a recipe, and there isn’t a published maximum number of recipes per dataset. Since each recipe can be called by a separate job, this provides a great deal of flexibility in our data prep processes. Recipe steps can only be edited on the Projects tab; the Recipes tab lists the existing recipes, allows for downloading of recipes and some other administrative tasks. Recipes can be downloaded and published via CloudFront or the CLI, providing a rudimentary sharing ability across datasets.

DataBrew Recipes Tab

Opening a recipe brings up summaries of the recipe’s versions, and the other tab on this page opens up the data lineage for the recipe. This lineage is not the data lineage through your enterprise, just the pathway through the recipe. My simple example here isn’t that impressive, but if you build a more complex flow with joins to other datasets and more recipes, this will be a nice view. Although you can preview the datasets and recipes at the various steps, this is not a graphical workflow editor.

DataBrew Lineage

This is also a convenient screen to access CloudTrail logs for the recipes

Jobs

There are two types of jobs in DataBrew–“recipe” and “profile”.

DataBrew Job Types

A profile job examines up to 20,000 rows of data (more if you request an increase). The results of a profiling job include:

  • data type
  • unique values count
  • missing values count
  • most common values and occurrences
  • minimum and maximum values with occurrences
  • percentiles, means, standard deviation, median
  • and more…

One feature missing in the Profiling is determining the pattern or length of text values. The Profiling results are saved in JSON format, can be saved in S3, and there is an option to create a QuickSight dataset for reporting. Anything more than QuickSight will require some custom processing of the JSON output. Although it took this long in a blog post to discuss profiling jobs, a profile is something which really should be created before building recipes.

A recipe job configures a published recipe to be run against a selected dataset. In a Dataset job we choose the dataset, recipe and recipe version we want to use.

DataBrew Dataset Job

The other recipe job option is is a Project job, which uses a saved project defined on the Projects tab. In this job, the only thing we need to configure is the project.

DataBrew Project Job

The original dataset is not modified in DataBrew; instead, we configure the S3 location, output file format, and compression for storing the results.

DataBrew Output File Type DataBrew Output Compression

The output can be partitioned on a particuar column, and we can choose whether to overrite the files from the previous run or keep each run’s files. Please use encryption.

DataBrew Output Partitioning

Once configured, jobs can be scheduled. You can have a maximum of two schedules per job. If you need more than two schedules you’ll need to create an identical job.

DataBrew Job Schedule

Either type of job can be run on a schedule, on-demand or as part of other workflows (see “Jobs Integrations” below). There is only one recipe and one dataset per job, so processing multiple recipes and/or multiple datasets would require additional workflow.

Jobs Integrations

Aside from the console or a schedule, how else can a DataBrew job be started? For starters, the DataBrew API exposes all the functionality in the console, including running a job. When coupled with lambdas, this exposes a great amount of flexibility in starting a job.

A second option is to use a Jupyter notebook (vanilla Jupyter, not SageMaker notebook yet) and the plugin found at https://github.com/aws/aws-glue-databrew-jupyter-extension.

Source Control Integration

Recipes and jobs have a form of versioning, but it seems to be S3 object versioning since there isn’t a real source control workflow, but rather a new version is created with every published update.

DataBrew Publish Recipe

DataBrew Recipe Versions

However, as with most of AWS’s online editors, there is no direct source control integration. The best you can do is to download recipes and jobs as JSON and check them in manually. Better than nothing but still surprising since AWS has CodeCommit.

Infrastructure as Code

At this time, neither Terraform nor Pulumi support DataBrew, but CloudFormation can be used to script DataBrew; see the https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/AWS_DataBrew.html for the API documentation and examples. The CLI is another scripting option, the documentation for the CLI is at https://awscli.amazonaws.com/v2/documentation/api/latest/reference/databrew/index.html.

Cloud Native AWS Development On A Chromebook Using VS Code

Whew! That’s a lot to unpack there. Let’s start in the middle, with the Chromebook, They’re not widely known as development machines. They are limited in RAM, persistent storage and processor speed. The OS has limited application support, especially for ones which rely on arbitray code execution like most IDEs. They have great battery life, small size, and are usually easy to procure (exept in the current return to school COVID shortage).

The newer Chromebooks (see https://sites.google.com/a/chromium.org/dev/chromium-os/chrome-os-systems-supporting-linux for a list of supported devices) have an interesting beta feature–the ability to install a Linux virtual machine. Termed Crostini, this is a supported way to install a Linux distro (Debian stretch). This opens up all kinds of possibilities, especially since the user’s home folder of the VM is mounted as local storage in the Chrome OS.

Microsoft’s VS Code is a fantastic free cross-platform code editor, and can be installed via the Linux command line. Not to give away any spoilers, but VS Code (and its extensions) runs seamlessly as if it is a native application even though it’s installed in the Linux VM. There is a rich ecosystem of extensions, a few of which will be useful for us.

Ckoud native development is more like infrstructure configuration and control; the definition is closer to “we’re connecting and orchestrating AWS’s services, rather than coding everything from scratch”. This is why you sometimes hear the term “infrastructure as code” (IAC)–we use a configuration language to write instructions to direct the activities of AWS’s services, then deploy the instructions to the service, and test our IAC. This works in our favor because we can write the code locally and deploy the code via AWS’s command line interface. Fortunately, the AWS CLI tooling can also be installed and run on the Linux VM, giving us a way to deploy our code from a Chromebook. AWS’s infrastructure language is defined using Cloud Formation, and VS Code has several Cloud Formation extensions, including a template library, to facilitate our work. When a custom coded application is needed, the serverless application model is used (the SAM pattern characteristics are small services with an API interface written in a cross-platform language and deployed in a container).

It’s sounding more promising that we can use a Chromebook for AWS development, doesn’t it? One last thing you’ll need–an AWS account, so get one if you don’t have one already. There is a free-tier, and everything I’ll be doing will use services or service limits which should be “free forever”.

Here’s how I set up my development environment. By the way–we’ll be using the Linux command line interface (CLI), which I am still learning also. If you’re not familiar, I’ve found this book to be invaluable: http://linuxcommand.org/.

  1. Install Crostini

The Crostini beta makes this step as simple as enabling the feature and waiting for several minutes. At this time, the VM is loaded on the main drive and consumes a lot of space, so depending on what else you want to do on your machne, you may need to minimize the number of other things you install or save on the main drive.

The installation instructions are at https://support.google.com/chromebook/answer/9145439. Make sure you read the entire page, since there is important information beyond just installation.

After you’re done, make sure to open the Terminal and run “sudo apt-get update” and “sudo apt-get upgrade” to make sure you have the latest patches.

  1. Install VS Code

Here we are in the CLI! Microsoft’s instructions can be found at https://code.visualstudio.com/docs/setup/linux#_debian-and-ubuntu-based-distributions. Crostini is a newer Debian-based distro so it’s only a couple simple steps.

There is an easier way to install, though. You can downoad the .deb package from the official site, then right-click the package and choose “Install with Linux (Beta)” from the menu. The installer will run for a minute or two, and presto! VS Code will be installed.

install VS Code from .deb package

  1. Add CloudFormation Extensions

A moment of truth! Open VS Code, and if it runs, great job! Now it’s time for an extension to make our development easier, the CloudFormation extension, https://marketplace.visualstudio.com/items?itemName=aws-scripting-guy.cform.

CloudFormation instructions are called stacks (plural: stackset), and can be written in either JSON or YAML. Stacks are also difficult to write from scratch correctly. The templates in this extension will make our lives much, much easier.

  1. Install and Configure the AWS CLI v2 for Linux x86

AWS provides us a couple ways to administer its services. You can interact via the web-based console but ultimately we want to get to a higher degree of repeatability and automation in our deployment processes, and the CLI gets us a little closer. Installation instructions are at https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-linux.html, and configuration is the next section.

Once installation is complete, you can open the Terminal and run “aws –version”. If that works, you’re good to go with configuration. The most important configuration is the access keys the CLI will use to connect to AWS service APIs. Although you can use your root user login, it’s not a good idea. Instead, you should create a new IAM user for the CLI. This is done in the AWS console, and there are a lot of steps in creating a new IAM user, but you can’t really screw anything up. Worst case, delete the user and start over. In step 2 of the wizard (image below) you’re asked to put the new user in a group or attach directly to an IAM profile. For now, it will be easiest if you attach directly to the AdministrativeAccess profile. We can always titrate this down later, but there are 580+ profiles to sort through and we don’t know which ones we need yet.

user wizard step 2

Once your new IAM user is created, you’ll need to follow the configuration for credential file settings at https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html. That’s all we need for now.

  1. Install git

Source control is always a best practice. VS Code, the extensions and the SDKs are all cross platform, so if you install the Windows or Mac bits on another machine, you can easily transfer your code from machine to machine with version history and backup.

Assitionally, we can also use GitHub Actions to deploy our code, but that’s for a future post. I plan to start with the CLI.

Installation for Linuz is found at https://git-scm.com/download/linux.

Notes:

The crouton project may be able to be used to install Ubuntu side-by-side with Chrome OS (see https://ubuntu.com/tutorials/install-ubuntu-on-chromebook#1-overview) but this is not a supported installtion. It would provide a more native Linux experience, though.

Deeper info about Crostini at https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md

Feedback? Find me on Twitter at @rj_dudley.