Data Prep with AWS Glue DataBrew

Scenario

Now that we’ve had our first look at AWS Glue DataBrew, it’s time to try it out with a real data preparation activity.  After nearly a year of COVId-19, and several rounds of financial relief, one interesting dataset is that from the SBA’s Paycheck Protection Program (PPP).  As with al government programs, there is a great deal of interest in how the money was allocated.  The PPP was a decentralized program, with local banks approving and disbursing funds.  The loan data for the program was released in several batches, and early indication is that the data is a bit of a mess, making it difficult for groups without a data prep organization to analyze the data.  As data professionals, we can help with that.

Setup

The link to the most recent PPP data is available is found at https://www.sba.gov/funding-programs/loans/coronavirus-relief-options/paycheck-protection-program/ppp-data.  I downloaded the Nov 24, 2020 dataset, and uploaded the files to an S3 bucket.

Our work with DataBrew begins, as  many things in AWS do, by creating a service level IAM role and granting permission to our data, as documented at https://docs.aws.amazon.com/databrew/latest/dg/setting-up-iam.html.

After we’ve uploaded our data and given DataBrew permission, it’s time to create a Dataset, which is basically a pointer to the data files we’ll be using.  We’ll need a Dataset for every different batch of data we want to use.

image

Initial Profiling

The first thing I like to do when I get an unknown dataset is profile as much of the data as I can.  With DataBrew, I can easily set up a Profile Job to gather statistics about the entire dataset.  To start, we navigate Jobs >> Profile Jobs >> Create Job.  The configuration looks like the image below.

image

The profiling job takes a little over one minute to run since DataBrew will profile a maximum of 20,000 rows even if you select “Full dataset” (you can request a limit increase).  Once complete, we can choose to “View data profile”, then the “Column Statistics” tab to check for completeness, type and validity.

Most of the columns are 100% valid, which would be fantastic if true, although I suspect unknown values may be represented by a value which DataBrew does not recognize as “unknown” or “invalid”.  We’ll need to investigate further.  Also, ZIP Code was identified as a numeric column, which is a very common mistake made by data profilers.  Many US Zip Codes start with zero, and need to be treated as strings in order to retain that leading zero.

image

State claims to be 100% valid, so let’s take a look at the values.  Of the 20,000 records profiled, all were in Kansas.  Deep sigh.  We’re going to need to try a random sample somehow.

image

Cities are where the fun usually begins, and looking at the Top 50 values, we see that there is inconsistent casing and DataBrew treats “OLATHE” and “Olathe” differently. We see the same treatment with “LAWRENCE” and “Lawrence”, too.  That’s something we can try and fix in our data prep.  Trivia note: “Pittsburg” is spelled correctly here, only Pittsburgh, PA has the “h” at the end.

image

Random Sampling

That’s a good start, but let’s see what else we can find with a random sample.  To do a random sample, we need to create a project, using the same dataset, and configure the sampling to be 5000 random rows.

image

After the sampling is complete, we’re taken to the projects tab, where we can review the sampled data.  Right away we can see that “Not Available” and “N/A” are very common answers, and we need to work with our business partners to decide if these are values we want to count, values we want to convert to a different value, or if we want to count them as invalid results.

1-6a-sample

Looking at some of the ZIP Codes, we can see that the column was profiled as a number, and some of the MA ZIP Codes lost their leading 0.  We’ll need to change the column type and put the leading 0 back using a transformation.

1-8a-sample

Looking at the State column, the random sampling did improve the sample somewhat—we now have 5 states represented instead of just one.

1-8b-sample

Recipe

Now that we have a couple columns which need a transformation, and a decent random sample, it’s time to start create our first recipe.  We’ll clean up both the ZIP Code and City name column and let our business users work with the data while we look for some additional transformations.

ZIP Code

Since ZIP Code was incorrectly typed as a numeric column, we need to correct this before we produce an output for our users.  This means we need to re-type the column as a string and pad the leading zero where it was stripped off.

To change the type of the column, click on the “#” next to the column name and choose “string”.  This will add a recipe step to convert the type of this column, but will not replace the leading zero.

image

In order to replace the leading 0, we can rely on the old trick of prefixing every value with a 0 and take the right five characters to create the full zip code.  This is a two-step process in our recipe.  First, we pad all values with 0 by activating the Clean menu, selecting “Add prefix…”, then entering a prefix text of 0.

image

This prefixing will be applied to all values, which will make most ZIP Codes six characters long.  To fix this, we take the right five characters by activating the Functions menu, selecting Text Functions, then Right.

image

This operation will create a new column, which by default is labeled “Zip_RIGHT”, and we configure the number of characters to keep.

image

And when we preview the change, this is how it looks.

image

City

As we saw in the profile results, city names are both mixed case and all uppercase, which is causing mis-counts by city.  We need to standardize the capitalization to alleviate this.  For our needs, it doesn’t matter if we use all uppercase or not, just as long as we’re consistent.  I’ll use proper case because it doesn’t look like I’m being yelled at.  We can either activate the Format menu (either from the menu bar or using the ellipses menu) then choosing “Change to capital case”.  

image

We can then see an example of what each formatting option will do.  Capital case is the closest option for how most city names are capitalized.  It’s not perfect, but it’s consistent, and we’d need an address verification system to do better.  This option changes the value in the original column, it does not create a new column.

image

We can even preview the changes, and see how ANCHORAGE and Anchorage are now combined into a single value of Anchorage.

imageimage

Our recipe now looks like this, which is good enough for now.

image

Publishing and Using the Recipe

In order to run this recipe against the full dataset, or to run it again, we need to publish it and then create a Job.  From the Recipe panel, we click the Publish button.  Recipes are versioned, so in the pop-up we add some version notes and Publish.  Once Published, we can use it in a Job.

image

I covered Jobs in detail in the First Look: AWS Glue DataBrew, so here is how I configured the job:

image

Parquet is a great storage format, it has a schema, it’s compact, columnar for performant query, and it’s native to many of AWS’s services.  Once the job has completed, how do we ensure it worked?  Simple, we use the output as a new Dataset and profile the results.  Viewing the results of the profile of the cleanup job, we can see the top 50 City names are all capital case.
image

Similarly, we can see the ZIP Codes are all 5 characters long and have the leading zero (fortunately, the profile job sampled New Jersey).

image

Congratulations, we can now start to make this data available to our users!  We know they’ll find more steps we need to add to our recipe as they begin to work with the data, but this is a great start.  Find me on Twitter @rj_dudley and let me know how you find DataBrew.

15 S3 Facts for S3’s 15th

s3-15th

To celebrate S3’s 15th birthday on 3/14/2021, and to kick off AWS Pi Week, I tweeted out 15 facts about S3.  Here they are as a blog post, to make them easier to read.  Because of the rapid pace of innovation in AWS services, including S3, so if you’re reading this in the future, some things may have changed.

1. S3 is designed for “eleven 9s” of durability. When you take into account redundancy in and across availability zones, in 10,000,000 years you’d lose only lose one of 10,000 objects. Read more at https://aws.amazon.com/blogs/aws/new-amazon-s3-reduced-redundancy-storage-rrs/.

2. S3 is region-bound, which means all S3 buckets in that region are partying in the same publicly available cloud ether. You can restrict access to a VPC but the bucket is still located outside the VPC. Related: https://cloudonaut.io/does-your-vpc-endpoint-allow-access-to-half-of-the-internet/.

3. S3 is a very versatile storage service. The trillions of objects it stores are the basis for many workloads, including serving websites, video streaming and analytics.

4. The return of INI files! With a first byte latency of milliseconds, S3 is suitable for storing configuration settings in an available and inexpensive way. Databases are no longer a fixed cost and there is no need for one just for configuration.

5. S3 is designed for “infinite storage”. Each object can be up to 5TB in size, and there is no limit to the number of objects you can store in a bucket. Analytics aren’t constrained by a file or disk size. It’s like a TARDIS, or bag of holding!

6. How do you perform operations on hundreds, thousands or more objects? S3 Batch Operations allow you to copy objects, restore from Glacier, or even call a lambda for each file. For more information, see https://aws.amazon.com/blogs/aws/new-amazon-s3-batch-operations/.

7. S3 is a “consumption model”, so you pay only for what you use when you use it. No more provisioning fixed-size network storage solutions with large up-front costs.

8. But what if you need massive object storage closer to your location? S3 on Outposts puts S3 on-premises, right where you collect or process your data. For more info, start at https://aws.amazon.com/s3/outposts/.

9. If your bandwidth is limited or non-existent, you can use Snowball Data Transfer to move TB to PB of data in and out of AWS. Learn more at https://aws.amazon.com/snowball/.

10. For data collection and object generation at the most extreme edges there is Snowball Edge Storage. Snowball Edge can even run processing workloads. Read more at https://docs.aws.amazon.com/snowball/latest/developer-guide/whatisedge.html.

11. Although you can upload files to S3 via the console, CLI and REST API, wouldn’t it be great if you could just drag a file to a network share and have it appear in the cloud? With a File Gateway, you can do exactly that! See https://aws.amazon.com/storagegateway/file/.

12. S3 offers multiple storage classes, so you can optimize cost, latency and retention period. Standard offers the lowest latency but at the highest cost, while Glacier Deep Archive is perfect for yearslong retention. Read more at https://aws.amazon.com/s3/storage-classes/.

13. S3 Storage Lens is a central dashboard organizations can use for insight into S3 utilization and to get recommendation to optimize price. Read more at https://aws.amazon.com/blogs/aws/s3-storage-lens/.

14. S3 can version objects, so if you accidentally delete or profoundly update an object, you can recover from the most recent save or many prior versions, too.

15. S3 is a very secure service. IAM policies can be applied at the bucket and object level with a great deal of granularity. Additionally, VPC endpoints bind S3 traffic to a specific VPC only.

And one to grow on (for everyone): AWS recently released three new S3 training courses: https://aws.amazon.com/about-aws/whats-new/2021/01/announcing-three-new-digital-courses-for-amazon-s3/.