Resolving Amazon Macie’s “A job with the name … has already been submitted with a different ‘clientToken’” Error

In doing some with with Amazon Macie and Terraform, I ran into this error message:

EXEC : error : creating Macie ClassificationJob: ResourceInUseException: A job with the name ‘Redacted Job Name’ has already been submitted with a different ‘clientToken’ [c:\src\redacted\path]
status code: 400

This isn’t a very clear error for what’s really happening. Macie jobs are immutable—you can’t change any property of a job, including the description (although you can update the job_status but you may be ignored based on the schedule). Instead, to make whatever change you’re trying to make, you have to create a new job with a slightly different name, and disable the old job.

Per AWS’s documentation at https://docs.aws.amazon.com/macie/latest/APIReference/jobs.html

Note that you can’t change any settings for a job after you create it. This helps to ensure that you have an immutable history of sensitive data findings and discovery results for data privacy and protection audits or investigations that you perform.

Terraform is stateful, but complies with the AWS API, so changing the name of the job creates a new one instead of updating an existing job’s name. Be careful of typos (or so I’ve been told…).

Side Project Chronicles, ep. 4: CLI/SDK

tl;dr

This post got a little long, so here are the learnings:

Any of the AWS SDKs, including boto3, as well as commercial IaC, all have Lightsail libraries.
The user id and user secret created in Lightsail do not have console access, so to use an SDK you need to create a user via the regular IAM.
Lightsail libraries in SDKs are limited to only the functionality of the Lightsail console.
Lightsail S3 buckets do not appear in the ListBuckets results from regular S3 components in the SDKs, but can be addressed directly using a regular S3 library if you know the bucket name.
When using a regular S3 library, the functionality is again limited to what Lightsail supports; attempting an unsupported action returns “Access Denied” exception.
The async .NET SDKs for .NET 5+ do not implement all the methods found in the full .NET Framework versions. I switched to boto3 and Python rather than install .NET 4.x to test ListBuckets and similar actions; see #4 for how that worked out.

The Full Adventure

The Lightsail console provides us a lot of functionality, but it’s not easy to audit the changes we make using the console. The console is a manual process and we have to remember to always check the same settings; hence why IaC is a best practice. Based on our look at the S3 bucket, we know more is happening via Lightsail than we can see, and we assume good decisions are being made. Something specific I’d like to check is if objects are encrypted at rest. Since a lot of automated compliance tooling uses the API or and SDK to check adherence to enterprise rules, we want to make sure we can use these to access the settings we’re interested in. As it turns out we have a number of options for SDK/CLI/IaC for Lightsail:

AWS Web API: https://docs.aws.amazon.com/lightsail/2016-11-28/api-reference/Welcome.html
AWS CLI: https://awscli.amazonaws.com/v2/documentation/api/latest/reference/lightsail/index.html#
boto3 (Python): https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/lightsail.html
AWS SDK for .NET: https://docs.aws.amazon.com/sdkfornet/v3/apidocs/items/Lightsail/NLightsailModel.html
AWS Tools for PowerShell: https://docs.aws.amazon.com/powershell/latest/reference/index.html
Pulumi Lightsail module: https://www.pulumi.com/registry/packages/aws/api-docs/lightsail/
Terraform Lightsail Provider: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lightsail_instance

(SDKs are also available for several languages other than .NET)

I want to try the AWS SDK for .NET, since that’s my most native programming language. The AWS SDK specifically for Lightsail is available via Nuget, which describes Lightsail as “[a]n extremely simplified VM creation and management service.” Despite that outdated description, the SDK is current.

The Lightsail S3 buckets were not visible in the usual S3 console, so I wanted to see if they are visible to the CLI or SDK. AWS has an example of how to list buckets with the SDK, at https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/dotnetv3/S3/ListBucketsExample/ListBuckets.cs

As it turns out, the bucket access keys we created in the Lightsail console do not grant permissions to use CLI/SDK. This is one instance where we need to use the normal AWS control panel rather than the Lightsail control panel, and create an IAM user with more privileged permissions (https://lightsail.aws.amazon.com/ls/docs/en_us/articles/amazon-lightsail-bucket-management-policies).

With a more privileged set of user credentials in place, we can run the AWS sample, and see that the Lightsail buckets are not listed in the response. That makes sense, since we didn’t see them in the API, but it’s good to check.

If we know the name of the bucket, we can access it directly, but the actions we can perform are limited. It was at this point I realized not all functions in the .NET Framework version have been implemented in the .NET version; instead of installing .NET 4.x, I switched to python boto3. What I found is, when using a regular S3 library, the you can list_objects but not get_bucket_encryption. Get_bucket_encryption returns an Access Denied error, even when using secrets for the root user.

To wrap this all up, you can use either a Lightsail SDK, or a regular S3 SDK, to work with Lightsail buckets. Either way, functionality is limited to what Lightsail supports. You’ll just have to take it on faith that AWS’ defaults are secure enough for your needs. It’s unlikely policy scanning tools can detect or validate best practices on your Lightsail buckets.

Side Project Chronicles, ep. 3: Lightsail Bucket Storage

In ep. 2 we briefly looked at the Lightsail control panel and saw that we can create Bucket Storage. In this post, we’ll look deeper into Lightsail Bucket Storage.

To create a new bucket, we first choose a region, storage plan, and name the bucket. Since Lightsail is not available in all regions, there is a shorter list of regions to choose from than regular S3. As with S3, bucket names must be globally unique.

Once the bucket is created, we’re taken to a bucket management page with several tabs. The Getting Started tab welcomes us and guides us to some documentation for important settings.

The Objects tab lists the folders and objects in our bucket, and the properties of any we selected. To add objects, we can upload an entire directory or a single file. We can also drag and drop instead using an upload dialog. Selecting an uploaded object shows the permissions, in addition to the size, type, tags and versions. Object tags can be set here also.

By default, Lightsail buckets are private, and objects inherit these permissions. Private buckets can still be accessed from instances we attach, as well as services and applications which can use access keys. Access keys are created on this tab, and cross-account access is also configured here.

The Metrics tab displays the storage consumption, a graph of storage growth, and set alarms in case we get too close to our limits. Since Lightsail buckets do not appear in S3, their metrics do not appear in CloudWatch metrics.

The Versioning tab is where we turn on versioning for objects stored in the bucket. Every version counts against the storage limit, so this is something to enable only if it’s needed, and if we have an alarm set.

We configure CloudWatch-like logs using the Logging tab. Since Lightsail buckets are not part of regular S3, their logs do not appear in CloudWatch. Instead, logs must be stored in a Lightsail bucket in the same account (see https://lightsail.aws.amazon.com/ls/docs/en_us/articles/amazon-lightsail-bucket-access-logs). This can be the same bucket as our objects, or a different bucket. It may take a couple hours for logs to appear, but once logs appear, we can download logs for analysis. There may be as little as 1 entry per file and look like CloudWatch formatted. It’s not very convenient to read or analyze logs in this fashion but it doesn’t look like there is a better option at this time. If you want to monitor access patterns it looks like you’ll need to implement logging in your web application and keep the bucket completely locked down.

Since Lightsail Bucket Storage is based on S3, what do we see if we look at S3 console? As it turns out, Lightsail buckets are not available via the S3 console. This means we have to manage buckets via the Lightsail CLI, or an application like S3 Drive, for which we’ll need access keys created in the Permissions tab. Using S3 Drive we can interact with S3 just like any other removable storage, so we can transfer files and open them directly.

Lightsail Bucket Storage simplifies S3 and seems to have good security defaults, but the limited size and lack of CloudWatch make it suitable only for hosting web assets.

Series Contents

Side Project Chronicles, ep. 1: Hosting

Side Project Chronicles, ep. 2: Tour of the Lightsail Control Panel

Side Project Chronicles, ep. 3: Lightsail Bucket Storage (this post)

Side Project Chronicles, ep. 2: Tour of the Lightsail Control Panel

I apologize in advance, this is going to be a long post because of the screenshots. When I explore an AWS service, I like to look through the control panel to get an idea of what settings are important and available. This helps me learn the IaC options too. You don’t incur any charges just poking around the control panel, either, so it’s a good practice for any service.

The first thing you notice is how very different the Lightsail control panel is from the rest of AWS. It greets you like a wizard, and has a decidedly non-AWS UI. Lightsail services are their own distinct offering, but are built on top of other AWs services. This means that although the object storage is built on top of S3, the access to uour buckets should be through Lightsail endpoints and the billing is Lightsail pricing.

Clicking through the welcoming robot screen for the first time, you’re greeted with a page which has several tabs–Instances, Containers, Databases, Networking, Storage, Domains & DNS and Snapshots. I’ll talk about each of these tabs and the top-level options for these below. In later posts I’ll examine some of the more detailed settings as I set up each service.

Every tab has a link to an overview of that specific service, and it’s worth reading those.

Instances

Instances are the virtual private servers (VPSs) you’ve created. A VPS is a type of virtual machine, with the full instance of the OS installed, running in a multi-tenant environment. If you’ve created any VPSs, they will be listed on the main page.

To create an instance, you can choose the OS (Linux or Windows). With Linux you can stick with the base OS, or choose one of the prepackaged applications such as WordPress, GitLab, Joomla and more. As you select the OS or prepackaged application, the prices are displayed at the bottom of the page.

Windows Instances have the OS or SQL Server Express (2016 or 2019) as the options. Note that SQL Server is a Lightsail Instance and not a Lightsail Database. Lightsail Instances manages EC2 and AMIs behind the scenes. Click on the images below for a larger view.

You can choose different instance sizes, and set a few options for both the instance and the prepackaged application. I’ll dig into one or two of these in future posts.

Containers

Lightsail Containers are built on ECS, and can use Docker containers from any public registry or pushed from your local machine. Access to these containers should be though either the Lightsail endpoints or a custom domain you configure in Lightsail.

Databases

Lightsail databases can be either MySQL or PostgreSQL (Lightsail does have an option for SQL Server Express hosted on Windows, but that is set up as a Windows Lightsail Instance, not as a Lightsail Database, see the Instances above). There is a lot of documentation about database parameters, importing data, snapshots and so on. You can use your favorite database tool for managing your databases, but you have to put them into Public mode, it does not appear that SSH tunneling is an option at this time. You could probably set up another Instance with phMyAdmin (or similar) and there is a cpanel option in Instances, but cpanel requires a paid license.

Networking

Networking is where you can configure a static IP, load balancers and a CDN. You can have up to five static IPs attached to instances at no cost. The load balancer supports both HTTP and HTTPS, but HTTPS requires you to obtain an SSL/TLS certificate via Lightsail (see https://lightsail.aws.amazon.com/ls/docs/en_us/articles/understanding-tls-ssl-certificates-in-lightsail-https).

Storage

Lightsail Storage is either Bucket (built on S3) or Disk (built on EBS). I will look deeper into Bucket in the next post. There does not appear to be an option to attach existing S3 buckets or ESB disks to a Lightsail application. In Bucket storage, 250GB is the max storage you configure, although for an “overage fee” it looks like you can exceed this. That’s not a lot for space for what S3 gets used for in general, but for what we’re doing in Lightsail that should be pretty good, and you can have more than one bucket. You can configure up to 16TB of Disk with the Custom option, but at $1 for 10GB that will run about $1600/month.

Domains & DNS

Domains & DNS is where you can register a domain name and manage its nameservers, If you already have a domain name you can use it and just configure the DNS Zone. For domains registered elsewhere, you can use external nameservers, but its recommended to use Lightsail’s DNS.

If you register a domain name via Lightsail, the DNS zone is automatically configured. The TLDs available to register are listed at https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/registrar-tld-list.html#registrar-tld-list-index-generic, and the price varies based on he TLD.

Lightsail DNS is built on top of Route 53, but supports only A, AAAA, CNAME, MX, NS, SRV and TXT and record types, These are the most common record types for web applications. If you need other record types, you can use Route 53 instead. You can have up to 6 DNS Zones (one per domain name) at no cost.

Snapshots

Snapshots are backups of Instances, and are configured on the Instances tab.

Summary

That’s the tour of the Lightsail control panel and some of the configuration pages. I’ll look deeper into some of these in future posts.

Series Contents

Side Project Chronicles, ep. 1: Hosting

Side Project Chronicles, ep. 2: Tour of the Lightsail Control Panel (this post)

Side Project Chronicles, ep. 3: Lightsail Bucket Storage

Side Project Chronicles, ep. 1: Hosting

It’s been a long time since I’ve written my own thing from the ground up. How long, you may ask? Since shared hosting was the only game in town other than colo. That’s a long time. Shared hosting was great. Someone else took care of the servers and networks and firewalls, you were only responsible for the code. I’ve missed tinkering around on side projects, no matter how ridiculous they become and even if they amount to nothing. So it’s time to tinker a little.

There are plenty of shared hosting services around, and they’re great (this blog is hosted on one, as are a few other sites I maintain). But I wondered, can I replicate the shared hosting model in AWS? Or get close? I’ve built some pretty big APIs and data services in AWS, but I’ve never really explored the website hosting aspects. As an AWS Community Builder (curious, check out https://aws.amazon.com/developer/community/community-builders/), we get credits to do exactly this type of tinkering. I’d also like to use AWS because I may want to expand into trying other services. And, IaC if possible, probably Pulumi since it also supports C# and Python. I know a tiny amount about Pulumi, but YOLO.

The stack I want to use is .NET Core, Windows or Linux (doesn’t matter any more with .NET), a relational database but not I’m not partial to MySQL or PostgreSQL, since SQL Server is very difficult to host properly in AWS. Again, I’m more interested in the tinkering with code than the hosting aspects, so I want as much managed as possible. Just like old times.

AWS has an overview of its website hosting options at https://aws.amazon.com/websites/. I won’t be using static pages, so I can eliminate S3 as an option. Since I want to use ASP.NET server side scripting, that eliminates Amplify, so Amazon Lightsail it is. Of course it can run containers, which weren’t a thought until now.

Lightsail is not as well documented as other AWS services. There is a documentation mini-site at https://lightsail.aws.amazon.com/ls/docs/en_us/all, but also some courses on Cloud Academy (https://cloudacademy.com/search/?q=lightsail). I’ll have to check these out, Cloud Academy has good content and AWS CBs have a complimentary subscription. The documentation makes Lightsail a bit of an enigma, and makes me a little hesitant that this service is only half-baked, but hey, sometimes it’s about the journey.

Lightsail has a lot of cool capabilities. I’ll be running either a virtual private server (VPS) or a container. As with shared hosting services, it has some preconfigured applications like WordPress, GitLab, Plesk, Drupal, Ghost, Joomla and more (see https://aws.amazon.com/lightsail/features/). MySQL and PostgreSQL are available, plus block and object storage. The object storage is based on S3 but is simplified (ref. https://lightsail.aws.amazon.com/ls/docs/en_us/articles/buckets-in-amazon-lightsail). For scaling, there is a CDN and load balancers which can be added on later.

This is clearly going to be a series, posted at irregular intervals, but there is a lot to look forward to tinkering with in Lightsail.

Series Contents

Side Project Chronicles, ep. 1: Hosting (this post)

Side Project Chronicles, ep. 2: Tour of the Lightsail Control Panel

Side Project Chronicles, ep. 3: Lightsail Bucket Storage

Can a customer managed IAM policy override AWS managed IAM policies?

Introduction

Writing custom IAM policies can be difficult, especially when job function utilizes bunch of services. AWS manages several IAM policies for particular job functions (such as data scientist), which are a great help, but what if we want to restrict access to certain services all together, or certain actions, or even specific buckets?

A common pattern in lake house architecture is to have an S3 bucket of raw data, a process to tokenize/scrub the data of sensitive information, and then a “cleansed” bucket with cleansed data that can be used in analyses. The AWS-managed DataScientist job role policy is complex, and we’d prefer to use that as our base policy but put additional restrictions on it. The question became, can we simply attach an additional policy to a role and have it override some of the settings in the AWS-managed policy? As it turns out, we can.

Tighter Restrictions

The first question we had was, can we make restrictions tighter than an AWS-managed policy by adding one of our own? Here’s what I did. I first created a user, with only AmazonS3FullAccess, which allowed me to access all objects in all buckets. I then created the following policy and attached it as an inline policy to my test user.

The results were exactly what I wanted to see—no ability to list the objects in the bucket.

I repeated this experiment, but this time creating and attaching the a customer-managed policy. The result was the same—the user could list the bucket’s objects when my custom policy was not attached, and could not list the objects when the policy was attached.

Looser Restrictions

The second question we had was whether or not we could loosen restrictions in an IAM-managed policy by attaching one of our own. To test this, I used the same user as above, but removed all policies, and then added AmazonS3ReadOnlyAccess. Then, I confirmed a folder could not be created:

I then created a policy which allowed PutObject, attached it to the user, and confirmed I could now create a folder:

So again, a customer managed policy can override an AWS-managed policy.

Conflicting Policies

So then we wondered, what happens if you attach conflicting policies. So I attached both AmazonS3ReadOnly and AmazonS3FullAccess to my test user:

I could once again create a folder:

This isn’t surprising, since explicit permissions overrule implicit permissions. One final question we wanted to test was what happens with two explicit permissions—one allow and one deny for the same action. I created two policies–one which explicitly denied listing buckets, and one which explicitly allowed listing of buckets–and attached them to the same user one at a time. After confirming they worked as intended when attached individually.

When attached together, the explicit deny overrides the explicit allow.

Conclusion

Customer-managed policies can be used to override actions when implicitly allowed or denied in AWS-managed policies. This means we can make use of the complex AWS-managed IAM policies and still have the ability to make some modifications when needed.

AWS describes the order of evaluation at https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_evaluation-logic.html. The results here are in line with the logic described—we could allow an action which wasn’t explicitly denied, but an explicit deny took precedence over an explicit allow.

Creating folders and listing objects are easy tests, but they’re not the full story. It would merit some deeper investigation into individual actions before concluding all actions behave the same way. Also, this emphasizes the need for specifically and carefully defining the actions you want to allow or deny.

Preview Review: AWS Outpost Micro Appliance

I recently had the opportunity to review a forthcoming AWS Outpost Micro appliance and was asked to provide feedback. The review was uncompensated, and the device had to be returned, so my agreement was that when a more public release approached I could put my thoughts into a blog post, and here we are.

The AWS Outposts family (see https://aws.amazon.com/outposts/) is a category of appliances which extend the AWS cloud into on-premises data centers. They come in a variety of configurations to suit corporate workloads. Although the Outpost Micro is part of the Outpost family, its capabilities and resources are scaled to the power smart home user.

Even in its preview form, the Outpost Micro showed a lot of potential. The second generation prototype I used has 4 CPU cores and 16GB RAM, plus a bunch of storage (see below). The Outpost Micro does not support services such as EC2, EFS, EBS, SES, etc. This also means services like API GW and GWLB which have a reliance on EC2 are not available. For a couple of these services that’s OK, ISPs usually have provisions about hosting websites from home which the API gateways would allow you to do.

The preview appliance did support S3, Lambda, ECS, DynamoDB, SNS, some IoT services, EventBridge and Fargate. Most compelling was the S3 media streaming. As mentioned above, the Outpost Micro is designed for smart home storage and computational workloads, so there was seamless integration with FireTV devices. Forthcoming features include integration for local Alexa skills, integration with Echo Show and Ring devices.

If you’re familiar with developing for AWS services, you can also deploy your own applications to your device. I was able to set up some Lambda functions and do some data processing in a local environment similar to what I do at my day job. I did not have it long enough to set up Octoprint and drive a fleet of 3D printers but maybe when I get a real one.

Since you always need an architecture diagram to make anything official, this is basically how the Outpost Micro connects to AWS:

As with Kindle and Fire devices, the Outpost Micro is factory configured with your Amazon account, so you just connect it to your network router, turn it on, and hit the config page from a laptop (mobile app coming soon). The appliance uses a Customer Gateway VPN to extend your AWS account on-prem into your own home; other outposts directly extend VPC but this is designed as consumer device and is somewhat self-sufficient. The Customer Gateway is technically part of the appliance and isn’t something you need to set up yourself aside from some initial setup wizard and T&C acceptances.

Since I had the device during sports season, I decided to see how I could extend the device beyond my home. The power outlet in my Honda Pilot was not sufficient to power the device, but my buddy’s Ford pickup could power it, and when coupled with a small wifi router had a portable LAN which the kids loved on a couple long sports trips for media and gameplay. Other cars stayed within portable wifi range so the rest of the team could participate. Thinking back to the LAN parties of old, this is happily similar in concept but almost absurd in its portability.

The OM device has limited access to the rest of your home’s network, so it isn’t suitable as a print server or media server for something outside of the AWS fleet of devices and apps. After some begging and arm twisting, learned my device had about 20TB of storage but final versions may have more or less or the same. This isn’t a 20TB NAS, the storage space is partitioned and used across services, so you may only have 5TB of extended S3 and any overflow is in AWS cloud. It’s clear this is meant to be a cloud-connected device with local cache serving edge computation and streaming needs.

I miss my old Windows Home Server, but with a little config (and in the future, some apps) the Outpost Micro is an exciting piece of home technology.

For more information or to sign up for the next round of preview, click here: https://bit.ly/2ObV8Lh

Data Prep with AWS Glue DataBrew

Scenario

Now that we’ve had our first look at AWS Glue DataBrew, it’s time to try it out with a real data preparation activity. After nearly a year of COVId-19, and several rounds of financial relief, one interesting dataset is that from the SBA’s Paycheck Protection Program (PPP). As with al government programs, there is a great deal of interest in how the money was allocated. The PPP was a decentralized program, with local banks approving and disbursing funds. The loan data for the program was released in several batches, and early indication is that the data is a bit of a mess, making it difficult for groups without a data prep organization to analyze the data. As data professionals, we can help with that.

Setup

The link to the most recent PPP data is available is found at https://www.sba.gov/funding-programs/loans/coronavirus-relief-options/paycheck-protection-program/ppp-data. I downloaded the Nov 24, 2020 dataset, and uploaded the files to an S3 bucket.

Our work with DataBrew begins, as many things in AWS do, by creating a service level IAM role and granting permission to our data, as documented at https://docs.aws.amazon.com/databrew/latest/dg/setting-up-iam.html.

After we’ve uploaded our data and given DataBrew permission, it’s time to create a Dataset, which is basically a pointer to the data files we’ll be using. We’ll need a Dataset for every different batch of data we want to use.

Initial Profiling

The first thing I like to do when I get an unknown dataset is profile as much of the data as I can. With DataBrew, I can easily set up a Profile Job to gather statistics about the entire dataset. To start, we navigate Jobs >> Profile Jobs >> Create Job. The configuration looks like the image below.

The profiling job takes a little over one minute to run since DataBrew will profile a maximum of 20,000 rows even if you select “Full dataset” (you can request a limit increase). Once complete, we can choose to “View data profile”, then the “Column Statistics” tab to check for completeness, type and validity.

Most of the columns are 100% valid, which would be fantastic if true, although I suspect unknown values may be represented by a value which DataBrew does not recognize as “unknown” or “invalid”. We’ll need to investigate further. Also, ZIP Code was identified as a numeric column, which is a very common mistake made by data profilers. Many US Zip Codes start with zero, and need to be treated as strings in order to retain that leading zero.

State claims to be 100% valid, so let’s take a look at the values. Of the 20,000 records profiled, all were in Kansas. Deep sigh. We’re going to need to try a random sample somehow.

Cities are where the fun usually begins, and looking at the Top 50 values, we see that there is inconsistent casing and DataBrew treats “OLATHE” and “Olathe” differently. We see the same treatment with “LAWRENCE” and “Lawrence”, too. That’s something we can try and fix in our data prep. Trivia note: “Pittsburg” is spelled correctly here, only Pittsburgh, PA has the “h” at the end.

Random Sampling

That’s a good start, but let’s see what else we can find with a random sample. To do a random sample, we need to create a project, using the same dataset, and configure the sampling to be 5000 random rows.

After the sampling is complete, we’re taken to the projects tab, where we can review the sampled data. Right away we can see that “Not Available” and “N/A” are very common answers, and we need to work with our business partners to decide if these are values we want to count, values we want to convert to a different value, or if we want to count them as invalid results.

Looking at some of the ZIP Codes, we can see that the column was profiled as a number, and some of the MA ZIP Codes lost their leading 0. We’ll need to change the column type and put the leading 0 back using a transformation.

Looking at the State column, the random sampling did improve the sample somewhat—we now have 5 states represented instead of just one.

Recipe

Now that we have a couple columns which need a transformation, and a decent random sample, it’s time to start create our first recipe. We’ll clean up both the ZIP Code and City name column and let our business users work with the data while we look for some additional transformations.

ZIP Code

Since ZIP Code was incorrectly typed as a numeric column, we need to correct this before we produce an output for our users. This means we need to re-type the column as a string and pad the leading zero where it was stripped off.

To change the type of the column, click on the “#” next to the column name and choose “string”. This will add a recipe step to convert the type of this column, but will not replace the leading zero.

In order to replace the leading 0, we can rely on the old trick of prefixing every value with a 0 and take the right five characters to create the full zip code. This is a two-step process in our recipe. First, we pad all values with 0 by activating the Clean menu, selecting “Add prefix…”, then entering a prefix text of 0.

This prefixing will be applied to all values, which will make most ZIP Codes six characters long. To fix this, we take the right five characters by activating the Functions menu, selecting Text Functions, then Right.

This operation will create a new column, which by default is labeled “Zip_RIGHT”, and we configure the number of characters to keep.

And when we preview the change, this is how it looks.

City

As we saw in the profile results, city names are both mixed case and all uppercase, which is causing mis-counts by city. We need to standardize the capitalization to alleviate this. For our needs, it doesn’t matter if we use all uppercase or not, just as long as we’re consistent. I’ll use proper case because it doesn’t look like I’m being yelled at. We can either activate the Format menu (either from the menu bar or using the ellipses menu) then choosing “Change to capital case”.

We can then see an example of what each formatting option will do. Capital case is the closest option for how most city names are capitalized. It’s not perfect, but it’s consistent, and we’d need an address verification system to do better. This option changes the value in the original column, it does not create a new column.

We can even preview the changes, and see how ANCHORAGE and Anchorage are now combined into a single value of Anchorage.

Our recipe now looks like this, which is good enough for now.

Publishing and Using the Recipe

In order to run this recipe against the full dataset, or to run it again, we need to publish it and then create a Job. From the Recipe panel, we click the Publish button. Recipes are versioned, so in the pop-up we add some version notes and Publish. Once Published, we can use it in a Job.

I covered Jobs in detail in the First Look: AWS Glue DataBrew, so here is how I configured the job:

Parquet is a great storage format, it has a schema, it’s compact, columnar for performant query, and it’s native to many of AWS’s services. Once the job has completed, how do we ensure it worked? Simple, we use the output as a new Dataset and profile the results. Viewing the results of the profile of the cleanup job, we can see the top 50 City names are all capital case.

Similarly, we can see the ZIP Codes are all 5 characters long and have the leading zero (fortunately, the profile job sampled New Jersey).

Congratulations, we can now start to make this data available to our users! We know they’ll find more steps we need to add to our recipe as they begin to work with the data, but this is a great start. Find me on Twitter @rj_dudley and let me know how you find DataBrew.

15 S3 Facts for S3’s 15th

To celebrate S3’s 15^th birthday on 3/14/2021, and to kick off AWS Pi Week, I tweeted out 15 facts about S3. Here they are as a blog post, to make them easier to read. Because of the rapid pace of innovation in AWS services, including S3, so if you’re reading this in the future, some things may have changed.

1. S3 is designed for “eleven 9s” of durability. When you take into account redundancy in and across availability zones, in 10,000,000 years you’d lose only lose one of 10,000 objects. Read more at https://aws.amazon.com/blogs/aws/new-amazon-s3-reduced-redundancy-storage-rrs/.

2. S3 is region-bound, which means all S3 buckets in that region are partying in the same publicly available cloud ether. You can restrict access to a VPC but the bucket is still located outside the VPC. Related: https://cloudonaut.io/does-your-vpc-endpoint-allow-access-to-half-of-the-internet/.

3. S3 is a very versatile storage service. The trillions of objects it stores are the basis for many workloads, including serving websites, video streaming and analytics.

4. The return of INI files! With a first byte latency of milliseconds, S3 is suitable for storing configuration settings in an available and inexpensive way. Databases are no longer a fixed cost and there is no need for one just for configuration.

5. S3 is designed for “infinite storage”. Each object can be up to 5TB in size, and there is no limit to the number of objects you can store in a bucket. Analytics aren’t constrained by a file or disk size. It’s like a TARDIS, or bag of holding!

6. How do you perform operations on hundreds, thousands or more objects? S3 Batch Operations allow you to copy objects, restore from Glacier, or even call a lambda for each file. For more information, see https://aws.amazon.com/blogs/aws/new-amazon-s3-batch-operations/.

7. S3 is a “consumption model”, so you pay only for what you use when you use it. No more provisioning fixed-size network storage solutions with large up-front costs.

8. But what if you need massive object storage closer to your location? S3 on Outposts puts S3 on-premises, right where you collect or process your data. For more info, start at https://aws.amazon.com/s3/outposts/.

9. If your bandwidth is limited or non-existent, you can use Snowball Data Transfer to move TB to PB of data in and out of AWS. Learn more at https://aws.amazon.com/snowball/.

10. For data collection and object generation at the most extreme edges there is Snowball Edge Storage. Snowball Edge can even run processing workloads. Read more at https://docs.aws.amazon.com/snowball/latest/developer-guide/whatisedge.html.

11. Although you can upload files to S3 via the console, CLI and REST API, wouldn’t it be great if you could just drag a file to a network share and have it appear in the cloud? With a File Gateway, you can do exactly that! See https://aws.amazon.com/storagegateway/file/.

12. S3 offers multiple storage classes, so you can optimize cost, latency and retention period. Standard offers the lowest latency but at the highest cost, while Glacier Deep Archive is perfect for yearslong retention. Read more at https://aws.amazon.com/s3/storage-classes/.

13. S3 Storage Lens is a central dashboard organizations can use for insight into S3 utilization and to get recommendation to optimize price. Read more at https://aws.amazon.com/blogs/aws/s3-storage-lens/.

14. S3 can version objects, so if you accidentally delete or profoundly update an object, you can recover from the most recent save or many prior versions, too.

15. S3 is a very secure service. IAM policies can be applied at the bucket and object level with a great deal of granularity. Additionally, VPC endpoints bind S3 traffic to a specific VPC only.

And one to grow on (for everyone): AWS recently released three new S3 training courses: https://aws.amazon.com/about-aws/whats-new/2021/01/announcing-three-new-digital-courses-for-amazon-s3/.

First Look: AWS Glue DataBrew

Introduction

This is a post about a new vendor service which blew up a blog series I had planned, and I’m not mad. With a greater reliance on data science comes a greater emphasis on data engineering, and I had planned a blog series about building a pipeline with AWS services. That all changed when AWS released DataBrew, which is a managed data profiling and preparation service. The announcement is at https://aws.amazon.com/blogs/aws/announcing-aws-glue-databrew-a-visual-data-preparation-tool-that-helps-you-clean-and-normalize-data-faster/, but the main thing to know is that DataBrew is a visual tool for analyzing and preparing datasets. It’s powerful without a lot of programming. Despite its ease of use and numerous capabilities, DataBrew will not replace data engineers; instead, DataBrew will make it easier to set up and perform a great deal of the simple, rote data preparation activities, freeing data engineers to focus on the really hard problems. We’ll look into use cases and capabilities in future blog posts. Spoiler alert: we’re still going to need that pipeline I was going to write about, just more streamlined. Updated series in future posts.

DataBrew is not a stand-alone component, but is instead a component of AWS Glue. This makes sense, since it adds a lot of missing capabilities into Glue, but can also take advantage of Glue’s job scheduling and workflows. Some of what I was planning to write involved Glue anyway, so this is convenient for me.

In this “First Look” post I’m working my way through the DataBrew screens as you first encounter them, so if you have an AWS account, it might be useful to open DataBrew and move through the screens as you read. No worries if you don’t, I’ll cover features more in-depth as I work through future posts.

DataBrew Overview

There are four main parts of DataBrew: Datasets, Projects, Recipes and Jobs. These are just where we start, there is a lot of ground to cover.

DataBrew parts

Datasets

DataBrew can work directly with files stored in S3, or via the Glue catalog to access data in S3, RedShift or RDS. If you’re using Lake Formation, it appears DataBrew (since it is part of Glue) will honor the AuthN (“authorization”) configuration. Exactly how this works is a topic for future exploration.

If you’re connecting directly to S3, DataBrew can work with CSV, parquet, and JSON files. At the time of writing, XML is not supported so you’d need to do a conversion upstream in a Lambda or Spark job. One cool feature is the ability to create parameterized paths to S3, even using a regex. This isn’t something available in the Glue Catalog, only directly to S3. I work with a lot of files which have a date stamp as part of the filename, so this will be helpful.

DataBrew datasets

Projects

Holey moley there’s a lot of stuff here! The Projects screen is where the real action is, and we’ll spend a lot of time here in the future.

DataBrew Projects

Sample View

As we explore the Sample View, it’s important to keep in mind that DataBrew is meant for actual data preparation work, not just lightweight profiles. This sample view is kept to a small windows so we can explore the effects of transformations and monitor effects on quality.

The majority of this page is taken up with a sample of the dataset and some lightweight profiling, including the type, number of unique values in the sample, the most common values in the sample, and the first few rows of the sample. The sample size and position in the set can be changed. This sample view is a great way to test transformations and enrichments, which we’ll look into later.

DataBrew Sample

The profile view can be changed to explore the schema, which will be inferred from CSV and JSON files, or use the metadata in parquet or Glue Catalog.

DataBrew Schema

he third profile view is correlations and summaries. If you’ve runs several profiles, the history is available to browse. The “missing cells” statistic is something we will revisit for the dataset I have loaded here. Also, for my sample dataset, the Correlation isn’t that interesting because the majority of the columns are part of an address so they should correlate. But in other datasets, this could be really interesting.

DataBrew Profile Overview

The profile view also has data insights into individual columns, showing several quality metrics for the selected column.

DataBrew Column Stats

Transformations

DataBrew currently has over 250 built-in transformations, which AWS confusingly calls “Recipe actions” in parts of its documentation.

DataBrew Transformations

The transformations are categorized in the menu bar above the profile grid. Transformations include removing invalid values, remove nulls, flag column, replace values, joins, aggregates, splits, etc. Most of these should be familiar to a data professional. With a join you can enrich one dataset by joining to other datasets.

Recipes

When we’re in the Projects tab, and we apply a transformation to a column, we’re creating a recipe step. One or more recipe steps form a recipe, and there isn’t a published maximum number of recipes per dataset. Since each recipe can be called by a separate job, this provides a great deal of flexibility in our data prep processes. Recipe steps can only be edited on the Projects tab; the Recipes tab lists the existing recipes, allows for downloading of recipes and some other administrative tasks. Recipes can be downloaded and published via CloudFront or the CLI, providing a rudimentary sharing ability across datasets.

DataBrew Recipes Tab

Opening a recipe brings up summaries of the recipe’s versions, and the other tab on this page opens up the data lineage for the recipe. This lineage is not the data lineage through your enterprise, just the pathway through the recipe. My simple example here isn’t that impressive, but if you build a more complex flow with joins to other datasets and more recipes, this will be a nice view. Although you can preview the datasets and recipes at the various steps, this is not a graphical workflow editor.

DataBrew Lineage

This is also a convenient screen to access CloudTrail logs for the recipes

Jobs

There are two types of jobs in DataBrew–“recipe” and “profile”.

DataBrew Job Types

A profile job examines up to 20,000 rows of data (more if you request an increase). The results of a profiling job include:

data type
unique values count
missing values count
most common values and occurrences
minimum and maximum values with occurrences
percentiles, means, standard deviation, median
and more…

One feature missing in the Profiling is determining the pattern or length of text values. The Profiling results are saved in JSON format, can be saved in S3, and there is an option to create a QuickSight dataset for reporting. Anything more than QuickSight will require some custom processing of the JSON output. Although it took this long in a blog post to discuss profiling jobs, a profile is something which really should be created before building recipes.

A recipe job configures a published recipe to be run against a selected dataset. In a Dataset job we choose the dataset, recipe and recipe version we want to use.

DataBrew Dataset Job

The other recipe job option is is a Project job, which uses a saved project defined on the Projects tab. In this job, the only thing we need to configure is the project.

DataBrew Project Job

The original dataset is not modified in DataBrew; instead, we configure the S3 location, output file format, and compression for storing the results.

DataBrew Output File Type DataBrew Output Compression

The output can be partitioned on a particuar column, and we can choose whether to overrite the files from the previous run or keep each run’s files. Please use encryption.

DataBrew Output Partitioning

Once configured, jobs can be scheduled. You can have a maximum of two schedules per job. If you need more than two schedules you’ll need to create an identical job.

DataBrew Job Schedule

Either type of job can be run on a schedule, on-demand or as part of other workflows (see “Jobs Integrations” below). There is only one recipe and one dataset per job, so processing multiple recipes and/or multiple datasets would require additional workflow.

Jobs Integrations

Aside from the console or a schedule, how else can a DataBrew job be started? For starters, the DataBrew API exposes all the functionality in the console, including running a job. When coupled with lambdas, this exposes a great amount of flexibility in starting a job.

A second option is to use a Jupyter notebook (vanilla Jupyter, not SageMaker notebook yet) and the plugin found at https://github.com/aws/aws-glue-databrew-jupyter-extension.

Source Control Integration

Recipes and jobs have a form of versioning, but it seems to be S3 object versioning since there isn’t a real source control workflow, but rather a new version is created with every published update.

DataBrew Publish Recipe

DataBrew Recipe Versions

However, as with most of AWS’s online editors, there is no direct source control integration. The best you can do is to download recipes and jobs as JSON and check them in manually. Better than nothing but still surprising since AWS has CodeCommit.

Infrastructure as Code

At this time, neither Terraform nor Pulumi support DataBrew, but CloudFormation can be used to script DataBrew; see the https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/AWS_DataBrew.html for the API documentation and examples. The CLI is another scripting option, the documentation for the CLI is at https://awscli.amazonaws.com/v2/documentation/api/latest/reference/databrew/index.html.