Can a customer managed IAM policy override AWS managed IAM policies?


Writing custom IAM policies can be difficult, especially when job function utilizes bunch of services.  AWS manages several IAM policies for particular job functions (such as data scientist), which are a great help, but what if we want to restrict access to certain services all together, or certain actions, or even specific buckets?

A common pattern in lake house architecture is to have an S3 bucket of raw data, a process to tokenize/scrub the data of sensitive information, and then a “cleansed” bucket with cleansed data that can be used in analyses.  The AWS-managed DataScientist job role policy is complex, and we’d prefer to use that as our base policy but put additional restrictions on it.  The question became, can we simply attach an additional policy to a role and have it override some of the settings in the AWS-managed policy?  As it turns out, we can.

Tighter Restrictions

The first question we had was, can we make restrictions tighter than an AWS-managed policy by adding one of our own?  Here’s what I did.  I first created a user, with only AmazonS3FullAccess, which allowed me to access all objects in all buckets.  I then created the following policy and attached it as an inline policy to my test user.

The results were exactly what I wanted to see—no ability to list the objects in the bucket.

I repeated this experiment, but this time creating and attaching the a customer-managed policy.  The result was the same—the user could list the bucket’s objects when my custom policy was not attached, and could not list the objects when the policy was attached.

Looser Restrictions

The second question we had was whether or not we could loosen restrictions in an IAM-managed policy by attaching one of our own.  To test this, I used the same user as above, but removed all policies, and then added AmazonS3ReadOnlyAccess.  Then, I confirmed a folder could not be created:


I then created a policy which allowed PutObject, attached it to the user, and confirmed I could now create a folder:


So again, a customer managed policy can override an AWS-managed policy.

Conflicting Policies

So then we wondered, what happens if you attach conflicting policies.  So I attached both AmazonS3ReadOnly and AmazonS3FullAccess to my test user:

I could once again create a folder:

This isn’t surprising, since explicit permissions overrule implicit permissions.  One final question we wanted to test was what happens with two explicit permissions—one allow and one deny for the same action.  I created two policies–one which explicitly denied listing buckets, and one which explicitly allowed listing of buckets–and attached them to the same user one at a time.  After confirming they worked as intended when attached individually.



When attached together, the explicit deny overrides the explicit allow.



Customer-managed policies can be used to override actions when implicitly allowed or denied in AWS-managed policies.  This means we can make use of the complex AWS-managed IAM policies and still have the ability to make some modifications when needed.

AWS describes the order of evaluation at  The results here are in line with the logic described—we could allow an action which wasn’t explicitly denied, but an explicit deny took precedence over an explicit allow.

Creating folders and listing objects are easy tests, but they’re not the full story.  It would merit some deeper investigation into individual actions before concluding all actions behave the same way.  Also, this emphasizes the need for specifically and carefully defining the actions you want to allow or deny.

Preview Review: AWS Outpost Micro Appliance

I recently had the opportunity to review a forthcoming AWS Outpost Micro appliance and was asked to provide feedback.  The review was uncompensated, and the device had to be returned, so my agreement was that when a more public release approached I could put my thoughts into a blog post, and here we are.

The AWS Outposts family (see is a category of appliances which extend the AWS cloud into on-premises data centers.  They come in a variety of configurations to suit corporate workloads.  Although the Outpost Micro is part of the Outpost family, its capabilities and resources are scaled to the power smart home user.

Even in its preview form, the Outpost Micro showed a lot of potential.  The second generation prototype I used has 4 CPU cores and 16GB RAM, plus a bunch of storage (see below).  The Outpost Micro does not support services such as EC2, EFS, EBS, SES, etc.  This also means services like API GW and GWLB which have a reliance on EC2 are not available.  For a couple of these services that’s OK, ISPs usually have provisions about hosting websites from home which the API gateways would allow you to do.

The preview appliance did support S3, Lambda, ECS, DynamoDB, SNS, some IoT services, EventBridge and Fargate.  Most compelling was the S3 media streaming.  As mentioned above, the Outpost Micro is designed for smart home storage and computational workloads, so there was seamless integration with FireTV devices.  Forthcoming features include integration for local Alexa skills, integration with Echo Show and Ring devices.

If you’re familiar with developing for AWS services, you can also deploy your own applications to your device.  I was able to set up some Lambda functions and do some data processing in a local environment similar to what I do at my day job.  I did not have it long enough to set up Octoprint and drive a fleet of 3D printers but maybe when I get a real one.

Since you always need an architecture diagram to make anything official, this is basically how the Outpost Micro connects to AWS:


As with Kindle and Fire devices, the Outpost Micro is factory configured with your Amazon account, so you just connect it to your network router, turn it on, and hit the config page from a laptop (mobile app coming soon).  The appliance uses a Customer Gateway VPN to extend your AWS account on-prem into your own home; other outposts directly extend VPC but this is designed as consumer device and is somewhat self-sufficient.  The Customer Gateway is technically part of the appliance and isn’t something you need to set up yourself aside from some initial setup wizard and T&C acceptances.

Since I had the device during sports season, I decided to see how I could extend the device beyond my home.  The power outlet in my Honda Pilot was not sufficient to power the device, but my buddy’s Ford pickup could power it, and when coupled with a small wifi router had a portable LAN which the kids loved on a couple long sports trips for media and gameplay.  Other cars stayed within portable wifi range so the rest of the team could participate.  Thinking back to the LAN parties of old, this is happily similar in concept but almost absurd in its portability.

The OM device has limited access to the rest of your home’s network, so it isn’t suitable as a print server or media server for something outside of the AWS fleet of devices and apps.  After some begging and arm twisting, learned my device had about 20TB of storage but final versions may have more or less or the same.  This isn’t a 20TB NAS, the storage space is partitioned and used across services, so you may only have 5TB of extended S3 and any overflow is in AWS cloud.  It’s clear this is meant to be a cloud-connected device with local cache serving edge computation and streaming needs.

I miss my old Windows Home Server, but with a little config (and in the future, some apps) the Outpost Micro is an exciting piece of home technology.

For more information or to sign up for the next round of preview, click here:

Data Prep with AWS Glue DataBrew


Now that we’ve had our first look at AWS Glue DataBrew, it’s time to try it out with a real data preparation activity.  After nearly a year of COVId-19, and several rounds of financial relief, one interesting dataset is that from the SBA’s Paycheck Protection Program (PPP).  As with al government programs, there is a great deal of interest in how the money was allocated.  The PPP was a decentralized program, with local banks approving and disbursing funds.  The loan data for the program was released in several batches, and early indication is that the data is a bit of a mess, making it difficult for groups without a data prep organization to analyze the data.  As data professionals, we can help with that.


The link to the most recent PPP data is available is found at  I downloaded the Nov 24, 2020 dataset, and uploaded the files to an S3 bucket.

Our work with DataBrew begins, as  many things in AWS do, by creating a service level IAM role and granting permission to our data, as documented at

After we’ve uploaded our data and given DataBrew permission, it’s time to create a Dataset, which is basically a pointer to the data files we’ll be using.  We’ll need a Dataset for every different batch of data we want to use.


Initial Profiling

The first thing I like to do when I get an unknown dataset is profile as much of the data as I can.  With DataBrew, I can easily set up a Profile Job to gather statistics about the entire dataset.  To start, we navigate Jobs >> Profile Jobs >> Create Job.  The configuration looks like the image below.


The profiling job takes a little over one minute to run since DataBrew will profile a maximum of 20,000 rows even if you select “Full dataset” (you can request a limit increase).  Once complete, we can choose to “View data profile”, then the “Column Statistics” tab to check for completeness, type and validity.

Most of the columns are 100% valid, which would be fantastic if true, although I suspect unknown values may be represented by a value which DataBrew does not recognize as “unknown” or “invalid”.  We’ll need to investigate further.  Also, ZIP Code was identified as a numeric column, which is a very common mistake made by data profilers.  Many US Zip Codes start with zero, and need to be treated as strings in order to retain that leading zero.


State claims to be 100% valid, so let’s take a look at the values.  Of the 20,000 records profiled, all were in Kansas.  Deep sigh.  We’re going to need to try a random sample somehow.


Cities are where the fun usually begins, and looking at the Top 50 values, we see that there is inconsistent casing and DataBrew treats “OLATHE” and “Olathe” differently. We see the same treatment with “LAWRENCE” and “Lawrence”, too.  That’s something we can try and fix in our data prep.  Trivia note: “Pittsburg” is spelled correctly here, only Pittsburgh, PA has the “h” at the end.


Random Sampling

That’s a good start, but let’s see what else we can find with a random sample.  To do a random sample, we need to create a project, using the same dataset, and configure the sampling to be 5000 random rows.


After the sampling is complete, we’re taken to the projects tab, where we can review the sampled data.  Right away we can see that “Not Available” and “N/A” are very common answers, and we need to work with our business partners to decide if these are values we want to count, values we want to convert to a different value, or if we want to count them as invalid results.


Looking at some of the ZIP Codes, we can see that the column was profiled as a number, and some of the MA ZIP Codes lost their leading 0.  We’ll need to change the column type and put the leading 0 back using a transformation.


Looking at the State column, the random sampling did improve the sample somewhat—we now have 5 states represented instead of just one.



Now that we have a couple columns which need a transformation, and a decent random sample, it’s time to start create our first recipe.  We’ll clean up both the ZIP Code and City name column and let our business users work with the data while we look for some additional transformations.

ZIP Code

Since ZIP Code was incorrectly typed as a numeric column, we need to correct this before we produce an output for our users.  This means we need to re-type the column as a string and pad the leading zero where it was stripped off.

To change the type of the column, click on the “#” next to the column name and choose “string”.  This will add a recipe step to convert the type of this column, but will not replace the leading zero.


In order to replace the leading 0, we can rely on the old trick of prefixing every value with a 0 and take the right five characters to create the full zip code.  This is a two-step process in our recipe.  First, we pad all values with 0 by activating the Clean menu, selecting “Add prefix…”, then entering a prefix text of 0.


This prefixing will be applied to all values, which will make most ZIP Codes six characters long.  To fix this, we take the right five characters by activating the Functions menu, selecting Text Functions, then Right.


This operation will create a new column, which by default is labeled “Zip_RIGHT”, and we configure the number of characters to keep.


And when we preview the change, this is how it looks.



As we saw in the profile results, city names are both mixed case and all uppercase, which is causing mis-counts by city.  We need to standardize the capitalization to alleviate this.  For our needs, it doesn’t matter if we use all uppercase or not, just as long as we’re consistent.  I’ll use proper case because it doesn’t look like I’m being yelled at.  We can either activate the Format menu (either from the menu bar or using the ellipses menu) then choosing “Change to capital case”.  


We can then see an example of what each formatting option will do.  Capital case is the closest option for how most city names are capitalized.  It’s not perfect, but it’s consistent, and we’d need an address verification system to do better.  This option changes the value in the original column, it does not create a new column.


We can even preview the changes, and see how ANCHORAGE and Anchorage are now combined into a single value of Anchorage.


Our recipe now looks like this, which is good enough for now.


Publishing and Using the Recipe

In order to run this recipe against the full dataset, or to run it again, we need to publish it and then create a Job.  From the Recipe panel, we click the Publish button.  Recipes are versioned, so in the pop-up we add some version notes and Publish.  Once Published, we can use it in a Job.


I covered Jobs in detail in the First Look: AWS Glue DataBrew, so here is how I configured the job:


Parquet is a great storage format, it has a schema, it’s compact, columnar for performant query, and it’s native to many of AWS’s services.  Once the job has completed, how do we ensure it worked?  Simple, we use the output as a new Dataset and profile the results.  Viewing the results of the profile of the cleanup job, we can see the top 50 City names are all capital case.

Similarly, we can see the ZIP Codes are all 5 characters long and have the leading zero (fortunately, the profile job sampled New Jersey).


Congratulations, we can now start to make this data available to our users!  We know they’ll find more steps we need to add to our recipe as they begin to work with the data, but this is a great start.  Find me on Twitter @rj_dudley and let me know how you find DataBrew.

15 S3 Facts for S3’s 15th


To celebrate S3’s 15th birthday on 3/14/2021, and to kick off AWS Pi Week, I tweeted out 15 facts about S3.  Here they are as a blog post, to make them easier to read.  Because of the rapid pace of innovation in AWS services, including S3, so if you’re reading this in the future, some things may have changed.

1. S3 is designed for “eleven 9s” of durability. When you take into account redundancy in and across availability zones, in 10,000,000 years you’d lose only lose one of 10,000 objects. Read more at

2. S3 is region-bound, which means all S3 buckets in that region are partying in the same publicly available cloud ether. You can restrict access to a VPC but the bucket is still located outside the VPC. Related:

3. S3 is a very versatile storage service. The trillions of objects it stores are the basis for many workloads, including serving websites, video streaming and analytics.

4. The return of INI files! With a first byte latency of milliseconds, S3 is suitable for storing configuration settings in an available and inexpensive way. Databases are no longer a fixed cost and there is no need for one just for configuration.

5. S3 is designed for “infinite storage”. Each object can be up to 5TB in size, and there is no limit to the number of objects you can store in a bucket. Analytics aren’t constrained by a file or disk size. It’s like a TARDIS, or bag of holding!

6. How do you perform operations on hundreds, thousands or more objects? S3 Batch Operations allow you to copy objects, restore from Glacier, or even call a lambda for each file. For more information, see

7. S3 is a “consumption model”, so you pay only for what you use when you use it. No more provisioning fixed-size network storage solutions with large up-front costs.

8. But what if you need massive object storage closer to your location? S3 on Outposts puts S3 on-premises, right where you collect or process your data. For more info, start at

9. If your bandwidth is limited or non-existent, you can use Snowball Data Transfer to move TB to PB of data in and out of AWS. Learn more at

10. For data collection and object generation at the most extreme edges there is Snowball Edge Storage. Snowball Edge can even run processing workloads. Read more at

11. Although you can upload files to S3 via the console, CLI and REST API, wouldn’t it be great if you could just drag a file to a network share and have it appear in the cloud? With a File Gateway, you can do exactly that! See

12. S3 offers multiple storage classes, so you can optimize cost, latency and retention period. Standard offers the lowest latency but at the highest cost, while Glacier Deep Archive is perfect for yearslong retention. Read more at

13. S3 Storage Lens is a central dashboard organizations can use for insight into S3 utilization and to get recommendation to optimize price. Read more at

14. S3 can version objects, so if you accidentally delete or profoundly update an object, you can recover from the most recent save or many prior versions, too.

15. S3 is a very secure service. IAM policies can be applied at the bucket and object level with a great deal of granularity. Additionally, VPC endpoints bind S3 traffic to a specific VPC only.

And one to grow on (for everyone): AWS recently released three new S3 training courses:

First Look: AWS Glue DataBrew


This is a post about a new vendor service which blew up a blog series I had planned, and I’m not mad. With a greater reliance on data science comes a greater emphasis on data engineering, and I had planned a blog series about building a pipeline with AWS services. That all changed when AWS released DataBrew, which is a managed data profiling and preparation service. The announcement is at, but the main thing to know is that DataBrew is a visual tool for analyzing and preparing datasets. It’s powerful without a lot of programming. Despite its ease of use and numerous capabilities, DataBrew will not replace data engineers; instead, DataBrew will make it easier to set up and perform a great deal of the simple, rote data preparation activities, freeing data engineers to focus on the really hard problems. We’ll look into use cases and capabilities in future blog posts. Spoiler alert: we’re still going to need that pipeline I was going to write about, just more streamlined. Updated series in future posts.

DataBrew is not a stand-alone component, but is instead a component of AWS Glue. This makes sense, since it adds a lot of missing capabilities into Glue, but can also take advantage of Glue’s job scheduling and workflows. Some of what I was planning to write involved Glue anyway, so this is convenient for me.

In this “First Look” post I’m working my way through the DataBrew screens as you first encounter them, so if you have an AWS account, it might be useful to open DataBrew and move through the screens as you read. No worries if you don’t, I’ll cover features more in-depth as I work through future posts.

DataBrew Overview

There are four main parts of DataBrew: Datasets, Projects, Recipes and Jobs. These are just where we start, there is a lot of ground to cover.

DataBrew parts


DataBrew can work directly with files stored in S3, or via the Glue catalog to access data in S3, RedShift or RDS. If you’re using Lake Formation, it appears DataBrew (since it is part of Glue) will honor the AuthN (“authorization”) configuration. Exactly how this works is a topic for future exploration.

If you’re connecting directly to S3, DataBrew can work with CSV, parquet, and JSON files. At the time of writing, XML is not supported so you’d need to do a conversion upstream in a Lambda or Spark job. One cool feature is the ability to create parameterized paths to S3, even using a regex. This isn’t something available in the Glue Catalog, only directly to S3. I work with a lot of files which have a date stamp as part of the filename, so this will be helpful.

DataBrew datasets


Holey moley there’s a lot of stuff here! The Projects screen is where the real action is, and we’ll spend a lot of time here in the future.

DataBrew Projects

Sample View

As we explore the Sample View, it’s important to keep in mind that DataBrew is meant for actual data preparation work, not just lightweight profiles. This sample view is kept to a small windows so we can explore the effects of transformations and monitor effects on quality.

The majority of this page is taken up with a sample of the dataset and some lightweight profiling, including the type, number of unique values in the sample, the most common values in the sample, and the first few rows of the sample. The sample size and position in the set can be changed. This sample view is a great way to test transformations and enrichments, which we’ll look into later.

DataBrew Sample

The profile view can be changed to explore the schema, which will be inferred from CSV and JSON files, or use the metadata in parquet or Glue Catalog.

DataBrew Schema

he third profile view is correlations and summaries. If you’ve runs several profiles, the history is available to browse. The “missing cells” statistic is something we will revisit for the dataset I have loaded here. Also, for my sample dataset, the Correlation isn’t that interesting because the majority of the columns are part of an address so they should correlate. But in other datasets, this could be really interesting.

DataBrew Profile Overview

The profile view also has data insights into individual columns, showing several quality metrics for the selected column.

DataBrew Column Stats


DataBrew currently has over 250 built-in transformations, which AWS confusingly calls “Recipe actions” in parts of its documentation.

DataBrew Transformations

The transformations are categorized in the menu bar above the profile grid. Transformations include removing invalid values, remove nulls, flag column, replace values, joins, aggregates, splits, etc. Most of these should be familiar to a data professional. With a join you can enrich one dataset by joining to other datasets.


When we’re in the Projects tab, and we apply a transformation to a column, we’re creating a recipe step. One or more recipe steps form a recipe, and there isn’t a published maximum number of recipes per dataset. Since each recipe can be called by a separate job, this provides a great deal of flexibility in our data prep processes. Recipe steps can only be edited on the Projects tab; the Recipes tab lists the existing recipes, allows for downloading of recipes and some other administrative tasks. Recipes can be downloaded and published via CloudFront or the CLI, providing a rudimentary sharing ability across datasets.

DataBrew Recipes Tab

Opening a recipe brings up summaries of the recipe’s versions, and the other tab on this page opens up the data lineage for the recipe. This lineage is not the data lineage through your enterprise, just the pathway through the recipe. My simple example here isn’t that impressive, but if you build a more complex flow with joins to other datasets and more recipes, this will be a nice view. Although you can preview the datasets and recipes at the various steps, this is not a graphical workflow editor.

DataBrew Lineage

This is also a convenient screen to access CloudTrail logs for the recipes


There are two types of jobs in DataBrew–“recipe” and “profile”.

DataBrew Job Types

A profile job examines up to 20,000 rows of data (more if you request an increase). The results of a profiling job include:

  • data type
  • unique values count
  • missing values count
  • most common values and occurrences
  • minimum and maximum values with occurrences
  • percentiles, means, standard deviation, median
  • and more…

One feature missing in the Profiling is determining the pattern or length of text values. The Profiling results are saved in JSON format, can be saved in S3, and there is an option to create a QuickSight dataset for reporting. Anything more than QuickSight will require some custom processing of the JSON output. Although it took this long in a blog post to discuss profiling jobs, a profile is something which really should be created before building recipes.

A recipe job configures a published recipe to be run against a selected dataset. In a Dataset job we choose the dataset, recipe and recipe version we want to use.

DataBrew Dataset Job

The other recipe job option is is a Project job, which uses a saved project defined on the Projects tab. In this job, the only thing we need to configure is the project.

DataBrew Project Job

The original dataset is not modified in DataBrew; instead, we configure the S3 location, output file format, and compression for storing the results.

DataBrew Output File Type DataBrew Output Compression

The output can be partitioned on a particuar column, and we can choose whether to overrite the files from the previous run or keep each run’s files. Please use encryption.

DataBrew Output Partitioning

Once configured, jobs can be scheduled. You can have a maximum of two schedules per job. If you need more than two schedules you’ll need to create an identical job.

DataBrew Job Schedule

Either type of job can be run on a schedule, on-demand or as part of other workflows (see “Jobs Integrations” below). There is only one recipe and one dataset per job, so processing multiple recipes and/or multiple datasets would require additional workflow.

Jobs Integrations

Aside from the console or a schedule, how else can a DataBrew job be started? For starters, the DataBrew API exposes all the functionality in the console, including running a job. When coupled with lambdas, this exposes a great amount of flexibility in starting a job.

A second option is to use a Jupyter notebook (vanilla Jupyter, not SageMaker notebook yet) and the plugin found at

Source Control Integration

Recipes and jobs have a form of versioning, but it seems to be S3 object versioning since there isn’t a real source control workflow, but rather a new version is created with every published update.

DataBrew Publish Recipe

DataBrew Recipe Versions

However, as with most of AWS’s online editors, there is no direct source control integration. The best you can do is to download recipes and jobs as JSON and check them in manually. Better than nothing but still surprising since AWS has CodeCommit.

Infrastructure as Code

At this time, neither Terraform nor Pulumi support DataBrew, but CloudFormation can be used to script DataBrew; see the for the API documentation and examples. The CLI is another scripting option, the documentation for the CLI is at

Cloud Native AWS Development On A Chromebook Using VS Code

Whew! That’s a lot to unpack there. Let’s start in the middle, with the Chromebook, They’re not widely known as development machines. They are limited in RAM, persistent storage and processor speed. The OS has limited application support, especially for ones which rely on arbitray code execution like most IDEs. They have great battery life, small size, and are usually easy to procure (exept in the current return to school COVID shortage).

The newer Chromebooks (see for a list of supported devices) have an interesting beta feature–the ability to install a Linux virtual machine. Termed Crostini, this is a supported way to install a Linux distro (Debian stretch). This opens up all kinds of possibilities, especially since the user’s home folder of the VM is mounted as local storage in the Chrome OS.

Microsoft’s VS Code is a fantastic free cross-platform code editor, and can be installed via the Linux command line. Not to give away any spoilers, but VS Code (and its extensions) runs seamlessly as if it is a native application even though it’s installed in the Linux VM. There is a rich ecosystem of extensions, a few of which will be useful for us.

Ckoud native development is more like infrstructure configuration and control; the definition is closer to “we’re connecting and orchestrating AWS’s services, rather than coding everything from scratch”. This is why you sometimes hear the term “infrastructure as code” (IAC)–we use a configuration language to write instructions to direct the activities of AWS’s services, then deploy the instructions to the service, and test our IAC. This works in our favor because we can write the code locally and deploy the code via AWS’s command line interface. Fortunately, the AWS CLI tooling can also be installed and run on the Linux VM, giving us a way to deploy our code from a Chromebook. AWS’s infrastructure language is defined using Cloud Formation, and VS Code has several Cloud Formation extensions, including a template library, to facilitate our work. When a custom coded application is needed, the serverless application model is used (the SAM pattern characteristics are small services with an API interface written in a cross-platform language and deployed in a container).

It’s sounding more promising that we can use a Chromebook for AWS development, doesn’t it? One last thing you’ll need–an AWS account, so get one if you don’t have one already. There is a free-tier, and everything I’ll be doing will use services or service limits which should be “free forever”.

Here’s how I set up my development environment. By the way–we’ll be using the Linux command line interface (CLI), which I am still learning also. If you’re not familiar, I’ve found this book to be invaluable:

  1. Install Crostini

The Crostini beta makes this step as simple as enabling the feature and waiting for several minutes. At this time, the VM is loaded on the main drive and consumes a lot of space, so depending on what else you want to do on your machne, you may need to minimize the number of other things you install or save on the main drive.

The installation instructions are at Make sure you read the entire page, since there is important information beyond just installation.

After you’re done, make sure to open the Terminal and run “sudo apt-get update” and “sudo apt-get upgrade” to make sure you have the latest patches.

  1. Install VS Code

Here we are in the CLI! Microsoft’s instructions can be found at Crostini is a newer Debian-based distro so it’s only a couple simple steps.

There is an easier way to install, though. You can downoad the .deb package from the official site, then right-click the package and choose “Install with Linux (Beta)” from the menu. The installer will run for a minute or two, and presto! VS Code will be installed.

install VS Code from .deb package

  1. Add CloudFormation Extensions

A moment of truth! Open VS Code, and if it runs, great job! Now it’s time for an extension to make our development easier, the CloudFormation extension,

CloudFormation instructions are called stacks (plural: stackset), and can be written in either JSON or YAML. Stacks are also difficult to write from scratch correctly. The templates in this extension will make our lives much, much easier.

  1. Install and Configure the AWS CLI v2 for Linux x86

AWS provides us a couple ways to administer its services. You can interact via the web-based console but ultimately we want to get to a higher degree of repeatability and automation in our deployment processes, and the CLI gets us a little closer. Installation instructions are at, and configuration is the next section.

Once installation is complete, you can open the Terminal and run “aws –version”. If that works, you’re good to go with configuration. The most important configuration is the access keys the CLI will use to connect to AWS service APIs. Although you can use your root user login, it’s not a good idea. Instead, you should create a new IAM user for the CLI. This is done in the AWS console, and there are a lot of steps in creating a new IAM user, but you can’t really screw anything up. Worst case, delete the user and start over. In step 2 of the wizard (image below) you’re asked to put the new user in a group or attach directly to an IAM profile. For now, it will be easiest if you attach directly to the AdministrativeAccess profile. We can always titrate this down later, but there are 580+ profiles to sort through and we don’t know which ones we need yet.

user wizard step 2

Once your new IAM user is created, you’ll need to follow the configuration for credential file settings at That’s all we need for now.

  1. Install git

Source control is always a best practice. VS Code, the extensions and the SDKs are all cross platform, so if you install the Windows or Mac bits on another machine, you can easily transfer your code from machine to machine with version history and backup.

Assitionally, we can also use GitHub Actions to deploy our code, but that’s for a future post. I plan to start with the CLI.

Installation for Linuz is found at


The crouton project may be able to be used to install Ubuntu side-by-side with Chrome OS (see but this is not a supported installtion. It would provide a more native Linux experience, though.

Deeper info about Crostini at

Feedback? Find me on Twitter at @rj_dudley.

GitHub/BitBucket For The People Who Just Got a Raspberry Pi and Are In Facebook Groups Asking Questions

(work in progress – kids woke up before I was done but I wanted to share what I had, updates will come)

Few things are more awesome than being curious and being able to spend a few bucks on a small board to follow your curiosities.  There is a lot of fun ahead, but there are also going to be frustrations.  I’ve been a professional software programmer for almost 20 years and I am frequently still stumped.  When you hit these times, don’t be discouraged, ask questions and experiment because the success is worth 100 frustrations.  I promise you.

I’m writing this because I’m in a couple Raspberry Pi groups on Facebook, and I see questions about better ways to store scripts and save documents every couple of days.  If you’ve asked that, your instincts are spot-on, and even if you’ve never written a line of code this instinct alone puts you above about 10% of the people I’ve ever interviewed for a job.  It probably also means you’re experimenting with code a bit (experimenting with code is how you get past those frustrations), and you need a better way to keep track of your changes than naming your files “”.  Been there, done that.  Let me explain the better way.

BTW – this blog post is meant to complement and explain already available documentation, not replace what exists.  So I’ll explain the fundamentals of what you’re doing but link to official documentation.

What are GitHub and Bitbucket?

GitHub and Bitbucket are two commercial services which provide source code management (SCM).  This means they’re meant to keep your code files safe and organized, and keep a history of your changes.  You can use this history to make something work, commit that working file, then screw it all up on your device trying something new, then be able to retrieve a previous version.  Because of the version history, you’ll sometimes hear these called “version control systems”, or just “VCS”.

Both GitHub (GH) and Bitbucket (BB) offer free plans with both public (for sharing) and private (because sometimes my code is embarrassingly bad when I’m tinkering) repositories.  They both offer very similar features, either one is a good choice for you, but GitHub is a lot more popular in the open source world.  I used Bitbucket for my personal work for a long time, but the popularity of other projects made GitHub my main system now.

Both GH and BB are based on an open source version control system called “git“, which was developed in 2005 by Linus Torvalds to use for developing Linux (see for the rest of the story).  GH and BB provide the cloud-based server side of keeping your code safe, you’ll still need something on your computer.  The basic git uses the BASH command-line shell on every OS (Linux, Mac, Windows, Raspbian, etc.), but there are also plenty of GUI clients available.  Because Rapsbian is derived from the same Linux kernel as Ubuntu, almost anyone which runs on Ubuntu will run on Raspbian.  That being said, I really recommend learning the BASH commands so you have a little better understanding of how git works, but also the Raspberry Pi has limited processing and the shell doesn’t take much to run.  Most of the git examples you’ll see use the BASH commands, and the GUIs just put the same commands behind a button and then most just run the shell anyway.

VCS Basics

I’m going to really simplify what you new RPi hackers need to know.  Some of the features are meant for teams of developers who automatically want to push code to a cloud computing environment and run all kinds of tests, while you’re just trying to get an LED to light up when you push a button.  Bottom line–you won’t need a lot of what GH and BB offer for a long, long while, if ever.

First and foremost is understanding a little code management strategy.  Both GH and BB offer unlimited repositories, so make use of that.  At its most basic, think of a repository as a main folder for your work.  If you get the LED to light up, and want to now try making a buzzer sound, you wouldn’t put those in the same folder on your RPi–you should separate the two.  You’d then make a corresponding separate repository (“repo” for short) for each project in GH or BB.  To get started you might copy the code from the LED project into your buzzer project, and that’s cool, but you’d still keep the two things separate.

When you need a new repo, there are two ways to get started.  If you’ve already started your project, your workflow would look like this (and I’ll explain a little more below):

  1. Create the remote repo on GH/BB (called “remote” because it’s not on your machine)
  2. On your device, use “git init” to create the local repo
  3. Link your local and remote repos
  4. add/commit/push your files

If you haven’t started a project yet, your workflow would look like this:

  1. Create the remote repo
  2. “git clone” to your device to create the local repo
  3. do your work
  4. add/commit/push your files

Either way, once you have your repo set up and first code committed, everything works the same.  Neither way of getting started is “wrong”, both exist because sometimes we code first and sometimes we repo first.

Also, one note, I’m kind of lazy, so when I say “GitHub” below, know that everything also applied to Bitbucket.  The web pages are different between the two, but since both are based on git, all the other aspects are almost exactly the same.

Explanation – Code First

Summary–since you already have a folder on your device full of code, we need to turn it into a local repository.  This is called “initialization”, and uses the “git init” command.  We also need to create a repository (aka folder) on GitHub, and then link the two using the “git remote add” command.  Finally, you add the files you want to put under version control using “git add”, commit the changes using “git commit”, and then copy the change from your local repo to your remote repo using “git push”.  You only need to “init” once per repo, and usually “remote add” once per repo, but the add/commit/push you’ll do over and over and over again, it’s how you send the different version to GitHub.

Read this to create a repo in GitHub:  This creates that “remote folder”.

Once you have your remote repository, you’ll need to initialize (“init”) your local folder to make it a local repo, choose the files you want to add to the repo, commit those files to version control, and finally push the changes to your repo.  Usually, using python on a RPi you just want to add everything in the folder (in professional development our tools–like Visual Studio or Eclipse–add a lot of other files, like settings files for color preferences and font size, we don’t share).  To init your existing code to GitHub, and add/commit/push, follow the guide at

When you commit changes, you’re prompted to add a commit message to the set of changes.  This is super-important, it’s where you make notes to yourself about what you’re doing, making it easier for you to go back to a specific version of your code.  A sentence or two is pretty good, but don’t slack off.  Trust me, you’ll kick yourself later if you do.

Log Analysis in Azure Data Lake, Chapter 3

Time for some real profiling, and right away I discover a reason to do this.  On my first run my script failed with this error:

Error occurred while extracting row after processing 12360 record(s) in the vertex' input split. Column index: 10, column name: 'user_agent'.


Invalid character is detected following the ending quote character in a quoted field. A column delimiter, row delimiter or EOF is expected.
This error can occur if double-quotes within the field are not correctly escaped as two double-quotes.

I hunted down the offending entry and found this data in the user_agent column (truncated for safety and space):

"}__test|O:21:\"JDatabaseDriverMysqli\":3:{s:2:\"fc\";O:17:\"JSimplepieFactory\":0:{}s:21:\"\\0\\0\\0disconnectHandlers\";a:1:{i:0;a:2:{i:0;O:9:\"SimplePie\":5:{s:8:\"sanitize\";O:20:\"JDatabaseDriverMysql\":0:{}s:8:\"feed_url\";s:3814:\" (lots more I'm not showing here)

This error occurs several times across a number of files, so it’s not just a one-off fluke.  Through some searching I found this was a Joomla exploit, but WordPress doesn’t seem to be affected.  This is interesting, and confirms exploits are being attempted against sites frequently.

Because this log entry isn’t parsing correctly, and is throwing a fatal error, I need to figure out both how to keep processing past the errors, and also how to catch and log these errors.

According to U-SQL Error: Unexpected number of columns in input record, there is a “silent : true” option I can add to my EXTRACT statement, but this is meant for times when the column number differs, not other types of errors.  Catching other errors is a little more complicated, and according to the same article the best option right now is with a custom extractor, so it looks like I might need that after all.  Eliminating invalid entries is one strong reason to have a transformed zone.

User Defined Extractor

Microsoft has some basic documentation on programming user defined extractors.  There isn’t any discussion about best practices for structuring a project or unit testing, but there’s a basic example.  At the time of his blog post the documentation is pretty lacking, so the rest of this post represents a couple weeks of trial and error.  I will be contributing to the documentation, too.  The code sample for this project can be found at

User-defined extractors (UDEs) are one of several types of user-defined objects (UDOs) supported by the Azure Data Lake.  If you’re familiar with the common language runtime (CLR) in SQL Server, UDOs are analogous to CLR Objects.  If you’re not familiar with CLR Objects or Azure Data Lake UDOs, you just need to know they are both .NET classes which can be called by the query code.  The methods in the UDOs can be called like a standard function.  Needless to say, UDOs can be very powerful.  In fact, one of the “killer features” of Azure Data Lake is being able to easily extend the functionality of the query language using languages and tooling you’re very familiar with.

To begin creating a UDE, you start by adding a new “Class Library For U-SQL Solution” to your solution.  This is a template project with all the necessary references, you could use a regular class library and then add all the references yourself.  Be lazy, use the template.


Since a UDE is a .NET class (in this example, C#), I also recommend unit testing.  There is a “U-SQL Test Project” but that’s meant for testing the U-SQL code.  Just use a regular class library and include your favorite unit test framework (my sample uses NUnit).

In your UDE project, add a new class file.  This class needs to implement IExtractor, and needs to be decorated with the SqlUserDefinedExtractor attribute.  This attribute indicates this is the class with the entry point.  IExtractor has one required method, named Extract, which as two parameters–input and output.  They are exactly what they sound like–input is the data we’ll be acting on, and output is the object we’ll build up through our processing.

public class Class1 : IExtractor    
     public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
         throw new NotImplementedException();

Input is an unstructured object, which means processing is completely up to us.  Whether it’s deserializing JSON or XML, or splitting on some other delimiter, or anything else, that’s probably the first thing we do.  On my case, splitting on a space was not the best solution because of the data structure.  Instead, I’ll use a regular expression (yes, I know, now I’ll have two problems) to parse the entry, and then piece the data together in a more coherent manner.

Inside the Extract function is where we do the initial splitting of the input, call our helper methods, and return the output.  The SELECT I wrote in Chapter 2 retrieves an entire file at one time, and passes that into the input (refer to  I then need to split the file on the newline character to get an array of the individual rows, then iterate through the rows, as seen in

public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
string line;
//Read the input line by line
foreach (Stream current in input.Split(_row_delim))
using (StreamReader streamReader = new StreamReader(current, this._encoding))
line = streamReader.ReadToEnd().Trim();
LogRowParser splitter = new LogRowParser();
LogRowElements parts = new LogRowElements();
parts = splitter.ParseElements(line);
// …
yield return output.AsReadOnly();

view raw


hosted with ❤ by GitHub

I added two additional classes: LogRowParser and LogRowElements.  I did this to make unit testing easier.  Most of the actual work I need to do will be done in LogRowParser.

To reiterate, it’s inside the Extract method where we process the input–whether we deserialize XML or JSON, or split a file on a line terminator, or whatever.  The result will be a collection of objects, or lines of a file, or whatever, which we then process using additional classes.  We build a row of data using output.Set<T>(), which is then passed back into the U-SQL EXTRACT statement.

Returning data back to the EXTRACT statement looks weird but makes sense when you think about it.  The “output” object is of type IUpdateableRow, so we have to set the type and value of each column using Set<T>(index, value) method.  For this to work correctly, you need to follow the same order as specified in the EXTRACT statement.

Once all the inputs have been processed, we then have a dataset we can perform one or more SELECT statements against.  In our model, the results of these SELECT statements would then be persisted in our transformed zone.  That’s the next chapter, though.

Once the UDE code is written and our tests pass, it’s time to really test it out.  If you’re so inclined you can create a local debugging instance (see but that seems like a lot of effort for limited functionality.  Personally I found it easier to deploy to the cloud instance and run there, and you need to get the assemblies there anyway so it’s not wasted knowledge.

Just as you need to register SQL CLR objects in your SQL Server database, you also need to register UDOs in your U-SQL catalog.  Documentation is at

To save you some time: in Visual Studio, you register assembly by right-slicking on the C# project, choosing “Register Assembly”, change any values you need to (I kept all defaults for the purposes of this blog series), and clicking Submit.


Once your assembly is registered, you can submit your U-SQL script.  If all goes well, your results will end up in your transformed zone.  This took me weeks of my “after the kids are in bed” time to get this all working.

End Notes

This blog post went through many, many edits as I worked this all out over several weeks, and it’s still terribly stream of consciousness for which I apologize.  It took some time to put all the pieces together, so if your stuff doesn’t work at first just keep working on it.  As I mentioned above I’ll contribute to the official documentation as well as continuing this blog series.

Turns out using a regex to parse web server logs is a fairly common practice.  Most of the examples focus on a pattern for the whole string, such as and for a couple examples.  My host’s log format is a little different than the Apache combined log format, so I needed to tweak that regex a bit.  I used to build my regex, which is saved at  I did try to use this pattern, but it proved to be difficult.  I may try again later in this series to improve the code, but to just get the thing working I decided to split on quoted text pattern, using  This means I’ll have to do some manipulations to the data during extraction but that’s an opportunity to learn.

Data Warehouse vs Data Lake

In discussions of data lakes, questions about the differences between and the future of data warehouses inevitably arise.  In the modern corporate information factory there is a place for both, and with the availability of managed cloud offerings it’s possible to do both even on a small scale.

Let’s start by reviewing data warehouses.  A data warehouse is a highly curated, highly structured database.  A data warehouse may use fact tables or dimensional cubes, and the ETL process is designed to put only the highest quality data into rigid structures.

Data warehouses excel and answering the same questions over and over with high fidelity and high performance.  This is an important business capability since many enterprises rely on periodic sales reports or inventory levels or manufacturing details.  Additionally, data warehouses are designed to return these answers quickly, and reports can often be run on demand with up-to-the-minute data.

Because they are structured for a purpose, and both ETL and queries are schema bound, changing a data warehouse is not a trivial task.  This doesn’t lend itself to just keeping data around to explore “what-if” scenarios or exploratory data analysis.  Data warehouses aren’t good structures to keep data around “just in case”, so there are occasions where you’ll find yourself thinking “if only we had the rest of the data!”

Data lakes are designed to be that way to keep the rest of the data in a cheap and explorable way.  Data lakes are designed to store large amounts of data in its raw form, and to provide different ways of exploring the data.  Although you could set up a data lake for reporting, that’s not the best use of the technologies.  Data lake technologies are less suited for repeated answers, and are more designed for exploratory analytics.  Data lakes are essentially unindexed heaps, so queries may take some time run when executed on demand.  A lot of answers (or inputs to analyses) from data lakes are precalculated using map/reduce jobs.

Let’s run thorough a thought experiment and see how a data warehouse and a data lake can work together.  Imagine we’re a manufacturing company with a data warehouse which we use to report on manufacturing output, error rates, scrap amounts, etc.  Raw data are written on paper forms, and supervisors enter the data into an application.  An ETL process takes the data from this application’s database and loads both the data warehouse and the data lake.

Some of the company leaders have been reading studies which show the affect of changing the brightness of lights in a factory on manufacturing output.  It has been decided that we’ll run week-long tests with brighter and dimmer lighting, and see which one gives us better results compared to a previous week.

After a week of elevated lighting and a week of decreased lighting, we run our usual reports out of the data warehouse and see that elevated lighting seemed to improve output and decrease shrink by a small percentage, while decreased lighting had a small negative effect.  It seems clear that we should brighten the lights to squeeze out a little more productivity.  But then someone has an observation–the week we turned the lights up was unseasonably sunny, while the week we turned the lights down was typical crappy January.

How easy would it be to test the effect of weather on our study using our data warehous?  Since weather could include temperature, sunlight and precipitation, all of which would need to be added to dimensional models it probably wouldn’t be easy.  And since we’re operating on a hunch, the work may not be worth it.  However, it would be very easy to download historical data for the weeks of our test, or even for the past several years, into our data lake and run some ad-hoc analyses looking at one or more weather feature.  If it turns out we want to add weather to our reports, we can update our data warehouse, ETLs and reports.  We could even download forecasts and do some predictive analytics.

The allure of big data technologies isn’t just the ability to work with huge datasets, but being able to run dozens or hundreds of analyses on medium-to-small datasets for analyses just like this.  When paired with traditional warehousing, data professionals have two very powerful tools.

Dirty Bits: Flynn, or Flinn? PennDOT vs USPS

In discussions around data quality, one of the challenges I frequently cite is numbered roads which also have a name.  Most roads in my area are this way–PA Route 68 is Evans City Road, PA Route 356 is New Castle Road, PA Route 19 is William Penn Highway, and PA Route 8 is William Flynn Highway.  Mailing addresses reference the name, but when you give verbal directions most people use the number.  The original location of my wife’s flower shop was 190 New Castle Road, and that’s what we printed on our business forms.  But when giving directions, we’d tell people “at the intersection of 68 and 356”.

Because “190 PA Route 356” and “190 New Castle Road” are synonymous, they don’t really fall into one of the best known data quality dimensions.  Both are accurate, and so we need an easy way to know one refers to the other for searches and matching.  There are different ways of handling this, but those aren’t the focus of this post.

Working for a mortgage company, addresses are essential to us, but touch so many other parts of our daily lives without us knowing it.  With multiple ways to abbreviate “William” and “Highway”, and having both a name and a number, PA Route 8 becomes a serious data quality challenge.  Here is a small set of data from the Allegheny County Health Department (


Having all these forms for the same road makes it very difficult to look up an inspected facility.  ACHD could use a hand in standardization and canonical forms of addresses.

But wait, there’s more.  And I’m not making this up.

What if, for as long as anyone can remember, the road signs spelled the name wrong?  That’s exactly what happened here.  The road was named for a prominent builder and developer, William Flinn (  There are even historical markers with the correct spelling where the road begins.  But every road sign, and hence every map, GPS,  business sign and stationery were all printed with the wrong spelling of Flinn.  This goes back decades, based on my informal poll of old timers.

According to Google StreetView imagery, PennDOT replaced the road signs sometime between 2007 and 2011 (Google also got significantly better cameras in that time).

2007 2011
image image

This is a road I drove every work day for 15 years and noticed this only recently, but it was reported back in 2013 (  How did this happen?  According to the article,

Steve Cowan, a PennDOT spokesman, told me that no one at the agency knew how the signs with the “Flynn” spelling came to be along the state route. He confirmed that the “Flinn” spelling is correct and requested that the public notify PennDOT if they see any road signs that say otherwise.

Google Maps was updated sometime between 2016 and 2018, but other map applications still have the incorrect spelling.  That’s because the USPS still uses the spelling “Flynn”, and actually corrects Flinn to Flynn:


And now we’ve entered into the classic “system of truth” dilemma.  Do we use the PennDOT spelling, which is correct, or do we use the USPS, which is generally accepted as truth by address verification systems and determines delivery point verification?

I wondered if the ACHD adopted the correct spelling yet…



Are business planning to change?  According to the TribLive article,

“It should be ‘Y-N-N,’ ” said Carole Wolfe, office assistant at John Utz Agency, an insurance firm at 4485 William F-L-Y-N-N Highway. Her sentiment was shared by several other residents and business owners I talked with on Route 8 this week.

“Actually,” I told her, “it’s supposed to be ‘I-N-N.’ ”

Though she said her curiosity is piqued by the name issue, it is not enough that the office letterhead will ever change.

“I don’t pay attention to it because in my mind, it’s Flynn.”

Again, nope.

Handling this is going to take some creative solutions, and we’re still working on the details.  We can’t rely on our AVSs at this time, so we’ll probably have to insert some asserted equivalencies and data quality flags to make sure the addresses aren’t re-AVSd and incorrectly changed.  Cowabunga, dude, as if a vanity name wasn’t enough.