Data Warehouse vs Data Lake

In discussions of data lakes, questions about the differences between and the future of data warehouses inevitably arise.  In the modern corporate information factory there is a place for both, and with the availability of managed cloud offerings it’s possible to do both even on a small scale.

Let’s start by reviewing data warehouses.  A data warehouse is a highly curated, highly structured database.  A data warehouse may use fact tables or dimensional cubes, and the ETL process is designed to put only the highest quality data into rigid structures.

Data warehouses excel and answering the same questions over and over with high fidelity and high performance.  This is an important business capability since many enterprises rely on periodic sales reports or inventory levels or manufacturing details.  Additionally, data warehouses are designed to return these answers quickly, and reports can often be run on demand with up-to-the-minute data.

Because they are structured for a purpose, and both ETL and queries are schema bound, changing a data warehouse is not a trivial task.  This doesn’t lend itself to just keeping data around to explore “what-if” scenarios or exploratory data analysis.  Data warehouses aren’t good structures to keep data around “just in case”, so there are occasions where you’ll find yourself thinking “if only we had the rest of the data!”

Data lakes are designed to be that way to keep the rest of the data in a cheap and explorable way.  Data lakes are designed to store large amounts of data in its raw form, and to provide different ways of exploring the data.  Although you could set up a data lake for reporting, that’s not the best use of the technologies.  Data lake technologies are less suited for repeated answers, and are more designed for exploratory analytics.  Data lakes are essentially unindexed heaps, so queries may take some time run when executed on demand.  A lot of answers (or inputs to analyses) from data lakes are precalculated using map/reduce jobs.

Let’s run thorough a thought experiment and see how a data warehouse and a data lake can work together.  Imagine we’re a manufacturing company with a data warehouse which we use to report on manufacturing output, error rates, scrap amounts, etc.  Raw data are written on paper forms, and supervisors enter the data into an application.  An ETL process takes the data from this application’s database and loads both the data warehouse and the data lake.

Some of the company leaders have been reading studies which show the affect of changing the brightness of lights in a factory on manufacturing output.  It has been decided that we’ll run week-long tests with brighter and dimmer lighting, and see which one gives us better results compared to a previous week.

After a week of elevated lighting and a week of decreased lighting, we run our usual reports out of the data warehouse and see that elevated lighting seemed to improve output and decrease shrink by a small percentage, while decreased lighting had a small negative effect.  It seems clear that we should brighten the lights to squeeze out a little more productivity.  But then someone has an observation–the week we turned the lights up was unseasonably sunny, while the week we turned the lights down was typical crappy January.

How easy would it be to test the effect of weather on our study using our data warehous?  Since weather could include temperature, sunlight and precipitation, all of which would need to be added to dimensional models it probably wouldn’t be easy.  And since we’re operating on a hunch, the work may not be worth it.  However, it would be very easy to download historical data for the weeks of our test, or even for the past several years, into our data lake and run some ad-hoc analyses looking at one or more weather feature.  If it turns out we want to add weather to our reports, we can update our data warehouse, ETLs and reports.  We could even download forecasts and do some predictive analytics.

The allure of big data technologies isn’t just the ability to work with huge datasets, but being able to run dozens or hundreds of analyses on medium-to-small datasets for analyses just like this.  When paired with traditional warehousing, data professionals have two very powerful tools.

Dirty Bits: Flynn, or Flinn? PennDOT vs USPS

In discussions around data quality, one of the challenges I frequently cite is numbered roads which also have a name.  Most roads in my area are this way–PA Route 68 is Evans City Road, PA Route 356 is New Castle Road, PA Route 19 is William Penn Highway, and PA Route 8 is William Flynn Highway.  Mailing addresses reference the name, but when you give verbal directions most people use the number.  The original location of my wife’s flower shop was 190 New Castle Road, and that’s what we printed on our business forms.  But when giving directions, we’d tell people “at the intersection of 68 and 356”.

Because “190 PA Route 356” and “190 New Castle Road” are synonymous, they don’t really fall into one of the best known data quality dimensions.  Both are accurate, and so we need an easy way to know one refers to the other for searches and matching.  There are different ways of handling this, but those aren’t the focus of this post.

Working for a mortgage company, addresses are essential to us, but touch so many other parts of our daily lives without us knowing it.  With multiple ways to abbreviate “William” and “Highway”, and having both a name and a number, PA Route 8 becomes a serious data quality challenge.  Here is a small set of data from the Allegheny County Health Department (http://webapps.achd.net/Restaurant/):

image

Having all these forms for the same road makes it very difficult to look up an inspected facility.  ACHD could use a hand in standardization and canonical forms of addresses.

But wait, there’s more.  And I’m not making this up.

What if, for as long as anyone can remember, the road signs spelled the name wrong?  That’s exactly what happened here.  The road was named for a prominent builder and developer, William Flinn (http://surewhynotnow.blogspot.com/2017/12/the-william-flinn-not-flynn-highway.html).  There are even historical markers with the correct spelling where the road begins.  But every road sign, and hence every map, GPS,  business sign and stationery were all printed with the wrong spelling of Flinn.  This goes back decades, based on my informal poll of old timers.

According to Google StreetView imagery, PennDOT replaced the road signs sometime between 2007 and 2011 (Google also got significantly better cameras in that time).

2007 2011
image image

This is a road I drove every work day for 15 years and noticed this only recently, but it was reported back in 2013 (http://triblive.com/opinion/nafarivanaski/4304741-74/road-signs-flynn).  How did this happen?  According to the article,

Steve Cowan, a PennDOT spokesman, told me that no one at the agency knew how the signs with the “Flynn” spelling came to be along the state route. He confirmed that the “Flinn” spelling is correct and requested that the public notify PennDOT if they see any road signs that say otherwise.

Google Maps was updated sometime between 2016 and 2018, but other map applications still have the incorrect spelling.  That’s because the USPS still uses the spelling “Flynn”, and actually corrects Flinn to Flynn:

image

And now we’ve entered into the classic “system of truth” dilemma.  Do we use the PennDOT spelling, which is correct, or do we use the USPS, which is generally accepted as truth by address verification systems and determines delivery point verification?

I wondered if the ACHD adopted the correct spelling yet…

image

Nope. 

Are business planning to change?  According to the TribLive article,

“It should be ‘Y-N-N,’ ” said Carole Wolfe, office assistant at John Utz Agency, an insurance firm at 4485 William F-L-Y-N-N Highway. Her sentiment was shared by several other residents and business owners I talked with on Route 8 this week.

“Actually,” I told her, “it’s supposed to be ‘I-N-N.’ ”

Though she said her curiosity is piqued by the name issue, it is not enough that the office letterhead will ever change.

“I don’t pay attention to it because in my mind, it’s Flynn.”

Again, nope.

Handling this is going to take some creative solutions, and we’re still working on the details.  We can’t rely on our AVSs at this time, so we’ll probably have to insert some asserted equivalencies and data quality flags to make sure the addresses aren’t re-AVSd and incorrectly changed.  Cowabunga, dude, as if a vanity name wasn’t enough.

BBQ Sizing for Relative Story Estimates

I’m not a fan of story points.  Because they’re numbers, there is always the temptation to try and turn them into hours or days or fractions of a sprint.  Plus you frequently have to re-explain your scheme (Fibonacci?  Doubling?  Custom?).  T-shit sizing is a little better, but what does “size” mean?  If it’s time, then you’ve baselined against hours or days.  Or is it effort?  Or riskiness?

When planning work, the questions usually boil down to “what can we get done quickly” and “what it going to take a while”.  Sprints are filled to capacity, but so are stomachs.  Instead of saying “we can produce 24 story points”, think instead of “we can cook 8 hot dogs, six burgers, two steaks and a rack of ribs”, or “we can only get one Freddy Flintstone rack of ribs done”.  In short, you’re not planning sprints anymore, you’re planning tailgate parties.

  Estimate Meaning
Hot dog with mustard.png hot dog Cooks quickest on a grill, so something which can be finished quickly.
hamburger Takes a little longer to cook, but still pretty quick.
steak Longer, but so much more worth it than a hot dog.  You also have the option of “well done” or otherwise to indicate compromises may may you sick.
ribs Even longer, these require attention to detail so they don’t burn.
freddy Freddy Flintstone rack of ribs The longest cook time.  Depending on the team, this may be all you can do in a sprint.