I’m fascinated by data lakes, and the analytics they can power. If not done well, you can end up with a data swamp (see http://rjdudley.com/prevent-swampification-in-your-data-lake/). I love what Microsoft is doing with the Azure Data Lake, and the storage, analytics and U-SQL components. I’ve been looking for an excuse to work with it, and I’ve been interested in better ways to analyze website logs. I’m writing these blog posts as I build out the data lake, so they may wander or things may change as I progress. We’ll see how this works out.
First things first–setting up your storage. With a data lake, you have the option of just throwing everything into one heap, but that’s not a great idea. This would leave us trying to do analysis directly against raw data, and having to account for any data quality issue or conversions in our analysis queries. Over time, the format of our raw data may change, adding a layer of complexity to these queries. Natural lakes have different zones (see http://www.lakeaccess.org/ecology/lakeecologyprim9.html) based on temperature/light/nutrient characteristics, so if we’re extending the lake metaphor, data lakes should also have a number of zones.
While traditional data warehouses are designed for ETL (extract, transform, load), data lakes are designed for LET (load, extract, transform). I’ll follow the LET pattern and have a raw “zone” (actually a folder) where we land the log files, use U-SQL for the extraction and transformation, then store the results in a transformed “zone” (again, just a folder). At least that’s the initial idea. Since this project is small scale I can use simple folders, larger data may need a different strategy.
Another complication is that my host names the files by week-of-year, but no year, so within 52 weeks I would start overwriting files. I have a couple options here–I could prepend a year onto the filename when I load it, or I could load the files into folders named by year. Since this is small and I’m manually uploading files, I’m going with subfolders in the raw zone, named for the year. This may change later as I automate the load process. Fortunately I can rearrange the raw zone with some PowerShell scripts or the Azure Storage Explorer. Again, YMMV, so I highly recommend burning every penny of Azure credits you have until you figure out the best strategy for your needs.
Now that I have my two zones, plus raw subfolders, and the log files for the last few months uploaded it’s time to start the E and T.