Log Analysis in Azure Data Lake, Chapter 2

My task this time is profiling the log data.  I love profiling data, to me it’s like people watching at the mall.  You find a lot of interesting stuff just by looking.  The profiling task will actually accomplish four things:

  1. I’ll get to test recursive or wildcard querying in U-SQL.
  2. I’ll get to see if the subfolders named for the log year are good idea by how messy #1 is.
  3. I’ll have some profiled data I can use when I start my analysis queries.  I know from past experience that one of the columns in a log file is the byte size of the resource being requested, but not all resources have a byte size (like a default route URI).  In this case the server stores a “-” instead of a numeric value.  This would break my analytics unless I handled this situation.
  4. I’ll see if saving results to a flat file in an output folder is a good idea, or if I should switch to database tables.  Profiling is an intensive task, and sometimes it’s good to offload objects in memory, or run in parallel.  In either of these cases outputting to a flat file isn’t always the best idea.

Before I start any profiling, I need to get my log data out of my files and into a consolidated dataset.  That’s the focus of this blog post, the actual profiling will start in the next post.

Results: Getting one file to work

First thing I’m going to do is get data from just one hardcoded file.  If I know one file works, I always have a fallback if things are being wonky.  The Azure Data Lake Explorer in Visual Studio (not to be confused with the Azure Storage Explorer I mentioned previously, which is a standalone application) has several useful functions on its right-click menu.  When I drill down to a file, I can Preview the first 100 lines, which will give me an idea of how the U-SQL EXTRACT function will interpret the file with zero work on my part.  The second function is creating that EXTRACT script, which I can use as-is or tweak to suit my needs.

image

With just a click of a button, I get this, and I can see how the files will be processed.  Apache server logs are sort of space delimited, but not really.  This is why they’re so hard to process is the standard tools.  For instance, the timestamp is wrapped in square brackets, which a space between the time and offset.  This results in Column 3 having the date and time, plus a leading “[“, and Column 4 with the offset and a trailing “]”.  Also, the method, URL and protocol are all quoted in the log file, so they end up in the same column, Column 5.

image

Since this log is for a blog hosted on a shared server, I don’t have control over the file format, but the challenges are what makes this project fun.

To get useful data, I’m going to have to clean up and reassemble the timestamp, and also split apart Column 5 into its components.  This is exactly why I have a raw ad a transformed layer in my storage–I don’t want to have to do these (and other transforms) in the query every time I want to run some analysis; I’d prefer to run my analyses against the transformed data.

Something to consider at this point is that I could build a user-defined function which would process the rows a little more intelligently (see

Extending U-SQL Expressions with User-Code), and maybe I’ll do that later on.  The temptation would be to run analyses against the raw data and use the UDF to basically create an in-memory transformed layer when I am running analyses.  I think I’d still want the transformed later and save on the CPU cycles and memory usage.

The File Preview also has a “Create EXTRACT Script” button, which opens the EXTRACT Script Builder.  Here I can change column names and adjust types.   Note that the preview and column type inference is only the first 100 characters, and some excitement probably lurks (which is why I’m profiling in the first place).

image

Column 7 is the byte size of the requested URL, which I know to be a possible problem spot.  The script builder interprets this as an integer, but scrolling down a few lines my suspicions are confirmed.  Two entries have a byte size of “-“, which won’t parse correctly.  At this exact moment I can’t decide if I want to convert “-” to zero or null (zero will be counted in averages, null won’t) so for now, since I’m just building a profiling script, I’m going to type this as a string and deal with it after I analyze my profiling.

image

Now it’s time to profile my selected file.  This extract script is being generated to run against the one file I selected to preview, so later on I’ll still need to work in wildcards or recursive queries.  As a getting started, one file is perfect.

Before I Submit this script I need to add an OUTPUT statement also, since that’s part of the definition of a complete script in U-SQL.  This has the net effect of rewriting the raw file into the transformed folder with a comma between the columns, but that’s good for a first run.  Here’s the full script of my initial test:

@input =
     EXTRACT ip string,
             identd string,
             remote_user string,
             timestamp string,
             offset string,
             method_file_protocol string,
             response int,
             size string,
             domain string,
             url string,
             user_agent string,
             forwarded string
     FROM "adl://redactedprojctname.azuredatalakestore.net/mylogs_raw/2017/access.log.45"
     USING Extractors.Text(delimiter:' ');
 

OUTPUT @input
TO "adl://redactedprojectname.azuredatalakestore.net/mylogs_transformed/input45.txt"
USING Outputters.Csv();

I Submit the script, and success!  But that’s only one file.  How to process all of my logs?

Results: Wildcard and subfolders

This could not have been easier in U-SQL.  All I had to do was change the FROM to read as below, and every log file in both folders was processed into one @input.  I did not have to make lists of subfolders, then loop through the subfolders and read files one at  time like we have to do in C#.  U-SQL handled everything for me.

FROM "adl://redactedprojectname.azuredatalakestore.net/mylogs_raw/{*}/access.{*}"

I can start querying the whole dataset now!  Your mileage may vary based on how much data you have, but this works for me for now.

For a little more information about using files and filesets, see Files and File Sets as Inputs and Outputs (U-SQL).

Saving Windows RT

I consider the release of Windows RT to the consumer market to be one of the worst decisions Microsoft has made in recent years, and I have an $853MM writedown to back me up.  RT shipped primarily on a Surface RT, which isn’t an attractive personal device—it’s small, relatively costly, difficult to connect to the usual suite of peripherals and doesn’t sit well in your lap.  Additionally, here was a version of Windows which wouldn’t run any previous Windows program.  Consumers were used to getting a new computer with a new version of Windows and simply reinstalling their favorite old greeting card maker or photo editor.  Months later, when Windows 8 was released, confusion multiplied—now there were two versions of Windows—a “right one” and a “wrong one”, and your average consumer couldn’t tell the difference by looking.  Consumers literally needed someone with technical knowledge to tell the devices apart.  Add to that an a store which had few desirable apps and it’s no wonder interest was really low for RT.  The release of the Surface 3 running only Windows 8 puts the future of RT into greater doubt.

Having said that, RT could still be one of the greatest versions of Windows of all time.  How?  Improve the concept of enterprise application stores, and make RT the next Windows Embedded.  It’s not as crazy as it sounds.  I’ve helped manage installations of WinTerms for sales teams, and hundreds of handheld and lift mount devices in multiple warehouses, and this idea is a bit of a dream come true.

Windows 8 ships with a hard-coded attachment to the Microsoft store.  Make it simpler for enterprises to set up their own internal app store, and control the store setting via group policy.  Enterprises could easily distribute their in-house apps, or those supplied by ERP/WMS/etc vendors to the issued devices.  At a previous employer—a warehousing company—we had to manage hundreds of devices in multiple warehouses around the country.  We had to have someone onsite manually dock each one, and we had to go through a complicated set of steps to update the wimpy onboard apps.  If we could have posted an updated app on our internal store and have every device update itself automatically in seconds, that would have been a dream come true.  Intermec and Symbol should be all over this idea.

Take this one step further.  Remember the fires in Tesla Model S?  A software fix to correct how the car rides at freeway speed was downloaded to all the Model Ss.  Now imagine Ford replacing Sync with RT, and being able to do the same for control or entertainment systems.  Speaking of entertainment systems. keep the linkage to the movies and music stores so movies can be downloaded while parked at a McDonald’s.  The capabilities in RT would put Ford years ahead of its competitors in regards to onboard systems.  This could be extended into on-board systems for trucks as well.

Take this one more step.  Imagine battlefield updates to combat systems, downloaded via AWACs or properly equipped drones from a secure DOD app store.  It’s not too far-fetched.

Vehicles and warehouse equipment alone offers the potential of millions of devices running RT.  By looking at RT as a new Windows Embedded, Microsoft thinks big by thinking small.

Slides for “The Data Bath” at Pittsburgh Tech Fest

Thanks to everyone who attended! You can download the slides handout at The Data Bath Handout.

If you’re a SlideShare fan, you can find these same slides at http://www.slideshare.net/rjdudley/the-data-bath.

Additional references for the SimMetrics library are at the end, but the main reference for installing into SQL Server is Beyond SoundEx-Functions for Fuzzy Searching in MS SQL Server. All the algorithms have great entries in Wikipedia.

How we did EDI via AS2 with /n software’s AS2 Connector and BizTalk 2009

Two “lives’” ago, I led the team of enterprise developers.  We did everything from the data warehouse/BI to LOB apps to systems integration.  It was good times, we kept busy.  It is an amazing company, small with people but with big revenues and big needs.  As our trading partners and services grew, we needed to significantly upgrade our EDI capabilities, including AS2.  After several months of evaluating solutions, we settled on BizTalk, because it was very flexible with EDI mapping, could multicast documents (which we needed to do), and would handle other types of messaging as well (we had a requirement for XML between several systems).  We settled on BizTalk 2009, which as it turned out had its share of issues and limitations we found out later.

One of the limitations of BizTalk’s AS2 connector is that it had to run on the same machine as BizTalk (I don’t know if this has changed or not).  This meant either having a second license of BizTalk just for AS2 (cost prohibitive), putting a production server in the DMZ (stupid) or poking a hole into our internal network (over the network admin’s dead body).  Time to find a new, simple, cost-effective solution. 

This time the decision was significantly easier.  We looked at a number of options, from hosted solutions to AS2 apps, but /n software’s AS2 Connector was exactly what we needed (they moved the current version of the connector to their RSS Bus product line, so don’t panic since the company brands don’t match).  Just to clarify, /n software’s EDI integrator is a component for building your own AS2 solutions.  The AS2 Connector is a pre-built application with most or all of the functionality you need—this is what fit the bill for us.

In a nutshell, here’s what we did:

1. Installed the AS2 Connector on a web server in our DMZ.  Since we had several web servers already, and AS2 is pretty low bandwidth, nothing additional was required here besides the SSL certificate.  Setup and config was insanely easy on our IIS box.

2. The version we used dropped all the AS2 files into one folder. To make it easy for BizTalk’s processing rules, we needed to sort them by trading partner.  The connector did have the ability to call a batch file after a receive was complete.  We wrote a PowerShell script (called by a BAT file) to read the ISA line, and move the files to a folder named for the trading partner ID.  We also had T and P folders, based on the test indicator.  This was back in 2009—I think the current version does this now without needing a “sorting hat” script.

3. On that same web server, we had a TFTP server set up.  We secured it to only accept connections from a particular IP (corresponding to our BizTalk server), and had a specific firewall route exclusive for the BizTalk server into the DMZ.

4. We scheduled BizTalk to check the folders every few minutes.  One of the downsides to this approach is that you lose BizTalk’s file system watcher capabilities.  BizTalk picked up the files via FTP and processed them per the rules we had configured.

What we ended up with was a very flexible system that was easy to expand as we brought on new trading partners, and we could meet all kinds of crazy new requirements.  We actually started to become the go-to integration partner because of how fast we could adapt to changes and the processing we could do on the received information.

Of huge importance for a couple of our trading partners we brought on later was having a Drummond Certified solution.  Fortunately, the AS2 Connector was (and still is) Drummond certified.

Something to remember that AS2 is not EDI—AS2 is just a way of transferring files.  You can send nearly any file type via AS2.

Yahoo’s CAPTCHA Broken…Is a Spam Tsunami in the Offing?

Uh oh…

The CAPTCHA security system that Yahoo, and many other email service providers adopt to prevent spam, may not be secure, according to Russian security researchers. The researchers claim to have found a way in which the security system can be compromised. This would result in a huge increase in spam coming from yahoo and other email accounts.

Full story at http://internetcommunications.tmcnet.com/topics/broadband-mobile/articles/18772-yahoos-captcha-brokenis-spam-tsunami-the-offing.htm