Setting Up Neo4j on Azure VM

NB: I’m leaving this up for continuity purposes, but MS Open Tech no longer exists, so the VM Depot is no longer being updated (see https://msopentech.com/blog/2015/04/17/nextchapter/).  Newer versions of Neo4j will need to be installed the usual way using VMs.

It’s time for me to get back to experimenting with different datastores and data structures (and burn some Azure credits I’m basically wasting).  One datastore I’m interested in for my day job is the graph database Neo4j.  Relationships are fascinating to me, and a graph database stores relationships as data you can query.  There are DBaaS (managed, cloud-based Neo4j) providers such as graphstory, but for getting started and learning it’s probably cheaper to set up your own instance, and here I’ll show you one way to get up your own instance.  Fortunately, Neo Technology (the company behind Neo4j) created a VM image on Microsoft’s VM Depot, which we can use to spin up an Azure VM .

  1. Obviously, you need an Azure account.  If you don’t have one, you need to create one.  Despite the promise of “Free Account”, running VMs is not free on Azure, and the cheapest option for me was $13/month (prices at https://azure.microsoft.com/en-us/pricing/details/virtual-machines/#Linux).   It’s not terrible, especially if you remember to turn off your VM when you’re not using it.  The day job gets me MSDN credits, and anyone in the same boat can probably run a small VM without worries.
  2. It would also be a good idea to know some Linux, because that’s the OS.  If you don’t know the difference between SSH and LTS, you might want to pick up a used copy of Ubuntu Unleashed for 12.04 LTS for a buck or so.  It’s scary thick, but don’t panic, it’s organized well enough to be used as a reference.
  3. In order to publish a VM Depot image to your Azure account, you need a PublishSettings file (which is similar to a WebDeploy file, if you know what those are).  Just click https://manage.windowsazure.com/publishsettings/index?client=xplat and save the file locally.  You don’t need to do anything else, even though there are additional instructions on the page.
  4. Find the Neo4j Community on Ubuntu VM.  This VM is Neo4j 2.0.1 and the current Neo4j is 2.3, so it’s a little behind but good enough as a sandbox.  (This link might change if the Ubuntu OS or Neo4j version are updated, so if it’s broken let me know and I’ll update this post)
  5. On the VM Depot page, click the “Create Virtual Machine” button.  If you haven’t logged in you’ll be prompted to do so, and then you’ll need to provide your PublishSettings file.
  6. Next you’ll get to choose your DNS name, VM username and a few more options.  Pay attention to the ADVANCED settings, the default machine size will cost you about $65/month.  This would be a good time to scale it down a bit.  This is also a good time to change default ports for Neo4j or SSH if you want to.
  7. Now wait about 10 minutes for everything to get set up.  The publish process is a background process, and once it’s complete you’ll get an email if you close the window.

Once you get the confirmation, you’re now ready to start using Neo4j!

Transactional vs. ODS Talking Points

When considering implementing an operational data store, discussion always includes the differences between an ODS and a transactional database.  Transactional databases store the data for an application.  An ODS’s purpose is to consolidate from one or more transactional systems, to serve as a source of master data, or for reporting, or one source to a data warehouse.  While the purposes are pretty clear, how they differ at a design level is less clear.  Here are the talking points I’ve used in the past to describe the differences.

Transactional databases

  • are optimized for write performance and ensuring consistency of data
  • mainly inserts and updates, no table rebuilds
  • Low level of indexing, mainly primary keys and the lookups needed 
  • high use of foreign keys
  • use of history and archive tables for no longer current data
  • index and data fragmentation are a concern due to updates, and maintenance jobs need to be utilized
  • data are normalized
  • but, frequently updated data are often separated from less frequently updated data to reduce table fragmentation
  • data are raw

Operational data stores

  • are optimized for reads
  • mainly inserts and table rebuilds via ETL from transactional systems, few updates
  • high level of indexing to support querying
  • low use of foreign keys, since relations are maintained in the transactional databases
  • no history or archive tables–ODSs are for current data
  • low level of normalization, since updates are usually on the same schedule and in a batch process
  • data are sometimes calculated or rolled-up (rather than saving a birthdate, use a demographic age)
  • data may be bucketed

Exactly when to use an ODS and how the schema is designed is a discussion about balancing data duplication vs application architecture.

The update schedule of an ODS is determined partly by the needs of the ODS data consumers, and partly by what the transactional databases can tolerate.  Usually ODS updates are a batch job which runs once or several times a day.  For more frequent updates, commanding could be used.