The IoT Journey: Introducing the Raspberry Pi Expansion Capabilities

In my previous post in this series, I introduced the Raspberry Pi3 and provided the steps necessary to build and configure it to run C# applications that were developed using Visual Studio on Windows. Obviously you need to know the basics in order to proceed to more IoT-relevant projects, so if you’re just getting started in the IoT world with the Raspberry Pi, please make sure you understand the concepts discussed in the first post.

In any IoT project, you’ll likely want some external sensor capability beyond what is provided by the Raspberry Pi itself. Examples of relevant sensors could be temperature and humidity sensors, barometric sensors, altimeter sensors, etc.. The list is limited only by your creativity and imagination.

The Raspberry Pi is very versatile in that it has expansion capabilities that allow us to both read data and write data using several common interface protocols.

In this post, I’ll discuss the Raspberry Pi’s expansion header and how to write data as well as read data from externally-connected “things”. Admittedly this post is going to enter into “uber geek” territory quickly, and if you’re the type of person that is easily intimidated by “raw electronics”, I’d suggest to you that even though it might look intimidating and complicated, working with electronic components and circuitry is not all that difficult and can be extremely rewarding, so please, read on and give it a go. You won’t be disappointed.

Step One – Obtaining the Necessary Components

When I was much younger, one of my favorite things to do was to grab my allowance savings and free battery card (ah, the memories) and run down to the local Radio Shack store to see what cool things I could afford. Although the stores have declined in popularity and capability somewhat, they have seen a resurgence due to the popularity of the “maker” community, and many stores do stock the components necessary to experiment with IoT. With that said, I have found that there are some very comprehensive kits available from Sunfounder that put all of the parts into a single kit. While there are many different kits available, I found that for my experimentation the following two kits were the most useful:

Super Starter Kit – This kit has pretty much everything you’ll want, although it is a little light on some of the more useful sensors.

Sensor Kit – This kit makes up for the missing sensors above and then some.

For the purposes of this post, you’ll need the following components (which are available in the Super Starter Kit):

  • Breadboard – The breadboard allows you to place the components together in a circuit without the need to solder anything
  • Raspberry Pi GPIO Expansion Board – This simply plugs into the breadboard and makes it easy to connect the components to the Pi
  • Expansion Cable – The Raspberry Pi expansion header is 40 pins and the cable will connect the header on the Pi to the Breadboard
  • 220 Ohm resistor – You’ll need this to ensure the circuit we build is properly configured
  • Led – We will write code to turn the LED Light on and off
  • Switch – We will use code to read the state of the switch and take action

20160421_165941558_iOS

Once you have the components, you’ll want to connect everything to the expansion header in the Raspberry Pi (pay close attention to the pin layout – if you use the rainbow cables from the sunfounder kit mentioned above, you’ll want to connect it with the black wire (pin 1) closest to the end of the Raspberry Pi that contains the SD card. This can be confusing because the cable will fit in either direction, but it’s extremely important to connect the correct pin #s to the breadboard, otherwise you could find yourself applying the wrong signal or voltage to a sensor and at best destroying your sensor, and worst destroying the Pi….

20160421_175849764_iOS

Just make sure that the cable is correctly connected, and you’ll be OK.

Step Two – Prepare the Raspberry Pi

Once you’ve obtained the components and have plugged the expansion cable into the Pi, we need to make sure we have the appropriate development libraries installed in order to take advantage of the capabilities of the Pi.

The primary expansion capability on the Raspberry Pi is accessed through a series of General Purpose Input-Output (GPIO) pins. In short, the GPIO interface allows you to connect external devices directly to the Raspberry and then interact with that device. The “General Purpose” nature of this interface is exactly what makes the Raspberry Pi so popular with “makers”, and is what allows us to connect so many different devices to the Pi. This is the good news, the bad news, of course, is that because it is “General Purpose”, we need to actually write all of the code necessary to interact with the device. Fortunately, there are a number of very good libraries that already exist in the public domain that are designed to do exactly that.

Arguably the most popular and useful library is called Wiring Pi and is made available by Gordon Henderson. Gordon has made his library available using various methods and he is very good about reacting quickly to community feedback and requests for enhancement. This really does demonstrate the power of the Open Source community and is one reason that I’ve chosen to use open source tools for all of my IoT work.

The easiest way to install the WiringPi library is to use git (git is a tool that allows you to interact directly with source code repositories) and install it directly on the Pi. To do this, you’ll first need to make sure you have the current version of the git command installed connect to your Pi via SSH (see the first post in this series for a refresher on how to do this) and then run the following command, sudo apt-get install git-core . (Git is usually preinstalled on the Raspbian image, but you want to install the latest version, which this command will do)

image

Once Git is installed, change directories to the DevOps folder that you created for the Hello application (created in the first post in this series) and then execute the following command: git clone git://git.drogon.net/wiringPi . This will install the git source code onto your Pi.

image

Once the source code is installed, you’ll need to compile the code that will be used by your applications. Fortunately Gordon makes this simple and has provided a build script to do all of this. Change directories to the wiringPi directory and execute the script by typing ./build to start the script.

image

You are now ready to start developing applications for the Pi that interact directly with the GPIO interface.

Step Three – Preparing your Development Environment

Now that your Pi is configured and ready to receive your IoT applications, you need to prepare your development environment to ensure that you can properly develop code that will run on the Pi and interact with the wiringPi library. This can get tricky, because we’re going to be developing our code in C# on windows, but the wiringPi library is written on C for Linux. The first step in preparing for this is to create shared libraries from the code we just complied on the Pi. For our purposes, there are 3 specific libraries that we will create. Change to the wiringPi folder (inside the wiringPi folder that you created) and execute the following commands to create the libraries:

  • cc -shared wiringPi.o -o libwiringPi.so This command uses the compiler to create the basic library – this is the core library that we will use
  • cc -shared wiringPiI2C.o -o libwiringPiI2C.so – This command creates the library specific to the I2C protocol – more on this in later posts
  • cc -shared wiringPiSPI.o -o libwiringPiSPI.so – This command create the library specific to the SPI protocol – more on this in later posts

image

The next step in utilizing the wiringPi library is to build a C# wrapper that will encapsulate the functionality of the C libraries and allow you to use the C-on-Linux library as just another reference in your project. This is done in C# / Visual Studio by using an External Reference in a code file which tells the C# code where to “enter” the C code and what the specific interface looks like. As you might imagine, it can get fairly complex trying to read through the C source code to locate the routines that you want to use and then generate the appropriate wrapper. Fortunately for us, a gentleman by the name of Daniel Riches has taken the time to do just that and made his work available via open source. (yet another reason to love the open source community!)

In order to take advantage of the work Daniel has done, you’ll need to clone his git repository on your Windows development environment. You should already have git for windows installed, but if not, download it and install it, and then start the command line utility (which is PowerShell-based), switch to the root of the folder that you want to install the library in, and then issue the command git clone https://github.com/danriches/WiringPi.Net.git

image

This will install the source code and test harness for the WiringPi.Net library, which you will use for all of the code that you develop moving forward.

Once the library is installed, open Visual Studio and load the WiringPi.sln file that is included in the library. Right-click on the WiringPi project and then select Build.

image

This will create a .dll that you will reference in all of your other projects. Alternatively you can simply add the WiringPi code directly to each of your new solutions moving forward, but I personally find it easier to just add the .dll reference. (note, the default is to build the .dll in Debug mode, which is fine for our experiments, but if you are going to build production applications you’ll probably want to build it in Release mode)

Once your development environment is setup, you’re ready to develop  the “Hello World” equivalent in the IoT world…

Step Four – Create a Simple IoT Application

In any IoT application, the basic components are a device or sensor that collects information and a repository for that information. For the purposes of this post, we will focus on the sensor aspect of IoT. In most cases, the sensor that you would work with in a real world application would collect data in an analog form, and then convert that data to a digital signal in order to communicate that to the host (in our case via the GPIO pins on the Raspberry Pi). The process of converting the analog information into a digital signal is a non-trivial process, so when you are beginning to learn to develop IoT applications, you want to start with the most basic and work your way up to the more complex tasks, learning along the way. One of the more basic tasks that you can perform is using code to light an LED light. (It may sound like a very simple and easy task, but even though it really is a basic task, there is a lot to be learned by performing this simple task).

For this step, you’ll need an LED (any color will do), a 220 ohm resistor (color code Red,Red,Black) and the wires necessary to connect them on the breadboard.

Wire the Breadboard

If you are using the same GPIO expansion header that I am from the Sunfounder kit, this will be relatively easy for you since the GPIO signals are marked on the header and it’s easy to understand what you are connecting to. If you aren’t using that, you will definitely want to pay attention to the Raspberry Pi pinout diagram below:

RP2_Pinout

There is a fair amount of confusion around how the pins are referenced by both documentation and code, and this pinout diagram is one of the best available that shows the translation between the physical pin number, and the pin number as referenced by code. Due to the fact that hardware does occasionally change, most software libraries will reference the pin by function as opposed to location.

In our case, we are going to use GPIO 5 (physical pin 29) to drive the LED, and will use the 3.3v power (physical pins 1 or 17) as our power source. On the breadboard, plug the LED into an open space, and then connect the shorter leg (This is very important, the LED has 2 legs, one is longer than the other and must be installed in the correct direction to work correctly) to GPIO 5 and the longer leg to the resistor. Connect the other side of the resistor to the 3.3v supply. The circuit might look like this:

20160422_013956739_iOS

Note that I am using a header board that automatically connects the power to the edges of the breadboard, so I simply ran the 3.3v from that instead of to pin 1 or 17.

Once the circuit is complete, you can build a very simple application to turn the light on and off.

Developing the Hello World Application

In Visual Studio, create a new C# Windows Console Application and name it PiLED.

image

Once Visual Studio creates the project, right-click on the references node and select Add Reference. Select Browse, and browse to the location where you compiled the wiringPi library earlier. Select wiringPi.dll and then Add it to the project references.

image

This will ensure that the appropriate references are present for the code we’re about to write.

The basic process that we need to follow in order to instruct the Raspberry Pi to turn the LED on and off is:

  1. Initialize the GPIO Interface and tell it how we will reference the pins.
  2. Tell the GPIO Interface that we are planning to write to it
  3. Instruct the appropriate GPIO pin to turn LOW (which enables current to flow from the 3.3v power through the resistor and LED, thus lighting the LED)
  4. Pause (so that we can witness the LED in an ON state
  5. Instruct the appropriate GPIO pin to turn HIGH (and thus disable current flow through the LED)
  6. Wait for a key press
  7. Exit the program

The completed code (in the file program.cs in your project) looks like this:

using System;
using WiringPi;
namespace PiLED
{
class Program
{
// ENUM to represent various Pin Mode settings
public enum PinMode{
HIGH = 1,
LOW = 0,
READ = 0,
WRITE = 1
}
const int RedLedPin = 29; // LED is on GPIO 5 (Physical Pin 29)

        //This is a console application with no GUI interface. Everything that happens will be in the shell
static void Main(string[] args)
{
Console.WriteLine(“Initializing GPIO Interface”); // Tell the user that we are attempting to start the GPIO
if (Init.WiringPiSetupPhys() != -1) // The WiringPiSetup method is static and returns either true or false. Calling it in this fashion
//ensures that it initializes the GPIO interface and reports ready to work. We will use Physical Pin Numbers
{
GPIO.pinMode(RedLedPin, (int)PinMode.WRITE); // Set the mode of the GPIO Pin to WRITE (The method requires an integer, so CAST it)
GPIO.digitalWrite(RedLedPin, (int)PinMode.HIGH); //Ensure that the LED is OFF
Console.WriteLine(“GPIO Initialization Complete”);
Console.WriteLine(“Press Any Key to Turn LED On”);
Console.ReadKey(); // Pause and wait for user to press a key
GPIO.digitalWrite(RedLedPin, (int)PinMode.LOW); // Turn the LED On
Console.WriteLine(“Led should be on”);
Console.WriteLine(“Press Any Key to turn the LED Off and Exit”);
Console.ReadKey();
GPIO.digitalWrite(RedLedPin, (int)PinMode.HIGH); //Turn LED Off
}
else
{
Console.WriteLine(“GPIO Init Failed!”); //If we reach this point, the GPIO Interface did not successfully initialize
}
}
}
}

 

Deploying the HelloWorld Application

Once you have the code in Visual Studio, Compile it by using the Build option. Once the code is compiled, you will need to use a file transfer program to copy the .exe as well as the wiringPi.dll file to a directory on the Pi. (I use FileZilla, and created a new folder in the /DevOps folder called PiLED.

image

Once the file is copied you will need to execute the application on the Raspberry Pi.

Executing the HelloWorld Application

Connect to the Raspberry Pi and then change to the directory where you copied the files (in my case ~/DevOps/PiLED) and execute the application by using the mono command. In this case, since we are interfacing directly with the hardware, we must run the application as root by using the sudo command. The command is sudo mono PiLED.exe

image

Once  you start the application you will see the “Initializing” message. Press a key and the LED will turn on. Press another key and the LED will turn off and the program will exit.

Congratulations! You’ve just written your first application that will become the foundation for further IoT application development.

Future posts in this series will build on what you’ve created here and will demonstrate some of the other protocols as well as how to send data “to the cloud” to truly experience IoT development.

The IoT Journey : Getting Started with the Raspberry Pi 3

If you are involved with “Big Data”, “Advanced Analytics” or “Cloud Computing”, you’ve likely heard all the hype around the “Internet of Things” or “IoT”. It used to be that IoT meant things like the connected refrigerator, or networked thermostats. Now it seems like IoT is being applied to just about anything that can connect and share information, be it “wearables” that track fitness information, RFID tags that track objects, or more complex ideas like Smart Meters and connected buildings. In short, IoT is currently in the midst of a major hype cycle, and as such everyone seems to be talking about it, or wondering what it’s all about.

Simply put, IoT at it core is a connected device (it doesn’t have to be connected to the Internet) that shares some data about itself somewhere with something.

One of the most talked about devices over the last couple of years has been the credit-card sized computer, Raspberry Pi. The Raspberry Pi was originally designed to be used in the classroom to teach kids about programming and electronics, but due to its capability (there are those who use the Pi as their primary computer!) and price (you can buy a new Raspberry Pi for $35 in the US), an entire community of hobbyists and professionals use the Raspberry Pi for their work and play.

I have been working with a Raspberry Pi2 for the last year, but I had never gotten the WiFi adapter that I purchased to work properly, so I was really excited to hear that not only was the Raspberry Pi3 faster, it also had onboard WiFi. I want to utilize WiFi so that the device can be portable and used wherever I might be working.

If you want to learn how to develop IoT applications, there is no better place to start than with a Raspberry Pi and the Pi community. Be warned though, this community will suck you in, and you will find yourself immersed in a world of code, wires, sensors and frameworks before you know it!

The First Steps

One of the major announcements that Microsoft made with Windows 10 is that there is a version that will run on the Raspberry Pi. This version of Windows is called “Windows IoT Core” and is designed to be the OS layer for IoT devices. If you are a developer that focuses on Windows technologies, this is a natural platform for you to gravitate towards. Of course the “New Microsoft” embraces Open Source platforms as well as our own, so I thought it would be interesting to see how far I could extend my Windows development skills into the IoT world using the popular open source Linux Operating System. This post marks the beginning of that journey…

Step One – Obtain the Hardware

There are many places that you can buy a Raspberry Pi3, including Amazon, Fry’s Electronics (in the US) and RS-Online (in the UK). For this project, I decided to buy the kit offered by the Microsoft Store (I was already there, and had other things to buy, so why not?). The specific items I purchased are:

  • Raspberry Pi3 – This kit comes complete with an SD-Card with the NOOBS installer already configured. In my case though, the SD Card was NOT a class 10 card, meaning it did not actually work properly with Windows. It was fine for Linux, which is what I ended up using for reasons stated above, but it is something to look out for. The Microsoft store has subsequently fixed that issue so any new orders will have the correct card.
  • Raspberry Enclosure – I wanted to make sure that the little computer was properly protected and looked good, so I decided to buy the official enclosure. There are plenty of different cases available for the Pi, but this is the “official” one with the proper logo.
  • Power Adapter – This is an important piece! It’s important to note that power is supplied to the Raspberry Pi via the ubiquitous micro-USB port. Most of us have tons of these just laying around. I wanted to make sure though that I had a proper adapter that supplied the full 2.5A that the Pi will demand.

Once I returned home and unpacked everything, it all looked like this:

 

20160418_175839687_iOS

Once assembled (a very simple process, just drop the computer onto the posts in the case bottom, and then snap the rest of the pieces together around it:

20160418_175940941_iOS

Once everything is assembled, you simply plug the SD card into the Pi, then attach a USB keyboard and mouse along with an HDMI cable to connect to the TV/monitor and you’re ready to go. You should also plug in a regular network cable if you can, it will give the setup utility the ability to download the latest OS.

20160418_181545787_iOS

There is no power switch on the Pi, so to boot it you simply plug the power adaper into your AC power.

Step Two – Boot and Configure

If you purchased a version of the Raspberry Pi with NOOBs included on the SD-card, you simply have to insert the SD card and boot. If not, you’ll need to first download a copy of the NOOBS installer and follow the instructions on how to set it up. It really is a simple process (you just format a new SD card and then copy the contents of the NOOBS archive onto the card), so nothing to be concerned about. Once the Pi boots, you’ll see the configuration page:

20160418_182055424_iOS

Once NOOBS has started and displayed the available OS images to install, make sure you select the appropriate region and keyboard settings at the bottom of the page, and then select the Raspbian image (should be the first one on the menu) and then select Install. This will start the installer (which will take about 10-15 minutes).

20160418_182209714_iOS

Once the installer completes, the Pi will reboot and then start Raspbian Linux and will then automatically login as the user “Pi”.

20160418_185014682_iOS

Once the Pi has restarted properly and Raspbian is up and running, you can configure the WiFi connection by clicking the network icon (upper-right of the screen, just to the left of the speaker icon) and joining an available WiFi network.

image

Once you are connected to the WiFi network, you’re ready to configure the remote access capabilities of the Pi, as well as changing the password of the Pi user.

Step Three – Configure for Remote Access

If you are a Windows-only person and haven’t spent much time using Linux, this is where things are going to get a little confusing. If you’re used to the Linux world, then most of what I put here is going to be extremely basic for you.

Linux was designed first and foremost to be a Server Operating System platform, and as such much of the work that you will do will be from a command line shell. The shell is accessed via the Terminal program, which is started by double-clicking on the icon (It looks like a monitor and is located to the right of the menu button in the GUI).

image

Once in the shell, execute the command (by the way, in Linux it’s important to note that case matters when typing commands. “Sudo” and “sudo” for example are not equal) sudo raspi-config which will start the configuration manager. The “sudo” command basically tells the OS that you want to execute the command as an administrator (“root” in linux terms).

image

Use the arrow keys to navigate to Advanced Options, and then select the SSH option. Enable the SSH Server. (SSH is the remote access tool used to enable a secure shell from one computer to another)

image

Once the SSH Server is enabled, you will also want to change the time zone (located in the “Internationalisation settings” as well as the password for the Pi user. You can select the finish option at this point and then type “exit” to exit the terminal program/ shell.

Note: There is a very good guide located here that explains how to enable the remote GUI access as well, along with instructions on how to obtain and download appropriate programs for Windows to access the Pi remotely. For me, I don’t use the GUI that often, but when I do I use a program called Ultra VNC Viewer which is very lightweight and simple. For my remote shell, I’m using the Insider Preview of Windows 10, which includes the Bash Shell. For those not in the insider program, you can use the PuTTY tool mentioned in the article or any remote SSH tool. For file transfer, I’ve found that FileZilla is probably the easiest to use.

Configuring SSH to use Certificate Authentication

I hate having to type passwords, and tend to do the extra up-front work to enable any remote machine that I’m working with to enable certificate authentication instead of passwords. If you don’t mind typing passwords, you can skip this section, but if you’re like me and hate typing the passwords, then start your preferred shell on the remote machine, and execute the ssh-keygen command. Do not enter a passphrase for the key. This will create a new key pair and install it into the ~/.ssh folder.

image

Once the key is generated, execute the ssh-copy-id command, using the IP address of the Raspberry Pi (either WiFi or Cabled, depending on which method you’re using to connect) as the destination and the user “pi” as the user. You will be prompted for the password of the Pi user, but after that you will not be prompted again.

image

Once this is done you are ready to test the ssh command to see if you can connect without password authentication. Type ssh pi@<ip address or name of remote machine> to connect:

image

Congrats! You now have a remote connection to the Raspberry Pi without using a password.

 

Step Four – Preparing the Development and Test Environment

Once remote access is configured and working, you are ready to prepare the Pi for Iot Development and testing. I am a big fan of Microsoft Visual Studio (which you can download for free)and since most of the development work that I do is related to the various demos and proof of concept projects that I build for customer presentations, I didn’t really want to learn a new environment to play with the Raspberry Pi, plus I thought that it would be an interesting test to continue to write code in C# on Windows, then deploy and execute that code on the Raspberry Pi. As you will see, this turns out to be an almost trivial task (for the simple applications, as later posts in this series will show, it does present some serious challenges as well).

The first step to enabling the execution of C# code on the Raspberry Pi is to download and install the Mono framework.  The Mono project is an open source implementation of the Microsoft .NET Framework that allows developers to easily build cross-platform applications. It is very easy to install on the Pi, and uses built-in linux commands to implement.

To install the Mono framework on the Raspberry Pi, first update all of the application repositories by using the apt-get update command. (remember to execute the command as root by using the sudo command)

image

Once the update is complete (depending on the version of Raspbian and the speed of your Internet connection, it can take as little as a minute or as long as 10 minutes) you can then install the complete Mono framework by executing the apt-get install mono-complete command. (again, don’t forget to run it as root by using the sudo command)

image

Once Mono is installed (This will likely take several minutes to complete) you are ready to develop and deploy your first simple application.

Step Five – Hello Raspberry Pi!

No “how to” article would be complete without a “Hello World” application, and I certainly don’t want to disappoint. To start, on your Windows PC, launch Visual Studio and create a new C# Windows Console Application. Title it “PiHelloWorld”.

image

Then in the Program.cs file, add the following code. Note that you are targeting any CPU and using the .NET 4.5 framework.

image

Then once you are happy with the code, select Build to build the application. Once the application builds without errors, copy the PiHelloWorld.exe file to the Raspberry Pi using a file transfer utility as discussed above. (I use FileZilla)

image

Once the file is copied, switch back to the Raspberry Pi and execute the code with the mono command. Remember that Linux is case-sensitive!

image

This will execute the application and prove that the app is actually running on Linux on the Pi, even though it was developed on Windows in C#.

Conclusion

This blog post details the start of a journey, and explains how to get a Raspberry Pi3 ready to develop and deploy IoT applications written in C#. Following posts in this series will explore some of the sensors available that can be connected to the expansion port of the Pi, and will also explain the process of connecting the Pi to Microsoft Azure and the Microsoft IoT Suite of tools available in Azure.

Using Power BI Desktop to Create Data Visualizations and Explore Data

In previous posts, I discussed how to create an Azure SQL Database using data from the City of Chicago Data portal. (See previous posts on this blog for more information) Part of the process was showing how Microsoft Excel can be used in conjunction with Power Pivot and Power View in order to create data visualizations. Using tools such as Excel to build visualizations is very practical because many people are comfortable working with Excel, and sharing the data visualizations is a matter of simply sharing the Excel workbook. There are, however, situations where a richer design canvas is required, or you want to publish your visualizations to a wide audience that may not have Excel installed.

Microsoft has recently released the Power BI Desktop, a tool that will allow you to create full-featured data visualizations on your desktop, and then publish them to the Power BI service (which has a free tier as well as a professional, paid tier) allowing you to share your visualizations with anyone that has a web browser and Internet connectivity.

Downloading and Installing Power BI Desktop

Before attempting to install the Power BI Desktop application, read through the system requirements to make sure that your system will support it. To download, click on the download button on the Power BI Desktop page: https://powerbi.microsoft.com/desktop Save the file to a local folder and then double-click to install. Follow the instructions on the screen and install the Power BI Desktop application along with any pre-requisites that the installer determines that it needs. Once the install completes, start Power BI Desktop which will open the following screen:

Once the application starts, you are ready to build your first dashboard. For the purposes of this post, we’ll recreate the visualizations that we built from the same data source using Excel Power View. (See Part 3 in my previous posts)

Connecting to Azure SQL Database and Loading Data

Click on the Get Data button and select the Microsoft Azure SQL Database option.

Click the Connect button and then enter your Azure SQL Database Server and Database information and then click OK.

When prompted, select Database Authentication and enter your connection credentials.

If you receive an error related to the firewall rules, follow the instructions from Managing the Database in Part 1 and then click the Refresh Preview button to refresh the data connection. If you do not receive an error, skip to the Navigator section below.

Once the refresh is complete you will see a list of objects in the database (which will look slightly different than the picture below as there are more objects in my database than what I’ve documented in previous posts)

On the Close & Load button, select Close to return to the blank canvas. When the canvas loads, select the Recent Sources button and select your database connection. This will open the Navigator.

Select the ChicagoCrimes, Community, DateDimension, PoliceStations and SocioEconomicFactors tables and then click the Load button.

The data load will start (during this process, Power BI Desktop is loading the data into a local Data Model that is using the same xVelocity data engine that Power Pivot uses) and will take some time to complete.

Once the data is loaded, we will want to ensure all of the appropriate relationships are in place and will also use the DAX language to create a very simple calculated measure to ensure that appropriate visualizations are easy to create. You will also want to click the Save command to save the data model at this time.

Managing Data Relationships

When the data load completes, you’ll be returned to the Power BI Desktop canvas. You will want to verify that the relationships between the fact table and dimension tables were properly detected. (At the time I’m writing this article, there is an issue with Power BI Desktop where it did not detect the relationship between the DateDimension table and the ChicagoCrimes fact table). Click on the Relationships icon on the left side (the bottom button) to open the data diagram view.

If there are any missing relationships (in the image above, the relationship between the ChicagoCrimes table and the DateDimension table is missing), click the Manage Relationships button on the ribbon to open the Manage Relationships dialog.

Click New to create a new relationship. Select the ChicagoCrimes table from the first drop-down, and then highlight the DateOnly column. In the next drop down select the DateDimension table and verify that the Date column is selected. And then select OK.

Select Close and verify that the relationship has been created. When all the relationships are created, the data model should look like this:

 

Once the relationships are properly created, we’ll want to add a calculated measure to simplify the creation of visualizations using the model.

Creating a New Calculated Measure

For the purposes of this post, we will create a very simple calculated measure that provides a count of the number of crime reports in the data. To create the measure, we’ll use the DAX language (that is typically thought of as a Power Pivot language, but since Power BI Desktop uses the same engine, the language is used here as well) to create the measure.

Select the Data icon on the left side (the middle icon) to open the data view, and ensure that the ChicagoCrimes table is selected in the Fields area on the right.

Click the New Measure button in the ribbon to enable the formula bar. Enter the DAX expression TotalReports = COUNT([ID]) and then click the checkmark icon to ensure that the expression is correct.

 

Once the new measure is created, we will want to make a change to the data type of the Year column in the DateDimension table to ensure that appropriate reports can be easily created. Because Power BI Desktop is designed to minimize errors and simplify the creation of reports, there are sometimes manipulations to the data that need to be made in order for Power BI to properly render visualizations of that data. In the case of the DateDimension table, we will want to use the Year column as an axis in a timeline. Because the Year column is currently defined as an Integer type, PowerBI Desktop will consider that column a measure and will make it difficult to use on a chart axis. The simple workaround is to change the data type of the column to Text.

In the fields section, select the DateDimension table and then select the Year column. From the ribbon, change the Data Type to Text.

Once these changes are made, you are ready to create a report based on the model.

Creating a Report

For the purposes of this post, we’ll recreate the visualization that was created in Excel Power View in Part 3 of the earlier blog series. Click the Report icon on the left side (the top icon) to open the report design canvas.

From the Fields area, select the TotalReports measure in the ChicagoCrimes table and then expand the DateDimension table and select the Year column. A column chart will be created on the canvas.

In the Visualizations area, select the Line Chart icon to change the visualization to a line chart. On the canvas, click a blank area and then in the Fields section, select the TotalReports measure as well as the Location Description field in the ChicagoCrimes table. On the canvas, drag the resulting visualization to the right of the line chart and resize it appropriately.

With the new chart selected, in the Visualizations area select the Bar Chart icon to change the visualization type to a bar chart. On the Bar chart, click the ellipsis (3 dots in the upper right) and sort the chart by the TotalReports field.

Click in a blank area on the canvas and then in the Fields area, select TotalReports from the ChicagoCrimes table and then expand the Community table and select the Communityad field. In the Visualizations area select the Map icon to change the visualization to a map, and then from the Fields area drag the Communityad field to the Location area under Visualizations. Resize the visualization appropriately.

 

After completing all of the steps above you will have a report that details all types of crime reports in Chicago from 2001 until the last date you loaded into your database. To filter the date to a specific type of crime, drag the Primary Type field from the ChicagoCrimes table to the Page Level Filters area under Visualizations. To replicate the report that was created in Excel, select the Homicide type in the filter.

Conclusion

In this blog post, we created a report using Power BI Desktop that connected to a Microsoft Azure SQL Database and detailed Homicides in the City of Chicago from 2001 until the last day of data that you loaded into your database.

Power BI Desktop is a powerful data analysis and visualization tool that allows you to gain insight from your data very quickly, and also share that insight by posting reports to the Power BI Service (future posts will discuss posting reports to the Power BI service).

Creating a “Real World” Database in Azure for Advanced Analytics Exploration: Part 4

The previous posts of this series (part 1, part 2, and part 3) focused on using publically-available crime report data to build a “real world” database in Azure, then showed how you can use some of the Power BI tools in Microsoft Excel to create interesting visualizations with that data.

In this post, I’ll walk through how to manually update the database with the latest data from the City of Chicago data portal so that your visualizations will be current. In future posts we’ll discuss how to automate this process, but for now, we’ll use a simplistic manual process.

Obtaining the Latest Data

The first step in updating the database is to understand what date and time the last police report was filed that you have added to your database. In order to accomplish this, you will need to use SQL Management Studio and connect to your database (if you need a refresher on how to do this, see the other posts in this series) and execute the following query:

SELECT MAX([Date]) FROM ChicagoCrimes;

 

This will return a single row with a DATETIME value such as: 2015-07-01 23:58:00.000

The way that that the City of Chicago Data Portal works is that they make available their crime data based on a 24-hour cycle, and they post complete cycles to the portal on a 7-day lag. (In other words, generally speaking the data in the portal is the last complete day from 1 week ago). Based on the date above (7/1/2015 at 2 minutes to midnight – Iron Maiden reference not intended J ) we can assume that all police reports through July 1 2015 have been uploaded to our database and we’ll need to start with police reports from July 2.

To download the latest police report data, browse to the City of Chicago’s data portal which is available at https://data.cityofchicago.org/ and select the Crimes – 2001 to present link:

This will open the dataset in a browser window with several options available to you:

In the upper-right-hand section of the window, you will see a Filter option. Select that option to open the filter pane:

Select the Add a New Filter Condition button to add a filter. Select the Date field, and then select the Is After option, and then select the target date (in my example, July 2) as shown:

After a few moments, the window will refresh and you will have a much smaller dataset than what you started with. Once the screen refreshes, in the upper-right-hand side of the screen, select the Export button:

This will open the Download dialog, where you will want to choose the CSV for Excel option and then save the file to your local hard drive.

Once the file is saved, you will be ready to upload it to your Azure database environment.

Updating the Database

The first post in this series discussed downloading the initial data from the City of Chicago and uploading it to a temporary table in Azure SQL Database. The first part of the process to update the data is exactly the same. The first step will be to upload the new data to a temporary table (for the purposes of this post, we’ll use the same table we created during that process) and then we’ll use T-SQL code to update the fact table in our database.

Preparing the Temporary Table

Before we upload the new data, we will want to truncate the temporary table. This step is not technically necessary as the T-SQL query we’ll run will not insert duplicates into the fact table, however for efficiencies sake we’ll want to start with a blank table.

Connect to your Azure SQL Database instance and execute the following T-SQL command:

TRUNCATE TABLE CRIMES_NEW;

 

This will ensure that the temporary table that we used during the initial creation of the database is ready to use again. If for some reason you deleted the initial temporary table, you can create it with the following T-SQL script:

CREATE TABLE [dbo].[Crimes_New](

    [ID] [varchar](50) NULL PRIMARY KEY,

    [Case Number] [varchar](50) NULL,

    [Date] [varchar](50) NULL,

    [Block] [varchar](50) NULL,

    [IUCR] [varchar](50) NULL,

    [Primary Type] [varchar](50) NULL,

    [Description] [varchar](max) NULL,

    [Location Description] [varchar](50) NULL,

    [Arrest] [varchar](50) NULL,

    [Domestic] [varchar](50) NULL,

    [Beat] [varchar](50) NULL,

    [District] [varchar](50) NULL,

    [Ward] [varchar](50) NULL,

    [Community Area] [varchar](50) NULL,

    [FBI Code] [varchar](50) NULL,

    [X Coordinate] [varchar](50) NULL,

    [Y Coordinate] [varchar](50) NULL,

    [Year] [varchar](50) NULL,

    [Updated On] [varchar](50) NULL,

    [Latitude] [varchar](50) NULL,

    [Longitude] [varchar](50) NULL,

    [Location] [varchar](50) NULL

);

 

Once the table is truncated, we’re ready to use the Import/Export Wizard to upload the new data.

Uploading New Data

Start the SQL Server Import Export Wizard (From the Start menu, search for SQL Server Import and Export Data) and then choose a flat file source from the Data source drop down menu. Browse for the file that you downloaded earlier, and add a double quote to the Text Qualifier textbox:

Select the Columns option on the left-hand side and verify that the columns are aligned properly:

Then select the Advanced option on the left-hand side and select the Description column. Change the Data Type to text stream [DT_TEXT] to ensure that the full width of the Description column will be uploaded.

Click Next and then select the SQL Server Native Client x.x (if you have SQL Server 2014 tools installed, your Native Client version will be 11.0) option from the Destination drop down. Enter your Azure SQL Database server name in the Server name: textbox and then choose the Use SQL Server Authentication radio button. Enter your User name and Password to connect to the Azure SQL Database instance and then select the ChicagoCrime database in the Database drop down menu.

Choose Next, and then select the [dbo].[Crimes_New] table in the Destination:<servername> field:

Click Next, and ensure the Run immediately checkbox is selected and then click Finish to begin the import.

Once the import is complete (depending on how much time has elapsed between the initial load and the update, this process will either be very quick or could take a few minutes) you will see the following:

Click Close to close the Import Export Wizard. Once the temporary table has been loaded, you can connect to the database and transfer the data to the fact table in the database.

Updating the Fact Table with the New Data

Adding the new data to the ChicagoCrimes Fact table is a relatively straightforward process. You could easily modify the process above to load the data directly into the fact table from the text file, however by separating the process of uploading from updating, you are ensuring that any errors that occur can be corrected easily.

To update the fact table, connect to your database instance using SQL Server Management Studio and execute the following script:

INSERT ChicagoCrimes

SELECT [ID]

,[Case Number]

,[Date]

,[Block]

,[IUCR]

,[Primary Type]

,[Description]

,[Location Description]

,[Arrest]

,[Domestic]

,[Beat]

,[District]

,[Ward]

,[Community Area]

,[FBI Code]

,[X Coordinate]

,[Y Coordinate]

,[Year]

,[Updated On]

,[Latitude]

,[Longitude]

,[Location]

     ,CAST([Date] AS DATE)

FROM [dbo].[Crimes_New]

WHERE [Date] > (SELECT MAX([Date]) FROM ChicagoCrimes);

 

When the script finishes, you should see the same number of rows affected as you saw transferred in the Import Export Wizard:

Once the script is complete, the fact table in your database will be updated with the latest information available. You can verify that the data was updated properly by re-running the following T-SQL query and comparing the date returned with the initial date from above:

SELECT MAX([Date]) FROM ChicagoCrimes;

 

You should see a newer date returned.

Updating the Excel Workbook

In Part 3 of this series we used an Excel Workbook, a Power Pivot model and Power View to create a simple visualization of crime data. Once the Fact Table has been updated, you’ll want to return to the workbook and refresh the data to ensure the new rows are loaded into the visualizations.

Start Excel and open the workbook you created in Part 3, and when the workbook is opened, click the Enable Content button:

Select Power Pivot from the ribbon and select the Manage command to open the Power Pivot window:

From the ribbon, select the Refresh drop-down and select Refresh All and when prompted, enter the password for the database. When the refresh is complete, you should see the following dialog:

Note that the refresh will repopulate the entire Power Pivot model, not just the added rows.

Conclusion

In this post, we downloaded the latest information from the City of Chicago data portal and manually updated the database with the new information. You also updated the Excel Workbook with visualizations that you created earlier. This process can be re-run any time you want to add updated data to your database.

Future posts in this series will detail how to use the Azure Data Factory to automate the process of updating the data. I will also be creating a new series of posts that will use the data from this series to create an Azure Machine Learning predictive analytics model.

Creating a “Real World” Database in Azure for Advanced Analytics Exploration: Part 3

In the previous posts of this series (Part 1 and Part 2) I walked through the creation and population of an Azure SQL Database that contains “real world” crime statistics (with granularity down to the individual police report). This is a very simple database design, but it does contain a fair amount of data (over 5M rows in the fact table if you used the entire dataset) and provides enough diversity to allow for some very interesting analytics queries.

In this post, I’ll detail creating some very specific indexes that will support basic analytics, and will also walk through some analysis of the data using Microsoft Excel and Power Pivot, along with Excel Power View. If you don’t already have these products installed, you can obtain them by signing up for an Office365 ProPlus subscription (which gives you a ton of additional features as well). In many cases you can obtain these products through your companies “Home Use” program if they have a Microsoft Enterprise Agreement.

Create Indexes to Support Queries

Because the data model that we are working with is a relatively simple star schema, we don’t have to worry too much about complicated index analysis to create indexes that will support our queries.

As a review, this is what the data model looks like:

Each of the tables in the schema has a Primary Key constraint and the fact table has Foreign Key constraints that reference each of the Dimension Tables. Each of the Primary Keys are implemented by creating a Clustered Index which means that each table is automatically sorted by the Primary Key column. For the most part, this is good if we are doing a lot of targeted searches of our data and looking up crime statistics by the primary key (for example, in the ChicagoCrimes fact table, the primary key is defined on the ID column. We likely will never use the ID column to search for a crime, so we will want to create additional indexes to support how we actually use the data), but it is not very efficient when you execute queries that look up data (or summarize data) by a different field.

In order to understand what additional indexes you will need, it’s important to understand what types of questions that you will be asking of your data. For the purposes of this blog post, we will limit these questions to the following:

  • How many murders have been committed? How many murders occur each year?
    • For this question, we will be searching the data by the Primary Type field in the fact table.
  • What type of location (street, sidewalk, etc.) has the majority of crimes?
    • For this question, we will be searching the data by the Location Description field.
  • What part of the city has the most crime?
    • For this question, we will be searching the data by the Community Area field.
  • Which police station is the busiest?
    • For this question, we will be searching the data by the District field.

In SQL Server (and Azure SQL DB) we can only have a single Clustered Index. This Index defines the storage for the table and the data will always be ordered according to the Clustered Index. In order to create additional indexes, we will need to make them Non-Clustered, which means that additional storage will be used for the Index keys.

For each of the above questions, we’ll create an additional non-clustered index on the fields in the fact table that will be used. Since we’ll also be interested in summarizing reports by the date that they occurred, we’ll also create an index on the DateOnly field.

Creating the Indexes

In order to create the additional indexes, you’ll use the SQL Server Management Studio and connect directly to the database instance in Azure. Start SQL Management Studio and connect to your database in Azure (if you need a reminder on how to do this, see Part 1 of this blog series). Ensure that you have selected the ChicagoCrime database and execute the following T-SQL script for each index you need to create (note that each script will take several minutes to run):

Primary Type Index

CREATE NONCLUSTERED INDEX [IDX_PrimaryType] ON [dbo].[ChicagoCrimes]

(

    [Primary Type] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);

 

Location Description Index

CREATE NONCLUSTERED INDEX [IDX_LocationDescription] ON [dbo].[ChicagoCrimes]

(

    [Location Description] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);

Community Area Index

CREATE NONCLUSTERED INDEX [IDX_CommunityArea] ON [dbo].[ChicagoCrimes]

(

    [Community Area] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);

District Index

CREATE NONCLUSTERED INDEX [idx_District] ON [dbo].[ChicagoCrimes]

(

    [District] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);

Date Index

CREATE NONCLUSTERED INDEX [idx_Date] ON [dbo].[ChicagoCrimes]

(

    [DateOnly] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);

 

Now that the indexes are created, we can run some simple queries to answer some questions about the data.

Asking Questions of the Data

If we understand the structure of the data and know some specific questions that we want to ask, we can use T-SQL to write a query that will return a dataset that answers the question. For example, to implement the questions above, we can use the following T-SQL queries from SQL Management Studio while connected to the database:

Murders by Year

SELECT

    DATEPART(yy,[DateOnly]) AS [Year]

    ,COUNT([Primary Type]) AS NumberMurders

FROM ChicagoCrimes

WHERE [Primary Type] = ‘Homicide’

GROUP BY DATEPART(yy,[DateOnly])

ORDER BY [Year];

Top 10 Crime Locations

SELECT TOP 10

    [Primary Type] AS CrimeType

    ,[Location Description] AS [Location]

    ,COUNT(*) AS NumCrimes

FROM ChicagoCrimes

GROUP BY [Location Description],[Primary Type]

ORDER BY NumCrimes DESC;

Top 10 Worst Crime Areas

SELECT TOP 10

    [Community]

    ,[Primary Type] AS CrimeType

    ,COUNT(*) AS NumCrimes

FROM ChicagoCrimes cc

JOIN Community c

ON cc.[Community Area] = c.Community_id

GROUP BY cc.[Primary Type],c.[Community]

ORDER BY NumCrimes DESC;

 

(note: This query is a little misleading because some of the crime reports do not include a community, therefore they are assigned to “Chicago”)

Busiest Police Stations

SELECT

    [Primary Type] AS CrimeType

    ,ps.[Address] AS PoliceDistrict

    ,COUNT(*) AS NumCrimes

FROM ChicagoCrimes cc

JOIN PoliceStations ps

ON cc.[District] = ps.[District]

GROUP BY ps.[Address],cc.[Primary Type]

ORDER BY NumCrimes DESC;

 

These queries will give you a good idea of the types of questions that you can ask of your data with T-SQL, however as you can see, for every question that you ask you need to execute a new query and wait for the results. For the most part, when analyzing data, it is a good idea to visualize the data in a tool that end-users are comfortable with and that can support ad-hoc analysis of the data. For the purposes of this blog series, we’ll use Microsoft Excel for the analysis.

Connecting Microsoft Excel to the Data

Even though the database we’ve created is relatively large (if you pulled in the full dataset, you’ll have more than 5.5 million records in the fact table) Microsoft Excel is a viable tool to visualize the data. For the purposes of this blog series, we’ll use a technology embedded in Excel called Power Pivot, which is based on an in-memory column-oriented database technology called xVelocity that makes use of very efficient compression techniques to store massive amounts of data in relatively little memory. As mentioned above, if you do not have a copy of Excel from Office 2013 installed on your machine, you can sign up for an Office 365 subscription to obtain it.

Using Power Pivot to Download Data

The first step to analyzing data with Excel is to connect Power Pivot to the data. Power Pivot should be an option on the Excel Ribbon when you start Excel as shown below (I have a few extra plug-ins enabled, so don’t expect to see all of the ribbon items):

If you do not see the Power Pivot Ribbon item, make sure that it is enabled by selecting File, then Options, then select COM Add-Ins and select Go. Make sure that Power Pivot (and while you are here, ensure that Power View is enabled as well) is enabled:

Once you have Power Pivot enabled, select the Power Pivot ribbon item and then select the Manage option:

Select the From Database option, select SQL Server, and then enter your database connection information:

Select Next, and then select the Select from a list of …. option:

Select the ChicagoCrimes table, and then select the Select Related Tables option (note my database has additional tables that have not been discussed yet):

Click Finish to begin the import of data:

The import operation will take several minutes to complete. When it’s done press Close and you will see the data represented in the Power Pivot management interface:

At the bottom of the screen you’ll see the various tables represented. Click through each of them to see the data stored in Power Pivot, then click on Diagram View to view a data diagram. (Which is automatically created with the relationships inherited from the data model)

Close the Power Pivot window (select the X in the upper right) and then save the Excel workbook. Note that even with more 5 million rows in the fact table the total size of the workbook is under 500MB.

From the blank Excel worksheet, choose the Insert command on the ribbon and then select Power View:

Note that you will see a blank report canvas and a list of fields available. (if you do not have Silverlight installed on your machine, you will be prompted to install it. Silverlight is required for the embedded version of Power View at this time)

For this example, we will implement a simple dashboard that will answer the same questions that we used above. To start the process, expand the ChicagoCrimes field on the right side and select the Primary Type field. In the FIELDS box, change the drop-down option to Count (Not Blank)

Then scroll down and expand the DateDimension field. Select the Year item and then in the FIELDS area, drag the Year item above the # Count of Primary Type. Then select the Year item and change the drop down to Do Not Summarize. This results in a table that represents the total number of crimes per year and should look like this:

In the Power View Fields list, scroll back to the ChicagoCrimes item and drag the Primary Type field to the Filters area. Select Homicide from the list of Primary Type fields. This will result in a view that shows the total number of Homicides per year.

Once you have the table showing the data you want to visualize, click any of the rows in the table, and then from the Ribbon, select Other Chart and then Line. This will change the table to a line chart representing the number of Homicides per year. Drag the corners of the window to size the graph to fit your canvas.

On the report canvas, click outside of the graph that you just created, and then from the Power View Fields area, select the Location Description item in the ChicagoCrimes table. Also from the Power View Fields area select the Primary Type field and then from the Values area, choose the Primary Type field and change the drop down to Count (Not Blank). The result should look like:

Click any row in the new table and then from the Ribbon, select Bar Chart and then select Stacked Bar. This will change the table to a stacked bar chart. Above the chart, select the Count of Primary Type field in the sort by area, and then select desc to sort in descending order.

Click on a blank area of the canvas, and from the Power View Fields area, select Primary Type and then expand the Community table and select the Communityad field. In the Values area, change the Primary Type field to Count (Not Blank).

Click on any row in the new table, and from the Ribbon, select Map. When the map renders, drag the corners of the window to resize the map accordingly.

You can also click the arrow in the Filters area to collapse the filter, and then type a title into the Title area.

The result is a very nice dashboard-like visualization of Chicago homicide data that is easy to interpret. You can save the Excel workbook at this time to ensure that the dashboard is saved. You can add additional dashboards by selecting the + item at the bottom of the page and then repeating the steps above to add different fields to the canvas.

Conclusion

In this post, we continued the construction of our demonstration database by adding indexes to improve query efficiency and then we used Microsoft Excel, Power Pivot and Power View to create compelling visualizations from the connected data.

In the next post in this series, I will explain how to update the database with new data and then refresh the Excel workbook to include the new data. In future posts we will also use Microsoft Power BI to visualize the data and will also connect the data to Microsoft Azure Machine Learning to perform predictive analytics on the Crime Data.

 

 

Creating a “Real World” Database in Azure for Advanced Analytics Exploration: Part 2

In the previous article in this series, I wrote about provisioning the database in Azure and performing the download of the data from the City of Chicago crime statistics.

In this post, I will discuss transforming the initial data into the various tables necessary to perform meaningful analysis. Keep in mind that the steps I’m laying out here are not the only way to accomplish the task at hand, I’m just providing a relatively simple and straightforward approach to creating data structures that can later be used/consumed by other tools that will also be discussed in later posts in this series.

If you haven’t performed the steps in the previous article, you’ll want to do that first to have the basic environment available in order to follow the steps laid out here.

Creating the Basic Table Structure

Since the end-state of this database is to provide a structure for efficient queries and analysis, we’ll stick with a simple denormalized design. We will also be creating some tables that are purpose-built for some specific queries, so there will be duplicated data within the database we’re creating. This should by no means be seen as best practices, but rather as a way to build data structures that support specific use-cases.

With all of the above said, the database structure will look like this:

This design is primarily focused on providing efficient query operations and there are likely many different opinions on how the data could be better structured. Again, for the purpose of this series of blog posts we’re not going to focus on best practices, but rather on getting the job done.

This database design is known in the Data Warehouse world as a star schema. The schema consists of a central Fact Table and surrounding Dimension Tables. In short the fact table contains the information we’re interested in, and the dimension tables contain the descriptors for that information.

Creating the Fact Table (ChicagoCrimes)

The heart of the database will be the ChicagoCrimes fact table. For the purposes of this blog post, we’re going to create a table that uses the same names in the columns as the original data that we downloaded. This will mean creating columns that have spaces in the name, something to which I am generally averse, however it makes it easier to understand how all the pieces fit together in the final puzzle. To create this table, connect to your database instance in Azure using SQL Server Management Studio (See the instructions in the first post if you don’t know how to do this) change to the database you created, and execute the following TSQL script:

CREATE TABLE [dbo].[ChicagoCrimes]

(

    [ID] [int] NOT NULL PRIMARY KEY CLUSTERED,

    [Case Number] [varchar](50) NOT NULL,

    [Date] [datetime] NOT NULL,

    [Block] [varchar](50) NULL,

    [IUCR] [varchar](10) NULL,

    [Primary Type] [varchar](50) NULL,

    [Description] [nvarchar](max) NULL,

    [Location Description] [varchar](50) NULL,

    [Arrest] [varchar](10) NULL,

    [Domestic] [varchar](10) NULL,

    [Beat] [varchar](10) NULL,

    [District] [int] NULL,

    [Ward] [varchar](10) NULL,

    [Community Area] [int] NULL,

    [FBI Code] [varchar](10) NULL,

    [X Coordinate] [int] NULL,

    [Y Coordinate] [int] NULL,

    [Year] [int] NULL,

    [Updated On] [datetime] NULL,

    [Latitude] [float] NULL,

    [Longitude] [float] NULL,

    [Location] [varchar](50) NULL,

    [DateOnly] [date] NULL

);

 

This should execute very quickly. The above script creates the ChicagoCrimes table, marking the ID column as the Primary Key which will also create a clustered index based on the ID column.

Create the Remaining Tables

The next step is to create the remaining tables. Execute the following TSQL script:

CREATE TABLE [dbo].[Community]

(

    [Community_id] [int] NOT NULL PRIMARY KEY CLUSTERED,

    [Community] [varchar](30) NULL,

    [Area] [varchar](20) NULL,

    [Side] [varchar](15) NULL,

    [Communityad] [varchar](50) NULL,

    [WeatherStation] [varchar](4) NULL

);

 

CREATE TABLE [dbo].[DateDimension]

(

    [PKIDDate] [varchar](50) NOT NULL PRIMARY KEY CLUSTERED,

    [Date] [date] NOT NULL UNIQUE,

    [Day] [int] NULL,

    [DaySuffix] [varchar](50) NULL,

    [DayOfWeek] [int] NULL,

    [DayOfWeekName] [varchar](50) NULL,

    [DOWInMonth] [int] NULL,

    [DayOfYear] [int] NULL,

    [WeekOfYear] [int] NULL,

    [WeekOfMonth] [int] NULL,

    [Month] [int] NULL,

    [MonthName] [varchar](50) NULL,

    [Quarter] [int] NULL,

    [QuarterName] [varchar](50) NULL,

    [Year] [int] NULL,

    [StandardDate] [date] NULL,

    [HolidayText] [varchar](50) NULL,

    [Season] [varchar](50) NULL

);

 

CREATE TABLE [dbo].[PoliceStations]

(

    [District] [int] NOT NULL PRIMARY KEY CLUSTERED,

    [Address] [varchar](255) NULL,

    [City] [varchar](50) NULL,

    [State] [varchar](50) NULL,

    [Zip] [varchar](50) NULL,

    [Website] [varchar](500) NULL,

    [Phone] [varchar](50) NULL,

    [Fax] [varchar](50) NULL,

    [TTY] [varchar](50) NULL,

    [Location] [varchar](255) NULL

);

 

CREATE TABLE [dbo].[SocioEconomicFactors]

(

    [Community Area Number] [int] NOT NULL PRIMARY KEY CLUSTERED,

    [Community Area Name] [varchar](50) NULL,

    [Percent of Housing Crowded] [float] NULL,

    [Percent Households Below Poverty] [float] NULL,

    [Percent Aged 16+ Unemployed] [float] NULL,

    [Percent Aged 25+ Without High School Diploma] [float] NULL,

    [Percent Aged Under 18 or Over 64] [float] NULL,

    [Per Capita Income] [float] NULL,

    [Hardship Index] [int] NULL,

);

The above script will create each of the primary tables needed in the database. The next step is to create the constraints that will ensure referential integrity between the tables.

Create the Foreign Key Constraints

This step is not really necessary in order to have a functional database, but it will provide a framework that ensures that the data stored in the fact table is properly related to the data in the dimension tables. While we’re not focusing on perfection in modeling this particular database, since we are going to be focused on using the database for analysis purposes, it’s important to do what we can to ensure that the data we do analyze is at least consistent with the model we build.

To create the Foreign Key constraints, execute the following TSQL script:

ALTER TABLE ChicagoCrimes ADD

    CONSTRAINT FK_Community FOREIGN KEY ([Community Area]) REFERENCES dbo.Community (Community_id),

    CONSTRAINT FK_DateDim FOREIGN KEY ([DateOnly]) REFERENCES dbo.DateDimension ([DATE]),

    CONSTRAINT FK_PoliceStation FOREIGN KEY (District) REFERENCES dbo.PoliceStations (District),

    CONSTRAINT FK_Socio FOREIGN KEY ([Community Area]) REFERENCES dbo.SocioEconomicFactors ([Community Area Number]);

Once the constraints are added, the dimension tables can be populated.

Populating the Dimension Tables

In a star schema, the dimension tables provide descriptive data about related information in the fact table. The dimension tables will need to be populated before the fact table due to the foreign key constraints that were just added.

Populating the Date Dimension

Arguably the most important dimension of any star schema is the date dimension. The more descriptive information that can be provided about a date in the dimension table, the more interesting your analysis can be. For simplicities sake, I will provide an export of the date dimension that I use most often, which has been populated with many descriptive columns and seasons related to the northern hemisphere. You can download the .csv file here. (Save the file to your local drive)

Once the file is downloaded, start the SQL Server Import Export Wizard (refer back to the first article in this series if you need a refresher on how to start the wizard)

Click Next and choose Flat File Source from the drop down menu. Select the file that you just downloaded in the File name textbox. Uncheck the Column Names in First Data Row checkbox.

Click Advanced on the left side, and change the Data Type to Unicode String [DT_WSTR] on columns 0,3,5,11,13,16 and 17 (this is to ensure the data types are converted properly when loaded into the dimension table)

Click Preview on the left side to verify that the file is being properly read.

Choose Next and then select SQL Native Client .. in the Destination drop down and then enter your server connection information, then select the ChicagoCrime database:

Click Next and ensure the DateDimension table is selected in the Destination field:

Click Next and ensure the Convert checkbox is selected for every row that requires a conversion (denoted by a caution icon on the left side of the dialog):

Click Next twice and then click Finish to load the data into the DateDimension table. You should see 54787 rows loaded.

Populating the Community Dimension

The Community dimension table is used to relate the area that the crime report occurred in to a community in Chicago. This table also has geographic breakdowns (North Side, South Side, etc) as well as a relevant weather station (that will be used in later posts in this series). You can download the data file for the Community dimension here (save the file to your local machine).

Once the file is downloaded, start the SQL Server Import/Export Wizard and choose Flat File Source from the source drop down. Choose the file that you just downloaded, and uncheck the Column names in the first data row checkbox. Put a double-quote (“) in the Text qualifier textbox.

Choose Advanced on the left side and change the data type of Column 0 to 4 byte unsigned integer [DT_4UI] and change the remaining column datatypes to Unicode String [DT_WSTR]

Choose Preview on the left side to verify that the columns are properly recognized

Choose Next and then select SQL Server Native Client xx in the Destination drop down and then enter your server information. Choose ChicagoCrime in the Database drop down.

Click Next and make sure that the Community table is shown in the Destination

Click Next twice and then Finish to populate the table. You should see 78 rows transferred.

Populating the PoliceStations Dimension

The PoliceStations dimension is used to relate the area that a crime occurred to the police precinct that services the area. You can download the data file for the PoliceStations dimension here (Save the file to your local drive).

Once the file is downloaded, start the SQL Server Import/Export Wizard and choose Flat File Source from the source drop down. Choose the file that you just downloaded, and uncheck the Column names in the first data row checkbox. Put a double-quote (“) in the Text qualifier textbox.

Click Advanced on the left side. Change the data type of Column 0 to 4 byte unsigned integer [DT_UI4] and change the remaining column data types to Unicode string [DT_WSTR]. Change the Output column width on columns 5 and 9 to 255.

Click Preview on the left side to ensure that the columns are correctly displayed.

Click Next and select SQL Native Client xx in the Destination drop down. Enter your server connection information and logon credentials, then select the ChicagoCrimes database from the Database dropdown.

Click Next and verify that the PoliceStations table is listed in the Destination.

Click Next twice and then Finish to load the PoliceStations table. You should see 23 rows transferred.

Populating the SocioEconomicFactors Dimension

The SocioEconomicFactors dimension is used to describe the social and economic factors of a given neighborhood in Chicago. This will be useful in later analysis of the crime data.

You can download the data file for the SocioEconomicFactors dimension here (Save the file to your local drive).

Once the file is downloaded, start the SQL Server Import/Export Wizard and choose Flat File Source from the source drop down. Choose the file that you just downloaded, and uncheck the Column names in the first data row checkbox. Put a double-quote (“) in the Text qualifier textbox. Change the Code page to 1252 (ANSI – Latin I).

On the left side, select Preview to ensure the columns are properly displayed.

 

Click Next and select SQL Native Client xx in the Destination drop down. Enter your server connection information and logon credentials, then select the ChicagoCrimes database from the Database dropdown.

Click Next and make sure that the SocioEconomicFactors table is listed in the Destination. (You will have to select the SocioEconomicFactors table in the drop down since the name will be incorrect when the list is initially populated)

Click Next twice and then Finish to load the table. You should see 78 rows transferred.

When the transfer is complete, you will have loaded all of the dimension tables and are ready to load the fact table.

Populating the Fact Table

In a star schema design, the fact table is the table that contains all of the information that will be used to analyze the database. This is the largest table of the entire design (In the case of this database, it will contain more than 5.5M rows).

The process of populating the fact table involves writing a TSQL query that will read data from the temporary table that was created previously (follow the steps outlined in Part 1 of this series if you do not have the temporary table) and write the data to the ChicagoCrimes table. This is a very simple process and will make use of implicit conversion to convert from the default data types that the temporary table uses to the data types configured in the ChicagoCrimes table.

The first step in populating the ChicagoCrimes table is to connect to the database and open a new query window. Follow the steps in the first article in this series if you are unfamiliar with doing this. Once you are connected to the database, in the new query window execute the following query:

INSERT ChicagoCrimes

SELECT [ID]

,[Case Number]

,[Date]

,[Block]

,[IUCR]

,[Primary Type]

,[Description]

,[Location Description]

,[Arrest]

,[Domestic]

,[Beat]

,[District]

,[Ward]

,[Community Area]

,[FBI Code]

,[X Coordinate]

,[Y Coordinate]

,[Year]

,[Updated On]

,[Latitude]

,[Longitude]

,[Location]

,CAST([Date] AS DATE)

FROM [dbo].[Crimes_New]

ORDER BY [Date]

The above query will read the data from the Crimes_New table (which was populated using data from the City of Chicago data portal in the previous article in this series) and then populate the new ChicagoCrimes table, using implicit conversion for the data types. The reason there is an Order By clause in the query is to ensure that the crime data is read in sequential order by date. This query will take several minutes to complete as there should be more than 5.5 million rows of data in the Crimes_New table.

If you loaded the dimension tables properly, you will find that the query completes with no errors and you will end up with a fully-populated database. You can run some queries against the ChicagoCrimes table like:

SELECT

    dd.Season AS [Season]

    ,dd.DayOfWeekName AS [DayOfWeek]

    ,cc.[DateOnly] AS [IncidentDate]

    ,CAST(cc.[Date] AS TIME) AS [IncidentTime]

    ,cc.[Case Number] AS CaseNo

    ,cc.[Primary Type] AS CrimeType

    ,cc.Description AS [CrimeDescription]

    ,cc.[Location Description] AS [CrimeLocation]

    ,cc.Latitude AS [Latitude]

    ,cc.Longitude AS [Longitude]

    ,cc.Beat AS [Beat]

    ,c.Community AS [Community]

    ,c.Area AS [Area]

    ,c.Side    AS [Side]

FROM ChicagoCrimes cc

JOIN [DateDimension] dd

ON cc.DateOnly = dd.[Date]

JOIN [dbo].[Community] c

ON cc.[Community Area] = c.Community_id

WHERE cc.[Date] > (GETDATE()-30);

Which will return a list of crimes that have been reported within the last 30 days (assuming that you have downloaded the latest data from the City of Chicago within the last 30 days)

Conclusion

In this post we continued the process of building a usable database for analysis by implementing a star-schema design and populating both the dimension and fact tables. We then executed a query to generate a Crime Report for crimes within the last 30 days in order to test the data.

In future posts I will detail the creation of indexes and views that will help make analysis of this data easier, and will also discuss tools such as Excel and Power BI for analysis.

Creating a “Real World” Database in Azure for Advanced Analytics Exploration: Part 1

One of the major issues that people who want to get started with “Data Science” or “Big Data Analytics” face is finding datasets that are both compelling and diverse enough to provide a useful playground for exploration.

In my previous life, I blogged about this subject and wrote instructions on how to create a large SQL Server database from data available from the Home Mortgage Disclosure Act (HMDA). You can read that entry here although I’m not 100% positive that the steps I’ve laid out in that blog are still valid in terms of data locations and availability.

The purpose of this series of posts is to provide instructions on how to create a SQL database in Microsoft Azure that can be used as a data source for advanced analytics. I’ll be posting future entries on this blog that refer back to this database. Also, to be clear, while I’m focusing on creating the database in Microsoft Azure, there is no reason why you couldn’t use the same techniques to create the database on a local instance of SQL Server.

For those that want to use Azure but don’t have a subscription, you can sign up for a 30-day trial here. If all you want to do is host a simple database in Azure, you can do so for around US $5 per month, up to 2GB in size. For a larger database, you’ll want to look at the standard tier, which starts at US $15 per month for 250GB. If you follow the instructions that I lay out in this post, you’ll end up with a database of about 6GB in size.

For this first article in the series, I will discuss setting up the database environment in Azure and downloading the initial data. I will also discuss the initial loading of the data into a temporary table in the database. Additional articles in this series will focus on turning the temporary table into a useful data structure as well as analyzing the data.

The Use Case

For this series, the use case I’ll be working with is crime, based on data from the city of Chicago. The reason I chose this particular dataset is that it’s a subject that most people can relate to in one way or another, and it does lend itself to some advanced analytics capabilities very well.

The City of Chicago maintains a data portal where they publish crime detail on a regular basis. This is significant because the level of detail published is very granular – down to the individual police report. When combined with other data, such as NOAA Weather data, there are a number of interesting analytics possibilities that can be realized. The solution that will be developed throughout this series will be surfaced using Microsoft PowerBI and will result in a dashboard that looks like this:

Configuring The Database Environment

The first step in building the database is to ensure that you’re properly setup in the Microsoft Azure portal. Follow the instructions here to sign up for an account (a trial account will work just fine for this, but remember that it’s only good for 30 days). Also remember that the instructions I’m providing here will also work with a local SQL Server instance, you’ll just have to modify how and where you load your data appropriately.

Once you are logged in to the Azure environment, you should have a portal that looks something like this (likely with fewer pre-configured resources however):

To create a new database, choose SQL Databases from the left side navigation, and then select New at the bottom of the page, then select Data Services / SQL Database / Quick Create and then fill out the form, choosing a New SQL database server along with a region close to you.

Once the form is filled out appropriately, choose Create SQL Database which will then submit everything to Azure and begin the process of provisioning your new database. You will see a status message appear, and it will take a few minutes to complete the process.

If you receive a message about enabling auditing on the new database, you can choose to ignore it, unless you want to experiment with auditing throughout this process.

Once the database is created, you can click the details link to view the status of the job:

Managing the Database

Now that the database has been created, you’ll want some tools to manage it. The good news there is that the same tools that manage local SQL Server instances work just fine with Azure SQL Database. If you don’t already have a local instance of SQL Server Management Studio to work with, you can download a free version here (Click the link to download Express Edition, and then select the appropriate 32 or 64-bit Management Studio option – Follow the instructions to install it on your local machine). Make sure that you install all of the management tools, as you will be using the Import/Export Wizard to populate the initial table in the database.

The first step to enable managing your new database is to enable your client IP Address in the Firewall rules. From the Azure Management Portal, choose SQL Databases on the left side, then select the Servers tab, then select the server that you just created.

Select the Configure tab, and then select Add to the Allowed IP Addresses. Choose Yes next to Windows Azure Services (you will need this option later) and then choose Save at the bottom of the screen. This will add your local IP address to the firewall rules to enable your local machine to connect to the Azure SQL Database server.

Once the firewall rules are saved, you’ll use SQL Server Management Studio (SSMS) to manage the database and server. To connect to the database, start SQL Server Management Studio, and when prompted, login to the new Database Server that you created above, using SQL Server Authentication and the username and password that you provided when you initially provisioned the database. (notice that you will use the fully-qualified name of the database server, which is <servername_you_provided>.database.windows.net)

Once connected, you should see your server in the Object Explorer window. You can expand Databases to see the database that you provisioned (note: My personal preference is to automatically open a new query window when I start SSMS. The default is to not open a new query. If you want to configure this option, it is available under Tools/Startup )

Once you have successfully connected to the database, you are ready to proceed to importing the data into the initial staging area.

Downloading the Crime Data

Downloading the data from the City of Chicago is a very easy process. For this initial load, we will download the entire dataset and load it in a single pass. Since the data is updated on a regular basis in the portal, later entries in this series will explain how to keep your data in sync with the portal.

Using your browser, connect to the Chicago Data Portal (https://data.cityofchicago.org/ ) and select the Crimes 2001-Present option from the middle pane of the page.

This will open the Crimes dataset in the online explorer (which is very nicely done, and allows for a wide-range of analysis directly from the portal).

In the upper-right corner of the portal, choose the Export option and then choose CSV for Excel. This will eventually open a Save File dialog (it can take a few minutes to generate the export file)

Choose Save, and the file will begin downloading. This will take several minutes to download, depending on your Internet connection speed.

Now that the file is downloaded, you can import the data into your Azure SQL Database.

Importing The Data

Note: The process that we will use to import the data is a very simplistic process. There are more efficient ways to accomplish this task, but I wanted to use the simple and easy approach to load the initial data.

To load the initial data, start the SQL Server 2014 Import Export Wizard. (It was installed along with the SSMS tools above and can be found in your Start menu. Make sure you choose the appropriate version of the tool – 64-bit or 32-bit depending on your operating system)

When the wizard starts, click Next and then choose Flat File Source. Click the Browse button and select the file that you downloaded in the previous step (in order to see the file in the window, you will need to select the CSV files option next to the File textbox) and then choose Open.

In the Text Qualified textbox, enter a double-quote (“).

Ensure that the options are configured as in the image above. Choose the Columns option on the left side to ensure that the fields are lining up properly.

Once you are sure that the columns are properly lined up, select the Advanced option on the left side and then choose the Description Column. Change the data type to a text stream (DT_TEXT)

Then choose the Next button and then select the SQL Server Native Client destination. Enter your Server Name and login information, then select the ChicagoCrime database in the drop down.

Click the Next button and change the destination table name to [dbo].[Crimes_New]

Choose the Edit Mappings button, and select the Edit SQL button, and then add PRIMARY KEY CLUSTERED to the [ID] column as shown below. Azure SQL Database likes to see Clustered indexes on tables.

Choose the OK button and then choose the Next button and then choose Finish twice to start the import operation.

This operation will run for several minutes as there are over 5 million rows of data to import.

Once the operation is complete, you can switch back to SSMS and verify that the initial table has been created and populated by executing the following query in the ChicagoCrimes database:

Preparing for the Next Step

In this blog post, we setup a new Azure SQL Database and imported some crime data into a temporary table in the SQL Database. You will note that we did not manipulate the data in any way, so all of the data types are currently inherited from the text file. In the next article in this series, I’ll walk through how to create the permanent table and how to populate it with the data, converting the data types to the appropriate values.

 

 

 

 

 

 

 

Welcome to the Big Data and Advanced Analytics blog!

In my work with the US Federal Government customer base, I find that most of the people I interact with are very interested in applying advanced analytics and “Big Data” technology to their work, but struggle with how to get started.

I plan to use this blog to post relevant information on building solutions using Microsoft technologies, with a focus on “Big Data” and advanced analytics.

Everything that I post here is meant to reflect my own personal opinion related to using Microsoft Technologies to solve the problems faced by our US Federal Customer base, but anyone interested in “Big Data” and “Advanced Analytics”  should find the content here interesting.

Feel free to ask questions and post comments, and I’ll respond to them as I can.

Again, Welcome to my blog!

 

Ted Malone

Solutions Architect

Microsoft Federal