Creating a “Real World” Database in Azure for Advanced Analytics Exploration: Part 3

In the previous posts of this series (Part 1 and Part 2) I walked through the creation and population of an Azure SQL Database that contains “real world” crime statistics (with granularity down to the individual police report). This is a very simple database design, but it does contain a fair amount of data (over 5M rows in the fact table if you used the entire dataset) and provides enough diversity to allow for some very interesting analytics queries.

In this post, I’ll detail creating some very specific indexes that will support basic analytics, and will also walk through some analysis of the data using Microsoft Excel and Power Pivot, along with Excel Power View. If you don’t already have these products installed, you can obtain them by signing up for an Office365 ProPlus subscription (which gives you a ton of additional features as well). In many cases you can obtain these products through your companies “Home Use” program if they have a Microsoft Enterprise Agreement.

Create Indexes to Support Queries

Because the data model that we are working with is a relatively simple star schema, we don’t have to worry too much about complicated index analysis to create indexes that will support our queries.

As a review, this is what the data model looks like:

Each of the tables in the schema has a Primary Key constraint and the fact table has Foreign Key constraints that reference each of the Dimension Tables. Each of the Primary Keys are implemented by creating a Clustered Index which means that each table is automatically sorted by the Primary Key column. For the most part, this is good if we are doing a lot of targeted searches of our data and looking up crime statistics by the primary key (for example, in the ChicagoCrimes fact table, the primary key is defined on the ID column. We likely will never use the ID column to search for a crime, so we will want to create additional indexes to support how we actually use the data), but it is not very efficient when you execute queries that look up data (or summarize data) by a different field.

In order to understand what additional indexes you will need, it’s important to understand what types of questions that you will be asking of your data. For the purposes of this blog post, we will limit these questions to the following:

How many murders have been committed? How many murders occur each year?
- For this question, we will be searching the data by the Primary Type field in the fact table.
What type of location (street, sidewalk, etc.) has the majority of crimes?
- For this question, we will be searching the data by the Location Description field.
What part of the city has the most crime?
- For this question, we will be searching the data by the Community Area field.
Which police station is the busiest?
- For this question, we will be searching the data by the District field.

In SQL Server (and Azure SQL DB) we can only have a single Clustered Index. This Index defines the storage for the table and the data will always be ordered according to the Clustered Index. In order to create additional indexes, we will need to make them Non-Clustered, which means that additional storage will be used for the Index keys.

For each of the above questions, we’ll create an additional non-clustered index on the fields in the fact table that will be used. Since we’ll also be interested in summarizing reports by the date that they occurred, we’ll also create an index on the DateOnly field.

Creating the Indexes

In order to create the additional indexes, you’ll use the SQL Server Management Studio and connect directly to the database instance in Azure. Start SQL Management Studio and connect to your database in Azure (if you need a reminder on how to do this, see Part 1 of this blog series). Ensure that you have selected the ChicagoCrime database and execute the following T-SQL script for each index you need to create (note that each script will take several minutes to run):