A New Blog Home

Microsoft has decided to retire the MSDN blogs platform, which is where I had hosted my blog for the past few years. Coincident with this event, my wife Mary has decided that she wants to start blogging, so basically that meant I had to find the time to setup a new WordPress environment that would host a couple of virtual servers, and then figure out how to migrate all of the posts from my previous blog over.

Thanks to the good people at Bluehost.com, this turned out to be a lot easier than I had originally thought..

So, this blog now has a new home (https://blog.sqltrainer.com), and very soon, you’ll start seeing some interesting non-technical content from my bride over on her very first blog.

Using Power BI Desktop to Create Data Visualizations and Explore Data

In previous posts, I discussed how to create an Azure SQL Database using data from the City of Chicago Data portal. (See previous posts on this blog for more information) Part of the process was showing how Microsoft Excel can be used in conjunction with Power Pivot and Power View in order to create data visualizations. Using tools such as Excel to build visualizations is very practical because many people are comfortable working with Excel, and sharing the data visualizations is a matter of simply sharing the Excel workbook. There are, however, situations where a richer design canvas is required, or you want to publish your visualizations to a wide audience that may not have Excel installed.

Microsoft has recently released the Power BI Desktop, a tool that will allow you to create full-featured data visualizations on your desktop, and then publish them to the Power BI service (which has a free tier as well as a professional, paid tier) allowing you to share your visualizations with anyone that has a web browser and Internet connectivity.

Downloading and Installing Power BI Desktop

Before attempting to install the Power BI Desktop application, read through the system requirements to make sure that your system will support it. To download, click on the download button on the Power BI Desktop page: https://powerbi.microsoft.com/desktop Save the file to a local folder and then double-click to install. Follow the instructions on the screen and install the Power BI Desktop application along with any pre-requisites that the installer determines that it needs. Once the install completes, start Power BI Desktop which will open the following screen:

Once the application starts, you are ready to build your first dashboard. For the purposes of this post, we’ll recreate the visualizations that we built from the same data source using Excel Power View. (See Part 3 in my previous posts)

Connecting to Azure SQL Database and Loading Data

Click on the Get Data button and select the Microsoft Azure SQL Database option.

Click the Connect button and then enter your Azure SQL Database Server and Database information and then click OK.

When prompted, select Database Authentication and enter your connection credentials.

If you receive an error related to the firewall rules, follow the instructions from Managing the Database in Part 1 and then click the Refresh Preview button to refresh the data connection. If you do not receive an error, skip to the Navigator section below.

Once the refresh is complete you will see a list of objects in the database (which will look slightly different than the picture below as there are more objects in my database than what I’ve documented in previous posts)

On the Close & Load button, select Close to return to the blank canvas. When the canvas loads, select the Recent Sources button and select your database connection. This will open the Navigator.

Select the ChicagoCrimes, Community, DateDimension, PoliceStations and SocioEconomicFactors tables and then click the Load button.

The data load will start (during this process, Power BI Desktop is loading the data into a local Data Model that is using the same xVelocity data engine that Power Pivot uses) and will take some time to complete.

Once the data is loaded, we will want to ensure all of the appropriate relationships are in place and will also use the DAX language to create a very simple calculated measure to ensure that appropriate visualizations are easy to create. You will also want to click the Save command to save the data model at this time.

Managing Data Relationships

When the data load completes, you’ll be returned to the Power BI Desktop canvas. You will want to verify that the relationships between the fact table and dimension tables were properly detected. (At the time I’m writing this article, there is an issue with Power BI Desktop where it did not detect the relationship between the DateDimension table and the ChicagoCrimes fact table). Click on the Relationships icon on the left side (the bottom button) to open the data diagram view.

If there are any missing relationships (in the image above, the relationship between the ChicagoCrimes table and the DateDimension table is missing), click the Manage Relationships button on the ribbon to open the Manage Relationships dialog.

Click New to create a new relationship. Select the ChicagoCrimes table from the first drop-down, and then highlight the DateOnly column. In the next drop down select the DateDimension table and verify that the Date column is selected. And then select OK.

Select Close and verify that the relationship has been created. When all the relationships are created, the data model should look like this:

 

Once the relationships are properly created, we’ll want to add a calculated measure to simplify the creation of visualizations using the model.

Creating a New Calculated Measure

For the purposes of this post, we will create a very simple calculated measure that provides a count of the number of crime reports in the data. To create the measure, we’ll use the DAX language (that is typically thought of as a Power Pivot language, but since Power BI Desktop uses the same engine, the language is used here as well) to create the measure.

Select the Data icon on the left side (the middle icon) to open the data view, and ensure that the ChicagoCrimes table is selected in the Fields area on the right.

Click the New Measure button in the ribbon to enable the formula bar. Enter the DAX expression TotalReports = COUNT([ID]) and then click the checkmark icon to ensure that the expression is correct.

 

Once the new measure is created, we will want to make a change to the data type of the Year column in the DateDimension table to ensure that appropriate reports can be easily created. Because Power BI Desktop is designed to minimize errors and simplify the creation of reports, there are sometimes manipulations to the data that need to be made in order for Power BI to properly render visualizations of that data. In the case of the DateDimension table, we will want to use the Year column as an axis in a timeline. Because the Year column is currently defined as an Integer type, PowerBI Desktop will consider that column a measure and will make it difficult to use on a chart axis. The simple workaround is to change the data type of the column to Text.

In the fields section, select the DateDimension table and then select the Year column. From the ribbon, change the Data Type to Text.

Once these changes are made, you are ready to create a report based on the model.

Creating a Report

For the purposes of this post, we’ll recreate the visualization that was created in Excel Power View in Part 3 of the earlier blog series. Click the Report icon on the left side (the top icon) to open the report design canvas.

From the Fields area, select the TotalReports measure in the ChicagoCrimes table and then expand the DateDimension table and select the Year column. A column chart will be created on the canvas.

In the Visualizations area, select the Line Chart icon to change the visualization to a line chart. On the canvas, click a blank area and then in the Fields section, select the TotalReports measure as well as the Location Description field in the ChicagoCrimes table. On the canvas, drag the resulting visualization to the right of the line chart and resize it appropriately.

With the new chart selected, in the Visualizations area select the Bar Chart icon to change the visualization type to a bar chart. On the Bar chart, click the ellipsis (3 dots in the upper right) and sort the chart by the TotalReports field.

Click in a blank area on the canvas and then in the Fields area, select TotalReports from the ChicagoCrimes table and then expand the Community table and select the Communityad field. In the Visualizations area select the Map icon to change the visualization to a map, and then from the Fields area drag the Communityad field to the Location area under Visualizations. Resize the visualization appropriately.

 

After completing all of the steps above you will have a report that details all types of crime reports in Chicago from 2001 until the last date you loaded into your database. To filter the date to a specific type of crime, drag the Primary Type field from the ChicagoCrimes table to the Page Level Filters area under Visualizations. To replicate the report that was created in Excel, select the Homicide type in the filter.

Conclusion

In this blog post, we created a report using Power BI Desktop that connected to a Microsoft Azure SQL Database and detailed Homicides in the City of Chicago from 2001 until the last day of data that you loaded into your database.

Power BI Desktop is a powerful data analysis and visualization tool that allows you to gain insight from your data very quickly, and also share that insight by posting reports to the Power BI Service (future posts will discuss posting reports to the Power BI service).

Creating a “Real World” Database in Azure for Advanced Analytics Exploration: Part 4

The previous posts of this series (part 1, part 2, and part 3) focused on using publically-available crime report data to build a “real world” database in Azure, then showed how you can use some of the Power BI tools in Microsoft Excel to create interesting visualizations with that data.

In this post, I’ll walk through how to manually update the database with the latest data from the City of Chicago data portal so that your visualizations will be current. In future posts we’ll discuss how to automate this process, but for now, we’ll use a simplistic manual process.

Obtaining the Latest Data

The first step in updating the database is to understand what date and time the last police report was filed that you have added to your database. In order to accomplish this, you will need to use SQL Management Studio and connect to your database (if you need a refresher on how to do this, see the other posts in this series) and execute the following query:

SELECT MAX([Date]) FROM ChicagoCrimes;

 

This will return a single row with a DATETIME value such as: 2015-07-01 23:58:00.000

The way that that the City of Chicago Data Portal works is that they make available their crime data based on a 24-hour cycle, and they post complete cycles to the portal on a 7-day lag. (In other words, generally speaking the data in the portal is the last complete day from 1 week ago). Based on the date above (7/1/2015 at 2 minutes to midnight – Iron Maiden reference not intended J ) we can assume that all police reports through July 1 2015 have been uploaded to our database and we’ll need to start with police reports from July 2.

To download the latest police report data, browse to the City of Chicago’s data portal which is available at https://data.cityofchicago.org/ and select the Crimes – 2001 to present link:

This will open the dataset in a browser window with several options available to you:

In the upper-right-hand section of the window, you will see a Filter option. Select that option to open the filter pane:

Select the Add a New Filter Condition button to add a filter. Select the Date field, and then select the Is After option, and then select the target date (in my example, July 2) as shown:

After a few moments, the window will refresh and you will have a much smaller dataset than what you started with. Once the screen refreshes, in the upper-right-hand side of the screen, select the Export button:

This will open the Download dialog, where you will want to choose the CSV for Excel option and then save the file to your local hard drive.

Once the file is saved, you will be ready to upload it to your Azure database environment.

Updating the Database

The first post in this series discussed downloading the initial data from the City of Chicago and uploading it to a temporary table in Azure SQL Database. The first part of the process to update the data is exactly the same. The first step will be to upload the new data to a temporary table (for the purposes of this post, we’ll use the same table we created during that process) and then we’ll use T-SQL code to update the fact table in our database.

Preparing the Temporary Table

Before we upload the new data, we will want to truncate the temporary table. This step is not technically necessary as the T-SQL query we’ll run will not insert duplicates into the fact table, however for efficiencies sake we’ll want to start with a blank table.

Connect to your Azure SQL Database instance and execute the following T-SQL command:

TRUNCATE TABLE CRIMES_NEW;

 

This will ensure that the temporary table that we used during the initial creation of the database is ready to use again. If for some reason you deleted the initial temporary table, you can create it with the following T-SQL script:

CREATE TABLE [dbo].[Crimes_New](

    [ID] [varchar](50) NULL PRIMARY KEY,

    [Case Number] [varchar](50) NULL,

    [Date] [varchar](50) NULL,

    [Block] [varchar](50) NULL,

    [IUCR] [varchar](50) NULL,

    [Primary Type] [varchar](50) NULL,

    [Description] [varchar](max) NULL,

    [Location Description] [varchar](50) NULL,

    [Arrest] [varchar](50) NULL,

    [Domestic] [varchar](50) NULL,

    [Beat] [varchar](50) NULL,

    [District] [varchar](50) NULL,

    [Ward] [varchar](50) NULL,

    [Community Area] [varchar](50) NULL,

    [FBI Code] [varchar](50) NULL,

    [X Coordinate] [varchar](50) NULL,

    [Y Coordinate] [varchar](50) NULL,

    [Year] [varchar](50) NULL,

    [Updated On] [varchar](50) NULL,

    [Latitude] [varchar](50) NULL,

    [Longitude] [varchar](50) NULL,

    [Location] [varchar](50) NULL

);

 

Once the table is truncated, we’re ready to use the Import/Export Wizard to upload the new data.

Uploading New Data

Start the SQL Server Import Export Wizard (From the Start menu, search for SQL Server Import and Export Data) and then choose a flat file source from the Data source drop down menu. Browse for the file that you downloaded earlier, and add a double quote to the Text Qualifier textbox:

Select the Columns option on the left-hand side and verify that the columns are aligned properly:

Then select the Advanced option on the left-hand side and select the Description column. Change the Data Type to text stream [DT_TEXT] to ensure that the full width of the Description column will be uploaded.

Click Next and then select the SQL Server Native Client x.x (if you have SQL Server 2014 tools installed, your Native Client version will be 11.0) option from the Destination drop down. Enter your Azure SQL Database server name in the Server name: textbox and then choose the Use SQL Server Authentication radio button. Enter your User name and Password to connect to the Azure SQL Database instance and then select the ChicagoCrime database in the Database drop down menu.

Choose Next, and then select the [dbo].[Crimes_New] table in the Destination:<servername> field:

Click Next, and ensure the Run immediately checkbox is selected and then click Finish to begin the import.

Once the import is complete (depending on how much time has elapsed between the initial load and the update, this process will either be very quick or could take a few minutes) you will see the following:

Click Close to close the Import Export Wizard. Once the temporary table has been loaded, you can connect to the database and transfer the data to the fact table in the database.

Updating the Fact Table with the New Data

Adding the new data to the ChicagoCrimes Fact table is a relatively straightforward process. You could easily modify the process above to load the data directly into the fact table from the text file, however by separating the process of uploading from updating, you are ensuring that any errors that occur can be corrected easily.

To update the fact table, connect to your database instance using SQL Server Management Studio and execute the following script:

INSERT ChicagoCrimes

SELECT [ID]

,[Case Number]

,[Date]

,[Block]

,[IUCR]

,[Primary Type]

,[Description]

,[Location Description]

,[Arrest]

,[Domestic]

,[Beat]

,[District]

,[Ward]

,[Community Area]

,[FBI Code]

,[X Coordinate]

,[Y Coordinate]

,[Year]

,[Updated On]

,[Latitude]

,[Longitude]

,[Location]

     ,CAST([Date] AS DATE)

FROM [dbo].[Crimes_New]

WHERE [Date] > (SELECT MAX([Date]) FROM ChicagoCrimes);

 

When the script finishes, you should see the same number of rows affected as you saw transferred in the Import Export Wizard:

Once the script is complete, the fact table in your database will be updated with the latest information available. You can verify that the data was updated properly by re-running the following T-SQL query and comparing the date returned with the initial date from above:

SELECT MAX([Date]) FROM ChicagoCrimes;

 

You should see a newer date returned.

Updating the Excel Workbook

In Part 3 of this series we used an Excel Workbook, a Power Pivot model and Power View to create a simple visualization of crime data. Once the Fact Table has been updated, you’ll want to return to the workbook and refresh the data to ensure the new rows are loaded into the visualizations.

Start Excel and open the workbook you created in Part 3, and when the workbook is opened, click the Enable Content button:

Select Power Pivot from the ribbon and select the Manage command to open the Power Pivot window:

From the ribbon, select the Refresh drop-down and select Refresh All and when prompted, enter the password for the database. When the refresh is complete, you should see the following dialog:

Note that the refresh will repopulate the entire Power Pivot model, not just the added rows.

Conclusion

In this post, we downloaded the latest information from the City of Chicago data portal and manually updated the database with the new information. You also updated the Excel Workbook with visualizations that you created earlier. This process can be re-run any time you want to add updated data to your database.

Future posts in this series will detail how to use the Azure Data Factory to automate the process of updating the data. I will also be creating a new series of posts that will use the data from this series to create an Azure Machine Learning predictive analytics model.

Creating a “Real World” Database in Azure for Advanced Analytics Exploration: Part 3

In the previous posts of this series (Part 1 and Part 2) I walked through the creation and population of an Azure SQL Database that contains “real world” crime statistics (with granularity down to the individual police report). This is a very simple database design, but it does contain a fair amount of data (over 5M rows in the fact table if you used the entire dataset) and provides enough diversity to allow for some very interesting analytics queries.

In this post, I’ll detail creating some very specific indexes that will support basic analytics, and will also walk through some analysis of the data using Microsoft Excel and Power Pivot, along with Excel Power View. If you don’t already have these products installed, you can obtain them by signing up for an Office365 ProPlus subscription (which gives you a ton of additional features as well). In many cases you can obtain these products through your companies “Home Use” program if they have a Microsoft Enterprise Agreement.

Create Indexes to Support Queries

Because the data model that we are working with is a relatively simple star schema, we don’t have to worry too much about complicated index analysis to create indexes that will support our queries.

As a review, this is what the data model looks like:

Each of the tables in the schema has a Primary Key constraint and the fact table has Foreign Key constraints that reference each of the Dimension Tables. Each of the Primary Keys are implemented by creating a Clustered Index which means that each table is automatically sorted by the Primary Key column. For the most part, this is good if we are doing a lot of targeted searches of our data and looking up crime statistics by the primary key (for example, in the ChicagoCrimes fact table, the primary key is defined on the ID column. We likely will never use the ID column to search for a crime, so we will want to create additional indexes to support how we actually use the data), but it is not very efficient when you execute queries that look up data (or summarize data) by a different field.

In order to understand what additional indexes you will need, it’s important to understand what types of questions that you will be asking of your data. For the purposes of this blog post, we will limit these questions to the following:

  • How many murders have been committed? How many murders occur each year?
    • For this question, we will be searching the data by the Primary Type field in the fact table.
  • What type of location (street, sidewalk, etc.) has the majority of crimes?
    • For this question, we will be searching the data by the Location Description field.
  • What part of the city has the most crime?
    • For this question, we will be searching the data by the Community Area field.
  • Which police station is the busiest?
    • For this question, we will be searching the data by the District field.

In SQL Server (and Azure SQL DB) we can only have a single Clustered Index. This Index defines the storage for the table and the data will always be ordered according to the Clustered Index. In order to create additional indexes, we will need to make them Non-Clustered, which means that additional storage will be used for the Index keys.

For each of the above questions, we’ll create an additional non-clustered index on the fields in the fact table that will be used. Since we’ll also be interested in summarizing reports by the date that they occurred, we’ll also create an index on the DateOnly field.

Creating the Indexes

In order to create the additional indexes, you’ll use the SQL Server Management Studio and connect directly to the database instance in Azure. Start SQL Management Studio and connect to your database in Azure (if you need a reminder on how to do this, see Part 1 of this blog series). Ensure that you have selected the ChicagoCrime database and execute the following T-SQL script for each index you need to create (note that each script will take several minutes to run):

Primary Type Index

CREATE NONCLUSTERED INDEX [IDX_PrimaryType] ON [dbo].[ChicagoCrimes]

(

    [Primary Type] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);

 

Location Description Index

CREATE NONCLUSTERED INDEX [IDX_LocationDescription] ON [dbo].[ChicagoCrimes]

(

    [Location Description] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);

Community Area Index

CREATE NONCLUSTERED INDEX [IDX_CommunityArea] ON [dbo].[ChicagoCrimes]

(

    [Community Area] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);

District Index

CREATE NONCLUSTERED INDEX [idx_District] ON [dbo].[ChicagoCrimes]

(

    [District] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);

Date Index

CREATE NONCLUSTERED INDEX [idx_Date] ON [dbo].[ChicagoCrimes]

(

    [DateOnly] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);

 

Now that the indexes are created, we can run some simple queries to answer some questions about the data.

Asking Questions of the Data

If we understand the structure of the data and know some specific questions that we want to ask, we can use T-SQL to write a query that will return a dataset that answers the question. For example, to implement the questions above, we can use the following T-SQL queries from SQL Management Studio while connected to the database:

Murders by Year

SELECT

    DATEPART(yy,[DateOnly]) AS [Year]

    ,COUNT([Primary Type]) AS NumberMurders

FROM ChicagoCrimes

WHERE [Primary Type] = ‘Homicide’

GROUP BY DATEPART(yy,[DateOnly])

ORDER BY [Year];

Top 10 Crime Locations

SELECT TOP 10

    [Primary Type] AS CrimeType

    ,[Location Description] AS [Location]

    ,COUNT(*) AS NumCrimes

FROM ChicagoCrimes

GROUP BY [Location Description],[Primary Type]

ORDER BY NumCrimes DESC;

Top 10 Worst Crime Areas

SELECT TOP 10

    [Community]

    ,[Primary Type] AS CrimeType

    ,COUNT(*) AS NumCrimes

FROM ChicagoCrimes cc

JOIN Community c

ON cc.[Community Area] = c.Community_id

GROUP BY cc.[Primary Type],c.[Community]

ORDER BY NumCrimes DESC;

 

(note: This query is a little misleading because some of the crime reports do not include a community, therefore they are assigned to “Chicago”)

Busiest Police Stations

SELECT

    [Primary Type] AS CrimeType

    ,ps.[Address] AS PoliceDistrict

    ,COUNT(*) AS NumCrimes

FROM ChicagoCrimes cc

JOIN PoliceStations ps

ON cc.[District] = ps.[District]

GROUP BY ps.[Address],cc.[Primary Type]

ORDER BY NumCrimes DESC;

 

These queries will give you a good idea of the types of questions that you can ask of your data with T-SQL, however as you can see, for every question that you ask you need to execute a new query and wait for the results. For the most part, when analyzing data, it is a good idea to visualize the data in a tool that end-users are comfortable with and that can support ad-hoc analysis of the data. For the purposes of this blog series, we’ll use Microsoft Excel for the analysis.

Connecting Microsoft Excel to the Data

Even though the database we’ve created is relatively large (if you pulled in the full dataset, you’ll have more than 5.5 million records in the fact table) Microsoft Excel is a viable tool to visualize the data. For the purposes of this blog series, we’ll use a technology embedded in Excel called Power Pivot, which is based on an in-memory column-oriented database technology called xVelocity that makes use of very efficient compression techniques to store massive amounts of data in relatively little memory. As mentioned above, if you do not have a copy of Excel from Office 2013 installed on your machine, you can sign up for an Office 365 subscription to obtain it.

Using Power Pivot to Download Data

The first step to analyzing data with Excel is to connect Power Pivot to the data. Power Pivot should be an option on the Excel Ribbon when you start Excel as shown below (I have a few extra plug-ins enabled, so don’t expect to see all of the ribbon items):

If you do not see the Power Pivot Ribbon item, make sure that it is enabled by selecting File, then Options, then select COM Add-Ins and select Go. Make sure that Power Pivot (and while you are here, ensure that Power View is enabled as well) is enabled:

Once you have Power Pivot enabled, select the Power Pivot ribbon item and then select the Manage option:

Select the From Database option, select SQL Server, and then enter your database connection information:

Select Next, and then select the Select from a list of …. option:

Select the ChicagoCrimes table, and then select the Select Related Tables option (note my database has additional tables that have not been discussed yet):

Click Finish to begin the import of data:

The import operation will take several minutes to complete. When it’s done press Close and you will see the data represented in the Power Pivot management interface:

At the bottom of the screen you’ll see the various tables represented. Click through each of them to see the data stored in Power Pivot, then click on Diagram View to view a data diagram. (Which is automatically created with the relationships inherited from the data model)

Close the Power Pivot window (select the X in the upper right) and then save the Excel workbook. Note that even with more 5 million rows in the fact table the total size of the workbook is under 500MB.

From the blank Excel worksheet, choose the Insert command on the ribbon and then select Power View:

Note that you will see a blank report canvas and a list of fields available. (if you do not have Silverlight installed on your machine, you will be prompted to install it. Silverlight is required for the embedded version of Power View at this time)

For this example, we will implement a simple dashboard that will answer the same questions that we used above. To start the process, expand the ChicagoCrimes field on the right side and select the Primary Type field. In the FIELDS box, change the drop-down option to Count (Not Blank)

Then scroll down and expand the DateDimension field. Select the Year item and then in the FIELDS area, drag the Year item above the # Count of Primary Type. Then select the Year item and change the drop down to Do Not Summarize. This results in a table that represents the total number of crimes per year and should look like this:

In the Power View Fields list, scroll back to the ChicagoCrimes item and drag the Primary Type field to the Filters area. Select Homicide from the list of Primary Type fields. This will result in a view that shows the total number of Homicides per year.

Once you have the table showing the data you want to visualize, click any of the rows in the table, and then from the Ribbon, select Other Chart and then Line. This will change the table to a line chart representing the number of Homicides per year. Drag the corners of the window to size the graph to fit your canvas.

On the report canvas, click outside of the graph that you just created, and then from the Power View Fields area, select the Location Description item in the ChicagoCrimes table. Also from the Power View Fields area select the Primary Type field and then from the Values area, choose the Primary Type field and change the drop down to Count (Not Blank). The result should look like:

Click any row in the new table and then from the Ribbon, select Bar Chart and then select Stacked Bar. This will change the table to a stacked bar chart. Above the chart, select the Count of Primary Type field in the sort by area, and then select desc to sort in descending order.

Click on a blank area of the canvas, and from the Power View Fields area, select Primary Type and then expand the Community table and select the Communityad field. In the Values area, change the Primary Type field to Count (Not Blank).

Click on any row in the new table, and from the Ribbon, select Map. When the map renders, drag the corners of the window to resize the map accordingly.

You can also click the arrow in the Filters area to collapse the filter, and then type a title into the Title area.

The result is a very nice dashboard-like visualization of Chicago homicide data that is easy to interpret. You can save the Excel workbook at this time to ensure that the dashboard is saved. You can add additional dashboards by selecting the + item at the bottom of the page and then repeating the steps above to add different fields to the canvas.

Conclusion

In this post, we continued the construction of our demonstration database by adding indexes to improve query efficiency and then we used Microsoft Excel, Power Pivot and Power View to create compelling visualizations from the connected data.

In the next post in this series, I will explain how to update the database with new data and then refresh the Excel workbook to include the new data. In future posts we will also use Microsoft Power BI to visualize the data and will also connect the data to Microsoft Azure Machine Learning to perform predictive analytics on the Crime Data.

 

 

Creating a “Real World” Database in Azure for Advanced Analytics Exploration: Part 2

In the previous article in this series, I wrote about provisioning the database in Azure and performing the download of the data from the City of Chicago crime statistics.

In this post, I will discuss transforming the initial data into the various tables necessary to perform meaningful analysis. Keep in mind that the steps I’m laying out here are not the only way to accomplish the task at hand, I’m just providing a relatively simple and straightforward approach to creating data structures that can later be used/consumed by other tools that will also be discussed in later posts in this series.

If you haven’t performed the steps in the previous article, you’ll want to do that first to have the basic environment available in order to follow the steps laid out here.

Creating the Basic Table Structure

Since the end-state of this database is to provide a structure for efficient queries and analysis, we’ll stick with a simple denormalized design. We will also be creating some tables that are purpose-built for some specific queries, so there will be duplicated data within the database we’re creating. This should by no means be seen as best practices, but rather as a way to build data structures that support specific use-cases.

With all of the above said, the database structure will look like this:

This design is primarily focused on providing efficient query operations and there are likely many different opinions on how the data could be better structured. Again, for the purpose of this series of blog posts we’re not going to focus on best practices, but rather on getting the job done.

This database design is known in the Data Warehouse world as a star schema. The schema consists of a central Fact Table and surrounding Dimension Tables. In short the fact table contains the information we’re interested in, and the dimension tables contain the descriptors for that information.

Creating the Fact Table (ChicagoCrimes)

The heart of the database will be the ChicagoCrimes fact table. For the purposes of this blog post, we’re going to create a table that uses the same names in the columns as the original data that we downloaded. This will mean creating columns that have spaces in the name, something to which I am generally averse, however it makes it easier to understand how all the pieces fit together in the final puzzle. To create this table, connect to your database instance in Azure using SQL Server Management Studio (See the instructions in the first post if you don’t know how to do this) change to the database you created, and execute the following TSQL script:

CREATE TABLE [dbo].[ChicagoCrimes]

(

    [ID] [int] NOT NULL PRIMARY KEY CLUSTERED,

    [Case Number] [varchar](50) NOT NULL,

    [Date] [datetime] NOT NULL,

    [Block] [varchar](50) NULL,

    [IUCR] [varchar](10) NULL,

    [Primary Type] [varchar](50) NULL,

    [Description] [nvarchar](max) NULL,

    [Location Description] [varchar](50) NULL,

    [Arrest] [varchar](10) NULL,

    [Domestic] [varchar](10) NULL,

    [Beat] [varchar](10) NULL,

    [District] [int] NULL,

    [Ward] [varchar](10) NULL,

    [Community Area] [int] NULL,

    [FBI Code] [varchar](10) NULL,

    [X Coordinate] [int] NULL,

    [Y Coordinate] [int] NULL,

    [Year] [int] NULL,

    [Updated On] [datetime] NULL,

    [Latitude] [float] NULL,

    [Longitude] [float] NULL,

    [Location] [varchar](50) NULL,

    [DateOnly] [date] NULL

);

 

This should execute very quickly. The above script creates the ChicagoCrimes table, marking the ID column as the Primary Key which will also create a clustered index based on the ID column.

Create the Remaining Tables

The next step is to create the remaining tables. Execute the following TSQL script:

CREATE TABLE [dbo].[Community]

(

    [Community_id] [int] NOT NULL PRIMARY KEY CLUSTERED,

    [Community] [varchar](30) NULL,

    [Area] [varchar](20) NULL,

    [Side] [varchar](15) NULL,

    [Communityad] [varchar](50) NULL,

    [WeatherStation] [varchar](4) NULL

);

 

CREATE TABLE [dbo].[DateDimension]

(

    [PKIDDate] [varchar](50) NOT NULL PRIMARY KEY CLUSTERED,

    [Date] [date] NOT NULL UNIQUE,

    [Day] [int] NULL,

    [DaySuffix] [varchar](50) NULL,

    [DayOfWeek] [int] NULL,

    [DayOfWeekName] [varchar](50) NULL,

    [DOWInMonth] [int] NULL,

    [DayOfYear] [int] NULL,

    [WeekOfYear] [int] NULL,

    [WeekOfMonth] [int] NULL,

    [Month] [int] NULL,

    [MonthName] [varchar](50) NULL,

    [Quarter] [int] NULL,

    [QuarterName] [varchar](50) NULL,

    [Year] [int] NULL,

    [StandardDate] [date] NULL,

    [HolidayText] [varchar](50) NULL,

    [Season] [varchar](50) NULL

);

 

CREATE TABLE [dbo].[PoliceStations]

(

    [District] [int] NOT NULL PRIMARY KEY CLUSTERED,

    [Address] [varchar](255) NULL,

    [City] [varchar](50) NULL,

    [State] [varchar](50) NULL,

    [Zip] [varchar](50) NULL,

    [Website] [varchar](500) NULL,

    [Phone] [varchar](50) NULL,

    [Fax] [varchar](50) NULL,

    [TTY] [varchar](50) NULL,

    [Location] [varchar](255) NULL

);

 

CREATE TABLE [dbo].[SocioEconomicFactors]

(

    [Community Area Number] [int] NOT NULL PRIMARY KEY CLUSTERED,

    [Community Area Name] [varchar](50) NULL,

    [Percent of Housing Crowded] [float] NULL,

    [Percent Households Below Poverty] [float] NULL,

    [Percent Aged 16+ Unemployed] [float] NULL,

    [Percent Aged 25+ Without High School Diploma] [float] NULL,

    [Percent Aged Under 18 or Over 64] [float] NULL,

    [Per Capita Income] [float] NULL,

    [Hardship Index] [int] NULL,

);

The above script will create each of the primary tables needed in the database. The next step is to create the constraints that will ensure referential integrity between the tables.

Create the Foreign Key Constraints

This step is not really necessary in order to have a functional database, but it will provide a framework that ensures that the data stored in the fact table is properly related to the data in the dimension tables. While we’re not focusing on perfection in modeling this particular database, since we are going to be focused on using the database for analysis purposes, it’s important to do what we can to ensure that the data we do analyze is at least consistent with the model we build.

To create the Foreign Key constraints, execute the following TSQL script:

ALTER TABLE ChicagoCrimes ADD

    CONSTRAINT FK_Community FOREIGN KEY ([Community Area]) REFERENCES dbo.Community (Community_id),

    CONSTRAINT FK_DateDim FOREIGN KEY ([DateOnly]) REFERENCES dbo.DateDimension ([DATE]),

    CONSTRAINT FK_PoliceStation FOREIGN KEY (District) REFERENCES dbo.PoliceStations (District),

    CONSTRAINT FK_Socio FOREIGN KEY ([Community Area]) REFERENCES dbo.SocioEconomicFactors ([Community Area Number]);

Once the constraints are added, the dimension tables can be populated.

Populating the Dimension Tables

In a star schema, the dimension tables provide descriptive data about related information in the fact table. The dimension tables will need to be populated before the fact table due to the foreign key constraints that were just added.

Populating the Date Dimension

Arguably the most important dimension of any star schema is the date dimension. The more descriptive information that can be provided about a date in the dimension table, the more interesting your analysis can be. For simplicities sake, I will provide an export of the date dimension that I use most often, which has been populated with many descriptive columns and seasons related to the northern hemisphere. You can download the .csv file here. (Save the file to your local drive)

Once the file is downloaded, start the SQL Server Import Export Wizard (refer back to the first article in this series if you need a refresher on how to start the wizard)

Click Next and choose Flat File Source from the drop down menu. Select the file that you just downloaded in the File name textbox. Uncheck the Column Names in First Data Row checkbox.

Click Advanced on the left side, and change the Data Type to Unicode String [DT_WSTR] on columns 0,3,5,11,13,16 and 17 (this is to ensure the data types are converted properly when loaded into the dimension table)

Click Preview on the left side to verify that the file is being properly read.

Choose Next and then select SQL Native Client .. in the Destination drop down and then enter your server connection information, then select the ChicagoCrime database:

Click Next and ensure the DateDimension table is selected in the Destination field:

Click Next and ensure the Convert checkbox is selected for every row that requires a conversion (denoted by a caution icon on the left side of the dialog):

Click Next twice and then click Finish to load the data into the DateDimension table. You should see 54787 rows loaded.

Populating the Community Dimension

The Community dimension table is used to relate the area that the crime report occurred in to a community in Chicago. This table also has geographic breakdowns (North Side, South Side, etc) as well as a relevant weather station (that will be used in later posts in this series). You can download the data file for the Community dimension here (save the file to your local machine).

Once the file is downloaded, start the SQL Server Import/Export Wizard and choose Flat File Source from the source drop down. Choose the file that you just downloaded, and uncheck the Column names in the first data row checkbox. Put a double-quote (“) in the Text qualifier textbox.

Choose Advanced on the left side and change the data type of Column 0 to 4 byte unsigned integer [DT_4UI] and change the remaining column datatypes to Unicode String [DT_WSTR]

Choose Preview on the left side to verify that the columns are properly recognized

Choose Next and then select SQL Server Native Client xx in the Destination drop down and then enter your server information. Choose ChicagoCrime in the Database drop down.

Click Next and make sure that the Community table is shown in the Destination

Click Next twice and then Finish to populate the table. You should see 78 rows transferred.

Populating the PoliceStations Dimension

The PoliceStations dimension is used to relate the area that a crime occurred to the police precinct that services the area. You can download the data file for the PoliceStations dimension here (Save the file to your local drive).

Once the file is downloaded, start the SQL Server Import/Export Wizard and choose Flat File Source from the source drop down. Choose the file that you just downloaded, and uncheck the Column names in the first data row checkbox. Put a double-quote (“) in the Text qualifier textbox.

Click Advanced on the left side. Change the data type of Column 0 to 4 byte unsigned integer [DT_UI4] and change the remaining column data types to Unicode string [DT_WSTR]. Change the Output column width on columns 5 and 9 to 255.

Click Preview on the left side to ensure that the columns are correctly displayed.

Click Next and select SQL Native Client xx in the Destination drop down. Enter your server connection information and logon credentials, then select the ChicagoCrimes database from the Database dropdown.

Click Next and verify that the PoliceStations table is listed in the Destination.

Click Next twice and then Finish to load the PoliceStations table. You should see 23 rows transferred.

Populating the SocioEconomicFactors Dimension

The SocioEconomicFactors dimension is used to describe the social and economic factors of a given neighborhood in Chicago. This will be useful in later analysis of the crime data.

You can download the data file for the SocioEconomicFactors dimension here (Save the file to your local drive).

Once the file is downloaded, start the SQL Server Import/Export Wizard and choose Flat File Source from the source drop down. Choose the file that you just downloaded, and uncheck the Column names in the first data row checkbox. Put a double-quote (“) in the Text qualifier textbox. Change the Code page to 1252 (ANSI – Latin I).

On the left side, select Preview to ensure the columns are properly displayed.

 

Click Next and select SQL Native Client xx in the Destination drop down. Enter your server connection information and logon credentials, then select the ChicagoCrimes database from the Database dropdown.

Click Next and make sure that the SocioEconomicFactors table is listed in the Destination. (You will have to select the SocioEconomicFactors table in the drop down since the name will be incorrect when the list is initially populated)

Click Next twice and then Finish to load the table. You should see 78 rows transferred.

When the transfer is complete, you will have loaded all of the dimension tables and are ready to load the fact table.

Populating the Fact Table

In a star schema design, the fact table is the table that contains all of the information that will be used to analyze the database. This is the largest table of the entire design (In the case of this database, it will contain more than 5.5M rows).

The process of populating the fact table involves writing a TSQL query that will read data from the temporary table that was created previously (follow the steps outlined in Part 1 of this series if you do not have the temporary table) and write the data to the ChicagoCrimes table. This is a very simple process and will make use of implicit conversion to convert from the default data types that the temporary table uses to the data types configured in the ChicagoCrimes table.

The first step in populating the ChicagoCrimes table is to connect to the database and open a new query window. Follow the steps in the first article in this series if you are unfamiliar with doing this. Once you are connected to the database, in the new query window execute the following query:

INSERT ChicagoCrimes

SELECT [ID]

,[Case Number]

,[Date]

,[Block]

,[IUCR]

,[Primary Type]

,[Description]

,[Location Description]

,[Arrest]

,[Domestic]

,[Beat]

,[District]

,[Ward]

,[Community Area]

,[FBI Code]

,[X Coordinate]

,[Y Coordinate]

,[Year]

,[Updated On]

,[Latitude]

,[Longitude]

,[Location]

,CAST([Date] AS DATE)

FROM [dbo].[Crimes_New]

ORDER BY [Date]

The above query will read the data from the Crimes_New table (which was populated using data from the City of Chicago data portal in the previous article in this series) and then populate the new ChicagoCrimes table, using implicit conversion for the data types. The reason there is an Order By clause in the query is to ensure that the crime data is read in sequential order by date. This query will take several minutes to complete as there should be more than 5.5 million rows of data in the Crimes_New table.

If you loaded the dimension tables properly, you will find that the query completes with no errors and you will end up with a fully-populated database. You can run some queries against the ChicagoCrimes table like:

SELECT

    dd.Season AS [Season]

    ,dd.DayOfWeekName AS [DayOfWeek]

    ,cc.[DateOnly] AS [IncidentDate]

    ,CAST(cc.[Date] AS TIME) AS [IncidentTime]

    ,cc.[Case Number] AS CaseNo

    ,cc.[Primary Type] AS CrimeType

    ,cc.Description AS [CrimeDescription]

    ,cc.[Location Description] AS [CrimeLocation]

    ,cc.Latitude AS [Latitude]

    ,cc.Longitude AS [Longitude]

    ,cc.Beat AS [Beat]

    ,c.Community AS [Community]

    ,c.Area AS [Area]

    ,c.Side    AS [Side]

FROM ChicagoCrimes cc

JOIN [DateDimension] dd

ON cc.DateOnly = dd.[Date]

JOIN [dbo].[Community] c

ON cc.[Community Area] = c.Community_id

WHERE cc.[Date] > (GETDATE()-30);

Which will return a list of crimes that have been reported within the last 30 days (assuming that you have downloaded the latest data from the City of Chicago within the last 30 days)

Conclusion

In this post we continued the process of building a usable database for analysis by implementing a star-schema design and populating both the dimension and fact tables. We then executed a query to generate a Crime Report for crimes within the last 30 days in order to test the data.

In future posts I will detail the creation of indexes and views that will help make analysis of this data easier, and will also discuss tools such as Excel and Power BI for analysis.

Creating a “Real World” Database in Azure for Advanced Analytics Exploration: Part 1

One of the major issues that people who want to get started with “Data Science” or “Big Data Analytics” face is finding datasets that are both compelling and diverse enough to provide a useful playground for exploration.

In my previous life, I blogged about this subject and wrote instructions on how to create a large SQL Server database from data available from the Home Mortgage Disclosure Act (HMDA). You can read that entry here although I’m not 100% positive that the steps I’ve laid out in that blog are still valid in terms of data locations and availability.

The purpose of this series of posts is to provide instructions on how to create a SQL database in Microsoft Azure that can be used as a data source for advanced analytics. I’ll be posting future entries on this blog that refer back to this database. Also, to be clear, while I’m focusing on creating the database in Microsoft Azure, there is no reason why you couldn’t use the same techniques to create the database on a local instance of SQL Server.

For those that want to use Azure but don’t have a subscription, you can sign up for a 30-day trial here. If all you want to do is host a simple database in Azure, you can do so for around US $5 per month, up to 2GB in size. For a larger database, you’ll want to look at the standard tier, which starts at US $15 per month for 250GB. If you follow the instructions that I lay out in this post, you’ll end up with a database of about 6GB in size.

For this first article in the series, I will discuss setting up the database environment in Azure and downloading the initial data. I will also discuss the initial loading of the data into a temporary table in the database. Additional articles in this series will focus on turning the temporary table into a useful data structure as well as analyzing the data.

The Use Case

For this series, the use case I’ll be working with is crime, based on data from the city of Chicago. The reason I chose this particular dataset is that it’s a subject that most people can relate to in one way or another, and it does lend itself to some advanced analytics capabilities very well.

The City of Chicago maintains a data portal where they publish crime detail on a regular basis. This is significant because the level of detail published is very granular – down to the individual police report. When combined with other data, such as NOAA Weather data, there are a number of interesting analytics possibilities that can be realized. The solution that will be developed throughout this series will be surfaced using Microsoft PowerBI and will result in a dashboard that looks like this:

Configuring The Database Environment

The first step in building the database is to ensure that you’re properly setup in the Microsoft Azure portal. Follow the instructions here to sign up for an account (a trial account will work just fine for this, but remember that it’s only good for 30 days). Also remember that the instructions I’m providing here will also work with a local SQL Server instance, you’ll just have to modify how and where you load your data appropriately.

Once you are logged in to the Azure environment, you should have a portal that looks something like this (likely with fewer pre-configured resources however):

To create a new database, choose SQL Databases from the left side navigation, and then select New at the bottom of the page, then select Data Services / SQL Database / Quick Create and then fill out the form, choosing a New SQL database server along with a region close to you.

Once the form is filled out appropriately, choose Create SQL Database which will then submit everything to Azure and begin the process of provisioning your new database. You will see a status message appear, and it will take a few minutes to complete the process.

If you receive a message about enabling auditing on the new database, you can choose to ignore it, unless you want to experiment with auditing throughout this process.

Once the database is created, you can click the details link to view the status of the job:

Managing the Database

Now that the database has been created, you’ll want some tools to manage it. The good news there is that the same tools that manage local SQL Server instances work just fine with Azure SQL Database. If you don’t already have a local instance of SQL Server Management Studio to work with, you can download a free version here (Click the link to download Express Edition, and then select the appropriate 32 or 64-bit Management Studio option – Follow the instructions to install it on your local machine). Make sure that you install all of the management tools, as you will be using the Import/Export Wizard to populate the initial table in the database.

The first step to enable managing your new database is to enable your client IP Address in the Firewall rules. From the Azure Management Portal, choose SQL Databases on the left side, then select the Servers tab, then select the server that you just created.

Select the Configure tab, and then select Add to the Allowed IP Addresses. Choose Yes next to Windows Azure Services (you will need this option later) and then choose Save at the bottom of the screen. This will add your local IP address to the firewall rules to enable your local machine to connect to the Azure SQL Database server.

Once the firewall rules are saved, you’ll use SQL Server Management Studio (SSMS) to manage the database and server. To connect to the database, start SQL Server Management Studio, and when prompted, login to the new Database Server that you created above, using SQL Server Authentication and the username and password that you provided when you initially provisioned the database. (notice that you will use the fully-qualified name of the database server, which is <servername_you_provided>.database.windows.net)

Once connected, you should see your server in the Object Explorer window. You can expand Databases to see the database that you provisioned (note: My personal preference is to automatically open a new query window when I start SSMS. The default is to not open a new query. If you want to configure this option, it is available under Tools/Startup )

Once you have successfully connected to the database, you are ready to proceed to importing the data into the initial staging area.

Downloading the Crime Data

Downloading the data from the City of Chicago is a very easy process. For this initial load, we will download the entire dataset and load it in a single pass. Since the data is updated on a regular basis in the portal, later entries in this series will explain how to keep your data in sync with the portal.

Using your browser, connect to the Chicago Data Portal (https://data.cityofchicago.org/ ) and select the Crimes 2001-Present option from the middle pane of the page.

This will open the Crimes dataset in the online explorer (which is very nicely done, and allows for a wide-range of analysis directly from the portal).

In the upper-right corner of the portal, choose the Export option and then choose CSV for Excel. This will eventually open a Save File dialog (it can take a few minutes to generate the export file)

Choose Save, and the file will begin downloading. This will take several minutes to download, depending on your Internet connection speed.

Now that the file is downloaded, you can import the data into your Azure SQL Database.

Importing The Data

Note: The process that we will use to import the data is a very simplistic process. There are more efficient ways to accomplish this task, but I wanted to use the simple and easy approach to load the initial data.

To load the initial data, start the SQL Server 2014 Import Export Wizard. (It was installed along with the SSMS tools above and can be found in your Start menu. Make sure you choose the appropriate version of the tool – 64-bit or 32-bit depending on your operating system)

When the wizard starts, click Next and then choose Flat File Source. Click the Browse button and select the file that you downloaded in the previous step (in order to see the file in the window, you will need to select the CSV files option next to the File textbox) and then choose Open.

In the Text Qualified textbox, enter a double-quote (“).

Ensure that the options are configured as in the image above. Choose the Columns option on the left side to ensure that the fields are lining up properly.

Once you are sure that the columns are properly lined up, select the Advanced option on the left side and then choose the Description Column. Change the data type to a text stream (DT_TEXT)

Then choose the Next button and then select the SQL Server Native Client destination. Enter your Server Name and login information, then select the ChicagoCrime database in the drop down.

Click the Next button and change the destination table name to [dbo].[Crimes_New]

Choose the Edit Mappings button, and select the Edit SQL button, and then add PRIMARY KEY CLUSTERED to the [ID] column as shown below. Azure SQL Database likes to see Clustered indexes on tables.

Choose the OK button and then choose the Next button and then choose Finish twice to start the import operation.

This operation will run for several minutes as there are over 5 million rows of data to import.

Once the operation is complete, you can switch back to SSMS and verify that the initial table has been created and populated by executing the following query in the ChicagoCrimes database:

Preparing for the Next Step

In this blog post, we setup a new Azure SQL Database and imported some crime data into a temporary table in the SQL Database. You will note that we did not manipulate the data in any way, so all of the data types are currently inherited from the text file. In the next article in this series, I’ll walk through how to create the permanent table and how to populate it with the data, converting the data types to the appropriate values.

 

 

 

 

 

 

 

Welcome to the Big Data and Advanced Analytics blog!

In my work with the US Federal Government customer base, I find that most of the people I interact with are very interested in applying advanced analytics and “Big Data” technology to their work, but struggle with how to get started.

I plan to use this blog to post relevant information on building solutions using Microsoft technologies, with a focus on “Big Data” and advanced analytics.

Everything that I post here is meant to reflect my own personal opinion related to using Microsoft Technologies to solve the problems faced by our US Federal Customer base, but anyone interested in “Big Data” and “Advanced Analytics”  should find the content here interesting.

Feel free to ask questions and post comments, and I’ll respond to them as I can.

Again, Welcome to my blog!

 

Ted Malone

Solutions Architect

Microsoft Federal