Creating a “Real World” Database in Azure for Advanced Analytics Exploration: Part 3

In the previous posts of this series (Part 1 and Part 2) I walked through the creation and population of an Azure SQL Database that contains “real world” crime statistics (with granularity down to the individual police report). This is a very simple database design, but it does contain a fair amount of data (over 5M rows in the fact table if you used the entire dataset) and provides enough diversity to allow for some very interesting analytics queries.

In this post, I’ll detail creating some very specific indexes that will support basic analytics, and will also walk through some analysis of the data using Microsoft Excel and Power Pivot, along with Excel Power View. If you don’t already have these products installed, you can obtain them by signing up for an Office365 ProPlus subscription (which gives you a ton of additional features as well). In many cases you can obtain these products through your companies “Home Use” program if they have a Microsoft Enterprise Agreement.

Create Indexes to Support Queries

Because the data model that we are working with is a relatively simple star schema, we don’t have to worry too much about complicated index analysis to create indexes that will support our queries.

As a review, this is what the data model looks like:

Each of the tables in the schema has a Primary Key constraint and the fact table has Foreign Key constraints that reference each of the Dimension Tables. Each of the Primary Keys are implemented by creating a Clustered Index which means that each table is automatically sorted by the Primary Key column. For the most part, this is good if we are doing a lot of targeted searches of our data and looking up crime statistics by the primary key (for example, in the ChicagoCrimes fact table, the primary key is defined on the ID column. We likely will never use the ID column to search for a crime, so we will want to create additional indexes to support how we actually use the data), but it is not very efficient when you execute queries that look up data (or summarize data) by a different field.

In order to understand what additional indexes you will need, it’s important to understand what types of questions that you will be asking of your data. For the purposes of this blog post, we will limit these questions to the following:

  • How many murders have been committed? How many murders occur each year?
    • For this question, we will be searching the data by the Primary Type field in the fact table.
  • What type of location (street, sidewalk, etc.) has the majority of crimes?
    • For this question, we will be searching the data by the Location Description field.
  • What part of the city has the most crime?
    • For this question, we will be searching the data by the Community Area field.
  • Which police station is the busiest?
    • For this question, we will be searching the data by the District field.

In SQL Server (and Azure SQL DB) we can only have a single Clustered Index. This Index defines the storage for the table and the data will always be ordered according to the Clustered Index. In order to create additional indexes, we will need to make them Non-Clustered, which means that additional storage will be used for the Index keys.

For each of the above questions, we’ll create an additional non-clustered index on the fields in the fact table that will be used. Since we’ll also be interested in summarizing reports by the date that they occurred, we’ll also create an index on the DateOnly field.

Creating the Indexes

In order to create the additional indexes, you’ll use the SQL Server Management Studio and connect directly to the database instance in Azure. Start SQL Management Studio and connect to your database in Azure (if you need a reminder on how to do this, see Part 1 of this blog series). Ensure that you have selected the ChicagoCrime database and execute the following T-SQL script for each index you need to create (note that each script will take several minutes to run):

Primary Type Index

CREATE NONCLUSTERED INDEX [IDX_PrimaryType] ON [dbo].[ChicagoCrimes]

(

    [Primary Type] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);

 

Location Description Index

CREATE NONCLUSTERED INDEX [IDX_LocationDescription] ON [dbo].[ChicagoCrimes]

(

    [Location Description] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);

Community Area Index

CREATE NONCLUSTERED INDEX [IDX_CommunityArea] ON [dbo].[ChicagoCrimes]

(

    [Community Area] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);

District Index

CREATE NONCLUSTERED INDEX [idx_District] ON [dbo].[ChicagoCrimes]

(

    [District] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);

Date Index

CREATE NONCLUSTERED INDEX [idx_Date] ON [dbo].[ChicagoCrimes]

(

    [DateOnly] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);

 

Now that the indexes are created, we can run some simple queries to answer some questions about the data.

Asking Questions of the Data

If we understand the structure of the data and know some specific questions that we want to ask, we can use T-SQL to write a query that will return a dataset that answers the question. For example, to implement the questions above, we can use the following T-SQL queries from SQL Management Studio while connected to the database:

Murders by Year

SELECT

    DATEPART(yy,[DateOnly]) AS [Year]

    ,COUNT([Primary Type]) AS NumberMurders

FROM ChicagoCrimes

WHERE [Primary Type] = ‘Homicide’

GROUP BY DATEPART(yy,[DateOnly])

ORDER BY [Year];

Top 10 Crime Locations

SELECT TOP 10

    [Primary Type] AS CrimeType

    ,[Location Description] AS [Location]

    ,COUNT(*) AS NumCrimes

FROM ChicagoCrimes

GROUP BY [Location Description],[Primary Type]

ORDER BY NumCrimes DESC;

Top 10 Worst Crime Areas

SELECT TOP 10

    [Community]

    ,[Primary Type] AS CrimeType

    ,COUNT(*) AS NumCrimes

FROM ChicagoCrimes cc

JOIN Community c

ON cc.[Community Area] = c.Community_id

GROUP BY cc.[Primary Type],c.[Community]

ORDER BY NumCrimes DESC;

 

(note: This query is a little misleading because some of the crime reports do not include a community, therefore they are assigned to “Chicago”)

Busiest Police Stations

SELECT

    [Primary Type] AS CrimeType

    ,ps.[Address] AS PoliceDistrict

    ,COUNT(*) AS NumCrimes

FROM ChicagoCrimes cc

JOIN PoliceStations ps

ON cc.[District] = ps.[District]

GROUP BY ps.[Address],cc.[Primary Type]

ORDER BY NumCrimes DESC;

 

These queries will give you a good idea of the types of questions that you can ask of your data with T-SQL, however as you can see, for every question that you ask you need to execute a new query and wait for the results. For the most part, when analyzing data, it is a good idea to visualize the data in a tool that end-users are comfortable with and that can support ad-hoc analysis of the data. For the purposes of this blog series, we’ll use Microsoft Excel for the analysis.

Connecting Microsoft Excel to the Data

Even though the database we’ve created is relatively large (if you pulled in the full dataset, you’ll have more than 5.5 million records in the fact table) Microsoft Excel is a viable tool to visualize the data. For the purposes of this blog series, we’ll use a technology embedded in Excel called Power Pivot, which is based on an in-memory column-oriented database technology called xVelocity that makes use of very efficient compression techniques to store massive amounts of data in relatively little memory. As mentioned above, if you do not have a copy of Excel from Office 2013 installed on your machine, you can sign up for an Office 365 subscription to obtain it.

Using Power Pivot to Download Data

The first step to analyzing data with Excel is to connect Power Pivot to the data. Power Pivot should be an option on the Excel Ribbon when you start Excel as shown below (I have a few extra plug-ins enabled, so don’t expect to see all of the ribbon items):

If you do not see the Power Pivot Ribbon item, make sure that it is enabled by selecting File, then Options, then select COM Add-Ins and select Go. Make sure that Power Pivot (and while you are here, ensure that Power View is enabled as well) is enabled:

Once you have Power Pivot enabled, select the Power Pivot ribbon item and then select the Manage option:

Select the From Database option, select SQL Server, and then enter your database connection information:

Select Next, and then select the Select from a list of …. option:

Select the ChicagoCrimes table, and then select the Select Related Tables option (note my database has additional tables that have not been discussed yet):

Click Finish to begin the import of data:

The import operation will take several minutes to complete. When it’s done press Close and you will see the data represented in the Power Pivot management interface:

At the bottom of the screen you’ll see the various tables represented. Click through each of them to see the data stored in Power Pivot, then click on Diagram View to view a data diagram. (Which is automatically created with the relationships inherited from the data model)

Close the Power Pivot window (select the X in the upper right) and then save the Excel workbook. Note that even with more 5 million rows in the fact table the total size of the workbook is under 500MB.

From the blank Excel worksheet, choose the Insert command on the ribbon and then select Power View:

Note that you will see a blank report canvas and a list of fields available. (if you do not have Silverlight installed on your machine, you will be prompted to install it. Silverlight is required for the embedded version of Power View at this time)

For this example, we will implement a simple dashboard that will answer the same questions that we used above. To start the process, expand the ChicagoCrimes field on the right side and select the Primary Type field. In the FIELDS box, change the drop-down option to Count (Not Blank)

Then scroll down and expand the DateDimension field. Select the Year item and then in the FIELDS area, drag the Year item above the # Count of Primary Type. Then select the Year item and change the drop down to Do Not Summarize. This results in a table that represents the total number of crimes per year and should look like this:

In the Power View Fields list, scroll back to the ChicagoCrimes item and drag the Primary Type field to the Filters area. Select Homicide from the list of Primary Type fields. This will result in a view that shows the total number of Homicides per year.

Once you have the table showing the data you want to visualize, click any of the rows in the table, and then from the Ribbon, select Other Chart and then Line. This will change the table to a line chart representing the number of Homicides per year. Drag the corners of the window to size the graph to fit your canvas.

On the report canvas, click outside of the graph that you just created, and then from the Power View Fields area, select the Location Description item in the ChicagoCrimes table. Also from the Power View Fields area select the Primary Type field and then from the Values area, choose the Primary Type field and change the drop down to Count (Not Blank). The result should look like:

Click any row in the new table and then from the Ribbon, select Bar Chart and then select Stacked Bar. This will change the table to a stacked bar chart. Above the chart, select the Count of Primary Type field in the sort by area, and then select desc to sort in descending order.

Click on a blank area of the canvas, and from the Power View Fields area, select Primary Type and then expand the Community table and select the Communityad field. In the Values area, change the Primary Type field to Count (Not Blank).

Click on any row in the new table, and from the Ribbon, select Map. When the map renders, drag the corners of the window to resize the map accordingly.

You can also click the arrow in the Filters area to collapse the filter, and then type a title into the Title area.

The result is a very nice dashboard-like visualization of Chicago homicide data that is easy to interpret. You can save the Excel workbook at this time to ensure that the dashboard is saved. You can add additional dashboards by selecting the + item at the bottom of the page and then repeating the steps above to add different fields to the canvas.

Conclusion

In this post, we continued the construction of our demonstration database by adding indexes to improve query efficiency and then we used Microsoft Excel, Power Pivot and Power View to create compelling visualizations from the connected data.

In the next post in this series, I will explain how to update the database with new data and then refresh the Excel workbook to include the new data. In future posts we will also use Microsoft Power BI to visualize the data and will also connect the data to Microsoft Azure Machine Learning to perform predictive analytics on the Crime Data.

 

 

Creating a “Real World” Database in Azure for Advanced Analytics Exploration: Part 2

In the previous article in this series, I wrote about provisioning the database in Azure and performing the download of the data from the City of Chicago crime statistics.

In this post, I will discuss transforming the initial data into the various tables necessary to perform meaningful analysis. Keep in mind that the steps I’m laying out here are not the only way to accomplish the task at hand, I’m just providing a relatively simple and straightforward approach to creating data structures that can later be used/consumed by other tools that will also be discussed in later posts in this series.

If you haven’t performed the steps in the previous article, you’ll want to do that first to have the basic environment available in order to follow the steps laid out here.

Creating the Basic Table Structure

Since the end-state of this database is to provide a structure for efficient queries and analysis, we’ll stick with a simple denormalized design. We will also be creating some tables that are purpose-built for some specific queries, so there will be duplicated data within the database we’re creating. This should by no means be seen as best practices, but rather as a way to build data structures that support specific use-cases.

With all of the above said, the database structure will look like this:

This design is primarily focused on providing efficient query operations and there are likely many different opinions on how the data could be better structured. Again, for the purpose of this series of blog posts we’re not going to focus on best practices, but rather on getting the job done.

This database design is known in the Data Warehouse world as a star schema. The schema consists of a central Fact Table and surrounding Dimension Tables. In short the fact table contains the information we’re interested in, and the dimension tables contain the descriptors for that information.

Creating the Fact Table (ChicagoCrimes)

The heart of the database will be the ChicagoCrimes fact table. For the purposes of this blog post, we’re going to create a table that uses the same names in the columns as the original data that we downloaded. This will mean creating columns that have spaces in the name, something to which I am generally averse, however it makes it easier to understand how all the pieces fit together in the final puzzle. To create this table, connect to your database instance in Azure using SQL Server Management Studio (See the instructions in the first post if you don’t know how to do this) change to the database you created, and execute the following TSQL script:

CREATE TABLE [dbo].[ChicagoCrimes]

(

    [ID] [int] NOT NULL PRIMARY KEY CLUSTERED,

    [Case Number] [varchar](50) NOT NULL,

    [Date] [datetime] NOT NULL,

    [Block] [varchar](50) NULL,

    [IUCR] [varchar](10) NULL,

    [Primary Type] [varchar](50) NULL,

    [Description] [nvarchar](max) NULL,

    [Location Description] [varchar](50) NULL,

    [Arrest] [varchar](10) NULL,

    [Domestic] [varchar](10) NULL,

    [Beat] [varchar](10) NULL,

    [District] [int] NULL,

    [Ward] [varchar](10) NULL,

    [Community Area] [int] NULL,

    [FBI Code] [varchar](10) NULL,

    [X Coordinate] [int] NULL,

    [Y Coordinate] [int] NULL,

    [Year] [int] NULL,

    [Updated On] [datetime] NULL,

    [Latitude] [float] NULL,

    [Longitude] [float] NULL,

    [Location] [varchar](50) NULL,

    [DateOnly] [date] NULL

);

 

This should execute very quickly. The above script creates the ChicagoCrimes table, marking the ID column as the Primary Key which will also create a clustered index based on the ID column.

Create the Remaining Tables

The next step is to create the remaining tables. Execute the following TSQL script:

CREATE TABLE [dbo].[Community]

(

    [Community_id] [int] NOT NULL PRIMARY KEY CLUSTERED,

    [Community] [varchar](30) NULL,

    [Area] [varchar](20) NULL,

    [Side] [varchar](15) NULL,

    [Communityad] [varchar](50) NULL,

    [WeatherStation] [varchar](4) NULL

);

 

CREATE TABLE [dbo].[DateDimension]

(

    [PKIDDate] [varchar](50) NOT NULL PRIMARY KEY CLUSTERED,

    [Date] [date] NOT NULL UNIQUE,

    [Day] [int] NULL,

    [DaySuffix] [varchar](50) NULL,

    [DayOfWeek] [int] NULL,

    [DayOfWeekName] [varchar](50) NULL,

    [DOWInMonth] [int] NULL,

    [DayOfYear] [int] NULL,

    [WeekOfYear] [int] NULL,

    [WeekOfMonth] [int] NULL,

    [Month] [int] NULL,

    [MonthName] [varchar](50) NULL,

    [Quarter] [int] NULL,

    [QuarterName] [varchar](50) NULL,

    [Year] [int] NULL,

    [StandardDate] [date] NULL,

    [HolidayText] [varchar](50) NULL,

    [Season] [varchar](50) NULL

);

 

CREATE TABLE [dbo].[PoliceStations]

(

    [District] [int] NOT NULL PRIMARY KEY CLUSTERED,

    [Address] [varchar](255) NULL,

    [City] [varchar](50) NULL,

    [State] [varchar](50) NULL,

    [Zip] [varchar](50) NULL,

    [Website] [varchar](500) NULL,

    [Phone] [varchar](50) NULL,

    [Fax] [varchar](50) NULL,

    [TTY] [varchar](50) NULL,

    [Location] [varchar](255) NULL

);

 

CREATE TABLE [dbo].[SocioEconomicFactors]

(

    [Community Area Number] [int] NOT NULL PRIMARY KEY CLUSTERED,

    [Community Area Name] [varchar](50) NULL,

    [Percent of Housing Crowded] [float] NULL,

    [Percent Households Below Poverty] [float] NULL,

    [Percent Aged 16+ Unemployed] [float] NULL,

    [Percent Aged 25+ Without High School Diploma] [float] NULL,

    [Percent Aged Under 18 or Over 64] [float] NULL,

    [Per Capita Income] [float] NULL,

    [Hardship Index] [int] NULL,

);

The above script will create each of the primary tables needed in the database. The next step is to create the constraints that will ensure referential integrity between the tables.

Create the Foreign Key Constraints

This step is not really necessary in order to have a functional database, but it will provide a framework that ensures that the data stored in the fact table is properly related to the data in the dimension tables. While we’re not focusing on perfection in modeling this particular database, since we are going to be focused on using the database for analysis purposes, it’s important to do what we can to ensure that the data we do analyze is at least consistent with the model we build.

To create the Foreign Key constraints, execute the following TSQL script:

ALTER TABLE ChicagoCrimes ADD

    CONSTRAINT FK_Community FOREIGN KEY ([Community Area]) REFERENCES dbo.Community (Community_id),

    CONSTRAINT FK_DateDim FOREIGN KEY ([DateOnly]) REFERENCES dbo.DateDimension ([DATE]),

    CONSTRAINT FK_PoliceStation FOREIGN KEY (District) REFERENCES dbo.PoliceStations (District),

    CONSTRAINT FK_Socio FOREIGN KEY ([Community Area]) REFERENCES dbo.SocioEconomicFactors ([Community Area Number]);

Once the constraints are added, the dimension tables can be populated.

Populating the Dimension Tables

In a star schema, the dimension tables provide descriptive data about related information in the fact table. The dimension tables will need to be populated before the fact table due to the foreign key constraints that were just added.

Populating the Date Dimension

Arguably the most important dimension of any star schema is the date dimension. The more descriptive information that can be provided about a date in the dimension table, the more interesting your analysis can be. For simplicities sake, I will provide an export of the date dimension that I use most often, which has been populated with many descriptive columns and seasons related to the northern hemisphere. You can download the .csv file here. (Save the file to your local drive)

Once the file is downloaded, start the SQL Server Import Export Wizard (refer back to the first article in this series if you need a refresher on how to start the wizard)

Click Next and choose Flat File Source from the drop down menu. Select the file that you just downloaded in the File name textbox. Uncheck the Column Names in First Data Row checkbox.

Click Advanced on the left side, and change the Data Type to Unicode String [DT_WSTR] on columns 0,3,5,11,13,16 and 17 (this is to ensure the data types are converted properly when loaded into the dimension table)

Click Preview on the left side to verify that the file is being properly read.

Choose Next and then select SQL Native Client .. in the Destination drop down and then enter your server connection information, then select the ChicagoCrime database:

Click Next and ensure the DateDimension table is selected in the Destination field:

Click Next and ensure the Convert checkbox is selected for every row that requires a conversion (denoted by a caution icon on the left side of the dialog):

Click Next twice and then click Finish to load the data into the DateDimension table. You should see 54787 rows loaded.

Populating the Community Dimension

The Community dimension table is used to relate the area that the crime report occurred in to a community in Chicago. This table also has geographic breakdowns (North Side, South Side, etc) as well as a relevant weather station (that will be used in later posts in this series). You can download the data file for the Community dimension here (save the file to your local machine).

Once the file is downloaded, start the SQL Server Import/Export Wizard and choose Flat File Source from the source drop down. Choose the file that you just downloaded, and uncheck the Column names in the first data row checkbox. Put a double-quote (“) in the Text qualifier textbox.

Choose Advanced on the left side and change the data type of Column 0 to 4 byte unsigned integer [DT_4UI] and change the remaining column datatypes to Unicode String [DT_WSTR]

Choose Preview on the left side to verify that the columns are properly recognized

Choose Next and then select SQL Server Native Client xx in the Destination drop down and then enter your server information. Choose ChicagoCrime in the Database drop down.

Click Next and make sure that the Community table is shown in the Destination

Click Next twice and then Finish to populate the table. You should see 78 rows transferred.

Populating the PoliceStations Dimension

The PoliceStations dimension is used to relate the area that a crime occurred to the police precinct that services the area. You can download the data file for the PoliceStations dimension here (Save the file to your local drive).

Once the file is downloaded, start the SQL Server Import/Export Wizard and choose Flat File Source from the source drop down. Choose the file that you just downloaded, and uncheck the Column names in the first data row checkbox. Put a double-quote (“) in the Text qualifier textbox.

Click Advanced on the left side. Change the data type of Column 0 to 4 byte unsigned integer [DT_UI4] and change the remaining column data types to Unicode string [DT_WSTR]. Change the Output column width on columns 5 and 9 to 255.

Click Preview on the left side to ensure that the columns are correctly displayed.

Click Next and select SQL Native Client xx in the Destination drop down. Enter your server connection information and logon credentials, then select the ChicagoCrimes database from the Database dropdown.

Click Next and verify that the PoliceStations table is listed in the Destination.

Click Next twice and then Finish to load the PoliceStations table. You should see 23 rows transferred.

Populating the SocioEconomicFactors Dimension

The SocioEconomicFactors dimension is used to describe the social and economic factors of a given neighborhood in Chicago. This will be useful in later analysis of the crime data.

You can download the data file for the SocioEconomicFactors dimension here (Save the file to your local drive).

Once the file is downloaded, start the SQL Server Import/Export Wizard and choose Flat File Source from the source drop down. Choose the file that you just downloaded, and uncheck the Column names in the first data row checkbox. Put a double-quote (“) in the Text qualifier textbox. Change the Code page to 1252 (ANSI – Latin I).

On the left side, select Preview to ensure the columns are properly displayed.

 

Click Next and select SQL Native Client xx in the Destination drop down. Enter your server connection information and logon credentials, then select the ChicagoCrimes database from the Database dropdown.

Click Next and make sure that the SocioEconomicFactors table is listed in the Destination. (You will have to select the SocioEconomicFactors table in the drop down since the name will be incorrect when the list is initially populated)

Click Next twice and then Finish to load the table. You should see 78 rows transferred.

When the transfer is complete, you will have loaded all of the dimension tables and are ready to load the fact table.

Populating the Fact Table

In a star schema design, the fact table is the table that contains all of the information that will be used to analyze the database. This is the largest table of the entire design (In the case of this database, it will contain more than 5.5M rows).

The process of populating the fact table involves writing a TSQL query that will read data from the temporary table that was created previously (follow the steps outlined in Part 1 of this series if you do not have the temporary table) and write the data to the ChicagoCrimes table. This is a very simple process and will make use of implicit conversion to convert from the default data types that the temporary table uses to the data types configured in the ChicagoCrimes table.

The first step in populating the ChicagoCrimes table is to connect to the database and open a new query window. Follow the steps in the first article in this series if you are unfamiliar with doing this. Once you are connected to the database, in the new query window execute the following query:

INSERT ChicagoCrimes

SELECT [ID]

,[Case Number]

,[Date]

,[Block]

,[IUCR]

,[Primary Type]

,[Description]

,[Location Description]

,[Arrest]

,[Domestic]

,[Beat]

,[District]

,[Ward]

,[Community Area]

,[FBI Code]

,[X Coordinate]

,[Y Coordinate]

,[Year]

,[Updated On]

,[Latitude]

,[Longitude]

,[Location]

,CAST([Date] AS DATE)

FROM [dbo].[Crimes_New]

ORDER BY [Date]

The above query will read the data from the Crimes_New table (which was populated using data from the City of Chicago data portal in the previous article in this series) and then populate the new ChicagoCrimes table, using implicit conversion for the data types. The reason there is an Order By clause in the query is to ensure that the crime data is read in sequential order by date. This query will take several minutes to complete as there should be more than 5.5 million rows of data in the Crimes_New table.

If you loaded the dimension tables properly, you will find that the query completes with no errors and you will end up with a fully-populated database. You can run some queries against the ChicagoCrimes table like:

SELECT

    dd.Season AS [Season]

    ,dd.DayOfWeekName AS [DayOfWeek]

    ,cc.[DateOnly] AS [IncidentDate]

    ,CAST(cc.[Date] AS TIME) AS [IncidentTime]

    ,cc.[Case Number] AS CaseNo

    ,cc.[Primary Type] AS CrimeType

    ,cc.Description AS [CrimeDescription]

    ,cc.[Location Description] AS [CrimeLocation]

    ,cc.Latitude AS [Latitude]

    ,cc.Longitude AS [Longitude]

    ,cc.Beat AS [Beat]

    ,c.Community AS [Community]

    ,c.Area AS [Area]

    ,c.Side    AS [Side]

FROM ChicagoCrimes cc

JOIN [DateDimension] dd

ON cc.DateOnly = dd.[Date]

JOIN [dbo].[Community] c

ON cc.[Community Area] = c.Community_id

WHERE cc.[Date] > (GETDATE()-30);

Which will return a list of crimes that have been reported within the last 30 days (assuming that you have downloaded the latest data from the City of Chicago within the last 30 days)

Conclusion

In this post we continued the process of building a usable database for analysis by implementing a star-schema design and populating both the dimension and fact tables. We then executed a query to generate a Crime Report for crimes within the last 30 days in order to test the data.

In future posts I will detail the creation of indexes and views that will help make analysis of this data easier, and will also discuss tools such as Excel and Power BI for analysis.