Update to Creating a Real-World Database in Azure

Almost 4 years ago (amazing how time flies!) I wrote a series of posts on creating a large-scale real-world database in Azure that could be used for demos and testing. It’s amazing to me how much mileage I’ve gotten out of this database over the years. It has become my go-to dataset for a number of scenarios, and I’ve used it as the basis to discuss many new Azure-based technologies with customers and partners. One thing that occurred to me though recently is that I hadn’t really updated the blog post to take advantage of new Azure Data Integration tools such as Azure Data Factory and specifically Mapping Data Flows so this post will begin to do just that.

Updating the Database

The first step in ensuring that your database is up to date with the latest available information is to understand how new/updated data is made available. In the case of the demo database discussed here, the City of Chicago posts the crime dataset to the data portal every day. This dataset usually contains all police reports that have been filed and is approximately 7 days old (i.e., the last report is about a week old).  The dataset is available to manually export to CSV (as I detailed in the last post in this series) but it is also available via an http web service that uses the SODA protocol which is very easy to work with. The basic process to update the database will be:

  1. Identify the last date that is currently in the database
  2. Connect to the City of Chicago Data Portal and retrieve records after the last date
  3. Transform the data returned to fit the schema of the database
  4. Insert new records into the database

Fortunately, Azure Data Factory gives us a very good platform to execute each of the steps above in a reusable data pipeline. In this post, we’ll focus in the first two steps above, using a data pipeline in Azure Data Factory to accomplish the tasks. The result will be a pipeline that looks like this:

Building a Data Pipeline

For the purposes of this post, I will assume that you already have an Azure account and have created an Azure Data Factory. If not, you can follow the steps in this article to create a free Azure Account and deploy a Data Factory instance. Once you’ve gone through the steps in the tutorial above and have created a Data Factory instance, choose the Author and Monitor link to load the design canvas. The result will be a blank canvas that looks something like this:

Note: I am using GitHub integration with my Data Factory which is described here: https://azure.microsoft.com/en-us/blog/azure-data-factory-visual-tools-now-supports-github-integration/

Create a new Pipeline named “New Crime Reports” by clicking on the ellipse next to Pipelines and selecting Add New Pipeline and then type the name in the Name dialog. Next, drag a “Lookup” object from the General toolbox on the left side. Name it “GetLatestDate“. The canvas will look like this:

For this step, we will be connecting to the existing ChicagoCrime database (which was described in the earlier set of posts mentioned above) and looking up the latest date/time that I stored in the ChicagoCrimes table. In order to accomplish this, we will need to create a dataset to query the database and return the result. Click on the Settings tab and then select New to the right of the Source dataset drop-down, and then select Azure SQL Database as the source:

When you click Finish to add the dataset, you will be prompted to enter additional information about the dataset. Enter ChicagoCrimeLastDate as the name, and then select the Connection tab. Click New to the right of the Linked service dialog and enter the appropriate information for your Azure Database connection:

Once you have entered the correct information to connect to your database, select Finish, and then in the Table drop-down, select the v_LastDate view. (If you do not have this view in your database, the DDL can be found at the git repository linked at the end of this post)

Note: You don’t need to create this view and can leave the dataset more generic by selecting a different table and then using a Query in the Lookup activity. I chose to use a view here for simplicity

Now that the required information has been entered for the connection and dataset, you can return to the pipeline tab and select Preview data to verify that the correct information is returned.

At this point your canvas will look like this:

If you want to verify the functionality of the Lookup activity, select Debug and note that the pipeline will deploy and execute:

It will run for a few seconds, and then will succeed. To view the output click the Output icon (arrow pointing to the right away from the box) under Actions, and the JSON results from the activity will be displayed:

In order to use the results from the Lookup activity elsewhere in the pipeline, we will need to store them in a pipeline variable (this is not entirely true, you can reference the output of a previous step directly, however I prefer to use variables in my pipelines). To create a variable, select an open area on the canvas (make sure the Lookup activity is NOT selected) and then select the Variables tab. Click New, and then enter the name, type, and default value for the variable. For this pipeline, the name is LastDate, the type is String, and the default value is 01/01/2001 (you can choose any arbitrary value here).

Once the LastDate variable is created, drag a SetVariable activity onto the canvas to the right of the Lookup activity. Drag the output (the green box) of the Lookup activity to the SetVariable activity (this will set the SetVariable activity to execute on success of the lookup). Name the activity PopulateLastDate, then select the Variables tab, select the LastDate variable in the drop-down, and then click in the Value box to add dynamic content. In the expression editor use @{activity(‘GetLatestDate’).output.firstRow.MaxDate} as the expression.

Click Finish to complete the Value dialog. To test the pipeline and verify that you properly set the variable, choose Debug to start the pipeline, and then click the output (under Actions) for the PopulateLastDate activity and verify that the last date in your database table is returned.

Now that you have retrieved the last date and populated the LastDate variable, you can connect to the City of Chicago web service and query the dataset. To do this, drag a Copy Data activity from the Move & Transform section of the toolbox on the left, and enter the name GetLatestReports in the Name field. Drag the green box on the Set Variable activity to the Copy Data activity to connect them. Select the Source tab, and click New to the right of the Source dataset drop-down. Search for, and select, HTTP in the New Dataset dialog and press Finish.

Name the dataset ChicagoCrimesSODA and select the Connection tab. Select New to the right of the Linked service dialog. Enter CityOfChicagoCrimes for the Name, https://data.cityofchicago.org/resource/crimes.json for the URL, and Anonymous for the Authentication.

Select Test connection to verify that the URL was entered correctly and then click Finish to create the Linked Service. In the Relative Url dialog, click to Add Dynamic Content, then enter @concat(‘?$limit=50000&$order=date&$where=date>”’,dataset().LastDate,””) as the function. Pay close attention to the number of single quotes within the concat function. This will construct an appropriate query string (see the API documentation linked above for more information) and will make sure that the dataset is ordered by date, starting from the last date available in the database, and limited to 50,000 rows returned (this is a hard limit imposed by the software that the City of Chicago uses to provide the interface). Click Finish to save the Dynamic Content, and then click Test connection to verify that it is correct.

Verify that the Request Method is GET, and select JSON format for the File format, and Array of objects for the File pattern.

Note: For the purposes of this exercise, you will leave the schema unbound. In most cases, you would want to bind the schema, but since we are using dynamically-constructed query strings to query the web service, we won’t be able to directly import the schema. If you want to bind the schema, you can provide a static relative URL that will enable you to import the schema, and then replace it with dynamic content after the schema is imported. For this exercise, Data Factory will properly handle the schema mapping without being directly bound.

Click on the Parameters tab, and add a New Parameter named LastDate as a String Type with a default value of a valid date (I use 01/01/2001 since that is the first available date in the dataset)

Now that the source connection is configured, click on the New Crime Reports pipeline tab and in the Dataset parameters section, set the LastDate parameter to the value @{variables(‘LastDate’)} (which will populate the parameter with the value of the LastDate variable that is assigned during the Set Variable activity. This parameter will be used in the Relative URL section to complete the query string by adding the appropriate date. You cannot reference the variable directly in the Dataset configuration, which is why we need to use this two-step process)

Now that the Source is configured properly, click on the Sink tab and select New to the right of the Sink dataset drop-down. Name the new Dataset ChicagoCrimesFile.

Before continuing, switch back to the Azure Portal and add a new Storage account, and then create a Blob container in that account called newcrimereports

Switch back to the Data Factory canvas, and in the Connection tab for the ChicagoCrimesFile sink, click New to the right of the Linked service drop-down and enter the appropriate information for the Blob Storage account that you just created.

Click Finish to create the Connection, and then click Browse in the File path, and select the newcrimesfolder container that you created earlier.

In the name section of the File path dialog, enter NewCrimeReports.csv and select Text format in the File format drop down, select Comma in the Column delimiter drop-down and Carriage Return + Line feed in the Row delimiter drop-down.

Select Column names in the first row, and enter a double-quote (“) in the Quote character dialog.

Like the source, you will not bind the schema to the sink for the purposes of this exercise. Data Factory will infer the schema at runtime and will properly create the output file. The web service will return records in an Array of JSON objects, and the sink will copy the JSON to a standard CSV text file. At this point, your Pipeline should look like this:

To validate that everything is configured correctly, select Debug to deploy and execute the pipeline. This should take a couple of minutes to full run. Once complete, check the output of each of the activities. Note that the Copy Data activity has a pair of eyeglasses next to it that will allow you to view specifics such as the number of records read and written to the output file.

You can verify that the output file was created by switching to the Azure portal and browsing to the storage account that you created earlier, clicking on Blobs, and then selecting the newcrimereports container. Inside the container you should see the NewCrimeReports.csv file. You can click the ellipses next to the file and select View/Edit blob to see the contents of the file.

Conclusion

In this post, I discussed how to use an Azure Data Factory pipeline to connect to a web service and return records that are written to a file stored in an Azure storage blob container. In the next post in this series, I will detail how to use Azure Data Factory Mapping Data Flows to transform the data and insert it into the ChicagoCrime database.