Azure Data Factory – Mapping Data Flows

In my previous post, I discussed the process of connecting to a web service and retrieving updated records to be inserted into a database through the use of Azure Data Factory pipelines. In this post, I’ll continue the process by using Azure Data Factory (ADF) Mapping Data Flows to transform the data and integrate the Data Flow with the pipeline that was created in the previous post.

Introduction to ADF Mapping Data Flows

One of the challenges that data engineers are faced with in today’s “Big Data” world is the need for a scalable process that enables the data transformation workload. The typical tools used by ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) processes (think SQL Server Integration Services or InfoSphere DataStage ) weren’t really designed to scale out to meet the volume and velocity requirements imposed today. To meet this need, data engineers are using systems such as Apache Spark and Hadoop, which require specialized knowledge and coding skills.

ADF Mapping Data Flows has been designed to provide developers with a fully visual, drag and drop, design experience, without the need to write code. The resulting data flow is integrated into an ADF pipeline, and executed within the context of a Spark cluster (implemented by Azure Databricks) which can be scaled as necessary to support the workload.

As of now, ADF Mapping Data Flows is currently in limited preview, you can request access here: https://aka.ms/dataflowpreview (at some point in the near future, there will not be a need to request access and the feature will move out of preview mode)

Getting Started with ADF Mapping Data Flows

For the purposes of this post, I will continue where the previous post left off. We currently have an Azure Data Factory pipeline that connects to the City of Chicago Crime dataset and returns new records that we do not have in our ChicagoCrime database. (If you haven’t yet built the ChicagoCrime database, read the series of posts here)

The process that we will use looks like this:

  1. Connect to the CSV file that was created by the pipeline
  2. Transform the data (add appropriate data type conversions)
  3. Load the data into the NewCrimes table in the database

This is a very simple example and will not effectively demonstrate the power of Mapping Data Flows, but it will give you a good idea of how to integrate them into your data pipelines.

To build the data flow, open the Azure Portal, browse to your Data Factory instance, and click the Author & Monitor link. Under Factory Resources, click the ellipses (…) next to Data Flows, and add a New Data Flow. This will activate the Mapping Data Flow wizard:

Click the Finish button and name the Data Flow Transform New Reports. Click the Add Source box and step through the informational dialog explaining the layout of the Data Flow Source object.

Name the output stream NewCrimesFile and then click New to the right of the Source dataset drop down:

Select DelimtedText and click Finish to open the New Dataset dialog. Name the Dataset NewCrimes, choose the BlobStore linked service that you previously created and then click Browse and navigate to the NewCrimeReports.csv file that was created earlier. Select First row as header, and Import schema From connection/store.

Click Finish to create the Dataset. At the top of the UI, slide the Data Flow Debug switch to the right to enable debug. This will take some time as a new Azure Databricks cluster will be provisioned and started.

Once the cluster is started (you will know this by the green dot to the right of the debug switch) click the Source Options tab and choose Delete source files in the After completion radio button (This will delete the file after it is processed by the data flow)

Click the Projection tab and view the schema that was inferred from the file. Note that every field is of type String. This is normal since the CSV file is all text. This is where you would perform and data type conversions that would be necessary for other ETL operations. For the purposes of this exercise, we will leave everything as String types since the actual data conversions will be done in the database. (If we wanted to perform some aggregations or other ETL functions, we’d want to convert the data types here to make sure that the correct types were presented downstream in the data flow)

Since this is a relatively small dataset, there is no need to optimize partitioning, which is what you would select under the Optimize tab. Select the Data Preview tab, and then click Fetch latest preview data to load a sample of the data.

Now that the source is properly configured for the data flow, click the + in the lower right of the NewCrimesFile source and select the Select object under Row modifier to add a column select object.

Enter RemoveColumns in the Output stream name dialog. Scroll to the bottom of the list and remove the last 4 columns (location.latitude, location.human-address, location.needs_recoding, and location.longitude) by clicking the trash can icon next to each column. (These columns are not used in the database, but have been included in the JSON results from the web service) The result should look like this:

Select the Data Preview tab and then click Fetch latest preview data to view the resulting dataset:

In the ChicagoCrime database, there is a column named Location, which simply contains the Latitude and Longitude of the report in the format “(lat,lon)”. This column is used by some reporting tools to provide map visuals. Since this column is not present in the result returned by the web service, we will need to use Mapping Data Flows to construct it.

To add the column, click the + to the right of the RemoveColumns object and select Derived column under Schema modifier:

Enter LocationColumn in the Output stream name dialog, and then type location in the Columns dialog. Add the function concat(‘(‘,latitude,’,’,longitude,’)’) in the expression editor (this will build a string matching the correct format), and then click Save and exit.

Select the Data Preview tab and click Fetch latest data preview, then scroll to the right to verify that the column has been correctly added and populated.

The final step for the data flow is to write the transformed dataset to the Crimes_New table in the database.

Click the + next to the LocationColumn activity and add a new Sink (you may have to scroll the dialog down).

Click Finish to dismiss the text next to the Sink activity, and enter NewCrimesTable in the Output stream name dialog, then click New to the right of the Sink dataset drop-down to create a New Dataset. Select Azure SQL Database and select Finish. Name the new dataset Crimes_NewTable and select the ChicagoCrimesDB linked service (that was created previously) and then select the Crimes_New table in the Table drop-down and click Finish.

Select the Settings tab, and select the Truncate table
Table action. This will ensure that the table is empty before loading any data.

Select the Mapping tab, turn off the Auto Mapping function and then ensure that each of the columns are appropriately mapped to the destination columns (The UI does not recognize the column names that have spaces in the destination table)

Now that the data flow is complete (you can verify by selecting the Data Preview tab and then selecting Fetch latest data preview) you can add the data flow to the pipeline that was created earlier. On the left side, under Factory Resources, select the New Crime Reports pipeline. Under Move & Transform, select the Data Flow activity and drag it to the canvas to the right of the Copy Data activity. Select the Transform New Reports data flow.

Select Finish, and connect the output of the GetLatestReports Copy Data activity to the input of the Transform New Reports data flow. The resulting pipeline will look like this:

 

To verify that the pipeline is fully functional, choose the Debug option to start the pipeline execution. Note that this will take several minutes to complete, as the Azure Databricks cluster that executes the data flow will need to be provisioned and started. You will likely need to refresh the status, as the auto refresh stops after 5 minutes.

Once the execution is complete, click on the status (the eyeglass) icon to view the results:

The final step in the update process is to copy the data from the Crimes_New table to the ChicagoCrimes table. This is accomplished via the prc_AddNewReports stored procedure that was created when you built the database.

In the Data Factory UI, drag a Stored Procedure activity from the General section to canvas. Connect the output of the Transform New Reports activity to the input, and name the activity AddNewReports.

Select the SQL Account tab and select the ChicagoCrimeDB linked service.

Select the Stored Procedure tab and select the prc_AddNewReports stored procedure.

Conclusion

Azure Data Factory and ADF Mapping Data Flows combine to create a very powerful ETL/ELT system that allows data engineers to develop complex data pipelines with little or no code development.

This post extended the last post in this series by adding a simple Mapping Data Flows process that transformed the output of a web service to a database table.

Additional posts in this series will explore some of the more advanced features of Mapping Data Flows.

Update to Creating a Real-World Database in Azure

Almost 4 years ago (amazing how time flies!) I wrote a series of posts on creating a large-scale real-world database in Azure that could be used for demos and testing. It’s amazing to me how much mileage I’ve gotten out of this database over the years. It has become my go-to dataset for a number of scenarios, and I’ve used it as the basis to discuss many new Azure-based technologies with customers and partners. One thing that occurred to me though recently is that I hadn’t really updated the blog post to take advantage of new Azure Data Integration tools such as Azure Data Factory and specifically Mapping Data Flows so this post will begin to do just that.

Updating the Database

The first step in ensuring that your database is up to date with the latest available information is to understand how new/updated data is made available. In the case of the demo database discussed here, the City of Chicago posts the crime dataset to the data portal every day. This dataset usually contains all police reports that have been filed and is approximately 7 days old (i.e., the last report is about a week old).  The dataset is available to manually export to CSV (as I detailed in the last post in this series) but it is also available via an http web service that uses the SODA protocol which is very easy to work with. The basic process to update the database will be:

  1. Identify the last date that is currently in the database
  2. Connect to the City of Chicago Data Portal and retrieve records after the last date
  3. Transform the data returned to fit the schema of the database
  4. Insert new records into the database

Fortunately, Azure Data Factory gives us a very good platform to execute each of the steps above in a reusable data pipeline. In this post, we’ll focus in the first two steps above, using a data pipeline in Azure Data Factory to accomplish the tasks. The result will be a pipeline that looks like this:

Building a Data Pipeline

For the purposes of this post, I will assume that you already have an Azure account and have created an Azure Data Factory. If not, you can follow the steps in this article to create a free Azure Account and deploy a Data Factory instance. Once you’ve gone through the steps in the tutorial above and have created a Data Factory instance, choose the Author and Monitor link to load the design canvas. The result will be a blank canvas that looks something like this:

Note: I am using GitHub integration with my Data Factory which is described here: https://azure.microsoft.com/en-us/blog/azure-data-factory-visual-tools-now-supports-github-integration/

Create a new Pipeline named “New Crime Reports” by clicking on the ellipse next to Pipelines and selecting Add New Pipeline and then type the name in the Name dialog. Next, drag a “Lookup” object from the General toolbox on the left side. Name it “GetLatestDate“. The canvas will look like this:

For this step, we will be connecting to the existing ChicagoCrime database (which was described in the earlier set of posts mentioned above) and looking up the latest date/time that I stored in the ChicagoCrimes table. In order to accomplish this, we will need to create a dataset to query the database and return the result. Click on the Settings tab and then select New to the right of the Source dataset drop-down, and then select Azure SQL Database as the source:

When you click Finish to add the dataset, you will be prompted to enter additional information about the dataset. Enter ChicagoCrimeLastDate as the name, and then select the Connection tab. Click New to the right of the Linked service dialog and enter the appropriate information for your Azure Database connection:

Once you have entered the correct information to connect to your database, select Finish, and then in the Table drop-down, select the v_LastDate view. (If you do not have this view in your database, the DDL can be found at the git repository linked at the end of this post)

Note: You don’t need to create this view and can leave the dataset more generic by selecting a different table and then using a Query in the Lookup activity. I chose to use a view here for simplicity

Now that the required information has been entered for the connection and dataset, you can return to the pipeline tab and select Preview data to verify that the correct information is returned.

At this point your canvas will look like this:

If you want to verify the functionality of the Lookup activity, select Debug and note that the pipeline will deploy and execute:

It will run for a few seconds, and then will succeed. To view the output click the Output icon (arrow pointing to the right away from the box) under Actions, and the JSON results from the activity will be displayed:

In order to use the results from the Lookup activity elsewhere in the pipeline, we will need to store them in a pipeline variable (this is not entirely true, you can reference the output of a previous step directly, however I prefer to use variables in my pipelines). To create a variable, select an open area on the canvas (make sure the Lookup activity is NOT selected) and then select the Variables tab. Click New, and then enter the name, type, and default value for the variable. For this pipeline, the name is LastDate, the type is String, and the default value is 01/01/2001 (you can choose any arbitrary value here).

Once the LastDate variable is created, drag a SetVariable activity onto the canvas to the right of the Lookup activity. Drag the output (the green box) of the Lookup activity to the SetVariable activity (this will set the SetVariable activity to execute on success of the lookup). Name the activity PopulateLastDate, then select the Variables tab, select the LastDate variable in the drop-down, and then click in the Value box to add dynamic content. In the expression editor use @{activity(‘GetLatestDate’).output.firstRow.MaxDate} as the expression.

Click Finish to complete the Value dialog. To test the pipeline and verify that you properly set the variable, choose Debug to start the pipeline, and then click the output (under Actions) for the PopulateLastDate activity and verify that the last date in your database table is returned.

Now that you have retrieved the last date and populated the LastDate variable, you can connect to the City of Chicago web service and query the dataset. To do this, drag a Copy Data activity from the Move & Transform section of the toolbox on the left, and enter the name GetLatestReports in the Name field. Drag the green box on the Set Variable activity to the Copy Data activity to connect them. Select the Source tab, and click New to the right of the Source dataset drop-down. Search for, and select, HTTP in the New Dataset dialog and press Finish.

Name the dataset ChicagoCrimesSODA and select the Connection tab. Select New to the right of the Linked service dialog. Enter CityOfChicagoCrimes for the Name, https://data.cityofchicago.org/resource/crimes.json for the URL, and Anonymous for the Authentication.

Select Test connection to verify that the URL was entered correctly and then click Finish to create the Linked Service. In the Relative Url dialog, click to Add Dynamic Content, then enter @concat(‘?$limit=50000&$order=date&$where=date>”’,dataset().LastDate,””) as the function. Pay close attention to the number of single quotes within the concat function. This will construct an appropriate query string (see the API documentation linked above for more information) and will make sure that the dataset is ordered by date, starting from the last date available in the database, and limited to 50,000 rows returned (this is a hard limit imposed by the software that the City of Chicago uses to provide the interface). Click Finish to save the Dynamic Content, and then click Test connection to verify that it is correct.

Verify that the Request Method is GET, and select JSON format for the File format, and Array of objects for the File pattern.

Note: For the purposes of this exercise, you will leave the schema unbound. In most cases, you would want to bind the schema, but since we are using dynamically-constructed query strings to query the web service, we won’t be able to directly import the schema. If you want to bind the schema, you can provide a static relative URL that will enable you to import the schema, and then replace it with dynamic content after the schema is imported. For this exercise, Data Factory will properly handle the schema mapping without being directly bound.

Click on the Parameters tab, and add a New Parameter named LastDate as a String Type with a default value of a valid date (I use 01/01/2001 since that is the first available date in the dataset)

Now that the source connection is configured, click on the New Crime Reports pipeline tab and in the Dataset parameters section, set the LastDate parameter to the value @{variables(‘LastDate’)} (which will populate the parameter with the value of the LastDate variable that is assigned during the Set Variable activity. This parameter will be used in the Relative URL section to complete the query string by adding the appropriate date. You cannot reference the variable directly in the Dataset configuration, which is why we need to use this two-step process)

Now that the Source is configured properly, click on the Sink tab and select New to the right of the Sink dataset drop-down. Name the new Dataset ChicagoCrimesFile.

Before continuing, switch back to the Azure Portal and add a new Storage account, and then create a Blob container in that account called newcrimereports

Switch back to the Data Factory canvas, and in the Connection tab for the ChicagoCrimesFile sink, click New to the right of the Linked service drop-down and enter the appropriate information for the Blob Storage account that you just created.

Click Finish to create the Connection, and then click Browse in the File path, and select the newcrimesfolder container that you created earlier.

In the name section of the File path dialog, enter NewCrimeReports.csv and select Text format in the File format drop down, select Comma in the Column delimiter drop-down and Carriage Return + Line feed in the Row delimiter drop-down.

Select Column names in the first row, and enter a double-quote (“) in the Quote character dialog.

Like the source, you will not bind the schema to the sink for the purposes of this exercise. Data Factory will infer the schema at runtime and will properly create the output file. The web service will return records in an Array of JSON objects, and the sink will copy the JSON to a standard CSV text file. At this point, your Pipeline should look like this:

To validate that everything is configured correctly, select Debug to deploy and execute the pipeline. This should take a couple of minutes to full run. Once complete, check the output of each of the activities. Note that the Copy Data activity has a pair of eyeglasses next to it that will allow you to view specifics such as the number of records read and written to the output file.

You can verify that the output file was created by switching to the Azure portal and browsing to the storage account that you created earlier, clicking on Blobs, and then selecting the newcrimereports container. Inside the container you should see the NewCrimeReports.csv file. You can click the ellipses next to the file and select View/Edit blob to see the contents of the file.

Conclusion

In this post, I discussed how to use an Azure Data Factory pipeline to connect to a web service and return records that are written to a file stored in an Azure storage blob container. In the next post in this series, I will detail how to use Azure Data Factory Mapping Data Flows to transform the data and insert it into the ChicagoCrime database.

IoT and the Intelligent Edge

In previous posts, I have discussed using a Raspberry Pi device connected to Microsoft Azure IoT. While this gives you a good example of how smart devices can communicate with the cloud and deliver useful information, it is really just treating the devices as communication tools, and doesn’t take advantage of any of the processing power that exists within the Raspberry Pi. When you think of a complete “Internet of Things” (IoT), you can picture the “thing” as being a device that bridges the gap between on premises activity and the cloud. At Microsoft, we call this the “Intelligent Edge”,  and we see this as a major opportunity to bring the power of the cloud to the world of IoT.

image

The Intelligent Edge

The heart of the Intelligent Edge is the new Azure IoT Edge service. In short, this service is built for those who want to push some of the cloud processing power down to the edge devices, so that they can independently analyze data (think about deploying an AI model to the edge and allowing the device to perform a portion of the work instead of having to rely on sending everything to the cloud and waiting for the results).

image

The IoT Edge service consists of cloud components, such as Azure IoT Hub, Azure Stream Analytics, and other Azure Services. The IoT Edge also includes a runtime service that is deployed to the device. This service is based on Docker, and is designed to simplify the deployment and management of code modules that are deployed to the device, as well as facilitate communication both with the cloud, and with any downstream devices that might be part of the environment (Think of the Edge device as being a controller that monitors the local environment, performs analysis there, and then reports status upstream to the cloud). The overall architecture of a connected IoT Edge device looks like this:

IoT Edge runtime sends insights and reporting to IoT Hub

The IoT Edge runtime supports both Windows and Linux devices, and simplifies the deployment and management of code modules.

Next Steps

This post has been an introduction to the Intelligent Edge and is really meant to provide an overview of the term and the services necessary to implement it. In follow-on posts to this topic, I will walk through how to configure a Raspberry Pi device to become an Intelligent Edge device, and will walk through deploying intelligent modules to the device.

Flight Tracking, the Raspberry Pi, and the Internet of Things

As part of my day-to-day job responsibilities, I look for ways to help my customers understand how new technologies and methodologies can be applied to their business processes. One major component of this is incorporating the vast amount of volatile data that they generate into useful business information. This can be an interesting conversation, because when many people hear the phrase, “Internet of Things” or “IoT”, they immediately dismiss it as not something that would help them, since they aren’t in manufacturing, or don’t deal with “sensors”. Over the last year, I’ve found myself struggling somewhat to come up with an IoT story that a majority of people could relate to and yet would still be relevant to their business operations. What follows is a breakdown of such a story, and how I decided to tackle the problem at hand.

The Problem

I live in a relatively rural area NE of Phoenix, Arizona called the Rio Verde Foothills. It’s an area that used to be a working cattle ranch, that was turned into home sites with acreage many years ago. The area is popular with “horse people”, and there are many world-class horse ranches out here along with people who have decided to move to a more rural area and enjoy a bit of a slower pace, while at the same time staying relatively close to a major city and all the amenities that offers. The area is roughly 30 square miles, and if you were to view it on an aerial map it would look like this:

image

An aerial view of the Rio Verde Foothills

We are about 10 miles away from the Scottsdale Airport, and about 15 miles from the Phoenix Deer Valley Airport, both of which have a healthy pilot-training industry. This means that our community is an ideal target for a practice area and is used relatively extensively for flight training operations.  The US Federal Aviation Administration (FAA) is very specific in many rules, but one rule that is often open to interpretation is the Code of Federal Regulations (CFR) 91.119 which details the minimum safe altitude that aircraft may operate. Generally-speaking, aircraft must remain 1000’ above ground in “congested areas” and 500’ in “other than congested”, unless it’s a “sparsely populated” area, in which case the pilot must maintain 500’ separation between any person, vessel, vehicle or structure. The FAA has never published an official definition of “congested area” or “sparsely populated” area, so each violation of this CFR is taken on a case by case basis. During flight training operations, it’s very common to perform “engine out” maneuvers, where you simulate an engine failure and practice setting up for a safe landing, which involves gliding to the chosen area, and then when the instructor is satisfied that the student has successfully performed the task, applying full power and climbing back to altitude. Typically, this results in a large burst of engine noise relatively close to the ground. This is a concern for people who have horses and don’t want them “spooked”, and is also a concern for people who moved to the area expecting it to be quiet and peaceful. (Personal note: I’m a pilot and I love the engine noise, so I don’t fall into the “concerned” category)

As our community grows, and as flight schools become more active, we see more and more complaints about the engine noise and flight operations. Since I consider myself a Data Scientist, I figured it would be an interesting endeavor to collect real data about flight operations in our area, and provide a data-driven analysis of actual flight operations above our community.

The Mechanics

In order to be certified for flight, the FAA requires certain equipment to be onboard aircraft (There are exceptions to the following, however for the purposes of this article, they are irrelevant). One of the requirements is that the aircraft have a transponder. Simply put, the transponder is a device that allows air traffic control (ATC) to query the airplane and learn things about it. Currently, in order to operate in controlled airspace (In the United States, airspace is broken down into several classes and each class has a specific requirement as to how you operate within it. For example, Class C airspace extends from the surface to 4000’ above an airport and within a 5 nautical mile radius, and from 1200’ to 4000’ within a 10 nautical mile radius – usually depicted on a chart by a cyan arc)

image

A portion of the Phoenix Sectional Chart, showing the airspace designations

aircraft are required to have what is known as a Mode C transponder, which transmits the altitude of the aircraft along with other flight information to ATC. This transmission occurs at 1090 Mhz and is not an encrypted signal. The transmission does not include any location information, but ATC can correlate radar returns with transponder information in order to get an accurate position and altitude of an aircraft.

As part of the FAA’s Next Generation Airspace initiative, a new technology known as Automatic Dependent Surveillance – Broadcast or ADS-B, will be required on all aircraft operating in controlled airspace. ADS-B is already being used by many European countries, and the infrastructure is already in place throughout the United States. ADS-B essentially extends existing transponder functions by including GPS location information, and is already being installed on many aircraft. (Basically all commercial airliners are equipped with it, and most general aviation aircraft with “modern” avionics have it as well). ADS-B uses the same unencrypted 1090Mhz signal, so it is relatively easy to capture with inexpensive radio receiver technology.

The advent of ADS-B technologies have afforded an opportunity for companies that provide flight tracking information, such as FlightAware and flightradar 24.

image

image

Examples of FlightAware (top) and flightradar24 (bottom)

These companies can collect the ADS-B information transmitted by aircraft and provide extremely accurate flight tracking information. If you haven’t used their sites or apps yet, do yourself a favor and check them out. They offer free service, but also have commercial tiers where you pay for ad-free browsing, or have other features that are only available to paid accounts.

One challenge that the above companies and others like them are faced with is the fact that not all areas are covered by Government-sponsored ADS-B receivers, meaning that there can be large gaps in their flight-tracking coverage.

In order to solve the coverage problem, these companies have made it very easy for hobbyists and the general public to participate in the ADS-B system by providing easy-to-use solutions that allow for collection of ADS-B data and transmission to these sites. The added advantage to this is that when there are multiple ADS-B receivers in a given area, they can use a technique known as “multilateration” (MLAT) to pinpoint the location and altitude of aircraft that are using non-ADS-B equipped transponders. Basically anyone with the desire and a little technical ability can construct an ADS-B receiver that can then be used to transmit the information to the sites, enhancing their coverage and MLAT accuracy. In return for doing this, the sites offer you free membership (which in the case of Flightradar24, is worth about US$800 per year and removes all advertising from their app and website)

In any given 24 hour period, in my area there are just over 2000 aircraft within range of my ADS-B receiver, and those aircraft report about 500,000 positions. That is a fair amount of data, and if harnessed for more than just tracking of individual flights, could be used for all sorts of analytics, including the ability to actually quantify the amount of air traffic in our area, along with the altitude, speed, etc… When collected and analyzed using a tool such as Microsoft Power BI, this data can prove to be very useful.

image

An example of analytics possible with the ADS-B data and Power BI

This is where IoT can prove to be a very useful tool to answer questions outside of the typical manufacturing or sensor-driven use case.

For the remainder of this post, I’ll describe how to build an ADS-B receiver for under US$80 using the popular Raspberry Pi computing platform and will discuss how to connect it to FlightAware and flightradar24. In follow-on posts, I’ll describe how to feed the data into Microsoft Azure IoT Suite, and finally will describe how to analyze historical information with Power BI.

Procuring the Hardware

There are many different ways to build an ADS-B receiver, several of which are probably more effective than what I’m going to detail (If you really want to get serious about this, take a look at the Virtual Radar Server project for example), but the way shown here results in very little expense and is relatively easy to do, even if you don’t consider yourself a hard-core geek. The shopping list (along with links to the item on Amazon.com) for this project is as follows:

  • Raspberry Pi 3 Model B – The Raspberry Pi 2 can also be used, but the 3 is faster and has integrated WiFi. I used a starter kit from Vilros that I purchased on Amazon as it had the case and integrated heat sinks.
  • A fast SDHC Card at least 8GB in size – You don’t want to skimp on this, as the faster the card, the better MLAT along with the local web interface will work. I used this one from SanDisk.
  • A USB ADS-B Receiver – FlightAware makes a nice one that includes an integrated filter for under US$20.
  • A 1090 Mhz antenna – There are several of these on the market, but the simple version to start with can be found here. This one, when placed in a window, will receive ADS-B signals from up to 50 miles away. Once you decide to get more serious about collecting the data, you can use a more effective antenna which can reach out to more than 250 miles, but will need to be externally-mounted.

Once you have the hardware, you will need to download and install an operating system to run the Raspberry Pi. You can follow the directions here to download, install and configure Raspbian Jessie (I use the desktop version, but the Lite version will work as well).

If you need more detail on setting up the Raspberry Pi, you can follow the steps (through step 3) from this earlier blog post that I wrote on the topic.

Installing the FlightAware Software

After you have installed and configured the operating system, and plugged in the FlightAware receiver with antenna, you will want to ensure that everything is up to date with the latest patches and repository information. To do so, connect to the Pi (either via SSH from another machine, or open a terminal session on the desktop of the Pi) and issue the sudo apt-get update and sudo apt-get upgrade commands.

image

apt-get update

image

apt-get upgrade

The following has been documented using the following version of Raspbian Jessie:

image

Once this is complete, you will install the PiAware application, which collects data from the ADS-B receiver and transmits it to FlightAware. There are very detailed instructions listed here, but the following steps will work just fine:

  • Install the FlightAware repository so that the package installer can find the piaware source.
wget http://flightaware.com/adsb/piaware/files/packages/pool/piaware/p/piaware-support/piaware-repository_3.5.1_all.deb sudo dpkg -i piaware-repository_3.5.1_all.deb

image

    • Update the repositories
sudo apt-get update

image

    • Install the Dump1090-fa software (This is the software that decodes the 1090Mhz signal into the digital ADS-B information needed to transmit to FlightAware.
sudo apt-get install dump1090-fa

image

    • Install the FlightAware piaware software.
sudo apt-get install piaware

image

    • Configure piaware to support updates
sudo piaware-config allow-auto-updates yes sudo piaware-config allow-manual-updates yes

image

    • Reboot the Pi to allow the configuration to take effect.
sudo reboot now

Viewing the Results

Once the Pi has rebooted, open a browser (either on another machine, or directly on the desktop of the Pi) and browse to <ip address of Pi>:8080 – (For example if the IP address of your Pi is 192.168.1.10 : http://192.168.1.10:8080) This will open the PiAware Skyview web application. If everything is working fine, and if there are aircraft nearby that are transmitting ADS-B signals, you should see them represented on your screen, along with a link asking you to connect to Flightaware and claim your receiver. Click the link and claim your receiver (if you don’t already have a FlightAware account, use the link to register for one). Once you claim your receiver, your account will be upgraded to an Enterprise premium account, which is normally worth over $1000.00 per year.

image

The Skyview application showing aircraft detected.

Once you have claimed your receiver, you can verify functionality by checking the piaware log file, located at /var/log/piaware.log. Use the following command: sudo cat /var/log/piaware.log to view the entire log, or, sudo tail /var/log.piaware.log to view just the end of the file.

image

Conclusion

In this post, we have discussed the use of ADS-B signals to collect information on nearby aircraft, and have demonstrated how to build an ADS-B receiver that will transmit the information to FlightAware.

Future posts in this series will discuss how to extend this solution to other sites, as well as collecting the information via Microsoft Azure IoT Suite to make it available for historical analysis.

The IoT Journey — Visualizing IoT Streaming Data with Power BI

In my previous posts in this series(see posts one, two, three, four, five and six) I discussed the construction of a system to collect sensor data from a device and send that data to the Microsoft Azure IoT Suite. The final step in this journey is to build an application that will use the data that we’re sending to the cloud. There are many approaches to building the visualization layer (for a complete discussion on this topic, see the IoT Suite Remote Monitoring solution here: https://www.microsoft.com/en-us/server-cloud/remotemonitoring/Index.html ), but I wanted to incorporate the use of Microsoft Power BI to demonstrate a Software as a Service (SaaS) solution to visualize the output of the sensor platform.

Obviously this is overkill for temperature and humidity data for a single sensor, but imagine having a worldwide network of these sensors reporting data. The beauty of the cloud and the SaaS platform for visualization, is that there is virtually an unlimited amount of capacity available to you with very little work on the front end to build the solution.

The first step in this process is to obtain and provision a Microsoft Power BI subscription if you don’t already have one. Power BI is available in a free tier that will work for this example, so you do not need to purchase a subscription in order to build the solution.

Step One – Obtain a Power BI Subscription

Sign up for Microsoft Power BI at www.powerbi.com and select the Get Started now button. Follow the instructions on the screen to sign up. Note that you must use a corporate email address (not an @gmail, @outlook or @hotmail address). If you want, you can sign up for a 30-day trial of Office 365, or sign up for a $5 per month plan and then use that address as your Power BI login. The good news there is that after the 30-day trial, Power BI will continue to function normally.  Once you’ve signed up for Power BI and logged in, you’ll see the following screen:

image

Once Power BI is successfully provisioned, the next step is to configure the Azure IoT Hub to send data to Power BI so that we can add it to a dataset.

Step Two – Send IoT Data to Power BI

One of the services available in the Azure IoT Suite is Azure Stream Analytics (ASA). ASA is a fully-managed cloud service that enables real-time processing of streaming data. It is a very powerful service, and when coupled with Azure Event Hubs, can scale to millions of devices.

For the purpose of this post, we will use ASA to receive data from the IoT Hub that we created in the previous post, and then output the data to a data set that is read by Power BI to build a report and dashboard to represent the data being sent by our sensors. As mentioned earlier, this is overkill for a single sensor, but it will give you an idea of how simple building this solution is, and of course it can be easily scaled to a virtually unlimited amount of sensors.

As you recall, in the last post we created an Event Hub (in my case it was named FedIoTHubDemo) and we configured it to receive data from our Raspberry Pi device. Now we will use that hub to send data to ASA so that it can be viewed in Power BI. You will need to open the Azure Management Portal and navigate to the Event Hub that you created in the previous post.

image

Make sure that you make note of the name and region where the resources are located.

To connect ASA to the Event Hub, we will perform the following steps:

  • Create a new Stream Analytics Job
  • Configure an Input to ASA
  • Write a Query using the Azure Stream Analytics Query Language
  • Configure an Output to Power BI (as of the time of this post, this is not supported in the management portal, we will need to use the legacy portal to perform this step)
  • Start the Job

In the Azure Management Portal, select New in the upper left, and then select everything in the Marketplace pane, and then type Stream Analytics Job in the search box:

image

Then select Stream Analytics Job from the search results:

image

 

Then select Create to open the ASA blade:

image

Give the job a name, select the resource group that you used for the Hub and then select a location for the resources. Then select Create to create the Job. This will start the deployment process and you will receive a message in the portal when the deployment is completed:

image

Once the deployment is complete, click on the job to open the dashboard and settings blade:

image

Note that there are no inputs or outputs configured. For our example, we will configure an Input that uses the IoT Hub that we previously created.

Click on the cloud icon in the essentials section of the pane to open the Quick Start Wizard:

image

Then click Add inputs. Give the input a name, and then select IoT Hub as the source. Make sure the drop-downs are filled with the appropriate names from the IoT Hub that you created earlier. Use the Shared Access Key for the iothubowner policy that you copied to the text file in the last post (or copy it again from the IoT Hub that you created).

image

Once all of the fields are filled out (don’t worry about the Consumer Group dialog unless you chose a name for it previously. It is named $default unless you chose a name) click Create to create the Input. The system will create the input and then test it to ensure that it is properly configured.

Once the input is created, click on Step 2 of the quick start wizard, Develop a Query. This will open the query pane with a sample query:

image

The ASA Query Language is very similar to T-SQL with some extensions specifically for streaming data. In our scenario, the message that is sent from the Raspberry Pi to the Azure IoT Hub is very basic, consisting of 7 fields(deviceID, temperatureF, temperatureC, humidity, latitude, longitude and timestamp):

This message is sent to Azure approximately every 3 seconds. In our case, we will want to create a query that collects the data in the stream, groups it appropriately and then assigns a time window to the collection times so that we know what the groupings refer to. The most appropriate groupings are deviceID, latitude & longitude, and the time window is controlled by the timestamp value. In theory this will be every 3 seconds, but there is no guarantee that the Pi will send the data on that schedule, so we will create a Tumbling Window to represent a 5 second interval. (in production we would likely change this to have a wider window, as we have no driving need to see the temperature every 5 seconds). The resulting query will look like this:

image

Click Save to save the query. Make sure that the FROM table you reference in the query is the same name that you gave the Input earlier. Typically you would test the query at this time, but currently testing is not supported in the Azure Management Portal. (It should be available soon, and I’ll update this post when it is)

image

Once the query is saved, you will need to launch the legacy Azure Management Portal (http://manage.windowsazure.com ) as the current portal does not support creating a Power BI output sink for ASA.

 image

Launch the legacy portal and navigate to the Stream Analytics Job that you created.

image

Select the OUTPUTS tab, and add a new output. Select Power BI and then click the next arrow:

image

Then click the Authorize Now button and sign in using the credentials that you used to provision your Power BI environment. This will bring you to the configuration dialog for the output:

image

Note that the values you use for the DataSet and Table will appear in Power BI, so choose a name that is meaningful. Click the OK CheckMark to create and test the output.

image

Now that the output has been configured, you can exit the legacy portal and return to the new portal where you will see the output you just created listed in the outputs section of the job:

image

Next, in the ASA Job blade, click the Start button, select Now, and click the Start command. This will start the ASA Job, but will take several minutes to complete.

Once the job is started, you will see a message in the portal.:

image

Now that the job has been started, we will need to start the sensor application on the Raspberry Pi and start sending data to Azure.

Step Three – Send the Data to Azure

This is the easiest step, since we’ve already built and tested the application. Follow the steps from the previous post to use an SSH client to connect to your Raspberry Pi and start the DHT11Test application to start sending data to Azure:

image

Let the application run for a few minutes before proceeding to the next step.

Step Four – Build the Power BI Dashboard

After the application has been running for awhile, launch Power BI and expand the workspace on the left side of the screen (note the arrow icon at the bottom left)

image

You will see the DataSet name that you provided above in the Datasets section. Click the dataset to open the designer. Note the table name that you specified earlier, and the fields from the Raspberry Pi application:

image

For the purposes of this post, we will build a simple report and dashboard to visualize our environmental data. Click on the deviceid, temperaturef, temperaturec and timestamp fields. Note that you will have a table displayed with each of these values:

image

Note that the numeric values are summarized at the bottom of the table. Since we do not want this, choose each of the numeric fields in the Values section and select Don’t Summarize in the drop down list. Your table should look like this:

image

Since we want to build a nice dashboard instead of a table of values, lets switch to a gauge to display each of the values. In the Visualizations pane, select the Gauge icon. Drag the temperaturec field to the Values box, and in the drop down, select Average. Leave the other textboxes blank. Resize the Gauge to make it look nice. Your canvas should now look like this:

image

Now click on a blank area in the report canvas, and then select the humidity field. Click on the Gauge icon in visualizations, and repeat the above step to visualize the average humidity. Repeat for the temperaturef field as well.

image

Now click on a blank area under the gauges, and select the timestamp field. In the visualizations select a line chart, and drag the temperaturef field to the values and the timestamp field to the axis. Resize the graph and then click on an open area in the canvas. Repeat these steps choosing the humidity field, and then resize the graph. It should look like this:

image

Now, click a blank area under the line graphs and select the Map visualization. Drag the devicelatitude field to the latitude box, the devicelongitude field to the longitide box, and the temperaturef field to the size box. Resize the graph to fit the canvas. It should look like this:

image

You have now built a report, using data from the sensors on the Raspberry Pi and streamed to Power BI via Azure Streaming Analytics!

On the report canvas, choose File and then Save and then give the report a name. Once the report is saved, click the Pin icon on each graph to pin to the dashboard (the first pin will require you to name the dashboard). Once everything is pinned, select the dashboard in the left side and resize the elements as desired.

Now you have a dashboard connected to the report. Notice that when you click on a dashboard element, it drills down to the report page that contains the visualization.

Spend some time experimenting with various visualizations, and explore the power of Power BI.

Conclusion

In this post we completed the journey by connecting the data collected from sensors on our Raspberry Pi to Power BI by using the Azure IoT Suite.

This showcases how powerful the Azure IoT Suite can be, while remaining relatively easy to develop solutions that utilize the suite.

If you purchased the sensor kit from Sunfounder that I mentioned in earlier posts, you’ll find that you have several additional sensors to play with that can be added to your Raspberry Pi environment and connected to Power BI.

The IoT Journey: Connecting to the Cloud

In the previous posts in this series (see posts one, two, three, four and five) we’ve walked through designing and building an application that reads a temperature and humidity sensor connected to a Raspberry Pi. In this post, we’ll create the cloud-based components necessary to receive the information from the Pi, and we’ll modify our application to transmit the data to the Azure IoT Suite.

The Azure IoT Suite is a comprehensive set of cloud services and Application Programming Interfaces that enable you to construct a highly-scalable Internet of YOUR Things.

image

The use-cases for the Azure IoT Suite are basically limited only by your imagination. Many organizations are turning to IoT technologies to gain more and better insight on how their products and services are being used, along with creating tools and applications to make the lives of their customers better. (While there is a huge debate about the privacy concerns, one great example of this in my mind is how the OnStar service works; I have OnStar activated in my vehicle, and once per month I receive an email that gives me diagnostic information, such as tire pressure, oil life, upcoming maintenance, mileage, etc.. I also have the ability to use the service to locate my vehicle in a parking lot, or remote start.. This is all made possible by the fact that my vehicle is “connected” to the cloud).

The first step in connecting the Raspberry Pi to Azure IoT Suite is to provision an instance of the IoT suite in an Azure account. If you do not already have an Azure account, you can sign up for a free account here: https://azure.microsoft.com/en-us/free/

The free account will give you a $200 credit for one month that allows you to use any of the resources available in Microsoft Azure, and after the month if you choose not to pay for a subscription, you can still use free services including the IoT Suite. (Details are available at the link above)

Once you have an Azure account setup, you are ready to provision an instance of the IoT Suite.

Step One – Provision the IoT Suite

This is strangely the easiest part of the whole journey, even though the technology behind the scene is fairly complex. Browse to the Azure Management Portal (http://portal.azure.com ) and select the New item in the upper-left, then select Internet of Things, and then select IoT Hub:

image

This will open the IoT Hub Configuration blade. You will need to give the hub a name, select a pricing level (the Free tier will work just fine for our purposes here) and then provide a name for a resource group that will be a container to hold all of the services that comprise the IoT Hub. Then select a location close to you:

 

image

Once you’ve selected the required elements, click the Create button to create the IoT Hub. This will take a few minutes to complete. During the process you can click the bell icon on the management portal to receive status updates of the provisioning process:

image

Once the deployment completes, you will see the following status message and you should also see a new icon in your management dashboard that represents the IoT Hub that you created.

image

Click on the new icon on the management portal (if the icon does not appear, use the browse option on the left and then choose all/IoT Hubs and you will see it listed) to open the management blade for the new IoT Hub that you created:

image

Once the site is provisioned, you will need to obtain the connection string and authorization key in order to allow client applications to send data to the hub.

Step Two – Provision Your Device

The Azure IoT Suite is designed from the ground up with security in mind. Nothing can be sent to, or received from, the IoT environment that you’ve provisioned without proper credentials. In our case, we simply want to connect a single device (our Raspberry Pi sensor platform) and send data to the hub. This will involve provisioning a device and obtaining the appropriate connection string / shared access key for the device.

For the purposes of this tutorial, we’re going to take a simple path and not configure access roles or develop a full-featured application to manage the provisioning of devices on the hub (there is currently no mechanism to manually provision devices in the Azure IoT Hub, you must provision the device from the device itself).

In order to provision a device, we will need to create a simple application to provision a device. In order to build this application, we need the following information from the hub we just created:

  • Host Name
  • Connection String
  • Shared Access Signature that allows provisioning of devices

These values can be found in the Azure Management Portal. From the management blade that you opened above, click the Hostname value that is listed in the upper-center of the management blade (in the example above, the value is FedIoTHubDemo.azure-devices.net ) and then copy the value to the clipboard. Save this value to a text file (open Notepad and paste the value) as you will need to retrieve it later. Next click on Shared access policies in the settings blade to open the policies, and then select the iothubowner policy:

image

Copy the Primary Key and Connection string – primary key to the text file you created above. You will need these as well. Note that we are using the owner key, which gives us full access to the IoT Hub environment. In a production application we would not use the owner key here, but would rather create appropriate policies for the device and then use those keys. Since this is a simple “getting started” tutorial, we are using the simple path to test the application.  I highly recommend that you read the IoT Hub Developer Guide to understand the ramifications of using the owner key before attempting to build a production application using the Azure IoT Hub.

The process to provision a new device in the Azure IoT Hub is:

  1. Connect to the hub using an appropriate Shared Access Signature
  2. Read the device registry to ensure that the device isn’t already provisioned
  3. Add a new device entry to the registry
  4. Obtain the new device Shared Access Signature

Typically the code to accomplish the above would be built into a client application that executes on the device. To simplify matters, we will build a separate application to register the device and will copy the Shared Access Signature into the application that we’ve previously developed on the Raspberry Pi.

To provision the device, start Visual Studio and create a new Windows Console application named ProvisionPiHub:

image

Once the solution is created, open the NuGet Package Manager (Project/Manage NuGet Packages) and then select the Browse tab. Type Microsoft.Azure.Devices into the search box, and then select the Microsoft.Azure.Devices package. Click Install, and then accept the license agreement when asked. This will add the necessary components to the project to connect to Azure IoT Hub..

image

Once the package is installed, open the Program.cs file and add using statements for Microsoft.Azure.Devices and Microsoft.Azure.Devices.Common.Exceptions to the top of the file.

image

This will add references to the Azure IoT SDK.  Next you will add static fields to the Program class that represent the RegistryManager as well as the connection string that you copied earlier as follows:

image

 

Next you will want to add an async method to register your device as follows (make sure you choose an appropriate name for your device):

image

Now in the Main method, add code to invoke the Registry Manager as well as the method above:

image

Run the program and note the generated Shared Access Signature that is displayed. Press Enter, and then Mark/Copy the generated signature. Paste it into the text file you created earlier so that you have it saved. (to Mark click the mouse pointer in the upper left corner and then select Mark. use the mouse to highlight the device key, and then press Enter to copy it to the clipboard). Once the device key is copied, you can press Enter to exit the application.

image

 

If for some reason you aren’t able to copy the key from the console application, you can refresh the Management Portal, and then select the devices blade and select the new device, and copy the Primary key from the details pane.

image

Now that you have created the device and copied the Shared Access Signature, you are ready to extend the application that was created in the last post to send the sensor data to the Azure IoT Hub.

Step Three – Extend the Sensor Application

Now that the device has been registered, we can extend the application that we developed in the previous post in this series to send sensor data to the cloud.

Since the application has already been created to collect the sensor data that we want to use, we will simply extend the application to transmit the data to Azure as well as writing it to the console. The process to communicate with Azure is relatively simple:

  • Create an instance of the Azure Device Client class
  • Use the appropriate Shared Access Signature to connect to the Azure IoT Hub
  • Create a telemetry data point, using data from the sensors
  • Add the telemetry to a message, and serialize it to JSON message
  • Add the message to the Client class, and transmit to Azure

Remember that this is a simple tutorial, so there is no real exception or retry logic involved. For production applications, be sure you understand the Transient Fault Handling as you will encounter transient faults.

To extend the DHT11Test application, open the solution in Visual Studio, and go to the NuGet package manager (Project / Manage NuGet Packages) and install the Microsoft.Azure.Devices and Microsoft.Azure.Devices.Client packages. Since we will be executing this application on the Raspberry Pi with Mono, we will also want to add the Mono.Security package. Once these packages are added, open the Program.cs file and add using statements for Microsoft.Azure.Devices.Client and Newtonsoft.Json.

image

Then, add static fields to the program class to represent your device and client. Note that part of the telemetry payload will include a location for the device. Since we do not have GPS enabled on our device, we manually lookup our geolocation and add it. For the ConnectionString and HubURI, make sure you use the values that you saved earlier, not the values that are present in the device information.

image

Then, in the main method, add a line to instantiate the device client. Add the code after you have started the DHT11 sensor.

image

Then, create an async method to send Device to Cloud messages. This will be called every time the DHT11 sensor returns data. We will also write the message to the console so that we can see what is being transmitted.

image

Then, in the DHT11NewData event handler, call the SendDeviceToCloudMessagesAsync method and pass the DHT11 sensor data:

image

This will ensure that messages are sent when the DHT11 sensor reports new data (which happens every 3 seconds in our example). Build the application and repair any errors that might have cropped up. Pay attention to the NuGet packages and make sure that you have all of the appropriate packages added.

Now that the application has been extended, you will need to deploy the application to the Raspberry Pi.

Step Four – Deploy the New Application

In earlier examples, deployment has been relatively easy because the libraries that we have used have been for the most part already present on the Raspberry Pi. In this case, however, there are several .dlls that we will have to deploy as part of our application. If you examine the output folder for the build, you’ll notice that there are many files that have been generated.

image

We will need to copy the .exe, along with all .dll files and the .xml files as well to the Raspberry Pi.

Use whatever copy program that you’ve used in previous examples (I use FileZilla) to copy the files to a folder on the Raspberry Pi. I made a new folder to hold the new version of the application, but it is entirely up to you how you want to store the program files on the Pi.

image

Once the application is deployed, you will need to ensure that the Pi is properly configured for https communication. Some versions of the OS have incomplete certificates configured for https communication, so it’s important to ensure that the Pi is ready.

Use an SSH client to connect to the Pi, and then execute the following command:

image

This will download and install the latest root certificates into the local client store. You will need to do this twice, once as a normal user, and once using sudo to ensure both certificate stores are updated.

Now that the application is deployed and the certificate store updated, execute the application (don’t forget to execute with sudo) and watch the messages transmitted to Azure.

image

This will send a telemetry payload message containing the temperature in Fahrenheit and Celsius, as well as humidity and lat/long every 3 seconds. Leave the app running, and switch back to the Azure Portal and notice the Usage counter in the diagnostics now shows data being received by your device.

Conclusion

Congrats! You’ve now built a sensor platform that collects data from a sensor and transmits it to the cloud every 3 seconds (if you let the app run long enough, you will eventually run into an unhandled exception due to the nature of cloud communications. As mentioned above, we did not build in any retry logic or exception handling, so this will happen at some point)

In the next post of this series, we will take the final step on this journey and connect Microsoft Power BI to the IoT hub so that we can visualize the results of our sensor platform.