Stream SmartThings data to Cosmos DB and PowerBI Part 2 - Data Storage

In part one of this series, we looked at how we can collect data from SmartThings and send it over to an Azure Event hub. We’re now going to take the next step and work on storing that data so that we can use it in our reports and do historical analysis. For this, we are going to be making use of Azure Cosmos DB for storage, and Azure Stream Analytics to move the data.

Cosmos DB

The first thing we need to do is create a Cosmos DB account to store the data. Head over to the Azure portal and create a Cosmos DB account instance. When you do this, it is going to ask you to select the API to use, I chose SQL, but pick what you prefer. You also want to put this in the same region as your Event Hub, to improve latency.

At this point, you have created an account which will contain your database. This account is in a single region and using the default Session consistency. This was fine for my needs, but if you want to replicate your data to more regions, or alter consistency, you can do so now.

We now need to create a database and a collection in Cosmos to store out data. The easiest way to do this is to scroll down to the collections section, then click browse. Click on the “Add Collection” option at the top. For the database, select “create new” and then enter a name for the database and for the collection. You then need to pick a size, either fixed or unlimited. The fixed option uses a single partition, which imposes a limit of 10GB. This has the advantage of not having to deal with managing partition keys. To give you an idea of the data volume, I have had this running for around 4 months and have generated 80MB of data. I would recommend the fixed option to keep things simple. Finally, we need to select the amount of throughput we want the database to support in RU’s. The lowest you can go is 400 RU, and that has been more than enough for my setup which has around 20 sensors reporting in. Once you have completed all that, create the DB and collection.

The great thing about Cosmos is that it is NoSQL based, so we don’t need to worry about creating schemas and tables. All we need to do now is send the data across.

Stream Analytics

At the moment our sensor data has arrived at the Event Hub and is sat there doing nothing. Event Hub is not intended as a long-term storage mechanism and will only keep this data for 7 days. We need to create a process to move this data over to our new Cosmos DB.

In this example, I have chosen to use Azure Steam Analytics. This is Microsoft’s tool for processing streams of data and performing analytics. I chose this option because it’s very simple and easy to setup, it processes data quickly and it can undertake data transformation if required. It also supports sending data to many different locations. this includes the option to stream data to PowerBI for real-time reporting. We will be looking at historical reporting here, but this option is possible.

There is a pretty significant downside to using Stream Analytics for a home project, the cost. Running a single streaming unit full time will run at around $80 a month. In a future post, we will look at ways we can reduce this cost. We’ll take a look at scheduling batch running Stream Analytics jobs, and using other tools like Azure Data Factory and Azure Functions to process the data instead.

Stream Analytics Job

Using the Azure portal, create a new Stream Analytics Job. Again it should be in the same region as the Event Hub and Cosmos DB and you should select the “Cloud” hosting option. We then need to define how many streaming units we need. Streaming units are instances of the job, more units equal more parallel processing of data. As we are running this full time the processing of the data will be almost immediate. Given this, and the low amount of data, we should only need a single streaming unit. When we look at batch processing data we may want to use more units to speed up throughput.

After creating the Stream Analytics job, we need to define the query that will tell it how to process the data. Before we can do that, we need to define our inputs and outputs.

Inputs

In the Steam Analytics Job, select the “Inputs” option under “Job Topology” and then click “Add Stream Input”. In the drop-down select “Event Hub”. The window that opens should populate with the details of existing Event Hubs. Provide an alias for this input and then select the Namespace, Event Hub and Policy Name that we created in the previous article. Leave the rest of the options as default, including an empty consumer group.

Outputs

We’ll follow a similar process to create the output. Under the same “Job Topology” menu select Outputs. Click Add, and then pick “Cosmos DB” from the drop-down. Again this will fill in the Cosmos DB account, Key, and Database for you. The value to use for the “Collection Pattern Name” field will depend on what you chose for the Cosmos DB size. If you selected the fixed 10GB size then you can enter the collection name here. If you selected the unlimited option then you need to enter a pattern to locate the appropriate collection partitions. Finally, the “Document ID” field allows you to specify a field in the records to use as the document ID. For simplicities sake, I chose to leave this blank and use the default.

Query

Now we have our input and outputs we need to create the query to transfer data from one to the other. Again under the “Job Topology” menu select “Query. This will open the visual query editor and it should look like this:

In the editor select [YourOutputAlias] and replace it with the alias of the Cosmos DB output we created. Then Select [YourInputAlias} and replace it with the alias of the Event Hub. There is a list of the aliases on the left if you need them. We will leave the select query to “*” for now so that we send all data to Cosmos. If you want to you can refine this query to reduce the amount of data sent.

Now we have our query defined all we need to do is start the job. Click back on the overview tab and then click on the “Start” option. This will take a few minutes to start, when it does you should start to see events appear in the metrics charts.

Data Validation

At this point, we can go back to our Cosmos DB instance and check data is arriving. Select your Cosmos DB account and then and click on the “Data Explorer” option. Expanding the database and collection we can then click on the “Documents” option and see a list of all the documents in our collection.

Click on a document we can view its content and confirm that we are seeing the data from SmartThings. We now have a process in place to take the data from the Event Hub and store it in Cosmos DB ready for reporting on. As long as the Steam Analytics job is running, it will transfer the data once it arrives at the Event Hub. If you wish to stop the charges for the stream analytics job you can select the option to stop the job. Stream Analytics jobs only cost money whilst they are running.

In part 3 of this series, we are going to look at taking the data in Cosmos DB and building some reports in PowerBI.