Prometheus has become the default metrics collection mechanism for use in a Kubernetes cluster, providing a way to collect the time series metrics for your pods, nodes and clusters. This metric collection allows you to monitor for issues, review performance over time, and also provide metrics to be used by the scaling functionality in Kubernetes. Both the Pod and Cluster auto-scaler can take custom metrics from Prometheus and use these to determine if they can scale out. If your running your applications in Azure on AKS, then Prometheus works just fine running inside an AKS cluster and reports data on nodes and pods as expected (it won't give you much data on masters, as these are hidden from you). However, what Prometheus won't give you out of the box is a way to collect metrics from Azure Monitor, the monitoring platform in Azure.
Why would you want to do this you say? Think about a scenario where you have an application that is running in Kubernetes and is processing data that comes into an Azure Service Bus queue. The more data there is in the queue, the more load there is on your application, and it is possible you may want to scale the number of pods processing data as the queue increases. With the default Prometheus setup you would need to monitor the metrics of the pods themselves and determine when they need scaling based on that, but wouldn't it be better to see the metrics of the queue itself so you can see that it is getting larger and scale based on that? The metrics for Service Bus queue size are all collected by Azure Monitor, but to be able to use them for scaling we need to get them into Prometheus. This is where the open source project called Promitor comes in.
Promitor is an Azure Monitor Scraper for Prometheus. The way Prometheus works is that it expects to be presented with HTTP endpoints that it can query, which displays the metric data in an appropriate format for it to collect. These metrics are then pulled into Prometheus and stored. Promitor provides an application that talks to Azure Monitor and collect the required metrics, and then format these and present them to Prometheus in a way it can understand and can scrape. It works in both standalone Docker and Kubernetes; we'll be looking to implement this in Kubernetes.
Currently, Promitor supports monitoring the following Azure resources, but more will be added in the future:
- Azure Container Instances
- Azure Container Registry
- Azure Service Bus Queue
- Azure Virtual Machine
- Azure Network Interface
- Azure Storage Queue
- Azure Cosmos DB
To be able to run Promitor you need a Kubernetes cluster; this doesn't have to be AKS, it can be from any provider. You also need to have deployed Prometheus to this cluster. I won't be covering the deployment of Prometheus here, but I would strongly recommend you deploy this using the Prometheus operator for Kubernetes, rather than a manual deployment. This makes configuring scraping of the endpoint much easier, and we will be using the operator in this tutorial. If you are new to Prometheus, this is a good article explaining how it works and how to deploy it (it mentions AKS, but it works for all deployments).
You also need resources in Azure that you want to monitor. For this tutorial, I will be using a simple Service Bus Queue, but you can get as complex as you wish and monitor as many resources as you wish. One thing to bear in mind is that currently, a single instance of Promitor can monitor resources in a single resource group. If you need to monitor resources in multiple resource groups, you will need to deploy multiple instances.
Finally, you need a service principal created for Promitor to use to authenticate to Azure. This service principal needs to be granted the "monitoring reader" role on the resource group you wish to monitor. You will need to make a note of the application ID and password of the service principal, along with the subscription ID, Tenant ID and resource group name you want to monitor.
Before we deploy Promitor we need to create a YAML file that defines what we want to monitor. Below is an example file which is setup to monitor my Service Bus Queue:
azureMetadata: tenantId: xxxxxxxx subscriptionId: yyyyyyy resourceGroupName: promitorDemo metricDefaults: aggregation: interval: 00:01:00 scraping: schedule: "0 * * ? * *" metrics: - name: demo_queue_size description: "Amount of active messages of the 'kubernetesjobssb' queue" resourceType: ServiceBusQueue namespace: kubernetesjobssb queueName: jobs scraping: schedule: "0 */2 * ? * *" azureMetricConfiguration: metricName: ActiveMessages aggregation: type: Total interval: 00:01:00
The first "azureMetaData" section contains the ID of the tenant and subscription where your resources are, and the resource group they are in. Then in the metricDefaults we configure two default values that can be overridden later:
- Aggregation Interval - The period your collected results will be aggregated into when sent to Prometheus. Here I have selected 1-minute intervals, so even if Azure Monitor collects this data every 10 seconds, I will be aggregating this into 1-minute intervals for Prometheus
- Scraping Schedule - How often Promitor will go to Azure and collect the data. This is in cron job format; here I am scraping every minute
The next section is where we list the actual metrics we want to collect. We provide a name and description, and then the type of resource we are collecting, in this case, ServiceBusQueue. You can find documentation on the types of metrics here. We then define the details of the specific resource to query, in the case of Service Bus we define the namespace and the queue name. We can also specify a schedule which will override the default schedule we supplied above. Lastly, we specify which metric we are collecting; this supports any of the metrics provided by Azure Monitor for Service Bus Queues, so you can check the Azure monitor documentation for what syntax to use. We then can provide another override for the aggregation settings, here we specify to aggregate by the total value, and aggregate into 1-minute intervals.
Save this file; we will use it when we deploy.
Deploy to Kubernetes
We're now ready to deploy Promitor. The project recently added a Helm chart, which makes things much easier. However, it hasn't yet made it into the main Helm repo so we will need to download it. The easiest way is to clone the Promitor repo from here then open the command line and CD into the charts folder. We can then run the Helm install command, where we will provide the Service Principal application ID and password and the path to the metrics configuration file. I would recommend you deploy this into the same namespace as your Prometheus install, so in my case, this is called "monitoring".
helm install --name promitor-scraper ./promitor-scraper \ --namespace monitoring \ --set azureAuthentication.appId='<azure-ad-app-id>' \ --set azureAuthentication.appKey='<azure-ad-app-key>' \ --values /path/to/metric-declaration.yaml
This should deploy correctly, and within in a minute or two, you should see the Promitor pod running.
Once this is running, we can check that it is pulling metrics correctly by looking at the scraping URL. This is only exposed inside the cluster so we will need to port forward.
kubectl port-forward <name of Promitor pod> 8080:9090 -n monitoring
We can then go to the URL http://127.0.0.1:8080/prometheus/scrape, and we should see a page displaying the metric we created, a value for the metric and a timestamp:
If you don't see this data, have a look at the pod's logs to see if there are any issues. If you do, we have our collector up and running.
Now that we have Promitor collecting data and presenting it, we need to configure Prometheus to collect it. Previously this would have involved creating collector jobs in a config file, but with the advent of the Kubernetes operator, we can instead create a Kubernetes object called a ServiceMonitor, which will tell Prometheus to collect from this resource. We do this by creating a YAML file defining our service monitor, which looks like this:
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: promitor-scraper labels: prometheus: kube-prometheus app: promitor-scraper spec: selector: matchLabels: app: promitor-scraper endpoints: - port: http Path: /prometheus/scrape interval: 15s
This is a reasonably simple resource; the important things to call out are:
- You must use the label "prometheus: kube-prometheus" this is how Prometheus knows that this is a service monitor it needs to use.
- The Selector should be configured to look for "app: promitor-scraper ", this is a label that has already been attached to our Promitor Service in the helm chart
- The port name should be HTTP to match the service, and we need to define the path as this differs from the Prometheus default of Metrics. This may change to metrics in the future
We can then deploy this to Kubernetes using this command.
Kubectl create -f <path to yaml file> -n monitor
It is vital that this ServiceMonitor be deployed to the same namespace as Prometheus, or it will not work.
Once created, wait a few minutes, and then we can check to see if this data is being collected. The easiest way to check this is to port-forward the Prometheus UI using this command:
kubectl port-forward -n monitoring prometheus-kube-prometheus-0 9090
Check the name of your Prometheus pod if this does not work.
Once we have the port-forward running, we can go to http://127.0.0.1:9090/ in the browser. The first thing to do is go to the status menu and go to "targets". You should see listed here the promitor-scraper service monitor, and it should show as up. If it is not working it should show an error.
If that is all working as expected we can then go to the graph tab and look in the metric list for our metric, in this case, it is called "demo_queue_size". Select that and click Execute and we should see a text representation of the data. Click on the Graph tab to see it over time.
Using these Metrics
Now that we have configured our metric collection we can use this in whatever way we need. The first thing most people want to do is configure some graphing on a dashboard; we can do this in Grafana. If you used the Prometheus operator, you should already have a working Grafana instance with a Prometheus data source. All you need to do is create a new chart and use the query generated when we tested in the Prometheus dashboard. So in this example to get the message queue size the query is:
This gives us a similar chart to that we saw in Prometheus.
We can also now start to look to use these metrics as part of our scaling configuration, and scale our resources based on what we see at the Azure side, not just what is happening in the cluster. There is a good article on using Prometheus metrics for scaling here.
If you've got any exciting scenarios you plan to implement using these metrics I would love to hear about them, and if you want to see more Azure metrics available in Promitor, I would encourage you to consider committing some code to the project.