It is a moral responsibility of code to crash, of bugs to appear, functionality to break and developers to fix them. But to fix them, developers should know when a certain bug appears, and the context of the bug. To get this first-hand information, we set up monitoring services, which monitor the health of the service, and trigger an alert when things aren’t normal. In this series, we discuss how we set up a minimal monitoring stack at our startup. We will talk about Thanos, Prometheus and Graffana: Monitoring Kubernetes.
This is a series of articles, in this article we establish the background of monitoring and describe the tools that we are going to use. The next part would be more hands-on, where we dive into the implementation details of how to use Thanos, Prometheus, and Graffana for Monitoring Kubernetes.
What is Monitoring?
Monitoring, or to be more specific, IT Monitoring as the name suggests. means observing various attributes of service constantly and take some action when an anomaly is detected. It is the process to gather metrics about the operations of an IT environment’s hardware and software to ensure everything functions as expected to support applications and services.
I talked about attributes, but what are those attributes? Well, let us take an example of a backend service that exposes 2-3 APIs. The attributes to be monitored can be –
API Latency ,
Number of requests Served ,
HTTP Code ,
Physical Memory of the server ,
CPU Utillization of the server , etc.
By monitoring each of these attributes, we measure the health of the service. We ourselves define the thresholds, and whenever these thresholds cross a certain limit, the monitoring services trigger an alert, and the developers pick their favorite job, DEBUGGING!
Now that we understand what monitoring is and why it is used, let us discuss the architecture of Thanos, Prometheus, and Graffana and how it is used for Monitoring Kubernetes here at retailpulse .
We have deployed our application in a Kubernetes environment on GCP. It interacts with a SQL Database. The diagram below shows a snapshot of our GCP cluster and how the different components interact.
Prometheus, Graffana, and Thanos will be described in detail later in the article, for now, we just want to focus on the interaction between these services. A fact worth mentioning here is that the stack can run even without
Thanos . But we use Thanos for better storage and a bunch of other features that we will dive into.
- The application server exports some metrics and stores them in its physical memory for a small time. It also exposes an endpoint for Prometheus to query the metrics.
- Prometheus is a
pull basedservice. It hits the Application service at the exposed endpoint and fetches the metrics. Fetched metrics are cleared from the application memory.
- Prometheus Pod is a stateful set. If used alone it saves the metrics.
- If Thanos is involved, Prometheus interacts with Thanos Sidecar to save the metrics in proper format and get features like compaction, downsampling, etc.
- Graffana is the visualisation tool. It queries Prometheus to fetch data and display it.
- When a Prometheus pod is queried, it again interacts with Thanos Sidecar to fetch the data, merge it and send it to Graffana.
Thus a metric generated at the application level is saved in Prometheus or some other permanent storage with the help of Thanos and is visualized using Graffana. Now let us study each of the components.
Prometheus at the heart of it is a simple alternative to a time-series database. Though it does much more than that, in the purest form, it is persistent storage good at storing time-series metrics. In the diagram above, we have tried to cover most of the components that constitute Prometheus.
- Service Monitor
In an ideal world, an application comprises many services running simultaneously. In order to monitor the application, we need to monitor all the services properly. This is where service monitors come to the rescue. Service Monitor describes the various targets from which the metrics have to scrap. They define the sources, and the endpoints exposed by the sources which the scrapper can hit in order to get the metrics.
Scrapper is a component that is responsible for running after a certain interval of time and hitting the endpoints described by each of the service monitors to fetch the metrics. These metrics are then stored in the Storage.
- Rules and Alerts:
Rules and alerts are basically the configurations of the Prometheus. As already mentioned in the article, the monitoring setup triggers an alert when some anomaly or unexpected behavior is detected. Rules and Alert are the definitions of those anomalies. Like, Trigger an alert when the API latency is more than 3 seconds, etc, etc.
Alertmanager is another component of Prometheus. When an anomaly is detected, and an alert is triggered, there has to be a channel via which the developer gets to know about the alert. Developers across the world use different mediums. Some prefer to have alerts on workplace chat, some on slack, some on SMS or emails, some even on a phone call. It is the responsibility of the alert manager that the anomaly detected by Rules triggers an alert and that alert is successfully routed to the preferred channel described by the client (developer).
Storage, as stated, is the heart of the Prometheus. All the metrics that have been scrapped from the data sources pointed at by the service monitors, are saved in the storage. With the advent of Thanos sidecar, the load on storage has decreased. Storage integrates with Thanos and pushed the old data into other mass storage options like Storage Buckets. But it still remains the heart of Prometheus.
Prometheus in itself is a complete solution to fetch, store and expose metrics for monitoring purposes. But in certain cases, even Prometheus started showing some problems. Prometheus in Kubernetes runs on a pod. For simplicity let us assume a pod is a machine with some processing power and memory. But the memory here is limited. What happens when we want to store metrics with high cardinality for a long period of time? We face a bottleneck of memory. We can use pods with high memory but again that is not a cost-effective solution. What happens if we want to aggregate past data? Like we have a per-minute value of a number of requests received by a certain API, but we want to store a per day value of this metrics for old data (say last quarter). This is where Thanos comes to the rescue.
Thanos is a wrapper built over Prometheus to provide extra features like high availability, unlimited storage, downsampling/upsampling, etc.
Let us discuss the various components of Thanos.
- Object Storage:
As stated above, the Prometheus pod has a limitation on storage. To overcome this, Thanos provides integration with external bulk data sources, like S3 Buckets. Please note that Object Storage is not a part of Thanos, but to understand Thanos architecture better, we have demonstrated it as a Thanos component.
- Thanos SideCar:
Thanos Sidecar sits and works very closely with Prometheus. It has two major functions: It exposes Prometheus to the Querier so that the querier can fetch recent metrics from Prometheus. And It takes old metrics from Prometheus and stores them on bulk object storage.
- Thanos Querier:
Visualization tools like Graffana display the metrics in various forms like Graphs and Charts. But in order to display these graphs, they need to fetch them from Persistent Storage. In very simple terms, they can directly query the Prometheus storage and get the metrics to display the charts. But again with the advent of Thanos, some blocks of storage are saved in low-cost persistent storage like Buckets. Thus arises the need for Querier. Graffana hits the Querier and asks for the data. It is the responsibility of Querier to decide if it has to hit Prometheus (in case of new data) or Thanos (in case of old data) or both. In case Querier decides to query both (Prometheus and Thanos), it is again its responsibility to merge data from different sources and return to Graffana.
- Thanos Store Gateway:
Thanos store gateway interfaces
object storage. When
Querierwants to fetch data from
Object Storage, it uses Thanos Store Gateway to resolve the type of Object Storage (GCP Bucket, S3 Bucket, etc), fetch data from the source and provide it to the Querier in a format that it understands.
- Thanos Compactor:
Compactor felicitates efficient storage and retrieval of Data. Some of its features are Downsampling/ Upsampling and Compaction. Compaction means merging two small blocks of data into one large block. This helps in better utilization of Disk Space. Down Sampling and Up Sampling are like zoom in / zoom out functions for data.
These were 2 major components of the monitoring stack that we use here at our organization. In the next part of the series, we shall dwell on Graffana, and some hands-on how to use Thanos, Prometheus, and Graffana for Monitoring Kubernetes.