How We Monitor Our Kubernetes Cluster![]() nanit has been using kubernetes on production from its early days and for almost two years now. As with every large and complicated system, we experienced failures on all levels:
Every failure led us to push our monitoring capabilities further ahead so we can know about failures as early as possible and have the ability to interrogate our system and resolve them as fast as possible. In this post I’ll go over our monitoring tools and how we monitor our Kubernetes cluster using them. Our Monitoring Stack
It is important to note that all of this infrastructure runs inside our Kubernetes cluster so we had to port some of these tools to work on Kubernetes. We saw this as a great opportunity to open source our StatsD + Graphite Kubernetes cluster and our RabbitMQ Kubernetes Cluster. We monitor our cluster by 2 steps:
Let’s check some specific examples. For each example we will explore what are we monitoring, why is it important to know about such failures and how we setup the data collection and alerts. Monitoring Dead NodesWhat: During the time we used Kubernetes we encountered numerous failures of “dead nodes”. We call a node a “dead node” when its ready status is at NotReady state. Kubernetes evicts all pods from these nodes and will not schedule new pods on them as long as they are marked as NotReady. Sometimes these nodes return to a Ready state by themselves and sometimes they remain in a NotReady state indefinitely. Why: It is important to know about dead nodes mainly for the following 3 reasons:
How: We issue a GET /api/v1/nodes request to the Kubernetes API. Every node is sent with an array of node conditions: NetworkUnavailable, MemoryPressure, OutOfDisk and Ready. For each condition, we send StatsD the number of nodes which failed that condition. We can then setup an alert to notify us whenever the number of failing nodes crosses our threshold. Sometimes a single misbehaving pod with no suitable resources request/limits may repeatedly choke the node it is running on. That’s why we also send Slack the private IP of the dead node, along with the set of pods currently running on it. This allows us to identify patterns of specific service pods which cause node failures. Monitoring Missing PodsWhat: Kubernetes has various ways to allow us to specify the redundancy of a service, mainly: Deployments, Replication Controllers, ReplicasSets and StatefulSets. It may happen that the number of active pods is not as specified in the replicas field on each of them. The two main reasons for this are insufficient resources to schedule a pod into a node and certain pods that fail to start for applicative reasons. Why:
How: Each type of resource type has its own API endpoint but the general flow is the same — If we look at Deployments as an example, we List all deployments via a GET /apis/apps/v1beta1/deployments request to the Kubernetes API. Each deployment has a status field and in it two parameters we can use: replicas and availableReplicas. By subtracting the latter from the former we have the number of pods we are currently missing in the deployment. We can then define an alert to notify us whenever a deployment’s missing pods value is above a certain threshold. Source CodeThis is the Ruby script we use to send metrics of Kubernetes nodes, deployments, statefulsets and replication controllers to StatsD: Monitoring RabbitMQWhat: Even though RabbitMQ is not a core component of Kubernetes I thought it is worth mentioning since it is a critical part of our infrastructure. Almost all services communicate with RabbitMQ one way or another so by monitoring RabbitMQ itself we can catch failures across the system. The things we are looking for are:
Each of these failures may occur due to a RabbitMQ failure or an application level failure. Once the alert is sent, it is up to us to interrogate and understand what is the root cause. Why: Each of the symptoms may imply different systems failures:
How: We use the RabbitMQ HTTP Ruby Gem that uses RabbitMQ HTTP REST API and reports to StatsD the following metrics:
By having these metrics, we can easily define the appropriate alerts to detect system failures as early as possible. Source CodeThis is the Ruby script we use for reporting: ConclusionMonitoring is an issue each and every company faces and it is always interesting to see the variety of monitoring stacks and tools each company embraces to itself. I hope that by sharing our monitoring stack and procedures here I gave some of you ideas and tools that will help you gain visibility on your system and know about failures as soon as possible. You are more than welcome to share your thoughts and suggestions on the comments section :) |
|