How We Monitor Our Kubernetes Cluster – Nanit Engineering

louy2 2019-01-29

展开全文

How We Monitor Our Kubernetes Cluster

Sep 14, 2017

nanit has been using kubernetes on production from its early days and for almost two years now. As with every large and complicated system, we experienced failures on all levels:

The Kubernetes level: Node failures, Pod allocation failures etc.
The applicative infrastructure level: Redis, RabbitMQ etc.
The applicative level: nanit’s web services and video processing mechanisms.

Every failure led us to push our monitoring capabilities further ahead so we can know about failures as early as possible and have the ability to interrogate our system and resolve them as fast as possible.

In this post I’ll go over our monitoring tools and how we monitor our Kubernetes cluster using them.

Our Monitoring Stack

For metric collection and in-memory aggregation we use StatsD
For metric storage and serving we use Graphite
For visualization we use Grafana
For alerting we use Cabot

It is important to note that all of this infrastructure runs inside our Kubernetes cluster so we had to port some of these tools to work on Kubernetes. We saw this as a great opportunity to open source our StatsD + Graphite Kubernetes cluster and our RabbitMQ Kubernetes Cluster.

We monitor our cluster by 2 steps:

A script repeatedly runs every X seconds, collects data from the Kubernetes or RabbitMQ API and reports to StatsD. This data is persisted on Graphite.
We define an alert on Cabot to check for each of the reported Graphite metrics and verify they are within an acceptable range. We also set a grace time before we alert to allow the failure to be auto-recovered.

Let’s check some specific examples. For each example we will explore what are we monitoring, why is it important to know about such failures and how we setup the data collection and alerts.

Monitoring Dead Nodes

What: During the time we used Kubernetes we encountered numerous failures of “dead nodes”. We call a node a “dead node” when its ready status is at NotReady state. Kubernetes evicts all pods from these nodes and will not schedule new pods on them as long as they are marked as NotReady. Sometimes these nodes return to a Ready state by themselves and sometimes they remain in a NotReady state indefinitely.

Why: It is important to know about dead nodes mainly for the following 3 reasons:

The total sum of usable resources in our cluster is less than we intend it to be, which can later lead to insufficient resources when trying to schedule a pod.
We are paying for instances which are not in use by any service.
If there are several dead nodes we might have a cluster-wide problem we need to attend.

How: We issue a GET /api/v1/nodes request to the Kubernetes API. Every node is sent with an array of node conditions: NetworkUnavailable, MemoryPressure, OutOfDisk and Ready. For each condition, we send StatsD the number of nodes which failed that condition. We can then setup an alert to notify us whenever the number of failing nodes crosses our threshold.

Sometimes a single misbehaving pod with no suitable resources request/limits may repeatedly choke the node it is running on. That’s why we also send Slack the private IP of the dead node, along with the set of pods currently running on it. This allows us to identify patterns of specific service pods which cause node failures.

Monitoring Missing Pods

What: Kubernetes has various ways to allow us to specify the redundancy of a service, mainly: Deployments, Replication Controllers, ReplicasSets and StatefulSets. It may happen that the number of active pods is not as specified in the replicas field on each of them. The two main reasons for this are insufficient resources to schedule a pod into a node and certain pods that fail to start for applicative reasons.

Why:

We may be missing the redundancy we set. If for example, we set the replicas number of a certain service to 2 and one of the pods is not running that service becomes a single point of failure in our system.
We are not investing the computing power we wanted to invest on that task. If we have a background job service that consumes jobs from a queue we now have fewer workers (pods) working on that task. The same goes for web APIs which are now served by a smaller amount of pods. It may lead to slow response times and bad user experience.
If we don’t have enough resources to schedule the pod we might want to either add more instances to the cluster or change the instances’ type so that our pods can fit in (resources-wise).

How: Each type of resource type has its own API endpoint but the general flow is the same — If we look at Deployments as an example, we List all deployments via a GET /apis/apps/v1beta1/deployments request to the Kubernetes API. Each deployment has a status field and in it two parameters we can use: replicas and availableReplicas. By subtracting the latter from the former we have the number of pods we are currently missing in the deployment. We can then define an alert to notify us whenever a deployment’s missing pods value is above a certain threshold.

Source Code

This is the Ruby script we use to send metrics of Kubernetes nodes, deployments, statefulsets and replication controllers to StatsD:

Monitoring RabbitMQ

What: Even though RabbitMQ is not a core component of Kubernetes I thought it is worth mentioning since it is a critical part of our infrastructure. Almost all services communicate with RabbitMQ one way or another so by monitoring RabbitMQ itself we can catch failures across the system. The things we are looking for are:

Exchanges with no bindings
Queues with no consumers
Full queues — queues that have more than X messages waiting on them.

Each of these failures may occur due to a RabbitMQ failure or an application level failure. Once the alert is sent, it is up to us to interrogate and understand what is the root cause.

Why: Each of the symptoms may imply different systems failures:

Exchanges with no bindings may imply that a service could not successfully create and bind a queue to an exchange and thus does not receive any messages from RabbitMQ.
Queues with no consumers may imply that a service can not successfully connect and consume messages from the queue or even worse — that the consumer service pods are failing to start. It usually leads to the third symptom of full queue since messages arrive to the queue and no consumer is there to handle them.
Full queues can be either a result of symptom 2 or a result of a higher message arrival rate than message consumption rate. In case of the latter, we may need to scale our consumer service to match the rate of the messages arriving that queue.

How: We use the RabbitMQ HTTP Ruby Gem that uses RabbitMQ HTTP REST API and reports to StatsD the following metrics:

For each exchange, the number of bindings (both exchange bindings and queue bindings).
For each queue, the number of consumers connected to that queue.
For each queue, the number of pending messages on that queue.

By having these metrics, we can easily define the appropriate alerts to detect system failures as early as possible.

Source Code

This is the Ruby script we use for reporting:

Conclusion

Monitoring is an issue each and every company faces and it is always interesting to see the variety of monitoring stacks and tools each company embraces to itself.

I hope that by sharing our monitoring stack and procedures here I gave some of you ideas and tools that will help you gain visibility on your system and know about failures as soon as possible.

You are more than welcome to share your thoughts and suggestions on the comments section :)