Managing a Kubernetes cluster requires a keen eye for detail and a deep understanding of its complex structure.

To ensure smooth operation of your applications and optimal performance, it is vital to monitor a wide range of metrics across the different components of your cluster.

In this article, we will discuss key kubernetes metrics that can be used to monitor both self-managed and cloud-managed Kubernetes environments, helping you to keep your cluster running at its best.

🗒️
How Incident Management Tools Can Help You Improve Your IT Operations!

The Role of Kubernetes Metrics in monitoring

Kubernetes metrics play a crucial role in monitoring and maintaining the health and performance of a Kubernetes cluster. These metrics provide valuable insights into the resource utilization, workload distribution, and overall efficiency of the cluster.

Some of the benefits of Kubernetes metrics server are:

Ensure resource optimization:

Metrics like CPU and memory usage allow you to monitor resource utilization and identify potential bottlenecks or inefficiencies.

Scale applications effectively:

By monitoring metrics related to pod and node usage, you can determine the workload distribution and make informed decisions about scaling applications horizontally or vertically to meet the demand.

Detect and troubleshoot issues:

Metrics such as pod restarts, error rates, and latency help in identifying issues and performance anomalies.

Capacity planning:

By analyzing historical metrics data, you can identify usage patterns, predict resource requirements, and plan for future capacity needs. This enables you to allocate resources effectively and avoid potential resource constraints.

💡
Learn more about incident management KPIs here!

Top 18 Metrics for Monitoring Kubernetes server

When monitoring a Kubernetes server, there are several key metrics that you should pay attention to.

These metrics provide insights into the health, performance, and resource utilisation of your Kubernetes cluster.

Here are some of the top kubernetes metrics to monitor:

Monitoring Node Memory Pressure

When managing a Kubernetes cluster, it is important to keep an eye on the memory usage of each individual node. One way to do this is by setting an alert for when a node exceeds a certain threshold of memory consumption.

For example, if a node reaches 90% memory usage, an alert can be triggered to alert the administrator to take action before any disruptions to the application or service occur.

Some may argue that Kubernetes will handle high memory usage by evicting pods to reclaim resources and ensure there is enough memory available. However, eviction can lead to the deletion of critical pods that do not have multiple replicas, resulting in disruptions to the application or service.

Additionally, if there are no available nodes to schedule evicted pods, the cluster may need to be scaled up, which can be costly and may not be feasible.

To avoid these issues, it is important to monitor node memory usage and take action before it leads to eviction or disruptive scaling. By proactively addressing high memory usage, you can ensure the smooth operation of your Kubernetes cluster.

🔖

Role of Node CPU High Utilization

Monitoring the CPU utilization of each node in a Kubernetes cluster is important for ensuring the smooth operation of your application or service. Unlike memory usage, a Pod exceeding its CPU limit may not result in eviction, but rather CPU throttling.

However, some Pods may be sensitive to CPU throttling, leading to readiness probe failures, Pod restarts, and ultimately, a non-operational application or service.

It is important to distinguish between CPU and memory usage as they are different resources that require different monitoring strategies. While a Pod can enter a CPU throttling state, this is not possible for memory usage. When a Pod exceeds its memory limits, it gets OOMkilled (Out of Memory killed).

By monitoring node CPU usage, administrators can anticipate and address high CPU usage before it leads to disruptions such as Pod restarts and non-operational applications or services.

It's a hard lesson to learn but monitoring this metric is key to ensure the smooth operation of your Kubernetes cluster.

Ideal State Compro: Node not in Ready state

Monitoring the state of each node in a Kubernetes cluster is crucial for ensuring the smooth operation of your application or service.

The ideal state for a node is "Ready" which indicates that the node is healthy and able to accept and run Pods. However, it is normal for a node to be in a non-Ready state for brief periods of time due to node draining, upgrades, or other routine maintenance.

If a node remains in a non-Ready state for an extended period of time, it can indicate a problem with the node.

For example, an "Unknown" state means that the controller was unable to communicate with the node to identify its state, and this is definitely something worth investigating to ensure the cluster is running smoothly.

A non-ready state can also cause other issues, like if a node is not ready, the pods running on that node will also become non-ready, which can cause your application or service to crash or become unavailable.

This can be a major issue in a production environment, where availability is of the utmost importance. To avoid these issues, it is important to monitor the state of nodes and take action as soon as possible if a node is not in a ready state.

📑
What is incident response lifecycle? Learn about NIST and SANS framework here!

Node Storage Monitoring

Monitoring the usage of disk space on nodes is an essential aspect of keeping track of the overall health of a node, alongside monitoring CPU and memory usage. It is important to note that running low on disk space can result in node malfunctions, even if most of the storage is defined and used outside of the nodes.

To avoid potential issues, it is crucial to take a proactive approach and regularly monitor the available disk space. Even if it seems trivial in some cases, it is always better to be safe than sorry.

Network Traffic Tracking

Another critical metric that often gets overlooked is the monitoring of network traffic in and out of your Kubernetes nodes. By tracking network metrics, you can quickly identify potential issues and respond to alerts indicating a lack of traffic to and from a node.

This could indicate a serious problem with the node and needs to be dealt with to ensure the seamless operation of your Kubernetes cluster. Therefore, it is crucial to keep an eye on network traffic to address any potential problems that may arise.

Namespace Health: Monitor Expected Pods

Knowing the expected number of Pods for each namespace is crucial to ensure the health of your environment. In some environments, each namespace runs a fixed number of Pods, while in others, there might be a minimum number of Pods required.

Monitoring this metric can help you detect issues with the namespace's health if fewer Pods are running than expected. This metric is not limited to Pods only and can be applied to other Kubernetes resources, including services, ConfigMaps, and Secrets.

Deployment Health: Track Running Replicas

Deployments are a critical component in Kubernetes that help manage the desired number of replicas. Monitoring the desired number of replicas versus the actual number of running replicas is a useful metric to track deployment health.

Any discrepancy between the desired and actual number of replicas indicates an issue that is preventing all the replicas from running correctly. Keeping track of this metric can help ensure that your deployments are running smoothly and without any issues.

Ensure Optimal Scaling: Monitor Pod Creation and Deletion Rates

To keep up with the demands of your Deployment, it's crucial to monitor the rate of pod creation and deletion. These metrics offer valuable insight into how quickly your system is scaling and can help identify any issues that need to be addressed.

By tracking pod creation and deletion rates, you can pinpoint scaling problems and take action to rectify them. This might involve adjusting resource limits or troubleshooting and resolving underlying issues.

Preventing OOMkill Events: The Importance of Setting Memory Limits for Pods

It's widely acknowledged that setting memory requests and limits for pods is best practice. Doing so ensures fair resource allocation, prevents out-of-memory issues, and guarantees that each pod has minimal resources.

However, memory limits are "hard limits," unlike CPU limits, which Kubernetes can throttle without killing the pod. When a pod exceeds its memory limits, it triggers an "OOMkill event," resulting in the pod's termination.

While it's not always necessary to monitor OOMkill events (especially if you're using mechanisms like VPA to manage memory usage), a sudden increase in these events could indicate a problem that needs attention.

Prevent Performance Issues: Monitor CPU Throttling in Kubernetes Pods

CPU throttling is a potential issue that can impact the performance of your Kubernetes application. Despite not killing or re-creating Pods, CPU throttling can result in slower response times and unsatisfied customers.

By monitoring CPU throttling in your Pods, you can identify and address performance issues early.

🗒️
Learn more about Pagerduty alternatives here!

Ensure Application Availability: Track Readiness Probe Failures in Kubernetes

Kubernetes uses readiness probes to ensure that Pods are ready to receive traffic. When a readiness probe fails, the Pod will not start running, and the system will try to restart it until the probe succeeds.

By tracking readiness probe failures, you can ensure that your application is available and ready to handle traffic at all times.

Optimize Node Selection: Verify Pods Run on the Right Nodes in Kubernetes

Kubernetes provides several ways to ensure that Pods are running on specific nodes. Using node selectors, affinity rules, and anti-affinity rules, you can configure Pods to run on specific groups of nodes, known as node pools.

By monitoring node selection, you can ensure that Pods are running on the right nodes, optimizing performance and resource utilization.

Avoiding Too Many Restarts and CrashLoopBackOff

Pod restarts can happen for various reasons and may be normal at times, but too many restarts in a short period of time can be a sign of trouble. This can be caused by issues like out-of-memory kills, probe failures, and application level failures.

If left unchecked, too many restarts can lead to a status called "CrashLoopBackOff", which can cause further problems. It's important to track for often restarts in short periods of time, and also for "CrashLoopBackOff" status.

Troubleshooting Pending Pods

In the lifecycle of a Pod, it can be in different states, one of them being "Pending" which means the Pod is waiting to be scheduled.

If a Pod remains in a Pending state for too long, it could mean something is wrong, such as the Pod not being configured to match any nodes or there being no available nodes to schedule the Pod.

To ensure that your service/application is running smoothly, it's important to attend to Pods in a Pending state until the issue is resolved.

Identify and Remove Zombie Pods

To maintain a clean and organized environment, it is essential to keep an eye on the state of your Pods. Pods in an "Unknown" state or "Zombie Pods" can cause confusion and clutter, as they may appear to be running in the same namespace as their healthy counterparts. This metric helps identify Pods that were not successfully terminated and need to be removed.

Control Plane Level Metrics

For those who manage their Kubernetes cluster independently, these metrics are essential to monitor:

Investigating Pod Creation Latency

If Pods are taking an excessive amount of time to create and start running, it may indicate an issue with the Kubelet or API server.

To ensure that Pods are created and begin running within an acceptable timeframe, it is critical to monitor the latency of the Pod creation process.

Keeping an Eye on kubelet State

Kubelet is a vital component that plays a key role in ensuring the health and performance of your cluster. When issues arise with Kubelet, it can lead to problems such as Pod scheduling delays, slow Pod creation, and delayed Pod startup. As a result, it is essential to keep a close eye on the state of Kubelet and detect any issues as early as possible.

Monitoring kube-controller-manager State

While Kubelet is a crucial component of the Kubernetes control plane, it is not the only one. kube-controller-manager is responsible for managing a collection of controllers that reconcile tasks and ensure that the actual state of objects, such as ReplicaSets, Deployments, and PersistentVolumes, meets the desired state.

Therefore, it is just as important to monitor kube-controller-manager and ensure that it is functioning properly to maintain the overall health of your Kubernetes environment.

Conclusion

Successfully managing a Kubernetes cluster requires attention to detail and ongoing monitoring. By understanding and monitoring the key metrics discussed in this article, you can optimize performance, ensure resource availability, and maintain reliable applications.

What is the Kubernetes CPU limit metric?

A maximum of 0.5 CPU and 128MiB of memory can be used by each container.

How to check Kubernetes metrics?

The simplest way to keep an eye on your Kubernetes cluster is to use Prometheus to gather metrics, S3 to store them in a time series database, and Grafana to display and aggregate the data.

What Does Kubernetes Monitoring Mean? 

Kubernetes monitoring includes gathering metrics and logs from the cluster's many nodes, pods, and services and analysing them to learn more about the cluster's behaviour. CPU consumption, memory usage, network traffic, and disc usage are some of the important metrics that are frequently tracked in a Kubernetes cluster.

Which metrics affect a Kubernetes cluster's performance?

The primary Kubernetes metrics that affect the performance of the cluster are CPU, memory, network traffic, and disc consumption.

Why is monitoring node memory pressure important?

Monitoring node memory pressure is crucial because it allows administrators to proactively address high memory usage before it leads to evictions or disruptions in the application or service. 

How many nodes can a cluster of Kubernetes support?

Kubernetes is specifically designed to meet the following criteria:

  • Maximum of 110 pods per node.
  • Maximum of 5,000 nodes.

How does network traffic tracking benefit Kubernetes clusters?

Monitoring network traffic in and out of Kubernetes nodes helps administrators quickly identify potential issues and ensure seamless cluster operation. A lack of network traffic to and from a node may indicate a serious problem that requires attention.

What distinguishes Metric Server from Kube-State Metrics?

The Kubernetes metrics server offers insights into resource utilization within the cluster, such as CPU and memory usage. It's primarily used for scaling purposes. On the other hand, kube-state-metrics focuses on the overall health of Kubernetes objects in the cluster, including pod availability and node readiness.

How can I monitor my pod metrics?

You need to use the kubectl top command, which displays the CPU, memory, and network utilisation for the containers, pods, or nodes, to get these metrics.