Managing a Kubernetes cluster requires a keen eye for detail and a deep understanding of its complex structure. To ensure smooth operation of your applications and optimal performance, it is vital to monitor a wide range of metrics across the different components of your cluster. In this article, we will discuss key metrics that can be used to monitor both self-managed and cloud-managed Kubernetes environments, helping you to keep your cluster running at its best.

Monitoring Node Memory Pressure

When managing a Kubernetes cluster, it is important to keep an eye on the memory usage of each individual node. One way to do this is by setting an alert for when a node exceeds a certain threshold of memory consumption. For example, if a node reaches 90% memory usage, an alert can be triggered to alert the administrator to take action before any disruptions to the application or service occur.

Some may argue that Kubernetes will handle high memory usage by evicting pods to reclaim resources and ensure there is enough memory available. However, eviction can lead to the deletion of critical pods that do not have multiple replicas, resulting in disruptions to the application or service. Additionally, if there are no available nodes to schedule evicted pods, the cluster may need to be scaled up, which can be costly and may not be feasible.

To avoid these issues, it is important to monitor node memory usage and take action before it leads to eviction or disruptive scaling. By proactively addressing high memory usage, you can ensure the smooth operation of your Kubernetes cluster.

Role of Node CPU High Utilization

Monitoring the CPU utilization of each node in a Kubernetes cluster is important for ensuring the smooth operation of your application or service. Unlike memory usage, a Pod exceeding its CPU limit may not result in eviction, but rather CPU throttling. However, some Pods may be sensitive to CPU throttling, leading to readiness probe failures, Pod restarts, and ultimately, a non-operational application or service.

It is important to distinguish between CPU and memory usage as they are different resources that require different monitoring strategies. While a Pod can enter a CPU throttling state, this is not possible for memory usage. When a Pod exceeds its memory limits, it gets OOMkilled (Out of Memory killed).

By monitoring node CPU usage, administrators can anticipate and address high CPU usage before it leads to disruptions such as Pod restarts and non-operational applications or services. It's a hard lesson to learn but monitoring this metric is key to ensure the smooth operation of your Kubernetes cluster.

Ideal State Compro: Node not in Ready state

Monitoring the state of each node in a Kubernetes cluster is crucial for ensuring the smooth operation of your application or service. The ideal state for a node is "Ready" which indicates that the node is healthy and able to accept and run Pods. However, it is normal for a node to be in a non-Ready state for brief periods of time due to node draining, upgrades, or other routine maintenance.

If a node remains in a non-Ready state for an extended period of time, it can indicate a problem with the node. For example, an "Unknown" state means that the controller was unable to communicate with the node to identify its state, and this is definitely something worth investigating to ensure the cluster is running smoothly.

A non-ready state can also cause other issues, like if a node is not ready, the pods running on that node will also become non-ready, which can cause your application or service to crash or become unavailable. This can be a major issue in a production environment, where availability is of the utmost importance. To avoid these issues, it is important to monitor the state of nodes and take action as soon as possible if a node is not in a ready state.

Node Storage Monitoring

Monitoring the usage of disk space on nodes is an essential aspect of keeping track of the overall health of a node, alongside monitoring CPU and memory usage. It is important to note that running low on disk space can result in node malfunctions, even if most of the storage is defined and used outside of the nodes. To avoid potential issues, it is crucial to take a proactive approach and regularly monitor the available disk space. Even if it seems trivial in some cases, it is always better to be safe than sorry.

Network Traffic Tracking

Another critical metric that often gets overlooked is the monitoring of network traffic in and out of your Kubernetes nodes. By tracking network metrics, you can quickly identify potential issues and respond to alerts indicating a lack of traffic to and from a node. This could indicate a serious problem with the node and needs to be dealt with to ensure the seamless operation of your Kubernetes cluster. Therefore, it is crucial to keep an eye on network traffic to address any potential problems that may arise.

Namespace Health: Monitor Expected Pods

Knowing the expected number of Pods for each namespace is crucial to ensure the health of your environment. In some environments, each namespace runs a fixed number of Pods, while in others, there might be a minimum number of Pods required. Monitoring this metric can help you detect issues with the namespace's health if fewer Pods are running than expected. This metric is not limited to Pods only and can be applied to other Kubernetes resources, including services, ConfigMaps, and Secrets.

Deployment Health: Track Running Replicas

Deployments are a critical component in Kubernetes that help manage the desired number of replicas. Monitoring the desired number of replicas versus the actual number of running replicas is a useful metric to track deployment health. Any discrepancy between the desired and actual number of replicas indicates an issue that is preventing all the replicas from running correctly. Keeping track of this metric can help ensure that your deployments are running smoothly and without any issues.

Ensure Optimal Scaling: Monitor Pod Creation and Deletion Rates

To keep up with the demands of your Deployment, it's crucial to monitor the rate of pod creation and deletion. These metrics offer valuable insight into how quickly your system is scaling and can help identify any issues that need to be addressed.

By tracking pod creation and deletion rates, you can pinpoint scaling problems and take action to rectify them. This might involve adjusting resource limits or troubleshooting and resolving underlying issues.

Preventing OOMkill Events: The Importance of Setting Memory Limits for Pods

It's widely acknowledged that setting memory requests and limits for pods is best practice. Doing so ensures fair resource allocation, prevents out-of-memory issues, and guarantees that each pod has minimal resources.

However, memory limits are "hard limits," unlike CPU limits, which Kubernetes can throttle without killing the pod. When a pod exceeds its memory limits, it triggers an "OOMkill event," resulting in the pod's termination.

While it's not always necessary to monitor OOMkill events (especially if you're using mechanisms like VPA to manage memory usage), a sudden increase in these events could indicate a problem that needs attention.

Prevent Performance Issues: Monitor CPU Throttling in Kubernetes Pods

CPU throttling is a potential issue that can impact the performance of your Kubernetes application. Despite not killing or re-creating Pods, CPU throttling can result in slower response times and unsatisfied customers. By monitoring CPU throttling in your Pods, you can identify and address performance issues early.

Ensure Application Availability: Track Readiness Probe Failures in Kubernetes

Kubernetes uses readiness probes to ensure that Pods are ready to receive traffic. When a readiness probe fails, the Pod will not start running, and the system will try to restart it until the probe succeeds. By tracking readiness probe failures, you can ensure that your application is available and ready to handle traffic at all times.

Optimize Node Selection: Verify Pods Run on the Right Nodes in Kubernetes

Kubernetes provides several ways to ensure that Pods are running on specific nodes. Using node selectors, affinity rules, and anti-affinity rules, you can configure Pods to run on specific groups of nodes, known as node pools. By monitoring node selection, you can ensure that Pods are running on the right nodes, optimizing performance and resource utilization.

Avoiding Too Many Restarts and CrashLoopBackOff

Pod restarts can happen for various reasons and may be normal at times, but too many restarts in a short period of time can be a sign of trouble. This can be caused by issues like out-of-memory kills, probe failures, and application level failures. If left unchecked, too many restarts can lead to a status called "CrashLoopBackOff", which can cause further problems. It's important to track for often restarts in short periods of time, and also for "CrashLoopBackOff" status.

Troubleshooting Pending Pods

In the lifecycle of a Pod, it can be in different states, one of them being "Pending" which means the Pod is waiting to be scheduled. If a Pod remains in a Pending state for too long, it could mean something is wrong, such as the Pod not being configured to match any nodes or there being no available nodes to schedule the Pod. To ensure that your service/application is running smoothly, it's important to attend to Pods in a Pending state until the issue is resolved.

Identify and Remove Zombie Pods

To maintain a clean and organized environment, it is essential to keep an eye on the state of your Pods. Pods in an "Unknown" state or "Zombie Pods" can cause confusion and clutter, as they may appear to be running in the same namespace as their healthy counterparts. This metric helps identify Pods that were not successfully terminated and need to be removed.

Control Plane Level

For those who manage their Kubernetes cluster independently, these metrics are essential to monitor:

Investigating Pod Creation Latency

If Pods are taking an excessive amount of time to create and start running, it may indicate an issue with the Kubelet or API server. To ensure that Pods are created and begin running within an acceptable timeframe, it is critical to monitor the latency of the Pod creation process.

Keeping an Eye on kubelet State

Kubelet is a vital component that plays a key role in ensuring the health and performance of your cluster. When issues arise with Kubelet, it can lead to problems such as Pod scheduling delays, slow Pod creation, and delayed Pod startup. As a result, it is essential to keep a close eye on the state of Kubelet and detect any issues as early as possible.

Monitoring kube-controller-manager State

While Kubelet is a crucial component of the Kubernetes control plane, it is not the only one. kube-controller-manager is responsible for managing a collection of controllers that reconcile tasks and ensure that the actual state of objects, such as ReplicaSets, Deployments, and PersistentVolumes, meets the desired state. Therefore, it is just as important to monitor kube-controller-manager and ensure that it is functioning properly to maintain the overall health of your Kubernetes environment.