When your pod can’t be scheduled
A Kubernetes horror story
I’ve been getting to know Kubernetes trial by fire style. Recently, I had a close encounter with a Kubernetes-based video analytics system. In this system, customer IP cameras are bound to dedicated camera-reading pods. The lifetime of a camera reader should match the business hours of the corresponding customer location. Some of these camera readers were not starting on time.
In this architecture, there is a main service (literally called main-service) that orchestrates location-based job scheduling. Another service (called k8s-manager) starts and stops camera reader pods on demand. The main-service makes an API call to k8s-manager when it’s time to launch (or destroy) a camera reader.
At first, I suspected an issue in timezone handling logic as the problem concentrated at west coast locations. I studied the log and added more log events. It appeared the main-service was initiating scheduling on time.
But, I noticed in the main-service log that some of the API calls to k8s-manager were failing with connect errors.
Next, I examined the k8s-manager monitoring dashboard and noticed a curious sawtooth pattern.
I found the following cry for help in the k8s-manager log.
Inspecting the deployment configuration, I discovered this stanza:
The limits constraint defines an upper bound on how much memory/CPU a given pod can consume. There is a related requests constraint that defines how much free memory/CPU a node must have for the pod to be scheduled on it. Kubernetes relies on these user-supplied parameters to ensure pods have the resources they need yet do not exhaust system resources.
In my scenario, the singleton k8s-manager instance was getting killed every few minutes when its memory breached the limits threshold of 400 mebibytes. With the degraded availability of this job launching service, the system was unable to promptly start customer jobs.
Relaxing the memory constraint restored prompt scheduling.
At some point in the past, 400 mebibytes were fully sufficient for this pod. Over time the workload of this pod increased until it exhausted its appointed memory resources. Perhaps there’s an approach to time-based scheduling of pods that avoids the single point of failure k8s-manager service. I’m open to suggestions.
Another concern: this problem endured for over two weeks before anyone took notice. A critical Kubernetes pod was in a continuous restart loop without blaring alarm bells sounding. The existing Prometheus/Grafana setup appeared to have a gap. With some quick googling, I found a project called kube-slack that will publish slack notifications when a pod fails. It works well enough, though in this system it produces a lot of noise. It also seems kube-slack is a deprecated project. Perhaps there’s a more canonical Kubernetes approach to setting up basic alarms on system stability. Perhaps you can tell me about it.
The moral of this story? Kubernetes is not a cuddly teddy bear; it’s a powerful sophisticated beast. Give it the respect it deserves.