Did you recently update your kubernetes certificates manually and after that, your pods are stuck at the pod creating stage while throwing Flannel certificate verification errors?
If so, this blog post will show you how to exactly find your issue and fix it. Let’s first look at the issue symptoms to verify the root cause.
1. Kubernetes pods are failing to progress to
Do you see your kubernetes pods are stuck at
ContainerCreating state without progressing to
Running state as below?
kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES <pod_name_1> 0/1 ContainerCreating 0 56m <none> <node_ip_1> <none> <none> <pod_name_2> 0/1 ContainerCreating 0 7s <none> <node_ip_2> <none> <none> <pod_name_3> 0/1 ContainerCreating 0 56m <none> <node_ip_3> <none> <none>
What does this mean?
Kubernetes pods get stuck at
ContainerCreating state when kubernetes services fail to satisfy the pre-requirements to run the pod.
It can be due to network, connectivity, or kubernetes services issue. To find a clue, the next obvious step would be to
check the startup events of the pod.
2. Kubernetes pods are failing due to missing
When you run
kubectl describe pod <pod_name> command to find a clue for the pod stuck at
you will see below error messages under
$ kubectl describe pod <pod_name> ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 3m50s default-scheduler Successfully assigned default/<pod_name> to <node_name> Warning FailedCreatePodSandBox 3m48s kubelet, <node_name> Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "chfehyuh9vg328dsd82u47897dguhru37410248df8v93429jjdfhtvkjczn12k9" network for pod "<pod_name>": NetworkPlugin cni failed to set up pod "<pod_name>_default" network: open /run/flannel/subnet.env: no such file or directory Warning FailedCreatePodSandBox 3m46s kubelet, <node_name> Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "cy8j2iqlw21opcnbg3stzmk4qo093hdd8hdj9wd9494jiw3g7zxqwijskw3q5jgq" network for pod "<pod_name>": NetworkPlugin cni failed to set up pod "<pod_name>_default" network: open /run/flannel/subnet.env: no such file or directory Warning FailedCreatePodSandBox 3m44s kubelet, <node_name> Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "cqfeowjc9ej2jdkve4kkdow9lmv4lpspqj6cnx8ejekn2m5c9cskopejt53mkvk8" network for pod "<pod_name>": NetworkPlugin cni failed to set up pod "<pod_name>_default" network: open /run/flannel/subnet.env: no such file or directory ...
What does this mean?
/run/flannel/subnet.env file gets created in each control-plane node by the Flannel pods
when creating their subnet and the parameters in that file will show you the configs used by the Flannel
pod running in that node.
$ cat /run/flannel/subnet.env FLANNEL_NETWORK=10.244.0.0/16 FLANNEL_SUBNET=10.244.0.1/24 FLANNEL_MTU=8951 FLANNEL_IPMASQ=true
kubectl describe pod <pod_name> output complains that this file is missing, there is a big chance of Flannel pods
not getting created or throwing some errors. Therefore, to find our next clue, we may need to check the pods running in the
3. Flannel & CoreDNS pods in
kube-system namespace are failing due to certificate issues
If the Flannel & CoreDNS pods are either in
Error state, it is likely that
the issue is more towards our kubernetes configurations, network, or sometimes hardware.
$ kubectl get all -n kube-system NAME READY STATUS RESTARTS AGE ... pod/coredns-XXXXXXXXX-XXXXX 0/1 Error 47 264d pod/coredns-YYYYYYYYY-YYYYY 0/1 Error 11 44m pod/coredns-ZZZZZZZZZ-ZZZZZ 0/1 Evicted 0 360d ... pod/kube-flannel-ds-amd64-XXXXX 0/1 CrashLoopBackOff 16 365d pod/kube-flannel-ds-amd64-YYYYY 0/1 CrashLoopBackOff 13 259d pod/kube-flannel-ds-amd64-ZZZZZ 0/1 CrashLoopBackOff 21 365d ...
At this stage, check why Flannel & CoreDNS pods are failing.
kubectl logs pod/coredns-fb8b8dccf-cb4dh -n kube-system E0606 12:01:33.904148 1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:315: Failed to list *v1.Service: Get https://<subnet_ip_2>:443/api/v1/services?limit=500&resourceVersion=0: x509: certificate is valid for <subnet_ip_1>, <node_ip>, not <subnet_ip_2> E0606 12:01:33.904148 1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:315: Failed to list *v1.Service: Get https://<subnet_ip_2>:443/api/v1/services?limit=500&resourceVersion=0: x509: certificate is valid for <subnet_ip_1>, <node_ip>, not <subnet_ip_2> log: exiting because of error: log: cannot create log: open /tmp/coredns.coredns-XXXXXXXXX-XXXXX.unknownuser.log.ERROR.20200606-120133.1: no such file or directory
kubectl logs pod/kube-flannel-ds-amd64-XXXXX -n kube-system I0606 12:50:15.602728 1 main.go:514] Determining IP address of default interface I0606 12:50:15.603419 1 main.go:527] Using interface with name ensXXX and address <node_ip> I0606 12:50:15.603463 1 main.go:544] Defaulting external address to interface address (<node_ip>) E0606 12:50:15.705089 1 main.go:241] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/kube-flannel-ds-amd64-XXXXX': Get https://<subnet_ip_2>:443/api/v1/namespaces/kube-system/pods/kube-flannel-ds-amd64-XXXXX: x509: certificate is valid for <subnet_ip_1>, <node_ip>, not <subnet_ip_2>
What does this mean?
x509: certificate is valid for <subnet_ip_1>, <node_ip>, not <subnet_ip_2> means that the certificates used by your kubernetes cluster (kubernetes api-server and other components)
have a mismatch against the Flannel and CoreDNS pods. This can happen when the cluster has been updated with wrong certificates.
How do the wrong certificates get created?
Once we setup our cluster, usually we are not going to touch the certificates o keys because kubernetes manages them for us. But on v1.14 and before, you have to manually update the certifications and keys every year. Also, no matter the version, based on exceptional reasons, there can be situations that you have manually updated your certificates.
In such cases, there is a high chance of you using a configuration that is different from the original configuration used to create the cluster (mostly by mistake or you just forgot how you did it back then). Therefore, kubernetes would generate certificates for wrong subnet IPs given by you or use default IPs, so other pods won’t be able to run with that configuration.
What is the fix?
Some discussions suggest to reboot the nodes and run below commands to recreate the Flannel pods, but it won’t simply work like that since it’s a certificate issue.
// This won't work for certiticate issue $ kubectl apply -f <k8_specs_directory>/kube-flannel.yml $ kubectl apply -f <k8_specs_directory>/kube-flannel-rbac.yml // This won't work for certiticate issue $ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel-rbac.yml $ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
The correct fix would be to re-create your certificates and keys with the original configuration used to setup your cluster,
otherwise your pods won’t get it. For instance, your commands will have the below format (see
--config flag and passing the correct
kubeadm init ... commands). To find the exact steps to
manually generate certificates, please go to our previous blog post.
// The correct format of certificate creation commands // ** Complete steps: https://platformengineer.com/fix-kubernetes-bootstrap-client-certificate-expired-error/ $ sudo kubeadm init phase certs all --config /<k8_specs_directory>/kubeadm_config.yaml $ sudo kubeadm init phase kubeconfig all --config /<k8_specs_directory>/kubeadm_config.yaml
Using the original
clusterConfiguration spec with certification creation commands will fix this issue and all your pods will
start to run without any errors.
|✅ Tested OS's||: RHEL 7+, CentOS 7+, Ubuntu 18.04+, Debian 8+|
|✅ Tested Gear||: Cloud (AWS EC2), On-Prem (Bare Metal)|