Did you recently update your kubernetes certificates manually and after that, your pods are stuck at the pod creating stage while throwing Flannel certificate verification errors?

If so, this blog post will show you how to exactly find your issue and fix it. Let’s first look at the issue symptoms to verify the root cause.

1. Kubernetes pods are failing to progress to Running state

Do you see your kubernetes pods are stuck at ContainerCreating state without progressing to Running state as below?

kubectl get pods -o wide
NAME                READY   STATUS              RESTARTS   AGE   IP       NODE          NOMINATED NODE   READINESS GATES
<pod_name_1>        0/1     ContainerCreating   0          56m   <none>   <node_ip_1>   <none>           <none>
<pod_name_2>        0/1     ContainerCreating   0          7s    <none>   <node_ip_2>   <none>           <none>
<pod_name_3>        0/1     ContainerCreating   0          56m   <none>   <node_ip_3>   <none>           <none>

What does this mean?

Kubernetes pods get stuck at ContainerCreating state when kubernetes services fail to satisfy the pre-requirements to run the pod. It can be due to network, connectivity, or kubernetes services issue. To find a clue, the next obvious step would be to check the startup events of the pod.

2. Kubernetes pods are failing due to missing /run/flannel/subnet.env file

When you run kubectl describe pod <pod_name> command to find a clue for the pod stuck at ContainerCreating state, you will see below error messages under Events section.

$ kubectl describe pod <pod_name>
...
Events:
  Type     Reason                  Age       From                    Message
  ----     ------                  ----      ----                    -------
  Normal   Scheduled               3m50s     default-scheduler       Successfully assigned default/<pod_name> to <node_name>
  Warning  FailedCreatePodSandBox  3m48s     kubelet, <node_name>    Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "chfehyuh9vg328dsd82u47897dguhru37410248df8v93429jjdfhtvkjczn12k9" network for pod "<pod_name>": NetworkPlugin cni failed to set up pod "<pod_name>_default" network: open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  3m46s     kubelet, <node_name>    Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "cy8j2iqlw21opcnbg3stzmk4qo093hdd8hdj9wd9494jiw3g7zxqwijskw3q5jgq" network for pod "<pod_name>": NetworkPlugin cni failed to set up pod "<pod_name>_default" network: open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  3m44s     kubelet, <node_name>    Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "cqfeowjc9ej2jdkve4kkdow9lmv4lpspqj6cnx8ejekn2m5c9cskopejt53mkvk8" network for pod "<pod_name>": NetworkPlugin cni failed to set up pod "<pod_name>_default" network: open /run/flannel/subnet.env: no such file or directory
  ...

What does this mean?

The /run/flannel/subnet.env file gets created in each control-plane node by the Flannel pods when creating their subnet and the parameters in that file will show you the configs used by the Flannel pod running in that node.

$ cat /run/flannel/subnet.env
FLANNEL_NETWORK=10.244.0.0/16
FLANNEL_SUBNET=10.244.0.1/24
FLANNEL_MTU=8951
FLANNEL_IPMASQ=true

If kubectl describe pod <pod_name> output complains that this file is missing, there is a big chance of Flannel pods not getting created or throwing some errors. Therefore, to find our next clue, we may need to check the pods running in the kube-system namespace.

3. Flannel & CoreDNS pods in kube-system namespace are failing due to certificate issues

If the Flannel & CoreDNS pods are either in CrashLoopBackOff or Error state, it is likely that the issue is more towards our kubernetes configurations, network, or sometimes hardware.

$ kubectl get all -n kube-system
NAME                                            READY   STATUS              RESTARTS   AGE
...
pod/coredns-XXXXXXXXX-XXXXX                     0/1     Error               47         264d
pod/coredns-YYYYYYYYY-YYYYY                     0/1     Error               11         44m
pod/coredns-ZZZZZZZZZ-ZZZZZ                     0/1     Evicted             0          360d
...
pod/kube-flannel-ds-amd64-XXXXX                 0/1     CrashLoopBackOff    16         365d
pod/kube-flannel-ds-amd64-YYYYY                 0/1     CrashLoopBackOff    13         259d
pod/kube-flannel-ds-amd64-ZZZZZ                 0/1     CrashLoopBackOff    21         365d
...

At this stage, check why Flannel & CoreDNS pods are failing.

kubectl logs pod/coredns-fb8b8dccf-cb4dh -n kube-system
E0606 12:01:33.904148       1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:315: Failed to list *v1.Service: Get https://<subnet_ip_2>:443/api/v1/services?limit=500&resourceVersion=0: x509: certificate is valid for <subnet_ip_1>, <node_ip>, not <subnet_ip_2>
E0606 12:01:33.904148       1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:315: Failed to list *v1.Service: Get https://<subnet_ip_2>:443/api/v1/services?limit=500&resourceVersion=0: x509: certificate is valid for <subnet_ip_1>, <node_ip>, not <subnet_ip_2>
log: exiting because of error: log: cannot create log: open /tmp/coredns.coredns-XXXXXXXXX-XXXXX.unknownuser.log.ERROR.20200606-120133.1: no such file or directory
kubectl logs pod/kube-flannel-ds-amd64-XXXXX -n kube-system
I0606 12:50:15.602728       1 main.go:514] Determining IP address of default interface
I0606 12:50:15.603419       1 main.go:527] Using interface with name ensXXX and address <node_ip>
I0606 12:50:15.603463       1 main.go:544] Defaulting external address to interface address (<node_ip>)
E0606 12:50:15.705089       1 main.go:241] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/kube-flannel-ds-amd64-XXXXX': Get https://<subnet_ip_2>:443/api/v1/namespaces/kube-system/pods/kube-flannel-ds-amd64-XXXXX: x509: certificate is valid for <subnet_ip_1>, <node_ip>, not <subnet_ip_2>

What does this mean?

x509: certificate is valid for <subnet_ip_1>, <node_ip>, not <subnet_ip_2> means that the certificates used by your kubernetes cluster (kubernetes api-server and other components) have a mismatch against the Flannel and CoreDNS pods. This can happen when the cluster has been updated with wrong certificates.

How do the wrong certificates get created?

Once we setup our cluster, usually we are not going to touch the certificates o keys because kubernetes manages them for us. But on v1.14 and before, you have to manually update the certifications and keys every year. Also, no matter the version, based on exceptional reasons, there can be situations that you have manually updated your certificates.

In such cases, there is a high chance of you using a configuration that is different from the original configuration used to create the cluster (mostly by mistake or you just forgot how you did it back then). Therefore, kubernetes would generate certificates for wrong subnet IPs given by you or use default IPs, so other pods won’t be able to run with that configuration.

What is the fix?

Some discussions suggest to reboot the nodes and run below commands to recreate the Flannel pods, but it won’t simply work like that since it’s a certificate issue.

// This won't work for certiticate issue
$ kubectl apply -f <k8_specs_directory>/kube-flannel.yml
$ kubectl apply -f <k8_specs_directory>/kube-flannel-rbac.yml

// This won't work for certiticate issue
$ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel-rbac.yml
$ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

The correct fix would be to re-create your certificates and keys with the original configuration used to setup your cluster, otherwise your pods won’t get it. For instance, your commands will have the below format (see --config flag and passing the correct clusterConfiguration spec for kubeadm init ... commands). To find the exact steps to manually generate certificates, please go to our previous blog post.

// The correct format of certificate creation commands
// ** Complete steps: https://platformengineer.com/fix-kubernetes-bootstrap-client-certificate-expired-error/
$ sudo kubeadm init phase certs all --config /<k8_specs_directory>/kubeadm_config.yaml
$ sudo kubeadm init phase kubeconfig all --config /<k8_specs_directory>/kubeadm_config.yaml

Using the original clusterConfiguration spec with certification creation commands will fix this issue and all your pods will start to run without any errors.


✅ Tested OS's : RHEL 7+, CentOS 7+, Ubuntu 18.04+, Debian 8+
✅ Tested Gear : Cloud (AWS EC2), On-Prem (Bare Metal)

👉 Any questions? Please comment below.


Leave a comment