13.3 Understanding the operation of the ReplicaSet controller

In the previous sections, you saw how changing the replicas and template within the ReplicaSet object causes Kubernetes to do something with the Pods that belong to the ReplicaSet. The Kubernetes component that performs these actions is called the controller. Most of the object types you create through your cluster’s API have an associated controller. For example, in the previous chapter you learned about the Ingress controller, which manages Ingress objects. There’s also the Endpoints controller for the Endpoints objects, the Namespace controller for the Namespace objects, and so on.

Not surprisingly, ReplicaSets are managed by the ReplicaSet controller. Any change you make to a ReplicaSet object is detected and processed by this controller. When you scale the ReplicaSet, the controller is the one that creates or deletes the Pods. Each time it does this, it also creates an Event object that informs you of what it’s done. As you learned in chapter 4, you can see the events associated with an object at the bottom of the output of the kubectl describe command as shown in the next snippet, or by using the kubectl get events command to specifically list the Event objects.

$ kubectl describe rs kiada
...
Events:
  Type    Reason            Age   From                   Message
  ----    ------            ----  ----                   -------
  Normal  SuccessfulDelete  34m   replicaset-controller  Deleted pod: kiada-k9hn2
  Normal  SuccessfulCreate  30m   replicaset-controller  Created pod: kiada-dl7vz
  Normal  SuccessfulCreate  30m   replicaset-controller  Created pod: kiada-dn9fb
  Normal  SuccessfulCreate  16m   replicaset-controller  Created pod: kiada-z9dp2

To understand ReplicaSets, you must understand the operation of their controller.

13.3.1 Introducing the reconciliation control loop

As shown in the following figure, a controller observes the state of both the owner and the dependent objects. After each change in this state, the controller compares the state of the dependent objects with the desired state specified in the owning object. If these two states differ, the controller makes changes to the dependent object(s) to reconcile the two states. This is the so-called reconciliation control loop that you’ll find in all controllers.

Figure 13.5 A controller's reconciliation control loop

The ReplicaSet controller’s reconciliation control loop consists of observing ReplicaSets and Pods. Each time a ReplicaSet or Pod changes, the controller checks the list of Pods associated with the ReplicaSet and ensures that the actual number of Pods matches the desired number specified in the ReplicaSet. If the actual number of Pods is lower than the desired number, it creates new replicas from the Pod template. If the number of Pods is higher than desired, it deletes the excess replicas. The flowchart in the following figure explains the entire process.

Figure 13.6 The ReplicaSet controller’s reconciliation loop

13.3.2 Understanding how the ReplicaSet controller reacts to Pod changes

You’ve seen how the controller responds immediately to changes in the ReplicaSet’s replicas field. However, that’s not the only way the desired number and the actual number of Pods can differ. What if no one touches the ReplicaSet, but the actual number of Pods changes? The ReplicaSet controller’s job is to make sure that the number of Pods always matches the specified number. Therefore, it should also come into action in this case.

Deleting a Pod managed by a ReplicaSet

Let’s look at what happens if you delete one of the Pods managed by the ReplicaSet. Select one and delete it with kubectl delete:

$ kubectl delete pod kiada-z9dp2
pod "kiada-z9dp2" deleted

Now list the Pods again:

$ kubectl get pods -l app=kiada
NAME          READY   STATUS    RESTARTS   AGE
kiada-dl7vz   2/2     Running   0          34m
kiada-dn9fb   2/2     Running   0          34m
kiada-rfkqb   2/2     Running   0          47s

The Pod you deleted is gone, but a new Pod has appeared to replace the missing Pod. The number of Pods again matches the desired number of replicas set in the ReplicaSet object. Again, the ReplicaSet controller reacted immediately and reconciled the actual state with the desired state.

Even if you delete all kiada Pods, three new ones will appear immediately so that they can serve your users. You can see this by running the following command:

$ kubectl delete pod -l app=kiada

Creating a Pod that matches the ReplicaSet’s label selector

Just as the ReplicaSet controller creates new Pods when it finds that there are fewer Pods than needed, it also deletes Pods when it finds too many. You’ve already seen this happen when you reduced the desired number of replicas, but what if you manually create a Pod that matches the ReplicaSet’s label selector? From the controller’s point of view, one of the Pods must disappear.

Let’s create a Pod called one-kiada-too-many. The name doesn’t match the prefix that the controller assigns to the ReplicaSet’s Pods, but the Pod’s labels match the ReplicaSet’s label selector. You can find the Pod manifest in the file pod.one-kiada-too-many.yaml. Apply the manifest with kubectl apply to create the Pod, and then immediately list the kiada Pods as follows:

$ kubectl get po -l app=kiada
NAME                 READY   STATUS        RESTARTS   AGE
kiada-jp4vh          2/2     Running       0          11m
kiada-r4k9f          2/2     Running       0          11m
kiada-shfgj          2/2     Running       0          11m
one-kiada-too-many   0/2     Terminating   0          3s

As expected, the ReplicaSet controller deletes the Pod as soon as it detects it. The controller doesn’t like it when you create Pods that match the label selector of a ReplicaSet. As shown, the name of the Pod doesn’t matter. Only the Pod’s labels matter.

What happens when a node that runs a ReplicaSet’s Pod fails?

In the previous examples, you saw how a ReplicaSet controller reacts when someone tampers with the Pods of a ReplicaSet. Although these examples do a good job of illustrating how the ReplicaSet controller works, they don’t really show the true benefit of using a ReplicaSet to run Pods. The best reason to create Pods via a ReplicaSet instead of directly is that the Pods are automatically replaced when your cluster nodes fail.

WARNING

In the next example, a cluster node is caused to fail. In a poorly configured cluster, this can cause the entire cluster to fail. Therefore, you should only perform this exercise if you’re willing to rebuild the cluster from scratch if necessary.

To see what happens when a node stops responding, you can disable its network interface. If you created your cluster with the kind tool, you can disable the network interface of the kind-worker2 node with the following command:

$ docker exec kind-worker2 ip link set eth0 down

NOTE

Pick a node that has at least one of your kiada Pods running on it. List the Pods with the -o wide option to see which node each Pod runs on.

NOTE

If you’re using GKE, you can log into the node with the gcloud compute ssh command and shut down its network interface with the sudo ifconfig eth0 down command. The ssh session will stop responding, so you’ll need to close it by pressing Enter, followed by “~.” (tilde and dot, without the quotes).

Soon, the status of the Node object representing the cluster node changes to NotReady:

$ kubectl get node
NAME                 STATUS     ROLES                  AGE    VERSION
kind-control-plane   Ready      control-plane,master   2d3h   v1.21.1
kind-worker          Ready      <none>                 2d3h   v1.21.1
kind-worker2         NotReady   <none>                 2d3h   v1.21.1

This status indicates that the Kubelet running on the node hasn’t contacted the API server for some time. Since this isn’t a clear sign that the node is down, as it could just be a temporary network glitch, this doesn’t immediately affect the status of the Pods running on the node. They’ll continue to show as Running. However, after a few minutes, Kubernetes realizes that the node is down and marks the Pods for deletion.

NOTE

The time that elapses between a node becoming unavailable and its Pods being deleted can be configured using the Taints and Tolerations mechanism, which is explained in chapter 23.

Once the Pods are marked for deletion, the ReplicaSet controller creates new Pods to replace them. You can see this in the following output.

$ kubectl get pods -l app=kiada -o wide
NAME          READY   STATUS        RESTARTS   AGE   IP             NODE
kiada-ffstj   2/2     Running       0          35s   10.244.1.150   kind-worker
kiada-l2r85   2/2     Terminating   0          37m   10.244.2.173   kind-worker2
kiada-n98df   2/2     Terminating   0          37m   10.244.2.174   kind-worker2
kiada-vnc4b   2/2     Running       0          37m   10.244.1.148   kind-worker
kiada-wkpsn   2/2     Running       0          35s   10.244.1.151   kind-worker

As you can see in the output, the two Pods on the kind-worker2 node are marked as Terminating and have been replaced by two new Pods scheduled to the healthy node kind-worker. Again, three Pod replicas are running as specified in the ReplicaSet.

The two Pods that are being deleted remain in the Terminating state until the node comes back online. In reality, the containers in those Pods are still running because the Kubelet on the node can’t communicate with the API server and therefore doesn’t know that they should be terminated. However, when the node’s network interface comes back online, the Kubelet terminates the containers, and the Pod objects are deleted. The following commands restore the node’s network interface:

$ docker exec kind-worker2 ip link set eth0 up
$ docker exec kind-worker2 ip route add default via 172.18.0.1

Your cluster may be using a gateway IP other than 172.18.0.1. To find it, run the following command:

$ docker network inspect kind -f '{{ (index .IPAM.Config 0).Gateway }}'

NOTE

If you’re using GKE, you must remotely reset the node with the gcloud compute instances reset <node-name> command.

When do Pods not get replaced?

The previous sections have demonstrated that the ReplicaSet controller ensures that there are always as many healthy Pods as specified in the ReplicaSet object. But is this always the case? Is it possible to get into a state where the number of Pods matches the desired replica count, but the Pods can’t provide the service to their clients?

Remember the liveness and readiness probes? If a container’s liveness probe fails, the container is restarted. If the probe fails multiple times, there’s a significant time delay before the container is restarted. This is due to the exponential backoff mechanism explained in chapter 6. During the backoff delay, the container isn’t in operation. However, it’s assumed that the container will eventually be back in service. If the container fails the readiness rather than the liveness probe, it’s also assumed that the problem will eventually be fixed.

For this reason, Pods whose containers continually crash or fail their probes are never automatically deleted, even though the ReplicaSet controller could easily replace them with Pods that might run properly. Therefore, be aware that a ReplicaSet doesn’t guarantee that you’ll always have as many healthy replicas as you specify in the ReplicaSet object.

You can see this for yourself by failing one of the Pods’ readiness probes with the following command:

$ kubectl exec rs/kiada -c kiada -- curl -X POST localhost:9901/healthcheck/fail

NOTE

If you specify the ReplicaSet instead of the Pod name when running the kubectl exec command, the specified command is run in one of the Pods, not all of them, just as with kubectl logs.

After about thirty seconds, the kubectl get pods command indicates that one of the Pod’s containers is no longer ready:

$ kubectl get pods -l app=kiada
NAME          READY   STATUS    RESTARTS   AGE
kiada-78j7m   1/2     Running   0          21m
kiada-98lmx   2/2     Running   0          21m
kiada-wk99p   2/2     Running   0          21m

The Pod no longer receives any traffic from the clients, but the ReplicaSet controller doesn’t delete and replace it, even though it’s aware that only two of the three Pods are ready and accessible, as indicated by the ReplicaSet status:

$ kubectl get rs
NAME    DESIRED   CURRENT   READY   AGE
kiada   3         3         2       2h

IMPORTANT

A ReplicaSet only ensures that the desired number of Pods are present. It doesn’t ensure that their containers are actually running and ready to handle traffic.

If this happens in a real production cluster and the remaining Pods can’t handle all the traffic, you’ll have to delete the bad Pod yourself. But what if you want to find out what’s wrong with the Pod first? How can you quickly replace the faulty Pod without deleting it so you can debug it?

You could scale the ReplicaSet up by one replica, but then you’ll have to scale back down when you finish debugging the faulty Pod. Fortunately, there’s a better way. It’ll be explained in the next section.

13.3.3 Removing a Pod from the ReplicaSet’s control

You already know that the ReplicaSet controller is constantly making sure that the number of Pods that match the ReplicaSet’s label selector matches the desired number of replicas. So, if you remove a Pod from the set of Pods that match the selector, the controller replaces it. To do this, you simply change the labels of the faulty Pod, as shown in the following figure.

Figure 13.7 Changing a Pod’s labels to remove it from the ReplicaSet

The ReplicaSet controller replaces the Pod with a new one, and from that point on, no longer pays attention to the faulty Pod. You can calmly figure out what’s wrong with it while the new Pod takes over the traffic.

Let’s try this with the Pod whose readiness probe you failed in the previous section. For a Pod to match the ReplicaSet’s label selector, it must have the labels app=kiada and rel=stable. Pods without these labels aren’t considered part of the ReplicaSet. So, to remove the broken Pod from the ReplicaSet, you need to remove or change at least one of these two labels. One way is to change the value of the rel label to debug as follows:

$ kubectl label po kiada-78j7m rel=debug --overwrite

Since only two Pods now match the label selector, one less than the desired number of replicas, the controller immediately creates another Pod, as shown in the following output:

$ kubectl get pods -l app=kiada -L app,rel
NAME          READY   STATUS    RESTARTS   AGE   APP     REL
kiada-78j7m   1/2     Running   0          60m   kiada   debug
kiada-98lmx   2/2     Running   0          60m   kiada   stable
kiada-wk99p   2/2     Running   0          60m   kiada   stable
kiada-xtxcl   2/2     Running   0          9s    kiada   stable

As you can see from the values in the APP and REL columns, three Pods match the selector, while the broken Pod doesn’t. This Pod is no longer managed by the ReplicaSet. Therefore, when you’re done inspecting the Pod, you need to delete it manually.

NOTE

When you remove a Pod from a ReplicaSet, the reference to the ReplicaSet object is removed from the Pod’s ownerReferences field.

Now that you’ve seen how the ReplicaSet controller responds to all the events shown in this and previous sections, you understand everything you need to know about this controller.

13.3 Understanding the operation of the ReplicaSet controller

13.3 Understanding the operation of the ReplicaSet controller

13.3.1 Introducing the reconciliation control loop

13.3.2 Understanding how the ReplicaSet controller reacts to Pod changes

Deleting a Pod managed by a ReplicaSet

Creating a Pod that matches the ReplicaSet’s label selector

What happens when a node that runs a ReplicaSet’s Pod fails?

When do Pods not get replaced?

13.3.3 Removing a Pod from the ReplicaSet’s control

results matching ""

No results matching ""