15.1 Introducing StatefulSets

Before you learn about StatefulSets and how they differ from Deployments, it’s good to know how the requirements of stateful workloads differ from those of their stateless counterparts.

15.1.1 Understanding stateful workload requirements

A stateful workload is a piece of software that must store and maintain state in order to function. This state must be maintained when the workload is restarted or relocated. This makes stateful workloads much more difficult to operate.

Stateful workloads are also much harder to scale because you can’t simply add and remove replicas without considering their state, as you can with stateless workloads. If the replicas can share state by reading and writing the same files, adding new replicas isn’t a problem. However, for this to be possible, the underlying storage technology must support it. On the other hand, if each replica stores its state in its own files, you’ll need to allocate a separate volume for each replica. With the Kubernetes resources you’ve encountered so far, this is easier said than done. Let’s look at these two options to understand the issues associated with both.

In Kubernetes, you can use PersistentVolumes with the ReadWriteMany access mode to share data across multiple Pods. However, in most cloud environments, the underlying storage technology typically only supports the ReadWriteOnce and ReadOnlyMany access modes, not ReadWriteMany, meaning you can’t mount the volume on multiple nodes in read/write mode. Therefore, Pods on different nodes can’t read and write to the same PersistentVolume.

Let’s demonstrate this problem using the Quiz service. Can you scale the quiz Deployment to, say, three replicas? Let’s see what happens. The kubectl scale command is as follows:

$ kubectl scale deploy quiz --replicas 3
deployment.apps/quiz scaled

Now check the Pods like so:

$ kubectl get pods -l app=quiz
NAME                   READY   STATUS             RESTARTS      AGE
quiz-6f4968457-2c8ws   2/2     Running            0             10m
quiz-6f4968457-cdw97   0/2     CrashLoopBackOff   1 (14s ago)   22s
quiz-6f4968457-qdn29   0/2     Error              2 (16s ago)   22s

As you can see, only the Pod that existed before the scale-up is running, while the two new Pods aren’t. Depending on the type of cluster you’re using, these two Pods may not start at all, or they may start but immediately terminate with an error message. For example, in a cluster on Google Kubernetes Engine, the containers in the Pods don’t start because the PersistentVolume can’t be attached to the new Pods because its access mode is ReadWriteOnce and the volume can’t be attached to multiple nodes at once. In kind-provisioned clusters, the containers start, but the mongo container fails with an error message, which you can see as follows:

$ kubectl logs quiz-6f4968457-cdw97 -c mongo
..."msg":"DBException in initAndListen, terminating","attr":{"error":"DBPathInUse: Unable to lock the lock file: /data/db/mongod.lock (Resource temporarily unavailable). Another mongod instance is already running on the /data/db directory"}}

The error message indicates that you can’t use the same data directory in multiple instances of MongoDB. The three quiz Pods use the same directory because they all use the same PersistentVolumeClaim and therefore the same PersistentVolume, as illustrated in the next figure.

Figure 15.1 All Pods from a Deployment use the same PersistentVolumeClaim and PersistentVolume.

Since this approach doesn’t work, the alternative is to use a separate PersistentVolume for each Pod replica. Let’s look at what this means and whether you can do it with a single Deployment object.

Using a dedicated PersistentVolume for each replica

As you learned in the previous section, MongoDB only supports a single instance by default. If you want to deploy multiple MongoDB instances with the same data, you must create a MongoDB replica set that replicates the data across those instances (here the term “replica set” is a MongoDB-specific term and doesn’t refer to the Kubernetes ReplicaSet resource). Each instance needs its own storage volume and a stable address that other replicas and clients can use to connect to it. Therefore, to deploy a MongoDB replica set in Kubernetes, you need to ensure that:

each Pod has its own PersistentVolume,
each Pod is addressable by its own unique address,
when a Pod is deleted and replaced, the new Pod is assigned the same address and PersistentVolume.

You can’t do this with a single Deployment and Service, but you can do it by creating a separate Deployment, Service, and PersistentVolumeClaim for each replica, as shown in the following figure.

Figure 15.2 Providing each replica with its own volume and address.

Each Pod has its own Deployment, so the Pod can use its own PersistentVolumeClaim and PersistentVolume. The Service associated with each replica gives it a stable address that always resolves to the IP address of the Pod, even if the Pod is deleted and recreated elsewhere. This is necessary because with MongoDB, as with many other distributed systems, you must specify the address of each replica when you initialize the replica set. In addition to these per-replica Services, you may need yet another Service to make all Pods accessible to clients at a single address. So, the whole system looks daunting.

It gets worse from here. If you need to increase the number of replicas, you can’t use the kubectl scale command; you have to create additional Deployments, Services, and PersistentVolumeClaims, which adds to the complexity.

Even though this approach is feasible, it’s complex and it would be difficult to operate this system. Fortunately, Kubernetes provides a better way to do this with a single Service and a single StatefulSet object.

NOTE

You don’t need the quiz Deployment and the quiz-data PersistentVolumeClaim anymore, so please delete them as follows: kubectl delete deploy/quiz pvc/quiz-data.

15.1.2 Comparing StatefulSets with Deployments

A StatefulSet is similar to a Deployment, but is specifically tailored to stateful workloads. However, there are significant differences in the behavior of these two objects. This difference is best explained with the Pets vs. Cattle analogy that you may have heard of. If not, let me explain.

NOTE

StatefulSets were originally called PetSets. The name comes from this Pets vs. Cattle analogy.

The Pets vs. Cattle analogy

We used to treat our hardware infrastructure and workloads like pets. We gave each server a name and took care of each workload instance individually. However, it turns out that it’s much easier to manage hardware and software if you treat them like cattle and think of them as indistinguishable entities. That makes it easy to replace each unit without worrying that the replacement isn’t exactly the unit that was there before, much like a farmer treats cattle.

Figure 15.3 Treating entities as pets vs. as cattle

Stateless workloads deployed via Deployments are like cattle. If a Pod is replaced, you probably won’t even notice. Stateful workloads, on the other hand, are like pets. If a pet gets lost, you can’t just replace it with a new one. Even if you give the replacement pet the same name, it won’t behave exactly like the original. However, in the hardware/software world, this is possible if you can give the replacement the same network identity and state as the replaced instance. And this is exactly what happens when you deploy an application with a StatefulSet.

Deploying Pods with a StatefulSet

As with Deployments, in a StatefulSet you specify a Pod template, the desired number of replicas, and a label selector. However, you can also specify a PersistentVolumeClaim template. Each time the StatefulSet controller creates a new replica, it creates not only a new Pod object, but also one or more PersistentVolumeClaim objects.

The Pods created from a StatefulSet aren't exact copies of each other, as is the case with Deployments, because each Pod points to a different set of PersistentVolumeClaims. In addition, the names of the Pods aren't random. Instead, each Pod is given a unique ordinal number, as is each PersistentVolumeClaim. When a StatefulSet Pod is deleted and recreated, it’s given the same name as the Pod it replaced. Also, a Pod with a particular ordinal number is always associated with PersistentVolumeClaims with the same number. This means that the state associated with a particular replica is always the same, no matter how often the Pod is recreated.

Figure 15.4 A StatefulSet, its Pods, and PersistentVolumeClaims

Another notable difference between Deployments and StatefulSets is that, by default, the Pods of a StatefulSet aren't created concurrently. Instead, they’re created one at a time, similar to a rolling update of a Deployment. When you create a StatefulSet, only the first Pod is created initially. Then the StatefulSet controller waits until the Pod is ready before creating the next one.

A StatefulSet can be scaled just like a Deployment. When you scale a StatefulSet up, new Pods and PersistentVolumeClaims are created from their respective templates. When you scale down the StatefulSet, the Pods are deleted, but the PersistentVolumeClaims are either retained or deleted, depending on the policy you configure in the StatefulSet.

15.1.3 Creating a StatefulSet

In this section, you’ll replace the quiz Deployment with a StatefulSet. Each StatefulSet must have an associated headless Service that exposes the Pods individually, so the first thing you must do is create this Service.

Creating the governing Service

The headless Service associated with a StatefulSet gives the Pods their network identity. You may recall from chapter 11 that a headless Service doesn’t have a cluster IP address, but you can still use it to communicate with the Pods that match its label selector. Instead of a single A or AAAA DNS record pointing to the Service’s IP, the DNS record for a headless Service points to the IPs of all the Pods that are part of the Service.

As you can see in the following figure, when using a headless Service with a StatefulSet, an additional DNS record is created for each Pod so that the IP address of each Pod can be looked up by its name. This is how stateful Pods maintain their stable network identity. These DNS records don’t exist when the headless Service isn’t associated with a StatefulSet.

Figure 15.5 A headless Service used in combination with a StatefulSet

You already have a Service called quiz that you created in the previous chapters. You could change it into a headless Service, but let's create an additional Service instead, because the new Service will expose all quiz Pods, whether they’re ready or not.

This headless Service will allow you to resolve individual Pods, so let’s call it quiz-pods. Create the service with the kubectl apply command. You can find the Service manifest in the svc.quiz-pods.yaml file, whose contents are shown in the following listing.

Listing 15.1 Headless Service for the quiz StatefulSet
apiVersion: v1
kind: Service
metadata:
  name: quiz-pods
spec:
  clusterIP: None
  publishNotReadyAddresses: true
  selector:
    app: quiz
  ports:
  - name: mongodb
    port: 27017

In the listing, the clusterIP field is set to None, which makes this a headless Service. If you set publishNotReadyAddresses to true, the DNS records for each Pod are created immediately when the Pod is created, rather than only when the Pod is ready. This way, the quiz-pods Service will include all quiz Pods, regardless of their readiness status.

Creating the StatefulSet

After you create the headless Service, you can create the StatefulSet. You can find the object manifest in the sts.quiz.yaml file. The most important parts of the manifest are shown in the following listing.

Listing 15.2 The object manifest for a StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: quiz
spec:
  serviceName: quiz-pods
  podManagementPolicy: Parallel
  replicas: 3
  selector:
    matchLabels:
      app: quiz
  template:
    metadata:
      labels:
        app: quiz
        ver: "0.1"
    spec:
      volumes:
      - name: db-data
        persistentVolumeClaim:
          claimName: db-data
      containers:
      - name: quiz-api
        ...
      - name: mongo
        image: mongo:5
        command:
        - mongod
        - --bind_ip
        - 0.0.0.0
        - --replSet
        - quiz
        volumeMounts:
        - name: db-data
          mountPath: /data/db
  volumeClaimTemplates:
  - metadata:
      name: db-data
      labels:
        app: quiz
    spec:
      resources:
        requests:
          storage: 1Gi
      accessModes:
      - ReadWriteOnce

The manifest defines an object of kind StatefulSet from the API group apps, version v1. The name of the StatefulSet is quiz. In the StatefulSet spec, you’ll find some fields you know from Deployments and ReplicaSets, such as replicas, selector, and template, explained in the previous chapter, but this manifest contains other fields that are specific to StatefulSets. In the serviceName field, for example, you specify the name of the headless Service that governs this StatefulSet.

By setting podManagementPolicy to Parallel, the StatefulSet controller creates all Pods simultaneously. Since some distributed applications can’t handle multiple instances being launched at the same time, the default behavior of the controller is to create one Pod at a time. However, in this example, the Parallel option makes the initial scale-up less involved.

In the volumeClaimTemplates field, you specify the templates for the PersistentVolumeClaims that the controller creates for each replica. Unlike the Pod templates, where you omit the name field, you must specify the name in the PersistentVolumeClaim template. This name must match the name in the volumes section of the Pod template.

Create the StatefulSet by applying the manifest file as follows:

$ kubectl apply -f sts.quiz.yaml
statefulset.apps/quiz created

15.1.4 Inspecting the StatefulSet, Pods, and PersistentVolumeClaims

After you create the StatefulSet, you can use the kubectl rollout status command to see its status like so:

$ kubectl rollout status sts quiz
Waiting for 3 pods to be ready...

NOTE The shorthand for StatefulSets is sts.

After kubectl prints this message, it doesn’t continue. Interrupt its execution by pressing Control-C and check the StatefulSet status with the kubectl get command to investigate why.

$ kubectl get sts
NAME   READY   AGE
quiz   0/3     22s

NOTE

As with Deployments and ReplicaSets, you can use the -o wide option to display the names of the containers and images used in the StatefulSet.

The value in the READY column shows that none of the replicas are ready. List the Pods with kubectl get pods as follows:

$ kubectl get pods -l app=quiz
NAME     READY   STATUS    RESTARTS   AGE
quiz-0   1/2     Running   0          56s
quiz-1   1/2     Running   0          56s
quiz-2   1/2     Running   0          56s

NOTE

Did you notice the Pod names? They don’t contain a template hash or random characters. the name of each Pod is composed of the StatefulSet name and an ordinal number, as explained in the introduction.

You’ll notice that only one of the two containers in each Pod is ready. If you examine a Pod with the kubectl describe command, you’ll see that the mongo container is ready, but the quiz-api container isn’t, because its readiness check fails. This is because the endpoint called by the readiness probe (/healthz/ready) checks whether the quiz-api process can query the MongoDB server. The failed readiness probe indicates that this isn’t possible. If you check the logs of the quiz-api container as follows, you’ll see why:

$ kubectl logs quiz-0 -c quiz-api
... INTERNAL ERROR: connected to mongo, but couldn't execute the ping command: server selection error: server selection timeout, current topology: { Type: Unknown, Servers: [{ Addr: 127.0.0.1:27017, Type: RSGhost, State: Connected, Average RTT: 898693 }, ] }

As indicated in the error message, the connection to MongoDB has been established, but the server doesn’t allow the ping command to be executed. The reason is that the server was started with the --replSet option configuring it to use replication, but the MongoDB replica set hasn’t been initiated yet. To do this, run the following command:

$ kubectl exec -it quiz-0 -c mongo -- mongosh --quiet --eval 'rs.initiate({
  _id: "quiz",
  members: [
    {_id: 0, host: "quiz-0.quiz-pods.kiada.svc.cluster.local:27017"},
    {_id: 1, host: "quiz-1.quiz-pods.kiada.svc.cluster.local:27017"},
    {_id: 2, host: "quiz-2.quiz-pods.kiada.svc.cluster.local:27017"}]})'

NOTE

Instead of typing this long command, you can also run the initiate-mongo-replicaset.sh shell script, which you can find in this chapter’s code directory.

If the MongoDB shell gives the following error message, you probably forgot to create the quiz-pods Service beforehand:

MongoServerError: replSetInitiate quorum check failed because not all proposed set members responded affirmatively: ... caused by :: Could not find address for quiz-2.quiz-pods.kiada.svc.cluster.local:27017: SocketException: Host not found

If the initiation of the replica set is successful, the command prints the following message:

{ ok: 1 }

All three quiz Pods should be ready shortly after the replica set is initiated. If you run the kubectl rollout status command again, you’ll see the following output:

$ kubectl rollout status sts quiz
partitioned roll out complete: 3 new pods have been updated...

Inspecting the StatefulSet with kubectl describe

As you know, you can examine an object in detail with the kubectl describe command. Here you can see what it displays for the quiz StatefulSet:

$ kubectl describe sts quiz
Name:               quiz
Namespace:          kiada
CreationTimestamp:  Sat, 12 Mar 2022 18:05:43 +0100
Selector:           app=quiz
Labels:             app=quiz
Annotations:        <none>
Replicas:           3 desired | 3 total
Update Strategy:    RollingUpdate
  Partition:        0
Pods Status:        3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  ...
Volume Claims:
  Name:          db-data
  StorageClass:
  Labels:        app=quiz
  Annotations:   <none>
  Capacity:      1Gi
  Access Modes:  [ReadWriteOnce]
Events:
  Type    Reason            Age   From                    Message
  ----    ------            ----  ----                    -------
  Normal  SuccessfulCreate  10m   statefulset-controller  create Claim db-data-quiz-0
                                                          Pod quiz-0 in StatefulSet
                                                          quiz success
  Normal  SuccessfulCreate  10m   statefulset-controller  create Pod quiz-0 in
                                                          StatefulSet quiz successful
  ...

As you can see, the output is very similar to that of a ReplicaSet and Deployment. The most noticeable difference is the presence of the PersistentVolumeClaim template, which you won’t find in the other two object types. The events at the bottom of the output show you exactly what the StatefulSet controller did. Whenever it creates a Pod or a PersistentVolumeClaim, it also creates an Event that tells you what it did.

Inspecting the Pods

Let’s take a closer look at the manifest of the first Pod to see how it compares to Pods created by a ReplicaSet. Use the kubectl get command to print the Pod manifest like so:

$ kubectl get pod quiz-0 -o yaml
apiVersion: v1
kind: Pod
metadata:
  labels:
    app: quiz
    controller-revision-hash: quiz-7576f64fbc
    statefulset.kubernetes.io/pod-name: quiz-0
    ver: "0.1"
  name: quiz-0
  namespace: kiada
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: quiz
spec:
  containers:
  ...
  volumes:
  - name: db-data
    persistentVolumeClaim:
      claimName: db-data-quiz-0
status:
  ...

The only label you defined in the Pod template in the StatefulSet manifest was app, but the StatefulSet controller added two additional labels to the Pod:

The label controller-revision-hash serves the same purpose as the label pod-template-hash on the Pods of a ReplicaSet. It allows the controller to determine to which revision of the StatefulSet a particular Pod belongs.
The label statefulset.kubernetes.io/pod-name specifies the Pod name and allows you to create a Service for a specific Pod instance by using this label in the Service’s label selector.

Since this Pod object is managed by the StatefulSet, the ownerReferences field indicates this fact. Unlike Deployments, where Pods are owned by ReplicaSets, which in turn are owned by the Deployment, StatefulSets own the Pods directly. The StatefulSet takes care of both replication and updating of the Pods.

The Pod’s containers match the containers defined in the StatefulSet’s Pod template, but that’s not the case for the Pod’s volumes. In the template you specified the claimName as db-data, but here in the Pod it’s been changed to db-data-quiz-0. This is because each Pod instance gets its own PersistentVolumeClaim. The name of the claim is made up of the claimName and the name of the Pod.

Inspecting the PersistentVolumeClaims

Along with the Pods, the StatefulSet controller creates a PersistentVolumeClaim for each Pod. List them as follows:

$ kubectl get pvc -l app=quiz
NAME             STATUS   VOLUME           CAPACITY   ACCESS MODES   STORAGECLASS   AGE
db-data-quiz-0   Bound    pvc...1bf8ccaf   1Gi        RWO            standard       10m
db-data-quiz-1   Bound    pvc...c8f860c2   1Gi        RWO            standard       10m
db-data-quiz-2   Bound    pvc...2cc494d6   1Gi        RWO            standard       10m

You can check the manifest of these PersistentVolumeClaims to make sure they match the template specified in the StatefulSet. Each claim is bound to a PersistentVolume that’s been dynamically provisioned for it. These volumes don’t yet contain any data, so the Quiz service doesn’t currently return anything. You’ll import the data next.

15.1.5 Understanding the role of the headless Service

An important requirement of distributed applications is peer discovery—the ability for each cluster member to find the other members. If an application deployed via a StatefulSet needs to find all other Pods in the StatefulSet, it could do so by retrieving the list of Pods from the Kubernetes API. However, since we want applications to remain Kubernetes-agnostic, it’s better for the application to use DNS and not talk to Kubernetes directly.

For example, a client connecting to a MongoDB replica set must know the addresses of all the replicas, so it can find the primary replica when it needs to write data. You must specify the addresses in the connection string you pass to the MongoDB client. For your three quiz Pods, the following connection URI can be used:

mongodb://quiz-0.quiz-pods.kiada.svc.cluster.local:27017,quiz-1.quiz-pods.kiada.svc.
cluster.local:27017,quiz-2.quiz-pods.kiada.svc.cluster.local:27017

If the StatefulSet was configured with additional replicas, you’d need to add their addresses to the connection string, too. Fortunately, there’s a better way.

Exposing stateful Pods through DNS individually

In chapter 11 you learned that a Service object not only exposes a set of Pods at a stable IP address but also makes the cluster DNS resolve the Service name to this IP address. With a headless Service, on the other hand, the name resolves to the IPs of the Pods that belong to the Service. However, when a headless Service is associated with a StatefulSet, each Pod also gets its own A or AAAA record that resolves directly to the individual Pod’s IP. For example, because you combined the quiz StatefulSet with the quiz-pods headless Service, the IP of the quiz-0 Pod is resolvable at the following address:

All the other replicas created by the StatefulSet are resolvable in the same way.

Exposing stateful Pods via SRV records

In addition to the A and AAAA records, each stateful Pod also gets SRV records. These can be used by the MongoDB client to look up the addresses and port numbers used by each Pod so you don’t have to specify them manually. However, you must ensure that the SRV record has the correct name. MongoDB expects the SRV record to start with _mongodb. To ensure that’s the case, you must set the port name in the Service definition to mongodb like you did in listing 15.1. This ensures that the SRV record is as follows:

Using SRV records allows the MongoDB connection string to be much simpler. Regardless of the number of replicas in the set, the connection string is always as follows:

mongodb+srv://quiz-pods.kiada.svc.cluster.local

Instead of specifying the addresses individually, the mongodb+srv scheme tells the client to find the addresses by performing an SRV lookup for the domain name _mongodb._tcp.quiz-pods.kiada.svc.cluster.local. You’ll use this connection string to import the quiz data into MongoDB, as explained next.

Importing quiz data into MongoDB

In the previous chapters, an init container was used to import the quiz data into the MongoDB store. The init container approach is no longer valid since the data is now replicated, so if you were to use it, the data would be imported multiple times. Instead, let’s move the import to a dedicated Pod.

You can find the Pod manifest in the file pod.quiz-data-importer.yaml. The file also contains a ConfigMap that contains the data to be imported. The following listing shows the contents of the manifest file.

Listing 15.3 The manifest of the quiz-data-importer Pod
apiVersion: v1
kind: Pod
metadata:
  name: quiz-data-importer
spec:
  restartPolicy: OnFailure
  volumes:
  - name: quiz-questions
    configMap:
      name: quiz-questions
  containers:
  - name: mongoimport
    image: mongo:5
    command:
    - mongoimport
    - mongodb+srv://quiz-pods.kiada.svc.cluster.local/kiada?tls=false
    - --collection
    - questions
    - --file
    - /questions.json
    - --drop
    volumeMounts:
    - name: quiz-questions
      mountPath: /questions.json
      subPath: questions.json
      readOnly: true
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: quiz-questions
  labels:
    app: quiz
data:
  questions.json: ...

The quiz-questions ConfigMap is mounted into the quiz-data-importer Pod through a configMap volume. When the Pod's container starts, it runs the mongoimport command, which connects to the primary MongoDB replica and imports the data from the file in the volume. The data is then replicated to the secondary replicas.

Since the mongoimport container only needs to run once, the Pod's restartPolicy is set to OnFailure. If the import fails, the container will be restarted as many times as necessary until the import succeeds. Deploy the Pod using the kubectl apply command and verify that it completed successfully. You can do this by checking the status of the Pod as follows:

$ kubectl get pod quiz-data-importer
NAME                 READY   STATUS      RESTARTS   AGE
quiz-data-importer   0/1     Completed   0          50s

If the STATUS column displays the value Completed, it means that the container exited without errors. The logs of the container will show the number of imported documents. You should now be able to access the Kiada suite via curl or your web browser and see that the Quiz service returns the questions you imported. You can delete the quiz-data-importer Pod and the quiz-questions ConfigMap at will.

Now answer a few quiz questions and use the following command to check if your answers are stored in MongoDB:

$ kubectl exec quiz-0 -c mongo -- mongosh kiada --quiet --eval 'db.responses.find()'

When you run this command, the mongosh shell in pod quiz-0 connects to the kiada database and displays all the documents stored in the responses collection in JSON form. Each of these documents represents an answer that you submitted.

NOTE

This command assumes that quiz-0 is the primary MongoDB replica, which should be the case unless you deviated from the instructions for creating the StatefulSet. If the command fails, try running it in the quiz-1 and quiz-2 Pods. You can also find the primary replica by running the MongoDB command rs.hello().primary in any quiz Pod.

15.1 IntroducingStatefulSets