CloneSet in Kubernetes

Making Kubernetes stronger

Featured image

After attending Kubecon North America for the first time this past November in Chicago (A blog about this comming later this year, I have one month to complete it) I was able to learn from the CloneSet resource for the first time ever, and it was really a surprise to me as I have been working with Kubernetes for the last 5 years now, I have been interviewed so many times for Kubernetes Job related work, did the CKA certification, and never was I asked about CloneSets. The actual talk was “On the right tack: Kubernetesat Uber Scale”. You can see the actual Object image in there:

After learning a little bit about this type of workload I was able to understand why: this resource it’s only used by the “Big Tech” on +1 million pods environments, that is Uber, Netflix, Apple, etc, and indeed, I have never worked on those types of clusters nor applications in the past.

Before jumping into Clonesets in Kubernetes and understand their use cases let’s first talk about OpenKruise.

What is OpenKruise?

OpenKruise (official site: https://openkruise.io) is a CNCF(Cloud Native Computing Foundation) incubating project. It consists of several controllers that extend and complement the Kubernetes core controllers for workload and application management.

OpenKruise is an extended component suite for Kubernetes, which mainly focuses on application automations, such as deployment, upgrade, ops and availability protection.Mostly features provided by OpenKruise are built primarily based on CRD extensions. They can work in pure Kubernetes clusters without any other dependences.

Advanced Workloads

OpenKruise contains a set of advanced workloads, such as CloneSet, Advanced StatefulSet, Advanced DaemonSet, BroadcastJob. They all support not only the basic features which are similar to the original Workloads in Kubernetes, but also more advanced abilities like in-place update, configurable scale/upgrade strategies, and parallel operations.

In-place Update is a new methodology to update container images and even environments. It only restarts the specific container with the new image and the Pod will not be recreated, which leads to a much faster update process and much fewer side effects on other sub-systems such as scheduler, CNI or CSI.

Decoupled Application Management

OpenKruise provides several decoupled ways to manage sidecar container, multi-domain deployment for applications, which means you can manage these things without modifying the Workloads of applications.

For example, SidecarSet can help you inject sidecar containers into all matching Pods during creation and even update them in place with no effect on other containers in Pod. WorkloadSpread constrains the spread of stateless workload, which empowers single workload the abilities for multi-domain and elastic deployment.

High-availability Protection

OpenKruise works hard on protecting high-availability for applications. Now it can prevent your Kubernetes resources from the cascading deletion mechanism, including CRD, Namespace and almost all kinds of Workloads. In voluntary disruption scenarios, PodUnavailableBudget can achieve the effect of preventing application disruption or SLA degradation, which is not only compatible with Kubernetes PDB protection for Eviction API, but also able to support the protection ability of the above scenarios.

With that covered, let’s talk about this specific controller, the CloneSet:

CloneSet

This controller provides advanced features to efficiently manage stateless applications in large-scale scenarios that do not have instance order requirement during scaling and rollout. Analogously, CloneSet can be recognized as an enhanced version of upstream Deployment workload, but it does many more.

As name suggests, CloneSet is a **Set** -suffix controller which manages Pods directly. A sample CloneSet yaml looks like following:

apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
  labels:
    app: sample
  name: sample
spec:
  replicas: 5
  selector:
    matchLabels:
      app: sample
  template:
    metadata:
      labels:
        app: sample
    spec:
      containers:
      - name: nginx
        image: nginx:alpine

CloneSet CRD belongs to a family of CRDs: Kruise. It’s part of Alibaba’s open source effort.

Alibaba group has adopted K8s in production early on. And it has one of the world’s largest clusters now running. During the migration and the operation of K8s, many problems of the upstream K8s surfaced. For example, Deployment doesn’t support canary rolling updates. Statefulset does by using partition updates. But since it’s StatefulSet, you have to update pods one by one. Assuming you have hundreds of pods, how long will it take to update them? Really, the rollout strategies available from upstream workloads are limited and implemented in different workloads. That’s understandable since K8s is a framework and cannot simply satisfy all use cases.

Another example is Deployment gives random names to pods. But that creates issues with monitoring after a service reboot. StatefulSets does enforce strict naming orders. However, it starts/updates pods one by one, in serial forms without enough flexibility.

So, Alibaba Cloud created the open source project Kruise. Under it there are several CRDs that has been proven in real production environment. They are now shared with everyone. And CloneSet is the representative workload CRD, which has quite a few unique characteristics.

As you might now, in Kubernetes, there is the naming convention on controllers/CRDs. The suffix “set” suggests the CRD is working on the pods directly, like “StatefulSet” or “ReplicaSet” or “DeamonSet”. In the same way, CloneSet works on individual pods. But StatefulSet emphasis on the “stateful” workloads while CloneSet focuses on “stateless” workloads. This is the key and the most important reason this resource was created/

Feature-wise, CloneSet is a workload for stateless pods. It supports all the upstream Kubernetes rollout strategy has in all the other upstream workloads are supported here.

Here is the table of comparison:

Features

In-Place Update

This approach means the pod stays as is during an update. Only the container’s image gets a refresh. Everything else about the Pod – like its IP, PVC, and more – remains unchanged. Since Kubernetes operates at the pod level rather than the container level, as far as Kubernetes is concerned, the pod remains unaltered.

This feature is particularly handy when your pod houses multiple containers, and you’re looking to update a secondary container rather than the primary one. Take, for instance, updating Istio; it updates the main container along with the sidecar. Is this what you’re aiming for?

maxSurge

This one is all about smoothly rolling out updates to your pods while keeping the number of replicas constant. Think of it like having an extra space in memory during a swap – maxSurge essentially sets this extra space. Say you have five replicas and set maxSurge to 20%; this means you get one extra pod as a buffer during updates.

This rolling update strategy of maxSurge is effective only in conjunction with maxUnavailable, specifically in Deployments.

In CloneSet configurations, however, you can mix maxSurge with Partition and maxUnavailable, and even apply it to in-place updates. This flexibility makes it a more robust choice than standard deployments in complex scenarios.

Selective Pod Deletion

With CloneSet, you get to choose which pod gets removed first during a scale-down operation. This isn’t the same as manually deleting a pod using kubectl delete pod. Unlike with Deployments and StatefulSets, where the removal order is preset and unchangeable by the user, CloneSet lets you prioritize which pods to remove first. Here’s how it works:

apiVersion: apps.kruise.io/v1alpha1
 
kind: CloneSet
 
spec:
 
  # ...
 
  replicas: 4
 
  scaleStrategy:
 
    podsToDelete:
 
    - sample-9m4hp

In this scenario, the pod named ‘sample-9m4hp’ is set to be the first one deleted during scaling down. This feature is useful for strategically freeing up resources on nodes during a scale-down. It essentially hands you the reins for more direct control.

Per pod PVC

CloneSet also introduces individual PVC claims for each pod. In a StatefulSet, each pod is assigned a volume that matches its name. But with CloneSet, you avoid the constraints of StatefulSets. In a Deployment scenario, pods are assigned random volume names that don’t correlate with the pod names, making it hard to track volumes post-update. CloneSet’s per pod PVC feature allows each pod to maintain stateful data without being labeled as a StatefulSet.

CloneSet is packed with many more features. For a comprehensive list and more details, you can visit the CloneSet CRD’s Git repository or the official documentation in here: https://openkruise.io/docs/user-manuals/cloneset/

Live Example of CloneSet

Please refer to this GitHub example and try it yourself.

https://github.com/openkruise/kruise/blob/master/docs/tutorial/cloneset.md

Conclusion

In essence, CloneSet stands out as a robust solution for managing Kubernetes applications at scale, offering enhanced flexibility, efficiency, and control. Its features are particularly aligned with the needs of large environments, where managing complexity, ensuring high availability, and efficient resource utilization are paramount.

References:

Build On!