What is a disruption?
Keen reader like yourself definitely noticed the word disruption in the word PodDisruptionBudget. What is a disruption?
A disruption, for the purpose of this blog post, refers to an event when a pod needs to be killed and respawned. Disruptions are inevitable, and they would need to be handled delicately, otherwise we will have outage. Imagine when we have a service but none of the backing pods are available. Bad, right?
PodDisruptionBudget (pdb) is a useful layer of defence provided by Kubernetes to deal with this kind of issue. If you have a kubernetes cluster currently running in production, try running
kubectl get pdb --all-namespaces
Chances are, you will see some pdb floating around in various namespaces. Of course, kubectl might tell you there isn’t any pdb in this cluster. In that case, please read on, as I will discuss about what pdb is, and why they are essential in production landscapes.
What is PodDisruptionBudget (pdb)?
pdb stands for PodDisruptionBudget. Kubernetes API introduced this resource in the initial release of version 1.0. The API remained fairly stable until version 1.21, when it was graduated and promoted to policy/v1 from policy/v1beta1. policy/v1beta1 is currently deprecated, and will be removed in version 1.25.
A pdb defines the budget of voluntary disruption. In essence, a human operator is letting the cluster aware of a minimum threshold in terms of available pods that the cluster needs to guarantee in order to ensure a baseline availability or performance. The word budget is used as in error budget, in the sense that any voluntary disruption within this budget should be acceptable.
A typical pdb looks like
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: nginx
If we take a closer look at this sample, we will notice
It selects other resources based on labels
It demands that there needs to be at least one pod running.
There are three possible fields in a pdb:
.spec.selector (required): a selector field to specify a resource on which this pdb is applied. A pdb can be applied on:
Deployment
ReplicationController
ReplicaSet
StatefulSet
Most frequently, pdb is used in conjunction with Deployment. You can also use pdb with arbitrary controllers. It is beyond the scope of this blog, though.
one .spec.minAvailable or one .spec.maxUnavailable (required. either but not together): You can define minAvailable or maxUnavailable in absolute digits (i.e. have at least two pods available / at most two pods unavailable) or in percentage (i.e. have at least 10% of pods available / at most 10% pods unavailable)
If you want to see pdb in action, please checkout my sample repository here. If you haven’t used pdb before, this demo will help you understand the concepts that follow.
Voluntary Disruption?
You may notice that I’ve mentioned that pdb is only good for voluntary disruption.
Voluntary disruptions can take the form of:
A node group replacement, from an incompatible change or a cluster upgrade.
Scaling up/down nodes.
Oftentimes, the responsibility of managing an application workload is separated from the responsibility of managing the cluster, and usually picked up by separate teams such as a platform team and an application team.
There can be a conflict of interest between them:
An application team wants their apps running all the times, with 100% availability and the endpoints as responsive as possible.
A platform team needs to make changes to the cluster. Those changes will take down nodes, with the pods running on them as well.
A pdb is, in all fairness, a compromise between an application team and a platform team. Application team acknowledges the necessity of having scheduled/voluntary disruption, and provides a guideline to assist in completing the rollout, which is carried out by the platform team.
Of course, there are involuntary disruptions as well, such as electricity outage or node kernel crash. pdb won’t protect your workload from them, understandably.
What if I don’t have pdb in my cluster?
Your workload might go offline when a cluster maintenance event is in place. Yes, even if you have replicas set to a value greater than 1.
For the sake of brevity, let’s consider you have a deployment of nginx with replicas = 2. If both of them are scheduled onto a single node, what will happen if that particular node is recycled? Downtime, until the pod is re-scheduled on another node.
What if I have podAntiAffinity in place to ensure my pods span across multiple nodes? Still not 100% safe. Let’s reuse our nginx example and do a thought experiment. Now, we have two nginx (nginx-1 and nginx-2) running on node A and B. A scheduled node rollout is imminent, and node A starts evicting pods. nginx-1 on A is destroyed, and goes back to Pending. If the node group is large enough with plenty of spare computing capacity, nginx-1 might be rescheduled immediately onto another node i.e. C. However, what if the node group is already fully utilised? nginx-1 will stay in Pending. If controller manager, unaware of the seriousness of the situation, sends B to be killed. nginx-2 will go back to Pending too. We have none of the nginx pods alive, a.k.a downtime.
When pdb is being set inside your cluster, what will happen is controller manager will honour the clause by maintaining a minimum count of available replicas. In example 1, that would mean a pod will be spawned on another node before the original node is scraped. In example 2, nginx-1 will spawn onto another node before nginx-2, and in turn, node B, is destroyed.
Without pdb, you are essentially leaving availability to probability. It probably will be fine.
I have set maxUnavailable on my Deployment, do I still need pdb?
Yes. maxUnavailable in Deployment mandates the behaviour when a pod rollout is called. pdb, on the other hand, mandates the behaviour when node rollout is being performed.
What if I set minAvailable to 100%?
This is equal to setting maxUnavailable to 0%. Kubernetes will ensure that none of your pods will ever be disrupted by voluntary disruption.
In practice, what happens is when the cluster administrator starts a rollout, the process never completes. The nodes with your pods will never finish draining, because none of the pods can be offline. There is no disruption budget.
You are making the life of your humble platform team miserable. Don’t do that.
How should I decide on the numbers?
The exact number or percentage should be decided by a concerted effort between the application team (app owner) and the platform team (cluster admin).
App owner should understand that the higher minAvailable goes, the slower/more difficult the node rollout process will be. Surge capacity may be needed, which can bring up the cost. Cluster admin should realise as easy and clean as a brutal replacement goes, it can bring downtime to the app and/or response time fluctuation due to change in pod counts.
Therefore, both of the two parties should come to terms with the reality and reach a compromise to make everyone happy.
Summary
PodDisruptionBudget is quite important if your team has an SLA. Granted, it is not absolutely mandatory as discussed before - if the cluster you manage has enough spare capacity in CPU/memory, the rollout can uneventfully finish without impacting the workload more often than not. Nevertheless, it is still a recommended approach to have control in the event of a voluntary disruption.
Innablr is a leading cloud engineering and next generation platform consultancy. We are hiring! Want to be an Innablr? Reach out!