What Is PodDisruptionBudget, And Why You Should Use It in Kubernetes
What is a disruption?
Keen reader like yourself definitely noticed the word disruption in the word
PodDisruptionBudget. What is a disruption?
A disruption, for the purpose of this blog post, refers to an event when a pod needs to be killed and respawned. Disruptions are inevitable, and they would need to be handled delicately, otherwise we will have outage. Imagine when we have a service but none of the backing pods are available. Bad, right?
pdb) is a useful layer of defence provided by Kubernetes to deal with this kind of issue.
If you have a kubernetes cluster currently running in production, try running
kubectl get pdb --all-namespaces
Chances are, you will see some
pdb floating around in various namespaces. Of course,
kubectl might tell you there isn’t any
pdb in this cluster. In that case, please read on, as I will discuss about what
pdb is, and why they are essential in production landscapes.
pdb stands for
PodDisruptionBudget. Kubernetes API introduced this resource in the initial release of version
1.0. The API remained fairly stable until version
1.21, when it was graduated and promoted to
policy/v1beta1 is currently deprecated, and will be removed in version 1.25.
pdb defines the budget of voluntary disruption. In essence, a human operator is letting the cluster aware of a minimum threshold in terms of available pods that the cluster needs to guarantee in order to ensure a baseline availability or performance. The word budget is used as in error budget, in the sense that any voluntary disruption within this budget should be acceptable.
pdb looks like
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: pdb spec: minAvailable: 1 selector: matchLabels: app: nginx
If we take a closer look at this sample, we will notice
- It selects other resources based on labels
- It demands that there needs to be at least one pod running.
There are three possible fields in a
.spec.selector(required): a selector field to specify a resource on which this
pdbis applied. A
pdbcan be applied on:
pdbis used in conjunction with
Deployment. You can also use
pdbwith arbitrary controllers. It is beyond the scope of this blog, though.
.spec.maxUnavailable(required. either but not together): You can define
maxUnavailablein absolute digits (i.e. have at least two pods available / at most two pods unavailable) or in percentage (i.e. have at least 10% of pods available / at most 10% pods unavailable)
If you want to see
pdb in action, please checkout my sample repository here. If you haven’t used
pdb before, this demo will help you understand the concepts that follow.
You may notice that I’ve mentioned that
pdb is only good for voluntary disruption.
Voluntary disruptions can take the form of:
- A node group replacement, from an incompatible change or a cluster upgrade.
- Scaling up/down nodes.
Oftentimes, the responsibility of managing an application workload is separated from the responsibility of managing the cluster, and usually picked up by separate teams such as a platform team and an application team.
There can be a conflict of interest between them:
- An application team wants their apps running all the times, with 100% availability and the endpoints as responsive as possible.
- A platform team needs to make changes to the cluster. Those changes will take down nodes, with the pods running on them as well.
pdb is, in all fairness, a compromise between an application team and a platform team. Application team acknowledges the necessity of having scheduled/voluntary disruption, and provides a guideline to assist in completing the rollout, which is carried out by the platform team.
Of course, there are involuntary disruptions as well, such as electricity outage or node kernel crash.
pdb won’t protect your workload from them, understandably.
What if I don’t have
pdb in my cluster?
Your workload might go offline when a cluster maintenance event is in place. Yes, even if you have replicas set to a value greater than 1.
For the sake of brevity, let’s consider you have a deployment of
replicas = 2. If both of them are scheduled onto a single node, what will happen if that particular node is recycled? Downtime, until the pod is re-scheduled on another node.
What if I have
podAntiAffinity in place to ensure my pods span across multiple nodes? Still not 100% safe. Let’s reuse our
nginx example and do a thought experiment. Now, we have two
nginx-2) running on node
B. A scheduled node rollout is imminent, and node
A starts evicting pods.
A is destroyed, and goes back to
Pending. If the node group is large enough with plenty of spare computing capacity,
nginx-1 might be rescheduled immediately onto another node i.e.
C. However, what if the node group is already fully utilised?
nginx-1 will stay in
Pending. If controller manager, unaware of the seriousness of the situation, sends
B to be killed.
nginx-2 will go back to
Pending too. We have none of the
nginx pods alive, a.k.a downtime.
pdb is being set inside your cluster, what will happen is controller manager will honour the clause by maintaining a minimum count of available replicas. In example 1, that would mean a pod will be spawned on another node before the original node is scraped. In example 2,
nginx-1 will spawn onto another node before
nginx-2, and in turn, node
B, is destroyed.
pdb, you are essentially leaving availability to probability. It probably will be fine.
I have set
maxUnavailable on my
Deployment, do I still need
Deployment mandates the behaviour when a pod rollout is called.
pdb, on the other hand, mandates the behaviour when node rollout is being performed.
What if I set
minAvailable to 100%?
This is equal to setting
maxUnavailable to 0%. Kubernetes will ensure that none of your pods will ever be disrupted by voluntary disruption.
In practice, what happens is when the cluster administrator starts a rollout, the process never completes. The nodes with your pods will never finish draining, because none of the pods can be offline. There is no disruption budget.
You are making the life of your humble platform team miserable. Don’t do that.
How should I decide on the numbers?
The exact number or percentage should be decided by a concerted effort between the application team (app owner) and the platform team (cluster admin).
App owner should understand that the higher
minAvailable goes, the slower/more difficult the node rollout process will be. Surge capacity may be needed, which can bring up the cost. Cluster admin should realise as easy and clean as a brutal replacement goes, it can bring downtime to the app and/or response time fluctuation due to change in pod counts.
Therefore, both of the two parties should come to terms with the reality and reach a compromise to make everyone happy.
PodDisruptionBudget is quite important if your team has an SLA. Granted, it is not absolutely mandatory as discussed before - if the cluster you manage has enough spare capacity in CPU/memory, the rollout can uneventfully finish without impacting the workload more often than not. Nevertheless, it is still a recommended approach to have control in the event of a voluntary disruption.
Innablr is a leading cloud engineering and next generation platform consultancy. We are hiring! Want to be an Innablr? Reach out!
Share this post: