Keen reader like yourself definitely noticed the word disruption in the word PodDisruptionBudget
. What is a disruption?
A disruption, for the purpose of this blog post, refers to an event when a pod needs to be killed and respawned. Disruptions are inevitable, and they would need to be handled delicately, otherwise we will have outage. Imagine when we have a service but none of the backing pods are available. Bad, right?
PodDisruptionBudget
(pdb
) is a useful layer of defence provided by Kubernetes to deal with this kind of issue.
If you have a kubernetes cluster currently running in production, try running
kubectl get pdb --all-namespaces
Chances are, you will see some pdb
floating around in various namespaces. Of course, kubectl
might tell you there isn’t any pdb
in this cluster. In that case, please read on, as I will discuss about what pdb
is, and why they are essential in production landscapes.
pdb
?pdb
stands for PodDisruptionBudget
. Kubernetes API introduced this resource in the initial release of version 1.0
. The API remained fairly stable until version 1.21
, when it was graduated and promoted to policy/v1
from policy/v1beta1
. policy/v1beta1
is currently deprecated, and will be removed in version 1.25.
A pdb
defines the budget of voluntary disruption. In essence, a human operator is letting the cluster aware of a minimum threshold in terms of available pods that the cluster needs to guarantee in order to ensure a baseline availability or performance. The word budget is used as in error budget, in the sense that any voluntary disruption within this budget should be acceptable.
A typical pdb
looks like
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: nginx
If we take a closer look at this sample, we will notice
There are three possible fields in a pdb
:
.spec.selector
(required): a selector field to specify a resource on which this pdb
is applied. A pdb
can be applied on:
Most frequently, pdb
is used in conjunction with Deployment
. You can also use pdb
with arbitrary controllers. It is beyond the scope of this blog, though.
one .spec.minAvailable
or one .spec.maxUnavailable
(required. either but not together): You can define minAvailable
or maxUnavailable
in absolute digits (i.e. have at least two pods available / at most two pods unavailable) or in percentage (i.e. have at least 10% of pods available / at most 10% pods unavailable)
If you want to see pdb
in action, please checkout my sample repository here. If you haven’t used pdb
before, this demo will help you understand the concepts that follow.
You may notice that I’ve mentioned that pdb
is only good for voluntary disruption.
Voluntary disruptions can take the form of:
Oftentimes, the responsibility of managing an application workload is separated from the responsibility of managing the cluster, and usually picked up by separate teams such as a platform team and an application team.
There can be a conflict of interest between them:
A pdb
is, in all fairness, a compromise between an application team and a platform team. Application team acknowledges the necessity of having scheduled/voluntary disruption, and provides a guideline to assist in completing the rollout, which is carried out by the platform team.
Of course, there are involuntary disruptions as well, such as electricity outage or node kernel crash. pdb
won’t protect your workload from them, understandably.
pdb
in my cluster?Your workload might go offline when a cluster maintenance event is in place. Yes, even if you have replicas set to a value greater than 1.
For the sake of brevity, let’s consider you have a deployment of nginx
with replicas = 2
. If both of them are scheduled onto a single node, what will happen if that particular node is recycled? Downtime, until the pod is re-scheduled on another node.
What if I have podAntiAffinity
in place to ensure my pods span across multiple nodes? Still not 100% safe. Let’s reuse our nginx
example and do a thought experiment. Now, we have two nginx
(nginx-1
and nginx-2
) running on node A
and B
. A scheduled node rollout is imminent, and node A
starts evicting pods. nginx-1
on A
is destroyed, and goes back to Pending
. If the node group is large enough with plenty of spare computing capacity, nginx-1
might be rescheduled immediately onto another node i.e. C
. However, what if the node group is already fully utilised? nginx-1
will stay in Pending
. If controller manager, unaware of the seriousness of the situation, sends B
to be killed. nginx-2
will go back to Pending
too. We have none of the nginx
pods alive, a.k.a downtime.
When pdb
is being set inside your cluster, what will happen is controller manager will honour the clause by maintaining a minimum count of available replicas. In example 1, that would mean a pod will be spawned on another node before the original node is scraped. In example 2, nginx-1
will spawn onto another node before nginx-2
, and in turn, node B
, is destroyed.
Without pdb
, you are essentially leaving availability to probability. It probably will be fine.
maxUnavailable
on my Deployment
, do I still need pdb
?Yes. maxUnavailable
in Deployment
mandates the behaviour when a pod rollout is called. pdb
, on the other hand, mandates the behaviour when node rollout is being performed.
minAvailable
to 100%?This is equal to setting maxUnavailable
to 0%. Kubernetes will ensure that none of your pods will ever be disrupted by voluntary disruption.
In practice, what happens is when the cluster administrator starts a rollout, the process never completes. The nodes with your pods will never finish draining, because none of the pods can be offline. There is no disruption budget.
You are making the life of your humble platform team miserable. Don’t do that.
The exact number or percentage should be decided by a concerted effort between the application team (app owner) and the platform team (cluster admin).
App owner should understand that the higher minAvailable
goes, the slower/more difficult the node rollout process will be. Surge capacity may be needed, which can bring up the cost. Cluster admin should realise as easy and clean as a brutal replacement goes, it can bring downtime to the app and/or response time fluctuation due to change in pod counts.
Therefore, both of the two parties should come to terms with the reality and reach a compromise to make everyone happy.
PodDisruptionBudget
is quite important if your team has an SLA. Granted, it is not absolutely mandatory as discussed before - if the cluster you manage has enough spare capacity in CPU/memory, the rollout can uneventfully finish without impacting the workload more often than not. Nevertheless, it is still a recommended approach to have control in the event of a voluntary disruption.
Innablr is a leading cloud engineering and next generation platform consultancy. We are hiring! Want to be an Innablr? Reach out!
Share this post: