Image

By default, Kubernetes schedules workloads (pods) onto nodes mainly driven by resource constraints (memory, CPU, ephemeral storage). This works fine for the vast majority of users where neither nodes nor pods have anything custom about them.

Now, imagine that subset of your nodes has a specific type of hardware, TPU, necessary for machine learning workloads. Only some of your workloads need access to TPU enabled nodes.


How can we schedule pods on the nodes that have specialized hardware?

While one might make TPUs ubiquitous and have them available on every node in the cluster, it will likely be wasteful and inefficient as not all nodes may be running applications that require TPUs.

Kubernetes has just the right method to solve this problem - nodeSelector . It is a way for a workload to select nodes to run on based on the node labels.

You can provide multiple key-value pairs within nodeSelector. All of those are ANDed, meaning that for pod to schedule on the node, a node must match all of the labels set via nodeSelector. A node can have a lot of labels that are not sed by nodeSelector of this particular workload but can be useful elsewhere.

Solution:

spec:
  nodeSelector:
    tpu: enabled

Important takeaways about nodeSelector :

  • Defined by workload as a way to express hard-requirements for nodes.
  • When multiple values are provided, all must be satisfied (AND logic) by the labels on nodes.
  • Can result in unschedulable pods if there are no nodes to meet the requirement.

Image

The explanation for the above diagram:

  • First example is pretty straight-forward. Pod matches a node with exactly the same labels.
  • 2nd example is more interesting. Pod’s criteria is to select nodes that match shape: triangle which is actually met by two different kinds of nodes (triangle and empty-triangle).
  • 3rd example showcases that nodes may have spare labels and pod simply ignores any keys outside of its nodeSelector criteria. This is how colour: green is a sufficient criteria for the pod to get scheduled on diamond node.
  • Last example is a warning on using nodeSelector with care. When your pod’s node selector criteria are not met by any of the existing nodes, it will remain unscheduling and pending. This can have disastrous consequences in production.

How can nodes only accept certain kinds of pods?

Great, we are now able to schedule our pods onto nodes with TPUs. However, we’ll soon notice that our TPU nodes are running a lot of workloads that do not need TPUs at all. This means that we have less capacity for TPU workloads. Our next challenge is to forbid nodes to schedule non-TPU workloads. This way we will keep the entire node capacity dedicated to TPU workloads.

Kubernetes has considered this scenario as well. We need to apply taints to our nodes and tolerations to our pods. Node with NoSchedule effect taint effectively becomes unschedulable to all pods, but those with matching toleration.

In our TPU example node’s YAML may look like this:

  taints:
  - effect: NoSchedule
	 key: tpu
	 value: enabled

while pods that desire to get scheduled on TPU nodes will have to match taint with a toleration

tolerations:
- key: "tpu"
  operator: "Equal"
  value: "enabled"
  effect: "NoSchedule"

Important takeaway about taints and tolerations:

  • Nodes define taints as a way to forbid accidental workload scheduling.
  • Pods define tolerations to comply with node’s taints and get scheduled.
  • When multiple taints are provided, all must be satisfied (AND logic) by the tolerations on pods.
  • Overly restrictive node taints can result in unschedulable nodes.

Image

Explanation for the above diagram:

  • First node can schedule 1st pod because it matches colour: orange taint with toleration.
  • Second node can schedule 1st and 2nd pods because both tolerate shape: triangle.
  • Third node has no taints and can schedule any pod.
  • Fourth node can not schedule any pod because there are no pods with matching tolerations.

Combining nodeSelector and tolerations together

Above techniques can be used in isolation, however, for predictable, reliable scheduling, I recommend always combine nodeSelector and toleration. This way an intention to schedule a pod on a specific kind of node is very clearly defined, whilst nodes are protected from accidental scheduling of generic workloads that can run elsewhere.