ReplicaSet Controllerは削除するPodをどうやって決めるのか

これは「Kubernetes Advent Calendar 2021 (2)」 21日目の記事です。昨年も参加させて頂いたのですが、1年経つの早すぎますね。

参加表明したものの何を書こうか全く決めていませんでしたので、最近勉強した内容をシェアさせてください。私が誤解している箇所やお気づきの点あれば、Twitterでおしえていただけると嬉しいです。

素朴な疑問
どこを調べるか
調べてみた
さいごに
参考

素朴な疑問

KubernetesでPodを作成したいとき、ご存知の通りDeploymentやReplicaSet を使う方法があります。DeploymentやReplicaSetのレプリカ数を増やした場合は、Schedulerのロジックによって適切なノードに追加のPodがスケジュールされるんだろうなと想像できます。Schedulerについては、いくつも有益な記事があります。例えば：

一方、ReplicaSetでレプリカ数を減らした場合に、削除するPodをどうやって選んでいるのか知りませんでした(知らないことに最近気づきました)。ググってみてもよくわからなかったので、今回調べてみます。

どこを調べるか

今回考えるのはDeployment及びReplicaSetを利用してPodを管理するときです。Deployment がReplicaSetを管理し、ReplicaSetはPodを管理するという依存関係にあるので、ReplicaSet Controllerを調べればよさそうです。

まず、ドキュメントだとこの辺に書かれていて、アルゴリズムも丁寧に説明されています。

ReplicaSet

A ReplicaSet's purpose is to maintain a stable set of replica Pods running at any given time. Usually, you define a Deployment and let that Deployment manage Re...(続く)

でもこれで終わりだと記事にならないので、ReplicaSet Controllerのコードを読んでみようと思います。ReplicaSet Controllerのコードは pkg/controller/replicaset にあります。 release-1.23ブランチを見ていきます。記事の中に該当箇所のコードを張ってますが、弊ブログのシンタックスハイライトがしょぼいせいで読みにくいと思います。ごめんなさい。

なお、KubernetesのContorollerや実装については、この辺の本がわかりやすいです。

調べてみた

syncReplicaSet()

syncReplicaSet() は、その名の通りReplicaSetの同期を行うメソッドです。

kubernetes/pkg/controller/replicaset/replica_set.go at release-1.23 · kubernetes/kubernetes

Production-Grade Container Scheduling and Management - kubernetes/kubernetes

まず与えられたkeyをnamespaceとnameに分割して、対象となるReplicaSetを取得しています。そして、このあたりでReplicaSetのSelectorを取得しています。

1	selector, err := metav1.LabelSelectorAsSelector(rs.Spec.Selector)

次にPodをフィルターします。まず、Namespace内の全Podを取得したあと、activeでないPodを除外し、さらに先程のSelectorにマッチするPodに絞り、filteredPods とします。

allPods, err := rsc.podLister.Pods(rs.Namespace).List(labels.Everything())

if err != nil {

return err

}

// Ignore inactive pods.

filteredPods := controller.FilterActivePods(allPods)

// NOTE: filteredPods are pointing to objects from cache - if you need to

// modify them, you need to copy it first.

filteredPods, err = rsc.claimPods(ctx, rs, selector, filteredPods)

このfilteredPods を引数にして、manageReplicas()を呼び出しています。

var manageReplicasErr error

if rsNeedsSync && rs.DeletionTimestamp == nil {

manageReplicasErr = rsc.manageReplicas(ctx, filteredPods, rs)

}

manageReplicas()

manageReplicas() が、ReplicaSetのレプリカ数を確認・更新します。僕が知りたかった内容はこのメソッドに書いてそうです。

kubernetes/pkg/controller/replicaset/replica_set.go at release-1.23 · kubernetes/kubernetes

Production-Grade Container Scheduling and Management - kubernetes/kubernetes

まず、対象となるReplicaSetオブジェクトで定義されている最新のレプリカ数とfilteredPods の数を比較し、diffを取ります。今回調べているのは、レプリカ数を減らしたときなので、diff > 0 のケースです。

diff := len(filteredPods) - int(*(rs.Spec.Replicas))

rsKey, err := controller.KeyFunc(rs)

if err != nil {

utilruntime.HandleError(fmt.Errorf("couldn't get key for %v %#v: %v", rsc.Kind, rs, err))

return nil

}

diff > 0 のケースでは、relatedPodsというものをgetIndirectlyRelatedPods() により取得しています。コードを読むとgetIndirectlyRelatedPods()は、対象のReplicaSetのOwnerが管理するPodをsliceを返していることがわかります。ここでいうOwnerとは、Deploymentのことです。

ん？それってrelatedPodsと何が違うの？という疑問が浮かぶかもしれません。これはDeploymentのローリングアップデートにおいて、新しいReplicaSetと古いReplicaSetが同時に存在するタイミングを考慮しています。

} else if diff > 0 {

if diff > rsc.burstReplicas {

diff = rsc.burstReplicas

}

klog.V(2).InfoS("Too many replicas", "replicaSet", klog.KObj(rs), "need", *(rs.Spec.Replicas), "deleting", diff)

relatedPods, err := rsc.getIndirectlyRelatedPods(rs)

utilruntime.HandleError(err)

そして、次の処理でついに削除すべきPodを決めているようです！getPodsToDelete() を見てみましょう。

1 2	// Choose which Pods to delete, preferring those in earlier phases of startup. podsToDelete := getPodsToDelete(filteredPods, relatedPods, diff)

getPodsToDelete()

getPodsToDelete() で、削除すべきPodを決めます。

kubernetes/pkg/controller/replicaset/replica_set.go at release-1.23 · kubernetes/kubernetes

Production-Grade Container Scheduling and Management - kubernetes/kubernetes

manageReplicas() で呼び出されるときの引数をおさらいしておきます。

filteredPods: 当該ReplicaSetと同じNamespace内のPodで、現在active かつSelectorにマッチするPod
relatedPods: 当該ReplicaSetと同じOwnerが管理するPod。Deploymentのローリングアップデート時を考慮
diff: filteredPods の数と当該ReplicaSetのレプリカ数の差

diffの定義から[filteredPodsの数] – diff = [ReplicaSetのレプリカ数] (≥0)なので、diff ≤ [filteredPodsの数]が成り立つとわかります。diff = [filteredPodsの数] のときは、何も考えずfilteredPodsを消せばよい。一方 diff < [filteredPodsの数]のときはどのPodを削除するか優先度をつける必要があります。

if文の中身を見ていきます。まずgetPodsRankedByRelatedPodsOnSameNode() でRankとやらをつけたPodたちを取得し、それをsort.Sort()で並び替え、先頭からdiff個のPodを削除対象としているのが見て取れます。

func getPodsToDelete(filteredPods, relatedPods []*v1.Pod, diff int) []*v1.Pod {

// No need to sort pods if we are about to delete all of them.

// diff will always be <= len(filteredPods), so not need to handle > case.

if diff < len(filteredPods) {

podsWithRanks := getPodsRankedByRelatedPodsOnSameNode(filteredPods, relatedPods)

sort.Sort(podsWithRanks)

reportSortingDeletionAgeRatioMetric(filteredPods, diff)

}

return filteredPods[:diff]

}

getPodsRankedByRelatedPodsOnSameNode()

kubernetes/pkg/controller/replicaset/replica_set.go at release-1.23 · kubernetes/kubernetes

Production-Grade Container Scheduling and Management - kubernetes/kubernetes

getPodsRankedByRelatedPodsOnSameNode() では、第一引数として与えられたPodのslice(=filteredPods)をラップし、各PodについてRankを計算して返します。なお、relatedPodsはRankを計算するのに使います。

podsOnNode というmapを定義しています。このmapのkeyはノード名が、valueにはrelatedPodsのうちkeyのノード上にあってかつactiveなPodの数が入ります。

// getPodsRankedByRelatedPodsOnSameNode returns an ActivePodsWithRanks value

// that wraps podsToRank and assigns each pod a rank equal to the number of

// active pods in relatedPods that are colocated on the same node with the pod.

// relatedPods generally should be a superset of podsToRank.

func getPodsRankedByRelatedPodsOnSameNode(podsToRank, relatedPods []*v1.Pod) controller.ActivePodsWithRanks {

podsOnNode := make(map[string]int)

for _, pod := range relatedPods {

if controller.IsPodActive(pod) {

podsOnNode[pod.Spec.NodeName]++

}

次にPodのRank付けを行います。第一引数として与えられたPodのslice(=filteredPods)の各Podがどのノード上にあるか見ていき、そのノードをキーとして対応するpodsOnNodeの値をRankとします。そして、PodにRankと現在のタイムスタンプ(Now: metav1.Now())を付与して返します。現在の時間もソート時に考慮されます。

ranks := make([]int, len(podsToRank))

for i, pod := range podsToRank {

ranks[i] = podsOnNode[pod.Spec.NodeName]

}

return controller.ActivePodsWithRanks{Pods: podsToRank, Rank: ranks, Now: metav1.Now()}

}

getPodsToDelete()のつづき

getPodsToDelete() の続きを見ていきます。getPodsRankedByRelatedPodsOnSameNode() によって返されたRank付きのPodの情報podsWithRanks をソートします。

func getPodsToDelete(filteredPods, relatedPods []*v1.Pod, diff int) []*v1.Pod {

// No need to sort pods if we are about to delete all of them.

// diff will always be <= len(filteredPods), so not need to handle > case.

if diff < len(filteredPods) {

podsWithRanks := getPodsRankedByRelatedPodsOnSameNode(filteredPods, relatedPods)

sort.Sort(podsWithRanks)

reportSortingDeletionAgeRatioMetric(filteredPods, diff)

}

return filteredPods[:diff]

}

ソートはsort.Sort()で行っているようです。またpodsWithRanksはcontroller.ActivePodsWithRanks型です。したがって、controllerパッケージのActivePodsWithRanks構造体の実装を見てみます。以下のようになっています。

100

type ActivePodsWithRanks struct {

// Pods is a list of pods.

Pods []*v1.Pod

// Rank is a ranking of pods. This ranking is used during sorting when

// comparing two pods that are both scheduled, in the same phase, and

// having the same ready status.

Rank []int

// Now is a reference timestamp for doing logarithmic timestamp comparisons.

// If zero, comparison happens without scaling.

Now metav1.Time

}

func (s ActivePodsWithRanks) Len() int {

return len(s.Pods)

}

func (s ActivePodsWithRanks) Swap(i, j int) {

s.Pods[i], s.Pods[j] = s.Pods[j], s.Pods[i]

s.Rank[i], s.Rank[j] = s.Rank[j], s.Rank[i]

}

// Less compares two pods with corresponding ranks and returns true if the first

// one should be preferred for deletion.

func (s ActivePodsWithRanks) Less(i, j int) bool {

// 1. Unassigned < assigned

// If only one of the pods is unassigned, the unassigned one is smaller

if s.Pods[i].Spec.NodeName != s.Pods[j].Spec.NodeName && (len(s.Pods[i].Spec.NodeName) == 0 || len(s.Pods[j].Spec.NodeName) == 0) {

return len(s.Pods[i].Spec.NodeName) == 0

}

// 2. PodPending < PodUnknown < PodRunning

if podPhaseToOrdinal[s.Pods[i].Status.Phase] != podPhaseToOrdinal[s.Pods[j].Status.Phase] {

return podPhaseToOrdinal[s.Pods[i].Status.Phase] < podPhaseToOrdinal[s.Pods[j].Status.Phase]

}

// 3. Not ready < ready

// If only one of the pods is not ready, the not ready one is smaller

if podutil.IsPodReady(s.Pods[i]) != podutil.IsPodReady(s.Pods[j]) {

return !podutil.IsPodReady(s.Pods[i])

}

// 4. lower pod-deletion-cost < higher pod-deletion cost

if utilfeature.DefaultFeatureGate.Enabled(features.PodDeletionCost) {

pi, _ := helper.GetDeletionCostFromPodAnnotations(s.Pods[i].Annotations)

pj, _ := helper.GetDeletionCostFromPodAnnotations(s.Pods[j].Annotations)

if pi != pj {

return pi < pj

}

// 5. Doubled up < not doubled up

// If one of the two pods is on the same node as one or more additional

// ready pods that belong to the same replicaset, whichever pod has more

// colocated ready pods is less

if s.Rank[i] != s.Rank[j] {

return s.Rank[i] > s.Rank[j]

}

// TODO: take availability into account when we push minReadySeconds information from deployment into pods,

// see https://github.com/kubernetes/kubernetes/issues/22065

// 6. Been ready for empty time < less time < more time

// If both pods are ready, the latest ready one is smaller

if podutil.IsPodReady(s.Pods[i]) && podutil.IsPodReady(s.Pods[j]) {

readyTime1 := podReadyTime(s.Pods[i])

readyTime2 := podReadyTime(s.Pods[j])

if !readyTime1.Equal(readyTime2) {

if !utilfeature.DefaultFeatureGate.Enabled(features.LogarithmicScaleDown) {

return afterOrZero(readyTime1, readyTime2)

} else {

if s.Now.IsZero() || readyTime1.IsZero() || readyTime2.IsZero() {

return afterOrZero(readyTime1, readyTime2)

}

rankDiff := logarithmicRankDiff(*readyTime1, *readyTime2, s.Now)

if rankDiff == 0 {

return s.Pods[i].UID < s.Pods[j].UID

}

return rankDiff < 0

}

// 7. Pods with containers with higher restart counts < lower restart counts

if maxContainerRestarts(s.Pods[i]) != maxContainerRestarts(s.Pods[j]) {

return maxContainerRestarts(s.Pods[i]) > maxContainerRestarts(s.Pods[j])

}

// 8. Empty creation time pods < newer pods < older pods

if !s.Pods[i].CreationTimestamp.Equal(&s.Pods[j].CreationTimestamp) {

if !utilfeature.DefaultFeatureGate.Enabled(features.LogarithmicScaleDown) {

return afterOrZero(&s.Pods[i].CreationTimestamp, &s.Pods[j].CreationTimestamp)

} else {

if s.Now.IsZero() || s.Pods[i].CreationTimestamp.IsZero() || s.Pods[j].CreationTimestamp.IsZero() {

return afterOrZero(&s.Pods[i].CreationTimestamp, &s.Pods[j].CreationTimestamp)

}

rankDiff := logarithmicRankDiff(s.Pods[i].CreationTimestamp, s.Pods[j].CreationTimestamp, s.Now)

if rankDiff == 0 {

return s.Pods[i].UID < s.Pods[j].UID

}

return rankDiff < 0

}

return false

}

ご存知の通り、sort.Interfaceはこのようになっていて、構造体のソートはLen(), Swap(), Less()を用意すれば実現できます。

ActivePodsWithRanks構造体に関しては、Len(), Swap()はまあ想像通りです。一方、Less()がややこしそうですが、コメントで丁寧に説明されています。まとめると以下のようになります。ソートされたPodのうち”小さい”ものから削除されていきます。

Unassigned < assigned: NodeにアサインされていないPodが小さい
PodPending < PodUnknown < PodRunning: PodのPhaseによる判定
Not ready < ready: Not ReadyなPodが小さい
lower pod-deletion-cost < higher pod-deletion cost: Pod Deletion Costが小さいPodが小さい
Doubled up < not doubled up: 2重化されたPod(doubled-up pods)が小さい
Been ready for empty time < less time < more time: Readyになった時間が一番若いPodが小さい
Pods with containers with higher restart counts < lower restart counts: コンテナのRestart回数多いPodが小さい
Empty creation time pods < newer pods < older pods: 作成時間が若いPodが小さい。作成時間が空のPodはもっと小さい

たくさんありますが、当然そうだろうなというものがほとんどです。例えば、PendingなPodの削除を優先したほうがいいだろうし、コンテナのRestartが少ないPodは安定して動いているので削除しないほうがいいだろうし。ただ5. Doubled up < not doubled upはなんでや？そもそもdoubled upとは？と私には理解できませんでした。

これがmanageReplicas()の節にも少し書いたDeploymentのローリングアップデートへの考慮です。以下のCommitログにいろいろ説明が載っていました。

manageReplicas()のつづき

最後にmanageReplicas()に戻って記事も終わりたいと思います。Rank付けをしたりソートもしたりして、削除対象のPodが決まりました。その名の通りpodsToDeleteという変数がそれを保持しています。そしてpodsToDeleteをforで回しつつ、goroutineを使って並列処理でPodを削除しているのがわかります。podControl.DeletePod()は、Pod IDを指定してPodを削除するメソッドです。

// Snapshot the UIDs (ns/name) of the pods we're expecting to see

// deleted, so we know to record their expectations exactly once either

// when we see it as an update of the deletion timestamp, or as a delete.

// Note that if the labels on a pod/rs change in a way that the pod gets

// orphaned, the rs will only wake up after the expectations have

// expired even if other pods are deleted.

rsc.expectations.ExpectDeletions(rsKey, getPodKeys(podsToDelete))

errCh := make(chan error, diff)

var wg sync.WaitGroup

wg.Add(diff)

for _, pod := range podsToDelete {

go func(targetPod *v1.Pod) {

defer wg.Done()

if err := rsc.podControl.DeletePod(ctx, rs.Namespace, targetPod.Name, rs); err != nil {

// Decrement the expected number of deletes because the informer won't observe this deletion

podKey := controller.PodKey(targetPod)

rsc.expectations.DeletionObserved(rsKey, podKey)

if !apierrors.IsNotFound(err) {

klog.V(2).Infof("Failed to delete %v, decremented expectations for %v %s/%s", podKey, rsc.Kind, rs.Namespace, rs.Name)

errCh <- err

}

}(pod)

}

wg.Wait()

select {

case err := <-errCh:

// all errors have been reported before and they're likely to be the same, so we'll only return the first one we hit.

if err != nil {

return err

}

default:

}

return nil

これで無事に、ReplicaSet Controllerは削除対象のPodを決定し、削除することができました！えらい！

さいごに

2021年ありがとうございました！ブログ更新めちゃくっちゃサボった年だったので、2022年は頑張りたいです。

素朴な疑問

どこを調べるか

調べてみた

syncReplicaSet()

manageReplicas()

getPodsToDelete()

getPodsRankedByRelatedPodsOnSameNode()

getPodsToDelete()のつづき

関連するCommit 980b640

manageReplicas()のつづき

さいごに

参考

コメント