这篇文章主要讲解了“Kubernetes Eviction Manager怎么启动”,文中的讲解内容简单清晰,易于学习与理解,下面请大家跟着小编的思路慢慢深入,一起来研究和学习“Kubernetes Eviction Manager怎么启动”吧!
Kubernetes Eviction Manager源码分析
Kubernetes Eviction Manager在何处启动
Kubelet在实例化一个kubelet对象的时候,调用eviction.NewManager
新建了一个evictionManager对象。
pkg/kubelet/kubelet.go:273
func NewMainKubelet(kubeCfg *componentconfig.KubeletConfiguration, kubeDeps *KubeletDeps, standaloneMode bool) (*Kubelet, error) {
...
thresholds, err := eviction.ParseThresholdConfig(kubeCfg.EvictionHard, kubeCfg.EvictionSoft, kubeCfg.EvictionSoftGracePeriod, kubeCfg.EvictionMinimumReclaim)
if err != nil {
return nil, err
}
evictionConfig := eviction.Config{
PressureTransitionPeriod: kubeCfg.EvictionPressureTransitionPeriod.Duration,
MaxPodGracePeriodSeconds: int64(kubeCfg.EvictionMaxPodGracePeriod),
Thresholds: thresholds,
KernelMemcgNotification: kubeCfg.ExperimentalKernelMemcgNotification,
}
...
// setup eviction manager
evictionManager, evictionAdmitHandler, err := eviction.NewManager(klet.resourceAnalyzer, evictionConfig, killPodNow(klet.podWorkers, kubeDeps.Recorder), klet.imageManager, kubeDeps.Recorder, nodeRef, klet.clock)
if err != nil {
return nil, fmt.Errorf("failed to initialize eviction manager: %v", err)
}
klet.evictionManager = evictionManager
klet.admitHandlers.AddPodAdmitHandler(evictionAdmitHandler)
...
}
kubelet执行Run方法开始工作时,启动了一个goroutine,每5s执行一次updateRuntimeUp。在updateRuntimeUp中,待确认runtime启动成功后,会调用initializeRuntimeDependentModules完成runtime依赖模块的初始化工作。
pkg/kubelet/kubelet.go:1219
func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)
}
pkg/kubelet/kubelet.go:2040
func (kl *Kubelet) updateRuntimeUp() {
...
kl.oneTimeInitializer.Do(kl.initializeRuntimeDependentModules)
...
}
再跟踪到initializeRuntimeDependentModules的代码可见,runtime的依赖模块包括cadvisor和evictionManager,初始化的工作其实就是分别调用它们的Start方法进行启动。
pkg/kubelet/kubelet.go:1206
func (kl *Kubelet) initializeRuntimeDependentModules() {
if err := kl.cadvisor.Start(); err != nil {
// Fail kubelet and rely on the babysitter to retry starting kubelet.
// TODO(random-liu): Add backoff logic in the babysitter
glog.Fatalf("Failed to start cAdvisor %v", err)
}
// eviction manager must start after cadvisor because it needs to know if the container runtime has a dedicated imagefs
if err := kl.evictionManager.Start(kl, kl.getActivePods, evictionMonitoringPeriod); err != nil {
kl.runtimeState.setInternalError(fmt.Errorf("failed to start eviction manager %v", err))
}
}
因此,从这里开始就进入到evictionManager的分析了。
Kubernetes Eviction Manager的定义
从上面的分析可见,kubelet在启动过程中进行runtime依赖模块的初始化过程中,将evictionManager启动了。先别急,我们必须先来看看Eviction Manager是如何定义的。
pkg/kubelet/eviction/eviction_manager.go:40
// managerImpl implements Manager
type managerImpl struct {
// used to track time
clock clock.Clock
// config is how the manager is configured
config Config
// the function to invoke to kill a pod
killPodFunc KillPodFunc
// the interface that knows how to do image gc
imageGC ImageGC
// protects access to internal state
sync.RWMutex
// node conditions are the set of conditions present
nodeConditions []v1.NodeConditionType
// captures when a node condition was last observed based on a threshold being met
nodeConditionsLastObservedAt nodeConditionsObservedAt
// nodeRef is a reference to the node
nodeRef *v1.ObjectReference
// used to record events about the node
recorder record.EventRecorder
// used to measure usage stats on system
summaryProvider stats.SummaryProvider
// records when a threshold was first observed
thresholdsFirstObservedAt thresholdsObservedAt
// records the set of thresholds that have been met (including graceperiod) but not yet resolved
thresholdsMet []Threshold
// resourceToRankFunc maps a resource to ranking function for that resource.
resourceToRankFunc map[v1.ResourceName]rankFunc
// resourceToNodeReclaimFuncs maps a resource to an ordered list of functions that know how to reclaim that resource.
resourceToNodeReclaimFuncs map[v1.ResourceName]nodeReclaimFuncs
// last observations from synchronize
lastObservations signalObservations
// notifiersInitialized indicates if the threshold notifiers have been initialized (i.e. synchronize() has been called once)
notifiersInitialized bool
}
managerImpl就是evictionManager的具体定义,重点关注:
config
- evictionManager的配置,包括:
PressureTransitionPeriod( --eviction-pressure-transition-period)
MaxPodGracePeriodSeconds(--eviction-max-pod-grace-period)
Thresholds(--eviction-hard, --eviction-soft)
KernelMemcgNotification(--experimental-kernel-memcg-notification)
killPodFunc
- evict pod时kill pod的接口,kubelet NewManager的时候,赋值为killPodNow方法(pkg/kubelet/pod_workers.go:285)
imageGC
- 当node出现diskPressure condition时,imageGC进行unused images删除操作以回收disk space。
summaryProvider
- 提供node和node上所有pods的最新status数据汇总,既NodeStats and []PodStats。
thresholdsFirstObservedAt
- 记录threshold第一次观察到的时间。
thresholdsMet
- 保存已经触发但还没解决的Thresholds,包括那些处于grace period等待阶段的Thresholds。
resourceToRankFunc
- 定义各种Resource进行evict 挑选时的排名方法。
resourceToNodeReclaimFuncs
- 定义各种Resource进行回收时调用的方法。
lastObservations
- 上一次获取的eviction signal的记录,确保每次更新thresholds时都是按照正确的时间序列进行。
notifierInitialized
- bool值,表示threshold notifier是否已经初始化,以确定是否可以利用kernel memcg notification功能来提高evict的响应速度。目前创建manager时该值为false,是否要利用kernel memcg notification,完全取决于kubelet的--experimental-kernel-memcg-notification
参数。
kubelet在NewMainKubelet时调用eviction.NewManager
进行evictionManager的创建,eviction.NewManager
的代码很简单,就是赋值。
pkg/kubelet/eviction/eviction_manager.go:79
// NewManager returns a configured Manager and an associated admission handler to enforce eviction configuration.
func NewManager(
summaryProvider stats.SummaryProvider,
config Config,
killPodFunc KillPodFunc,
imageGC ImageGC,
recorder record.EventRecorder,
nodeRef *v1.ObjectReference,
clock clock.Clock) (Manager, lifecycle.PodAdmitHandler, error) {
manager := &managerImpl{
clock: clock,
killPodFunc: killPodFunc,
imageGC: imageGC,
config: config,
recorder: recorder,
summaryProvider: summaryProvider,
nodeRef: nodeRef,
nodeConditionsLastObservedAt: nodeConditionsObservedAt{},
thresholdsFirstObservedAt: thresholdsObservedAt{},
}
return manager, manager, nil
}
但是,有一点很重要,NewManager不但返回evictionManager对象,还返回了一个lifecycle.PodAdmitHandler
实例evictionAdmitHandler,它其实和evictionManager的内容相同,但是不同的两个实例。evictionAdmitHandler用来kubelet创建Pod前进行准入检查,满足条件后才会继续创建Pod,通过Admit(attrs *lifecycle.PodAdmitAttributes)
方法来检查,代码如下:
pkg/kubelet/eviction/eviction_manager.go:102
// Admit rejects a pod if its not safe to admit for node stability.
func (m *managerImpl) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult {
m.RLock()
defer m.RUnlock()
if len(m.nodeConditions) == 0 {
return lifecycle.PodAdmitResult{Admit: true}
}
// the node has memory pressure, admit if not best-effort
if hasNodeCondition(m.nodeConditions, v1.NodeMemoryPressure) {
notBestEffort := qos.BestEffort != qos.GetPodQOS(attrs.Pod)
if notBestEffort || kubepod.IsCriticalPod(attrs.Pod) {
return lifecycle.PodAdmitResult{Admit: true}
}
}
// reject pods when under memory pressure (if pod is best effort), or if under disk pressure.
glog.Warningf("Failed to admit pod %v - %s", format.Pod(attrs.Pod), "node has conditions: %v", m.nodeConditions)
return lifecycle.PodAdmitResult{
Admit: false,
Reason: reason,
Message: fmt.Sprintf(message, m.nodeConditions),
}
}
上述Pod Admit逻辑,正是Kubernetes Eviction Manager工作机制分析中Scheduler一节提到的EvictionManager对Pod调度的逻辑影响:
Kubelet会定期的将Node Condition传给kube-apiserver并存于etcd。kube-scheduler watch到Node Condition Pressure之后,会根据以下策略,阻止更多Pods Bind到该Node。
Node Condition | Scheduler Behavior |
---|
MemoryPressure | No new BestEffort pods are scheduled to the node. |
DiskPressure | No new pods are scheduled to the node. |
killPodNow的代码,后面再分析。
基本上,这一小节我们把evictionManager是什么以及怎么来的问题搞清楚了。下面我们来看看evictionManager的启动过程。
Kubernetes Eviction Manager的启动
上面分析过,kubelet在启动过程中进行runtime依赖模块的初始化过程中,将evictionManager启动了(kl.evictionManager.Start(kl, kl.getActivePods, evictionMonitoringPeriod)
),那我们先来看看Start方法:
pkg/kubelet/eviction/eviction_manager.go:126
// Start starts the control loop to observe and response to low compute resources.
func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, monitoringInterval time.Duration) error {
// start the eviction manager monitoring
go wait.Until(func() { m.synchronize(diskInfoProvider, podFunc) }, monitoringInterval, wait.NeverStop)
return nil
}
很简单,启动一个goroutine,每执行完一次m.synchronize
就间隔monitoringInterval(10s)的时间再次执行m.synchronize
,如此反复。
接下来,就是evictionManager的关键工作流程了:
pkg/kubelet/eviction/eviction_manager.go:181
// synchronize is the main control loop that enforces eviction thresholds.
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) {
// if we have nothing to do, just return
thresholds := m.config.Thresholds
if len(thresholds) == 0 {
return
}
// build the ranking functions (if not yet known)
if len(m.resourceToRankFunc) == 0 || len(m.resourceToNodeReclaimFuncs) == 0 {
// this may error if cadvisor has yet to complete housekeeping, so we will just try again in next pass.
hasDedicatedImageFs, err := diskInfoProvider.HasDedicatedImageFs()
if err != nil {
return
}
m.resourceToRankFunc = buildResourceToRankFunc(hasDedicatedImageFs)
m.resourceToNodeReclaimFuncs = buildResourceToNodeReclaimFuncs(m.imageGC, hasDedicatedImageFs)
}
// make observations and get a function to derive pod usage stats relative to those observations.
observations, statsFunc, err := makeSignalObservations(m.summaryProvider)
if err != nil {
glog.Errorf("eviction manager: unexpected err: %v", err)
return
}
// attempt to create a threshold notifier to improve eviction response time
if m.config.KernelMemcgNotification && !m.notifiersInitialized {
glog.Infof("eviction manager attempting to integrate with kernel memcg notification api")
m.notifiersInitialized = true
// start soft memory notification
err = startMemoryThresholdNotifier(m.config.Thresholds, observations, false, func(desc string) {
glog.Infof("soft memory eviction threshold crossed at %s", desc)
// TODO wait grace period for soft memory limit
m.synchronize(diskInfoProvider, podFunc)
})
if err != nil {
glog.Warningf("eviction manager: failed to create hard memory threshold notifier: %v", err)
}
// start hard memory notification
err = startMemoryThresholdNotifier(m.config.Thresholds, observations, true, func(desc string) {
glog.Infof("hard memory eviction threshold crossed at %s", desc)
m.synchronize(diskInfoProvider, podFunc)
})
if err != nil {
glog.Warningf("eviction manager: failed to create soft memory threshold notifier: %v", err)
}
}
// determine the set of thresholds met independent of grace period
thresholds = thresholdsMet(thresholds, observations, false)
// determine the set of thresholds previously met that have not yet satisfied the associated min-reclaim
if len(m.thresholdsMet) > 0 {
thresholdsNotYetResolved := thresholdsMet(m.thresholdsMet, observations, true)
thresholds = mergeThresholds(thresholds, thresholdsNotYetResolved)
}
// determine the set of thresholds whose stats have been updated since the last sync
thresholds = thresholdsUpdatedStats(thresholds, observations, m.lastObservations)
// track when a threshold was first observed
now := m.clock.Now()
thresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now)
// the set of node conditions that are triggered by currently observed thresholds
nodeConditions := nodeConditions(thresholds)
// track when a node condition was last observed
nodeConditionsLastObservedAt := nodeConditionsLastObservedAt(nodeConditions, m.nodeConditionsLastObservedAt, now)
// node conditions report true if it has been observed within the transition period window
nodeConditions = nodeConditionsObservedSince(nodeConditionsLastObservedAt, m.config.PressureTransitionPeriod, now)
// determine the set of thresholds we need to drive eviction behavior (i.e. all grace periods are met)
thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now)
// update internal state
m.Lock()
m.nodeConditions = nodeConditions
m.thresholdsFirstObservedAt = thresholdsFirstObservedAt
m.nodeConditionsLastObservedAt = nodeConditionsLastObservedAt
m.thresholdsMet = thresholds
m.lastObservations = observations
m.Unlock()
// determine the set of resources under starvation
starvedResources := getStarvedResources(thresholds)
if len(starvedResources) == 0 {
glog.V(3).Infof("eviction manager: no resources are starved")
return
}
// rank the resources to reclaim by eviction priority
sort.Sort(byEvictionPriority(starvedResources))
resourceToReclaim := starvedResources[0]
glog.Warningf("eviction manager: attempting to reclaim %v", resourceToReclaim)
// determine if this is a soft or hard eviction associated with the resource
softEviction := isSoftEvictionThresholds(thresholds, resourceToReclaim)
// record an event about the resources we are now attempting to reclaim via eviction
m.recorder.Eventf(m.nodeRef, v1.EventTypeWarning, "EvictionThresholdMet", "Attempting to reclaim %s", resourceToReclaim)
// check if there are node-level resources we can reclaim to reduce pressure before evicting end-user pods.
if m.reclaimNodeLevelResources(resourceToReclaim, observations) {
glog.Infof("eviction manager: able to reduce %v pressure without evicting pods.", resourceToReclaim)
return
}
glog.Infof("eviction manager: must evict pod(s) to reclaim %v", resourceToReclaim)
// rank the pods for eviction
rank, ok := m.resourceToRankFunc[resourceToReclaim]
if !ok {
glog.Errorf("eviction manager: no ranking function for resource %s", resourceToReclaim)
return
}
// the only candidates viable for eviction are those pods that had anything running.
activePods := podFunc()
if len(activePods) == 0 {
glog.Errorf("eviction manager: eviction thresholds have been met, but no pods are active to evict")
return
}
// rank the running pods for eviction for the specified resource
rank(activePods, statsFunc)
glog.Infof("eviction manager: pods ranked for eviction: %s", format.Pods(activePods))
// we kill at most a single pod during each eviction interval
for i := range activePods {
pod := activePods[i]
status := v1.PodStatus{
Phase: v1.PodFailed,
Message: fmt.Sprintf(message, resourceToReclaim),
Reason: reason,
}
// record that we are evicting the pod
m.recorder.Eventf(pod, v1.EventTypeWarning, reason, fmt.Sprintf(message, resourceToReclaim))
gracePeriodOverride := int64(0)
if softEviction {
gracePeriodOverride = m.config.MaxPodGracePeriodSeconds
}
// this is a blocking call and should only return when the pod and its containers are killed.
err := m.killPodFunc(pod, status, &gracePeriodOverride)
if err != nil {
glog.Infof("eviction manager: pod %s failed to evict %v", format.Pod(pod), err)
continue
}
// success, so we return until the next housekeeping interval
glog.Infof("eviction manager: pod %s evicted successfully", format.Pod(pod))
return
}
glog.Infof("eviction manager: unable to evict any pods from the node")
}
代码写的非常工整,注释也很到位,很棒。关键流程如下:
通过buildResourceToRankFunc
和buildResourceToNodeReclaimFuncs
分别注册Evict Pod时各种Resource的排名函数和回收Node Resource的Reclaim函数。
通过makeSignalObservations
从cAdvisor中获取Eviction Signal Observation和Pod的StatsFunc(后续对Pods进行Rank时需要用)。
如果kubelet配置了--experimental-kernel-memcg-notification
且为true,则通过startMemoryThresholdNotifier
启动soft & hard memory notification,当system usage第一时间达到soft & hard memory thresholds时,会立刻通知kubelet,并触发evictionManager.synchronize
进行资源回收的流程。这样提高了eviction的实时性。
根据从cAdvisor数据计算得到的Observation(observasions)和配置的thresholds通过thresholdsMet
计算得到此次Met的thresholds。
再根据从cAdvisor数据计算得到的Observation(observasions)和thresholdsMet通过thresholdsMet
计算得到已记录但还没解决的thresholds,然后与上一步中的thresholds进行合并。
根据lastObservations中Signal的时间,对比observasions的中Signal中的时间,过滤thresholds。
更新thresholdsFirstObservedAt
, nodeConditions
。
过滤出那些从observed time到now,已经历过grace period时间的thresholds。
更新evictionManager对象的内部数据: nodeConditions,thresholdsFirstObservedAt,nodeConditionsLastObservedAt,thresholds,observations。
根据thresholds得到starvedResources,并进行排序,如果memory属于starvedResources,则memory排序第一。
取starvedResources排第一的Resource,调用reclaimNodeLevelResources
对Node上这种Resource进行资源回收。如果回收完后,available满足thresholdValue+evictionMinimumReclaim
,则流程结束,不再evict user-pods。
如果reclaimNodeLevelResources
后,还不足以达到要求,则会继续evict user-pods,首先根据前面buildResourceToRankFunc
注册的方法对所有active Pods进行排序。
按照前面的排序,顺序的调用killPodNow
将选出的pod干掉。如果kill某个pod失败,则会跳过这个pod,再按顺序挑下一个pod进行kill。只要某个pod kill成功,就返回结束,也就是说这个流程中,最多只会kill最多一个Pod。
上面流程中,有两个最关键的步骤,回收节点资源(reclaimNodeLevelResources)和evict user-pods(killPodNow)。
pkg/kubelet/eviction/eviction_manager.go:340
// reclaimNodeLevelResources attempts to reclaim node level resources. returns true if thresholds were satisfied and no pod eviction is required.
func (m *managerImpl) reclaimNodeLevelResources(resourceToReclaim v1.ResourceName, observations signalObservations) bool {
nodeReclaimFuncs := m.resourceToNodeReclaimFuncs[resourceToReclaim]
for _, nodeReclaimFunc := range nodeReclaimFuncs {
// attempt to reclaim the pressured resource.
reclaimed, err := nodeReclaimFunc()
if err == nil {
// update our local observations based on the amount reported to have been reclaimed.
// note: this is optimistic, other things could have been still consuming the pressured resource in the interim.
signal := resourceToSignal[resourceToReclaim]
value, ok := observations[signal]
if !ok {
glog.Errorf("eviction manager: unable to find value associated with signal %v", signal)
continue
}
value.available.Add(*reclaimed)
// evaluate all current thresholds to see if with adjusted observations, we think we have met min reclaim goals
if len(thresholdsMet(m.thresholdsMet, observations, true)) == 0 {
return true
}
} else {
glog.Errorf("eviction manager: unexpected error when attempting to reduce %v pressure: %v", resourceToReclaim, err)
}
}
return false
}
pkg/kubelet/pod_workers.go:283
// killPodNow returns a KillPodFunc that can be used to kill a pod.
// It is intended to be injected into other modules that need to kill a pod.
func killPodNow(podWorkers PodWorkers, recorder record.EventRecorder) eviction.KillPodFunc {
return func(pod *v1.Pod, status v1.PodStatus, gracePeriodOverride *int64) error {
// determine the grace period to use when killing the pod
gracePeriod := int64(0)
if gracePeriodOverride != nil {
gracePeriod = *gracePeriodOverride
} else if pod.Spec.TerminationGracePeriodSeconds != nil {
gracePeriod = *pod.Spec.TerminationGracePeriodSeconds
}
// we timeout and return an error if we don't get a callback within a reasonable time.
// the default timeout is relative to the grace period (we settle on 2s to wait for kubelet->runtime traffic to complete in sigkill)
timeout := int64(gracePeriod + (gracePeriod / 2))
minTimeout := int64(2)
if timeout < minTimeout {
timeout = minTimeout
}
timeoutDuration := time.Duration(timeout) * time.Second
// open a channel we block against until we get a result
type response struct {
err error
}
ch := make(chan response)
podWorkers.UpdatePod(&UpdatePodOptions{
Pod: pod,
UpdateType: kubetypes.SyncPodKill,
OnCompleteFunc: func(err error) {
ch <- response{err: err}
},
KillPodOptions: &KillPodOptions{
PodStatusFunc: func(p *v1.Pod, podStatus *kubecontainer.PodStatus) v1.PodStatus {
return status
},
PodTerminationGracePeriodSecondsOverride: gracePeriodOverride,
},
})
// wait for either a response, or a timeout
select {
case r := <-ch:
return r.err
case <-time.After(timeoutDuration):
recorder.Eventf(pod, v1.EventTypeWarning, events.ExceededGracePeriod, "Container runtime did not kill the pod within specified grace period.")
return fmt.Errorf("timeout waiting to kill pod")
}
}
}
讲到这里,整个evictionManager的主要流程都分析完了。
感谢各位的阅读,以上就是“Kubernetes Eviction Manager怎么启动”的内容了,经过本文的学习后,相信大家对Kubernetes Eviction Manager怎么启动这一问题有了更深刻的体会,具体使用情况还需要大家实践验证。这里是天达云,小编将为大家推送更多相关知识点的文章,欢迎关注!