CN116627586A

CN116627586A - Container cluster management method and device, electronic equipment and storage medium

Info

Publication number: CN116627586A
Application number: CN202310602263.9A
Authority: CN
Inventors: 胡东旭; 陈存利; 刘畅; 司禹; 赵鹏
Original assignee: Du Xiaoman Technology Beijing Co Ltd
Current assignee: Du Xiaoman Technology Beijing Co Ltd
Priority date: 2023-05-25
Filing date: 2023-05-25
Publication date: 2023-08-22

Abstract

The invention provides a container cluster management method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a historical value of a competitive state detection index of an original pod of a managed node; predicting a predicted value of the race detection index of the original pod based on the historical value of the race detection index of the original pod; when the node of the managed person newly increases the pod, acquiring the actual value of the competitive state detection index of the original pod; and judging whether the managed node has resource race or not based on the actual value and the predicted value of the race detection index of the original pod.

Description

Container cluster management method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computers, and in particular, to a method and apparatus for managing a container cluster, an electronic device, and a storage medium.

Background

A container cluster (cluster) is typically made up of a plurality of nodes (nodes) on which programs are running are packaged into containers (containers), which are not directly run, but are the smallest resource unit for resource scheduling by packaging one or more containers into one scheduling unit, which is the container cluster. The Kubernetes is a container cluster management system, provides a series of complete functions of deployment operation, resource scheduling, service discovery, dynamic expansion and the like for containerized application on the basis of applying a container engine Docker technology, improves the convenience of large-scale container cluster management, and is internally divided into manager nodes and managed nodes, wherein the minimum scheduling management unit of the Kubernetes is a pod. The Kubernetes mode is realized by a container deployment mode, each container is isolated from the other, each container has a file system, processes among the containers cannot affect each other, and computing resources can be distinguished. Compared with a virtual machine, the container can be rapidly deployed, and can migrate among different clouds and different versions of operating systems due to the fact that the container is decoupled from an underlying facility and a machine file system. In the hybrid state, the same managed node is deployed with multiple container deployments, and service types in the containers are not uniform, for example, the container deployments include both online types and offline types.

In the related art, when a new Pod needs to be scheduled, kubernetes calculates the best node for running the current Pod through various algorithms according to the resource usage situation of the cluster. However, the increase of the container density in the optimal node may cause interference between different services of the optimal node, that is, a resource race is generated, and the related technology cannot effectively detect the resource race.

Disclosure of Invention

In view of this, the embodiments of the present invention provide a method, an apparatus, an electronic device, and a storage medium for managing a container cluster, so as to solve the problem of how to effectively detect a resource race of a managed node of a newly added pod.

According to an aspect of the present invention, there is provided a container cluster management method, including: acquiring a historical value of a competitive state detection index of an original pod of a managed node; predicting a predicted value of the race detection index of the original pod based on the historical value of the race detection index of the original pod; when the manager node adds a pod, acquiring an actual value of a race detection index of the original pod; and judging whether the managed node has resource race or not based on the actual value and the predicted value of the race detection index of the original pod.

According to another aspect of the present invention, there is provided a container cluster management apparatus including: the historical data acquisition module is used for acquiring the historical value of the competitive state detection index of the original pod of the managed node; the data prediction module is used for predicting the predicted value of the competitive detection index of the original pod based on the historical value of the competitive detection index of the original pod; the actual data acquisition module is used for acquiring the actual value of the race detection index of the original pod after the managed node adds the pod; and the resource competitive state judging module is used for judging whether the managed node has resource competitive state or not based on the actual value and the predicted value of the competitive state detection index of the original pod.

According to another aspect of the present invention, there is provided an electronic apparatus including: a processor; and a memory storing a program, wherein the program comprises instructions that when executed by the processor cause the processor to perform a method according to container cluster management.

According to another aspect of the present invention, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute a container cluster management method.

According to one or more technical schemes provided by the embodiment of the application, the predicted value of the competitive detection index of the original pod can be predicted according to the historical value of the competitive detection index of the original pod, and further, whether the managed node has resource competitive state can be rapidly and accurately judged based on the actual value and the predicted value of the competitive detection index of the original pod.

Drawings

Further details, features and advantages of the application are disclosed in the following description of exemplary embodiments with reference to the following drawings, in which:

FIG. 1 is a schematic illustration of an application scenario of a container cluster management method according to some embodiments of the present disclosure;

FIG. 2 is a schematic block diagram of a container cluster management system shown in accordance with some embodiments of the present description;

FIG. 3 is a schematic flow chart diagram of a container cluster management method according to some embodiments of the present description;

fig. 4 shows a block diagram of an exemplary electronic device that can be used to implement an embodiment of the application.

Detailed Description

Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While the application is susceptible of embodiment in the drawings, it is to be understood that the application may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided to provide a more thorough and complete understanding of the application. It should be understood that the drawings and embodiments of the application are for illustration purposes only and are not intended to limit the scope of the present application.

It should be understood that the various steps recited in the method embodiments of the present invention may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the invention is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below. It should be noted that the terms "first," "second," and the like herein are merely used for distinguishing between different devices, modules, or units and not for limiting the order or interdependence of the functions performed by such devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those skilled in the art will appreciate that "one or more" is intended to be construed as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the devices in the embodiments of the present invention are for illustrative purposes only and are not intended to limit the scope of such messages or information.

The following describes aspects of the invention with reference to the drawings.

Fig. 1 is a schematic view of an application scenario of a container cluster management method according to some embodiments of the present disclosure. As shown in fig. 1.

The application scenario 100 of the container cluster management method may include a container cluster 110.

The container cluster 110 may include one manager node (master-node) and a plurality of manager nodes (worker-nodes).

In some embodiments, the manager node may include at least a kube-apiserver component, a kube-controller-manager component, and a kube-schedule component.

In some embodiments, the manager nodes may be a physical machine or a virtual machine, and may include at least kubelet components and kube-proxy components, where only one kubelet component is deployed on each manager node. Each manager node may deploy at least one pod in which a business service process runs, which may be a stateless service, a stateful service (e.g., redis, mySQL, etc.), an offline class service, and so on.

The plurality of manager nodes of the container cluster 110 may be deployed in a plurality of resource pools, each resource pool having at least one manager node deployed therein, and the functions of the manager nodes in different resource pools may be different. For example only, as shown in FIG. 1, the container cluster 110 may include a resource pool 1 (pool-1), where the resource pool 1 may be deployed with a worker-node-1, a worker-node-2, a worker-node-3 … … worker-node-n, a worker-node-1, a worker-node-2, a worker-node-3 … … worker-node-n for deploying a traffic service. A report agent can be deployed on a worker-node where a user deploys business services, where the report agent is used to collect running information of each pod on the worker-node in real time, for example, a resource usage situation, an actual value of a race detection index, and the like. The report-agent may run in daemonset mode and default grants allow RPC (Remote Procedure Call) communication with all the pod on the manager node.

As shown in fig. 1, the container cluster 110 may also include a resource pool x (pool-x) that may be deployed with at least one manager node (e.g., worker-node-x, etc.). At least one kafka pod may be deployed on the manager node in the resource pool x, and the kafka pod may be used to receive the running information (e.g., the actual value of the race detection index, etc.) of each pod collected by the report agent of the at least one manager node in the resource pool 1. At least one flink pod can be deployed on the managed nodes in the resource pool x, and the kafka pod can send the received running information of each pod to the flink pod.

As shown in fig. 1, the container cluster 110 may further include a data analysis server (Analyzer-server) 120, where the link pod is configured to generate a link real-time data stream based on the received running information of each pod, write the link real-time data stream into the data analysis server 120 in a consumer mode, and the data analysis server 120 may determine whether a resource race occurs in a managed node of the newly added pod in the resource pool 1 according to the written link real-time data stream. The data analysis server 120 may also persist the determination to the MySQL database as an archive for auditing.

In some embodiments, the data analysis server 120 may be a virtual machine or a physical machine deployed independently of the container cluster 110. In some embodiments, the data analysis server 120 may be a manager node in the container cluster 110. For example, the data analysis server 120 may be a manager node in the resource pool x.

In some embodiments, when the data analysis server 120 determines that a resource race occurs at a certain manager node, the kube-controller-manager of the manager node may migrate the pod deployed on the manager node to avoid the resource race from continuing as much as possible.

For further description of the container cluster management method, reference may be made to fig. 3 and its related description, which are not repeated here.

Fig. 2 is a schematic block diagram of a container cluster management apparatus according to some embodiments of the present description. As shown in fig. 2, the container cluster management device 200 may include a historical data acquisition module 210, a data prediction module 220, an actual data acquisition module 230, and a resource race judgment module 240.

The historical data acquisition module 210 may be configured to acquire a historical value of the race detection indicator of the original pod of the manager node.

In some embodiments, the race detection index includes a process CPI.

The data prediction module 220 may be configured to predict a predicted value of the race detection indicator of the original pod based on a historical value of the race detection indicator of the original pod.

In some embodiments, the data prediction module 220 may predict the predicted value of the process CPI for the original pod based on the historical value of the process CPI for the original pod.

In some embodiments, the data prediction module 220 may determine a target model based on a length of time corresponding to a historical value of the race detection indicator of the original pod, where the target model is an SVM model or an LSTM model; and predicting the predicted value of the competitive detection index of the original pod based on the historical value of the competitive detection index of the original pod through the target model.

The actual data obtaining module 230 may be configured to obtain an actual value of the race detection index of the original pod after the manager node adds the pod.

The resource race determination module 240 may be configured to determine whether a managed node has a resource race based on an actual value and a predicted value of a race detection index of an original pod.

In some embodiments, the resource race determination module 240 may determine whether the manager node has a resource race based on a deviation between an actual value and a predicted value of the process CPI of the original pod.

In some embodiments, the race detection indicator includes a target link time consumption, and when determining that the managed node has a resource race based on a deviation between an actual value and a predicted value of a process CPI of the original pod, the resource race determination module 240 may determine forwarding performance fluctuation based on the actual value of the target link time consumption, and determine whether the managed node has the resource race based on the forwarding performance fluctuation.

In some embodiments, the race detection indicator includes a service response time, and when it is determined that the managed node does not have a resource race based on the forwarding performance fluctuation, the resource race determination module 240 may determine the service response time fluctuation based on an actual value of the service response time, and determine whether the managed node has a resource race based on the service response time fluctuation.

In some embodiments, the container cluster management device 200 may further include a pod migration module 250, where the pod migration module 250 may be configured to migrate at least one pod of the managed node when the resource race determination module 240 determines that the managed node has a resource race.

In some embodiments, pod migration module 250 may determine a pod to be migrated based on the service type of the pod of the manager node.

In some embodiments, when the service type of the pod to be migrated is a dis service, the pod migration module 250 may start the target pod at the target manager node, initialize the data of the dis container of the target pod to be empty, obtain the IP of the dis container of the target pod, access the dis container of the target pod based on the IP through RDMA technology, transfer the memory data copy of the pod to be migrated to the dis container of the target pod, and after the transfer is completed, go offline to be migrated.

In some embodiments, the pod migration module 250 may intercept MySQL packets for a pod with a service type MySQL service in the manager node and forward the intercepted packets to the corresponding storage medium. When the service type of the pod to be migrated is MySQL service, the pod migration module 250 may start the target pod at the target manager node, map the storage medium corresponding to the pod to be migrated to the target pod, and after the mapping is completed, log off the pod to be migrated, where the storage medium corresponding to the pod to be migrated is used to store the MySQL data packet intercepted and forwarded from the target pod.

For further description of the container cluster management device 200, reference may be made to fig. 3 and its associated description, which is not repeated here.

It should be noted that the description of the container cluster management device 200 and its modules is for convenience only and is not intended to limit the present disclosure to the scope of the illustrated embodiments. It will be appreciated by those skilled in the art that, given the principles of the system, various modules may be combined arbitrarily or a subsystem may be constructed in connection with other modules without departing from such principles. In some embodiments, the historical data acquisition module 210, the data prediction module 220, the actual data acquisition module 230, the resource race determination module 240, and the pod migration module 250 disclosed in fig. 2 may be different modules in a system, or may be one module to implement the functions of two or more modules. For example, each module may share one memory module, or each module may have a respective memory module. Such variations are within the scope of the present description.

Fig. 3 is an exemplary flow chart of a method of container cluster management, according to some embodiments of the present description. In some embodiments, the container cluster management method 300 may be performed by the container cluster management device 200. As shown in fig. 3, the container cluster management method 300 includes the following steps.

Step 310, obtain the historical value of the race detection index of the original pod of the manager node. In some embodiments, step 310 may be performed by the historical data acquisition module 210 or the report-agent of the manager node.

The race detection index may be information related to the service running conditions deployed on the pod. For example, the race detection metrics may include at least one or any combination of process CPI (cycle per instruction), target link time consumption, service response time, etc., where the target link time consumption may be the time consumption of the core link of the pod (e.g., key function of IPVS, etc.), the service response time may include at least one or any combination of average time consumption, percentile values (e.g., P50, P95, P99.99, etc.), etc. The average time consumption may characterize the average of the response times for each request. The requests are arranged from small to large according to response time, the response time of the requests with the sequencing result being the median is the P50 value, and the response time of the requests with the sequencing result being the 95% position is the P95 value.

The original pod of the manager node may be a pod already deployed on the manager node before the newly added pod.

The historical value of the race detection index of the original pod may be the value of the race detection index of the original pod at the historical time point.

In some embodiments, the historical data acquisition module 210 or the report agent of the manager node may acquire the historical value of the race detection indicator of the original pod in any manner.

For example, the historical data acquisition module 210 or the report-agent of the managed node may acquire the total number of valid instruction executions of a process on a pod during a historical CPU clock cycle, and determine a historical value of the process CPI with the process during the historical CPU clock cycle. For example only, the historical data acquisition module 210 or the report-agent of the manager node may divide the historical CPU clock cycle by the total number of valid instruction executions within the historical CPU clock cycle to obtain a historical value of the process CPI for the process at the historical CPU clock cycle.

For another example, the historical data acquisition module 210 or the report-agent of the manager node can inject code into the core link of the pod that automatically gathers the time elapsed for the pod's core link. By way of example only, the historical data acquisition module 210 or the report-agent of the managed node may inject code into the key functions of the IPVS (e.g., ip_vs_out, ip_vs_in, etc.) through eBPF (extended Berkeley Packet Filter) techniques, which may collect the time consumption of the key functions of the IPVS.

For another example, the historical data acquisition module 210 or the report-agent of the manager node can monitor the response time of the pod's request in real time. For example only, the historical data acquisition module 210 or report-agent of the manager node may collect the response times of all requests of the pod in a period and calculate the response time consuming percentile P95 in the period based on a percentile value algorithm.

Step 320, predicting the predicted value of the race detection index of the original pod based on the historical value of the race detection index of the original pod. In some embodiments, step 320 may be performed by data prediction module 220 or data analysis server 120.

The predicted value of the race detection index of the original pod may be the value of the race detection index of the original pod at a future point in time.

In some embodiments, the data prediction module 220 or the data analysis server 120 may predict the plurality of predicted values of the race detection indicator of the original pod based on the plurality of historical values of the race detection indicator of the original pod in any manner. For example, the data prediction module 220 or the data analysis server 120 may predict the predicted value of the race detection index of the original pod based on the historical value of the race detection index of the original pod through a data prediction model, the input of the data prediction model may include a plurality of historical values of the race detection index of the original pod, and the output of the data prediction model may include a plurality of predicted values of the race detection index of the original pod. The data prediction model may be one of Convolutional Neural Network (CNN), deep Neural Network (DNN), recurrent Neural Network (RNN), multi-layer neural network (MLP), generation antagonistic neural network (GAN), etc., or any combination thereof.

In some embodiments, the data prediction module 220 or the data analysis server 120 may pre-process the historical values of the race detection indicator of the original pod before predicting the plurality of predicted values of the race detection indicator of the original pod using the plurality of historical values of the race detection indicator of the original pod, and filter the abnormal data in the plurality of historical values. For example, the data prediction module 220 or the data analysis server 120 may employ an outlier detection algorithm to filter the plurality of spikes with greater deviation from the historical values.

In some embodiments, the data prediction module 220 or the data analysis server 120 can predict the predicted value of the process CPI for the original pod based on the historical value of the process CPI for the original pod. For example, the data prediction module 220 or the data analysis server 120 may predict multiple predicted values of the process CPI of the original pod by the data prediction model using multiple historical values of the process CPI of the original pod. For example only, the data prediction model may predict a predicted value of the process CPI for the original pod for a future day based on a history of the process CPI for the original pod for the past week.

In some embodiments, the data prediction module 220 or the data analysis server 120 may determine the target model based on a time length corresponding to a historical value of the race detection index of the original pod, where the target model is a SVM (Support Vector Machine) model or an LSTM (Long Short-Term Memory) model, and predict the predicted value of the race detection index of the original pod based on the historical value of the race detection index of the original pod through the target model.

In some embodiments, when the length of time corresponding to the historical value of the race detection indicator of the original pod is short (e.g., less than or equal to one day, one week, etc.), the data prediction module 220 or the data analysis server 120 may predict the predicted value of the race detection indicator of the original pod based on the historical value of the race detection indicator of the original pod using the SVM model; when the historical value of the race detection index of the original pod corresponds to a longer time period (e.g., greater than one day, one week, etc.), the data prediction module 220 or the data analysis server 120 may predict the predicted value of the race detection index of the original pod based on the historical value of the race detection index of the original pod using the LSTM model. The LSTM model can be optimized by using adam algorithm in the training process, so that the convergence rate is increased.

For example only, the data prediction module 220 or the data analysis server 120 may predict a predicted value of the race detection index of the original pod one hour in the future based on a historical value of the race detection index of the original pod in the past day using an SVM model; and predicting the predicted value of the competitive detection index of the original pod in the future week based on the historical value of the competitive detection index of the original pod in the past month by using the LSTM model.

When the time length corresponding to the historical value of the competitive detection index of the original pod is short, the data volume of the historical value cannot support to conduct long-period time sequence prediction, so that short-period prediction is conducted by using the SVM model, accuracy of the predicted value can be guaranteed, when the time length corresponding to the historical value of the competitive detection index of the original pod is long, the data volume of the historical value can support to conduct long-period time sequence prediction, the LSTM model can predict the predicted value of the long period at one time, multiple times of prediction is not needed, and calculation power consumption is reduced.

Step 330, when the manager node adds a pod, the actual value of the race detection index of the original pod is obtained. In some embodiments, step 330 may be performed by the actual data acquisition module 230 or the report-agent of the manager node.

For more description of the actual value of the competition detection index obtained by the actual data obtaining module 230 or the report-agent of the managed node, reference may be made to the above description of the history value of the competition detection index obtained by the original pod, which is not repeated here.

Step 340, determining whether the managed node has a resource race based on the actual value and the predicted value of the race detection index of the original pod. In some embodiments, step 330 may be performed by the resource race determination module 240 or the data analysis server 120.

In some embodiments, the resource race determination module 240 or the data analysis server 120 may determine whether the manager node has a resource race based on the actual value and the predicted value of the race detection index of the original pod in any manner.

For example, the resource race determination module 240 or the data analysis server 120 may determine whether the managed node has a resource race based on the difference between the actual value and the corresponding predicted value of the race detection index of the original pod at a plurality of time points. For example only, when the difference between the actual value and the corresponding predicted value of the race detection index of the original pod at multiple time points is greater than the difference threshold, the resource race determination module 240 or the data analysis server 120 may determine that the managed node has a resource race.

For another example, the resource race determination module 240 or the data analysis server 120 may determine whether the managed node has a resource race based on the actual values and the corresponding predicted values of the race detection indexes of the original pod at a plurality of time points using a race determination model, and the input of the race determination model may include the actual values and the corresponding predicted values of the race detection indexes of the original pod at a plurality of time points, and the output of the race determination model may include the determination result of whether the managed node has a resource race. The race judgment model may be one or any combination of Convolutional Neural Network (CNN), deep Neural Network (DNN), cyclic neural network (RNN), multi-layer neural network (MLP), generation antagonistic neural network (GAN), etc.

In some embodiments, the resource race determination module 240 or the data analysis server 120 may determine whether the manager node has a resource race based on a deviation between an actual value and a predicted value of the process CPI of the original pod.

For example, when at least one original pod exists in the managed node, the deviation between the actual value of the process CPI and the corresponding predicted value of the process CPI in a certain period of time is greater than the deviation threshold, the resource race determination module 240 or the data analysis server 120 may determine whether the managed node has a resource race.

For example only, when there is at least one pod in the original pod (e.g., pod a, pod B, pod C) of the managed node after the managed node newly adds pod D, the resource race determination module 240 or the data analysis server 120 may determine that the managed node has a resource race when the process CPI deviates by more than 20% from the predicted value for each of the actual values and predicted values for 6 consecutive periods (e.g., 10s, 30s, 60s, etc.).

By means of the deviation between the actual value and the predicted value of the CPI of the original pod, whether the managed node has resource race or not can be judged rapidly.

In some embodiments, when it is determined that the managed node has a resource race based on a deviation between an actual value and a predicted value of the process CPI of the original pod, the resource race determination module 240 or the data analysis server 120 may determine forwarding performance fluctuation based on an actual value of the target link time consumption, and determine whether the managed node has a resource race based on the forwarding performance fluctuation.

For example, when it is determined that the managed node has a resource race based on a deviation between an actual value and a predicted value of a process CPI of the original pod, the resource race determination module 240 or the data analysis server 120 may determine forwarding performance fluctuation based on an actual value of a target link time consumption for a current period of time and an actual value of a target link time consumption for a previous period of time, and when the forwarding performance fluctuation is greater than a forwarding performance fluctuation threshold, the resource race determination module 240 or the data analysis server 120 may determine that the managed node has a resource race.

In some embodiments, the resource race determination module 240 or the data analysis server 120 may determine that the actual value of the target link time consumption for the current consecutive 6 cycles (e.g., 10s, 30s, 60s, etc.) and the actual value of the target link time consumption for the last consecutive 6 cycles, determine that the forwarding performance of the current consecutive 6 cycles fluctuates. For example, the resource race determination module 240 or the data analysis server 120 may determine the forwarding performance fluctuation for the first of the current consecutive 6 cycles based on the actual value of the target link time for the first of the current consecutive 6 cycles and the actual value of the target link time for the first of the last consecutive 6 cycles. For example only, the resource race determination module 240 or the data analysis server 120 may determine that a resource race exists for the manager node when the corresponding forwarding performance fluctuation for the current consecutive 6 cycles is greater than 2%.

Under the condition that the resource race exists in the managed node based on the deviation between the actual value and the predicted value of the process CPI of the original pod, whether the resource race exists in the managed node is further judged based on the time-consuming actual value of the target link, so that the accuracy of the resource race judgment of the managed node can be further improved, and the misjudgment is reduced.

In some embodiments, when it is determined that the manager node does not have a resource race based on the forwarding performance fluctuation, the resource race determination module 240 or the data analysis server 120 may determine a service response time fluctuation based on an actual value of the service response time, and determine whether the manager node has a resource race based on the service response time fluctuation.

For example, when it is determined that the manager node does not have a resource race based on forwarding performance fluctuation, the resource race determination module 240 or the data analysis server 120 may determine that the manager node has a resource race based on an actual value of a current period of service response time (e.g., P50 time consuming, P95 time consuming, P99.99 time consuming, etc.) and an actual value of a last period of service response time, and when the current period of service response time fluctuation is greater than a service response time fluctuation threshold, the resource race determination module 240 or the data analysis server 120 may determine that the manager node has a resource race.

In some embodiments, the resource race determination module 240 or the data analysis server 120 may determine that the actual value of the service response time (e.g., P50 time consuming, P95 time consuming, P99.99 time consuming, etc.) for the current consecutive 6 cycles (e.g., 10s, 30s, 60s, etc.) and the actual value of the service response time for the last consecutive 6 cycles, determine that the service response time for the current consecutive 6 cycles fluctuates. For example, the resource race determination module 240 or the data analysis server 120 may determine the service response time fluctuation for the first of the current consecutive 6 cycles based on the actual value of P95 time consumption for the first of the current consecutive 6 cycles and the actual value of P95 time consumption for the first of the previous consecutive 6 cycles. For example only, the resource race determination module 240 or the data analysis server 120 may determine that a resource race exists for the manager node when the service response time fluctuations for the first 6 consecutive periods are all greater than 2%.

When the resource race state of the managed node is judged to exist based on the deviation between the actual value and the predicted value of the process CPI of the original pod and the resource race state of the managed node is judged to not exist based on the forwarding performance fluctuation, whether the resource race state of the managed node exists or not is further judged based on the service response time fluctuation, the accuracy of judging the resource race state of the managed node can be further improved, and misjudgment is reduced.

In some embodiments, the container cluster management method 300 may further include a step 350 of migrating at least one pod of the manager node when it is determined that the manager node is in a resource race. In some embodiments, step 350 may be performed by pod migration module 250 or an administrator node of container cluster 110.

In some embodiments, the pod migration module 250 or manager node of the container cluster 110 may migrate at least one pod of the managed node in any manner. For example, pod migration module 250 or a manager node of container cluster 110 may migrate the newly added pod to other manager nodes of container cluster 110 (e.g., other manager nodes in resource pool 1).

When judging that the managed node has the resource race, migrating at least one pod of the managed node, reducing or eliminating the resource race of the managed node as much as possible, and improving the user experience.

In some embodiments, the pod migration module 250 or a manager node of the container cluster 110 may determine a pod to be migrated based on the service type of the pod of the manager node.

In some embodiments, the service types may include stateless class services, stateful services, offline class services, and the like. Different service types may correspond to different priorities. For example, stateless class services correspond to highest priority, stateful services such as rediss, mySQL, etc. for basic services correspond to medium priority, and offline class services (e.g., mapReduce, etc.) correspond to low priority. The pod migration module 250 or the manager node of the container cluster 110 may determine the pod to be migrated according to the priority corresponding to the service type of the pod of the manager node. For example, a low priority pod is preferentially migrated, a medium priority pod is migrated, and a high priority pod is migrated.

Based on the service type of the pod of the manager node, the priority of the migration of the pod of the manager node is determined, and the pod to be migrated is determined, so that in the migration process, the pod with smaller influence on the user experience by the priority migration can be preferentially migrated, and the user experience is improved.

In some embodiments, if the service deployed on all the pod of the managed node is a stateless class service, the pod migration module 250 or the manager node of the container cluster 110 may determine the pod to be migrated according to the CPU usage of all the pod of the managed node. For example, the pod migration module 250 or the manager node of the container cluster 110 may take the pod with the highest CPU usage on the manager node as the pod to be migrated.

Preferentially migrating the pod with the highest CPU utilization rate, and eliminating the resource race of the managed node under the condition of less migration times.

In some embodiments, if the services deployed on all the pod of the managed node are stateful services, the pod migration module 250 or the manager node of the container cluster 110 may preferentially migrate the pod where the cache class service (e.g., dis, etc.) is located, and then migrate the pod where the database class service (e.g., mySQL, etc.) is located.

In some embodiments, if the services deployed on all the pod of the managed node are offline class services, the pod migration module 250 or the manager node of the container cluster 110 may preferentially migrate the pod of the MapReduce class service, and the pod of the Spark and Presto class services.

In some embodiments, when the service type of the pod to be migrated is a dis service, the pod migration module 250 or the manager node of the container cluster 110 may start the target pod at the target manager node, initialize the data of the dis container of the target pod to be empty, obtain the IP of the dis container of the target pod, access the dis container of the target pod based on the IP through RDMA (Remote Direct Memory Access) technology, transfer the memory data copy of the pod to be migrated to the dis container of the target pod, and after the transfer is completed, go down to the pod to be migrated.

For example only, prior to migrating the pod to be migrated, the pod migration module 250 or manager node of the container cluster 110 may modify the redis kernel source code of the pod to be migrated to support the RDMA communication protocol. When a pod to be migrated needs to be migrated, the pod migration module 250 or the manager node of the container cluster 110 may determine a target manager node first, then start a pod on the target manager node as a target pod, initialize data of a dis container of the target pod to be empty, notify an IP of the dis container of the target pod to be migrated to the dis container of the pod to be migrated, by using an RDMA protocol, bypass an operating system kernel of the pod to be migrated and the dis container of the target pod between the dis container of the pod to be migrated and the dis container of the target pod, directly communicate through an intelligent network card, and after the data copy in the dis container of the pod to be migrated is completed, the pod migration module 250 or the manager node of the container cluster 110 may drop the pod to be migrated.

In some embodiments, the pod migration module 250 or the manager node of the container cluster 110 can determine the target managed node in any manner. For example, the pod migration module 250 or a manager node of the container cluster 110 can determine a target manager node from among the manager nodes in the resource pool 1 except for the manager node where the resource race occurs. By way of example only, the manager node of the resource pool 1 that has the greatest remaining amount of computer resources, other than the manager node that has generated the resource race, may be the target manager node. In some embodiments, the pod migration module 250 or the container cluster 110 can also determine a target manager node based on the CPU resource rate, the average load.

The Redis container of the target pod is accessed based on IP through RDMA (Remote Direct Memory Access) technology, and the memory data copy of the to-be-migrated pod is transmitted to the Redis container of the target pod, so that the to-be-migrated Redis pod can be smoothly migrated to other managed nodes.

In some embodiments, the pod migration module 250 or the manager node of the container cluster 110 may intercept and forward MySQL packets to a corresponding storage medium for a pod with a service type of MySQL service in the manager node, when the service type of the pod to be migrated is MySQL service, starting the target pod at the target manager node, mapping the storage medium corresponding to the pod to be migrated to the target pod, and after mapping is completed, downloading the pod to be migrated, where the storage medium corresponding to the pod to be migrated is used for storing MySQL packets intercepted and forwarded from the target pod.

For example only, when data storage is required for a pod with a service type of MySQL service in the managed node, the pod migration module 250 or the manager node of the container cluster 110 may use the intelligent network card to intercept MySQL data packets and forward the data packets to a corresponding storage medium for storage, and when data reading is required for calculation, the intelligent network card may read data from the storage medium corresponding to the pod, where the storage medium may be a remote storage medium, for example, a storage medium outside the container cluster 110. When the service type of the pod to be migrated is MySQL service, the pod migration module 250 or the manager node of the container cluster 110 may determine the target manager node first, then start a pod on the target manager node as the target pod, map the storage medium corresponding to the pod to be migrated to the target pod through the intelligent network card, after the mapping is completed, the intelligent network card may intercept the MySQL data packet of the target pod and forward the intercepted data packet to the corresponding storage medium of the pod to be migrated for storage, and then download the pod to be migrated. In some embodiments, the mapping relationship between pod and storage medium may be established by PV (Persistent Volume) mapping.

By intercepting MySQL data packets of the pod with the service type of MySQL service and forwarding the MySQL data packets to the corresponding storage medium, the storage separation of the pod of the MySQL service is realized, the storage medium corresponding to the pod to be migrated is mapped to the target pod, the quick migration of the pod of the MySQL service can be realized, a large amount of data is not required to be copied and transmitted from the storage medium to the target pod, and a large amount of time expenditure is reduced.

In some embodiments, after completing one pod migration, the resource race determination module 240 or the data analysis server 120 may repeatedly perform resource race detection on the remaining pods of the managed node, and if the managed node still has resource race, the next pod to be migrated may be determined, and the pod migration module 250 or the manager node of the container cluster 110 may complete the next pod to be migrated until it is determined that the remaining pods of the managed node no longer have resource race, thereby implementing elimination of resource race of the managed node and improving service stability.

For example, after completing a pod migration, the actual data acquisition module 230 or the report-agent of the managed node may acquire the actual value of the remaining original pod again, and the resource race determination module 240 or the data analysis server 120 may determine whether the managed node has a resource race according to the acquired remaining original pod and the corresponding predicted value. For further description of determining whether the managed node has a resource race according to the obtained remaining original pod and the corresponding predicted value, refer to step 340 and the related description thereof, which are not repeated herein.

In some embodiments, to increase the efficiency of eliminating a resource race, the pod migration module 250 or a manager node of the container cluster 110 may determine one or more to-be-migrated pods that need to be migrated in the manager node where the resource race occurs, and migrate the one or more to-be-migrated pods. After the migration work of the one or more to-be-migrated pod is completed, the resource race of the managed node is eliminated, and the resource race detection of the managed node is not required to be performed again after the migration is completed.

In some embodiments, the pod migration module 250 or the manager node of the container cluster 110 may determine, according to information about the pods in the manager node that is in the resource race, one or more to-be-migrated pods that need to be migrated in the manager node that is in the resource race. For example, the pod migration module 250 or a manager node of the container cluster 110 may use a pod determination model to determine one or more pods to be migrated in a managed node where resource race occurs based on relevant information of the pods (e.g., one or any combination of actual and predicted values of race detection indicators of the original pods, service types of all pods, CPU resource rates, average loads, etc.). The pod determination model may be one of Convolutional Neural Network (CNN), deep Neural Network (DNN), recurrent Neural Network (RNN), multi-layer neural network (MLP), generation antagonistic neural network (GAN), etc., or any combination thereof.

In some embodiments, after the migration of the pod to be migrated is completed, the target manager node is newly added with a pod, the resource race determination module 240 or the data analysis server 120 may perform resource race detection on the target manager node, and when determining that the target manager node has a resource race, the pod migration module 250 or the manager node of the container cluster 110 may migrate the pod of the target manager node, and detect whether the re-migrated target manager node has a resource race or not until no resource race exists in the manager nodes in the container cluster 110. For more description of the resource race detection and pod migration of the target manager node, refer to the relevant descriptions of steps 310 to 350, which are not repeated here.

The container cluster management method can predict the predicted value of the competitive detection index of the original pod according to the historical value of the competitive detection index of the original pod, and further, based on the actual value and the predicted value of the competitive detection index of the original pod, whether the managed node has resource competitive state can be rapidly and accurately judged.

It should be noted that the foregoing description of the container cluster management method is merely for illustration and description, and does not limit the application scope of the present disclosure. Various modifications and changes to the cloud virtual host server management method may be made by those skilled in the art under the guidance of this specification. However, such modifications and variations are still within the scope of the present description.

The exemplary embodiment of the invention also provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor for causing the electronic device to perform a method according to an embodiment of the invention when executed by the at least one processor.

The exemplary embodiments of the present invention also provide a non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is for causing the computer to perform a method according to an embodiment of the present invention.

The exemplary embodiments of the invention also provide a computer program product comprising a computer program, wherein the computer program, when being executed by a processor of a computer, is for causing the computer to perform a method according to an embodiment of the invention.

Referring to fig. 4, a block diagram of an electronic device 400 that may be a server or a client of the present invention will now be described, which is an example of a hardware device that may be applied to aspects of the present invention. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 4, the electronic device 400 includes a computing unit 401 that can perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 402 or a computer program loaded from a storage unit 408 into a Random Access Memory (RAM) 403. In RAM 403, various programs and data required for the operation of device 400 may also be stored. The computing unit 401, ROM 402, and RAM 403 are connected to each other by a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

Various components in electronic device 400 are connected to I/O interface 405, including: an input unit 406, an output unit 407, a storage unit 408, and a communication unit 409. The input unit 406 may be any type of device capable of inputting information to the electronic device 400, and the input unit 406 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. The output unit 407 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 404 may include, but is not limited to, magnetic disks, optical disks. The communication unit 409 allows the electronic device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, e.g., bluetooth (TM) devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 401 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 401 performs the respective methods and processes described above. For example, in some embodiments, the container cluster management method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 400 via the ROM 402 and/or the communication unit 409. In some embodiments, the computing unit 401 may be configured to perform the container cluster management method by any other suitable means (e.g., by means of firmware).

Program code for carrying out methods of the present invention may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Claims

1. A method of container cluster management, comprising:

acquiring a historical value of a competitive state detection index of an original pod of a managed node;

predicting a predicted value of the race detection index of the original pod based on the historical value of the race detection index of the original pod;

when the manager node adds a pod, acquiring an actual value of a race detection index of the original pod;

and judging whether the managed node has resource race or not based on the actual value and the predicted value of the race detection index of the original pod.

2. The container cluster management method of claim 1, wherein the race detection indicator includes a process CPI;

the predicting the predicted value of the competitive detection index of the original pod based on the historical value of the competitive detection index of the original pod comprises:

predicting a predicted value of the process CPI of the original pod based on the historical value of the process CPI of the original pod;

The determining whether the managed node has a resource race based on the actual value and the predicted value of the race detection index of the original pod includes:

and judging whether the managed node has resource race or not based on the deviation between the actual value and the predicted value of the process CPI of the original pod.

3. The container cluster management method of claim 2, wherein the race detection indicator comprises a target link time consumption;

the step of judging whether the managed node has a resource race based on the actual value and the predicted value of the race detection index of the original pod, further includes:

when the resource race exists in the managed node based on the deviation between the actual value and the predicted value of the process CPI of the original pod, forwarding performance fluctuation is determined based on the actual value of the time consuming of the target link, and whether the resource race exists in the managed node is judged based on the forwarding performance fluctuation.

4. A container cluster management method according to claim 3, wherein the race detection index comprises a service response time;

And when judging that the managed node does not have the resource race based on the forwarding performance fluctuation, determining service response time fluctuation based on the actual value of the service response time, and judging whether the managed node has the resource race based on the service response time fluctuation.

5. The container cluster management method according to any one of claims 1 to 4, wherein predicting the predicted value of the race detection index of the original pod based on the historical value of the race detection index of the original pod comprises:

determining a target model based on a time length corresponding to a historical value of the competitive detection index of the original pod, wherein the target model is an SVM model or an LSTM model;

and predicting the predicted value of the competitive detection index of the original pod based on the historical value of the competitive detection index of the original pod through the target model.

6. The container cluster management method according to any one of claims 1 to 4, further comprising:

and when judging that the managed node has resource race, migrating at least one pod of the managed node.

7. The container cluster management method of claim 6, wherein said migrating at least one pod of said manager node comprises:

And determining the pod to be migrated based on the service type of the pod of the managed node.

8. The container cluster management method of claim 6, wherein said migrating at least one pod of said manager node comprises:

when the service type of the pod to be migrated is Redis service, starting a target pod at a target manager node, initializing data of a Redis container of the target pod to be empty, acquiring IP of the Redis container of the target pod, accessing the Redis container of the target pod based on the IP through RDMA technology, transmitting memory data copy of the pod to be migrated to the Redis container of the target pod, and after transmission is completed, downloading the pod to be migrated.

9. The container cluster management method of claim 6, further comprising:

intercepting MySQL data packets of the pod with the service type of MySQL service in the managed node and forwarding the MySQL data packets to a corresponding storage medium;

said migrating at least one pod of said manager node, comprising:

when the service type of the to-be-migrated pod is MySQL service, starting a target pod at a target manager node, mapping a storage medium corresponding to the to-be-migrated pod to the target pod, and after mapping is completed, downloading the to-be-migrated pod, wherein the storage medium corresponding to the to-be-migrated pod is used for storing MySQL data packets intercepted and forwarded from the target pod.

10. A method of container cluster management, comprising:

the historical data acquisition module is used for acquiring the historical value of the competitive state detection index of the original pod of the managed node;

the data prediction module is used for predicting the predicted value of the competitive detection index of the original pod based on the historical value of the competitive detection index of the original pod;

the actual data acquisition module is used for acquiring the actual value of the race detection index of the original pod after the managed node adds the pod;

and the resource competitive state judging module is used for judging whether the managed node has resource competitive state or not based on the actual value and the predicted value of the competitive state detection index of the original pod.

11. An electronic device, comprising:

a processor; and

a memory in which a program is stored,

wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the method according to any of claims 1-9.

12. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-9.