CN113391913A

CN113391913A - Distributed scheduling method and device based on prediction

Info

Publication number: CN113391913A
Application number: CN202110782812.6A
Authority: CN
Inventors: 朱宗卫; 唐鑫; 熊义昆; 周学海; 李曦; 王超
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-07-12
Filing date: 2021-07-12
Publication date: 2021-09-14

Abstract

The invention discloses a distributed scheduling method and device based on prediction. Acquiring the historical resource utilization rate of the current scheduling Pod and the resource utilization rate of available nodes in a cluster; inputting the historical resource utilization rate of the current scheduling Pod into a deep learning model for training and reasoning to obtain the resource prediction utilization rate of the current scheduling Pod; and screening the available nodes in the cluster according to the resource prediction utilization rate of the current scheduling Pod and the resource utilization rate of the available nodes in the cluster to determine the target scheduling node of the current scheduling Pod. According to the embodiment of the invention, the resource utilization rate of the Pod is predicted by using the deep learning model, and the nodes in the cluster are scored based on the prediction data, so that the condition that the Pod is always in an abnormal state due to the unbalanced loading state of the cluster in the cluster is optimized, the unbalanced loading state of the cluster is automatically adjusted, and the performance upper limit of the cluster is optimized.

Description

Distributed scheduling method and device based on prediction

Technical Field

The embodiment of the invention relates to the technical field of cluster resource scheduling, in particular to a distributed scheduling method and device based on prediction.

Background

Kubernets is a very popular container arrangement tool, which is focused by the industry in an advanced design concept and widely applied to practical production environments, and an important task of kubernets is to select a proper node (node) to run Pod (the minimum unit for creation and deployment in kubernets is a running instance), the load of the whole cluster is determined by the resource utilization rate of each node in the cluster, and the utilization rate of each node is related to the Pod information running on the node. Therefore, the scheduling policy for the cluster greatly affects the load status and resource utilization of the cluster.

In general, when a kubernets default scheduling algorithm is used for scheduling a computing task with a large resource demand deviation, a server unbalance loading phenomenon occurs in a cluster.

In the prior art, the problem of cluster state is generally solved by extending a Kubernetes scheduler, but the method still needs to be improved by not considering time factors or ignoring low-priority tasks.

Disclosure of Invention

The invention provides a distributed scheduling method and a distributed scheduling device based on prediction, which are used for optimizing the situation that a Pod is always in an abnormal state due to the cluster unbalance loading state in a cluster, and automatically adjusting the unbalance loading state of the cluster, so that the performance upper limit of the cluster is optimized.

In a first aspect, an embodiment of the present invention provides a distributed scheduling method based on prediction, which is applied to a kubernets cluster, and includes:

acquiring the historical resource utilization rate of the current scheduling Pod and the resource utilization rate of the available nodes in the cluster;

inputting the historical resource utilization rate of the current scheduling Pod into a deep learning model for training and reasoning to obtain the resource prediction utilization rate of the current scheduling Pod;

and screening the available nodes in the cluster according to the resource prediction utilization rate of the current scheduling Pod and the resource utilization rate of the available nodes in the cluster to determine the target scheduling node of the current scheduling Pod.

In a second aspect, an embodiment of the present invention further provides a distributed scheduling apparatus based on prediction, configured in a kubernets cluster, including:

the acquisition module is used for acquiring the historical resource utilization rate of the current scheduling Pod and the resource utilization rate of the available nodes in the cluster;

the prediction module is used for inputting the historical resource utilization rate of the current scheduling Pod into a deep learning model for training and reasoning so as to obtain the resource prediction utilization rate of the current scheduling Pod;

and the determining module is used for screening the available nodes in the cluster according to the resource prediction utilization rate of the current scheduling Pod and the resource utilization rate of the available nodes in the cluster so as to determine the target scheduling node of the current scheduling Pod.

According to the method and the device, the resource utilization rate of the Pod is predicted by using the deep learning model, the nodes in the cluster are scored based on the prediction data, and the optimal target scheduling node is determined for the currently scheduled Pod, so that the condition that the Pod is always in an abnormal state due to the cluster unbalance loading state in the cluster is optimized, the unbalance loading state of the cluster is automatically adjusted, and the performance upper limit of the cluster is optimized.

Drawings

Fig. 1 is a flowchart of a prediction-based distributed scheduling method according to an embodiment of the present invention;

fig. 2 is a processing diagram of a prediction-based distributed scheduling method according to an embodiment of the present invention;

fig. 3 is a flowchart of node scoring according to an embodiment of the present invention;

fig. 4 is a working sequence diagram of a local working mode of the model according to the second embodiment of the present invention;

fig. 5 is a schematic diagram illustrating utilization rates of a CPU and a memory of a cluster node in the prior art according to a third embodiment of the present invention;

fig. 6 is a schematic diagram illustrating utilization rates of a CPU and a memory of a cluster node after the prediction-based distributed scheduling method provided by the third embodiment of the present invention is used;

fig. 7 is a schematic structural diagram of a distributed scheduling apparatus based on prediction according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a prediction-based distributed scheduling method according to an embodiment of the present invention, where the embodiment is applicable to a situation where a kubernets scheduler schedules Pod, and the method may be executed by a prediction-based distributed scheduling apparatus.

The distributed scheduling method based on prediction provided in this embodiment first needs to acquire the resource usage of Pod, that is, needs to acquire each item of resource usage data at Pod level in real time, after acquiring the data, the data is used for training a deep learning model, the trained model is deployed in a cluster, the Pod resource occupancy is predicted in a service form, when scheduling is performed, the resource usage of Pod also participates in node scoring, and finally a node with the highest scoring is selected to bind and run Pod, and a specific processing process is shown in fig. 2.

Further, the distributed scheduling method based on prediction provided by the embodiment of the present invention specifically includes the following steps:

s110, acquiring the historical resource utilization rate of the current scheduling Pod and the resource utilization rate of the available nodes in the cluster.

In this embodiment, the historical resource utilization rate of the currently scheduled Pod and the resource utilization rate of the available node in the cluster are obtained by deploying a cluster resource monitoring policy of Prometheus + Grafana in the cluster.

The resource specifically comprises three indexes: CPU utilization rate, memory utilization rate and IOwait rate. Specifically, the embodiment may obtain the data from the exposed data interface of the prometheus server in real time through the following three formulas:

a) CPU utilization rate:

(1-sum(increase(node_cpu_seconds_total{mode＝"idle"}[10s]))by(instance)/sum(increase(no de_cpu_seconds_total[10s]))by(instance))*100；

b) the memory utilization rate is as follows:

(1-((node_memory_Buffers_bytes+node_memory_Cached_bytes+node_memory_MemFree_bytes)/node_memory_MemTotal_bytes))*100；

c) IOwait rate:

(sum(increase(node_cpu_seconds_total{mode＝"iowait"}[10s]))by(instance)/sum(increase(node_cpu_seconds_total[10s]))by(instance))*100。

since the present embodiment can use the nodecoort method to perform service exposure, the data acquisition process can be completed on any node in the cluster, so that each node in the cluster exposes a Port, and data can be acquired from the designated IP + Port. Another method is LoadBalancer, which exposes data as a domain name that can be used to access data both within and outside the cluster.

S120, inputting the historical resource utilization rate of the current scheduling Pod into a deep learning model for training and reasoning to obtain the resource prediction utilization rate of the current scheduling Pod.

After the resource usage data for the Pod is acquired, the acquired data is used for training of the model. Data for both model training and reasoning can be obtained in the manner provided by S110.

Because the data is single-dimensional after being collected, the data needs to be preprocessed, and specifically, the utilization rate of a CPU and the utilization rate of a memory are used as a single sample, and the data length can be predicted according to actual conditions. The processed data is placed in a designated file, so that automatic training can be performed subsequently.

And for data prediction, sending the acquired real-time data into a model and executing an inference program to obtain a prediction result of each node, and exposing the prediction result out by using a specified endpoint for scoring the node during subsequent Pod scheduling.

The deep learning network model in this embodiment may be an RNN model, an LSTM model, or a modified Multi-attention feature Memory (M-ACM) model. The M-ACM working unit consists of three working units, each working unit uses a bidirectional LSTM structure added with an attention mechanism, and finally, a full connection layer is added for data output.

S130, screening the available nodes in the cluster according to the resource prediction utilization rate of the current scheduling Pod and the resource utilization rate of the available nodes in the cluster to determine the target scheduling node of the current scheduling Pod.

In this embodiment, after the obtained data is used for training and predicting the deep learning model, the obtained prediction result is used in a subsequent scheduling strategy. Based on the prediction result of the Pod resource usage, a case for solving the data allocation imbalance existing at the time of multidimensional resource allocation can be considered. In the resource allocation process of the cluster, there are some problems that when a new service is deployed, part of the resources in the cluster are left, but the new service cannot be deployed.

By monitoring and looking at the resource usage of the cluster, it is found that the cluster resource satisfies the resource request of some Pod, but there are still a large number of pods in a waiting state all the time, that is, preparations are always made before the node application, and there are many reasons for this state, for example: mirror image pull failure, insufficient resources, limitations of Pod scheduling rules, taint setting, and the like. However, the two main reasons are the first two, so this embodiment considers that the load degrees of the CPU and the memory are ensured to be as close as possible when Pod is scheduled for the first time, and when some requirements are met and Pod needs to be evicted and rescheduled, the predicted resource usage of the Pod is obtained by sending the resource usage before the Pod to the model, and node scoring is performed based on the predicted data.

Specifically, the screening the available nodes in the cluster according to the resource prediction utilization rate of the currently scheduled Pod and the resource utilization rate of the available nodes in the cluster to determine the target scheduling node of the currently scheduled Pod includes:

pre-screening the available nodes based on a built-in algorithm of Kubernetes to obtain candidate available nodes;

and screening the candidate available nodes according to the resource prediction utilization rate of the current scheduling Pod and the resource utilization rate of the candidate available nodes so as to determine the target scheduling node of the current scheduling Pod.

In this embodiment, in a preselection (Predicate) stage of schedule scheduling, the available nodes are pre-screened based on a kubernets built-in algorithm to relax a condition of node election, that is, meeting the most basic scheduling condition, may become candidate available nodes through preselection. Since the built-in algorithm of Kubernetes is the prior art, it is not described in detail.

Further, after the available nodes are subjected to pre-screening, the candidate available nodes passing through the pre-screening are subjected to further grading screening. Specifically, the candidate available nodes are screened according to the resource prediction utilization rate of the current scheduling Pod and the resource utilization rate of the candidate available nodes, so as to determine the target scheduling node of the current scheduling Pod, which includes two parts:

a) statistical scoring of remaining resource rate of server

Namely, according to the resource utilization rate of each candidate available node, determining a first score of each candidate available node, wherein a calculation formula of the first score is as follows:

Score1＝10-(U_nc+U_nm)*5；

where Score1 represents the first Score, U_ncCPU utilization, U, representing candidate available nodes_nmRepresenting the memory utilization of the candidate available node.

The first grading considers that the nodes with low resource utilization rate are selected as much as possible, and accordingly coarse-grained load balance of each node is guaranteed. And performing basic scoring on the nodes based on the use conditions of the nodes, wherein the lower the resource utilization rate of the nodes is, the higher the score is.

b) Differential statistical scoring of CPU and memory utilization

And if the current scheduling Pod is operated for the first time, determining a second score of each candidate available node according to the resource utilization rate of each candidate available node and the resource application proportion of the current scheduling Pod. Specifically, the CPU application proportion and the memory application proportion of the Pod are obtained and added to the original resource utilization rate of the node, so as to obtain the expected utilization rate of the CPU and the expected utilization rate of the memory, respectively, and the smaller the difference between the two is, the higher the score is, and the calculation formula of the second score is as follows:

Score2-1＝10-(U_nc+R_pc-(U_nm+R_pm))*10；

where Score2-1 indicates the second Score, U, for the current scheduled Pod as the first runtime_ncAnd R_pcRespectively representing the CPU utilization rate of the candidate available node and the CPU application proportion, U, of the current scheduling Pod_nmAnd R_pmRespectively representing the memory utilization rate of the candidate available node and the memory application proportion of the current scheduling Pod.

And if the current scheduling Pod is not operated for the first time, determining a second score of each candidate available node according to the resource utilization rate of each candidate available node and the resource prediction utilization rate of the current scheduling Pod. Namely, the node runs for a period of time, but is deleted forcibly by a certain eviction strategy or manually deleted by a user due to a certain error, in this case, the resource usage result can be predicted based on the previous data by using a deep learning model, and the expected occupancy rate of the CPU and the memory in a future period of time can be obtained. For this case, the smaller the difference between the two is, the higher the node score is, and at this time, the corresponding calculation formula of the second score is as follows:

Score2-2＝10-(U_nc+U_pcp-(U_nm+U_pmp))*10；

where Score2-2 is the second Score, U, for when the current schedule is not the first run_ncAnd R_pcpRespectively representing the CPU utilization rate of the candidate available node and the CPU predicted utilization rate of the current scheduling Pod; u shape_nmAnd R_pmpRespectively representing the memory utilization rate of the candidate available node and the predicted utilization rate of the current scheduling Pod memory.

Further, after the first score and the second score of each candidate node are obtained, the method for determining the target scheduling node of the current scheduling Pod by sorting each candidate available node according to the first score and the second score includes: taking the sum of the first score and the second score as a target score of each candidate available node; and taking the candidate available node with the highest target score as the target scheduling node of the current scheduling Pod.

Further, on the basis of the above embodiment, the embodiment also considers that the Pod providing the same service is bound to different physical nodes, that is, Pod decentralized deployment is performed by using the inverse Affinity, and by introducing a Pod inverse Affinity (Anti-Affinity) algorithm, the Pod providing the same service is prevented from running on the same node as much as possible, so that the high availability of the service is improved. In order to determine the anti-affinity between the Pod, the Pod needs to be tagged with a corresponding tag. If they possess the same tag, these Pod are said to have an inverse affinity between them. The more Pod is then calculated that has the same label as the current scheduled Pod, the lower the score. Scoring of the node affinity was performed using the following, where AAPodi represents the ith Pod with an anti-affinity tag.

Scores_A＝10-∑(AApod_i)/10

In a Kubernetes cluster, if mutual exclusion between Pod nodes is to be performed, a Matchlabel policy can be used, but the policy is a hard policy, nodes are directly filtered if conditions are not satisfied, the operable range of pods is desired to be widened, and the situation that pods of the same application or the same service are required to communicate across nodes is also desired to be avoided. In the scheduling algorithm, the actual use condition of the CPU and the memory of the node and the variation of the resource use condition after Pod scheduling are mainly considered. Therefore, during scoring, the nodes with lower loads and smaller difference between the utilization rates of the CPU and the memory are preferentially selected as the optimal scheduling nodes after the resource amount of the Pod request is allocated.

FIG. 3 illustrates the workflow of S130, where Score1 and Score2-1 using the algorithm directly are scored if the currently scheduled Pod is run for the first time, and if the Pod has been run before but is deleted for some reason, scoring of the Score2-2 portion of the algorithm may be based on the predicted resource utilization of the queried Pod.

The embodiment of the invention predicts the use condition of the Pod resource by using the deep learning technology, and then dispatches the Pod to a more appropriate node in advance instead of adjusting the Pod after the offset load actually occurs, thereby realizing the optimization of the condition that the Pod is always in an abnormal state due to the offset load state of the cluster in the cluster, automatically adjusting the offset load state of the cluster and further optimizing the performance upper limit of the cluster.

Example two

The prediction-based distributed scheduling method provided by this embodiment may extend the Scheduler of Kubernetes in an extenser manner, and embed a custom algorithm in the extenser. It can be realized that data is obtained from a specified endpoint in the extenser client, this endpoint may be a URL, and data exposure is performed in Restful style.

1. The algorithm work flow is as follows: the distributed scheduling method based on prediction provided by the embodiment is added into the scheduling process of the Kubernetes platform in the form of extension or plug-in, so that the subsequent extension or the replacement of different algorithms is facilitated. After consulting Kubernets relevant documents, an algorithm is embedded into a Kubernets scheduling process by using an extenser mode, the proposed scheduling scoring algorithm based on prediction takes effect in the priority stage of Scheduler, the object list of the current scheduling Pod object and the available node is needed in the stage, and a scored node list is output.

Firstly, each node is traversed, when each node is traversed, a built-in algorithm based on a formula Kubernetes is needed to carry out basic scoring on the node, the CPU and the memory utilization rate of the node are obtained from a collector, and the utilization rate is 10- (U) through a formula Score1_nc+U_nm) The score of the first part is obtained after calculation of 5, and in this part, the higher the load of the nodes, the lower the score, because it is desirable to select the server with lower load to shorten the load difference between different servers.

For the scoring of the second part, it is necessary to first determine the Pod in the current scheduling round, and if the Pod is created for the first time, the resource usage prediction result of the Pod cannot be obtained, but the resource usage is also kept as balanced as possible according to the formula Score2-1 ═ 10- (U) — (U)_nc+R_pc-(U_nm+R_pm) 10) and pre-allocating the resources first, calculating the load difference between the CPU and the memory based on the pre-allocation result, and hopefully, after the Pod is scheduled on a certain node, the load difference between the CPU and the memory of the node is as small as possible.

If the Pod is not operated for the first time, prefix matching of the Pod name can be carried out in the ETCD, then a resource prediction result of the Pod is obtained, the average use condition of a CPU and a memory of the latest 30 minutes when the data is stored takes 10 seconds as an acquisition interval, the average value of the CPU and the memory use ratio of the Pod is used as a final prediction result, and a formula Score2-2 is used as 10- (U) which is 10- (U)_nc+U_pcp-(U_nm+U_pmp) 10) calculates the final score of the second part. In this part, if the difference between the resource usage amounts of the CPU and the memory is higher, it is considered that scheduling the current Pod to the node will cause a higher degree of resource unbalance loading, and in this case, the score of the node will be reduced to ensure load balancing of the multidimensional resource.

After traversing all nodes, scoring the Pod based on the Scheduler Extender to obtain the final score of each available node, sequencing the nodes according to the score, sending the scoring result to the Scheduler, writing the ID of the best node into the NodeName domain of the Pod by the Scheduler, writing the binding information into the ETCD through the API Server, calling the Docker to create the container after the Kubelet monitors that the Pod information changes, and writing the creation information of the container into the ETCD through the API Server. The Pod of the current scheduling round really runs on a certain server node in the Kubernetes cluster, and then the next Pod in the Pod scheduling queue executes the same process to select the best scheduling node.

2. Model embedding

Since the algorithm only needs to obtain the prediction result of the model, two methods can be considered.

(1) Local operation model

In this manner, model training, reasoning, and predictive result exposure may be performed in the GPU servers in the cluster. In this way, the model works as follows:

1) data acquisition: data was obtained from the Prometheus exposed data endpoint and data cleansing was performed.

2) Model training: and preprocessing the cleaned data, sending the preprocessed data into a model for training, and considering that the preprocessed part of the data is put into a training code of the model, so that only the cleaning data is required to be acquired from Prometheus. In the preprocessing of the model, considering a short board with poor parallelism of the RNN family model, the data is standardized during data preprocessing to accelerate the convergence speed of the model, reduce the training time and improve the efficiency of accommodating feature storage.

3) Model reasoning: and storing the trained model in a local folder of the GPU server in a weight file mode, and acquiring data from a Prometeus endpoint and sending the data to the model for reasoning when the data is reasoned.

4) The predicted result is exposed: for the model's predictions, the latest prediction is exposed by a client program at a fixed endpoint. The extenser program only needs to pull the prediction result to the specified endpoint.

Fig. 4 shows a working timing sequence of a local working mode of the model, which may expose an obtained prediction result to a designated endpoint through a client program, and query a prediction result record of a response from the designated endpoint in the extenser program for scoring operation of a scheduling algorithm during Pod scheduling.

The advantage of this approach is that the parameters and training states of the model are persisted without applying for PV volume or other Pod storage persisted operations, and the disadvantage is that both the model itself and the data exposure end points are outside the cluster, with some latency in communication.

(2) Running the model in the vessel

The model may be trained first, then the model weight file, training code, and inference code are packed into a Docker container, and then model inference and retraining are performed in Pod form. The model works in this way as follows:

1) data acquisition: in this way, the Prometheus does not need to expose the collected data to the outside of the cluster, and the client program running in the Pod can acquire the data by adding the port specified by the Prometheus to the IP address of the server node in the cluster.

2) Model training: the training work of the model is completed locally, and the process is similar to the working mode of the local model.

3) Packing a model: packing the trained model into a Docker mirror image, if a GPU is used for reasoning, when the Docker mirror image of the model is deployed, mounting a specified Nvidia library file into a directory which can be accessed in a specified container, and deploying Nvidia-Device-plug in advance. If the CPU is used for reasoning, no configuration for GPU use is performed.

4) Model reasoning: the trained model is operated in the Pod mode, data can be exposed outwards in the form of Service, but the data exposure can be only carried out inside the cluster in the form of cluster virtual IP plus ports due to operation in the Pod.

5) The predicted result is exposed: in the extenser, data can be obtained in the form of IP + Port, and if access is needed outside the cluster, an endpoint containing prediction data needs to be exposed by using a nodecort or LoadBalance manner.

The model works in a similar way to a local model, but the model training environment needs to be packaged into a Docker mirror image, and the prediction result needs to be persisted to a disk in a PV or PVC mode.

The two modes can complete data acquisition, model reasoning and result exposure, and if the code needs to be optimized manually at regular intervals or the data needs to be kept safely and conveniently, the work of data acquisition, model training and reasoning and result exposure can be directly completed locally.

3. Algorithm embedding

And intervening in the cluster scheduling process based on the prediction result of the deep learning model to select relatively good nodes to operate Pod, ensuring the load balance of the cluster, and improving the available upper limit of the cluster resources while reducing the resource unbalance loading condition.

Further, the overall working process of scheduling work after the scheduling algorithm is added is as follows:

(1) data acquisition: firstly, a Prometheus Server is deployed in a cluster, a built-in TSDB (time sequence database) can be used for storing collected cluster data, wherein the latest data within two hours is stored in a memory, and other old data is persisted to a local disk but keeps consistent with the life cycle of the Prometheus Server Pod. After the Prometheus Server is restarted, all data is lost. It is considered to write the prediction result of the Pod resource usage into the ETCD key-value pair storage system.

(2) Data cleaning: after the data is acquired, the data is sampled by the step size of 3, so that the characteristic transformation frequency is increased to a certain extent, and the occurrence of abnormal conditions is predicted earlier. For samples with zero or very close to zero data, the workload of Pod is considered to be almost 0, so less than 0.1% of the samples are filtered out. And then writing the sample into a local file by using a one-dimensional two-element array.

(3) Model training: after the resource usage data is acquired, the model can be read from a specified path, and two methods can be used for implementing the method. Another approach is to store the data locally and mount the local folders to designated locations in the container when the model is deployed. After data are acquired, a data generator module is used for generating data samples of 180:60, the data samples are sent to a model for training, if training or reasoning on a GPU is needed, a used Nvidia library needs to be mounted in a container, and the mode can only train and reason on a designated GPU server after mounting, and certain pressure is brought to the GPU servers in the cluster. And after the training is finished, the model is stored in a specified directory, and a path is configured in the reasoning module.

(4) Resource usage reasoning: reading the nearest CPU and memory use condition from the file written in by the data acquisition module, loading the trained model, and performing reasoning, wherein the part only needs to write the timing prediction function of the prediction module, and in the experiment, the time for performing the reasoning of 10000 samples only needs less than 40 seconds, so that the time spent on a single sample with the length of 180 is less than 1 second. When the resource use condition reasoning of all the Pods is completed, writing the average value of the reasoning result into an ETCD, and storing the average value in a key value pair mode, wherein the key uses the part of the Pod name, which is left after a generated random number suffix is removed, as an identifier, for a value, I use the average value of the CPU and the memory utilization rate as a one-dimensional binomial array to be stored into the ETCD, if the homogeneous Pod runs for multiple times, a plurality of prediction results can be generated, and in order to avoid the occurrence of an unbalanced load state as much as possible, the current record is covered by the record with the largest difference between the CPU and the memory utilization rate average value.

(5) Scheduling by Scheduler: the Scheduler continuously scans the Pod with its name empty in the ETCD through the API Server, and these pods are organized into a queue and fed into the Scheduler together with the list of available nodes, where the elements of the Pod and the queue of available nodes are corresponding object pointers. The Scheduler enters a pre-selection phase, which mainly selects which resources are sufficient to support the Pod operation, or some selector controllers strictly stipulate that the operation can only be performed on nodes of a specified type, and the available nodes must meet the hard conditions. And filtering out unavailable nodes according to a built-in algorithm, and generating an available node list, the unavailable nodes and unavailable reasons thereof. When the Extender configuration is set, the information can be sent to an entry address specified by the Extender, secondary filtering is carried out by the Extender, no additional processing is carried out in the part, and the available nodes only need to meet the index condition specified by Kubernetes. The original preselection results generated by the Scheduler after calculation based on the built-in preselection algorithm are returned at the preselection stage. And the Scheduler enters an optimal stage and sends a node scoring or sorting result and a point object pointer to an entry address of the extenser after executing a built-in algorithm. In this section, if there is a running record before Pod, the key value record that needs to look up the ID prefix of the current Pod from the etc is used. And then, sending the node into a Priority module in the Extender, scoring the nodes based on the actual CPU and memory utilization rate of each node and the model prediction result, updating and sending the scoring result.

(6) Node binding and Pod running: after receiving the scoring result of the Extender, the Scheduler writes the Node name field of the Pod into the Node ID with the highest scoring, and then writes the binding result of the Pod and the Node into the ETCD through the API Server. And then the next round of Pod scheduling is performed.

The embodiment provides a technical scheme, and aims at the problem of Kubernets cluster resource unbalance loading, the resource utilization rate is predicted through a multi-attention feature memory model, and a Kubernets distributed scheduling strategy is designed through the model for predicting the resource utilization rate.

EXAMPLE III

The embodiment provides an experimental verification method and a verification result of a distributed scheduling method based on prediction.

1. Experimental Environment

The experimental environment in this example was performed on a kubernets distributed cluster over 26 nodes.

2. Introduction to testing tools

In this embodiment, PowerfulSeal is used as a test case debugging tool, and the functions of the tool include:

(1) the tool can obtain detailed information about each running container object to facilitate manual modification or deletion of some required configurations for fault simulation at the time of testing.

(2) Stopping, starting, deleting server nodes in a cluster

(3) A container on a designated node is killed.

(4) The specified Pod, deploys and namespace are found through the API Server.

(5) And an interactive mode is provided, so that the cluster change after the operation can be checked immediately.

In the experiment, a Pod in a cluster is randomly deleted by using PowerfulSeal to simulate a random scheduling scenario of multiple Pods, and a process of cluster performance testing will be described below.

3. Cluster optimized pressure measurement performance and analysis

During cluster performance testing, Kube-Stresscheck is used as a basic test case, six test applications with different offset load degrees are simulated by adjusting the upper limit of container resources when the Pod of the test case runs, the test applications run in a cluster in a form of a plurality of copies, and the Pod is in a waiting state all the time after a new Pod is added, so that the condition that the use of cluster resources reaches the upper limit is simulated. And then deleting the specified number of Pods by using Powerseal, and due to the existence of the replica controller, the cluster can recreate the Pods and schedule the Pods, but the Pods may not be scheduled to the nodes bound for the first time based on the current resource use condition. The reason why a Pod needs to be deleted using PowerfulSeal is that when the Pod is created, different copies of the Pod started by the same deployment or daemon configuration file may cause the Pod to be in a continuous position in the scheduling queue, and if the resource allows, the Pod will be immediately bound to a specified node to start running. Some copies of heterogeneous Pod need to be manually deleted to simulate a random scheduling scenario with different CPU and memory application proportions Pod.

The upper performance limit of the cluster was tested using six example applications with different resource application ratios as follows:

in the running process of daily calculation tasks, due to the difference of intermediate calculation operations, the use conditions of different resources by the Pod are constantly changed, and the change may cause that a task using more CPU resources occupies a larger proportion of memory than a CPU at a certain time during execution, which may affect the judgment of the calculation task type. In order to verify the performance of the algorithm, the used test cases always occupy the resource quantity close to the resource distribution upper limit, and the problem that the experimental result is influenced by the change of task properties caused by different resource occupancy rates generated by the calculation task in different calculation stages is avoided. Three types of applications are used as test cases in the experiment, the proportion of CPU and memory is equal, the number of applications with more CPU and more memory is two, and different granularity levels are used in the similar application experiment, so that the accuracy of the experiment result is ensured, wherein 1000m represents that a CPU with one core is applied for Pod application, 1Gi represents that the Pod application uses 1GB memory space, and 1Mi represents that the memory space of 1MB is applied.

When the Pod operation upper limit of the nodes in the cluster is set to be 200 and the Pod is deployed during deployment, the number of copies of the example application is set to be 90 in the configuration file. And randomly deleting about half of example application copies when all the applications start to run until new Pod can be scheduled again, so as to disturb the scheduling sequence as much as possible. In the original cluster, Prometheus is used to obtain the CPU utilization rate and the memory utilization rate of each node server, and the CPU utilization rate and the memory utilization rate of the node are obtained as shown in fig. 5, and the average remaining rate of the CPU is 11.4% and the remaining idle rate of the memory is 10.8% through calculation.

After the prediction-based distributed scheduling algorithm provided in the embodiment of the present invention is added, the cluster state is set to be the full load state again using the example Pod, and the obtained CPU and memory utilization rates of all nodes in the cluster are as shown in fig. 6, and the average remaining rate of the CPU is 5.1% and the average remaining rate of the memory is 4.9% through calculation. Compared with the original scheduling strategy, the utilization rate of the CPU and the memory of the data added with the scheduling algorithm is improved to a certain extent, the unbalance loading degree of the use of the CPU and the memory is also reduced, and the residual resources are allowed to accommodate tasks with more strict resource requirements.

In this embodiment, the average value of the difference between the CPU and the memory utilization rate of each node is used to measure the degree of unbalanced load of the node, and the variance between the memory and the CPU utilization rate is used to measure the degree of load balancing of the node. For the cluster before optimization, the average unbalance loading degree of the CPU and the memory is 3.89%, the variance of the utilization rate of the node CPU is 10.30, and the variance of the memory utilization rate is 10.97. For the cluster added with the algorithm provided by the embodiment of the invention, the average unbalance loading degree is 2.27%, the variance of the utilization rate of the node CPU is 6.41, and the variance of the memory utilization rate is 3.94. It can be seen that the optimized cluster resources are more evenly used.

4. Summary of Experimental optimization

In the embodiment of the invention, the traditional model is improved and optimized mainly aiming at cluster data, and a user-defined scheduling scoring algorithm is realized based on the model. The optimization of the situation that the Pod is always in the abnormal state due to the cluster unbalance loading state in the cluster is completed, and the unbalance loading state of the cluster is automatically adjusted, so that the performance upper limit of the cluster is optimized.

According to the technical scheme provided by the embodiment of the invention, when the target scheduling node is determined, the actual resource use condition of the server and the resource use condition prediction result of the Pod are considered at the same time, and the difference between the CPU and the memory utilization rate is taken as one of the scoring standards. The scheduling is divided into two stages of pre-selection and optimization, the scheduling strategy proposed herein does not secondarily limit the range of available nodes, nodes are added into the available node list as long as the basic Pod operation conditions can be met, but in the optimization stage, those previously operated pods can be predicted by using the previously collected information, and the Scheduler is guided to make scheduling decisions of the pods based on the prediction result. The scheduling algorithm provided in the embodiment of the invention mainly considers that each node in the cluster is kept in a stable state as much as possible, so that the server unbalanced load state caused when a plurality of tasks with different application proportions are continuously scheduled is reduced. Experiments prove that by using the distributed scheduling method based on prediction provided by the embodiment of the invention, the average idle rate of the CPU of the Kubernetes cluster is reduced by 6.3%, the average idle rate of the memory is reduced by 5.9%, the offset variance of the CPU utilization rate is reduced by 3.90%, and the offset variance of the memory utilization rate is reduced by 7.02%.

Example four

Referring to fig. 7, the present invention provides a prediction-based distributed scheduling apparatus, including:

an obtaining module 210, configured to obtain a historical resource utilization rate of a currently scheduled Pod and a resource utilization rate of an available node in a cluster;

the prediction module 220 is configured to input the historical resource utilization rate of the currently scheduled Pod into a deep learning model for training and reasoning, so as to obtain a resource prediction utilization rate of the currently scheduled Pod;

a determining module 230, configured to screen available nodes in the cluster according to the resource prediction utilization of the current scheduling Pod and the resource utilization of the available nodes in the cluster, so as to determine a target scheduling node of the current scheduling Pod.

The determining module 210 includes a pre-screening unit and a preferred unit, where the pre-screening unit is configured to pre-screen the available node based on a built-in algorithm of kubernets to obtain a candidate available node;

the preferred unit is configured to screen the candidate available nodes according to the resource prediction utilization of the current scheduling Pod and the resource utilization of the candidate available nodes, so as to determine a target scheduling node of the current scheduling Pod.

Further, the preferred unit is specifically configured to: determining a first score of each candidate available node according to the resource utilization rate of each candidate available node;

if the current scheduling Pod is operated for the first time, determining a second score of each candidate available node according to the resource utilization rate of each candidate available node and the resource application proportion of the current scheduling Pod;

if the current scheduling Pod is not operated for the first time, determining a second score of each candidate available node according to the resource utilization rate of each candidate available node and the resource prediction utilization rate of the current scheduling Pod;

and sequencing the candidate available nodes according to the first score and the second score so as to determine a target scheduling node of the current scheduling Pod.

Wherein, the calculation formula of the first score is as follows:

Score1＝10-(U_nc+U_nm)*5；

If the current scheduling Pod is operated for the first time, the calculation formula of the second score is as follows:

Score2-1＝10-(U_nc+R_pc-(U_nm+R_pm))*10；

If the current schedule is not operated for the first time, the calculation formula of the second score is as follows:

Score2-2＝10-(U_nc+U_pcp-(U_nm+U_pmp))*10；

Specifically, the step of ranking the candidate available nodes according to the first score and the second score to determine the target scheduling node of the current scheduling Pod includes:

taking the sum of the first score and the second score as a target score of each candidate available node;

and taking the candidate available node with the highest target score as the target scheduling node of the current scheduling Pod.

Further, the acquiring module specifically includes:

and acquiring the historical resource utilization rate of the currently scheduled Pod and the resource utilization rate of the available nodes in the cluster by deploying the cluster resource monitoring strategies of Prometous and Grafana in the cluster.

The prediction-based distributed scheduling apparatus provided in the embodiment of the present invention may execute the prediction-based distributed scheduling method provided in any embodiment of the present invention, has functional modules and beneficial effects corresponding to the execution method, and is not described in detail again.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A distributed scheduling method based on prediction is applied to a Kubernets cluster and comprises the following steps:

2. The method of claim 1, wherein screening available nodes in a cluster to determine a target scheduling node for the currently scheduled Pod according to the predicted utilization of resources for the currently scheduled Pod and resource utilizations of available nodes in the cluster comprises:

3. The method of claim 2, wherein screening candidate available nodes to determine a target scheduling node of the current scheduling Pod according to the predicted resource utilization of the current scheduling Pod and resource utilization of the candidate available nodes comprises:

determining a first score of each candidate available node according to the resource utilization rate of each candidate available node;

4. The method of claim 3, wherein the first score is calculated by:

Score1＝10-(U_nc+U_nm)*5；

5. The method of claim 3, wherein if the currently scheduled Pod is first run, the second score is calculated as follows:

Score2-1＝10-(U_nc+R_pc-(U_nm+R_pm))*10；

6. The method of claim 3, wherein if the current schedule is not a first run, the second score is calculated as follows:

Score2-2＝10-(U_nc+U_pcp-(U_nm+U_pmp))*10；

7. The method of claim 3, wherein ranking the candidate available nodes according to the first score and the second score to determine the target scheduling node for the current scheduling Pod comprises:

8. The method of claim 1, wherein obtaining historical resource utilization of the currently scheduled Pod and resource utilization of available nodes in the cluster comprises:

9. A distributed scheduling apparatus based on prediction, configured in a Kubernets cluster, comprising:

10. The apparatus of claim 9, wherein the determining module is specifically configured to: