CN116932255A

CN116932255A - Scheduling method and device in cloud environment, electronic equipment and storage medium

Info

Publication number: CN116932255A
Application number: CN202210340576.7A
Authority: CN
Inventors: 田国良; 王鑫; 李映; 王坚; 樊野
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Jiangsu Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Jiangsu Co Ltd
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2023-10-24

Abstract

The invention provides a scheduling method, a scheduling device, electronic equipment and a storage medium in a cloud environment, wherein the scheduling method comprises the following steps: acquiring alarm data to be detected in a cloud environment, and processing the alarm data to obtain alarm characteristics corresponding to the alarm data; inputting the alarm characteristics into a fault prediction model to obtain fault categories corresponding to the alarm data, wherein the fault prediction model is obtained by training according to the fault alarm characteristics corresponding to the historical alarm data and the fault categories corresponding to the historical alarm data; according to the fault category and a mapping table of the fault category and the self-healing policy, automatically triggering an arrangement management tool of the portable container to execute self-healing operation according to the self-healing policy corresponding to the fault category, wherein the self-healing operation is used for realizing resource scheduling in the cloud environment. The method and the device realize intelligent resource scheduling in the cloud environment.

Description

Scheduling method and device in cloud environment, electronic equipment and storage medium

Technical Field

The present invention relates to the field of intelligent operation and maintenance technologies, and in particular, to a scheduling method and apparatus in a cloud environment, an electronic device, and a storage medium.

Background

For development and operation staff, the virtual machine has the defects of slow starting, large occupied space and difficult migration. The generation of containerization techniques well solves the above-described problems by not requiring the virtualization of the entire operating system, but only a small-scale environment, and by having a very fast boot-up speed, and by not consuming substantially additional system resources other than running applications therein. However, as cloud computing is developed and applied more and more, the number of containers is also increased, thereby deriving the problem of difficulty in managing the operation and maintenance containers.

The existing container operation and maintenance management technology has the following problems: 1. scheduling problems in existing container clouds, including scaling, fault repair, gray scale distribution, etc., are mostly implemented based on hard rules or hard codes. 2. The fault detection is based on the detection that the process is or is not doing, whether the process is in error or not is not detected, whether the process is in normal operation or not, and the like.

Disclosure of Invention

The invention provides a scheduling method, a scheduling device, electronic equipment and a storage medium in a cloud environment, which are used for solving the defect of container operation and maintenance management in the prior art and realizing intelligent scheduling of resources in the cloud environment.

The invention provides a scheduling method in a cloud environment, which comprises the following steps:

acquiring alarm data to be detected in a cloud environment, and processing the alarm data to obtain alarm characteristics corresponding to the alarm data;

inputting the alarm characteristics into a fault prediction model to obtain fault categories corresponding to the alarm data, wherein the fault prediction model is obtained by training according to the fault alarm characteristics corresponding to the historical alarm data and the fault categories corresponding to the historical alarm data;

according to the fault category and a mapping table of the fault category and the self-healing policy, automatically triggering an arrangement management tool of the portable container to execute self-healing operation according to the self-healing policy corresponding to the fault category, wherein the self-healing operation is used for realizing resource scheduling in the cloud environment.

According to the scheduling method in the cloud environment, the fault prediction model is obtained through training of the following steps:

collecting historical alarm data from cloud applications and/or pod, wherein the historical alarm data comprises alarm indexes corresponding to self-healing scenes;

counting the alarming times of each target alarming index according to a preset time interval, wherein the target alarming index is an alarming index which is screened from the historical alarming index information and can reflect the application performance fault and/or the application service fault;

Acquiring fault alarm characteristics and fault categories corresponding to the fault alarm characteristics based on the counted alarm times of the target alarm indexes;

and training a classification combination model by utilizing the fault alarming characteristics and the fault categories corresponding to the fault alarming characteristics to obtain the fault prediction model.

The invention also provides a scheduling method in cloud environment, the statistics-based alarm times of each target alarm index, obtaining fault alarm characteristics and fault categories corresponding to the fault alarm characteristics, comprising:

obtaining a first matrix by using a first smoothing time window mode, wherein columns of the first matrix are all target alarm indexes, the alarm times of all target alarm indexes in each time unit of the behavior of the first matrix are obtained, and the first smoothing time window is larger than the preset time interval;

combining the data in the first matrix according to columns by using a second smooth time window mode, and converting the number of alarms combined in the matrix into a fixed digit number value to form a second matrix, wherein the second smooth time window is larger than the first smooth time window;

Marking the fault category of each row of data of the second matrix, and adding the fault category of each row of data to the last column of the second matrix to form a third matrix;

splicing the target alarm indexes in the third matrix in a pairwise manner to obtain a plurality of spliced alarm indexes, and determining the alarm times and fault types of each spliced alarm index;

performing sparse processing on the alarming times of each spliced alarming index to obtain a plurality of characteristics, and calculating the importance of each characteristic in the plurality of characteristics by adopting a random forest algorithm;

and screening out features with importance higher than a preset threshold value from the features as fault alarm features, and determining fault categories corresponding to the fault alarm features.

The invention also provides a scheduling method in cloud environment, which trains a classification combination model by utilizing the fault alarm characteristics and the fault categories corresponding to the fault alarm characteristics to obtain the fault prediction model, and comprises the following steps:

dividing a data set consisting of fault alarm characteristics and fault categories corresponding to the fault alarm characteristics into a training set and a testing set;

Based on the training set, training a first logistic regression model in the classification combination model in a K-fold cross validation mode, and a random forest model and an XGBoost model to obtain a prediction result corresponding to the training set;

combining the prediction results of the training sets corresponding to the first logistic regression model, the random forest model and the XGBoost model respectively to obtain three features;

and inputting the three features into a second logistic regression model in the classification combination model for training, testing the trained second logistic regression model by using the test set after training is finished, adjusting model parameters according to test results, and obtaining the fault prediction model after testing is finished.

The invention also provides a scheduling method in cloud environment, which processes the alarm data to obtain the alarm characteristics corresponding to the alarm data, and comprises the following steps:

counting the number of alarms corresponding to each alarm index in the alarm data according to a preset time interval;

obtaining a third matrix by using a third smoothing time window mode based on the alarm times corresponding to the alarm indexes in the alarm data, wherein the columns of the third matrix are the alarm indexes, the rows of the third matrix are the alarm times of the alarm indexes in each time unit, and the third smoothing time window is larger than the preset time interval;

Combining the data in the third matrix according to columns by using a fourth smoothing time window mode, and converting the number of alarms combined in the matrix into a fixed digit number to form a fourth matrix, wherein the fourth smoothing time window is larger than the third smoothing time window;

marking the fault category of each row of data of the fourth matrix, and adding the fault category of each row of data to the last column of the fourth matrix to form a fifth matrix;

splicing alarm indexes in the fifth matrix in a pairwise manner to obtain a plurality of spliced alarm indexes, and determining the alarm times of each spliced alarm index;

performing sparse processing on the alarming times of each spliced alarming index to obtain a plurality of characteristics, and calculating the importance of each characteristic by adopting a random forest algorithm;

and screening alarm features with importance higher than a preset threshold value from the plurality of features to serve as alarm features corresponding to the alarm data.

The invention also provides a scheduling method in cloud environment, the method inputs the alarm characteristics into a fault prediction model to obtain fault categories corresponding to the alarm data, and the method comprises the following steps:

The alarm features are respectively input into a first logistic regression model, a random forest model and an XGBoost model in the fault prediction model to obtain a first feature, a second feature and a third feature;

and inputting the first feature, the second feature and the third feature into a second logistic regression model in the fault prediction model, and obtaining the fault category corresponding to the alarm data output by the second logistic regression model.

The invention also provides a scheduling method in the cloud environment, which further comprises the following steps:

obtaining a mapping relation between fault categories and a self-healing strategy;

and constructing a mapping table of the fault category and the self-healing strategy based on the mapping relation of the fault category and the self-healing strategy.

The invention also provides a scheduling device in the cloud environment, which comprises:

the characteristic acquisition module is used for acquiring alarm data to be detected in a cloud environment, and processing the alarm data to obtain alarm characteristics corresponding to the alarm data;

the fault type prediction module is used for inputting the alarm characteristics into a fault prediction model to obtain fault types corresponding to the alarm data, and the fault prediction model is obtained by training according to the fault alarm characteristics corresponding to the historical alarm data and the fault types corresponding to the historical alarm data;

And the self-healing module is used for automatically triggering the arrangement management tool of the portable container to execute self-healing operation according to the self-healing strategy corresponding to the fault category according to the fault category and a mapping table of the fault category and the self-healing strategy, wherein the self-healing operation is used for realizing resource scheduling in the cloud environment.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the scheduling method under any cloud environment when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a scheduling method in a cloud environment as described in any of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a scheduling method in a cloud environment as described in any of the above.

According to the scheduling method, the scheduling device, the electronic equipment and the storage medium in the cloud environment, the AI algorithm is used for fault detection and root cause positioning, and then the scheduling management tool of the portable container is automatically triggered to execute self-healing operation according to the self-healing strategy corresponding to the fault type, so that intelligent resource scheduling in the cloud environment is realized.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a scheduling method in a cloud environment provided by the invention;

FIG. 2 is a schematic flow chart of a training failure prediction model provided by the invention;

FIG. 3 is a flow chart of obtaining fault alarm characteristics and fault categories corresponding to the fault alarm characteristics based on the alarm times of each target alarm index obtained by statistics;

FIG. 4 is a schematic flow chart of training a classification combination model by using the fault alarm feature and the fault class corresponding to the fault alarm feature to obtain the fault prediction model;

fig. 5 is a schematic structural diagram of a scheduling device in a cloud environment according to an embodiment of the present invention;

fig. 6 is a schematic diagram of an entity structure of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

For development and operation staff, the virtual machine has the defects of slow starting, large occupied space and difficult migration. The generation of containerization techniques well solves the above-described problems by not requiring the virtualization of the entire operating system, but only a small-scale environment, and by having a very fast boot-up speed, and by not consuming substantially additional system resources other than running applications therein. However, as cloud computing is developed and applied more and more, the number of containers is also increased, thereby deriving the problem of difficulty in managing the operation and maintenance containers. Under the driving of the service, k8s is introduced, a set of brand new distributed architecture leading scheme based on the container technology is provided, and the development of the whole container technology field is a great breakthrough and innovation.

k8s is kubernetes, is an arrangement management tool of a portable container generated for container service, and currently, k8s is developing well, has dominant cloud business processes, and promotes popularization and landing of hot technologies such as micro-service architecture. And k8s not only well solves the usability and the scalability in the architecture design level, but also provides a good solution for service deployment, service monitoring, application capacity expansion and fault handling in the deployment operation and maintenance level. Specifically, the method mainly comprises the following steps: 1. service discovery and scheduling; 2. load balancing; 3. service self-healing; 4. service elastic expansion; 5. transversely expanding the capacity; 6. the storage volume is mounted.

Scheduling problems in the existing cloud environment, such as elastic expansion and contraction, mainly adopt the principle of monitoring the utilization rate of resources (such as a CPU) of the Pod, and once a set threshold value is reached, making a strategy to determine whether the number of the Pod needs to be increased or decreased. A cycle is also required to make the policy, which may be set to 5 minutes, for example, fixedly. When CPU usage is detected to be high for five minutes, a strategy is made to increase the number of Pods to relieve its pressure, whereas after five minutes, the detected pressure is always low, then a decision is made to decrease the number of Pods.

The prior art has the following problems:

1. scheduling problems in existing container clouds, including scaling, fault repair, gray scale distribution, etc., are mostly implemented based on hard rules or hard codes. For example, as mentioned in the foregoing, the detection period is set to be 5min fixedly when making a decision.

2. The existing detection is based on the detection that the process is or is not doing, whether the process reports errors or not is not detected, whether the process is in normal operation or not, and the like.

In order to solve the above problems, the embodiment of the invention provides a scheduling method in a cloud environment.

Fig. 1 is a flow chart of a scheduling method in a cloud environment according to an embodiment of the present invention, as shown in fig. 1, where the method includes: step 100, step 101 and step 102.

Step 100, acquiring alarm data to be detected in a cloud environment, and processing the alarm data to obtain alarm characteristics corresponding to the alarm data;

the invention relies on the basic data as the alarm data of the application, including the performance alarm index data, business index alarm data, etc. How to monitor and obtain the performance index alarm data and the business index alarm data of the application is not limited by the invention.

The alarm data includes, but is not limited to, alarm entity, alarm time, alarm value, alarm index.

In order to perform fault detection on the alarm data, the alarm data needs to be processed to obtain alarm characteristics corresponding to the alarm data.

Step 101, inputting the alarm characteristics into a fault prediction model to obtain fault categories corresponding to the alarm data;

after the alarm characteristics are obtained, the alarm characteristics are input into a fault prediction model, and fault categories output by the fault prediction model are obtained, wherein the fault categories are fault categories corresponding to alarm data to be detected in a cloud environment.

The fault prediction model is obtained by training through an AI algorithm according to fault alarm characteristics corresponding to the historical alarm data and fault categories corresponding to the historical alarm data.

The embodiment of the invention can realize fault detection and root cause positioning by utilizing the fault prediction model.

And 102, automatically triggering an arrangement management tool of the portable container to execute self-healing operation according to the self-healing strategy corresponding to the fault category according to the fault category and a mapping table of the fault category and the self-healing strategy, wherein the self-healing operation is used for realizing resource scheduling in the cloud environment.

After obtaining the fault type corresponding to the alarm data to be detected in the cloud environment, the self-healing strategy corresponding to the fault type can be determined according to a pre-constructed mapping table of the fault type and the self-healing strategy, wherein the self-healing strategy comprises but is not limited to intelligent expansion and contraction capacity, intelligent fault repair and intelligent gray scale release.

After the fault prediction model detects the fault category, the arrangement management tool of the portable container can be automatically triggered to execute self-healing operation according to the self-healing strategy corresponding to the fault category, so that the effects of fault self-healing and resource scheduling are achieved.

In the embodiment of the invention, the AI algorithm is used for fault detection and root cause positioning, and then the automatic triggering of the arrangement management tool of the portable container executes the self-healing operation according to the self-healing strategy corresponding to the fault category, so that the problem that the existing container cloud can only rely on hard codes or hard rules for scheduling is solved, and the intelligent resource scheduling in the cloud environment is realized.

Optionally, as shown in fig. 2, the fault prediction model is trained by the following steps:

step 200, collecting historical alarm data from cloud applications and/or pod, wherein the historical alarm data comprises alarm indexes corresponding to self-healing scenes;

Specifically, acquiring historical alarm index information of cloud application and/or pod in a period of time, wherein the historical alarm index information needs to contain alarm indexes corresponding to self-healing scenes, otherwise, AI model training cannot be performed.

Step 201, counting the alarming times of each target alarming index according to a preset time interval, wherein the target alarming index is an alarming index which is screened from the historical alarming index information and can reflect the application performance fault and/or the application service fault;

it should be noted that, under normal conditions, indexes in cloud application products are hundreds to thousands, most indexes have no correlation to alarms, and in order to improve the efficiency of feature engineering and model training, a part of important alarm indexes are selected to count the number of alarms generated by each alarm index according to a fixed time interval. The step needs to perform preliminary screening, and the alarm index which can embody the application performance fault and/or the alarm index which can embody the application service fault is screened out as the fault characteristic index of the subsequent AI training, and for convenience of description, the alarm index which can embody the application performance fault and/or the application service fault and is screened out from the history alarm index information is called a target alarm index.

Counting the number of alarms generated by each target alarm index according to a fixed time interval, for example, counting at intervals of one minute, and if a certain target alarm index does not generate an alarm within the time range, counting as 0.

Step 202, obtaining fault alarm characteristics and fault categories corresponding to the fault alarm characteristics based on the counted alarm times of the target alarm indexes;

and carrying out data processing based on the counted alarm times of the target alarm indexes to obtain fault alarm characteristics and fault categories corresponding to the fault alarm characteristics.

Optionally, as shown in fig. 3, step 202, based on the counted alarm times of the target alarm indexes, obtains a fault alarm feature and a fault category corresponding to the fault alarm feature, including:

step 2021, obtaining a first matrix by using a first smoothing time window, where a column of the first matrix is a number of alarms of each target alarm indicator in each time unit of a behavior of the first matrix, and the first smoothing time window is greater than the preset time interval;

specifically, a first smooth time window mode is used to obtain a matrix of occurrence times of each target alarm index in the first smooth window, namely a first matrix, wherein the column title of the matrix is the target alarm index, and each row corresponds to the occurrence times of the target alarm index in each time unit. The first smoothed time window is selected to be greater than the preset time interval in step 201, for example, a time window of 5 minutes, and the smoothed amount is a time unit, for example, the number of alarms per minute of each target alarm indicator in the first smoothed time window is calculated, and a two-dimensional matrix is constructed, as shown in table 1.

Table 1 example of first matrix

Step 2022, merging the data in the first matrix according to columns by using a second smooth time window, and converting the number of alarm times after merging in the matrix into a fixed number of digits to form a second matrix, wherein the second smooth time window is larger than the first smooth time window;

specifically, the data in the matrix are combined according to columns, namely according to the target alarm index, and each numerical value in each column is spliced according to the fixed number of bits to form new data. It is clear that a value needs to be converted into a binary of a fixed number of bits. The new matrix is obtained after the matrix combination in step 2021, as shown in table 2.

Table 2 example of new matrix obtained after matrix data combination

Based on table 2, the conversion is performed according to the second smoothing time window, so as to obtain a new matrix formed by binary data calculated by the historical time smoothing window and the target alarm index, namely, a second matrix, wherein the second matrix is shown in table 3.

Table 3 second matrix example

Step 2023, marking the fault category of each row of data of the second matrix, and adding the fault category of each row of data to the last column of the second matrix to form a third matrix;

It should be noted that, the fault class of each row of data is determined according to the difference of the joint fault results shown by the plurality of key indexes in the smooth window. For example, when the cpu usage alert is relatively high for 5 minutes and the average response time of the application is greater than 5 seconds of alert number is increased much more than usual, the line data category may be considered as cpu resource starvation. When the number of alarms with high memory usage continues to increase and the service request failure rate increases, the resource failure class can be determined as the application memory resource problem. And the fault types are added, so that the subsequent self-healing process can be conveniently processed.

The fault class was added to the second matrix shown in table 3 to obtain a third matrix, as shown in table 4.

Table 4 third matrix example

Step 2024, splicing the target alarm indexes in the third matrix in pairs to obtain a plurality of spliced alarm indexes, and determining the alarm times and fault types of each spliced alarm index;

based on the third matrix, the invention adopts a feature crossing method to conduct feature derivation, for example: the values of the two features are stitched to form a new feature combination.

The process is as follows:

assuming that the matrix has N characteristics, and splicing the characteristics of the matrix in pairs, the new derivative can be obtainedThe total number of features is->Specific derivatization processes are as follows: the features "Cpu high usage" and "memory high usage" are combined to form "Cpu high usage and memory high usage". The combination has the advantages that the related fault characteristics can be accurately described, and the fault category can be more accurately predicted by combining a subsequent classification algorithm. The combined data structure is shown in table 5.

Table 5 data examples derived from features

	Cpu is high in utilization rate and memory is high in utilization rate	……
			2021/04/22 15:04:00	00050007000300060003&00030004000700020004	……
2021/04/22 15:05:00	00070003000600030003&00040007000200040004	……
			……	……	……

Step 2025, performing sparse processing on the alarm times of each spliced alarm index to obtain a plurality of features, and calculating the importance of each feature in the plurality of features by adopting a random forest algorithm;

further, in order to enable the spliced data to be effectively adapted to a subsequent related fault classification algorithm, the alarm times of each spliced alarm index are subjected to sparse processing, so that a plurality of characteristics are obtained. The process is as follows:

assuming that the feature "Cpu is high and the memory usage is high" is F1, the features F1-1, F1-2 … F1-40 are derived, examples of which are shown in Table 6 below.

TABLE 6 multiple features after sparseness processing

The feature selection is carried out on the matrix after the sparseness, the feature importance is calculated by adopting a random forest algorithm, wherein the random forest algorithm is a more general algorithm in the classification algorithm, the feature importance is calculated first in the process of carrying out the category prediction, and the category prediction is carried out through the feature importance.

Step 2026, screening out features with importance higher than a preset threshold from the plurality of features as fault alarm features, and determining fault categories corresponding to the fault alarm features.

And 203, training a classification combination model by utilizing the fault alarming characteristics and the fault categories corresponding to the fault alarming characteristics to obtain the fault prediction model.

In the embodiment of the invention, the fault alarm characteristics are obtained by smoothing, merging, increasing fault categories, characteristic derivation, sparse processing and importance screening the collected historical alarm data from the cloud application and/or pod, so that the training precision of the fault prediction model can be effectively improved.

The invention adopts a classification combination method, a basic model adopts three algorithms of logistic regression, random Forest and XGBoost, and the three algorithm models are combined by stacking (stacking) to integrate and train a new logistic regression model, so that the combined algorithm precision is very high.

The Stacking is to learn a new learner by taking the prediction results of a plurality of basic learners as a new training set after learning the plurality of basic learners by using initial training data. The base layer of Stacking typically includes different learning algorithms, and therefore Stacking ensemble tends to be heterogeneous.

Optionally, as shown in fig. 4, the step 203 trains the classification combination model by using the fault alert feature and the fault class corresponding to the fault alert feature to obtain the fault prediction model, which includes:

step 2031, dividing a data set composed of fault categories corresponding to the fault alarm features into a training set and a testing set;

the application does not limit the dividing proportion, for example, the data set formed by the fault alarming feature and the fault category corresponding to the fault alarming feature is divided into a training set and a testing set according to the proportion of 4:1.

Illustratively, the fault alert feature and the fault class corresponding to the fault alert feature may be understood as the final features "F2-5", "F1-6", "F2-9", "F3-6", "F1-8", "F1-2", "F4-4", "F4-3" and the corresponding labels obtained in step 2026 above are divided into training sets and test sets according to 4:1.

Step 2032, training a first logistic regression model in the classification combination model in a K-fold cross validation mode based on the training set, and training a random forest model and an XGBoost model to obtain a prediction result corresponding to the training set;

specifically, a 5-fold cross validation training model is used using basic algorithms such as Logistic Regration, random Forest and XGBoost. And 5-fold cross validation is carried out on the training data, three algorithms, namely Logistic Regration, random Forest and XGBoost, are respectively input, parameters are adjusted and optimized, and models are respectively trained. The model may be saved to the specified path, typically in the form of a pkl file. And each model obtains the prediction results of all the train_data through 5 times of cross validation. And then respectively inputting the test_data into three models to respectively obtain the prediction results of the test_data.

Logistic Regration mainly uses Sigmoid function to carry out two classification problems, and can be expanded to solve the problem of multiple classification. The model is simple and easy to understand, and the interpretability is very good. And (3) optimizing an algorithm, selecting parameters, and selecting a sag algorithm, namely a random average gradient descent method, by using the software parameters, wherein the speed is faster than that of SGD. The regularization parameters use L2 regularization to prevent overfitting.

Random Forest is a classifier that contains multiple decision trees, the class of output of which is dependent on the mode of the individual tree output. The randomness of the random forest includes the random selection of data and the random selection of features to be selected. There are several important parameters in the algorithm, such as n_optimizers, which are the maximum number of iterations of the weak learner, too small, easily under-fit, too large, and easily over-fit. Also of great importance is the maximum feature number max_features, maximum depth max_depth of the decision tree. The invention can search the network of n_optimators through GridSearchCV, and select the most suitable n_optimators, max_features and max_depth.

Similar to GBDT, XGBoost boosting strategy is to reduce the last residual error and then build a new model in the direction of residual error reduction (negative gradient). XGBoost differs from GBDT in that a loss function may be artificially defined, which may be a least squares difference, logistic loss function, hinge loss function, or an artificially defined loss function. Boosting can be performed by knowing only the first and second derivatives of the loss function to the parameters, which further increases the generalization ability of the model. XGBoost, the most important parameters, such as boost parameters, i.e. the iterative model is chosen, the tree-based model is far superior to the linear model, so the gbtree is chosen. eta parameters, i.e. learning rate, are used to avoid overfitting. max_depth represents the maximum depth of the tree. The objective represents a loss function. eta, max_depth, object_was found using gridsetchcv grid search.

Step 2033, respectively merging the prediction results of the training sets corresponding to the first logistic regression model, the random forest model and the XGBoost model to obtain three features;

specifically, the prediction results of each model verification set (corresponding to the prediction results of each piece of data of the training set train_data) are combined respectively to form new training data, and the new logistic regression model is input as 3 features to train the next model.

Step 2034, inputting the three features to a second logistic regression model in the classification combination model for training, testing the trained second logistic regression model by using the test set after training, adjusting model parameters according to test results, and obtaining the fault prediction model after testing.

The fault prediction model is composed of three trained basic models (namely a first logistic regression model, a random forest model and an XGBoost model) and a second trained logistic regression model.

In the embodiment of the invention, the fault prediction model is trained and generated by adopting the classification combination method, so that the precision is very high, and the accurate prediction of the cloud application fault can be realized.

Optionally, in the step 100, the processing the alarm data to obtain the alarm feature corresponding to the alarm data includes:

step 1001, counting the number of alarms corresponding to each alarm index in the alarm data according to a preset time interval;

step 1002, obtaining a third matrix by using a third smoothing time window based on the number of alarms corresponding to each alarm indicator in the alarm data, wherein a column of the third matrix is each alarm indicator, a row of the third matrix is the number of alarms of each alarm indicator in each time unit, and the third smoothing time window is larger than the preset time interval;

step 1003, merging the data in the third matrix according to columns by using a fourth smoothing time window, and converting the number of alarm times after merging in the matrix into a fixed digit number to form a fourth matrix, wherein the fourth smoothing time window is larger than the third smoothing time window;

step 1004, marking the fault category of each row of data of the fourth matrix, and adding the fault category of each row of data to the last column of the fourth matrix to form a fifth matrix;

Step 1005, combining alarm indexes in the fifth matrix two by two to obtain a plurality of spliced alarm indexes, and determining the alarm times of each spliced alarm index;

step 1006, performing sparse processing on the alarm times of each spliced alarm index to obtain a plurality of features, and calculating the importance of each feature by adopting a random forest algorithm;

step 1007, screening out alarm features with importance higher than a preset threshold from the multiple features, as alarm features corresponding to the alarm data.

It should be noted that, for the understanding of the above steps 1001 to 1007, reference may be made to the related descriptions of the above steps 2021 to 2026, which are not described herein.

Optionally, in step 101, the step of inputting the alarm feature into a fault prediction model to obtain a fault class corresponding to the alarm data includes:

For example, the data processing such as "Cpu usage is high", "memory usage is high", and "service request failure rate is increased" at the time of detection is performed to generate final features "F2-5", "F1-6", "F2-9", "F3-6", "F1-8", "F1-2", "F4-4", "F4-3", and the final features are respectively input into the trained three basic models logistic regression, random Forest and XGBoost, so as to obtain three prediction results, and then the 3 prediction results are input into the next layer logistic regression as 3 features, so as to obtain the final prediction category.

Optionally, on the basis of the foregoing embodiments, the method further includes:

It should be noted that, in order to implement automatic triggering of the arrangement management tool of the portable container to execute the self-healing operation according to the self-healing policy corresponding to the fault category, the present invention also needs to construct a mapping table of the fault category and the self-healing policy. The specific construction method comprises the steps of firstly obtaining the mapping relation between the fault category and the self-healing strategy, and then constructing the mapping table between the fault category and the self-healing strategy based on the mapping relation between the fault category and the self-healing strategy.

In the embodiment of the invention, the fault detection and root cause positioning of the alarm data are realized by using an AI algorithm, so that the fault type is obtained, the arrangement management tool of the portable container can be automatically triggered to execute the self-healing operation according to the self-healing strategy corresponding to the fault type, the problem that the existing container cloud can only rely on hard codes or hard rules for scheduling is solved, and therefore, the intelligent resource scheduling in the cloud environment is realized.

The scheduling device in the cloud environment provided by the invention is described below, and the scheduling device in the cloud environment described below and the scheduling method in the cloud environment described above can be referred to correspondingly.

Fig. 5 is a schematic structural diagram of a scheduling device in a cloud environment according to an embodiment of the present invention. As shown in fig. 5, the scheduling apparatus 500 in the cloud environment includes:

the feature acquisition module 510 is configured to acquire alarm data to be detected in a cloud environment, and process the alarm data to obtain alarm features corresponding to the alarm data;

the fault class prediction module 520 is configured to input the alarm feature into a fault prediction model, and obtain a fault class corresponding to the alarm data, where the fault prediction model is obtained by training according to a fault alarm feature corresponding to the historical alarm data and a fault class corresponding to the historical alarm data;

The self-healing module 530 is configured to automatically trigger an arrangement management tool of the portable container to execute a self-healing operation according to the self-healing policy corresponding to the fault category according to the fault category and a mapping table of the fault category and the self-healing policy, where the self-healing operation is used to implement resource scheduling in the cloud environment.

Optionally, the fault prediction model is obtained through training of the following steps:

Optionally, the obtaining the fault alarm feature and the fault category corresponding to the fault alarm feature based on the counted alarm times of the target alarm indexes includes:

Optionally, training the classification combination model by using the fault alarm feature and the fault category corresponding to the fault alarm feature to obtain the fault prediction model, including:

Optionally, the processing the alarm data to obtain alarm characteristics corresponding to the alarm data includes:

Optionally, the inputting the alarm feature into a fault prediction model to obtain a fault class corresponding to the alarm data includes:

Optionally, the apparatus further comprises:

the mapping relation acquisition module is used for acquiring the mapping relation between the fault category and the self-healing strategy;

the construction module is used for constructing a mapping table of the fault category and the self-healing strategy based on the mapping relation of the fault category and the self-healing strategy.

It should be noted that, the scheduling device in the cloud environment provided by the embodiment of the present invention can implement all the method steps implemented by the scheduling method embodiment in the cloud environment, and can achieve the same technical effects, and detailed descriptions of the same parts and beneficial effects as those of the method embodiment in the embodiment are omitted.

Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor 610, communication interface (Communications Interface) 620, memory 630, and communication bus 640, wherein processor 610, communication interface 620, and memory 630 communicate with each other via communication bus 640. The processor 610 may call logic instructions in the memory 610 to perform a scheduling method in a cloud environment, the method comprising: acquiring alarm data to be detected in a cloud environment, and processing the alarm data to obtain alarm characteristics corresponding to the alarm data; inputting the alarm characteristics into a fault prediction model to obtain fault categories corresponding to the alarm data, wherein the fault prediction model is obtained by training according to the fault alarm characteristics corresponding to the historical alarm data and the fault categories corresponding to the historical alarm data; according to the fault category and a mapping table of the fault category and the self-healing policy, automatically triggering an arrangement management tool of the portable container to execute self-healing operation according to the self-healing policy corresponding to the fault category, wherein the self-healing operation is used for realizing resource scheduling in the cloud environment.

Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can execute a scheduling method in a cloud environment provided by the above methods, and the method includes: acquiring alarm data to be detected in a cloud environment, and processing the alarm data to obtain alarm characteristics corresponding to the alarm data; inputting the alarm characteristics into a fault prediction model to obtain fault categories corresponding to the alarm data, wherein the fault prediction model is obtained by training according to the fault alarm characteristics corresponding to the historical alarm data and the fault categories corresponding to the historical alarm data; according to the fault category and a mapping table of the fault category and the self-healing policy, automatically triggering an arrangement management tool of the portable container to execute self-healing operation according to the self-healing policy corresponding to the fault category, wherein the self-healing operation is used for realizing resource scheduling in the cloud environment.

In yet another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform a scheduling method in a cloud environment provided by the above methods, the method comprising: acquiring alarm data to be detected in a cloud environment, and processing the alarm data to obtain alarm characteristics corresponding to the alarm data; inputting the alarm characteristics into a fault prediction model to obtain fault categories corresponding to the alarm data, wherein the fault prediction model is obtained by training according to the fault alarm characteristics corresponding to the historical alarm data and the fault categories corresponding to the historical alarm data; according to the fault category and a mapping table of the fault category and the self-healing policy, automatically triggering an arrangement management tool of the portable container to execute self-healing operation according to the self-healing policy corresponding to the fault category, wherein the self-healing operation is used for realizing resource scheduling in the cloud environment.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The scheduling method in the cloud environment is characterized by comprising the following steps:

2. The scheduling method in a cloud environment according to claim 1, wherein the fault prediction model is obtained by training:

3. The scheduling method according to claim 2, wherein the obtaining the fault alert feature and the fault category corresponding to the fault alert feature based on the counted number of alerts of the target alert indicators includes:

4. A scheduling method in a cloud environment according to claim 2 or 3, wherein training the classification combination model by using the fault alert feature and the fault class corresponding to the fault alert feature to obtain the fault prediction model includes:

5. The scheduling method in the cloud environment according to claim 1, wherein the processing the alarm data to obtain the alarm feature corresponding to the alarm data includes:

6. The scheduling method in the cloud environment according to claim 1, wherein the inputting the alarm feature into the fault prediction model to obtain the fault class corresponding to the alarm data includes:

7. The scheduling method in a cloud environment of claim 1, wherein the method further comprises:

8. A scheduling apparatus in a cloud environment, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the scheduling method in the cloud environment of any of claims 1 to 7 when the program is executed by the processor.

10. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements a scheduling method in a cloud environment according to any of claims 1 to 7.