CN113010389B

CN113010389B - Training method, fault prediction method, related device and equipment

Info

Publication number: CN113010389B
Application number: CN201911325501.6A
Authority: CN
Inventors: 殷征; 谭宇; 邱能俊; 张纪宽
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2024-03-01
Anticipated expiration: 2039-12-20
Also published as: CN113010389A

Abstract

The embodiment of the invention provides a training method, a fault prediction method, a related device and equipment, wherein the training method comprises the following steps: selecting a target service to be simulated and a target fault; calling a service template corresponding to the target service, simulating the target service, and calling a fault template corresponding to the target fault, and simulating the target fault; marking index data of a target service in the process of simulating the target fault, wherein the index data are used for reflecting the starting occurrence of the target fault of the target service; and training to obtain a fault prediction model corresponding to the target service in the target fault according to the marked index data. The embodiment of the invention can train a model for predicting the service faults and provides possibility for realizing the service fault prediction.

Description

Training method, fault prediction method, related device and equipment

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a training method, a fault prediction method, a related device and equipment.

Background

The online service has more and more services at present, for example, an online shopping service often has a shopping cart service, a user purchase service, a merchant selling service, a payment service and other numerous services, and the services are generally controlled by a service system, and face to the numerous services controlled by the service system, so that smooth running of the services is ensured, and further, the service stability (that is, the capability of continuously and stably providing the service) is particularly necessary.

One of the bases for ensuring the smooth operation of the service is to realize the fault prediction of the service, however, the current technical scheme mainly carries out the fault detection of the service when the service is in fault, and no available technical scheme can realize the fault prediction before the service possibly fails; therefore, how to train a model capable of predicting service faults becomes a problem to be solved by those skilled in the art.

Disclosure of Invention

In view of this, the embodiments of the present invention provide a training method, a fault prediction method, a related device, and a device, so as to train a model capable of predicting a service fault, and implement fault prediction of the service based on the trained model.

In order to achieve the above object, the embodiment of the present invention provides the following technical solutions:

a training method, comprising:

selecting a target service to be simulated and a target fault;

calling a service template corresponding to the target service, simulating the target service, and calling a fault template corresponding to the target fault, and simulating the target fault;

marking index data of a target service in the process of simulating the target fault, wherein the index data are used for reflecting the starting occurrence of the target fault of the target service;

And training to obtain a fault prediction model corresponding to the target service in the target fault according to the marked index data.

The embodiment of the invention also provides a fault prediction method, which comprises the following steps:

acquiring real-time index data of a service;

determining an abnormal index set in the real-time index data according to the index baseline of the pre-predicted service; the index baseline is a data limit value reflecting index abnormality;

invoking a pre-trained fault prediction model corresponding to the service to predict a target fault type corresponding to the abnormal index set; the fault prediction model is obtained by training marked index data based on the process of simulating the faults of the service, and the marked index data is used for reflecting the occurrence of the faults of the service.

The embodiment of the invention also provides a training device, which comprises:

the selector is used for selecting target business and target faults to be simulated;

the simulator is used for calling a service template corresponding to the target service, simulating the target service, calling a fault template corresponding to the target fault and simulating the target fault;

the marker is used for marking the index data of the target service in the process of simulating the target fault, wherein the index data is used for reflecting the starting occurrence of the target fault of the target service;

And the trainer is used for training and obtaining a fault prediction model corresponding to the target service in the target fault according to the marked index data.

The embodiment of the invention also provides a fault prediction device, which comprises:

the real-time data acquisition module is used for acquiring real-time index data of the service;

the abnormal index determining module is used for determining an abnormal index set in the real-time index data according to the index base line of the pre-predicted service; the index baseline is a data limit value reflecting index abnormality;

the fault type prediction module is used for calling a pre-trained fault prediction model corresponding to the service and predicting a target fault type corresponding to the abnormal index set; the fault prediction model is obtained by training marked index data based on the process of simulating the faults of the service, and the marked index data is used for reflecting the occurrence of the faults of the service.

The embodiment of the invention also provides a first server platform, which comprises: at least one memory and at least one processor, said memory storing a program, said processor invoking said program to perform the training method described above.

The embodiment of the invention also provides a second server platform, which comprises: at least one memory storing a program and at least one processor invoking the program to perform the fault prediction method described above.

The embodiment of the invention also provides a storage medium which stores a program for executing the training method, or a program for executing the fault prediction method.

The training method provided by the embodiment of the invention can simulate the target fault under the target service through the service template of the target service and the fault template of the target fault; therefore, in the process of simulating the target fault, the embodiment of the invention can label the index data of the target service, and the labeled index data is used for reflecting the starting occurrence of the target fault of the target service; furthermore, according to the embodiment of the invention, the fault prediction model corresponding to the target fault of the target service can be obtained through training according to the marked index data.

The process of simulating the target fault can be regarded as a process from the start of the simulation of the target fault to the occurrence of the target fault, and because the target fault does not really occur when the target fault of the target service starts to be simulated, but the target fault develops towards the occurrence direction, in the process of simulating the target fault, the embodiment of the invention marks the index data of the target service, and the marked index data can reflect the start of the occurrence of the target fault of the target service, so that the fault prediction model of the target service corresponding to the target fault is trained based on the marked index data, and the trained fault prediction model can realize the advanced prediction of the target fault based on the index data condition of the target service before the occurrence of the target fault of the target service. That is, the training method provided by the embodiment of the invention can train the model for predicting the service faults, and provides possibility for realizing the fault prediction of the service.

Further, in the process of simulating the target faults of the target service, the embodiment of the invention marks the index data used for reflecting the beginning occurrence of the target faults of the target service, can realize the automatic marking of the index data, improves the number of positive samples reflecting the beginning occurrence of the target faults of the target service, and reduces the proportion difference condition of the positive and negative samples of the model training; further, based on simulating the target fault of the target service, the index data reflecting the start of the target fault of the target service is obtained, and the obtained index data is established under the condition that the target fault of the target service starts to occur, so that the obtained index data has higher accuracy.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.

FIG. 1 is a diagram illustrating a training method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a training method according to an embodiment of the present invention;

FIG. 3 is a flowchart of a prediction index baseline provided by an embodiment of the present invention;

FIG. 4 is a flowchart of a fault prediction method according to an embodiment of the present invention;

FIG. 5 is a flowchart of determining propagation relationships of anomaly metrics provided by an embodiment of the present invention;

FIG. 6 is an exemplary diagram of a propagation map;

FIG. 7 is a schematic illustration of an embodiment of the present invention;

FIG. 8 is a block diagram of a training device provided by an embodiment of the present invention;

FIG. 9 is another block diagram of a training device according to an embodiment of the present invention;

FIG. 10 is a further block diagram of a training device according to an embodiment of the present invention;

FIG. 11 is a further block diagram of a training device according to an embodiment of the present invention;

FIG. 12 is yet another block diagram of a training device provided by an embodiment of the present invention;

FIG. 13 is a hardware block diagram of a first server platform;

FIG. 14 is a block diagram of a failure prediction apparatus according to an embodiment of the present invention;

fig. 15 is another block diagram of a fault prediction apparatus according to an embodiment of the present invention.

Detailed Description

As the service system becomes larger, the fault detection cannot meet the operation and maintenance requirements of the service, namely the fault detection cannot predict the fault type of the service in time before the service possibly fails, and if the fault is not predicted in advance, the fault detection is only performed when the fault occurs, so that the operation and maintenance is very likely to cause service loss due to the fact that the fault cannot be processed in time; meanwhile, the fault detection of the current service is based on the index data of the service recorded in the database, when the service fails, the fault type of the service is detected, and as the service system becomes larger, the index type of the service becomes more and more, and at present, accurate fault detection is difficult to realize through one or a plurality of indexes.

Based on the above, the embodiment of the invention provides a training method of a fault prediction model, so as to provide a model capable of carrying out fault prediction on a service, and provide a foundation for realizing the fault prediction of the service. The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a diagram illustrating a stage example of a training method provided by an embodiment of the present invention, where, as shown in fig. 1, the training method provided by the example of the present invention may be performed cooperatively by a user and a first server platform; the first server platform can be a server group platform formed by a plurality of servers, or can be a server platform formed by combining a single server and a database; in an alternative implementation, the first server platform may be an offline platform, i.e., embodiments of the present invention may train the failure prediction model based on an offline manner;

As shown in fig. 1, the process of training the failure prediction model is divided into four phases: selecting service and fault, loading service template and fault template, simulating fault, and importing training model into model engine.

Based on the stage example of the training method shown in fig. 1, fig. 2 shows an optional flow of the training method provided by the embodiment of the present invention, and referring to fig. 2, the flow may include:

and S10, selecting a target service to be simulated and a target fault.

The embodiment of the invention can provide various simulative services and simulative faults, and a user can select a target service to be simulated from the various simulative services and select the target fault to be simulated from the various simulative faults according to the needs;

by way of example, the embodiment of the invention can provide various simulative services such as shopping cart service, user purchase service, merchant selling service, payment service and the like, and various simulative faults such as disk IO (input output) full load, CPU full load, network fault and the like; the user can select the target service to be simulated from the simulated services and the target fault to be simulated from the simulated faults according to the requirements.

The target service may refer to any service, and the target fault may refer to any fault, which may be specifically selected by a user according to needs.

And S11, calling a service template corresponding to the target service, simulating the target service, and calling a fault template corresponding to the target fault, and simulating the target fault.

The embodiment of the invention can define the business template for various simulative businesses so as to simulate corresponding businesses, and the business template can record various information of business operation, thereby realizing the simulation of corresponding businesses through the business operation information recorded by the business template; taking a purchase service as an example, the embodiment of the invention can define a service template corresponding to the purchase service, thereby simulating the process of purchasing the service through the process information of purchasing the service recorded by the service template;

correspondingly, after the user selects the target service to be simulated, the embodiment of the invention can call the service template corresponding to the target service from the preset service templates corresponding to the simulated services, thereby simulating the target service based on the information of the target service operation recorded by the called service template.

Meanwhile, the embodiment of the invention can define fault templates for various simulative faults so as to simulate corresponding faults, and the fault templates can record various information of fault operation, so that the simulation of the corresponding faults is realized through the information of fault operation recorded by the fault templates; optionally, based on the index type focused by the user, the user can also customize the fault template;

Taking a CPU full load fault as an example, the embodiment of the invention can define a fault template corresponding to the CPU full load, thereby recording the running information of the CPU full load (such as running intensive programs to occupy the CPU load and the like) through the fault template and simulating the CPU full load fault;

correspondingly, after a user selects a target fault to be simulated, the embodiment of the invention can call the fault template corresponding to the target fault from the preset fault templates corresponding to the simulated faults, so that the target fault is simulated based on the information of the target fault operation recorded by the called fault template; simulation of a fault may be considered as simulating the occurrence of the fault, i.e. the process from the time the fault occurs until the fault occurs.

Optionally, the embodiment of the invention can simulate the fault based on the fault template corresponding to the target fault by using the pressure measurement system; the pressure test system is a system for performing pressure test, can provide basic service quality assurance test, is part of software test work, and resources for performing software pressure test generally comprise internal memory, CPU availability, disk space, network bandwidth and the like.

Based on step S11, the embodiment of the present invention may perform the simulation of the target fault in the scenario of simulating the target service, that is, simulate the target fault in the environment of the target service.

Optionally, further, the target fault of the simulated target service is realized based on the service template corresponding to the target service and the fault template corresponding to the target fault, and the embodiment of the invention can simulate the target fault based on the data of the target fault occurring when the target service actually runs; by way of example, the embodiment of the invention can automatically collect the data of the target faults of the target service and feed the data back to the pressure measurement system, so that the pressure measurement system can perform corresponding fault simulation.

And step S12, marking index data of the target service in the process of simulating the target fault, wherein the index data are used for reflecting the starting occurrence of the target fault of the target service.

The process of simulating the target fault may be regarded as a process of simulating the occurrence of the target fault, that is, a process of starting the simulation of the target fault to the occurrence of the target fault; in the process from the start of the simulation of the target fault of the target service to the occurrence of the target fault, the embodiment of the invention can label the index data of the target service, for example, label the data of the indexes such as the CPU utilization rate, the memory utilization rate, the RT (response actual time), the QPS (Query Per Second), the TPS (Transaction Per Second, transaction processing amount Per Second) and the like of the target service, and of course, the type of the index can be set according to the actual situation, and the embodiment of the invention is not limited.

In the process of simulating the target fault, the index data of the target service changes towards the direction of reflecting the occurrence of the target fault, namely in the process of starting the simulation of the target fault of the target service to the occurrence of the target fault, the index data of the target service changes towards the direction of reflecting the occurrence of the target fault; in this process, the embodiment of the invention can monitor the change of the index data of the target service and label the index data, so that the labeled index data is used as the index data reflecting the beginning occurrence of the target fault of the target service, i.e. the labeled index data can be regarded as the index data before the target fault of the target service formally occurs.

That is, when the target fault of the target service starts to simulate, the target fault does not actually occur, but the target fault is developed towards the occurrence direction, and in the process from the start of the target fault simulation to the occurrence of the target fault, the index data of the target service is marked, and then the fault prediction model corresponding to the target fault of the target service can be trained based on the marked index data, so that the trained fault prediction model can realize the advanced prediction of the target fault based on the index data condition of the target service before the occurrence of the target fault of the target service. In short, the simulation is started on the target fault, but the index data of the target service before the occurrence of the target fault is marked, and the training of the corresponding fault prediction model is performed, so that before the occurrence of the target fault of the target service, the trained fault prediction model can predict whether the target fault occurs in the follow-up process in advance according to the condition of the index data of the target service.

Alternatively, the index data may be time series data, such as index data may be index values of respective indexes recorded in time series.

And step S13, training to obtain a fault prediction model corresponding to the target service in the target fault according to the marked index data.

The marked index data can be used for reflecting the occurrence of the target fault of the target service, and can be used as training data for training a corresponding fault prediction model, so that the fault prediction model of the target service corresponding to the target fault is trained, and whether the target fault occurs or not can be predicted based on the index data before the target fault of the target service occurs.

Optionally, the index data marked by the embodiment of the invention can be combined with the index data under the condition that the target service does not have the target fault, so as to be used as training data for training a fault prediction model of the target service corresponding to the target fault; optionally, the marked index data can be used as a positive sample for training the fault prediction model, and the index data under the condition that the target service does not have the target fault can be used as a negative sample for training the fault prediction model, so that the training of the fault prediction model is realized by using the positive sample and the negative sample.

Optionally, the index data marked by the embodiment of the invention can be preprocessed and then used as training data for training the fault prediction model; taking the index data as time sequence data as an example, the preprocessing process includes: and (3) performing data cleaning on the time sequence data, and supplementing median value to the data of the missing value, performing data standardization processing, performing data normalization processing and the like.

Optionally, when training the fault prediction model corresponding to the target fault in the target service, the embodiment of the invention may use multiple training methods, for example, random forest, GDBT (gradient descent tree, gradient Boosting Decision Tree), XGBoost (eXtreme Gradient Boosting, extreme gradient lifting), GRU (gate control loop unit) and other training methods, and of course, training methods similar to deep learning, for example, CNN (convolutional neural network), DNN (deep neural network), LSTM (long short term memory) and other training methods may also be used;

according to the embodiment of the invention, a plurality of training methods can be arranged, for example, grid search is used for arranging, so that the embodiment of the invention can respectively train the fault prediction model by using various arranged training methods according to marked index data (the index data can be preprocessed and then used for model training), thereby obtaining the fault prediction model corresponding to the various training methods, and further select the optimal fault prediction model based on the training results of the fault prediction model corresponding to the various training methods, thereby obtaining the fault prediction model corresponding to the trained target business in the target fault;

By taking a target fault as a CPU full-load fault as an example, the embodiment of the invention can label the index data of the target service in the occurrence process of simulating the CPU full-load fault of the target service, automatically use various training methods arranged by grid search, and respectively train a fault prediction model by various training methods based on the labeled index data; based on each trained fault prediction model, the embodiment of the invention can determine the accuracy and recall of each fault prediction model, and finally automatically select the fault prediction model with optimal accuracy and recall as the fault prediction model corresponding to the full load fault of the training target service in the CPU.

Based on the training process shown in fig. 1 and fig. 2, under the condition that different services to be simulated and faults to be simulated are selected by a user, the embodiment of the invention can train the corresponding fault prediction models of the different services under different faults, thereby providing a basis for realizing the advanced prediction of the different faults of the different services.

For each fault of each service, the embodiment of the invention can obtain corresponding fault prediction models of each service under each fault based on the flow shown in fig. 2, the fault prediction models can be imported into a model engine for persistence, the model engine can be regarded as a set of corresponding fault prediction models of different faults of different services, one fault of one service can correspond to one fault prediction model, for example, a CPU full load fault of one service corresponds to a corresponding fault prediction model in the model engine, and a disk IO full load fault of one service also corresponds to a corresponding fault prediction model in the model engine.

The following describes advantages of the training scheme provided by the embodiment of the present invention compared with the existing fault detection method and the training scheme provided by the embodiment of the present invention.

It should be noted that, in the current fault detection mode, fault prediction cannot be performed before a fault of a service occurs, meanwhile, learning data relied on by fault detection is marked manually, so that a certain error exists in marking of the learning data, and because the number of times of fault occurrence is far lower than the normal number (i.e. a few times of faults may occur in a plurality of service examples for a period of time, for example), a great proportion of data (i.e. a great proportion of positive and negative samples) exists between the occurrence of the fault and the failure of the corresponding data;

based on the above, the above-mentioned problems can be solved by using the scheme provided by the embodiment of the present invention, and in particular, in the embodiment of the present invention, the index data of the service is marked in the process from the start of the simulation of the fault to the occurrence of the fault, so that the fault prediction model trained based on the marked index data can implement the prediction of the fault before the occurrence of the fault of the service, rather than just detect the fault when the fault occurs; meanwhile, according to the embodiment of the invention, through simulating the faults, index data for reflecting the start of the faults are marked in the occurrence process of the simulated faults, so that the automatic marking of the index data for reflecting the start of the faults can be realized, the number of positive samples for reflecting the start of the faults is increased, and the proportion difference between the positive samples and the negative samples is reduced; further, based on fault simulation, index data reflecting the occurrence of the start of the fault is obtained, and the obtained index data is established under the condition that the start of the fault occurs, so that the obtained index data has higher accuracy.

According to the scheme provided by the embodiment of the invention, various faults can be repeatedly simulated under different service scenes to obtain index data of each service scene under various faults, so that a fault prediction model corresponding to each service scene under various faults is trained; the scheme provided by the embodiment of the invention has at least the following advantages: the proportion of positive and negative samples is balanced, so that a fault prediction model with more robustness can be trained; by automatically marking the index data when the fault simulation starts, but the fault is not caused immediately when the fault simulation starts, the automatically marked index data is used as training data for training the fault prediction model, so that the fault prediction model has the capability of predicting the fault, and the index data in the process of simulating the fault from the start of the fault to the occurrence of the fault can be regarded as a snapshot before the fault occurs before the fault really occurs.

Based on the failure prediction model obtained by training, the embodiment of the invention can realize the failure prediction of the service; the main thought of predicting faults in the embodiment of the invention is as follows: based on the real-time index data of the service, the abnormal index is found, so that the fault type corresponding to the abnormal index is predicted by using a fault prediction model corresponding to the service, and the fault prediction before the service fails is realized.

Based on the thought, the embodiment of the invention can also provide an index predictor for predicting the index baseline so as to predict abnormal indexes of the service from real-time index data of the service based on the index baseline of the service; the index baseline may be considered as a data limit reflecting an index anomaly. Optionally, the embodiment of the invention can realize the prediction of the index baseline of each service by setting an index predictor, wherein the index predictor can be a program module in a program form or a hardware module; as an optional implementation, fig. 3 illustrates an optional flow of a prediction index baseline provided by an embodiment of the present invention, where the flow may be implemented by an index predictor, and the index predictor may specifically be implemented on an offline first server platform, and referring to fig. 3, the flow may include:

and step S20, classifying the historical index data of the service to obtain periodic historical index data and aperiodic historical index data.

Alternatively, the service may be any service, and the embodiment of the present invention may predict the index baseline of each service based on the flow shown in fig. 3.

The embodiment of the invention can classify the data characteristics of the historical index data of the service so as to obtain the periodic historical index data and the aperiodic historical index data; by way of example, the historical index data of the service may be historical index time sequence data of the service, and the embodiment of the invention may perform feature classification on the historical index time sequence data to obtain the historical index time sequence data with periodic data features and the historical index time sequence data with non-periodic data features.

Optionally, the historical index data of the service may be collected by the stream processing platform and then stored in the database.

And S21, determining a period corresponding to the periodic historical index data by adopting FFT.

FFT (fast fourier transform ) is a generic term for efficient, fast computing methods that utilize a computer to compute a Discrete Fourier Transform (DFT), and embodiments of the present invention may perform FFT transforms on periodic historical index data to determine a period corresponding to the periodic historical index data.

And step S22, predicting an index baseline of the periodic historical index data by using an STL method based on the period.

STL (sequential-Trend decomposition procedure based on Loess) is a time-series decomposition method using robust local weighted regression as a smoothing method, and the embodiment of the invention can predict an index baseline of periodic historical index data by using the STL method, so as to obtain a data limit value reflecting index abnormality in the periodic historical index data (i.e. an index baseline of the periodic historical index data).

Step S23, predicting an index baseline of the first part of aperiodic historical index data through an isolated forest method for the first part of aperiodic historical index data.

The aperiodic historical index data can be divided into two parts, including a first part of aperiodic historical index data and a second part of aperiodic historical index data; the first portion may be larger than the second portion, e.g., 80% for the first portion and 20% for the second portion, although the specific values herein are all alternative examples.

For the first portion of aperiodic historical index data, embodiments of the present invention may use an orphan forest methodology to predict an index baseline for the first portion of aperiodic historical index data. It should be noted that, the isolated forest is an unsupervised anomaly detection method suitable for continuous data, and for the first part of aperiodic historical index data, the embodiment of the invention can use an isolated forest method to predict an index baseline, so as to obtain a data limit value reflecting the index anomaly in the first part of aperiodic historical index data (i.e. an index baseline of the first part of aperiodic historical index data).

Optionally, after denoising the first part of aperiodic historical index data, the embodiment of the invention predicts the index baseline of the denoised first part of aperiodic historical index data by an isolated forest method.

And S24, for the second part of aperiodic historical index data, determining the relation between the data change rate and the business or task in the second part of aperiodic historical index data through time sequence decomposition, and predicting the index base line of the second part of aperiodic historical index data based on the relation.

For the second part of aperiodic historical index data, the embodiment of the invention can judge the relationship between the data change rate and the service or task through time sequence decomposition, thereby realizing the index baseline for predicting the second part of aperiodic historical index data. For example, the second partial aperiodic historical index data is second partial aperiodic historical time sequence data, and according to the embodiment of the invention, a change relation between data and time in the second partial aperiodic historical time sequence data can be analyzed to obtain an increase and decrease slope (an optional representation of the data change rate) corresponding to the data and time, so as to determine a relation between the increase and decrease slope and a service or task in the second partial aperiodic historical index data, and based on the relation, an index baseline for predicting the second partial aperiodic historical index data is realized, so that a data limit reflecting an index abnormality in the second partial aperiodic historical index data (i.e., an index baseline of the second partial aperiodic historical index data) is obtained.

Optionally, in the embodiment of the present invention, the second part of aperiodic historical index data may be separated from task data, for example, different task scenes of the same service are simulated in a simulation manner, and then the same task scene of different services is simulated, so as to distinguish tasks from services.

Based on the flow shown in fig. 3, the embodiment of the invention can divide the historical index data of the service into the periodic historical index data and the aperiodic historical index data, so that the index base line of the periodic historical index data of the service and the index base line of the aperiodic historical index data are predicted, and the index base line of the service is determined.

Based on the main thought of predicting faults provided by the embodiment of the invention, the embodiment of the invention can predict the faults of real-time data of the service, thereby realizing the early prediction of the faults before the faults occur; optionally, fig. 4 illustrates an optional flow of the fault prediction method provided by the embodiment of the present invention, where the flow may be performed online, and specifically may be implemented by an online second server platform, and the second server platform may be an online server platform different from the first server platform; referring to fig. 4, the process may include:

And step S30, acquiring real-time index data of the service.

The real-time index data of the service can be real-time index data in the service running process, and comprises the corresponding index data such as CPU utilization rate, memory utilization rate, RT (response actual time), QPS (Query Per Second Per Second) and TPS (Transaction Per Second, transaction processing amount Per Second) in real time of the service running; of course, the type of the index may also be set according to the actual situation, and the embodiment of the invention is not limited. Alternatively, the service may refer to any service.

Alternatively, the real-time index data of the service may be collected in real-time by a source flow processing platform (e.g., kakfa).

And S31, determining an abnormal index set in the real-time index data according to the index baseline of the pre-predicted service.

Optionally, the embodiment of the invention can learn in advance to obtain the index baseline of the service through an unsupervised learning method, so that the index baseline of the service reflects the data limit value of the index abnormality. For the real-time index data, the index value does not accord with the index of the index baseline of the service, and the embodiment of the invention can be regarded as an abnormal index, so as to obtain a set of abnormal indexes (namely, an abnormal index set).

In one example, the index baseline may include an index baseline upper boundary value and an index baseline lower boundary value, and an index with an index value greater than the baseline upper boundary value or less than the baseline lower boundary value in the real-time index data may be considered as an abnormal index, so that the embodiment of the invention may obtain an abnormal index set with an index value greater than the baseline upper boundary value or less than the baseline lower boundary value from the real-time index data.

Alternatively, the index baseline of the service may be learned by an index predictor under the offline condition based on the historical index data of the service, and stored in an offline rule engine, so that the embodiment of the present invention may call the index baseline of the service stored in the rule engine to implement step S31.

Alternatively, step S31 may be performed by a real-time stream computing engine (e.g., blink), and for the real-time index data of the service collected by the open stream processing platform, the real-time stream computing engine may call an index baseline of the service stored in a rule engine to determine an abnormal index set in the real-time index data.

And S32, calling a pre-trained fault prediction model corresponding to the service to predict the target fault type corresponding to the abnormal index set.

According to the embodiment of the invention, the fault of the service can be simulated through the preset fault template of the service, so that the fault prediction model of the service is obtained by training based on the marked index data in the process from the start of the simulation of the fault of the service to the occurrence of the fault, and the marked index data is used for reflecting the start of the fault of the service. Based on the fault prediction model corresponding to the service obtained through training, the embodiment of the invention can perform fault classification on the abnormal index set, so that the fault type (namely, the target fault type) corresponding to the abnormal index set is predicted.

Optionally, each fault of the service has a corresponding fault prediction model, for example, one fault type of the service may correspond to one fault prediction model, that is, different types of faults of the service may have different fault prediction models, and the embodiment of the present invention may pretrain the fault prediction model corresponding to the service under different faults, so as to obtain multiple fault prediction models corresponding to the service; furthermore, the embodiment of the invention can respectively predict the fault types of the abnormal index set by using a plurality of fault prediction models corresponding to the service, thereby taking the fault type with the highest reliability in the prediction result as the target fault type.

Alternatively, step S32 may be implemented by the real-time flow calculation engine invoking the fault prediction model corresponding to the service, and outputting the predicted target fault type.

In an alternative implementation, the steps S31 and S32 may be implemented by a predictor set in the real-time flow calculation engine, and optionally the predictor may be a program module or a hardware module set in the real-time flow calculation engine, where the predictor calls an index baseline of the service stored in the rule engine on one hand, and predicts an abnormal index set in real-time data of the service; and on the other hand, calling a pre-trained fault prediction model corresponding to the service to predict the target fault type corresponding to the abnormal index set, so that the fault prediction of the service is realized based on the real-time index data of the service.

Optionally, further, based on the predicted target fault type, the embodiment of the present invention may initiate a corresponding fault decision behavior to implement fault processing, and for different predicted fault types, the fault decision behavior initiated by the embodiment of the present invention may also be different; for example, when a service has a disk IO full-load fault and qps of other service examples drops, the embodiment of the invention can give out corresponding fault decision actions after identifying the IO full-load fault of the service, for example, recommending the IO current limiting operation of the disk to limit the IO operation of the disk, and under the condition that the IO current limiting operation of the disk cannot be initiated, the embodiment of the invention can output an alarm and generate a propagation chart of an abnormal index to help locate the root cause of the fault, thereby helping an operation and maintenance personnel to manually execute the operation and maintenance actions.

In a further implementation, the embodiment of the invention can analyze the root cause of the service fault so as to analyze and determine the root cause of the service fault after detecting the abnormal index of the service; the root cause may be obtained by analyzing based on the propagation relationship of the abnormal indicator, optionally, fig. 5 shows an optional flow for determining the propagation relationship of the abnormal indicator provided by the embodiment of the present invention, where the flow may be implemented by a root cause analysis engine, and the root cause analysis engine may be a program module or a hardware module in a program form, and the root cause analysis engine may specifically be implemented on an offline first server platform, and referring to fig. 5, the flow may include:

and S40, scanning the abnormal index corresponding to the service fault, and determining the fluctuation condition corresponding to the abnormal index.

Aiming at the business faults, the embodiment of the invention can determine the fluctuation condition of the abnormal indexes based on the abnormal indexes of the business faults.

And S41, analyzing the fluctuation condition through a correlation analysis algorithm to obtain propagation map data corresponding to the abnormal index.

Because the number of the abnormal indexes corresponding to the service faults may be multiple, the abnormal indexes have the transmission sequence, and the embodiment of the invention can embody the transmission relation of the abnormal indexes, such as the transmission sequence, through the obtained transmission diagram data.

By way of example, taking a time out fault as an example, as shown in fig. 6, the time out fault may be caused by mysql.

Optionally, the correlation analysis method for analyzing the fluctuation condition can be pearson, sparstman, cross, etc.; it should be noted that pearson is used to calculate the correlation between equidistant and equal-ratio data or continuous data, and the value of such data is not limited to integers; the clearman is a relationship between data used to calculate grades, which is characterized by a score of a grade but the specific score differences between two successive grades are not necessarily equal.

Optionally, the embodiment of the present invention may further mine an anomaly detection rule according to the propagation map data, so as to represent a rule between propagation relations of anomaly indexes reflected by the propagation map data; it can be understood that the embodiment of the invention can summarize a plurality of abnormal index lists by simulating the same type of faults of the service, and the propagation relations of the abnormal index lists are quite similar, and after a plurality of propagation relations of the abnormal index of the same fault of the service are obtained, the propagation relations can be clustered, so that the abnormal detection rule of the service in the same fault is mined, and the root cause analysis rule of the service in the same fault is reflected by the abnormal detection rule.

Based on the above flow, the embodiment of the invention can find the abnormal data characteristic based on the index predictor to obtain the index base line reflecting the index abnormality, and after determining the abnormal index in the index data of the service based on the index base line, the embodiment of the invention can determine the propagation relationship of the abnormal index through the root cause analysis engine so as to dig out the abnormality detection rule for locating the abnormal root cause, namely the root cause analysis engine can dig out the abnormality detection rule in a clustering mode based on the propagation relationship of the abnormal index so as to detect the root cause of the abnormal index. Optionally, the fault prediction process provided by the embodiment of the present invention may further include: and calling an abnormality detection rule corresponding to the business recorded in a rule engine in the target fault type, and analyzing the root cause corresponding to the abnormality index set according to the abnormality detection rule, so as to determine the root cause of the abnormality index, and facilitate operation and maintenance personnel to select fault processing behaviors according to the abnormality root cause.

Alternatively, as shown in fig. 7, an example of a solution provided by an embodiment of the present invention may include: an offline portion executed by the first server platform, and an online portion executed by the second server platform; the offline portion may provide support for execution of the online portion;

As shown in fig. 7, the offline part may be divided into a model engine part and a rule engine part, where the model engine part mainly collects the corresponding failure prediction models of each service trained by the embodiments of the present invention under each failure, and the rule engine part mainly collects the index baselines of the services predicted by the index predictor, and root the anomaly detection rules generated by the analysis engine;

optionally, in the model engine part, the user can select the service and the fault to be simulated, so as to load the corresponding service template and the fault template, and realize the fault simulation of the service; in the process from the simulation of the fault to the occurrence of the fault, the embodiment of the invention can label the index data of the service, thereby training a fault prediction model of the selected service under the selected fault based on the labeled index data; in this way, the corresponding fault prediction model of each service under each fault can be obtained through training and summarized into a model engine;

in the rule engine part, the index predictor may predict index baselines reflecting abnormal indexes of the respective services based on historical index data of the services, and the predicted index baselines of the abnormal indexes of the respective services may be summarized to the rule engine;

On the other hand, after determining the abnormal index of the service based on the index baseline of the service, the root cause analysis engine can analyze the propagation relationship of the abnormal index through a correlation analysis algorithm based on the fluctuation condition of the abnormal index of the service and mine the abnormal detection rule in the propagation relationship; the anomaly detection rules may be summarized to a rules engine.

Optionally, in the above process, the fault prediction model may be obtained by training using a supervised training method; and the index baseline, the propagation relationship and the abnormal rule can be obtained through an unsupervised learning method.

For the online part, the open source stream processing platform can collect real-time index data of the service, so that a predictor of the real-time stream computing engine can call an index baseline of the service in the rule engine to determine an abnormal index set in the real-time index data; furthermore, the predictor of the real-time flow calculation engine can call a corresponding fault prediction model of the service in the rule engine to predict the fault type corresponding to the abnormal index set, so as to realize the prediction of the fault type of the service;

optionally, further, the predictor of the real-time flow computing engine may call an anomaly detection rule corresponding to the predicted fault type of the service recorded in the rule engine, so as to analyze a root cause corresponding to the anomaly index set according to the anomaly detection rule;

Optionally, further, after predicting the failure type of the service, the predicted failure type may be output to the decision-making actuator, so that the decision-making actuator initiates a corresponding failure decision-making action based on the predicted failure type.

The database instance of the online part can record the data of the business fault, so that the data is further fed back to the fault simulation stage of the offline part to enrich the data of fault simulation.

The scheme provided by the embodiment of the invention can provide a model capable of carrying out fault prediction on the service, realize the fault prediction on the service, can mine the root cause of the service fault and provide decision basis for operation and maintenance decision.

The foregoing describes several embodiments of the present invention, and the various alternatives presented by the various embodiments may be combined, cross-referenced, with each other without conflict, extending beyond what is possible embodiments, all of which are considered to be embodiments of the present invention disclosed and disclosed.

The training device provided by the embodiment of the present invention is described below, and the training device described below may be considered as a functional module required to implement the training method provided by the embodiment of the present invention. The content of the training device described below may be referred to in correspondence with the content of the training method described above.

Optionally, fig. 8 shows an optional block diagram of a training device provided by an embodiment of the present invention, and referring to fig. 8, the training device may include:

a selector 100 for selecting a target service and a target fault to be simulated;

a simulator 110, configured to invoke a service template corresponding to the target service, simulate the target service, and invoke a fault template corresponding to the target fault, simulate the target fault;

the labeler 120 is configured to label, in a process of simulating the target fault, index data of a target service, where the index data is used to reflect that the target fault of the target service begins to occur;

and the trainer 130 is used for training and obtaining a fault prediction model corresponding to the target service in the target fault according to the marked index data.

Optionally, the simulator 110 is configured to invoke a service template corresponding to the target service to simulate the target service, and may specifically include:

calling a business template corresponding to the target business from preset business templates corresponding to each simulative business, wherein the business template records information of the operation of the target business;

simulating the target service according to the information of the target service operation recorded by the service template;

Optionally, the simulator 110 is configured to invoke a fault template corresponding to the target fault, and simulate the target fault, and may specifically include:

calling a fault template corresponding to a target fault from preset fault templates corresponding to each simulatable fault, wherein the fault template records information of target fault operation;

and simulating the target fault according to the information of the target fault operation recorded by the fault template.

Optionally, the trainer 130 is configured to train to obtain a fault prediction model corresponding to the target fault of the target service according to the marked index data, and may specifically include:

respectively training a fault prediction model by using various preset training methods according to the marked index data;

and selecting an optimal fault prediction model based on training results of each fault prediction model obtained through training so as to obtain a fault prediction model corresponding to the target service in the target fault.

Optionally, the training device provided by the embodiment of the invention can be further used for: and preprocessing the marked index data before training to obtain a fault prediction model of the target service corresponding to the target fault according to the marked index data.

Optionally, fig. 9 shows another block diagram of a training device according to an embodiment of the present invention, and in combination with fig. 8 and fig. 9, the training device may further include:

the model engine 140 is configured to aggregate the fault prediction models corresponding to the respective services under the respective faults.

Optionally, fig. 10 shows a further block diagram of a training device according to an embodiment of the present invention, and in combination with fig. 9 and fig. 10, the training device may further include:

the index predictor 150 is configured to determine an index baseline of the service according to historical index data of the service, where the index baseline is a data limit value reflecting an index anomaly.

Optionally, the index predictor 150 is configured to determine, according to historical index data of a service, an index baseline of the service, and may specifically include:

classifying the historical index data of the service to obtain periodic historical index data and aperiodic historical index data;

for periodic historical index data, determining a period corresponding to the periodic historical index data by adopting fast Fourier transform, and predicting an index baseline of the periodic historical index data by using a time sequence decomposition method based on the period;

for aperiodic historical index data, dividing the aperiodic historical index data into a first part of aperiodic historical index data and a second part of aperiodic historical index data, wherein the data volume of the first part of aperiodic historical index data is larger than that of the second part of aperiodic historical index data; predicting an index baseline by an isolated forest method for the first portion of aperiodic historical index data; and for the second part of aperiodic historical index data, determining the relation between the data change rate and the service or task through time sequence decomposition, and predicting an index baseline of the second part of aperiodic historical index data based on the relation.

Further, fig. 11 shows a further block diagram of a training device according to an embodiment of the present invention, and in combination with fig. 10 and fig. 11, the training device may further include:

rules engine 160 for summarizing the metrics baseline of the traffic.

Optionally, fig. 12 shows still another block diagram of a training device provided in an embodiment of the present invention, and in conjunction with fig. 11 and fig. 12, the training device may further include:

the root cause analysis engine 170 is configured to scan an abnormal indicator corresponding to a service fault, determine a fluctuation condition corresponding to the abnormal indicator, where the abnormal indicator is an indicator that the indicator value does not conform to an indicator baseline of the service; and analyzing the fluctuation condition through a correlation analysis algorithm to obtain propagation map data corresponding to the abnormal index, wherein the propagation map data is used for representing the propagation relationship of the abnormal index.

Optionally, root cause analysis engine 170 may also be configured to: clustering a plurality of propagation relations of the abnormal indexes of the same faults of the service, and mining an abnormal detection rule of the fault root cause of the service.

Optionally, in an embodiment of the present invention, the rule engine 160 may be further configured to summarize the anomaly detection rule.

The modules in the training device provided by the embodiment of the invention can be realized by hardware or programs; in an alternative implementation, the training device provided in the embodiment of the present invention may be implemented by a program module, and the embodiment of the present invention may further provide a first server platform, where the first server platform may implement the training method provided in the embodiment of the present invention by loading the training device in a program form.

Alternatively, the hardware structure of the first server platform may be as shown in fig. 13, including: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the invention, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete the communication with each other through the communication bus 4;

alternatively, the processor 1 may be a CPU (central processing unit), GPU (Graphics Processing Unit, graphics processor), NPU (embedded neural network processor), FPGA (Field Programmable Gate Array ), TPU (tensor processing unit), AI chip, specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, or the like.

The memory 3 may comprise a high-speed RAM memory or may further comprise a non-volatile memory, such as at least one disk memory.

The memory 3 stores a program, and the processor 1 calls the program stored in the memory 3 to implement the training method provided by the embodiment of the present invention.

The embodiment of the invention also provides a storage medium which can store a program for executing the training method provided by the embodiment of the invention.

The following describes a fault prediction device provided by an embodiment of the present invention, and the fault prediction device described below may be considered as a functional module that is required to be set to implement the fault prediction method provided by the embodiment of the present invention. The contents of the failure prediction apparatus described below may be referred to in correspondence with the contents of the method described above.

Optionally, fig. 14 shows a block diagram of a fault prediction device provided by an embodiment of the present invention, and referring to fig. 14, the fault prediction device may include:

the real-time data acquisition module 200 is used for acquiring real-time index data of the service;

an anomaly index determining module 210, configured to determine an anomaly index set in the real-time index data according to a pre-predicted index baseline of the service; the index baseline is a data limit value reflecting index abnormality;

the fault type prediction module 220 is configured to invoke a pre-trained fault prediction model corresponding to the service, and predict a target fault type corresponding to the abnormal index set; the fault prediction model is obtained by training marked index data based on the process of simulating the faults of the service, and the marked index data is used for reflecting the occurrence of the faults of the service.

Alternatively, the real-time data acquisition module 200, the anomaly index determination module 210 and the fault type prediction module 220 may be integrated into a real-time stream computation engine; the embodiment of the present invention may set a predictor in the real-time stream calculation engine to integrate the real-time data acquisition module 200, the anomaly index determination module 210 and the fault type prediction module 220.

Optionally, the fault type prediction module 220 is configured to invoke a pre-trained fault prediction model corresponding to the service to predict a target fault type corresponding to the abnormal indicator set, and may specifically include:

invoking a plurality of fault prediction models corresponding to the service, respectively predicting the fault type of the abnormal index set, and taking the fault type with highest reliability in the prediction result as a target fault type; wherein a failure of the service corresponds to a failure prediction model.

Optionally, fig. 15 shows another block diagram of a fault prediction device provided by an embodiment of the present invention, and in combination with fig. 14 and fig. 15, the fault prediction device may further include:

the decision maker 230 is configured to initiate a corresponding fault decision behavior based on the predicted target fault type.

Optionally, the anomaly index determining module 210 is configured to determine, according to a pre-predicted index baseline of the service, an anomaly index set in the real-time index data, and may specifically include:

Calling an index baseline of the business recorded in a rule engine;

and determining an abnormal index set, of which the index value does not accord with the index baseline, in the real-time index data according to the invoked index baseline.

Optionally, in an embodiment of the present invention, the predictor in the real-time stream computation engine may be further configured to: and calling an abnormality detection rule corresponding to the business recorded in a rule engine in the target fault type, and analyzing a root cause corresponding to the abnormality index set according to the abnormality detection rule.

The models in the fault prediction device provided by the embodiment of the invention can be realized by hardware or programs; in an alternative implementation, the fault prediction device provided in the embodiment of the present invention may be implemented by a program module, and the embodiment of the present invention may further provide a second server platform, where the second server platform may implement the fault prediction method provided in the embodiment of the present invention by loading the fault prediction device in a program form.

Alternatively, the structure of the second server platform may be as shown in fig. 13, including: at least one memory and at least one processor, the memory stores a program, and the processor calls the program to execute the fault prediction method provided by the embodiment of the invention.

The embodiment of the invention also provides a storage medium, which can store a program for executing the fault prediction method provided by the embodiment of the invention.

Although the embodiments of the present invention are disclosed above, the present invention is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention, and the scope of the invention should be assessed accordingly to that of the appended claims.

Claims

1. A training method, comprising:

selecting a target service to be simulated and a target fault;

2. The training method of claim 1, wherein the invoking the service template corresponding to the target service, simulating the target service comprises:

the calling the fault template corresponding to the target fault, and simulating the target fault comprises the following steps:

3. The training method according to claim 1 or 2, wherein training to obtain the failure prediction model corresponding to the target failure of the target service according to the marked index data comprises:

4. The training method according to claim 1, wherein before training the target service according to the noted index data to obtain a failure prediction model corresponding to the target failure, the method further comprises:

preprocessing the marked index data.

5. The training method of claim 1, further comprising:

in the model engine, failure prediction models corresponding to the businesses under the failures are summarized.

6. The training method of claim 1, further comprising:

and determining an index baseline of the service according to the historical index data of the service, wherein the index baseline is a data limit value reflecting index abnormality.

7. The training method of claim 6, wherein determining an indicator baseline for a business based on historical indicator data for the business comprises:

8. Training method according to claim 6 or 7, characterized in that it further comprises:

the index baselines for the traffic are summarized in the rules engine.

9. The training method of claim 6, further comprising:

scanning the abnormal index corresponding to the service fault, and determining the fluctuation condition corresponding to the abnormal index; the abnormal index is an index of which the index value does not accord with an index baseline of the service;

and analyzing the fluctuation condition through a correlation analysis algorithm to obtain propagation map data corresponding to the abnormal index, wherein the propagation map data is used for representing the propagation relationship of the abnormal index.

10. The training method of claim 9, further comprising:

clustering a plurality of propagation relations of the abnormal indexes of the same faults of the service, and mining an abnormal detection rule of the fault root cause of the service.

11. The training method of claim 10, further comprising:

summarizing the anomaly detection rules in a rules engine.

12. A method of fault prediction, comprising:

acquiring real-time index data of a service;

invoking a pre-trained fault prediction model corresponding to the service to predict a target fault type corresponding to the abnormal index set; the fault prediction model is obtained by training marked index data based on the process of simulating the faults of the service, and the marked index data is used for reflecting the occurrence of the faults of the service; wherein the fault prediction model is trained in accordance with the training method of any one of claims 1-11.

13. The method of claim 12, wherein the invoking the pre-trained fault prediction model corresponding to the service, predicting the target fault type corresponding to the anomaly index set comprises:

Invoking a plurality of fault prediction models corresponding to the service, respectively predicting the fault type of the abnormal index set, and taking the fault type with highest reliability in the prediction result as a target fault type; wherein each fault of the service has a corresponding fault prediction model.

14. The failure prediction method according to claim 12 or 13, characterized by further comprising:

and initiating corresponding fault decision behaviors based on the predicted target fault types.

15. The method of claim 12, wherein determining the abnormal index set in the real-time index data based on the pre-predicted index baseline of the traffic comprises:

calling an index baseline of the business recorded in a rule engine;

16. The fault prediction method according to claim 12, further comprising:

and calling an abnormality detection rule corresponding to the business recorded in a rule engine in the target fault type, and analyzing a root cause corresponding to the abnormality index set according to the abnormality detection rule.

17. A training device, comprising:

18. A failure prediction apparatus, comprising:

the fault type prediction module is used for calling a pre-trained fault prediction model corresponding to the service and predicting a target fault type corresponding to the abnormal index set; the fault prediction model is obtained by training marked index data based on the process of simulating the faults of the service, and the marked index data is used for reflecting the occurrence of the faults of the service; wherein the fault prediction model is trained in accordance with the training method of any one of claims 1-11.

19. A first server platform, comprising: at least one memory storing a program and at least one processor invoking the program to perform the training method of any of claims 1-11.

20. A second server platform, comprising: at least one memory storing a program and at least one processor invoking the program to perform the fault prediction method of any of claims 12-16.

21. A storage medium storing a program for executing the training method according to any one of claims 1 to 11, or storing a program for executing the failure prediction method according to any one of claims 12 to 16.