CN117034143A

CN117034143A - Distributed system fault diagnosis method and device based on machine learning

Info

Publication number: CN117034143A
Application number: CN202311303999.2A
Authority: CN
Inventors: 徐小龙; 刘畅; 周鑫; 张继杰; 王林
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-10-10
Filing date: 2023-10-10
Publication date: 2023-11-10
Anticipated expiration: 2043-10-10
Also published as: CN117034143B

Abstract

The invention discloses a distributed system fault diagnosis method and device based on machine learning, which relate to the technical field of intelligent operation and maintenance, wherein the method comprises the following steps: preprocessing historical system fault data; constructing a filling model for deducing historical system fault data missing values; fitting historical system fault data processed by a filling model by using a preprocessing model based on a shallow neural network, and obtaining deeper feature embedding among features; combining the feature embedding and the original feature to obtain an expanded data set; the three basic learners are used for training the fault diagnosis model, probability representations output by the basic learners are taken as meta-features and used as inputs of logistic regression of the meta-learners, and the meta-learners are trained to obtain the diagnosis model; obtaining final meta-characteristics of the new data sample by the same method, and inputting the final meta-characteristics into a diagnosis model for prediction; the invention can improve the intelligent operation and maintenance efficiency of the distributed system, reduce the input of human resources and save the cost.

Description

Distributed system fault diagnosis method and device based on machine learning

Technical Field

The invention relates to the technical field of intelligent operation and maintenance, in particular to a distributed system fault diagnosis method and device based on machine learning.

Background

In the big data age, distributed systems have become the dominant information storage and processing systems. Compared with the traditional system, the distributed system is larger and more complex, the average probability of faults is higher, and the operation and maintenance difficulty and complexity are greatly improved. When a certain node in the distributed system fails, the fault can propagate along the topological structure of the distributed system, so that KPI indexes related to the self node and adjacent nodes thereof and a large number of log anomalies occur, and the overall operation of the system is affected. When the system fails, operation staff is required to correctly judge the node where the failure occurs in the shortest time, determine the failure type of the node and then take corresponding measures. If a large amount of fault alarm information exists at the same time, the processing of the faults consumes huge manpower resources, is low in efficiency and even can repeatedly process certain fault information. Therefore, it is necessary to design a failure diagnosis automation technology for a distributed system. The traditional fault diagnosis method generally judges whether the fault belongs to a normal category according to the characteristics, parameters or state information of the system, and then uses a certain means to discriminate the type of the fault. In recent years, with the rising of the concept of intelligent operation and maintenance, the quality and efficiency of operation and maintenance can be further improved by utilizing artificial intelligence to assist or even partially replace artificial decision.

Typical fault diagnosis methods include a signal-based processing method, a knowledge-based fault diagnosis method, and a model-based diagnosis method. The signal-based processing method mainly realizes fault diagnosis by processing certain information and extracting signal characteristics, and generally analyzes relevant signals by utilizing signal models such as a spectrum analysis method, a correlation function wavelet transformation and the like to judge whether faults occur. This approach reduces the dependence on the mathematical model but lacks the ability to diagnose some potential faults. The diagnosis method based on the model often needs to build an accurate mathematical model, and then, when the judgment signal reaches a set threshold, the detection residual error (the difference between the estimated value and the actual output value of the system output value generated by the fault detection unit aiming at the related information of the normal operation system) judges whether a fault occurs. Although the method does not need extra hardware to realize the fault diagnosis algorithm, the method needs an accurate mathematical model and has great difficulty. The general fault diagnosis method based on knowledge depends on an expert system, and the expert system can only diagnose in the existing knowledge range, and meanwhile, the problems of difficult acquisition of new knowledge and fixed reasoning mode are also existed, thus easily causing errors; it is difficult for a typical neural network to gather a data set containing a stack of associated fault information, and the characteristics of the fault information data are difficult to determine. In addition, the fault information may have uneven fault distribution, so that the problem of unbalanced data is generated, the model is over-fitted, and the final effect is poor.

Disclosure of Invention

The present invention has been made in view of the above-described problems.

Therefore, the technical problems solved by the invention are as follows: how to improve the efficiency of system operation and maintenance and reduce the loss caused by system faults.

In order to solve the technical problems, the invention provides the following technical scheme:

in a first aspect, an embodiment of the present invention provides a distributed system fault diagnosis method based on machine learning, including:

collecting state sample data of the distributed system in a historical time interval, and preprocessing the data;

classifying fault types of the preprocessed sample data set, counting the fault types in the sample data set, and constructing a system fault type knowledge base;

fitting the data set by using a multi-layer perceptron preprocessing model, and extracting deep feature embedding among fault features; splicing the sample data sets with the embedded and preprocessed deep features together to obtain an extended data set;

taking the characteristics in the extended data set as the input of a base learner, taking the labels in the extended data set as the targets of the base learner, training the base learner, storing the base learner and the parameters thereof, and predicting the characteristics in the extended data set by using the base learner to obtain the occurrence probability of each fault type;

The probability of each fault type is used as meta-feature input of a meta-learner, the labels in the extended data set are used as output of the meta-learner, the meta-learner is trained, the meta-learner and parameters thereof are stored, the input is used as feature representation, and the output is used as a diagnosis model of the final fault type probability;

acquiring a distributed system state information sample data set to be subjected to fault diagnosis, inputting state information of nodes of each sample in the data set into a preprocessing model based on a multi-layer perceptron to generate deep feature embedding, merging the deep feature embedding with original features to obtain an expanded data set, and inputting the expanded data set into a base learner to obtain meta features of a meta learner;

inputting the meta-characteristics into a diagnosis model, comparing the meta-characteristics with fault types in a system fault type knowledge base to obtain a probability set of which fault type each sample finally belongs to, and determining the fault type of the fault node according to the confidence degree given by the diagnosis model.

As a preferred embodiment of the machine learning-based distributed system fault diagnosis method, wherein:

collecting state sample data of the distributed system in the historical time interval, and preprocessing the data, wherein the collecting state sample data comprises the following steps:

Reading in the characteristic columns of all the sample data, extracting the tag columns in the sample data at the same time, and deleting the repeated data according to the repeated condition of the sample data;

acquiring a characteristic column set containing missing data, and sequencing according to the missing quantity in each characteristic column;

carrying out standardized processing on the data in the feature column, and eliminating the numerical problem of the feature column;

selecting the feature columns with the least missing quantity as the current prediction targets, and filling 0 for the missing values of the rest feature columns;

taking a sample without a missing value in the current feature column as a training set, taking a sample containing the missing value as a test set, training a filling model, storing the filling model and corresponding parameters, and taking the prediction of the filling model on the test set as a filling value;

and selecting the feature columns to be filled according to the missing quantity, repeating filling until filling of all missing values is completed, and finally returning to the filled sample set.

the classifying the fault types of the preprocessed sample data set, counting the fault types in the sample data set, and constructing a system fault type knowledge base comprises the following steps:

reading a tag column in sample data according to the filled sample set;

If the label column is a classification variable, the label column is digitized by utilizing integer mapping, otherwise, the label column is not processed;

and taking the unique value of the object in the tag column, classifying the unique value, counting the occurrence times of each fault category, calculating the occurrence frequency of the fault category, and constructing a system fault category knowledge base.

fitting the data set by using a multi-layer perceptron preprocessing model, and extracting deep feature embedding among fault features comprises the following steps:

reading in a characteristic column of sample data and a corresponding tag column according to the filled sample set, taking the characteristic column as input, and taking the tag column as output, and training a shallow neural network;

and predicting the feature columns by using the shallow neural network to obtain deep feature embedding among the features.

the method for obtaining the probability of occurrence of each fault type comprises the following steps of taking the characteristics in the extended data set as input of a base learner, taking the labels in the extended data set as targets of the base learner, training the base learner, storing the base learner and parameters thereof, and predicting the characteristics in the extended data set by using the base learner:

Repeatedly obtaining the fault type probability for 3 times to obtain the fault type probabilities on the test sets respectively output by the three base learners and the base learners;

the probability of each fault type output by the three obtained base learners is spliced to obtain meta-characteristics;

the obtaining the fault type probability comprises the following steps: randomly segmenting the characteristics in the extended data set and the corresponding tag columns into 5 mutually disjoint subsets, evaluating the base learner through layered cross verification, and obtaining the probability of each fault type of the output of the base learner on the verification set;

repeating the evaluation k times to obtain k base learner models, and simultaneously obtaining the probability of each fault type output by the k base learners, wherein the base learners weight the weak learners through the following formula:

；

wherein the method comprises the steps ofRepresenting weak learner, < >>Representing the final learner, ++>Representing a penalty, in the penalty expression, +.>Target value representing current optimization, +_>Representing the currently optimized model.

the method for diagnosing the fault type comprises the steps of inputting the probability of occurrence of each fault type as meta-characteristic of a meta-learner, taking a label in an extended data set as output of the meta-learner, training the meta-learner, storing the meta-learner and parameters thereof to obtain a diagnosis model with the input as characteristic representation and the output as final fault type probability, wherein the diagnosis model comprises the following steps:

Reading in meta-features and tag columns corresponding to the meta-features; and training a meta learner by taking meta features as input and corresponding label columns as output to obtain a diagnosis model of the final fault type probability.

inputting the meta-features into a diagnosis model, comparing the meta-features with fault types in a system fault type knowledge base to obtain a probability set of which fault type each sample finally belongs to, and determining the fault type of a fault node according to the confidence degree given by the diagnosis model comprises the following steps:

inputting meta-features generated based on a distributed system state information sample data set to be subjected to fault diagnosis into a diagnosis model to obtain possible fault type probabilities of the sample;

comparing the possible fault types with fault types in a system fault type knowledge base, and removing corresponding samples if the types which do not exist in the system fault type knowledge base exist;

and after the comparison is completed, acquiring a final fault diagnosis type according to the confidence degree given by the diagnosis model.

In a second aspect, an embodiment of the present invention provides an apparatus adopting the machine learning-based distributed system fault diagnosis method according to any one of the embodiments of the present invention, including:

The preprocessing module is used for collecting state sample data of the distributed system in the historical time interval and preprocessing the data;

the knowledge base construction module is used for classifying fault types of the preprocessed sample data set, counting the fault types in the sample data set and constructing a system fault type knowledge base;

the extraction module is used for fitting the data set by using the multi-layer perceptron preprocessing model and extracting deep feature embedding among fault features; splicing the sample data sets with the embedded and preprocessed deep features together to obtain an extended data set;

the training module is used for taking the characteristics in the extended data set as the input of the base learner, taking the label in the extended data set as the target of the base learner, training the base learner, storing the base learner and the parameters thereof, and predicting the characteristics in the extended data set by using the base learner to obtain the occurrence probability of each fault type;

the diagnosis model generation module is used for inputting the probability of each fault type as the meta characteristic of the meta learner, taking the label in the extended data set as the output of the meta learner, training the meta learner, storing the meta learner and the parameters thereof to obtain a diagnosis model with the input as characteristic representation and the output as the final fault type probability;

The meta-feature generation module is used for acquiring a distributed system state information sample dataset for fault diagnosis, inputting the state information of the nodes of each sample in the dataset into a preprocessing model based on a multi-layer perceptron to generate deep feature embedding, merging the deep feature embedding with the original feature to obtain an expanded dataset, and inputting the expanded dataset into the base learner to obtain the meta-feature of the meta-learner;

the diagnosis module is used for inputting the meta-characteristics into the diagnosis model, comparing the meta-characteristics with fault types in the system fault type knowledge base to obtain a probability set of which fault type each sample finally belongs to, and determining the fault type of the fault node according to the confidence degree given by the diagnosis model.

In a third aspect, embodiments of the present invention provide a computing device comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions that, when executed by the one or more processors, cause the one or more processors to implement a machine learning based distributed system fault diagnosis method according to any of the embodiments of the present invention.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the machine-learning-based distributed system fault diagnosis method.

The invention has the beneficial effects that: the traditional fault diagnosis technology is generally realized by adopting an expert system or a model-based method, rich field knowledge is generally required, and the system has the characteristics of design lines and effectiveness, different characteristics possibly exist among different systems, and the invention can directly analyze system state information to diagnose faults by utilizing the advantages of machine learning and automatic characteristic engineering of a shallow neural network without professional field knowledge; the traditional fault diagnosis model is high in pertinence, robustness is still to be improved, and the robustness and generalization capability of the fault diagnosis model are effectively improved by utilizing the multi-layer stacked integrated algorithm, so that the fault diagnosis problem of a system which is more complicated in reality can be solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

FIG. 1 is an overall flow chart of a machine learning based distributed system fault diagnosis method according to a first embodiment of the present invention;

FIG. 2 is a diagram illustrating an example of node failure KPI anomalies for a machine learning based distributed system fault diagnosis method according to a first embodiment of the present invention;

FIG. 3 is a schematic diagram of a shallow neural network architecture for a machine learning based distributed system fault diagnosis method according to a first embodiment of the present invention;

FIG. 4 is a diagram of a meta-feature acquisition architecture of a machine learning based distributed system fault diagnosis method according to a first embodiment of the present invention;

FIG. 5 is a schematic diagram of confusion matrix on training set in a simulation example of a machine learning based distributed system fault diagnosis method according to a second embodiment of the present invention;

FIG. 6 is a schematic diagram of confusion matrix on a verification set in a simulation example of a machine learning based distributed system fault diagnosis method according to a second embodiment of the present invention.

Detailed Description

So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

Example 1

Referring to fig. 1-4, a first embodiment of the present invention provides a distributed system fault diagnosis method based on machine learning, including:

s1: collecting state sample data of the distributed system in a historical time interval, and preprocessing the data;

s1.1: reading in the characteristic columns of all the sample data, extracting the tag columns in the sample data at the same time, and deleting the repeated data according to the repeated condition of the sample data;

S1.2: acquiring a characteristic column set containing missing data, and sequencing according to the missing quantity in each characteristic column;

s1.3: carrying out standardized processing on the data in the feature column, and eliminating the numerical problem of the feature column;

s1.4: selecting the feature columns with the least missing quantity as the current prediction targets, and filling 0 for the missing values of the rest feature columns;

s1.5: the method comprises the steps of taking a sample without a missing value in a current feature column as a training set, taking a sample containing the missing value as a test set, training a filling model, wherein the filling model is a Multi-layer perceptron regression model (Multi-Layer Perceptron Regressor, MLPRegressor), storing the filling model and corresponding parameters, and taking the prediction of the model on the test set as a filling value;

s1.6: and selecting the feature columns to be filled according to the quantity of the missing values, repeating S1.5, completing filling of all the missing values, and finally returning to the filled sample set.

S2: classifying fault types of the preprocessed sample data set, counting the fault types in the sample data set, and constructing a system fault type knowledge base;

s2.1: reading a tag column in sample data according to the sample set obtained in the step S1;

s2.2: if the label column is a classification variable, the label column is digitized by utilizing integer mapping, otherwise, the label column is not processed;

S2.3: taking the unique value of the object in the tag column, classifying the unique value, counting the occurrence times of each fault category, calculating the occurrence frequency of the fault category, and constructing a system fault category knowledge base;

s3: fitting the data set by using a multi-layer perceptron preprocessing model, and extracting deep feature embedding among fault features; splicing the sample data sets with the embedded and preprocessed deep features together to obtain an extended data set;

s3.1: reading in a characteristic column of sample data and a corresponding tag column according to the sample set obtained in the step S1, taking the characteristic column as input, taking the tag column as output, training a shallow neural network, and storing a model and related parameters;

s3.2: predicting a feature column by using the model in S3.1 to obtain deep feature embedding among the features;

s3.3: combining the feature column and feature embedding to obtain new features;

s4: taking the characteristics in the extended data set as the input of a base learner, taking the labels in the extended data set as the targets of the base learner, training the base learner, storing the base learner and the parameters thereof, and predicting the characteristics in the extended data set by using the base learner to obtain the occurrence probability of each fault type;

S4.1: reading in S3 to obtain new features, randomly dividing the new features and the corresponding tag columns into 5 mutually disjoint subsets, evaluating the base learner through layered cross verification, and obtaining the probability of each fault type of the output of the base learner on the verification set;

s4.2: repeating S4.1 for k times to obtain k base learner models, and simultaneously obtaining the probability of each fault type output by the k base learners, wherein the base learners all adopt a Boosting-based method, and a better result is obtained by weighting the weak learners through the following formula:

；

wherein the method comprises the steps ofRepresenting weak learner, < >>Representing the final learner, ++>Representing a penalty, in the penalty expression, +.>Target value representing current optimization, +_>Representing the currently optimized model, eventually obtaining a model that minimizes the loss and weighting to the final learningA device;

s4.3: repeating the step S4.2 for 3 times to obtain the probabilities of the fault types on the test sets respectively output by the three base learners and the base learners;

s4.4: and (3) splicing the probabilities of the fault types output by the three base learners obtained in the step (S4.3) to obtain the input of the final meta-learner, namely the meta-feature.

It should be noted that, after the meta-feature is obtained in this step, the meta-feature is used as an input of the final model.

S5: the probability of each fault type is used as meta-feature input of a meta-learner, the labels in the extended data set are used as output of the meta-learner, the meta-learner is trained, the meta-learner and parameters thereof are stored, the input is used as feature representation, and the output is used as a diagnosis model of the final fault type probability;

s5.1: reading in meta-features and tag columns corresponding to the meta-features;

s5.2: taking meta-characteristics as input and corresponding label columns as output, training a meta-learner, wherein LR is adopted in the invention;

s5.3: and (5) storing the model obtained in the step (S5.2) and corresponding parameters, namely a final distributed system fault diagnosis model.

S6: acquiring a distributed system state information sample data set to be subjected to fault diagnosis, inputting state information of nodes of each sample in the data set into a preprocessing model based on a multi-layer perceptron to generate deep feature embedding, merging the deep feature embedding with original features to obtain an expanded data set, and inputting the expanded data set into a base learner to obtain meta features of a meta learner;

it should be noted that, a new data set of distributed system state information samples in a day is obtained for fault diagnosis, the storage content of data in the samples includes the nodes of the samples and the state information of the samples, the state information of the nodes of each sample in the new data set is input into a preprocessing model based on a multi-layer perceptron according to the methods of S4 and S5 to generate deep feature embedding, the deep feature embedding and the original feature embedding are combined to obtain an expanded data set, and the extended data set is input into a base learner to obtain meta features of a meta learner.

S7: inputting the meta-characteristics into a diagnosis model, comparing the meta-characteristics with fault types in a system fault type knowledge base to obtain a probability set of which fault type each sample finally belongs to, and determining the fault type of the fault node according to the confidence degree given by the diagnosis model.

S7.1: the current preprocessing data are respectively passed through the base learners obtained in the step S4 to obtain the prediction output of the three base learners, and the three base learners are spliced together in the first dimension to obtain the meta-characteristics of the current data;

s7.2: the meta-characteristics of the current data are processed through the distributed system fault diagnosis model obtained in the S5, and each possible fault type probability of the model to the sample is obtained;

s7.3: comparing the possible fault type with fault types in a system fault type knowledge base, and if the fault type does not exist in the system fault type knowledge base, removing a sample corresponding to the output;

s7.4: after the comparison is completed, the final fault diagnosis type is obtained according to the confidence degree given by the model, and all fault diagnosis results of the batch of system state information are uploaded for further processing.

It should be noted that the invention screens useless and repeated log information under a large amount of abnormal log information generated by nodes in the distributed system, fills part of missing information, accurately diagnoses the fault type of the abnormal nodes, improves the operation and maintenance efficiency of the system, and reduces the loss caused by the system fault. The fault diagnosis of the auxiliary system node by utilizing the machine learning can effectively utilize a large amount of abnormal log information generated under the fault condition, and greatly reduce the time spent on fault type diagnosis caused by redundant information. At present, the fault diagnosis method for the distributed system is quite rare, and generally judges whether the fault diagnosis method belongs to a normal category according to the characteristics, parameters or state information of the system, and then a certain means is used for screening the fault category. Common fault diagnosis methods include a signal-based processing method, a knowledge-based fault diagnosis method, and a model-based fault diagnosis method. The invention combines machine learning and knowledge-based fault diagnosis methods, automatically extracts the relationship among a large amount of abnormal log information by the machine learning method, and further performs fault diagnosis on the nodes suspected to be faulty nodes.

When a certain node in the distributed system fails, the fault can propagate along the topological structure of the distributed system, so that KPI indexes related to the self node and adjacent nodes thereof and a large number of log anomalies are caused. As shown in fig. 1, a nodeIf the fault 1 occurs, the related indexes such as KPI indexes feature1, feature5, feature15 and the like generate abnormal values, and the KPI index anomalies such as feature1, feature5, feature15, feature6 and the like further describe nodes->A fault 3 occurs. Collecting a distributed system state information log in a specific historical time interval to obtain 10000 groups of sample data with state information and fault types, wherein the state information is KPI index data when a system fails, the KPI indexes comprise 107 indexes of feature0 and feature1.

Since the system state information inevitably lacks information for some mechanical or artificial reason during the gathering process, the partial absence of the system state information may destroy the correlation between the information itself, so that some specific methods are used to fill the missing values. The invention trains a filling model for each type of missing system state information independently, and is used for filling the missing data. Training the filled data through a designed MLP model after denoising, extracting the output of the last hidden layer in the MLP model as an additional feature, and combining the output with the original data set to form an expanded data set. Then three base learners, XGBoost, catBoost, lightGBM respectively, are designed, the three base learners are trained respectively by using the expanded data set, and then the probability of each fault type output by the three base learners is obtained and spliced together to be used as the meta-feature input of the final fault diagnosis model. And finally, designing an LR model, putting the meta-feature into the model for training to obtain a model capable of judging the fault type of the system node according to the system state information. And finding a group of new system state information samples, obtaining meta-features of the group of samples after the pretreatment steps such as missing filling and the like, and inputting the meta-features into a trained model to obtain a prediction result. And according to the prediction result output by the model, acquiring an index of the maximum probability output value in the appointed dimension, searching the fault type mapped by the index, obtaining a final fault diagnosis result, and then carrying out the next processing.

The above is a schematic scheme of the machine learning-based distributed system fault diagnosis method of the present embodiment. It should be noted that, the technical solution of the distributed system fault diagnosis device based on machine learning and the technical solution of the distributed system fault diagnosis method based on machine learning belong to the same concept, and the details of the technical solution of the distributed system fault diagnosis device based on machine learning in this embodiment, which are not described in detail, can be referred to the description of the technical solution of the distributed system fault diagnosis method based on machine learning.

The distributed system fault diagnosis apparatus based on machine learning in this embodiment is characterized by comprising:

The present embodiment also provides a computing device, which is suitable for the case of the distributed system fault diagnosis method based on machine learning, and includes:

a memory and a processor; the memory is used for storing computer executable instructions, and the processor is used for executing the computer executable instructions to implement the distributed system fault diagnosis method based on machine learning as set forth in the above embodiment.

The present embodiment also provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the distributed system fault diagnosis method based on machine learning as proposed in the above embodiment.

The storage medium proposed in the present embodiment belongs to the same inventive concept as the machine learning-based distributed system fault diagnosis method proposed in the above embodiment, and technical details not described in detail in the present embodiment can be seen in the above embodiment, and the present embodiment has the same advantageous effects as the above embodiment.

Example 2

Referring to fig. 5-6, for one embodiment of the present invention, a distributed system fault diagnosis method based on machine learning is provided, and in order to verify the beneficial effects of the present invention, scientific demonstration is performed through simulation experiments.

In order to facilitate the understanding of the technical solution of the present invention, some concepts are defined below:

definition 1: missing status information

After a node fails, some KPI index data related to the node will change in series, and some data missing will not be avoided when the state information is counted, so that in order not to destroy the interrelation between the index data, it is necessary to count the missing data and fill in.

According to the definition, the collected system state data is counted, then the system state data is traversed, whether the KPI index data have the deletion or not is counted, if the KPI index data have the deletion, the KPI index name, the index and the number of the missing data are recorded, and the KPI index name, the index and the number of the missing data are stored. The specific implementation steps are as follows:

(1) if the KPI index data has a missing value, the missing information is stored according to the format of (KPI index name, index, missing number).

(2) Traversing all KPI indexes, repeating (1), recording all KPI indexes containing missing data, and sequencing the KPI indexes according to the number of missing data.

(3) And storing the ordered indexes in n_index.

Definition 2: denoising process

The denoising process of the invention is as follows: and filling the collected system state information with missing state information.

Definition 3: feature representation

The system state information, which is intended to go through the model and be accurately diagnosed, must be converted into a meaningful representation of the feature. The invention is directed to a feature representation that is an additional feature representation for obtaining state information using an MLP-based shallow neural network. The information of each layer of the shallow neural network is shown in table 1. The MLP transforms the original 107-dimensional system state feature into a 16-dimensional low-dimensional feature representation as an additional feature by learning a corresponding nonlinear mapping in combination with a ReLU activation function, whose mapping function can be expressed as:

；

wherein the method comprises the steps ofRepresenting the final output of the MLP, in the present invention, since the final failure category has 6 categories, +.>Is +.>，/>The number of the input samples; />Then is MLP->Mapping function of layer, in the present invention, +.>The layer is an output layer; />Is->The dimension of the input of the layer is the output dimension of the hidden layer of the upper layer>，/>To last oneThe number of neurons of the hidden layer, in the present invention, < ->16; />Is->The layer weight, its dimension is +.>；Is a bias term; />To activate the function, a ReLU activation function is taken in the present invention. In the present invention, the input +.of the final output layer is taken out>As an additional feature, spliced on the original feature, and as an extended input, finally obtaining the feature representation of the base learner.

TABLE 1 details of each layer of shallow neural network

Definition 4: meta-features

After the feature representation finally used for training the base learners is obtained, three base learners XGBoost, lightGBM, catBoost based on an integrated algorithm are trained, the output results of probabilities of the three base learners for finally outputting each fault type by taking given features as input are obtained, and the probability output results of the three base learners are spliced to be used as features.

Definition 5: node fault class knowledge base

When a node in the distributed system fails, the fault can propagate along the topological structure of the distributed system, so that KPI indexes related to the self node and adjacent nodes thereof and a large number of log anomalies are caused. And counting the system state information which is common enough in the training set, classifying the state information of the same fault type, and counting the occurrence times, thereby forming a system node fault category knowledge base shown in the table 2.

Table 2 System failure class knowledge base sample exemplary Table

After the set of fault nodes in one day is found by using a machine learning method, the fault category information of all the fault nodes is compared with a knowledge base, and if the fault category information does not exist in the knowledge base, the fault category information is directly screened out.

Definition 6: failure category frequency

After the node fault class knowledge base is constructed, counting the occurrence times of each class of fault class information, and calculating the frequency of a certain class of fault class by using a formula (1)

；/>

Wherein the method comprises the steps ofIndicating the fault class frequency of class i +.>And N is the total number of fault nodes for the number of times of occurrence of the i-type fault category.

By the method, the fault type of a certain system node can be determined by a machine learning method. And estimating a specific possible fault type of the node by analyzing KPI index data related to the system node, and giving out the maximum possible fault type of the node.

The invention takes a distributed system topology node state information sample as an example to determine the fault type of the node to which the newly given system state information belongs. The embodiment of the invention comprises the following specific operation steps of:

step 1: state sample data of the distributed system in a specific time interval of the history is collected, repeated state information deletion is carried out on the samples, a filling model is constructed, and data preprocessing such as filling of missing values is carried out. The preprocessing of data of a certain day is specifically described as follows:

(1) reading in feature columns (107 columns in the invention are KPI index data) of all sample data, storing the feature columns in a DataFrame of the train_feature, extracting a label column in the sample data, storing the label column as a label in the train_label, and updating the train_feature and the train_label according to the data repetition condition in the train_feature;

(2) If the KPI index data has a missing value, the missing information is stored according to the format of (KPI index name, index, missing number);

(3) traversing all KPI indexes, repeating (2), recording all KPI indexes containing missing data, sorting indexes according to the missing quantity, and storing the sorted indexes in n_index;

(4) normalizing the train_feature by using a Standard scaler, traversing n_index, and determining the feature column to be filled currently.

(5) And taking a sample without a missing value in the current characteristic column as a training set, taking a sample containing the missing value as a test set, simultaneously carrying out 0 filling on the missing values in the rest characteristic columns (the influence of 0 filling is minimum after normalization), training a filling model (the invention uses MLPRegressor), taking the prediction of the model on the test set as a final filling value, and finally storing the filling model and related parameters thereof. Traversing all values in the n_index, repeating according to the sequence of increasing the number of the missing values, completing filling of all the missing values, and finally returning to the filled sample set.

Step 2: obtaining a preprocessed complete sample data setAfter that, read->Corresponding KPI index columnObtaining a feature set train_feature; read- >The corresponding tag column obtains train_label, namely fault type information; then extracting unique values of all objects in the train_label to obtain target_label, namely all fault types; then counting the number of samples of each fault type, calculating the frequency of each fault type>And constructing a system fault type knowledge base.

Step 3: the trace_feature is input to an MLP-based preprocessing model, and the data is fitted. The specific implementation is to pass each sample in the train_feature through an MLP with a hidden layer size (64, 32, 16), and the specific architecture of the MLP is shown in fig. 3. For each sample, the MLP outputs a 1*6 logits output, and the label train_label corresponding to the output and the sample is subjected to cross Entropy loss to complete training of the MLP model. Finally extracting hidden layer output with hidden layer size of 16As a deep relationship representation between KPI indicators.

Step 4: will beAnd the initial KPI index data train_feature are spliced to obtain +.>As a feature input to a training base learner.

Step 5: will beAs input of the base learner, train_label is used as a target of the base learner, the model is trained, the model and parameters thereof are saved, and probability representation of the output label is obtained by utilizing the model through the characteristics of the input data set and is used as meta-characteristic input of the meta-learner. The specific method comprises the following steps:

(1) Will beRandomly dividing the corresponding train_label into 5 mutually disjoint subsets, and taking the +.>And train_label as training set +.>The remainder as verification set->Training the base learner (the base learner in the invention is LightGBM, XGBoost and Catboost), repeating the above process for 5 times, and storing the base learner in +.>Output_prob_ lgb;

(2) the remaining base learners XGBoost, catBoost are trained sequentially to obtain the positions of the base learners respectivelyOutput_prob_xgb, output_prob_cat;

(3) and splicing the obtained output_prob_lgb, the obtained output_prob_xgb and the obtained output_prob_cat to obtain meta-characteristics.

Step 6: the obtained meta features are used as the input of a meta learner LR, the train_label is the output of the meta learner, the meta learner is trained, the meta learner and parameters thereof are saved, the input is obtained as the feature representation, and the input is output as a model of the final fault type probability.

To this end, the embodiment of the present invention has been prepared. The specific operation steps of the embodiment flow of the invention are as follows:

step 7: acquiring a new distributed system state information sample data set collected in practice in a day, wherein the storage content of data in the samples comprises nodes of the samples and state information of the samples, inputting the state information of the nodes of each sample in the new data set into a preprocessing model based on MLP according to the methods of steps 3, 4 and 5 to generate deep feature representation, obtaining an expanded data set, and inputting the expanded data set into a base learner to obtain meta features of a meta learner. The specific operation steps are as follows:

(1) Acquiring a new system state sample data set collected in practice in a day. Will->Hidden layer output of preprocessing model obtained through steps 1 and 3>Will->Splicing with train_feature to obtain +.>；

(2) Will beObtaining probability outputs of the three base learners, namely output_prob_lgb, output_prob_xgb and output_prob_cat, through the three base learners obtained in the step 5 respectively;

(3) the output_prob_lgb, the output_prob_xgb and the output_prob_cat are spliced together to be used asIs a meta-feature of (c).

Step 8: and (3) inputting all the meta-features into the model stored in the step (5), comparing the meta-features with fault types in a system fault type knowledge base to obtain a probability set of which fault type each sample finally belongs to, and determining the fault type of the fault node according to the confidence degree given by the model. The specific operation steps are as follows:

(1) the step 7 is carried outThe meta-characteristics of the faults are transmitted into the fault diagnosis model obtained in the step 6, and probability output_prob of each type of faults is obtained;

(2) comparing the fault type in the output_prob with the types in the system fault type knowledge base, and removing the corresponding samples in which the fault type does not exist;

(3) and finally, determining the fault type with the highest possibility of output_prob according to the confidence degree given by the model, and obtaining the final diagnosis result of the model.

The invention is based on the actual application scene, collects the distributed system state information sample data in a specific history period, and takes the data as a training set, the data scale is 10000, each sample consists of a sample number, KPI indexes and fault labels, the sample number is a unique distinguishing mark of each sample, the KPI indexes have 107 dimensions in total and are used for representing the state information of one node in the distributed system, and simultaneously, the sample number is also taken as the sample characteristic of the training set; the fault labels represent fault types of the nodes, the values of the fault labels are 0-5, and the fault labels correspond to specific fault types in a system fault class knowledge base respectively. The specific format of the training set is shown in table 3.

TABLE 3 training data concrete format

By using the data of the training set, the invention respectively trains a Random Forest (RF) fault diagnosis model, a support vector machine (Support Vector Machine, SVM) fault diagnosis model, a fault diagnosis model based on residual shrinkage network (Deep Residual Shrinkage Networks, DRSN) and an integrated model used by the fault diagnosis method provided by the invention, wherein the evaluation index is F1-macro, and the specific performances of the methods are shown in table 4.

Table 4 performance of models on training set

The confusion matrix on the training set of the present invention is shown in fig. 5. The fault diagnosis method provided by the invention has the performance basically exceeding that of the common traditional fault diagnosis method, and has higher accuracy.

For a complex running environment of a distributed system in a real environment, a verification set for testing the performance of the system is collected, the data size is 1000, and the specific format of the verification set is shown in table 5.

Table 5 validates the data specification

The performance of each fault diagnosis model on the validation set is shown in table 6.

Table 6 performance of models on validation set

The confusion matrix of the method proposed by the present invention on the verification set is shown in fig. 6.

It can be seen that while some models used in conventional methods perform well on training data, they have different levels of degradation in performance for practical use scenarios. The method is characterized in that when verification data are collected, data which are completely consistent with the distribution of training data are not collected, and some samples with covariate deviation phenomenon are collected to form the verification data, so that a complex environment of the distributed system operation in a real scene is simulated.

The data show that the fault diagnosis method provided by the invention has good generalization capability and can cope with complex environments in real scenes while having accuracy which is not different from that of a general traditional method.

It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims

1. A machine learning-based distributed system fault diagnosis method, comprising:

2. The machine learning based distributed system fault diagnosis method according to claim 1, wherein the gathering of the state sample data of the distributed system in the history time interval and the data preprocessing comprises:

3. The machine learning based distributed system fault diagnosis method according to claim 2, wherein said classifying the fault type of the preprocessed sample dataset, counting the fault types in the sample dataset, and constructing the system fault type knowledge base includes:

Reading a tag column in sample data according to the filled sample set;

4. The machine learning based distributed system fault diagnosis method according to claim 3, wherein said fitting a dataset using a multi-layer perceptron preprocessing model, extracting deep feature embedding between fault features comprises:

5. The machine learning based distributed system fault diagnosis method according to claim 4, wherein the obtaining the probability of occurrence of each fault type by taking the feature in the extended data set as the input of the base learner, taking the tag in the extended data set as the target of the base learner, training the base learner, saving the base learner and the parameters thereof, and predicting the feature in the extended data set by the base learner comprises:

；

6. The machine learning based distributed system fault diagnosis method as claimed in claim 5, wherein the inputting the probability of occurrence of each fault type as meta characteristic of the meta learner, using the tag in the extended data set as output of the meta learner, training the meta learner, and saving the meta learner and its parameters to obtain the diagnostic model whose input is characteristic representation and whose output is probability of final fault type comprises:

7. The machine learning based distributed system fault diagnosis method according to claim 6, wherein the inputting the meta-feature into the diagnosis model, comparing the meta-feature with fault categories in a system fault type knowledge base to obtain a probability set of what fault type each sample finally belongs to, and determining the fault type of the fault node according to the confidence degree given by the diagnosis model comprises:

8. An apparatus for performing the machine learning-based distributed system fault diagnosis method according to any one of claims 1 to 7, comprising:

9. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions that, when executed by a processor, implement the steps of the machine-learning-based distributed system fault diagnosis method of any one of claims 1 to 7.

10. A computer readable storage medium storing computer executable instructions which when executed by a processor implement the steps of the machine learning based distributed system fault diagnosis method of any one of claims 1 to 7.