CN112181758A

CN112181758A - Fault root cause positioning method based on network topology and real-time alarm

Info

Publication number: CN112181758A
Application number: CN202010835820.8A
Authority: CN
Inventors: 徐康; 李熠轩; 刘海琦; 张晓伟; 叶宁; 王汝传
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-08-19
Filing date: 2020-08-19
Publication date: 2021-01-05
Anticipated expiration: 2040-08-19
Also published as: CN112181758B

Abstract

The invention discloses a fault root cause positioning method based on network topology and real-time alarm, which comprises the following steps: inputting an alarm data set, performing data processing, extracting the characteristics contained in the current corresponding node as a characteristic set, and acquiring time and node information in each piece of alarm information; according to the current node information, combining the topological relation to obtain upper and lower nodes, screening out alarm information of the upper and lower nodes within a certain time interval according to the time information, and combining a feature set of the current node to construct an alarm feature of the upper and lower nodes; dividing an alarm data set into a training set and a testing set, screening all obtained characteristic information, inputting a classification algorithm, taking a characteristic set with the best prediction performance as a model classification characteristic, inputting the characteristic values contained in the training set into the classification algorithm to train to obtain a prediction model, predicting data in the testing set by using the trained classification model and outputting a prediction result, and obtaining a final prediction root result according to the number of candidate root causes in the prediction result and time information.

Description

Fault root cause positioning method based on network topology and real-time alarm

Technical Field

The invention relates to the technical field of intelligent operation and maintenance, in particular to a fault root cause positioning method based on network topology and real-time alarm.

Background

The large e-commerce platform internally relates to the mutual calling among hundreds of methods, and generates tens of thousands of pieces of alarm data every day. How to utilize the network topology information and the alarm data to filter and analyze the alarm timely and effectively and finally give effective alarm and suspected root cause is a major challenge facing network operation and maintenance. When a node in the network topology fails, other nodes connected with the node are often abnormal, and then a large amount of alarms are generated, so that the root cause of true alarm is submerged. When a large number of alarms occur, the alarms need to be analyzed and processed, invalid alarms are filtered out, candidate root cause nodes are accurately positioned, and fault positioning time is shortened.

Disclosure of Invention

The invention aims to provide a fault root cause positioning method based on network topology and real-time alarm, which can accurately and quickly position network faults, can improve the operation and maintenance efficiency of a first-line network and reduce the loss generated by the network faults.

The invention adopts the following technical scheme for realizing the aim of the invention:

the invention provides a fault root cause positioning method based on network topology and real-time alarm, which comprises the following steps:

inputting an alarm data set, performing data processing on the alarm data set, extracting features contained in a current corresponding node from all alarm information to be used as a feature set, acquiring time and node information in each piece of alarm information, and extracting the features contained in each piece of alarm information by combining the obtained feature set;

according to the processed current node information, upper and lower nodes are obtained by combining the topological relation, alarm information of the upper and lower nodes in a certain time interval is screened out according to time information, alarm characteristics of the upper and lower nodes can be constructed by combining a characteristic set of the current node, and global characteristic information of each alarm information is obtained;

dividing an alarm data set into a training set and a testing set, then screening all obtained characteristic information, inputting a classification algorithm, taking a characteristic set with the best prediction performance as a model classification characteristic, inputting the characteristic value contained in the training set into the classification algorithm to train to obtain a prediction model, predicting data in the testing set by using the trained classification model and outputting a prediction result, and obtaining a final prediction root result according to the number of candidate roots and time information in the prediction result.

Further, the method for inputting an alarm data set, performing data processing on the alarm data set, and extracting features contained in a current corresponding node from all alarm information as a feature set specifically includes:

preprocessing an alarm data set providing alarm information, combining all files, removing all irrelevant information by using a regular expression, extracting features and feature values, removing duplication to obtain all features, using the features as a feature matching set, and customizing a regular expression for extracting the feature values for each feature.

Further, the method for acquiring time and node information in each piece of alarm information specifically includes:

and extracting the time and node information of each piece of alarm information by using a regular expression, so as to establish a dictionary to facilitate searching and matching.

Further, the method for extracting the features included in each piece of alarm information by combining the obtained feature set specifically includes:

and matching each piece of alarm information with the features in the feature set according to the file after line traversal processing, and filling the extracted feature values into the features corresponding to each piece of alarm information if the judgment is consistent.

Further, the method for screening out the alarm information of the upper node and the lower node within a certain time interval according to the time information specifically comprises the following steps:

traversing adjacent alarm information one minute before and after each piece of alarm information, inputting node information of the adjacent alarm information, screening upper and lower node alarm information from the adjacent alarm information, and performing characteristic matching on all associated nodes to obtain characteristics of the associated nodes.

Further, the method for constructing the alarm characteristics of the upper and lower nodes by combining the feature set to obtain the global characteristic information of each alarm information specifically includes:

the feature set is processed into a data set with T0 containing only the home position feature, a data set with T1 containing the home position feature and the upper node feature, a data set with T2 containing the home position feature and the lower node feature, and a data set with T3 containing the home position feature, the upper node feature and the lower node feature.

Further, dividing the alarm data set into a training set and a test set, then screening all the obtained feature information, inputting a classification algorithm, taking a feature set with the best prediction performance as a model classification feature, inputting the feature values contained in the training set into the classification algorithm to train to obtain a prediction model, predicting data in the test set by using the trained classification model and outputting a prediction result, and obtaining a final prediction root cause result according to the number of candidate root causes in the prediction result and time information specifically comprises the following steps:

selecting a T2 data set containing the characteristics of the home node and the lower node, inputting the data set into an XGboost classification model, and training to obtain a root prediction result;

selecting Borderline SMOTE to balance the data set;

training an XGboost classification model by using a training set aiming at a T2 data set containing a home position characteristic and a lower position characteristic, and then carrying out root factor prediction on a test set to obtain all candidate root factor information;

and judging the candidate root cause by combining the occurrence time and the occurrence times to obtain a root cause prediction result.

The invention has the following beneficial effects:

the invention locates the root cause node causing alarm and outputs the alarm information, which is convenient for operation and maintenance personnel to troubleshoot the fault; the method has good flexibility and expansibility, various classification algorithms can be replaced on the basis of the method, and the accuracy of the root cause positioning can be further improved by replacing the classification algorithms more suitable for a certain working environment and using different training set training models.

Drawings

FIG. 1 is a schematic overall flow chart provided according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of step S10 according to the embodiment of the present invention;

fig. 3 is a schematic flow chart of step S20 according to the embodiment of the present invention;

fig. 4 is a schematic flowchart of step S30 according to the embodiment of the present invention.

Detailed Description

The invention is further described with reference to specific examples. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

The invention relates to a fault root cause positioning method based on network topology and real-time alarm, which carries out root cause prediction on alarm information containing time information and topology relation; extracting the characteristics of the alarm information, inputting the alarm information into a classification algorithm, and obtaining a model prediction root node through training; in order to improve the prediction accuracy, the topological relation between alarm information nodes is combined, and the upper node and the lower node of the alarm information nodes are searched by utilizing a topological graph; after the associated node is positioned, the alarm information with causal connection is screened out through the time information, the included characteristics of the alarm information are further judged, and upper and lower characteristic information is added to serve as the input characteristics of a classification algorithm; combining all the obtained characteristics, obtaining characteristic information with the highest F1 value after screening treatment, constructing a data set according to the characteristic information, and carrying out balance treatment on the data set by using Borderline SMOTE; inputting the data set into a machine learning classification algorithm, obtaining suspected root cause information after classification, and positioning root cause nodes by combining time information and classification quantity; the invention outputs the root cause node causing the alarm and the alarm information thereof, thereby being convenient for operation and maintenance personnel to troubleshoot the fault.

The embodiment is applicable to the case of filtering and analyzing alarms by using network topology information and alarm data to finally give effective alarms and suspected root causes, and the method can be executed by a machine learning module, wherein the machine learning module can be realized by software and/or hardware, and can also be applied to an alarm method such as an e-commerce platform, and as shown in fig. 1, the method is a flow diagram provided by an embodiment of the invention, and specifically comprises the following steps:

in step S10, an alarm data set is input, data processing is performed on the alarm data set, features included in all alarm information are extracted as a feature set, the alarm information is traversed, time and node information in each alarm is obtained through processing, and the features included in each alarm information can be extracted by combining the obtained feature set;

in step S20, according to the node information obtained by processing in S10, the upper and lower nodes are obtained in combination with the topological relation, the alarm information of the upper and lower nodes within a certain time interval is screened out according to the time information, and the alarm characteristics of the upper and lower nodes can be constructed in combination with the feature set in S10, so as to obtain the global characteristic information of each piece of alarm information;

in step S30, the alarm data set is divided into a training set and a test set, all the feature information obtained in step S20 is filtered to remove noise, a classification algorithm is input, the feature set with the best prediction performance is used as a model classification feature, the feature values of the training set data are input into the classification algorithm to be trained to obtain a prediction model, the trained classification model is used to predict the data in the test set and output a prediction result, and a final prediction root result is obtained according to the candidate root number and time information in the prediction result.

Preferably, the alarm text data is processed by a regular expression, and a feature dictionary is constructed to improve the feature extraction efficiency, as shown in fig. 2, the following is specifically provided:

in step S101, preprocessing a test set file and a training set file providing warning information, merging all files, removing all irrelevant information by using a regular expression, extracting features and feature values, removing duplicates to obtain all features, using the features as a feature matching set, and customizing a regular expression for extracting feature values for each feature;

in step S102, time and node information of each piece of alarm information is extracted by using a regular expression, so that a dictionary is established to facilitate searching and matching;

in step S103, each piece of alarm information is matched with the feature in the feature set according to the file after the line traversal processing, and if the judgment is consistent, the extracted feature value is filled in the feature corresponding to each piece of alarm information.

Preferably, the method for obtaining the characteristics of the upper and lower nodes of the current node according to the topological relation in step S20, as shown in fig. 3, specifically includes the following steps:

in step S201, the upper and lower node information of a node can be located by using the topological relation between nodes, and the upper and lower node information of the current node is output;

in step S202, traversing adjacent alarm information one minute before and after each alarm information, inputting node information thereof, and screening upper and lower node alarm information therefrom, and performing feature matching on all associated nodes to obtain features thereof;

in step S203, the feature set is processed into a data set of T0 containing only the local features; t1 data set containing home and upper node features; t2 data set containing the characteristics of the home node and the lower node; t3 data set containing characteristics of local, upper and lower nodes; each data set is uniform in format and should have a label in the corresponding column of the feature it contains.

The step S30, as shown in fig. 4, specifically includes:

in step S301, to reduce noise and adjust input features, a data set containing the features of the home node and the lower node of T2 is selected and input to the XGBoost classification model, and a root prediction result is obtained through training;

in step S302, since the problem of unbalanced category exists in the alarm information of whether the data set is the root cause, a Borderline SMOTE is selected to perform balance processing on the data set;

in step S303, for a T2 data set including a home position and a lower feature, a training set is used to train an XGBoost classification model, and then a root factor prediction is performed on a test set to obtain all candidate root factor information;

in step S304, since the same file only contains one root cause, the candidate root causes are determined by combining the occurrence time and the occurrence frequency, and a root cause prediction result is obtained.

Operation example:

assuming that a training set is given, which contains nodes, time and alarm information, and labels root cause alarm nodes, the following are two details of the training set of the embodiment:

example 1, time: "2019/6/41: 14", triggername: "host node _61FullGC average elapsed time: 2118ms (greater than threshold: 1000ms) ", is _ root: "0".

Example 2, time: "2019/6/41: 14", triggername: "host node _60 port 80 communication exception", is _ root: "1".

Also, the topological relation between the nodes is given, for example, the nodes and the lower nodes thereof are stored in a dictionary manner: { "node _50" [ "node _4", "node _83", "node _33", "node _17" ], "node _0" [ "node _4", "node _83", "node _33", "node _17" ], and the like.

S1: local feature extraction: step (1): traversing the alarm information, and processing each alarm by using a regular expression, for example, removing the specified information in the triggername in the case by using an r 'host node \ d +' expression, where the rest part of the triggername in the case is: "FullGC average time consumption: 2118ms (greater than threshold: 1000ms) "; step (2): the remaining portion of the regular expression after processing is processed again, and the regular expression is customized for each piece of information, such as "FullGC average elapsed time" remaining in the case: 2118ms (greater than threshold: 1000ms) "customize regular expression r' [ FullGC average elapsed time: the value contained in the method can be read out through an expression, and a specific characteristic value is replaced by a '+'; and (3): the information contained in the processed triggername is deduplicated, all features can be obtained as a feature set, and the features contained in each alarm message can be extracted by the feature set, for example, the average time consumption of the host node _61FullGC of "2019/6/41: 14: 2118ms (greater than the threshold value: 1000ms) ", and the alarm information thereof contains the following characteristics: "FullGC average time consumption: ms (greater than threshold: 1000ms) ", eigenvalue 2118;

s2: and customizing a regular expression according to the feature set obtained in the step S1, and then matching each piece of alarm information to obtain the contained features of all pieces of information. By using the given topological relation, specific association between nodes can be obtained, so that information of an upper node and a lower node of the current node can be obtained, for example, if the obtained node information is "node _50", the obtained lower node is "node _4", "node _83", "node _33", "node _ 17"; combining the time information, the alarm information of all upper/lower nodes in one minute of a certain alarm information can be positioned, so that the alarm information corresponding to the nodes 'node _ 4', 'node _ 83', 'node _ 33' and 'node _ 17' occurring in the same minute is positioned, and the characteristic contained in the node is screened out by combining the characteristic set obtained in S1, thereby obtaining the upper/lower node characteristics of the alarm information;

s3: screening all the characteristic information obtained in the step S2, and using a data set containing the characteristics of the home node and the lower node as a training set; carrying out balance processing on the data set by using Borderline SMOTE; and adopting an XGboost algorithm, after training by using a test set, carrying out root factor prediction on the processed test set containing the home position and the lower features to obtain all candidate root factor information, and judging the candidate root factors by combining the occurrence time and the occurrence times to obtain a root factor prediction result.

When the method is used for preprocessing data, the alarm information features are extracted from all data sets, and all files are integrated into a whole, so that the alarm information is conveniently and accurately positioned, the root cause nodes are intuitively previewed, and the subsequent addition of the alarm information features and file operation is facilitated; when a classification algorithm is adopted to predict the root cause, the extracted features of the alarm information are used as input, the features contained in the alarm information are extracted through a regular expression after traversing the alarm information and are used as classification basis, due to causal association among nodes, topological relation among the nodes is combined, the upper and lower nodes of the alarm information are positioned by using a topological graph, the alarm information of the related nodes is obtained by combining alarm time information, the features of the alarm information are extracted, and the prediction accuracy is obviously improved by adding the features; four data sets can be obtained after data are processed, and in order to reduce noise and ensure the effectiveness of characteristic adding information, different data sets can be adopted under different situations, so that the prediction accuracy is improved.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. A fault root cause positioning method based on network topology and real-time alarm is characterized by comprising the following steps:

inputting an alarm data set, performing data processing on the alarm data set, extracting features contained in a current corresponding node from all alarm information to be used as a feature set, acquiring time and node information in each piece of alarm information, and extracting the features contained in each piece of alarm information by combining the obtained feature set; according to the processed current node information, upper and lower nodes are obtained by combining the topological relation, alarm information of the upper and lower nodes in a certain time interval is screened out according to time information, alarm characteristics of the upper and lower nodes can be constructed by combining a characteristic set of the current node, and global characteristic information of each alarm information is obtained;

2. The method for locating the fault root cause based on the network topology and the real-time alarm according to claim 1, wherein the method for inputting the alarm data set, processing the alarm data set, and extracting the features contained in the current corresponding nodes from all the alarm information as the feature set specifically comprises:

3. The method for locating the fault root cause based on the network topology and the real-time alarm according to claim 2, wherein the method for obtaining the time and node information in each piece of alarm information specifically comprises:

4. The method for locating the fault root cause based on the network topology and the real-time alarm according to claim 3, wherein the method for extracting the features included in each piece of alarm information by combining the obtained feature set specifically comprises:

5. The method for locating the fault root cause based on the network topology and the real-time alarm according to claim 1, wherein the method for screening out the alarm information of the upper node and the lower node within a certain time interval according to the time information specifically comprises:

6. The method for locating the fault root cause based on the network topology and the real-time alarm according to claim 5, wherein the alarm characteristics of the upper and lower nodes can be constructed by combining the characteristic set of the current node, and the method for obtaining the global characteristic information of each alarm information specifically comprises:

7. The method according to claim 1, wherein the step of dividing the alarm data set into a training set and a test set, the step of screening all the obtained feature information, the step of inputting a classification algorithm, the step of inputting a feature set with the best prediction performance as a model classification feature, the step of inputting the feature values contained in the training set into the classification algorithm for training to obtain a prediction model, the step of predicting the data in the test set by using the trained classification model and outputting a prediction result, and the step of obtaining a final prediction root result according to the number of candidate root causes in the prediction result and time information specifically comprises the steps of:

selecting a T2 data set containing the characteristics of the home node and the lower node, inputting the data set into an XGboost classification model, and training to obtain a root prediction result; selecting Borderline SMOTE to balance the data set;