WO2022007108A1 - 一种基于深度学习的网络告警定位方法 - Google Patents

一种基于深度学习的网络告警定位方法 Download PDF

Info

Publication number
WO2022007108A1
WO2022007108A1 PCT/CN2020/108816 CN2020108816W WO2022007108A1 WO 2022007108 A1 WO2022007108 A1 WO 2022007108A1 CN 2020108816 W CN2020108816 W CN 2020108816W WO 2022007108 A1 WO2022007108 A1 WO 2022007108A1
Authority
WO
WIPO (PCT)
Prior art keywords
root cause
node
alarm information
sample
nodes
Prior art date
Application number
PCT/CN2020/108816
Other languages
English (en)
French (fr)
Inventor
徐小龙
黄寄
赵娟
徐佳
姜宇
孙维
Original Assignee
南京邮电大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京邮电大学 filed Critical 南京邮电大学
Publication of WO2022007108A1 publication Critical patent/WO2022007108A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • H04L41/065Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving logical or physical relationship, e.g. grouping and hierarchies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/147Network analysis or design for predicting network behaviour

Definitions

  • the invention relates to computer network operation and maintenance, in particular to a network alarm location method based on deep learning.
  • Fault management is an important part of network management, including fault discovery, fault diagnosis and fault repair.
  • the difficulty lies in determining the root cause of the fault, that is, the root cause node.
  • the internal of various large platforms involves hundreds of systems calling each other, and a large amount of alarm information will be generated between the network nodes.
  • the node calling the node or the node that needs to use the resources of the node may also fail, resulting in a large amount of alarm information, and the alarm information may be issued earlier than the root cause of the failure node.
  • the existence of these problems makes fault location very difficult.
  • General alarm correlation methods include rule-based reasoning methods, artificial intelligence methods, and cause-and-effect diagram methods.
  • the rule-based reasoning method needs to design a set of rules for the occurrence of alarm information, which is very difficult to implement, and cannot deal with situations that have not been considered in advance, and the stability is insufficient.
  • general artificial intelligence methods it is difficult to collect a set of related alarm information data sets, and it is difficult to determine the characteristics of the alarm information data.
  • the sample data of the root cause alarm information is generally relatively small, so there is still the problem of data imbalance, which leads to the overfitting of the model, and the final effect is not good.
  • the cause and effect diagram method is also to perform rule inference on the connection relationship of the alarm information to obtain the root cause node.
  • none of these methods are real-time. If new alarm information is generated, it is impossible to match in real time whether it contains established correlation rules, and it is difficult to meet the real-time requirements of alarm correlation analysis.
  • the purpose of the present invention is to provide a network alarm location method based on deep learning that improves the efficiency of network operation and maintenance and reduces losses caused by network failures.
  • the method screens the alarm information sent by the non-root cause nodes, and finally locates the root cause nodes in real time.
  • the present invention provides a deep learning-based network alarm location method, comprising the following steps:
  • Step 1 collect the alarm information sample data within a specific time interval in the history, and perform data preprocessing on the repeated alarm information deletion for these samples;
  • Step 2 After the repeated alarm information is removed, the samples of the isolated nodes are also screened. After screening, the alarm information of nodes in all samples is classified, and then the types of root cause alarm information are counted. Build a knowledge base of root cause node alarm information categories;
  • Step 3 The node information and alarm information of the sample are combined and input into the word representation model based on distributed assumptions, and finally the feature representation of the sample is obtained.
  • Each sample contains two information: feature representation and root cause node label;
  • Step 4 Divide the root cause markers in the sample dataset into two subsets by 1 and 0. For the samples in the subset whose root cause is marked as 1, the sample expansion method is used to expand the number of samples to be the same as the number of samples in the subset whose root cause is marked as 0;
  • Step 5 Use the feature representation in the sample of the expanded dataset as the input of the LSTM model, mark the root cause in the sample of the expanded dataset as the output of the LSTM model, train the model, and save the model and its parameters down.
  • the input is the feature representation
  • the output is a model that predicts this sample as the probability value of the root cause node;
  • Step 6 Obtain a sample data set of alarm information in a new day collected in practice.
  • the storage content of the data in the sample includes the node of the sample and the alarm information of the sample.
  • the node and alarm information of each sample in the new data set are combined and input into the distributed hypothesis-based word representation model to generate the feature representation of each sample, and the corresponding feature representation set is obtained;
  • Step 7 Input the feature representation of all samples into the model stored in Step 5, and obtain the probability set of each sample being predicted as a suspected root cause node. Store all sample nodes whose probability is greater than the threshold as a set of suspected root cause nodes;
  • Step 8 Compare the alarm information sent by the nodes in the suspected root cause node set with the alarm information types in the root cause node alarm information category knowledge base established in step 2, and compare the nodes that do not exist in the knowledge base. delete. If there is no element in the suspected root cause node set, there is no root cause node for this day. Otherwise, the root cause node is filtered out by using the distance relationship between nodes and the knowledge base.
  • step 1 is:
  • Step 11 Make the node alarm information, corresponding node and root cause tags of all samples into a triple format including the nodes, alarm information and root cause tags for storage, and create an empty dictionary;
  • Step 12 put the alarm information within one day into the queue
  • Step 13 judge whether the queue is empty, if it is empty, go directly to step 15, if not empty, then dequeue an element
  • Step 14 Determine whether the element to be dequeued exists in the dictionary, if so, do not operate, if not, add it to the dictionary. Return to step 13;
  • Step 15 Store all elements in the dictionary as a sample data set from which duplicate alarm information has been removed.
  • step 2 is:
  • Step 21 Make an adjacency matrix according to the connection relationship of the alarm nodes every day, the value of the matrix position corresponding to the node with the connection relationship is set to 1, and the value of the matrix position corresponding to the node without the connection relationship is set to 0;
  • Step 22 calculate the sum of all elements of the row and column represented by each node
  • Step 23 remove the node whose sum of all elements of row and column is 0;
  • Step 24 In the remaining samples, the alarm information of all the samples is classified, stored in the root cause node alarm information classification knowledge base, and the occurrence frequency of each type is calculated.
  • step 4 is:
  • Step 42 create an empty list T new ;
  • Step 43 if the sum of the number of samples of T new and T 1 is the same as the number of samples of T 0 , skip directly to step 46;
  • Step 44 the alarm information and randomly selected node features a sample of T 1 represents x, then its k-nearest neighbor samples in a random sample of nodes and alarm information and taking the feature representation x ', is calculated by the following formula
  • the node and alarm information features of the new sample represent x new , where rand(0, 1) represents a random value from 0 to 1;
  • Step 45 constructing the newly constructed x new and its root cause tag into a 2-tuple, where the root cause tag value is always 1. Join the T new list. as a newly expanded sample. Return to step 43;
  • Step 46 adding all samples of T new to T 1 .
  • step 7 After the suspected root cause node set in step 7 is generated, the suspected root cause node set is set as S s , and the specific process of step 8 is:
  • Step 81 create an empty list S c , compare the alarm information of all nodes in S s with the knowledge base of the root cause node alarm information type generated in step 2, and keep the alarm information types corresponding to the nodes in S s existing in the knowledge. Nodes in the library;
  • Step 82 if S s is an empty set, it means that there is no root cause node in this day, and the process ends. If S s has only one node, then this node is the root cause node and ends;
  • Step 83 Make the connection relationship of the nodes in all the samples of the day into an adjacency matrix, wherein the weight of each edge is set to 1. According to the connection relationship, calculate the shortest distance between S s and the nodes of all samples in one day. And count the number of nodes within the root cause node fault propagation range (set within 2 hops in the present invention), and then form a two-tuple containing the number of nodes within the root cause node fault propagation range, and add it to S c middle;
  • Step 84 extract the element with the largest number of nodes in the root cause node fault propagation range in S c , if the element is unique, the node corresponding to its tuple is the root cause node. Otherwise, the root cause node is selected according to the occurrence frequency of the alarm information type of the node in the knowledge base, and the node corresponding to the alarm information type with the largest frequency is the root cause node.
  • the present invention has the following advantages:
  • the traditional alarm information root cause node location technology is generally realized by the method of alarm association, which generally requires association rules to realize, and different systems may have different alarm information rules, and the present invention uses deep learning and history.
  • the knowledge base composed of alarm information does not need to design association rules, as long as it is a system that generates alarm information, it can be used.
  • the root cause sample data is generally far less than the non-root cause sample data. Therefore, when the artificial intelligence method is used, the problem of unbalanced sample categories will occur, resulting in over-fitting of the final model prediction results.
  • the present invention expands the root cause alarm information samples of the training set to the same number as the non-root cause alarm information samples, and solves the problem of unbalanced sample categories.
  • Figure 1 is an example diagram of a node connection relationship
  • Figure 2 is an example diagram of an adjacency matrix
  • Figure 3 is the structure diagram of the Embedding layer of the Bert model
  • Fig. 4 is the network structure diagram of Bert model
  • Figure 6 is a flow diagram of an embodiment of the present invention.
  • the present invention screens the useless and repeated alarm information under a large amount of alarm information generated by network nodes, accurately locates the node that sends out the root cause alarm information, improves the efficiency of network operation and maintenance, and reduces losses caused by network failures.
  • Using deep learning to assist network node alarm root cause location can filter out a large number of non-root cause nodes, greatly reducing the root cause node location time.
  • methods for alarm location are relatively scarce, and root cause screening is generally performed based on alarm correlation methods. Common alarm correlation methods include case- or rule-based reasoning expert systems, cause-and-effect diagrams, and dependency diagrams.
  • the present invention will combine the deep learning and the alarm correlation method, filter the alarm information to the set of suspected root cause nodes through the deep learning method, and then locate the root cause node in the set of suspected root cause nodes according to the characteristics of the root cause node.
  • node v 0 fails, and nodes v 1 , v 2 , v 3 , v 4 and v 5 that are connected to the root cause node failure propagation range may also fail.
  • Collect alarm information logs in a specific historical time interval, and obtain 100 groups of sample data with alarm nodes and alarm information, each group contains several pieces of sample data with alarm nodes and alarm information, and manually mark each log Whether the sample is a root cause node, this data is used as the training set.
  • the samples of the same node and the same alarm information in each group should be deduplicated, and only one sample should be kept. Then, the connection relationship of the nodes included in each group is made into an adjacency matrix, so as to observe whether there are isolated nodes in this group of faulty nodes, and delete the alarm samples of the isolated nodes. After denoising, the denoised training set is obtained. The host node number is combined with the alarm information of the node, and the word embedding feature of the alarm information is obtained after pre-training the word representation model based on the distributed hypothesis.
  • the nodes in the node sample are made into a set of suspected nodes. According to the type of root cause alarm information in the training set, a knowledge base (root cause node alarm type, frequency of occurrence) is made. The connection relationship of all nodes in the suspected node set is made into an adjacency matrix of the connection relationship, and the weight of each edge is 1. Dijkstra single-source shortest path method is used to calculate the number of nodes whose shortest distance is less than 2 between the suspected node and all nodes in this group.
  • the node alarm information is made into (node, alarm information, root cause mark) format, and a dictionary of the length of the alarm information in one day is created. Then store the (node, alarm information, root cause flag) within a day into the queue, and then traverse it.
  • the specific implementation steps are as follows:
  • Definition 2 Among the nodes that an isolated node sends out alarm information in one day, some nodes may be neither connected by other nodes nor connected to other nodes. As shown in Figure 1, such nodes are called isolated nodes. First, the connection relationship of all nodes that issue alarm information in a day is stored as an adjacency matrix, and then the matrix is traversed to calculate the row sum and column sum. It must not be connected to other nodes, nor is it connected by other nodes, so it can be regarded as an isolated node.
  • the specific method is to form an adjacency matrix of the connection relationship according to the connection relationship of the nodes in a day.
  • the matrix value corresponding to the nodes with the connection relationship is 1, and the matrix value directly corresponding to the nodes without the connection relationship is 0.
  • the adjacency matrix in Figure 2 if the row and column sums of the corresponding node are both 0, the node can be regarded as an isolated node.
  • the de-noising processing of the present invention is: performing duplicate alarm information deletion and isolated node deletion processing on alarm information within a day.
  • the feature representation of the present invention is to obtain the feature representation of the text by using a word representation model based on distributed assumptions.
  • the present invention takes Bert as an example to obtain the word feature representation of the alarm information.
  • the Bert method is a word representation based on distributed assumptions, and maps natural language words into word vectors according to a certain method.
  • Distributed representation means that each dimension in the feature vector is not interpretable, and any dimension does not correspond to the specific features of the text.
  • Each of its dimensions is a new feature that the neural network combines with many different features of the text. So each vector in the word vector obtained by the feature representation is a combination of many features of the text.
  • the [CLS] mark is a mark corresponding to the last hidden state that contains all the following word information.
  • the [SEP] mark records the sentence location information, but the alarm information targeted by the present invention is all one sentence, so there is only one trailing [SEP] mark. is the jth word of the alarm information in the ith sample.
  • Pass the alarm information through three Embedding layers respectively E A is the Token Embedding layer of the word vector, E B is the Segment Embedding layer, and E C is the Position Embedding layer.
  • E A is responsible for mapping words into word vectors
  • E B is responsible for recording which sentence is the first sentence
  • E C is responsible for recording the position information vector of words.
  • the three Embedding layer results are added to form the final Embedding for each word.
  • the suspected root cause node uses the LSTM model to calculate the input sample, and obtains the probability that the sample is the root cause node.
  • a threshold is set (0.9 in the present invention), and as long as the probability that the sample is predicted to be a root cause node is greater than this threshold value, it is listed as a suspected root cause node.
  • the set composed of all suspected root cause nodes in one day is the set of suspected root cause nodes.
  • Root cause node alarm information knowledge base Define 6 types of root cause node alarm information knowledge base to count enough root cause alarm information in the training set, classify the alarm information of the same type, and count the number of occurrences. Thus, a knowledge base of root cause alarm information categories as shown in Table 1 is formed.
  • Table 1 Sample table of knowledge base for root cause alarm information category
  • Alarm information category Alarm information content Frequency of alarm messages 0 Port 80 communication exception 0.24 8 Url: http:// ⁇ node number: port number ⁇ //Access failed 0.12 1 Ping packet loss rate 100% server downtime 0.08 ... ... ...
  • the alarm information sent by all suspected root cause nodes is compared with the knowledge base. If it does not exist in the knowledge base, it is directly screened.
  • f i represents the frequency of alarm information of type i
  • n i is the number of occurrences of alarm information of type i
  • N is the total number of root cause alarms.
  • Root Cause Node Failure Propagation Scope A node failure in the network topology will often cause other nodes connected to it to be abnormal, resulting in a large number of alarms. For large-scale propagation, there is usually a root-cause node failure propagation range. Nodes within this range may or may not fail due to root-cause node failures.
  • Root cause nodes to mark whether the sample is the root cause node's marking information, if the value is 1, it means that the alarm information of the sample is generated by the root cause node. If the value is 0, it means that the alarm information of the sample is not generated by the root cause node.
  • the fault propagation range used in the present invention is set as the nodes within two hops before and after the root cause node. As shown in Figure 1, the nodes v 1 , v 2 , v 3 , v 4 and v 5 within two hops before and after the root cause node v 0 are the root cause node fault propagation range.
  • the suspected root cause node can be obtained by the method of deep learning.
  • the present invention will also use the root cause node alarm information knowledge base and the distance relationship of the nodes to screen the suspected root cause node set. And considering the root cause node in the root cause node failure propagation range of nodes may have an impact.
  • the invention takes the alarm information sample of an e-commerce platform as an example to determine the location of the root cause node alarm information in a new day.
  • the pre-preparation flow chart of the embodiment of the present invention is shown in FIG. 5 .
  • the specific operation steps are as follows:
  • Step 1 Collect the alarm information sample data within a specific time interval in the history, and perform data preprocessing on the repeated alarm information deletion for these samples. Specifically, the preprocessing of repeated alarm information for data of a certain day is described as follows:
  • Step 2 After removing the repeated alarm information, start to delete the isolated node, and make the connection relationship of the nodes in the daily alarm information into an adjacency matrix of the connection relationship.
  • the matrix position corresponding to a node with a connection relationship is 1, and the matrix position corresponding to a node without a connection relationship is 0. In this way, it is only necessary to find out whether the sum of the row and column of each node is 0 to know whether it is an isolated node. If the sum of the row and column of a node is 0, it means that this is an isolated node.
  • Delete the sample data of orphaned nodes A sample data set S train that removes repeated alarm information and contains isolated nodes is obtained.
  • Each sample in S train is in (node, alarm information, root cause flag) format. Then , statistics are made according to the alarm information types of the root cause nodes in the S train , and a knowledge base V of the root cause node alarm information types is formed.
  • Step 3 Input S train into a word representation model based on the distributional hypothesis.
  • Bert is taken as an example, and S train is input into the model pre-trained by Bert.
  • the specific method is to combine the nodes and alarm information of each sample and pass through three Embedding layers respectively, and E A is the word vector Embedding layer (Token Embedding) , E B is the sentence Embedding layer (Segment Embedding), E C is the position Embedding layer (Position Embedding). E A is responsible for mapping words into word vectors, E B is responsible for recording which sentence is the first sentence, and E C is responsible for recording the position information vector of words.
  • Token Embedding Token Embedding
  • E B is the sentence Embedding layer (Segment Embedding)
  • E C is the position Embedding layer (Position Embedding).
  • E A is responsible for mapping words into word vectors
  • E B is responsible for recording which sentence
  • the three Embedding layer results are added to form the final Embedding for each word.
  • the parameters of Bert's pre-trained model are set by the Google team, and the final feature representation set T train can be obtained by directly inputting the node and alarm information.
  • Each sample in T train is in the format of (feature representation after the combination of node and alarm information, root cause mark).
  • Step 4 Divide T train into T 1 and T 0 according to the root cause mark, where T 1 is the sample whose root cause is marked as 1, and T 0 is the sample whose root cause is marked as 0.
  • data expansion should be performed on a small number of root cause node alarm sample set T 1 until the number of samples in the T 1 data set is expanded to be consistent with the number of samples in T 0 .
  • the specific method is:
  • the Euclidean distance between all samples T 1 is calculated, and then records the k samples of each sample closest distance (k value of 3 of the present invention).
  • Step 5 Combine T 1 and T 0 into T new_train , take T new_train as the training set, input the LSTM neural network model, and conduct training to obtain the input as the node and alarm information feature representation, and the output is the probability value of the node predicted to be the root cause
  • the parameters of a model of save the model and its parameters as M.
  • Step 6 Obtain the sample data set S test of alarm information in a new day collected in practice.
  • the storage format of the data in S test is (node, alarm information).
  • the node and alarm information of each sample in S test are combined and input into the distributed hypothesis-based word representation model to generate the feature representation of each sample, and the feature representation set T test corresponding to S test is obtained.
  • Step 7 Create a new empty list S s , input all samples of T test into the model M obtained in step 5, and obtain the probability that all samples are predicted as root cause nodes.
  • a threshold is set (0.9 in the present invention), and the sample nodes whose prediction results are greater than the threshold are stored in S s .
  • a set of suspected root cause nodes S s is obtained .
  • Step 8 Newly create an empty list S c . Compare the alarm information of all nodes in S s with the alarm information in V, and remove the type of alarm information node that does not exist in V. Then judge that if S s is an empty set, it means that there is no root cause node on this day, otherwise, if S s has only one element, then that node is the root cause node. If there is more than one element in S s , the connection relationship of all nodes in S test is made into an adjacency matrix. The weight of each edge is set to 1.
  • the Dijkstra method is used to calculate the distance between the nodes in S s and the nodes in S test , and the nodes whose distance is less than the root cause node fault propagation range (set within 2 hops in the present invention) are counted. number. Finally, a (node, number of nodes within the propagation range of the root due to node failure) tuple is formed, and the tuple is added to the list S c . Take out the element set v max with the largest number of nodes in the root cause node fault propagation range in S c , if the element in v max is unique, then that node is the root cause node. If the element is not unique, select the root cause node according to the frequency of alarm information in V, and select the node corresponding to the alarm information type with the largest occurrence frequency as the root cause node.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Algebra (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本发明公开了一种基于深度学习的网络告警定位方法,先对历史的告警信息数据进行预处理;构建根因节点告警信息类别知识库;使用基于分布式假设的词表示模型将节点与告警信息组合映射成特征表示;将根因节点样本集合数据扩充至与非根因节点样本集合的样本数相同;将扩充后的数据集作为训练集,训练LSTM模型;对新的数据样本用同样的方法得到特征表示,输入保存好的模型,得到预测样本为根因节点的概率值集合;将预测概率大于阈值的样本筛选出来并存入疑似根因节点集合;根据疑似根因节点集合连接关系确定根因节点。该方法可提高网络运维的效率,节省成本。

Description

一种基于深度学习的网络告警定位方法 技术领域
本发明涉及计算机网络运维,特别涉及一种基于深度学习的网络告警定位方法。
背景技术
故障管理是网络管理的一个重要组成部分,包括故障发现、故障诊断和故障修复,难点在于确定故障的根源,即故障根因节点。目前各种大型平台的内部涉及到了上百个系统间相互调用,其网络节点之间会产生大量的告警信息。而在网络中如果一个节点出现故障,调用该节点或者需要利用该节点资源的节点可能也会继而发生故障,从而产生大量的告警信息,而且可能比根因故障节点还早发出告警信息。这些问题的存在导致故障定位十分困难。每次网络出现告警时,需要有运维人员在最短时间内正确地判断出告警的关联关系,筛选出根因节点,然后采取相应的措施。如果有海量的告警信息发生,那么对于这些告警信息的人工处理将会占据大量人力资源,而且效率低下,甚至可能有重复告警信息。所以设计出网络告警信息根因定位自动化技术十分有必要。由于网络实在过于庞大,所以故障的发生在网络的运行中是不可能避免的。一般的做法是通过告警信息的告警关联方法,将告警之间的关系找出来,筛选掉不相关的告警信息,留下相关的告警信息。
一般的告警关联方法有基于规则推理的方法、人工智能方法、因果图方法等。基于规则推理的方法需要设计一套告警信息出现的规则,实现起来十分困难,并且无法应对事先没有考虑到的情况,稳定性不足。而一般的人工智能方法,难以搜集到包含一堆关联的告警信息数据集,而且对告警信息数据的特征难以确定。而且根因告警信息样本数据一般比较少,所以还有数据不平衡的问题产生,导致模型过拟合,最终效果不佳。因果图方法也是对告警信息的连接关系进行规则推理得出根因节点。但是这些方法都不具备实时性,若产生了新的告警信息,无法实时匹配其中是否包含有成立的关联规则,难以满足告警关联分析实时性需求。
发明内容
发明目的:本发明目的是提供一种提高网络运维的效率,降低网络故障所造成的损失的基于深度学习的网络告警定位方法。该方法在网络节点互相调用而产生大量告警数据的环境下,将非根因节点所发出的告警信息进行筛选,最终实时定位出根因节点。
技术方案:本发明提供的一种基于深度学习的网络告警定位方法,包括如下步骤:
步骤1:搜集到历史的一个具体的时间区间内的告警信息样本数据,对这些样本进行重复告警信息删除数据预处理;
步骤2:去除重复告警信息以后,对孤立节点的样本也进行筛选。筛选完以后对所有样本中的节点的告警信息进行归类,然后对根因告警信息种类进行统计。构建根因节点告警信息类别知识库;
步骤3:将样本的节点信息和告警信息组合后输入基于分布式假设的词表示模型,最终得到样本的特征表示。每个样本中都包含了特征表示和根因节点标记两个信息;
步骤4:将样本数据集中的根因标记按1和0分为两个子集。针对根因标记为1的子集中的样本,利用样本扩充方法将其样本数扩充至和根因标记为0的子集样本数一致;
步骤5:将扩充后的数据集的样本中的特征表示作为LSTM模型的输入,扩充后数据集的样本中的根因标记为LSTM模型的输出,对模型进行训练,并且将模型以及其参数保存下来。得到输入为特征表示,输出为将此样本预测为根因节点的概率值的一个模型;
步骤6:取得实际中搜集到的新的一天内的告警信息样本数据集。样本中数据的存储内容包含了该样本的节点和该样本的告警信息。按步骤3的做法,将新数据集中每个样本的节点和告警信息组合后输入基于分布式假设的词表示模型中生成每个样本的特征表示,得到其对应的特征表示集合;
步骤7:将所有样本的特征表示输入步骤5中存储的模型,得到每个样本被预测为疑似根因节点的概率集合。将所有概率大于阈值的样本节点存储起来作为疑似根因节点集合;
步骤8:将疑似根因节点集合中的节点所发出的告警信息与步骤2中所建立的根因节点告警信息类别知识库中的告警信息种类进行比对,将不存在于知识库中的节点删除。若疑似根因节点集合中没有元素,则直这一天没有根因节点。否则利用节点之间的距离关系和知识库筛选出根因节点。
进一步地,所述步骤1的具体过程为:
步骤11,将所有样本的节点告警信息、对应的节点和根因标记制作成包含节点、告警信息和根因标记的三元组格式存储,创建一个空的字典;
步骤12,将一天内的告警信息放进队列;
步骤13,判断队列是否为空,若为空,直接到步骤15,若不为空,则出队 一个元素;
步骤14,判断出队的元素是否存在于字典中,若存在则不做操作,若不存在则将其加入字典中。返回步骤13;
步骤15,将字典中的所有元素存储起来,作为去除了重复告警信息的样本数据集。
进一步地,所述步骤2的具体过程为:
步骤21,根据每天的告警节点的连接关系制作成一个邻接矩阵,有连接关系的节点对应的矩阵位置的值置为1,没有连接关系的节点对应的矩阵位置的值置为0;
步骤22,计算每个节点代表的行、列所有元素的和;
步骤23,将行、列所有元素的和为0的节点去除;
步骤24,剩下的样本中,将所有的样本的告警信息归类,存储进根因节点告警信息类别知识库中,并计算每个种类的出现频率。
进一步地,所述步骤4的具体过程为:
步骤41,设根因标记为1的子集为T 1,根因标记为0的子集为T 0。将T 1内的所有样本之间的欧式距离计算出来,然后将每个样本最近距离的k个样本记录(本发明k取值优选为3)。得到了每个样本的k近邻(k=3);
步骤42,创建一个空列表T new
步骤43,若T new与T 1的样本数的和与T 0样本数相同,直接跳到步骤46;
步骤44,随机选取T 1中的一个样本的节点和告警信息特征表示x,然后再将其k近邻的样本中随机抽取一个样本并取其节点和告警信息特征表示x′,利用以下公式计算出新样本的节点和告警信息特征表示x new,其中rand(0,1)表示从0~1中随机取值;
x new=x+rand(0,1)|x-x′|
步骤45,将新构建出的x new以及其根因标记构建成二元组,这里的根因标记值恒为1。加入T new列表。作为新扩充的样本。返回步骤43;
步骤46,将T new的所有样本加入T 1
进一步地,在步骤7的疑似根因节点集合生成以后,将疑似根因节点集合设为S s,所述步骤8的具体过程为:
步骤81,新建一个空列表S c,将S s中所有节点的告警信息与步骤2中生成的根因节点告警信息类别知识库进行比对,保留S s中节点对应的告警信息种类存在于知识库中的节点;
步骤82,若S s为空集,则说明这一天中没有根因节点,结束。若S s只有一 个节点,则此节点为根因节点,结束;
步骤83,将一天所有样本中的节点的连接关系制作成邻接矩阵,其中每条边的权值设置为1。根据连接关系,计算出S s与一天所有样本的节点的最短距离。并统计在根因节点故障传播范围(本发明设定为2跳以内)内的节点个数,然后形成包含节点和根因节点故障传播范围内节点数的二元组,并将其加入S c中;
步骤84,取出S c中根因节点故障传播范围内节点数最大的元素,若元素唯一,则其元组对应的节点即为根因节点。否则则根据节点的告警信息种类在知识库中的出现频率的大小来选取根因节点,其中频率最大的告警信息种类对应的节点即为根因节点。
有益效果:与现有技术相比,本发明具有如下优势:
1、传统的告警信息根因节点定位技术一般采取告警关联的方法来实现,一般都需要有关联规则来实现,而不同的系统中可能有不同的告警信息规则,而本发明利用深度学习以及历史告警信息构成的知识库,不需要设计关联规则,只要是产生告警信息的系统都可以使用。
2、网络告警信息中,根因样本数据一般远少于非根因样本数据。所以使用人工智能方法的时候,会产生样本类别不均衡问题,导致最后模型预测结果产生过拟合现象。本发明对训练集的根因告警信息样本进行扩充,将其扩充至与非根因告警信息样本数一致,解决样本类别不均衡问题。
附图说明
图1是节点连接关系示例图;
图2是邻接矩阵示例图;
图3是Bert模型Embedding层结构图;
图4是Bert模型网络结构图;
图5是本发明实施方案前置准备流程图;
图6是本发明实施方案流程图。
具体实施方式
本发明是在网络节点产生的大量的告警信息下,对无用以及重复告警信息进行筛选,准确定位发出根因告警信息的节点,提升网络运维的效率,降低网络故障产生的损失。利用深度学习辅助网络节点告警根因定位能够将大量非根因节点筛除,大大减少根因节点的定位时间。目前对于告警定位的方法比较稀缺,一般都是基于告警关联方法之后再进行根因筛选。常用的告警关联方法有基于案例或规则的推理专家系统、因果图、依赖图等方法。本发明将结合深度学习以及告警关联方法,通过深度学习方法将告警信息筛选出疑似根因节点集合,而后针对根 因节点的特点在疑似根因节点集合中进行根因节点定位。
每个主机节点之间有连接关系,这些连接关系错综复杂。如果其中一个节点发生了根本性的错误,那么与其相连的根因节点故障传播范围内的节点也往往可能发生错误。如图1所示,节点v 0发生了故障,与其连接的根因节点故障传播范围内的节点v 1、v 2、v 3、v 4和v 5也可能发生故障。搜集一个具体的历史时间区间内的告警信息日志,得到100组带有告警节点和告警信息的样本数据,每组中又包含若干条带有告警节点和告警信息的样本数据,并人工标记每条样本是否为根因节点,以此数据为训练集。
同一个节点发生故障后可能持续发出告警信息,所以要将每组中相同节点且相同告警信息的样本去重,仅保留一条样本。然后将每组包含的节点的连接关系制作成一个邻接矩阵,以此观测出本组故障节点中是否存在孤立节点,将孤立节点的告警样本删除。经过去噪处理后,得到去噪后的训练集。将主机节点号与该节点的告警信息组合起来,同时通过基于分布式假设的词表示模型预训练后得到告警信息的词嵌入特征。由于根因节点一般而言只有一个或没有,而非根因节点却有许多个,这导致每组数据集中,根因节点和非根因节点的样本数很不均衡,所以需要使用数据扩充方法对根因节点样本进行数据扩充,将根因节点样本数量扩充至与非根因节点样本数一致。最终将扩充的样本和原样本合起来作为训练集。然后设计LSTM模型,将训练集放入模型进行训练,得到一个可以筛选出一条样本信息是否为根因节点的模型。再找一组新的告警信息样本,在去噪和Bert预训练步骤后,得到新的告警信息样本的词嵌入特征,将其输入训练好的模型中,得到预测结果,将预测结果为根因节点的样本中的节点做成疑似节点集合。根据训练集中根因告警信息的种类,制成一个(根因节点告警类型,出现频率)的知识库。将疑似节点集合中将疑似节点集合中所有节点的连接关系制成一个连接关系的邻接矩阵,每条边的权重视为1。使用Dijkstra单源最短路径方法计算出疑似节点与本组所有节点之间最短距离小于2的节点个数。
为了方便理解本发明的技术方案,下面定义一些概念:
定义1重复告警信息一个节点出现故障后,会发出一种告警信息。但是节点故障如果没有被及时解决,间隔一段时间后,便会重复报告同一告警信息。所以同一天内,同一节点所发出的时间靠后的同一种告警信息的样本应当被去除。
根据以上定义,将节点告警信息制成(节点,告警信息,根因标记)格式,创建一个一天内告警信息长度的字典。然后对一天内的(节点,告警信息,根因标记)存储进队列,然后对其进行遍历。具体实施步骤如下:
①:若(节点,告警信息,根因标记)队列为空,则结束。否则队头元素出队。 转至②。
②:判断出队的(节点,告警信息,根因标记)是否存在于字典中,如果存在则回到①,否则将此(节点,告警信息,根因标记)加入字典中,回到①。
定义2孤立节点一天发出告警信息的节点中,有的节点可能既没有被其他节点所连接,也没有连接其他节点,如图1所示,这种节点被称之为孤立节点。首先将一天内的所有发出告警信息节点的连接关系以邻接矩阵方式存储,然后对矩阵进行遍历,计算行和与列和,若有对应下标的行与列和均为0的节点,那么此节点必然没有连接其他节点,也没有被其他节点所连接,故而可以视为孤立节点。
具体做法为,根据一天内的节点的连接关系形成连接关系的邻接矩阵,有连接关系的节点之间对应的矩阵值为1,没有连接关系的节点直接对应的矩阵值为0。如此一来,如图2邻接矩阵所示,若对应的节点的行和列和都为0,则可以视该节点为孤立节点。
定义3去噪处理本发明的去噪处理为:将一天内的告警信息进行重复告警信息删除和孤立节点删除处理。
定义4特征表示告警信息文本想要被计算机识别,就必须将文本的特征表示成能被计算机识别的格式。本发明面向的特征表示为利用基于分布式假设的词表示模型得到文本的特征表示。本发明以Bert为例得到告警信息的词特征表示。Bert方法是基于分布式假设的词表示,根据一定的方法将自然语言词语映射成词向量。分布式表示则是指特征向量中的每一个维度均不可被解释,而任何维度也不会对应到文本的具体特征。其每一个维度都是神经网络对文本的许多不同特征组合起来的新的特征。所以特征表示得到的词向量中的每一个向量都是文本的许多特征的组合。
如图3所示,[CLS]标志为对应最后隐藏状态下包含了后面所有词信息的一个标记。[SEP]标志记录分句位置信息,但是本发明针对的告警信息中都是一句话,所以只会有一个尾部的[SEP]标记。
Figure PCTCN2020108816-appb-000001
为第i个样本中告警信息的第j个字。将告警信息分别通过三个Embedding层,E A为词向量Embedding层(Token Embedding),E B为分句Embedding层(Segment Embedding),E C为位置Embedding层(Position Embedding)。E A负责将字映射成字向量,E B负责记录这是第几个句子,E C负责记录字的位置信息向量。然后将三个Embedding层结果加起来,形成每个字的最终Embedding。再将最终的Embedding送入如图4所示的Transformer结构中,最终得到样本的特征表示。
定义5疑似根因节点利用LSTM模型对输入的样本进行计算,得到该样本 为根因节点的概率。设置一个阈值(本发明设为0.9),只要是预测出该样本为根因节点的概率大于这个阈值的,就被列为疑似根因节点。由一天内的所有疑似根因节点组成的集合为疑似根因节点集合。
定义6根因节点告警信息类别知识库将训练集中足够多的根因告警信息进行统计,将同样类别的告警信息归类,统计出现的次数。由此形成如表1所示的根因告警信息类别知识库。
表1根因告警信息类别知识库样本示例表
告警信息类别 告警信息内容 告警信息出现频率
0 端口80通信异常 0.24
8 Url:http://{节点号:端口号}//访问失败 0.12
1 Ping丢包率100%服务器宕机 0.08
利用深度学习方法找到了一天内的疑似根因节点集合后,对所有疑似根因节点发出的告警信息与知识库进行对比,如若不存在于知识库中,则直接筛除。
定义7告警信息类别频率告警信息类别知识库构建以后,将每类告警信息的出现次数统计,并且利用公式(1)计算出告警信息某类的频率
Figure PCTCN2020108816-appb-000002
其中f i表示i种类的告警信息频率,n i为i种类告警信息出现的次数,N为总的根因告警个数。
定义8根因节点故障传播范围网络拓扑中一个节点出现故障,往往会导致与其相连的其他节点也发生异常,进而产生大量告警,由于网络节点的合理设计,根因节点故障所导致的异常不会大范围传播,通常会有一个根因节点故障传播范围,在这个范围内的节点,可能会因为根因节点发生故障继而发生故障,也可能不发生故障。
定义9根因节点标记该样本是否为根因节点的标记信息,若值为1,则说明该样本的告警信息为根因节点产生。若值为0,则说明该样本的告警信息不是由根因节点产生。
本发明中使用的故障传播范围设定为根因节点前后两跳之内的节点。如图1所示,根因节点v 0前后两跳内的节点v 1、v 2、v 3、v 4和v 5为根因节点故障传播范围。
通过本发明的方法,可以用深度学习的方法求得疑似根因节点。为了在疑似根因节点中进一步确定唯一的根因节点,本发明还将使用根因节点告警信息知识库以及节点的距离关系对疑似根因节点集合进行筛选。并且考虑到了根因节点在根因节点故障传播范围内的节点可能造成影响。
本发明以某电商平台告警信息样本为例,确定新出现的某天内的根因节点告警信息定位。本发明的实施方案前置准备流程图如图5所示。其具体操作步骤如下:
步骤1:搜集到历史的一个具体的时间区间内的告警信息样本数据,对这些样本进行重复告警信息删除数据预处理。具体对某一天的数据的重复告警信息预处理描述如下:
①将节点告警信息制作成(节点,告警信息,根因标记)格式存储。创建一个空的字典。
②将一天内的告警信息放进队列。
③判断队列是否为空,若为空,转⑤,若不为空,则出队一个元素。
④判断出队的(节点,告警信息,根因标记)是否存在于字典中,若存在则不做操作,若不存在则将其加入字典中。转③。
⑤将字典中的所有元素存储起来,作为去除了重复告警信息的样本数据集。
步骤2:去除重复告警信息以后,开始对孤立节点进行删除,将每天的告警信息中的节点的连接关系制作成一个连接关系的邻接矩阵。有连接关系的节点对应的矩阵位置为1,没有连接关系的节点对应的矩阵位置为0。如此一来,只需要找每个节点的行与列的和是否为0即可知道是否为孤立节点。若某个节点的行和列之和为0,则说明这是一个孤立节点。将孤立节点的样本数据删除。得到去除了重复告警信息和包含了孤立节点的样本数据集S train。S train里每个样本都是(节点,告警信息,根因标记)格式。然后根据S train中的根因节点的告警信息种类进行统计,形成一个根因节点告警信息类别知识库V。
步骤3:将S train输入基于分布式假设的词表示模型。本发明以Bert为例,将S train输入Bert预训练好的模型,具体做法为将每一个样本的节点和告警信息组合后分别通过三个Embedding层,E A为词向量Embedding层(Token Embedding),E B为分句Embedding层(Segment Embedding),E C为位置Embedding层(Position Embedding)。E A负责将字映射成字向量,E B负责记录这是第几个句子,E C负责记录字的位置信息向量。然后将三个Embedding层结果加起来,形成每个字的最终Embedding。再将最终的Embedding送入如图4所示的Transformer结构中,最终得到样本的特征表示。其中,Bert预训练的模型的参数是Google团队所设 定好的,只需要直接将节点以及告警信息输入便可以得到最终的特征表示集T train。T train里每个样本都是(节点与告警信息组合后的特征表示,根因标记)格式。
步骤4:将T train按其根因标记将样本数据集分为T 1和T 0,其中T 1为根因标记为1的样本,T 0为根因标记为0的样本。接下来要对少量的根因节点告警样本集T 1进行数据扩充,直到将T 1数据集的样本数扩充至与T 0样本数一致为止。具体做法为:
①将T 1中的所有样本之间的欧式距离计算出来,然后将每个样本最近距离的k个样本记录(本发明k取值为3)。得到了每个样本的k近邻(k=3)。
②创建一个空列表T new
③若T new与T 1的样本数的和与T 0样本数相同,转⑥。
④随机选取T 1中的一个样本的节点和告警信息特征表示x,然后再将其k近邻的样本中随机抽取一个样本并取其节点和告警信息特征表示x′,利用公式(2)计算出新样本的节点和告警信息特征表示x new,其中rand(0,1)表示从0~1中随机取值。
x new=x+rand(0,1)|x-x′|      (2)
⑤将新构建出的x new构建成(x new,根因标记),加入T new列表。作为新扩充的样本。转③。
⑥将T new的所有样本加入T 1
步骤5:将T 1与T 0合并成T new_train,将T new_train作为训练集,输入LSTM神经网络模型,进行训练,得到输入为节点与告警信息特征表示,输出为预测为根因节点的概率值的一个模型的参数,将模型以及其参数保存为M。
到此,本发明的实施方案前置准备已经完成。本发明的实施方案流程图如图6所示。其具体操作步骤如下:
步骤6:取得实际中搜集到的新的一天内的告警信息样本数据集S test。S test中数据的存储格式为(节点,告警信息)。按步骤3的做法,将S test中每个样本的节点和告警信息组合后输入基于分布式假设的词表示模型中生成每个样本的特征表示,得到S test对应的特征表示集合T test
步骤7:新创建一个空列表S s,将T test的所有样本输入进步骤5中得到的模型M,得到所有样本被预测为根因节点的概率。设置一个阈值(本发明设置为0.9),将预测结果大于阈值的样本节点存入S s。得到疑似根因节点集合S s
步骤8:新创建一个空列表S c。将S s中所有节点的告警信息与V中的告警信息进行比对,去除掉不存在于V中告警信息种类节点。再判断若S s为空集,则 说明这天没有根因节点,否则若S s只有一个元素,则那个节点即为根因节点。若S s中不止一个元素,则将S test中所有节点的连接关系制作成邻接矩阵。每条边的权值设置为1。为了得到节点之间的距离,利用Dijkstra方法计算S s中的节点与S test中的节点之间的距离,并统计距离小于根因节点故障传播范围(本发明设定为2跳以内)的节点个数。最终形成(节点,根因节点故障传播范围内节点数)元组,并将元组加入列表S c。取出S c中根因节点故障传播范围内节点数最多的元素集合v max,若v max中元素唯一,则那个节点则为根因节点。若元素不唯一,则根据V中告警信息出现频率的大小选取根因节点,选取出现频率最大的那个告警信息种类对应的节点即为根因节点。

Claims (5)

  1. 一种基于深度学习的网络告警定位方法,其特征在于:包括如下步骤:
    步骤1:搜集到历史的一个具体的时间区间内的告警信息样本数据,对这些样本进行重复告警信息删除数据预处理;
    步骤2:去除重复告警信息以后,对孤立节点的样本也进行筛选,筛选完以后对所有样本中的节点的告警信息进行归类,然后对根因告警信息种类进行统计,构建根因节点告警信息类别知识库;
    步骤3:将样本的节点信息和告警信息组合后输入基于分布式假设的词表示模型,最终得到样本的特征表示,每个样本中都包含了特征表示和根因节点标记两个信息;
    步骤4:将样本数据集中的根因标记按1和0分为两个子集,针对根因标记为1的子集中的样本,利用样本扩充方法将其样本数扩充至和根因标记为0的子集样本数一致;
    步骤5:将扩充后的数据集的样本中的特征表示作为LSTM模型的输入,扩充后数据集的样本中的根因标记为LSTM模型的输出,对模型进行训练,并且将模型以及其参数保存下来,得到输入为特征表示,输出为将此样本预测为根因节点的概率值的一个模型;
    步骤6:取得实际中搜集到的新的一天内的告警信息样本数据集,样本中数据的存储内容包含了该样本的节点和该样本的告警信息,按步骤3的方法,将新数据集中每个样本的节点和告警信息组合后输入基于分布式假设的词表示模型中生成每个样本的特征表示,得到其对应的特征表示集合;
    步骤7:将所有样本的特征表示输入步骤5中存储的模型,得到每个样本被预测为疑似根因节点的概率集合,将所有概率大于阈值的样本节点存储起来作为疑似根因节点集合;
    步骤8:将疑似根因节点集合中的节点所发出的告警信息与步骤2中所建立的根因节点告警信息类别知识库中的告警信息种类进行比对,将不存在于知识库中的节点删除,若疑似根因节点集合中没有元素,则说明这一天没有根因节点,否则利用节点之间的距离关系和知识库筛选出根因节点。
  2. 根据权利要求1所述基于深度学习的网络告警定位方法,其特征在于:所述步骤1的具体过程为:
    步骤11:将所有样本的节点告警信息、对应的节点和根因标记制作成节点、告警信息、根因标记的三元组格式存储,创建一个空的字典;
    步骤12:将一天内的告警信息放进队列;
    步骤13:判断队列是否为空,若为空,直接到步骤15,若不为空,则出队 一个元素;
    步骤14:判断出队的元素是否存在于字典中,若存在则不做操作,若不存在则将其加入字典中,返回步骤13;
    步骤15:将字典中的所有元素存储起来,作为去除了重复告警信息的样本数据集。
  3. 根据权利要求1所述基于深度学习的网络告警定位方法,其特征在于:所述步骤2的具体过程为:
    步骤21:根据每天的告警节点的连接关系制作成一个邻接矩阵,有连接关系的节点对应的矩阵位置的值置为1,没有连接关系的节点对应的矩阵位置的值置为0;
    步骤22:计算每个节点代表的行、列所有元素的和;
    步骤23:将行、列所有元素的和为0的节点去除;
    步骤24:剩下的样本中,将所有的样本的告警信息归类,存储进根因节点告警信息类别知识库中,并计算每个种类的出现频率。
  4. 根据权利要求1所述基于深度学习的网络告警定位方法,其特征在于:步骤4的具体过程为:
    步骤41:设根因标记为1的子集为T 1,根因标记为0的子集为T 0,将T 1内的所有样本之间的欧式距离计算出来,然后将每个样本最近距离的k个样本记录,得到了每个样本的k近邻;
    步骤42:创建一个空列表T new
    步骤43:若T new与T 1的样本数的和与T 0样本数相同,直接跳到步骤46;
    步骤44:随机选取T 1中的一个样本的节点和告警信息特征表示x,然后再将其k近邻的样本中随机抽取一个样本并取其节点和告警信息特征表示x′,利用以下公式计算出新样本的节点和告警信息特征表示x new,其中rand(0,1)表示从0~1中随机取值;
    x new=x+rand(0,1)|x-x′|
    步骤45:将新构建出的x new以及其根因标记构建成二元组,这里的根因标记值恒为1。然后将其加入T new列表,作为新扩充的样本,返回步骤43;
    步骤46:将T new的所有样本加入T 1
  5. 根据权利要求1所述基于深度学习的网络告警定位方法,其特征在于:步骤7的疑似根因节点集合生成以后,将疑似根因节点集合设为S s,则步骤8的具体过程为:
    步骤81,新建一个空列表S c,将S s中所有节点的告警信息与步骤2中生成 的根因节点告警信息类别知识库进行比对,保留S s中节点对应的告警信息种类存在于知识库中的节点;
    步骤82,若S s为空集,则说明这一天中没有根因节点,结束,若S s只有一个节点,则此节点为根因节点,结束;
    步骤83,将一天所有样本中的节点的连接关系制作成邻接矩阵,其中每条边的权值设置为1,根据连接关系,计算出S s与一天所有样本的节点的最短距离,并统计在根因节点故障传播范围内的节点个数,然后形成包含节点和根因节点故障传播范围内节点数的二元组,并将其加入S c中;
    步骤84,取出S c中根因节点故障传播范围内节点数最大的元素,若元素唯一,则其元组对应的节点即为根因节点,否则则根据节点的告警信息种类在知识库中的出现频率的大小来选取根因节点,其中频率最大的告警信息种类对应的节点即为根因节点。
PCT/CN2020/108816 2020-07-07 2020-09-28 一种基于深度学习的网络告警定位方法 WO2022007108A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010649254.1 2020-07-07
CN202010649254.1A CN112003718B (zh) 2020-09-25 2020-09-25 一种基于深度学习的网络告警定位方法

Publications (1)

Publication Number Publication Date
WO2022007108A1 true WO2022007108A1 (zh) 2022-01-13

Family

ID=73467004

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/108816 WO2022007108A1 (zh) 2020-07-07 2020-09-28 一种基于深度学习的网络告警定位方法

Country Status (2)

Country Link
CN (1) CN112003718B (zh)
WO (1) WO2022007108A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114637649A (zh) * 2022-03-01 2022-06-17 清华大学 一种基于oltp数据库系统的告警根因分析方法及装置
CN115051907A (zh) * 2022-06-10 2022-09-13 中国电信股份有限公司 告警日志数据的处理方法及装置、非易失性存储介质
CN115086148A (zh) * 2022-07-15 2022-09-20 中国电信股份有限公司 光网络告警处理方法、系统、设备及存储介质
CN115150253A (zh) * 2022-06-27 2022-10-04 杭州萤石软件有限公司 一种故障根因确定方法、装置及电子设备
CN116991620A (zh) * 2023-08-03 2023-11-03 北京优特捷信息技术有限公司 一种解决方案确定方法、装置、设备及介质
CN117194459A (zh) * 2023-09-22 2023-12-08 天翼爱音乐文化科技有限公司 基于运维事件的运维知识库更新方法、系统、装置与介质
CN117527527A (zh) * 2024-01-08 2024-02-06 天津市天河计算机技术有限公司 多源告警处理方法和系统

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254254B (zh) * 2021-07-14 2021-11-30 南京中兴新软件有限责任公司 系统故障的根因定位方法、装置、存储介质及电子装置
CN113780597B (zh) * 2021-09-16 2023-04-07 睿云奇智(重庆)科技有限公司 影响传播关系模型构建和告警影响评估方法、计算机设备、存储介质
CN113901126A (zh) * 2021-09-18 2022-01-07 中兴通讯股份有限公司 告警因果关系挖掘方法、告警因果挖掘装置及存储介质
CN114124676B (zh) * 2021-11-19 2024-04-02 南京邮电大学 一种面向网络智能运维系统的故障根因定位方法及其系统
CN114968727B (zh) * 2022-06-29 2023-02-10 北京柏睿数据技术股份有限公司 基于人工智能运维的数据库贯穿基础设施的故障定位方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016070642A1 (zh) * 2014-11-05 2016-05-12 中兴通讯股份有限公司 一种多故障数据解耦方法和装置
CN110147387A (zh) * 2019-05-08 2019-08-20 腾讯科技(上海)有限公司 一种根因分析方法、装置、设备及存储介质
CN110351118A (zh) * 2019-05-28 2019-10-18 华为技术有限公司 根因告警决策网络构建方法、装置和存储介质
CN110609759A (zh) * 2018-06-15 2019-12-24 华为技术有限公司 一种故障根因分析的方法及装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106603293A (zh) * 2016-12-20 2017-04-26 南京邮电大学 虚拟网络环境下一种基于深度学习的网络故障诊断方法
CN108540330B (zh) * 2018-04-24 2021-04-02 南京邮电大学 一种异构网络环境下的基于深度学习的网络故障诊断方法
CN109034368B (zh) * 2018-06-22 2021-10-15 北京航空航天大学 一种基于dnn的复杂设备多重故障诊断方法
CN110309009B (zh) * 2019-05-21 2022-05-13 北京云集智造科技有限公司 基于情境的运维故障根因定位方法、装置、设备及介质
CN110351150B (zh) * 2019-07-26 2022-08-16 中国工商银行股份有限公司 故障根源确定方法及装置、电子设备和可读存储介质
CN111342997B (zh) * 2020-02-06 2022-08-09 烽火通信科技股份有限公司 一种深度神经网络模型的构建方法、故障诊断方法及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016070642A1 (zh) * 2014-11-05 2016-05-12 中兴通讯股份有限公司 一种多故障数据解耦方法和装置
CN110609759A (zh) * 2018-06-15 2019-12-24 华为技术有限公司 一种故障根因分析的方法及装置
CN110147387A (zh) * 2019-05-08 2019-08-20 腾讯科技(上海)有限公司 一种根因分析方法、装置、设备及存储介质
CN110351118A (zh) * 2019-05-28 2019-10-18 华为技术有限公司 根因告警决策网络构建方法、装置和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YANG ZHAOPENG, LIN YEGUI;LUO FEIPENG: "Research and Application of Log-Based Machine Learning Method to Realize Fast Delimitation of Faults", DESIGNING TECHNIQUES OF POSTS AND TELECOMMUNICATIONS, DESIGNING INSTITUTE OF MPT OF CHINA, CN, no. 12, 20 December 2018 (2018-12-20), CN , pages 23 - 26, XP055885506, ISSN: 1007-3043, DOI: 10.12045/j.issn.1007-3043.2018.12.005 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114637649A (zh) * 2022-03-01 2022-06-17 清华大学 一种基于oltp数据库系统的告警根因分析方法及装置
CN115051907A (zh) * 2022-06-10 2022-09-13 中国电信股份有限公司 告警日志数据的处理方法及装置、非易失性存储介质
CN115150253A (zh) * 2022-06-27 2022-10-04 杭州萤石软件有限公司 一种故障根因确定方法、装置及电子设备
CN115150253B (zh) * 2022-06-27 2024-03-08 杭州萤石软件有限公司 一种故障根因确定方法、装置及电子设备
CN115086148A (zh) * 2022-07-15 2022-09-20 中国电信股份有限公司 光网络告警处理方法、系统、设备及存储介质
CN115086148B (zh) * 2022-07-15 2024-01-30 中国电信股份有限公司 光网络告警处理方法、系统、设备及存储介质
CN116991620A (zh) * 2023-08-03 2023-11-03 北京优特捷信息技术有限公司 一种解决方案确定方法、装置、设备及介质
CN116991620B (zh) * 2023-08-03 2024-02-23 北京优特捷信息技术有限公司 一种解决方案确定方法、装置、设备及介质
CN117194459A (zh) * 2023-09-22 2023-12-08 天翼爱音乐文化科技有限公司 基于运维事件的运维知识库更新方法、系统、装置与介质
CN117194459B (zh) * 2023-09-22 2024-05-10 天翼爱音乐文化科技有限公司 基于运维事件的运维知识库更新方法、系统、装置与介质
CN117527527A (zh) * 2024-01-08 2024-02-06 天津市天河计算机技术有限公司 多源告警处理方法和系统
CN117527527B (zh) * 2024-01-08 2024-03-19 天津市天河计算机技术有限公司 多源告警处理方法和系统

Also Published As

Publication number Publication date
CN112003718B (zh) 2021-07-27
CN112003718A (zh) 2020-11-27

Similar Documents

Publication Publication Date Title
WO2022007108A1 (zh) 一种基于深度学习的网络告警定位方法
WO2019238109A1 (zh) 一种故障根因分析的方法及装置
Zhang et al. Identification of core-periphery structure in networks
CN112217674B (zh) 基于因果网络挖掘和图注意力网络的告警根因识别方法
WO2022134794A1 (zh) 新闻事件的舆情处理方法及装置、存储介质、计算机设备
CN106570513A (zh) 大数据网络系统的故障诊断方法和装置
CN106628097A (zh) 一种基于改进径向基神经网络的船舶设备故障诊断方法
CN107506389A (zh) 一种提取职位技能需求的方法和装置
WO2023029654A1 (zh) 一种故障根因确定方法、装置、存储介质及电子装置
CN110032463A (zh) 一种基于贝叶斯网络的系统故障定位方法和系统
Thaler et al. Towards a neural language model for signature extraction from forensic logs
CN110110334A (zh) 一种基于自然语言处理的远程会诊记录文本纠错方法
CN116541510A (zh) 一种基于知识图谱的故障案例推荐方法
CN116225760A (zh) 一种基于运维知识图谱的实时根因分析方法
CN110598787B (zh) 一种基于自定步长学习的软件bug分类方法
CN113254675A (zh) 基于自适应少样本关系抽取的知识图谱构建方法
CN117034143A (zh) 一种基于机器学习的分布式系统故障诊断方法及装置
Amani et al. A case-based reasoning method for alarm filtering and correlation in telecommunication networks
CN117221087A (zh) 告警根因定位方法、装置及介质
CN112507720A (zh) 基于因果语义关系传递的图卷积网络根因识别方法
Li et al. Contrastive deep nonnegative matrix factorization for community detection
CN114385403A (zh) 基于双层知识图谱架构的分布式协同故障诊断方法
CN111737107B (zh) 一种基于异质信息网络的重复缺陷报告检测方法
CN114465875A (zh) 故障处理方法及装置
CN114157553A (zh) 一种数据处理方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20944605

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20944605

Country of ref document: EP

Kind code of ref document: A1