CN115801557A - Fault root cause positioning method and device and readable storage medium - Google Patents

Fault root cause positioning method and device and readable storage medium Download PDF

Info

Publication number
CN115801557A
CN115801557A CN202111055032.8A CN202111055032A CN115801557A CN 115801557 A CN115801557 A CN 115801557A CN 202111055032 A CN202111055032 A CN 202111055032A CN 115801557 A CN115801557 A CN 115801557A
Authority
CN
China
Prior art keywords
fault
node
matrix
root
root cause
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111055032.8A
Other languages
Chinese (zh)
Inventor
王凯
范晓晖
李宜铮
罗达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Communications Ltd Research Institute
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Communications Ltd Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Communications Ltd Research Institute filed Critical China Mobile Communications Group Co Ltd
Priority to CN202111055032.8A priority Critical patent/CN115801557A/en
Publication of CN115801557A publication Critical patent/CN115801557A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The embodiment of the application provides a fault root cause positioning method, a fault root cause positioning device and a readable storage medium, wherein the method comprises the following steps: determining node topology information according to the network topology graph; acquiring fault weight information, wherein the fault weight information represents the probability distribution condition of fault root causes in the network topology graph; and determining the fault root cause probability of each node in the network topology graph according to the node topology information and the fault weight information, wherein the node with the highest fault root cause probability is a predicted fault root cause.

Description

Fault root cause positioning method and device and readable storage medium
Technical Field
The embodiment of the application relates to the technical field of communication, in particular to a fault root cause positioning method and device and a readable storage medium.
Background
The industry field network is a general name of an industry field end side node network access technology, and is connected with various terminals, machines, sensors, systems and the like at the end of an industry field, so that the requirements of the industry field on various services such as sensing, data, positioning, control, management and the like are met. The main research objects of the industrial field network include passive communication (e.g., radio Frequency Identification (RFID)), short-range communication (e.g., star flash), and Time Sensitive Networking (TSN). With the fact that industrial field networks go deep into scenes such as buildings, hospitals, business supermans and industrial parks, the problems of difficulty in heterogeneous network management and low network operation and maintenance efficiency are increasingly prominent, and enterprise users have stronger requirements for unified and convenient network management.
As shown in the field network scene RFID in the industry of fig. 1, the complete node topology includes an electronic tag (tag), a reader antenna (reader antenna), a reader, and a platform (host computer). When a fault occurs, dozens of alarms often occur at the same time, and how to accurately and quickly locate the fault reason from numerous alarms is an urgent problem to be solved.
Disclosure of Invention
The embodiment of the application aims to provide a fault root cause positioning method, a fault root cause positioning device and a readable storage medium, and solves the problem of how to accurately and quickly position the fault root cause.
In a first aspect, a method for locating a fault root cause is provided, including:
determining node topology information according to the network topology graph;
acquiring fault weight information, wherein the fault weight information represents the probability distribution condition of fault root causes in the network topological graph;
and determining the fault root cause probability of each node in the network topology graph according to the node topology information and the fault weight information, wherein the node with the highest fault root cause probability is a predicted fault root cause.
Optionally, the step of determining node topology information according to the network topology map includes:
determining a topology matrix and an integral filter of each node in the network topology graph according to the path information in the network topology graph;
generating a global correlation matrix according to the fault type and/or the fault-related path, wherein elements in the global correlation matrix are used for representing the fault correlation relation of each node in the network topology diagram aiming at a specific fault;
and obtaining a node topology matrix according to the product of the topology matrix and the global correlation matrix.
Optionally, the step of obtaining the fault weight information includes:
judging whether a local fault database of a node user side or a fault base of a node manufacturing side is updated according to the data model;
if the local fault database of the node user side or the fault root data model of the node manufacturer side is updated, updating the fault weight information by using the updated local fault database or the updated fault root data model; otherwise, the existing failure weight information is used.
Optionally, the method further comprises:
reporting a local fault database to a node manufacturer, wherein the local fault database is generated by a node user based on a digital twin;
and obtaining an updated fault factor data model from the node manufacturer, wherein the fault factor data model is obtained by updating the local fault database reported by the node manufacturer based on a plurality of node users.
Optionally, the step of determining a fault root cause probability of each node in the network topology according to the node topology information and the fault weight information includes:
by passing
Figure BDA0003254269410000031
Calculating to obtain an objective function X', n representing the number of times of fault alarm, wherein n is greater than 1;
wherein A is a topological matrix and D error For a global correlation matrix corresponding to a fault alarm, H error A fault root probability distribution matrix corresponding to a fault alarm;
normalizing the target function X' to obtain a target root factor prediction matrix X;
by prop i =∑∑f i * X, calculating the fault root probability of the node i;
wherein, f i Integral filter representing node i, is the hadamard product.
Optionally, the method further comprises:
updating a fault root cause probability distribution matrix according to a local sample, wherein the local sample comprises: a fault alarm and a root cause of the fault alarm.
Optionally, the step of updating the fault root probability distribution matrix according to the local sample includes:
generating a target fault root probability distribution matrix in a specified time range according to samples in a preset range;
updating the fault root probability distribution matrix according to preset weight parameters and the target fault root probability distribution matrix to obtain an updated fault root probability distribution matrix;
obtaining a prediction probability after normalization processing according to the product of the updated fault root cause probability distribution matrix and the node topology matrix;
constructing a loss function according to the prediction probability;
and training a fault root probability distribution matrix according to the local sample and the loss function, and optimizing the weight parameters.
In a second aspect, a fault root cause locating device is provided, which includes:
the node topology module is used for determining node topology information according to the network topology graph;
the fault weighting module is used for acquiring fault weighting information, and the fault weighting information represents the probability distribution condition of fault root causes in the network topological graph;
and the root cause positioning module is used for determining the fault root cause probability of each node in the network topology graph according to the node topology information and the fault weight information, wherein the node with the highest fault root cause probability is a predicted fault root cause.
Optionally, the node topology module is further configured to:
determining a topology matrix and an integral filter of each node in the network topology graph according to the path information in the network topology graph;
generating a global correlation matrix according to the fault type and/or the fault-related path, wherein elements in the global correlation matrix are used for representing the fault correlation of each node in the network topology diagram aiming at a specific fault;
and obtaining a node topology matrix according to the product of the topology matrix and the global correlation matrix.
Optionally, the fault weighting module is further configured to:
judging whether a local fault database of a node user side or a fault root data model of a node manufacturing side is updated or not;
if the local fault database of the node user side or the fault root factor data model of the node manufacturing side is updated, the updated local fault database or the updated fault root factor data model is used for updating the fault weight information; otherwise, the existing failure weight information is used.
Optionally, the apparatus further comprises:
the updating module is used for reporting a local fault database to the node manufacturing party, wherein the local fault database is generated by the node using party based on the digital twinning;
and acquiring an updated fault root data model from the node manufacturer, wherein the fault root data model is obtained by updating the local fault database reported by a plurality of node users by the node manufacturer.
Optionally, the root cause location module is further configured to:
by passing
Figure BDA0003254269410000061
Calculating to obtain an objective function X', n representing the number of times of fault alarm, wherein n is greater than 1;
wherein A is a topological matrix and D error For a global correlation matrix corresponding to a fault alarm, H error A fault root probability distribution matrix corresponding to a fault alarm;
normalizing the target function X' to obtain a target root factor prediction matrix X;
by prop i =∑∑f i * X, calculating the fault root probability of the node i;
wherein f is i The integral filter representing node i is the hadamard product.
In a third aspect, an electronic device is provided, including: a processor, a memory and a program stored on the memory and executable on the processor, which program, when executed by the processor, performs the steps of the method according to the first aspect.
In a fourth aspect, a readable storage medium is provided, on which a program is stored which, when being executed by a processor, carries out steps comprising the method according to the first aspect.
In the embodiment of the application, the fault root cause probability of each node in the network topology graph is determined according to the node topology information and the fault weight information, wherein the node with the highest fault root cause probability is a predicted fault root cause, and the dependence on a specific mechanism model is reduced, so that the fault cause can be accurately and quickly positioned in a plurality of alarms.
Drawings
Various additional advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a schematic diagram of an RFID service in a field network;
fig. 2 is a flowchart of a fault root cause locating method provided in an embodiment of the present application;
fig. 3 is a schematic diagram of a fault root cause locating device according to an embodiment of the present application;
fig. 4 is a schematic diagram of a process executed by each module in the fault root cause locating apparatus according to the embodiment of the present application;
FIG. 5 is a schematic of a topology of a node;
FIG. 6 is a schematic of a topology of nodes in an RFID scenario;
fig. 7 is a schematic diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "comprises," "comprising," or any other variation thereof, in the description and claims of this application, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Furthermore, the use of "and/or" in the specification and claims means that at least one of the connected objects, e.g., a and/or B, means that three conditions exist including a alone, B alone, and both a and B.
In the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present relevant concepts in a concrete fashion.
It is noted that the techniques described in the embodiments of the present application are not limited to Long Term Evolution (LTE)/LTE-Advanced (LTE-a) systems, but may also be used in other wireless communication systems, such as Code Division Multiple Access (CDMA), time Division Multiple Access (TDMA), frequency Division Multiple Access (FDMA), orthogonal Frequency Division Multiple Access (OFDMA), single-carrier Frequency Division Multiple Access (SC-FDMA), and other systems. The terms "system" and "network" are often used interchangeably in embodiments of the present application, and the described techniques may be used for both the above-mentioned systems and radio technologies, as well as for other systems and radio technologies. However, the following description describes a New Radio (NR) system for purposes of example, and NR terminology is used in much of the description below, although the techniques may also be applied to applications other than NR system applications, such as 6th generation,6g communication systems.
The root cause positioning technical scheme of the current industry field network node fault comprises the following four types:
1. and (3) node detection: and according to the alarm reminding, the nodes possibly related to the alarm are detected one by one, and the fault nodes are eliminated one by one.
2. Mechanism modeling: based on the node mechanism, the fault and the characteristics thereof are simulated, a fault characteristic library is established, and then the fault characteristic library is compared with the actual fault to determine the fault source.
3. And (3) data analysis: and (4) deducing the fault reason by adopting a statistical analysis method based on historical data.
4. Digital twinning: data such as a physical model, sensor updating, operation history and the like are fully utilized, a multidisciplinary, multi-physical quantity, multi-scale and multi-probability simulation process is integrated, and mapping is completed in a virtual space, so that the full life cycle process of corresponding entity equipment is reflected.
When an industrial field network fault occurs, the conventional platform often has dozens of related alarms or errors, and the fault source is determined by how to quickly determine the fault source from numerous alarms or errors, so that the fault removal efficiency is determined.
The existing root cause positioning technology has the following defects:
1. the disadvantages of the node detection technology are as follows: for an industrial field network, a node failure may cause tens of abnormal alarms of related service indexes, and performing related node tests on each alarm consumes huge time and energy, and is not efficient and poor in user experience.
2. The disadvantages of the mechanism model technology: the requirement for technical accumulation in the professional field is very high, only a few high-quality node parties generally master the technical accumulation, and the professional technology is competitive capital and is not willing to be shared. Therefore, node procurement is limited and only high-priced purchases from a few node merchants are possible. And the industry field network relates to a plurality of fields (passive, short-distance and TSN), and the high-quality suppliers in each field are different, so that different field network nodes have different operation and maintenance systems, and the system management cost is increased.
3. Disadvantages of the data analysis technique: the samples of a single enterprise are limited, and statistical deviation is caused if the samples are insufficient; sufficient data accumulation is required to perform data analysis, so that real-time performance is difficult to achieve due to positioning.
4. The existing digital twinning technology has the following defects: currently, in the full life cycle of a node, the application of digital twins across life stages is less, and most of the digital twins are digital twins of a single life stage. The different life stages of a node are often dependent on different vendors, and their information is therefore difficult to circulate. The digital twin can help and promote the circulation of information across life stages through means of authority control and the like.
Referring to fig. 2, an embodiment of the present application provides a method for locating a fault root cause, which includes the specific steps of: step 201, step 202 and step 203.
Step 201: determining node topology information according to the network topology graph;
taking a topological graph constructed by the nodes of the industrial field network as an example, constructing a topological subgraph based on the node topological graph to obtain a plurality of topological subgraphs capable of completing basic services, wherein the combination of the topological subgraphs can express the topological subgraphs corresponding to complex services, thereby realizing the abstract expression of the nodes of the industrial field network and constructing the corresponding relation between the services and the topological subgraphs (nodes).
Step 202: acquiring fault weight information, wherein the fault weight information represents the probability distribution condition of fault root causes in the network topology graph;
the failure weight information includes: for a certain fault, each node in the network topology is taken as a fault root probability, for example, referring to fig. 5, for the fault a, the fault root probability of the node 1 is 30%, the fault root probability of the node 2 is 20%, the fault root probability of the node 5 is 20%, the fault root probability of the node 3 is 60%, the fault root probability of the node 6 is 30%, and the fault root probability of the node 4 is 0%, and the fault root probabilities of the above nodes may be determined based on historical data, it can be understood that, in the case of no historical data, the fault root probabilities of the nodes may be an average value, and taking fig. 5 as an example, the fault root probabilities of the nodes 1 to 6 may be set to 20%.
The failure weight information may be determined based on the probability that each type of service failure provided by the node user is due to each node related to the service, or the initial function may be an average distribution function. Along with operation and maintenance of an industrial field network, fault weight information is adjusted based on accumulated data, so that the fault weight information more suitable for local is realized, and the accuracy of root cause positioning is improved.
Step 203: and determining the fault root cause probability of each node in the network topology graph according to the node topology information and the fault weight information, wherein the node with the highest fault root cause probability is a predicted fault root cause.
In an embodiment of the present application, the step of determining a node topology matrix according to a network topology map includes:
determining a topology matrix and an integral filter of each node in the network topology graph according to the path information in the network topology graph;
generating a global correlation matrix according to the fault type and the fault-related path;
and obtaining a node topology matrix according to the product of the topology matrix and the global correlation matrix.
In an embodiment of the present application, the step of obtaining the fault weighting function includes:
judging whether a local fault database of a node user side or a fault root data model of a node manufacturing side is updated or not;
if the local fault database of the node user side or the fault root factor data model of the node manufacturing side is updated, the updated local fault database or the updated fault root factor data model is used for updating the fault weight function; otherwise, the existing fault weighting function is used.
Based on a continuous learning mechanism, the fault weight information is periodically trained by using a continuously updated training data set, and the method has self-learning and self-adaptive capabilities and realizes the purpose of 'more using and more clever'.
In one embodiment of the present application, the method further comprises:
reporting a local fault database generated by a node user based on the digital twin to a node manufacturer;
and obtaining an updated fault factor data model from the node manufacturer, wherein the fault factor data model is obtained by updating the local fault database reported by the node manufacturer based on a plurality of node users.
In an embodiment of the present application, the step of determining, according to the node topology matrix and the fault weighting function, that each node in the network topology graph is a fault root probability includes:
calculating to obtain a target function X' of the fault alarms 1-n through the following formula;
Figure BDA0003254269410000121
wherein A is a topological matrix and D error For a global correlation matrix corresponding to a fault alarm, H error A fault root probability distribution matrix corresponding to a fault alarm;
normalizing the target function X' to obtain a target root cause prediction matrix X;
calculating the fault root probability of the node i by the following formula;
prop i =∑∑f i *X
wherein, f i The integral filter representing node i is the hadamard product.
The point multiplication result of the probability distribution matrix of the fault root and the topology matrix is a probability matrix, the normalized probability matrix represents that each node of the industrial field network is the probability of the service fault root, the probability matrices of a plurality of service faults in the same time period are superposed, the true fault root is traced in a plurality of service attributions due to the true fault root, other nodes are possibly related to a certain service fault attribution, but the probability cannot be traced by most of service faults, so that the true root is highlighted.
Optionally, the method further comprises:
updating a fault root probability distribution matrix according to a local sample, wherein the local sample comprises: a fault alarm and a root cause of the fault alarm.
Optionally, the step of updating the fault root probability distribution matrix according to the local sample includes:
generating a target fault root probability distribution matrix in a specified time range according to samples in a preset range;
updating the fault root probability distribution matrix according to preset weight parameters and the target fault root probability distribution matrix to obtain an updated fault root probability distribution matrix;
obtaining a prediction probability after normalization processing according to the product of the updated fault root probability distribution matrix and the node topology matrix;
constructing a loss function according to the prediction probability;
and training a fault root probability distribution matrix according to the local sample and the loss function, and optimizing the weight parameter.
In the embodiment of the application, the fault root cause probability of each node in the network topology graph is determined according to the node topology information and the fault weight information, wherein the node with the highest fault root cause probability is a predicted fault root cause, and the dependence on a specific mechanism model is reduced, so that the fault cause can be accurately and quickly positioned in a plurality of alarms.
Referring to fig. 3, an embodiment of the present application provides a fault root cause locating device, where the device 300 includes: a node topology module 301, a fault weight module 302, and a root cause location module 303.
A node topology module 301, configured to determine node topology information according to a network topology map;
a fault weighting module 302, configured to obtain fault weighting information, where the fault weighting information indicates a probability distribution of a fault root in the network topology diagram;
a root cause positioning module 303, configured to determine, according to the node topology information and the fault weight information, a fault root cause probability of each node in the network topology map, where a node with a highest fault root cause probability is a predicted fault root cause.
In an embodiment of the present application, the node topology module is further configured to:
determining a topology matrix and an integral filter of each node in the network topology graph according to the path information in the network topology graph;
generating a global correlation matrix according to the fault type and/or the fault-related path, wherein elements in the global correlation matrix are used for representing the fault correlation relation of each node in the network topology diagram aiming at a specific fault;
and obtaining a node topology matrix according to the product of the topology matrix and the global correlation matrix.
In one embodiment of the present application, the failure weight module is further configured to:
judging whether a local fault database of a node user side or a fault root data model of a node manufacturing side is updated or not;
if the local fault database of the node user side or the fault root factor data model of the node manufacturing side is updated, the updated local fault database or the updated fault root factor data model is used for updating the fault weight information; otherwise, the existing failure weight information is used.
In one embodiment of the present application, the apparatus further comprises:
the updating module is used for reporting a local fault database to the node manufacturing party, wherein the local fault database is generated by the node using party based on the digital twinning;
and obtaining an updated fault factor data model from the node manufacturer, wherein the fault factor data model is obtained by updating the local fault database reported by the node manufacturer based on a plurality of node users.
In an embodiment of the present application, the root cause location module is further configured to:
by passing
Figure BDA0003254269410000151
Calculating to obtain an objective function X', n representing the number of times of fault alarm, wherein n is greater than 1;
wherein A is a topological matrix, D error For a global correlation matrix corresponding to a fault alarm, H error A fault root probability distribution matrix corresponding to a fault alarm;
carrying out normalization processing on the target function X' to obtain a target root factor prediction matrix X;
by prop i =∑∑f i * X, calculating the fault root probability of the node i;
wherein f is i The integral filter representing node i is the hadamard product.
Optionally, the apparatus further comprises:
an update module, configured to update the fault root probability distribution matrix according to a local sample, where the local sample includes: a fault alarm and a root cause of the fault alarm.
Optionally, the update module is further configured to:
generating a target fault root probability distribution matrix within a specified time range according to samples within a preset range;
updating the probability distribution matrix of the fault root cause according to a preset weight parameter and the probability distribution matrix of the target fault root cause to obtain an updated probability distribution matrix of the fault root cause;
obtaining a prediction probability after normalization processing according to the product of the updated fault root cause probability distribution matrix and the node topology matrix;
constructing a loss function according to the prediction probability;
and training a fault root probability distribution matrix according to the local sample and the loss function, and optimizing the weight parameter.
Further, the node topology module 301 is responsible for building and monitoring the topology relationship of each node, and collects the relationship between nodes through the transmission path of the task at each node, thereby building a global node topology matrix. The node topology module 301 is also responsible for monitoring changes of nodes, and if there are increases and decreases of nodes, the node topology module can update the node topology matrix in time.
Further, the fault weighting module 302 is responsible for performing fault root cause weight distribution on alarms of various nodes. The fault weight module 302 may obtain the root cause weight assignment from a local fault database of the device manufacturer and the device usage vendor. Based on the digital twin, device failure data flows between device manufacturers and device users, creating additional value.
Further, the root cause location module 303 is responsible for calling relevant data from the node topology module 301 and the fault weight module 302 to calculate and speculate faults in case that a large number of device alarms are received in a short period of time by the system.
The device provided in the embodiment of the present application can implement each process implemented in the method embodiment shown in fig. 2, and achieve the same technical effect, and is not described here again to avoid repetition.
Referring to fig. 4, the node topology module executes the following process:
step a: the node topology module can monitor whether the node connection topology changes?
If not, the original node topology matrix is continuously used
If so, updating the node topology matrix, and the updating manner can refer to the contents in the node topology matrix construction described below.
Step b: and pushing the updated node topology matrix to a root cause positioning module.
With continued reference to FIG. 4, the fault weighting module performs the following process:
a, step a: is the failure weight module monitoring device checking for updates using the vendor's local failure database?
If yes, starting a fault weight module to retrain the function, and updating a fault weight function;
step b: does the fault weight module monitor the equipment manufacturer for an update based on the model?
If yes, starting a fault weight module to retrain the function, and updating a fault weight function;
step c: pushing the updated fault weight function to a root cause positioning module according to a request
Before step a, further comprising step d: based on the digital twin, the equipment uses a local fault database of a manufacturer, and periodically reports the fault database to the equipment manufacturer after desensitization treatment; and the equipment manufacturer updates the fault factor data model in time based on the data of a plurality of equipment use manufacturers and pushes the new fault factor data model to the equipment use manufacturers.
With continued reference to fig. 4, the process executed by the root cause location module is as follows:
a, step a: determine if the system's failure alarm frequency (number of alarms per unit time) exceeds a certain threshold?
Step b: if the alarm information exceeds the preset threshold value, a root cause positioning module is started, the stored node topology matrix is called, a fault weight function is called based on the collected alarm information, and a fault root cause is presumed.
In order to facilitate better understanding of the embodiments of the present application, the following technical points are introduced:
1. and constructing a node topology matrix.
And determining all the given node distribution conditions by adopting a graph-based search method, and constructing an overall topology matrix and a global correlation matrix based on the service node paths related to the current task for calculating the node topology matrix.
The method comprises the following specific steps:
step a: operator nodes in the computation graph are labeled based on a unified rule, for example: firstly, labeling nodes in the branch with the largest number of nodes in the calculation graph in sequence, then labeling all nodes in the branch with the second largest number of nodes, and so on until all operator nodes in the calculation graph are labeled;
step b: for each terminal node (nodes 1 and 6), the node is used as a starting point, a path from the node to the platform node (node 4) is searched, and the path is recorded until each path is recorded. For each path, a normalization process is performed, wherein omitted nodes are complemented with 0, so that the vector length of each path is kept consistent.
Taking fig. 5 as an example, the recorded paths are: 1-2-3-4,1-5-3-4,6-0-3-4.
Step c: based on the path, generating an overall topological matrix A p*q And integral filter f of each node p*q
Taking FIG. 5 as an example, the overall topology matrix is
Figure BDA0003254269410000181
Integral filter f of node 1 1 Is composed of
Figure BDA0003254269410000182
Integral filter f of node 4 4 Is composed of
Figure BDA0003254269410000183
The integral filter is used for the superposition attribution step.
Step d: according to the fault type, determining the fault-related path to generate a global correlation matrix D q*p
For example, traffic unresponsive failure 1 of node 1 corresponds to
Figure BDA0003254269410000191
By integrating the above, the topology matrix E = D · a of each node is obtained.
2. Learning about fault root cause probability distribution matrix
One of the problems to be solved by the invention is to establish a root cause analysis method which does not completely depend on a mechanism model or historical data, so that a superposition method which is commonly used in the field of signal processing is adopted. The superposition method has the advantages that even under the condition that only few mechanism models exist and no historical data exists, the failure weight can be set by adopting an averaging method, and good effect can be achieved.
For a single specific fault type, the attribution can be simulated according to mechanism or historical data, so that the attribution accuracy is improved. Two related designs are provided for this purpose:
(1) The initial fault root probability distribution matrix can be used for periodically pushing and updating relevant data mechanisms of equipment manufacturers and most advanced mechanism models of partial manufacturers by the equipment manufacturers in a digital twin mode, so that a valuable fault root probability distribution matrix can be provided. Data format provided by equipment manufacturer root, providing failure root probability distribution matrix H 1*q As an initial value.
When a vendor fails or refuses to provide a probability distribution of a root cause of a faultWhen matrix, initial function H 1*q Can be defined as an averaging function.
(2) Learning and updating fault root cause probability distribution matrix based on local small sample
The failure root probability distribution matrix provided by the equipment manufacturer is derived from various operation environments and operators, so that the local actual environment and operation level are not considered. Such information is contained in the local samples and can therefore be used for updating the fault root cause probability distribution matrix. The method comprises the following specific steps:
step a: generating a recent fault root probability distribution matrix J based on small samples 1*q
Step b: setting weight parameters
Figure BDA0003254269410000192
The initial value is 0.5, and an updated fault root probability distribution matrix is generated:
Figure BDA0003254269410000193
step c: the prediction probability T' = H · E is generated, and then L1 normalization (normalization) is performed, whereby the normalized prediction probability can be calculated as:
Figure BDA0003254269410000201
step d: and constructing a loss function according to the prediction probability:
Figure BDA0003254269410000202
wherein t is i Is the predicted result, t' i Is a group channel, and the correctly marked data is called the group channel.
Step e: based on local sample (each fault alarm and root cause thereof are one sample) and loss function, training probability distribution matrix of fault root cause, optimizing weight parameter
Figure BDA0003254269410000204
3. Attribution on to superposition
And based on the node topology matrix and the fault root probability distribution matrix, the root positioning is carried out on the fault time by utilizing the superposition method prediction, so that the fast and efficient root prediction is realized.
Figure BDA0003254269410000203
Wherein A is an overall topological matrix (from the construction of a node topological matrix), and D is error Global correlation matrix (from node topology matrix construction), H, for fault correspondences error Is a fault root probability distribution matrix (from the node fault root probability distribution matrix learning module). After L1 normalization is carried out on X', a target root factor prediction matrix X is obtained p*q . The integral filter f of each node is then used p*q (from node topology matrix construction), calculating the fault root probability of the node i:
prop i =∑∑f i *X p*q
note: * Is Hadamard product.
Wherein, the node with the highest probability is the predicted failure root cause.
Referring to fig. 6, an application scenario of the embodiment of the present application is described by taking RFID as an example.
In the RFID scenario of the industrial field network, the node paths are basically similar and are all composed of (D-exciter, C-tag environmental interference, B-receiver and A-gateway). As shown in FIG. 6, the RFID scene shown includes 5 node paths, which are A1-B1-C1-D1, A2-B2-C2-D2, A3-B3-C3-D3, A4-B4-C4-D4, and A5-B5-C5-D5, respectively.
Aiming at the counting service of a single tag, only one topological subgraph is needed to participate; for the tag positioning service, 3/4 point positioning is usually needed, and 3-4 topological subgraphs are needed to participate.
The target function is composed of a node topology matrix and a fault root probability distribution matrix
Figure BDA0003254269410000211
Integral topological matrix
Figure BDA0003254269410000212
Because the RFID-based positioning scenario depicted in FIG. 6 has 5 node paths, the maximum node number of a single node path is 4.
Global correlation matrix D error Is a5 x 4 matrix which depends on the alarm failure. The positioning of one RFID tag needs 3 links (the xy two-dimensional positioning is assumed), the positioning accuracy of the RFID tag t1 corresponds to 3 links (A1-B1-C1-D1, A2-B2-C2-D2 and A4-B4-C4-D4), and the positioning accuracy of the RFID tag t2 corresponds to 3 links (A1-B1-C1-D1, A3-B3-C3-D3 and A5-B5-C5-D5). The topological function is different for different error types.
Table 1 shows an example of a topology function corresponding to a fault with an industrial field network node.
Figure BDA0003254269410000213
Figure BDA0003254269410000221
Probability distribution matrix H of fault root cause error The system is a 1-4 matrix, which is a weight function of the fault corresponding industry field network node, and 4 is the maximum node number on a single node path. Still taking the RFID-based positioning scenario depicted in fig. 6 as an example, there are 4 nodes on each path.
The initial values of the functions shown in table 2 should be provided by the hardware manufacturer and optimized according to the data during operation. In the case that the relevant data of the hardware manufacturer can not be provided, the following steps are adopted:
Figure BDA0003254269410000222
when a node fails, there are usually tens of relevant fault alarms, which may be but are not limited to: fault alarm of a node, fault alarm of a related node, and fault alarm related to Quality of Experience (QoE). Calculating each fault alarm according to a formula:
Figure BDA0003254269410000231
performing L1 normalization processing on the X' to obtain a target root cause prediction matrix X 4*5 . The integral filter f of each node is then used 4*5 And calculating the fault root probability of each node:
prop=∑∑f 4*5 *X 4*5
note that, is the hadamard product.
For example, 3 fault alarms are received in a short time, the positioning deviation of the fault alarm error1 is too large for Tag-t1, the positioning deviation of the fault alarm error2 is too large for Tag-t2, and the fault alarm error3 is D1 stateless data.
For the fault alarm error =1,
Figure BDA0003254269410000232
for the fault alarm error =2,
Figure BDA0003254269410000233
for the fault alarm error =3,
Figure BDA0003254269410000234
Figure BDA0003254269410000241
normalizing X' to obtain
Figure BDA0003254269410000242
For each node, the integral filter of the node is adopted in turn to calculate the fault root probability of the node, taking the node D1 as an example, and the integral filter thereof as
Figure BDA0003254269410000251
The fault root cause probability calculation process comprises the following steps:
Figure BDA0003254269410000252
and sequencing the fault root probability of all the nodes to obtain the node with the maximum fault root probability as D1, and judging that the root of the series of faults is a D1 exciter.
As shown in fig. 7, the embodiment of the present application further provides a communication device 700, which includes a processor 701, a memory 702, and a program or an instruction stored on the memory 702 and executable on the processor 701, and when the program or the instruction is executed by the processor 701, the program or the instruction implements the processes of the embodiment of the method in fig. 2, and can achieve the same technical effect. To avoid repetition, further description is omitted here.
An embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the method embodiment shown in fig. 2, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.
The processor is a processor in the first communication device or the second communication device described in the above embodiments. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.
The steps of a method or algorithm described in connection with the disclosure herein may be embodied in hardware or may be embodied in software instructions executed by a processor. The software instructions may consist of corresponding software modules that may be stored in RAM, flash memory, ROM, EPROM, EEPROM, registers, hard disk, a removable hard disk, a compact disk, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. In addition, the ASIC may be carried in a core network interface device. Of course, the processor and the storage medium may reside as discrete components in a core network interface device.
Those skilled in the art will recognize that in one or more of the examples described above, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
The above-mentioned embodiments, objects, technical solutions and advantages of the present application are further described in detail, it should be understood that the above-mentioned embodiments are only examples of the present application, and are not intended to limit the scope of the present application, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present application should be included in the scope of the present application.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications can be made in the embodiments of the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to encompass such modifications and variations.

Claims (14)

1. A fault root cause positioning method is characterized by comprising the following steps:
determining node topology information according to the network topology graph;
acquiring fault weight information, wherein the fault weight information represents the probability distribution condition of fault root causes in the network topology graph;
and determining the fault root cause probability of each node in the network topology graph according to the node topology information and the fault weight information, wherein the node with the highest fault root cause probability is a predicted fault root cause.
2. The method of claim 1, wherein the step of determining node topology information from the network topology map comprises:
determining a topology matrix and an integral filter of each node in the network topology graph according to the path information in the network topology graph;
generating a global correlation matrix according to the fault type and/or the fault-related path, wherein elements in the global correlation matrix are used for representing the fault correlation of each node in the network topology diagram aiming at a specific fault;
and obtaining a node topology matrix according to the product of the topology matrix and the global correlation matrix.
3. The method of claim 1, wherein the step of obtaining fault weight information comprises:
judging whether a local fault database of a node user side or a fault root data model of a node manufacturing side is updated or not;
if the local fault database of the node user side or the fault root factor data model of the node manufacturing side is updated, the updated local fault database or the updated fault root factor data model is used for updating the fault weight information; otherwise, the existing failure weight information is used.
4. The method of claim 3, further comprising:
reporting a local failure database to a node manufacturer, wherein the local failure database is generated by a node user based on a digital twin;
and obtaining an updated fault factor data model from the node manufacturer, wherein the fault factor data model is obtained by updating the local fault database reported by the node manufacturer based on a plurality of node users.
5. The method according to claim 2, wherein the step of determining the root probability of the failure of each node in the network topology map according to the node topology information and the failure weight information comprises:
by passing
Figure FDA0003254269400000021
Calculating to obtain an objective function X', n representing the number of times of fault alarm, wherein n is greater than 1;
wherein A is a topological matrix and D error For a global correlation matrix corresponding to a fault alarm, H error A fault root cause probability distribution matrix corresponding to a fault alarm, wherein elements in the fault root cause probability distribution matrix represent the probability that each node is a fault root cause;
normalizing the target function X' to obtain a target root factor prediction matrix X;
by prop i =∑∑f i * X, calculating the fault root cause probability of a node i, wherein the node i is any node in the network topological graph;
wherein f is i The integral filter representing node i is the hadamard product.
6. The method of claim 5, further comprising:
updating a fault root cause probability distribution matrix according to a local sample, wherein the local sample comprises: a fault alarm and a root cause of the fault alarm.
7. The method of claim 6, wherein the step of updating the probability distribution matrix of the fault root according to the local samples comprises:
generating a target fault root probability distribution matrix in a specified time range according to samples in a preset range;
updating the fault root probability distribution matrix according to preset weight parameters and the target fault root probability distribution matrix to obtain an updated fault root probability distribution matrix;
obtaining a prediction probability after normalization processing according to the product of the updated fault root cause probability distribution matrix and the node topology matrix;
constructing a loss function according to the prediction probability;
and training a fault root probability distribution matrix according to the local sample and the loss function, and optimizing the weight parameters.
8. A fault root cause locating device, comprising:
the node topology module is used for determining node topology information according to the network topology graph;
the fault weighting module is used for acquiring fault weighting information, and the fault weighting information represents the fault root probability distribution condition in the network topological graph;
and the root cause positioning module is used for determining the fault root cause probability of each node in the network topology graph according to the node topology information and the fault weight information, wherein the node with the highest fault root cause probability is a predicted fault root cause.
9. The apparatus of claim 8, wherein the node topology module is further configured to:
determining a topology matrix and an integral filter of each node in the network topology graph according to the path information in the network topology graph;
generating a global correlation matrix according to the fault type and/or the fault-related path, wherein elements in the global correlation matrix are used for representing the fault correlation relation of each node in the network topology diagram aiming at a specific fault;
and obtaining a node topology matrix according to the product of the topology matrix and the global correlation matrix.
10. The apparatus of claim 8, wherein the failure weight module is further configured to:
judging whether a local fault database of a node user side or a fault base of a node manufacturing side is updated according to the data model;
if the local fault database of the node user side or the fault root data model of the node manufacturer side is updated, updating the fault weight information by using the updated local fault database or the updated fault root data model; otherwise, the existing failure weight information is used.
11. The apparatus of claim 10, further comprising:
the updating module is used for reporting a local fault database to the node manufacturing party, wherein the local fault database is generated by the node using party based on digital twinning;
and obtaining an updated fault factor data model from the node manufacturer, wherein the fault factor data model is obtained by updating the local fault database reported by the node manufacturer based on a plurality of node users.
12. The apparatus of claim 8, wherein the root cause location module is further configured to:
by passing
Figure FDA0003254269400000051
Calculating to obtain an objective function X', n represents the number of times of fault alarm, and n is greater than 1;
wherein A is a topological matrix, D error For a global correlation matrix corresponding to a fault alarm, H error A fault root probability distribution matrix corresponding to a fault alarm;
carrying out normalization processing on the target function X' to obtain a target root factor prediction matrix X;
by prop i =∑∑f i * X, calculating the fault root cause probability of a node i, wherein the node i is any node in the network topological graph;
wherein, f i Integral filter representing node i, is the hadamard product.
13. An electronic device, comprising: a processor, a memory and a program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the method of any one of claims 1 to 7.
14. A readable storage medium, characterized in that it has stored thereon a program which, when being executed by a processor, carries out steps comprising the method according to any one of claims 1 to 7.
CN202111055032.8A 2021-09-09 2021-09-09 Fault root cause positioning method and device and readable storage medium Pending CN115801557A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111055032.8A CN115801557A (en) 2021-09-09 2021-09-09 Fault root cause positioning method and device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111055032.8A CN115801557A (en) 2021-09-09 2021-09-09 Fault root cause positioning method and device and readable storage medium

Publications (1)

Publication Number Publication Date
CN115801557A true CN115801557A (en) 2023-03-14

Family

ID=85416953

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111055032.8A Pending CN115801557A (en) 2021-09-09 2021-09-09 Fault root cause positioning method and device and readable storage medium

Country Status (1)

Country Link
CN (1) CN115801557A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117640346A (en) * 2024-01-25 2024-03-01 中兴系统技术有限公司 Communication equipment fault diagnosis method, storage medium and computer equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117640346A (en) * 2024-01-25 2024-03-01 中兴系统技术有限公司 Communication equipment fault diagnosis method, storage medium and computer equipment

Similar Documents

Publication Publication Date Title
CN109194707B (en) Distributed graph embedding method and device
CN113098723B (en) Fault root cause positioning method and device, storage medium and equipment
CN109697500B (en) Data processing method and device, electronic equipment and storage medium
US20210209481A1 (en) Methods and systems for dynamic service performance prediction using transfer learning
JP6299759B2 (en) Prediction function creation device, prediction function creation method, and program
CN111177485B (en) Parameter rule matching based equipment fault prediction method, equipment and medium
CN109194534B (en) Scheduling and management method for Internet of things equipment group
CN115048370B (en) Artificial intelligence processing method for big data cleaning and big data cleaning system
CN110705598A (en) Intelligent model management method and device, computer equipment and storage medium
CN113259176B (en) Alarm event analysis method and device
CN111090807A (en) Knowledge graph-based user identification method and device
US20150378806A1 (en) System analysis device and system analysis method
CN113037577A (en) Network traffic prediction method, device and computer readable storage medium
CN112769605A (en) Heterogeneous multi-cloud operation and maintenance management method and hybrid cloud platform
CN113986595A (en) Abnormity positioning method and device
CN115801557A (en) Fault root cause positioning method and device and readable storage medium
CN110309947A (en) Complete vehicle logistics order forecast method and device, logistics system and computer-readable medium
JP2020181578A (en) Method, device, and medium for data processing
CN110309948A (en) Complete vehicle logistics order forecast method and device, logistics system and computer-readable medium
WO2023093431A1 (en) Model training method and apparatus, and device, storage medium and program product
CN114819442A (en) Operational research optimization method and device and computing equipment
CN113487344B (en) Communication service prediction method, centralized server, edge service terminal and medium
WO2024066292A1 (en) Device group fault identification method and apparatus, and computer-readable storage medium
CN117632666B (en) Alarm method, equipment and storage medium
CN115221359A (en) Graph matching method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination