CN115809183A - Method for discovering and disposing information-creating terminal fault based on knowledge graph - Google Patents

Method for discovering and disposing information-creating terminal fault based on knowledge graph Download PDF

Info

Publication number
CN115809183A
CN115809183A CN202211454686.2A CN202211454686A CN115809183A CN 115809183 A CN115809183 A CN 115809183A CN 202211454686 A CN202211454686 A CN 202211454686A CN 115809183 A CN115809183 A CN 115809183A
Authority
CN
China
Prior art keywords
fault
alarm
data
knowledge
knowledge graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211454686.2A
Other languages
Chinese (zh)
Inventor
张迪
孙元田
聂郁徐
朱宪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Software Group Co Ltd
Original Assignee
Inspur Software Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Software Group Co Ltd filed Critical Inspur Software Group Co Ltd
Priority to CN202211454686.2A priority Critical patent/CN115809183A/en
Publication of CN115809183A publication Critical patent/CN115809183A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention particularly relates to a knowledge graph-based method for discovering and disposing faults of a message-creating terminal. According to the method for discovering and disposing the fault of the information-creating terminal based on the knowledge graph, the fault prediction knowledge graph and the historical fault knowledge graph are constructed by using a knowledge graph technology, and the intellectualization of data processing, fault discovery and fault disposal is realized. According to the method for discovering and disposing the faults of the information-creating terminal based on the knowledge map, the system faults can be discovered in time through the centralized monitoring system, the fault processing time is reduced, a user can know the running states of the system and the application correctly and timely, system personnel are helped to carry out necessary system optimization and configuration change, and reasonable basis is provided for upgrading and expanding the capacity of the system.

Description

Method for discovering and disposing information-creating terminal fault based on knowledge graph
Technical Field
The invention relates to the technical field of fault disposal, in particular to a knowledge graph-based method for discovering and disposing faults of a message-creating terminal.
Background
Currently, software and hardware are rapidly developed, and application systems based on domestic basic software and hardware are also rapidly developed. More and more customers are considering or adopting a business-intensive approach. However, with the rapid development of the business of the trusted platform, the complexity of the business becomes higher and higher, and then not only the working strength of operation and maintenance is increased, but also the system becomes more complex. An effective system and an application monitoring system become a key for knowing service resource use conditions and finding possible system faults in time, and the key is for realizing system operation guarantee.
Aiming at the increase of the concentration and complexity of a service system under the current information and creation platform, the operation and maintenance work difficulty is correspondingly increased. An effective system and an application monitoring system become keys for realizing system operation guarantee by knowing the service state of the business resources of the trust and creation platform and timely discovering hidden dangers which may cause system faults.
Based on the situation, the invention provides a knowledge graph-based method for discovering and disposing the fault of the message-creating terminal.
Disclosure of Invention
In order to make up for the defects of the prior art, the invention provides a simple and efficient knowledge-graph-based method for discovering and disposing the faults of the message-creating terminal.
The invention is realized by the following technical scheme:
a method for discovering and disposing a message-creating terminal fault based on a knowledge graph is characterized in that: a fault prediction knowledge map and a historical fault knowledge map are constructed by using a knowledge map technology, so that the intellectualization of data processing, fault discovery and fault disposal is realized;
the method comprises the following steps:
step S1, data acquisition and processing
The data of the system is diversified and isomerized, preprocessing is carried out by combining performance data and alarm data, and modeling is carried out on the relevance among different types and different levels of data in the system, so that the accuracy and reliability of information system fault prediction are effectively improved.
Firstly, performing data cleaning, structuring and normalization processing on acquired data to obtain a unified data structure, and then constructing a fault prediction knowledge graph based on preprocessed information data;
step S1.1, real-time data acquisition
The method comprises the steps that time sequence data acquisition is carried out on monitoring indexes through an acquisition client, wherein the time sequence data acquisition comprises index data, performance data, log data and the like; setting a threshold parameter, and appointing a filter with a simple rule according to key index characteristics and key requirements of application so as to screen a part of suspected abnormalities;
step S1.2, data preprocessing
Firstly, performing data cleaning on the acquired data to remove incomplete data and redundant data;
then realizing data structuralization, packaging and extracting knowledge from the semi-structured data, and performing entity extraction after converting the semi-structured data into structured data;
the semi-structured data includes, but is not limited to, historical fault data, historical fault repair records, operational and maintenance engineer's empirical data, technical manuals, and user manuals;
finally, carrying out normalization processing on the data, and unifying the fault prediction standard;
step S2, fault discovery
Step S2.1, fault prediction
Constructing a fault prediction knowledge graph, pre-training the fault prediction knowledge graph to obtain a fault prediction model, and predicting faults hidden in the collected data;
step S2.2, establishing a historical fault knowledge map
Aiming at each fault type, training and analyzing historical fault data by using a cause and effect discovery algorithm, thereby constructing index data change characteristics of each fault type, serving as a historical fault knowledge map and being used for judging fault similarity;
s2.3, the associated fault data are converged into an alarm knowledge map through the steps, and a fault rate prediction module further predicts the fault rate through the similarity between the fault knowledge map to be predicted and the historical fault knowledge map, so that the fault prediction accuracy is improved;
s2.4, calculating the correlation of the abnormal events by using a causal discovery algorithm, representing the propagation path and the characteristics of the abnormal events by using a high-correlation link, and positioning the root cause of the abnormal events;
step S3, failure handling
Judging and inhibiting false alarms through the context of abnormal events which are associated with the knowledge map, and judging the authenticity of alarms by utilizing the similarity of the context which is mistakenly reported on historical alarms, thereby realizing the inhibition of false alarms;
the alarm convergence module finally aggregates and organizes an effective message through filtering, compressing, merging and duplicate removal based on the relevance among the index data;
the warning self-healing module trains historical fault repairing data by introducing an artificial intelligence technology, constructs a Bayesian fault self-healing model, and realizes artificial intelligence fault self-healing to replace manual fault treatment, so that the terminal fault is quickly and automatically treated;
the event scheduling module automatically repairs the events supporting the fault self-healing through the alarm self-healing module, and the events needing manual repair are manually repaired through an automatic work order scheduling technology;
and the alarm notification module is configured through an alarm notification strategy and used for notifying a notification user of an alarm event and reminding operation and maintenance personnel of repairing the fault in time.
In step S1.1, the threshold parameter is set by an alarm policy, or the threshold parameter is automatically configured by performing model training according to historical data.
In the step S2.1, the fault prediction knowledge graph construction step is as follows:
s2.1.1, determining the relevance of each information entity in the information data and a preset fault type based on a preset information system equipment manual and a related knowledge base;
s2.1.2, selecting part of information entities from the information entities as nodes of a fault prediction knowledge graph based on the relevance between the information entities and a preset fault type, and determining the relationship between adjacent nodes;
the method comprises the steps that information data are trained through a plurality of cause and effect finding algorithms PC, and the final cause and effect edge is determined based on the cause and effect edge output by each algorithm in combination with manual examination and screening; the causal edge refers to node information corresponding to an entity in the graph and edge information corresponding to the relationship between the entity and the entity;
s2.1.3, constructing a fault prediction knowledge graph based on all nodes and incidence relations
And constructing a fault prediction knowledge graph based on the logic of all information entities, incidence relations and entities.
In the step s2.1.1, the relevance determination method is as follows:
s2.1.1.1, calculating fingerprint information for each alarm by adopting an MD5 algorithm and utilizing the strategy, the rule, the terminal unique id, the index field and the index tag attribute, wherein the alarms with the same fingerprint are regarded as the same alarm message;
s2.1.1.2, extracting the association relation of the module to which the alarm message belongs;
and S2.1.1.3, when the machine room fails, collecting and organizing alarm messages generated in the same machine room into an effective message.
In the step S2.2, the historical fault knowledge graph construction step is as follows:
step S2.2.1, event Generation
Detecting the real-time index data by using a fault prediction model, and generating an abnormal event when an abnormality occurs;
step S2.2.2, fault propagation diagram construction
The incidence relation of the abnormal events is mined as in the step of constructing the fault prediction knowledge graph in the step S2.1, so that a fault propagation graph is constructed;
step S2.2.3, sort merge
Classifying and combining the generated fault propagation map and the historical fault knowledge map by using a clustering algorithm;
step S2.2.4, labeling
And performing root cause labeling on the classified and combined fault knowledge graph, and marking a fault type label on each type of historical fault knowledge graph.
In the step S2.3, the failure probability corresponding to the knowledge graph to be predicted is determined, and the steps are as follows:
s2.3.1, determining the same subgraph based on the intersection of the subgraph set of the knowledge graph to be predicted and the fault subgraph set of any historical fault knowledge graph;
s2.3.2, determining the similarity between the knowledge graph to be predicted and any historical fault knowledge graph based on the determined same subgraph, the weight corresponding to the same subgraph and the union of the subgraph set and the fault subgraph set;
and S2.3.3, sequencing the similarity to obtain a historical fault knowledge graph with the highest similarity to the knowledge graph to be predicted, wherein the similarity is the fault rate of the knowledge graph to be predicted.
In the step S2.4, the failure prediction knowledge graph includes a causal derivation relationship precipitated by a historical alarm event, and a root cause positioning process based on the failure prediction knowledge graph is as follows:
s2.4.1, inputting the abnormal event time slice sample into an alarm knowledge graph to obtain an alarm node causal graph;
s2.4.2, calculating the weight of each causal edge in the causal graph of the alarm node, wherein the causal edge weight reflects the probability of the abnormal event suspected root path under the time slice;
and S2.4.3, sequencing the weights of the causal edges, and selecting the causal edge with the highest weight as a final root cause path.
In step S3, the fault handling implementation procedure is as follows:
step S3.1, alarm suppression
Step S3.1.1, analyzing and warehousing call chain log data
Setting an analysis window, analyzing a calling chain in a time window of the alarm according to the alarm log, and storing the calling chain in a library;
step S3.1.2, map construction
Constructing a complete entity tree structure according to the reference relationship in the call chain;
step S3.1.3, index table construction
Analyzing the structure of the call chain, finding out relevant indexes on the call chain in the window time, analyzing the occurrence frequency of the relevant indexes in the analysis window time, and recording the statistical information of the relevant indexes into an index table;
step S3.1.4, alarm authenticity analysis
Searching for a calling chain which has the same index in history but generates false alarm according to the calling chain of the current alarm, searching for a related index generated on the current calling chain and the generation frequency of the index, and calculating the similarity between the local alarm and the historical alarm through the calling chain and the related index on the calling chain; if the similarity is higher than the self-defined threshold value, the alarm is considered as false alarm, otherwise, the alarm is real alarm.
Step S3.2, alarm convergence
Step S3.2.1, setting time slice granularity
Acquiring alarm data in the time slice, namely the statistical period of the alarm rule in real time;
step S3.2.2, alarm Classification
Classifying the original alarm data according to the acquisition indexes of the alarm data; the acquisition index of the alarm data adopts any one of a cpu utilization rate, a disk utilization rate, network flow or a terminal unique identifier.
Step S3.2.3, convergence is carried out on the alarm event
Inquiring an alarm knowledge map, and converging alarm events by taking a system as a unit; the convergence format is as follows: the system 1: { node 1: [ alarm type 1, alarm type 2. ], node 2: [ alarm type 1, alarm type 2. ]. };
step S3.3, fault self-healing
Step S3.3.1, preprocessing fault parameter data
Sorting the root cause path data acquired in the root cause positioning step into fault parameter data, moving the fault parameter data into a historical fault database, and storing the fault type and a corresponding repairing program;
step S3.3.2, establishing a Bayesian fault self-healing model
Calculating to obtain prior probability according to the performance index of the equipment, and calculating to obtain conditional probability, adjustment factor and posterior probability according to the fault parameter data; then, performing data optimization calculation on the prior probability by adopting a Markov transfer matrix method model, and calculating according to the fault parameter data to obtain an adjustment factor and a posterior probability so as to calculate the self-healing probability corresponding to the fault type;
inputting root cause path data acquired in real time into a fault self-healing model to acquire a preset fault self-healing operable index of the fault type; when the self-healing operable index is larger than the self-defined threshold value, automatically repairing by using a repairing program corresponding to the fault type;
step S3.4, event scheduling
For the event needing manual repair, initiating a work order through an automatic work order scheduling technology, and manually repairing by operation and maintenance personnel; for the event capable of self-healing, the fault self-healing model is adopted to call a corresponding repairing program to realize automatic repairing;
step S3.5, alarm Notification
The alarm notification module notifies the user of an alarm event through alarm notification strategy configuration, and the notification content supports the configuration of a preset template and a self-defined template; according to the alarm notification silence strategy, the alarm notification module ignores the alarm meeting the condition in the silence time, namely, does not send the alarm notification.
A device for discovering and disposing the fault of a message-creating terminal based on a knowledge graph is characterized in that: comprises a memory and a processor; the memory is adapted to store a computer program and the processor is adapted to carry out the above-mentioned method steps when executing the computer program.
A readable storage medium, characterized by: the readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the above-mentioned method steps.
The invention has the beneficial effects that: according to the method for discovering and disposing the faults of the trusted terminal based on the knowledge graph, the system faults can be discovered in time through the centralized monitoring system, the fault processing time is reduced, a user can know the running states of the system and the application correctly and timely, the system personnel are helped to carry out necessary system optimization and configuration change, and reasonable basis is provided for upgrading and expanding the capacity of the system.
Drawings
In order to more clearly illustrate the embodiments or technical solutions of the present invention, the drawings used in the embodiments or technical solutions in the prior art are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a schematic diagram of a method for discovering and disposing faults of a trusted terminal based on a knowledge graph.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the embodiment of the present invention. It should be apparent that the described embodiments are only some embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.
The maintenance personnel can quickly analyze the reason of the fault according to the modules of statistics, notification, processing and the like of the alarm center and release the reason from the complex labor. Faults in the system are discovered in time through the centralized monitoring system, and the fault processing time is shortened. Meanwhile, a basis is provided for upgrading and expanding the system.
According to the method for discovering and disposing the fault of the information-creating terminal based on the knowledge graph, the knowledge graph technology is used for constructing the fault prediction knowledge graph and the historical fault knowledge graph, and the intellectualization of data processing, fault discovery and fault disposal is realized;
the method comprises the following steps:
step S1, data acquisition and processing
The data of the system is diversified and isomerized, preprocessing is carried out by combining performance data and alarm data, and modeling is carried out on the relevance among different types and different levels of data in the system, so that the accuracy and reliability of information system fault prediction are effectively improved.
Firstly, performing data cleaning, structuring and normalization processing on acquired data to obtain a unified data structure, and then constructing a fault prediction knowledge graph based on preprocessed information data;
step S1.1, real-time data acquisition
The method comprises the steps that time sequence data acquisition is carried out on monitoring indexes through an acquisition client, wherein the monitoring indexes comprise index data, performance data, log data and the like; setting a threshold parameter, and appointing a filter with a simple rule according to key index characteristics and key requirements of application so as to screen a part of suspected abnormalities;
step S1.2, data preprocessing
Firstly, data cleaning is carried out on the acquired data, and incomplete data (too short online time, too long or too short acquisition time interval, interval truncated data and the like) and redundant data (data which has no influence on faults are filtered, the workload of subsequent data analysis is reduced, and the data which has no influence comprises data set information, system log data and the like in a normal threshold range) are removed;
then realizing data structuralization, packaging and extracting knowledge from the semi-structured data, and performing entity extraction after converting the semi-structured data into structured data;
the semi-structured data includes, but is not limited to, historical fault data, historical fault repair records, operational and maintenance engineer's empirical data, technical manuals, and user manuals;
finally, carrying out normalization processing on the data, and unifying the fault prediction standard;
step S2, fault discovery
The fault discovery method comprises the steps of realizing a processing process from collected data to fault root factor information based on a knowledge graph, and predicting faults, constructing a historical fault knowledge graph, predicting fault rates and positioning root factors;
step S2.1, fault prediction
Constructing a fault prediction knowledge graph, pre-training the fault prediction knowledge graph to obtain a fault prediction model, and predicting faults hidden in the collected data;
step S2.2, historical fault knowledge map is constructed
Aiming at each fault type, training and analyzing historical fault data by using a causal discovery algorithm, thereby constructing index data change characteristics of each fault type, and using the index data change characteristics as a historical fault knowledge map for fault similarity judgment;
s2.3, gathering the associated fault data into an alarm knowledge map through the steps, and further predicting the fault rate through the similarity between the fault knowledge map to be predicted and a historical fault knowledge map by a fault rate prediction module so as to improve the accuracy of fault prediction;
step S2.4, the root cause service and the initial exception service are typically all on an exception propagation chain consisting of a series of exception services. Therefore, the root cause positioning module analyzes the abnormal propagation chain hidden in the alarm knowledge graph to determine a group of candidate abnormal root cause services from the possible abnormal propagation chain, calculates the correlation of the abnormal events by using a cause and effect discovery algorithm on the basis, and utilizes the high-correlation link to represent the propagation path and the characteristics of the abnormal events to position the root cause of the abnormal events;
step S3, failure handling
The fault handling realizes fault automatic diagnosis decision through the steps of alarm suppression, alarm convergence, fault self-healing, event scheduling and alarm notification, realizes fault self-healing in part of scenes, and realizes automatic work order scheduling in other scenes, thereby improving operation and maintenance efficiency and reducing fault processing time.
Due to the uncertainty of artificial intelligent prediction, the large fluctuation of detection indexes, the non-periodic rule and the like, certain false alarms can be caused, when the method is used in a production environment in a large scale, the number of the false alarms is continuously overlapped, and an alarm storm is likely to be formed. In order to solve the problem, the judgment and the suppression of the false alarm are carried out by associating the context with the abnormal event through the knowledge map, and the authenticity judgment of the alarm is carried out by utilizing the similarity of the context with the false alarm on the historical alarm, thereby realizing the suppression of the false alarm;
the alarm convergence module finally integrates and organizes into an effective message through filtering, compressing, merging and de-duplicating based on the relevance among the index data, so that on one hand, the alarm storm can be avoided, on the other hand, the relevance compression can reduce 90% of invalid alarms, and the alarm efficiency is improved.
The alarm self-healing module trains historical fault healing data by introducing an artificial intelligence technology, constructs a Bayesian fault self-healing model, and realizes artificial intelligence fault self-healing to replace manual fault treatment, so that the terminal fault is quickly and automatically treated;
the event scheduling module automatically repairs the events supporting the fault self-healing through the alarm self-healing module, and manually repairs the events needing manual repair through an automatic work order scheduling technology;
and the alarm notification module is configured through an alarm notification strategy and used for notifying a notification user of an alarm event and reminding operation and maintenance personnel of repairing the fault in time.
In the step S1.1, the threshold parameter is set through an alarm strategy, or model training is performed according to historical data, and the threshold parameter is automatically configured.
The collected historical data contains a large number of correlation and causal relationships. The inherent topological structure of the terminal equipment and the calling relation of the system application can quickly form the entity and the relation of the knowledge graph. In the step S2.1, the fault prediction knowledge graph construction step is as follows:
s2.1.1, determining the relevance of each information entity in the information data and a preset fault type based on a preset information system equipment manual and a related knowledge base;
in the step s2.1.1, the relevance determination method is as follows:
and S2.1.1.1, the alarm messages have the same latitude attributes, such as the same strategy name, rule name, the same deployment attribute (terminal, server, host and the like), and the same index type (memory, CPU, network utilization rate and the like). Calculating fingerprint information for each alarm by adopting an MD5 algorithm and utilizing the strategy, the rule, the unique terminal id, the index field and the index tag attribute, wherein the alarms with the same fingerprint are regarded as the same alarm message;
s2.1.1.2, extracting the association relation of the module to which the alarm message belongs;
when the module A calls the module B, if the module B is abnormal, the module A also has related abnormal, and the relevance can be mined through historical abnormal events.
And S2.1.1.3, when the machine room fails, alarm messages generated in the same machine room are not required to be generated or acquired one by one, and finally, the summary is organized into an effective message.
S2.1.2, selecting partial information entities from the information entities as nodes of a fault prediction knowledge graph based on the relevance between the information entities and a preset fault type, and determining the relationship between adjacent nodes;
the method comprises the steps that information data are trained through a plurality of cause and effect finding algorithms PC, and the final cause and effect edge is determined based on the cause and effect edge output by each algorithm in combination with manual examination and screening; the causal edge refers to node information corresponding to an entity in the graph and edge information corresponding to the relationship between the entity and the entity;
s2.1.3, constructing a fault prediction knowledge graph based on all nodes and incidence relations
Determining the relevance of each information entity in the information data and a preset fault type based on a mode of combining a preset system equipment manual and a relevant knowledge base of operation and maintenance expert experience precipitation; based on the relevance, partial entities are selected from the information entities to serve as nodes of the fault prediction knowledge graph, and the relation between adjacent nodes is determined. And constructing a fault prediction knowledge graph based on the logic of all information entities, incidence relations and entities.
In the step S2.2, the historical fault knowledge graph construction step is as follows:
step S2.2.1, event Generation
Detecting the real-time index data by using a fault prediction model, and generating an abnormal event when an abnormality occurs;
step S2.2.2, fault propagation diagram construction
The incidence relation of the abnormal events is mined as in the step of constructing the fault prediction knowledge graph in the step S2.1, so that a fault propagation graph is constructed;
step S2.2.3, classification and combination
Classifying and combining the generated fault propagation diagram and a historical fault knowledge map by using a clustering algorithm;
step S2.2.4, labeling
And (4) carrying out root cause labeling on the classified and combined fault knowledge maps, and marking fault type labels on each type of historical fault knowledge maps.
In the step S2.3, the failure probability corresponding to the knowledge graph to be predicted is determined, and the steps are as follows:
s2.3.1, determining the same subgraph based on the intersection of the subgraph set of the knowledge graph to be predicted and the fault subgraph set of any historical fault knowledge graph;
s2.3.2, determining the similarity between the knowledge graph to be predicted and any historical fault knowledge graph based on the determined same subgraph, the weight corresponding to the same subgraph and the union of the subgraph set and the fault subgraph set;
and S2.3.3, sequencing the similarity to obtain a historical fault knowledge graph with the highest similarity to the knowledge graph to be predicted, wherein the similarity is the fault rate of the knowledge graph to be predicted.
In the step S2.4, the failure prediction knowledge graph includes a cause-and-effect derivation relationship precipitated by a historical alarm event, and a root cause positioning process based on the failure prediction knowledge graph is as follows:
s2.4.1, inputting abnormal event time slice samples into an alarm knowledge graph to obtain an alarm node cause-and-effect graph;
s2.4.2, calculating the weight of each causal edge in the causal graph of the alarm node, wherein the weight of each causal edge reflects the probability of the suspected root path of the abnormal event under the time slice;
s2.4.3, sequencing the weights of the causal edges, and selecting the causal edge with the highest weight as a final root path.
The finally obtained possible root cause propagation path set basically comprises the real root cause. The method can effectively realize automatic root cause derivation, shortens the intervention time of operation and maintenance personnel, has better interpretability of a visualized derivation link, and is easy to duplicate and optimize.
In step S3, the fault handling implementation procedure is as follows:
step S3.1, alarm suppression
The context of an abnormal event is associated through an interpretable knowledge graph, and the judgment and the suppression of false alarm are carried out, wherein the method comprises the following steps:
step S3.1.1, analyzing and warehousing call chain log data
Setting an analysis window, analyzing a calling chain in a time window of the alarm according to the alarm log, and storing the calling chain in a library;
step S3.1.2, map construction
Constructing a complete entity tree structure according to the reference relationship in the call chain;
step S3.1.3, index table construction
Analyzing the structure of the call chain, finding out relevant indexes on the call chain in the window time, analyzing the occurrence frequency of the relevant indexes in the analysis window time, and recording the statistical information of the relevant indexes into an index table;
s3.1.4, true and false analysis of alarm
Searching for a calling chain which has the same index in history but generates false alarm according to the calling chain of the current alarm, searching for a related index generated on the current calling chain and the generation frequency of the index, and calculating the similarity between the local alarm and the historical alarm through the calling chain and the related index on the calling chain; if the similarity is higher than the self-defined threshold value, the alarm is considered as false alarm, otherwise, the alarm is true alarm.
The above steps can deposit the history false alarm records, and the false alarm judgment is carried out by using the similarity of the history false alarm records and the context of the false alarm records, thereby realizing the inhibition of the false alarm.
Step S3.2, alarm convergence
The alarm convergence converges the alarm events to the knowledge graph taking the system as a unit based on the relevance between the alarm events, thereby effectively inhibiting the number of alarm messages and reducing the frequency of the alarm messages. The alarm convergence comprises the following main steps:
step S3.2.1, setting time slice granularity
Acquiring alarm data in time slices, namely the statistical period of the alarm rule in real time;
step S3.2.2, alarm classification
Classifying the original alarm data according to the acquisition indexes of the alarm data; the acquisition index of the alarm data adopts any one of a cpu utilization rate, a disk utilization rate, network flow or a terminal unique identifier such as IP, mac and the like.
Step S3.2.3, convergence is carried out on the alarm event
Inquiring an alarm knowledge map, and converging alarm events by taking a system as a unit; the convergence format is as follows: the system 1: { node 1: [ alarm type 1, alarm type 2. ], node 2: [ alarm type 1, alarm type 2. ]. };
step S3.3, fault self-healing
Training historical fault repair data by adopting an artificial intelligence technology, constructing a Bayesian fault self-healing model, and realizing artificial intelligence fault self-healing to replace manual fault treatment, thereby realizing rapid and automatic processing of terminal faults; the fault self-healing model construction method used by the patent comprises the following steps:
step S3.3.1, preprocessing fault parameter data
Sorting the root cause path data acquired in the root cause positioning step into fault parameter data, moving the fault parameter data into a historical fault database, and storing the fault type and a corresponding repair program;
step S3.3.2, establishing a Bayesian fault self-healing model
Calculating according to the performance index of the equipment to obtain prior probability, and calculating according to the fault parameter data to obtain conditional probability, an adjustment factor and posterior probability; then, performing data optimization calculation on the prior probability by adopting a Markov transfer matrix method model, and calculating according to the fault parameter data to obtain an adjustment factor and a posterior probability so as to calculate the self-healing probability corresponding to the fault type;
inputting the root cause path data acquired in real time into a fault self-healing model to acquire a preset fault self-healing operable index of the fault type; when the self-healing operable index is larger than the self-defined threshold value, automatically repairing by using a repairing program corresponding to the fault type;
step S3.4, event scheduling
For the event needing manual repair, initiating a work order through an automatic work order scheduling technology, and manually repairing the event by operation and maintenance personnel; for the event capable of self-healing, the fault self-healing model is adopted to call a corresponding repairing program to realize automatic repairing;
step S3.5, alarm Notification
The alarm notification module notifies the user of the alarm event through alarm notification strategy configuration, and the notification mode includes but is not limited to mail, short message, in-application notification, stapling and the like. The notification content supports the configuration of a preset template and a self-defined template; according to the alarm notification silent strategy, the alarm notification module ignores the alarm meeting the condition in the silent time, namely, the alarm notification is not sent.
The method for discovering and disposing the fault of the trusted and created terminal based on the knowledge graph can be compatible with operating systems such as kylin, depth and Puhua, which are the winning numbers of CPUs in various countries, and compatible with Firefox and Chromium browsers in the environments of software and hardware in the countries, and has good universality, flexibility and transportability.
The device for discovering and disposing the fault of the information creating terminal based on the knowledge graph comprises a memory and a processor; the memory is adapted to store a computer program, and the processor is adapted to carry out the above-mentioned method steps when executing the computer program.
The readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the above-mentioned method steps.
The method for discovering and disposing the faults of the trusted terminal based on the knowledge graph can be compatible with the domestic system environment, not only can regularly clean system data and ensure the stable operation of the system, but also can regularly backup the system data and prevent data loss, or can timely recover by utilizing the backup data when the data is lost.
When the system is used, the cleaning or backup of data can be dynamically set according to needs, a plurality of tasks are configured and started, the tasks are not influenced mutually, and the system is relatively flexible.
In addition, the method for discovering and disposing the fault of the information-creating terminal based on the knowledge graph can be used for backing up through the plug-in or script uploaded by the user, and a cleaning backup task is started or closed at any time.
The above-described embodiment is only one specific embodiment of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims (10)

1. A method for discovering and disposing a message-creating terminal fault based on a knowledge graph is characterized in that: a fault prediction knowledge map and a historical fault knowledge map are constructed by using a knowledge map technology, so that the intellectualization of data processing, fault discovery and fault disposal is realized;
the method comprises the following steps:
step S1, data acquisition and processing
Preprocessing is carried out by combining performance data and alarm data, and modeling is carried out on the relevance among different types and different levels of data in the system; firstly, performing data cleaning, structuring and normalization processing on acquired data to obtain a unified data structure, and then constructing a fault prediction knowledge graph based on preprocessed information data;
step S1.1, real-time data acquisition
The method comprises the steps that time sequence data acquisition is carried out on monitoring indexes through an acquisition client, wherein the time sequence data acquisition comprises index data, performance data, log data and the like; setting a threshold parameter, and appointing a filter with a simple rule according to key index characteristics and key requirements of application so as to screen a part of suspected abnormalities;
step S1.2, data preprocessing
Firstly, performing data cleaning on the acquired data to remove incomplete data and redundant data;
then realizing data structuralization, packaging and extracting knowledge from the semi-structured data, and performing entity extraction after converting the semi-structured data into structured data;
the semi-structured data includes, but is not limited to, historical fault data, historical fault repair records, operational and maintenance engineer's empirical data, technical manuals, and user manuals;
finally, carrying out normalization processing on the data, and unifying the fault prediction standard;
step S2, fault discovery
Step S2.1, fault prediction
Constructing a fault prediction knowledge graph, pre-training the fault prediction knowledge graph to obtain a fault prediction model, and predicting faults hidden in the collected data;
step S2.2, establishing a historical fault knowledge map
Aiming at each fault type, training and analyzing historical fault data by using a causal discovery algorithm, thereby constructing index data change characteristics of each fault type, and using the index data change characteristics as a historical fault knowledge map for fault similarity judgment;
s2.3, the associated fault data are converged into an alarm knowledge map through the steps, and a fault rate prediction module further predicts the fault rate through the similarity between the fault knowledge map to be predicted and the historical fault knowledge map, so that the fault prediction accuracy is improved;
s2.4, calculating the correlation of the abnormal events by using a causal discovery algorithm, representing the propagation path and characteristics of the abnormal events by using a high-correlation link, and positioning the root cause of the abnormal events;
step S3, failure handling
Judging and inhibiting false alarm by associating the context with abnormal events through the knowledge map, and judging the truth of the alarm by using the similarity of the context with the false alarm in the historical alarm, thereby realizing the inhibition of the false alarm;
the alarm convergence module finally integrates and organizes an effective message through filtering, compressing, merging and de-duplicating based on the relevance between the index data;
the warning self-healing module trains historical fault repairing data by introducing an artificial intelligence technology, constructs a Bayesian fault self-healing model, and realizes artificial intelligence fault self-healing to replace manual fault treatment, so that the terminal fault is quickly and automatically treated;
the event scheduling module automatically repairs the events supporting the fault self-healing through the alarm self-healing module, and manually repairs the events needing manual repair through an automatic work order scheduling technology;
and the alarm notification module is configured through an alarm notification strategy and used for notifying a user of an alarm event and reminding operation and maintenance personnel of repairing the fault in time.
2. The method of knowledge-graph-based belief creation terminal fault discovery and handling of claim 1, wherein: in the step S1.1, the threshold parameter is set through an alarm strategy, or model training is performed according to historical data, and the threshold parameter is automatically configured.
3. The method of knowledge-graph-based belief creation terminal fault discovery and handling of claim 1, wherein: in the step S2.1, the fault prediction knowledge graph construction step is as follows:
s2.1.1, determining the relevance of each information entity in the information data and a preset fault type based on a preset information system equipment manual and a related knowledge base;
s2.1.2, selecting partial information entities from the information entities as nodes of a fault prediction knowledge graph based on the relevance between the information entities and a preset fault type, and determining the relationship between adjacent nodes;
the method comprises the steps that information data are trained through a plurality of cause and effect discovery algorithms PC, and final cause and effect edges are determined based on cause and effect edges output by the algorithms in combination with manual examination and screening; the causal edge refers to node information corresponding to an entity in the graph and edge information corresponding to the relationship between the entity and the entity;
s2.1.3, constructing a fault prediction knowledge graph based on all nodes and incidence relations
And constructing a fault prediction knowledge graph based on the logic of all information entities-incidence relation-entities.
4. The method of knowledge-graph-based belief creation terminal fault discovery and handling of claim 3, wherein: in the step s2.1.1, the relevance determination method is as follows:
s2.1.1.1, calculating fingerprint information for each alarm by adopting an MD5 algorithm and utilizing the strategy, the rule, the terminal unique id, the index field and the index tag attribute, wherein the alarms with the same fingerprint are regarded as the same alarm message;
s2.1.1.2, extracting the association relation of the module to which the alarm message belongs;
and S2.1.1.3, when the machine room fails, summarizing and organizing alarm messages generated in the same machine room into an effective message.
5. The method of knowledge-graph-based message-created terminal fault discovery and handling of claim 1, wherein: in the step S2.2, the historical fault knowledge map is constructed as follows:
step S2.2.1, event Generation
Detecting the real-time index data by using a fault prediction model, and generating an abnormal event when an abnormality occurs;
step S2.2.2, fault propagation diagram construction
The incidence relation of the abnormal events is mined as in the step of constructing the fault prediction knowledge graph in the step S2.1, so that a fault propagation graph is constructed;
step S2.2.3, sort merge
Classifying and combining the generated fault propagation diagram and a historical fault knowledge map by using a clustering algorithm;
step S2.2.4, labeling
And performing root cause labeling on the classified and combined fault knowledge graph, and marking a fault type label on each type of historical fault knowledge graph.
6. The method of knowledge-graph-based message-created terminal fault discovery and handling of claim 1, wherein: in the step S2.3, the failure probability corresponding to the knowledge graph to be predicted is determined, and the steps are as follows:
s2.3.1, determining the same subgraph based on the intersection of the subgraph set of the knowledge graph to be predicted and the fault subgraph set of any historical fault knowledge graph;
s2.3.2, determining the similarity between the knowledge graph to be predicted and any historical fault knowledge graph based on the determined same subgraph, the weight corresponding to the same subgraph and the union of the subgraph set and the fault subgraph set;
and S2.3.3, sequencing the similarity to obtain a historical fault knowledge graph with the highest similarity to the knowledge graph to be predicted, wherein the similarity is the fault rate of the knowledge graph to be predicted.
7. The method of knowledge-graph-based belief creation terminal fault discovery and handling of claim 1, wherein: in the step S2.4, the failure prediction knowledge graph includes a causal derivation relationship precipitated by a historical alarm event, and a root cause positioning process based on the failure prediction knowledge graph is as follows:
s2.4.1, inputting the abnormal event time slice sample into an alarm knowledge graph to obtain an alarm node causal graph;
s2.4.2, calculating the weight of each causal edge in the causal graph of the alarm node, wherein the weight of each causal edge reflects the probability of the suspected root path of the abnormal event under the time slice;
s2.4.3, sequencing the weights of the causal edges, and selecting the causal edge with the highest weight as a final root path.
8. The method of knowledge-graph-based message-created terminal fault discovery and handling of claim 1, wherein: in step S3, the fault handling implementation procedure is as follows:
step S3.1, alarm suppression
Step S3.1.1, analyzing and warehousing call chain log data
Setting an analysis window, analyzing a calling chain in a time window of the alarm according to the alarm log, and storing the calling chain in a library;
step S3.1.2, map construction
Constructing a complete entity tree structure according to the reference relationship in the call chain;
step S3.1.3, index table construction
Analyzing the structure of the call chain, finding out relevant indexes on the call chain in the window time, analyzing the occurrence frequency of the relevant indexes in the analysis window time, and recording the statistical information of the relevant indexes into an index table;
step S3.1.4, alarm authenticity analysis
Searching for a calling chain which has the same index in history but generates false alarm according to the calling chain of the current alarm, searching for a related index generated on the current calling chain and the generation frequency of the index, and calculating the similarity between the local alarm and the historical alarm through the calling chain and the related index on the calling chain; if the similarity is higher than the self-defined threshold value, the alarm is considered as false alarm, otherwise, the alarm is true alarm;
step S3.2, alarm convergence
Step S3.2.1, setting time slice granularity
Acquiring alarm data in the time slice, namely the statistical period of the alarm rule in real time;
step S3.2.2, alarm Classification
Classifying the original alarm data according to the acquisition indexes of the alarm data; the acquisition index of the alarm data adopts any one of a CPU utilization rate, a disk utilization rate, network flow or a terminal unique identifier;
step S3.2.3, convergence is carried out on the alarm event
Inquiring an alarm knowledge map, and converging alarm events by taking a system as a unit; the convergence format is as follows: the system 1: { node 1: [ alarm type 1, alarm type 2. ], node 2: [ alarm type 1, alarm type 2. ]. };
step S3.3, fault self-healing
Step S3.3.1, preprocessing fault parameter data
Sorting the root cause path data acquired in the root cause positioning step into fault parameter data, moving the fault parameter data into a historical fault database, and storing the fault type and a corresponding repair program;
step S3.3.2, establishing a Bayesian fault self-healing model
Calculating according to the performance index of the equipment to obtain prior probability, and calculating according to the fault parameter data to obtain conditional probability, an adjustment factor and posterior probability; then, performing data optimization calculation on the prior probability by adopting a Markov transfer matrix method model, and calculating according to the fault parameter data to obtain an adjustment factor and a posterior probability so as to calculate the self-healing probability corresponding to the fault type;
inputting root cause path data acquired in real time into a fault self-healing model to acquire a preset fault self-healing operable index of the fault type; when the self-healing operable index is larger than the self-defined threshold value, automatically repairing by using a repairing program corresponding to the fault type;
step S3.4, event scheduling
For the event needing manual repair, initiating a work order through an automatic work order scheduling technology, and manually repairing the event by operation and maintenance personnel; for the event capable of self-healing, the fault self-healing model is adopted to call a corresponding repairing program to realize automatic repairing;
step S3.5, alarm Notification
The alarm notification module notifies a notification user of an alarm event through alarm notification strategy configuration, and notification contents support configuration of a preset template and a self-defined template; according to the alarm notification silent strategy, the alarm notification module ignores the alarm meeting the condition in the silent time, namely, the alarm notification is not sent.
9. A device for discovering and disposing the fault of a message-creating terminal based on a knowledge graph is characterized in that: comprises a memory and a processor; the memory is adapted to store a computer program which, when executed by the processor, implements the method steps of any of claims 1 to 8.
10. A readable storage medium, characterized by: the readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the method steps of any one of claims 1 to 8.
CN202211454686.2A 2022-11-21 2022-11-21 Method for discovering and disposing information-creating terminal fault based on knowledge graph Pending CN115809183A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211454686.2A CN115809183A (en) 2022-11-21 2022-11-21 Method for discovering and disposing information-creating terminal fault based on knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211454686.2A CN115809183A (en) 2022-11-21 2022-11-21 Method for discovering and disposing information-creating terminal fault based on knowledge graph

Publications (1)

Publication Number Publication Date
CN115809183A true CN115809183A (en) 2023-03-17

Family

ID=85483592

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211454686.2A Pending CN115809183A (en) 2022-11-21 2022-11-21 Method for discovering and disposing information-creating terminal fault based on knowledge graph

Country Status (1)

Country Link
CN (1) CN115809183A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116225769A (en) * 2023-05-04 2023-06-06 北京优特捷信息技术有限公司 Method, device, equipment and medium for determining root cause of system fault
CN116366477A (en) * 2023-05-30 2023-06-30 中车工业研究院(青岛)有限公司 Train network communication signal detection method, device, equipment and storage medium
CN116976859A (en) * 2023-08-11 2023-10-31 杭州学沃网络科技有限公司 Intelligent campus management dormitory warranty maintenance method and system based on big data application
CN117192373A (en) * 2023-08-08 2023-12-08 浙江凌骁能源科技有限公司 Power battery fault analysis method, device, computer equipment and storage medium
CN117389230A (en) * 2023-11-16 2024-01-12 广州中健中医药科技有限公司 Antihypertensive traditional Chinese medicine extract production control method and system
CN117527527A (en) * 2024-01-08 2024-02-06 天津市天河计算机技术有限公司 Multi-source alarm processing method and system
CN117647697A (en) * 2023-11-21 2024-03-05 广东电网有限责任公司江门供电局 Knowledge graph-based fault positioning method and system for electric power metering assembly line
CN117389230B (en) * 2023-11-16 2024-06-07 广州中健中医药科技有限公司 Antihypertensive traditional Chinese medicine extract production control method and system

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116225769B (en) * 2023-05-04 2023-07-11 北京优特捷信息技术有限公司 Method, device, equipment and medium for determining root cause of system fault
CN116225769A (en) * 2023-05-04 2023-06-06 北京优特捷信息技术有限公司 Method, device, equipment and medium for determining root cause of system fault
CN116366477A (en) * 2023-05-30 2023-06-30 中车工业研究院(青岛)有限公司 Train network communication signal detection method, device, equipment and storage medium
CN116366477B (en) * 2023-05-30 2023-08-18 中车工业研究院(青岛)有限公司 Train network communication signal detection method, device, equipment and storage medium
CN117192373B (en) * 2023-08-08 2024-05-07 浙江凌骁能源科技有限公司 Power battery fault analysis method, device, computer equipment and storage medium
CN117192373A (en) * 2023-08-08 2023-12-08 浙江凌骁能源科技有限公司 Power battery fault analysis method, device, computer equipment and storage medium
CN116976859A (en) * 2023-08-11 2023-10-31 杭州学沃网络科技有限公司 Intelligent campus management dormitory warranty maintenance method and system based on big data application
CN117389230A (en) * 2023-11-16 2024-01-12 广州中健中医药科技有限公司 Antihypertensive traditional Chinese medicine extract production control method and system
CN117389230B (en) * 2023-11-16 2024-06-07 广州中健中医药科技有限公司 Antihypertensive traditional Chinese medicine extract production control method and system
CN117647697A (en) * 2023-11-21 2024-03-05 广东电网有限责任公司江门供电局 Knowledge graph-based fault positioning method and system for electric power metering assembly line
CN117647697B (en) * 2023-11-21 2024-05-14 广东电网有限责任公司江门供电局 Knowledge graph-based fault positioning method and system for electric power metering assembly line
CN117527527B (en) * 2024-01-08 2024-03-19 天津市天河计算机技术有限公司 Multi-source alarm processing method and system
CN117527527A (en) * 2024-01-08 2024-02-06 天津市天河计算机技术有限公司 Multi-source alarm processing method and system

Similar Documents

Publication Publication Date Title
CN115809183A (en) Method for discovering and disposing information-creating terminal fault based on knowledge graph
CN111061620B (en) Intelligent detection method and detection system for server abnormity of mixed strategy
KR101984730B1 (en) Automatic predicting system for server failure and automatic predicting method for server failure
CN108415789B (en) Node fault prediction system and method for large-scale hybrid heterogeneous storage system
CN108038049B (en) Real-time log control system and control method, cloud computing system and server
CN108964995B (en) Log correlation analysis method based on time axis event
CN111209131A (en) Method and system for determining fault of heterogeneous system based on machine learning
CN111176879A (en) Fault repairing method and device for equipment
CN113282635B (en) Method and device for positioning fault root cause of micro-service system
CN111339175B (en) Data processing method, device, electronic equipment and readable storage medium
CN112769605B (en) Heterogeneous multi-cloud operation and maintenance management method and hybrid cloud platform
CN112328425A (en) Anomaly detection method and system based on machine learning
CN111949480B (en) Log anomaly detection method based on component perception
CN104574219A (en) System and method for monitoring and early warning of operation conditions of power grid service information system
CN113360722B (en) Fault root cause positioning method and system based on multidimensional data map
CN115865649B (en) Intelligent operation and maintenance management control method, system and storage medium
CN112636967A (en) Root cause analysis method, device, equipment and storage medium
CN111581056B (en) Software engineering database maintenance and early warning system based on artificial intelligence
Chen et al. Graph-based incident aggregation for large-scale online service systems
CN115733762A (en) Monitoring system with big data analysis capability
CN106649034B (en) Visual intelligent operation and maintenance method and platform
CN117331790A (en) Machine room fault detection method and device for data center
Li et al. Microservice anomaly detection based on tracing data using semi-supervised learning
CN116468423A (en) Operation and maintenance emergency coordination method, system and terminal equipment
CN115514627A (en) Fault root cause positioning method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination