CN115809183A

CN115809183A - Method for discovering and disposing information-creating terminal fault based on knowledge graph

Info

Publication number: CN115809183A
Application number: CN202211454686.2A
Authority: CN
Inventors: 张迪; 孙元田; 聂郁徐; 朱宪
Original assignee: Inspur Software Group Co Ltd
Current assignee: Inspur Software Group Co Ltd
Priority date: 2022-11-21
Filing date: 2022-11-21
Publication date: 2023-03-17

Abstract

The invention particularly relates to a knowledge graph-based method for discovering and disposing faults of a message-creating terminal. According to the method for discovering and disposing the fault of the information-creating terminal based on the knowledge graph, the fault prediction knowledge graph and the historical fault knowledge graph are constructed by using a knowledge graph technology, and the intellectualization of data processing, fault discovery and fault disposal is realized. According to the method for discovering and disposing the faults of the information-creating terminal based on the knowledge map, the system faults can be discovered in time through the centralized monitoring system, the fault processing time is reduced, a user can know the running states of the system and the application correctly and timely, system personnel are helped to carry out necessary system optimization and configuration change, and reasonable basis is provided for upgrading and expanding the capacity of the system.

Description

Method for discovering and disposing information-creating terminal fault based on knowledge graph

Technical Field

The invention relates to the technical field of fault disposal, in particular to a knowledge graph-based method for discovering and disposing faults of a message-creating terminal.

Background

Currently, software and hardware are rapidly developed, and application systems based on domestic basic software and hardware are also rapidly developed. More and more customers are considering or adopting a business-intensive approach. However, with the rapid development of the business of the trusted platform, the complexity of the business becomes higher and higher, and then not only the working strength of operation and maintenance is increased, but also the system becomes more complex. An effective system and an application monitoring system become a key for knowing service resource use conditions and finding possible system faults in time, and the key is for realizing system operation guarantee.

Aiming at the increase of the concentration and complexity of a service system under the current information and creation platform, the operation and maintenance work difficulty is correspondingly increased. An effective system and an application monitoring system become keys for realizing system operation guarantee by knowing the service state of the business resources of the trust and creation platform and timely discovering hidden dangers which may cause system faults.

Based on the situation, the invention provides a knowledge graph-based method for discovering and disposing the fault of the message-creating terminal.

Disclosure of Invention

In order to make up for the defects of the prior art, the invention provides a simple and efficient knowledge-graph-based method for discovering and disposing the faults of the message-creating terminal.

The invention is realized by the following technical scheme:

a method for discovering and disposing a message-creating terminal fault based on a knowledge graph is characterized in that: a fault prediction knowledge map and a historical fault knowledge map are constructed by using a knowledge map technology, so that the intellectualization of data processing, fault discovery and fault disposal is realized;

the method comprises the following steps:

step S1, data acquisition and processing

The data of the system is diversified and isomerized, preprocessing is carried out by combining performance data and alarm data, and modeling is carried out on the relevance among different types and different levels of data in the system, so that the accuracy and reliability of information system fault prediction are effectively improved.

Firstly, performing data cleaning, structuring and normalization processing on acquired data to obtain a unified data structure, and then constructing a fault prediction knowledge graph based on preprocessed information data;

step S1.1, real-time data acquisition

The method comprises the steps that time sequence data acquisition is carried out on monitoring indexes through an acquisition client, wherein the time sequence data acquisition comprises index data, performance data, log data and the like; setting a threshold parameter, and appointing a filter with a simple rule according to key index characteristics and key requirements of application so as to screen a part of suspected abnormalities;

step S1.2, data preprocessing

Firstly, performing data cleaning on the acquired data to remove incomplete data and redundant data;

then realizing data structuralization, packaging and extracting knowledge from the semi-structured data, and performing entity extraction after converting the semi-structured data into structured data;

the semi-structured data includes, but is not limited to, historical fault data, historical fault repair records, operational and maintenance engineer's empirical data, technical manuals, and user manuals;

finally, carrying out normalization processing on the data, and unifying the fault prediction standard;

step S2, fault discovery

Step S2.1, fault prediction

Constructing a fault prediction knowledge graph, pre-training the fault prediction knowledge graph to obtain a fault prediction model, and predicting faults hidden in the collected data;

step S2.2, establishing a historical fault knowledge map

Aiming at each fault type, training and analyzing historical fault data by using a cause and effect discovery algorithm, thereby constructing index data change characteristics of each fault type, serving as a historical fault knowledge map and being used for judging fault similarity;

s2.3, the associated fault data are converged into an alarm knowledge map through the steps, and a fault rate prediction module further predicts the fault rate through the similarity between the fault knowledge map to be predicted and the historical fault knowledge map, so that the fault prediction accuracy is improved;

s2.4, calculating the correlation of the abnormal events by using a causal discovery algorithm, representing the propagation path and the characteristics of the abnormal events by using a high-correlation link, and positioning the root cause of the abnormal events;

step S3, failure handling

Judging and inhibiting false alarms through the context of abnormal events which are associated with the knowledge map, and judging the authenticity of alarms by utilizing the similarity of the context which is mistakenly reported on historical alarms, thereby realizing the inhibition of false alarms;

the alarm convergence module finally aggregates and organizes an effective message through filtering, compressing, merging and duplicate removal based on the relevance among the index data;

the warning self-healing module trains historical fault repairing data by introducing an artificial intelligence technology, constructs a Bayesian fault self-healing model, and realizes artificial intelligence fault self-healing to replace manual fault treatment, so that the terminal fault is quickly and automatically treated;

the event scheduling module automatically repairs the events supporting the fault self-healing through the alarm self-healing module, and the events needing manual repair are manually repaired through an automatic work order scheduling technology;

and the alarm notification module is configured through an alarm notification strategy and used for notifying a notification user of an alarm event and reminding operation and maintenance personnel of repairing the fault in time.

In step S1.1, the threshold parameter is set by an alarm policy, or the threshold parameter is automatically configured by performing model training according to historical data.

In the step S2.1, the fault prediction knowledge graph construction step is as follows:

s2.1.1, determining the relevance of each information entity in the information data and a preset fault type based on a preset information system equipment manual and a related knowledge base;

s2.1.2, selecting part of information entities from the information entities as nodes of a fault prediction knowledge graph based on the relevance between the information entities and a preset fault type, and determining the relationship between adjacent nodes;

the method comprises the steps that information data are trained through a plurality of cause and effect finding algorithms PC, and the final cause and effect edge is determined based on the cause and effect edge output by each algorithm in combination with manual examination and screening; the causal edge refers to node information corresponding to an entity in the graph and edge information corresponding to the relationship between the entity and the entity;

s2.1.3, constructing a fault prediction knowledge graph based on all nodes and incidence relations

And constructing a fault prediction knowledge graph based on the logic of all information entities, incidence relations and entities.

In the step s2.1.1, the relevance determination method is as follows:

s2.1.1.1, calculating fingerprint information for each alarm by adopting an MD5 algorithm and utilizing the strategy, the rule, the terminal unique id, the index field and the index tag attribute, wherein the alarms with the same fingerprint are regarded as the same alarm message;

s2.1.1.2, extracting the association relation of the module to which the alarm message belongs;

and S2.1.1.3, when the machine room fails, collecting and organizing alarm messages generated in the same machine room into an effective message.

In the step S2.2, the historical fault knowledge graph construction step is as follows:

step S2.2.1, event Generation

Detecting the real-time index data by using a fault prediction model, and generating an abnormal event when an abnormality occurs;

step S2.2.2, fault propagation diagram construction

The incidence relation of the abnormal events is mined as in the step of constructing the fault prediction knowledge graph in the step S2.1, so that a fault propagation graph is constructed;

step S2.2.3, sort merge

Classifying and combining the generated fault propagation map and the historical fault knowledge map by using a clustering algorithm;

step S2.2.4, labeling

And performing root cause labeling on the classified and combined fault knowledge graph, and marking a fault type label on each type of historical fault knowledge graph.

In the step S2.3, the failure probability corresponding to the knowledge graph to be predicted is determined, and the steps are as follows:

s2.3.1, determining the same subgraph based on the intersection of the subgraph set of the knowledge graph to be predicted and the fault subgraph set of any historical fault knowledge graph;

s2.3.2, determining the similarity between the knowledge graph to be predicted and any historical fault knowledge graph based on the determined same subgraph, the weight corresponding to the same subgraph and the union of the subgraph set and the fault subgraph set;

and S2.3.3, sequencing the similarity to obtain a historical fault knowledge graph with the highest similarity to the knowledge graph to be predicted, wherein the similarity is the fault rate of the knowledge graph to be predicted.

In the step S2.4, the failure prediction knowledge graph includes a causal derivation relationship precipitated by a historical alarm event, and a root cause positioning process based on the failure prediction knowledge graph is as follows:

s2.4.1, inputting the abnormal event time slice sample into an alarm knowledge graph to obtain an alarm node causal graph;

s2.4.2, calculating the weight of each causal edge in the causal graph of the alarm node, wherein the causal edge weight reflects the probability of the abnormal event suspected root path under the time slice;

and S2.4.3, sequencing the weights of the causal edges, and selecting the causal edge with the highest weight as a final root cause path.

In step S3, the fault handling implementation procedure is as follows:

step S3.1, alarm suppression

Step S3.1.1, analyzing and warehousing call chain log data

Setting an analysis window, analyzing a calling chain in a time window of the alarm according to the alarm log, and storing the calling chain in a library;

step S3.1.2, map construction

Constructing a complete entity tree structure according to the reference relationship in the call chain;

step S3.1.3, index table construction

Analyzing the structure of the call chain, finding out relevant indexes on the call chain in the window time, analyzing the occurrence frequency of the relevant indexes in the analysis window time, and recording the statistical information of the relevant indexes into an index table;

step S3.1.4, alarm authenticity analysis

Searching for a calling chain which has the same index in history but generates false alarm according to the calling chain of the current alarm, searching for a related index generated on the current calling chain and the generation frequency of the index, and calculating the similarity between the local alarm and the historical alarm through the calling chain and the related index on the calling chain; if the similarity is higher than the self-defined threshold value, the alarm is considered as false alarm, otherwise, the alarm is real alarm.

Step S3.2, alarm convergence

Step S3.2.1, setting time slice granularity

Acquiring alarm data in the time slice, namely the statistical period of the alarm rule in real time;

step S3.2.2, alarm Classification

Classifying the original alarm data according to the acquisition indexes of the alarm data; the acquisition index of the alarm data adopts any one of a cpu utilization rate, a disk utilization rate, network flow or a terminal unique identifier.

Step S3.2.3, convergence is carried out on the alarm event

Inquiring an alarm knowledge map, and converging alarm events by taking a system as a unit; the convergence format is as follows: the system 1: { node 1: [ alarm type 1, alarm type 2. ], node 2: [ alarm type 1, alarm type 2. ]. };

step S3.3, fault self-healing

Step S3.3.1, preprocessing fault parameter data

Sorting the root cause path data acquired in the root cause positioning step into fault parameter data, moving the fault parameter data into a historical fault database, and storing the fault type and a corresponding repairing program;

step S3.3.2, establishing a Bayesian fault self-healing model

Calculating to obtain prior probability according to the performance index of the equipment, and calculating to obtain conditional probability, adjustment factor and posterior probability according to the fault parameter data; then, performing data optimization calculation on the prior probability by adopting a Markov transfer matrix method model, and calculating according to the fault parameter data to obtain an adjustment factor and a posterior probability so as to calculate the self-healing probability corresponding to the fault type;

inputting root cause path data acquired in real time into a fault self-healing model to acquire a preset fault self-healing operable index of the fault type; when the self-healing operable index is larger than the self-defined threshold value, automatically repairing by using a repairing program corresponding to the fault type;

step S3.4, event scheduling

For the event needing manual repair, initiating a work order through an automatic work order scheduling technology, and manually repairing by operation and maintenance personnel; for the event capable of self-healing, the fault self-healing model is adopted to call a corresponding repairing program to realize automatic repairing;

step S3.5, alarm Notification

The alarm notification module notifies the user of an alarm event through alarm notification strategy configuration, and the notification content supports the configuration of a preset template and a self-defined template; according to the alarm notification silence strategy, the alarm notification module ignores the alarm meeting the condition in the silence time, namely, does not send the alarm notification.

A device for discovering and disposing the fault of a message-creating terminal based on a knowledge graph is characterized in that: comprises a memory and a processor; the memory is adapted to store a computer program and the processor is adapted to carry out the above-mentioned method steps when executing the computer program.

A readable storage medium, characterized by: the readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the above-mentioned method steps.

The invention has the beneficial effects that: according to the method for discovering and disposing the faults of the trusted terminal based on the knowledge graph, the system faults can be discovered in time through the centralized monitoring system, the fault processing time is reduced, a user can know the running states of the system and the application correctly and timely, the system personnel are helped to carry out necessary system optimization and configuration change, and reasonable basis is provided for upgrading and expanding the capacity of the system.

Drawings

In order to more clearly illustrate the embodiments or technical solutions of the present invention, the drawings used in the embodiments or technical solutions in the prior art are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a schematic diagram of a method for discovering and disposing faults of a trusted terminal based on a knowledge graph.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the embodiment of the present invention. It should be apparent that the described embodiments are only some embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.

The maintenance personnel can quickly analyze the reason of the fault according to the modules of statistics, notification, processing and the like of the alarm center and release the reason from the complex labor. Faults in the system are discovered in time through the centralized monitoring system, and the fault processing time is shortened. Meanwhile, a basis is provided for upgrading and expanding the system.

According to the method for discovering and disposing the fault of the information-creating terminal based on the knowledge graph, the knowledge graph technology is used for constructing the fault prediction knowledge graph and the historical fault knowledge graph, and the intellectualization of data processing, fault discovery and fault disposal is realized;

the method comprises the following steps:

step S1, data acquisition and processing

step S1.1, real-time data acquisition

The method comprises the steps that time sequence data acquisition is carried out on monitoring indexes through an acquisition client, wherein the monitoring indexes comprise index data, performance data, log data and the like; setting a threshold parameter, and appointing a filter with a simple rule according to key index characteristics and key requirements of application so as to screen a part of suspected abnormalities;

step S1.2, data preprocessing

Firstly, data cleaning is carried out on the acquired data, and incomplete data (too short online time, too long or too short acquisition time interval, interval truncated data and the like) and redundant data (data which has no influence on faults are filtered, the workload of subsequent data analysis is reduced, and the data which has no influence comprises data set information, system log data and the like in a normal threshold range) are removed;

step S2, fault discovery

The fault discovery method comprises the steps of realizing a processing process from collected data to fault root factor information based on a knowledge graph, and predicting faults, constructing a historical fault knowledge graph, predicting fault rates and positioning root factors;

step S2.1, fault prediction

step S2.2, historical fault knowledge map is constructed

Aiming at each fault type, training and analyzing historical fault data by using a causal discovery algorithm, thereby constructing index data change characteristics of each fault type, and using the index data change characteristics as a historical fault knowledge map for fault similarity judgment;

s2.3, gathering the associated fault data into an alarm knowledge map through the steps, and further predicting the fault rate through the similarity between the fault knowledge map to be predicted and a historical fault knowledge map by a fault rate prediction module so as to improve the accuracy of fault prediction;

step S2.4, the root cause service and the initial exception service are typically all on an exception propagation chain consisting of a series of exception services. Therefore, the root cause positioning module analyzes the abnormal propagation chain hidden in the alarm knowledge graph to determine a group of candidate abnormal root cause services from the possible abnormal propagation chain, calculates the correlation of the abnormal events by using a cause and effect discovery algorithm on the basis, and utilizes the high-correlation link to represent the propagation path and the characteristics of the abnormal events to position the root cause of the abnormal events;

step S3, failure handling

The fault handling realizes fault automatic diagnosis decision through the steps of alarm suppression, alarm convergence, fault self-healing, event scheduling and alarm notification, realizes fault self-healing in part of scenes, and realizes automatic work order scheduling in other scenes, thereby improving operation and maintenance efficiency and reducing fault processing time.

Due to the uncertainty of artificial intelligent prediction, the large fluctuation of detection indexes, the non-periodic rule and the like, certain false alarms can be caused, when the method is used in a production environment in a large scale, the number of the false alarms is continuously overlapped, and an alarm storm is likely to be formed. In order to solve the problem, the judgment and the suppression of the false alarm are carried out by associating the context with the abnormal event through the knowledge map, and the authenticity judgment of the alarm is carried out by utilizing the similarity of the context with the false alarm on the historical alarm, thereby realizing the suppression of the false alarm;

the alarm convergence module finally integrates and organizes into an effective message through filtering, compressing, merging and de-duplicating based on the relevance among the index data, so that on one hand, the alarm storm can be avoided, on the other hand, the relevance compression can reduce 90% of invalid alarms, and the alarm efficiency is improved.

The alarm self-healing module trains historical fault healing data by introducing an artificial intelligence technology, constructs a Bayesian fault self-healing model, and realizes artificial intelligence fault self-healing to replace manual fault treatment, so that the terminal fault is quickly and automatically treated;

the event scheduling module automatically repairs the events supporting the fault self-healing through the alarm self-healing module, and manually repairs the events needing manual repair through an automatic work order scheduling technology;

In the step S1.1, the threshold parameter is set through an alarm strategy, or model training is performed according to historical data, and the threshold parameter is automatically configured.

The collected historical data contains a large number of correlation and causal relationships. The inherent topological structure of the terminal equipment and the calling relation of the system application can quickly form the entity and the relation of the knowledge graph. In the step S2.1, the fault prediction knowledge graph construction step is as follows:

in the step s2.1.1, the relevance determination method is as follows:

and S2.1.1.1, the alarm messages have the same latitude attributes, such as the same strategy name, rule name, the same deployment attribute (terminal, server, host and the like), and the same index type (memory, CPU, network utilization rate and the like). Calculating fingerprint information for each alarm by adopting an MD5 algorithm and utilizing the strategy, the rule, the unique terminal id, the index field and the index tag attribute, wherein the alarms with the same fingerprint are regarded as the same alarm message;

when the module A calls the module B, if the module B is abnormal, the module A also has related abnormal, and the relevance can be mined through historical abnormal events.

And S2.1.1.3, when the machine room fails, alarm messages generated in the same machine room are not required to be generated or acquired one by one, and finally, the summary is organized into an effective message.

S2.1.2, selecting partial information entities from the information entities as nodes of a fault prediction knowledge graph based on the relevance between the information entities and a preset fault type, and determining the relationship between adjacent nodes;

Determining the relevance of each information entity in the information data and a preset fault type based on a mode of combining a preset system equipment manual and a relevant knowledge base of operation and maintenance expert experience precipitation; based on the relevance, partial entities are selected from the information entities to serve as nodes of the fault prediction knowledge graph, and the relation between adjacent nodes is determined. And constructing a fault prediction knowledge graph based on the logic of all information entities, incidence relations and entities.

step S2.2.1, event Generation

step S2.2.2, fault propagation diagram construction

step S2.2.3, classification and combination

Classifying and combining the generated fault propagation diagram and a historical fault knowledge map by using a clustering algorithm;

step S2.2.4, labeling

And (4) carrying out root cause labeling on the classified and combined fault knowledge maps, and marking fault type labels on each type of historical fault knowledge maps.

In the step S2.4, the failure prediction knowledge graph includes a cause-and-effect derivation relationship precipitated by a historical alarm event, and a root cause positioning process based on the failure prediction knowledge graph is as follows:

s2.4.1, inputting abnormal event time slice samples into an alarm knowledge graph to obtain an alarm node cause-and-effect graph;

s2.4.2, calculating the weight of each causal edge in the causal graph of the alarm node, wherein the weight of each causal edge reflects the probability of the suspected root path of the abnormal event under the time slice;

s2.4.3, sequencing the weights of the causal edges, and selecting the causal edge with the highest weight as a final root path.

The finally obtained possible root cause propagation path set basically comprises the real root cause. The method can effectively realize automatic root cause derivation, shortens the intervention time of operation and maintenance personnel, has better interpretability of a visualized derivation link, and is easy to duplicate and optimize.

In step S3, the fault handling implementation procedure is as follows:

step S3.1, alarm suppression

The context of an abnormal event is associated through an interpretable knowledge graph, and the judgment and the suppression of false alarm are carried out, wherein the method comprises the following steps:

step S3.1.1, analyzing and warehousing call chain log data

step S3.1.2, map construction

step S3.1.3, index table construction

s3.1.4, true and false analysis of alarm

Searching for a calling chain which has the same index in history but generates false alarm according to the calling chain of the current alarm, searching for a related index generated on the current calling chain and the generation frequency of the index, and calculating the similarity between the local alarm and the historical alarm through the calling chain and the related index on the calling chain; if the similarity is higher than the self-defined threshold value, the alarm is considered as false alarm, otherwise, the alarm is true alarm.

The above steps can deposit the history false alarm records, and the false alarm judgment is carried out by using the similarity of the history false alarm records and the context of the false alarm records, thereby realizing the inhibition of the false alarm.

Step S3.2, alarm convergence

The alarm convergence converges the alarm events to the knowledge graph taking the system as a unit based on the relevance between the alarm events, thereby effectively inhibiting the number of alarm messages and reducing the frequency of the alarm messages. The alarm convergence comprises the following main steps:

step S3.2.1, setting time slice granularity

Acquiring alarm data in time slices, namely the statistical period of the alarm rule in real time;

step S3.2.2, alarm classification

Classifying the original alarm data according to the acquisition indexes of the alarm data; the acquisition index of the alarm data adopts any one of a cpu utilization rate, a disk utilization rate, network flow or a terminal unique identifier such as IP, mac and the like.

Step S3.2.3, convergence is carried out on the alarm event

step S3.3, fault self-healing

Training historical fault repair data by adopting an artificial intelligence technology, constructing a Bayesian fault self-healing model, and realizing artificial intelligence fault self-healing to replace manual fault treatment, thereby realizing rapid and automatic processing of terminal faults; the fault self-healing model construction method used by the patent comprises the following steps:

step S3.3.1, preprocessing fault parameter data

Sorting the root cause path data acquired in the root cause positioning step into fault parameter data, moving the fault parameter data into a historical fault database, and storing the fault type and a corresponding repair program;

step S3.3.2, establishing a Bayesian fault self-healing model

Calculating according to the performance index of the equipment to obtain prior probability, and calculating according to the fault parameter data to obtain conditional probability, an adjustment factor and posterior probability; then, performing data optimization calculation on the prior probability by adopting a Markov transfer matrix method model, and calculating according to the fault parameter data to obtain an adjustment factor and a posterior probability so as to calculate the self-healing probability corresponding to the fault type;

inputting the root cause path data acquired in real time into a fault self-healing model to acquire a preset fault self-healing operable index of the fault type; when the self-healing operable index is larger than the self-defined threshold value, automatically repairing by using a repairing program corresponding to the fault type;

step S3.4, event scheduling

For the event needing manual repair, initiating a work order through an automatic work order scheduling technology, and manually repairing the event by operation and maintenance personnel; for the event capable of self-healing, the fault self-healing model is adopted to call a corresponding repairing program to realize automatic repairing;

step S3.5, alarm Notification

The alarm notification module notifies the user of the alarm event through alarm notification strategy configuration, and the notification mode includes but is not limited to mail, short message, in-application notification, stapling and the like. The notification content supports the configuration of a preset template and a self-defined template; according to the alarm notification silent strategy, the alarm notification module ignores the alarm meeting the condition in the silent time, namely, the alarm notification is not sent.

The method for discovering and disposing the fault of the trusted and created terminal based on the knowledge graph can be compatible with operating systems such as kylin, depth and Puhua, which are the winning numbers of CPUs in various countries, and compatible with Firefox and Chromium browsers in the environments of software and hardware in the countries, and has good universality, flexibility and transportability.

The device for discovering and disposing the fault of the information creating terminal based on the knowledge graph comprises a memory and a processor; the memory is adapted to store a computer program, and the processor is adapted to carry out the above-mentioned method steps when executing the computer program.

The readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the above-mentioned method steps.

The method for discovering and disposing the faults of the trusted terminal based on the knowledge graph can be compatible with the domestic system environment, not only can regularly clean system data and ensure the stable operation of the system, but also can regularly backup the system data and prevent data loss, or can timely recover by utilizing the backup data when the data is lost.

When the system is used, the cleaning or backup of data can be dynamically set according to needs, a plurality of tasks are configured and started, the tasks are not influenced mutually, and the system is relatively flexible.

In addition, the method for discovering and disposing the fault of the information-creating terminal based on the knowledge graph can be used for backing up through the plug-in or script uploaded by the user, and a cleaning backup task is started or closed at any time.

The above-described embodiment is only one specific embodiment of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims

1. A method for discovering and disposing a message-creating terminal fault based on a knowledge graph is characterized in that: a fault prediction knowledge map and a historical fault knowledge map are constructed by using a knowledge map technology, so that the intellectualization of data processing, fault discovery and fault disposal is realized;

the method comprises the following steps:

step S1, data acquisition and processing

Preprocessing is carried out by combining performance data and alarm data, and modeling is carried out on the relevance among different types and different levels of data in the system; firstly, performing data cleaning, structuring and normalization processing on acquired data to obtain a unified data structure, and then constructing a fault prediction knowledge graph based on preprocessed information data;

step S1.1, real-time data acquisition

step S1.2, data preprocessing

step S2, fault discovery

Step S2.1, fault prediction

step S2.2, establishing a historical fault knowledge map

s2.4, calculating the correlation of the abnormal events by using a causal discovery algorithm, representing the propagation path and characteristics of the abnormal events by using a high-correlation link, and positioning the root cause of the abnormal events;

step S3, failure handling

Judging and inhibiting false alarm by associating the context with abnormal events through the knowledge map, and judging the truth of the alarm by using the similarity of the context with the false alarm in the historical alarm, thereby realizing the inhibition of the false alarm;

the alarm convergence module finally integrates and organizes an effective message through filtering, compressing, merging and de-duplicating based on the relevance between the index data;

and the alarm notification module is configured through an alarm notification strategy and used for notifying a user of an alarm event and reminding operation and maintenance personnel of repairing the fault in time.

2. The method of knowledge-graph-based belief creation terminal fault discovery and handling of claim 1, wherein: in the step S1.1, the threshold parameter is set through an alarm strategy, or model training is performed according to historical data, and the threshold parameter is automatically configured.

3. The method of knowledge-graph-based belief creation terminal fault discovery and handling of claim 1, wherein: in the step S2.1, the fault prediction knowledge graph construction step is as follows:

the method comprises the steps that information data are trained through a plurality of cause and effect discovery algorithms PC, and final cause and effect edges are determined based on cause and effect edges output by the algorithms in combination with manual examination and screening; the causal edge refers to node information corresponding to an entity in the graph and edge information corresponding to the relationship between the entity and the entity;

And constructing a fault prediction knowledge graph based on the logic of all information entities-incidence relation-entities.

4. The method of knowledge-graph-based belief creation terminal fault discovery and handling of claim 3, wherein: in the step s2.1.1, the relevance determination method is as follows:

and S2.1.1.3, when the machine room fails, summarizing and organizing alarm messages generated in the same machine room into an effective message.

5. The method of knowledge-graph-based message-created terminal fault discovery and handling of claim 1, wherein: in the step S2.2, the historical fault knowledge map is constructed as follows:

step S2.2.1, event Generation

step S2.2.2, fault propagation diagram construction

step S2.2.3, sort merge

step S2.2.4, labeling

6. The method of knowledge-graph-based message-created terminal fault discovery and handling of claim 1, wherein: in the step S2.3, the failure probability corresponding to the knowledge graph to be predicted is determined, and the steps are as follows:

7. The method of knowledge-graph-based belief creation terminal fault discovery and handling of claim 1, wherein: in the step S2.4, the failure prediction knowledge graph includes a causal derivation relationship precipitated by a historical alarm event, and a root cause positioning process based on the failure prediction knowledge graph is as follows:

8. The method of knowledge-graph-based message-created terminal fault discovery and handling of claim 1, wherein: in step S3, the fault handling implementation procedure is as follows:

step S3.1, alarm suppression

Step S3.1.1, analyzing and warehousing call chain log data

step S3.1.2, map construction

step S3.1.3, index table construction

step S3.1.4, alarm authenticity analysis

Searching for a calling chain which has the same index in history but generates false alarm according to the calling chain of the current alarm, searching for a related index generated on the current calling chain and the generation frequency of the index, and calculating the similarity between the local alarm and the historical alarm through the calling chain and the related index on the calling chain; if the similarity is higher than the self-defined threshold value, the alarm is considered as false alarm, otherwise, the alarm is true alarm;

step S3.2, alarm convergence

Step S3.2.1, setting time slice granularity

step S3.2.2, alarm Classification

Classifying the original alarm data according to the acquisition indexes of the alarm data; the acquisition index of the alarm data adopts any one of a CPU utilization rate, a disk utilization rate, network flow or a terminal unique identifier;

step S3.2.3, convergence is carried out on the alarm event

step S3.3, fault self-healing

Step S3.3.1, preprocessing fault parameter data

step S3.3.2, establishing a Bayesian fault self-healing model

step S3.4, event scheduling

step S3.5, alarm Notification

The alarm notification module notifies a notification user of an alarm event through alarm notification strategy configuration, and notification contents support configuration of a preset template and a self-defined template; according to the alarm notification silent strategy, the alarm notification module ignores the alarm meeting the condition in the silent time, namely, the alarm notification is not sent.

9. A device for discovering and disposing the fault of a message-creating terminal based on a knowledge graph is characterized in that: comprises a memory and a processor; the memory is adapted to store a computer program which, when executed by the processor, implements the method steps of any of claims 1 to 8.

10. A readable storage medium, characterized by: the readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the method steps of any one of claims 1 to 8.