CN114978877B

CN114978877B - Abnormality processing method, abnormality processing device, electronic equipment and computer readable medium

Info

Publication number: CN114978877B
Application number: CN202210517872.XA
Authority: CN
Inventors: 张静; 张宪波
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2022-05-13
Filing date: 2022-05-13
Publication date: 2024-04-05
Anticipated expiration: 2042-05-13
Also published as: CN114978877A

Abstract

The application discloses an exception handling method, an exception handling device, electronic equipment and a computer readable medium, and relates to the technical field of computers, wherein the exception handling method comprises the following steps: in response to detecting the abnormal event, determining an abnormal time period corresponding to the abnormal event; acquiring index information associated with the abnormal event, and further determining a corresponding abnormal index based on the index information and the abnormal time period; acquiring a corresponding static entity relation according to the abnormal time period, and further generating a knowledge graph corresponding to the abnormal event based on the abnormal event, the abnormal index and the static entity relation; positioning the abnormal entity based on the knowledge graph, and further determining a total abnormal index corresponding to the abnormal entity; and inputting the total abnormal index into an abnormal detection model to obtain a corresponding index abnormal type, and outputting the index abnormal type and the abnormal entity. The classifying capability of the abnormal categories of the operation and maintenance time sequence data is improved, the checking flow is concise, and the abnormality checking efficiency is high.

Description

Abnormality processing method, abnormality processing device, electronic equipment and computer readable medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an exception handling method, an exception handling device, an electronic device, and a computer readable medium.

Background

When the operation and maintenance time sequence data is detected abnormally, mass monitoring indexes cannot be well adapted, corresponding abnormal types are difficult to identify at abnormal time points, the complex mode is extracted from the time sequence index data, important structural information such as abnormal information among abnormal adjacent points cannot be reserved, and the classification capability of the operation and maintenance time sequence data on the abnormal types is weak.

In the process of implementing the present application, the inventor finds that at least the following problems exist in the prior art:

in the prior art, the troubleshooting root from the index anomaly angle depends on expert experience, and after an operation and maintenance expert receives the monitoring index anomaly alarm, the troubleshooting process is complicated through each operation and maintenance monitoring platform including host monitoring, database monitoring, middleware monitoring and the like.

Disclosure of Invention

In view of this, the embodiments of the present application provide an anomaly processing method, apparatus, electronic device, and computer readable medium, which can solve the problem that the existing troubleshooting process is complicated by checking the root cause of the fault from the angle of anomaly of the index, relying on the experience of the expert, and after the operation and maintenance expert receives the monitoring index anomaly alarm, through each operation and maintenance monitoring platform, including host monitoring, database monitoring, middleware monitoring, etc.

To achieve the above object, according to one aspect of the embodiments of the present application, there is provided an exception handling method, including:

in response to detecting the abnormal event, determining an abnormal time period corresponding to the abnormal event;

acquiring index information associated with the abnormal event, and further determining a corresponding abnormal index based on the index information and the abnormal time period;

acquiring a corresponding static entity relation according to the abnormal time period, and further generating a knowledge graph corresponding to the abnormal event based on the abnormal event, the abnormal index and the static entity relation;

positioning the abnormal entity based on the knowledge graph, and further determining a total abnormal index corresponding to the abnormal entity;

and inputting the total abnormal index into an abnormal detection model to obtain a corresponding index abnormal type, and outputting the index abnormal type and the abnormal entity.

Optionally, determining the corresponding abnormality index includes:

determining an index shape corresponding to the index information;

and determining an abnormal index corresponding to the abnormal time period according to the index shape.

Optionally, determining the abnormality index corresponding to the abnormality time period includes:

determining a business layer index shape and a basic resource layer index shape corresponding to the abnormal event from the index shapes;

Calculating the similarity of the service layer index shape and the basic resource layer index shape;

and determining the associated abnormal index corresponding to the abnormal time period according to the similarity.

Optionally, acquiring the corresponding static entity relationship according to the abnormal time period includes:

acquiring corresponding log information according to the abnormal time period;

and determining the static entity relation according to the log information.

Optionally, after outputting the indicator anomaly type and the anomaly entity, the method further comprises:

displaying the abnormal event, the abnormal entity corresponding to the abnormal event, the total abnormal index corresponding to the abnormal entity and the index abnormal type in a knowledge graph in a preset form.

Optionally, determining the total anomaly index corresponding to the anomaly entity includes:

determining one or more abnormal events corresponding to the abnormal entity based on the knowledge graph;

and summarizing the abnormal indexes corresponding to the one or more abnormal events to obtain a total abnormal index.

Optionally, before determining the abnormal time period corresponding to the abnormal event, the method further includes:

in response to detecting that the scene changes, switching to a corresponding abnormal threshold based on the changed scene;

obtaining detection results of all indexes based on the anomaly detection model and the baseline model;

And triggering an abnormal event in response to the detection result exceeding the abnormal threshold.

In addition, the application also provides an exception handling device, which comprises:

an abnormal time period determining unit configured to determine an abnormal time period corresponding to the abnormal event in response to detecting the abnormal event;

an abnormality index determination unit configured to acquire index information associated with an abnormality event, and further determine a corresponding abnormality index based on the index information and the abnormality period;

the knowledge graph generation unit is configured to acquire a corresponding static entity relationship according to the abnormal time period, and further generate a knowledge graph corresponding to the abnormal event based on the abnormal event, the abnormal index and the static entity relationship;

the total abnormal index determining unit is configured to locate an abnormal entity based on the knowledge graph and further determine a total abnormal index corresponding to the abnormal entity;

the abnormality processing unit is configured to input the total abnormality index into the abnormality detection model to obtain a corresponding index abnormality type, and output the index abnormality type and the abnormality entity.

Optionally, the abnormality index determination unit is further configured to:

determining an index shape corresponding to the index information;

Optionally, the abnormality index determination unit is further configured to:

Optionally, the knowledge-graph generation unit is further configured to:

acquiring corresponding log information according to the abnormal time period;

and determining the static entity relation according to the log information.

Optionally, the exception handling device further comprises a presentation unit configured to:

Optionally, the total anomaly index determining unit is further configured to:

Optionally, the exception handling apparatus further comprises an exception event triggering unit configured to:

In addition, the application also provides an exception handling electronic device, which comprises: one or more processors; and a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the exception handling method as described above.

In addition, the application also provides a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the exception handling method as described above.

One embodiment of the above invention has the following advantages or benefits: in the method, an abnormal time period corresponding to an abnormal event is determined in response to the detection of the abnormal event; acquiring index information associated with the abnormal event, and further determining a corresponding abnormal index based on the index information and the abnormal time period; acquiring a corresponding static entity relation according to the abnormal time period, and further generating a knowledge graph corresponding to the abnormal event based on the abnormal event, the abnormal index and the static entity relation; positioning the abnormal entity based on the knowledge graph, and further determining a total abnormal index corresponding to the abnormal entity; and inputting the total abnormal index into an abnormal detection model to obtain a corresponding index abnormal type, and outputting the index abnormal type and the abnormal entity. When an abnormal event occurs, the knowledge graph of the abnormal event is constructed by collecting index information of the abnormal event of the system. The knowledge graph is used for storing abnormal events, the graph reasoning method is used for positioning abnormal entities, and the entity part affected by the faults and related abnormal indexes are obtained, so that the related abnormal indexes are classified. The classifying capability of the abnormal categories of the operation and maintenance time sequence data is improved, the checking flow is concise, and the abnormality checking efficiency is high.

Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the present application and are not to be construed as unduly limiting the present application. Wherein:

FIG. 1 is a schematic diagram of the main flow of an exception handling method provided according to one embodiment of the present application;

FIG. 2 is a schematic diagram of the main flow of an exception handling method provided according to one embodiment of the present application;

FIG. 3 is a flow chart of event map construction for root cause localization according to an exception handling method provided in one embodiment of the present application;

FIG. 4 is a schematic diagram of fault node associations of an exception handling method provided in accordance with one embodiment of the present application;

FIG. 5 is a flowchart of a multi-index anomaly detection algorithm for an anomaly handling method according to one embodiment of the present application;

FIG. 6 is a schematic diagram of the main units of an exception handling apparatus according to an embodiment of the present application;

FIG. 7 is an exemplary system architecture diagram in which embodiments of the present application may be applied;

fig. 8 is a schematic diagram of a computer system suitable for use in implementing the terminal device or server of the embodiments of the present application.

Detailed Description

Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness. The data acquisition, storage, use, processing and the like in the technical scheme meet the relevant regulations of national laws and regulations.

Fig. 1 is a schematic diagram of main flow of an exception handling method according to an embodiment of the present application, and as shown in fig. 1, the exception handling method includes:

in step S101, in response to detecting the abnormal event, an abnormal period corresponding to the abnormal event is determined.

In this embodiment, the execution body (for example, may be a server) of the exception handling method may continuously detect whether an exception event exists by means of a wired connection or a wireless connection. An exception event, which may be, for example, an event associated with an exception indicator, is triggered when the exception indicator appears.

After detecting the abnormal event, the executing body can determine an abnormal time period corresponding to the abnormal event, for example, 10:10-10:30 a.m.

Specifically, before determining the abnormal time period corresponding to the abnormal event, the method further includes: in response to detecting that the scene changes, switching to a corresponding abnormal threshold based on the changed scene; obtaining detection results of all indexes based on the anomaly detection model and the baseline model; and triggering an abnormal event in response to the detection result exceeding the abnormal threshold.

The anomaly detection model is a model trained based on a deep learning method. The deep learning method is introduced to reasonably process complex abnormal modes, learn implicit rules from data and well reserve abnormal information between abnormal adjacent points. Baseline models may include a cluster-based CBLOF baseline model, an angular metric-based ABOD baseline model, an integrated iforst baseline model, and a deep-learning-based GRU baseline model. The four baseline models can realize that the characteristics of the index data are respectively extracted and learned from different angles, so that the output result of the baseline model is not deviated, and the bureau is considered. According to the embodiment of the application, the f1-score calculation method is used as an objective function, the optimal weighted ballot number is adaptively searched based on the L-BFGS optimization searching method, and the requirement of self-adaptive threshold-free monitoring of massive operation and maintenance time sequence indexes is met. The method for integrating the machine learning baseline model and the deep learning anomaly detection model can enhance the classification capability of the anomaly class of the operation and maintenance time sequence data. Obtaining detection results of each operation and maintenance time sequence index data based on the anomaly detection model and the baseline model; when the execution subject determines that the detection result exceeds the abnormal threshold, an abnormal event is triggered. The anomaly monitoring model can be trained from historical data with a majority of normal data.

The embodiment of the application adopts a self-adaptive threshold-free monitoring technology, so that the abnormal threshold value can be automatically regulated according to scene change, and the result is more accurate; different types of scenes can be adapted, and the robustness is better.

Step S102, index information associated with the abnormal event is acquired, and then a corresponding abnormal index is determined based on the index information and the abnormal time period.

When an abnormal event occurs, all index information such as logs, alarms, changes, configurations and the like associated with the event are searched in a correlated way. And taking the abnormal event as a starting point, and carrying out related query on all index information related to the abnormal event. When a fault occurs, in the index level, abnormal fluctuation intervals (namely, abnormal time periods) of indexes of the service layer are accompanied with abnormal morphological changes of monitoring indexes of the base resource layer, the relevance of fault nodes among all indexes is calculated through the shape similarity of the indexes, namely, the abnormal fluctuation similarity of each index in the abnormal time period is compared based on the cross correlation distance measurement SBD, and the occurrence of one abnormal event is accompanied with the fluctuation of a plurality of indexes, so that the abnormal indexes which are accompanied with the occurrence of the abnormal event and generate fluctuation in the abnormal time period can be determined. Specifically, an index corresponding to index information similar to index information associated with an abnormal event generated in an abnormal time period may be determined as an abnormal index.

Step S103, according to the abnormal time period, a corresponding static entity relation is obtained, and further, a knowledge graph corresponding to the abnormal event is generated based on the abnormal event, the abnormal index and the static entity relation.

Specifically, according to the abnormal time period, obtaining the corresponding static entity relationship includes: acquiring corresponding log information according to the abnormal time period; and determining the static entity relation according to the log information.

The execution subject can inquire and acquire real-time log information of an event generating the abnormality according to the information of the abnormal time period, and the real-time log information records date and time stamp information of a daily event or misoperation alarm related to the system, or records a static entity relationship. The static entity relationship may specifically be a static entity relationship described in the configuration management system. In a configuration management system, a static library (or referred to as a backup library, a product library): a file containing various baselines for backup stores static entity relationships. The configuration items of the static library are placed under full configuration management. By way of example, the static entity relationship may be a relationship between a host, a database, an index, a log, and an exception event in a configuration management system. Writing the abnormal indexes into a graph database, and generating a knowledge graph of the abnormal event by combining the static entity relationship described by the configuration management system. The generated knowledge graph of the abnormal event is used for representing the association relationship between the abnormal event and the host, the database, the abnormal index and the log.

Step S104, positioning the abnormal entity based on the knowledge graph, and further determining a total abnormal index corresponding to the abnormal entity.

Specifically, determining a total anomaly index corresponding to an anomaly entity includes:

determining one or more abnormal events corresponding to the abnormal entity based on the knowledge graph; and summarizing the abnormal indexes corresponding to the one or more abnormal events to obtain a total abnormal index.

According to the embodiment of the application, the knowledge graph corresponding to the abnormal event is generated through the abnormal event, the abnormal entity (such as a host) corresponding to the abnormal event is positioned according to the generated knowledge graph, one abnormal entity can correspond to one or more abnormal events in the knowledge graph, and when one abnormal entity determined in the knowledge graph corresponds to one or more abnormal events, the execution main body can collect abnormal indexes corresponding to one or more abnormal events so as to obtain a total abnormal index. It is understood that the total anomaly index may be a set of all anomaly indexes corresponding to one anomaly entity.

Step S105, inputting the total abnormal index into an abnormal detection model to obtain a corresponding index abnormal type, and outputting the index abnormal type and the abnormal entity.

And extracting indexes influenced by the whole abnormal event to be used as input of an abnormal detection model for classifying the time sequence indexes.

Specifically, after outputting the index exception type and the exception entity, the exception handling method further includes: displaying the abnormal event, the abnormal entity corresponding to the abnormal event, the total abnormal index corresponding to the abnormal entity and the index abnormal type in a knowledge graph in a preset form. By way of example, the preset form may be a form of a pictorial representation.

On the basis of the knowledge graph corresponding to the abnormal event, a derivation model of graph analysis is applied, the dynamically-changed index is combined with the static system configuration relation, and an entity generating the abnormality (such as a submodule generating the abnormality of the system) is positioned and extracted, so that the graph representation of the range influenced by the abnormal event and the index in the knowledge graph is obtained. Therefore, a user can clearly and definitely know the influence range of the abnormal index, and timely remedy the abnormal index, so that the loss is reduced.

In the embodiment, an abnormal time period corresponding to an abnormal event is determined by responding to the detection of the abnormal event; acquiring index information associated with the abnormal event, and further determining a corresponding abnormal index based on the index information and the abnormal time period; acquiring a corresponding static entity relation according to the abnormal time period, and further generating a knowledge graph corresponding to the abnormal event based on the abnormal event, the abnormal index and the static entity relation; positioning the abnormal entity based on the knowledge graph, and further determining a total abnormal index corresponding to the abnormal entity; and inputting the total abnormal index into an abnormal detection model to obtain a corresponding index abnormal type, and outputting the index abnormal type and the abnormal entity. When an abnormal event occurs, the knowledge graph of the abnormal event is constructed by collecting index information of the abnormal event of the system. The knowledge graph is used for storing abnormal events, the graph reasoning method is used for positioning abnormal entities, and the entity part affected by the faults and related abnormal indexes are obtained, so that the related abnormal indexes are classified. The classifying capability of the abnormal categories of the operation and maintenance time sequence data is improved, the checking flow is concise, and the abnormality checking efficiency is high.

Fig. 2 is a main flow diagram of an exception handling method according to an embodiment of the present application, and as shown in fig. 2, the exception handling method includes:

in step S201, in response to detecting the abnormal event, an abnormal period corresponding to the abnormal event is determined.

The triggering of the abnormal event may be that if the value of a certain index is detected to exceed a preset threshold value, the abnormal event is triggered. The time period formed by the time when the value of a certain index exceeds the preset threshold value can be the abnormal event period corresponding to the abnormal event. The embodiment of the present application does not specifically limit the abnormal time period.

Step S202, index information associated with the abnormal event is acquired.

In step S203, an index shape corresponding to the index information is determined.

The index value is taken as an ordinate, the index is taken as an abscissa, an index coordinate system is established, and the index shape corresponding to the index information can be the shape formed in the index coordinate system by each index value corresponding to the index information. The embodiment of the present application does not specifically limit the index shape.

Step S204, determining an abnormal index corresponding to the abnormal time period according to the index shape.

As shown in fig. 4, after receiving the operation and maintenance monitoring index data of the system, the executing body may perform an abnormal mode extraction, and after the abnormal mode is extracted, perform similarity calculation between index information corresponding to the abnormal mode and all operation and maintenance monitoring index data in the abnormal time period, so as to perform abnormal matching and screen out the operation and maintenance monitoring index data with the index information similarity higher than a threshold value, where the index information similarity is corresponding to the abnormal mode. The operation and maintenance monitoring index data with the similarity higher than the threshold value of the index information corresponding to the abnormal mode can be data with asynchronous phase, different dimension or inconsistent increase and decrease of the index information corresponding to the abnormal mode as shown in fig. 4. Therefore, good robustness, high accuracy and strong suitability in the process of identifying the abnormal index data can be ensured.

Specifically, determining an abnormality index corresponding to an abnormality period includes:

determining a business layer index shape and a basic resource layer index shape corresponding to the abnormal event from the index shapes; calculating the similarity of the service layer index shape and the basic resource layer index shape; and determining the associated abnormal index corresponding to the abnormal time period according to the similarity.

When an abnormal event is triggered, the existence of a fault node is proved, and abnormal morphological changes of monitoring indexes of a basic resource layer are often accompanied in an abnormal fluctuation interval of indexes of a service layer on an index layer. And calculating the relevance of fault nodes among all indexes through the index shape similarity, namely comparing the abnormal fluctuation similarity of each index in an abnormal time period based on a cross-correlation distance measurement SBD, wherein the occurrence of one event is accompanied by a plurality of index fluctuation, and the indexes with fluctuation relevance can be generated in the same abnormal event knowledge graph.

Specifically, the index having the fluctuation correlation may be an index whose magnitudes are different in the same period of time, an index whose values are simultaneously increased or decreased in the same period of time, or an index whose values are not synchronized in the same period of time. The abnormal index corresponding to the abnormal event in the abnormal time period (i.e. the index which is subjected to larger fluctuation when the abnormal event is triggered or the index which is bound by the abnormal event, the abnormal index corresponding to the abnormal event is not specifically limited in the embodiment of the application) has fluctuation relevance, namely the associated abnormal index corresponding to the abnormal time period. Solves the problems of asynchronous index fluctuation phase, different dimension and inconsistent increase and decrease.

The execution main body determines the business layer index shape and the basic resource layer index shape corresponding to the abnormal event from the index shapes; the execution body can calculate the similarity of the business layer index shape and each basic resource layer index shape when the basic resource layer index shape is multiple, and respectively determine the phase similarity, the dimension similarity and the index value increasing and decreasing similarity of the business layer index shape and each basic resource layer index shape in the abnormal time period by respectively acquiring the phase, the dimension and the index increasing and decreasing information corresponding to the business layer index shape and each basic resource layer index shape in the abnormal time period, wherein the similarity can comprise the phase similarity, the dimension similarity and the index value increasing and decreasing similarity; and determining the associated abnormal index corresponding to the abnormal time period according to the similarity. Thereby improving the accuracy of determining the abnormal index associated with the abnormal event and laying a foundation for accurately determining the abnormal index category.

Step S205, according to the abnormal time period, the corresponding static entity relation is obtained, and further, based on the abnormal event, the abnormal index and the static entity relation, a knowledge graph corresponding to the abnormal event is generated.

Step S206, positioning the abnormal entity based on the knowledge graph, and further determining a total abnormal index corresponding to the abnormal entity.

Step S207, inputting the total anomaly index into the anomaly detection model to obtain a corresponding index anomaly type, and outputting the index anomaly type and the anomaly entity.

In the abnormal type classification process, firstly, the operation and maintenance time sequence index is combined with the operation and maintenance time sequence index characteristics to be classified into the following multiple abnormal types, a plurality of groups of abnormal detection model combinations are arranged based on the index abnormal types, the adaptation speed and accuracy of the abnormal detection model and the index are accelerated, and the 7 types of operation and maintenance time sequence data are as follows:

1, burr type anomaly 2, context anomaly 3, cliff break/abrupt slope anomaly 4: climbing anomaly 5: amplitude anomaly 6: wave frequency anomaly 7: amplitude + wave frequency anomalies.

The execution body inputs the total abnormal index into the abnormality detection model, and outputs one or more of the above 7 operation and maintenance time sequence data abnormality types, and then the execution body can output the obtained index abnormality type and an abnormality entity corresponding to the abnormality event. The method has the advantages of improving the classification capability of the abnormal categories of the operation and maintenance time sequence data, along with simple investigation flow and high abnormality investigation efficiency.

Fig. 3 is an application scenario diagram of an exception handling method according to an embodiment of the present application. The exception handling method of the embodiment of the application can be applied to Yu Yunwei time sequence data exception detection scenes. The anomaly type classification first requires determining the extent of the anomaly's impact and the index of the anomaly. Query and reasoning are carried out by adopting a method based on the abnormal event knowledge graph. The embodiments are as follows: when an abnormal event occurs, a knowledge graph of the abnormal event is constructed by collecting related information of the abnormal event, a deduction model is applied to extract information of the graph, a module generating the abnormal event is determined, and finally a rule reasoning method is applied to analyze and deduce root causes to obtain root causes, so that abnormal indexes to be classified are obtained.

For example, as shown in fig. 3, in a first step, a map is constructed.

(1) When an abnormal event is generated, carrying out associated inquiry on all information such as logs, alarms, changes, configurations and the like which are associated with the event;

(2) Taking the event as a starting point, and associatively inquiring all index information related to the abnormal event; when faults occur, abnormal form changes of monitoring indexes of a business layer are often accompanied in abnormal fluctuation intervals of indexes of a basic resource layer on an index layer, relevance of fault nodes among all indexes is calculated through index shape similarity, namely abnormal fluctuation similarity of each index in a fault time period is compared based on cross correlation distance measurement SBD, multiple indexes are accompanied in occurrence of one event, indexes with fluctuation relevance can be generated in the same event map, and the problems of asynchronous index fluctuation phases, different dimensions and inconsistent increase and decrease are solved.

(3) Inquiring and acquiring real-time log information of the transaction generating the abnormality according to the information of the abnormality time point;

(4) And generating a knowledge graph of the abnormal event by combining the static entity relationship described by the configuration management system.

And secondly, root cause positioning.

On the basis of the abnormal event knowledge graph, a deduction model of graph analysis is applied, the dynamically-changed index is combined with the static system configuration relation, and the abnormal submodule is positioned and extracted to obtain the graph representation of the range and the index affected by the abnormal event in the knowledge graph.

Thirdly, root cause analysis.

Based on the second step, the root cause of the event can be clarified, and by combining the logs generated by the exception, a graph structure reasoning method is introduced, for example, when a plurality of exceptional events (such as exceptional event 1, exceptional event 2 and exceptional event 3) simultaneously point to a node (such as host 1), the node (such as host 1) can be considered to have an exception based on rule reasoning. And further deducing and supplementing the sub-module obtained by positioning to obtain the final abnormal event root cause. And extracting indexes influenced by the whole abnormal event to be used as input of time sequence index abnormal detection classification.

In the abnormal type classification process, firstly, the operation and maintenance time sequence index is summarized into the following multi-type abnormal types by combining with the operation and maintenance time sequence index characteristics, a plurality of groups of abnormal detection model combinations are arranged based on the index abnormal types, the adaptation speed and accuracy of the model and the index are accelerated, and the 7-type operation and maintenance time sequence data abnormal types are as follows:

In order to adapt to operation and maintenance monitoring indexes with different time granularity, the time sequence index acquisition granularity is divided into two sets of abnormal detection algorithm parts, the algorithm of the second granularity and the minute granularity is standardized, a relative majority voting method is introduced, influences of subjective factors caused by manually distributing weights to models are avoided, meanwhile, calculation amount caused by learning weights is avoided, and calculation cost is saved. The method for introducing deep learning can reasonably process complex abnormal modes, learn implicit rules from data, well reserve abnormal information among abnormal adjacent points, integrate machine learning and deep learning abnormal detection models, and can enhance the classification capability of abnormal categories of operation and maintenance time sequence data. It should be noted that if the differences between the individual models to be combined are relatively significant, there is often a better result after integration. Based on the modeling experience of the mass operation and maintenance monitoring indexes, the algorithm can be selected from different angles as far as possible when the model is selected to be added. In the technological intelligent operation and maintenance ground combat process, four types of baseline models, namely CBLOF based on clustering, ABOD based on angle measurement, IFore based on integration and GRU based on deep learning, are applied, and the four types of baseline models extract and learn data features from different angles respectively, so that the results are not deviated, and the bureau is considered. According to the embodiment of the application, the f1-score calculation method is used as an objective function, the optimal weighted ballot number is adaptively searched based on the L-BFGS optimization searching method, and the requirement of self-adaptive threshold-free monitoring of massive operation and maintenance time sequence indexes is met.

Finally, the technology of the embodiment of the application associates the abnormal detection result with the baseline detection result, introduces an rmse and f1-score calculation method as two objective functions in the baseline upper and lower limit calculation method, adapts to the prediction of each index baseline and upper limit based on the L-BFGS optimization search method, has better real-time performance of an abnormal detection model, has better effect of index history performance in the baseline model, complements the detection results of the baseline upper and lower limit calculation method, and improves the coverage and accuracy of the overall algorithm on the identification of the index abnormal types. The adaptive threshold-free monitoring technology adopted by the embodiment of the application can automatically adjust the abnormal threshold according to scene change, and the result is more accurate; different types of scenes can be adapted, and the robustness is better.

In the embodiment of the application, the knowledge graph is used for storing the abnormal event, the graph reasoning method is used for root cause positioning, the affected part and related indexes generated by the fault are obtained, and the dynamic knowledge graph is generated by combining the static entity relation and the time sequence indexes of the configuration system. Based on years of threshold-free monitoring experience, the operation and maintenance time sequence indexes mainly cover 7 types of abnormal burr type abnormality, contextual abnormality, cliff/abrupt slope abnormality, climbing abnormality, amplitude abnormality, wave frequency abnormality and amplitude+wave frequency abnormality, and can cover 90% of operation and maintenance time sequence data abnormality types on a line. In order to adapt to operation and maintenance monitoring indexes with different time granularity, the method for distinguishing the time sequence index acquisition granularity and the time sequence index acquisition granularity is divided into two sets of abnormal detection algorithm parts, the algorithm of the standardized second-level granularity and the minute-level granularity is added, and a relative majority voting method applied by the multi-index abnormal detection algorithm shown in fig. 5 is introduced, so that the influence of subjective factors caused by artificially distributing weights to each model is avoided, the calculation amount caused by learning the weights is avoided, and the calculation cost is saved. Based on different angle selection algorithms, the method for integrating the machine learning and the deep learning anomaly detection models can enhance the classification capability of the anomaly class of the operation and maintenance time sequence data. As shown in fig. 5, in the intelligent operation and maintenance land combat process, four types of baseline models, i.e., CBLOF based on clustering, ABOD based on angle measurement, iforst based on integration, and GRU based on deep learning, are selected, and the four types of methods extract and learn data features from different angles respectively, so that the results are not misbalanced, and the bureau is considered. According to the embodiment of the application, the f1-score calculation method is used as an objective function, the optimal weighted ballot number is adaptively searched based on the L-BFGS optimization searching method, and the requirement of self-adaptive threshold-free monitoring of massive operation and maintenance time sequence indexes is met. And (3) associating an abnormal detection result with a baseline detection result, introducing an rmse and f1-score calculation method into a baseline upper and lower limit calculation method as two objective functions, and based on an L-BFGS optimization search method, self-adapting each index baseline and upper limit prediction, and automatically adjusting an abnormal threshold according to scene change. The real-time performance of the anomaly detection model is better, the effect of the baseline model on the historical performance of the index is better, the detection results of the baseline model and the baseline model are complementary, and the coverage and accuracy of the overall algorithm on anomaly type identification are improved.

According to the embodiment of the application, the operation and maintenance knowledge graph is applied to fault location, the knowledge graph of the abnormal event is constructed, and the overall obstacle removing efficiency from fault discovery to root cause location is improved. Firstly, a knowledge graph of an abnormal event is constructed by collecting related information when a system is abnormal, then a deduction model is applied to extract information of the graph to determine a module generating the abnormality, and finally a rule reasoning method is applied to analyze and deduce root causes to obtain root causes, so that abnormal indexes to be classified are obtained. And by combining the characteristics of mass operation and maintenance data, operation and maintenance time sequence indexes are summarized as follows: the system comprises 7 types of burr type anomalies, context anomalies, cliff/abrupt slope anomalies, climbing anomalies, amplitude anomalies, wave frequency anomalies and amplitude+wave frequency anomalies, and the coverage of the anomaly detection model to the operation and maintenance time sequence index anomaly type is improved according to the combination of the anomaly type and the multi-layer unsupervised anomaly detection model corresponding to the index sensitivity level (high, medium and low). Secondly, the time sequence index acquisition granularity is divided into two sets of abnormal detection algorithm learning pieces, the algorithm of second-level granularity and minute-level granularity is standardized, a relative majority voting method is introduced, influences of subjective factors caused by manually distributing weights to models are avoided, meanwhile, calculation amount caused by learning the weights is avoided, and calculation cost is saved. The knowledge graph is introduced to store the abnormal event, so that the inquiry and the display are convenient. Meanwhile, the reasoning method of the graph structure can obtain the association relation among the entities more completely. The deep learning method is introduced to reasonably process complex abnormal modes, learn implicit rules from data, well reserve abnormal information among abnormal adjacent points, integrate the machine learning and deep learning abnormal detection model, enhance the classification capability of abnormal categories of operation and maintenance time sequence data and solve the requirement of self-adaption threshold-free monitoring of massive operation and maintenance time sequence indexes. Finally, the embodiment of the application associates the abnormal detection result with the baseline detection result, the real-time performance of the abnormal detection model is better, the baseline model has better effect of index history performance, the two detection results are complementary, and the coverage and accuracy of the overall algorithm on the abnormal type identification are improved.

Fig. 6 is a schematic diagram of main units of the abnormality processing apparatus according to the embodiment of the present application. As shown in fig. 6, the abnormality processing apparatus 600 includes an abnormality period determination unit 601, an abnormality index determination unit 602, a knowledge map generation unit 603, a total abnormality index determination unit 604, and an abnormality processing unit 605.

An abnormal time period determining unit 601 configured to determine an abnormal time period corresponding to an abnormal event in response to detection of the abnormal event;

an abnormality index determination unit 602 configured to acquire index information associated with an abnormality event, and further determine a corresponding abnormality index based on the index information and the abnormality period;

the knowledge graph generating unit 603 is configured to obtain a corresponding static entity relationship according to the abnormal time period, and further generate a knowledge graph corresponding to the abnormal event based on the abnormal event, the abnormal index and the static entity relationship;

a total anomaly index determining unit 604 configured to locate an anomaly entity based on the knowledge graph, and further determine a total anomaly index corresponding to the anomaly entity;

the anomaly processing unit 605 is configured to input the total anomaly index into the anomaly detection model to obtain a corresponding index anomaly type, and output the index anomaly type and the anomaly entity.

In some embodiments, the abnormality index determination unit 602 is further configured to: determining an index shape corresponding to the index information; and determining an abnormal index corresponding to the abnormal time period according to the index shape.

In some embodiments, the abnormality index determination unit 602 is further configured to: determining a business layer index shape and a basic resource layer index shape corresponding to the abnormal event from the index shapes; calculating the similarity of the service layer index shape and the basic resource layer index shape; and determining the associated abnormal index corresponding to the abnormal time period according to the similarity.

In some embodiments, the knowledge-graph generation unit 603 is further configured to: acquiring corresponding log information according to the abnormal time period; and determining the static entity relation according to the log information.

In some embodiments, the exception handling apparatus further comprises a presentation unit, not shown in fig. 6, configured to: displaying the abnormal event, the abnormal entity corresponding to the abnormal event, the total abnormal index corresponding to the abnormal entity and the index abnormal type in a knowledge graph in a preset form.

In some embodiments, the total anomaly index determination unit 604 is further configured to: determining one or more abnormal events corresponding to the abnormal entity based on the knowledge graph; and summarizing the abnormal indexes corresponding to the one or more abnormal events to obtain a total abnormal index.

In some embodiments, the exception handling apparatus further comprises an exception event triggering unit, not shown in fig. 6, configured to: in response to detecting that the scene changes, switching to a corresponding abnormal threshold based on the changed scene; obtaining detection results of all indexes based on the anomaly detection model and the baseline model; and triggering an abnormal event in response to the detection result exceeding the abnormal threshold.

In the present application, the exception handling method and the exception handling apparatus have a corresponding relationship in terms of the specific implementation contents, and therefore, the description is not repeated.

FIG. 7 illustrates an exemplary system architecture 700 in which the exception handling method or exception handling apparatus of embodiments of the present application may be applied.

As shown in fig. 7, a system architecture 700 may include terminal devices 701, 702, 703, a network 704, and a server 705. The network 704 is the medium used to provide communication links between the terminal devices 701, 702, 703 and the server 705. The network 704 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 705 via the network 704 using the terminal devices 701, 702, 703 to receive or send messages or the like. Various communication client applications such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 701, 702, 703.

The terminal devices 701, 702, 703 may be various electronic devices having an exception handling screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 705 may be a server providing various services, such as a background management server (by way of example only) that provides support for abnormal events detected by users using the terminal devices 701, 702, 703. The background management server can respond to the detection of the abnormal event and determine an abnormal time period corresponding to the abnormal event; acquiring index information associated with the abnormal event, and further determining a corresponding abnormal index based on the index information and the abnormal time period; acquiring a corresponding static entity relation according to the abnormal time period, and further generating a knowledge graph corresponding to the abnormal event based on the abnormal event, the abnormal index and the static entity relation; positioning the abnormal entity based on the knowledge graph, and further determining a total abnormal index corresponding to the abnormal entity; and inputting the total abnormal index into an abnormal detection model to obtain a corresponding index abnormal type, and outputting the index abnormal type and the abnormal entity. When an abnormal event occurs, the knowledge graph of the abnormal event is constructed by collecting index information of the abnormal event of the system. The knowledge graph is used for storing abnormal events, the graph reasoning method is used for positioning abnormal entities, and the entity part affected by the faults and related abnormal indexes are obtained, so that the related abnormal indexes are classified. The classifying capability of the abnormal categories of the operation and maintenance time sequence data is improved, the checking flow is concise, and the abnormality checking efficiency is high.

It should be noted that, the exception handling method provided in the embodiments of the present application is generally executed by the server 705, and accordingly, the exception handling apparatus is generally disposed in the server 705.

It should be understood that the number of terminal devices, networks and servers in fig. 7 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 8, there is illustrated a schematic diagram of a computer system 800 suitable for use in implementing the terminal device of an embodiment of the present application. The terminal device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present application.

As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU) 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the computer system 800 are also stored. The CPU801, ROM802, and RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, mouse, etc.; an output section 807 including a display such as a Cathode Ray Tube (CRT), a liquid crystal credit authorization query processor (LCD), and a speaker; a storage section 808 including a hard disk or the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to the I/O interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as needed so that a computer program read out therefrom is mounted into the storage section 808 as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments disclosed herein include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section 809, and/or installed from the removable media 811. The above-described functions defined in the system of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 801.

It should be noted that the computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware. The described units may also be provided in a processor, for example, described as: a processor includes an abnormal time period determination unit, an abnormal index determination unit, a knowledge graph generation unit, a total abnormal index determination unit, and an abnormal processing unit. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

As another aspect, the present application also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer-readable medium carries one or more programs that, when executed by one of the devices, cause the device to determine an abnormal time period corresponding to the abnormal event in response to detecting the abnormal event; acquiring index information associated with the abnormal event, and further determining a corresponding abnormal index based on the index information and the abnormal time period; acquiring a corresponding static entity relation according to the abnormal time period, and further generating a knowledge graph corresponding to the abnormal event based on the abnormal event, the abnormal index and the static entity relation; positioning the abnormal entity based on the knowledge graph, and further determining a total abnormal index corresponding to the abnormal entity; and inputting the total abnormal index into an abnormal detection model to obtain a corresponding index abnormal type, and outputting the index abnormal type and the abnormal entity. When an abnormal event occurs, the knowledge graph of the abnormal event is constructed by collecting index information of the abnormal event of the system. The knowledge graph is used for storing abnormal events, the graph reasoning method is used for positioning abnormal entities, and the entity part affected by the faults and related abnormal indexes are obtained, so that the related abnormal indexes are classified.

According to the technical scheme of the embodiment of the application, when the abnormal event occurs, the knowledge graph of the abnormal event is constructed by collecting index information of the abnormal event of the system. The method has the advantages that the knowledge graph is adopted to store abnormal events, the graph reasoning method is adopted to locate abnormal entities, the entity part affected by faults and related abnormal indexes are obtained, so that the related abnormal indexes are classified, the capability of classifying abnormal categories of operation and maintenance time sequence data can be improved, the troubleshooting flow is concise, and the troubleshooting efficiency is high.

The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. An exception handling method, comprising:

in response to detecting an abnormal event, determining an abnormal time period corresponding to the abnormal event;

acquiring index information associated with the abnormal event, further determining an index shape corresponding to the index information, determining a business layer index shape and a basic resource layer index shape corresponding to the abnormal event from the index shapes, calculating the similarity of the business layer index shape and the basic resource layer index shape, and determining an associated abnormal index corresponding to the abnormal time period according to the similarity;

positioning an abnormal entity based on the knowledge graph, and further determining a total abnormal index corresponding to the abnormal entity;

2. The method of claim 1, wherein the obtaining the corresponding static entity relationship according to the abnormal time period comprises:

acquiring corresponding log information according to the abnormal time period;

and determining a static entity relation according to the log information.

3. The method of claim 1, wherein after said outputting the indicator exception type and the exception entity, the method further comprises:

displaying the abnormal event, the abnormal entity corresponding to the abnormal event, the total abnormal index corresponding to the abnormal entity and the index abnormal type in the knowledge graph in a preset form.

4. The method of claim 1, wherein the determining the total anomaly metrics for the anomaly entity comprises:

5. The method of claim 1, wherein prior to the determining the anomaly time period for the anomaly event, the method further comprises:

6. An abnormality processing apparatus, comprising:

an abnormal time period determining unit configured to determine an abnormal time period corresponding to an abnormal event in response to detection of the abnormal event;

an abnormal index determining unit configured to obtain index information associated with the abnormal event, further determine an index shape corresponding to the index information, determine a business layer index shape and a base resource layer index shape corresponding to the abnormal event from the index shapes, calculate a similarity of the business layer index shape and the base resource layer index shape, and determine an associated abnormal index corresponding to the abnormal time period according to the similarity;

a total anomaly index determining unit configured to locate an anomaly entity based on the knowledge graph, and further determine a total anomaly index corresponding to the anomaly entity;

and the abnormality processing unit is configured to input the total abnormality index into an abnormality detection model to obtain a corresponding index abnormality type and output the index abnormality type and the abnormality entity.

7. An exception handling electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-5.

8. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-5.