CN114205216B

CN114205216B - Root cause positioning method and device for micro service fault, electronic equipment and medium

Info

Publication number: CN114205216B
Application number: CN202111513439.0A
Authority: CN
Inventors: 谢伟; 武文轩; 王豪赞; 王磊
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2024-02-06
Anticipated expiration: 2041-12-07
Also published as: CN114205216A

Abstract

The present disclosure provides a root cause positioning method for micro-service faults, which can be applied to the financial field or other fields. The method comprises the following steps: obtaining a detection model; acquiring a system operation and maintenance map; detecting the real-time log based on the detection model; when the detected result is abnormal, marking abnormal nodes on the system operation and maintenance map; and determining an abnormal root node based on the abnormal node. The present disclosure also provides a root cause positioning apparatus, device, storage medium and program product for micro-service failures.

Description

Root cause positioning method and device for micro service fault, electronic equipment and medium

Technical Field

The present disclosure relates to the field of information technology, and may also be used in the financial field or other fields, and in particular, to a method, apparatus, electronic device, medium, and program product for root cause localization of micro-service faults.

Background

Abnormality detection and root cause localization have been a subject of study for a long time. In the digitized world, the amount of data produced exceeds the ability of humans to manually study it. Therefore, automated data analysis is necessary. Among these, an important data analysis task is to detect anomalies in the data. Anomalies are data points that deviate from the normal distribution throughout the dataset, and anomaly detection is a technique for finding them. The effects of anomalies are field-dependent. In the network active dataset, anomalies may mean intrusion attacks. Anomalies in financial transactions may suggest financial fraud, and anomalies in medical images may be caused by disease. Other targets for anomaly detection include industrial damage detection, data leak prevention, identifying security vulnerabilities, or military surveillance.

The micro-service system of the financial science and technology has higher requirements on data consistency, abnormal response, processing and the like. During system operation, a large amount of monitoring data and log data are generated, which brings great challenges to IT expert investigation and analysis and solving of system faults, and one or more innovative technologies are needed to solve the problems. Important questions include how to quickly and accurately locate the root cause of a problem when a fault occurs.

Disclosure of Invention

In view of the foregoing, the present disclosure provides a root cause positioning method, apparatus, electronic device medium, and program product for micro-service faults.

According to a first aspect of the present disclosure, there is provided a root cause positioning method of a micro service failure, including: obtaining a detection model; acquiring a system operation and maintenance map; detecting the real-time log based on the detection model; when the detected result is abnormal, marking abnormal nodes on the system operation and maintenance map; and determining an abnormal root node based on the abnormal node.

According to an embodiment of the present disclosure, the method further comprises: and when the detection result is abnormal, updating the detection model.

According to an embodiment of the present disclosure, the step of constructing the detection model includes: acquiring a history log; preprocessing the history log to obtain history characteristic data; and constructing a detection model based on the historical characteristic data.

According to an embodiment of the present disclosure, the preprocessing includes: extracting key data of the history log; grouping the key data according to preset conditions to obtain grouped key data; and abstracting the grouped key data to obtain historical characteristic data.

According to an embodiment of the disclosure, the detection model includes a first detection model and a second detection model, wherein the first detection model is a call graph pattern library; and the second detection model is a variable model.

According to an embodiment of the present disclosure, the step of constructing the first detection model includes: and mining a call graph mode of the history log based on the history feature data, and generating the call graph mode library.

According to an embodiment of the present disclosure, the step of constructing the second detection model includes: performing data characteristic analysis on the historical characteristic data to obtain a configuration file, wherein the configuration file comprises data characteristics and key fields; and constructing a variable model based on the data characteristics and the key fields.

According to an embodiment of the disclosure, the step of detecting the real-time log based on the detection model includes: judging whether a calling graph mode of the real-time log is in the calling graph mode library or not based on the first detection model, and obtaining a first detection result; and judging whether the field variable of the real-time log is abnormal or not based on the second detection model, and obtaining a second detection result.

According to an embodiment of the present disclosure, the method further comprises: and when the first detection result is abnormal, and/or when the second detection result is abnormal, determining that the detection result is abnormal.

According to an embodiment of the disclosure, the step of determining an abnormal root node based on the abnormal node includes: determining all non-abnormal nodes on the system operation and maintenance map, wherein the non-abnormal nodes are nodes with direct edges with the abnormal nodes; calculating the anomaly degree of all non-anomaly nodes respectively; and sequencing the abnormality degrees of all the non-abnormal nodes from big to small, and determining the abnormal root cause node.

According to an embodiment of the present disclosure, the calculation formula of the degree of abnormality is:

wherein: n is the number of abnormal nodes directly associated with a non-abnormal node; a is that _j Is the anomaly confidence corresponding to the j-th anomaly node.

A second aspect of the present disclosure provides a root cause positioning device for a micro-service failure, comprising: the first acquisition module is used for acquiring a detection model; the second acquisition module is used for acquiring a system operation and maintenance map; the detection module is used for detecting the real-time log; the labeling module is used for labeling abnormal nodes on the system operation and maintenance map when the detection result is abnormal; and an analysis module for determining an anomaly root node based on the anomaly node.

A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the root cause localization method of the micro-service failure described above.

A fourth aspect of the present disclosure also provides a computer readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the root cause localization method of a micro-service failure described above.

A fifth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the root cause localization method of a micro-service fault as described above.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates an application scenario diagram of a root cause localization method, apparatus, device, medium and program product of micro-service failures according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a root cause localization method of a micro-service failure according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a schematic diagram of a root cause positioning system architecture for micro-service failures in accordance with an embodiment of the present disclosure;

fig. 4 schematically illustrates a schematic diagram of the data preprocessing apparatus 1 shown in fig. 3 according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a schematic diagram of the dynamic map modeling apparatus 2 shown in FIG. 3, according to an embodiment of the present disclosure;

fig. 6 schematically illustrates a schematic diagram of the abnormality detection apparatus 3 shown in fig. 3 according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a schematic diagram of root cause analysis device 4 shown in FIG. 3, according to an embodiment of the present disclosure;

FIG. 8 schematically illustrates an example flow diagram of the micro-service failure provisioning service illustrated in FIG. 3, in accordance with an embodiment of the present disclosure;

FIG. 9 schematically illustrates a block diagram of a root cause positioning device of a micro-service failure according to an embodiment of the disclosure; and

fig. 10 schematically illustrates a block diagram of an electronic device adapted to implement a root cause localization method of a microservice failure according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

Micro-services are a software development technology in this context that advocates the division of a single application into a small set of services that coordinate and interact with each other to provide the final value to the user.

The micro-service system of the financial science and technology has higher requirements on data consistency, abnormal response, processing and the like. During system operation, a large amount of monitoring data and log data are generated, which brings great challenges to IT expert investigation and analysis and solving of system faults, and one or more innovative technologies are needed to solve the problems. Among the key problems are:

1. and (5) identifying an abnormality. Whether the current service fails or not is judged mainly based on a predefined dictionary of error codes. The method needs to combine various indexes and parameters recorded in the log, and perform modeling analysis to more intelligently identify anomalies such as business failure, database connection timeout and the like.

2. Root cause analysis. When a fault occurs, the root cause of the problem needs to be rapidly and accurately positioned. Including in particular the deployment location (e.g., container, host, template group, application, etc.) where the failed node is located, and the failed service (e.g., service name, type, etc.).

Therefore, how to provide a new solution to the above-mentioned problems is a technical problem to be solved in the art.

The disclosure aims to provide a root cause positioning analysis method for faults. According to the method, log information of different modules is associated and fused, and a log graph mode, a numerical mode and statistical distribution are further analyzed, so that repeated, abnormal and the like reasons of problems are automatically identified, one-stop searching and positioning can be performed, and abnormal reasons and influences can be displayed on a dynamic graph based on log extraction.

The embodiment of the disclosure provides a root cause positioning method of a micro-service fault, comprising the following steps: obtaining a detection model; acquiring a system operation and maintenance map; detecting the real-time log based on the detection model; when the detected result is abnormal, marking abnormal nodes on the system operation and maintenance map; and determining an abnormal root node based on the abnormal node.

It should be noted that, the method and the device for determining the root cause positioning of the micro service fault in the financial field can be used for root cause positioning of faults in any field except the financial field, and the application field of the method and the device for determining the root cause positioning of the present disclosure is not limited.

Fig. 1 schematically illustrates an application scenario diagram of a root cause localization method, apparatus, device, medium and program product of micro-service failure according to an embodiment of the present disclosure.

As shown in fig. 1, an application scenario 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by users using the terminal devices 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that, the root cause positioning method of the micro service fault provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the root cause positioning device of the micro service failure provided by the embodiments of the present disclosure may be generally disposed in the server 105. The root cause positioning method of the micro service failure provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the root cause positioning device of the micro service failure provided by the embodiments of the present disclosure may also be provided in a server or a server cluster different from the server 105 and capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 2 schematically illustrates a flow chart of a root cause localization method of a micro-service failure according to an embodiment of the present disclosure.

As shown in fig. 2, the root cause positioning method of the micro service failure of this embodiment includes operations S201 to S205.

In operation S201, a detection model is acquired.

The detection model is obtained through analysis processing of the history logs and comprises a graph model reflecting the relation between the history logs and a variable model reflecting variable characteristics. In addition, the detection model is a dynamic model because the detection model is updated according to the detection result of each time. The process for establishing the detection model comprises the following steps: (1) The method comprises the steps of collecting original data, preprocessing history logs, screening key business logs, and extracting effective features from the original data to the greatest extent for algorithms and models. Moreover, the logs from different sources have a definite association relationship, and the logs from different sources need to be connected in series, specifically, the logs from different sources are associated through business keywords; (2) Carrying out data characteristic analysis on log data, modeling and storing statistical characteristics of key values of different fields in logs of different sources, and screening key fields aiming at repeated detection tasks and graph modeling tasks; (3) According to the information in the key field of the log, the historical log streams are spatially connected in series and combined, and finally an abstract dynamic graph model is generated; (4) Based on variable modeling of the dynamic graph, according to the dynamic graph model obtained in the previous step, variable extraction is respectively carried out on different nodes in the dynamic graph, the variable modeling is carried out on the different nodes according to a message field knowledge base, different modeling methods exist on the variables of different data types, and finally, the variable modes of all types of variables are put into storage and stored.

In operation S202, a system operation and maintenance map is acquired. The system operation and maintenance map is composed of different levels of nodes of different types such as application, service, template group, container, host and the like, and dynamic interaction and association thereof.

In operation S203, a real-time log is detected based on the detection model. The detection process is to match the preprocessed real-time log with the established dynamic graph model and the dynamic graph variable, and find out suspicious abnormality which does not accord with the mode.

And when the detected result is abnormal, labeling abnormal nodes on the system operation and maintenance map in operation S204.

In operation S205, an anomaly root node is determined based on the anomaly node. And carrying out propagation path analysis on the anomalies by using the dynamic graph model, and positioning and outputting the root cause of the problem according to the propagation direction.

According to an embodiment of the present disclosure, the method further comprises: and when the detection result is abnormal, updating the detection model. Because the detection model is a dynamic model, the dynamic update can be performed according to the real-time detection result. After the real-time log is detected, the real-time log becomes a history log, the modeling process is repeated, and after the data of the history log are processed, the detection model is updated.

According to an embodiment of the present disclosure, the step of constructing the detection model includes: acquiring a history log; preprocessing the history log to obtain history characteristic data; and constructing a detection model based on the historical characteristic data. After the original log is obtained, it needs to be collected and processed first. Since the log data contains a large amount of numerical and character type characteristic information, information redundancy may exist, and the purpose of preprocessing is to extract effective characteristics from the original data to the maximum extent for algorithm and model use.

According to an embodiment of the present disclosure, the preprocessing includes: extracting key data of the history log; grouping the key data according to preset conditions to obtain grouped key data; and abstracting the grouped key data to obtain historical characteristic data. The basis of grouping may be determined according to practical situations, such as according to transaction type, according to transaction initiator, etc., which is not limited in this disclosure.

According to an embodiment of the disclosure, the detection model includes a first detection model and a second detection model, wherein the first detection model is a call graph pattern library; and the second detection model is a variable model. Wherein the call graph schema library comprises graph schemas of micro-service calls, and the variable model comprises data characteristics and key fields in a history log.

According to an embodiment of the disclosure, the step of detecting the real-time log based on the detection model includes: judging whether a calling graph mode of the real-time log is in the calling graph mode library or not based on the first detection model, and obtaining a first detection result; and judging whether the field variable of the real-time log is abnormal or not based on the second detection model, and obtaining a second detection result. It should be noted that, the real-time log needs to be preprocessed before detection, a calling graph mode and a field variable of the real-time log are extracted, and then the first detection model and the second detection model are used for respectively judging to detect whether the real-time log is abnormal or not.

According to an embodiment of the present disclosure, the method further comprises: and when the first detection result is abnormal, and/or when the second detection result is abnormal, determining that the detection result is abnormal. That is, if the transaction pattern of the real-time log is not in the call graph pattern library, or the variables in the real-time log are inconsistent with the variable models, and one of the variables in the real-time log and the variable models is inconsistent, the real-time log is considered to be abnormal.

It should be noted that, for a non-abnormal node, it may have a direct edge with a plurality of abnormal nodes, and at the same time, each abnormal node has its own abnormal confidence, and the abnormal confidence of a plurality of abnormal nodes associated with the non-abnormal node is added to obtain the abnormal degree of the non-abnormal node. The abnormal confidence is used for reflecting the probability of node abnormality, different nodes have different abnormal confidence in the system operation and maintenance map, the node abnormality confidence which is easy to cause problems is higher, and the node abnormality confidence which is difficult to cause problems is lower.

The root cause positioning method of the micro service fault can solve the problems that the detection of the service system is incomplete, the root cause analysis efficiency is low, the load of operation and maintenance personnel is heavy and the like in the operation and maintenance work of the system, and the system can integrate transaction (service) logs of different modules and analyze log graph modes, numerical modes and statistical distribution, so that repetition, abnormality and the like can be automatically identified; in addition, the reasons of the problems can be searched and positioned in one step, and the reasons and influences of the anomalies can be displayed on the dynamic diagram based on log extraction.

Fig. 3 schematically illustrates a schematic diagram of a root cause localization system architecture of a micro service failure according to an embodiment of the present disclosure.

As shown in fig. 3, the root cause positioning system of a micro service failure of an embodiment of the present disclosure includes: a data preprocessing device 1, a dynamic diagram modeling device 2, an abnormality detection device 3 and a root cause analysis device 4. The data preprocessing device 1 is connected with the dynamic diagram modeling device 2; the dynamic diagram modeling device 2 is connected with the abnormality detection device 3; the dynamic diagram modeling device 2 is connected with the root cause analysis device 4; the abnormality detection device 3 is connected to the root cause analysis device 4.

The data preprocessing device 1 is used for acquiring a history log and then preprocessing it. Since the log data contains a large amount of numerical and character type characteristic information, information redundancy may exist, and the purpose of preprocessing is to extract effective characteristics from the original data to the maximum extent for algorithm and model use.

The dynamic map modeling apparatus 2 is used for performing offline modeling analysis on log data. In the preprocessing process, a transaction (business) sequence is extracted from log historical data based on a transaction (business) identification number, and offline modeling is performed on the basis of the transaction (business) identification number, and modeling of a first detection model and a second detection model is performed in two steps.

The anomaly detection device 3 is used for matching the preprocessed real-time log with an established dynamic graph model, finding out suspicious anomalies which do not accord with the model, merging anomaly information and outputting suspicious anomalies and providing information for assisting in problem diagnosis by assisting in anomaly detection results based on dynamic graph variables.

The root cause analysis device 4 is used for marking the obtained abnormal information on a system operation and maintenance map, analyzing the propagation path of the abnormality by using the system operation and maintenance map, and positioning and outputting the root cause of the problem according to the propagation direction.

Fig. 4 schematically shows a schematic diagram of the data preprocessing apparatus 1 shown in fig. 3 according to an embodiment of the present disclosure.

As shown in fig. 4, the data preprocessing apparatus 1 includes a critical data extraction unit 11, a critical data grouping unit 12, and a critical data abstraction unit 13.

The key data extraction unit 11 is configured to extract key data, for example, concatenate transaction (service) logs according to a transaction (service) serial number, and associate the associated logs together.

The critical data grouping unit 12 is used for grouping critical data, for example in groups according to the initiator of the transaction.

The key data abstraction unit 13 is configured to abstract key data, process structured log data, select a field succinct representation that can represent characteristics of the log for each log, remove redundant information to improve analysis efficiency, screen out values of fields such as service name, method name, log type, etc., and separate each field value by "-" symbol, and form a character string as an abstract tuple.

Fig. 5 schematically illustrates a schematic diagram of the dynamic map modeling apparatus 2 shown in fig. 3, according to an embodiment of the present disclosure.

As shown in fig. 5, the dynamic diagram modeling apparatus 2 includes a data feature analysis unit 21, a dynamic diagram pattern mining unit 22, and a dynamic diagram variable modeling unit 23.

The data characteristic analysis unit 21 is configured to perform data characteristic analysis on the key data abstracted by the key data abstraction unit 13, including field type identification and field data distribution analysis. And outputting a configuration file, wherein the configuration file comprises data characteristics and key field descriptions.

The dynamic graph pattern mining unit 22 is configured to mine different groups of micro service call graph patterns from the data grouped by the key data grouping unit 12 based on the graph patterns of the transaction mining micro service call, and store the call graph pattern library. Specific graph pattern mining algorithms include gSpan, closegraph, SPIN, and the like, and can be selected according to actual conditions.

The dynamic diagram variable modeling unit 23 is configured to model variable fields under different calling relationships, respectively, to obtain a variable model, where the variable fields are derived from a configuration file output by the data feature analysis unit 21. In the dynamic graph, the nodes (fields) of the same type may appear in call contexts of different structures, and variable fields under different call relations need to be modeled respectively, and the fields are called combined fields and originate from configuration files; other non-combined fields in the configuration file are modeled directly. Specific models include ARIMA, isolation-Forest, K-means and the like, and the specific models can be selected according to actual conditions.

Fig. 6 schematically illustrates a schematic diagram of the abnormality detection apparatus 3 shown in fig. 3 according to an embodiment of the present disclosure.

As shown in fig. 6, the abnormality detection device 3 includes a dynamic diagram abnormality detection unit 31 and a dynamic diagram variable abnormality detection unit 32.

The dynamic graph anomaly detection unit 31 is used for identifying anomalies according to graph patterns in the call graph pattern library. In the dynamic graph modeling apparatus 2, the system establishes a call graph pattern library of various transactions. For a new transaction, the dynamic graph anomaly detection unit 31 needs to match in the call graph pattern library to find out whether there is a call pattern that can be matched. And comparing the marginal distances of the graphs to obtain an anomaly score, and comparing a predefined threshold to judge whether the new transaction is anomaly. The predefined threshold may be determined according to an actual situation, and any threshold that can determine whether the new transaction is abnormal based on an actual need may be used, which is not limited in the present disclosure.

The dynamic diagram variable anomaly detection unit 32 is configured to determine whether the field variable is anomalous according to the anomaly detection method based on the time sequence based on the numerical type of each field in the new transaction. In the dynamic image modeling apparatus 2, the system establishes a variable model, and the dynamic image variable abnormality detection unit 32 determines whether or not a field variable in a new transaction is abnormal based on the variable model.

Fig. 7 schematically illustrates a schematic diagram of the root cause analysis device 4 shown in fig. 3, according to an embodiment of the disclosure.

As shown in fig. 7, the root cause analysis device 4 includes a system operation and maintenance map unit 41, an anomaly labeling unit 42, and a root cause positioning unit 43.

The system operation and maintenance map unit 41 is configured to construct a dynamic operation and maintenance map of the system based on a predefined system operation and maintenance data model by taking a log in a time window to be analyzed as an input. The map includes different levels of nodes of different types, such as applications, services, templates, template groups, containers, hosts, and the like, and their dynamic interactions and associations.

The anomaly labeling unit 42 is configured to label the anomaly detected by the anomaly detection device 3 in the constructed dynamic graph (label, i.e. find the same node in the graph as the key or value corresponding to the anomaly object, and update the corresponding record of the database with the error information including the source and time of the anomaly).

The root cause locating unit 43 is used for finding all non-abnormal nodes with direct edges to the abnormal nodes, traversing each non-abnormal node, and calculating the degree of abnormality. And finally, sorting the anomaly degree from large to small to obtain the possible root cause nodes of the top K.

Fig. 8 schematically illustrates an example flow diagram of the micro service failure provision service illustrated in fig. 3, in accordance with an embodiment of the present disclosure.

As shown in fig. 8, an example flow of the micro service failure providing service includes the following steps:

step S801: the system starts up.

Step S802: the system acquires historical transaction logs and other related data, performs preprocessing on key data, including extraction, grouping and abstraction, performs data characteristic analysis on the preprocessed data, and outputs a configuration file.

Step S803: and the system carries out dynamic graph modeling and variable modeling according to the data and the configuration file after the data preprocessing to obtain a call graph mode library and a variable model.

Step S804: and the system detects the abnormality of the new transaction log input into the system according to the call graph mode library and the variable model, and if the abnormality is detected, the system outputs abnormality information.

Step S805: and marking the abnormal information on the system operation and maintenance map according to the abnormal detection result, and outputting the root cause analysis result.

Step S806: the system records the root cause analysis result of the fault and provides the inquiry.

Step S807: the user inquires the detected faults and root cause analysis results through various interfaces (including a graphical interface, an API, a command line and the like) of the system.

Based on the root cause positioning method of the micro-service fault, the disclosure also provides a root cause positioning device of the micro-service fault. The device will be described in detail below in connection with fig. 9.

Fig. 9 schematically illustrates a block diagram of a root cause positioning device of a micro-service failure according to an embodiment of the present disclosure.

As shown in fig. 9, the root cause positioning device 900 of the micro service fault of this embodiment includes a first acquisition module 910, a second acquisition module 920, a detection module 930, a labeling module 940, and an analysis module 950.

The first acquisition module 910 is configured to acquire a detection model. In an embodiment, the first obtaining module 910 may be configured to perform the operation S201 described above, which is not described herein.

The second obtaining module 920 is configured to obtain a system operation and maintenance map. In an embodiment, the second obtaining module 920 may be configured to perform the operation S202 described above, which is not described herein.

The detection module 930 is configured to detect the real-time log. In an embodiment, the detection module 930 may be configured to perform the operation S203 described above, which is not described herein.

The labeling module 940 is configured to label the abnormal node on the system operation and maintenance map when the detected result is abnormal. In an embodiment, the labeling module 940 may be configured to perform the operation S204 described above, which is not described herein.

The analysis module 950 is configured to determine an abnormal root node based on the abnormal node. In an embodiment, the analysis module 950 may be configured to perform the operation S205 described above, which is not described herein.

According to embodiments of the present disclosure, any of the plurality of modules of the first acquisition module 910, the second acquisition module 920, the detection module 930, the labeling module 940, and the analysis module 950 may be combined in one module to be implemented, or any of the plurality of modules may be split into a plurality of modules. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the first acquisition module 910, the second acquisition module 920, the detection module 930, the labeling module 940, and the analysis module 950 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system-on-chip, a system-on-substrate, a system-on-package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging the circuitry, or in any one of or a suitable combination of any of the three implementations of software, hardware, and firmware. Alternatively, at least one of the first acquisition module 910, the second acquisition module 920, the detection module 930, the labeling module 940, and the analysis module 950 may be at least partially implemented as computer program modules, which when executed, may perform the respective functions.

As shown in fig. 10, an electronic device 1000 according to an embodiment of the present disclosure includes a processor 1001 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. The processor 1001 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 1001 may also include on-board memory for caching purposes. The processor 1001 may include a single processing unit or multiple processing units for performing different actions of the method flows according to embodiments of the present disclosure.

In the RAM 1003, various programs and data necessary for the operation of the electronic apparatus 1000 are stored. The processor 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. The processor 1001 performs various operations of the method flow according to the embodiment of the present disclosure by executing programs in the ROM 1002 and/or the RAM 1003. Note that the program may be stored in one or more memories other than the ROM 1002 and the RAM 1003. The processor 1001 may also perform various operations of the method flow according to the embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the disclosure, the electronic device 1000 may also include an input/output (I/O) interface 1005, the input/output (I/O) interface 1005 also being connected to the bus 1004. The electronic device 1000 may also include one or more of the following components connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output portion 1007 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), etc., and a speaker, etc.; a storage portion 1008 including a hard disk or the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The drive 1010 is also connected to the I/O interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in the drive 1010, so that a computer program read out therefrom is installed as needed in the storage section 1008.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 1002 and/or RAM 1003 and/or one or more memories other than ROM 1002 and RAM 1003 described above.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code, when executed in a computer system, causes the computer system to perform the methods provided by embodiments of the present disclosure.

The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1001. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted in the form of signals on a network medium, distributed, and downloaded and installed via the communication section 1009, and/or installed from the removable medium 1011. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1009, and/or installed from the removable medium 1011. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1001. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A root cause positioning method of a micro-service fault, comprising:

obtaining a detection model;

acquiring a system operation and maintenance map;

Detecting the real-time log based on the detection model;

when the detection result is abnormal, marking abnormal nodes on the system operation and maintenance map; and

determining an abnormal root cause node based on the abnormal node;

wherein the detection model comprises a first detection model and a second detection model,

the first detection model is a call graph mode library; and

the second detection model is a variable model;

wherein the step of constructing the first detection model includes:

mining a call graph mode of the history log based on the history feature data, and generating a call graph mode library;

wherein the step of constructing the second detection model includes:

carrying out data characteristic analysis on the historical characteristic data to obtain a configuration file, wherein the configuration file comprises data characteristics and key fields; and

constructing a variable model based on the data characteristics and the key fields;

wherein, based on the detection model, the step of detecting the real-time log comprises:

judging whether a calling graph mode of the real-time log is in the calling graph mode library or not based on the first detection model, and obtaining a first detection result; and

judging whether field variables of the real-time log are abnormal or not based on the second detection model, and obtaining a second detection result;

When the first detection result is abnormal, and/or when the second detection result is abnormal, determining that the detection result is abnormal;

wherein, based on the abnormal node, the step of determining the abnormal root cause node comprises:

determining all non-abnormal nodes on the system operation and maintenance map, wherein the non-abnormal nodes are nodes with direct edges with the abnormal nodes;

calculating the anomaly degree of all non-anomaly nodes respectively; and

sequencing the abnormality degrees of all the non-abnormal nodes from big to small, and determining abnormal root cause nodes;

wherein, the calculation formula of the degree of anomaly is:

wherein: n is the number of abnormal nodes directly associated with a non-abnormal node; a is that _j And the abnormal confidence coefficient is the abnormal confidence coefficient corresponding to the j-th abnormal node, and the abnormal confidence coefficient is used for representing the probability of node abnormality.

2. The method according to claim 1, wherein the method further comprises:

and when the detection result is abnormal, updating the detection model.

3. The method of claim 1, wherein the step of constructing the detection model comprises:

acquiring a history log;

preprocessing the history log to obtain history characteristic data; and

And constructing a detection model based on the historical characteristic data.

4. A method according to claim 3, wherein the pre-processing comprises:

extracting key data of the history log;

grouping the key data according to preset conditions to obtain grouped key data; and

and abstracting the grouped key data to obtain historical characteristic data.

5. A root cause positioning device for a micro-service failure, comprising:

a first acquisition module for acquiring a detection model, wherein the detection model comprises a first detection model and a second detection model,

the first detection model is a call graph mode library; and

the second detection model is a variable model;

wherein the step of constructing the first detection model includes:

wherein the step of constructing the second detection model includes:

The second acquisition module is used for acquiring a system operation and maintenance map;

the detection module is used for detecting the real-time log, wherein the step of detecting the real-time log comprises the following steps:

the labeling module is used for labeling abnormal nodes on the system operation and maintenance map when the detection result is abnormal; and

the analysis module is configured to determine an abnormal root node based on the abnormal node, where the step of determining the abnormal root node based on the abnormal node includes:

calculating the anomaly degree of all non-anomaly nodes respectively; and

Wherein, the calculation formula of the degree of anomaly is:

wherein: n is the number of abnormal nodes directly associated with a non-abnormal node; aj is an anomaly confidence corresponding to the j-th anomaly node, the anomaly confidence being used to represent a probability that the node is anomalous.

6. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-4.

7. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1-4.