CN117194092A - Root cause positioning method, root cause positioning device, computer equipment and storage medium - Google Patents

Root cause positioning method, root cause positioning device, computer equipment and storage medium Download PDF

Info

Publication number
CN117194092A
CN117194092A CN202311213288.6A CN202311213288A CN117194092A CN 117194092 A CN117194092 A CN 117194092A CN 202311213288 A CN202311213288 A CN 202311213288A CN 117194092 A CN117194092 A CN 117194092A
Authority
CN
China
Prior art keywords
service
root cause
service node
node
maintenance system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311213288.6A
Other languages
Chinese (zh)
Inventor
罗森
刘智敏
朱少华
刘亚丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Quyan Network Technology Co ltd
Original Assignee
Guangzhou Quyan Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Quyan Network Technology Co ltd filed Critical Guangzhou Quyan Network Technology Co ltd
Priority to CN202311213288.6A priority Critical patent/CN117194092A/en
Publication of CN117194092A publication Critical patent/CN117194092A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application relates to a root cause positioning method, a root cause positioning device, computer equipment and a storage medium. The method comprises the following steps: responding to the root cause positioning instruction triggered by the operation and maintenance system, and determining a plurality of service nodes responded by the operation and maintenance system in the current time window; each service node comprises at least one service request, and the operation and maintenance system is used for executing the at least one service request so as to realize the service function corresponding to the service node; acquiring operation and maintenance data correspondingly generated when the operation and maintenance system respectively executes service requests of all service nodes; based on the operation and maintenance data, root cause evaluation is respectively carried out on each service node, and an evaluation result aiming at each service node is obtained; root cause evaluation is used for evaluating the possibility degree of system abnormality of the operation and maintenance system caused by the service node; and determining the root cause node from the plurality of service nodes based on the evaluation result. The method can effectively improve the efficiency and accuracy of root cause positioning of the fault system.

Description

Root cause positioning method, root cause positioning device, computer equipment and storage medium
Technical Field
The present application relates to the field of computer technology, and in particular, to a root cause positioning method, a root cause positioning device, a computer device, and a computer readable storage medium.
Background
With the development of technology and the wide application of computer technology, more and more complex software systems adopt a modularized method, i.e. large service applications are decomposed into hundreds of smaller, independent and easier-to-manage service nodes to execute. The service nodes cooperate with each other through a lightweight communication mechanism, so that a high-cohesion low-coupling system architecture is formed. However, due to the complex dependencies between service nodes, the service nodes may propagate along the service call chain, affecting the associated nodes by a small number of root cause nodes, and ultimately leading to service level availability problems. Therefore, when the system is monitored to be abnormal, the operation and maintenance personnel need to quickly and accurately locate the root cause service node of the fault, so as to prevent the fault from further spreading.
In the current method for positioning root cause nodes, the method is mainly based on expert experience and manual investigation, but the methods often depend on the deep knowledge of manual experience on a system and require a large amount of operation and maintenance data to manually analyze, so that the application of the method in a complex system is limited, and the root cause positioning efficiency is low and the accuracy is difficult to guarantee.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a root cause positioning method, a root cause positioning apparatus, a computer device, a computer-readable storage medium, and a computer program product that can improve root cause positioning efficiency and accuracy.
In a first aspect, the present application provides a root cause positioning method. The method comprises the following steps:
responding to a root cause positioning instruction triggered by an operation and maintenance system, and determining a plurality of service nodes responded by the operation and maintenance system in a current time window; wherein, each service node includes at least one service request, and the operation and maintenance system is used for executing the at least one service request to realize the service function corresponding to the service node;
acquiring operation and maintenance data correspondingly generated when the operation and maintenance system respectively executes service requests of the service nodes;
based on the operation and maintenance data, root cause evaluation is respectively carried out on each service node, and an evaluation result aiming at each service node is obtained; the root cause evaluation is used for evaluating the possibility degree of system abnormality of the operation and maintenance system caused by the service node;
determining a root cause node from the plurality of service nodes based on the evaluation result;
The root node is an abnormal service node which causes the system abnormality of the operation and maintenance system to trigger the root positioning instruction.
In one embodiment, the performing root cause evaluation on each service node based on the operation and maintenance data to obtain an evaluation result for each service node includes:
determining performance indexes and function indexes of the operation and maintenance system when responding to each service node respectively based on the operation and maintenance data; the performance index is used for representing the system performance condition of the operation and maintenance system when responding to the service node, and the functional index is used for representing the functional service condition applied by the operation and maintenance system when responding to the service node;
performing first scoring on performance indexes of all the service nodes based on a preset score factor to obtain a first evaluation score aiming at the performance indexes; and
performing second scoring on the functional indexes of the service nodes based on preset weight factors to obtain second evaluation scores aiming at the functional indexes;
and determining an evaluation result for each service node based on the first evaluation score and the second evaluation score.
In one embodiment, the performance indicator is characterized by a corresponding indicator value;
after determining the performance index of the operation and maintenance system when responding to each service node respectively, the operation and maintenance system further comprises:
acquiring a historical index value of a historical performance index of a historical service node responded by the operation and maintenance system in a historical time window; the historical time window includes the current time window;
inputting the historical index value of the historical performance index into a pre-trained index prediction model to predict the safety performance, so as to obtain a predicted safety performance interval;
and determining the real-time state of the performance index of each service node based on the magnitude relation between the index value of the performance index of each service node in the current time window and the safety performance interval.
In one embodiment, the determining the real-time state of the performance index of each service node based on the magnitude relation between the index value of the performance index of each service node and the security performance interval in the current time window includes the following two steps:
if the index values of the performance indexes of the service nodes are all in the safety performance interval within the preset time length, determining that the real-time state of the performance indexes of the service nodes within the preset time length is a normal state;
And in the preset time length, if the index values of the performance indexes of the service nodes are not in the safety performance interval, determining that the real-time state of the performance indexes of the service nodes in the preset time length is an abnormal state.
In one embodiment, the first scoring the performance index of each service node based on a preset score factor to obtain a first evaluation score for the performance index, including:
taking a score factor of a performance index corresponding to the real-time state as a first evaluation score aiming at the performance index;
wherein, different fractional factors are correspondingly preset for the performance indexes of different real-time states.
In one embodiment, the functional indicator is characterized by a corresponding indicator value;
performing a second scoring on the functional index of each service node based on a preset weight factor to obtain a second evaluation score aiming at the functional index, including:
taking the product value between the index value of the function index and the corresponding weight factor as a first evaluation score aiming at the performance index;
wherein different weight factors are correspondingly preset for different functional indexes.
In one embodiment, the performance index at least includes P99 time consumption, log error rate, request failure rate, and data failure rate of the operation and maintenance system in responding to a service node; the function index at least comprises the application number proportion, the abnormal application proportion and the quantity of operation and maintenance data corresponding to the change type when the operation and maintenance system responds to the service node;
the determining an evaluation result for each service node based on the first evaluation score and the second evaluation score includes:
and determining a total score between the first evaluation score of each performance index and the second evaluation score of each functional index for each service node, and taking the total score as an evaluation result of the service node.
In one embodiment, the determining, based on the evaluation result, a root cause node from the plurality of service nodes includes:
and taking the target service node with the highest total score as a root cause node in each service node.
In one embodiment, the responding to the operation and maintenance system triggering root cause positioning instruction includes:
acquiring a fault alarm initiated by a terminal user;
And in response to the number of the fault alarms being greater than a preset number in the current sliding time window, automatically triggering a root cause positioning instruction.
In a second aspect, the application also provides a root cause positioning device. The device comprises:
an instruction triggering unit configured to execute a trigger root cause positioning instruction in response to an operation and maintenance system, and determine a plurality of service nodes responded by the operation and maintenance system in a current time window; wherein, each service node includes at least one service request, and the operation and maintenance system is used for executing the at least one service request to realize the service function corresponding to the service node;
a path reference unit configured to perform determining a reference processing path from a plurality of candidate processing paths stored in a preset path library based on the dialogue sentence and the account information; the candidate processing path is a workflow for connecting a plurality of task nodes according to a preset sequence;
a data acquisition unit configured to perform acquisition of operation and maintenance data correspondingly generated by the operation and maintenance system when service requests of the service nodes are respectively executed;
the node evaluation unit is configured to perform root cause evaluation on the service nodes based on the operation and maintenance data to obtain evaluation results of the service nodes; the root cause evaluation is used for evaluating the possibility degree of system abnormality of the operation and maintenance system caused by the service node;
A node screening unit for determining root cause nodes from the plurality of service nodes based on the evaluation result;
the root node is an abnormal service node which causes the system abnormality of the operation and maintenance system to trigger the root positioning instruction.
In a third aspect, the present application also provides a computer device comprising:
a processor;
a memory for storing executable instructions of the processor;
wherein the processor is configured to execute the executable instructions to implement the root cause localization method as described above.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium includes program data therein, which when executed, implements the root cause localization method as described above.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprises program instructions which, when executed, implement the root cause localization method as described above.
The root cause positioning method, the root cause positioning device, the computer equipment, the computer readable storage medium and the computer program product firstly determine a plurality of service nodes responded by the operation and maintenance system in the current time window by responding to the trigger root cause positioning instruction of the operation and maintenance system; each service node comprises at least one service request, and the operation and maintenance system is used for executing the at least one service request so as to realize the service function corresponding to the service node; acquiring operation and maintenance data correspondingly generated when the operation and maintenance system respectively executes service requests of all service nodes; based on the operation and maintenance data, root cause evaluation is respectively carried out on each service node, and an evaluation result aiming at each service node is obtained; root cause evaluation is used for evaluating the possibility degree of system abnormality of the operation and maintenance system caused by the service node; determining a root cause node from a plurality of service nodes based on the evaluation result; the root cause node is an abnormal service node which causes the system abnormality of the operation and maintenance system to trigger the root cause positioning instruction. On the one hand, in the scheme, when the operation and maintenance system triggers the root cause positioning instruction, firstly, a plurality of service nodes responded by the operation and maintenance system are determined, then operation and maintenance data corresponding to each service node are acquired, and finally root cause evaluation is carried out on each service node by utilizing the operation and maintenance data so as to position the root cause node in the plurality of service nodes, thereby optimizing the root cause positioning process, effectively improving the root cause positioning efficiency of a fault system and reducing the consumption of manpower and material resources; on the other hand, when the operation and maintenance system is abnormal, the plurality of service nodes affected by the abnormality are determined in the current time window, and then the operation and maintenance data generated when the operation and maintenance system executes the service request corresponding to each service node are utilized to evaluate the possibility degree of the system abnormality caused by the service nodes, so that the root cause node of the abnormality is determined, the root cause positioning mode of the service nodes is optimized, the rationality and accuracy of the root cause positioning are effectively improved, and the subsequent timely repair of the system abnormality is facilitated.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.
FIG. 1 is an application environment diagram illustrating a root cause positioning method according to an example embodiment.
FIG. 2 is a flow chart illustrating a root cause positioning method according to an exemplary embodiment.
FIG. 3 is a flowchart illustrating the steps of an operation and maintenance system triggering root cause positioning instruction according to an exemplary embodiment.
FIG. 4 is a flowchart illustrating steps for determining a root cause node from a serving node, according to an exemplary embodiment.
FIG. 5 is a flowchart illustrating a process for root cause evaluation for a service node, according to an example embodiment.
FIG. 6 is a flowchart illustrating a real-time status step of determining a performance metric, according to an exemplary embodiment.
Fig. 7 is a flow chart illustrating a root cause positioning method according to another exemplary embodiment.
Fig. 8 is a block diagram illustrating a root cause positioning method according to another exemplary embodiment.
FIG. 9 is a block diagram of a root cause positioning device, according to an example embodiment.
FIG. 10 is a block diagram of a computer device for root cause positioning, according to an example embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The terms "first," "second," and the like in this disclosure are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, although the terms "first," "second," etc. may be used multiple times to describe various operations (or various thresholds or various applications or various instructions or various elements), etc., these operations (or thresholds or applications or instructions or elements) should not be limited by these terms. These terms are only used to distinguish one operation (or threshold or application or instruction or element) from another operation (or threshold or application or instruction or element).
The root cause positioning method provided by the embodiment of the application can be applied to an application environment shown in figure 1. Wherein the terminal 102 communicates with the server 104 via a communication network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server.
In some embodiments, referring to fig. 1, first, the server 104 determines a plurality of service nodes to which the operation and maintenance system responds within a current time window in response to an operation and maintenance system trigger root cause positioning instruction; each service node comprises at least one service request, and the operation and maintenance system is used for executing the at least one service request so as to realize the service function corresponding to the service node; then, the server 104 acquires the operation and maintenance data correspondingly generated when the operation and maintenance system executes the service requests of the service nodes respectively; then, the server 104 carries out root cause evaluation on each service node based on the operation and maintenance data to obtain an evaluation result aiming at each service node; root cause evaluation is used for evaluating the possibility degree of system abnormality of the operation and maintenance system caused by the service node; finally, the server 104 determines a root node from the plurality of service nodes based on the evaluation result; the root cause node is an abnormal service node which causes the system abnormality of the operation and maintenance system to trigger the root cause positioning instruction.
In some embodiments, the terminal 102 (e.g., mobile terminal, fixed terminal) may be implemented in various forms. The terminal 102 may be a mobile terminal including a mobile terminal such as a mobile phone, a smart phone, a notebook computer, a portable handheld device, a personal digital assistant (PDA, personal Digital Assistant), a tablet (PAD), etc., or the terminal 102 may be a fixed terminal such as an automated teller machine (Automated Teller Machine, ATM), an automatic all-in-one machine, a digital TV, a desktop computer, a stationary computer, etc.
In the following, it is assumed that the terminal 102 is a fixed terminal. However, those skilled in the art will appreciate that the configuration according to the disclosed embodiments of the present application can also be applied to a mobile type terminal 102 if there are operations or elements specifically for the purpose of movement.
In some embodiments, the data processing components running on server 104 may load any of a variety of additional server applications and/or middle tier applications being executed, including, for example, HTTP (hypertext transfer protocol), FTP (file transfer protocol), CGI (common gateway interface), RDBMS (relational database management system), and the like.
In some embodiments, the server 104 may be implemented as a stand-alone server or as a cluster of servers. The server 104 may be adapted to run one or more application services or software components that provide the terminal 102 described in the foregoing disclosure.
In some embodiments, the application services may include a service interface that provides root cause positioning to the user, and corresponding program services, among others. Among other things, the software components may include an application (SDK) or a client (APP) that performs root cause location functions.
In some embodiments, the application or client provided by the server 104 with root location functionality includes a portal port that provides one-to-one application services to users in the foreground and a plurality of business systems that perform data processing in the background to extend the root location functionality application to the APP or client so that users can perform root location functionality usage and access anywhere at any time.
In some embodiments, a user may input corresponding code data or control parameters to the APP or client through a preset input device or an automatic control program to execute application services of a computer program in the server 104 and display application services in a user interface.
In some embodiments, the APP or client-running operating system may include various versions of Microsoft WindowsApple/>And/or Linux operating system, various commercial or quasi +.>Operating systems (including but not limited to various GNU/Linux operating systems, google +. >OS, etc.) and/or a mobile operating system, such asPhone、/>OS、/>OS、/>The OS operating system, as well as other online or offline operating systems, is not particularly limited herein.
In one embodiment, as shown in fig. 2, a root cause positioning method is provided, and the method is applied to the server in fig. 1 for illustration, and specifically includes the following steps:
step S11: and responding to the root cause positioning instruction triggered by the operation and maintenance system, and determining a plurality of service nodes responded by the operation and maintenance system in the current time window.
In an embodiment, based on a monitoring task preconfigured by the server, when the server monitors that the operation and maintenance system has a system fault and a system abnormality, the server automatically responds to a root cause positioning instruction triggered by the operation and maintenance system to start executing the root cause positioning task.
In one embodiment, the server may include the following steps in response to the operation and maintenance system triggering the root cause positioning instruction:
step one: and acquiring a fault alarm initiated by the terminal user.
Step two: and in response to the number of fault alarms being greater than a preset number in the current sliding time window, automatically triggering a root cause positioning instruction.
As an example, the server first acquires, in real time, the fault alarms sent by the end user for the "customer service system" (i.e. the fault reports submitted by the user side), and then the server detects again the number of fault alarms received for the "customer service system" within the current sliding time window. The current sliding time window takes the current moment as the end time and the time length is 10 minutes, and the server correspondingly receives the fault report and respectively comprises 5 fault alarms X1, X2, X3, X4 and X5. The number of threshold values preset by the server for fault alarms is 4, so that the number of fault alarms received by the server exceeds the preset number in the current sliding time window, and the server automatically triggers a root cause positioning instruction for a customer service system.
In an exemplary embodiment, as shown in fig. 3, fig. 3 is an interface schematic diagram of an embodiment of an operation and maintenance system trigger root cause positioning instruction in the present application. The interface is a background monitoring interface S1 for monitoring an abnormal state of the operation and maintenance system, and the background monitoring interface S1 includes an object type display area S2, an object information display area S3, an alarm level display area S4, an alarm time display area S5 and an alarm content display area S6. The object type display area S2 displays a monitoring object "customer service system" in the server monitoring operation and maintenance system, the object information display area S3 displays system information related to the "customer service system", the alarm level display area S4 displays an alarm level "middle level" of the current server for dividing the "customer service system", the alarm time display area S5 displays the time when the server divides the alarm level for the "customer service system", and the alarm content display area S6 displays a fault alarm X1, a fault alarm X2, a fault alarm X3, a fault alarm X4 and a fault alarm X5 which are sent by the terminal user and received by the server.
In one embodiment, the root cause positioning instruction triggered by the operation and maintenance system is used for determining a plurality of service nodes responded by the operation and maintenance system in the current time window firstly; and then analyzing the abnormal cause of the abnormal operation and maintenance system according to the service node to determine the root node causing the abnormal operation and maintenance system, thereby repairing the root node and restoring the operation and maintenance system to be normal.
The current time window is a fixed-length time window taking the moment when the root cause positioning instruction is triggered as the end time and the time length is the preset length. For example, the fixed length time window is 3 hours long, the operation and maintenance system triggers the root cause location instruction at 16:00 today, so that the server determines all service nodes to which the operation and maintenance system responds between 13:00-16:00 today.
In one embodiment, the service node is an entity module or an abstract module of an operation and maintenance system that implements a single function, including, for example, a micro-service, a server, middleware, a business application, a business module, and the like.
Each service node comprises at least one corresponding service request in the service nodes responded by the operation and maintenance system, and the operation and maintenance system is used for executing the at least one service request so as to realize the service function corresponding to the service node.
Step S12: and acquiring operation and maintenance data correspondingly generated by the operation and maintenance system when the operation and maintenance system respectively executes service requests of all the service nodes.
Specifically, when the operation and maintenance system executes a service request of a service node, corresponding operation and maintenance data are generated; then, the server stores the operation and maintenance data in corresponding storage media respectively so as to extract the operation and maintenance data to be used from the corresponding storage media when the operation and maintenance system responds to the root cause positioning instruction.
In some embodiments, the operation and maintenance system performs various service requests of various service nodes, and the types of operation and maintenance data correspondingly generated are various. Including various types such as "code logic" types, "change" types, "third party application" types, "system infrastructure" types, "system architecture design" types, "system specific application" types, and the like.
Wherein, based on the statistics of the operation and maintenance engineer on all operation and maintenance data generated by the operation and maintenance system in a long time span (such as 365 days), the number of operation and maintenance data of the type of 'change', the type of 'system infrastructure' and the type of 'system specific application' is determined to occupy more than 90% of the number of all operation and maintenance data. Therefore, among the plurality of operation and maintenance data, operation and maintenance data of the "change" type, the "system infrastructure" type, and the "system specific application" type are set as core operation and maintenance data by the server.
Further, in some embodiments, the server obtains the operation and maintenance data generated by the operation and maintenance system corresponding to the service requests of the service nodes respectively, which may be the operation and maintenance data of "change" type, the "system infrastructure" type and the "system specific application" type.
Step S13: and based on the operation and maintenance data, root cause evaluation is respectively carried out on each service node, and an evaluation result aiming at each service node is obtained.
In one embodiment, root cause assessment is used to assess the degree of likelihood of system anomalies in the operational system caused by the service nodes.
In some embodiments, the server inputs the operation data corresponding to each service node into the pre-trained root cause positioning model for root cause evaluation, so as to output an evaluation score for each service node, and then the server takes the evaluation score as an evaluation result of the service node.
The root cause positioning model is used for predicting the probability score of the system abnormality of the operation and maintenance system caused by the service node in multiple dimensions and outputting the total score predicted between the corresponding dimensions.
In some embodiments, the root cause location model may be a neural network (e.g., CNN, VGG, resNet, etc.) model, a semantic segmentation (e.g., transducer, attention-based RNN, LSTM, etc.) model, or the like.
In some embodiments, the root cause location model may be a machine learning model. The machine learning model obtains an optimal feedback value range (i.e. an optimal prediction score range of learning parameters of the model) of different action strategies (i.e. values of different types of operation and maintenance data) under each initial input parameter data through learning accumulated feedback values obtained after a large amount of training data take different input parameter data (i.e. values of training data).
Step S14: and determining the root cause node from the plurality of service nodes based on the evaluation result.
In one embodiment, the server determines a root node from a plurality of service nodes, including: and taking the target service node with the highest corresponding total score as a root cause node among the service nodes.
The root cause node is an abnormal service node which causes the system abnormality of the operation and maintenance system to trigger the root cause positioning instruction.
As an example, the server first obtains a plurality of service nodes P1-P5 that the "customer service system" responded to within the last 3 hours, and then the server determines that the total score for root rating of service node P1 is 0.65 score, the total score for root rating of service node P2 is 0.60 score, the total score for root rating of service node P3 is 0.47 score, the total score for root rating of service node P4 is 0.55 score, and the total score for root rating of service node P5 is 0.59 score, respectively. And finally, the server selects the service node P1 with the highest corresponding total score from the service nodes P1-P5 as a root cause node.
In an exemplary embodiment, as shown in fig. 4, fig. 4 is an interface schematic diagram of an embodiment of determining a root cause node from service nodes in the present application. The interface is a result display interface L1 for displaying a root cause positioning result, and the result display interface L1 includes an object type display area L2, a root cause node display area L3, an anomaly description display area L4, an anomaly time display area L5, a root cause score display area L6 and a root cause ranking display area L7. The object type display area L2 displays a monitoring object "customer service system" in the server monitoring operation and maintenance system, and the root node display area L3 displays a root node which is determined correspondingly: the "service node P1" displays the description information of the current "customer service system" abnormality in the abnormality description display area L4, the time when the server is judged to be abnormal for the "customer service system" in the abnormality time display area L5, the total score of 0.65 score for root cause evaluation of the "service node P1" in the root cause score display area L6, and the total score row name of the "service node P1" corresponding to the root cause evaluation among all the service nodes in the root cause ranking display area L7 as the first name.
In the root cause positioning process, the server responds to the root cause positioning instruction triggered by the operation and maintenance system to determine a plurality of service nodes responded by the operation and maintenance system in the current time window; each service node comprises at least one service request, and the operation and maintenance system is used for executing the at least one service request so as to realize the service function corresponding to the service node; acquiring operation and maintenance data correspondingly generated when the operation and maintenance system respectively executes service requests of all service nodes; based on the operation and maintenance data, root cause evaluation is respectively carried out on each service node, and an evaluation result aiming at each service node is obtained; root cause evaluation is used for evaluating the possibility degree of system abnormality of the operation and maintenance system caused by the service node; determining a root cause node from a plurality of service nodes based on the evaluation result; the root cause node is an abnormal service node which causes the system abnormality of the operation and maintenance system to trigger the root cause positioning instruction. On the one hand, in the scheme, when the operation and maintenance system triggers the root cause positioning instruction, firstly, a plurality of service nodes responded by the operation and maintenance system are determined, then operation and maintenance data corresponding to each service node are acquired, and finally root cause evaluation is carried out on each service node by utilizing the operation and maintenance data so as to position the root cause node in the plurality of service nodes, thereby optimizing the root cause positioning process, effectively improving the root cause positioning efficiency of a fault system and reducing the consumption of manpower and material resources; on the other hand, when the operation and maintenance system is abnormal, the plurality of service nodes affected by the abnormality are determined in the current time window, and then the operation and maintenance data generated when the operation and maintenance system executes the service request corresponding to each service node are utilized to evaluate the possibility degree of the system abnormality caused by the service nodes, so that the root cause node of the abnormality is determined, the root cause positioning mode of the service nodes is optimized, the rationality and accuracy of the root cause positioning are effectively improved, and the subsequent timely repair of the system abnormality is facilitated.
It will be appreciated by those skilled in the art that in the above-described methods of the embodiments, the disclosed methods may be implemented in a more specific manner. For example, the embodiment in which the server described above performs root cause evaluation on each service node based on the operation and maintenance data, and obtains the evaluation result for each service node is merely illustrative.
Illustratively, the server determines a manner of operating a plurality of service nodes to which the system responds within a current time window; alternatively, the server determines the root cause node from the multiple service nodes, which is just a set mode, and may actually be implemented in another division mode, for example, multiple service nodes in the current time window, the root cause nodes in the multiple service nodes may be combined or may be integrated into another system, or some features may be omitted or not performed.
In an exemplary embodiment, referring to fig. 5, fig. 5 is a flowchart illustrating an embodiment of root cause evaluation for a service node according to the present application. That is, in step S13, the server performs root cause evaluation on each service node based on the operation and maintenance data, and the process of obtaining the evaluation result for each service node may be specifically implemented by executing the following modes:
Step S131: and determining the performance index and the function index of the operation and maintenance system when responding to each service node respectively based on the operation and maintenance data.
In some embodiments, the performance metrics include at least P99 time consumption, log error rate, request failure rate, and data failure rate of the operation and maintenance system in responding to the service node.
The performance index is used for representing the system performance condition of the operation and maintenance system when responding to the service node, and the performance index of the service node is also used for representing the corresponding index value.
Wherein P99 time consumption characterizes how much millisecond the time consumption of the operation and maintenance system is in when responding to the service node, the operation and maintenance system correspondingly executes 99% of service requests.
As an example, when the operation and maintenance system responds to the service node A1, the time consumed by the operation and maintenance system for 99% of the service requests executed is within 250ms, and then the index value of the time consumed by the service node A1 for P99 is 250ms.
The log error rate characterizes the proportion of log data of a keyword error in a corresponding generated log file when the operation and maintenance system responds to the service node.
As an example, when the operation and maintenance system responds to the service node A2, the operation and maintenance system correspondingly generates 100 pieces of log data and stores the 100 pieces of log data in the log file, and if 10 pieces of log data have a keyword "error", the index value of the log error rate corresponding to the service node A2 is 10%.
The request failure rate characterizes the proportion of failed service requests in all service requests correspondingly processed when the operation and maintenance system responds to the service nodes.
As an example, when the operation and maintenance system responds to the service node A3, the operation and maintenance system processes 10 service requests, and 2 service requests among the 10 service requests fail to be processed, the index value of the failure rate of processing the service request corresponding to the network node A3 is 20%.
The data fault rate represents the proportion of all fault data in the last 3 hours of correspondingly generated fault data when the operation and maintenance system responds to the service node.
As an example, in response to the service node A4, the operation and maintenance system correspondingly generates 30 pieces of fault data of "change" type, 20 pieces of fault data of "infrastructure" type and 20 pieces of fault data of "core application" type; and the operation and maintenance system generates 100 pieces of fault data in the last 3 hours, so that index values of the data fault rate corresponding to the service node A4 in the 100 pieces of fault data are 30%, 20% and 20% respectively.
In some embodiments, the function index includes at least a proportion of an number of applications of the operation and maintenance system in responding to the service node, an abnormal application proportion, and a quantity of operation and maintenance data corresponding to a type of change.
The function duty ratio table is used for representing the function service condition applied by the operation and maintenance system when responding to the service node, and the function index of the service node also represents the corresponding index value.
The quantity of the operation and maintenance data of the change type represents the quantity of the operation and maintenance data of the 'change' type in all operation and maintenance data correspondingly generated when the operation and maintenance system responds to the service node.
As an example, when responding to the service node A5, the operation and maintenance system correspondingly generates 100 pieces of operation and maintenance data with a 'change' type; and the operation and maintenance system generates 1000 pieces of operation and maintenance data in total within the last 3 hours, so that the index value of the quantity of the change data corresponding to the service node A5 in the 1000 pieces of operation and maintenance data is 10%.
The application population ratio characterizes the ratio of the number of users of the target application, which is most used in 3 hours, of the number of users of the current application currently called by the operation and maintenance system when the operation and maintenance system responds to the service node.
The current application and the target application may be the same application program or different application programs.
The number of users of the application program is used for measuring the fault influence surface of the application program, and if the number of users of the application program is low, the number of users influenced by the fault influence surface is also small when the application program goes out of a fault.
As an example, the operation and maintenance system calls 20 applications in total in the last 3 hours, wherein the number of users corresponding to the target application with the largest number of users is 1000; when responding to the service node A6, the operation and maintenance system uses 200 people as the current application currently called by the operation and maintenance system, so that the index value of the application number proportion of the service node A6 is 20%.
The abnormal application proportion represents the proportion of all applications occupied by the operation and maintenance system based on the abnormal application called by the call chain when the operation and maintenance system responds to the service node.
As an example, in response to the service node A7, the operation and maintenance system calls 20 applications in total based on the call chain of the service node A7, wherein the number of corresponding abnormal applications is 10, so that the index value of the abnormal application proportion of the service node A7 is 50%.
In an embodiment, after the server determines the performance indexes of the operation and maintenance system when responding to each service node, it is further required to determine the positive abnormal state of the performance indexes of the operation and maintenance system when responding to the service nodes, so as to evaluate the service nodes according to the positive abnormal state of the performance indexes.
In an exemplary embodiment, referring to fig. 6, fig. 6 is a flowchart illustrating an embodiment of determining a real-time status of a performance index according to the present application. After step S131, i.e. after determining the performance index of the operation and maintenance system when responding to each service node, respectively, the server may further perform the following manner:
Step a1: and acquiring a historical index value of the historical performance index of the historical service node responded by the operation and maintenance system in the historical time window.
The historical time window includes a current time window, for example, the current time window is a time window with a time point when the root positioning instruction is triggered as an end point time and a time length of 3 hours, and the historical time window is a time window with a time point when the root positioning instruction is triggered as an end point time and a time length of 72 hours. Thus, the historical service nodes that the operation and maintenance system responds to within the historical time window comprise all service nodes that the operation and maintenance system responds to within the current time window, and the number of the historical service nodes is not less than all service nodes that the operation and maintenance system responds to currently.
Step a2: and inputting the historical index value of the historical performance index into a pre-trained index prediction model to perform safety performance prediction, so as to obtain a predicted safety performance interval.
In an embodiment, the safety performance interval predicted by the index prediction model is a safety interval for the performance index value, and the safety performance interval has an effective duration with a preset size.
As an example, a predicted safety performance interval is denoted as (x 1, x2;60 min), where "x1" is the lower boundary of the safety performance interval, "x2" is the upper boundary of the safety performance interval, "60min" is the effective duration of the safety performance interval, that is, the effective interval of reference value within 60 minutes after the index prediction model outputs the safety performance interval.
Step a3: and determining the real-time state of the performance index of each service node based on the magnitude relation between the index value of the performance index of each service node in the current time window and the safety performance interval.
In an embodiment, the server determines the real-time status of the performance index of each service node, including the following two cases:
case one: and in the preset time length, if the index values of the performance indexes of the service nodes are all in the safety performance interval, determining that the real-time state of the performance indexes of the service nodes in the preset time length is a normal state.
And a second case: and in the preset time length, if the index values of the performance indexes of the service nodes are not in the safety performance interval, determining that the real-time state of the performance indexes of the service nodes in the preset time length is an abnormal state.
Wherein the preset time length is smaller than the effective time length of the safety performance interval
For case one, in one example, the safety performance interval is (x 1, x2;60 min), and the preset time length is 5min. If the index value of the performance index of a certain service node is in the interval (x 1, x 2) within a continuous 5min, the real-time state of the performance index of the service node within the continuous 5min is a normal state.
For case two, in one example, the security performance interval is (x 3, x4;120 min), and the preset time length is 10min. If the index value of the performance index of a certain service node is not in the interval (x 3, x 4) within one continuous 10min, the real-time state of the performance index of the service node within the continuous 10min is an abnormal state.
Step S132: and performing first scoring on the performance indexes of each service node based on a preset score factor to obtain a first evaluation score aiming at the performance indexes.
In some embodiments, the server performs a first scoring on the performance index of each service node to obtain a first evaluation score for the performance index, including: the score factor of the performance index corresponding to the real-time state is taken as a first evaluation score for the performance index.
Wherein, the performance indexes of different real-time states are correspondingly preset with different fraction factors.
As an example, based on experience of an operation and maintenance expert, the server configures a P99 time consumption preset score factor of a normal state to be 0 score, and the P99 time consumption preset score factor of an abnormal state to be 0.25 score, where a corresponding first score is a value of the score factor of a corresponding real-time state, that is, if the P99 time consumption of a certain service node belongs to the normal state, the first evaluation score for the P99 time consumption is 0 score; if the P99 time consumption of a certain service node belongs to an abnormal state, the first evaluation score for the P99 time consumption is 0.25 score.
As another example, based on experience of an operation and maintenance expert, the server configures a score factor preset by a log error rate in a normal state to be 0 score, and the score factor preset by the log error rate in an abnormal state to be 0.1 score, wherein a corresponding first score is a value of the score factor corresponding to a real-time state, that is, if the log error rate of a certain service node belongs to the normal state, a first evaluation score for the log error rate is 0 score; if the log error rate of a certain service node belongs to an abnormal state, the first evaluation score for the log error rate is 0.1 score.
As another example, based on experience of an operation and maintenance expert, the server configures a request failure rate and a data failure rate of a normal state to be 0 score, and the request failure rate and the data failure rate of an abnormal state to be 0.15 score, wherein the corresponding first scores are all score factors corresponding to a real-time state, i.e. if the request failure rate or the data failure rate of a certain service node belongs to the normal state, the first evaluation score for the request failure rate or the data failure rate is 0 score; if the request failure rate or the data failure rate of a certain service node belongs to an abnormal state, the first evaluation score for the request failure rate or the data failure rate is 0.15 score.
Step S133: and performing second scoring on the function indexes of the service nodes based on preset weight factors to obtain second evaluation scores aiming at the function indexes.
In some embodiments, the server performs a second scoring on the function index of each service node to obtain a second score for the function index, including: the product value between the index value of the function index and the corresponding weight factor is taken as a first evaluation score for the performance index.
Wherein, different weight factors are correspondingly preset for different function indexes.
As an example, based on the experience of the operation and maintenance expert, the server presets a weight factor of 0.25 for the operation and maintenance data of the "change" type, and the corresponding second score is the product between the number of the change data and the weight factor; that is, if a certain service node includes X pieces of operation and maintenance data of the "change" type, the second evaluation score for the number of operation and maintenance data of the "change" type is x×0.25 score.
As an example, based on the experience of the operation and maintenance expert, the server uses the preset weight factor of the people number proportion as 0.1, and uses the product between the people number proportion and the weight factor as the corresponding second score; that is, if the application population ratio corresponding to a certain service node is S1, the second evaluation score for the application population ratio is s1×0.1 score.
As an example, based on the experience of the operation and maintenance expert, the server applies a weight factor preset by the proportion to the anomaly to be 0.15, and the corresponding second score is the product between the proportion of the anomaly application and the weight factor; that is, if the abnormal application ratio corresponding to a certain service node is S2, the second evaluation score for the abnormal application ratio is s2×0.15 score.
Step S134: based on the first evaluation score and the second evaluation score, an evaluation result for each service node is determined.
In some embodiments, the server determining the evaluation results for each service node comprises: for each service node, determining a total score between the first evaluation score of each performance index and the second evaluation score of each function index, and taking the total score as an evaluation result of the service node.
As an example, for a certain service node P, the first evaluation score thereof with respect to the performance index includes 0.25 score for P99 time consumption, 0.1 score for log error rate, 0.15 score for request failure rate, and 0.15 score for data failure rate, and the second evaluation score of the function index includes x×0.25 score for the number of "change" type operation and maintenance data, s1×0.1 score for the number of applications, s2×0.15 score for the abnormal application ratio, so that the total score of the service node P is 0.25+0.1+0.15+0.15+x×0.25+s1×0.1+s2×0.15 score, and the total score is the evaluation result of the service node P. Wherein "X", "S1" and "S2" are constants.
In order to more clearly clarify the root cause positioning method provided by the embodiments of the present disclosure, the root cause positioning method is specifically described in the following by using a specific embodiment. In an exemplary embodiment, referring to fig. 7 and 8, fig. 7 is a flowchart of a root cause positioning method according to an exemplary embodiment, and fig. 8 is a block diagram of a root cause positioning method according to an exemplary embodiment, where the root cause positioning method is used in a server, and specifically includes the following:
step S21: and responding to the triggering condition meeting the root cause positioning, and automatically triggering the execution program of the root cause positioning by the system.
The triggering conditions of root cause positioning comprise the following two types:
(1) the failure reporting times of the user reach a threshold value within the last 10 minutes, and the operation and maintenance system gives an alarm so as to trigger root cause positioning; (2) the user inputs a time point in the root cause positioning operation interface and clicks the confirmation to instruct the operation and maintenance system to trigger the root cause positioning.
The root positioning execution program can be executed through a pre-trained root positioning model, and is used for positioning root nodes of a failed operation and maintenance system. The root cause positioning execution program includes the following steps S22 to S28.
Step S22: the network node to which the system responded within the last 3 hours is determined.
The network node is an entity module or an abstract module for realizing a single function in the operation and maintenance system, such as a micro-service, a server, middleware, a business application, a business module and the like.
Wherein a plurality of service requests performed by the system are included in each network node.
Step S23: from the data warehouse, the primary type of operation and maintenance data generated by the system in response to each network node respectively is collected.
Wherein, the main type of operation and maintenance data comprises three types of operation and maintenance data of an 'change' type, an 'infrastructure' type and a 'specific application' type.
Wherein collecting the operation and maintenance data of the 'change' type comprises: the method comprises the steps of collecting release data in a release platform, audit data recorded in a springboard machine and related to editing codes, script data executed in a work platform, change data for a database (data table), change data of various operation events and release data in an important change notification group.
The method comprises the steps that an engineer firstly edits release contents in an editing page aiming at release data in a key change notification group; then, uploading the release content to an important change notification group to release the content; finally, the system stores the release content in a preset event center platform.
The key change notification group is a virtual group chat room in social applications such as nailing, weChat, QQ and the like; the event center platform is a database for storing data.
Wherein collecting "infrastructure" type fortune dimension includes: and collecting storage data, network private line data, LB data, NAT network data, DB data (including data in kafka, data in Etcd, data in MongoDb and the like) and the like in the DNS, cloud data of a third party cloud manufacturer, data in k8s and data in relation.
Wherein collecting the "core application" type of operation and maintenance data includes: program data and call chain data generated by a plurality of preset application programs are collected.
Step S24: and determining health indexes, the quantity of changed data, the Qps proportion and the abnormal application proportion of the system when responding to each network node respectively according to the operation data.
The health indexes of the network nodes comprise P99 time consumption, log error rate, failure rate of processing service requests and historical failure probability.
Wherein P99 time consuming characterizes how much time it takes for the system to correspondingly perform 99% of the service requests is within milliseconds when the system responds to the network node.
As an example, in response to network node A1, the time consumed by the system for 99% of the service requests performed is within 250ms, and then the time consumed by P99 corresponding to network node A1 is 250ms.
When responding to the network node, the log failure rate characterization system has the proportion of log data of the keyword error in the corresponding generated log file.
As an example, when the system responds to the network node A2, the system correspondingly generates 100 pieces of log data and stores the 100 pieces of log data in the log file, and if the keyword "error" exists in 30 pieces of log data in the 100 pieces of log data, the log failure rate corresponding to the network node A2 is 30%.
Wherein, the failure rate of processing the service request characterizes the proportion of processing the failed service request in all the service requests correspondingly processed when the system responds to the network node.
As an example, when the system responds to the network node A3, the system processes 10 service requests correspondingly, and 2 service requests among the 10 service requests fail to be processed, and the failure rate of processing the service requests corresponding to the network node A3 is 20%.
The historical fault probability representation system correspondingly generates fault data belonging to a 'change' type when responding to a network node, wherein the proportion of all fault data in the last 3 hours is occupied by the fault data; and/or, for the generated fault data belonging to the "infrastructure" type, occupying a proportion of the total fault data within the last 3 hours; and/or, for the generated fault data belonging to the type "core application", the proportion of all fault data within the last 3 hours is occupied.
As an example, in response to network node A4, the system correspondingly generates 30 pieces of "change" type fault data, 20 pieces of "infrastructure" type fault data, and 20 pieces of "core application" type fault data; and the system generates 100 pieces of fault data in total within the last 3 hours, so that the historical fault probabilities corresponding to the network node A4 in the 100 pieces of fault data are 30%, 20% and 20%, respectively.
The quantity characterization system of the change data is used for responding to network nodes, and the quantity of the operation and maintenance data which belongs to the 'change' type is among all operation and maintenance data which are correspondingly generated.
As an example, in response to network node A5, the system correspondingly generates 100 pieces of operation and maintenance data of "change" type; and the system generates 1000 pieces of operation data in total within the last 3 hours, so that the number of change data corresponding to the network node A5 in the 1000 pieces of operation data is 10%.
Wherein the Qps ratio characterizes the ratio of the number of users of the current application currently invoked by the system to the number of users of the target application with the largest number of users in 3 hours when the system responds to the network node.
The current application and the target application may be the same application program or different application programs.
The number of users of the application program is used for measuring the fault influence surface of the application program, and if the number of users of the application program is low, the number of users influenced by the fault influence surface is also small when the application program goes out of a fault.
As an example, the system calls 20 applications in total in the last 3 hours, wherein the number of users corresponding to the target application with the largest number of users is 1000; the current application currently invoked by the system when responding to the network node A6 has the number of users of 200, so that the Qps proportion of the network node A6 is 20%.
The system occupies the proportion of all applications based on the abnormal applications called by the call chain when responding to the network node.
As an example, in response to network node A7, the system calls 20 applications in total based on the call chain of network node A7, wherein the number of corresponding abnormal applications is 10, so that the abnormal application proportion of network node A7 is 50%.
Step S25: and judging the state of the health index corresponding to each network node, and determining the real-time state of each health index.
The real-time state of the health index corresponding to the network node comprises a normal state or an abnormal state.
The state judging process comprises the following steps: (1) extracting historical network nodes within 5 days before a current time stamp from a data warehouse Prometaheus; (2) acquiring time stamps and health index values of all historical network nodes; (3) the time stamp and the health index value of each historical network node are fed into a pre-trained Prophet time sequence model to conduct interval prediction so as to output a predicted safety value interval; wherein the safety value interval represents an interval of the safety value to which the health index value of the network node corresponds within one hour in the future; (4) the health index value corresponding to each network node in the last 3 hours is compared with the safety value interval in real time; (5) if the health index values corresponding to the network nodes are all outside the safety value interval within 5 continuous minutes, determining that the real-time state of the corresponding health index is an abnormal state; if the health index values corresponding to the network nodes are all within the safe value interval within 5 continuous minutes, determining that the real-time state of the corresponding health index is a normal state.
Step S26: determining a first score of the health index for each network node based on a preset score factor and a real-time state of the health index; and determining respective second scores of the quantity of change data, the Qps proportion and the abnormal application proportion for each network node based on preset weight factors.
As an example, based on the experience of the operation and maintenance expert, the P99 time-consuming preset score factor of the normal state in the health index is 0 score, the P99 time-consuming preset score factor of the abnormal state is 0.25 score, and the corresponding first score is the score factor of the corresponding real-time state; the preset score factor of the log error rate in the normal state is 0 score, the preset score factor of the log error rate in the abnormal state is 0.1 score, and the corresponding first score is the score factor of the corresponding real-time state; the failure rate of processing service requests in a normal state and the preset score factor of the historical failure probability are both 0 score, the failure rate of processing service requests in an abnormal state and the preset score factor of the historical failure probability are both 0.15 score, and the corresponding first scores are the score factors corresponding to the real-time states.
As another example, based on the experience of the operation and maintenance expert, the weight factor preset for the operation and maintenance data of the "change" type is 0.25, and the corresponding second score is the product between the number of the change data and the weight factor; the preset weight factor for the QPs proportion is 0.1, and the corresponding second score is the product between the QPs proportion and the weight factor; the preset weight factor of the abnormal application proportion is 0.15, and the corresponding second score is the product of the abnormal application proportion and the weight factor.
Step S27: a total score for each network node is determined based on a sum between the respective first and second scores of the network nodes.
Step S28: among the network nodes, the target network node with the highest corresponding total score is used as the root cause node of the system fault.
After the server determines the root node causing the fault of the operation and maintenance system through the root positioning execution program, the operation and maintenance engineer can detect the root node again to determine whether the root node is accurate or not, so that validity of each root node is marked.
On the one hand, in the scheme, when the operation and maintenance system triggers the root cause positioning instruction, firstly, a plurality of service nodes responded by the operation and maintenance system are determined, then operation and maintenance data corresponding to each service node are acquired, and finally root cause evaluation is carried out on each service node by utilizing the operation and maintenance data so as to position the root cause node in the plurality of service nodes, thereby optimizing the root cause positioning process, effectively improving the root cause positioning efficiency of a fault system and reducing the consumption of manpower and material resources; on the other hand, when the operation and maintenance system is abnormal, the plurality of service nodes affected by the abnormality are determined in the current time window, and then the operation and maintenance data generated when the operation and maintenance system executes the service request corresponding to each service node are utilized to evaluate the possibility degree of the system abnormality caused by the service nodes, so that the root cause node of the abnormality is determined, the root cause positioning mode of the service nodes is optimized, the rationality and accuracy of the root cause positioning are effectively improved, and the subsequent timely repair of the system abnormality is facilitated.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a root cause positioning device for realizing the root cause positioning method. The implementation of the solution provided by the positioning device is similar to the implementation described in the above method, so the specific limitation in one or more embodiments of the device provided below may refer to the limitation of the identification method of the user information and the interaction method of the multi-terminal user hereinabove, and will not be described herein.
In one embodiment, as shown in FIG. 9, a root cause positioning device 10 is provided, comprising: an instruction triggering unit 11, a data acquisition unit 12, a node evaluation unit 13, and a node screening unit 14, wherein:
an instruction triggering unit 11 configured to execute a trigger root cause positioning instruction in response to an operation and maintenance system, and determine a plurality of service nodes to which the operation and maintenance system responds within a current time window; wherein, each service node includes at least one service request, and the operation and maintenance system is used for executing the at least one service request to realize the service function corresponding to the service node;
a data acquisition unit 12 configured to perform acquisition of operation and maintenance data correspondingly generated by the operation and maintenance system when the operation and maintenance system performs service requests of the service nodes respectively;
a node evaluation unit 13 configured to perform root cause evaluation on each of the service nodes based on the operation and maintenance data, respectively, to obtain an evaluation result for each of the service nodes; the root cause evaluation is used for evaluating the possibility degree of system abnormality of the operation and maintenance system caused by the service node;
a node screening unit 14 that determines a root node from the plurality of service nodes based on the evaluation result;
The root node is an abnormal service node which causes the system abnormality of the operation and maintenance system to trigger the root positioning instruction.
In some embodiments, the performing root cause evaluation on each service node based on the operation and maintenance data to obtain an evaluation result for each service node includes:
determining performance indexes and function indexes of the operation and maintenance system when responding to each service node respectively based on the operation and maintenance data; the performance index is used for representing the system performance condition of the operation and maintenance system when responding to the service node, and the functional index is used for representing the functional service condition applied by the operation and maintenance system when responding to the service node;
performing first scoring on performance indexes of all the service nodes based on a preset score factor to obtain a first evaluation score aiming at the performance indexes; and
performing second scoring on the functional indexes of the service nodes based on preset weight factors to obtain second evaluation scores aiming at the functional indexes;
and determining an evaluation result for each service node based on the first evaluation score and the second evaluation score.
In some embodiments, the performance indicator characterizes a corresponding indicator value;
after determining the performance index of the operation and maintenance system when responding to each service node respectively, the operation and maintenance system further comprises:
acquiring a historical index value of a historical performance index of a historical service node responded by the operation and maintenance system in a historical time window; the historical time window includes the current time window;
inputting the historical index value of the historical performance index into a pre-trained index prediction model to predict the safety performance, so as to obtain a predicted safety performance interval;
and determining the real-time state of the performance index of each service node based on the magnitude relation between the index value of the performance index of each service node in the current time window and the safety performance interval.
In some embodiments, the determining the real-time state of the performance index of each service node based on the magnitude relation between the index value of the performance index of each service node and the security performance interval in the current time window includes the following two steps:
if the index values of the performance indexes of the service nodes are all in the safety performance interval within the preset time length, determining that the real-time state of the performance indexes of the service nodes within the preset time length is a normal state;
And in the preset time length, if the index values of the performance indexes of the service nodes are not in the safety performance interval, determining that the real-time state of the performance indexes of the service nodes in the preset time length is an abnormal state.
In some embodiments, the first scoring the performance index of each service node based on a preset score factor, to obtain a first evaluation score for the performance index, including:
taking a score factor of a performance index corresponding to the real-time state as a first evaluation score aiming at the performance index;
wherein, different fractional factors are correspondingly preset for the performance indexes of different real-time states.
In some embodiments, the functional indicator is characterized by a corresponding indicator value;
performing a second scoring on the functional index of each service node based on a preset weight factor to obtain a second evaluation score aiming at the functional index, including:
taking the product value between the index value of the function index and the corresponding weight factor as a first evaluation score aiming at the performance index;
wherein different weight factors are correspondingly preset for different functional indexes.
In some embodiments, the performance metrics include at least P99 time consumption, log error rate, request failure rate, and data failure rate of the operation and maintenance system in responding to a service node; the function index at least comprises the application number proportion, the abnormal application proportion and the quantity of operation and maintenance data corresponding to the change type when the operation and maintenance system responds to the service node;
the determining an evaluation result for each service node based on the first evaluation score and the second evaluation score includes:
and determining a total score between the first evaluation score of each performance index and the second evaluation score of each functional index for each service node, and taking the total score as an evaluation result of the service node.
In some embodiments, the determining a root cause node from the plurality of service nodes based on the evaluation result includes:
and taking the target service node with the highest total score as a root cause node in each service node.
In some embodiments, the responding to the operation and maintenance system trigger root cause positioning instruction comprises:
acquiring a fault alarm initiated by a terminal user;
And in response to the number of the fault alarms being greater than a preset number in the current sliding time window, automatically triggering a root cause positioning instruction.
In one embodiment, a computer device is provided, which may be an electronic device or a server according to program logic to be applied, and an internal structure of the computer device may be shown in fig. 10. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus.
Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies.
Wherein the computer program when executed implements a root cause positioning method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the architecture shown in fig. 10 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer devices (including servers and electronic devices) to which the present inventive arrangements are applicable, and that a particular computer device may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed performs the steps of:
responding to a root cause positioning instruction triggered by an operation and maintenance system, and determining a plurality of service nodes responded by the operation and maintenance system in a current time window; wherein, each service node includes at least one service request, and the operation and maintenance system is used for executing the at least one service request to realize the service function corresponding to the service node;
acquiring operation and maintenance data correspondingly generated when the operation and maintenance system respectively executes service requests of the service nodes;
based on the operation and maintenance data, root cause evaluation is respectively carried out on each service node, and an evaluation result aiming at each service node is obtained; the root cause evaluation is used for evaluating the possibility degree of system abnormality of the operation and maintenance system caused by the service node;
Determining a root cause node from the plurality of service nodes based on the evaluation result;
the root node is an abnormal service node which causes the system abnormality of the operation and maintenance system to trigger the root positioning instruction.
In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:
responding to a root cause positioning instruction triggered by an operation and maintenance system, and determining a plurality of service nodes responded by the operation and maintenance system in a current time window; wherein, each service node includes at least one service request, and the operation and maintenance system is used for executing the at least one service request to realize the service function corresponding to the service node;
acquiring operation and maintenance data correspondingly generated when the operation and maintenance system respectively executes service requests of the service nodes;
based on the operation and maintenance data, root cause evaluation is respectively carried out on each service node, and an evaluation result aiming at each service node is obtained; the root cause evaluation is used for evaluating the possibility degree of system abnormality of the operation and maintenance system caused by the service node;
Determining a root cause node from the plurality of service nodes based on the evaluation result;
the root node is an abnormal service node which causes the system abnormality of the operation and maintenance system to trigger the root positioning instruction.
The user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided with a root cause positioning method, a root cause positioning device, a server, a computer apparatus, a computer readable storage medium, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer program instructions (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of root cause positioning methods, root cause positioning apparatus, servers, computer devices, computer readable storage media, or computer program products according to embodiments of the application. It will be understood that each flowchart and/or block of the flowchart and/or block diagrams, and combinations of flowcharts and/or block diagrams, can be implemented by computer program products. These computer program products may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the program instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program products may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the program instructions stored in the computer program product produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the program instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (12)

1. A root cause positioning method, the method comprising:
responding to a root cause positioning instruction triggered by an operation and maintenance system, and determining a plurality of service nodes responded by the operation and maintenance system in a current time window; wherein, each service node includes at least one service request, and the operation and maintenance system is used for executing the at least one service request to realize the service function corresponding to the service node;
acquiring operation and maintenance data correspondingly generated when the operation and maintenance system respectively executes service requests of the service nodes;
based on the operation and maintenance data, root cause evaluation is respectively carried out on each service node, and an evaluation result aiming at each service node is obtained; the root cause evaluation is used for evaluating the possibility degree of system abnormality of the operation and maintenance system caused by the service node;
determining a root cause node from the plurality of service nodes based on the evaluation result;
the root node is an abnormal service node which causes the system abnormality of the operation and maintenance system to trigger the root positioning instruction.
2. The method according to claim 1, wherein the performing root cause evaluation on each service node based on the operation and maintenance data to obtain an evaluation result for each service node includes:
Determining performance indexes and function indexes of the operation and maintenance system when responding to each service node respectively based on the operation and maintenance data; the performance index is used for representing the system performance condition of the operation and maintenance system when responding to the service node, and the functional index is used for representing the functional service condition applied by the operation and maintenance system when responding to the service node;
performing first scoring on performance indexes of all the service nodes based on a preset score factor to obtain a first evaluation score aiming at the performance indexes; and
performing second scoring on the functional indexes of the service nodes based on preset weight factors to obtain second evaluation scores aiming at the functional indexes;
and determining an evaluation result for each service node based on the first evaluation score and the second evaluation score.
3. The method of claim 2, wherein the performance indicator is characterized by a corresponding indicator value;
after determining the performance index of the operation and maintenance system when responding to each service node respectively, the operation and maintenance system further comprises:
acquiring a historical index value of a historical performance index of a historical service node responded by the operation and maintenance system in a historical time window; the historical time window includes the current time window;
Inputting the historical index value of the historical performance index into a pre-trained index prediction model to predict the safety performance, so as to obtain a predicted safety performance interval;
and determining the real-time state of the performance index of each service node based on the magnitude relation between the index value of the performance index of each service node in the current time window and the safety performance interval.
4. A method according to claim 3, wherein said determining the real-time status of the performance index of each of the service nodes based on the magnitude relation between the index value of the performance index of each of the service nodes and the security performance interval within the current time window comprises:
if the index values of the performance indexes of the service nodes are all in the safety performance interval within the preset time length, determining that the real-time state of the performance indexes of the service nodes within the preset time length is a normal state;
and in the preset time length, if the index values of the performance indexes of the service nodes are not in the safety performance interval, determining that the real-time state of the performance indexes of the service nodes in the preset time length is an abnormal state.
5. A method according to claim 3, wherein said first scoring the performance index of each of the service nodes based on a predetermined score factor to obtain a first score for the performance index, comprises:
taking a score factor of a performance index corresponding to the real-time state as a first evaluation score aiming at the performance index;
wherein, different fractional factors are correspondingly preset for the performance indexes of different real-time states.
6. The method of claim 2, wherein the functional indicator is characterized by a corresponding indicator value;
performing a second scoring on the functional index of each service node based on a preset weight factor to obtain a second evaluation score aiming at the functional index, including:
taking the product value between the index value of the function index and the corresponding weight factor as a first evaluation score aiming at the performance index;
wherein different weight factors are correspondingly preset for different functional indexes.
7. The method of claim 2, wherein the performance metrics include at least P99 time consumption, log error rate, request failure rate, and data failure rate of the operation and maintenance system in responding to a service node; the function index at least comprises the application number proportion, the abnormal application proportion and the quantity of operation and maintenance data corresponding to the change type when the operation and maintenance system responds to the service node;
The determining an evaluation result for each service node based on the first evaluation score and the second evaluation score includes:
and determining a total score between the first evaluation score of each performance index and the second evaluation score of each functional index for each service node, and taking the total score as an evaluation result of the service node.
8. The method of claim 7, wherein the determining a root node from the plurality of service nodes based on the evaluation result comprises:
and taking the target service node with the highest total score as a root cause node in each service node.
9. The method of claim 1, wherein said responding to said operational system trigger root cause positioning instructions comprises:
acquiring a fault alarm initiated by a terminal user;
and in response to the number of the fault alarms being greater than a preset number in the current sliding time window, automatically triggering a root cause positioning instruction.
10. A root cause positioning device, comprising:
an instruction triggering unit configured to execute a trigger root cause positioning instruction in response to an operation and maintenance system, and determine a plurality of service nodes responded by the operation and maintenance system in a current time window; wherein, each service node includes at least one service request, and the operation and maintenance system is used for executing the at least one service request to realize the service function corresponding to the service node;
A data acquisition unit configured to perform acquisition of operation and maintenance data correspondingly generated by the operation and maintenance system when service requests of the service nodes are respectively executed;
the node evaluation unit is configured to perform root cause evaluation on the service nodes based on the operation and maintenance data to obtain evaluation results of the service nodes; the root cause evaluation is used for evaluating the possibility degree of system abnormality of the operation and maintenance system caused by the service node;
a node screening unit for determining root cause nodes from the plurality of service nodes based on the evaluation result;
the root node is an abnormal service node which causes the system abnormality of the operation and maintenance system to trigger the root positioning instruction.
11. A computer device, comprising:
a processor;
a memory for storing executable instructions of the processor;
wherein the processor is configured to execute the executable instructions to implement the root cause localization method of any one of claims 1 to 9.
12. A computer readable storage medium comprising program data, wherein the program data, when executed by a processor of a computer device, enables the computer device to perform the root cause localization method of any one of claims 1 to 9.
CN202311213288.6A 2023-09-19 2023-09-19 Root cause positioning method, root cause positioning device, computer equipment and storage medium Pending CN117194092A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311213288.6A CN117194092A (en) 2023-09-19 2023-09-19 Root cause positioning method, root cause positioning device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311213288.6A CN117194092A (en) 2023-09-19 2023-09-19 Root cause positioning method, root cause positioning device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117194092A true CN117194092A (en) 2023-12-08

Family

ID=88988542

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311213288.6A Pending CN117194092A (en) 2023-09-19 2023-09-19 Root cause positioning method, root cause positioning device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117194092A (en)

Similar Documents

Publication Publication Date Title
WO2020259421A1 (en) Method and apparatus for monitoring service system
US11586972B2 (en) Tool-specific alerting rules based on abnormal and normal patterns obtained from history logs
CN110321371B (en) Log data anomaly detection method, device, terminal and medium
US10467533B2 (en) System and method for predicting response time of an enterprise system
US9436535B2 (en) Integration based anomaly detection service
US11805005B2 (en) Systems and methods for predictive assurance
US9397906B2 (en) Scalable framework for monitoring and managing network devices
US11281522B2 (en) Automated detection and classification of dynamic service outages
US20200012990A1 (en) Systems and methods of network-based intelligent cyber-security
CN112631887A (en) Abnormality detection method, abnormality detection device, electronic apparatus, and computer-readable storage medium
CN112308126A (en) Fault recognition model training method, fault recognition device and electronic equipment
US9860109B2 (en) Automatic alert generation
Bogojeska et al. Classifying server behavior and predicting impact of modernization actions
US20230038164A1 (en) Monitoring and alerting system backed by a machine learning engine
US7617313B1 (en) Metric transport and database load
CN113722134A (en) Cluster fault processing method, device and equipment and readable storage medium
CN114202256A (en) Architecture upgrading early warning method and device, intelligent terminal and readable storage medium
CN107480703B (en) Transaction fault detection method and device
CN112416896A (en) Data abnormity warning method and device, storage medium and electronic device
CN117194092A (en) Root cause positioning method, root cause positioning device, computer equipment and storage medium
CN114676021A (en) Job log monitoring method and device, computer equipment and storage medium
CN115098326A (en) System anomaly detection method and device, storage medium and electronic equipment
CN110413482B (en) Detection method and device
CN112764957A (en) Application fault delimiting method and device
Watanabe et al. Failure prediction for cloud datacenter by hybrid message pattern learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination