CN116701031A - Root cause model training method, analysis method and device in micro-service system - Google Patents

Root cause model training method, analysis method and device in micro-service system Download PDF

Info

Publication number
CN116701031A
CN116701031A CN202310575528.0A CN202310575528A CN116701031A CN 116701031 A CN116701031 A CN 116701031A CN 202310575528 A CN202310575528 A CN 202310575528A CN 116701031 A CN116701031 A CN 116701031A
Authority
CN
China
Prior art keywords
root cause
log
fault
model
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310575528.0A
Other languages
Chinese (zh)
Inventor
潘晓华
尹建伟
黄逸东
李莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Ronghe Intelligent Manufacturing Technology Co ltd
Binjiang Research Institute Of Zhejiang University
Original Assignee
Hangzhou Ronghe Intelligent Manufacturing Technology Co ltd
Binjiang Research Institute Of Zhejiang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Ronghe Intelligent Manufacturing Technology Co ltd, Binjiang Research Institute Of Zhejiang University filed Critical Hangzhou Ronghe Intelligent Manufacturing Technology Co ltd
Priority to CN202310575528.0A priority Critical patent/CN116701031A/en
Publication of CN116701031A publication Critical patent/CN116701031A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3447Performance evaluation by modeling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application provides a root cause model analysis method in a micro-service system, and belongs to the technical field of cloud computing. The method solves the problems of lower accuracy and the like of the existing method for developers. The root cause model analysis method in the micro-service system comprises the following steps: step S5: collecting distributed tracking log data of a target system, constructing by the distributed tracking data, and determining potential abnormal nodes; step S6: collecting a log of a target system, processing the log, extracting an event and a parameter, and determining an abnormal event; step S7: and carrying out fault root cause analysis processing based on the potential abnormal nodes and the abnormal events to obtain an analysis result. The application has the advantages of simplifying the operation and maintenance cost of the root cause of the positioning fault in the micro-service system, enabling operation and maintenance personnel to discover the root cause existing in the system more quickly, and the like.

Description

Root cause model training method, analysis method and device in micro-service system
Technical Field
The application belongs to the technical field of cloud computing, and particularly relates to a root cause model training method, an analysis method and an analysis device in a micro-service system.
Background
With the development of micro-service technology, more and more applications begin to use micro-service architecture to provide services. A system may have hundreds of hosts and thousands of services to provide system services, however, due to complexity of the system itself and complexity of services, nodes in the system are often abnormal, and a great amount of alarm information is generated, which seriously can cause the system to stop providing services to the outside, and has great negative influence on performance and reliability of the system.
Currently, the related art relies on experience of operation and maintenance engineers and development engineers to manually guess and check possible fault points through rules accumulated by a large amount of experience.
The existing method has lower accuracy for developers, requires the experience of engineers, and has higher time consumption for completing the fault checking process.
Disclosure of Invention
The first object of the present application is to provide a root cause model training method, an analysis method and an apparatus in a micro-service system, which can perform intelligent root cause analysis based on historical data, and can quickly locate a root cause which may cause an event when the event occurs, reduce processing time of the event, and improve system stability and high performance.
The first object of the present application can be achieved by the following technical scheme: the root cause model training method in the micro-service system is characterized by comprising the following steps of:
step S1: acquiring a history distributed tracking log and a history running log of the system history normal running and problem feedback data in events corresponding to the history distributed tracking log and the history running log;
step S2: processing the history distributed tracking log and the history running log to obtain the characteristics of the history potential abnormal nodes and the history potential abnormal log;
step S3: the historical potential abnormal nodes and the problem feedback data are used as a group of training data, input into a ranking sub-model of the fault root positioning model, and train the ranking sub-model of the fault root positioning model through the training data;
step S4: and taking the characteristic information and the problem feedback data of the historical operation log as a group of training data, inputting the training data into the cause sub-model of the fault root cause positioning model, and training the cause sub-model of the fault root cause positioning model through the training data.
In the root cause model training method in the micro-service system, the on-line problem feedback data is generated by executing preset operation on the equipment running the target system.
In the root cause model training method in the micro-service system, the input data in the training data and the input data in the using process have the same or similar forms.
In the root cause model training method in the micro-service system, the historical operation log is processed through a Drain method and a GPT method, so that characteristic information of the historical operation log is obtained; and analyzing the historical distributed tracking log, and analyzing the response time to obtain the historical potential abnormal nodes.
In the root cause model training method in the micro-service system, in the step S3, the aggregated historical potential abnormal nodes and the aggregated problem feedback data in the same time window are used as a set of training data, a preset loss function is adopted to calculate a loss value between the predicted fault root cause and the problem feedback data, and model parameters of a ranking sub-model of the fault root cause positioning model are adjusted according to the loss value.
In the root cause model training method in the micro service system, in the step S4, the aggregated historical log feature information and the aggregated problem feedback data in the same time window are used as a set of training data, a preset loss function is adopted to calculate a loss value between the predicted fault root cause and the problem feedback data, and model parameters of a cause sub-model of the fault root cause positioning model are adjusted according to the loss value.
In the root cause model training method in the micro-service system, the training samples and the data are expanded.
The second object of the present application can be achieved by the following technical scheme: the root cause analysis method in the micro-service system is characterized by further comprising the following steps of:
step S5: collecting distributed tracking log data of a target system, constructing by the distributed tracking data, and determining potential abnormal nodes;
step S6: collecting a log of a target system, processing the log, extracting an event and a parameter, and determining an abnormal event;
step S7: and carrying out fault root cause analysis processing based on the potential abnormal nodes and the abnormal events to obtain an analysis result.
In the root cause analysis method in the micro service system, the step S5 includes a step S5.1: collecting distributed tracking log data of a target system in operation, constructing a service call chain according to a tracking ID and a span ID, analyzing response time of each span, and judging potential possible fault nodes;
the step S6 comprises the step S6.1 of: acquiring operation log information corresponding to a target system, and analyzing the log by using a Drain technology to acquire abnormal log information;
the step S7 comprises the following steps:
step S7.1: extracting the characteristics of the abnormal log information to obtain characteristic information;
step S7.2: inputting the potential abnormal nodes into a ranking sub-model of a fault root cause positioning model, and analyzing and processing the potential abnormal nodes to obtain a fault node probability ranking;
step S7.3: inputting the log characteristic information into a cause sub-model of a fault root cause positioning model, and analyzing and processing the root cause truly caused by the fault node to obtain the possible cause of the fault node;
step S7.4: and inputting the possible reasons of the fault nodes and the probability ranking of the fault nodes into a determination layer of a fault root cause positioning model for analysis and processing to obtain the target fault root cause.
The third object of the present application can be achieved by the following technical scheme: a root cause analysis device in a micro-service system, comprising a communication component, a power component, an audio component, a display, one or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising a root cause model analysis method for performing the micro-service system as described above.
Compared with the prior art, the application has the advantages of simplifying the operation and maintenance cost of the root cause of the positioning fault in the micro-service system, enabling operation and maintenance personnel to discover the root cause existing in the system more quickly, using artificial intelligence technology, learning along with the development of the service system, reducing the cost of subsequent maintenance and discovering problems more conveniently, and improving the efficiency of personnel without relying on the experience of the operation and maintenance personnel to find the root cause of the system.
Drawings
FIG. 1 is a flow chart of a root cause model training method in a micro-server system according to the present application;
fig. 2 is a flow chart of a root cause analysis method in a micro-service system according to the present application.
FIG. 3 is a schematic diagram of a fault root cause localization model of a root cause analysis method in a micro-server system according to the present application.
Fig. 4 is a training diagram of a root cause analysis method in a micro-server system according to the present application.
Fig. 5 is a schematic diagram of a fault root cause determining apparatus according to the present application.
FIG. 6 is a schematic diagram of a root cause analysis device in a micro-server system according to the present application
Detailed Description
The following are specific embodiments of the present application and the technical solutions of the present application will be further described with reference to the accompanying drawings, but the present application is not limited to these embodiments.
As shown in fig. 1 to 6, the first object of the application can be achieved by the following technical solutions: a root cause model training method in a micro-service system comprises the following steps:
step S1: acquiring a history distributed tracking log and a history running log of the system history normal running and problem feedback data in events corresponding to the history distributed tracking log and the history running log;
as a data base for training the model, the distributed trace data and the running log used for training the model are referred to herein as historical distributed trace data and historical running log in order to distinguish them from the distributed trace data and the running log described in S5, S6; the generation time of the historical distributed tracking log and the historical running log is earlier than that of the distributed tracking data and the running log described in S5 and S6; the types and forms of the historical distributed tracking log and the historical running log are basically the same as those of the distributed tracking log and the running log, and the historical distributed tracking log and the historical running log are not repeated herein or limited.
Step S2: processing the history distributed tracking log and the history running log to obtain the characteristics of the history potential abnormal nodes and the history potential abnormal log;
step S3: the historical potential abnormal nodes and the problem feedback data are used as a group of training data, input into a ranking sub-model of the fault root positioning model, and train the ranking sub-model of the fault root positioning model through the training data;
step S4: and taking the characteristic information and the problem feedback data of the historical operation log as a group of training data, inputting the training data into the cause sub-model of the fault root cause positioning model, and training the cause sub-model of the fault root cause positioning model through the training data.
Further, the on-line problem feedback data is generated by performing a preset operation on the device running the target system.
In addition, there is a need to obtain problem feedback data; in one example scenario, a user performs some preset operations on a device running the target system by using the target system, thereby generating online problem feedback data; for different target systems, corresponding types and contents of the problem feedback data may be different; the type and content of the problem feedback data are not limited here; for ease of understanding, a few examples are provided below.
Example one: the user A uses a target system running on the equipment, the CPU occupancy rate of the target system is too high by presetting the target system equipment, the problem of operation response delay in the using process is caused, and the time for finding the problem and the reason of the problem are recorded.
Example two: the user B uses the target system running on the equipment, the memory occupation of the target system is too high by presetting the target system equipment, the problem of operation response delay in the using process is caused, and the time for finding the problem and the reason of the problem are recorded.
Further, the input data in the training data and the input data during use have the same or similar form.
In the embodiment of the application, in order to ensure the use effect of the trained anomaly detection model, the input data in the training data and the input data in the using process are required to have the same or similar form.
Further, the historical operation log is processed through a Drain method and a GPT method, so that characteristic information of the historical operation log is obtained; and analyzing the historical distributed tracking log, and analyzing the response time to obtain the historical potential abnormal nodes.
In connection with the descriptions of embodiments S6.1 and S7.1, it is necessary to process the history running log by the Drain method and the GPT method, so as to obtain the characteristics of the log. In connection with the description of embodiment S5.1, it is necessary to analyze the historical distributed trace log and analyze its response time to obtain historical potential anomaly nodes.
Further, in the step S3, the aggregated historical potential abnormal nodes and the aggregated problem feedback data in the same time window are used as a set of training data, a loss value between the predicted fault root cause and the problem feedback data is calculated by using a preset loss function, and model parameters of a ranking sub-model of the fault root cause positioning model are adjusted according to the loss value.
In order to ensure the matching degree of the input and output of the model in time and realize accurate detection of the root cause of the system in the time dimension, in the embodiment of the application, when training data are formed, the aggregated historical potential abnormal nodes and the aggregated problem feedback data of the same time window are used as a group of training data; in this way, multiple sets of training data may be obtained based on different time windows.
In the embodiment of the application, the historical potential abnormal nodes are input into the ranking sub-model of the fault root positioning model, then a loss value between the predicted fault root and the problem feedback data is calculated by adopting a preset loss function, and the model parameters of the ranking sub-model of the fault root positioning model are adjusted according to the loss value.
Further, in the step S4, the aggregated historical log feature information and the aggregated problem feedback data in the same time window are used as a set of training data, a loss value between the predicted fault root cause and the problem feedback data is calculated by using a preset loss function, and model parameters of a cause sub-model of the fault root cause positioning model are adjusted according to the loss value.
In order to ensure the matching degree of the input and output of the model in time and realize accurate detection of the root cause of the system in the time dimension, in the embodiment of the application, when training data are formed, the aggregated historical log characteristic information and the aggregated problem feedback data of the same time window are used as a group of training data; in this way, multiple sets of training data may be obtained based on different time windows.
In the embodiment of the application, the historical log characteristic information is input into a cause sub-model of the fault root positioning model, then a loss value between the predicted fault root and the problem feedback data is calculated by adopting a preset loss function, and the model parameters of the cause sub-model of the fault root positioning model are adjusted according to the loss value.
In addition, the fault root positioning model trained by the embodiment of the application can be applied to cloud computing intelligent scenes and scenes for determining the faults of the new energy equipment.
Further, the training samples are expanded and the data are expanded.
In the embodiment of the application, the robustness of the fault root cause positioning model can be improved through the expansion of the training sample and the expansion of the data, so that the fault root cause of the fault of the system can be accurately determined by adopting the training to obtain the fault root cause positioning model, and the system quality is further ensured.
The second object of the application can be achieved by the following technical scheme: the root cause analysis method in the micro-service system is based on the root cause model training method in the micro-service system, and further comprises the following steps:
step S5: collecting distributed tracking log data of a target system, constructing by the distributed tracking data, and determining potential abnormal nodes;
the target system refers to any software system to be detected in the embodiment of the present application, and the function of the target system is not limited herein; the distributed tracking data of the target system includes, but is not limited to: track ID (TraceID), span ID (span ID), parent span ID, service name, response time, etc. The tracking Id is the unique ID of the distributed tracking in the whole response flow, the span ID is the ID of the distributed tracking performed in a certain service, the father span ID is the ID of the distributed tracking performed in the last service, the service name is the name of the running service of the target system, and the response time is the response time of the service in the request.
In one possible implementation, the service call chain graph may be generated by traversing the span ID and the parent span ID for processing; the method for converting the text into the vector in practical application is not limited to the traversal method of the span ID and the father span ID, and the specific implementation mode of the S5 is not limited in the scheme.
Step S6: collecting a log of a target system, processing the log, extracting an event and a parameter, and determining an abnormal event;
the log data of the target system includes, but is not limited to: log ID, log time, log level, log producer, log event; the log ID bit is the global unique ID of the log of the target system, the log time is the time of the generation of the log of the target system, the log level is the level of the log of the target system, the log generator is the information of the function or object of the log of the target system, and the log event is the information printed by the log of the target system.
In an alternative embodiment, after the running logs are obtained, the running logs may be parsed to obtain log events and log parameters.
Further, after the running log is obtained, an abnormal log in the running log and an event occurring in the abnormal log are determined. The exception log is at least part of the running log.
Step S7: and carrying out fault root cause analysis processing based on the potential abnormal nodes and the abnormal events to obtain an analysis result.
The fault root cause positioning model is obtained through pre-training, and the target fault root cause can be accurately predicted based on the input information.
In the embodiment of the application, the fault root positioning model can be combined with a plurality of sub-models, each word model adopts a corresponding algorithm model, and further, the accurate prediction of the target fault root can be realized by combining the plurality of sub-models; other fault root location models may employ personalized PageRank algorithm (an algorithm), GPT (a natural language processing model), XGBOOST (a machine learning model), and the like. The algorithm models can be trained in advance to predict the root cause of the target fault respectively, and can also be integrated to predict the root cause of the target fault.
Furthermore, the target fault root cause refers to the condition or cause of the cause and effect chain that caused the corresponding fault, specifically the root, potential, deepest or initial cause.
In the embodiment of the application, the target fault root is sent to the terminal equipment, so that an operation and maintenance person can obtain the target fault root in time, and further the fault of the system is overhauled.
According to the method, a call chain is constructed through distributed tracking data of a target system, then analysis is carried out to obtain potential abnormal nodes, an operation log of the target system is analyzed to obtain log events and log parameters, the abnormal logs are obtained through analysis of the log events and the log parameters, and then the potential abnormal nodes and the potential abnormal logs are input into a fault root cause positioning model to carry out fault root cause analysis processing, so that potential root causes of faults are obtained; according to the application, the trained fault root cause analysis model is used for assisting the operation and maintenance personnel to complete root cause analysis of the target system, so that the professional requirements on the operation and maintenance personnel are reduced, and complex maintenance is not required, thus the system root cause analysis efficiency is improved, and the realization difficulty and the realization cost are reduced.
Further, the step S5 includes a step S5.1: collecting distributed tracking log data of a target system in operation, constructing a service call chain according to a tracking ID and a span ID, analyzing response time of each span, and judging potential possible fault nodes;
based on the response time of each span, if the set threshold time is exceeded, this indicates that the service may be faulty. In order to reduce false alarms, normal distribution is used for judging, if the response time exceeds 3 sigma, the node possibly has faults and belongs to potential fault nodes; the upstream and downstream of the node is then traversed until all potentially faulty nodes are found.
Reconstructing a new fault subgraph from the fault nodes for subsequent processing.
The step S6 comprises the step S6.1 of: and acquiring running log information corresponding to the target system, and analyzing the log by using a Drain technology to acquire abnormal log information.
The Drain (Anonline log parsing based on fixed depth tree) technique can parse log samples to obtain log events and log parameters, respectively.
Searching log information during the fault period of the fault node, wherein before the fault occurs, some invisible abnormal problems cannot be displayed; the failure period refers to a period of time before and after the occurrence of the failure; the period of the fault may be a preset period of time, such as a log of what occurred 1 hour before the fault occurred; the determination of the specific fault period may be made as desired.
The step S7 comprises the following steps:
step S7.1: extracting the characteristics of the abnormal log information to obtain characteristic information;
the abnormal log information has a large amount of information, the data are required to be processed, important data are extracted for analysis, and interference possibly caused by other information is reduced; specifically, in the embodiment of the application, the exception log information is subjected to feature extraction by using a GPT (a natural language processing model) to obtain potential information possibly existing in the log.
Step S7.2: inputting the potential abnormal nodes into a ranking sub-model of a fault root cause positioning model, and analyzing and processing the potential abnormal nodes to obtain a fault node probability ranking;
the fault root cause positioning model can be an integrated model integrating one or more sub-models; wherein the ranking sub-model will determine the root cause node of the fault; the ranking sub-model adopts the PageRank algorithm, and the ranking sub-model of the PageRank algorithm is used as the basis of the fault root cause analysis model, so that the method has good performance.
In addition, the ranking sub-model is pre-trained, and potential fault nodes can be analyzed and processed to obtain the probability ranking of the fault nodes, wherein the higher the ranking is, the more likely the probability ranking is the root cause of the fault; in the embodiment of the application, the first node can be a plurality of nodes, such as multiple faults caused by insufficient memory resources and network faults.
Step S7.3: inputting the log characteristic information into a cause sub-model of a fault root cause positioning model, and analyzing and processing the root cause truly caused by the fault node to obtain the possible cause of the fault node;
wherein, the method can be jointly constructed by adopting the Grangejack causal relationship technology and the PageRank algorithm; by using the Grangel causality technology, the causality between the error signal sent by the service and the log can be deduced; assuming that the abnormal behavior of the failed component is likely to cause adjacent components (microservices) to signal errors, these components are components that interact directly or indirectly with the failed component; however, nodes that are not related to errors are likely to have a higher causal score; to avoid these false positives, where the cause of the error is considered a candidate, ranking the error causes using the personalized PageRank algorithm may assign a higher weight to the error cause that caused the failure.
Step S7.4: and inputting the possible reasons of the fault nodes and the probability ranking of the fault nodes into a determination layer of a fault root cause positioning model for analysis and processing to obtain the target fault root cause.
The determining layer ranks various preset weights according to possible reasons of the fault nodes and the possibility of the fault nodes, and determines target fault root causes.
Determining possible reasons of the fault nodes and ranking of the possible fault nodes by adopting two modes through the two sub-models; determining a final target fault root cause by adopting a determining layer; in the embodiment of the application, the preset weight corresponding to the determination layer can be obtained through pre-training.
In an alternative embodiment, the possible reason of the fault node and the ranking of the possible reason of the fault node may also be directly sent to the terminal device, where the terminal device is configured to display the possible reason of the fault node and the ranking of the possible reason of the fault node; receiving an operation and maintenance positioning fault root cause sent by terminal equipment, wherein the operation and maintenance positioning fault root cause is determined by operation and maintenance personnel according to possible reasons of fault nodes and ranking of the possibility of the fault nodes; and determining a target fault root cause according to the operation and maintenance positioning fault root cause, the possible reasons of the fault nodes and the possible ranks of the fault nodes.
The operation and maintenance personnel can determine potential reasons (operation and maintenance positioning fault reasons) possibly existing in the fault according to the displayed possible reasons and the possible ranks of the fault nodes, the operation and maintenance personnel input the operation and maintenance positioning fault reasons into the terminal equipment, the terminal equipment sends the operation and maintenance positioning fault reasons to the server, and the server can determine final target fault reasons according to the operation and maintenance positioning fault reasons, the possible reasons of the fault nodes and the possible ranks of the fault nodes, so that the accuracy of the target fault causes is improved.
And sending the target fault root cause to the terminal equipment so that the terminal equipment displays the target fault root cause to the operation and maintenance personnel.
In the embodiment of the application, the fault root positioning model integrates a plurality of sub-models, and fully utilizes the fault causal graph; and supplementing each sub-model to obtain the accurate target fault root cause finally.
The third object of the application can be achieved by the following technical scheme: a fault root cause determination device 500, the fault root cause determination device 500 comprising:
an obtaining module 501, configured to obtain a plurality of running log information and distributed trace information during a system failure, where each distributed trace log includes: track ID, stride ID, father stride ID, service name, response time. Each of the travel logs includes: log ID, log time, log level, log producer, log event.
The data processing module 502 is configured to process the distributed tracking information and the running log information to obtain potential abnormal nodes and running log feature information.
And the input module 503 is used for inputting the potential abnormal node and the operation log information into the fault root cause positioning model to perform fault root cause analysis and processing, and obtaining a target fault root cause with faults.
And the sending module 504 is configured to send the target fault root cause to the terminal device, so that the terminal device displays the target fault root cause like an operation and maintenance personnel.
The fault root cause determining device provided by the embodiment of the application can obtain potential fault nodes and the characteristic information of the running log by expanding the running log information and the distributed tracking information, and can accurately determine the target fault root cause of faults by taking the potential fault nodes and the characteristic information of the running log as the input of a fault root cause positioning model.
The fourth object of the application can be achieved by the following technical scheme: a root cause analysis device in a micro-service system, comprising an input, an output, one or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing a root cause model analysis method in a micro-service system as described above.
A memory 601 for storing a program. In addition to the programs described above, the memory 601 may be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and the like.
The memory 601 may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The processor 602 is not limited to a processor (CPU), but may be a Graphics Processor (GPU), a Field Programmable Gate Array (FPGA), an embedded neural Network Processor (NPU), or an Artificial Intelligence (AI) chip, etc. A processor 602, coupled to the memory 601, executes a program stored in the memory 601, which program, when executed, performs the root cause analysis method in the micro-server system described above.
The communication component 603 is configured to facilitate communication between the electronic device and other devices, either wired or wireless. The electronic device may access a wireless network based on a communication standard, such as WiFi, 3G, 4G, or 5G, or a combination thereof. In one exemplary embodiment, the communication component 603 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 503 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
A power supply component 604 provides power to the various components of the electronic device. The power components 604 can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic devices.
The audio component 605 is configured to output and/or input audio signals. For example, the audio component 605 includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a speech recognition mode. The received audio signals may be further stored in the memory 501 or transmitted via the communication component 603. In some embodiments, the audio component 605 also includes a speaker for outputting audio signals.
The display 606 includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation.
The specific embodiments described herein are offered by way of example only to illustrate the spirit of the application. Those skilled in the art may make various modifications or additions to the described embodiments or substitutions thereof without departing from the spirit of the application or exceeding the scope of the application as defined in the accompanying claims.
Although a number of terms are used more herein, the use of other terms is not precluded. These terms are used merely for convenience in describing and explaining the nature of the application; they are to be interpreted as any additional limitation that is not inconsistent with the spirit of the present application.

Claims (10)

1. The root cause model training method in the micro-service system is characterized by comprising the following steps of:
step S1: acquiring a history distributed tracking log and a history running log of the system history normal running and problem feedback data in events corresponding to the history distributed tracking log and the history running log;
step S2: processing the history distributed tracking log and the history running log to obtain the characteristics of the history potential abnormal nodes and the history potential abnormal log;
step S3: the historical potential abnormal nodes and the problem feedback data are used as a group of training data, input into a ranking sub-model of the fault root positioning model, and train the ranking sub-model of the fault root positioning model through the training data;
step S4: and taking the characteristic information and the problem feedback data of the historical operation log as a group of training data, inputting the training data into the cause sub-model of the fault root cause positioning model, and training the cause sub-model of the fault root cause positioning model through the training data.
2. The root cause model training method in a micro-service system according to claim 1, wherein the on-line problem feedback data is generated by performing a preset operation on a device running the target system.
3. The method of claim 1, wherein the input data in the training data and the input data in the use process have the same or similar form.
4. The root cause model training method in a micro-service system according to claim 1, wherein the historical operation log is processed by a Drain method and a GPT method, so that characteristic information of the historical operation log is obtained; and analyzing the historical distributed tracking log, and analyzing the response time to obtain the historical potential abnormal nodes.
5. The root cause model training method in a micro service system according to claim 1, wherein in the step S3, aggregated historical potential abnormal nodes and aggregated problem feedback data of the same time window are used as a set of training data, a loss value between a predicted fault root cause and the problem feedback data is calculated by using a preset loss function, and model parameters of ranking submodels of a fault root cause positioning model are adjusted according to the loss value.
6. The root cause model training method in a micro service system according to claim 1, wherein in the step S4, aggregated historical log feature information and aggregated problem feedback data of the same time window are used as a set of training data, a loss value between a predicted fault root cause and the problem feedback data is calculated by using a preset loss function, and model parameters of a cause sub-model of a fault root cause positioning model are adjusted according to the loss value.
7. The method of claim 1, wherein the training samples are expanded and the data is expanded.
8. Root cause analysis method in a micro-service system, characterized in that it is based on a root cause model training method in a micro-service system according to any of claims 1-7, further comprising the steps of:
step S5: collecting distributed tracking log data of a target system, constructing by the distributed tracking data, and determining potential abnormal nodes;
step S6: collecting a log of a target system, processing the log, extracting an event and a parameter, and determining an abnormal event;
step S7: and carrying out fault root cause analysis processing based on the potential abnormal nodes and the abnormal events to obtain an analysis result.
9. The root cause analysis method in a micro-service system according to claim 8, wherein the step S5 includes the steps of S5.1: collecting distributed tracking log data of a target system in operation, constructing a service call chain according to a tracking ID and a span ID, analyzing response time of each span, and judging potential possible fault nodes;
the step S6 comprises the step S6.1 of: and acquiring running log information corresponding to the target system, and analyzing the log by using a Drain technology to acquire abnormal log information.
The step S7 comprises the following steps:
step S7.1: extracting the characteristics of the abnormal log information to obtain characteristic information;
step S7.2: inputting the potential abnormal nodes into a ranking sub-model of a fault root cause positioning model, and analyzing and processing the potential abnormal nodes to obtain a fault node probability ranking;
step S7.3: inputting the log characteristic information into a cause sub-model of a fault root cause positioning model, and analyzing and processing the root cause truly caused by the fault node to obtain the possible cause of the fault node;
step S7.4: and inputting the possible reasons of the fault nodes and the probability ranking of the fault nodes into a determination layer of a fault root cause positioning model for analysis and processing to obtain the target fault root cause.
10. Root cause analysis device in a micro-service system, characterized by comprising a communication component, a power supply component, an audio component, a display, one or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising a root cause model analysis method for performing in a micro-service system according to any of claims 8-9.
CN202310575528.0A 2023-05-19 2023-05-19 Root cause model training method, analysis method and device in micro-service system Pending CN116701031A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310575528.0A CN116701031A (en) 2023-05-19 2023-05-19 Root cause model training method, analysis method and device in micro-service system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310575528.0A CN116701031A (en) 2023-05-19 2023-05-19 Root cause model training method, analysis method and device in micro-service system

Publications (1)

Publication Number Publication Date
CN116701031A true CN116701031A (en) 2023-09-05

Family

ID=87831908

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310575528.0A Pending CN116701031A (en) 2023-05-19 2023-05-19 Root cause model training method, analysis method and device in micro-service system

Country Status (1)

Country Link
CN (1) CN116701031A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117149501A (en) * 2023-10-31 2023-12-01 中邮消费金融有限公司 Problem repair system and method
CN117349129A (en) * 2023-12-06 2024-01-05 广东无忧车享科技有限公司 Abnormal optimization method and system for vehicle sales process service system
CN117493068A (en) * 2024-01-03 2024-02-02 安徽思高智能科技有限公司 Root cause positioning method, equipment and storage medium for micro-service system

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117149501A (en) * 2023-10-31 2023-12-01 中邮消费金融有限公司 Problem repair system and method
CN117149501B (en) * 2023-10-31 2024-02-06 中邮消费金融有限公司 Problem repair system and method
CN117349129A (en) * 2023-12-06 2024-01-05 广东无忧车享科技有限公司 Abnormal optimization method and system for vehicle sales process service system
CN117349129B (en) * 2023-12-06 2024-03-29 广东无忧车享科技有限公司 Abnormal optimization method and system for vehicle sales process service system
CN117493068A (en) * 2024-01-03 2024-02-02 安徽思高智能科技有限公司 Root cause positioning method, equipment and storage medium for micro-service system
CN117493068B (en) * 2024-01-03 2024-03-26 安徽思高智能科技有限公司 Root cause positioning method, equipment and storage medium for micro-service system

Similar Documents

Publication Publication Date Title
CN116701031A (en) Root cause model training method, analysis method and device in micro-service system
US11487539B2 (en) Systems and methods for automating and monitoring software development operations
US11294754B2 (en) System and method for contextual event sequence analysis
CN111435366A (en) Equipment fault diagnosis method and device and electronic equipment
CN114785666B (en) Network troubleshooting method and system
CN111800289B (en) Communication network fault analysis method and device
CN111291096B (en) Data set construction method, device, storage medium and abnormal index detection method
CN114580263A (en) Knowledge graph-based information system fault prediction method and related equipment
US20220414689A1 (en) Method and apparatus for training path representation model
US20210049414A1 (en) Deep graph de-noise by differentiable ranking
CN114860542A (en) Trend prediction model optimization method, trend prediction model optimization device, electronic device, and medium
US20230376758A1 (en) Multi-modality root cause localization engine
CN113392920B (en) Method, apparatus, device, medium, and program product for generating cheating prediction model
CN114090320A (en) Fault detection method and device
US20210243069A1 (en) Alert correlating using sequence model with topology reinforcement systems and methods
CN117196333A (en) Natural disaster influence and loss information generation method and device based on power data
CN114141236B (en) Language model updating method and device, electronic equipment and storage medium
CN116861236A (en) Illegal user identification method, device, equipment, storage medium and product
US20230274161A1 (en) Entity linking method, electronic device, and storage medium
CN116225848A (en) Log monitoring method, device, equipment and medium
CN113241063B (en) Algorithm parameter updating method, device, terminal and medium in voice recognition system
CN112801156B (en) Business big data acquisition method and server for artificial intelligence machine learning
CN113887932A (en) Operation and maintenance management and control method and device based on artificial intelligence and computer equipment
Sudan et al. Prediction of success and complex event processing in E-learning
CN112764957A (en) Application fault delimiting method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Country or region after: China

Address after: 310000 301, building 2, No. 66, Dongxin Avenue, Binjiang District, Hangzhou City, Zhejiang Province

Applicant after: Binjiang Research Institute of Zhejiang University

Applicant after: Hangzhou Ronghe Intelligent Manufacturing Technology Co.,Ltd.

Address before: Room 301, Building 2, No. 66 Dongxin Avenue, Jiang District, Hangzhou City, Zhejiang Province, 310000

Applicant before: Binjiang Research Institute of Zhejiang University

Country or region before: China

Applicant before: Hangzhou Ronghe Intelligent Manufacturing Technology Co.,Ltd.

CB02 Change of applicant information