CN116737436A - Root cause positioning method and system for micro-service system facing mixed deployment scene - Google Patents

Root cause positioning method and system for micro-service system facing mixed deployment scene Download PDF

Info

Publication number
CN116737436A
CN116737436A CN202310569212.0A CN202310569212A CN116737436A CN 116737436 A CN116737436 A CN 116737436A CN 202310569212 A CN202310569212 A CN 202310569212A CN 116737436 A CN116737436 A CN 116737436A
Authority
CN
China
Prior art keywords
service
micro
abnormal
services
dependency graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310569212.0A
Other languages
Chinese (zh)
Inventor
王健
严誉翔
李兵
张泽锟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202310569212.0A priority Critical patent/CN116737436A/en
Publication of CN116737436A publication Critical patent/CN116737436A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3058Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/60Software deployment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Biomedical Technology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a root cause positioning method and a root cause positioning system of a micro-service system for a mixed deployment scene, wherein the method comprises the following steps: firstly, performing data collection operation, and performing a leading chaotic engineering experiment on a hybrid deployment system to collect a fault data set; secondly, collecting container-level indexes and service-level indexes of different micro-service systems of the mixed deployment scene; thirdly, obtaining calling relations of different micro-service systems in a mixed deployment scene by using an unsupervised learning algorithm and constructing an abnormal service dependency graph of a single system for each micro-service system; then, according to a frequent item set mining algorithm and a causal inference algorithm, obtaining the connection between different micro services of the mixed deployment scene and constructing an abnormal service dependency graph of the multiple systems; updating the abnormal weight in the abnormal service dependency graph of the multiple systems; and finally, adopting a personalized random walk algorithm to order abnormal services of the multiple systems so as to realize root cause positioning.

Description

Root cause positioning method and system for micro-service system facing mixed deployment scene
Technical Field
The application relates to the field of computers, in particular to a root cause positioning method and system of a micro-service system for a hybrid deployment scene.
Background
In recent years, with the increasing business of software companies, the increasing scale of users and the increasing diversity of data, the cost of using traditional single architecture to build software for design, development, deployment, testing and maintenance is also increasing. In a micro-service system, applications are broken down into small-granularity, componentized, loosely coupled, autonomous, decentralized services. The services communicate with each other through lightweight communication mechanisms such as the HTTP protocol, and an automated deployment mechanism that is continuously integrated and continuously deployed is used to significantly reduce the complexity of the developer and the operation and maintenance personnel.
As the demand of software companies continues to increase and business continues to expand, the modern software scale continues to increase and the amount of software continues to increase. Resources of the CPU and the memory are precious, but their utilization is often underutilized. The proposal of mixed deployment becomes an important means for improving the utilization rate of resources, reducing the cost of software companies and breaking the resource separation between different departments. However, with the increasing scale of services, increasing complexity of inter-service dependencies, agile development, and use of DevOps tools, code submissions and version updates can reach hundreds of times a day. The cost and complexity of manually detecting errors and locating possible causes is also increasing. Therefore, root cause positioning automation is a critical task.
In recent years, academia and industry have done most of their work in micro-service system root cause positioning. The Chinese patent document CN115576732A proposes a root cause positioning method and system for screening candidate virtual machines from a virtual machine cluster according to flow change information, loading historical data of associated fault time points of the candidate virtual machines, determining abnormal information of the candidate virtual machines in a preset root cause positioning dimension according to the historical data, and finally determining a target virtual machine in the candidate virtual machines based on the abnormal information. The method is not suitable for root cause positioning of a micro-service system, the granularity of the root cause positioning is the level of a host machine, and the positioning granularity is too large, so that the method is not beneficial to focusing on real root cause micro-service. Chinese patent document CN115756919A proposes a multi-dimensional data-oriented root cause positioning method and system, which comprises the steps of acquiring data in a section of window before and after an abnormality occurs, preprocessing, predicting expected values of each attribute combination of the data, calculating deviation scores of each attribute combination according to a true value and the predicted expected values, clustering, judging whether the data accords with a ripple effect, and selecting a root cause positioning algorithm to perform root cause positioning. The method focuses on a single deployment scene of the micro-service system, and mixed deployment of the micro-service system is lack of exploration. Thus, how to construct a micro-service root cause location that is suitable for a hybrid deployment scenario remains a challenge for cloud-native intelligent operation and maintenance.
Disclosure of Invention
Aiming at the problem that accurate root cause positioning of micro-service in a mixed deployment scene can not be realized in the prior art, the application provides a root cause positioning method of a micro-service system for the mixed deployment scene.
The technical scheme of the application is as follows:
the first aspect provides a root cause positioning method of a micro-service system facing a hybrid deployment scenario, which comprises the following steps:
s1: conducting a leading chaotic engineering experiment on a micro-service system facing a mixed deployment scene, collecting a chaotic engineering data set, and continuously monitoring and collecting a service level index and a container level index through a monitoring tool;
s2: acquiring calling relations of different micro-service systems in a mixed deployment scene by adopting an unsupervised learning algorithm, and constructing an abnormal service dependency graph of a single system for each micro-service system, wherein nodes in the abnormal service dependency graph of the single system are abnormal services, and the sides represent the calling relations among the abnormal services;
s3: adopting a frequent item set mining algorithm and a causal inference algorithm to obtain the connection between different micro services of the mixed deployment scene, constructing an abnormal service dependency graph of a plurality of systems, wherein nodes in the abnormal service dependency graph of the plurality of systems are abnormal services, and representing the dependency relationship between the abnormal services;
s4: updating the weight of the abnormal service dependency graph of the multiple systems according to the association between the container level index and the service level index of two services under the same micro service system;
s5: and executing a personalized random walk algorithm on the abnormal service dependency graph of the multiple systems after the weight is updated, and realizing root cause positioning.
In one embodiment, step S1 includes:
collecting a chaotic engineering data set by using a chaotic engineering tool to inject anomalies into a micro-service system instance facing a mixed deployment scene, wherein the types of the injected anomalies comprise instance anomalies, network anomalies, file system anomalies and pressure anomalies;
the service level index comprises an average delay index, a P90 delay index and a P99 delay index of each micro service, and the container level index comprises CPU, memory, network and file system indexes in the micro service operation process.
In one embodiment, step S2 includes:
performing cluster analysis on P90 delay indexes among different micro services by adopting an unsupervised learning algorithm, finding out candidate sets among abnormal services, considering that the collected delay data tend to be stable if the input delay data are gathered into one type, considering that the collected delay data tend to be discrete if the input delay data are gathered into multiple types, regarding the delay data among the micro services as abnormal delay at the moment, and regarding the call among the micro services as abnormal call;
and constructing an abnormal service dependency graph of the single system by taking the abnormal services as nodes and the calling relations among the abnormal services as edges.
In one embodiment, step S3 includes:
s3.1: constructing a frequent item set by using an Apriori algorithm based on the chaotic engineering data set;
s3.2: digging strong association relations between different micro-service systems based on the constructed frequent item sets;
s3.3: the cause and effect relation between strongly associated abnormal services is checked by using a Grangel cause and effect checking algorithm, and an abnormal service dependency graph of a plurality of systems is constructed.
In one embodiment, step S3.1 comprises:
scanning all abnormal micro-services in the chaotic engineering data set, wherein different micro-services are used as different items, and 1-item sets are generated by arranging and combining the items, and each 1-item set belongs to C 1 A collection;
counting each item, deleting the items which do not meet the minimum support degree from all 1-item sets based on the minimum support degree, thereby obtaining a set L of frequent 1-item sets 1
For L 1 Set C of 2-item sets generated by self-connection and pruning strategy 2 Scanning chaotic engineering dataset and comparing C 2 Counting each item set, deleting the items which do not meet the minimum support degree, thereby obtaining a set L of frequent 2-item sets 2 Similarly, for L k-1 Generating a set C of k-item sets by performing self connection and pruning strategy k Scan transaction set and for C k Counting each item set in the list, and deleting the items which do not meet the minimum support according to the minimum support to obtain a frequent k-item set L k
In one embodiment, step S3.2 comprises:
generating, for each frequent k-term set, a non-empty subset of all frequent k-term sets;
setting upThe two item sets are X and Y respectively, and the association rule is defined asRepresented as item set X, Y can be derived; for association rule->The confidence is the ratio of the transaction containing X and Y to the transaction containing X, which is recorded as Wherein, when->Then get->Representing that the occurrence of item set X will cause the occurrence of item set Y with a probability or confidence of +.>conf min Representing a minimum confidence.
In one embodiment, step S3.3 includes:
detecting whether abnormal services in the abnormal service dependency graphs of a plurality of single systems appear in a frequent item set, if a certain number of abnormal services appear in the frequent item set, checking container-level indexes in a mixed deployment scene by using a method of the Granges causal relation check, and if the change of a certain container-level index has causal relation, indicating that the service occurrence abnormality among different micro service systems has causal relation;
if no abnormal service appears in the set of frequent item sets, carrying out causal check on all abnormal services among different micro service systems, and when the change of a certain container level index is found to have causal relation, indicating that the abnormal service among different systems has causal relation.
In one embodiment, step S4 includes:
extracting all container-level indexes of two services under the same micro-service system and P90 delay data between the two services;
and calculating the Pearson correlation coefficient between the extracted container level index and the P90 delay data, taking the obtained value of the maximum positive correlation coefficient as the weight of the directed edge between services under the same micro-service system, and updating the weight of the abnormal service dependency graph of the multiple systems.
In one embodiment, step S5 includes:
s5.1: the basic transition matrix M of the abnormal service dependency graph MSDG defining a multisystem is represented by formula (1):
M=[m ij ] n×n (1)
for each node v in the MSDG, it is assumed that it has k outgoing edges, which are connected to node u 1 、u 2 、…u k Setting the element of the ith row and the jth column in M as the weight w of the edge ij The element of row i and column j in M divided by the degree k of node v is represented by equation (2):
m ij =w ij /k (2)
wherein each element in M represents a transition probability from one node to another;
s5.2: introducing a completely random transfer matrix E, wherein the transfer probability from one node to any node is 1/n, n is the number of nodes in the MSDG, and a damping factor d is defined for controlling the proportional relation between M and E, and d is more than or equal to 0 and less than or equal to 1;
s5.3: by weighted averaging M and E to obtain a complete transition matrix P of MSDG, i.e. p=dm+ (1-d) E, performing iterative computation using P as a transition matrix of a generally random walk markov chain, multiplying the current transition matrix vector by P in each iteration to obtain a new state vector, repeating this process until the state vector converges, and reaching a stationary distribution R, where R is an n-dimensional vector with a sum of components of 1, each component representing a score of a corresponding node in MSDG, i.e. PageRank value, representing importance and impact of the node in MSDG, and the representation of R is represented by formula (3):
PR(v 1 )、PR(v n ) Respectively represent node v 1 Node v n PageRank value of (C);
s5.4: and (3) descending order sorting is carried out on the PageRank values of the nodes in the MSDG, and the service sorted into the first order is used as a root cause micro-service, namely the micro-service most likely to cause abnormal conditions.
Based on the same inventive concept, a second aspect of the present application provides a micro-service system root cause positioning system oriented to a hybrid deployment scenario, comprising:
the data collection module is used for conducting a leading chaotic engineering experiment on the micro-service system facing the mixed deployment scene, collecting a chaotic engineering data set, and continuously monitoring and collecting service level indexes and container level indexes through a monitoring tool;
the method comprises the steps of constructing a single application exception graph module, acquiring calling relations of different micro-service systems in a mixed deployment scene by adopting an unsupervised learning algorithm, and constructing an exception service dependency graph of the single system for each micro-service system, wherein nodes in the exception service dependency graph of the single system are exception services, and the calling relations among the exception services are represented;
the method comprises the steps of constructing a multi-application exception graph module, obtaining the connection between different micro services of a mixed deployment scene by adopting a frequent item set mining algorithm and a causal inference algorithm, constructing a multi-system exception service dependency graph, wherein nodes in the multi-system exception service dependency graph are exception services, and the edges represent the dependency relationship between the exception services;
the comprehensive ordering module is used for updating the weight of the abnormal service dependency graph of the multiple systems according to the association between the container level index and the service level index of two services under the same micro-service system;
and executing a personalized random walk algorithm on the abnormal service dependency graph of the multiple systems after the weight updating to realize root cause positioning.
Compared with the prior art, the technical scheme provided by the application has at least the following technical effects:
the root cause positioning method respectively constructs an abnormal service dependency graph of a single system and an abnormal service dependency graph of multiple systems; on one hand, the mixed deployment situation of a plurality of micro service systems can be processed, and on the other hand, the service level index and the container level index of the micro service are fused, so that the health condition of the running environment of the micro service can be comprehensively reflected, and the root cause positioning accuracy is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a root cause positioning method of a micro service system facing a hybrid deployment scenario provided by an embodiment of the present application;
FIG. 2 is a block diagram of a micro-service system facing a hybrid deployment scenario in an embodiment of the present application;
FIG. 3 is a schematic diagram of generating multiple SSDGs based on collected traffic class indicators in an embodiment of the method of the present application;
FIG. 4 is a diagram showing the construction of an MSDG based on a plurality of SSDGs and weighting of the MSDGs in an embodiment of the method of the present application;
FIG. 5 shows experimental results in an on-Boutique, sock-Shop and Train-Ticket in an embodiment of the method of the present application.
Detailed Description
The application provides a root cause positioning method of a micro-service system for a hybrid deployment scene. The method comprises the following steps: firstly, performing data collection operation, and performing a leading chaotic engineering experiment on a hybrid deployment system to collect a fault data set; secondly, collecting container-level indexes and service-level indexes of different micro-service systems of the mixed deployment scene; thirdly, obtaining calling relations of different micro-service systems in a mixed deployment scene by using an unsupervised learning algorithm and constructing an abnormal service dependency graph of a single system for each micro-service system; then, according to a frequent item set mining algorithm and a causal inference algorithm, obtaining the connection between different micro services of the mixed deployment scene and constructing an abnormal service dependency graph of the multiple systems; updating the abnormal weight in the abnormal service dependency graph of the multiple systems; and finally, adopting a personalized random walk algorithm to order abnormal services of the multiple systems so as to realize root cause positioning.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Example 1
The application discloses a root cause positioning method of a micro-service system for a hybrid deployment scenario, referring to fig. 1, the method comprises the following steps:
s1: conducting a leading chaotic engineering experiment on a micro-service system facing a mixed deployment scene, collecting a chaotic engineering data set, and continuously monitoring and collecting a service level index and a container level index through a monitoring tool;
s2: acquiring calling relations of different micro-service systems in a mixed deployment scene by adopting an unsupervised learning algorithm, and constructing an abnormal service dependency graph of a single system for each micro-service system, wherein nodes in the abnormal service dependency graph of the single system are abnormal services, and the sides represent the calling relations among the abnormal services;
s3: adopting a frequent item set mining algorithm and a causal inference algorithm to obtain the connection between different micro services of the mixed deployment scene, constructing an abnormal service dependency graph of a plurality of systems, wherein nodes in the abnormal service dependency graph of the plurality of systems are abnormal services, and representing the dependency relationship between the abnormal services;
s4: updating the weight of the abnormal service dependency graph of the multiple systems according to the association between the container level index and the service level index of two services under the same micro service system;
s5: and executing a personalized random walk algorithm on the abnormal service dependency graph of the multiple systems after the weight is updated, and realizing root cause positioning.
The abnormal service dependency graph of a single system is also called a single application service dependency graph, and is abbreviated as SSDG. The abnormal service dependency graph of the multi-system is also called a multi-application service dependency graph, and is abbreviated as MSDG.
Compared with the prior art, the application has the following advantages and technical effects:
1. the root cause positioning method of the micro service system for the mixed deployment scene can process the mixed deployment situation of a plurality of micro service systems. While current research is focused mainly on single deployment scenarios, mixed deployment scenarios lack attention.
2. The proposed micro-service system root cause positioning method based on the index oriented to the mixed deployment scene fuses the service level index and the container level index of the micro-service, including indexes of CPU, memory, network and file system. The business-level index and the container-level index can comprehensively reflect the health condition of the running environment where the micro service is located, so that the accuracy of root cause positioning can be improved.
In one embodiment, step S1 includes:
collecting a chaotic engineering data set by using a chaotic engineering tool to inject anomalies into a micro-service system instance facing a mixed deployment scene, wherein the injected anomalies comprise instance anomalies, network anomalies, file system anomalies and pressure anomalies;
the service level index comprises an average delay index, a P90 delay index and a P99 delay index of each micro service, and the container level index comprises CPU, memory, network and file system indexes in the micro service operation process.
Specifically, chaotic engineering experiments use chaotic engineering tools to inject anomalies into hybrid deployment microservices examples. Instance exceptions include instance failures and instances killed; network anomalies include network partitioning, network packet loss, network delay, network transmission duplicate packets, and network packet errors. File system anomalies include file system call join delays and file system return errors; the pressure anomalies include CPU full load pressure and memory full load pressure.
The continuous monitoring and collection of business-level and container-level metrics by the monitoring tool means that business-level and container-level metrics are collected by Prometheus (Prometheus).
In one embodiment, step S2 includes:
performing cluster analysis on P90 delay indexes among different micro services by adopting an unsupervised learning algorithm, finding out candidate sets among abnormal services, considering that the collected delay data tend to be stable if the input delay data are gathered into one type, considering that the collected delay data tend to be discrete if the input delay data are gathered into multiple types, regarding the delay data among the micro services as abnormal delay at the moment, and regarding the call among the micro services as abnormal call;
and constructing an abnormal service dependency graph of the single system by taking the abnormal services as nodes and the calling relations among the abnormal services as edges.
Specifically, the delay data is a P90 delay index. FIG. 3 is a schematic diagram of an abnormal micro service call topology for generating a plurality of micro service systems based on collected business level metrics in an embodiment. Specifically, as shown in FIG. 3, the first column is the P90 delay from one microservice to another. For example, front_add service & P90 represents the P90 delay of front micro-service and add service. And obtaining an abnormal service dependency graph of the micro service systems based on the BIRCH algorithm according to the delay data of the micro service systems.
In one embodiment, step S3 includes:
s3.1: constructing a frequent item set by using an Apriori algorithm based on the chaotic engineering data set;
s3.2: digging strong association relations between different micro-service systems based on the constructed frequent item sets;
s3.3: the cause and effect relation between strongly associated abnormal services is checked by using a Grangel cause and effect checking algorithm, and an abnormal service dependency graph of a plurality of systems is constructed.
In one embodiment, step S3.1 comprises:
scanning all abnormal micro-services in the chaotic engineering data set, wherein different micro-services are used as different items, and 1-item sets are generated by arranging and combining the items, and each 1-item set belongs to C 1 A collection;
counting each item, deleting the items which do not meet the minimum support degree from all 1-item sets based on the minimum support degree, thereby obtaining a set L of frequent 1-item sets 1
For L 1 Set C of 2-item sets generated by self-connection and pruning strategy 2 Scanning chaotic engineering dataset and comparing C 2 Counting each item set, deleting the items which do not meet the minimum support degree, thereby obtaining a set L of frequent 2-item sets 2 Similarly, for L k-1 Generating a set C of k-item sets by performing self connection and pruning strategy k Scan transaction set and for C k Counting each item set in the list, and deleting the items which do not meet the minimum support according to the minimum support to obtain a frequent k-item set L k
In the specific implementation process, L k C is a set of frequent k-term sets k Is a set of k-term sets.
In one embodiment, step S3.2 comprises:
generating, for each frequent k-term set, a non-empty subset of all frequent k-term sets;
setting two item sets as X and Y respectively, and defining association rule asRepresented as item set X, Y can be derived; for association rule->The confidence is the ratio of the transaction containing X and Y to the transaction containing X, which is recorded as Wherein, when->Then get->Representing that the occurrence of item set X will cause the occurrence of item set Y with a probability or confidence of +.>conf min Representing a minimum confidence.
Specifically, building strong associations between different system services based on frequent item sets refers to building a set L of frequent k-item sets x To construct a strong association between different services.
All non-empty subsets of a frequent item set are also frequent items, so that it can be ensured that all strong association rules generated are related to frequent k-item sets and subsets.
In one embodiment, step S3.3 includes:
detecting whether abnormal services in the abnormal service dependency graphs SSDGs of a plurality of single systems occur in a frequent item set, if a certain abnormal service occurs in the frequent item set, checking container-level indexes in a mixed deployment scene by using a method of the Granges causal relation check, and if the change of a certain container-level index has causal relation, indicating that the service occurrence abnormality among different micro service systems has causal relation;
if no abnormal service appears in the set of frequent item sets, carrying out causal check on all abnormal services among different micro service systems, and when the change of a certain container level index is found to have causal relation, indicating that the abnormal service among different systems has causal relation.
Specifically, using a glaring causal verification algorithm to verify causal relationships between strongly associated anomalous services, and constructing a multi-application service dependency graph refers to the need to determine causal relationships between different system services prior to constructing an MSDG.
In the implementation process, as shown in part (a) of fig. 4, if the abnormal service S of the micro service system a A Is an exception service S that causes a micro-service system B B The reason for (1) is that a slave S is added to MSDG A Direction S B Is a side of (c). The direction of the directed edge here represents the direction of the causal relationship, i.e. can be interpreted as a dependency between the abnormal services.
In one embodiment, step S4 includes:
extracting all container-level indexes of two services under the same micro-service system and P90 delay data between the two services;
and calculating the Pearson correlation coefficient between the extracted container level index and the P90 delay data, taking the obtained value of the maximum positive correlation coefficient as the weight of the directed edge between services under the same micro-service system, and updating the weight of the abnormal service dependency graph of the multiple systems.
Specifically, for the edge between services under the same micro-service system, it is necessary to extract all container-level metrics of two services and P90 delay data between the two services, and calculate pearson correlation coefficients of the extracted container-level metrics and the P90 delay data. The value of the maximum positive correlation coefficient obtained is taken as the weight of the directed edge between services under the same micro-service system. In the implementation process, when the pearson correlation coefficient is calculated, if the obtained result is positive, the positive correlation of the two types of data is indicated, and the stronger the correlation is, the larger the value of the correlation coefficient r will be. Therefore, by calculating the pearson correlation coefficient of the container level index of all the services and the delay data between the services, the degree of correlation between the services under the same micro-service system can be found and used as the weight of the directed edge.
In one embodiment, the weight is set to a fixed value α (α ε [0,1 ]) for edges between services under different micro-service applications. Alpha can be optimally adjusted by the developer, typically taking 0.4.
The MSDG after the weight update is shown in part (B) of fig. 4, which shows the weighted MSDG of the a and B systems.
In one embodiment, step S5 includes:
s5.1: the basic transition matrix M of the abnormal service dependency graph MSDG defining a multisystem is represented by formula (1):
M=[m ij ] n×n (1)
for each node v in the MSDG, it is assumed that it has k outgoing edges, which are connected to node u 1 、u 2 、…u k Setting the element of the ith row and the jth column in M as the weight w of the edge ij The element of row i and column j in M divided by the degree k of node v is represented by equation (2):
m ij =w ij /k (2)
wherein each element in M represents a transition probability from one node to another;
s5.2: introducing a completely random transfer matrix E, wherein the transfer probability from one node to any node is 1/n, n is the number of nodes in the MSDG, and a damping factor d is defined for controlling the proportional relation between M and E, namely a linear combination coefficient, wherein d is more than or equal to 0 and less than or equal to 1;
s5.3: by weighted averaging M and E to obtain a complete transition matrix P of MSDG, i.e. p=dm+ (1-d) E, performing iterative computation using P as a transition matrix of a generally random walk markov chain, multiplying the current transition matrix vector by P in each iteration to obtain a new state vector, repeating this process until the state vector converges, and reaching a stationary distribution R, where R is an n-dimensional vector with a sum of components of 1, each component representing a score of a corresponding node in MSDG, i.e. PageRank value, representing importance and impact of the node in MSDG, and the representation of R is represented by formula (3):
PR(v 1 )、PR(v n ) Respectively represent node v 1 Node v n PageRank value of (C);
s5.4: and (3) descending order sorting is carried out on the PageRank values of the nodes in the MSDG, and the service sorted into the first order is used as a root cause micro-service, namely the micro-service most likely to cause abnormal conditions.
Specifically, nodes in the MSDG are in one-to-one correspondence with micro services in the micro service system. The PageRank values of the nodes in the MSDG are ranked to obtain micro-services with higher scores, so that possible root causes are determined. In this process, the service that scores the first name is generally considered to be the root cause micro-service, i.e., the micro-service that is most likely to cause an abnormal situation.
In this embodiment, the present application has been tested in the open source microservice system Online-Boutique, sock-Shop and Train-Ticket.
FIG. 5 shows the experimental results of the present application in an Online-Boutique, sock-Shop and Train-Ticket. Wherein hybrid MRCL represents the present application. The present application compares with Random Walk (RW), microRCA, microRCA, FRL-MFPG and FRL-MFPG methods. Wherein the contact of multiple systems with their nodes is artificially increased for MicroRCA and FRL-MFPG so that they can operate in a hybrid deployment environment. MicroRCA and FRL-MFPG are relationships between the construction of multiple microservices systems using the method of the present application. Experimental results show that compared with the existing method, the method has higher accuracy.
Example two
Based on the same inventive concept, this embodiment discloses a root cause positioning system of a micro service system for a hybrid deployment scenario, please refer to fig. 2, the system includes:
the data collection module is used for conducting a leading chaotic engineering experiment on the micro-service system facing the mixed deployment scene, collecting a chaotic engineering data set, and continuously monitoring and collecting service level indexes and container level indexes through a monitoring tool;
the method comprises the steps of constructing a single application exception graph module, acquiring calling relations of different micro-service systems in a mixed deployment scene by adopting an unsupervised learning algorithm, and constructing an exception service dependency graph of the single system for each micro-service system, wherein nodes in the exception service dependency graph of the single system are exception services, and the calling relations among the exception services are represented;
the method comprises the steps of constructing a multi-application exception graph module, obtaining the connection between different micro services of a mixed deployment scene by adopting a frequent item set mining algorithm and a causal inference algorithm, constructing a multi-system exception service dependency graph, wherein nodes in the multi-system exception service dependency graph are exception services, and the edges represent the dependency relationship between the exception services;
the comprehensive ordering module is used for updating the weight of the abnormal service dependency graph of the multiple systems according to the association between the container level index and the service level index of two services under the same micro-service system;
and executing a personalized random walk algorithm on the abnormal service dependency graph of the multiple systems after the weight updating to realize root cause positioning.
Because the system described in the second embodiment of the present application is a system adopted by the positioning method for implementing the micro service system facing the hybrid deployment scenario in the first embodiment of the present application, based on the method described in the first embodiment of the present application, a person skilled in the art can understand the specific structure and the deformation of the system, so that the details are not repeated here. All systems used in the method of the first embodiment of the present application are within the scope of the present application.
Example III
Based on the same inventive concept, the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed, implements the method as described in embodiment one.
Since the computer readable storage medium described in the third embodiment of the present application is a computer readable storage medium adopted by the positioning method for implementing the micro service system facing the hybrid deployment scenario in the first embodiment of the present application, based on the method described in the first embodiment of the present application, a person skilled in the art can understand the specific structure and the modification of the computer readable storage medium, and therefore, the description thereof is omitted here. All computer readable storage media used in the method according to the first embodiment of the present application are included in the scope of protection.
Example IV
Based on the same inventive concept, the application also provides a computer device, comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor executes the program to implement the method in the first embodiment.
Because the computer device described in the fourth embodiment of the present application is the computer device adopted by the positioning method for implementing the micro service system facing the hybrid deployment scenario in the first embodiment of the present application, based on the method described in the first embodiment of the present application, a person skilled in the art can understand the specific structure and deformation of the computer device, and therefore, the details are not repeated here. All computer devices used in the method of the first embodiment of the present application are within the scope of the present application.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application. It will be apparent to those skilled in the art that various modifications and variations can be made to the embodiments of the present application without departing from the spirit or scope of the embodiments of the application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims and the equivalents thereof, the present application is also intended to include such modifications and variations.

Claims (10)

1. The root cause positioning method of the micro-service system for the mixed deployment scene is characterized by comprising the following steps:
s1: conducting a leading chaotic engineering experiment on a micro-service system facing a mixed deployment scene, collecting a chaotic engineering data set, and continuously monitoring and collecting a service level index and a container level index through a monitoring tool;
s2: acquiring calling relations of different micro-service systems in a mixed deployment scene by adopting an unsupervised learning algorithm, and constructing an abnormal service dependency graph of a single system for each micro-service system, wherein nodes in the abnormal service dependency graph of the single system are abnormal services, and the sides represent the calling relations among the abnormal services;
s3: adopting a frequent item set mining algorithm and a causal inference algorithm to obtain the connection between different micro services of the mixed deployment scene, constructing an abnormal service dependency graph of a plurality of systems, wherein nodes in the abnormal service dependency graph of the plurality of systems are abnormal services, and representing the dependency relationship between the abnormal services;
s4: updating the weight of the abnormal service dependency graph of the multiple systems according to the association between the container level index and the service level index of two services under the same micro service system;
s5: and executing a personalized random walk algorithm on the abnormal service dependency graph of the multiple systems after the weight is updated, and realizing root cause positioning.
2. The root cause positioning method of a micro service system for a hybrid deployment scenario of claim 1, wherein step S1 comprises:
collecting a chaotic engineering data set by using a chaotic engineering tool to inject anomalies into a micro-service system instance facing a mixed deployment scene, wherein the types of the injected anomalies comprise instance anomalies, network anomalies, file system anomalies and pressure anomalies;
the service level index comprises an average delay index, a P90 delay index and a P99 delay index of each micro service, and the container level index comprises CPU, memory, network and file system indexes in the micro service operation process.
3. The root cause positioning method of a micro service system for a hybrid deployment scenario according to claim 2, wherein step S2 comprises:
performing cluster analysis on P90 delay indexes among different micro services by adopting an unsupervised learning algorithm, finding out candidate sets among abnormal services, considering that the collected delay data tend to be stable if the input delay data are gathered into one type, considering that the collected delay data tend to be discrete if the input delay data are gathered into multiple types, regarding the delay data among the micro services as abnormal delay at the moment, and regarding the call among the micro services as abnormal call;
and constructing an abnormal service dependency graph of the single system by taking the abnormal services as nodes and the calling relations among the abnormal services as edges.
4. The root cause positioning method of a micro service system for a hybrid deployment scenario according to claim 1, wherein step S3 comprises:
s3.1: constructing a frequent item set by using an Apriori algorithm based on the chaotic engineering data set;
s3.2: digging strong association relations between different micro-service systems based on the constructed frequent item sets;
s3.3: the cause and effect relation between strongly associated abnormal services is checked by using a Grangel cause and effect checking algorithm, and an abnormal service dependency graph of a plurality of systems is constructed.
5. The method for root cause positioning of a micro service system for a hybrid deployment scenario of claim 4, wherein step S3.1 comprises:
scanning all abnormal micro-services in the chaotic engineering data set, wherein different micro-services are used as different items, and 1-item sets are generated by arranging and combining the items, and each 1-item set belongs to C 1 A collection;
counting each item, deleting the items which do not meet the minimum support degree from all 1-item sets based on the minimum support degree, thereby obtaining a set L of frequent 1-item sets 1
For L 1 Set C of 2-item sets generated by self-connection and pruning strategy 2 Scanning chaotic engineering dataset and comparing C 2 Counting each item set, deleting the items which do not meet the minimum support degree, thereby obtaining a set L of frequent 2-item sets 2 Similarly, for L k-1 Generating a set C of k-item sets by performing self connection and pruning strategy k Scan transaction set and for C k Counting each item set in the list, and deleting the items which do not meet the minimum support according to the minimum support to obtain a frequent k-item set L k
6. The method for root cause positioning of a micro service system for a hybrid deployment scenario of claim 5, wherein step S3.2 comprises:
generating, for each frequent k-term set, a non-empty subset of all frequent k-term sets;
setting two item sets as X and Y respectively, and defining association rule asRepresented as item set X, Y can be derived; for association rule->The confidence is the ratio of the transaction containing X and Y to the transaction containing X, which is marked +.> Wherein, when->Then get->Representing that the occurrence of item set X will cause the occurrence of item set Y with a probability or confidence of +.>conf min Representing a minimum confidence.
7. The method for root cause positioning of a micro service system for a hybrid deployment scenario of claim 4, wherein step S3.3 comprises:
detecting whether abnormal services in the abnormal service dependency graphs of a plurality of single systems appear in a frequent item set, if a certain number of abnormal services appear in the frequent item set, checking container-level indexes in a mixed deployment scene by using a method of the Granges causal relation check, and if the change of a certain container-level index has causal relation, indicating that the service occurrence abnormality among different micro service systems has causal relation;
if no abnormal service appears in the set of frequent item sets, carrying out causal check on all abnormal services among different micro service systems, and when the change of a certain container level index is found to have causal relation, indicating that the abnormal service among different systems has causal relation.
8. The root cause positioning method of a micro service system for a hybrid deployment scenario of claim 1, wherein step S4 comprises:
extracting all container-level indexes of two services under the same micro-service system and P90 delay data between the two services;
and calculating the Pearson correlation coefficient between the extracted container level index and the P90 delay data, taking the obtained value of the maximum positive correlation coefficient as the weight of the directed edge between services under the same micro-service system, and updating the weight of the abnormal service dependency graph of the multiple systems.
9. The root cause positioning method of a micro service system for a hybrid deployment scenario of claim 1, wherein step S5 comprises:
s5.1: the basic transition matrix M of the abnormal service dependency graph MSDG defining a multisystem is represented by formula (1):
M=[m ij ] n×n (1)
for each node v in the MSDG, it is assumed that it has k outgoing edges, which are connected to node u 1 、u 2 、...u k Setting the element of the ith row and the jth column in M as the weight w of the edge ij The element of row i and column j in M divided by the degree k of node v is represented by equation (2):
m ij =w ij /k (2)
wherein each element in M represents a transition probability from one node to another;
s5.2: introducing a completely random transfer matrix E, wherein the transfer probability from one node to any node is 1/n, n is the number of nodes in the MSDG, and a damping factor d is defined for controlling the proportional relation between M and E, and d is more than or equal to 0 and less than or equal to 1;
s5.3: by weighted averaging M and E to obtain a complete transition matrix P of MSDG, i.e. p=dm+ (1-d) E, performing iterative computation using P as a transition matrix of a generally random walk markov chain, multiplying the current transition matrix vector by P in each iteration to obtain a new state vector, repeating this process until the state vector converges, and reaching a stationary distribution R, where R is an n-dimensional vector with a sum of components of 1, each component representing a score of a corresponding node in MSDG, i.e. PageRank value, representing importance and impact of the node in MSDG, and the representation of R is represented by formula (3):
PR(v 1 )、PR(v n ) Respectively represent node v 1 Node v n PageRank value of (C);
s5.4: and (3) descending order sorting is carried out on the PageRank values of the nodes in the MSDG, and the service sorted into the first order is used as a root cause micro-service, namely the micro-service most likely to cause abnormal conditions.
10. The root cause positioning system of the micro service system for the mixed deployment scene is characterized by comprising the following components:
the data collection module is used for conducting a leading chaotic engineering experiment on the micro-service system facing the mixed deployment scene, collecting a chaotic engineering data set, and continuously monitoring and collecting service level indexes and container level indexes through a monitoring tool;
the method comprises the steps of constructing a single application exception graph module, acquiring calling relations of different micro-service systems in a mixed deployment scene by adopting an unsupervised learning algorithm, and constructing an exception service dependency graph of the single system for each micro-service system, wherein nodes in the exception service dependency graph of the single system are exception services, and the calling relations among the exception services are represented;
the method comprises the steps of constructing a multi-application exception graph module, obtaining the connection between different micro services of a mixed deployment scene by adopting a frequent item set mining algorithm and a causal inference algorithm, constructing a multi-system exception service dependency graph, wherein nodes in the multi-system exception service dependency graph are exception services, and the edges represent the dependency relationship between the exception services;
the comprehensive ordering module is used for updating the weight of the abnormal service dependency graph of the multiple systems according to the association between the container level index and the service level index of two services under the same micro-service system;
and executing a personalized random walk algorithm on the abnormal service dependency graph of the multiple systems after the weight updating to realize root cause positioning.
CN202310569212.0A 2023-05-17 2023-05-17 Root cause positioning method and system for micro-service system facing mixed deployment scene Pending CN116737436A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310569212.0A CN116737436A (en) 2023-05-17 2023-05-17 Root cause positioning method and system for micro-service system facing mixed deployment scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310569212.0A CN116737436A (en) 2023-05-17 2023-05-17 Root cause positioning method and system for micro-service system facing mixed deployment scene

Publications (1)

Publication Number Publication Date
CN116737436A true CN116737436A (en) 2023-09-12

Family

ID=87903534

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310569212.0A Pending CN116737436A (en) 2023-05-17 2023-05-17 Root cause positioning method and system for micro-service system facing mixed deployment scene

Country Status (1)

Country Link
CN (1) CN116737436A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117149500A (en) * 2023-10-30 2023-12-01 安徽思高智能科技有限公司 Abnormal root cause obtaining method and system based on index data and log data
CN118054974A (en) * 2024-04-15 2024-05-17 浙江保融科技股份有限公司 Flow control method in private deployment scene
CN118427578A (en) * 2024-07-04 2024-08-02 安徽思高智能科技有限公司 Micro-service system data evaluation method, device and medium based on chaotic engineering

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117149500A (en) * 2023-10-30 2023-12-01 安徽思高智能科技有限公司 Abnormal root cause obtaining method and system based on index data and log data
CN117149500B (en) * 2023-10-30 2024-01-26 安徽思高智能科技有限公司 Abnormal root cause obtaining method and system based on index data and log data
CN118054974A (en) * 2024-04-15 2024-05-17 浙江保融科技股份有限公司 Flow control method in private deployment scene
CN118427578A (en) * 2024-07-04 2024-08-02 安徽思高智能科技有限公司 Micro-service system data evaluation method, device and medium based on chaotic engineering

Similar Documents

Publication Publication Date Title
US10867244B2 (en) Method and apparatus for machine learning
US8918431B2 (en) Adaptive ontology
US8392760B2 (en) Diagnosing abnormalities without application-specific knowledge
CN116737436A (en) Root cause positioning method and system for micro-service system facing mixed deployment scene
US11514308B2 (en) Method and apparatus for machine learning
US20160055044A1 (en) Fault analysis method, fault analysis system, and storage medium
CN108320171A (en) Hot item prediction technique, system and device
CN103513983A (en) Method and system for predictive alert threshold determination tool
CN115237717A (en) Micro-service abnormity detection method and system
CN111104242A (en) Method and device for processing abnormal logs of operating system based on deep learning
CN115756929A (en) Abnormal root cause positioning method and system based on dynamic service dependency graph
CN117061322A (en) Internet of things flow pool management method and system
CN117560275B (en) Root cause positioning method and device for micro-service system based on graphic neural network model
CN114463072A (en) E-business service optimization method based on business demand AI prediction and big data system
AU2021240196B1 (en) Utilizing machine learning models for determining an optimized resolution path for an interaction
Wang et al. Aistar: an intelligent system for online it ticket automation recommendation
Grishma et al. Software root cause prediction using clustering techniques: A review
Liu et al. A multi-source approach for bug triage
JP2010272004A (en) Discriminating apparatus, discrimination method, and computer program
CN111695583A (en) Feature selection method based on causal network
US20230161637A1 (en) Automated reasoning for event management in cloud platforms
CN109685308A (en) A kind of complication system critical path appraisal procedure and system
EP4149075B1 (en) Automatic suppression of non-actionable alarms with machine learning
CN118427578B (en) Micro-service system data evaluation method, device and medium based on chaotic engineering
US20240248836A1 (en) Bootstrap method for continuous deployment in cross-customer model management

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination