CN112118127B - Service reliability guarantee method based on fault similarity - Google Patents

Service reliability guarantee method based on fault similarity Download PDF

Info

Publication number
CN112118127B
CN112118127B CN202010789841.0A CN202010789841A CN112118127B CN 112118127 B CN112118127 B CN 112118127B CN 202010789841 A CN202010789841 A CN 202010789841A CN 112118127 B CN112118127 B CN 112118127B
Authority
CN
China
Prior art keywords
fault
service
execution
request
normal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010789841.0A
Other languages
Chinese (zh)
Other versions
CN112118127A (en
Inventor
王焘
陈皓
张文博
许源佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Priority to CN202010789841.0A priority Critical patent/CN112118127B/en
Publication of CN112118127A publication Critical patent/CN112118127A/en
Application granted granted Critical
Publication of CN112118127B publication Critical patent/CN112118127B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • H04L41/065Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving logical or physical relationship, e.g. grouping and hierarchies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a service reliability guarantee method based on fault similarity, which comprises the following steps: analyzing an execution trace to construct a directed weighted graph; comparing the directed weighted graph with a plurality of normal directed weighted graphs constructed by a plurality of normal execution traces, and judging whether the execution traces are normal or not; if the execution tracking is fault execution tracking, obtaining an unknown fault request character string by the service called in the execution tracking according to the calling sequence; extracting a fault calling character string in a known fault database, and carrying out similarity calculation with an unknown fault request character string to obtain a fault reason for executing tracking; detecting whether the service failure type exists or not according to the failure reason; if the service failure type exists, restarting the service; and if the service failure type does not exist, acquiring a service container measurement index, and performing self-adaptive matching on the service container. The method is based on directional fault injection, realizes the coupling of execution tracking and fault reasons, and judges the fault of the monitoring node.

Description

Service reliability guarantee method based on fault similarity
Technical Field
The invention belongs to the technical field of software, and particularly relates to a service reliability guarantee method based on fault similarity.
Background
In the face of mass requests in the internet era, service failure in a short time can cause the reduction of user experience and evaluation, and service failure in a long time can even cause serious economic loss to enterprises. With the rapid increase of the cloud computing data volume and the continuous expansion of the cluster size, the high reliability of the cluster becomes a technology which becomes more and more important. Service reliability refers to the probability of software running without errors in a given time and in a specific environment, and the time, the running environment, the conditions and the functions need to be specified for accurately measuring the reliability. The service monitoring and fault technology helps operation and maintenance personnel to monitor the operation condition of the container in the distributed service cluster, allocate resources and ensure the reliable operation of the whole service system.
The service reliability guarantee method mainly comprises three methods of analyzing measurement information, log files, executing tracking and the like. The method based on measurement information analysis collects the measurement value of a certain logic measurement unit or in a certain time period, and can set a fixed index or a dynamically changing index through a series of operations, so that the index is used as an alarm rule of system abnormity, and an abnormity warning is sent to operation and maintenance personnel or used as a scheduling rule of a cluster task. Representative methods include literature (S.Chouliaras and S.Sotiriadis, "Real-Time analysis Detection of NoSQL Systems Based on Resource Usage Monitoring," In IEEE Transactions on Industrial In formats, vol.16, No.9, pp.6042-6049, Sept.2020, doi:10.1109/TII.2019.2958606.) which compare different machine learning models with the design patterns of NoSQL system operation Based on Real-Time Monitoring of Resource Usage and characterize them Based on Resource Usage Monitoring and process information extraction to identify different abnormal patterns. The method does not need to know the structure and the relation of the service in the system, but needs to know the type and the abnormal characteristic of the abnormality in advance, and has poor flexibility. The method based on log file analysis collects metadata information in discrete log files, the log files record massive scattered events or request information in system operation, and error reports in operation can be found by setting a retrieval mode, so that abnormal reasons of the system can be effectively checked. Representative methods include literature (Y.Yuan, W.Shi, B.Liang and B.Qin, "An apparatus to close Execution Failure Diagnosis Based on Exception Logs in OpenStack,"2019I EEE 12th International Conference on Cloud Computing (CLOUD), Milan, Italy,2019, pp.124-131, doi:10.1109/CLOUD.2019.00031) that focus on identifying Exception Logs generated by system tasks, comparing Exception Logs to system history task Logs on a lightweight scale, and thereby identifying a particular type of Exception. The method needs to collect a large amount of scattered log files and extract key fault information from the log files, and the log collection and the information extraction have hysteresis, so that the fault existing in the system is difficult to analyze in real time. The method based on execution tracking collects all information in a single request, constructs the internal structural characteristics of the system, causes the request processing track to deviate when the system is abnormal, and achieves the purposes of abnormal positioning and fault reason diagnosis by analyzing the processing track. Representative methods include literature (s.zhang, y.wang, w.li and x.qiu, "Service surface area diagnostics in Service function" 201719th Asia-business Operations a d Management Symposium (apnos), Seoul,2017, pp.70-75, doi: 10.1109/apnos.2017.8094181) by proposing a low cost detection method that includes calculating the topology of the Network functions, avoiding repeated probing of links between Network functions, adding timestamps to the Network headers to analyze quality of Service. The method can be used for checking the system performance problem, but the too fine monitoring granularity brings huge monitoring and analysis resource overhead, and the problem that the monitoring granularity and the monitoring overhead are difficult to balance exists.
In summary, the following problems exist in the existing service reliability guarantee method: (1) there are various monitoring methods for the service reliability guarantee method, and the method based on the measurement information analysis and the method based on the log file analysis need to know the abnormal characteristic information in advance, so that it is difficult to cope with the sudden abnormality, while the method based on the execution tracking lacks the monitoring of the system operation index (such as CPU, memory, disk, network, etc.). (2) The execution traces of the same request have similarity and lack the index for measuring the similarity between the execution traces, so the prior art has difficulty in discovering the service abnormality by analyzing the similarity between the execution traces, thereby causing the failure to be discovered and diagnosed quickly. (3) The diagnosis of the fault cause depends on the characteristic information of the historical fault of the system, and a diagnosis method for the possible cause of the unknown fault is lacked, so the prior art needs to continuously iterate fault characteristic diagnosis rules, and the diagnosis of the unknown fault depends on the technology and experience of operation and maintenance personnel, so that the reliability guarantee method has poor flexibility and is difficult to deal with the rare system faults.
Disclosure of Invention
Aiming at the problems of the existing service reliability guarantee method that an index monitoring method for sudden abnormal system operation, setting of an execution tracking information comparison rule of a service request and a fault cause detection method for flexibly coping with unknown faults are lacked, the invention provides a service reliability guarantee method based on fault similarity.
The technical solution of the invention comprises:
a service reliability guarantee method based on fault similarity comprises the following steps:
1) analyzing an execution trace generated by the system operation to construct a directed weighted graph, wherein a vertex in the directed weighted graph uses a service id, a service request end id, a calling service end id, and requests time consumption and other information of the method]Tuple representation, method other information uses a [ service name, start timestamp, tag contained in request]And (4) representing a multi-element group. The directed edge is
Figure BDA0002623359800000031
a represents a service request end identifier, b represents a calling service end identifier, and a directed edge weight vector is the request consumption time;
2) comparing the directed weighted graph with a plurality of normal directed weighted graphs constructed by a plurality of normal execution traces, and judging whether the execution traces are normal or not according to the request consumption time and the normal request consumption time;
3) if the execution tracking is fault execution tracking, sequentially converting each service name into a corresponding fixed-length character string according to a calling sequence of the service called in the execution tracking, and splicing according to the calling sequence to obtain an unknown fault request character string;
4) extracting all fault calling character strings of known faults in a known fault database, and carrying out similarity calculation on the unknown fault request character strings and each fault calling character string to obtain fault reasons for executing tracking;
5) acquiring a corresponding service failure type according to a fault reason, and detecting whether the service failure type exists or not;
6) if the service failure type exists, restarting the service; and if the service failure type does not exist, acquiring a service container measurement index, and performing self-adaptive matching on the service container.
Further, the execution traces are collected by setting the execution trace collection component interface zipkin-address-url specified by the Mixer as the open interface of Jaeger.
Further, the known fault database is built by:
a) injecting a plurality of sample faults into a system in normal operation, and sending a plurality of requests to obtain a plurality of known fault execution traces;
b) constructing a plurality of known fault directed weighted graphs of the known fault execution traces;
c) and classifying all known fault directed weighted graphs according to the fault reasons of the known fault directed weighted graphs to obtain a plurality of known fault directed weighted graph combinations to form a known fault database.
Further, whether the execution tracking is normal is judged by the following steps:
1) calculating the upper limit value of the time consumed by normal requests through the weight vectors of all directed edges of a plurality of normal directed weighted graphs
Figure BDA0002623359800000032
And lower limit value
Figure BDA0002623359800000033
Wherein
Figure BDA0002623359800000034
N normal requests representing normal execution traces consume time-averaged execution time, σ represents a standard deviation of the consumption time of each normal request, and zαAn alpha quantile representing a normal distribution;
2) if tminThe request consumption time is less than or equal to tmaxIf so, the execution trace is a normal execution trace; if the request consumption time is less than tminOr request elapsed time > tmaxThen the execution trace is a faulty execution trace.
Further, the fault calling character string is obtained through the following steps:
1) performing tracing for each known fault in a database of known faultsRetrieving known fault directed weighted graph vertices S in chronological order1And according to the vertex S1The corresponding service name is converted into a fixed-length character string c by using a mapping function1
2) Respectively obtaining the next connected vertex S of the connected directed weighted graph according to the calling relationiAnd according to the vertex SiThe corresponding service name is converted into a fixed-length character string c by using a mapping functioniWherein i is more than or equal to 2 and less than or equal to p, and p is the total number of services included in the execution trace of the known fault;
3) by set of strings { c1,…,ci,…cpAnd obtaining a fault calling character string.
Further, the similarity between the unknown fault request string and each fault call string
Figure BDA0002623359800000041
Wherein T is a known fault execution trace set corresponding to the known fault f, diPerforming a request string edit distance, l, that tracks fi and e for each known faultfiPerforming a known fault string length, l, of trace fi for each known faulteAnd the request character string length of the unknown fault e corresponding to the execution trace is long, and m is the number of the known fault execution traces corresponding to the known fault f.
Further, the fault causes comprise known fault queues which are arranged from large to small according to the similarity.
Further, the method of detecting the presence of the service failure type includes performing survivability detection using a probe.
Further, the service failure type includes that the status code is not 0 when the container exits, the designated port of the IP address is not opened, and the response code of the container HTTP request of the designated port is less than 200 or greater than or equal to 400.
Further, adaptive matching of service containers is performed by the following policies:
1) when the measurement index of the service container is larger than the upper limit of the expected measurement index, calculating the number of the expanded copies of the service container, and distributing;
2) and when the metric index of the service container is smaller than the upper limit of the expected metric index, deleting part of the service container.
Compared with the prior art, the invention has the following advantages:
(1) by adopting the execution tracking as an index for system monitoring, the dependency relationship of the fault and the detailed request execution information of the fault can be collected in the request. The execution tracing is used as an index for judging abnormal discovery, the similarity between the request execution tracing and the fault execution tracing is used as a basis for judging fault reasons, and the purpose that the execution tracing is linked with the fault reasons is achieved. And the fault of the monitoring node can be judged by additionally combining with the monitoring of the cluster measurement information.
(2) The degree of similarity between execution traces can be expressed. By using the character string editing distance to express the similarity between faults, the size of the editing distance can be used as the possible degree for judging the fault reason.
(3) The known fault database based on the directional fault injection can simulate various faults encountered by the system in advance, describe and record the execution tracking characteristics of the faults and reduce the dependence on the experience of operation and maintenance personnel.
Drawings
FIG. 1 is a flow chart of an implementation of the method of the present invention;
fig. 2 is a schematic view of a usage environment according to an embodiment of the present invention.
Detailed Description
The invention is described in detail below with reference to specific embodiments and the attached drawings,
the service reliability guarantee method based on fault similarity matching of the invention is shown in fig. 1, and the implementation steps are as follows:
in the first step, monitoring data collection of directional pre-injection faults is carried out. A fault of known cause is first injected into the system, and for a distributed system, a request may be split into multiple calls in parallel. For an execution trace T, the present invention adopts a structure of a directed weighted graph to represent the call relationship of sending and receiving events, wherein: each vertex corresponds to a garment in the application dependency relationshipAffairs; for the application with service A and service B and the service A generates a call request to the service B, a directed edge is established between the vertex a and the vertex B corresponding to the service A and the service B in the directed weighted graph
Figure BDA0002623359800000053
The weight value is the time t consumed by receiving the call request; depending on the content of the request and the function of the application, multiple calls to the same service may result in different call relationships, and therefore the requested execution trace T and the corresponding directed graph D with rights may not be unique.
To record dependencies between services, the vertices will use tuple MiExpressed as (MID, requestID, callerID, duration, info), where MID is the service id; the requestID is the id of the service request; the callerID is the id of the calling service; duration is the request-consuming time; the info contains other information of the method, and is represented by tuple info ═ s (serviceName, startTime, tags), where serviceName is a service name; startTime is a start timestamp; tag s is the label that the request contains. Each vertex S in the directed weighted graph D contains a fixed-length character string converted from a vertex service name through a mapping algorithm as a unique identifier of the service;
and secondly, fault finding. The invention firstly counts the correct execution of the system request and records the corresponding execution trace TiAnd execution time tjA Service Profile Database (SPDB) is constructed and stores an execution trace of the correct execution of requests by the system. The TPDB is not only established as a record of system normal requests to the system, but also as a criterion for determining system abnormality. In the design of the present invention, the system exception is often accompanied by the change of the service execution and the call time in the execution tracking, and therefore the execution time is adopted as the index of the main judgment. And counting the requests of the same type for multiple times among the same vertexes, wherein due to randomness caused by factors such as network delay and the like, the calling time can present a normal distribution rule, and the execution time approximately falls within a certain interval. The vertex of such a section is defined as a boundary value falling withinExecution traces outside of the interval may be considered an exception. Thus, for a new execution request, the present invention calculates its boundary values using equations 1 and 2:
Figure BDA0002623359800000051
Figure BDA0002623359800000052
wherein, tmaxUpper bound, t, representing normal execution timeminA lower bound for the normal execution time is indicated,
Figure BDA0002623359800000061
representing the average execution time of the service request, sigma representing the standard deviation of the execution time of the service request, n representing the number of sample executions traces, zαRepresenting the alpha quantile of a normal distribution, the present invention uses a 95% confidence probability.
And thirdly, judging the reason of the abnormality. The system adopts a mode of directional fault injection, collects execution tracking when the system fails in advance, and establishes a Known Fault Database (KFDB). The KFDB can not only store the abnormal event execution tracing generated by the failure of the injected single service and the calling among a plurality of services, but also collect the execution tracing caused by the cluster physical resource abnormality, and the result is beneficial to quantifying and comparing the similarity through the processing of the invention. When the system is judged to be abnormal, the fault execution tracing which occurs is compared with the system execution tracing which occurs due to the known fault in the KFDB, and the most similar fault is calculated through the corresponding algorithm introduced by the invention, so that operation and maintenance personnel can be helped to judge the actual position of the system failure in the production environment, and the loss is avoided to the maximum extent.
The method specifically comprises the following steps:
(1) for each trace T in KFDB, the vertex S with the earliest start time is retrieved1As the starting vertex, representThe first service requested. For vertex S1Get its service name N1Conversion into fixed-length character strings c using mapping functions1
(2) Entering the next vertex S according to the calling relation2Get its service name N2Converting the character string into a fixed-length character string c by using the same mapping function as the step (1)2
(3) And (5) repeating the step (2), and converting each vertex in the T to obtain a character string set C. Connecting the character strings in the C end to end according to the numbering sequence to form a character string t representing the whole call requestf
(4) And (3) repeating the steps (1), (2) and (3), and converting each fault in the KFDB into a fault calling character string. When the unknown fault e is generated, the request character string t of the unknown fault is formed according to the steps (1) and (2)r
(5) For both known faults F in the KFDB and unknown faults e in the system, their corresponding requests can be translated into request strings. Therefore, for any one injected fault f in KFDB, the similarity s of the injected fault f and the request character string of the unknown fault e is expressed by using the edit distance d, wherein the same fault f may have a plurality of execution traces, so that
Figure BDA0002623359800000062
T is the set of all execution traces corresponding to the fault f, diFor each execution, the edit distance, l, of the corresponding request string with e is trackedfiIndicating the known fault string length, l, of each execution traceeA request string length indicating an unknown fault e, m indicating the number of execution traces of f;
(6) and sequentially arranging according to the similarity s of each fault in the KFDB to form a queue with k members as a result of the abnormal detection.
Thus, for actually generated faults, the present invention will return a fault queue that is sorted by fault similarity. This queue contains the associated injected faults, and the results are used as a basis for determining the likely cause of the actual fault.
And fourthly, exception handling. The service abnormity is divided into two types of service failure and performance attenuation, and corresponds to two detection modes: the former uses probes for viability detection, the latter monitors load information for services. For a service failure type of anomaly, the viability probe will identify the following three anomaly cases: the status code is not 0 when the container is withdrawn; the designated port of the IP address is not opened; the response code of the container HTTP request for the specified port is less than 200 or equal to or greater than 400. And in the face of service failure type abnormity, processing in a service restart mode. For performance decay type anomalies, the resource usage index of the service is out of the expected range. The system automatically tracks and calculates the service measurement index at regular time intervals according to the service measurement information. Once the service metric index is larger than the upper limit of the expected metric index, calculating the number of the expanded copies of the service container, automatically expanding the number of the containers and distributing the expanded copies to the working nodes of the cluster so as to achieve the purpose of reducing the load; in addition, when the metric index of the service container is far smaller than the upper limit of the expected metric index, part of the service copies are deleted, and the waste of cluster resources caused by starting excessive containers is reduced.
Aiming at the characteristic of executing a tracing and recording service request, the method adopts a directional fault injection mode to pre-record the execution tracing of the injection fault with the known reason of the system and the execution tracing in normal operation, describes the fault characteristic and the normal operation characteristic of the system by executing the tracing, and hooks the service execution tracing with the system state; when a system in a production environment executes a new request, considering whether the request execution time deviates from the range of normal operation, and judging the request which deviates from the normal operation and consumes time as an abnormal request; and describing the similarity between the requested execution tracking and the fault execution tracking by using the edit distance so as to retrieve the pre-injected fault with the characteristic of higher similarity to the characteristic of the current system and diagnose the possible fault reason. In addition, for services which are possibly abnormal, the abnormal services are judged by using a survival probe and a metric value monitoring mode respectively, and service restart and automatic extension are carried out, so that the service reliability is guaranteed.
In the use environment of the method of the embodiment of the invention, as shown in fig. 2, five KVM virtual machines are deployed on a physical host, and a kubernets cluster is built (kubernets are the most popular container arrangement system at present, and can help to manage the bottom container of a distributed system, and the kubernets management cluster can realize automatic management service discovery, load balancing combination, resource allocation tracking and container scaling, https:// kubernets. One virtual machine is used as a Master node to control the whole cluster, one virtual machine is used as an anomaly detection node to deploy Jaeger (a distributed calling and tracking tool, https:// www.jaegertracing.io /), an elastic search (a distributed search and analysis engine, ht tps:// www.elastic.co /) and a MySQL database, and three virtual machines are used as working nodes to deploy the whole application and process the working load. Jaeger collects the tracking monitoring information of the application, stores the tracking monitoring information into an Elasticissearch database, extracts the tracking information from the Elasticissearch database by an anomaly detection system, converts each request into a directed weighted graph, and stores the directed weighted graph into a MySQL database. When the anomaly detection system finds that the weight of the directed edge of a certain requested directed weighted graph exceeds a threshold value, the character string of the directional injection fault stored in MySQL in advance is taken out for comparison, and then the most similar fault reason is found. The method provided by the invention is realized in an anomaly detection system.
The method of the embodiment of the invention comprises the following steps:
(1) the method comprises the steps that a Mixer of Istio (service grid technology, which intercepts flow in Kubernets for operation, https:// Istio. i o /) is a middle component for decoupling front-end application and rear-end monitoring, a zip-address-url specified by the Mixer is set as an open interface of Jaeger, and tracking information of service is collected;
(2) injecting various faults in advance, and storing the directed weighted graph sublist of the execution tracking of the faults in a MySQL database as the execution information of the faults injected in advance;
(3) sending a request to a normally running system for multiple times in advance, recording a directed weighted graph of execution tracking in a normal running environment, and taking the directed weighted graph as a judgment threshold value of fault discovery;
(4) when the system runs, collecting the tracking metric value in real time, and when the weight of the directed edge exceeds the threshold value of normal running, transmitting the corresponding character string for executing the tracking directed weighted graph and mapping into abnormal comparison;
(5) respectively extracting mapping character strings of the fault directed weighted graph stored in each table in the MySQL database, and calculating the average value d of the editing distance between the character strings;
(6) and sorting according to the size of the average value d, taking the corresponding fault as a possible fault option to be put into a queue, and finally obtaining an ordered fault queue.
For services which may have an exception in the fault queue, the number of copies of the container of the services is scaled according to the measurement information of the services, so as to automatically solve the service exception. In addition, the generation reason of each fault directionally injected in the ordered fault queue is used as a reference for the operation and maintenance personnel to repair the abnormity, and an actual fault reason and a processing method are found out.
The above examples are provided only for the purpose of describing the present invention, and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalent substitutions and modifications can be made without departing from the spirit and principles of the invention, and are intended to be within the scope of the invention.

Claims (10)

1. A service reliability guarantee method based on fault similarity comprises the following steps:
1) analyzing an execution trace generated by the system operation to construct a directed weighted graph, wherein a vertex in the directed weighted graph uses a service id, a service request end id, a calling service end id, and requests time consumption and other information of the method]Tuple representation, method other information uses a [ service name, start timestamp, tag contained in request]Tuple representation with directed edges of
Figure FDA0003101003110000011
a represents a service request end identifier, b represents a calling service end identifier, and a directed edge weight vector is the request consumption time;
2) comparing the directed weighted graph with a plurality of normal directed weighted graphs constructed by a plurality of normal execution traces, and judging whether the execution traces are normal or not according to the request consumption time and the normal request consumption time;
3) if the execution tracking is fault execution tracking, sequentially converting each service name into a corresponding fixed-length character string according to a calling sequence of the service called in the execution tracking, and splicing according to the calling sequence to obtain an unknown fault request character string;
4) extracting all fault calling character strings of known faults in a known fault database, and carrying out similarity calculation on the unknown fault request character strings and each fault calling character string to obtain fault reasons for executing tracking;
5) according to the fault reason, acquiring a corresponding service exception type, wherein the service exception type comprises the following steps: service failure and performance degradation;
6) if the service is invalid, restarting the service; and if the performance is degraded, performing adaptive matching of the service container.
2. The method of claim 1, wherein the execution traces are collected by setting the execution trace collection component interface zip kin-address-url specified by Mixer as the open interface of Jaeger.
3. The method of claim 1, wherein the database of known faults is created by:
a) injecting a plurality of sample faults into a system in normal operation, and sending a plurality of requests to obtain a plurality of known fault execution traces;
b) constructing a plurality of known fault directed weighted graphs of the known fault execution traces;
c) and classifying all known fault directed weighted graphs according to the fault reasons of the known fault directed weighted graphs to obtain a plurality of known fault directed weighted graph combinations to form a known fault database.
4. The method of claim 1, wherein the determining whether the performing tracking is normal is performed by:
1) calculating the upper limit value of the time consumed by normal requests through the weight vectors of all directed edges of a plurality of normal directed weighted graphs
Figure FDA0003101003110000012
And lower limit value
Figure FDA0003101003110000013
Wherein
Figure FDA0003101003110000014
N normal requests representing normal execution traces consume time-averaged execution time, σ represents a standard deviation of the consumption time of each normal request, and zαAn alpha quantile representing a normal distribution;
2) if tminThe request consumption time is less than or equal to tmaxIf so, the execution trace is a normal execution trace; if the request consumption time is less than tminOr request elapsed time > tmaxThen the execution trace is a faulty execution trace.
5. The method of claim 1, wherein the fault call string is obtained by:
1) performing tracing for each known fault in a database of known faults, retrieving known fault directed weighted graph vertices S in chronological order1And according to the vertex S1The corresponding service name is converted into a fixed-length character string c by using a mapping function1
2) Respectively obtaining the next connected vertex S of the connected directed weighted graph according to the calling relationiAnd according to the vertex SiThe corresponding service name is converted into a fixed-length character string c by using a mapping functioniWherein i is more than or equal to 2 and less than or equal to p, and p is the total number of services included in the execution trace of the known fault;
3) by set of strings { c1,…,ci,…cpAnd obtaining a fault calling character string.
6. The method of claim 1, wherein the similarity of unknown fault request strings to fault call strings
Figure FDA0003101003110000021
Wherein T is a known fault execution trace set corresponding to the known fault f, diPerforming a request string edit distance, l, that tracks fi and e for each known faultfiPerforming a known fault string length, l, of trace fi for each known faulteAnd (e) the request character string length of the unknown fault e corresponding to the execution trace, and m is the number of the known fault execution traces corresponding to the known fault f.
7. The method of claim 1, wherein the failure cause comprises a queue of known failures arranged from large to small according to similarity.
8. The method of claim 1, wherein the method of detecting the type of service failure comprises viability detection using a probe.
9. The method of claim 1, wherein the service failure type includes a state code of not 0 at the time of container exit, a designated port of the IP address is not opened, and a response code of a container HTTP request of the designated port is less than 200 or equal to or greater than 400.
10. The method of claim 1, wherein the adaptive matching of service containers is performed by the following policies:
1) when the measurement index of the service container is larger than the upper limit of the expected measurement index, calculating the number of the expanded copies of the service container, and distributing;
2) and when the metric index of the service container is smaller than the upper limit of the expected metric index, deleting part of the service container.
CN202010789841.0A 2020-08-07 2020-08-07 Service reliability guarantee method based on fault similarity Active CN112118127B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010789841.0A CN112118127B (en) 2020-08-07 2020-08-07 Service reliability guarantee method based on fault similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010789841.0A CN112118127B (en) 2020-08-07 2020-08-07 Service reliability guarantee method based on fault similarity

Publications (2)

Publication Number Publication Date
CN112118127A CN112118127A (en) 2020-12-22
CN112118127B true CN112118127B (en) 2021-11-09

Family

ID=73803671

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010789841.0A Active CN112118127B (en) 2020-08-07 2020-08-07 Service reliability guarantee method based on fault similarity

Country Status (1)

Country Link
CN (1) CN112118127B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112925668B (en) * 2021-02-25 2024-04-05 北京百度网讯科技有限公司 Method, device, equipment and storage medium for evaluating server health
CN113190373B (en) * 2021-05-31 2022-04-05 中国人民解放军国防科技大学 Micro-service system fault root cause positioning method based on fault feature comparison

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109873812A (en) * 2019-01-28 2019-06-11 腾讯科技(深圳)有限公司 Method for detecting abnormality, device and computer equipment
CN109933452A (en) * 2019-03-22 2019-06-25 中国科学院软件研究所 A kind of micro services intelligent monitoring method towards anomalous propagation

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10255127B2 (en) * 2015-09-30 2019-04-09 International Business Machines Corporation Optimized diagnostic data collection driven by a ticketing system
CN110262972B (en) * 2019-06-17 2020-12-08 中国科学院软件研究所 Failure testing tool and method for micro-service application

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109873812A (en) * 2019-01-28 2019-06-11 腾讯科技(深圳)有限公司 Method for detecting abnormality, device and computer equipment
CN109933452A (en) * 2019-03-22 2019-06-25 中国科学院软件研究所 A kind of micro services intelligent monitoring method towards anomalous propagation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于自适应监测的云计算系统故障检测方法;王焘等;《计算机学报》;20161029;全文 *

Also Published As

Publication number Publication date
CN112118127A (en) 2020-12-22

Similar Documents

Publication Publication Date Title
CN108415789B (en) Node fault prediction system and method for large-scale hybrid heterogeneous storage system
KR102522005B1 (en) Apparatus for VNF Anomaly Detection based on Machine Learning for Virtual Network Management and a method thereof
Gu et al. Online anomaly prediction for robust cluster systems
Xu et al. Online system problem detection by mining patterns of console logs
US8655623B2 (en) Diagnostic system and method
US7502971B2 (en) Determining a recurrent problem of a computer resource using signatures
CN111209131A (en) Method and system for determining fault of heterogeneous system based on machine learning
Lan et al. Toward automated anomaly identification in large-scale systems
US20060188011A1 (en) Automated diagnosis and forecasting of service level objective states
CN112118127B (en) Service reliability guarantee method based on fault similarity
CN104903866A (en) Management system and method for assisting event root cause analysis
CN110489317B (en) Cloud system task operation fault diagnosis method and system based on workflow
CN109918313B (en) GBDT decision tree-based SaaS software performance fault diagnosis method
Fu et al. Performance issue diagnosis for online service systems
CN114528175A (en) Micro-service application system root cause positioning method, device, medium and equipment
KR20220166760A (en) Apparatus and method for managing trouble using big data of 5G distributed cloud system
CN114154035A (en) Data processing system for dynamic loop monitoring
CN114327964A (en) Method, device, equipment and storage medium for processing fault reasons of service system
Chen et al. Trace-based intelligent fault diagnosis for microservices with deep learning
CN110175100B (en) Storage disk fault prediction method and prediction system
CN111339052A (en) Unstructured log data processing method and device
CN116069618A (en) Application scene-oriented domestic system evaluation method
CN115860709A (en) Software service guarantee system and method
CN111506422B (en) Event analysis method and system
Lingrand et al. Analyzing the EGEE production grid workload: application to jobs submission optimization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant