CN112118127B

CN112118127B - Service reliability guarantee method based on fault similarity

Info

Publication number: CN112118127B
Application number: CN202010789841.0A
Authority: CN
Inventors: 王焘; 陈皓; 张文博; 许源佳
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2020-08-07
Filing date: 2020-08-07
Publication date: 2021-11-09
Anticipated expiration: 2040-08-07
Also published as: CN112118127A

Abstract

The invention provides a service reliability guarantee method based on fault similarity, which comprises the following steps: analyzing an execution trace to construct a directed weighted graph; comparing the directed weighted graph with a plurality of normal directed weighted graphs constructed by a plurality of normal execution traces, and judging whether the execution traces are normal or not; if the execution tracking is fault execution tracking, obtaining an unknown fault request character string by the service called in the execution tracking according to the calling sequence; extracting a fault calling character string in a known fault database, and carrying out similarity calculation with an unknown fault request character string to obtain a fault reason for executing tracking; detecting whether the service failure type exists or not according to the failure reason; if the service failure type exists, restarting the service; and if the service failure type does not exist, acquiring a service container measurement index, and performing self-adaptive matching on the service container. The method is based on directional fault injection, realizes the coupling of execution tracking and fault reasons, and judges the fault of the monitoring node.

Description

Service reliability guarantee method based on fault similarity

Technical Field

The invention belongs to the technical field of software, and particularly relates to a service reliability guarantee method based on fault similarity.

Background

In the face of mass requests in the internet era, service failure in a short time can cause the reduction of user experience and evaluation, and service failure in a long time can even cause serious economic loss to enterprises. With the rapid increase of the cloud computing data volume and the continuous expansion of the cluster size, the high reliability of the cluster becomes a technology which becomes more and more important. Service reliability refers to the probability of software running without errors in a given time and in a specific environment, and the time, the running environment, the conditions and the functions need to be specified for accurately measuring the reliability. The service monitoring and fault technology helps operation and maintenance personnel to monitor the operation condition of the container in the distributed service cluster, allocate resources and ensure the reliable operation of the whole service system.

The service reliability guarantee method mainly comprises three methods of analyzing measurement information, log files, executing tracking and the like. The method based on measurement information analysis collects the measurement value of a certain logic measurement unit or in a certain time period, and can set a fixed index or a dynamically changing index through a series of operations, so that the index is used as an alarm rule of system abnormity, and an abnormity warning is sent to operation and maintenance personnel or used as a scheduling rule of a cluster task. Representative methods include literature (S.Chouliaras and S.Sotiriadis, "Real-Time analysis Detection of NoSQL Systems Based on Resource Usage Monitoring," In IEEE Transactions on Industrial In formats, vol.16, No.9, pp.6042-6049, Sept.2020, doi:10.1109/TII.2019.2958606.) which compare different machine learning models with the design patterns of NoSQL system operation Based on Real-Time Monitoring of Resource Usage and characterize them Based on Resource Usage Monitoring and process information extraction to identify different abnormal patterns. The method does not need to know the structure and the relation of the service in the system, but needs to know the type and the abnormal characteristic of the abnormality in advance, and has poor flexibility. The method based on log file analysis collects metadata information in discrete log files, the log files record massive scattered events or request information in system operation, and error reports in operation can be found by setting a retrieval mode, so that abnormal reasons of the system can be effectively checked. Representative methods include literature (Y.Yuan, W.Shi, B.Liang and B.Qin, "An apparatus to close Execution Failure Diagnosis Based on Exception Logs in OpenStack,"2019I EEE 12th International Conference on Cloud Computing (CLOUD), Milan, Italy,2019, pp.124-131, doi:10.1109/CLOUD.2019.00031) that focus on identifying Exception Logs generated by system tasks, comparing Exception Logs to system history task Logs on a lightweight scale, and thereby identifying a particular type of Exception. The method needs to collect a large amount of scattered log files and extract key fault information from the log files, and the log collection and the information extraction have hysteresis, so that the fault existing in the system is difficult to analyze in real time. The method based on execution tracking collects all information in a single request, constructs the internal structural characteristics of the system, causes the request processing track to deviate when the system is abnormal, and achieves the purposes of abnormal positioning and fault reason diagnosis by analyzing the processing track. Representative methods include literature (s.zhang, y.wang, w.li and x.qiu, "Service surface area diagnostics in Service function" 201719th Asia-business Operations a d Management Symposium (apnos), Seoul,2017, pp.70-75, doi: 10.1109/apnos.2017.8094181) by proposing a low cost detection method that includes calculating the topology of the Network functions, avoiding repeated probing of links between Network functions, adding timestamps to the Network headers to analyze quality of Service. The method can be used for checking the system performance problem, but the too fine monitoring granularity brings huge monitoring and analysis resource overhead, and the problem that the monitoring granularity and the monitoring overhead are difficult to balance exists.

In summary, the following problems exist in the existing service reliability guarantee method: (1) there are various monitoring methods for the service reliability guarantee method, and the method based on the measurement information analysis and the method based on the log file analysis need to know the abnormal characteristic information in advance, so that it is difficult to cope with the sudden abnormality, while the method based on the execution tracking lacks the monitoring of the system operation index (such as CPU, memory, disk, network, etc.). (2) The execution traces of the same request have similarity and lack the index for measuring the similarity between the execution traces, so the prior art has difficulty in discovering the service abnormality by analyzing the similarity between the execution traces, thereby causing the failure to be discovered and diagnosed quickly. (3) The diagnosis of the fault cause depends on the characteristic information of the historical fault of the system, and a diagnosis method for the possible cause of the unknown fault is lacked, so the prior art needs to continuously iterate fault characteristic diagnosis rules, and the diagnosis of the unknown fault depends on the technology and experience of operation and maintenance personnel, so that the reliability guarantee method has poor flexibility and is difficult to deal with the rare system faults.

Disclosure of Invention

Aiming at the problems of the existing service reliability guarantee method that an index monitoring method for sudden abnormal system operation, setting of an execution tracking information comparison rule of a service request and a fault cause detection method for flexibly coping with unknown faults are lacked, the invention provides a service reliability guarantee method based on fault similarity.

The technical solution of the invention comprises:

a service reliability guarantee method based on fault similarity comprises the following steps:

1) analyzing an execution trace generated by the system operation to construct a directed weighted graph, wherein a vertex in the directed weighted graph uses a service id, a service request end id, a calling service end id, and requests time consumption and other information of the method]Tuple representation, method other information uses a [ service name, start timestamp, tag contained in request]And (4) representing a multi-element group. The directed edge is

a represents a service request end identifier, b represents a calling service end identifier, and a directed edge weight vector is the request consumption time;

2) comparing the directed weighted graph with a plurality of normal directed weighted graphs constructed by a plurality of normal execution traces, and judging whether the execution traces are normal or not according to the request consumption time and the normal request consumption time;

3) if the execution tracking is fault execution tracking, sequentially converting each service name into a corresponding fixed-length character string according to a calling sequence of the service called in the execution tracking, and splicing according to the calling sequence to obtain an unknown fault request character string;

4) extracting all fault calling character strings of known faults in a known fault database, and carrying out similarity calculation on the unknown fault request character strings and each fault calling character string to obtain fault reasons for executing tracking;

5) acquiring a corresponding service failure type according to a fault reason, and detecting whether the service failure type exists or not;

6) if the service failure type exists, restarting the service; and if the service failure type does not exist, acquiring a service container measurement index, and performing self-adaptive matching on the service container.

Further, the execution traces are collected by setting the execution trace collection component interface zipkin-address-url specified by the Mixer as the open interface of Jaeger.

Further, the known fault database is built by:

a) injecting a plurality of sample faults into a system in normal operation, and sending a plurality of requests to obtain a plurality of known fault execution traces;

b) constructing a plurality of known fault directed weighted graphs of the known fault execution traces;

c) and classifying all known fault directed weighted graphs according to the fault reasons of the known fault directed weighted graphs to obtain a plurality of known fault directed weighted graph combinations to form a known fault database.

Further, whether the execution tracking is normal is judged by the following steps:

1) calculating the upper limit value of the time consumed by normal requests through the weight vectors of all directed edges of a plurality of normal directed weighted graphs

And lower limit value

Wherein

N normal requests representing normal execution traces consume time-averaged execution time, σ represents a standard deviation of the consumption time of each normal request, and z_αAn alpha quantile representing a normal distribution;

2) if t_minThe request consumption time is less than or equal to t_maxIf so, the execution trace is a normal execution trace; if the request consumption time is less than t_minOr request elapsed time > t_maxThen the execution trace is a faulty execution trace.

Further, the fault calling character string is obtained through the following steps:

1) performing tracing for each known fault in a database of known faultsRetrieving known fault directed weighted graph vertices S in chronological order₁And according to the vertex S₁The corresponding service name is converted into a fixed-length character string c by using a mapping function₁；

2) Respectively obtaining the next connected vertex S of the connected directed weighted graph according to the calling relation_iAnd according to the vertex S_iThe corresponding service name is converted into a fixed-length character string c by using a mapping function_iWherein i is more than or equal to 2 and less than or equal to p, and p is the total number of services included in the execution trace of the known fault;

3) by set of strings { c₁,…,c_i,…c_pAnd obtaining a fault calling character string.

Further, the similarity between the unknown fault request string and each fault call string

Wherein T is a known fault execution trace set corresponding to the known fault f, d_iPerforming a request string edit distance, l, that tracks fi and e for each known fault_fiPerforming a known fault string length, l, of trace fi for each known fault_eAnd the request character string length of the unknown fault e corresponding to the execution trace is long, and m is the number of the known fault execution traces corresponding to the known fault f.

Further, the fault causes comprise known fault queues which are arranged from large to small according to the similarity.

Further, the method of detecting the presence of the service failure type includes performing survivability detection using a probe.

Further, the service failure type includes that the status code is not 0 when the container exits, the designated port of the IP address is not opened, and the response code of the container HTTP request of the designated port is less than 200 or greater than or equal to 400.

Further, adaptive matching of service containers is performed by the following policies:

1) when the measurement index of the service container is larger than the upper limit of the expected measurement index, calculating the number of the expanded copies of the service container, and distributing;

2) and when the metric index of the service container is smaller than the upper limit of the expected metric index, deleting part of the service container.

Compared with the prior art, the invention has the following advantages:

(1) by adopting the execution tracking as an index for system monitoring, the dependency relationship of the fault and the detailed request execution information of the fault can be collected in the request. The execution tracing is used as an index for judging abnormal discovery, the similarity between the request execution tracing and the fault execution tracing is used as a basis for judging fault reasons, and the purpose that the execution tracing is linked with the fault reasons is achieved. And the fault of the monitoring node can be judged by additionally combining with the monitoring of the cluster measurement information.

(2) The degree of similarity between execution traces can be expressed. By using the character string editing distance to express the similarity between faults, the size of the editing distance can be used as the possible degree for judging the fault reason.

(3) The known fault database based on the directional fault injection can simulate various faults encountered by the system in advance, describe and record the execution tracking characteristics of the faults and reduce the dependence on the experience of operation and maintenance personnel.

Drawings

FIG. 1 is a flow chart of an implementation of the method of the present invention;

fig. 2 is a schematic view of a usage environment according to an embodiment of the present invention.

Detailed Description

The invention is described in detail below with reference to specific embodiments and the attached drawings,

the service reliability guarantee method based on fault similarity matching of the invention is shown in fig. 1, and the implementation steps are as follows:

in the first step, monitoring data collection of directional pre-injection faults is carried out. A fault of known cause is first injected into the system, and for a distributed system, a request may be split into multiple calls in parallel. For an execution trace T, the present invention adopts a structure of a directed weighted graph to represent the call relationship of sending and receiving events, wherein: each vertex corresponds to a garment in the application dependency relationshipAffairs; for the application with service A and service B and the service A generates a call request to the service B, a directed edge is established between the vertex a and the vertex B corresponding to the service A and the service B in the directed weighted graph

The weight value is the time t consumed by receiving the call request; depending on the content of the request and the function of the application, multiple calls to the same service may result in different call relationships, and therefore the requested execution trace T and the corresponding directed graph D with rights may not be unique.

To record dependencies between services, the vertices will use tuple M_iExpressed as (MID, requestID, callerID, duration, info), where MID is the service id; the requestID is the id of the service request; the callerID is the id of the calling service; duration is the request-consuming time; the info contains other information of the method, and is represented by tuple info ═ s (serviceName, startTime, tags), where serviceName is a service name; startTime is a start timestamp; tag s is the label that the request contains. Each vertex S in the directed weighted graph D contains a fixed-length character string converted from a vertex service name through a mapping algorithm as a unique identifier of the service;

and secondly, fault finding. The invention firstly counts the correct execution of the system request and records the corresponding execution trace T_iAnd execution time t_jA Service Profile Database (SPDB) is constructed and stores an execution trace of the correct execution of requests by the system. The TPDB is not only established as a record of system normal requests to the system, but also as a criterion for determining system abnormality. In the design of the present invention, the system exception is often accompanied by the change of the service execution and the call time in the execution tracking, and therefore the execution time is adopted as the index of the main judgment. And counting the requests of the same type for multiple times among the same vertexes, wherein due to randomness caused by factors such as network delay and the like, the calling time can present a normal distribution rule, and the execution time approximately falls within a certain interval. The vertex of such a section is defined as a boundary value falling withinExecution traces outside of the interval may be considered an exception. Thus, for a new execution request, the present invention calculates its boundary values using equations 1 and 2:

wherein, t_maxUpper bound, t, representing normal execution time_minA lower bound for the normal execution time is indicated,

representing the average execution time of the service request, sigma representing the standard deviation of the execution time of the service request, n representing the number of sample executions traces, z_αRepresenting the alpha quantile of a normal distribution, the present invention uses a 95% confidence probability.

And thirdly, judging the reason of the abnormality. The system adopts a mode of directional fault injection, collects execution tracking when the system fails in advance, and establishes a Known Fault Database (KFDB). The KFDB can not only store the abnormal event execution tracing generated by the failure of the injected single service and the calling among a plurality of services, but also collect the execution tracing caused by the cluster physical resource abnormality, and the result is beneficial to quantifying and comparing the similarity through the processing of the invention. When the system is judged to be abnormal, the fault execution tracing which occurs is compared with the system execution tracing which occurs due to the known fault in the KFDB, and the most similar fault is calculated through the corresponding algorithm introduced by the invention, so that operation and maintenance personnel can be helped to judge the actual position of the system failure in the production environment, and the loss is avoided to the maximum extent.

The method specifically comprises the following steps:

(1) for each trace T in KFDB, the vertex S with the earliest start time is retrieved₁As the starting vertex, representThe first service requested. For vertex S₁Get its service name N₁Conversion into fixed-length character strings c using mapping functions₁；

(2) Entering the next vertex S according to the calling relation₂Get its service name N₂Converting the character string into a fixed-length character string c by using the same mapping function as the step (1)₂；

(3) And (5) repeating the step (2), and converting each vertex in the T to obtain a character string set C. Connecting the character strings in the C end to end according to the numbering sequence to form a character string t representing the whole call request_f；

(4) And (3) repeating the steps (1), (2) and (3), and converting each fault in the KFDB into a fault calling character string. When the unknown fault e is generated, the request character string t of the unknown fault is formed according to the steps (1) and (2)_r；

(5) For both known faults F in the KFDB and unknown faults e in the system, their corresponding requests can be translated into request strings. Therefore, for any one injected fault f in KFDB, the similarity s of the injected fault f and the request character string of the unknown fault e is expressed by using the edit distance d, wherein the same fault f may have a plurality of execution traces, so that

T is the set of all execution traces corresponding to the fault f, d_iFor each execution, the edit distance, l, of the corresponding request string with e is tracked_fiIndicating the known fault string length, l, of each execution trace_eA request string length indicating an unknown fault e, m indicating the number of execution traces of f;

(6) and sequentially arranging according to the similarity s of each fault in the KFDB to form a queue with k members as a result of the abnormal detection.

Thus, for actually generated faults, the present invention will return a fault queue that is sorted by fault similarity. This queue contains the associated injected faults, and the results are used as a basis for determining the likely cause of the actual fault.

And fourthly, exception handling. The service abnormity is divided into two types of service failure and performance attenuation, and corresponds to two detection modes: the former uses probes for viability detection, the latter monitors load information for services. For a service failure type of anomaly, the viability probe will identify the following three anomaly cases: the status code is not 0 when the container is withdrawn; the designated port of the IP address is not opened; the response code of the container HTTP request for the specified port is less than 200 or equal to or greater than 400. And in the face of service failure type abnormity, processing in a service restart mode. For performance decay type anomalies, the resource usage index of the service is out of the expected range. The system automatically tracks and calculates the service measurement index at regular time intervals according to the service measurement information. Once the service metric index is larger than the upper limit of the expected metric index, calculating the number of the expanded copies of the service container, automatically expanding the number of the containers and distributing the expanded copies to the working nodes of the cluster so as to achieve the purpose of reducing the load; in addition, when the metric index of the service container is far smaller than the upper limit of the expected metric index, part of the service copies are deleted, and the waste of cluster resources caused by starting excessive containers is reduced.

Aiming at the characteristic of executing a tracing and recording service request, the method adopts a directional fault injection mode to pre-record the execution tracing of the injection fault with the known reason of the system and the execution tracing in normal operation, describes the fault characteristic and the normal operation characteristic of the system by executing the tracing, and hooks the service execution tracing with the system state; when a system in a production environment executes a new request, considering whether the request execution time deviates from the range of normal operation, and judging the request which deviates from the normal operation and consumes time as an abnormal request; and describing the similarity between the requested execution tracking and the fault execution tracking by using the edit distance so as to retrieve the pre-injected fault with the characteristic of higher similarity to the characteristic of the current system and diagnose the possible fault reason. In addition, for services which are possibly abnormal, the abnormal services are judged by using a survival probe and a metric value monitoring mode respectively, and service restart and automatic extension are carried out, so that the service reliability is guaranteed.

In the use environment of the method of the embodiment of the invention, as shown in fig. 2, five KVM virtual machines are deployed on a physical host, and a kubernets cluster is built (kubernets are the most popular container arrangement system at present, and can help to manage the bottom container of a distributed system, and the kubernets management cluster can realize automatic management service discovery, load balancing combination, resource allocation tracking and container scaling, https:// kubernets. One virtual machine is used as a Master node to control the whole cluster, one virtual machine is used as an anomaly detection node to deploy Jaeger (a distributed calling and tracking tool, https:// www.jaegertracing.io /), an elastic search (a distributed search and analysis engine, ht tps:// www.elastic.co /) and a MySQL database, and three virtual machines are used as working nodes to deploy the whole application and process the working load. Jaeger collects the tracking monitoring information of the application, stores the tracking monitoring information into an Elasticissearch database, extracts the tracking information from the Elasticissearch database by an anomaly detection system, converts each request into a directed weighted graph, and stores the directed weighted graph into a MySQL database. When the anomaly detection system finds that the weight of the directed edge of a certain requested directed weighted graph exceeds a threshold value, the character string of the directional injection fault stored in MySQL in advance is taken out for comparison, and then the most similar fault reason is found. The method provided by the invention is realized in an anomaly detection system.

The method of the embodiment of the invention comprises the following steps:

(1) the method comprises the steps that a Mixer of Istio (service grid technology, which intercepts flow in Kubernets for operation, https:// Istio. i o /) is a middle component for decoupling front-end application and rear-end monitoring, a zip-address-url specified by the Mixer is set as an open interface of Jaeger, and tracking information of service is collected;

(2) injecting various faults in advance, and storing the directed weighted graph sublist of the execution tracking of the faults in a MySQL database as the execution information of the faults injected in advance;

(3) sending a request to a normally running system for multiple times in advance, recording a directed weighted graph of execution tracking in a normal running environment, and taking the directed weighted graph as a judgment threshold value of fault discovery;

(4) when the system runs, collecting the tracking metric value in real time, and when the weight of the directed edge exceeds the threshold value of normal running, transmitting the corresponding character string for executing the tracking directed weighted graph and mapping into abnormal comparison;

(5) respectively extracting mapping character strings of the fault directed weighted graph stored in each table in the MySQL database, and calculating the average value d of the editing distance between the character strings;

(6) and sorting according to the size of the average value d, taking the corresponding fault as a possible fault option to be put into a queue, and finally obtaining an ordered fault queue.

For services which may have an exception in the fault queue, the number of copies of the container of the services is scaled according to the measurement information of the services, so as to automatically solve the service exception. In addition, the generation reason of each fault directionally injected in the ordered fault queue is used as a reference for the operation and maintenance personnel to repair the abnormity, and an actual fault reason and a processing method are found out.

The above examples are provided only for the purpose of describing the present invention, and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalent substitutions and modifications can be made without departing from the spirit and principles of the invention, and are intended to be within the scope of the invention.

Claims

1. A service reliability guarantee method based on fault similarity comprises the following steps:

1) analyzing an execution trace generated by the system operation to construct a directed weighted graph, wherein a vertex in the directed weighted graph uses a service id, a service request end id, a calling service end id, and requests time consumption and other information of the method]Tuple representation, method other information uses a [ service name, start timestamp, tag contained in request]Tuple representation with directed edges of

5) according to the fault reason, acquiring a corresponding service exception type, wherein the service exception type comprises the following steps: service failure and performance degradation;

6) if the service is invalid, restarting the service; and if the performance is degraded, performing adaptive matching of the service container.

2. The method of claim 1, wherein the execution traces are collected by setting the execution trace collection component interface zip kin-address-url specified by Mixer as the open interface of Jaeger.

3. The method of claim 1, wherein the database of known faults is created by:

4. The method of claim 1, wherein the determining whether the performing tracking is normal is performed by:

And lower limit value

Wherein

5. The method of claim 1, wherein the fault call string is obtained by:

1) performing tracing for each known fault in a database of known faults, retrieving known fault directed weighted graph vertices S in chronological order₁And according to the vertex S₁The corresponding service name is converted into a fixed-length character string c by using a mapping function₁；

6. The method of claim 1, wherein the similarity of unknown fault request strings to fault call strings

Wherein T is a known fault execution trace set corresponding to the known fault f, d_iPerforming a request string edit distance, l, that tracks fi and e for each known fault_fiPerforming a known fault string length, l, of trace fi for each known fault_eAnd (e) the request character string length of the unknown fault e corresponding to the execution trace, and m is the number of the known fault execution traces corresponding to the known fault f.

7. The method of claim 1, wherein the failure cause comprises a queue of known failures arranged from large to small according to similarity.

8. The method of claim 1, wherein the method of detecting the type of service failure comprises viability detection using a probe.

9. The method of claim 1, wherein the service failure type includes a state code of not 0 at the time of container exit, a designated port of the IP address is not opened, and a response code of a container HTTP request of the designated port is less than 200 or equal to or greater than 400.

10. The method of claim 1, wherein the adaptive matching of service containers is performed by the following policies: