CN117640544A

CN117640544A - Distributed service call management system and method thereof

Info

Publication number: CN117640544A
Application number: CN202311585920.XA
Authority: CN
Inventors: 朱利鲁; 付琨; 王洋; 黄凯; 刘添瑞
Original assignee: Suzhou Aerospace Information Research Institute
Current assignee: Suzhou Aerospace Information Research Institute
Priority date: 2023-11-24
Filing date: 2023-11-24
Publication date: 2024-03-01

Abstract

The invention discloses a distributed service call management system, which comprises: the agent access module realizes the convergence of service-component-method level call information through an integrated multi-source agent, the service call link generation module realizes the display of service call information through the reconstruction of a service call link and a global call topology, and the call link abnormality detection module realizes the positive feedback closed loop of service abnormality discovery, restoration and flow evaluation by establishing a full link abnormality detection restoration flow based on a knowledge base and environmental feedback. The invention can effectively treat complex call management scenes of large-scale and multi-dependent services.

Description

Distributed service call management system and method thereof

Technical Field

The invention relates to a service operation monitoring technology, in particular to a distributed service call management system and a method thereof.

Background

With the rapid development and widespread use of micro-services, virtualization, etc., modern enterprises and Internet Service Providers (ISPs) build more complex and huge system architectures. On the one hand, micro-service technology makes these systems more easily scalable and upgradeable. On the other hand, the use of containerization techniques (e.g., docker) and automation tools (e.g., kubernetes) enables these systems to operate in a consistent and portable manner in different environments, thereby enabling rapid deployment and horizontal expansion capabilities. However, as system complexity increases and versions iterate rapidly, service anomaly investigation and performance optimization become increasingly difficult. As cross-service dependencies become complex, a single service call may need to traverse multiple services and involve a large number of network transfers and data operations. Once a system failure or performance bottleneck occurs, it is difficult to accurately locate root cause services or components. Therefore, it is desirable to provide a reliable service call management system that can accurately track and record key information such as each service call path, time, and status in a distributed environment, so as to implement rapid troubleshooting and performance optimization. Currently, some call tracking systems, such as Zipkin, jaeger, openTelemetry, skywalking, are presented in the field of distributed service monitoring. These systems provide service invocation tracking and performance analysis functions that can assist the operation and maintenance personnel in locating and resolving performance bottlenecks in the micro-service architecture. However, it is difficult to adapt to large distributed platforms or complex business systems because of the relatively complex configuration and deployment and the obstacles still faced in terms of multi-language, multi-frame integration. In the field of academic research, researchers have proposed a call chain anomaly detection method based on rules, machine learning, and anomaly pattern mining. However, these methods are mostly biased to theoretical research, and still need to be improved to adapt to specific business requirements according to factors such as system characteristics, data characteristics, and computing resources when facing practical application scenarios. In summary, there is currently no research on a perfect service call management system and a method thereof in the related field, that is, an integrated solution method for providing service intelligent management based on the establishment of a full-link service call topology.

Disclosure of Invention

The invention aims to provide a distributed service call management system and a distributed service call management method, and the distributed service call management system is sponsored by a Style of Suzhou market front technology research project (No: SYG 202335).

The technical solution for realizing the purpose of the invention is as follows: a distributed service invocation governance system, comprising: the system comprises a proxy access module, an information fusion analysis module, a service calling link generation module and a calling link abnormality detection module, wherein:

the proxy access module integrates a gateway proxy, an access proxy and a log proxy, gathers calling operation information among distributed services and among internal components of the services, thereby supporting multi-level and fine-granularity service calling link monitoring; the distributed acquisition and the persistent storage of the call information are realized through an open source filecoat-logstack-elastic search component;

the information fusion analysis module is used for supporting the monitoring of multi-level calling links of the service-component-method through correlation fusion analysis of calling information of different sources and different levels, further providing effective data support for the visual construction of the calling links, providing a software identification alignment and timestamp alignment method, realizing the alignment processing of the calling information of different levels of the service, the component, the interface and the method, and simultaneously supporting the online analysis and calculation capability of service response time, service throughput, service error rate and SLA violation rate of different time granularities;

The service call link generation module analyzes call information of single cross-service in real time through batch processing tasks, dynamically organizes a service call link and a global call topology based on a link ID and a software identifier, and realizes linkage analysis of local-global service call information, wherein the service call link supports call index analysis and dynamic presentation with different time granularities; the global call topology support distinguishes service call states through colors, dynamically presents component call relations, call method stacks and call exception information;

calling a link anomaly detection module, realizing positive feedback closed loop of service anomaly repair and flow evaluation by establishing a full-link anomaly detection repair flow, realizing identification of service calling link states by improving a 3sigma method, and screening out an anomaly calling link; comprehensively evaluating service abnormality probability through service call abnormality classification and service abnormality propagation behavior analysis, and screening out root cause abnormality service; the positive feedback closed loop for automatic repair of service abnormality and evaluation of repair rules is realized by establishing a full-link abnormality detection repair flow based on a knowledge base and environmental feedback, namely by 'rule matching, rule fusion, rule execution and rule scoring'.

Further, the proxy access module comprises a proxy integrated deployment module and an information convergence storage module, wherein the proxy integrated deployment module is used for calling, and the information convergence storage module is used for calling:

the call agent integrated deployment module comprises a gateway agent, an access agent and a log agent, wherein the gateway agent is a component for managing and controlling service access in a distributed system, serves as an intermediary between a service provider and a service consumer and is responsible for processing requests, routing traffic, monitoring and collecting data, and the format convention is [ call time ] [ caller identification ] [ callee identification ] [ response time ] [ status code ] by analyzing the gateway agent log, extracting request call information of an analysis service level; the access agent is a component for managing and coordinating communication between services, serves as a service access intermediary in a distributed architecture, is a non-invasive operation period call capturing technology, extracts request call information of an analysis interface level through the access agent, and has a format convention of [ call time ] [ caller identification ] [ callee identification ] [ link ID ] [ interface name ] [ response time ] [ status code ]; the log agent is a component for processing and outputting log information of an application program, serves as an intermediary of log flow, records a database and a cache of application program operation, invokes key log information of a method stack, extracts operation information of an analysis component level and a method level through the log agent, and has a format contract of [ operation time ] [ thread name ] [ log level ] [ operator identifier ] [ interface name ] [ method name ] [ code line number ] [ service type ] [ operation user ] [ target subject address ] [ operation result ] [ exception information ];

The information aggregation storage module is called to deploy a filecoat component at a server node where a service and a gateway are located, and correct collection of log data is realized by agreeing with a log storage path and a log file naming mode; summarizing the acquisition log from the filecoat to the logstar through https communication protocol, and providing query service through the elastiscearch; log data is queried and processed in batches from the elastic search by a timing task to become service call key logs meeting the agreed format, and the key logs are centralized and persisted to a relational database to support subsequent further processing.

Further, the information fusion analysis module comprises a multi-source call information alignment processing module and a service call information aggregation and index calculation module, wherein:

the multi-source call information alignment processing module is used for carrying out alignment processing on the request call information of different sources through a service identification alignment and timestamp alignment method, wherein the service identification is agreed to be in a point-division three-section format: service ID. service type, wherein the service ID is a unique identification of the service; the service identifier alignment method processes the calling party or the operating party into a uniform format, processes the called party or the operating party into a uniform format, classifies and gathers the calling information with the same service identifier, and further supports the subsequent centralized processing of the same service calling information; the time stamp alignment method is characterized in that the standard calling time format is yyyy-MM-ddHH: MM: ss.SSS, namely 'year-month-day: minute: second: millisecond', wherein the other words except for 4 bits in the year and 3 bits in the millisecond are 2 bits, and calling information within the range of a stipulated time threshold delta is provided with time correlation;

The service call information aggregation and index calculation module supports aggregation analysis of service call information with different time granularity, groups the service call information from different agents with the time granularity tau, and groups gateway agents according to the call path of service-service; for the access agent, grouping is carried out according to the link ID and the STSID; for log agent, grouping according to the call path of service component; the following index is calculated for each packet:

response time, i.e., time interval across service calls, units: millisecond, specifically including Average, P95, P99, represents Average response time, 95 line response time, and 99 line response time, respectively;

throughput, i.e., number of calls per unit time; error rate, i.e., the ratio of the number of failed calls to the total number of calls;

SLA violation rate, i.e. the ratio of total violation time to total response time, defines a response time exceeding a thresholdFor violation, then the violation time is expressed as +.>

Aggregating packet indexes according to service identifiers and time stamps, averaging the same indexes of the same packet from different agents, associating different packets through the service identifiers to further calculate service indexes, and calculating the maximum value of corresponding indexes of each packet for response time and SLA violation rate; the throughput is represented by summing the packet throughput indices.

Further, the service call link generation module includes a service call analysis and topology construction module and a global call analysis and topology construction module, wherein:

the service call analysis and topology construction module records a call link generated by service call, wherein the call link is a cross-service and cross-component communication path for completing a specific request call process; on the basis of a service call link, supporting link index analysis and dynamic presentation with different time granularity, wherein the time granularity comprises minutes, hours, days and months, and the link index comprises link response time, link throughput, link access error rate and link SLA violation rate;

the global call analysis and topology construction module records global call topology among all services and components in the distributed system, and the global call topology is constructed by call links through service identification cross correlation; on the basis of global call topology, the call state of the global service can be distinguished through colors, wherein red, yellow, green and black respectively represent that the service operation is deadly, serious, general and prompts, and the call state of the global service can be mastered in real time; the global call topology also supports component call relationships, call method stacks, and dynamic presentation of call exception information.

Further, the calling link anomaly detection module comprises a link anomaly discovery module, a root cause anomaly tracing module and heuristic strategy restoration, wherein:

the link anomaly discovery module is used for realizing the identification of the service call link state through a 3sigma method, screening out an abnormal service call link, and assuming that the time sequence data of link indexes meet standard Gaussian distribution, and determining the link index data except mu+/-3 sigma as anomaly points because the probability of data points falling within mu+/-3 sigma in the Gaussian distribution is 99.73%, wherein mu is the mean value of the time sequence data and sigma is the variance, and the link indexes comprise link response time, link throughput, link access error rate and link SLA violation rate;

the root cause anomaly tracing module comprehensively evaluates the service anomaly probability through service call index analysis, service anomaly log analysis and service anomaly propagation behavior analysis, screens out the anomaly service, and comprises the following processes:

(1) traversing the abnormal call link according to the call direction to acquire the call index of the current service;

(2) and analyzing the service operation log and judging the service abnormality type. The service anomaly category includes: run-time anomalies, database anomalies, computing resource anomalies, network access anomalies, three-way call anomalies, and other anomalies; if the first three types of anomalies occur, the current service is judged to be the anomaly service. And (4) adding the abnormal service into an abnormal service queue, and turning to the step (4). If the fourth class and the fifth class of anomalies occur, judging that the anomalies occur in the called service, acquiring a called service node, and turning to the step (4); if the service operation log does not exist in the current service or other anomalies occur, turning to the step (3);

(3) Acquiring a global call topology associated with a current service, simulating a service exception propagation behavior based on a PageRank algorithm to comprehensively evaluate service exception probability, and screening out exception services;

assume that: a) Anomaly with probability d (0<d<1) Propagating to the calling party, wherein the probability 1-d randomly appears in any service; b) The higher the anomaly probability of more services being invoked; c) The higher the probability of abnormality is, the higher the probability of abnormality of the service called by the service is, and the arbitrary service s _i Probability of occurrence of abnormality Pr(s) _i ) Represented by formula (5);

where N represents the total number of services in the global call topology model, M (s _i ) Representing calls s _i Is set of services, L(s) _j ) Representing calls s _j Is a number of services;

in addition, define arbitrary services s _i Initialization probability P of occurrence of abnormality ⁰ (s _i ) Is of formula (6);

Pr ⁰ (s _i )＝N ₁ /N ₂ (6)

wherein N is ₁ Representing the number of violations of service request response time, N ₂ Representing a total number of service requests;

initializing the abnormal probability of the service node by using a formula (6), iteratively solving the new abnormal probability of the service after the abnormal propagation by using a formula (5), wherein the iteration ending condition is that the abnormal probability converges or reaches the upper limit of the iteration times, and if the abnormal probability of the service exceeds a threshold value gamma, judging that the service is abnormal and adding the service into an abnormal service queue;

(4) And (3) judging whether the current abnormal link is traversed, stopping after the current abnormal link is traversed, waiting for the timing task to analyze the next abnormal call link, otherwise, acquiring the next service calling party node, and turning to the step (1).

The heuristic strategy repair module realizes the positive feedback closed loop of service exception repair and flow evaluation by establishing a heuristic exception repair flow based on a knowledge base and environmental feedback, namely by 'rule matching, rule fusion, rule execution and rule scoring', and the process is as follows:

(1) acquiring an abnormal call link, sequentially acquiring abnormal services in the current abnormal call link from an abnormal service queue in the reverse link direction, and then acquiring service abnormal information according to a service ID, wherein the service abnormal information comprises information such as a service identifier, service response time, service throughput, service access error rate, SLA violation rate, service abnormal category and the like;

(2) if the service abnormality class exists, matching an abnormality repair rule according to the service abnormality class, and turning to the step (5), otherwise turning to the step (3); the abnormal repair rules are predefined and stored in a rule knowledge base, and comprise resource upgrading, configuration updating, service upgrading, service migration, service restarting, service capacity expansion and flow limitation, the predefined rules realize standardized rule development interfaces which support being mounted on an abnormal repair flow in a pluggable mode, and the rule development interfaces support secondary development rules of users;

The predefined rule establishes a matching relation with an abnormal category in a knowledge base, under an initial condition, the abnormal matching service upgrading rule is operated, the abnormal matching resource upgrading rule is calculated, the abnormal matching service restarting rule is calculated, the network accesses the abnormal matching service migration rule and the flow limiting rule, the abnormal matching configuration updating rule is called by a three-party, and the other abnormal supports the self-defined matching relation;

(3) retrieving a repair rule set which is the same as the current service exception category from a rule knowledge base, screening out the exception repair rule with the highest priority, and turning to the step (5) if the exception repair rule exists, otherwise turning to the step (4);

(4) and calculating and acquiring the repair rule of the abnormal service with the highest similarity (Pearson coefficient) with the current abnormal service index. If the rule acquisition fails, the service restarting rule is selected by default;

(5) acquiring a service control mode according to the service ID, including an IP deployment mode, a port, a configuration mode, a start-stop mode and the like, and then implementing exception repair on the current service by using a repair rule;

(6) detecting whether the call link is repaired normally within the overtime time, if so, continuing to judge whether the service node in the call link is traversed, if so, turning to the step (7), otherwise, turning to the step (1);

(7) Executing link alarm, requesting to manually participate in repair;

(8) according to the link state and the link repair time, comprehensively evaluating all rules for executing the link repair, updating a rule knowledge base, and ending the current calling link repair process; specifically, rule knowledge is defined as triples < S, R, P >, where S represents the environmental state in which the abnormal service is located, and consists of (service identification, link ID, service index, abnormal class); r represents an abnormality repair rule; p represents the priority of the exception repair flow, which is calculated using equation (7);

where T represents the total consumption of successful exception repair, T is a time threshold, α is an indicator function, α=1 represents successful exception repair, α=0 represents failed exception repair, μ and ε represent the base priorities of successful and failed exception repair, respectively, and μ > ε.

A distributed service call governance method is based on the distributed service call governance system to realize distributed service call governance.

A computer device comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein when the processor executes the computer program, the processor implements distributed service call remediation based on the distributed service call remediation system.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements distributed service call remediation based on the distributed service call remediation system.

Compared with the prior art, the invention has the remarkable advantages that: 1) By fusing various proxy call information, fine granularity monitoring of the service-component-method multi-level call link is realized. 2) The full-link anomaly detection and repair process based on the knowledge base and the environmental feedback is established, the positive feedback closed loop of service anomaly repair and process evaluation is realized, and the traditional low-efficiency management mode of unsupervised link tracking is changed. Particularly, a heuristic type exception repairing method is provided, and the accuracy and efficiency of online repairing of service call exceptions are remarkably improved.

Drawings

FIG. 1 is an overall block diagram of a distributed service invocation abatement system of the present invention.

Fig. 2 is a block diagram of a proxy access module.

Fig. 3 is a block diagram of an information fusion analysis module.

Fig. 4 is a flow chart of information fusion analysis.

Fig. 5 is a block diagram of a service call link generation module.

Fig. 6 is a diagram of the call link anomaly detection module.

Fig. 7 is a flow chart of link anomaly identification and root cause tracing.

FIG. 8 is a flow chart for heuristic exception repair.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The invention integrates the multi-source call information fusion technology, the distributed call link tracking technology, the call anomaly detection and fault recovery technology and the like, so that the treatment process of the multi-source service full call link in the distributed environment is realized, and the overall system structure is shown in figure 1.

The invention discloses a distributed service call management system and a method thereof, wherein the distributed service call management system comprises a proxy access module, an information fusion analysis module, a call link model generation module and a call link abnormality detection module, and the flow for realizing the call link management of the distributed service comprises the following steps:

a. and the proxy access module is used for maximally collecting call information among distributed services and among internal components of the services by integrating various proxy tools, so that fine-granularity service call link monitoring can be supported. The gateway proxy realizes the collection of service level call information, the call proxy realizes the collection of interface level call information, and the log proxy realizes the collection of component level and method level call information. The collected call information is transmitted through an open source filecoat component, converged and inquired through Logstar and elastic search, and uniformly stored through a standardized storage structure.

b. And the information fusion analysis module is used for establishing the association between the multi-level call information and analyzing the call index by carrying out alignment processing on the call information of different sources and different levels. The module provides methods such as software identification alignment, time stamp alignment and the like, and realizes alignment processing of call information of different levels such as services, components, interfaces, methods and the like. Meanwhile, the online analysis and calculation capacities of indexes such as service response time, service throughput, service error rate, SLA violation rate and the like with different time granularities are supported.

c. And the service call link generation module analyzes the call information of the single cross-service in real time through the batch task, dynamically organizes the service call link and the global call topology based on the link ID and the software mark, and realizes the linkage analysis of the local-global service call information. The service call link supports call index analysis and dynamic display with different time granularity; the global call topology supports differentiating service call states through colors, dynamically exposing component call relations, call method stacks and call exception information.

d. And calling a link abnormality detection module, and establishing a full-link abnormality detection and repair flow to realize the positive feedback closed loop of service abnormality repair and flow evaluation. Firstly, the identification of the service call link state is realized by improving a 3sigma method, and the abnormal call link is screened out. Then, comprehensively evaluating the service abnormality probability through service call abnormality classification and analysis of service abnormality propagation behaviors, and screening out root cause abnormality services; and finally, establishing a full-link abnormality detection and repair flow based on a knowledge base and environmental feedback, namely realizing automatic repair of service abnormality and positive feedback closed loop of repair rule evaluation through rule matching, rule fusion, rule execution and rule scoring.

The components and functions of the modules are described in detail below with reference to fig. 2-8.

The proxy access module is used for maximally collecting call information among distributed services and among service internal components, and unified access management is carried out on the multi-source service call information through a standardized storage structure, so that a multi-level and fine-grained service call management process is supported. Referring to the proxy access module structure diagram shown in fig. 2, the specific implementation steps are as follows:

(1) Invoking proxy integration deployment

Referring to fig. 2, the call agents include gateway agents, access agents, log agents.

The gateway proxy is a component for managing and controlling service access in a distributed system, such as nmginx, which acts as an intermediary between service providers and service consumers, responsible for the tasks of handling requests, routing traffic, monitoring and collecting data, etc. The request call information of the service level is extracted by analyzing the gateway proxy log, and the format convention is [ call time ] [ caller identification ] [ callee identification ] [ response time ] [ status code ].

The access agent is a component for managing and coordinating communication between services, such as the JVM Sandbox, which acts as a service access intermediary in a distributed architecture, and is a non-invasive runtime call capture technique. The request call information at the interface level is extracted and analyzed by the access agent, and the format convention is [ call time ] [ caller identification ] [ callee identification ] [ link ID ] [ interface name ] [ response time ] [ status code ].

The log agent is a component for processing and outputting application log information, which acts as an intermediary for log streams, recording key log information of call method stacks of other components (such as databases, caches, etc.) operated by the application. The operation information of the component level and the method level is extracted and analyzed through the log agent, and the format convention is [ operation time ] [ thread name ] [ log level ] [ operator identifier ] [ operated side identifier ] [ method name ] [ code line number ] [ service type ] [ operation user ] [ target subject address ] [ abnormal information ].

In a practical application scenario, a system is usually composed of numerous services of different architecture, different operation protocols and specifications developed by a plurality of teams, so that the service call proxy mode is difficult to be unified. The service needs to select one or more of an integrated gateway agent, an access agent, and a logging agent according to its own situation.

(2) Calling information aggregation stores

Referring to fig. 2, the present system employs an open source filebean-logstack-elastic search component to effect transmission, aggregation, querying and persistent storage of service call information. The specific implementation steps are as follows:

(1) and the filebean components are deployed at server nodes where the service and the gateway are located, and correct acquisition of log data is realized by contracted log storage paths and log file naming modes.

(2) The acquisition logs are aggregated from filecoat to logstack via lightweight https communication protocol and query services are provided via elastiscearch.

(3) The log data is queried and processed in batches from the elastic search through a timing task, so that the log data becomes a service call key log meeting the appointed format. The critical logs are centrally persisted to a relational database to support subsequent processing.

The information fusion analysis module is used for the association fusion of call information of different sources and different levels, so as to support the fine granularity monitoring of a service-component-method multi-level call link. Referring to fig. 3 and 4, the specific implementation steps are as follows:

(1) Multi-source call information alignment processing

Referring to fig. 3, the module performs alignment processing on request call information of different sources through methods such as service identification alignment, timestamp alignment and the like. The service identification convention is in a three-segment format: service ID. service type, wherein the service ID is a unique identification of the service. The service identifier alignment method processes the identifier of the (called) party into a consistent format so as to realize classification and convergence of the call information with the same service identifier, thereby supporting subsequent centralized processing of the same service call information.

Further, the time stamp alignment method is to unify the time format of the multi-source call information to yyyy-MM-dd HH: MM: ss.SSS (i.e. "year-month-day: minute: second: millisecond") so as to realize the time correlation between the call information. Wherein except for 4 bits per year, 3 bits per millisecond, 2 bits, for example: 2022-01-1409:51:16.604. In addition, because there may be time differences in the same call procedure as it passes between different access agents, to ensure time correlation, the present system agrees that the call information is considered to have time correlation only when the time difference between the call information is equal to or less than a threshold δ (in: milliseconds).

(2) Service invocation information aggregation and index computation

Referring to fig. 4, the present system supports aggregate analysis of service invocation information at different time granularities. First, service invocation information originating from different agents is grouped using a time granularity τ (units: milliseconds). Specifically, for call information from the gateway proxy, packets are made according to the call path of "service→service (STS)". Referring to fig. 3, STS represents a directed call relationship (e.g., a→b) across services, and STS is uniquely identified by "< caller service identification, callee service identification > (STSID)"; grouping call information from the access agent according to the link ID and the STSID; for call information from the log agent, packets are made according to a call path of "Service→component (STC)".

Further, the following index is obtained for each packet: (1) response Time (RT), i.e., the Time interval (in milliseconds) across service calls. RT specifically includes Average, P95, P99, representing Average response time, 95 line response time, and 99 line response time, respectively; (2) throughput (TH), i.e. unit timeNumber of calls in the room; error Rate (ER), i.e., the ratio of the number of failed calls to the total number of calls; SLA violation rate, i.e. the ratio of total violation time to total response time, defines the average response time to exceed a thresholdNamely violation, violation time is expressed as +.>

Further, the packet metrics are aggregated according to the service identity, the time stamp, and the time threshold delta, and the same metrics of the same packet from different agents are averaged. Meanwhile, different packets are associated through the service identity to further calculate the service indicator. For response time and SLA violation rate, obtaining the maximum value of the corresponding index of each group; for throughput, the sum of the packet throughput indices is obtained.

The call link model generating module is used for constructing service call links and global call topologies and displaying call statistical analysis indexes, and comprises call statistical analysis information such as link states, link indexes, service states, service indexes, method call stacks, abnormal categories and the like. Referring to fig. 5, the specific implementation steps are as follows:

(1) Service call analysis and topology construction

Service invocation links refer to cross-service, cross-component communication paths (i.e., invocation topology of a particular request) that complete a particular request invocation process in a distributed system. The calling state of the calling link can be mastered in real time by displaying the service calling link and presenting various statistical analysis indexes on the link. Referring to fig. 5, taking "link 1" as an example, a service invocation link is shown in the form of nodes (small circles) and edges (directional arrows), where the arrows represent the direction of data flow.

In the system, for a service integrating an access agent, call information is associated by a link ID to construct a service call link. The link ID uniquely identifies the call communication path of the full link at a time. For services that do not have an integrated call agent, but integrate one or both of the gateway agent and the log agent, call links are constructed by service identification, timestamp association between services, and operational relationships between services and components.

Further, on the basis of the service call link, call link indexes are shown at different time granularities. The time granularity includes minutes (min), hours (hours), days (day), months (monta). The call Link indicator includes a Link response time (Link Response Time, LRT), a Link Throughput (LTH), a Link Error Rate (LER), a Link SLA violation Rate (Link SLAViolation Rate, LSLA), and is calculated by formulas (1) to (4).

Wherein, RT (i), TH (i), ER (i), SLA (i) respectively represent the call index of the ith STS on the link access path, min and Max respectively represent the minimum value and the maximum value of the obtained index.

(2) Global call analysis and topology construction

The global call topology refers to call relations among all services and components in the distributed system, and the call states of the global services can be mastered in real time by showing the call relations of the global services and presenting various statistical analysis indexes on service nodes. Referring to FIG. 5, the global call topology exposes call relationships between services, between services and components in terms of nodes (small circles, small squares, etc.) and edges (directional arrows).

Referring to fig. 5, a global call topology is constructed by service identification associating service call links. Specifically, the nodes with the same service identifier in the service call links are the same service node, and different call links are cross-linked together through the same service node, so as to form a service call relation network. On the global call topology, the call state of the service is distinguished by colors, wherein red, yellow, green and black respectively represent that the service call is deadly, serious, general and prompts four alarms. The global call topology also supports call relationships between components, call method stacks, and dynamic presentation of call exception information.

The call link abnormality detection module is used for service call abnormality discovery, root cause tracing and fault repair. The module realizes the positive feedback closed loop of service abnormality repair and flow evaluation by establishing a full-link abnormality detection repair flow based on a knowledge base and environmental feedback. Referring to fig. 6-8, the specific implementation steps are as follows:

(1) Link anomaly discovery

The identification of the service call link state is realized by improving the 3sigma method, and the abnormal call link is screened out. Assuming that the time series data of the link index conforms to a standard gaussian distribution, since the probability that the data points fall within μ±3σ in the gaussian distribution is 99.73%, the link index data other than μ±3σ can be determined as outliers, where μ is the mean of the time series data and σ is the variance. The link indicator includes LRT, LTH, LER, LSLA. And judging that one of the indexes is abnormal when an abnormal call link occurs.

(2) Tracing source of root cause abnormality

Referring to fig. 6 and 7, the abnormal service is located by comprehensively evaluating the service abnormality probability through service call index analysis, service abnormality log analysis, and service abnormality propagation behavior analysis. The specific implementation steps are as follows:

(1) traversing the abnormal call link according to the call direction, and acquiring the call index of the current service.

(2) And analyzing the service operation log and judging the service abnormality type. The service anomaly category includes: run-time anomalies, database anomalies, computing resource anomalies, network access anomalies, three-way call anomalies, and other anomalies; if the first three types of anomalies occur, the current service is judged to be the anomaly service. And (4) adding the abnormal service into an abnormal service queue, and turning to the step (4). If the fourth class and the fifth class of anomalies occur, judging that the anomalies occur in the called service, acquiring a called service node, and turning to the step (4); if the current service does not have an operation log or other anomalies occur, the process goes to step (3).

(3) And acquiring a global call topology associated with the current service, simulating a service exception propagation process based on a PageRank algorithm to comprehensively evaluate service exception probability, and screening out exception services. Assume that: a) Anomaly with probability d (0<d<1) Propagating to the calling party, wherein the probability 1-d randomly appears in any service; b) The higher the anomaly probability of more services being invoked; c) The higher the anomaly probability, the higher the anomaly probability of the service called by the service. Then any service s _i Probability of occurrence of abnormality Pr(s) _i ) Can be represented by formula (5).

Where N represents the total number of services in the global call topology model, M (s _i ) Representing calls s _i Is set of services, L(s) _j ) Representing calls s _j Is a number of services of (a). In addition, define arbitrary services s _i Initialization probability P of occurrence of abnormality ⁰ (s _i ) Is of formula (6).

Pr ⁰ (s _i )＝N ₁ /N ₂ ,#(6)

Wherein N is ₁ Representing the number of violations of service request response time, N ₂ Indicating the total number of service requests.

The service node anomaly probability is initialized using equation (6). And (5) iteratively solving the new service anomaly probability after the anomaly propagation by using a formula (5). The iteration ending condition is abnormal probability convergence or reaches the upper limit of iteration times. If the service abnormality probability exceeds the threshold gamma, judging that the service is abnormal and adding the service abnormality into an abnormal service queue, and turning to the step (4).

(3) Heuristic strategy repair

Referring to fig. 6 and 8, the module realizes the positive feedback closed loop of service exception repair and process evaluation by establishing a heuristic exception repair process based on a knowledge base and environmental feedback, namely by 'rule matching, rule fusion, rule execution and rule scoring'. The specific implementation steps are as follows:

(1) and acquiring the abnormal call link, and sequentially acquiring the abnormal service in the current abnormal call link from the abnormal service queue in the reverse link direction. And then acquiring service abnormality information according to the service ID, wherein the service abnormality information comprises information such as service identification, service response time, service throughput, service access error rate, SLA violation rate, service abnormality category and the like.

(2) If the service abnormality class exists, matching an abnormality repair rule according to the service abnormality class, and turning to the step (5); otherwise, turning to the step (3). Specifically, referring to fig. 6, the anomaly repair rules include predefined rules and custom rules. The predefined rules include resource upgrade, configuration update, service upgrade, service migration, service restart, service capacity expansion, and traffic restriction. The predefined rules establish a matching relationship with the anomaly category in the knowledge base. Under the initial condition, the abnormal matching service upgrading rule is executed; calculating abnormal matching resource upgrading rules of resources; database anomaly matching service restart rules; network access anomalies match service migration rules and traffic restriction rules; calling an abnormal matching configuration updating rule by a three-party; other anomalies support custom matching relationships. The predefined rules realize a standardized rule development interface which supports the process of repairing the abnormality by being mounted on the abnormality in a pluggable manner, and the rule development interface supports the secondary development of the rules by a user.

(3) And retrieving the repair rule set which is the same as the current service exception category from the rule knowledge base, and screening out the exception repair rule with the highest priority (if a plurality of rules exist, randomly taking one rule). If the abnormality repair rule exists, turning to the step (5); otherwise, turning to the step (4).

(4) And calculating and acquiring the repair rule of the abnormal service with the highest similarity (Pearson coefficient) with the current abnormal service index. If the rule acquisition fails, the "service restart" rule is selected by default.

(5) Obtaining a service control mode according to the service ID, including an IP deployment mode, a port, a configuration mode, a start-stop mode and the like, and then implementing exception repair on the current service by using a repair rule.

(6) Detecting whether the call link is repaired normally within the timeout period, and if so, performing a normal step (8); otherwise, continuing to judge whether the service node in the call link is traversed to end, if so, turning to the step (7), otherwise, turning to the step (1).

(7) And executing link alarm, and requesting to manually participate in repair.

(8) And (3) according to the link state and the link repair time, comprehensively evaluating all rules for executing the link repair, updating a rule knowledge base, and ending the current calling link repair process. Specifically, rule knowledge is defined as triples < S, R, P >. Wherein S represents the environment state of the abnormal service, and consists of (service identifier, link ID, service index and abnormal category); r represents an abnormality repair rule; p represents the priority of the exception repair rule. The priority is calculated using equation (7).

Wherein T represents the total consumed time of successful repair of the whole link abnormality, and T is a time threshold. a is an indication function, alpha=1 indicates that the link abnormality repair is successful, and alpha=0 indicates that the link abnormality repair is failed. Mu and epsilon represent the base priorities of successful and failed link anomaly repairs, respectively, and mu > epsilon.

In summary, the invention realizes the monitoring of the service-component-method multi-level call link by fusing various proxy call information, realizes the association of local-global service call information by dynamically organizing the service call link and global call topology, realizes the positive feedback closed loop of service exception repair and process evaluation by establishing a full-link exception detection repair process based on knowledge base and environment feedback, and can effectively cope with complex call monitoring scenes of large-scale and multi-dependent services.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A distributed service invocation governance system, comprising: the system comprises a proxy access module, an information fusion analysis module, a service calling link generation module and a calling link abnormality detection module, wherein:

the information fusion analysis module is used for supporting the monitoring of multi-level calling links of the service-component-method through correlation fusion analysis of calling information of different sources and different levels, further providing effective data support for the visual construction of the calling links, providing a software identification alignment and timestamp alignment method, realizing alignment treatment of services, components, interfaces and methods, and simultaneously supporting the online analysis and calculation capability of service response time, service throughput, service error rate and SLA violation rate of different time granularity;

2. The distributed service invocation governance system of claim 1, wherein the proxy access module comprises an invocation proxy integration deployment module and an invocation information aggregation storage module, wherein:

3. The distributed service invocation governance system of claim 1, wherein the information fusion analysis module comprises a multi-source invocation information alignment processing module and a service invocation information aggregation and indicator calculation module, wherein:

response time, i.e., time interval across service calls, units: millisecond, specifically including Average, P95, and P99, represent Average response time, 95 line response time, and 99 line response time, respectively;

4. The distributed service call remediation system of claim 1 wherein the service call link generation module includes a service call analysis and topology construction module and a global call analysis and topology construction module, wherein:

the global call analysis and topology construction module records global call topology among all services and components in the distributed system, and the global call topology is constructed by call links through service identification cross correlation; on the basis of global call topology, the call state of the service is supported to be distinguished by colors, wherein red, yellow, green and black respectively represent that the service operation is deadly, serious, general and prompts, and the call state of the global service is mastered in real time; the global call topology also supports component call relationships, call method stacks, and dynamic presentation of call exception information.

5. The distributed service invocation governance system of claim 2, wherein the invocation link anomaly detection module comprises a link anomaly discovery module, a root anomaly tracing module, and a heuristic policy repair, wherein:

(2) analyzing a service operation log, and judging a service abnormality class, wherein the service abnormality class comprises: run-time anomalies, database anomalies, computing resource anomalies, network access anomalies, three-way call anomalies, and other anomalies; if the first three types of anomalies occur, judging that the current service is the anomaly service, adding the anomaly service into an anomaly service queue, and turning to the step (4); if the fourth class and the fifth class of anomalies occur, judging that the anomalies occur in the called service, acquiring a called service node, and turning to the step (4); if the service operation log does not exist in the current service or other anomalies occur, turning to the step (3);

assume that: a) Anomaly with probability d,0<d<1 to the calling party, randomly appearing in any service with probability 1-d; b) The higher the anomaly probability of more services being invoked; c) The higher the probability of abnormality is, the higher the probability of abnormality of the service called by the service is, and the arbitrary service s _i Probability of occurrence of abnormality Pr(s) _i ) Represented by formula (5);

wherein N represents a service in the global call topology modelTotal number, M(s) _i ) Representing calls s _i Is set of services, L(s) _j ) Representing calls s _j Is a number of services;

(4) Judging whether the current abnormal link is traversed to be over, stopping when the current abnormal link is traversed to be over, waiting for the timing task to analyze the next abnormal call link, otherwise, acquiring the next service calling party node, and turning to the step (1);

(1) acquiring an abnormal call link, sequentially acquiring abnormal services in the current abnormal call link from an abnormal service queue in the reverse link direction, and then acquiring service abnormal information according to a service ID, wherein the service abnormal information comprises a service identifier, a service response time, a service throughput, a service access error rate, an SLA violation rate and a service abnormal class;

(4) calculating and acquiring a repair rule of the abnormal service with highest similarity with the current abnormal service index, and defaulting to select a service restarting rule if the rule acquisition fails;

(5) acquiring a service control mode according to the service ID, including an IP deployment mode, a port, a configuration mode and a start-stop mode, and then implementing exception repair on the current service by using a repair rule;

(7) Executing link alarm and requesting repair;

6. A distributed service call governance method, characterized in that distributed service call governance is implemented based on the distributed service call governance system of any of claims 1-5.

7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a distributed service call remediation based on the distributed service call remediation system of any one of claims 1 to 5 when the computer program is executed.

8. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements distributed service call remediation based on the distributed service call remediation system of any one of claims 1 to 5.