CN111682964B

CN111682964B - Rapid recovery method for combined Web service failure

Info

Publication number: CN111682964B
Application number: CN202010445889.XA
Authority: CN
Inventors: 王鹏; 陈亮; 熊达鹏; 衣双辉; 蔡军; 包阳; 何骏; 施寅生
Original assignee: Chinese People's Liberation Army 32801
Current assignee: Chinese People's Liberation Army 32801
Priority date: 2020-05-22
Filing date: 2020-05-22
Publication date: 2020-12-29
Anticipated expiration: 2040-05-22
Also published as: CN111682964A

Abstract

The invention discloses a method for quickly recovering combined Web service failure. The method comprises the following steps: 1) setting a fault check point in each atomic service of the combined Web service, monitoring the atomic service by the fault check point, and sending monitoring information to the outside; 2) setting a detection Agent M-Agent at a server side of the combined Web service, receiving monitoring information sent by a fault check point by the detection Agent M-Agent, and acquiring an actual execution topological structure of the combined Web service; 3) the detection Agent M-Agent locates the failure atomic service of the combined Web service; 4) and setting a multi-level recovery Agent R-Agent in the combined Web service, and restarting the failed combined Web service by using a recursive recovery method based on micro-restart. By adopting the combined Web service failure rapid recovery method, the failure recovery of the combined Web service can be rapidly and efficiently completed, and the reliability of the combined Web service is improved.

Description

Rapid recovery method for combined Web service failure

Technical Field

The invention relates to the field of computer networks and application, in particular to a quick recovery method based on Web service failure.

Background

Distributed computing based on Web services is the main trend of the development of the Internet, the rapid development and wide application of the Web services promote the prosperity of Web service combination technology, and the Web technology has the characteristics of high interoperation, cross-platform and loose coupling, and can realize large-scale combination service through the cooperation among single Web services. However, the network environment where the Web service is located is an unreliable heterogeneous dynamic environment, and the problems of service operation failure, even unavailability and response loss can be caused by network failure, denial of service attack, infrastructure failure, and update and upgrade of atomic service, which hinders the application and popularization of the Web service combination technology. Therefore, how to realize rapid service recovery when failure occurs, maintain the continuously available state of the service, and improve the reliability of the service is an important problem to be solved urgently by the application and popularization of the Web service combination technology.

At present, methods for recovering after the failure of the combined Web service are mainly divided into two types: node failure handling and protocol failure handling. The former mainly implements failure recovery by predefined failure handling mechanisms, and the latter mainly provides failure recovery mechanisms in the peripheral operating environment of the composite service. Both of these approaches have limitations.

For a node fault processing method, redundancy backup is performed on atomic services which are easy to fail and a fault check point is set by analyzing a logic structure of a workflow system at present; a user predefines a fault processing mechanism according to self requirements and sets a processing plan in advance for possible operation faults, so that fault tolerance and fault recovery of workflow services are realized. The method has poor flexibility, and has high fault processing speed for nodes with preset detection points and recovery mechanisms, and low recovery efficiency for fault points without the detection points; when the service set related to the combined service is large in scale, if the fault tolerance and the fault recovery efficiency are required to be improved based on the method, fault processing mechanisms of a large number of nodes need to be predefined, a large number of service resources are occupied, and the overall efficiency of the combined Web service is reduced.

For the protocol fault processing method, the fault processing of the combined service is provided mainly based on the fault processing mechanism and the transaction mechanism built in the WS-BPEL language framework, and the fault recovery mechanism is provided in the peripheral operation environment of the combined service by improving the existing fault processing method of the WS-BPEL through expanding the existing protocol standard. The method has the advantages that the influence on the combined service is small, some unnecessary restarting atomic services can be restarted when the combined Web service is restarted in the defect, the additional overhead of the system is large, and the failure recovery efficiency is low.

Disclosure of Invention

The invention provides a method for rapidly recovering the failure of a combined Web service, which can realize the active monitoring of the failure of the combined Web service and improve the failure detection efficiency and the failure recovery efficiency.

An embodiment of the invention comprises the following steps:

s1: setting a fault check point in each atomic service of the combined Web service, monitoring the atomic service by the fault check point, and sending monitoring information to the outside;

s2: setting a detection Agent M-Agent at a server side of the combined Web service, receiving monitoring information sent by a fault check point by the detection Agent M-Agent, and acquiring an actual execution topological structure of the combined Web service;

s3: the detection Agent M-Agent locates the failure atomic service of the combined Web service;

s4: and setting a multi-level recovery Agent R-Agent in the combined Web service, and restarting the failed combined Web service by using a recursive recovery method based on micro-restart.

As a further improvement of the embodiment of the present invention, the fault check point is implemented in the following manner: and acquiring the combinational logic of each atomic service according to the BPEL file of the combined Web service, and inserting a monitoring probe based on AOP into each atomic service.

As a further improvement of the embodiment of the present invention, the acquiring of the monitoring information by the detection Agent M-Agent through each monitoring probe includes: called information, heartbeat information, and exception information of each atomic service.

As a further improvement of the embodiment of the present invention, the detection Agent M-Agent locates the failed atomic service of the composite Web service according to the heartbeat information and the abnormality notification information of the composite Web service actually executing the topology structure and the failed atomic service.

As a further improvement of the embodiment of the present invention, the step of acquiring the actual execution topology of the composite Web service includes:

s21: the fault check point sends the called information of the atomic service to the detection Agent M-Agent, and the information structure is 'URL of the atomic service + execution start' or 'URL of the atomic service + execution end';

s22: the detection Agent M-Agent collects all marks to form a mark character string;

s23: the execution topology of the composite Web service is restored with the marker string.

As a further improvement of the embodiment of the present invention, the marker string is represented by: "URL atom service 1Begin … URL atom service nBegin … URL atom service nEnd … URL atom service1 End".

As a further improvement of the embodiment of the present invention, the step of the recursive recovery method based on micro-restart comprises:

s41: finding a direct recovery agent of the failed atomic service through a BPEL file of the combined Web service, and restarting the whole restart tree taking the direct recovery agent as a vertex;

s42: if the restart fails, finding the upper-level recovery agent, and restarting the whole restart tree taking the upper-level recovery agent as a vertex;

s43: step S42 is repeated until the composite Web service is restarted successfully or the restart root recovery agent fails.

The invention has the following effects: the method mainly monitors the atomic service by setting a fault check point, positions the failed atoms by combining with the actual execution topology tree of the combined Web service, and sets a multi-stage recovery agent to recover the normal operation of the combined Web service by using a recursion recovery method based on micro-restart.

Compared with the prior art, the method has the advantages of less occupied resources, high flexibility and higher failure recovery efficiency. Specifically, compared with the node fault processing method in the prior art, the method does not need to predefine a large number of fault processing mechanisms, improves the flexibility, actively monitors the faults of the combined Web service in a probe insertion mode, improves the fault detection efficiency, and shortens the failure recovery time; compared with the protocol fault processing in the prior art, the invention optimizes the failure decision of the combined Web service by adopting a recursive recovery method based on micro-restart, and improves the failure recovery efficiency of the combined Web service. In a word, the failure recovery method for the combined Web service can quickly and efficiently complete the failure recovery of the combined Web service.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a general flow chart of a combined Web service failure rapid recovery method provided by the present invention;

FIG. 2 illustrates an AOP code implantation principle diagram;

FIG. 3 shows a schematic diagram of the failure detection principle of heartbeat detection combined with a monitor probe;

FIG. 4 is a schematic diagram showing the generation of a markup string for a composite Web service execution output;

FIG. 5 is a diagram illustrating an execution topology of a markup string restore combined Web service;

FIG. 6 shows a schematic diagram of a combined Web service inserting a tagged output AOP-based monitoring probe;

FIG. 7 illustrates a logical relationship diagram of a composite Web service;

fig. 8 shows a schematic diagram of a recursive recovery process based on a micro-restart.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The invention provides a combined Web service failure rapid recovery method which mainly comprises the steps of setting a failure check point, acquiring an execution topological structure, detecting and positioning failure and performing failure recursive recovery.

Fig. 1 is a general flowchart of a combined Web service failure quick recovery method provided by the present invention.

And S1, setting a fault checkpoint.

The failure check point setting is to acquire the combinational logic of each atomic service according to the BPEL file of the combined Web service, insert the monitoring probe based on the AOP into each atomic service, and perform failure detection by combining the heartbeat and the monitoring probe based on the AOP. The specific steps for setting the fault check point are as follows:

s11: inserting a monitoring probe based on AOP into an atomic service execution code of the combined Web service, wherein the matching point of the monitoring probe is a main function of the Web service code;

s12: in the running process of the Web service, a heartbeat signal and fault notification information are sent out by a notification function of a monitoring probe;

s13: and finally, receiving the heartbeat signal and the fault notification information through a message agent of the server, and processing the acquired abnormity.

Fig. 2 shows the working principle of a fault checkpoint. Before a fault processing point is not set in the atomic service, the atomic service fails when an exception occurs, and the exception is not processed. This mechanism is referred to herein as fail-silence. Aiming at the problem of silent service failure, the method provides an abnormal detection method combining heartbeat detection and a monitoring probe. By inserting an AOP-based monitoring probe into a code of an atomic service execution code, sending a message and an operation instruction to a detection Agent M-Agent of the combined Web service by using a notification function of the AOP-based monitoring probe, changing the attribute of atomic service failure silence, and processing the abnormal condition of the atomic service by the detection Agent M-Agent of the combined Web service after learning: sending an abnormal notice to a user side; a recovery agent that sends a failure recovery directive to the composite Web service initiates a recovery measure.

Fig. 3 shows a schematic diagram of the failure detection principle of heartbeat detection combined with a monitoring probe. The basic flow of Web service calling is that a user selects and binds Web services through WSDL file description in a registry. The method improves the basic flow, as shown in fig. 3, a detection Agent M-Agent is locally established in a server for deploying the Web Service, the request and the response between a user and the Web Service are forwarded through the M-Agent, and meanwhile, the M-Agent receives a heartbeat signal sent by an atomic Service (namely Service) and an abnormal notification sent by a monitoring probe, so that the running state and the abnormal state of the atomic Service are known in time. The heartbeat signal detection Web service is in a normal connection state, and the notification function of the AOP probe sends abnormal information to the M-Agent when the atomic service is abnormal.

S2: topology acquisition is performed.

The execution topological structure acquisition is to use a probe to send execution condition feedback information when the atomic service is called, and acquire the actual execution topological structure of the combined Web service. The specific steps for obtaining the execution topology structure are as follows:

s21: sending a mark containing an atomic service URL to a detection Agent M-Agent at the beginning and the end of the execution of the atomic service called by an AOP probe inserted in the atomic service of the combined Web service;

s22: the detection Agent M-Agent collects all the marks to form a mark character string, and the mark character string is expressed as: "URL atomic service 1Begin … URL atomic service nBegin … URL atomic service nEnd … URL atomic service1 End";

Take the simplest composite Web service of two atomic services A, B as an example. After the marking is finished, if A calls B to execute, the flow is as follows: the service A starts to output A0, B is called during the execution process, B starts to output B0, B finishes executing output B1, and returns to continue executing A, and A finishes executing output A1. This results in a string tag of A0B0B1A 1. Any marked combined Web service is executed to obtain a marked character string consistent with the execution topological structure. Each atomic service of the composite service on the left side of fig. 4 is marked according to the above method, and if the execution is completed, a marked character string is obtained: A0B0D0D1B1C0E0G0G1H 0E 1F0F1C1A 1. Such a string of tokens is reduced to an execution topology knot tree by a reduction algorithm, as shown in fig. 5. Thus, the execution topology of the composite Web service is indirectly restored by obtaining a string representing the topology.

The algorithm principle of the execution topology structure of the marked character string restoration is as follows:

taking the combined Web service execution string tag A0B0B1 as an example, A0 and A1 are respectively the head and tail identifiers of the service A, and B0 and B1 are respectively the head and tail identifiers of the service B. A is the parent node of B if the following two conditions are met:

1. in the execution string, the string subsections of B0 to B1 are completely contained in the string subsections of a0 to a 1;

2. the string subsections of A0 through A1 are the shortest subsections that contain the B0 through B1 string subsections.

The execution topology is restored from the character string step by step according to the above-mentioned rules, and a specific example is shown in fig. 6.

S3: failure detection positioning

The failure detection positioning is to position the failure atomic service of the combined Web service according to the marking character string of the combined Web service and the fault detection information sent by the probe; failures occurring during the operation of a composite Web service mainly include two types: the first is node failure due to atomic service function failure and the second is communication failure where messages are lost due to communication link failure.

Aiming at the first type of failures, the method realizes the detection of the node failures through the AOP probe, namely when the atomic service in the combined Web service is called and abnormal, the AOP monitoring probe captures abnormal information and informs a detection Agent M-Agent. Aiming at the second type of failures, the method positions the failures of the combined service by adding heartbeat signals into a notification function of the AOP monitoring probe and combining an execution marking character string of the combined Web service.

The specific algorithm flow is as follows:

s31: the AOP monitoring probe calls a main function main () through matching Web service to capture the execution of the main function;

s32: the notification function of the AOP monitoring probe sends a heartbeat signal to a monitoring receiver;

s33: performing service acquisition, such as acquiring abnormal signals, and switching to S34; if the abnormal signal is not captured, the step is switched to S35;

s34: sending an abnormal notice containing the URL of the failed atomic service to a detection Agent M-Agent, and jumping out of the program;

s35: detecting whether the current atomic service is the last atomic service of the BPEL file, if so, jumping out of the program; if not, go to S31.

Fig. 7 is a schematic diagram showing a logical relationship of an example of a composite Web Service, in which when an atomic Service (Service1-Service6) has a code execution exception, it is a node failure, and a communication failure between atomic services (e.g., between Service4 and Service 5) is a communication failure due to interactive data loss or network interruption.

S4: failure recursive recovery

The failure recursive recovery is that a plurality of recovery agents (R-agents) are arranged in the combined Web service by a recursive recovery method based on micro-restart, and the restart recovery is executed from the recovery Agent closest to the failure when the failure occurs until the combined Web service is recovered in a whole. The recursive recovery method based on the micro-restart comprises the following specific steps:

s41: finding a direct recovery agent of the failed atomic service through a BPEL file of the combined Web service, and restarting the whole restart tree with the direct recovery agent R-Agentc as a vertex;

The method adopts a recursion recovery method based on micro restart in order to improve the speed of failure recovery, and all descendant nodes below the failed node are restarted in the implementation process of the recursion recovery based on micro restart, so that the calling information given to the restarted node by the parent of the failed node is required to be reserved. In order to facilitate the storage of the call information, fixed nodes are set in the restart tree as the vertices of the restart tree, and the restart nodes are called recovery proxies (R-agents).

To explain the restart process, the combined Web service shown in fig. 7 is taken as an example. If 4 recovery agents R-agents have been set in the service, the logical structure of the composite Web service is shown in fig. 8, where a, B, and C are atomic services, and R-Agent1, R-Agent2, R-Agent3, and R-Agent4 are recovery agents. When the Service5 fails, a direct recovery Agent R-Agent3 of the Service5 is found according to a recursive recovery method based on micro-restart, the R-Agent3 is set as R-Agent, and the whole sub-tree with the R-Agent3 as a vertex is restarted. If the combined Web service is successfully recovered, jumping out of the recovery program; if the recovery fails, judging whether the R-Agent3 is a root recovery Agent; if so, sending a recovery failure message, ending and jumping out of the recovery program; if not, expanding the restarting range, setting the R-Agent1 of the upper-level recovery Agent as R-Agent, and restarting the whole tree with the R-Agent1 as a vertex; and repeating the steps to recursively recover step by step until the recovery is successful.

The algorithms provided herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

Claims

1. A method for rapidly recovering combined Web service failure is characterized by comprising the following steps:

s1: setting a fault check point in each atomic service of the combined Web service; the fault check point is a monitoring probe based on AOP, which is inserted into each atomic service and acquires the combinational logic of each atomic service according to the BPEL file of the combined Web service; the fault check point monitors the located atomic service and sends monitoring information to the outside;

s2: setting a detection Agent M-Agent at a server side of the combined Web service, wherein the detection Agent M-Agent receives monitoring information sent by the fault check point and acquires an actual execution topological structure of the combined Web service;

2. The method of claim 1, wherein the combined Web service failure rapid recovery method comprises: the monitoring information is acquired by the detection Agent M-Agent through each monitoring probe, and comprises the following steps: called information, heartbeat information, and exception information of each atomic service.

3. The method of claim 2, wherein the combined Web service failure rapid recovery method comprises: and the detection Agent M-Agent locates the failed atomic service of the combined Web service according to the actual execution topological structure of the combined Web service, the heartbeat information of the failed atomic service and the abnormal information.

4. The method of claim 3, wherein the combined Web service failure rapid recovery method comprises: the step of acquiring the actual execution topology of the combined Web service comprises the following steps:

s21: the fault check point sends the atomic service called information to the detection Agent M-Agent, and the information structure is 'atomic service URL + execution start' or 'atomic service URL + execution end';

s23: and restoring the execution topological structure of the combined Web service by using the mark character string.

5. The method of claim 4, wherein the combined Web service failure rapid recovery method comprises: the marker string is represented as: "URL atom service 1Begin … URL atom service nBegin … URL atom service nEnd … URL atom service1 End".

6. The method of claim 5, wherein the combined Web service failure rapid recovery method comprises: the method for recursive recovery based on micro-restart comprises the following steps:

s41: finding a direct recovery agent R-Agentc of the failed atomic service through the BPEL file of the combined Web service, and restarting the whole restarting tree with the direct recovery agent R-Agentc as a vertex;

s42: if the restart fails, finding a previous-level recovery agent of the direct recovery agent R-Agentc, and restarting the whole restart tree with the previous-level recovery agent as a vertex;