CN111343260B

CN111343260B - Stream processing system fault tolerance method for multi-cloud deployment

Info

Publication number: CN111343260B
Application number: CN202010101719.XA
Authority: CN
Inventors: 沃天宇; 贾宵雷; 林学练; 谢天宇; 罗彦林
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-02-19
Filing date: 2020-02-19
Publication date: 2021-05-28
Anticipated expiration: 2040-02-19
Also published as: CN111343260A

Abstract

The invention provides a stream processing system fault tolerance method for multi-cloud deployment, which mainly comprises the following three steps: step 1, setting a resource management and distribution architecture as a three-layer architecture: task manager-cloud manager-executive terminal; step 2, with the cloud boundary as a segmentation basis, segmenting the cross-cloud-flow processing task into a plurality of processing stages, wherein nodes connected with other stages in each stage are called boundary nodes, and the rest nodes are called cloud interior nodes; and 3, carrying out fault tolerance judgment and processing, wherein the fault between the boundary nodes of the two processing stages adopts a fault tolerance method of caching and confirmation, and the fault of the nodes in the cloud adopts a distributed snapshot fault tolerance method of global consistency in the cloud.

Description

Stream processing system fault tolerance method for multi-cloud deployment

Technical Field

The invention relates to a fault tolerance method for a stream processing system, in particular to a fault tolerance method for a stream processing system for multi-cloud deployment.

Background

Active backup and passive backup are the first two similar fault-tolerant techniques. The active backup stream processing system has one backup node in each node, and the two nodes process data at the same time and are in a periodic synchronous state. When one node fails, the system can be quickly switched to another node, so that fault tolerance is realized. Unlike active backup, in a system using passive backup, the backup node does not process data and the master node periodically synchronizes its state to the backup node. When a fault occurs, the system is also switched to the backup node to continue processing data

In the prior art, there are three main ways, namely, a streaming processing system uses a data source cache and confirmation mechanism to realize fault tolerance, a neighboring node cache and confirmation mechanism to realize fault tolerance, and a globally consistent distributed snapshot fault tolerance mechanism. However, the fault tolerance mechanism of the existing distributed stream processing system is more designed to be suitable for a single cloud environment, and a multi-cloud environment is greatly different from a multi-cloud environment in many aspects, especially a multi-cloud network environment. Due to these differences, the disadvantages of the existing mechanisms are amplified and the advantages are reduced.

The problem of the data source caching and validating mechanism in the prior art is that if a window operation with a relatively long time span is used in the data processing flow, the timeout time of the timer needs to be increased. Otherwise, the system will frequently send repeated data, which not only increases the system consumption and wastes the system resources, but also may affect the final result. On the other hand, increasing the timeout time increases the fault discovery time of the system, which in turn increases the fault response time of the system. At the same time, the fault-tolerant mechanism cannot provide semantic assurance to upper-layer applications only once.

A problem with the prior art caching and validation mechanism for neighboring nodes is excessive resource consumption. The method can ensure that the node fault is found and recovered in time, and can also ensure that the semantic processing is performed only once. However, since each node needs to cache data, a persistent state is required after each data is processed, which causes excessive resource consumption and increases the processing delay of the system.

The globally consistent distributed snapshot fault-tolerant mechanism in the prior art has the advantages of low resource consumption, low processing delay of each data, short fault discovery time, capability of ensuring that the processing semantics are only once, and long fault recovery time. Any node failure requires recovery of the entire pipeline. If the probability of failure in the deployment environment is high, the system can frequently perform failure recovery, and the availability of the system is seriously influenced.

Disclosure of Invention

The invention provides a stream processing system fault tolerance method for multi-cloud deployment, which mainly comprises the following three steps: step 1, resource management and allocation, wherein a management module of a single cloud is arranged to set a resource management and allocation framework as three layers: task manager-cloud manager-executive terminal; step 2, with the cloud boundary as a segmentation basis, segmenting the cross-cloud-flow processing task into a plurality of processing stages, wherein nodes connected with other stages in each stage are called boundary nodes, and the rest nodes are called cloud interior nodes; step 3, adopting a fault tolerance technology of caching and confirming for the fault between the boundary nodes of the two processing stages, and adopting a distributed snapshot fault tolerance technology of global consistency in the cloud for the fault of the nodes in the cloud

Compared with the traditional processing method, the stream processing system suitable for multi-cloud deployment of the fault-tolerant mechanism has the advantages that the distributed stream processing system is better in convenience, reliability, real-time performance and the like when used for processing multi-cloud distributed data, the existing distributed stream processing system is more designed to be more suitable for a single-cloud environment, the multi-cloud environment is different from the multi-cloud environment in many aspects, particularly the network environment is different, and if the existing system is directly deployed to the multi-cloud environment, the reliability, the efficiency and the like of the stream processing system are seriously influenced. By optimizing the fault-tolerant mechanism of the conventional stream processing system, the system can be more suitable for the cloud environment so as to better process the data distributed in multiple clouds.

Drawings

FIG. 1 is a resource management and allocation architecture diagram of the present invention;

FIG. 2 is a schematic diagram illustrating the division of the flow processing task stages according to the present invention;

FIG. 3 is a failure recovery mechanism of the present invention;

FIG. 4 is an overall flow chart of the present invention

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in fig. 4, which is an overall flowchart of the present invention, the present invention provides a fault tolerance method for a multi-cloud-deployed stream processing system, which mainly includes three steps: step 1, resource management and allocation, wherein a management module of a single cloud is arranged to set a resource management and allocation framework as three layers: task manager-cloud manager-executive terminal; step 2, with the cloud boundary as a segmentation basis, segmenting the cross-cloud-flow processing task into a plurality of processing stages, wherein nodes connected with other stages in each stage are called boundary nodes, and the rest nodes are called cloud interior nodes; and 3, adopting a cache plus confirmation fault tolerance method for the fault between the boundary nodes of the two processing stages, and adopting a cloud-internal globally consistent distributed snapshot fault tolerance method for the fault of the nodes in the cloud.

The invention regards a single cloud as a whole on the premise of not destroying the security of the cloud intranet, and in the original resource management and distribution framework, a management module CloudManager of the single cloud is arranged, and the resource management and distribution framework is arranged into three layers: task manager JobManager-cloud manager-executive terminal Executor. In the three-layer architecture, the JobManager only communicates with the cloud manager nodes in each cloud to complete task distribution and resource monitoring, and the specific flow is shown in fig. 1.

The task manager is responsible for receiving the stream processing application submitted by the user, generating an execution plan and issuing the execution plan to each cloud manager, and meanwhile, the task manager is responsible for collecting the execution state of the application from the cloud managers and displaying the execution state to the user; and receiving some instructions of the user, such as suspending task execution, canceling task and the like. The cloud manager runs in each cloud, at least one cloud manager is arranged in each cloud, and the cloud manager receives the tasks issued by the task manager and issues the tasks to specific execution nodes according to specific task attributes; monitoring the execution state of the task and reporting the execution state to a task manager; and when a fault occurs, the fault recovery is carried out. The execution terminal is a node in each cloud, which is responsible for executing the task, can be a physical machine or a virtual machine, and is mainly responsible for receiving and executing the task issued by the cloud manager. The layers can communicate with each other through RPC calling, task issuing specifically includes that part of tasks of a certain cloud in all tasks forming an application are sent to corresponding cloud managers through RPC calling, and resource monitoring refers to collection of node information reported by all cloud managers.

The stream processing task stages of the invention are divided as shown in fig. 2, the invention divides the overall consistent distributed snapshot into several relatively independent stages, each stage independently carries out distributed snapshot and recovery, and the use of the distributed snapshot by taking one stage as a unit can reduce the system coupling, reduce the fault plane, accelerate the recovery speed and reduce the dependence on the central node.

The failure recovery mechanism in and among the flow processing stages is shown in fig. 3, in order to reduce the coupling degree, a single cloud is taken as a stage, the processing flow of one job is segmented, in order to realize decentralization, the cloud manager CM in each cloud is taken as an error monitoring and recovery scheduler, when a certain node in the whole system fails, the cloud manager CM of the cloud where the failed node is located schedules the node in the cloud, and the nodes in the cloud are recovered without influencing the operation of the nodes in other clouds, so that the failure recovery speed of the system is accelerated, and the availability of the system is improved.

The specific steps of the scheduling and recovering of the cloud manager are fault detection between the cloud manager and an execution node in the cloud through a heartbeat packet. When the cloud manager finds that a certain node has a fault, the type of the node is judged firstly, and then different recovery algorithms are executed according to different node types. If the nodes are the nodes in the cloud, executing a global consistent snapshot recovery algorithm, suspending tasks on all internal nodes in the cloud, migrating the tasks of the fault nodes to the idle nodes, and restarting all tasks; if the node is the boundary node, the task on the fault node is only required to be suspended, migrated to the idle node and restarted to be executed.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A stream processing system fault tolerance method for multi-cloud deployment is characterized by mainly comprising the following three steps: step 1, setting a resource management and distribution architecture as a three-layer architecture: task manager-cloud manager-executive terminal; step 2, with the cloud boundary as a segmentation basis, segmenting the cross-cloud-flow processing task into a plurality of processing stages, wherein nodes connected with other stages in each stage are called boundary nodes, and the rest nodes are called cloud interior nodes; step 3, fault tolerance judgment and processing are carried out, a fault tolerance method of caching and confirmation is adopted for the fault between the boundary nodes of the two processing stages, and a distributed snapshot fault tolerance method of global consistency in the cloud is adopted for the fault of the nodes in the cloud; in the three-layer architecture, the task manager receives stream processing applications submitted by a user, generates an execution plan and issues the execution plan to each cloud manager, and collects the execution states of the applications from the cloud managers and displays the execution states to the user; the cloud manager runs in each cloud, receives the tasks issued by the task manager, issues the tasks to specific execution nodes according to specific task attributes, monitors the execution states of the tasks, reports the execution states to the task manager, and is responsible for fault recovery when a fault occurs; and the execution terminal is a node in each cloud which is responsible for executing the task, receives the task issued by the cloud manager and executes the task.

2. The method of claim 1, wherein a task manager only communicates with cloud manager nodes in each cloud to complete task distribution and resource monitoring, the communication between the task manager and the cloud manager nodes is performed through RPC, part of tasks belonging to a certain cloud in all tasks forming an application are issued to the corresponding cloud manager through RPC calls, and the resource monitoring is used for collecting node information reported by all the cloud managers.

3. The method according to claim 2, wherein in step 3, the global consistent snapshot recovery algorithm is to suspend tasks on all internal nodes in the cloud, migrate a task of a failed node to a free node, and restart all tasks; the fault-tolerant method of cache and confirmation is realized by only suspending tasks on a fault node, migrating the tasks to an idle node and restarting execution.