CN111343260B - Stream processing system fault tolerance method for multi-cloud deployment - Google Patents

Stream processing system fault tolerance method for multi-cloud deployment Download PDF

Info

Publication number
CN111343260B
CN111343260B CN202010101719.XA CN202010101719A CN111343260B CN 111343260 B CN111343260 B CN 111343260B CN 202010101719 A CN202010101719 A CN 202010101719A CN 111343260 B CN111343260 B CN 111343260B
Authority
CN
China
Prior art keywords
cloud
nodes
task
manager
fault
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010101719.XA
Other languages
Chinese (zh)
Other versions
CN111343260A (en
Inventor
沃天宇
贾宵雷
林学练
谢天宇
罗彦林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202010101719.XA priority Critical patent/CN111343260B/en
Publication of CN111343260A publication Critical patent/CN111343260A/en
Application granted granted Critical
Publication of CN111343260B publication Critical patent/CN111343260B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Abstract

The invention provides a stream processing system fault tolerance method for multi-cloud deployment, which mainly comprises the following three steps: step 1, setting a resource management and distribution architecture as a three-layer architecture: task manager-cloud manager-executive terminal; step 2, with the cloud boundary as a segmentation basis, segmenting the cross-cloud-flow processing task into a plurality of processing stages, wherein nodes connected with other stages in each stage are called boundary nodes, and the rest nodes are called cloud interior nodes; and 3, carrying out fault tolerance judgment and processing, wherein the fault between the boundary nodes of the two processing stages adopts a fault tolerance method of caching and confirmation, and the fault of the nodes in the cloud adopts a distributed snapshot fault tolerance method of global consistency in the cloud.

Description

Stream processing system fault tolerance method for multi-cloud deployment
Technical Field
The invention relates to a fault tolerance method for a stream processing system, in particular to a fault tolerance method for a stream processing system for multi-cloud deployment.
Background
Active backup and passive backup are the first two similar fault-tolerant techniques. The active backup stream processing system has one backup node in each node, and the two nodes process data at the same time and are in a periodic synchronous state. When one node fails, the system can be quickly switched to another node, so that fault tolerance is realized. Unlike active backup, in a system using passive backup, the backup node does not process data and the master node periodically synchronizes its state to the backup node. When a fault occurs, the system is also switched to the backup node to continue processing data
In the prior art, there are three main ways, namely, a streaming processing system uses a data source cache and confirmation mechanism to realize fault tolerance, a neighboring node cache and confirmation mechanism to realize fault tolerance, and a globally consistent distributed snapshot fault tolerance mechanism. However, the fault tolerance mechanism of the existing distributed stream processing system is more designed to be suitable for a single cloud environment, and a multi-cloud environment is greatly different from a multi-cloud environment in many aspects, especially a multi-cloud network environment. Due to these differences, the disadvantages of the existing mechanisms are amplified and the advantages are reduced.
The problem of the data source caching and validating mechanism in the prior art is that if a window operation with a relatively long time span is used in the data processing flow, the timeout time of the timer needs to be increased. Otherwise, the system will frequently send repeated data, which not only increases the system consumption and wastes the system resources, but also may affect the final result. On the other hand, increasing the timeout time increases the fault discovery time of the system, which in turn increases the fault response time of the system. At the same time, the fault-tolerant mechanism cannot provide semantic assurance to upper-layer applications only once.
A problem with the prior art caching and validation mechanism for neighboring nodes is excessive resource consumption. The method can ensure that the node fault is found and recovered in time, and can also ensure that the semantic processing is performed only once. However, since each node needs to cache data, a persistent state is required after each data is processed, which causes excessive resource consumption and increases the processing delay of the system.
The globally consistent distributed snapshot fault-tolerant mechanism in the prior art has the advantages of low resource consumption, low processing delay of each data, short fault discovery time, capability of ensuring that the processing semantics are only once, and long fault recovery time. Any node failure requires recovery of the entire pipeline. If the probability of failure in the deployment environment is high, the system can frequently perform failure recovery, and the availability of the system is seriously influenced.
Disclosure of Invention
The invention provides a stream processing system fault tolerance method for multi-cloud deployment, which mainly comprises the following three steps: step 1, resource management and allocation, wherein a management module of a single cloud is arranged to set a resource management and allocation framework as three layers: task manager-cloud manager-executive terminal; step 2, with the cloud boundary as a segmentation basis, segmenting the cross-cloud-flow processing task into a plurality of processing stages, wherein nodes connected with other stages in each stage are called boundary nodes, and the rest nodes are called cloud interior nodes; step 3, adopting a fault tolerance technology of caching and confirming for the fault between the boundary nodes of the two processing stages, and adopting a distributed snapshot fault tolerance technology of global consistency in the cloud for the fault of the nodes in the cloud
Compared with the traditional processing method, the stream processing system suitable for multi-cloud deployment of the fault-tolerant mechanism has the advantages that the distributed stream processing system is better in convenience, reliability, real-time performance and the like when used for processing multi-cloud distributed data, the existing distributed stream processing system is more designed to be more suitable for a single-cloud environment, the multi-cloud environment is different from the multi-cloud environment in many aspects, particularly the network environment is different, and if the existing system is directly deployed to the multi-cloud environment, the reliability, the efficiency and the like of the stream processing system are seriously influenced. By optimizing the fault-tolerant mechanism of the conventional stream processing system, the system can be more suitable for the cloud environment so as to better process the data distributed in multiple clouds.
Drawings
FIG. 1 is a resource management and allocation architecture diagram of the present invention;
FIG. 2 is a schematic diagram illustrating the division of the flow processing task stages according to the present invention;
FIG. 3 is a failure recovery mechanism of the present invention;
FIG. 4 is an overall flow chart of the present invention
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
As shown in fig. 4, which is an overall flowchart of the present invention, the present invention provides a fault tolerance method for a multi-cloud-deployed stream processing system, which mainly includes three steps: step 1, resource management and allocation, wherein a management module of a single cloud is arranged to set a resource management and allocation framework as three layers: task manager-cloud manager-executive terminal; step 2, with the cloud boundary as a segmentation basis, segmenting the cross-cloud-flow processing task into a plurality of processing stages, wherein nodes connected with other stages in each stage are called boundary nodes, and the rest nodes are called cloud interior nodes; and 3, adopting a cache plus confirmation fault tolerance method for the fault between the boundary nodes of the two processing stages, and adopting a cloud-internal globally consistent distributed snapshot fault tolerance method for the fault of the nodes in the cloud.
The invention regards a single cloud as a whole on the premise of not destroying the security of the cloud intranet, and in the original resource management and distribution framework, a management module CloudManager of the single cloud is arranged, and the resource management and distribution framework is arranged into three layers: task manager JobManager-cloud manager-executive terminal Executor. In the three-layer architecture, the JobManager only communicates with the cloud manager nodes in each cloud to complete task distribution and resource monitoring, and the specific flow is shown in fig. 1.
The task manager is responsible for receiving the stream processing application submitted by the user, generating an execution plan and issuing the execution plan to each cloud manager, and meanwhile, the task manager is responsible for collecting the execution state of the application from the cloud managers and displaying the execution state to the user; and receiving some instructions of the user, such as suspending task execution, canceling task and the like. The cloud manager runs in each cloud, at least one cloud manager is arranged in each cloud, and the cloud manager receives the tasks issued by the task manager and issues the tasks to specific execution nodes according to specific task attributes; monitoring the execution state of the task and reporting the execution state to a task manager; and when a fault occurs, the fault recovery is carried out. The execution terminal is a node in each cloud, which is responsible for executing the task, can be a physical machine or a virtual machine, and is mainly responsible for receiving and executing the task issued by the cloud manager. The layers can communicate with each other through RPC calling, task issuing specifically includes that part of tasks of a certain cloud in all tasks forming an application are sent to corresponding cloud managers through RPC calling, and resource monitoring refers to collection of node information reported by all cloud managers.
The stream processing task stages of the invention are divided as shown in fig. 2, the invention divides the overall consistent distributed snapshot into several relatively independent stages, each stage independently carries out distributed snapshot and recovery, and the use of the distributed snapshot by taking one stage as a unit can reduce the system coupling, reduce the fault plane, accelerate the recovery speed and reduce the dependence on the central node.
The failure recovery mechanism in and among the flow processing stages is shown in fig. 3, in order to reduce the coupling degree, a single cloud is taken as a stage, the processing flow of one job is segmented, in order to realize decentralization, the cloud manager CM in each cloud is taken as an error monitoring and recovery scheduler, when a certain node in the whole system fails, the cloud manager CM of the cloud where the failed node is located schedules the node in the cloud, and the nodes in the cloud are recovered without influencing the operation of the nodes in other clouds, so that the failure recovery speed of the system is accelerated, and the availability of the system is improved.
The specific steps of the scheduling and recovering of the cloud manager are fault detection between the cloud manager and an execution node in the cloud through a heartbeat packet. When the cloud manager finds that a certain node has a fault, the type of the node is judged firstly, and then different recovery algorithms are executed according to different node types. If the nodes are the nodes in the cloud, executing a global consistent snapshot recovery algorithm, suspending tasks on all internal nodes in the cloud, migrating the tasks of the fault nodes to the idle nodes, and restarting all tasks; if the node is the boundary node, the task on the fault node is only required to be suspended, migrated to the idle node and restarted to be executed.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (3)

1. A stream processing system fault tolerance method for multi-cloud deployment is characterized by mainly comprising the following three steps: step 1, setting a resource management and distribution architecture as a three-layer architecture: task manager-cloud manager-executive terminal; step 2, with the cloud boundary as a segmentation basis, segmenting the cross-cloud-flow processing task into a plurality of processing stages, wherein nodes connected with other stages in each stage are called boundary nodes, and the rest nodes are called cloud interior nodes; step 3, fault tolerance judgment and processing are carried out, a fault tolerance method of caching and confirmation is adopted for the fault between the boundary nodes of the two processing stages, and a distributed snapshot fault tolerance method of global consistency in the cloud is adopted for the fault of the nodes in the cloud; in the three-layer architecture, the task manager receives stream processing applications submitted by a user, generates an execution plan and issues the execution plan to each cloud manager, and collects the execution states of the applications from the cloud managers and displays the execution states to the user; the cloud manager runs in each cloud, receives the tasks issued by the task manager, issues the tasks to specific execution nodes according to specific task attributes, monitors the execution states of the tasks, reports the execution states to the task manager, and is responsible for fault recovery when a fault occurs; and the execution terminal is a node in each cloud which is responsible for executing the task, receives the task issued by the cloud manager and executes the task.
2. The method of claim 1, wherein a task manager only communicates with cloud manager nodes in each cloud to complete task distribution and resource monitoring, the communication between the task manager and the cloud manager nodes is performed through RPC, part of tasks belonging to a certain cloud in all tasks forming an application are issued to the corresponding cloud manager through RPC calls, and the resource monitoring is used for collecting node information reported by all the cloud managers.
3. The method according to claim 2, wherein in step 3, the global consistent snapshot recovery algorithm is to suspend tasks on all internal nodes in the cloud, migrate a task of a failed node to a free node, and restart all tasks; the fault-tolerant method of cache and confirmation is realized by only suspending tasks on a fault node, migrating the tasks to an idle node and restarting execution.
CN202010101719.XA 2020-02-19 2020-02-19 Stream processing system fault tolerance method for multi-cloud deployment Active CN111343260B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010101719.XA CN111343260B (en) 2020-02-19 2020-02-19 Stream processing system fault tolerance method for multi-cloud deployment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010101719.XA CN111343260B (en) 2020-02-19 2020-02-19 Stream processing system fault tolerance method for multi-cloud deployment

Publications (2)

Publication Number Publication Date
CN111343260A CN111343260A (en) 2020-06-26
CN111343260B true CN111343260B (en) 2021-05-28

Family

ID=71185472

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010101719.XA Active CN111343260B (en) 2020-02-19 2020-02-19 Stream processing system fault tolerance method for multi-cloud deployment

Country Status (1)

Country Link
CN (1) CN111343260B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102547812A (en) * 2011-11-04 2012-07-04 南京邮电大学 Fault detection method of wireless sensor network and event detection method thereof
CN103778031A (en) * 2014-01-15 2014-05-07 华中科技大学 Distributed system multilevel fault tolerance method under cloud environment
CN104115469A (en) * 2011-09-23 2014-10-22 混合电路逻辑有限公司 System for live -migration and automated recovery of applications in a distributed system
CN105959356A (en) * 2016-04-26 2016-09-21 华中科技大学 Method of realizing multi-cloud storage fault-tolerance conversion mechanism
CN107370802A (en) * 2017-07-10 2017-11-21 中国人民解放军国防科学技术大学 A kind of collaboration storage dispatching method based on alternating direction multiplier method
CN109063841A (en) * 2018-08-27 2018-12-21 北京航空航天大学 A kind of failure mechanism intelligent analysis method based on Bayesian network and deep learning algorithm

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013016584A1 (en) * 2011-07-26 2013-01-31 Nebula, Inc. Systems and methods for implementing cloud computing
CN103716182B (en) * 2013-12-12 2016-08-31 中国科学院信息工程研究所 A kind of fault detect towards real-time cloud platform and fault-tolerance approach and system
US9990372B2 (en) * 2014-09-10 2018-06-05 Panzura, Inc. Managing the level of consistency for a file in a distributed filesystem
US9830233B2 (en) * 2016-01-29 2017-11-28 Netapp, Inc. Online backup to an object service using bulk export
US20190313157A1 (en) * 2018-04-09 2019-10-10 James Fitzgerald System and Method for a Scalable IPTV Recorder and Cloud DVR

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104115469A (en) * 2011-09-23 2014-10-22 混合电路逻辑有限公司 System for live -migration and automated recovery of applications in a distributed system
CN102547812A (en) * 2011-11-04 2012-07-04 南京邮电大学 Fault detection method of wireless sensor network and event detection method thereof
CN103778031A (en) * 2014-01-15 2014-05-07 华中科技大学 Distributed system multilevel fault tolerance method under cloud environment
CN105959356A (en) * 2016-04-26 2016-09-21 华中科技大学 Method of realizing multi-cloud storage fault-tolerance conversion mechanism
CN107370802A (en) * 2017-07-10 2017-11-21 中国人民解放军国防科学技术大学 A kind of collaboration storage dispatching method based on alternating direction multiplier method
CN109063841A (en) * 2018-08-27 2018-12-21 北京航空航天大学 A kind of failure mechanism intelligent analysis method based on Bayesian network and deep learning algorithm

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"A Scalable lnternet-of-Vehicles Service over Joint Clouds";林学练等;《IEEE》;20150517;全文 *
"分布式多云架构下的协同计算方研究";司旭;《中国优秀硕士学位论文全文数据库》;20180430;第三章至第四章,图3.3,3.4 *
"软件定义的云际计算基础理论和方法研究进展";沃天宇等;《中国基础科学》;20190608;全文 *

Also Published As

Publication number Publication date
CN111343260A (en) 2020-06-26

Similar Documents

Publication Publication Date Title
US20200104222A1 (en) Systems and methods for managing server cluster environments and providing failure recovery therein
EP3180695B1 (en) Systems and methods for auto-scaling a big data system
US8983961B2 (en) High availability for cloud servers
CN109936622B (en) Unmanned aerial vehicle cluster control method and system based on distributed resource sharing
CN100426751C (en) Method for ensuring accordant configuration information in cluster system
WO2017067484A1 (en) Virtualization data center scheduling system and method
US20110083046A1 (en) High availability operator groupings for stream processing applications
US8082344B2 (en) Transaction manager virtualization
CN102833310B (en) Workflow engine trunking system based on virtualization technology
CN102629906A (en) Design method for improving cluster business availability by using cluster management node as two computers
CN107741876A (en) A kind of virtual machine process monitoring system and method
CN111124806A (en) Equipment state real-time monitoring method and system based on distributed scheduling task
US20020083116A1 (en) Buffered coscheduling for parallel programming and enhanced fault tolerance
CN110784539A (en) Data management system and method based on cloud computing
CN105183591A (en) High-availability cluster implementation method and system
CN111418187A (en) Scalable statistics and analysis mechanism in cloud networks
WO2021139174A1 (en) Faas distributed computing method and apparatus
CN111343260B (en) Stream processing system fault tolerance method for multi-cloud deployment
Ali et al. Probabilistic normed load monitoring in large scale distributed systems using mobile agents
CN112737934A (en) Cluster type Internet of things edge gateway device and method
CN113742073B (en) LSB interface-based cluster control method
CN111290767A (en) Container group updating method and system with service quick recovery function
Grant et al. RaDD runtimes: Radical and different distributed runtimes with smartnics
CN115391058A (en) SDN-based resource event processing method, resource creating method and system
CN111614702A (en) Edge calculation method and edge calculation system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant