CN106874142B - Real-time data fault-tolerant processing method and system - Google Patents

Real-time data fault-tolerant processing method and system Download PDF

Info

Publication number
CN106874142B
CN106874142B CN201510923282.7A CN201510923282A CN106874142B CN 106874142 B CN106874142 B CN 106874142B CN 201510923282 A CN201510923282 A CN 201510923282A CN 106874142 B CN106874142 B CN 106874142B
Authority
CN
China
Prior art keywords
node
instance
real
time data
pull
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510923282.7A
Other languages
Chinese (zh)
Other versions
CN106874142A (en
Inventor
单卫华
林铭
殷晖
李旭良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Cloud Computing Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201510923282.7A priority Critical patent/CN106874142B/en
Priority to PCT/CN2016/099585 priority patent/WO2017097006A1/en
Publication of CN106874142A publication Critical patent/CN106874142A/en
Application granted granted Critical
Publication of CN106874142B publication Critical patent/CN106874142B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Hardware Redundancy (AREA)

Abstract

The embodiment of the invention discloses a real-time data fault-tolerant processing method and a system. By deploying at least two instances in the system for a service, each instance being allocated a corresponding physical resource, each node in each instance having peer nodes in the other instances, when the real-time data of the node processing service in the system fails, the instance of the node is determined according to the physical resource corresponding to the node, in certain examples, the failed pull node is replaced with the failed node, the failed node is updated in the association information table as the failed pull node, sending the cache data of the peer node of the failed node to the failure pull-up node according to the peer node information in the connection information table so as to enable the failure pull-up node to recover the data processing, the system can uniformly manage real-time data processing, and can ensure that when a node fails, the state of the node before downtime is recovered, and the node is quickly accessed into the system again.

Description

Real-time data fault-tolerant processing method and system
Technical Field
The invention relates to the field of real-time computation, in particular to a real-time data fault-tolerant processing method and system.
Background
In the fields of finance, telecommunication, energy, medical treatment and the like, many business systems have the requirement of '7 x 24 hours' business continuity, and the business interruption caused by any reason is unacceptable. The high fault tolerance requirement of the industry has prompted the birth of a dual-active system, namely, the redundant system elements are provided to ensure that the system maintains service continuity when various faults occur, and ensure data integrity and the characteristics of system functions when the faults occur. Certainly, the resource consumption of the dual-active system is always the scaling problem of the solution, when the solution of the dual-active system is adopted, two sets of independent resources need to be prepared, and meanwhile, in service operation, the two sets of independent systems respectively deploy, manage and maintain the operation units of the two sets of independent systems.
The present industry widely adopts a real-time flow computing platform to construct a framework solution of a real-time online system, wherein a real-time flow computing component is most widely applied to Storm, Storm is a free, open-source, distributed and highly fault-tolerant real-time computing system, Storm is often used in the fields of real-time analysis, online machine learning, continuous computing, distributed remote calling, ET L and the like.
The Storm process is stateless, so that quick failure is convenient to realize, and the robustness of Storm is guaranteed. The Storm does not provide functional support for storing the state data cached by the node, so that after a certain node is down, the Storm only needs to pull up the service of the node without loading the state data, and the HA mechanism protection of quick failure and recovery can be realized. However, in the business demands of various industries at present, a business scenario requiring state data to be saved often occurs, and in this scenario, it is not acceptable that a node does not support the function of restoring state data. After the fault is recovered, the node is required to load the state data before the fault and recover to the state before the fault, so that the node can be quickly accessed into the system again, respond to the data processing request and undertake service processing.
Disclosure of Invention
The embodiment of the invention provides a real-time data fault-tolerant processing method and a real-time data fault-tolerant processing system, which are used for uniformly managing real-time data processing and can ensure that the state of a node before downtime is recovered when the node fails.
In a first aspect, a method for fault-tolerant processing of real-time data is provided, which includes:
when real-time data of a node processing service in a system fails, determining an instance where the node is located according to physical resources corresponding to the node, wherein the service deploys at least two instances in the system, each instance comprises at least one node with a topological relation, each instance allocates corresponding physical resources, the at least one node in each instance has a corresponding relation with the allocated physical resources, and each node in each instance has a peer node in other instances;
in the determined instance, replacing the failed pull node with the failed node;
updating the failed node as the failure pull node in a join information table, wherein the join information table comprises peer node information in the at least two instances;
and sending the cache data of the peer node of the failed node to the fault pull-up node according to the peer node information so that the fault pull-up node recovers the data processing of the node according to the received cache data.
The system can uniformly manage real-time data processing, and can ensure that when a node fails, the state of the node before downtime is recovered, and the node is quickly accessed into the system again.
With reference to the first aspect, in a first possible implementation manner of the first aspect, the method further includes:
and controlling the at least one node in each instance to process the real-time data respectively.
Each instance is distributed on the physical resource independently controlled by the instance, and the real-time data between the nodes in each instance can only flow in the same instance.
With reference to the first aspect, in a second possible implementation manner of the first aspect, the allocating physical resources includes at least one physical machine, and the at least one node in each instance has a corresponding relationship with the allocated physical resources, including: each of the physical machines corresponds to the at least one node.
With reference to the first aspect, or the first possible implementation manner of the first aspect, or the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, the method further includes:
when the system adds physical resources to the service, the added physical resources are allocated to at least two instances of the service.
With reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, the method further includes:
transferring real-time data of at least one node corresponding to the physical resource with the load higher than a set value in each instance to at least one node corresponding to the added physical resource allocated to the instance; or
Allocating the physical resources added by each of the instances to the failed pull-up node.
For the physical resources added in each instance, the real-time data of the nodes can be migrated according to the loads of the physical resources distributed to the nodes, so that load balance is ensured, and the physical resources are utilized in a balanced manner; and physical resources added by each instance can also be directly allocated to the new fault pull-up node, and the operation is simple.
With reference to the first aspect or the first possible implementation manner of the first aspect or the second possible implementation manner of the first aspect, in a fifth possible implementation manner of the first aspect, the method further includes:
when physical resources in an instance need to be reduced, stopping the real-time data processing of nodes corresponding to the reduced physical resources;
replacing the node that stopped processing real-time data with a failed pull node, wherein remaining physical resources in the instance are reallocated to at least one of the nodes that is performing real-time data processing in the instance;
updating the node stopping processing the real-time data in the connection information table to be the fault pull-up node;
and sending the cache data of the peer node of the node stopping processing the real-time data to the fault pull-up node according to the peer node information.
When reducing the physical resources of each instance, it is necessary to stop the real-time data processing of the node corresponding to the reduced physical resources, and replace the node that stops processing the real-time data with the failed pull-up node, so as to ensure that the data processing is not interrupted by the reduction of the physical resources.
In a second aspect, a real-time data fault-tolerant processing system is provided, which has a function of implementing the system behavior in the above method. The functions may be implemented by hardware executing corresponding software. The software includes one or more modules corresponding to the above-described functions.
The real-time data fault-tolerant processing system comprises:
a determining unit, configured to determine, when real-time data of a service processed by a node in a system fails, an instance where the node is located according to a physical resource corresponding to the node, where the service deploys at least two instances in the system, each instance includes at least one node having a topological relationship, each instance allocates a corresponding physical resource, the at least one node in each instance has a corresponding relationship with the allocated physical resource, and each node in each instance has a peer node in other instances;
a replacement unit configured to replace the failed pull-up node with the failed node in the determined instance;
an updating unit, configured to update the failed node as the failure pull-up node in a join information table, where the join information table includes peer node information in the at least two instances;
and the sending unit is used for sending the cache data of the peer node of the failed node to the failure pull-up node according to the peer node information so that the failure pull-up node recovers the data processing of the node according to the received cache data.
The implementation of the method and the system for processing the real-time data fault tolerance provided by the embodiment of the invention has the following beneficial effects:
by deploying at least two instances in the system for a service, each instance being allocated a corresponding physical resource, each node in each instance having peer nodes in the other instances, when the real-time data of the node processing service in the system fails, the instance of the node is determined according to the physical resource corresponding to the node, in certain examples, the failed pull node is replaced with the failed node, the failed node is updated in the association information table as the failed pull node, sending the cache data of the peer node of the failed node to the failure pull-up node according to the peer node information in the connection information table so as to enable the failure pull-up node to recover the data processing, the system can uniformly manage real-time data processing, and can ensure that when a node fails, the state of the node before downtime is recovered, and the node is quickly accessed into the system again.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a real-time data fault-tolerant processing method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an exemplary prior art Storm application operation;
FIG. 3 is a schematic diagram of a fault-tolerant topology deployment structure in a Storm platform according to an example embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating a fast recovery of a Storm platform in accordance with an exemplary embodiment of the present invention to achieve a fault downtime;
FIG. 5 is a flowchart illustrating another method for fault-tolerant processing of real-time data according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a real-time data fault-tolerant processing system according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of another real-time data fault-tolerant processing system according to an embodiment of the present invention.
Detailed Description
Fig. 1 is a schematic flowchart of a real-time data fault-tolerant processing method according to an embodiment of the present invention, where the method includes the following steps:
s101, when real-time data of a node processing service in a system fails, determining an instance where the node is located according to physical resources corresponding to the node.
The system deploys two or more instances to each service, each instance comprises one or more nodes with topological relations, the instances have the same number of nodes, and the nodes have the same topological relations. Each instance is allocated with corresponding physical resources, the allocated physical resources can only be controlled by the instance, the nodes in each instance have corresponding relations with the allocated physical resources, namely, the physical resources are allocated to each instance, and the physical resources allocated to each instance are respectively allocated to the nodes in the instance. And each node in each instance has a peer node in the other instances, i.e. a node having the same location and performing the same function in a topological relationship.
The system in this embodiment may be used to process real-time data. When a node in the system fails to process real-time data of the service, it is first necessary to determine an instance where the node is located: when a failure occurs, it can be determined which physical resource has failed first, and therefore, according to the failed physical resource and the correspondence between the node and the physical resource, the instance where the node is located can be determined.
S102, in the determined example, the fault pull-up node is replaced with the node with the fault.
After the instance where the failed node is located is determined, in the instance, the failed node is replaced by a failed pull-up node, and the replacement operation is only performed in the instance, so that the normal operation of other instances is not influenced.
S103, updating the node with the fault as the fault pull-up node in a connection information table.
An association information table is arranged in the system, and the association information table comprises peer node information in two or more instances deployed by the system for each service, for example, the node 1 in the instance A and the node 1' in the instance B are recorded as peer nodes. After replacing the failed node with the failed pull-up node, the failed node needs to be updated in the join information table to be the failed pull-up node, for example, if the node 1 in the instance a fails, and the node 1 is replaced with the node 7, then the node 1 in the instance a needs to be replaced with the node 7 in the join information table.
S104, according to the peer node information in the connection information table, sending the cache data of the peer node of the failed node to the failure pull-up node, so that the failure pull-up node recovers the data processing of the node according to the received cache data.
And inquiring the peer node of the failed node according to the peer node information in the connection information table, wherein when the node fails, the peer node is in normal operation, and the peer node stores cache data within a period of time, so that the cache data of the peer node can be sent to the failure pull-up node of the failed node, and the failure pull-up node can recover the data processing of the node from the failure according to the received cache data.
As can be seen from the above description, the physical resource allocation, the record of the peer node information, the replacement of the node, and the like of two or more instances deployed by the system for each service are uniformly managed by the system, and when a certain node of a certain instance fails, the cache data of the peer node of the certain instance can be acquired, the state of the failed node before downtime is recovered, and the system is quickly accessed again.
According to the real-time data fault-tolerant processing method provided by the embodiment of the invention, the system can uniformly manage real-time data processing, and can recover the state of the node before the downtime when the node fails, so as to quickly access the system again.
The real-time data fault-tolerant processing method provided by the embodiment of the invention is described in detail by taking a Storm system or a platform as an example:
two types of nodes are included in the Storm system: the Spout is responsible for sending data stream in the application (generating data stream itself or receiving external data stream and sending out), namely, the data source in the whole Storm application; the code in the Bolt realizes service logic processing, which is responsible for processing data flow, and complex service logic realization requires cooperation of a plurality of bolts. In the Storm system, Tuple refers to a single data stream circulating in the Storm application, which is a piece of data that continuously circulates at various nodes. Topology (Topology) describes the connection order and connection relationship between the Spout and the Bolt, and may also set a transmission policy of data streams between the Spout and the Bolt in the application, an operation configuration of the entire application, and the like.
Fig. 2 is a schematic diagram illustrating the operation of a Storm application in the prior art, in which the Storm application code Jar is submitted to the Storm cluster by a command to be submitted to operation, the Storm application in operation is independent into a system service type Storm task, the Storm task can be closed only by a manual command, the Storm task can always run in the Storm cluster, and once data meeting the task requirement arrives, the received data is processed and output.
A set of reliability mechanism is built in the Storm to guarantee normal operation of each component and node, when a certain component/node fails to operate or is abnormal, the Storm can automatically restart the component/node, relevant configuration is initialized and loaded into a task, and normal operation of the whole Storm task is guaranteed.
Data enters a Storm task through a Spout node, the Spout inherits and realizes related service logic, the inflowing data is analyzed and processed, and the inflowing data is sent to the next connected Bolt node; the Bolt node receives the data, processes the data through the service logic defined by the Bolt node, and then continuously sends the data to the next Bolt node; a plurality of Bolt nodes can be connected in series and in parallel and can also be separated and summarized, a data flow structure follows the principle of a directed acyclic graph, a complex business logic model can be formed through various combinations, and the application requirements are met.
Storm HAs a perfect High availability cluster/dual-computer cluster (HA) mechanism for downtime recovery, Storm stores and shares node configuration information by using a synchronous storage and sharing module Zookeeper, when a certain node crashes, a main control node detects a node failure by using a heartbeat mechanism, tries to pull up the failed node, reads the node configuration information from the Zookeeper when pulling up the failed node, and re-accesses the application Topology when pulling up the node.
Fig. 3 is a schematic diagram of a fault-tolerant topology deployment structure in a Storm platform according to an example embodiment of the present invention. The embodiment of the invention designs a Storm extended module: and a Fault Tolerance (FT) Topology (Topology) module. And the user adopts FT Topology to create application and deploy operation, and follows the realization and operation mode of the newly added module FT Topology. Storm applications built using the FT Topology module are denoted herein as FT-capable Storm applications.
When the Storm application with FT capability executes deployment, two identical Storm application instances are automatically generated in the Storm cluster platform, which are respectively represented by FT Topology A and FT Topology B. And then, the physical resources in the Storm cluster are distributed to the two instances according to a set strategy (average distribution/proportion distribution), and each instance is distributed on the physical resources independently controlled by the instance, so that the data streams among the nodes in the respective instances can only circulate in the same instance. That is, the data stream flowing in FT Topology a instance is not sent to FT Topology B instance, and the two instances are completely isolated in terms of processing the data stream. In this embodiment, the physical resource may be a physical machine, and one physical machine may support multiple nodes to run on the physical machine, as shown in fig. 3, the Host1 is allocated to the node 1 and the node 4 in the instance a, the Host2 is allocated to the node 2 and the node 5 in the instance a, and so on.
When the Storm application deployment with the FT capability is completed, each Storm application instance has an independent physical resource group, only the nodes of the same Storm application instance run in the same physical resource group, and the nodes in the two Storm application instances cannot run in the machines of the same physical resource group at the same time no matter the fault is down and pulled or other running operations are performed. As shown in fig. 3, that is, any node in FT Topology a cannot run in three machines of Host4, Host5, and Host6 at any time, and when a certain node in FT Topology a is down to pull up, a newly pulled up node can only run in a certain machine of Host1, Host2, and Host 3.
Fig. 4 is a schematic diagram illustrating a rapid recovery of a Storm platform in accordance with an embodiment of the present invention. After the Storm application with the FT capability is deployed, the Topology structure of the peer position and function of each Spout/Bolt node between two associated deployed instances (namely, FTtopology A instance and FT Topology B instance), namely, the peer node information is recorded through a binding information table, and the binding information table can be stored in a Zookeeper module. The heartbeat mode of Storm is used to monitor the operating states of the peer nodes (i.e., nodes of the same location and function) Spout/Bolt in the two instances with each other. If a certain Spout/Bolt node goes down due to a fault, and in another example, a Spout/Bolt node with the same location and function operates normally, the normally operating Spout/Bolt node will send its own cache data to the new fault-pulled Spout/Bolt node synchronously, so as to speed up the recovery of the fault node.
The process is realized by invoking the function of Zookeeper caching and synchronizing the information of each node by a Storm control node (nimbus), and creating and maintaining a link information table of the same position and function Spout/Bolt node between two instances. When a certain Spout/Bolt node in a certain example is pulled up due to a fault, the old node connection is interrupted, and the connection relation of the corresponding node is adjusted by combining the new fault-pulled Spout/Bolt node creation information in the Zookeeper. Namely, the adjustment connection information table is updated, and the old downtime node in the Topology is replaced by the new fault pull-up node.
After the node is down, the normally operating node in the other corresponding example in the connection information table sends the cache data of the node to the new node through the connection information table, so that the node is helped to quickly construct a data model and restore the data processing capacity.
When the Storm with the FT capability is applied to fault disaster recovery operation of a certain node of the Storm platform, the grouping information of fault physical resources aiming at the FT Topology instance is verified, only the fault FT Topology instance is subjected to disaster recovery operation, and the operation of the other group of instances is not influenced.
As shown in fig. 4, when the node 3 in the instance a fails, the node 7 is used to replace the node 3, the node 3 is replaced with the node 7 in the join information table, and the cache data of the peer node 3' of the original node 3 is sent to the new node 7 according to the peer node information recorded in the join information table, so that the node 7 resumes the data processing from the failure.
Fig. 5 is a schematic flowchart of another real-time data fault-tolerant processing method according to an embodiment of the present invention, where the method includes the following steps:
s201, controlling at least one node in each instance to process real-time data respectively.
The system deploys two or more instances for the service, the nodes in the instances process real-time data respectively, namely, each instance performs distributed deployment on physical resources independently dominated by the instance, and the data streams among the nodes in the respective instances can only circulate in the same instance.
S202, when the real-time data of the node processing service in the system fails, determining the instance of the node according to the physical resource corresponding to the node.
S203, in the determined example, replacing the failed node with the failed pull-up node.
And S204, updating the node with the fault into the fault pull-up node in a connection information table.
S205, according to the peer node information in the connection information table, sending the cache data of the peer node of the failed node to the failure pull-up node, so that the failure pull-up node recovers the data processing of the node according to the received cache data.
S202 to S205 are data fault tolerance processing procedures when a node in a certain example fails, and are the same as S101 to S104 in the embodiment shown in fig. 1, and are not described herein again.
S206, when the system adds physical resources to the service, the added physical resources are distributed to at least two instances of the service.
S207, the physical resources added by each instance are allocated to the fault pull-up node.
S206-S207 are the process of expanding physical resources, and when a system adds physical resources to a service, the added physical resources are allocated to two or more instances deployed by the system for the service, generally, evenly allocated. The increased physical resources allocated to each instance can be directly allocated to the fault pull-up node, because the fault pull-up node is a new node and has not yet allocated physical resources, at this time, if there are increased physical resources, the increased physical resources are directly allocated to the fault pull-up node, so that the operation can be simplified; or firstly judging which physical resources in each instance correspond to one or more nodes with loads higher than a set value, determining which nodes the increased physical resources are allocated to, and migrating the real-time data stream of the nodes with loads higher than the set value to the nodes allocated with the increased physical resources for continuous processing, so that the loads among the nodes are balanced, and the physical resources are reasonably utilized. When the physical resource is expanded, the operation being processed for each node is not required.
And S208, stopping the real-time data processing of the node corresponding to the reduced physical resource when the physical resource in the instance needs to be reduced.
And S209, replacing the node stopping processing the real-time data with the fault pull-up node.
And S210, updating the node stopping processing the real-time data in the connection information table to be the fault pull-up node.
S211, according to the peer node information, sending the cache data of the peer node of the node stopping processing the real-time data to the fault pull-up node.
S208-S211 are processes that reduce physical resources in an instance. The physical resources in the example are reduced, the reduced physical resources may be only the physical resources corresponding to a part of the nodes, the nodes with the reduced physical resources need to stop the current data processing, the failure pull-up node is used to replace the node stopping the processing, and after the remaining physical resources in the example are allocated to the failure pull-up node and other nodes not stopped, the same operation as the embodiment shown in fig. 1 is used to enable the failure pull-up node to recover the data processing when the processing stop node is stopped. The remaining physical resources in the instance are reallocated to at least one node in the instance that is performing real-time data processing.
It should be noted that when the Storm application with the FT capability performs capacity expansion and capacity reduction on the Storm platform, only the Topology instance that needs to be adjusted is redeployed, and the Topology instance that does not need to be adjusted continues to run.
According to the real-time data fault-tolerant processing method provided by the embodiment of the invention, the system can uniformly manage real-time data processing, and can recover the state of a node before the node goes down and quickly access the system again when the node goes down; for the physical resources added to each instance, the physical resources added to each instance can be directly allocated to the new fault pull-up node, and the operation is simple; when reducing the physical resources of each instance, it is necessary to stop the real-time data processing of the node corresponding to the reduced physical resources, and replace the node that stops processing the real-time data with the failed pull-up node, so as to ensure that the data processing is not interrupted by the reduction of the physical resources.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
Fig. 6 is a schematic structural diagram of a real-time data fault-tolerant processing system according to an embodiment of the present invention, where the system 1000 includes:
the determining unit 11 is configured to determine, when real-time data of a node processing service in the system fails, an instance where the node is located according to a physical resource corresponding to the node.
The system deploys two or more instances to each service, each instance comprises one or more nodes with topological relations, the instances have the same number of nodes, and the nodes have the same topological relations. Each instance is allocated with corresponding physical resources, the allocated physical resources can only be controlled by the instance, the nodes in each instance have corresponding relations with the allocated physical resources, namely, the physical resources are allocated to each instance, and the physical resources allocated to each instance are respectively allocated to the nodes in the instance. And each node in each instance has a peer node in the other instances, i.e. a node having the same location and performing the same function in a topological relationship.
The system in this embodiment may be used to process real-time data. When a node in the system fails to process real-time data of the service, the determining unit 11 first needs to determine an instance where the node is located: when a failure occurs, it can be determined which physical resource has failed first, and therefore, according to the failed physical resource and the correspondence between the node and the physical resource, the instance where the node is located can be determined.
A replacing unit 12, configured to, in the determined example, replace the failed pull-up node with the failed node.
After determining the instance where the failed node is located, in this instance, the replacing unit 12 replaces the failed node with a failed pull-up node, and this replacing operation is performed only in this instance, and does not affect the normal operation of other instances.
An updating unit 13, configured to update the failed node as the failure pull-up node in the join information table.
An association information table is arranged in the system, and the association information table comprises peer node information in two or more instances deployed by the system for each service, for example, the node 1 in the instance A and the node 1' in the instance B are recorded as peer nodes. After replacing the failed node with the failed pull-up node, the updating unit 13 needs to update the failed node as the failed pull-up node in the join information table, for example, if the node 1 in the instance a fails, and the node 1 is replaced with the node 7, then the node 1 in the instance a needs to be replaced with the node 7 in the join information table.
A sending unit 14, configured to send, according to the peer node information in the connection information table, the cache data of the peer node of the failed node to the failure pull-up node, so that the failure pull-up node recovers data processing of the node according to the received cache data.
The peer node of the failed node is queried according to the peer node information in the connection information table, and since the peer node is normally operated when the node fails and the peer node stores the cache data for a period of time, the sending unit 14 may send the cache data of the peer node to the failure pull-up node of the failed node, so that the failure pull-up node recovers the data processing of the node from the failure according to the received cache data.
As can be seen from the above description, the physical resource allocation, the record of the peer node information, the replacement of the node, and the like of two or more instances deployed by the system for each service are uniformly managed by the system, and when a certain node of a certain instance fails, the cache data of the peer node of the certain instance can be acquired, the state of the failed node before downtime is recovered, and the system is quickly accessed again.
According to the real-time data fault-tolerant processing system provided by the embodiment of the invention, the system can uniformly manage real-time data processing, and can recover the state of the node before the down state when the node fails and quickly access the system again.
Fig. 7 is a schematic structural diagram of another real-time data fault-tolerant processing system according to an embodiment of the present invention, where the system 2000 includes:
and the processing unit 21 is used for controlling at least one node in each instance to process the real-time data respectively.
The system deploys two or more instances for the service, the nodes in the instances process real-time data respectively, namely, each instance performs distributed deployment on physical resources independently dominated by the instance, and the data streams among the nodes in the respective instances can only circulate in the same instance.
The determining unit 22 is configured to determine, when real-time data of a service processed by a node in the system fails, an instance where the node is located according to a physical resource corresponding to the node.
A replacing unit 23, configured to replace the failed pull-up node with the failed node in the determined example.
An updating unit 24, configured to update the failed node as the failure pull-up node in the association information table.
A sending unit 25, configured to send, according to the peer node information in the connection information table, the cache data of the peer node of the failed node to the failure pull-up node, so that the failure pull-up node recovers data processing of the node according to the received cache data.
The above is a data fault tolerance processing procedure when a node in a certain example fails, and the functions of the corresponding units in the embodiment shown in fig. 6 are the same, and are not described again here.
An allocating unit 26, configured to, when the system adds physical resources to the service, allocate the added physical resources to at least two instances of the service.
The allocation unit 26 is further configured to allocate the physical resource added by each of the instances to the failed pull-up node.
In the above process of expanding physical resources, when a system adds physical resources to a service, the added physical resources are allocated to two or more instances deployed by the system for the service, and generally are allocated evenly. The increased physical resources allocated to each instance can be directly allocated to the fault pull-up node, because the fault pull-up node is a new node and has not yet allocated physical resources, at this time, if there are increased physical resources, the increased physical resources are directly allocated to the fault pull-up node, so that the operation can be simplified; or firstly judging which physical resources in each instance correspond to one or more nodes with loads higher than a set value, determining which nodes the increased physical resources are allocated to, and migrating the real-time data stream of the nodes with loads higher than the set value to the nodes allocated with the increased physical resources for continuous processing, so that the loads among the nodes are balanced, and the physical resources are reasonably utilized. When the physical resource is expanded, the operation being processed for each node is not required.
The processing unit 21 is further configured to stop real-time data processing of a node corresponding to the reduced physical resource when the physical resource in the instance needs to be reduced.
The replacement unit 23 is further adapted to replace the failed pull-up node with the node that stopped processing the real-time data.
The updating unit 24 is further configured to update the node stopping processing the real-time data in the connection information table as the failure pull-up node.
The sending unit 25 is further configured to send, according to the peer node information, the cache data of the peer node of the node that stops processing the real-time data to the failure pull-up node.
The above is a process of reducing physical resources in an example. The physical resources in the example are reduced, the reduced physical resources may be only the physical resources corresponding to a part of the nodes, the nodes with the reduced physical resources need to stop the current data processing, the failure pull-up node is used to replace the node stopping the processing, and after the remaining physical resources in the example are allocated to the failure pull-up node and other nodes not stopped, the same operation as the embodiment shown in fig. 1 is used to enable the failure pull-up node to recover the data processing when the processing stop node is stopped. The remaining physical resources in the instance are reallocated to at least one node in the instance that is performing real-time data processing.
It should be noted that when the Storm application with the FT capability performs capacity expansion and capacity reduction on the Storm platform, only the Topology instance that needs to be adjusted is redeployed, and the Topology instance that does not need to be adjusted continues to run.
According to the real-time data fault-tolerant processing system provided by the embodiment of the invention, the system can uniformly manage real-time data processing, and can restore the state of a node before the node goes down and quickly access the system again when the node goes down; for the physical resources added to each instance, the physical resources added to each instance can be directly allocated to the new fault pull-up node, and the operation is simple; when reducing the physical resources of each instance, it is necessary to stop the real-time data processing of the node corresponding to the reduced physical resources, and replace the node that stops processing the real-time data with the failed pull-up node, so as to ensure that the data processing is not interrupted by the reduction of the physical resources.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
It will be apparent to those skilled in the art from this disclosure that the present invention may be implemented in hardware, or firmware, or a combination thereof, that when implemented in software, the functions described above may be stored on or transmitted as one or more instructions or code on a computer-readable medium including computer-storage media and communication media including any medium that facilitates transfer of a computer program from one place to another, storage media may be any available medium that is accessible to a computer, by way of example and not limitation, computer-readable media may include Random Access Memory (RAM), Read-Only Memory (ROM), electrically erasable programmable Read-Only Memory (EEPROM), Read-Only optical Disk (Read-Only-on Memory, ROM), electrically erasable programmable Read-Only Memory (electrically twisted pair cable), or other media that can be Read-Only Memory (EEPROM), or Read-Only optical Disk (CD-on) or other media that can be Read by a computer or other medium such as a CD, optical Disk, or optical Disk, if suitable for use with a computer, a Compact Disc, or other wireless optical Disk, such as a Compact Disc, a Compact-Read-Only optical Disc (CD-on-Only-optical Disc), or optical Disc, or other optical Disc, or optical Disc, such as a computer.
In short, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (12)

1. A real-time data fault-tolerant processing method is characterized by comprising the following steps:
when real-time data of a node processing service in a system fails, determining an instance where the node is located according to physical resources corresponding to the node, wherein the service deploys at least two instances in the system, each instance comprises at least one node with a topological relation, each instance allocates corresponding physical resources, the at least one node in each instance has a corresponding relation with the allocated physical resources, and each node in each instance has a peer node in other instances;
in the determined instance, replacing the failed pull node with the failed node;
updating the failed node as the failure pull node in a join information table, wherein the join information table comprises peer node information in the at least two instances;
and sending the cache data of the peer node of the failed node to the fault pull-up node according to the peer node information, so that the fault pull-up node recovers the data processing of the failed node according to the received cache data.
2. The method of claim 1, further comprising:
and controlling the at least one node in each instance to process the real-time data respectively.
3. The method of claim 1, wherein the allocated physical resources include at least one physical machine, the at least one node in each instance having a correspondence with the allocated physical resources, comprising: each of the physical machines corresponds to the at least one node.
4. The method of any one of claims 1-3, further comprising:
when the system adds physical resources to the service, the added physical resources are allocated to at least two instances of the service.
5. The method of claim 4, wherein the method further comprises:
transferring real-time data of at least one node corresponding to the physical resource with the load higher than a set value in each instance to at least one node corresponding to the added physical resource allocated to the instance; or
Allocating the physical resources added by each of the instances to the failed pull-up node.
6. The method of any one of claims 1-3, further comprising:
when physical resources in an instance need to be reduced, stopping the real-time data processing of nodes corresponding to the reduced physical resources;
replacing the node that stopped processing real-time data with a failed pull node, wherein remaining physical resources in the instance are reallocated to at least one of the nodes that is performing real-time data processing in the instance;
updating the node stopping processing the real-time data in the connection information table to be the fault pull-up node;
and sending the cache data of the peer node of the node stopping processing the real-time data to the fault pull-up node according to the peer node information.
7. A real-time data fault tolerant processing system, comprising:
a determining unit, configured to determine, when real-time data of a service processed by a node in a system fails, an instance where the node is located according to a physical resource corresponding to the node, where the service deploys at least two instances in the system, each instance includes at least one node having a topological relationship, each instance allocates a corresponding physical resource, the at least one node in each instance has a corresponding relationship with the allocated physical resource, and each node in each instance has a peer node in other instances;
a replacement unit configured to replace the failed pull-up node with the failed node in the determined instance;
an updating unit, configured to update the failed node as the failure pull-up node in a join information table, where the join information table includes peer node information in the at least two instances;
and the sending unit is used for sending the cache data of the peer node of the failed node to the failure pull-up node according to the peer node information so that the failure pull-up node recovers the data processing of the failed node according to the received cache data.
8. The system of claim 7, further comprising:
and the processing unit is used for controlling the at least one node in each instance to process the real-time data respectively.
9. The system of claim 7, wherein the allocated physical resources include at least one physical machine, the at least one node in each instance having a correspondence with the allocated physical resources, comprising: each of the physical machines corresponds to the at least one node.
10. The system of any one of claims 7-9, further comprising:
an allocating unit, configured to, when the system adds physical resources to the service, allocate the added physical resources to at least two instances of the service.
11. The system of claim 10, wherein the system further comprises:
the migration unit is used for migrating the real-time data of at least one node corresponding to the physical resource with the load higher than the set value in each instance to at least one node corresponding to the increased physical resource distributed to the instance; or
The allocation unit is further configured to allocate the physical resource added by each of the instances to the failed pull-up node.
12. The system of claim 8, wherein:
the processing unit is further configured to stop real-time data processing of a node corresponding to the reduced physical resource when the physical resource in the instance needs to be reduced;
the replacing unit is further used for replacing the node stopping processing the real-time data by the fault pull-up node, wherein the remaining physical resources in the instance are reallocated to at least one node performing real-time data processing in the instance;
the updating unit is further configured to update the node that stops processing the real-time data in the connection information table as the fault pull-up node;
the sending unit is further configured to send, according to the peer node information, the cache data of the peer node of the node that stops processing the real-time data to the failure pull-up node.
CN201510923282.7A 2015-12-11 2015-12-11 Real-time data fault-tolerant processing method and system Active CN106874142B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201510923282.7A CN106874142B (en) 2015-12-11 2015-12-11 Real-time data fault-tolerant processing method and system
PCT/CN2016/099585 WO2017097006A1 (en) 2015-12-11 2016-09-21 Real-time data fault-tolerance processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510923282.7A CN106874142B (en) 2015-12-11 2015-12-11 Real-time data fault-tolerant processing method and system

Publications (2)

Publication Number Publication Date
CN106874142A CN106874142A (en) 2017-06-20
CN106874142B true CN106874142B (en) 2020-08-07

Family

ID=59013692

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510923282.7A Active CN106874142B (en) 2015-12-11 2015-12-11 Real-time data fault-tolerant processing method and system

Country Status (2)

Country Link
CN (1) CN106874142B (en)
WO (1) WO2017097006A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108614460B (en) * 2018-06-20 2020-11-06 东莞市李群自动化技术有限公司 Distributed multi-node control system and method
CN111193759B (en) * 2018-11-15 2023-08-01 中国电信股份有限公司 Distributed computing system, method and apparatus
CN110351122B (en) * 2019-06-17 2022-02-25 腾讯科技(深圳)有限公司 Disaster recovery method, device, system and electronic equipment
CN111124696B (en) * 2019-12-30 2023-06-23 北京三快在线科技有限公司 Unit group creation, data synchronization method, device, unit and storage medium
CN111930515B (en) * 2020-09-16 2021-09-10 北京达佳互联信息技术有限公司 Data acquisition and distribution method, device, server and storage medium
CN113312210B (en) * 2021-05-28 2022-07-29 北京航空航天大学 Lightweight fault-tolerant method of streaming processing system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699599A (en) * 2013-12-13 2014-04-02 华中科技大学 Message reliable processing guarantee method of real-time flow calculating frame based on Storm
CN103701906A (en) * 2013-12-27 2014-04-02 北京奇虎科技有限公司 Distributed real-time calculation system and data processing method thereof
CN103716182A (en) * 2013-12-12 2014-04-09 中国科学院信息工程研究所 Failure detection and fault tolerance method and failure detection and fault tolerance system for real-time cloud platform
CN104683445A (en) * 2015-01-26 2015-06-03 北京邮电大学 Distributed real-time data fusion system
CN104794015A (en) * 2015-04-16 2015-07-22 华中科技大学 Real-time streaming computing flow speed perceiving elastic execution tolerant system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060080319A1 (en) * 2004-10-12 2006-04-13 Hickman John E Apparatus, system, and method for facilitating storage management
EP2002670B1 (en) * 2006-03-31 2019-03-27 Cisco Technology, Inc. System and method for active geographic redundancy
CN104283950B (en) * 2014-09-29 2019-01-08 杭州华为数字技术有限公司 A kind of method, apparatus and system of service request processing
CN104836850A (en) * 2015-04-16 2015-08-12 华为技术有限公司 Instance node management method and management equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103716182A (en) * 2013-12-12 2014-04-09 中国科学院信息工程研究所 Failure detection and fault tolerance method and failure detection and fault tolerance system for real-time cloud platform
CN103699599A (en) * 2013-12-13 2014-04-02 华中科技大学 Message reliable processing guarantee method of real-time flow calculating frame based on Storm
CN103701906A (en) * 2013-12-27 2014-04-02 北京奇虎科技有限公司 Distributed real-time calculation system and data processing method thereof
CN104683445A (en) * 2015-01-26 2015-06-03 北京邮电大学 Distributed real-time data fusion system
CN104794015A (en) * 2015-04-16 2015-07-22 华中科技大学 Real-time streaming computing flow speed perceiving elastic execution tolerant system

Also Published As

Publication number Publication date
WO2017097006A1 (en) 2017-06-15
CN106874142A (en) 2017-06-20

Similar Documents

Publication Publication Date Title
CN106874142B (en) Real-time data fault-tolerant processing method and system
US11360854B2 (en) Storage cluster configuration change method, storage cluster, and computer system
CN202798798U (en) High availability system based on cloud computing technology
CN107231399B (en) Capacity expansion method and device for high-availability server cluster
US10177994B2 (en) Fault tolerant federation of computing clusters
CN102355369B (en) Virtual clustered system as well as processing method and processing device thereof
CN108200124B (en) High-availability application program architecture and construction method
CN110362381A (en) HDFS cluster High Availabitity dispositions method, system, equipment and storage medium
EP2643771B1 (en) Real time database system
CN105159798A (en) Dual-machine hot-standby method for virtual machines, dual-machine hot-standby management server and system
CN105337780B (en) A kind of server node configuration method and physical node
CN113032085A (en) Management method, device, server, management system and medium of cloud operating system
CN104679579A (en) Virtual machine migration method and device in cluster system
CN102983996A (en) Dynamic allocation method and system for high-availability cluster resource management
CN106919473A (en) A kind of data disaster recovery and backup systems and method for processing business
CN115297124A (en) System operation and maintenance management method and device and electronic equipment
CN111045602A (en) Cluster system control method and cluster system
CN102487332B (en) Fault processing method, apparatus thereof and system thereof
CN115225642B (en) Elastic load balancing method and system of super fusion system
CN108984602B (en) Database control method and database system
CN114598711B (en) Data migration method, device, equipment and medium
CN115378800A (en) Distributed fault-tolerant system, method, apparatus, device and medium without server architecture
CN116192885A (en) High-availability cluster architecture artificial intelligent experiment cloud platform data processing method and system
CN112131201B (en) Method, system, equipment and medium for high availability of network additional storage
CN107590032A (en) The method and storage cluster system of storage cluster failure transfer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220216

Address after: 550025 Huawei cloud data center, jiaoxinggong Road, Qianzhong Avenue, Gui'an New District, Guiyang City, Guizhou Province

Patentee after: Huawei Cloud Computing Technology Co.,Ltd.

Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee before: HUAWEI TECHNOLOGIES Co.,Ltd.

TR01 Transfer of patent right