WO2017097006A1 - Procédé et système de traitement de tolérance aux anomalies de données en temps réel - Google Patents

Procédé et système de traitement de tolérance aux anomalies de données en temps réel Download PDF

Info

Publication number
WO2017097006A1
WO2017097006A1 PCT/CN2016/099585 CN2016099585W WO2017097006A1 WO 2017097006 A1 WO2017097006 A1 WO 2017097006A1 CN 2016099585 W CN2016099585 W CN 2016099585W WO 2017097006 A1 WO2017097006 A1 WO 2017097006A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
instance
real
fault
time data
Prior art date
Application number
PCT/CN2016/099585
Other languages
English (en)
Chinese (zh)
Inventor
单卫华
林铭
殷晖
李旭良
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2017097006A1 publication Critical patent/WO2017097006A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Definitions

  • the present invention relates to the field of real-time computing, and in particular, to a real-time data fault-tolerant processing method and system.
  • the real-time streaming computing platform is widely used in the industry to construct a real-time online system architecture solution, and the real-time stream computing component is the most widely used by Storm.
  • Storm is a free, open source, distributed, and highly fault-tolerant real-time computing system.
  • Storm is often used in real-time analytics, online machine learning, continuous computing, distributed remote calls, and ETL.
  • Storm's deployment management is very simple, and in the same kind of streaming computing tools, Storm's performance is also very outstanding, is the first solution to build a real-time computing system architecture.
  • Storm's process is stateless, which makes it easy to achieve fast failures and guarantee the robustness of the Storm.
  • Storm does not provide the function of saving the state data of the node cache. Therefore, after a node is down, the Storm only needs to pull up the node service, and the HA mechanism protection of fast failure and recovery can be realized without loading state data.
  • state data needs to be saved. In this scenario, the function of the node not supporting the recovery of state data is unacceptable. They urgently need to ensure that the node loads the state data before the failure and restores the state before the node is down, so that the system can be quickly re-accessed. The system responds to data processing requests and undertakes business processing.
  • the embodiment of the invention provides a real-time data fault-tolerant processing method and system for uniformly managing real-time data processing, and can ensure the state before the node is down when the node fails.
  • a real-time data fault tolerance processing method including:
  • the instance where the node is located is determined according to the physical resource corresponding to the node, where the service deploys at least two instances in the system, and each instance includes At least one node having a topological relationship, each instance assigning a corresponding physical resource, and the at least one node in each instance has a corresponding relationship with the allocated physical resource, and each node in each instance has a different instance in each instance Peer node
  • the fault pull-up node replaces the failed node
  • connection information table Updating the failed node in the connection information table as the fault pullup node, wherein the join information table includes peer node information in the at least two instances;
  • the system can manage real-time data processing in a unified manner, and can ensure that when a node fails, the state before the node is down is restored, and the system is quickly re-accessed.
  • the method further includes:
  • the at least one node in each instance is controlled to process the real-time data separately.
  • Each instance is distributedly deployed on its own independent physical resources, ensuring that real-time data between nodes in their respective instances can only be circulated in the same instance.
  • the allocated physical resource includes at least one physical machine, and the at least one node in each instance has a corresponding physical resource
  • the relationship includes: each of the physical machines corresponding to the at least one node.
  • the method further includes include:
  • the added physical resources are allocated to at least two instances of the service.
  • the method further includes:
  • the physical resources added by each of the instances are assigned to the fault pull-up node.
  • the real-time data of the nodes can be migrated according to the load of the physical resources allocated to the nodes, so as to ensure load balancing and balance the physical resources; or directly add physical resources to each instance. Assignment to a new fault pulls up the node for easy operation.
  • the method further includes:
  • the node that stops processing real-time data is the fault pull-up node
  • the real-time data processing of the node corresponding to the reduced physical resources must be stopped, and the node that replaces the real-time data is replaced by the fault pull-up node to ensure that the data processing is not interrupted by the reduction of physical resources. .
  • a real-time data fault-tolerant processing system having the function of implementing system behavior in the above method.
  • the functions can be implemented by hardware executing corresponding software.
  • the software includes one or more modules corresponding to the functions described above.
  • the real-time data fault-tolerant processing system includes:
  • a determining unit configured to determine, according to a physical resource corresponding to the node, an instance in which the node is located when the real-time data of the node processing service in the system fails, where the service deploys at least two instances in the system
  • Each instance includes at least one node having a topological relationship, each instance assigning a corresponding physical resource, and the at least one node in each instance has a corresponding relationship with the allocated physical resource, and each node in each instance
  • a replacement unit in the determined instance, replacing the faulty pull node with the failed node
  • an updating unit configured to update, in the connection information table, the failed node as the fault pull-up node, where the connection information table includes peer node information in the at least two instances;
  • a sending unit configured to send, according to the peer node information, cache data of a peer node of the failed node to the fault pull node, so that the fault pull node is according to the received
  • the cached data restores the data processing of the node.
  • each instance is assigned a corresponding physical resource, and each node in each instance has a peer node in another instance, and the real-time data of the node processing service in the system fails.
  • the fault pull-up node replaces the faulty node, and the faulty node is updated in the connection information table as a fault pull-up node, according to the association information.
  • the peer node information in the table sends the cache data of the peer node of the failed node to the fault pull node, so that the fault pulls up the node to resume data processing, the system can uniformly manage real-time data processing, and can guarantee the node In the event of a failure, restore the state before the node is down and quickly re-access the system.
  • FIG. 1 is a schematic flowchart of a real-time data fault-tolerant processing method according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of an exemplary operation of an existing Storm application
  • FIG. 3 is a schematic structural diagram of a fault tolerant topology deployment in a Storm platform according to an example of an embodiment of the present invention
  • FIG. 4 is a schematic diagram of fast recovery of a faulty downtime in a Storm platform according to an example of an embodiment of the present invention
  • FIG. 5 is a schematic flowchart of another real-time data fault-tolerant processing method according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of a real-time data fault-tolerant processing system according to an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of another real-time data fault-tolerant processing system according to an embodiment of the present invention.
  • FIG. 1 is a schematic flowchart of a real-time data fault-tolerant processing method according to an embodiment of the present invention, where the method includes the following steps:
  • the system deploys two or more instances for each service, each instance consisting of one or more nodes with topological relationships with the same number of nodes and the same topological relationship between the nodes.
  • Each instance allocates a corresponding physical resource, and the allocated physical resource can only be dictated by the instance, and each node in each instance has a corresponding relationship with the allocated physical resource, that is, the physical resource is allocated to each instance, and the allocation is performed.
  • the physical resources for each instance are each assigned to each node in the instance.
  • each node in each instance has a peer node in other instances, that is, a node that has the same location and performs the same function in the topology relationship.
  • the system in this embodiment can be used to process real-time data.
  • a node in the system processes the real-time data of the service and fails, it is first necessary to determine the instance where the node is located: because of the failure, the first physical resource can be determined to be faulty. Therefore, according to the failed physical
  • the resource and the correspondence between the node and the physical resource can determine the instance where the node is located.
  • the fault pull-up node replaces the failed node.
  • the failed node After determining the instance of the failed node, in this instance, the failed node is replaced with a fault pull node, and the replacement operation is only performed in the instance, and does not affect the normal operation of other instances.
  • connection information table is set in the system, where the connection information table includes peer node information in two or more instances deployed by the system for each service, for example, recording node 1 in node A and node 1 in instance B. Is a peer node. After the fault pull-up node replaces the failed node, it is necessary to update the faulty node in the joint information table to pull up the node for the fault. For example, node 1 in instance A fails, and node 7 replaces node 1, then In the junction information table, node 1 in instance A needs to be replaced with node 7.
  • S104 Send, according to the peer node information in the connection information table, cache data of the peer node of the failed node to the fault pull node, so that the fault pulls the node according to the received
  • the cached data restores data processing of the node.
  • the peer node of the failed node is queried. Because the peer node is in normal operation when the node fails, the peer node stores the cached data for a period of time. Therefore, the cache data of the peer node can be sent to the fault node of the fault node to pull up the node, so that the fault pull node recovers the data processing of the node from the fault according to the received cache data.
  • the physical resource allocation, the recording of peer node information, and the replacement of nodes for the two or more instances deployed by the system are managed by the system, and in an instance.
  • a node fails, it can obtain the cache data of the peer node of the node, restore the state before the failed node is down, and quickly re-access the system.
  • a real-time data fault-tolerant processing method can enable the system to uniformly manage real-time data processing, and can ensure that when a node fails, the state before the node is down is restored, and the system is quickly re-accessed.
  • Spout is responsible for sending data streams in the application (either generating data streams or receiving external data streams and sending them out), which is the data source in the entire Storm application; in Bolt, the code implements the business logic. Processing, which is responsible for the processing of data streams, complex business logic implementations require multiple bolts.
  • a Tuple is a single data stream that circulates in a Storm application. It is a piece of data that is constantly flowing through each node. Topology describes the connection order and connectivity of Spout and Bolt. You can also set Spout and Bolt in the application. The transmission strategy of the data flow between each node and the running configuration of the entire application.
  • FIG. 2 is a schematic diagram of the operation of the existing Storm application.
  • the operation of the Storm application is to submit the Storm application code Jar to the Storm cluster for submission and running.
  • the running Storm application will be independently a system-served Storm task. It can only be closed by manual command. It will always run in the Storm cluster. Once there is data that meets the task requirements, the received data will be processed and output.
  • Storm has built in a set of reliability mechanisms to ensure the normal operation of each component and node. When a component/node fails or is abnormal, Storm will restart the component/node spontaneously, initialize the relevant configuration, and load it into the task. Ensure the normal operation of the entire Storm task.
  • the Spout inherits the relevant business logic, analyzes and processes the incoming data, and sends it to the next connected Bolt node.
  • the Bolt node receives the data and uses the data defined by the business logic. Processing, and then continue to send to the next Bolt node; multiple Bolt nodes can be connected in series and in parallel, can also be separated and summarized, the data circulation structure follows the principle of directed acyclic graph, and can form complex business logic models through various combinations , to meet the needs of the application.
  • Storm has a complete high-availability cluster/two-machine cluster (English: High Availability, HA for short) mechanism.
  • Storm uses the synchronous save and shared module Zookeeper to save and share the configuration information of the node.
  • the main control node detects the fault of the node by using the heartbeat mechanism, tries to pull up the faulty node, reads the node configuration information from the Zookeeper and loads it when the faulty node is pulled up, and re-accesses the application topology when the node pulls up.
  • FIG. 3 is a schematic structural diagram of a fault tolerant topology deployment in a Storm platform according to an example of an embodiment of the present invention.
  • the embodiment of the present invention designs a Storm extended module: fault tolerance (English: fault tolerance, FT) Topology module.
  • FT fault tolerance
  • Users use FT Topology to create applications and deploy runtimes, they follow the implementation and operation mode of the new module FT Topology.
  • the Storm application built with the FT Topology module is represented here as an FT-capable Storm application.
  • the physical resources in the Storm cluster are distributed to the two instances according to the established policy (average allocation/proportional allocation), and each instance is distributedly deployed on the physical resources that are independently controlled by itself, so that the data flow between the nodes in the respective instances is only guaranteed. Can be circulated in the same instance.
  • FT The data streams circulating in the Topology A instance are not sent to the FT Topology B instance, which are completely isolated in terms of processing the data stream.
  • the physical resource may be a physical machine, and one physical machine may support multiple nodes running thereon. As shown in FIG. 3, the physical machine of Host1 is assigned to node 1 and node 4 in instance A, and Host2 is Assigned to Node 2, Node 5, and so on in Instance A.
  • each Storm application instance has its own independent physical resource group. Only the nodes of the same Storm application instance run in the same physical resource group, regardless of whether the fault is pulled up or other. Running the operation, the nodes in the two Storm application instances cannot run in the same physical resource grouping machine at the same time. As shown in Figure 3, any node in FT Topology A cannot run on Host4, Host5, and Host6 at any time. When a node in FT Topology A is pulled up, the newly pulled node is pulled up. Can only be run on a machine of Host1, Host2, Host3.
  • FIG. 4 is a schematic diagram of a fast recovery of a faulty downtime in a Storm platform according to an example of an embodiment of the present invention.
  • the connection information table After the FT-enabled Storm application is deployed, the connection information table records the peer location and function topology of each Spout/Bolt node between the two instances of the associated deployment (that is, the FT Topology A instance and the FT Topology B instance). That is, peer node information, the junction information table can be saved in the Zookeeper module.
  • Use Storm's heartbeat mode to monitor the running status of Spout/Bolt of peer nodes (namely nodes with the same location and function) in the two instances.
  • the normal Spout/Bolt node will send its own cached data synchronously to the new fault pull.
  • the Spout/Bolt node is used to speed up the recovery of the failed node.
  • the implementation of this process is the function of the Zoo control node (nimbus) to call Zookeeper to cache and synchronize the information of each node, and create and maintain the connection information table of the same location and function Spout/Bolt node between the two instances.
  • the Zoo control node node
  • the old node is disconnected, and the Spout/Bolt node creation information is pulled up by the new fault in Zookeeper, and the connection relationship of the corresponding node is adjusted. That is, the adjustment association information table is updated, and the old down node in the topology is replaced with a new failure pull node.
  • the normal running node in another instance corresponding to the binding information table sends its cached data to the new node through the connection information table, which helps it quickly build the data model and restore the data processing capability.
  • the FT Topology instance is fault-tolerant and the FT Topology instance is fault-tolerant.
  • the operation of the group instance is faulty.
  • node 3 in instance A fails, node 7 is replaced with node 7, node 3 is replaced with node 7 in the association information table, and the original node is based on the peer node information recorded in the association information table.
  • the cache data of the peer node 3' of 3 is sent to the new node 7, thereby causing the node 7 to resume data processing from the occurrence of the failure.
  • FIG. 5 is a schematic flowchart of another real-time data fault-tolerant processing method according to an embodiment of the present invention, where the method includes the following steps:
  • the system deploys two or more instances to the service.
  • the nodes in these instances process real-time data separately, that is, each instance performs distributed deployment on its own independent physical resources to ensure data between nodes in their respective instances. Streams can only be circulated in the same instance.
  • the fault pull-up node replaces the failed node.
  • S205 Send, according to the peer node information in the connection information table, cache data of the peer node of the failed node to the fault pull node, so that the fault pulls the node according to the received
  • the cached data restores data processing of the node.
  • S202-S205 is the same as the S101-S104 in the embodiment shown in FIG. 1 and is not described here.
  • S206-S207 is a process of expanding physical resources.
  • the added physical resources are allocated to two or more instances deployed by the system for the service, which is generally distributed.
  • the added physical resources assigned to each instance can be directly assigned to the fault pull-up node because the fault pull-up node is a new node and has not been allocated physical resources, at this time if there is an added physical Resources are directly allocated to the fault pull-up node, which can simplify the operation. It can also first determine which physical resources in each instance correspond to the load of one or more nodes higher than the set value, and determine that the added physical resources are allocated.
  • S208-S211 are processes for reducing physical resources in the instance.
  • the physical resources in the instance are reduced.
  • the reduced physical resources of the partial nodes may be reduced.
  • the nodes that are reduced by the physical resources need to stop the current data processing, and the node that stops processing is replaced by the fault pulling node, and the remaining nodes in the instance are left.
  • the same operation as the embodiment shown in FIG. 1 is employed to cause the fault pull-up node to resume data processing when the stop processing node is stopped.
  • the remaining physical resources in the instance are reassigned to at least one node in the instance that is undergoing real-time data processing.
  • a real-time data fault-tolerant processing method can enable the system to uniformly manage real-time data processing, and can ensure that when a node fails, the state before the node is down is restored, and the system is quickly re-accessed;
  • the physical resources added by the instance can directly allocate the physical resources added by each instance to the new fault pull-up node, and the operation is simple; when reducing the physical resources of each instance, the real-time of the node corresponding to the reduced physical resources must be stopped.
  • Data processing, and the fault pull-up node replaces the node that stops processing real-time data to ensure that data processing is not interrupted by the reduction of physical resources.
  • FIG. 6 is a schematic structural diagram of a real-time data fault-tolerant processing system according to an embodiment of the present invention.
  • the system 1000 includes:
  • the determining unit 11 is configured to determine, when the real-time data of the node processing service in the system fails, determine an instance where the node is located according to the physical resource corresponding to the node.
  • the system deploys two or more instances for each service, each instance consisting of one or more nodes with topological relationships with the same number of nodes and the same topological relationship between the nodes.
  • Each instance allocates a corresponding physical resource, and the allocated physical resource can only be dictated by the instance, and each node in each instance has a corresponding relationship with the allocated physical resource, that is, the physical resource is allocated to each instance, and the allocation is performed.
  • the physical resources for each instance are each assigned to each node in the instance.
  • each node in each instance has a peer node in other instances, that is, a node that has the same location and performs the same function in the topology relationship.
  • the system in this embodiment can be used to process real-time data.
  • the determining unit 11 first needs to determine the instance where the node is located: because of the failure, the first physical resource can be determined to be faulty, therefore, according to the occurrence The physical resource of the fault and the correspondence between the node and the physical resource can determine the instance where the node is located.
  • the replacing unit 12 is configured to, in the determined example, replace the faulty pull node with the failed node.
  • the replacement unit 12 After determining the instance in which the failed node is located, in this example, the replacement unit 12 replaces the failed node with a fault pull node, the replacement operation being performed only in the instance, without affecting the normal operation of other instances.
  • the updating unit 13 is configured to update, in the connection information table, the faulty node as the fault pullup node.
  • connection information table is set in the system, and the connection information table includes a system deployed for each service.
  • Peer node information in two or more instances such as node 1 in record instance A and node 1' in instance B, are peer nodes.
  • the update unit 13 needs to update the faulty node in the joint information table as a fault pull-up node, for example, node 1 in instance A fails, and node 7 replaces node 1 , in the connection information table, you need to replace node 1 in instance A with node 7.
  • the sending unit 14 is configured to send, according to the peer node information in the connection information table, cache data of the peer node of the failed node to the fault pull node, so that the fault pulls up the node
  • the data processing of the node is resumed according to the received cached data.
  • the peer node of the failed node is queried. Because the peer node is in normal operation when the node fails, the peer node stores the cached data for a period of time. Therefore, the transmitting unit 14 may send the cache data of the peer node to the fault pull node of the fault node, so that the fault pull node restores the data processing of the node from the fault according to the received cache data.
  • the physical resource allocation, the recording of peer node information, and the replacement of nodes for the two or more instances deployed by the system are managed by the system, and in an instance.
  • a node fails, it can obtain the cache data of the peer node of the node, restore the state before the failed node is down, and quickly re-access the system.
  • a real-time data fault-tolerant processing system can enable the system to uniformly manage real-time data processing, and can ensure that when a node fails, the state before the node is down is restored, and the system is quickly re-accessed.
  • FIG. 7 is a schematic structural diagram of another real-time data fault-tolerant processing system according to an embodiment of the present invention.
  • the system 2000 includes:
  • the processing unit 21 is configured to control at least one node in each instance to process real-time data separately.
  • the system deploys two or more instances to the service.
  • the nodes in these instances process real-time data separately, that is, each instance performs distributed deployment on its own independent physical resources to ensure data between nodes in their respective instances. Streams can only be circulated in the same instance.
  • the determining unit 22 is configured to determine, when the real-time data of the node processing service in the system fails, determine an instance where the node is located according to the physical resource corresponding to the node.
  • the replacing unit 23 is configured to, in the determined example, replace the faulty pull node with the failed node.
  • the updating unit 24 is configured to update, in the connection information table, the faulty node as the fault pullup node.
  • the sending unit 25 is configured to send, according to the peer node information in the connection information table, cache data of the peer node of the failed node to the fault pull node, so that the fault pulls up the node
  • the data processing of the node is resumed according to the received cached data.
  • the allocating unit 26 is configured to allocate the added physical resources to at least two instances of the service when the system adds physical resources to the service.
  • the allocation unit 26 is further configured to allocate the physical resource added by each of the instances to the fault pull node.
  • the above is the process of expanding the physical resources.
  • the added physical resources are allocated to the system for two or more instances deployed for the service, which are generally evenly distributed.
  • the added physical resources allocated to each instance can be directly assigned to the fault pull-up node, because the fault pull-up node is a new node, and no physical resources have been allocated. If there is an added physical resource, the fault is pulled up directly.
  • Nodes can simplify operations; first, it can be determined which physical resources in each instance correspond to one or more nodes whose load is higher than the set value, and determine which nodes the added physical resources are allocated to, and the load is higher than the setting.
  • the real-time data stream of the value node is migrated to the node to which the added physical resource is allocated to continue processing, so that the load between the nodes can be balanced, and the physical resources are rationally utilized. It should be noted that when the physical resources are expanded, operations that are being processed by each node are not required.
  • the processing unit 21 is further configured to stop real-time data processing of the node corresponding to the reduced physical resource when it is required to reduce physical resources in the instance.
  • the replacement unit 23 is further configured to replace the fault pull-up node with the node that stops processing the real-time data.
  • the updating unit 24 is further configured to update, in the connection information table, the node that stops processing the real-time data as the fault pull-up node.
  • the sending unit 25 is further configured to send the cached data of the peer node of the node that stops processing the real-time data to the fault pull node according to the peer node information.
  • the above is the process of reducing the physical resources in the instance.
  • the physical resources in the instance are reduced, and the reduced physical resources of the partial nodes may be reduced, and the nodes that are reduced by the physical resources need to stop the current Data processing, replacing the node that stops processing with the fault pull-up node, and after allocating the remaining physical resources in the instance to the fault pull-up node and other nodes that are not stopped, the same operation as the embodiment shown in FIG. 1 is adopted. , causing the fault to pull up the node to resume processing the data processing when the processing node is stopped.
  • the remaining physical resources in the instance are reassigned to at least one node in the instance that is undergoing real-time data processing.
  • a real-time data fault-tolerant processing system can enable the system to uniformly manage real-time data processing, and can ensure that when a node fails, the state before the node is down is restored, and the system is quickly re-accessed;
  • the physical resources added by the instance can directly allocate the physical resources added by each instance to the new fault pull-up node, and the operation is simple; when reducing the physical resources of each instance, the real-time of the node corresponding to the reduced physical resources must be stopped.
  • Data processing, and the fault pull-up node replaces the node that stops processing real-time data to ensure that data processing is not interrupted by the reduction of physical resources.
  • Computer readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one location to another.
  • a storage medium may be any available media that can be accessed by a computer.
  • the computer readable medium may include a random access memory (RAM), a read-only memory (ROM), and an electrically erasable programmable read-only memory (Electrically Erasable Programmable).
  • EEPROM Electrically Error Read-Only Memory
  • CD-ROM Compact Disc Read-Only Memory
  • Any connection may suitably be a computer readable medium.
  • coaxial cable, fiber optic light Cable, twisted pair, Digital Subscriber Line (DSL) or wireless technology such as infrared, radio and microwave transmission from a website, server or other remote source
  • coaxial cable, fiber optic cable, twisted pair , DSL or wireless technologies such as infrared, wireless and microwave are included in the fixing of the associated medium.
  • a disk and a disc include a compact disc (CD), a laser disc, a compact disc, a digital versatile disc (DVD), a floppy disk, and a Blu-ray disc, wherein the disc is usually magnetically copied, and the disc is The laser is used to optically replicate the data. Combinations of the above should also be included within the scope of the computer readable media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Hardware Redundancy (AREA)

Abstract

L'invention concerne un procédé et un système de traitement de tolérance aux anomalies de données en temps réel. Au moins deux instances sont déployées pour un service dans un système, des ressources physiques correspondantes sont affectées à chaque instance, et chaque nœud dans chaque instance a un nœud homologue dans d'autres instances. Le procédé consiste : lorsqu'une anomalie se produit lors du traitement de données en temps réel du service dans un nœud dans le système, à déterminer une instance dans laquelle le nœud est situé selon une ressource physique correspondant au nœud (S101) ; à remplacer le nœud défaillant par un nœud d'extraction d'anomalie dans l'instance déterminée (S102) ; à mettre à jour le nœud défaillant au nœud d'extraction d'anomalie dans une table d'informations d'association (S103) ; et à envoyer, selon des informations de nœud homologue dans la table d'informations d'association, des données de mémoire tampon d'un nœud homologue du nœud défaillant au nœud d'extraction d'anomalie, de telle sorte que le nœud d'extraction d'anomalie reprend, selon les données de mémoire tampon reçues, le traitement de données du nœud (S104). Le procédé peut gérer un traitement de données en temps réel d'une manière unifiée et peut reprendre l'état d'un nœud avant un accident lorsque l'anomalie se produit dans le nœud, et permet un nouvel accès rapide au système.
PCT/CN2016/099585 2015-12-11 2016-09-21 Procédé et système de traitement de tolérance aux anomalies de données en temps réel WO2017097006A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510923282.7A CN106874142B (zh) 2015-12-11 2015-12-11 一种实时数据容错处理方法及系统
CN201510923282.7 2015-12-11

Publications (1)

Publication Number Publication Date
WO2017097006A1 true WO2017097006A1 (fr) 2017-06-15

Family

ID=59013692

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/099585 WO2017097006A1 (fr) 2015-12-11 2016-09-21 Procédé et système de traitement de tolérance aux anomalies de données en temps réel

Country Status (2)

Country Link
CN (1) CN106874142B (fr)
WO (1) WO2017097006A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210240163A1 (en) * 2018-06-20 2021-08-05 Qkm Technology (Dong Guan) Co., Ltd Distributed multi-node control system and method, and control node
CN113312210A (zh) * 2021-05-28 2021-08-27 北京航空航天大学 一种流式处理系统的轻量级容错方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111193759B (zh) * 2018-11-15 2023-08-01 中国电信股份有限公司 分布式计算系统、方法和设备
CN110351122B (zh) * 2019-06-17 2022-02-25 腾讯科技(深圳)有限公司 容灾方法、装置、系统与电子设备
CN111124696B (zh) * 2019-12-30 2023-06-23 北京三快在线科技有限公司 单元组创建、数据同步方法、装置、单元和存储介质
CN111930515B (zh) * 2020-09-16 2021-09-10 北京达佳互联信息技术有限公司 数据获取及分配方法、装置、服务器、存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006040264A1 (fr) * 2004-10-12 2006-04-20 International Business Machines Corporation Appareil, systeme et procede pour faciliter la gestion de memoire
CN101444122A (zh) * 2006-03-31 2009-05-27 思达伦特网络公司 用于活动地理冗余的系统和方法
CN103701906A (zh) * 2013-12-27 2014-04-02 北京奇虎科技有限公司 分布式实时计算系统及其数据处理方法
CN104283950A (zh) * 2014-09-29 2015-01-14 杭州华为数字技术有限公司 一种业务请求处理的方法、装置及系统
CN104836850A (zh) * 2015-04-16 2015-08-12 华为技术有限公司 一种实例节点管理的方法及管理设备

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103716182B (zh) * 2013-12-12 2016-08-31 中国科学院信息工程研究所 一种面向实时云平台的故障检测与容错方法及系统
CN103699599B (zh) * 2013-12-13 2016-10-05 华中科技大学 一种基于Storm实时流计算框架的消息可靠处理保障方法
CN104683445A (zh) * 2015-01-26 2015-06-03 北京邮电大学 分布式实时数据融合系统
CN104794015B (zh) * 2015-04-16 2017-08-18 华中科技大学 一种实时流计算流速感知弹性执行容错系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006040264A1 (fr) * 2004-10-12 2006-04-20 International Business Machines Corporation Appareil, systeme et procede pour faciliter la gestion de memoire
CN101444122A (zh) * 2006-03-31 2009-05-27 思达伦特网络公司 用于活动地理冗余的系统和方法
CN103701906A (zh) * 2013-12-27 2014-04-02 北京奇虎科技有限公司 分布式实时计算系统及其数据处理方法
CN104283950A (zh) * 2014-09-29 2015-01-14 杭州华为数字技术有限公司 一种业务请求处理的方法、装置及系统
CN104836850A (zh) * 2015-04-16 2015-08-12 华为技术有限公司 一种实例节点管理的方法及管理设备

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210240163A1 (en) * 2018-06-20 2021-08-05 Qkm Technology (Dong Guan) Co., Ltd Distributed multi-node control system and method, and control node
US11681271B2 (en) * 2018-06-20 2023-06-20 Qkm Technology (Dong Guan) Co., Ltd Distributed multi-node control system and method, and control node
CN113312210A (zh) * 2021-05-28 2021-08-27 北京航空航天大学 一种流式处理系统的轻量级容错方法
CN113312210B (zh) * 2021-05-28 2022-07-29 北京航空航天大学 一种流式处理系统的轻量级容错方法

Also Published As

Publication number Publication date
CN106874142B (zh) 2020-08-07
CN106874142A (zh) 2017-06-20

Similar Documents

Publication Publication Date Title
US11360854B2 (en) Storage cluster configuration change method, storage cluster, and computer system
WO2017097006A1 (fr) Procédé et système de traitement de tolérance aux anomalies de données en temps réel
US11163653B2 (en) Storage cluster failure detection
US10177994B2 (en) Fault tolerant federation of computing clusters
US20100023564A1 (en) Synchronous replication for fault tolerance
US20140376362A1 (en) Dynamic client fail-over during a rolling patch installation based on temporal server conditions
US20180091586A1 (en) Self-healing a message brokering cluster
CN110362381A (zh) Hdfs集群高可用部署方法、系统、设备及存储介质
US9378078B2 (en) Controlling method, information processing apparatus, storage medium, and method of detecting failure
CN105069152B (zh) 数据处理方法及装置
CN111935244B (zh) 一种业务请求处理系统及超融合一体机
CN111147274B (zh) 为集群解决方案创建高度可用的仲裁集的系统和方法
CN111045602A (zh) 集群系统控制方法及集群系统
US8621260B1 (en) Site-level sub-cluster dependencies
US10977276B2 (en) Balanced partition placement in distributed databases
CN115225642B (zh) 超融合系统的弹性负载均衡方法及系统
CN115378800A (zh) 无服务器架构分布式容错系统、方法、装置、设备及介质
CN111400098B (zh) 一种副本管理方法、装置、电子设备及存储介质
CN112035250A (zh) 高可用局域网服务管理方法、设备及部署架构
JP5353378B2 (ja) Haクラスタシステムおよびそのクラスタリング方法
CN104298553B (zh) 一种虚拟机迁移的方法、vrms和系统
US20230334026A1 (en) Asynchronous metadata replication and migration between compute sites
US20240069778A1 (en) Integrating mirrored storage to remote replication site
CN115878269A (zh) 集群迁移方法、相关装置及存储介质
CN116112512A (zh) 一种基于故障域的分布式存储系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16872201

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16872201

Country of ref document: EP

Kind code of ref document: A1