WO2018019023A1 - 一种数据容灾方法、装置和系统 - Google Patents

一种数据容灾方法、装置和系统 Download PDF

Info

Publication number
WO2018019023A1
WO2018019023A1 PCT/CN2017/086105 CN2017086105W WO2018019023A1 WO 2018019023 A1 WO2018019023 A1 WO 2018019023A1 CN 2017086105 W CN2017086105 W CN 2017086105W WO 2018019023 A1 WO2018019023 A1 WO 2018019023A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
logical unit
data
standby
primary
Prior art date
Application number
PCT/CN2017/086105
Other languages
English (en)
French (fr)
Inventor
张文
孙勇福
祝百万
李�瑞
郑寒
郝志刚
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP17833328.2A priority Critical patent/EP3493471B1/en
Publication of WO2018019023A1 publication Critical patent/WO2018019023A1/zh
Priority to US16/203,376 priority patent/US10713135B2/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2056Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring
    • G06F11/2082Data synchronisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0784Routing of error reports, e.g. with a specific transmission path or data flow
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1471Saving, restoring, recovering or retrying involving logging of persistent data for recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1658Data re-synchronization of a redundant component, or initial sync of replacement, additional or spare unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2028Failover techniques eliminating a faulty processor or activating a spare
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2041Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant with more than one idle spare processing component
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2056Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring
    • G06F11/2064Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring while ensuring consistency
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2056Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring
    • G06F11/2069Management of state, configuration or failover
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2097Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements maintaining the standby controller/processing unit updated
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0668Management of faults, events, alarms or notifications using network fault recovery by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/80Database-specific techniques

Definitions

  • the present application relates to the field of communications technologies, and in particular, to a data disaster tolerance method, apparatus, and system. Background of the invention
  • Data disaster recovery refers to the establishment of a data system to ensure the security of user data when the system fails, and even provide uninterrupted application services.
  • a traditional data disaster recovery solution there are at least two devices in the system, one as the primary device (Master) and the other as the standby device (Slave).
  • the primary device provides external services, while the standby device serves as the primary device. Backup of the device, when the primary device fails, replace the primary device.
  • the active/standby replication disaster recovery solution can achieve disaster recovery to a certain extent, most of the operations in this solution only support manual switching. Therefore, when a fault occurs, it cannot be switched to the standby device in time.
  • the prior art has proposed a disaster recovery solution for primary and secondary replication, that is, two devices are mutually active and standby, and data transmission changes in any one device are synchronized to another device, so that two devices are Services can be provided externally and mirrored to each other. When one of the devices fails, the service can be directly switched to another device without intervention by the operation and maintenance personnel. Summary of the invention
  • the embodiment of the present application provides a data disaster tolerance method, device, and system, which can be implemented not only The consistency of the data before and after the active/standby switchover, and the need to strictly distinguish the devices on the service layer, the implementation is relatively simple, and the system availability is high.
  • the embodiment of the present application provides a data disaster tolerance method, including:
  • each node includes a primary node and multiple standby nodes
  • the log information of the multiple standby nodes is obtained, and the log information of the standby node includes a time point at which the standby node synchronizes data with the primary node.
  • the embodiment of the present application further provides a data disaster tolerance device, including a processor and a memory, where the memory stores instructions executable by the processor, and when executing the instruction, the processor is configured to:
  • each node includes a primary node and multiple standby nodes
  • the log information of the multiple standby nodes is obtained, and the log information of the standby node includes a time point at which the standby node synchronizes data with the primary node.
  • the embodiment of the present application further provides a data disaster tolerance system, which may include any data disaster tolerance device provided by the embodiment of the present application.
  • FIG. 1a is a schematic diagram of a data disaster recovery system according to an embodiment of the present disclosure
  • FIG. 1b is a schematic diagram of a data disaster recovery method provided by an embodiment of the present application
  • FIG. 2a is a schematic diagram of an architecture of a data disaster recovery device according to an embodiment of the present application
  • FIG. 2b is a schematic diagram of a scenario when a temporary node is registered in a data disaster recovery method according to an embodiment of the present application
  • FIG. 2 is a flowchart of a data disaster recovery method according to an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of a logical unit backup provided by an embodiment of the present application
  • FIG. 4 is a schematic structural diagram of a data disaster recovery device according to an embodiment of the present disclosure
  • FIG. 4 is another schematic structural diagram of a data disaster recovery device according to an embodiment of the present disclosure
  • the embodiment of the present application provides a data disaster tolerance method, apparatus, and system.
  • the data disaster recovery system may include any data disaster recovery device provided by the embodiment of the present application.
  • the number of the data disaster recovery device may be determined according to actual application requirements.
  • the disaster recovery system may also include other devices, such as an access gateway (GW, Gate Way), and one or more user equipments, etc., wherein when the user newly establishes a connection, the access gateway may receive a connection establishment request sent by the user equipment.
  • GW access gateway
  • Gate Way Gate Way
  • the user equipment subsequently sends a data processing request to the access gateway, and the access gateway can forward the data processing request to the corresponding data disaster recovery device based on the connection relationship, and the data disaster tolerance device performs data according to the data processing request. Processing, such as reading, writing, deleting or changing data, etc.
  • the data disaster recovery device may be implemented as an entity, or may be implemented by multiple entities, that is, the data disaster recovery device may include multiple network devices, for example, as shown in FIG.
  • the data disaster recovery device may include a scheduler and a logic unit, where the logic unit may include a primary node and multiple standby nodes, and the nodes may be deployed in the same equipment room or in different equipment rooms, and may be located in the same room. Areas can also be located in different areas.
  • the master node is mainly used for data processing according to the data processing request, for example, performing data reading or writing, and the standby node can back up data in the master node by means of data synchronization, and the standby node A "read" function can be provided.
  • Scheduler master It is used to monitor the state of the nodes in the logical unit and to control the switching between the active and standby nodes. For example, each node in the logical unit can be monitored. If an abnormality is detected in the primary node, the multiple is obtained.
  • Log information of the standby node where the log information of the standby node includes a time point at which the standby node synchronizes data with the primary node, and then selects the standby node closest to the current time at the time point as the target node, and updates the primary node. For this target node, the target node is taken as the new primary node, and the original primary node can be reduced to the standby node, and so on.
  • the data disaster recovery device may further include a proxy server, configured to receive a data processing request sent by the access gateway, and send the data processing request to a node of the corresponding logical unit according to the preset routing information, and Interact with the scheduler to obtain the corresponding routing information, and so on.
  • a proxy server configured to receive a data processing request sent by the access gateway, and send the data processing request to a node of the corresponding logical unit according to the preset routing information, and Interact with the scheduler to obtain the corresponding routing information, and so on.
  • the data disaster recovery device may be implemented as an entity, or may be implemented by multiple entities, that is, the data disaster recovery device may be integrated into a server or the like.
  • the network disaster recovery device may include other devices, such as a proxy server, and the like, and may include a network device, such as a scheduler and a logical unit.
  • a data disaster recovery method includes: monitoring each node in a logical unit, and when detecting an abnormality of the primary node, acquiring log information of the multiple standby nodes, where the log information of the standby node includes a standby node and a primary At the time point when the node synchronizes data, the standby node closest to the current time at the time point is selected as the target node, and the primary node is updated to the target node.
  • Each node in the logical unit may include a primary node and multiple standby nodes.
  • the master node is mainly used for data processing according to data processing requests, for example, reading or writing data; and the standby node can back up data in the master node by means of data synchronization, and the standby node can provide The function of "reading".
  • These nodes can be deployed in the same equipment room or in different equipment rooms. They can be located in the same area or in different areas.
  • each node can also be transmitted through a dedicated line.
  • each node including the primary node and the standby node
  • a database instance such as a mysql (a relational database) instance
  • the transaction is monitored, and/or the hardware status of each node in the logical unit and the running status of the core program are monitored, etc., for example, the specifics can be as follows:
  • monitoring the working state of the transaction thread from the master node in each logical node is monitored separately. If the working state indicates that the current working state is in a non-working state, for example, the working state is " ⁇ ", the corresponding standby is determined. An exception occurred on the node.
  • the data recovery method may further include: before the step of "monitoring the nodes in the logic unit", the data disaster recovery method may further include:
  • the data in the primary node is synchronized to each standby node, and the data can be backed up to other storage devices, such as a distributed file system (HDFS, Hadoop Distribute File System).
  • a distributed file system such as Hadoop Distribute File System
  • the corresponding image file of the logical unit is saved, so that when the node in the logical unit fails, the logical unit can be based on the image file of the current day and the log information at the specified time point.
  • the data in the node is restored to the specified time point at a very fast rate (such as seconds).
  • a corresponding node may be selected in the standby node as a cold standby node to perform data backup operations, that is, After the step of "determining the active/standby relationship of each node in the logical unit", the data disaster recovery method may further include:
  • a standby node is selected as a cold standby node from a plurality of standby nodes, and the cold standby node backs up data to the distributed file system in a pipe flow manner.
  • the so-called cold standby node refers to the node that backs up data to other locations based on the cold standby mechanism.
  • the cold standby node can avoid the peak period of the operation of the logical unit when backing up the data to the distributed file system, for example, one day backup Once, and, you can select a random event of the day to back up, and so on. 102.
  • log information of the multiple standby nodes is obtained.
  • the log information of the standby node may include a time point at which the standby node synchronizes data with the primary node.
  • the log information may further include other information, such as a name of the executed transaction, a type of the data, and And / or size, and so on.
  • the log information of the standby node can be expressed in various forms, for example, a binary log (binlog).
  • the “acquiring the multiple standby nodes may be triggered by other means.
  • the operation of the log information for example, the primary node can be demoted to a standby node, so that the primary node does not exist in the logical unit, so that the logical unit generates status information indicating that the primary node does not exist, and If the subsequent scheduler detects that the status information of the primary node does not exist, the subsequent scheduler may stop the input/output (I/O, Input/Output) interface of the synchronization log of the standby node, and trigger execution of the "acquiring the multiple standby nodes.”
  • the data disaster recovery method may further include:
  • the data in the primary node can also be repaired, which is better than ⁇ , specifically:
  • the log information of the primary node may include a time point at which the primary node updates data (that is, a time point at which the primary node updates its own data).
  • the log information may further include other information, such as the executed transaction. Name, type of data, and/or size, etc. Wait.
  • the log information of the master node can be expressed in various forms, for example, a binary log (binlog) or the like.
  • the user equipment after the user equipment sends a data processing request, such as a structured query language (sql, Structured Query Language), it needs to wait for the log information corresponding to the data processing request of the primary node ( For example, binlog) can be returned normally when it is synchronized to the standby node, otherwise the user device will return a timeout error. Therefore, when the primary node suddenly has an abnormality and the log information may not have time to be sent to other standby nodes, if the primary node performs data recovery according to its own log information, the result will be more data (ie, the primary node ratio at this time). The standby node has more data.
  • a structured query language sql, Structured Query Language
  • the data disaster recovery method may further include:
  • the log information corresponding to the unsynchronized data before performing flashback processing on the log information corresponding to the data that has not been synchronized, it is also determined whether the log information corresponding to the unsynchronized data can be flashed back, and if yes, executing the pair of unsynchronized data.
  • the corresponding log information is subjected to flashback processing; otherwise, if it cannot be flashed back, the mirror data is fully extracted from the target node to reconstruct the data in the master node.
  • the mirror data can be pulled from the target node to reconstruct the data.
  • the data disaster recovery device can automatically pull the mirror data from the target node in full, it can also be interfered by the operation and maintenance personnel, that is, manually from the target node. Fully pull the mirror data to reconstruct the data in the master node.
  • the standby node closest to the current time is used as the target node.
  • the current time point is 12:10:01 on July 1
  • the time point for data synchronization between the standby node A and the primary node is 12:10:00 on July 1
  • the standby node B performs the primary node with the primary node.
  • the time point of data synchronization is 12:09:59 on July 1
  • the time for data synchronization between standby node C and the master node is 12:10:01 on July 1.
  • Node C serves as the target node, thereby ensuring the consistency of the data in the target node with the data of the master node.
  • Update the primary node to the target node that is, use the target node as a new primary node, thereby switching the service in the original primary node to the target node.
  • the data in the logical unit may be backed up in different places. For example, if the logical unit is located at the A location, the logical unit may be backed up at the B location, so that when A is An exception occurs in the logical unit of the ground. For example, when a disaster occurs in the A, the service of the logical unit of the A is switched to the backup logical unit of the B; that is, the data disaster recovery method may further include:
  • the logical unit is backed up off-site to obtain an alternate logical unit, and when the logical unit fails, the service of the logical unit is switched to the standby logical unit.
  • the standby logical unit has a plurality of nodes corresponding to the logical unit.
  • the data recovery method may further include: performing data reconstruction on the logical unit of the A area by using the data backed up by the B, that is, after the step of “switching the service of the logical unit to the standby logical unit”, the data disaster recovery method may further include:
  • the standby logical unit is set to read-only, and when the delay (that is, the delay between the data of each node in the logical unit and the standby logical unit) is 0, the service is switched back to the logic. In the unit.
  • each node in the logical unit is monitored.
  • log information of multiple standby nodes is obtained respectively, where the log information of the standby node includes the standby node and the primary node.
  • the standby node whose time point is closest to the current time as the target node, and update the master node to the target node, thereby implementing switching between the master and the backup; since the logical unit of the scheme may include more a standby node, and then, when an abnormality occurs in the primary node, select the spare node with the latest data as the new primary node, thereby ensuring the consistency of data between the original primary node and the new primary node before and after the handover; Since the logical unit has only one primary node, the large-scale primary key conflict problem as faced in the prior art does not occur, and the individual devices can be distinguished from each other at the service layer, which is simpler and can greatly improve the system availability. .
  • the data disaster tolerance system includes a user equipment, an access gateway, and a plurality of data disaster tolerance devices
  • the data disaster tolerance device includes a proxy server, a scheduler, and a logical unit (set) as an example. Description.
  • the user equipment is configured to establish a connection relationship with the data disaster recovery device by using the access gateway, and send a data processing request to the corresponding data disaster recovery device by using the access gateway to obtain a corresponding service.
  • access gateway (2) access gateway;
  • the access gateway can obtain the load information of the data disaster recovery device, select a matching data disaster recovery device according to the load information, and establish a connection relationship between the user equipment and the matched data disaster recovery device.
  • the access gateway may also save related information of the connection relationship, such as saving session information of the connection.
  • the data processing request can be forwarded to the corresponding data disaster recovery device according to the connected session information, thereby achieving the purpose of load balancing.
  • a service list may be set, which is used to store information of a plurality of data disaster recovery devices that can provide services, so that data disaster tolerance needs to be selected.
  • the required data disaster recovery device can be selected according to the information in the service list; at the same time, in order to maintain the accuracy and real-time of the information in the service list, the access gateway can also periodically detect The status of each data disaster recovery device, and the information in the service list is updated based on the status, for example, the faulty data disaster recovery device can be deleted from the service list in time, and the data disaster recovery device is detected when the fault is detected. Return to normal, you can add it back to the list of services, and so on.
  • the access gateway saves the connected session information, even if the data disaster recovery device is expanded, for example, adding more data disaster recovery devices, the existing user connection will not be affected. Therefore, the solution can be easily expanded.
  • the data disaster recovery device may include a proxy, a scheduler, and a logical unit, as follows:
  • A proxy server
  • the proxy server is configured to obtain and save the routing information from the scheduler, and forward the data processing request (such as the sql statement) sent by the user equipment to the logical unit according to the routing information.
  • the proxy server may also perform authentication on the received data processing request, for example, according to the identifier of the user equipment carried in the data processing request, such as an Internet Protocol (IP) address, a username, and/or a user.
  • IP Internet Protocol
  • Information such as passwords are authenticated, and so on.
  • the scheduler serves as a scheduling management center of the data disaster recovery device, and can perform a logical unit.
  • the scheduler can also monitor each node in the logical unit. When an abnormality is detected on the primary node, the active/standby switchover process is initiated. For example, the log information of the multiple standby nodes can be separately obtained, and according to This log information selects the new primary node, and so on.
  • the scheduler can include multiple schedulers (Scheduler), and then multiple
  • the Scheduler cooperates to complete the main operations of the scheduler, such as creating and deleting logical units, selecting and replacing nodes in the logical unit, and monitoring the status of each node, and determining that an abnormality occurs in the primary node. Initiate the active/standby switchover process, and so on.
  • the interaction between the Scheduler and other parts can be done through open source distributed application coordination services, such as Zookeeper; for example, the Zookeeper can receive status information reported by each node, and The status information is provided to the Scheduler, or the Scheduler can also monitor the ZooKeeper, obtain the status information of the corresponding node from the Zookeeper, and so on.
  • ZooKeeper may face multiple logical units, in order to facilitate management of each logical unit, as shown in FIG. 2b, when establishing a logical unit, a Zookeeper may be registered to correspond to the logical unit.
  • Temporary node, and the temporary node handles the transaction of the logical unit, such as receiving status information of each node in the logical unit, and the like.
  • the logical unit may include a primary node and a plurality of standby nodes.
  • the primary node is mainly used for data processing according to a data processing request, for example, performing data reading or writing, and the standby node may be synchronized by data.
  • the data in the primary node is backed up, and the standby node can provide the function of "reading".
  • These nodes can be deployed in the same equipment room or in different equipment rooms. They can be located in the same area or in different areas.
  • a proxy module can also be set for each node, which can be independent of each node or integrated in each node, for example, see Fig. 2a.
  • the proxy module can periodically access the database instance (DB) of the node to be detected by a short connection to detect whether it is readable and writable. If it is readable and writable, the node is normal, if not readable and/or If it is not writable, it indicates that the node is abnormal. At this time, the proxy module can generate the corresponding state information indicating the abnormality of the node, and report the status information to the scheduler. For example, it can be reported to the corresponding temporary node in Zookeeper. Therefore, the scheduler can detect that an abnormality occurs in the node, and then perform an active/standby switchover.
  • DB database instance
  • the access period of the proxy module to the database instance of the node to which the node belongs may be determined according to the requirements of the actual application, for example, may be set to be monitored every 3 seconds, and the like.
  • the agent module can also report the delay time of the node to be executed and the number of the delayed transactions, and report the status information carrying the statistical result to the scheduler periodically, for example, the corresponding report can be reported to the corresponding Zookeeper.
  • a temporary node so that the scheduler can determine whether to initiate an active/standby switchover process, for example, if the statistical result exceeds a preset threshold, then perform an active/standby switchover, and so on.
  • the proxy module can also rebuild the active/standby relationship according to the new primary node, and the newly added database instance can also use the method such as xtrabackup (a data backup tool) to reconstruct data through the primary node. Therefore, the data reconstruction of the program is It can be automated without the intervention of a DBA (Database Administrator).
  • DBA Database Administrator
  • the temporary node corresponding to the logical unit is registered in the Zookeeper. If the logical unit has a hardware failure, and/or the core program has an abnormality, such as a core failure of the proxy module, the corresponding Zookeeper The temporary node is also deleted accordingly. Therefore, if the Scheduler detects that the temporary node has disappeared, it can be determined that the primary node in the logical unit is abnormal. In this case, the active/standby switchover can be performed.
  • a data disaster recovery method can be as follows:
  • Each proxy module in the logic unit monitors the node to which it belongs, obtains state information, and sends the state information to the scheduler.
  • the proxy module A of the master node A can monitor the master node A as follows:
  • the proxy module A periodically accesses the database instance of the master node A to detect whether the database instance is readable and writable, and if readable and writable, it is determined that the master node A is normal, if not readable and/or non-writable, then The primary node A generates an abnormality, generates corresponding status information indicating the abnormality of the node, and sends the status information indicating the abnormality of the node to the scheduler. For example, it may be sent to the corresponding temporary node in the Zookeeper.
  • the proxy module A periodically acquires the hardware state information of the self (ie, the proxy module A) and the master node A, and the core program running state information, and sends the hardware state information and the core program running state information to the scheduler. For example, it can be sent to the corresponding temporary node in Zookeeper, and so on.
  • the scheduler After receiving the status information, the scheduler determines whether the corresponding primary node has an abnormality according to the status information. If an abnormality occurs, step 203 is performed, and if no abnormality occurs, the step of performing step 201 may be returned.
  • the status information may be received by the corresponding temporary node in the Zookeeper, and the status information is provided to the Scheduler.
  • the Scheduler determines that the primary node has an abnormality according to the status information, step 203 is performed.
  • step 203 may also be performed at this time.
  • the scheduler demotes the primary node to a new standby node, and stops the I/O interface of the secondary node's binlog, and then performs step 204.
  • the master node may be demoted to a new standby node by the Scheduler, and a thread stop instruction is sent to the standby node by the Zookeeper, so that the standby node can stop the synchronization log (binlog) according to the thread stop instruction.
  • a thread stop instruction is sent to the standby node by the Zookeeper, so that the standby node can stop the synchronization log (binlog) according to the thread stop instruction.
  • O interface and so on.
  • the scheduler separately obtains log information of multiple standby nodes in the logical unit.
  • the scheduler may obtain log information of multiple spare nodes in the logical unit, such as a relay log.
  • the log information of the standby node may include a time point at which the standby node synchronizes data with the primary node.
  • the log information may further include other information, such as a name of the executed transaction, a type of the data, and And / or size, and so on.
  • the scheduler selects, according to the log information, a time point (ie, a time point at which the standby node synchronizes data with the primary node) from the original multiple standby nodes, and the standby node closest to the current time serves as the target node.
  • a time point ie, a time point at which the standby node synchronizes data with the primary node
  • the target node may be selected by the Scheduler; for example, the current time point is 12:10:01 on July 1st, and the time point for the standby node A to synchronize data with the master node is 12 on July 1st: At 10:00, the time point for data synchronization between the standby node B and the primary node is 12:09:59 on July 1, and the time for the standby node C to synchronize data with the primary node is 12 on July 1st: At 10:01, at this time, the Scheduler can select the standby node C as the target node.
  • the scheduler updates the primary node to the target node, that is, the target node is used as a new primary node, so as to switch the service in the original primary node to the target node.
  • the master node may be updated by the scheduler to the target node.
  • the indication information indicating that the master node is updated to the target node may be generated, and then the indication information is sent to the proxy module of the target node by the Zookeeper, so that The proxy module can reconstruct the active/standby relationship based on the indication information, for example, determining that the own (ie, the target node) is a new primary node, and the other nodes are standby nodes, and generating corresponding routing information, and then routing the information.
  • the Zookeeper to the Scheduler, and more.
  • data in order to improve data security, data can be backed up to other storage devices, such as a distributed file system, while synchronizing data in the primary node to each standby node.
  • storage devices such as a distributed file system
  • a standby node may be selected as a cold standby node among multiple standby nodes to perform backup of data to the distributed file system.
  • the cold standby node when it performs backup, it can use the flow pipeline to transmit to the distributed file system to improve the transmission efficiency; further, a random event of the day can be selected for backup, thereby avoiding The peak period of operation of the logic unit.
  • step 207 when determining that the primary node is abnormal (ie, step 202), in addition to performing the active/standby switchover, the data in the primary node may be restored, that is, the data disaster recovery method.
  • step 207 You can also perform step 207 as follows:
  • the scheduler obtains log information of the master node, and then performs step 209.
  • the scheduler can obtain the log information of the primary node, such as a relay log.
  • the log information of the primary node may include a time point at which the primary node updates data (that is, a time point at which the primary node updates its own data).
  • the log information may further include other information, such as the executed transaction. The name, type of data, and/or size, and so on.
  • step 207 the log information corresponding to the data that has not been synchronized may be flashback processed.
  • the log information corresponding to the unsynchronized data can be flashed back
  • the execution is performed.
  • the step of performing flashback processing on the log information corresponding to the data that has not been synchronized; otherwise, if it cannot be flashed back, the mirror data may be fully extracted from the target node at this time to reconstruct the data in the master node.
  • step 207 and steps 203-206 may be in no particular order.
  • the scheduler repairs the data in the primary node according to the log information of the primary node.
  • each node in the logical unit is monitored.
  • the log information of the multiple standby nodes is obtained, where the log information of the standby node includes the standby node and the primary node.
  • the standby node closest to the current time is selected as the target node, and the primary node is updated to the target node, thereby implementing switching between the active and standby; since the logical unit of the scheme may include multiple standby nodes, then When an abnormality occurs in the primary node, the standby node with the latest data is selected as the new primary node, so that the data consistency between the original primary node and the new primary node can be ensured before and after the handover; and, since the logical unit only A master node, therefore, does not have the problem of large-scale primary key conflicts as faced in the prior art, and can not distinguish between individual devices at the service layer, which is simpler to implement and can greatly improve the availability of the system.
  • the log information corresponding to the data that has not been synchronized to the standby node may be flashed back. Therefore, the nodes may be further improved, and before and after the switchover. Data consistency.
  • the solution can select a corresponding standby node as a cold standby node among the multiple standby nodes, and back up the data to the distributed file system through the cold standby node, the data backup efficiency can be greatly improved, and the data can be reduced.
  • the impact on the operation of each node during backup helps to improve the performance of the entire data disaster recovery system.
  • the security of the data is improved, and the data in the logical unit may be backed up in different places.
  • the logical unit may be in the B location.
  • the backup is performed such that when an abnormality occurs in the logical unit of the A, for example, when a disaster occurs in the A, the service of the logical unit of the A is switched to the backup logical unit of the B, and so on.
  • the following is an example where the source logical unit is located in city A and the backup logical unit is located in city B.
  • the data in the logical unit of the A city can be backed up to the B city.
  • the data of the primary node in the logical unit of the A city can be asynchronously transmitted to the primary node in the logical unit of the B city, and then by B.
  • the master node in the city logical unit synchronizes data to the logic Other spare nodes in the unit (that is, the B-city logical unit).
  • A can be The business of the urban logical unit is switched to the logical unit of the B city, and the data in the logical unit of the B city is used to repair it. After waiting for the data recovery of the logical unit of the A city, the service can be switched back to the logical unit of the A city; for example, Referring to Figure 3b, the details can be as follows:
  • the data processing request sent by the original service to the logical unit of the A city is also forwarded to the logical unit of the B city, that is, in the logical unit of the B city at this time.
  • the master node can receive the data processing request and perform data processing according to the data processing request.
  • the data synchronization request may be sent to the B-city logical unit (that is, request data synchronization).
  • the B-city logical unit After receiving the request for data synchronization, the B-city logical unit synchronizes the data of the B-city logical unit to the A-city logical unit.
  • the global transaction identifier (GTID, Global Transaction ID) of the A-city logical unit, the log information of each node in the A-city logical unit, and the B-city logical unit receive the request for data synchronization, and then the B-city logical unit
  • the GTID, and the log information of each node in the B-city logical unit, etc. synchronize the data of the B-city logical unit to the A-city logical unit based on these GTIDs and log information, and the like.
  • the B-city logical unit is set to read-only, and when the delay is equal to 0, the service is switched back to the A-city.
  • the logical unit that is, at this time, the primary node in the logical unit of the A city can receive the data processing request sent by the corresponding service, and perform data processing according to the data processing request.
  • the preset value can be set according to the requirements of the actual application, and details are not described herein again. It can be seen that the present embodiment can not only realize the beneficial effects that the embodiment shown in FIG. 2c can achieve, but also support cross-city disaster tolerance and greatly improve data security.
  • the embodiment of the present application further provides a data disaster recovery device.
  • the data disaster recovery device includes a monitoring unit 401, an obtaining unit 402, a selecting unit 403, and an updating unit 404, as follows. :
  • monitoring unit 401 (1) monitoring unit 401;
  • the monitoring unit 401 is configured to monitor each node in the logical unit.
  • Each node may include a primary node and multiple standby nodes.
  • the nodes may be deployed in the same equipment room, or may be deployed in different equipment rooms, and may be located in the same area, or may be located in different areas. For details, refer to the previous embodiments, and details are not described herein.
  • each node in the logical unit there are various ways to monitor each node in the logical unit.
  • the database instance such as the mysql instance
  • the transactions performed by each node can be monitored, and/or the nodes in the logical unit can be monitored.
  • the hardware status and core program running status are monitored, and so on.
  • the details can be as follows:
  • the monitoring unit 401 is specifically configured to monitor the running of the database instance of each node in the logical unit, for example, as follows:
  • the monitoring unit 401 is specifically configured to monitor transactions performed by each node in the logical unit, for example, as follows:
  • the monitoring unit 401 specifically for hardware of each node in the logic unit
  • the status and core program running status are monitored, for example, as follows:
  • the obtaining unit 402 is configured to obtain log information of the multiple standby nodes when an abnormality is detected on the primary node.
  • the log information of the standby node may include a time point at which the standby node synchronizes data with the primary node.
  • the log information may further include other information, such as a name of the executed transaction, a type of the data, and And / or size, and so on.
  • the log information of the standby node can be expressed in various forms, for example, a binary log (binlog).
  • the selecting unit 403 is configured to select, from the plurality of standby nodes, the standby node that is closest to the current time at the time point as the target node.
  • the updating unit 404 is configured to update the primary node to the target node.
  • the update unit 404 can use the target node as a new primary node, thereby switching the traffic in the original primary node to the target node.
  • the primary node may be demoted to a standby node, so that the primary node does not exist in the logical unit, so that the logical unit will The status information indicating that the primary node does not exist is generated, and if the subsequent scheduler monitors the status information, the I/O interface of the standby log (binlog) of the standby node may be stopped, and the acquiring unit 402 is executed to perform the “acquiring the multiple”.
  • the operation of the log information of the standby node; that is, as shown in FIG. 4b, the data disaster recovery device may further include a processing unit 405, as follows:
  • the processing unit 405 is specifically configured to: when the monitoring unit 401 detects that an abnormality occurs in the primary node, demoting the primary node to a new standby node, and stopping the I/O interface of the standby log (binlog) of the standby node. And triggering the obtaining unit 402 to perform the operation of acquiring the log information of the multiple standby nodes respectively.
  • the data disaster recovery device may further include a repairing unit 406, as follows: When the monitoring unit 401 monitors that the primary node has an abnormality, the log information of the primary node is obtained, and the data in the primary node is repaired according to the log information of the primary node.
  • the log information of the primary node may include a time point at which the primary node updates data (that is, a time point at which the primary node updates its own data).
  • the log information may further include other information, such as the executed transaction.
  • the log information of the master node can be expressed in various forms. For example, it can be a binary log (binlog).
  • the repairing unit 406 may be further configured to determine whether data in the primary node has been synchronized to the standby node, and if yes, perform a process of acquiring log information of the primary node; if not, log corresponding to the data that has not been synchronized. The information is flashback processed.
  • the repairing unit 406 may be further configured to determine whether the log information corresponding to the data that has not been synchronized can be flashed back; if yes, perform the corresponding data of the pair that has not been synchronized.
  • the log information is subjected to the flashback processing; if not, the mirror data is pulled from the target node to reconstruct the data in the master node.
  • the data unit may be preset or may be established by the system according to a specific service request.
  • the data disaster recovery device may further include an establishing unit 407, as follows:
  • the establishing unit 407 may be configured to obtain a service request sent by the access gateway, select multiple nodes according to the service request, create a logical unit according to the selected node, and determine a primary/standby relationship of each node in the logical unit, so that the logical unit Includes the primary node and multiple standby nodes.
  • the data in the primary node is synchronized to each standby node, and the data can be backed up to other storage devices, such as a distributed file system, as shown in FIG. 4b.
  • the data disaster recovery device may further include a backup unit 408, as follows:
  • the backup unit 408 may be configured to: after the establishing unit 407 determines the active/standby relationship of each node in the logical unit, select a standby node from the multiple standby nodes as a cold standby node according to a preset policy, and use the cold standby node to Pipelined way to back up data to a distributed file system.
  • the data in the logical unit may be backed up in different places. For example, if the logical unit is located at the A location, the logical unit may be backed up at the B location, so that when A is An exception occurs in the logical unit of the ground. For example, when a disaster occurs in the A, the service of the logical unit of the A is switched to the backup logical unit of the B.
  • the data disaster recovery device may further include a remote disaster recovery unit. 409, as follows:
  • the remote disaster recovery unit 409 is configured to perform offsite backup on the logical unit to obtain an alternate logical unit, and switch the service of the logical unit to the standby logical unit when the logical unit fails.
  • the remote disaster tolerance unit 409 can also be configured to synchronize data of each node in the standby logical unit to a corresponding node in the logical unit; when determining that the delay between data of each node in the logical unit and the standby logical unit is less than When the preset value is set, the standby logical unit is set to read-only, and after determining that the delay is equal to 0, the service is switched back to the logical unit.
  • each of the foregoing units may be implemented as a separate entity, or may be implemented in any combination, as the same or several entities.
  • the data disaster recovery device may be implemented. Includes proxy servers, schedulers, and logical units, and more.
  • proxy servers, schedulers, and logical units and more.
  • the monitoring unit 401 in the data disaster recovery device of the embodiment can monitor each node in the logical unit.
  • the obtaining unit 402 obtains the log information of the multiple standby nodes.
  • the log information of the standby node includes a time point at which the standby node synchronizes data with the master node, and then the selection unit 403 selects the standby node whose time point is closest to the current time as the target node, and is updated by the update unit 404.
  • the node is updated to the target node, thereby implementing switching between the master and the backup; since the logical unit of the scheme may include multiple standby nodes, then when an abnormality occurs in the primary node, the standby node with the latest data is selected as the new one.
  • the master node can therefore ensure the consistency of data between the original master node and the new master node before and after the switch; and, since the logical unit has only one master node, there is no large-scale as faced in the prior art.
  • the primary key conflict problem can be used to distinguish different devices at the service layer, which is simpler to implement and can greatly improve the system availability.
  • the embodiment of the present application further provides a data disaster recovery system, which includes any data disaster recovery device provided by the embodiment of the present application.
  • a data disaster recovery system which includes any data disaster recovery device provided by the embodiment of the present application.
  • the following may be as follows:
  • the data disaster recovery device is configured to monitor each node in the logical unit.
  • the log information of the multiple standby nodes is obtained, and the log information of the standby node includes the standby node and the primary node.
  • the standby node closest to the current time at the time point is selected as the target node from the plurality of standby nodes, and the primary node is updated to the target node.
  • the data disaster tolerance device may be specifically used for running a database instance of each node in the logical unit, a transaction performed by each node in the logical unit, and/or a hardware state and a core program running state of each node in the logical unit. Monitor, and so on.
  • the data disaster recovery device may be multiple, and the specific quantity may be determined according to actual application requirements.
  • the data disaster tolerance system may also include other devices, for example, an access gateway, and the following:
  • the access gateway is configured to receive a connection establishment request sent by the user equipment, obtain load information of the multiple data disaster recovery devices according to the connection establishment request, and select a match according to the load information.
  • a data disaster recovery device establishing a connection relationship between the user equipment and the matched data disaster recovery device; and receiving a data processing request sent by the user equipment, and sending the data processing request to the corresponding data content based on the connection relationship In the disaster device.
  • the data disaster tolerance system may further include a user equipment, configured to send a connection establishment request to the access gateway, and send a data processing request.
  • the data disaster tolerance system can include any data disaster recovery device provided by the embodiment of the present application. Therefore, the beneficial effects that can be achieved by any data disaster recovery device provided by the embodiments of the present application can be implemented. The embodiment is not described here.
  • FIG. 5 shows a schematic structural diagram of a server involved in the embodiment of the present application, specifically:
  • the server may include one or more processing core processor 501, one or more computer readable storage media 502, a radio frequency (RF) circuit 503, a power supply 504, an input unit 505, and a display unit 506. And other components. It will be understood by those skilled in the art that the server structure illustrated in Figure 5 does not constitute a limitation to the server, and may include more or fewer components than those illustrated, or some components may be combined, or different component arrangements. among them:
  • Processor 501 is the control center of the server, connecting various portions of the server with various interfaces and lines, by running or executing software programs and/or modules stored in memory 502, and recalling data stored in memory 502, Execute the various functions of the server and process the data to monitor the server as a whole.
  • the processor 501 may include one or more processing cores.
  • the processor 501 may integrate an application processor and a modem processor, where the application processor mainly processes the processing system, the user interface, and the application. Etc., the modem processor primarily handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 501.
  • the memory 502 can be used to store software programs and modules, and the processor 501 executes various functional applications and data processing by running software programs and modules stored in the memory 502.
  • the memory 502 can mainly include a storage program area and a storage data area, wherein the storage program area can store the production system, an application required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area can be stored. Data created based on the use of the server, etc.
  • memory 502 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, memory 502 can also include a memory controller to provide processor 501 access to memory 502.
  • the RF circuit 503 can be used for receiving and transmitting signals during the transmission and reception of information. Specifically, after receiving the downlink information of the base station, it is processed by one or more processors 501; in addition, the data related to the uplink is transmitted to the base station.
  • the RF circuit 503 includes, but is not limited to, an antenna, at least one amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, and a Low Noise Amplifier (LNA). , duplexer, etc.
  • SIM Subscriber Identity Module
  • LNA Low Noise Amplifier
  • the RF circuit 503 can also communicate with the network and other devices via wireless communication.
  • the wireless communication may use any communication standard or protocol, including but not limited to Global System of Mobile communication (GSM), General Packet Radio Service (GPRS), and code division multiple access.
  • GSM Global System of Mobile communication
  • GPRS General Packet Radio Service
  • CDMA Code Division Multiple Access
  • WCDMA Wideband Code Division Multiple Access
  • LTE Long Term Evolution
  • SMS Short Messaging Service
  • the server also includes a power source 504 (such as a battery) for powering various components.
  • a power source 504 (such as a battery) for powering various components.
  • the power source 504 can be logically coupled to the processor 501 through a power management system to manage functions such as charging, discharging, and power management through the power management system.
  • Power supply 504 can also be packaged Includes one or more DC or AC power supplies, recharging systems, power failure detection circuits, power converters or inverters, power status indicators, and more.
  • the server can also include an input unit 505 that can be used to receive input numeric or character information, as well as to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function controls.
  • input unit 505 can include a touch-sensitive surface as well as other input devices. Touch-sensitive surfaces, also known as touch screens or trackpads, collect touch motions on or near the user (such as a user using a finger, stylus, etc., on any touch-sensitive surface or touch The action near the sensitive surface), and the corresponding connection device is driven according to a preset program.
  • the touch-sensitive surface may include two parts of a touch detection device and a touch controller.
  • the touch detection device detects the touch orientation of the user, and detects a signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, and converts the touch information into contact coordinates,
  • the processor 501 is sent to the processor 501 and can receive commands from the processor 501 and execute them.
  • touch-sensitive surfaces can be implemented in a variety of types including resistive, capacitive, infrared, and surface acoustic waves.
  • the input unit 505 can also include other input devices. Specifically, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as a volume control button, a switch button, etc.), a trackball, a mouse, a lever, and the like.
  • the server can also include a display unit 506 that can be used to display information entered by the user or information provided to the user as well as various graphical user interfaces of the server, which can be composed of graphics, text, icons, video, and It is composed of any combination.
  • the display unit 506 can include a display panel.
  • the display panel can be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.
  • the touch-sensitive surface may cover the display panel, and when the touch-sensitive surface detects a touch operation on or near the touch surface, it is transmitted to the processor 501 to confirm The type of touch event is determined, and the processor 501 then provides a corresponding visual output on the display panel depending on the type of touch event.
  • the touch-sensitive surface and display panel are implemented as two separate components to implement input and input functions, in some embodiments, the touch-sensitive surface can be integrated with the display panel to implement input and output functions.
  • the server may also include a camera, a Bluetooth module, etc., and will not be described herein.
  • the processor 501 in the server loads the executable file corresponding to the process of one or more application programs into the memory 502 according to the following instructions, and is executed by the processor 501 to be stored in the memory.
  • the application in 502 thus implementing various functions, as follows:
  • Each node in the logical unit is monitored.
  • the log information of the multiple standby nodes is obtained, and the log information of the standby node includes data synchronization between the standby node and the primary node.
  • an alternate node closest to the current time at the time point is selected from the plurality of standby nodes as a target node, and the primary node is updated to the target node.
  • the logic unit may be preset or may be established by the system according to a specific service request, that is, the processor 501 may also implement the following functions:
  • the data in the primary node when the data in the primary node is synchronized to each standby node, the data may be backed up to other storage devices, such as a distributed file system, so that when the logical unit When a node in the middle of a failure occurs, the data in the logical unit node can be quickly restored to the specified time point based on the image file of the day and the log information at the specified time point.
  • the processor 501 can also implement the following functions:
  • the data in the logical unit may be backed up in an off-site manner, so that when the source logical unit is abnormal, the service of the source logical unit may be switched to the standby logical unit; that is, the processor 501 It can also achieve the following functions:
  • the logical unit is backed up off-site to obtain an alternate logical unit, and when the logical unit fails, the service of the logical unit is switched to the standby logical unit.
  • the logical unit of the A area can be reconstructed by using the standby logic unit, that is, the processor 501 can also implement the following functions:
  • the preset value can be set according to the requirements of the actual application, and details are not described herein again. It can be seen that the server in this embodiment can monitor each node in the logical unit. When an abnormality is detected on the primary node, the log information of multiple standby nodes is obtained respectively.
  • the log information of the standby node includes the standby node and the primary node.
  • the point in time at which the node synchronizes data and then selects the standby node whose time point is closest to the current time as the target node, and updates the master node to the target node, thereby implementing switching between the master and the backup; since the logical unit of the scheme can Including multiple standby nodes, and then, when an abnormality occurs in the primary node, the standby node with the latest data is selected as the new primary node, so that the data consistency between the original primary node and the new primary node can be ensured before and after the handover. Moreover, since the logical unit has only one primary node, the large-scale primary key conflict problem as faced in the prior art does not occur, and the individual devices can be distinguished from each other at the service layer, which is simpler to implement and can greatly improve the system. Availability.
  • the sub-steps can be completed by a program to instruct related hardware, and the program can be stored in a computer readable storage medium, and the storage medium can include: a read only memory (ROM), a random memory and a memory. RAM (RAM, Random Access Memory), disk or CD.
  • ROM read only memory
  • RAM Random Access Memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Environmental & Geological Engineering (AREA)
  • Hardware Redundancy (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例公开了一种数据容灾方法、装置和系统;本申请实施例采用对逻辑单元内各个节点进行监控,当监控到主节点发生异常时,分别获取多个备用节点的日志信息,其中,备用节点的日志信息包括备用节点与主节点进行数据同步的时间点,然后,从多个备用节点中选择时间点最接近当前时间的一个备用节点作为目标节点,并将主节点更新为该目标节点,从而实现主备之间的切换。

Description

一种数据容灾方法、 装置和系统
本申请要求于 2016 年 7 月 27 日提交中国专利局、 申请号为 201610603383.0, 申请名称为"一种数据容灾方法、 装置和系统"的中国 专利申请的优先权, 其全部内容通过引用结合在本申请中。 技术领域
本申请涉及通信技术领域, 具体涉及一种数据容灾方法、 装置和系 统。 发明背景
数据容灾, 指的是建立一个数据系统, 使得系统发生故障时, 能够 保证用户数据的安全性, 甚至, 可以提供不间断的应用服务。
在传统的数据容灾方案中, 系统至少存在两个设备, 一个作为主用 设备 (Master) , 另一个作为备用设备 (Slave) , 其中, 主用设备对外提 供服务, 而备用设备则作为主用设备的备份, 当主用设备发生故障时, 替代主用设备。 这种主备复制容灾方案虽然可以在一定程度上达到容灾 的目的, 但是, 由于该方案大部分搡作只支持手动切换, 因此, 当故障 发生时, 并无法及时切换至备用设备中。 为此, 现有技术又提出了主主 复制的容灾方案, 即两台设备之间互为主备, 任何一个设备中数据发送 变化, 均为同步到另一设备中, 这样, 两台设备可以同时对外提供服务, 并互为镜像, 当其中一台设备发生故障时, 业务便可以直接切换到另一 台设备中, 而无需运維人员干预。 发明内容
本申请实施例提供一种数据容灾方法、 装置和系统, 不仅可以实现 主备切换前后数据的一致性, 而且, 无需在业务层上对设备进行严格区 分, 实现较为简单, 系统可用性高。
本申请实施例提供一种数据容灾方法, 包括:
对逻辑单元内各个节点进行监控, 所述各个节点包括主节点和多个 备用节点;
当监控到所述主节点发生异常时, 分别获取所述多个备用节点的日 志信息, 所述备用节点的日志信息包括该备用节点与所述主节点进行数 据同步的时间点;
从所述多个备用节点中选择所述时间点最接近当前时间的一个备用 节点作为目标节点;
将所述主节点更新为所述目标节点。
相应的, 本申请实施例还提供一种数据容灾装置, 包括处理器和存 储器, 所述存储器中存储可被所述处理器执行的指令, 当执行所述指令 时, 所述处理器用于:
对逻辑单元内各个节点进行监控, 所述各个节点包括主节点和多个 备用节点;
在监控到所述主节点发生异常时, 分别获取所述多个备用节点的日 志信息, 所述备用节点的日志信息包括该备用节点与所述主节点进行数 据同步的时间点;
从所述多个备用节点中选择所述时间点最接近当前时间的一个备用 节点作为目标节点;
将所述主节点更新为所述目标节点。
此外, 本申请实施例还提供一种数据容灾系统, 可以包括本申请实 施例所提供的任一种数据容灾装置。 附图简要说明
为了更清楚地说明本申请实施例中的技术方案, 下面将对实施例描 述中所需要使用的附图作简单地介绍, 显而易见地, 下面描述中的附图 仅仅是本申请的一些实施例, 对于本领域技术人员来讲, 在不付出创造 性劳动的前提下, 还可以根据这些附图获得其他的附图。
图 1 a是本申请实施例提供的数据容灾系统的场景示意图; 图 1 b是本申请实施例提供的数据容灾方法的场景示意图; 图 1 c是本申请实施例提供的数据容灾方法的流程图;
图 2a是本申请实施例提供的数据容灾装置的架构示例图; 图 2b 是本申请实施例提供的数据容灾方法中临时节点注册时的场 景示意图;
图 2c是本申请实施例提供的数据容灾方法的另一流程图; 图 3 a是本申请实施例提供的逻辑单元备份的场景示意图; 图 3 b是本申请实施例提供的跨城容灾流程的场景示意图; 图 4a是本申请实施例提供的数据容灾装置的结构示意图; 图 4b是本申请实施例提供的数据容灾装置的另一结构示意图; 图 5是本申请实施例提供的服务器的结构示意图。 实施方式
下面将结合本申请实施例中的附图, 对本申请实施例中的技术方案 进行清楚、 完整地描述, 显然, 所描述的实施例仅仅是本申请一部分实 施例, 而不是全部的实施例。 基于本申请中的实施例, 本领域技术人员 在没有作出创造性劳动前提下所获得的所有其他实施例, 都属于本申请 保护的范围。
在对现有技术的研究和实践过程中发现, 现有的主主复制的容灾方 案虽然在一定程度上可以保证切换前后数据的一致性, 但是, 由于存在 两个主用设备, 因此, 需要在业务层上对这两个设备进行严格区分, 否 则业务随意选择主用设备会造成大规模主键冲突, 所以, 该方案实现较 为复杂, 系统可用性不高。
本申请实施例提供一种数据容灾方法、 装置和系统。
其中, 该数据容灾系统可以包括本申请实施例所提供的任一种数据 容灾装置,该数据容灾装置的数量可以根据实际应用的需求而定, 此外, 如图 la 所示, 该数据容灾系统还可以包括其他的设备, 比如接入网关 (GW, Gate Way) , 以及一个或多个用户设备等; 其中, 在用户新建连 接时, 接入网关可以接收用户设备发送的连接建立请求, 根据该连接建 立请求获取多个数据容灾装置的负载信息, 然后, 根据该负载信息选择 匹配的数据容灾装置, 并建立该用户设备与匹配到的数据容灾装置之间 的连接关系, 这样, 用户设备后续向接入网关发送数据处理请求, 接入 网关便可以基于该连接关系将数据处理请求转发至相应的数据容灾装 置中, 由该数据容灾装置根据该数据处理请求进行数据处理, 比如进行 数据的读取、 写入、 删除或更改, 等等。
其中, 具体实现时, 该数据容灾装置可以作为一个实体来实现, 也 可以由多个实体共同来实现, 即该数据容灾装置可以包括多个网络设 备, 比如, 如图 lb 所示, 该数据容灾装置可以包括调度器和逻辑单元 (set) , 其中, 该逻辑单元可以包括主节点和多个备用节点, 这些节点 可以部署在同一机房中,也可以部署在不同机房中, 可以位于同一区域, 也可以是位于不同区域。
其中, 主节点主要用于根据数据处理请求进行数据处理, 比如, 进 行数据的读或写等搡作; 而备用节点则是可以通过数据同步的方式来对 主节点中的数据进行备份, 备用节点可以提供 "读" 的功能。 调度器主 要用于对逻辑单元中的节点的状态进行监控、 以及对主备节点的切换进 行控制, 比如, 具体可以对逻辑单元内各个节点进行监控, 若监控到主 节点发生异常, 则分别获取该多个备用节点的日志信息, 其中, 该备用 节点的日志信息包括备用节点与主节点进行数据同步的时间点, 然后, 选择该时间点最接近当前时间的备用节点作为目标节点, 并将主节点更 新为该目标节点, 即将该目标节点作为新的主节点, 而原主节点则可以 降为备用节点, 等等。
此外, 该数据容灾装置还可以包括代理服务器, 用于接收接入网关 发送的数据处理请求, 并根据预设路由信息将该数据处理请求发送至相 应的逻辑单元的节点中进行处理, 以及, 与调度器进行交互, 以获取相 应的路由信息, 等等。
以下将分别进行详细说明。
本实施例将从数据容灾装置的角度进行描述, 该数据容灾装置可以 作为一个实体来实现, 也可以由多个实体共同来实现, 即该数据容灾装 置具体可以集成在服务器等设备中, 或者, 也可以包括多个网络设备, 比如可以包括调度器和逻辑单元 (set) 等, 此外, 该数据容灾装置还可 以包括其他设备, 比如代理服务器等。
一种数据容灾方法, 包括: 对逻辑单元内各个节点进行监控, 当监 控到该主节点发生异常时, 分别获取该多个备用节点的日志信息, 该备 用节点的日志信息包括备用节点与主节点进行数据同步的时间点, 选择 该时间点最接近当前时间的备用节点作为目标节点, 将主节点更新为该 目标节点。
如图 lc所示, 该数据容灾方法的具体流程可以如下:
101、 对逻辑单元内各个节点进行监控。
其中, 该逻辑单元内的各个节点可以包括主节点和多个备用节点, 主节点主要用于根据数据处理请求进行数据处理, 比如, 进行数据的读 或写等搡作; 而备用节点则是可以通过数据同步的方式来对主节点中的 数据进行备份, 备用节点可以提供 "读" 的功能。 这些节点可以部署在 同一机房中, 也可以部署在不同机房中, 可以位于同一区域, 也可以是 位于不同区域; 可选的, 为了提高传输效率, 各个节点之间还可以通过 专线进行传输。
其中, 对逻辑单元内各个节点 (包括主节点和备用节点) 进行监控 的方式可以有多种, 比如, 可以对数据库实例, 如 mysql (—种关系型 数据库) 实例进行监控、 对各个节点所执行的事务进行监控、 和 /或对逻 辑单元内各个节点的硬件状态和核心程序运行状态进行监控, 等等, 例 口, 具体可以 下:
( 1 ) 对逻辑单元内各个节点的数据库实例的运行进行监控, 如下: 分别对逻辑单元内各个节点中的数据库实例进行周期性访问, 若该 数据库实例不可读和 /或不可写, 则确定相应的节点发生异常。
(2) 对逻辑单元内各个节点所执行的事务进行监控, 如下: 对主节点执行事务时的工作状态进行监控, 若该工作状态指示当前 处于非工作状态, 则确定主节点发生异常。
此外, 分别对逻辑单元内各个备用节点从主节点上拉取事务线程的 工作状态进行监控, 若该工作状态指示当前处于非工作状态, 比如, 该 工作状态为 "ΝΟ", 则确定相应的备用节点发生异常。
(3)对逻辑单元内各个节点的硬件状态和核心程序运行状态进行监 控, 如下:
定时获取逻辑单元内各个节点的硬件状态信息、 以及核心程序运行 状态信息, 当根据该硬件状态信息确定节点发生硬件故障时, 和 /或, 根 据该核心程序运行状态信息确定节点发生软件故障时, 确定相应的节点 发生异常。
其中, 该逻辑单元可以是预设的, 也可以由系统根据具体的服务请 求进行建立, 即在步骤 "对逻辑单元内的节点进行监控" 之前, 该数据 容灾方法还可以包括:
获取接入网关发送的服务请求, 根据该服务请求选择多个节点, 根 据选择的节点创建逻辑单元, 并确定逻辑单元中各个节点的主备关系, 使得该逻辑单元包括主节点和多个备用节点。
可选的, 为了提高数据的安全性, 在将主节点中的数据同步到各个 备用节点的同时, 还可以将数据备份到其他的存储设备, 比如分布式文 件系统 (HDFS, Hadoop Distribute File System) 中, 即在分布式文件系 统中保存该逻辑单元相应的镜像文件, 这样, 当该逻辑单元中的节点发 生故障时, 便可以基于当天的镜像文件以及指定时间点的日志信息, 将 该逻辑单元节点中的数据, 以极快的速度 (比如秒级) 恢复到指定时间 点。
可选的, 为了提高数据备份的效率, 以及减少数据备份时对各个节 点运行的影响, 还可以在备用节点中选择相应的节点, 作为冷备节点, 来执行数据备份的搡作, 即, 在步骤 "确定逻辑单元中各个节点的主备 关系" 之后, 该数据容灾方法还可以包括:
按照预设策略从多个备用节点中选择一个备用节点作为冷备节点, 通过该冷备节点以管道流的方式将数据备份至分布式文件系统。 所谓冷 备节点, 是指基于冷备机制将数据备份至其他位置的节点。
可选的, 为了进一步提高备份数据的效率, 以及减少对各个节点运 行的影响, 冷备节点在将数据备份至分布式文件系统时, 可以避开逻辑 单元运作的高峰期, 比如, 可以一天备份一次, 并且, 可以选择一天中 的一个随机事件来进行备份, 等等。 102、 当监控到该主节点发生异常时, 分别获取该多个备用节点的日 志信息。
其中, 该备用节点的日志信息可以包括备用节点与主节点进行数据 同步的时间点, 此外, 可选的, 该日志信息还可以包括其他的信息, 比 如所执行的事务的名称、 数据的类型、 和 /或大小, 等等。 该备用节点的 日志信息的表现形式可以有多种,比如,具体可以为二进制日志(binlog) 等。
可选的, 在监控到该主节点发生异常时, 除了直接触发 "获取该多 个备用节点的日志信息"搡作之外, 还可以通过其他的方式来触发执行 该 "获取该多个备用节点的日志信息 " 的搡作, 比如, 可以将该主节 点降级为一个备用节点, 从而使得该逻辑单元中不存在主节点, 这样, 该逻辑单元便会生成指示不存在主节点的状态信息, 而后续调度器若监 控到该指示不存在主节点的状态信息, 便可以停止该备用节点的同步日 志的输入输出 (I/O, Input/Output) 接口, 并触发执行该 "获取该多个 备用节点的日志信息 " 的搡作; 也就是说, 在步骤 "分别获取该多个备 用节点的日志信息" 之前, 该数据容灾方法还可以包括:
将该主节点降级为一个备用节点,并停止该备用节点的同步日志(比 如 binlog) 的 I/O接口。
在监控到该主节点发生异常时,还可以对主节点中的数据进行修复, 比^, 具体可以 口下:
获取该主节点的日志信息, 根据该主节点的日志信息对该主节点中 的数据进行修复, 等等。
其中, 该主节点的日志信息可以包括主节点更新数据的时间点 (即 主节点更新自身数据的时间点), 此外, 可选的, 该日志信息还可以包 括其他的信息, 比如所执行的事务的名称、 数据的类型、 和 /或大小, 等 等。 该主节点的日志信息的表现形式可以有多种, 比如, 具体可以为二 进制日志 (binlog) 等。
需说明的是, 由于在本申请实施例中, 用户设备在发出一条数据处 理请求, 比如结构化查询语言 (sql, Structured Query Language) 后, 需 要等待主节点将该数据处理请求对应的日志信息 (比如 binlog) 同步到 备用节点上时, 才能正常返回, 否则用户设备会返回超时的错误。 因此, 当主节点突然发生异常, 日志信息可能还没来得及发送到其他备用节点 时, 如果此时主节点依照自身的日志信息进行数据恢复, 则结果会多出 这笔数据 (即此时主节点比备用节点多出这一笔数据), 因而, 为了避 免出现这种情况, 在对主节点进行数据恢复之前, 可以对这笔数据所对 应的日志信息做闪回处理, 即, 在步骤 "获取该主节点的日志信息 " 之 前, 该数据容灾方法还可以包括:
确定主节点中的数据是否已同步至备用节点, 若是, 则执行获取该 主节点的日志信息的步骤; 若否, 则对尚未同步的数据所对应的日志信 息进行闪回处理。
可选的,在对尚未同步的数据所对应的日志信息进行闪回处理之前, 还可以确定该尚未同步的数据所对应的日志信息是否能够被闪回, 若 是, 才执行该对尚未同步的数据所对应的日志信息进行闪回处理的步 骤; 否则, 若不能被闪回, 则从该目标节点中全量拉取镜像数据, 以重 构该主节点中的数据。
比如, 对于删除表 (drop table) 之类的, 就无法被闪回, 因此, 对 于此类数据, 可以从目标节点中全量拉取镜像数据, 来进行数据的重构, 。
需说明的是, 除了可以由该数据容灾装置自动从目标节点中全量拉 取镜像数据之外, 也可以由运維人员进行干预, 即由人工从目标节点中 全量拉取镜像数据, 以重构该主节点中的数据。
103、 根据该日志信息, 从多个备用节点中选择该时间点 (即备用节 点与主节点进行数据同步的时间点) 最接近当前时间的备用节点作为目 标节点。
比如, 当前时间点为 7月 1 日的 12:10:01, 而备用节点 A与主节点 进行数据同步的时间点为 7月 1 日的 12:10:00分,备用节点 B与主节点 进行数据同步的时间点为 7月 1 日的 12:09:59分,备用节点 C与主节点 进行数据同步的时间点为 7月 1 日的 12:10:01分, 则此时, 可以选择备 用节点 C作为目标节点, 从而保证该目标节点中的数据与主节点的数据 的一致性。
104、 将主节点更新为该目标节点, 即, 将该目标节点作为新的主节 点, 从而将原主节点中的业务切换至该目标节点上。
可选的, 为了提高数据的安全性, 还可以对该逻辑单元中的数据进 行异地备份, 比如, 若该逻辑单元位于 A地, 则可以在 B地对该逻辑单 元进行备份, 从而使得当 A地的逻辑单元发生异常, 比如 A地发生灾难 时, 可以将 A地逻辑单元的业务切换至 B地的备份逻辑单元; 即该数据 容灾方法还可以包括:
对该逻辑单元进行异地备份, 得到备用逻辑单元, 在该逻辑单元发 生故障时, 将该逻辑单元的业务切换至该备用逻辑单元中。 其中, 该备 用逻辑单元具有和逻辑单元——对应的多个节点。
此外,还可以利用 B地备份的数据对 A地的逻辑单元进行数据重建, 即在步骤 "将该逻辑单元的业务切换至该备用逻辑单元中" 之后, 该数 据容灾方法还可以包括:
将该备用逻辑单元中各个节点的数据同步至该逻辑单元内相应节点 中; 当确定该逻辑单元和备用逻辑单元中各个节点的数据之间的延迟小 于预设值时, 将该备用逻辑单元设置为只读, 并在该延迟 (即该逻辑单 元和备用逻辑单元中各个节点的数据之间的延迟) 为 0时, 将该业务切 换回该逻辑单元中。
其中, 该预设值可以根据实际应用的需求进行设置, 在此不再赘述。 由上可知, 本实施例采用对逻辑单元内各个节点进行监控, 当监控 到主节点发生异常时, 分别获取多个备用节点的日志信息, 其中, 备用 节点的日志信息包括备用节点与主节点进行数据同步的时间点, 然后, 选择时间点最接近当前时间的备用节点作为目标节点, 并将主节点更新 为该目标节点, 从而实现主备之间的切换; 由于该方案的逻辑单元可以 包括多个备用节点, 然后, 在主节点发生异常时, 从中选择出具有最新 数据的备用节点作为新的主节点, 因此, 可以保证切换前后, 原主节点 与新的主节点之间数据的一致性; 而且, 由于该逻辑单元只有一个主节 点, 因此, 不会出现如现有技术中所面临的大规模主键冲突问题, 可以 不在业务层对各个设备进行区分, 实现更为简单, 可以大大提高系统的 可用性。
根据上述实施例所描述的方法, 以下将举例作进一步详细说明。 在本实施例中, 将以数据容灾系统包括用户设备、 接入网关、 以及 多个数据容灾装置, 而该数据容灾装置包括代理服务器、 调度器和逻辑 单元 (set) 为例进行详细说明。
其中, 参见图 la、 图 lb和图 2a, 该数据容灾系统中各个设备的具 体功能可以如下:
( 1 ) 用户设备;
用户设备, 用于通过接入网关与数据容灾装置建立连接关系, 并基 于该连接关系通过接入网关向相应的数据容灾装置发送数据处理请求, 以获取相应的服务。 (2) 接入网关;
在用户新建连接时, 接入网关可以获取多个数据容灾装置的负载信 息, 根据这些负载信息选择匹配的数据容灾装置, 并建立用户设备与匹 配到的数据容灾装置之间的连接关系; 此外, 接入网关中还可以保存该 连接关系的相关信息, 比如保存该连接的会话信息等。 而对于已经建立 连接的用户请求, 比如用户设备发送的数据处理请求, 则可以根据该连 接的会话信息, 将该数据处理请求转发到对应的数据容灾装置上, 从而 达到负载均衡的目的。
其中, 为了可以及时且准确地获取各个数据容灾装置的负载信息, 还可以设置一服务列表, 用于保存有多个可以提供服务的数据容灾装置 的信息, 这样, 在需要选择数据容灾装置时, 便可以根据该服务列表中 的信息, 来选择所需的数据容灾装置; 与此同时, 为了保持该服务列表 中信息的准确性和实时性, 接入网关还可以周期性地探测各个数据容灾 装置的状态, 并基于这些状态对该服务列表中的信息进行更新, 比如, 可以将故障的数据容灾装置及时从服务列表中删除, 而当检测到该故障 的数据容灾装置恢复正常, 则可以添加回服务列表中, 等等。
需说明的是, 由于该接入网关保存了连接的会话信息, 因此, 即便 对数据容灾装置进行扩容, 比如增加多几个数据容灾装置, 也不会对已 经存在的用户连接造成影响, 所以, 该方案可以很方便地进行扩容。
(3) 数据容灾装置;
参见图 lb, 该数据容灾装置可以包括代理服务器 (proxy)、 调度器 和逻辑单元 (set) 等设备, 具体可以如下:
A、 代理服务器;
代理服务器, 用于从调度器中获取和保存路由信息, 并根据该路由 信息将用户设备发送的数据处理请求 (比如 sql语句) 转发至逻辑单元。 此外, 该代理服务器, 还可以对接收到的数据处理请求进行鉴权, 比如,根据该数据处理请求中携带的用户设备的标识,比如网际协议(IP, Internet Protocol) 地址、 用户名和 /或用户密码等信息进行鉴权, 等等。
B、 调度器;
该调度器作为该数据容灾装置的调度管理中心, 可以进行逻辑单元
(set) 的创建、 删除、 以及逻辑单元中节点的选择、 以及替换等操作。 此外, 该调度器还可以对逻辑单元中的各个节点进行监控, 当监控到该 主节点发生异常时, 则发起主备切换流程, 比如, 可以分别获取该多个 备用节点的日志信息, 并根据该日志信息选择新的主节点, 等等。
其中, 该调度器可以包括多个调度程序 (Scheduler) , 然后由这多个
Scheduler来协作完成该调度器的主要搡作,比如,创建和删除逻辑单元, 以及对逻辑单元中的节点进行选择和替换, 以及对各个节点的状态进行 监控, 并在确定主节点发生异常时, 发起主备切换流程, 等等。 而该 Scheduler与其他部分 (比如代理服务器和逻辑单元) 之间的交互则可以 通过开源的分布式应用程序协调服务, 比如 Zookeeper来完成; 比如, 该 Zookeeper可以接收各个节点上报的状态信息, 并将该状态信息提供 给 Scheduler, 或者, Scheduler 也可以对 Zookeeper 进行监控, 并从 Zookeeper中获取到相应的节点的状态信息, 等等。
需说明的是, 由于 Zookeeper可能会面对多个逻辑单元, 因此, 为 了方便对各个逻辑单元进行管理, 如图 2b 所示, 在建立逻辑单元时, 可以在 Zookeeper中注册一个与该逻辑单元对应的临时节点, 并由该临 时节点来处理该逻辑单元的事务, 比如接收该逻辑单元中各个节点的状 态信息, 等等。
需说明的是, Scheduler的容灾可以通过 Zookeeper的选举机制完成, 在此不作细述。 C、 逻辑单元 (set) ;
逻辑单元, 可以包括主节点和多个备用节点, 主节点主要用于根据 数据处理请求进行数据处理, 比如, 进行数据的读或写等搡作; 而备用 节点则是可以通过数据同步的方式来对主节点中的数据进行备份, 备用 节点可以提供 "读" 的功能。 这些节点可以部署在同一机房中, 也可以 部署在不同机房中, 可以位于同一区域, 也可以是位于不同区域。
为了便于对这些节点进行管理和数据传输, 还可以为每一个节点分 别设置一个代理模块 (Agent) , 该代理模块可以独立于各个节点, 也可 以集成在各个节点中, 例如, 参见图 2a。
该代理模块, 可以通过短连接的方式周期性访问所属节点的数据库 实例 (DB), 以检测是否可读和可写, 若可读且可写, 则表示该节点正 常, 若不可读和 /或不可写, 则表示该节点发生异常, 此时, 代理模块便 可以生成相应的表示节点异常的状态信息, 并将该状态信息上报给调度 器, 比如, 具体可以上报给 Zookeeper中相应的临时节点, 从而使得调 度器可以检测到该节点发生异常, 进而进行主备切换。
其中, 该代理模块对所属节点的数据库实例的访问周期可以根据实 际应用的需求而定, 比如, 可以设置为每 3秒钟监测一次, 等等。
该代理模块, 还可以对所属节点在执行事务时的延时时间、 以及延 迟的事务数量进行统计, 并定期将携带统计结果的状态信息上报给调度 器, 比如, 具体可以上报给 Zookeeper中相应的临时节点, 以便调度器 可以据此确定是否发起主备切换流程, 比如, 如果统计结果超过预设阈 值, 则进行主备切换, 等等。
此外, 如发生了主备切换, 代理模块还可以根据新的主节点重建主 备关系, 而对于新增的数据库实例也可以采用如 xtrabackup (—种数据 备份工具) 等方式, 通过主节点重建数据, 因此, 该方案的数据重建均 可自动进行, 而无需数据库管理员 (DBA, Database Administrator) 干 预。
需说明的是, Zookeeper中注册了与逻辑单元对应的临时节点, 如果 该逻辑单元发生硬件故障, 和 /或, 核心程序发生异常, 比如该代理模块 发生核心 (core) 故障, 则 Zookeeper 中相应的临时节点也会相应被删 除, 所以, 如果 Scheduler检测到该临时节点消失了, 则可以确定该逻 辑单元中的主节点发生异常, 此时, 可以进行主备切换。
基于上述数据容灾系统的架构, 以下将举例对其执行流程进行详细 说明。
如图 2c所示, 一种数据容灾方法, 具体流程可以如下:
201、逻辑单元中的各个代理模块对各自所属的节点进行监控, 得到 状态信息, 并将状态信息发送给调度器。
例如, 以主节点 A为例, 则主节点 A的代理模块 A对主节点 A的 监控方式可以如下:
代理模块 A周期性访问主节点 A的数据库实例, 以检测该数据库实 例是否可读且可写, 若可读且可写, 则确定主节点 A正常, 若不可读和 /或不可写, 则确定主节点 A 发生异常, 生成相应的表示节点异常的状 态信息, 并将该表示节点异常的状态信息发送给调度器, 比如, 具体可 以发送给 Zookeeper中相应的临时节点。
和 /或, 代理模块 A 定时获取自身 (即该代理模块 A) 和主节点 A 的硬件状态信息、 以及核心程序运行状态信息, 并将该硬件状态信息、 以及核心程序运行状态信息发送给调度器, 比如, 具体可以发送给 Zookeeper中相应的临时节点, 等等。
需说明的是,上述代理模块 A对主节点 A的监控方式同样适用于其 他节点, 在此不再赘述。 202、调度器接收到该状态信息后, 根据该状态信息确定相应的主节 点是否发生异常, 若发生异常, 则执行步骤 203, 而若没有发生异常, 则可以返回执行步骤 201的步骤。
例如, 具体可以由 Zookeeper中相应的临时节点来接收该状态信息, 并将该状态信息提供该 Scheduler, 当 Scheduler根据该状态信息确定主 节点发生异常时, 则执行步骤 203。
需说明的是, 由于当逻辑单元发生硬件故障, 和 /或, 该逻辑单元中 的核心程序发生异常时, 该逻辑单元相应的临时节点会消失, 因此, 若 Scheduler检测到该临时节点消失了, 则也可以确定该逻辑单元中的主节 点发生了异常, 因此, 此时也可以执行步骤 203。
203、调度器将该主节点降级为一个新的备用节点, 并停止该备用节 点的同步日志 (binlog) 的 I/O接口, 然后执行步骤 204。
例如, 具体可以由 Scheduler 来将该主节点降级为一个新的备用节 点, 并通过 Zookeeper向该备用节点发送线程停止指令, 以便该备用节 点可以根据该线程停止指令停止同步日志 (binlog) 的 I/O接口, 等等。
204、 调度器分别获取该逻辑单元中原有的多个备用节点的日志信 息。
例如,具体可以由 Scheduler来获取该逻辑单元中原有的多个备用节 点的日志信息, 比如 relaylog (中继日志) 等。
其中, 该备用节点的日志信息可以包括备用节点与主节点进行数据 同步的时间点, 此外, 可选的, 该日志信息还可以包括其他的信息, 比 如所执行的事务的名称、 数据的类型、 和 /或大小, 等等。
205、调度器根据该日志信息, 从原有的多个备用节点中选择时间点 (即备用节点与主节点进行数据同步的时间点) 最接近当前时间的备用 节点作为目标节点。 例如, 具体可以由 Scheduler来选择该目标节点; 比如, 当前时间点 为 7月 1 日的 12: 10:01, 而备用节点 A与主节点进行数据同步的时间点 为 7月 1 日的 12: 10:00分,备用节点 B与主节点进行数据同步的时间点 为 7月 1 日的 12:09:59分,备用节点 C与主节点进行数据同步的时间点 为 7月 1 日的 12: 10:01分, 则此时, Scheduler可以选择备用节点 C作 为目标节点。
206、调度器将主节点更新为该目标节点, 即将目标节点作为新的主 节点, 以便将原主节点中的业务切换至该目标节点上。
例如, 具体可以由 Scheduler将该主节点更新为该目标节点, 比如, 可以生成表示主节点更新为该目标节点的指示信息, 然后, 通过 Zookeeper 向该目标节点的代理模块发送该指示信息, 这样, 该代理模 块便可以基于该指示信息重建主备关系, 比如, 确定自身 (即该目标节 点) 为新的主节点, 而其他节点为备用节点, 并生成相应的路由信息, 然后, 将该路由信息通过 Zookeeper, 提供给 Scheduler, 等等。
可选的, 如图 2a所示, 为了提高数据的安全性, 在将主节点中的数 据同步到各个备用节点的同时, 还可以将数据备份到其他的存储设备, 比如分布式文件系统中。
需说明的是, 为了提高数据备份的效率, 以及减少数据备份时对各 个节点运行的影响, 可以在多个备用节点中选择一个备用节点作为冷备 节点, 来执行将数据备份到分布式文件系统的搡作; 而且, 冷备节点在 进行备份时, 可以采用流管道的方式向分布式文件系统进行传输, 以提 高传输效率; 进一步, 还可以选择一天中的一个随机事件来进行备份, 从而避开逻辑单元运作的高峰期。
可选的, 在确定主节点发生异常 (即步骤 202) 时, 除了可以进行 主备切换之外, 还可以对主节点中的数据进行恢复, 即该数据容灾方法 还可以执行步骤 207, 如下:
207、 调度器获取该主节点的日志信息, 然后执行步骤 209。
例如, 具体可以由 Scheduler 来获取该主节点的日志信息, 比如 relaylog等。
其中, 该主节点的日志信息可以包括主节点更新数据的时间点 (即 主节点更新自身数据的时间点), 此外, 可选的, 该日志信息还可以包 括其他的信息, 比如所执行的事务的名称、 数据的类型、 和 /或大小, 等 等。
需说明的是, 为了避免主节点的日志信息中出现多余的数据 (具体 可参见图 lc所示实施例中的说明), 可以先确定该主节点中的数据是否 已同步至备用节点, 若是, 才执行获取该主节点的日志信息的步骤 (即 步骤 207), 否则, 此时可以对尚未同步的数据所对应的日志信息进行闪 回处理。
可选的,在对尚未同步的数据所对应的日志信息进行闪回处理之前, 还可以对该尚未同步的数据所对应的日志信息是否能够被闪回进行判 断, 若可以被闪回, 才执行该对尚未同步的数据所对应的日志信息进行 闪回处理的步骤; 否则, 若不能被闪回, 则此时可以从该目标节点中全 量拉取镜像数据, 以重构该主节点中的数据, 具体可参见上述实施例, 在此不再赘述。
还需说明的是,步骤 207与步骤 203~206的执行顺序可以不分先后。
208、 调度器根据该主节点的日志信息对该主节点中的数据进行修 复。
由上可知, 本实施例采用对逻辑单元内各个节点进行监控, 当监控 到主节点发生异常时, 分别获取多个备用节点的日志信息, 其中, 该备 用节点的日志信息包括备用节点与主节点进行数据同步的时间点, 然 后, 选择时间点最接近当前时间的备用节点作为目标节点, 并将主节点 更新为该目标节点, 从而实现主备之间的切换; 由于该方案的逻辑单元 可以包括多个备用节点, 然后, 在主节点发生异常时, 从中选择出具有 最新数据的备用节点作为新的主节点, 因此, 可以保证切换前后, 原主 节点与新的主节点之间数据的一致性; 而且, 由于该逻辑单元只有一个 主节点, 因此, 不会出现如现有技术中所面临的大规模主键冲突问题, 可以不在业务层对各个设备进行区分, 实现更为简单, 可以大大提高系 统的可用性。
可选的, 由于该方案在对原主节点中的数据进行恢复时, 还可以对 尚未同步至备用节点的数据所对应的日志信息作闪回处理, 因此, 可以 进一步提高各节点间, 以及切换前后数据的一致性。
此外, 由于该方案可以在多个备用节点中选择相应的备用节点作为 冷备节点, 并通过该冷备节点将数据备份至分布式文件系统, 因此, 可 以大大提高数据备份的效率, 以及减少数据备份时对各个节点运行的影 响, 有利于提高整个数据容灾系统的性能。
在上述实施例的基础上, 可选的, 提高数据的安全性, 还可以对该 逻辑单元中的数据进行异地备份, 比如, 若该逻辑单元位于 A地, 则可 以在 B地对该逻辑单元进行备份,从而使得当 A地的逻辑单元发生异常, 比如 A地发生灾难时, 可以将 A地逻辑单元的业务切换至 B地的备份 逻辑单元, 等等。
以下将以源逻辑单元位于 A城市,备份逻辑单元位于 B城市为例进 行说明。
如图 3a所示, 可以将 A城市逻辑单元中的数据备份至 B城市, 比 如, 具体可以将 A城市逻辑单元中主节点的数据, 异步传输至 B城市逻 辑单元中的主节点, 然后由 B城市逻辑单元中的主节点将数据同步至逻 辑单元 (即 B城市逻辑单元) 中的其他备用节点。
当 A城市的逻辑单元发生故障, 比如, 当 A城市发生自然灾害如水 灾或地震等, 或者发生一些其他人为的灾难如战争或爆炸等, 而导致该 逻辑单元发生故障的时候,便可以将 A城市逻辑单元的业务切换至 B城 市逻辑单元, 并利用 B城市逻辑单元中的数据对其进行修复, 在等待 A 城市逻辑单元的数据恢复后, 便可以将业务切换回 A城市逻辑单元; 例 如, 参见图 3b, 具体可以如下:
301、 当 A城市逻辑单元中的业务切换到 B城市逻辑单元中之后, 原本业务发送至 A城市逻辑单元的数据处理请求,也会转发至 B城市逻 辑单元, 即此时, B城市逻辑单元中的主节点可以接收该数据处理请求, 并根据该数据处理请求进行数据处理。
302、 当 A城市逻辑单元需要进行数据重建时, 可以向 B城市逻辑 单元发送进行数据同步的请求 (即请求数据同步)。
303、 B城市逻辑单元接收到该进行数据同步的请求后,将该 B城市 逻辑单元的数据同步至 A城市逻辑单元。
例如, 可以获取 A城市逻辑单元的全局事务标识 (GTID, Global Transaction ID)、 A城市逻辑单元中各个节点的日志信息、 B城市逻辑单 元接收到该进行数据同步的请求后, 将 B城市逻辑单元的 GTID、 以及 B城市逻辑单元中各个节点的日志信息等, 基于这些 GTID和日志信息 将该 B城市逻辑单元的数据同步至 A城市逻辑单元, 等等。
304、 当确定 A城市逻辑单元和 B城市逻辑单元中的数据之间的延 迟小于预设值时,将 B城市逻辑单元设置为只读,并在该延迟等于 0时, 将业务切换回 A城市逻辑单元, 即此时, A城市逻辑单元中的主节点可 以接收相应业务发送的数据处理请求, 并根据该数据处理请求进行数据 处理。 其中, 该预设值可以根据实际应用的需求进行设置, 在此不再赘述。 可见,本实施例不仅可以实现如图 2c所示的实施例所能实现的有益 效果之外, 还可以支持跨城容灾, 大大提高数据的安全性。
为了更好地实施以上方法,本申请实施例还提供一种数据容灾装置, 如图 4a所示, 该数据容灾装置包括监控单元 401、 获取单元 402、 选择 单元 403和更新单元 404, 如下:
( 1 ) 监控单元 401 ;
监控单元 401, 用于对逻辑单元内各个节点进行监控。
其中, 各个节点可以包括主节点和多个备用节点。 这些节点可以部 署在同一机房中, 也可以部署在不同机房中, 可以位于同一区域, 也可 以是位于不同区域, 具体可参见前面的实施例, 在此不再赘述。
其中, 对逻辑单元内各个节点进行监控的方式可以有多种, 比如, 可以对数据库实例, 如 mysql实例进行监控、 对各个节点所执行的事务 进行监控、和 /或对逻辑单元内各个节点的硬件状态和核心程序运行状态 进行监控, 等等, 例如, 具体可以如下:
监控单元 401, 具体可以用于对逻辑单元内各个节点的数据库实例 的运行进行监控, 比如, 可以如下:
分别对逻辑单元内各个节点中的数据库实例进行周期性访问, 若该 数据库实例不可读和 /或不可写, 则确定相应的节点发生异常。
和 /或, 监控单元 401, 具体可以用于对逻辑单元内各个节点所执行 的事务进行监控, 比如, 可以如下:
分别对逻辑单元内各个备用节点从主节点上拉取事务线程的工作状 态进行监控; 若该工作状态指示当前处于非工作状态, 比如, 如该工作 状态为 "NO", 则确定相应的节点发生异常。
和 /或, 监控单元 401, 具体可以用于对逻辑单元内各个节点的硬件 状态和核心程序运行状态进行监控, 比如, 可以如下:
定时获取逻辑单元内各个节点的硬件状态信息、 以及核心程序运行 状态信息, 当根据该硬件状态信息确定节点发生硬件故障时, 和 /或, 根 据该核心程序运行状态信息确定节点发生软件故障时, 确定相应的节点 发生异常。
或者, 注册与逻辑单元对应的临时节点, 其中, 当逻辑单元发生硬 件故障, 和 /或, 核心程序发生异常时, 临时节点被删除; 当检测到临时 节点消失时, 确定主节点发生异常。
(2) 获取单元 402;
获取单元 402, 用于在监控到该主节点发生异常时, 分别获取该多 个备用节点的日志信息。
其中, 该备用节点的日志信息可以包括备用节点与主节点进行数据 同步的时间点, 此外, 可选的, 该日志信息还可以包括其他的信息, 比 如所执行的事务的名称、 数据的类型、 和 /或大小, 等等。 该备用节点的 日志信息的表现形式可以有多种,比如,具体可以为二进制日志(binlog) 等。
(3) 选择单元 403 ;
选择单元 403, 用于从多个备用节点中选择该时间点最接近当前时 间的备用节点作为目标节点。
(4) 更新单元 404 ;
更新单元 404, 用于将主节点更新为该目标节点。
即, 更新单元 404可以将该目标节点作为新的主节点, 从而将原主 节点中的业务切换至该目标节点上。
可选的,在监控到该主节点发生异常时,除了直接触发获取单元 402 执行 "获取该多个备用节点的日志信息"搡作之外, 还可以通过其他的 方式来触发执行该 "获取该多个备用节点的日志信息" 的搡作, 比如, 可以将该主节点降级为备用节点, 从而使得该逻辑单元中不存在主节 点, 这样, 该逻辑单元便会生成指示不存在主节点的状态信息, 而后续 调度器若监控到该状态信息, 便可以停止备用节点的同步日志 (binlog) 的 I/O接口, 并触发获取单元 402执行该 "获取该多个备用节点的日志 信息" 的搡作; 即如图 4b 所示, 该数据容灾装置还可以包括处理单元 405, 如下:
该处理单元 405, 具体可以用于在监控单元 401 监控到该主节点发 生异常时, 将该主节点降级为一个新的备用节点, 停止该备用节点的同 步日志 (binlog) 的 I/O接口, 并触发获取单元 402执行分别获取该多个 备用节点的日志信息的搡作。
此外, 监控到该主节点发生异常时, 还可以对主节点中的数据进行 修复, 即如图 4b所示,该数据容灾装置还可以包括修复单元 406,如下: 该修复单元 406, 可以用于在监控单元 401 监控到该主节点发生异 常时, 获取该主节点的日志信息, 根据该主节点的日志信息对该主节点 中的数据进行修复。
其中, 该主节点的日志信息可以包括主节点更新数据的时间点 (即 主节点更新自身数据的时间点), 此外, 可选的, 该日志信息还可以包 括其他的信息, 比如所执行的事务的名称、 数据的类型、 和 /或大小, 等 等。 该主节点的日志信息的表现形式可以有多种, 比如, 具体可以为二 进制日志 (binlog) 等。
需说明的是, 为了避免主节点的日志信息中出现多余的数据 (具体 可参见图 lc所示的实施例中的说明), 可以先确定该主节点中的数据是 否已同步至备用节点, 若是, 才执行获取该主节点的日志信息的搡作, 否则, 此时可以对尚未同步的数据所对应的日志信息进行闪回处理。即: 该修复单元 406, 还可以用于确定主节点中的数据是否已同步至备 用节点, 若是, 则执行获取该主节点的日志信息的搡作; 若否, 则对尚 未同步的数据所对应的日志信息进行闪回处理。
可选的, 为了减少无效搡作, 在对尚未同步的数据所对应的日志信 息进行闪回处理之前, 还可以对该尚未同步的数据所对应的日志信息是 否能够被闪回进行判断, 在确定可以被闪回时才进行闪回处理, 即: 该修复单元 406, 还可以用于确定尚未同步的数据所对应的日志信 息是否能够被闪回; 若是, 则执行该对尚未同步的数据所对应的日志信 息进行闪回处理的搡作; 若否, 则从该目标节点中拉取镜像数据, 以重 构该主节点中的数据, 具体可参见上述实施例, 在此不再赘述。
其中, 该逻辑单元可以是预设的, 也可以由系统根据具体的服务请 求进行建立,即如图 4b所示,该数据容灾装置还可以包括建立单元 407, 如下:
该建立单元 407, 可以用于获取接入网关发送的服务请求; 根据该 服务请求选择多个节点, 根据选择的节点创建逻辑单元, 并确定逻辑单 元中各个节点的主备关系, 使得该逻辑单元包括主节点和多个备用节 点。
可选的, 为了提高数据的安全性, 在将主节点中的数据同步到各个 备用节点的同时, 还可以将数据备份到其他的存储设备, 比如分布式文 件系统中, 即如图 4b所示, 该数据容灾装置还可以包括备份单元 408, 如下:
该备份单元 408, 可以用于在该建立单元 407确定逻辑单元中各个 节点的主备关系之后, 按照预设策略从多个备用节点中选择一个备用节 点作为冷备节点, 通过该冷备节点以管道流的方式将数据备份至分布式 文件系统。 可选的, 为了提高数据的安全性, 还可以对该逻辑单元中的数据进 行异地备份, 比如, 若该逻辑单元位于 A地, 则可以在 B地对该逻辑单 元进行备份,从而使得当 A地的逻辑单元发生异常, 比如 A地发生灾难 时, 可以将 A地逻辑单元的业务切换至 B地的备份逻辑单元; 即如图 4b所示, 该数据容灾装置还可以包括异地容灾单元 409, 如下:
该异地容灾单元 409, 用于对该逻辑单元进行异地备份, 得到备用 逻辑单元, 在该逻辑单元发生故障时, 将该逻辑单元的业务切换至该备 用逻辑单元中。
此外,还可以利用 B地备份的数据对 A地的逻辑单元进行数据重建, 即:
该异地容灾单元 409, 还可以用于将该备用逻辑单元中各个节点的 数据同步至该逻辑单元内相应节点中; 当确定该逻辑单元和备用逻辑单 元中各个节点的数据之间的延迟小于预设值时, 将该备用逻辑单元设置 为只读, 并在确定该延迟等于 0, 将该业务切换回该逻辑单元中。
其中,该预设值可以根据实际应用的需求进行设置,在此不再赘述。 具体实施时, 以上各个单元可以作为独立的实体来实现, 也可以进 行任意组合, 作为同一或若干个实体来实现, 比如, 参见图 2c和图 3b 所示的实施例, 该数据容灾装置可以包括代理服务器、 调度器和逻辑单 元, 等等。 以上各个单元的具体实施可参见前面的方法实施例, 在此不 再赘述。
由上可知, 本实施例的数据容灾装置中的监控单元 401可以对逻辑 单元内各个节点进行监控,当监控到主节点发生异常时,由获取单元 402 分别获取多个备用节点的日志信息, 其中, 备用节点的日志信息包括备 用节点与主节点进行数据同步的时间点, 然后, 由选择单元 403选择时 间点最接近当前时间的备用节点作为目标节点, 并由更新单元 404将主 节点更新为该目标节点, 从而实现主备之间的切换; 由于该方案的逻辑 单元可以包括多个备用节点, 然后, 在主节点发生异常时, 从中选择出 具有最新数据的备用节点作为新的主节点, 因此, 可以保证切换前后, 原主节点与新的主节点之间数据的一致性; 而且, 由于该逻辑单元只有 一个主节点, 因此, 不会出现如现有技术中所面临的大规模主键冲突问 题, 可以不在业务层对各个设备进行区分, 实现更为简单, 可以大大提 高系统的可用性。
相应的, 本申请实施例还提供一种数据容灾系统, 包括本申请实施 例所提供的任一种数据容灾装置, 具体可参见上述实施例, 例如, 可以 如下:
数据容灾装置, 用于对逻辑单元内各个节点进行监控, 当监控到该 主节点发生异常时, 分别获取该多个备用节点的日志信息, 该备用节点 的日志信息包括备用节点与主节点进行数据同步的时间点, 从多个备用 节点中选择该时间点最接近当前时间的备用节点作为目标节点, 将主节 点更新为该目标节点。
例如, 该数据容灾装置, 具体可以用于对逻辑单元内各个节点的数 据库实例的运行、 逻辑单元内各个节点所执行的事务、 和 /或逻辑单元内 各个节点的硬件状态和核心程序运行状态进行监控, 等等。
该数据容灾装置的具体实施可参见前面的实施例, 在此不再赘述。 其中, 该数据容灾装置可以为多个, 具体数量可以根据实际应用的 需求而定。
此外, 该数据容灾系统还可以包括其他的设备, 比如, 还可以包括 接入网关, 下:
该接入网关, 用于接收用户设备发送的连接建立请求, 根据该连接 建立请求获取多个数据容灾装置的负载信息, 根据该负载信息选择匹配 的数据容灾装置, 建立该用户设备与匹配到的数据容灾装置之间的连接 关系; 以及, 接收用户设备发送的数据处理请求, 基于该连接关系将该 数据处理请求发送至相应的数据容灾装置中。
该数据容灾系统还可以包括用户设备, 用于向接入网关发送连接建 立请求, 以及发送数据处理请求等。
以上各个设备的具体搡作可参见前面的实施例, 在此不再赘述。 由于该数据容灾系统可以包括本申请实施例所提供的任一种数据容 灾装置, 因此, 可以实现本申请实施例所提供的任意一种数据容灾装置 所能实现的有益效果, 相机前面的实施例, 在此不再赘述。
此外, 本申请实施例还提供一种服务器, 如图 5所示, 其示出了本 申请实施例所涉及的服务器的结构示意图, 具体来讲:
该服务器可以包括一个或者一个以上处理核心的处理器 501、 一个 或一个以上计算机可读存储介^的存储器 502、 射频 (Radio Frequency, RF) 电路 503、 电源 504、 输入单元 505、 以及显示单元 506等部件。 本领域技术人员可以理解, 图 5中示出的服务器结构并不构成对服务器 的限定, 可以包括比图示更多或更少的部件, 或者组合某些部件, 或者 不同的部件布置。 其中:
处理器 501是该服务器的控制中心, 利用各种接口和线路连接整个 服务器的各个部分, 通过运行或执行存储在存储器 502内的软件程序和 /或模块, 以及调用存储在存储器 502内的数据, 执行服务器的各种功能 和处理数据, 从而对服务器进行整体监控。 可选的, 处理器 501可包括 一个或多个处理核心; 优选的, 处理器 501可集成应用处理器和调制解 调处理器, 其中, 应用处理器主要处理搡作系统、 用户界面和应用程序 等, 调制解调处理器主要处理无线通信。 可以理解的是, 上述调制解调 处理器也可以不集成到处理器 501 中。 存储器 502可用于存储软件程序以及模块, 处理器 501通过运行存 储在存储器 502的软件程序以及模块, 从而执行各种功能应用以及数据 处理。 存储器 502可主要包括存储程序区和存储数据区, 其中, 存储程 序区可存储搡作系统、 至少一个功能所需的应用程序 (比如声音播放功 能、 图像播放功能等) 等; 存储数据区可存储根据服务器的使用所创建 的数据等。 此外, 存储器 502可以包括高速随机存取存储器, 还可以包 括非易失性存储器, 例如至少一个磁盘存储器件、 闪存器件、 或其他易 失性固态存储器件。 相应地, 存储器 502还可以包括存储器控制器, 以 提供处理器 501对存储器 502的访问。
RF电路 503可用于收发信息过程中, 信号的接收和发送, 特别地, 将基站的下行信息接收后, 交由一个或者一个以上处理器 501处理; 另 外, 将涉及上行的数据发送给基站。 通常, RF 电路 503 包括但不限于 天线、至少一个放大器、调谐器、一个或多个振荡器、用户身份模块(SIM) 卡、 收发信机、 耦合器、 低噪声放大器 (LNA, Low Noise Amplifier) , 双工器等。 此外, RF 电路 503 还可以通过无线通信与网络和其他设备 通信。 所述无线通信可以使用任一通信标准或协议, 包括但不限于全球 移动通讯系统 ( GSM, Global System of Mobile communication )、 通用 分组无线月良务 (GPRS , General Packet Radio Service ) , 码分多址 (CDMA, Code Division Multiple Access) , 宽带码分多址 (WCDMA, Wideband Code Division Multiple Access) ^ 长期演进 (LTE, Long Term Evolution )、 电子邮件、 短消息月良务 (SMS, Short Messaging Service) 等。
服务器还包括给各个部件供电的电源 504 (比如电池), 优选的, 电 源 504可以通过电源管理系统与处理器 501逻辑相连, 从而通过电源管 理系统实现管理充电、 放电、 以及功耗管理等功能。 电源 504还可以包 括一个或一个以上的直流或交流电源、再充电系统、 电源故障检测电路、 电源转换器或者逆变器、 电源状态指示器等任意组件。
该服务器还可包括输入单元 505, 该输入单元 505 可用于接收输入 的数字或字符信息, 以及产生与用户设置以及功能控制有关的键盘、 鼠 标、 搡作杆、 光学或者轨迹球信号输入。 具体地, 在一个具体的实施例 中, 输入单元 505可包括触敏表面以及其他输入设备。 触敏表面, 也称 为触摸显示屏或者触控板, 可收集用户在其上或附近的触摸搡作 (比如 用户使用手指、 触笔等任何适合的物体或附件在触敏表面上或在触敏表 面附近的搡作), 并根据预先设定的程式驱动相应的连接装置。 可选的, 触敏表面可包括触摸检测装置和触摸控制器两个部分。 其中, 触摸检测 装置检测用户的触摸方位, 并检测触摸搡作带来的信号, 将信号传送给 触摸控制器; 触摸控制器从触摸检测装置上接收触摸信息, 并将它转换 成触点坐标, 再送给处理器 501, 并能接收处理器 501 发来的命令并加 以执行。 此外, 可以采用电阻式、 电容式、 红外线以及表面声波等多种 类型实现触敏表面。 除了触敏表面, 输入单元 505还可以包括其他输入 设备。 具体地, 其他输入设备可以包括但不限于物理键盘、 功能键 (比 如音量控制按键、 开关按键等)、 轨迹球、 鼠标、 搡作杆等中的一种或 多种。
该服务器还可包括显示单元 506, 该显示单元 506可用于显示由用 户输入的信息或提供给用户的信息以及服务器的各种图形用户接口, 这 些图形用户接口可以由图形、 文本、 图标、 视频和其任意组合来构成。 显示单元 506可包括显示面板, 可选的, 可以采用液晶显示器 (LCD, Liquid Crystal Display) ^有机发光二极管 (OLED, Organic Light-Emitting Diode) 等形式来配置显示面板。 进一步的, 触敏表面可覆盖显示面板, 当触敏表面检测到在其上或附近的触摸搡作后, 传送给处理器 501以确 定触摸事件的类型, 随后处理器 501根据触摸事件的类型在显示面板上 提供相应的视觉输出。 虽然在图 5中, 触敏表面与显示面板是作为两个 独立的部件来实现输入和输入功能, 但是在某些实施例中, 可以将触敏 表面与显示面板集成而实现输入和输出功能。
尽管未示出, 服务器还可以包括摄像头、 蓝牙模块等, 在此不再赘 述。 具体在本实施例中, 服务器中的处理器 501会按照如下的指令, 将 一个或一个以上的应用程序的进程对应的可执行文件加载到存储器 502 中, 并由处理器 501来运行存储在存储器 502中的应用程序, 从而实现 各种功能, 如下:
对逻辑单元 (set) 内各个节点进行监控, 当监控到该主节点发生异 常时, 分别获取该多个备用节点的日志信息, 该备用节点的日志信息包 括该备用节点与主节点进行数据同步的时间点, 从多个备用节点中选择 该时间点最接近当前时间的备用节点作为目标节点, 将主节点更新为该 目标节点。
其中, 该逻辑单元可以是预设的, 也可以由系统根据具体的服务请 求进行建立, 即处理器 501还可以实现如下功能:
获取接入网关发送的服务请求, 根据该服务请求选择多个节点, 根 据选择的节点创建逻辑单元, 并确定逻辑单元中各个节点的主备关系, 使得该逻辑单元包括主节点和多个备用节点。
可选的, 为了提高数据的安全性, 在将主节点中的数据同步到各个 备用节点的同时, 还可以将数据备份到其他的存储设备, 比如分布式文 件系统中, 这样, 当该逻辑单元中的节点发生故障时, 便可以基于当天 的镜像文件以及指定时间点的日志信息, 将该逻辑单元节点中的数据快 速恢复到指定时间点。 即, 处理器 501还可以实现如下功能:
按照预设策略从多个备用节点中选择一个备用节点作为冷备节点, 通过该冷备节点以管道流的方式将数据备份至分布式文件系统。
可选的, 为了提高数据的安全性, 还可以对该逻辑单元中的数据进 行异地备份, 从而使得源逻辑单元发生异常时, 可以将源逻辑单元的业 务切换至备用逻辑单元; 即处理器 501还可以实现如下功能:
对该逻辑单元进行异地备份, 得到备用逻辑单元, 在该逻辑单元发 生故障时, 将该逻辑单元的业务切换至该备用逻辑单元中。
此外, 还可以利用备用逻辑单元对 A地的逻辑单元进行数据重建, 即处理器 501还可以实现如下功能:
将该备用逻辑单元中各个节点的数据同步至该逻辑单元内相应节点 中, 确定该逻辑单元和备用逻辑单元中各个节点的数据之间的延迟小于 预设值时, 将该备用逻辑单元设置为只读, 在确定该延迟等于 0时, 将 该业务切换回该逻辑单元中。
其中, 该预设值可以根据实际应用的需求进行设置, 在此不再赘述。 由上可知,本实施例的服务器可以对逻辑单元内各个节点进行监控, 当监控到主节点发生异常时, 分别获取多个备用节点的日志信息, 其中, 备用节点的日志信息包括备用节点与主节点进行数据同步的时间点, 然 后, 选择时间点最接近当前时间的备用节点作为目标节点, 并将主节点 更新为该目标节点, 从而实现主备之间的切换; 由于该方案的逻辑单元 可以包括多个备用节点, 然后, 在主节点发生异常时, 从中选择出具有 最新数据的备用节点作为新的主节点, 因此, 可以保证切换前后, 原主 节点与新的主节点之间数据的一致性; 而且, 由于该逻辑单元只有一个 主节点, 因此, 不会出现如现有技术中所面临的大规模主键冲突问题, 可以不在业务层对各个设备进行区分, 实现更为简单, 可以大大提高系 统的可用性。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部 分步骤是可以通过程序来指令相关的硬件来完成, 该程序可以存储于一 计算机可读存储介^中, 存储介 ^可以包括: 只读存储器 (ROM, Read Only Memory)、 随机存耳又记忆体 (RAM, Random Access Memory)、 磁 盘或光盘等。
以上对本申请实施例所提供的一种数据容灾方法、 装置和系统进行 了详细介绍, 本文中应用了具体个例对本申请的原理及实施方式进行了 阐述, 以上实施例的说明只是用于帮助理解本申请的方法及其核心思 想; 同时, 对于本领域的技术人员, 依据本申请的思想, 在具体实施方 式及应用范围上均会有改变之处, 综上所述, 本说明书内容不应理解为 对本申请的限制。

Claims

权利要求书
1、 一种数据容灾方法, 其特征在于, 包括:
对逻辑单元内各个节点进行监控, 所述各个节点包括主节点和多个 备用节点;
当监控到所述主节点发生异常时, 分别获取所述多个备用节点的日 志信息, 所述备用节点的日志信息包括该备用节点与所述主节点进行数 据同步的时间点;
从所述多个备用节点中选择所述时间点最接近当前时间的一个备用 节点作为目标节点;
将所述主节点更新为所述目标节点。
2、 根据权利要求 1所述的方法, 其特征在于, 所述对逻辑单元内各 个节点进行监控, 包括:
对所述逻辑单元内各个节点的数据库实例的运行进行监控; 和 /或, 对所述逻辑单元内各个节点所执行的事务进行监控; 和 /或, 对所述逻辑单元内各个节点的硬件状态和核心程序运行状态进行监 控。
3、 根据权利要求 2所述的方法, 其特征在于, 所述对所述逻辑单元 内各个节点的数据库实例的运行进行监控, 包括:
分别对所述逻辑单元内各个节点中的数据库实例进行周期性访问; 若所述数据库实例不可读和 /或不可写, 则确定相应的节点发生异 常。
4、 根据权利要求 2所述的方法, 其特征在于, 所述对所述逻辑单元 内各个节点的硬件状态和核心程序运行状态进行监控, 包括:
定时获取逻辑单元内各个节点的硬件状态信息、 以及核心程序运行 状态信息;
当根据所述硬件状态信息确定节点发生硬件故障时, 和 /或, 根据所 述核心程序运行状态信息确定节点发生软件故障时, 确定相应的节点发 生异常。
5、 根据权利要求 2所述的方法, 其特征在于, 所述对所述逻辑单元 内各个节点的硬件状态和核心程序运行状态进行监控, 包括:
注册与所述逻辑单元对应的临时节点, 其中, 当所述逻辑单元发生 硬件故障, 和 /或, 核心程序发生异常时, 所述临时节点被删除;
当检测到所述临时节点消失时, 确定所述主节点发生异常。
6、 根据权利要求 1至 5任一项所述的方法, 其特征在于, 所述分别 获取所述多个备用节点的日志信息之前, 还包括:
将所述主节点降级为一个新的备用节点, 并停止该备用节点的同步 日志的输入输出接口。
7、 根据权利要求 1至 5任一项所述的方法, 其特征在于, 当监控到 所述主节点发生异常时, 还包括:
获取所述主节点的日志信息, 所述主节点的日志信息包括所述主节 点更新数据的时间点;
根据所述主节点的日志信息对所述主节点中的数据进行修复。
8、 根据权利要求 7所述的方法, 其特征在于, 所述获取所述主节点 的日志信息之前, 还包括:
确定主节点中的数据是否已同步至备用节点;
若是, 则执行获取所述主节点的日志信息的步骤;
若否, 则对尚未同步的数据所对应的日志信息进行闪回处理。
9、 根据权利要求 8所述的方法, 其特征在于, 所述对尚未同步的数 据所对应的日志信息进行闪回处理之前, 还包括: 确定尚未同步的数据所对应的日志信息是否能够被闪回;
若是, 则执行所述对尚未同步的数据所对应的日志信息进行闪回处 理的步骤;
若否, 则从所述目标节点中拉取镜像数据, 以重构所述主节点中的 数据。
10、 根据权利要求 1至 5任一项所述的方法, 其特征在于, 所述对 逻辑单元内各个节点进行监控之前, 还包括:
获取接入网关发送的服务请求;
根据所述服务请求选择多个节点;
根据选择的节点创建所述逻辑单元, 并确定所述逻辑单元中各个节 点的主备关系, 使得所述逻辑单元包括所述主节点和所述多个备用节 点。
11、 根据权利要求 10所述的方法, 其特征在于, 所述确定逻辑单元 中各个节点的主备关系之后, 还包括:
按照预设策略从所述多个备用节点中选择一个备用节点作为冷备节 通过所述冷备节点以管道流的方式将数据备份至分布式文件系统。
12、 根据权利要求 1至 5任一项所述的方法, 其特征在于, 还包括: 对所述逻辑单元进行异地备份, 得到备用逻辑单元;
在所述逻辑单元发生故障时, 将所述逻辑单元的业务切换至所述备 用逻辑单元中。
13、 根据权利要求 12所述的方法, 其特征在于, 所述将所述逻辑单 元的业务切换至所述备用逻辑单元中之后, 还包括:
将所述备用逻辑单元中各个节点的数据同步至所述逻辑单元内相应 节点中; 当确定所述逻辑单元和所述备用逻辑单元中各个节点的数据之间的 延迟小于预设值时, 将所述备用逻辑单元设置为只读, 并在所述延迟为
0时, 将所述业务切换回所述逻辑单元中。
14、 一种数据容灾装置, 其特征在于, 包括处理器和存储器, 所述 存储器中存储可被所述处理器执行的指令, 当执行所述指令时, 所述处 理器用于:
对逻辑单元内各个节点进行监控, 所述各个节点包括主节点和多个 备用节点;
在监控到所述主节点发生异常时, 分别获取所述多个备用节点的日 志信息, 所述备用节点的日志信息包括该备用节点与所述主节点进行数 据同步的时间点;
从所述多个备用节点中选择所述时间点最接近当前时间的一个备用 节点作为目标节点;
将所述主节点更新为所述目标节点。
15、根据权利要求 14所述的装置,其特征在于, 当执行所述指令时, 所述处理器进一步用于:
对所述逻辑单元内各个节点的数据库实例的运行进行监控; 和 /或, 对所述逻辑单元内各个节点所执行的事务进行监控; 和 /或, 对所述逻辑单元内各个节点的硬件状态和核心程序运行状态进行监 控。
16、根据权利要求 15所述的装置,其特征在于, 当执行所述指令时, 所述处理器进一步用于: 分别对所述逻辑单元内各个节点中的数据库实 例进行周期性访问; 若所述数据库实例不可读和 /或不可写, 则确定相应 的节点发生异常
17、根据权利要求 15所述的装置,其特征在于, 当执行所述指令时, 所述处理器进一步用于:
定时获取逻辑单元内各个节点的硬件状态信息、 以及核心程序运行 状态信息; 当根据所述硬件状态信息确定节点发生硬件故障时, 和 /或, 根据所述核心程序运行状态信息确定节点发生软件故障时, 确定相应的 节点发生异常。
18、根据权利要求 15所述的装置,其特征在于, 当执行所述指令时, 所述处理器进一步用于: 注册与所述逻辑单元对应的临时节点, 其中, 当所述逻辑单元发生硬件故障, 和 /或, 核心程序发生异常时, 所述临时 节点被删除; 当检测到所述临时节点消失时,确定所述主节点发生异常。
19、 根据权利要求 14至 18任一项所述的装置, 其特征在于, 当执 行所述指令时, 所述处理器进一步用于:对所述逻辑单元进行异地备份, 得到备用逻辑单元, 在所述逻辑单元发生故障时, 将所述逻辑单元的业 务切换至所述备用逻辑单元中。
20、根据权利要求 19所述的装置,其特征在于, 当执行所述指令时, 所述处理器进一步用于: 将所述备用逻辑单元中各个节点的数据同步至 所述逻辑单元内相应节点中; 当确定所述逻辑单元和所述备用逻辑单元 中各个节点的数据之间的延迟小于预设值时, 将所述备用逻辑单元设置 为只读, 并在所述延迟为 0时, 将所述业务切换回所述逻辑单元中。
21、 一种数据容灾系统, 其特征在于, 包括权利要求 14至 20任一 项所述的数据容灾装置。
22、 根据权利要求 21所述的系统, 其特征在于, 还包括接入网关; 所述接入网关, 用于接收用户设备发送的连接建立请求, 根据所述 连接建立请求获取多个数据容灾装置的负载信息, 根据所述负载信息选 择匹配的数据容灾装置, 建立所述用户设备与匹配到的数据容灾装置之 间的连接关系; 以及, 接收所述用户设备发送的数据处理请求, 基于所 述连接关系将所述数据处理请求发送至相应的数据容灾装置中。
PCT/CN2017/086105 2016-07-27 2017-05-26 一种数据容灾方法、装置和系统 WO2018019023A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP17833328.2A EP3493471B1 (en) 2016-07-27 2017-05-26 Data disaster recovery method, apparatus and system
US16/203,376 US10713135B2 (en) 2016-07-27 2018-11-28 Data disaster recovery method, device and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610603383.0 2016-07-27
CN201610603383.0A CN106254100B (zh) 2016-07-27 2016-07-27 一种数据容灾方法、装置和系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/203,376 Continuation-In-Part US10713135B2 (en) 2016-07-27 2018-11-28 Data disaster recovery method, device and system

Publications (1)

Publication Number Publication Date
WO2018019023A1 true WO2018019023A1 (zh) 2018-02-01

Family

ID=57604892

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/086105 WO2018019023A1 (zh) 2016-07-27 2017-05-26 一种数据容灾方法、装置和系统

Country Status (4)

Country Link
US (1) US10713135B2 (zh)
EP (1) EP3493471B1 (zh)
CN (1) CN106254100B (zh)
WO (1) WO2018019023A1 (zh)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918360A (zh) * 2019-02-28 2019-06-21 携程旅游信息技术(上海)有限公司 数据库平台系统、创建方法、管理方法、设备及存储介质
CN109936642A (zh) * 2019-01-28 2019-06-25 中国银行股份有限公司 一种分布式系统中生成机器id的方法、装置及系统
EP3617886A1 (en) * 2018-08-30 2020-03-04 Baidu Online Network Technology (Beijing) Co., Ltd. Hot backup system, hot backup method, and computer device
CN111092754A (zh) * 2019-11-29 2020-05-01 贝壳技术有限公司 实时接入服务系统及其实现方法
CN111400404A (zh) * 2020-03-18 2020-07-10 中国建设银行股份有限公司 一种节点初始化方法、装置、设备及存储介质
CN112202853A (zh) * 2020-09-17 2021-01-08 杭州安恒信息技术股份有限公司 数据同步方法、系统、计算机设备和存储介质
CN112231407A (zh) * 2020-10-22 2021-01-15 北京人大金仓信息技术股份有限公司 PostgreSQL数据库的DDL同步方法、装置、设备和介质
CN112667440A (zh) * 2020-12-28 2021-04-16 紫光云技术有限公司 一种高可用MySQL的异地灾备方法
CN112769634A (zh) * 2020-12-09 2021-05-07 航天信息股份有限公司 一种基于Zookeeper的可横向扩展的分布式系统及开发方法
CN112783694A (zh) * 2021-02-01 2021-05-11 紫光云技术有限公司 一种高可用Redis的异地灾备方法
CN113312384A (zh) * 2020-02-26 2021-08-27 阿里巴巴集团控股有限公司 图数据的查询处理方法、装置及电子设备
CN114598711A (zh) * 2022-03-29 2022-06-07 百果园技术(新加坡)有限公司 一种数据迁移方法、装置、设备及介质
CN113489601B (zh) * 2021-06-11 2024-05-14 海南视联通信技术有限公司 基于视联网自治云网络架构的抗毁方法和装置

Families Citing this family (74)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106254100B (zh) 2016-07-27 2019-04-16 腾讯科技(深圳)有限公司 一种数据容灾方法、装置和系统
CN107689879A (zh) * 2016-08-04 2018-02-13 中兴通讯股份有限公司 虚拟网元的管理方法及装置
WO2018049552A1 (en) * 2016-09-13 2018-03-22 Thomson Licensing Method and apparatus for controlling network sensors
CN106844145A (zh) * 2016-12-29 2017-06-13 北京奇虎科技有限公司 一种服务器硬件故障预警方法和装置
CN106815097A (zh) * 2017-01-18 2017-06-09 北京许继电气有限公司 数据库容灾系统和方法
CN107071351B (zh) * 2017-03-30 2019-11-05 杭州瑞网广通信息技术有限公司 一种车站多级容灾架构及方法
CN106911524B (zh) * 2017-04-27 2020-07-07 新华三信息技术有限公司 一种ha实现方法及装置
CN108964948A (zh) * 2017-05-19 2018-12-07 北京金山云网络技术有限公司 主从服务系统、主节点故障恢复方法及装置
CN108984569A (zh) * 2017-06-05 2018-12-11 中兴通讯股份有限公司 数据库切换方法、系统和计算机可读存储介质
CN107070753A (zh) * 2017-06-15 2017-08-18 郑州云海信息技术有限公司 一种分布式集群系统的数据监控方法、装置及系统
CN109257404B (zh) * 2017-07-14 2022-04-05 迈普通信技术股份有限公司 数据备份方法、装置及系统
CN109274986B (zh) * 2017-07-17 2021-02-12 中兴通讯股份有限公司 多中心容灾方法、系统、存储介质和计算机设备
CN107577700B (zh) * 2017-07-26 2020-11-10 创新先进技术有限公司 数据库容灾的处理方法及装置
CN109308643B (zh) * 2017-07-27 2022-04-08 阿里巴巴集团控股有限公司 一种打底数据生成方法、数据容灾方法及相关设备
CN110019063B (zh) * 2017-08-15 2022-07-05 厦门雅迅网络股份有限公司 计算节点数据容灾回放的方法、终端设备及存储介质
CN107708085B (zh) * 2017-08-29 2020-11-13 深圳市盛路物联通讯技术有限公司 一种中继器保障方法及接入点
US10705880B2 (en) * 2017-09-22 2020-07-07 Vmware, Inc. Cluster updating using temporary update-monitor pod
CN110377570B (zh) * 2017-10-12 2021-06-11 腾讯科技(深圳)有限公司 节点切换方法、装置、计算机设备及存储介质
CN110019515A (zh) * 2017-11-10 2019-07-16 中兴通讯股份有限公司 数据库切换方法、装置、系统及计算机可读存储介质
CN107846476B (zh) * 2017-12-18 2020-06-16 东软集团股份有限公司 一种信息同步方法、设备及存储介质
CN110071870B (zh) * 2018-01-24 2022-03-18 苏宁云商集团股份有限公司 基于Alluxio的多HDFS集群的路由方法及装置
CN108628717A (zh) * 2018-03-02 2018-10-09 北京辰森世纪科技股份有限公司 一种数据库系统及监控方法
JP6866532B2 (ja) * 2018-03-26 2021-04-28 株式会社Fuji スレーブ、作業機、及びログ情報を記憶する方法
CN110348826B (zh) * 2018-04-08 2024-05-10 财付通支付科技有限公司 异地多活容灾方法、系统、设备及可读存储介质
CN108717384B (zh) * 2018-05-18 2021-11-23 创新先进技术有限公司 一种数据备份方法及装置
CN108984337B (zh) * 2018-05-29 2021-04-16 杭州网易再顾科技有限公司 一种数据同步异常的修复方法、修复装置、介质和计算设备
CN108881452B (zh) * 2018-06-27 2021-11-16 咪咕文化科技有限公司 一种数据同步的方法、装置及存储介质
CN109117312B (zh) * 2018-08-23 2022-03-01 北京小米智能科技有限公司 数据恢复方法及装置
CN110858168B (zh) * 2018-08-24 2023-08-18 浙江宇视科技有限公司 集群节点故障处理方法、装置及集群节点
CN109635038B (zh) * 2018-11-20 2022-08-19 福建亿榕信息技术有限公司 一种结构化数据异地双读写方法
CN109714394B (zh) * 2018-12-05 2021-11-09 深圳店匠科技有限公司 跨境多服务端的信息同步方法、系统和存储介质
CN111342986B (zh) * 2018-12-19 2022-09-16 杭州海康威视系统技术有限公司 分布式节点管理方法及装置、分布式系统、存储介质
US10812320B2 (en) * 2019-03-01 2020-10-20 At&T Intellectual Property I, L.P. Facilitation of disaster recovery protection for a master softswitch
CN110196832A (zh) * 2019-06-04 2019-09-03 北京百度网讯科技有限公司 用于获取快照信息的方法及装置
CN110659157A (zh) * 2019-08-30 2020-01-07 安徽芃睿科技有限公司 一种无损恢复的分布式多语种检索平台及其方法
CN112822227B (zh) * 2019-11-15 2022-02-25 北京金山云网络技术有限公司 分布式存储系统的数据同步方法、装置、设备及存储介质
CN111030846A (zh) * 2019-11-18 2020-04-17 杭州趣链科技有限公司 一种基于区块链的数据上链异常重试方法
CN112948484A (zh) * 2019-12-11 2021-06-11 中兴通讯股份有限公司 分布式数据库系统和数据灾备演练方法
CN111125738A (zh) * 2019-12-26 2020-05-08 深圳前海环融联易信息科技服务有限公司 防止核心数据丢失的方法、装置、计算机设备及存储介质
CN111352959B (zh) * 2020-02-28 2023-04-28 中国工商银行股份有限公司 数据同步补救、存储方法及集群装置
CN115088235B (zh) * 2020-03-17 2024-05-28 深圳市欢太科技有限公司 主节点选取方法、装置、电子设备以及存储介质
US11392617B2 (en) * 2020-03-26 2022-07-19 International Business Machines Corporation Recovering from a failure of an asynchronous replication node
CN111581287A (zh) * 2020-05-07 2020-08-25 上海茂声智能科技有限公司 一种数据库管理的控制方法、系统和存储介质
CN111708787A (zh) * 2020-05-07 2020-09-25 中国人民财产保险股份有限公司 多中心业务数据管理系统
CN111866094B (zh) * 2020-07-01 2023-10-31 天津联想超融合科技有限公司 一种定时任务处理方法、节点及计算机可读存储介质
CN112069018B (zh) * 2020-07-21 2024-05-31 上海瀚银信息技术有限公司 一种数据库高可用方法及系统
CN114079612B (zh) * 2020-08-03 2024-06-04 阿里巴巴集团控股有限公司 容灾系统及其管控方法、装置、设备、介质
CN113761075A (zh) * 2020-09-01 2021-12-07 北京沃东天骏信息技术有限公司 切换数据库的方法、装置、设备和计算机可读介质
CN112131191B (zh) * 2020-09-28 2023-05-26 浪潮商用机器有限公司 一种namenode文件系统的管理方法、装置及设备
CN112256497B (zh) * 2020-10-28 2023-05-12 重庆紫光华山智安科技有限公司 一种通用的高可用服务实现方法、系统、介质及终端
CN112437146B (zh) * 2020-11-18 2022-10-14 青岛海尔科技有限公司 一种设备状态同步方法、装置及系统
CN112416655A (zh) * 2020-11-26 2021-02-26 深圳市中博科创信息技术有限公司 一种基于企业服务门户的存储灾备系统及数据复制方法
CN112492030B (zh) * 2020-11-27 2024-03-15 北京青云科技股份有限公司 数据存储方法、装置、计算机设备和存储介质
CN112732999B (zh) * 2021-01-21 2023-06-09 建信金融科技有限责任公司 静态容灾方法、系统、电子设备及存储介质
CN113055461B (zh) * 2021-03-09 2022-08-30 中国人民解放军军事科学院国防科技创新研究院 一种基于ZooKeeper的无人集群分布式协同指挥控制方法
CN112988475B (zh) * 2021-04-28 2022-10-28 厦门亿联网络技术股份有限公司 容灾测试方法、装置、测试服务器及介质
BE1029472B1 (nl) * 2021-06-08 2023-01-16 Unmatched Bv Werkwijze voor opvolgen en afhandelen van alarmen in een alarmeringssysteem en een alarmeringssysteem
CN113472566A (zh) * 2021-06-11 2021-10-01 北京市大数据中心 一种联盟区块链的状态监控方法及主节点状态监控系统
US11604768B2 (en) 2021-06-29 2023-03-14 International Business Machines Corporation Optimizing resources in a disaster recovery cleanup process
RU2771211C1 (ru) * 2021-07-12 2022-04-28 Акционерное общество "Научно-исследовательский институт "Субмикрон" (АО "НИИ "Субмикрон") Вычислительная система с холодным резервом
CN113836179B (zh) * 2021-08-23 2023-10-27 辽宁振兴银行股份有限公司 一种交易读写分离装置
CN114095343A (zh) * 2021-11-18 2022-02-25 深圳壹账通智能科技有限公司 基于双活系统的容灾方法、装置、设备及存储介质
CN114201117B (zh) * 2021-12-22 2023-09-01 苏州浪潮智能科技有限公司 缓存数据的处理方法、装置、计算机设备及存储介质
CN114338370A (zh) * 2022-01-10 2022-04-12 北京金山云网络技术有限公司 Ambari的高可用方法、系统、装置、电子设备和存储介质
CN114844951B (zh) * 2022-04-22 2024-03-19 百果园技术(新加坡)有限公司 请求处理方法、系统、设备、存储介质及产品
CN114785849A (zh) * 2022-04-27 2022-07-22 郑州小鸟信息科技有限公司 一种基于多级节点网络实现的应用高可用方法
CN114860782B (zh) * 2022-07-04 2022-10-28 北京世纪好未来教育科技有限公司 数据查询方法、装置、设备及介质
CN115277379B (zh) * 2022-07-08 2023-08-01 北京城市网邻信息技术有限公司 分布式锁容灾处理方法、装置、电子设备及存储介质
CN115396327B (zh) * 2022-09-13 2023-11-21 中国农业银行股份有限公司 系统访问切换方法与装置
CN115567395B (zh) * 2022-11-10 2023-03-07 苏州浪潮智能科技有限公司 一种主从节点确定方法、装置、电子设备及存储介质
CN116069792A (zh) * 2022-12-06 2023-05-05 北京奥星贝斯科技有限公司 一种数据库容灾系统、方法、装置、存储介质及电子设备
CN115934428B (zh) * 2023-01-10 2023-05-23 湖南三湘银行股份有限公司 一种mysql数据库的主灾备切换方法、装置及电子设备
CN116595085B (zh) * 2023-07-17 2023-09-29 上海爱可生信息技术股份有限公司 数据库主从切换方法和分布式数据库
CN116991635B (zh) * 2023-09-26 2024-01-19 武汉吧哒科技股份有限公司 数据同步方法和数据同步装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679604A (zh) * 2015-02-12 2015-06-03 大唐移动通信设备有限公司 一种主节点和备节点切换的方法和装置
US20150256383A1 (en) * 2014-03-05 2015-09-10 Electronics And Telecommunications Research Institute Method for transiting operation mode of routing processor
US20160165463A1 (en) * 2014-12-03 2016-06-09 Fortinet, Inc. Stand-by controller assisted failover
CN106254100A (zh) * 2016-07-27 2016-12-21 腾讯科技(深圳)有限公司 一种数据容灾方法、装置和系统

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8689043B1 (en) * 2003-06-30 2014-04-01 Symantec Operating Corporation Fast failover with multiple secondary nodes
US7523110B2 (en) * 2005-03-03 2009-04-21 Gravic, Inc. High availability designated winner data replication
US8301593B2 (en) * 2008-06-12 2012-10-30 Gravic, Inc. Mixed mode synchronous and asynchronous replication system
CN101729290A (zh) * 2009-11-04 2010-06-09 中兴通讯股份有限公司 用于实现业务系统保护的方法及装置
US8572031B2 (en) * 2010-12-23 2013-10-29 Mongodb, Inc. Method and apparatus for maintaining replica sets
CN104036043B (zh) * 2014-07-01 2017-05-03 浪潮(北京)电子信息产业有限公司 一种mysql高可用的方法及管理节点
CN104537046B (zh) * 2014-12-24 2018-09-11 北京奇虎科技有限公司 数据补全方法和装置
CN104933132B (zh) * 2015-06-12 2019-11-19 深圳巨杉数据库软件有限公司 基于操作序列号的分布式数据库有权重选举方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150256383A1 (en) * 2014-03-05 2015-09-10 Electronics And Telecommunications Research Institute Method for transiting operation mode of routing processor
US20160165463A1 (en) * 2014-12-03 2016-06-09 Fortinet, Inc. Stand-by controller assisted failover
CN104679604A (zh) * 2015-02-12 2015-06-03 大唐移动通信设备有限公司 一种主节点和备节点切换的方法和装置
CN106254100A (zh) * 2016-07-27 2016-12-21 腾讯科技(深圳)有限公司 一种数据容灾方法、装置和系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3493471A4 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3617886A1 (en) * 2018-08-30 2020-03-04 Baidu Online Network Technology (Beijing) Co., Ltd. Hot backup system, hot backup method, and computer device
US11397647B2 (en) 2018-08-30 2022-07-26 Apollo Intelligent Driving Technology (Beijing) Co., Ltd. Hot backup system, hot backup method, and computer device
CN109936642A (zh) * 2019-01-28 2019-06-25 中国银行股份有限公司 一种分布式系统中生成机器id的方法、装置及系统
CN109918360A (zh) * 2019-02-28 2019-06-21 携程旅游信息技术(上海)有限公司 数据库平台系统、创建方法、管理方法、设备及存储介质
CN111092754A (zh) * 2019-11-29 2020-05-01 贝壳技术有限公司 实时接入服务系统及其实现方法
CN111092754B (zh) * 2019-11-29 2022-07-29 贝壳技术有限公司 实时接入服务系统及其实现方法
CN113312384A (zh) * 2020-02-26 2021-08-27 阿里巴巴集团控股有限公司 图数据的查询处理方法、装置及电子设备
CN113312384B (zh) * 2020-02-26 2023-12-26 阿里巴巴集团控股有限公司 图数据的查询处理方法、装置及电子设备
CN111400404A (zh) * 2020-03-18 2020-07-10 中国建设银行股份有限公司 一种节点初始化方法、装置、设备及存储介质
CN112202853A (zh) * 2020-09-17 2021-01-08 杭州安恒信息技术股份有限公司 数据同步方法、系统、计算机设备和存储介质
CN112202853B (zh) * 2020-09-17 2022-07-22 杭州安恒信息技术股份有限公司 数据同步方法、系统、计算机设备和存储介质
CN112231407B (zh) * 2020-10-22 2023-09-15 北京人大金仓信息技术股份有限公司 PostgreSQL数据库的DDL同步方法、装置、设备和介质
CN112231407A (zh) * 2020-10-22 2021-01-15 北京人大金仓信息技术股份有限公司 PostgreSQL数据库的DDL同步方法、装置、设备和介质
CN112769634A (zh) * 2020-12-09 2021-05-07 航天信息股份有限公司 一种基于Zookeeper的可横向扩展的分布式系统及开发方法
CN112769634B (zh) * 2020-12-09 2023-11-07 航天信息股份有限公司 一种基于Zookeeper的可横向扩展的分布式系统及开发方法
CN112667440A (zh) * 2020-12-28 2021-04-16 紫光云技术有限公司 一种高可用MySQL的异地灾备方法
CN112783694A (zh) * 2021-02-01 2021-05-11 紫光云技术有限公司 一种高可用Redis的异地灾备方法
CN113489601B (zh) * 2021-06-11 2024-05-14 海南视联通信技术有限公司 基于视联网自治云网络架构的抗毁方法和装置
CN114598711A (zh) * 2022-03-29 2022-06-07 百果园技术(新加坡)有限公司 一种数据迁移方法、装置、设备及介质
CN114598711B (zh) * 2022-03-29 2024-04-16 百果园技术(新加坡)有限公司 一种数据迁移方法、装置、设备及介质

Also Published As

Publication number Publication date
CN106254100B (zh) 2019-04-16
EP3493471A1 (en) 2019-06-05
EP3493471A4 (en) 2019-06-05
EP3493471B1 (en) 2020-07-29
US10713135B2 (en) 2020-07-14
CN106254100A (zh) 2016-12-21
US20190095293A1 (en) 2019-03-28

Similar Documents

Publication Publication Date Title
WO2018019023A1 (zh) 一种数据容灾方法、装置和系统
US9715522B2 (en) Information processing apparatus and control method
US9031910B2 (en) System and method for maintaining a cluster setup
US10565071B2 (en) Smart data replication recoverer
US11892922B2 (en) State management methods, methods for switching between master application server and backup application server, and electronic devices
WO2016202051A1 (zh) 一种通信系统中管理主备节点的方法和装置及高可用集群
WO2016070375A1 (zh) 一种分布式存储复制系统和方法
CN110224871A (zh) 一种Redis集群的高可用方法及装置
JP2017528809A (ja) 記憶不具合後の安全なデータアクセス
JP2009507280A (ja) Id保存を介するエンタープライズサービス利用可能性
KR20110044858A (ko) 데이터 센터들에 걸쳐 데이터 서버들내 데이터 무결정의 유지
WO2022036901A1 (zh) 一种Redis副本集的实现方法及装置
US7069317B1 (en) System and method for providing out-of-band notification of service changes
TWI677797B (zh) 主備資料庫的管理方法、系統及其設備
WO2018010501A1 (zh) 全局事务标识gtid的同步方法、装置及系统、存储介质
US20120278429A1 (en) Cluster system, synchronization controlling method, server, and synchronization controlling program
US10097630B2 (en) Transferring data between sites
CN106294795A (zh) 一种数据库切换方法及系统
CN111865632A (zh) 分布式数据存储集群的切换方法及切换指令发送方法和装置
US20130205108A1 (en) Managing reservation-control in a storage system
CN112783694B (zh) 一种高可用Redis的异地灾备方法
KR20160004721A (ko) 데이터 손실 없는 데이터베이스 리두 로그 이중화 방법 및 그를 위한 시스템
US20240061754A1 (en) Management of logs and cache for a graph database
US20120191645A1 (en) Information processing apparatus and database system
CN116668269A (zh) 一种用于双活数据中心的仲裁方法、装置及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17833328

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2017833328

Country of ref document: EP

Effective date: 20190227