CN116302691A - Disaster recovery method, device and system - Google Patents

Disaster recovery method, device and system Download PDF

Info

Publication number
CN116302691A
CN116302691A CN202310181688.7A CN202310181688A CN116302691A CN 116302691 A CN116302691 A CN 116302691A CN 202310181688 A CN202310181688 A CN 202310181688A CN 116302691 A CN116302691 A CN 116302691A
Authority
CN
China
Prior art keywords
machine room
data
database instance
node
metadata
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310181688.7A
Other languages
Chinese (zh)
Inventor
郭鹏
杜众舒
刘图明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Cloud Computing Ltd
Original Assignee
Alibaba Cloud Computing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Cloud Computing Ltd filed Critical Alibaba Cloud Computing Ltd
Priority to CN202310181688.7A priority Critical patent/CN116302691A/en
Publication of CN116302691A publication Critical patent/CN116302691A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1461Backup scheduling policy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1469Backup restoration techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A10/00TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE at coastal zones; at river basins
    • Y02A10/40Controlling or monitoring, e.g. of flood or hurricane; Forecasting, e.g. risk assessment or mapping

Abstract

The embodiment of the specification provides a disaster recovery method device and a system, wherein the disaster recovery method is applied to a dispatching platform, and the dispatching platform provides a group of management services for a first machine room and a second machine room, and comprises the following steps: under the condition that the first machine room fails, switching service of the second machine room is called, and data storage nodes in the second machine room and metadata database instance nodes in the second machine room are updated; invoking a management and control service unit in the second machine room, distributing the data processing task corresponding to the first machine room to the second machine room, and updating the database instance node in the second machine room; synchronizing data to the second machine room based on the updated data storage node, the metadata database instance node and the database instance node in the second machine room; and sending an execution instruction for executing the metadata processing task and the data processing task to the second machine room according to the data synchronization result.

Description

Disaster recovery method, device and system
Technical Field
Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a disaster recovery method, device, and system.
Background
With the rapid development of computer technology, more and more data are stored and processed in an electronic form, and a data management system has become a key for enterprise operation, but the storage and processing modes of the data can improve convenience, and once the data management system has a disaster, the data can be easily lost and damaged. In the prior art, a main machine room and a standby machine room are generally adopted to provide data management services, the main machine room and the standby machine room are respectively provided with two independent data services, and when the main machine room fails, the standby machine room takes over the work of the main machine room, however, the deployment mode needs to deploy and manage the two data services, so that the deployment cost and the operation and maintenance complexity are improved, and the speed of main-standby switching is low, so that a disaster recovery method is needed to solve the problems.
Disclosure of Invention
In view of this, the embodiments of the present disclosure provide a disaster recovery method. One or more embodiments of the present disclosure relate to a disaster recovery device, a disaster recovery system, a computing device, a computer-readable storage medium, and a computer program for solving the technical drawbacks of the prior art.
According to a first aspect of embodiments of the present disclosure, a disaster recovery method is provided and applied to a dispatch platform, where the dispatch platform provides a set of management services for a first machine room and a second machine room, and the method includes:
Under the condition that the first machine room fails, switching service of the second machine room is called, and data storage nodes in the second machine room and metadata database instance nodes in the second machine room are updated;
invoking a management and control service unit in the second machine room, distributing the data processing task corresponding to the first machine room to the second machine room, and updating the database instance node in the second machine room;
synchronizing data to the second machine room based on the updated data storage node, the metadata database instance node and the database instance node in the second machine room;
and sending an execution instruction for executing the metadata processing task and the data processing task to the second machine room according to the data synchronization result.
According to a second aspect of embodiments of the present disclosure, there is provided a disaster recovery device, including:
the updating module is configured to call the switching service of the second machine room under the condition that the first machine room fails, and update the data storage nodes in the second machine room and the metadata database instance nodes in the second machine room;
the distribution module is configured to call a management and control service unit in the second machine room, distribute the data processing task corresponding to the first machine room to the second machine room, and update the database instance node in the second machine room;
The synchronization module is configured to synchronize data to the second machine room based on the updated data storage node, the metadata database instance node and the database instance node in the second machine room;
and the sending module is configured to send an execution instruction for executing the metadata processing task and the data processing task to the second machine room according to the data synchronization result.
According to a third aspect of embodiments of the present disclosure, there is provided a disaster recovery system, including:
the system comprises a first machine room, a second machine room and a dispatching platform, wherein the dispatching platform stores data synchronization executable instructions, and the data synchronization executable instructions are used for realizing the steps of the disaster recovery method when being executed by the dispatching platform and distributing data and data processing tasks stored in the first machine room to the second machine room.
According to a fourth aspect of embodiments of the present specification, there is provided a computing device comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions, which when executed by the processor, implement the steps of the disaster recovery method described above.
According to a fifth aspect of embodiments of the present specification, there is provided a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the disaster recovery method described above.
According to a sixth aspect of the embodiments of the present specification, there is provided a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the disaster recovery method described above.
The disaster recovery method provided by the embodiment of the specification is applied to a dispatching platform, the dispatching platform provides a group of management services for a first machine room and a second machine room, and under the condition of a first machine room fault, switching services of the second machine room are called to update data storage nodes in the second machine room and metadata database instance nodes in the second machine room; calling a management and control service unit in the second machine room, distributing the data processing task corresponding to the first machine room to the second machine room, and updating the database instance node in the second machine room; synchronizing data to the second machine room based on the updated data storage node, the metadata database instance node and the database instance node in the second machine room; and sending an execution instruction for executing the metadata processing task and the data processing task to the second machine room according to the data synchronization result.
The first machine room and the second machine room use the same group of management services provided by the dispatching platform, so that service deployment cost and operation and maintenance complexity are reduced, and under the condition of failure of the first machine room, the standby switching service of the second machine room can be directly called to enable the second machine room to replace the first machine room to continuously provide services to the outside, and service switching speed and failure recovery efficiency are improved while switching requirements of a disaster recovery system are met.
Drawings
FIG. 1 is a schematic diagram of a disaster recovery method according to an embodiment of the present disclosure;
FIG. 2 is a flow chart of a disaster recovery method according to one embodiment of the present disclosure;
FIG. 3 is a flowchart illustrating a disaster recovery method according to an embodiment of the present disclosure;
FIG. 4 is a block diagram of a disaster recovery system according to one embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a disaster recovery device according to an embodiment of the present disclosure;
FIG. 6 is a block diagram of a computing device provided in one embodiment of the present description.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.
The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
First, terms related to one or more embodiments of the present specification will be explained.
K8S (Kubernetes) is a portable, extensible, open-source platform for managing containerized workloads and services, facilitating declarative configuration and automation.
DBaaS (Database as a Service): the database is a service, and here refers to a database service provided by a cloud manufacturer. Compared with the traditional self-built database, the DBaaS system provided by the cloud manufacturer comprises a database kernel and a set of complete ecology surrounding the database kernel, including life cycle management, backup recovery, manual or automatic operation and maintenance service and the like.
RPO (RecoveryPointObject ive): the data recovery point target is in units of time. The point in time at which the system and data must be restored when a disaster occurs. The maximum amount of data loss that can be tolerated by the RPO flag system. The smaller the amount of data the system tolerates lost, the smaller the value of RPO.
RTO (RecoveryTimeObject ive): the time target is restored in units of time. After a disaster, the information system is required from a stop to a time when it must be restored. RTO flag system can tolerate the longest time of service outage. The higher the urgency requirement for system services, the smaller the RTO value.
Paxos protocol: the Paxos protocol is a few strongly consistent, highly available, decentralized distributed protocols that have proven in engineering practice. In the Paxos protocol, there is a set of completely peer participating nodes, each of which makes a resolution for an event, and takes effect if a resolution has obtained more than half the nodes' consent. In the Paxos protocol, more than half of nodes can work normally, and abnormal conditions such as downtime and network differentiation can be well dealt with.
Three machine rooms in the same city: a proprietary cloud deployment model, characterized by: 1. the three machine rooms are mutually communicated; 2. the main machine room and the standby machine room have low-delay characteristics (network delay is less than 1.5 ms) due to the same city; 3. the third machine room is a light machine room, and cloud base and ecological service can not be deployed and are only used for databases. The third machine room and the main machine room and the standby machine room form a light distributed disaster recovery environment, so that the cost of disaster recovery construction is reduced, and zero data loss is realized.
Multiple activities: in the application deployment mode, a plurality of process copies run on different physical machines without distinguishing roles such as a master backup and the like, and each application backup provides service.
And (3) main preparation: an application deployment mode is characterized in that one or more of multiple process copies are used as a master node, the master node provides service, and the rest are standby nodes which are used as data backups of the master node; the primary and backup roles may switch between each other in some applications.
XDB: a product form of RDS-MySQL is a cluster MySQL instance, and uses Paxos protocol to ensure data consistency among a plurality of nodes, and at least 3 nodes are in a normal state.
Fig. 1 is a schematic structural diagram of a disaster recovery method provided in an embodiment of the present disclosure, as shown in fig. 1, a resource scheduling platform provides the same set of management services for a first machine room and a second machine room, where the first machine room corresponds to a first switching service, a management service layer, a database instance, a data storage node, and a metadata database instance node, and correspondingly, the second machine room corresponds to a second switching service, a management service layer, a database instance, a data storage node, and a metadata database instance node, and data synchronization is implemented by Paxos protocol between a database instance in the first machine room and a database instance in the second machine room, between a data storage node in the first machine room and a data storage node in the second machine room, and between a metadata database instance node in the first machine room and a metadata database instance node in the second machine room. Under the condition of a first machine room fault, the resource scheduling platform calls a switching service of a second machine room, and updates a data storage node in the second machine room and a metadata database instance node in the second machine room; calling a management and control service unit in the second machine room, distributing the data processing task corresponding to the first machine room to the second machine room, and updating the database instance node in the second machine room; synchronizing data to the second machine room based on the updated data storage node, the metadata database instance node and the database instance node in the second machine room; and sending an execution instruction for executing the metadata processing task and the data processing task to the second machine room according to the data synchronization result.
And a set of management service is used for managing all database examples and management and control components, metadata is physically copied, and under the condition of low delay (often less than 1.5 milliseconds) of a first machine room and a second machine room in the same city, the rapid synchronization of data can be ensured, and the situation of losing data is avoided as much as possible. The physical copying of the kernel level is realized without depending on external components, and the method is safe and efficient. And a set of management service is used, so that the management and control components are more active, the time consumption for switching the management and control service is minimized, and the operation and maintenance cost is reduced.
In the present specification, a disaster recovery method is provided, and the present specification relates to a disaster recovery device, a disaster recovery system, a computing device, a computer readable storage medium and a computer program, which are described in detail in the following embodiments one by one.
Referring to fig. 2, fig. 2 shows a flowchart of a disaster recovery method according to an embodiment of the present disclosure, where the disaster recovery method is applied to a dispatching platform, and the dispatching platform provides a set of management services for a first machine room and a second machine room, and specifically includes the following steps.
Step S202: and under the condition that the first machine room fails, calling the switching service of the second machine room, and updating the data storage node in the second machine room and the metadata database instance node in the second machine room.
Specifically, disaster tolerance refers to that when a disaster such as natural disasters, equipment faults, manual operation damages and the like occurs, tasks of a system are kept to run uninterruptedly under the condition that data of the system are lost as little as possible; the disaster recovery system generally establishes two or more sets of systems with the same function, and health state detection and function switching can be performed among the systems, when one system stops working due to accidents (such as fire, earthquake and the like), the whole application system can be switched to the other system, so that the functions of the system can continue to work normally; correspondingly, the first machine room and the second machine room correspond to two sets of systems with the same functions, the first machine room and the second machine room respectively have the same machine room structure and are generally composed of a cloud base layer, a management and control service layer and a kernel service layer, the first machine room can be a main machine room or a standby machine room, when the first machine room is the main machine room, the second machine room can be the standby machine room, and under the condition of a first machine room fault, the second machine room can provide service; the method comprises the steps of carrying out a first treatment on the surface of the In the disaster recovery method provided by the embodiment, a dispatching platform provides a group of management services for the first machine room and the second machine room respectively, wherein the dispatching platform comprises but is not limited to a k8s platform and provides management services for the first machine room and the second machine room; the switching service of the second machine room refers to a service deployed in the second machine room, and is used for calling the switching service when the first machine room fails, so that the second machine room takes over the task of the first machine room; the cloud base layer of the second machine room comprises a metadata base instance unit and a data storage unit, wherein the metadata base instance unit comprises a plurality of metadata base instance nodes for managing metadata to realize data synchronization; the data storage unit comprises a plurality of data storage nodes for realizing data synchronization.
Based on the above, the dispatching platform can provide the same group of management services for the first machine room and the second machine room at the same time, and is used for managing and controlling the data storage units and the metadata database instance units in the first machine room and the second machine room. And under the condition of the first machine room fault, calling the switching service of the second machine room, updating the data storage node of the data storage unit in the second machine room and the metadata database instance node of the metadata database instance unit in the second machine room, and changing the attributes of the data storage node and the metadata database instance node to enable the data storage node and the metadata database instance node to provide services to the outside.
In practical application, in order to improve the disaster recovery capability of the machine room, besides the first machine room and the second machine room, a third machine room is further included, a disaster recovery framework of the same city and three machine rooms is constructed, and unnecessary cloud base and management and control components are not required to be deployed in the third machine room, so that the material requirements and operation and maintenance costs of the disaster recovery framework can be reduced. A set of cloud base is deployed in the first machine room and the second machine room, database examples and management and control components in the first machine room and the second machine room are managed, original data in the cloud base can be physically copied, and under the condition that the first machine room and the second machine room of three machine rooms are low in delay, quick synchronization of the data can be guaranteed, and the condition that the data is lost is avoided. The first machine room and the second machine room use the same set of management service, so that the database instance can be deployed across the machine rooms, the kernel-level physical copying can be realized without depending on external components, and the safety and the data synchronization efficiency of data synchronization are improved.
Further, considering that the second machine room includes at least one second data storage node, the second machine room includes at least one second metadata database instance node, and only one node in the second data storage node and the second metadata database instance node can provide services to the outside respectively, and other nodes are only used for data synchronization, in the case that the second machine room is used as the machine room for providing services to the outside, the second data storage node included in the second machine room and the second metadata database instance node included in the second machine room need to be updated, which is specifically implemented as follows:
invoking a switching service of the second machine room, selecting a data storage node from second data storage nodes contained in the second machine room, updating the data storage node, wherein the updated data storage node is used for receiving storage node data submitted by aiming at the second machine room; and calling the switching service of the second machine room, selecting a metadata database instance node from second metadata database instance nodes in the second machine room, and updating the metadata database instance node, wherein the updated metadata database instance node is used for receiving metadata database instance data submitted by the second machine room.
Specifically, the second data storage node refers to a data storage node for performing data synchronization, which is included in the second machine room, and the corresponding data storage node is one or more selected from a plurality of second data storage nodes for providing data service to the outside, and receives storage node data submitted by aiming at the second machine room, where nodes other than the data storage nodes in the second data storage node are only used for implementing data synchronization, and no data service is required to be provided to the outside; the second metadata base instance nodes refer to metadata base instance nodes for data synchronization contained in the second machine room, and correspondingly, the metadata base instance nodes are one or more selected from a plurality of second metadata base instance nodes for providing data service to the outside, and receive storage node data submitted by the second machine room, and nodes except for the metadata base instance nodes in the second metadata base instance nodes are only used for realizing data synchronization without providing data service to the outside.
And based on the switching service of the second machine room, selecting a data storage node from at least two second data storage nodes contained in the second machine room, updating the data storage node, wherein the updated data storage node is used for receiving storage node data submitted by the second machine room. The second data storage nodes contained in the second machine room are in a preparation state before the first machine room fails and are only used for carrying out data synchronization between the first machine room and the second machine room, and when the first machine room fails and the second machine room takes over the first machine room to provide service, one data storage node is selected from at least two second data storage nodes, and the data storage nodes are updated to be in a service state and are used for providing service to the outside; and calling a switching service of the second machine room, selecting a metadata database instance node from at least two second metadata database instance nodes in the second machine room, updating the metadata database instance node, wherein the updated metadata database instance node is used for receiving metadata database instance data submitted by the second machine room, correspondingly, the second metadata database instance nodes contained in the second machine room are in a preparation state before the first machine room fails and are only used for carrying out data synchronization between the first machine room and the second machine room, and when the first machine room fails and the second machine room takes over the first machine room to provide service, one metadata database instance node is selected from the at least two second metadata database instance nodes, and the metadata database instance node is updated into a service state and is used for providing service to the outside.
For example, the database service includes various peripheral management and control systems besides the database kernel, so as to realize functions of life cycle management and the like of the instance. When a machine room-level disaster occurs, high availability of the management and control system and the database instance needs to be ensured, and services are quickly and stably switched to a second machine room on the premise of ensuring data consistency and maintaining functional integrity. In the scene of three machine rooms in the same city, the first machine room breaks down, the second machine room takes over the first machine room to replace the first machine room to provide service, the management metadata base in the second machine room comprises three XDB leader nodes, the k8s base corresponds to the three etcd leader nodes (metadata service nodes), when the first machine room breaks down and the second machine room is started, one node is arbitrarily selected from the three XDB leader nodes to update into XDB fo l lower for providing service to the outside, and the other two XDB leader nodes are used for continuing asynchronous synchronization of data; one node is arbitrarily selected from the three etcd leader nodes to be updated into the etcd leader node for providing service to the outside, and the other two etcd leader nodes are used for continuing asynchronous synchronization of data.
In summary, the data storage nodes are selected from the second data storage nodes in the second machine room to update, and the metadata database instance nodes are selected from the second metadata database instance nodes in the second machine room to update, so that updating of part of nodes in the second machine room is realized, and continuity of external service is ensured.
Step S204: and calling a management and control service unit in the second machine room, distributing the data processing task corresponding to the first machine room to the second machine room, and updating the database instance node in the second machine room.
Specifically, a management and control service layer is disposed in each of the first machine room and the second machine room, a plurality of management and control service units corresponding to the management and control service layer are disposed in the second machine room, and management and control service units equal to the first machine room in number are disposed in the second machine room, namely, each management and control service unit corresponding to the first machine room is disposed in each of the second machine rooms, and duplicate management and control service units corresponding to the first machine room are disposed in the second machine room, and are used for providing management and control services for the machine rooms; when the first machine room fails, distributing data processing tasks corresponding to the first machine room to the second machine room, wherein the data processing tasks are tasks for data processing received in the running process of the first machine room; the database instance node refers to a database instance corresponding to the service layer in the second machine room and is used for data processing, and a data synchronization protocol with strong consistency, such as a Paxos protocol, is adopted between database instance units corresponding to the first machine room and the second machine room respectively and is used for synchronizing data.
Based on the above, under the condition that the first machine room fails, the management and control service unit in the second machine room is called, the data processing task corresponding to the first machine room is distributed to the second machine room, after the switching service of the second machine room is called, the data storage node in the second machine room and the metadata database instance node in the second machine room are updated, the management and control service in the second machine room can normally run, the data processing task is received, and the data processing is performed. And calling a management and control service unit in the second machine room, updating the database instance node in the second machine room, and starting the database instance node in the second machine room in a processing preparation state for providing services to the outside.
In practical application, when the first machine room fails and can provide normal service, the database instance node in the first machine room is a leader node, namely a master node, when the first machine room operates normally, the leader node provides write service to the outside, the written data is synchronized to the database instance node (fo lower node) in the second machine room through the Paxos protocol, and is synchronized to the database instance node (fo lower node) in the third machine room, when the first machine room fails and can not provide service, the second machine room is started, the database instance node (fo lower node) in the second machine room is updated to the database instance node (leader node), the written data is synchronized to the database instance node (fo lower node) in the third machine room through the Paxos protocol, and the normal and uninterrupted service can still be provided to the outside under the condition of the first machine room.
Further, considering that before the first machine room fails, the database instance node in the first machine room is always in a ready state, after the first machine room fails, the second machine room needs to take over the first machine room to continue to provide service, and at this time, the database instance node in the second machine room needs to provide service to the outside, which is specifically implemented as follows:
and calling a management and control service unit in the second machine room, distributing the data processing task corresponding to the first machine room to the second machine room, updating the state information and the attribute information of the database instance node in the second machine room, wherein the updated database instance node is used for receiving the database instance data submitted by aiming at the second machine room, and the management and control service unit is used for carrying out state management on the database instance node in the second machine room.
Specifically, the state information refers to state information of a database instance node, including, but not limited to, a ready state and a service state, in the case that the database instance node in the second machine room is in the ready state, service is not provided externally, in the case that the database instance node in the second machine room is in the service state, service is provided externally, corresponding to the state of the database instance node, attribute information also exists in the database instance node, when the database instance node in the second machine room is a leader node, service can be provided externally, and when the database instance node in the second machine room is a fo l lower node, service is not provided externally, and only data synchronization between the first machine room and the second machine room is realized.
Based on the data processing task, the management and control service unit in the second machine room is called, the data processing task corresponding to the first machine room is distributed to the second machine room, the state information and the attribute information of the database instance node in the second machine room are updated, the database instance node in the second machine room is in a preparation state before the first machine room fails and is a fo l lower node, after the first machine room fails, the second machine room provides service to the outside, the preparation state of the database instance node in the second machine room is updated to be a service state, the fo l lower node is updated to be a leader node, the updated database instance node is used for receiving the database instance data submitted by the second machine room, and the management and control service unit is used for carrying out state management on the database instance node in the second machine room.
Along with the above example, when the second machine room takes over the first machine room to provide service, updating the database instance node (XDB instance) in the second machine room from fo lower to leader for providing service to the outside. The management and control service deploys a copy in the second machine room, and adopts a multi-activity architecture. And k8s automatically transfers the flow to the second machine room when the first machine room fails.
In summary, the state information and the attribute information of the database instance node in the second machine room are updated, so that the service is provided to the outside based on the database instance node in the second machine room, and the continuity of the service provided to the outside is ensured under the condition of the first machine room failure.
Further, considering the security of data processing and data storage, on the basis of the first machine room and the second machine room, a third machine room can be deployed, unnecessary management service, switching service and management and control service are not required to be deployed in the third machine room, and the third machine room only comprises database instance nodes for synchronizing the data of the database instance units in the second machine room, and the method is specifically implemented as follows:
and synchronizing the instance data in the second machine room to a third database instance node in a third machine room based on the updated database instance node, wherein the third database instance node is used for asynchronous storage of the instance data.
Specifically, the third machine room refers to another machine room provided except the first machine room and the second machine room, the third machine room only comprises database instance nodes, the database instance nodes are used for synchronizing data corresponding to the database instance nodes of the second machine room based on a data synchronization protocol, unnecessary management and control services, management services and the like do not need to be deployed in the third machine room, and only the database instance nodes need to be deployed.
Based on the above, on the basis of the first machine room and the second machine room, a third machine room is further included, and under the condition that the second machine room can normally provide services to the outside, the instance data corresponding to the instance nodes of the database in the second machine room are synchronized to the instance nodes of the third database in the third machine room, and the data synchronization is performed through an asynchronous synchronization strategy.
Along the above example, besides the first machine room and the second machine room, a third machine room is disposed, the XDB instance of the third machine room is a fo lower node, and data synchronization is achieved between the XDB instance of the third machine room and the XDB instance of the second machine room through a Paxos protocol.
In summary, on the basis of the first machine room and the second machine room, the system further comprises a third machine room, the database instance nodes in the third machine room are used for realizing asynchronous synchronization of data, and unnecessary management and control services and management services are not required to be deployed in the third machine room, so that material consumption is reduced, and the application cost of the machine room is reduced.
Further, considering disaster tolerance of the machine room architecture, in order to reduce data loss as much as possible, task continuity is ensured when the first machine room and the second machine room are switched, machine room switching can be realized by adopting a mode of actively switching the machine room or passively switching the machine room, and the following is specifically realized:
invoking a switching service of the second machine room, and sending an updating task to a management and control service unit in the second machine room; invoking a management and control service unit in the second machine room, which receives the update task, and updating the database instance node in the second machine room; or receiving an update instruction aiming at the second machine room, and updating the database instance node in the second machine room based on the update instruction; and the updated database instance node in the second machine room is used for receiving the data operation task.
Specifically, the update task refers to a task issued to a management and control service unit of a machine room to be provided with service when a first machine room fails or a machine room switching requirement exists, and is used for informing the management and control service unit in the machine room of the task of machine room switching, and further calling the management and control service unit, and performing switching between the first machine room and a second machine room in an auxiliary manner through the management and control service unit; correspondingly, the updating instruction refers to a computer command submitted by a user aiming at the dispatching platform when the first machine room fails, and the computer command is used for realizing switching between the first machine room and the second machine room.
Based on the above, under the condition that the first machine room fails, the first machine room reports the failure to the dispatching platform, the dispatching platform calls the switching service of the second machine room, and an update task is sent to the management and control service unit in the second machine room; and after the management and control service unit receives the update task, the scheduling platform calls the management and control service unit which receives the update task in the second machine room, and updates the database instance node in the second machine room. Or under the condition that the first machine room is in fault, the user can perceive the first machine room fault, so that the user can submit an update instruction to the scheduling platform aiming at the second machine room, after receiving the update instruction aiming at the second machine room, the scheduling platform updates the database instance node in the second machine room based on the update instruction, and the updated database instance node in the second machine room is used for receiving the data operation task to realize that the second machine room takes over the first machine room to provide service.
In the above example, when the database instance is switched between the first machine room and the second machine room, the database instance may be automatically switched or may be manually switched by the user. Whether etcd (metadata service) and management metadata base of the cloud base are switched or not, and part of database examples without automatic switching function are switched, and the switching is uniformly processed by disaster recovery switching service, so that the orderly switching is ensured; the option that the user does not switch the service to the second machine room is also given under special conditions, and the consistency of the system data is ensured by selecting the sacrifice service continuity. Switching of database instances requires a categorization discussion: as the strong identical Paxos protocol is adopted between the XDB instances of the first machine room and the second machine room, RPO can be guaranteed to be equal to zero under the condition of the first machine room fault, and the leader node is automatically determined in the second machine room. At this time, both the RPO and the RTO are determined by the kernel. Part of database examples do not have the function of automatic switching, and then the switching task is issued through disaster recovery switching service, and the master and slave switching is performed through management and control service. At this time, the RPO and RTO of the database instance are determined by the recovery speed of the management and control service, and since the recovery of the management and control service depends on the recovery of the base management and control metadata database, the recovery is finally determined by the RTO of the k8s base metadata database.
In summary, under the condition of the first machine room fault, the switching between the first machine room and the second machine room can be automatically performed through the dispatching platform, and the manual switching can be performed by the user according to the user's needs, so that the flexibility of switching between the first machine room and the second machine room is improved, the switching between the first machine room and the second machine room is automatically performed through the dispatching platform, the real-time fault detection by the user is avoided, and the fault response speed is improved.
Further, considering that the time of the first machine room failure has uncertainty, when the first machine room fails, metadata in the second machine room cannot be consistent with the first machine room, and therefore metadata correction is needed to be performed by the management and control service unit, recovery of the metadata is achieved, and the method is specifically implemented as follows:
acquiring historical data corresponding to the first machine room fault; and calling the management and control service unit in the second machine room, and updating metadata of the metadata database instance node in the second machine room based on the historical data in a preset recovery time.
Specifically, the historical data refers to metadata contained in the first machine room before the first machine room fails; the preset recovery time is a preset data recovery time, which means that when a disaster occurs, data and service need to be recovered within a preset recovery time RTO, wherein the RTO marks the longest time that service can be tolerated to stop, and the higher the urgency requirement of service recovery, the smaller the RTO value. When the first machine room fails, after the management and control service in the second machine room is recovered, the metadata correction is firstly carried out, so that the follow-up service can be ensured to be normally carried out.
Based on the above, when the first machine room fails, after the management and control service unit in the second machine room is recovered, the historical data stored in the dispatching platform before the first machine room fails is obtained. And calling a management and control service unit in the second machine room, and updating metadata of the metadata database instance nodes in the second machine room based on the acquired historical data in a preset recovery time so that the subsequent second machine room can take over the first machine room to provide service to the outside.
Along the above example, it is also very important that the primary and backup switching of the management and control parameters in the DBaaS system is often very important, and the inconsistent state of the management and control parameters and the kernel may cause the wrong switching task to flow down and cause a secondary disaster, so that metadata can be corrected for the first time after the management and control service is recovered, and the time is determined by the RTO of the management and control service, and since the recovery of the management and control service depends on the recovery of the base management and control metadata base, the time is finally determined by the RTO of the base metadata base.
In summary, after the management and control service unit in the second machine room is restored, the metadata in the second machine room is corrected based on the historical data corresponding to the fault of the first machine room, so that the subsequent second machine room can take over the first machine room to provide service to the outside.
Step S206: and synchronizing data to the second machine room based on the updated data storage node, the metadata database instance node and the database instance node in the second machine room.
Based on the above, after the update of the data storage node, the metadata database instance node and the database instance node in the second machine room is completed, the data can be synchronized to the second machine room based on the updated data storage node, metadata database instance node and database instance node in the second machine room, so that the second machine room can take over the first machine room to provide services.
In practical application, the disaster recovery architecture formed by the first machine room, the second machine room and the third machine room has the capability of switching between the main machine room and the standby machine room, ensures high availability, can be automatically switched from the first machine room to the second machine room when a disaster occurs, and can also manually determine to temporarily stop the first machine room or the second machine room to provide services so as to avoid the problem that partial database examples generate inconsistent data. By adopting a three-machine room disaster recovery architecture, a database kernel with disaster recovery capability is supported to achieve that RPO is equal to 0, and the disaster recovery requirement of a financial grade is met.
Further, because the data volume processed in the first machine room is larger, the external component is required to be deployed in addition to perform data synchronization between the first machine room and the second machine room in the manner of the external component, and in order to improve the efficiency of data synchronization and the data consistency, the data synchronization between the first machine room and the second machine room can be performed based on the data synchronization protocol, which is specifically implemented as follows:
determining a data storage space, reading storage data in the data storage space, and synchronizing the storage data to updated data storage nodes in the second machine room based on a data synchronization protocol; reading metadata in the data storage space, and synchronizing the metadata to updated metadata database instance nodes in the second machine room based on the data synchronization protocol; and reading the instance data in the data storage space, and synchronizing the instance data to the updated database instance node in the second machine room based on the data synchronization protocol.
Specifically, the data storage space may be a local storage space for data storage or a cloud storage space for data storage, where the data storage space is used for recording data in the first machine room in real time, and data backup may be performed in a snapshot manner, so that subsequent data recovery based on the backed-up data is facilitated; correspondingly, the stored data is backup data corresponding to the snapshot of the first machine room before the first machine room fails; the metadata is the metadata used for carrying out data synchronization on the second machine room; the instance data is the instance data used for carrying out data synchronization on the second machine room. The snapshot has the main function of being capable of carrying out online data recovery, and can carry out timely data recovery when the storage equipment has application faults or files are damaged, so that the data is recovered to a state of a snapshot generation time point; the data synchronization protocol may be a protocol for data synchronization such as the strongly consistent Paxos protocol.
Based on the data, determining a data storage space for storing the snapshot of the first machine room, reading storage data corresponding to the first machine room in the data storage space, and synchronizing the storage data to the updated data storage nodes in the second machine room based on a data synchronization protocol. Reading metadata in a data storage space, and synchronizing the metadata to updated metadata database instance nodes in a second machine room based on a data synchronization protocol; and reading the instance data in the data storage space, synchronizing the instance data to the updated database instance node in the second machine room based on the data synchronization protocol, and synchronizing the updated data storage node, the metadata database instance node and the data in the database instance node in the second machine room to be in a state before the first machine room fails, so that the second machine room can take over the first machine room to continue to provide services to the outside.
Along the above example, when data synchronization is performed between the first machine room and the second machine room, data synchronization between the first machine room XDB instance and the second machine room XDB instance is realized based on the Paxos protocol, data synchronization between the first machine room management metadata base and the second machine room management metadata base is realized based on the Paxos protocol, and data synchronization between the first machine room etcd (metadata service) node and the second machine room etcd node is realized based on the Paxos protocol,
In summary, based on the data synchronization protocol, the data storage node, the metadata database instance node and the database instance node between the first machine room and the second machine room are synchronized, so that the efficiency and the accuracy of the data synchronization are improved.
Step S208: and sending an execution instruction for executing the metadata processing task and the data processing task to the second machine room according to the data synchronization result.
Based on the above, in the case of the failure of the first machine room, the switching service of the second machine room is invoked, and the data storage node in the second machine room and the metadata database instance node in the second machine room are updated. And calling a management and control service unit in the second machine room, distributing the data processing task corresponding to the first machine room to the second machine room, updating the database instance node in the second machine room, wherein after updating, the second machine room can replace the first machine room in faults to provide service to the outside, and the data synchronization result is a data synchronization result obtained after the data in the first machine room are synchronized to the second machine room, so that the data of each layer of the first machine room are synchronized to the second machine room, the data processing task corresponding to the first machine room is sent to the second machine room, and the second machine room replaces the first machine room to execute the data processing task. And the dispatching platform sends an execution metadata processing task and an execution instruction for executing the data processing task to the second machine room, and the second machine room carries out metadata processing and data processing, wherein the metadata processing task refers to the data processing task corresponding to the metadata database instance node, and the corresponding data processing task is the data processing task corresponding to the database instance unit.
In practical application, before the first machine room fails, metadata processing tasks are executed by the metadata instance nodes of the first machine room, data processing tasks are executed by the database instance units of the first machine room, data synchronization between the metadata instance nodes in the first machine room and the metadata instance nodes in the second machine room is achieved through a Paxos protocol, and data synchronization between the database instance units in the first machine room and the database instance units in the second machine room is achieved through the Paxos protocol. After the first machine room fails, the database instance unit and the metadata instance node in the first machine room cannot normally provide services, so that the database instance unit and the metadata instance node in the second machine room can replace the first machine room to provide services to the outside after the state update is completed, metadata processing tasks and data processing tasks are executed, and meanwhile data corresponding to the database instance unit in the second machine room are synchronized to the database instance unit in the third machine room through a Paxos protocol.
Further, considering that after the first machine room sends a fault, a worker checks and eliminates the fault of the first machine room, so that after the fault recovery of the first machine room is realized, the second machine room can be switched back to the first machine room by the dispatching platform, the first machine room continues to provide service, and the second machine room is recovered to the state before the fault of the first machine room, which is specifically realized as follows:
Under the condition of the fault recovery of the first machine room, calling switching service of the first machine room, and updating a data storage node of a database storage unit in the first machine room and a metadata database instance node in the second machine room; invoking a management and control service unit in the first machine room, distributing a data processing task corresponding to the second machine room to the first machine room, and updating a database instance node in the first machine room; synchronizing data to the first machine room based on the updated data storage node, the metadata database instance node and the database instance node in the first machine room; and sending an execution instruction for executing the metadata processing task and the data processing task to the first machine room according to the data synchronization result.
Based on this, under the condition that first computer lab fault is recovered, first computer lab can provide the service outward this moment, can switch back first computer lab with the second computer lab, resumes the state before first computer lab breaks down with the second computer lab, takes over the second computer lab to provide the service outward by first computer lab. After the first machine room is recovered from faults, switching service of the first machine room is called, and data storage nodes of the database storage units in the first machine room and metadata database instance nodes in the second machine room are updated; calling a management and control service unit in the first machine room, distributing a data processing task corresponding to the second machine room to the first machine room, and updating a database instance node in the first machine room; synchronizing data to the first machine room based on the updated data storage node, the metadata database instance node and the database instance node in the first machine room; and sending an execution instruction for executing the metadata processing task and the data processing task to the first machine room according to the data synchronization result.
Along with the above example, after the failure of the first machine room is recovered, the second machine room can be switched back to the first machine room, the management and control service, the XDB example, the management and control metadata database and the k8s base in the second machine room no longer provide services to the outside, the second machine room is recovered to a preparation state, and the first machine room after the failure recovery takes over the services of the second machine room to provide services to the outside.
In summary, after the first machine room is failed and recovered, the second machine room providing service to the outside can be switched to the first machine room, that is, the first machine room replaces the second machine room to provide service to the outside, and the second machine room is recovered to the state before the first machine room fails, so that the switching between the first machine room and the second machine room after the first machine room fails and the normal operation of the first machine room are realized.
Further, considering that only one machine room is required to provide service to the outside between the first machine room and the second machine room, the other machine room is in a standby state, so after the fault of the first machine room is recovered, the machine room providing service to the outside can be switched back to the first machine room, and meanwhile, the second machine room is set to be in a standby state, so that only the data synchronization with the first machine room is realized, the service is not provided to the outside, and the following is specifically realized:
Restoring the data storage nodes in the second machine room and the metadata database instance nodes in the second machine room; and calling a management and control service unit in the second machine room, distributing the data processing task corresponding to the first machine room to the second machine room, and restoring the database instance node in the second machine room.
Specifically, the restoration refers to updating the service states of the data storage nodes in the second machine room for providing services to the outside and the metadata database instance nodes in the second machine room to a ready state, that is, to a state before the first machine room fails, and only performing data asynchronous synchronization with the first machine room, so that services are not provided to the outside.
Based on the above, after the first machine room fault is recovered, updating the service state of the second machine room to the state before the first machine room fault occurs, and recovering the data storage node, the database instance node and the metadata database instance node in the second machine room; and calling a management and control service unit in the second machine room, distributing the data processing task corresponding to the first machine room to the second machine room, and restoring the database instance node in the second machine room.
Along with the above example, after the failure of the first machine room is recovered, the management and control service, the XDB instance, the management and control metadata base, and the k8s base in the second machine room no longer provide services to the outside, the XDB instance in the second machine room is updated to be fo lower, the XDB leader is updated to be XDB fo lower, and the etcd leader is updated to be etcd fo lower, i.e. the second machine room is recovered to a ready state for synchronizing data from the first machine room.
In summary, the disaster recovery method provided in one embodiment of the present disclosure is applied to a dispatching platform, where the dispatching platform provides a set of management services for a first machine room and a second machine room, and in case of a failure of the first machine room, the dispatching platform invokes a switching service of the second machine room to update a data storage node in the second machine room and a metadata database instance node in the second machine room; calling a management and control service unit in the second machine room, distributing the data processing task corresponding to the first machine room to the second machine room, and updating the database instance node in the second machine room; synchronizing data to the second machine room based on the updated data storage node, the metadata database instance node and the database instance node in the second machine room; and sending an execution instruction for executing the metadata processing task and the data processing task to the second machine room according to the data synchronization result.
The first machine room and the second machine room use the same group of management services provided by the dispatching platform, so that service deployment cost and operation and maintenance complexity are reduced, and under the condition of failure of the first machine room, the standby switching service of the second machine room can be directly called to enable the second machine room to replace the first machine room to continuously provide services to the outside, and service switching speed and failure recovery efficiency are improved while switching requirements of a disaster recovery system are met.
The disaster recovery method provided in the present specification is further described below with reference to fig. 3 by taking an application of the disaster recovery method in the same city machine room as an example. Fig. 3 is a flowchart illustrating a processing procedure of a disaster recovery method according to an embodiment of the present disclosure, which specifically includes the following steps.
Step S302: and under the condition of the failure of the main machine room, switching service of the standby machine room is invoked, a data storage node is selected from second data storage nodes contained in the standby machine room, and the data storage node is updated.
Step S304: and calling the switching service of the standby machine room, selecting a metadata database instance node from the second metadata database instance nodes in the standby machine room, and updating the metadata database instance node.
Step S306: and calling a management and control service unit in the standby machine room, distributing a data processing task corresponding to the host machine room to the standby machine room, and updating the state information and attribute information of the database instance node in the standby machine room.
Step S308: and synchronizing the instance data in the standby machine room to a third database instance node in a third machine room based on the updated database instance node, wherein the third database instance node is used for asynchronous storage of the instance data.
Step S310: and acquiring historical data corresponding to the faults of the host computer room, calling a management and control service unit in the backup computer room, and updating metadata of the metadata database instance nodes in the backup computer room based on the historical data in a preset recovery time.
Step S312: and under the condition of recovering the fault of the main machine room, switching the standby machine room to the main machine room.
Fig. 4 is a schematic diagram of a disaster recovery system according to an embodiment of the present disclosure, where a main machine room, a standby machine room, and a third machine room form a basic disaster recovery system. The main machine room and the standby machine room are taken over by the same set of K8s, and framework layering is carried out on the cloud primary database system, so that the disaster recovery system has the capability of switching between the main machine room and the standby machine room, the main machine room and the standby machine room can be automatically switched when a disaster occurs, and a user can manually decide to temporarily stop providing services so as to avoid the problem that partial database examples generate inconsistent data. As shown in fig. 4, a k8s base, a management metadata base, a database instance (XDB instance), a management service and a disaster recovery switching service are deployed in the host computer room, correspondingly, the same set of k8s base, management metadata base, database instance (XDB instance), management service and disaster recovery switching service-b is deployed in the backup computer room, global load balancing is implemented through the k8s service, the switching service-a of the host computer room and the switching service-b of the backup computer room can be managed by a user server, and the third computer room only deploys the database instance (XDB instance).
The K8S base in the main room provides K8S service, the disaster recovery system also provides Global Load Balancing (GLB) inside or outside, and the K8S base contains three etcd nodes (metadata service of K8S), namely: the device comprises a leader node and two fo lower nodes, wherein the leader node provides service to the outside, the two fo lower nodes perform data synchronization, a k8s base in a standby machine room provides k8s service, and the k8s base comprises three etcdfo lower nodes and is used for realizing data asynchronous synchronization. The management and control metadata base of the main machine room comprises three xdb nodes, a leader node and two fo lower nodes, wherein the leader node provides service to the outside, the two fo lower nodes perform data synchronization, and the management and control metadata base of the standby machine room comprises three xdb fo lower nodes for realizing data asynchronous synchronization. The master machine room is a leader database example, and the standby machine room and the third machine room are fo lower database examples. Data synchronization is performed between nodes between the main machine room and the standby machine room and between database instances through a Paxos protocol, and data synchronization is also performed between the standby machine room and the fo lower database instance of the third machine room through the Paxos protocol, so that the main machine room and the standby machine room both provide high-availability management and control services.
That is, 6 etcd nodes of the main machine room and the standby machine room form a Paxos cluster, and three nodes of the main machine room are l eater, fo l lower and fo l lower, so that the single-node disaster recovery capacity in the machine room is ensured; the three etcd nodes of the standby machine room are in the role of a learner, and asynchronously synchronize the data of the main machine room. Xdb is used for managing and controlling a metadata base, and 6-node Paxos clusters are formed in a main machine room and a standby machine room similarly to etcd; three nodes of the main machine room are leader, fo l lower and fo l lower, so that the disaster recovery capacity of a single node in the machine room is ensured; the three node roles of the standby machine room are the data of the asynchronous synchronous main machine room. The management and control service is simple to deploy a copy in the standby machine room, and the main machine room and the standby machine room belong to a k8s cluster, so that services are provided for the outside through the k8s service; at this time, the stateless management and control service is in a more active state; the stateful management and control service needs to be modified in a distributed manner to realize normal service when the main and the standby are hung at the service back end at the same time. The database instances are deployed in three available areas to form a cluster (shown as XDB three nodes in fig. 4) for data synchronization via a strong consistency (Paxos) protocol as in fig. 4. The main machine room and the standby machine room respectively have disaster recovery switching service, the main machine room and the standby machine room are mutually independent, and the service of the main machine room has the capability of switching the base and the examples back to the main machine room; the service of the standby machine room has the capability of switching the base and the examples back to the standby machine room; and under the condition of single-room fault, the system has the capability of switching between the main room and the standby room.
When the host computer room sends a fault, disaster recovery switching service of the backup computer room is called, and three backup node roles of the etcd of the backup computer room are promoted to leader, fo lower and fo lower to provide service to the outside. And calling disaster recovery switching service of the standby machine room, and improving the roles of three xdb learner nodes of the standby machine room management and control metadata database into leader, fo l lower and fo l lower to provide service to the outside. Management and control service layer: each management and control service deploys a copy in the backup machine room and adopts a multi-activity architecture. When the main machine room fails, the traffic is automatically routed to the standby machine room. The metadata base on which the control service depends can be used after the three xdb l learner nodes of the control metadata base can provide service to the outside, so that the control service of the standby machine room can also normally run. Kernel service layer: (1) As the XDB shown in FIG. 4 adopts the strong uniform Paxos protocol, RPO can be guaranteed to be equal to zero under the condition of downtime of a main machine room, and a main machine can be automatically selected in an alternative machine room. At this time, both the RPO and the RTO are determined by the kernel. In addition, in DBaaS systems, it is often important that the primary and backup switching of the management parameters is also very important, and the inconsistent state of the management parameters and the kernel may cause the wrong switching task to flow down, resulting in a secondary disaster, so that metadata is corrected for the first time after the management service is recovered, and the time is determined by the RTO of the management service. (2) The cores of some database examples do not have the function of automatic master-switching, and then the master-slave switching is performed by the management and control service by issuing a switching task through the disaster recovery switching service. At this time, the RPO and RTO of the database instance are determined by the recovery speed of the management and control service, and since the recovery of the management and control service depends on the recovery of the base management and control metadata database, they are finally determined by the RTO of the base metadata database. Whether the etcd of the cloud base is switched with the management and control metadata base or not is switched with part of database instances without an automatic switching function, and the database instances are uniformly processed by disaster recovery switching service to ensure orderly switching; under special conditions, the user is also given the option of not switching the service to the standby machine room, and the consistency of the system data is ensured by selecting the sacrifice service continuity.
In summary, by adopting the three-machine-room disaster recovery architecture, the database kernel with disaster recovery capability is supported to achieve that RPO is equal to 0, so as to meet the financial disaster recovery requirement. The third machine room is not provided with unnecessary base components and management and control components, so that the material requirements and the operation and maintenance cost are reduced. A cloud base is used for managing all database examples and management and control components, metadata of the base are physically copied, and under the condition of low delay (often less than 1.5 milliseconds) of a main machine room and a standby machine room of the same city three machine rooms, quick synchronization of k8s etcd and management and control metadata database data can be guaranteed, and the condition of data loss is avoided as much as possible. The cloud base can enable the database instance to be deployed across the machine room, truly does not depend on external components, and achieves kernel-level physical replication, safety and high efficiency. A set of cloud base management and control assembly is used, so that the management and control assembly is more active, and the time consumption for switching management and control services is minimized. And a set of cloud bases are used for managing all database examples and management and control components, two cloud bases and management and control services are not required to be maintained, the problems of inconsistent metadata of a main machine room and a standby machine room, inconsistent state of the database examples and the like are not required to be processed, and the operation and maintenance cost is reduced.
One embodiment of the present disclosure proposes a disaster recovery system that uses a set of cloud base to manage data plane and management and control plane services of three machine rooms. Based on the Paxos protocol, kernel-level physical replication is performed on the cloud base layer, so that the cost and delay are reduced, and the disaster recovery of three computer rooms in the same city with high performance is realized; in a management and control service layer, seamless switching of multiple activities of double machine rooms and machine room-level disaster recovery based on one set of cloud base is realized; the financial-level disaster recovery requirement that the RPO of the database kernel based on physical replication is equal to zero is met in the database instance layer, and the disaster recovery capability of the management and control service is combined, so that the quick switching between the kernel and the management and control is realized, and the disaster recovery of the DBaaS service is further realized.
Corresponding to the above method embodiment, the present disclosure further provides an embodiment of a disaster recovery device, and fig. 5 shows a schematic structural diagram of the disaster recovery device provided in one embodiment of the present disclosure. As shown in fig. 5, the apparatus includes:
an updating module 502 configured to invoke a switching service of the second machine room, update a data storage node in the second machine room, and a metadata database instance node in the second machine room in case of a failure of the first machine room;
the allocation module 504 is configured to call a management and control service unit in the second machine room, allocate a data processing task corresponding to the first machine room to the second machine room, and update a database instance node in the second machine room;
a synchronization module 506 configured to synchronize data to the second machine room based on the updated data storage node, the metadata database instance node, and the database instance node in the second machine room;
and the sending module 508 is configured to send an execution instruction for executing the metadata processing task and the data processing task to the second machine room according to the data synchronization result.
In an alternative embodiment, the update module 502 is further configured to:
Invoking a switching service of the second machine room, selecting a data storage node from second data storage nodes contained in the second machine room, updating the data storage node, wherein the updated data storage node is used for receiving storage node data submitted by aiming at the second machine room; and calling the switching service of the second machine room, selecting a metadata database instance node from second metadata database instance nodes in the second machine room, and updating the metadata database instance node, wherein the updated metadata database instance node is used for receiving metadata database instance data submitted by the second machine room.
In an alternative embodiment, the allocation module 504 is further configured to:
and calling a management and control service unit in the second machine room, distributing the data processing task corresponding to the first machine room to the second machine room, updating the state information and the attribute information of the database instance node in the second machine room, wherein the updated database instance node is used for receiving the database instance data submitted by aiming at the second machine room, and the management and control service unit is used for carrying out state management on the database instance node in the second machine room.
In an alternative embodiment, the synchronization module 506 is further configured to:
determining a data storage space, reading storage data in the data storage space, and synchronizing the storage data to updated data storage nodes in the second machine room based on a data synchronization protocol; reading metadata in the data storage space, and synchronizing the metadata to updated metadata database instance nodes in the second machine room based on the data synchronization protocol; and reading the instance data in the data storage space, and synchronizing the instance data to the updated database instance node in the second machine room based on the data synchronization protocol.
In an alternative embodiment, the allocation module 504 is further configured to:
and synchronizing the instance data in the second machine room to a third database instance node in a third machine room based on the updated database instance node, wherein the third database instance node is used for asynchronous storage of the instance data.
In an alternative embodiment, the allocation module 504 is further configured to:
invoking a switching service of the second machine room, and sending an updating task to a management and control service unit in the second machine room; invoking a management and control service unit in the second machine room, which receives the update task, and updating the database instance node in the second machine room; or receiving an update instruction aiming at the second machine room, and updating the database instance node in the second machine room based on the update instruction; and the updated database instance node in the second machine room is used for receiving the data operation task.
In an alternative embodiment, the allocation module 504 is further configured to:
acquiring historical data corresponding to the first machine room fault; and calling the management and control service unit in the second machine room, and updating metadata of the metadata database instance node in the second machine room based on the historical data in a preset recovery time.
In an alternative embodiment, the sending module 508 is further configured to:
under the condition of the fault recovery of the first machine room, calling switching service of the first machine room, and updating a data storage node of a database storage unit in the first machine room and a metadata database instance node in the second machine room; invoking a management and control service unit in the first machine room, distributing a data processing task corresponding to the second machine room to the first machine room, and updating a database instance node in the first machine room; synchronizing data to the first machine room based on the updated data storage node, the metadata database instance node and the database instance node in the first machine room; and sending an execution instruction for executing the metadata processing task and the data processing task to the first machine room according to the data synchronization result.
In an alternative embodiment, the sending module 508 is further configured to:
restoring the data storage nodes in the second machine room and the metadata database instance nodes in the second machine room; and calling a management and control service unit in the second machine room, distributing the data processing task corresponding to the first machine room to the second machine room, and restoring the database instance node in the second machine room.
In summary, the disaster recovery device provided in one embodiment of the present disclosure is applied to a dispatching platform, where the dispatching platform provides a set of management services for a first machine room and a second machine room, and in case of a failure of the first machine room, the dispatching platform invokes a switching service of the second machine room to update a data storage node in the second machine room and a metadata database instance node in the second machine room; calling a management and control service unit in the second machine room, distributing the data processing task corresponding to the first machine room to the second machine room, and updating the database instance node in the second machine room; synchronizing data to the second machine room based on the updated data storage node, the metadata database instance node and the database instance node in the second machine room; and sending an execution instruction for executing the metadata processing task and the data processing task to the second machine room according to the data synchronization result.
The first machine room and the second machine room use the same group of management services provided by the dispatching platform, so that service deployment cost and operation and maintenance complexity are reduced, and under the condition of failure of the first machine room, the standby switching service of the second machine room can be directly called to enable the second machine room to replace the first machine room to continuously provide services to the outside, and service switching speed and failure recovery efficiency are improved while switching requirements of a disaster recovery system are met.
The foregoing is a schematic scheme of a disaster recovery device in this embodiment. It should be noted that, the technical solution of the disaster recovery device and the technical solution of the disaster recovery method belong to the same concept, and details of the technical solution of the disaster recovery device, which are not described in detail, can be referred to the description of the technical solution of the disaster recovery method.
Corresponding to the method embodiment, the present disclosure further provides a disaster recovery system embodiment, including: the system comprises a first machine room, a second machine room and a dispatching platform, wherein the dispatching platform stores data synchronization executable instructions, and the data synchronization executable instructions are used for realizing the steps of the disaster recovery method when being executed by the dispatching platform and distributing data and data processing tasks stored in the first machine room to the second machine room.
The foregoing is a schematic scheme of a disaster recovery system of this embodiment. It should be noted that, the technical solution of the disaster recovery system and the technical solution of the disaster recovery method belong to the same concept, and details of the technical solution of the disaster recovery system, which are not described in detail, can be referred to the description of the technical solution of the disaster recovery method.
Fig. 6 illustrates a block diagram of a computing device 600 provided in accordance with one embodiment of the present description. The components of computing device 600 include, but are not limited to, memory 610 and processor 620. The processor 620 is coupled to the memory 610 via a bus 630 and a database 650 is used to hold data.
Computing device 600 also includes access device 640, access device 640 enabling computing device 600 to communicate via one or more networks 660. Examples of such networks include public switched telephone networks (PSTN, pub l ic Switched Te lephone Network), local area networks (LAN, loca l Area Network), wide area networks (WAN, wide Area Network), personal area networks (PAN, persona l Area Network), or combinations of communication networks such as the internet. The access device 640 may include one or more of any type of network interface, wired or wireless, such as a network interface card (NIC, network interface contro l ler), such as an ieee 802.11 wireless local area network (WLAN, wi re less Loca l Area Network) wireless interface, a worldwide interoperability for microwave access (Wi-MAX, wor ldwide I nteroperabi l ity for Microwave Access) interface, an ethernet interface, a universal serial bus (USB, un iversa l Ser ia l Bus) interface, a cellular network interface, a bluetooth interface, a near field communication (NFC, near Fie ld Commun icat ion) interface, and so forth.
In one embodiment of the present application, the above-described components of computing device 600, as well as other components not shown in FIG. 6, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 6 is for exemplary purposes only and is not intended to limit the scope of the present application. Those skilled in the art may add or replace other components as desired.
Computing device 600 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or personal computer (PC, persona l Computer). Computing device 600 may also be a mobile or stationary server. The processor 620 is configured to execute computer-executable instructions that, when executed by the processor, implement the steps of the disaster recovery method described above.
The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the disaster recovery method belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the disaster recovery method.
An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the disaster recovery method described above.
The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the disaster recovery method belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the disaster recovery method.
An embodiment of the present disclosure further provides a computer program, where the computer program when executed in a computer causes the computer to perform the steps of the disaster recovery method described above.
The above is an exemplary version of a computer program of the present embodiment. It should be noted that, the technical solution of the computer program and the technical solution of the disaster recovery method belong to the same conception, and details of the technical solution of the computer program, which are not described in detail, can be referred to the description of the technical solution of the disaster recovery method.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier wave signal, a telecommunication signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims (12)

1. The disaster recovery method is applied to a dispatching platform, wherein the dispatching platform provides a group of management services for a first machine room and a second machine room, and comprises the following steps:
under the condition that the first machine room fails, switching service of the second machine room is called, and data storage nodes in the second machine room and metadata database instance nodes in the second machine room are updated;
Invoking a management and control service unit in the second machine room, distributing the data processing task corresponding to the first machine room to the second machine room, and updating the database instance node in the second machine room;
synchronizing data to the second machine room based on the updated data storage node, the metadata database instance node and the database instance node in the second machine room;
and sending an execution instruction for executing the metadata processing task and the data processing task to the second machine room according to the data synchronization result.
2. The method of claim 1, the invoking the handoff service of the second machine room, updating a data storage node in the second machine room, and a metadata database instance node in the second machine room, comprising:
invoking a switching service of the second machine room, selecting a data storage node from second data storage nodes in the second machine room, updating the data storage node, wherein the updated data storage node is used for receiving storage node data submitted by aiming at the second machine room;
and calling the switching service of the second machine room, selecting a metadata database instance node from second metadata database instance nodes in the second machine room, and updating the metadata database instance node, wherein the updated metadata database instance node is used for receiving metadata database instance data submitted by the second machine room.
3. The method of claim 1, wherein the invoking the in-second-room management service unit allocates the data processing task corresponding to the first room to the second room, and updates the database instance node in the second room, comprises:
and calling a management and control service unit in the second machine room, distributing the data processing task corresponding to the first machine room to the second machine room, updating the state information and the attribute information of the database instance node in the second machine room, wherein the updated database instance node is used for receiving the database instance data submitted by aiming at the second machine room, and the management and control service unit is used for carrying out state management on the database instance node in the second machine room.
4. The method of claim 1, the synchronizing data to the second machine room based on the updated data storage node, the metadata database instance node, and the database instance node in the second machine room, comprising:
determining a data storage space, reading storage data in the data storage space, and synchronizing the storage data to updated data storage nodes in the second machine room based on a data synchronization protocol;
Reading metadata in the data storage space, and synchronizing the metadata to updated metadata database instance nodes in the second machine room based on the data synchronization protocol;
and reading the instance data in the data storage space, and synchronizing the instance data to the updated database instance node in the second machine room based on the data synchronization protocol.
5. The method of claim 1, further comprising, after the step of updating the database instance node in the second machine room is performed:
and synchronizing the instance data in the second machine room to a third database instance node in a third machine room based on the updated database instance node, wherein the third database instance node is used for asynchronous storage of the instance data.
6. The method of claim 1, the updating the database instance node in the second machine room, comprising:
invoking a switching service of the second machine room, and sending an updating task to a management and control service unit in the second machine room; invoking a management and control service unit in the second machine room, which receives the update task, and updating the database instance node in the second machine room;
Or alternatively, the process may be performed,
receiving an update instruction aiming at the second machine room, and updating a database instance node in the second machine room based on the update instruction; and the updated database instance node in the second machine room is used for receiving the data operation task.
7. The method of claim 1, further comprising, after the step of updating the database instance node in the second machine room is performed:
acquiring historical data corresponding to the first machine room fault;
and calling the management and control service unit in the second machine room, and updating metadata of the metadata database instance node in the second machine room based on the historical data in a preset recovery time.
8. The method of claim 1, wherein the sending, to the second machine room, an execution instruction to execute a metadata processing task and the data processing task according to a data synchronization result, further comprises:
under the condition of the fault recovery of the first machine room, calling switching service of the first machine room, and updating a data storage node of a database storage unit in the first machine room and a metadata database instance node in the second machine room;
invoking a management and control service unit in the first machine room, distributing a data processing task corresponding to the second machine room to the first machine room, and updating a database instance node in the first machine room;
Synchronizing data to the first machine room based on the updated data storage node, the metadata database instance node and the database instance node in the first machine room;
and sending an execution instruction for executing the metadata processing task and the data processing task to the first machine room according to the data synchronization result.
9. The method of claim 8, wherein the invoking the in-first-room management service unit allocates the data processing task corresponding to the second room to the first room and updates the database instance node in the first room, further comprising:
restoring the data storage nodes in the second machine room and the metadata database instance nodes in the second machine room;
and calling a management and control service unit in the second machine room, distributing the data processing task corresponding to the first machine room to the second machine room, and restoring the database instance node in the second machine room.
10. A disaster recovery system comprising:
the method comprises the steps of the disaster recovery method according to any one of claims 1 to 9, wherein the data synchronization executable instructions are stored in the dispatching platform, and the steps of the disaster recovery method according to any one of claims 1 to 9 are realized when the data synchronization executable instructions are executed by the dispatching platform and are used for distributing data and data processing tasks stored in the first machine room to the second machine room.
11. A computing device, comprising:
a memory and a processor;
the memory is configured to store computer executable instructions, and the processor is configured to execute the computer executable instructions, which when executed by the processor, implement the steps of the disaster recovery method of any one of claims 1 to 9.
12. A computer readable storage medium storing computer executable instructions which when executed by a processor implement the steps of the disaster recovery method of any one of claims 1 to 9.
CN202310181688.7A 2023-02-23 2023-02-23 Disaster recovery method, device and system Pending CN116302691A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310181688.7A CN116302691A (en) 2023-02-23 2023-02-23 Disaster recovery method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310181688.7A CN116302691A (en) 2023-02-23 2023-02-23 Disaster recovery method, device and system

Publications (1)

Publication Number Publication Date
CN116302691A true CN116302691A (en) 2023-06-23

Family

ID=86831772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310181688.7A Pending CN116302691A (en) 2023-02-23 2023-02-23 Disaster recovery method, device and system

Country Status (1)

Country Link
CN (1) CN116302691A (en)

Similar Documents

Publication Publication Date Title
EP3694148B1 (en) Configuration modification method for storage cluster, storage cluster and computer system
CN115576655B (en) Container data protection system, method, device, equipment and readable storage medium
CN110581782A (en) Disaster tolerance data processing method, device and system
CN112732491B (en) Data processing system and business data processing method based on data processing system
CN112711498A (en) Virtual machine disaster recovery backup method, device, equipment and readable storage medium
CN114610532A (en) Database processing method and device
CN113467873A (en) Virtual machine scheduling method and device, electronic equipment and storage medium
CN111381931A (en) Disaster recovery method, device and system
CN114415984B (en) Data processing method and device
CN112000444B (en) Database transaction processing method and device, storage medium and electronic equipment
CN112929438B (en) Business processing method and device of double-site distributed database
CN113438111A (en) Method for restoring RabbitMQ network partition based on Raft distribution and application
CN113986450A (en) Virtual machine backup method and device
CN116302691A (en) Disaster recovery method, device and system
CN111083074A (en) High availability method and system for main and standby dual OSPF state machines
CN114598711B (en) Data migration method, device, equipment and medium
CN110333973A (en) A kind of method and system of multi-host hot swap
JP2010231257A (en) High availability system and method for handling failure of high availability system
CN115378800A (en) Distributed fault-tolerant system, method, apparatus, device and medium without server architecture
CN113874842B (en) Fault tolerant system, server, method for operating fault tolerant system and method for operating server
CN112256484A (en) Data backup method, device and system
CN109788007B (en) Cloud platform based on two places and three centers and communication method thereof
CN117201278A (en) Method for realizing disaster recovery high-availability scene of primary and backup cloud primary application in information creation environment
CN116955015B (en) Data backup system and method based on data storage service
CN115421971B (en) ETCD disaster recovery fault recovery method and application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination