CN114168380A - Database configuration method, device, system and storage medium - Google Patents

Database configuration method, device, system and storage medium Download PDF

Info

Publication number
CN114168380A
CN114168380A CN202111395416.4A CN202111395416A CN114168380A CN 114168380 A CN114168380 A CN 114168380A CN 202111395416 A CN202111395416 A CN 202111395416A CN 114168380 A CN114168380 A CN 114168380A
Authority
CN
China
Prior art keywords
node
data
shared storage
storage system
checkpoint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111395416.4A
Other languages
Chinese (zh)
Inventor
郑涔
林云
夏德军
李竟成
李敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202111395416.4A priority Critical patent/CN114168380A/en
Publication of CN114168380A publication Critical patent/CN114168380A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • G06F11/1451Management of the data involved in backup or backup restore by selection of backup contents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2046Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant where the redundant components share persistent storage
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1734Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • G06F16/1756De-duplication implemented within the file system, e.g. based on file segments based on delta files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/176Support for shared access to files; File sharing support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/211Schema design and management
    • G06F16/213Schema design and management with details for schema evolution support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/275Synchronous replication

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a database configuration method, a device, a system and a storage medium, wherein the method comprises the following steps: starting a plurality of nodes in a database system on a shared storage system, wherein the plurality of nodes comprise a first node serving as a main node and a second node serving as a standby node, and the plurality of nodes share data files stored in the shared storage system; writing modification operation corresponding to the data file into a transaction log through the first node, periodically generating a check point corresponding to the data file, and writing the transaction log and the check point into a shared storage system; and periodically loading the latest check point from the shared storage system through the second node so as to realize data synchronization with the first node. Based on a shared storage mechanism, the data storage cost is reduced, data copying among different nodes is not needed, better deployment flexibility is provided, and the problem of inconsistent data layout easily caused by a synchronous mode of replaying transaction logs between the main node and the standby node can be avoided.

Description

Database configuration method, device, system and storage medium
Technical Field
The present invention relates to the field of database technologies, and in particular, to a method, device, system, and storage medium for configuring a database.
Background
Databases typically use a master-slave synchronization mechanism to ensure high availability of data and services. In the database architecture adopting the mechanism, a main node and a plurality of standby nodes (the nodes can be realized as processes) exist, each node has the same data, only the main node provides read-write service, the standby nodes only provide read service, and the main node provides logs for synchronizing the main node and the standby nodes to the standby nodes through a network so as to ensure that the data of the standby nodes are up-to-date.
In such an architecture, an election mechanism is also typically required to determine the role of the master node. When the main node is unavailable, a new main node can be selected from a plurality of standby nodes through an election mechanism to continue providing the service, and high availability of data and the service is ensured in this way. For example, taking a database system MongoDB as an example, a common deployment architecture is a master node 1 and at least two standby nodes, the master node synchronizes the latest data to the standby nodes through oplog logs, and when the master node goes down, a new master node can be elected from the standby nodes.
The existing deployment architecture has the problems that the data of each node is independent, and each node is added, so that the data needs to be synchronized again in full, which needs a long time, has poor elasticity and brings high storage cost.
Disclosure of Invention
The embodiment of the invention provides a database configuration method, equipment, a system and a storage medium, which are used for reducing the data storage cost and providing better flexible deployment.
In a first aspect, an embodiment of the present invention provides a database configuration method, where the method includes:
starting a plurality of nodes in a database system on a shared storage system, wherein the plurality of nodes comprise a first node serving as a main node and a second node serving as a standby node, and the plurality of nodes share data files stored in the shared storage system;
writing, by the first node, a modification operation corresponding to the data file into a transaction log, periodically generating a checkpoint corresponding to the data file, and writing the transaction log and the checkpoint into the shared storage system;
periodically loading, by the second node, a latest checkpoint from the shared storage system to achieve data synchronization with the first node.
In a second aspect, an embodiment of the present invention provides a database configuration apparatus, where the apparatus includes:
the system comprises a starting module, a storage module and a storage module, wherein the starting module is used for starting a plurality of nodes in a database system on a shared storage system, the plurality of nodes comprise a first node serving as a main node and a second node serving as a standby node, and the plurality of nodes share data files stored in the shared storage system;
a processing module, configured to write, by the first node, a modification operation corresponding to the data file into a transaction log, periodically generate a checkpoint corresponding to the data file, and write the transaction log and the checkpoint into the shared storage system; and periodically loading, by the second node, a latest checkpoint from the shared storage system to achieve data synchronization with the first node.
In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor, a communication interface; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to implement at least the database configuration method of the first aspect.
In a fourth aspect, an embodiment of the present invention provides a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to implement at least the database configuration method according to the first aspect.
In a fifth aspect, an embodiment of the present invention provides a database system, including:
the system comprises a shared storage system and a plurality of nodes, wherein the shared storage system stores data files, and the plurality of nodes are started on the shared storage system and comprise a first node serving as a main node and a second node serving as a standby node;
the first node is used for writing modification operation corresponding to the data file into a transaction log, periodically generating a check point corresponding to the data file, and writing the transaction log and the check point into the shared storage system;
the second node is used for loading the latest check point from the shared storage system periodically so as to realize data synchronization with the first node.
In the embodiment of the invention, a shared storage mechanism is adopted to store the data files in the database system, and the main node and the standby node in the database system are configured to use the data files stored in the shared storage system, so that the deployment architecture can have fixed storage cost which is irrelevant to the number of nodes, namely only one shared data file is stored regardless of the number of the nodes. Moreover, when the specification of the database system is changed, for example, the memory and the CPU of the physical machine or the virtual machine used by the database system cannot meet the requirements and the specification needs to be upgraded, because the shared storage mechanism is adopted, data copying between different nodes in the physical machine or the virtual machine is not needed, and better deployment flexibility is provided, because when the data file is large, the copying needs a long time.
In addition, the embodiment of the invention also provides data synchronization between the main and standby nodes based on the shared storage system. Specifically, the master node may periodically generate checkpoints, sequentially write the generated checkpoints into the shared storage system, and the standby node periodically loads the latest checkpoint stored in the shared storage system, so that the standby node can see the data file at the latest checkpoint, and implement data synchronization with the master node. The data synchronization mode can avoid the problem that the physical layout of data between the main node and the standby node is inconsistent easily caused by the data synchronization mode of replaying transaction logs when the conventional main node directly sends the transaction logs (which are logic logs) to the standby node.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a schematic diagram of a deployment manner of a database system according to an embodiment of the present invention;
FIG. 2 is a diagram of a database system according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of another database system provided by an embodiment of the present invention;
FIG. 4 is a diagram illustrating a loading checkpoint state according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a primary and standby node synchronization process according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a primary/standby switching process according to an embodiment of the present invention;
fig. 7 is a flowchart of a database configuration method according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a database configuration apparatus according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of an electronic device corresponding to the database configuration apparatus provided in the embodiment shown in fig. 8.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Some embodiments of the invention are described in detail below with reference to the accompanying drawings. The features of the embodiments and examples described below may be combined with each other without conflict between the embodiments. In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited.
As described in the background art, in the framework that a primary node and at least one backup node in a database system each independently maintain a data file, if a total of N (N >1) nodes are assumed, N data files need to be stored, which is high in storage cost, and when a node needs to be newly added, a data file needs to be completely copied again, which is long in time consumption and poor in elasticity. A data file does not refer to a data file, but refers to a data file that contains all data files present in the database system.
To avoid the problem of high storage cost, a mechanism for sharing storage is provided, that is, all nodes (including the master node and each standby node) in the database system share one data file, and the carrier storing the data file is called a shared storage system.
An alternative implementation of database system deployment and data synchronization between active and standby nodes based on a shared storage mechanism is illustrated in the following with reference to fig. 1.
In fig. 1, a case is assumed where a database system is deployed in a physical machine or a virtual machine (herein, a target machine for convenience of description). In this case, a shared storage system may be configured in the disk of the target machine, that is, each data file of the database system is set to be stored in the same root directory, and a plurality of nodes (i.e., a plurality of processes) including a master node and at least one standby node (for simplicity of description, only one standby node is illustrated in fig. 1) are started on the shared storage system. In this embodiment, a database system MongoDB is taken as an example, and the MongoDB has a built-in storage engine WiredTiger.
As shown in fig. 1, data synchronization between the active and standby nodes can be achieved by playing back (or applying) a Journal log on the standby node in real time to ensure that the data of the standby node is in the latest state. Wherein, the Journal log is a transaction log.
That is to say, when receiving a modification operation of a user on a certain data file in the database system, the master node writes the modification operation into the Journal log, and then performs data modification on the corresponding data file based on the modification operation, and transmits the Journal log to the standby node in real time, and the standby node performs log playback after receiving the Journal log, thereby synchronously obtaining a modification result of the master node on the data file, namely obtaining the latest state of the data.
However, the memory engine WiredTiger of the montoddb itself has a logical log, that is, a logical log, and the physical layout of data owned by the standby node and consistent with the master node cannot be guaranteed by directly playing back the logical log on the standby node, especially, the physical layout of data in the memory is inconsistent, which may cause the database system to fail to work normally.
Wherein a logical log is a concept as opposed to a physical log. Data is organized in pages (pages) in a database system. In brief, in the embodiment of the present invention, a logical log refers to that, for a modification operation, a modification to a certain key is recorded, but not a modification to a specific certain page; and the physical log means that for a modification operation, the modification of a certain page is recorded. For example, if a modification operation is to subtract 100 elements from a user account x, the modification operation performed on the "user account name" is recorded in the logical log, and the modification operation performed on the page storing the data of the user account is recorded in the physical log, which includes information such as the accurate storage location of the page in the shared storage system. In summary, a logical log records logical changes of a database system, and a physical log records physical changes of the database, i.e., changed data information.
Among them, regarding the data physical layout inconsistency, such as: if some data is stored in some continuous storage locations in the memory of the primary node, but stored in different discontinuous storage locations in the memory of the secondary node, the problem arises that the physical layout of the data in the memories of the primary node and the secondary node is inconsistent. This inconsistency can ultimately render the database system inoperable because the primary node and the backup node cannot operate if they have inconsistent understanding of the data layout of the data file when using the same data file.
In order to provide highly available MongoDB provided based on shared storage, the traditional MongoDB can be modified, but all the methods of the adopted modification schemes replace a memory engine Wirettage of the MongoDB, and a storage engine based on a Structured MergeTree (LSM tree) which is more friendly to the shared storage is adopted, such as RocksDB, and the modification of MongoDB kernels (which means the original code implementation thereof) by the schemes is too intrusive, and the modification difficulty and cost are too large, so that the good compatibility of the MongoDB native syntax cannot be maintained.
The embodiment of the invention also provides another database system deployment scheme based on shared storage. In the scheme, a MongoDB warm-keeping mode is newly added by slightly modifying a MongoDB kernel (still using a Wiredtimer of an own storage engine). The MongoDB supports the start of a master node and a plurality of standby nodes (which may be called warm standby nodes) on a share of stored data, and allows one of the standby nodes to quickly become a new master node and provide services when the master node is unavailable. In the scheme, an oplog log-based native main and standby synchronization mechanism of MongoDB is not arranged between the main node and the standby node, and data are synchronized by using shared storage. The data of the standby node is not up-to-date in real time, but rather has a fixed delay (typically around 1 minute) compared to the primary node. When the master node is unavailable, the standby node elects to determine a new master node through a mechanism based on distributed file locks, updates own data to be in a latest state through shared storage, and then performs master-standby switching to take over service.
Some related concepts are explained first.
Hot standby mode. As described in the background section, in a conventional database architecture, data of each of a primary node and a secondary node is independent, and data synchronization between the primary node and the secondary node is completed through a data synchronization log (e.g., oplog, which is another type of log different from transaction log for performing data synchronization between the primary node and the secondary node). At this time, since the nodes are independent, each node can implement the persistence of the local data through the data persistence mechanism. That is, the primary node needs to ensure the persistence of local data through a data persistence mechanism (i.e., data is not lost in the database system), and the backup node needs to perform relevant data modification write-in operations on its local data file during data synchronization based on the data synchronization log obtained from the primary node, and at this time, needs to ensure the persistence of local data based on the data persistence mechanism. In data persistence, a checkpoint (checkpoint) and a transaction log may be generally used to implement data persistence. When the data synchronization mode is adopted, the hot standby mode is adopted, and the standby node is called a hot standby node.
The warm standby mode is relative to the hot standby mode, and in the warm standby mode, the standby node is called a warm standby node. The warm standby mode provided in the embodiment of the present invention is simply that the standby node does not directly interact with the host node to complete data synchronization, but the standby node and the host node all implement data synchronization through the shared storage system, and the data persistence capability of the standby node is closed. Therefore, the shared storage system can be only modified by one node (main node) at the same time, and if different nodes modify the shared storage system at the same time, data is damaged, so that the database system cannot work. Wherein the data persistence capability of the standby node is turned off, it means that the standby node (i.e., a node during operation as a standby node) will not perform any more operations that generate checkpoints and write transaction logs.
The following embodiments are used to illustrate the composition and data synchronization process of the database system using the warm standby mode.
Fig. 2 is a schematic diagram of a database system according to an embodiment of the present invention, as shown in fig. 2, in an alternative embodiment, a database system may be deployed in the same target machine, that is, a shared storage system is formed in a disk of the target machine, and is used for storing data files in the database system, and in addition, a plurality of nodes are started in the target machine, and are configured to share the shared storage system.
The method comprises the steps that a database locking mechanism such as a distributed file locking mechanism is configured in the shared storage system, a plurality of nodes are enabled to carry out locking operation on the shared storage system based on a configured distributed locking algorithm, and based on locking results of the plurality of nodes, which node of the plurality of started nodes is used as a master node (the node which is successfully locked is used as the master node), and other nodes are used as standby nodes. And closing the data persistence capability of the standby node.
Fig. 3 is a schematic diagram of another database system according to an embodiment of the present invention, as shown in fig. 3, in another alternative embodiment, a database system may be constructed by using cloud resources. Specifically, the shared storage system may be deployed at a cloud, and the multiple nodes may be located at a user side (where a deployer of the database system is located), at this time, the multiple nodes may be located in the same or different target machines, for example, the multiple nodes are started in two different virtual machines, and a master node and other backup nodes are determined by election.
The warm standby mode provided by the embodiment of the invention can be applied to different database system construction modes shown in fig. 2 and 3.
In summary, in the warm standby mode, the data synchronization between the master node and the standby node is performed in the following manner:
the main node writes modification operation corresponding to the data file into a transaction log, periodically generates a check point corresponding to the data file, and writes the transaction log and the check point into a shared storage system;
and the standby node loads the latest check point from the shared storage system periodically so as to realize data synchronization with the main node.
The data files refer to data files included in the shared storage system.
Specifically, after a plurality of nodes are started on the shared storage system and a master node and a slave node are selected from the nodes based on, for example, a distributed file locking mechanism, the master node can implement data persistence processing through data persistence capability.
The starting of the plurality of nodes on the shared storage system means that the plurality of nodes are started and configured to point to the shared storage system, that is, the shared storage system is commonly used.
The plurality of nodes elect the master node and the standby node by competing for the distributed file lock, and this process can be implemented by using the existing related technology, which is not described herein.
When a certain node is not elected as a main node, the node works as a standby node. During the time that the node is acting as a standby node, its data persistence capabilities, including the ability to write transaction logs and the ability to generate checkpoints, are turned off so that the standby node does not modify data files in the shared storage system.
In practical applications, a user accessing the database system may trigger a data modification operation, which may be understood to include adding, deleting, and modifying. After receiving the modification operation, the main node executes the action of writing the transaction log, and writes the modification operation into the transaction log, so that the persistence of data can be guaranteed. Because the normal data modification process is to write the modification operation into the transaction log first and then perform the destaging of the modified data (i.e. writing the modified data into the corresponding data file), in the case that the modification operation has been written into the transaction log, even if the subsequent modified data destaging fails, the recovery can be performed based on the transaction log.
In addition, the master node is configured to periodically (e.g., every minute or other set time) generate checkpoints corresponding to the data files, thereby securing the persistence of the data via a checkpoint mechanism. It will be appreciated that each checkpoint generated by the master node is applied to a respective data file in the shared storage system, and that alternatively, if a data modification operation has not occurred for some of the data files at the time a checkpoint is currently generated, this checkpoint may be skipped, i.e. not recorded in the corresponding data file.
Specifically, the master node periodically generates checkpoints corresponding to the data files, each sequentially generated checkpoint actually includes a transaction log record generated from a previous checkpoint to the current checkpoint, and each checkpoint is associated with an execution time, so that when a checkpoint is triggered (the corresponding execution time is reached), the master node can write dirty data in the memory of the checkpoint into the corresponding data file.
The above introduces an implementation manner in which the master node implements data persistence through transaction logs and checkpoints. The primary node is executing the write transaction log and each time a checkpoint is generated, the corresponding transaction log and checkpoint may be written to the shared storage system. Wherein a checkpoint may be written to a data file.
In the embodiment of the present invention, the master node periodically generates a checkpoint and writes the generated checkpoint into each data file, but it should be noted that, for a data file, the master node may sequentially write a plurality of checkpoints into the data file, that is, the data file may retain a plurality of checkpoints generated in the latest period of time.
That is, the master node writes a plurality of checkpoints generated in sequence into the data file, so that the plurality of checkpoints remain in the data file, and the offset addresses corresponding to the plurality of checkpoints in the data file are different.
To deploy a database system such as MongoDB, its own storage engine Wiredpointer supports this persistence feature that is not updated in place. By in-place update, it is meant that if a modification of a record in the database system requires updating the data at the location where the record was originally stored while persisting, it is called in-place update.
For example, in the case of a data file, based on the persistency of in-place updates, one checkpoint A is written to this data file at offset address D1 for storing checkpoint information, and when another checkpoint B is generated, checkpoint B still needs to be written to offset address D1, at which time the originally stored checkpoint A needs to be updated to checkpoint B. However, when utilizing the persistence feature of not updating in place, checkpoint B described above may be written to the data file at another offset address D2, thus leaving more than one checkpoint in the data file. In this embodiment, the number of the checkpoints retained in the data file may be preset, or a plurality of recently generated checkpoints may be retained in the data file by setting a life cycle of each checkpoint, for example, the life cycle of each checkpoint is set to be twice or more times of a checkpoint generation cycle.
The multiple checkpoints are maintained in the data file to facilitate data synchronization between the primary and secondary nodes. Because the standby node is configured to periodically load the latest checkpoint from the shared storage system to achieve data synchronization with the primary node. Wherein the latest checkpoint is a last checkpoint generated detected from a plurality of checkpoints stored in the shared storage system each time the backup node performs an operation of loading a checkpoint.
The following illustrates the synchronization process between the primary and standby nodes in conjunction with fig. 4.
In fig. 4, it is assumed that the plurality of most recently generated checkpoints that the primary node retains in the shared storage system are checkpoint C1, checkpoint C2 and checkpoint C3, respectively, where checkpoint C1 is generated first and checkpoint C3 is generated last. Assuming that at a certain time T1 after the checkpoint C3 is generated, the time when the backup node loads the checkpoint is reached, the results obtained by the backup node detecting multiple checkpoints currently stored in the shared storage system are the three checkpoints. Because checkpoint C3 is current, the standby node achieves current synchronization with the primary node by loading checkpoint C3.
In this case, assuming that the primary node generates a new checkpoint C4 (for example, at time T2 shown in the figure) in the process of loading the checkpoint by the standby node, since the newly generated checkpoint C4 is not yet visible to the standby node at the loading time T1, the newly generated checkpoint C4 cannot be loaded, and therefore synchronization between the standby node and the primary node is delayed slightly.
In fact, the processing capacity and efficiency of the standby node are lower than those of the primary node, and the process of loading the checkpoint has a delay, in the above example, it is conceivable that if the multiple checkpoints are not reserved for the data file based on the persistence characteristic of the out-of-place update, the checkpoint C3 before the generation of checkpoint C4 is deleted, the checkpoint C4 may not be seen when the standby node performs the loading action, but if the checkpoint C3 is deleted (the processing rate of the primary node is higher than that of the standby node) when the checkpoint C4 is generated, the data synchronization with the primary node at the current loading time cannot be realized by the standby node, so the primary node reserves the multiple checkpoints generated before the latest generated checkpoint.
It should be noted that the standby node "loads" the checkpoint and "generates" the checkpoint as two different events from the primary node. The generation of the check point refers to that when the check point is triggered, the current memory dirty page data is printed in the data file. Loading a checkpoint refers to loading metadata corresponding to a currently acquired latest checkpoint from a data file. The metadata comprises information of each data page (page) in a corresponding data file, and the standby node can see the latest data state in the shared storage system at the moment of the corresponding latest check point by executing the operation of loading the check point, so that the synchronization with the main node is realized.
The way that the standby node detects and loads the latest checkpoint from the shared storage system periodically is to ensure that the delay of own data (data that can be seen by the standby node) and data on the main node (data that can be seen by the main node) is within a fixed range.
In addition, it should be noted that the standby node realizes data synchronization with the primary node by periodically loading the latest checkpoint in the shared storage system, and this process does not involve playback of the transaction log, so that the problem of inconsistent physical layout of data easily caused by the manner that the standby node performs data synchronization by playing back the transaction log synchronously transmitted by the primary node is not generated.
Therefore, in the above-mentioned scheme provided by the embodiment of the present invention, in the warm-standby mode, by using the persistent characteristic that the MongoDB storage engine WiredTiger is not updated in place, data is shared between the active and standby nodes through the shared storage system, so that the warm-standby node can be periodically preloaded to the latest generated check point of the host node, so as to implement data synchronization with the host node. The scheme sacrifices the real-time property of the data of the standby node (the synchronization of the standby node and the main node is slightly delayed), thereby avoiding the problem that a transaction log of logic adopted by the Wiredtier is not friendly to shared storage, avoiding overlarge modification of a MongoDB kernel, and further keeping good compatibility of the MongoDB native syntax. The shared storage unfriendly means that the master node and the slave node have inconsistent understanding on the data layout of the same data file in the shared storage system due to the inconsistent data layout between the master node and the slave node.
In practical application, after a certain node is elected as a master node at a certain time, the node may fail in the subsequent operation process to cause unavailability, and at this time, the problem of switching between the master node and the slave node is involved.
Specifically, assuming that the database system is initially started, the nodes that are started at this time will determine a master node and a plurality of backup nodes through election. Specifically, each node will elect a master node by continuously attempting to lock a specific data file in the shared storage system, that is, if a certain node X successfully locks the data file, the node X becomes the master node, and other nodes become standby nodes. However, after that, each node still continuously tries to lock the data file, and if the node Y successfully locks the data file at a certain time, it indicates that the node X which was previously the master node is unavailable, and releases the lock on the data file, and at this time, the node Y serves as a new master node. When the node Y is successfully locked, the primary-standby switching process is entered, and the process simulates a starting recovery process of the database, and first ensures that the node Y has loaded (or may be referred to as opened) the latest checkpoint stored in the shared storage system (which is implemented by depending on the logic periodically loaded by the primary node), and then recovers the data of the node Y to the latest state by playing back a transaction log generated after the latest checkpoint, and performs some other initialization actions. Node Y may then switch to become the new master node providing service.
According to the scheme provided by the embodiment of the invention, under the warm-standby mode, the persistent characteristic that the MongoDB storage engine Wiredpointer is not updated in place is utilized, and data is shared between the main node and the standby node through the shared storage system, so that the warm-standby node can be regularly preloaded to the latest check point generated by the main node at ordinary times, but the playback of the latest transaction log is not performed at ordinary times, and the operation is performed only when the main node and the standby node are switched. That is, two phases are involved in the startup recovery flow of a database system: the first phase is loading the checkpoint and the second phase is playing back the transaction log. In this embodiment, the standby node does what needs to be done in the first phase in advance, that is, does what needs to be done before becoming a new master node, so that after becoming a new master node, only the operation in the second phase needs to be performed — the transaction log generated by the master node after the latest checkpoint is played back, which shortens the recovery time of the database system.
That is, the backup node may be periodically preloaded with the latest checkpoint (full data) of the primary node, and only the transaction log (incremental data) after the latest checkpoint needs to be played back when the primary-backup switch occurs.
It should be noted that, the foregoing only takes the example of the MongoDB database system with the storage engine WiredTiger as an example, and actually, the solution provided by the embodiment of the present invention may be applied to other database systems having similar characteristics to the Mongodb database system, where the transaction logs adopted are all logical logs.
To facilitate understanding of the implementation of the database system described above, an example is provided in conjunction with fig. 5 and 6.
In fig. 5, it is assumed that the current first node is the master node and the second node is the standby node. A file lock (mongod. lock) is included in the shared storage system, a transaction log (a transaction log herein may be understood as a single record in the log) sequentially generated by the master node in response to various modification operations triggered by the user on the database system, and various checkpoints stored in a certain table (table 1). The main node is in a read-write mode, and the standby node is in a read-only mode.
The first node, which is the master node, periodically generates checkpoints and maintains the latest k old checkpoints in the shared storage system, k being greater than 1. In fig. 5, it is assumed that the first node sequentially generates two checkpoints, namely checkpoint1 and checkpoint 2. The second node, as a standby node, periodically loads the latest checkpoint in the shared storage system, and it is assumed that at the current loading time, checkpoint1 is loaded into the second node. Thus, the second node may achieve data synchronization with the primary node based on the latest checkpoint loaded into.
In addition, as shown in fig. 5, the first node and the second node may perform locking operations periodically. If the second node is successfully locked at a certain time, the active/standby switching process is triggered, that is, as shown in fig. 6, the second node originally serving as the standby node becomes a new master node.
At this point, the second node may perform a load operation on the latest checkpoint currently stored in the shared storage system to ensure that the latest checkpoint is loaded, such as checkpoint 2. Thereafter, the second node acquires the transaction logs generated after the checkpoint from the shared storage system, and in fig. 6, assuming that the transaction logs REC1, REC2 and CKPT2 are generated after checkpoint2, the three transaction logs are played back. In practical applications, the process of switching between the main node and the standby node may also involve some other initialization operations of the new main node, which is not described herein, and the second node serves as the new main node to provide services through the switching process.
Fig. 7 is a flowchart of a database configuration method according to an embodiment of the present invention, where the method may be executed by an electronic device, and the electronic device may be the target machine mentioned above. As shown in fig. 7, the method may include the steps of:
701. starting a plurality of nodes in a database system on a shared storage system, wherein the plurality of nodes comprise a first node serving as a main node and a second node serving as a standby node, and the plurality of nodes share data files stored in the shared storage system.
In practice, the second node is turned off during its operation as a standby node, with data persistence capabilities including the ability to write transaction logs and the ability to generate checkpoints.
702. The modification operation corresponding to the data file is written into the transaction log through the first node, a check point corresponding to the data file is generated periodically, and the transaction log and the check point are written into the shared storage system.
Wherein periodically generating, by a first node, a checkpoint corresponding to a data file comprises: and when the check point is triggered, writing the dirty data in the memory into the data file through the first node.
The method for writing the periodically generated check points into the shared storage system through the first node comprises the following steps: and writing a plurality of check points which are sequentially generated into the data file through the first node so as to enable the plurality of check points to be reserved in the data file, wherein the offset addresses of the plurality of check points which correspond to the data file are different.
703. And periodically loading the latest check point from the shared storage system through the second node so as to realize data synchronization with the first node.
Specifically, metadata corresponding to the currently acquired latest check point may be loaded from the data file by the second node to achieve data synchronization with the first node.
In addition, if the second node elects to become a new main node, the second node acquires a transaction log generated after the current latest checkpoint from the shared storage system and plays back the transaction log.
For the specific implementation process of the above steps, reference may be made to the relevant descriptions in the foregoing embodiments, which are not described herein again.
The database configuration apparatus of one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these means can each be constructed using commercially available hardware components and by performing the steps taught in this disclosure.
Fig. 8 is a schematic structural diagram of a database configuration apparatus according to an embodiment of the present invention, as shown in fig. 8, the apparatus includes: the device comprises a starting module 11 and a processing module 12.
The starting module 11 is configured to start a plurality of nodes in a database system on a shared storage system, where the plurality of nodes include a first node serving as a master node and a second node serving as a slave node, and the plurality of nodes share data files stored in the shared storage system.
A processing module 12, configured to write, by the first node, a modification operation corresponding to the data file into a transaction log, periodically generate a checkpoint corresponding to the data file, and write the transaction log and the checkpoint into the shared storage system; and periodically loading, by the second node, a latest checkpoint from the shared storage system to achieve data synchronization with the first node.
Optionally, the processing module 12 is further configured to: and if the second node elects to become a new main node, acquiring a transaction log generated after the latest check point from the shared storage system through the second node, and playing back the transaction log.
Optionally, the starting module 11 is further configured to: shutting down data persistence capabilities of the second node during the second node being a standby node, the data persistence capabilities including an ability to write a transaction log and an ability to generate a checkpoint.
Optionally, in the process of writing the checkpoint into the shared storage system, the processing module 12 is specifically configured to: writing a plurality of check points generated in sequence into the data file through the first node so as to enable the data file to retain the check points, wherein offset addresses corresponding to the check points in the data file are different.
Optionally, in the process of periodically generating the checkpoint corresponding to the data file, the processing module 12 is specifically configured to: and writing the dirty data in the memory into the data file through the first node when the checkpoint is triggered.
Optionally, during the process of periodically loading the latest checkpoint from the shared storage system through the second node, the processing module 12 is specifically configured to: and loading metadata corresponding to the currently acquired latest check point from the data file through the second node.
The apparatus shown in fig. 8 can perform the steps in the foregoing embodiments, and the detailed performing process and technical effects refer to the descriptions in the foregoing embodiments, which are not described herein again.
In one possible design, the structure of the database configuration apparatus shown in fig. 8 may be implemented as an electronic device. As shown in fig. 9, the electronic device may include: a processor 21, a memory 22, and a communication interface 23. Wherein the memory 22 has stored thereon executable code which, when executed by the processor 21, makes the processor 21 at least capable of implementing the database configuration method as in the previous embodiments.
Additionally, an embodiment of the present invention provides a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of a server, causes the processor to at least implement a database configuration method as provided in the foregoing embodiments.
The above described embodiments of the apparatus are merely illustrative, wherein the network elements illustrated as separate components may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by adding a necessary general hardware platform, and of course, can also be implemented by a combination of hardware and software. With this understanding in mind, the above-described aspects and portions of the present technology which contribute substantially or in part to the prior art may be embodied in the form of a computer program product, which may be embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including without limitation disk storage, CD-ROM, optical storage, and the like.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (13)

1. A database configuration method is applied to a database system and comprises the following steps:
starting a plurality of nodes in a database system on a shared storage system, wherein the plurality of nodes comprise a first node serving as a main node and a second node serving as a standby node, and the plurality of nodes share data files stored in the shared storage system;
writing, by the first node, a modification operation corresponding to the data file into a transaction log, periodically generating a checkpoint corresponding to the data file, and writing the transaction log and the checkpoint into the shared storage system;
periodically loading, by the second node, a latest checkpoint from the shared storage system to achieve data synchronization with the first node.
2. The method of claim 1, further comprising:
and if the second node elects to become a new main node, acquiring a transaction log generated after the latest check point from the shared storage system through the second node, and playing back the transaction log.
3. The method of claim 1, wherein the second node is configured to turn off data persistence capabilities during operation as a standby node, the data persistence capabilities including an ability to write a transaction log and an ability to generate a checkpoint.
4. The method of claim 1, wherein said writing, by the first node, the checkpoint to the shared storage system comprises:
writing a plurality of check points generated in sequence into the data file through the first node so as to enable the data file to retain the check points, wherein offset addresses corresponding to the check points in the data file are different.
5. The method of claim 1, wherein said periodically generating, by the first node, checkpoints corresponding to the data files comprises:
and writing the dirty data in the memory into the data file through the first node when the checkpoint is triggered.
6. The method of claim 1, wherein said periodically loading, by the second node, a latest checkpoint from the shared storage system comprises:
and loading metadata corresponding to the currently acquired latest check point from the data file through the second node.
7. The method of claim 1, further comprising:
and based on a distributed locking algorithm configured on the shared storage system, locking the shared storage system by enabling the plurality of nodes to determine a first node serving as the main node and a second node serving as the standby node in the plurality of nodes.
8. A database system, comprising:
the system comprises a shared storage system and a plurality of nodes, wherein the shared storage system stores data files, and the plurality of nodes are started on the shared storage system and comprise a first node serving as a main node and a second node serving as a standby node;
the first node is used for writing modification operation corresponding to the data file into a transaction log, periodically generating a check point corresponding to the data file, and writing the transaction log and the check point into the shared storage system;
the second node is used for loading the latest check point from the shared storage system periodically so as to realize data synchronization with the first node.
9. The system of claim 8, wherein the second node is further configured to retrieve a transaction log generated after the latest checkpoint from the shared storage system and play back the transaction log when electing to become a new master node.
10. The system of claim 8, wherein the second node is configured to turn off data persistence capabilities during operation as a standby node, the data persistence capabilities including an ability to write a transaction log and an ability to generate a checkpoint.
11. The system according to claim 8, wherein the first node is configured to write a plurality of checkpoints generated in sequence into the data file, so that the plurality of checkpoints remain in the data file, and the plurality of checkpoints have different offset addresses in the data file.
12. An electronic device, comprising: a memory, a processor, a communication interface; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the database configuration method of any of claims 1 to 7.
13. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the database configuration method of any one of claims 1 to 7.
CN202111395416.4A 2021-11-23 2021-11-23 Database configuration method, device, system and storage medium Pending CN114168380A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111395416.4A CN114168380A (en) 2021-11-23 2021-11-23 Database configuration method, device, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111395416.4A CN114168380A (en) 2021-11-23 2021-11-23 Database configuration method, device, system and storage medium

Publications (1)

Publication Number Publication Date
CN114168380A true CN114168380A (en) 2022-03-11

Family

ID=80480034

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111395416.4A Pending CN114168380A (en) 2021-11-23 2021-11-23 Database configuration method, device, system and storage medium

Country Status (1)

Country Link
CN (1) CN114168380A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116244040A (en) * 2023-03-10 2023-06-09 安超云软件有限公司 Main and standby container cluster system, data synchronization method thereof and electronic equipment
WO2024104046A1 (en) * 2022-11-15 2024-05-23 华为云计算技术有限公司 Active change method and system, server, and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024104046A1 (en) * 2022-11-15 2024-05-23 华为云计算技术有限公司 Active change method and system, server, and storage medium
CN116244040A (en) * 2023-03-10 2023-06-09 安超云软件有限公司 Main and standby container cluster system, data synchronization method thereof and electronic equipment
CN116244040B (en) * 2023-03-10 2024-05-03 安超云软件有限公司 Main and standby container cluster system, data synchronization method thereof and electronic equipment

Similar Documents

Publication Publication Date Title
US7779295B1 (en) Method and apparatus for creating and using persistent images of distributed shared memory segments and in-memory checkpoints
US6823474B2 (en) Method and system for providing cluster replicated checkpoint services
Ardekani et al. A {Self-Configurable}{Geo-Replicated} Cloud Storage System
US7383293B2 (en) Database backup system using data and user-defined routines replicators for maintaining a copy of database on a secondary server
US5675802A (en) Version control system for geographically distributed software development
CN103597463B (en) Restore automatically configuring for service
US8099627B1 (en) Persistent images of distributed shared memory segments and in-memory checkpoints
US7584190B2 (en) Data files systems with hierarchical ranking for different activity groups
CN114168380A (en) Database configuration method, device, system and storage medium
CN109558215A (en) Backup method, restoration methods, device and the backup server cluster of virtual machine
JP6195834B2 (en) System and method for persisting transaction records in a transactional middleware machine environment
CN103562904A (en) Replaying jobs at a secondary location of a service
CN102158540A (en) System and method for realizing distributed database
CN107357688B (en) Distributed system and fault recovery method and device thereof
CN111651523B (en) MySQL data synchronization method and system of Kubernetes container platform
CN111880956B (en) Data synchronization method and device
CN102937955A (en) Main memory database achieving method based on My structured query language (SQL) double storage engines
JP5201133B2 (en) Redundant system, system control method and system control program
JP7215971B2 (en) METHOD AND APPARATUS FOR PROCESSING DATA LOCATION IN STORAGE DEVICE, COMPUTER DEVICE AND COMPUTER-READABLE STORAGE MEDIUM
CN108762982A (en) A kind of database restoring method, apparatus and system
CN115955488B (en) Distributed storage copy cross-machine room placement method and device based on copy redundancy
JP4289028B2 (en) Hard disk backup recovery system and information processing apparatus
WO2022033269A1 (en) Data processing method, device and system
CN115587099A (en) Distributed meter lock application method and device, storage medium and electronic equipment
CN111176886B (en) Database mode switching method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination