CN116401313A - Shared storage database cluster information synchronization method - Google Patents

Shared storage database cluster information synchronization method Download PDF

Info

Publication number
CN116401313A
CN116401313A CN202310335286.8A CN202310335286A CN116401313A CN 116401313 A CN116401313 A CN 116401313A CN 202310335286 A CN202310335286 A CN 202310335286A CN 116401313 A CN116401313 A CN 116401313A
Authority
CN
China
Prior art keywords
lsn
data page
page
data
wal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310335286.8A
Other languages
Chinese (zh)
Inventor
郑晓军
苗健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Highgo Base Software Co ltd
Original Assignee
Highgo Base Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Highgo Base Software Co ltd filed Critical Highgo Base Software Co ltd
Priority to CN202310335286.8A priority Critical patent/CN116401313A/en
Publication of CN116401313A publication Critical patent/CN116401313A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2358Change logging, detection, and notification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a shared storage database cluster information synchronization method, which comprises the steps of defining two variables of initial LSN and latest LSN in advance; after the main node is started and conventional data integrity detection is completed, the following steps are executed: acquiring the latest LSN in the database WAL; storing the latest LSN into a 'start LSN' and a 'latest LSN', and transmitting the 'start LSN' and the 'latest LSN' to each slave node; in case the master node triggers any operation involving data update, the following steps are performed: writing WAL, and obtaining the corresponding latest LSN after writing a log record; storing the latest LSN into the latest LSN, and counting all data page numbers wal _page_no related in the log record; the "latest LSN" and all the counted wal _page_no are sent to each slave node. The method can realize strong independence of each node, has small mutual influence and improves the data synchronization efficiency.

Description

Shared storage database cluster information synchronization method
Technical Field
The application relates to the technical field of databases, in particular to a shared storage database cluster information synchronization method.
Background
Database clusters that share storage have evolved over the years in the relational database market. The main technical types of the method include three major types of Oracle RAC, informix SDS and Internet cloud database schemes:
1、Oracle RAC
oracle RAC is the earliest shared storage database Cluster, originally in 1992 with Oracle Cluster version, and by 1998 with the introduction of Cache Fusion technology, it is developed as a matter of today. Oracle RAC provides peer-to-peer services to the outside, i.e. each node can provide readable and writable services to the outside. Oracle RAC provides a technique for memory Fusion among multiple database server nodes, called Cache Fusion. Each server checks in which server the required memory data block is when processing the access of the server; if in the other node, it will request the node to send the data block to the node and be controlled by the node. Meanwhile, oracle also developed a ASM (automatic storage management) storage management module, dedicated to providing management of multi-node shared storage.
The Oracle RAC has the disadvantage of too high coupling degree, and the memories of all nodes in the cluster are almost uniformly managed through a cache fusion mechanism. Each node needs to communicate with other nodes for confirming or transmitting the data blocks in the memory when dealing with the data access. Such architecture is difficult to provide good scalability. The industry does not employ such a clustering scheme except Oracle's own vendors.
2、Informix SDS
Informix SDS is a shared storage cluster product that Informix introduced in 2010. Oracle RAC has been on the market for more than 10 years at that time. The coupling degree of the RAC is considered to be too high by sound, and other nodes are easily affected if nodes in the cluster are out of order. Meanwhile, the mechanism of Cache Fusion is excessive in data transmission amount and times in running. For this purpose, informix proposes a synchronized one-master-multi-slave cluster mode. That is, only one database node provides read-write services to the outside, and the other nodes only provide read-only services. However, this is only a function of the database core angle, since during this period the routing technology of the database is already mature, and there are various ways for the user end to provide transparent services to the user. Several common modes of transparent services will be briefly described later. In the Informix SDS, a forwarding module is embedded in the secondary node by Informix, and the received write request can be forwarded to the primary node for execution.
The Informix SDS adopts shared storage, and only a main node in the core provides read-write service to the outside. When the master node encounters data update, firstly, a WAL (write ahead log, namely a pre-written log, hereinafter the same as the above) log is written; the master node will then transmit the log number LSN to all slave nodes. After receiving an LSN (log sequence number, hereinafter referred to as "log sequence number") sent from the master node, the slave node may share and store the WAL log, and do redox to a data block in the local memory according to the content of the log. Unlike the master node, the slave node may relinquish all write disk operations.
In the Informix SDS, the synchronization mechanism of the slave node is accomplished by reading the WAL from the shared storage and doing the redox locally by the WAL log location received from the master node. There are two problems here: 1) If only the LSN of the WAL is obtained from the master node, the slave node is required to read the WAL on shared storage. This involves the I/O of a disk. Sometimes or the system considers the I/O to be time consuming. Some similar schemes require the master node to directly send the contents of the WAL log, and thus the amount of data transferred is not controllable. At the same time, the WAL replay process of each slave node is also relatively time consuming. Under some availability synchronization level requirements, the master node is required to wait for replay work of some slave nodes. 2) The more prominent problems are: the slave node has limited memory size, and when replayed via WAL information, the processed data blocks are no longer written back to shared storage because the data in shared storage is maintained by the master node. But the slave node's memory buffer is under tension and depleted. At this point, if a need arises to retire the redo memory blocks, a local, unshared disk space needs to be maintained from the node for temporarily storing the data blocks that were retired from the cache. The amount of such locally temporarily stored data blocks may become larger and may affect slave node operation without cleaning. In this case, after each disk brushing operation, the master node gives a certain amount of information to each slave node, so that the slave node releases a part of local disk data blocks according to the information. This process increases the complexity of the system, increases traffic, decreases performance, and affects system stability.
3. Internet company cloud database scheme
By 2015, the internet company of the head pushes cloud services of the database on the cloud computing platform, such as: aurora, polarDB, etc. These databases emphasize "compute and store separation" and "cloud protogenesis". The calculation and the storage are separated, namely the database is unified to manage the data by utilizing the universal expandable storage on the cloud computing platform, and the server for calculation is not bound with the specific storage, so that more free quantity expansion and contraction are realized. Cloud protogenesis is to develop a database to a server-less mode, and directly provide the service of the database without binding with a specific virtual machine.
Under such a development framework, these cloud database schemes are architecturally similar to Informix SDS. After all, it is impractical to transfer blocks of memory data between the cloud computing numerous servers in a complex, frequent manner. The structure of Informix SDS-master and multi-slave is more suitable, and each slave node in Informix SDS continuously does a redox (replaying operation according to log content, the following is the same) mechanism according to WAL, the calculation amount of each node is larger, and the frequent addition and withdrawal of the slave node is inconvenient.
Therefore, the internet cloud database is more dependent on distributed shared storage on the cloud, and the distributed storage of the cloud platform is fully utilized to support the database cluster. In a master-multi-slave database cluster, the only data closest to synchronization is the master node's write-ahead log WAL. This is because any updates from the master node are written to the store first, the WAL. Then this brings up the concept of "journaling, i.e. data". That is, each slave node has the result of accessing the WAL as its own data standard. To achieve this goal, these databases introduce a data structure of the LSM-Tree (Log Structure Merge-Tree) to manage the data. The LSM manages data in a log structure, and is highly efficient for updating data, since only additional recording is required. However, LSM is complex for data query, and efficiency is lost. But also support for the diversity access method (indexing technique) of the traditional database is now thin. In addition, to implement such a database scheme, the amount of modification to the existing legacy database is excessive, and the data management layer is almost entirely reworked.
The cloud database scheme of the internet enterprise can be regarded as an advanced step on the Informix SDS in architecture. They inherit the cluster working mode of write-once and read-many in Informix SDS, but emphasize the separation of storage and computation, and introduce the concept of LSM-tree into the bottom data management of the database by the idea of 'log, namely data', avoiding the traditional data synchronization process.
However, the data management mode of LSM makes a relatively large sacrifice in the query efficiency of the database. In particular, there are relatively large limitations to the use of various advanced, complex indexing techniques for databases. In addition, the data portion of the conventional database is dominated by random access of the data blocks, which is very labor intensive if LSM modifications are to be made, almost rewriting the underlying data management.
Disclosure of Invention
The embodiment of the application provides a shared storage database cluster information synchronization method, which is used for realizing strong independence of all nodes, has small mutual influence and improves data synchronization efficiency.
The embodiment of the application provides a shared storage database cluster information synchronization method, wherein the shared storage database cluster is a one-write-multiple-read master-slave architecture, and comprises a master node and a slave node, wherein the master node provides a data read-write function, the slave node provides a read-only function, and the information synchronization method comprises the following steps:
predefining variable initial log serial numbers (LSNs, log serial number) and latest LSNs; the initial LSN is used as the initial position adopted by the slave node when performing a pre-written log (WAL) replay operation under specific conditions, and the latest LSN is the LSN of the latest written WAL of the master node;
After the main node is started and conventional data consistency detection is completed, before the service is provided, the following steps are executed:
acquiring the latest LSN in the database WAL;
storing the latest LSN into a start LSN and a latest LSN, and transmitting the start LSN and the latest LSN to each slave node;
in case the master node triggers any operation involving data update, the following steps are performed:
writing WAL, and after writing a WAL log record, obtaining a corresponding latest LSN;
storing the latest LSN into the latest LSN variable, and counting all data page numbers WAL _page_no related in the WAL log record, wherein WAL _page_no is the number of the data page with log record operation;
all wal _page_no related to the latest LSN and the log record written this time are sent to each slave node.
Optionally, after the master node is started, before starting to provide the service, the method further comprises:
according to the WAL log, carrying out consistency detection on the data, rolling back the uncommitted transaction, and carrying out write-in on the committed data without disk drop;
after the start LSN and the latest LSN are transmitted to each slave node, the state followed by each slave node is counted, and it is not necessary to put into a waiting state.
Optionally, the method further comprises:
after the master node is started, sending a primary start LSN to each slave node;
in the running process, all data pages in a local cache area are queried according to a set time interval, dirty pages in the data pages are found, and the minimum value of LSN of the dirty pages is found in all the dirty pages;
if the obtained LSN minimum value is larger than the prior initial LSN, the obtained LSN minimum value is assigned to the initial LSN stored locally, and the initial LSN is sent to each slave node, so that each slave node updates the initial LSN based on the LSN minimum value.
Optionally, a hash table is maintained locally at any slave node based on the data page encoding page_no, and the data item of the hash table is used for recording the latest LSN of the data page;
the slave node updates in the following manner:
assigning a value to the own starting LSN based on the received starting LSN;
assigning a latest LSN to itself based on the received latest LSN, an
Recording the earliest received 'latest LSN';
searching a local hash table according to the received wal _page_no, and assigning a value to the latest LSN of a corresponding data page in the hash table;
during the assignment, other processes are prohibited from modifying the hash table.
Optionally, after any slave node starts, before providing the service, the method further includes: if the received initial LSN is greater than or equal to the latest LSN received earliest, the service is directly provided.
Optionally, any slave node performs the query in the following manner:
before a target data page needs to be accessed, acquiring the code of the target data page;
if the local cache does not contain the target data Page, reading the target data Page from a Page Server of a data Page read-write Server based on the obtained code of the target data Page, writing the target data Page into the local cache, and updating the data Page state, wherein the Page Server is used for providing a Server for reading and writing the data Page with multiple nodes, and the program process in each node writes or reads the whole Page of the data Page according to the data Page number;
if the local cache contains the target data page, searching in a local data table according to the code of the target data page;
if the latest LSN of the local data table is newer than the latest LSN of the target data page, recovering the target data page in the memory;
if the local data table contains the code of the target data page but does not contain the required data page, acquiring the latest LSN of the target data page, and distributing a blank data page based on the code of the target data page so that the slave node starts from the position of the 'start LSN', and reconstructing the data page based on the blank data page according to the content of WAL until the position of the 'latest LSN';
The data page status is updated.
Optionally, in the process of executing the query, the latest LSN of the data page in the data table is read to the local variable.
Optionally, the method further comprises the following steps that the slave node performs data page recovery:
reading WAL logs;
configuring start-stop positions based on LSNs of WAL logs by using WAL pointers;
and according to the content of the WAL log record, carrying out log data recovery on the data page.
The embodiment of the application also provides a shared storage cluster device, which comprises a processor and a memory, wherein the memory is stored with a computer program, and the computer program realizes the steps of the shared storage database cluster information synchronization method when being executed by the processor.
The embodiments of the present application also provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor implements the steps of the shared storage database cluster information synchronization method described above.
The scheme of the application can realize strong independence of each node (the master node and the slave node), has small mutual influence, and can improve the data synchronization efficiency by using the synchronization method of the embodiment.
The foregoing description is only an overview of the technical solutions of the present application, and may be implemented according to the content of the specification in order to make the technical means of the present application more clearly understood, and in order to make the above-mentioned and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 is an example database cluster architecture according to an embodiment of the present application;
fig. 2 is a basic flow example of a database cluster information synchronization method according to an embodiment of the present application;
fig. 3 is an example of a query flow executed by a slave node according to an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The embodiment of the application provides a shared storage database cluster information synchronization method, wherein the shared storage database cluster is of a one-write-multiple-read master-slave architecture and comprises a master node and a slave node, the master node provides a data read-write function, and the slave node provides a read-only function. As shown in fig. 1, the database cluster in the embodiment of the present application is a master-slave architecture system with one write and multiple read. Only the master node provides read-write functions, while all the slave nodes only provide read-only functions. There are several ways for this underlying system to provide peer-to-peer readable and writable services externally. As in fig. 1, there are three categories A, B, C of users.
Class a users employ an extended client interface module that will issue read-only commands to the slave nodes and write commands to the master nodes. Such interfaces are typically an enhancement to existing ODBC, JDBC.
Class B users employ a database routing middleware B that provides user services like a normal database, but internally automatically sends write commands to the master node for execution and read-only commands to the slave nodes for execution. Such as the existing pg_pool or hg_proxy, belong to this class of solutions.
The class C user directly accesses each node in the cluster, but at this time, a write command forwarding module C is attached to all the slave nodes, and it can forward the write command received by the database server to the master node for execution, and then return the result to the client. Such as: such techniques are employed by the Cluster version of Informix SDS, HGDB.
In this architecture, however, data synchronization from master node to slave node is an important technical point.
Preparation of shared storage
In the embodiment of the application, the database cluster adopts shared storage, and in strict sense, a plurality of database nodes in the cluster commonly access the same data on the shared storage.
On this shared storage, embodiments of the present application provide several data storage services:
Page Server
this is a server capable of providing multi-node data page reading and writing. The program process in each node writes or reads the whole page of the data page according to the data page number. The size of each data page may be set to 8KB by default and may be adjusted according to preset parameters. Page Server is used to guarantee the access integrity of each process in each node to the data Page. I.e. any read data page content must be written or unwritten by other processes, no data confusion will occur.
Page Server has ready-made products and can also be developed by itself. The access portal of the Page Server can also be designed with multiple nodes. Page Server can provide throughput of normal storage, and SSD can be considered for improving performance.
Multi-point mountable file system
A file system equivalent to one Linux may mount to multiple nodes at the same time. This file system is visible on each node on which it is mounted. One of the nodes writes the file and the other nodes can read the data.
The multi-node mounted file system does not require that the integrity of data read and write be provided between the nodes. That is, a process on one node may read half the content that other node processes write to. This is a common solution currently offered by most storage vendors.
In the embodiment of the present application, for a file system for multi-point mount, the following requirements are: can be a magnetic disk, and has high sequential reading and writing speed and high reliability.
Database usage for storage, database management for data is divided into three categories: data, log, parameters;
data portion:
the data of the database is organized in data pages, and the mode of random access is the most in the running process. In a multi-server environment, if shared storage is required, the storage needs to provide integrity support for reading and writing data pages between nodes. I.e. any reading of a data page must be the result of either before or after writing the data page, the content must be complete.
Since a file system of multiple nodes mount can be provided generally, no protection is provided for reading and writing among the cross nodes (in a single Linux node, since file reading and writing operation belongs to system call, reading and writing integrity can be provided), a page server is required to be provided specially for a database solution of multiple nodes sharing storage, and reading and writing integrity among multiple nodes can be maintained by taking a data page as a reading and writing unit.
Taking the PostgreSQL database as an example, the PostgreSQL internally submits the task of data management to the smgr series function for processing. The smgr function accesses the store in data page units, with a default page size of 8K. In the PostgreSQL database, not only data, including a data dictionary and the like, are treated as data, and are uniformly managed and operated by smgr series functions.
In the embodiment of the application, the smgr series function is modified, so that the bottom layer of the smgr series function accesses the interface of the Page Server, and the shared storage management of database data is realized.
By the method, the data management and storage part of the whole database can be prevented from becoming an LSM-tree architecture, the mature and efficient part of the traditional database is reserved, and meanwhile, multiple nodes in the cluster share storage. In addition, the I/O performance requirement on the Page Server is moderate due to the use of the memory cache.
Log (WAL) section:
the database log, also referred to as WAL (write ahead log) in this embodiment of the present application, is also referred to as a "pre-written log", whose write disk is prior to the actual data update. Thus, whenever a floor operation in the WAL log is confirmed to occur in the database.
In the PostgreSQL database, WAL is a binary file written sequentially. The location of the log record in the WAL is determined by a call LSN (log serial number) variable. While LSN may be viewed as the offset when writing a file, which is continuously increasing with log writing (until log storage file is swapped).
In a multi-node shared storage environment, portions of the WAL may be borne with a multi-node mount (mount) file system that is not read-write protected.
In the actual running process, only the master node sequentially writes the WAL, and the LSNs thereof are transmitted to other nodes after the writing is completed. When other nodes read the WAL, the adopted offset is already written by the master node, so that no read-write conflict occurs. It is only necessary to ensure that the master node is not delayed by the local cache when writing the WAL. Depending on the nature of the WAL, the write cache of the WAL may be turned off in order to guarantee data security (RPO zero).
Thus, the management of logs under the shared storage database cluster can be satisfied with relatively low shared storage requirements.
Parameter file:
the parameter file of the database is a text file. In a multi-node environment, each node may be selected to manage its own parameter files separately, or put the parameter files into a shareable file system, like NFS or NAS, as desired. The method is not particularly required, and the method can be stored in a multi-point mounted file system or can be used for local independent management of each node.
As shown in fig. 2, the information synchronization method includes:
main node enhancement scheme
The main node of the database provides normal read-write service to the outside. The data management part of the master node accesses the Page Server provided by the shared storage to manage the data of the database through the transformation of the smgr function. A portion of the log WAL accesses the multi-node mount file system on the shared storage and stores the WAL on this multi-node mount file system.
In step S201, two variables, namely, a start log sequence number (LSN, log serial number) (start_ LSN) and a latest LSN (last_ LSN) are defined in advance. Wherein the start LSN is used as a start position for the slave node to employ in replaying operations from a pre-written log (WAL) under certain conditions; the latest LSN is the LSN of the latest write WAL of the master node.
Master node initiation operations
After the database server is started as the master node (the master node and the slave node are the same set on the product, the master node and the slave node are one mode when the database is started and can be specified by parameters), in some embodiments, after the master node is started, the method is further executed before the service is started: according to the WAL log, consistency detection is performed on the data, uncommitted transactions are rolled back, and committed, but not dropped, data is complemented, thereby achieving a consistent state (this is a conventional startup process for databases).
After the main node is started and conventional data consistency detection is completed, before the service is provided, the following steps are executed:
in step S202, the latest LSN in the database WAL is acquired.
In step S203, the latest LSN is stored in the variable start_ LSN and latest LSN (last_ LSN), and the start LSN and the latest LSN are sent to each slave node, in such a way that whether the master node waits for the slave node to respond or not depends on the requirement setting for the cluster coupling level.
The master node writes the WAL log operation, and when the database master node encounters any operation involving data update (including creation and modification of data objects), the master node writes the WAL log first, that is, writes the WAL log. When a log record is written, a latest LSN is obtained. In case the master node triggers any operation involving data update, the following steps are performed:
In step S204, the WAL is written, and after writing one WAL log record, the corresponding latest LSN is acquired. After writing a log record, the database master node obtains a latest LSN, and in this embodiment, before the master node writes a WAL log record, the latest LSN and all WAL _page_no need to be sent to each slave node. A preferred mode can immediately acquire the corresponding latest LSN and store the variable 'latest LSN' when the master node finishes writing one WAL log record, so that the latest LSN can not be modified by other processes before the master node writes the next WAL log record, and the reliability of data content is ensured.
In step S205, the latest LSN is stored in the variable "latest LSN", and all data page numbers WAL _page_no related to the WAL are counted, wherein WAL _page_no is the number of the data page where the log recording operation is performed.
In step S206, the latest LSN and all wal _page_no are transmitted to each slave node. After the start LSN and the latest LSN are sent to each slave node, the following state of each slave node is counted and the slave node is not in a waiting state. By the design of the master node, the working state of the database cluster can be obtained, and no matter whether the slave node responds or not, the master node can continue the flow of the master node and does not need to fall into a waiting state.
The embodiment of the invention is applied to the shared storage cluster, and the data synchronization level between the master node and the slave node has little influence on the master-slave switching result when the master node fails. What affects is how effectively each slave node is responding to its respective local query request. Thus, the master node may in principle not wait for the response of the slave node after sending the last_ lsn and the corresponding wal _page_no list to the respective slave node, and if waiting, simply count the state that the respective slave node follows, without having to fall into a waiting state.
Of course, the master node may retain the ability to return the last lsn content to the local client. The client of the master node, having obtained the corresponding last lsn, can then be used for control of the data synchronization level when a query is later performed at a slave node.
One of the key technical points of the present application is that the master node sends a list of last lsn together with wal _page_no list to each slave node, so that the master node can send as little content as possible, and does not have to wait hard for the slave node to respond, thereby minimizing the influence on performance.
In some embodiments, further comprising:
after the master node is started, sending a primary start LSN to each slave node;
in the running process, all data pages in the local cache area are queried according to a set time interval, the dirty pages in the data pages are found, and the minimum value of LSN of the dirty pages is found in all the dirty pages.
In step S208, if the obtained LSN minimum value is greater than the previous start LSN, the obtained LSN minimum value is assigned to the locally stored start LSN, and the start LSN is sent to each slave node, so that each slave node updates its start LSN based on the LSN minimum value. Thus, the "start LSN" stored by each slave node gradually increases with the change of the working state of the master node, and the slave node is not forced to get a response to make the subsequent processing more efficient, so that the efficiency is improved by the start_ LSN update.
In the embodiment of the application, the slave node only provides read-only database service, namely pure query. This limitation may be done at the level of the database instance. Specific implementations may refer to the read-only restriction of backup nodes in a stream replication cluster by PostgreSQL. If the whole database cluster needs to provide transparent readable and writable access to the outside, the cluster system can be enhanced by adopting three approaches (a client enhanced interface, a database routing middleware and a server-side forwarding module) described in the first section of system architecture introduction.
A slave node is a mode of database startup that is the same version of the master node in software, but in a different mode of operation because of the different startup parameters.
Similar to the master node, the data access from the node database is via a smgr series function to the Page Server, while the log portion accesses the WAL on the multi-node mounted file system. The slave node enhancement scheme is further described below, with the slave node configured with only the start_ lsn from the master node, the last_ lsn from the master node, and the last_ lsn received earliest by the slave node described later as early_ lsn.
In some embodiments, a data table is maintained locally at any slave node based on the data page encoding page_no, the data items of the data table being used to describe the latest LSN of the data page. In a specific implementation manner, the slave node may locally maintain a hash table page_hash (data table) using the data page code page_no as a hash value, where the content of the data item of the hash table is described as page_last_ lsn. The hash table itself contains the handling of hash value collisions. Maintaining a data page hash table at the slave node, through which the corresponding page_last_ lsn can be quickly found from the input page_no, is yet another key technical point of the present application.
The slave node updates in the following manner:
assigning a start LSN (start_ LSN) to itself based on the received start LSN;
A latest LSN (last_ LSN) assigned to itself based on the received latest LSN, and
the earliest received "latest LSN" is described, i.e., the first received last_ LSN is assigned to early_ LSN, early_ LSN =last_ LSN.
According to the received wal _page_no, the latest LSN of the corresponding data page in the data table is searched locally, namely, according to the received wal _page_no list, each wal _page_no is taken as a hash value, a local data page hash table page_hash is searched, and the last_ LSN is assigned to the corresponding unit of the page_hash.
And in the assignment process, protecting the page_hash, prohibiting other processes from modifying the data table, and performing read-write consistency protection of concurrent access.
After the slave node starts, the slave node needs to wait for the information transmitted by the master node before providing the service to the outside. These information include: start_ LSN), last LSN (last __ LSN), and wal _page_no list. In some embodiments, after any slave node boots up, before providing the service, further comprising: if the received initial LSN is greater than or equal to the latest received LSN, that is, the data page hash table page_hash maintained by the slave node itself contains enough data synchronization information, which is a precondition that the slave node can provide data integrity service, then query service can be provided to the outside.
In the cluster, the slave nodes of the database may be started or restarted at any time. Under normal conditions, starting all the slave nodes first and then starting the master node is a smooth flow, and the slave nodes can obtain the information sent by the master node at the first time when starting.
The slave node is a read-only node and receiving and processing a query request from a client is the primary job of the slave node. How to process and run queries is an important point of embodiments of the present application. Firstly, when receiving a query request sent by a client from a node, it is required to confirm whether the state of the current database meets the requirement of information synchronization. If no specific requirements are made for information synchronization, the query is considered asynchronous. If the received query request has additional LSN information: and (5) waiting for the additional LSN information of the query of the last_ LSN, judging whether the time-out exists, and exiting the error reporting after the time-out.
In some embodiments, further comprising any slave node performing the query in the following manner:
before the target data page needs to be accessed, the code of the target data page is acquired, and at the position where the code of the data page is obtained, the following logic is executed, as shown in fig. 3:
if the local cache does not contain the target data Page, the target data Page is read from the data Page read-write Server Page Server through a smgr function based on the obtained code of the target data Page, if the Page Server does not have a corresponding data Page, the data block is indicated to be only in the memory of the master node, and the disk is not dropped yet, and the data Page state=the memory data Page is not taken. Otherwise, reading the data Page from the Page Server into the memory buffer, writing the data Page into the local buffer, and updating the data Page state, wherein the data Page state=the Server which is used for providing the reading and writing of the multi-node data Page, and the program process in each node writes or reads the whole Page of the data Page according to the data Page number.
In the case of data page status = taken to memory data page, the page is blocked according to the data page encoding. If the local cache contains the target data page, searching in a local data table according to the code of the target data page;
if the latest LSN of the local data table is updated compared with the latest LSN of the target data page, the target data page is restored in the memory, that is, the page_last_ LSN & (page_last_ LSN in the page_hash > page_last_ LSN in the data page) is found in the page_hash, and the data page is restored (data page coding, page_last_ LSN in the data page, page_last_ LSN in the page_hash).
If the local data table contains the encoding of the target data page but does not contain the desired data page, the latest LSN of the target data page is obtained and an empty data page is allocated based on the encoding of the target data page.
The data page code is found in the hash table page_hash, the corresponding page_last_ lsn is obtained, if the hash table page_hash has no needed data page, the error is reported, and an empty data page is allocated by using the code.
Data page recovery (data page encoding, start lsn, page_last_ lsn in page_hash). At this point, the slave node needs to reconstruct the data page based on this empty data page, starting from the "start LSN" location, according to the WAL content, until the "latest LSN" location.
Updating the data page state and releasing the data page lock.
In some embodiments, the latest LSN of a data page in the data table is read to a local variable during execution of the query, so as not to affect the hash table itself to be modified by other processes during this process.
In some embodiments, the slave node further comprises:
the data page restore process is invoked repeatedly, and therefore a process is designed that has three parameters: data page coding (data page coding to be restored), start LSN, end LSN.
The execution process is as follows: reading the WAL log from the location of "start LSN" until the location of "end LSN"; for each log record read, if the data page code in the log record is equal to the data page code in the process input parameter, the data page is restored according to the content of the current WAL log record. At the end of the loop, the Page_last_ LSN attribute variable in the data page is assigned an "end LSN".
The data page recovery process is invoked frequently, and in some embodiments, a buffer management of the memory may be performed locally on the slave node for the WAL log, and it is determined whether the required WAL record is in the memory according to different LSNs.
The memory buffer area of the slave node is not required to be written back to be stored when the data page is eliminated, so that the slave node is directly eliminated.
Since all the dirty data pages in the buffer are restored by WAL replay, improvements are made in the data page elimination policy to preferentially eliminate the clean data pages.
Database node start-stop and master-slave conversion
The start-stop of the database node and the conversion of the master node and the slave node can be combined by the following processes:
1. the master node stops external service
The master node database directly stops serving and directly exits. At this time, each slave node can no longer obtain the information sent by the master node, but the slave node is not influenced to continue to provide read-only service to the outside.
2. Raising lattice of slave node into master node
The slave node is up-checked to form a master node, and the process corresponds to the restart of the slave node.
Firstly, confirming that a master node in a cluster stops working, and a slave node obtains write permission of a Page Server and WAL. Physically, such write rights may exist as they are, but need to be logically confirmed at this time.
Next, the database instance on the slave node is started in master node mode;
at this point, only the data in the WAL is considered reliable, and the data information in the Page Server may be old (because the previous master node cannot be guaranteed to be normally closed). The process at this time is the same as the starting process of the ordinary database (independent of whether the database is normally shut down or not before), and the data needs to be detected and restored.
In the cluster of the embodiment of the application, shared storage is adopted, the positions of all slave nodes are identical, and the restart and the lifting can be used.
After the new master node is started, the cluster management information needs to be accessed to obtain the communication address of each slave node in the cluster, and synchronous information is sent to each slave node.
Other slave nodes follow the new master node, and after the new master node is generated, each previous slave node needs to establish contact with the new master node to follow the new master node. At this time, the slave node performs the following operations:
1) The slave node stops external service;
2) Clearing the hash table page_hash, the last_lsn and the start_ lsn of the hash table;
3) Configuring new main node address information;
4) Receiving an initial LSN, a latest LSN, and an updated data page code from a new master node; when the initial LSN is larger than the latest SLN obtained initially, starting external service;
this is in fact a common slave node initiation procedure.
The new node joins the cluster with the identity of the slave node.
In summary, the solution of the embodiment of the application can be suitable for enhancing and modifying various database systems, and can be realized based on postgreSQL, for example.
The method master node only transmits the log number and the updated data page coding list. This is the mode in which the amount of data transferred is almost minimal in the various schemes at present, but the information is very efficient. The patent protects the design of these two data items, and the design of information that is different from this or more cumbersome than this does not protect the scope.
At the same time, the method master node of the embodiment of the application does not need to rely on the response of the slave node, and the influence on the performance is minimized (the operation can be asynchronous). At the same time, the transferred log number can be used for controlling the slave node synchronization level, and can provide various levels of data access modes from asynchronous to strictly synchronous for the whole cluster, and the communication design responded by the slave node is not relied on.
In the method of the embodiments of the present application, a hash table of data page states is maintained from the nodes, and the data pages are refreshed "on demand". The data page state hash table has small volume, high access efficiency and easy maintenance; patent protection the design of the hash table: with the data page number as the hash value, the contents of the hash table entry have only one field: the latest LSN of the data page is updated. The slave node refreshes the data page according to the use requirement, avoids various redundant operations, and simultaneously avoids various technical problems faced by other solutions.
The method of the embodiment of the application has high data synchronization efficiency; supporting multiple data synchronization levels among nodes; each node has strong independence and small mutual influence; good query performance extensibility can be provided; the node quantity is convenient to stretch out and draw back; the node increase and decrease and master-slave conversion processes are simple and safe; for low storage requirement and standard specification, the method is suitable for cloud primary database modes with separated calculation and storage, such as: server database.
The embodiment of the application also provides a shared storage cluster device, which comprises a processor and a memory, wherein the memory is stored with a computer program, and the computer program realizes the steps of the shared storage database cluster information synchronization method when being executed by the processor.
The embodiments of the present application also provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor implements the steps of the shared storage database cluster information synchronization method described above.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal (which may be a mobile phone, a computer, a server or a network device, etc.) to perform the method described in the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the protection of the claims, which fall within the protection of the present application.

Claims (10)

1. The method for synchronizing the information of the shared storage database cluster is characterized in that the shared storage database cluster is of a one-write multi-read master-slave architecture and comprises a master node and a slave node, wherein the master node provides a data read-write function, the slave node provides a read-only function, and the information synchronizing method comprises the following steps:
a variable initial Log Sequence Number (LSN) and a latest LSN are predefined; the initial LSN is used as the initial position adopted by the slave node when performing a pre-written log (WAL) replay operation under specific conditions, and the latest LSN is the LSN of the latest written WAL of the master node;
after the main node is started and conventional data consistency detection is completed, before the service is provided, the following steps are executed:
acquiring the latest LSN in the database WAL;
storing the latest LSN into a start LSN and a latest LSN, and transmitting the start LSN and the latest LSN to each slave node;
in case the master node triggers any operation involving data update, the following steps are performed:
writing WAL, and after writing a WAL log record, obtaining a corresponding latest LSN;
storing the latest LSN into the latest LSN variable, and counting all data page numbers WAL _page_no related in the WAL log record, wherein WAL _page_no is the number of the data page with log record operation;
All wal _page_no related to the latest LSN and the log record written this time are sent to each slave node.
2. The method for synchronizing cluster information of a shared storage database according to claim 1, wherein after the master node is started, before starting to provide the service, further performing:
according to the WAL log, carrying out consistency detection on the data, rolling back the uncommitted transaction, and carrying out write-in on the committed data without disk drop;
after the start LSN and the latest LSN are transmitted to each slave node, the state followed by each slave node is counted, and it is not necessary to put into a waiting state.
3. The method for synchronizing information of a shared storage database cluster according to claim 1, further comprising:
after the master node is started, sending a primary start LSN to each slave node;
in the running process, all data pages in a local cache area are queried according to a set time interval, dirty pages in the data pages are found, and the minimum value of LSN of the dirty pages is found in all the dirty pages;
if the obtained LSN minimum value is larger than the prior initial LSN, the obtained LSN minimum value is assigned to the initial LSN stored locally, and the initial LSN is sent to each slave node, so that each slave node updates the initial LSN based on the LSN minimum value.
4. The method for synchronizing cluster information of a shared storage database according to claim 1, wherein a hash table is maintained locally at any slave node based on the data page encoding page_no, the data item of the hash table being used for recording the latest LSN of the data page;
the slave node updates in the following manner:
assigning a value to the own starting LSN based on the received starting LSN;
assigning a latest LSN to itself based on the received latest LSN, an
Recording the earliest received 'latest LSN';
searching a local hash table according to the received wal _page_no, and assigning a value to the latest LSN of a corresponding data page in the hash table;
during the assignment, other processes are prohibited from modifying the hash table.
5. The method for synchronizing information of a shared storage database cluster according to claim 4, wherein after any slave node is started, before providing the service, further comprising: if the received initial LSN is greater than or equal to the latest LSN received earliest, the service is directly provided.
6. The method for synchronizing information of a shared storage database cluster according to claim 4, further comprising any slave node performing a query by:
before a target data page needs to be accessed, acquiring the code of the target data page;
If the local cache does not contain the target data page, reading the target data page from a data page read-write server PageServer based on the obtained code of the target data page, writing the target data page into the local cache, and updating the data page state, wherein the PageServer is used for providing a server for reading and writing the data page with multiple nodes, and the program process in each node writes or reads the whole page of the data page according to the data page number;
if the local cache contains the target data page, searching in a local data table according to the code of the target data page;
if the latest LSN of the local data table is newer than the latest LSN of the target data page, recovering the target data page in the memory;
if the local data table contains the code of the target data page but does not contain the required data page, acquiring the latest LSN of the target data page, and distributing a blank data page based on the code of the target data page so that the slave node starts from the position of the 'start LSN', and reconstructing the data page based on the blank data page according to the content of WAL until the position of the 'latest LSN';
the data page status is updated.
7. The method of claim 6, wherein the latest LSN of a data page in the data table is read to a local variable during execution of the query.
8. The method for synchronizing information of a shared storage database cluster according to claim 6, further comprising the slave node performing data page recovery by:
reading WAL logs;
configuring start-stop positions based on LSNs of WAL logs by using WAL pointers;
and according to the content of the WAL log record, carrying out log data recovery on the data page.
9. A shared storage cluster arrangement comprising a processor and a memory, the memory having stored thereon a computer program which, when executed by the processor, implements the steps of the shared storage database cluster information synchronization method of any of claims 1 to 8.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the shared storage database cluster information synchronization method of any of claims 1 to 8.
CN202310335286.8A 2023-03-31 2023-03-31 Shared storage database cluster information synchronization method Pending CN116401313A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310335286.8A CN116401313A (en) 2023-03-31 2023-03-31 Shared storage database cluster information synchronization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310335286.8A CN116401313A (en) 2023-03-31 2023-03-31 Shared storage database cluster information synchronization method

Publications (1)

Publication Number Publication Date
CN116401313A true CN116401313A (en) 2023-07-07

Family

ID=87019361

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310335286.8A Pending CN116401313A (en) 2023-03-31 2023-03-31 Shared storage database cluster information synchronization method

Country Status (1)

Country Link
CN (1) CN116401313A (en)

Similar Documents

Publication Publication Date Title
US11755415B2 (en) Variable data replication for storage implementing data backup
US10437721B2 (en) Efficient garbage collection for a log-structured data store
JP4568115B2 (en) Apparatus and method for hardware-based file system
KR101827239B1 (en) System-wide checkpoint avoidance for distributed database systems
KR101833114B1 (en) Fast crash recovery for distributed database systems
CN1746893B (en) Transactional file system
US8074035B1 (en) System and method for using multivolume snapshots for online data backup
US11841844B2 (en) Index update pipeline
CN101567805B (en) Method for recovering failed parallel file system
US8112607B2 (en) Method and system for managing large write-once tables in shadow page databases
KR100450400B1 (en) A High Avaliability Structure of MMDBMS for Diskless Environment and data synchronization control method thereof
CN101689129A (en) File system mounting in a clustered file system
US5504857A (en) Highly available fault tolerant relocation of storage with atomicity
EP1131715A1 (en) Distributed transactional processing system and method
CN108701048A (en) Data load method and device
US10909091B1 (en) On-demand data schema modifications
US10803006B1 (en) Persistent memory key-value store in a distributed memory architecture
US6658541B2 (en) Computer system and a database access method thereof
EP4307137A1 (en) Transaction processing method, distributed database system, cluster, and medium
CN116401313A (en) Shared storage database cluster information synchronization method
JP4286857B2 (en) Internode shared file control method
JP2013161398A (en) Database system, method for database management, and database management program
JP3866448B2 (en) Internode shared file control method
US11914571B1 (en) Optimistic concurrency for a multi-writer database
WO2020024590A1 (en) Persistent memory key-value store in a distributed memory architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination