CN113779087A - Database high-availability method and system based on remote direct memory access - Google Patents

Database high-availability method and system based on remote direct memory access Download PDF

Info

Publication number
CN113779087A
CN113779087A CN202111057997.0A CN202111057997A CN113779087A CN 113779087 A CN113779087 A CN 113779087A CN 202111057997 A CN202111057997 A CN 202111057997A CN 113779087 A CN113779087 A CN 113779087A
Authority
CN
China
Prior art keywords
data
database
cache
library
standby
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202111057997.0A
Other languages
Chinese (zh)
Inventor
褚立强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202111057997.0A priority Critical patent/CN113779087A/en
Publication of CN113779087A publication Critical patent/CN113779087A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1464Management of the backup or restore process for networked environments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/06Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/548Queue

Abstract

The invention provides a database high-availability method and a database high-availability system based on remote direct memory access, wherein the method comprises the following steps: RDMA interfaces based on RDMA protocol are arranged on a sending module and a receiving module of the database; inserting the generated pre-written log data into a main library cache, and managing the main library cache by adopting a cache elimination algorithm; calling an RDMA interface of a main library sending module, and sending the newly inserted pre-written log data in the main library cache to a standby library; the standby database receives the pre-written log data, stores the pre-written log data into the cache of the database, carries out data persistence on the pre-written log data, calls an RDMA (remote direct memory access) interface of a transmitting module of the database, and transmits the pre-written log data in the cache of the database to the connected standby database through a RoCE (network interface element) card for cascade copying; when the main library fails, selecting a standby library as a new main library, and switching the service system of the failed main library to the new main library; and the high available reliability of the database is improved.

Description

Database high-availability method and system based on remote direct memory access
Technical Field
The invention relates to the technical field of database access, in particular to a database high-availability method and a database high-availability system based on remote direct memory access.
Background
In order to ensure continuous stable operation and robustness of software of a business system or a metadata base, the metadata base system is often required to be designed into a high availability scheme in a high availability framework based on stream replication.
For example, PostgreSQL is a powerful open source object relational database system that uses and extends the SQL language and incorporates many functions to securely store and extend the most complex data workloads. The system is an open source database with the most powerful functions, the most abundant characteristics and the most complex structure at present, wherein some characteristics are not even possessed by commercial databases. Based on this, PostgreSQL can be used to securely store data, retrieve data when processing requests, and it is also cross-platform, can run on most of the daily operating systems, supporting large business systems in the enterprise. Meanwhile, due to the open source property and the native programming interface provided for a plurality of high-level development languages, the method is taken as an important type of the metadata database in a plurality of open source and commercial software.
Postgresql maintains a WAL log file in a subdirectory pg _ xlog of a data directory, can backup the WAL log to another backup server, and recovers data on the backup server by redoing the WAL log. However, the WAL log is transmitted to the backup server mainly through a stream replication mode provided by the WAL log. For large business systems, the amount of data generated and needed to be processed may be enormous, and accordingly, the resulting WAL logs may grow rapidly, which presents a challenge to the high availability of databases. In order to realize high availability of PostgreSQL, the master library also needs to be cached, and the changes of the transactions are recorded in WAL log files, which occupy a large amount of storage space of the master library; when data replication is performed with the standby library, the WAL log file needs to be read into the cache from the disk again for transmission, so that twice caching is performed in the process, and memory overhead is additionally increased.
In a traditional TCP/IP communication mode, a master library node host needs to send a WAL log file to a standby library node host through a user space and a kernel space by a network card, and then the standby library node host stores the WAL log file in the user space in an opposite mode and finally persists the WAL log file to a disk. This process data is replicated multiple times in different spaces. The system kernel sends messages, so that high data movement and data copying expenses exist, low performance and flexibility are caused, the efficiency of a stream copying process is reduced, a large amount of memory bandwidth and CPU expenses are occupied, the response of a database is slowed, and even a service system is influenced.
Disclosure of Invention
The invention provides a database high-availability method and a database high-availability system based on remote direct memory access, aiming at the problems that data needs to be copied for many times in different spaces, and messages are sent through a system kernel, so that high data moving and data copying expenses exist, low performance and flexibility are caused, the efficiency of a stream copying process is reduced, a large amount of memory bandwidth and CPU expenses are occupied, the response of a database is slow, and even a service system is affected.
The technical scheme of the invention is as follows:
on one hand, the technical scheme of the invention provides a database high-availability method based on remote direct memory access, wherein the database comprises a main library and a plurality of standby libraries, and the method comprises the following steps:
RDMA interfaces based on RDMA protocol are arranged on a sending module and a receiving module of the database;
when receiving the operation submitted by the user, inserting the generated pre-written log data into the main library cache, and managing the main library cache by adopting a cache elimination algorithm;
after receiving the operation submitted by the user, calling an RDMA interface of a main library sending module, and sending the newly inserted pre-written log data in the cache of the main library to a standby library communicated with the main library through a RoCE network card;
the receiving module of the standby database receives the pre-written log data, stores the pre-written log data into the cache of the standby database, carries out data persistence on the pre-written log data, calls an RDMA (remote direct memory access) interface of the transmitting module of the standby database, and transmits the pre-written log data in the cache of the standby database to the connected standby database through the RoCE network card for cascade copy;
when the main library fails, selecting a standby library as a new main library, and switching the service system of the failed main library to the new main library;
the RoCE network card is a network card supporting a RoCE protocol.
The original pre-written log data persistence is cancelled in the main library, the data in the main library cache is directly sent to the standby library cache, network cards supporting RoCE are replaced or added in the main library and the standby library of the database, a network transmission channel supporting RDMA is established, the data are sent and received to the standby library cache, the data are transmitted to bypass a system kernel area, the memory of a remote switch or a server is accessed, the CPU period on the remote server is not consumed, and therefore the available bandwidth and the higher scalability can be fully utilized. Network performance is greatly improved due to the improved delay and throughput of RoCE.
Preferably, the step of calling the RDMA interface of the sending module and sending the pre-written log data in the buffer to the standby library through the RoCE network card includes:
calling an RDMA interface of a sending module to acquire data in the cache of the database, communicating with a RoCE network card of a connected standby library node host through the RoCE network card of the database node host, and transmitting the data to a receiving module of the standby library;
and calling an RDMA interface of the receiving module to store the received pre-written log data into the cache of the database.
The overhead of external memory copy and context switch on the master library node can be eliminated, so that the memory bandwidth and the CPU period can be liberated to improve the performance of an application system, and the continuous and efficient operation of a master library service system is finally ensured.
Preferably, the step of the master library receiving the user-submitted operation comprises:
when the master library receives the operation submitted by the user, inserting the generated change content, namely the pre-written log data into the pre-written log cache of the master library;
and after receiving the operation submitted by the user, storing the data after the operation change into the data cache of the main library.
Preferably, the step of inserting the generated pre-written log data into the primary library cache and managing the primary library cache by using a cache elimination algorithm includes:
inserting newly-generated unaccessed pre-written log data into an FIFO queue of a main library buffer;
judging whether the pre-written log data is accessed in the FIFO queue within a set first time range;
if not, eliminating the pre-written log data with the earliest insertion time in the second percentage when the total amount of the FIFO queue reaches the first percentage of the set total capacity;
if yes, moving the pre-written log data to the head of the LRU queue;
after the time interval is set, judging whether the pre-written log data is accessed in the LRU queue; if yes, moving the pre-written log data to the head of the LRU queue;
in the LRU queue, judging whether the cache capacity exceeds a first percentage of the total cache capacity; and if so, eliminating the second percentage of the pre-written log data which is least accessed at the tail part of the LRU queue. And keeping the cache running uninterruptedly. The reliability and the safety of the cache data are improved.
Preferably, the method further comprises:
and when the fault of the database with the fault is repaired, taking the repaired database as a standby database access system. Maintaining the database highly available.
Preferably, the number of the standby libraries is two, and the standby libraries are respectively a first standby library and a second standby library;
the data stream replication process comprises:
when receiving the operation submitted by the user, storing the pre-written log record of the data change into a pre-written log cache of the main library;
after the user operation is finished, the changed data is stored in a data cache of the main library, an RDMA interface of a main library sending module is called to obtain the data in the pre-written log cache of the main library, the RoCE network card of the main library node host is communicated with the RoCE network card of the first standby library node host, and the obtained data is transmitted to a receiving module of the first standby library;
calling an RDMA interface of a first standby library receiving module to store the received data into a first standby library pre-written log cache;
writing the data in the first standby library pre-written log cache into a magnetic disk of a first standby library node host for data persistence, calling an RDMA (remote direct memory access) interface of a first standby library sending module to obtain the data in the first standby library pre-written log cache, communicating with a RoCE network card of a second standby library node host through the RoCE network card of the first standby library node host, and transmitting the obtained data to a receiving module of a second standby library;
calling an RDMA interface of a second standby library receiving module to store the received data into a second standby library pre-written log cache;
writing the data in the second standby database pre-written log cache into a magnetic disk of a second standby database node host for data persistence;
and periodically refreshing the data in the data cache of the main library to a magnetic disk of the host of the node of the main library.
On the other hand, the technical scheme of the invention also provides a database high-availability system based on remote direct memory access, which comprises a plurality of nodes, wherein each node is provided with a database;
each node host is provided with a RoCE network card and a magnetic disk;
the nodes are sequentially connected in series communication through RoCE network cards;
a first node database on the serial link is provided with a sending module, a last node database on the serial link is provided with a receiving module, and other remaining node databases on the serial link are provided with a sending module and a receiving module;
the sending module and the receiving module are both provided with RDMA interfaces based on RDMA protocol;
each database is provided with a cache;
the system also comprises a calling module, a cache management module and a persistence module;
the calling module is used for calling the RDMA interface of the database receiving module to store the pre-written log data into a pre-written log cache of the database where the receiving module is located; the RDMA interface is also used for calling the database sending module, and sending the pre-written log data in the cache of the database where the sending module is located to the cache of the next database through the RoCE network card;
the cache management module is used for carrying out data elimination management on the database cache;
and the persistence module is used for writing the data in the database caches of other nodes except the first node into a magnetic disk of a node host where the database is located to perform data persistence.
Preferably, except the first node and the last node, two RoCE network cards are installed on hosts of other nodes on the serial link; one RoCE network card and the RoCE network card of the host of the previous node on the serial link form an output transmission channel, and the other RoCE network card and the RoCE network card of the host of the next node on the serial link form a data transmission channel.
The cache management module is specifically used for inserting newly generated unaccessed data into an FIFO queue of the database cache; judging whether the data is accessed in the FIFO queue within a set first time range; if not, eliminating the data with the second percentage with the earliest insertion time when the total amount of the queue reaches the first percentage of the set total capacity; if yes, moving the data to the head of the LRU queue; after the time interval is set, judging whether the data is accessed in the LRU queue; if yes, moving the data to the head of the LRU queue; in the LRU queue, judging whether the cache data capacity exceeds a first percentage of the total cache capacity; if so, the second percentage of data least accessed at the tail of the LRU queue is eliminated.
The data is processed in the cache by the following method:
inserting newly generated unaccessed data into an FIFO queue of a database buffer; judging whether the data is accessed in the FIFO queue within a set first time range; if not, eliminating the data with the second percentage with the earliest insertion time when the total amount of the queue reaches the first percentage of the set total capacity; if yes, moving the data to the head of the LRU queue; after the time interval is set, judging whether the data is accessed in the LRU queue; if yes, moving the data to the head of the LRU queue; in the LRU queue, judging whether the cache data capacity exceeds a first percentage of the total cache capacity; if so, the second percentage of data least accessed at the tail of the LRU queue is eliminated. And keeping the cache running uninterruptedly. The reliability and the safety of the cache data are improved.
According to the technical scheme, the invention has the following advantages: the method has the advantages that the high availability of a database system flow copying mode is realized by using a remote direct access (RMDA) technology, the data moving and copying expenses are reduced, the data are prevented from being copied through a kernel, a CPU is released to execute the work to be done by the CPU, the bandwidth is increased, the delay, the jitter and the CPU consumption are reduced, the WAL log file copying speed between a main library and a standby library is increased, and the high availability reliability of the database is improved while the main library is ensured to stably provide services.
In addition, the invention has reliable design principle, simple structure and very wide application prospect.
Therefore, compared with the prior art, the invention has prominent substantive features and remarkable progress, and the beneficial effects of the implementation are also obvious.
Drawings
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention.
Fig. 2 is an architecture diagram of a system according to one embodiment of the invention.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
PostgreSQL: a very full-featured, software-free object-relational database management system.
RDMA: the Remote Direct Memory Access technology is called Remote Direct data Access, and is generated for solving the delay of server-side data processing in network transmission.
RoCE: RDMA over converted Ethernet is a network protocol that allows the use of Remote Direct Memory Access (RDMA) over Ethernet.
FIFO: first Input First Output abbreviation, First in First out queue.
LRU: the abbreviation of Least Recently Used is a commonly Used page replacement algorithm, and the Least Recently Used page is selected for elimination.
2Q: two queues. The 2Q algorithm has two buffer queues, one FIFO queue and one LRU queue. When data is accessed for the first time, the 2Q algorithm buffers the data in the FIFO queue, when the data is accessed for the second time, the data is moved from the FIFO queue to the LRU queue, and the two queues eliminate the data according to own methods.
WAL, Write-Ahead Logging, is a standard method of ensuring data integrity.
As shown in fig. 1, an embodiment of the present invention provides a high availability method for a database based on remote direct memory access, where the number of databases is several; the method comprises the following steps:
step 1: RDMA interfaces based on RDMA protocol are arranged on a sending module and a receiving module of the database;
step 2: when receiving the operation submitted by the user, inserting the generated pre-written log data into the main library cache, and managing the main library cache by adopting a cache elimination algorithm;
and step 3: the receiving module of the standby database receives the pre-written log data, stores the pre-written log data into the cache of the standby database, carries out data persistence on the pre-written log data, calls an RDMA (remote direct memory access) interface of the transmitting module of the standby database, and transmits the pre-written log data in the cache of the standby database to the connected standby database through the RoCE network card for cascade copy;
and 4, step 4: when the main library fails, selecting a standby library as a new main library, and switching the service system of the failed main library to the new main library;
the RoCE network card is a network card supporting a RoCE protocol.
It should be noted that, the original pre-written log data persistence is cancelled in the main library, the data in the main library cache is directly sent to the standby library cache, the network card supporting the RoCE is replaced or added in both the main library and the standby library of the database, a network transmission channel supporting the RDMA is created, the data is sent and received to the standby library cache, the data transmission bypasses the system kernel area, the memory of the remote switch or the server is accessed, the CPU cycle on the remote server is not consumed, and therefore the available bandwidth and the higher scalability can be fully utilized. Network performance is greatly improved due to the improved delay and throughput of RoCE.
In some embodiments, the step of calling the RDMA interface of the sending module and sending the pre-written log data in the buffer to the standby library through the RoCE network card includes:
calling an RDMA interface of a sending module to acquire data in the cache of the database, communicating with a RoCE network card of a connected standby library node host through the RoCE network card of the database node host, and transmitting the data to a receiving module of the standby library;
and calling an RDMA interface of the receiving module to store the received pre-written log data into the cache of the database.
The overhead of external memory copy and context switch on the master library node can be eliminated, so that the memory bandwidth and the CPU period can be liberated to improve the performance of an application system, and the continuous and efficient operation of a master library service system is finally ensured.
It should be noted that the step of receiving the operation submitted by the user by the master library includes:
when the master library receives the operation submitted by the user, inserting the generated change content, namely the pre-written log data into the pre-written log cache of the master library;
and after receiving the operation submitted by the user, storing the data after the operation change into the data cache of the main library.
In addition, it should be further noted that the step of inserting the generated pre-write log data into the primary library cache and managing the primary library cache by using a cache elimination algorithm includes:
inserting newly-generated unaccessed pre-written log data into an FIFO queue of a main library buffer;
judging whether the pre-written log data is accessed in the FIFO queue within a set first time range;
if not, eliminating the pre-written log data with the earliest insertion time in the second percentage when the total amount of the FIFO queue reaches the first percentage of the set total capacity;
if yes, moving the pre-written log data to the head of the LRU queue;
after the time interval is set, judging whether the pre-written log data is accessed in the LRU queue; if yes, moving the pre-written log data to the head of the LRU queue;
in the LRU queue, judging whether the cache capacity exceeds a first percentage of the total cache capacity; and if so, eliminating the second percentage of the pre-written log data which is least accessed at the tail part of the LRU queue. And keeping the cache running uninterruptedly. The reliability and the safety of the cache data are improved.
In some embodiments, the method further comprises:
and periodically refreshing the data in the data cache of the main library to the disk.
The operation of pre-written log disk-dropping is cancelled in the main library, so that the process number, memory occupation and data copying cost of the host where the main library is located can be reduced, and the service data volume which can be accommodated by the main library is greatly increased by storing the pre-written log file in the standby library.
The number of the standby libraries is two, and the standby libraries are respectively a first standby library and a second standby library;
the data stream replication process comprises:
step 2-1: when receiving the operation submitted by the user, storing the pre-written log record of the data change into a pre-written log cache of the main library;
step 2-2: after the user operation is finished, the changed data is stored in a data cache of the main library, an RDMA interface of a main library sending module is called to obtain the data in the pre-written log cache of the main library, the RoCE network card of the main library node host is communicated with the RoCE network card of the first standby library node host, and the obtained data is transmitted to a receiving module of the first standby library;
step 2-3: calling an RDMA interface of a first standby library receiving module to store the received data into a first standby library pre-written log cache;
step 2-4: writing the data in the first standby library pre-written log cache into a magnetic disk of a first standby library node host for data persistence, calling an RDMA (remote direct memory access) interface of a first standby library sending module to obtain the data in the first standby library pre-written log cache, communicating with a RoCE network card of a second standby library node host through the RoCE network card of the first standby library node host, and transmitting the obtained data to a receiving module of a second standby library;
step 2-5: calling an RDMA interface of a second standby library receiving module to store the received data into a second standby library pre-written log cache;
step 2-6: writing the data in the second standby database pre-written log cache into a magnetic disk of a second standby database node host for data persistence;
step 2-7: and periodically refreshing the data in the data cache of the main library to a magnetic disk of the host of the node of the main library.
The entire WAL log file backup process includes two main stages:
1) WAL log generation
The WAL mechanism is actually a process of adding a corresponding write WAL log into a process of refreshing data modified by a transaction to a disk after the transaction in a database is submitted.
Wherein, when a transaction occurs: firstly, storing changed log records into a WAL Buffer; and then the updated Data is stored in the Data Buffer.
At the time of transaction commit: calling an RDMA interface to transmit the WAL Buffer to a standby library;
when a checkpoint occurs, all Data buffers are flushed to disk.
2) Transmission of WAL logs
Specifically, at transaction commit time: calling an RDMA interface to transmit WAL Buffer to a standby library:
the method adopts a one-master-two-slave architecture design, wherein a master library does not reserve WAL log files any more, and also cancels the opening of extra cache for a sending module, and directly acquires modified service data from a WAL Buffer after a user submits operation, and the modified service data is sent to a WAL Buffer of a slave library by a data sending module.
After transaction data is modified, cache data is generated in a WAL Buffer and is not persisted any more, so that the reliability and the safety of the cache data need to be ensured, an improved Two queues (2Q) cache elimination algorithm is designed for the data in the cache, Two cache queues are created, one is an FIFO queue, and the other is an LRU queue, and the implementation details are as follows:
(a) inserting newly generated unaccessed WAL log data into an FIFO queue;
(b) after a user submits a transaction, newly inserted data in the FIFO queue is transmitted to a standby database, and if the data in the FIFO queue is not accessed any more in the next certain time range, 20% of the data with the earliest insertion time are eliminated when the total amount of the queue reaches 80% of the set total capacity;
(c) if the data is accessed again in the FIFO queue, moving the data to the head of the LRU queue;
(d) after a certain time, if the historical data is accessed again in the LRU queue, the data is moved to the head of the LRU queue;
(e) in the LRU queue, if the cache data capacity exceeds 80% of the total cache capacity, the least accessed 20% of data at the tail of the queue is eliminated, and the free 20% of cache space continues to receive the data transferred from the FIFO queue, so as to keep the cache running without interruption.
And the main library log sending process calls an RDMA interface, a stream copying process is started, data in the WAL Buffer is directly sent to the WAL Buffer of the standby library through a RoCE network transmission channel, the part of cache adopts the same elimination strategy as the WAL Buffer, and then the part of cache is durably arranged in the WAL file through the walwrite process and finally written into the data file for a user to read.
And the standby WAL Buffer starts a stream copying process while performing log file persistence, and transmits the log data in the cache to the WAL Buffer of a second standby library in the same RDMA mode to perform cascade copying, thereby further ensuring the high availability reliability of the database.
If the main library fails and stops running, the service system is switched to one of the standby libraries, which can be a first standby library or a second standby library; and if the flow copying process of the standby library to the second standby library is kept while the first standby library is switched, the high availability of the database is maintained. And after the first standby library becomes the main library, stopping the process of persisting the cache Data to the disk in the first standby library, namely stopping the processes of walreceiver, walwriter, starup and the like, opening up a Data cache Data Buffer in the memory, writing the changed service Data into the disk, finally persisting the changed service Data to the disk, butting the transaction change log Data with the pre-written log cache WAL Buffer, and finally completing the comprehensive switching of the standby libraries.
After the original main library is repaired, the main library is accessed as a second standby library, and the framework of a main library and a secondary library is maintained.
The embodiment of the invention also provides a database high-availability system based on remote direct memory access, which comprises a plurality of nodes, wherein each node is provided with a database;
each node host is provided with a RoCE network card and a magnetic disk;
the nodes are sequentially connected in series communication through RoCE network cards;
a first node database on the serial link is provided with a sending module, a last node database on the serial link is provided with a receiving module, and other remaining node databases on the serial link are provided with a sending module and a receiving module;
the sending module and the receiving module are both provided with RDMA interfaces based on RDMA protocol;
each database is provided with a cache;
the system also comprises a calling module, a cache management module and a persistence module;
the calling module is used for calling the RDMA interface of the database receiving module to store the pre-written log data into a pre-written log cache of the database where the receiving module is located; the RDMA interface is also used for calling the database sending module, and sending the pre-written log data in the cache of the database where the sending module is located to the cache of the next database through the RoCE network card;
the cache management module is used for carrying out data elimination management on the database cache;
and the persistence module is used for writing the data in the database caches of other nodes except the first node into a magnetic disk of a node host where the database is located to perform data persistence.
In some embodiments, except for the first node and the last node, two RoCE network cards are installed on hosts of other nodes on the serial link; one RoCE network card and the RoCE network card of the host of the previous node on the serial link form an output transmission channel, and the other RoCE network card and the RoCE network card of the host of the next node on the serial link form a data transmission channel.
The cache management module is specifically used for inserting newly generated unaccessed data into an FIFO queue of the database cache; judging whether the data is accessed in the FIFO queue within a set first time range; if not, eliminating the data with the second percentage with the earliest insertion time when the total amount of the queue reaches the first percentage of the set total capacity; if yes, moving the data to the head of the LRU queue; after the time interval is set, judging whether the data is accessed in the LRU queue; if yes, moving the data to the head of the LRU queue; in the LRU queue, judging whether the cache data capacity exceeds a first percentage of the total cache capacity; if so, the second percentage of data least accessed at the tail of the LRU queue is eliminated.
The data is processed in the cache by the following method:
inserting newly generated unaccessed data into an FIFO queue of a database buffer; judging whether the data is accessed in the FIFO queue within a set first time range; if not, eliminating the data with the second percentage with the earliest insertion time when the total amount of the queue reaches the first percentage of the set total capacity; if yes, moving the data to the head of the LRU queue; after the time interval is set, judging whether the data is accessed in the LRU queue; if yes, moving the data to the head of the LRU queue; in the LRU queue, judging whether the cache data capacity exceeds a first percentage of the total cache capacity; if so, the second percentage of data least accessed at the tail of the LRU queue is eliminated. And keeping the cache running uninterruptedly. The reliability and the safety of the cache data are improved.
As shown in fig. 2, in some embodiments, the number of nodes is three, and the corresponding three databases are a main database, a first backup database and a second backup database;
if the main library fails and stops running, the service system is switched to one of the standby libraries, which can be a first standby library or a second standby library; and if the flow copying process of the standby library to the second standby library is kept while the first standby library is switched, the high availability of the database is maintained. And after the first standby library becomes the main library, stopping the process of persisting the cache data to the disk in the first standby library, creating a data cache in the memory, writing the changed service data into the data cache and finally persisting the changed service data to the disk, butting the transaction change log data with the pre-write log cache, and finally completing the comprehensive switching of the standby libraries.
WAL log data can be directly read from a WAL Buffer, then the WAL log data does not pass through a Socket Buffer and a transmission protocol drive Buffer of an inner core area, but directly reaches a RoCE network card drive Buffer, then the WAL log data is sent to the RoCE network card drive Buffer of a standby library through a network card supporting a RoCE function, and the WAL log data does not pass through the Buffer of the inner core area on a host of the standby library. The whole copying process reduces the copying expense of data in the kernel of the operating system, thereby achieving the effects of releasing the memory bandwidth and using the CPU.
Adding an RDMA data transmission interface in a high-availability framework of PostgreSQL based on stream replication, adopting a one-master-two-backup framework design, canceling WAL log data persistence in a master library, directly sending the WAL log data persistence to a WAL cache of a backup library, using an adjusted 2Q cache elimination algorithm in WAL buffers of each node, replacing or adding a network card supporting RoCE in the master library and the backup library of the database, and creating a network transmission channel supporting RDMA.
Although the present invention has been described in detail by referring to the drawings in connection with the preferred embodiments, the present invention is not limited thereto. Various equivalent modifications or substitutions can be made on the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and these modifications or substitutions are within the scope of the present invention/any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (10)

1. A database high-availability method based on remote direct memory access is characterized in that a database comprises a main library and a plurality of standby libraries, and the method comprises the following steps:
RDMA interfaces based on RDMA protocol are arranged on a sending module and a receiving module of the database;
when receiving the operation submitted by the user, inserting the generated pre-written log data into the main library cache, and managing the main library cache by adopting a cache elimination algorithm;
after receiving the operation submitted by the user, calling an RDMA interface of a main library sending module, and sending the newly inserted pre-written log data in the cache of the main library to a standby library communicated with the main library through a RoCE network card;
the receiving module of the standby database receives the pre-written log data, stores the pre-written log data into the cache of the standby database, carries out data persistence on the pre-written log data, calls an RDMA (remote direct memory access) interface of the transmitting module of the standby database, and transmits the pre-written log data in the cache of the standby database to the connected standby database through the RoCE network card for cascade copy;
when the main library fails, selecting a standby library as a new main library, and switching the service system of the failed main library to the new main library;
the RoCE network card is a network card supporting a RoCE protocol.
2. The remote direct memory access-based database high availability method according to claim 1, wherein the step of calling the RDMA interface of the sending module and sending the pre-written log data in the buffer to the standby database through the RoCE network card comprises:
calling an RDMA interface of a sending module to obtain the pre-written log data in the cache of the database, communicating with a RoCE network card of a connected standby library node host through the RoCE network card of the database node host, and transmitting the pre-written log data to a receiving module of the standby library;
and calling an RDMA interface of the receiving module to store the received pre-written log data into the cache of the database.
3. The method for high availability of a database based on remote direct memory access according to claim 2, wherein the step of receiving the operation submitted by the user by the master library comprises:
when the master library receives the operation submitted by the user, inserting the generated change content, namely the pre-written log data into the pre-written log cache of the master library;
and after receiving the operation submitted by the user, storing the data after the operation change into the data cache of the main library.
4. The database high availability method based on remote direct memory access according to claim 3, wherein the step of inserting the generated pre-written log data into the primary library cache and managing the primary library cache by using a cache elimination algorithm comprises:
inserting newly-generated unaccessed pre-written log data into an FIFO queue of a main library buffer;
judging whether the pre-written log data is accessed in the FIFO queue within a set first time range;
if not, eliminating the pre-written log data with the earliest insertion time in the second percentage when the total amount of the FIFO queue reaches the first percentage of the set total capacity;
if yes, moving the pre-written log data to the head of the LRU queue;
after the time interval is set, judging whether the pre-written log data is accessed in the LRU queue; if yes, moving the pre-written log data to the head of the LRU queue;
in the LRU queue, judging whether the cache capacity exceeds a first percentage of the total cache capacity; and if so, eliminating the second percentage of the pre-written log data which is least accessed at the tail part of the LRU queue.
5. The method for high availability of a database based on remote direct memory access according to claim 4, wherein the method further comprises:
and periodically refreshing the data in the data cache of the main library to the disk.
6. The method for high availability of a database based on remote direct memory access according to claim 1, further comprising:
and when the fault of the database with the fault is repaired, taking the repaired database as a standby database access system.
7. The database high availability method based on remote direct memory access according to claim 1, wherein the number of the spare banks is two, and the spare banks are respectively a first spare bank and a second spare bank;
the data stream replication process comprises:
when receiving the operation submitted by the user, storing the pre-written log record of the data change into a pre-written log cache of the main library;
after the user operation is finished, the changed data is stored in a data cache of the main library, an RDMA interface of a main library sending module is called to obtain the data in the pre-written log cache of the main library, the RoCE network card of the main library node host is communicated with the RoCE network card of the first standby library node host, and the obtained data is transmitted to a receiving module of the first standby library;
calling an RDMA interface of a first standby library receiving module to store the received data into a first standby library pre-written log cache;
writing the data in the first standby library pre-written log cache into a magnetic disk of a first standby library node host for data persistence, calling an RDMA (remote direct memory access) interface of a first standby library sending module to obtain the data in the first standby library pre-written log cache, communicating with a RoCE network card of a second standby library node host through the RoCE network card of the first standby library node host, and transmitting the obtained data to a receiving module of a second standby library;
calling an RDMA interface of a second standby library receiving module to store the received data into a second standby library pre-written log cache;
writing the data in the second standby database pre-written log cache into a magnetic disk of a second standby database node host for data persistence;
and periodically refreshing the data in the data cache of the main library to a magnetic disk of the host of the node of the main library.
8. A database high-availability system based on remote direct memory access is characterized by comprising a plurality of nodes, wherein each node is provided with a database;
each node host is provided with a RoCE network card and a magnetic disk;
the nodes are sequentially connected in series communication through RoCE network cards;
a first node database on the serial link is provided with a sending module, a last node database on the serial link is provided with a receiving module, and other remaining node databases on the serial link are provided with a sending module and a receiving module;
the sending module and the receiving module are both provided with RDMA interfaces based on RDMA protocol;
each database is provided with a cache;
the system also comprises a calling module, a cache management module and a persistence module;
the calling module is used for calling the RDMA interface of the database receiving module to store the pre-written log data into a pre-written log cache of the database where the receiving module is located; the RDMA interface is also used for calling the database sending module, and sending the pre-written log data in the cache of the database where the sending module is located to the cache of the next database through the RoCE network card;
the cache management module is used for carrying out data elimination management on the database cache;
and the persistence module is used for writing the data in the database caches of other nodes except the first node into a magnetic disk of a node host where the database is located to perform data persistence.
9. The database high-availability system based on Remote Direct Memory Access (RDMA) as claimed in claim 8, wherein two RoCE network cards are installed on hosts of other nodes except the first node and the last node on the serial link; one RoCE network card and the RoCE network card of the host of the previous node on the serial link form an output transmission channel, and the other RoCE network card and the RoCE network card of the host of the next node on the serial link form a data transmission channel.
10. The remote dma-based database high availability system according to claim 9, wherein the buffer management module is specifically configured to insert newly generated unaccessed data into a FIFO queue of the database buffer; judging whether the data is accessed in the FIFO queue within a set first time range; if not, eliminating the data with the second percentage with the earliest insertion time when the total amount of the queue reaches the first percentage of the set total capacity; if yes, moving the data to the head of the LRU queue; after the time interval is set, judging whether the data is accessed in the LRU queue; if yes, moving the data to the head of the LRU queue; in the LRU queue, judging whether the cache data capacity exceeds a first percentage of the total cache capacity; if so, the second percentage of data least accessed at the tail of the LRU queue is eliminated.
CN202111057997.0A 2021-09-09 2021-09-09 Database high-availability method and system based on remote direct memory access Withdrawn CN113779087A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111057997.0A CN113779087A (en) 2021-09-09 2021-09-09 Database high-availability method and system based on remote direct memory access

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111057997.0A CN113779087A (en) 2021-09-09 2021-09-09 Database high-availability method and system based on remote direct memory access

Publications (1)

Publication Number Publication Date
CN113779087A true CN113779087A (en) 2021-12-10

Family

ID=78842222

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111057997.0A Withdrawn CN113779087A (en) 2021-09-09 2021-09-09 Database high-availability method and system based on remote direct memory access

Country Status (1)

Country Link
CN (1) CN113779087A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115202588A (en) * 2022-09-14 2022-10-18 云和恩墨(北京)信息技术有限公司 Data storage method and device and data recovery method and device
WO2024001079A1 (en) * 2022-06-29 2024-01-04 北京柏睿数据技术股份有限公司 Acceleration method and system for database master-slave synchronization operation

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024001079A1 (en) * 2022-06-29 2024-01-04 北京柏睿数据技术股份有限公司 Acceleration method and system for database master-slave synchronization operation
CN115202588A (en) * 2022-09-14 2022-10-18 云和恩墨(北京)信息技术有限公司 Data storage method and device and data recovery method and device
CN115202588B (en) * 2022-09-14 2022-12-27 本原数据(北京)信息技术有限公司 Data storage method and device and data recovery method and device

Similar Documents

Publication Publication Date Title
US11120152B2 (en) Dynamic quorum membership changes
US8793531B2 (en) Recovery and replication of a flash memory-based object store
US10031813B2 (en) Log record management
US10437721B2 (en) Efficient garbage collection for a log-structured data store
US8583885B1 (en) Energy efficient sync and async replication
CA2906547C (en) In place snapshots
US8868487B2 (en) Event processing in a flash memory-based object store
US8799213B2 (en) Combining capture and apply in a distributed information sharing system
JP4813924B2 (en) Database management system, storage device, disaster recovery system, and database backup method
US6549920B1 (en) Data base duplication method of using remote copy and database duplication storage subsystem thereof
US20140059020A1 (en) Reduced disk space standby
CN113779087A (en) Database high-availability method and system based on remote direct memory access
KR20060117505A (en) A recovery method using extendible hashing based cluster log in a shared-nothing spatial database cluster
CN113553346B (en) Large-scale real-time data stream integrated processing, forwarding and storing method and system
KR20200056526A (en) Technique for implementing change data capture in database management system
US11436256B2 (en) Information processing apparatus and information processing system
CN110413689B (en) Multi-node data synchronization method and device for memory database
CN110807039A (en) Data consistency maintenance system and method in cloud computing environment
JP2008310591A (en) Cluster system, computer, and failure recovery method
US8918364B1 (en) Online mirror state transitioning in databases
WO2022033269A1 (en) Data processing method, device and system
CN115510024A (en) Method and system for realizing high availability of large-scale parallel database
CN112087501A (en) Transmission method and system for keeping data consistency
CN113010348A (en) Device, system and method for recovering data in disaster tolerance
CN114063883A (en) Method for storing data, electronic device and computer program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20211210