CN113168371A - Write-write collision detection for multi-master shared storage databases - Google Patents

Write-write collision detection for multi-master shared storage databases Download PDF

Info

Publication number
CN113168371A
CN113168371A CN201980078344.5A CN201980078344A CN113168371A CN 113168371 A CN113168371 A CN 113168371A CN 201980078344 A CN201980078344 A CN 201980078344A CN 113168371 A CN113168371 A CN 113168371A
Authority
CN
China
Prior art keywords
log
write
sequence number
database
written
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201980078344.5A
Other languages
Chinese (zh)
Inventor
陈军
蔡乐
裴春峰
马科·季米特里耶维奇
陈建军
陈宇
孙扬
杜小林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN113168371A publication Critical patent/CN113168371A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • G06F9/526Mutual exclusion algorithms
    • G06F9/528Mutual exclusion algorithms by using speculative mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2308Concurrency control
    • G06F16/2315Optimistic concurrency control

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Systems and methods are provided that may generate an efficient architecture and method for multiple database engines to write to a shared data store to eliminate global locks during the write process. Systems and methods may use a common logging layer between a shared data store and compute database nodes, where write conflict detection is implemented using pre-written log records and log sequence numbers. Write-write conflict checking for pre-written log records received from a database engine of the plurality of database engines may be performed by: comparing a log sequence number received from the database engine with the pre-written log record to a global log sequence number in a hash table in a common log. The pre-write log record may be sent to the shared data store after the pre-write log record passes the write-write conflict check.

Description

Write-write collision detection for multi-master shared storage databases
Technical Field
This application claims priority and benefit from U.S. provisional application No. 62/777,972 entitled Write-Write Conflict Detection for Multi-Master Shared Storage Database (Write-Write Conflict Detection for Multi-Master Shared Storage Database), filed 2018, 12, month 11, which is incorporated herein by reference.
Technical Field
The present invention relates to data communications, and in particular to storing data in a shared storage database.
Background
Some enterprise-level multi-master database systems may allow multiple database instances to access shared storage, to read and write to shared storage, and to generate transaction Write Ahead Log (WAL) records. Multiple database instances may generate transaction pre-write log records so that when any one of these database instances fails, the system may still have read and write access through the other instances. When multiple database instances perform read-write access to shared data, conflicts that arise need to be detected and resolved to maintain data consistency. For example, in the case of database technology, a write-write conflict is a computationally occurring event associated with the interleaved execution of transactions. Such interleaving performs write operations on the same data that may come from different write sources. A write-write conflict may be referred to as overwriting uncommitted data, where the act of committing the data perpetuates a set of temporary changes.
A centralized or distributed global lock is typically employed to avoid conflicting access to shared data. Acquiring locks and releasing locks operations require communication with a global lock manager. These lock acquire and release operations have key-related effects on transaction execution, which can increase latency and result in low throughput.
Disclosure of Invention
It is an object of various embodiments to provide an efficient architecture and method for multiple database engines writing to a shared data store to eliminate global locks during the writing of shared data stores by multiple master nodes and to provide the benefits of optimistic concurrency control for multi-master shared-store database systems. Optimistic Concurrency Control (OCC) is a concurrency control method, which is generally applied to transaction systems, such as a relational database management system and a software transactional memory, and ensures that a correct result is generated for concurrent operations, and a result is obtained as soon as possible. OCC assumes that multiple transactions can be completed frequently without interfering with each other. In OCC, data resources are used in transactions without acquiring locks on the resources while running operations, and each transaction may be validated before committing so that no other transaction modifies the data read by the transaction. This object is achieved by the features of the independent claims. Other embodiments of the invention are apparent from the dependent claims, the description and the accompanying drawings.
Embodiments are based on using a common logging layer between the shared data store and the compute nodes, wherein write conflict detection is implemented using pre-written log records and log sequence numbers. The detection of write conflicts results from write conflict checking. Write conflict checking (write conflict detection) may be referred to as write-write conflict checking (write-write conflict detection) because it is a check for conflicts from different write operations. In an architecture that provides high availability, a common logging layer may be arranged with other common logging layers between storage nodes and compute nodes. The conflict check may be implemented as a page level check or a tuple level conflict check. Locks used by conflict checks are local in the common log and global locks are not used. The locks provided by the common log do not require any network communication related to lock acquisition or release.
According to a first aspect, one embodiment is directed to a computer-implemented method of writing to a data store shared among a plurality of database engines, the computer-implemented method comprising: performing, using one or more processors, write conflict checking, in a common log, on pre-written log records received from a database engine of the plurality of database engines, wherein the write conflict checking comprises: comparing a log sequence number received from the database engine with the pre-written log record to a global log sequence number in a hash table in the common log; sending the pre-write log record to the data store shared between multiple database engines after the pre-write log record passes the write conflict check. In this manner, network communications associated with lock acquisition and release may be eliminated.
This approach eliminates the major bottleneck of global lock based conflict resolution. It essentially provides the benefit of Optimistic Concurrency Control (OCC) for multi-master shared data systems. By means of the OCC, each master node runs transactions on the data in its local buffer pool without waiting for transactions in other master nodes to hold locks. Upon group commit, the master node may flush log records to a common log, which may perform validation. The term "flush to an entity" refers to storing to an entity.
In a first implementation form of the computer-implemented method according to the first aspect, the comparing comprises using a tuple identification or a page identification as a key in the hash table, the key being associated with an entry in the form of a master node identification and a global log sequence number value, the master node identification being an identification of a database engine of the plurality of database engines.
In a second implementation form of the computer-implemented method according to the first aspect as such or any of the preceding implementation forms of the first aspect, the checking for write conflicts comprises the log sequence number being greater than the global log sequence number.
In a third implementation form of the computer-implemented method according to the first aspect as such or any of the preceding implementation forms of the first aspect, the method comprises: updating the global log sequence number to be equal to the log sequence number after passing the write conflict check.
In a fourth implementation form of the computer-implemented method according to the first aspect as such or any of the preceding implementation forms of the first aspect, the method comprises, after passing the write collision check: inserting the pre-written log record into a group refresh pre-written log buffer; and storing all the pre-written log records in the group of refreshing pre-written log buffers into a persistent log in the common log.
In a fifth implementation form of the computer-implemented method according to the first aspect as such or any of the preceding implementation forms of the first aspect, the method comprises: copying the pre-written log records to one or more follower public logs, the one or more follower public logs constructed as backups to the public logs.
In a sixth implementation form of the computer-implemented method according to the first aspect as such or any of the preceding implementation forms of the first aspect, the pre-written log records received from a database engine are extracted from a batch of pre-written log records received from the database engine with the log sequence number of one or more transactions between the database engine and the data store.
In a seventh implementation form of the computer-implemented method according to the first aspect as such or any of the preceding implementation forms of the first aspect, the method comprises: extracting another pre-written log record from another batch of pre-written log records received from another database engine of the plurality of database engines, having another log sequence number for one or more other transactions between the another database engine and the data store.
In an eighth implementation form of the computer-implemented method according to the first aspect as such or any of the preceding implementation forms of the first aspect, the method comprises: all operations and commands that modify the internal state of the common log are maintained in a command log.
According to a second aspect, an embodiment relates to a system comprising: a memory comprising instructions; one or more processors in communication with the memory, wherein the one or more processors execute the instructions to: in a common log, performing write conflict checking on a pre-written log record received from a database engine of a plurality of database engines, wherein the write conflict checking comprises: comparing a log sequence number received from the database engine with the pre-written log record to a global log sequence number in a hash table in the common log; sending the pre-write log record to a data store shared among the plurality of database engines after the pre-write log record passes the write conflict check. With this system, network communications associated with lock acquisition/release may be eliminated.
In a first implementation of the system according to the second aspect, the comparing comprises using a tuple identification or a page identification as a key in the hash table, the key being associated with an entry in the form of a master node identification and a global log sequence number value, the master node identification being an identification of a database engine of the plurality of database engines.
In a second implementation form of the system according to the second aspect as such or any of the preceding implementation forms of the second aspect, the checking for write conflicts comprises that the log sequence number is greater than the global log sequence number.
In a third implementation of the system according to the second aspect as such or any of the preceding implementations of the second aspect, the one or more processors update the global log sequence number to be equal to the log sequence number after passing the write conflict check.
In a fourth implementation of the system according to the second aspect as such or any of the above implementations of the second aspect, after passing the write conflict check, the one or more processors: inserting the pre-written log record into a group refresh pre-written log buffer; and storing all the pre-written log records in the group of refreshing pre-written log buffers into a persistent log in the common log.
In a fifth implementation form of the system according to the second aspect as such or any of the preceding implementation forms of the second aspect, the one or more processors copy the pre-written log records to one or more follower public logs, the one or more follower public logs being constructed as backups of the public logs.
In a sixth implementation of the system according to the second aspect as such or any of the preceding implementations of the second aspect, the one or more processors extract the pre-written log records from a batch of pre-written log records received from the database engine with the log sequence number of one or more transactions between the database engine and the data store.
In a seventh implementation of the system according to the second aspect as such or any of the preceding implementations of the second aspect, the one or more processors extract another pre-written log record from another batch of pre-written log records received from another database engine of the plurality of database engines with another log sequence number for one or more other transactions between the another database engine and the data store.
In an eighth implementation of the system according to the second aspect as such or any of the preceding implementations of the second aspect, the one or more processors maintain in a command log all operations and commands that modify the internal state of the common log.
In a ninth implementation of the system according to the eighth implementation of the second aspect, the system comprises the plurality of database engines, the data store shared among the plurality of database engines, and one or more follower public logs in addition to the public log.
The computer-implemented method may be performed by the system. Other features of the computer-implemented method come directly from the functionality of the system.
The explanations provided for the first aspect and its implementations apply equally to the second aspect and the corresponding implementations.
According to a third aspect, embodiments relate to a non-transitory computer-readable medium storing computer instructions, wherein the computer instructions, when executed by one or more processors, cause the one or more processors to perform the steps of any one of the computer-implemented methods provided by the first aspect or any implementation thereof, when executed in a computer. Thus, the computer-implemented method may be performed automatically and repeatably.
The computer instructions stored by the non-transitory computer readable medium may be executable by the system. The system may be programmably arranged to execute the computer instructions.
According to a fourth aspect, an embodiment relates to a system comprising: means for performing a write conflict check on a pre-write log record received from a database engine of a plurality of database engines, wherein the write conflict check comprises comparing a log sequence number received from the database engine with the pre-write log record to a global log sequence number in a hash table in a common log, wherein the means for performing the write conflict check comprises the common log and is operably disposed between the plurality of database engines and a shared data store shared among the plurality of database engines; means for sending the pre-write log record to the shared data store after the pre-write log record passes the write conflict check.
In a first implementation form of the system according to the fourth aspect, the comparing comprises using a tuple identification or a page identification as a key in the hash table, the key being associated with an entry in the form of a master node identification and a global log sequence number value, the master node identification being an identification of a database engine of the plurality of database engines.
In a second implementation form of the system according to the fourth aspect as such or any of the preceding implementation forms of the fourth aspect, the checking for write conflicts comprises that the log sequence number is greater than the global log sequence number.
In a third implementation form of the system according to the fourth aspect as such or any of the preceding implementation forms of the fourth aspect, the means for performing the write conflict check updates the global log sequence number to be equal to the log sequence number after passing the write conflict check.
In a fourth implementation form of the system according to the fourth aspect as such or any of the preceding implementation forms of the fourth aspect, after passing the write collision check, the means for performing the write collision check performs: inserting the pre-written log record into a group refresh pre-written log buffer; and storing all the pre-written log records in the group of refreshing pre-written log buffers into a persistent log in the common log.
In a fifth implementation form of the system according to the fourth aspect as such or any of the preceding implementation forms of the fourth aspect, the means for performing the write conflict check copies the pre-write log records to one or more follower public logs, the one or more follower public logs being constructed as backups of the public logs.
In a sixth implementation of the system according to the fourth aspect as such or any of the preceding implementations of the fourth aspect, the means for performing the write conflict check extracts the pre-written log records from a batch of pre-written log records received from the database engine with the log sequence number of one or more transactions between the database engine and the data store.
In a seventh implementation of the system according to the fourth aspect as such or any of the preceding implementations of the fourth aspect, the means for performing the write conflict check extracts another pre-written log record from another batch of pre-written log records received from another database engine of the plurality of database engines, with another log sequence number of one or more other transactions between the another database engine and the data store.
In an eighth implementation form of the system according to the fourth aspect as such or any of the preceding implementation forms of the fourth aspect, the means for performing the write conflict check maintains in the command log all operations and commands modifying the internal state of the common log.
In a ninth implementation form of the system according to the eighth implementation form of the fourth aspect, the system comprises the plurality of database engines, the data stores shared among the plurality of database engines, and one or more follower public logs in addition to the means for performing the write conflict check and the means for sending the pre-write log records to the shared data store.
The explanations provided for the first aspect and its implementations apply equally to the fourth aspect and the corresponding implementations.
Embodiments of the invention may be implemented in hardware, software, or any combination thereof. Any of the foregoing examples may be combined with any one or more of the other foregoing examples to create new embodiments within the scope of the invention.
Drawings
FIG. 1 is a block diagram of an exemplary system having an architecture with a compute-store separated multi-master database provided by exemplary embodiments.
FIG. 2 depicts example operational components within the common log of FIG. 1 that may facilitate write-write conflict detection as provided by example embodiments.
FIG. 3 is a flowchart of an exemplary cycle of a write-write conflict detection thread as provided by an exemplary embodiment.
Fig. 4-10 illustrate the challenges and specific resolution of write conflicts for a PostgreSQL system provided by an exemplary embodiment, e.g., the PostgreSQL system is an object relational database management system.
FIG. 11 is a flowchart illustrating features of an exemplary method of writing a data store shared among multiple database engines provided by an exemplary embodiment.
FIG. 12 is a block diagram of circuitry for implementing an algorithm and an apparatus that performs a method for providing write-write collision detection for a multi-master shared storage database provided by exemplary embodiments.
Detailed Description
The following detailed description is to be read in connection with the accompanying drawings, which are a part of the description and which show, by way of illustration, specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the embodiments, and it is to be understood that other embodiments may be utilized and that structural, logical and electrical changes may be made. The following description of the exemplary embodiments is, therefore, not to be taken in a limiting sense.
In one embodiment, the functions or algorithms described herein may be implemented in software. The software may include computer-executable instructions stored in a computer-readable medium or computer-readable storage device, such as one or more non-transitory memories or other types of hardware-based local or network storage devices. Further, these functions correspond to modules, which may be software, hardware, firmware, or any combination thereof. Multiple functions may be performed in one or more modules as desired, and the described embodiments are merely examples. The software may be executed in a digital signal processor, ASIC, microprocessor, or other type of processor running in a computer system, such as a personal computer, server, or other computer system, to transform such a computer system into a specially programmed machine.
Non-transitory computer readable media include all types of computer readable media, including magnetic storage media, optical storage media, and solid state storage media, particularly excluding signals. It should be understood that software may be installed in and sold with devices that process event streams as taught herein. Alternatively, the software may be acquired and loaded into such devices, including acquiring the software through optical disk media or from any form of network or distribution system, including, for example, acquiring the software from a server owned by the software author or from a server not owned but used by the software author. For example, the software may be stored in a server for distribution over the internet.
In various embodiments, the system may be implemented with an architecture that operates a common journal between a primary node and a data store layer. The master node may be implemented as a database node. The common log may be arranged as a write-write detection layer. Write-write detection may also be referred to as write-write conflict checking, in which it is determined whether a write transaction from one source conflicts with a related write transaction from another source. The system may be a multi-master database system designed with an architecture in which database instances write log records to storage. The storage node may replay the log records, i.e., apply the log records to construct the data page.
In such an architecture, the master database may flush log records into a common log, where the common log may perform page-level or tuple-level conflict checks. One database table may be divided into a plurality of pages. There may be a certain number of records per page. One tuple is a single record of a database table. A table may be divided into smaller units, such as pages, each of which may have multiple tuples. All locks used by the conflict check may be local locks in a common log, such that there is no global lock. The common log will persist and only log records that pass the conflict check will be forwarded from the common log to the storage node. The architecture may allow a High Availability (HA) common logging layer between storage nodes and compute nodes, which may be databases. The HA may be provided by owning a plurality of common logs at a conflict detection layer between the storage nodes and the compute nodes, wherein one of the plurality of common logs is a primary common log and the other of the plurality of common logs is a copy of the primary common log. The replication defines the HA of the common log layer.
An architecture with a common log layer may be implemented using novel write-write collision detection algorithms based on the WAL and Log Sequence Number (LSN), which may be performed for each log record. The log sequence number is the identification number of a given WAL record, indicating the location of the WAL record in the sequence of records for the transaction. The write-write detection may be a page level collision detection or a tuple level collision detection. Such methods of handling conflicts at the tuple level or the page level may provide fine-grained locks to achieve good parallelism, since in a common log conflict checks run in parallel on multiple worker threads. A thread is a sequence of instructions that can be executed in parallel with the sequence of instructions. The associated lock does not require any network communication. Furthermore, read operations do not participate in write-write collision detection, which improves overall system throughput. These write-write collision detection techniques may eliminate global lock-based collision prevention and may provide the benefits of OCC for multi-master shared data/storage database systems.
FIG. 1 is a block diagram of an embodiment of an exemplary system 100 having an architecture with a compute-store separated multi-master database. The system 100 may include multiple databases that communicate with the common log 110 to write data to the shared storage layer 120. In this figure, there are three databases, with database engines 105-1, 105-2, and 105-3. Although three database engines of three databases are shown, system 100 may have fewer or more than three databases and thus may have fewer or more than three database engines. Each database engine may include one or more processors and storage devices storing instructions executable by the one or more processors to perform operations of the database as a component. Such operations for each database may include storing data generated by communicating with clients of the respective database in the shared storage layer 120 through operations of the common log 110. Each database represented by the database engine may be arranged as a database node that is a master node that receives Structured Query Language (SQL) queries or update/insert/delete/create requests from users, such as client devices, to the database. Data from users to database nodes managed by database engines 105-1, 105-2, and 105-3 may be serviced by buffer pools 109-1, 109-2, 109-3 of database engines 105-1, 105-2, and 105-3, respectively, or by persistent storage nodes if a buffer pool in a database is unavailable or lost. When a transaction modifies a tuple, it creates a log record and flushes the log record of the modified tuple to the common log 110. The public log 110 performs a conflict check on log records, and if the log records do not have a conflict, distributes the log records to the shared storage layer 120, wherein a new version of data is created by performing a log application operation.
Each database instance is a master node in the architecture of system 100. Each of database engines 105-1, 105-2, and 105-3 may receive an SQL query from a client and may initiate a transaction. Each of the database engines 105-1, 105-2, and 105-3 may flush WAL records to the common log 110. After the conflict checks are successfully completed, common log 110 controls the loading of data pages from transactions initiated in database engines 105-1, 105-2, and 105-3 to shared storage layer 120. However, the data pages from shared storage layer 120 may be loaded by each of database engines 105-1, 105-2, and 105-3. For example, a data page may be loaded from a shared storage layer by database engine 105-1 along path 132. Appropriate data may be loaded into each of database engines 105-2 and 105-3 in a similar manner.
The common log 110 may include one or more processors and storage devices storing instructions executable by the one or more processors to perform operations of the common log 110. The common log 110 may perform a number of functions. The common log may receive different WAL records from database engines 105-1, 105-2, and 105-3 operating as master nodes via paths 106-1, 106-2, and 106-3, respectively. It may perform write-write collision detection and may send WAL records that pass its collision check to the shared storage layer 120. In the architecture of the system 100, the common log 110 operates as a primary common log node in the HA arrangement. As a primary common journal node, the common journal 110 is a leader common journal node that copies WAL records that pass conflict checks to follower common journals 115-1 and 115-2. The WAL record may be sent to the follower public log 115-1 along path 113-1, and the WAL record may be sent to the follower public log 115-2 along path 113-2. The common log 110, as a leader common log node of the system 100, may have its internal state copied to the follower common log 115-1 and the follower common log 115-2 by transferring command logs to these follower common logs. The command log may be sent to the follower public log 115-1 along path 119-1 and the command log may be sent to the follower public log 115-2 along path 119-2. The command log may retain all operations that modify the internal state of the common log 110, which may be identified as commands.
Although FIG. 1 shows two follower common logs, system 100 may have fewer or more than two follower common logs in a common log layer between the master node and the shared storage layer. The implementation of the common log 110 with follower common logs provides an architecture of the HA common log layer between the storage nodes and the compute nodes. The HA characteristics of this architecture of the system increase as the number of follower common logs provided in the common log layer of the system increases. When the leader common log fails, the follower common log in the HA common log layer becomes the leader common log. The order in which the follower common log becomes the leader common log can be predefined or implemented using conventional techniques to select a leader node among a set of peers when the leader common log fails.
The public log 110 may include a Global Transaction Manager (GTM) 112 residing in the public log 110. GTM 112 may generate a transaction Identification (ID). These transaction IDs may be generated in ascending order. GTM 112 may maintain a snapshot of active transactions to the database engine, such as path 133 to database engine 105-1. The snapshot may be implemented as a list of active transactions. To enable follower public logs 115-1 and 115-2 to replace public log 110, follower public logs 115-1 and 115-2 include GTM 117-1 and GTM 117-2, respectively, which may perform the functions of GTM 112.
Shared storage tier 120 may be implemented as a shared storage server having storage nodes 125-1, 125-2, 125-3, 125-4, and 125-5. The shared storage server may include one or more processes that control the operation of the shared storage server by executing instructions stored in the shared storage layer 120. Although five storage nodes are shown, shared storage tier 120 may include less or more than five storage nodes. Each storage node may include one or more storage devices. Storage nodes 125-1, 125-2, 125-3, 125-4, and 125-5 may be structured as distributed storage nodes. The shared storage layer 120 may receive different WAL records from the common log 110 along multiple paths such as paths 123-1, 123-2, and 123-3.
FIG. 2 depicts an embodiment of exemplary operational components within the common log 110 of FIG. 1 that may facilitate write-write conflict detection. Similar or identical operational components are provided in the follower common logs 115-1 and 115-2 to perform operations as common log leaders when a selected follower common log is changed from a follower common log to a leader common log. The operational components may be implemented using storage devices controlled by one or more processors of the common journal 110. In the example shown in FIG. 2, these operational components residing in the common log 110 may be implemented as buffers 252 and 254 for write-write conflict detection threads, a hash table 255, a Group Flush WAL Buffer (GFWB) 256, and a persistence log 258. For ease of presentation, the common log 110 is shown receiving only a batch of write transaction log records from the two master nodes 105-1 and 105-2. The illustrated components may be extended to handle communication of a common log 110 with more than two master nodes.
Buffer zoneThe write-write conflict detection threads in 252 and 254 may be dedicated worker threads that perform write-write conflict detection in parallel on the buffers of the transaction log records. Dedicated threads in buffers 252 and 254 may run conflict checks concurrently on batches of WAL records. In FIG. 2, W is for threads 1 and 2 associated with buffers 252 and 254, respectivelyij(x) Represents a write transaction log record, wherein WijRepresenting the transaction from the ith (T)i) The jth write operation (write transaction log) record. The parameter x is either a tuple ID or a page ID, depending on whether the collision detection is tuple level or page level collision detection. Where each write transaction log record passing the write-write conflict check is assigned a global LSN in the common log 110, the term "reader LSN" denotes the latest global LSN known to the transaction. In the example of FIG. 2, the data for thread 1 includes data for the with commit transaction C2And C1Transaction 1 and transaction 2 of a batch of write transaction records equal to reader LSN of 7. The commit transaction from the perspective of the master node, the master node does not perceive any inconsistencies with the execution of a given commit transaction. Data is received in chronological order, starting from the left and proceeding to the right. In this example, the commit of transaction 2 occurs before the commit of transaction 1, although the write transaction log record for transaction 1 was received before the record for transaction 2. Data for thread 2 includes data for a commit with transaction C3And C4Transaction 3 and transaction 4 of a batch of write transaction records equal to reader LSN of 7. Unlike the data of thread 1, the commit of transaction 3 occurs before the commit of transaction 4, and the write transaction log record for transaction 3 is received before the record for transaction 4.
The hash table 255 is a write-write collision detection hash table 255 having entries for a tuple ID or page ID 261, a master node ID and LSN 262, and a bucket (also referred to as a slot) number 263. For tuple level conflict checking, the key of the write-write conflict detection hash table 255 is the tuple ID; for page level conflict checking, the key is the page ID. The key has a value of the form { home node ID, LSN }, including the latest global LSN of the WAL record that modifies the tuple or page, the home node ID being the ID of the home node that sent the WAL record. Whenever a log record passes conflict checks in the public log 110, a new global LSN is assigned to the log record. The LSN in { primary node ID, LSN }, which is the primary node that sends the log record to the common log 110, represents the global LSN. The global LSN is only generated for WAL records that pass the conflict check. For the N master nodes, the master node ID may be an integer of 1 … … 10, each integer assigned to one of the N master nodes that is different from the other of the N master nodes. Write-write collision detection hash table 255 has a fixed number of buckets, where each bucket has its own lock. The bucket lock is a tuple lock or a page lock, depending on whether the conflict check is at the tuple level or the page level.
In the common log 110, WAL records are inserted into the GFWB 256 when they pass conflict checks. Once a WAL record enters GFWB 256, a new global LSN is assigned to the WAL record. When GFWB 256 is full or the timer is off, all log records in GFWB 256 are flushed to persistent log 258 of common log 110. The persistent log 258 may be implemented as a disk. All log records that pass the conflict check are eventually flushed to the persistent log 258 and also sent to the storage nodes 125-1, 125-2, 125-3, 125-4, and 125-5 of the shared storage tier 120 of FIG. 1. The component usage of the public journal 110 is shown in FIG. 3.
FIG. 3 is a flow diagram 300 of an embodiment of an exemplary cycle of a write-write conflict detection thread. The period may be implemented in a common log (such as common log 110 of fig. 1 and 2), using one or more processors to execute instructions stored in a memory, e.g., implemented at a common log layer of the system. In 305, a batch of WAL records is received at a common log. For example, the WAL record for write-write collision detection thread 1 may be received in a buffer (e.g., buffer 252). Buffer 252 may include many log records and is not limited to four log records. At 310, WAL record W is extracted from the bufferij(x) In that respect In processing sequences of WAL records, W extractedij(x) Is the next WAL record in the buffer to be checked for conflicts. In 315, from Wij(x) Extracts the transaction ID. At 320, W is determinedij(x) Is a write operation record, an abort record, or a commit record. At 325, if Wij(x) Is an abort record, the transaction ID is aborted and execution branches to 310 to fetch the next WAL record.
At 330, if Wij(x) Is to commit the record, it is determined whether the transaction ID is marked as "conflicting" or other means of identifying the occurrence of a conflict. At 335, if the transaction ID is marked as having a conflict, the transaction ID abort is aborted and execution proceeds to 310 to extract the next WAL record. At 340, if the transaction ID is not marked as conflicting, the transaction ID is committed and the process passes to 310 to extract the next WAL record.
At 345, if W is determined at 320, W is determinedij(x) Is a write operation record, then from Wij(x) Extracting the page ID or tuple ID and from Wij(x) The reader LSN is extracted. At 350, a lookup is performed in a hash table (e.g., write-write collision detection hash table 255 of FIG. 2) with the extracted page ID or extracted tuple ID and an entry is obtained with the extracted page ID or extracted tuple ID. In 355, it is determined whether the master node ID of the fetch entry is not equal to Wij(x) And whether the LSN of the entry extracted from the hash table is greater than the slave Wij(x) The extracted reader LSN. At 360, if the condition in 355 is satisfied, the transaction ID is marked as "conflicted" or other equivalent identifier and the process passes to 310 to extract the next WAL record. The current condition occurs because of the received Wij(x) A latest global LSN known to the transaction, represented by the reader LSN, is identified that is less than the current global LSN identifying that other write operations have occurred, such that the data is not in an expected state.
At 365, if no conflict is found at 355, then several actions are taken. W is to beij(x) Inserted into the GCWB with the first LSN immediately following the current WAL record in the GCWB. LSN of hash table entry is updated to be from Wij(x) The LSN from the GCWB is inserted. Updating the primary node ID of the hash table entry to be equal to Wij(x) The master node ID of (1). Then, if the GCWB is full or the timer reaches a set time, the GCWB is refreshed. The internal state changes are retained in the command log, which is also transferred to the common log copy, as follows from FIG. 1The public logs 115-1 and 115-2. Each write-write collision detection thread may run the collision detection algorithm described above continuously for each batch of WAL records.
Multiple write-write conflict detection threads work concurrently, providing parallelism for the system. Each thread may access the write-write collision detection hash table of the common log using a bucket-level lock (close to a tuple-level lock or page-level lock, depending on whether write-write collision detection is performed at the page level or tuple level). A fixed number of buckets may be pre-allocated for write-write collision detection hash tables. Each bucket may have its own lock, where the lock may be derived from the index of the bucket.
To provide a common log HA, all log records that pass the conflict check can be propagated to the follower common log. The follower public log may be referred to as a backup public log. Further, the state of the write-write collision detection hash table and GFWB may be copied to the standby common logs by transferring a command log to each standby common log. The command log stores all operations/commands to modify the write-over collision detection hash table and GFWB.
Tuple-level write-write collision checking may reduce the number of collisions and may result in better throughput compared to page-level write-write collision checking. At the same time, tuple level write-write collision detection presents more challenges to implementation. In the next section with respect to fig. 4-10, the challenges and specific solutions for PostgreSQL page and tuple layout are described as examples. PostgreSQL is an object-relational database management system (ordms) that is a database management system (DBMS) similar to relational databases, but has an object-oriented database model in which database schemas and query languages directly support objects, classes, and inheritance. PostgreSQL is open-sourced, conforming to the atomicity, consistency, isolation, and persistence (ACID) principles, which are a set of attributes of database transactions that aim to ensure validity even in the event of errors, power failures, etc. PostgreSQL manages concurrency through a system called multi-version concurrency control (MVCC) that provides a snapshot of the database for each transaction, allowing changes to be made before committing the changes without the other transactions being visible.
In PostgreSQL, the data file may be referred to as a heap, where the heap may be associated with a Heap Tuple (HTUP), a PAGE with a PAGE number (PAGE _ NO), and a Line Pointer (LP) with a line pointer number (LP _ NO). Also associated with the data file (heap) is an index. An index is a specific structure that organizes references to data to make a lookup easier. In PostgreSQL, the index may be a copy of the item to be indexed in conjunction with a reference to the actual data location. Associated with the index are Index Tuples (ITUP) and LP _ NO. In fig. 4-10, reference tags with H, I and HI refer to heap, index, and heap or index, respectively, tup refers to tuple.
The insertions should not conflict with each other. However, the heap and index insertion implementation of PostgreSQL may result in conflicts between insertions in a multi-host system. FIG. 4 is an example of a log record of an insert that results in a conflict. If a common log receives log records from both the master node 405-1 and the master node 405-2, only log records from one master node may win. This may result in a large number of insertion collisions. Master node 405-1 commits transaction log (XLOG) record 1 to insert the HTUP labeled HTUP 1. XLOG record 1 includes the contents of HTUP1 with the information that PAGE _ NO equals HX and LP _ NO equals 1. Host node 405-1 also submits XLOG record 2 to insert the ITUP labeled ITUP 1. XLOG record 2 includes the contents of ITUP1 with the information that PAGE _ NO of XLOG record 2 equals IY and LP _ NO equals 1. The contents of ITUP1 are PAGE _ NO ═ HX and LP _ NO ═ 1. Master node 405-2 commits transaction log (XLOG) record 1 to insert the HTUP labeled HTUP 2. XLOG record 1 includes the contents of HTUP2 with the information that PAGE _ NO equals HX and LP _ NO equals 1. Host node 405-2 also submits XLOG record 2 to insert the ITUP labeled ITUP 2. XLOG record 2 includes the contents of ITUP2 with the information that PAGE _ NO of XLOG record 2 equals IY and LP _ NO equals 1. The content of ITUP2 is PAGE _ NO ═ HX and LP _ NO ═ 1, which results in a conflict in XLOG record 2 from master node 405-1.
The following is a method of eliminating such collisions. First, when a heap tuple is inserted, the master node checks to determine if LP _ NO can equal its own master node ID, rather than just selecting an unused LP. In the example of fig. 4, master node 405-1(ID ═ 1) may select LP1 on page HX, and master node 405-2(ID ═ 2) may select LP2 on page HX. Second, each master node may only select the next unused LP when inserting index tuples. But LP NO should not appear in the log record. In this way, when the log record is applied to the storage nodes, the next LP for the whole can be computed, where the LPS of the index tuple is ordered according to the order of the index keys.
FIG. 5 illustrates these operational principles using a common log to provide modified log records for insertion to eliminate conflicts associated with FIG. 4. As shown in fig. 5, XLOG record 1 from master node 405-1 with master node ID1 has LP _ NO 1 for insertion of HTUP1, XLOG record 2 from master node 405-1 has LP _ NO Φ (next unused LP), XLOG record 1 from master node 405-2 with master node ID2 has LP _ NO Φ 2, and XLOG record 2 from master node 405-2 has LP _ NO Φ (next unused LP). The associated common logs for the two master nodes may allow the two master nodes to commit, where different LP NOs do not conflict.
Insertions from different writers may also conflict with each other when a heap or index (hereinafter HI) page is close to full. Fig. 6 shows an example of insertion with respect to a full page. For example, master node 605-1 finds that the page stack or index x (hix) has space for one tuple, so it inserts the stack or index tuple (htup) 6 (htup 6) of unused LP _ NO ═ 5 or next unused LP _ NO ═ Φ. The associated common log receives an XLOG record for heap or index tuple 6(HITUP6) from master node 605-1, the XLOG record having PAGE NO for the heap or index, PAGE NO X, LP NO 5 (heap) or the next unused LP NO Φ (index), and the content of tuple 6 (heap or index). The master node 605-2 also finds that the page HIX has space for one tuple, so the master node 605-2 inserts the hit 7 of LP _ NO 6 or the next unused LP _ NO Φ. The associated common log receives an XLOG record for heap or index tuple 7(HITUP7) from master node 605-2, the XLOG record having PAGE NO for the heap or index, PAGE NO X, LP NO 6 (heap) or the next unused LP NO Φ (index), and the content of tuple 7 (heap or index). When a common log receives two log records of HITUP6 and HITUP7, it finds no conflict and commits both log records. However, when both log records are applied, the store may find that there is no space on page HIX, but the transaction has committed. To detect a page full conflict, the common log maintains a Free Space Map (FSM) page. There may be a FSM for each heap and index relationship to track the available space in the relationship. When the relevant log records are inserted without conflict and ready to commit, the common log also checks if the increase would result in a page size overflow. If so, the common log aborts the transaction. Updates that add new versions of tuples are similarly processed.
The insertion may result in index page splitting. When one writer splits an index page and another writer inserts an index tuple into the same page, the splitting or insertion should be aborted. FIG. 7 is an example of index page splitting. In FIG. 7, master node 705-1 inserts ITUP5 to fill index page IX. Host node 705-1 then attempts to insert ITUP6 into page IX. Page IX is split, with the first half ITUP staying at page IX and the second half going into new page INEW. ITUP5 and ITUP6 both turned to INEW because their index keys were larger. Main node 705-2 inserts ITUP7 on page IX, assuming ITUP7 has the largest key.
FIG. 8 generally illustrates the log records generated by the splitting of the two master nodes of FIG. 7 from an index page. The master node 705-1 generates an XLOG 1 for the index tuple ITUP5, the XLOG 1 including PAGE _ NO ═ IX (index PAGE X), LP _ NO ═ Φ, and ITUP5 content (e.g., PAGE _ NO ═ HX and LP _ NO ═ 1). The master node 705-1 generates an XLOG 1 split on index tuple ITUP6 and PAGE IX, the XLOG 1 including LP _ NO of the split point, new PAGE _ NO of the new PAGE, and the contents of the new PAGE including ITUP5 and ITUP 6. The master node 705-2 generates an XLOG 1 for the index tuple TUP7, the XLOG 1 including PAGE _ NO ═ IX (index PAGE X), LP _ NO ═ Φ, and ITUP7 content (e.g., PAGE _ NO ═ HX and LP _ NO ═ 2).
The public log maintains one or more PAGE NOs of the recently split index PAGE and their newly submitted LSNs in a write-write collision detection hash table of the public log. Assuming that the common log first committed the changes to the master node 705-1, the common log received log records from the master node 705-2 regarding the insertion of ITUP7 on page IX. Public log discovery only PAGE _ NO ═ IX split, split reader LSN that committed LSN later than the master node 705-2's log record, and it aborts the master node 705-2 changes. Similarly, assuming that the common journal first commits the changes of the master node 705-2, the common journal receives a split journal record of the master node 705-1 with respect to page IX. The split reader LSN is earlier than the commit LSN that the master node 705-2 inserted on page IX, and the public log aborts the splitting of page IX.
FIG. 9 illustrates handling update-update conflicts. Processing updates may be relatively simple. Assume that the master node 905-1 and the master node 905-2 attempt to update HTUP0 at { PAGE _ NO ═ HX, LP _ NO ═ 0 }. The newer version on the master node 905-1 is HTUP0 'at { PAGE _ NO ═ HX, LP _ NO ═ 1}, the content is HTUP0', and the maximum PAGE (XMAX) is equal to the transaction ID (TID1) of the master node 905-1. The newer version on the master node 905-2 is HTUP0 "at { PAGE _ NO ═ HX, LP _ NO ═ 2}, content is HTUP 0", and the maximum PAGE (XMAX) is equal to the transaction ID (TID2) of the master node 905-2. LP _ NO selected on the heap page satisfies LP _ NO — master node ID. If the public log first commits the log record of the master node 905-1, the log record will reject the change of the master node 905-2 when the public log finds that the most recently committed LSN of the HTUP0 at { PAGE _ NO ═ HX, LP _ NO ═ 0} is greater than the reader LSN of the log record of the master node 905-2 that also contains HTUP0 at { PAGE _ NO ═ HX, LP _ NO ═ 0 }. The same reasoning applies to the case where the common log first commits the log record of the master node 905-2.
FIG. 10 illustrates handling update-delete conflicts. Assuming that the main node 1005-1 deletes HTUP0 at { PAGE _ NO ═ HX, LP _ NO ═ 0}, the main node 1005-2 updates HTUP0 to HTUP 0'. The master node 1005-2 places HTUP0' at { PAGE _ NO ═ HX, LP _ NO ═ 2 }. If the associated public log first committed the log record of the master node 1005-1, then the log record will reject the change of the master node 1005-2 when the public log finds that the most recently committed LSN of HTUP0 at { PAGE _ NO ═ HX, LP _ NO ═ 0} is greater than the reader LSN of the log record of the master node 1005-2 that also contains HTUP0 at { PAGE _ NO ═ HX, LP _ NO ═ 0 }. Similar reasoning applies to the case where a common log first submits a log record for the master node 1005-2.
FIG. 11 is a flow diagram of features of an embodiment of an exemplary method 1100 of writing to a data store shared among multiple database engines. Method 1100 may be implemented as a computer-implemented method. At 1110, write-write conflict checking is performed, using one or more processors, on pre-written log records received from a database engine of the plurality of database engines in a common log, wherein the write-write conflict checking comprises: comparing a log sequence number received from the database engine with the pre-written log record to a global log sequence number in a hash table in the common log. Making the comparison may include using a tuple identification or a page identification as a key in the hash table, the key being associated with an entry represented in the form of a master node identification and a global log sequence number value, wherein the master node identification may be an identification of a database engine of the plurality of database engines.
In 1120, the pre-write log record is sent to the data store shared among a plurality of database engines after the pre-write log record passes the write-write conflict check. Passing the write conflict check may include the log sequence number being greater than the global log sequence number. After passing the write conflict check, method 1100, or a method similar to method 1100, may include updating the global log sequence number to be equal to the log sequence number. After passing the write collision check, method 1100 or a method similar to method 1100 may include: inserting the pre-written log record into a group refresh pre-written log buffer; and storing all the pre-written log records in the group of refreshing pre-written log buffers into a persistent log in the common log.
Variations of method 1100 or methods similar to method 1100 may include many different embodiments that may be combined depending on the application of such methods and/or the architecture of the system implementing such methods. Such methods may include copying the pre-written log records to one or more follower public logs that are constructed as backups to the public logs. Such methods may include maintaining all operations and commands in a command log that modify the internal state of the common log.
In method 1100 or a method similar to method 1100, the pre-write log records received from a database engine may be extracted from a batch of pre-write log records received from the database engine having the log sequence number of one or more transactions between the database engine and the data store. Another pre-written log record may be extracted from another batch of pre-written log records received from another database engine of the plurality of database engines, having another log sequence number for one or more other transactions between the another database engine and the data store.
In various embodiments, a non-transitory machine-readable storage device (e.g., a computer-readable non-transitory medium) may include instructions stored thereon that, when executed by components of a machine, cause the machine to perform operations, wherein the operations include one or more features similar or identical to those of the methods and techniques described with respect to method 1100, flowchart 300, variations thereof, and/or other methods taught herein (e.g., as associated with fig. 1-11). The physical structure of such instructions may be operated on by one or more processors. For example, execution of these physical structures may cause a machine to perform operations comprising: performing, using one or more processors, write-write conflict checking on pre-written log records received from a database engine of a plurality of database engines in a common log, wherein the write-write conflict checking comprises: comparing a log sequence number received from the database engine with the pre-written log record to a global log sequence number in a hash table in the common log; sending the pre-write log record to the data store shared between multiple database engines after the pre-write log record passes the write-write conflict check. Making the comparison may include using a tuple identification or a page identification as a key in the hash table, the key being associated with an entry represented in the form of a master node identification and a global log sequence number value, the master node identification being an identification of a database engine of the plurality of database engines.
Executing, using one or more processors, instructions stored in a machine-readable storage device may include the following: passing the write-write conflict check may include the log sequence number being greater than the global log sequence number. After passing the write-write conflict check, the executable operations may include updating the global log sequence number to equal the log sequence number. After passing the write-write conflict check, the executable operations may include: inserting the pre-written log record into a group refresh pre-written log buffer; and storing all the pre-written log records in the group of refreshing pre-written log buffers into a persistent log in the common log.
The operations may include: extracting the pre-write log records received from a database engine from a batch of pre-write log records received from the database engine having the log sequence number of one or more transactions between the database engine and the data store. The operations may include: extracting another pre-written log record from another batch of pre-written log records received from another database engine of the plurality of database engines, having another log sequence number for one or more other transactions between the another database engine and the data store.
The operations may include copying the pre-write log records to one or more follower public logs, wherein the one or more follower public logs are constructed as backups of the public logs. The operations may include maintaining all operations and commands in a command log that modify an internal state of the common log.
FIG. 12 is a block diagram of circuitry for implementing an algorithm and an apparatus that performs a method of providing write-write collision detection for a multi-master shared memory database provided in accordance with the teachings herein. FIG. 12 depicts a device 1200 having a non-transitory memory 1201 storing instructions, a cache 1207, and a processing unit 1202 coupled to a bus 1220. The processing unit 1202 may include one or more processors in operable communication with the non-transitory memory 1201 and the cache 1207. The one or more processors may be configured to execute instructions to operate the apparatus 1200 as a database engine, common log, or shared data store, according to any of the methods taught herein.
Device 1200 may include a communications interface 1216 usable to communicate between devices and systems associated with an architecture, such as the architecture of FIG. 1. One or more of the multiple databases, common logs, and shared data stores may be implemented in a cloud that may be associated with the device 1200. In general, the term "cloud" refers to data processing in many virtual servers, rather than directly in physical machines. The cloud may span a Wide Area Network (WAN). A WAN also typically refers to the public internet and sometimes to a network of leased fiber links interconnecting multiple branch offices of an enterprise. Alternatively, the cloud may reside entirely within a private data center within the internal local area network. Cloud datacenters, i.e., datacenters hosting virtual computing or services, may also provide services for network traffic management from one location on a network to another location on the network, or across a network at a remote location via a WAN (or the internet). Furthermore, the term "cloud computing" refers to software and services that these servers execute in a virtual manner (through a virtual machine hypervisor) for users, typically without the users knowing the physical location of the servers or the data center. Further, the data center may be a distributed entity. Cloud computing may provide shared computer processing resources and data to computers and other devices on demand over an associated network. The communication interface 1216 may be part of a data bus that may be used to receive data traffic for processing.
The non-transitory memory 1201 may be implemented as a machine-readable medium, such as a computer-readable medium, and may include volatile memory 1214 or non-volatile memory 1208. The device 1200 may include or have access to a computing environment that includes a variety of machine-readable media, such as computer-readable media including volatile memory 1214, non-volatile memory 1208, removable memory 1211, and non-removable memory 1222. Such machine-readable media may be used with instructions in one or more programs 1218 that are executed by device 1200. The cache 1207 may be implemented as a separate memory component or portion of one or more of the volatile memory 1214, nonvolatile memory 1208, removable memory 1211, or non-removable memory 1222. The memory may include Random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD ROM), Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium capable of storing computer-readable instructions.
Device 1200 may include or have access to a computing environment that includes input interface 1226 and output interface 1224. Output interface 1224 may include a display device (e.g., a touch screen) that may also serve as an input device. The input interface 1226 may include one or more of the following: a touch screen, touch pad, mouse, keyboard, camera, one or more device-specific buttons, one or more sensors integrated within device 1200 or coupled to device 1200 through a wired or wireless data connection, and other input devices.
The device 1200 may operate in a networked environment using communication connections to connect to one or more other remote devices. Such remote devices may be the same as or similar to device 1200, or may be different types of devices having similar or identical features to those of device 1200 or other features taught herein, to handle processes associated with providing write-write collision detection for multi-master shared storage databases in accordance with the techniques herein. The remote device may include a computer, such as a database server. Such remote computers may include Personal Computers (PCs), servers, routers, network PCs, peer devices or other common network nodes, and the like. The communication connection may include a Local Area Network (LAN), Wide Area Network (WAN), cellular, WiFi, bluetooth, or other Network.
Machine readable instructions, such as computer readable instructions stored in a computer readable medium, may be executed by the processing unit 1202 of the device 1200. Hard drives, CD-ROMs, and RAM are some examples of articles of manufacture that include a non-transitory computer-readable medium, such as a storage device. The terms "machine-readable medium," "computer-readable medium," and "storage device" do not include a carrier wave because a carrier wave is too transitory. The memory may also include networked memory, such as a Storage Area Network (SAN).
Device 1200 may be implemented as a computing device that may take different forms in different embodiments as part of a network such as an SDN/IoT network. For example, device 1200 may be a smartphone, tablet, smart watch, other computing device, or other type of device having wireless communication capabilities, where such device includes components for participating in the distribution and storage of content items, as taught herein. Devices such as smartphones, tablets, smartwatches, and other types of devices with wireless communication capabilities are commonly referred to collectively as mobile devices or user devices. Further, some of these devices may be considered systems that implement their functionality and/or applications. Further, while various data storage elements are illustrated as part of device 1200, the memory may also, or alternatively, comprise cloud-based memory accessible via a network, such as the Internet or server-based memory.
In one exemplary embodiment, the apparatus 1200 includes: a conflict check module to perform write-write conflict checking on pre-written log records received from a database engine of a plurality of database engines in a common log using one or more processors, wherein the write-write conflict checking comprises: comparing a log sequence number received from the database engine with the pre-written log record to a global log sequence number in a hash table in the common log; a sending module for sending the pre-write log record to the data store shared among multiple database engines after the pre-write log record passes the write-write conflict check. In some embodiments, the apparatus 1200 may include other modules or additional modules for performing any one or combination of the steps described in the embodiments. Moreover, any additional or alternative embodiments or aspects of the method as shown in any of the figures or recited in any claim are also contemplated to include similar apparatus.
Further, a machine-readable storage device, such as a computer-readable non-transitory medium, is herein a physical device that stores data represented by a physical structure within the respective device. Such a physical device is a non-transitory device. Examples of a machine-readable storage device may include, but are not limited to, Read Only Memory (ROM), Random Access Memory (RAM), magnetic disk storage, optical storage, flash memory or other electrical storage, magnetic storage, and/or optical storage. The machine-readable device may be a machine-readable medium such as the memory 1201 of fig. 12. Terms such as "memory," "memory module," "machine-readable medium," "machine-readable device," and the like, should be taken to include all forms of storage media, whether in the form of a single medium (or device) or multiple media (or devices), including all forms. For example, such structures may be implemented as one or more centralized databases, one or more distributed databases, associated caches and servers; one or more storage devices, such as a storage drive (including but not limited to electronic, magnetic, optical drives and storage mechanisms), and one or more instances of a storage device or module (whether main memory, cache memory internal or external to the processor, or buffers). Terms such as "memory," "memory module," "machine-readable medium," and "machine-readable device" should be taken to include any tangible, non-transitory medium that is capable of storing or encoding a sequence of instructions for execution by the machine and that cause the machine to perform any one of the methodologies taught herein. The term "non-transitory" used in connection with "machine-readable device", "medium", "storage medium", "device", or "storage device" expressly includes all forms of storage drives (optical, magnetic, electrical, etc.) and all forms of storage devices (e.g., DRAM, Flash (all storage designs), SRAM, MRAM, phase change memory devices, etc., as well as all other structures designed to store any type of data for later retrieval).
In various embodiments, a system may be implemented to enable write-write collision detection for a multi-master shared storage database. Such a system may include a memory having instructions and one or more processors in communication with the memory. The one or more processors may execute the instructions to: in a common log, performing write conflict checking on a pre-written log record received from a database engine of a plurality of database engines, wherein the write conflict checking comprises: comparing a log sequence number received from the database engine with the pre-written log record to a global log sequence number in a hash table in the common log; sending the pre-write log record to a data store shared among a plurality of database engines after the pre-write log record passes the write conflict check. The comparison may include using a tuple identification or a page identification as a key in the hash table, wherein the key is associated with an entry represented in the form of a master node identification and a global log sequence number value, the master node identification being an identification of a database engine of the plurality of database engines.
Variations of such systems or similar systems may include many different embodiments, which may be combined depending on the application of such systems and/or the architecture in which such systems are implemented. Passing the write conflict check may include the log sequence number being greater than the global log sequence number. After passing the write conflict check, the one or more processors may update the global log sequence number to be equal to the log sequence number. After passing the write conflict check, the one or more processors may insert the pre-write log records into the group flush pre-write log buffer and save all pre-write log records in the group flush pre-write log buffer into a persistent log in the common log.
Such systems may include one or more processors to execute instructions to copy the pre-write log records to one or more follower public logs, wherein the one or more follower public logs are constructed as backups to the public logs. The one or more processors can extract the pre-write log records from a batch of pre-write log records received from a database engine having a log sequence number of one or more transactions between the database engine and a data store. The one or more processors can extract another pre-written log record from another batch of pre-written log records received from another database engine of the plurality of database engines, having another log sequence number for one or more other transactions between the another database engine and the data store. The one or more processors may maintain in the command log all operations and commands that modify the internal state of the common log.
Such a system, or similar system, may include multiple database engines, a data store shared among the multiple database engines, and one or more follower public logs in addition to the public log. The multiple database engines may be independent structural units, separate from common logging and shared data storage. The multiple database engines may communicate with the leader common log and the shared data store of the follower common log to transmit data using conventional communication techniques such as, but not limited to, Transmission Control Protocol (TCP) and Internet Protocol (IP). Such a system, or similar system, may be constructed in accordance with any of the permutations of features taught herein for write-write collision detection of a multi-master shared storage database.
In various embodiments, a system may be implemented to enable write-write collision detection for a multi-master shared storage database. Such a system may include: means for performing a write conflict check on a pre-write log record received from a database engine of a plurality of database engines, wherein the write conflict check comprises comparing a log sequence number received from the database engine with the pre-write log record to a global log sequence number in a hash table in a common log, wherein the means for performing the write conflict check comprises the common log and is operably disposed between the plurality of database engines and a shared data store shared among the plurality of database engines; means for sending the pre-write log record to the shared data store after the pre-write log record passes the write conflict check. The comparison may include using a tuple identification or a page identification as a key in the hash table, the key being associated with an entry represented in the form of a master node identification and a global log sequence number value, the master node identification being an identification of a database engine of the plurality of database engines.
Variations of such systems or similar systems having means for performing write conflict checks on pre-written log records received from database engines of the plurality of database engines may include many different embodiments that may be combined depending on the application of such systems and/or the architecture in which such systems are implemented. Passing the write conflict check may include the log sequence number being greater than the global log sequence number. After passing the write conflict check, the means for performing the write conflict check may update the global log sequence number to be equal to the log sequence number. After passing the write conflict check, the system means for performing the write conflict check may insert the pre-write log records into the group flush pre-write log buffer and save all pre-write log records in the group flush pre-write log buffer into a persistent log in the common log. Such a system or similar system may include means for performing write conflict checks that is structured to copy the pre-write log records to one or more follower public logs structured as backups to the public logs.
The means for performing write conflict checking may extract the pre-write log records from a batch of pre-write log records received from a database engine having a log sequence number of one or more transactions between the database engine and a data store. The means for performing write conflict checking may extract another pre-written log record from another batch of pre-written log records received from another database engine of the plurality of database engines, having another log sequence number for one or more other transactions between the another database engine and the data store. The means for performing write conflict checks may maintain all operations and commands in the command log that modify the internal state of the common log.
In addition to the means for performing write conflict checks and the means for sending the pre-write log records to the shared data store, a system or similar system having means for performing write conflict checks on the pre-write log records may also include a plurality of database engines, a data store shared among the plurality of database engines, and one or more follower public logs. Such a system, or similar system, may be constructed in accordance with any of the permutations of features taught herein for write-write collision detection of a multi-master shared storage database.
Global lock based collision prevention can result in a large amount of network traffic and congestion, resulting in low performance and low throughput. The methods and structures taught herein use WAL records to determine the existence of conflicts. This approach eliminates global locks and brings the benefits of optimistic concurrency control for multi-master shared-storage database systems. The method can become a key technology for constructing a large-scale cloud multi-master system.
While the invention has been described with reference to specific features and embodiments thereof, it will be apparent that various modifications and combinations of the invention can be made without departing from the invention. Accordingly, the specification and figures are to be regarded in an illustrative manner only with respect to the invention as defined by the appended claims, and are intended to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the invention.

Claims (30)

1. A computer-implemented method of writing to a data store shared among a plurality of database engines, the computer-implemented method comprising:
performing, using one or more processors, write-write conflict checking in a common log for pre-written log records received from a database engine of the plurality of database engines, wherein the write-write conflict checking comprises: comparing a log sequence number received from the database engine with the pre-written log record to a global log sequence number in a hash table in the common log;
sending the pre-write log record to the data store shared between multiple database engines after the pre-write log record passes the write-write conflict check.
2. The computer-implemented method of claim 1, wherein performing the comparison comprises using a tuple identification or a page identification as a key in the hash table, the key associated with an entry represented in the form of a master node identification and a global log sequence number value, the master node identification being an identification of a database engine of the plurality of database engines.
3. The computer-implemented method of claim 1 or claim 2, wherein passing the write-write conflict check comprises the log sequence number being greater than the global log sequence number.
4. The computer-implemented method of any of claims 1 to 3, comprising: updating the global log sequence number to be equal to the log sequence number after passing the write-write conflict check.
5. The computer-implemented method of any of claims 1 to 4, comprising: after passing the write-write collision check,
inserting the pre-written log record into a group refresh pre-written log buffer;
and storing all the pre-written log records in the group of refreshing pre-written log buffers into a persistent log in the common log.
6. The computer-implemented method of any of claims 1 to 5, comprising copying the pre-written log records to one or more follower public logs, the one or more follower public logs constructed as backups to the public log.
7. The computer-implemented method of any of claims 1 to 6, wherein the pre-write log records received from a database engine are extracted from a batch of pre-write log records received from the database engine and having the log sequence number of one or more transactions between the database engine and the data store.
8. The computer-implemented method of any of claims 1 to 7, comprising extracting another pre-written log record from another batch of pre-written log records received from another database engine of the plurality of database engines, having another log sequence number of one or more other transactions between the another database engine and the data store.
9. The computer-implemented method of any of claims 1 to 8, comprising maintaining in a command log all operations and commands that modify the internal state of the common log.
10. A non-transitory computer readable medium storing computer instructions, wherein the computer instructions, when executed by one or more processors, cause the one or more processors to perform the steps of any one of claims 1 to 9.
11. A system, comprising:
a memory comprising instructions;
one or more processors in communication with the memory, wherein the one or more processors execute the instructions to:
performing write-write conflict checking on pre-written log records received from a database engine of a plurality of database engines in a common log, wherein the write-write conflict checking comprises: comparing a log sequence number received from the database engine with the pre-written log record to a global log sequence number in a hash table in the common log;
sending the pre-write log record to a data store shared among the plurality of database engines after the pre-write log record passes the write-write conflict check.
12. The system of claim 11, wherein the comparing comprises using a tuple identification or a page identification as a key in the hash table, the key associated with an entry represented in the form of a master node identification and a global log sequence number value, the master node identification being an identification of a database engine of the plurality of database engines.
13. The system of claim 11 or claim 12, wherein passing the write-write conflict check comprises the log sequence number being greater than the global log sequence number.
14. The system according to any one of claims 11 to 13, wherein said one or more processors update said global log sequence number to equal said log sequence number after passing said write-write conflict check.
15. The system according to any one of claims 11 to 14, wherein after passing the write-write collision check, the one or more processors perform:
inserting the pre-written log record into a group refresh pre-written log buffer;
and storing all the pre-written log records in the group of flushed pre-written log buffers into a persistent log in the common log.
16. The system according to any one of claims 11 to 15, wherein said one or more processors copy said pre-written log records to one or more follower public logs, said one or more follower public logs constructed as backups to said public log.
17. The system of any one of claims 11 to 16, wherein the one or more processors extract the pre-written log records from a batch of pre-written log records received from the database engine having the log sequence number of one or more transactions between the database engine and the data store.
18. The system of any of claims 11 to 17, wherein the one or more processors extract another pre-written log record from another batch of pre-written log records received from another database engine of the plurality of database engines, having another log sequence number for one or more other transactions between the another database engine and the data store.
19. The system according to any one of claims 11 to 18, wherein said one or more processors maintain in a command log all operations and commands that modify the internal state of said common log.
20. The system of any one of claims 11 to 19, comprising the plurality of database engines, the data store shared among the plurality of database engines, and one or more follower public logs other than the public log.
21. A system, comprising:
means for performing a write conflict check on a pre-write log record received from a database engine of a plurality of database engines, wherein the write conflict check comprises comparing a log sequence number received from the database engine with the pre-write log record to a global log sequence number in a hash table in a common log, wherein the means for performing the write conflict check comprises the common log and is operably disposed between the plurality of database engines and a shared data store shared among the plurality of database engines;
means for sending the pre-write log record to the shared data store after the pre-write log record passes the write conflict check.
22. The system of claim 21, wherein the comparing comprises using a tuple identification or a page identification as a key in the hash table, the key associated with an entry represented in the form of a master node identification and a global log sequence number value, the master node identification being an identification of a database engine of the plurality of database engines.
23. The system of claim 21 or claim 22, wherein passing the write conflict check comprises the log sequence number being greater than the global log sequence number.
24. The system according to any of claims 21 to 23, wherein said means for performing said write collision check updates said global log sequence number to equal said log sequence number after passing said write collision check.
25. The system according to any of claims 21 to 24, wherein after passing said write collision check, said means for performing said write collision check performs:
inserting the pre-written log record into a group refresh pre-written log buffer;
and storing all the pre-written log records in the group of flushed pre-written log buffers into a persistent log in the common log.
26. The system of any of claims 21 to 25, wherein the means for performing the write conflict check copies the pre-write log records to one or more follower public logs, the one or more follower public logs constructed as backups to the public log.
27. The system of any of claims 21 to 26, wherein the means for performing the write conflict check extracts the pre-write log records from a batch of pre-write log records received from the database engine having the log sequence number of one or more transactions between the database engine and the data store.
28. The system of any of claims 21 to 27, wherein the means for performing the write conflict check extracts another pre-written log record from another batch of pre-written log records received from another database engine of the plurality of database engines, having another log sequence number for one or more other transactions between the another database engine and the data store.
29. The system according to any of claims 21 to 28, wherein said means for performing said write conflict check maintains in a command log all operations and commands modifying the internal state of said common log.
30. The system of any of claims 21 to 29, wherein the system further comprises the plurality of database engines, the data stores shared among the plurality of database engines, and one or more follower public logs in addition to the means for performing the write conflict check and the means for sending the pre-write log records to the shared data store.
CN201980078344.5A 2018-12-11 2019-06-14 Write-write collision detection for multi-master shared storage databases Pending CN113168371A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201862777972P 2018-12-11 2018-12-11
US62/777,972 2018-12-11
PCT/CN2019/091397 WO2020119050A1 (en) 2018-12-11 2019-06-14 Write-write conflict detection for multi-master shared storage database

Publications (1)

Publication Number Publication Date
CN113168371A true CN113168371A (en) 2021-07-23

Family

ID=71075569

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980078344.5A Pending CN113168371A (en) 2018-12-11 2019-06-14 Write-write collision detection for multi-master shared storage databases

Country Status (3)

Country Link
EP (1) EP3877859A4 (en)
CN (1) CN113168371A (en)
WO (1) WO2020119050A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11379463B1 (en) * 2019-09-27 2022-07-05 Amazon Technologies, Inc. Atomic enforcement of cross-page data constraints in decoupled multi-writer databases
US11366802B1 (en) 2019-09-27 2022-06-21 Amazon Technologies, Inc. Batch undo processing for transaction rollbacks in a multi-writer database
US11874796B1 (en) 2019-09-27 2024-01-16 Amazon Technologies, Inc. Efficient garbage collection in optimistic multi-writer database systems
CN113220335B (en) * 2021-05-26 2023-03-14 西安热工研究院有限公司 Method for avoiding disorder of multithreading concurrent writing snapshot data

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5581754A (en) * 1994-12-07 1996-12-03 Xerox Corporation Methodology for managing weakly consistent replicated databases
US5857182A (en) * 1997-01-21 1999-01-05 International Business Machines Corporation Database management system, method and program for supporting the mutation of a composite object without read/write and write/write conflicts
US6754657B2 (en) * 2001-08-24 2004-06-22 Microsoft Corporation Time stamping of database records
US6981110B1 (en) * 2001-10-23 2005-12-27 Stephen Waller Melvin Hardware enforced virtual sequentiality
US8209499B2 (en) * 2010-01-15 2012-06-26 Oracle America, Inc. Method of read-set and write-set management by distinguishing between shared and non-shared memory regions
US8943278B2 (en) * 2012-07-31 2015-01-27 Advanced Micro Devices, Inc. Protecting large regions without operating-system support
US9619278B2 (en) * 2014-06-26 2017-04-11 Amazon Technologies, Inc. Log-based concurrency control using signatures
US9489142B2 (en) * 2014-06-26 2016-11-08 International Business Machines Corporation Transactional memory operations with read-only atomicity
WO2016044763A1 (en) * 2014-09-19 2016-03-24 Amazon Technologies, Inc. Automated configuration of log-coordinated storage groups
CN105849688B (en) * 2014-12-01 2019-10-22 华为技术有限公司 Method, apparatus, equipment and the storage system of data write-in
US10303525B2 (en) 2014-12-24 2019-05-28 Intel Corporation Systems, apparatuses, and methods for data speculation execution
US9710389B2 (en) * 2015-03-10 2017-07-18 Intel Corporation Method and apparatus for memory aliasing detection in an out-of-order instruction execution platform
CN105045563B (en) * 2015-06-19 2017-10-10 陕西科技大学 A kind of method for collision management for speculating nested software transaction storage
US11080271B2 (en) * 2016-09-09 2021-08-03 Sap Se Global database transaction management service
US20180322158A1 (en) * 2017-05-02 2018-11-08 Hewlett Packard Enterprise Development Lp Changing concurrency control modes

Also Published As

Publication number Publication date
EP3877859A1 (en) 2021-09-15
WO2020119050A1 (en) 2020-06-18
EP3877859A4 (en) 2022-01-05

Similar Documents

Publication Publication Date Title
EP3117348B1 (en) Systems and methods to optimize multi-version support in indexes
US10552372B2 (en) Systems, methods, and computer-readable media for a fast snapshot of application data in storage
CN109923534B (en) Multi-version concurrency control for database records with uncommitted transactions
US10761946B2 (en) Transaction commit protocol with recoverable commit identifier
US10262002B2 (en) Consistent execution of partial queries in hybrid DBMS
US11132350B2 (en) Replicable differential store data structure
US10140329B2 (en) Processing transactions in a distributed computing system
CN113168371A (en) Write-write collision detection for multi-master shared storage databases
US8694733B2 (en) Slave consistency in a synchronous replication environment
US20180173745A1 (en) Systems and methods to achieve sequential consistency in replicated states without compromising performance in geo-distributed, replicated services
US20130110781A1 (en) Server replication and transaction commitment
US9652346B2 (en) Data consistency control method and software for a distributed replicated database system
US11263236B2 (en) Real-time cross-system database replication for hybrid-cloud elastic scaling and high-performance data virtualization
US10983981B1 (en) Acid transaction for distributed, versioned key-value databases
US11599514B1 (en) Transactional version sets
EP3593243B1 (en) Replicating storage tables used to manage cloud-based resources to withstand storage account outage
US11003550B2 (en) Methods and systems of operating a database management system DBMS in a strong consistency mode
CN112384906A (en) MVCC-based database system asynchronous cache consistency
US20210365439A1 (en) Distributed transaction execution in distributed databases
CN116529726A (en) Method, device and medium for synchronizing data among cloud database nodes
Dey et al. Scalable distributed transactions across heterogeneous stores
US11709809B1 (en) Tree-based approach for transactionally consistent version sets
US11940972B2 (en) Execution of operations on partitioned tables
US20170139980A1 (en) Multi-version removal manager
Pankowski Consistency and availability of Data in replicated NoSQL databases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination