WO2020119050A1 - Write-write conflict detection for multi-master shared storage database - Google Patents

Write-write conflict detection for multi-master shared storage database Download PDF

Info

Publication number
WO2020119050A1
WO2020119050A1 PCT/CN2019/091397 CN2019091397W WO2020119050A1 WO 2020119050 A1 WO2020119050 A1 WO 2020119050A1 CN 2019091397 W CN2019091397 W CN 2019091397W WO 2020119050 A1 WO2020119050 A1 WO 2020119050A1
Authority
WO
WIPO (PCT)
Prior art keywords
write
log
common
sequence number
record
Prior art date
Application number
PCT/CN2019/091397
Other languages
French (fr)
Inventor
Jun Chen
Le CAI
Chunfeng Pei
Marko DIMITRIJEVIC
Jianjun Chen
Yu Chen
Yang Sun
Xiaolin DU
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to EP19896966.9A priority Critical patent/EP3877859A4/en
Priority to CN201980078344.5A priority patent/CN113168371A/en
Publication of WO2020119050A1 publication Critical patent/WO2020119050A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • G06F9/526Mutual exclusion algorithms
    • G06F9/528Mutual exclusion algorithms by using speculative mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2308Concurrency control
    • G06F16/2315Optimistic concurrency control

Definitions

  • the present disclosure is related to data communications and, in particular, storage of data in shared stored databases.
  • Some enterprise multi-master database systems can allow multiple database instances to access a shared storage to read from and write to the shared storage and to generate transaction write ahead log (WAL) records.
  • the multiple database instances can generate transaction write ahead log records so that when any of the database instances fails, the system is still available for read-write access through other instances.
  • WAL transaction write ahead log
  • a write–write conflict is a computational occurrence associated with interleaved execution of transactions. Such an interleaved execution can be from a write operation to the same data from different write sources.
  • a write–write conflict which can be referred to as overwriting uncommitted data, where the act of committing data is making a set of tentative changes permanent.
  • OCC Optimistic concurrency control
  • OCC is a concurrency control method typically applied to transactional systems, such as relational database management systems and software transactional memory, where concurrency methods are methods to make certain that correct results are generated for concurrent operations, while attaining the results as fast as possible. OCC assumes that multiple transactions can frequently complete without interfering with each other.
  • Embodiments are based on using a common log layer between a shared data storage and computing nodes in which write conflict detection is implemented using a write ahead log record and a log sequence number.
  • the detection of a write conflict results from a write conflict check.
  • the write conflict check (write conflict detection) may be referred to as a write-write conflict check (write-write confliction detection) since it is a check of conflict from different write operations.
  • the common log layer can be arranged with other common log layers between storage and computing nodes in an architecture providing high availability.
  • the conflict check can be implemented as a page-level check or tuple-level conflict check.
  • the locks used by the conflict check are local in the common log, without the use of global locking.
  • the locking facility provided by the common log doesn’ t require any network communications related to lock acquire or release.
  • an embodiment relates to a computer-implemented method of writing to data storage shared among multiple database engines, the computer-implemented method comprising: performing, in a common log using one or more processors, a write conflict check on a write ahead log record received from a database engine of the multiple database engines, in which the write conflict check includes conducting a comparison of a log sequence number, received with the write ahead log record from the database engine, with a global log sequence number in a hash table in the common log; and sending the write ahead log record to the data storage shared among multiple database engines, in response to the write ahead log record passing the write conflict check.
  • the write conflict check includes conducting a comparison of a log sequence number, received with the write ahead log record from the database engine, with a global log sequence number in a hash table in the common log.
  • OCC optimistic concurrency control
  • conducting the comparison comprises using a tuple identification or a page identification as a key in the hash table, the key associated with an entry in a form of a master identification and a global log sequence number value, the master identification being an identification of a database engine of the multiple database engines.
  • passing the write conflict check includes the log sequence number being greater than the global log sequence number.
  • the method comprises, in response to passing the write conflict check, updating the global log sequence number to equal the log sequence number.
  • the method comprises, in response to passing the write conflict check: inserting the write ahead log record into a group flush write ahead log buffer; and saving all write ahead log records in the group flush write ahead log buffer to a persistent log in the common log.
  • the method comprises replicating the write ahead log record to one or more follower common logs, the one or more follower common logs structured as back-ups to the common log.
  • the write ahead log record received from a database engine is extracted from a batch of write ahead log records received from the database engine having the log sequence number for one or more transactions between the database engine and the data storage.
  • the method comprises extracting another write ahead log record from another batch of write ahead log records received from another database engine of the multiple database engines having another log sequence number for one or more other transactions between the other database engine and the data storage.
  • the method comprises maintaining in a command log all operations and commands that modify an internal state of the common log.
  • an embodiment relates to a system comprising a memory storage comprising instructions; and one or more processors in communication with the memory storage, wherein the one or more processors execute the instructions to: perform, in a common log, a write conflict check on a write ahead log record received from a database engine of multiple database engines, in which the write conflict check includes a comparison of a log sequence number, received with the write ahead log record from the database engine, with a global log sequence number in a hash table in the common log; and send the write ahead log record to a data storage shared among the multiple database engines, in response to the write ahead log record passing the write conflict check.
  • the one or more processors execute the instructions to: perform, in a common log, a write conflict check on a write ahead log record received from a database engine of multiple database engines, in which the write conflict check includes a comparison of a log sequence number, received with the write ahead log record from the database engine, with a global log sequence number in a hash table in the common log; and send the write ahead log record to
  • the comparison comprises using a tuple identification or a page identification as a key in the hash table, the key associated with an entry in a form of a master identification and a global log sequence number value, the master identification being an identification of a database engine of the multiple database engines
  • passing the write conflict check includes the log sequence number being greater than the global log sequence number.
  • the one or more processors in response to passing the write conflict check, update the global log sequence number to equal the log sequence number.
  • the one or more processors in response to passing the write conflict check, execute: an insertion of the write ahead log record into a group flush write ahead log buffer; and a save operation of all write ahead log records in the group flush write ahead log buffer to a persistent log in the common log.
  • the one or more processors execute replication of the write ahead log record to one or more follower common logs, the one or more follower common logs structured as back-ups to the common log.
  • the one or more processors extract the write ahead log record from a batch of write ahead log records received from the database engine having the log sequence number for one or more transactions between the database engine and the data storage.
  • the one or more processors extract another write ahead log record from another batch of write ahead log records received from another database engine of the multiple database engines having another log sequence number for one or more other transactions between the other database engine and the data storage.
  • the one or more processors maintain in a command log all operations and commands that modify an internal state of the common log.
  • system includes the multiple database engines, the data storage shared among the multiple database engines, and one or more follower common logs in addition to the common log.
  • the computer-implemented method can be performed by the system. Further features of the computer-implemented method directly result from the functionality of the system.
  • embodiments relate to a non-transitory computer-readable medium storing computer instruction, which when executed by one or more processors, cause the one or more processors to perform the steps of any of the computer-implemented methods according to the first aspect or any of its implementation forms when executed on a computer.
  • the computer-implemented methods can be performed in an automatic and repeatable manner.
  • the computer instructions stored by the non-transitory computer-readable medium can be performed by the system.
  • the system can be programmably-arranged to perform the computer instructions.
  • an embodiment relates to a system comprising a means for performing a write conflict check on a write ahead log record received from a database engine of multiple database engines, in which the write conflict check includes a comparison of a log sequence number, received with the write ahead log record from the database engine, with a global log sequence number in a hash table in a common log, the means for performing the write conflict check including the common log and operationally arranged between the multiple database engines and a shared data storage, the shared data storage being shared among the multiple database engines; and a means for sending the write ahead log record to the shared data storage, in response to the write ahead log record passing the write conflict check.
  • the comparison comprises using a tuple identification or a page identification as a key in the hash table, the key associated with an entry in a form of a master identification and a global log sequence number value, the master identification being an identification of a database engine of the multiple database engines
  • passing the write conflict check includes the log sequence number being greater than the global log sequence number.
  • the means for performing the write conflict check in response to passing the write conflict check, updates the global log sequence number to equal the log sequence number.
  • the means for performing the write conflict check executes: an insertion of the write ahead log record into a group flush write ahead log buffer; and a save operation of all write ahead log records in the group flush write ahead log buffer to a persistent log in the common log.
  • the means for performing the write conflict check executes replication of the write ahead log record to one or more follower common logs, the one or more follower common logs structured as back-ups to the common log.
  • the means for performing the write conflict check extracts the write ahead log record from a batch of write ahead log records received from the database engine having the log sequence number for one or more transactions between the database engine and the data storage.
  • the means for performing the write conflict check extracts another write ahead log record from another batch of write ahead log records received from another database engine of the multiple database engines having another log sequence number for one or more other transactions between the other database engine and the data storage.
  • the means for performing the write conflict check maintains in the command log all operations and commands that modify an internal state of the common log.
  • system includes the multiple database engines, the data storage shared among the multiple database engines, and one or more follower common logs in addition to the means for performing the write conflict check and the means for sending the write ahead log record to the shared data storage.
  • Embodiments of the invention can be implemented in hardware, software or in any combination thereof. Any one of the foregoing examples may be combined with any one or more of the other foregoing examples to create a new embodiment within the scope of the present disclosure.
  • Figure 1 is a block diagram of an example system having an architecture with multiple master databases with computation-storage separation, according to an example embodiment.
  • Figure 2 depicts example operational components inside the common log of Figure 1 that can facilitate write-write conflict detection, according to an example embodiment.
  • Figure 3 is a flow diagram of an example cycle of a write-write conflict detection thread, according to an example embodiment.
  • FIGS 4 -10 illustrate challenges and specific solutions of write conflict for a PostgreSQL system, which is an object-relational database management system, as an example, according to an example embodiment.
  • Figure 11 is a flow diagram of features of an example method of writing to data storage shared among multiple database engines, according to an example embodiment.
  • Figure 12 is a block diagram illustrating circuitry for devices for implementing algorithms and performing methods of providing write-write conflict detection for multi-master shared storage database, according to an example embodiment.
  • the functions or algorithms described herein may be implemented in software in an embodiment.
  • the software may comprise computer executable instructions stored on computer readable media or computer readable storage device such as one or more non-transitory memories or other type of hardware-based storage devices, either local or networked.
  • modules which may be software, hardware, firmware or any combination thereof. Multiple functions may be performed in one or more modules as desired, and the embodiments described are merely examples.
  • the software may be executed on a digital signal processor, ASIC, microprocessor, or other type of processor operating on a computer system, such as a personal computer, server or other computer system, turning such computer system into a specifically programmed machine.
  • Computer-readable non-transitory media includes all types of computer readable media, including magnetic storage media, optical storage media, and solid state storage media and specifically excludes signals.
  • the software can be installed in and sold with the devices that handle event streams as taught herein.
  • the software can be obtained and loaded into such devices, including obtaining the software via a disc medium or from any manner of network or distribution system, including, for example, from a server owned by the software creator or from a server not owned but used by the software creator.
  • the software can be stored on a server for distribution over the Internet, for example.
  • a system can be implemented with an architecture in which a common log is operated between master nodes and a data storage layer.
  • the master nodes can be realized as database nodes.
  • the common log can be arranged as a write-write detection layer.
  • a write-write detection can also be referred to as a write-write conflict check in which a determination is made as to whether a write transaction from one source conflicts with a related write transaction from another source.
  • the system can be a multi-master database system designed with an architecture in which database instances write log records to storage. Storage nodes can replay, that is, apply the log records to construct data pages.
  • master databases can flush log records to a common log, where the common log can conduct a page-level or tuple-level conflict check.
  • a database table can be divided into multiple pages. Each page can have a certain number of records.
  • a tuple is an individual record of a database table.
  • a table can be divided into a smaller unit such as pages and each page can have multiple tuples. All locks used by conflict check can be local in the common log such that there is no global locking.
  • the common log persists and only log records that pass the conflict check are forward to the storage nodes from the common log.
  • This architecture can allow for a high availability (HA) common log layer between storage nodes and compute nodes, where the compute nodes can be databases.
  • HA high availability
  • This HA can be provided by having multiple common logs at a conflict detection layer between the storage nodes and compute nodes with one of the multiple common logs being the primary common log with the other common logs of the multiple common logs being duplicates of the primary common log. This duplication defines the HA of the common log layer.
  • An architecture with the common log layer can be implemented with novel write-write conflict detection algorithms based on a WAL and log sequence numbers (LSNs) , which can be performed on each log record.
  • a log sequence number is an identification number given a WAL record indicating its position in a sequence of records for a transaction.
  • the write-write detection can be a page-level conflict detection or a tuple-level conflict detection.
  • Such approaches to conflicts at a tuple-level or page-level can provide fine grained locks for good parallelism in that, in the common log, conflict checks run on multiple worker threads in parallel.
  • a thread is a sequence of instructions that may execute in parallel with sequences of instructions.
  • the associated locking facility does not require any network communications. Additionally, read operations do not participate in write-write conflict detection, which boosts overall system throughput.
  • FIG. 1 is a block diagram of an embodiment of an example system 100 having an architecture with multiple master databases with computation-storage separation.
  • the system 100 can include a number of databases that communicate with a common log 110 to write data to a shared storage layer 120.
  • Each database engine can include one or more processors and storage devices having instructions stored thereon, which instructions are executable by the one or more processors to perform operations of the database for which it is a component.
  • each database can include storing data from a communication with a client of the respective database in the shared storage layer 120 through operation of the common log 110.
  • Each database represented by a database engine can be arranged as a database node that is a master, which receives structured query language (SQL) queries or update/insert/delete/create requests from users such as client devices to the databases.
  • Data from users to database nodes managed by database engines 105-1, 105-2, and 105-3 can be served through buffer pools 109-1, 109-2, 109-3 of database engines 105-1, 105-2, and 105-3, respectively, or served through persistent storage nodes, if buffer pool is unavailable or missing from a database.
  • SQL structured query language
  • a transaction When a transaction modifies a tuple, it creates a log record and the log record of the modified tuple is flushed to the common log 110.
  • the common log 110 performs a conflict check on the log record and distributes, if the log record is without conflict, the log record to the shared storage layer 120, where a new version of data is created by performing a log apply operation.
  • Each database instance is a master in the architecture of the system 100.
  • Each of database engines 105-1, 105-2, and 105-3 can receive SQL queries from clients and can start transactions.
  • Each of database engines 105-1, 105-2, and 105-3 can flush WAL records to common log 110.
  • the control of loading data pages to the shared storage layer 120 from started transactions in the database engines 105-1, 105-2, and 105-3 is provided by the common log 110 in response to successful completion of conflict checks.
  • data pages from the shared storage layer 120 can be loaded by each of database engines 105-1, 105-2, and 105-3.
  • data pages can be loaded along path 132 from the shared storage layer by the database engine 105-1.
  • Appropriate data can be loaded to each of database engines 105-2 and 105-3 in a similar manner.
  • the common log 110 can include one or more processors and storage devices having instructions stored thereon, which instructions are executable by the one or more processors to perform operations of the common log 110.
  • the common log 110 can perform a number of functions. It can receive different WAL records from database engines 105-1, 105-2, and 105-3, operating as masters, on paths 106-1, 106-2, and 106-3, respectively. It can conduct write-write conflict detection and can send WAL records that pass their conflict check to the shared storage layer 120.
  • the common log 110 operates as a primary common log node in a HA arrangement. As the primary common log node, the common log 110 is a leader common log node that replicates WAL records that pass conflict check to follower common logs 115-1 and 115-2.
  • WAL records can be sent along a path 113-1 to the follower common log 115-1 and WAL records can be sent along a path 113-2 to the follower common log 115-2.
  • the common log 110 as the leader common log node of the system 100, can replicate its internal state to follower common log 115-1 and follower common log 115-2 by shipping a command log to these follower common logs.
  • the command log can be sent along a path 119-1 to the follower common log 115-1 and the command log can be sent along a path 119-2 to the follower common log 115-2.
  • the command log can keep all the operations, which can be identified as commands, that modify the internal state of the common log 110.
  • system 100 can have less than or more than two follower common logs in a common log layer between masters and shared storage layer.
  • the implementation of the common log 110 with follower common logs provides an architecture of a HA common log layer between storage and compute nodes.
  • the HA characteristic of such an architecture of a system increases with the number of follower common logs disposed in the common log layer of the system.
  • a follower common log in the HA common log layer becomes the leader common log.
  • the sequencing of which follower common log becomes the leader common log can be predefined or implemented at the time of failure of the leader common log using conventional techniques to elect a leader node among a set of peer nodes.
  • the common log 110 can include a global transaction manager (GTM) 112 residing in the common log 110.
  • the GTM 112 can generate transaction identifications (IDs) . These transaction IDs can be generated in ascending order.
  • the GTM 112 can maintain a snapshot of active transaction to a database engine, for example path 133 to database engine 105-1. The snapshot can be implemented as a list of active transactions.
  • the follower common logs 115-1 and 115-2 include a GTM 117-1 and GTM 117-2, respectively, that can perform the functions of GTM 112.
  • the shared storage layer 120 can be implemented as a shared storage server having storage nodes 125-1, 125-2, 125-3, 125-4, and 125-5.
  • the shared storage server can include one or more processes that control operation of the shared storage sever by execution of instructions stored in the shared storage layer 120. Though five storage nodes are shown, the shared storage layer 120 can include less than or more than five storage nodes. Each storage node can include one or more storage devices.
  • the storage nodes 125-1, 125-2, 125-3, 125-4, and 125-5 can be structured as distributed storage nodes.
  • the shared storage layer 120 can receive different WAL records from the common log 110 along a number of paths such as paths 123-1, 123-2, and 123-3.
  • Figure 2 depicts an embodiment of example operational components inside the common log 110 of Figure 1 that can facilitate write-write conflict detection. Similar or identical operational components are disposed in follower common logs 115-1 and 115-2 to perform operations as a common log leader when a selected follower common log is changed from a follower common log to a leader common log.
  • the operational components can be implemented with storage devices controlled by the one or more processors of the common log 110.
  • these operational components residing in the common log 110 can be implemented as buffers 252 and 254 for write-write conflict detection threads, a hash table 255, a group flush WAL buffer (GFWB) 256, and a persistent log 258.
  • the common log 110 is shown receiving a batch of write transaction log records from only two masters 105-1 and 105-2.
  • the shown components can be expanded to handle communication of common log 110 with more than two masters.
  • the write-write conflict detection threads in buffers 252 and 254 can be dedicated worker threads that execute write-write conflict detection on buffers of transaction log records in parallel.
  • the dedicated threads in buffers 252 and 254 can run conflict checks on batches of WAL records concurrently.
  • W ij (x) represents a write transaction log record, where W ij stands for a j th write operation (write transaction log) record from an i th transaction (T i ) .
  • the parameter x is either a tuple ID or a page ID depending on whether the conflict detection is a tuple level or a page level conflict detection.
  • reader LSN represents the latest global LSN known by a transaction.
  • data for thread 1 includes a reader LSN equal to seven for transactions 1 and 2 for a batch of write transaction records with commit transaction C 2 and C 1 .
  • the commit transactions are from the perspective of the master, which does not perceive any inconsistencies with performing the given commit transaction.
  • the data is received in time order staring on the left and proceeding to the right.
  • the commit for transaction 2 occurs before the commit for transaction 1, though the write transaction log records for transaction 1 are received before the records for transaction 2.
  • Data for thread 2 includes a reader LSN equal to seven for transactions 3 and 4 for a batch of write transaction records with commit transactions C 3 and C 4 . Different from the data for thread 1, the commit for transaction 3 occurs before the commit for transaction 4, along with the write transaction log records for transaction 3 being received before the records for transaction 4.
  • the hash table 255 is a write-write conflict detection hash table 255 having entries for a tuple ID or a page ID 261, a master ID and LSN 262, and a bucket (also referred to as a slot) number 263.
  • a key for the write-write conflict detection hash table 255 is a tuple ID
  • the key is a page ID.
  • the value of the key is in the form ⁇ master ID, LSN ⁇ , which includes the latest global LSN of the WAL record that modifies the tuple or page and the master ID is the ID of the master that sends the WAL record.
  • a log record passed the conflict check in the common log 110 it is assigned with a new global LSN.
  • the LSN in ⁇ master ID, LSN ⁇ represents this global LSN
  • master ID is the master that sends the log record to the common log 110.
  • the global LSN is only generated for a WAL record that passes the conflict check.
  • the master IDs can be the integers 1 ...10 with each integer assigned to one of the N masters different from the other masters of the N masters.
  • Write-write conflict detection hash table 255 has fixed number of buckets, where each bucket has its own lock. Bucket locks are either tuple lock or page lock depending on whether the conflict check is tuple-level or page-level.
  • common log 110 when a WAL record passes the conflict check, the WAL record is inserted to GFWB 256. Once a WAL record enters GFWB 256, it is assigned a new global LSN. When the GFWB 256 is full or a timer goes off, all the log records in the GFWB 256 are flushed to the persistent log 258 of the common log 110.
  • the persistent log 258 can be implemented as a disk. All log records that pass conflict check are eventually flushed to the persistent log 258 and are also sent to the storage nodes 125-1, 125-2, 125-3, 125-4, and 125-5 of the shared storage layer 120 of Figure 1. Use of the components of the common log 110 is illustrated in Figure 3.
  • FIG 3 is a flow diagram 300 of an embodiment of an example cycle of a write-write conflict detection thread.
  • the cycle can be implemented in a common log, such as the common log 110 of Figures 1 and 2, using one or more processors to execute instructions stored in a memory storage, for example at a common log layer of a system.
  • a batch of WAL records are received at the common log.
  • the WAL records of the write-write conflict detection thread 1 can be received in a buffer, such as buffer 252.
  • Buffer 252 can include many log records and is not limited to four log records.
  • a WAL record W ij (x) is extracted from the buffer.
  • the extracted W ij (x) is the next WAL record in the buffer to be conflict checked.
  • a transaction Id is extracted from W ij (x) .
  • a determination is made as to whether W ij (x) is a write operation record, an abort record, or a commit record.
  • the transaction ID is aborted and a goto 310 operation is performed to extract the next WAL record.
  • the transition ID is marked with “has conflict” or other manner of identifying the occurrence of a conflict.
  • abort the transaction ID is aborted and a goto 310 is preformed to extract the next WAL record.
  • the transaction ID is committed and the procedure goes to 310 to extract the next WAL record.
  • a page ID or a tuple ID from the W ij (x) is extracted and the reader LSN from the W ij (x) is extracted.
  • a look up in a hash table such as a write-write conflict detection hash table 255 of Figure 2, by the extracted page ID or the extracted tuple ID is preformed, and the entry having this extracted page ID or the extracted tuple ID is obtained.
  • the transaction ID is marked a “has a conflict” or other equivalent identifier and the procedure goes to 310 to extract the next WAL record.
  • the current condition arises since the received W ij (x) identifies the latest global LSN known by the transaction, represented by the reader LSN, as being less than the current global LSN identifying other write operations have occurred such that the data is not in the expected state.
  • the W ij (x) is inserted into the GCWB with the first LSN following the WAL records currently in the GCWB.
  • the LSN of the hash table entry is updated to the LSN from the insertion of the W ij (x) into the GCWB.
  • the master ID of the hash table entry is updated to equal the master ID of the W ij (x) .
  • the GCWB is then flushed if full or when a timer reaches a set time. Internal state changes are persisted in a command log that is also shipped to the common log replicas, such as follower common logs 115-1 and 115-2 of Figure 1.
  • Each write-write conflict detection thread can run the above conflict detection algorithm continuously on batches of WAL records.
  • Each thread can access the write-write conflict detection hash table of a common log using bucket level locks (close to tuple level locks or page level locks depending on whether the write-write conflict detection is performed on a page level or a tuple level) .
  • the write-write conflict detection hash table can be pre-allocated with a fixed number of buckets.
  • Each bucket can have its own lock, where the lock can be derived from the index of the bucket.
  • all log records that pass conflict check can be propagated to follower common logs.
  • the follower common logs can be referred to as standby common logs.
  • the state of the write-write conflict detection hash table and the GFWB can be replicated to the standby common logs by shipping a command log to each standby common log.
  • the command log stores all the operations/commands that modifies the write-write conflict detection hash table and the GFWB.
  • PostgreSQL is an object-relational database management system (ORDBMS) , where an ORDBMS is a database management system (DBMS) similar to a relational database, but with an object-oriented database model in which objects, classes and inheritance are directly supported in database schemas and in the query language.
  • ORDBMS object-relational database management system
  • DBMS database management system
  • PostgreSQL is open-source and compliant with atomicity, consistency, isolation, durability (ACID) principles, which are a set of properties of database transactions intended to guarantee validity even in the event of errors, power failures, etc.
  • PostgreSQL manages concurrency through a system known as multi-version concurrency control (MVCC) , which gives each transaction a snapshot of the database, allowing changes to be made without being visible to other transactions until the changes are committed.
  • MVCC multi-version concurrency control
  • a data file can be referred to as a heap, where the heap can have associated with it a heap tuple (HTUP) , a page having a page number (PAGE_NO) , and line pointer (LP) with a line pointer number (LP_NO) .
  • HTTP heap tuple
  • LP line pointer
  • An index is a specific structure that organizes a reference to the data that makes it easier to look up.
  • an index can be a copy of the item to be indexed combined with a reference to the actual data location.
  • index tuple IUP
  • LP_NO LP_NO
  • FIG. 4 is an example of log records for inserts that lead to conflict. If a common log receives the log records from master 405-1 and master 405-2, only those from one master can win. This can lead to a large number of insert conflicts.
  • Master 405-1 submits a transaction log (XLOG) record 1 to insert a HTUP, which is labelled HTUP1.
  • the XLOG record 1 includes the content of HTUP1 with the information that the PAGE_NO is equal to HX and the LP_NO is equal to one.
  • Master 405-1 also submits a XLOG record 2 to insert an ITUP, which is labelled ITUP1.
  • the XLOG record 2 includes the content of ITUP1with the information that the PAGE_NO for XLOG record 2 is equal to IY and the Lp_No is equal to one.
  • Master 405-2 submits a transaction log (XLOG) record 1 to insert a HTUP, which is labelled HTUP2.
  • the XLOG record 1 includes the content of HTUP2 with the information that the PAGE_NO is equal to HX and the LP_NO is equal to one.
  • Master 405-2 also submits a XLOG record 2 to insert an ITUP, which is labelled ITUP2.
  • the XLOG record 2 includes the content of ITUP2 with the information that the PAGE_NO for XLOG record 2 is equal to iy and the LP_NO is equal to one.
  • a master checks to determine if the LP_NO can be equal to its own master Id.
  • each master can simply pick the next unused LP. But the LP_NO should not appear in the log record. This way, when the log record is applied on the storage node, an overall next LP can be figured out, where the LPS of index tuples are ordered according to the order of index keys.
  • Figure 5 illustrates the use of these principles of operation of a common log to provide modified log records for insertions that eliminate conflict associated with Figure 4.
  • the associated common log for these two masters can let both masters commit, where the different LP_NOS are not conflicting.
  • Inserts from different writers can also conflict with each other when a heap or an index (referred to below as HI) page is close to be full.
  • Figure 6 illustrates an example of inserts with respect to a full page.
  • HITUP heap or index tuple
  • HITUP7 heap or index TUPLE7
  • PAGE_NO X
  • content of TUPLE 7 herein.
  • Each heap and index relation can have a FSM to keep track of available space in the relation.
  • the common log also checks whether the increase will cause page size overflow. If so, the common log aborts the transaction. An update adding new version of a tuple is handled similarly.
  • Inserts can cause index pages to split. When one writer splits an index page while another writer inserts index tuple to the same page, either the split or the insert should abort.
  • Figure 7 is an example of index page splitting.
  • master 705-1 inserts ITUP5 making index page IX full. Master 705-1 then tries to insert ITUP6 to page IX. Page IX splits, where the first half of ITUPs stay at page IX, and the second half go to a new page INEW. Both ITUP5 and ITUP6 go to INEW because their index keys are larger.
  • Master 705-2 inserts ITUP7 on page IX supposing ITUP7 has the largest key.
  • Figure 8 illustrates roughly log records generated by the two masters of Figure 7 from index page splitting.
  • the master 705-1 generates a XLOG 1 about the index tuple ITUP6 and page IX split, which includes LP_NO of the splitting point, the new PAGE_NO for the new page, and the content of the new page containing ITUP5 and ITUP6.
  • the common log maintains PAGE_NO (s) of recently split index pages and their latest commit LSNs in a write-write conflict detection hash table of the common log.
  • PAGE_NO IX just split and the split commit LSN is later than the reader LSN of the log record of master 705-2, and it aborts the change of master 705-2.
  • the common log commits the change of master 705-2 first, and the common log receives the log record of master 705-1 about the split of page IX.
  • the reader LSN of the split is earlier than the commit LSN of the insertion by master 705-2 on page IX, and the common log aborts the split of page IX.
  • Figure 9 illustrates handling an update-update conflict. Handling updates can be relatively straight forward.
  • Figure 10 illustrates handling an update-delete conflict.
  • Figure 11 is a flow diagram of features of an embodiment of an example method 1100 of writing to data storage shared among multiple database engines.
  • the method 1100 can be implemented as a computer-implemented method.
  • a write-write conflict check is performed on a write ahead log record received from a database engine of the multiple database engines, in which the write-write conflict check includes conducting a comparison of a log sequence number, received with the write ahead log record from the database engine, with a global log sequence number in a hash table in the common log.
  • Conducting the comparison can comprise using a tuple identification or a page identification as a key in the hash table, the key associated with an entry in a form of a master identification and a global log sequence number value, where the master identification can be an identification of a database engine of the multiple database engines.
  • the write ahead log record is sent to the data storage shared among multiple database engines, in response to the write ahead log record passing the write-write conflict check.
  • Passing the write conflict check can include the log sequence number being greater than the global log sequence number.
  • method 1100 or a method similar to method 1100 can include updating the global log sequence number to equal the log sequence number.
  • method 1100 or a method similar to method 1100 can include inserting the write ahead log record into a group flush write ahead log buffer and saving all write ahead log records in the group flush write ahead log buffer to a persistent log in the common log.
  • Variations of the method 1100 or methods similar to the method 1100 can include a number of different embodiments that may be combined depending on the application of such methods and/or the architecture of systems in which such methods are implemented.
  • Such methods can comprise replicating the write ahead log record to one or more follower common logs, the one or more follower common logs structured as back-ups to the common log.
  • Such methods can comprise maintaining in a command log all operations and commands that modify an internal state of the common log.
  • the write ahead log record received from a database engine can be extracted from a batch of write ahead log records received from the database engine having the log sequence number for one or more transactions between the database engine and the data storage.
  • Another write ahead log record can be extracted from another batch of write ahead log records received from another database engine of the multiple database engines having another log sequence number for one or more other transactions between the other database engine and the data storage.
  • a non-transitory machine-readable storage device such as computer-readable non-transitory media
  • the physical structures of such instructions may be operated on by one or more processors.
  • executing these physical structures can cause the machine to perform operations comprising: performing, in a common log using the one or more processors, a write-write conflict check on a write ahead log record received from a database engine of the multiple database engines, in which the write-write conflict check includes conducting a comparison of a log sequence number, received with the write ahead log record from the database engine, with a global log sequence number in a hash table in the common log; and sending the write ahead log record to the data storage shared among multiple database engines, in response to the write ahead log record passing the write-write conflict check.
  • Conducting the comparison can comprise using a tuple identification or a page identification as a key in the hash table, the key associated with an entry in a form of a master identification and a global log sequence number value, the master identification being an identification of a database engine of the multiple database engines.
  • Using the one or more processors to execute the instructions stored on the machine-readable storage device can comprise operations in which passing the write-write conflict check can include the log sequence number being greater than the global log sequence number.
  • the executable operations can include updating the global log sequence number to equal the log sequence number.
  • the executable operations can include inserting the write ahead log record into a group flush write ahead log buffer and saving all write ahead log records in the group flush write ahead log buffer to a persistent log in the common log.
  • the operations can include the write ahead log record received from a database engine being extracted from a batch of write ahead log records received from the database engine having the log sequence number for one or more transactions between the database engine and the data storage.
  • the operations can include extracting another write ahead log record from another batch of write ahead log records received from another database engine of the multiple database engines having another log sequence number for one or more other transactions between the other database engine and the data storage.
  • the operations can include replicating the write ahead log record to one or more follower common logs, where the one or more follower common logs is structured as back-ups to the common log.
  • the operations can include maintaining in a command log all operations and commands that modify an internal state of the common log.
  • Figure 12 is a block diagram illustrating circuitry for devices for implementing algorithms and performing methods of providing write-write conflict detection for multi-master shared storage database, according to the teachings herein.
  • Figure 12 depicts a device 1200 having a non-transitory memory storage 1201 storing instructions, a cache 1207, and a processing unit 1202, coupled to a bus 1220.
  • Processing unit 1202 can include one or more processors operatively in communication with non-transitory memory storage 1201 and cache 1207.
  • the one or more processors can be structured to execute the instructions to operate device 1200 as a database engine, a common log, or a shared data storage according to any of the methods taught herein.
  • Device 1200 can include a communication interface 1216 operable to communicate among devices and systems associated with an architecture such as the architecture of Figure 1.
  • an architecture such as the architecture of Figure 1.
  • One or more of multiple databases, common logs, and shared data storage can be implemented in a cloud to which device 1200 can be associated.
  • the term “cloud” refers to data processing on a number of virtual servers as opposed to directly on physical machines.
  • a cloud could span across a wide area network (WAN) .
  • WAN wide area network
  • WAN is also commonly referred to the public Internet or sometimes also a network of leased optic fiber links that inter-connect multiple branch offices of an Enterprise.
  • a cloud could be entirely resident within a private datacenter within an internal local area network.
  • Datacenters of clouds can also provide services for network traffic management from one location on a network to another location on the network or across networks spanning far flung locations over WAN (or Internet) .
  • the term "cloud computing" refers to the software and services executed for users by these servers virtually (over a hypervisor) , and typically the user is unaware of the physical location of the servers or datacenter.
  • the datacenter may be a distributed entity. Cloud computing can provide shared computer processing resources and data to computers and other devices on demand over the associated networks.
  • the communication interface 1216 may be part of a data bus that can be used to receive the data traffic for processing.
  • Non-transitory memory storage 1201 may be realized as machine-readable media, such as computer-readable media, and may include volatile memory 1214 or non-volatile memory 1208.
  • Device 1200 may include or have access to a computing environment that includes a variety of machine-readable media including as computer-readable media, such as volatile memory 1214, non-volatile memory 1208, removable storage 1211, or non-removable storage 1222.
  • Such machine-readable media may be used with instructions in one or more programs 1218 executed by device 1200.
  • Cache 1207 may be realized as a separate memory component or part of one or more of volatile memory 1214, non-volatile memory 1208, removable storage 1211, or non-removable storage 1222.
  • Memory storage can include random access memory (RAM) , read only memory (ROM) , erasable programmable read-only memory (EPROM) , electrically erasable programmable read-only memory (EEPROM) , flash memory or other memory technologies, compact disc read-only memory (CD ROM) , Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium capable of storing computer-readable instructions.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or other memory technologies compact disc read-only memory (CD ROM) , Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium capable of storing computer-readable instructions.
  • CD ROM compact disc read-only memory
  • DVD Digital Versatile Dis
  • Device 1200 may include or have access to a computing environment that includes input interface 1226 and output interface 1224.
  • Output interface 1224 may include a display device, such as a touchscreen, that also may serve as an input device.
  • Input interface 1226 may include one or more of a touchscreen, touchpad, mouse, keyboard, camera, one or more device-specific buttons, one or more sensors integrated within or coupled via wired or wireless data connections to device 1200, and other input devices.
  • Device 1200 may operate in a networked environment using a communication connection to connect to one or more other devices that are remote.
  • Such remote devices may be identical or similar to device 1200 or may be different types of devices having features similar or identical to features of device 1200 or other features, as taught herein, to handle procedures associated with providing write-write conflict detection for multi-master shared storage database, according to the teachings herein.
  • the remote devices may include computers, such as database servers. Such remote computers may include a personal computer (PC) , server, router, network PC, a peer device or other common network node, or the like.
  • the communication connection may include a Local Area Network (LAN) , a Wide Area Network (WAN) , cellular, WiFi, Bluetooth, or other networks.
  • LAN Local Area Network
  • WAN Wide Area Network
  • WiFi Wireless Fidelity
  • Machine-readable instructions such as computer-readable instructions stored on a computer-readable medium, are executable by the processing unit 1202 of the device 1200.
  • a hard drive, CD-ROM, and RAM are some examples of articles including a non-transitory computer-readable medium such as a storage device.
  • the terms machine-readable medium, computer-readable medium, and storage device do not include carrier waves to the extent carrier waves are deemed transitory.
  • Storage can also include networked storage such as a storage area network (SAN) .
  • SAN storage area network
  • Device 1200 can be realized as a computing device that may be in different forms in different embodiments, as part of a network such as a SDN/IoT network.
  • device 1200 may be a smartphone, a tablet, smartwatch, other computing device, or other types of devices having wireless communication capabilities, where such devices include components to engage in the distribution and storage of items of content, as taught herein.
  • Devices such as smartphones, tablets, smartwatches, and other types of device having wireless communication capabilities, are generally collectively referred to as mobile devices or user equipment.
  • some of these devices may be considered as systems for the functions and/or applications for which they are implemented.
  • the various data storage elements are illustrated as part of the device 1200, the storage may also or alternatively include cloud-based storage accessible via a network, such as the Internet or server based storage.
  • the device 1200 includes a conflict check module performing, in a common log using one or more processors, a write-write conflict check on a write ahead log record received from a database engine of the multiple database engines, in which the write-write conflict check includes conducting a comparison of a log sequence number, received with the write ahead log record from the database engine, with a global log sequence number in a hash table in the common log, and a transmission module sending the write ahead log record to the data storage shared among multiple database engines, in response to the write ahead log record passing the write-write conflict check.
  • the device 1200 may include other or additional modules for performing any one of or combination of steps described in the embodiments. Further, any of the additional or alternative embodiments or aspects of the method, as shown in any of the figures or recited in any of the claims, are also contemplated to include similar modules.
  • machine-readable storage devices such as computer-readable non-transitory media, herein, are physical devices that stores data represented by physical structure within the respective device. Such a physical device is a non-transitory device. Examples of machine-readable storage devices can include, but are not limited to, read only memory (ROM) , random access memory (RAM) , a magnetic disk storage device, an optical storage device, a flash memory, or other electronic, magnetic, and/or optical memory devices.
  • the machine-readable device may be a machine-readable medium such as memory 1201 of Figure 12.
  • Such structures can be realized as centralized database (s) , distributed database (s) , associated caches, and servers; one or more storage devices, such as storage drives (including but not limited to electronic, magnetic, and optical drives and storage mechanisms) , and one or more instances of memory devices or modules (whether main memory; cache storage, either internal or external to a processor; or buffers) .
  • Non-transitory used in reference to a “machine-readable device, “ “medium, ” “storage medium, ” “device, ” or “storage device” expressly includes all forms of storage drives (optical, magnetic, electrical, etc. ) and all forms of memory devices (e.g., DRAM, Flash (of all storage designs) , SRAM, MRAM, phase change, etc., as well as all other structures designed to store data of any type for later retrieval.
  • a system can be implemented to enable write-write conflict detection for multi-master shared storage database.
  • a system can comprise a memory storage comprising instructions and one or more processors in communication with the memory storage.
  • the one or more processors can execute the instructions to: perform, in a common log, a write conflict check on a write ahead log record received from a database engine of multiple database engines, in which the write conflict check includes a comparison of a log sequence number, received with the write ahead log record from the database engine, with a global log sequence number in a hash table in the common log; and send the write ahead log record to a data storage shared among the multiple database engines, in response to the write ahead log record passing the write conflict check.
  • the comparison can comprise using a tuple identification or a page identification as a key in the hash table, where the key is associated with an entry in a form of a master identification and a global log sequence number value, with the master identification being an identification of a database engine of the multiple database engines.
  • Passing the write conflict check can include the log sequence number being greater than the global log sequence number.
  • the one or more processors in response to passing the write conflict check, can update the global log sequence number to equal the log sequence number.
  • the one or more processors in response to passing the write conflict check, can execute an insertion of the write ahead log record into a group flush write ahead log buffer, and execute a save operation of all write ahead log records in the group flush write ahead log buffer to a persistent log in the common log.
  • Such systems can include the one or more processors configured to execute instructions to execute replication of the write ahead log record to one or more follower common logs, with the one or more follower common logs structured as back-ups to the common log.
  • the one or more processors can extract the write ahead log record from a batch of write ahead log records received from the database engine having the log sequence number for one or more transactions between the database engine and the data storage.
  • the one or more processors can extract another write ahead log record from another batch of write ahead log records received from another database engine of the multiple database engines having another log sequence number for one or more other transactions between the other database engine and the data storage.
  • the one or more processors can maintain in a command log all operations and commands that modify an internal state of the common log.
  • Such a system or similar systems can include the multiple database engines, the data storage shared among the multiple database engines, and one or more follower common logs in addition to the common log.
  • the multiple database engines can be separate structural units separate from the common logs and the shared data storage.
  • the multiple database engines can communicate with a leader common log of the follower common logs and the shared data storage to transfer data using conventional communicate techniques such as but not limited to transmission control protocol (TCP) and internet protocol (IP) .
  • TCP transmission control protocol
  • IP internet protocol
  • Such system or similar systems can be structured according to any permutation of the features taught herein for write-write conflict detection for multi-master shared storage database.
  • a system can be implemented to enable write-write conflict detection for multi-master shared storage database.
  • Such a system can comprise a means for performing a write conflict check on a write ahead log record received from a database engine of multiple database engines, in which the write conflict check includes a comparison of a log sequence number, received with the write ahead log record from the database engine, with a global log sequence number in a hash table in a common log, the means for performing the write conflict check including the common log and operationally arranged between the multiple database engines and a shared data storage, the shared data storage being shared among the multiple database engines; and a means for sending the write ahead log record to the shared data storage, in response to the write ahead log record passing the write conflict check.
  • the comparison can comprise using a tuple identification or a page identification as a key in the hash table, the key associated with an entry in a form of a master identification and a global log sequence number value, the master identification being an identification of a database engine of the multiple database engines.
  • Variations of such a system or similar systems having a means for performing a write conflict check on a write ahead log record received from a database engine of multiple database engines, can include a number of different embodiments that may be combined depending on the application of such systems and/or the architecture in which such systems are implemented.
  • Passing the write conflict check can include the log sequence number being greater than the global log sequence number.
  • the means for performing the write conflict check in response to passing the write conflict check, can update the global log sequence number to equal the log sequence number.
  • the system means for performing the write conflict check in response to passing the write conflict check, can execute an insertion of the write ahead log record into a group flush write ahead log buffer, and execute a save operation of all write ahead log records in the group flush write ahead log buffer to a persistent log in the common log.
  • Such a system or similar systems can include the means for performing the write conflict check structured to execute replication of the write ahead log record to one or more follower common logs, the one or more follower common logs structured as back-ups to the common log.
  • the means for performing the write conflict check can extract the write ahead log record from a batch of write ahead log records received from the database engine having the log sequence number for one or more transactions between the database engine and the data storage.
  • the means for performing the write conflict check can extract another write ahead log record from another batch of write ahead log records received from another database engine of the multiple database engines having another log sequence number for one or more other transactions between the other database engine and the data storage.
  • the means for performing the write conflict check can maintain in the command log all operations and commands that modify an internal state of the common log.
  • a system or similar systems having a means for performing a write conflict check on a write ahead log record, can include the multiple database engines, the data storage shared among the multiple database engines, and one or more follower common logs in addition to the means for performing the write conflict check and the means for sending the write ahead log record to the shared data storage.
  • Such system or similar systems can be structured according to any permutation of the features taught herein for write-write conflict detection for multi-master shared storage database.

Abstract

Systems and methods are provided to produce an efficient architecture and methodology for multiple database engines to write to a shared data storage to eliminate global locking in the write process. Such systems and methods can use a common log layer between a shared data storage and computing database nodes in which write conflict detection is implemented using a write ahead log record and a log sequence number. A write-write conflict check on a write ahead log record received from a database engine of multiple database engines can be performed by conducting a comparison of a log sequence number, received with the write ahead log record from the database engine, with a global log sequence number in a hash table in the common log. The write ahead log record can be sent to the shared data storage in response to the write ahead log record passing the write-write conflict check.

Description

WRITE-WRITE CONFLICT DETECTION FOR MULTI-MASTER SHARED STORAGE DATABASE Field of the Invention
This application claims priority to and benefit of U.S. Provisional Application No. 62/777,972, filed on December 11, 2018, entitled “Write-Write Conflict Detection for Multi-Master Shared Storage Database, ” which application is hereby incorporated by reference.
Field of the Invention
The present disclosure is related to data communications and, in particular, storage of data in shared stored databases.
Background
Some enterprise multi-master database systems can allow multiple database instances to access a shared storage to read from and write to the shared storage and to generate transaction write ahead log (WAL) records. The multiple database instances can generate transaction write ahead log records so that when any of the database instances fails, the system is still available for read-write access through other instances. When multiple database instances conduct read-write access to shared data, conflicts that arise need to be detected and resolved to maintain data consistency. For example, with respect to database technology, a write–write conflict is a computational occurrence associated with interleaved execution of transactions. Such an interleaved execution can be from a write operation to the same data from different write sources. A write–write conflict, which can be referred to as overwriting uncommitted data, where the act of committing data is making a set of tentative changes permanent.
Centralized or distributed global locking facilities are often adopted to avoid conflicting access to the shared data. Acquire lock and release lock operations require communication with a global lock manager. These lock acquire and release operations are on a critical path of transaction execution, increasing the latency and producing low throughput.
Summary
It is an object of various embodiments to provide an efficient architecture and methodology for multiple database engines to write to shared data storage to eliminate global locking in a write process of multiple masters to the shared data storage and to bring the benefit of optimistic concurrency control to multi-master shared storage database systems. Optimistic concurrency control (OCC) is a concurrency control method typically applied to transactional systems, such as relational database management systems and software transactional memory, where concurrency methods are methods to make certain that correct results are generated for concurrent operations, while attaining the results as fast as possible. OCC assumes that multiple transactions can frequently complete without interfering with each other. In OCC, while running operations, data resources are used in transactions without acquiring locks on the resources, and before committing, each transaction can be verified such that no other transaction has modified the data it has read. This object is achieved by the features of the independent claims. Further embodiments of the invention are apparent from the dependent claims, the description and the figures.
Embodiments are based on using a common log layer between a shared data storage and computing nodes in which write conflict detection is implemented using a write ahead log record and a log sequence number. The detection of a write conflict results from a write conflict check. The write conflict check (write conflict detection) may be referred to as a write-write conflict check (write-write confliction detection) since it is a check of conflict from different write operations. The common log layer can be arranged with other common log layers between storage and computing nodes in an architecture providing high availability. The conflict check can be implemented as a page-level check or tuple-level conflict check. The locks used by the conflict check are local in the common log, without the use of global locking. The locking facility provided by the common log doesn’ t require any network communications related to lock acquire or release.
According to a first aspect, an embodiment relates to a computer-implemented method of writing to data storage shared among multiple database engines, the computer-implemented method comprising: performing, in a  common log using one or more processors, a write conflict check on a write ahead log record received from a database engine of the multiple database engines, in which the write conflict check includes conducting a comparison of a log sequence number, received with the write ahead log record from the database engine, with a global log sequence number in a hash table in the common log; and sending the write ahead log record to the data storage shared among multiple database engines, in response to the write ahead log record passing the write conflict check. In this manner, elimination of network communication related to lock acquire and release can be realized.
This approach removes main bottlenecks from global locking-based conflict resolution. It in essence brings the benefit of optimistic concurrency control (OCC) to a multi-master shared data system. In the manner of OCC, each master runs transactions on data in its local buffer pool without waiting for locks being held by transactions in other masters. At a group commit time, the master can flush log records to the common log, which can perform validation. The term flush to an entity refers to storing to the entity.
In a first implementation form of the computer-implemented method according to the first aspect as such, conducting the comparison comprises using a tuple identification or a page identification as a key in the hash table, the key associated with an entry in a form of a master identification and a global log sequence number value, the master identification being an identification of a database engine of the multiple database engines.
In a second implementation form of the computer-implemented method according to the first aspect as such or any preceding implementation form of the first aspect, passing the write conflict check includes the log sequence number being greater than the global log sequence number.
In a third implementation form of the computer-implemented method according to the first aspect as such or any preceding implementation form of the first aspect, the method comprises, in response to passing the write conflict check, updating the global log sequence number to equal the log sequence number.
In a fourth implementation form of the computer-implemented method according to the first aspect as such or any preceding implementation form of the first aspect, the method comprises, in response to passing the write conflict  check: inserting the write ahead log record into a group flush write ahead log buffer; and saving all write ahead log records in the group flush write ahead log buffer to a persistent log in the common log.
In a fifth implementation form of the computer-implemented method according to the first aspect as such or any preceding implementation form of the first aspect, the method comprises replicating the write ahead log record to one or more follower common logs, the one or more follower common logs structured as back-ups to the common log.
In a sixth implementation form of the computer-implemented method according to the first aspect as such or any preceding implementation form of the first aspect, the write ahead log record received from a database engine is extracted from a batch of write ahead log records received from the database engine having the log sequence number for one or more transactions between the database engine and the data storage.
In a seventh implementation form of the computer-implemented method according to the first aspect as such or any preceding implementation form of the first aspect, the method comprises extracting another write ahead log record from another batch of write ahead log records received from another database engine of the multiple database engines having another log sequence number for one or more other transactions between the other database engine and the data storage.
In an eighth implementation form of the computer-implemented method according to the first aspect as such or any preceding implementation form of the first aspect, the method comprises maintaining in a command log all operations and commands that modify an internal state of the common log.
According to a second aspect, an embodiment relates to a system comprising a memory storage comprising instructions; and one or more processors in communication with the memory storage, wherein the one or more processors execute the instructions to: perform, in a common log, a write conflict check on a write ahead log record received from a database engine of multiple database engines, in which the write conflict check includes a comparison of a log sequence number, received with the write ahead log record from the database engine, with a global log sequence number in a hash table in the common log; and send the write ahead log record to a data storage shared among the multiple  database engines, in response to the write ahead log record passing the write conflict check. In this system, elimination of network communication related to lock acquire/release can be realized.
In a first implementation form of the system according to the second aspect as such, the comparison comprises using a tuple identification or a page identification as a key in the hash table, the key associated with an entry in a form of a master identification and a global log sequence number value, the master identification being an identification of a database engine of the multiple database engines
In a second implementation form of the system according to the second aspect as such or any preceding implementation form of the second aspect, passing the write conflict check includes the log sequence number being greater than the global log sequence number.
In a third implementation form of the system according to the second aspect as such or any preceding implementation form of the second aspect, the one or more processors, in response to passing the write conflict check, update the global log sequence number to equal the log sequence number.
In a fourth implementation form of the system according to the second aspect as such or any preceding implementation form of the second aspect, the one or more processors, in response to passing the write conflict check, execute: an insertion of the write ahead log record into a group flush write ahead log buffer; and a save operation of all write ahead log records in the group flush write ahead log buffer to a persistent log in the common log.
In a fifth implementation form of the system according to the second aspect as such or any preceding implementation form of the second aspect, the one or more processors execute replication of the write ahead log record to one or more follower common logs, the one or more follower common logs structured as back-ups to the common log.
In a sixth implementation form of the system according to the second aspect as such or any preceding implementation form of the second aspect, the one or more processors extract the write ahead log record from a batch of write ahead log records received from the database engine having the log sequence number for one or more transactions between the database engine and the data storage.
In a seventh implementation form of the system according to the second aspect as such or any preceding implementation form of the second aspect, the one or more processors extract another write ahead log record from another batch of write ahead log records received from another database engine of the multiple database engines having another log sequence number for one or more other transactions between the other database engine and the data storage.
In an eighth implementation form of the system according to the second aspect as such or any preceding implementation form of the second aspect, the one or more processors maintain in a command log all operations and commands that modify an internal state of the common log.
In a ninth implementation form of the system according to the eighth implementation form of the second aspect, the system includes the multiple database engines, the data storage shared among the multiple database engines, and one or more follower common logs in addition to the common log.
The computer-implemented method can be performed by the system. Further features of the computer-implemented method directly result from the functionality of the system.
The explanations provided for the first aspect and its implementation forms apply equally to the second aspect and the corresponding implementation forms.
According to a third aspect, embodiments relate to a non-transitory computer-readable medium storing computer instruction, which when executed by one or more processors, cause the one or more processors to perform the steps of any of the computer-implemented methods according to the first aspect or any of its implementation forms when executed on a computer. Thus, the computer-implemented methods can be performed in an automatic and repeatable manner.
The computer instructions stored by the non-transitory computer-readable medium can be performed by the system. The system can be programmably-arranged to perform the computer instructions.
According to a fourth aspect, an embodiment relates to a system comprising a means for performing a write conflict check on a write ahead log record received from a database engine of multiple database engines, in which the write conflict check includes a comparison of a log sequence number, received with the write ahead log record from the database engine, with a global  log sequence number in a hash table in a common log, the means for performing the write conflict check including the common log and operationally arranged between the multiple database engines and a shared data storage, the shared data storage being shared among the multiple database engines; and a means for sending the write ahead log record to the shared data storage, in response to the write ahead log record passing the write conflict check.
In a first implementation form of the system according to the fourth aspect as such, the comparison comprises using a tuple identification or a page identification as a key in the hash table, the key associated with an entry in a form of a master identification and a global log sequence number value, the master identification being an identification of a database engine of the multiple database engines
In a second implementation form of the system according to the fourth aspect as such or any preceding implementation form of the fourth aspect, passing the write conflict check includes the log sequence number being greater than the global log sequence number.
In a third implementation form of the system according to the fourth aspect as such or any preceding implementation form of the fourth aspect, the means for performing the write conflict check, in response to passing the write conflict check, updates the global log sequence number to equal the log sequence number.
In a fourth implementation form of the system according to the fourth aspect as such or any preceding implementation form of the fourth aspect, the means for performing the write conflict check, in response to passing the write conflict check, executes: an insertion of the write ahead log record into a group flush write ahead log buffer; and a save operation of all write ahead log records in the group flush write ahead log buffer to a persistent log in the common log.
In a fifth implementation form of the system according to the fourth aspect as such or any preceding implementation form of the fourth aspect, the means for performing the write conflict check executes replication of the write ahead log record to one or more follower common logs, the one or more follower common logs structured as back-ups to the common log.
In a sixth implementation form of the system according to the fourth aspect as such or any preceding implementation form of the fourth aspect, the  means for performing the write conflict check extracts the write ahead log record from a batch of write ahead log records received from the database engine having the log sequence number for one or more transactions between the database engine and the data storage.
In a seventh implementation form of the system according to the fourth aspect as such or any preceding implementation form of the fourth aspect, the means for performing the write conflict check extracts another write ahead log record from another batch of write ahead log records received from another database engine of the multiple database engines having another log sequence number for one or more other transactions between the other database engine and the data storage.
In an eighth implementation form of the system according to the fourth aspect as such or any preceding implementation form of the fourth aspect, the means for performing the write conflict check maintains in the command log all operations and commands that modify an internal state of the common log.
In a ninth implementation form of the system according to the eighth implementation form of the fourth aspect, the system includes the multiple database engines, the data storage shared among the multiple database engines, and one or more follower common logs in addition to the means for performing the write conflict check and the means for sending the write ahead log record to the shared data storage.
The explanations provided for the first aspect and its implementation forms apply equally to the fourth aspect and the corresponding implementation forms.
Embodiments of the invention can be implemented in hardware, software or in any combination thereof. Any one of the foregoing examples may be combined with any one or more of the other foregoing examples to create a new embodiment within the scope of the present disclosure.
Brief Description of the Drawings
Figure 1 is a block diagram of an example system having an architecture with multiple master databases with computation-storage separation, according to an example embodiment.
Figure 2 depicts example operational components inside the common log of Figure 1 that can facilitate write-write conflict detection, according to an example embodiment.
Figure 3 is a flow diagram of an example cycle of a write-write conflict detection thread, according to an example embodiment.
Figures 4 -10 illustrate challenges and specific solutions of write conflict for a PostgreSQL system, which is an object-relational database management system, as an example, according to an example embodiment.
Figure 11 is a flow diagram of features of an example method of writing to data storage shared among multiple database engines, according to an example embodiment.
Figure 12 is a block diagram illustrating circuitry for devices for implementing algorithms and performing methods of providing write-write conflict detection for multi-master shared storage database, according to an example embodiment.
Detailed Description
In the following description, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration specific embodiments which may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the embodiments, and it is to be understood that other embodiments may be utilized, and that structural, logical and electrical changes may be made. The following description of example embodiments is, therefore, not to be taken in a limited sense.
The functions or algorithms described herein may be implemented in software in an embodiment. The software may comprise computer executable instructions stored on computer readable media or computer readable storage device such as one or more non-transitory memories or other type of hardware-based storage devices, either local or networked. Further, such functions correspond to modules, which may be software, hardware, firmware or any combination thereof. Multiple functions may be performed in one or more modules as desired, and the embodiments described are merely examples. The software may be executed on a digital signal processor, ASIC, microprocessor,  or other type of processor operating on a computer system, such as a personal computer, server or other computer system, turning such computer system into a specifically programmed machine.
Computer-readable non-transitory media includes all types of computer readable media, including magnetic storage media, optical storage media, and solid state storage media and specifically excludes signals. It should be understood that the software can be installed in and sold with the devices that handle event streams as taught herein. Alternatively, the software can be obtained and loaded into such devices, including obtaining the software via a disc medium or from any manner of network or distribution system, including, for example, from a server owned by the software creator or from a server not owned but used by the software creator. The software can be stored on a server for distribution over the Internet, for example.
In various embodiments, a system can be implemented with an architecture in which a common log is operated between master nodes and a data storage layer. The master nodes can be realized as database nodes. The common log can be arranged as a write-write detection layer. A write-write detection can also be referred to as a write-write conflict check in which a determination is made as to whether a write transaction from one source conflicts with a related write transaction from another source. The system can be a multi-master database system designed with an architecture in which database instances write log records to storage. Storage nodes can replay, that is, apply the log records to construct data pages.
In such architectures, master databases can flush log records to a common log, where the common log can conduct a page-level or tuple-level conflict check. A database table can be divided into multiple pages. Each page can have a certain number of records. A tuple is an individual record of a database table. A table can be divided into a smaller unit such as pages and each page can have multiple tuples. All locks used by conflict check can be local in the common log such that there is no global locking. The common log persists and only log records that pass the conflict check are forward to the storage nodes from the common log. This architecture can allow for a high availability (HA) common log layer between storage nodes and compute nodes, where the compute nodes can be databases. This HA can be provided by having multiple  common logs at a conflict detection layer between the storage nodes and compute nodes with one of the multiple common logs being the primary common log with the other common logs of the multiple common logs being duplicates of the primary common log. This duplication defines the HA of the common log layer.
An architecture with the common log layer can be implemented with novel write-write conflict detection algorithms based on a WAL and log sequence numbers (LSNs) , which can be performed on each log record. A log sequence number is an identification number given a WAL record indicating its position in a sequence of records for a transaction. The write-write detection can be a page-level conflict detection or a tuple-level conflict detection. Such approaches to conflicts at a tuple-level or page-level can provide fine grained locks for good parallelism in that, in the common log, conflict checks run on multiple worker threads in parallel. A thread is a sequence of instructions that may execute in parallel with sequences of instructions. The associated locking facility does not require any network communications. Additionally, read operations do not participate in write-write conflict detection, which boosts overall system throughput. These write-write conflict detection techniques can eliminate global locking-based conflict prevention and can bring the benefit of OCC to multi-master shared data/storage database systems.
Figure 1 is a block diagram of an embodiment of an example system 100 having an architecture with multiple master databases with computation-storage separation. The system 100 can include a number of databases that communicate with a common log 110 to write data to a shared storage layer 120. In this illustration, there are three databases having database engines 105-1, 105-2, and 105-3. Though three database engines, for three databases, are shown, the system 100 can have less than or more than three databases and, hence, can have less than or more than three database engines. Each database engine can include one or more processors and storage devices having instructions stored thereon, which instructions are executable by the one or more processors to perform operations of the database for which it is a component. Such operations of each database can include storing data from a communication with a client of the respective database in the shared storage layer 120 through operation of the common log 110. Each database represented by a database engine can be  arranged as a database node that is a master, which receives structured query language (SQL) queries or update/insert/delete/create requests from users such as client devices to the databases. Data from users to database nodes managed by database engines 105-1, 105-2, and 105-3 can be served through buffer pools 109-1, 109-2, 109-3 of database engines 105-1, 105-2, and 105-3, respectively, or served through persistent storage nodes, if buffer pool is unavailable or missing from a database. When a transaction modifies a tuple, it creates a log record and the log record of the modified tuple is flushed to the common log 110. The common log 110 performs a conflict check on the log record and distributes, if the log record is without conflict, the log record to the shared storage layer 120, where a new version of data is created by performing a log apply operation.
Each database instance is a master in the architecture of the system 100. Each of database engines 105-1, 105-2, and 105-3 can receive SQL queries from clients and can start transactions. Each of database engines 105-1, 105-2, and 105-3 can flush WAL records to common log 110. The control of loading data pages to the shared storage layer 120 from started transactions in the database engines 105-1, 105-2, and 105-3 is provided by the common log 110 in response to successful completion of conflict checks. However, data pages from the shared storage layer 120 can be loaded by each of database engines 105-1, 105-2, and 105-3. For example, data pages can be loaded along path 132 from the shared storage layer by the database engine 105-1. Appropriate data can be loaded to each of database engines 105-2 and 105-3 in a similar manner.
The common log 110 can include one or more processors and storage devices having instructions stored thereon, which instructions are executable by the one or more processors to perform operations of the common log 110. The common log 110 can perform a number of functions. It can receive different WAL records from database engines 105-1, 105-2, and 105-3, operating as masters, on paths 106-1, 106-2, and 106-3, respectively. It can conduct write-write conflict detection and can send WAL records that pass their conflict check to the shared storage layer 120. In the architecture of the system 100, the common log 110 operates as a primary common log node in a HA arrangement. As the primary common log node, the common log 110 is a leader common log node that replicates WAL records that pass conflict check to follower common logs 115-1 and 115-2. WAL records can be sent along a path 113-1 to the  follower common log 115-1 and WAL records can be sent along a path 113-2 to the follower common log 115-2. The common log 110, as the leader common log node of the system 100, can replicate its internal state to follower common log 115-1 and follower common log 115-2 by shipping a command log to these follower common logs. The command log can be sent along a path 119-1 to the follower common log 115-1 and the command log can be sent along a path 119-2 to the follower common log 115-2. The command log can keep all the operations, which can be identified as commands, that modify the internal state of the common log 110.
Though Figure 1 shows two follower common logs, system 100 can have less than or more than two follower common logs in a common log layer between masters and shared storage layer. The implementation of the common log 110 with follower common logs provides an architecture of a HA common log layer between storage and compute nodes. The HA characteristic of such an architecture of a system increases with the number of follower common logs disposed in the common log layer of the system. When the leader common log fails, a follower common log in the HA common log layer becomes the leader common log. The sequencing of which follower common log becomes the leader common log can be predefined or implemented at the time of failure of the leader common log using conventional techniques to elect a leader node among a set of peer nodes.
The common log 110 can include a global transaction manager (GTM) 112 residing in the common log 110. The GTM 112 can generate transaction identifications (IDs) . These transaction IDs can be generated in ascending order. The GTM 112 can maintain a snapshot of active transaction to a database engine, for example path 133 to database engine 105-1. The snapshot can be implemented as a list of active transactions. For the follower common logs 115-1 and 115-2 to be able to replace the common log 110, the follower common logs 115-1 and 115-2 include a GTM 117-1 and GTM 117-2, respectively, that can perform the functions of GTM 112.
The shared storage layer 120 can be implemented as a shared storage server having storage nodes 125-1, 125-2, 125-3, 125-4, and 125-5. The shared storage server can include one or more processes that control operation of the shared storage sever by execution of instructions stored in the shared storage  layer 120. Though five storage nodes are shown, the shared storage layer 120 can include less than or more than five storage nodes. Each storage node can include one or more storage devices. The storage nodes 125-1, 125-2, 125-3, 125-4, and 125-5 can be structured as distributed storage nodes. The shared storage layer 120 can receive different WAL records from the common log 110 along a number of paths such as paths 123-1, 123-2, and 123-3.
Figure 2 depicts an embodiment of example operational components inside the common log 110 of Figure 1 that can facilitate write-write conflict detection. Similar or identical operational components are disposed in follower common logs 115-1 and 115-2 to perform operations as a common log leader when a selected follower common log is changed from a follower common log to a leader common log. The operational components can be implemented with storage devices controlled by the one or more processors of the common log 110. In the example shown in Figure 2, these operational components residing in the common log 110 can be implemented as  buffers  252 and 254 for write-write conflict detection threads, a hash table 255, a group flush WAL buffer (GFWB) 256, and a persistent log 258. For ease of presentation, the common log 110 is shown receiving a batch of write transaction log records from only two masters 105-1 and 105-2. The shown components can be expanded to handle communication of common log 110 with more than two masters.
The write-write conflict detection threads in  buffers  252 and 254 can be dedicated worker threads that execute write-write conflict detection on buffers of transaction log records in parallel. The dedicated threads in  buffers  252 and 254 can run conflict checks on batches of WAL records concurrently. In Figure 2 with respect to  threads  1 and 2 associated with  buffers  252 and 254, respectively, W ij (x) represents a write transaction log record, where W ij stands for a j th write operation (write transaction log) record from an i th transaction (T i) . The parameter x is either a tuple ID or a page ID depending on whether the conflict detection is a tuple level or a page level conflict detection. With each write transaction log record that passes the write-write conflict check being assigned a global LSN in the common log 110, the term reader LSN represents the latest global LSN known by a transaction. In the example of Figure 2, data for thread 1 includes a reader LSN equal to seven for  transactions  1 and 2 for a batch of write transaction records with commit transaction C 2 and C 1. The commit  transactions are from the perspective of the master, which does not perceive any inconsistencies with performing the given commit transaction. The data is received in time order staring on the left and proceeding to the right. In this example, the commit for transaction 2 occurs before the commit for transaction 1, though the write transaction log records for transaction 1 are received before the records for transaction 2. Data for thread 2 includes a reader LSN equal to seven for  transactions  3 and 4 for a batch of write transaction records with commit transactions C 3 and C 4. Different from the data for thread 1, the commit for transaction 3 occurs before the commit for transaction 4, along with the write transaction log records for transaction 3 being received before the records for transaction 4.
The hash table 255 is a write-write conflict detection hash table 255 having entries for a tuple ID or a page ID 261, a master ID and LSN 262, and a bucket (also referred to as a slot) number 263. For a tuple level conflict check, a key for the write-write conflict detection hash table 255 is a tuple ID, and for a page level conflict check, the key is a page ID. The value of the key is in the form {master ID, LSN} , which includes the latest global LSN of the WAL record that modifies the tuple or page and the master ID is the ID of the master that sends the WAL record. Whenever a log record passed the conflict check in the common log 110, it is assigned with a new global LSN. The LSN in {master ID, LSN} represents this global LSN, and master ID is the master that sends the log record to the common log 110. The global LSN is only generated for a WAL record that passes the conflict check. For N masters, the master IDs can be the integers 1 …10 with each integer assigned to one of the N masters different from the other masters of the N masters. Write-write conflict detection hash table 255 has fixed number of buckets, where each bucket has its own lock. Bucket locks are either tuple lock or page lock depending on whether the conflict check is tuple-level or page-level.
In common log 110, when a WAL record passes the conflict check, the WAL record is inserted to GFWB 256. Once a WAL record enters GFWB 256, it is assigned a new global LSN. When the GFWB 256 is full or a timer goes off, all the log records in the GFWB 256 are flushed to the persistent log 258 of the common log 110. The persistent log 258 can be implemented as a disk. All log records that pass conflict check are eventually flushed to the persistent log 258  and are also sent to the storage nodes 125-1, 125-2, 125-3, 125-4, and 125-5 of the shared storage layer 120 of Figure 1. Use of the components of the common log 110 is illustrated in Figure 3.
Figure 3 is a flow diagram 300 of an embodiment of an example cycle of a write-write conflict detection thread. The cycle can be implemented in a common log, such as the common log 110 of Figures 1 and 2, using one or more processors to execute instructions stored in a memory storage, for example at a common log layer of a system. At 305, a batch of WAL records are received at the common log. For example, the WAL records of the write-write conflict detection thread 1 can be received in a buffer, such as buffer 252. Buffer 252 can include many log records and is not limited to four log records. At 310, a WAL record W ij (x) is extracted from the buffer. In a sequence of processing WAL records, the extracted W ij (x) is the next WAL record in the buffer to be conflict checked. At 315, a transaction Id is extracted from W ij (x) . At 320, a determination is made as to whether W ij (x) is a write operation record, an abort record, or a commit record. At 325, if the W ij (x) is an abort record, the transaction ID is aborted and a goto 310 operation is performed to extract the next WAL record.
At 330, if the W ij (x) is a commit record, a determination is made as to whether the transition ID is marked with “has conflict” or other manner of identifying the occurrence of a conflict. At 335, with the transaction ID marked as having a conflict, abort the transaction ID is aborted and a goto 310 is preformed to extract the next WAL record. At 340, with the transaction ID not marked as having a conflict, the transaction ID is committed and the procedure goes to 310 to extract the next WAL record.
At 345, with the W ij (x) from the determination at 320 being a write operation record, a page ID or a tuple ID from the W ij (x) is extracted and the reader LSN from the W ij (x) is extracted. At 350, a look up in a hash table, such as a write-write conflict detection hash table 255 of Figure 2, by the extracted page ID or the extracted tuple ID is preformed, and the entry having this extracted page ID or the extracted tuple ID is obtained. At 355, a determination is made as to whether the master ID of the extracted entry is not equal to the master ID of the sender of the W ij (x) and whether the LSN of the entry extracted from the hash table is larger than the reader LSN extracted from the W ij (x) . At  360, with the condition at 355 met, the transaction ID is marked a “has a conflict” or other equivalent identifier and the procedure goes to 310 to extract the next WAL record. The current condition arises since the received W ij (x) identifies the latest global LSN known by the transaction, represented by the reader LSN, as being less than the current global LSN identifying other write operations have occurred such that the data is not in the expected state.
At 365, with a finding of no conflict at 355, several actions are taken. The W ij (x) is inserted into the GCWB with the first LSN following the WAL records currently in the GCWB. The LSN of the hash table entry is updated to the LSN from the insertion of the W ij (x) into the GCWB. The master ID of the hash table entry is updated to equal the master ID of the W ij (x) . The GCWB is then flushed if full or when a timer reaches a set time. Internal state changes are persisted in a command log that is also shipped to the common log replicas, such as follower common logs 115-1 and 115-2 of Figure 1. Each write-write conflict detection thread can run the above conflict detection algorithm continuously on batches of WAL records.
Multiple write-write conflict detection threads work concurrently, providing parallelism for a system. Each thread can access the write-write conflict detection hash table of a common log using bucket level locks (close to tuple level locks or page level locks depending on whether the write-write conflict detection is performed on a page level or a tuple level) . The write-write conflict detection hash table can be pre-allocated with a fixed number of buckets. Each bucket can have its own lock, where the lock can be derived from the index of the bucket.
To provide common log HA, all log records that pass conflict check can be propagated to follower common logs. The follower common logs can be referred to as standby common logs. Also, the state of the write-write conflict detection hash table and the GFWB can be replicated to the standby common logs by shipping a command log to each standby common log. The command log stores all the operations/commands that modifies the write-write conflict detection hash table and the GFWB.
Compared to page-level write-write conflict check, tuple-level write-write conflict check can reduce the number of conflicts and can produce better throughput. At the same time, tuple-level write-write conflict check poses more  challenges for implementation. In the next sections with respect to Figures 4-10, the challenges and the solutions specific are described for PostgreSQL page and tuple layout, as an example. PostgreSQL is an object-relational database management system (ORDBMS) , where an ORDBMS is a database management system (DBMS) similar to a relational database, but with an object-oriented database model in which objects, classes and inheritance are directly supported in database schemas and in the query language. PostgreSQL is open-source and compliant with atomicity, consistency, isolation, durability (ACID) principles, which are a set of properties of database transactions intended to guarantee validity even in the event of errors, power failures, etc. PostgreSQL manages concurrency through a system known as multi-version concurrency control (MVCC) , which gives each transaction a snapshot of the database, allowing changes to be made without being visible to other transactions until the changes are committed.
In PostgreSQL, a data file can be referred to as a heap, where the heap can have associated with it a heap tuple (HTUP) , a page having a page number (PAGE_NO) , and line pointer (LP) with a line pointer number (LP_NO) . Also associated with the data file (heap) is an index. An index is a specific structure that organizes a reference to the data that makes it easier to look up. In PostgreSQL, an index can be a copy of the item to be indexed combined with a reference to the actual data location. Associated with an index is an index tuple (ITUP) and LP_NO. In Figures 4-10, the reference labels that being with H, I, and HI refer to heap, index, and heap or index, respectively, and tup refers to a tuple.
Inserts should not conflict with each other. But PostgreSQL’s heap insert and index insert implementation can lead to conflicts between inserts in a multi-master system. Figure 4 is an example of log records for inserts that lead to conflict. If a common log receives the log records from master 405-1 and master 405-2, only those from one master can win. This can lead to a large number of insert conflicts. Master 405-1 submits a transaction log (XLOG) record 1 to insert a HTUP, which is labelled HTUP1. The XLOG record 1 includes the content of HTUP1 with the information that the PAGE_NO is equal to HX and the LP_NO is equal to one. Master 405-1 also submits a XLOG record 2 to insert an ITUP, which is labelled ITUP1. The XLOG record 2  includes the content of ITUP1with the information that the PAGE_NO for XLOG record 2 is equal to IY and the Lp_No is equal to one. The content of ITUP 1 is the PAGE_NO = HX and LP_NO = 1. Master 405-2 submits a transaction log (XLOG) record 1 to insert a HTUP, which is labelled HTUP2. The XLOG record 1 includes the content of HTUP2 with the information that the PAGE_NO is equal to HX and the LP_NO is equal to one. Master 405-2 also submits a XLOG record 2 to insert an ITUP, which is labelled ITUP2. The XLOG record 2 includes the content of ITUP2 with the information that the PAGE_NO for XLOG record 2 is equal to iy and the LP_NO is equal to one. The content of ITUP 2 is the PAGE_NO = HX and LP_NO = 1, which leads to conflict in XLOG record 2 from master 405-1.
The following is a way to eliminate this kind of conflicts. First, when inserting a heap tuple, instead of just picking an unused LP, a master checks to determine if the LP_NO can be equal to its own master Id. In the example of Figure 4, master 405-1 (ID=1) can select LP1 on page HX, and master 405-2 (ID=2) can select LP2 on page HX. Second, when inserting an index tuple, each master can simply pick the next unused LP. But the LP_NO should not appear in the log record. This way, when the log record is applied on the storage node, an overall next LP can be figured out, where the LPS of index tuples are ordered according to the order of index keys.
Figure 5 illustrates the use of these principles of operation of a common log to provide modified log records for insertions that eliminate conflict associated with Figure 4. As shown in Figure 5, XLOG record 1 from master 405-1, having master ID=1, has a LP_NO = 1 for inserting HTUP1 and XLOG record 2 from master 405-1 has a LP_NO = φ (next unused LP) , while XLOG record 1 from master 405-2, having master ID = 2, has a LP_NO = 2 and XLOG record 2 from master 405-2 has a LP_NO = φ (next unused LP) . The associated common log for these two masters can let both masters commit, where the different LP_NOS are not conflicting.
Inserts from different writers can also conflict with each other when a heap or an index (referred to below as HI) page is close to be full. Figure 6 illustrates an example of inserts with respect to a full page. For example, a master 605-1 sees page heap or index X (HIX) has room for one tuple, and it inserts heap or index tuple (HITUP) 6 (HITUP6) with an unused LP_NO=5 or  next unused LP_NO = φ. The associated common log receives XLOG record from master 605-1 for the heap or index TUPLE6 (HITUP6) with the XLOG record having PAGE_NO for the heap or the index, PAGE_NO = X, LP_NO =5 (heap) or next unused LP_NO = φ (index) , and content of tuple 6 (heap or index) . A master 605-2 also sees page HIX has room for one tuple, and master 605-2 inserts HITUP7 with LP_NO = 6 or next unused LP_NO = φ. The associated common log receives XLOG record from master 605-2 for the heap or index TUPLE7 (HITUP7) with the XLOG record having PAGE_NO for the heap or the index, PAGE_NO = X, LP_NO = 6 (heap) or next unused LP_NO = φ (index) , and content of TUPLE 7 (heap or index) . When the common log receives the two log records for HITUP6 and HITUP7, it sees no conflicts and commits both of them. However, when these two log records are applied, the storage sees there is no room on page HIX, but the transactions have committed. To detect page full conflicts, the common log maintains free space map (FSM) pages. Each heap and index relation can have a FSM to keep track of available space in the relation. When an insert related log record has no conflict and is ready to commit, the common log also checks whether the increase will cause page size overflow. If so, the common log aborts the transaction. An update adding new version of a tuple is handled similarly.
Inserts can cause index pages to split. When one writer splits an index page while another writer inserts index tuple to the same page, either the split or the insert should abort. Figure 7 is an example of index page splitting. In Figure 7, master 705-1 inserts ITUP5 making index page IX full. Master 705-1 then tries to insert ITUP6 to page IX. Page IX splits, where the first half of ITUPs stay at page IX, and the second half go to a new page INEW. Both ITUP5 and ITUP6 go to INEW because their index keys are larger. Master 705-2 inserts ITUP7 on page IX supposing ITUP7 has the largest key.
Figure 8 illustrates roughly log records generated by the two masters of Figure 7 from index page splitting. The master 705-1 generates a XLOG 1 about the index tuple ITUP5, which includes PAGE_NO = IX (index page X) , LP_NO = φ, and ITUP5 content (for example, PAGE_NO = HX and LP_NO = 1) . The master 705-1 generates a XLOG 1 about the index tuple ITUP6 and page IX split, which includes LP_NO of the splitting point, the new PAGE_NO for the new page, and the content of the new page containing ITUP5 and ITUP6. The master  705-2 generates a XLOG 1 about index tuple TUP7, which includes PAGE_NO = IX (index page x) , LP_NO = φ, and ITUP7 content (for example, PAGE_NO = HX and LP_NO = 2) .
The common log maintains PAGE_NO (s) of recently split index pages and their latest commit LSNs in a write-write conflict detection hash table of the common log. Suppose the common log commits changes of the master 705-1 first, and the common log receives from the master 705-2 the log record about inserting ITUP7 on page IX. The common log sees PAGE_NO = IX just split and the split commit LSN is later than the reader LSN of the log record of master 705-2, and it aborts the change of master 705-2. Similarly, suppose the common log commits the change of master 705-2 first, and the common log receives the log record of master 705-1 about the split of page IX. The reader LSN of the split is earlier than the commit LSN of the insertion by master 705-2 on page IX, and the common log aborts the split of page IX.
Figure 9 illustrates handling an update-update conflict. Handling updates can be relatively straight forward. Suppose master 905-1 and master 905-2 try to update HTUP0 at {PAGE_NO = HX, LP_NO = 0} . The newer version on master 905-1 is HTUP0’ at {PAGE_NO = HX, LP_NO = 1} with the content of HTUP0’ and the maximum pages (xmax) equal to the transaction ID (TID1) for master 905-1. The newer version on master 905-2 is HTUP0” at {PAGE_NO = HX, LP_NO = 2} with the content of HTUP0” and the maximum pages (XMAX) equal to the transaction ID (TID2) for master 905-2. The LP_NO selected on heap page satisfies LP_NO = master id. If the common log commits the log record of master 905-1 first, the common log will reject the change by master 905-2 when it sees the latest commit LSN of HTUP0 at {PAGE_NO = HX, LP_NO = 0} is larger than the reader LSN of the log record of master 905-2 that also contains HTUP0 at {PAGE_NO = HX, LP_NO = 0} . The same reasoning holds for the situation where the common log commits the log record of master 905-2 first.
Figure 10 illustrates handling an update-delete conflict. Suppose master 1005-1 deletes HTUP0 at {PAGE_NO = HX, LP_NO = 0} , and master 1005-2 updates HTUP0 to HTUP0’ . The master 1005-2 puts HTUP0’ at {PAGE_NO = HX, LP_NO = 2} . If the associated common log commits the log record of master 1005-1 first, it will reject the change of master 1005-2 when it  sees the latest commit LSN of HTUP0 at {PAGE_NO = HX, LP_NO = 0} is larger than the reader LSN of the log record of master 1005-2 that also contains HTUP0 at {PAGE_NO = HX, LP_NO = 0} . Similar reasoning for the situation where the common log commits the log record of the master 1005-2 first.
Figure 11 is a flow diagram of features of an embodiment of an example method 1100 of writing to data storage shared among multiple database engines. The method 1100 can be implemented as a computer-implemented method. At 1110, in a common log using one or more processors, a write-write conflict check is performed on a write ahead log record received from a database engine of the multiple database engines, in which the write-write conflict check includes conducting a comparison of a log sequence number, received with the write ahead log record from the database engine, with a global log sequence number in a hash table in the common log. Conducting the comparison can comprise using a tuple identification or a page identification as a key in the hash table, the key associated with an entry in a form of a master identification and a global log sequence number value, where the master identification can be an identification of a database engine of the multiple database engines.
At 1120, the write ahead log record is sent to the data storage shared among multiple database engines, in response to the write ahead log record passing the write-write conflict check. Passing the write conflict check can include the log sequence number being greater than the global log sequence number. In response to passing the write conflict check, method 1100 or a method similar to method 1100 can include updating the global log sequence number to equal the log sequence number. In response to passing the write conflict check, method 1100 or a method similar to method 1100 can include inserting the write ahead log record into a group flush write ahead log buffer and saving all write ahead log records in the group flush write ahead log buffer to a persistent log in the common log.
Variations of the method 1100 or methods similar to the method 1100 can include a number of different embodiments that may be combined depending on the application of such methods and/or the architecture of systems in which such methods are implemented. Such methods can comprise replicating the write ahead log record to one or more follower common logs, the one or more follower common logs structured as back-ups to the common log. Such methods  can comprise maintaining in a command log all operations and commands that modify an internal state of the common log.
In method 1100 or methods similar to the method 1100, the write ahead log record received from a database engine can be extracted from a batch of write ahead log records received from the database engine having the log sequence number for one or more transactions between the database engine and the data storage. Another write ahead log record can be extracted from another batch of write ahead log records received from another database engine of the multiple database engines having another log sequence number for one or more other transactions between the other database engine and the data storage.
In various embodiments, a non-transitory machine-readable storage device, such as computer-readable non-transitory media, can comprise instructions stored thereon, which, when executed by components of a machine, cause the machine to perform operations, where the operations comprise one or more features similar to or identical to features of methods and techniques described with respect to method 1100, flow diagram 300, variations thereof, and/or features of other methods taught herein such as associated with Figures 1-11. The physical structures of such instructions may be operated on by one or more processors. For example, executing these physical structures can cause the machine to perform operations comprising: performing, in a common log using the one or more processors, a write-write conflict check on a write ahead log record received from a database engine of the multiple database engines, in which the write-write conflict check includes conducting a comparison of a log sequence number, received with the write ahead log record from the database engine, with a global log sequence number in a hash table in the common log; and sending the write ahead log record to the data storage shared among multiple database engines, in response to the write ahead log record passing the write-write conflict check. Conducting the comparison can comprise using a tuple identification or a page identification as a key in the hash table, the key associated with an entry in a form of a master identification and a global log sequence number value, the master identification being an identification of a database engine of the multiple database engines.
Using the one or more processors to execute the instructions stored on the machine-readable storage device can comprise operations in which passing  the write-write conflict check can include the log sequence number being greater than the global log sequence number. In response to passing the write-write conflict check, the executable operations can include updating the global log sequence number to equal the log sequence number. In response to passing the write-write conflict check, the executable operations can include inserting the write ahead log record into a group flush write ahead log buffer and saving all write ahead log records in the group flush write ahead log buffer to a persistent log in the common log.
The operations can include the write ahead log record received from a database engine being extracted from a batch of write ahead log records received from the database engine having the log sequence number for one or more transactions between the database engine and the data storage. The operations can include extracting another write ahead log record from another batch of write ahead log records received from another database engine of the multiple database engines having another log sequence number for one or more other transactions between the other database engine and the data storage.
The operations can include replicating the write ahead log record to one or more follower common logs, where the one or more follower common logs is structured as back-ups to the common log. The operations can include maintaining in a command log all operations and commands that modify an internal state of the common log.
Figure 12 is a block diagram illustrating circuitry for devices for implementing algorithms and performing methods of providing write-write conflict detection for multi-master shared storage database, according to the teachings herein. Figure 12 depicts a device 1200 having a non-transitory memory storage 1201 storing instructions, a cache 1207, and a processing unit 1202, coupled to a bus 1220. Processing unit 1202 can include one or more processors operatively in communication with non-transitory memory storage 1201 and cache 1207. The one or more processors can be structured to execute the instructions to operate device 1200 as a database engine, a common log, or a shared data storage according to any of the methods taught herein.
Device 1200 can include a communication interface 1216 operable to communicate among devices and systems associated with an architecture such as the architecture of Figure 1. One or more of multiple databases, common logs,  and shared data storage can be implemented in a cloud to which device 1200 can be associated. Typically, the term “cloud” refers to data processing on a number of virtual servers as opposed to directly on physical machines. A cloud could span across a wide area network (WAN) . WAN is also commonly referred to the public Internet or sometimes also a network of leased optic fiber links that inter-connect multiple branch offices of an Enterprise. Or alternately, a cloud could be entirely resident within a private datacenter within an internal local area network. Datacenters of clouds, meaning datacenters that host virtual compute or services, can also provide services for network traffic management from one location on a network to another location on the network or across networks spanning far flung locations over WAN (or Internet) . In addition, the term "cloud computing" refers to the software and services executed for users by these servers virtually (over a hypervisor) , and typically the user is unaware of the physical location of the servers or datacenter. Further, the datacenter may be a distributed entity. Cloud computing can provide shared computer processing resources and data to computers and other devices on demand over the associated networks. The communication interface 1216 may be part of a data bus that can be used to receive the data traffic for processing.
Non-transitory memory storage 1201 may be realized as machine-readable media, such as computer-readable media, and may include volatile memory 1214 or non-volatile memory 1208. Device 1200 may include or have access to a computing environment that includes a variety of machine-readable media including as computer-readable media, such as volatile memory 1214, non-volatile memory 1208, removable storage 1211, or non-removable storage 1222. Such machine-readable media may be used with instructions in one or more programs 1218 executed by device 1200. Cache 1207 may be realized as a separate memory component or part of one or more of volatile memory 1214, non-volatile memory 1208, removable storage 1211, or non-removable storage 1222. Memory storage can include random access memory (RAM) , read only memory (ROM) , erasable programmable read-only memory (EPROM) , electrically erasable programmable read-only memory (EEPROM) , flash memory or other memory technologies, compact disc read-only memory (CD ROM) , Digital Versatile Disks (DVD) or other optical disk storage, magnetic  cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium capable of storing computer-readable instructions.
Device 1200 may include or have access to a computing environment that includes input interface 1226 and output interface 1224. Output interface 1224 may include a display device, such as a touchscreen, that also may serve as an input device. Input interface 1226 may include one or more of a touchscreen, touchpad, mouse, keyboard, camera, one or more device-specific buttons, one or more sensors integrated within or coupled via wired or wireless data connections to device 1200, and other input devices.
Device 1200 may operate in a networked environment using a communication connection to connect to one or more other devices that are remote. Such remote devices may be identical or similar to device 1200 or may be different types of devices having features similar or identical to features of device 1200 or other features, as taught herein, to handle procedures associated with providing write-write conflict detection for multi-master shared storage database, according to the teachings herein. The remote devices may include computers, such as database servers. Such remote computers may include a personal computer (PC) , server, router, network PC, a peer device or other common network node, or the like. The communication connection may include a Local Area Network (LAN) , a Wide Area Network (WAN) , cellular, WiFi, Bluetooth, or other networks.
Machine-readable instructions, such as computer-readable instructions stored on a computer-readable medium, are executable by the processing unit 1202 of the device 1200. A hard drive, CD-ROM, and RAM are some examples of articles including a non-transitory computer-readable medium such as a storage device. The terms machine-readable medium, computer-readable medium, and storage device do not include carrier waves to the extent carrier waves are deemed transitory. Storage can also include networked storage such as a storage area network (SAN) .
Device 1200 can be realized as a computing device that may be in different forms in different embodiments, as part of a network such as a SDN/IoT network. For example, device 1200 may be a smartphone, a tablet, smartwatch, other computing device, or other types of devices having wireless communication capabilities, where such devices include components to engage  in the distribution and storage of items of content, as taught herein. Devices, such as smartphones, tablets, smartwatches, and other types of device having wireless communication capabilities, are generally collectively referred to as mobile devices or user equipment. In addition, some of these devices may be considered as systems for the functions and/or applications for which they are implemented. Further, although the various data storage elements are illustrated as part of the device 1200, the storage may also or alternatively include cloud-based storage accessible via a network, such as the Internet or server based storage.
In an example embodiment, the device 1200 includes a conflict check module performing, in a common log using one or more processors, a write-write conflict check on a write ahead log record received from a database engine of the multiple database engines, in which the write-write conflict check includes conducting a comparison of a log sequence number, received with the write ahead log record from the database engine, with a global log sequence number in a hash table in the common log, and a transmission module sending the write ahead log record to the data storage shared among multiple database engines, in response to the write ahead log record passing the write-write conflict check. In some embodiments, the device 1200 may include other or additional modules for performing any one of or combination of steps described in the embodiments. Further, any of the additional or alternative embodiments or aspects of the method, as shown in any of the figures or recited in any of the claims, are also contemplated to include similar modules.
Further, machine-readable storage devices, such as computer-readable non-transitory media, herein, are physical devices that stores data represented by physical structure within the respective device. Such a physical device is a non-transitory device. Examples of machine-readable storage devices can include, but are not limited to, read only memory (ROM) , random access memory (RAM) , a magnetic disk storage device, an optical storage device, a flash memory, or other electronic, magnetic, and/or optical memory devices. The machine-readable device may be a machine-readable medium such as memory 1201 of Figure 12. Terms such as "memory, " "memory module, " "machine-readable medium, " "machine-readable device, " and similar terms should be taken to include all forms of storage media, either in the form of a single  medium (or device) or multiple media (or devices) , in all forms. For example, such structures can be realized as centralized database (s) , distributed database (s) , associated caches, and servers; one or more storage devices, such as storage drives (including but not limited to electronic, magnetic, and optical drives and storage mechanisms) , and one or more instances of memory devices or modules (whether main memory; cache storage, either internal or external to a processor; or buffers) . Terms such as "memory, ” "memory module, " "machine-readable medium, " and "machine-readable device, " shall be taken to include any tangible non-transitory medium which is capable of storing or encoding a sequence of instructions for execution by the machine and that cause the machine to perform any one of the methodologies taught herein. The term “non-transitory” used in reference to a "machine-readable device, " “medium, ” “storage medium, ” “device, ” or “storage device” expressly includes all forms of storage drives (optical, magnetic, electrical, etc. ) and all forms of memory devices (e.g., DRAM, Flash (of all storage designs) , SRAM, MRAM, phase change, etc., as well as all other structures designed to store data of any type for later retrieval.
In various embodiments, a system can be implemented to enable write-write conflict detection for multi-master shared storage database. Such a system can comprise a memory storage comprising instructions and one or more processors in communication with the memory storage. The one or more processors can execute the instructions to: perform, in a common log, a write conflict check on a write ahead log record received from a database engine of multiple database engines, in which the write conflict check includes a comparison of a log sequence number, received with the write ahead log record from the database engine, with a global log sequence number in a hash table in the common log; and send the write ahead log record to a data storage shared among the multiple database engines, in response to the write ahead log record passing the write conflict check. The comparison can comprise using a tuple identification or a page identification as a key in the hash table, where the key is associated with an entry in a form of a master identification and a global log sequence number value, with the master identification being an identification of a database engine of the multiple database engines.
Variations of such a system or similar systems can include a number of different embodiments that may be combined depending on the application of  such systems and/or the architecture in which such systems are implemented. Passing the write conflict check can include the log sequence number being greater than the global log sequence number. The one or more processors, in response to passing the write conflict check, can update the global log sequence number to equal the log sequence number. The one or more processors, in response to passing the write conflict check, can execute an insertion of the write ahead log record into a group flush write ahead log buffer, and execute a save operation of all write ahead log records in the group flush write ahead log buffer to a persistent log in the common log.
Such systems can include the one or more processors configured to execute instructions to execute replication of the write ahead log record to one or more follower common logs, with the one or more follower common logs structured as back-ups to the common log. The one or more processors can extract the write ahead log record from a batch of write ahead log records received from the database engine having the log sequence number for one or more transactions between the database engine and the data storage. The one or more processors can extract another write ahead log record from another batch of write ahead log records received from another database engine of the multiple database engines having another log sequence number for one or more other transactions between the other database engine and the data storage. The one or more processors can maintain in a command log all operations and commands that modify an internal state of the common log.
Such a system or similar systems can include the multiple database engines, the data storage shared among the multiple database engines, and one or more follower common logs in addition to the common log. The multiple database engines can be separate structural units separate from the common logs and the shared data storage. The multiple database engines can communicate with a leader common log of the follower common logs and the shared data storage to transfer data using conventional communicate techniques such as but not limited to transmission control protocol (TCP) and internet protocol (IP) . Such system or similar systems can be structured according to any permutation of the features taught herein for write-write conflict detection for multi-master shared storage database.
In various embodiments, a system can be implemented to enable write-write conflict detection for multi-master shared storage database. Such a system can comprise a means for performing a write conflict check on a write ahead log record received from a database engine of multiple database engines, in which the write conflict check includes a comparison of a log sequence number, received with the write ahead log record from the database engine, with a global log sequence number in a hash table in a common log, the means for performing the write conflict check including the common log and operationally arranged between the multiple database engines and a shared data storage, the shared data storage being shared among the multiple database engines; and a means for sending the write ahead log record to the shared data storage, in response to the write ahead log record passing the write conflict check. The comparison can comprise using a tuple identification or a page identification as a key in the hash table, the key associated with an entry in a form of a master identification and a global log sequence number value, the master identification being an identification of a database engine of the multiple database engines.
Variations of such a system or similar systems, having a means for performing a write conflict check on a write ahead log record received from a database engine of multiple database engines, can include a number of different embodiments that may be combined depending on the application of such systems and/or the architecture in which such systems are implemented. Passing the write conflict check can include the log sequence number being greater than the global log sequence number. The means for performing the write conflict check, in response to passing the write conflict check, can update the global log sequence number to equal the log sequence number. The system means for performing the write conflict check, in response to passing the write conflict check, can execute an insertion of the write ahead log record into a group flush write ahead log buffer, and execute a save operation of all write ahead log records in the group flush write ahead log buffer to a persistent log in the common log. Such a system or similar systems can include the means for performing the write conflict check structured to execute replication of the write ahead log record to one or more follower common logs, the one or more follower common logs structured as back-ups to the common log.
The means for performing the write conflict check can extract the write ahead log record from a batch of write ahead log records received from the database engine having the log sequence number for one or more transactions between the database engine and the data storage. The means for performing the write conflict check can extract another write ahead log record from another batch of write ahead log records received from another database engine of the multiple database engines having another log sequence number for one or more other transactions between the other database engine and the data storage. The means for performing the write conflict check can maintain in the command log all operations and commands that modify an internal state of the common log.
A system or similar systems, having a means for performing a write conflict check on a write ahead log record, can include the multiple database engines, the data storage shared among the multiple database engines, and one or more follower common logs in addition to the means for performing the write conflict check and the means for sending the write ahead log record to the shared data storage. Such system or similar systems can be structured according to any permutation of the features taught herein for write-write conflict detection for multi-master shared storage database.
Conflict prevention based on global locking incurs significant network communications and blocking, which yields low performance and throughput. The methods and structures as taught herein uses WAL records to determine the existence of conflicts. This approach eliminates global locking and brings the benefit of optimistic concurrency control to multi-master shared storage database systems. Such an approach can be a key technology to build large scale on-cloud multi-master systems.
Although the present disclosure has been described with reference to specific features and embodiments thereof, it is evident that various modifications and combinations can be made thereto without departing from scope of the disclosure. The specification and drawings are, accordingly, to be regarded simply as an illustration of the disclosure as defined by the appended claims, and are contemplated to cover any and all modifications, variations, combinations or equivalents that fall within the scope of the present disclosure.

Claims (30)

  1. A computer-implemented method of writing to data storage shared among multiple database engines, the computer-implemented method comprising:
    performing, in a common log using one or more processors, a write-write conflict check on a write ahead log record received from a database engine of the multiple database engines, in which the write-write conflict check includes conducting a comparison of a log sequence number, received with the write ahead log record from the database engine, with a global log sequence number in a hash table in the common log; and
    sending the write ahead log record to the data storage shared among multiple database engines, in response to the write ahead log record passing the write-write conflict check.
  2. The computer-implemented method of claim 1, wherein conducting the comparison comprises using a tuple identification or a page identification as a key in the hash table, the key associated with an entry in a form of a master identification and a global log sequence number value, the master identification being an identification of a database engine of the multiple database engines.
  3. The computer-implemented method of claim 1 or claim 2, wherein passing the write-write conflict check includes the log sequence number being greater than the global log sequence number.
  4. The computer-implemented method of any of claims 1-3, comprising, in response to passing the write-write conflict check, updating the global log sequence number to equal the log sequence number.
  5. The computer-implemented method of any of claims 1-4, comprising, in response to passing the write-write conflict check:
    inserting the write ahead log record into a group flush write ahead log buffer; and
    saving all write ahead log records in the group flush write ahead log buffer to a persistent log in the common log.
  6. The computer-implemented method of any of claims 1-5, comprising replicating the write ahead log record to one or more follower common logs, the one or more follower common logs structured as back-ups to the common log.
  7. The computer-implemented method of any of claims 1-6, wherein the write ahead log record received from a database engine is extracted from a batch of write ahead log records received from the database engine having the log sequence number for one or more transactions between the database engine and the data storage.
  8. The computer-implemented method of any of claims 1-7, comprising extracting another write ahead log record from another batch of write ahead log records received from another database engine of the multiple database engines having another log sequence number for one or more other transactions between the other database engine and the data storage.
  9. The computer-implemented method of any of claims 1-8, comprising maintaining in a command log all operations and commands that modify an internal state of the common log.
  10. A non-transitory computer-readable medium storing computer instructions, that when executed by one or more processors, cause the one or more processors to perform the steps of any of claims 1-9.
  11. A system comprising:
    a memory storage comprising instructions; and
    one or more processors in communication with the memory storage, wherein the one or more processors execute the instructions to:
    perform, in a common log, a write-write conflict check on a write ahead log record received from a database engine of multiple database engines, in which the write-write conflict check includes a comparison of  a log sequence number, received with the write ahead log record from the database engine, with a global log sequence number in a hash table in the common log; and
    send the write ahead log record to a data storage shared among the multiple database engines, in response to the write ahead log record passing the write-write conflict check.
  12. The system of claim 11, wherein the comparison comprises using a tuple identification or a page identification as a key in the hash table, the key associated with an entry in a form of a master identification and a global log sequence number value, the master identification being an identification of a database engine of the multiple database engines.
  13. The system of claim 11 or claim 12, wherein passing the write-write conflict check includes the log sequence number being greater than the global log sequence number.
  14. The system of any of claims 11-13, wherein the one or more processors, in response to passing the write-write conflict check, update the global log sequence number to equal the log sequence number.
  15. The system of any of claims 11-14, wherein the one or more processors, in response to passing the write-write conflict check, execute:
    an insertion of the write ahead log record into a group flush write ahead log buffer; and
    a save operation of all write ahead log records in the group flush write ahead log buffer to a persistent log in the common log.
  16. The system of any of claims 11-15, wherein the one or more processors execute replication of the write ahead log record to one or more follower common logs, the one or more follower common logs structured as back-ups to the common log.
  17. The system of any of claims 11-16, wherein the one or more processors  extract the write ahead log record from a batch of write ahead log records received from the database engine having the log sequence number for one or more transactions between the database engine and the data storage.
  18. The system of any of claims 11-17, wherein the one or more processors extract another write ahead log record from another batch of write ahead log records received from another database engine of the multiple database engines having another log sequence number for one or more other transactions between the other database engine and the data storage.
  19. The system of any of claims 11-18, wherein the one or more processors maintain in a command log all operations and commands that modify an internal state of the common log.
  20. The system of any of claims 11-19, wherein the system includes the multiple database engines, the data storage shared among the multiple database engines, and one or more follower common logs in addition to the common log.
  21. A system comprising:
    a means for performing a write conflict check on a write ahead log record received from a database engine of multiple database engines, in which the write conflict check includes a comparison of a log sequence number, received with the write ahead log record from the database engine, with a global log sequence number in a hash table in a common log, the means for performing the write conflict check including the common log and operationally arranged between the multiple database engines and a shared data storage, the shared data storage being shared among the multiple database engines; and
    a means for sending the write ahead log record to the shared data storage, in response to the write ahead log record passing the write conflict check.
  22. The system of claim 21, wherein the comparison comprises using a tuple identification or a page identification as a key in the hash table, the key associated with an entry in a form of a master identification and a global log sequence number value, the master identification being an identification of a  database engine of the multiple database engines.
  23. The system of claim 21 or claim 22, wherein passing the write conflict check includes the log sequence number being greater than the global log sequence number.
  24. The system of any of claims 21-23, wherein the means for performing the write conflict check, in response to passing the write conflict check, updates the global log sequence number to equal the log sequence number.
  25. The system of any of claims 21-24, wherein the means for performing the write conflict check, in response to passing the write conflict check, executes:
    an insertion of the write ahead log record into a group flush write ahead log buffer; and
    a save operation of all write ahead log records in the group flush write ahead log buffer to a persistent log in the common log.
  26. The system of any of claims 21-25, the means for performing the write conflict check executes replication of the write ahead log record to one or more follower common logs, the one or more follower common logs structured as back-ups to the common log.
  27. The system of any of claims 21-26, wherein the means for performing the write conflict check extracts the write ahead log record from a batch of write ahead log records received from the database engine having the log sequence number for one or more transactions between the database engine and the data storage.
  28. The system of any of claims 21-27, wherein the means for performing the write conflict check extracts another write ahead log record from another batch of write ahead log records received from another database engine of the multiple database engines having another log sequence number for one or more other transactions between the other database engine and the data storage.
  29. The system of any of claims 21-28, wherein the means for performing the write conflict check maintains in a command log all operations and commands that modify an internal state of the common log.
  30. The system of any of claims 21-29, wherein the system includes the multiple database engines, the data storage shared among the multiple database engines, and one or more follower common logs in addition to the means for performing the write conflict check and the means for sending the write ahead log record to the shared data storage.
PCT/CN2019/091397 2018-12-11 2019-06-14 Write-write conflict detection for multi-master shared storage database WO2020119050A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19896966.9A EP3877859A4 (en) 2018-12-11 2019-06-14 Write-write conflict detection for multi-master shared storage database
CN201980078344.5A CN113168371A (en) 2018-12-11 2019-06-14 Write-write collision detection for multi-master shared storage databases

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201862777972P 2018-12-11 2018-12-11
US62/777,972 2018-12-11

Publications (1)

Publication Number Publication Date
WO2020119050A1 true WO2020119050A1 (en) 2020-06-18

Family

ID=71075569

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/091397 WO2020119050A1 (en) 2018-12-11 2019-06-14 Write-write conflict detection for multi-master shared storage database

Country Status (3)

Country Link
EP (1) EP3877859A4 (en)
CN (1) CN113168371A (en)
WO (1) WO2020119050A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113220335A (en) * 2021-05-26 2021-08-06 西安热工研究院有限公司 Method for avoiding disorder of multithreading concurrent writing snapshot data
US11366802B1 (en) 2019-09-27 2022-06-21 Amazon Technologies, Inc. Batch undo processing for transaction rollbacks in a multi-writer database
US11379463B1 (en) * 2019-09-27 2022-07-05 Amazon Technologies, Inc. Atomic enforcement of cross-page data constraints in decoupled multi-writer databases
US11874796B1 (en) 2019-09-27 2024-01-16 Amazon Technologies, Inc. Efficient garbage collection in optimistic multi-writer database systems

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0716382A1 (en) * 1994-12-07 1996-06-12 Xerox Corporation Methodology for managing weakly consistent replicated databases
US6981110B1 (en) 2001-10-23 2005-12-27 Stephen Waller Melvin Hardware enforced virtual sequentiality
US20110179230A1 (en) * 2010-01-15 2011-07-21 Sun Microsystems, Inc. Method of read-set and write-set management by distinguishing between shared and non-shared memory regions
US20140040554A1 (en) * 2012-07-31 2014-02-06 Martin T. Pohlack Protecting Large Regions without Operating-System Support
CN105045563A (en) * 2015-06-19 2015-11-11 陕西科技大学 Conflict management method for speculating nested software transactional memory
US20150378777A1 (en) * 2014-06-26 2015-12-31 International Business Machines Corporation Transactional memory operations with read-only atomicity
WO2016086342A1 (en) * 2014-12-01 2016-06-09 华为技术有限公司 Data write-in method, apparatus and device, and storage system
WO2016105786A1 (en) 2014-12-24 2016-06-30 Intel Corporation Systems, apparatuses, and methods for data speculation execution
US20160267009A1 (en) * 2015-03-10 2016-09-15 Oleg Margulis Method and apparatus for memory aliasing detection in an out-of-order instruction execution platform
CN107148617A (en) * 2014-09-19 2017-09-08 亚马逊科技公司 Automatically configuring for storage group is coordinated in daily record
US20180322158A1 (en) * 2017-05-02 2018-11-08 Hewlett Packard Enterprise Development Lp Changing concurrency control modes

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5857182A (en) * 1997-01-21 1999-01-05 International Business Machines Corporation Database management system, method and program for supporting the mutation of a composite object without read/write and write/write conflicts
US6754657B2 (en) * 2001-08-24 2004-06-22 Microsoft Corporation Time stamping of database records
US9619278B2 (en) * 2014-06-26 2017-04-11 Amazon Technologies, Inc. Log-based concurrency control using signatures
US11080271B2 (en) * 2016-09-09 2021-08-03 Sap Se Global database transaction management service

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0716382A1 (en) * 1994-12-07 1996-06-12 Xerox Corporation Methodology for managing weakly consistent replicated databases
US6981110B1 (en) 2001-10-23 2005-12-27 Stephen Waller Melvin Hardware enforced virtual sequentiality
US7475201B1 (en) * 2001-10-23 2009-01-06 Teplin Application Limited Liability Co. Packet processor memory interface with conditional delayed restart
US20110179230A1 (en) * 2010-01-15 2011-07-21 Sun Microsystems, Inc. Method of read-set and write-set management by distinguishing between shared and non-shared memory regions
US20140040554A1 (en) * 2012-07-31 2014-02-06 Martin T. Pohlack Protecting Large Regions without Operating-System Support
US20150378777A1 (en) * 2014-06-26 2015-12-31 International Business Machines Corporation Transactional memory operations with read-only atomicity
CN107148617A (en) * 2014-09-19 2017-09-08 亚马逊科技公司 Automatically configuring for storage group is coordinated in daily record
WO2016086342A1 (en) * 2014-12-01 2016-06-09 华为技术有限公司 Data write-in method, apparatus and device, and storage system
WO2016105786A1 (en) 2014-12-24 2016-06-30 Intel Corporation Systems, apparatuses, and methods for data speculation execution
US20160267009A1 (en) * 2015-03-10 2016-09-15 Oleg Margulis Method and apparatus for memory aliasing detection in an out-of-order instruction execution platform
CN105045563A (en) * 2015-06-19 2015-11-11 陕西科技大学 Conflict management method for speculating nested software transactional memory
US20180322158A1 (en) * 2017-05-02 2018-11-08 Hewlett Packard Enterprise Development Lp Changing concurrency control modes

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3877859A4

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11366802B1 (en) 2019-09-27 2022-06-21 Amazon Technologies, Inc. Batch undo processing for transaction rollbacks in a multi-writer database
US11379463B1 (en) * 2019-09-27 2022-07-05 Amazon Technologies, Inc. Atomic enforcement of cross-page data constraints in decoupled multi-writer databases
US11874796B1 (en) 2019-09-27 2024-01-16 Amazon Technologies, Inc. Efficient garbage collection in optimistic multi-writer database systems
CN113220335A (en) * 2021-05-26 2021-08-06 西安热工研究院有限公司 Method for avoiding disorder of multithreading concurrent writing snapshot data
CN113220335B (en) * 2021-05-26 2023-03-14 西安热工研究院有限公司 Method for avoiding disorder of multithreading concurrent writing snapshot data

Also Published As

Publication number Publication date
EP3877859A4 (en) 2022-01-05
EP3877859A1 (en) 2021-09-15
CN113168371A (en) 2021-07-23

Similar Documents

Publication Publication Date Title
US10180946B2 (en) Consistent execution of partial queries in hybrid DBMS
US11874746B2 (en) Transaction commit protocol with recoverable commit identifier
WO2020119050A1 (en) Write-write conflict detection for multi-master shared storage database
US10552372B2 (en) Systems, methods, and computer-readable media for a fast snapshot of application data in storage
CN109923534B (en) Multi-version concurrency control for database records with uncommitted transactions
EP3117348B1 (en) Systems and methods to optimize multi-version support in indexes
US10191932B2 (en) Dependency-aware transaction batching for data replication
US9747356B2 (en) Eager replication of uncommitted transactions
US9336262B2 (en) Accelerated transactions with precommit-time early lock release
US10157108B2 (en) Multi-way, zero-copy, passive transaction log collection in distributed transaction systems
US11263236B2 (en) Real-time cross-system database replication for hybrid-cloud elastic scaling and high-performance data virtualization
US10983981B1 (en) Acid transaction for distributed, versioned key-value databases
US11599514B1 (en) Transactional version sets
US11048669B2 (en) Replicated state management using journal-based registers
US9563522B2 (en) Data recovery for a relational database management system instance in a heterogeneous database system
US9703634B2 (en) Data recovery for a compute node in a heterogeneous database system
US11003550B2 (en) Methods and systems of operating a database management system DBMS in a strong consistency mode
CN112384906A (en) MVCC-based database system asynchronous cache consistency
WO2022242372A1 (en) Object processing method and apparatus, computer device, and storage medium
Dey et al. Scalable distributed transactions across heterogeneous stores
US11709809B1 (en) Tree-based approach for transactionally consistent version sets
US20200249876A1 (en) System and method for data storage management
US11886422B1 (en) Transactional protocol for snapshot isolation without synchronized clocks
US20240126785A1 (en) Failover of database sessions to a logical replica database
US11940972B2 (en) Execution of operations on partitioned tables

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19896966

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019896966

Country of ref document: EP

Effective date: 20210608

NENP Non-entry into the national phase

Ref country code: DE