WO2022120313A1

WO2022120313A1 - Methods for distributed key-value store

Info

Publication number: WO2022120313A1
Application number: PCT/US2021/072279
Authority: WO
Inventors: Hei Tao Fung; Chun Liu
Original assignee: Futurewei Technologies, Inc.
Priority date: 2020-12-04
Filing date: 2021-11-08
Publication date: 2022-06-09
Also published as: WO2022120314A1

Abstract

A computer-implemented method for serializing multi-shard transactions of a storage node of a distributed database system. The method comprises tracking active transactions in the distributed database system using a bloom filter (BF), wherein an active transaction is a multi-shard transaction independent of other active transactions with respect to the BF and the multi-shard transaction has at least one key for a data item, and the BF includes entries corresponding to keys of active transactions; checking the BF for at least one key of a coming transaction; adding an entry for the at least one key of the coming transaction to the BF when there is miss in the check for the at least one key in the BF; enqueueing the coming transaction when there is a hit for the at least one key in the BF; and validating the transactions that are indicated by the BF to be active transactions.

Description

METHODS FOR DISTRIBUTED KEY- VALUE STORE

CROSS REFERENCE TO RELATED APPLICATION

[0001] This application claims priority to United States Provisional Application Serial Number 63/121,699, filed December 4, 2020, which is incorporated herein by reference.

TECHNICAL FIELD

[0002] The present disclosure is related to storing data in a distributed database, and in particular to systems and methods for distributed key-value stores.

BACKGROUND

[0003] In database systems and transaction processing, concurrency control (CC) schemes interleave read/write requests from multiple clients simultaneously, giving the illusion that each read/write transaction has exclusive access to the data. Distributed concurrency control refers to the concurrency control of a database distributed over a communication network. Serializability ensures that a schedule for executing concurrent transactions is equivalent to one that executes the transactions serially in some order. It is considered to be the highest level of isolation between concurrent transactions. It assumes that all accesses to the database are done using read and write operations. A desirable goal of a distributed database is distributed serializability, which is the serializability of a schedule of concurrent transactions over a distributed database.

[0004] The most common distributed concurrency control schemes are two-phase locking (2PL), snapshot isolation (SI), Read Committed (RC). They are also common centralized concurrency control schemes. Each allows more concurrency, i.e., more schedules of concurrent transactions to be permitted and hence higher transaction throughput. 2PL can achieve serializable isolation level, but RC and SI cannot. To make non-serializable concurrency control schemes to provide serializable isolation level, a serialization certifier can be used. One example is serializable snapshot isolation (SSI). A Serial Safety Net (SSN) is a serialization certifier that can make RC and SI CC schemes achieve serializable isolation level, meanwhile allowing more concurrency than the SSI scheme. While SSN provides a fully parallel multi-threading, latch-free, and shared- memory implementation for a multi-version database management system on a single multi-processor server, it is short of addressing a fully distributed multiversion database management system.

SUMMARY

[0005] Various aspects are now described to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. The Summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

[0006] According to a first aspect, there is provided a computer implemented method for serializing multi-shard transactions of a storage node of a distributed database system. The method includes tracking active transactions in the distributed database system using a bloom filter (BF), wherein an active transaction is shard transaction independent of other transactions with respect to the BF and the shard transaction has at least one key for a data item, and the BF includes entries corresponding to keys of active transactions, checking the BF for at least one key of a coming transaction, adding an entry corresponding to the at least one key of the coming transaction to the BF when there is miss in the check for the at least one key in the BF, enqueueing the coming transaction when there is a hit for the at least one key in the BF, and validating the transactions that are indicated by the BF to be active transactions.

[0007] Optionally in the preceding aspect, another implementation of the aspect provides determining a hash value for at least one key of a multi-shard transaction, updating at least one bit of an element of an integer array of the BF to indicate an active transaction when the multi-shard transaction is an independent transaction, wherein the element of the integer array is indexed using the determined hash value, determining a hash value for the at least one key of the coming transaction, and checking the value of the element of the integer array indexed according to the determined hash value. [0008] Optionally in any of the preceding aspects, another implementation of the aspects provides tracking active transactions using a counting BF (CBF), including determining a hash value for at least one read key of a multi-shard transaction and determining a hash value for at least one write key of the multishard transaction, setting a counter value for each of the hash values in an integer array of the CBF to indicate an active transaction when the multi-shard transaction is an independent transaction, wherein elements of the integer array are indexed using determined hash values. The implementation also includes checking the CBF for at least one key of the coming transaction, including determining a hash value for at least one read key of the coming transaction and determining a hash value for at least one write key of the coming transaction, and checking the counter value of elements of the integer array indexed using the determined hash values.

[0009] Optionally in any of the preceding aspects, another implementation of the aspects provides decrementing the counter value indexed by a hash value of a key of an active transaction when the active transaction is completed.

[0010] Optionally in any of the preceding aspects, another implementation of the aspects provides rechecking the CBF for the key of the coming transaction after a predetermined duration of time.

[0011] Optionally in any of the preceding aspects, another implementation of the aspects provides storing the BF in a memory with faster access relative to a memory used to store key-value tuples of the storage node.

[0012] Optionally in any of the preceding aspects, another implementation of the aspects provides maintaining an independent queue for independent multishard transactions waiting for validation, and an interdependent queue for interdependent multi-shard transactions waiting for validation, maintaining an independent queue BF and an interdependent queue BF for the independent queue and interdependent queue, respectively, and enqueueing the coming transaction in the independent queue and adding keys of the coming transaction to the independent queue BF when a check for keys of the coming transaction misses the independent queue BF and the interdependent queue BF.

[0013] Optionally in any of the preceding aspects, another implementation of the aspects provides an interdependent queue that includes a cold queue and a hot queue, and the interdependent queue BF includes a cold queue BF and a hot queue BF, and testing each key of the coming transaction against the hot queue BF when enqueueing the coming transaction, enqueueing the coming transaction in the hot queue and adding keys of the coming transaction to the hot queue BF when any of the keys hit the hot queue BF, enqueueing the coming transaction in the hot queue and adding the keys of the coming transaction to the hot queue BF when any of the keys hit the cold queue BF and a minimum value of counter values of the cold queue BF for the keys exceeds a specified threshold counter value, enqueueing the coming transaction in the cold queue and adding the keys of the coming transaction to the cold queue BF when any of the keys hits the cold queue BF, enqueueing the coming transaction in the cold queue and adding the keys of the coming transaction to the cold queue BF when any of the keys hits the independent queue BF, and enqueueing the coming transaction in the independent queue and adding the keys to the independent queue BF when all of the keys miss the independent queue BF, the cold queue BF, and the hot queue BF.

[0014] Optionally in any of the preceding aspects, another implementation of the aspects provides receiving a pre-commit request of a key of a coming transaction at a validator instance of the storage node, testing the key of the coming transaction against the hot queue BF, and sending, by the validator instance, an early abort signal for the pre-commit request when the key hits the hot queue BF and a minimum value of counter values of the hot queue BF for the key exceeds a specified threshold counter value.

[0015] Optionally in any of the preceding aspects, another implementation of the aspects provides identifying a single-shard transaction and validating the single-shard transaction without checking the BF.

[0016] According to a second aspect of the present disclosure there is provided a distributed computer system that serializes transactions from at least one transaction client in a distributed database system having multiple database shards. The system includes at least one sequencer instance configured to receive a multi-shard transaction from the at least one transaction client and transmit a request for the transaction to multiple storage nodes of the system, and a validator instance included in a storage node of the multiple storage nodes. The validator instance is configured to implement a bloom filter (BF) to track active transactions in the distributed database system using a bloom filter (BF), wherein an active transaction is multi-shard transaction independent of other transactions with respect to the BF and the shard transaction has at least one key for a data item, and the BF includes entries corresponding to keys of active transactions, receive the requested transaction and checking the BF for at least one key of the requested transaction, add an entry for the at least one key of the requested transaction to the BF when there is miss in the check for the at least one key in the BF, queue the requested transaction when there is a hit for the at least one key in the BF, and send a validating message for transactions that are indicated by the BF to be active transactions.

[0017] Optionally in the preceding aspect, another implementation of the aspect provides a validator instance is configured to: determine a hash value for at least one key of a multi-shard transaction; update at least one bit of an element of an integer array of the BF to indicate the multi-shard transaction an active transaction when the multi-shard transaction is an independent transaction, wherein the element of the integer array is indexed using the determined hash value; determine a hash value for the at least one key of the requested transaction; and identify the hit for the at least one key of the requested transaction based on a value of the element of the integer array indexed according to the determined hash value for the at least one key of the requested transaction.

[0018] Optionally in any of the preceding aspects, another implementation of the aspects provides a validator instance is configured to: determine a hash value for at least one read key of a multi-shard transaction and determine a hash value for at least one write key of the multi-shard transaction; set a counter value for each of the read key hash value and the write key hash value in an integer array of a counting BF (CBF) to indicate an active transaction when the multi-shard transaction is an independent transaction, wherein elements of the integer array are indexed using determined hash values; determine a hash value for at least one read key of the requested transaction and determine a hash value for at least one write key of the requested transaction; and queue the requested transaction when a counter value of the integer array of the CBF indexed according to either of the read key hash value or the write key of the requested transaction indicates a hit for either of the at least one read key or the at least one write key of the requested transaction. [0019] Optionally in any of the preceding aspects, another implementation of the aspects provides a storage node that includes a first memory to store the BF and a second memory to store the key-value tuples of the storage node, wherein an access operation to the first memory is faster relative to an access operation of the second memory.

[0020] Optionally in any of the preceding aspects, another implementation of the aspects provides a validator instance is configured to: maintain an independent queue for independent multi-shard transactions waiting for validation, and an interdependent queue for interdependent multi-shard transactions waiting for validation; maintain an independent queue BF and an interdependent queue BF for the independent queue and interdependent queue, respectively; and store the requested transaction in the independent queue and store keys of the requested transaction in the independent queue BF when a check for keys of the requested transaction misses the independent queue BF and the interdependent queue BF.

[0021] Optionally in any of the preceding aspects, another implementation of the aspects provides a validator instance is configured to: include a cold queue and a hot queue in the interdependent queue, and maintain a cold queue BF and a hot queue BF respectively, for the cold queue and hot queue, test each key of the requested transaction against the hot queue BF when enqueueing the requested transaction, store the requested transaction in the hot queue and include keys of the requested transaction in the hot queue BF when any of the keys hit the hot queue BF, store the requested transaction in the hot queue and include the keys of the requested transaction in the hot queue BF when any of the keys hit the cold queue BF and a minimum value of counter values of the cold queue BF for the keys exceeds a specified threshold counter value, store the requested transaction in the cold queue and include the keys of the coming transaction in the cold queue BF when any of the keys hits the cold queue BF, store the requested transaction in the cold queue and include the keys of the requested transaction in the cold queue BF when any of the keys hits the independent queue BF, and store the requested transaction in the independent queue and include the keys to the independent queue BF when all of the keys miss the independent queue BF, the cold queue BF, and the hot queue BF. [0022] Optionally in any of the preceding aspects, another implementation of the aspects provides a validator instance is configured to: include a cold queue and a hot queue in the interdependent queue, and maintain a cold queue BF and a hot queue BF respectively, for the cold queue and hot queue; receive a precommit operation on a key of the requested transaction; test the key of the requested transaction against the hot queue BF; and send an early abort signal for the pre-commit operation when the key hits the hot queue BF and a minimum value of counter values of the hot queue BF for the key exceeds a specified threshold counter value.

[0023] According to a second aspect of the present disclosure there is provided a storage server of a distributed database system. The server includes at least one hardware processor and memory storing instructions that cause the at least one hardware processor to perform operations including tracking active transactions in the distributed database system using a bloom filter (BF), wherein an active transaction is a shard transaction independent of other transactions with respect to the BF and the shard transaction has at least one key for a data item, and the BF includes entries corresponding to keys of active transactions; checking the BF for at least one key of a coming transaction; adding an entry for the at least one key of the coming transaction to the BF when there is miss in the check for the at least one key in the BF; enqueueing the coming transaction when there is a hit for the at least one key in the BF; and validating the transactions that are indicated by the BF to be active transactions.

[0024] Optionally in the preceding aspect, another implementation of the aspect provides another implementation of the aspect provides instructions to cause the at least one hardware processor to perform operations including: determining a hash value for at least one key of a multi-shard transaction; updating at least one bit of an element of an integer array of the BF to indicate the shard transaction is an active transaction when the shard transaction is an independent transaction, wherein the element of the integer array is indexed using the determined hash value; and enqueueing the coming transaction when an element of the integer array of the BF indexed according to the hash value of the coming transaction indicates a hit for the at least one key of the coming transaction. [0025] Optionally in any of the preceding aspects, another implementation of the aspect provides instructions to cause the at least one hardware processor to perform operations including: determining a read key hash value for at least one read key of a multi-shard transaction and determining a write key hash value for at least one write key of the multi-shard transaction; updating a counter value for each of the read key hash value and the write key hash value in an integer array of a counting BF (CBF) to indicate an active transaction when the multi-shard transaction is an independent transaction, wherein elements of the integer array are indexed using determined hash values; and enqueueing the coming transaction when a counter value of the integer array of the CBF indexed according to either of the read key hash value or the write key of the coming transaction indicates a hit for either of the at least one read key or the at least one write key of the coming transaction.

[0026] The examples can be implemented in hardware, software or in any combination thereof. The explanations provided for each of the first through third aspects and their implementation forms apply equally to other ones of the first through third aspects and the corresponding implementation forms. These aspects and implementation forms may be used in combination with one another.

BRIEF DESCRIPTION OF THE DRAWINGS

[0027] Some figures illustrating example embodiments are included with the text in the detailed description.

[0028] FIG. 1A illustrates one implementation of a fully distributed database system in an example embodiment.

[0029] FIG. IB illustrates another implementation of a fully distributed database system having one (centralized) instance of the sequencer in an example embodiment.

[0030] FIG. 2A illustrates a routine in the validator instance handling a commit request in an example embodiment.

[0031] FIG. 2B illustrates a routine in the validator instance handling a read operation initiated by a coordinator in an example embodiment.

[0032] FIG. 2C illustrates a routine in the validator instance handling a write operation initiated by a coordinator in an example embodiment. [0033] FIG. 3 is a now chart illustrating a method of an overall commit protocol among the coordinator, the sequencer, and the validator instance(s) for determining whether to abort or commit a transaction in an example embodiment.

[0034] FIG. 4A illustrates the communication message flow of the coordinator, sequencer, and validator instance modules in the distributed database system architecture of FIG. 1A for a distributed sequencer in an example embodiment.

[0035] FIG. 4B illustrates the communication message flow of the coordinator, sequencer, and validator instance modules in the distributed database system architecture of FIG. IB for a centralized sequencer in an example embodiment.

[0036] FIG. 5 is a flow diagram of an example of a method of using a bloom filter in a distributed database system in an example embodiment.

[0037] FIG. 6 is an illustration of an example of using a mutex approach to track transactions of a distributed database system in an example embodiment.

[0038] FIG. 7 is an illustration of an example of using a bloom filter approach to track transactions of a distributed database system in an example embodiment.

[0039] FIG. 8 is a flow diagram of an example of a method of using a counting bloom filter in a distributed database system in an example embodiment.

[0040] FIGS. 9A-9F are example distributed serial safety net (DSSN) routines for implementing a counting bloom filter in an example embodiment.

[0041] FIG. 10 is an illustration of an example of tracking dependencies of waiting transactions using multiple queues in an example embodiment.

[0042] FIG. 11 is an illustration of another example of tracking dependencies of waiting transactions using multiple queues in an example embodiment.

[0043] FIGS. 12A-12H are example DSSN routines for enqueuing a multishard transaction in an example embodiment.

[0044] FIG. 13 is an illustration of another example of tracking dependencies of waiting transactions using multiple queues in an example embodiment.

[0045] FIGS. 14A-14B show an example DSSN routine for identifying and validating a single shard transaction in an example embodiment. [0046] FIG. 15 is a flow diagram of an example of a method of recovery of a distributed database system in an example embodiment.

[0047] FIG. 16 is an illustration of an example of logging transactions of a distributed database system in an example embodiment.

[0048] FIG. 17 is a block schematic diagram of portions of a computer system to implement one or more example embodiments.

DETAILED DESCRIPTION

[0049] In the following description, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration specific embodiments which may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that structural, logical, and electrical changes may be made without departing from the scope of the present invention. The following description of example embodiments is, therefore, not to be taken in a limited sense, and the scope of the present invention is defined by the appended claims.

[0050] The functions or algorithms described herein may be implemented in software in one embodiment. The software may consist of computer executable instructions stored on computer readable media or computer readable storage device such as one or more non-transitory memories or other type of hardwarebased storage devices, either local or networked. Further, such functions correspond to modules, which may be software, hardware, firmware, or any combination thereof. Multiple functions may be performed in one or more modules as desired, and the embodiments described are merely examples. The software may be executed on a digital signal processor, ASIC, microprocessor, or other type of processor operating on a computer system, such as a personal computer, server, or other computer system, turning such computer system into a specifically programmed machine.

[0051] The functionality can be configured to perform an operation using, for instance, software, hardware, firmware, or the like. For example, the phrase “configured to” can refer to a logic circuit structure of a hardware element that is to implement the associated functionality. The phrase “configured to” can also refer to a logic circuit structure of a hardware element that is to implement the coding design of associated functionality of firmware or software. The term “module” refers to a structural element that can be implemented using any suitable hardware (e.g., a processor, among others), software (e.g., an application, among others), firmware, or any combination of hardware, software, and firmware. The term “logic” encompasses any functionality for performing a task. For instance, each operation illustrated in the flowcharts corresponds to logic for performing that operation. An operation can be performed using, software, hardware, firmware, or the like. The terms “component,” “system,” and the like may refer to computer-related entities, hardware, and software in execution, firmware, or combination thereof. A component may be a process running on a processor, an object, an executable, a program, a function, a subroutine, a computer, or a combination of software and hardware. The term “processor,” may refer to a hardware component, such as a processing unit of a computer system.

[0052] Furthermore, the claimed subject matter may be implemented as a method, apparatus, or article of manufacture using standard programming and engineering techniques to produce software, firmware, hardware, or any combination thereof to control a computing device to implement the disclosed subject matter. The term “article of manufacture,” as used herein is intended to encompass a computer program accessible from any computer-readable storage device or media. Computer-readable storage media can include, but are not limited to, magnetic storage devices, e.g., hard disk, floppy disk, magnetic strips, optical disk, compact disk (CD), digital versatile disk (DVD), smart cards, flash memory devices, among others. In contrast, computer-readable media, i.e., not storage media, may additionally include communication media such as transmission media for wireless signals and the like.

[0053] It is desirable to have a fully distributed database which supports high throughput of concurrent transactions while maintaining Atomicity, Consistency, Isolation, and Durability (ACID) properties. A concurrency control scheme making use of serial safety net (SSN) ensures serializable isolation level while offering very high concurrency. However, prior use of SSN does not make a good distributed concurrency control scheme as the SSN validation would be a central point of access, limiting the scalability of a fully distributed database system. Various examples of the inventive subject matter herein include a fully distributed concurrency control scheme that makes SSN validation to be distributed and methods to efficiently manage pending contentious transactions prior to the distributed SSN (DSSN) validation at each of the shard manager in the distributed database. DSSN can greatly improve the performance of ACID transactional database system, which includes multi-core databases, clustered databases, distributed databases, especially geo-distributed databases.

[0054] The implementation of DSSN and any distributed concurrency control protocol requires serialization of accesses to database entries in the memory during validation of the transaction commit requests. Using per-entry mutex would maximize the concurrency allowed, but it would cost much memory to maintain the mutexes. Using a table-based mutex would reduce memory requirement but would reduce the amount of concurrency.

[0055] DSSN is optimized for distributed transactions that involve many shards, but DSSN also needs to handle single shard transaction, which will account for majority of the transactions handled by the validator instance.

[0056] DSSN should be able to recover from failure scenarios that include storage failure, node failure (including power failure) or network failure. DSSN has its own method of recovering from node failure and network failure.

Because nodes only exchange information once for each transaction, the method of recovery will take node failure and network failure as a whole, which is different than what was done previously.

[0057] There are five methods described here that have different emphases. They are related to the solutions for serialization of accesses to database entries in the memory during validation of the transaction commit requests.

[0058] The first method uses a Bloom Filter (BF) in lieu of mutexes for serialization of accesses to the database entries. A counting Bloom Filter is also described. The CBF allows concurrent multi-threaded increments and singlethreaded decrements. The benefit is higher concurrency with a reasonable memory footprint.

[0059] The second method uses multiple queues, each with its own BF, to track dependencies among the pending transaction commit requests. It would be optimal to track the complete dependency graph for them, but a complete dependency graph algorithm runs in the order of N². That makes it impractical in terms of computation power and memory requirement. Multiple queues including independent, cold, and hot queues are used to track independent transactions, transactions with short dependency chains, and transactions with long dependency chains. The benefit is higher concurrency with a reasonable memory and computation load.

[0060] The third method leverages the hot queue to assess the access frequency of specific database entries so as to throttle associated highly contentious transactions. The hot queue contains the pending transactions that have a long chain of dependencies. The fact that the transactions are queued up indicate frequent access. The indicator enables the transaction clients to abort the transactions earlier and hence improving the overall database performance.

[0061] The fourth method uses a single thread to probe the Bloom Filter (BF) without updating the BF when validating single shard transactions. Handling the single shard transaction will not modify the BF, as updating the BF is reserved for multi-shard transactions. Not updating the BF, which is costly, will enable fast single shard transaction validation and conclusion.

[0062] The fifth method provides for recovery of the system in the case of node failure and network failure. Because multi-shard transactions require only single message exchange between validator modules, the rest of the protocol is computationally deterministic. The same deterministic outcome should be reproduced (recalculated) in the case of node restart or network restart. Because single shard transactions don’t participate in a validation message exchange, a single shard transaction outcome is stored in a log so the dependent chain can be reproducible.

[0063] The inventive subject matter focuses on the components of a distributed database. A database may be considered a data store in general, that can store structured or unstructured data. Each data item is accessible through a key, and the schema of the data items is irrelevant. A transaction refers to operating on one or more data items as one logical unit. The data items of a distributed database are stored and distributed in one or more shards. Each shard is considered as a separate computing and storage unit. A shard manager manages a shard and it at least provides the function of servicing transaction operation requests from other components, providing values to read operation requests according to keys and storing write values, provided through write operation requests, after validating end operation (i.e., commit operation) requests. A distributed transaction may therefore operate on one or more data items that are single -sharded, i.e., residing on the same shard, or multi-sharded, i.e., residing on multiple shards. Because a distributed multiple shard transaction is a transaction across multiple shards, it can be referred to as a cross-shard transaction. The objective is supporting concurrent, distributed transactions maintaining ACID properties.

[0064] In one example, there are three relevant functional modules: a coordinator, a sequencer, and a validator. Depending on the overall system architecture, one or more instances of the functional modules may reside in one or more components of the system. FIG. 1 A illustrates one implementation of the overall database system 100 with the instances of the functional modules highlighted. The database system 100 in FIG. 1A is fully distributed and includes a distributed sequencer configuration. One instance of the coordinator 110 resides in each transaction client 120, which requests the transaction service. One instance of the sequencer 130 resides along with the instance of the coordinator 110, which has the advantage of minimizing communication latency between the two functional modules. The sequencer 130 implements management rules to dynamically map transactions to one or more of validator instances 140. One validator instance 140 resides inside each storage node 150. The transaction clients 120 and storage nodes 150 are distributed over a network 160.

[0065] FIG. IB illustrates another implementation of the overall database system 100’ having a centralized sequencer configuration. One instance of the coordinator 110 resides in each transaction client 125; one instance of the sequencer 135 resides in a separate computing unit; and one instance of the validator instance 140 resides inside each storage node 150. Having one (centralized) instance of the sequencer 135 on the network 160 reduces the complexity of ensuring the transactions’ execution order and logging, though at the expense of higher communication latency and lower system scalability. [0066] In sample embodiments, achieving distributed SSN (DSSN) validation mandates the presence of the sequencer 130 and the co-operation of the validator instances 140 that implement a modified version of the SSN algorithm. The presence of the coordinator 110 completes the concurrency control scheme. [0067] In sample embodiments, the coordinator module 110 is responsible for initiating transactional operations such as read, write, and end operation requests and handling responses to the requests. Each transaction is identified by a unique transaction identifier (ID). A read operation should contain the transaction ID and at least one key, while a write operation should contain the transaction ID and at least one key and corresponding value. An end operation requests a commit of the transaction. Each response, from other components of the database system 100, should indicate an acceptance or a rejection of the operation requested. A rejection should cause an abort of the transaction. The acceptance of a read or write operation indicates that the coordinator 110 may move on to the next operation. The acceptance of an end operation indicates that the transaction has been validated and serialized and all written data items are stored. In an embodiment, the coordinator 110 knows or finds out how the data items in a transaction are sharded and can send the read and write operation requests to the appropriate shard, bypassing the sequencer 130/135 as an optimization. For the end operation request, the coordinator 110 sends the end operation request through the sequencer 130/135 so that the sequencer 130/135 may ensure the ordering of the concurrent transactions.

[0068] The sequencer 130/135 puts the concurrent transactions that are requested for validation and commit into a sequenced order in order to facilitate the function of the validator instance 140. The implementation of the sequencer 130/135 varies slightly in different system architectures, such as the architectures shown in FIG. 1A and FIG. IB. The sequencer 130/135 knows or finds out how the data items in a transaction are sharded and sends the commit requests to the validator instance 140 in the appropriate shard manager.

[0069] Distributed sequencer instances 130 fit well with the distributed database system architecture exemplified in FIG. 1A. In this embodiment, a sequencer instance 130 services one or more coordinator instances 110. The sequencer instances 130 exchange clock synchronization messages among themselves so that their local clocks are synchronized within a specified precision. A centralized sequencer 135 fits well with the distributed database system architecture exemplified in FIG. IB. In this embodiment, one sequencer instance 135 services all coordinator instances 110. [0070] The modified version of the SSN algorithm to implement DSSN is illustrated in FIGS. 2A, 2B, and 2C.

[0071] FIG. 2A illustrates a DSSN routine 200 in the validator instance 140 handling a commit request for a portion of a transaction T executing on shard I, denoted as T[I], the transaction having a timestamp denoted as ‘cts.’ The SSN routine 200 includes blocks 210, 220, 230, 235, 240, and 250. The validator instances 140 of all shards involved in the transaction will either abort the transaction or commit the portion being validated. Thus, the DSSN routine 200 allows for distributed validation of transactions.

[0072] To handle a commit request, the validator instance 140 first checks, in block 210, whether the current transaction T[I] should be serviced right now or delayed until a preceding transaction T’ has a commit or abort result. As the sequencer 130/135 has determined the order of the transactions, the validator instance 140 can differentiate preceding transactions T’, current transaction T, and succeeding transactions T”. The validator instance 140 completes the processing of all preceding transactions T’ first, determining their commit and abort results. The validator instance 140 also delays processing all succeeding transactions T”. If the preceding and current transactions are not completed within a specified time, they are aborted.

[0073] Second, the validator instance 140 updates the transaction validation values in block 220, namely pi and eta, using its shard’s local metadata. Third, in block 230, the validator instance 140 multicasts its transaction validation values to all other validator instances 140 of the shard managers involved in the current transaction and waits for the reception of the transaction validation values from the other validator instances 140 of the shards involved in the current transaction. For a single-shard transaction, this step is moot as there is no other shard manager involved. If the validator instance 140 does not receive from all expected validator instances 140, the current transaction will be timed out and aborted. The received data is stored in a data structure denoted T[J], with a different value J for each other shard in the multi-shard transaction.

[0074] Fourth, the validator instance 140 waits for the reception of the transaction validation values of the other validator instances of the shard managers involved in the current transaction and updates its local transaction validation values (e.g.,pz(T[I]) and eta(T[ I])) in block 235 with the received transaction validation values. If the validator does not receive from all expected validator instances, the current transaction will be timed out and aborted. Due to the associative and commutative properties of the min() and max() operations for determining the smallest and largest of the given values, respectively, in the SSN validation, all validator instances 140 associated with the current transaction will come to the same final transaction validation values. For a single-shard transaction, this step is moot as there is no other shard manager involved.

[0075] Finally, the validator instance 140 updates its local transaction validation values with the received transaction validation values. Due to the associative and commutative properties of the min() and max() operations, all validator instances associated with the current transaction will come to the same final transaction validation values. The validator instance 140 reaches an abort or commit result and updates local data appropriately.

[0076] It will be appreciated that single-shard transactions make the exchange of local transaction validation values moot. Multi-shard transactions require the exchange of relevant local transaction validation values of validator instances 140.

[0077] During the exchange of local transaction validation values, the associated validator instances 140 need to wait for one another to proceed together. If a first validator instance 140 executes transaction T while a second validator instance 140 is executing its preceding transaction T’, then the first validator instance 140 does not execute its succeeding transaction T” and would wait for the second validator instance 140 to execute transaction T for fear that transaction T” would alter the local validation values of transaction T, which are supposed to have been frozen and multi-casted to the second validator instance 140. However, if transaction T” is a single-shard transaction, then first validator instance 140 can confidently execute transaction T” when it determines that transaction T” would not alter the local validation values of transaction T. Therefore, it is possible that a validator instance 140 interleaves some of its single-shard transactions between its multi-shard transactions to achieve higher concurrency.

[0078] As the DSSN validation of a multi-shard transaction needs to wait for communication message exchange and could stall the next multi-shard transaction in sequence, it is desirable that all relevant validator instances of a transaction exchange communication messages at the same time in order to minimize the wait time. The sequencer 130/135 can help the validator instances 140 to schedule the DSSN validation of the transaction at the same time by providing a timestamp in the commit request messages sent to the validator instances 140. A commit timestamp (CTS) can serve the purpose of that timestamp.

[0079] In one embodiment of the validator instance module 140, the validator instance 140 receives transaction requests, from the sequencer 130/135. The validator instance 140 tracks the transaction requests in an input queue.

Upon the reception of the commit request for a transaction, the validator instance 140 looks at that transaction in the input queue. If the transaction is a singleshard one, as indicated by the lack of associated validator instances 140 in the metadata of the transaction, the validator instance 140 moves the transaction to a fast-lane queue; otherwise, the validator instance 140 moves the transaction to a slow-lane queue. The validator instance 140 can concurrently execute one transaction from the slow-lane queue and one transaction from the fast-lane queue. If the current fast-lane transaction has an overlapping read-write key set with the current slow-lane transaction, then the current fast-lane transaction is requeued into the fast-lane queue in favor of the next transaction in the fast lane queue. In that regard, the executions of the single-shard transactions may be reordered on the fly. This is acceptable because the modified SSN validation will guarantee serializability and abort pending transactions that use values that are outdated by validated and committed transactions. Furthermore, the validator instance 140 may consult the CTS of the next transaction in the slow-lane queue to determine when to execute that transaction.

[0080] In another embodiment of the validator instance 140, the validator instance 140 maintains the input queue, the fast-lane queue, and the slow-lane queue in the same way as described above. However, in this embodiment, the validator instance 140 may process a batch of sequential transactions in the slow-lane queue. Such batch processing of sequential transactions increases concurrency because the SSN validation needs to wait for communication message exchange to complete one transaction. In addition, the validator instance 140 may opt to use one communication message for the transaction validation values of the batch, as opposed to one communication message per transaction. Using one communication message improves the system throughput as communication messages may suffer from relatively high latency. The batch of sequential transactions must have non-overlapping read-write key sets among one another. If the next transaction in the slow-lane queue has an overlapping read-write key set with any of the transactions in the batch, the batch should be terminated and demarcated to exclude the next transaction.

[0081] To identify whether a transaction has an overlapping read-write key set with another transaction or batch of transactions efficiently, an approximate membership query (AMQ) data structure may be used. For example, a Bloom filter may be used to hold the keys of a reference transaction or batch of transactions. Then, the keys of the candidate transaction are tested against the Bloom filter. A hit indicates a possibility of an overlapping on the read-write key sets of the candidate transaction and the reference transaction or batch of transactions. Any overlapping read-write key sets are frozen until all previous multi-shard transactions have been processed. It will be appreciated that a Bloom filter may generate false positives but no false negatives.

[0082] FIG. 2B illustrates an SSN read routine 260 in the validator instance 140 handling a read operation initiated by a coordinator 110 for transaction portion T[I] and version V of the key-value tuple that has been read or written. The version V is locked, and the system provides the latest version V for a read request. The validator instance 140 first checks in block 270 whether the relevant keys are hot. If so, the validator further checks whether the hot keys are owned by other pending transactions. If none of them is owned, then the validator instance 140 owns them and proceeds by verifying no invalidation and then responding to the read operation. If one of the hot keys is owned by other pending transactions, the validator instance 140 delays responding to the read operation until a time-out or until all those hot keys are released.

[0083] Besides reading transaction portion T[I], the SSN read routine 260 in FIG. 2B also receives a reference to the appropriate version V returned by the underlying concurrency control algorithm as a parameter. Transaction portion T[I] may record in T[I] .pstamp the largest v.cstamp it has seen to reflect the dependency of transaction portion T[I] on the version’s creator at operation 272. T[I] records the smallest v.sstamp in t.sstamp at operation 274 in order to record the read anti -dependency from the transaction that overwrote V (if any). As shown at operation 274, if the version has not yet been overwritten, the version is added to transaction portion T[I]’s read set and checked for late-arriving overwrites during pre-commit. The transaction portion T[I] then verifies the exclusion window at operation 276 and aborts if a violation is detected. The transaction portion T[I] may then transition to the aborted status.

[0084] FIG. 2C illustrates an SSN write routine 280 in the validator instance 140 handling a write operation initiated by a coordinator 110 for transaction portion T[I] and version V, where V refers to a new version generated by the transaction portion T[I] . The validator instance 140 first checks whether the relevant keys are hot. If so, the validator instance 140 further checks whether the hot keys are owned by other pending transactions. If none of them is owned, then the validator instance 140 owns them and proceeds by verifying no invalidation and then responding to the write operation. If one of the hot keys is owned by other pending transactions, the validator instance 140 delays responding to the write operation until a time-out or all those hot keys are released.

[0085] In FIG. 2C, when updating a version V, the transaction portion T[I] updates its predecessor timestamp t.pstamp at operation 292 with v.prev.pstamp, which is then used instead of v.prev.cstamp. The transaction portion T[I] may then record V in its write set for the final validation at pre-commit at operation 294. If more reads are received later, transaction portion T[I] may update t.stamp with v.prev.pstamp, which was updated by read operations that came after T[I] but installed the new version V before transaction portion T[I] entered precommit. Version V is also removed from transaction portion T[I]’s read key set, if present, as updating pi(T[I]) using the edge would violate transaction portion T[I]’s exclusion window and trigger an unnecessary abort. Version V may be removed from a transaction’s read key set by skipping processing of V when examining the read key set, without making the read key set searchable.

[0086] How data items are sharded can be implemented in various ways. For example, the keys may be sorted and divided into ranges, and a subset of the ranges may be mapped into a shard deterministically. The coordinator 110, the sequencer 130/135, and the validator instance 140 can evaluate the mapping from keys to shards without coordination. Alternatively, there can be a shard manager that centrally determines the mappings, and the coordinator 110, the sequencer 130/135, and the validator instance 140 may query the mappings of the shard manager and be informed of changes, which are expected to be infrequent.

[0087] A commit log is a collection of records of committed transactions. A commit log may be used by the sequencer 130/135 to track the history and help with failure discovery. In general, all commits are written to the commit log before being assigned so that transactions in flight when a shard storage node went down can be recovered and re-assigned by checking the commit log. The commit log can be centralized at one node, e.g., at the sequencer 135 when a centralized sequencer 135 is used. Alternatively, the commit log can be composed of fragments scattered over multiple nodes, e.g., at the coordinator instances 110 or validator instances 140. Each record in the commit log should contain the time or sequence information about the committed transaction so that the re-assignment of the transaction is properly sequenced.

[0088] The commit log may also contain the transactions that have their read and write operations approved and that are awaiting validation. Logging a pending transaction having passed all of its read and write operations can help failure recovery of a validator instance 140 to resume the validation of the pending transaction quickly. The log messages in the commit log are referred to as commit-intent messages.

[0089] FIG. 3 is a flow chart illustrating a method 300 of an overall commit protocol among the coordinator 110, the sequencer 130, and the validator instances 140 for determining whether to abort or commit a transaction. The method 300 includes operations 310, 320, 330, 340, 350, and 360.

[0090] First, at operation 310, each of the read and write operations of a transaction initiated by a coordinator instance is to be approved individually and independently at each of the validator instances relevant to the transaction. That is, each validator instance 140 uses its local metadata, without dependency on the other validator instances 140, to determine whether to abort the transaction or to approve the operation. The coordinator 110 collects the results and can abort the transaction when one of the results is an abort.

[0091] Second, at operation 320, the overall result of the previous step is logged. In a sample embodiment, the log message can be a commit-intent message stored on the commit log of the sequencer instance(s) 130 when the coordinator 110 is positive to go ahead with an end operation request to the sequencer instance(s) 130.

[0092] Third, at operation 330, the sequencer instance 130/135 may request those validator instances 140 to validate interdependently the transaction by exchanging local validation parameters about the transaction. The validator instances 140 are supposed to reach the same validation result.

[0093] Fourth, at operation 340, the validation result is logged. In the sample embodiment, the sequencer instance 130/135 logs the commit message, nullifying the commit-intent message.

[0094] At operation 350, the write data of a pending transaction is stored in the storage node 150 associated with a validator instance 140. The storage node 150 makes the write data invisible to other concurrent transactions. As soon as the validator instance 140 changes the transaction status to commit, the storage node 150 makes the write data visible to other concurrent transactions. The invisible write data is garbage-collected at operation 360 if its transaction is aborted.

[0095] FIG. 4A illustrates the communication message flow of the coordinator, sequencer, and validator instance modules in the distributed database system architecture of FIG. 1A. A coordinator instance 110 can generate read and write operation requests 400 to the validator instances 140 associated with the keys in a transaction, bypassing the sequencer 130. The coordinator instance 110 aborts the transaction if any of the read and write operation requests is not satisfied. Otherwise, the coordinator 110 generates an end operation request 410 to a sequencer instance 130.

[0096] The sequencer instance 130 receives end operation requests from one or more coordinator instances 110 concurrently. The sequencer instance 130 assigns a CTS at 420, based on its local clock, to each end operation request signifying the order of execution to be expected on the relevant validator instances 140 when the sequencer instance 130 appends the end operation request with the CTS and forwards it, as a commit request, to the relevant validator instances 140. The CTS also helps the validator instance 140 to maintain multiple versions of data items.

[0097] It is possible that a validator instance 140 receives out-of-order commit requests from one or more sequencer instances 130. This may occur because the clocks of the sequencer instances 130 may not be perfectly in sync and also because the communication messages from the sequencer instances 130 may arrive at the validator instance 140 asynchronously. As a result, the validator instance 140 does not execute the SSN validation immediately upon receiving a commit request. The validator instance 140 instead may delay for a specified interval at 430 anticipating the possible late arrival of commit requests of lower CTSs and execute them in the proper order. When two commit requests have the same CTS, they are supposed to be from two different sequencer instances 130, and they are ordered using identifiers of the sequencer instances 130 as the tie breaker.

[0098] The validator instance 140 may abort a multi-shard transaction with a yet-to-be-validated commit request whose CTS is lower than the CTS of the current multi-shard transaction going through the SSN validation. Aborting the transaction at this validator instance 140 will cause aborting the transaction at the other validator instances 140 associated with the transaction because the latter ones will not receive transaction validation values from this validator instance 140 and will time out the transaction.

[0099] The sequencer instance 130 receives one or more responses from the one or more relevant validator instances 140 of the transaction. All of the responses are supposed to be consistent, indicating either commit or abort, across the board. Therefore, one positive response is enough to trigger the sequencer instance 130 to append its commit log with the transaction at 440.

[00100] The advantage of having a distributed sequencer 130 is database scalability. As the number of coordinator instances 110 grows as the number of transaction clients 120 grows, more sequencer instances 130 may be added. The disadvantage is clock synchronization, and its precision affects the amount of the delay to account for out-of-order commit requests.

[00101] FIG. 4B illustrates the communication message flow of the coordinator, sequencer, and validator instance modules in the distributed database system architecture of FIG. IB given a centralized sequencer 135. A coordinator instance 110 can generate read and write operation requests 450 to the validator instances 140 associated with the keys in a transaction, bypassing the sequencer 135. The coordinator instance 110 aborts the transaction if any of the read and write operation requests are not satisfied. Otherwise, the coordinator 110 generates an end operation request 460 to the sequencer 135.

[00102] The sequencer 135 receives end operation requests 460 from one or more coordinator instances 110 concurrently. The sequencer 135 assigns a sequence number and a CTS 470 to each request signifying the order of execution to be expected on the relevant validator instances 140 when the sequencer 135 appends the end operation request 460 with the sequence number and the CTS and forwards the request, as a commit request, to the relevant validator instances 140.

[00103] The sequence number helps the validator instance detect any missing communication messages in the case of an unreliable communication channel. The CTS also helps the validator instance to maintain multiple versions of data items. As the centralized sequencer 135 is the single source of the sequence numbers and the CTSs, either using sequence numbers or using CTSs is sufficient to identify the order of execution of commit requests and supporting the SSN validation.

[00104] The sequencer 135 receives one or more responses from the one or more relevant validator instances 140 of the transaction. All of the responses should be consistent, indicating either commit or abort, across the board. Therefore, one positive response is enough to trigger the sequencer 135 to append its commit log with the transaction at 480.

[00105] An advantage of this embodiment is that having only one sequencer instance makes it easier to ensure that all validator instances see the same order of the concurrent transactions and each validator instance will not receive out-of- order commit requests assuming a reliable communication channel. Also, the sequencer 135 may easily re-order the concurrent transactions, as it knows all of them in the distributed database system 100, for optimizing throughput for the validator instances 140. Furthermore, the sequencer 135 may implement a centralized hot key throttle mechanism as it knows the metadata of all transactions.

[00106] A disadvantage of having a centralized sequencer 135 is limited database scalability. As the database grows larger in number of coordinators 110, which is associated to the number of transaction clients 125, the load on the centralized sequencer 135 could be stressed. [00107] Stressing of the centralized sequencer 135 may be mitigated considering the fact that only multi-shard or cross-shard transactions need to go through the sequencer 135. Single-shard transactions can be offloaded by having the coordinator instances 110 send commit requests directly to the associated validator instances 140. In that case, the validator instance 140 may assign a CTS to a single-shard transaction locally based on interpolation of the CTSs of the immediately succeeding and preceding multi-shard transactions.

[00108] Overall system scalability can be obtained by having one sequencer instance per database. There can still be many sequencer instances when the system hosts multiple databases or multiple tenants. Also, it will be appreciated that the systems and methods described herein can greatly improve the performance of a database transaction system, which includes multi-core databases, clustered databases, distributed databases, especially geo-distributed databases. Databases as used herein includes all database systems that require ACID properties, like a storage system, a data store, and the like.

[00109] Validation of multi-shard transactions involve validation message exchanges about the transaction metadata among the multiple shards. During the validation, other succeeding transactions should not affect the read set and write set data and metadata of the transactions going through the validation. The period of serialization can be called a serialization window. Only independent transactions are allowed into the serialization window. Any coming transaction that has dependency on any of the transactions in the serialization window should wait for later processing and is queued up. Independent transactions that are undergoing multi-shard validation message exchanges are included in an active transaction set.

[00110] The transactions in the serialization window can be safeguarded using mutexes. A transaction is protected using a key to access a data item of a transaction. Multi-shard transactions can use a per-key locking approach to control correct concurrency of the transactions. A mutex is a synchronization mechanism like a lock that limits access to a database entry or item. A mutex can be associated with each key of a data base system. The finest-grained mutex would be one for each database entry, but this would result in using a lot of memory for the mutexes. [00111] A bloom filter (BF) is a probabilistic data structure that is used to test whether an element is a member of a set. In a BF, false positives are possible, but false negatives are not. A BF can be used by the validator instances to implement the serial window. A validator instance 140 receives a transaction request from a sequencer instance 130/135. The coming shard transaction is tested for independence by testing the shard transaction against the BF that represents the set of active transactions in the serialization window. This BF can be referred to as the active transaction set BF.

[00112] A shard transaction has a read data set and a write data set. Each of the read set and write set has zero or more tuples. At least one of the read and write sets should contain a tuple. A validator instance 140 can use the BF to track active shard transactions in the serial window. An active transaction set can be defined as a set containing all independent multi-shard transactions that are undergoing multi-shard validation message exchanges. An active transaction has an associated entry in the BF and is independent of other active transactions with respect to the BF. A multi-shard transaction is a member of the active transaction set if any tuple in its read set and write set hits the BF. Otherwise, the multi-shard transaction is not a member of the active transaction set.

[00113] Because the goal of the processing is to maximize processing concurrency, it is desirable to add as many transactions to the active transaction set as possible. Because active transactions should be independent, a coming transaction should be checked for a hit in the BF before adding the coming transaction to the active transaction set. A BF hit would indicate interdependence of the coming transaction.

[00114] When a validator instance 140 receives a request for a transaction, the validator instance goes through each of the tuples in the read set and write set of the transaction and checks if a tuple of the transaction hits the BF. If there is a hit for a tuple, then the coming transaction is considered to be dependent on a transaction already in the active transaction set. The coming transaction cannot be added to the active transaction set until some transactions in the active transition set are removed from the set for a re-test. Otherwise, the coming transaction is deemed to be independent and can be added to the active transaction set. [00115] The transactions in the active transaction set will go through validation message exchanges and reach either a commit or abort conclusion. The active transactions do so concurrently and simultaneously. Because the transactions are independent of each other, the timing of their conclusions does not affect the serializability of the conclusions.

[00116] Any concluded transaction can be removed from the active transaction set. To remove a transaction, the validator instance 140 goes through each tuple of the read set and write set of the transaction and the corresponding bit in the BF for each tuple key for the transaction is cleared. Subsequently, a new transaction that depends on the concluded transaction will later not hit the BF and will be allowed into the active transaction set.

[00117] FIG. 5 is a flow diagram of an example of a method 500 of using a BF to identify whether a coming transaction is a member of the set of active transactions. The BF is an array of bit elements. Each array element corresponds to one slot, indexed by a hash value of the tuple key. At block 510, a hash value is derived for each tuple key using a hash function. The test for whether a coming transaction is not an independent transaction is to check the array element (indexed by the determined hash function) for a one, indicating a hit. If the test results in a hit, the tuple is likely already a member of the set of active transactions. If the test results in a miss, the tuple is not a member of the set.

[00118] At block 520, the array elements of the BF are checked using a compare and swap (CAS) operation. If the CAS was good (meaning no hit) at block 530 the key is added to the granted list. If there is a hit, at block 540, the key is already in the granted and depends on another transaction, and at block 550, the coming transaction is added to the set of pending transactions. The set of pending transaction may be added to a queue of waiting transactions. If the check of all the keys of the transaction are misses, the coming transaction is independent. At block 560 the independent transaction is added to the set of active transactions, and multi-shard validation message exchanges for the added transaction can proceed.

[00119] A BF is more efficient in use of memory space than a mutex perrecord or per-tuple approach and uses a smaller memory footprint. FIG. 6 is an illustration of the mutex approach. Each record 602 of the database 650 is provided a mutex (Key_a, Key_b, ... Key_x). Checking for transaction interdependence involves checking each record. FIG. 7 is an illustration of the BF approach. A mutex is provided for the hash values of the keys 604 of the database system. To check dependency of a transaction, only the BF 706 needs to be checked instead of the database 650. The BF 706 can be included in fast memory (e.g., dynamic random access memory (DRAM) of the storage node, or in on-die static random access memory (SRAM)) of the storage node. The memory that stores the BF is faster than the memory that stores the key-value tuples of the storage node. For example, if the key-value tuples are stored in a solid state drive (SSD), then DRAM is the faster memory and the BF may be stored in the DRAM. If the key-value tuples are stored in DRAM, SRAM is the faster memory and the BF may be stored in SRAM. The compactness of the BF allows the BF to reside in faster memory.

[00120] In a BF, elements can be added to the BF set but not removed from the BF set. A counting BF (CBF) allows elements to be removed from the set. A CBF can be used to track active transactions to implement the serialization window. A coming transaction is tested against an active transaction set CBF that represents the set of active transactions in the serialization window.

[00121] The active transaction set CBF is an array of integers. As in the BF example, each array element corresponds to one slot, indexed by a hash value of the tuple key. However, for a CBF integer array, an integer array element represents a counter value. Using two hash functions, a hash value is derived for the read key of a tuple and a hash value is derived for the write key of the tuple. A counter value is tracked for the tuple read key and a counter value is tracked for the tuple write key. During a test of a transaction for independence, a CBF hit occurs if the tuple key’s two corresponding counter values are both non-zero; otherwise, it is a CBF miss. A tuple is likely a member of the set in the case of a CBF hit. Otherwise in the case of a miss, the tuple is for sure not a member of the set.

[00122] The operation for a CBF implementation is similar to that of the BF operation. When a request for a transaction is received, the validator instance 140 goes through each of the tuples in the read set and write set and checks if the tuple hits the CBF. If one tuple does hit, then the current transaction is considered to be dependent on a transaction already in the active transaction set. It cannot be added to the active transaction set until some transactions in the set are removed from the set and a re-test is performed. Otherwise, the current transaction is deemed to be independent and can be added to the active transaction set.

[00123] The transactions in the active transaction set will go through validation message exchanges and reach commit or abort conclusions. They do so concurrently and simultaneously. Because there are independent of each other, their timing of conclusions does not affect the serializability of the conclusions.

[00124] Any concluded transaction is to be removed from the active transaction set. To remove a transaction for the active transaction set CBF, the validator instance goes through each tuple of its read set and write set and decrements the two corresponding counter values of each tuple key. Consequently, a new transaction that depends on the concluded transaction will not result in a hit on the CBF and be allowed into the active transaction set.

[00125] Because each tuple key is a string of bytes of arbitrary length and values, a hash value for the tuple key is calculated by running through the string of bytes. Testing a transaction against the active transaction membership can be a frequent operation. It would be desirable to expedite the active transaction set BF or CBF testing operation and reduce memory accesses used by the testing operation.

[00126] One approach to expedite the testing operation is to calculate the hash values for each tuple key only once and store them in an array instead of rereading the tuple key values and re-calculating the hash values again and again. Also, when one of the tuple keys hits the BF/CBF and therefore fails the test, the position of the tuple key is cached. The next membership test for the same transaction will start from the cached position. This may reduce the test time because the same tuple key is likely to fail the test again. In other words, instead of starting from the first tuple key in the read set or the write set, cache the position of the tuple key that fails the test and resume the test from the cached- position in the re-test until all tuple keys have passed the test.

[00127] Another approach is to add the tuple key to the BF/CBF right after it passes the BF/CBF test. This approach takes advantage of the fresh cache lines still holding the relevant memories. When the BF/CBF test fails, the tuple keys in the new transaction that have been added to the CBF are removed, undoing the effect of having partially added the transaction to the CBF. Compared to the alternative of finishing testing all tuple keys first and then adding them to the CBF, this approach is more efficient for two reasons. First, the CPU cache lines still contain the parts being used in the test and add operations, so there would be more cache line hits. Second, it is more likely that a transaction is independent of the active transaction set than not when the transaction passes the BF/CBF test. [00128] Still another approach is to use a CBF and enable concurrent adding tuple keys to the CBF and removing tuple keys from the CBF. We use an atomic integer for a counter taking advantage of the CPU support of such atomic instructions on the counter. Multi-threaded operations on the atomic integers are naturally supported by the CPU. Therefore, the operations on the CBF can be multi-threaded and concurrent.

[00129] FIG. 8 is a flow diagram of an example of a method 800 of using a CBF to identify whether a coming transaction is a member of the set of active transactions. The CBF is an array of counter values of N bits each, where /Vis an integer greater than one. Each array element of N bits corresponds to one slot, indexed by a hash value of the tuple key. At block 810, a hash value is derived for a read tuple key and a write tuple key using a hash function. The test for whether a coming transaction is not an independent transaction is to check the array elements (indexed by the determined hash functions) for a nonzero counter value, indicating a hit. If the test results in a hit, the tuple is likely already a member of the set of active transactions. If the test results in a miss, the tuple is not a member of the set.

[00130] At block 820, the array elements of the CBF are checked using an atomic increment operation. If the test results in a miss, at block 830 the key is added to the granted list. If the test results in is a hit, at block 840, the key is already in the granted and depends on another transaction, and at block 850, the coming transaction is added to the set of pending transactions. The set of pending transaction may be added to a queue of waiting transactions. If the check of all the read tuple keys and write tuple keys of the transaction are misses, the coming transaction is independent and at block 860 the coming transaction is added to the set of active transactions, and multi-shard validation message exchanges for the added transaction can proceed. [00131] FIGS. 9A-9F are example DSSN routines for implementing a CBF fortesting shard transactions for independence. FIGS. 9A-9B illustrate a DSSN routine in the validator instance 140 for handling additions of entries to the active transaction set CBF. FIG. 9C illustrates a DSSN routine in the validator instance 140 for handling removal of entries from the active transaction set CBF. FIGS. 9D-9F illustrate a DSSN routine for searching the CBF according to hash values to determine the count value of the CBF array.

[00132] When the transaction clients simultaneously submit transaction commit requests to a storage node to validate the requests, the validation of the requested transactions needs to be serialized. The validation involves validation message exchanges with other storage nodes that participate in the multi-shard transactions. To enhance performance, multi-shard transactions that are independent transactions can go through the validation without waiting. All the independent transactions are put into a serialization window, and only independent transactions are allowed into the serialization window. Any coming transaction that has dependency on any of the transactions in the serialization window is queued up and waits for validation. A single First-In First-Out (FIFO) queue may be used to store and queue waiting transactions. All waiting transactions sequenced by the commit timestamp (CTS) will enter the FIFO queue.

[00133] The waiting enqueued transactions are recurrently rechecked to test whether any of them is allowed into the active transaction set. Because different waiting transactions may depend on different transactions in the active transaction set, each waiting transaction is tested to maximize processing concurrency. However, it can be computationally expensive to scan through all the waiting enqueued transactions. To make the scanning process efficient, the dependencies among the waiting transactions can be tracked.

[00134] For example, assume there is a chain of dependent multi-shard transactions. If the head transaction of the transaction chain is blocked from admission into the active transaction set because of interdependency, then the rest of the transactions in the chain do not need to be tested. In other words, all transactions do not need to be re-tested every time a concluded transaction is removed from the active transaction set because the dependency graph would indicate which specific ones to be tested. However, it is a challenge to track the full dependencies of the waiting transactions as the amount of memory and operations increases exponentially with the number of transactions.

[00135] FIG. 10 is an illustration of an example of tracking dependencies of waiting transactions. Multiple queues are used to store waiting transactions that are awaiting validation by validator instances. Full dependency of the waiting transactions is not tracked to keep the memory footprint manageable.

[00136] The multi-shard transactions 1002 are stored in multiple queues according to dependency. The queues may be FIFO queues. FIG. 10 shows an active transaction set 1012 of a serialization window, an independent queue 1014, and a dependent queue 1016 of interdependent transactions. The active transaction set 1012, the independent queue 1014, and the dependent queue 1016 are backed by an Active BF 1006, an Independent BF 1008, and an Interdependent BF 1010, respectively. If a coming transaction passes the check for a conflict with the Active BF 1006, it is added to the active transaction set 1012. If there is a conflict with the Active BF 1006 (e.g., there is a hit in the Active BF 1006), the transaction is checked for a conflict with the Independent BF 1008.

[00137] If there is no conflict (e.g., there is a miss in the check of the

Independent BF 1006), the transaction is added to the independent queue 1014 and an entry for the transaction is added to the Independent BF 1008. Like the entry for the Active BF 1006, the entry can include a bit or A bits of an integer array (depending on whether the Independent BF 1008 is a BF or CBF) indexed by one or more hash values determines for keys of the transactions. The independent queue 1014 can be used to queue up the transactions that have been tested to be independent of the waiting transactions ahead of it sequentially. These transactions can be tested against the active transaction set when there is a removal from the active transaction set.

[00138] If there is a conflict in the check of the independent queue 1014, the transaction is added to the interdependent queue 1016 and an entry for the transaction is added in the Dependent BF 1010 for the interdependent queue 1016. The entry can include a bit or A bits of an integer array (depending on whether the Interdependent BF 1010 is a BF or CBF) indexed by one or more hash values determined for the keys of the transactions. [00139] FIG. 11 is an illustration of another example of tracking dependencies of waiting transactions. In the example of FIG. 11, the dependencies of the dependent multi -transactions are tracked further than in the example of FIG. 10. In the example of FIG. 11, three transaction queues are used: an independent queue 1014 and two interdependent queues - a cold queue 1120 and hot queue 1122. Each queue is backed by a BF and there is an independent queue BF, cold queue BF, and a hot queue BF. The BFs can be CBFs.

[00140] The cold queue 1120 can be used to queue up transactions that have been tested (using the BFs) to be dependent on the waiting transactions already in the independent queue 1014 or cold queue 1120. The hot queue 1122 is used to queue up transactions that have been tested to be dependent on the waiting transactions already in the independent queue 1014, the cold queue 1120 or the hot queue 1122.

[00141] Transactions in the hot queue 1122 are considered to have long chains of transaction dependency. To test whether a coming transaction should be placed in the hot queue 1122, each key of the coming transaction is tested against the hot queue BF. The coming transaction is enqueued in the hot queue 1122 and entries for the keys of the coming transaction are added to the hot queue BF when one of the tested keys hits the hot queue BF.

[00142] The coming transaction is enqueued in the cold queue 1120 and the keys of the coming transaction are added to the cold queue BF when any of the tested keys hits the cold queue BF. The size of the cold queue 1120 can be much smaller than the size of the hot queue 1122. A hot queue threshold can be used to determine when to add transactions to the hot queue 1122. The hot queue threshold is a count of transactions, and the threshold count is smaller than the size of the cold queue 1120. When a coming transaction is tested against the BF and determined to be dependent on transactions already in the cold queue 1120 and the current number of transactions enqueued in the cold queue 1120 exceeds the specified hot queue threshold, the coming transaction is enqueued in the hot queue 1122. The testing for dependency on transactions in the cold queue 1120 is a way to differentiate a transaction that has a long chain of dependency. It is a simplified way to determine dependency to some extent without tracking the full dependency graphs. [00143] The coming transaction is enqueued in the independent queue 1014 and entries for the keys of the transaction are added to the independent queue BF when all of the keys miss the independent queue BF, the cold queue BF, and the hot queue BF. Each of the independent queue BF, cold queue BF, and hot queue BF keep the membership of the transactions enqueued in the corresponding queue. The BF provides a quick way to test dependency. When a transaction is added to a queue, entries for the transaction’s tuple keys are added to the queue’s BF. When a transaction is removed from the queue, they are removed from the queue’s BF.

[00144] The transactions enqueued in the independent queue 1014, the cold queue 1120, and the hot queue 1122 are tested against the active transaction set BF for entry into the active transaction set and the serialization window. The commit timestamp (CTS) of the transaction assigned by the sequencer instance 130/135 can be used to select a transaction for testing from the independent queue 1014, the cold queue 1120, or the hot queue 1122. The transactions stored in the independent, cold, and hot queues have CTSs, and the CTS of the transactions results from testing of dependencies sequentially during the insertion procedure. The transactions should be dequeued sequentially. For example, a transaction in the hot queue with a lower CTS should be dequeued before a transaction in the cold queue or the independent queue with a higher CTS. Similarly, a transaction in the cold queue with a lower CTS should be dequeued before a transaction in the hot queue or the independent queue with a higher CTS. On the other hand, a transaction in the independent queue can be selected any time because it has been tested to have no dependency on any transaction with a lower CTS.

[00145] FIGS. 12A-12D are an example DSSN routines for enqueuing a multi-shard transaction in the hot, cold, and independent queues. FIG. 12E is an example of a DSSN routine for removing a multi-shard transaction from a queue. [00146] Using multiple queues with BFs to track dependencies can provide a memory efficient and computation efficient method to track dependencies among pending transactions to be serialized.

[00147] When transaction clients 125 concurrently make transaction commit requests that depend on the same tuples, the concurrent requests contend to get into the serialization window. It is likely that the first transaction commit request that gets into the serialization window will modify the tuple values and cause a validation abort for a contentious commit request that enters the serialization window later. When an abort happens, the corresponding transaction clients 125 need to retry again, wasting computing resources.

[00148] Therefore, it is desirable to abort the contentious transaction commit requests early before the transactions clients 125 even send a contentious transaction commit request. The abort of a contentious transaction commit request can be triggered when the transaction client 125 sends a pre-commit request that tries to read or write some tuples in preparation for a commit transaction request. The trigger of the abort is based on the prediction and detection of potential contentious transactions using the pre-commit requests. [00149] Returning to FIG. 11, according to some examples the hot queue 1122 can be used to predict and detect potential contentions in transactions. The hot queue 1122 is used to track dependencies of transaction commit requests about to go through the serialization window. Transaction commit requests that get stored in the hot queue would typically have long dependency chains and therefore are probably highly contentious.

[00150] When a pre-commit request (e.g., a pre-commit read request or precommit write request) reaches a storage node, the validator instance 140 of the node may test the corresponding tuple of the pre-commit request against the BF of the hot queue 1122. During the BF test, the value of the BF bit associated with the tuple is known. When there is a hit in the hot queue BF, the tuple probably depends on the current contentious commit transaction requests of the hot queue 1122.

[00151] If the hot queue BF is a hot queue CBF, the counter values of the tuple in the CBF of the hot queue are known. If the minimum value of the two counter values in the hot queue CBF exceeds a specified threshold counter value, it indicates a long dependency chain for the tuple key. The validator instance 140 of the storage node then sends an early abort signal for the pre-commit request to the transaction client. The early abort 1124 reduces the waste of system resources and bandwidth when generating a transaction commit request.

[00152] FIG. 13 is an illustration of another example of tracking dependencies of waiting transactions. In contrast to a multi-shard transaction, a single shard transaction only involves one storage node and one validator instance 140. Exchanges of validation messages among validator instances is not needed to validate a single shard transaction. Because other validator instances are not involved, a single shard transaction can be quickly validated as long as the transaction doesn’t have read keys and write keys that would collide with the keys being processed in the current active multi-shard transactions.

[00153] At 1310, single shard transactions are identified. Multi-shard transactions are handled as described previously herein using Bloom Filters 1012, 1008, and 1010 as described previously herein. At 1320, single shard transactions are validated.

[00154] Because all single shard transactions can be reordered inside the system before its conclusion, it is not obligated to use the original commit timestamp (CTS) for the single shard transactions.

[00155] Instead, the serialization thread that chooses the transactions to be committed can choose a single shard transaction that does not conflict with the current Active Transaction BF 1006, and can assign the current timestamp as the CTS of the chosen single shard transaction. The Independent BF 1008 and the Interdependent BF 1010 do not need to be checked for conflicts. The same serialization thread can then use the SSN protocol to validate whether the single shard transaction can be committed. When the transaction outcome is determined, the timestamps of keys in its read set and write set, as well as the transaction outcome can be logged in a transaction log with completed active transactions of the serial window, making the outcome persistent. Consequently, the actual timestamp in the system can be updated with the logged timestamps. [00156] FIG. 14 is an example of a DSSN routine 1410 for identifying and validating a single shard transaction. The DSSN routine 1410 may be performed by a validator instance 140.

[00157] Single Shard Transactions are separated from multi-shard transactions by a serialization thread, and the same serialization thread handles validation of the single shard transactions. Multi-shard transactions require communications among multiple validator instances to determine the transaction outcome. The outcome of a single shard transaction can be determined solely by the one validator instance 140 of the single shard and thus will have less latency than multi-shard transactions. Separating the short-latency operations from the long-latency operations and quickly completing the short-latency operations will improve system throughput.

[00158] Also, unlike multi-shard transactions where a global serial order between multi-shard transactions is used to produce deterministic transaction outcomes, all single shard transactions are independent of each other. Thus, any “ready” single shard transaction can be chosen, assigned a CTS on the fly, validated, and concluded. A single shard transaction is defined as being “ready” when the tuple key touched by the single shard transaction does not overlap or collide with any outstanding multi-shard transactions.

[00159] If multiple serialization threads are used to validate single shard transactions, those threads would need to coordinate among themselves to make sure there is no key collision among outstanding single shard transactions. This can be done by the multiple serialization updating the bloom filter or filters. Using the same serialization thread to select and validate the single shard transaction avoids the need to update any bloom filter. Also, using the same serialization thread to select the single shard transaction avoids the need for coordinated updates to the metadata (timestamps and values) of the keys by the multiple serialization threads.

[00160] In a distributed system, it is normal for a computer system to have failures in the system nodes and in the network. In atypical computer system using SSN, the outcome of a transaction can be logged in persistent storage media, and the outcome of the transaction can be retrieved in the event of a failure. For a distributed system, if the failure occurs before the outcome of a transaction is computed, the peer information determined for the transaction needs to be regenerated. A better approach is for a distributed computer system to move the timing of the transaction logging to a point before the outcome of the transaction is computed.

[00161] FIG. 15 is a flow diagram of an example of a method 1500 of recovery of a distributed database system in the event of a failure. The method 1500 may be implemented by the distributed database system of FIG. 1A or FIG. IB. In certain examples, the method 1500 is implemented using DSSN. In the approach in FIG. 15 the logging for the transaction occurs before validation.

[00162] At operation 1510, a multi-shard transaction is generated by a transaction client of the distributed computer system. A multi-shard transaction includes peer information and at least one key for each shard of the multi-shard transaction. The peer information may include information (e.g., identifiers) about the peer storage nodes that store the multiple shards and participate in the transaction.

[00163] At operation 1520, the keys of the of the multi-shard transaction are included in subsets of the keys. The subsets of keys are sent to the validator instances of the peer storage nodes.

[00164] At operation 1530, a validator instance 140 of a storage node of the distributed database system receives the multi-shard transaction and its subset of keys and calculates subset metadata for the subset. The validator instance 140 may receive the multi-shard transaction from a sequencer instance 130/135. The subset metadata summarizes the metadata of the keys of the subset into single metadata.

[00165] At operation 1540, the peer information is logged on shared persistent storage of the distributed data base system prior to the transaction validation message exchange among the validator instances of the participating storage nodes. The transaction validation messages include the calculated subset metadata. Thus, the summarized metadata is exchanged in the messaging and not the full metadata of the individual keys of the subset.

[00166] For example, assume the multi-shard transaction is a three-shard transaction that involves three shards of the distributed database system. The multi-shard transaction includes a set of keys (A, B, C, D, E, F). The keys are divided into three subsets of keys, e.g., (A, B), (C, E, D), (F) and a subset is sent to the participating storage nodes. Each of the validator instances of the participating storage nodes receives its subset of keys and peer information. Each of the validator instances stores the peer information and processes its subset of keys to calculate its subset metadata. The first validator instance will calculate a summary of metadata for keys(A, B), the second validator instance will calculate a summary for keys (C, E, D), and the third validator instance hold metadata for key F. Each of the three validator instances sends its calculated metadata to the other validator instances during the transaction validation message exchange. Each shard eventually has the summarized data for the entire set of keys (A, B, C, D, E, F), although only subset summaries were sent. [00167] Moving the logging for the transaction earlier to when the peer information is received, allows the peer information to be retrieved in the event of a failure without having to regenerate the peer information.

[00168] FIG. 16 is an illustration of the logging technique. If a waiting transaction 1602 is a multi-shard transaction, when it becomes an independent transaction it eventually will become an active transaction in the active transaction set 1012. An Active BF 1006 may be used to determine when the multi-shard transaction becomes an independent transaction. The peer information is logged at 1632 in the shared persistent storage before the peer information is exchanged at 1634. The peer information may be logged when the transaction becomes an active transaction. If a BF is used to determine independence of the multi-shard transaction, the peer information may be logged when there is miss in the check for keys of the transaction in the BF. In certain examples, the peer information is logged after the transaction is evaluated locally by the validator instance.

[00169] Once all participating validator instances 140 log the peer info, the transaction is validated at 1636 and the outcome of the transaction is determined. If any of the participating validator instances failed to log the peer information (e.g., because of node failure, a network failure, the transaction missed an out-of- order window of one or more validators, etc.), the multi-shard transaction will not reach an outcome; resulting in a timeout. The failure can be tolerated because all the validator instances 140 of the participating storage nodes can reproduce the peer exchange information of the multi-shard transaction. Because all of the operations involved in validating multi-shard transactions are idempotent, the validation process can be repeat over and over without ill effects.

[00170] If the validator instance 140 identifies the shard transaction as a single shard transaction 1618, the transaction can be validated 1620 and the outcome can be logged 1622 in the transaction log at the conclusion of the single shard transaction. The outcome of a single shard transaction can be reproduced easily upon a node restart after a failure, and the peer information for a single shard transaction is not stored until after it validates 1620.

[00171] By logging the peer exchange information of a multi-shard prior to the validation message exchange, the recovery is made idempotent and multiple validations of the same transaction will not cause the system to change the outcome of the multi-shard transactions. The method is also deterministic for single shard transactions because the outcome of single shard transaction is logged at the time of conclusion.

[00172] In some examples, the active transaction set 1012 includes single shard transactions and multi-shard transactions. Hashed entries for the single shard transactions can be entered into the Active BF 1006 as well as for the multi-shard transactions. The single shard transactions can be quickly validated and quickly removed from the active transaction set 1012 and the Active BF 1006 by the validator instance 140.

[00173] FIG. 17 is a block schematic diagram of a computer system 1700 for performing methods and algorithms described herein. All components need not be used in various embodiments or examples.

[00174] One example computing device in the form of a computer 1700 may include a processing unit 1702, memory 1703, removable storage 1710, and nonremovable storage 1712. Although the example computing device is illustrated and described as computer 1700, the computing device may be in different forms in different embodiments. For example, the computing device may instead be a smartphone, a tablet, smartwatch, smart storage device (SSD), or other computing device including the same or similar elements as illustrated and described with regard to FIG. 17. Devices, such as smartphones, tablets, and smartwatches, are generally collectively referred to as mobile devices or user equipment.

[00175] Although the various data storage elements are illustrated as part of the computer 1700, the storage may also or alternatively include cloud-based storage accessible via a network, such as the Internet or server-based storage. Note also that an SSD may include a processor on which the parser may be run, allowing transfer of parsed, fdtered data through I/O channels between the SSD and main memory.

[00176] Memory 1703 may include volatile memory 1714 and non-volatile memory 1708. Computer 1700 may include - or have access to a computing environment that includes - a variety of computer-readable media, such as volatile memory 1714 and non-volatile memory 1708, removable storage 1710 and non-removable storage 1712. Computer storage includes random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM) or electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, compact disc readonly memory (CD ROM), Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium capable of storing computer- readable instructions.

[00177] Computer 1700 may include or have access to a computing environment that includes input interface 1706, output interface 1704, and a communication interface 1716. Output interface 1704 may include a display device, such as a touchscreen, that also may serve as an input device. The input interface 1706 may include one or more of a touchscreen, touchpad, mouse, keyboard, camera, one or more device -specific buttons, one or more sensors integrated within or coupled via wired or wireless data connections to the computer 1700, and other input devices. The computer may operate in a networked environment using a communication connection to connect to one or more remote computers, such as database servers. The remote computer may include a personal computer (PC), server, router, network PC, a peer device or other common data flow network switch, or the like. The communication connection may include a Local Area Network (LAN), a Wide Area Network (WAN), cellular, Wi-Fi, Bluetooth, or other networks. According to one embodiment, the various components of computer 1700 are connected with a system bus 1720.

[00178] Computer-readable instructions stored on a computer-readable medium are executable by the processing unit 1702 of the computer 1700, such as a program 1718. The program 1718 in some embodiments comprises software to implement one or more methods described herein. A hard drive, CD-ROM, and RAM are some examples of articles including a non-transitory computer-readable medium such as a storage device. The terms computer- readable medium, machine readable medium, and storage device do not include carrier waves to the extent carrier waves are deemed too transitory. Storage can also include networked storage, such as a storage area network (SAN). Computer program 1718 along with the workspace manager 1722 may be used to cause processing unit 1702 to perform one or more methods or algorithms described herein.

[00179] Although a few embodiments have been described in detail above, other modifications are possible. For example, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. Other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Other embodiments may be within the scope of the following claims.

Claims

CLAIMS What is claimed is:

1. A computer-implemented method for serializing multi-shard transactions of a storage node of a distributed database system, the method comprising: tracking active transactions in the distributed database system using a bloom filter (BF), wherein an active transaction is shard transaction independent of other transactions with respect to the BF and the shard transaction has at least one key for a data item, and the BF includes entries corresponding to keys of active transactions; checking the BF for at least one key of a coming transaction; adding an entry corresponding to the at least one key of the coming transaction to the BF when there is miss in the check for the at least one key in the BF; enqueueing the coming transaction when there is a hit for the at least one key in the BF; and validating the transactions that are indicated by the BF to be active transactions.

2. The method of claim 1, wherein the tracking active transactions using the BF includes: determining a hash value for at least one key of a multi-shard transaction; and updating at least one bit of an element of an integer array of the BF to indicate an active transaction when the multi-shard transaction is an independent transaction, wherein the element of the integer array is indexed using the determined hash value; wherein the checking the BF for at least one key of a coming transaction includes: determining a hash value for the at least one key of the coming transaction; and checking the value of the element of the integer array indexed according to the determined hash value.

43

3. The method of claim 1 or claim 2, wherein the tracking active transactions using the BF includes tracking active transactions using a counting BF (CBF), including: determining a hash value for at least one read key of a multi-shard transaction and determining a hash value for at least one write key of the multi-shard transaction; and setting a counter value for each of the hash values in an integer array of the CBF to indicate an active transaction when the multi-shard transaction is an independent transaction, wherein elements of the integer array are indexed using determined hash values; wherein the checking the BF includes checking the CBF for at least one key of the coming transaction, including: determining a hash value for at least one read key of the coming transaction and determining a hash value for at least one write key of the coming transaction; and checking the counter value of elements of the integer array indexed using the determined hash values.

4. The method of claim 3, including decrementing the counter value indexed by a hash value of a key of an active transaction when the active transaction is completed.

5. The method of claim 3 or claim 4, including rechecking the CBF for the key of the coming transaction after a predetermined duration of time.

6. The method of any one of claims 1-5, including storing the BF in a memory with faster access relative to a memory used to store key-value tuples of the storage node.

7. The method of any one of claims 1-6, wherein enqueueing the coming transaction includes: maintaining an independent queue for independent multi-shard transactions waiting for validation, and an interdependent queue for interdependent multi-shard transactions waiting for validation;

44 maintaining an independent queue BF and an interdependent queue BF for the independent queue and interdependent queue, respectively; and enqueueing the coming transaction in the independent queue and adding keys of the coming transaction to the independent queue BF when a check for keys of the coming transaction misses the independent queue BF and the interdependent queue BF.

8. The method of claim 7, wherein the interdependent queue includes a cold queue and a hot queue, and the interdependent queue BF includes a cold queue BF and a hot queue BF; wherein the enqueuing the coming transaction further includes: testing each key of the coming transaction against the hot queue BF when enqueueing the coming transaction; enqueueing the coming transaction in the hot queue and adding keys of the coming transaction to the hot queue BF when any of the keys hit the hot queue BF; enqueueing the coming transaction in the hot queue and adding the keys of the coming transaction to the hot queue BF when any of the keys hit the cold queue BF and a minimum value of counter values of the cold queue BF for the keys exceeds a specified threshold counter value; enqueueing the coming transaction in the cold queue and adding the keys of the coming transaction to the cold queue BF when any of the keys hits the cold queue BF; enqueueing the coming transaction in the cold queue and adding the keys of the coming transaction to the cold queue BF when any of the keys hits the independent queue BF; and enqueueing the coming transaction in the independent queue and adding the keys to the independent queue BF when all of the keys miss the independent queue BF, the cold queue BF, and the hot queue BF.

9. The method of claim 8, further including: receiving a pre-commit request of a key of a coming transaction at a validator instance of the storage node; testing the key of the coming transaction against the hot queue BF; and

45 sending, by the validator instance, an early abort signal for the precommit request when the key hits the hot queue BF and a minimum value of counter values of the hot queue BF for the key exceeds a specified threshold counter value.

10. The method of any one of claims 1-9, including: identifying a single-shard transaction; and validating the single -shard transaction without checking the BF.

11. A distributed computer system that serializes transactions from at least one transaction client in a distributed database system having multiple database shards, the system comprising: at least one sequencer instance configured to receive a multi-shard transaction from the at least one transaction client and transmit a request for the transaction to multiple storage nodes of the system; and a validator instance included in a storage node of the multiple storage nodes and configured to: implement a bloom filter (BF) to track active transactions in the distributed database system using a bloom filter (BF), wherein an active transaction is multi-shard transaction independent of other transactions with respect to the BF and the shard transaction has at least one key for a data item, and the BF includes entries corresponding to keys of active transactions; receive the requested transaction and checking the BF for at least one key of the requested transaction; add an entry for the at least one key of the requested transaction to the BF when there is miss in the check for the at least one key in the BF; queue the requested transaction when there is a hit for the at least one key in the BF; and send a validating message for transactions that are indicated by the BF to be active transactions.

12. The system of claim 11, wherein the validator instance is configured to: determine a hash value for at least one key of a multi-shard transaction; update at least one bit of an element of an integer array of the BF to indicate the multi-shard transaction an active transaction when the multi-shard transaction is an independent transaction, wherein the element of the integer array is indexed using the determined hash value; determine a hash value for the at least one key of the requested transaction; and identify the hit for the at least one key of the requested transaction based on a value of the element of the integer array indexed according to the determined hash value for the at least one key of the requested transaction.

13. The system of claim 11 or claim 12, wherein the validator instance is configured to: determine a hash value for at least one read key of a multi-shard transaction and determine a hash value for at least one write key of the multishard transaction; set a counter value for each of the read key hash value and the write key hash value in an integer array of a counting BF (CBF) to indicate an active transaction when the multi-shard transaction is an independent transaction, wherein elements of the integer array are indexed using determined hash values; determine a hash value for at least one read key of the requested transaction and determine a hash value for at least one write key of the requested transaction; and queue the requested transaction when a counter value of the integer array of the CBF indexed according to either of the read key hash value or the write key of the requested transaction indicates a hit for either of the at least one read key or the at least one write key of the requested transaction.

14. The system of any one of claims 11-13, wherein the storage node includes a first memory to store the BF and a second memory to store the keyvalue tuples of the storage node, wherein an access operation to the first memory is faster relative to an access operation of the second memory.

15. The system of any one of claims 11-14, wherein the validator instance is configured to: maintain an independent queue for independent multi-shard transactions waiting for validation, and an interdependent queue for interdependent multishard transactions waiting for validation; maintain an independent queue BF and an interdependent queue BF for the independent queue and interdependent queue, respectively; store the requested transaction in the independent queue and store keys of the requested transaction in the independent queue BF when a check for keys of the requested transaction misses the independent queue BF and the interdependent queue BF.

16. The system of claim 15, wherein the validator instance is configured to: include a cold queue and a hot queue in the interdependent queue, and maintain a cold queue BF and a hot queue BF respectively, for the cold queue and hot queue; test each key of the requested transaction against the hot queue BF when enqueueing the requested transaction; store the requested transaction in the hot queue and include keys of the requested transaction in the hot queue BF when any of the keys hit the hot queue BF; store the requested transaction in the hot queue and include the keys of the requested transaction in the hot queue BF when any of the keys hit the cold queue BF and a minimum value of counter values of the cold queue BF for the keys exceeds a specified threshold counter value; store the requested transaction in the cold queue and include the keys of the coming transaction in the cold queue BF when any of the keys hits the cold queue BF; store the requested transaction in the cold queue and include the keys of the requested transaction in the cold queue BF when any of the keys hits the independent queue BF; and store the requested transaction in the independent queue and include the keys to the independent queue BF when all of the keys miss the independent queue BF, the cold queue BF, and the hot queue BF.

17. The system of claim 15, wherein the validator instance is configured to:

48 include a cold queue and a hot queue in the interdependent queue, and maintain a cold queue BF and a hot queue BF respectively, for the cold queue and hot queue; receive a pre-commit operation on a key of the requested transaction; test the key of the requested transaction against the hot queue BF; and send an early abort signal for the pre-commit operation when the key hits the hot queue BF and a minimum value of counter values of the hot queue BF for the key exceeds a specified threshold counter value.

18. A storage server of a distributed database system, the server comprising: at least one hardware processor; and a memory storing instructions that cause the at least one hardware processor to perform operations comprising: tracking active transactions in the distributed database system using a bloom filter (BF), wherein an active transaction is a shard transaction independent of other transactions with respect to the BF and the shard transaction has at least one key for a data item, and the BF includes entries corresponding to keys of active transactions; checking the BF for at least one key of a coming transaction; adding an entry for the at least one key of the coming transaction to the BF when there is miss in the check for the at least one key in the BF; enqueueing the coming transaction when there is a hit for the at least one key in the BF; and validating the transactions that are indicated by the BF to be active transactions.

19. The server of claim 18, wherein the instructions cause the at least one hardware processor to perform operations including: determining a hash value for at least one key of a multi-shard transaction; updating at least one bit of an element of an integer array of the BF to indicate the shard transaction is an active transaction when the shard transaction is an independent transaction, wherein the element of the integer array is indexed using the determined hash value; and

49 enqueueing the coming transaction when an element of the integer array of the BF indexed according to the hash value of the coming transaction indicates a hit for the at least one key of the coming transaction.

20. The server of claim 18 or claim 19, wherein the instructions cause the at least one hardware processor to perform operations including: determining a read key hash value for at least one read key of a multishard transaction and determining a write key hash value for at least one write key of the multi-shard transaction; updating a counter value for each of the read key hash value and the write key hash value in an integer array of a counting BF (CBF) to indicate an active transaction when the multi-shard transaction is an independent transaction, wherein elements of the integer array are indexed using determined hash values; and enqueueing the coming transaction when a counter value of the integer array of the CBF indexed according to either of the read key hash value or the write key of the coming transaction indicates a hit for either of the at least one read key or the at least one write key of the coming transaction.

50