WO2024059352A1 - Reference-managed concurrency control - Google Patents

Reference-managed concurrency control Download PDF

Info

Publication number
WO2024059352A1
WO2024059352A1 PCT/US2023/061288 US2023061288W WO2024059352A1 WO 2024059352 A1 WO2024059352 A1 WO 2024059352A1 US 2023061288 W US2023061288 W US 2023061288W WO 2024059352 A1 WO2024059352 A1 WO 2024059352A1
Authority
WO
WIPO (PCT)
Prior art keywords
transaction
precedence
given
transactions
distributed system
Prior art date
Application number
PCT/US2023/061288
Other languages
French (fr)
Inventor
Justin FUNSTON
Ivan Avramov
Vaishali SURIANARAYANAN
Original Assignee
Futurewei Technologies, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Futurewei Technologies, Inc. filed Critical Futurewei Technologies, Inc.
Priority to PCT/US2023/061288 priority Critical patent/WO2024059352A1/en
Publication of WO2024059352A1 publication Critical patent/WO2024059352A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2308Concurrency control

Definitions

  • the present disclosure is related to systems that process data for applications and, in particular, to methods and apparatus associated with an inmemory transaction key-value storage system.
  • Storage systems store records, where users of the storage system can create, update, retrieve, and delete records by identifying them via a key.
  • a key is a unique entity that can identify a particular record in a system. With a record in the system being a set of fields, a key is a piece of information from which a record can be found in a search of the database, including finding all related fields. The key can be user defined. An example of a key can include, but is not limited to, an e-mail address. In non-distributed system design, all records reside on the same machine. In contrast, distributed systems split (partition) a set of possible keys and assign each subrange of keys to separate machines, typically called partition nodes or shards.
  • a transaction reads and/or writes according to a set of keys in a single atomic step so that the changes appear simultaneous.
  • a transaction is a set of one or more user-initiated operations.
  • Multi-Version Concurrency Control is the standard solution employed by most current state-of-the-art databases.
  • the approach in MVCC requires that all writes within a time window, called the retention window, are kept. This retention removes conflicts between read and write transactions resulting in a dramatic improvement of throughput for the system.
  • MVCC multi version of records are kept by the system for days or weeks which incurs significant space overhead.
  • Delta-encoding which is a technique of storing or transmitting data in the form of differences between sequential data rather than complete fdes, does help to ensure that the cost is not linear, at the cost of a runtime penalty for reconstruction.
  • MVCC continuous garbage collection is required. All current databases provide a background mechanism that continuously sweeps all data and removes versions older than the configured retention window. Transactions running for a time longer than this retention window cannot benefit from MVCC and are either automatically aborted or revert to locking, which restricts access to data.
  • the architecture can include a distributed system, implemented as an in-memory transaction key- value storage system that can allow for high concurrency and efficient memory usage while providing strict serializability and reasonable latency for transactions.
  • transaction precedence graphs can be used to identify and clear out versions of transactions that are no longer in use.
  • the distributed system can include use of partially constructed transaction precedence graphs that can be constructed while executing the transactions and can be maintained across multiple nodes.
  • the constructed transaction precedence graphs can be updated and combined when transactions attempt to commit.
  • Procedures for a commit can include combining the partial precedence graphs and performing cycle checking in the transaction precedence graph in order to achieve the consistency objectives of the distributed storage of the distributed system.
  • a distributed system comprising storage nodes arranged individually in a distributed arrangement; a memory storing instructions; and at least one processor in communication with the memory, the at least one processor configured, upon execution of the instructions, to perform the following steps: model the dependencies among transactions in the distributed system using transaction precedence graphs partially constructed while executing the transactions, the transactions correlated to keys stored in the storage nodes; and committing a transaction, correlated to the keys, of the transactions in the distributed system in response to checking for cycles in a transaction precedence graph for the transaction.
  • the at least one processor is configured to dynamically determine data to remove from the distributed system with respect to a given transaction precedence graph, the given transaction precedence graph modeling dependencies based on correlated keys and transaction commit times, by determining an absence of a path in the given transaction precedence graph from an uncommitted transaction in the transaction precedence graph to a committed transaction in the transaction precedence graph.
  • the storage nodes include data records and unique keys to the data records partitioned among the storage nodes with each storage node containing a subset of the data records.
  • the at least one processor is configured to: track the transaction in the distributed system as being in-progress, committed, or aborted; and maintain and update the transaction precedence graph for the transaction and combine the transaction precedence graph with other transaction precedence graphs, in response to detection that the other transaction precedence graphs have affected keys in common to the transaction precedence graph.
  • the at least one processor is configured to remove the transaction and associated information from the storage nodes in response to a determination of the transaction being clearable in the distributed system.
  • the distributed system includes client nodes configured to issue read and write requests to the storage nodes, the client nodes arranged with interfaces to endusers, the end-users external to the distributed system.
  • the at least one processor is configured to: locate partial transaction precedence graphs containing a neighbor transaction to the transaction; add transitive dependent edges to the transaction precedence graph for the transaction to generate a combined transaction precedence graph for the transaction; check commit times between committed transactions of the partial transaction precedence graphs and add edges based on the check of the commit times; check for a cycle in the combined transaction precedence graph for the transaction; and determine to commit or to abort from the checking for a cycle.
  • each storage node of includes data records, transaction records, each transaction record having a transaction identification, and transaction precedence graph records, each transaction precedence graph record having a transaction precedence graph identification.
  • the at least one processor is configured, upon execution of the instructions, to perform operations as multiple transaction coordinators and multiple directed acyclic graph (DAG) coordinators, such that each active transaction in the distributed system has a transaction coordinator that tracks the active transaction as in-progress, committed, or aborted and each DAG coordinator tracks transaction precedence graphs and updates and combines transaction precedence graphs among other DAG coordinators.
  • DAG directed acyclic graph
  • the transaction coordinator for the given transaction is assigned as the DAG coordinator for the given transaction.
  • the transaction coordinator for the given transaction determines current status of the given transaction by checking status of the given transaction in a transaction record of the given transaction; communicates with the DAG coordinator for the given transaction to evaluate if a commit of the given transaction forms a cycle in a given transaction precedence graph of the given transaction; and applies a commit of the given transaction if no cycle is formed in the given transaction precedence graph.
  • a method of operating a distributed data storage system comprises modeling dependencies among transactions in a distributed system having storage nodes arranged individually in a distributed arrangement, using transaction precedence graphs partially constructed while executing the transactions, the transactions marked as dependent in the transaction precedence graphs when the transactions affect common keys stored in the storage nodes; and committing a transaction in the distributed system in response to checking for cycles in a transaction precedence graph for the transaction
  • a second implementation form of the method of operating a distributed data storage system according to the second aspect as such or any preceding implementation form of the second aspect storing, in the storage nodes, data records and unique keys to the data records partitioned among the storage nodes with each storage node containing a subset of the data records.
  • tracking the transaction in the distributed system as being in-progress, committed, or aborted; and maintaining and updating the transaction precedence graph for the transaction and combining the transaction precedence graph with other transaction precedence graphs in response to detection that the other transaction precedence graphs have affected keys in common to the transaction precedence graph.
  • the method includes removing the transaction and associated information from the storage nodes in response to a determination of the transaction being clearable in the distributed system.
  • the method includes issuing read or write requests to the storage nodes from client nodes of the distributed system, the client nodes arranged with interfaces to end-users, the end-users external to the distributed system. .
  • the method includes: locating partial transaction precedence graphs containing a neighbor transaction to the transaction; adding transitive dependent edges to the transaction precedence graph for the transaction to generate a combined transaction precedence graph for the transaction; checking commit times between committed transactions of the partial transaction precedence graphs and adding edges based on the check of the commit times; checking for a cycle in the combined transaction precedence graph for the transaction; and determining to commit or to abort from the checking for a cycle.
  • the method includes maintaining, in each storage node, data records, transaction records, each transaction record having a transaction identification, and transaction precedence graph records, each transaction precedence graph record having a transaction precedence graph identification.
  • the method includes: operating, via execution of stored instructions by one or more first processors, multiple transaction coordinators such that each active transaction in the distributed system has a transaction coordinator that tracks the active transaction as in-progress, committed, or aborted; and operating, via execution of stored instructions by one or more second processors, multiple directed acyclic graph (DAG) coordinators, such that each DAG coordinator tracks transaction precedence graphs and updates and combines transaction precedence graphs among other DAG coordinators.
  • DAG directed acyclic graph
  • the method includes, at start of a given transaction, assigning the transaction coordinator for the given transaction as the DAG coordinator for the given transaction.
  • the method includes, for a given transaction requested by a client node of the distributed system, in response to a commit request for the given transaction from the client node, the transaction coordinator for the given transaction: determining current status of the given transaction by checking status of the given transaction in a transaction record of the given transaction; communicating with the DAG coordinator for the given transaction to evaluate if a commit of the given transaction forms a cycle in a given transaction precedence graph of the given transaction; and applying a commit of the given transaction if no cycle is formed in the given transaction precedence graph.
  • a non-transitory computer-readable storage medium storing instructions for processing data, which, when executed by at least one processor, cause the at least one processor to perform operations comprising modeling dependencies among transactions in a distributed system having storage nodes arranged individually in a distributed arrangement, using transaction precedence graphs partially constructed while executing the transactions, the transactions marked as dependent in the transaction precedence graphs when the transactions affect common keys stored in the storage nodes; and committing a transaction in the distributed system in response to checking for cycles in a transaction precedence graph for the transaction.
  • the operations include storing, in the storage nodes, data records and unique keys to the data records partitioned among the storage nodes with each storage node containing a subset of the data records.
  • the operations include: tracking the transaction in the distributed system as being in-progress, committed, or aborted; and maintaining and updating the transaction precedence graph for the transaction and combining the transaction precedence graph with other transaction precedence graphs in response to detection that the other transaction precedence graphs have affected keys in common to the transaction precedence graph.
  • the operations include removing the transaction and associated information from the storage nodes in response to a determination of the transaction being clearable in the distributed system.
  • the operations include issuing read or write requests to the storage nodes from client nodes of the distributed system, the client nodes arranged with interfaces to end-users, the end-users external to the distributed system.
  • the operations include: locating partial transaction precedence graphs containing a neighbor transaction to the transaction; adding transitive dependent edges to the transaction precedence graph for the transaction to generate a combined transaction precedence graph for the transaction; checking commit times between committed transactions of the partial transaction precedence graphs and adding edges based on the check of the commit times; checking for a cycle in the combined transaction precedence graph for the transaction; and determining to commit or to abort from the checking for a cycle.
  • the operations include maintaining, in each storage node, data records, transaction records, each transaction record having a transaction identification, and transaction precedence graph records, each transaction precedence graph record having a transaction precedence graph identification.
  • the operations include: operating, via execution of stored instructions by one or more first processors, multiple transaction coordinators such that each active transaction in the distributed system has a transaction coordinator that tracks the active transaction as inprogress, committed, or aborted; and operating, via execution of stored instructions by one or more second processors, multiple directed acyclic graph (DAG) coordinators, such that each DAG coordinator tracks transaction precedence graphs and updates and combines transaction precedence graphs among other DAG coordinators.
  • DAG directed acyclic graph
  • the operations include, at start of a given transaction, assigning the transaction coordinator for the given transaction as the DAG coordinator for the given transaction.
  • the operations include, for a given transaction requested by a client node of the distributed system, in response to a commit request for the given transaction from the client node, the transaction coordinator for the given transaction: determining current status of the given transaction by checking status of the given transaction in a transaction record of the given transaction; communicating with the DAG coordinator for the given transaction to evaluate if a commit of the given transaction forms a cycle in a given transaction precedence graph of the given transaction; and applying a commit of the given transaction if no cycle is formed in the given transaction precedence graph.
  • Figure 1 is a representation of two transactions in a directed graph, according to an example embodiment.
  • Figure 2 is a representation in which a transaction has been included with the transactions of Figure 1, according to an example embodiment.
  • Figure 3 is a representation of a transaction precedence graph for transactions having a cycle, according to an example embodiment.
  • Figure 4 is a representation of a directed acyclic graph, according to an example embodiment.
  • Figures 5-7 illustrate topological sorts of a topological ordering, according to an example embodiment.
  • Figure 8 illustrates an arrangement between a given transaction and other transactions for the given transaction trying to commit, according to an example embodiment.
  • Figure 9 is a representation of a distributed system for services and nodes that can be structured for reference managed concurrency control, according to an example embodiment.
  • Figure 10 is a representation of interactions of various components for reference managed concurrency control in a datacenter, according to an example embodiment.
  • Figure 11 illustrates a mechanism for the datacenter of Figure 10 to annotate transaction operations and allow other transactions to discover a given directed acyclic graph for concurrent access, according to an example embodiment.
  • Figures 12A-B are a flow diagram of an example of starting a transaction and performing operations in a reference managed concurrency control, according to an example embodiment.
  • Figures 13A-B are a flow diagram of an example commit request in a reference managed concurrency control, according to an example embodiment.
  • Figures 14A-B are a flow diagram of an example cleanup communication in a reference managed concurrency control, according to an example embodiment.
  • Figure 15 a flow diagram of features of an example method of operating a distributed data storage system, according to an example embodiment.
  • Figure 16 is a block diagram illustrating a computing system that implements algorithms and performs methods structured to process data for an application, according to an example embodiment.
  • the functions or algorithms described herein may be implemented in software, in an embodiment.
  • the software may comprise computer-executable instructions stored on computer-readable media or a computer-readable storage device, such as one or more non-transitory memories or other type of hardwarebased storage devices, either local or networked.
  • modules which may be software, hardware, firmware, or any combination thereof. Multiple functions may be performed in one or more modules as desired, and the embodiments described are merely examples.
  • the software may be executed on a digital signal processor, application-specific integrated circuit (ASIC), a microprocessor, or other type of processor operating on a computer system, such as a personal computer (PC), a server, or other computer system, turning such computer system into a specifically programmed machine.
  • ASIC application-specific integrated circuit
  • PC personal computer
  • server or other computer system, turning such computer system into a specifically programmed machine.
  • Computer-readable non-transitory media includes all types of computer-readable media, including magnetic storage media, optical storage media, and/or solid-state storage media, and specifically excludes signals.
  • the software can be installed in and sold with the devices that operate in association with reference managed concurrency control for data processing as taught herein.
  • the software can be obtained and loaded into such devices, including obtaining the software via a disc medium or from any manner of network or distribution system, including, for example, from a server owned by the software creator or from a server not owned but used by the software creator.
  • the software can be stored on a server for distribution over the Internet, for example.
  • a system referred to as reference managed concurrency control (RMCC)
  • RMCC reference managed concurrency control
  • Serializable means that a transaction can be serialized.
  • these transactions are said to be serializable if some ordering can be created where these transactions appear to execute one after the other.
  • This total ordering on the set is a serial schedule.
  • serializability is indifferent about time order.
  • a serial schedule can be constructed, then each one of these transactions is serializable with respect to the set.
  • a new transaction which wants to commit, is serializable with respect to the system if some placement for the new transaction can be found in the existing ordering such that the whole set is still a serial schedule.
  • a transaction is serializable if the transaction appears to have occurred in some serial schedule.
  • a system is not required to physically create such a schedule, but provides a basis that allows analysis about the fact that transactions are isolated from each-other and appear to be isolated to any observer of the system. If an attempt is made to add a new transaction to a given a set of committed/historical transactions, this new transaction is said to be serializable if there exists some possible ordering (aka serial order) called a schedule, which makes it appear as if all these transactions executed sequentially, one after the other.
  • serial order some possible ordering
  • a serial schedule can include a total ordering of transactions. For any two nonconcurrent transactions T1 and T2, if the end of transaction T1 occurs before the start of transaction T2 (end(Tl) ⁇ start(T2)) in real time, T1 occurs before T2 and transactions T1 and T2 are linearizable.
  • a system has strict serializability when the transactions are serializable plus linearizable. In other words, the system makes it appear to all observers that there is a total ordering of the transactions which is consistent with their real-time ordering.
  • Linearizable can identify a valid ordering, but does not specifically mention anything about transactions that occur concurrently.
  • Strict serializability which is serializable plus linearizable, makes it appear to any observer that all historical transactions have occurred one after the other, and if a second transaction actually occurs after a first transaction, then any observer will see that the results of the second transaction supersedes the results of the first transaction.
  • Design of the RMCC can be directed to efficiently remove unneeded record versions on-demand, as opposed to traditional MVCC designs that require periodic garbage collection.
  • the design includes a RMCC transaction protocol design along with implementation considerations.
  • the RMCC can address the problems associated with MVCC while retaining and improving the benefits. Multiple versions of records can still be used in the RMCC, but this retention in the RMCC is performed on demand and only to satisfy ongoing transactions. As soon as transactions commit, the extra versions can be discarded.
  • a commit is the applying of the changes made to the subject records of a transaction at an accepted completion of the transaction.
  • a transaction to perform one or more operations on a record of the distributed system can be started by reception of a begin statement from a user device.
  • the user device can either commit the transaction or abort the transaction, which means the user device either issues a commit command to the system to apply all the changes that were performed on the record or revert the changes that were made in executing the operations of the transaction.
  • RMCC With respect to memory usage, it is estimated that nearly a 95% reduction in memory overhead for supporting concurrent transactions can be achieved with RMCC as compared with traditional MVCC. With respect to background overhead, the use of RMCC can eliminate the need for background garbage collection in favor of graph-based transaction dependency tracking, with immediate cleanup upon commit. With respect to latency improvement, the RMCC can reduce reliance on timestamp usage, allowing concurrent timestamp allocation, which can achieve up to 33% better network latency per transaction. With respect to abort rate reduction, the RMCC can provide improved conflict rate over traditional MVCC due to usage of graph cycle detection by up to 50% in certain workloads.
  • a RMCC can be implemented to address inefficient memory usage by clearing out versions of transactions as soon as they are not needed, while providing serializability and external causality guarantees.
  • the RMCC can use transaction precedence graphs.
  • the use of transaction precedence graphs allows the RMCC to determine if a commit will form a dependency cycle, which violates strict serializability.
  • the same graphs can be used to clean multiple record versions, where older version of records in a committed transaction are cleared out when there is no path from any uncommitted transaction to the committed transaction in the transaction precedence graph.
  • the clearing procedure can be triggered by commit requests or abort requests by transactions, which means that multiple record versions are retained long enough to satisfy any open transactions, and not any longer.
  • Figure 1 is an example of a representation 100 of two transactions T1 and T2 in a directed graph.
  • a directed graph G(V, E) of transactions can include a set of all transactions V with associated edges E.
  • T1 occurs before T2 in the schedule, an edge extends from transaction T1 to T2, for which T2 is said to be dependent from or be dependent on Tl, where T1 and T2 are based on the same key.
  • Figure 2 is an example of a representation 200 in which a transaction T3 has been included with transactions Tl and T2 of Figure 1. Edges can be added among Tl, T2, and T3 by detecting conflicts, which are key-based, or linearizability, which is based on time order.
  • Figure 3 is an example of a representation 300 of a transaction precedence graph for transactions having a cycle.
  • the transactions are represented by vertices 0, 1, 2, 3 and 4.
  • a transaction precedence graph is serializable if and only if no cycles are in the transaction precedence graph. If a transaction precedence graph has a cycle, there is no total ordering of transactions and no serial schedule.
  • representation 300 there is a cycle between transactions vertices 2 and 4 and among transaction vertices 0, 1, 2, and 3.
  • FIG. 4 is an example of a representation 400 of a directed acyclic graph (DAG).
  • a directed graph with no cycles is a DAG. Since a transaction precedence graph is serializable, a RMCC that deals with serializable transaction precedence graphs operates with respect to DAGs.
  • a DAG has topological ordering if the total ordering of the vertices is provided such that there is no backward edge.
  • Representation 400 has edges among vertices 0, 1, 2, 3, 4, 5, 6, and 7 of representation 400 having no backward edges.
  • a topological ordering is a serial schedule.
  • FIGs 5-7 illustrate an example of a topological sort of a topological ordering.
  • the topological sort is conducted to arrange the vertices beginning with vertices having no incoming edges.
  • Figure 5 shows a beginning arrangement 500 of vertices A, B, C, D, E, F, and G in which there are no cycles.
  • Vertex A has edges to vertices B and C.
  • Vertex B has edges to vertices C and D.
  • Vertex C has an edge to vertex E.
  • Vertex D has edges to vertices F and E.
  • Vertex E has no edge from vertex E to any vertex in the arrangement,
  • Vertex G has edges to E and F and no edge incoming from any vertex in the arrangement to vertex G.
  • Figure 6 illustrates an arrangement 600 of the vertices of the arrangement 500 of Figure 5 with the vertices arranged in a linear fashion, maintaining the direction edges of the arrangement 500.
  • Figure 7 illustrates an arrangement 700 of the vertices of the arrangement 600 of Figure 6 with the vertices arranged in a linear fashion, where the procedure keeps picking vertices with no incoming edges.
  • Arrangement 700 results with the edges among vertices A, B, C, D, E, F, and G of arrangement 700 having no backward edges and the vertices having no input edges arranged at the beginning of arrangement 700.
  • DAGs, topological ordering, and topological sorting can be use in operation of a RMCC.
  • a RMCC as taught herein, can be implemented to use a transaction precedence graph to identify and clear out transactions.
  • the functions of the RMCC can include using transaction precedence graphs to achieve serializability in a distributed system, while also providing external causality. Due to the overhead of maintaining and using a transaction precedence graph for cycle checking, network overhead for RMCC may not be minimized, but for transactions having no concurrent transactions, RMCC performance can achieve low latency in order of microseconds.
  • RMCC can provide strict serializability. In RMCC, two transactions are concurrent if their execution times overlap and they access a common resource during their execution. RMCC can provide client devices with the strictest consistency guarantee for transactions, which is called external consistency or linearizability. Under external consistency, the distributed system using RMCC behaves in an order that is consistent with real-time. If one transaction T1 commits before another transaction T2 that is non-concurrent with T1 commits, the system guarantees that client devices do not see a state that includes the effect of the second transaction T2 but not the first T1. Intuitively, RMCC is semantically indistinguishable from a single-machine database.
  • Architectures for a RMCC can be constructed with several guiding design elements.
  • An RMCC can be structured to achieve memory efficiency by identifying transactions to clear out in the transaction precedence graph once the transaction is committed or aborted.
  • the RMCC can be constructed to use transaction precedence graphs along with cycle checking over distributed storage to achieve serializability.
  • the RMCC can be constructed to achieve linearizability and serializability without using begin timestamps.
  • the RMCC can be constructed to use end timestamps that can be limited to ensure linearizability and not for record version selection.
  • the RMCC can be constructed to use partially constructed transaction precedence graphs maintained across multiple nodes along with updating and combining transaction precedence graphs when transactions try to commit.
  • FIG 8 illustrates an arrangement 800 between a transaction T and other transactions Cl, C2, and C3 for transaction T trying to commit.
  • a commit procedure can include cycle detection, which is based on an invariant that there are no cycles among committed transactions.
  • arrangement 800 there are three committed transactions Cl, C2, and C3 in which the three committed transactions do not include a cycle among themselves.
  • Committed transactions Cl, C2, C3 have been committed and are fixed in the system. Reversal of the decision in which Cl, C2, C3 have been committed does not occur without breaking consistency.
  • Committed transactions Cl, C2, C3 are still in the DAG including T, which the system maintains, since T has not yet committed. Consider transaction T trying to commit.
  • Cycle determination can be conducted by a cycle detection according to the last transaction trying to commit, which can be used to speed up cycle finding.
  • a number of different approaches can be implemented to detect a cycle.
  • With respect to Figure 8 showing transaction Cl depends on transaction T, where transaction T has not been committed yet, it is noted that the DAG is not used to enforce the order in which things commit. Rather, the DAG represents the dependencies among concurrent transactions and is used to determine if any particular transaction is allowed to commit. As a result of detecting the cycle, the RMCC can disallow transaction T to commit because it would form a cycle.
  • FIG. 9 is a representation of a distributed system 900 for services and nodes that can be structured for a RMCC.
  • a system can be implemented as a set of services that work together to achieve the system design goals.
  • a service, or service cluster can be realized as a set of nodes for a particular component, which nodes have been configured to work together.
  • a node can be structured as a running instance of software for a given component.
  • a node may include one or more processors to execute the instance of software.
  • a node can have a unique network address, which provides a mechanism for other nodes or other portions of software to be able to send messages to the given node.
  • Other nodes can include, but are not limited to, client devices or nodes from other services.
  • a single machine for example a host, can run multiple nodes, as decided by a system operator.
  • the components of distributed system 900 can include a time stamp oracle (TSO) service 916, a control plane oracle (CPO) service 926, a storage service 906, and a persistence service 936.
  • TSO time stamp oracle
  • CPO control plane oracle
  • An oracle is an authority or mechanism that is configured to make decisions for the entity to which the oracle is directed.
  • TSO service 916 can provide end timestamps for transactions to ensure linearizability.
  • CPO service 926 can provide central controller for clusters of nodes.
  • Storage service 906 can provide distributed storage of transaction data.
  • Persistence service 936 can provide functionality to store data that can be available again or to other users or services after the process using or generating the data is no longer running. The functionality of these components can include implementation by software.
  • the nodes of a service can all run instances of the same component software.
  • the TSO nodes can run instances of TSO software, with all TSO nodes configured to be part of the same TSO Service.
  • TSO software module communicates with the TSO service, this software communicates with a particular node, which is part of a particular TSO Service.
  • a service-specific mechanism can be implemented to decide with which particular node to communicate.
  • the storage service 906 can include storage node 905-1, storage node 905-2 . . . storage node 905-N.
  • TSO service 916 can include TSO node 915-1, TSO node 915-2 . . . TSO node 915-N.
  • CPO service 926 can include CPO node 925-1, CPO node 925-2 . . . CPO node 925- N.
  • Persistence service 936 can include persistence node 935-1, persistence node 935-2 . . . persistence node 935-N. Though each of the services of distributed system 900 is shown having the same number of nodes, a similar distributed system can include services having different numbers of nodes.
  • Each node of a service can run independently of each other and can be implemented with one or more processors executing stored instructions for the independent node. Alternatively, nodes can share one or more processors configured to support the functionality of the nodes of a service.
  • Figure 9 reflects that during the running of one or more applications, the nodes of the active services for the one or more applications can interact with each other.
  • Figure 10 is a representation of interactions of various components for RMCC in a datacenter 1000.
  • the components in datacenter 1000 include, but are not limited to, client nodes 1002-1, 1002-2, and 1002-3, storage nodes 1005- 1, 1005-2, and 1005-3, a CPO 1026, a TSO 1016, transaction coordinators 1010- 1 and 1010-2, and DAG coordinators 1020-1, 1020-2, and 1020-3.
  • datacenter 1000 shows a number of these components
  • a datacenter, such as datacenter 1000 can include more or fewer than the number of each of the components shown in Figure 10.
  • TSO 1016 can be responsible for issuing real-time based timestamps with error bounds for transactions.
  • TSO 1016 can be configured to be used only to obtain commit timestamps without being implemented for record version selection.
  • Multiple TSO instances can be used at the same time. In such multiple TSO instances, the TSO instances can be implemented to agree on the maximum error bound of timestamps provided to the TSO instances.
  • the ordering of transactions is normally determined based on the data dependency. In some cases, there are transactions that do not access any common data but have a real -world dependency, e.g. issuance of an order after an order is issued from a device separate from the device that issued the earlier order. This is external causal dependency since the dependency is external to the system.
  • These transactions can be captured using timestamps issued by the TSO. Since there is no data dependency but there is causal relationship, the associated DAG records the relationship of these transactions. This is recorded by inserting an edge based on the timestamp ordering of the earlier transaction and the later transaction.
  • CPO 1026 can be configured as the central controller for a cluster of nodes to datacenter 1000.
  • CPO 1026 can be configured to be responsible for managing cluster partitions, scaling activities, or other managing activities for the cluster of components.
  • CPO 1026 can also serve as a versioned discovery system so that nodes and clients can discover where the cluster components are located.
  • Client nodes 1002-1, 1002-2, and 1002-3 can be structured as coordination-free client nodes.
  • Client nodes 1002-1, 1002-2, and 1002-3 can be configured to be the only components of the RMCC arrangement of datacenter 1000 visible to end-users.
  • Each of client nodes 1002-1, 1002-2, and 1002-3 can communicate with transaction coordinator nodes or storage nodes to start a transaction, perform operations (read, write), and commit or abort a transaction. For these operations, client nodes 1002-1, 1002-2, and 1002-3 can individually issue requests to storage nodes and transaction coordinators for transactions that are the subject of the requests.
  • multiple client nodes can be present in the distributed system of datacenter 1000.
  • Client nodes 1002-1, 1002-2, and 1002-3 are independent such that each node need not coordinate with each other node.
  • Storage nodes 1005-1, 1005-2, and 1005-3 can be configured as key- partitioned storage nodes. Each of storage nodes 1005-1, 1005-2, and 1005-3 can contain a subset of preassigned keys, where the keys are assigned by the CPO 1026 that it is responsible for managing the keys and storage nodes of datacenter 1000.
  • Datacenter 1000 can be implemented as a storage system having responsibility is to provide a unified view of a dataset, which cannot physically fit on a single machine. This implementation can be achieved by partitioning (splitting) the data into smaller chunks called partitions, where a partition is a subset of the entire data. In an embodiment of datacenter 1000, each partition can be guaranteed to be limited in size and be able to fit on a single machine.
  • a partition In a storage service for datacenter 1000, including storage nodes 1005-1, 1005-2, and 1005-3, exactly one partition can be assigned to each of storage nodes 1005-1, 1005-2, and 1005-3 in a one-to-one relationship.
  • the assignment which can be realized as a mapping, can be stored in a structure called a partition map.
  • a partition map When any record-level operation is performed according to a given key (for example, a read of a record with a key equal to a specific e- mail address), a determination can be made as to which partition the given key falls. Using the partition map, a determination can be made of exactly which storage node should own the record that is the subject to the record-level operation.
  • the entire cluster of storage nodes can be configured to perform an agreement procedure on the partition map.
  • Each of storage nodes 1005-1, 1005-2, and 1005-3 can receive read or write requests from client nodes 1002-1, 1002-2, and 1002-3 and can perform the requested operations by communicating with transaction coordinators as appropriate and respond to the requesting client.
  • Each of storage nodes 1005-1, 1005-2, and 1005-3 can handle read, write, and clear version requests on keys and maintain a local transaction precedence graph.
  • multiple storage nodes can be present in the distributed system of datacenter 1000.
  • the key domain can be partitioned and each storage node can be assigned keys from distinct parts of the partition.
  • a partition map can be implemented as, but is not limited to, a hash of the key that allows the storage node for a particular key to be identified with just the key.
  • Transaction coordinators 1010-1 and 1010-2 can be implemented as transaction coordinator nodes.
  • a transaction coordinator can be configured to be responsible for keeping track of a transaction and committing or aborting the transaction. Every active transaction in the RMCC of datacenter 1000 has a transaction coordinator that keeps track of whether it is in-progress, committed, or aborted.
  • a client such as one of client nodes 1002-1, 1002-2, and 1002- 3 issues a commit request
  • the transaction coordinator communicates with the DAG coordinator of the transaction to check if the transaction can be successfully committed.
  • multiple transaction coordinator nodes can be present in the distributed system of datacenter 1000.
  • a transaction can be mapped to a transaction coordinator using a transaction identification (ID), which can be a universally unique identifier (UUID).
  • ID can be a universally unique identifier
  • storage nodes such as storage nodes 1005-1, 1005-2, and 1005-3, can also double up to be transaction coordinators, that is, storage nodes can also operate as transaction coordinators.
  • DAG coordinators 1020-1, 1020-2, and 1020-3 can be implemented as DAG coordinator nodes. Each transaction in datacenter 1000 has a DAG coordinator.
  • a DAG coordinator can be configured to be responsible for keeping track of the parts of partial transaction precedence graphs.
  • a DAG coordinator can be configured to also maintain, update, and combine partial transaction precedence graphs between DAG coordinators.
  • the transaction coordinator for the transaction can be assigned as the DAG coordinator.
  • the DAG coordinator can become different from the transaction coordinator, for example, by merging with other DAG coordinators.
  • a DAG coordinator can be responsible for checking if a transaction can commit by performing cycle checking.
  • a DAG coordinator can identify transactions that can be cleared out by finding committed transactions with no ongoing transactions dependent on it, can issue cleanup requests to transaction coordinators, and can signal storage nodes when record versions can be cleaned- up. When a DAG coordinator determines a transaction can be cleared, the transaction can also be removed from the transaction coordinator as well.
  • a random storage node can be selected from the storage nodes, for example storage nodes 1005-1, 1005-2, and 1005-3, of datacenter 1000 to serve as the transaction coordinator as well as the DAG coordinator for this transaction.
  • a message can be sent to the selected storage node so that it can start tracking the transaction. All operations performed in the transaction can be annotated with the transaction coordinator location, which can allow any other transaction to discover the DAG for any concurrent access.
  • Figure 11 illustrates a mechanism for datacenter 1000 of Figure 10 to annotate transaction operations and allow other transactions to discover the DAG for any concurrent access.
  • Client nodes 1002-1, 1002-2, and 1002-3 can include client applications 1103-1, 1103-2, and 1103-3 and client libraries 1104-1, 1104- 2, and 1104-3, respectively.
  • a client library can be implemented as software that provides a programmatic interface, which allows a user application, such as client applications 1103-1, 1103-2, and 1103-3, to execute transactions.
  • Each of client libraries 1104-1, 1104-2, and 1104-3 can be responsible for discovering service nodes as well as assisting in some transaction coordination activities.
  • Each of client libraries 1104-1, 1104-2, and 1104-3 can interact with one or more storage nodes 1005-1, 1005-2, and 1005-3.
  • client library 1104-1 can interact with storage node 1005-1 and 1005-2
  • client library 1104-2 can interact with storage node 1005-2 and 1005-3
  • client library 1104-3 can interact with storage node 1005-1 and 1005-3.
  • the client library can select a coordinator for the transaction as well as the associated DAG.
  • the client library can select one storage partition as the coordinator of both the transaction and its DAG.
  • the DAG coordinator may change later as the transaction crosses paths with other transactions in the RMCC system of datacenter 1000, where such path interaction can result in merging of DAGs.
  • Storage node 1005-1 can include data records 1106-1, transaction records 1107-1, and DAG records 1108-1.
  • Storage node 1005-2 can include data records 1106-2, transaction records 1107-2, and DAG records 1108-2.
  • Storage node 1005-3 can include data records 1106-3, transaction records 1107-3, and DAG records 1108-3.
  • Each of transaction records 1107-1, 1107-2, and 1107-3 can keep track of the DAG to which a given transaction currently belongs.
  • transaction records 1107-3 which is representative of transaction records 1107-1 and 1107-2, can include a transactions fde 1109 that includes a number of records such as Transaction Record 1 and Transaction Record 2, where each of Transaction Record 1 and Transaction Record 2 is correlated to transaction identifications Txnldl and Txnld2, respectively.
  • Transaction Record 2 can include the identification Txnld2 that belongs to a DAG identified using a DAG identification (DAGId) set as DAGId2 in Transaction Record 2.
  • DAGId DAG identification
  • a DAG file 1111 can be searched to find DAGId2, which is tied to DAG record2, such that the transaction having Txnld2 keeps track of DAG identified by DAGId2.
  • T1 For T1 to now place an intent that overlaps with an intent that is currently held by T2, a decision has to be made on how to order the two transactions. The ordering can be made to also apply to all other intents.
  • T1 and T2 each taken as having a trivial graph of size one, the precedence can be registered by merging these two graphs into a single graph, with edges denoting the established order.
  • T3 As execution continues, it is possible that either T1 or T2 encounters another set of concurrent transactions, for example transactions T3, T4, T5. This occurrence can be detected via the placed intents, and, as a result, the two concurrent graphs of ⁇ T1 and T2 ⁇ and ⁇ T3, T4, and T5 ⁇ can be merged and a single DAG owner can be designated.
  • the single DAG owner for the graph can be consulted.
  • the single DAG owner is now in a position to perform a cycle check on the entire set of overlapping transactions ⁇ Tl, T2, T3, T4, T5 ⁇ .
  • the transaction precedence graph is a DAG, which is constructed in a distributed manner as running transactions access data.
  • the edges of the DAG edges represent transaction dependencies, and the transactions themselves are represented as nodes.
  • the edges can be viewed to be of two types. One type is an ordering between concurrent transactions and the second type is an ordering between non-current transactions using linearizability. By the second type, dependencies between non-concurrent transactions can be captured. If a transaction 0 ends before fa and the two are non-concurrent, then if there is an edge, it is to be from 0 to ti. At any given time, there can be many independent DAGs constructed in parallel in the cluster of the RMCC, where each DAG represents a set of transactions that created some dependency on each other while executing read or write operations.
  • edges in the transaction precedence graph can be classified into edges identified during operations on storage nodes and edges identified during commit time. After a commit or an abort, committed transactions that are not reachable from non-committed transactions are cleared from the transaction precedence graph. Data that is no longer need is cleared from the storage nodes, which are committed key-value records from these cleared transactions, where there is a newer committed record available.
  • a clear time(t) of a committed transaction t is the maximum among the commit times of any cleared transaction that could reach t and the commit time of t.
  • a maximum clear time of an uncommitted transaction t u (mct(t u )) can be the maximum commit time of any cleared transaction from which t u has an in-edge.
  • the clear time (t) and mct(tu) can be used for ensuring linearizability guarantees with cleared out transactions.
  • Different portions of constructing and maintaining the transaction precedence graph in a distributed way can include a number of segments.
  • the number of segments can include adding and maintaining dependencies during operations on storage nodes.
  • Merging of transaction precedence graphs can be performed during commits and adding new edges and performing cycle checks during commits.
  • Committed transactions can be cleared from maintaining records of the completed transactions.
  • Each storage node in a RMCC can have a storage node manager (SNM) component that can keep track of information that facilitates handling operations in a straight-forward manner.
  • the information can include a list of all transactions that have accessed any key on the storage node, and, for each transaction t and key k accessed by t, a set of transaction records, which can be reads and writes on the k by t, ordered by operation IDs.
  • the information can include a local transaction precedence graph among all transactions that have accessed keys on the storage node based on local dependency knowledge.
  • the information can include, for each key k, a DAG ordering Ok of transactions that have accessed k.
  • the information can include, for each key, the value and clear time of the last cleared transaction in the ordering of transactions.
  • the storage node manager can be updated with the new dependencies and records for the transaction after every successful operation.
  • Operations can be handled differently based on whether it is the first time a transaction t is accessing a key or not. If it is the first time and the first operation of t is a read, a committed transaction can be chosen to read the value from, such that all the dependencies of the transaction and linearizability are respected. Similarly, if the first operation is a write, a transaction with a write, after which to place the new write, can be identified. For this purpose, for each key on the storage node, an ordering of all transactions that have accessed the key can be maintained and, among the cleared transactions, the value and commit time of the last cleared committed transaction in the order with a write on the key can be stored. These actions can be maintained by the SNM of the storage node.
  • the task of identifying a transaction to read from for a first read or to write after for a first write becomes the task of correctly inserting the transaction t in this ordering. Inserting t in the ordering introduces new dependencies that are kept track of and are used for doing checks.
  • a transaction t is accessing a key for the second time or beyond, then handling of the transaction becomes easier. If the operation is the first write on the key by the transaction, additional dependencies are to be added for the transaction. For all other operations, a local version lookup for the transaction can be conducted and the value from the lookup can be returned. With respect to ensuring linearizability, consider three transactions t, tic, and t2c, where tic and t2c are committed and t is uncommitted. Let there be an edge from t to tic and an edge from tic to t2c.
  • Transitivity cannot be used to infer an edge from t to t2c as that can violate linearizability, since it is possible that tic was concurrent with both t and t2c but t2c committed before t started.
  • the new transaction should have an edge to all committed transactions in front of the new transaction in the ordering.
  • All the transactions with only reads can be ordered after the transaction from which it reads the value for the key. Let t be a transaction (Txn) trying to operate for the first time on some key.
  • the SNM can be updated with the new dependencies and records for the transaction after every operation.
  • a dependency from a cleared record with clear time ct to a transaction can be captured by updating the met of the transaction as the max(mct, ct).
  • the first committed transaction tc with a write in the ordering to which there is no edge from t can be identified. Insertion of t after tc in the ordering and before the next transaction with a write can be conducted. For this, a determination can be made to check if all current dependencies are respected, that is, to check if all out- edges of t are to transactions after tc and if all in-edges to t are from transactions before tc or concurrent to t. Then the new dependencies are (i) edges from all transactions before tc to t and (ii) edges to all transactions after tc from t, except the transactions that have only reads from tc. For new type (ii) edges, messages are sent to these transactions to ensure they have not committed. If something has committed, the process can be repeated to find a new tc. The read value is from tc.
  • operations can be divided based on whether the operation is the first write operation on the key by the transaction or not. If it is, then a determination can be made to check whether tc, the transaction in the ordering after which t was placed, has a transaction (uncommited or commited) that reads from it but also has a write. If so, a failure for the operation is returned. If not, edges from all the uncommited operations that read from tc to t can be added. Then, the write can be recorded.
  • the operation is a read
  • the local record versions on the key by the transaction are looked up and read from a past write by the transaction on the key or tc based on the operation ID. If the operation is a write, but not the first write, the write is inserted according to the operation ID along with ensuring there have been no reads from a past write after which this write is placed.
  • RMCC when a transaction t issues a commit request, various parts of the transaction precedence graphs involving the transaction can be merged. For this procedure, the dependencies that were accumulated during operation on storage nodes that are locally known to t can be added and used. After merging relevant parts of the transaction precedence graphs, a check can be made as to whether it is safe to commit t or not. The check can be conducted by checking for cycles in the transaction precedence graph to ensure serializability. During this check, new edges to the transaction precedence graph are added to ensure linearizability.
  • the new edges are created such that if a timestamp of first transaction (Txnl .timestamp) is less than a timestamp of a second transaction (Txn2.timestamp) then a directed edge is inserted to indicate that Txnl precedes Txn2 in the precedence graph.
  • the DAG coordinators can be the components in the RMCC responsible for this process of taking care of commits and managing the transaction precedence graph.
  • the DAG coordinator can maintain a transaction precedence graph for a subset of dependent transactions. It can be responsible for checking if a transaction t can commit. On receiving a commit request from a transaction t, the transaction coordinator of t can issue a commit request to its DAG coordinator. On receipt of this request, the DAG coordinator of t can obtain partial transaction precedence graphs from DAG coordinators of dependent transactions. The DAG coordinator of t then can build a bigger graph combining all the partial ones and can check for cycles consisting of already committed transactions and t. The DAG coordinator can also be responsible for identifying and clearing out transactions. For this clearing procedure, the DAG coordinator can issue a cleanup request to the transaction coordinator of the transactions that the DAG coordinator identifies as being no longer useful.
  • the invariant can be maintained to speed up the cycle-finding process and thus speed the time to commit instead of performing a depth first search (DFS) on the whole graph each time.
  • DFS depth first search
  • a check can be made only for cycles consisting of already committed transactions and t.
  • the cycle checking can include ensuring that no cycles are formed including committed cleared-out transactions. The clear times of committed transactions can be used to ensure that that no cycles are formed.
  • the DAG coordinator can store information to facilitate commit or abort operations.
  • the information can include part of a transaction precedence graph.
  • the DAG coordinator can track all committed nodes and uncommitted nodes. For each committed transaction, the DAG coordinator can store commit time and clear time of the committed transaction. The DAG coordinator can store the transaction coordinator for each transaction in the graph.
  • a cleanup can be triggered during every commit and every abort on a DAG coordinator.
  • a DAG coordinator at the end of a commit request, has a part of the transaction precedence graph stored on it.
  • the DAG coordinator identifies and aborts certain uncommitted transactions and cleans up certain committed transactions.
  • the transactions that are aborted include any uncommitted transactions t u that have a committed out neighbor t that has a committed out neighbor t’ to which t u is not adjacent. That is, there is an edge from t u to t and from t to t’ but not from t u to t’ .
  • the transactions that are cleaned up include, after identifying uncommitted transactions to abort, certain committed transactions, to which cleanup operations can be issued, where these cleanup committed transactions have no uncommitted in-neighbors.
  • the clear times of all committed transactions can be updated with respect to the set of transactions that have been identified to be cleaned up. Recall that the clear time(t) is the maximum among commit times of all cleared committed transactions that can reach t and commit(t). This update of clear times can be performed by a process similar to finding the topological ordering. Initially, all edges can be set as not traversed. Then, iteratively in each round, a node can be found with all incoming edges from committed transactions traversed and the clear time of the found node can be shared with its out neighbors. The neighbors can update their clear times accordingly.
  • the clear time can be stored along with the cleared record in the storage nodes. That is, during cleanup of versions, on each key, the value of the last cleared transaction along with its clear time can be stored. This action helps to ensure that no cycles are formed due to cleared out transactions during cycle checking.
  • FIG. 15 is a flow diagram of features of an embodiment of an example method 1500 of operating a distributed data storage system.
  • dependencies among transactions in a distributed system are modeled.
  • the distributed system has storage nodes arranged individually in a distributed arrangement.
  • the dependencies among transactions are modeled using transaction precedence graphs partially constructed while executing the transactions.
  • the transactions are marked as dependent in the transaction precedence graphs when the transactions affect common keys stored in the storage nodes.
  • a transaction is committed in the distributed system in response to checking for cycles in a transaction precedence graph for the transaction.
  • Variations of method 1500 or methods similar to method 1500 can include a number of different embodiments that may be combined depending on the application of such methods and/or the architecture of devices or systems in which such methods are implemented. Variations of such methods can include dynamically determining data to remove from the distributed system with respect to a given transaction precedence graph, where the given transaction precedence graph model dependencies can be based on correlated keys and transaction commit times. Dynamically determining data can include determining an absence of a path in the given transaction precedence graph from an uncommitted transaction in the transaction precedence graph to a committed transaction in the transaction precedence graph.
  • Variations can include storing, in the storage nodes, data records and unique keys to the data records partitioned among the storage nodes with each storage node containing a subset of the data records. Variations can include maintaining, in each storage node, data records, transaction records, and transaction precedence graph records. Each transaction record can have a transaction identification and each transaction precedence graph record can have a transaction precedence graph identification. Variations can include issuing read or write requests to the storage nodes from client nodes of the distributed system. The client nodes can be arranged with interfaces to end-users, where the end-users are external to the distributed system.
  • Variations of method 1500 or methods similar to method 1500 can include tracking the transaction in the distributed system as being in-progress, committed, or aborted and maintaining and updating the transaction precedence graph for the transaction.
  • the transaction precedence graph can be combined with other transaction precedence graphs in response to detection that the other transaction precedence graphs have affected keys in common to the transaction precedence graph.
  • Variations can include removing the transaction and associated information from the storage nodes in response to a determination of the transaction being clearable in the distributed system.
  • Variations of method 1500 or methods similar to method 1500 can include locating partial transaction precedence graphs containing a neighbor transaction to the transaction and adding transitive dependent edges to the transaction precedence graph for the transaction to generate a combined transaction precedence graph for the transaction. Commit times between committed transactions of the partial transaction precedence graphs can be checked and edges can be added based on the check of the commit times. A check can be performed for a cycle in the combined transaction precedence graph for the transaction and a determination to commit or to abort can be performed from the checking for a cycle.
  • Variations of method 1500 or methods similar to method 1500 can include operating, via execution of stored instructions by one or more first processors, multiple transaction coordinators such that each active transaction in the distributed system has a transaction coordinator that tracks the active transaction as in-progress, committed, or aborted; and operating, via execution of stored instructions by one or more second processors, multiple DAG coordinators, such that each DAG coordinator tracks transaction precedence graphs and updates and combines transaction precedence graphs among other DAG coordinators.
  • the transaction coordinator for the given transaction can be assigned as the DAG coordinator for the given transaction.
  • Variations can include, for a given transaction requested by a client node of the distributed system, in response to a commit request for the given transaction from the client node, the transaction coordinator for the given transaction determining current status of the given transaction by checking status of the given transaction in a transaction record of the given transaction.
  • the transaction coordinator for the given transaction can communicate with the DAG coordinator for the given transaction to evaluate if a commit of the given transaction forms a cycle in the given transaction precedence graph.
  • the transaction coordinator can apply a commit of the given transaction if no cycle is formed in the given transaction precedence graph.
  • a non-transitory machine-readable storage device such as computer-readable non-transitory medium, can comprise instructions stored thereon, which, when performed by a machine, cause the machine to perform operations, where the operations comprise one or more features similar to or identical to features of methods and techniques described with respect to method 1500, variations thereof, and/or features of other methods taught herein.
  • the physical structures of such instructions can be operated on by at least one processor.
  • executing these physical structures can cause the machine to perform operations comprising modeling dependencies among transactions in a distributed system having storage nodes arranged individually in a distributed arrangement, using transaction precedence graphs partially constructed while executing the transactions, the transactions marked as dependent in the transaction precedence graphs when the transactions affect common keys stored in the storage nodes; and committing a transaction in the distributed system in response to checking for cycles in a transaction precedence graph for the transaction.
  • Operations can include dynamically determining data to remove from the distributed system with respect to a given transaction precedence graph, where the given transaction precedence graph models dependencies can be based on correlated keys and transaction commit times. Dynamically determining data can include determining an absence of a path in the given transaction precedence graph from an uncommitted transaction in the transaction precedence graph to a committed transaction in the transaction precedence graph. Operations can include storing, in the storage nodes, data records and unique keys to the data records partitioned among the storage nodes with each storage node containing a subset of the data records.
  • the operations can include maintaining, in each storage node, data records, transaction records, and transaction precedence graph records, where each transaction record has a transaction identification and each transaction precedence graph record has a transaction precedence graph identification.
  • the operations can include removing the transaction and associated information from the storage nodes in response to a determination of the transaction being clearable in the distributed system.
  • the operations can include issuing read or write requests to the storage nodes from client nodes of the distributed system, where the client nodes can be arranged with interfaces to end-users, with the end-users external to the distributed system.
  • Operations can include tracking the transaction in the distributed system as being in-progress, committed, or aborted and maintaining and updating the transaction precedence graph for the transaction.
  • the operations can include combining the transaction precedence graph with other transaction precedence graphs in response to detection that the other transaction precedence graphs have affected keys in common to the transaction precedence graph.
  • Operations can include locating partial transaction precedence graphs containing a neighbor transaction to the transaction; adding transitive dependent edges to the transaction precedence graph for the transaction to generate a combined transaction precedence graph for the transaction; checking commit times between committed transactions of the partial transaction precedence graphs and adding edges based on the check of the commit times; checking for a cycle in the combined transaction precedence graph for the transaction; and determining to commit or to abort from the checking for a cycle.
  • Operations can include operating multiple transaction coordinators such that each active transaction in the distributed system has a transaction coordinator that tracks the active transaction as in-progress, committed, or aborted; and operating multiple directed acyclic graph (DAG) coordinators, such that each DAG coordinator tracks transaction precedence graphs and updates and combines transaction precedence graphs among other DAG coordinators.
  • Operations can include, at start of a given transaction, assigning the transaction coordinator for the given transaction as the DAG coordinator for the given transaction.
  • Operations can include, for a given transaction requested by a client node of the distributed system, in response to a commit request for the given transaction from the client node, the transaction coordinator for the given transaction: determining current status of the given transaction by checking status of the given transaction in a transaction record of the given transaction; communicating with the DAG coordinator for the given transaction to evaluate if a commit of the given transaction forms a cycle in the given transaction precedence graph; and applying a commit of the given transaction if no cycle is formed in the given transaction precedence graph.
  • a distributed system can comprise storage nodes arranged individually in a distributed arrangement, a memory storing instructions, and at least one processor in communication with the memory.
  • the at least one processor can be configured, upon execution of the instructions, to perform a number of steps.
  • Dependencies among transactions in the distributed system can be modeled using transaction precedence graphs partially constructed while executing the transactions.
  • the transactions can be correlated to a key stored in the storage nodes.
  • a transaction of the transactions in the distributed system, where the transaction is correlated to the key, can be committed in response to checking for cycles in a transaction precedence graph for the transaction.
  • the at least one processor can be structured to be configured to dynamically determine data to remove from the distributed system with respect to a given transaction precedence graph, where the given transaction precedence graph model dependencies can be based on correlated keys and transaction commit times. Determination of data to remove can be conducted by determining an absence of a path in the given transaction precedence graph from an uncommitted transaction in the transaction precedence graph to a committed transaction in the transaction precedence graph.
  • the storage nodes can include data records and unique keys to the data records partitioned among the storage nodes with each storage node containing a subset of the data records.
  • Each storage node can include data records, transaction records, and transaction precedence graph records, where each transaction record can have a transaction identification and each transaction precedence graph record can have a transaction precedence graph identification.
  • the at least one processor can be configured to remove the transaction and associated information from the storage nodes in response to a determination of the transaction being clearable in the distributed system.
  • the distributed systems can include client nodes configured to issue read and write requests to the storage nodes, where the client nodes are arranged with interfaces to end-users, with the end-users being external to the distributed system.
  • Variations of such a distributed system or similar distributed systems can include the at least one processor being configured to track the transaction in the distributed system as being in-progress, committed, or aborted, and maintain and update the transaction precedence graph for the transaction.
  • the at least one processor can be configured to combine the transaction precedence graph with other transaction precedence graphs, in response to detection that the other transaction precedence graphs have affected keys in common to the transaction precedence graph.
  • Variations of such a distributed system or similar distributed systems can include the at least one processor being configured to locate partial transaction precedence graphs containing a neighbor transaction to the transaction and to add transitive dependent edges to the transaction precedence graph for the transaction to generate a combined transaction precedence graph for the transaction.
  • the at least one processor can be configured to check commit times between committed transactions of the partial transaction precedence graphs and add edges based on the check of the commit times and to check for a cycle in the combined transaction precedence graph for the transaction.
  • the at least one processor can be configured to determine to commit or to abort from the checking for a cycle.
  • Variations of such a distributed system or similar distributed systems can include the at least one processor configured, upon execution of instructions, to perform operations as multiple transaction coordinators and multiple directed acyclic graph (DAG) coordinators, such that each active transaction in the distributed system has a transaction coordinator that tracks the active transaction as in-progress, commited, or aborted, and each DAG coordinator tracks transaction precedence graphs and updates and combines transaction precedence graphs among other DAG coordinators.
  • DAG directed acyclic graph
  • the transaction coordinator for the given transaction can be assigned as the DAG coordinator for the given transaction.
  • the transaction coordinator for the given transaction in response to a commit request for the given transaction from the client node, can perform a number of functions.
  • the transaction coordinator can determine a current status of the given transaction by checking status of the given transaction in a transaction record of the given transaction.
  • the transaction coordinator for the given transaction can communicate with the DAG coordinator for the given transaction to evaluate if a commit of the given transaction forms a cycle in the given transaction precedence graph.
  • the transaction coordinator for the given transaction can apply a commit of the given transaction if no cycle is formed in the given transaction precedence graph.
  • FIG 16 is a block diagram illustrating components of a computing system 1600 that can implement algorithms and perform methods structured to process data for an application in conjunction with using RMCC for data processing. All components need not be used in various embodiments.
  • the computing system 1600 can include a processor 1601, a memory 1612, a removable storage 1623, a non-removable storage 1622, and a cache 1628.
  • the processor 1601 can be implemented as multiple processors.
  • the computing system 1600 can be structured in different forms in different embodiments.
  • the computing system 1600 can be implemented in conjunction with various components associated with the distributed system 900 of Figure 9 and the datacenter 1000 of Figure 10. Although the various data storage elements are illustrated as part of the computing system 1600, the storage can also or alternatively include cloud-based storage accessible via a network, such as the Internet or remote server-based storage.
  • the memory 1612 can include a volatile memory 1614 and/or a nonvolatile memory 1617.
  • the computing system 1600 can include or have access to a computing environment that includes a variety of computer-readable media, such as the volatile memory 1614, the non-volatile memory 1617, the removable storage 1623 and/or the non-removable storage 1622.
  • Computer storage can include data storage server, random-access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, compact disc read-only memory (CD ROM), Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium capable of storing computer-readable instructions.
  • RAM random-access memory
  • ROM read-only memory
  • EPROM erasable programmable read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or other memory technologies
  • compact disc read-only memory (CD ROM) Compact disc read-only memory
  • DVD Digital Versatile Disks
  • magnetic cassettes magnetic tape
  • magnetic disk storage or other magnetic storage devices, or any other medium capable of storing computer-readable instructions.
  • the computing system 1600 can include or have access to a computing environment that includes an input interface 1627, an output interface 1624, and a communication interface 1631.
  • the output interface 1624 can include a display device, such as a touchscreen, that also can serve as an input device.
  • the input interface 1627 can include one or more of a touchscreen, touchpad, mouse, keyboard, camera, one or more device-specific buttons, one or more sensors integrated within or coupled via wired or wireless data connections to the computing system 1600, and other input devices.
  • the communication interface 1631 can exchange communications with external device and networks.
  • the computing system 1600 can operate in a networked environment using a communication connection to connect to one or more remote computers, such as one or more remote compute nodes.
  • the remote computer can include a PC, a server, a router, a network PC, a peer device or other common data flow network switch, or the like.
  • the communication connection can include a Local Area Network (LAN), a Wide Area Network (WAN), cellular, Wi-Fi, Bluetooth, or other networks.
  • LAN Local Area Network
  • WAN Wide Area Network
  • cellular Wireless Fidelity
  • Wi-Fi Wireless Fidelity
  • Bluetooth Wireless Fidelity
  • the components of the computing system 1600 can be connected with a system bus 1621.
  • Computer-readable instructions stored on a computer-readable medium are executable by the processor 1601 of the computing system 1600.
  • a hard drive, CD-ROM, and RAM are some examples of articles including a non-transitory computer-readable medium, such as a storage device.
  • the terms computer-readable medium and storage device do not include carrier waves to the extent carrier waves are deemed too transitory.
  • Storage can also include networked storage, such as a storage area network (SAN).
  • the program 1613 of the computing system 1600 can be used to cause the processor 1601 to perform one or more methods or algorithms described herein.
  • the components of the illustrative devices, systems, and methods employed in accordance with the illustrated embodiments can be implemented, at least in part, in digital electronic circuitry, analog electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. These components can be implemented, for example, as a computer program product such as a computer program, program code or computer instructions tangibly embodied in a machine-readable storage device, for execution by, or to control the operation of, data processing apparatus such as a programmable processor, a computer, or multiple computers.
  • DSPs digital signal processors
  • FPGAs field-programmable gate arrays
  • a general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine.
  • a processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
  • processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any processors of any kind of digital computer.
  • a processor will receive instructions and data from a read-only memory or a random-access memory or both.
  • the elements of a computer include a processor for executing instructions and one or more memory devices for storing instructions and data.
  • a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
  • Devices suitable for embodying computer program instructions and data include all forms of memory, including by way of example, semiconductor memory devices, e.g., electrically programmable read-only memory or ROM (EPROM), EEPROM, flash memory devices, and/or data storage disks (e.g., magnetic disks, internal hard disks, or removable disks, magneto-optical disks, CD-ROM, or DVD-ROM disks).
  • semiconductor memory devices e.g., electrically programmable read-only memory or ROM (EPROM), EEPROM, flash memory devices, and/or data storage disks (e.g., magnetic disks, internal hard disks, or removable disks, magneto-optical disks, CD-ROM, or DVD-ROM disks).
  • EPROM electrically programmable read-only memory
  • EEPROM electrically programmable read-only memory
  • flash memory devices e.g., electrically programmable read-only memory or ROM (EPROM), EEPROM, flash memory devices, and/or data storage disks (
  • a RMCC framework can address inefficient memory usage of conventional distributed storage systems by clearing out versions of transactions as soon as they are not part of active processing, while providing serializability and external causality.
  • RMCC uses transaction precedence graphs, which allows determination of whether a commit will form a dependency cycle, which would violate serializability.
  • the same transaction precedence graphs can be used to clean multiple record versions such that older version of records in a given committed transaction are cleared out of storage when there is no path from any uncommitted transaction to the given committed transaction in the transaction precedence graph.
  • the clearing can be triggered by commit or abort requests generated by transactions, which can result in multiple record versions being kept long enough to satisfy any open transactions, and not any longer.
  • RMCC can support serializable isolation level, a level of external consistency (linearizability), increased concurrency, and global transactions, which are transactions spanning multiple geographical regions, along with achieving efficient memory usage by cleaning up versions on the go.
  • RMCC provides a mechanism to split up bookkeeping of transactions to make such process distributed and scalable.
  • the transaction relationships for all ongoing transactions are maintained. This maintenance is performed by keeping pieces (partial DAGs) of the entire picture in different nodes of the system. These DAGs represent the order in which transactions are to be recorded in the system to preserve the serializable + linearizable properties.
  • a transaction attempts to commit it is evaluated against a combined DAG, made up by all the current partial DAGs. Any transaction is allowed to commit only if it won’t form a cycle with the already committed transactions in the combined DAG.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A distributed system is provided, comprising storage nodes arranged individually in a distributed arrangement, a memory storing instructions, and at least one processor in communication with the memory. The at least one processor is configured, upon execution of the instructions, to perform the following steps: modeling dependencies among transactions in the distributed system using transaction precedence graphs partially constructed while executing the transactions, the transactions correlated to a key stored in the storage nodes; and committing a transaction, correlated to the key, of the transactions in the distributed system in response to checking for cycles in a transaction precedence graph for the transaction.

Description

REFERENCE-MANAGED CONCURRENCY CONTROL
TECHNICAL FIELD
[0001] The present disclosure is related to systems that process data for applications and, in particular, to methods and apparatus associated with an inmemory transaction key-value storage system.
BACKGROUND
[0002] Storage systems store records, where users of the storage system can create, update, retrieve, and delete records by identifying them via a key. A key is a unique entity that can identify a particular record in a system. With a record in the system being a set of fields, a key is a piece of information from which a record can be found in a search of the database, including finding all related fields. The key can be user defined. An example of a key can include, but is not limited to, an e-mail address. In non-distributed system design, all records reside on the same machine. In contrast, distributed systems split (partition) a set of possible keys and assign each subrange of keys to separate machines, typically called partition nodes or shards. A transaction reads and/or writes according to a set of keys in a single atomic step so that the changes appear simultaneous. A transaction is a set of one or more user-initiated operations.
[0003] Traditional databases keep the latest version of their records. In order to satisfy isolation and consistency requirements, the database locks records when they are accessed, which is normally done using a two-phase commit protocol (2PC). Under concurrent execution, this approach generates quite a high rate of aborted transactions, which can be a measure of poor performance, since all keys shared by running transactions become choke points.
[0004] Multi-Version Concurrency Control (MVCC) is the standard solution employed by most current state-of-the-art databases. The approach in MVCC requires that all writes within a time window, called the retention window, are kept. This retention removes conflicts between read and write transactions resulting in a dramatic improvement of throughput for the system.
[0005] There are downsides associated with MVCC. With MVCC, multiple versions of records are kept by the system for days or weeks which incurs significant space overhead. Delta-encoding, which is a technique of storing or transmitting data in the form of differences between sequential data rather than complete fdes, does help to ensure that the cost is not linear, at the cost of a runtime penalty for reconstruction. With MVCC, continuous garbage collection is required. All current databases provide a background mechanism that continuously sweeps all data and removes versions older than the configured retention window. Transactions running for a time longer than this retention window cannot benefit from MVCC and are either automatically aborted or revert to locking, which restricts access to data.
SUMMARY
[0006] It is an object of various embodiments to provide an efficient architecture and methodology for processing data for an application in a distributed manner. The architecture can include a distributed system, implemented as an in-memory transaction key- value storage system that can allow for high concurrency and efficient memory usage while providing strict serializability and reasonable latency for transactions. In the architecture, transaction precedence graphs can be used to identify and clear out versions of transactions that are no longer in use. The distributed system can include use of partially constructed transaction precedence graphs that can be constructed while executing the transactions and can be maintained across multiple nodes. The constructed transaction precedence graphs can be updated and combined when transactions attempt to commit. Procedures for a commit can include combining the partial precedence graphs and performing cycle checking in the transaction precedence graph in order to achieve the consistency objectives of the distributed storage of the distributed system.
[0007] According to a first aspect of the present disclosure, there is provided a distributed system comprising storage nodes arranged individually in a distributed arrangement; a memory storing instructions; and at least one processor in communication with the memory, the at least one processor configured, upon execution of the instructions, to perform the following steps: model the dependencies among transactions in the distributed system using transaction precedence graphs partially constructed while executing the transactions, the transactions correlated to keys stored in the storage nodes; and committing a transaction, correlated to the keys, of the transactions in the distributed system in response to checking for cycles in a transaction precedence graph for the transaction.
[0008] In a first implementation form of the distributed system according to the first aspect as such, the at least one processor is configured to dynamically determine data to remove from the distributed system with respect to a given transaction precedence graph, the given transaction precedence graph modeling dependencies based on correlated keys and transaction commit times, by determining an absence of a path in the given transaction precedence graph from an uncommitted transaction in the transaction precedence graph to a committed transaction in the transaction precedence graph.
[0009] In a second implementation form of the distributed system according to the first aspect as such or any preceding implementation form of the first aspect, the storage nodes include data records and unique keys to the data records partitioned among the storage nodes with each storage node containing a subset of the data records.
[0010] In a third implementation form of the distributed system according to the first aspect as such or any preceding implementation form of the first aspect, the at least one processor is configured to: track the transaction in the distributed system as being in-progress, committed, or aborted; and maintain and update the transaction precedence graph for the transaction and combine the transaction precedence graph with other transaction precedence graphs, in response to detection that the other transaction precedence graphs have affected keys in common to the transaction precedence graph.
[0011] In a fourth implementation form of the distributed system according to the first aspect as such or any preceding implementation form of the first aspect, the at least one processor is configured to remove the transaction and associated information from the storage nodes in response to a determination of the transaction being clearable in the distributed system.
[0012] In a fifth implementation form of the distributed system according to the first aspect as such or any preceding implementation form of the first aspect, the distributed system includes client nodes configured to issue read and write requests to the storage nodes, the client nodes arranged with interfaces to endusers, the end-users external to the distributed system.
[0013] In a sixth implementation form of the distributed system according to the first aspect as such or any preceding implementation form of the first aspect, the at least one processor is configured to: locate partial transaction precedence graphs containing a neighbor transaction to the transaction; add transitive dependent edges to the transaction precedence graph for the transaction to generate a combined transaction precedence graph for the transaction; check commit times between committed transactions of the partial transaction precedence graphs and add edges based on the check of the commit times; check for a cycle in the combined transaction precedence graph for the transaction; and determine to commit or to abort from the checking for a cycle.
[0014] In a seventh implementation form of the distributed system according to the first aspect as such or any preceding implementation form of the first aspect, each storage node of includes data records, transaction records, each transaction record having a transaction identification, and transaction precedence graph records, each transaction precedence graph record having a transaction precedence graph identification.
[0015] In an eighth implementation form of the distributed system according to the first aspect as such or any preceding implementation form of the first aspect, the at least one processor is configured, upon execution of the instructions, to perform operations as multiple transaction coordinators and multiple directed acyclic graph (DAG) coordinators, such that each active transaction in the distributed system has a transaction coordinator that tracks the active transaction as in-progress, committed, or aborted and each DAG coordinator tracks transaction precedence graphs and updates and combines transaction precedence graphs among other DAG coordinators.
[0016] In a ninth implementation form of the distributed system according to the first aspect as such or any preceding implementation form of the first aspect, at start of a given transaction, the transaction coordinator for the given transaction is assigned as the DAG coordinator for the given transaction.
[0017] In a tenth implementation form of the distributed system according to the first aspect as such or any preceding implementation form of the first aspect, for a given transaction requested by a client node of the distributed system, in response to a commit request for the given transaction from the client node, the transaction coordinator for the given transaction: determines current status of the given transaction by checking status of the given transaction in a transaction record of the given transaction; communicates with the DAG coordinator for the given transaction to evaluate if a commit of the given transaction forms a cycle in a given transaction precedence graph of the given transaction; and applies a commit of the given transaction if no cycle is formed in the given transaction precedence graph.
[0018] According to a second aspect of the present disclosure, there is provided a method of operating a distributed data storage system. The method comprises modeling dependencies among transactions in a distributed system having storage nodes arranged individually in a distributed arrangement, using transaction precedence graphs partially constructed while executing the transactions, the transactions marked as dependent in the transaction precedence graphs when the transactions affect common keys stored in the storage nodes; and committing a transaction in the distributed system in response to checking for cycles in a transaction precedence graph for the transaction
[0019] In a first implementation form of the method of operating a distributed data storage system according to the second aspect as such, dynamically determining data to remove from the distributed system with respect to a given transaction precedence graph, the given transaction precedence modeling dependencies based on correlated keys and transaction commit times, by determining an absence of a path in the given transaction precedence graph from an uncommitted transaction in the transaction precedence graph to a committed transaction in the transaction precedence graph.
[0020] In a second implementation form of the method of operating a distributed data storage system according to the second aspect as such or any preceding implementation form of the second aspect, storing, in the storage nodes, data records and unique keys to the data records partitioned among the storage nodes with each storage node containing a subset of the data records. [0021] In a third implementation form of the method of operating a distributed data storage system according to the second aspect as such or any preceding implementation form of the second aspect, tracking the transaction in the distributed system as being in-progress, committed, or aborted; and maintaining and updating the transaction precedence graph for the transaction and combining the transaction precedence graph with other transaction precedence graphs in response to detection that the other transaction precedence graphs have affected keys in common to the transaction precedence graph.
[0022] In a fourth implementation form of the method of operating a distributed data storage system according to the second aspect as such or any preceding implementation form of the second aspect, the method includes removing the transaction and associated information from the storage nodes in response to a determination of the transaction being clearable in the distributed system.
[0023] In a fifth implementation form of the method of operating a distributed data storage system according to the second aspect as such or any preceding implementation form of the second aspect, the method includes issuing read or write requests to the storage nodes from client nodes of the distributed system, the client nodes arranged with interfaces to end-users, the end-users external to the distributed system. .
[0024] In a sixth implementation form of the method of operating a distributed data storage system according to the second aspect as such or any preceding implementation form of the second aspect, the method includes: locating partial transaction precedence graphs containing a neighbor transaction to the transaction; adding transitive dependent edges to the transaction precedence graph for the transaction to generate a combined transaction precedence graph for the transaction; checking commit times between committed transactions of the partial transaction precedence graphs and adding edges based on the check of the commit times; checking for a cycle in the combined transaction precedence graph for the transaction; and determining to commit or to abort from the checking for a cycle.
[0025] In a seventh implementation form of the method of operating a distributed data storage system according to the second aspect as such or any preceding implementation form of the second aspect, the method includes maintaining, in each storage node, data records, transaction records, each transaction record having a transaction identification, and transaction precedence graph records, each transaction precedence graph record having a transaction precedence graph identification.
[0026] In an eighth implementation form of the method of operating a distributed data storage system according to the second aspect as such or any preceding implementation form of the second aspect, the method includes: operating, via execution of stored instructions by one or more first processors, multiple transaction coordinators such that each active transaction in the distributed system has a transaction coordinator that tracks the active transaction as in-progress, committed, or aborted; and operating, via execution of stored instructions by one or more second processors, multiple directed acyclic graph (DAG) coordinators, such that each DAG coordinator tracks transaction precedence graphs and updates and combines transaction precedence graphs among other DAG coordinators.
[0027] In a ninth implementation form of the method of operating a distributed data storage system according to the second aspect as such or any preceding implementation form of the second aspect, the method includes, at start of a given transaction, assigning the transaction coordinator for the given transaction as the DAG coordinator for the given transaction.
[0028] In a tenth implementation form of the method of operating a distributed data storage system according to the second aspect as such or any preceding implementation form of the second aspect, the method includes, for a given transaction requested by a client node of the distributed system, in response to a commit request for the given transaction from the client node, the transaction coordinator for the given transaction: determining current status of the given transaction by checking status of the given transaction in a transaction record of the given transaction; communicating with the DAG coordinator for the given transaction to evaluate if a commit of the given transaction forms a cycle in a given transaction precedence graph of the given transaction; and applying a commit of the given transaction if no cycle is formed in the given transaction precedence graph.
[0029] According to a third aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing instructions for processing data, which, when executed by at least one processor, cause the at least one processor to perform operations comprising modeling dependencies among transactions in a distributed system having storage nodes arranged individually in a distributed arrangement, using transaction precedence graphs partially constructed while executing the transactions, the transactions marked as dependent in the transaction precedence graphs when the transactions affect common keys stored in the storage nodes; and committing a transaction in the distributed system in response to checking for cycles in a transaction precedence graph for the transaction.
[0030] In a first implementation form of the non-transitory computer-readable medium according to the third aspect as such, dynamically determining data to remove from the distributed system with respect to a given transaction precedence graph, the given transaction precedence graph modeling dependencies based on correlated keys and transaction commit times, by determining an absence of a path in the given transaction precedence graph from an uncommitted transaction in the transaction precedence graph to a committed transaction in the transaction precedence graph.
[0031] In a second implementation form of the non-transitory computer- readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations include storing, in the storage nodes, data records and unique keys to the data records partitioned among the storage nodes with each storage node containing a subset of the data records.
[0032] In a third implementation form of the non-transitory computer-readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations include: tracking the transaction in the distributed system as being in-progress, committed, or aborted; and maintaining and updating the transaction precedence graph for the transaction and combining the transaction precedence graph with other transaction precedence graphs in response to detection that the other transaction precedence graphs have affected keys in common to the transaction precedence graph.
[0033] In a fourth implementation form of the non-transitory computer- readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations include removing the transaction and associated information from the storage nodes in response to a determination of the transaction being clearable in the distributed system.
[0034] In a fifth implementation form of the non-transitory computer-readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations include issuing read or write requests to the storage nodes from client nodes of the distributed system, the client nodes arranged with interfaces to end-users, the end-users external to the distributed system.
[0035] In a sixth implementation form of the non-transitory computer-readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations include: locating partial transaction precedence graphs containing a neighbor transaction to the transaction; adding transitive dependent edges to the transaction precedence graph for the transaction to generate a combined transaction precedence graph for the transaction; checking commit times between committed transactions of the partial transaction precedence graphs and adding edges based on the check of the commit times; checking for a cycle in the combined transaction precedence graph for the transaction; and determining to commit or to abort from the checking for a cycle.
[0036] In a seventh implementation form of the non-transitory computer- readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations include maintaining, in each storage node, data records, transaction records, each transaction record having a transaction identification, and transaction precedence graph records, each transaction precedence graph record having a transaction precedence graph identification.
[0037] In an eighth implementation form of the non-transitory computer- readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations include: operating, via execution of stored instructions by one or more first processors, multiple transaction coordinators such that each active transaction in the distributed system has a transaction coordinator that tracks the active transaction as inprogress, committed, or aborted; and operating, via execution of stored instructions by one or more second processors, multiple directed acyclic graph (DAG) coordinators, such that each DAG coordinator tracks transaction precedence graphs and updates and combines transaction precedence graphs among other DAG coordinators.
[0038] In a ninth implementation form of the non-transitory computer-readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations include, at start of a given transaction, assigning the transaction coordinator for the given transaction as the DAG coordinator for the given transaction.
[0039] In a tenth implementation form of the non-transitory computer-readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations include, for a given transaction requested by a client node of the distributed system, in response to a commit request for the given transaction from the client node, the transaction coordinator for the given transaction: determining current status of the given transaction by checking status of the given transaction in a transaction record of the given transaction; communicating with the DAG coordinator for the given transaction to evaluate if a commit of the given transaction forms a cycle in a given transaction precedence graph of the given transaction; and applying a commit of the given transaction if no cycle is formed in the given transaction precedence graph.
[0040] Any one of the foregoing examples may be combined with any one or more of the other foregoing examples to create a new embodiment in accordance with the present disclosure.
BRIEF DESCRIPTION OF THE DRAWINGS
[0041] The drawings illustrate generally, by way of example, but not by way of limitation, various embodiments discussed in the present document.
[0042] Figure 1 is a representation of two transactions in a directed graph, according to an example embodiment.
[0043] Figure 2 is a representation in which a transaction has been included with the transactions of Figure 1, according to an example embodiment.
[0044] Figure 3 is a representation of a transaction precedence graph for transactions having a cycle, according to an example embodiment.
[0045] Figure 4 is a representation of a directed acyclic graph, according to an example embodiment.
[0046] Figures 5-7 illustrate topological sorts of a topological ordering, according to an example embodiment.
[0047] Figure 8 illustrates an arrangement between a given transaction and other transactions for the given transaction trying to commit, according to an example embodiment.
[0048] Figure 9 is a representation of a distributed system for services and nodes that can be structured for reference managed concurrency control, according to an example embodiment.
[0049] Figure 10 is a representation of interactions of various components for reference managed concurrency control in a datacenter, according to an example embodiment.
[0050] Figure 11 illustrates a mechanism for the datacenter of Figure 10 to annotate transaction operations and allow other transactions to discover a given directed acyclic graph for concurrent access, according to an example embodiment.
[0051] Figures 12A-B are a flow diagram of an example of starting a transaction and performing operations in a reference managed concurrency control, according to an example embodiment.
[0052] Figures 13A-B are a flow diagram of an example commit request in a reference managed concurrency control, according to an example embodiment.
[0053] Figures 14A-B are a flow diagram of an example cleanup communication in a reference managed concurrency control, according to an example embodiment.
[0054] Figure 15 a flow diagram of features of an example method of operating a distributed data storage system, according to an example embodiment.
[0055] Figure 16 is a block diagram illustrating a computing system that implements algorithms and performs methods structured to process data for an application, according to an example embodiment.
DETAILED DESCRIPTION
[0056] In the following description, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration specific embodiments which may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the embodiments, and it is to be understood that other embodiments may be utilized, and that structural, logical, mechanical, and electrical changes may be made.
The following description of example embodiments is, therefore, not to be taken in a limited sense.
[0057] The functions or algorithms described herein may be implemented in software, in an embodiment. The software may comprise computer-executable instructions stored on computer-readable media or a computer-readable storage device, such as one or more non-transitory memories or other type of hardwarebased storage devices, either local or networked. Further, such functions correspond to modules, which may be software, hardware, firmware, or any combination thereof. Multiple functions may be performed in one or more modules as desired, and the embodiments described are merely examples. The software may be executed on a digital signal processor, application-specific integrated circuit (ASIC), a microprocessor, or other type of processor operating on a computer system, such as a personal computer (PC), a server, or other computer system, turning such computer system into a specifically programmed machine.
[0058] Computer-readable non-transitory media includes all types of computer-readable media, including magnetic storage media, optical storage media, and/or solid-state storage media, and specifically excludes signals. It should be understood that the software can be installed in and sold with the devices that operate in association with reference managed concurrency control for data processing as taught herein. Alternatively, the software can be obtained and loaded into such devices, including obtaining the software via a disc medium or from any manner of network or distribution system, including, for example, from a server owned by the software creator or from a server not owned but used by the software creator. The software can be stored on a server for distribution over the Internet, for example.
[0059] In various embodiments, a system, referred to as reference managed concurrency control (RMCC), can be implemented as an in-memory transaction key-value storage system that allows for high concurrency and efficient memory usage while guaranteeing strict serializability and reasonable latency for transactions. Serializable means that a transaction can be serialized. In this approach, given a set of transactions, these transactions are said to be serializable if some ordering can be created where these transactions appear to execute one after the other. This total ordering on the set is a serial schedule. There could be more than one total ordering, e.g. [T1 -> T2 -> T3] and [T3 -> T1 -> T2] are both serial schedules. These may not be an order in which these transactions happened in real time, but serializability is indifferent about time order. Given a set of transactions, if a serial schedule can be constructed, then each one of these transactions is serializable with respect to the set. In a running system, where all committed transactions thus far are in a serial schedule, then a new transaction, which wants to commit, is serializable with respect to the system if some placement for the new transaction can be found in the existing ordering such that the whole set is still a serial schedule. Thus, a transaction is serializable if the transaction appears to have occurred in some serial schedule. A system is not required to physically create such a schedule, but provides a basis that allows analysis about the fact that transactions are isolated from each-other and appear to be isolated to any observer of the system. If an attempt is made to add a new transaction to a given a set of committed/historical transactions, this new transaction is said to be serializable if there exists some possible ordering (aka serial order) called a schedule, which makes it appear as if all these transactions executed sequentially, one after the other.
[0060] In a system which provides serial schedule, transactions are made to appear as if they executed non-interleaved even if in reality they did overlap, that is, no transaction can be seen to start until a running transaction has ended. A serial schedule can include a total ordering of transactions. For any two nonconcurrent transactions T1 and T2, if the end of transaction T1 occurs before the start of transaction T2 (end(Tl) < start(T2)) in real time, T1 occurs before T2 and transactions T1 and T2 are linearizable. A system has strict serializability when the transactions are serializable plus linearizable. In other words, the system makes it appear to all observers that there is a total ordering of the transactions which is consistent with their real-time ordering. Linearizable can identify a valid ordering, but does not specifically mention anything about transactions that occur concurrently. Strict serializability, which is serializable plus linearizable, makes it appear to any observer that all historical transactions have occurred one after the other, and if a second transaction actually occurs after a first transaction, then any observer will see that the results of the second transaction supersedes the results of the first transaction.
[0061] Design of the RMCC can be directed to efficiently remove unneeded record versions on-demand, as opposed to traditional MVCC designs that require periodic garbage collection. The design includes a RMCC transaction protocol design along with implementation considerations. The RMCC can address the problems associated with MVCC while retaining and improving the benefits. Multiple versions of records can still be used in the RMCC, but this retention in the RMCC is performed on demand and only to satisfy ongoing transactions. As soon as transactions commit, the extra versions can be discarded. A commit is the applying of the changes made to the subject records of a transaction at an accepted completion of the transaction. A transaction to perform one or more operations on a record of the distributed system can be started by reception of a begin statement from a user device. When the operations of the transaction are at a completion point, the user device can either commit the transaction or abort the transaction, which means the user device either issues a commit command to the system to apply all the changes that were performed on the record or revert the changes that were made in executing the operations of the transaction.
[0062] With respect to memory usage, it is estimated that nearly a 95% reduction in memory overhead for supporting concurrent transactions can be achieved with RMCC as compared with traditional MVCC. With respect to background overhead, the use of RMCC can eliminate the need for background garbage collection in favor of graph-based transaction dependency tracking, with immediate cleanup upon commit. With respect to latency improvement, the RMCC can reduce reliance on timestamp usage, allowing concurrent timestamp allocation, which can achieve up to 33% better network latency per transaction. With respect to abort rate reduction, the RMCC can provide improved conflict rate over traditional MVCC due to usage of graph cycle detection by up to 50% in certain workloads.
[0063] In various embodiments, a RMCC can be implemented to address inefficient memory usage by clearing out versions of transactions as soon as they are not needed, while providing serializability and external causality guarantees. In order to achieve these goals, the RMCC can use transaction precedence graphs. The use of transaction precedence graphs allows the RMCC to determine if a commit will form a dependency cycle, which violates strict serializability. The same graphs can be used to clean multiple record versions, where older version of records in a committed transaction are cleared out when there is no path from any uncommitted transaction to the committed transaction in the transaction precedence graph. The clearing procedure can be triggered by commit requests or abort requests by transactions, which means that multiple record versions are retained long enough to satisfy any open transactions, and not any longer.
[0064] Figure 1 is an example of a representation 100 of two transactions T1 and T2 in a directed graph. A directed graph G(V, E) of transactions can include a set of all transactions V with associated edges E. In the example of Figure 1, if T1 occurs before T2 in the schedule, an edge extends from transaction T1 to T2, for which T2 is said to be dependent from or be dependent on Tl, where T1 and T2 are based on the same key. Figure 2 is an example of a representation 200 in which a transaction T3 has been included with transactions Tl and T2 of Figure 1. Edges can be added among Tl, T2, and T3 by detecting conflicts, which are key-based, or linearizability, which is based on time order.
[0065] Figure 3 is an example of a representation 300 of a transaction precedence graph for transactions having a cycle. The transactions are represented by vertices 0, 1, 2, 3 and 4. A transaction precedence graph is serializable if and only if no cycles are in the transaction precedence graph. If a transaction precedence graph has a cycle, there is no total ordering of transactions and no serial schedule. In representation 300, there is a cycle between transactions vertices 2 and 4 and among transaction vertices 0, 1, 2, and 3.
[0066] Figure 4 is an example of a representation 400 of a directed acyclic graph (DAG). A directed graph with no cycles is a DAG. Since a transaction precedence graph is serializable, a RMCC that deals with serializable transaction precedence graphs operates with respect to DAGs. A DAG has topological ordering if the total ordering of the vertices is provided such that there is no backward edge. Representation 400 has edges among vertices 0, 1, 2, 3, 4, 5, 6, and 7 of representation 400 having no backward edges. A topological ordering is a serial schedule.
[0067] Figures 5-7 illustrate an example of a topological sort of a topological ordering. The topological sort is conducted to arrange the vertices beginning with vertices having no incoming edges. Figure 5 shows a beginning arrangement 500 of vertices A, B, C, D, E, F, and G in which there are no cycles. Vertex A has edges to vertices B and C. Vertex B has edges to vertices C and D. Vertex C has an edge to vertex E. Vertex D has edges to vertices F and E. Vertex E has no edge from vertex E to any vertex in the arrangement, Vertex G has edges to E and F and no edge incoming from any vertex in the arrangement to vertex G. Figure 6 illustrates an arrangement 600 of the vertices of the arrangement 500 of Figure 5 with the vertices arranged in a linear fashion, maintaining the direction edges of the arrangement 500. Figure 7 illustrates an arrangement 700 of the vertices of the arrangement 600 of Figure 6 with the vertices arranged in a linear fashion, where the procedure keeps picking vertices with no incoming edges. Arrangement 700 results with the edges among vertices A, B, C, D, E, F, and G of arrangement 700 having no backward edges and the vertices having no input edges arranged at the beginning of arrangement 700. DAGs, topological ordering, and topological sorting can be use in operation of a RMCC.
[0068] A RMCC, as taught herein, can be implemented to use a transaction precedence graph to identify and clear out transactions. The functions of the RMCC can include using transaction precedence graphs to achieve serializability in a distributed system, while also providing external causality. Due to the overhead of maintaining and using a transaction precedence graph for cycle checking, network overhead for RMCC may not be minimized, but for transactions having no concurrent transactions, RMCC performance can achieve low latency in order of microseconds.
[0069] RMCC can provide strict serializability. In RMCC, two transactions are concurrent if their execution times overlap and they access a common resource during their execution. RMCC can provide client devices with the strictest consistency guarantee for transactions, which is called external consistency or linearizability. Under external consistency, the distributed system using RMCC behaves in an order that is consistent with real-time. If one transaction T1 commits before another transaction T2 that is non-concurrent with T1 commits, the system guarantees that client devices do not see a state that includes the effect of the second transaction T2 but not the first T1. Intuitively, RMCC is semantically indistinguishable from a single-machine database.
[0070] Architectures for a RMCC can be constructed with several guiding design elements. An RMCC can be structured to achieve memory efficiency by identifying transactions to clear out in the transaction precedence graph once the transaction is committed or aborted. The RMCC can be constructed to use transaction precedence graphs along with cycle checking over distributed storage to achieve serializability. The RMCC can be constructed to achieve linearizability and serializability without using begin timestamps. The RMCC can be constructed to use end timestamps that can be limited to ensure linearizability and not for record version selection. The RMCC can be constructed to use partially constructed transaction precedence graphs maintained across multiple nodes along with updating and combining transaction precedence graphs when transactions try to commit.
[0071] Figure 8 illustrates an arrangement 800 between a transaction T and other transactions Cl, C2, and C3 for transaction T trying to commit. In RMCC, a commit procedure can include cycle detection, which is based on an invariant that there are no cycles among committed transactions. In arrangement 800, there are three committed transactions Cl, C2, and C3 in which the three committed transactions do not include a cycle among themselves. Committed transactions Cl, C2, C3 have been committed and are fixed in the system. Reversal of the decision in which Cl, C2, C3 have been committed does not occur without breaking consistency. Committed transactions Cl, C2, C3 are still in the DAG including T, which the system maintains, since T has not yet committed. Consider transaction T trying to commit. Cycle determination can be conducted by a cycle detection according to the last transaction trying to commit, which can be used to speed up cycle finding. A number of different approaches can be implemented to detect a cycle. For the transaction precedence graph of arrangement 800, only a check for cycles consisting of transaction T and committed transactions C is conducted. As shown in Figure 8, adding edges from committed transactions C3 to transaction T forms a cycle. With respect to Figure 8 showing transaction Cl depends on transaction T, where transaction T has not been committed yet, it is noted that the DAG is not used to enforce the order in which things commit. Rather, the DAG represents the dependencies among concurrent transactions and is used to determine if any particular transaction is allowed to commit. As a result of detecting the cycle, the RMCC can disallow transaction T to commit because it would form a cycle. Not allowing transaction T to commit is the same as forcing transaction T to abort. The reason for aborting transaction T when it comes to commit is that otherwise transaction T will break serializability, precisely because transactions C1,C2 and C3 are already committed. In DAG terms, committing transaction T would solidify a cycle in the system which is equivalent to non-serializable schedule. [0072] Figure 9 is a representation of a distributed system 900 for services and nodes that can be structured for a RMCC. A system can be implemented as a set of services that work together to achieve the system design goals. A service, or service cluster, can be realized as a set of nodes for a particular component, which nodes have been configured to work together. A node can be structured as a running instance of software for a given component. A node may include one or more processors to execute the instance of software. A node can have a unique network address, which provides a mechanism for other nodes or other portions of software to be able to send messages to the given node. Other nodes can include, but are not limited to, client devices or nodes from other services. A single machine, for example a host, can run multiple nodes, as decided by a system operator. The components of distributed system 900 can include a time stamp oracle (TSO) service 916, a control plane oracle (CPO) service 926, a storage service 906, and a persistence service 936. An oracle, as used herein, is an authority or mechanism that is configured to make decisions for the entity to which the oracle is directed. TSO service 916 can provide end timestamps for transactions to ensure linearizability. CPO service 926 can provide central controller for clusters of nodes. Storage service 906 can provide distributed storage of transaction data. Persistence service 936 can provide functionality to store data that can be available again or to other users or services after the process using or generating the data is no longer running. The functionality of these components can include implementation by software. The nodes of a service can all run instances of the same component software. For example, the TSO nodes can run instances of TSO software, with all TSO nodes configured to be part of the same TSO Service. When a software module communicates with the TSO service, this software communicates with a particular node, which is part of a particular TSO Service. A service-specific mechanism can be implemented to decide with which particular node to communicate.
[0073] In the distributed system 900, the storage service 906 can include storage node 905-1, storage node 905-2 . . . storage node 905-N. TSO service 916 can include TSO node 915-1, TSO node 915-2 . . . TSO node 915-N. CPO service 926 can include CPO node 925-1, CPO node 925-2 . . . CPO node 925- N. Persistence service 936 can include persistence node 935-1, persistence node 935-2 . . . persistence node 935-N. Though each of the services of distributed system 900 is shown having the same number of nodes, a similar distributed system can include services having different numbers of nodes. Each node of a service can run independently of each other and can be implemented with one or more processors executing stored instructions for the independent node. Alternatively, nodes can share one or more processors configured to support the functionality of the nodes of a service. Figure 9 reflects that during the running of one or more applications, the nodes of the active services for the one or more applications can interact with each other.
[0074] Figure 10 is a representation of interactions of various components for RMCC in a datacenter 1000. The components in datacenter 1000 include, but are not limited to, client nodes 1002-1, 1002-2, and 1002-3, storage nodes 1005- 1, 1005-2, and 1005-3, a CPO 1026, a TSO 1016, transaction coordinators 1010- 1 and 1010-2, and DAG coordinators 1020-1, 1020-2, and 1020-3. Though datacenter 1000 shows a number of these components, a datacenter, such as datacenter 1000, can include more or fewer than the number of each of the components shown in Figure 10.
[0075] TSO 1016 can be responsible for issuing real-time based timestamps with error bounds for transactions. In RMCC, TSO 1016 can be configured to be used only to obtain commit timestamps without being implemented for record version selection. Multiple TSO instances can be used at the same time. In such multiple TSO instances, the TSO instances can be implemented to agree on the maximum error bound of timestamps provided to the TSO instances. The ordering of transactions is normally determined based on the data dependency. In some cases, there are transactions that do not access any common data but have a real -world dependency, e.g. issuance of an order after an order is issued from a device separate from the device that issued the earlier order. This is external causal dependency since the dependency is external to the system. These transactions can be captured using timestamps issued by the TSO. Since there is no data dependency but there is causal relationship, the associated DAG records the relationship of these transactions. This is recorded by inserting an edge based on the timestamp ordering of the earlier transaction and the later transaction.
[0076] CPO 1026 can be configured as the central controller for a cluster of nodes to datacenter 1000. CPO 1026 can be configured to be responsible for managing cluster partitions, scaling activities, or other managing activities for the cluster of components. CPO 1026 can also serve as a versioned discovery system so that nodes and clients can discover where the cluster components are located.
[0077] Client nodes 1002-1, 1002-2, and 1002-3 can be structured as coordination-free client nodes. Client nodes 1002-1, 1002-2, and 1002-3 can be configured to be the only components of the RMCC arrangement of datacenter 1000 visible to end-users. Each of client nodes 1002-1, 1002-2, and 1002-3 can communicate with transaction coordinator nodes or storage nodes to start a transaction, perform operations (read, write), and commit or abort a transaction. For these operations, client nodes 1002-1, 1002-2, and 1002-3 can individually issue requests to storage nodes and transaction coordinators for transactions that are the subject of the requests. As shown in Figure 10, multiple client nodes can be present in the distributed system of datacenter 1000. Client nodes 1002-1, 1002-2, and 1002-3 are independent such that each node need not coordinate with each other node.
[0078] Storage nodes 1005-1, 1005-2, and 1005-3 can be configured as key- partitioned storage nodes. Each of storage nodes 1005-1, 1005-2, and 1005-3 can contain a subset of preassigned keys, where the keys are assigned by the CPO 1026 that it is responsible for managing the keys and storage nodes of datacenter 1000. Datacenter 1000 can be implemented as a storage system having responsibility is to provide a unified view of a dataset, which cannot physically fit on a single machine. This implementation can be achieved by partitioning (splitting) the data into smaller chunks called partitions, where a partition is a subset of the entire data. In an embodiment of datacenter 1000, each partition can be guaranteed to be limited in size and be able to fit on a single machine. In a storage service for datacenter 1000, including storage nodes 1005-1, 1005-2, and 1005-3, exactly one partition can be assigned to each of storage nodes 1005-1, 1005-2, and 1005-3 in a one-to-one relationship. The assignment, which can be realized as a mapping, can be stored in a structure called a partition map. When any record-level operation is performed according to a given key (for example, a read of a record with a key equal to a specific e- mail address), a determination can be made as to which partition the given key falls. Using the partition map, a determination can be made of exactly which storage node should own the record that is the subject to the record-level operation. In an embodiment, the entire cluster of storage nodes can be configured to perform an agreement procedure on the partition map.
[0079] Each of storage nodes 1005-1, 1005-2, and 1005-3 can receive read or write requests from client nodes 1002-1, 1002-2, and 1002-3 and can perform the requested operations by communicating with transaction coordinators as appropriate and respond to the requesting client. Each of storage nodes 1005-1, 1005-2, and 1005-3 can handle read, write, and clear version requests on keys and maintain a local transaction precedence graph. As shown in Figure 10, multiple storage nodes can be present in the distributed system of datacenter 1000. The key domain can be partitioned and each storage node can be assigned keys from distinct parts of the partition. A partition map can be implemented as, but is not limited to, a hash of the key that allows the storage node for a particular key to be identified with just the key.
[0080] Transaction coordinators 1010-1 and 1010-2 can be implemented as transaction coordinator nodes. A transaction coordinator can be configured to be responsible for keeping track of a transaction and committing or aborting the transaction. Every active transaction in the RMCC of datacenter 1000 has a transaction coordinator that keeps track of whether it is in-progress, committed, or aborted. When a client, such as one of client nodes 1002-1, 1002-2, and 1002- 3, issues a commit request, the transaction coordinator communicates with the DAG coordinator of the transaction to check if the transaction can be successfully committed. As shown in Figure 10, multiple transaction coordinator nodes can be present in the distributed system of datacenter 1000. A transaction can be mapped to a transaction coordinator using a transaction identification (ID), which can be a universally unique identifier (UUID). In various embodiments, storage nodes, such as storage nodes 1005-1, 1005-2, and 1005-3, can also double up to be transaction coordinators, that is, storage nodes can also operate as transaction coordinators.
[0081] DAG coordinators 1020-1, 1020-2, and 1020-3 can be implemented as DAG coordinator nodes. Each transaction in datacenter 1000 has a DAG coordinator. A DAG coordinator can be configured to be responsible for keeping track of the parts of partial transaction precedence graphs. A DAG coordinator can be configured to also maintain, update, and combine partial transaction precedence graphs between DAG coordinators. At the start of a transaction, the transaction coordinator for the transaction can be assigned as the DAG coordinator. As transaction precedence graphs are updated and combined with other transactions, the DAG coordinator can become different from the transaction coordinator, for example, by merging with other DAG coordinators. A DAG coordinator can be responsible for checking if a transaction can commit by performing cycle checking. A DAG coordinator can identify transactions that can be cleared out by finding committed transactions with no ongoing transactions dependent on it, can issue cleanup requests to transaction coordinators, and can signal storage nodes when record versions can be cleaned- up. When a DAG coordinator determines a transaction can be cleared, the transaction can also be removed from the transaction coordinator as well.
[0082] With respect to the example of the datacenter 1000, when a transaction is started, an asynchronous request is issued to TSO 1016 to obtain a new timestamp for the transaction. A random storage node can be selected from the storage nodes, for example storage nodes 1005-1, 1005-2, and 1005-3, of datacenter 1000 to serve as the transaction coordinator as well as the DAG coordinator for this transaction. A message can be sent to the selected storage node so that it can start tracking the transaction. All operations performed in the transaction can be annotated with the transaction coordinator location, which can allow any other transaction to discover the DAG for any concurrent access.
[0083] Figure 11 illustrates a mechanism for datacenter 1000 of Figure 10 to annotate transaction operations and allow other transactions to discover the DAG for any concurrent access. Client nodes 1002-1, 1002-2, and 1002-3 can include client applications 1103-1, 1103-2, and 1103-3 and client libraries 1104-1, 1104- 2, and 1104-3, respectively. A client library can be implemented as software that provides a programmatic interface, which allows a user application, such as client applications 1103-1, 1103-2, and 1103-3, to execute transactions. Each of client libraries 1104-1, 1104-2, and 1104-3 can be responsible for discovering service nodes as well as assisting in some transaction coordination activities. Each of client libraries 1104-1, 1104-2, and 1104-3 can interact with one or more storage nodes 1005-1, 1005-2, and 1005-3. In a non-limiting example, client library 1104-1 can interact with storage node 1005-1 and 1005-2, client library 1104-2 can interact with storage node 1005-2 and 1005-3, and client library 1104-3 can interact with storage node 1005-1 and 1005-3. When a transaction is started, the client library can select a coordinator for the transaction as well as the associated DAG. To start, the client library can select one storage partition as the coordinator of both the transaction and its DAG. The DAG coordinator may change later as the transaction crosses paths with other transactions in the RMCC system of datacenter 1000, where such path interaction can result in merging of DAGs.
[0084] Storage node 1005-1 can include data records 1106-1, transaction records 1107-1, and DAG records 1108-1. Storage node 1005-2 can include data records 1106-2, transaction records 1107-2, and DAG records 1108-2. Storage node 1005-3 can include data records 1106-3, transaction records 1107-3, and DAG records 1108-3. Each of transaction records 1107-1, 1107-2, and 1107-3 can keep track of the DAG to which a given transaction currently belongs. For example, transaction records 1107-3, which is representative of transaction records 1107-1 and 1107-2, can include a transactions fde 1109 that includes a number of records such as Transaction Record 1 and Transaction Record 2, where each of Transaction Record 1 and Transaction Record 2 is correlated to transaction identifications Txnldl and Txnld2, respectively. In this example, Transaction Record 2 can include the identification Txnld2 that belongs to a DAG identified using a DAG identification (DAGId) set as DAGId2 in Transaction Record 2. A DAG file 1111 can be searched to find DAGId2, which is tied to DAG record2, such that the transaction having Txnld2 keeps track of DAG identified by DAGId2.
[0085] In transaction execution, all operations performed during a transaction can be recorded by the corresponding storage nodes as read intents or write intents. Each intent can be annotated with the transaction coordinator for the issuing transaction. A transaction can commit trivially if there is not any concurrent access by other transactions. Once the commit decision has been recorded, the system can clean up the registered intents. In a concurrent execution model, a goal is to ensure that strict serializability is maintained among the involved transactions. As explained earlier, this translates to maintaining an invariant that a commit does not form a cycle in the precedence graph. [0086] In a RMCC, intents can be used to detect concurrent access. Consider a scenario of two concurrent transactions T1 and T2, where each transaction has placed some intents in the storage cluster without conflict. For T1 to now place an intent that overlaps with an intent that is currently held by T2, a decision has to be made on how to order the two transactions. The ordering can be made to also apply to all other intents. With T1 and T2 each taken as having a trivial graph of size one, the precedence can be registered by merging these two graphs into a single graph, with edges denoting the established order. As execution continues, it is possible that either T1 or T2 encounters another set of concurrent transactions, for example transactions T3, T4, T5. This occurrence can be detected via the placed intents, and, as a result, the two concurrent graphs of {T1 and T2}and { T3, T4, and T5} can be merged and a single DAG owner can be designated. Should any transaction from this merged set attempt to commit, the single DAG owner for the graph can be consulted. The single DAG owner is now in a position to perform a cycle check on the entire set of overlapping transactions {Tl, T2, T3, T4, T5}.
[0087] A main tool to ensure strict serializability in RMCC is the transaction precedence graph. The transaction precedence graph is a DAG, which is constructed in a distributed manner as running transactions access data. The edges of the DAG edges represent transaction dependencies, and the transactions themselves are represented as nodes. The edges can be viewed to be of two types. One type is an ordering between concurrent transactions and the second type is an ordering between non-current transactions using linearizability. By the second type, dependencies between non-concurrent transactions can be captured. If a transaction 0 ends before fa and the two are non-concurrent, then if there is an edge, it is to be from 0 to ti. At any given time, there can be many independent DAGs constructed in parallel in the cluster of the RMCC, where each DAG represents a set of transactions that created some dependency on each other while executing read or write operations.
[0088] Initially, all transactions can start out as single nodes with no dependencies on their own DAG coordinator. While transactions perform operations on storage nodes, new dependencies can be added between them. These dependencies can be kept track of locally by the client and the transaction coordinators. Only when a transaction tries to commit, the dependency edges accumulated, while performing operations on different storage nodes, are introduced into a single transaction precedence graph. At this stage, various partially constructed transaction precedence graphs, which can be stored on different DAG coordinators, are merged together based on these new dependency edges. Next, cycle checking can be performed to see if a transaction can commit. During cycle checking, linearizability checks can also be performed and new dependency edges can be added. If cycle checking fails, the transaction can be aborted and all edges incident to the transaction can be removed.
[0089] The edges in the transaction precedence graph can be classified into edges identified during operations on storage nodes and edges identified during commit time. After a commit or an abort, committed transactions that are not reachable from non-committed transactions are cleared from the transaction precedence graph. Data that is no longer need is cleared from the storage nodes, which are committed key-value records from these cleared transactions, where there is a newer committed record available.
[0090] In various embodiments, to ensure linearizability guarantees with cleared out transactions, a clear time can be introduced. A clear time(t) of a committed transaction t is the maximum among the commit times of any cleared transaction that could reach t and the commit time of t. A maximum clear time of an uncommitted transaction tu (mct(tu)) can be the maximum commit time of any cleared transaction from which tu has an in-edge. The clear time (t) and mct(tu) can be used for ensuring linearizability guarantees with cleared out transactions.
[0091] Different portions of constructing and maintaining the transaction precedence graph in a distributed way can include a number of segments. The number of segments can include adding and maintaining dependencies during operations on storage nodes. Merging of transaction precedence graphs can be performed during commits and adding new edges and performing cycle checks during commits. Committed transactions can be cleared from maintaining records of the completed transactions.
[0092] While handling operations on storage nodes, new dependencies between transactions can be identified and introduced. In operation, local knowledge can be used to ensure that no cycles are formed and that linearizability can be maintained. If there is no way to perform an operation without introducing new cycles or violating linearizability, failure for that operation can be returned. Handling operations on storage nodes can be viewed as having a two-fold goal. One goal is performance of operations and the second goal is the addition of new dependencies with local feasibility checks.
[0093] Each storage node in a RMCC can have a storage node manager (SNM) component that can keep track of information that facilitates handling operations in a straight-forward manner. The information can include a list of all transactions that have accessed any key on the storage node, and, for each transaction t and key k accessed by t, a set of transaction records, which can be reads and writes on the k by t, ordered by operation IDs. The information can include a local transaction precedence graph among all transactions that have accessed keys on the storage node based on local dependency knowledge. The information can include, for each key k, a DAG ordering Ok of transactions that have accessed k. The information can include, for each key, the value and clear time of the last cleared transaction in the ordering of transactions. In addition, the storage node manager can be updated with the new dependencies and records for the transaction after every successful operation.
[0094] Operations can be handled differently based on whether it is the first time a transaction t is accessing a key or not. If it is the first time and the first operation of t is a read, a committed transaction can be chosen to read the value from, such that all the dependencies of the transaction and linearizability are respected. Similarly, if the first operation is a write, a transaction with a write, after which to place the new write, can be identified. For this purpose, for each key on the storage node, an ordering of all transactions that have accessed the key can be maintained and, among the cleared transactions, the value and commit time of the last cleared committed transaction in the order with a write on the key can be stored. These actions can be maintained by the SNM of the storage node. The task of identifying a transaction to read from for a first read or to write after for a first write becomes the task of correctly inserting the transaction t in this ordering. Inserting t in the ordering introduces new dependencies that are kept track of and are used for doing checks.
[0095] If a transaction t is accessing a key for the second time or beyond, then handling of the transaction becomes easier. If the operation is the first write on the key by the transaction, additional dependencies are to be added for the transaction. For all other operations, a local version lookup for the transaction can be conducted and the value from the lookup can be returned. With respect to ensuring linearizability, consider three transactions t, tic, and t2c, where tic and t2c are committed and t is uncommitted. Let there be an edge from t to tic and an edge from tic to t2c. Transitivity cannot be used to infer an edge from t to t2c as that can violate linearizability, since it is possible that tic was concurrent with both t and t2c but t2c committed before t started. As a result, when a new transaction is inserted in an existing ordering, the new transaction should have an edge to all committed transactions in front of the new transaction in the ordering. [0096] As discussed, for each key on the storage node, an ordering of all transactions that have accessed the key can be maintained and, among the cleared transactions, the value of the last cleared committed transaction in the order with a write on the key along with its clear time can be cleared. All the transactions, both committed and uncommitted, with a write can be totally ordered. All the transactions with only reads can be ordered after the transaction from which it reads the value for the key. Let t be a transaction (Txn) trying to operate for the first time on some key. The SNM can be updated with the new dependencies and records for the transaction after every operation. In addition, a dependency from a cleared record with clear time ct to a transaction can be captured by updating the met of the transaction as the max(mct, ct).
[0097] When the first operation on a key by a transaction t is a read, the first committed transaction tc with a write in the ordering to which there is no edge from t can be identified. Insertion of t after tc in the ordering and before the next transaction with a write can be conducted. For this, a determination can be made to check if all current dependencies are respected, that is, to check if all out- edges of t are to transactions after tc and if all in-edges to t are from transactions before tc or concurrent to t. Then the new dependencies are (i) edges from all transactions before tc to t and (ii) edges to all transactions after tc from t, except the transactions that have only reads from tc. For new type (ii) edges, messages are sent to these transactions to ensure they have not committed. If something has committed, the process can be repeated to find a new tc. The read value is from tc.
[0098] When the first operation on a key by a transaction t is a write, the first commited transaction tc with a write in the ordering to which there is no edge from t is identified. If the next commited transaction, with a write, reads from tc, then a failure for the operation is returned. Else, let t’ be the first uncommited transaction with only writes after tc that has an edge from t. If such an uncommited transaction does not exist, consider t’ as the next commited transaction. Then, t is to be inserted in the ordering before t’ . For this, a determination is made to check if all current dependencies are respected, that is, a check is made to determine if all out edges of t are to transactions after t’ and if all in-edges to t are from transactions before t’. Then, the new dependencies are (i) edges from all transactions before t’ to t and (ii) edges to all transactions after t’ from t. For new type (ii) edges, messages are sent to these transactions to ensure they have not commited. If a transaction has commited, the process can be repeated to find new tc. The value can be writen in the record and the SNM can be updated.
[0099] When handling operations on a key by a transaction other than a first operation, operations can be divided based on whether the operation is the first write operation on the key by the transaction or not. If it is, then a determination can be made to check whether tc, the transaction in the ordering after which t was placed, has a transaction (uncommited or commited) that reads from it but also has a write. If so, a failure for the operation is returned. If not, edges from all the uncommited operations that read from tc to t can be added. Then, the write can be recorded.
[0100] If the operation is a read, the local record versions on the key by the transaction are looked up and read from a past write by the transaction on the key or tc based on the operation ID. If the operation is a write, but not the first write, the write is inserted according to the operation ID along with ensuring there have been no reads from a past write after which this write is placed.
[0101] For all these operations, it is important to keep track of which previous transaction on the key is a transaction read to prevent the situation of two transactions having a read followed by a write such that the two transactions are read from the same transaction. For this, the procedure can include keeping track of those transactions that have only reads from a commited transaction and also ensure that there is exactly one transaction that reads from a commited transaction and also writes to the same key. A flow diagram 1200 of an embodiment of an example of starting a transaction and performing operations in a RMCC, similar to the above discussion, is illustrated in Figures 12A-B.
[0102] In RMCC, when a transaction t issues a commit request, various parts of the transaction precedence graphs involving the transaction can be merged. For this procedure, the dependencies that were accumulated during operation on storage nodes that are locally known to t can be added and used. After merging relevant parts of the transaction precedence graphs, a check can be made as to whether it is safe to commit t or not. The check can be conducted by checking for cycles in the transaction precedence graph to ensure serializability. During this check, new edges to the transaction precedence graph are added to ensure linearizability. The new edges are created such that if a timestamp of first transaction (Txnl .timestamp) is less than a timestamp of a second transaction (Txn2.timestamp) then a directed edge is inserted to indicate that Txnl precedes Txn2 in the precedence graph. The DAG coordinators can be the components in the RMCC responsible for this process of taking care of commits and managing the transaction precedence graph.
[0103] The DAG coordinator can maintain a transaction precedence graph for a subset of dependent transactions. It can be responsible for checking if a transaction t can commit. On receiving a commit request from a transaction t, the transaction coordinator of t can issue a commit request to its DAG coordinator. On receipt of this request, the DAG coordinator of t can obtain partial transaction precedence graphs from DAG coordinators of dependent transactions. The DAG coordinator of t then can build a bigger graph combining all the partial ones and can check for cycles consisting of already committed transactions and t. The DAG coordinator can also be responsible for identifying and clearing out transactions. For this clearing procedure, the DAG coordinator can issue a cleanup request to the transaction coordinator of the transactions that the DAG coordinator identifies as being no longer useful.
[0104] To consider cycle checking by a DAG coordinator, let t be a transaction trying to commit. The DAG coordinator of t obtains transaction precedence graphs from DAG coordinators of all dependent transactions of t. After obtaining transaction precedence graphs, the DAG coordinator of t checks for cycles and adds, as appropriate, additional edges based on timestamp ordering to ensure external causality. [0105] For cycle checking, the invariant that there are no cycles among committed transactions in any partial DAGs is maintained. The reasoning behind such an invariant is that if a cycle contains an in-progress transaction, then the cycle will be detected by the last committing transaction in the cycle. The invariant can be maintained to speed up the cycle-finding process and thus speed the time to commit instead of performing a depth first search (DFS) on the whole graph each time. DFS is a recursive algorithm for searching all the vertices of a graph or tree data structure. For this cycle finding procedure, when a transaction t tries to commit, a check can be made only for cycles consisting of already committed transactions and t. In addition, the cycle checking can include ensuring that no cycles are formed including committed cleared-out transactions. The clear times of committed transactions can be used to ensure that that no cycles are formed. When a transaction t tries to commit, the following procedure can be implemented. All in-edges and out-edges of t reachable through known neighbors of t, that is dependencies known to t, can be added to t. Then, for each in-neighbor vin and out-neighbor vOut of t, a check can be made as to whether there is a path from vout to vin using only committed transactions since such paths will lead to cycles. This process can be simplified based on the proposition that if there is an existing path from one committed transaction to another committed transaction using committed transactions, then there is an edge between them. If there is an edge from vout to vin, t can be aborted since transaction t will cause a cycle. If there is an edge from vin to vout, t is good (no cycle) since there is no path from vout to vin using committed transaction. The final case occurs if there is no edge between vin and vout, that is, there is no path from vin to Vout nor from vout to vin. In this case, the clear times and commit times can be checked to ensure linearizability and an edge can be added from vin to vout if the commit time of vin is not more than vout, else t is aborted. If for all pairs of vin and vout no cycles are formed, then t is committed. [0106] The DAG coordinator can store information to facilitate commit or abort operations. The information can include part of a transaction precedence graph. The DAG coordinator can track all committed nodes and uncommitted nodes. For each committed transaction, the DAG coordinator can store commit time and clear time of the committed transaction. The DAG coordinator can store the transaction coordinator for each transaction in the graph. A flow diagram 1300 of an embodiment of an example commit request in a RMCC, similar to the above discussion, is illustrated in Figures 13A-B.
[0107] A cleanup can be triggered during every commit and every abort on a DAG coordinator. A DAG coordinator, at the end of a commit request, has a part of the transaction precedence graph stored on it. For the cleanup, the DAG coordinator identifies and aborts certain uncommitted transactions and cleans up certain committed transactions. The transactions that are aborted include any uncommitted transactions tu that have a committed out neighbor t that has a committed out neighbor t’ to which tu is not adjacent. That is, there is an edge from tu to t and from t to t’ but not from tu to t’ . Trying to transitively add a t to t’ edge will violate linearizability, and thus tu is aborted. The transactions that are cleaned up include, after identifying uncommitted transactions to abort, certain committed transactions, to which cleanup operations can be issued, where these cleanup committed transactions have no uncommitted in-neighbors. [0108] Before issuing a cleanup request to the transaction coordinators, the clear times of all committed transactions can be updated with respect to the set of transactions that have been identified to be cleaned up. Recall that the clear time(t) is the maximum among commit times of all cleared committed transactions that can reach t and commit(t). This update of clear times can be performed by a process similar to finding the topological ordering. Initially, all edges can be set as not traversed. Then, iteratively in each round, a node can be found with all incoming edges from committed transactions traversed and the clear time of the found node can be shared with its out neighbors. The neighbors can update their clear times accordingly.
[0109] While clearing the versions of a transaction, the clear time can be stored along with the cleared record in the storage nodes. That is, during cleanup of versions, on each key, the value of the last cleared transaction along with its clear time can be stored. This action helps to ensure that no cycles are formed due to cleared out transactions during cycle checking.
[0110] All uncommitted neighbors of the transaction being cleared can be notified of the clear time of the transaction before the transaction is cleared completely from the transaction coordinator. All transaction coordinators can store the state of the transaction as under clearing along with its clear time when the cleaning of versions is taking place. [0111] If a situation arises such that a transaction tries to commit and cannot locate a dependent transaction with an in-edge to it on its transaction coordinator, the transaction trying to commit can be aborted. However, this should not occur because all the out-uncommitted-neighbors are notified of the clear time when clearing out a transaction. Until then, the transaction can be maintained on the transaction coordinator in the under-clearing state. A flow diagram 1400 of an embodiment of an example cleanup communication in a RMCC, similar to the above discussion, is illustrated in Figures 14A-B.
[0112] Figure 15 is a flow diagram of features of an embodiment of an example method 1500 of operating a distributed data storage system. At operation 1510, dependencies among transactions in a distributed system are modeled. The distributed system has storage nodes arranged individually in a distributed arrangement. The dependencies among transactions are modeled using transaction precedence graphs partially constructed while executing the transactions. The transactions are marked as dependent in the transaction precedence graphs when the transactions affect common keys stored in the storage nodes. At operation 1520, a transaction is committed in the distributed system in response to checking for cycles in a transaction precedence graph for the transaction.
[0113] Variations of method 1500 or methods similar to method 1500 can include a number of different embodiments that may be combined depending on the application of such methods and/or the architecture of devices or systems in which such methods are implemented. Variations of such methods can include dynamically determining data to remove from the distributed system with respect to a given transaction precedence graph, where the given transaction precedence graph model dependencies can be based on correlated keys and transaction commit times. Dynamically determining data can include determining an absence of a path in the given transaction precedence graph from an uncommitted transaction in the transaction precedence graph to a committed transaction in the transaction precedence graph. Variations can include storing, in the storage nodes, data records and unique keys to the data records partitioned among the storage nodes with each storage node containing a subset of the data records. Variations can include maintaining, in each storage node, data records, transaction records, and transaction precedence graph records. Each transaction record can have a transaction identification and each transaction precedence graph record can have a transaction precedence graph identification. Variations can include issuing read or write requests to the storage nodes from client nodes of the distributed system. The client nodes can be arranged with interfaces to end-users, where the end-users are external to the distributed system.
[0114] Variations of method 1500 or methods similar to method 1500 can include tracking the transaction in the distributed system as being in-progress, committed, or aborted and maintaining and updating the transaction precedence graph for the transaction. The transaction precedence graph can be combined with other transaction precedence graphs in response to detection that the other transaction precedence graphs have affected keys in common to the transaction precedence graph. Variations can include removing the transaction and associated information from the storage nodes in response to a determination of the transaction being clearable in the distributed system.
[0115] Variations of method 1500 or methods similar to method 1500 can include locating partial transaction precedence graphs containing a neighbor transaction to the transaction and adding transitive dependent edges to the transaction precedence graph for the transaction to generate a combined transaction precedence graph for the transaction. Commit times between committed transactions of the partial transaction precedence graphs can be checked and edges can be added based on the check of the commit times. A check can be performed for a cycle in the combined transaction precedence graph for the transaction and a determination to commit or to abort can be performed from the checking for a cycle.
[0116] Variations of method 1500 or methods similar to method 1500 can include operating, via execution of stored instructions by one or more first processors, multiple transaction coordinators such that each active transaction in the distributed system has a transaction coordinator that tracks the active transaction as in-progress, committed, or aborted; and operating, via execution of stored instructions by one or more second processors, multiple DAG coordinators, such that each DAG coordinator tracks transaction precedence graphs and updates and combines transaction precedence graphs among other DAG coordinators. At the start of a given transaction, the transaction coordinator for the given transaction can be assigned as the DAG coordinator for the given transaction. Variations can include, for a given transaction requested by a client node of the distributed system, in response to a commit request for the given transaction from the client node, the transaction coordinator for the given transaction determining current status of the given transaction by checking status of the given transaction in a transaction record of the given transaction. The transaction coordinator for the given transaction can communicate with the DAG coordinator for the given transaction to evaluate if a commit of the given transaction forms a cycle in the given transaction precedence graph. The transaction coordinator can apply a commit of the given transaction if no cycle is formed in the given transaction precedence graph.
[0117] In various embodiments, a non-transitory machine-readable storage device, such as computer-readable non-transitory medium, can comprise instructions stored thereon, which, when performed by a machine, cause the machine to perform operations, where the operations comprise one or more features similar to or identical to features of methods and techniques described with respect to method 1500, variations thereof, and/or features of other methods taught herein. The physical structures of such instructions can be operated on by at least one processor. For example, executing these physical structures can cause the machine to perform operations comprising modeling dependencies among transactions in a distributed system having storage nodes arranged individually in a distributed arrangement, using transaction precedence graphs partially constructed while executing the transactions, the transactions marked as dependent in the transaction precedence graphs when the transactions affect common keys stored in the storage nodes; and committing a transaction in the distributed system in response to checking for cycles in a transaction precedence graph for the transaction.
[0118] Operations can include dynamically determining data to remove from the distributed system with respect to a given transaction precedence graph, where the given transaction precedence graph models dependencies can be based on correlated keys and transaction commit times. Dynamically determining data can include determining an absence of a path in the given transaction precedence graph from an uncommitted transaction in the transaction precedence graph to a committed transaction in the transaction precedence graph. Operations can include storing, in the storage nodes, data records and unique keys to the data records partitioned among the storage nodes with each storage node containing a subset of the data records. The operations can include maintaining, in each storage node, data records, transaction records, and transaction precedence graph records, where each transaction record has a transaction identification and each transaction precedence graph record has a transaction precedence graph identification. The operations can include removing the transaction and associated information from the storage nodes in response to a determination of the transaction being clearable in the distributed system. The operations can include issuing read or write requests to the storage nodes from client nodes of the distributed system, where the client nodes can be arranged with interfaces to end-users, with the end-users external to the distributed system.
[0119] Operations can include tracking the transaction in the distributed system as being in-progress, committed, or aborted and maintaining and updating the transaction precedence graph for the transaction. The operations can include combining the transaction precedence graph with other transaction precedence graphs in response to detection that the other transaction precedence graphs have affected keys in common to the transaction precedence graph.
[0120] Operations can include locating partial transaction precedence graphs containing a neighbor transaction to the transaction; adding transitive dependent edges to the transaction precedence graph for the transaction to generate a combined transaction precedence graph for the transaction; checking commit times between committed transactions of the partial transaction precedence graphs and adding edges based on the check of the commit times; checking for a cycle in the combined transaction precedence graph for the transaction; and determining to commit or to abort from the checking for a cycle.
[0121] Operations can include operating multiple transaction coordinators such that each active transaction in the distributed system has a transaction coordinator that tracks the active transaction as in-progress, committed, or aborted; and operating multiple directed acyclic graph (DAG) coordinators, such that each DAG coordinator tracks transaction precedence graphs and updates and combines transaction precedence graphs among other DAG coordinators. Operations can include, at start of a given transaction, assigning the transaction coordinator for the given transaction as the DAG coordinator for the given transaction. Operations can include, for a given transaction requested by a client node of the distributed system, in response to a commit request for the given transaction from the client node, the transaction coordinator for the given transaction: determining current status of the given transaction by checking status of the given transaction in a transaction record of the given transaction; communicating with the DAG coordinator for the given transaction to evaluate if a commit of the given transaction forms a cycle in the given transaction precedence graph; and applying a commit of the given transaction if no cycle is formed in the given transaction precedence graph.
[0122] In various embodiments, a distributed system can comprise storage nodes arranged individually in a distributed arrangement, a memory storing instructions, and at least one processor in communication with the memory. The at least one processor can be configured, upon execution of the instructions, to perform a number of steps. Dependencies among transactions in the distributed system can be modeled using transaction precedence graphs partially constructed while executing the transactions. The transactions can be correlated to a key stored in the storage nodes. A transaction of the transactions in the distributed system, where the transaction is correlated to the key, can be committed in response to checking for cycles in a transaction precedence graph for the transaction.
[0123] Variations of such a distributed system or similar distributed systems can include a number of different embodiments that may or may not be combined depending on the application of such distributed systems and/or the architecture of distributed systems in which methods, as taught herein, are implemented. In such distributed systems, the at least one processor can be structured to be configured to dynamically determine data to remove from the distributed system with respect to a given transaction precedence graph, where the given transaction precedence graph model dependencies can be based on correlated keys and transaction commit times. Determination of data to remove can be conducted by determining an absence of a path in the given transaction precedence graph from an uncommitted transaction in the transaction precedence graph to a committed transaction in the transaction precedence graph. In such distributed systems, the storage nodes can include data records and unique keys to the data records partitioned among the storage nodes with each storage node containing a subset of the data records. Each storage node can include data records, transaction records, and transaction precedence graph records, where each transaction record can have a transaction identification and each transaction precedence graph record can have a transaction precedence graph identification. The at least one processor can be configured to remove the transaction and associated information from the storage nodes in response to a determination of the transaction being clearable in the distributed system. In variations of such distributed systems, the distributed systems can include client nodes configured to issue read and write requests to the storage nodes, where the client nodes are arranged with interfaces to end-users, with the end-users being external to the distributed system.
[0124] Variations of such a distributed system or similar distributed systems can include the at least one processor being configured to track the transaction in the distributed system as being in-progress, committed, or aborted, and maintain and update the transaction precedence graph for the transaction. The at least one processor can be configured to combine the transaction precedence graph with other transaction precedence graphs, in response to detection that the other transaction precedence graphs have affected keys in common to the transaction precedence graph.
[0125] Variations of such a distributed system or similar distributed systems can include the at least one processor being configured to locate partial transaction precedence graphs containing a neighbor transaction to the transaction and to add transitive dependent edges to the transaction precedence graph for the transaction to generate a combined transaction precedence graph for the transaction. The at least one processor can be configured to check commit times between committed transactions of the partial transaction precedence graphs and add edges based on the check of the commit times and to check for a cycle in the combined transaction precedence graph for the transaction. The at least one processor can be configured to determine to commit or to abort from the checking for a cycle.
[0126] Variations of such a distributed system or similar distributed systems can include the at least one processor configured, upon execution of instructions, to perform operations as multiple transaction coordinators and multiple directed acyclic graph (DAG) coordinators, such that each active transaction in the distributed system has a transaction coordinator that tracks the active transaction as in-progress, commited, or aborted, and each DAG coordinator tracks transaction precedence graphs and updates and combines transaction precedence graphs among other DAG coordinators. In variations of such distributed systems, at the start of a given transaction, the transaction coordinator for the given transaction can be assigned as the DAG coordinator for the given transaction. In variations of such distributed systems, for a given transaction requested by a client node of the distributed system, in response to a commit request for the given transaction from the client node, the transaction coordinator for the given transaction can perform a number of functions. The transaction coordinator can determine a current status of the given transaction by checking status of the given transaction in a transaction record of the given transaction. The transaction coordinator for the given transaction can communicate with the DAG coordinator for the given transaction to evaluate if a commit of the given transaction forms a cycle in the given transaction precedence graph. The transaction coordinator for the given transaction can apply a commit of the given transaction if no cycle is formed in the given transaction precedence graph. [0127] Figure 16 is a block diagram illustrating components of a computing system 1600 that can implement algorithms and perform methods structured to process data for an application in conjunction with using RMCC for data processing. All components need not be used in various embodiments. The computing system 1600 can include a processor 1601, a memory 1612, a removable storage 1623, a non-removable storage 1622, and a cache 1628. The processor 1601 can be implemented as multiple processors. The computing system 1600 can be structured in different forms in different embodiments. The computing system 1600 can be implemented in conjunction with various components associated with the distributed system 900 of Figure 9 and the datacenter 1000 of Figure 10. Although the various data storage elements are illustrated as part of the computing system 1600, the storage can also or alternatively include cloud-based storage accessible via a network, such as the Internet or remote server-based storage.
[0128] The memory 1612 can include a volatile memory 1614 and/or a nonvolatile memory 1617. The computing system 1600 can include or have access to a computing environment that includes a variety of computer-readable media, such as the volatile memory 1614, the non-volatile memory 1617, the removable storage 1623 and/or the non-removable storage 1622. Computer storage can include data storage server, random-access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, compact disc read-only memory (CD ROM), Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium capable of storing computer-readable instructions.
[0129] The computing system 1600 can include or have access to a computing environment that includes an input interface 1627, an output interface 1624, and a communication interface 1631. The output interface 1624 can include a display device, such as a touchscreen, that also can serve as an input device. The input interface 1627 can include one or more of a touchscreen, touchpad, mouse, keyboard, camera, one or more device-specific buttons, one or more sensors integrated within or coupled via wired or wireless data connections to the computing system 1600, and other input devices. The communication interface 1631 can exchange communications with external device and networks. The computing system 1600 can operate in a networked environment using a communication connection to connect to one or more remote computers, such as one or more remote compute nodes. The remote computer can include a PC, a server, a router, a network PC, a peer device or other common data flow network switch, or the like. The communication connection can include a Local Area Network (LAN), a Wide Area Network (WAN), cellular, Wi-Fi, Bluetooth, or other networks. In various embodiments, the components of the computing system 1600 can be connected with a system bus 1621.
[0130] Computer-readable instructions stored on a computer-readable medium, such as a program 1613, are executable by the processor 1601 of the computing system 1600. A hard drive, CD-ROM, and RAM are some examples of articles including a non-transitory computer-readable medium, such as a storage device. The terms computer-readable medium and storage device do not include carrier waves to the extent carrier waves are deemed too transitory. Storage can also include networked storage, such as a storage area network (SAN). The program 1613 of the computing system 1600 can be used to cause the processor 1601 to perform one or more methods or algorithms described herein.
[0131] The components of the illustrative devices, systems, and methods employed in accordance with the illustrated embodiments can be implemented, at least in part, in digital electronic circuitry, analog electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. These components can be implemented, for example, as a computer program product such as a computer program, program code or computer instructions tangibly embodied in a machine-readable storage device, for execution by, or to control the operation of, data processing apparatus such as a programmable processor, a computer, or multiple computers.
[0132] The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with one or more general-purpose processors, a digital signal processors (DSPs), ASICs, field-programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logics, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
[0133] Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random-access memory or both. The elements of a computer include a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. Devices suitable for embodying computer program instructions and data include all forms of memory, including by way of example, semiconductor memory devices, e.g., electrically programmable read-only memory or ROM (EPROM), EEPROM, flash memory devices, and/or data storage disks (e.g., magnetic disks, internal hard disks, or removable disks, magneto-optical disks, CD-ROM, or DVD-ROM disks). The processor and the memory can be supplemented by or incorporated in special purpose logic circuitry.
[0134] A RMCC framework, as taught herein, can address inefficient memory usage of conventional distributed storage systems by clearing out versions of transactions as soon as they are not part of active processing, while providing serializability and external causality. RMCC uses transaction precedence graphs, which allows determination of whether a commit will form a dependency cycle, which would violate serializability. The same transaction precedence graphs can be used to clean multiple record versions such that older version of records in a given committed transaction are cleared out of storage when there is no path from any uncommitted transaction to the given committed transaction in the transaction precedence graph. The clearing can be triggered by commit or abort requests generated by transactions, which can result in multiple record versions being kept long enough to satisfy any open transactions, and not any longer. RMCC can support serializable isolation level, a level of external consistency (linearizability), increased concurrency, and global transactions, which are transactions spanning multiple geographical regions, along with achieving efficient memory usage by cleaning up versions on the go.
[0135] Current state-of-the-art systems cannot achieve the combination of serializable and linearizable without significant performance issues. RMCC provides a mechanism to split up bookkeeping of transactions to make such process distributed and scalable. The transaction relationships for all ongoing transactions are maintained. This maintenance is performed by keeping pieces (partial DAGs) of the entire picture in different nodes of the system. These DAGs represent the order in which transactions are to be recorded in the system to preserve the serializable + linearizable properties. When a transaction attempts to commit, it is evaluated against a combined DAG, made up by all the current partial DAGs. Any transaction is allowed to commit only if it won’t form a cycle with the already committed transactions in the combined DAG. If a transaction commits, it remains in the DAG in the committed state so that the remaining ongoing transactions that are part of the DAG can be evaluated for cycle forming if and when they attempt to commit. [0136] Although specific embodiments have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that any arrangement that is calculated to achieve the same purpose may be substituted for the specific embodiments shown. Various embodiments use permutations and/or combinations of embodiments described herein. It is to be understood that the above description is intended to be illustrative, and not restrictive, and that the phraseology or terminology employed herein is for the purpose of description. Combinations of the above embodiments and other embodiments will be apparent to those of skill in the art upon studying the above description

Claims

CLAIMS What is claimed is:
1. A distributed system comprising: storage nodes arranged individually in a distributed arrangement; a memory storing instructions; and at least one processor in communication with the memory, the at least one processor configured, upon execution of the instructions, to perform the following steps: modeling dependencies among transactions in the distributed system using transaction precedence graphs partially constructed while executing the transactions, the transactions correlated to a key stored in the storage nodes; and committing a transaction, correlated to the key, of the transactions in the distributed system in response to checking for cycles in a transaction precedence graph for the transaction.
2. The distributed system of claim 1, wherein the at least one processor is configured to dynamically determine data to remove from the distributed system with respect to a given transaction precedence graph, the given transaction precedence graph modeling dependencies based on correlated keys and transaction commit times, by determining an absence of a path in the given transaction precedence graph from an uncommitted transaction in the transaction precedence graph to a committed transaction in the transaction precedence graph.
3. The distributed system of any one of the preceding claims, wherein the storage nodes include data records and unique keys to the data records partitioned among the storage nodes with each storage node containing a subset of the data records.
4. The distributed system of any one of the preceding claims, wherein the at least one processor is configured to: track the transaction in the distributed system as being in-progress, commited, or aborted; and maintain and update the transaction precedence graph for the transaction and combine the transaction precedence graph with other transaction precedence graphs, in response to detection that the other transaction precedence graphs have affected keys in common to the transaction precedence graph.
5. The distributed system of any one of the preceding claims, wherein the at least one processor is configured to remove the transaction and associated information from the storage nodes in response to a determination of the transaction being clearable in the distributed system.
6. The distributed system of any one of the preceding claims, wherein the distributed system includes client nodes configured to issue read and write requests to the storage nodes, the client nodes arranged with interfaces to endusers, the end-users external to the distributed system.
7. The distributed system of any one of the preceding claims, wherein the at least one processor is configured to: locate partial transaction precedence graphs containing a neighbor transaction to the transaction; add transitive dependent edges to the transaction precedence graph for the transaction to generate a combined transaction precedence graph for the transaction; check commit times between commited transactions of the partial transaction precedence graphs and add edges based on the check of the commit times; check for a cycle in the combined transaction precedence graph for the transaction; and determine to commit or to abort from the checking for a cycle.
8. The distributed system of any one of the preceding claims, wherein each storage node of includes: data records; transaction records, each transaction record having a transaction identification; and transaction precedence graph records, each transaction precedence graph record having a transaction precedence graph identification.
9. The distributed system of any one of the preceding claims, wherein the at least one processor is configured, upon execution of the instructions, to perform operations as multiple transaction coordinators and multiple directed acyclic graph (DAG) coordinators, such that each active transaction in the distributed system has a transaction coordinator that tracks the active transaction as inprogress, committed, or aborted and each DAG coordinator tracks transaction precedence graphs and updates and combines transaction precedence graphs among other DAG coordinators.
10. The distributed system of claim 9, wherein, at start of a given transaction, the transaction coordinator for the given transaction is assigned as the DAG coordinator for the given transaction.
11. The distributed system of claim 9, wherein, for a given transaction requested by a client node of the distributed system, in response to a commit request for the given transaction from the client node, the transaction coordinator for the given transaction: determines current status of the given transaction by checking status of the given transaction in a transaction record of the given transaction; communicates with the DAG coordinator for the given transaction to evaluate if a commit of the given transaction forms a cycle in a given transaction precedence graph of the given transaction; and applies a commit of the given transaction if no cycle is formed in the given transaction precedence graph.
12. A method of operating a distributed data storage system, the method comprising: modeling dependencies among transactions in a distributed system having storage nodes arranged individually in a distributed arrangement, using transaction precedence graphs partially constructed while executing the transactions, the transactions marked as dependent in the transaction precedence graphs when the transactions affect common keys stored in the storage nodes; and committing a transaction in the distributed system in response to checking for cycles in a transaction precedence graph for the transaction.
13. The method of claim 12, wherein the method includes dynamically determining data to remove from the distributed system with respect to a given transaction precedence graph, the given transaction precedence graph modeling dependencies based on correlated keys and transaction commit times, by determining an absence of a path in the given transaction precedence graph from an uncommitted transaction in the transaction precedence graph to a committed transaction in the transaction precedence graph.
14. The method of any one of the preceding claims 12-13, wherein the method includes storing, in the storage nodes, data records and unique keys to the data records partitioned among the storage nodes, with each storage node containing a subset of the data records.
15. The method of any one of the preceding claims 12-14, wherein the method includes: tracking the transaction in the distributed system as being in-progress, committed, or aborted; and maintaining and updating the transaction precedence graph for the transaction and combining the transaction precedence graph with other transaction precedence graphs in response to detection that the other transaction precedence graphs have affected keys in common to the transaction precedence graph.
16. The method of any one of the preceding claims 12-15, wherein the method includes removing the transaction and associated information from the storage nodes in response to a determination of the transaction being clearable in the distributed system.
17. The method of any one of the preceding claims 12-16, wherein the method includes: issuing read or write requests to the storage nodes from client nodes of the distributed system, the client nodes arranged with interfaces to end-users, the end-users external to the distributed system.
18. The method of any one of the preceding claims 12-17, wherein the method includes: locating partial transaction precedence graphs containing a neighbor transaction to the transaction; adding transitive dependent edges to the transaction precedence graph for the transaction to generate a combined transaction precedence graph for the transaction; checking commit times between committed transactions of the partial transaction precedence graphs and adding edges based on the check of the commit times; checking for a cycle in the combined transaction precedence graph for the transaction; and determining to commit or to abort from the checking for a cycle.
19. The method of any one of the preceding claims 12-18, wherein the method includes maintaining, in each storage node, data records, transaction records, each transaction record having a transaction identification, and transaction precedence graph records, each transaction precedence graph record having a transaction precedence graph identification.
20. The method of any one of the preceding claims 12-19, wherein the method includes: operating, via execution of stored instructions by one or more first processors, multiple transaction coordinators such that each active transaction in the distributed system has a transaction coordinator that tracks the active transaction as in-progress, committed, or aborted; and operating, via execution of stored instructions by one or more second processors, multiple directed acyclic graph (DAG) coordinators, such that each DAG coordinator tracks transaction precedence graphs and updates and combines transaction precedence graphs among other DAG coordinators.
21. The method of claim 20, wherein the method includes, at start of a given transaction, assigning the transaction coordinator for the given transaction as the DAG coordinator for the given transaction.
22. The method of claim 20, wherein the method includes, for a given transaction requested by a client node of the distributed system, in response to a commit request for the given transaction from the client node, the transaction coordinator for the given transaction: determining current status of the given transaction by checking status of the given transaction in a transaction record of the given transaction; communicating with the DAG coordinator for the given transaction to evaluate if a commit of the given transaction forms a cycle in a given transaction precedence graph of the given transaction; and applying a commit of the given transaction if no cycle is formed in the given transaction precedence graph.
23. A non-transitory computer-readable storage medium storing instructions, wherein the instructions, when executed by one or more processors, cause the one or more processors to perform operations comprising any one of the methods of claims 12-22.
PCT/US2023/061288 2023-01-25 2023-01-25 Reference-managed concurrency control WO2024059352A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2023/061288 WO2024059352A1 (en) 2023-01-25 2023-01-25 Reference-managed concurrency control

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2023/061288 WO2024059352A1 (en) 2023-01-25 2023-01-25 Reference-managed concurrency control

Publications (1)

Publication Number Publication Date
WO2024059352A1 true WO2024059352A1 (en) 2024-03-21

Family

ID=85381110

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/061288 WO2024059352A1 (en) 2023-01-25 2023-01-25 Reference-managed concurrency control

Country Status (1)

Country Link
WO (1) WO2024059352A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090287703A1 (en) * 2008-05-02 2009-11-19 Toru Furuya Transaction processing system of database using multi-operation processing providing concurrency control of transactions
US20220100733A1 (en) * 2020-09-29 2022-03-31 International Business Machines Corporation Transaction reordering in blockchain

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090287703A1 (en) * 2008-05-02 2009-11-19 Toru Furuya Transaction processing system of database using multi-operation processing providing concurrency control of transactions
US20220100733A1 (en) * 2020-09-29 2022-03-31 International Business Machines Corporation Transaction reordering in blockchain

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "Serializability - Wikipedia", 28 February 2016 (2016-02-28), XP055831670, Retrieved from the Internet <URL:https://en.wikipedia.org/w/index.php?title=Serializability&oldid=707292573> [retrieved on 20210811] *

Similar Documents

Publication Publication Date Title
AU2016271618B2 (en) Disconnected operation within distributed database systems
US8261020B2 (en) Cache enumeration and indexing
US11132350B2 (en) Replicable differential store data structure
US6714949B1 (en) Dynamic file system configurations
US9858322B2 (en) Data stream ingestion and persistence techniques
CN113874852A (en) Indexing for evolving large-scale datasets in a multi-master hybrid transaction and analytics processing system
US20130275550A1 (en) Update protocol for client-side routing information
US9576038B1 (en) Consistent query of local indexes
US20070118572A1 (en) Detecting changes in data
US20180004777A1 (en) Data distribution across nodes of a distributed database base system
JP7389793B2 (en) Methods, devices, and systems for real-time checking of data consistency in distributed heterogeneous storage systems
US20130332435A1 (en) Partitioning optimistic concurrency control and logging
US8527559B2 (en) Garbage collector with concurrent flipping without read barrier and without verifying copying
CN112789606A (en) Data redistribution method, device and system
CN114207601A (en) Managing objects in a shared cache using multiple chains
CN113760847A (en) Log data processing method, device, equipment and storage medium
Tomsic et al. Distributed transactional reads: the strong, the quick, the fresh & the impossible
WO2024059352A1 (en) Reference-managed concurrency control
WO2023124242A1 (en) Transaction execution method and apparatus, device, and storage medium
CN114205354B (en) Event management system, event management method, server, and storage medium
KR20140031260A (en) Cache memory structure and method
JP2003271436A (en) Data processing method, data processing device and data processing program
Shacham et al. Taking omid to the clouds: Fast, scalable transactions for real-time cloud analytics
CN109542631A (en) A kind of recurrence method, apparatus, server and the storage medium of standby host
Lev-Ari et al. Quick: a queuing system in cloudkit

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23707237

Country of ref document: EP

Kind code of ref document: A1