WO2024081140A1 - Configuration et gestion d'unités de réplication pour réplication de transaction de base de données asynchrone - Google Patents

Configuration et gestion d'unités de réplication pour réplication de transaction de base de données asynchrone Download PDF

Info

Publication number
WO2024081140A1
WO2024081140A1 PCT/US2023/034465 US2023034465W WO2024081140A1 WO 2024081140 A1 WO2024081140 A1 WO 2024081140A1 US 2023034465 W US2023034465 W US 2023034465W WO 2024081140 A1 WO2024081140 A1 WO 2024081140A1
Authority
WO
WIPO (PCT)
Prior art keywords
replication
database
leader
new
servers
Prior art date
Application number
PCT/US2023/034465
Other languages
English (en)
Inventor
Lik Wong
Leonid NOVAK
Sampanna Salunke
Mark Dilman
Wei-Ming Hu
Original Assignee
Oracle International Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US18/372,005 external-priority patent/US20240126782A1/en
Application filed by Oracle International Corporation filed Critical Oracle International Corporation
Publication of WO2024081140A1 publication Critical patent/WO2024081140A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/278Data partitioning, e.g. horizontal or vertical partitioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Definitions

  • the present invention relates to asynchronous database transaction replication with fast, automatic failover, zero data loss, strong consistency, full SQL support, and horizontal scalability using a consensus protocol.
  • Consensus protocols allow a collection of machines to work as a coherent group that can survive the failures of some of its members. Because of this, variations of consensus protocols play a key role in large-scale software systems, such as replicated database systems.
  • Raft is a consensus protocol that is designed to be understandable and straightforward to implement. Raft offers a generic way to distribute a. state machine across a cluster of computing nodes, referred to herein as simply “'nodes” or “participant nodes,” ensuring that each node in the cluster agrees upon the same series of state transitions.
  • the replicated state machines are typically implemented using a replicated log. Each node stores a log replica containing a series of commands, which its state machine executes in order; thus, each state machine processes the same sequence of commands. Since the state machines are deterministic, each computes the same state and the same sequence of outputs.
  • Shardmg is a database scaling technique based on horizontal partitioning of data across multiple independent physical databases. Each physical database in such a configuration is called a “shard.”
  • Logical Standby can support 1-2 second failover; however, multiple databases must be configured per physical server to balance the load, and more shards must be used to keep up with the primary' shard’s workload.
  • Active Data Guard and Logical Standby are active/passive replication strategies at shard level that have idle hardware.
  • GoldenGateTM does not support automatic fast failover in a sharded database.
  • a typical NoSQL database such as Cassandra TM , Amazon® DynamoDBTM, meets many of the above requirements, such as horizontal scalability, simplicity, symmetric shards; however, they lack SQL support, ACID (Atomicity, Consistency, Isolation, and Durability) transactions, and strong consistency.
  • Some NewSQL databases such as Cloud SpannerTM, CockRoachDBTM, YugabyteDBTM, TiDBTM, provide SQL support and implement a consensus-based replication (Paxos or Raft), which supports strong consistency. They typically implement synchronous database replications, which increases user transaction response time. YugabyteDBTM claims that it applies changes to a follower asynchronously for certain cases, e.g., single-key DMLs. However, YugabyteDBTM may still need synchronization to present a global time for transactions.
  • Kafka T M is a well-known messaging system and meets many of the above requirements, but KafkaTM is non-relational.
  • Raft-based Replication does not require persistent memory , RR adds logical logging on top of physical redo logs, RR supports full SQL, and RR can more readily tolerate geographically remote replicas.
  • FIG. 1 is a block diagram illustrating a distributed computing system with a state machine and log replicated across a plurality' of computing nodes in accordance with a consensus protocol in which aspects of the illustrati ve embodiments may be implemented.
  • FIG. 2 is a block diagram depicting grouping of chunks for replication in accordance with an illustrative embodiment.
  • FIG. 3 illustrates a replication unit in a sharded database management system in accordance with an illustrative embodiment.
  • FIG. 4 is a diagram illustrating a consensus protocol-based replication user request flow in accordance with an illustrative embodiment.
  • FIG. 5 is a flowchart illustrating operation of consensus protocol-based replication for a sharded database management system in accordance with an illustrative embodiment.
  • FIG. 6 is a block diagram illustrating an architecture for consensus protocol-based replication in a sharded database in accordance with an illustrative embodiment.
  • FIG. 7 is a block diagram depicting log persistence optimizations in a leader shard server in accordance with an illustrative embodiment.
  • FIG. 8 is a block diagram depicting log persistence optimizations in a follower shard server in accordance with an illustrative embodiment.
  • FIG. 9 depicts a replication log with interleaved transactions in accordance with an illustrative embodiment.
  • FIG. 10 depicts apply progress tracking in a replication log in accordance with an il lus trati ve embodi men t.
  • FIG. 11 depicts an example of multiple ring placement of replication units in accordance with an illustrative embodiment.
  • FIG. 12 is a data flow diagram illustrating an example replication unit split in accordance with an illustrative embodiment.
  • FIG. 13 is a flowchart illustrating operation of new leader shard server taking over a replication unit when there is a commit initiated in the replication log in accordance with an il lus trati ve embodi men t.
  • FIG. 14 is a flowchart illustrating operation of a shard server performing replication log recovery' in accordance with an illustrative embodiment.
  • FIG. 15 is a block diagram that illustrates a computer system upon which an embodiment of the invention may be implemented .
  • FIG. 16 is a block diagram of a basic software system that may be employed for controlling the operation of computer system.
  • the illustrative embodiments provide asynchronous database transaction replication based on log replication.
  • a leader server receives a command to perform a change operation on a row of a table of a database.
  • the table is replicated on a replication group of servers such that each server within the replication group of servers stores a respective copy of the row of the table.
  • the replication group of servers includes the leader server and one or more follower servers.
  • the leader server is configured to perform data manipulation language (DML) operations on the row' of the table and replicate the DML. operations to the one or more follower servers.
  • the leader server performs the change operation on the copy of the row' of the table stored at the leader server.
  • DML data manipulation language
  • the leader server then creates a replication log record for the change operation in a replication pipeline to be replicated to the one or more follower servers and returns a result of the change operation to the client.
  • the replication pipeline includes the components and mechanisms for storing a log record in the leader and propagating the log record to followers, from generation of the log record on the leader to persisting the log record to disk on the followers.
  • the leader server does not wait for a consensus from the one or more follower servers replicating the replication log record to perform the change operation or to return the result of the change operation to the client.
  • the results of change operations (DML operations) are returned to the client immediately after the change is made and the log record is created in the leader without waiting for consensus from the followers.
  • the propagation of log records to followers occurs asynchronously after returning to the client.
  • the leader server receives a database transaction commit command to perform a database transaction commit operation on a particular transaction, creates a replication log record for the database transaction commit operation in the replication pipeline, and in response to receiving acknowledgement that the replication log record for the database transaction commit operation has been appended to a replication log of a consensus number of the one or more follower servers, performs the database transaction commit operation on the particular transaction on the copy of the row of the table at the leader server. Then, the leader server returns the result of the database transaction commit operation to the client. For database transaction commit operations, the leader server does wait for consensus from the follower servers to perform the database transaction commit operation and return the result to the client.
  • database transaction commit operation log records are replicated synchronously.
  • the leader server performs DML operations and database transaction commit operations asynchronously, performs replication of the DML log records asynchronously, and performs replication of the commit log records synchronously, thus enabling fast, automatic shard failover with zero data loss and strong consistency.
  • the leader server prepares the local transaction for commit and marks the transaction as in-doubt and then sends the commit log record to the follower servers. Upon receiving the consensus of this commit log record, the leader server can commit the local transaction. Upon failure of the leader server, if the old leader becomes a follower and there is a consensus for the commit log record based on the replication log, the in-doubt state of the local transaction in the old leader can be committed; otherwise, the transaction is rolled back in the old leader.
  • the leader server before committing a local transaction, the leader server sends a pre-commit log record to the follower servers. Upon receiving a consensus of the pre-commit record, the leader server can commit the local transaction and send a post commit log record to the follower servers. Upon failure of the leader shard server and the leader becoming a follower, if there is a consensus for the pre-commit log record based on the replication log, the local transaction in the old leader can be committed. If the local transaction in the old leader was rolled back by the database, the transaction will be replayed to the old leader based on the replication log.
  • chunks are grouped into replication units (RUs) to optimize replication efficiency
  • a chunk is a unit of distribution in sharded database systems.
  • a sharded database system comprises multiple shard servers.
  • sharded tables are each divided across chunks; a chunk may contain parts of multiple tables.
  • chunks are assigned to RUs based on load and replication throughput. Splitting RUs and merging RUs do not internipt concurrent user workload or require routing changes as the relevant chunks remain in the same set of shard servers. In addition, transactions spanning chunks within an RU do not require distributed transaction processing.
  • each replication unit has a replication factor (RF), which refers to the number of copies/replicas of the replication unit, including the primary copy at the leader, and an associated distribution factor (DF), which refers to the number of shard servers taking over the workload from a failed leader shard server.
  • the replication factor must be an odd number to determine a majority’. The higher the DF, the more balanced workload distribution is after a failover.
  • RUs are placed in rings of shard servers, where the number of shard sen- ers in a ring is equal to the replication factor. This placement helps with schema upgrades.
  • quiescing the workload can be restricted to a ring of shard servers instead of the entire sharded database.
  • placement can be done in the form of a single ring.
  • a leader RU on shard 1 is replicated on shards 2 and 3
  • leader RU on shard 2 is replicated on shards 3 and 4, and so on.
  • barrier DDLs result in quiescing the entire sharded database; however, when adding a new shard, it is possible to restrict the number of RU splits, moves, and merges to a fixed number, reducing data movement during incremental deployment and making incremental deployment a more deterministic operation.
  • Removing a shard server (e.g., due to failure or scaling down) can be done similarly in multiple ring placement or single ring placement.
  • a lead-sync log record is used to synchronize the replication logs of follower shards to the leader shard.
  • the new leader shard performs a sync operation using the lead-sync log record to synchronize replication logs of the follower shards to the replication log of the new leader.
  • the lead-sync log record requires consensus.
  • the new leader can determine whether there is consensus for the database transaction commit operation based on the result of the sync operation.
  • the new' leader performs the sync operation by sending the lead-sync log record to the follower shard servers.
  • a shard server when recovering from a database failure, identifies a first transaction having a first log record but not a post-commit log record in the replication log, defines a recovery window in the replication log starting at the first log record of the identified first transaction and ending at the lead-sync log record, identifies a set of transactions to be recovered, and performs a recovery action on the set of transactions to be recovered. Any transaction having a log record outside the recover ⁇ - window is not included in the set of transactions to be recovered, any transaction having a post-commit log record within the recovers' window is not included in the set of transactions to be recovered, and any transaction having a rollback log record in the recovery window is not included in the set of transactions to be recovered.
  • This allows the shard server to match the state of the replication log with the state of the database transactions by recreating missing log entries and replaying or rolling back transactions.
  • a non-sharded database may be represented in some embodiments as a sharded database with only one shard where all the chunks are within that one database, which can be a single server instance or a Real Application Cluster (RAC).
  • RAC Real Application Cluster
  • a non-sharded database may be a mirrored database implementation, where the non-sharded database may be represented as a sharded database with one replication unit that includes all chunks, the replication group includes all servers, and the replication factor is equal to the number of servers. That is, a replication group can include all or a subset of the servers, and a replication unit can include all or a subset of the chunks of the database.
  • Raft is a consensus protocol for managing a replicated log. To enhance understandability, Raft separates the key elements of consensus, such as leader election, log replication, and safety, and enforces a stronger degree of coherency to reduce the number of states that must be considered.
  • FIG. 1 is a block diagram illustrating a distributed computing system with a state machine and log replicated across a plurality of computing nodes in accordance with a consensus protocol in which aspects of the illustrative embodiments may be implemented. In the example shown in FIG. 1, there is a leader node 110 and two follower nodes 120, 130; however, the distributed computing system can include other numbers of nodes depending on the configuration or workload.
  • the number of nodes in the group of participant nodes can be scaled up or down depending on the w orkload or other factors that affect resource usage.
  • Consensus protocols typically arise in the context of replicated state machines. As shown in FIG. 1, state machines 112, 122, 132 are replicated across a group of computing nodes 110, 120, 130, respectively. State machines 1 12, 122, 132 operate to compute the same state and continue to operate even if one or more of the computing nodes 110, 120, 130 are down.
  • Replicated state machines are implemented using replicated logs.
  • Each node 1 10, 120, 130 stores a log 1 15, 125, 135, respectively, containing a series of commands that are executed in order by its state machine 112, 122, 132.
  • Each log should contain the same commands in the same order, so each state machine will process the same sequence of commands. Because the state machines 1 12, 122, 132 are deterministic, each computes the same state and the same sequence of outputs.
  • the consensus module 111 on a leader node 1 10 receives commands from clients, such as client 105, and adds them to its log 115.
  • the consensus module 1 1 1 of leader node 110 communicates with the consensus modules 121, 131 of the follower nodes 120, 130 to ensure that their logs 125, 135 eventually contain the same requests or commands in the same order, even if one or more nodes fail.
  • each node s state machine processes them in log order, and the outputs are returned to client 105.
  • the nodes 110, 120, 130 appear to form a single, highly reliable state machine.
  • a Raft cluster or group also referred to herein as a replication group, contains several nodes, such as servers.
  • a typical Raft group may include five nodes, which allows the system to tolerate two failures.
  • each server is in one of three states: leader, follower, or candidate.
  • leader In normal operation, there is exactly one leader, and all other participant nodes are followers.
  • followers are passive and issue no requests on their own; followers simply respond to requests from leaders and candidates.
  • the leader handles all client requests. If a client contacts a follower, the follower redirects it to the leader.
  • the third state, candidate is used to elect a new leader.
  • a leader begins servicing client requests.
  • Each client request contains a command to be executed by the replicated state machines.
  • the leader node 110 appends the command to its log 115 as a new entry, then issues AppendEntries RPCs in parallel to each of the other nodes 120, 130 to replicate the entry.
  • Each log entry stores a state machine command along with the term number when the entry was received by the leader.
  • the term numbers in log entries are used to detect inconsistencies between logs and to ensure some of the properties.
  • Each log entry also has an integer log index identifying its position in the log.
  • Raft guarantees that ‘‘committed” entries are durable and will eventually be executed by all of the available state machines.
  • a log entry is committed once the leader that created the entiy has replicated it on a majority of the servers (consensus). This also commits all preceding entries in the leader’s log, including entries created by previous leaders. The leader keeps track of the highest index it knows to be committed, and it includes that index in future AppendEi Uries RPCs (including heartbeats) so that the other servers eventually find out.
  • the Raft consensus protocol is described herein with respect to a. cluster or group of computing nodes, such as servers.
  • the Raft consensus protocol is applied to replicate a log of commands that is to be executed by the state machines of database servers to apply changes to a database.
  • Changes to be applied to a database by a leader database server e.g., a leader shard in a sharded database system
  • a leader database server e.g., a leader shard in a sharded database system
  • each follower database server receives the commands in its log and applies the changes, in order, using its state machine, to a respective replica of the database.
  • the leader node intercepts changes (e.g., data manipulation language (DML) commands, piecewise large object (LOB) updates, JavaScriptTM object notation (JSON) inserts and updates) as logical change records (LCRs).
  • changes e.g., data manipulation language (DML) commands, piecewise large object (LOB) updates, JavaScriptTM object notation (JSON) inserts and updates
  • LCRs logical change records
  • the leader node constructs Raft log records based on the LCRs, which are replicated to follower database servers.
  • sharding distributes segments of a data set across many database servers on different computers (nodes).
  • Sharding is a data tier architecture in which data is horizontally partitioned across independent database servers.
  • Each database server is hosted on a dedicated computing node with its own local resources.
  • Each database server in such a configuration is referred to as a “shard server” or “shard ” All of the shards together make up a single logical database system, which is referred to as a sharded database management system (SDBMS).
  • SDBMS sharded database management system
  • horizontal partitioning involves splitting a database table across shards so that each shard contains the table with the same columns but a different subset of rows.
  • a table split up in this manner is also known as a sharded table.
  • each participant node can be a leader for one subset of data and a follower for other subsets of data.
  • the Raft consensus protocol handles leader election, log replication, and replication group membership changes with modifications to be described below.
  • the modifications of the illustrative embodiments help to ensure asynchronous database transaction replication for fast, automatic shard failover with zero data loss, strong consistency, full SQL support, and horizontal scalability.
  • a replication unit consists of a set of chunks.
  • FIG. 2 is a block diagram depicting grouping of chunks for replication in accordance with an illustrative embodiment.
  • a shard 210 can have multiple chunk sets (RUs) 220. Each RU 220 has a set of chunks 2.30.
  • a smaller replication unit has lower overhead of instantiation after a shard failure. Too many RUs may have higher run-time overhead (e.g., processes). Thus, there is a tradeoff between lower overhead with smaller RUs (i.e. , smaller chunk sets) and lower run-time overhead with larger RUs (i.e., fewer RUs).
  • Each RU consists of a set of chunks. All transactions in an RU are replicated in the same replication pipeline, which consists of a set of processes, in-memory data structures, and the related replication log. To minimize replication overhead, the illustrative embodiments configure a size of the replication runt to maximize throughput. A large replication unit may increase data movement time during resharding.
  • FIG. 3 illustrates a replication unit in a sharded database management system in accordance with an illustrative embodiment.
  • Each RU has a leader shard server 310 and a set of follow er shard servers 320, 330, and the leader shard and all follower shards have the same chunk set. All DML operations for a particular row' are performed in the leader and replicated to its followers. This is primary copy replication.
  • a shard can be the leader for one replication unit and a follower for other replication units. This leads to better utilization of hardware. All reads are routed to the leader, unless an application explicitly requests a read from a specified follow-er shard and tolerates stale data, which may be beneficial if the application is geographically closer to a follower.
  • each replication unit has a replication factor (RF), which refers to the number of copies/replicas of the replication unit, including the primary' copy at the leader.
  • RF replication factor
  • the illustrative embodiments maintain the replication factor as shards fail assuming there is capacity in other available shards.
  • Each replication unit also has an associated distribution factor (DF), which refers to the number of shard servers taking over the workload from a failed leader shard server. The higher the DF, the more balanced workload distribution is after a failover.
  • DF distribution factor
  • FIG. 4 is a diagram illustrating a consensus protocol -based replication user request flow for a change operation in accordance with an illustrative embodiment. For a given replication unit, there is a leader 410 and a plurality' of followers 420, 430.
  • Each node or server 410, 420, 430 has a respective shard catalog 412, 422, 432, which is a specialized database that supports automated shard deployment, centralized management of a sharded database, and multi-shard queries.
  • the shard catalogs 412, 422, 432 maintain data describing which server is the leader and which servers are followers.
  • each node or server can share a shard catalog.
  • a logical change record encapsulates a row change (e.g., insert, update, LOB operation, JSON operation, old/new values) and transaction directives (e.g., commits, rollbacks, partial rollbacks).
  • a replication log record is an LCR with a valid log index. Log records in a replication unit have strictly increasing log indices. Replication logs contain log records for interleaved, uncommitted transactions, tike redo logs, but no undos or indices. This contrasts with other solutions that contain only committed transactions.
  • the terms logical change record (LCR) and log record (LR) are used interchangeably herein.
  • a read-only (R/O) routing state is maintained for a chunk in a follower RU in a routing map (not shown), and a read-write (R/W) routing state is maintained for a chunk in the leader RU in the routing map.
  • the R/O and R/W states are changed accordingly when there is a leadership change of an RU.
  • the routing map which is cached outside the database, e.g., in client-side drivers, is invalidated and reloaded via notifications when there is a leadership change.
  • the client sends a command to the leader of a Raft group, and the leader appends the command to its log, sends an AppendEntries RPC to all followers, and once a new' entry is committed (stored in a consensus number of follower replication logs), the leader executes the command and returns a result to the client, the leader notifies followers of committed entries in a subsequent AppendEntries RPCs, and the followers execute commited commands in their state machines.
  • the sharded database replication approach of the illustrative embodiments executes DMLs first in the database before appending them to the replication logs.
  • a logical change record encapsulates a row change (e.g., insert, update, delete, LOB operation, JSON operation, old/new values) and database transaction directives (e.g., database transaction commits, database transaction rollbacks, partial rollbacks).
  • a log record is an LCR with a valid log index and a term.
  • Log records in a replication group have strictly increasing log indices. Log records contain interleaved, uncommitted transactions, as will be described in further detail below.
  • a user/client 401 sends a DML to the leader 410, which performs the DML on its copy (replica) of the replication unit before appending a log record (LR) for the DML to its replication log 415.
  • the leader also returns a result (e.g., and acknowledgement) to the user/client 401 in response to the LR being created in the leader’s replication log 415.
  • the result is returned without confirming the followers have stored the LR to their replication logs, and therefore without the followers acknowledging storing the LR.
  • the leader can return to the client before writing this log record to its replication log.
  • the persistence of log records is done asynchronously.
  • the replication log 415 is propagated to followers 420, 430, which store the LR for the DML in replication logs 425, 435, respectively.
  • followers 420, 430 which store the LR for the DML in replication logs 425, 435, respectively.
  • the follower 420 returns an acknowledgement (ACK)
  • ACK acknowledgement
  • ’fire leader 410 considers a DML LR as committed if a consensus number of followers return an ACK to acknowledge that the LR has been stored in the replication log.
  • a follower 420, 430 eagerly executes DMLs while it appends to its replication log 425, 435 in parallel. This minimizes the impact to user transaction response time and improves replication efficiency.
  • the leader To maintain ACID properties among replicas, the leader only commits a relevant database transaction when the commit LCR is a committed log record (appended to a consensus number of follower replication logs), which means any LCR previously generated for the database transaction is also a committed log record.
  • a user/client 401 sends a database transaction commit command to the leader 410, which creates a commit LR in its replication log 415 and propagates the replication log 415 to followers 420, 430.
  • the leader only commits the database transaction in its copy of the replication unit when it has received an acknowledgement for the commit LR from a consensus number of followers.
  • the leader generates the post-commit LR and then returns a result of the database transaction commit (e.g., and acknowledgement) to the user/client 401,
  • FIG. 5 is a flowchart illustrating operation of consensus protocol-based replication for a sharded database management system in accordance with an illustrative embodiment. Operation begins (block 500), and the leader shard server receives a command from a client (block 501). The leader shard server determines whether the command is a database transaction commit command (block 502). If the command is not a database transaction commit command (block 502:NO), then the command is a change operation (e.g., an insert, update, delete, etc.), and the leader shard server performs the change operation (block 503). The leader shard server then creates a replication log record in its replication log (block 504) and returns a result of the change operation to the client (block 505). Thereafter, operation returns to block 501 to receive the next command from the client.
  • a database transaction commit command block 502
  • the command is a change operation (e.g., an insert, update, delete, etc.)
  • the leader shard server performs the change
  • the leader shard server creates a replication log record for the database transaction commit in its replication log (block 506).
  • the leader shard server determines whether it has received a consensus for the commit log record (block 507). If consensus is received (block 507:YES), then the leader shard server performs the database transaction commit operation (block 508), advances the commit index (block 509), and writes the log records to disk (block 510). Thereafter, operation returns to block 501 to receive the next command from the client.
  • FIG. 6 is a block diagram illustrating an architecture for consensus protocol-based replication in a sharded database in accordance with an illustrative embodiment.
  • Users 601 send DML and transaction directives to leader 610 for a given replication unit.
  • leader 610 includes capture components 611, System Global Area (SGA) 612, inmemory replication log queue 613, network senders 614A... 614B, consensus module 615, and persistent replication log in disk 616.
  • the capture components 611 intercept DML executions and capture DMLs, piecewise LOB updates, JSON inserts and updates (J SON Transform) as LCRs.
  • the capture components 611 also intercept transaction execution to capture database transaction commits, database transaction rollbacks, and rol Ibacks-to-sav epoint,
  • the SGA is a group of shared memory structures, known as SGA components, that contain data and control information for one database instance.
  • the SGA is shared by all server and background processes.
  • the capture components 611 store the LCRs to be inserted into the in-memory replication log queue 613.
  • a commit queue (not shown) between SGA 612 and consensus module 615.
  • the commit queue contains the commit record for each transaction.
  • consensus module 615 receives an acknowledgement from a follower, it reads the commit queue to find a matching transaction and checks if this transaction obtains consensus. If so, the consensus module posts the user session, which allows the transaction to commit, generates a post-commit LCR, and returns the control to the user.
  • Network senders 614A, 614B distribute replication logs to followers 620, 630 over network 605.
  • the leader may have a network sender for each follower in the replication group.
  • Consensus module 615 communicates with the consensus modules on other servers, such as consensus modules 625, 635 on followers 620, 630 to ensure that every log eventually contains the same log records in the same order, even if some servers fail.
  • Follower 620 includes network receiver 621, in-memory replication log queue 622, SQL Apply servers 623, consensus module 625, and persistent replication log in disk 626.
  • follow-er 630 includes network receiver 631, in-memory replication log queue 632, SQL Apply servers 633, consensus module 635, and persistent replication log in disk 636.
  • Network receivers 621, 631 receive replication logs from the leader 610 and hand LCRs to the SQL, Apply servers 623, 633.
  • Consensus modules 615, 625, 635 have an LCR persister process at each replica (including the leader) to persist LCRs durably in persistent replication logs in disk 616, 626, 636.
  • the consensus module 625, 635 sends back highest persisted log index to its leader 610 to acknowledge that log records up to the highest persisted log index have been persisted.
  • leader Only the leader can process user DML requests. A follower can automatically redirect DMLs to the leader.
  • the leader constructs a log record to encapsulate the DML change, enqueues the log record into an SGA circular buffer, and immediately returns to the user.
  • the illustrative embodiments decouple replication from the original DMLs and pipeline the DML logging, propagation, and SQL Apply at followers asynchronously with minimal latency. For multi-DML transactions, the replication largely overlaps with user transactions, and the latency overhead from the commit consensus is much less significant.
  • a capture component 611 constructs an LCR in image format or object formal and enqueues it into an in-memory queue 613.
  • the in-memory queue is a contention-free multi-writer, single-reader queue.
  • the image format LCR in the queue minimizes multiple downstream pickling (e.g., persisting and propagating the LCRs).
  • the leader 610 returns to user 601 immediately after constructing the required log record in the DML execution path.
  • In-memory queue 613 is a single-writer and multiple-reader queue.
  • LCRs in SGA 612 are stored in a different queue, which is a contention-free multiple-writer and single- reader queue. Each capture process representing the user is a writer to this queue.
  • the LCR producer process is the single reader of this queue and dequeues LCRs from SGA612, assigns a unique and strictly increasing log index to the LCR, and enqueues it into in-memory queue 613.
  • the strictly increasing log index is an important property of the Raft log.
  • the consensus module 615 scans through queue 613 and constructs replication log records based on the LCRs it dequeues.
  • the consensus module 615 (LCR persister) persists the in-memory log records to disk 626.
  • the consensus module 615 (ACK receiver) counts acknowledgements from the followers 620, 630 and advances the log commitindex appropriately.
  • the network sender 614A, 614B calls the AppendEntries RPC to propagate log records over the network 605 for all followers. For every 7 log record propagation, the network sender 614A, 614B also includes the current commit log index.
  • the network receiver 621 , 631 is automatically spawned due to the connection from the network sender 614A, 614B at the leader.
  • the network receiver 622 , 631 enqueues log records into a queue 622, 632 from the wire via the consensus protocol Application Programming Interface (API) if the AppendEnlries RPC passes its validation.
  • the consensus module 625, 635 at the follower reads the in-memory queue 622, 632. containing the log records from the leader, persists the log records to disk 626, 636, and sends an acknowledgement back to the leader via a separate network connection.
  • the SQL Apply server 623, 633 reads LCRs from the in-memory' queue 622, 632, assembles transactions from interleaved LCRs from different transactions, and applies transactions to the database. This is referred to herein as ‘"eager apply.” If the SQL Apply server 623, 633 is slow or catching up, it may need to retrieve the relevant log records from disk 626, 636.
  • the consensus protocol-based replication of the illustrative embodiments does not require explicit consensus on DMLs. As soon as an LCR for a row change is pushed to the replication pipeline, the leader immediately returns the control to the user or allows subsequent processing of the transaction. This minimizes the impact on the user’s response time for DMLs.
  • the replication (propagation, persistence) of a DML is done asynchronously in a streaming fashion. If there is consensus on a commit, the replication approach maintains transaction ACID properties among replicas.
  • the network senders 614A, 614B send the log record containing the transaction commit to all followers 620, 630.
  • the consensus module 615 LCR persister
  • the leader 610 When the leader 610 receives the consensus from a consensus number of followers, the leader advances its log commitindex if possible. The commit of the user transaction and advancing the commitindex does not need to be atomic. The leader 610 does not need to verify that the log record has been persisted locally to send the next set of log records to followers 620, 630. The leader needs to ensure that a commit log record is persisted locally before committing the transaction locally. In practice, this log record would have been persisted anyway when the leader receives the consensus. One way is to verify that the persisted log index is equal to or greater than the log index for this transaction commit log record. The leader then commits the transaction and returns success to the user.
  • leader 610 crashes before communicating the consensus of the log record to its followers, a new leader may complete the replication of any last set of committed log records regardless of whether the prior leader is up or not. Because the apply of the transaction commit command and the return to the user are not atomic, the user might resubmit the transaction, possibly to a different leader. Such resubmission of transactions may encounter errors and exist outside the sharded database.
  • ah log records are sent, including commits, in the same stream continuously without any interruption.
  • the acknowledgement is sent back via a separate network connection.
  • the leader can con tinue streaming log records with higher log index, including other commits from other user transactions.
  • all the log records cannot have inter-transaction dependencies. This maximizes concurrency for independent user transacti ons.
  • one acknowledgement from the follower may include multiple transactions. As a result, this improves replication efficiency. Only one round trip is needed to obtain consensus for a commit log record.
  • One set of replication logs is kept for each replica (copy of the RU on a server) in each replication group of servers. There will be no interference among replication units when reading and writing replication logs. There are numerous possible embodiments to persist replication logs in all replicas, including the leader, including the following examples:
  • custom files are implemented so that replication logs can be read faster during recovery (role transition) and when catching up anew replica.
  • custom files allow transporting replication logs among replicas (e.g., for catchup), moving a follower from one shard server to another, re-instantiating a new follower.
  • the illustrative embodiment employs asynchronous IOs when writing and asynchronous IOs when prefetching and reading.
  • FIG. 7 is a block diagram depicting log persistence optimizations in a leader shard server in accordance with an illustrative embodiment.
  • the log persistence process group 710 in the leader shard receives LCRs 702 from SGA.
  • LCR producer process 711 dequeues LCRs 702 and enqueues the LCRs into circular queue 713.
  • the queue in LCRs 702 is a multiple- writer and single-reader queue.
  • Network senders 714A, 714B and LCR persister process 715 are subscribers of the circular queue 713.
  • Network senders 714A, 714B browse the circular queue 713 and distribute replication logs to followers,
  • LCR persister process 715 maintains a set of IO buffers 716 in the SGA.
  • a log record is dequeued from the circular queue 713 and placed in the IO buffers 716 as it would appear in the on-disk log file 717,
  • the LCR persister process 715 is also a subscriber of the circular queue 713, it must ensure that it dequeues records quickly to ensure that the queue 713 does not fill up. Therefore, IO is asynchronous; as soon as an IO buffer is full, asynchronous IO is issued. Tins IO is reaped when the buffer is used again or if a commit record is encountered (when all pending IOs are reaped).
  • the LCR persister process 715 performs the following:
  • a log record is a wrapper containing an LCR with a term and a log index
  • FIG. 8 is a block diagram depicting log persistence optimizations in a follower shard server in accordance with an illustrative embodiment.
  • the log persistence process group 820 in a follower shard receives LCRs, commit Index, minimum persisted log index, and minimum oldest log index from the leader.
  • the network receiver 821 receives the log records from the wire, drops the LCR into the circular queue 823 and the IO buffers 825 in the SGAS, and notifies the LCR persister process 826 if it does not have a free IO buffer.
  • the LCR persister process 826 persists LCRs from the IO buffers 825 to on-disk log file 828.
  • the LCR persister process 826 monitors the IO buffers 825 and issues a block of buffers as one synchronous IO when a predetermined threshold (e.g., 60%) of buffers are utilized or a commit is received.
  • the LCR persister process 826 also notifies the acknowledgement (ACK) sender 827 when IO is completed.
  • ACK acknowledgement
  • the ACK sender 827 maintains a record of the last log index 829 that w as acknowledged. Whenever the LCR persister process 826 persists a commit index higher than the last acknowledged index, the ACK sender 827 sends this information to the leader. [0100] The network receiver 821 sends the commitindex 822 to the SQL Apply processes
  • the SQL Apply processes 824 read LCRs from the circular queue 823, assembles transactions from interleaved LCRs from different transactions, and executes DML operations on the database.
  • FIG. 9 depicts a replication log with interleaved transactions in accordance with an illustrative embodiment.
  • the replication log shown in FIG. 9 includes the raft logs, each of which has a log index.
  • T1L1 means the first LCR in transaction Tl.
  • TI C means a commit LCR for transaction Tl .
  • T2R means a rollback LCR for transaction T2.
  • T1L1 has a log index of 100
  • T2L1 has a log index of 101
  • TIC has a log index of 102
  • T2R has a log index of 103. Therefore, transaction Tl and transaction T2 are interleaved because transaction T2 has log records between the first LCR and the commit LCR of transaction Tl.
  • a SQL Apply server consists of an LCR reader process and an apply reader process, a coordinator process, and multiple applier processes.
  • the apply LCR reader process dequeues LCRs, possibly reads LCRs from a persistence layer, computes hash values for relevant key columns.
  • the apply reader process assembles LCRs into complete transactions, computes dependencies among transactions, and passes the transactions to the coordinator.
  • the coordinator process assigns transactions to available appliers based on the transaction dependencies and commit ordering.
  • Each applier process executes an entire transaction at the replica database before requesting another one.
  • the appliers processes independent transactions concurrently for better throughput.
  • an applier starts applying DMLs for a transaction before receiving the commit or rollback. This is referred to as "‘eager apply.” However, an applier cannot commit a transaction even if it sees the commit LCR for this transaction unless the commit LCR has been consented. Each transaction is applied exactly once.
  • FIG. 10 depicts apply progress tracking in a replication log in accordance with an illustrative embodiment.
  • SQL Apply maintains two key values about its progress: Low Watermark Log Index (LWMLI) and Oldest Log Index. All transactions with a commit index less than or equal to the LWMLI have been applied and committed. The oldest log index is the log index of the earliest log record the apply may need.
  • SQL Apply inserts a row in a system table (e.g., appliedTxns) containing the source transaction ID, the log index of the first DML, and the log index of the txCommii, which refers to the commit for a user transaction.
  • a system table e.g., appliedTxns
  • the recovery process or the apply process can start reading from the Oldest Log Index, skip already applied transactions based on the appliedTxns , and complete the replication of any open transactions. As an optimization, multiple transactions can be batched and applied as one transaction in the follower for better performance.
  • FIG. 11 depicts an example of multiple ring placement of replication units in accordance with an illustrative embodiment.
  • RU1 and RU2 are replicated from Shardl to Shard2 and Shard3, RU3 and RU4 are replicated from Shard2 to Shardl and Shard3, and RU5 and RU6 are replicated from Shard3 to Shardl and Shard2.
  • RU7 and RU8 are replicated from Shard4 to Shard5 and Shard6,
  • RU9 and RU10 are replicated from Shard5 to and Shard6, and
  • RUH and RU12 are replicated from Shard6 to Shard4 and Shard5.
  • the ideal case is to add RF number of shard servers at the same time.
  • FIG. 11 if three shard servers are added, an equal number of chunks would be pulled from all RUs to create six new RUs, which would be deployed on the three new shard servers creating a third ring of shard servers.
  • Step 1 With one ring and three shard servers, the topology is shown in Table 1 as follows:
  • Step 2 Add Shard4. Even though a shard server is being added, the one ring is kept because there are not enough shard servers to make two smaller rings. Chunks for the new RU are pulled equally from the other RUs. There are now 45 chunks per RU. After adding the fourth shard server, synchronization is required across all four shard servers for any barrier DDL propagation.
  • the topology' is shown in Table 2 as follows:
  • Step 3 Add Shard5. Again, no new smaller ring is created for the same reason. The new shard server pulls chunks from all shard servers. There are now 36 chunks per RU. After adding the fifth shard server, synchronization is required across all five shard servers for any barrier DDL propagation.
  • the topology' is shown in Table 3 as follows:
  • Step 4 Add Shardo. ’This time, followers are moved around to form two smaller rings. There are now 30 chunks per RU. After adding the sixth shard server, synchronization is required across only three shard servers for any barrier DDL. The topology is shown in
  • Step 5 Beyond this point, it is no longer necessary to modify the first ring (Shardl, Shard2, Shard3) as new shards are added. Any new shard servers will pull chunks from all shards, but the RU composition of the first ring will not change. As Shard7 and Shard8 are added, the second ring (Shard4, Shard5, Shard6) will expand to five shard servers, and at Shard9, Shard4 through Shard9 will split to form two smaller rings to form a total of three smaller rings. This layout allows barrier DDLs to be applied to a single ring of shard servers without impacting the entire sharded database management system.
  • Replication unit placement can also be done in the form of a single ring.
  • the leader chunk set on Shardl is replicated on Shard2 and Shardo, the leader chunk on Shard2 is replicated on Shard3 and Shard4, and so on.
  • the leader chunk set is marked with “I.,”
  • the follower chunk set is marked with “F”.
  • the number indicates the number of chunks in the chunk set.
  • Table 5 The chunk distribution is shown in Table 5 as follows:
  • adding a shard server into the ring involves pulling chunks from the four shard servers nearest to it (two on either side). This is done in the following manner: a. Deciding the insertion point: Find 4 adjacent shard servers that together have the highest number of chunks when compared to all other groups of adjacent 4. The new shard server should be inserted in the middle of these 4 donor shard servers.
  • a single ring structure has one disadvantage: barrier DDLs result in quiescing the entire sharded database management system.
  • barrier DDLs result in quiescing the entire sharded database management system.
  • Removing a shard server (e.g., due to failure or scaling down) can be done similarly in both options.
  • FIG. 12 is a data flow diagram illustrating an example replication unit split in accordance with an illustrative embodiment. For simplicity, assume there are no leadership changes during an RU split. Transactions T1 , T2, T3, T4 are executed on a given shard server (leader or follower) within the sharded database management system.
  • the shard coordinator 1250 creates a new RU (RU2 in FIG. 12), which has the same set of shard servers for its replicas.
  • the leader for the old RU is the leader for the new RU.
  • the shard coordinator 1250 creates the new RU and enqueues a split RU begin marker into both the old RU and the new RU.
  • the shard coordinator 1250 associates a set of chunks (C2 in FIG. 12) with the new RU.
  • the shard coordinator 1250 sets up the replication processes for the new RU and suspends the SQL Apply processes for the new RU.
  • the shard coordinator 1250 waits for transactions running concurrently with the RU split to finish.
  • transactions T1 and T3 are in-flight when the split is initiated.
  • Transaction T3 depends on transaction T2, and transaction T2 completes during the split process.
  • the shard coordinator 1250 enqueues the old values of all scalar columns for updates and deletes during the RU split process.
  • Shard coordinator 1250 does the same for the RU merge process.
  • the shard coordinator After the RU split process, the shard coordinator enables regular consensus protocol-based leadership change.
  • the RU split process described above with reference to FIG. 12 can be generalized to split one RU into multiple RUs instead of just two.
  • Moving an RU from one follower to another follower may be performed as follows:
  • the reasons for reinstating an RU include creation of an additional replica on a new shard server to increase replication factor, replacement of a replication unit on a ‘’ fallen” shard server with a replication unit on another shard server, rebuilding a replication unit on a shard server to recover from data/log divergence, and rebuilding an outdated replication unit after it was down for a long time to allow them to catch up.
  • a shard server fails and there is a new spare shard
  • target shards are selected, typically hosting a follower, and all the chunks and replication logs for the relevant replication units are copied to the new shard.
  • ah the chunks for the relevant replication units are redistributed to those shards, maintaining the replication factor. For simplicity, new transactions are not allowed to span a term; however, commit and rollback records can be generated in a different term in the new leader.
  • the leader prepares the local transaction for commit and marks the transaction as in-doubt first, and then sends the commit LCR to its followers. Upon receiving consensus of the commit LCR, the leader commits the local transaction. Upon failure of the leader shard server, if the failed leader becomes a follower and there is a consensus for this commit LCR based on the replication log, the in-doubt state of the local transaction in the failed leader can be committed. Otherwise, the transaction must be rolled back in the failed leader.
  • Each follower receives the commit LCR but does not perform the commit on its copy of the replication unit until the follower receives a commit log index from the leader that is equal to or greater than the log index of the commit LCR, thus indicating that the commit LCR has been persisted by the other followers.
  • the leader before committing a local transaction, the leader sends a pre-commit LCR to its followers. Upon receiving consensus for the pre-commit LCR, the leader commits the local transaction and sends a post-commit LCR to its followers. Upon failure of the leader shard server and the failed leader becoming a follower, if there is a consensus for the pre-commit LCR based on the replication log, then the local transaction can be committed in the failed leader. If the local transaction in the failed leader is rolled back, the transaction can be replayed to the failed leader based on the replication log. If the transaction fails (e.g., the user session crashes) after sending the pre-commit LCR, if the shard server remains the leader, the process monitor in the database generates and sends rollback records to its followers.
  • the process monitor in the database e.g., the user session crashes
  • a sequence of LCRs for a transaction T1 is as follows:
  • TD IS a DML LCR
  • TF is the first LCR of the transaction, which could be a DML LCR
  • T pre is a pre-commit LCR (waiting for consensus)
  • Tpost is a post-commit LCR.
  • TRBK which is a rollback LCR.
  • the purpose of the pre-commit LCR e is to allow the leader to receive a consensus prior to committing the transaction locally.
  • the leader does not wait for consensus on the post-commit LCR Tp OSt before returning to the client.
  • the commit of the transaction at the leader occurs between the pre- commit LCR and the post-commit LCR.
  • the commit occurs after receiving the post-commit LCR.
  • a follower does not commit the transaction to its copy of the replication unit after receiving the pre-commit LCR, because the follower does not know if there was consensus for the pre-commit LCR. In fact, the follower could receive a rollback LCR after a pre-commit LCR, If a follower receives a post-commit LCR, then the follower knows that there was consensus for the pre-commit LCR and the leader committed the transaction.
  • the new leader will be the follower that has the longest replication log.
  • One scenario is that the new leader has LCRs for the transaction not including the pre-commit LCR or the post-commit LCR. In this case, the new leader does not know' if there were other DML LCRs for changes made by the old leader and acknowledged to the user/client. Thus, the new leader would roll back the transaction and issue a rollback LCR.
  • Another scenario is that the new leader has LCRs for the transaction including the pre-commit LCR and the post-commit LCR, This transaction must be committed. Therefore, the new leader commits the transaction.
  • the new leader has LCRs for the transaction including the pre-commit LCR but not the post-commit LCR.
  • the new leader cannot assume that the pre- commit LCR received consensus.
  • the new leader issues a lead-sync LCR, which is a special LCR sent at the beginning of a term to synchronize the replication logs of the followers.
  • the new leader waits for consensus on the lead-sync LCR, which indicates that a consensus number of followers have persisted all the LCRs leading up to the lead-sync LCR.
  • a follower acknowledges the lead- sync LCR only if that follower has persisted the LCRs leading up to the lead-sync LCR.
  • Each LCR has a log index; therefore, each follower will know if all LCRs are persisted up to and including the lead-sync LCR.
  • a follower cannot acknowledge that the lead-sync LCR is persisted without also acknowledging that each LCR having a log index less than the lead- sync LCR has been persisted. This will ensure that for transaction Tl, all followers will have the pre-commit LCR and all LCRs leading up to the pre-commit LCR.
  • the new leader can then commit the transaction Tl to its copy of the replication unit because the new leader now' knows there is consensus for the pre-commit LCR.
  • the new leader would then send a postcommit LCR to its followers.
  • the new leader receives consensus on the lead-sync LCR with the log index of 501, then the new leader knows that a consensus number of followers have persisted the LCRs having log index up to 500. The new leader can commit both transactions Tl and T2. The new' leader will generate the following LCRs:
  • FIG. 13 is a flowchart illustrating operation of new' leader shard server taking over a replication unit when there is a commit initiated in the replication log in accordance with an illustrative embodiment. Operation begins when there is a commit initiated for a transaction in the replication log (block 1301). The leader shard server determines whether there is a consensus for the transaction commit (block 1301). The leader shard server determines that there is consensus for the transaction if there is a post-commit LCR in the replication log for the transaction.
  • the leader shard server performs a lead-sync operation by generating a lead-sync LCR and propagating the lead-sync LCR to its followers (block 1302).
  • the follower In response to a follower receiving the lead-sync LCR, the follower will request every LCR in the replication log up to the log index of the lead-sync LCR. If the follower successfully receives every LCR in the replication log up to the log index of the lead-sync LCR and persists the LCRs in its replication log, then the follower returns an acknowledgement specifying the log index of the lead-sync LCR. The leader shard server then knows that the follower has acknowledged every LCR in the replication log up to and including the lead-sync LCR.
  • the leader shard server determines whether consensus is received for the lead-sync LCR (block 1303).
  • the leader shard server determines that there is consensus for the transaction if there is a consensus number of followers acknowledge LCRs with log index up to and including the lead-sync LCR, thus including the pre-commit LCR for the transaction. If there is consensus for the lead-sync LCR (block 1301 : ⁇ ES or block 1303:YES), then the leader shard server completes the commit operation (block 1304), and operation ends (block 1305).
  • completing the commit operation includes generating a post-commit LCR and propagating the post-commit LCR to its followers.
  • Replication log recover ⁇ ' is the procedure of matching the state of the replication log with the state of the database transactions by recreating missing entries in the replication log and replaying or rolling back transactions.
  • a shard server determines a recovery window defining transactions that must be recovered.
  • the recovery' window starts with the earliest transaction that has a first LCR in the replication log but no post-commit LCR.
  • Transaction T4 reached post-commit in the replication log; therefore, T4 does not need to be recovered, because a post-commit LCR for T4 indicates that the commit of T4 received consensus.
  • the earliest transaction that has a first LCR but no post-commit LCR is transaction T1 at log index 100. Therefore, the follower must recover the replication log starting with log index 100 up to log index 1001 (the lead-sync LCR). The recover ⁇ ' window is shown in bold. If the shard failed after receiving consensus for the pre-commit LCR but before generating the post-commit LCR, then there is not enough information in the rephcation log to conclude the fate of the transaction in the failed shard.
  • the approach of the illustrative embodiment is to have the database resolve this. If the transaction was rolled back, then the transaction is replayed; otherwise, the transaction will not be replayed against the database.
  • Transaction T1 requires post/replay/commit.
  • Transaction T2 does not need to be processed, because transaction T2 -was locally.
  • Transaction T3 requires rollback and a rollback LCR, because 1'3 did not reach the pre-commit LCR in the replication log.
  • Transaction T4 does not need to be processed, because the first LCR of T4 is outside the recovery window.
  • Transaction T5 does not need to be processed, because T5 has been rolled back. Therefore, the follower can ignore transactions T2, T4, and T5 during recovery.
  • FIG. 14 is a. flowchart illustrating operation of a shard server performing replication log recover ⁇ ' in accordance with an illustrative embodiment. Operation begins on a shard server when recovering from a failure (block 1400).
  • the shard server identifies a first transaction in the replication log having a first LCR but no post-commit LCR (block 1401).
  • the shard server defines the recovery window as described above (block 1402).
  • the shard server then identifies a set of transactions to be recovered (block 1403).
  • the shard server identifies transactions having LCRs within the recovery window that can be ignored during recovery.
  • the shard server then performs recovery on the identified set of transactions to be recovered (block 1404). Thereafter, operation ends (block 1405).
  • DDL Data Description Language
  • each shard contains the same schema definitions.
  • DDL operations in an SDBMS are issued on the Global Data Service (GDS) catalog, which propagates the DDLs to each shard.
  • GDS Global Data Service
  • the consensus protocol-based replication approach of the illustrative embodiments only replicates DMLs asynchronously.
  • there may be in-flight transactions in the replication pipeline that may: a) lag schema definition in a follower when the DDL Is executed in the follower first for the replication unit, or b) be ahead of the schema definition in a follower when the DDL is executed in the leader first for the replication unit.
  • a DDL is classified as a barrier DDL or a non-barrier DDL.
  • Barner DDLs do not allow in-flight DMLs, e.g., AL TER TABLE RENAME COLUMN.
  • the SDB is quiesced, the DDL is executed synchronously against all available shards, and DMLs are allowed after DDL execution.
  • Non-barrier DDLs allow in-flight DMLs, e.g., CREATE SYNONYM.
  • DDLs that do not require any synchronization with inflight DMLs, e.g,, CREATE SYNONYM
  • DDLs that require synchronization of delayed user DMLs in the replication streams and the DDL executions.
  • Certain in-flight DMLs are still allowed, e.g., ALTER TABLE ADD column with default values.
  • the leader filters out this column.
  • SQL Apply stops when it hits an error.
  • the SQL Apply can be restarted.
  • the leader can raise an error if a DML may stop the SQL Apply, e.g., alter table add column with default value, an insert with non-default value for that column.
  • the SQL Apply continues to apply, e.g., for a drop column, the SQL Apply can ignore this column in the LCR. For an add column with default value, because the leader does not have this new column yet, the SQL Apply will not include this column in all replicated DMLs.
  • DDL execution When a shard is down during DDL execution, it is not necessary to re-execute the DDLs when the shard is available. DDL execution must be coordinated with the replication of in-flight transactions in the replication log.
  • Any transaction spanning chunks will be initiated from the shard coordinator, which acts as the distributed transaction coordinator (DTC).
  • the DTC executes a typical two- phase commit protocol among the leaders of all relevant replication units.
  • the leader for a relevant replication unit must propagate sufficient information about this distributed transaction (e.g., global transaction ID, local transaction ID) to its followers and obtain consensus in the prepare commit phase from its followers before informing its readiness to commit to the DTC.
  • the DTC can only commit after receiving ready-to-commit from all relevant RU leaders.
  • the DTC is not aware of the presence of followers.
  • the new leader of the replication unit must contact the DTC to find out the fate of such a distributed transaction.
  • distributed transactions incur a larger delay in transaction response time: one more network roundtrip. Note that the replication of a distributed transaction branch is asynchronous and is not part of the original distributed transaction. In a typical sharded deployment, it is assumed that most transactions are within the same chunk, hence avoiding the two-phase commit protocol.
  • a database management system manages a database.
  • a DBMS may 7 comprise one or more database servers.
  • a database comprises database data and a database dictionary that are stored on a persistent memory mechanism, such as a set of hard disks.
  • Database data may be stored in one or more collections of records. The data within each record is organized into one or more attributes.
  • the collections are referred to as tables (or data frames), the records are referred to as records, and the attributes are referred to as attributes.
  • a collection of records is a collection of documents, each of which may be a data object marked up in a hierarchical- markup language, such as a JSON object or XML document.
  • the attributes are referred to as JSON fields or XML elements.
  • a relational DBMS may also store hierarchically marked data objects; however, the hierarchically marked data objects are contained in an attribute of record, such as JSON typed attribute.
  • a database command may be in the form of a database statement that conforms to a database language.
  • a database language for expressing the database commands is the Structured Query Language (SQL).
  • SQL Structured Query Language
  • Data definition language (“DDL”) commands are issued to a database server to create or configure data objects referred to herein as database objects, such as tables, views, or complex data types.
  • SQL/XML is a. common extension of SQL used when manipulating XML data in an object- relational database.
  • SparkTM SQL is Another database language for expressing database commands, which uses a syntax based on function or method invocations.
  • a database command may be in the form of functions or object method calls that invoke CRUD (Create Read Update Delete) operations.
  • An example of an API for such functions and method calls is MQL (MondoDBTM Query Language).
  • database objects include a collection of documents, a document, a view, or fields defined by a JSON schema for a collection.
  • a view may be created by invoking a function provided by the DBMS for creating views in a database.
  • a database transaction is a set of operations that change database data.
  • a database transaction is initiated in response to a database command requesting a change, such as a DML command requesting an update, insert of a record, or a delete of a record or a CRUD object method invocation requesting to create, update or delete a document.
  • DML commands and DDL specify changes to data, such as INSERT and UPDATE statements.
  • a DML statement or command does not refer to a statement or command that merely queries database data. Committing a transaction refers to making the changes for a transaction permanent.
  • transactional metadata includes change records that record changes made by transactions to database data.
  • transactional metadata is embedded transactional metadata stored within the database data, the embedded transactional metadata describing transactions that changed the database data.
  • Undo records are used to provide transactional consistency by performing operations referred to herein as consistency operations.
  • Each undo record is associated with a logical time.
  • An example of logical time is a system change number (SCN).
  • SCN system change number
  • An SCN may be maintained using a Lamportmg mechanism, for example.
  • a DBMS applies the needed undo records to copies of the data blocks to bring the copies to a state consistent with the snap-shot time of the query .
  • the DBMS determines which undo records to apply to a data block based on the respective logical times associated with the undo records,
  • DBMSs In a distributed transaction, multiple DBMSs commit a distributed transaction using a two-phase commit approach. Each DBMS executes a local transaction in a branch transaction of the distributed transaction.
  • One DBMS, the coordinating DBMS is responsible for coordinating the commitment of the transaction on one or more other database systems.
  • the other DBMSs are referred to herein as participating DBMSs.
  • a two-phase commit involves two phases, the prepare-to-commit phase, and the commit phase.
  • the prepare-to-commit phase a branch transaction is prepared in each of the participating database systems.
  • the database is in a "prepared state" such that it can guarantee that modifications executed as part of a branch transaction to the database data can be committed. This guarantee may entail storing change records for the branch transaction persistently.
  • a participating DBMS acknowledges when it has completed the prepare-to-commit phase and has entered a prepared state for the respective branch transaction of the participating DBMS.
  • the coordinating database system commits the transaction on the coordinating database system and on the participating database systems. Specifically, the coordinating database system sends messages to the participants requesting that the participants commit the modifications specified by the transaction to data on the participating database systems. The participating database systems and the coordinating database system then commit the transaction.
  • a client may issue a series of requests, such as requests for execution of queries, to a DBMS by establishing a database session.
  • a database session comprises a particular connection established for a client to a database server through which the client may issue a series of requests.
  • a database session process executes within a database session and processes requests issued by the client through the database session.
  • the database session may generate an execution plan for a query issued by the database session client and marshal slave processes for execution of the execution plan.
  • the database server may maintain session state data about a database session.
  • the session state data reflects the current state of the session and may contain the identity' of the user for which the session is established, services used by the user, instances of object types, language and character set data, statistics about resource usage for the session, temporary' variable values generated by processes executing software within the session, storage for cursors, variables, and other information.
  • a database server includes multiple database processes.
  • Database processes run under the control of the database server (i.e., can be created or terminated by the database server) and perform various database server functions.
  • Database processes include processes running within a database session established for a client,
  • a database process is a unit of execution.
  • a database process can be a computer system process or thread or a user-defined execution context such as a user thread or fiber.
  • Database processes may also include “database server system” processes that provide services and/or perform functions on behalf of the entire database server. Such database server sy stem processes include listeners, garbage collectors, log writers, and recovery processes.
  • a multi-node database management system is made up of interconnected computing nodes (“nodes”), each running a database server that shares access to the same database. Ty pically , the nodes are interconnected via a network and share access, in varying degrees, to shared storage, e.g., shared access to a set of disk drives and data blocks stored thereon.
  • the nodes in a multi-node database system may be in the form of a group of computers (e.g., work stations, personal computers) that are interconnected via a network. Alternately, the nodes may be the nodes of a grid, which is composed of nodes in the form of server blades interconnected with other server blades on a rack.
  • Each node in a multi-node database system hosts a database server.
  • a server such as a database server, is a combination of integrated software components and an allocation of computational resources, such as memory, a node, and processes on the node for executing the integrated software components on a processor, the combination of the software and computational resources being dedicated to performing a particular function on behalf of one or more clients.
  • Resources from multiple nodes in a multi-node database system can be allocated to running a particular database server's software.
  • Each combination of the software and allocation of resources from a node is a server that is referred to herein as a “server instance” or “instance.”
  • a database sen- er may comprise multiple database instances, some or all of which are running on separate computers, including separate server blades.
  • a database dicuonaiy may comprise multiple data structures that store database metadata.
  • a database dictionary may, for example, comprise multiple files and tables. Portions of the data structures may be cached in main memory of a database server.
  • the database dictionary' contains metadata that defines properties of the database object.
  • metadata in a database dictionary defining a database table may specify the attribute names and data types of the attributes, and one or more files or portions thereof that store data for the table.
  • Metadata in the database dictionary defining a procedure may specify a name of the procedure, the procedure's arguments and the return data ty pe, and the data types of the arguments, and may include source code and a compiled version thereof.
  • a database object may be defined by the database dictionary, but the metadata in the database dictionary itself may only partly specify the properties of the database object. Other properties may be defined by data structures that may not be considered part of the database dictionary'.
  • a user-defined function implemented in a JAVA class maybe defined in part by the database dictionary by specifying the name of the user-defined function and by specifying a reference to a file containing the source code of the Java class (i.e., .java file) and the compiled version of the class (i.e., .class file).
  • a database object may' have an attribute that is a primary key.
  • a primary key contains primary' key values.
  • a primary key value uniquely identifies a record among the records in the database object.
  • a database table may include a column that is a primary' key. Each row- in the database table holds a primary' key value that uniquely identifies the row among the row's in the database table.
  • a database object may have an atribute that is a foreign key of a primary' key of another database object.
  • a foreign key- of a primary key contains primary’ key values of the primary' key.
  • a foreign key value in the foreign key uniquely identifies a record in the respective database object of the primary' key.
  • a foreign key constraint based on a primary key may be defined for a foreign key.
  • a DBMS ensures that any value in the foreign key exists in the primary key.
  • a foreign key need not be defined for a foreign key. Instead, a foreign key relationship may be defined for the foreign key.
  • Applications that populate the foreign key are configured to ensure that foreign key values in the foreign key exist in the respective primary. An application may maintain a foreign key in this way even when no foreign relationship is defined for the foreign key.
  • the techniques described herein are implemented by one or more special -purpose computing devices.
  • the special -purpose computing devices may be hard-wired to perform the techniques, or may include digital electronic devices such as one or more application-specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs) that are persistently programmed to perform the techniques, or may include one or more general purpose hardware processors programmed to perform the techniques pursuant to program instructions in firmware, memory' , other storage, or a combination.
  • ASICs application-specific integrated circuits
  • FPGAs field programmable gate arrays
  • Such special-purpose computing devices may also combine custom hard-wired logic, ASICs, or FPGAs with custom programming to accomplish the techniques.
  • the special-purpose computing devices may be desktop computer systems, portable computer systems, handheld devices, networking devices or any other device that incorporates hard-wired and/or program logic to implement the techniques.
  • FIG. 15 is a block diagram that illustrates a computer system 1500 upon which an embodiment of the invention may be implemented.
  • Computer system 1500 includes a bus 1502 or other communication mechanism for communicating information, and a hardware processor 1504 coupled with bus 1502 for processing information.
  • Hardware processor 1504 may be, for example, a general-purpose microprocessor.
  • Computer sy stem 1500 also includes a mam memory 1506, such as a randomaccess memory (RAM) or other dynamic storage device, coupled to bus 1502 for storing information and instructions to be executed by processor 1504.
  • Mam memory’ 1506 also may be used for storing temporary' variables or other intermediate information during execution of instructions to be executed by processor 1504. Such instructions, when stored in non- transitory storage media accessible to processor 1504, render computer system 1500 into a special-purpose machine that is customized to perform the operations specified in the instructions.
  • Computer system 1500 further includes a read only memory (ROM) 1508 or other static storage device coupled to bus 1502 for storing static information and instructions for processor 1504.
  • a storage device 1510. such as a magnetic disk, optical disk, or solid-state drive is provided and coupled to bus 1502 for storing information and instructions.
  • Computer system 1500 may be coupled via bus 1502 to a display 1512, such as a cathode ray tube (CRT), for displaying information to a computer user.
  • a display 1512 such as a cathode ray tube (CRT)
  • An input device 1514 is coupled to bus 1502 for communicating information and command selections to processor 1504.
  • cursor control 1516 is Another type of user input device
  • cursor control 1516 such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 1504 and for controlling cursor movement on display 1512.
  • This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.
  • Computer system 1500 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic winch in combination with the computer system causes or programs computer system 1500 to be a special -purpose machine. According to one embodiment, the techniques herein are performed by computer system 1500 in response to processor 1504 executing one or more sequences of one or more instructions contained in main memory' 1506. Such instructions may be read into main memory 1506 from another storage medium, such as storage device 1510. Execution of the sequences of instructions contained in main memoiy 1506 causes processor 1504 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.
  • Non-volatile media includes, for example, optical disks, magnetic disks, or solid-state drives, such as storage device 1510.
  • Volatile media includes dynamic memory, such as main memory' 1506.
  • Storage media includes, for example, a floppy disk, a flexible disk, hard disk, solid-state drive, magnetic tape, or any other magnetic data storage medium, a CD- ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memoiy' chip or cartridge.
  • Storage media is distinct from but may be used in conjunction with transmission media.
  • Transmission media participates in transferring information between storage media.
  • transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 1502.
  • Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.
  • Various forms of media may be involved in carrying one or more sequences of one or more instructions to processor 1504 for execution.
  • the instructions may initially be carried on a magnetic disk or solid-state drive of a remote computer.
  • the remote computer can load the instructions into its dynamic memory' and send the instructions over a telephone line using a modem.
  • a modem local to computer system 1500 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal.
  • An infra-red detector can receive the data earned in the infra-red signal and appropriate circuitry' can place the data on bus 1502.
  • Bus 1502. carries the data to main memow 1506, from which processor 1504 retrieves and executes the instructions.
  • the instructions received by main memory 1506 may optionally be stored on storage device 1510 either before or after execution by processor 1504.
  • Computer system 1500 also includes a communication interface 1518 coupled to bus 1502.
  • Communication interface 1518 provides a two-way data communication coupling to a network link 1520 that is connected to a local network 1522.
  • communication interface 1518 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line.
  • ISDN integrated services digital network
  • communication interface 1518 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN.
  • LAN local area network
  • Wireless links may also be implemented.
  • communication interface 1518 sends and receives electrical, electromagnetic, or optical signals that carry digital data streams representing various types of information.
  • Network link 1520 typically provides data communication through one or more networks to other data devices.
  • network link 1520 may provide a connection through local network 1522 to a host computer 1524 or to data equipment operated by an Internet Service Provider (ISP) 1526.
  • ISP 1526 in turn provides data communication services through the world-wide packet data communication network now commonly referred to as the ‘‘Internet” 1528.
  • Internet 1528 uses electrical, electromagnetic, or optical signals that cany digital data streams.
  • the signals through the various networks and the signals on network link 1520 and through communication interface 1518, which carry' the digital data to and from computer system 1500, are example forms of transmission media,
  • Computer system 1500 can send messages and receive data, including program code, through the network(s), network link 1520 and communication interface 1518.
  • a server 1530 might transmit a requested code for an application program through Internet 1528, ISP 1526, local network 1522 and communication interface 1518.
  • the received code may be executed by processor 1504 as it is received, and/or stored in storage device 1510, or other non-volatile storage for later execution.
  • FIG. 16 is a block diagram of a basic software system 1600 that may be employed for controlling the operation of computer system 1600.
  • Software system 1600 and its components, including their connections, relationships, and functions, is meant to be exemplary only, and not meant to limit implementations of the example embodiment(s).
  • Other software systems suitable for implementing the example embodiment(s) may have different components, including components with different connections, relationships, and functions.
  • Software system 1600 is provided for directing the operation of computer system 1500.
  • Software system 1600 which may be stored in system memoiy (RAM) 1506 and on fixed storage (e.g., hard disk or flash memoiy) 1510, includes a kernel or operating system (OS) 1610.
  • RAM system memoiy
  • OS operating system
  • the OS 1610 manages low-level aspects of computer operation, including managing execution of processes, memory allocation, file input and output (I/O), and device I/O.
  • One or more application programs represented as 1602A, 1602B, 1602C... 1602N, may be “loaded” (e.g., transferred from fixed storage 1510 into memory' 1506) for execution by the system 1600.
  • the applications or other software intended for use on computer system 1500 may also be stored as a set of downloadable computer-executable instructions, for example, for downloading and installation from an Internet location (e.g., a Web server, an app store, or other online service).
  • Software system 1600 includes a graphical user interface (GUI) 1615, for receiving user commands and data in a graphical (e.g., “point- and-click” or “touch gesture”) fashion. These inputs, in turn, may’ be acted upon by the system 1600 in accordance with instructions from operating sy stem 1610 and/or application(s) 1602.
  • the GUI 1615 also serves to display the results of operation from the OS 1610 and application(s) 1602, whereupon the user may supply additional inputs or terminate the session (e.g., log off).
  • OS 1610 can execute directly on the bare hardware 1620 (e.g., processor(s) 1504) of computer system 1500.
  • a hypervisor or virtual machine monitor (VMM) 1630 may be interposed between the bare hardware 1620 and the OS 1610.
  • VMM 1630 acts as a software “cushion” or virtualization layer between the OS 1610 and the bare hardware 1620 of the computer system 1500.
  • VMM 1630 instantiates and runs one or more virtual machine instances (“guest machines”). Each guest machine comprises a “guest” operating system, such as OS 1610, and one or more applications, such as application(s) 1602, designed to execute on the guest operating system.
  • the VMM 1630 presents the guest operating systems with a virtual operating platform and manages the execution of the guest operating systems.
  • the VMM 1630 may allow a guest operating system to run as if it is running on the bare hardware 1620 of computer system 1500 directly. In these instances, the same version of the guest operating system configured to execute on the bare hardware 1620 directly may also execute on VMM 1630 without modification or reconfiguration. In other words, VMM 1630 may provide full hardware and CPU virtualization to a guest operating system in some instances.
  • a guest operating system may be specially designed or configured to execute on VMM 1630 for efficiency.
  • the guest operating system is “aware” that it executes on a virtual machine monitor.
  • VMM 1630 may provide para-virtualization to a guest operating system in some instances.
  • a computer system process comprises an allotment of hardware processor time, and an allotment of memory (physical and/or virtual), the allotment of memory being for storing instructions executed by the hardware processor, for storing data generated by the hardware processor executing the instructions, and/or for storing the hardware processor state (e.g., content of registers) between allotments of the hardware processor time w'hen the computer system process is not running.
  • Computer system processes run under the control of an operating system and may run under the control of other programs being executed on the computer system.
  • cloud computing is generally used herein to describe a computing model which enables on-demand access to a shared pool of computing resources, such as computer networks, servers, software applications, and services, and which allows for rapid provisioning and release of resources with minimal management effort or service provider interaction.
  • a cloud computing environment (sometimes referred to as a cloud environment, or a cloud) can be implemented in a variety of different ways to best suit different requirements.
  • a cloud environment in a public cloud environment, the underlying computing infrastructure is owned by an organization that makes its cloud services available to other organizations or to the general public.
  • a private cloud environment is generally intended solely for use by, or within, a single organization.
  • a community cloud is intended to be shared by several organizations within a community; while a hybrid cloud comprises two or more types of cloud (e.g., private, community, or public) that are bound together by data and application portability.
  • a cloud computing model enables some of those responsibilities which previously may have been provided by an organization's own information technology department, to instead be delivered as service layers within a cloud environment, for use by consumers (either within or external to the organization, according to the cloud's public/private nature).
  • the precise definition of components or features provided by or within each cloud service layer can vary but common examples include: Software as a Sendee ( SaaS), in which consumers use software applications that are running upon a cloud infrastructure, while a SaaS provider manages or controls the underlying cloud infrastructure and applications.
  • SaaS Software as a Sendee
  • PaaS Platform as a Sendee
  • PaaS provider manages or controls other aspects of the cloud environment (i.e., everything below the run-time execution environment).
  • Infrastructure as a Service laaS
  • laaS Infrastructure as a Service
  • laaS provider manages or controls the underlying physical cloud infrastructure (i.e., every thing below the operating system layer).
  • Database as a Service DBaaS
  • DbaaS Database Management System

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne une approche de réplication basée sur un protocole de consensus. Des fragments sont groupés en unités de réplication (RU) pour optimiser l'efficacité de réplication. Des blocs peuvent être attribués pour fonctionner sur la base du débit de charge et de réplication. La division et la fusion n'interrompent pas la charge de travail d'utilisateur concurrente ou nécessitent des changements de routage. Les transactions couvrant des fragments à l'intérieur d'une RU ne nécessitent pas de traitement de transaction distribué. Chaque unité de réplication a un facteur de réplication (RF), qui se réfère au nombre de copies/répliques de l'unité de réplication, et un facteur de distribution (DF) associé, qui se réfère au nombre de serveurs prenant en charge la charge de travail à partir d'un serveur de tête défaillant. Les RU peuvent être placés dans des cercles de serveurs, le nombre de serveurs dans un cercle étant égal au facteur de réplication, et la mise au repos de la charge de travail pouvant être limitée à un cercle de serveurs au lieu de toute la base de données.
PCT/US2023/034465 2022-10-12 2023-10-04 Configuration et gestion d'unités de réplication pour réplication de transaction de base de données asynchrone WO2024081140A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202263415466P 2022-10-12 2022-10-12
US63/415,466 2022-10-12
US18/372,005 2023-09-22
US18/372,005 US20240126782A1 (en) 2022-10-12 2023-09-22 Configuration and management of replication units for asynchronous database transaction replication

Publications (1)

Publication Number Publication Date
WO2024081140A1 true WO2024081140A1 (fr) 2024-04-18

Family

ID=88695576

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/034465 WO2024081140A1 (fr) 2022-10-12 2023-10-04 Configuration et gestion d'unités de réplication pour réplication de transaction de base de données asynchrone

Country Status (1)

Country Link
WO (1) WO2024081140A1 (fr)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050165858A1 (en) * 2004-01-09 2005-07-28 Tom Richard W. Well-known transactions in data replication
EP1876788A1 (fr) * 2006-07-07 2008-01-09 Alcatel Lucent Mécanisme distribué de hashing pour réseau auto -organisant
US20130290249A1 (en) * 2010-12-23 2013-10-31 Dwight Merriman Large distributed database clustering systems and methods
US20170103092A1 (en) * 2015-10-07 2017-04-13 Oracle International Corporation Ddl processing in sharded databases
US20190163545A1 (en) * 2017-11-30 2019-05-30 Oracle International Corporation Messages with delayed delivery in an in-database sharded queue
US20200034257A1 (en) * 2016-08-05 2020-01-30 Nutanix, Inc. Implementing availability domain aware replication policies
US20220100710A1 (en) * 2020-09-22 2022-03-31 Commvault Systems, Inc. Decommissioning, re-commissioning, and commissioning new metadata nodes in a working distributed data storage system
US20220114164A1 (en) * 2020-10-14 2022-04-14 Oracle International Corporation System and method for high performance multi-statement interactive transactions with snapshot isolation in a scale-out database

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050165858A1 (en) * 2004-01-09 2005-07-28 Tom Richard W. Well-known transactions in data replication
EP1876788A1 (fr) * 2006-07-07 2008-01-09 Alcatel Lucent Mécanisme distribué de hashing pour réseau auto -organisant
US20130290249A1 (en) * 2010-12-23 2013-10-31 Dwight Merriman Large distributed database clustering systems and methods
US20170103092A1 (en) * 2015-10-07 2017-04-13 Oracle International Corporation Ddl processing in sharded databases
US20200034257A1 (en) * 2016-08-05 2020-01-30 Nutanix, Inc. Implementing availability domain aware replication policies
US20190163545A1 (en) * 2017-11-30 2019-05-30 Oracle International Corporation Messages with delayed delivery in an in-database sharded queue
US20220100710A1 (en) * 2020-09-22 2022-03-31 Commvault Systems, Inc. Decommissioning, re-commissioning, and commissioning new metadata nodes in a working distributed data storage system
US20220114164A1 (en) * 2020-10-14 2022-04-14 Oracle International Corporation System and method for high performance multi-statement interactive transactions with snapshot isolation in a scale-out database

Similar Documents

Publication Publication Date Title
US11372890B2 (en) Distributed database transaction protocol
US11874746B2 (en) Transaction commit protocol with recoverable commit identifier
US11704290B2 (en) Methods, devices and systems for maintaining consistency of metadata and data across data centers
CN106991113B (zh) 数据库环境中的表格复制
US10936578B2 (en) Client-driven commit of distributed write transactions in a database environment
US10235440B2 (en) Decentralized transaction commit protocol
US7299378B2 (en) Geographically distributed clusters
US10216588B2 (en) Database system recovery using preliminary and final slave node replay positions
EP1704480B1 (fr) Base de donnees en grappe a miroitage de donnees a distance
US9563522B2 (en) Data recovery for a relational database management system instance in a heterogeneous database system
JP2016524750A (ja) 索引更新パイプライン
Waqas et al. Transaction management techniques and practices in current cloud computing environments: A survey
US20190196918A1 (en) Methods and systems of operating a database management system dmbs in a strong consistency mode
WO2024081139A1 (fr) Protocole de consensus pour réplication de transaction de base de données asynchrone avec basculement rapide, automatique, perte de données nulle, forte cohérence, prise en charge de langage sql complète et extensibilité horizontale
US11176115B2 (en) Dependency locking
US20240126782A1 (en) Configuration and management of replication units for asynchronous database transaction replication
WO2024081140A1 (fr) Configuration et gestion d'unités de réplication pour réplication de transaction de base de données asynchrone
Sapate et al. Survey on comparative analysis of database replication techniques
Agrawal et al. Transactions on Co-located Data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23801096

Country of ref document: EP

Kind code of ref document: A1

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)