CN114207600A - Distributed cross-regional database transaction processing - Google Patents

Distributed cross-regional database transaction processing Download PDF

Info

Publication number
CN114207600A
CN114207600A CN201980099051.5A CN201980099051A CN114207600A CN 114207600 A CN114207600 A CN 114207600A CN 201980099051 A CN201980099051 A CN 201980099051A CN 114207600 A CN114207600 A CN 114207600A
Authority
CN
China
Prior art keywords
hlc
database
transaction
timestamp
commit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201980099051.5A
Other languages
Chinese (zh)
Inventor
蔡乐
贾鑫
杜贵彬
赵殿奎
智雅楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Publication of CN114207600A publication Critical patent/CN114207600A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses cross-regional database transaction processing, comprising: determining that a transaction containing one or more sets of statements is to be executed on a plurality of database servers, the servers spanning at least two regions, wherein each region is associated with a respective Hybrid Logic Clock (HLC) -based, centralized-time service; causing the one or more sets of statements to execute on the plurality of database servers across at least two regions; obtaining a plurality of HLC-based preparation timestamps from the plurality of database servers across at least two regions; and, selecting a largest HLC-based preparation timestamp as the commit timestamp associated with the transaction.

Description

Distributed cross-regional database transaction processing
Background
The distributed database solves the problem of expandability of independent databases, and allows the computation and storage capacity of the database system to be flexibly increased without being limited by a single server. In a distributed approach, the databases may be distributed to different physical locations to ensure high availability. If there is only one global lock, then its physical location may be close to some database entities, but far from others. For more distant database entities, the network delay associated with acquiring the clock would be very high and would adversely affect the throughput of these database entities. If multiple clocks are located at different physical locations, it may be necessary to periodically synchronize the clocks to suppress the negative effects of natural deviations of the clocks from each other. Despite the existence of clock synchronization protocols, there is still a non-zero clock skew, which may cause the timestamp issued by the database regarding the commit of the transaction to be inconsistent with the timestamp of the commit of the transaction within the absolute time. The discrepancy between the transaction commit order indicated by the timestamp issued by the database and the transaction commit order in absolute time may cause inconsistent reads of database data, which may ultimately cause problems for applications performing such reads.
Drawings
Various embodiments of the invention are disclosed in the following detailed description and drawings.
FIG. 1 is a schematic diagram of a distributed database scheme.
FIG. 2 is a diagram of an embodiment of a system for conducting transactions in a database distributed over different regions.
Fig. 3 shows a diagram of an HLC embodiment.
Fig. 4 shows a diagram illustrating an example of a database server.
FIG. 5 illustrates a flow diagram for an embodiment of a process for processing transactions in a database distributed across regions.
FIG. 6 shows a flow diagram of an example of a process for determining whether a transaction is to be executed across multiple regions.
FIG. 7 illustrates a flow diagram of an example of a process for executing a transaction on one or more databases within a region.
FIG. 8 illustrates a flow diagram of an example of a process for executing transactions in one or more databases across multiple regions.
Fig. 9 illustrates a flow diagram of an example of a process for updating local HLCs based on updates received from non-local sources.
FIG. 10 illustrates a sequence diagram of an example process for performing transactions in a distributed database that spans multiple regions.
Detailed Description
The invention can be implemented in numerous ways, including as a process; a device; a system; a component part of an article; a computer program product embodied on a computer readable storage medium; and/or a processor, e.g., a processor configured to execute instructions provided and/or stored by a memory coupled to the processor. In this detailed description, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless otherwise specified, a component described as being configured to perform a task, such as a processor or a memory, may employ a general component that is temporarily configured to perform the task at a given time or a specific component that is constructed to perform the task. As used herein, the term "processor" refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
The following provides a detailed description of one or more embodiments of the invention, along with the accompanying drawings that illustrate the principles of the invention. The invention is described in connection with these embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
FIG. 1 is a schematic diagram of a distributed database scheme. As shown in the example in fig. 1, database entities (e.g., DB1, DB2, DB3, and DB4), which are implemented on respective database servers, are distributed in two regions, region 1(DC1) and region 2(DC 2). In various embodiments, "geographic area" or simply "region" includes a geographic area. In embodiments, a database server deployed in a certain area is referred to as a "data center". The network delay between regions is typically in excess of 100 milliseconds (ms). As shown in the example of FIG. 1, in a distributed database system, each zone may have a logical subgroup of one or more database servers. Such logical subgroups of database servers are referred to as "sub-clusters". In FIG. 1, DC1 includes sub-cluster 1(SC1), which includes two database servers, DB1 and DB 2. DC2 includes sub-cluster 2(SC2), which includes two database servers, DB3 and DB 4. The database servers (including at least DB1, DB2, DB3, and DB4) are deployed on different servers, but are presented to the user with the features of a single database. For example, transactions to read data and/or update data on a distributed database may be performed on one or more database servers, which may be located in one or more areas. A transaction contains a function that is used by the database to ensure atomicity and isolation of operations.
From a high availability perspective, the distributed database may be extended to multiple sub-clusters, thereby ensuring high availability of the database in the event of sub-cluster failure, providing local data access for applications, and enhancing the user experience. Extending to multiple sub-clusters also presents challenges to distributed databases, such as global event sequences required for transactional and concurrency management. In a single database server scenario, the sequence is provided only by the local physical clock of the server; when a sub-cluster is deployed, one database server may be selected to provide global physical clock service for all database entities in the sub-cluster. When a database server is deployed to span multiple zones, the cross-zone network latency is very high, so that the cost of acquiring a centralized global clock is very high, the impact on high throughput is particularly great, and the low latency transaction load is particularly great.
In addition to using a global physical clock for database entities distributed across multiple regions, multiple physical clocks may be implemented in each region. Natural deviations may exist between physical clock systems and these deviations may increase over time. To ensure that physical clock skew (also sometimes referred to as "clock skew") between different clocks is minimized, GPS, atomic clocks, and other protocols (such as the network time protocol "NTP") are used to synchronize the time of different physical clock entities across all regions. However, considering that the physical clock skew between the different clocks is non-zero, the skew problem needs to be solved when different transactions are executed on the distributed database, being closed on time.
Embodiments of transactions in a database distributed across regions are described herein. A transaction comprised of one or more sets of statements is determined to be executed on a plurality of database servers of at least two regions. In embodiments, each zone includes at least one corresponding centralized time service (sometimes referred to as a "CTS") that implements a hybrid logical clock (sometimes referred to as a "HLC") for database servers within the zone. In embodiments, a transaction is received at a database server configured to act as a coordinator to process the transaction. The one or more sets of statements are executed on a plurality of database servers spanning at least two regions. The coordinator database server is located in one region and therefore relies on HLCs local to that region to determine the time, as are other database servers located in the same region. The two-phase commit includes obtaining a plurality of HLC-based preparation timestamps from a plurality of database servers spanning a plurality of regions. The two-phase commit also includes selecting the largest HLC-based preparation timestamp as the commit timestamp associated with the transaction. The two-phase commit further includes causing a plurality of database servers of a plurality of regions to commit execution results associated with the one or more statement sets using the commit timestamp. And returning a commit result corresponding to the transaction.
According to various embodiments described in further detail below, a centralized time service clock is used for transactions that may be processed in a single zone's sub-cluster, while a hybrid logical clock protocol is used to implement cross-centralized time service clock serialization for transactions processed across these zones in different zone's sub-clusters. Using these techniques, transactions successfully support external coherency within the various regions based on the use of a uniform, centralized time service (hybrid logic clock) within each region. For transactions that execute in the same session (i.e., the number of commit timestamps recorded by the database for transactions in the same session), the external coherency is also satisfied because transactions in the same session are executed serially. Furthermore, these techniques ensure that the number of commit timestamps and the observed sequence of results for transactions performed across multiple regions are consistent. In other words, while the transaction commit timestamps recorded by the database may not reflect their commit times in absolute time, they reflect a consistent order that is always observed, and thus commit times in absolute time are no longer critical. Furthermore, transaction performance within a single region is not affected by cross-region network latency and clock skew applicable to cross-region transactions (because database servers located in a single region rely on the same centralized time service system). In embodiments, a "cross-region" transaction comprises a transaction having one or more statements that execute in at least a database server located within a first region and a database server located within a second region. In some embodiments, a "cross-region" transaction includes a transaction in which at least one statement executes on a database server that is located in a different region than the region in which the coordinator database is located. Thus, embodiments described herein allow a database to be distributed across regions, yet still support transactions with high throughput and low latency within a single region.
FIG. 2 is a diagram of an embodiment of a system for conducting transactions in a database distributed over different regions.
System 200 includes a plurality of zones, including zone 1(DC1) and zone 2(DC 2). Region 1 and region 2 are located in two different geographical areas. Region 1 includes database servers DB1 and DB2 in logical grouping subcluster 1(SC1), and region 2 includes database servers DB3 and DB4 in logical grouping subcluster 2(SC 2). From the user's perspective, at least the database servers DB1, DB2, DB3, and DB4 may implement a distributed database. Each of sub-cluster 1 and sub-cluster 2 includes a corresponding Centralized Time Service (CTS) that generates hybrid logic clock 1(HLC1) and hybrid logic clock 2(HLC2), respectively. In embodiments, the HLC generated by the CTS is a combination of a physical clock that is incremented over time and a logical clock that is incremented when an event occurs. When the physical clock portion of the HLC increases, the logical clock portion of the HLC will reset to zero. Fig. 3 below illustrates one embodiment of an HLC.
Returning to FIG. 2, database servers DB1 and DB2 of sub-cluster 1 obtain time from HLC1 with HLC1 as the CTS for sub-cluster 1, database servers DB3 and DB4 obtain time from HLC2 with HLC2 as the CTS for sub-cluster 2. Given that HLC1 is located in the same region as DB1 and DB2, DB1 and DB2 can obtain the current time from HLC1 with minimal delay. Similarly, DB3 and DB4 may obtain the current time from HLC2 with minimal delay given that HLC2 is located in the same region as DB3 and DB 4. At least the times held by HLCs 1 and HLCs 2, as well as individual HLCs from other sub-clusters of zone 1 and zone 2 or other zones, are synchronized by a clock synchronization protocol (e.g., NTP) to reduce clock skew between these clocks. However, clock skew up to a maximum (e.g., 10 milliseconds) may still exist between individual HLCs. HLC1 provides delta time series for DB1 and DB2, while HLC2 provides delta time series for DB3 and DB 4.
In some embodiments, the network delay between different regions far exceeds the maximum clock skew between individual HLCs located in different regions. For example, the maximum clock skew between individual HLCs is 10 milliseconds, while the maximum network delay between different regions is 100 milliseconds due to geographical distance. In various embodiments, "network delay between different regions" refers to the length of time required for data to be transmitted from one region to another. For example, a message transmitted from one database server in region 1 to another database server in region 2 over network 206 would incur the greatest network latency (e.g., 100 milliseconds). The network 206 may be implemented using, for example, a high-speed data and/or telecommunications network.
Transactions for databases distributed across at least zone 1 and zone 2 may originate from an application 204 executing on the client device 202. In embodiments, a "transaction" comprises an atomic operation that includes one or more statements to be executed across one or more database servers in a distributed database. In some embodiments, each database server on which a transaction is to be performed (at least in part) is sometimes referred to as a "participating database server". For example, a "statement" that is part of a transaction contains an operation or command to be applied to a database. For example, a statement may include a read operation, a write operation for new data, a delete operation for existing data stored in a database, and/or an update operation for existing data in a database. These statements may be, for example, Structured Query Language (SQL) statements/commands.
In embodiments, application 204 is configured to generate a session and send one or more transactions in the session to a database distributed across zone 1 and zone 2. Transactions issued by a requestor (e.g., application 204) to a distributed database are received and processed at the transaction coordinator. In various embodiments, the transaction coordinator is a participating database server in the distributed database. The set of statements contained in the transaction is determined (e.g., by the coordinator) to be executed by a database server located in a single region or across multiple regions. Whether the transaction contains a set of statements that are executed by a database server located in one locale or across multiple locales may depend, for example, on hints sent by the (e.g., SQL-based) application 204, or the two-phase commit protocol used by the coordinator. In some embodiments, the hints information provided by the application may specify whether a transaction or session operates on data in only one sub-cluster or one region. For example, the coordinator is selected by a two-phase commit protocol used by the distributed database.
If a statement set for a transaction is to be executed on a database server located in a single region, the statement set will be executed on one or more database servers within that region. In various embodiments, the first phase of a two-phase locking protocol (sometimes referred to as "2 PL") is used to lock data affected by a set of execution statements. After a statement is executed, any database data changes affected by the statement execution are temporary until a commit operation makes the changes permanent. The second phase of the two-phase locking protocol will unlock upon determining that the transaction has been committed to the database. When a transaction's set of statements affects only one region, the commit time of the entire transaction can be derived from the HLCs in the sub-cluster where the coordinator database server resides, when the statement (set) is completed. For example, if the coordinator database server is DB1 of system 200, then the commit timestamp of the transaction will be obtained from HLC1 in zone 1's subgroup 1. If the set of statements are executed on only a single database server within a single region, then there is no need to perform the commit in two phases. The commit timestamp is sent by the coordinator database server to the affected database server and is recorded by the database server that has executed the statement (set) of the transaction. According to the second phase of the two-phase locking protocol, the log is written to and unlocked to complete the commit of the transaction. After the transaction commits, any changes to the database data due to executing statements will become permanent and can be seen by other transactions/watchers. However, if the set of statements is executed on multiple database servers within an area, then a two-phase commit protocol (sometimes referred to as "2 PC") is used to perform the transaction commit. When the statement set is executed by a plurality of database servers within a region, the coordinator database server executes a first phase of the two-phase commit protocol by sending a prepare command to each participating database server, and then, upon receiving a prepare response from each participating database server, the coordinator database server executes a second phase of the two-phase commit by sending a commit command and a commit timestamp to each participating database server. Executing the prepare command means that the database will flush the modification log records of the transaction to a persistent location, such as a disk. This ensures that if the database crashes, the modifications of the transaction are not lost. Executing the commit command means that the database will complete the transaction after executing the prepare command, including setting the state of the transaction to committed and unlocked. Each participating database server then records a commit timestamp, the log is written to, and is unlocked according to a second phase of the two-phase lock. Subsequently, the commit timestamp can be used when the data is read from the database. For example, a read operation includes a read timestamp that is compared to a commit timestamp of the requested data to determine which version of the requested data was read (e.g., the last commit timestamp of the version of the requested data is determined to be earlier than the read timestamp). After the database server(s) record the commit timestamp and unlock, the transaction is considered successful and a successful commit result (including the commit timestamp) is sent back to the initiator of the transaction, application 204. Two-phase locking achieves sequentiality and external consistency for transactions that are processed on a database server within only one region.
If a statement set of a transaction is to be executed on a database server across multiple regions, the statement set is to be executed on one or more database servers across multiple regions. In various embodiments, the first phase of the two-phase locking protocol is used to lock data affected by the set of statements being executed. The first phase of the two-phase commit protocol is also implemented by the coordinator database server, which sends a prepare command to each participating database server across multiple zones. Each participating database server will complete the prepare command and send back to the coordinator database server a prepare timestamp obtained from the local HLC of the sub-cluster to which the participating database server belongs. The coordinator logic of the two-phase commit protocol executed by the coordinator is to select the maximum/latest preparation timestamp received from the participating database servers and use the selected preparation timestamp as the commit timestamp for the transaction. For example, if the coordinator database server is DB1 of system 200 and the participating database servers of the transaction are DB2 and DB4, then DB2 would return the preparation timestamp from HLC1 to DB1 and DB3 would return the preparation timestamp from HLC2 to DB 1. The coordinator database server will then send a commit command with a commit timestamp to each participating database server. In response, the participating database server records the commit timestamp for the affected data, writes a log, unlocks, completes the commit, and sends a commit return message to the coordinator database server. Once the coordinator database server receives commit return messages from all participating database servers, the coordinator database server returns the commit result of the transaction to application 204. Thus, by using the max/last prepare timestamp as the commit timestamp for the transaction, the two-phase commit provides consistency across multiple regions for a cross-region transaction. Thus, the maximum latest preparation timestamp is selected as the commit timestamp for the transaction in a two-phase commit protocol, such that the commit timestamp for the transaction recorded by each participating database server causes multiple participating database servers located in different regions to have the same commit timestamp for the transaction. In this way, multiple participating database servers located in different regions may be provided with the same commit timestamp for the transaction, thereby achieving consistency. If the maximum timestamp across all participating database servers is not selected, then a transaction occurring after the transaction completes the two-phase commit protocol may obtain a larger commit timestamp but may not be able to read the data of the completed transaction.
HLC ensures that commit times of transactions executing within the same region and having causal relationships (e.g., one transaction causes another transaction to occur) are ordered. However, for each transaction that occurs across different regions, their order in absolute time may be different from the order of their commit timestamps if they have no causal relationship, but the true commit order (in absolute time) of these transactions is not observable due to the network delays across the regions. In this case, the commit timestamps of these transactions can be considered as their commit order as long as the maximum clock skew between HLCs of different regions is much smaller than the network delay between different regions.
The external consistency of cross-regional transactions without causality will be further explained here. The causal relationship of HLCs ensures that transactions in the same database session are externally consistent because the logic of HLCs is continually incremented in response to transactions. Transactions are executed serially in the same session, so the maintenance of HLC ensures that a larger commit timestamp will certainly be used for subsequently occurring transactions. HLC further ensures that if the result of one transaction is read by another transaction, then the commit timestamp of the latter transaction will be greater than the commit timestamp of the previous transaction.
However, there are no such relationships between transactions that occur between multiple/different database sessions. In the general case, when two sessions are executed simultaneously, the application initiating the session does not care about (does not depend on) the order of occurrence of the transactions that the two sessions execute separately. In a more specific case, when an application initiates and manages two database sessions and uses the results returned by one session as a basis for deciding to execute another session, the application actually maintains one application-level session. In this case, there is a dependency between the transaction execution results of the two sessions, and if the results of these transactions seen by the two sessions in absolute time do not coincide with the sequence of commit timestamps of these transactions, the application may generate erroneous decisions and results.
In various embodiments, only transactions that occur at different regions (within the maximum clock bias) at virtually the same time and without causality may cause two sessions of an application to see inconsistent results. One example of this scenario is described as follows:
assuming three regions, DC1, DC2 and DC3, transaction Q1 occurs in DC1, transaction Q2 occurs in DC2, and application APP3 initiates two sessions R1 and R2 in DC 3. The commit timestamps (recorded by the database) for Q1 and Q2 are t1 and t2, respectively, and the absolute times of commit for Q1 and Q2 are tabs1 and tabs2, respectively, assuming that they satisfy the following relationships:
t1>t2 (1)
tabs1<tabs2 (2)
in other words, the commit order of Q1 and Q2 within absolute time and the relative values of their commit timestamps recorded/executed by the database are inconsistent. This inconsistency is due to clock skew between the respective corresponding time service systems of the Q1 and Q2 performed at DC1 and DC2, respectively. From the above described commit time relationships, the following relationships can be known:
tabs1+maximum time skew>tabs2 (3)
assume that transaction R1 and transaction R2 access DC1 and DC2, respectively, at access times of tabs11 and tabs22, respectively, and that the access times satisfy the following relationships:
tabs2>tabs22>tabs11>tabs1 (4)
in other words, R1 accessed DC1 after the results of Q1 were committed to the database, while R2 accessed DC2 before the results of Q2 were committed to the database. Thus, R1 can see the results for Q1, while R2 cannot see the results for Q2. If this application relied on the commit result of R1 to determine the operation of R2, the application would see inconsistent results. This is because, according to the commit timestamp in the database, the Q2 result should be visible when the Q1 result is visible. In other words, in view of the causal relationship of R1 and R2, and that R2 should be performed after R1 is completed, the operational results of R2 may not be the same depending on whether R2 issues earlier in absolute time than the result of Q2 is submitted to the database. If the result of Q2 commits earlier than the issue of R2, then the execution of R2 may produce an operation result; however, if the result of Q2 was not committed before R2 issued (as described in the scenario above), then the running of R2 may produce another operational result.
In fact, tabs2> tabs22> tabs11> tabs1 do not necessarily have to be true because the network delay across regions (e.g., 100 milliseconds) is much greater than the clock skew between HLCs of different regions (e.g., 10 milliseconds). In other words:
tabs22> Tabs11+ maximum time deviation > Tabs1+ maximum time deviation (5)
Since relation (3) indicates tab 1+ maximum time deviation > tab 2, relation (5) can be rewritten as:
tabs22> Tabs11+ maximum time Difference > Tabs1+ maximum time Difference > Tabs2(6)
Since tabs22> tabs1+ maximum time difference > tabs2, and in accordance with the above analysis, as long as the network delay across the center of the region is much greater than the clock skew between HLCs located in different regions, the external consistency of the cross-region transaction can be guaranteed. In other words, given that the network delay across the regional center is much larger than the clock skew between HLCs located in different regions, the commit timestamps recorded by the database lock will reflect the externally visible/observable result sequence (the actual commit sequence and the absolute time commit sequence are no longer important as no observer can see them).
In various embodiments, for transactions that execute in only one region, the logical clock portion of a certain HLC is only incremented when the transaction commits, in order to distinguish the commit order of different transactions. For transactions that are executed across multiple zones, not only does the logical clock portion of the HLC increase as the transaction is committed, but, in some embodiments, read operations from other zones are received by the database server local to the HLC, which also needs to be updated. If the HLC timestamp of an operation (e.g., associated with a statement) received from another region is greater than the local HLC of the region in which the operation result was received, the local HLC is incremented accordingly. For example, an operation from another region may include a read operation or a commit command. In other words, when the coordinator database server receives an operation from a non-local, the operation carries the non-local HLC-based timestamp, and if the HLC-based timestamp is greater than the local HLC, the local HLC needs to be updated. For example, updating the HLC may refer to adding a logical clock portion of the HLC to match an accepted timestamp that originates from another HLC corresponding to some different region. In another example, updating HLC may refer to increasing a logical clock portion of HLC by a predetermined amount (e.g., 1). In embodiments, a maximum HLC is maintained for each session to ensure consistency of transactions within the session. For example, if a session level HLC is received on a certain coordinator database server, and if the session level HLC is greater than the HLC local to the coordinator database server, then the HLC local to the coordinator database server will be updated accordingly.
As indicated above, various embodiments enable a database to be extended across multiple regions to ensure high availability and access to recent data while also reducing latency for (e.g., most) transactions that are performed only within a single region, such that the database is not affected by high latency between regions while ensuring strong global consistency.
Fig. 3 is a diagram showing an embodiment of HLC. Both HLC1 and HLC2 in fig. 2 may be implemented using the example of fig. 3. HLC 300 includes two parts, a physical clock 302 and a logical clock 304. As described above, physical clock 302 increases as time passes, and logical clock 304 increases with events occurring at a minimum time unit (e.g., microseconds) of physical clock 302. In some embodiments, HLC 300 is implemented using 64 bits, with 48 bits allocated to the physical clock 302 and the remaining 16 bits allocated to the logical clock 304. In some embodiments, physical clock 302 represents the highest microsecond granularity of a physical clock. In some embodiments, the logical clock 304 may be considered a counter that increments depending on events occurring at the database and resets to 0 each time the physical clock 302 increments (e.g., a minimum unit of time, such as perhaps one microsecond, is captured by it). In some embodiments, the clock synchronization protocol synchronizes the physical clock 302 of the HLC 300 with the physical clock 302 of one or more other HLCs. The logic clock 304 helps to obtain time and cause-and-effect relationships in the distributed database by issuing the sequence of events.
Fig. 4 shows a diagram of an example of a database server. In certain embodiments, database servers DB1, DB2, DB3, and DB4 are each implemented using the example of fig. 4. Database server 400 includes a coordinator engine 402, a statement execution engine 404, a local centralized time service update engine 406, and a database entry store 408. The coordinator engine 402, statement execution engine 404, and local centralized time service update engine 406 may all be implemented using one or more hardware and/or software. Database entry store 408 may be implemented using one or more types of storage media.
When database server 400 is selected (e.g., via a two-phase commit protocol) as the coordinator for a particular transaction, coordinator engine 402 is configured to perform functions related to processing the transaction received from the application. The transaction is a collection of one or more statements (e.g., SQL commands). The coordinator engine 402 is configured to determine whether the transaction (all statements therein) is within a region (specifically, on a database server in a sub-cluster of the region) or is executing across multiple regions (specifically, on a database server of each sub-cluster of the multiple regions). For example, whether a transaction is performed within one region or across multiple regions may be determined from hints information provided by the application (e.g., SQL-based).
In the case where the transaction to be executed is in a region, coordinator engine 402 is configured to execute all statements of the transaction across one or more participating database servers in the region, as well as execute transaction timestamps for a centralized time service (HLC) associated with the region. If the statement is executed on multiple database servers, the coordinator engine 402 uses a two-phase commit protocol to verify that all participating database servers are ready to commit portions of execution in their respective databases. Because all participating database servers (groups) executing statements are located within the same region and therefore share the same HLC, there is no need to account for clock skew or network delays in sending information across regions. Thus, after the participating database server records a commit timestamp for the portion of the database affected by the statement of the transaction, coordinator engine 402 may be configured to return a successful commit result associated with the transaction to the application.
Where the transaction is performed across multiple regions, the coordinator engine 402 is configured to execute all statements participating in the transaction of the database server across the multiple regions. The coordinator engine 402 is also configured to receive a timestamp from each participating database at the end of execution of each statement. Each such completion timestamp is determined based on HLCs local to the sub-cluster of the area in which the participating database server resides. Since there are multiple participating database servers involved in executing the statements of the transaction, coordinator engine 402 is configured to use a two-phase protocol to send prepare commands to each participating database server across multiple regions. The coordinator engine 402 is then configured to receive a prepare timestamp from each participating database server and respond whether they are ready for commit. Each preparation timestamp received from the database server is obtained from the sub-cluster local HLC of the area in which the database server is located. If the coordinator engine 402 receives successfully prepared responses from all participating database servers, the coordinator engine 402 is configured to select the maximum preparation timestamp associated with any of the prepared responses as the commit timestamp for the transaction. Thereafter, coordinator engine 402 sends a transaction commit timestamp to each participating database server, which records the commit timestamp and its locally affected data. Once the coordinator engine 402 receives the commit acknowledgement from each participating database server, the coordinator engine 402 is configured to return successful commit results to the application.
Statement execution engine 404 is configured to execute statements that affect data stored in database entry store 408. Statement execution engine 404 is configured to receive statements containing commands to manage database data. Statement execution engine 404 is configured to receive statements from database servers that are selected as coordinators for processing transactions. Statement execution engine 404 is configured to execute each received statement in the relevant portion (e.g., data entry) of database entry store 408. In some embodiments, based on the sub-cluster local HLC of the area to which database server 400 belongs, statement execution engine 404 records timestamps corresponding to the beginning and/or end of statement execution. After a statement is successfully executed, statement execution engine 404 sends the recorded execution start and end timestamps to the coordinator database server, which may or may not be located in the same area as database server 400.
For transactions that do not execute statements only on database server 400, statement execution engine 404 is configured to receive a prepare command from the coordinator database server as the first phase of the two-phase commit protocol. In response to the prepare command, statement execution engine 404 is configured to perform one or more operations associated with the prepare command. In response to the prepare command, the statement execution engine 404 is further configured to send back to the coordinator database server a prepare response with a prepare timestamp obtained from the local HLCs of the sub-cluster of the area to which the database server 400 belongs. In a second phase of a two-phase commit of a transaction whose statements are not to be executed solely on database server 400, statement execution engine 404 is configured to receive a commit command from the coordinator database server, where the commit command includes a commit timestamp for the transaction. In response to the prepare command, statement execution engine 404 is configured to perform one or more operations associated with the commit command and to record the transaction commit timestamp. In response to the commit command, the statement execution engine 404 is further configured to send a commit response back to the coordinator database server.
Local centralized time services update engine 406 is configured to update local HLCs for the sub-clusters of the area to which database server 400 belongs in response to certain events. In various embodiments, local centralized time services update engine 406 is configured to determine whether to update local HLCs for a sub-cluster of the area to which database server 400 belongs in certain events. In some embodiments, when database server 400 receives the result of an execution of a statement that includes a read operation performed in another region and includes a read timestamp obtained from an HLC of another region, local centralized time services update engine 406 is configured to compare the read timestamp from the non-local region to a current local HLC (a local HLC of a subset of the region to which local database server 400 belongs), and if the read timestamp from the non-local region is greater than the current local HLC, local centralized time services update engine 406 is configured to update the local HLC (e.g., match the read timestamp obtained from the HLC of the non-local region). In some embodiments, when the database server 400 receives a commit command with a commit timestamp from a coordinator database server located in a non-native region, the local centralized time services update engine 406 is configured to compare the commit timestamp from the non-native region with a current local HLC, and if the commit timestamp from the non-native region is greater than the current local HLC, the local centralized time services update engine 406 is configured to update the local HLC (e.g., match the commit timestamp obtained from the HLC of the non-native region). In some embodiments, when database server 400 is selected as the coordinating database and a transaction to be processed for a particular session is received, the transaction will be received using the session level HLC timestamp. In some embodiments, the session level HLC timestamp is equivalent to the commit timestamp or rollback timestamp of the most recently executed transaction in the session. The local centralized time services update engine 406 is configured to compare the session level timestamp from the non-native region to a current local HLC, and if the session level timestamp from the non-native region is greater than the current local HLC, the local centralized time services update engine 406 is configured to update a local HLC (e.g., match the session level HLC timestamp).
FIG. 5 illustrates a flow diagram for an embodiment of a process for processing transactions in a database distributed across regions. In some embodiments, flow 500 may be implemented on any of database servers DB1, DB2, DB3, and DB4 of system 200 in FIG. 2, which is selected as the coordinator database server for a transaction.
At 502, it may be determined that a transaction containing one or more sets of statements is to be executed on a plurality of database servers that span at least two regions, wherein each region is associated with a respective Hybrid Logic Clock (HLC) -based centralized time service. For example, hints information from the application that generated the transaction (e.g., SQL-based) is used to determine whether the transaction is a cross-region transaction. Each zone comprises a sub-cluster using a centralized time service of the HLC protocol.
At 504, a set of one or more statements are to be executed on a plurality of database servers across at least two regions. At least one statement is executed on a database server located in a first region and at least one statement is executed on a database server located in a second region.
At 506, a plurality of HLC-based preparation timestamps are obtained from a plurality of database servers across at least two regions.
At 508, a largest HLC-based preparation timestamp is selected as the commit timestamp associated with the transaction. One version of the two-phase commit protocol is used to commit transactions. Since the statements are executed by at least two database servers respectively located in at least two different areas, the preparation timestamps returned by the participating database servers will be respectively derived from the corresponding local HLCs of the sub-clusters located in the area to which they belong. The maximum prepare timestamp returned in response to the prepare command is selected as the commit timestamp for the transaction. The database server commits the transaction using the commit timestamp received from the coordinator. After the database server performs the commit, a commit result (e.g., committed to the source of the transaction) will be returned that includes a commit timestamp.
FIG. 6 shows a flow diagram of an example of a process for determining whether a transaction is to be executed across multiple regions. In certain embodiments, process 600 may be implemented on any of database servers DB1, DB2, DB3, and DB4 of system 200 in FIG. 2, which are selected as the scheduling database server for a transaction.
At 602, a transaction is received that includes a set of statements. For example, the transaction is received from an application as part of a particular session.
At 604, it is determined whether a statement set is to be executed across multiple regions. If the statement set is to be executed within a single region, control is transferred to 606. Otherwise, where the statement set is executed across multiple regions, control will transfer to 608. In some embodiments, whether a statement set is executed within a single region or across multiple regions is determined from application-provided (e.g., SQL-based) hints information. Whether a transaction is cross-regional may also be determined according to any other suitable technique.
At 606, the transaction is executed on one or more database servers within an area.
At 608, the transaction is executed on one or more database servers across multiple regions.
FIG. 7 shows a flow diagram of an example of a process of executing a transaction on one or more databases within a single region. In certain embodiments, process 700 may be implemented on any of database servers DB1, DB2, DB3, and DB4 of system 200 in FIG. 2, which are selected as the scheduling database server for the transaction. In some embodiments, step 606 of process 600 of FIG. 6 may be implemented, at least in part, using process 700.
At 702, a determination is made that a transaction containing a set of statements is to be executed on one or more database servers within an area.
At 704, the set of statements executes on one or more database servers. After these statements execute on their respective participating database servers, any changes to the database data are temporary, and these changes are not visible to other transactions until the transaction in process is committed.
At 706, data locks affected by the statement set are obtained by one or more database servers. Using a two-phase locking protocol, data that is temporarily changed by the statement(s) will be locked (so that another concurrent transaction does not change the data again before committing the transaction in process).
At 708, a commit timestamp is obtained from a centralized time service local to the local region's sub-cluster. Because the statements (sets) executed by the transaction are in the sub-clusters of the same region in the database server(s), all the database server(s) share one centralized time service (local HLC), and after each statement is executed, the commit timestamp for the transaction in the database is obtained from the local HLC.
At 710, it is determined whether the statement set executes on multiple database servers. In the case where the statement set executes on a single database server in a sub-cluster of the region, then control will be transferred to 712. Otherwise, in the case where the statement set is executed on multiple database servers in a sub-cluster of the region, control will transfer to 720. If these statements (sets) are executed on only one database server, then the commit operation can be performed without using the two-phase commit protocol and control is transferred directly to 712. Otherwise, if the statement (set) is executed on multiple database servers, then the commit operation will be performed using the two-phase commit protocol and control will be transferred to 720 first.
At 712, the commit command and commit timestamp are sent to the database server(s). The commit command and commit timestamp are sent to each participating database server executing the statement.
At 714, an unlock operation is performed. Once the commit operation is performed on each participating database server, the temporary changes made by the execute statement will be made permanent and the affected data may be unlocked in the second phase of the two-phase locking protocol.
At 716, it is determined that the transaction is complete. After completion of the commit operation and execution of the unlock on each participating database server, the transaction is deemed complete.
At 718, a successful commit result is returned. In various embodiments, once a transaction is completed, a successful commit result message is returned to the application that initiated the transaction.
At 720, a prepare command is sent to the participating database server. As a first phase of the two-phase commit protocol, a prepare command is sent to each participating database server.
At 722, it is determined whether all participating database servers successfully executed the prepare command. Upon determining that all participating database servers have successfully returned a prepare response, control is transferred to 712. Otherwise, in the event that there are fewer than all participating database servers to successfully return a ready response, control will transfer to 724. The two-phase commit protocol does not allow the transaction to continue to commit unless all participating database servers successfully return a prepare response.
At 724, a transaction rollback is performed. If at least one participating database server does not return a successful prepare response, a transaction rollback is performed, where temporary changes made by the statements performing the transaction are ignored or discarded, and the database data is returned to its state prior to the (set of) execution statements.
FIG. 8 illustrates a flow diagram of an example of a process for executing a transaction in one or more databases across multiple regions. In certain embodiments, process 800 may be implemented on any of database servers DB1, DB2, DB3, and DB4 of system 200 in FIG. 2, which are selected as coordinator database servers for the transaction. In some embodiments, step 608 of process 600 of FIG. 6 may be implemented, at least in part, using process 800.
At 802, a set of statements is determined to be executed on one or more database servers across multiple regions.
At 804, the set of statements executes on one or more database servers. After these statements execute on their respective participating database servers, any changes to the database data are temporary and are not visible to other transactions until the transaction in process is committed.
At 806, data locks affected by the statement set are obtained on one or more database servers. Using a two-phase locking protocol, data that a statement temporarily changes will be locked (so that another concurrent transaction does not change data again before committing the relevant transaction).
At 808, a prepare command is sent to a participating database server that spans multiple regions. As a first phase of the two-phase commit protocol, a prepare command is sent to each participating database server.
At 810, a determination is made as to whether all of the plurality of database servers successfully executed the prepare command. In the event that it is determined that all participating database servers have successfully sent back a prepare response, control transfers to 814. Otherwise, in the event that there are fewer than all participating database servers successfully sent back ready responses, control will transfer to 812. The two-phase commit protocol does not allow the transaction to continue to commit unless all participating database servers successfully return a prepare response.
At 812, a transaction rollback is performed. If at least one participating database server does not return a successful prepare response, a transaction rollback is performed, where the temporary changes made by the statements performing the transaction are ignored or discarded and the database data is returned to the state prior to the execution of the statement (set).
At 814, a preparation timestamp is received from the participating database server. The preparation timestamp sent by each participating database server with the successful preparation response originates from the local HLC of the sub-cluster in which the database server is located.
At 816, the maximum preparation timestamp is selected as the commit timestamp. And selecting the maximum preparation timestamp returned by the participating database server as the commit timestamp of the transaction in the database.
At 818, the commit command and the commit timestamp are sent to the participating database server. A commit command and the commit timestamp of the transaction are sent to each participating database server executing the statement.
At 820, unlocking is performed. Once the commit operation is performed on each participating database server, the temporary changes made by the execute statements are made permanent and the affected data may be unlocked in the second phase of the two-phase locking protocol.
At 822, the transaction is confirmed to be completed. After completion of the commit operation and unlocking on each participating database server, the transaction is considered complete.
At 824, a successful commit result is returned. The successful commit result is returned to the application at the source of the transaction.
Fig. 9 is a flow chart illustrating an example of a process for updating local HLCs based on updates received from local sources. In some embodiments, process 900 may be implemented on any of database servers DB1, DB2, DB3, and DB4 of system 200 in FIG. 2, regardless of whether the database server has been selected as the coordinator database server for the transaction.
In various embodiments, for transactions that are executed only within a single region, the logical clock of the local HLC for that region acts as a centralized time service that only increases with the physical clock. However, process 900 describes example events from sources outside the area of the database server executing process 900 where the database server is configured to update its local HLCs.
At 902, it is determined whether an update is received from a database server located in another area. In the event such an update is received, control transfers to 904. Otherwise, in the event that no such update is received, control will transfer to 908. In some embodiments, updates from a database server located in a non-local region (relative to the region in which the database server executing process 900 is located) include the results of operations (e.g., values read by executing read statements) including timestamps obtained from HLCs corresponding to the non-local region. For example, the database server executing process 900 acts as a coordinator database for processing a transaction that includes a statement to read a value stored on another database server in the non-local region. In some embodiments, an update from a database server located in a non-local region is a commit command with a commit timestamp obtained from an HLC corresponding to the non-local region. For example, the database server executing process 900 executes a statement contained in a transaction for which the coordinator database server initiates a two-part commit phase.
At 904, it is determined whether the update-related HLC timestamp is greater than the current local HLC time. In the event that the HLC timestamp associated with the update is greater than the current local HLC time, control is transferred to 906. Otherwise, in the event that the update-related HLC timestamp is equal to or less than the current local HLC time, control is transferred to 908. The HLC-based timestamp associated with the update from the database server of the non-local region, as provided by the local HLC of the sub-cluster of the region associated with the database executing process 900, will be compared to the current time.
At 906, the local HLC time is updated. If the HLC based update timestamp is greater than the time of the local HLC, the local HLC is updated. In some embodiments, local HLCs are updated to match the updated HLC-based timestamps. In certain embodiments, the local HLCs are added at predetermined physical and/or logical time intervals. For example, HLC is increased by a value of 1.
At 908, a determination is made as to whether a new transaction is received with a session level HLC timestamp. In the event a new transaction is received, control is transferred to 910. Otherwise, in the event that no new transaction is received, control is transferred to 914. In the case where the database server executing process 900 acts as a coordinator for a transaction, in some cases, the session level HLC timestamp is maintained by the application that initiated the transaction. In various embodiments, the session-level HLC timestamp is set to the commit timestamp or rollback timestamp of the last transaction executed in the session.
At 910, it is determined whether the session level HLC timestamp is greater than the current local HLC time. In the event that the session level HLC timestamp is greater than the current local HLC time, control is transferred to 912. Otherwise, in the event that the session-level HLC timestamp is equal to or less than the current local HLC time, control is transferred to 914.
At 912, the local HLC time is updated. In some embodiments, the local HLC is updated to match the session level HLC timestamp. In certain embodiments, the local HLCs are added at predetermined physical and/or logical time intervals. For example, the HLC is increased by a value of 1.
At 914, a determination is made whether to stop updating the local HLC. In the case where the local HLC will stop updating, process 900 ends. Otherwise, in the event that the local HLC does not stop updating, control returns to 902. For example, if the database server executing process 900 is powered down, the local HLC will stop updating.
FIG. 10 illustrates a sequence diagram of an example process for performing transactions in a distributed database that spans multiple regions. In the example process shown at 1000, a transaction is received at the system 200 of FIG. 2 in which the database server DB1 of sub-cluster 1 of region 1 is selected as the coordinator database. A transaction is received from an application containing statements S1 and S2. S1 would be performed on database server DB2 of zone 1, sub-cluster 1(DC1/SC1) and S2 would be performed on database server DB3 of zone 2, sub-cluster 2(DC2/SC2), meaning that the transaction is a cross-zone transaction. Since the coordinator database server is DB1 located in DC1/SC1, DC1 may be referred to as local DC/region, while HLC1 local to DC1/SC1 may also be referred to as local HLC.
At 1002, initialization information associated with a transaction is sent from the application to DB 1.
At 1004, DB1 performs an initialization operation (set).
At 1006, DB1 sends an acknowledgement to the application to indicate that initialization has been performed.
At 1008, statement S1 is sent from the application to DB 1.
At 1010, DB1 forwards S1 to DB 2.
At 1012, DB1 performs S1 on its local database data.
At 1014, the completion timestamp of statement S1 is sent from DB2 to DB 1. DB1 may record this completion timestamp.
At 1016, the results associated with the execution results of S1 are sent from DB1 to the application.
At 1018, statement S2 is sent from the application to DB 1.
At 1020, DB1 forwards S2 to DB 3.
At 1022, DB3 performs S2 on its local database data.
At 1024, the completion timestamp of statement S2 is sent from DB3 to DB 1. The DB1 may record this end timestamp because it is the end timestamp of the last statement (S2) executed in the non-local area (DC 2).
At 1026, the results associated with the execution results of S2 are sent from DB1 to the application.
At 1028, the application sends a submit command to DB 1.
At 1030, a prepare command is sent from DB1 to DB 2. This is part of the first phase of a two-phase commit protocol for performing a commit when a transaction is performed on multiple database servers.
At 1032, a prepare command is sent from DB1 to DB 3. This is part of the first phase of the two-phase commit protocol.
At 1034, the preparation timestamp is sent from DB2 to DB 1. This preparation timestamp is from HLC1 and is sent with a successful preparation response.
At 1036, a prepare timestamp is sent from DB3 to DB 1. This preparation timestamp is derived from HLC2 and is sent with a successful preparation response.
At 1038, DB1 determines the commit timestamp of the transaction by selecting the larger timestamp among the prepare timestamps received from DB2 and DB 3.
At 1040, DB1 sends a commit command and a commit timestamp to DB 2.
At 1042, DB1 sends a commit command and the commit timestamp to DB 3.
At 1044, a submit response is sent from DB2 to DB 1.
At 1046, a submit response is sent from DB3 to DB 1. Upon DB1 receiving a successful commit response from participating database servers DB2 and DB3, the transaction is determined to be completed.
At 1048, a successful commit result is sent from DB1 to the application.
The following is a new example that describes a scenario in which accessing a database may lead to inconsistent results due to clock skew between various HLCs of sub-clusters of different regions:
assume that a distributed database is established as shown in fig. 2. Assume that the record for table tb1 for "Foo" is stored in DB2(DC1/SC1) and the record for table tb1 for "Bar" is stored in DB3(DC2/SC 2).
Assume transaction Q contains statements S1 and S2. S1 and S2 are sequentially executed in the database, wherein:
a sentence S1 showing that table tb1 is updated, and the balance tb1.balance of "foo." is set to 100; thus, S1 was performed on DB2 in DC1/SC 1.
Statement S2 shows updating table tb1, and setting the balance tb1.balance 200 of name "bar. Thus, S2 was performed on DB3 of DC2/SC 2.
Although S1 was performed before S2 in absolute time, since there was a clock skew between HLC1 of DC1/SC1 and HLC2 of DC2/SC2, the commit timestamp t1 recorded at S1 is greater (later) than the commit timestamp t2 recorded at S2. The following is assumed:
t1 (commit timestamp recorded for database S1) ═ 103
t2 (commit timestamp recorded for database S2) ═ 101
tabs1 (absolute time submitted at S1) ═ 1000
Tabs2 (Absolute time submitted in S2) 1005
In summary, since the times at which statements S1 and S2 execute in the database are short and there is a clock skew, the order of their commit timestamps (t 1 and t2, respectively) recorded in the database do not coincide with their commit order in absolute time.
From the perspective of the database, or more precisely, the confirmed commit order of S1 and S2 by the database, S2 commits before S1, since t 1> t 2. However, the order of submission of this database validation contradicts the submissions of S1 and S2 in absolute time, as tabs2> tabs 1.
Reading the balance parameter (tb1.balance) correctly (relative to the commit timestamp confirmed by the database) for records named "Foo" and "Bar" respectively in tb1 returns one of three results: 1) foo has a balance of 100 and Bar has a balance of 200; 2) the balance of Foo is the original balance value (not 100), and the balance of Bar is 200; and 3) Foo's balance is the original balance (other than 100) and Bar's balance is the original balance (other than 200).
However, due to clock skew between HLCs in different regional sub-clusters, typically, an incorrect (relative to the commit timestamp followed by the database) balance parameter read for records named "Foo" and "Bar" in tb1 would return a balance of 100 for Foo and a balance of Bar as the original value (not 200).
The application creates two watchers (sessions) and uses the results of the first watcher to determine the operation of the second watcher (i.e., there is a dependency between the two watchers). The following are the consequences that may typically result from reading the results of Q1 and Q2:
requesting: if observer 1 sees Foo's balance as 100, then observer 2 will increase Bar's balance to 220 if Bar's current balance is 200. .
Observer 1 goes to DB2 DC1/SC1 at absolute time 1001, which will see the balance of Foo as 100,
observer 2 arrives at DC2 at absolute time 1003, it will see Bar's balance at its original value (not 200), and then the application will not set Bar's balance to 220.
However, by implementing the techniques described herein for each cross-region transaction executed in the database, by selecting the maximum prepare timestamp sent back by one of the participating database servers as the commit timestamp for the entire cross-region transaction, and taking into account that the cross-region network delay (e.g., 100 milliseconds) far exceeds the maximum clock skew (e.g., 10 milliseconds) between the various HLCs located in different regions, when the successful commit result of watcher 1 is returned to the application, transaction Q will have committed, and the execution results of both statements S1 and S2 will be recorded as having the same commit timestamp. Once both statements S1 and S2 have been submitted, observer 2 can see the correct result of S2. After applying the techniques described herein, two sessions will be performed as follows:
observer 1 initiates transaction a 1. A1 executes a statement to select/read the balance of the record with the last name "Foo" from tb1, with the timestamp of completion of execution of the statement from HLC 1.
When the application receives the commit result returned by observer 1's A1, it knows that the balance of Foo is 100. The application then sends a transaction a2 through watcher 2 (this is a different session than watcher 1). A2 will be performed as follows: the balance with the last name "Bar" is selected from tb1 and if the balance is 200, the balance is updated to 220. Therefore, observer 2 receives a reading of the balance of Bar and checks whether the balance is 200. Since the maximum prepare timestamp sent back by one of the participating database servers is used as the commit timestamp for each entire cross-region transaction, and assuming that the cross-region network delay (e.g., 100ms) far exceeds the maximum clock offset (e.g., 10ms) between HLCs located in different regions, a1 will be able to see the commit result of S2 when a2 is executed, which means that the balance of Bar is 200, and therefore a2 will correctly update the balance of Bar to 220.
Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.

Claims (20)

1. A distributed database, comprising:
one or more processors configured to:
determining that a transaction containing one or more sets of statements is to be executed on a plurality of database servers across at least two regions, wherein each region is associated with a respective centralized time service based on a hybrid logic clock HLC;
causing the one or more statement sets to execute on the plurality of database servers across at least two regions;
obtaining a plurality of HLC-based preparation timestamps from the plurality of database servers across at least two regions; and
selecting a largest HLC-based preparation timestamp as the commit timestamp associated with the transaction; and
one or more memories coupled to the one or more processors and configured to provide instructions to the one or more processors.
2. The distributed database of claim 1, wherein the one or more processors are further configured to cause the plurality of database servers across at least two regions to commit execution results associated with the one or more statement sets using the commit timestamp.
3. The distributed database of claim 1, wherein the one or more processors are further configured to return a commit result corresponding to the transaction.
4. The distributed database of claim 1, wherein the determining of the transaction containing one or more sets of statements to be executed on the plurality of database servers across at least two regions is based on hints information sent by an application.
5. The distributed database of claim 1, wherein a network delay between two regions is greater than a maximum clock skew between respective HLC-based centralized time services corresponding to respective ones of the at least two regions.
6. The distributed database of claim 1, wherein the one or more processors are further configured to: acquiring data locks affected by the one or more sets of statements on the plurality of database servers.
7. The distributed database of claim 1, wherein the one or more processors are further configured to send a prepare command to the plurality of database servers.
8. The distributed database of claim 1, wherein the one or more processors are configured to unlock data affected by the one or more statement sets on the plurality of database servers.
9. The distributed database of claim 1, wherein the one or more processors are further configured to:
receiving an update from a database server located in a first area, the first area being distinct from a second area associated with the system;
comparing an HLC based timestamp associated with the update to a current local HLC time; and
updating the current local HLC time in response to a determination that the HLC based timestamp associated with the update is greater than the current local HLC time.
10. The distributed database of claim 9, wherein updates from a database server located in the first region include values read from the database server.
11. The distributed database of claim 1, wherein the one or more processors are further configured to:
receiving a session level HLC timestamp with the transaction;
comparing the session level HLC timestamp with a current local HLC time; and
updating the current local HLC time in response to a determination that the session level HLC timestamp is greater than the current local HLC time.
12. The distributed database of claim 1, wherein the transaction comprises a first transaction, wherein the one or more processors are further configured to:
determining that a second transaction is to be performed on a database server group within the region; and
executing the second transaction on the database server set.
13. A method, comprising:
determining that a transaction containing one or more sets of statements is to be executed on a plurality of database servers spanning at least two regions, wherein each region is associated with a respective centralized time service based on a hybrid logic clock HLC;
causing the one or more sets of statements to execute on the plurality of database servers spanning at least two regions;
obtaining a plurality of HLC-based preparation timestamps from the plurality of database servers spanning at least two regions; and
a largest HLC-based preparation timestamp is selected as the commit timestamp associated with the transaction.
14. The method of claim 13, wherein determining the transaction containing one or more sets of statements to be executed on the plurality of database servers spanning the at least two regions is based on hints information sent by an application.
15. The method of claim 13, wherein a network delay between two zones is greater than a maximum clock skew between respective HLC-based centralized time services corresponding to respective ones of the at least two zones.
16. The method of claim 13, further comprising unlocking data affected by the one or more statement sets on the plurality of database servers.
17. The method of claim 13, further comprising:
receiving an update from a database server located in a first region, the first region being distinct from a second region;
comparing an HLC based timestamp associated with the update to a current local HLC time; and
updating the current local HLC time in response to a determination that the HLC based timestamp associated with the update is greater than the current local HLC time.
18. The method of claim 17, wherein the update from the database server located in the first area comprises a value read from the database server.
19. The method of claim 13, further comprising:
receiving a session level HLC timestamp with the transaction;
comparing the session level HLC timestamp to a current local HLC time; and
updating the current local HLC time in response to a determination that the session level HLC timestamp is greater than the current local HLC time.
20. A computer program product, the computer program product being embodied in a non-transitory computer readable storage medium and comprising computer instructions for:
determining that a transaction containing one or more sets of statements is to be executed on a plurality of database servers across at least two regions, wherein each region is associated with a respective centralized time service based on a hybrid logic clock HLC;
causing the one or more statement sets to execute on the plurality of database servers across at least two regions;
obtaining a plurality of HLC-based preparation timestamps from the plurality of database servers across at least two regions; and
a largest HLC-based preparation timestamp is selected as the commit timestamp associated with the transaction.
CN201980099051.5A 2019-08-02 2019-08-02 Distributed cross-regional database transaction processing Pending CN114207600A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/099002 WO2021022396A1 (en) 2019-08-02 2019-08-02 Transaction processing for database distributed across regions

Publications (1)

Publication Number Publication Date
CN114207600A true CN114207600A (en) 2022-03-18

Family

ID=74502402

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980099051.5A Pending CN114207600A (en) 2019-08-02 2019-08-02 Distributed cross-regional database transaction processing

Country Status (2)

Country Link
CN (1) CN114207600A (en)
WO (1) WO2021022396A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114362870B (en) * 2021-12-23 2022-11-29 天津南大通用数据技术股份有限公司 Partition logic clock method for distributed transaction type database

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101132270B (en) * 2007-08-02 2010-09-08 北京航空航天大学 Multi-node coordinated time consistency management method
US10474377B2 (en) * 2017-07-06 2019-11-12 Facebook, Inc. Optimizing data writes in a distributed computing system
CN110018884B (en) * 2019-03-19 2023-06-06 创新先进技术有限公司 Distributed transaction processing method, coordination device, database and electronic equipment

Also Published As

Publication number Publication date
WO2021022396A1 (en) 2021-02-11

Similar Documents

Publication Publication Date Title
Zhang et al. Building consistent transactions with inconsistent replication
EP3185142B1 (en) Distributed database transaction protocol
EP3185143B1 (en) Decentralized transaction commit protocol
US11436218B2 (en) Transaction processing for a database distributed across availability zones
Du et al. Clock-si: Snapshot isolation for partitioned data stores using loosely synchronized clocks
US10250693B2 (en) Idempotence for database transactions
Cowling et al. Granola:{Low-Overhead} distributed transaction coordination
US8924346B2 (en) Idempotence for database transactions
EP2323047B1 (en) Primary database system, replication database system and method for replicating data of a primary database system
US20190171763A1 (en) High-throughput distributed transaction management for globally consistent sharded oltp system and method of implementing
US11132350B2 (en) Replicable differential store data structure
KR20210135477A (en) Systems and methods for augmenting database applications with blockchain technology
EP3593243B1 (en) Replicating storage tables used to manage cloud-based resources to withstand storage account outage
CN115668141A (en) Distributed processing of transactions in a network using timestamps
CN111597015A (en) Transaction processing method and device, computer equipment and storage medium
Lu et al. {Performance-Optimal}{Read-Only} Transactions
CN113168371A (en) Write-write collision detection for multi-master shared storage databases
CN112384906A (en) MVCC-based database system asynchronous cache consistency
Nawab et al. Message Futures: Fast Commitment of Transactions in Multi-datacenter Environments.
Dey et al. Scalable distributed transactions across heterogeneous stores
US9201685B2 (en) Transactional cache versioning and storage in a distributed data grid
CN114207600A (en) Distributed cross-regional database transaction processing
Pankowski Consistency and availability of Data in replicated NoSQL databases
Yang et al. Natto: Providing distributed transaction prioritization for high-contention workloads
Harrison et al. Consistency models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination