CN109977171B - Distributed system and method for ensuring transaction consistency and linear consistency - Google Patents

Distributed system and method for ensuring transaction consistency and linear consistency Download PDF

Info

Publication number
CN109977171B
CN109977171B CN201910247559.7A CN201910247559A CN109977171B CN 109977171 B CN109977171 B CN 109977171B CN 201910247559 A CN201910247559 A CN 201910247559A CN 109977171 B CN109977171 B CN 109977171B
Authority
CN
China
Prior art keywords
transaction
data
consistency
global
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910247559.7A
Other languages
Chinese (zh)
Other versions
CN109977171A (en
Inventor
卢卫
张孝
杜小勇
陈跃国
赵欣
程一舰
张真苗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Renmin University of China
Original Assignee
Renmin University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Renmin University of China filed Critical Renmin University of China
Publication of CN109977171A publication Critical patent/CN109977171A/en
Application granted granted Critical
Publication of CN109977171B publication Critical patent/CN109977171B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/466Transaction processing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a distributed system and a method for ensuring transaction consistency and linear consistency, wherein the distributed system comprises a plurality of clients and a database server side which is composed of an access layer, a meta information management cluster, a global Gts generation cluster and a transaction processing and storage layer; the client is used for providing an interface for the user to interact with the database server and sending a user request to the database server; the access layer is used for receiving a request sent by the client and analyzing and generating an execution plan; the meta information management cluster is used for the management of the distributed cluster; the global Gts generates a cluster for generating a global timestamp, and uniquely ordering global transactions in the distributed system to realize linear consistency; the transaction processing and storage layer comprises a plurality of resource management nodes, and is used for executing transaction logic according to an execution plan sent by the access layer, and the obtained result is returned to the client through the access layer. The invention can be widely applied to the field of data processing.

Description

Distributed system and method for ensuring transaction consistency and linear consistency
Technical Field
The present invention relates to the field of data processing technology, and in particular, to a distributed system and method for ensuring transaction consistency and linear consistency.
Background
First, the consistency of distributed transactions is described. The data processing technology needs the semantics of the transaction and borrows the four characteristics of the ACID of the relational database to ensure the transaction characteristics of the system so as to meet the requirements of the electronic transaction of the business society. Wherein A is atomicity, C is consistency, I is isolation, and D is durability. Electronic transaction type operations require that the transaction be secured and reliable by these four features.
Distributed transaction processing techniques are also required to satisfy the four characteristics of the ACID of the transaction. In order to meet the four characteristics of ACID, the data processing technology needs a plurality of technologies to be guaranteed. Of which the most important is the consistency and isolation of the data, since the consistency of the data determines the correctness of the data, and the isolation determines the performance of the concurrent system.
The transaction consistency is realized, and the method mainly depends on a concurrent access control algorithm. Common concurrent access control algorithms are: a lock-based concurrent access control protocol, a timestamp ordering-based concurrent access control protocol, an MVCC-based concurrent access control protocol, and an OCC-based concurrent access control protocol. These algorithms first need to ensure that the data is not anomalous, i.e., that the serializable scheduling of transactions is satisfied, so that correctness can be ensured. Secondly, different concurrent access control algorithms determine the concurrency of transaction processing, and further influence the transaction throughput of the system, which is a performance problem.
The consistency of the transaction is embodied in the distributed database, and the consistency is ensured by the distributed transaction crossing the interface points. The document Distributed snapshot isolation: global transactions pay globally, local transactions pay locally (distributed snapshot isolation: global transaction is executed globally and local transaction is executed locally) mentions that there are two data anomalies under the distributed system, which, if they occur, cannot guarantee the consistency of the transaction.
Next, an external consistency involved in CAP theory is described. Based on CAP (also known as the brour theorem) theory, consistency in a distributed system is defined as multiple levels and can be divided into two categories, strong consistency and weak consistency.
For strong consistency, a distributed system is required to guarantee linear consistency. The linear consistency needs to ensure global relation, is defined by the linear consistency, and definitely requires that the linear consistency is required to satisfy global order of all operations, and a return-before (visible after return) partial order relation needs to be ensured between the operations. That is, it is required that the influence of events having a time sequence on the data can be read out according to the event sequence from the viewpoint of an external observer. Linear consistency is a property of a distributed system and the database itself is not directly linked. But distributed database systems, as their name implies, need to satisfy external consistency.
For weak consistency, causal consistency is more common. The definition of causal consistency explicitly requires that causal consistency only places a sequential constraint on causal operations, i.e. causal consistency requires weaker than linear consistency. Causal consistency requires that if process a reads the old value of a data item first and then updates the new value that resulted in the data item, then another process B's read operation on the data item must ensure that the first completed transaction does not read the new value earlier than the last completed transaction, i.e., must ensure that the partial order relationship that has been established in process a. Meanwhile, the data access of the process C which is irrelevant to the process A is unlimited.
In a distributed system, if a globally unique transaction manager is adopted, the method is a feasible method for ensuring the consistency of the transaction. However, by virtue of the global transaction manager, three major problems remain:
1. the global transaction manager is complex to implement, has a plurality of technical difficulties, needs to take time to overcome, and has great challenges for technicians.
2. The transaction mechanism of a stand-alone database at its lower layer, i.e. on a single node, cannot be utilized efficiently, because the upper layer implements the function of transaction management and the lower layer does not need to be implemented anymore. This means that the single point database transaction engine is discarded and turned to the upper layer to repeatedly create wheels, wasting time and manpower and financial resources.
3. The global transaction manager is a single-point architecture, which does not conform to the decentralized concept of decentralization, and is a performance bottleneck point.
Currently, there are several schemes for ensuring transaction consistency and linear consistency in a distributed system:
first kind: as shown in FIG. 1, the global transaction management implementation of Postgres-XC (Postgres database cluster solution) is composed of a GTM (Global transaction manager), a plurality of Coordinators, and a plurality of Datanode. Wherein GTM is a core component of Postgres-XC for global transaction control and visibility control of a scroll. The GTM is used for allocating global transaction numbers and managing PGXC MVCC modules, and there can only be one master GTM in one CLUSTER.
However, as can be seen from the previous description, the core technology of concurrent access (the most difficult and complex part of the database system, also the part where the code amount is most throughout the whole storage engine) is in the GTM module. This is the only global transaction management node, and the work done is complex and is a single point in architecture, thus easily becoming a bottleneck point.
Second, a completely decentralized transaction manager algorithm is described for the relevant literature, which maintains a "global transaction manager GTM" at each node (independent database), how many nodes there are in a cluster. Wherein the GTM is responsible for maintaining serializability of global transactions, each global transaction is assigned a global transaction identifier uniquely incremented across the set, the identifier being a timestamp value representing the order between global transactions to enable serializable scheduling of transactions. Then, the global transaction is decomposed into sub-transactions executed on different nodes, the sub-transactions of the global transaction are executed on each node with the time identification of the global transaction (each node adopts an S2PL algorithm), the sub-transactions are globally ordered because the transaction identifications (sub-transaction identifications) are ordered by the time stamps, and the global transaction can be submitted if the sub-transactions are executed successfully on each sub-relevant node. The algorithm is executed by dispersing a plurality of global transactions in each node, so that the purpose of removing a global transaction manager is achieved. However, the distribution of the timestamp (called GTS) of the global transaction depends on the local clocks of the single node, and the authors consider that the clocks of the multiple nodes need to be synchronized but whether the clocks are synchronized does not affect the correctness of the algorithm, because the timestamp value of each globally submitted transaction is stored in the involved child node in the algorithm and is used as the basis for generating the next timestamp value of the child node, and the new transaction is compared with the GTS by using the timestamp value as the condition of transaction rollback in the algorithm, and the global new transaction with the timestamp value smaller than the GTS is rolled back.
However, the algorithm does not use conflict information between global transactions as a basis for conflict resolution, but relies on time ordering global transaction implementation to serialize, and introduces more rollback situations; the time stamp value of the child node depends on the time stamp value of the transaction that has been committed, introducing a backward-deferred situation for the transaction time stamp. Moreover, for database systems where concurrent access control is not based on MVCC, the performance of transactions is not high when there are many read transactions.
Thirdly, the Google's Spanner system adopts a Truetime mechanism (the mechanism relies on physical devices such as GPS and atomic clocks), can realize a decentralised transaction management mechanism, does not depend on a global time stamp as a concurrent access mutual exclusion basis of a transaction (adopts an SS2PL+MVCC technology as a concurrent access control algorithm), and does not depend on a global time stamp as a realization basis of linear consistency. However, since this mechanism relies on physical devices: GPS and atomic clocks are expensive and are not suitable for all users from an economic point of view.
Disclosure of Invention
In view of the foregoing, it is an object of the present invention to provide a distributed system and method for ensuring transaction consistency and linear consistency, which is capable of ensuring transaction consistency of distributed transactions across nodes in a distributed architecture and linear consistency involved in CAP theory, so as to ensure that data operated by global transactions across nodes are transaction-consistent and linear-consistent in a data processing system including a distributed database (SQL, noSQL, newSQL, relational, non-relational) system, a distributed big data processing system, a transactional system in which global write operations across nodes exist, and the like.
In order to achieve the above purpose, the present invention adopts the following technical scheme: a distributed system for ensuring transaction consistency and linear consistency, comprising: the system comprises a plurality of clients and a database server, wherein the database server consists of an access layer, a meta information management cluster and a global Gts generation cluster and a transaction processing and storage layer; the client is used for providing an interface for a user to interact with the database server and sending a user request to the database server; the access layer is used for receiving the request sent by the client and analyzing and generating an execution plan; the meta information management cluster is used for uniformly managing the distributed clusters of the distributed system; the global Gts is used for generating a global timestamp and uniquely ordering global transactions in the distributed system to realize linear consistency; the transaction processing and storage layer comprises a plurality of resource management nodes, wherein the resource management nodes comprise coordination nodes and data nodes, the coordination nodes and the data nodes are used for executing transaction logic according to an execution plan sent by an access layer, and an obtained result is returned to the client through the access layer.
Further, the data node is used for storing the data in the distributed system in a partitioned mode, and the coordination node is used for performing coordination processing on the transaction in the distributed system; for all the purposes of the resource management node, there are two allocation modes: master-slave mode: a part of the resource management nodes are exclusively used as coordination nodes for transaction processing, and the rest of the resource management nodes are used as data nodes; peer-to-peer mode: all the resource management nodes are peer-to-peer, and each resource management node has two functions of a data node and a coordination node.
Further, the global Gts generating the global timestamp of the cluster is composed of eight bytes, and the eight bytes are composed by adopting a mixed physical clock mode: a) The first 44 bits are the physical timestamp value; b) The last 20 bits are monotonically increasing counts within one millisecond.
Further, the basic data structure related to the distributed system comprises a global transaction state table, a local transaction state table, a data item data structure, a global read-set write set of the transaction and four data structures of a communication protocol and a message;
the global transaction state table is used for maintaining the transaction state seen from the global of the distributed system and is represented by six-element ancestor { TID, lowts, uppts, status, gts, nodes }, wherein TID represents a unique identification of a transaction, lowts represents a lower limit of a transaction logic commit time stamp, uppts represents an upper limit of the transaction logic commit time stamp, status represents the state of the current transaction in the global, gts represents the time stamp of global commit/rollback completion of the transaction, and Nodes represents data Nodes involved in the current transaction;
the local transaction state table is used for maintaining the local transaction state of the transaction on each resource management node and is represented by { TID, lowts, uppts and Status }, wherein the TID represents a unique identification of the transaction, the Lowts represents a lower limit of a transaction logic commit time stamp, the Uppts represents an upper limit of the transaction logic commit time stamp, and the Status represents the local state of the transaction;
The data item data structure comprises a first group of data elements serving as a linear consistency basis and a second group of data elements serving as a distributed transaction consistency, wherein the first group of data elements comprises { gts, info_bit }, wherein gts represents a globally unique sequence of a transaction in a distributed system, and the info_bit is used for identifying whether Gts or TID is currently recorded in a gts field; the second set of data elements includes { wts, rts }, where wts is for recording a logical timestamp of a transaction that created the version of the data item, and rts is for recording a logical timestamp of a transaction that last read the data item;
the global read set of the transaction is used for recording all data items read by the transaction and is represented by { BlockAddress, offset, size and Value }, wherein BlockAddress represents a corresponding block address of the data item, offset represents an Offset of the data item in a block, size represents the Size of the data item and Value represents the Value of the data item;
the global write set of the transaction is used for recording all data items which need to be updated in the transaction and is represented by { BlockAddress, offset, size and NewValue, operationType }, wherein BlockAddress represents a corresponding block address of the data item, offset represents an Offset of the data item in a block, size represents the Size of the data item, newValue represents a value of the data item, and operation type represents whether the operation is an updating operation, an inserting operation or a deleting operation;
The communication protocol and the message comprise a message sent by the coordination node to the data node, a message sent by the data node to the coordination node, a message sent by the coordination node to the global Gts generation cluster, and a message sent by the global Gts generation cluster to the coordination node; the message sent by the coordination node to the data node comprises a read data request message, a verification request message and a write submitting/rollback request; the message sent by the data node to the coordination node comprises a read request feedback message and a local verification feedback message; the message sent by the coordinating node to the global Gts generation cluster includes a global timestamp request message; the global Gts generates a message for the cluster to send to the coordinator node including a global timestamp request feedback message.
A multi-level consistency method of a distributed system that ensures transaction consistency and linear consistency, comprising the steps of: 1) Establishing a unified consistency model capable of realizing multi-level consistency; 2) And determining the consistency level which needs to be achieved by the distributed system according to the actual service demand, determining a consistency execution algorithm suitable for the consistency demand based on the established system consistency model, and executing the distributed transaction and the single-machine transaction in the distributed system to obtain a transaction execution result.
Further, in the step 1), a method for establishing a unified consistency model capable of realizing multi-level consistency includes the following steps:
1.1 Performing transaction concurrency access control by adopting an OCC strategy based on DTA, and establishing an RUC-CC algorithm for ensuring the consistency of the transaction;
1.2 Generating a global timestamp and a global transaction state generated by the cluster based on the global Gts, and establishing a linear consistency guarantee algorithm based on the global timestamp for guaranteeing linear consistency among transactions;
1.3 A method for performing twice data reading is adopted, and a twice-reading linear consistency assurance algorithm is established and used for guaranteeing consistency among transactions;
1.4 Combining the transaction consistency and the linear consistency of the steps 1.1) to 1.3) and the MVCC algorithm to establish a unified model capable of meeting various consistency levels.
Further, in the step 1.1), the method for establishing the RUC-CC algorithm by performing the transaction concurrency access control by adopting the OCC strategy based on the DTA comprises the following steps:
1.1.1 For the transaction T sent by the client, completing corresponding initialization work on the coordination node;
1.1.2 Dividing the global execution phase of a transaction into 3 phases: the method comprises a reading stage, a verification stage and a commit writing/rollback stage, and under the coordination of the coordination nodes, each data node related to the operation executes the transaction and clears the corresponding table entry of the commit or rollback transaction in the transaction state table.
Further, in the step 1.1.2), the global execution phase of the transaction is divided into 3 phases: the method comprises a reading stage, a verification stage and a commit writing/rollback stage, and each data node related to operation executes the transaction under the coordination of the coordination node, and comprises the following steps:
1.1.2.1 Transaction T reads the required data according to the execution logic, and writes the update to the local memory of transaction T,
1.1.2.2 The transaction T verifies whether the transaction T collides with other transactions or not to obtain a verification result;
1.1.2.3 Transaction T selects whether to perform write commit or rollback depending on the validation result of the validation phase.
Further, in the step 1.1.2.1), the method for reading the required data according to the execution logic and writing the update to the local memory of the transaction T includes:
firstly, a coordination node of a transaction T needs to send a read data request message for reading a data item x to a data node where the data item x is located;
then, after receiving the read data request message, the data node where the data item x is located firstly establishes or updates a local transaction state table of the transaction T, then searches a visible version of the data item x in a logic life cycle of the transaction T, and sends a read request feedback message to a coordination node of the transaction T;
And finally, after receiving the read request feedback messages of all the data nodes, the coordination node of the transaction T judges whether rollback is needed, if so, the coordination node enters a global rollback stage, and if not, the transaction continues to be executed.
Further, in the step 1.1.2.2), the method for verifying whether the transaction T collides with other transactions by itself includes:
first, the coordinator node of the transaction T modifies the state of the transaction T in the global transaction state table as: gvaliding; then sending verification request information and a local write set to each data node involved in the transaction T;
secondly, after each data node involved in the transaction T receives the verification request message, the local verification operation is executed, and the method specifically comprises the following steps:
(1) updating t.lowts=max (t.lowts, vrm.lowts), t.uppts=min (t.uppts, vrm.uppts) of the transaction T in the local transaction state table;
(2) checking whether T.Lowts is larger than T.Uppts, if so, verifying failure, returning an Abort message to a coordination node of the transaction T to enter rollback, otherwise, entering a step (3);
(3) find each data item y in the transactional write set and then see if the WT for data item y is empty:
if not, sending an Abort message to the coordination node of the transaction T to enter rollback;
Otherwise, entering the step (4);
(4) updating the WT of each data item y in the write set to be T.TID, and adjusting the timestamp lower bound of the transaction T in the local transaction state table to be larger than the rts of y;
(5) checking whether T.Lowts is larger than T.Uppts, if so, verifying failure, rolling back locally, and then returning an Abort message to a coordination node of the transaction T; otherwise, entering step (6);
(6) for each element y in the writing set, adjusting the time stamp of the transaction T or the transaction in the RTlist, and eliminating the read-write conflict;
(7) according to the updated value of the data item y, creating a new version of the data item y, and setting a flag indicating that the new version is not globally submitted;
(8) returning a local verification feedback message lvm of the transaction T to the coordination node of the transaction T, wherein the Lowts and the Uppts of the lvm record the upper and lower bounds of the logical timestamp of the transaction T on the local data node respectively;
finally, after the coordination node of the transaction T receives the local verification feedback messages of all the resource management nodes, whether the transaction T can pass the verification is determined according to the received messages.
Further, in the step 1.2), a global timestamp generated by the cluster and a global transaction state are generated based on the global Gts, and the established linear consistency assurance algorithm based on the global timestamp includes the following steps:
1.2.1 A client initiates a transaction T request, and an access layer establishes connection with the client to form a session;
1.2.2 The access layer analyzes the transaction T and selects a coordination node to be responsible for managing the execution process of the transaction;
1.2.3 At the beginning of a read transaction T, generating a global timestamp from global Gts at the beginning of the cluster get transaction, and recording Gts in the global transaction state table for the read transaction; the coordination node establishes connection with all data nodes related to the read transaction, forms a data packet with the analyzed query execution plan and the global timestamp Gts, and transmits the data packet to all related data nodes through network communication;
1.2.4 All data nodes respectively perform data reading operation, data items meeting selection conditions are determined, and then each data item with multiple versions is traversed from the latest version until the first visible version is found;
1.2.5 The coordination node gathers the data returned by all the data nodes and returns the data to the access layer, the access layer returns the data to the client for establishing the session relation, and the current reading transaction is completed.
Further, in the step 1.3), a method of performing two times of data reading is adopted, and the established flow of the two times of reading linear consistency assurance algorithm is as follows:
1.3.1 A client initiates a transaction T request, and an access layer establishes connection with the client to form a session;
1.3.2 The access layer analyzes the transaction T and selects a coordination node to be responsible for managing the execution process of the transaction;
1.3.3 The coordination node establishes connection with all data nodes related to the read transaction, sends a data acquisition request to all related data nodes, all related data nodes execute a first data reading algorithm and return data to the coordination node, and the coordination node determines the global timestamp Gts of the current read transaction T based on all returned data items;
1.3.4 The coordination node sends the data acquisition request to all the data nodes again, and sends the determined global timestamp Gts of the current read transaction T to all the data nodes, a second data reading algorithm is executed on the data nodes, and a data version meeting the linear consistency with respect to the global timestamp Gts of the current read transaction T is returned;
1.3.5 The coordination node gathers the data returned by all the data nodes and returns the data to the access layer, the access layer returns the data to the client for establishing the session relation, and the current reading transaction is completed.
Further, in the step 2), a consistency level to be reached by the distributed system is determined according to an actual service requirement, and a consistency execution algorithm suitable for the consistency level requirement is determined based on an established system consistency model, and the method for executing the distributed transaction in the distributed system comprises the following steps:
2.1 According to whether the transaction needs to operate on the data on a plurality of resource management nodes, dividing the transaction related in the distributed system into two types of distributed transaction and single-machine transaction;
2.2 Adopting a consistency execution algorithm which is suitable for the consistency level requirement to execute the distributed transaction in the distributed system;
2.3 A consistency execution algorithm adapted to the consistency level requirements is used to execute the single transaction in the distributed system.
Further, in the step 2.2), a consistency execution algorithm adapted to the consistency level requirement is adopted, and a flow of executing the distributed transaction in the distributed system is as follows:
2.2.1 The client is responsible for sending out a request for executing the transaction T, and the access layer is responsible for receiving the request sent by the client and establishing a session relation with the client;
2.2.2 After receiving the request information, the access layer interacts with the metadata management cluster, analyzes the request after acquiring the related metadata, and distributes the request to different coordination nodes through routes;
2.2.3 The coordination node optimizes SQL and generates a physical execution plan, performs global transaction initialization work, records global transaction state information, then decomposes the execution plan into the execution plan on each data node, sends the execution plan to the corresponding data node, and records the global transaction state as an running state;
2.2.4 Each data node performs data operation by adopting an algorithm which is adaptive to the requirement of the consistency level according to the execution plan, records the execution state of local transactions, and sends a command capable of verifying to the coordination node after the local execution of data read-write by the data node is completed; specific:
when required for transactional logical consistency and transactional causal consistency: the data node adopts RUC-CC algorithm to perform data operation and transaction scheduling according to the execution plan;
when a linear consistency level requirement: the data node performs data operation by adopting a linear consistency assurance algorithm according to the execution plan, and performs transaction scheduling based on an MVCC algorithm;
when required for full consistency level: the data node performs data operation and transaction scheduling by adopting a linear consistency assurance algorithm and combining with a RUC-CC algorithm according to an execution plan;
2.2.5 After receiving the instructions of 'can verify' sent by all relevant data nodes, the coordination node records the global transaction state as being verified and sends instructions of 'verify' to all relevant data nodes;
2.2.6 After receiving the command of verification, the data node; entering a local verification process, and if verification is passed, sending a verification pass instruction to the coordination node;
2.2.7 After receiving the verification passing instruction sent by all relevant data nodes, the coordination node determines whether interaction with the global Gts generation cluster is needed to acquire a global timestamp of the transaction according to different consistency level requirements, and records the global transaction state as submitted; then, two threads are started simultaneously: the first is used for returning the result set to the access layer, and the access layer is responsible for returning the execution result to the client; the second one will send a commit command to all relevant data nodes;
when required for transactional logical consistency: after the coordination node receives the verification passing instruction sent by all relevant data nodes, the global transaction state is recorded as submitted;
when required for transactional causal consistency level, linear consistency, and full consistency: after receiving the verification passing instruction sent by all relevant data nodes, the coordination node needs to interact with the global Gts generation cluster, acquires the global timestamp of the transaction, and records the global transaction state as submitted;
2.2.8 After each data node receives the commit command, the data node enters a local commit flow.
Further, in the step 2.3), a consistency level to be achieved by the distributed system is determined according to an actual service requirement, and a consistency execution algorithm suitable for the consistency level requirement is determined based on an established system consistency model, and a method for executing a single transaction in the distributed system comprises the following steps:
2.3.1 The client is responsible for sending out a request for executing the transaction T, and the access layer is responsible for receiving the request sent by the client and establishing a session relation with the client;
2.3.2 After receiving the request information, the access layer interacts with the metadata management cluster, analyzes the request after acquiring the related metadata, and distributes the request to different coordination nodes through routes;
2.3.3 The coordination node optimizes SQL, generates a physical execution plan, sends the physical execution plan to the selected data node, performs transaction initialization work, records the transaction state as running, and directly sends the execution plan to the corresponding data node;
2.3.4 The data node performs data operation by adopting an algorithm which is adaptive to the requirement of the consistency level according to the execution plan, records the local transaction state, directly enters a verification process after the data node performs data read-write locally, and sends a verification passing instruction to the coordination node if the verification passes, and enters a local submission process;
when the business logic consistency and the business cause and effect consistency level are required, the data node performs data operation and business scheduling through the RUC-CC algorithm;
when the requirement of the linear consistency level is met, the data node performs data operation through a linear consistency guarantee algorithm and performs transaction scheduling based on an MVCC algorithm;
When the requirement of the complete consistency level is met, the data node performs data operation and transaction scheduling by combining a linear consistency guarantee algorithm with a RUC-CC algorithm;
2.3.5 After receiving the verification passing instruction sent by the data node, the coordination node determines whether to interact with the global Gts generation cluster according to different consistency level requirements to acquire a global timestamp of the transaction, records the state of the transaction as submitted, returns a result set to an access layer, and the access layer is responsible for returning an execution result to the client;
when required for transactional logical consistency: after receiving the verification passing instruction sent by the data node, the coordination node records the global transaction state as submitted;
when required for transactional causal consistency level, linear consistency, and full consistency: after receiving the verification passing instruction sent by the data node, the coordination node needs to interact with the global Gts generation cluster, acquires the global timestamp of the transaction, and records the global transaction state as submitted.
Due to the adoption of the technical scheme, the invention has the following advantages: 1) The invention provides an optimistic concurrent access control algorithm (RUC-CC algorithm) based on dynamic timestamp adjustment, which is a transaction scheduling algorithm which is decentralised, efficient and capable of ensuring the global transaction ACID characteristic of a system, and ensures that the conflict of distributed transactions can be scheduled in a serialization way. 2) The invention ensures that all operations in the system are consistent with the sequence under the global clock through the global Gts generation cluster, combines the global Gts generation cluster with the MVCC algorithm, provides the linear consistency guarantee algorithm, and can guarantee the strongest linear consistency in external consistency. 3) According to the invention, through the design of the basic structure of the data item, the transaction consistency assurance algorithm and the linear consistency assurance algorithm are decoupled in form, so that the transaction consistency assurance algorithm and the linear consistency assurance algorithm are not mutually influenced, but can be fused with each other in terms of algorithm and function, so that the consistency of multiple levels can be ensured, and multiple requirements of different application scenes on consistency correctness and distributed database system efficiency can be met.
Drawings
FIG. 1 is a diagram of a Postgres-XC architecture;
FIG. 2 is a diagram of a distributed database system architecture of the present invention;
FIG. 3 is a diagram of the global transaction state structure of the present invention;
FIG. 4 is a diagram of the local transaction state structure of the present invention;
FIG. 5 is a block diagram of a data item structure and its required maintenance information according to the present invention;
FIG. 6 is a timing diagram of global execution phase transaction execution under RUC-CC of the present invention;
FIG. 7 is a schematic diagram of the multi-level consistency of the present invention.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings and examples.
As shown in fig. 2, the distributed system for ensuring transaction consistency and linear consistency includes a plurality of clients (clients) and a database server composed of an access layer (Proxy), a meta information management cluster (Metadata Manager), a global Gts generation cluster (Gts Manager) and a transaction processing and storage layer. The client is used for providing an interface for the user to interact with the database server and sending a user request to the database server; the access layer is used for receiving a request sent by the client and analyzing and generating an execution plan; the meta information management cluster is used for uniformly managing each system in the distributed system, for example, maintaining the routing information of each data node and the like; the global Gts generates a cluster for generating a global timestamp, and uniquely ordering global transactions in the distributed system to realize linear consistency; the transaction processing and storage layer comprises a plurality of Resource Manager (RM) for executing the transaction logic according to the execution plan sent by the access layer, and the obtained result is returned to the client through the access layer.
Further, in the transaction processing and storage layer, the resource management nodes are divided into two types, namely, the resource management nodes are used as data nodes in the distributed database and used for storing data in a partitioning manner; and secondly, a host node which is a coordination node for transaction processing. Thus, for all RM uses, there are two allocation schemes:
master-slave mode: a part of RMs are exclusively used as coordination nodes of transactions, and the part of RMs is represented by host nodes; while the remaining RMs are used as data nodes.
Peer-to-peer mode: all RMs are peer-to-peer, all RMs can act as host nodes, and all RMs can act as data nodes, i.e., each RM has both data node and transaction coordination node functions.
Further, the global Gts generates a global timestamp (Gts) of the cluster from eight bytes, and the eight bytes are formed by using a mixed physical clock mode:
a) The first 44 bits are the physical timestamp value (i.e., unix timestamp, accurate to milliseconds). Together, 2-44 unsigned integers can be represented, so that in theory together can be represented about
Figure BDA0002011471360000111
Annual articleAnd (5) processing the time stamp.
b) The last 20 bits are monotonically increasing counts within a millisecond (i.e., 20-1, about 100 ten thousand per millisecond)
c) Based on this data structure, if the transaction throughput of a single machine is 10W/s, a distributed database cluster containing 1 ten thousand data nodes can theoretically be supported. Meanwhile, the number of Gts represents the total number of transactions that the distributed system can theoretically support, i.e., based on the present architecture, the distributed system can theoretically support (2≡44-1) × (2≡20) transactions.
d) The number of bits of Gts can be extended as needed to meet the support for more node numbers, transaction numbers.
The global Gts generation cluster is implemented to provide services in a form of one master and multiple slaves (master node—multiple slaves node), that is, a cluster is formed by multiple servers to provide high availability services, so that the cluster does not become a single point of performance bottleneck.
Further, the basic data structure related to the distributed system in the invention comprises a global transaction state table, a local transaction state table, a data item data structure, a read-set-write set of the transaction, and four data structures of a communication protocol and a message. Each data structure is described in detail below.
As shown in fig. 3, the global transaction state table is referred to as GlobalTS, i.e., maintains transaction states from a distributed system global perspective. The GlobalTS structure exists on each node in the host node set. For the global transaction T, its global transaction state is maintained on the coordinator node host node of the global transaction T. For each global transaction T, its global transaction state is represented by a six-tuple { TID, lowts, uppts, status, gts, nodes }, where the meaning of each field is:
a) TID: representing the unique identification of the transaction, which consists of 8 bytes, all the transactions in the system are uniquely identified through two-part combination, and the TIDs are allocated at the time of transaction initialization:
(1) The first 14 bits are used to record the number of the transaction host node, which represents the coordinator node that handles the transaction. The total of 14 bits may represent 16383 (2≡14-1) unsigned integers and thus may correspond to the number of nodes that the estimated global timestamp Gts can support.
(2) The last 50 bits are filled by monotonically increasing counts within the host node, distinguishing between different transactions in the host node, 2-50-1 in total, which theoretically guarantees that TID is guaranteed not to repeat within the total transaction number range specified by Gts.
(3) If the last 50 bits of the TID on a host node have been allocated to 2≡50-1, then the TID needs to be recycled by a TID multiplexing mechanism, which is designed with reference to the freezing mechanism provided by PostgreSQL, which is well known to those skilled in the art and will not be described in detail here. At the same time, the TID designed by the invention is theoretically enough for the system to normally operate.
b) Lowts: representing the lower bound of the logical commit timestamp of the transaction, namely the earliest logical time that the transaction can commit, taking the value as a non-negative integer and taking 8 bytes;
c) Uppts: representing the upper bound of the logical commit timestamp of the transaction, namely the latest logical time that the transaction can commit, taking the value as a non-negative integer and taking 8 bytes;
The transaction logic commit timestamp lower bound and the transaction logic commit timestamp upper bound constitute the logical lifecycle that constitutes the transaction: [ Lowts, uppts ], the initial lifecycle of a transaction is [0, + ], and the final logical commit timestamp T.cts of the global transaction T is obtained from the interval [ Lowts, uppts ]. The logical lifecycle of a transaction is relative, and its lifecycle adjustment depends on a dynamic time stamping algorithm (DTA), a specific algorithm is described below.
d) Status: representing the state of the current transaction in the global. In the present invention, the transaction is being executed by Grounning; representing that the transaction is in the verification stage by gvaliding; using Gcommit to represent that the transaction has completed verification, is in the commit phase; the transaction is in the rollback phase represented by gaboring; the transaction is represented by Gcommitted as having been globally committed to completion; the transaction is represented by Gaborted as having been globally rollback completed.
e) Gts: the timestamp representing the global commit/rollback completion of the transaction is generated by "global Gts generate cluster" to ensure global ordering.
The TID is used as a unique number for a global transaction within the system, assigned at the beginning of the global transaction, and ordered between TIDs for transactions on the same host node, and thus can be considered as an manifestation of a partial order relationship. Gts represents a global timestamp, which means the order of transactions under the global view of the system and is therefore considered a full order relationship. Therefore, both TIDs and Gts can uniquely identify a transaction, but differ in the meaning of the order they identify.
f) And (3) Nodes: representing the data node, i.e. the set of data nodes, involved in the current transaction.
Further, as shown in FIG. 4, to maintain the state of transactions on each resource management node, the present invention maintains a local transaction state table, called LocalTS, for each transaction. LocalTS structure exists on each RM. For the global transaction T, there are multiple copies of its corresponding local transaction state, which are maintained on each RM involved in the global transaction T. The local transaction state table contains { TID, lowts, uppts, status }4 fields, which means:
a) TID: the unique identification of the transaction is distributed when the transaction starts, 8 bytes are taken, and the unique identification has the same meaning as the TID in the global transaction state;
b) Lowts: the meaning of Lowts in the global transaction state is the same, 8 bytes are taken;
c) Uppts: the meaning of the Uppts in the global transaction state is the same, and 8 bytes are taken;
d) Status: to describe the local state of a transaction, with a size of 4 bytes, there are 4 local states for a transaction: running state (Running), local authentication passing state (valid), local commit completion state (Committed), and local rollback state (Aborted).
Further, as shown in fig. 5, in the basic structure of the data item of the present invention, each data item includes two parts: data item header information and several data versions. In both parts, 2 sets of three elements are included that are related to distributed transactions, such that linear coherency and distributed transaction coherency are decoupled, in the following sense:
A first group: the linear consistency basis.
a) gts: the global timestamp, alternatively referred to as a global linear identification, is used to represent a globally unique order of transactions in the distributed system, with one gts for each data version. The assignment process is as follows:
(1) when the transaction generating the data item is not submitted globally, the gts field is multiplexed to record the global transaction number TID of the transaction, and is used for uniquely identifying the transaction generating the version currently, so that the global transaction state of the transaction is conveniently positioned.
(2) When the transaction that generated the present data item has been globally committed, gts is assigned a global timestamp Gts recorded by the present transaction in the global transaction state, marking the global order of the present transaction within the entire distributed system, thereby achieving linear consistency.
b) info_bit: representing a flag bit, 1 bit, to identify whether Gts or TID is currently recorded in the gts field. If info_bit is 1, gts is representative of the record, and if 0, TID.
Second group: distributed transaction consistency basis.
a) wts: each version of the data item maintains a wts, the wts record of the logical timestamp of the transaction that created the version of the data item, one wts for each version of the data item;
b) rts: recording the logical timestamp of the latest transaction reading the data item, wherein each data item maintains one rts, and therefore, the data item is maintained in the data item header information;
to ensure distributed transaction consistency, each data item also records which transactions are reading and writing data items, in the sense that:
a) RTlist: recording an active transaction list which has accessed the latest version of the data item, wherein each element of the list records the transaction TID of a specific transaction;
b) WT: active transactions intended to modify (write) the data item are recorded, in the form of a List, the particular record content of the element being the TID of the active transaction;
c) The logic lifecycle of the transaction can be adjusted by RTlist, WT.
Further, a read-set write-set data structure for a transaction:
the global read set of the transaction T records all data items read by the transaction T; the set of data items read by a transaction T on a certain resource management node RM constitutes a local read-set of the transaction T on that resource management node RM, which local read-set is a subset of the global read-set; the union of the local read sets of transaction T on all relevant RMs is equal to the global read set of transaction T. In the present invention, a local read set of a transaction T will be maintained on each RM involved in that transaction. The read-set role of record transaction T has 2 roles:
a) Updating the rts field of the data item in the read set when the transaction T is submitted;
b) At transaction T commit, T is deleted from RTlist for each read-set element x;
in the present invention, a linked list structure is used to maintain a read set of transaction T, and each linked list node represents a read set element x, which is composed of 4 fields:
a) BlockAddress: taking 8 bytes and a block address, and indicating that the data item x corresponds to the block address;
b) Offset: taking 4 bytes, and indicating the offset of the data item x in the block;
c) Size: taking 4 bytes, the size of the data item, the size of the tuple corresponding to the data item x is recorded, namely the byte number of the Value field is indicated.
d) Value: the value of the data item x is actually recorded after the length of the data item is prolonged;
the global write set of the transaction T records all data items which need to be updated for the transaction; the local write set of transaction T on a certain RM records which data items on the RM the transaction is to update. The local write set of a transaction is a subset of the global write set, and the union of the local write sets of transaction T on all RMs is equal to the global write set of T. The write set role of record transaction T has 2:
a) In the verification stage, the host node of T divides the global write set into a plurality of local write sets according to the difference of RMs where write elements are located, and sends each local write set to the relevant RM in the form of a message, so that the RM is required to create a new data version according to the values of the elements in the local write sets;
b) During the commit phase, each child RM of transaction T cleans up WTs of each write set element;
in the invention, a linked list structure is used for maintaining a write set of a transaction T, and each linked list node corresponds to a write set data item y and consists of 5 fields:
a) BlockAddress: taking 8 bytes and a block address, and indicating that the data item y corresponds to the block address;
b) Offset: taking 4 bytes, and indicating the offset of the data item y in the block by the offset in the block;
c) Size: taking 4 bytes, the size of the data item, the size of the tuple corresponding to the data item y is recorded, i.e. the byte number of the NewValue field is indicated.
d) NewValue: the length is increased, the data item value is actually recorded with the updated value of the data item y;
e) OperationType: taking 1 byte, indicating whether the operation is an update, insert or delete operation, a value of 0 indicating update, a value of 1 indicating insert, and a value of 2 indicating delete.
During the read phase of transaction T, the global write set of T will be maintained on the host node of T; in the verification stage of T, the T divides the global write set into a plurality of local write sets according to the difference of RMs where write elements are located, and each local write set is sent to a corresponding RM for maintenance.
Further, in the present invention, the coordination node host node and the data node, and the coordination node host node and the global Gts generate a cluster "to need to communicate in the form of a message, and according to the difference between the sender and the receiver, the present invention classifies the communication protocol and the message into the following 4 major classes:
1. The host node sends a message to the data node, which mainly includes 3 kinds:
a) Read data request message, readRequestMessage: in the read phase of transaction T, host node sends the message to RM requesting to read the relevant data on RM. The message includes the following 4 fields:
(1) TID: transaction identification, taking 8 bytes, indicating which transaction requests to read data;
(2) lowts: taking 8 bytes of the lower bound of the logical timestamp of the transaction, and indicating the lower bound of the logical timestamp of the transaction T on a host node;
(3) uppts: the upper bound of the logical timestamp of the transaction is taken to be 8 bytes, indicating the upper bound of the logical timestamp of the transaction T on the host node.
(4) ReadPlan: reading a query plan of the data item x;
b) Validating the request message, validateRequestMessage: in the verification stage of the transaction T, the host node sends the message to the data node requesting the data node to perform local verification of the transaction T. The message includes the following 4 fields:
(1) type: the message type, take 1 byte, point out this message as verifying the request message;
(2) TID: transaction identification, taking 8 bytes, indicating which transaction needs to be locally verified by the data node;
(3) lowts: taking 8 bytes of the lower bound of the logical timestamp of the transaction, and indicating the lower bound of the logical timestamp of the transaction T on a host node;
(4) Uppts: the upper bound of the logical timestamp of the transaction is taken to be 8 bytes, indicating the upper bound of the logical timestamp of the transaction T on the host node.
c) Write commit/rollback request, commitOrAbortRequestMessage: the write commit/rollback phase of transaction T, the host node sends the message to the data node requesting the data node to complete the local commit or rollback. The message includes the following 5 fields:
(1) type: the message type, take 1 byte, indicate the message is a write commit/rollback request message;
(2) TID: transaction identification, taking 8 bytes, indicating which transaction the data node is required to perform local commit;
(3) IsAbort: if the value is 1, it indicates that the transaction T needs to be rolled back, and other values do not need to be rolled back.
(4) Cts: the transaction logic commit timestamp, taking 8 bytes, indicating that the host node is the logical commit timestamp selected by T;
(5) gts: transaction global timestamp, taken in 8 bytes, indicates that global Gts generates a global commit timestamp assigned to transaction T by the cluster.
2. The message sent by the data node to the host node mainly comprises 2 kinds of messages;
a) Read request feedback message, readReplyMessage: the read phase of transaction T, the data node returns the value of the read data item to the host node, the message containing the following 6 fields:
(1) TID: transaction identification, taking 8 bytes, indicating which transaction read request feedback message;
(2) IsAbort: if the value is 1, it indicates that the transaction T needs to be rolled back, and other values do not need to be rolled back.
(3) Lowts: taking 8 bytes of a logical timestamp lower bound of a transaction T on a local data node;
(4) uppts: the upper bound of the transaction logical timestamp is taken to be 8 bytes, indicating the upper bound of the logical timestamp of the transaction T on the local data node.
(5) Size: taking 4 bytes, indicating the size of the Value field;
(6) value: a data item value, the value of the read data item being recorded;
b) Locally validating the feedback message, localValidateMessage: after passing the local authentication message, the data node sends the message to the host node, which includes the following 5 fields:
(1) type: the message type, take 1 byte, point out this message as the feedback message of local verification;
(2) TID: transaction identification, taking 8 bytes, indicating which transaction locally verifies the feedback message;
(3) IsAbort: if the value is 1, the transaction needs to be rolled back, and other values do not need to be rolled back.
(4) Lowts: taking 8 bytes of a logical timestamp lower bound of a transaction T on a data node;
(5) Uppts: the upper bound of the logical timestamp of the transaction is taken to be 8 bytes, indicating the upper bound of the logical timestamp of the transaction T on the data node.
3. host node sends a message to global Gts to generate a cluster, mainly comprising 1 kinds of messages, namely: global timestamp request message, gtsrequetmessage. After the transaction T passes the verification, the host node sends the message to the global Gts generation cluster, and requests a global timestamp for the transaction T, where the message mainly contains the following 2 fields:
a) Type: the message type, take 1 byte, point out this message as the global time stamp request message;
b) TID: transaction identification, taking 8 bytes, indicates for which transaction a global timestamp is requested.
4. The global Gts generates a message that the cluster sends to host node, mainly including 1 kind, namely: the global timestamp requests a feedback message, gtsReplyMessage. Global Gts generates a global timestamp assigned to transaction T and sends it to the host node in the form of the message. The message mainly contains the following 3 fields:
a) Type: the message type, 1 byte is taken, and the message is indicated as a global timestamp request feedback message;
b) TID: transaction identification, taking 8 bytes, indicating which transaction is assigned a global timestamp;
c) Gts: the transaction global timestamp, taking 8 bytes, indicates the global timestamp value of the transaction.
Based on the above description of the framework of the distributed system for ensuring the transaction consistency and the linear consistency, the invention also provides a multi-level consistency method for ensuring the transaction consistency and the linear consistency, which comprises the following steps:
1) A unified consistency model (United Consistency Mode, abbreviated as UCM) is built that enables multi-level consistency.
2) And determining a consistency level which needs to be achieved by the distributed system according to actual service requirements, and determining a consistency execution algorithm suitable for the consistency level requirements based on an established system consistency model to execute distributed transactions and single-machine transactions in the distributed system.
In the step 1), the method for establishing the system consistency model comprises the following steps:
1.1 Adopting an OCC (Optimistic Concurrency Control, hereinafter referred to as OCC) strategy based on DTA (Dynamic Timestamp Allocation, hereinafter referred to as DTA) to carry out transaction concurrency access control, and establishing an RUC-CC algorithm for ensuring the consistency of the transactions in the distributed system.
1.2 A global timestamp generated by the cluster is generated based on the global Gts and the global transaction state, and a linear consistency guarantee algorithm based on the global timestamp is established for guaranteeing the linear consistency among the transactions.
1.3 A method for performing twice data reading is adopted, and a twice-reading linear consistency assurance algorithm is established and used for guaranteeing the linear consistency between transactions.
1.4 The transaction consistency, the linear consistency and the MVCC (Multi-Version Concurrency Control, hereinafter MVCC) algorithm of the steps 1.1) to 1.3) are combined to establish a unified model capable of meeting various consistency levels.
In the step 1.1), in order to realize distributed decentralized transaction scheduling and ensure transaction consistency, the invention adopts an OCC strategy based on DTA to carry out transaction concurrency access control, and the strategy mainly uses an OCC algorithm framework and combines with DTA to reduce the rollback rate of the transaction, thereby improving the concurrency processing capability of the transaction and being named as RUC-CC algorithm for convenience of description.
As shown in fig. 6, under the RUC-CC scheduling, the life cycle of a transaction is divided into 2 phases: global initialization phase, global execution phase. The 2 phases are completed on the host node and the relevant data nodes of the transaction operation by coordination on the host node. Therefore, the workflow of each stage is refined according to different nodes, and the work is divided into a global initialization stage and a global execution stage. The method specifically comprises the following steps:
1.1.1 Global initialization phase): for the transaction T sent by the client, the corresponding initialization work is completed on the host node.
1.1.2 A global execution phase): the global execution phase of a transaction is divided into 3 phases: the method comprises a reading stage, a verification stage and a commit writing/rollback stage, and under the coordination of host nodes, each RM related to operation executes the transaction, and then corresponding table items of the commit or rollback transaction in a transaction state table are cleared.
Step 1.1.1) above), global initialization phase: for the transaction T sent by the client, the corresponding initialization work is completed on the host node, and the method specifically comprises the following steps:
1.1.1.1 A globally unique transaction number TID is assigned to the transaction T;
1.1.1.2 Record the status of transaction T in GlobalTS on host node with status set to Grunning, lowts and Uppts initialized to 0 and + -infinity, respectively.
In the step 1.1.2), the specific flow of the global execution stage includes the following steps:
1.1.2.1 In the read phase, the transaction T reads the required data according to the execution logic and writes the update to the local memory of the transaction T (i.e., the local memory of the host node).
1.1.2.2 In the verification stage, the transaction T verifies whether the transaction T has conflict with other transactions or not, and a verification result is obtained.
1.1.2.3 Write commit or rollback phase): the transaction T selects whether to perform write commit or rollback depending on the validation result of the validation phase.
In the above step 1.1.2.1), the reading phase, when the transaction T needs to read the data item x, includes the following steps:
I. the host node of transaction T needs to send a read data request message ReadRequestMessage rrqm for read x to the data node where data item x is located.
The values of the four fields of the read data request message ReadRequestMessage rrqm are respectively:
a) TID, TID for transaction T;
b) Lowts, lowts with transaction T on host node;
c) Uppts, uppts of transaction T on host node;
d) ReadPlan, query plan for transaction T read x;
II. After receiving the message rrqm, the data node where the data item x is located first establishes or updates the local transaction state table of the transaction T, then searches the visible version of the data item x in the logic life cycle of the transaction T, and sends a read request feedback message rrpm to the host node of the transaction T.
And III, after the host node of the transaction T receives the read request feedback message rrpm sent by the data node, judging whether rollback is needed, if so, entering a global rollback stage, otherwise, continuing to execute the transaction.
In step II of step 1.1.2.1) above, the specific flow of sending the read request feedback message from the data item x to the host node of the transaction T is as follows:
(1) checking whether a local transaction state table LocalTS of the data node contains information of a transaction T or not:
a) If not, initializing the information of the transaction T on the information, namely inserting a record in the LocalTS, wherein the values are rrqm.TID, rrqm.Lowts, rrqm.Uppts and rrqm.Running respectively;
b) If so, i.e. the transaction T has accessed other data items on the data node before it reads data item x, the information of the transaction T is updated such that t.lowts=max (t.lowts, rrqm.lowts), t.uppts=min (t.uppts, rrqm.uppts).
(2) Checking whether the lower bound of the logical commit timestamp of the transaction T is smaller than the upper bound of the logical commit timestamp thereof, i.e. checking whether t.lowts is smaller than or equal to t.uppts:
if yes, continuing to read the data x;
otherwise, updating the state of the transaction T in the LocalTS to aborded, i.e. t.status=aborded, and returning an Abort message to the host node of the transaction T, i.e. sending a read request feedback message rrpm to the host node of the transaction T, where rrpm.isabort=1;
(3) the data node finds the appropriate visible version of the data item x according to the logical lifecycle of the transaction T Lowts, uppts.
Wherein, finding the proper visible version of x should first be checked from the latest committed version, if t.uppts is greater than wts of the latest version, the latest version is the proper visible version. Otherwise, it is not the proper visible version, and the last version needs to be looked up until the first data version x.v satisfying t.uppts > wts is found, where wts is the creation timestamp of x.v.
(4) After finding the appropriate version x.v of data item x, modify the Lowts of transaction T, make t.lowts > x.v.wts (eliminate write-read anomalies); in addition, if the found version is the latest version of data item x, the following operations need to be performed:
a) Checking x.v whether the corresponding WT (WT records the transaction identifier of the transaction being modified x and passing the verification) is empty, and if not (assuming that the value is TID1 and the transaction corresponding to TID1 is T1), adjusting the Uppts of the transaction T to satisfy t.uppts < T1.Lowts (eliminating read-write collision);
b) Adding t.tid to RTlist list of x.v;
c) Adding data item x to the local read set of transaction T;
(5) a read request feedback message is returned to the host node of transaction T ReadReplyMessage rrpm.
Wherein rrpm's Lowts and Uppts record the logical timestamp upper and lower bounds, respectively, of the transaction T on the current data node, and Value records the Value of the read data item.
In step III of the above step 1.1.2.1), after the host node of the transaction T receives the data node read request feedback message rrpm, a specific flow for judging whether rollback is needed is as follows:
(1) checking whether the received message is Abort, namely checking whether rrpm. IsAborts is equal to 1, if so, entering a global rollback stage; otherwise, continuing to execute;
(2) updating the state of transaction T in GlobalTS: t.lowts=max (t.lowts, rrpm.lowts), t.uppts=min (t.uppts, rrpm.uppts) of update T;
(3) checking whether T.Lowts in the Global TS is greater than T.Uppts, if so, entering a global rollback stage; otherwise, execution of the transaction continues.
In rule (3), if host node decides to roll back transaction T, modifying the state of T in globalst to Gaborting is needed, notifying relevant child nodes to perform local roll back;
as can be seen from the above rules, during the read phase of the transaction T, communication mainly takes place between the host node of the transaction T and the relevant child RM. Two communications are required for transaction T to successfully read data each time:
a) The host node of the transaction T sends read data request information to the related child RM;
b) The related sub-data node sends the read request feedback information to the host node;
thus, the read phase is performed at most 2n communications, with a maximum traffic of n× (request message size + corresponding message size), where n is the number of remote reads. An optimization mode for saving communication times is as follows: when a transaction T requires multiple data of a related child data node, the request may be sent in packets and the data may be read in batches.
In the above step 1.1.2.2), the specific procedure for verifying whether the transaction T collides with other transactions in the verification stage is as follows:
I. the host node of the transaction T first modifies the state of the transaction T in GlobalTS to be: gvaliding; an authentication request message vrm and a local write set are then sent to each RM to which T relates.
The Lowts and Uppts in the authentication request messages ValidateRequestMessage vrm, vrm record the upper and lower bounds of the transaction logical timestamp of the transaction T in GlobalTS, respectively, and further send the authentication request messages to the local write set of the data node together with the data node.
II. After each data node involved in the transaction T receives the authentication request message vrm, a local authentication operation is performed.
And III, after the host node of the transaction T receives the local verification feedback message lvm of all the data nodes, determining whether the transaction T can pass verification according to the received message.
In step II of step 1.1.2.2), the local authentication operation needs to sequentially perform the following steps:
(1) updating t.lowts=max (t.lowts, vrm.lowts) and t.uppts=min (t.uppts, vrm.uppts) of T in LocalTS, and it should be noted that, here, updating the logical timestamp information of the transaction in the local transaction state table is used for the concurrent access control of the transaction, that is, for ensuring the consistency of the transaction;
(2) Checking whether t.lowts is greater than t.uppts, if yes, the authentication fails, and returning an Abort message (thereby causing global rollback) to the host node of the transaction T, i.e. sending a local authentication feedback message lvm, wherein lvm.IsAbort=1; otherwise, entering the next step of verification;
(3) find each data item y in the write set and then see if the WT for data item y is empty:
if not empty, it is stated that there are other transactions that are modifying the data item y, and that the transaction has entered the validation phase, it is necessary to rollback the transaction T to eliminate write-write conflicts, i.e. send an Abort message to the host node of the transaction T;
otherwise, continuing the next operation: the WT locking data item y prevents other concurrent transactions from concurrently modifying y (suggesting locks are applied to data item y, exclusive of modification operations to WTs of data item y).
(4) Updating the WT of each data item y in the write set to be t.tid (indicating that the transaction T entering the validation phase is to modify y), and adjusting the timestamp lower bound of the transaction T in the local transaction state table to be greater than the rts of y, i.e., t.lowts=max (t.lowts, y.cts+1) (eliminating read-write conflicts); in practice, y WTs are assigned using the lock-free CAS technique (CAS (Compare and swap) is a well-known lock-free algorithm) to improve performance (nor does it exclude the normal manner in which y WTs are assigned after locking).
(5) Checking whether T.Lowts is larger than T.Uppts, if so, failing to verify, rolling back locally, and then returning an Abort message to a host node of T; otherwise, go to the next step of verification;
(6) and for each element y in the write set, adjusting the time stamp of the transaction T or the transaction in the RTlist, and eliminating the read-write conflict.
The adjustment rule is as follows:
a) Read-write conflict resolution: this transactional write, other reads that have occurred in the past for transactions that have completed, causes the occurrence of this transactional write operation to be deferred until after the read operation for the transaction that has completed the read.
First, find all transactions T1 that are already in commit or pass the local validation state, adjust the timestamp interval lower bound of T itself to be greater than the Uppts of T1, i.e., t.lowts=max (T.Lowts, T1.Uppts +1).
Then, checking whether the timestamp interval of the transaction T is still legal, if not, returning an Abort message, otherwise, updating the local transaction state of the transaction T to be validled, i.e. t.status=validled, and proceeding to the next adjustment.
b) Read-write conflict resolution: the transaction writes, other ongoing transaction reads, so that other transactions cannot read the data written by the transaction:
and finding out all the transactions T2 in the Running state, and adjusting the timestamp interval of the T2 to ensure that the timestamp upper bound is smaller than the Lowts of the T. I.e. t2.uppts=min (T2.Uppts, T.Lowts-1). If Lowts > Uppts for transaction T2, transaction T2 may be notified that it should roll back.
(7) To this step, it indicates that the transaction T has passed the local verification, and creates a new version of y according to the updated value of y, but needs to set a flag to indicate that the new version is not globally submitted and is invisible to the outside under the RUC-CC protocol;
(8) returning a local verification feedback message lvm of the transaction T to a host node of the transaction T, wherein Lowts and Upplts of the lvm record the logical timestamp upper and lower bounds of the transaction T on a local data node respectively; it should be noted that if the transaction T fails to verify locally, the transaction T state in LocalTS needs to be updated to be Aborted, i.e. t.status=aborted.
In step III of step 1.1.2.2), after the host node of the transaction T receives the local verification feedback message lvm of all the data nodes, it determines whether the transaction T can pass the verification according to the received message, and the method mainly includes the following cases:
(1) if the Lvm containing IsAbort field is equal to 1, indicating that the transaction does not pass all local verifications, then deciding a global rollback transaction T; the state of the transaction in the globalps is updated simultaneously as follows: gaboring; all child nodes are informed of the completion of the rollback, i.e. send a write commit/rollback message coarm to the relevant data node, where coarm.isabort=1.
(2) Otherwise, the time stamp intervals of all received transactions T are intersected to obtain a new time stamp interval [ T.Lowts, T.Uppts ], if T.Lowts > T.Uppts, a global rollback transaction is determined, the state of T in the Global TS is updated to Gaborting, and all child nodes are informed of finishing rollback; otherwise, entering the next step;
(3) Transaction T is determined to be validated and a point in time is randomly selected from interval T.Lowts, T.Uppts as the logical commit timestamp of transaction T to assign cts. Such as selecting a t.lowts as the logical commit timestamp for T.
(4) Updating t.lowts=t.uppts=t.cts in GlobalTS; the state of the transaction in the updated GlobalTS is: gcom; at this point, the global transaction state is further marked as Gcommitted, and the global Gts is requested to generate a cluster allocation global timestamp, which is recorded in the Gts field of the global transaction state.
(5) The relevant data node is informed of the completion of commit, i.e. a write commit/rollback message coarm is sent to the relevant data node, wherein coarm.isabort=0, and coarm.cts and coarm.gts record the logical timestamp and global timestamp of the transaction, respectively.
As can be seen from the above rules, during the validation phase of a transaction T, communication occurs primarily between the host node of T and the associated child data node. The communication mainly comprises the following two steps:
a) The host node of T sends verification request information to each related sub-data node and a local write set of the data node;
b) Each associated child data node sends a local authentication feedback message to the host node of T.
Thus, the validation phase requires at most 2m communications, the size of the traffic being m× (request validation message size + validation feedback message size) +global write set size, where m is the number of child data nodes associated with transaction T.
An optimization point: if the transaction is a local transaction, in the validation phase, after modifying the state of the transaction in the global transaction state table to gvaliding, (2) - (7) in step II of the validation flow (i.e., 1.1.2.2) is performed directly at the local data node:
1) If the fact that the transaction needs to be rolled back is detected, firstly modifying a local transaction state table into Aborted, completing relevant rolling back operation, and then directly modifying a global transaction state table into Gaborted;
2) If the transaction passes smoothly, a verification stage is performed and the commit operation of the transaction is performed at the local data node.
In step 1.1.2.3) above, if the transaction T passes the verification, the write commit phase is entered, i.e., the update to the data by the transaction T is persisted to the database and some subsequent cleaning is done. The local data node needs to perform the following operations in the write commit phase:
I. for each read-set element x:
a) Modifying the rts of x to be equal to or greater than t.cts, i.e., x.rtx=max (x.rtx, t.cts);
b) Delete itself from RTlist (x);
II. For each writeset element y:
c) Update the new version of y wts and rts, where wts = t.cts;
d) Rts=max (x.rtx, t.cts) of update y
e) Persisting y into a database, and modifying a flag, the identity being visible to the outside under the RUC-CC protocol;
f) Clearing the content of the RTlist list of y;
g) The WT content of y is emptied.
III, clearing a local read set and a local write set of the transaction T;
IV, update Lowts=T.cts for T in LocalTS, and state as committed (local transaction state table at this time, used for transaction consistency only, synchronization of global transaction state is not involved)
V, returning ACK which completes successful submission to a host node of the transaction T;
after the host node of the transaction T receives all completion commit ACKs, the global transaction state is modified to Gcommitted. And informs each data node that the state of transaction T can be cleaned from the local transaction state table.
If the transaction T fails to pass the verification, entering a global rollback stage, namely rolling back the transaction T, and performing corresponding cleaning work, wherein the cleaning work comprises the following steps:
I. for each read set element x, delete T from RTlist (x);
II. For each write set element y, cleaning the newly created version y, and emptying the WT content of y;
III, clearing a local read set and a local write set of the transaction T;
IV, updating the local transaction state of the transaction T to be Aborted;
v, returning the ACK for finishing rollback to the host node of the transaction T;
as can be seen from the above rules, during the commit/rollback phase of the transaction T, the communication mainly takes place between the host node of the transaction T and the relevant child RM, which mainly comprises the following two steps:
a) The host node of the transaction T sends a commit/rollback request message to each relevant child RM;
b) Each correlator RM sends a commit/rollback complete corresponding message to the host node.
Thus, the commit/rollback phase is at most 2m communications, with a traffic size of m× (commit/rollback request message size + commit/rollback request message size), where m is the number of transaction T-related child RM.
After the host node of the transaction T receives all the completion rollback ACKs, the global transaction state is modified to Gaborted. And informs each RM that the status of transaction T can be cleaned from LocalTS. An optimization mode is as follows: the system may send a clear message to the RM in bulk to reduce the number of communications.
In step 1.2) above, based on the linear consistency guarantee algorithm of the global timestamp, when the read transaction starts, the global timestamp acquired by the cluster is generated to the global Gts to determine the order of the read transaction under the global clock of the system, so as to determine which data conform to the linear consistency for the current read transaction. The algorithm regards the operation of the transaction in the time period as the operation on a time point, and guarantees the linear consistency among the transactions based on the global transaction state and the global timestamp, and the main algorithm flow is as follows:
1.2.1 A client initiates a transaction T-request with which a connection is established by Proxy, forming a session.
1.2.2 Proxy parses the transaction T and selects host node to be responsible for managing the execution of the transaction.
1.2.3 At the beginning of the read transaction T, a global timestamp of the beginning of the cluster get transaction is generated from global Gts and recorded Gts in the global transaction state of the read transaction. The host node establishes connection with all data nodes related to the read transaction, forms a data packet with the analyzed query execution plan and Gts, and transmits the data packet to all related data nodes through network communication.
1.2.4 Data nodes respectively perform data reading operation, determine data items meeting selection conditions, and then traverse each data item with multiple versions from the latest version until the first visible version is found.
In the step 1.2.4), for each logical data item, the method for finding the first visible version is as follows:
1.2.4.1 A transaction state extraction algorithm is performed to obtain a snapshot of the transaction state of the write transaction that generated the version relative to the current read transaction.
1.2.4.2 Using Gts visibility determination algorithm to determine whether the version is visible based on the read transaction global timestamp and the transaction state snapshot that generated the version write transaction.
In step 1.2.4.1) above, the purpose of the transaction state extraction algorithm is to find the transaction state that produces a version of the write transaction for the current read transaction t.gts instant. If the write transaction is not globally submitted, the state of the write transaction may be updated in the execution process of the read transaction, so that a transaction snapshot for the read transaction T and ensuring consistency needs to be found, and the flow of the algorithm is as follows:
I. based on the gts field on the version v of the data, the global state record of the transaction that generated the version is read.
a) If the global timestamp Gts has been recorded on the data item, then the transaction that resulted in the data item is already in the globally committed Gcommitted state and the global timestamp is Gts.
b) If the TID is recorded on the data item, a request is sent to a remote host node through network communication according to the RM.ID contained in the TID, and a transaction state record corresponding to the TID is searched in a global transaction state list on the remote host node.
II. And restoring to obtain a transaction snapshot for the read transaction T and ensuring consistency according to the global state record and the read transaction Gts obtained in the previous step.
Because the data version is read into the memory at this time, the snapshot information is directly recorded on the gts field of version v for use by the visibility determination algorithm. The snapshot acquisition method comprises the following steps:
a) If status of the global transaction state record corresponding to the data version v is Gcommitted, further judging v.s:
if v.gts > Gts, then it is proved that the transaction that generated version v at time Gts did not commit, v.gts is set to T.Gts+1.
If v.gts < = Gts, then a Gcommitted state is demonstrated at time Gts. Because the global commit validation is performed in the present invention upon completion of the validation and entry into the commit phase, the transaction that generated version v is considered to have been committed globally at Gts without the need to modify v.gts.
b) If status of the global transaction state record corresponding to v is Grunning or gvaliding, then the transaction that generated version v at t.gts must not be globally committed.
c) If status of the global transaction state record corresponding to v is Gaborted or Gaborting, further judging v.gts:
if v.gts > Gts, the transaction that generated version v at Gts is in a Grouning or Gvaliding or Gaborting state (i.e., not globally committed);
if v.gts < = Gts, the transaction that generated version v at time Gts is in the gamberted state.
In the step 1.2.4.2), the specific flow of judging whether the version is visible according to the global timestamp of the read transaction and the transaction state snapshot of the version write transaction by using the Gts visibility judging algorithm is as follows:
I. Each RM forms a data packet from visible data and sends it to the host node via network communication.
II. The host node gathers the data returned by all the data nodes and returns the data to the Proxy, and the Proxy returns the data to the client for establishing the session relation, and the current reading transaction is completed.
The algorithm can return data through one round of communication, and needs to communicate once with the global Gts generation cluster.
Gts visibility judgment algorithm acts on each data node to judge and obtain the data with the global linearity consistent. Based on the read-only transaction Gts and the gts field on the tuple, a determination is made as to whether the data item is visible. And meanwhile, the fact that the transaction generating the version has been globally submitted and the transaction modifying the version at the given time point is not globally completed at the given time point is met, and the data version v can be ensured to be visible to the read transaction T. Therefore, the judgment conditions for the visibility are as follows:
IV, if v.info_bit is satisfied-! =0 & & v.gts < t.gts, i.e. it is the global timestamp Gts that satisfies the current gts field record, and v.gts is less than t.gts, then the version is visible, i.e. it is consistent with linear consistency.
And V, for each logical data item, finding a first physical version (the latest visible version) meeting the judgment condition, namely the visible version of the data item for the current reading transaction. The reason is that the traversal process of the version is from a newer version to an older version, so when the first time condition is met, the read version must be the most recently modified visible version of the data item.
In the step 1.3), the two-time reading linear consistency guarantee algorithm is that when a reading transaction starts, a process of generating a cluster application global time stamp from the global Gts is omitted, an idea of performing two-time reading is adopted, and the global time stamp of the current reading operation is calculated to determine the global sequence of the current reading operation, so that data conforming to global linear consistency is obtained. The main process of the algorithm is as follows:
1.3.1 A client initiates a transaction T-request with which a connection is established by Proxy, forming a session.
1.3.2 Proxy parses the transaction T and selects host node to be responsible for managing the execution of the transaction.
1.3.3 For the first round of communication, the host node establishes connection with all data nodes related to the read transaction, and sends a data acquisition request to all related data nodes. All relevant data nodes execute the first data read algorithm and return data to the host node, which determines the global timestamp Gts of the current read transaction T based on all returned data items.
1.3.4 In the second round of communication, the host node sends a data acquisition request to all relevant data nodes, and sends T.Gts to all relevant data nodes, a second data reading algorithm is executed on the data nodes, and a data version meeting the linear consistency for the T.Gts is returned.
1.3.5 The host node gathers the data returned by the data node and returns the data to the Proxy, and the Proxy returns the data to the client for establishing the session relation, so that the current reading transaction is completed.
In the step 1.3.3), the specific flow of the first round of communication is as follows:
1.3.3.1 On each data node, traversing the multiple versions of each data item (traversing from the current latest version) for the data item meeting the selection condition, judging for each version v until a visible version meeting the condition is found, and sending to a host node.
1.3.3.2 Host node determines the global timestamp of the current read transaction from the received visible version, thereby determining the global order of the current read transaction T. The specific method comprises the following steps: traversing all read data versions, and comparing to obtain the maximum gts value recorded on the data item, wherein the maximum gts value is recorded as gts_max. The global timestamp of the current read transaction is gts_max, i.e., t.gts=gts_max. At the same time, the last bit of T.Gts is set to 1, the current transaction is marked as a read transaction, and the sequence of the current transaction in the global is marked as gts_max and then gts_max+1.
In the above step 1.3.3.1), the method for determining each version v is as follows:
I. If the data item has recorded thereon a global timestamp gts, then it is proved that the transaction that generated the data item has been committed globally, and the global timestamp is gts. At this point the current version is the visible version.
II. If the TID is recorded in the gts field on the data item, searching the global transaction state list on the data node to obtain a global transaction state record corresponding to the TID. Because at this point the transaction to which the TID corresponds has been globally validated. At this point the current version is the visible version and the gts field on the data item is recorded as tid.
III, if the TID is recorded in the gts field on the data item, the corresponding global transaction state record is not found in the global transaction state list on the data node, and whether the version is visible or not cannot be judged through the local information. At this point the version is not visible, it is necessary to continue the loop and judge the visibility of its previous version.
In the step 1.3.4), the second round of communication refers to traversing, on each data node, multiple versions of each data item (traversed from the current latest version) for the data item meeting the selection condition, and for each version v, performing the following judgment to find a data version meeting the linear consistency, where the specific flow is as follows:
1.3.4.1 If the global timestamp gts has been recorded on the data item, then the transaction that generated the data item is proved to have been committed globally, and the global timestamp is gts. It is determined that if t.gts > =gts, the current version is the visible version.
1.3.4.2 If the TID is recorded in the gts field on the data item, searching the global transaction state list on the data node to obtain the global transaction state record corresponding to the TID, wherein the transaction corresponding to the TID is globally verified. It is determined that if t.gts > =tid.gts, the current version is the visible version.
1.3.4.3 If the TID is recorded in the gts field on the data item, the corresponding global transaction state record is not found in the global transaction state list on the data node, and at this time, whether the version is visible cannot be judged through the local information. And sending a request to a remote host node through network communication according to the global TID recorded on the data item and the RM.ID therein, and searching a transaction state record corresponding to the TID in a global transaction state list on the remote host node. It is necessary to determine that if tid.status= Gcommitted or Gcommitting, and t.gts > =tid.gts, then that version is visible, otherwise that version is not visible, it is necessary to continue looping and determine the visibility of its previous version.
It should be noted that, in the method of determining the global timestamp of the read transaction by using the policy of two reads adopted by the algorithm, the global clock sequence between the read and write operations can be determined, but the sequence between the two read transactions cannot be determined. Therefore, we need to define the order of read operations at the session layer to prevent out-of-order problems in the case of read-read concurrency.
In the step 1.4), as shown in fig. 7, by combining different technologies, multiple consistency levels can be generated, and different system efficiencies can be achieved, so as to meet different service requirements. In the present invention, the consistency levels are respectively the following four levels (ordered by external consistency level):
complete consistency: representing that the system satisfies both linear and transactional consistency is the highest of the consistency levels.
Linear consistency: the representative system fully meets the requirement of linear consistency, but does not require the guarantee of transaction consistency.
Transactional causal consistency: representing that the system fully satisfies transactional consistency and satisfies causal consistency.
Transaction logical consistency: representing that the system fully satisfies the transactional consistency but does not require the level of external consistency.
In the step 1.4), the method for combining the transaction consistency and the linear consistency according to the divided four levels is as follows:
1.4.1 DTA (Dynamic Timestamp Allocation, hereinafter DTA) +occ (Optimistic Concurrency Control, hereinafter OCC) algorithm is combined with MVCC (Multi-Version Concurrency Control, hereinafter MVCC) to propose a RUC-CC algorithm to make the distributed system satisfy the requirement of transaction logic consistency.
1.4.2 In combination with global Gts according to the MVCC algorithm to enable the distributed system to meet the requirement that global transaction operations meet linear consistency.
1.4.3 A global Gts is introduced in the RUC CC algorithm to enable the distributed system to meet the requirement of transactional causal consistency.
1.4.4 A RUC-CC algorithm is combined with a linear consistency algorithm to make the distributed system meet the requirement of full consistency.
1.4.5 Comparison of the system execution efficiency for the various consistency levels in steps 1.4.1) to 1.4.4).
In the step 1.4.1), the RUC-CC algorithm realizes a relatively efficient serializable isolation level by combining the thought of MVCC into the algorithm of DTA+OCC, and the main demonstration process is as follows:
a) In the rules of the read phase introduced in step 1.1.2.1), the invention uses MVCC technology. There are concurrent transactions T1 and T2, and if T2 modifies data item x, from version x0 to version x1, T1 needs to read data item x. The invention utilizes a multi-version mechanism, T1 can read version x0 before T2 modification, thereby eliminating the write-read dependency relationship between T2 and T1, namely, no write-read conflict exists between any concurrent transaction in the system.
b) Since T1 will read the version before modification of T2, logically T1 reads first, T2 writes then, and there is a read-write dependency. Thus guaranteeing the important feature of "read-write mutual unblocking" in MVCC. And the dependency is maintained by an RTlist structure in the data structure for conflict serializable guarantees.
c) Because the write-read conflicts are eliminated and the read-write conflicts are maintained in a lighter weight mode, the rollback rate of the transaction can be reduced and the efficiency of the transaction scheduling can be improved on the basis of ensuring serializable performance, and therefore the method has higher efficient transaction scheduling performance.
In the step 1.4.2), the linear consistency assurance algorithm provided by the invention combines the thought of MVCC, and based on a plurality of versions maintained in the system, data conforming to the linear consistency is read from the plurality of versions. The three algorithms provided by the invention are different in visibility judging method of the data conforming to the linear consistency, but are all based on the thought of MVCC. For example, in the algorithm based on the global timestamp Gts, the visibility of the version can be judged by comparing the gts field on the data item with Gts of the read transaction, so that the thought of multiple versions is utilized, and a specific data version conforming to the linear consistency can be conveniently located.
In the step 1.4.3), if the DTA+OCC algorithm is combined with the global Gts generation technology, the causal consistency in the external consistency can be ensured, and the demonstration process is as follows. The requirements for causal consistency are: operations with causal links need to be ordered according to causal logic.
a) Assume that there is a partial order relationship between transaction T1 and transaction T2 as follows:
i. transaction T1 first updates data item R and generates version R1, and then T1 transaction commit obtains the global timestamp. Transaction T2 then updates data item R and generates tuple R2, then the T2 transaction commit obtains the global timestamp.
b) Suppose that the operations of transaction T3 and transaction T4 are both read data item R and are concurrent with transaction T2, and that there is a partial ordering relationship that T3 precedes T4.
At this time, we ensure causal consistency by:
a) For each transaction, belonging to a certain session (maintained when a connection is established by the client and the system service), it interacts with the "global Gts generation cluster" to obtain the respective order. For example, generating clusters between T3 and T4, i.e., via "global Gts", ensures that T4 is preceded by a sequence of T3.
b) T3 transaction is submitted before T4 transaction, and based on DTA+OCC algorithm, if the logical timestamp of T3 transaction is adjusted to T2, version R2 is read; version R1 is read if the logical timestamp of the T3 transaction is adjusted to be before T2. Thus, a T3 transaction is likely to read to one of versions R1 and R2.
c) T4 commits after the transaction, so that in case T3 reads R1, T4 can only read R1 or R2, there is no schedule to read R0; in the case where T3 reads R2, the logical timestamp of T4 must also be greater than T2, so R2 is only readable.
The dta+occ algorithm itself guarantees transaction consistency, but if MVCC is not combined, there is a problem of higher rollback rate, so the algorithm is inefficient from the traffic scheduling level.
In the step 1.4.4), three technologies are combined, so that the 'complete consistency' (unification) of the transaction consistency and the linear consistency can be realized and is taken as the highest consistency level in the distributed database system. The guarantee of "full consistency" has a large transaction verification overhead and therefore affects system performance to some extent. However, since there is a trade-off between consistency and system efficiency, ensuring full consistency requires sacrificing certain system performance.
At the "full consistency" isolation level, the read operation needs to use the RUC-CC algorithm in combination with the linear consistency guarantee algorithm based on the time stamp, and therefore, the running process of the read transaction needs to be expanded as follows:
a) First, when a transaction starts, it communicates with the "global Gts generation cluster" to obtain the global timestamp t.gts of the current read transaction.
b) Then, in the reading phase, the required data is acquired through the DTA+OCC algorithm.
c) And then, performing linear consistency verification operation on the read data, and calling a linear consistency guarantee algorithm based on a global time stamp to judge whether the current read data accords with the linear consistency.
i. If all the data pass the linear consistency judgment, the data are summarized and returned to the user.
if there are data that cannot be judged by linear consistency, there are two processing methods: (1) this read transaction rollback; (2) The read transaction is retried and if the data that is eligible is still not readable by the retry three times (parameters, settable), the transaction rolls back.
At the "transactional causal consistency" level, efficient transactional causal consistency can also be achieved for a combination of three technologies for performance considerations. Because the DTA+OCC algorithm is combined with the global Gts generation technology, the transactional causal consistency can be ensured. At the moment, MVCC is fused, so that the rollback rate in DTA+OCC can be reduced, and the system performance is improved. At this consistency level, the read operation is not subjected to the linear consistency verification as described above, thereby improving the execution efficiency of the read operation.
In step 1.4.5) above, different consistency levels need to be ensured by different combinations of techniques proposed in the present invention, and thus the efficiency of the system will also vary. We summarize the factors affecting efficiency at different consistency levels and demonstrate that weaker consistency levels can have better system efficiency, but because the relationship between consistency requirements and system efficiency is necessarily negative, higher consistency level performance can be tolerated with weaker performance:
(1) the complete consistency level can be ensured by adopting RUC-CC algorithm and combining linear consistency ensuring algorithm. Thus, the two algorithms combine to require the following two operations to be additionally performed, thereby bringing about a certain performance loss:
a) The read operation additionally verifies. The read operation requires consistency verification of the read data, so that the read data can ensure the consistency of transactions and the linear consistency at the same time. Thus, the joint use of two algorithms exacerbates the data validation overhead, causing performance loss.
b) Additional rollback of transactions. There are cases where the transaction schedule generated by the RUC-CC algorithm satisfies the transaction consistency, but does not satisfy the linear consistency. Such transactions need to be rolled back at full consistency levels, causing additional rollback overhead.
(2) The linear consistency is ensured by a linear consistency ensuring algorithm. The additional verification of read operations and the additional rollback operations of transactions required in the full consistency level guarantee are omitted, thus having a certain performance improvement over the full consistency level.
(3) The transaction causal consistency level is guaranteed by the DTA+OCC+global clock, so that the overhead caused by executing a linear consistency guarantee algorithm is further saved, and the transaction causal consistency level has certain performance improvement compared with the linear consistency level.
(4) The transaction logic consistency is guaranteed by the RUC-CC algorithm. The algorithm does not take into account the guarantee of external consistency, and therefore the overhead of communicating with the global clock is further saved. Thus having a certain performance improvement over the transactional causal consistency level.
In the step 2), the consistency level to be achieved by the distributed system is determined according to the actual service requirement, and a consistency execution algorithm suitable for the consistency level requirement is determined based on the established system consistency model, and the distributed transaction and the single-machine transaction in the distributed system are executed, which comprises the following steps:
2.1 According to whether the transaction needs to operate on the data on a plurality of resource management nodes, the transaction involved in the distributed system is divided into two types of distributed transaction and single-machine transaction.
2.2 Adopting a consistency execution algorithm which is suitable for the consistency level requirement to execute the distributed transaction in the distributed system;
2.3 A consistency execution algorithm adapted to the consistency level requirements is used to execute the single transaction in the distributed system.
In step 2.1) above, the transactions involved in the distributed system are classified because the smallest operation execution unit in the distributed database system is the transaction. Transactions can be categorized into distributed transactions and stand-alone transactions depending on whether the transaction requires an operation on data on multiple data nodes. For the two transactions, the invention adopts different execution flows respectively to reduce the communication overhead among the nodes as much as possible and improve the transaction processing efficiency.
Wherein, the distributed transaction represents that the transaction needs to perform read-write operations across multiple resource management nodes, i.e. the transaction can operate on data on multiple RMs. For example, transaction T requires operation nodes RM1, RM2, and RM3, and the transaction is a distributed transaction. In this case, a coordinator node host node needs to be introduced to store global transaction state information during the execution of the transaction and manage the execution of the transaction. The selection modes of the coordination node host node are as follows:
a) A random selection mechanism, i.e. randomly selecting a node from the host node set as a host node.
b) The selection mechanism is determined, i.e. the host node is selected by some certain determination rule. The selection rule is defined as selecting from the host node set by a polling mechanism.
A single transaction represents that the transaction only needs to operate on data on a single data node, e.g., transaction T needs to operate on node RM1, then the transaction is a single transaction. In the execution process of the single-machine transaction, only one round of communication with the coordination node is needed.
In the step 2.2), the process of executing the distributed transaction in the distributed system according to the requirements of different consistency levels is as follows:
2.2.1 The Client is responsible for issuing a request to execute the transaction T, and the Proxy is responsible for receiving the request from the Client and establishing a session relationship with the Client.
2.2.2 After receiving the request information, proxy interacts with the metadata management cluster, and after acquiring the related meta information, analyzes the request and routes the SQL sentence to different host nodes.
2.2.3 The host node optimizes SQL and generates a physical execution plan, performs global transaction initialization work, records global transaction state information and the like, then decomposes the execution plan into the execution plan on each node, sends the execution plan to the corresponding data node, and records the global transaction state as an running state.
2.2.4 According to the execution plan, the data node respectively adopts an algorithm which is suitable for the requirement of the consistency level to perform data operation, records the execution state of local transaction, and sends a command which can verify to a host node after the local execution of data read-write by the data node is completed; specific:
when required for transactional logical consistency and transactional causal consistency: and the data node adopts the RUC-CC algorithm in the step 2.1) to perform data operation and transaction scheduling according to the execution plan.
When a linear consistency level requirement: and (3) the data node performs data operation by adopting the linear consistency guarantee algorithm in the step 1.2) according to the execution plan, and performs transaction scheduling based on the MVCC algorithm.
When required for full consistency level: and the data node adopts the linear consistency assurance algorithm in the step 1.2) to combine with the RUC-CC algorithm in the step 1.1) to perform data operation and transaction scheduling according to the execution plan.
2.2.5 After all relevant data nodes receive the command of 'can verify', the host node records the global transaction state as being verified, and sends the command of 'verify' to all relevant data nodes.
2.2.6 After receiving the "verify" instruction, the data node enters a local verification process by adopting the verification method in step 1.1.2.2), and if the verification is passed, the data node sends a "verify pass" instruction to the host node.
2.2.7 After receiving the "verify pass" instruction from all relevant data nodes, the host node determines whether to interact with the "global Gts generation cluster" according to different consistency level requirements to obtain the global timestamp of the transaction, and records the global transaction state as committed. Then, two threads are started simultaneously: the first is used for returning the result set to the Proxy, and the Proxy is responsible for returning the execution result to the client; the second will send a commit instruction to all relevant data nodes.
When required for transactional logical consistency: after the host node receives the verification pass instruction sent by all relevant data nodes, the global transaction state is recorded as committed.
When required for transactional causal consistency level, linear consistency, and full consistency: after the host node receives the "verify pass" instruction sent by all relevant data nodes, interaction with the global Gts generation cluster is required, the global timestamp of the transaction is obtained, and the global transaction state is recorded as committed.
2.2.8 After receiving the commit command, the data node enters a local commit flow using the commit method of step 1.1.2.3).
In the step 2.3), the process of executing the single transaction in the distributed system according to the requirements of different consistency levels is as follows:
2.3.1 The Client is responsible for issuing a request to execute the transaction T, and the Proxy is responsible for receiving the request from the Client and establishing a session relationship with the Client.
2.3.2 After receiving the request information, proxy interacts with the metadata management cluster, obtains the related metadata, analyzes the request, and distributes the request to different host nodes through routes.
2.3.3 The host node optimizes SQL and generates a physical execution plan that is sent to the selected transaction coordinator node host node. And (3) performing transaction initialization work by the Host node, recording the transaction state as running, and directly transmitting the execution plan to the corresponding data node.
2.3.4 And the data node performs data operation by adopting an algorithm which is suitable for the requirement of the consistency level according to the execution plan, records the local transaction state, directly enters the verification process after the data node performs data read-write locally, and sends a verification passing instruction to the host node and enters the local submitting process if the verification passes.
When the business logic consistency and the business cause and effect consistency level are required, the data node performs data operation and business scheduling through the RUC-CC algorithm;
when the requirement is the linear consistency level, the data node performs data operation through the linear consistency guarantee algorithm in the step 1.2), and performs transaction scheduling based on the MVCC algorithm;
When the requirement is the complete consistency level, the data node performs data operation and transaction scheduling by combining the linear consistency guarantee algorithm in the step 1.2) with the RUC-CC algorithm in the step 1.1).
2.3.5 After receiving the "verify pass" instruction from RM, the host node determines, according to the different consistency level requirements, whether to need to interact with the global Gts generation cluster to obtain the global timestamp of the transaction, records the transaction state as committed, and then returns the result set to Proxy, which is responsible for returning the execution result to the client.
When required for transactional logical consistency: after the host node receives the verification pass instruction sent by all relevant data nodes, the global transaction state is recorded as committed.
When required for transactional causal consistency level, linear consistency, and full consistency: after the host node receives the "verify pass" instruction sent by all relevant data nodes, interaction with the global Gts generation cluster is required, the global timestamp of the transaction is obtained, and the global transaction state is recorded as committed.
In the above method, in particular, if the read-write set of the transaction is large and exceeds the memory bearing capacity of the system during the transaction processing, the transaction cannot be executed due to memory overflow. To solve this problem, the following 3 methods can be adopted:
1. Threshold method: when the size of the read set or the write set of a certain RM exceeds a certain threshold, the execution of the transaction T is terminated, so that the transaction T is prevented from consuming the system memory, and other transactions cannot be executed, wherein the threshold is selected by a parameter, for example, the size of the threshold can be equal to 60% of the available memory of the RM. The method is simple and easy to use, but the transactions with larger read-write sets can not be executed;
2. dumping method: when the size of the read set or the write set of the transaction T at a certain data node exceeds a certain threshold, the read set or the write set of the transaction T is brushed onto the disk, and when the transaction T needs to access the read and write set of the transaction T, the transaction T is read into the memory, wherein the threshold can be selected through parameters, for example, the size of the threshold can be equal to 60% of the available memory of the RM. Although the method does not need to terminate the execution of the transaction T, the method brings extra I/O cost because of the need of reading and writing the disk;
3. the optimization method comprises the following steps: when the read-only transaction T read set is overlarge, the RUC-CC concurrent access control algorithm is not adopted for scheduling, and the linear consistency assurance algorithm based on the global time stamp is used for data reading.
It should be noted that the expression "external consistency" in the present invention is a generalized concept, i.e. all distributed system consistency such as linear consistency, causal consistency, monotonic reading, monotonic writing, etc. are included in the category of "external consistency". Global clock, which is a logical whole-order meaning, does not refer specifically to a physical timestamp.
The foregoing embodiments are only for illustrating the present invention, wherein the structures, connection modes, manufacturing processes, etc. of the components may be changed, and all equivalent changes and modifications performed on the basis of the technical solutions of the present invention should not be excluded from the protection scope of the present invention.

Claims (15)

1. A distributed system for ensuring transaction consistency and linear consistency, comprising: it comprises the following steps: the system comprises a plurality of clients and a database server, wherein the database server consists of an access layer, a meta information management cluster and a global Gts generation cluster and a transaction processing and storage layer;
the client is used for providing an interface for a user to interact with the database server and sending a user request to the database server;
the access layer is used for receiving the request sent by the client and analyzing and generating an execution plan;
the meta information management cluster is used for uniformly managing the distributed clusters of the distributed system;
the global Gts is used for generating a global timestamp and uniquely ordering global transactions in the distributed system to realize linear consistency;
the transaction processing and storage layer comprises a plurality of resource management nodes, wherein the resource management nodes comprise coordination nodes and data nodes, the coordination nodes and the data nodes are used for executing transaction logic according to an execution plan sent by an access layer, and an obtained result is returned to the client through the access layer.
2. A distributed system for ensuring transaction consistency and linear consistency as recited in claim 1, wherein: the data nodes are used for carrying out partition storage on the data in the distributed system, and the coordination nodes are used for carrying out coordination processing on the transactions in the distributed system; for all the purposes of the resource management node, there are two allocation modes:
master-slave mode: a part of the resource management nodes are exclusively used as coordination nodes for transaction processing, and the rest of the resource management nodes are used as data nodes;
peer-to-peer mode: all the resource management nodes are peer-to-peer, and each resource management node has two functions of a data node and a coordination node.
3. A distributed system for ensuring transaction consistency and linear consistency as recited in claim 1, wherein: the global Gts generation cluster global timestamp is composed of eight bytes, and the eight bytes are composed by adopting a mixed physical clock mode:
a) The first 44 bits are the physical timestamp value;
b) The last 20 bits are monotonically increasing counts within one millisecond.
4. A distributed system for ensuring transaction consistency and linear consistency as recited in claim 1, wherein: the basic data structure related to the distributed system comprises a global transaction state table, a local transaction state table, a data item data structure, a global read-set write set of the transaction, a communication protocol and a message data structure;
The global transaction state table is used for maintaining the transaction state seen from the global of the distributed system and is represented by six-element ancestor { TID, lowts, uppts, status, gts, nodes }, wherein TID represents a unique identification of a transaction, lowts represents a lower limit of a transaction logic commit time stamp, uppts represents an upper limit of the transaction logic commit time stamp, status represents the state of the current transaction in the global, gts represents the time stamp of global commit/rollback completion of the transaction, and Nodes represents data Nodes involved in the current transaction;
the local transaction state table is used for maintaining the local transaction state of the transaction on each resource management node and is represented by { TID, lowts, uppts and Status }, wherein the TID represents a unique identification of the transaction, the Lowts represents a lower limit of a transaction logic commit time stamp, the Uppts represents an upper limit of the transaction logic commit time stamp, and the Status represents the local state of the transaction;
the data item data structure comprises a first group of data elements serving as a linear consistency basis and a second group of data elements serving as a distributed transaction consistency, wherein the first group of data elements comprises { gts, info_bit }, wherein gts represents a globally unique sequence of a transaction in a distributed system, and the info_bit is used for identifying whether Gts or TID is currently recorded in a gts field; the second set of data elements includes { wts, rts }, where wts is for recording a logical timestamp of a transaction that created the version of the data item, and rts is for recording a logical timestamp of a transaction that last read the data item;
The global read set of the transaction is used for recording all data items read by the transaction and is represented by { BlockAddress, offset, size and Value }, wherein BlockAddress represents a corresponding block address of the data item, offset represents an Offset of the data item in a block, size represents the Size of the data item and Value represents the Value of the data item;
the global write set of the transaction is used for recording all data items which need to be updated in the transaction and is represented by { BlockAddress, offset, size and NewValue, operationType }, wherein BlockAddress represents a corresponding block address of the data item, offset represents an Offset of the data item in a block, size represents the Size of the data item, newValue represents a value of the data item, and operation type represents whether the operation is an updating operation, an inserting operation or a deleting operation;
the communication protocol and the message comprise a message sent by the coordination node to the data node, a message sent by the data node to the coordination node, a message sent by the coordination node to the global Gts generation cluster, and a message sent by the global Gts generation cluster to the coordination node; the message sent by the coordination node to the data node comprises a read data request message, a verification request message and a write submitting/rollback request; the message sent by the data node to the coordination node comprises a read request feedback message and a local verification feedback message; the message sent by the coordinating node to the global Gts generation cluster includes a global timestamp request message; the global Gts generates a message for the cluster to send to the coordinator node including a global timestamp request feedback message.
5. A multi-level consistency method employing a distributed system for ensuring transaction consistency and linear consistency as claimed in any one of claims 1 to 4, comprising the steps of:
1) Establishing a unified consistency model capable of realizing multi-level consistency;
2) And determining the consistency level which needs to be achieved by the distributed system according to the actual service demand, determining a consistency execution algorithm suitable for the consistency demand based on the established system consistency model, and executing the distributed transaction and the single-machine transaction in the distributed system to obtain a transaction execution result.
6. A multi-level consistency method as claimed in claim 5, wherein: in the step 1), a method for establishing a unified consistency model capable of realizing multi-level consistency comprises the following steps:
1.1 Performing transaction concurrency access control by adopting an OCC strategy based on DTA, and establishing an RUC-CC algorithm for ensuring the consistency of the transaction;
1.2 Generating a global timestamp and a global transaction state generated by the cluster based on the global Gts, and establishing a linear consistency guarantee algorithm based on the global timestamp for guaranteeing linear consistency among transactions;
1.3 A method for performing twice data reading is adopted, and a twice-reading linear consistency assurance algorithm is established and used for guaranteeing consistency among transactions;
1.4 Combining the transaction consistency and the linear consistency of the steps 1.1) to 1.3) and the MVCC algorithm to establish a unified model capable of meeting various consistency levels.
7. A multi-level consistency method as claimed in claim 6, wherein: in the step 1.1), the method for establishing the RUC-CC algorithm by performing transaction concurrent access control by adopting an OCC strategy based on DTA comprises the following steps:
1.1.1 For the transaction T sent by the client, completing corresponding initialization work on the coordination node;
1.1.2 Dividing the global execution phase of a transaction into 3 phases: the method comprises a reading stage, a verification stage and a commit writing/rollback stage, and under the coordination of the coordination nodes, each data node related to the operation executes the transaction and clears the corresponding table entry of the commit or rollback transaction in the transaction state table.
8. A multi-level consistency method as claimed in claim 7, wherein: in the step 1.1.2), the global execution phase of the transaction is divided into 3 phases: the method comprises a reading stage, a verification stage and a commit writing/rollback stage, and each data node related to operation executes the transaction under the coordination of the coordination node, and comprises the following steps:
1.1.2.1 Reading the required data according to the execution logic by the transaction T, and writing the update into a local memory of the transaction T1.1.2.2), and verifying whether the transaction T collides with other transactions or not by the transaction T to obtain a verification result;
1.1.2.3 Transaction T selects whether to perform write commit or rollback depending on the validation result of the validation phase.
9. A multi-level consistency method as claimed in claim 8, wherein: in the step 1.1.2.1), the method for reading the required data according to the execution logic and writing the update into the local memory of the transaction T by the transaction T is as follows:
firstly, a coordination node of a transaction T needs to send a read data request message for reading a data item x to a data node where the data item x is located;
then, after receiving the read data request message, the data node where the data item x is located firstly establishes or updates a local transaction state table of the transaction T, then searches a visible version of the data item x in a logic life cycle of the transaction T, and sends a read request feedback message to a coordination node of the transaction T;
and finally, after receiving the read request feedback messages of all the data nodes, the coordination node of the transaction T judges whether rollback is needed, if so, the coordination node enters a global rollback stage, and if not, the transaction continues to be executed.
10. A multi-level consistency method as claimed in claim 8, wherein: in the step 1.1.2.2), the method for verifying whether the transaction T collides with other transactions by itself includes:
first, the coordinator node of the transaction T modifies the state of the transaction T in the global transaction state table as: gvaliding; then sending verification request information and a local write set to each data node involved in the transaction T;
secondly, after each data node involved in the transaction T receives the verification request message, the local verification operation is executed, and the method specifically comprises the following steps:
(1) updating t.lowts=max (t.lowts, vrm.lowts), t.uppts=min (t.uppts, vrm.uppts) of the transaction T in the local transaction state table;
(2) checking whether T.Lowts is larger than T.Uppts, if so, verifying failure, returning an Abort message to a coordination node of the transaction T to enter rollback, otherwise, entering a step (3);
(3) find each data item y in the transactional write set and then see if the WT for data item y is empty:
if not, sending an Abort message to the coordination node of the transaction T to enter rollback;
otherwise, entering the step (4);
(4) updating the WT of each data item y in the write set to be T.TID, and adjusting the timestamp lower bound of the transaction T in the local transaction state table to be larger than the rts of y;
(5) Checking whether T.Lowts is larger than T.Uppts, if so, verifying failure, rolling back locally, and then returning an Abort message to a coordination node of the transaction T; otherwise, entering step (6);
(6) for each element y in the writing set, adjusting the time stamp of the transaction T or the transaction in the RTlist, and eliminating the read-write conflict;
(7) according to the updated value of the data item y, creating a new version of the data item y, and setting a flag indicating that the new version is not globally submitted;
(8) returning a local verification feedback message lvm of the transaction T to the coordination node of the transaction T, wherein the Lowts and the Uppts of the lvm record the upper and lower bounds of the logical timestamp of the transaction T on the local data node respectively;
finally, after the coordination node of the transaction T receives the local verification feedback messages of all the resource management nodes, whether the transaction T can pass the verification is determined according to the received messages.
11. A multi-level consistency method as claimed in claim 6, wherein: in the step 1.2), a global timestamp generated by the cluster and a global transaction state are generated based on the global Gts, and the established linear consistency assurance algorithm based on the global timestamp comprises the following steps:
1.2.1 A client initiates a transaction T request, and an access layer establishes connection with the client to form a session;
1.2.2 The access layer analyzes the transaction T and selects a coordination node to be responsible for managing the execution process of the transaction;
1.2.3 At the beginning of a read transaction T, generating a global timestamp from global Gts at the beginning of the cluster get transaction, and recording Gts in the global transaction state table for the read transaction; the coordination node establishes connection with all data nodes related to the read transaction, forms a data packet with the analyzed query execution plan and the global timestamp Gts, and transmits the data packet to all related data nodes through network communication;
1.2.4 All data nodes respectively perform data reading operation, data items meeting selection conditions are determined, and then each data item with multiple versions is traversed from the latest version until the first visible version is found;
1.2.5 The coordination node gathers the data returned by all the data nodes and returns the data to the access layer, the access layer returns the data to the client for establishing the session relation, and the current reading transaction is completed.
12. A multi-level consistency method as claimed in claim 6, wherein: in the step 1.3), a method of performing two times of data reading is adopted, and the established flow of the two times of reading linear consistency assurance algorithm is as follows:
1.3.1 A client initiates a transaction T request, and an access layer establishes connection with the client to form a session;
1.3.2 The access layer analyzes the transaction T and selects a coordination node to be responsible for managing the execution process of the transaction;
1.3.3 The coordination node establishes connection with all data nodes related to the read transaction, sends a data acquisition request to all related data nodes, all related data nodes execute a first data reading algorithm and return data to the coordination node, and the coordination node determines the global timestamp Gts of the current read transaction T based on all returned data items;
1.3.4 The coordination node sends the data acquisition request to all the data nodes again, and sends the determined global timestamp Gts of the current read transaction T to all the data nodes, a second data reading algorithm is executed on the data nodes, and a data version meeting the linear consistency with respect to the global timestamp Gts of the current read transaction T is returned;
1.3.5 The coordination node gathers the data returned by all the data nodes and returns the data to the access layer, the access layer returns the data to the client for establishing the session relation, and the current reading transaction is completed.
13. A multi-level consistency method as claimed in claim 6, wherein: in the step 2), a consistency level to be achieved by the distributed system is determined according to actual service requirements, a consistency execution algorithm suitable for the consistency level requirements is determined based on an established system consistency model, and a method for executing distributed transactions in the distributed system comprises the following steps:
2.1 According to whether the transaction needs to operate on the data on a plurality of resource management nodes, dividing the transaction related in the distributed system into two types of distributed transaction and single-machine transaction;
2.2 Adopting a consistency execution algorithm which is suitable for the consistency level requirement to execute the distributed transaction in the distributed system;
2.3 A consistency execution algorithm adapted to the consistency level requirements is used to execute the single transaction in the distributed system.
14. A multi-level consistency method as claimed in claim 13, wherein: in the step 2.2), a consistency execution algorithm which is adapted to the consistency level requirement is adopted, and the execution flow of the distributed transaction in the distributed system is as follows:
2.2.1 The client is responsible for sending out a request for executing the transaction T, and the access layer is responsible for receiving the request sent by the client and establishing a session relation with the client;
2.2.2 After receiving the request information, the access layer interacts with the metadata management cluster, analyzes the request after acquiring the related metadata, and distributes the request to different coordination nodes through routes;
2.2.3 The coordination node optimizes SQL and generates a physical execution plan, performs global transaction initialization work, records global transaction state information, then decomposes the execution plan into the execution plan on each data node, sends the execution plan to the corresponding data node, and records the global transaction state as an running state;
2.2.4 Each data node performs data operation by adopting an algorithm which is adaptive to the requirement of the consistency level according to the execution plan, records the execution state of local transactions, and sends a command capable of verifying to the coordination node after the local execution of data read-write by the data node is completed; specific:
when required for transactional logical consistency and transactional causal consistency: the data node adopts RUC-CC algorithm to perform data operation and transaction scheduling according to the execution plan;
when a linear consistency level requirement: the data node performs data operation by adopting a linear consistency assurance algorithm according to the execution plan, and performs transaction scheduling based on an MVCC algorithm;
when required for full consistency level: the data node performs data operation and transaction scheduling by adopting a linear consistency assurance algorithm and combining with a RUC-CC algorithm according to an execution plan;
2.2.5 After receiving the instructions of 'can verify' sent by all relevant data nodes, the coordination node records the global transaction state as being verified and sends instructions of 'verify' to all relevant data nodes;
2.2.6 After receiving the command of verification, the data node; entering a local verification process, and if verification is passed, sending a verification pass instruction to the coordination node;
2.2.7 After receiving the verification passing instruction sent by all relevant data nodes, the coordination node determines whether interaction with the global Gts generation cluster is needed to acquire a global timestamp of the transaction according to different consistency level requirements, and records the global transaction state as submitted; then, two threads are started simultaneously: the first is used for returning the result set to the access layer, and the access layer is responsible for returning the execution result to the client; the second one will send a commit command to all relevant data nodes;
when required for transactional logical consistency: after the coordination node receives the verification passing instruction sent by all relevant data nodes, the global transaction state is recorded as submitted;
when required for transactional causal consistency level, linear consistency, and full consistency: after receiving the verification passing instruction sent by all relevant data nodes, the coordination node needs to interact with the global Gts generation cluster, acquires the global timestamp of the transaction, and records the global transaction state as submitted;
2.2.8 After each data node receives the commit command, the data node enters a local commit flow.
15. A multi-level consistency method as claimed in claim 13, wherein: in the step 2.3), the consistency level to be achieved by the distributed system is determined according to the actual service requirement, and a consistency execution algorithm suitable for the consistency level requirement is determined based on the established system consistency model, and the method for executing the single-machine transaction in the distributed system comprises the following steps:
2.3.1 The client is responsible for sending out a request for executing the transaction T, and the access layer is responsible for receiving the request sent by the client and establishing a session relation with the client;
2.3.2 After receiving the request information, the access layer interacts with the metadata management cluster, analyzes the request after acquiring the related metadata, and distributes the request to different coordination nodes through routes;
2.3.3 The coordination node optimizes SQL, generates a physical execution plan, sends the physical execution plan to the selected data node, performs transaction initialization work, records the transaction state as running, and directly sends the execution plan to the corresponding data node;
2.3.4 The data node performs data operation by adopting an algorithm which is adaptive to the requirement of the consistency level according to the execution plan, records the local transaction state, directly enters a verification process after the data node performs data read-write locally, and sends a verification passing instruction to the coordination node if the verification passes, and enters a local submission process;
when the business logic consistency and the business cause and effect consistency level are required, the data node performs data operation and business scheduling through the RUC-CC algorithm;
when the requirement of the linear consistency level is met, the data node performs data operation through a linear consistency guarantee algorithm and performs transaction scheduling based on an MVCC algorithm;
When the requirement of the complete consistency level is met, the data node performs data operation and transaction scheduling by combining a linear consistency guarantee algorithm with a RUC-CC algorithm;
2.3.5 After receiving the verification passing instruction sent by the data node, the coordination node determines whether to interact with the global Gts generation cluster according to different consistency level requirements to acquire a global timestamp of the transaction, records the state of the transaction as submitted, returns a result set to an access layer, and the access layer is responsible for returning an execution result to the client;
when required for transactional logical consistency: after receiving the verification passing instruction sent by the data node, the coordination node records the global transaction state as submitted;
when required for transactional causal consistency level, linear consistency, and full consistency: after receiving the verification passing instruction sent by the data node, the coordination node needs to interact with the global Gts generation cluster, acquires the global timestamp of the transaction, and records the global transaction state as submitted.
CN201910247559.7A 2019-02-02 2019-03-29 Distributed system and method for ensuring transaction consistency and linear consistency Active CN109977171B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2019101072809 2019-02-02
CN201910107280 2019-02-02

Publications (2)

Publication Number Publication Date
CN109977171A CN109977171A (en) 2019-07-05
CN109977171B true CN109977171B (en) 2023-04-28

Family

ID=67081468

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910247559.7A Active CN109977171B (en) 2019-02-02 2019-03-29 Distributed system and method for ensuring transaction consistency and linear consistency

Country Status (1)

Country Link
CN (1) CN109977171B (en)

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427427B (en) * 2019-08-02 2022-05-27 北京快立方科技有限公司 Method for realizing global transaction distributed processing through pin bridging
CN110457157B (en) * 2019-08-05 2021-05-11 腾讯科技(深圳)有限公司 Distributed transaction exception handling method and device, computer equipment and storage medium
CN111190935B (en) 2019-08-27 2022-10-14 中国人民大学 Data reading method and device, computer equipment and storage medium
CN112650561B (en) * 2019-10-11 2023-04-11 金篆信科有限责任公司 Transaction management method, system, network device and readable storage medium
CN110807046B (en) * 2019-10-31 2022-06-07 浪潮云信息技术股份公司 Novel distributed NEWSQL database intelligent transaction optimization method
CN111399447A (en) * 2019-12-26 2020-07-10 德华兔宝宝装饰新材股份有限公司 Board-like customization furniture quality control system based on MES
CN111159252B (en) * 2019-12-27 2022-10-21 腾讯科技(深圳)有限公司 Transaction execution method and device, computer equipment and storage medium
CN111240810B (en) * 2020-01-20 2024-02-06 上海达梦数据库有限公司 Transaction management method, device, equipment and storage medium
CN111338766B (en) * 2020-03-12 2022-10-25 腾讯科技(深圳)有限公司 Transaction processing method and device, computer equipment and storage medium
CN111597015B (en) * 2020-04-27 2023-01-06 腾讯科技(深圳)有限公司 Transaction processing method and device, computer equipment and storage medium
CN111475585B (en) * 2020-06-22 2021-06-01 阿里云计算有限公司 Data processing method, device and system
CN113934792B (en) * 2020-06-29 2023-03-24 金篆信科有限责任公司 Processing method and device of distributed database, network equipment and storage medium
CN111651244B (en) * 2020-07-01 2023-08-18 中国银行股份有限公司 Distributed transaction processing system
CN112286992B (en) * 2020-10-29 2021-08-24 星环信息科技(上海)股份有限公司 Query method, distributed system, device and storage medium
CN112732414B (en) * 2020-12-29 2023-12-08 北京浪潮数据技术有限公司 Distributed transaction processing method and system in OLTP mode and related components
CN112463311B (en) * 2021-01-28 2021-06-08 腾讯科技(深圳)有限公司 Transaction processing method and device, computer equipment and storage medium
CN112948064B (en) * 2021-02-23 2023-11-03 北京金山云网络技术有限公司 Data reading method, device and system
EP4307137A1 (en) * 2021-04-06 2024-01-17 Huawei Cloud Computing Technologies Co., Ltd. Transaction processing method, distributed database system, cluster, and medium
CN113238892B (en) * 2021-05-10 2022-01-04 深圳巨杉数据库软件有限公司 Time point recovery method and device for global consistency of distributed system
CN113391885A (en) * 2021-06-18 2021-09-14 电子科技大学 Distributed transaction processing system
CN113778632B (en) * 2021-09-14 2024-06-18 杭州沃趣科技股份有限公司 Cassandra database-based distributed transaction management method
CN114003657A (en) * 2021-10-11 2022-02-01 阿里云计算有限公司 Data processing method, system, device and storage medium for distributed database
CN114328613B (en) * 2022-03-03 2022-07-05 阿里云计算有限公司 Method, device and system for processing distributed transactions in Structured Query Language (SQL) database
CN114510539B (en) * 2022-04-18 2022-06-24 北京易鲸捷信息技术有限公司 Method for generating and applying consistency check point of distributed database
CN115145942B (en) * 2022-09-05 2023-01-17 北京奥星贝斯科技有限公司 Distributed database system and method and device for realizing monotonous reading of distributed database system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101286123A (en) * 2006-12-28 2008-10-15 英特尔公司 Efficient and consistent software transactional memory
CN102831156A (en) * 2012-06-29 2012-12-19 浙江大学 Distributed transaction processing method on cloud computing platform
CN103198159A (en) * 2013-04-27 2013-07-10 国家计算机网络与信息安全管理中心 Transaction-redo-based multi-copy consistency maintaining method for heterogeneous clusters

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110161281A1 (en) * 2009-12-30 2011-06-30 Sybase, Inc. Distributed Transaction Management in a Distributed Shared Disk Cluster Environment
WO2013019892A1 (en) * 2011-08-01 2013-02-07 Tagged, Inc. Generalized reconciliation in a distributed database

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101286123A (en) * 2006-12-28 2008-10-15 英特尔公司 Efficient and consistent software transactional memory
CN102831156A (en) * 2012-06-29 2012-12-19 浙江大学 Distributed transaction processing method on cloud computing platform
CN103198159A (en) * 2013-04-27 2013-07-10 国家计算机网络与信息安全管理中心 Transaction-redo-based multi-copy consistency maintaining method for heterogeneous clusters

Also Published As

Publication number Publication date
CN109977171A (en) 2019-07-05

Similar Documents

Publication Publication Date Title
CN109977171B (en) Distributed system and method for ensuring transaction consistency and linear consistency
CN111338766B (en) Transaction processing method and device, computer equipment and storage medium
Van Renesse et al. Paxos made moderately complex
CN109739935B (en) Data reading method and device, electronic equipment and storage medium
US20230100223A1 (en) Transaction processing method and apparatus, computer device, and storage medium
CN111143389B (en) Transaction execution method and device, computer equipment and storage medium
US9852204B2 (en) Read-only operations processing in a paxos replication system
CN111597015B (en) Transaction processing method and device, computer equipment and storage medium
CN111190935B (en) Data reading method and device, computer equipment and storage medium
CN111159252A (en) Transaction execution method and device, computer equipment and storage medium
CN112162846B (en) Transaction processing method, device and computer readable storage medium
CN104793988A (en) Cross-database distributed transaction implementation method and device
CN109710388A (en) Method for reading data, device, electronic equipment and storage medium
EP4276651A1 (en) Log execution method and apparatus, and computer device and storage medium
CN109783578B (en) Data reading method and device, electronic equipment and storage medium
DuBourdieux Implementation of Distributed Transactions.
CN111444027A (en) Transaction processing method and device, computer equipment and storage medium
EP1197876A2 (en) Persistent data storage techniques
US9201685B2 (en) Transactional cache versioning and storage in a distributed data grid
CN115495495A (en) Transaction processing method, distributed database system, cluster and medium
Kokocinski et al. Make the leader work: Executive deferred update replication
JP2023546818A (en) Transaction processing method, device, electronic device, and computer program for database system
CN114328591A (en) Transaction execution method, device, equipment and storage medium
US11874796B1 (en) Efficient garbage collection in optimistic multi-writer database systems
Grov et al. Scalable and fully consistent transactions in the cloud through hierarchical validation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant