CN113391885A - Distributed transaction processing system - Google Patents

Distributed transaction processing system Download PDF

Info

Publication number
CN113391885A
CN113391885A CN202110676531.2A CN202110676531A CN113391885A CN 113391885 A CN113391885 A CN 113391885A CN 202110676531 A CN202110676531 A CN 202110676531A CN 113391885 A CN113391885 A CN 113391885A
Authority
CN
China
Prior art keywords
transaction
module
layer
read
storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110676531.2A
Other languages
Chinese (zh)
Inventor
李建平
肖飞
高源�
周越
俞腾秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202110676531.2A priority Critical patent/CN113391885A/en
Publication of CN113391885A publication Critical patent/CN113391885A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/466Transaction processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/465Distributed object oriented systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • G06F9/524Deadlock detection or avoidance

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a distributed transaction processing system, which comprises a computing layer, a storage layer and a scheduling layer; the calculation layer is used for receiving the SQL request of the client, converting the SQL request into a KV read-write request, transmitting the KV read-write request to the storage layer, and initiating a transaction two-stage submission based on the SQL request; the storage layer is used for providing a read-write interface for the two-stage submission of the transaction initiated by the computing layer and providing storage service; the dispatching layer is used for maintaining the meta-information of the whole cluster, including KV data distribution information and the like, providing global time service for the computing layer and the storage layer, and storing transaction distribution information submitted in two phases of transactions. The invention submits through a two-stage submission protocol when the transaction submits, and adopts a multi-version concurrency control mechanism to realize that the reading and the writing are not blocked mutually.

Description

Distributed transaction processing system
Technical Field
The invention belongs to the technical field of distribution, and particularly relates to a distributed transaction processing system.
Background
Under the big data era, the development of mobile internet, intelligent equipment and internet of things technology enables the global data volume to show explosive growth, the traditional single-machine database is limited by expansion capability and is difficult to bear massive business requirements, people begin to explore the database distribution, and a distributed database represented by Google Spanner appears. The distributed database has the characteristics of strong consistency, high availability, expandability, easy operation and maintenance and fault tolerance and disaster tolerance, has high concurrent transaction processing capacity meeting the ACID characteristics, and can meet the requirements of low delay and massive concurrent processing. Distributed databases generally adopt multiple partitions and multiple copies to achieve scalability and high availability, and distributed transactions often make a compromise between strong consistency and transaction performance, which also makes the research of distributed transactions one of the most challenging tasks in the field of distributed databases.
In recent years, distributed databases have been studied and practiced in academia and industry, and have achieved great results. It is common practice to convert the traditional two-phase lock scheme of transaction concurrency control to a two-phase commit scheme, applied in a distributed scenario. On one hand, the two-phase submission has problems of synchronous blocking, single point failure, split brain and the like, and improvement schemes such as three-phase submission are proposed, and the improvement schemes usually have higher execution overhead or complexity, so the improvement schemes are not widely used in the industry. On the other hand, two-phase commit itself requires multiple network communications, increasing the response time of transactions and also limiting the scalability and high availability of databases. Optimizing and improving a two-stage submission algorithm, and improving the transaction performance as much as possible on the premise of ensuring the isolation level, is worthy of research in the field of distributed databases.
Disclosure of Invention
The invention aims to solve the problem of distributed transaction processing and provides a distributed transaction processing system.
The technical scheme of the invention is as follows: a distributed transaction processing system comprises a computation layer, a storage layer and a scheduling layer;
the calculation layer is used for receiving the SQL request of the client, converting the SQL request into a KV read-write request, transmitting the KV read-write request to the storage layer, and initiating a transaction two-stage submission based on the SQL request;
the storage layer is used for providing a read-write interface for the two-stage submission of the transaction initiated by the computing layer and providing storage service;
the dispatching layer is used for maintaining the meta-information of the whole cluster, including KV data distribution information and the like, providing global time service for the computing layer and the storage layer, and storing transaction distribution information submitted in two phases of transactions.
Further, the computing layer comprises a MySQL protocol module, an SQL core module, a KV storage module and an RPC client module;
the MySQL protocol module is used for analyzing a MySQL data packet sent by the client, transmitting an SQL request obtained by analysis to the SQL core module for processing, packaging a processing result returned by the SQL core module and returning the processing result to the client;
the SQL core module is used for processing the SQL request and transmitting the KV read-write request obtained by processing to the KV storage module;
the KV storage module is used for analyzing the KV read-write request and initiating a transaction two-stage submission based on the analyzed KV read-write request;
and the RPC client module is used for transmitting the transaction two-stage submission initiated by the KV storage module to the storage layer and the scheduling layer by using the RPC request.
Further, the SQL core module comprises a parser, an optimizer, an executor and a session manager;
the parser is used for parsing the SQL request and generating an abstract syntax tree;
the optimizer is used for analyzing and optimizing the abstract syntax tree to generate a logic execution plan and a physical execution plan;
the executor is used for optimizing the logic execution plan and the physical execution plan;
and the session manager is used for transmitting the KV read-write request obtained by executing the logic execution plan and the physical execution plan to the KV storage module.
Furthermore, the two-phase submitting function of the transaction initiated by the KV storage module specifically comprises creating the transaction, locally locking the transaction, processing the lock conflict and managing the long transaction;
the method comprises the following steps that a transaction is created and used for providing a KV read-write interface meeting the requirement of transaction isolation for an executor of an SQL core module, providing a cache function for a computing layer, providing a snapshot isolation function for transaction reading and providing a two-stage commit function for transaction commit;
the transaction local lock is used for adding a local lock to the transaction modification;
lock conflict processing is used for transaction two-stage submission to detect and clear expired lock objects when lock conflicts occur;
long transactions manage the latch live time for updating the transaction local lock.
Further, the storage layer comprises an RPC server module, an MVCC module, a consensus protocol module and a storage engine module;
the RPC server module is used for receiving the transaction two-stage submission sent by the RPC client module, analyzing and transmitting the transaction two-stage submission to the MVCC module;
the MVCC module is used for providing a read-write interface for transaction two-stage submission, packaging transaction read-write operation in the transaction two-stage submission into a pure KV read-write operation request, and sending the pure KV read-write operation request to the local or remote consensus protocol module;
the consensus protocol module is used for synchronizing the pure KV read-write operation request to the storage engine module in a log form;
and the storage engine module is used for writing the pure KV read-write operation request into a disk for storage.
Further, the scheduling layer comprises a Region manager module, a time service module and an etcd module;
the Region manager module is used for inquiring Region distribution information in the scheduling layer and updating the scheduling layer;
the time service module is used for providing a timestamp for the two-stage submission of the transaction as the starting time and the identification of the transaction submission;
the etcd module is used for storing transaction distribution information submitted in two phases of transaction.
The invention has the beneficial effects that:
(1) the invention submits through a two-stage submission protocol when the transaction submits, and adopts a multi-version concurrency control mechanism to realize that the reading and the writing are not blocked mutually.
(2) Aiming at the problem of high delay of distributed transactions, an asynchronous commit protocol is designed, after the first-stage writing is completed, a client response is immediately returned, and meanwhile, the transaction commit of the second stage is asynchronously completed, so that the transaction delay caused by one round of network IO is reduced. And the step of locking the transaction is skipped, so that the record is directly written and returned after no transaction conflict exists, the complicated steps during transaction submission are reduced, and the transaction performance is greatly improved.
Drawings
FIG. 1 is a block diagram of a distributed transaction processing system.
Detailed Description
The embodiments of the present invention will be further described with reference to the accompanying drawings.
MVCC: multi-version concurrency control, wherein MVCC is a concurrency control method and generally realizes concurrent access to a database in a database management system; transactional memory is implemented in a programming language.
KV reading and writing: the file formats are kv pairs, namely keyyngth, key, value, and read-write operation is performed on them.
RPC: the remote procedure invokes a protocol that the program can use to request a service from a program on another computer in the network.
Region: and (4) area information.
etcd: etcd is a highly available distributed key value database, which can be used for service discovery.
As shown in FIG. 1, the present invention provides a distributed transaction processing system, comprising a computation layer, a storage layer and a scheduling layer; each hierarchy can be taken as an independent component and deployed on different server nodes;
the calculation layer is used for receiving the SQL request of the client, converting the SQL request into a KV read-write request, transmitting the KV read-write request to the storage layer, and initiating a transaction two-stage submission based on the SQL request;
and processing the user request, analyzing the SQL statement, optimizing the SQL and generating an efficient execution plan. And maintaining Session and transaction context as a client to initiate a transaction two-phase submission. Converting the SQL statement into a KV request and sending the KV request to a storage layer;
the storage layer is used for providing a read-write interface for the two-stage submission of the transaction initiated by the computing layer and managing the data submitted on the disk in the two stages of the transaction; providing high available KV storage service;
the dispatching layer is used for maintaining the meta-information of the whole cluster, including KV data distribution information and the like, providing global time service for the computing layer and the storage layer, and storing transaction distribution information submitted in two phases of transactions.
In the embodiment of the invention, the computing layer comprises a MySQL protocol module, an SQL core module, a KV storage module and an RPC client module;
the MySQL protocol module is responsible for maintaining the connection state and is used for analyzing a MySQL data packet sent by the client, transmitting an SQL request obtained by analysis to the SQL core module for processing, packaging a processing result returned by the SQL core module and returning the processing result to the client;
the SQL core module is used for processing the SQL request and transmitting the KV read-write request obtained by processing to the KV storage module;
the SQLCore module is a core module of a computing layer and is responsible for processing and executing SQL statements, and comprises submodules such as a syntax parser, an optimizer, an executor and session management;
the KV storage module is used for analyzing the KV read-write request and initiating a transaction two-stage submission based on the analyzed KV read-write request;
the KVStore module is responsible for organizing KV requests, providing transaction management functions and maintaining transaction context objects;
and the RPC client module is used for transmitting the transaction two-stage submission initiated by the KV storage module to the storage layer and the scheduling layer by using the RPC request.
And the RPC client module is responsible for packaging the RPC request of the upper module, sending the RPC request to the storage and scheduling node, receiving and returning the RPC response, and maintaining the life cycle of the RPC request. And reading and writing KV data into the storage node by the computing node, performing transaction two-phase submission and the like. And the computing node acquires a globally unique strictly increasing timestamp from the scheduling node and acquires meta-information such as KV partition distribution and routing information.
In the embodiment of the invention, the SQL core module comprises a parser, an optimizer, an executor and a session manager;
the parser is used for parsing the SQL request and generating an abstract syntax tree;
the optimizer is used for analyzing and optimizing the abstract syntax tree to generate a logic execution plan and a physical execution plan;
the executor is used for optimizing the logic execution plan and the physical execution plan;
and the session manager is used for transmitting the KV read-write request obtained by executing the logic execution plan and the physical execution plan to the KV storage module.
In the database, the logical relational SQL model is converted into a physical key value mapping and stored in the storage node in the form of ordered KV. Each SQL tuple is converted into a key value pair, i.e. one row of SQL records corresponds to one KV record:
(tableID,rowID)>→(col1,col2,col3,col4)
where Key consists of a table identifier (TableID) and a row identifier (RowID), and Value consists of the values of all records of the row.
Each index is also converted into a key-value pair:
(tableID,indexID,ColumnValue)>→rowID
where Key consists of a table identifier, an index identifier (IndexID) and the Value of the index column, and Value is the identifier of the row (RowID). When the index is used for data query, firstly, the corresponding row identifier is queried through scanning the index KV pair, and then all data of the row are queried through the row identifier.
Such KV encoding can preserve the ordering of SQL tuples. All the row data of one table are arranged in the Key space according to the RowID sequence, and one index is arranged in the Key space according to the sequence of the values of the indexed row.
In the embodiment of the invention, the two-phase submitting function of the transaction initiated by the KV storage module specifically comprises the steps of creating the transaction, locally locking the transaction, processing the lock conflict and managing the long transaction;
the method comprises the following steps that a transaction is created and used for providing a KV read-write interface meeting the requirement of transaction isolation for an executor of an SQL core module, providing a cache function for a computing layer, providing a snapshot isolation function for transaction reading and providing a two-stage commit function for transaction commit;
KV read-write interface: providing a KV read-write interface meeting the requirement of transaction isolation for an executor of the SQL module;
writing cache: the KV modification operation is not immediately executed, but is firstly cached in a computing layer, and is submitted together when the transaction is submitted;
snapshot isolation: the current transaction can only read the latest data before the start time of the transaction and the modified data in the transaction;
the transaction is committed in two phases: when a transaction submission request is received, a new two-phase submission client is created to submit the transaction, and the transaction is rolled back when a conflict or error occurs.
The transaction local lock is used for adding a local lock to the transaction modification; the method can prevent a plurality of transactions from reading and writing the same Key simultaneously to cause transaction conflict and rollback;
lock conflict processing is used for transaction two-stage submission to detect and clear expired lock objects when lock conflicts occur; blocking new transaction submissions is avoided.
The long transaction management periodically sends heartbeat packets to the storage node for updating the latch live time of the transaction local lock.
In the embodiment of the invention, the storage layer comprises an RP server module, an MVCC module, a consensus protocol module and a storage engine module;
the RPC server module is used for receiving the transaction two-stage submission sent by the RPC client module, analyzing and transmitting the transaction two-stage submission to the MVCC module;
the MVCC module is used for providing a read-write interface for transaction two-stage submission, packaging transaction read-write operation in the transaction two-stage submission into a pure KV read-write operation request, and sending the pure KV read-write operation request to the local or remote consensus protocol module;
the function of the multi-version concurrency control (MVCC) module is as follows:
multi-version concurrency control function: in order to improve concurrency performance and enable reading and writing not to be mutually blocked, a transaction reading and writing interface with a version number is provided, and the visibility of data is controlled;
local lock function: locking before the storage layer accesses the same Key to avoid the write-in competition of the transaction;
lock wait and wake-up functions: for pessimistic affairs which do not acquire the lock, a transaction thread waiting for the lock is dormant, and the thread waiting for the lock is awakened after the lock is released;
deadlock detection function: a transaction wait graph is maintained, and deadlocks are detected by checking for loops before a new transaction begins waiting on the lock.
The consensus protocol module is used for synchronizing the pure KV read-write operation request to the storage engine module in a log form;
and the storage engine module is used for writing the pure KV read-write operation request into a disk for storage.
The consensus protocol module is responsible for providing data high availability services based on the Raft protocol. The Raft protocol synchronizes KV operation to all nodes in a log mode, and each node calls the persistent log data of the local storage module interface. After most nodes synchronize logs, KV operations are written to disk through the storage engine. To provide the horizontal expansion capability, a range partition (Rangepartition) based strategy is employed to partition a set of KV into many intervals, each interval being referred to as a block (Region). The Raft consensus algorithm is used for maintaining consistency among the copies of each Region, and all the copies in the regions form a Raftgroup.
This extended Raft algorithm is also known as Multi-Raft. The consensus protocol module comprises the following specific functions:
(1) and (3) meta information management: maintaining Key intervals, node states and other information stored in the Region, and regularly reporting the information to a scheduling layer through heartbeat packets;
(2) election and voting functions: realizing the selection of a Raftleader and the role change of a Raft member;
(3) the log copy function: each data change is converted into a Raft log, and the data are safely and reliably synchronized to most nodes of the Group through the log copying function of the Raft;
(4) the downtime recovery function: after the nodes are down, the nodes are recovered to a correct state in a mode of synchronizing and playing back the raw log, so that data loss is avoided;
(5) region splitting and merging functions: the regions are distributed on all the nodes in the cluster as uniformly as possible, so that horizontal expansion and load balancing are facilitated.
In the embodiment of the invention, the scheduling layer comprises a Region manager module, a time service module and an etcd module;
the Region manager module is used for inquiring Region distribution information in the scheduling layer and updating the scheduling layer;
the Region management module mainly maintains Region information, and the storage node periodically reports Region distribution information to the scheduling node through heartbeat. Meanwhile, Region routing information is provided, and the computing node periodically inquires Region distribution information from the scheduling node and caches the Region distribution information in the computing layer.
The time service module is used for providing a timestamp for the two-stage submission of the transaction as the starting time and the identification of the transaction submission;
the time service module serves as the only time service in the cluster and provides a monotonically increasing time stamp for the outside. Any transaction at the beginning needs to get a globally unique timestamp as the start time (StartTS) and identification of the transaction, depending on the snapshot isolation level requirements. The transaction can only read the latest data that StartTS has previously committed. When a transaction commits, a timestamp may also need to be acquired as a marker (CommitTS) for the transaction commit.
The etcd module is used for storing transaction distribution information submitted in two phases of transaction.
The scheduling layer uses the etcd to store the meta-information, the whole database cluster is unavailable due to the downtime of the time service server, and the meta-information is stored in the etcd module, so that the high-availability time service is provided.
In the embodiment of the invention, when a transaction is started, a user sends a BEGIN statement to the database to start a transaction. After the SQL module of the computation layer parses the statement, it creates and binds a transaction context object (TransactionContext) for the current Session, which is used to maintain the state of the current transaction, and provide a KV data read-write interface, etc. Meanwhile, the computing layer sends a request to the time service server, and obtains a globally unique timestamp as the start time (StartTS) of the transaction. After the transaction context object is successfully created, the computing layer returns a success response to the client.
After a transaction is successfully opened, all add-delete-modify-verify (CRUD) operations performed by the user are confined to the context of the previous transaction. For a read operation (SELECT statement), due to the isolation requirement of the transaction, only the latest data modified in the current transaction can be read, and only the latest data of the committed transaction can be read.
The two-stage submitting client firstly acquires all data modification operations from the write cache object, initializes a modification set (Mutation), and takes a first modified Key in the write cache object as a PrimaryKey of a set, and other keys are called as secondarykeys. The PrimaryKey will be used to identify the state of the current transaction, and other transactions can determine whether the current transaction should continue to complete commit or rollback by checking the PrimaryKey's lock. In both the Prewrite phase and the Commit phase, the set PrimaryKey needs to be sent first, and after the PrimaryKey is locked or written successfully, other keys need to be sent. The two-phase commit client then attempts to add a compute-level local lock (Latch) to all keys to be modified, preventing other transactions from concurrently modifying the same Key, causing transaction conflicts and rollback. If it is checked that part of the Key has been locked by other transactions, retry for a period of time, if it is not yet waited for the lock to be released, abort the commit transaction and return an error to the user. If the lock is successful, then the step of two-phase commit is entered.
The RPCServer of the storage node analyzes the RPC request and then transmits the RPC request to a multi-version concurrency control (MVCCtore) module. The mvcsctore module first checks whether each Key is locked by other transactions. If locked, retry and wait for the lock release, if the maximum wait time is exceeded, return a transaction conflict error to the compute layer. Then, the current Key-up-to-date record in the local storage needs to be checked. If CommitTS in the record is greater than StartTS of the current transaction, indicating that the record read by the current transaction is not up-to-date, a transaction conflict error is returned to the compute layer.
Then, the mvcsctore module creates a multi-version concurrency control lock (hereinafter, mvcsclock) for each Key, where the lock includes information such as StartTS (as a globally unique identifier for a transaction), KV record, and the like. The MVCCtore module synchronizes all MVCCLink objects to other nodes of the Raftgroup through the consensus protocol layer to ensure high availability of the storage layer.
In the consensus protocol layer, after a read leader accepts a write command of an upper layer, the read leader converts the command into a read log and writes the read log into the read log, and then sends the read log to a Follower (follow) in the same read group through an additional log RPC request. After receiving the response that most of the node logs are synchronously completed, the leader self applies the log and informs the follower of the log through a heartbeat packet. After each node completes the Raft log application, the MVCClock is written into the local and is stored persistently.
After the leader succeeds in applying the journal, writes to MVCCLOCK, it returns a response that the Prewrite succeeded to the compute layer. After the computing layer receives the response, the two-phase Commit client begins executing the Commit phase.
For the Commit phase, the two-phase Commit client obtains a timestamp from the global time service server as the Commit ts for the current transaction. And then, the two-stage submission client sends a CommitRPC request containing information such as all Key and CommitTS to the corresponding storage node.
After receiving the request, the storage node transmits the request to a multi-version concurrency control (MVCCStore) module. The mvcsctore first attempts to add a storage tier local lock (Latch) to the set of keys to avoid concurrent modification of the same Key by transactions from different compute nodes, resulting in transaction conflicts and rollback. Then, the mvcsctore acquires the set of mvcclocks, constructs KV records containing Key and Value, and meta information of StartTS and commit ts of the transaction, synchronizes them to other storage nodes through the consensus protocol layer, and finally writes them into the local storage layer.
Finally, the MVCCtore deletes the corresponding MVCClock and returns a response of successful submission to the computing layer. And after receiving the response, the computing layer returns a response that the transaction is successfully submitted to the user. And when the life cycle of the whole transaction is ended, destroying temporary objects such as the transaction context and the like, and recycling the corresponding memory.
The working principle and the process of the invention are as follows: the functional requirements of the generic transaction in the present invention include 3 aspects: first, an optimistic transaction model is supported: the user's data modification operations are cached in memory until the transaction commit phase is committed together. Conflict detection is performed at the time of transaction commit. Second, a pessimistic transaction model is supported: and writing a pessimistic lock into the storage node while modifying the data by the user, and advancing the conflict detection so as to avoid performance rollback of the optimistic transaction in a conflict scene. Third, snapshot isolation level is supported: when the transaction starts, the global timestamp is obtained as the transaction identifier, and when the transaction commits, the global commit timestamp is obtained, so that the execution sequence of the transaction is determined.
The invention has the beneficial effects that:
(1) the invention submits through a two-stage submission protocol when the transaction submits, and adopts a multi-version concurrency control mechanism to realize that the reading and the writing are not blocked mutually.
(2) Aiming at the problem of high delay of distributed transactions, an asynchronous commit protocol is designed, after the first-stage writing is completed, a client response is immediately returned, and meanwhile, the transaction commit of the second stage is asynchronously completed, so that the transaction delay caused by one round of network IO is reduced. And the step of locking the transaction is skipped, so that the record is directly written and returned after no transaction conflict exists, the complicated steps during transaction submission are reduced, and the transaction performance is greatly improved.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims (6)

1. A distributed transaction processing system is characterized by comprising a computing layer, a storage layer and a scheduling layer;
the calculation layer is used for receiving an SQL request of a client, converting the SQL request into a KV read-write request, transmitting the KV read-write request to the storage layer, and initiating a transaction two-stage submission based on the SQL request;
the storage layer is used for providing a read-write interface for the two-stage submission of the transaction initiated by the computing layer and providing storage service;
the scheduling layer is used for providing global time service for the computing layer and the storage layer and storing transaction distribution type information submitted in two phases of transactions.
2. The distributed transaction processing system of claim 1, wherein the compute layer comprises a MySQL protocol module, an SQL core module, a KV storage module, and an RPC client module;
the MySQL protocol module is used for analyzing a MySQL data packet sent by the client, transmitting an SQL request obtained by analysis to the SQL core module for processing, packaging a processing result returned by the SQL core module and returning the processing result to the client;
the SQL core module is used for processing the SQL request and transmitting the KV read-write request obtained by processing to the KV storage module;
the KV storage module is used for analyzing the KV read-write request and initiating a transaction two-stage submission based on the analyzed KV read-write request;
and the RPC client module is used for transmitting the transaction two-stage submission initiated by the KV storage module to the storage layer and the scheduling layer by using the RPC request.
3. The distributed transaction system of claim 2, wherein the SQL core module comprises a parser, an optimizer, an executor, and a session manager;
the parser is used for parsing the SQL request and generating an abstract syntax tree;
the optimizer is used for analyzing and optimizing the abstract syntax tree to generate a logic execution plan and a physical execution plan;
the executor is used for optimizing a logic execution plan and a physical execution plan;
and the session manager is used for transmitting the KV read-write request obtained by executing the logic execution plan and the physical execution plan to the KV storage module.
4. The distributed transaction processing system according to claim 3, wherein the functions of the two-phase commit of the transaction initiated by the KV storage module specifically include creating a transaction, local lock of transaction, lock conflict handling, and long transaction management;
the created transaction is used for providing a KV read-write interface meeting the requirement of transaction isolation for an executor of an SQL core module, providing a cache function for a computing layer, providing a snapshot isolation function for transaction reading and providing a two-stage commit function for transaction commit;
the transaction local lock is used for adding a local lock to the transaction modification;
the lock conflict processing is used for transaction two-stage submission to detect and clear expired lock objects when lock conflicts occur;
the long transaction manages the latch live time for updating the transaction local lock.
5. The distributed transaction system of claim 2, wherein the storage tier comprises an RPC server module, an MVCC module, a consensus protocol module, and a storage engine module;
the RPC server module is used for receiving the transaction two-stage submission sent by the RPC client module, analyzing and transmitting the transaction two-stage submission to the MVCC module;
the MVCC module is used for providing a read-write interface for transaction two-stage submission, packaging transaction read-write operation in the transaction two-stage submission into a pure KV read-write operation request, and sending the pure KV read-write operation request to the consensus protocol module;
the consensus protocol module is used for synchronizing the pure KV read-write operation request to the storage engine module in a log form;
and the storage engine module is used for writing the pure KV read-write operation request into a disk for storage.
6. The distributed transaction system of claim 1, wherein the scheduling layer comprises a Region manager module, a time service module, and an etcd module;
the Region manager module is used for inquiring Region distribution information in the scheduling layer and updating the scheduling layer;
the time service module is used for providing a timestamp for the two-stage submission of the transaction as the starting time and the identification of the transaction submission;
the etcd module is used for storing transaction distribution information submitted in two phases of transaction.
CN202110676531.2A 2021-06-18 2021-06-18 Distributed transaction processing system Pending CN113391885A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110676531.2A CN113391885A (en) 2021-06-18 2021-06-18 Distributed transaction processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110676531.2A CN113391885A (en) 2021-06-18 2021-06-18 Distributed transaction processing system

Publications (1)

Publication Number Publication Date
CN113391885A true CN113391885A (en) 2021-09-14

Family

ID=77621811

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110676531.2A Pending CN113391885A (en) 2021-06-18 2021-06-18 Distributed transaction processing system

Country Status (1)

Country Link
CN (1) CN113391885A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114238353A (en) * 2021-12-21 2022-03-25 山东浪潮科学研究院有限公司 Method and system for realizing distributed transaction
CN115103011A (en) * 2022-06-24 2022-09-23 北京奥星贝斯科技有限公司 Cross-data-center service processing method, device and equipment
CN115421698A (en) * 2022-08-30 2022-12-02 敏于行(北京)科技有限公司 Data processing method and device based on declarative and distributed accounts book and electronic device
CN115840631A (en) * 2023-01-04 2023-03-24 中科金瑞(北京)大数据科技有限公司 RAFT-based high-availability distributed task scheduling method and equipment
CN116383227A (en) * 2023-06-05 2023-07-04 北京成章数据科技发展有限公司 Distributed cache and data storage consistency processing system and method
CN116737744A (en) * 2023-08-14 2023-09-12 金篆信科有限责任公司 Database control system, method, computer device and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473318A (en) * 2013-09-12 2013-12-25 中国科学院软件研究所 Distributed transaction security method for memory data grid
US20150172412A1 (en) * 2012-07-06 2015-06-18 Cornell University Managing dependencies between operations in a distributed system
US20150193264A1 (en) * 2012-07-18 2015-07-09 OpenCloud NZ Ltd. Combining scalability across multiple resources in a transaction processing system having global serializability
US20150347243A1 (en) * 2014-05-27 2015-12-03 International Business Machines Corporation Multi-way, zero-copy, passive transaction log collection in distributed transaction systems
CN106033437A (en) * 2015-03-13 2016-10-19 阿里巴巴集团控股有限公司 Method and system for processing distributed transaction
CN109977171A (en) * 2019-02-02 2019-07-05 中国人民大学 A kind of distributed system and method guaranteeing transaction consistency and linear consistency
CN112214649A (en) * 2020-10-21 2021-01-12 北京航空航天大学 Distributed transaction solution system of temporal graph database
CN112231070A (en) * 2020-10-15 2021-01-15 北京金山云网络技术有限公司 Data writing and reading method and device and server

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150172412A1 (en) * 2012-07-06 2015-06-18 Cornell University Managing dependencies between operations in a distributed system
US20150193264A1 (en) * 2012-07-18 2015-07-09 OpenCloud NZ Ltd. Combining scalability across multiple resources in a transaction processing system having global serializability
CN103473318A (en) * 2013-09-12 2013-12-25 中国科学院软件研究所 Distributed transaction security method for memory data grid
US20150347243A1 (en) * 2014-05-27 2015-12-03 International Business Machines Corporation Multi-way, zero-copy, passive transaction log collection in distributed transaction systems
CN106033437A (en) * 2015-03-13 2016-10-19 阿里巴巴集团控股有限公司 Method and system for processing distributed transaction
CN109977171A (en) * 2019-02-02 2019-07-05 中国人民大学 A kind of distributed system and method guaranteeing transaction consistency and linear consistency
CN112231070A (en) * 2020-10-15 2021-01-15 北京金山云网络技术有限公司 Data writing and reading method and device and server
CN112214649A (en) * 2020-10-21 2021-01-12 北京航空航天大学 Distributed transaction solution system of temporal graph database

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HATEM A. MAHMOUD 等: "MaaT: effective and scalable coordination of distributed transactions in the cloud" *
HENGFENG WEI 等: "Parameterized and Runtime-Tunable Snapshot Isolation in Distributed Transactional Key-Value Stores" *
马鹏玮 等: "互联网环境下分布式事务处理系统现状与趋势" *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114238353A (en) * 2021-12-21 2022-03-25 山东浪潮科学研究院有限公司 Method and system for realizing distributed transaction
CN115103011A (en) * 2022-06-24 2022-09-23 北京奥星贝斯科技有限公司 Cross-data-center service processing method, device and equipment
CN115103011B (en) * 2022-06-24 2024-02-09 北京奥星贝斯科技有限公司 Cross-data center service processing method, device and equipment
CN115421698A (en) * 2022-08-30 2022-12-02 敏于行(北京)科技有限公司 Data processing method and device based on declarative and distributed accounts book and electronic device
CN115840631A (en) * 2023-01-04 2023-03-24 中科金瑞(北京)大数据科技有限公司 RAFT-based high-availability distributed task scheduling method and equipment
CN115840631B (en) * 2023-01-04 2023-05-16 中科金瑞(北京)大数据科技有限公司 RAFT-based high-availability distributed task scheduling method and equipment
CN116383227A (en) * 2023-06-05 2023-07-04 北京成章数据科技发展有限公司 Distributed cache and data storage consistency processing system and method
CN116383227B (en) * 2023-06-05 2023-08-15 北京成章数据科技发展有限公司 Distributed cache and data storage consistency processing system and method
CN116737744A (en) * 2023-08-14 2023-09-12 金篆信科有限责任公司 Database control system, method, computer device and storage medium
CN116737744B (en) * 2023-08-14 2023-11-24 金篆信科有限责任公司 Database control system, method, computer device and storage medium

Similar Documents

Publication Publication Date Title
EP3968175B1 (en) Data replication method and apparatus, and computer device and storage medium
US10860612B2 (en) Parallel replication across formats
US11138180B2 (en) Transaction protocol for reading database values
US11874746B2 (en) Transaction commit protocol with recoverable commit identifier
US11263235B2 (en) Database management system and method of operation
CN109739935B (en) Data reading method and device, electronic equipment and storage medium
CN113391885A (en) Distributed transaction processing system
US10503699B2 (en) Metadata synchronization in a distrubuted database
USRE47106E1 (en) High-performance log-based processing
EP1704470B1 (en) Geographically distributed clusters
US8639677B2 (en) Database replication techniques for maintaining original linear request order for asynchronous transactional events
EP1840766B1 (en) Systems and methods for a distributed in-memory database and distributed cache
CN109710388B (en) Data reading method and device, electronic equipment and storage medium
EP1840768A2 (en) Systems and method for a distributed in-memory database
CN111190935B (en) Data reading method and device, computer equipment and storage medium
CN109783578B (en) Data reading method and device, electronic equipment and storage medium
US20230081900A1 (en) Methods and systems for transactional schema changes
EP4276651A1 (en) Log execution method and apparatus, and computer device and storage medium
US11860860B2 (en) Methods and systems for non-blocking transactions
WO2024081139A1 (en) Consensus protocol for asynchronous database transaction replication with fast, automatic failover, zero data loss, strong consistency, full sql support and horizontal scalability

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination