CN113391885A

CN113391885A - Distributed transaction processing system

Info

Publication number: CN113391885A
Application number: CN202110676531.2A
Authority: CN
Inventors: 李建平; 肖飞; 高源�; 周越; 俞腾秋
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2021-09-14

Abstract

The invention discloses a distributed transaction processing system, which comprises a computing layer, a storage layer and a scheduling layer; the calculation layer is used for receiving the SQL request of the client, converting the SQL request into a KV read-write request, transmitting the KV read-write request to the storage layer, and initiating a transaction two-stage submission based on the SQL request; the storage layer is used for providing a read-write interface for the two-stage submission of the transaction initiated by the computing layer and providing storage service; the dispatching layer is used for maintaining the meta-information of the whole cluster, including KV data distribution information and the like, providing global time service for the computing layer and the storage layer, and storing transaction distribution information submitted in two phases of transactions. The invention submits through a two-stage submission protocol when the transaction submits, and adopts a multi-version concurrency control mechanism to realize that the reading and the writing are not blocked mutually.

Description

Distributed transaction processing system

Technical Field

The invention belongs to the technical field of distribution, and particularly relates to a distributed transaction processing system.

Background

Under the big data era, the development of mobile internet, intelligent equipment and internet of things technology enables the global data volume to show explosive growth, the traditional single-machine database is limited by expansion capability and is difficult to bear massive business requirements, people begin to explore the database distribution, and a distributed database represented by Google Spanner appears. The distributed database has the characteristics of strong consistency, high availability, expandability, easy operation and maintenance and fault tolerance and disaster tolerance, has high concurrent transaction processing capacity meeting the ACID characteristics, and can meet the requirements of low delay and massive concurrent processing. Distributed databases generally adopt multiple partitions and multiple copies to achieve scalability and high availability, and distributed transactions often make a compromise between strong consistency and transaction performance, which also makes the research of distributed transactions one of the most challenging tasks in the field of distributed databases.

In recent years, distributed databases have been studied and practiced in academia and industry, and have achieved great results. It is common practice to convert the traditional two-phase lock scheme of transaction concurrency control to a two-phase commit scheme, applied in a distributed scenario. On one hand, the two-phase submission has problems of synchronous blocking, single point failure, split brain and the like, and improvement schemes such as three-phase submission are proposed, and the improvement schemes usually have higher execution overhead or complexity, so the improvement schemes are not widely used in the industry. On the other hand, two-phase commit itself requires multiple network communications, increasing the response time of transactions and also limiting the scalability and high availability of databases. Optimizing and improving a two-stage submission algorithm, and improving the transaction performance as much as possible on the premise of ensuring the isolation level, is worthy of research in the field of distributed databases.

Disclosure of Invention

The invention aims to solve the problem of distributed transaction processing and provides a distributed transaction processing system.

The technical scheme of the invention is as follows: a distributed transaction processing system comprises a computation layer, a storage layer and a scheduling layer;

the calculation layer is used for receiving the SQL request of the client, converting the SQL request into a KV read-write request, transmitting the KV read-write request to the storage layer, and initiating a transaction two-stage submission based on the SQL request;

the storage layer is used for providing a read-write interface for the two-stage submission of the transaction initiated by the computing layer and providing storage service;

the dispatching layer is used for maintaining the meta-information of the whole cluster, including KV data distribution information and the like, providing global time service for the computing layer and the storage layer, and storing transaction distribution information submitted in two phases of transactions.

Further, the computing layer comprises a MySQL protocol module, an SQL core module, a KV storage module and an RPC client module;

the MySQL protocol module is used for analyzing a MySQL data packet sent by the client, transmitting an SQL request obtained by analysis to the SQL core module for processing, packaging a processing result returned by the SQL core module and returning the processing result to the client;

the SQL core module is used for processing the SQL request and transmitting the KV read-write request obtained by processing to the KV storage module;

the KV storage module is used for analyzing the KV read-write request and initiating a transaction two-stage submission based on the analyzed KV read-write request;

and the RPC client module is used for transmitting the transaction two-stage submission initiated by the KV storage module to the storage layer and the scheduling layer by using the RPC request.

Further, the SQL core module comprises a parser, an optimizer, an executor and a session manager;

the parser is used for parsing the SQL request and generating an abstract syntax tree;

the optimizer is used for analyzing and optimizing the abstract syntax tree to generate a logic execution plan and a physical execution plan;

the executor is used for optimizing the logic execution plan and the physical execution plan;

and the session manager is used for transmitting the KV read-write request obtained by executing the logic execution plan and the physical execution plan to the KV storage module.

Furthermore, the two-phase submitting function of the transaction initiated by the KV storage module specifically comprises creating the transaction, locally locking the transaction, processing the lock conflict and managing the long transaction;

the method comprises the following steps that a transaction is created and used for providing a KV read-write interface meeting the requirement of transaction isolation for an executor of an SQL core module, providing a cache function for a computing layer, providing a snapshot isolation function for transaction reading and providing a two-stage commit function for transaction commit;

the transaction local lock is used for adding a local lock to the transaction modification;

lock conflict processing is used for transaction two-stage submission to detect and clear expired lock objects when lock conflicts occur;

long transactions manage the latch live time for updating the transaction local lock.

Further, the storage layer comprises an RPC server module, an MVCC module, a consensus protocol module and a storage engine module;

the RPC server module is used for receiving the transaction two-stage submission sent by the RPC client module, analyzing and transmitting the transaction two-stage submission to the MVCC module;

the MVCC module is used for providing a read-write interface for transaction two-stage submission, packaging transaction read-write operation in the transaction two-stage submission into a pure KV read-write operation request, and sending the pure KV read-write operation request to the local or remote consensus protocol module;

the consensus protocol module is used for synchronizing the pure KV read-write operation request to the storage engine module in a log form;

and the storage engine module is used for writing the pure KV read-write operation request into a disk for storage.

Further, the scheduling layer comprises a Region manager module, a time service module and an etcd module;

the Region manager module is used for inquiring Region distribution information in the scheduling layer and updating the scheduling layer;

the time service module is used for providing a timestamp for the two-stage submission of the transaction as the starting time and the identification of the transaction submission;

the etcd module is used for storing transaction distribution information submitted in two phases of transaction.

The invention has the beneficial effects that:

(1) the invention submits through a two-stage submission protocol when the transaction submits, and adopts a multi-version concurrency control mechanism to realize that the reading and the writing are not blocked mutually.

(2) Aiming at the problem of high delay of distributed transactions, an asynchronous commit protocol is designed, after the first-stage writing is completed, a client response is immediately returned, and meanwhile, the transaction commit of the second stage is asynchronously completed, so that the transaction delay caused by one round of network IO is reduced. And the step of locking the transaction is skipped, so that the record is directly written and returned after no transaction conflict exists, the complicated steps during transaction submission are reduced, and the transaction performance is greatly improved.

Drawings

FIG. 1 is a block diagram of a distributed transaction processing system.

Detailed Description

The embodiments of the present invention will be further described with reference to the accompanying drawings.

MVCC: multi-version concurrency control, wherein MVCC is a concurrency control method and generally realizes concurrent access to a database in a database management system; transactional memory is implemented in a programming language.

KV reading and writing: the file formats are kv pairs, namely keyyngth, key, value, and read-write operation is performed on them.

RPC: the remote procedure invokes a protocol that the program can use to request a service from a program on another computer in the network.

Region: and (4) area information.

etcd: etcd is a highly available distributed key value database, which can be used for service discovery.

As shown in FIG. 1, the present invention provides a distributed transaction processing system, comprising a computation layer, a storage layer and a scheduling layer; each hierarchy can be taken as an independent component and deployed on different server nodes;

and processing the user request, analyzing the SQL statement, optimizing the SQL and generating an efficient execution plan. And maintaining Session and transaction context as a client to initiate a transaction two-phase submission. Converting the SQL statement into a KV request and sending the KV request to a storage layer;

the storage layer is used for providing a read-write interface for the two-stage submission of the transaction initiated by the computing layer and managing the data submitted on the disk in the two stages of the transaction; providing high available KV storage service;

In the embodiment of the invention, the computing layer comprises a MySQL protocol module, an SQL core module, a KV storage module and an RPC client module;

the MySQL protocol module is responsible for maintaining the connection state and is used for analyzing a MySQL data packet sent by the client, transmitting an SQL request obtained by analysis to the SQL core module for processing, packaging a processing result returned by the SQL core module and returning the processing result to the client;

the SQLCore module is a core module of a computing layer and is responsible for processing and executing SQL statements, and comprises submodules such as a syntax parser, an optimizer, an executor and session management;

the KVStore module is responsible for organizing KV requests, providing transaction management functions and maintaining transaction context objects;

And the RPC client module is responsible for packaging the RPC request of the upper module, sending the RPC request to the storage and scheduling node, receiving and returning the RPC response, and maintaining the life cycle of the RPC request. And reading and writing KV data into the storage node by the computing node, performing transaction two-phase submission and the like. And the computing node acquires a globally unique strictly increasing timestamp from the scheduling node and acquires meta-information such as KV partition distribution and routing information.

In the embodiment of the invention, the SQL core module comprises a parser, an optimizer, an executor and a session manager;

In the database, the logical relational SQL model is converted into a physical key value mapping and stored in the storage node in the form of ordered KV. Each SQL tuple is converted into a key value pair, i.e. one row of SQL records corresponds to one KV record:

(tableID,rowID)>→(col1,col2,col3,col4)

where Key consists of a table identifier (TableID) and a row identifier (RowID), and Value consists of the values of all records of the row.

Each index is also converted into a key-value pair:

(tableID,indexID,ColumnValue)>→rowID

where Key consists of a table identifier, an index identifier (IndexID) and the Value of the index column, and Value is the identifier of the row (RowID). When the index is used for data query, firstly, the corresponding row identifier is queried through scanning the index KV pair, and then all data of the row are queried through the row identifier.

Such KV encoding can preserve the ordering of SQL tuples. All the row data of one table are arranged in the Key space according to the RowID sequence, and one index is arranged in the Key space according to the sequence of the values of the indexed row.

In the embodiment of the invention, the two-phase submitting function of the transaction initiated by the KV storage module specifically comprises the steps of creating the transaction, locally locking the transaction, processing the lock conflict and managing the long transaction;

KV read-write interface: providing a KV read-write interface meeting the requirement of transaction isolation for an executor of the SQL module;

writing cache: the KV modification operation is not immediately executed, but is firstly cached in a computing layer, and is submitted together when the transaction is submitted;

snapshot isolation: the current transaction can only read the latest data before the start time of the transaction and the modified data in the transaction;

the transaction is committed in two phases: when a transaction submission request is received, a new two-phase submission client is created to submit the transaction, and the transaction is rolled back when a conflict or error occurs.

The transaction local lock is used for adding a local lock to the transaction modification; the method can prevent a plurality of transactions from reading and writing the same Key simultaneously to cause transaction conflict and rollback;

lock conflict processing is used for transaction two-stage submission to detect and clear expired lock objects when lock conflicts occur; blocking new transaction submissions is avoided.

The long transaction management periodically sends heartbeat packets to the storage node for updating the latch live time of the transaction local lock.

In the embodiment of the invention, the storage layer comprises an RP server module, an MVCC module, a consensus protocol module and a storage engine module;

the function of the multi-version concurrency control (MVCC) module is as follows:

multi-version concurrency control function: in order to improve concurrency performance and enable reading and writing not to be mutually blocked, a transaction reading and writing interface with a version number is provided, and the visibility of data is controlled;

local lock function: locking before the storage layer accesses the same Key to avoid the write-in competition of the transaction;

lock wait and wake-up functions: for pessimistic affairs which do not acquire the lock, a transaction thread waiting for the lock is dormant, and the thread waiting for the lock is awakened after the lock is released;

deadlock detection function: a transaction wait graph is maintained, and deadlocks are detected by checking for loops before a new transaction begins waiting on the lock.

The consensus protocol module is responsible for providing data high availability services based on the Raft protocol. The Raft protocol synchronizes KV operation to all nodes in a log mode, and each node calls the persistent log data of the local storage module interface. After most nodes synchronize logs, KV operations are written to disk through the storage engine. To provide the horizontal expansion capability, a range partition (Rangepartition) based strategy is employed to partition a set of KV into many intervals, each interval being referred to as a block (Region). The Raft consensus algorithm is used for maintaining consistency among the copies of each Region, and all the copies in the regions form a Raftgroup.

This extended Raft algorithm is also known as Multi-Raft. The consensus protocol module comprises the following specific functions:

(1) and (3) meta information management: maintaining Key intervals, node states and other information stored in the Region, and regularly reporting the information to a scheduling layer through heartbeat packets;

(2) election and voting functions: realizing the selection of a Raftleader and the role change of a Raft member;

(3) the log copy function: each data change is converted into a Raft log, and the data are safely and reliably synchronized to most nodes of the Group through the log copying function of the Raft;

(4) the downtime recovery function: after the nodes are down, the nodes are recovered to a correct state in a mode of synchronizing and playing back the raw log, so that data loss is avoided;

(5) region splitting and merging functions: the regions are distributed on all the nodes in the cluster as uniformly as possible, so that horizontal expansion and load balancing are facilitated.

In the embodiment of the invention, the scheduling layer comprises a Region manager module, a time service module and an etcd module;

the Region management module mainly maintains Region information, and the storage node periodically reports Region distribution information to the scheduling node through heartbeat. Meanwhile, Region routing information is provided, and the computing node periodically inquires Region distribution information from the scheduling node and caches the Region distribution information in the computing layer.

the time service module serves as the only time service in the cluster and provides a monotonically increasing time stamp for the outside. Any transaction at the beginning needs to get a globally unique timestamp as the start time (StartTS) and identification of the transaction, depending on the snapshot isolation level requirements. The transaction can only read the latest data that StartTS has previously committed. When a transaction commits, a timestamp may also need to be acquired as a marker (CommitTS) for the transaction commit.

The scheduling layer uses the etcd to store the meta-information, the whole database cluster is unavailable due to the downtime of the time service server, and the meta-information is stored in the etcd module, so that the high-availability time service is provided.

In the embodiment of the invention, when a transaction is started, a user sends a BEGIN statement to the database to start a transaction. After the SQL module of the computation layer parses the statement, it creates and binds a transaction context object (TransactionContext) for the current Session, which is used to maintain the state of the current transaction, and provide a KV data read-write interface, etc. Meanwhile, the computing layer sends a request to the time service server, and obtains a globally unique timestamp as the start time (StartTS) of the transaction. After the transaction context object is successfully created, the computing layer returns a success response to the client.

After a transaction is successfully opened, all add-delete-modify-verify (CRUD) operations performed by the user are confined to the context of the previous transaction. For a read operation (SELECT statement), due to the isolation requirement of the transaction, only the latest data modified in the current transaction can be read, and only the latest data of the committed transaction can be read.

The two-stage submitting client firstly acquires all data modification operations from the write cache object, initializes a modification set (Mutation), and takes a first modified Key in the write cache object as a PrimaryKey of a set, and other keys are called as secondarykeys. The PrimaryKey will be used to identify the state of the current transaction, and other transactions can determine whether the current transaction should continue to complete commit or rollback by checking the PrimaryKey's lock. In both the Prewrite phase and the Commit phase, the set PrimaryKey needs to be sent first, and after the PrimaryKey is locked or written successfully, other keys need to be sent. The two-phase commit client then attempts to add a compute-level local lock (Latch) to all keys to be modified, preventing other transactions from concurrently modifying the same Key, causing transaction conflicts and rollback. If it is checked that part of the Key has been locked by other transactions, retry for a period of time, if it is not yet waited for the lock to be released, abort the commit transaction and return an error to the user. If the lock is successful, then the step of two-phase commit is entered.

The RPCServer of the storage node analyzes the RPC request and then transmits the RPC request to a multi-version concurrency control (MVCCtore) module. The mvcsctore module first checks whether each Key is locked by other transactions. If locked, retry and wait for the lock release, if the maximum wait time is exceeded, return a transaction conflict error to the compute layer. Then, the current Key-up-to-date record in the local storage needs to be checked. If CommitTS in the record is greater than StartTS of the current transaction, indicating that the record read by the current transaction is not up-to-date, a transaction conflict error is returned to the compute layer.

Then, the mvcsctore module creates a multi-version concurrency control lock (hereinafter, mvcsclock) for each Key, where the lock includes information such as StartTS (as a globally unique identifier for a transaction), KV record, and the like. The MVCCtore module synchronizes all MVCCLink objects to other nodes of the Raftgroup through the consensus protocol layer to ensure high availability of the storage layer.

In the consensus protocol layer, after a read leader accepts a write command of an upper layer, the read leader converts the command into a read log and writes the read log into the read log, and then sends the read log to a Follower (follow) in the same read group through an additional log RPC request. After receiving the response that most of the node logs are synchronously completed, the leader self applies the log and informs the follower of the log through a heartbeat packet. After each node completes the Raft log application, the MVCClock is written into the local and is stored persistently.

After the leader succeeds in applying the journal, writes to MVCCLOCK, it returns a response that the Prewrite succeeded to the compute layer. After the computing layer receives the response, the two-phase Commit client begins executing the Commit phase.

For the Commit phase, the two-phase Commit client obtains a timestamp from the global time service server as the Commit ts for the current transaction. And then, the two-stage submission client sends a CommitRPC request containing information such as all Key and CommitTS to the corresponding storage node.

After receiving the request, the storage node transmits the request to a multi-version concurrency control (MVCCStore) module. The mvcsctore first attempts to add a storage tier local lock (Latch) to the set of keys to avoid concurrent modification of the same Key by transactions from different compute nodes, resulting in transaction conflicts and rollback. Then, the mvcsctore acquires the set of mvcclocks, constructs KV records containing Key and Value, and meta information of StartTS and commit ts of the transaction, synchronizes them to other storage nodes through the consensus protocol layer, and finally writes them into the local storage layer.

Finally, the MVCCtore deletes the corresponding MVCClock and returns a response of successful submission to the computing layer. And after receiving the response, the computing layer returns a response that the transaction is successfully submitted to the user. And when the life cycle of the whole transaction is ended, destroying temporary objects such as the transaction context and the like, and recycling the corresponding memory.

The working principle and the process of the invention are as follows: the functional requirements of the generic transaction in the present invention include 3 aspects: first, an optimistic transaction model is supported: the user's data modification operations are cached in memory until the transaction commit phase is committed together. Conflict detection is performed at the time of transaction commit. Second, a pessimistic transaction model is supported: and writing a pessimistic lock into the storage node while modifying the data by the user, and advancing the conflict detection so as to avoid performance rollback of the optimistic transaction in a conflict scene. Third, snapshot isolation level is supported: when the transaction starts, the global timestamp is obtained as the transaction identifier, and when the transaction commits, the global commit timestamp is obtained, so that the execution sequence of the transaction is determined.

The invention has the beneficial effects that:

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. A distributed transaction processing system is characterized by comprising a computing layer, a storage layer and a scheduling layer;

the calculation layer is used for receiving an SQL request of a client, converting the SQL request into a KV read-write request, transmitting the KV read-write request to the storage layer, and initiating a transaction two-stage submission based on the SQL request;

the scheduling layer is used for providing global time service for the computing layer and the storage layer and storing transaction distribution type information submitted in two phases of transactions.

2. The distributed transaction processing system of claim 1, wherein the compute layer comprises a MySQL protocol module, an SQL core module, a KV storage module, and an RPC client module;

3. The distributed transaction system of claim 2, wherein the SQL core module comprises a parser, an optimizer, an executor, and a session manager;

the executor is used for optimizing a logic execution plan and a physical execution plan;

4. The distributed transaction processing system according to claim 3, wherein the functions of the two-phase commit of the transaction initiated by the KV storage module specifically include creating a transaction, local lock of transaction, lock conflict handling, and long transaction management;

the created transaction is used for providing a KV read-write interface meeting the requirement of transaction isolation for an executor of an SQL core module, providing a cache function for a computing layer, providing a snapshot isolation function for transaction reading and providing a two-stage commit function for transaction commit;

the lock conflict processing is used for transaction two-stage submission to detect and clear expired lock objects when lock conflicts occur;

the long transaction manages the latch live time for updating the transaction local lock.

5. The distributed transaction system of claim 2, wherein the storage tier comprises an RPC server module, an MVCC module, a consensus protocol module, and a storage engine module;

the MVCC module is used for providing a read-write interface for transaction two-stage submission, packaging transaction read-write operation in the transaction two-stage submission into a pure KV read-write operation request, and sending the pure KV read-write operation request to the consensus protocol module;

6. The distributed transaction system of claim 1, wherein the scheduling layer comprises a Region manager module, a time service module, and an etcd module;