WO2022213526A1 - 事务处理方法、分布式数据库系统、集群及介质 - Google Patents

事务处理方法、分布式数据库系统、集群及介质 Download PDF

Info

Publication number
WO2022213526A1
WO2022213526A1 PCT/CN2021/112643 CN2021112643W WO2022213526A1 WO 2022213526 A1 WO2022213526 A1 WO 2022213526A1 CN 2021112643 W CN2021112643 W CN 2021112643W WO 2022213526 A1 WO2022213526 A1 WO 2022213526A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
transaction
coordinating
distributed database
database system
Prior art date
Application number
PCT/CN2021/112643
Other languages
English (en)
French (fr)
Inventor
谢晓芹
张宗全
马文龙
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202110679707.XA external-priority patent/CN115495495A/zh
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Priority to EP21935746.4A priority Critical patent/EP4307137A1/en
Priority to CN202180004526.5A priority patent/CN115443457A/zh
Publication of WO2022213526A1 publication Critical patent/WO2022213526A1/zh
Priority to US18/477,848 priority patent/US20240028598A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2308Concurrency control

Definitions

  • the present application relates to the field of database technology, and in particular, to a transaction processing method, a distributed database system, a transaction processing system, a cluster, a computer-readable storage medium, and a computer program product.
  • the database is a collection of data that is stored together in a certain way, can be shared with multiple users, has as little redundancy as possible, and is independent of applications. Users can access the database through a client application (hereinafter referred to as a client) to realize data reading or data writing.
  • a client a client application
  • Database systems include database management systems (DBMS).
  • DBMS database management systems
  • the database system realizes creation, query, update and deletion of data through the above-mentioned DBMS.
  • the user triggers an operation on the data in the database through the client, and the database system executes a corresponding transaction in response to the operation.
  • the database system performs the data writing operation, writes the data to the nodes of the database system, and then writes the data to the database, such as a shared storage database, so as to realize data persistence.
  • the distributed database system can be deployed in a real application cluster (RAC).
  • the cluster is specifically a disk-oriented distributed database storage engine cluster based on a shared-everything architecture.
  • RAC includes two types of nodes, specifically a hub node and a leaf node.
  • the sink node is the master node in the cluster.
  • the master nodes are interconnected through a point-to-point network to process distributed transactions. There is no network connection between the leaf nodes, which are used to process concurrent queries and online business reporting.
  • leaf nodes usually only get data through sink nodes acting as proxies. Because the leaf nodes and sink nodes interact in a bilateral manner, the leaf nodes need to wait for the scheduling of the operating system and obtain the data on the sink nodes through a long access path. In this way, the leaf nodes can only read historical data, which is difficult to satisfy. Real-time business requirements for data consistency.
  • the present application provides a transaction processing method.
  • the method utilizes the shared global memory to enable nodes in the distributed database system, such as coordinating nodes and participating nodes, to access the global memory unilaterally across nodes, without the need to synchronize data through bilateral interaction, without the need for processing by the processor and the operating system.
  • the path is greatly shortened, and it does not need to be scheduled by the operating system, which can greatly shorten the synchronization time, realize the real-time consistency between the coordinating node and the participating nodes, and meet the real-time business requirements for real-time consistency.
  • the present application also provides a distributed database system, a transaction processing system, a cluster, a computer-readable storage medium, and a computer program product corresponding to the above method.
  • the present application provides a transaction processing method.
  • the method may be performed by a distributed database system.
  • the distributed database system can be deployed in a cluster.
  • the cluster may be, for example, a cluster of memory engines.
  • a distributed database system runs in a cluster, it can achieve real-time consistency of data between nodes, so as to meet the needs of real-time services.
  • the distributed database system includes coordinating nodes and participating nodes.
  • the coordinating node assumes the coordination responsibility during the transaction execution process, and the participating nodes assume the execution responsibility during the transaction execution process.
  • a transaction refers to a program execution unit that accesses and possibly updates data in a database, usually including a limited sequence of database operations.
  • Part of the memory of multiple nodes of a distributed database system is used to form global memory. This global memory is visible to all coordinating nodes and participating nodes in the distributed database system. The remaining part of the memory in the coordinating node or participating node is local memory, which is visible to the coordinating node or participating node itself. For any coordinating node or participating node, part of the memory located in other nodes in the global memory can be accessed through remote direct memory access or memory bus.
  • the coordination node is used to receive multiple query statements sent by the client, create a transaction according to the first query statement in the multiple query statements, and then execute the above transaction in the global memory according to the second query statement in the multiple query statements, and then coordinate The node submits the above transaction according to the third query statement in the plurality of query statements, so as to realize the consistency between the coordinating node and the participating node.
  • the global memory in the coordinating node or the participating nodes is shared.
  • the participating nodes can quickly perceive the change, and the participating nodes can quickly perceive the change.
  • Part of the memory in the global memory can be accessed unilaterally across nodes for data synchronization, instead of data synchronization through bilateral interaction. Since it does not need to be processed by the processor and the operating system, the access path is greatly shortened, and it does not need to be scheduled by the operating system. , which can greatly shorten the synchronization time, realize the real-time consistency between the coordinating node and the participating nodes, and meet the real-time business requirements for real-time consistency.
  • the capacity of the global memory can be expanded with the number of nodes, and is no longer limited by the capacity of the memory of a single node, which improves the concurrency control capability.
  • the distributed database system is deployed in a cluster, and the global memory comes from the cluster.
  • the cluster refers to a computing network formed by at least one group of computers, and is used to provide computing capabilities for the distributed database system, so that the distributed database system can provide external services based on the aforementioned computing capabilities.
  • the method enables the nodes of the distributed database system to access the global memory across the nodes unilaterally, without the need for processing by the processor and the operating system, and without waiting for the scheduling of the operating system. (for example, between coordinating nodes, participating nodes) real-time consistency.
  • the global memory includes part of the memory of the coordinating node and/or the participating nodes.
  • multiple nodes for example, each node in the distributed database system may provide part of the memory for forming a global memory, and the remaining memory is used as the local memory of the corresponding node.
  • the global memory is shared among the nodes in the distributed database system. These nodes can directly achieve cross-node unilateral access through remote direct memory access or memory bus, without going through the operating system and processor, and without waiting for the scheduling of the operating system, so real-time consistency between nodes can be achieved.
  • the node type of the coordinator node is the master node.
  • the coordinating node may create a read-write transaction according to the first query statement in the plurality of query statements. This can meet the needs of real-time read and write services.
  • the node type of the coordinating node is the first slave node.
  • the first slave node is used to maintain real-time consistency with the node whose node type is the master node. Based on this, the first slave node can also be called a real-time slave node.
  • the coordinating node may create a read-only transaction according to the first query statement in the plurality of query statements. This can meet the needs of real-time read-only services.
  • the distributed database system may further include a node whose node type is the second slave node.
  • the second slave node is used to maintain quasi-real-time consistency with the node whose node type is the master node. Therefore, the second slave node can also be called a quasi-real-time slave node.
  • the quasi-real-time slave node is used to process services that do not require high real-time performance, such as non-real-time analysis services.
  • the quasi-real-time slave node is used to receive query statements related to non-real-time analysis services, and then return corresponding query results. This can meet the needs of non-real-time analysis services.
  • the distributed database system may receive the cluster management node
  • the sent table records the number of replicas in the global memory.
  • the distributed database system then saves the number of copies of the table records in the global memory.
  • the distributed database system when the distributed database system is writing data, a corresponding number of copies can be written according to the number of copies recorded in the above table in the global memory to ensure data security.
  • the distributed database system can set the number of replicas based on table granularity, which meets the individual needs of different businesses.
  • the table record is stored in the global memory of the distributed database system, and the index tree and management header of the table record are stored in the local memory of the distributed database system.
  • This method uses limited global memory to store table records, and uses local memory to store the index tree and management header of table records to perform version management on table records. Consistency, on the other hand, avoids the use of global memory such as index trees, and improves resource utilization.
  • the coordinating node may implement a transaction commit protocol when committing a transaction. Specifically, the coordinating node submits the transaction according to the third query statement in the plurality of query statements, through a transaction submission protocol running on the coordinating node and the participating nodes, so as to realize the coordinating node and the participating nodes. Real-time consistency of participating nodes.
  • the protocol nodes and participating nodes are constrained through the transaction commit protocol, so that the transaction operations performed by the nodes (such as coordinating nodes and participating nodes) that need to write data (including data insertion or update) are either completed at the same time or rolled back at the same time, so It avoids the completion of writing of some replica nodes and the failure of writing to other replica nodes, which leads to the occurrence of real-time inconsistency of nodes, and further ensures the real-time consistency between nodes.
  • the coordinating node when a write conflict occurs in the transaction, for example, when a transaction has a read/write conflict with other transactions or a write/write conflict, the coordinating node triggers pessimistic concurrency control, and the participating nodes trigger optimistic concurrency control.
  • the principle of pessimistic concurrency control is to assume that multi-user concurrent transactions affect each other during processing, so data can be modified by blocking a transaction.
  • pessimistic concurrency control peerssimistic locking
  • other transactions can perform conflicting operations only after the transaction releases its permissions.
  • each transaction can process the part of the data affected by each other without generating locks.
  • each transaction checks whether other transactions have modified the data after the transaction has read the data. If other transactions have updates, the committing transaction will be rolled back.
  • the present application provides a distributed database system.
  • the distributed database system includes a coordinating node and a participating node, and the coordinating node and the participating nodes share a global memory.
  • the coordination node is used to receive multiple query statements sent by the client;
  • the coordination node is further configured to create a transaction according to the first query statement in the plurality of query statements, execute the transaction in the global memory according to the second query statement in the plurality of query statements, and execute the transaction according to the second query statement in the plurality of query statements.
  • a third query statement in the plurality of query statements commits the transaction.
  • the distributed database system is deployed in a cluster, and the global memory comes from the cluster.
  • the global memory includes part of the memory of the coordinating node and/or the participating nodes.
  • the node type of the coordination node is a master node, and the coordination node is specifically used for:
  • a read-write transaction is created according to the first query statement in the plurality of query statements.
  • the node type of the coordinating node is a first slave node
  • the first slave node is used to maintain real-time consistency with the node whose node type is the master node
  • the coordinating node is specifically used for:
  • a read-only transaction is created according to the first query statement in the plurality of query statements.
  • the coordinating node is further configured to receive and save the number of copies of the table record sent by the cluster management node in the global memory;
  • the participating node is further configured to receive and save the number of copies of the table record sent by the cluster management node in the global memory.
  • the table record is stored in the global memory of the distributed database system, and the index tree and management header of the table record are stored in the local memory of the distributed database system.
  • the coordinating node is specifically used for:
  • the transaction is submitted through the transaction submission protocol running on the coordinating node and the participating node, so as to realize the real-time consistency between the coordinating node and the participating node sex.
  • the coordinating node is specifically used for triggering pessimistic concurrency control when a write conflict occurs in the transaction; the participating node is specifically used for triggering optimistic concurrency control when a write conflict occurs in the transaction.
  • the present application provides a transaction processing system.
  • the transaction processing system includes a client and a distributed database system according to any implementation manner of the second aspect of the present application, where the distributed database system is configured to execute a corresponding transaction according to a query statement sent by the client Approach.
  • the present application provides a cluster.
  • the cluster includes multiple computers.
  • the computer includes a processor and memory.
  • the processor and the memory communicate with each other.
  • the processor is configured to execute the instructions stored in the memory, so that the cluster executes the transaction processing method according to the first aspect or any one of the implementations of the first aspect.
  • the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and the instructions instruct a computer to execute the first aspect or any implementation manner of the first aspect.
  • Transaction method In a fifth aspect, the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and the instructions instruct a computer to execute the first aspect or any implementation manner of the first aspect.
  • the present application provides a computer program product containing instructions, which, when run on a computer, cause the computer to execute the transaction processing method described in the first aspect or any implementation manner of the first aspect.
  • the present application may further combine to provide more implementation manners.
  • FIG. 1 is a system architecture diagram of a transaction processing system provided by an embodiment of the present application.
  • FIG. 2 is a flowchart of a node configuration method according to an embodiment of the present application
  • FIG. 3 is a flowchart of a method for configuring the number of copies provided by an embodiment of the present application
  • FIG. 4 is an interactive flowchart of a transaction processing method provided by an embodiment of the present application.
  • FIG. 5 is a flowchart of a transaction initiation and execution phase provided by an embodiment of the present application.
  • FIG. 6 is a flowchart of a transaction commit phase provided by an embodiment of the present application.
  • FIG. 7 is a flowchart of a transaction completion stage provided by an embodiment of the present application.
  • FIG. 8 is a flowchart of a transaction rollback phase provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a cluster provided by an embodiment of the present application.
  • first and second in the embodiments of the present application are only used for the purpose of description, and cannot be understood as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature defined as “first” or “second” may expressly or implicitly include one or more of that feature.
  • Database application refers to the application that provides data management services to users based on the underlying database.
  • the data management service includes at least one of data creation, data query, data update, and data deletion.
  • Typical database applications include information management systems such as attendance management systems, salary management systems, production reporting systems, and securities trading systems.
  • Database applications typically include database systems and user-facing clients. The user can trigger data creation, query, update or deletion operations through the client, and the client can process the data in the database through the database system in response to the above operations.
  • the database system can be divided into centralized database system and distributed database system according to the deployment method.
  • the distributed database system is a database system deployed in a cluster including multiple computers.
  • a computer in a cluster may also be referred to as a node.
  • the nodes can communicate through the network to complete data processing cooperatively.
  • the database system can determine the storage mode of data in memory, disk and other storage media and the data read mode through the storage engine.
  • the storage engine is the core component of the database system. Different types of database systems can use different storage engines to provide different storage mechanisms, indexing methods, and locking mechanisms.
  • the cluster can also be divided into a disk engine cluster and an in-memory engine cluster according to the different types of storage media used by nodes in the cluster to store data.
  • RAC widely used in the industry is a disk-oriented distributed database storage engine cluster (for example, a disk engine cluster) based on a data sharing architecture.
  • RAC includes two types of nodes, specifically a hub node and a leaf node.
  • the sink node is the master node in the cluster, and the master nodes are interconnected through a point-to-point network for processing distributed transactions, and there is no network connection between the leaf nodes for processing concurrent queries and online reporting services.
  • leaf nodes usually only get data through sink nodes acting as proxies. Among them, the leaf node and the sink node obtain the data in the sink node through bilateral interaction.
  • Bilateral interaction requires the central processing unit (CPU) of both parties to participate in processing, resulting in an excessively long access path.
  • the access path can be from the CPU of the leaf node to the network card of the leaf node, then to the network card of the sink node, and then to the sink node.
  • CPU central processing unit
  • the above interaction needs to wait for the scheduling of the operating system, which leads to a large delay, and usually only historical data can be read on the leaf nodes, which is difficult to meet the requirements of real-time services for data consistency.
  • the distributed database system can be deployed in a cluster.
  • the cluster may be, for example, a cluster of memory engines.
  • a distributed database system runs in a cluster, it can achieve real-time consistency of data between nodes, so as to meet the needs of real-time services.
  • the distributed database system includes coordinating nodes and participating nodes.
  • the coordinating node assumes the coordination responsibility during the transaction execution process, and the participating nodes assume the execution responsibility during the transaction execution process.
  • a transaction refers to a program execution unit that accesses and possibly updates data in a database, usually including a limited sequence of database operations.
  • GM global memory
  • This global memory is visible to all coordinating nodes and participating nodes in the distributed database system.
  • the remaining part of the memory in the coordinating node or participating node is local memory, which is visible to the coordinating node or participating node itself.
  • RDMA remote direct memory access
  • memory fabric memory fabric
  • the coordination node is used to receive multiple query statements sent by the client, create a transaction according to the first query statement in the multiple query statements, and then execute the above transaction in the global memory according to the second query statement in the multiple query statements, and then coordinate The node submits the above transaction according to the third query statement in the plurality of query statements, so as to realize the consistency between the coordinating node and the participating node.
  • the global memory in the coordinating node or the participating nodes is shared.
  • the participating nodes can quickly perceive the change, and the participating nodes can quickly perceive the change.
  • Part of the memory in the global memory can be accessed unilaterally across nodes through RDMA or memory fabric for data synchronization, instead of data synchronization through bilateral interaction. Since there is no need for processing by the processor and the operating system, the access path is greatly shortened, and there is no need to perform data synchronization.
  • the operating system can greatly shorten the synchronization time, realize the real-time consistency between the coordinating node and the participating nodes, and meet the real-time business requirements for real-time consistency.
  • the capacity of the global memory can be expanded with the number of nodes, and is no longer limited by the capacity of the memory of a single node, which improves the concurrency control capability.
  • an embodiment of the present application further provides a transaction processing system.
  • the transaction processing system provided by the embodiments of the present application will be described in detail below with reference to the accompanying drawings.
  • the transaction processing system 10 includes a distributed database system 100 , a client 200 and a database 300 .
  • the client 200 is connected to the distributed database system 100
  • the distributed database system 100 is connected to the database 300.
  • the distributed database system 100 includes coordinating nodes and participating nodes. Coordinating nodes and participating nodes can be connected through high-speed network RDMA or memory fabric.
  • the coordinating node and participating nodes run a transaction commit protocol.
  • the coordinator node is defined as the access node of the transaction, and the participating nodes are the master node and the first slave node other than the coordinator node in the distributed database system 100 .
  • the node type of the coordinating node may be a master node or a first slave node, and the first slave node is used to maintain real-time consistency with a node whose node type is the master node. Therefore, the first slave node is also called a real-time slave node.
  • the node type of the coordinator node may be a master node
  • the node type of the coordinator node may be a real-time slave node.
  • the master node has the ability to process read-only transactions, so for read-only transactions, the node type of the coordinator node can also be the master node.
  • the distributed database system 100 further includes a non-transactional node, such as a second slave node.
  • the second slave node is used to maintain quasi-real-time consistency with the node whose node type is the master node. Therefore, the second slave node is also called a quasi-real-time slave node.
  • the distributed database system 100 can process these services through the second slave node.
  • the second slave node in the distributed database system 100 may receive a query statement related to the non-real-time analysis service, and the second slave node returns a query result according to the query statement.
  • part of the memory of multiple nodes (for example, each node) in the distributed database system 100 may be used to form a global memory.
  • the global memory can implement memory addressing, memory application and release management through a software module such as a global memory management module.
  • the global memory management module may be a software module of the distributed database system 100 .
  • the global memory management module can manage the memory of the nodes in the distributed database system 100 .
  • the global memory management module can support single-copy or multiple-copy memory application and release.
  • memory application and release are byte-level memory application and release.
  • Data using single-copy global memory is usually cached on a node. When the node fails, the data is inaccessible in the cache. During the takeover phase, the data can be accessed only after it is loaded from the storage system.
  • Data using multi-copy global memory is usually cached on multiple nodes, and when a node fails, it can still be addressed and accessed by other nodes.
  • the global memory management module can provide application and release of small blocks of memory for the global memory formed by partial memory on multiple nodes.
  • the global memory management module may provide a global memory interface. The global memory interface can be used to request a small block of memory of a specified length.
  • the single-copy memory or multi-copy memory returned by the global memory interface is uniformly addressed by the global memory management module, and the address of the single-copy memory or the multi-copy memory is called the global memory address GmAddr.
  • GmAddr the address of the single-copy memory or the multi-copy memory
  • any node in the distributed database system 100 can access the data at the address.
  • the global memory management module may also determine the node identifier and offset position of the corresponding node according to the global memory address. Local or remote read and write access can be achieved based on node identification and offset location.
  • the client 200 may be a general client such as a browser, or a dedicated client such as a client of various information management systems.
  • a user can write a query statement through the client 200 according to a query language, such as structured query language (SQL), and the distributed database system 100 (for example, the coordination node in the distributed database system 100 ) receives multiple query statements.
  • SQL structured query language
  • the distributed database system 100 for example, the coordination node in the distributed database system 100
  • the global memory visible to both the coordinating node and the participating nodes can guarantee the real-time consistency of the coordinating node and the participating nodes.
  • the transaction commit protocol running on the coordinating node and the participating nodes enables the operations of the transaction to be executed or rolled back at the same time, which further ensures the real-time consistency of the coordinating node and the participating nodes.
  • the database 300 is used for persistent storage of data.
  • the database 300 may persistently store log data.
  • the distributed database system 100 can load data from the database 300 into the memory.
  • the database 300 may be a database in a shared storage system (shared storage system).
  • the shared storage system includes any one or more of a raw device (raw device), an automatic storage management (automatic storage management, ASM) device, or a network attached storage (network attached storage, NAS) device.
  • shared storage systems have shared access capabilities. Nodes in the distributed database system 100 can access the shared storage system and access the shared storage system.
  • Shared storage systems can use cross-node replicas or cross-node erasure codes to ensure data reliability and ensure the atomicity of data writes. Atomicity means that all operations in a transaction, either all completed, or not all completed, will not end in a certain link in the middle. If an error occurs during the execution of the transaction, it will be rolled back to the state before the transaction started.
  • the distributed database system 100 may be deployed in a cluster, for example, the distributed database system 100 may be deployed in an in-memory engine cluster. Accordingly, the transaction processing system 10 may further include a cluster management node 400 .
  • the cluster management node 400 is connected to the distributed database system 100 .
  • the cluster management node 400 is used to operate and maintain the cluster (for example, the distributed database system 100 in the cluster). Specifically, the cluster management node 400 may be used to discover nodes in the distributed database system 100, manage the states of the nodes in the distributed database system 100, or manage metadata.
  • the metadata may include at least one of node attribute information and a data table schema.
  • the node attribute information includes any one or more of node identification (identity, ID), node network address (Internet protocol address, IP), node type, and node state.
  • the data table schema includes any one or more of table name, table ID, table type, field number, field type description, and the like.
  • the transaction processing system 10 further includes a timing server 500 .
  • the timing server 500 is connected to the distributed database system 100 .
  • the timing server 500 is used to provide a monotonically increasing clock service.
  • the clock may be a logic clock, a true time clock or a mixed logic clock.
  • the mixed logical clock refers to a logical clock mixed with a physical clock (such as real time).
  • the timing server 500 may provide the distributed database system 100 with a timestamp of the current moment.
  • the timestamp may be a numerical value representing time with a length of 8 bytes or 16 bytes.
  • Distributed database system 100 may obtain timestamps to determine the visibility of transactions to data.
  • Distributed database system 100 may be installed prior to transaction processing. During the process of installing the distributed database system 100 , the user may be prompted to configure the nodes in the distributed database system 100 .
  • the following describes the node configuration method provided by the embodiments of the present application with reference to the accompanying drawings.
  • the method includes:
  • S202 The nodes in the distributed database system 100 configure the node IP, install the log file system, and set the node type.
  • Distributed database system 100 may include multiple nodes.
  • the node IP can be automatically configured based on the IP address pool.
  • the node may also receive the node IP manually configured by the administrator, thereby implementing the node IP configuration.
  • the node can automatically configure the node type, or receive the node type manually configured by the administrator, so as to realize the node type configuration.
  • the node type may include a master node and a real-time slave node, and further, the node type may also include a quasi-real-time slave node.
  • the master node may be used to process read-write transactions
  • the real-time slave node may be used to process read-only transactions
  • the quasi-real-time slave node may be used to process non-transactional requests, such as query requests associated with analytical services. This can meet the needs of different businesses.
  • the client 200 In order to ensure the durability of transaction-related data, after the transaction log file such as the redolog log is successfully persisted, the client 200 Returns a transaction commit success notification message.
  • a transaction commit success notification message Directly writing log files to a shared storage system (for example, a database in a shared storage system) can increase the transaction commit delay, resulting in faster transaction processing on the distributed database system 100 side, but slower log file persistence, which affects the overall time. extension and performance.
  • some nodes in the distributed database system 100 may also be configured with high-speed persistent media for the persistence of log files.
  • the high-speed persistent medium includes, but is not limited to, power-saving memory, non-volatile random access memory (NVRAM) or other non-volatile 3D-point medium.
  • Nodes can install a log file system (LogFs) to manage local high-speed persistent media through the log file system.
  • the log file system can also provide an access interface for file semantics for persistent log files, such as redolog files.
  • the real-time slave node can also be configured with a high-speed persistent medium, so that when the master node fails, the node type of the real-time slave node can be changed to the master node.
  • the quasi-real-time slave node is mainly used to process non-transactional requests, and there is no need for local log file persistence, so there is no need to configure high-speed persistent media.
  • the redolog file can be moved from the log file system to the shared storage system by a background task (for example, a task triggered by a node in the distributed database system 100 ). middle.
  • the moving process is completed by the background task and does not affect the transaction execution delay.
  • the capacity requirements of high-speed persistent media are small.
  • the transaction processing capacity of a single node is 1 million transactions per second, and the data volume of log files such as redolog files is 1 gigabyte (gigabyte, GB) per second.
  • the distributed database system 100 can be logically divided into an index layer, a table record layer, and a near-end persistence layer.
  • the index layer is used to store the index tree of the table records in the data table and the management header rhead of the table records.
  • rhead is used to record the global memory address and log file address recorded by the version table, such as the redolog file address.
  • the index layer is usually implemented using local memory and accessed through local access.
  • the index tree and rhead exist on all nodes in the distributed database system 100 , wherein the index tree and rhead on the master node and the real-time slave node are consistent with rhead in real time. Specifically, consistent data is formed when the transaction commit is completed.
  • the table record layer is used to store the table record record of the data table.
  • the table record layer is usually implemented using global memory, which can be accessed through remote access methods such as RDMA or memory fabric.
  • the near-end persistence layer is used for persistent storage of log files such as redolog files. Near-end persistence layers are typically implemented using high-speed persistence media.
  • S204 The nodes in the distributed database system 100 report the node configuration information to the cluster management node 400.
  • the node configuration information may include node IP and node type. Further, the node configuration information may also include any one or more of memory capacity and log file system capacity.
  • the cluster management node 400 checks the number of nodes corresponding to each node type, and when the number of nodes satisfies the set condition, saves the node configuration information to the system node table sysNodeTbl.
  • the distributed database system 100 includes at least a master node, and in some possible implementations, the distributed database system 100 further includes at least one of a real-time slave node and a quasi-real-time slave node. Based on this, the cluster management node 400 can check whether the number Nm of master nodes satisfies the following set condition: Nm>0. Optionally, the cluster management node 400 may check whether the number Nr of real-time slave nodes and the number Nq of quasi-real-time slave nodes satisfy the following set conditions: Nr ⁇ 0, Nq ⁇ 0.
  • the cluster management node 400 can check whether the sum of the number of master nodes Nm and the number of real-time slave nodes Nr satisfies the following set conditions: Nm+Nr ⁇ Q1, and check whether the number of quasi-real-time slave nodes Nq satisfies the following set conditions: Nq ⁇ Q2.
  • Q1 and Q2 can be set according to empirical values, for example, Q1 can be set to 8, and Q2 can be set to 64.
  • the cluster management node 400 can save the node configuration information to the system node table sysNodeTbl.
  • the cluster management node 400 returns a configuration success prompt to the nodes in the distributed database system 100.
  • the cluster management node 400 returns a configuration success prompt to each node in the distributed database system 100 , and then each node returns a configuration success prompt to the client 200 .
  • the cluster management node 400 may also return a configuration failure prompt, so as to reconfigure the node.
  • S212 The client 200 sets the valid flag of the system node table sysNodeTbl.
  • the client 200 sends the valid flag of the system node table sysNodeTbl to the cluster management node 400 .
  • the validation flag is used to identify the system node table sysNodeTbl, and the client 200 sets the validation flag and sends the validation flag to the cluster management node 400 to make the system node table sysNodeTbl valid.
  • the cluster management node 400 returns the system node table sysNodeTbl to the nodes in the distributed database system 100.
  • S220 The cluster management node 400 returns the node configuration information to the client 200.
  • the cluster management node 400 returns the node configuration information of each node to the client 200, such as the node IP of each node, the node type, and the like.
  • the client 200 can not only obtain the node configuration information, but also obtain the number of nodes, such as the number of nodes of each node type, according to the node configuration information of each node.
  • the minimum value of RecordMemRepNum can be set to 1. Considering that the data in the master node and the real-time slave node in the distributed database system 100 are consistent in real time, if the number of replicas in the global memory exceeds the sum of the number of master nodes Nm and the number of real-time slave nodes Nr, that is, Nm+Nr, the amount of memory will increase. Consumption and no improvement in availability, based on this, the maximum value can be set to Nm+Nr.
  • the cluster management node 400 can also save RecordMemRepNum as a table attribute in the system metadata table sysMetaTbl, and the nodes in the distributed database system 100 can also update the system metadata table sysMetaTbl in the local memory.
  • the method includes:
  • the client 200 sends a command to create a table to the master node in the distributed database system 100, where the table command includes RecordMemRepNum.
  • a table parameter is included in the create table command, and the table parameter may include the number of copies of the table records in the global memory, that is, RecordMemRepNum.
  • the table parameter may also include one or more of table name, column name, and column type.
  • S304 The master node forwards the create table command to the cluster management node 400.
  • the master node can execute the create table command to create the data table.
  • the master node also forwards the table creation command, for example, forwards the RecordMemRepNum parameter in the table creation command to the cluster management node 400 to set the RecordMemRepNum.
  • the number of replicas RecordMemRepNum recorded in the global memory can be configured with the data table as the granularity, which satisfies the availability requirements of different data tables, and can control the memory consumption according to the requirements of the data table.
  • S306 The cluster management node 400 checks whether RecordMemRepNum is within a preset range. If yes, execute S308; if not, execute S320.
  • the preset range is the value range of RecordMemRepNum.
  • the range may be greater than or equal to 1 and less than or equal to the sum of the number Nm of master nodes and the number Nr of real-time slave nodes, that is, Nm+Nr.
  • the cluster management node 400 checks whether RecordMemRepNum is greater than or equal to 1 and less than or equal to Nm+Nr. If so, the RecordMemRepNum representing the configuration is valid, and S308 can be performed; if not, the RecordMemRepNum representing the configuration is invalid, and S320 can be performed.
  • the cluster management node 400 stores the RecordMemRepNum in the system metadata table sysMetaTbl.
  • the cluster management node 400 adds a table record in the system metadata table sysMetaTbl, where the table record is specifically used to record RecordMemRepNum, and the cluster management node 400 can persistently store the above data in the table record.
  • the cluster management node 400 sends the system metadata table sysMetaTbl to the master node, the real-time slave node, and the quasi-real-time slave node.
  • the cluster management node 400 may send the above-mentioned system metadata table sysMetaTbl to the nodes in the distributed database system 100 .
  • the distributed database system 100 does not include a real-time slave node or a quasi-real-time slave node, the step of sending the system metadata table sysMetaTbl to the real-time slave node or the quasi-real-time slave node may not be performed.
  • S312 The master node, the real-time slave node, and the quasi-real-time slave node in the distributed database system 100 update the system metadata table sysMetaTbl in the local memory.
  • S314 The master node, the real-time slave node, and the quasi-real-time slave node in the distributed database system 100 send an update completion notification to the cluster management node 400.
  • the update completion notification is used to notify the cluster management node 400 that the update of the system metadata table sysMetaTbl has been completed in the local memory of each node of the distributed database system 100 .
  • S316 The cluster management node 400 sends a configuration success response to the master node in the distributed database system 100.
  • S318 The master node sends a configuration success response to the client 200.
  • the cluster management node 400 may also directly send a configuration success response to the client 200 to notify the client 200 that the configuration of RecordMemRepNum has been completed.
  • S320 The cluster management node 400 sends a configuration failure response to the master node in the distributed database system 100.
  • S322 The master node sends a configuration failure response to the client 200.
  • the cluster management node 400 may also directly send a configuration failure response to the client 200 to notify the client 200 that the configuration of RecordMemRepNum fails. Based on this, the client 200 can also adjust the table parameters, and then resend the create table command.
  • transaction processing can be performed based on the above-mentioned transaction processing system 10 .
  • the transaction processing method provided by the embodiments of the present application will be described in detail with reference to the accompanying drawings.
  • the method includes:
  • the client 200 sends a plurality of query statements to the coordination node in the distributed database system 100.
  • a query statement refers to a statement written in a query language and used to process data in the database 300 .
  • the processing of the data in the database 300 includes any one or more of data creation, data query, data update, and data deletion.
  • the client 200 may receive multiple query statements written by the user through a query language, and then send the aforementioned multiple query statements to the coordination node in the distributed database system 100 .
  • the query language may be determined by the user from the list of query languages supported by the database 300 .
  • the query language may be SQL, and correspondingly, the query statement written by the user may be an SQL statement.
  • the client 200 when it sends multiple query statements, it can send multiple query statements at one time, so that the throughput rate can be improved.
  • the client 200 may also send query statements one by one. Specifically, the client 200 may first send a query statement, and then send the next query statement after the query statement is executed.
  • multiple query statements can be used to form a transaction.
  • the client 200 determines the transaction type according to the query statement. Among them, the types of transactions include read-write transactions and read-only transactions. Read-only transactions do not support insert, delete, and update operations within a transaction.
  • the client 200 can determine whether the transaction type is a read-write transaction or a read-only transaction according to whether the query statement indicates to insert, delete or update the table record. For example, if at least one query statement in the multiple query statements indicates to insert, delete or update table records, the client 200 may determine that the transaction type is a read-write transaction, otherwise, the transaction type is determined to be a read-only transaction.
  • the client 200 may determine a coordinator node from the master node of the distributed database system 100, and send multiple query statements to the coordinator node.
  • the client 200 may determine a coordinator node from the real-time slave nodes of the distributed database system 100, and send multiple query statements to the coordinator node.
  • the client 200 may also determine the coordination node from the master node, which is not limited in this embodiment of the present application.
  • S404 The coordination node in the distributed database system 100 creates a transaction according to the first query statement in the multiple query statements.
  • the first query statement may be a query statement indicating the start of a transaction.
  • the first query statement may include a begin command.
  • the coordinating node may execute the first query statement, thereby creating a transaction.
  • the node type of the coordinating node is the master node
  • the coordinating node may create a read-write transaction according to the first query statement.
  • the coordinator node can create a read-only transaction according to the first query statement.
  • the coordinating node may, according to the first query statement (begin as shown in FIG. 5 ) characterizing the start of the transaction, Create a transaction.
  • the coordinating node can create a global transaction, apply for the unique identifier gtxid of the global transaction, and apply for a local transaction control block from the local memory to obtain the unique identifier lotxid of the local transaction control block.
  • the coordinating node may also obtain a start time stamp (begin time stamp, beginTs) from the timing server 500 .
  • the gtxid can be determined according to the node identifier and the serial number in the node, for example, it can be a string obtained by concatenating the node identifier and the serial number in the node.
  • a global transaction includes multiple sub-transactions (for example, local transactions on the coordinating node and participating nodes), and multiple sub-transactions can be associated with each other through gtxid.
  • other participating nodes can query the cluster management node 400 for the node identification of the takeover node through the node identification in gtxid, thereby initiating the global transaction state re-confirmation process to the takeover node.
  • the local transaction control block is specifically a section of memory space in the local memory for process state control.
  • the unique identifier of the local transaction control block that is, the lotxid, can be a strictly monotonically increasing value, and the value can be an 8-byte value.
  • lotxid can be recorded into the rhead of the index layer. When other transactions on this node need to wait for the completion of this transaction, you can find this transaction through lotxid, and add other transactions to the waiting queue of this transaction.
  • the transaction start timestamp namely beginTs
  • the transaction start timestamp can be recorded in the local transaction control block to provide a basis for transaction visibility judgment.
  • the coordinating node obtains the current timestamp from the timing server 500 as beginTs, and the coordinating node can record the visibility of the transaction through the visibility rule judgment table based on the starting timestamp. The following describes the process of judging the visibility based on the visibility rules in detail.
  • the rhead of a table record of a version includes the lifetime window of the table record of the version.
  • the time-to-live window can be characterized by a minimum time stamp tmin characterizing the start time and a maximum time stamp tmax characterizing the end time.
  • the coordinating node can judge the visibility of the table record of a version to the transaction based on the beginTs of the transaction and the size relationship of tmin and tmax in the rhead of the table record of a version.
  • the coordinating node in the distributed database system 100 executes the transaction in the global memory according to the second query statement in the plurality of query statements.
  • the second query statement may be a query statement indicating a transaction operation.
  • the transaction operation includes a data manipulation language (DML) operation.
  • the DML operation may include an insert, delete or update operation.
  • Transactional operations may also include query operations.
  • the second query statement when the second query statement is a statement instructing to execute the insert operation, the second query statement may also carry the record data of the table record to be inserted.
  • the second query statement When the second query statement is a statement instructing to perform the update operation, the second query statement may also carry the primary key of the table record to be updated and the updated record data.
  • the second query statement When the second query statement is a statement instructing to perform a delete operation, the second query statement may also carry the primary key of the table record.
  • the second query statement when the second query statement is a statement instructing to perform a query operation, the second query statement may also carry query conditions.
  • the query condition may include the primary key or predicate condition of the table record. Predicates are used to express comparison operations, and predicate conditions include query conditions expressed through comparison operations. Predicate conditions are used to narrow down the result set returned by a query.
  • the coordinating node may execute the second query statement, thereby executing the transaction in the global memory.
  • the coordinating node may perform at least one of data insertion, deletion, update, or read operations in the global memory.
  • the second query statement may include one or more. The following describes the process of executing a transaction by taking the second query statement as a statement for instructing to perform an insert operation, to perform a delete operation, to perform an update operation, and to perform a query operation.
  • the coordinating node (for example, the master node 1 in the distributed database system 100) can query the system metadata table sysMetaTbl to obtain A table attribute, where the table attribute includes the number of copies of the table records in the global memory, wherein the number of copies of the table records in the global memory may be recorded as RecordMemRepNum.
  • the coordinating node can call the global memory interface provided by the global memory management module to apply for a global memory space with a specified number of copies (for example, it can be denoted as gm1), and then fill in the record data in gm1.
  • the specified number of replicas is equal to RecordMemRepNum.
  • the coordinating node can call the global memory interface to obtain the list of nodes where the replica of gm1 is located, and then fill in the new record data in the global memory of the node of the first replica (for example, the coordinating node) according to the list of nodes where the replica is located.
  • tmin can be set to gtxid
  • tmax can be set to -1 (for infinite infinite). It should be noted that tmin is set to gtxid before submission, set to timestamp after submission, and tmax remains unchanged before and after submission.
  • the coordinating node can query the system node table sysNodeTbl. If there is a quasi-real-time slave node in the node list where the replica is located, continue to fill in the newly added record data in the global memory of the quasi-real-time slave node.
  • tmin is set to gtxid and tmax is set to -1.
  • the coordinating node can also apply for local memory for storing rhead and indirect indexes. Among them, the global memory address, physical address and lotxid of the record data of insert are filled in rhead. Among them, since the transaction has not yet committed, the physical address can be 0. indirect points to rhead. Then the coordinating node inserts the newly added record data into the local index tree. If there is a key conflict, the insert fails and the previously allocated global memory and local memory are released. The coordinating node can also return error messages. The error message may indicate that the insert fails, and further, the error message may indicate the reason for the failure of the insert.
  • the indirect is modified to point to the newly added record data, and then the coordinating node can record the operation type as insert in the local transaction write set (write set, wset), and return the insertion success to the client 200.
  • the local transaction write set is hereinafter referred to as the write set for short.
  • the coordinating node eg, the master node 1 in the distributed database system 100
  • the coordinating node can query the system metadata table sysMetaTbl to obtain table attributes, where the table attributes include RecordMemRepNum.
  • the coordinating node finds the version of the table record of the update operation, determines the visibility of the table record to the transaction according to beginTs and tmin and tmax in rhead, and returns the correct version of the table record.
  • the coordinating node searches the index data such as the index tree for the version linked list of the table record, that is, the rhead linked list. Then the coordinating node reads the corresponding tmin and tmax of the table record according to the global memory address recorded by rhead, and judges the visibility according to tmin and tmax. If tmin or tmax in this version is gtxid instead of timestamp, this transaction is added to the waiting queue of the local transaction control block identified by lotxid in rhead. After this transaction is awakened, the traversal process of rhead linked list can be re-executed.
  • the index data such as the index tree for the version linked list of the table record, that is, the rhead linked list.
  • gtxid or timestamp can be distinguished by high-order bit. If the high-order bit of tmin or tmax is 1, then tmin or tmax is gtxid, and if the high-order bit of tmin or tmax is not 1, then tmin or tmax is timestamp.
  • both tmin or tmax in this version are timestamps, when beginTs is within [tmin, tmax), it means that the table records of this version are visible to the transaction, and the coordinating node can return the table records and rhead of this version, when beginTs is not in [ tmin, tmax), indicating that the table record of this version is not visible to the transaction, and the coordinating node can continue to traverse the previous version according to the address of the previous version recorded by rhead.
  • the coordinating node can get the global memory address of the table record according to the returned version of rhead, and then try to mark the update in the table record according to the global memory address. Specifically, if the tmax of the returned version is not -1, it indicates that the version is not the current version and has been updated by other transactions, that is, a write-write conflict occurs, and the coordinating node can return a flag failure notification to the client 200.
  • the coordinating node can call the global memory interface to get the node list of the memory copy, and initiate a comparison and replacement of the tmax recorded in the corresponding table in the global memory of the first copy node ( compare and swap, CAS) atomic operation, marking tmax as gtxid. If the CAS atomic operation returns a failure, it indicates a write-write conflict, and the client 200 is notified of the failure to return the flag. If the CAS atomic operation returns a success, it indicates that the flag is successful, and the coordinating node can update the table records.
  • the coordinating node when it updates the table records, it can call the global memory interface to apply for the global memory with the specified number of copies, and then fill in the updated record data.
  • the specific implementation of the coordinating node applying for the global memory and filling in the updated record data in the global memory may refer to the description of the insert operation, which will not be repeated here.
  • the coordinating node applies for local memory to store rhead, and fill in the global memory address, lotxid, and physical address of the updated table record in rhead. At this time, the physical address can be 0.
  • the coordinating node installs the new version chain, specifically, directing the management header newrhead of the updated table record to the address of the previous version of rhead, and directing the indirect to newrhead.
  • the coordinating node records the operation type as update in the write set wset in the local transaction control block and records the address of rhead (ie oldrhead) and the address of newrhead.
  • the coordinating node may return a notification of successful update to the client 200 .
  • the coordinating node (for example, the master node 1 in the distributed database system 100 ) can search for the version of the table record to be deleted, and determine the table according to beginTs and tmin and tmax in rhead Records the visibility of the transaction and returns the correct version of the table record. Then the coordinating node obtains the global memory address of the table record according to the returned version of rhead, and marks the update in the table record.
  • the specific implementation of the coordinating node determining the visibility, returning the table record of the correct version, and marking the update can refer to the description of the update operation, which will not be repeated here.
  • the coordinating node records the operation type as delete in the write set in the local transaction control block, and records the address of rhead. After completing the above operations, the coordinating node may return a deletion success notification to the client 200 .
  • the coordinating node (for example, the master node 1 in the distributed database system 100 ) can search for the version of the table record of the query according to the query conditions, and according to the tmin and tmax in beginTs and rhead Determines the visibility of a table record to a transaction and returns the correct version of the table record.
  • the specific implementation of the coordinating node determining the visibility and returning the correct version of the table record can be found in the description of the update operation.
  • the coordinating node can also traverse the read records in the local transaction read set (read set, rset) and check the phantom to verify the read-write conflict. When the validation passes, the correct version of the above table record can be returned to respond to the query operation.
  • the local transaction read set may be referred to as a read set for short.
  • the coordinator node After the above processing, for the insert/update operation, the coordinator node has applied for the specified number of copies of the global memory for the new version of the table record, and filled in the record data of the table record for the first copy node (for example, the coordinator node), and tmin and tmax are set. If the replica node includes a quasi-real-time slave node, the record data recorded in the table is also filled in for the quasi-real-time slave node, and tmin and tmax are set. And the coordinating node applies rhead for the new version of the table record, and the global memory address of the record data or the updated record data is recorded in the rhead, and has been installed in the local index tree and indirect.
  • the tmax of the first replica node of the original version has CAS as gtxid, thus realizing concurrent conflict processing with other transactions.
  • the addresses of gtxid, beginTs and newrhead have been recorded in the write set in the local transaction control block.
  • the read records and query conditions (for example, predicate conditions) have been recorded in the read set in the local transaction control block. Among them, the read record can be used for read-write conflict verification.
  • the third query statement may be a query statement indicating transaction commit.
  • the third query statement may include a commit command.
  • the coordinating node may execute the third query statement, thereby committing the transaction.
  • the coordinating node may execute the third query statement to submit the transaction, so that newly added record data, updated record data, deleted record data or searched record data are consistent between the coordinating node and the participating nodes in real time.
  • the coordinating node (for example, the master node 1 in the distributed database system 100) can query the locally cached system node table sysNodeTbl to obtain the list of other master nodes and real-time slave nodes.
  • a node is a participating node.
  • the coordinating node records the operation type and the global memory address recorded in the old and new version table and the new version table according to the operation type in the write set (for example, one or more of insert, update, delete) and the address of newrhead and oldrhead.
  • the record data is packaged into a pre-synchronization (also called Preinstall) request message.
  • the pre-sync request message includes gtxid and beginTs.
  • the coordinating node sends the pre-synchronization request message to the participating nodes (for example, the master node 2, the master node 3, the real-time slave node 1, the real-time slave node 2, etc. in the distributed database system 100). Participating nodes receive the pre-synchronization request message, create local transactions on their own nodes and get lotxid.
  • the participating nodes for example, the master node 2, the master node 3, the real-time slave node 1, the real-time slave node 2, etc. in the distributed database system 100.
  • Participating nodes traverse the write set in the pre-sync request message, and perform the following processing according to the operation type:
  • participating nodes apply for local memory for the new version, which is used to store newrhead.
  • Participating nodes record the global memory address, loxid and physical address carried in the pre-sync request message in newrhead, where the physical address is 0, and then assigns a value to indirect to make it point to newrhead.
  • the participating nodes insert the record data into the index tree according to the primary key of the newly added record data.
  • the participating nodes apply for local memory for the new version, and the local memory is used to store the newrhead.
  • the participating nodes record the global memory address, loxid, and physical address carried in the pre-sync request message in the newrhead. At this time, the physical address is 0.
  • Participating nodes look up the indirect address in the local index number according to the primary key recorded in the old version table, get the rhead it points to according to the address, and point the newrhead to the current rhead. Participating nodes record the global memory address of the data according to the old version, and check that if the global memory has a copy of this node and is not the first copy, it can modify tmax to gtxid in the copy of this node.
  • the participating nodes look up the indirect address in the local index tree according to the primary key recorded in the old version table, obtain the rhead pointed to by the indirect, and point the newrhead to the current rhead. Then, the participating node records the global memory address of the data according to the old version, and checks that if the global memory has a copy of the node and is not the first copy, it can modify tmax to gtxid in the copy of the node.
  • the participating nodes send a pre-synchronization response message to the coordinating node.
  • the coordinating node collects the pre-synchronization response messages sent by all participating nodes, and the pre-synchronization response messages all indicate that the pre-synchronization is successful, the coordinating node obtains the current timestamp as the termination timestamp endTs.
  • the coordinator node can determine the isolation level of the transaction.
  • the isolation level of the transaction is the serializable snapshot isolation level (SSI)
  • the coordinator node can check for read and write conflicts. Specifically, the coordinating node can traverse the read set rset, and use endTs to check the visibility of the table record corresponding to rhead in rset, so as to determine whether a read-write conflict occurs.
  • the coordinating node can re-execute the query of table records according to the predicate conditions, and check whether the table records visible based on endTs and the records visible based on beginTs are the same. If a table record is not visible, it means that another transaction has modified the table record, that is, a read-write conflict occurs.
  • the coordinating node can terminate the transaction, perform a rollback operation, notify other participating nodes to terminate the transaction, and return an error to the client 200 response. It should be noted that when the isolation level of the transaction is other isolation levels such as read committed (RC) and snapshot isolation (SI), the coordinating node does not need to perform this step to check for read-write conflicts.
  • RC read committed
  • SI snapshot isolation
  • the coordinating node obtains a list of nodes configured with a log file system such as LogFs according to the node configuration information in the system node table.
  • the coordinating node can select a corresponding number of nodes from the list to write log files according to the preset number of replicas. For example, if the preset number of replicas is N, the coordinator node can write log files and send synchronization requests (also called prepare requests) to other N-1 nodes configured with LogFs to notify the above nodes to write log files (for example, redolog file).
  • the log file records data by gtxid, endTs, newly added records and their global memory addresses, and deleted records.
  • the coordinating node then waits for a synchronous response from the aforementioned nodes.
  • redolog does not meet the requirements of the number of copies set previously, for example, the number of copies set previously is 3, and the number of nodes configured with high-speed persistent media is 2, the coordinating node can directly write redolog to the shared storage system .
  • the coordinating node determines the RC or SI of a transaction, log files, such as redolog files, can be generated directly.
  • the participating nodes receive the synchronization request, write the log file in the same file name of the local logFs, such as writing the redolog file, and then return the synchronization response to the coordinating node.
  • the participating nodes can also verify the read-write conflict and the write-write conflict before writing the log file.
  • the coordination node verifying the read-write conflict.
  • Participating nodes can verify the write-write conflict in the following manner: The participating nodes determine whether a write-write conflict occurs according to the index of at least one table record in the write set of the transaction.
  • a participating node when a participating node inserts a write record index entry into the index, a unique conflict occurs, indicating that a write conflict occurs.
  • a redolog file is generated; if the participating node determines that a write-write conflict occurs, it can return an error response to the coordinating node.
  • the coordinating node After the coordinating node receives the synchronous response (also called the commit response, commit reply), it can enter the transaction completion (complete) process.
  • the coordinating node (for example, the master node 1 in the distributed database system 100 ) sends each participating node (for example, the master node 2 , the master node in the distributed database system 100 ) to the 3.
  • the real-time slave node 1, real-time slave node 2, etc.) send a transaction completion request, and the transaction completion request carries the loxid of each participating node.
  • the coordinating node traverses the write set in the local transaction and applies the modification of this transaction.
  • the coordinating node applies the modification of this transaction, which can be the first copy of the new version table recorded in the global memory and the tmin in the quasi-real-time slave node copy (if there is a quasi-real-time slave node copy) to be set to endTs, and the rhead in rhead
  • the physical address is set to the redolog file identifier and offset position.
  • the coordinating node records the old version table in the first copy of the global memory and sets tmax in the quasi-real-time slave node copy (if there is a quasi-real-time slave node copy) to endTs, and updates the physical address in rhead to the redolog file identifier and offset position.
  • the coordinating node sets the lotxid to 0 for all rhead records in the write set. At this time, the coordinating node takes out the list of local transactions waiting to determine the visibility in the local transaction, and wakes up all. Among them, the awakened transaction rechecks the visibility of the table records. The coordinating node adds the local transaction to the recycling list, and after all active transactions are completed, the old version chain is recycled and the index is deleted.
  • the participating node After the participating node receives the transaction completion request, it adopts a processing method similar to that of the coordinating node. Specifically, the participating nodes traverse the write set and apply the modification of this transaction. Among them, if the new version table record in the write set has a copy in this node, the participating nodes set the tmin of the new version table record to endTs, and the physical address in rhead to the redolog file identifier and offset position. Participating nodes also set the tmax of the old version table records in the write set to endTs, and update the physical address in rhead to the redolog file identifier and offset position. Participating nodes set the lotxid of all rhead records in the write set to 0.
  • the participating nodes take out the list of local transactions waiting to determine the visibility in the local transaction, and wake up all. Then the participating nodes add the local transaction to the recycling list, wait for the end of all active transactions, and then recycle the old version chain and delete the index. The participating nodes may then respond to the coordinating node transaction completion.
  • the coordinating node verifies the read-write conflict based on the isolation level, if a table record is not visible, it means that other transactions have modified the table record, resulting in a read-write conflict.
  • the coordinating node can terminate the transaction and roll back the transaction.
  • the coordinating node may also terminate the transaction and roll back the transaction. The rollback process is described in detail below.
  • the coordinating node (for example, the master node 1 in the distributed database system 100 ) reports to each participating node (for example, the master node 2 in the distributed database system 100 ) , master node 3, real-time slave node 1, real-time slave node 2, etc.) send a transaction rollback request (final-abort as shown in Figure 8), and the transaction rollback request carries the lotxid of each participating node.
  • the coordinating node traverses the write set in the local transaction and rolls back the modification of this transaction.
  • the coordinating node deletes the index of the new version table record in the write set from the index tree, records the old version table in the write set in the first copy of the global memory and the quasi-real-time slave node copy (if there is a quasi-real-time slave node copy)
  • the tmax of the node replica is set to -1
  • the indirect of the old version table records in the write set is restored to point to the old version
  • the lotxid of all rhead records in the write set is set to 0.
  • the coordinating node takes out the list of local transactions waiting to determine the visibility in the local transaction, and wakes up all. It should be noted that the awakened transaction rechecks the visibility of the record. Then the coordinating node adds the local transaction to the recycling list, and after all active transactions are over, the global memory and local memory of the new version table records are recycled.
  • the participating node After the participating node receives the transaction rollback request, it can adopt processing similar to the coordinating node. Specifically, the participating nodes traverse the write set and apply the modification of this transaction. Among them, if the global memory of the old version table record in the write set has a copy in this node, set tmax to -1, then restore the indirect of the old version table record in the write set to point to the old version, and set all write sets to the old version. The lotxid of the rhead record in rhead is set to 0. At this time, the participating nodes take out the list of local transactions waiting to determine the visibility in the local transaction, and wake up all. Participating nodes add local transactions to the recycling list, wait for the end of all active transactions, and then recycle the global memory and local memory recorded in the new version table.
  • the coordinating node can determine the visible version and directly return the record data of the table record to the client. Specifically, when the transaction isolation level is SSI, the table record has not been committed (when tmin or tmax is used to indicate gtxid, it indicates uncommitted) and the beginTs of the transaction > the beginTs of the local transaction corresponding to the uncommitted record, the coordinating node can The transaction is added to the waiting queue of the other transaction, otherwise the record data recorded in the table is directly returned.
  • the transaction isolation level is SSI
  • the table record has not been committed (when tmin or tmax is used to indicate gtxid, it indicates uncommitted) and the beginTs of the transaction > the beginTs of the local transaction corresponding to the uncommitted record
  • the coordinating node can The transaction is added to the waiting queue of the other transaction, otherwise the record data recorded in the table is directly returned.
  • the distributed database system 100 also includes near real-time slave nodes.
  • the quasi-real-time slave node can receive the query request related to the analysis service sent by the client, and generate the index tree and data copy locally by replaying the redolog.
  • the quasi-real-time slave node can play back all redologs at regular intervals (for example, at intervals of 0.5 seconds), and play back the recorded data recorded in the local index tree and the new version table through the content recorded in the table and the global memory address recorded in the redolog.
  • the quasi-real-time slave node can take the minimum endTs of all active transactions in the cluster as the playback deadline, and the transaction logs in the redolog that are smaller than the endTs will be played back.
  • the transaction commit protocol defines a write conflict (write-write conflict or read-write conflict) control method.
  • the coordinating node adopts pessimistic concurrency control (also known as pessimistic locking), and the participating nodes adopt optimistic concurrency control (also known as optimistic locking).
  • Consistency can reduce the interaction between the coordinating node and the participating nodes, shorten the synchronization time, and achieve real-time consistency.
  • pessimistic concurrency control is to assume that multi-user concurrent transactions affect each other during processing, so data can be modified by blocking a transaction. Specifically, if pessimistic concurrency control (pessimistic locking) is applied to an operation performed by a transaction, such as reading a row of data, other transactions can perform conflicting operations only after the transaction releases its permissions.
  • the principle of optimistic concurrency control is that, assuming that multi-user concurrent transactions will not affect each other during processing, each transaction can process the part of the data affected by each other without generating locks. Before committing a data update, each transaction checks whether other transactions have modified the data after the transaction has read the data. If other transactions have updates, the committing transaction will be rolled back.
  • an embodiment of the present application provides a transaction processing method.
  • the partial memory of multiple nodes of the distributed database system 100 is used to form the global memory.
  • the global memory is visible to coordinating nodes and participating nodes in the distributed database system 100 . That is to say, the global memory in the coordinating node or the participating nodes is shared.
  • the participating nodes can quickly perceive the change, and use RDMA or RDMA based on the transaction commit protocol.
  • memoryfabric accesses some memory in the global memory across nodes to synchronize data without having to synchronize data through message interaction, which greatly shortens the synchronization time, realizes real-time consistency between coordinating nodes and participating nodes, and satisfies real-time performance. Business requirements for real-time consistency.
  • the capacity of the global memory can be expanded with the number of nodes, and is no longer limited by the capacity of the memory of a single node, which improves the concurrency control capability.
  • the memory engine cluster oriented to the memory medium has no pages and no rollback log, and has better performance.
  • the embodiments of the present application provide different types of nodes, such as master nodes, real-time slave nodes, and quasi-real-time slave nodes, which can meet real-time read-write business requirements (such as the requirements of transaction scenarios), real-time read-only business requirements (such as real-time analysis scenarios) requirements) or non-real-time read-only business requirements (such as non-real-time analysis scenarios).
  • the number of copies of the table records in the global memory can also be set according to the table granularity, on the one hand, the memory occupation can be controlled, and on the other hand, the high availability requirements of different data tables can be met.
  • the distributed database system 100 includes:
  • the coordination node is used to receive multiple query statements sent by the client;
  • the coordination node is further configured to create a transaction according to the first query statement in the plurality of query statements, execute the transaction in the global memory according to the second query statement in the plurality of query statements, and execute the transaction according to the second query statement in the plurality of query statements.
  • a third query statement of the plurality of query statements commits the transaction.
  • the distributed database system is deployed in a cluster, and the global memory comes from the cluster.
  • the global memory includes part of the memory of the coordinating node and/or the participating nodes.
  • the node type of the coordination node is a master node, and the coordination node is specifically used for:
  • a read-write transaction is created according to the first query statement in the plurality of query statements.
  • the node type of the coordinating node is a first slave node
  • the first slave node is used to maintain real-time consistency with the node whose node type is the master node
  • the coordinating node is specifically used for:
  • a read-only transaction is created according to the first query statement in the plurality of query statements.
  • the coordinating node is further configured to receive and save the number of copies of the table record sent by the cluster management node in the global memory;
  • the participating node is further configured to receive and save the number of copies of the table record sent by the cluster management node in the global memory.
  • the table record is stored in the global memory of the distributed database system, and the index tree and management header of the table record are stored in the local memory of the distributed database system.
  • the coordinating node is specifically used for:
  • the transaction is submitted through the transaction submission protocol running on the coordinating node and the participating node, so as to realize the real-time consistency between the coordinating node and the participating node sex.
  • the coordination node is specifically used to trigger pessimistic concurrency control when a write conflict occurs in the transaction
  • the participating node is specifically used to trigger optimistic concurrency control when a write conflict occurs in the transaction.
  • the distributed database system 100 may correspond to executing the methods described in the embodiments of the present application, and the above-mentioned and other operations and/or functions of the various modules/units of the distributed database system 100 are respectively intended to realize the methods shown in FIG. 4 .
  • the corresponding flow of each method in the illustrated embodiment is not repeated here for brevity.
  • the embodiment of the present application further provides a transaction processing system 10 .
  • the transaction processing system 10 includes: a distributed database system 100 and a client 200 .
  • the distributed database system 100 is configured to execute the corresponding transaction processing method according to the query statement sent by the client 200 , for example, the transaction processing method shown in FIG. 4 .
  • the client 200 is used to send multiple query statements to the distributed database system 100
  • the coordination node of the distributed database system 100 is used to receive multiple query statements, and create a transaction according to the first query statement in the multiple query statements , executing the transaction in the global memory according to the second query statement in the plurality of query statements, and submitting the transaction according to the third query statement in the plurality of query statements.
  • the transaction processing system 10 further includes a database 300, and the distributed database system 100 executes transaction processing methods to manage data in the database 300, such as inserting new record data, updating record data or deleting records data, etc.
  • transaction processing system 10 also includes cluster management node 400 .
  • the cluster management node 400 is used to configure the nodes of the distributed database system deployed in the cluster, for example, to configure the node IP, node type, and the like.
  • the transaction processing system 10 may further include a timing server 500, and the timing server 500 is configured to provide a time stamp for the distributed database system 100, so as to determine the visibility of data to a transaction according to the time stamp.
  • the embodiment of the present application further provides a cluster 90 .
  • the cluster 90 includes a plurality of computers.
  • the computer may be a server, such as a local server in a private data center, or a cloud server provided by a cloud service provider.
  • the computer can also be a terminal. Terminals include but are not limited to desktop computers, notebook computers, smart phones, etc.
  • the cluster 90 is specifically used to implement the functions of the distributed database system 100 .
  • FIG. 9 provides a schematic structural diagram of a cluster 90 .
  • the cluster 90 includes multiple computers 900 .
  • Device 900 includes bus 901 , processor 902 , communication interface 903 and memory 904 .
  • the processor 902 , the memory 904 and the communication interface 903 communicate through the bus 901 .
  • the bus 901 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus or the like.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of presentation, only one thick line is used in FIG. 9, but it does not mean that there is only one bus or one type of bus.
  • the processor 902 can be a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU), a microprocessor (micro processor, MP), or a digital signal processor (digital signal processor, DSP), etc. any one or more of the devices.
  • CPU central processing unit
  • GPU graphics processing unit
  • MP microprocessor
  • DSP digital signal processor
  • the communication interface 903 is used for external communication.
  • the communication interface 903 can be used to receive multiple query statements sent by the client 200 , obtain a start timestamp and an end timestamp from the timing server 500 , or return a submission response to the client 200 , and so on.
  • Memory 904 may include volatile memory, such as random access memory (RAM).
  • RAM random access memory
  • the memory 904 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid state drive (solid state drive) , SSD).
  • non-volatile memory such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid state drive (solid state drive) , SSD).
  • Executable code is stored in the memory 904, and the processor 902 executes the executable code to perform the aforementioned transaction processing method.
  • each component of the distributed database system 100 described in the embodiment of FIG. 1 is implemented by software
  • the functions of each component in FIG. 1 are executed.
  • the required software or program code is stored in memory 904 .
  • the processor 902 executes the program code corresponding to each component stored in the memory 904 to execute the aforementioned transaction processing method.
  • Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium includes instructions, and the instructions instruct the computer 900 to execute the above-mentioned transaction processing method applied to the distributed database system 100.
  • each computer 900 can also execute a part of the above-mentioned transaction processing method applied to the distributed database system 100.
  • some computers may perform the steps performed by the coordinating node in the transaction processing method described above, and other computers may perform the steps performed by the participating nodes in the transaction processing method described above.
  • An embodiment of the present application further provides a computer program product, when the computer program product is executed by a computer, the computer executes any one of the foregoing transaction processing methods.
  • the computer program product can be a software installation package, which can be downloaded and executed on a computer if any one of the aforementioned transaction processing methods needs to be used.

Abstract

本申请提供了一种事务处理方法,由分布式数据库系统执行,该系统包括协调节点和参与节点,协调节点和参与节点共享全局内存,该方法包括:协调节点接收客户端发送的多条查询语句,根据多条查询语句中的第一查询语句创建事务,根据多条查询语句中的第二查询语句在全局内存中执行事务,根据多条查询语句中的第三查询语句提交事务。由于无需经过处理器和操作系统处理即可跨节点访问全局内存,缩短了访问路径,而且无需通过操作系统调度,进一步缩短同步时间,实现了协调节点和参与节点之间的实时一致性,满足了业务需求。

Description

事务处理方法、分布式数据库系统、集群及介质 技术领域
本申请涉及数据库技术领域,尤其涉及一种事务处理方法、分布式数据库系统、事务处理系统、集群、计算机可读存储介质以及计算机程序产品。
背景技术
随着数据库技术的不断发展,对数据(例如是员工考勤数据、员工薪资数据、生产数据等)通过数据库进行管理逐渐成为主流趋势。其中,数据库是以一定方式储存在一起、能与多个用户共享、具有尽可能小的冗余度、与应用程序彼此独立的数据集合。用户可以通过客户端应用程序(以下简称为客户端)访问数据库,以实现数据读取或数据写入。
数据读取或数据写入通常是由数据库系统实现的。数据库系统包括数据库管理系统(data base management system,DBMS)。数据库系统通过上述DBMS实现创建、查询、更新和删除数据。具体地,用户通过客户端触发对数据库中数据的操作,数据库系统响应于该操作执行相应的事务。以数据写入为例,数据库系统执行数据写入操作,将数据写入数据库系统的节点,然后将数据写入至数据库,如共享存储的数据库,从而实现数据持久化。
考虑到分布式数据库系统的高可靠性和高可用性,越来越多的用户(例如企业用户)采用分布式数据库系统对数据进行管理。其中,分布式数据库系统可以部署在实时应用集群(real application cluster,RAC)。该集群具体是面向磁盘设计的基于数据全共享(shared-everything)架构的分布式数据库存储引擎集群。RAC包括两种类型的节点,具体为汇聚节点(hub node)和叶子节点(leaf node)。汇聚节点是集群中的主节点,主节点之间通过点对点网络互连,处理分布式事务,叶子节点间无网络连接,用于处理并发查询和在线上报业务。
然而,叶子节点通常只能通过汇聚节点作为代理获取数据。由于叶子节点和汇聚节点通过双边方式进行交互,导致叶子节点需要等待操作系统的调度,并通过较长的访问路径获取汇聚节点上的数据,如此叶子节点通常只能读取到历史数据,难以满足实时性业务对于数据一致性的要求。
发明内容
本申请提供了一种事务处理方法。该方法利用共享的全局内存使得分布式数据库系统中的节点如协调节点和参与节点可以跨节点单边访问全局内存,而不必通过双边交互方式进行数据同步,无需经过处理器和操作系统处理,访问路径大幅缩短,而且不需要通过操作系统调度,可以大幅缩短同步时间,实现了协调节点和参与节点之间的实时一致性,满足了实时性业务对实时一致性的需求。本申请还提供了上述方法对应的分布式数据库系统、事务处理系统、集群、计算机可读存储介质以及计算机程序产品。
第一方面,本申请提供了一种事务处理方法。该方法可以由分布式数据库系统执行。该分布式数据库系统可以部署在集群中。该集群例如可以是内存引擎集群。分布式数据库系统在集群中运行时可以实现节点间数据的实时一致性,从而满足实时性业务的需求。具体地,分布式数据库系统包括协调节点和参与节点。协调节点在事务执行过程中承担协调责任,参与节点在事务执行过程中承担执行责任。其中,事务是指访问并可能更新数据库中数据的一个程序执行单元,通常包括有限的数据库操作序列。
分布式数据库系统的多个节点的部分内存用于形成全局内存。该全局内存对于分布式数据库系统中的所有协调节点、参与节点可见。协调节点或参与节点中剩余的部分内存为本地内存,该本地内存对于协调节点或参与节点自身可见。对于任意一个协调节点或参与节点,可以通过远端直接内存访问或内存总线等方式访问全局内存中位于其他节点的部分内存。
协调节点用于接收客户端发送的多条查询语句,根据多条查询语句中的第一查询语句创建事务,然后根据多条查询语句中的第二查询语句在全局内存中执行上述事务,接着协调节点根据多条查询语句中的第三查询语句提交上述事务,以实现所述协调节点和所述参与节点之间的一致性。
在该分布式数据库系统中,协调节点或参与节点中全局内存是共享的,当协调节点执行事务导致全局内存中的部分内存所存储的数据发生变化时,参与节点可以快速感知该变化,参与节点可以跨节点单边访问全局内存中的部分内存,以进行数据同步,而不必通过双边交互方式进行数据同步,由于无需经过处理器和操作系统处理,访问路径大幅缩短,而且不需要通过操作系统调度,可以大幅缩短同步时间,实现了协调节点和参与节点之间的实时一致性,满足了实时性业务对实时一致性的需求。此外,全局内存的容量可以随着节点数扩展,不再受限于单节点内存的容量,提高了并发控制能力。
在一些可能的实现方式中,所述分布式数据库系统部署在集群,所述全局内存来自于所述集群。该集群是指至少一组计算机形成的计算网络,用于为分布式数据库系统提供计算能力,以使分布式数据库系统基于上述计算能力对外提供服务。
该方法通过利用来自于集群的全局内存,使得分布式数据库系统的节点之间可以跨节点单边访问全局内存,无需经过处理器和操作系统处理,也无需等待操作系统的调度,实现了节点间(例如是协调节点、参与节点之间)的实时一致性。
在一些可能的实现方式中,所述全局内存包括所述协调节点和/或所述参与节点的部分内存。具体地,分布式数据库系统中的多个节点(例如是每一个节点)可以提供部分内存,用于形成全局内存,剩余的内存作为对应节点的本地内存。其中,全局内存对于分布式数据库系统中的节点是共享的。这些节点可以通过远端直接内存访问或者是内存总线直接实现跨节点单边访问,无需经过操作系统和处理器,也无需等待操作系统的调度,因而能够实现节点间的实时一致性。
在一些可能的实现方式中,协调节点的节点类型为主节点。相应地,该协调节点可以根据所述多条查询语句中的第一查询语句创建读写事务。如此可以满足实时读写业务的需求。
在一些可能的实现方式中,所述协调节点的节点类型为第一从节点。所述第一从 节点用于与节点类型为主节点的节点保持实时一致。基于此,该第一从节点也可以称作实时从节点。相应地,该协调节点可以根据所述多条查询语句中的第一查询语句创建只读事务。如此可以满足实时只读业务的需求。
在一些可能的实现方式中,分布式数据库系统还可以包括节点类型为第二从节点的节点。第二从节点用于与节点类型为主节点的节点保持准实时一致。因此,第二从节点也可以称作准实时从节点。该准实时从节点用于处理对实时性要求不高的业务,例如非实时分析业务。例如,准实时从节点用于接收与非实时分析业务关联的查询语句,然后返回相应的查询结果。如此可以满足非实时分析业务的需求。
在一些可能的实现方式中,在所述协调节点接收客户端发送的多条查询语句之前,所述分布式数据库系统(例如是分布式数据库系统中的协调节点和参与节点)可以接收集群管理节点发送的表记录在所述全局内存中的副本数量。然后分布式数据库系统保存所述表记录在所述全局内存中的副本数量。
如此,当分布式数据库系统在写数据时,可以根据上述表记录在全局内存中的副本数量写入相应数量的副本,保障数据安全性。其中,分布式数据库系统可以基于表粒度设置副本数量,满足了不同业务的个性化需求。
在一些可能的实现方式中,所述表记录存储在所述分布式数据库系统的全局内存中,所述表记录的索引树和管理头存储在所述分布式数据库系统的本地内存中。该方法将有限的全局内存用于存储表记录,采用本地内存存储表记录的索引树和管理头,以对表记录进行版本管理,一方面实现跨节点单边访问全局内存,保证节点间的实时一致性,另一方面避免了索引树等占用全局内存,提高了资源利用率。
在一些可能的实现方式中,协调节点在提交事务时可以基于事务提交协议实现。具体地,协调节点根据所述多条查询语句中的第三查询语句,通过运行于所述协调节点和所述参与节点的事务提交协议,提交所述事务,以实现所述协调节点和所述参与节点的实时一致性。
通过事务提交协议对协议节点、参与节点进行约束,使得需要进行数据写入(包括数据插入或更新)的节点(如协调节点、参与节点)执行的事务操作要么同时完成,要么同时回滚,如此避免了一些副本节点写入完成,另一些副本节点写入失败,导致节点实时不一致的情况发生,进一步保障了节点间的实时一致性。
在一些可能的实现方式中,所述事务发生写冲突时,例如事务与其他事务发生读写冲突或者是写写冲突时,所述协调节点触发悲观并发控制,所述参与节点触发乐观并发控制。其中,悲观并发控制的原理为,假设多用户并发的事务在处理时彼此互相影响,因此,可以通过阻止一个事务来修改数据。具体地,如果一个事务执行的操作如读某行数据应用了悲观并发控制(悲观锁),那么只有当这个事务释放权限后,其他事务才能够执行冲突的操作。乐观并发控制的原理为,假设多用户并发的事务在处理时不会彼此互相影响,各事务能够在不产生锁的情况下处理各自影响的那部分数据。在提交数据更新之前,每个事务会先检查在该事务读取数据后,有没有其他事务又修改了该数据。如果其他事务有更新的话,正在提交的事务会进行回滚。
通过上述并发控制,一方面可以避免写写冲突或者读写冲突,保证协调节点和参与节点之间的一致性,另一方面可以减少协调节点与参与节点的交互,缩短同步时间, 实现实时一致。
第二方面,本申请提供一种分布式数据库系统。所述分布式数据库系统包括协调节点和参与节点,所述协调节点和所述参与节点共享全局内存。
所述协调节点,用于接收客户端发送的多条查询语句;
所述协调节点,还用于根据所述多条查询语句中的第一查询语句创建事务,根据所述多条查询语句中的第二查询语句在所述全局内存中执行所述事务,以及根据所述多条查询语句中的第三查询语句提交所述事务。
在一些可能的实现方式中,所述分布式数据库系统部署在集群,所述全局内存来自于所述集群。
在一些可能的实现方式中,所述全局内存包括所述协调节点和/或所述参与节点的部分内存。
在一些可能的实现方式中,所述协调节点的节点类型为主节点,所述协调节点具体用于:
根据所述多条查询语句中的第一查询语句创建读写事务。
在一些可能的实现方式中,所述协调节点的节点类型为第一从节点,所述第一从节点用于与节点类型为主节点的节点保持实时一致,所述协调节点具体用于:
根据所述多条查询语句中的第一查询语句创建只读事务。
在一些可能的实现方式中,所述协调节点,还用于接收并保存集群管理节点发送的表记录在所述全局内存中的副本数量;
所述参与节点,还用于接收并保存所述集群管理节点发送的表记录在所述全局内存中的副本数量。
在一些可能的实现方式中,所述表记录存储在所述分布式数据库系统的全局内存中,所述表记录的索引树和管理头存储在所述分布式数据库系统的本地内存中。
在一些可能的实现方式中,所述协调节点具体用于:
根据所述多条查询语句中的第三查询语句,通过运行于所述协调节点和所述参与节点的事务提交协议,提交所述事务,以实现所述协调节点和所述参与节点的实时一致性。
在一些可能的实现方式中,所述协调节点具体用于所述事务发生写冲突时,触发悲观并发控制;所述参与节点具体用于所述事务发生写冲突时,触发乐观并发控制。
第三方面,本申请提供一种事务处理系统。所述事务处理系统包括客户端和如本申请第二方面任意一种实现方式所述的分布式数据库系统,所述分布式数据库系统用于根据所述客户端发送的查询语句,执行对应的事务处理方法。
第四方面,本申请提供一种集群。该集群包括多台计算机。所述计算机包括处理器和存储器。所述处理器、所述存储器进行相互的通信。所述处理器用于执行所述存储器中存储的指令,以使得集群执行如第一方面或第一方面的任一种实现方式中的事务处理方法。
第五方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,所述指令指示计算机执行上述第一方面或第一方面的任一种实现方式所述的事务处理方法。
第六方面,本申请提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面或第一方面的任一种实现方式所述的事务处理方法。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
为了更清楚地说明本申请实施例的技术方法,下面将对实施例中所需使用的附图作以简单地介绍。
图1为本申请实施例提供的一种事务处理系统的系统架构图;
图2为本申请实施例提供的一种节点配置方法的流程图;
图3为本申请实施例提供的一种配置副本数量方法的流程图;
图4为本申请实施例提供的一种事务处理方法的交互流程图;
图5为本申请实施例提供的一种事务开始及执行阶段的流程图;
图6为本申请实施例提供的一种事务提交阶段的流程图;
图7为本申请实施例提供的一种事务完成阶段的流程图;
图8为本申请实施例提供的一种事务回滚阶段的流程图;
图9为本申请实施例提供的一种集群的结构示意图。
具体实施方式
本申请实施例中的术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。
为了便于理解本申请实施例,首先,对本申请涉及的部分术语进行解释说明。
数据库应用是指基于底层的数据库向用户提供数据管理服务的应用。其中,数据管理服务包括数据创建、数据查询、数据更新和数据删除等中的至少一种。典型的数据库应用包括考勤管理系统、薪资管理系统、生产报表系统、证券交易系统等信息管理系统。数据库应用通常包括数据库系统和面向用户的客户端。用户可以通过客户端触发数据创建、查询、更新或删除操作,客户端可以响应于上述操作,通过数据库系统对数据库中的数据进行相应的处理。
数据库系统可以根据部署方式分为集中式数据库系统和分布式数据库系统。其中,分布式数据库系统是部署在包括多台计算机的集群中的数据库系统。在本申请中,集群中的计算机也可以称作节点(node)。节点之间可以通过网络进行通信,从而协同完成数据处理。
数据库系统可以通过存储引擎决定数据在内存、磁盘等存储介质中的存储方式以及数据读取方式。存储引擎具体是数据库系统的核心组件。不同类型的数据库系统可以采用不同的存储引擎,从而提供不同的存储机制、索引方式、锁定机制。当数据库系统分布式地部署在集群的不同节点时,还可以根据集群中节点存储数据的存储介质类型不同,将集群分为磁盘引擎集群和内存引擎集群。
目前,业界广泛应用的RAC是面向磁盘设计的、基于数据全共享架构的分布式数 据库存储引擎集群(例如是磁盘引擎集群)。RAC包括两种类型的节点,具体为汇聚节点(hub node)和叶子节点(leaf node)。其中,汇聚节点是集群中的主节点,主节点之间通过点对点网络互连,用于处理分布式事务,叶子节点之间无网络连接,用于处理并发查询和在线上报业务。然而,叶子节点通常只能通过汇聚节点作为代理获取数据。其中,叶子节点和汇聚节点通过双边交互获取汇聚节点中的数据。双边交互需要双方的中央处理器(central processing unit,CPU)参与处理,导致访问路径过长,例如访问路径可以为叶子节点的CPU到叶子节点的网卡,然后到汇聚节点的网卡,接着到汇聚节点的CPU,最后到汇聚节点的缓存。此外,上述交互需要等待操作系统的调度,由此导致较大的时延,叶子节点上通常只能读取到历史数据,难以满足实时性业务对于数据一致性的要求。
有鉴于此,本申请实施例提供了一种分布式数据库系统。该分布式数据库系统可以部署在集群中。该集群例如可以是内存引擎集群。分布式数据库系统在集群中运行时可以实现节点间数据的实时一致性,从而满足实时性业务的需求。具体地,分布式数据库系统包括协调节点和参与节点。协调节点在事务(transaction)执行过程中承担协调责任,参与节点在事务执行过程中承担执行责任。其中,事务是指访问并可能更新数据库中数据的一个程序执行单元,通常包括有限的数据库操作序列。
分布式数据库系统的多个节点的部分内存用于形成全局内存(global memory,GM)。该全局内存对于分布式数据库系统中的所有协调节点、参与节点可见。协调节点或参与节点中剩余的部分内存为本地内存,该本地内存对于协调节点或参与节点自身可见。对于任意一个协调节点或参与节点,可以通过远端直接内存访问(remote direct memory access,RDMA)或内存总线(memory fabric)等方式访问全局内存中位于其他节点的部分内存。
协调节点用于接收客户端发送的多条查询语句,根据多条查询语句中的第一查询语句创建事务,然后根据多条查询语句中的第二查询语句在全局内存中执行上述事务,接着协调节点根据多条查询语句中的第三查询语句提交上述事务,以实现所述协调节点和所述参与节点之间的一致性。
在该分布式数据库系统中,协调节点或参与节点中全局内存是共享的,当协调节点执行事务导致全局内存中的部分内存所存储的数据发生变化时,参与节点可以快速感知该变化,参与节点可以通过RDMA或memory fabric跨节点单边访问全局内存中的部分内存,以进行数据同步,而不必通过双边交互方式进行数据同步,由于无需经过处理器和操作系统处理,访问路径大幅缩短,而且不需要通过操作系统调度,可以大幅缩短同步时间,实现了协调节点和参与节点之间的实时一致性,满足了实时性业务对实时一致性的需求。此外,全局内存的容量可以随着节点数扩展,不再受限于单节点内存的容量,提高了并发控制能力。
基于上述分布式数据库系统,本申请实施例还提供一种事务处理系统。下面结合附图对本申请实施例提供的事务处理系统的进行详细说明。
参见图1所示的事务处理系统的系统架构图,事务处理系统10包括分布式数据库系统100、客户端200和数据库300。客户端200与分布式数据库系统100连接,分布 式数据库系统100与数据库300连接。
分布式数据库系统100包括协调节点和参与节点。协调节点和参与节点可以通过高速网络的RDMA或者memory fabric连接。协调节点和参与节点运行有事务提交协议。该事务提交协议中定义协调节点为事务的接入节点,参与节点为分布式数据库系统100中除协调节点之外的主节点和第一从节点。
其中,协调节点的节点类型可以是主节点或者是第一从节点,该第一从节点用于与节点类型为主节点的节点保持实时一致。因此,第一从节点也称作实时从节点。例如针对读写事务,协调节点的节点类型可以是主节点,针对只读事务,协调节点的节点类型可以是实时从节点。需要说明的是,主节点具有处理只读事务的能力,因此针对只读事务,协调节点的节点类型也可以是主节点。
在一些可能的实现方式中,分布式数据库系统100还包括非事务节点,例如是第二从节点。该第二从节点用于与节点类型为主节点的节点保持准实时一致。因此,第二从节点也称作准实时从节点。考虑到一些业务(如非实时分析业务)对于实时性要求不高,分布式数据库系统100可以通过第二从节点对这些业务进行处理。例如,分布式数据库系统100中的第二从节点可以接收与非实时分析业务关联的查询语句,第二从节点根据该查询语句返回查询结果。
其中,分布式数据库系统100中的多个节点(例如是每个节点)的部分内存可以用于形成全局内存。该全局内存可以通过软件模块例如是全局内存管理模块实现内存编址、内存申请和释放管理。其中,全局内存管理模块可以是分布式数据库系统100的软件模块。
全局内存管理模块可以对分布式数据库系统100中的节点的内存进行管理。具体地,全局内存管理模块可以支持单副本或多副本的内存申请和释放。其中,内存申请和释放为字节级内存申请和释放。使用单副本全局内存的数据通常缓存在一个节点,当该节点故障时,该数据在缓存中不可访问,在接管阶段,该数据从存储系统中载入后才能继续访问。使用多副本全局内存的数据通常缓存在多个节点,当某个节点故障时,仍可以通过其他节点寻址并进行访问。其中,全局内存管理模块可以针对由多个节点上的部分内存形成的全局内存提供小块内存的申请和释放。具体地,全局内存管理模块可以提供全局内存接口。该全局内存接口可以用于申请指定长度的小块内存。
全局内存接口返回的单副本内存或多副本内存由全局内存管理模块统一编址,上述单副本内存或多副本内存的地址称作全局内存地址GmAddr。对于指定全局内存地址,分布式数据库系统100中的任意节点均可以访问该地址的数据。进一步地,全局内存管理模块还可以根据全局内存地址确定对应节点的节点标识以及偏移位置。基于节点标识和偏移位置可以实现本地或远端的读写访问。
客户端200可以是通用客户端如浏览器,或者是专用客户端如各种信息管理系统的客户端。用户可以根据查询语言,例如是结构化查询语言(structured query language,SQL),通过客户端200编写查询语句,分布式数据库系统100(例如是分布式数据库系统100中的协调节点)接收到多条查询语句,可以根据多条查询语句中的第一查询语句创建事务,然后根据多条查询语句中的第二查询语句在全局内存中执行事务,接着根据多条查询语句中的第三查询语句提交事务,从而实现创建、查询、 更新和/或删除数据。其中,对于协调节点、参与节点均可见的全局内存可以保障协调节点和参与节点的实时一致性。并且,协调节点、参与节点上运行的事务提交协议使得事务的操作要么同时被执行,要么同时回滚,进一步保障了协调节点和参与节点的实时一致性。
数据库300用于对数据进行持久化存储。例如,数据库300可以对日志数据进行持久化存储。当分布式数据库系统100中的节点从故障中恢复,或者分布式数据库系统100整体重新上电恢复时,分布式数据库系统100可以从数据库300载入数据至内存。
需要说明的是,数据库300可以是共享存储系统(shared storage system)中的数据库。该共享存储系统包括裸设备(raw device)、自动存储管理(automatic storage management,ASM)设备或者网络附属存储(network attached storage,NAS)设备中的任意一种或多种。共享存储系统具有共享访问能力。分布式数据库系统100中的节点可以接入共享存储系统,并访问该共享存储系统。共享存储系统可以使用跨节点副本或者跨节点纠删码(erasure code)保障数据的可靠性,以及保障数据写入的原子性(atomicity)。原子性是指一个事务中的所有操作,或者全部完成,或者全部不完成,不会结束在中间某个环节,事务在执行过程中发生错误,会被回滚到事务开始前的状态。
在本实施例中,分布式数据库系统100可以部署在集群中,例如分布式数据库系统100可以部署在内存引擎集群中。相应地,事务处理系统10还可以包括集群管理节点400。该集群管理节点400与分布式数据库系统100连接。
集群管理节点400用于对集群(例如是集群中的分布式数据库系统100)进行运维。具体地,集群管理节点400可以用于发现分布式数据库系统100中的节点,对分布式数据库系统100中节点的状态进行管理,或者是对元数据进行管理。该元数据可以包括节点属性信息和数据表模式(schema)中的至少一种。节点属性信息包括节点标识(identity,ID)、节点网络地址(Internet protocol address,IP)、节点类型和节点状态中的任意一种或多种。数据表schema包括表名、表ID、表类型、字段个数、字段类型描述等中的任意一种或多种。
在一些可能的实现方式中,事务处理系统10还包括授时服务器500。授时服务器500与分布式数据库系统100连接。授时服务器500用于提供单调递增的时钟服务。该时钟具体可以是逻辑时钟、真实时间(true time)时钟或者混合逻辑时钟。其中,混合逻辑时钟是指混合有物理时钟(如真实时间)的逻辑时钟。授时服务器500可以向分布式数据库系统100提供当前时刻的时间戳。该时间戳具体可以是表征时间的、长度为8字节或16字节的数值。分布式数据库系统100可以获取时间戳以确定事务对数据的可见性。
在进行事务处理之前,可以先安装分布式数据库系统100。安装分布式数据库系统100过程中,可以提示用户配置分布式数据库系统100中的节点。为了便于理解,下面结合附图,对本申请实施例提供的节点配置方法进行介绍。
参见图2所示的节点配置方法的交互流程图,该方法包括:
S202:分布式数据库系统100中的节点配置节点IP,安装日志文件系统,以及设置节点类型。
分布式数据库系统100可以包括多个节点。针对任意一个节点,可以基于IP地址池自动配置节点IP。在一些实施例中,节点也可以接收管理员人工配置的节点IP,从而实现节点IP配置。
类似地,节点可以自动配置节点类型,或者接收管理员人工配置的节点类型,从而实现节点类型配置。其中,节点类型可以包括主节点和实时从节点,进一步地,节点类型还可以包括准实时从节点。在一些实施例中,主节点可以用于处理读写事务,实时从节点可以用于处理只读事务,准实时从节点可以用于处理非事务请求,例如是与分析业务关联的查询请求。如此可以满足不同业务的需求。
考虑到分布式数据库系统100中的内存为易失性存储介质,为了保证事务相关数据的可持久性,可以在事务的日志文件如再执行(redolog)日志持久化成功后,再向客户端200返回事务提交成功通知消息。日志文件直接写入共享存储系统(例如是共享存储系统中的数据库)可以增加事务提交时延,导致事务在分布式数据库系统100侧处理较快,但是日志文件持久化较慢,从而影响整体时延和性能。
为此,分布式数据库系统100中的一些节点(例如是主节点)还可以配置高速持久化介质,以用于日志文件的持久化。该高速持久化介质包括但不限于保电内存、非易失性随机访问存储器(non-volatile random access memory,NVRAM)或者其他非易失性的3D-point介质。节点可以安装日志文件系统(log file system,LogFs),以便通过日志文件系统对本地的高速持久化介质进行管理。日志文件系统还可以提供文件语义的访问接口,以用于持久化日志文件,如redolog文件。
进一步地,实时从节点也可以配置高速持久化介质,以便于主节点故障时,可以将实时从节点的节点类型修改为主节点。准实时从节点主要用于处理非事务请求,无需在本地进行日志文件持久化,因而无需配置高速持久化介质。
由于redolog文件的数据量随着事务提交数量的增加而膨胀,因此redolog文件可以由后台任务(例如是分布式数据库系统100中的节点触发的任务)从日志文件系统中搬移写入到共享存储系统中。该搬移过程由后台任务完成,不影响事务执行时延。
需要说明的是,高速持久化介质的容量需求较小。例如,单节点的事务处理能力为100万事务/秒,日志文件如redolog文件的数据量为1吉字节(gigabyte,GB)/秒,当后台任务每0.5秒搬移一次,那么高速持久化介质的容量配置为1GB即可满足1个节点的redolog文件写入。而redolog文件被搬移到共享存储系统之前,为保证其可靠性,可以在多个节点写入。假设在3个节点写入,那么高速持久化介质的容量可以配置为1GB*3=3GB。
基于分布式数据库系统100的上述结构可以从逻辑上将分布式数据库系统100分为索引层、表记录层、近端持久层。索引层用于存储数据表中表记录的索引树和表记录的管理头rhead。其中,rhead用于记录该版本表记录的全局内存地址和日志文件地址如redolog文件地址。索引层通常使用本地内存实现,通过本地访问方式进行访问。该索引树和rhead在分布式数据库系统100中所有节点上均存在,其中,主节点和实时从节点上的索引树和rhead实时一致,具体是事务提交完成时形成一致数据。表记 录层用于存储数据表的表记录record。表记录层通常使用全局内存实现,可以通过远程访问方式如RDMA或memory fabric进行访问。近端持久层用于对日志文件如redolog文件进行持久化存储。近端持久层通常使用高速持久化介质实现。
S204:分布式数据库系统100中的节点向集群管理节点400上报节点配置信息。
其中,节点配置信息可以包括节点IP和节点类型。进一步地,节点配置信息还可以包括内存容量、日志文件系统容量中的任意一种或多种。
S206:集群管理节点400检查各节点类型对应的节点数量,当节点数量满足设定条件时,保存节点配置信息到系统节点表sysNodeTbl。
分布式数据库系统100中至少包括主节点,在一些可能的实现方式中,分布式数据库系统100还包括实时从节点和准实时从节点中的至少一种。基于此,集群管理节点400可以检查主节点数量Nm是否满足如下设定条件:Nm>0。可选地,集群管理节点400可以检查实时从节点数量Nr和准实时从节点数量Nq是否满足如下设定条件:Nr≥0,Nq≥0。
进一步地,考虑到分布式数据库系统100扩展的成本,主节点数量Nm和实时从节点数量Nr之和通常设定有上限值,该上限值可以为第一预设值Q1。类似地,准实时从节点数量Nq也设定有上限值,该上限值可以为第二预设值Q2。基于此,集群管理节点400可以检查主节点数量Nm和实时从节点数量Nr之和是否满足如下设定条件:Nm+Nr≤Q1,以及检查准实时从节点数量Nq是否满足如下设定条件:Nq≤Q2。其中,Q1和Q2可以根据经验值设置,例如Q1可以设置为8,Q2可以设置为64。
当节点数量(例如单一类型节点数量和/或不同类型节点数量之和)满足设定条件时,表明节点配置合法,集群管理节点400可以保存节点配置信息到系统节点表sysNodeTbl。
S208:集群管理节点400向分布式数据库系统100中的节点返回配置成功提示。
S210:分布式数据库系统100中的节点向客户端200返回配置成功提示。
具体地,集群管理节点400向分布式数据库系统100中的各个节点返回配置成功提示,然后各个节点向客户端200返回配置成功提示。
需要说明的是,当节点数量不满足设定条件时,如节点数量超过上限值,或者节点数量低于下限值,表明节点配置不成功。相应地,集群管理节点400还可以返回配置失败提示,以便重新进行节点配置。
S212:客户端200设置系统节点表sysNodeTbl生效标记。
S214:客户端200发送系统节点表sysNodeTbl生效标记至集群管理节点400。
具体地,生效标记用于标识系统节点表sysNodeTbl,客户端200设置该生效标记,并将生效标记发送至集群管理节点400可以实现将系统节点表sysNodeTbl生效。
S216:集群管理节点400向分布式数据库系统100中的节点返回系统节点表sysNodeTbl。
S218:分布式数据库系统100中的节点保存系统节点表sysNodeTbl。
S220:集群管理节点400向客户端200返回节点配置信息。
具体地,集群管理节点400向客户端200返回各个节点的节点配置信息,如各个节点的节点IP、节点类型等。如此,客户端200不仅可以获得节点配置信息,还可以 根据各节点的节点配置信息获得节点数量,如各节点类型的节点数量。
需要说明的是,上述S210至S220为本申请实施例提供的节点配置方法的可选步骤,在本申请其他可能的实现方式中,可以不执行上述S210至S220。
在节点配置完成后,还可以配置表记录在全局内存中的副本数量RecordMemRepNum。其中,RecordMemRepNum的最小值可以设置为1。考虑到分布式数据库系统100中主节点和实时从节点中的数据保持实时一致,如果全局内存中的副本数量超过主节点数量Nm和实时从节点数量Nr之和即Nm+Nr,会增加内存的消耗且对可用性没有提升,基于此,最大值可以设置为Nm+Nr。
当业务对于恢复时间要求较高时,可以配置RecordMemRepNum大于1。如此,当某个节点故障时,缓存在该节点上的表记录无法访问,但处于正常状态的协调节点或者参与节点依然可以直接从该数据表的内存副本节点的内存中访问表记录。此时,恢复时间目标(recovery time objective,RTO)=0。
当业务对于恢复时间要求较低时,可以配置RecordMemRepNum等于1。如此,当某个节点故障时,其他节点(如处于正常状态的协调节点或参与节点)无法访问缓存在该故障节点上的表记录,那么访问该表记录的事务可以等待分布式数据库系统100中某节点接管了故障节点,并从共享存储系统中恢复,或者从日志文件如redolog文件中回放后,再继续执行。此时,RTO>0。
在配置RecordMemRepNum时,还可以向对用户进行信息提示,例如提示RecordMemRepNum的最小值和最大值,以便于用户参考该最小值和最大值进行配置。在配置完成后,集群管理节点400还可以将RecordMemRepNum作为表属性,保存在系统元数据表sysMetaTbl中,分布式数据库系统100中的节点也可以在本地内存中更新系统元数据表sysMetaTbl。
接下来,结合附图对本申请实施例提供的配置RecordMemRepNum的方法进行介绍。
参见图3所示的配置RecordMemRepNum的方法的交互流程图,该方法包括:
S302:客户端200向分布式数据库系统100中的主节点发送创建表命令,表命令中包括RecordMemRepNum。
创建表命令中包括表参数,该表参数可以包括表记录在全局内存中的副本数量即RecordMemRepNum。在一些可能的实现方式中,表参数还可以包括表名、列名、列类型中的一种或多种。
S304:主节点转发创建表命令至集群管理节点400。
主节点可以执行创建表命令,以创建数据表。并且,主节点还转发创建表命令,例如是转发创建表命令中的RecordMemRepNum参数至集群管理节点400,以设置RecordMemRepNum。
如此,可以实现以数据表为粒度配置表记录在全局内存中的副本数量RecordMemRepNum,满足了不同数据表的可用性需求,并且可以根据数据表的需求控制内存消耗。
S306:集群管理节点400检查RecordMemRepNum是否在预设范围内。若是,则执行S308;若否,则执行S320。
预设范围是RecordMemRepNum的取值范围。该范围可以是大于等于1且小于等于主节点的数量Nm和实时从节点的数量Nr之和即Nm+Nr。集群管理节点400检查RecordMemRepNum是否大于等于1且小于等于Nm+Nr。若是,则表征配置的RecordMemRepNum是合法的,可以执行S308;若否,则表征配置的RecordMemRepNum是不合法的,可以执行S320。
S308:集群管理节点400在系统元数据表sysMetaTbl中保存RecordMemRepNum。
具体地,集群管理节点400在系统元数据表sysMetaTbl中增加表记录,该表记录具体用于记录RecordMemRepNum,集群管理节点400可以对表记录中的上述数据进行持久化存储。
S310:集群管理节点400向主节点、实时从节点、准实时从节点发送系统元数据表sysMetaTbl。
其中,集群管理节点400可以向分布式数据库系统100中的节点发送上述系统元数据表sysMetaTbl。当分布式数据库系统100不包括实时从节点或者准实时从节点时,也可以不执行向实时从节点、准实时从节点发送系统元数据表sysMetaTbl的步骤。
S312:分布式数据库系统100中的主节点、实时从节点和准实时从节点在本地内存中更新系统元数据表sysMetaTbl。
S314:分布式数据库系统100中的主节点、实时从节点和准实时从节点向集群管理节点400发送更新完成通知。
其中,更新完成通知用于通知集群管理节点400已在分布式数据库系统100各节点的本地内存完成对系统元数据表sysMetaTbl的更新。
S316:集群管理节点400向分布式数据库系统100中的主节点发送配置成功响应。
S318:主节点向客户端200发送配置成功响应。
在一些可能的实现方式中,集群管理节点400也可以直接向客户端200发送配置成功响应,以通知客户端200对于RecordMemRepNum的配置已完成。
S320:集群管理节点400向分布式数据库系统100中的主节点发送配置失败响应。
S322:主节点向客户端200发送配置失败响应。
在一些可能的实现方式中,集群管理节点400也可以直接向客户端200发送配置失败响应,以通知客户端200对于RecordMemRepNum的配置失败。基于此,客户端200还可以调整表参数,然后重新发送创建表命令。
需要说明的是,上述S314至S322为本申请实施例提供的配置RecordMemRepNum的方法的可选步骤,在本申请其他可能的实现方式中,可以不执行上述S314至S322。
在完成节点配置和表记录在全局内存中的副本数量的配置后,可以基于上述事务处理系统10进行事务处理。接下来,结合附图对本申请实施例提供的事务处理方法进行详细说明。
参见图4所示的事务处理方法的流程图,该方法包括:
S402:客户端200向分布式数据库系统100中的协调节点发送多条查询语句。
查询语句是指通过查询语言编写的、用于对数据库300中数据进行处理的语句。其中,对数据库300中数据进行处理包括数据创建、数据查询、数据更新和数据删除 等中的任意一种或多种。
客户端200可以接收用户通过查询语言编写的多条查询语句,然后向分布式数据库系统100中的协调节点发送上述多条查询语句。其中,查询语言可以由用户从数据库300支持的查询语言列表中确定。例如,查询语言可以是SQL,相应地,用户编写的查询语句可以是SQL语句。
其中,客户端200发送多条查询语句时,可以一次性地发送多条查询语句,如此可以提高吞吐率。在一些可能的实现方式中,客户端200也可以逐条发送查询语句。具体地,客户端200可以先发送一条查询语句,在该查询语句被执行后,再发送下一条查询语句。
在一些可能的实现方式中,多条查询语句可以用于组成一个事务。客户端200根据查询语句确定事务的类型。其中,事务的类型包括读写事务和只读事务。只读事务不支持事务内进行插入(insert)、删除(delete)、更新(update)等操作。客户端200可以根据查询语句是否指示插入、删除或更新表记录,从而确定事务的类型为读写事务或只读事务。例如,多条查询语句中至少有一条查询语句指示插入、删除或更新表记录,则客户端200可以确定事务的类型为读写事务,否则确定事务的类型为只读事务。
当事务的类型为读写事务时,客户端200可以从分布式数据库系统100的主节点中确定协调节点,向该协调节点发送多条查询语句。当事务的类型为只读事务时,客户端200可以从分布式数据库系统100的实时从节点中确定协调节点,向该协调节点发送多条查询语句。在一些可能的实现方式中,事务的类型为只读事务时,客户端200也可以从主节点中确定协调节点,本申请实施例对此不作限定。
S404:分布式数据库系统100中的协调节点根据多条查询语句中的第一查询语句创建事务。
第一查询语句可以是指示事务开始的查询语句。例如,第一查询语句为SQL查询语句时,该第一查询语句可以包括开始(begin)命令。协调节点可以执行第一查询语句,从而创建事务。其中,协调节点的节点类型为主节点时,协调节点可以根据第一查询语句创建读写事务。协调节点的节点类型为实时从节点时,协调节点可以根据第一查询语句创建只读事务。
参见图5所示的事务开始阶段的流程图,协调节点(例如是分布式数据库系统100中的主节点1)可以根据表征事务开始的第一查询语句(如图5中所示的begin),创建事务。具体地,协调节点可以创建全局事务,申请全局事务唯一标识gtxid,以及从本地内存申请本地事务控制块,获得本地事务控制块唯一标识lotxid。此外,协调节点还可以从授时服务器500获取开始时间戳(begin time stamp,beginTs)。
其中,gtxid可以根据节点标识和节点内的序列号确定,例如可以是节点标识和节点内的序列号拼接所得的字符串。全局事务包括多个子事务(例如是协调节点和参与节点上的本地事务),多个子事务之间可以通过gtxid进行关联。当协调节点发生异常,被其他节点接管时,其他参与节点可以通过gtxid中的节点标识,向集群管理节点400查询接管节点的节点标识,从而向接管节点发起全局事务状态重新确认流程。
本地事务控制块具体是本地内存中用于过程状态控制的一段内存空间。本地事务 控制块唯一标识即lotxid可以是严格单调递增的数值,该数值可以是8字节的数值。lotxid可以记录到索引层的rhead中。当本节点上其他事务需要等待本事务提交完成时,可以通过lotxid找到本事务,并将其他事务加入本事务的等待队列。
事务开始时间戳即beginTs可以记录在本地事务控制块中,用于为事务可见性判断提供依据。具体地,事务开始时,协调节点从授时服务器500获取当前时间戳作为beginTs,协调节点可以基于开始时间戳,通过可见性规则判断表记录对于事务的可见性。下面对基于可见性规则判断可见性过程进行详细说明。
具体地,一个版本的表记录的rhead中包括该版本的表记录的生存时间窗。生存时间窗可以通过表征起始时间的最小时间戳tmin和表征终止时间的最大时间戳tmax表征。协调节点可以基于事务的beginTs和一个版本的表记录的rhead中tmin、tmax的大小关系,判断该版本的表记录对于事务的可见性。
在tmin和tmax都记录时间戳(而不是gtxid)的情况下,如果beginTs>=tmin且beginTs<tmax,则该版本的表记录对于该事务可见,否则不可见。在tmin或者tmax中的至少一个记录gtxid(而不是时间戳)的情况下,则协调节点可以根据rhead中记录的lotxid找到本地事务,将该事务加入本地事务的等待队列,当事务从等待队列中唤醒后,再读取tmin和tmax进行可见性的判断。
S406:分布式数据库系统100中的协调节点根据多条查询语句中的第二查询语句在全局内存中执行事务。
第二查询语句可以是指示事务操作的查询语句。其中,事务操作包括数据操作语言(data manipulation language,DML)操作。其中,DML操作可以包括insert、delete或者update操作。事务操作也可以包括查询(query)操作。
其中,第二查询语句为指示执行insert操作的语句时,第二查询语句还可以携带待插入的表记录的记录数据。第二查询语句为指示执行update操作的语句时,第二查询语句还可以携带待更新的表记录的主键以及更新的记录数据。第二查询语句为指示执行delete操作的语句时,第二查询语句还可以携带表记录的主键。第二查询语句为指示执行query操作的语句时,第二查询语句还可以携带查询条件。其中,查询条件可以包括表记录的主键或谓词条件。谓词用于表示比较运算,谓词条件包括通过比较运算表达的查询条件。谓词条件用于缩小查询所返回的结果集的范围。
协调节点可以执行该第二查询语句,从而在全局内存中执行事务,例如协调节点可以在全局内存中执行数据插入、删除、更新或者读取等操作中的至少一种。其中,第二查询语句可以包括一条或多条。下面以第二查询语句分别为指示执行insert操作、执行delete操作、执行update操作和执行query操作的语句,对执行事务的过程进行说明。
参见图5所示的事务执行阶段的流程图,第二查询语句为指示执行insert操作的语句时,协调节点(例如是分布式数据库系统100中的主节点1)可以查询系统元数据表sysMetaTbl得到表属性,该表属性包括表记录在全局内存中的副本数量,其中,表记录在全局内存中的副本数量可以记作RecordMemRepNum。协调节点可以调用全局内存管理模块的提供的全局内存接口申请指定副本数量的全局内存空间(例如可以记作gm1),然后在gm1中填写记录数据。其中,指定副本数量等于RecordMemRepNum。
具体地,协调节点可以调用全局内存接口获得gm1的副本所在节点列表,然后根据副本所在节点列表,在第一个副本的节点(例如是协调节点)的全局内存中填写新增的记录数据。其中,tmin可以设置为gtxid,tmax可以设置为-1(用于表示无穷infinite)。需要说明,tmin在提交之前设置为gtxid,在提交之后设置为时间戳,tmax在提交前后保持不变。
进一步地,协调节点可以查询系统节点表sysNodeTbl。如果副本所在节点列表中存在准实时从节点,则继续在该准实时从节点的全局内存中填写新增的记录数据。其中,tmin设置为gtxid,tmax设置为-1。
协调节点还可以申请本地内存,用于存储rhead和间接索引indirect。其中,rhead中填写insert的记录数据的全局内存地址、物理地址和lotxid。其中,由于事务尚未提交,物理地址可以为0。indirect指向rhead。接着协调节点在本地索引树中插入新增的记录数据。如果存在键冲突,则insert失败,释放之前申请的全局内存和本地内存。协调节点还可以返回错误信息。该错误信息可以指示insert失败,进一步地,错误信息可以指示insert失败的原因。如果insert成功,则将indirect修改为指向新增的记录数据,然后协调节点可以在本地事务写集合(write set,wset)中记录操作类型为insert,并向客户端200返回插入成功。其中,本地事务写集合下文简称为写集合。
第二查询语句为指示执行update操作的语句时,协调节点(例如是分布式数据库系统100中的主节点1)可以查询系统元数据表sysMetaTbl得到表属性,该表属性包括RecordMemRepNum。协调节点查找update操作的表记录的版本,根据beginTs和rhead中的tmin、tmax确定表记录对于事务的可见性,并返回正确版本的表记录。
具体地,协调节点根据update操作的表记录的主键,在索引数据如索引树中查找该表记录的版本链表,即rhead链表。然后协调节点根据rhead记录的全局内存地址,读取表记录对应的tmin和tmax,根据tmin和tmax进行可见性判断。如果该版本中tmin或者tmax为gtxid而不是时间戳,则将本事务加入rhead中的lotxid所标识的本地事务控制块的等待队列中,本事务被唤醒后,可以重新执行rhead链表的遍历过程。其中,gtxid或时间戳可以通过高位bit区分,tmin或者tmax的高位bit为1则tmin或者tmax为gtxid,tmin或者tmax的高位bit不为1,则tmin或者tmax为时间戳。如果该版本中tmin或者tmax均为时间戳,当beginTs在[tmin,tmax)内时,则表明该版本的表记录对于事务可见,协调节点可以返回该版本的表记录以及rhead,当beginTs不在[tmin,tmax)内时,表明该版本的表记录对于事务不可见,协调节点可以根据rhead记录的前一个版本的地址继续遍历前一个版本。
协调节点可以根据返回的版本的rhead得到表记录的全局内存地址,然后根据全局内存地址尝试在表记录中标记更新。具体地,如果返回的版本的tmax不是-1,则表明该版本不是当前版本,已经被其他事务更新,即产生写写冲突,协调节点可以返回标记失败通知给客户端200。如果返回的版本的tmax是-1,则表明是最新版本,协调节点可以调用全局内存接口得到内存副本的节点列表,对第一个副本节点的全局内存中对应表记录的tmax发起比较并替换(compare and swap,CAS)原子操作,将tmax标记为gtxid。如果CAS原子操作返回失败,说明写写冲突,返回标记失败通知至客 户端200,如果CAS原子操作返回成功,说明标记成功,协调节点可以执行对表记录的更新。
其中,协调节点对表记录进行更新时,可以先调用全局内存接口申请指定副本数量的全局内存,然后填写更新的记录数据。其中,协调节点申请全局内存以及在全局内存中填写更新的记录数据的具体实现可以参考insert操作相关内容描述,在此不再赘述。然后协调节点申请本地内存,用于存储rhead,rhead中填写更新的表记录的全局内存地址和lotxid、物理地址。此时,物理地址可以为0。
接着,协调节点安装新版本链,具体是将更新的表记录的管理头newrhead指向前一版本的rhead的地址,以及将indirect指向newrhead。协调节点在本地事务控制块中的写集合wset中记录操作类型为update以及记录rhead(也即oldrhead)的地址和newrhead的地址。在完成上述操作时,协调节点可以向客户端200返回更新成功通知。
第二查询语句为指示执行delete操作的语句时,协调节点(例如是分布式数据库系统100中的主节点1)可以查找待delete的表记录的版本,根据beginTs和rhead中的tmin、tmax确定表记录对于事务的可见性,并返回正确版本的表记录。然后协调节点根据返回的版本的rhead得到表记录的全局内存地址,在表记录中标记更新。其中,协调节点确定可见性、返回正确版本的表记录、以及标记更新的具体实现可以参见update操作相关内容描述,在此不再赘述。
接着,协调节点在本地事务控制块中写集合中记录操作类型为delete,以及记录rhead的地址。在完成上述操作时,协调节点可以向客户端200返回删除成功通知。
第二查询语句为指示执行query操作的语句时,协调节点(例如是分布式数据库系统100中的主节点1)可以根据查询条件查找query的表记录的版本,根据beginTs和rhead中的tmin、tmax确定表记录对于事务的可见性,并返回正确版本的表记录。其中,协调节点确定可见性以及返回正确版本的表记录的具体实现可以参见update操作相关内容描述。然后,协调节点还可以在本地事务读集合(read set,rset)中遍历读取记录,检查幻影(phantom),以对读写冲突进行验证。当验证通过时,可以返回上述正确版本的表记录,以对query操作进行响应。其中,本地事务读集合下文可以简称为读集合。
经过上述处理,对于insert/update操作,协调节点已经为表记录的新版本申请了指定副本数量的全局内存,并且针对第一个副本节点(例如是协调节点)填写了表记录的记录数据,并设置了tmin和tmax。如果副本节点中包括准实时从节点,则还针对该准实时从节点填写表记录的记录数据,以及设置tmin和tmax。并且协调节点为表记录的新版本申请了rhead,rhead中记录了记录数据或更新的记录数据的全局内存地址,且已经安装到本地索引树及indirect中。对于update/delete操作,原有版本的第一个副本节点的tmax已经CAS为gtxid,由此实现与其他事务的并发冲突处理。本地事务控制块中的写集合中已经记录了gtxid、beginTs以及newrhead的地址。本地事务控制块中的读集合中已经记录了读取记录以及查询条件(例如是谓词条件)。其中,读取记录可以用于读写冲突验证。
S408:分布式数据库系统100中的协调节点根据多条查询语句中的第三查询语句 提交事务。
第三查询语句可以是指示事务提交的查询语句。例如,第三查询语句为SQL查询语句时,该第三查询语句可以包括提交(commit)命令。协调节点可以执行第三查询语句,从而提交事务。协调节点可以执行第三查询语句,从而提交事务,以使新增的记录数据、更新的记录数据、删除的记录数据或者查找的记录数据在协调节点、参与节点之间实时一致。
参见图6所示的事务提交阶段的流程图,协调节点(例如是分布式数据库系统100中的主节点1)可以查询本地缓存的系统节点表sysNodeTbl,得到其他主节点和实时从节点列表,这些节点即为参与节点。协调节点根据写集合中的操作类型(例如是insert、update、delete中的一个或多个)以及newrhead的地址、oldrhead的地址,将操作类型和新旧版本表记录的全局内存地址和新版本表记录的记录数据,打包到预同步(也可以称作Preinstall)请求消息中。预同步请求消息中包括gtxid和beginTs。协调节点将预同步请求消息发送至参与节点(例如是分布式数据库系统100中的主节点2、主节点3、实时从节点1、实时从节点2等)。参与节点接收到预同步请求消息,各自在本节点创建本地事务并得到lotxid。
参与节点遍历预同步请求消息中的写集合,根据操作类型执行下述处理:
对于insert操作,参与节点为新版本申请本地内存,该本地内存用于存储newrhead。参与节点在newrhead中记录预同步请求消息中所携带的全局内存地址、loxid和物理地址,其中,物理地址为0,然后对indirect赋值以使其指向newrhead。接着参与节点检查新版本的全局内存地址在本节点是否有副本,若有副本,且不是第一个副本,则在本节点的副本中填写记录数据,以及设置tmin=gtxid,tmax=-1。最后参与节点根据新增的记录数据的主键在索引树中插入该记录数据,如果存在键冲突,则发送预同步失败通知给协调节点,否则插入成功。需要说明,此时如果有其他事务发现该记录数据,则将本事务加入rhead中的lotxid对应的本地事务控制块的等待队列。
对于update操作,参与节点为新版本申请本地内存,该本地内存用于存储newrhead。参与节点在newrhead中记录预同步请求消息中所携带的全局内存地址和loxid、物理地址。此时,物理地址为0。参与节点检查新版本的全局内存地址在本节点是否有副本,若有副本,且不是第一个副本,则在本节点的副本中填写记录数据,以及设置tmin=gtxid,tmax=-1。然后参与节点根据旧版本表记录的主键,在本地索引数中查找indirect地址,并根据该地址得到其指向的rhead,将newrhead指向当前rhead。参与节点根据旧版本记录数据的全局内存地址,检查全局内存在本节点有副本且并非第一个副本时,则可以在本节点的副本中将tmax修改为gtxid。
对于delete操作,参与节点根据旧版本表记录的主键,在本地索引树中查找indirect的地址,得到indirect指向的rhead,将newrhead指向当前rhead。然后,参与节点根据旧版本记录数据的全局内存地址,检查全局内存在本节点有副本且并非第一个副本时,则可以在本节点的副本中将tmax修改为gtxid。
接着,参与节点发送预同步响应消息至协调节点。当协调节点集齐各参与节点发送的预同步响应消息,且预同步响应消息均标识预同步成功,则协调节点获取当前时 间戳作为终止时间戳endTs。
协调节点可以确定事务的隔离级别,当事务的隔离级别为可串行化的快照隔离级别(serializable snapshot isolation,SSI)时,协调节点可以检查读写冲突。具体地,协调节点可以遍历读集合rset,使用endTs检查rset中rhead对应的表记录的可见性,从而确定是否发生读写冲突。其中,协调节点可以按照谓词条件重新执行表记录的查询,检查基于endTs可见的表记录和基于beginTs可见的记录是否相同,若是,则表示事务执行过程中谓词条件覆盖的表记录无读写冲突。如果某个表记录不可见,则表明有其他事务修改了该表记录,即产生读写冲突,协调节点可以终止事务,执行回滚操作,并通知其他参与节点终止事务,向客户端200返回错误响应。需要说明,当事务的隔离级别为其他隔离级别如读已提交(read commited,RC)、快照隔离(snapshot isolation,SI)时,协调节点可以不用执行该步骤,以对读写冲突进行检查。
协调节点根据系统节点表中节点配置信息,获得配置有日志文件系统如LogFs的节点的列表。协调节点可以根据预先设置的副本数量,从该列表中选择相应数量的节点写日志文件。例如,预先设置的副本数量为N,协调节点可以写日志文件,并向其他配置有LogFs的N-1个节点发送同步请求(也可以称作prepare请求),以通知上述节点写日志文件(例如redolog文件)。该日志文件记录由gtxid、endTs、新增的记录数据及其全局内存地址以及删除的记录数据。然后协调节点等待上述节点的同步响应。其中,如果redolog不满足在先设置的副本数量的要求,例如在先设置的副本数量为3,而配置有高速持久化介质的节点数量为2,则协调节点可以直接将redolog写入共享存储系统。当协调节点确定事务的RC或SI时,可以直接生成日志文件,如redolog文件。
参与节点收到同步请求,在本地logFs的相同文件名中写入日志文件,如写入redolog文件,然后向协调节点返回同步响应。其中,参与节点在写入日志文件之前,也可以先验证读写冲突和写写冲突。其中,参与节点验证读写冲突的过程可以参考协调节点验证读写冲突的具体实现。参与节点验证写写冲突可以通过如下方式实现:参与节点根据事务的写集合中至少一个表记录的索引,确定是否发生写写冲突。例如参与节点在索引中插入写记录索引项时,发生唯一性冲突,则表明发生写写冲突。参与节点确定未发生写写冲突时,则生成redolog文件;参与节点确定发生写写冲突,则可以向协调节点返回错误响应。协调节点收到同步响应(也称提交响应,commit reply)后,可以进入事务完成(complete)流程。
具体地,参见图7所示的事务完成阶段的流程图,协调节点(例如是分布式数据库系统100中主节点1)向各参与节点(例如是分布式数据库系统100中主节点2、主节点3、实时从节点1、实时从节点2等)发送事务完成请求,该事务完成请求携带各参与节点的loxid。协调节点遍历本地事务中的写集合,应用本次事务的修改。
协调节点应用本次事务的修改,可以是将新版本表记录在全局内存的第一个副本以及准实时从节点副本(如果有准实时从节点副本)中的tmin设置为endTs,将rhead中的物理地址设置为redolog文件标识及偏移位置。然后协调节点将旧版本表记录在全局内存的第一个副本以及准实时从节点副本(如果有准实时从节点副本)中的tmax 设置为endTs,将rhead中的物理地址更新为redolog文件标识及偏移位置。协调节点将所有写集合中的rhead记录的lotxid设置为0。此时,协调节点取出本地事务中等待确定可见性的本地事务列表,全部唤醒。其中,被唤醒的事务会重新检查表记录的可见性。协调节点将本地事务加入回收链表,等待所有活动事务结束后,进行旧版本链的回收和索引的删除。
参与节点收到事务完成请求后,采用与协调节点相似的处理方式。具体地,参与节点遍历写集合,应用本次事务的修改。其中,写集合中的新版本表记录在本节点有副本,则参与节点将新版本表记录的tmin设置为endTs,rhead中的物理地址设置为redolog文件标识及偏移位置。参与节点还将写集合中的旧版本表记录的tmax设置为endTs,rhead中的物理地址更新为redolog文件标识及偏移位置。参与节点将所有写集合中的rhead记录的lotxid设置为0。此时,参与节点取出本地事务中等待确定可见性的本地事务列表,全部唤醒。然后参与节点将本地事务加入回收链表,等待所有活动事务结束后,进行旧版本链的回收和索引的删除。接着参与节点可以向协调节点事务完成响应。
需要说明的是,协调节点在基于隔离级别验证读写冲突时,如果某个表记录不可见,则说明其他事务修改了该表记录,产生读写冲突。协调节点可以终止事务,对事务进行回滚。在一些实施例中,协调节点接收参与节点的预同步响应时,如果包括错误响应,则协调节点也可以终止事务,对事务进行回滚。下面对回滚过程进行详细说明。
具体地,参见图8所示的事务回滚阶段的流程图,协调节点(例如是分布式数据库系统100中的主节点1)向各参与节点(例如是分布式数据库系统100中的主节点2、主节点3、实时从节点1、实时从节点2等)发送事务回滚请求(如图8中所示的final-abort),该事务回滚请求携带各参与节点的lotxid。协调节点遍历本地事务中的写集合,回滚本次事务的修改。具体地,协调节点从索引树中删除写集合中的新版本表记录的索引,将写集合中的旧版本表记录在全局内存的第一个副本以及准实时从节点副本(如果有准实时从节点副本)的tmax设置为-1,将写集合中的旧版本表记录的indirect恢复成指向该旧版本,将所有写集合中的rhead记录的lotxid设置为0。此时,协调节点取出本地事务中的等待确定可见性的本地事务列表,全部唤醒。需要说明,被唤醒的事务会重新检查记录的可见性。然后协调节点将本地事务加入回收链表,等待所有活动事务结束后,进行新版本表记录的全局内存和本地内存回收。
参与节点接收到事务回滚请求后,可以采用类似协调节点的处理。具体地,参与节点遍历写集合,应用本次事务的修改。其中,如果写集合中的旧版本表记录的全局内存在本节点有副本,将tmax设置为-1,然后将写集合中的旧版本表记录的indirect恢复成指向该旧版本,将所有写集合中的rhead记录的lotxid设置为0。此时,参与节点取出本地事务中的等待确定可见性的本地事务列表,全部唤醒。参与节点将本地事务加入回收链表,等待所有活动事务结束后,进行新版本表记录的全局内存和本地内存的回收。
对于query操作,协调节点(节点类型为主节点或实时从节点)可以确定可见的版本,直接向客户端返回表记录的记录数据。具体地,当事务隔离级别为SSI,该表 记录尚未提交(tmin或tmax用于表示gtxid时,表明未提交)且本事务的beginTs>未提交记录对应的本地事务的beginTs,协调节点可以将本事务加入对方事务的等待队列,否则直接返回该表记录的记录数据。
在一些可能的实现方式中,分布式数据库系统100还包括准实时从节点。该准实时从节点可以接收客户端发送的与分析业务关联的查询请求,通过回放redolog,在本地生成索引树和数据副本。具体地,准实时从节点可以定时(例如是间隔0.5秒)回放所有的redolog,通过redolog中记录的表记录的内容和全局内存地址,回放生成本地的索引树和新版本表记录的记录数据。
为了保证回放redolog内容的一致性,准实时从节点可以取集群中所有活动事务的最小endTs作为回放的截止时刻,redolog中小于该endTs的事务日志会被回放。
在本实施例中,事务提交协议定义了写冲突(写写冲突或读写冲突)控制方法。具体地,协调节点采用悲观并发控制(也称悲观锁),参与节点采用乐观并发控制(也称乐观锁),一方面可以避免写写冲突或者读写冲突,保证协调节点和参与节点之间的一致性,另一方面可以减少协调节点与参与节点的交互,缩短同步时间,实现实时一致。
其中,悲观并发控制的原理为,假设多用户并发的事务在处理时彼此互相影响,因此,可以通过阻止一个事务来修改数据。具体地,如果一个事务执行的操作如读某行数据应用了悲观并发控制(悲观锁),那么只有当这个事务释放权限后,其他事务才能够执行冲突的操作。乐观并发控制的原理为,假设多用户并发的事务在处理时不会彼此互相影响,各事务能够在不产生锁的情况下处理各自影响的那部分数据。在提交数据更新之前,每个事务会先检查在该事务读取数据后,有没有其他事务又修改了该数据。如果其他事务有更新的话,正在提交的事务会进行回滚。
基于上述内容描述,本申请实施例提供了一种事务处理方法。该方法中,分布式数据库系统100的多个节点的部分内存通过用于形成全局内存。该全局内存对于分布式数据库系统100中的协调节点、参与节点可见。也即协调节点或参与节点中全局内存是共享的,当协调节点执行事务导致全局内存中的部分内存所存储的数据发生变化时,参与节点可以快速感知该变化,并基于事务提交协议通过RDMA或memoryfabric跨节点访问全局内存中的部分内存,以进行数据同步,而不必通过消息交互方式进行数据同步,如此大幅缩短同步时间,实现了协调节点和参与节点之间的实时一致性,满足了实时性业务对实时一致性的需求。此外,全局内存的容量可以随着节点数扩展,不再受限于单节点内存的容量,提高了并发控制能力。
相较于传统的面向磁盘介质的磁盘引擎集群,本申请实施例提供的面向内存介质的内存引擎集群无页面且无回滚日志,具有较好性能。而且,本申请实施例提供不同类型的节点如主节点、实时从节点和准实时从节点,可以满足实时读写业务需求(如交易场景的需求)、实时只读业务需求(如实时分析场景的需求)或者是非实时只读业务需求(如非实时分析场景)的需求。本申请实施例还可以按照表粒度设置表记录在全局内存中的副本数量,一方面可以控制内存占用,另一方面可以满足不同数据表的高可用性需求。
上文结合图1至图8对本申请实施例提供的事务处理方法进行了详细介绍,下面 将结合附图对本申请实施例提供的分布式数据库系统100、事务处理系统10进行介绍。
参见图1所示的分布式数据库系统100的结构示意图,该分布式数据库系统100包括:
协调节点,用于接收客户端发送的多条查询语句;
协调节点,还用于根据所述多条查询语句中的第一查询语句创建事务,根据所述多条查询语句中的第二查询语句在所述全局内存中执行所述事务,以及根据所述多条查询语句中的第三查询语句提交所述事务。
在一些可能的实现方式中,所述分布式数据库系统部署在集群,所述全局内存来自于所述集群。
在一些可能的实现方式中,所述全局内存包括所述协调节点和/或所述参与节点的部分内存。
在一些可能的实现方式中,所述协调节点的节点类型为主节点,所述协调节点具体用于:
根据所述多条查询语句中的第一查询语句创建读写事务。
在一些可能的实现方式中,所述协调节点的节点类型为第一从节点,所述第一从节点用于与节点类型为主节点的节点保持实时一致,所述协调节点具体用于:
根据所述多条查询语句中的第一查询语句创建只读事务。
在一些可能的实现方式中,所述协调节点,还用于接收并保存集群管理节点发送的表记录在所述全局内存中的副本数量;
所述参与节点,还用于接收并保存所述集群管理节点发送的表记录在所述全局内存中的副本数量。
在一些可能的实现方式中,所述表记录存储在所述分布式数据库系统的全局内存中,所述表记录的索引树和管理头存储在所述分布式数据库系统的本地内存中。
在一些可能的实现方式中,所述协调节点具体用于:
根据所述多条查询语句中的第三查询语句,通过运行于所述协调节点和所述参与节点的事务提交协议,提交所述事务,以实现所述协调节点和所述参与节点的实时一致性。
在一些可能的实现方式中,所述协调节点具体用于所述事务发生写冲突时,触发悲观并发控制;
所述参与节点具体用于所述事务发生写冲突时,触发乐观并发控制。
根据本申请实施例的分布式数据库系统100可对应于执行本申请实施例中描述的方法,并且分布式数据库系统100的各个模块/单元的上述和其它操作和/或功能分别为了实现图4所示实施例中的各个方法的相应流程,为了简洁,在此不再赘述。
基于本申请实施例提供的分布式数据库系统100,本申请实施例还提供了一种事务处理系统10。参见图1所示的事务处理系统10的结构示意图,该事务处理系统10包括:分布式数据库系统100和客户端200。
其中,分布式数据库系统100用于根据所述客户端200发送的查询语句,执行对应的事务处理方法,例如执行如图4所示的事务处理方法。具体地,客户端200用于 向分布式数据库系统100发送多条查询语句,分布式数据库系统100的协调节点用于接收到多条查询语句,根据多条查询语句中的第一查询语句创建事务,根据所述多条查询语句中的第二查询语句在所述全局内存中执行所述事务,以及根据所述多条查询语句中的第三查询语句提交所述事务。
在一些可能的实现方式中,事务处理系统10还包括数据库300,分布式数据库系统100执行事务处理方法,以对数据库300中的数据进行管理,例如插入新的记录数据,更新记录数据或者删除记录数据等等。
类似地,事务处理系统10还包括集群管理节点400。该集群管理节点400用于对部署在集群中的分布式数据库系统的节点进行配置,例如对节点IP、节点类型等进行配置。事务处理系统10还可以包括授时服务器500,授时服务器500用于为分布式数据库系统100提供时间戳,以便根据时间戳确定数据对事务的可见性。
本申请实施例还提供了一种集群90。该集群90包括多台计算机。该计算机可以是服务器,例如是私有数据中心中的本地服务器,或者是云服务提供商提供的云服务器。该计算机也可以是终端。终端包括但不限于台式机、笔记本电脑、智能手机等。集群90具体用于实现分布式数据库系统100的功能。
图9提供了一种集群90的结构示意图,如图9所示,集群90包括多台计算机900。设备900包括总线901、处理器902、通信接口903和存储器904。处理器902、存储器904和通信接口903之间通过总线901通信。
总线901可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图9中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
处理器902可以为中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等处理器中的任意一种或多种。
通信接口903用于与外部通信。例如,通信接口903可以用于接收客户端200发送的多条查询语句,从授时服务器500获取开始时间戳、终止时间戳,或者是向客户端200返回提交响应等等。
存储器904可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。存储器904还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,硬盘驱动器(hard disk drive,HDD)或固态驱动器(solid state drive,SSD)。
存储器904中存储有可执行代码,处理器902执行该可执行代码以执行前述事务处理方法。
具体地,在实现图1所示实施例的情况下,且图1实施例中所描述的分布式数据库系统100的各组成部分为通过软件实现的情况下,执行图1中的各组成部分功能所需的软件或程序代码存储在存储器904中。处理器902执行存储器904中存储的各组成部分对应的程序代码,以执行前述事务处理方法。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质包括指 令,所述指令指示计算机900执行上述应用于分布式数据库系统100的事务处理方法。
需要说明的是,该计算机可读存储介质中的指令可以被集群90中的多台计算机900执行,因此,每台计算机900也可以执行上述应用于分布式数据库系统100的事务处理方法的一部分。例如,一些计算机可以执行上述事务处理方法中由协调节点执行的步骤,另一些计算机可以执行上述事务处理方法中由参与节点执行的步骤。
本申请实施例还提供了一种计算机程序产品,所述计算机程序产品被计算机执行时,所述计算机执行前述事务处理方法的任一方法。该计算机程序产品可以为一个软件安装包,在需要使用前述事务处理方法的任一方法的情况下,可以下载该计算机程序产品并在计算机上执行该计算机程序产品。

Claims (21)

  1. 一种事务处理方法,其特征在于,应用于分布式数据库系统,所述分布式数据库系统包括协调节点和参与节点,所述协调节点和所述参与节点共享全局内存,所述方法包括:
    所述协调节点接收客户端发送的多条查询语句;
    所述协调节点根据所述多条查询语句中的第一查询语句创建事务;
    所述协调节点根据所述多条查询语句中的第二查询语句在所述全局内存中执行所述事务;
    所述协调节点根据所述多条查询语句中的第三查询语句提交所述事务。
  2. 根据权利要求1所述的方法,其特征在于,所述分布式数据库系统部署在集群,所述全局内存来自于所述集群。
  3. 根据权利要求1或2所述的方法,其特征在于,所述全局内存包括所述协调节点和/或所述参与节点的部分内存。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述协调节点的节点类型为主节点,所述协调节点根据所述多条查询语句中的第一查询语句创建事务,包括:
    所述协调节点根据所述多条查询语句中的第一查询语句创建读写事务。
  5. 根据权利要求1至3任一项所述的方法,其特征在于,所述协调节点的节点类型为第一从节点,所述第一从节点用于与节点类型为主节点的节点保持实时一致,所述协调节点根据所述多条查询语句中的第一查询语句创建事务,包括:
    所述协调节点根据所述多条查询语句中的第一查询语句创建只读事务。
  6. 根据权利要求1至5任一项所述的方法,其特征在于,在所述协调节点接收客户端发送的多条查询语句之前,所述方法还包括:
    所述分布式数据库系统接收集群管理节点发送的表记录在所述全局内存中的副本数量;
    所述分布式数据库系统保存所述表记录在所述全局内存中的副本数量。
  7. 根据权利要求6所述的方法,其特征在于,所述表记录存储在所述分布式数据库系统的全局内存中,所述表记录的索引树和管理头存储在所述分布式数据库系统的本地内存中。
  8. 根据权利要求1至7任一项所述的方法,其特征在于,所述协调节点根据所述多条查询语句中的第三查询语句提交所述事务,包括:
    所述协调节点根据所述多条查询语句中的第三查询语句,通过运行于所述协调节点和所述参与节点的事务提交协议,提交所述事务,以实现所述协调节点和所述参与节点的实时一致性。
  9. 根据权利要求8所述的方法,其特征在于,所述事务发生写冲突时,所述协调节点触发悲观并发控制,所述参与节点触发乐观并发控制。
  10. 一种分布式数据库系统,其特征在于,所述系统包括协调节点和参与节点,所述协调节点和所述参与节点共享全局内存;
    所述协调节点,用于接收客户端发送的多条查询语句;
    所述协调节点,还用于根据所述多条查询语句中的第一查询语句创建事务,根据 所述多条查询语句中的第二查询语句在所述全局内存中执行所述事务,以及根据所述多条查询语句中的第三查询语句提交所述事务。
  11. 根据权利要求10所述的系统,其特征在于,所述分布式数据库系统部署在集群,所述全局内存来自于所述集群。
  12. 根据权利要求10或11所述的系统,其特征在于,所述全局内存包括所述协调节点和/或所述参与节点的部分内存。
  13. 根据权利要求10至12任一项所述的系统,其特征在于,所述协调节点的节点类型为主节点,所述协调节点具体用于:
    根据所述多条查询语句中的第一查询语句创建读写事务。
  14. 根据权利要求10至12任一项所述的系统,其特征在于,所述协调节点的节点类型为第一从节点,所述第一从节点用于与节点类型为主节点的节点保持实时一致,所述协调节点具体用于:
    根据所述多条查询语句中的第一查询语句创建只读事务。
  15. 根据权利要求10至14任一项所述的系统,其特征在于,
    所述协调节点,还用于接收并保存集群管理节点发送的表记录在所述全局内存中的副本数量;
    所述参与节点,还用于接收并保存所述集群管理节点发送的表记录在所述全局内存中的副本数量。
  16. 根据权利要求15所述的系统,其特征在于,所述表记录存储在所述分布式数据库系统的全局内存中,所述表记录的索引树和管理头存储在所述分布式数据库系统的本地内存中。
  17. 根据权利要求10至16任一项所述的系统,其特征在于,所述协调节点具体用于:
    根据所述多条查询语句中的第三查询语句,通过运行于所述协调节点和所述参与节点的事务提交协议,提交所述事务,以实现所述协调节点和所述参与节点的实时一致性。
  18. 根据权利要求17所述的系统,其特征在于,
    所述协调节点具体用于所述事务发生写冲突时,触发悲观并发控制;
    所述参与节点具体用于所述事务发生写冲突时,触发乐观并发控制。
  19. 一种事务处理系统,其特征在于,所述事务处理系统包括客户端和如权利要求10至18任一项所述的分布式数据库系统,所述分布式数据库系统用于根据所述客户端发送的查询语句,执行对应的事务处理方法。
  20. 一种集群,其特征在于,包括多台计算机,所述计算机包括处理器和存储器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令,以使得所述集群执行如权利要求1至9任一项所述的事务处理方法。
  21. 一种计算机可读存储介质,其特征在于,包括计算机可读指令,当所述计算机可读指令在计算机上运行时,使得所述计算机执行如权利要求1至9任一项所述的事务处理方法。
PCT/CN2021/112643 2021-04-06 2021-08-14 事务处理方法、分布式数据库系统、集群及介质 WO2022213526A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP21935746.4A EP4307137A1 (en) 2021-04-06 2021-08-14 Transaction processing method, distributed database system, cluster, and medium
CN202180004526.5A CN115443457A (zh) 2021-04-06 2021-08-14 事务处理方法、分布式数据库系统、集群及介质
US18/477,848 US20240028598A1 (en) 2021-04-06 2023-09-29 Transaction Processing Method, Distributed Database System, Cluster, and Medium

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202110369369.X 2021-04-06
CN202110369369 2021-04-06
CN202110679707.X 2021-06-18
CN202110679707.XA CN115495495A (zh) 2021-06-18 2021-06-18 事务处理方法、分布式数据库系统、集群及介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/477,848 Continuation US20240028598A1 (en) 2021-04-06 2023-09-29 Transaction Processing Method, Distributed Database System, Cluster, and Medium

Publications (1)

Publication Number Publication Date
WO2022213526A1 true WO2022213526A1 (zh) 2022-10-13

Family

ID=83545980

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/112643 WO2022213526A1 (zh) 2021-04-06 2021-08-14 事务处理方法、分布式数据库系统、集群及介质

Country Status (4)

Country Link
US (1) US20240028598A1 (zh)
EP (1) EP4307137A1 (zh)
CN (1) CN115443457A (zh)
WO (1) WO2022213526A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115994037A (zh) * 2023-03-23 2023-04-21 天津南大通用数据技术股份有限公司 一种集群数据库负载均衡方法及装置

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115796874B (zh) * 2023-01-09 2023-05-09 杭州安节科技有限公司 一种操作级别的区块链交易并发执行方法
CN116302076B (zh) * 2023-05-18 2023-08-15 云账户技术(天津)有限公司 一种基于解析配置项表结构进行配置项配置的方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521028A (zh) * 2011-12-02 2012-06-27 华中科技大学 一种分布式环境下的事务内存系统
CN106647412A (zh) * 2017-01-17 2017-05-10 爱普(福建)科技有限公司 一种基于组态元件的分布式控制器之间的数据共享方法
US20180349418A1 (en) * 2017-06-06 2018-12-06 Sap Se Dynamic snapshot isolation protocol selection
CN109977171A (zh) * 2019-02-02 2019-07-05 中国人民大学 一种保证事务一致性和线性一致性的分布式系统和方法
CN111159252A (zh) * 2019-12-27 2020-05-15 腾讯科技(深圳)有限公司 事务执行方法、装置、计算机设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521028A (zh) * 2011-12-02 2012-06-27 华中科技大学 一种分布式环境下的事务内存系统
CN106647412A (zh) * 2017-01-17 2017-05-10 爱普(福建)科技有限公司 一种基于组态元件的分布式控制器之间的数据共享方法
US20180349418A1 (en) * 2017-06-06 2018-12-06 Sap Se Dynamic snapshot isolation protocol selection
CN109977171A (zh) * 2019-02-02 2019-07-05 中国人民大学 一种保证事务一致性和线性一致性的分布式系统和方法
CN111159252A (zh) * 2019-12-27 2020-05-15 腾讯科技(深圳)有限公司 事务执行方法、装置、计算机设备及存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115994037A (zh) * 2023-03-23 2023-04-21 天津南大通用数据技术股份有限公司 一种集群数据库负载均衡方法及装置

Also Published As

Publication number Publication date
US20240028598A1 (en) 2024-01-25
CN115443457A (zh) 2022-12-06
EP4307137A1 (en) 2024-01-17

Similar Documents

Publication Publication Date Title
CN111338766B (zh) 事务处理方法、装置、计算机设备及存储介质
US20220276998A1 (en) Database transaction processing method and apparatus, server, and storage medium
WO2022213526A1 (zh) 事务处理方法、分布式数据库系统、集群及介质
KR102141234B1 (ko) 분산된 데이터 스토어 내의 버젼형 계층 데이터 구조
CN110196760B (zh) 分布式事务一致性实现方法及装置
US9946735B2 (en) Index structure navigation using page versions for read-only nodes
US20230100223A1 (en) Transaction processing method and apparatus, computer device, and storage medium
CN111597015B (zh) 事务处理方法、装置、计算机设备及存储介质
US8392482B1 (en) Versioning of database partition maps
CN101567805B (zh) 并行文件系统发生故障后的恢复方法
CN111159252B (zh) 事务执行方法、装置、计算机设备及存储介质
US11132350B2 (en) Replicable differential store data structure
US6850969B2 (en) Lock-free file system
CN111143389A (zh) 事务执行方法、装置、计算机设备及存储介质
CN112162846B (zh) 事务处理方法、设备及计算机可读存储介质
EP4276651A1 (en) Log execution method and apparatus, and computer device and storage medium
US8600962B2 (en) Transaction processing device, transaction processing method, and transaction processing program
US10909091B1 (en) On-demand data schema modifications
CN115495495A (zh) 事务处理方法、分布式数据库系统、集群及介质
WO2022242372A1 (zh) 对象处理方法、装置、计算机设备和存储介质
WO2022135471A1 (zh) 多版本并发控制和日志清除方法、节点、设备和介质
CN115658245A (zh) 一种基于分布式数据库系统的事务提交系统、方法及装置
JP2020536339A (ja) 共有ジャーナルを含むキーバリューストア間のコンシステンシー
US11379463B1 (en) Atomic enforcement of cross-page data constraints in decoupled multi-writer databases
CN114691307A (zh) 事务处理方法及计算机系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21935746

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2021935746

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2021935746

Country of ref document: EP

Effective date: 20231013

NENP Non-entry into the national phase

Ref country code: DE