CN113535656A - Data access method, device, equipment and storage medium - Google Patents

Data access method, device, equipment and storage medium Download PDF

Info

Publication number
CN113535656A
CN113535656A CN202110709977.0A CN202110709977A CN113535656A CN 113535656 A CN113535656 A CN 113535656A CN 202110709977 A CN202110709977 A CN 202110709977A CN 113535656 A CN113535656 A CN 113535656A
Authority
CN
China
Prior art keywords
data
target
storage
node
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110709977.0A
Other languages
Chinese (zh)
Other versions
CN113535656B (en
Inventor
柴云鹏
李海翔
周芳
吴坤尧
王子恺
杜小勇
潘安群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Renmin University of China
Shenzhen Tencent Computer Systems Co Ltd
Original Assignee
Renmin University of China
Shenzhen Tencent Computer Systems Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Renmin University of China, Shenzhen Tencent Computer Systems Co Ltd filed Critical Renmin University of China
Priority to CN202110709977.0A priority Critical patent/CN113535656B/en
Publication of CN113535656A publication Critical patent/CN113535656A/en
Application granted granted Critical
Publication of CN113535656B publication Critical patent/CN113535656B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1734Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data access method, a data access device, data access equipment and a storage medium, and belongs to the technical field of databases. In a distributed database system, when a computing node receives a data reading request, a plurality of storage nodes storing a plurality of copies of target data of the data reading request are determined according to data fragments to which the target data belong, then target storage nodes meeting target conditions are selected from the storage nodes according to data access cost of the storage nodes for accessing the target data, and the target data are read by the target storage nodes. In the method, the target storage node is determined according to the data access cost, so that the main storage node and the auxiliary storage node can be the target storage node, and the main storage node is prevented from processing all data reading requests, so that high availability caused by multiple copies is ensured, the data reading speed is increased, and the data access performance of the distributed database system is effectively improved.

Description

Data access method, device, equipment and storage medium
Technical Field
The present application relates to the field of database technologies, and in particular, to a data access method, apparatus, device, and storage medium.
Background
With the development of database technology, it has become common to store data in the form of multiple copies in order to improve the data access performance and availability of a distributed database system. The multi-copy technology is that one copy of data is copied into multiple copies, and the copies of the data are stored on different nodes of the distributed database system respectively. For example, copies of data with different values are stored in different storage media according to the value of the data. For another example, different types of data copies are stored in different data storage areas, respectively, according to the data type of the data. However, when data access is performed based on the multi-copy technology, all data access requests need to be concentrated on one master node for processing, so that the difficulty in implementing load balancing is increased, high time overhead is easily caused, data access delay is high, and the data access performance of the distributed database system is poor.
Therefore, a data access method capable of improving the data access performance of the distributed database system is needed.
Disclosure of Invention
The embodiment of the application provides a data access method, a data access device, data access equipment and a storage medium, and the data access performance of a distributed database system can be effectively improved. The technical scheme is as follows:
in one aspect, a data access method is provided, which is applied to a distributed database system including a computing node and a plurality of storage nodes, and includes:
the method comprises the steps that a computing node responds to a first data reading request, determines a first data fragment to which first target data of the first data reading request belongs, and determines a plurality of first storage nodes from a plurality of storage nodes based on the first data fragment, wherein the plurality of first storage nodes are used for storing a plurality of copies of the first target data;
the computing node determines a first target storage node from the plurality of first storage nodes based on the first data reading request, and sends the first data reading request to the first target storage node, wherein the data access cost of the first target storage node meets a first target condition;
the first target storage node reads the first target data based on the first data reading request, and sends a first data reading result to the computing node.
In another aspect, a data access apparatus is provided, which is applied to a distributed database system, and includes:
a first determining module, configured to determine, in response to a first data read request, a first data fragment to which first target data of the first data read request belongs, and determine, based on the first data fragment, a plurality of first storage nodes from the plurality of storage nodes, where the plurality of first storage nodes are configured to store multiple copies of the first target data;
a second determining module, configured to determine a first target storage node from the plurality of first storage nodes based on the first data read request, and send the first data read request to the first target storage node, where a data access cost of the first target storage node meets a first target condition;
the first reading module is used for reading the first target data based on the first data reading request and sending a first data reading result to the computing module.
In an optional implementation, the first reading module includes:
a first reading unit, configured to read the first target data based on the first data reading request and the node type of the first target storage node if the first target data is current-state data;
and the second reading unit is used for reading the first target data based on the first data reading request and the transaction completion time of the first target data if the first target data is historical data.
In an alternative implementation, the first reading unit is configured to:
if the first target storage node is a main storage node, based on the first data reading request, determining a first reading index of the first target data, and reading the first target data from a state machine corresponding to the first target data by taking the first reading index as a starting point, wherein the first reading index is used for indicating a minimum reading index for reading the first target data based on the first data reading request;
if the first target storage node is a slave storage node, the first read index is acquired from the master storage node based on the first data read request, and the first target data is read from a state machine corresponding to the first target data by taking the first read index as a starting point.
In an optional implementation, the first reading unit is configured to:
updating a submission index at the current moment, wherein the submission index is used for indicating the maximum index of submitted logs in a log list;
scanning the logs stored in the log list according to a first order, wherein the first order refers to an execution index indexed from the submission to the log list, and the execution index is used for indicating the maximum index of executed logs in the log list;
if a first target log exists, determining the first read index based on the log index of the first target log, wherein the data operated by the first target log is the first target data; if the first target log does not exist, the first target storage node determines the first read index based on the execution index.
In an optional implementation, the apparatus further comprises:
a first storage module, configured to store the first read index into a first list, where the first list includes the first target data, the first read index, and a first check index, where the first check index is used to indicate a commit index corresponding to the first target storage node when determining the first read index, and the commit index is used to indicate a maximum index of committed logs in a log list;
the first query module is configured to query the first list to read the first target data when the distributed database system processes a second data read request and data of the second data read request is the first target data.
In an optional implementation, the first reading unit is configured to:
if the log corresponding to the first read index exists in the first target storage node, performing persistent storage on the log corresponding to the first read index, and reading the first target data from a state machine corresponding to the first target data by taking the first read index as a starting point;
if the log corresponding to the first read index does not exist in the first target storage node, the log corresponding to the first read index is acquired from the main storage node, and the first target data is read from the state machine corresponding to the first target data by taking the first read index as a starting point.
In an alternative implementation, the second reading unit is configured to:
if the data submission time of the first target data is before the transaction completion time, based on the first data reading request and the transaction completion time, scanning logs stored in a log list according to a second sequence, determining a first reading index, and reading the first target data from a state machine corresponding to the first target data by taking the first reading index as a starting point;
if the data commit time of the first target data is after the transaction completion time, based on the first data read request and the transaction completion time, scanning the logs stored in the log list according to a third sequence, determining the first read index, and reading the first target data from a state machine corresponding to the first target data by taking the first read index as a starting point;
the second order refers to an execution index indexed from the commit of the log list to the log list, the third order refers to a commit index indexed from the execution of the log list to the log list, the commit index is used for indicating a maximum index of committed logs in the log list, the execution index is used for indicating a maximum index of executed logs in the log list, and the first read index is used for indicating a minimum read index for reading the first target data based on the first data read request.
In an alternative implementation, the second reading unit is configured to:
if a second target log exists, and the transaction completion time of the first target data is the same as the transaction completion time of the second target log, or the transaction completion time of the first target data is after the transaction completion time of the second target log, determining the first read index based on the log index of the second target log, wherein the data operated by the second target log is the first target data;
if the second target log does not exist, the first read index is determined based on the execution index of the log list.
In an alternative implementation, the data access cost is used to indicate the execution time, the waiting time and the transmission time of the storage node;
the execution time comprises the time for the storage node to inquire the first target data, the time for processing the data volume and the tuple construction time;
the waiting time comprises the request queue time, the equipment load delay time and the data synchronization time of the storage node;
the transmission time comprises a network transmission time.
In an alternative implementation, the data access cost of the first target storage node meets a first target condition, which includes any one of:
the storage mode of the first target data in the first target storage node is a column storage mode, and the ratio of the number of columns to be accessed by the data reading request to the total number of columns is smaller than a first threshold value, wherein the storage mode is used for indicating the storage format of data in the storage node;
the node load of the first target storage node is less than the node load of storage nodes other than the first target storage node in the plurality of storage nodes;
the physical distance between the first target storage node and the computing node is smaller than the physical distance between the storage nodes except the first target storage node in the plurality of storage nodes and the computing node;
the data synchronization state of the first target storage node is subsequent to the data synchronization state of storage nodes other than the first target storage node in the plurality of storage nodes.
In an optional implementation, the apparatus further comprises:
and the adjusting module is used for dynamically adjusting the storage mode of the multiple copies of the first target data, and the storage mode is used for indicating the storage format of the data in the storage node.
In an alternative implementation, the adjustment module is configured to any one of:
switching the storage modes of the plurality of copies based on the load conditions of the plurality of first storage nodes;
if at least one copy exists in the plurality of copies, establishing at least one new copy based on the at least one copy;
if the first data fragment is subjected to data splitting, generating at least one second data fragment, and establishing a plurality of copies corresponding to the at least one second data fragment based on the at least one second data fragment;
the storage mode of the plurality of copies is adjusted based on the node type of the plurality of first storage nodes.
In an optional implementation manner, the switching the storage modes of the multiple copies based on the load conditions of the multiple first storage nodes includes any one of:
switching the storage mode of the plurality of copies based on the node load size and the available space of the plurality of first storage nodes;
and switching the storage modes of the plurality of copies based on the node load sizes of the plurality of first storage nodes and the number of copies in each storage mode.
In an optional implementation, the apparatus further comprises:
a third determining module, configured to determine, in response to a data write request, a plurality of second storage nodes from the plurality of storage nodes based on a third data fragment to which second target data of the data write request belongs if the third data fragment belongs, where the plurality of second storage nodes are used to store a plurality of copies of the second target data;
a sending module, configured to send the data write request to a main storage node in the plurality of second storage nodes;
and the first writing module is used for writing the second target data based on the data writing request and sending a first data writing result to the computing node.
In an optional implementation, the first writing module is configured to:
writing the second target data based on the data writing request, generating a data operation log, and sending a log synchronization request to a slave storage node in the plurality of storage nodes, wherein the log synchronization request is used for instructing the slave storage node to send a data synchronization message to the master storage node after the slave storage node synchronizes the data operation log;
and if the number of the data synchronization messages received by the main storage node is greater than or equal to half of the number of the auxiliary storage nodes, confirming that the data write request is operated successfully.
In an optional implementation, the apparatus further comprises:
the first persistent storage module is used for performing persistent storage on the second target data;
and the second persistent storage module is used for performing format conversion on the second target data based on the data operation log and a storage mode of the second target data in the slave storage node, and performing persistent storage on the converted second target data, wherein the storage mode is used for indicating the storage format of data in the storage node.
In an optional implementation, the apparatus further comprises:
a fourth determining module, configured to determine a second read index of the second target data based on the log index of the data operation log, where the second read index is used to indicate a minimum read index for reading the second target data based on a third data read request;
a second storage module, configured to store the second read index into a second list, where the second list includes the second target data, the second read index, and a second check index, and the second check index is a log index of the data operation log;
and the second query module is used for querying the second list to read the second target data if the data of the third data read request is the second target data when the distributed database system processes the third data read request.
In an optional implementation, the apparatus further comprises:
a first establishing module, configured to establish a third data segment of the second target data and send a copy creation request to the plurality of storage nodes if the second target data of the data write request does not have the third data segment to which the second target data belongs;
and the second establishing module is used for establishing a plurality of copies corresponding to the third data fragment based on the copy establishing request.
In an optional implementation manner, the second establishing module is configured to:
and establishing a plurality of copies corresponding to the third data fragment based on the copy creation request and the storage mode of the second target data in the plurality of storage nodes, wherein the storage mode is used for indicating the storage format of the data in the storage nodes.
In an optional implementation, the apparatus further comprises:
a fifth determining module, configured to, in response to a data read-write request, determine, based on a fourth data fragment to which third target data of the data read-write request belongs, a plurality of third storage nodes from the plurality of storage nodes, where the plurality of third storage nodes are configured to store a plurality of copies of the third target data;
a second reading module, configured to determine, for a read operation in the data read-write request, a second target storage node from the multiple third storage nodes based on the data read-write request, send the data read-write request to the second target storage node, where the second target storage node reads the third target data based on the data read-write request, and sends a second data reading result to the computing node, where a data access cost of the second target storage node meets a second target condition;
and the second writing module is used for sending the data reading and writing request to a main storage node in the plurality of third storage nodes for the writing operation in the data reading and writing request, and the main storage node writes the third target data based on the data reading and writing request and sends a second data writing result to the computing node.
In an optional implementation manner, a slave storage node in the plurality of third storage nodes is configured with a memory lock, and the memory lock is used for locking the third target data when the write operation is not completed yet.
In an optional implementation, the apparatus further comprises:
a sixth determining module, configured to, when a fourth storage node exists in the plurality of storage nodes and becomes a master storage node through election, determine a timeout time by a slave storage node in the plurality of storage nodes based on a current storage mode and a write performance parameter of the slave storage node, where the storage mode is used to indicate a storage format of data in the storage nodes;
and the state switching module is used for switching the first slave storage node to a candidate state to participate in next election if the first slave storage node does not receive the message of the master storage node within the corresponding timeout time.
In another aspect, a computer device is provided, which includes a processor and a memory, where the memory is used to store at least one computer program, and the at least one computer program is loaded and executed by the processor to implement the operations performed in the data access method in the embodiments of the present application.
In another aspect, a computer-readable storage medium is provided, in which at least one computer program is stored, and the at least one computer program is loaded and executed by a processor to implement the operations performed in the data access method in the embodiments of the present application.
In another aspect, a computer program product or a computer program is provided, the computer program product or the computer program comprising computer program code, the computer program code being stored in a computer readable storage medium. The processor of the computer device reads the computer program code from the computer-readable storage medium, and the processor executes the computer program code, causing the computer device to perform the data access method provided in the various alternative implementations described above.
The technical scheme provided by the embodiment of the application has the following beneficial effects:
in a distributed database system, when a computing node receives a data reading request, a plurality of storage nodes storing a plurality of copies of target data of the data reading request are determined according to data fragments to which the target data belong, then target storage nodes meeting target conditions are selected from the storage nodes according to data access cost of the storage nodes for accessing the target data, and the target data are read by the target storage nodes. In the method, the target storage node is determined according to the data access cost, so that the main storage node and the auxiliary storage node can be the target storage node, and the main storage node is prevented from processing all data reading requests, so that high availability caused by multiple copies is ensured, the data reading speed is increased, and the data access performance of the distributed database system is effectively improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of an implementation environment of a data access method provided according to an embodiment of the present application;
fig. 2 is a schematic architecture diagram of an HTAP database system according to an embodiment of the present application;
FIG. 3 is a flow chart of a data access method provided according to an embodiment of the present application;
FIG. 4 is a flow chart of a data access method provided according to an embodiment of the present application;
FIG. 5 is a schematic diagram of determining a first read index according to an embodiment of the present application;
FIG. 6 is a schematic diagram illustrating a first read index being stored according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a processing flow of a read request of a main node according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a processing flow of a read request from a node according to an embodiment of the present application;
FIG. 9 is a schematic diagram of determining a first read index according to an embodiment of the present application;
FIG. 10 is a diagram illustrating a process flow of an old data read request according to an embodiment of the present application;
FIG. 11 is a schematic diagram of a processing flow of a non-latest data read request according to an embodiment of the present application;
FIG. 12 is a flow chart of a data access method provided according to an embodiment of the present application;
FIG. 13 is a schematic diagram illustrating a storage of a second read index according to an embodiment of the present application;
FIG. 14 is a flow chart of a data access method provided according to an embodiment of the present application;
FIG. 15 is a schematic diagram of a read semi-committed problem provided in accordance with an embodiment of the present application;
fig. 16 is a schematic structural diagram of a data access device according to an embodiment of the present application;
fig. 17 is a schematic structural diagram of a server provided according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution. It will be further understood that, although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by these terms.
These terms are only used to distinguish one element from another. For example, a first storage node can be referred to as a second storage node, and similarly, a second storage node can also be referred to as a first storage node, without departing from the scope of the various examples. The first storage node and the second storage node may both be storage nodes, and in some cases, may be separate and distinct storage nodes.
For example, at least one storage node may be an integer number of storage nodes greater than or equal to one, such as one storage node, two storage nodes, three storage nodes, and the like. The plurality of storage nodes means two or more, and for example, the plurality of storage nodes may be two storage nodes, three storage nodes, or any integer number of storage nodes equal to or greater than two.
Before introducing the embodiments of the present application, some basic concepts in the cloud technology field need to be introduced:
cloud Technology (Cloud Technology): the cloud computing business mode management system is a management technology for unifying series resources such as hardware, software, networks and the like in a wide area network or a local area network to realize data calculation, storage, processing and sharing, namely is a general name of a network technology, an information technology, an integration technology, a management platform technology, an application technology and the like applied based on a cloud computing business mode, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support in the field of cloud technology. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can be realized through cloud computing.
Cloud Storage (Cloud Storage): the distributed cloud storage system (hereinafter referred to as a storage system) refers to a storage system which integrates a large number of storage devices (storage devices are also referred to as storage nodes) of different types in a network through application software or application interfaces to cooperatively work through functions of cluster application, grid technology, distributed storage file systems and the like, and provides data storage and service access functions to the outside.
Database (Database): in short, it can be regarded as an electronic file cabinet, that is, a place for storing electronic files, and a user can add, query, update, delete, etc. to data in the files. A "database" is a collection of data that is stored together in a manner that can be shared by multiple users, has as little redundancy as possible, and is independent of the application.
The distributed database system according to the embodiment of the present application may be any type of database system based on Multi-Version Concurrency Control (MVCC). In the embodiment of the present application, the type of the distributed database system is not particularly limited.
At least one node device may be included in the distributed database system, and a database of each node device may have a plurality of data tables stored therein, each data table being operable to store one or more data items (also referred to as variable versions). The database of the node device may be any type of distributed database, and may include at least one of a relational database and a non-relational database, such as a Structured Query Language (SQL) database, NoSQL, NewSQL (broadly, various new extensible/high performance databases), and the like, where in this embodiment, the type of the database is not specifically limited.
In some embodiments, the embodiments of the present application may also be applied to a distributed database system based on a blockchain technology (hereinafter referred to as "blockchain system"), where the blockchain system essentially belongs to a decentralized distributed database system, a consensus algorithm is used to keep ledger data recorded by different node devices on a blockchain consistent, an encryption algorithm is used to ensure encrypted transmission and non-falsification of the ledger data between different node devices, a script system is used to extend the ledger function, and network routing is used to interconnect different node devices.
One or more blockchains may be included in the blockchain system, where a blockchain is a string of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions for verifying the validity (anti-counterfeiting) of the information and generating a next blockchain.
Node devices in the blockchain system may form a Peer-To-Peer (P2P) network, and the P2P Protocol is an application layer Protocol operating on a Transmission Control Protocol (TCP). In the blockchain system, any node device may have the following functions: 1) routing, a basic function that the node device has for supporting communication between the node devices; 2) the application is used for being deployed in a block chain, realizing specific business according to actual business requirements, recording data related to the realization function to form account book data, carrying a digital signature in the account book data to represent a data source, sending the account book data to other node equipment in the block chain system, and adding the account book data to a temporary block when the other node equipment successfully verifies the data source and integrity of the account book, wherein the business realized by the application can comprise a wallet, a shared account book, an intelligent contract and the like; 3) and the block chain comprises a series of blocks which are mutually connected according to the sequential time sequence, the new blocks cannot be removed once being added into the block chain, and the blocks record the account book data submitted by the node equipment in the block chain system.
In some embodiments, each block may include a hash value of the block storing the transaction record (hash value of the block) and a hash value of a previous block, and the blocks are connected by the hash value to form a block chain.
Some basic terms to which embodiments of the present application relate are described below:
multiple copies are heterogeneous: multiple copies of the same piece of data take on different storage structures on the disk. For example, one copy stores data in a line memory format, and each storage record is the content of an entire line of data of the table. One copy stores data according to a column storage format, and each storage record is the content of one or some column values of one row of data in the table.
Data fragmentation: the data management method is characterized in that a data fragment has a plurality of copies in the smallest logical unit of data management in a distributed database system, and the plurality of copies control the copy consistency among data through a multi-copy consistency protocol. In some embodiments, when a data fragment is newly created in the distributed database system, the fragment information of the data fragment is stored. For example, the fragment information includes a data range of the data fragment, node information of each of the multiple copies, storage mode information of each copy, and the like, which is not limited in this embodiment of the application.
Transaction: the transaction is a logic unit of the database system in the process of executing the operation, is formed by a limited database operation sequence and is the minimum execution unit of the database system operation.
Data item: a transaction is a unit of data in a database system, and a data item is an actor (or data object being manipulated) of a database operation, and in some embodiments, is also referred to as a variable. One data item may be a tuple (tuple) or record (record), or a page (page) or a table (table) object, etc. A data item may contain several versions of the data item (hereinafter also referred to as "versions") and each time a transaction updates the data item, a new version of the data item is added, each version of the data item may be identified by a natural number as a version number, and the larger the version number, the newer the version number of the data item. In some embodiments, the distributed database system determines the data fragment to which the data access request needs to access according to the data item to which the data access request needs to access when receiving the data access request, for example, the distributed database system may calculate the data range to which the data item belongs according to the data item identifier, so as to determine the data fragment to which the data item belongs, which is not limited in this embodiment of the present application.
The operation is as follows: a database operation is composed of three parts of operation type, transaction and data item version, wherein the operation type can comprise Read (Read, R) and Write (Write, W).
Distributed consensus: consensus is one of the most important abstractions of a distributed system, specifically that all nodes in a distributed system agree on a proposal. That is, after one or more processes propose a value, a global approval method is used to make all processes in the distributed system agree on the value.
And (3) Raft: a distributed consensus protocol is used for managing log consistency, is a consistency algorithm which is widely used in engineering, and has the characteristics of strong consistency, decentralization, easiness in understanding, development and implementation and the like. Raft divides roles in the distributed system into a Leader (Leader), a Follower (Follower), and a Candidate (Candidate). In a Raft cluster, there is one and only one Leader node (also referred to as a master node) that is responsible for processing requests and synchronizing data to a Follower node (also referred to as a slave node). Meanwhile, the Raft provides a perfect error handling mechanism, thereby ensuring high availability of data.
Leader node: the Raft protocol selects a Leader node through an election mechanism and is responsible for receiving a terminal request and log replication, the Leader node writes the request into a log of the Leader node and synchronously requests the log to a Follower node after receiving the request, and the Leader node reports to the Follower node to submit the log after the log is synchronized to most nodes. The Leader node adopts a heartbeat mechanism in the due period to inform the Follower node that the Leader node still works normally.
A Follower node: the log is used for accepting and persisting Leader synchronization, and the log is submitted after the Leader tells that the log can be submitted. Wherein, the following step that the following node does not receive the heartbeat sent by the Leader node within the election timeout time, the following step converts the state into the Candidate state, and sends a message to other nodes to initiate a new round of election of the Leader node.
MVCC: is a common concurrency control for database systems. It is intended to solve the problem of multiple or long read operations blocking write operations due to read-write locks. The data item read by each transaction is a snapshot. A write operation does not overwrite an existing data item, but creates a new version of the data item that does not become visible until the operation commits.
Full state data: data in a distributed database system may include three states based on state attributes: the data processing method comprises a current state, a transition state and a history state, wherein the three states are collectively called a 'full state of data', the 'full state of data' is short for full state data, and different state attributes in the full state data can be used for identifying the state of the data in a life cycle track of the data. Wherein, Current State (Current State): the latest version of the tuple is the data in the current phase, in other words, the state of the data in the current phase, which is called the current state. Transition State (Transitional State): the data in the transition state, which is not the latest version or the history state version of the tuple, is called half-decay data in the process of converting from the current state to the history state. Historical State (Historical State): the tuple is in a state of history whose value is the old value and not the current value. The state of the data in the history phase is referred to as the history state. The historical state of a tuple can be multiple, and the process of state transition of data is reflected. Data in a history state can only be read and cannot be modified or deleted.
Metadata: the Data about the organization, Data domain and relationship of the Data, mainly describing the Data Property (Property), and is used to support the functions such as the fragmentation information indicating Data fragmentation, storage location, history Data, resource search, file record, etc.
State Machine (State Machine): the finite state automata is a short-term finite state automata and is a mathematical model formed by abstracting the operation rules of real things.
The following describes an implementation environment of the data access method provided by the embodiment of the present application.
Fig. 1 is a schematic diagram of an implementation environment of a data access method according to an embodiment of the present application. As shown in fig. 1, the embodiment of the present application may be applied to a distributed database system, which may include a gateway server 101, a global timestamp generation cluster 102, a distributed storage cluster 103, and a distributed coordination system 104 (e.g., ZooKeeper), where the distributed storage cluster 103 may include a data node device and a coordination node device.
The gateway server 101 is configured to receive an external read-write request, and distribute a read-write transaction corresponding to the read-write request to the distributed storage cluster 103, for example, after a user logs in an Application terminal on a terminal, the Application terminal is triggered to generate the read-write request, and an Application Programming Interface (API) provided by a distributed database system is called to send the read-write request to the gateway server 101, where the API may be MySQL API (API provided by a relational database system), for example.
In some embodiments, the gateway server 101 may be merged with any data node device or any coordinating node device in the distributed storage cluster 103 on the same physical machine, that is, a certain data node device or coordinating node device is allowed to act as the gateway server 101.
Global Timestamp generation cluster 102 is configured to generate Global commit timestamps (Global Timestamp, Gts) for Global transactions, which are also referred to as distributed transactions, and refer to transactions involving multiple data node devices, for example, a Global read transaction may involve reading data stored on multiple data node devices, and a Global write transaction may involve writing data on multiple data node devices. The global timestamp generation cluster 102 may be logically regarded as a single point, but in some embodiments, a service with higher availability may be provided through a one-master-three-slave architecture, and the generation of the global commit timestamp is implemented in a cluster form, so that a single point failure may be prevented, and a single point bottleneck problem is avoided.
Optionally, the global commit timestamp is a globally unique and monotonically increasing timestamp identifier in the distributed database system, and can be used to mark a global commit order of each transaction, so as to reflect a true temporal precedence relationship between the transactions (a full order relationship of the transactions), where the global commit timestamp may use at least one of a physical Clock, a Logical Clock, a Hybrid physical Clock, or a Hybrid Logical Clock (HLC), and the embodiment of the present application does not specifically limit the type of the global commit timestamp.
In one exemplary scenario, the global commit timestamp may be generated in a hybrid physical clock, which may be eightByte composition, where the first 44 bits can be the value of the physical timestamp (i.e., Unix timestamp, accurate to milliseconds), which in total can represent 244An unsigned integer, and therefore together can theoretically represent about
Figure BDA0003133204180000141
Physical timestamp of year, where the last 20 bits may be a monotonically increasing count within a certain millisecond, such that there is 2 per millisecond20Based on the above data structure, if the transaction throughput of a single machine (any data node device) is 10w/s, the distributed storage cluster 103 containing 1 ten thousand node devices can be theoretically supported, and meanwhile, the number of global commit timestamps represents the total number of transactions that the system can theoretically support, and based on the above data structure, the system can theoretically support (2)44-1)*220And (4) a transaction. Here, the definition method of the global commit timestamp is merely an exemplary description, and according to different business requirements, the bit number of the global commit timestamp may be expanded to meet the support of more node numbers and transaction numbers.
In some embodiments, the global timestamp generation cluster 102 may be physically separate or may be incorporated with the distributed coordination system 104 (e.g., ZooKeeper).
The distributed storage cluster 103 may include data node devices and coordination node devices, each coordination node device may correspond to at least one data node device, the division between the data node devices and the coordination node devices is for different transactions, taking a certain global transaction as an example, an initiating node of the global transaction may be referred to as a coordination node device, other node devices involved in the global transaction are referred to as data node devices, the number of the data node devices or the coordination node devices may be one or more, and the number of the data node devices or the coordination node devices in the distributed storage cluster 103 is not specifically limited in the embodiments of the present application. Because the distributed database system provided by this embodiment lacks a global transaction manager, an XA (eXtended Architecture, X/Open organization distributed transaction specification)/2 PC (Two-Phase Commit) technology may be adopted in the system to support transactions (global transactions) across nodes, so as to ensure atomicity and consistency of data during write operation across nodes, at this time, the coordinator node device is configured to serve as a coordinator in a 2PC algorithm, and each data node device corresponding to the coordinator node device is configured to serve as a participant in the 2PC algorithm.
Optionally, each data node device or coordination node device may be a stand-alone device, or may also adopt a master/backup structure (that is, a master/backup cluster), as shown in fig. 1, which is exemplified by taking a node device (data node device or coordination node device) as a master/backup cluster, each node device includes a host and two backup devices, optionally, each host or backup device is configured with a proxy (agent) device, the proxy device may be physically independent from the host or backup device, of course, the proxy device may also be used as a proxy module on the host or backup device, taking the node device 1 as an example, the node device 1 includes a master database and a proxy device (master database + agent, abbreviated as master + agent), and in addition, includes two backup databases and a proxy device (backup database + agent, abbreviated as backup DB + agent).
In an exemplary scenario, a SET of database instances of a host or a backup corresponding to each node device is referred to as a SET (SET), for example, if a certain node device is a stand-alone device, the SET of the node device is only a database instance of the stand-alone device, and if a certain node device is a master-backup cluster, the SET of the node device is a SET of a host database instance and two backup database instances, at this time, consistency between data of the host and duplicate data of the backup may be ensured based on a strong synchronization technique of a cloud database, optionally, each SET may perform linear expansion to cope with business processing requirements in a large data scenario, and in some financial business scenarios, a global transaction generally refers to transfer across SETs.
The distributed coordination system 104 may be configured to manage at least one of the gateway server 101, the global timestamp generation cluster 102, or the distributed storage cluster 103, and optionally, a technician may access the distributed coordination system 104 through a scheduler (scheduler) on the terminal, so as to control the distributed coordination system 104 on the back end based on the scheduler on the front end, thereby implementing management on each cluster or server. For example, a technician may control the ZooKeeper to delete a node device from the distributed storage cluster 103 through the scheduler, that is, to disable a node device.
Fig. 1 is an architecture diagram providing a lightweight global transaction, and is a kind of distributed database system. The whole distributed database system can be regarded as a large logical table which is commonly maintained, data stored in the large table is scattered to each node device in the distributed storage cluster 103 through a main key, and the data stored on each node device is independent of other node devices, so that the node devices can horizontally divide the large logical table. In the system, each data table in each database can be stored in a distributed manner after being horizontally divided, so that the system can also be visually referred to as an architecture with "database division table".
In some embodiments, the distributed database system formed by the gateway server 101, the global timestamp generation cluster 102, the distributed storage cluster 103, and the distributed coordination system 104 may be regarded as a server providing data services to a user terminal, where the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud services, a cloud database, cloud computing, cloud functions, cloud storage, Network services, cloud communication, middleware services, domain name services, security services, a Content Delivery Network (CDN), and a big data and artificial intelligence platform. Optionally, the user terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.
Based on the implementation environment, in some embodiments, the distributed database system may be a Hybrid Transaction and Analytical Processing (HTAP) database system. The HTAP database system is a database system supporting simultaneous Processing of On-Line Transaction Processing (OLTP) service and On-Line Analytical Processing (OLAP) service. In other words, an HTAP database system is a database system that supports both online update tasks and online analytics query requests. It should be noted that, the HTAP database avoids a large amount of data interaction between online and offline data, and the innovative computing storage framework thereof can also support flexible capacity expansion, thereby better coping with the challenges brought by high concurrency.
The architecture of an HTAP database system according to an embodiment of the present application is described below with reference to fig. 2. Fig. 2 is a schematic architecture diagram of an HTAP database system according to an embodiment of the present application. As shown in fig. 2, the HTAP database system may include four parts: a compute layer 201, a distributed coherency protocol layer 202, a storage layer 203, and a metadata management layer 204. Wherein the content of the first and second substances,
the computation layer 201 has several functions:
1) and the connection processing function is used for establishing reliable connection with the terminal according to the connection request sent by the terminal.
2) A query plan making function, configured to receive a data access request sent by a terminal, and perform processing such as syntax analysis on the data access request, so as to generate a query plan (this part will be described in detail in subsequent embodiments, and therefore will not be described herein again).
3) And an executing function (also referred to as an encapsulating function) for encapsulating the operation related to the data access request with a transaction and transmitting the data access request to the corresponding node according to the query plan.
In some embodiments, the computing layer 201 is configured with at least one computing node for implementing the above functions, which is not limited in this application.
Distributed coherence protocol layer 202 has several functions:
1) and the log synchronization function is used for controlling data synchronization and ensuring the data consistency among the copies. Optionally, the log synchronization function is used to ensure that the slave node has data synchronized with the master node when the slave node performs a read operation (this part will be described in detail in the following embodiments, and therefore will not be described herein again).
2) A master node election function, configured to reselect a master node when the master node fails (this part will be described in detail in the following embodiments, and therefore will not be described herein again).
3) The state machine security function is configured to receive and process a data access request from the computing layer 201, and control access to data in the storage layer 203 (details of this part will be described in subsequent embodiments, and therefore are not described herein again).
In some embodiments, the functions of the distributed coherence protocol layer 202 are implemented by at least one computing node configured in the computing layer 201, which is not limited in this application.
The storage layer 203 has several functions:
1) a data access request processing function for receiving and processing data access requests of distributed coherency protocol layer 202.
2) And the copy management function is used for managing a plurality of copies of the data and ensuring the data consistency among the copies.
In some embodiments, storage layer 203 includes a state machine and a format converter. The state machine is provided with data fragments with different storage modes, one data fragment has a plurality of copies and is stored in different storage nodes, and the copies are synchronized through a Log (Log) according to a distributed consistency protocol. Data in the Log can be persisted into a state machine through a Log playback step, so that data access is facilitated. The format converter is used for converting the data into a corresponding storage format according to a specified storage mode and then carrying out persistent storage by the state machine. For example, the storage modes of the multiple copies of the data are a row storage format and a column storage format, which are not limited in this embodiment of the present application. It should be noted that the combination manner of the storage modes shown in fig. 2 is only an exemplary manner, and the number of copies of each storage mode may be configured in combination according to the actual situation of the system and the load, which is not limited in the embodiment of the present application.
In some embodiments, the storage layer 203 is configured with at least one storage node for implementing the above functions, which is not limited in this application.
The metadata management layer 204 has several functions:
1) and the recording function is used for recording the fragment information of the data fragments and the log synchronization condition of each copy of the data.
2) A routing function for providing routing services to the computation layer 201 and the storage layer 203.
It should be noted that, in the architecture shown in fig. 2, the number of the computing nodes and the storage nodes is only an illustrative example, and in some embodiments, the HTAP database system may further include more or fewer computing nodes and storage nodes, which is not limited in this application embodiment. By adopting the architecture with the separation of calculation and storage, the distribution and the configuration of the storage nodes are convenient to change, so that the flexibility of the distributed database system is improved, and the distributed database system is suitable for various different scenes and has wide applicability.
In addition, the computing layer 204 may be configured with the gateway server 101 shown in fig. 1, and the distributed coordination system 104 shown in fig. 1 may be configured to manage the entire HTAP system, that is, the distributed coordination system 104 schedules at least one of the node devices involved in the HTAP system, which is not limited in this embodiment of the present invention.
The distributed database system shown in fig. 2 described above is illustratively provided with storage and computation functionality. In the storage layer, the system will slice the data, each data slice corresponds to several copies, theoretically, each copy provides a data access function. The storage patterns are not necessarily the same for different copies of a data slice. At the computing layer, the system can judge the type of the data access request according to the analysis statistics, so as to determine the copy with better performance, and the data access request is sent to the selected copy in consideration of load balancing. And after the node where the copy is located executes the data access request, returning the data to the computing layer, and then returning the data to the terminal through the computing layer. For the upper-layer user, the heterogeneous nature of the underlying storage mode is not sensed, and the distributed database system completes the management of multiple copies in a heterogeneous manner. Based on the distributed database system, the method for storing and accessing the data by adopting the multi-copy heterogeneous data model ensures high availability caused by multiple copies and data consistency among the multiple copies, increases the concurrency of the system and improves the performance of the system under mixed load.
Next, a data access method provided by an embodiment of the present application will be described based on the distributed database system shown in fig. 1 and fig. 2.
Firstly, data multi-copy storage is a commonly used technology in a distributed database system, and a multi-copy mechanism can ensure high availability of data. At present, however, a distributed database system generally adopts a homogeneous multi-copy mechanism, that is, copies of data adopt the same organization model and storage structure. This storage mechanism allows the user request to behave the same on each copy (without regard to node load and hardware differences). Secondly, a plurality of copies of data are managed by adopting a distributed consistency protocol, so that the consistency of the data can be ensured, but all requests need to be concentrated on one copy to be processed by a strict master-slave mechanism in the related distributed consistency protocol, the realization difficulty of load balancing is increased, and the performance bottleneck is more likely to be caused. Thirdly, in a distributed consistency protocol (e.g. Raft), when a read request is executed, data must be executed (applied) from a log to a state machine, and when a write request is executed, a master node must wait for a relevant log to be executed (applied) to the state machine before considering that a write transaction is successful.
Therefore, the traditional multi-copy technology does not fully exert the performance advantages of multiple copies, only improves the availability of data, and meanwhile, the traditional distributed consistency protocol has the problem of high read-write operation delay overhead.
In view of this, embodiments of the present application provide a data access method, which selects the most appropriate copy for each data access request on the premise of comprehensively considering factors such as load balancing, data synchronization state, and user request type, so as to improve the overall access performance of the distributed database system. Aiming at a data reading request, a distributed consistency protocol used for managing data copies is modified, on the premise of ensuring the correct data consistency, the flow of responding to the data reading request from a node is increased, and a data reading method based on a Relaxed Read Index (RRI) is provided, so that the system concurrency is increased, the processing speed of the data reading request is accelerated, and the overall reading performance of a distributed database system is improved. Aiming at the data write-in request, the write flow of the distributed consistency protocol is optimized, and a Commit Return (CR) -based data write-in method is provided, so that the data write-in speed is increased, the Return speed of the data write-in request is accelerated, and the overall write performance of the distributed database system is improved. Aiming at the data read-write request, the data read-write method based on the RRI and the data write-in method based on the CR are combined, so that the processing speed of read-write transactions is improved, the return speed of the data read-write request is accelerated, the overall read-write performance of the distributed database system is improved, the system concurrency is improved, the occupation of disk and network bandwidth is reduced, and the system throughput is effectively improved.
The data access method provided by the embodiment of the present application will be described below by taking different types of data access requests as examples through several embodiments. In the following embodiments, corresponding data access methods are provided for different data access requests, so that the data access performance of the distributed database system is effectively improved.
Fig. 3 is a flowchart of a data access method provided in an embodiment of the present application, and as shown in fig. 3, the embodiment is applied to a distributed database system, where the distributed database system includes a computing node and a plurality of storage nodes. In the embodiment shown in fig. 3, the data access method is applied to the HTAP database system shown in fig. 2, and the interaction between the compute node and the storage node is taken as an example. This embodiment includes the following steps.
301. The computing node responds to a first data reading request, determines a first data fragment to which first target data of the first data reading request belongs, and determines a plurality of first storage nodes from a plurality of storage nodes based on the first data fragment, wherein the plurality of first storage nodes are used for storing a plurality of copies of the first target data.
In an embodiment of the present application, the first data read request is for requesting to read the first target data. Optionally, the first target data is current state data or historical state data, which is not limited in this embodiment of the application.
Optionally, the computing node parses the first data reading request to obtain a data item identifier of the first target data, determines a first data fragment of the first target data in the current distributed database system according to the data item identifier, and determines a plurality of first storage nodes from the plurality of storage nodes according to fragment information of the first data fragment.
Optionally, the distributed database system includes a metadata management layer, and after determining a first data fragment to which first target data belongs, the computing node sends an information acquisition request to the metadata management layer according to fragment information of the first data fragment, and receives fragment information of the first data fragment returned by the metadata management layer. For example, the fragmentation information includes: the embodiment of the present application does not limit specific contents of the fragmentation information of the first data fragment, where the fragmentation information includes node information (including a node type, a node load, a physical distance from a computing node, and the like) of a storage node corresponding to each of the multiple copies of the first target data, storage information of a storage mode corresponding to each of the multiple copies, a data range of the multiple copies, node information of a main storage node, and the like.
302. The computing node determines a first target storage node from the plurality of first storage nodes based on the first data reading request, and sends the first data reading request to the first target storage node, wherein the data access cost of the first target storage node meets a first target condition.
In the embodiment of the application, the plurality of first storage nodes realize data consistency based on a distributed consistency protocol. The plurality of first storage nodes comprise a master storage node and at least one slave storage node, and each first storage node can respond to the first data reading request to read the copy of the first target data stored in the corresponding node. Optionally, the plurality of first storage nodes implement data consistency based on a distributed consistency protocol, for example, the distributed consistency protocol is a Raft protocol, which is not limited in this embodiment of the present application.
Optionally, the distributed database system includes a metadata management layer, and after determining a first data fragment to which first target data belongs, the computing node acquires, based on the metadata management layer, related information of the first data fragment, and selects, according to the related information and a first data reading request, a storage node whose data access cost meets a first target condition from a plurality of first storage nodes, determines the storage node as the first target storage node (a specific implementation of determining the first target storage node will be described later, which is not described herein again), and sends the first data reading request to the first target storage node.
303. The first target storage node reads the first target data based on the first data reading request, and sends a first data reading result to the computing node.
In this embodiment of the application, after receiving the first data reading request, the first target storage node reads the first target data from the state machine corresponding to the first target data based on the first data reading request, and sends a reading result to the computing node. Optionally, the first target storage node determines a corresponding reading process according to the node type of the first target storage node and the data type of the first target data (a specific implementation of determining the reading process will be described later, and is not described herein again), reads the first target data from the state machine corresponding to the first target data according to the reading process, and sends the reading result to the computing node, which is not limited in this embodiment of the present application.
In the data access method provided in the embodiment of the present application, in a distributed database system, when a computing node receives a data read request, a plurality of storage nodes storing multiple copies of target data of the data read request are determined according to a data fragment to which the target data belong, and then a target storage node meeting a target condition is selected from the storage nodes according to a data access cost of the storage nodes for accessing the target data, and the target data is read by the target storage node. In the method, the target storage node is determined according to the data access cost, so that the main storage node and the auxiliary storage node can be the target storage node, and the main storage node is prevented from processing all data reading requests, so that high availability caused by multiple copies is ensured, the data reading speed is increased, and the data access performance of the distributed database system is effectively improved.
The embodiment shown in fig. 3 is a brief description of the data access method provided in the present application, and the data access method provided in the embodiment of the present application is described in detail below with reference to fig. 4.
Fig. 4 is a flowchart of a data access method provided in an embodiment of the present application, and as shown in fig. 4, the embodiment is applied to a distributed database system, where the distributed database system includes a computing node and a plurality of storage nodes. In the embodiment shown in fig. 4, the data access method is applied to the HTAP database system shown in fig. 2, and the interaction between the compute node and the storage node is taken as an example to illustrate that the request type of the data access request is a data read request. This embodiment includes the following steps.
401. And the computing node establishes connection with the terminal based on the connection request sent by the terminal.
In the embodiment of the application, the computing node is connected with the terminal through a wireless network or a wired network, the terminal responds to the operation of a user and sends a connection request to the computing node, and the computing node receives the connection request and establishes reliable connection with the terminal based on the connection request.
402. The computing node responds to a first data reading request, determines a first data fragment to which first target data of the first data reading request belongs, and determines a plurality of first storage nodes from a plurality of storage nodes based on the first data fragment, wherein the plurality of first storage nodes are used for storing a plurality of copies of the first target data.
In the embodiment of the present application, the manner of determining the plurality of first storage nodes by the computing node is the same as that in step 301, and therefore, the description thereof is omitted here.
In some embodiments, the storage modes of the multiple copies include a row storage mode, a column storage mode, a cross mode, and the like, which is not limited in this application. Wherein the storage mode is used for indicating the storage format of the data in the storage node. Several of the storage modes referred to above are described below.
The first, line memory mode.
Illustratively, referring to table 1 (where m and n are positive integers), in the state machine of the storage node, each datum corresponds to a complete Row of records in table 1, as shown by Row (Row)2 in table 1. It should be noted that, in some embodiments, in the table of the line storage mode, a line sequence (which may also be referred to as a row sequence, and is not limited in this application) may be adjusted according to a requirement. For example, when one of the data is taken as the student score and the data is expressed as the student (school number, name, score), the Column (Column, Col) corresponding to the school number, the Column corresponding to the name, and the Column corresponding to the score may be taken as the main row sequence, which is not limited in the embodiment of the present application.
TABLE 1
Col 1 Col 2 Col 3 Col 4 Col 5 …… Col n
Row
1
Row 2 xxx xxx xxx xxx xxx xxx xxx
Row 3
Row 4
Row 5
……
Row m
Second, the rank save mode.
Illustratively, referring to table 2 (m and n in the table are positive integers), in the state machine of the storage node, each data corresponds to the value of a column in a Row of data in table 2, as shown in Row (Row)3 column (Col)3 in table 2.
TABLE 2
Col 1 Col 2 Col 3 Col 4 Col 5 …… Col n
Row
1
Row 2
Row 3 xxx
Row
4
Row 5
……
Row m
Third, crossover mode.
Illustratively, referring to table 3 (m and n in the table are positive integers), in the state machine of the storage node, each data corresponds to a data range where several rows and several columns intersect in table 3, as shown in table 3. In some embodiments, the cross mode is also referred to as a Tile/Tile (Tile) mode, which is not limited by the embodiments of the present application.
TABLE 3
Col 1 Col 2 Col 3 Col 4 Col 5 …… Col n
Row
1 xxx xxx xxx xxx xxx
Row 2 xxx xxx xxx xxx xxx
Row 3 xxx xxx xxx xxx xxx
Row 4 xxx xxx
Row 5 xxx xxx
……
Row m
It should be noted that the above storage modes are only illustrative, and in some embodiments, the multiple copies may also be stored in the storage node in other storage modes, and the embodiment of the present application does not limit the specific types of the storage modes.
In some embodiments, the first storage nodes dynamically adjust the storage patterns of the copies (which can also be understood as dynamically managing the copies of the data) during operation of the distributed database system. Several cases of dynamic adjustment are described below:
in case one, the plurality of first storage nodes switch the storage mode of the plurality of copies based on the load condition of the plurality of first storage nodes.
Wherein, the first case includes the following two scenarios:
the first scenario and the first storage nodes switch the storage modes of the copies based on the node load size and the available space of the first storage nodes.
The first scenario may include two switching schemes according to the node load size and the available space of the storage node. Taking the copy of the first data fragment stored in the storage node a as an example, the original storage mode of the copy is the S1 mode, and the storage mode of the copy after switching is the S2 mode, which are taken as examples, and these two switching schemes are described below.
In the first scheme, the storage mode of the copy is switched on the storage node a when the node load of the storage node a is small and the available space of the node is large.
Illustratively, taking a distributed consistency protocol as a Raft protocol as an example, the storage node a establishes a new temporary copy of the first data fragment, where the temporary copy does not account for members of the Raft (that is, the temporary copy does not participate in all processes of the Raft, and only synchronizes data), the storage node a reads data of all copies of the S1 mode, converts the data into new data of which the storage mode is the S2 mode, persistently stores the new data of the S2 mode in the state machine through the temporary copy, synchronizes logs of which the copies of the S1 mode are not persistently stored to the temporary copy, and destroys the copy of the S1 mode after synchronization is completed, and adds the temporary copy of the S2 mode as a new copy (which can also be understood as a new member) to the Raft group.
In the second scheme, the node load of the storage node a is large or the available space of the node is small, and the storage mode of the copy is switched on the storage node B, the node load of the storage node B is small and the available space of the node is large.
Illustratively, taking a distributed consistency protocol as a Raft protocol as an example, a storage node B establishes a new temporary copy of the first data fragment, where the temporary copy does not account for members of the Raft (that is, the temporary copy does not participate in all processes of the Raft, and only synchronizes data), transfers data of the copy in the S1 mode to the storage node B through a snapshot, converts the data into new data in the storage mode of S2, persistently stores the new data in the S2 mode in a state machine through the temporary copy, synchronizes a log in which the copy in the S1 mode is not persistently stored in the temporary copy, and destroys the temporary copy in the S1 mode as a new copy (which can also be understood as a new member) after synchronization is completed, and adds the new copy in the Raft group.
It should be noted that, in the two switching schemes, the change of the members of the Raft group, the transfer of data between the storage nodes, and the persistent storage of the snapshot to the state machine, etc. can be implemented through the original process of the Raft protocol, so that the switching scheme ensures the correctness of the distributed database system. In addition, the switching of the storage mode relates to reading and writing of the state machine, so that in the switching process, a corresponding switching scheme is selected according to the node load size and the available space, the running condition of the state machine is fully considered, and the availability of the distributed database system is ensured.
And in the second scenario, the plurality of first storage nodes switch the storage modes of the plurality of copies based on the node load sizes of the plurality of first storage nodes and the number of copies in each storage mode.
In the distributed database system, the number of copies in each storage mode is not constant, and it can also be understood that the configuration policy of multiple copies is not constant. And the plurality of first storage nodes switch the storage mode of the copy corresponding to the storage node with the smaller node load into the storage mode of the copy corresponding to the storage node with the larger node load on the premise of ensuring that each storage mode has an available copy according to the node load of the first storage nodes.
Illustratively, the storage modes of the multiple copies include a row storage mode, a column storage mode and an intersection mode, where the number of copies in the row storage mode is L, the number of copies in the column storage mode is M, and the number of copies in the intersection mode is N (L, M, N is a positive integer), and when the system has a large number of data access requests processed by the storage node where the copy is located in the column storage mode and a large node load of the storage node where the copy is located in the column storage mode in a certain period of time, the copy in the storage mode where the data access requests processed by the storage node where the copy is located in the period of time is less (i.e., the node load is small) and the number of copies is greater than 1 (i.e., it is ensured that there is an available copy in each storage mode) is switched to the column storage mode.
It should be noted that, in the process of system operation, the storage modes of multiple copies are switched according to the node load size and the number of copies in each storage mode, and on the premise of ensuring that there is an available copy in each storage mode, the storage modes of the copies are switched in time, so that the situation of load change in the process of system operation can be better dealt with, and the data access performance of the distributed database system is effectively improved.
And in case of abnormality of at least one copy in the plurality of copies, the plurality of first storage nodes establish at least one new copy based on the at least one copy.
The copy exception means that the copy is unavailable, and it can also be understood that the copy fails to work normally due to an error occurring in the copy. When at least one copy exception exists in the plurality of copies, the plurality of first storage nodes establish at least one new copy based on the number of the at least one copy, namely the total number of the copies accords with the preset total number.
For any one of the at least one new copy, if other copies in the same storage mode already exist in the system and the other copies are already synchronized with the main storage node, the storage node where the new copy is located synchronizes data from the other copies, and if other copies in the same storage mode do not exist in the system, the data is synchronized from the main storage node in a snapshot transfer mode and is converted into the corresponding storage mode for persistent storage.
In some embodiments, the storage pattern of the at least one new copy is the same as the storage pattern of the at least one copy that experienced the exception. In some embodiments, the storage mode of the at least one new copy is different from the storage mode of the at least one copy in which the exception occurs, and is determined by the current copy configuration policy of the distributed database system, that is, the storage mode of the at least one new copy does not necessarily coincide with the state before the exception occurs, which is not limited in this embodiment of the present application.
It should be noted that, in this second case, the copy exception, the availability of the data fragment during and after the new copy is established, and the correctness of the copy data are guaranteed by a distributed consistency protocol, for example, the distributed consistency protocol is a Raft protocol. In addition, in the running process of the system, the availability of each copy in the system is ensured by timely establishing a new copy under the condition that the copy is abnormal, so that the data access performance of the distributed database system is effectively improved.
And if the first data fragment is subjected to data splitting, generating at least one second data fragment, and establishing a plurality of copies corresponding to the at least one second data fragment by the plurality of first storage nodes based on the at least one second data fragment.
The data fragmentation refers to that when one data fragment is too large or the data fragment is overheated, in order to balance load, the distributed database system performs a re-fragmentation operation on the data fragment, so that the data fragment is split into two data fragments.
In some embodiments, the storage mode of the plurality of copies corresponding to the at least one second data slice is the same as the storage mode of the plurality of copies of the first data slice. In some embodiments, a storage pattern of the multiple copies corresponding to the at least one second data fragment is different from a storage pattern of the multiple copies of the first data fragment, and is determined by a current copy configuration policy of the distributed database system, that is, the storage pattern of the multiple copies corresponding to the at least one second data fragment is not necessarily consistent with a state before the data split occurs, which is not limited in this embodiment of the present application.
In some embodiments, when the copy of the ram mode is subjected to data splitting, only part of the data needs to be migrated respectively; when the column storage mode copy is subjected to data splitting, the cost of reorganizing all rows in the column storage mode is high, so that after the row storage mode copy is subjected to data splitting, the split data is converted into the column storage mode, and the split column storage mode copy is destroyed after conversion is completed. In this way, the data amount required to be processed by the distributed database system during data splitting can be reduced, and the data splitting efficiency is improved.
And in case four, the plurality of first storage nodes adjust the storage mode of the plurality of copies based on the node types of the plurality of first storage nodes.
Wherein, the main storage node and the slave storage node exist in the plurality of first storage nodes, and the storage modes of the storage nodes are different for the storage nodes of different node types. For example, the storage mode of the master storage node is a column storage mode, the storage mode of the slave storage node includes a line storage mode and a cross mode, and when a certain storage node is referred to as a master storage node by election, the storage node adjusts the storage mode of the copy to the column storage mode, which is not limited in this embodiment of the present application.
It should be noted that, the above dynamic adjustment cases are only illustrative, and in some embodiments, the multiple first storage nodes may also dynamically adjust the multiple copies in other manners, and the embodiment of the present application is not limited to the specific manner of dynamic adjustment. In addition, in some embodiments, the plurality of first storage nodes implement switching of the storage mode through the format converter, which may specifically refer to the storage layer 203 in the HTAP database system shown in fig. 2, and details of this application are not repeated herein.
By dynamically adjusting the multiple copies of the data in the operation process of the distributed database system, the situation that the load changes in the operation process of the system can be better dealt with, and the data access performance of the distributed database system is effectively improved.
403. The computing node determines a first target storage node from the plurality of first storage nodes based on the first data reading request, and sends the first data reading request to the first target storage node, wherein the data access cost of the first target storage node meets a first target condition.
In the embodiment of the application, the computing node determines the data access cost of the plurality of first storage nodes based on the first data reading request and the node information of the plurality of first storage nodes, takes the first storage nodes meeting a first target condition as first target storage nodes, and sends the first data reading request to the first target storage nodes. In some embodiments, the first target condition is that the data access cost corresponding to the first storage node is the lowest.
In some embodiments, the data access cost is used to indicate the execution time, latency, and transmission time of the storage node. The execution time includes time for the storage node to query the first target data, time for processing data amount (i.e., Input Output (IO) data amount), and tuple construction time; the waiting time comprises the request queue time, the equipment load delay time and the data synchronization time of the storage node; the transmission time includes a network transmission time.
Illustratively, for convenience of description, the representation of the time involved in the data access cost is shown in table 4:
TABLE 4
Figure BDA0003133204180000291
As shown in table 4, the data access cost T is the execution time Tproc+ wait time Twait+ Transmission time Ttrans. Wherein, the execution time TprocQuery time Tsearch+ processing time Tio+ tuple construction time TconsWaiting time TwaitRequest queue time Tqueue+ device load delay time Tload+ data synchronization time Tsync
In some embodiments, the data access cost of the first target storage node meets the first target condition, which includes the following cases:
in case one, the storage mode of the first target data in the first target storage node is a column storage mode, and a ratio between the number of columns to be accessed by the data read request and the total number of columns is smaller than a first threshold.
The first threshold is a preset threshold, and in some embodiments, the first threshold may be adjusted according to a requirement, which is not limited in the embodiment of the present application. The above-mentioned case can also be summarized as that the data read request is a query request with a wide list and a small number of columns, that is, the data read request is a query request involving only a small number of columns (i.e., "small columns") on a table with a relatively large number of columns (i.e., "wide table").
It should be noted that, in the distributed database system, the storage mode of the copy is related to the execution time of the storage node. Illustratively, referring to Table 4, the query time T involved in querying the copy of the columnar schemasearchAnd a processing time TioLess than the query time T involved in querying the copy of the load modesearchAnd a processing time TioThe more columns are involved in querying the copy of the column store pattern, the more tuple construction time TconsThe longer. Therefore, for a data reading request, if the ratio of the number of columns required to be accessed by the request to the total number of columns in the table is smaller than a first threshold value, the execution time T of the storage node storing the copy of the column storage mode is determinedprocAt a minimum, the number of storage nodes correspondinglyThe access cost is lowest.
In case two, the node load of the first target storage node is smaller than the node loads of the storage nodes other than the first target storage node in the plurality of storage nodes.
The node load of the storage node where the copy is located is related to the waiting time of the storage node. Illustratively, referring to Table 4, the smaller the node load of a storage node, the request queue time T of that storage nodequeueAnd a device load delay time TloadThe less. Therefore, if the node load of a certain storage node is minimum, the waiting time T of the storage node iswaitAt a minimum, the data access cost of the storage node is correspondingly lowest.
And in case three, the physical distance between the first target storage node and the computing node is smaller than the physical distance between the storage nodes except the first target storage node in the plurality of storage nodes and the computing node.
Wherein the physical distance between the storage node where the copy is located and the computing node is related to the transmission time. Illustratively, referring to Table 4, the smaller the physical distance between a storage node and a compute node, the smaller the network transit time T of the storage nodetransThe less. Therefore, if the physical distance between a certain storage node and a computing node is minimum, the transmission time T of the storage node istransAt a minimum, the data access cost of the storage node is correspondingly lowest. In addition, because the multiple copies may be dispersed in different rooms or even different data centers, the first storage node closest to the computing node is used as the first target storage node, and thus the overhead of network transmission can be effectively reduced.
Fourth, the data synchronization status of the first target storage node is after the data synchronization status of the storage nodes other than the first target storage node in the plurality of storage nodes.
The data synchronization state of the storage node where the copy is located is related to the waiting time of the storage node. Illustratively, referring to Table 4, the more recent the data synchronization status of a storage node, the more recent the data synchronization time T of the storage nodesyncThe less. Therefore, for a data read request, the request needs to wait for the log of a storage node to be synchronized to the latest state, and if the data synchronization state of a certain storage node is the latest, the waiting time T of the storage node iswaitAt a minimum, the data access cost of the storage node is correspondingly lowest.
In some embodiments, the four cases shown above can also be understood as four policies for determining, by the computing node, the first target storage node through the data access cost, that is, by the policies, a storage node with the lowest data access cost is selected from the plurality of first storage nodes, and the storage node is taken as the first target storage node.
In some embodiments, the computing node determines the first target storage node according to any of the policies described above. In other embodiments, the computing node combines the above strategies by weight or the like to determine the first target storage node. For example, referring to table 4, for any first storage node, the data access cost is represented as: t is1=a1(Tsearch+Tio)+1/a1(Tcons)+a2(Tqueue+Tload)+a3(Ttrans)+a4(Tsync) Wherein a is1、a2、a3And a4Respectively representing the corresponding weights of the four strategies. It should be understood that the above-mentioned weights and selection of the strategies may be adjusted according to actual situations, for example, only two strategies are selected and combined in a weight manner, and the like, which is not limited in the embodiment of the present application. Additionally, in some embodiments, the execution time T of the storage node is due to the effects of node performance, network speed, and the likeprocWaiting time TwaitAnd a transmission time TtransThere is a difference in the magnitude of (c), and therefore the setting of the weights should ensure as fast as possible for the longer total time portion.
In some embodiments, the data access cost is used to indicate the execution time, latency, and transmission time of the storage node per unit time. That is, the number is divided on the basis of the data access cost T shown in the above Table 4According to the access cost T and the time T for the storage node to randomly read a data page0The ratio is taken as the data access cost T'. By the method, the data access cost of each storage node can be unified, so that the accuracy of the computing node in determining the first target storage node is improved.
It should be noted that, through the above step 403, the computing node uses a storage node whose data access cost meets the target condition in the plurality of first storage nodes as a first target storage node, which may also be referred to as a process in which the computing node formulates a query plan, and in this process, factors in various aspects such as a storage mode, load balancing, and a delay time of a request are considered, so that concurrency and storage advantages of multiple copies of storage are maximally exerted, and overall performance of the system is improved. And selecting the optimal copy as an access target according to the data access cost of each storage node, and sending a data reading request to the storage node corresponding to the copy, so that the possibility of bottleneck occurrence is reduced, and the overall throughput of the system is improved. In addition, in this process, the computing node determines the first target storage node according to the data access cost, and therefore, both the master storage node and the slave storage node may become the first target storage node, in other words, in the distributed database system according to the embodiment of the present application, data is allowed to be read from different types of storage nodes, so that the read operation concurrency of multiple copy storage is improved, high availability and data consistency among multiple copies brought by multiple copies are ensured, the concurrency of the system is increased, and the data access performance of the system is effectively improved.
After the steps 401 to 403, after receiving the data reading request sent by the terminal, the computing node determines the storage nodes where the multiple copies of the first target data of the data reading request are located, and determines the first target storage node according to the data access cost of each storage node. An embodiment of reading the first target data by the first target storage node will be described below through steps 404 to 411.
404. The first target storage node determines the first target data to be current state data based on the data type of the first target data.
In the embodiment of the application, the first target storage node reads the first target data by adopting different data reading modes according to different data types of the first target data. When the first target data is current-state data, the first target storage node reads the first target data based on the first data read request and the node type of the first target storage node (i.e., steps 405 to 408 described below).
405. If the first target storage node is a master storage node, the first target storage node performs the following steps 406 and 408, and if the first target storage node is a slave storage node, the first target storage node performs the following steps 407 and 408.
In the embodiment of the application, when a first target storage node receives a data reading request sent by a computing node, different data reading modes are adopted to read first target data according to the node type of a current node. When the first target storage node is the main storage node, the first target storage node reads the first target data according to a Relaxed Read Index (RRI) (for convenience of description, the RRI is simply referred to as RRI) of the first target storage node (that is, the following step 406, the meaning of the RRI will be described in detail in the following step 406, and will not be described herein again). When the first target storage node is the slave storage node, the first target storage node reads the first target data after acquiring the RRI from the master storage node (i.e., step 407).
406. The first target storage node determines a first read index of the first target data based on the first data read request, and reads the first target data from a state machine corresponding to the first target data with the first read index as a starting point, where the first read index is used to indicate a minimum read index for reading the first target data based on the first data read request.
In this embodiment, the first target storage node is a main storage node, the first Read Index of the first target data is an RRI of the first target data, and a value of the first Read Index is smaller than or equal to a value of a Read Index (Read Index). In general, RRI indicates the minimum Read Index (Read Index) that the current storage node can accept based on the data Read request, and it can be ensured that the current storage node reads the latest data as long as the execution Index (Apply Index) is ensured to be greater than RRI. In some embodiments, the manner in which the first target storage node reads the first target data according to the RRI may also be referred to as a Relaxed Read (RRR).
It should be understood that in a related distributed consistency protocol (such as the Raft protocol), in order to guarantee the consistency of data, a storage node needs to execute (Apply) all the committed (Commit) logs into a state machine before reading the data so as to guarantee the consistency of the data, and this process causes high time overhead, resulting in poor data access performance. In the embodiment of the application, by introducing the index RRI, the number of logs to be executed by the storage node is reduced on the premise of ensuring the correctness of the read data, so that the data is read in advance, the processing efficiency of the data reading request is improved, and the data access performance of the distributed database system is effectively improved.
The following describes a specific implementation of determining, by the first target storage node in step 406, the first read index of the first target data, including the following steps 4061 to 4063:
4061. the first target storage node updates a commit index of the current time, wherein the commit index is used for indicating the maximum index of committed logs in the log list.
The updating of the Commit Index (Commit Index) of the current time by the first target storage node means that the Read Index (Read Index) required by the data Read request is set to the Commit Index (Commit Index) of the current node.
In some embodiments, the first target storage node updates a Commit Index (Commit Index) for the current time after confirming the node type is the primary storage node. Since the first target storage node still considers itself as the master storage node due to the possible network partition, but there is already an updated master storage node in the cluster, the above situation can be avoided by the way of confirming the identity first and then updating the commit index, thereby ensuring the accuracy of data reading.
4062. The first target storage node scans the logs stored in the log list according to a first order, wherein the first order refers to an execution index indexed to the log list from the submission, and the execution index is used for indicating the maximum index of the executed logs in the log list.
Wherein, in the log list, the value of the execution Index (Commit Index) is smaller than the value of the Commit Index (Commit Index). The first target storage node scans the logs stored in the log list according to the first order, and may also be understood as that the first target storage node scans the log list from the Commit Index (Commit Index) to the backward direction and the forward direction until the Index (Apply Index) is executed.
4063. If a first target log exists, the first target storage node determines the first read index based on the log index of the first target log, and the data operated by the first target log is the first target data; if the first target log does not exist, the first target storage node determines the first read index based on the execution index.
In the process of executing step 4062, if the data operated by scanning one Log is the first target data, the Log is the first target Log, and the first read Index is set as the Log Index (Log Index) of the first target Log; and if the first target log does not exist in the log list after the scanning is finished, setting the first read Index as an execution Index (Apply Index) of the log list.
The manner of determining the first read index shown in steps 4061 to 4063 described above is exemplified below with reference to fig. 5. Fig. 5 is a schematic diagram of determining a first read index according to an embodiment of the present disclosure. As shown in fig. 5, the Log indexes (Log Index) in the Log list have values of 1 to 8, which represent Log 1(Log1) to Log 8(Log 8), respectively, the execution Index (Apply Index) is 2, and the Commit Index (Commit Index) is 7. Illustratively, the first target storage node scans from Log 7(Log 7) one by one, and if there is one Log satisfying that "the data operated by the Log is the first target data", sets the first read Index (i.e. RRI) as the Log Index (Log Index) of the Log, and ends the scanning in advance. For example, taking the first target data as x, the RRI of the first target data x is 7. Similarly, taking the first target data as y as an example, the RRI of the first target data y is 6; taking the first target data as z as an example, the RRI of the first target data z is 3; taking the first target data as w as an example, the RRI of the first target data w is 2.
Through steps 4061 to 4063, the first target storage node determines a first read index of the first target data, and when the execution index in the state machine of the first target storage node is greater than the first read index, the first target data is read from the state machine corresponding to the first target data with the first read index as a starting point. By the method, on the premise of ensuring the correctness of the read first target data, the processing time of the data reading request is reduced, so that the data access performance of the distributed database system is effectively improved.
In some embodiments, after the first target storage node determines the first read index of the first target data, the first read index is stored so that when the same data read request is received again, the corresponding data is read by looking up a table. This alternative embodiment is described below in two steps:
step one, a first target storage node stores the first read index to a first list, the first list comprises the first target data, the first read index and a first check index, the first check index is used for indicating a commit index corresponding to the first target storage node when determining the first read index, and the commit index is used for indicating a maximum index of committed logs in a log list.
And step two, when the distributed database system processes a second data reading request, if the data of the second data reading request is the first target data, inquiring the first list to read the first target data.
When the distributed database system processes a query of the first list to read the first target data, the following two cases are included:
in case one, the Commit Index (Commit Index) of the first target data in the first list is equal to the corresponding first Check Index (Check Index), and the RRI corresponding to the first target data in the first list is used as the RRI corresponding to the second data read request to read the first target data.
In case two, the Commit Index (Commit Index) of the first target data in the first list is greater than the corresponding first Check Index (Check Index), which indicates that the RRI at this time is not necessarily the latest data, i.e. the data read according to the RRI at this time is not necessarily the latest data. In this case, the distributed database system needs to scan the logs stored in the log list according to the methods shown in steps 4062 to 4063 to determine the RRI corresponding to the second data read request. During scanning, the log corresponding to the Commit Index (Commit Index) in the log list is scanned until the first Check Index (Check Index).
In some embodiments, if the number of data items in the first list is greater than or equal to a preset threshold, deleting at least one data item and a corresponding RRI in the first list, the RRI of the at least one data item being less than an execution Index (Apply Index). For example, the preset threshold is 100, which is not limited in the embodiments of the present application.
Referring to fig. 6, an alternative embodiment of storing the first read index and processing the second data read request is illustrated. Fig. 6 is a schematic diagram illustrating a first read index storage according to an embodiment of the present application. As shown in fig. 6, the first target storage node stores the result of scanning the log list in a list, where the list includes the data item, the RRI corresponding to the data item, and a Check Index (Check Index), where a value of the Check Index (Check Index) is equal to a value of the Commit Index (Commit Index). When the distributed database system processes the second data read request, the RRI and Check Index (Check Index) are first looked up in the table. And if the Commit Index (Commit Index) is equal to the Check Index (Check Index), taking the value of the RRI in the table as the RRI corresponding to the second data read request. Illustratively, as shown in (a) and (b) of fig. 6, when the second data read request reads x, no additional operation is required since the Commit Index (Commit Index) in the table is equal to the Check Index (Check Index). If the Commit Index (Commit Index) is greater than the Check Index (Check Index), it means that the RRI at this time is not necessarily the latest, and the log list needs to be scanned, and the scanning starts from the log corresponding to the Commit Index (Commit Index) in the log list until the Check Index (Check Index) is reached. In FIG. 6, (c) and (d) are the same as (a) and (b), and therefore, the description thereof is omitted.
It should be noted that, by storing the first read index in the first list, repeated scanning of the log list can be avoided, and the data processing amount is reduced, thereby saving the computing resources. In addition, since the information of the first list is updated each time a data read request is processed, this alternative implementation may also be referred to as a read update method. It should be appreciated that experiments have shown that the gap between the execute Index (Apply Index) and the Commit Index (Commit Index) typically does not exceed 10, so the first list is large enough to be placed in memory. If the first list is full, it indicates that the state machine of the current storage node is far behind the log. Maintenance of this table may therefore be suspended, data read requests blocked, and service resumed while waiting for the execution Index (Apply Index) to equal the Commit Index (Commit Index). Further, because the information of the first list is completely derivable from the state of the distributed coherency protocol at any time, it need not be stored in persistent storage. When fault recovery is needed, the table is reconstructed by using the method according to the current state of the distributed consistency protocol.
In step 406, in the case that the first target storage node is a main storage node, the first target data is read by determining RRI. This data reading process can also be summarized as a main node Read request (Leader Read) processing flow, which is schematically described below with reference to fig. 7. Fig. 7 is a schematic diagram of a processing flow of a main node read request according to an embodiment of the present application. As shown in fig. 7, the main node read request processing flow is executed by the first target storage node, and includes the following steps (1) to (7):
(1) a data read request is received.
(2) Judging whether the current storage node is a main storage node, if so, executing the following step (4), and if not, executing the following step (3).
(3) The data read request is forwarded to the primary storage node.
(4) The Commit Index (Commit Index) at the current time is updated.
(5) A first read index (i.e., RRI) is determined.
(6) And executing the log into a state machine according to the first read index.
(7) Data is read from the state machine.
It should be noted that the specific implementation of the steps (1) to (7) is the same as the step 406, and therefore, the detailed description thereof is omitted here.
It should be understood that in the related art, in order to ensure strict consistency and ensure that the upper layer application does not cause service logic errors due to reading old data, the execution Index (Apply Index) must be guaranteed to exceed the Read Index (Read Index) each time before reading data. Since the Raft protocol processes data access requests with only log writes being synchronous and processes executing to the state machine being asynchronous, it is often the case that the execution Index (Apply Index) is smaller than the Read Index (Read Index) when Raft is running normally. During the time that the execution Index (Apply Index) is added to the Read Index (Read Index), the user cannot Read the data, even if the data to be Read is not relevant to these operations. This tight restriction increases the latency of data read requests, reducing the overall system performance. By introducing the relaxed Read Index RRI, the storage node is not required to Read the result when the execution Index (Apply Index) exceeds the Read Index (Read Index), but the storage node can Read the data from the state machine after the instruction log related to the data Read request is executed to the state machine (that is, the execution Index (Apply Index) exceeds the RRI), so that the waiting time of the data Read request is reduced, the Read data is ensured to be latest, and the data access performance of the distributed database system is effectively improved.
407. The first target storage node acquires the first read index from the main storage node based on the first data read request, and reads the first target data from the state machine corresponding to the first target data by taking the first read index as a starting point.
In this embodiment, when the first target storage node is a slave storage node, the first target storage node sends a read index obtaining request to the master storage node based on the first data read request, so as to obtain a first read index of the first target data. In some embodiments, the first target storage node communicates with the master storage node through a Remote Procedure Call (RPC), which is not limited in this embodiment.
In some embodiments, the manner in which the first target storage node reads the first target data from the state machine corresponding to the first target data using the first read index as a starting point includes any of the following cases:
in case one, if a log corresponding to the first read index exists in the first target storage node, the first target storage node performs persistent storage on the log corresponding to the first read index, and reads the first target data from a state machine corresponding to the first target data with the first read index as a starting point.
After the first target storage node acquires the first read index, determining that a log record with an index as the first read index exists by querying a local log list, then performing log playback, performing persistent storage on the log (namely, persisting data to a state machine), and completing data synchronization with the main storage node, wherein the first target storage node reads the first target data from the state machine corresponding to the first target storage node by taking the first read index as a starting point.
It should be noted that, in some embodiments, due to a network, machine hardware, IO scheduling, and the like, the time for persistent storage of the slave storage node and the master storage node does not necessarily keep consistent (it may also be understood that the time for log and data to be landed does not necessarily keep consistent), and therefore, data consistency can be ensured by the way of persistent storage through log playback as described above.
In case that the first target storage node does not have the log corresponding to the first read index, the first target storage node obtains the log corresponding to the first read index from the main storage node, and reads the first target data from the state machine corresponding to the first target data with the first read index as a starting point.
After acquiring the first read index, the first target storage node does not inquire a log record with an index being the first read index by inquiring a local log list, then sends a log acquisition request to the main storage node to acquire a log corresponding to the first read index, and then reads the first target data from a state machine corresponding to the first target storage node with the first read index as a starting point.
In step 407, if the first target storage node is the slave storage node, the first target data is read by obtaining the first read index from the master storage node. This data reading process can also be summarized as a slave node Read request (following with reference to fig. 8, which is schematically illustrated as the slave node Read request process. Fig. 8 is a schematic diagram of a processing flow of a read request from a node according to an embodiment of the present application. As shown in fig. 8, the slave node read request processing flow includes the following steps (1) to (5):
(1) a data read request is received from a storage node.
(2) The slave storage node requests a first read index (i.e., RRI) from the master storage node.
(3) The master storage node determines a first read index (this process is referred to as step 406 above and will not be described herein), and sends the first read index to the slave storage node.
(4) And executing the log into the state machine according to the first read index from the storage node.
(5) Data is read from the state machine from the storage node.
It should be noted that the specific implementation of the steps (1) to (5) is the same as the step 407, and therefore, the description thereof is omitted here.
Through the above steps 405 to 407, when the first target data is current state data, the first target storage node reads the first target data by adopting different data reading modes based on the first data reading request and the node type of the first target storage node. The following describes a manner in which the first target storage node reads the first target data when the first target data is history data through steps 408 to 410.
408. The first target storage node sends a first data read result to the compute node.
In the embodiment of the application, the first target storage node generates a first data reading result based on first target data read from a state machine corresponding to the first target data, sends the first data reading result to the computing node, and the computing node feeds the first data reading result back to the terminal. In some embodiments, the first target storage node sends the first target data read from the state machine corresponding to the first target data to the computing node as a first data reading result, and the computing node feeds back the first data reading result to the terminal.
It should be noted that, in the above steps 404 to 408, the first target storage node reads the first target data when the first target data is the current state data, and in some embodiments, the above steps 404 to 408 can be replaced with the following steps 404 'to 408'.
404', the first target storage node determines the first target data to be historian data based on the data type of the first target data.
When the first target data is history data, the first target storage node reads the first target data based on the first data read request and the transaction completion time of the first target data (i.e., steps 405 'to 408' described below).
405 ', if the data commit time of the first target data is before the transaction completion time, the first target storage node performs steps 406 ' and 408 ', and if the data commit time of the first target data is after the transaction completion time, the first target storage node performs steps 407 ' and 408 '.
In this embodiment of the present application, when a first target storage node receives a data reading request sent by a computing node, a Key (Key) of first target data and a transaction completion time corresponding to the first target data are obtained, and according to whether a data commit time of the first target data is before the transaction completion time, the first target data is read in different data reading manners. When the data commit time of the first target data is before the transaction completion time, the first target storage node scans the log list in a second order to determine a first read index (i.e., RRI) to read the first target data (i.e., step 406' described below). When the data commit time of the first target data is after the transaction completion time, the first target storage node scans the log list in a third order to determine a first read index to read the first target data (i.e., step 407' described below).
406' and the first target storage node scans the logs stored in the log list according to the second order based on the first data read request and the transaction completion time, determines a first read index, and reads the first target data from the state machine corresponding to the first target data with the first read index as a starting point.
In the embodiment of the present application, the second order refers to a Commit Index (Commit Index) of the log list to an execute Index (Apply Index) of the log list. The first target storage node scans the logs stored in the log list according to the second order, and the first target storage node also scans the log list from the Commit Index (Commit Index) to the backward direction and the forward direction until the Index (Apply Index) is executed.
Wherein the first target storage node determines the first read index by scanning the log list, including any of:
in case one, if a second target log exists and the transaction completion time of the first target data is the same as the transaction completion time of the second target log, or the transaction completion time of the first target data is after the transaction completion time of the second target log, the first read index is determined based on the log index of the second target log, and the data operated by the second target log is the first target data.
The log list comprises log time information, and the log time information is used for indicating the transaction completion time of each log. In the process of scanning the Log list, the first target storage node scans that the data operated by one Log is the first target data, and the transaction completion time of the first target data is the same as the transaction completion time of the Log, or the transaction completion time of the first target data is after the transaction completion time of the second target Log, then the Log is the second target Log, and the first read Index is set as the Log Index (Log Index) of the second target Log.
And in case two, if the second target log does not exist, determining the first read index based on the execution index of the log list.
After the first target storage node scans the log list and does not scan the second target log, the first read Index is set as an execution Index (Apply Index) of the log list.
The manner in which the first read index is determined as shown in step 406' is illustrated below with reference to FIG. 9. Fig. 9 is a schematic diagram of determining a first read index according to an embodiment of the present application. As shown in fig. 9, the log list includes log time information for indicating the transaction completion time of each log, which may also be referred to as a transaction completion timestamp (Txn Commit TS). The Log indexes (Log Index) in the Log list have values of 1 to 8, which represent logs 1(Log1) to 8(Log 8), respectively, the execution Index (Apply Index) is 2, and the Commit Index (Commit Index) is 7. Illustratively, the first target storage node scans from Log 7(Log 7) one by one, and if there is a Log satisfying that "the data operated by the Log is the first target data" and "the transaction completion timestamp of the first target data is greater than or equal to the transaction completion timestamp of the Log" (this process may also be understood as aiming at finding old data which is not executed in the Log list and is to be read next), the first read Index (i.e. RRI) is set as the Log Index (Log Index) of the Log, and the scanning is ended in advance. If the log shown above does not exist after the scanning is finished, the first read Index is set as the execution Index (Apply Index) of the log.
It should be noted that, for the distributed database system applied in the embodiment of the present application, the distributed database system is a database supporting MVCC, and it is a requirement to read data before a certain version. For example, OLAP needs to analyze the statistics of the previous day, and the latest data is not needed in the state machine, and it is not necessary for the underlying storage to provide the latest data. Therefore, the manner of reading the first target data shown in step 406' may also be referred to as reading old data, and this process is simply referred to as an old data read request processing flow. This old data read request processing flow is schematically illustrated below with reference to fig. 10. Fig. 10 is a schematic diagram of a processing flow of an old data read request according to an embodiment of the present application. As shown in fig. 10, the old data read request processing flow is executed by the first target storage node, and includes the following steps (1) to (7):
(1) a Key (Key) of the first target data and a transaction completion time of the first target data are obtained.
(2) The log list is scanned backward from the Commit Index (Commit Index) to the front until the Index (Apply Index) is executed.
(3) And (4) judging whether a second target log exists in the log list, and if so, executing the following step (4), and if not, executing the following step (5), wherein the transaction completion time of the first target data is the same as the transaction completion time of the second target log, or the transaction completion time of the first target data is after the transaction completion time of the second target log.
(4) A first read index is determined based on a log index of a second target log.
(5) A first read index is determined based on an execution index of the log list.
(6) And executing the log into a state machine according to the first read index.
(7) Data is read from the state machine.
It should be noted that the specific implementation of the steps (1) to (7) is the same as the step 406', and therefore, the detailed description thereof is omitted here.
Through step 406', when the first target data is old data, the first target storage node introduces a relaxation reading method by determining the first read index, so that the processing time of a data reading request is reduced, and the data access efficiency of the distributed database system is improved.
407' and the first target storage node scans the logs stored in the log list according to a third order based on the first data read request and the transaction completion time, determines the first read index, and reads the first target data from the state machine corresponding to the first target data by taking the first read index as a starting point.
In the embodiment of the present application, the third order refers to a sequence from the execution Index (Apply Index) of the log list to the Commit Index (Commit Index) of the log list. The first target storage node scans the logs stored in the log list according to the third order, which may also be understood as that the first target storage node scans the log list forward and backward from the execution Index (Apply Index) until the Commit Index (Commit Index).
Wherein the first target storage node determines the first read index by scanning the log list, including any of:
in case one, if a third target log exists, and the transaction completion time of the first target data is the same as the transaction completion time of the third target log, or the transaction completion time of the first target data is after the transaction completion time of the third target log, the first read index is determined based on the log index of the third target log, and the data operated by the third target log is the first target data.
The log list comprises log time information, and the log time information is used for indicating the transaction completion time of each log. In the process of scanning the Log list, the first target storage node scans that the data operated by one Log is the first target data, and the transaction completion time of the first target data is the same as the transaction completion time of the Log, or the transaction completion time of the first target data is after the transaction completion time of the third target Log, then the Log is the third target Log, and the first read Index is set as the Log Index (Log Index) of the third target Log.
And in case two, if the third target log does not exist, determining the first read index based on the execution index of the log list.
After the first target storage node scans the log list, and does not scan the third target log, the first read Index is set as an execution Index (Apply Index) of the log list.
The manner in which the first read index is determined as shown in step 407' is illustrated below with continued reference to FIG. 9. As shown in fig. 9, the log list includes log time information for indicating the transaction completion time of each log, which may also be referred to as a transaction completion timestamp (Txn Commit TS). The Log indexes (Log Index) in the Log list have values of 1 to 8, which represent logs 1(Log1) to 8(Log 8), respectively, the execution Index (Apply Index) is 2, and the Commit Index (Commit Index) is 7. Illustratively, the first target storage node scans from Log 2(Log 2) one by one backwards, and if there is a Log satisfying that "the data operated by the Log is the first target data" and "the transaction completion timestamp of the first target data is greater than or equal to the transaction completion timestamp of the Log" (this process may also be understood as aiming at finding the non-latest data which is not executed in the Log list and is read next), the first read Index (i.e. RRI) is set as the Log Index (Log Index) of the Log, and the scanning is ended in advance. If the log shown above does not exist after the scanning is finished, the first read Index is set as the execution Index (Apply Index) of the log.
It should be noted that, for the distributed database system applied in the embodiment of the present application, the distributed database system is a database supporting MVCC, and it is also a requirement to read data after a certain version. For example, a transaction may need to read data after a certain timestamp, and does not need the data to be up-to-date. Therefore, the manner of reading the first target data shown in the above step 407' may also be referred to as reading non-latest data, and this process is simply referred to as a non-latest data read request processing flow. This non-current data read request processing flow is schematically illustrated below with reference to fig. 11. Fig. 11 is a schematic diagram of a processing flow of a non-latest data read request according to an embodiment of the present application. As shown in fig. 11, the non-latest data read request processing flow is executed by the first target storage node, and includes the following steps (1) to (7):
(1) a Key (Key) of the first target data and a transaction completion time of the first target data are obtained.
(2) The log list is scanned forward and backward from the execution Index (Apply Index) until the Commit Index (Commit Index) is committed.
(3) And (4) judging whether a third target log exists in the log list, and if so, executing the following step (4), and if not, executing the following step (5), wherein the transaction completion time of the first target data is the same as the transaction completion time of the third target log, or the transaction completion time of the first target data is after the transaction completion time of the third target log.
(4) A first read index is determined based on a log index of a third target log.
(5) A first read index is determined based on an execution index of the log list.
(6) And executing the log into a state machine according to the first read index.
(7) Data is read from the state machine.
It should be noted that the specific implementation of the steps (1) to (7) is the same as that of the step 407', and therefore, the description thereof is omitted.
Through step 407', when the first target data is non-latest data, the first target storage node introduces a relaxation reading method by determining the first read index, thereby reducing the processing time of the data reading request and improving the data access efficiency of the distributed database system.
408', the first target storage node sends the first data read to the compute node.
In the embodiment of the present application, step 408' is the same as step 408, and therefore, is not described herein again.
Through the above steps 404 'to 408', when the first target data is the historical data, the first target storage node reads the first target data by adopting different data reading modes based on the transaction completion time of the first data reading request and the first target data. This process may also be referred to as a read process of the special time version data. By determining the mode of relaxing the read index, the processing time of the data reading request is reduced, and the data access performance of the distributed database system is effectively improved.
In the data access method provided in the embodiment of the present application, when a computing node receives a data read request, a plurality of storage nodes storing multiple copies of target data of the data read request are determined according to a data fragment to which the target data belong, and then a target storage node meeting a target condition is selected from the storage nodes according to a data access cost of the storage nodes for accessing the target data, and the target storage node reads the target data. In the method, the target storage node is determined according to the data access cost, so that the main storage node and the auxiliary storage node can be the target storage node, and the main storage node is prevented from processing all data reading requests, so that high availability caused by multiple copies is ensured, the data reading speed is increased, and the data access performance of the distributed database system is effectively improved.
Through the embodiment shown in fig. 4, the data access method provided by the embodiment of the present application is described by taking the request type of the data access request as the data read request as an example. Another data access method provided in the embodiment of the present application is described in detail below with reference to fig. 12.
Fig. 12 is a flowchart of a data access method provided in an embodiment of the present application, and as shown in fig. 12, the embodiment is applied to a distributed database system that includes a computing node and a plurality of storage nodes. In the embodiment shown in fig. 12, the data access method is applied to the HTAP database system shown in fig. 2, and the interaction between the compute node and the storage node is taken as an example. This embodiment includes the following steps.
1201. And the computing node establishes connection with the terminal based on the connection request sent by the terminal.
In the embodiment of the present application, step 1201 is similar to step 401, and therefore is not described herein again.
1202. The computing node responds to the data writing request, determines whether a third data fragment to which second target data of the data writing request belongs exists, if not, the computing node and the plurality of storage nodes respectively execute the following steps 1203 and 1204, and if so, the computing node executes the following step 1205.
In this embodiment of the present application, a computing node receives a data write request sent by a terminal, computes a third data segment to which second target data belongs according to data content of the second target data of the data write request, and determines whether the third data segment exists in the current multiple storage nodes. If not, a third data fragment to which the second target data belongs and a plurality of corresponding copies need to be established (i.e., step 1203 and step 1204 below), and if so, the computing node determines, according to the data content of the second target data, a third data fragment to which the second target data belongs in the current distributed database system, and determines, according to fragment information of the third data fragment, a plurality of second storage nodes from a plurality of storage nodes (i.e., step 1205 below).
1203. And the computing node establishes a third data fragment to which the second target data belongs and sends a copy creation request to the plurality of storage nodes.
In the embodiment of the application, the computing node establishes a third data fragment to which the second target data belongs based on the data content of the second target data, and sends a copy creation request to the plurality of storage nodes based on the fragment information of the third data fragment. Optionally, the copy creation request carries data content of the second target data, fragmentation information of the third data fragment, a storage mode of the copy, and the like, which is not limited in this embodiment of the application. In some embodiments, the computing node stores the shard information of the third data shard to the metadata management layer.
1204. And the plurality of storage nodes establish a plurality of copies corresponding to the third data fragment based on the copy creation request.
In the embodiment of the present application, after receiving a copy creation request sent by a computing node, a plurality of storage nodes establish a copy corresponding to the third data fragment on their respective storage nodes based on the copy creation request.
In some embodiments, the plurality of storage nodes establish a plurality of copies corresponding to the third data fragment based on the copy creation request and a storage pattern of the second target data in the plurality of storage nodes, the storage pattern indicating a storage format of the data in the storage nodes.
In some embodiments, the plurality of storage nodes establish a plurality of copies corresponding to the third data segment according to a preset copy configuration policy. For example, the preset copy configuration policy is as follows: n copies are established in each data fragment, wherein the storage mode of K copies is a row storage mode, the storage mode of N-K copies is a column storage mode, N and K are positive integers, the N copies are scattered on N physical nodes for storage, and main storage nodes are selected through a distributed consistency protocol management and an election mechanism. It should be understood that the configuration policy of the copy may be adjusted according to actual situations, and this is not limited in the embodiments of the present application.
It should be noted that, the specific content of the storage mode of the multiple copies corresponding to the third data segment may refer to the step 402, and therefore, the detailed description is omitted here. In addition, after the establishment of the multiple copies corresponding to the third data segment is completed, the multiple storage nodes dynamically adjust the storage modes of the multiple copies. This process is similar to the dynamic adjustment of the storage modes of the multiple copies by the multiple first storage nodes shown in step 402, and therefore, the detailed description thereof is omitted here.
1205. The computing node determines, based on the third data slice, a plurality of second storage nodes from the plurality of storage nodes, the plurality of second storage nodes to store a plurality of copies of second target data.
In the embodiment of the present application, the manner in which the computing node determines the plurality of second storage nodes is the same as that in step 301 and step 402, and therefore, the description thereof is omitted here.
1206. The compute node sends the data write request to a primary storage node of the plurality of second storage nodes.
1207. The master storage node writes the second target data based on the data writing request, generates a data operation log, and sends a log synchronization request to a slave storage node in the plurality of storage nodes, wherein the log synchronization request is used for indicating the slave storage node to send a data synchronization message to the master storage node after synchronizing the data operation log.
In the embodiment of the application, after receiving a data write request, a master storage node first performs validity judgment, writes the second target data on the premise that the data write request is valid, generates a data operation log, sends a log synchronization request to a slave storage node, and waits for a data synchronization message to be returned from the slave storage node. In some embodiments, the data operation log includes an operation type, an operated data item, an operation time, and the like, which is not limited in this application.
In some embodiments, after the master storage node writes the second target data based on the data write request and generates the data operation log, the master storage node determines a second read index of the second target data and stores the second read index, so that when the distributed database system receives a data read request that needs to read the second target data, the corresponding data is read by looking up a table. This alternative embodiment is described below in three steps:
step one, the main storage node determines a second read index of the second target data based on the log index of the data operation log, wherein the second read index is used for indicating a minimum read index for reading the second target data based on a third data reading request.
It should be noted that the second read index of the second target data is also the RRI of the second target data, and the specific meaning of the RRI is the same as that in the embodiment shown in fig. 4, and therefore, the description thereof is omitted here.
Step two, storing the second read index to a second list, where the second list includes the second target data, the second read index, and a second check index, and the second check index is a log index of the data operation log.
For each piece of data operation Log, the RRI of the data operated by the data operation Log is the Log Index (Log Index) of the piece of Log, and the second Check Index (Check Index) is also the Log Index (Log Index) of the piece of Log.
In some embodiments, if the number of data items in the second list is greater than or equal to a preset threshold, deleting at least one data item and a corresponding RRI in the second list, the RRI of the at least one data item being less than an execution Index (appliance Index). For example, the preset threshold is 100, which is not limited in the embodiments of the present application.
And step three, when the distributed database system processes a third data reading request, if the data required to be read by the third data reading request is the second target data, querying the second list to read the second target data.
The manner in which the distributed database system processes the query for the second list to read the second target data is the same as that in step 406, and therefore, the description thereof is omitted here.
Referring to fig. 13, an alternative embodiment of storing the second read index and processing the third data read request will be described. Fig. 13 is a schematic diagram illustrating a storage of a second read index according to an embodiment of the present application. As shown in fig. 13, the primary storage node stores the results of the data operation Log in a list including the data items and the corresponding RRIs and a Check Index (Check Index), wherein the value of the Check Index (Check Index) is equal to the value of the Log Index (Log Index); the RRI of the data operated by the data operation Log is a Log Index (Log Index) of the Log. Illustratively, as shown in (a) and (b) of fig. 13, when the second target data is x, its corresponding RRI has a value of 7, when the second target data is y, its corresponding RRI has a value of 6, and so on. In FIG. 13, the diagrams (c) and (d) are the same as those (a) and (b), and therefore, the description thereof is omitted.
It should be noted that, by storing the second read index in the second list, repeated scanning of the log list can be avoided, and the data processing amount is reduced, thereby saving the computing resources. In addition, since the information of the second list is updated each time a data write request is processed, this alternative implementation may also be referred to as a write-time update method. It should be appreciated that since only the parameters of the data associated with the data write request need to be updated each time, the single update amount of this method is less than the read update method described in step 406 above. Like the read updating method, the size of the second list is generally not more than 10 items, which is enough to be stored in the memory, and the second list does not need to be synchronized to the persistent storage, and the table is reconstructed according to the same strategy when the failure is recovered.
1208. If the number of the data synchronization messages received by the main storage node is larger than or equal to half of the number of the slave storage nodes, the main storage node confirms that the data write request is operated successfully.
In the embodiment of the application, after receiving a log synchronization request sent by a master storage node from a storage node, the master storage node synchronizes data operation logs into respective log records, and sends a data synchronization message to notify the master storage node that the logs are successfully synchronized. And when the number of the data synchronization messages received by the master storage node is greater than or equal to half of the number of the slave storage nodes, confirming that the data write request is operated successfully.
It should be noted that, in a related distributed consistency protocol (for example, the Raft protocol), after receiving a data write request, a master storage node adds the data write request as a log entry to its own log, and then copies the log to other slave storage nodes. When the log is copied to most of the slave storage nodes, the master storage node stores the log into the state machine of the master storage node in a persistent mode, and then the execution result can be returned to the terminal to indicate that the writing is successful. In this process, the persistence of the slave storage nodes is completed asynchronously without waiting, but the persistence of the master storage nodes is synchronized and the execution results are returned after the persistence is successful. In the embodiment of the application, the persistence of the main storage node is changed into asynchronous, namely, the execution result can be returned after half of the secondary storage nodes copy the logs, and the main storage nodes do not need to wait for the successful persistence of the main storage nodes, so that the waiting time of the data writing request is reduced, the processing efficiency of the data writing request is effectively improved, and the data access performance of the distributed database system is improved.
1209. The primary storage node sends a first data write result to the compute node.
In the embodiment of the application, the main storage node generates a first data writing result based on the successful operation of the data writing request, sends the first data writing result to the computing node, and the computing node feeds the first data writing result back to the terminal.
In some embodiments, after the primary storage node sends the first data write result to the compute node, the data access method further comprises: the main storage node performs persistent storage on the second target data; and the slave storage node performs format conversion on the second target data based on the data operation log and the storage mode of the second target data in the slave storage node, and performs persistent storage on the converted second target data. It should be noted that this process is also a process in which the master-slave node performs an asynchronous persistence operation on the write data. Illustratively, the master storage node takes the data off-disk according to the configuration of the node; gradually performing data destaging work from the storage nodes: and taking out a log record to obtain the operation type and the data item. And accessing the format converter from the storage node, specifying the data storage mode and the data item required to be recorded by the node, obtaining the converted data, and performing disk dropping on the data to the state machine.
In the data access method shown in the above steps 1201 to 1209, when a computing node receives a data write request, first, a plurality of storage nodes storing multiple copies of target data of the data write request are determined according to a data slice to which the target data belong, then the data write request is sent to a main storage node of the plurality of storage nodes, and when log synchronization of more than half of the number of the auxiliary storage nodes is completed, the main storage node is allowed to Return an execution result without persisting log information into a state machine, which may also be referred to as a Commit Return (Commit Return) -based data write method. The method accelerates the return speed of the data write-in request, improves the write performance of the whole system, and effectively improves the data access performance of the distributed database system.
With the embodiments shown in fig. 4 and fig. 12, the data access method provided by the embodiment of the present application is described by taking the data access request as a data read request and a data write request, respectively. Another data access method provided in the embodiment of the present application is described in detail below with reference to fig. 14.
Fig. 14 is a flowchart of a data access method provided in an embodiment of the present application, and as shown in fig. 14, the embodiment is applied to a distributed database system, where the distributed database system includes a computing node and a plurality of storage nodes. In the embodiment shown in fig. 14, the data access method is applied to the HTAP database system shown in fig. 2, and the interaction between the compute node and the storage node is taken as an example to illustrate, where the data access request is a data read-write request. This embodiment includes the following steps.
1401. And the computing node establishes connection with the terminal based on the connection request sent by the terminal.
In the embodiment of the present application, step 1401 is similar to step 401, and therefore is not described herein again.
1402. The computing node responds to the data read-write request, determines whether a fourth data fragment to which third target data of the data read-write request belongs exists, if not, the computing node and the plurality of storage nodes respectively execute the following steps 1403 and 1404, and if so, the computing node executes the following steps 1405 to 1407.
In the embodiment of the present application, step 1402 is similar to step 1202, and therefore will not be described herein again.
1403. And the computing node establishes a fourth data fragment to which the third target data belongs and sends a copy creation request to the plurality of storage nodes.
In the embodiment of the present application, step 1403 is similar to step 1203, and therefore is not described herein again.
1404. And the plurality of storage nodes establish a plurality of copies corresponding to the fourth data fragment based on the copy creation request.
In the embodiment of the present application, step 1404 is similar to step 1204, and therefore will not be described herein again.
1405. The computing node determines, based on the fourth data slice, a plurality of third storage nodes from the plurality of storage nodes, the plurality of third storage nodes to store a plurality of copies of a third target data.
In the embodiment of the present application, step 1405 is similar to step 1205, and therefore will not be described herein again.
1406. For the read operation in the data read-write request, the computing node determines a second target storage node from the plurality of third storage nodes based on the data read-write request, and sends the data read-write request to the second target storage node, wherein the data access cost of the second target storage node meets a second target condition.
1407. And the second target storage node reads the third target data based on the data read-write request and sends a second data read result to the computing node.
In the embodiment of the present application, step 1406 and step 1407 are similar to the above-mentioned steps 403 to 411, and therefore are not described herein again.
1408. For the write operation in the data read-write request, the computing node sends the data read-write request to a main storage node in the plurality of third storage nodes.
1409. And the main storage node writes the third target data based on the data read-write request and sends a second data write result to the computing node.
In the embodiment of the present application, step 1408 and step 1409 are the same as the above-mentioned steps 1206 to 1209, and therefore are not described herein again.
It should be noted that, in the embodiment of the present application, the distributed database system is executed according to the above steps 1406 to 1409, and in some embodiments, the distributed database system executes step 1408 and step 1409 first, and then executes step 1406 and step 1407. In other embodiments, the distributed database system executes steps 1406 to 1409 synchronously, and the execution order of steps 1406 to 1409 is not limited in the present embodiment.
In some embodiments, a slave storage node of the plurality of third storage nodes is configured with a memory lock for locking the third target data when the write operation has not completed. In this way, serializable scheduling of concurrent transactions can be ensured.
In the data access method shown in steps 1401 to 1409, when a computing node receives a data read-write request, a plurality of storage nodes storing a plurality of copies of target data of the data read-write request are determined according to a data slice to which the target data belong, and then data reading and data writing are performed respectively for a read operation and a write operation in the data read-write request.
For reading operation, according to the data access cost of target data, selecting a target storage node meeting target conditions from the storage nodes, and reading the target data by the target storage node, wherein in the process, the target storage node is determined according to the data access cost, so that both the main storage node and the slave storage node can be target storage nodes, and the main storage node is prevented from processing all data reading requests, thereby ensuring high availability caused by multiple copies, improving the data reading speed, and effectively improving the data access performance of the distributed database system.
For write operation, the data write request is sent to the main storage nodes of the plurality of storage nodes, when more than half of the number of the logs of the auxiliary storage nodes are synchronously completed, the main storage nodes are allowed to return execution results without persisting the log information into a state machine, and the method accelerates the return speed of the data write request, improves the write performance of the whole system, and effectively improves the data access performance of the distributed database system.
Through the embodiments shown in fig. 3, fig. 4, fig. 12, and fig. 14, the data access method provided by the present application is described, and an election mechanism involved in the data access method is described below, where a processing flow of participating in election from a storage node includes the following steps 1501 and 1502:
1501. when a third storage node exists in a plurality of storage nodes and becomes a main storage node through election, a slave storage node in the plurality of storage nodes determines a timeout time based on a current storage mode and a write performance parameter of the slave storage node, wherein the storage mode is used for indicating a storage format of data in the storage nodes.
The election mechanism of the main storage node is also called a Leader election mechanism, and is used for ensuring order consistency. When a third storage node is referred to by election as a master storage node, the third storage node sends a notification message to a slave storage node of the plurality of storage nodes to establish a Leader identity. And after receiving the notification message sent by the main storage node, the slave storage node sets the timeout time based on the current storage mode and the write performance parameters of the storage node. In some implementations, storage nodes of different storage modes set different random time ranges. For example, a random timeout range for a storage node is (t)1,t2) If the write performance parameter of the storage node is p, the storage node is selected from (t)1-p/f,t2-p/f) determining the timeout time, the larger the write performance parameter p is, the shorter the timeout time is, i.e. the storage node with better write performance is, which exceedsThe shorter the time, wherein f is a specification factor, can be set reasonably according to the cluster performance, and this is not limited in the embodiments of the present application. It should be noted that, for example, the setting of the timeout time of the storage node is only illustrative, and in some embodiments, any scheme that includes the storage mode and the write performance parameter of the storage node in the determination process of the timeout time falls within the protection scope of the present application.
By the determination process of bringing the storage mode and the write performance parameters of the storage nodes into the timeout time, the storage nodes with good write performance are more likely to be called main storage nodes, and the overall data access performance of the distributed database system is improved.
1502. And if the first slave storage node does not receive the message of the master storage node within the corresponding timeout time, the first slave storage node is switched to a candidate state to participate in next election.
And if the first slave storage node does not receive the message of the master storage node within the corresponding timeout time and indicates that the master storage node is failed, the first slave storage node is switched to a candidate state to participate in next election. It should be noted that, since the storage node with good write performance is set to have a shorter timeout time, there is a greater probability that the storage node will switch to the candidate state first and participate in the next election, so that the storage node is more likely to become the master storage node.
In addition, when the main storage node fails and a new main storage node is not selected yet, the distributed database system does not support any data reading and writing request and also does not support data reading request of the slave storage node. The reason is that if the selected slave storage node does not have the latest version of data, and the master storage node does not work at this time, the slave storage node cannot obtain a data consistency point, and if a data read request sent by an upper layer is served, a data inconsistency condition may be caused. Therefore, the mode can avoid the situation of data inconsistency and ensure the correctness of the system.
Since the distributed coherency protocol is modified in the data access method provided by the present application, the following description will be made of the influence of the modification on the distributed transaction, including the following two parts, namely "proof of linear coherency" and "memory lock is set for the distributed transaction".
First, linear consistency proof.
In the embodiment of the present application, the data access method ensures the linear consistency of the data, that is, only a new value v is needednewIs written or read, and all subsequent data read requests can read to this new value vnewUp to vnewIs covered.
Taking the Raft protocol as an example, although the data access method relaxes the limitations of the read flow and the write flow in the Raft protocol, the linear consistency of the Raft protocol is not changed. The data access method shown in fig. 12 described above performs a modification on the data writing process from successful write back after data landing to successful write back after reaching the consensus of the update log, i.e., successful write back (i.e., CR-based data writing method); the data access method shown in fig. 4 relaxes the restriction that the wait for execution Index (Apply Index) is the same as the Read Index (Read Index) in the data reading process, and records the update Index of each data item, so that the data can be Read as long as the write Index is the same as the update Index of the data item involved in the write request (i.e. the RRI-based data reading method). In the slave node read request processing flow, the slave node forcibly performs data synchronization with the master node before processing the data read request, so that the slave node read request processing flow is not different from the master node in terms of consistency.
Taking the master node processing the data read request as an example, the formalization proves that the set of changes (CR and RRI) does not destroy the linear consistency of the Raft protocol, including the following two cases:
case one, read after write.
In the embodiment of the application, a new value vnewIs written to means, for vnewHas agreed in the Raft cluster (i.e., the past half node has saved the update log). Suppose pair vnewIs updated with a log index of n1The current Commit Index is c1To know that c1≥n1,vnewRelaxed read index i of1=n1. When the master node receives a new data read request after the data write request returns, assume that the appliance Index at this time is a2The current Commit Index is c2Read Index (r) of data Read request2) Will be set as c2I.e. r2=c2Thus we can get r2=c2≥c1≥n1And c is2≥a2I.e. r2≥n1And r2≥a2
At this time, suppose n1>a2,vnewHas not fallen, at this time i2=i1At this point, the data read request will be blocked until the Apply Index is updated to a3=n1=i2The data read request will read the data from the state machine and return the result. Since at this time a3=n1To v is to vnewHas fallen off the disk, so reading data from the state machine can obtain the latest vnewThe value of (c).
If n is assumed1≤a2At this time, v is pairednewRelaxed read index i of2=a2The data read request may read the data directly from the state machine and return the result. Since n is1≤a2,vnewHas fallen off the disk so that the data read request can read to the latest vnewThe value of (c).
Suppose the two cases (n)1>a2And n1≤a2) V read downnewNot up to date, there is a pair of v before this data read request arrivesnewIs submitted, contrary to the design of the RRI. Therefore, the data access method provided by the application ensures that the value v is newnewAfter a data write request of (a) is submitted, a subsequent data read request may be at vnewRead to v before being overwrittennew
Case two, read after read.
When the new value vnewIs readIndex r1After the read request is read, the first read is set to vnewThe Commit Index of the node when the data read request returns is c1Writing vnewIndex of the data write request of n1,vnewRelaxed read index of i1Data read request read to vnewAt this time, the Aply Index (a) is used1) Is not less than i1Then there is r1=c1≥a1≥i1≥n1New relaxed read index i2=a1. When a new data Read request comes, the Read Index is set as r2Has r of2>r1Let c be the Commit Index of the host node receiving the new data read request2And the Apply Index is a2Has r of2=c2≥c1≥n1Then r can be obtained2≥n1And r2≥a2Similarly to the above case, it can be proved that the new data read request can still be at vnewRead to v before being overwrittennew
In summary, the data access method provided by the embodiment of the present application only needs a new value vnewIs written or read, and all subsequent data read requests can read to this new value vnewUp to vnewIs covered. The data access method provided by the application maintains the linear consistency of the Raft algorithm.
And in the second part, the distributed transaction sets a memory lock.
In the embodiment of the application, the function of reading data from the node is added, and when the slave node reads the data, the problem of half-committed reading may occur. Aiming at the problem of half-submitted read, the memory lock is added in the slave node, and serializable scheduling of concurrent transactions is realized based on a blocking concurrent access mechanism.
This is demonstrated below by way of an example.
Referring first to FIG. 15, the read semi-committed problem is described. FIG. 15 is a schematic diagram of a read semi-committed problem provided by an embodiment of the present application. As shown in fig. 15, the graph includes two physical nodes, which are node a and node b, where node a corresponds to account X, node b corresponds to account Y, and the initial values of X and Y are both 1. Both physical nodes can read data from the slave node.
For example, now the first write transaction, to transfer 1-element from the X account to the Y account, when this write transaction completes a commit at node a, but node b has not yet committed, the slave node has not yet synchronized the log information. At this time, another distributed transaction read transaction needs to perform reconciliation operation, and the values X, Y on the two physical nodes are respectively read. Since the write transaction for node a has committed, the value of X read by the read transaction is 0. When the value of Y is read, the computing layer specifies that the slave node responds to the data read request, but since the operation for modifying the value of Y has not been committed, the value of Y read by the read transaction from the slave node is 1, the general ledger is "X-1 + Y", and a data inconsistency occurs, which is called read-half committed.
To solve the problem of semi-committed reads, the embodiment of the present application adds a memory lock to the slave node, so that for the read/write transaction, before the modification to Y is not committed (i.e., the state in fig. 15 occurs), the tie-back read transaction is not allowed to read the value of the Y data item from any node, and only the read transaction is allowed to wait. Read transactions are not allowed to read the values of X and Y until after the modifications to both X and Y have committed.
With reference to fig. 2 to fig. 15, the data access method provided by the present application is introduced from multiple aspects, such as the architecture of the distributed database system, different types of data access requests corresponding to different data access methods, and the influence on the distributed transaction, and the following beneficial effects brought by the data access method provided by the present application will be summarized based on the above contents, and mainly include the following 9 points.
1. The data access method is based on the support of multiple copies of heterogeneous storage models, so that different types of data access requests can be on copies of different storage modes under mixed load of the system, thereby better exerting the disk reading advantage and reducing the occupation of network bandwidth (see the embodiment shown in fig. 4 for details).
2. The data access method of the application allows data to be read from different types of nodes, thereby improving the read operation concurrency of multi-copy storage, ensuring high availability and data consistency among multiple copies brought by multiple copies, and increasing the system concurrency (see step 403 in the embodiment shown in fig. 4 for details).
3. The data access method of the application provides a multi-copy dynamic management strategy, which can better cope with the situation that the load changes in the system operation process (see step 402 in the embodiment shown in fig. 4 in detail).
4. The data access method modifies the formulation of the query plan, increases the consideration of multiple factors in the process of formulating the query plan by the computing node, comprehensively judges the request type and the system state, and selects the optimal data copy as the access target, thereby being beneficial to reducing the possibility of bottleneck occurrence and improving the overall throughput of the system (see step 403 in the embodiment shown in fig. 4 for details).
5. According to the data access method, the system concurrency is increased, the processing speed of the data reading request is increased, and the overall reading performance of the distributed database system is improved through the data reading method based on the Relaxed Read Index (RRI). When reading data, it is allowed not to persist all log information into the state machine, so as to shorten the processing time of the underlying state machine, speed up the return of read request, and improve the read performance of the whole system (see step 406 and step 407 in the embodiment shown in fig. 4).
6. According to the data access method, when the special version data before or after a certain timestamp is read, the log time information allows that all log information is not persisted in the state machine, so that the return time of the special read request is increased (see steps 409 and 410 in the embodiment shown in fig. 4).
7. According to the data access method, through a Commit Return (Commit Return) -based data writing method, when a data writing request is processed, a master node is allowed to Return an execution result without persisting log information into a state machine, so that the returning speed of the writing request is increased, and the writing performance of the whole system is improved (see step 1208 in the embodiment shown in fig. 12).
8. The data access method of the application adopts a calculation and storage separation architecture, and storage node distribution and configuration are convenient to change (see the HTAP database system shown in FIG. 2 in detail).
9. Overall, the data access method of the application solves the problem that the system concurrency and the storage advantages are not fully exerted in a multi-copy storage mode, improves the system concurrency, reduces the occupation of a disk and network bandwidth, and finally achieves the effect of improving the system throughput.
Fig. 16 is a schematic structural diagram of a data access device according to an embodiment of the present application. The data access apparatus is applied to a distributed database system, and referring to fig. 16, the data access apparatus includes: a first determining module 1601, a second determining module 1602, and a first reading module 1603.
A first determining module 1601, configured to determine, in response to a first data read request, a first data slice to which first target data of the first data read request belongs, and determine, based on the first data slice, a plurality of first storage nodes from the plurality of storage nodes, where the plurality of first storage nodes are configured to store multiple copies of the first target data;
a second determining module 1602, configured to determine a first target storage node from the plurality of first storage nodes based on the first data read request, send the first data read request to the first target storage node, where a data access cost of the first target storage node meets a first target condition;
the first reading module 1603 is configured to read the first target data based on the first data reading request and send a first data reading result to the computing module.
In an optional implementation, the first reading module 1603 includes:
a first reading unit, configured to read the first target data based on the first data reading request and the node type of the first target storage node if the first target data is current-state data;
and the second reading unit is used for reading the first target data based on the first data reading request and the transaction completion time of the first target data if the first target data is historical data.
In an alternative implementation, the first reading unit is configured to:
if the first target storage node is a main storage node, based on the first data reading request, determining a first reading index of the first target data, and reading the first target data from a state machine corresponding to the first target data by taking the first reading index as a starting point, wherein the first reading index is used for indicating a minimum reading index for reading the first target data based on the first data reading request;
if the first target storage node is a slave storage node, the first read index is acquired from the master storage node based on the first data read request, and the first target data is read from a state machine corresponding to the first target data by taking the first read index as a starting point.
In an optional implementation, the first reading unit is configured to:
updating a submission index at the current moment, wherein the submission index is used for indicating the maximum index of submitted logs in a log list;
scanning the logs stored in the log list according to a first order, wherein the first order refers to an execution index indexed from the submission to the log list, and the execution index is used for indicating the maximum index of executed logs in the log list;
if a first target log exists, determining the first read index based on the log index of the first target log, wherein the data operated by the first target log is the first target data; if the first target log does not exist, the first target storage node determines the first read index based on the execution index.
In an optional implementation, the apparatus further comprises:
a first storage module, configured to store the first read index into a first list, where the first list includes the first target data, the first read index, and a first check index, where the first check index is used to indicate a commit index corresponding to the first target storage node when determining the first read index, and the commit index is used to indicate a maximum index of committed logs in a log list;
the first query module is configured to query the first list to read the first target data when the distributed database system processes a second data read request and data of the second data read request is the first target data.
In an optional implementation, the first reading unit is configured to:
if the log corresponding to the first read index exists in the first target storage node, performing persistent storage on the log corresponding to the first read index, and reading the first target data from a state machine corresponding to the first target data by taking the first read index as a starting point;
if the log corresponding to the first read index does not exist in the first target storage node, the log corresponding to the first read index is acquired from the main storage node, and the first target data is read from the state machine corresponding to the first target data by taking the first read index as a starting point.
In an alternative implementation, the second reading unit is configured to:
if the data submission time of the first target data is before the transaction completion time, based on the first data reading request and the transaction completion time, scanning logs stored in a log list according to a second sequence, determining a first reading index, and reading the first target data from a state machine corresponding to the first target data by taking the first reading index as a starting point;
if the data commit time of the first target data is after the transaction completion time, based on the first data read request and the transaction completion time, scanning the logs stored in the log list according to a third sequence, determining the first read index, and reading the first target data from a state machine corresponding to the first target data by taking the first read index as a starting point;
the second order refers to an execution index indexed from the commit of the log list to the log list, the third order refers to a commit index indexed from the execution of the log list to the log list, the commit index is used for indicating a maximum index of committed logs in the log list, the execution index is used for indicating a maximum index of executed logs in the log list, and the first read index is used for indicating a minimum read index for reading the first target data based on the first data read request.
In an alternative implementation, the second reading unit is configured to:
if a second target log exists, and the transaction completion time of the first target data is the same as the transaction completion time of the second target log, or the transaction completion time of the first target data is after the transaction completion time of the second target log, determining the first read index based on the log index of the second target log, wherein the data operated by the second target log is the first target data;
if the second target log does not exist, the first read index is determined based on the execution index of the log list.
In an alternative implementation, the data access cost is used to indicate the execution time, the waiting time and the transmission time of the storage node;
the execution time comprises the time for the storage node to inquire the first target data, the time for processing the data volume and the tuple construction time;
the waiting time comprises the request queue time, the equipment load delay time and the data synchronization time of the storage node;
the transmission time comprises a network transmission time.
In an alternative implementation, the data access cost of the first target storage node meets a first target condition, which includes any one of:
the storage mode of the first target data in the first target storage node is a column storage mode, and the ratio of the number of columns to be accessed by the data reading request to the total number of columns is smaller than a first threshold value, wherein the storage mode is used for indicating the storage format of data in the storage node;
the node load of the first target storage node is less than the node load of storage nodes other than the first target storage node in the plurality of storage nodes;
the physical distance between the first target storage node and the computing node is smaller than the physical distance between the storage nodes except the first target storage node in the plurality of storage nodes and the computing node;
the data synchronization state of the first target storage node is subsequent to the data synchronization state of storage nodes other than the first target storage node in the plurality of storage nodes.
In an optional implementation, the apparatus further comprises:
and the adjusting module is used for dynamically adjusting the storage mode of the multiple copies of the first target data, and the storage mode is used for indicating the storage format of the data in the storage node.
In an alternative implementation, the adjustment module is configured to any one of:
switching the storage modes of the plurality of copies based on the load conditions of the plurality of first storage nodes;
if at least one copy exists in the plurality of copies, establishing at least one new copy based on the at least one copy;
if the first data fragment is subjected to data splitting, generating at least one second data fragment, and establishing a plurality of copies corresponding to the at least one second data fragment based on the at least one second data fragment;
the storage mode of the plurality of copies is adjusted based on the node type of the plurality of first storage nodes.
In an optional implementation manner, the switching the storage modes of the multiple copies based on the load conditions of the multiple first storage nodes includes any one of:
switching the storage mode of the plurality of copies based on the node load size and the available space of the plurality of first storage nodes;
and switching the storage modes of the plurality of copies based on the node load sizes of the plurality of first storage nodes and the number of copies in each storage mode.
In an optional implementation, the apparatus further comprises:
a third determining module, configured to determine, in response to a data write request, a plurality of second storage nodes from the plurality of storage nodes based on a third data fragment to which second target data of the data write request belongs if the third data fragment belongs, where the plurality of second storage nodes are used to store a plurality of copies of the second target data;
a sending module, configured to send the data write request to a main storage node in the plurality of second storage nodes;
and the first writing module is used for writing the second target data based on the data writing request and sending a first data writing result to the computing node.
In an optional implementation, the first writing module is configured to:
writing the second target data based on the data writing request, generating a data operation log, and sending a log synchronization request to a slave storage node in the plurality of storage nodes, wherein the log synchronization request is used for instructing the slave storage node to send a data synchronization message to the master storage node after the slave storage node synchronizes the data operation log;
and if the number of the data synchronization messages received by the main storage node is greater than or equal to half of the number of the auxiliary storage nodes, confirming that the data write request is operated successfully.
In an optional implementation, the apparatus further comprises:
the first persistent storage module is used for performing persistent storage on the second target data;
and the second persistent storage module is used for performing format conversion on the second target data based on the data operation log and a storage mode of the second target data in the slave storage node, and performing persistent storage on the converted second target data, wherein the storage mode is used for indicating the storage format of data in the storage node.
In an optional implementation, the apparatus further comprises:
a fourth determining module, configured to determine a second read index of the second target data based on the log index of the data operation log, where the second read index is used to indicate a minimum read index for reading the second target data based on a third data read request;
a second storage module, configured to store the second read index into a second list, where the second list includes the second target data, the second read index, and a second check index, and the second check index is a log index of the data operation log;
and the second query module is used for querying the second list to read the second target data if the data of the third data read request is the second target data when the distributed database system processes the third data read request.
In an optional implementation, the apparatus further comprises:
a first establishing module, configured to establish a third data segment of the second target data and send a copy creation request to the plurality of storage nodes if the second target data of the data write request does not have the third data segment to which the second target data belongs;
and the second establishing module is used for establishing a plurality of copies corresponding to the third data fragment based on the copy establishing request.
In an optional implementation manner, the second establishing module is configured to:
and establishing a plurality of copies corresponding to the third data fragment based on the copy creation request and the storage mode of the second target data in the plurality of storage nodes, wherein the storage mode is used for indicating the storage format of the data in the storage nodes.
In an optional implementation, the apparatus further comprises:
a fifth determining module, configured to, in response to a data read-write request, determine, based on a fourth data fragment to which third target data of the data read-write request belongs, a plurality of third storage nodes from the plurality of storage nodes, where the plurality of third storage nodes are configured to store a plurality of copies of the third target data;
a second reading module, configured to determine, for a read operation in the data read-write request, a second target storage node from the multiple third storage nodes based on the data read-write request, send the data read-write request to the second target storage node, where the second target storage node reads the third target data based on the data read-write request, and sends a second data reading result to the computing node, where a data access cost of the second target storage node meets a second target condition;
and the second writing module is used for sending the data reading and writing request to a main storage node in the plurality of third storage nodes for the writing operation in the data reading and writing request, and the main storage node writes the third target data based on the data reading and writing request and sends a second data writing result to the computing node.
In an optional implementation manner, a slave storage node in the plurality of third storage nodes is configured with a memory lock, and the memory lock is used for locking the third target data when the write operation is not completed yet.
In an optional implementation, the apparatus further comprises:
a sixth determining module, configured to, when a fourth storage node exists in the plurality of storage nodes and becomes a master storage node through election, determine a timeout time by a slave storage node in the plurality of storage nodes based on a current storage mode and a write performance parameter of the slave storage node, where the storage mode is used to indicate a storage format of data in the storage nodes;
and the state switching module is used for switching the first slave storage node to a candidate state to participate in next election if the first slave storage node does not receive the message of the master storage node within the corresponding timeout time.
In the embodiment of the present application, when a data read request is received, a plurality of storage nodes storing multiple copies of target data of the data read request are determined according to a data fragment to which the target data belong, then a target storage node meeting a target condition is selected from the storage nodes according to a data access cost of the storage nodes for accessing the target data, and the target storage node reads the target data. In the process, the target storage node is determined according to the data access cost, so that the main storage node and the auxiliary storage node can be the target storage node, and the main storage node is prevented from processing all data reading requests, so that high availability caused by multiple copies is ensured, the data reading speed is increased, and the data access performance of the distributed database system is effectively improved.
It should be noted that: in the data access device provided in the above embodiment, only the division of the functional modules is illustrated when data access is performed, and in practical applications, the functions may be distributed by different functional modules as needed, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the functions described above. In addition, the data access device and the data access method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.
The embodiment of the present application further provides a computer device, where the computer device includes a processor and a memory, where the memory is used to store at least one computer program, and the at least one computer program is loaded and executed by the processor to implement the operations performed by the compute node or the storage node in the data access method in the embodiment of the present application.
In some embodiments, the computer program according to the embodiments of the present application may be deployed to be executed on one computer device or on multiple computer devices located at one site, or may be executed on multiple computer devices distributed at multiple sites and interconnected by a wired network or a wireless network, and the multiple computer devices distributed at the multiple sites and interconnected by the wired network or the wireless network may constitute a block chain system.
Taking a computer device as an example, fig. 17 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1700 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1701 and one or more memories 1702, where the memory 1702 stores at least one computer program, and the at least one computer program is loaded and executed by the processors 1701 to implement the data access method provided by each of the above method embodiments. Certainly, the server can also have components such as a wired or wireless network interface, a keyboard, an input/output interface, and the like so as to perform input and output, and the server can also include other components for realizing the functions of the device, which is not described herein again.
The embodiment of the present application further provides a computer-readable storage medium, which is applied to a computer device, and the computer-readable storage medium stores at least one computer program, which is loaded and executed by a processor to implement the operations performed by the computer device in the data access method of the foregoing embodiment.
Embodiments of the present application also provide a computer program product or a computer program comprising computer program code stored in a computer readable storage medium. The processor of the computer device reads the computer program code from the computer-readable storage medium, and the processor executes the computer program code, causing the computer device to perform the data access method provided in the various alternative implementations described above.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (25)

1. A data access method for use in a distributed database system, the distributed database system including a compute node and a plurality of storage nodes, the method comprising:
the computing node responds to a first data reading request, determines a first data fragment to which first target data of the first data reading request belongs, and determines a plurality of first storage nodes from the plurality of storage nodes based on the first data fragment, wherein the plurality of first storage nodes are used for storing a plurality of copies of the first target data;
the computing node determines a first target storage node from the plurality of first storage nodes based on the first data reading request, and sends the first data reading request to the first target storage node, wherein the data access cost of the first target storage node meets a first target condition;
and the first target storage node reads the first target data based on the first data reading request and sends a first data reading result to the computing node.
2. The method of claim 1, wherein the first target storage node reads the first target data based on the first data read request, comprising any one of:
if the first target data is current-state data, the first target storage node reads the first target data based on the first data reading request and the node type of the first target storage node;
and if the first target data is historical data, the first target storage node reads the first target data based on the first data reading request and the transaction completion time of the first target data.
3. The method of claim 2, wherein if the first target data is current state data, the first target storage node reads the first target data based on the first data read request and a node type of the target storage node, and comprises any one of:
if the first target storage node is a main storage node, the first target storage node determines a first read index of the first target data based on the first data read request, and reads the first target data from a state machine corresponding to the first target data with the first read index as a starting point, wherein the first read index is used for indicating a minimum read index for reading the first target data based on the first data read request;
if the first target storage node is a slave storage node, the first target storage node acquires the first read index from the master storage node based on the first data read request, and reads the first target data from a state machine corresponding to the first target data by taking the first read index as a starting point.
4. The method of claim 3, wherein if the first target storage node is a master storage node, the first target storage node determining a first read index of the first target data based on the first data read request, comprising:
the first target storage node updates a submission index at the current moment, wherein the submission index is used for indicating the maximum index of submitted logs in a log list;
the first target storage node scans the logs stored in the log list according to a first sequence, wherein the first sequence refers to an execution index from the submission index to the log list, and the execution index is used for indicating the maximum index of the executed logs in the log list;
if a first target log exists, the first target storage node determines the first read index based on a log index of the first target log, and data operated by the first target log is the first target data; and if the first target log does not exist, the first target storage node determines the first read index based on the execution index.
5. The method of claim 3, wherein after the first target storage node determines the first read index of the first target data based on the first data read request, the method further comprises:
the first target storage node stores the first read index to a first list, the first list comprises the first target data, the first read index and a first check index, the first check index is used for indicating a commit index corresponding to the first target storage node when determining the first read index, and the commit index is used for indicating a maximum index of committed logs in a log list;
when the distributed database system processes a second data reading request, if the data of the second data reading request is the first target data, the first list is inquired to read the first target data.
6. The method according to claim 3, wherein if the first target storage node is a slave storage node, the first target storage node obtains the first read index from the master storage node based on the first data read request, and reads the first target data from a state machine corresponding to the first target data with the first read index as a starting point, including any one of:
if the log corresponding to the first read index exists in the first target storage node, the first target storage node performs persistent storage on the log corresponding to the first read index, and reads the first target data from a state machine corresponding to the first target data by taking the first read index as a starting point;
if the log corresponding to the first read index does not exist in the first target storage node, the first target storage node acquires the log corresponding to the first read index from the main storage node, and reads the first target data from the state machine corresponding to the first target data by taking the first read index as a starting point.
7. The method of claim 2, wherein if the first target data is historical data, the first target storage node reads the first target data based on the first data read request and a transaction completion time of the first target data, comprising any one of:
if the data submission time of the first target data is before the transaction completion time, the first target storage node scans logs stored in a log list according to a second sequence based on the first data reading request and the transaction completion time, determines a first reading index, and reads the first target data from a state machine corresponding to the first target data by taking the first reading index as a starting point;
if the data submission time of the first target data is after the transaction completion time, the first target storage node scans the logs stored in the log list according to a third sequence based on the first data reading request and the transaction completion time, determines the first reading index, and reads the first target data from the state machine corresponding to the first target data by taking the first reading index as a starting point;
wherein the second order refers to an execution index indexed from the commit of the log list to the log list, the third order refers to a commit index indexed from the execution index of the log list to the log list, the commit index is used to indicate a maximum index of committed logs in the log list, the execution index is used to indicate a maximum index of executed logs in the log list, and the first read index is used to indicate a minimum read index for reading the first target data based on the first data read request.
8. The method of claim 7, wherein scanning the logs stored in the log list to determine the first read index comprises any one of:
if a second target log exists, and the transaction completion time of the first target data is the same as the transaction completion time of the second target log, or the transaction completion time of the first target data is after the transaction completion time of the second target log, determining the first read index based on a log index of the second target log, wherein the data operated by the second target log is the first target data;
and if the second target log does not exist, determining the first read index based on the execution index of the log list.
9. The method of claim 1, wherein the data access cost is indicative of an execution time, a latency time, and a transmission time of the storage node;
the execution time comprises the time for the storage node to inquire the first target data, the time for processing the data volume and the tuple construction time;
the waiting time comprises the request queue time, the equipment load delay time and the data synchronization time of the storage node;
the transmission time comprises a network transmission time.
10. The method of claim 1, wherein the data access cost of the first target storage node meets a first target condition, comprising any one of:
the storage mode of the first target data in the first target storage node is a column storage mode, the ratio of the number of columns to be accessed by the data reading request to the total number of columns is smaller than a first threshold value, and the storage mode is used for indicating the storage format of the data in the storage node;
the node load of the first target storage node is less than the node load of storage nodes other than the first target storage node in the plurality of storage nodes;
a physical distance between the first target storage node and the compute node is less than a physical distance between storage nodes of the plurality of storage nodes other than the first target storage node and the compute node;
the data synchronization state of the first target storage node is subsequent to the data synchronization state of storage nodes other than the first target storage node in the plurality of storage nodes.
11. The method of claim 1, further comprising:
the first storage nodes dynamically adjust storage modes of the copies of the first target data, wherein the storage modes are used for indicating storage formats of the data in the storage nodes.
12. The method of claim 11, wherein the plurality of first storage nodes dynamically adjust the storage pattern of the plurality of copies of the first target data, comprising any of:
the plurality of first storage nodes switch the storage modes of the plurality of copies based on the load conditions of the plurality of first storage nodes;
if at least one copy abnormality exists in the plurality of copies, the plurality of first storage nodes establish at least one new copy based on the at least one copy;
if the first data fragment is subjected to data splitting, generating at least one second data fragment, and establishing a plurality of copies corresponding to the at least one second data fragment by the plurality of first storage nodes based on the at least one second data fragment;
the plurality of first storage nodes adjust the storage mode of the plurality of copies based on the node type of the plurality of first storage nodes.
13. The method of claim 12, wherein the plurality of first storage nodes switch the storage mode of the plurality of copies based on a load condition of the plurality of first storage nodes, comprising any one of:
the plurality of first storage nodes switch the storage modes of the plurality of copies based on the node load size and the available space of the plurality of first storage nodes;
the plurality of first storage nodes switch the storage modes of the plurality of copies based on the node load size of the plurality of first storage nodes and the number of copies in each storage mode.
14. The method of claim 1, further comprising:
the computing node responds to a data writing request, if second target data of the data writing request has a third data fragment, a plurality of second storage nodes are determined from the plurality of storage nodes based on the third data fragment, and the plurality of second storage nodes are used for storing a plurality of copies of the second target data;
the computing node sending the data write request to a primary storage node of the plurality of second storage nodes;
and the main storage node writes the second target data based on the data writing request and sends a first data writing result to the computing node.
15. The method of claim 14, wherein the primary storage node writes the second target data based on the data write request, comprising:
the main storage node writes the second target data based on the data writing request, generates a data operation log, and sends a log synchronization request to a slave storage node in the plurality of storage nodes, wherein the log synchronization request is used for indicating the slave storage node to send a data synchronization message to the main storage node after synchronizing the data operation log;
if the number of the data synchronization messages received by the main storage node is larger than or equal to half of the number of the secondary storage nodes, the main storage node confirms that the data writing request is operated successfully.
16. The method of claim 15, wherein after the primary storage node acknowledges that the data write request has been successfully operated, the method further comprises:
the primary storage node persistently stores the second target data;
and the slave storage node performs format conversion on the second target data based on the data operation log and a storage mode of the second target data in the slave storage node, and performs persistent storage on the converted second target data, wherein the storage mode is used for indicating a storage format of data in the storage node.
17. The method of claim 15, wherein after the primary storage node writes the second target data based on the data write request and generates a data operation log, the method further comprises:
the main storage node determines a second read index of the second target data based on the log index of the data operation log, wherein the second read index is used for indicating a minimum read index for reading the second target data based on a third data reading request;
storing the second read index to a second list, the second list including the second target data, the second read index, and a second check index, the second check index being a log index of the data oplog;
when the distributed database system processes the third data reading request, if the data of the third data reading request is the second target data, querying the second list to read the second target data.
18. The method of claim 14, further comprising:
if the second target data of the data writing request does not have the third data fragment, the computing node establishes the third data fragment of the second target data and sends a copy creating request to the plurality of storage nodes;
and the plurality of storage nodes establish a plurality of copies corresponding to the third data fragment based on the copy creation request.
19. The method of claim 18, wherein the plurality of storage nodes establish a plurality of copies corresponding to the third data slice based on the copy creation request, comprising:
the plurality of storage nodes establish a plurality of copies corresponding to the third data fragment based on the copy creation request and a storage mode of the second target data in the plurality of storage nodes, wherein the storage mode is used for indicating a storage format of data in the storage nodes.
20. The method of claim 1, further comprising:
the computing node responds to a data reading and writing request, if a fourth data fragment to which third target data of the data reading and writing request belongs exists, a plurality of third storage nodes are determined from the plurality of storage nodes based on the fourth data fragment, and the plurality of third storage nodes are used for storing a plurality of copies of the third target data;
for a read operation in the data read-write request, the computing node determines a second target storage node from the plurality of third storage nodes based on the data read-write request, sends the data read-write request to the second target storage node, the second target storage node reads the third target data based on the data read-write request, and sends a second data read result to the computing node, wherein the data access cost of the second target storage node meets a second target condition;
for the write operation in the data read-write request, the computing node sends the data read-write request to a main storage node in the plurality of third storage nodes, and the main storage node writes the third target data based on the data read-write request and sends a second data write result to the computing node.
21. The method of claim 20,
a slave storage node of the plurality of third storage nodes is configured with a memory lock, and the memory lock is used for locking the third target data when the write operation is not completed yet.
22. The method of claim 1, further comprising:
when a fourth storage node exists in the plurality of storage nodes and becomes a main storage node through election, a slave storage node in the plurality of storage nodes determines a timeout time based on a current storage mode and a write performance parameter of the slave storage node, wherein the storage mode is used for indicating a storage format of data in the storage nodes;
and if the first slave storage node does not receive the message of the master storage node within the corresponding timeout time, the first slave storage node is switched to a candidate state to participate in next election.
23. A data access apparatus, applied to a distributed database system, the apparatus comprising:
a first determining module, configured to determine, in response to a first data read request, a first data fragment to which first target data of the first data read request belongs, and determine, based on the first data fragment, a plurality of first storage nodes from the plurality of storage nodes, where the plurality of first storage nodes are configured to store multiple copies of the first target data;
a second determining module, configured to determine a first target storage node from the plurality of first storage nodes based on the first data read request, and send the first data read request to the first target storage node, where a data access cost of the first target storage node meets a first target condition;
and the first reading module is used for reading the first target data based on the first data reading request and sending a first data reading result to the computing module.
24. A computer device, characterized in that the computer device comprises a processor and a memory for storing at least one computer program, which is loaded by the processor and which performs the data access method according to any one of claims 1 to 22.
25. A computer-readable storage medium, having stored therein at least one computer program, which is loaded and executed by a processor, to implement the data access method of any one of claims 1 to 22.
CN202110709977.0A 2021-06-25 2021-06-25 Data access method, device, equipment and storage medium Active CN113535656B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110709977.0A CN113535656B (en) 2021-06-25 2021-06-25 Data access method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110709977.0A CN113535656B (en) 2021-06-25 2021-06-25 Data access method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113535656A true CN113535656A (en) 2021-10-22
CN113535656B CN113535656B (en) 2022-08-09

Family

ID=78125940

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110709977.0A Active CN113535656B (en) 2021-06-25 2021-06-25 Data access method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113535656B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114244859A (en) * 2022-02-23 2022-03-25 阿里云计算有限公司 Data processing method and device and electronic equipment
CN114969072A (en) * 2022-06-06 2022-08-30 北京友友天宇系统技术有限公司 Data transmission method, device and equipment based on state machine and data persistence
CN115103011A (en) * 2022-06-24 2022-09-23 北京奥星贝斯科技有限公司 Cross-data-center service processing method, device and equipment
CN115114374A (en) * 2022-06-27 2022-09-27 腾讯科技(深圳)有限公司 Transaction execution method and device, computing equipment and storage medium
CN115114344A (en) * 2021-11-05 2022-09-27 腾讯科技(深圳)有限公司 Transaction processing method and device, computing equipment and storage medium
WO2023193495A1 (en) * 2022-04-07 2023-10-12 华为技术有限公司 Method for processing read request, distributed database and server
WO2023236629A1 (en) * 2022-06-07 2023-12-14 华为技术有限公司 Data access method and apparatus, and storage system and storage medium
WO2024040902A1 (en) * 2022-08-22 2024-02-29 华为云计算技术有限公司 Data access method, distributed database system and computing device cluster

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140324785A1 (en) * 2013-04-30 2014-10-30 Amazon Technologies, Inc. Efficient read replicas
CN105516263A (en) * 2015-11-28 2016-04-20 华为技术有限公司 Data distribution method, device in storage system, calculation nodes and storage system
CN106406758A (en) * 2016-09-05 2017-02-15 华为技术有限公司 Data processing method based on distributed storage system, and storage equipment
CN106844399A (en) * 2015-12-07 2017-06-13 中兴通讯股份有限公司 Distributed data base system and its adaptive approach
CN112148798A (en) * 2020-10-10 2020-12-29 腾讯科技(深圳)有限公司 Data processing method and device applied to distributed system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140324785A1 (en) * 2013-04-30 2014-10-30 Amazon Technologies, Inc. Efficient read replicas
CN105516263A (en) * 2015-11-28 2016-04-20 华为技术有限公司 Data distribution method, device in storage system, calculation nodes and storage system
CN106844399A (en) * 2015-12-07 2017-06-13 中兴通讯股份有限公司 Distributed data base system and its adaptive approach
CN106406758A (en) * 2016-09-05 2017-02-15 华为技术有限公司 Data processing method based on distributed storage system, and storage equipment
CN112148798A (en) * 2020-10-10 2020-12-29 腾讯科技(深圳)有限公司 Data processing method and device applied to distributed system

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115114344A (en) * 2021-11-05 2022-09-27 腾讯科技(深圳)有限公司 Transaction processing method and device, computing equipment and storage medium
CN115114344B (en) * 2021-11-05 2023-06-23 腾讯科技(深圳)有限公司 Transaction processing method, device, computing equipment and storage medium
CN114244859A (en) * 2022-02-23 2022-03-25 阿里云计算有限公司 Data processing method and device and electronic equipment
WO2023193495A1 (en) * 2022-04-07 2023-10-12 华为技术有限公司 Method for processing read request, distributed database and server
CN114969072A (en) * 2022-06-06 2022-08-30 北京友友天宇系统技术有限公司 Data transmission method, device and equipment based on state machine and data persistence
WO2023236629A1 (en) * 2022-06-07 2023-12-14 华为技术有限公司 Data access method and apparatus, and storage system and storage medium
CN115103011A (en) * 2022-06-24 2022-09-23 北京奥星贝斯科技有限公司 Cross-data-center service processing method, device and equipment
CN115103011B (en) * 2022-06-24 2024-02-09 北京奥星贝斯科技有限公司 Cross-data center service processing method, device and equipment
CN115114374A (en) * 2022-06-27 2022-09-27 腾讯科技(深圳)有限公司 Transaction execution method and device, computing equipment and storage medium
WO2024040902A1 (en) * 2022-08-22 2024-02-29 华为云计算技术有限公司 Data access method, distributed database system and computing device cluster

Also Published As

Publication number Publication date
CN113535656B (en) 2022-08-09

Similar Documents

Publication Publication Date Title
CN113535656B (en) Data access method, device, equipment and storage medium
US11388043B2 (en) System and method for data replication using a single master failover protocol
US11120044B2 (en) System and method for maintaining a master replica for reads and writes in a data store
US11243945B2 (en) Distributed database having blockchain attributes
US10929240B2 (en) System and method for adjusting membership of a data replication group
CN111338766B (en) Transaction processing method and device, computer equipment and storage medium
Akkoorath et al. Cure: Strong semantics meets high availability and low latency
US9411873B2 (en) System and method for splitting a replicated data partition
JP2023546249A (en) Transaction processing methods, devices, computer equipment and computer programs
US20140244581A1 (en) System and method for log conflict detection and resolution in a data store
US20130110873A1 (en) Method and system for data storage and management
CN111597015A (en) Transaction processing method and device, computer equipment and storage medium
US10712964B2 (en) Pre-forking replicas for efficient scaling of a distributed data storage system
WO2022170979A1 (en) Log execution method and apparatus, and computer device and storage medium
CN112199427A (en) Data processing method and system
CN115114374B (en) Transaction execution method and device, computing equipment and storage medium
JP2023541298A (en) Transaction processing methods, systems, devices, equipment, and programs
Waqas et al. Transaction management techniques and practices in current cloud computing environments: A survey
US11461201B2 (en) Cloud architecture for replicated data services
Sarr et al. Transpeer: Adaptive distributed transaction monitoring for web2. 0 applications
US11789971B1 (en) Adding replicas to a multi-leader replica group for a data set
US11360866B2 (en) Updating stateful system in server cluster
Sapate et al. Survey on comparative analysis of database replication techniques
MADHAVI et al. DESIGN AND PERFORMANCE EVALUATION OF HYBRID BLOCKCHAIN DATABASE SYSTEMS
CN116975147A (en) Data storage method, system, node, calculation engine and coordinator

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant