WO2022170979A1 - 日志执行方法、装置、计算机设备及存储介质 - Google Patents

日志执行方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2022170979A1
WO2022170979A1 PCT/CN2022/074080 CN2022074080W WO2022170979A1 WO 2022170979 A1 WO2022170979 A1 WO 2022170979A1 CN 2022074080 W CN2022074080 W CN 2022074080W WO 2022170979 A1 WO2022170979 A1 WO 2022170979A1
Authority
WO
WIPO (PCT)
Prior art keywords
log
logs
node device
executed
conflict
Prior art date
Application number
PCT/CN2022/074080
Other languages
English (en)
French (fr)
Inventor
李海翔
李昊华
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP22752140.8A priority Critical patent/EP4276651A4/en
Priority to JP2023537944A priority patent/JP2024501245A/ja
Publication of WO2022170979A1 publication Critical patent/WO2022170979A1/zh
Priority to US18/079,238 priority patent/US20230110826A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1471Saving, restoring, recovering or retrying involving logging of persistent data for recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1469Backup restoration techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2219Large Object storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2379Updates performed during online database operations; commit processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/80Database-specific techniques

Definitions

  • the present application relates to the technical field of databases, and in particular, to a log execution method, apparatus, computer equipment and storage medium.
  • Raft algorithm is a consensus algorithm widely used in engineering, which has the characteristics of strong consistency, decentralization, easy understanding and development and implementation.
  • Embodiments of the present application provide a log execution method, apparatus, computer device, and storage medium.
  • a log execution method which is applied to a first node device in a distributed storage system, and the method includes:
  • the log execution active window includes multiple logs that have not been executed, and the logs before the log execution active window have been executed;
  • a log execution device applied to a distributed storage system, and the device includes:
  • a scanning module configured to cyclically scan the log execution active window, the log execution active window includes multiple logs that have not been executed, and the logs before the log execution active window have been executed;
  • the first acquisition module is configured to execute any log in the active window for the log, and based on the storage range information of the any log, obtain the conflict verification result of the any log, and the storage range information is used to indicate The storage range of the any log and the storage range of the target number of logs before the any log, the target number is equal to the window size of the log execution active window;
  • An execution module configured to execute any one of the logs in response to the conflict verification result being that there is no conflict.
  • the first obtaining module is used for:
  • the execution module includes:
  • An adding unit is used to add any of the logs to the list of logs to be executed
  • the processing unit is configured to call the log execution thread to process the logs stored in the log list to be executed.
  • the processing unit is used for:
  • the device further includes:
  • a second obtaining module configured to obtain a plurality of logs to be recovered from the executed log list based on the state parameter when the distributed storage system restarts in response to a crash event of the distributed storage system, and Multiple business data corresponding to the multiple logs to be restored have been written to the volatile storage medium but not written to the non-volatile storage medium;
  • a recovery module configured to sequentially recover a plurality of business data corresponding to the plurality of to-be-restored logs to the volatile storage medium based on the storage order of the plurality of to-be-restored logs in the executed log list middle.
  • the device further includes:
  • a third acquisition module configured to acquire the number of logs that conflict with any of the logs in response to the conflict verification result being a conflict
  • a determining module configured to determine the scanning frequency of any of the logs based on the number of logs
  • the execution module is further configured to scan the any log according to the scanning frequency and refresh the conflict verification result, and execute the any log until the conflict verification result is no conflict.
  • the determining module is used to:
  • determining the scan frequency to be a first frequency In response to the number of logs being greater than a conflict threshold, determining the scan frequency to be a first frequency; or,
  • the scan frequency is determined to be a second frequency, the second frequency being greater than the first frequency.
  • the execution module is used to:
  • the any log is added to a log list corresponding to the scanning frequency, and the logs stored in the log list are scanned according to the scanning frequency.
  • the scanning module is further configured to: cyclically scan a log matching index table, where the log matching index table is used to record the number of copies of multiple to-be-committed logs stored in the distributed storage system;
  • the apparatus further includes a submission module, configured to submit any log to be submitted in response to the number of copies of any log to be submitted in the log matching index table meeting the target condition.
  • the apparatus is elected by a plurality of second node devices in the distributed storage system after the end of the previous term, and the maximum index in the consecutive submitted logs in the apparatus is Greater than or equal to the largest index in the consecutive submitted logs in the plurality of second node devices.
  • the device further includes:
  • the fourth acquisition module is configured to acquire at least one index of the missing at least one log in response to the absence of at least one log in the log submission active window, the log submission active window includes a plurality of logs that have not been submitted, and the log The logs before the submission of the active window have been submitted;
  • a requesting module configured to request the at least one log from the plurality of second node devices based on the at least one index.
  • the device further includes:
  • a receiving module configured to receive at least one target log returned by the plurality of second node devices and the submission status information of the at least one target log
  • a completion module used to complete the submission status information as the submitted target log in the log submission active window
  • the request module is further configured to request the terminal for the target log whose submission status information is unsubmitted.
  • a computer device comprising one or more processors and one or more memories, the one or more memories storing at least one computer program, the at least one computer program consisting of the one or more Multiple processors are loaded and executed to implement the log execution method as described above.
  • a storage medium is provided, and at least one computer program is stored in the storage medium, and the at least one computer program is loaded and executed by a processor to implement the above log execution method.
  • a computer program product or computer program comprising one or more pieces of program codes stored in a computer-readable storage medium.
  • One or more processors of the computer device can read the one or more program codes from a computer readable storage medium, the one or more processors execute the one or more program codes to enable the computer device to Execute the above log execution method.
  • FIG. 1 is a schematic diagram of the architecture of a distributed storage system provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a principle of a log out-of-order replication mechanism provided by an embodiment of the present application
  • FIG. 3 is a flowchart of a log execution method provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a LHA data structure provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of the principle of a data persistence mechanism provided by an embodiment of the present application.
  • FIG. 6 is a schematic schematic diagram of a log out-of-order execution scheduling provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a data structure of a log submission active window provided by an embodiment of the present application.
  • FIG. 8 is a flowchart of a log out-of-order submission mechanism provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a log inconsistency situation provided by an embodiment of the present application.
  • FIG. 10 is a schematic schematic diagram of a leader node election mechanism provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of a term of office provided by an embodiment of the present application.
  • FIG. 12 is a schematic flowchart of interaction between a terminal and a cluster provided by an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a log execution device provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the term "at least one” refers to one or more, and the meaning of "plurality” refers to two or more, for example, a plurality of first positions refers to two or more first positions.
  • Cloud Technology It refers to a hosting technology that unifies a series of resources such as hardware, software, and network in a wide area network or a local area network to realize the computing, storage, processing and sharing of data, that is, a business model based on cloud computing.
  • Cloud computing technology will become an important support in the field of cloud technology. Background services of technical network systems require a lot of computing and storage resources, such as video websites, picture websites or more portal websites. With the high development and application of the Internet industry, in the future, each item may have its own identification mark, which needs to be transmitted to the back-end system for logical processing. Data of different levels will be processed separately, and all kinds of industry data need to be strong.
  • the system backing support can be realized through cloud computing.
  • Cloud Storage It is a new concept extended and developed from the concept of cloud computing.
  • Distributed cloud storage system (hereinafter referred to as storage system) refers to functions such as cluster application, grid technology and distributed storage file system. , a storage system that integrates a large number of different types of storage devices (also called storage nodes) in the network through application software or application interfaces to work together to provide external data storage and business access functions.
  • Database In short, it can be regarded as an electronic filing cabinet—a place to store electronic files, which supports users to add, query, update, and delete data in files.
  • the so-called “database” is a collection of data that is stored together in a certain way, can be shared with multiple users, has as little redundancy as possible, and is independent of applications.
  • Full state of data For data items in the database system, based on different state attributes, it is divided into three states: current state, transition state and historical state. These three states are collectively referred to as "full state of data”. ”, referred to as global data, the different state attributes in the global data are used to identify the state of the data in its life cycle trajectory.
  • the historical state (Historical State): a state of the data item in history, its value is the old value, not the current value.
  • multiple historical data items correspond to the same primary key identifier, reflecting the state transition process of each data item with the primary key identifier. Data items in the historical state can only be read and cannot be modified or deleted.
  • Transitional State It is neither the current state data item nor the historical state data item, which is in the process of transitioning from the current state to the historical state. This kind of data in the transitional state is also called half-dead data.
  • each data item with the same primary key identifier constitutes a global data set, and each data item in the global data set is in essence. It is used to represent full-state data, that is, in the process of multiple modification (or deletion) of the initial data item with the primary key identification, multiple different version, which constitutes a global dataset.
  • full-state data set some data items are in the current state, some data items are in the transition state, and some data items are in the historical state.
  • the global data set here refers to an abstract and virtual collection concept, and each data item in the same global data set can be distributed and stored on different physical machines.
  • the database system uses a pointer to link each data item corresponding to the same primary key identifier according to the time sequence, so as to facilitate the query of the life cycle trajectory of the global data.
  • a transaction is a logical unit of the database management system in the process of performing operations. It consists of a limited sequence of database operations and is the smallest execution unit of database system operations. Inside a system, the unit of each operation series is called a transaction, and a single operation is also called a transaction. Transactions must obey the ACID principle, where ACID is an abbreviation for Atomicity, Consistency, Isolation, and Durability.
  • Logs The logs involved in the embodiments of the present application, also referred to as log items and log records, all refer to transaction logs in a distributed storage system.
  • data changes are recorded in a series of consecutive log records, each log record is stored in a virtual log file.
  • the transaction log has any number of virtual log files, the number of which depends on the database engine, and the size of each virtual log file is not fixed.
  • the database in SQL Server is composed of one or more data files and one or more transaction log files.
  • data files mainly store database data (also known as data items, data records), including database content structure, data pages, index pages, etc.
  • transaction logs are mainly used to save database modification records.
  • transaction logs are used for Recording all transactions and the modifications made to the database by each transaction is an important component of backup and recovery of database systems.
  • the transaction in order to ensure the integrity of ACID, the transaction must rely on the transaction log for traceability, that is, each operation of the transaction must be written to the log before it is placed on the disk.
  • the log includes a redo log (Redo Log), a rollback log (Undo Log), and a binary log (Bin Log, also known as an archive log), wherein the redo log is used to record changes in data caused by transaction operations, The physical modification of the data page is recorded, so the redo log belongs to the physical log, which records how the transaction has modified a certain data; the binary log is mainly used to record the changes of the database, including all the updates of the database. Operations, all operations involving data changes must be recorded in the binary log.
  • Redo log redo log
  • Undo Log rollback log
  • Bin Log also known as an archive log
  • the binary log belongs to the logical log and records the original logic of the SQL statement; the function of the rollback log is to roll back the data, and when the transaction modifies the database , the database engine will not only record the redo log, but also generate the corresponding rollback log. If the transaction execution fails or the Rollback interface is called, causing the transaction to be rolled back, the information in the rollback log can be used to roll back the data to the modification As before, the rollback log belongs to the logical log, which records the information related to the execution of the SQL statement.
  • the distributed storage system involved in the embodiments of the present application includes a distributed database system and a distributed database management system using a similar architecture.
  • the distributed storage system is a distributed transactional database system, which requires distributed transaction processing capability, Consistency model on shared data.
  • the distributed storage system includes at least one node device, and the database of each node device stores one or more data tables, and each data table is used to store one or more data items (also called variable versions).
  • the database of the node device is any type of distributed database, including at least one of a relational database or a non-relational database, such as SQL (Structured Query Language) database, NoSQL, NewSQL (generally referring to various A new type of scalable/high-performance database), etc., the type of the database is not specifically limited in this embodiment of the present application.
  • the embodiments of the present application are applied to a database system based on blockchain technology (hereinafter referred to as "blockchain system"), and the above-mentioned blockchain system is essentially a decentralized
  • blockchain system uses a consensus algorithm to keep the ledger data recorded by different node devices on the blockchain consistent, and uses a cryptographic algorithm to ensure the encrypted transmission of ledger data between different node devices and cannot be tampered with.
  • the script system is used to expand the ledger function. Network routing for interconnection between different node devices.
  • a blockchain system includes one or more blockchains.
  • a blockchain is a series of data blocks that are associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of its information (anti-counterfeiting) and the generation of the next block.
  • a peer-to-peer (P2P) network is formed between node devices in the blockchain system.
  • the P2P protocol is an application layer protocol running on top of the Transmission Control Protocol (TCP) protocol.
  • TCP Transmission Control Protocol
  • any node device has the following functions: A) routing, the basic functions of node devices, used to support communication between node devices; B) application, used to deploy in the blockchain, according to Realize specific business according to actual business requirements, record data related to the realization of functions to form ledger data, carry digital signatures in ledger data to indicate the source of the data, and send ledger data to other node devices in the blockchain system for other node devices in the blockchain system.
  • the ledger data is added to the temporary block, in which the business implemented by the application includes wallets, shared ledger, smart contracts, etc.;
  • C) Blockchain including a series of chronological order For consecutive blocks, once a new block is added to the blockchain, it will not be removed. The block records the ledger data submitted by the node devices in the blockchain system.
  • each block includes the hash value of the transaction record stored in the current block (the hash value of the current block) and the hash value of the previous block, and each block is connected by the hash value to form an area In the block chain, optionally, the block also includes information such as the timestamp when the block was generated.
  • the distributed storage system is an HTAC (Hybrid Transaction/Analytical Cluster, hybrid transaction/analysis cluster) system as an example for description.
  • the HTAC system includes a TP (Transaction Processing, transaction processing) cluster 101 and an AP (Analytical Processing, analytical processing) cluster 102 .
  • TP Transaction Processing, transaction processing
  • AP Analytical Processing, analytical processing
  • the TP cluster 101 is used to provide transaction processing services.
  • the TP cluster 101 includes a plurality of TP node devices.
  • each TP node device is used to process current state data, wherein each TP node device is a stand-alone device or a cluster with one master and multiple backups.
  • This embodiment of the present application The type of the TP node device is not specifically limited.
  • the AP cluster 102 is used to store historical data and provide historical data query and analysis services.
  • the AP cluster 102 includes a global timestamp generation cluster and a distributed storage cluster, and the distributed storage cluster includes multiple AP node devices.
  • each AP node device is a stand-alone device or a cluster with one master and multiple backups, which is implemented in this application. The example does not specifically limit the type of the AP node device.
  • the set of database instances of the host or standby machine corresponding to each TP node device is called a SET (set).
  • a TP node device is a stand-alone device
  • the SET of the TP node device is only the A database instance of a stand-alone device
  • the TP node device is a cluster with one master and two backups
  • the SET of the TP node device is the set of the master database instance and the two backup database instances.
  • Synchronization technology is used to ensure the consistency between the data of the host and the copy data of the standby machine.
  • each SET is linearly expanded to meet the business processing requirements in big data scenarios.
  • each AP node device stores the historical state data generated by the TP cluster 101 in the local database, and accesses the distributed file system 103 through the storage interface, so as to use the distributed file system 103 to store the historical state data generated by the TP cluster 101.
  • the distributed file system 103 includes but is not limited to: HDFS (Hadoop Distributed File System, Hadoop Distributed File System), Ceph (a distributed file system under a Linux system), Alluxio (a memory-based distributed file system), etc.
  • the TP node device since the TP node device provides transaction processing services, when any transaction is submitted and completed, when new current state data is generated, historical state data corresponding to the current state data will also be generated. Historical data will occupy a lot of storage space, and historical data has preservation value. Therefore, the TP node device atomically migrates the generated historical data to the AP cluster 102 through the predefined historical data migration strategy. In 102, each AP node device realizes dumping of historical data based on the local executor (Local Executor, LE), and registers the metadata of each data migration into the metadata (Metadata, MD) manager, so as to facilitate the AP cluster 102 Based on the MD manager, the meta information of the stored data is counted.
  • Local Executor Local Executor
  • MD metadata manager
  • the client routes the query to any data stored in the TP cluster 101 or the AP cluster 102 based on the query statement, the semantics and metadata of the query operation provided in the SQL router (SQL Router, SR for short) layer,
  • the TP cluster 101 mainly provides query services for current state data
  • the AP cluster 102 mainly provides query services for historical state data.
  • the above-mentioned TP cluster 101 or AP cluster 102 is a server cluster or distributed system composed of multiple physical servers, or provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, and cloud communications. , middleware service, domain name service, security service, CDN (Content Delivery Network, Content Delivery Network), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms, which are not limited in the embodiments of this application.
  • cloud services cloud databases, cloud computing, cloud functions, cloud storage, network services, and cloud communications.
  • middleware service domain name service, security service
  • CDN Content Delivery Network
  • cloud servers for basic cloud computing services such as big data and artificial intelligence platforms, which are not limited in the embodiments of this application.
  • the user terminal that is, the terminal is a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto.
  • the terminal and the TP cluster 101 can be directly or indirectly connected through wired or wireless communication, which is not limited in this application.
  • the above-mentioned HTAC system is an exemplary illustration of a distributed database system, and also an exemplary illustration of a distributed transactional database system, which requires the capability of distributed transaction processing and a consistency model on shared data.
  • the above-mentioned HTAC system can realize more and more efficient data anomaly identification and realize an efficient serializable isolation level, so that the HTAC system can be adapted to various business scenarios.
  • the strict serializability level when used, it can be well applied to the financial field to ensure the reliability of data, and the current mainstream distributed database systems cannot efficiently provide this consistency level.
  • a weak consistency level when used, it can be well applied to Internet scenarios, thereby providing high-concurrency, real-time database services and providing Internet users with a good product experience.
  • the storage layer needs a consensus protocol to ensure data consistency and high availability, which is the specific background of the technical dependence provided by the embodiments of the present application.
  • the log execution method and related theories referring to concurrent Raft algorithm, Parallel Raft algorithm
  • the above log execution method can effectively ensure the correctness of user data, and also has good performance in terms of performance.
  • the Raft algorithm proposed by Diego Ongaro of Stanford University has gradually become one of the mainstream consensus algorithms in the field of distributed systems.
  • the Raft algorithm is a master-slave model that guarantees strong consistency. Consistency algorithm, and very easy to understand and develop and implement.
  • the replicated log (Replicated Log) is managed based on the master-slave mode, and a node is elected as the leader (Leader, the master node in the Raft algorithm, that is, the first node device) in each round of term, and the remaining nodes are Then, as a follower (Follower, the slave node in the Raft algorithm, that is, the second node device), only the leader can respond to the request of the terminal and send the request to other followers.
  • a follower the slave node in the Raft algorithm, that is, the second node device
  • the Raft algorithm Since the Raft algorithm is divided into leader election, log replication, security and other parts, for simplicity and comprehensibility of the algorithm, the Raft algorithm adopts a highly serialized design, and the log is not allowed to be missing on the leader and the follower. , which means that all logs will be confirmed by the followers in order, then submitted by the leader (Commit), and then executed (Apply) to all replicas.
  • the leader Commit
  • Apply Apply
  • the leader and the follower usually use multiple connections to transmit logs concurrently in the engineering scenario.
  • the order in which the logs arrive at the follower will be chaotic. , that is, the log stored in the leader is sent to the follower out of order.
  • the log in the later order ie the index
  • the follower Based on the mechanism of the Raft algorithm, the follower must receive the logs in order. Since the logs arrive at the follower out of order, the follower log will be missing. After the follower, the follower will cancel the blocking, which will reduce the throughput of the system. Additionally, the entire system stutters when multiple followers are blocked by individual missing logs.
  • the Raft algorithm is not suitable for high concurrency scenarios due to its strict serialization characteristics, which severely limits the throughput of distributed systems.
  • Distributed storage systems need to use high concurrency I(Input, input)/O(Output, output. ) ability to improve system throughput, so the embodiment of this application provides a log execution method that can release the strict orderly constraints of the traditional Raft algorithm, that is, an improved consensus algorithm based on the Raft algorithm is designed, called Parallel Raft (Concurrent Raft) algorithm.
  • the concurrent Raft algorithm is applied to a distributed storage system, and the distributed storage system includes a first node device (Leader, leader node) and a plurality of second node devices (Follower, follower nodes).
  • the concurrent Raft algorithm involves a log out-of-order replication mechanism based on a distributed storage system.
  • the log out-of-order replication mechanism includes the first node device receiving the service request of the terminal, the log out-of-order acknowledgment (ACK) stage, and the log out-of-order submission (Commit). ) stage and log out-of-order execution (Apply) stage.
  • the traditional Raft algorithm ensures serialization in the following ways: when the first node device sends a log to the second node device, the second node device needs to return an ACK (Acknowledge Character) message to confirm that the log has been received and is recorded, and it also implicitly indicates that all previous logs have been received and stored persistently; when the first node device has submitted a log and broadcast it to all second node devices, it will also confirm the submitted log. All logs have been committed.
  • ACK Acknowledge Character
  • the above-mentioned serialization limitation is broken, so that the log can be copied from the first node device to the second node device concurrently and out of order. log, so that the first node device can submit logs concurrently and out of order according to the log receiving statistics.
  • FIG. 2 is a schematic diagram of a principle of a log out-of-order replication mechanism provided by an embodiment of the present application.
  • any log under the concurrent Raft algorithm can return an ACK message immediately after it is successfully persisted on the second node device, without waiting for the logs before the log in the log list to be confirmed in turn.
  • the first node device is in a After most copies of the log are confirmed, they can be committed without waiting for the log commit before this log in the log list.
  • the out-of-order submission is carried out within the limit of the log submission (Commit) active window, that is, a log in the log submission active window is not restricted to other logs except this log in the log submission active window. Journal, journal submission active window is used to support out-of-order submission of journals. The log out-of-order submission mechanism will be introduced in subsequent embodiments, and will not be described here.
  • log out-of-order confirmation mechanism log out-of-order submission mechanism, and log out-of-order execution mechanism will cause the log list of the second node device to be missing, so that the log can be safely applied to the state machine (that is, execution).
  • the execution of the subsequent log needs to consider whether it conflicts with the log that has not been executed before.
  • each log may have missing copies on different second node devices (that is, the phenomenon of "holes").
  • the concurrent Raft algorithm designs a log hole array ( Log Hole Array, LHA) to judge whether there is a conflict between this log and the N (N ⁇ 1) logs before this log to ensure the correctness of storage semantics, and then this log can be executed safely.
  • LHA Log Hole Array
  • LHA is used to determine whether there is a conflict between this log and the previous target number of logs.
  • the log list of the first node device is not allowed to have holes, and the first node device needs to receive terminals in sequence
  • the service request (or command) sent by the (Client) and the log corresponding to the service request is recorded in the log list.
  • the above strategy ensures that the LHA of the log sent by the first node device to the second node device can completely cover this log. and the storage range of the first N logs.
  • the data structure of the LHA will be described in detail in the following embodiments, and will not be repeated here.
  • LHA is added to the log structure of the concurrent Raft algorithm to assist in determining whether this log conflicts with the storage range of the first N logs.
  • the log index of the log LogIndex -1( When the log index LogIndex is abbreviated as Index Index), it indicates that the log is missing.
  • the correctness condition of the storage semantics can be obtained, that is, in the storage scenario, the precondition for this log to be executed safely is that each log that has not been executed before this log is related to this log. There are no conflicting storage ranges.
  • the LHA it can be judged whether the conflict of the storage range occurs between this log and the first N logs, so that the distributed storage system can execute the log list with high concurrency and out-of-order under the condition that the correctness condition of the storage semantics is guaranteed. in each log.
  • FIG. 3 is a flowchart of a log execution method provided by an embodiment of the present application. Referring to FIG. 3 , this embodiment is executed by the first node device in the above-mentioned distributed storage system.
  • the embodiment of the present application introduces a log out-of-order execution mechanism of the concurrent Raft algorithm, which will be described in detail below.
  • the first node device cyclically scans the log execution active window, where the log execution active window includes multiple logs that have not been executed, and the logs before the log execution active window have all been executed.
  • the meaning of the log execution (Apply) active window is: only the logs located in the log execution active window can be executed, but the log can be executed also needs to meet the correctness conditions of the storage semantics: this log and the log execution active window. , the logs that have not been executed before this log do not conflict.
  • the log execution active window involves the following two state variables.
  • the first one, the toApplyWindowBeginIndex variable, is used to record the index (Index) of the first log in the active window of log execution.
  • the second, the isApplied[] list, is used to record whether each log in the log execution active window has been executed, that is, to record the execution status information of each log in the log execution active window.
  • the execution status information of each log is represented by a Boolean data.
  • the value of the Boolean data is true (True), which means that the log has been executed, and the value of the Boolean data is false (False), which means that the log has not been executed.
  • the execution information of each log is stored in the order of log index from small to large.
  • the storage overhead of state variables in the concurrent Raft algorithm can be saved.
  • the traditional Raft algorithm needs to record whether each log has been submitted and executed, while the concurrent Raft algorithm ensures that the log execution is active.
  • the concurrent Raft algorithm ensures that the log execution is active.
  • the first node device obtains the conflict verification result of the log based on the storage range information of the log, and the storage range information is used to indicate the storage range of the log and before the log.
  • the storage range of the target number of logs which is equal to the window size of the log execution active window.
  • the first node device cyclically scans the log execution active window, for any log that has not been executed in the log execution active window, it is determined whether the log is the same as the log execution active window that is located before the log and has not been executed. Conflict occurs in the log, and the conflict verification result of the log is obtained.
  • the first node device maintains one storage range information for each log, and the storage range information of each log is used to record the storage range of each log and the storage range of the target number of logs before each log .
  • the first node device reads the storage range information of the log to obtain the log and the storage range of the target number of logs before the log; the storage range of the log and the storage range of the target number of logs If the intersection between them is an empty set, determine that the conflict verification result is no conflict; if the intersection between the storage range of the log and the storage range of the target number of logs is not an empty set, determine the conflict Validation results in a conflict.
  • the first node device obtains the intersection between the storage range of the log and the storage range of each log before the log, traverses the target number of logs, and obtains the target number of intersections, if the target number of intersections is If any intersection is not an empty set, it is determined that the conflict verification result is a conflict.
  • the number of intersections that are not an empty set is obtained, and the number of transactions is determined as the number of logs that conflict with the log, so as to facilitate subsequent verification of the log. Delay execution; otherwise, if all intersections in the target number of intersections are empty sets, determine that the conflict verification result is no conflict.
  • the first node device obtains the union between the storage ranges of the previous target number of logs, and then obtains the target intersection between the union and the storage range of the log, if the target intersection is an empty set. , determine that the conflict verification result is no conflict; otherwise, if the target intersection is not an empty set, determine that the conflict verification result is a conflict.
  • the storage range information maintained by the first node device for each log is a log hole array (Log Hole Array, LHA), and the LHA is used to determine whether the current log and the previous target number of logs have occurred conflict. That is to say, when the first node device stores each log, it will attach the LHA for storing each log.
  • the LHA includes the storage range of this log and the previous target number of logs. For example, the LHA includes this log and this log.
  • FIG. 4 is a schematic diagram of the principle of an LHA data structure provided by an embodiment of the present application.
  • the storage range of the 5 logs is arranged in the order of index from small to large: [6,7], [0,1], [8,9], [4,5], [1,2]. It can be seen that the storage range of this log conflicts with the two logs whose storage ranges are [6,7] and [4,5] respectively.
  • the first node device can determine whether this log conflicts with N logs before this log based on the LHA stored with each log.
  • the first node device cannot judge whether this log conflicts with other logs before the above N logs based on the LHA. Therefore, the concurrent Raft algorithm needs to ensure that the other logs before the above N logs have been executed, and the above logs are designed to be active. window to ensure that: a) only logs within the log execution active window are likely to be executed; b) the window size of the log execution active window is exactly equal to N.
  • the first node device executes the log when the conflict verification result is no conflict.
  • the first node device adds the log to the to-be-executed log list ApplyingList[], invokes the log execution thread, and processes the log stored in the to-be-executed log list ApplyingList[]. At this time, since the log has been confirmed There is no conflict with the previous target number of logs, so there is no need to wait for the target number of logs before the log to be executed, and the log can be executed out of order to ensure the correctness of storage semantics.
  • the first node device when processing the logs stored in the log list to be executed, performs the following operation: write the business data corresponding to the logs stored in the log list to be executed ApplyingList[] into volatile volatile storage medium; add the log corresponding to the business data written in the volatile storage medium to the executed log list AppliedList[]; when the data storage capacity of the volatile storage medium is greater than or equal to the storage threshold Next, the business data stored in the volatile storage medium is written into the non-volatile storage medium; wherein, in the executed log list AppliedList[] respectively, the business data is not written into the non-volatile storage medium
  • the log of the data is set to a different state parameter from the log of the business data that has been written to the non-volatile storage medium, and the storage threshold is any value greater than or equal to 0.
  • the executed log list AppliedList[] is used to record the logs that have been executed.
  • the executed log list AppliedList[] stores the index LogIndex of each log that has been executed, and uses the state parameter State to represent the log Whether the corresponding business data is persistent (that is, whether the business data is written into a non-volatile storage medium).
  • this embodiment of the present application does not specifically limit this.
  • the volatile storage medium is a data buffer pool constructed by the storage system using memory
  • the non-volatile storage medium is a magnetic disk, a hard disk, or the like.
  • the business data corresponding to the log is first written into the data buffer pool, and then the system periodically flushes the business data in the data buffer pool to the disk in batches.
  • the first node device adopts the Checkpoint technology when transferring the business data stored in the data buffer pool to the disk, that is, when the data storage capacity of the data buffer pool is greater than or equal to the storage threshold,
  • the system will trigger the Checkpoint event, flush all the dirty data (unpersisted business data) in the data buffer pool to disk, and modify the status parameters of the corresponding logs in the executed log list AppliedList[], so that the status parameters of the corresponding logs will be changed. Mark as flushed back to disk.
  • FIG. 5 is a schematic diagram of the principle of a data persistence mechanism provided by an embodiment of the present application.
  • the volatile storage medium is the data buffer pool and the non-volatile storage medium is the disk as an example for description.
  • the execution log list ApplyingList[] records each log that can be executed currently.
  • the first node device calls the log execution thread to cyclically scan the to-be-executed log list ApplyingList[].
  • Each business data corresponding to the log is written into the data buffer pool.
  • each log corresponding to each business data that has been written to the data buffer pool is added to the executed log list AppliedList[], and the status parameters of the above logs are set.
  • the storage system uses the log mechanism to solve the problem.
  • the above problem that is, the business data is written to the log before being written to the volatile storage medium.
  • the business data can be recovered through the redo log, and the Checkpoint technology can be used to record the unrecorded data in the volatile storage medium. Logs corresponding to persistent dirty data, and the rest of the logs can be recycled.
  • the concurrent Raft algorithm designs a system crash recovery mechanism to ensure that the written business data will not be completely lost when the system crashes, and uses the Checkpoint technology to restore the business data, and then uses the executed log list AppliedList[] data.
  • the structure orders the out-of-order write requests to ensure the orderliness of the data recovery process, as described below.
  • the business data that has been written to the non-volatile storage medium can be stored persistently, so there is no need for recovery, but only the volatile storage medium is written to All service data (which is dirty data) that has not been written to the non-volatile storage medium will be lost.
  • the first node device restores data by performing the following operations: in the case of a crash event of the distributed storage system, when the distributed storage system restarts, based on the state parameter State, from the executed log list AppliedList[] A plurality of logs to be restored are obtained from the .
  • the storage order in the executed log list is to restore multiple business data corresponding to the multiple logs to be restored to the volatile storage medium in turn. In other words, based on the logs to be restored, they are stored in the executed log list AppliedList[]
  • the business data corresponding to each log to be restored is restored to the volatile storage medium in sequence.
  • each log to be restored is executed in this order, which can ensure that the log to be executed to be restored conflicts with the log (including the log to be restored) that conflicts with the previous storage range. log or persistent log) are implemented to ensure correct storage semantics.
  • the above-mentioned executed log list AppliedList[] only saves the index Index and state parameter State of each executed log, but does not save specific log records, and the corresponding log records can be queried in the log list through the index Index , which can greatly save the storage overhead of the executed log list AppliedList[].
  • the above step 303 shows how to execute the operation of the log and persist the corresponding business data when the log does not conflict with the previous target number of logs, and also provides a crash recovery mechanism of the system, and in other embodiments , when this log conflicts with the previous target number of logs, the first node device performs the following operations: if the conflict verification result is a conflict, obtain the number of logs that conflict with the log; based on the number of logs, determine The scan frequency of the log; based on the scan frequency, the log is scanned and the conflict verification result is refreshed, and the log is executed until the conflict verification result is no conflict.
  • the first node device determines the scan frequency as the first frequency when the number of logs is greater than the conflict threshold; or, when the number of logs is less than or equal to the conflict threshold, the scan frequency The frequency is determined to be a second frequency that is greater than the first frequency.
  • the first node device determines the frequency of scanning this log based on the number of logs conflicting with this log. If the number of conflicting logs is large, it will take a long time to wait for conflict resolution, so the lower The first frequency of scanning is performed. If the number of conflicting logs is small, it will take a shorter time to wait for the conflict to be resolved, so the scanning is performed at a higher second frequency.
  • the first node device adds the log to a log list corresponding to the scan frequency, and scans the logs stored in the log list based on the scan frequency. That is to say, the first node device maintains different log lists for different scanning frequencies. By cyclically scanning the corresponding log lists according to the corresponding scanning frequencies, different scanning frequencies can be flexibly selected based on the number of conflicting logs, thereby Save system computing resources.
  • the above log list is divided into two categories, the first log list SlowScanList[] and the second log list FastScanList[], the scan frequency of the first log list SlowScanList[] is the first frequency, the second log list The scanning frequency of FastScanList[] is the second frequency, because the second frequency is greater than the first frequency, that is, the scanning frequency of the second log list FastScanList[] by the first node device is greater than the scanning frequency of the first log list SlowScanList[] .
  • the administrator after setting the conflict threshold ScanListThreshold, the administrator obtains the number of logs that conflict with this log in the log execution active window, and when the number of logs is greater than the conflict threshold ScanListThreshold, this log is added to the first log list SlowScanList [], otherwise, when the number of logs is less than or equal to the conflict threshold ScanListThreshold, add this log to the second log list FastScanList[].
  • the logs in the second log list FastScanList[] have a greater probability of resolving conflicts due to the small number of conflicting logs.
  • the second log list FastScanList[] is frequently scanned to find and Execute the conflict-resolved log, so as to avoid delaying the execution process of other logs that conflict with this log for a long time due to the blocking of this log.
  • the first log list SlowScanList[] and the second log list FastScanList[] not only store the index of this log, but also store the index of each conflicting log that conflicts with this log, so that through cyclic scanning
  • the first log list SlowScanList[] and the second log list FastScanList[] can find the non-conflict log in time, and move the non-conflict log from the first log list SlowScanList[] or the second log list FastScanList[] Remove and add to the list of pending logs ApplyingList[].
  • the first node device adds the log to the same log list ScanList[] and waits for the conflict to be resolved before executing it, so as to ensure that no data is generated when the log is executed.
  • the first node device adds the log to the log list corresponding to the value interval according to the value interval in which the number of conflicting logs is located, wherein the number of value intervals is greater than or equal to 2,
  • Each value interval corresponds to a log list, and each log list corresponds to a scan frequency.
  • the log lists corresponding to different value intervals have different scan frequencies, and as the value interval increases, the corresponding log The scan frequency of the list is also reduced, so that more levels of scan frequency can be refined to build a more complete scan process.
  • the conditions for any log in the concurrent Raft algorithm to be executed are as follows: (a) the log is located in the log execution active window, which ensures that the first N logs of the log have been executed; (b) pass Check the LHA of this log to confirm that there is no conflict between the storage range of this log and the first N logs.
  • FIG. 6 is a schematic diagram of a schematic diagram of a log out-of-order execution scheduling provided by an embodiment of the present application.
  • the index range of the log execution active window of the first node device at a certain moment is [4, 8]
  • the log execution active window by setting the log execution active window and ensuring that the logs before the log execution active window have been executed, it is only necessary to verify whether any log in the log execution active window is the same as the log in the log execution active window. If there is a conflict in the storage range of the previous, unexecuted log, you can know whether the log will cause data inconsistency in the entire distributed storage system. For the log without conflict, it supports out-of-order execution of the log without blocking the log. The log execution process does not need to wait for the logs in the log execution active window that have not been executed before the log to be executed, which can greatly improve the throughput of the distributed storage system and can be suitable for high concurrency scenarios.
  • the log out-of-order execution mechanism of the concurrent Raft algorithm is introduced in detail.
  • the traditional Raft algorithm since all logs are executed in strict order, all copies can also ensure data consistency.
  • the concurrent Raft algorithm has out-of-order confirmation and Out-of-order submission, the log copies of each log will be missing in the log list of different second node devices.
  • the entire distributed storage system can be guaranteed in the concurrent disorder. Data consistency during sequential execution.
  • any log can be executed out of order is that the correctness condition of storage semantics needs to be met (this log has no conflict with the first N logs that have not been executed), and the LHA data structure is introduced to quickly determine whether this log and the first N logs have occurred. Conflicts.
  • the log execution active window is introduced to save the execution status of each log and manage the execution process.
  • the log out-of-order execution mechanism can solve the execution order and execution timing of a series of logs that may conflict with each other.
  • the different log lists for the above-mentioned control can be achieved.
  • the log out-of-order submission mechanism of the concurrent Raft algorithm will be introduced.
  • a log submission (Commit) active window is set. For the first node device, only the logs in the log submission active window can be sent. To the second node device, for the second node device, only the logs in the log submission active window can be received from the first node device, in addition, all logs before the log submission active window must be submitted.
  • the related data structure of the log commit active window includes the following contents.
  • CommitIndex which represents the largest index in the consecutive submitted logs on the device of this node.
  • CommitIndex is used to indicate that the log with the index of CommitIndex on the device of this node and all previous logs have been submitted.
  • log submission active windows of the first node device and any second node device are the same or different, and the log submission active windows of different second node devices are the same or different, which is not specified in this embodiment of the present application. limited.
  • FIG. 7 is a schematic diagram of the data structure of a log submission active window provided by an embodiment of the present application.
  • the log submission active window of the first node device 701 is [5, 9], indicating that the logs with Index ⁇ 5 are all It has been successfully submitted.
  • Logs with 5 ⁇ Index ⁇ 9 can be sent to the second node device, but logs with Index>9 cannot be sent.
  • the log submission active window of the second node device 702 is [3, 7], indicating that logs with Index ⁇ 3 have been successfully submitted, logs with 3 ⁇ Index ⁇ 7 can be received and an ACK message can be returned, and logs with Index>7 cannot be Receiving (that is, refusing to receive), in the log submission active window [3,7], the logs whose Index is 3, 5, 6, and 7 have not been received yet.
  • the log submission active window of the second node device 703 is [5, 9], indicating that logs with Index ⁇ 5 have been successfully submitted, logs with 5 ⁇ Index ⁇ 9 can be received and an ACK message can be returned, and Index>9 In the log submission active window [5, 9], the logs with Index 5, 7, and 9 have not been received yet.
  • log flow control can be performed by setting the log submission active window, which facilitates the management of the rate at which the first node device sends logs to the second node device, and ensures that the second node device receiving logs is not overloaded.
  • Logs located in the log submission active window support concurrent out-of-order reception (reception here refers to returning ACK messages), thereby improving log delivery efficiency.
  • setting the log submission active window can save the storage overhead of the state variables of the concurrent Raft algorithm (such as the overhead caused by recording whether the log has been submitted, executed, etc.), because it is guaranteed that all logs before the log submission active window have been submitted. , it is only necessary to record the submission status information of each log in the log submission active window, thereby greatly saving storage overhead.
  • the first node device can receive the service request of the terminal (Client), and the first node device receives the service request of the terminal
  • the database transaction requested by the service request is executed, the service data requested by the service request is obtained, and the log corresponding to the service data is added to the log list.
  • the term parameter Term of the log is set to the Term of the first node device itself.
  • the first node device After the first node device writes the new log to the log list, it can concurrently send the new log to each second node device through the AppendEntriesRPC message.
  • the full name of RPC is Remote Procedure Call Protocol in English, and the full name in Chinese is Remote Procedure Call Protocol. .
  • each second node device After receiving the AppendEntriesRPC message, each second node device immediately adds the new log to the corresponding position of its own log list, and can immediately return the ACK message of the new log to the first node device without waiting for the previous log ACK.
  • the first node device sends logs in batches to each second node device, which can save the communication overhead of the system.
  • the range of log indexes that can be sent in the log list of the first node device is limited, so that the rate at which the first node device sends logs to each second node device can be conveniently managed, and the overload of each second node device can be avoided.
  • the maximum index of logs that can be sent is commitIndex+toCommitWindowSize.
  • FIG. 8 is a flowchart of a log out-of-order submission mechanism provided by an embodiment of the present application. Please refer to FIG. 8 .
  • the following describes a log out-of-order submission mechanism based on the log submission active window.
  • This log out-of-order submission mechanism is suitable for distributed A storage system, a distributed storage system includes a first node device (Leader, leader node) and a plurality of second node devices (Follower, follower nodes), which will be described in detail below.
  • the first node device cyclically scans a log matching index table, where the log matching index table is used to record the number of copies of multiple to-be-committed logs stored in the distributed storage system.
  • the log matching index table matchIndex[][] is a two-dimensional sequence table structure used for statistics on log reception.
  • the first node device In addition to specifying the range of log indexes that can be sent, the first node device also needs to know which logs have been received and which logs have not been received by each second node device, so as to facilitate the macroscopic log replication work performed by the entire distributed storage system. schedule.
  • the first node device records which log items have been received by each second node device through the log matching index table matchIndex[][] and beginIndex[].
  • the node identifier of any second node device is i
  • beginIndex[i] indicates that each log whose index Index is less than beginIndex[i] in the log list of the i-th second node device has been deleted by the i-th second node device.
  • the node device receives, matchIndex[i][j] indicates whether the log entry whose index Index is beginIndex[i]+j in the log list of the first node device has been received by the i-th second node device, where i ⁇ 1, j ⁇ 1.
  • matchIndex[][] is a two-dimensional sequence table structure
  • each element in the two-dimensional sequence table is Boolean data occupying only one bit, that is, for the i-th second node device, It is only necessary to save whether the log whose Index is greater than or equal to beginIndex[i] is received, so that the storage cost of matchIndex[][] is small.
  • the first node device cyclically scans the log matching index table matchIndex[][], and determines that the Index of the i-th second node device is beginIndex[i based on the log matching index table matchIndex[i][j] Whether the log of ]+j is received, if not (the boolean data value is False), the first node device calls the AppendEntriesRPC message to send the log whose Index is beginIndex[i]+j to the i-th second node device .
  • the i-th second node device is divided into two cases when receiving logs.
  • the Term of the corresponding log of the index is not equal, indicating that the log originally stored by the i-th second node device is wrong (or invalid). In this case, it is only necessary to overwrite the wrong log and store the latest log.
  • a failure message eg, a string of error codes is returned
  • the first node device submits any log to be submitted in response to the log matching index table having a number of copies of any log to be submitted that meets the target condition.
  • the first-stage device submits the log to be submitted when the log matches the number of copies of any log to be submitted in the index table and meets the target condition.
  • the target condition is that the ratio of the number of copies of the log to be submitted to the number of nodes in the distributed storage system exceeds a proportional threshold, for example, the proportional threshold is 1/2, that is, the target condition refers to the Log copies to be submitted exist in more than half of the second node devices.
  • the proportional threshold is any value greater than or equal to 0 and less than or equal to 1, which is not limited in this embodiment of the present application, for example, the proportional threshold is 2/3.
  • the target condition is that the number of replicas of the log to be submitted is greater than a replica threshold, and the replica threshold is any integer greater than or equal to 0 and less than or equal to the number of nodes of the distributed storage system, for example, in the inclusive 99 In the distributed storage system of each node device, the replica threshold is set to 50, which is not specifically limited in this embodiment of the present application.
  • the target condition is that the ratio of the number of copies of the log to be submitted to the number of nodes in the distributed storage system exceeds the proportional threshold as an example, the first node device scans matchIndex[][] cyclically, It can be known whether any log to be submitted has log copies stored in more than half of the second node devices, and if more than half of the second node devices store log copies, the first node device will choose to submit the log to be submitted.
  • the first node device sends a submission instruction of the log to be submitted to a plurality of second node devices.
  • the first node device informs each second node device of the log to be submitted that it has submitted by using an AppendEntriesRPC message.
  • multiple second node devices submit the log to be submitted.
  • any second node device after receiving the AppendEntriesRPC message, if the log to be submitted indicated by the AppendEntriesRPC message has been stored on the second node device and the log to be submitted has been stored by the first node device Submit, the second node device will also submit the log to be submitted.
  • the logs of the first node device and each second node device can be consistent.
  • the consistency check of the AppendEntriesRPC message will never fail, but if the first node device crashes, it will cause the first node device.
  • the old first node device may not have time to copy all the logs in the log list to other second node devices.
  • the crash event occurred and restarted after the crash.
  • this inconsistency may cause a series of first node devices and second node devices to crash. Therefore, the above log inconsistency problem needs to be solved.
  • FIG. 9 is a schematic diagram of a log inconsistency provided by an embodiment of the present application.
  • the second node device may lack the log stored in the first node device, and may have the first node device. Additional logs that do not exist on the first node device, or the second node device both lacks the logs stored in the first node device and has additional logs that do not exist on the first node device.
  • the missing or irrelevant logs in the log list may span multiple terms.
  • the first node device solves the inconsistency problem by forcing the second node device to copy its own log, which means that the conflicting logs (inconsistent logs) in the log list of the second node device will be replaced by the first node device is overwritten by the corresponding log in the log list.
  • the log of the second node device to make the log of the second node device consistent with the log of the first node device, all logs with the same Index and Term in the log lists of the first node device and the second node device need to be consistent , any inconsistent logs between the log list of the second node device and the log list of the first node device will be deleted, and the first node device will resend the corresponding log to cover it to the log list of the second node device with the same Index in the log list
  • the first node device when each first node device takes the stage for the first time (ie, votes to be elected leader), the first node device will ask each second node device for its own commitIndex and whether all the logs to be submitted in the log submission active window are It has been submitted by each second node device, and this process is completed in the election merge (Merge) recovery phase, which will be introduced in the next embodiment, and will not be repeated here.
  • election merge Merge
  • the first node device maintains data structures matchIndex[][] and beginIndex[], and through the above data structure, it is possible to record whether each log is consistent with the log of the corresponding index in the log list of the second node device, so as to ensure that the first node device It is consistent with all the submitted logs in the second node device.
  • the log of the i-th second node device submits each log to be submitted in the active window
  • the first node device will know whether the logs in the log list of the first node device and each second node device are consistent according to matchIndex[][] and beginIndex[], and then send the log to each second node device. Inconsistent logs in the list, this process is done via the AppendEntriesRPC message. If the log of the second node device is inconsistent with the log of the first node device, the log of the first node device will overwrite the inconsistent log in the second node device.
  • the log list of the first node device cannot be missing, and due to the decentralized design, the first node device is time-limited, that is, every After the term of a first node device is over, the entire distributed storage system will vote to elect the next first node device, then the next first node device may cause the next first node device due to the log out-of-order replication mechanism.
  • the election merge recovery mechanism provided by the embodiment of the present application, it is possible to supplement the missing logs in the log submission active window of its own, and provide services to the cluster after the completion.
  • the election mechanism of the first node device is also called the leader election mechanism, which is used to ensure sequence consistency. Only the first node device can respond to the service request of the terminal and send the service request to each second node device.
  • the distributed storage system includes a first node device and a plurality of second node devices. Based on the principle of majority, the number of nodes of all second node devices is an odd number. For example, 5 is a typical optional node number, which allows The system tolerates the failure of two node devices.
  • each node device is in one of the following three states: leader node Leader, follower node Follower or candidate node Candidate. The status of each state is shown below.
  • the leader node Leader (ie, the first node device): responsible for processing all service requests of the terminal.
  • Candidate node Candidate (referred to as candidate node device): Candidates, during the election period, conduct elections among candidate nodes, and elect a new leader node Leader.
  • FIG. 10 is a schematic diagram of the principle of a leader node election mechanism provided by an embodiment of the present application.
  • time is divided into terms (Terms) with any length of time.
  • Term is represented by a continuous integer.
  • Each Term starts from the election phase, and one or more candidate nodes in the system try to become the leader node Leader.
  • elections will result in split votes, where no candidate node receives more than half of the votes. In this case, the term will end without a leader, and the election phase of the next new term will begin.
  • 11 is a schematic diagram of a term of office provided by an embodiment of the present application. As shown in 1100 , split votes are generated in Term3 during the election period, so Term3 does not have a daily period, and directly enters the election period of Term4.
  • the follower node only responds to requests from other nodes (including the leader node Leader or candidate node Candidate). If the follower node follower does not receive any communication, the follower node follower becomes the candidate node candidate and initiates an election. The candidate that gets votes from the majority of the cluster will become the new leader. A single leader will manage the cluster until the end of the term.
  • each node device stores a current Term number, which increases monotonically over time.
  • their respective current Term will be exchanged at the same time; if the current Term of one node device is smaller than the current Term of another node device, the node device will update its current Term to the maximum value between the two value.
  • the candidate node Candidate or the leader node Leader finds that its current Term has expired when exchanging the current Term in a certain communication, it will immediately become a follower node Follower. If any node device receives a request with an expired Term, the node device will reject the request.
  • the remote procedure call (RPC) protocol is used for communication between node devices. If a node device does not receive a response from other node devices in time, it will re-send the corresponding RPC messages, and these RPC messages are parallel. issued to obtain the best performance of the system.
  • RPC remote procedure call
  • the first node device of the current term is elected by a plurality of second node devices in the distributed storage system after the end of the previous term, and consecutive submitted logs in the first node device are in the The maximum index commitIndex is greater than or equal to the maximum index commitIndex in the consecutive submitted logs in the multiple second node devices.
  • the election conditions allow the presence of missing logs within the log commit active window, and these missing logs will be recovered during the election merge recovery phase.
  • the above-mentioned election conditions for the first node device are restricted through the RequestVoteRPC message, if the commitIndex of the candidate node device (representing the log with the largest index among the consecutive, submitted logs) is smaller than the current commitIndex of the second node device , the current second node device refuses to vote for the candidate node device.
  • the next first node device has continuous and submitted logs of all node devices, but it cannot be guaranteed that there is no deletion in the log submission active window of the next first node device.
  • the missing logs are mainly divided into the following two categories: (A) the missing log has been submitted; (B) the missing log has not been submitted, but it may be exist on other second node devices.
  • the above-mentioned (B) missing logs have not been submitted, so they may be expired and unsafe logs.
  • the first node device cannot receive the (B) missing logs. At this time, the first node device chooses to re-request (B) to the terminal. Class missing logs to complement (B) classes missing logs.
  • the first node device in response to the absence of at least one log in the log submission active window, obtains at least one index of the missing at least one log, and the log submission active window includes multiple logs that have not been submitted. , and the logs before the log submission active window have been submitted; based on the at least one index, the at least one log is requested from the plurality of second node devices.
  • the first node device obtains the index of each missing log, and then, based on each obtained index, requests each second node device for the value of each index. Indicated missing logs.
  • the first node device receives at least one target log returned by the plurality of second node devices and the submission status information of the at least one target log; submit the submission status information as the submitted target log completion in the log. In the active window; request the terminal to submit the information for the unsubmitted target log.
  • the first node device sends its own commitIndex and isCommited[] to each second node device, and requests each second node device to submit its own log to submit the missing log in the active window.
  • the submitted logs are used to supplement the missing logs in the active log submission window.
  • the first node device will request the log submission status information sent by each second node device, and initialize the data structures matchIndex[][] and nextIndex [], the above process is completed by the RequestMergeRPC function.
  • the first node device will request the terminal for the missing log.
  • the RequestVoteRPC message, AppendEntriesRPC message, RequestMergeRPC message and other RPC messages sent to the crashed node device will all fail.
  • the concurrent Raft algorithm handles these failures by retrying RPC messages indefinitely. If the crashed node is restarted, the RPC message will complete successfully. If a node device has completed the command indicated by the RPC message, but crashes before responding to that RPC message, then the node device will receive the same RPC message again after restarting.
  • the RPC message is idempotent, that is, the node device executes the same message multiple times, and the final result is the same, so repeating the execution of the same RPC message will not cause any bad effects. For example, if the second node device receives an AppendEntries RPC request containing logs already in its own log list, it will ignore those logs in the new request.
  • storage scenarios require that system security not be time-dependent: the system cannot produce incorrect results simply because certain events occur faster or slower than expected.
  • system availability ie, the ability of the system to respond to terminals in a timely manner
  • the candidate node device will not be able to remain long enough to win the election; no stable first node device is elected, The concurrent Raft algorithm cannot proceed.
  • broadcastTime ⁇ electIOnTimeout ⁇ MTBF the distributed storage system will be able to select and maintain a stable first node device.
  • broadcastTime is the average time for the first node device to send RPC messages to each second node device in the distributed storage system in parallel to receive a reply
  • electionTimeout is the election timeout duration
  • MTBF is the single second node device mean time between failures.
  • broadcastTime should be an order of magnitude less than electionTimeout, so that the first node device can reliably and timely send heartbeat messages to each second node device, so as to prevent the second node device from exceeding its own electionTimeout and becoming a candidate node device to start the election.
  • electionTimeout of different second node devices is different, which also makes split voting unlikely to occur.
  • the electiontTimeout is several orders of magnitude less than the MTBF so that the system runs steadily, otherwise when the first node device crashes, the second node device cannot end the electionTimeout in time to start a new election.
  • broadcastTime and MTBF are properties of the system, while electionTimeout is manually set and modified by technicians.
  • the RPC message of the concurrent Raft algorithm usually requires the receiving second node device to receive the information and store the information persistently, so the broadcast duration may be between 0.5ms and 20ms, depending on the storage technology.
  • the electionTimeout is set between 10ms and 500ms.
  • the MTBF of a typical server node can last several months or even longer, so it is easy to satisfy the above timing requirement inequality.
  • a log refers to the overlapping storage range of the old log and the new log.
  • the ACK message received by the terminal is inconsistent with the order in which the terminal sends the service request: that is, after the terminal sends multiple service requests to the first node device, the ACK message of the old service request has not been returned, and the new service request The ACK message has been returned first.
  • the concurrent Raft algorithm needs to ensure that if there is a submitted log, and the log on which the submitted log depends has not been submitted, The system will not execute such committed logs. That is, the concurrent Raft algorithm guarantees the following things to the terminal: if the storage range of the i-th service request and the j-th service request overlap, and i ⁇ j, after receiving the ACK message of the j-th service request but not receiving the When the ACK message of the i service request is received, it means that both the i and j service requests have not been executed.
  • Case 2 If the old first node device crashes, the distributed storage system elects a new first node device, and the new first node device does not have the log corresponding to the i-th service request above (that is, the first node device’s log The corresponding log is missing in the log submission active window), the new first node device will request the terminal to resend the i-th service request during the election merger recovery phase, so that the new first node device can restore the corresponding log; if the terminal does not Re-send the i-th service request to the new first node device (that is, the new first-node device requests to re-send the i-th service request for 3 consecutive timeouts), then the i-th service request and the i-th service request After that, all service requests that depend on the i service requests will fail.
  • the related log items are marked as empty logs in the log list (that is, the log index Index is assigned as -2), and the elements in the second log list FastScanPList and the first log list SlowScanPList corresponding to the empty logs are cleared.
  • a concurrent Raft algorithm (a consensus algorithm) that can be applied to a distributed storage system.
  • the upper-layer application acts as a client (ie, a terminal) to send service requests to the distributed storage system.
  • the system passes through Figure 12 It interacts with the terminal in the way shown, and provides a system crash recovery mechanism to ensure that business data will not be lost after a system crash, and can order out of order I/O requests to ensure the orderliness of the data recovery process.
  • the above concurrent Raft algorithm can be applied to high concurrent I/O scenarios under distributed storage systems, thereby effectively improving system throughput and reducing I/O request latency.
  • the receiver (second node device) implements:
  • the receiver (second node device) implements:
  • RequestMergeRPC function for RPC communication between different node devices in a distributed storage system is shown, which is described below:
  • the receiver implements:
  • server behavior of the distributed storage system that is, the behavior of the entire server cluster for providing distributed storage services.
  • the first node device behavior is the first node device behavior
  • FIG. 13 is a schematic structural diagram of a log execution device provided by an embodiment of the present application. Please refer to FIG. 13 .
  • the device is located in a distributed storage system, and the device includes the following modules:
  • the scanning module 1301 is used to cyclically scan the log execution active window, the log execution active window includes multiple logs that have not been executed, and the logs before the log execution active window have been executed;
  • the first acquisition module 1302 is used to execute any log in the active window for the log, and based on the storage range information of the log, obtain the conflict verification result of the log, and the storage range information is used to indicate the storage range of the log and the log.
  • the storage range of the target number of logs before the log, the target number is equal to the window size of the log execution active window;
  • the execution module 1303 is configured to execute the log when the conflict verification result is no conflict.
  • the log execution active window by setting the log execution active window and ensuring that all logs before the log execution active window have been executed, it is only necessary to verify whether any log in the log execution active window is the same as the log in the log execution active window. If there is a conflict in the storage range of the previous, unexecuted log, you can know whether the log will cause data inconsistency in the entire distributed storage system. For the log without conflict, it supports out-of-order execution of the log without blocking the log. The log execution process does not need to wait for the logs in the log execution active window that have not been executed before the log to be executed, which can greatly improve the throughput of the distributed storage system and can be suitable for high concurrency scenarios.
  • the first obtaining module 1302 is used to: read the storage range information of the log, and obtain the log and the storage range of the target number of logs before the log; In the case where the intersection between the storage ranges of the target number of logs is an empty set, it is determined that the conflict verification result is no conflict; the intersection between the storage range of the log and the storage range of the target number of logs is not an empty set In the case of , it is determined that the conflict verification result is a conflict.
  • the execution module 1303 includes: an adding unit for adding the log to the log list to be executed; a processing unit for calling a log execution thread to process the log to be executed Logs stored in the log list.
  • the processing unit is configured to: write the service data corresponding to the log stored in the log list to be executed into a volatile storage medium; write the service written into the volatile storage medium The log corresponding to the data is added to the executed log list; when the data storage capacity of the volatile storage medium is greater than or equal to the storage threshold, the business data stored in the volatile storage medium is written to the non-volatile storage medium.
  • different state parameters are respectively set for the log whose business data has not been written to the non-volatile storage medium and the log whose business data has been written to the non-volatile storage medium.
  • the apparatus further includes: a second obtaining module, configured to, when a crash event occurs in the distributed storage system, based on the Status parameter, obtain a plurality of logs to be recovered from the executed log list, and the business data corresponding to the logs to be recovered has been written to the volatile storage medium but not to the non-volatile storage medium; recovery module is used to sequentially restore a plurality of business data corresponding to a plurality of the to-be-restored logs to the volatile storage medium based on the storage order of the plurality of the to-be-restored logs in the executed log list.
  • a second obtaining module configured to, when a crash event occurs in the distributed storage system, based on the Status parameter, obtain a plurality of logs to be recovered from the executed log list, and the business data corresponding to the logs to be recovered has been written to the volatile storage medium but not to the non-volatile storage medium
  • recovery module is used to sequentially restore a plurality of business data corresponding to a plurality of the
  • the device further includes: a third acquisition module, configured to acquire the number of logs that conflict with the log when the conflict verification result is a conflict; determine module is used to determine the scanning frequency of the log based on the number of logs; the execution module 1303 is also used to scan the log based on the scanning frequency and refresh the conflict verification result, until the conflict verification result is no conflict, execute the log.
  • a third acquisition module configured to acquire the number of logs that conflict with the log when the conflict verification result is a conflict
  • determine module is used to determine the scanning frequency of the log based on the number of logs
  • the execution module 1303 is also used to scan the log based on the scanning frequency and refresh the conflict verification result, until the conflict verification result is no conflict, execute the log.
  • the determining module is configured to: in the case that the number of logs is greater than the conflict threshold, determine the scanning frequency as the first frequency; or, in the case that the number of logs is less than or equal to the conflict threshold , the scanning frequency is determined as a second frequency, and the second frequency is greater than the first frequency.
  • the executing module 1303 is configured to: add the log to a log list corresponding to the scan frequency, and scan the logs stored in the log list based on the scan frequency.
  • the scanning module 1301 is further configured to: cyclically scan a log matching index table, where the log matching index table is used to record the number of copies of multiple to-be-committed logs stored in the distributed storage system; the device It also includes a submission module, configured to submit the log to be submitted under the condition that the number of copies of any log to be submitted in the log matches the index table and meets the target condition.
  • the apparatus is elected by a plurality of second node devices in the distributed storage system by voting after the end of the previous term, and the maximum index in the consecutive submitted logs in the apparatus is greater than or equal to The largest index in the consecutive submitted logs of the second node devices.
  • the device further includes: a fourth obtaining module, configured to obtain the index of each missing log in the case that at least one log is missing in the log submission active window,
  • the log submission active window includes multiple logs that have not been submitted, and the logs before the log submission active window have been submitted;
  • the request module is configured to request the missing logs from the plurality of second node devices based on the index.
  • the apparatus further includes: a receiving module, configured to receive a plurality of target logs returned by the second node device and the submission status information of the target logs; a completion module , which is used to complete the submission status information for the submitted target log in the log submission active window; the request module is also used to request the terminal for the target log whose submission status information is unsubmitted.
  • a receiving module configured to receive a plurality of target logs returned by the second node device and the submission status information of the target logs
  • a completion module which is used to complete the submission status information for the submitted target log in the log submission active window
  • the request module is also used to request the terminal for the target log whose submission status information is unsubmitted.
  • the log execution device when the log execution device provided in the above embodiment executes the log, only the division of the above functional modules is used as an example for illustration. In practical applications, the above functions can be allocated to different functional modules according to needs. The internal structure of the computer equipment is divided into different functional modules to complete all or part of the functions described above.
  • the log execution device and the log execution method provided by the above embodiments belong to the same concept. For the specific implementation process, please refer to the log execution method embodiment, which will not be repeated here.
  • the computer device 1400 may vary greatly due to different configurations or performances.
  • the computer device 1400 includes one or more processors (Central Processing Units, CPU) 1401 and one or more memories 1402, wherein, at least one computer program is stored in the memory 1402, and the at least one computer program is loaded and executed by the one or more processors 1401 to realize the above-mentioned various embodiments.
  • Log execution method the computer device 1400 also has components such as a wired or wireless network interface, a keyboard, and an input and output interface for input and output, and the computer device 1400 also includes other components for implementing device functions, which will not be repeated here.
  • a computer-readable storage medium such as a memory including at least one computer program, and the at least one computer program can be executed by a processor in the terminal to complete the log execution methods in the foregoing embodiments.
  • the computer-readable storage medium includes ROM (Read-Only Memory, read-only memory), RAM (Random-Access Memory, random access memory), CD-ROM (Compact Disc Read-Only Memory, read-only optical disk), Tape, floppy disk, and optical data storage devices, etc.
  • a computer program product or computer program comprising one or more pieces of program code stored in a computer-readable storage medium.
  • One or more processors of a computer device can read the one or more program codes from a computer-readable storage medium, the one or more processors execute the one or more program codes, enabling the computer device to execute to perform The log execution method in the above embodiment.
  • the program is stored in a computer-readable storage medium,
  • the above-mentioned storage medium is a read-only memory, a magnetic disk, an optical disk, or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种日志执行方法、装置、计算机设备及存储介质,涉及数据库技术领域。方法包括:循环扫描(301)日志执行活跃窗口;对于日志执行活跃窗口中任一日志,基于该日志的存储范围信息,获取(302)该日志的冲突验证结果,该存储范围信息用于指示该日志以及该日志之前的目标数量个日志的存储范围,该目标数量等于该日志执行活跃窗口的窗口尺寸;在该冲突验证结果为无冲突的情况下,执行(303)该日志。

Description

日志执行方法、装置、计算机设备及存储介质
本申请要求于2021年02月09日提交的申请号为202110178645.4、发明名称为“日志执行方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据库技术领域,特别涉及一种日志执行方法、装置、计算机设备及存储介质。
背景技术
随着数据库技术的发展,对于工业级别的分布式存储系统,需要确保所有提交的修改不丢失,因此需要引入一致性算法(即共识算法)来保证文件系统中数据的可靠性与一致性。其中,Raft算法是一种在工程上使用较为广泛的一致性算法,具有强一致性、去中心化、易于理解和开发实现等特点。
发明内容
本申请实施例提供了一种日志执行方法、装置、计算机设备及存储介质。
一方面,提供了一种日志执行方法,应用于分布式存储系统中的第一节点设备,该方法包括:
循环扫描日志执行活跃窗口,所述日志执行活跃窗口包括尚未执行的多个日志,且所述日志执行活跃窗口之前的日志均已执行;
对于所述日志执行活跃窗口中的任一日志,基于所述任一日志的存储范围信息,获取所述任一日志的冲突验证结果,所述存储范围信息用于指示所述任一日志的存储范围以及所述任一日志之前的目标数量个日志的存储范围,所述目标数量等于所述日志执行活跃窗口的窗口尺寸;
响应于所述冲突验证结果为无冲突,执行所述任一日志。
一方面,提供了一种日志执行装置,应用于分布式存储系统,该装置包括:
扫描模块,用于循环扫描日志执行活跃窗口,所述日志执行活跃窗口包括尚未执行的多个日志,且所述日志执行活跃窗口之前的日志均已执行;
第一获取模块,用于对于所述日志执行活跃窗口中的任一日志,基于所述任一日志的存储范围信息,获取所述任一日志的冲突验证结果,所述存储范围信息用于指示所述任一日志的存储范围以及所述任一日志之前的目标数量个日志的存储范围,所述目标数量等于所述日志执行活跃窗口的窗口尺寸;
执行模块,用于响应于所述冲突验证结果为无冲突,执行所述任一日志。
在一种可能实施方式中,所述第一获取模块用于:
读取所述任一日志的存储范围信息,得到所述任一日志以及所述任一日志之前的目标数量个日志的存储范围;
响应于所述任一日志的存储范围与所述目标数量个日志的存储范围之间的交集为空集,确定所述冲突验证结果为无冲突;
响应于所述任一日志的存储范围与所述目标数量个日志的存储范围之间的交集不为空集,确定所述冲突验证结果为发生冲突。
在一种可能实施方式中,所述执行模块包括:
添加单元,用于将所述任一日志添加至待执行日志列表;
处理单元,用于调用日志执行线程,处理所述待执行日志列表中存储的日志。
在一种可能实施方式中,所述处理单元用于:
将所述待执行日志列表中存储的日志所对应的业务数据写入易失性存储介质;
将写入所述易失性存储介质中的所述业务数据所对应的日志添加至已执行日志列表;
响应于所述易失性存储介质的数据存储量大于或等于存储阈值,将所述易失性存储介质中存储的业务数据写入至非易失性存储介质中;
其中,在所述已执行日志列表中分别为业务数据未写入非易失性存储介质的日志与业务数据已写入非易失性存储介质的日志设置不同的状态参数。
在一种可能实施方式中,所述装置还包括:
第二获取模块,用于响应于所述分布式存储系统发生崩溃事件,在所述分布式存储系统重启时基于所述状态参数,从所述已执行日志列表中获取多个待恢复日志,所述多个待恢复日志所对应的多个业务数据已写入所述易失性存储介质而未写入至所述非易失性存储介质;
恢复模块,用于基于所述多个待恢复日志在所述已执行日志列表中的存储顺序,依次将所述多个待恢复日志所对应的多个业务数据恢复至所述易失性存储介质中。
在一种可能实施方式中,所述装置还包括:
第三获取模块,用于响应于所述冲突验证结果为发生冲突,获取与所述任一日志发生冲突的日志数量;
确定模块,用于基于所述日志数量,确定对所述任一日志的扫描频率;
所述执行模块,还用于按照所述扫描频率扫描所述任一日志并刷新所述冲突验证结果,直到所述冲突验证结果为无冲突,执行所述任一日志。
在一种可能实施方式中,所述确定模块用于:
响应于所述日志数量大于冲突阈值,将所述扫描频率确定为第一频率;或者,
响应于所述日志数量小于或等于所述冲突阈值,将所述扫描频率确定为第二频率,所述第二频率大于所述第一频率。
在一种可能实施方式中,所述执行模块用于:
将所述任一日志添加至与所述扫描频率对应的日志列表,按照所述扫描频率扫描所述日志列表中存储的日志。
在一种可能实施方式中,所述扫描模块还用于:循环扫描日志匹配索引表,所述日志匹配索引表用于记录多个待提交日志在所述分布式存储系统中存储的副本数量;
所述装置还包括提交模块,用于响应于所述日志匹配索引表中的任一待提交日志的副本数量符合目标条件,提交所述任一待提交日志。
在一种可能实施方式中,所述装置由所述分布式存储系统中的多个第二节点设备在上一任期结束后投票选举产生,所述装置中连续的已提交的日志中的最大索引大于或等于所述多个第二节点设备中连续的已提交的日志中的最大索引。
在一种可能实施方式中,所述装置还包括:
第四获取模块,用于响应于日志提交活跃窗口中缺失至少一个日志,获取缺失的所述至少一个日志的至少一个索引,所述日志提交活跃窗口包括尚未提交的多个日志,且所述日志提交活跃窗口之前的日志均已提交;
请求模块,用于基于所述至少一个索引,向所述多个第二节点设备请求所述至少一个日志。
在一种可能实施方式中,所述装置还包括:
接收模块,用于接收所述多个第二节点设备所返回的至少一个目标日志以及所述至少一个目标日志的提交情况信息;
补全模块,用于将提交情况信息为已提交的目标日志补全在所述日志提交活跃窗口中;
所述请求模块,还用于向终端请求提交情况信息为未提交的目标日志。
一方面,提供了一种计算机设备,该计算机设备包括一个或多个处理器和一个或多个存储器,该一个或多个存储器中存储有至少一条计算机程序,该至少一条计算机程序由该一个 或多个处理器加载并执行以实现如上述日志执行方法。
一方面,提供了一种存储介质,该存储介质中存储有至少一条计算机程序,该至少一条计算机程序由处理器加载并执行以实现如上述日志执行方法。
一方面,提供一种计算机程序产品或计算机程序,所述计算机程序产品或所述计算机程序包括一条或多条程序代码,所述一条或多条程序代码存储在计算机可读存储介质中。计算机设备的一个或多个处理器能够从计算机可读存储介质中读取所述一条或多条程序代码,所述一个或多个处理器执行所述一条或多条程序代码,使得计算机设备能够执行上述日志执行方法。
附图说明
图1是本申请实施例提供的一种分布式存储系统的架构示意图;
图2是本申请实施例提供的一种日志乱序复制机制的原理性示意图;
图3是本申请实施例提供的一种日志执行方法的流程图;
图4是本申请实施例提供的一种LHA数据结构的原理性示意图;
图5是本申请实施例提供的一种数据持久化机制的原理性示意图;
图6是本申请实施例提供的一种日志乱序执行调度的原理性示意图;
图7是本申请实施例提供的一种日志提交活跃窗口的数据结构示意图;
图8是本申请实施例提供的一种日志乱序提交机制的流程图;
图9是本申请实施例提供的一种日志不一致情况的原理性示意图;
图10是本申请实施例提供的一种领导节点的选举机制的原理性示意图;
图11是本申请实施例提供的一种任期Term的原理性示意图;
图12是本申请实施例提供的一种终端与集群交互的原理性流程图;
图13是本申请实施例提供的一种日志执行装置的结构示意图;
图14是本申请实施例提供的一种计算机设备的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。
本申请中术语“至少一个”是指一个或多个,“多个”的含义是指两个或两个以上,例如,多个第一位置是指两个或两个以上的第一位置。
在介绍本申请实施例之前,需要引入一些云技术领域内的基本概念,下面进行介绍。
云技术(Cloud Technology):是指在广域网或局域网内将硬件、软件、网络等系列资源统一起来,实现数据的计算、储存、处理和共享的一种托管技术,也即是基于云计算商业模式应用的网络技术、信息技术、整合技术、管理平台技术、应用技术等的总称,通过组成资源池,按需所用,灵活便利。云计算技术将变成云技术领域的重要支撑。技术网络系统的后台服务需要大量的计算、存储资源,如视频网站、图片类网站或更多的门户网站。伴随着互联网行业的高度发展和应用,将来每个物品都有可能存在自己的识别标志,都需要传输到后台系统进行逻辑处理,不同程度级别的数据将会分开处理,各类行业数据皆需要强大的系统后盾支撑,均能通过云计算来实现。
云存储(Cloud Storage):是在云计算概念上延伸和发展出来的一个新的概念,分布式云存储系统(以下简称存储系统)是指通过集群应用、网格技术以及分布存储文件系统等功能,将网络中大量各种不同类型的存储设备(存储设备也称之为存储节点)通过应用软件或应用接口集合起来协同工作,共同对外提供数据存储和业务访问功能的一个存储系统。
数据库(Database):简而言之能够视为一种电子化的文件柜——存储电子文件的处所,支持用户对文件中的数据进行新增、查询、更新、删除等操作。所谓“数据库”是以一定方式储存在一起、能与多个用户共享、具有尽可能小的冗余度、与应用程序彼此独立的数据集合。
数据的全态(Full State):对于数据库系统中的数据项,基于状态属性的不同,划分为三种状态:当前态、过渡态和历史态,该三种状态合称为“数据的全态”,简称全态数据,全态数据中的各个不同状态属性,用于标识数据在其生命周期轨迹中所处的状态。
其一,当前态(Current State):最新版本的数据项,是处于当前阶段的数据项。
其二,历史态(Historical State):数据项在历史上的一个状态,其值是旧值,不是当前值。可选地,多个历史态数据项对应于同一主键标识,反映了具有该主键标识的各个数据项的状态变迁的过程。处于历史态的数据项,只能被读取而不能被修改或删除。
其三,过渡态(Transitional State):不是当前态数据项也不是历史态数据项,处于从当前态向历史态转变的过程中,这种处于过渡态的数据也称为半衰数据。
可选地,不同的数据项具有相同的主键标识(Primary Key,PK),此时,具有相同主键标识的各个数据项构成一个全态数据集,该全态数据集内的各个数据项在本质上用于表示全态数据,也即是说,在对具有该主键标识的初始数据项进行多次修改(或删除)的过程中,由于修改(或删除)时刻不同而产生的多个不同的版本,即构成一个全态数据集。在一个全态数据集中,有的数据项处于当前态,有的数据项处于过渡态,有的数据项处于历史态。这里的全态数据集是指一个抽象的、虚拟的集合概念,同一个全态数据集内的各个数据项能够分布式地存储在不同的物理机上。可选地,数据库系统在存储各个数据项时,采用指针将对应于同一主键标识的各个数据项按照时序链接起来,便于查询全态数据的生命周期轨迹。
事务:事务是数据库管理系统在执行操作的过程中的一个逻辑单位,由一个有限的数据库操作序列构成,是数据库系统操作的最小执行单位。在一个系统的内部,每个操作系列的单位,称为一个事务,单个操作也称为一个事务。事务必须服从ACID原则,其中,ACID是原子性(Atomicity)、一致性(Consistency)、隔离性(Isolation)和持久性(Durability)的缩写。
日志:本申请实施例涉及的日志,也称为日志项、日志记录,均是指分布式存储系统中的事务日志。在事务日志中,数据变化被记录在一系列连续的日志记录中,每一条日志记录都被存储在一个虚拟日志文件中。可选地,事务日志有任意多个虚拟日志文件,数量的多少取决于数据库引擎,而且每个虚拟日志文件的大小也不是固定的。
以关系型数据管理系统SQL(Structured Query Language,结构化查询语言)Server中的数据库为例,SQL Server中的数据库都是由一或多个数据文件以及一或多个事务日志文件组成的,其中,数据文件主要存储数据库的数据(也称为数据项、数据记录),包括数据库内容结构、数据页、索引页等,事务日志则主要是用来保存数据库修改记录的,换言之,事务日志用于记录所有事务以及每个事务对数据库所做的修改,是数据库系统的备份和恢复的重要组件。在数据库中,事务要保证ACID的完整性,则必须依靠事务日志进行追溯,即,事务的每一个操作在落盘之前,必须先写入到日志中,例如,事务要删除数据表中的某一行数据记录,则先在日志中将这一行数据记录标记为删除,但此时数据文件中并没有发生变化,只有在事务提交后,才会把事务中的SQL语句进行落盘,即在该数据表中删除这一行数据记录。
可选地,日志包括重做日志(Redo Log)、回滚日志(Undo Log)和二进制日志(Bin Log,也称为归档日志),其中,重做日志用来记录事务操作引起数据的变化,记录的是数据页的物理修改,因此重做日志属于物理日志,记录的是事务对某个数据进行了怎么样的修改;二进制日志则主要用于记录数据库的变化情况,内容包括数据库所有的更新操作,所有涉及数据变动的操作,都要记录进二进制日志中,二进制日志属于逻辑日志,记录的是SQL语句的原始逻辑;回滚日志的作用就是对数据进行回滚,当事务对数据库进行修改,数据库引擎不仅会记录重做日志,还会生成对应的回滚日志,如果事务执行失败或调用了Rollback接口,导 致事务需要回滚,就能够利用回滚日志中的信息将数据回滚到修改之前的样子,回滚日志属于逻辑日志,记录的是SQL语句执行相关的信息。
本申请实施例所涉及的分布式存储系统包括分布式数据库系统以及采用类似架构的分布式数据库管理系统,例如,该分布式存储系统为分布式事务型数据库系统,需要分布式事务处理能力、需要共享数据上的一致性模型。
在分布式存储系统中包括至少一个节点设备,每个节点设备的数据库中存储有一个或多个数据表,每个数据表用于存储一个或多个数据项(也称为变量版本)。其中,节点设备的数据库为任一类型的分布式数据库,包括关系型数据库或者非关系型数据库中至少一项,例如SQL(Structured Query Language,结构化查询语言)数据库、NoSQL、NewSQL(泛指各种新式的可拓展/高性能数据库)等,本申请实施例对数据库的类型不作具体限定。
在一些实施例中,本申请实施例应用于一种基于区块链技术的数据库系统(以下简称为“区块链系统”),上述区块链系统在本质上属于一种去中心化式的分布式数据库系统,采用共识算法保持区块链上不同节点设备所记载的账本数据一致,通过密码算法保证不同节点设备之间账本数据的加密传送以及不可篡改,通过脚本系统来拓展账本功能,通过网络路由来进行不同节点设备之间的相互连接。
在区块链系统中包括一条或多条区块链,区块链是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。
区块链系统中节点设备之间组成点对点(Peer To Peer,P2P)网络,P2P协议是一个运行在传输控制协议(Transmission Control Protocol,TCP)协议之上的应用层协议。在区块链系统中,任一节点设备具备如下功能:A)路由,节点设备具有的基本功能,用于支持节点设备之间的通信;B)应用,用于部署在区块链中,根据实际业务需求而实现特定业务,记录实现功能相关的数据形成账本数据,在账本数据中携带数字签名以表示数据来源,将账本数据发送至区块链系统中的其他节点设备,供其他节点设备在验证账本数据来源以及完整性成功时,将账本数据添加至临时区块中,其中,应用实现的业务包括钱包、共享账本、智能合约等;C)区块链,包括一系列按照先后的时间顺序相互接续的区块,新区块一旦加入到区块链中就不会再被移除,区块中记录了区块链系统中节点设备提交的账本数据。
在一些实施例中,每个区块中包括本区块存储交易记录的哈希值(本区块的哈希值)以及前一区块的哈希值,各区块通过哈希值连接形成区块链,可选地,区块中还包括有区块生成时的时间戳等信息。
图1是本申请实施例提供的一种分布式存储系统的架构示意图,以分布式存储系统为HTAC(Hybrid Transaction/Analytical Cluster,混合事务/分析集群)系统为例进行说明。参见图1,在HTAC系统内包括TP(Transaction Processing,事务处理)集群101和AP(Analytical Processing,分析处理)集群102。
TP集群101用于提供事务处理服务。在TP集群101中包括多个TP节点设备,在事务处理过程中,各个TP节点设备用于处理当前态数据,其中,每个TP节点设备是单机设备或一主多备集群,本申请实施例不对TP节点设备的类型进行具体限定。
AP集群102用于存储历史态数据并提供历史态数据的查询及分析服务。可选地,AP集群102包括全局时间戳生成集群和分布式存储集群,分布式存储集群中包括多个AP节点设备,当然,各个AP节点设备是单机设备或一主多备集群,本申请实施例不对AP节点设备的类型进行具体限定。
在上述架构中,每个TP节点设备所对应主机或备机的数据库实例集合称为一个SET(集合),例如,如果某一TP节点设备为单机设备,那么该TP节点设备的SET仅为该单机设备的数据库实例,如果该TP节点设备为一主两备集群,那么该TP节点设备的SET为主机数据 库实例以及两个备机数据库实例的集合,此时基于云数据库(Cloud Database)的强同步技术来保证主机的数据与备机的副本数据之间的一致性,可选地,每个SET进行线性扩容,以应付大数据场景下的业务处理需求。
可选地,各个AP节点设备在本地数据库中存储TP集群101产生的历史态数据,通过存储接口接入分布式文件系统103,从而通过该分布式文件系统103对TP集群101产生的历史态数据提供无限存储功能,例如,该分布式文件系统103包括但不限于:HDFS(Hadoop Distributed File System,Hadoop分布式文件系统)、Ceph(一种Linux系统下的分布式文件系统)、Alluxio(一种基于内存的分布式文件系统)等。
在一些实施例中,由于TP节点设备提供事务处理服务,当任一事务提交完成时,在生成新的当前态数据的同时,也会生成与该当前态数据所对应的历史态数据,而由于历史态数据会占用较多存储空间,而历史态数据又具有保存价值,因此TP节点设备通过预定义的历史态数据迁移策略,将产生的历史态数据原子化地迁移至AP集群102,AP集群102中各个AP节点设备基于本地执行器(Local Executor,LE)实现历史态数据的转储,并将每次数据迁移的元信息注册到元数据(Metadata,MD)管理器中,从而便于AP集群102基于MD管理器来统计已储备数据的元信息。
在一些实施例中,用户端基于SQL路由(SQL Router,简称SR)层中提供的查询语句、查询操作的语义和元数据,路由查询到TP集群101或AP集群102内存储的任一数据,当然,TP集群101主要提供对当前态数据的查询服务,AP集群102则主要提供对历史态数据的查询服务。
可选地,上述TP集群101或者AP集群102是多个物理服务器构成的服务器集群或者分布式系统,或者,是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network,内容分发网络)以及大数据和人工智能平台等基础云计算服务的云服务器,本申请实施例对此不作限定。
可选地,用户端也即终端是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表等,但并不局限于此。终端以及TP集群101能够通过有线或无线通信方式进行直接或间接地连接,本申请在此不做限制。
上述HTAC系统是一种分布式数据库系统的示例性说明,也是一种分布式事务型数据库系统的示例性说明,需要分布式事务处理的能力,且需要共享数据上的一致性模型。上述HTAC系统能够实现更多、更高效的数据异常识别,实现高效的可串行化隔离级别,从而使得HTAC系统能够适应于多种业务场景。
例如,使用严格可串行化级别时,能够很好地适用于金融领域,保证数据的可靠性,而目前主流的分布式数据库系统均无法高效地提供此一致性级别。又例如,使用较弱的一致性级别时,能够很好地适用于互联网场景,从而提供高并发、实时地数据库服务,为互联网用户提供良好的产品体验。
此外,上述HTAC系统在数据多副本的情况下,存储层需要共识协议保证数据的一致性和高可用性,这为本申请实施例提供的技术依赖的具体背景。综上,本申请实施例涉及的日志执行方法及相关理论(指并发Raft算法,Parallel Raft算法)能够提升数据库产品的技术含量、技术门槛、竞争力和技术影响力,具有很强的现实意义。并且,上述日志执行方法能够有效保障用户数据的正确性,在性能方面也会具有良好的表现。
在相关技术中,分布式存储技术的发展平衡了日益膨胀的应用存储需求与单点存储能力瓶颈之间的矛盾,一致性问题是保证分布式系统正确性与可靠性的核心问题,而对于一个工业级别的分布式存储系统,需要确保所有提交的修改不丢失,因此需要引入一致性算法(即共识算法)来保证文件系统中数据的可靠性与一致性。
从工程实现的成熟角度来说,斯坦福大学的Diego Ongaro等提出的Raft算法逐渐在分布式系统领域中成为主流的一致性算法之一,Raft算法是一种基于主从模型、保证强一致性的 一致性算法,且非常易于理解和开发实现。
在Raft算法中,基于主从模式来管理复制日志(Replicated Log),在每一轮任期中会选举一个节点作为领导者(Leader,Raft算法中的主节点,即第一节点设备),其余节点则作为跟随者(Follower,Raft算法中的从节点,即第二节点设备),只有领导者能够响应终端的请求,并将该请求发送至其他的跟随者。
由于Raft算法分为了领导者选举、日志复制、安全性等部分,为了简单性和算法的可理解性,Raft算法采用高度串行化的设计,日志在领导者和跟随者上都不允许有缺失,即意味着所有的日志会按照顺序被跟随者确认后,再被领导者提交(Commit),然后执行(Apply)到所有副本上。换言之,基于Raft算法的分布式存储系统,对于数据库应用(即终端侧)的大量并发写请求,只能顺序依次提交,而不能并发地处理请求以提高系统性能。
针对Raft算法,在多连接的高并发场景下,领导者和跟随者在工程场景中通常使用多条连接并发地传送日志,当一个连接阻塞或者变慢,那么日志到达跟随者的顺序就会变乱,也即会发生领导者中存储的日志被乱序地发送到跟随者的情况,此时次序(也即索引)靠后的日志会比次序靠前的日志先到达跟随者。
基于Raft算法的机制,跟随者必须按照次序接收日志,由于日志乱序到达跟随者,会产生跟随者日志缺失的问题,跟随者只有阻塞所有乱序日志的提交流程,直到缺失的日志也到达跟随者之后,跟随者才会取消阻塞,导致系统的吞吐量降低。此外,当多个跟随者因个别缺失的日志而阻塞时,整个系统会出现卡顿。
如上可知,Raft算法因其严格串行化的特性不适用于高并发场景,严重限制了分布式系统的吞吐量,分布式存储系统需要利用高并发I(Input,输入)/O(Output,输出)能力提升系统吞吐量,因此本申请实施例提供一种日志执行方法,能够放开传统Raft算法严格有序性的约束,也即设计了一种基于Raft算法的改进型的共识算法,称为Parallel Raft(并发Raft)算法。
并发Raft算法应用于分布式存储系统,分布式存储系统包括第一节点设备(Leader,领导节点)和多个第二节点设备(Follower,跟随节点)。并发Raft算法涉及到一种基于分布式存储系统的日志乱序复制机制,日志乱序复制机制包括第一节点设备接收终端的业务请求、日志乱序确认(ACK)阶段、日志乱序提交(Commit)阶段和日志乱序执行(Apply)阶段。
传统Raft算法为了容易理解和开发实现,采用了高度串行化的设计,在第一节点设备和第二节点设备的日志列表中的日志均不允许有缺失(也称为“空洞”),意味着所有日志会按照顺序被第二节点设备确认,按照顺序被第一节点设备提交,最后按照顺序被执行到所有副本上。
传统Raft算法通过如下方式来保证串行化:当第一节点设备发送一个日志给第二节点设备,第二节点设备需要返回ACK(Acknowledge Character,确认字符)消息来确认该日志已经被接收到并被记录,同时也隐式表明所有之前的日志均已收到且持久化存储;当第一节点设备已经提交一个日志并广播至所有的第二节点设备,那么也将同时确认了提交的日志之前的所有日志均已经被提交了。
而在并发Raft算法中打破了上述串行化的限制,使得日志能够并发乱序地从第一节点设备复制到第二节点设备,同理,第二节点设备能够并发乱序地确认接收到的日志,使得第一节点设备能够并发乱序地根据日志接收统计情况来提交日志。
图2是本申请实施例提供的一种日志乱序复制机制的原理性示意图,如200所示,第一节点设备并行向所有的第二节点设备发送Index=3和Index=6的日志,而某个第二节点设备在接收到Index=3的日志之后,即使前面Index=2的日志缺失(由于乱序复制机制尚未到达该第二节点设备),第二节点设备仍然能够立刻向第一节点设备返回ACK消息,同理,第二节点设备在接收到Index=6的日志之后,即使前面Index=2和Index=4的日志均缺失,第二节点设备依然能够立刻向第一节点设备返回ACK消息。
在上述过程中,并发Raft算法下任何日志在第二节点设备上被成功持久化之后均能够立即返回ACK消息,不用等待日志列表中本日志之前的日志依次确认,此外,第一节点设备在一个日志的多数副本均被确认后即可提交,不必等待日志列表中本日志之前的日志提交。需要说明的是,乱序提交是在日志提交(Commit)活跃窗口的限制内进行的,也即,日志提交活跃窗口内的一个日志不受限制于日志提交活跃窗口内除了本日志之外的其他日志,日志提交活跃窗口用于支持日志的乱序提交。日志乱序提交机制将在后续实施例中进行介绍,这里不做展开说明。
上述日志乱序确认机制、日志乱序提交机制以及日志乱序执行机制,会导致第二节点设备上日志列表的日志存在缺失现象,从而给安全地将日志应用于状态机(也即执行)带来困难,后续日志的执行需要考虑是否与前面尚未执行的日志相冲突。
为使得分布式存储系统在进行日志乱序复制时,仍能保证整个分布式存储系统的数据一致性,需要引入存储语义的正确性条件,只有满足存储语义的正确性条件的日志才能够被允许乱序复制。
并发Raft算法由于日志能够被乱序确认和乱序提交,导致各个日志都可能在不同的第二节点设备上出现缺失副本的情况(即“空洞”现象),并发Raft算法设计了日志空洞数组(Log Hole Array,LHA)来判断本日志与本日志之前的N(N≥1)个日志之间是否冲突,来保证存储语义的正确性,然后才能够安全地执行本日志。
LHA用于判断本日志与之前的目标数量个日志之间是否发生冲突,为了保证每个日志的LHA的完整性,第一节点设备的日志列表不允许存在空洞,第一节点设备需要顺序接收终端(Client)发来的业务请求(或命令)并将业务请求对应的日志记录在日志列表中,上述策略保证了第一节点设备发送给第二节点设备的日志的LHA能够完整地包涵了本日志及前N个日志的存储范围。LHA的数据结构将在下述实施例中进行详述,这里不做赘述。
相较于传统Raft算法来说,并发Raft算法的日志结构中增加了LHA,用于辅助判断本日志是否和前N个日志的存储范围发生冲突,此外,当日志的日志索引LogIndex=-1(日志索引LogIndex简称为索引Index)时,说明该日志是缺失的,当日志的日志索引LogIndex=-2时,说明该日志是空日志项,空日志项即写入成功的无效日志项。
在一个示例中,第一节点设备向任一第二节点设备顺序发送Index=1的日志x=1和Index=2的日志x=2,此时第二节点设备仅接收到Index=2的日志,而Index=1的日志发生了缺失,由于Index=1的日志和Index=2的日志均修改x的值,因此二者是冲突的,此时不能立即执行Index=2的日志,而需要等待接收到Index=1的日志且成功执行Index=1的日志之后,才能执行Index=2的日志。
将上述示例中的x修改为日志的存储范围,能够得到存储语义的正确性条件,即:在存储场景下,本日志能够安全地执行的前提条件是本日志之前尚未执行的各个日志与本日志的存储范围无冲突。可选地,通过LHA能够判断本日志和前N个日志是否发生存储范围的冲突,从而能够在保证满足存储语义的正确性条件的情况下,分布式存储系统能够高并发乱序地执行日志列表中的各个日志。
图3是本申请实施例提供的一种日志执行方法的流程图。参见图3,该实施例由上述分布式存储系统中的第一节点设备执行,本申请实施例介绍并发Raft算法的日志乱序执行机制,下面进行详述。
301、第一节点设备循环扫描日志执行活跃窗口,该日志执行活跃窗口包括尚未执行的多个日志,且该日志执行活跃窗口之前的日志均已执行。
其中,日志执行(Apply)活跃窗口的含义为:只有位于日志执行活跃窗口内的日志才能够被执行,但日志能够被执行还需满足存储语义的正确性条件:本日志与日志执行活跃窗口内的、本日志前面的尚未执行的日志均不冲突。
可选地,日志执行活跃窗口前的所有日志均已执行,而只有位于日志执行活跃窗口内的 各个日志才能够被执行,此外,日志执行活跃窗口之后的所有日志均不能被执行,需要等待日志执行活跃窗口向后移动之后,当某一日志满足处于日志执行活跃窗口之内时,这一日志才拥有被执行的权限。
在一些实施例中,日志执行活跃窗口涉及以下2种状态变量。
第一种、toApplyWindowBeginIndex变量,用于记录日志执行活跃窗口内的首个日志的索引(Index)。
第二种、isApplied[]列表,用于记录日志执行活跃窗口内的各个日志是否已经执行,也即用于记录日志执行活跃窗口内的各个日志的执行情况信息。
例如,每个日志的执行情况信息采用一个布尔型数据来表示,布尔型数据取值为真(True)代表本日志已经执行,布尔型数据取值为假(False)代表本日志尚未执行。在isApplied[]数组中,按照日志索引从小到大的顺序依次存放各个日志的执行情况信息。
在上述过程中,通过设置日志执行活跃窗口,能够节省并发Raft算法中状态变量的存储开销,比如传统Raft算法需要记录每个日志是否已提交、是否已执行,而并发Raft算法在保证日志执行活跃窗口之前的所有日志均已执行的前提下,仅仅需要记录位于日志执行活跃窗口内的各个日志的执行状态,从而能够节约第一节点设备的存储开销。
302、对于该日志执行活跃窗口中的任一日志,第一节点设备基于该日志的存储范围信息,获取该日志的冲突验证结果,该存储范围信息用于指示该日志的存储范围以及该日志之前的目标数量个日志的存储范围,该目标数量等于该日志执行活跃窗口的窗口尺寸。
在一些实施例中,第一节点设备循环扫描日志执行活跃窗口之后,对于日志执行活跃窗口内尚未执行的任一日志,判断该日志是否与日志执行活跃窗口内的、位于该日志之前且尚未执行的日志发生冲突,得到该日志的冲突验证结果。
在一些实施例中,第一节点设备为每个日志维护一个存储范围信息,每个日志的存储范围信息用于记录每个日志的存储范围以及在每个日志之前的目标数量个日志的存储范围。
在一些实施例中,第一节点设备读取该日志的存储范围信息,得到该日志以及该日志之前的目标数量个日志的存储范围;在该日志的存储范围与该目标数量个日志的存储范围之间的交集为空集的情况下,确定该冲突验证结果为无冲突;在该日志的存储范围与该目标数量个日志的存储范围之间的交集不为空集的情况下,确定该冲突验证结果为发生冲突。
可选地,第一节点设备获取该日志的存储范围与该日志之前的每个日志的存储范围之间的交集,遍历该目标数量个日志,得到目标数量个交集,如果该目标数量个交集中任一交集不是空集,确定该冲突验证结果为发生冲突,可选地,获取不是空集的交集数量,将这一交易数量确定为与该日志发生冲突的日志数量,便于后续对该日志进行延后执行;否则,如果该目标数量个交集中所有交集都是空集,确定该冲突验证结果为无冲突。
可选地,第一节点设备获取该之前的该目标数量个日志的存储范围之间的并集,再获取该并集与该日志的存储范围之间的目标交集,如果该目标交集为空集,确定该冲突验证结果为无冲突;否则,如果该目标交集不是空集,确定该冲突验证结果为发生冲突。
在一些实施例中,第一节点设备为每个日志维护的该存储范围信息是一个日志空洞数组(Log Hole Array,LHA),LHA用于判断本日志与之前的目标数量个日志之间是否发生冲突。也即是说,第一节点设备在存储每个日志时,会附带存储每个日志的LHA,LHA中包括本日志以及之前的目标数量个日志的存储范围,例如,LHA包括本日志及本日志之前的N个日志的存储范围,N为大于或等于1的整数。
图4是本申请实施例提供的一种LHA数据结构的原理性示意图,如400所示,某一日志的LHA包括本日志自身的存储范围[4,7],以及本日志之前的N=5个日志的存储范围,这5个日志按照索引从小到大的顺序,对其存储范围进行排列为:[6,7]、[0,1]、[8,9]、[4,5]、[1,2]。能够看出,本日志的存储范围分别与存储范围为[6,7]和[4,5]的2个日志相冲突。
在上述基础上,第一节点设备基于每个日志附带存储的LHA即可判断本日志是否和本日志之前的N个日志发生冲突。然而,第一节点设备无法基于LHA判断本日志是否与上述N 个日志之前的其他日志发生冲突,因此并发Raft算法需要保证上述N个日志之前的其他日志均已执行,并通过设计上述日志执行活跃窗口来保证:a)只有日志执行活跃窗口内的日志有可能被执行;b)日志执行活跃窗口的窗口尺寸恰好等于N。
通过上述设计,能够保证对日志执行活跃窗口内的任一日志基于LHA判断是否产生冲突时,由于日志执行活跃窗口的窗口尺寸也等于N,那么即使是日志执行活跃窗口的最后一个日志(即日志执行活跃窗口内索引最大的日志),也能够保证上述最后一个日志之前的N个日志再之前的其他日志均已执行。
303、第一节点设备在该冲突验证结果为无冲突的情况下,执行该日志。
在一些实施例中,第一节点设备将该日志添加至待执行日志列表ApplyingList[],调用日志执行线程,处理该待执行日志列表ApplyingList[]中存储的日志,此时由于已经确认了该日志与之前的目标数量个日志均无冲突,因此无需等待该日志之前的目标数量个日志执行,而能够乱序执行保证了存储语义正确性的该日志。
以下,对上述存储语义的正确性条件进行详细说明,在分布式存储系统中为了保证互相冲突的I/O请求能够按照时间先后顺序串行执行,且互不冲突的I/O请求能够并行执行,那么就需要判断本日志是否发生冲突:1)通过LHA记录日志执行活跃窗口内的每个日志是否与前N个日志的存储范围相冲突;2)确保日志执行活跃窗口之前的所有日志都已经执行,设日志执行活跃窗口的窗口尺寸为toApplyWindowSize;3)通过上述步骤302来判断日志执行活跃窗口内的每个日志是否和前N个日志之间发生冲突。
在一些实施例中,第一节点设备在处理该待执行日志列表中存储的日志时,执行下述操作:将该待执行日志列表ApplyingList[]中存储的日志所对应的业务数据写入易失性存储介质;将写入该易失性存储介质中的该业务数据所对应的日志添加至已执行日志列表AppliedList[];在该易失性存储介质的数据存储量大于或等于存储阈值的情况下,将该易失性存储介质中存储的业务数据写入至非易失性存储介质中;其中,在该已执行日志列表AppliedList[]中分别为业务数据未写入非易失性存储介质的日志与业务数据已写入非易失性存储介质的日志设置不同的状态参数,该存储阈值为任一大于或等于0的数值。
可选地,已执行日志列表AppliedList[]用于记录已经执行了的日志,例如,在已执行日志列表AppliedList[]中存储已经执行了的各个日志的索引LogIndex,并采用状态参数State来表示日志所对应的业务数据是否持久化(也即业务数据是否写入非易失性存储介质中)。
在一个示例中,State=0表示日志所对应的业务数据仅写入易失性存储介质且尚未写入非易失性存储介质,State=1表示日志所对应的业务数据已写入非易失性存储介质中。在另一示例中,采用State=1表示日志所对应的业务数据仅写入易失性存储介质且尚未写入非易失性存储介质,State=0表示日志所对应的业务数据已写入非易失性存储介质中,本申请实施例对此不作具体限定。
在一个示例性场景中,该易失性存储介质是存储系统利用内存构建的数据缓冲池,该非易失性存储介质是磁盘、硬盘等。在上述情况中,日志所对应的业务数据先写入数据缓冲池内,系统再定期将数据缓冲池内的业务数据批量的刷到磁盘中。可选地,第一节点设备在将数据缓冲池中存储的业务数据转存至磁盘时,采取Checkpoint(检查点)技术,也即,当数据缓冲池的数据存储量大于或等于存储阈值时,系统会触发Checkpoint事件,把数据缓冲池中所有的脏数据(未持久化的业务数据)刷回磁盘,并修改已执行日志列表AppliedList[]中对应日志的状态参数,从而将对应日志的状态参数标记为已刷回磁盘。
图5是本申请实施例提供的一种数据持久化机制的原理性示意图,如500所示,以易失性存储介质为数据缓冲池、非易失性存储介质为磁盘为例进行说明,待执行日志列表ApplyingList[]中记录了当前能够执行的各个日志,第一节点设备调用日志执行线程循环扫描待执行日志列表ApplyingList[],在每次扫描时将待执行日志列表ApplyingList[]中的各个日志所对应的各个业务数据写入数据缓冲池,同时,将已写入数据缓冲池的各个业务数据所对应的各个日志追加到已执行日志列表AppliedList[]中,并设置上述各个日志的状态参数State=0 来标识各个日志所对应的各个业务数据已写入数据缓冲池(但未写入磁盘)。当数据缓冲池的数据存储量大于或等于存储阈值(相当于一个缓冲池空间阈值)时,则将数据缓冲池中的各个业务数据转存至磁盘中,也即将数据缓冲池中的脏数据批量刷回磁盘,同时,将已执行日志列表AppliedList[]中各个业务数据所对应的各个日志的状态参数State赋值为1。
在上述过程中,通过对已执行日志列表AppliedList[]中的各个日志维护一个状态参数,用于表示各个日志所对应的各个业务数据是否已写入数据缓冲池或者已写入磁盘,能够有效地保障系统在发生崩溃时及时恢复丢失的业务数据。
由于易失性存储介质内存储的业务数据具有易失性,若系统崩溃,易失性存储介质内未刷出到非易失性存储介质的脏数据则会丢失,存储系统利用日志机制来解决上述问题,也即,业务数据在写入易失性存储介质之前先写入日志中,当系统意外崩溃时能够通过重做日志恢复业务数据,且能够利用Checkpoint技术记录易失性存储介质内尚未持久化的脏数据对应的日志,其余日志均能够循环利用。
有鉴于此,并发Raft算法设计了系统崩溃恢复机制,确保写入的业务数据在系统崩溃时不会彻底丢失,并通过Checkpoint技术来恢复业务数据,再通过已执行日志列表AppliedList[]这一数据结构将乱序的写请求有序化,保证数据恢复过程的有序性,下面进行说明。
在一些实施例中,假设某个时刻分布式存储系统发生崩溃,已写入非易失性存储介质的业务数据能够被持久化存储,因此无需进行恢复,但仅写入易失性存储介质但未写入非易失性存储介质的业务数据(属于脏数据)会全部丢失。此时,第一节点设备通过如下操作来恢复数据:在该分布式存储系统发生崩溃事件的情况下,在该分布式存储系统重启时基于该状态参数State,从该已执行日志列表AppliedList[]中获取多个待恢复日志,每个该待恢复日志所对应的业务数据已写入该易失性存储介质而未写入至该非易失性存储介质;基于多个该待恢复日志在该已执行日志列表中的存储顺序,依次将多个该待恢复日志所对应的多个业务数据恢复至该易失性存储介质中,换言之,基于待恢复日志被存储到已执行日志列表AppliedList[]中的先后顺序,依次将各个待恢复日志所对应的业务数据恢复到易失性存储介质中。
在上述过程中,由于在该已执行日志列表AppliedList[]中分别为业务数据未写入非易失性存储介质的日志与业务数据已写入非易失性存储介质的日志设置不同的状态参数State,假设State=0表示日志所对应的业务数据仅写入易失性存储介质且尚未写入非易失性存储介质,State=1表示日志所对应的业务数据已写入非易失性存储介质中,在系统崩溃后重启时,第一节点设备利用Checkpoint技术检查已执行日志列表AppliedList[],获取状态参数State=0的多个日志,将上述State=0的每个日志确定均为待恢复日志,这些State=0的日志(即待恢复日志)所对应的业务数据均仅仅刷入易失性存储介质而未刷回非易失性存储介质,在崩溃时业务数据会从易失性存储介质中丢失,因此需要通过重做日志(Redo Log)对这些丢失的业务数据(脏数据)进行恢复。
在一些实施例中,第一节点设备顺序扫描已执行日志列表AppliedList[],顺序执行上述已执行日志列表AppliedList[]中业务数据尚未持久化的待恢复日志(State=0的日志),从而能够将通过重做这些待恢复日志,依次将待恢复日志所对应的业务数据恢复到易失性存储介质内。
在上述过程中,由于已执行日志列表AppliedList[]保存了日志的执行顺序,那么按照此顺序依次执行各个待恢复日志,能够保证将要执行的待恢复日志与前面存储范围冲突的日志(包括待恢复日志或者持久化日志)均已执行,从而能够确保存储语义的正确性。
在一些实施例中,上述已执行日志列表AppliedList[]仅仅保存各个已执行日志的索引Index和状态参数State,而不保存具体的日志记录,通过索引Index能够在日志列表中查询到对应的日志记录,这样能够大大节约该已执行日志列表AppliedList[]的存储开销。
上述步骤303示出了当本日志与前目标数量个日志无冲突时,如何执行本日志并将对应业务数据进行持久化的操作,还提供了系统的崩溃恢复机制,而在另一些实施例中,当本日 志与前目标数量个日志发生冲突时,第一节点设备执行如下操作:在该冲突验证结果为发生冲突的情况下,获取与该日志发生冲突的日志数量;基于该日志数量,确定对该日志的扫描频率;基于该扫描频率扫描该日志并刷新该冲突验证结果,直到该冲突验证结果为无冲突,执行该日志。
在一些实施例中,第一节点设备在该日志数量大于冲突阈值的情况下,将该扫描频率确定为第一频率;或者,在该日志数量小于或等于该冲突阈值的情况下,将该扫描频率确定为第二频率,该第二频率大于该第一频率。
在上述过程中,第一节点设备基于与本日志发生冲突的日志数量,来确定对本日志的扫描频率,如果发生冲突的日志数量较多,那么等待冲突解决需要较长的时间,因此以较低的第一频率进行扫描,如果发生冲突的日志数量较少,那么等待冲突解决需要较短的时间,因此以较高的第二频率进行扫描。
在一些实施例中,第一节点设备将该日志添加至与该扫描频率对应的日志列表,基于该扫描频率扫描该日志列表中存储的日志。也即是说,第一节点设备为不同的扫描频率维护不同的日志列表,通过按照对应扫描频率来循环扫描对应日志列表,能够基于发生冲突的日志数量的情况,灵活选择不同的扫描频率,从而节约系统的计算资源。
在一些实施例中,将上述日志列表分为两类,第一日志列表SlowScanList[]和第二日志列表FastScanList[],第一日志列表SlowScanList[]的扫描频率为第一频率,第二日志列表FastScanList[]的扫描频率为第二频率,由于第二频率大于第一频率,也即,第一节点设备对第二日志列表FastScanList[]的扫描频率大于对第一日志列表SlowScanList[]的扫描频率。
在一些实施例中,管理人员设置冲突阈值ScanListThreshold后,获取在日志执行活跃窗口内与本日志发生冲突的日志数量,当该日志数量大于冲突阈值ScanListThreshold时,将本日志添加至第一日志列表SlowScanList[],否则,当该日志数量小于或等于冲突阈值ScanListThreshold时,将本日志添加至第二日志列表FastScanList[]。
在上述基础上,处于第二日志列表FastScanList[]中的日志由于发生冲突的日志数量较少,因此存在更大概率优先解决冲突,此时频繁扫描第二日志列表FastScanList[],能够及时找到并执行解决了冲突的日志,这样避免由于本日志的阻塞而长时间耽误了后续与本日志冲突的其他日志的执行流程。
在上述基础上,处于第一日志列表SlowScanList[]中的日志由于发生冲突的日志数量较多,因此所有的冲突大概率需要等待较长时间才能够全部解决,此时低频率地扫描第一日志列表SlowScanList[],能够在极大节约计算资源的前提下保证第一日志列表SlowScanList[]中的日志能够顺利提交。
在一些实施例中,该第一日志列表SlowScanList[]和该第二日志列表FastScanList[]中,不但存储本日志的索引,还存储与本日志发生冲突的各个冲突日志的索引,这样通过循环扫描第一日志列表SlowScanList[]和第二日志列表FastScanList[],能够及时发现无冲突的日志,并将这一无冲突的日志从第一日志列表SlowScanList[]或第二日志列表FastScanList[]中移除,并添加到待执行日志列表ApplyingList[]中。
在一些实施例中,对于冲突验证结果为发生冲突的日志,第一节点设备将本日志添加到同一个日志列表ScanList[]中等待冲突解决后再执行,从而保证本日志执行时不会引发数据不一致的问题,且无需维护两个不同的日志列表,能够简化日志执行的处理流程。
在一些实施例中,第一节点设备按照发生冲突的日志数量所处的取值区间,将本日志添加至与取值区间对应的日志列表中,其中,取值区间的数量大于或等于2,每个取值区间对应于一个日志列表,且每个日志列表对应于一个扫描频率,不同的取值区间所对应的日志列表具有不同的扫描频率,且随着取值区间的增大,对应日志列表的扫描频率也随之降低,这样能够细化出更多层级的扫描频率,构建更加完善的扫描流程。
在本申请实施例中,并发Raft算法中任一日志能够被执行的条件如下:(a)本日志位于日志执行活跃窗口内,保证了本日志的前N个日志均已执行;(b)通过检查本日志的LHA, 确认本日志与前N个日志的存储范围无冲突。
图6是本申请实施例提供的一种日志乱序执行调度的原理性示意图,如600所示,假设某一时刻第一节点设备的日志执行活跃窗口的索引范围为[4,8],在日志执行活跃窗口内,与索引分别为5、6、7的日志发生冲突(即两日志的存储范围的交集不为空集)的日志数量分别为0、2、3,假设冲突阈值ScanListThreshold为2,则对于Index=5的日志,由于发生冲突的日志数量为0,说明无冲突,直接添加至待执行日志列表ApplyingList[],对于Index=6的日志,由于发生冲突的日志数量2恰好等于冲突阈值ScanListThreshold,说明发生冲突的日志较少,则添加到第二日志列表FastScanList[]中,对于Index=7的日志,由于发生冲突的日志数量3大于冲突阈值ScanListThreshold,说明发生冲突的日志较多,则添加到第一日志列表SlowScanList[]中。需要说明的是,在循环扫描第一日志列表SlowScanList[]或第二日志列表FastScanList[]时,一旦发现任一日志满足存储语义的正确性条件,则将上述日志从第一日志列表SlowScanList[]或第二日志列表FastScanList[]中移除,并将上述日志添加至待执行日志列表ApplyingList[]中,等待日志执行线程进行相应地处理。
上述所有可选技术方案,能够采用任意结合形成本公开的可选实施例,在此不再一一赘述。
本申请实施例提供的方法,通过设置日志执行活跃窗口并保证日志执行活跃窗口之前的日志均已执行,仅需要验证日志执行活跃窗口内的任一日志是否与日志执行活跃窗口内的、该日志之前的、尚未执行的日志发生存储范围冲突,即可得知该日志是否会在整个分布式存储系统中引发数据不一致问题,对于无冲突的该日志,支持乱序执行该日志,而无需阻塞该日志的执行进程,并无需等待日志执行活跃窗口内的、该日志之前的、尚未执行的日志执行完毕,能够大大提升分布式存储系统的吞吐量,且能够适用于高并发场景。
以下,示出了并发Raft算法中,分布式存储系统内所有节点设备上存储的与执行流程相关的状态变量,以及仅在第一节点设备上存储的状态变量,下面进行说明:
/**
所有节点设备上存储的与apply相关的状态变量
*/
//日志apply活跃窗口尺寸
int toApplyWindowSize
//apply窗口首个日志的index
int toApplyWindowBeginindex
//记录日志apply活跃窗口内的日志是否已经applied
bool isApplied[toApplyWindowSize]
//其LHA中和其发生冲突的日志数量低于用户设定的冲突阈值时,加入此Pending List
//发生冲突的日志数量少的日志更大概率地先满足apply条件,因此频繁扫描此列表以找到能够apply的日志
PendingList fastScanPList
//其LHA中和其发生冲突的日志数量高于用户设定的冲突阈值时,加入此Pending List
//发生冲突的日志数量多的日志先满足apply条件的概率很小,因此低频率地扫描此列表以找到能够apply的日志
PendingList slowScanPList
//LHA冲突日志数量阈值(即冲突阈值)
int scanListThreshold
//记录了能够apply的日志的index
int applyingList[]
//记录了已经apply的日志的index
Log appliedList[]
/**
只有第一节点设备上存储的状态变量
*/
//id为i的第二节点设备上index为beginindex[i]之前的日志均和第一节点设备上index相同的日志匹配
int beginindex[]
//记录每个节点中的日志是否和第一节点设备上index相同的日志是否匹配
//matchindex[i][j]记录了id为i的第二节点设备上index为beginindex[i]+j的日志是否和第一节点设备上index相同的日志匹配
bool matchindex[][]
在上述实施例中,详细介绍了并发Raft算法的日志乱序执行机制,传统Raft算法由于所有日志按照严格的顺序执行,所有的副本也能够保证数据一致性,但并发Raft算法由于乱序确认和乱序提交,各个日志的日志副本会在不同的第二节点设备的日志列表中出现缺失,通过仅对存储语义的正确性条件的日志容许乱序执行,能够保证整个分布式存储系统在并发乱序执行时的数据一致性。
任一日志能够乱序执行的前提是需要满足存储语义的正确性条件(本日志与前N个尚未执行的日志无冲突),并引入LHA数据结构来快速判断本日志与前N个日志是否发生冲突,此外,引入日志执行活跃窗口来保存各个日志的执行状态并管理执行流程,日志乱序执行机制能够解决彼此可能存在冲突的一系列日志的执行顺序以及执行时机,通过设置对应于不同扫描频率的不同日志列表可实现上述管控。
而在本申请实施例中,将介绍并发Raft算法的日志乱序提交机制,同理,设置日志提交(Commit)活跃窗口,对于第一节点设备,只有日志提交活跃窗口内的日志才能够被发送到第二节点设备,对于第二节点设备,只有日志提交活跃窗口内的日志才能够从第一节点设备接收,此外,还需日志提交活跃窗口之前的所有日志均已提交。
在一些实施例中,设日志提交活跃窗口的窗口尺寸为toCommitWindowSize,则日志提交活跃窗口的相关数据结构包括如下内容。
(1)CommitIndex,代表本节点设备上连续的已提交的日志中的最大索引,CommitIndex用于表示本节点设备上索引为CommitIndex的日志以及之前的所有日志均已提交。
(2)isCommited[],用于记录日志提交活跃窗口内的各个日志的提交情况信息,也即用于记录日志提交活跃窗口内的各个日志是否已经提交,其中,Index位于区间[CommitIndex+1,CommitIndex+toCommitWindowSize+1]内的各个日志均属于日志提交活跃窗口。
(3)令x=CommitIndex+toCommitWindowSize+1,对于第一节点设备,只有Index≤x的日志能够发送给第二节点设备并提交,对于第二节点设备,只有Index≤x的日志能够被接收并返回ACK消息。
需要说明的是,第一节点设备和任一第二节点设备的日志提交活跃窗口相同或者不同,且不同的第二节点设备的日志提交活跃窗口相同或者不同,本申请实施例对此不做具体限定。
图7是本申请实施例提供的一种日志提交活跃窗口的数据结构示意图,如图7所示,第 一节点设备701的日志提交活跃窗口为[5,9],说明Index<5的日志均已成功提交,5≤Index≤9的日志能够发送给第二节点设备,而Index>9的日志不能发送。第二节点设备702的日志提交活跃窗口为[3,7],说明Index<3的日志均已成功提交,3≤Index≤7的日志能够被接收并返回ACK消息,而Index>7的日志不能接收(也即拒绝接收),在日志提交活跃窗口[3,7]中,Index为3、5、6、7的日志均是尚未接收的。同理,第二节点设备703的日志提交活跃窗口为[5,9],说明Index<5的日志均已成功提交,5≤Index≤9的日志能够被接收并返回ACK消息,而Index>9的日志不能接收(也即拒绝接收),在日志提交活跃窗口[5,9]中,Index为5、7、9的日志均是尚未接收的。
一方面,通过设置日志提交活跃窗口能够进行日志流控,方便管理第一节点设备向第二节点设备发送日志的速率,保证接收日志的第二节点设备不至于过载,同时,第二节点设备上位于日志提交活跃窗口内的日志支持并发乱序地接收(这里的接收是指返回ACK消息),从而提升日志传送效率。
另一方面,通过设置日志提交活跃窗口能够节约并发Raft算法的状态变量的存储开销(比如记录日志是否已提交、执行等造成的开销),由于保证了日志提交活跃窗口之前的所有日志均已提交,则仅仅需要记录日志提交活跃窗口内各个日志的提交情况信息即可,从而大大节约了存储开销。
在一些实施例中,在分布式存储集群选举出第一节点设备(Leader,领导节点)之后,第一节点设备即可接收终端(Client)的业务请求,第一节点设备收到终端的业务请求之后,执行该业务请求所请求的数据库事务,得到该业务请求所请求的业务数据,并将该业务数据所对应的日志追加到日志列表中。然后,将该日志的任期参数Term设置为第一节点设备自身的Term,假设日志列表中原本最后一个日志的编号为i,则新追加的日志(简称为新日志)编号置为i+1,令第i+1个日志的索引log[i+1].LogIndex=log[i].LogIndex+1。然后,复制(Copy,拷贝)一份第i个日志的LHA,移除LHA的第一个元素,并将新的业务请求(也称新命令)的存储范围作为LHA的最后一个元素追加到LHA中,得到第i+1个日志的LHA,其中,包含了第i+1个日志及其前N个日志的存储范围。
第一节点设备将新日志写入到日志列表后,即可通过AppendEntriesRPC消息并发地将新日志发送给各个第二节点设备,其中RPC的英文全称为Remote Procedure Call Protocol,中文全称为远程过程调用协议。各个第二节点设备在接收到该AppendEntriesRPC消息后,立即将新日志添加到自身日志列表的对应位置,无需等待前面的日志ACK就能够立即向第一节点设备返回新日志的ACK消息。可选地,第一节点设备向各个第二节点设备批量地发送日志,能够节约系统的通信开销。
在一些实施例中,限定第一节点设备的日志列表中可发送的日志索引范围,这样能够方便地管理第一节点设备向各个第二节点设备发送日志的速率,避免各个第二节点设备发生过载,可选地,可发送的日志的最大Index为commitIndex+toCommitWindowSize。
图8是本申请实施例提供的一种日志乱序提交机制的流程图,请参考图8,下面介绍基于日志提交活跃窗口的日志乱序提交机制,这一日志乱序提交机制适用于分布式存储系统,分布式存储系统中包括第一节点设备(Leader,领导节点)和多个第二节点设备(Follower,跟随节点),下面进行详述。
801、第一节点设备循环扫描日志匹配索引表,该日志匹配索引表用于记录多个待提交日志在该分布式存储系统中存储的副本数量。
其中,日志匹配索引表matchIndex[][]是用于进行日志接收情况统计的二维顺序表结构。
第一节点设备除了需要明确可发送的日志索引范围之外,还需要知道各个第二节点设备已接收哪些日志和未接收哪些日志,才能够方便对整个分布式存储系统进行的日志复制工作进行宏观调度。
可选地,第一节点设备通过日志匹配索引表matchIndex[][]和beginIndex[]来记录各个第 二节点设备已接收哪些日志项。设任一第二节点设备的节点标识为i,则beginIndex[i]表示第i个第二节点设备的日志列表中日志的索引Index小于beginIndex[i]的各个日志均已被第i个第二节点设备接收,matchIndex[i][j]表示第一节点设备的日志列表中日志的索引Index为beginIndex[i]+j的日志项是否已被第i个第二节点设备接收,其中,i≥1,j≥1。
在上述过程中,虽然matchIndex[][]为二维顺序表结构,但二维顺序表中的每个元素都是仅占一个比特位的布尔型数据,即对第i个第二节点设备,仅需要保存Index大于或等于beginIndex[i]的日志是否被接收,使得matchIndex[][]的存储开销很小。
在一些实施例中,第一节点设备循环扫描日志匹配索引表matchIndex[][],并基于日志匹配索引表matchIndex[i][j]来判断第i个第二节点设备的Index为beginIndex[i]+j的日志是否被接收,若未接收(布尔型数据取值为False),则第一节点设备调用AppendEntriesRPC消息将Index为beginIndex[i]+j的日志发送给第i个第二节点设备。
在一些实施例中,对于第i个第二节点设备,接收该AppendEntriesRPC消息,并解析得到Index=beginIndex[i]+j的日志,并读取该第一节点设备的Term和上述Index=beginIndex[i]+j的日志的Term,在该第一节点设备的Term大于或等于第i个第二节点设备自身的Term,且上述Index=beginIndex[i]+j的日志的Term与第i个第二节点设备上同一索引的对应日志的Term不相等(或者同一索引的对应日志在第i个第二节点设备上缺失)的情况下,第i个第二节点设备接收上述Index=beginIndex[i]+j的日志,将该Index=beginIndex[i]+j的日志写入自身的日志列表中,并向第一节点设备返回ACK消息。
可选地,第i个第二节点设备在接收日志时分为两种情况,第一种情况是接收到的Index=beginIndex[i]+j的日志的Term与第i个第二节点设备上同一索引的对应日志的Term不相等,说明第i个第二节点设备原本存储的日志是错误的(或者已失效的),这时只需要将错误的日志覆盖掉,存储最新的日志即可,能够保证Index=beginIndex[i]+j的日志与第一节点设备是一致的;第二种情况是同一索引的对应日志在第i个第二节点设备上缺失,这时直接在日志列表中补全缺失的Index=beginIndex[i]+j的日志即可。
在一些实施例中,对于第i个第二节点设备,在读取该第一节点设备的Term和上述Index=beginIndex[i]+j的日志的Term之后,在该第一节点设备的Term小于第i个第二节点设备自身的Term的情况下,拒绝上述Index=beginIndex[i]+j的日志,并向第一节点设备返回失败消息(例如,返回一串错误码)。
在上述过程中,对于第i个第二节点设备,响应于第一节点设备所发送的AppendEntriesRPC消息,假设第一节点设备发送的Index=beginIndex[i]+j的日志的Term为k,若第一节点设备的Term大于或等于第i个第二节点设备自身的Term,且第i个第二节点设备自身的日志列表中Index=beginIndex[i]+j的日志的Term不等于k(或者Index=beginIndex[i]+j的日志缺失),则将接收到的Index=beginIndex[i]+j的日志插入到自身的日志列表中Index为beginIndex[i]+j的位置,并返回ACK消息;否则,若第一节点设备的Term小于第i个第二节点设备自身的Term,则拒绝Index=beginIndex[i]+j的日志并返回失败消息。
在一些实施例中,第一节点设备在接收到第i个第二节点设备相对于Index=beginIndex[i]的日志返回的ACK消息之后,代表第i个第二节点设备已经接收到了Index=beginIndex[i]的日志,则第一节点设备移除matchIndex[i][]的首个元素,并将beginIndex[i]加一(执行自增一运算)。
802、第一节点设备响应于该日志匹配索引表中的任一待提交日志的副本数量符合目标条件,提交该任一待提交日志。
换言之,第一阶段设备在该日志匹配索引表中的任一待提交日志的副本数量符合目标条件的情况下,提交该待提交日志。
在一些实施例中,该目标条件为该待提交日志的副本数量在分布式存储系统的节点数量的占比超过比例阈值,例如,该比例阈值为1/2,也即该目标条件是指该待提交日志在超过半数的第二节点设备中存在日志副本。当然,该比例阈值是任一大于或等于0且小于等于1的 数值,本申请实施例对此不作限定,例如,该比例阈值为2/3。
在一些实施例中,该目标条件为该待提交日志的副本数量大于副本阈值,该副本阈值是任一大于或等于0且小于或等于分布式存储系统的节点数量的整数,例如,在包含99个节点设备的分布式存储系统中,将该副本阈值设置为50,本申请实施例对此不进行具体限定。
在一个示例性场景中,以目标条件为该待提交日志的副本数量在分布式存储系统的节点数量的占比超过比例阈值为例进行说明,第一节点设备通过循环扫描matchIndex[][],即可得知任一待提交日志是否在半数以上的第二节点设备中存储有日志副本,若半数以上的第二节点设备存储有日志副本,则第一节点设备将选择提交该待提交日志。
在一些实施例中,第一节点设备循环扫描日志匹配索引表matchIndex[][],来判断任一待提交日志是否在超过半数的第二节点设备中存在日志副本,假设该待提交日志的Index=i,如果在超过半数的第二节点设备中存在Index=i的日志副本,则第一节点设备修改isCommited[i-commitIndex]为True,用于标记该待提交日志已经提交。进一步地,如果isCommited[]的首个元素为True,则第一节点设备移除isCommited[]的首个元素,并将commitIndex加一(执行自增一运算)。
803、第一节点设备向多个第二节点设备发送该待提交日志的提交指令。
在一些实施例中,第一节点设备通过AppendEntriesRPC消息告知各个第二节点设备自己已经提交的该待提交日志。
804、多个第二节点设备响应于该提交指令,提交该待提交日志。
在一些实施例中,对于任一第二节点设备,在接收到AppendEntriesRPC消息之后,如果该第二节点设备上已经存储了AppendEntriesRPC消息指示的该待提交日志且该待提交日志已经被第一节点设备提交,则第二节点设备也会提交该待提交日志。
在一些实施例中,该第二节点设备根据该AppendEntriesRPC消息,能够得知第一节点设备已经提交的该待提交日志的Term和Index,假设该待提交日志的Term=T、Index=i,则该第二节点设备检查日志列表,如果该日志列表中存在Term=T、Index=i的日志(说明本地存储的日志与第一节点设备提交的日志是同一个日志),则该第二节点设备也提交该待提交日志,也即修改自身isCommited[]中Index=i的元素为True。
在通常情况下,第一节点设备和各个第二节点设备的日志能够保持一致,理论上AppendEntriesRPC消息的一致性检查永远不会失败,但是第一节点设备如果发生崩溃事件,会导致第一节点设备和各个第二节点设备的日志不一致,也即,旧的第一节点设备可能没有来得及将日志列表中的所有日志复制到其他的各个第二节点设备上就发生了崩溃事件,并在崩溃后重启时系统已经选举出了下一任第一节点设备(Leader),这种不一致情况可能会导致一系列的第一节点设备和第二节点设备均崩溃,因此需要解决上述日志不一致问题。
图9是本申请实施例提供的一种日志不一致情况的原理性示意图,如900所示,第二节点设备中有可能缺少第一节点设备中已存储的日志,有可能具有了第一节点设备上不存在的额外的日志,或者,第二节点设备中既缺少第一节点设备中已存储的日志同时又具有了第一节点设备上不存在的额外的日志。而日志列表中缺失或者无关的日志可能跨越了多个Term。在传统Raft算法中,第一节点设备通过强制第二节点设备复制自身的日志来解决不一致问题,这意味着第二节点设备的日志列表中的冲突日志(不一致的日志)将被第一节点设备的日志列表中的对应日志所覆盖。
在一些实施例中,要使得第二节点设备的日志与第一节点设备的日志保持一致,那么第一节点设备和第二节点设备的日志列表中Index和Term均相同的所有日志都需要保持一致,第二节点设备的日志列表中与第一节点设备的日志列表中凡是不一致的日志都会被删除,并且由第一节点设备重新发送对应的日志覆盖到第二节点设备的日志列表中Index相同的日志位置,上述操作都是响应于AppendEntriesRPC消息执行一致性检查时发生的。
在一些实施例中,每个第一节点设备在首次上台(即投票当选Leader)时,第一节点设备会向各个第二节点设备询问自身的commitIndex和日志提交活跃窗口中的所有待提交日志 是否已被各个第二节点设备提交,这一过程在选举合并(Merge)恢复阶段完成,选举合并恢复阶段将在下一实施例中介绍,这里不做赘述。
此外,第一节点设备维持数据结构matchIndex[][]和beginIndex[],通过上述数据结构能够记录各个日志与第二节点设备的日志列表中对应索引的日志是否一致,从而来保证第一节点设备和第二节点设备中所有已提交的日志均已一致。
在一个示例性场景中,仍然以第i个第二节点设备为例,假设第i个第二节点设备的commitIndex为cIndex,第i个第二节点设备的日志提交活跃窗口内的各个待提交日志的提交情况信息保存在isCommitted[]中,那么,可知第i个第二节点设备的beginIndex[i]=cIndex+1。此外,matchIndex[i][j]代表第一节点设备和第i个第二节点设备的日志列表中Index=beginIndex[i]+j的日志是否一致。此外,isCommitted[k]表示第i个第二节点设备的日志列表中Index为cIndex+k+1的日志是否提交,则令matchIndex[i][j]=isCommitted[beginIndex[i]+j-cIndex-1]。
进一步地,第一节点设备将根据matchIndex[][]和beginIndex[],得知第一节点设备和每个第二节点设备的日志列表中的日志是否一致,进而给各个第二节点设备发送日志列表中不一致的日志,此过程通过AppendEntriesRPC消息完成。如果第二节点设备的日志与第一节点设备的日志不一致,那么第一节点设备的日志会覆盖第二节点设备中与其不一致的日志。
在上述实施例中,为了保证每个日志项中LHA的完整性,第一节点设备的日志列表中不能存在缺失,而由于去中心化的设计,第一节点设备具有时限性,也即,每一任第一节点设备的任期结束后,会由整个分布式存储系统投票选举出下一任第一节点设备,那么下一任第一节点设备可能会由于日志乱序复制机制,导致下一任第一节点设备的日志列表中存在缺失,此时,通过本申请实施例提供的选举合并恢复机制,就能够补全自身日志提交活跃窗口内缺失的日志,并在补全后向集群提供服务。
以下,将介绍第一节点设备的选举机制。
第一节点设备的选举机制也称为Leader选举机制,用于保证顺序一致性,只有第一节点设备能够响应终端的业务请求,并将该业务请求发送至各个第二节点设备。分布式存储系统中包含第一节点设备和多个第二节点设备,基于多数派原则,所有第二节点设备的节点数量为一个奇数,例如,5是典型可选的一个的节点数量,它允许系统容忍两个节点设备发生故障。
在任何时候,每个节点设备处于以下三种状态之一:领导节点Leader,跟随节点Follower或候选节点Candidate。各种状态的情况如下所示。
1、领导节点Leader(即第一节点设备):负责处理终端的所有业务请求。
2、跟随节点Follower(即第二节点设备):不会发送任何请求,只是简单的响应来自领导节点Leader或者候选节点Candidate的请求。
3、候选节点Candidate(称为候选节点设备):候选者,在选举时期,候选节点Candidate之间进行竞选,选举出新的领导节点Leader。
图10是本申请实施例提供的一种领导节点的选举机制的原理性示意图,如1000所示,将时间划分为具有任意时间长度的任期(Term),Term用连续的整数来表示,每个Term从选举阶段开始,系统内的一个或多个候选节点Candidate试图成为领导节点Leader。在某些情况下,选举将导致分裂投票,即没有任何一个候选节点Candidate获得半数以上的选票。在这种情况下,该Term将以没有Leader的情况结束,同时开始下一个新Term的选举阶段。
图11是本申请实施例提供的一种任期Term的原理性示意图,如1100所示,Term3在选举时期产生了分裂投票,因此Term3不存在日常时期,直接进入到Term4的选举时期。
可选地,跟随节点Follower只响应来自其他节点(包括领导节点Leader或者候选节点Candidate)的请求。如果跟随节点Follower没有收到任何通信,则跟随节点Follower就会成为候选节点Candidate并发起选举。从大多数群集中获得投票的candidate将成为新的leader。 单个leader将管理集群,直到任期结束。
在一些实施例中,每个节点设备都存储一个当前的Term编号,该Term编号随着时间的推移单调增加。当不同节点设备之间发生通信时,会同时交换各自的当前Term;如果一个节点设备的当前Term小于另一个节点设备的当前Term,则该节点设备将自身的当前Term更新为两者间的最大值。
如果候选节点Candidate或领导节点Leader在某次通信交换当前Term时,发现自身的当前Term已过期,则会立即变成跟随节点Follower。如果任一节点设备收到带有过期Term的请求,该节点设备将拒绝该请求。
分布式存储系统中节点设备之间使用远程过程调用(RPC)协议进行通信,如果某个节点设备未及时收到其他节点设备的响应,则会重新发起相应的RPC消息,并且这些RPC消息是并行发出的,以获得的系统最佳性能。
在一些实施例中,当前任期的第一节点设备由该分布式存储系统中的多个第二节点设备在上一任期结束后投票选举产生,该第一节点设备中连续的已提交的日志中的最大索引commitIndex大于或等于该多个第二节点设备中连续的已提交的日志中的最大索引commitIndex。
也即是说,通过对第一节点设备的选举条件增加限制,强制只有拥有所有节点设备连续的、已提交的日志的候选节点设备才能够当选为第一节点设备,这样能够大大节约选举合并恢复阶段的耗时。需要说明的是,选举条件允许日志提交活跃窗口内存在缺失日志,这些缺失的日志将在选举合并恢复阶段中得到恢复。
在一些实施例中,通过RequestVoteRPC消息限制上述对第一节点设备的选举条件,如果候选节点设备的commitIndex(代表连续的、已提交的日志中Index最大的日志)小于当前的第二节点设备的commitIndex,则当前的第二节点设备拒绝为该候选节点设备投票。
基于上述选举条件的限制,保证了下一任第一节点设备拥有所有节点设备连续的、已提交的日志,但无法保证下一任第一节点设备日志提交活跃窗口内不存在缺失。针对第一节点设备仅仅在自身日志提交活跃窗口内缺失日志的情况,缺失的日志主要分为如下两类:(A)该缺失的日志已经提交;(B)该缺失的日志尚未提交,但可能在其他的第二节点设备上存在。
上述(A)类缺失的日志已经提交,说明缺失的日志在多数第二节点设备上已达成一致,因此第一节点设备能够安全地向其他拥有该缺失的日志的第二节点设备请求该缺失的日志,然后接收并存储(A)类缺失的日志。
上述(B)类缺失的日志尚未提交,因此可能是过期的、不安全的日志,第一节点设备不能接收(B)类缺失的日志,此时第一节点设备选择向终端重新请求(B)类缺失的日志,以补全(B)类缺失的日志。
可选地,在选举合并恢复阶段,响应于日志提交活跃窗口中缺失至少一个日志,第一节点设备获取缺失的该至少一个日志的至少一个索引,该日志提交活跃窗口包括尚未提交的多个日志,且该日志提交活跃窗口之前的日志均已提交;基于该至少一个索引,向该多个第二节点设备请求该至少一个日志。
换言之,在日志提交活跃窗口中缺失至少一个日志的情况下,第一节点设备获取每个缺失的日志的索引,接着,基于获取到的每个索引,向各个第二节点设备请求每个索引所指示的缺失的日志。
可选地,第一节点设备接收该多个第二节点设备所返回的至少一个目标日志以及该至少一个目标日志的提交情况信息;将提交情况信息为已提交的目标日志补全在该日志提交活跃窗口中;向终端请求提交情况信息为未提交的目标日志。
在上述过程中,第一节点设备向各个第二节点设备发送自身的commitIndex和isCommited[],并向各个第二节点设备请求自身日志提交活跃窗口内缺失的日志,第一节点设备根据收到的已经提交的日志来补全自身日志提交活跃窗口内的缺失的日志,同时第一节点设备会请求得到各个第二节点设备发送的日志的提交情况信息,并初始化数据结构 matchIndex[][]和nextIndex[],上述过程由RequestMergeRPC函数完成。在上述基础上,如果第一节点设备仍然尚未补全自身日志提交活跃窗口内缺失的日志,则第一节点设备会向终端请求缺失的日志。
在一些实施例中,如果第二节点设备或者候选节点设备崩溃,则发送至崩溃节点设备的RequestVoteRPC消息、AppendEntriesRPC消息、RequestMergeRPC消息等RPC消息都将失败。
并发Raft算法通过无限期重试RPC消息来处理这些失败。如果崩溃的节点重新启动,则RPC消息将成功完成。如果节点设备已完成RPC消息所指示的命令,但在响应该RPC消息之前崩溃,那么节点设备将在重新启动后再次收到相同的RPC消息。
并发Raft算法中,RPC消息是幂等(Idempotent)的,即节点设备执行多次相同的消息,最终产生的结果是一样的,所以重复执行同一RPC消息这不会造成任何坏影响。例如,如果第二节点设备收到包含自身日志列表中已存在的日志的AppendEntriesRPC请求,则会忽略新请求中的这些日志。
以下,示出了并发Raft算法的投票选举机制实现,以及分布式存储系统内所有节点设备上存储的与提交流程相关的状态变量,下面进行说明:
//节点已知最新的term(初次启动时初始化为0,之后单调递增)
int currentTerm
//最近term时投票给的candidate的ID
int voteFor
//任何时候,每个节点的状态为三个角色之一:leader、candidate、follower
State state
//日志列表:其中每一条日志包含用于在复制状态机上执行的命令、从leader收到该日志时leader的term
LogEntry log[]
/**
所有节点上存储的与commit相关的状态变量
*/
//已知该节点索引为commitindex的日志之前的日志都已commit
int commitIndex
//记录日志commit活跃窗口中的日志是否commit
//索引为[commitIndex+1,commitIndex+toCommitWindowSize+1]的日志范围为日志commit活跃窗口
bool isCommitted[toCommitWindowSize]
//日志commit活跃窗口的大小
int toCommitWindowSize
//记录日志commit活跃窗口内的日志是否接收
bool isReceived[toCommitWindowSize]
在一些实施例中,存储场景下要求系统安全性不能依赖于时间:系统不能仅仅因为某些事件比预期更快或更慢地发生而产生不正确的结果。但是,系统可用性(即系统及时响应终 端的能力)必然取决于时间。例如,如果信息交换所需的时间超过服务器集群(即分布式存储系统)崩溃之前的典型时间,则候选节点设备将无法保持足够长的时间来赢得选举;没有选举出稳定的第一节点设备,并发Raft算法则无法继续进行。可选地,只要系统满足以下时序要求:broadcastTime<<electIOnTimeout<<MTBF,则分布式存储系统将能够选择并保持稳定的第一节点设备。
在上述时序要求的不等式中,broadcastTime是第一节点设备并行向分布式存储系中每个第二节点设备发送RPC消息收到回复的平均时长;electionTimeout是选举超时时长;MTBF是单个第二节点设备的平均故障间隔时长。
一方面,broadcastTime应比electionTimeout少一个数量级,以便第一节点设备能够可靠且及时地给各个第二节点设备发送心跳消息,避免该第二节点设备超过自身的electionTimeout并成为候选节点设备开始选举。鉴于用于electionTimeout的随机产生方法,不同第二节点设备的electionTimeout不同,也使分裂投票不太可能会发生。
另一方面,electiontTimeout比MTBF少几个数量级,以便系统稳步运行,否则当第一节点设备崩溃时,第二节点设备无法及时结束electionTimeout以便开始新的选举。broadcastTime和MTBF是系统的属性,而electionTimeout则由技术人员进行人为设置和修改。
在一个示例性场景中,并发Raft算法的RPC消息通常要求接收的第二节点设备收到信息并将信息持久化存储,因此广播时长可能在0.5ms到20ms之间,具体时长取决于存储技术。可选地,electionTimeout设置在介于10毫秒到500毫秒之间。而典型的服务器节点的MTBF能够持续几个月甚至更长时间,因此很容易满足上述时序要求不等式。
在一些实施例中,由于并发Raft算法中各个待提交日志的提交操作是乱序进行的,因此会导致旧日志还没有提交,依赖该旧日志的新日志已经提交,其中,上述旧日志依赖新日志是指旧日志与新日志的存储范围重叠。
从终端的角度看,终端接收到的ACK消息与终端发送业务请求的顺序不一致:即终端发送给第一节点设备多条业务请求之后,旧的业务请求的ACK消息尚未返回,而新的业务请求已经率先返回ACK消息。
图12是本申请实施例提供的一种终端与集群交互的原理性流程图,如1200所示,并发Raft算法需要保证如果存在一个已提交日志,而该已提交日志所依赖的日志尚未提交,则系统不会执行此类已提交日志。也即,并发Raft算法向终端保证如下事情:如果第i条业务请求与第j条业务请求的存储范围重叠,且i<j,在收到了第j条业务请求的ACK消息而没有收到第i条业务请求的ACK消息时,则说明第i和j条业务请求均尚未执行。
对于上述事情,划分为如下两种情况。
情况一、在第一节点设备稳定运行的情况下,上述第i条业务请求对应的日志会最终提交,并返回ACK消息给终端,以使终端得知第i条业务请求和第j条业务请求已经成功提交并且安全地执行。
情况二、如果旧的第一节点设备崩溃、分布式存储系统选举了新的第一节点设备,且新的第一节点设备上没有上述第i条业务请求对应的日志(即第一节点设备的日志提交活跃窗口中缺失相应的日志),新的第一节点设备会在选举合并恢复阶段,向终端要求重发第i条业务请求,以便新的第一节点设备恢复相应的日志;如果终端没有给新的第一节点设备重发第i条业务请求(即新的第一节点设备要求重发第i条业务请求连续3次超时),则第i条业务请求和在该第i条业务请求之后且依赖该i条业务请求的所有业务请求均会失败。并且,在日志列表中将相关的日志项标记为空日志(也即将日志索引Index赋值为-2),且清空空日志对应的第二日志列表FastScanPList和第一日志列表SlowScanPList中的元素。
在上述各个实施例中,介绍了能够适用于分布式存储系统的并发Raft算法(属于一致性算法),上层应用作为客户端(即终端),向分布式存储系统发送业务请求,系统通过图12所示的方式与终端进行交互,并提供了系统崩溃恢复机制,保证系统崩溃后业务数据不会丢失,并能够将乱序的I/O请求有序化,以保证数据恢复过程的有序性。上述并发Raft算法能够适 用于分布式存储系统下的高并发I/O场景,从而有效提升系统吞吐量,降低I/O请求延迟。
以下,示出了并发Raft算法中,上述各个实施例中所涉及的各项相关的数据结构,下面进行说明:
Figure PCTCN2022074080-appb-000001
以下,示出了分布式存储系统中不同节点设备之间进行RPC通信的一种AppendEntries RPC函数:
//由第一节点设备调用去发起日志复制请求,也可用作心跳
参数:
//第一节点设备的term
int term
//第一节点设备的id
int leaderid
//第一节点设备上已经commit的日志信息
CommitedLog leaderCommits[]
//需要存储的日志;(当AppendEntries RPC函数用于发送心跳时,此变量为空)
LogEntry entries[]
返回:
//返回值term为当前第二节点设备的term
int currentTerm
//返回值success记录是否执行成功
bool success
接收者(第二节点设备)实现:
Ⅰ)如果term<currentTerm则返回false
Ⅱ)遍历自身日志commit活跃窗口,若该范围内第一节点设备的日志和自身日志发生冲突(index相同但是term不同),则用该第一节点设备的日志覆盖自身的日志;若该范围内第一节点设备发来了自身缺失的日志,则将该日志覆盖到对应index的位置
Ⅲ)遍历自身日志commit活跃窗口,若该范围内有和第一节点设备的日志一致的(index和term均相同)、第一节点设备已经commit的日志,则将该日志标记为commit
以下,示出了分布式存储系统中不同节点设备之间进行RPC通信的一种RequestVoteRPC函数,下面进行说明:
//由候选节点设备调用获取选票
参数:
//候选节点设备的term
int term
//候选节点设备的id
int candidateId
//候选节点设备上index小于或等于commitIndex的日志均已commit
int commitIndex
返回:
//返回值term为当前节点的term
int term
//返回值voteGranted表明该节点是否给该candidate投票
bool voteGranted
接收者(第二节点设备)实现:
ⅰ)如果term<currentTerm则返回false
ⅱ)如果voteFor为空且候选节点设备的commitIndex大于或等于自身的commitIndex,则给该候选节点设备投票
以下,示出了分布式存储系统中不同节点设备之间进行RPC通信的一种RequestMergeRPC函数,下面进行说明:
//新的第一节点设备选举成功,为保证第一节点设备日志列表中的日志连续性,请求第二节点设备发送给自己缺失的日志
参数:
//第一节点设备的term
int term
//第一节点设备的id
int leaderid
//记录第一节点设备的日志commit活跃窗口内的日志是否commit
int isCommitted[]
返回:
//返回值term为当前的第二节点设备的term
int currentTerm
//返回第二节点设备上已经commit且index处于第一节点设备日志commit活跃窗口范围的日志
Log logs[]
//返回第二节点设备日志列表中的日志commit情况
bool logcommits[]
接收者实现:
(Ⅰ)如果term<currentTerm则返回currentTerm
(Ⅱ)返回自己已经commit且index处于第一节点设备日志commit活跃窗口范围内的日志以及日志列表中的日志commit情况
以下,示出了分布式存储系统的服务器行为(也即用于提供分布式存储服务的整个服务器集群的行为):
所有服务器:
(ⅰ)进行日志的乱序ack和乱序commit流程
(ⅱ)进行日志的乱序apply流程
第二节点设备行为:
a、响应候选节点设备和第一节点设备的RPC函数
b、如果选举时间超时且没有从第一节点设备或者承诺投票的候选节点设备收到AppendEntries RPC消息,则转换为候选节点设备
候选节点设备行为:
A、自增currentTerm
B、给自己投票
C、选举时间计数器清零
D、给所有其他节点设备发送RequestVote RPC
E、如果收到多数选票,则成为第一节点设备
F、如果从新的第一节点设备收到AppendEntries RPC消息,则转换为第二节点设备
G、如果选举时间超时,则开启新的选举周期
第一节点设备行为:
Ⅰ、向其他每个节点设备发送心跳(通过日志为空的AppendEntries RPC),将其他每个节点设备转化为第二节点设备,重复发送心跳,阻止选举时间超时
Ⅱ、在向集群提供服务之前,先进行选举Merge恢复阶段,恢复自己缺失的日志,保证自己的日志连续性,并同时获取其他节点设备上日志的commit情况
Ⅲ、从终端收到命令,则将日志追加到自己的日志列表后面,给其他节点发送AppendEntries RPC,当日志commit时恢复给终端ack信息
图13是本申请实施例提供的一种日志执行装置的结构示意图,请参考图13,该装置位于分布式存储系统中,该装置包括如下模块:
扫描模块1301,用于循环扫描日志执行活跃窗口,该日志执行活跃窗口包括尚未执行的多个日志,且该日志执行活跃窗口之前的日志均已执行;
第一获取模块1302,用于对于该日志执行活跃窗口中的任一日志,基于该日志的存储范围信息,获取该日志的冲突验证结果,该存储范围信息用于指示该日志的存储范围以及该日志之前的目标数量个日志的存储范围,该目标数量等于该日志执行活跃窗口的窗口尺寸;
执行模块1303,用于在该冲突验证结果为无冲突的情况下,执行该日志。
本申请实施例提供的装置,通过设置日志执行活跃窗口并保证日志执行活跃窗口之前的日志均已执行,仅需要验证日志执行活跃窗口内的任一日志是否与日志执行活跃窗口内的、该日志之前的、尚未执行的日志发生存储范围冲突,即可得知该日志是否会在整个分布式存储系统中引发数据不一致问题,对于无冲突的该日志,支持乱序执行该日志,而无需阻塞该日志的执行进程,并无需等待日志执行活跃窗口内的、该日志之前的、尚未执行的日志执行完毕,能够大大提升分布式存储系统的吞吐量,且能够适用于高并发场景。
在一种可能实施方式中,该第一获取模块1302用于:读取该日志的存储范围信息,得到该日志以及该日志之前的目标数量个日志的存储范围;在该日志的存储范围与该目标数量个日志的存储范围之间的交集为空集的情况下,确定该冲突验证结果为无冲突;在该日志的存储范围与该目标数量个日志的存储范围之间的交集不为空集的情况下,确定该冲突验证结果为发生冲突。
在一种可能实施方式中,基于图13的装置组成,该执行模块1303包括:添加单元,用于将该日志添加至待执行日志列表;处理单元,用于调用日志执行线程,处理该待执行日志列表中存储的日志。
在一种可能实施方式中,该处理单元用于:将该待执行日志列表中存储的日志所对应的业务数据写入易失性存储介质;将写入该易失性存储介质中的该业务数据所对应的日志添加 至已执行日志列表;在该易失性存储介质的数据存储量大于或等于存储阈值的情况下,将该易失性存储介质中存储的业务数据写入至非易失性存储介质中;其中,在该已执行日志列表中分别为业务数据未写入非易失性存储介质的日志与业务数据已写入非易失性存储介质的日志设置不同的状态参数。
在一种可能实施方式中,基于图13的装置组成,该装置还包括:第二获取模块,用于在该分布式存储系统发生崩溃事件的情况下,在该分布式存储系统重启时基于该状态参数,从该已执行日志列表中获取多个待恢复日志,该待恢复日志所对应的业务数据已写入该易失性存储介质而未写入至该非易失性存储介质;恢复模块,用于基于多个该待恢复日志在该已执行日志列表中的存储顺序,依次将多个该待恢复日志所对应的多个业务数据恢复至该易失性存储介质中。
在一种可能实施方式中,基于图13的装置组成,该装置还包括:第三获取模块,用于在该冲突验证结果为发生冲突的情况下,获取与该日志发生冲突的日志数量;确定模块,用于基于该日志数量,确定对该日志的扫描频率;该执行模块1303,还用于基于该扫描频率扫描该日志并刷新该冲突验证结果,直到该冲突验证结果为无冲突,执行该日志。
在一种可能实施方式中,该确定模块用于:在该日志数量大于冲突阈值的情况下,将该扫描频率确定为第一频率;或者,在该日志数量小于或等于该冲突阈值的情况下,将该扫描频率确定为第二频率,该第二频率大于该第一频率。
在一种可能实施方式中,该执行模块1303用于:将该日志添加至与该扫描频率对应的日志列表,基于该扫描频率扫描该日志列表中存储的日志。
在一种可能实施方式中,该扫描模块1301还用于:循环扫描日志匹配索引表,该日志匹配索引表用于记录多个待提交日志在该分布式存储系统中存储的副本数量;该装置还包括提交模块,用于在该日志匹配索引表中的任一待提交日志的副本数量符合目标条件的情况下,提交该待提交日志。
在一种可能实施方式中,该装置由该分布式存储系统中的多个第二节点设备在上一任期结束后投票选举产生,该装置中连续的已提交的日志中的最大索引大于或等于多个该第二节点设备中连续的已提交的日志中的最大索引。
在一种可能实施方式中,基于图13的装置组成,该装置还包括:第四获取模块,用于在日志提交活跃窗口中缺失至少一个日志的情况下,获取每个缺失的日志的索引,该日志提交活跃窗口包括尚未提交的多个日志,且该日志提交活跃窗口之前的日志均已提交;请求模块,用于基于该索引,向多个该第二节点设备请求该缺失的日志。
在一种可能实施方式中,基于图13的装置组成,该装置还包括:接收模块,用于接收多个该第二节点设备所返回的目标日志以及该目标日志的提交情况信息;补全模块,用于将提交情况信息为已提交的目标日志补全在该日志提交活跃窗口中;该请求模块,还用于向终端请求提交情况信息为未提交的目标日志。
上述所有可选技术方案,能够采用任意结合形成本公开的可选实施例,在此不再一一赘述。
需要说明的是:上述实施例提供的日志执行装置在执行日志时,仅以上述各功能模块的划分进行举例说明,实际应用中,能够根据需要而将上述功能分配由不同的功能模块完成,即将计算机设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的日志执行装置与日志执行方法实施例属于同一构思,其具体实现过程详见日志执行方法实施例,这里不再赘述。
图14是本申请实施例提供的一种计算机设备的结构示意图,该计算机设备1400可因配置或性能不同而产生比较大的差异,该计算机设备1400包括一个或一个以上处理器(Central Processing Units,CPU)1401和一个或一个以上的存储器1402,其中,该存储器1402中存储有至少一条计算机程序,该至少一条计算机程序由该一个或一个以上处理器1401加载并执 行以实现上述各个实施例提供的日志执行方法。可选地,该计算机设备1400还具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该计算机设备1400还包括其他用于实现设备功能的部件,在此不做赘述。
在示例性实施例中,还提供了一种计算机可读存储介质,例如包括至少一条计算机程序的存储器,上述至少一条计算机程序可由终端中的处理器执行以完成上述各个实施例中的日志执行方法。例如,该计算机可读存储介质包括ROM(Read-Only Memory,只读存储器)、RAM(Random-Access Memory,随机存取存储器)、CD-ROM(Compact Disc Read-Only Memory,只读光盘)、磁带、软盘和光数据存储设备等。
在示例性实施例中,还提供了一种计算机程序产品或计算机程序,包括一条或多条程序代码,该一条或多条程序代码存储在计算机可读存储介质中。计算机设备的一个或多个处理器能够从计算机可读存储介质中读取该一条或多条程序代码,该一个或多个处理器执行该一条或多条程序代码,使得计算机设备能够执行以完成上述实施例中的日志执行方法。
本领域普通技术人员能够理解实现上述实施例的全部或部分步骤能够通过硬件来完成,也能够通过程序来指令相关的硬件完成,可选地,该程序存储于一种计算机可读存储介质中,可选地,上述提到的存储介质是只读存储器、磁盘或光盘等。
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (15)

  1. 一种日志执行方法,由分布式存储系统中的第一节点设备执行,所述方法包括:
    循环扫描日志执行活跃窗口,所述日志执行活跃窗口包括尚未执行的多个日志,且所述日志执行活跃窗口之前的日志均已执行;
    对于所述日志执行活跃窗口中的任一日志,基于所述日志的存储范围信息,获取所述日志的冲突验证结果,所述存储范围信息用于指示所述日志的存储范围以及所述日志之前的目标数量个日志的存储范围,所述目标数量等于所述日志执行活跃窗口的窗口尺寸;
    在所述冲突验证结果为无冲突的情况下,执行所述日志。
  2. 根据权利要求1所述的方法,所述基于所述日志的存储范围信息,获取所述日志的冲突验证结果包括:
    读取所述日志的存储范围信息,得到所述日志以及所述日志之前的目标数量个日志的存储范围;
    在所述日志的存储范围与所述目标数量个日志的存储范围之间的交集为空集的情况下,确定所述冲突验证结果为无冲突;
    在所述日志的存储范围与所述目标数量个日志的存储范围之间的交集不为空集的情况下,确定所述冲突验证结果为发生冲突。
  3. 根据权利要求1所述的方法,所述执行所述日志包括:
    将所述日志添加至待执行日志列表;
    调用日志执行线程,处理所述待执行日志列表中存储的日志。
  4. 根据权利要求3所述的方法,所述处理所述待执行日志列表中存储的日志包括:
    将所述待执行日志列表中存储的日志所对应的业务数据写入易失性存储介质;
    将写入所述易失性存储介质中的所述业务数据所对应的日志添加至已执行日志列表;
    在所述易失性存储介质的数据存储量大于或等于存储阈值的情况下,将所述易失性存储介质中存储的业务数据写入至非易失性存储介质中;
    其中,在所述已执行日志列表中分别为业务数据未写入非易失性存储介质的日志与业务数据已写入非易失性存储介质的日志设置不同的状态参数。
  5. 根据权利要求4所述的方法,所述方法还包括:
    在所述分布式存储系统发生崩溃事件的情况下,在所述分布式存储系统重启时基于所述状态参数,从所述已执行日志列表中获取多个待恢复日志,所述待恢复日志所对应的业务数据已写入所述易失性存储介质而未写入至所述非易失性存储介质;
    基于多个所述待恢复日志在所述已执行日志列表中的存储顺序,依次将多个所述待恢复日志所对应的多个业务数据恢复至所述易失性存储介质中。
  6. 根据权利要求1所述的方法,所述方法还包括:
    在所述冲突验证结果为发生冲突的情况下,获取与所述日志发生冲突的日志数量;
    基于所述日志数量,确定对所述日志的扫描频率;
    基于所述扫描频率扫描所述日志并刷新所述冲突验证结果,直到所述冲突验证结果为无冲突,执行所述日志。
  7. 根据权利要求6所述的方法,所述基于所述日志数量,确定对所述日志的扫描频率包 括:
    在所述日志数量大于冲突阈值的情况下,将所述扫描频率确定为第一频率;或者,
    在所述日志数量小于或等于所述冲突阈值的情况下,将所述扫描频率确定为第二频率,所述第二频率大于所述第一频率。
  8. 根据权利要求6所述的方法,所述基于所述扫描频率扫描所述日志包括:
    将所述日志添加至与所述扫描频率对应的日志列表,基于所述扫描频率扫描所述日志列表中存储的日志。
  9. 根据权利要求1所述的方法,所述方法还包括:
    循环扫描日志匹配索引表,所述日志匹配索引表用于记录多个待提交日志在所述分布式存储系统中存储的副本数量;
    在所述日志匹配索引表中的任一待提交日志的副本数量符合目标条件的情况下,提交所述待提交日志。
  10. 根据权利要求1所述的方法,所述第一节点设备由所述分布式存储系统中的多个第二节点设备在上一任期结束后投票选举产生,所述第一节点设备中连续的已提交的日志中的最大索引大于或等于多个所述第二节点设备中连续的已提交的日志中的最大索引。
  11. 根据权利要求10所述的方法,所述方法还包括:
    在日志提交活跃窗口中缺失至少一个日志的情况下,获取每个缺失的日志的索引,所述日志提交活跃窗口包括尚未提交的多个日志,且所述日志提交活跃窗口之前的日志均已提交;
    基于所述索引,向多个所述第二节点设备请求所述缺失的日志。
  12. 根据权利要求11所述的方法,所述方法还包括:
    接收多个所述第二节点设备所返回的目标日志以及所述目标日志的提交情况信息;
    将提交情况信息为已提交的目标日志补全在所述日志提交活跃窗口中;
    向终端请求提交情况信息为未提交的目标日志。
  13. 一种日志执行装置,所述装置位于分布式存储系统中,所述装置包括:
    扫描模块,用于循环扫描日志执行活跃窗口,所述日志执行活跃窗口包括尚未执行的多个日志,且所述日志执行活跃窗口之前的日志均已执行;
    第一获取模块,用于对于所述日志执行活跃窗口中的任一日志,基于所述日志的存储范围信息,获取所述日志的冲突验证结果,所述存储范围信息用于指示所述日志的存储范围以及所述日志之前的目标数量个日志的存储范围,所述目标数量等于所述日志执行活跃窗口的窗口尺寸;
    执行模块,用于在所述冲突验证结果为无冲突的情况下,执行所述日志。
  14. 一种计算机设备,所述计算机设备包括一个或多个处理器和一个或多个存储器,所述一个或多个存储器中存储有至少一条计算机程序,所述至少一条计算机程序由所述一个或多个处理器加载并执行以实现如权利要求1至权利要求12任一项所述的日志执行方法。
  15. 一种存储介质,所述存储介质中存储有至少一条计算机程序,所述至少一条计算机程序由处理器加载并执行以实现如权利要求1至权利要求12任一项所述的日志执行方法。
PCT/CN2022/074080 2021-02-09 2022-01-26 日志执行方法、装置、计算机设备及存储介质 WO2022170979A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP22752140.8A EP4276651A4 (en) 2021-02-09 2022-01-26 PROTOCOL EXECUTION METHOD AND APPARATUS AS WELL AS COMPUTER DEVICE AND STORAGE MEDIUM
JP2023537944A JP2024501245A (ja) 2021-02-09 2022-01-26 ログ実行方法並びにその、装置、コンピュータ機器及びコンピュータプログラム
US18/079,238 US20230110826A1 (en) 2021-02-09 2022-12-12 Log execution method and apparatus, computer device and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110178645.4A CN112527759B (zh) 2021-02-09 2021-02-09 日志执行方法、装置、计算机设备及存储介质
CN202110178645.4 2021-02-09

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/079,238 Continuation US20230110826A1 (en) 2021-02-09 2022-12-12 Log execution method and apparatus, computer device and storage medium

Publications (1)

Publication Number Publication Date
WO2022170979A1 true WO2022170979A1 (zh) 2022-08-18

Family

ID=74975576

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/074080 WO2022170979A1 (zh) 2021-02-09 2022-01-26 日志执行方法、装置、计算机设备及存储介质

Country Status (5)

Country Link
US (1) US20230110826A1 (zh)
EP (1) EP4276651A4 (zh)
JP (1) JP2024501245A (zh)
CN (1) CN112527759B (zh)
WO (1) WO2022170979A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112527759B (zh) * 2021-02-09 2021-06-11 腾讯科技(深圳)有限公司 日志执行方法、装置、计算机设备及存储介质
CN112948132B (zh) * 2021-04-07 2022-09-06 广西师范大学 云服务事件及服务等级合约数据的向量化方法
CN113609091B (zh) * 2021-08-18 2022-07-12 星环众志科技(北京)有限公司 日志管理方法、装置、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180329944A1 (en) * 2017-05-12 2018-11-15 International Business Machines Corporation Distributed system, computer program product and method
US20200183951A1 (en) * 2018-12-05 2020-06-11 Ebay Inc. Free world replication protocol for key-value store
CN111352943A (zh) * 2018-12-24 2020-06-30 华为技术有限公司 实现数据一致性的方法和装置、服务器和终端
CN112527759A (zh) * 2021-02-09 2021-03-19 腾讯科技(深圳)有限公司 日志执行方法、装置、计算机设备及存储介质

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5061250B2 (ja) * 2011-02-21 2012-10-31 株式会社日立製作所 データベース管理方法、およびストレージシステム
US10380026B2 (en) * 2014-09-04 2019-08-13 Sandisk Technologies Llc Generalized storage virtualization interface
CN104516959B (zh) * 2014-12-18 2018-01-02 杭州华为数字技术有限公司 一种管理数据库日志的方法及装置
US9904722B1 (en) * 2015-03-13 2018-02-27 Amazon Technologies, Inc. Log-based distributed transaction management
CN108140035B (zh) * 2016-04-22 2020-09-29 华为技术有限公司 分布式系统的数据库复制方法及装置
CN107665219B (zh) * 2016-07-28 2021-01-29 华为技术有限公司 一种日志管理方法及装置
US10671639B1 (en) * 2017-03-30 2020-06-02 Amazon Technologies, Inc. Selectively replicating changes to hierarchial data structures
CN109753248B (zh) * 2019-01-22 2022-05-13 上海微小卫星工程中心 存储访问控制器和访问存储器的方法
US11513935B2 (en) * 2019-08-30 2022-11-29 Dell Products L.P. System and method for detecting anomalies by discovering sequences in log entries
CN111159252B (zh) * 2019-12-27 2022-10-21 腾讯科技(深圳)有限公司 事务执行方法、装置、计算机设备及存储介质
CN111143389B (zh) * 2019-12-27 2022-08-05 腾讯科技(深圳)有限公司 事务执行方法、装置、计算机设备及存储介质
CN111338766B (zh) * 2020-03-12 2022-10-25 腾讯科技(深圳)有限公司 事务处理方法、装置、计算机设备及存储介质
US11853340B2 (en) * 2020-11-30 2023-12-26 Oracle International Corporation Clustering using natural language processing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180329944A1 (en) * 2017-05-12 2018-11-15 International Business Machines Corporation Distributed system, computer program product and method
US20200183951A1 (en) * 2018-12-05 2020-06-11 Ebay Inc. Free world replication protocol for key-value store
CN111352943A (zh) * 2018-12-24 2020-06-30 华为技术有限公司 实现数据一致性的方法和装置、服务器和终端
CN112527759A (zh) * 2021-02-09 2021-03-19 腾讯科技(深圳)有限公司 日志执行方法、装置、计算机设备及存储介质

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DIEGO ONGARO ET AL.: "the Raft algorithm", STANFORD UNIVERSITY
ONGARO DIEGO: "In Search of an Understandable Consensus Algorithm In Search of an Understandable Consensus Algorithm", 2014 USENIX ANNUAL TECHNICAL CONFERENCE, 19 June 2014 (2014-06-19), pages 1 - 16, XP055958788, Retrieved from the Internet <URL:https://www.usenix.org/system/files/conference/atc14/atc14-paper-ongaro.pdf> [retrieved on 20220907] *
See also references of EP4276651A4

Also Published As

Publication number Publication date
EP4276651A1 (en) 2023-11-15
EP4276651A4 (en) 2024-05-15
CN112527759B (zh) 2021-06-11
CN112527759A (zh) 2021-03-19
US20230110826A1 (en) 2023-04-13
JP2024501245A (ja) 2024-01-11

Similar Documents

Publication Publication Date Title
Van Renesse et al. Paxos made moderately complex
WO2022170979A1 (zh) 日志执行方法、装置、计算机设备及存储介质
US11768885B2 (en) Systems and methods for managing transactional operation
US7426653B2 (en) Fault tolerant distributed lock management
US20180173745A1 (en) Systems and methods to achieve sequential consistency in replicated states without compromising performance in geo-distributed, replicated services
US20100191713A1 (en) Unbundled storage transaction services
US10135929B2 (en) Fault-tolerant data processing computer system and method for implementing a distributed two-tier state machine
Park et al. Exploiting commutativity for practical fast replication
US7996360B2 (en) Coordinating updates to replicated data
US11080262B1 (en) Optimistic atomic multi-page write operations in decoupled multi-writer databases
US11003550B2 (en) Methods and systems of operating a database management system DBMS in a strong consistency mode
US20230081900A1 (en) Methods and systems for transactional schema changes
Skrzypczak et al. Linearizable state machine replication of state-based crdts without logs
Hellings et al. Byshard: Sharding in a byzantine environment
US20130006920A1 (en) Record operation mode setting
US20220197761A1 (en) Cloud architecture for replicated data services
US20230145054A1 (en) Multi-region database systems and methods
Fan et al. ALOHA-KV: high performance read-only and write-only distributed transactions
US11379463B1 (en) Atomic enforcement of cross-page data constraints in decoupled multi-writer databases
Lehner et al. Transactional data management services for the cloud
Sapate et al. Survey on comparative analysis of database replication techniques
US11874796B1 (en) Efficient garbage collection in optimistic multi-writer database systems
US20240126782A1 (en) Configuration and management of replication units for asynchronous database transaction replication
US20240134879A1 (en) Keeping stable leadership in raft-based protocol with fast automatic failover
Sridhar Active Replication in AsterixDB

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22752140

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023537944

Country of ref document: JP

ENP Entry into the national phase

Ref document number: 2022752140

Country of ref document: EP

Effective date: 20230811

NENP Non-entry into the national phase

Ref country code: DE