CN112035410A

CN112035410A - Log storage method and device, node equipment and storage medium

Info

Publication number: CN112035410A
Application number: CN202010833472.0A
Authority: CN
Inventors: 毛东方; 李海翔; 王建民; 黄向东; 潘安群
Original assignee: Tsinghua University; Tencent Technology Shenzhen Co Ltd
Current assignee: Tsinghua University; Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-18
Filing date: 2020-08-18
Publication date: 2020-12-04
Anticipated expiration: 2040-08-18
Also published as: CN112035410B

Abstract

The application discloses a log storage method and device, node equipment and a storage medium, and belongs to the technical field of databases. The method comprises the following steps: determining a remaining capacity of a first storage medium in response to a commit event of a target transaction, the first storage medium being a non-volatile storage medium for storing a log; in response to the residual capacity being smaller than the data amount of the uncached log of the target transaction, creating a log checkpoint, and storing the business data generated based on the modification operation in the second storage medium to a third storage medium; the uncached log of the target transaction is written to the first storage medium. According to the method and the device, logs are stored in the first storage medium in a persistent mode directly, a complex double-layer log caching process is not required to be executed, the space occupied by log storage is greatly saved, the system performance of the database is improved, the limitation on the upper limit of the throughput of the database system is avoided, and data expansion is facilitated.

Description

Log storage method and device, node equipment and storage medium

Technical Field

The present application relates to the field of database technologies, and in particular, to a log storage method and apparatus, a node device, and a storage medium.

Background

In mainstream database systems, a logging module is typically employed to optimize system performance. When data is written in, the data is firstly cached in the memory, then the data is asynchronously persisted to the disk, when the system crashes and crashes, the data which is not persisted in the memory is lost, and at the moment, the system can recover the data which is not persisted through the log module, so that the reliability of the data is ensured. In addition, in order to ensure that the log module can normally recover the data in the memory, the corresponding log after the data is written needs to be persisted to the disk, and therefore, the upper limit of the throughput of the database system is greatly limited.

Disclosure of Invention

The embodiment of the application provides a log storage method and device, node equipment and a storage medium, which can avoid limiting the upper limit of the throughput of a database system and optimize the performance of the database system. The technical scheme is as follows:

in one aspect, a log storage method is provided, and the method includes:

determining the remaining capacity of a first storage medium in the database system in response to a commit event of a target transaction, wherein the first storage medium is a nonvolatile storage medium for storing a log;

in response to the residual capacity being smaller than the data amount of the uncached log of the target transaction, creating a log checkpoint, and storing business data generated based on a modification operation in a second storage medium to a third storage medium, wherein the second storage medium is a volatile storage medium, and the third storage medium is a non-volatile storage medium;

writing the uncached log of the target transaction to the first storage medium.

In one aspect, a log storage apparatus is provided, the apparatus including:

the determining module is used for responding to a commit event of a target transaction and determining the residual capacity of a first storage medium in the database system, wherein the first storage medium is a nonvolatile storage medium used for storing a log;

the storage module is used for responding to the data volume of the uncached log of the target transaction, creating a log checkpoint, and storing the service data generated based on the modification operation in a second storage medium to a third storage medium, wherein the second storage medium is a volatile storage medium, and the third storage medium is a non-volatile storage medium;

a write module to write the uncached log of the target transaction to the first storage medium.

In one possible implementation, the write module is executed in response to the remaining capacity being greater than or equal to an amount of data of the uncached log of the target transaction.

In one possible embodiment, the apparatus further comprises:

an obtaining module, configured to obtain a storage capacity of the first storage medium;

and the configuration module is used for configuring the log space capacity parameter of the database system as the storage capacity of the first storage medium.

In one possible implementation, the storage module is further configured to:

every interval of a first target duration, responding to the condition of meeting the target, and creating a log check point;

storing the service data generated based on the modification operation in the second storage medium to the third storage medium;

copying the log in the last log block stored in the first storage medium to the first log block in the first storage medium;

moving a starting write location pointer of the first storage medium to the first log block.

In one possible embodiment, the target condition comprises at least one of:

the remaining capacity of the first storage medium is less than a capacity threshold;

the difference value between the maximum log serial number in the database system and the log serial number corresponding to the service data with the minimum timestamp in the second storage medium is greater than a first target threshold value;

and the difference value between the maximum log serial number in the database system and the log serial number of the last log checkpoint in the first storage medium is greater than a second target threshold value.

In one possible embodiment, the apparatus further comprises:

and the recovery module is used for responding to the restart of the database system after the shutdown, acquiring the logs to be recovered from the first storage medium, and recovering data based on the logs to be recovered.

In one possible embodiment, the recovery module comprises:

and the checking unit is used for checking the log blocks from the first log block of the first storage medium and determining the log stored in the log block passing the checking as the log to be recovered.

In one possible embodiment, the recovery module comprises:

and the redoing unit is used for storing the log to be recovered in a hash table, traversing the hash table, and redoing the log to be recovered stored in the hash table to obtain the recovered service data.

In one aspect, a node device is provided, which includes one or more processors and one or more memories, where at least one program code is stored in the one or more memories, and loaded and executed by the one or more processors to implement the log storing method according to any one of the possible implementations described above.

In one aspect, a storage medium is provided, in which at least one program code is stored, the at least one program code being loaded and executed by a processor to implement the log storing method according to any one of the above possible implementations.

In one aspect, a computer program product or computer program is provided that includes one or more program codes stored in a computer readable storage medium. The one or more processors of the node apparatus are capable of reading the one or more program codes from the computer-readable storage medium, and the one or more processors execute the one or more program codes to enable the node apparatus to perform the log storage method of any one of the above-described possible embodiments.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

when a target transaction is submitted, the uncached log of the target transaction is persisted in an NVM storage medium (namely, a first storage medium), so that the persistent maintenance of the log can be realized while the transaction is submitted, because the log does not need to be stored in a memory (namely, a second storage medium) and a magnetic disk (namely, a third storage medium) twice, the space occupied by the log storage is greatly saved, a double-layer log storage system of a log buffer zone-log file is not required to be constructed like the traditional InNODB, and the traditional log file is cancelled, and because the idle storage space can be cleared from the NVM storage medium by directly creating the log if the residual capacity of the NVM storage medium is insufficient, a tedious low-speed IO cache flow for brushing the log from the memory to the magnetic disk is not required to be executed, and the system performance of the database is improved, limiting the upper throughput limit of the database system is avoided.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to be able to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an implementation environment of a log storage method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an NVM-MySQL log architecture according to an embodiment of the present application;

FIG. 3 is a flowchart of a log storage method according to an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of an NVM-MySQL log storage system according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a log block provided in an embodiment of the present application;

FIG. 6 is a schematic flow chart of a log storage method according to an embodiment of the present application;

FIG. 7 is a flowchart illustrating an initialization process of an NVM medium according to an embodiment of the present application;

FIG. 8 is a flowchart for periodically inspecting NVM media according to an embodiment of the present application;

FIG. 9 is a schematic flow chart of the periodic inspection of NVM media according to an embodiment of the present application;

fig. 10 is a flowchart of a disaster recovery method according to an embodiment of the present application;

fig. 11 is a schematic flowchart of a disaster recovery method according to an embodiment of the present application;

FIG. 12 is a schematic structural diagram of a log storage device according to an embodiment of the present disclosure;

fig. 13 is a schematic structural diagram of a node device according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution.

The term "at least one" in this application means one or more, and the meaning of "a plurality" means two or more, for example, a plurality of first locations means two or more first locations.

Before introducing the embodiments of the present application, some basic concepts in the cloud technology field need to be introduced:

cloud Technology (Cloud Technology): the cloud computing business mode management system is a management technology for unifying series resources such as hardware, software, networks and the like in a wide area network or a local area network to realize data calculation, storage, processing and sharing, namely is a general name of a network technology, an information technology, an integration technology, a management platform technology, an application technology and the like applied based on a cloud computing business mode, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support in the field of cloud technology. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can be realized through cloud computing.

Cloud Storage (Cloud Storage): the distributed cloud storage system (hereinafter referred to as a storage system) refers to a storage system which integrates a large number of storage devices (storage devices are also referred to as storage nodes) of different types in a network through application software or application interfaces to cooperatively work through functions of cluster application, grid technology, distributed storage file systems and the like, and provides data storage and service access functions to the outside.

Database (Database): in short, it can be regarded as an electronic file cabinet, i.e. a place for storing electronic files, and the user can add, query, update, delete, etc. to the data in the files. A "database" is a collection of data that is stored together in a manner that can be shared by multiple users, has as little redundancy as possible, and is independent of the application.

The database system according to the embodiment of the present application may be a stand-alone database system, a stand-alone database system mainly based on transactions, a stand-alone database system mainly based on analysis but requiring transaction processing capability, may be a NoSQL (Non-relational SQL) system, and may also be a distributed database system and a distributed big data processing system.

At least one node device may be included in the database system, and a database of each node device may have a plurality of data tables stored therein, each data table being operable to store one or more data items. The database of the node device may be any type of distributed database, and may include at least one of a relational database and a non-relational database, such as an SQL (Structured Query Language) database, an NoSQL, a NewSQL (broadly, various new extensible/high performance databases), and the like, where in this embodiment, the type of the database is not specifically limited.

In some embodiments, the embodiments of the present application may also be applied to a database system based on a blockchain technology (hereinafter referred to as "blockchain system"), where the blockchain system essentially belongs to a decentralized distributed database system, a consensus algorithm is used to keep ledger data recorded by different node devices on a blockchain consistent, an encryption algorithm is used to ensure encrypted transmission and non-falsification of ledger data between different node devices, an ledger function is extended by a script system, and interconnection between different node devices is performed through a network route.

One or more blockchains may be included in the blockchain system, where a blockchain is a string of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions for verifying the validity (anti-counterfeiting) of the information and generating a next blockchain.

Node devices in the blockchain system may form a Peer-To-Peer (P2P) network, and the P2P Protocol is an application layer Protocol operating on a Transmission Control Protocol (TCP). In the blockchain system, any node device may have the following functions: 1) routing, a basic function that the node device has for supporting communication between the node devices; 2) the application is used for being deployed in a block chain, realizing specific business according to actual business requirements, recording data related to the realization function to form account book data, carrying a digital signature in the account book data to represent a data source, sending the account book data to other node equipment in the block chain system, and adding the account book data to a temporary block when the other node equipment successfully verifies the data source and integrity of the account book, wherein the business realized by the application can comprise a wallet, a shared account book, an intelligent contract and the like; 3) and the block chain comprises a series of blocks which are mutually connected according to the sequential time sequence, the new blocks cannot be removed once being added into the block chain, and the blocks record the account book data submitted by the node equipment in the block chain system.

In some embodiments, each block may include a hash value of the block storing the transaction record (hash value of the block) and a hash value of a previous block, and the blocks are connected by the hash value to form a block chain.

Fig. 1 is a schematic diagram of an implementation environment of a log storage method according to an embodiment of the present application. Referring to fig. 1, the present embodiment may be applied to a distributed database system, where the system may include a gateway server 101, a global timestamp generation cluster 102, a distributed storage cluster 103, and a distributed coordination system 104 (e.g., ZooKeeper), and the distributed storage cluster 103 may include a data node device and a coordination node device.

The gateway server 101 is configured to receive an external read-write request, and distribute a read-write transaction corresponding to the read-write request to the distributed storage cluster 103, for example, after a user logs in an Application client on a terminal, the Application client is triggered to generate the read-write request, and an Application Programming Interface (API) provided by a distributed database system is called to send the read-write request to the gateway server 101, where the API may be MySQL API (API provided by a relational database system), for example.

In some embodiments, the gateway server 101 may be merged with any data node device or any coordinating node device in the distributed storage cluster 103 on the same physical machine, that is, a certain data node device or coordinating node device is allowed to act as the gateway server 101.

Global Timestamp generation cluster 102 is configured to generate Global commit timestamps (Global Timestamp, Gts) for Global transactions, which are also referred to as distributed transactions, and refer to transactions involving multiple data node devices, for example, a Global read transaction may involve reading data stored on multiple data node devices, and a Global write transaction may involve writing data on multiple data node devices. The global timestamp generation cluster 102 may be logically regarded as a single point, but in some embodiments, a service with higher availability may be provided through a one-master-three-slave architecture, and the generation of the global commit timestamp is implemented in a cluster form, so that a single point failure may be prevented, and a single point bottleneck problem is avoided.

Optionally, the global commit timestamp is a globally unique and monotonically increasing timestamp identifier in the distributed database system, and can be used to mark a global commit order of each transaction, so as to reflect a true temporal precedence relationship between the transactions (a full order relationship of the transactions), where the global commit timestamp may use at least one of a physical Clock, a Logical Clock, a Hybrid physical Clock, or a Hybrid Logical Clock (HLC), and the embodiment of the present application does not specifically limit the type of the global commit timestamp.

In an exemplary scenario, the global commit timestamp may be generated by using a hybrid physical clock, and the global commit timestamp may be composed of eight bytes, where the first 44 bits may be a value of the physical timestamp (i.e., Unix timestamp, accurate to millisecond), so that 2 bits may be represented in total⁴⁴An unsigned integer, and therefore together can theoretically represent about

Physical timestamp of year, where the last 20 bits may be a monotonically increasing count within a certain millisecond, such that there is 2 per millisecond²⁰Based on the above data structure, if the transaction throughput of a single machine (any data node device) is 10w/s, the distributed storage cluster 103 containing 1 ten thousand node devices can be theoretically supported, and meanwhile, the number of global commit timestamps represents the total number of transactions that the system can theoretically support, and based on the above data structure, the system can theoretically support (2)⁴⁴-1)*2²⁰And (4) a transaction. Here, the definition method of the global commit timestamp is merely an exemplary description, and according to different business requirements, the bit number of the global commit timestamp may be expanded to meet the support of more node numbers and transaction numbers.

In some embodiments, the global timestamp generation cluster 102 may be physically separate or may be incorporated with the distributed coordination system 104 (e.g., ZooKeeper).

The distributed storage cluster 103 may include data node devices and coordination node devices, each coordination node device may correspond to at least one data node device, the division between the data node devices and the coordination node devices is for different transactions, taking a certain global transaction as an example, an initiating node of the global transaction may be referred to as a coordination node device, other node devices involved in the global transaction are referred to as data node devices, the number of the data node devices or the coordination node devices may be one or more, and the number of the data node devices or the coordination node devices in the distributed storage cluster 103 is not specifically limited in the embodiments of the present application. Because the distributed database system provided by this embodiment lacks a global transaction manager, an XA (eXtended Architecture, X/Open organization distributed transaction specification)/2 PC (Two-Phase Commit) technology may be adopted in the system to support transactions (global transactions) across nodes, so as to ensure atomicity and consistency of data during write operation across nodes, at this time, the coordinator node device is configured to serve as a coordinator in a 2PC algorithm, and each data node device corresponding to the coordinator node device is configured to serve as a participant in the 2PC algorithm.

Optionally, each data node device or coordination node device may be a stand-alone device, or may also adopt a master/backup structure (that is, a master/backup cluster), as shown in fig. 1, which is exemplified by taking a node device (data node device or coordination node device) as a master/backup cluster, each node device includes a host and two backup devices, optionally, each host or backup device is configured with a proxy (agent) device, the proxy device may be physically independent from the host or backup device, of course, the proxy device may also be used as a proxy module on the host or backup device, taking the node device 1 as an example, the node device 1 includes a master database and a proxy device (master database + agent, abbreviated as master + agent), and in addition, includes two backup databases and a proxy device (backup database + agent, abbreviated as backup DB + agent).

In an exemplary scenario, a SET of database instances of a host or a backup corresponding to each node device is referred to as a SET (SET), for example, if a certain node device is a stand-alone device, the SET of the node device is only a database instance of the stand-alone device, and if a certain node device is a master-backup cluster, the SET of the node device is a SET of a host database instance and two backup database instances, at this time, consistency between data of the host and duplicate data of the backup may be ensured based on a strong synchronization technique of a cloud database, optionally, each SET may perform linear expansion to cope with business processing requirements in a large data scenario, and in some financial business scenarios, a global transaction generally refers to transfer across SETs.

The distributed coordination system 104 may be configured to manage at least one of the gateway server 101, the global timestamp generation cluster 102, or the distributed storage cluster 103, and optionally, a technician may access the distributed coordination system 104 through a scheduler (scheduler) on the terminal, so as to control the distributed coordination system 104 on the back end based on the scheduler on the front end, thereby implementing management on each cluster or server. For example, a technician may control the ZooKeeper to delete a node device from the distributed storage cluster 103 through the scheduler, that is, to disable a node device.

Fig. 1 is an architecture diagram providing a lightweight global transaction, and is a kind of distributed database system. The whole distributed database system can be regarded as a large logical table which is commonly maintained, data stored in the large table is scattered to each node device in the distributed storage cluster 103 through a main key, and the data stored on each node device is independent of other node devices, so that the node devices can horizontally divide the large logical table. In the system, each data table in each database can be stored in a distributed manner after being horizontally divided, so that the system can also be visually referred to as an architecture with "database division table".

In the distributed database system, atomicity and consistency of data during write operation are achieved based on an XA/2PC algorithm, from the technical point of view, a distributed sub-table architecture lacks a global transaction manager, namely lacks distributed transaction processing capacity, and by constructing a lightweight decentralized distributed transaction processing mechanism, the distributed database system can be provided with capabilities such as horizontal expansion and the like, and is simple and easy to popularize and higher in transaction processing efficiency, so that great impact is certainly generated on the distributed database architecture designed by a traditional concurrency control mode.

The log storage method provided by the embodiment of the present application can be applied to the distributed system adopting the database-based and table-divided architecture, for example, the distributed system is a distributed transaction type database system, and certainly, can also be a distributed relational database system, and in addition, the log storage method provided by the embodiment of the present application can also be applied to some single database systems, and the log storage method can increase the ability of using a Non-Volatile Memory (NVM) for a storage engine of a single node in a database system, so as to meet application requirements of different clients, improve transaction processing efficiency, and improve product competitiveness and technical influence of the database, and has a strong practical significance.

In some embodiments, the distributed database system formed by the gateway server 101, the global timestamp generation cluster 102, the distributed storage cluster 103, and the distributed coordination system 104 may be regarded as a server providing data services to a user terminal, where the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud services, a cloud database, cloud computing, cloud functions, cloud storage, Network services, cloud communication, middleware services, domain name services, security services, a Content Delivery Network (CDN) Network, and a big data and artificial intelligence platform. Optionally, the user terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

Before describing the embodiments of the present application, several different types of storage media referred to in the present application will be described separately:

a first storage medium: namely NVM (Non-Volatile Memory) media, a new type of durable storage media, such as PCM (Phase Change Memory), the data writing speed of PCM is between the data writing speed of the second storage medium and the data writing speed of the third storage medium, but the data reading speed of PCM is substantially equal to the data reading speed of the second storage medium and much greater than the data reading speed of the third storage medium, and hereinafter the first storage medium is also referred to as NVM medium.

A second storage medium: a volatile storage medium, generally referred to as a Memory, is used as a system acceleration cache, such as a Dynamic Random Access Memory (DRAM), and has the characteristics of small capacity, high Input/Output (IO) speed, high price, volatility and byte addressing.

A third storage medium: a non-volatile storage medium, usually referred to as a Hard Disk, is used for persistent storage of traffic data, such as a Hard Disk Drive (HDD) or a Solid State Disk (SSD), and has the characteristics of large capacity, slow IO speed, low price, non-volatile, block addressing, and hereinafter referred to as a third storage medium, also referred to as a Hard Disk/Disk.

On the basis of the system architecture, the bottom storage architecture of most of the existing databases is designed in a disk-oriented way, and the whole storage architecture can be divided into two storage hierarchies. On one hand, the main storage medium is an HDD or an SSD, has the characteristics of large capacity, low IO speed, low price, nonvolatility and block addressing, and can ensure the persistence of data when the database suffers from major faults (such as system crash, power failure and the like), namely can realize the persistent storage of the data; on the other hand, DRAM has characteristics of small capacity, fast IO speed, high price, volatility, and byte addressing as a cache for system acceleration. Because there is a large gap between the IO speeds in the two storage levels, the database uses a series of modules to fill the gap between the IO speeds, thereby optimizing system performance, such as the log module, as described below.

In a database system, a Log module is an indispensable module for ensuring the data reliability of the system, and is also a module which significantly affects the system performance, in order to improve the system throughput, the database system is firstly and quickly cached in a memory during data writing, and then asynchronously and persistently stores the data in a hard disk, however, since the data in the memory has volatility, when the system is crashed and crashed due to the problem of difficult repair, the data which is not persisted in the memory is lost, so that the system needs to recover the data in the memory through the Log module, and then a corresponding Redo Log (Redo Log) after the data writing needs to be persisted in the disk. Although the above steps can be completed by a relatively fast sequential IO of the disk, the upper limit of the throughput of the database system is still limited.

In the traditional InNODB system, the log storage system is divided into two layers: a log buffer and a log file. The log buffer is located in the memory (i.e. the second storage medium, a volatile storage medium), and is actually a continuous virtual memory area, and can be logically divided into a plurality of log blocks, and the inside of the log buffer can be divided into two parts with the same size: the write-in area is used for caching the log data newly generated by the transaction, the transaction log data are directly written into the write-in area in the log buffer area after being generated, the persistence area is used for persisting the log data, and the log data in the persistence area are asynchronously persisted into a log file in a disk. When the size of the written log in the writing area reaches a certain threshold and the log in the persistent area is completely persistent, the InNODB exchanges two areas, namely newly generated log data is written into the beginning of the previous persistent area to cover the old data which is already persistent, and the previous log data in the writing area starts asynchronous persistence.

In addition, in the conventional inodb system, the log file is located in a disk (i.e., a third storage medium, a non-volatile storage medium), and is managed by a plurality of redo log file groups, each file group includes one or more log files, and in a general case, the inodb only sets one redo log file group. In the same redo log file group, log data is circularly written, namely after the current log file is fully written, the next log file is written, if all the log files are fully written, the next log file is overwritten from the beginning of the first log file. Each Log file can be divided into two parts, namely a file header and Log data, wherein the file header is composed of 4 Log blocks, the first Log block contains some basic metadata, such as a Log group ID (Identification) to which the file belongs, a Log Sequence Number (LSN) of the Log record of the file, and the like, the second Log block and the fourth Log block record related information of the Log checkpoint, and only the header of the first Log file in each Log group contains the part of information. The log data part is used for storing the actual log record and has a structure similar to a log buffer.

In the embodiment of the application, a novel log architecture of a storage engine is provided, which is called NVM-MySQL, in the architecture, redo logs (which may be referred to as logs for short) are all stored in an NVM medium (that is, a first storage medium, a novel nonvolatile storage medium), a traditional log buffer-log file double-layer architecture is simplified into an NVM single-layer architecture, and interaction between a log module and a disk is avoided. When the redo log is generated, the redo log is directly persisted into the NVM medium, and meanwhile, the redo log is directly read through the high-speed IO of the NVM medium in disaster recovery, so that the time cost is low.

Fig. 2 is a schematic diagram of an NVM-MySQL log architecture provided in an embodiment of the present application, please refer to fig. 2, where a log single-layer storage system is provided in the NVM-MySQL log architecture, as shown in fig. 200, when a transaction operation occurs, data is written into a data buffer pool 201 located in a memory, and at the same time, a redo log corresponding to the transaction operation is written into an NVM medium 202, and then the data cached in the data buffer pool 201 is persisted to a disk in the form of a data file, when a system disaster recovery is performed, only the redo log needs to be acquired from the NVM medium 202, and the data is rewritten into the data buffer pool 201 according to the redo log.

Compared with the traditional InNODB storage engine (a MySQL database engine), the redo logs of the InNODB database need to exist in the memory and the magnetic disk at the same time, and in the NVM-MySQL log architecture, all the redo logs only need to be stored in the NVM medium, so that the persistent storage of all the redo logs can be realized, namely, the double-layer log storage system is reconstructed into the single-layer log storage system. Specifically, the operation flow related to the log also needs to be redesigned, so that the NVM-MySQL log architecture can be adapted to various database systems, and in the code level, the core data structure and the operation flow related to the log mainly need to be modified, so that the NVM-MySQL log architecture can be compatible with various database systems.

Fig. 3 is a flowchart of a log storage method according to an embodiment of the present application. Referring to fig. 3, the embodiment is applied to any node device of a database system, and includes:

301. and the node equipment responds to the submission event of the target transaction and determines the residual capacity of a first storage medium in the database system, wherein the first storage medium is an NVM (non-volatile memory) medium for storing the log.

Wherein the target Transaction is any Mini Transaction (MTR) divided by any logical Transaction. Optionally, any one of the logical transactions is a global transaction, or any one of the logical transactions is a local transaction, and the embodiment of the present application does not specifically limit the type of the logical transaction. The global transaction (also referred to as a distributed transaction) refers to a transaction involving cross-node operations, and the local transaction (also referred to as a local transaction) refers to a transaction involving only a single node operation.

Optionally, the node device is any node device in a database system, for example, in a stand-alone database system, the node device is a stand-alone device corresponding to the stand-alone database system, and for example, in a distributed database system, since a distributed transaction may involve a cross-node operation, the node device may be a coordinating node device or a data node device. The initiating node of the distributed transaction is called a coordinating node device, and other nodes involved in the distributed transaction are called data node devices.

MTR is an important mechanism for ensuring the integrity and persistence of physical page write operations, commonly called "physical transactions", and in inodb, MTR is not isolated as long as data modification and read operations are involved. One logical transaction may include one or more MTRs, each of which writes the generated Log into its own dynamic data Log for temporary storage, and writes the temporary stored Log into the NVM medium (i.e., the first storage medium) of the Redo Log once the current MTR commits.

In the NVM-MySQL architecture, the NVM medium is all used for storing Log records, and the NVM medium is divided into a plurality of Log blocks (Log blocks) for management, each Log Block corresponds to a continuous storage address interval at one end in the NVM medium, and the structure is similar to a Log buffer, but the Log buffer does not need to be divided into two parts, namely a write area and a persistent area as in the conventional InnoDB dual-layer storage system, but Log data is directly written in the NVM medium in a circular covering manner.

Alternatively, each log block in the NVM medium may be used to store one or more logs (also referred to as log records, log data), or may be used to store a part of data of one log, that is, when the data of the log is small, the data may be merged with the data of other logs and recorded in the same log block, and when the data of the log is large, the data may be stored in multiple consecutive log blocks in a scattered manner.

Fig. 4 is a schematic structural diagram of an NVM-MySQL log storage system according to an embodiment of the present application, as shown in 400, an NVM medium includes multiple log blocks, and if a log of a single transaction is small, the log may be merged with logs of other transactions and recorded in the same log block, and if the log of the single transaction is large, the log may be dispersed into multiple log blocks. Optionally, each log block is actually a 512 byte data block.

In some embodiments, each log block includes a metadata area at a head and a tail of the log block, and a data area between the head and the tail. The header of the log block occupies 12 bytes, belongs to the metadata area, and specifically includes: 1) number (log block Number): that is, it means that the current log block is the second one, and occupies 4 bytes, and Number is LSN/512+ 1; 2) data Len (Data length): that is, it represents that the current log block has used byte number, and occupies 2 bytes, if the whole log block is written to full, the Data Len is 512; 3) first Rec offset (start position where First record is): that is, the starting position of the first log record of the first new MTR in the current log block occupies 2 bytes, and the starting position is not necessarily located at the head of the data portion because the current log block may store the data that is not stored in the previous log block; 4) checkpoint No (log Checkpoint No): that is, the log checkpoint sequence number corresponding to the current log block occupies 4 bytes, and is set only when the current log block is fully written, so as to determine whether to reapply the log record in the current log block (for redoing) during disaster recovery. The tail of the log block occupies 4 bytes, belongs to the metadata area, and is used for recording the Checksum (Checksum) of the current log block. The storage area between the head and tail of the log block is called a data area, and takes 496 bytes for storing log records.

Fig. 5 is a schematic structural diagram of a log block according to an embodiment of the present application, and as shown in fig. 5, a log block 500 is divided into a metadata area 501 and a data area 502, a header and a trailer of the log block both belong to the metadata area 501, and a storage area between the header and the trailer belongs to the data area 502. Optionally, the header of the log block occupies 12 bytes, including a log block sequence number (4 bytes), a data length (2 bytes), a start position where the first record is located (2 bytes, hereinafter also referred to as "start position"), and a log checkpoint sequence number (4 bytes). The tail of the log block occupies 4 bytes, which is the last remaining 4 bytes of the log block, including the checksum of the log block. The data area 502 is located after the 12-byte head and before the 4-byte tail and occupies 496 bytes for storing log records.

Compared with the traditional InNODB system, the NVM medium removes the log file, so that log Checkpoint (Checkpoint) information is not needed to assist in locating a position of the log file to recover the log, log metadata is not needed to be saved, and a redundant complex double-layer log storage system is simplified into a single-layer log storage system based on the NVM medium.

In some embodiments, since the NVM-MySQL architecture modifies the log storage architecture, the log core data structure in the inodb, which mainly includes log _ t and log _ group _ t, needs to be modified accordingly. Wherein log _ t represents the whole log system, and only one example is provided, and the log system holds the related pointer of the log buffer; log _ group _ t represents a log file group, the number of instances is the same as that of the file group, and relevant pointers of a file header buffer area and a log check point buffer area are held.

In the NVM-MySQL system, all log data is stored in the NVM medium, so log _ group _ t does not need to be kept continuously, and therefore, the related attributes of the NVM medium need to be added in log _ t for managing the log space of the NVM medium. Optionally, the attributes that need to be added include at least: 1) ulint buf _ free: a starting offset of the free space; 2) byte buf: a starting write location pointer of the log space; 3) ulint buf _ size: a log space capacity parameter controlled by another parameter inodb _ log _ buffer _ size; 4) ulint max _ buf _ free: when buf _ free exceeds this value, it represents that the log space is insufficient.

In step 301, the node device may query the number of occupied log blocks in the first storage medium, multiply the number of occupied log blocks with the storage capacity of a single log block to obtain the occupied capacity of the first storage medium, and subtract the occupied capacity from the total storage capacity of the first storage medium to obtain the remaining capacity of the first storage medium.

302. The node device writes the uncached log of the target transaction to the first storage medium in response to the remaining capacity being greater than or equal to the amount of data of the uncached log of the target transaction.

In the above process, after acquiring the remaining capacity, the node device needs to determine whether the remaining capacity is sufficient (i.e., whether the data amount required for the log storage of this time is satisfied), if the remaining capacity is greater than or equal to the data amount of the uncached log of the target transaction, it indicates that the remaining capacity is sufficient, and then all the uncached logs of the target transaction are persisted to the first storage medium, otherwise, the following step 303 is executed.

Optionally, during the process of writing the uncached log of the target transaction to the first storage medium, the node device may write the uncached log in units of log blocks starting from the last log block of the first storage medium, so that the uncached log may be written one log block by one log block, and it is ensured that the uncached log recorded in the log block has an old-to-new commit order.

Optionally, after the node device writes the uncached log into the last log block, if the storage capacity of the last log block is smaller than the data volume of the uncached log, which indicates that the uncached log that has not been written yet still exists when the last log block is fully written, after the last log block is completely written, another log block is created after the last log block, and since the newly created another log block is located after the original last log block, the newly created another log block is used as the last log block in the next round of loop, and the operation of writing the uncached log into the last log block is repeatedly executed until all the uncached log is written into the first storage medium.

In the above process, in order to write the uncached log in units of log blocks from the last log block, by circularly performing the operation of writing the last log block and then newly creating another log block, the uncached log can be written into the first storage medium on a per log block basis. Therefore, even if the system is down at any time, the data needing to be recovered according to the redo log can be reduced to the maximum extent.

In some embodiments, the node device may further create a corresponding number of log blocks according to the data amount of the uncached log, so as to ensure that the capacity of the created corresponding number of log blocks is greater than the data amount of the uncached log, then write the uncached log from the last log block, and after the last log block is full, continue to write the uncached log that is not written to each created log block, so that frequent operations for creating log blocks may be avoided, and a cumbersome log writing process is simplified.

303. The node device responds to the data quantity of the residual capacity smaller than the uncached log of the target transaction, creates a log checkpoint, stores the business data generated based on the modification operation in the second storage medium to the third storage medium, and writes the uncached log of the target transaction to the first storage medium.

The second storage medium is a volatile storage medium, and the third storage medium is a nonvolatile storage medium. In the embodiment of the present application, the second storage medium is used as a memory, and the third storage medium is used as a magnetic disk.

In the above process, after obtaining the remaining capacity, the node device needs to determine whether the remaining capacity is sufficient (that is, whether the data amount required by the current log storage is satisfied), and if the remaining capacity is smaller than the data amount of the uncached log of the target transaction, which indicates that the remaining capacity is insufficient, it is necessary to first flush, in a manner of creating a log checkpoint, service data (that is, dirty data) originally stored in a data buffer pool of a memory to a disk, delete unnecessary log records, vacate a free storage space, and then completely cache the uncached log of the target transaction in the first storage medium.

In some embodiments, the data buffer pool is located in the memory and is configured to store the service data generated by each transaction, and since the final execution result of each transaction may be commit completion or rollback completion, when the service data in the memory is persisted, only the service data generated based on the modification operation needs to be persisted, and for some service data generated by the rollback operation, the service data is directly discarded, so that invalid data is prevented from occupying a storage space of the disk.

The storage process of the uncached log is similar to the step 302, and is not described herein again.

Fig. 6 is a schematic flowchart of a log storage method provided in an embodiment of the present application, as shown in 600, taking a target transaction as an MTR as an example, in step 601, the MTR commits. In step 602, the node device detects whether the free portion of the log space (i.e., the remaining capacity of the NVM) is sufficient. If the free portion is sufficient, go to step 603: the uncached log of the MTR is written into the last Block (log Block) of the log space, and the writing is stopped when the log space is full. Next, go to step 604: it is detected whether the MTR has any uncached logs. If the MTR has uncached logs, go to step 605: and newly building a Block after the last Block in the log space, initializing the Block, returning to the step 603, and ending the flow until no uncached log exists in the MTR. Conversely, if the space portion is not sufficient, then step 606 is performed: and creating a log checkpoint, flushing the dirty data in the data buffer pool to a disk, deleting the log corresponding to the dirty data in the log space, vacating sufficient free space, and returning to execute the step 602.

In one example, the algorithm for clearing log free space in the NVM-MySQL architecture is shown in Table 1 below:

TABLE 1

In a code level, a node device can adopt a finish _ write () method to cache log records, in a traditional InNODB system, if the space of a log buffer area is detected to be insufficient before log caching, the InNODB can call a log _ write _ up _ to () method to persist the log data cached in the log buffer area into a disk, and in an NVM-MySQL system, a checkpoint _ for _ NVM _ MySQL method is called to clear up an idle space, namely, the idle space is cleared up by creating a log checkpoint, the method directly persists service data generated based on modification operations (including updating, inserting, deleting and the like) in the data buffer area to the disk, and since the service data generated based on the modification operations is persisted, the corresponding logs do not need to be recorded, and the corresponding logs can be deleted or can be based on an overwriting mechanism, the last Block of the log space is directly moved to the starting position (namely, the log recorded in the last Block is copied into the first Block), and the starting write position pointer of the first storage medium is reset to point to the position where the copied first Block is written, so that the uncached log can be directly overwritten from the starting position when being written next time.

All the above optional technical solutions can be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

According to the method provided by the embodiment of the application, when a target transaction is submitted, the uncached log of the target transaction is persisted in the NVM storage medium (namely, the first storage medium), so that the persistent maintenance of the log can be realized while the transaction is submitted, because the log does not need to be stored in the memory (namely, the second storage medium) and the magnetic disk (namely, the third storage medium) twice, the space occupied by the log storage is greatly saved, a double-layer log storage system of a log buffer area-log file is not required to be constructed like the traditional InnodB, and the traditional log file is cancelled, and because of the above, if the residual capacity of the NVM storage medium is insufficient, the free storage space can be cleared from the NVM storage medium by directly creating a log check point, and the tedious low-speed IO cache process of brushing the log from the memory to the magnetic disk is not required to be executed, the system performance of the database is improved, and the upper limit of the throughput of the database system is avoided being limited.

In the above embodiment, a policy is provided for writing the uncached log of the target transaction to the first storage medium, and if the remaining capacity is sufficient, the uncached log is written directly from the last log block, and if the remaining capacity is insufficient, the uncached log is written after a log checkpoint needs to be created to make free space. In the embodiment of the present application, an initialization policy of the first storage medium is provided, and details are described by taking an example in which the target transaction is an MTR and the first storage medium is an NVM medium.

Fig. 7 is a flowchart of initialization of an NVM medium according to an embodiment of the present application, please refer to fig. 7, where the initialization process is applied to a node device, and the initialization process includes:

701. the node device acquires the storage capacity of the NVM medium.

In the conventional InNODB system, the initialization of the log module is performed when the database is started, and the main work is to create a related data structure and allocate memory space to a related buffer. The information of the InoDB, such as log records, log meta-information and the like, is configured with a corresponding memory structure in a memory for caching, and during initialization, a memory space needs to be allocated for the log records and a metadata buffer area, and the first Block of the log buffer area is initialized.

In the NVM-MySQL architecture provided in the embodiment of the present application, because only the log (i.e. log record) is stored in the NVM medium, only the log space capacity parameter of the database system needs to be configured during initialization, so that the log space capacity parameter matches with the storage capacity of the NVM medium, and the first Block is initialized. Alternatively, here "match" means: the log space capacity parameter is less than or equal to the storage capacity of the NVM media.

In step 701, the node device calculates a storage capacity (hereinafter also referred to as "log space size") of the NVM medium, which is calculated by the node device when the database system is first started. Optionally, after calculating the storage capacity of the NVM medium, the storage capacity of the NVM medium may be recorded in the ulint buf _ size parameter of the log core data structure log _ t for subsequent recording and accessing.

702. And the node equipment configures the log space capacity parameter of the database system as the storage capacity of the NVM medium.

In the above process, the node device allocates the NVM medium to the log space of the database system, and optionally, after initializing the log space in log _ t, maps the NVM medium to the log space of the database system by mmap (a method for mapping a file in memory), thereby completing the space allocation to the NVM medium. Wherein the mmap mode is located in the log _ init () method.

703. The node device initializes the first Block in the NVM media.

In the above process, the node device initializes the metadata area of the first Block in the NVM medium, sets the data area to be empty, and sets the start position to be the initialized first Block.

In one example, the log module initialization algorithm of the NVM-MySQL architecture is shown in Table 2 below:

TABLE 2

In the embodiment of the application, the log initialization method of the NVM-MySQL system is provided, compared with the traditional InNODB system, the method does not need to allocate memory space to the log buffer area and the file header buffer area respectively, cancels a redundant double-layer log storage system, and only needs to allocate the NVM medium as the first storage medium, thereby greatly simplifying the log initialization process and improving the system performance of the database.

In the above embodiment, a log initialization policy of NVM-MySQL is provided, on this basis, since the log in the conventional inodb memory cannot be persisted, the log in the memory needs to be persisted to a disk at an appropriate time, and for the NVM-MySQL system, since the NVM medium itself already has non-volatility, that is, the log itself is persisted in the NVM medium, it is not necessary to persist the log additionally, that is, it is not necessary to maintain log metadata. The log module of the NVM-MySQL does not need to involve low-speed disk IO at all, log records do not need to be additionally persisted, time and expense for maintaining a large amount of logs are saved, and the upper limit of throughput of a database system is prevented from being limited.

Fig. 8 is a flowchart of a periodic inspection process for NVM media according to an embodiment of the present application, please refer to fig. 8, which is described in detail by taking an example that a target transaction is MTR and a first storage medium is NVM media, where the periodic inspection process is applied to a node device, and includes the following steps:

801. at intervals of a first target duration, the node device creates a log checkpoint in response to meeting the target condition.

Optionally, the target condition comprises at least one of: the remaining capacity of the first storage medium is less than a capacity threshold; or the difference value between the maximum log serial number in the database system and the log serial number corresponding to the service data with the minimum timestamp in the second storage medium is greater than a first target threshold value; or the difference value between the maximum log sequence number in the database system and the log sequence number of the last log checkpoint in the first storage medium is larger than a second target threshold value.

Wherein, the first target threshold or the second target threshold is any value which is greater than or equal to 0.

In the above process, because NVM-MySQL is a single-layer log storage architecture, when the free space of the log in the NVM medium (i.e. the remaining capacity) is not sufficient, for example, the remaining capacity is smaller than the capacity threshold, or the proportion of the remaining capacity to the total storage capacity of the NVM medium is smaller than the proportion threshold, a log checkpoint is directly created, after the log checkpoint is created, the system needs to persist the service data generated based on the modification operation (including updating, inserting, deleting, etc.) in the data buffer to the disk, and does not need to additionally persist the redo log of the target transaction (persistence has been achieved in the NVM medium, and persistence is not needed again), and then the initial write location pointer and the last Block of the log space are moved to the first Block. Since the business data generated based on the modification operation in the data buffer pool is already persisted, the previous log record can be directly overwritten.

In some embodiments, for the NVM-MySQL architecture, a node device may periodically (every first target duration) call the log _ check _ flags () method to ensure that the target condition is met, specifically, the following three entries may be checked: whether log space is insufficient, e.g., the remaining capacity of the first storage medium is less than a capacity threshold; whether the service data generated by the modification operation in the data buffer pool is not persistent for too long or not is determined, for example, the difference value between the maximum log sequence number in the database system and the log sequence number corresponding to the service data with the minimum timestamp in the second storage medium is greater than a first target threshold value; if the log checkpoint was not created too long, e.g., the difference between the largest log sequence number in the database system and the log sequence number of the last log checkpoint in the first storage medium is greater than a second target threshold. In the checking process, if any item is satisfied by the node equipment, the target condition is determined to be satisfied, and the step of creating the log check point is executed, otherwise, if none of the three items is satisfied, the target condition is determined not to be satisfied.

Optionally, when creating the log checkpoint, call checkpoint _ for _ NVM _ MySQL () method to clean up the NVM media, freeing up free area. The algorithm for creating the log checkpoint is similar to that in step 303, and is not described herein again.

In one exemplary, periodic checking algorithm for NVM-MySQL architecture is shown in Table 3 below

TABLE 3

802. And the node equipment stores the service data generated based on the modification operation in the memory to a disk.

In the above process, the node device stores the service data generated based on the modification operation in the data buffer area of the memory to the disk, which is a possible implementation manner of storing the service data generated based on the modification operation in the second storage medium to the third storage medium, that is, the data object operated by each transaction is persistently stored, and it is not necessary to additionally persist the log corresponding to the read-write operation of each transaction again, so that the time overhead of the periodic check is greatly reduced.

803. And the node equipment copies the log in the last log block stored in the NVM medium to the first log block in the first storage medium.

The step 803 is a series of related operations performed on the NVM media after the log checkpoint is created, and optionally, after the node device determines the last log block stored in the NVM media, the node device reads the log stored in the last log block, and then copies the log stored in the last log block to the first log block, so as to directly overwrite the log, and then the following step 804 is performed.

804. And the node equipment moves the pointer of the initial writing position of the NVM medium to the first log block.

Optionally, the node device moves the pointer to the starting write position of the NVM medium to the end of the last log record in the first log block, which is the starting position of the next log storage process.

Fig. 9 is a schematic flowchart of periodically checking NVM media according to an embodiment of the present application, please refer to fig. 9, and in step 901, the node device determines whether log space is sufficient. If the log space is not sufficient, step 904 is executed, step 906 is executed, and if the log space is sufficient, step 902 is executed: it is determined whether there is a data modification in the data buffer pool that has not persisted for too long (whether there is traffic data generated based on the modify operation that has a timestamp that is too old). If there are data modifications that have not persisted too long, go to step 904-: it is determined if a log Checkpoint (Checkpoint) has not been created for too long. If Checkpoint was not created too long, step 904 is performed, step 906, and if Checkpoint was not created too long, the flow ends. Then, in step 904, the service data generated based on the modification operation in the data buffer pool is persisted to the disk, in step 905, the log in the last Block of the log space is copied to the first Block, and in step 906, the start write location pointer of the log space is moved to the first Block.

In the embodiment of the application, compared with the traditional InnodB, the service data and the log are required to be persisted during the periodic check, the service data is required to be persisted only in the NVM-MySQL system, the log is persisted in the NVM medium when the transaction is submitted, and the log is not required to be persisted again during the periodic check, so that the time overhead of the periodic check process is greatly reduced.

In the above embodiment, a periodic check policy of the NVM-MySQL architecture is provided, and when any of the three entries is satisfied, it is determined that the target condition is met, and related operations of creating a log checkpoint and modifying persistent data are executed.

Fig. 10 is a flowchart of a disaster recovery method according to an embodiment of the present application, please refer to fig. 10, where the method includes the following steps:

1001. and the node equipment is restarted after responding to the crash of the database system, the log blocks are verified from the first log block of the NVM medium, and the logs stored in the log blocks passing the verification are determined as the logs to be recovered.

In the above process, because all log data to be recovered is stored in the NVM medium in the NVM-MySQL system, when recovering from disaster, only the log record of the NVM medium needs to be checked and analyzed, and then the log record is saved in the hash table through the following step 1002, and finally the log in the hash table is traversed and reapplied, so that the disaster recovery process can be completed.

Optionally, the node device modifies a recv _ group _ scan _ log _ recs () method, and changes the implementation into analyzing check log records in batches directly from the NVM medium log _ sys- > buf, and caching the check log records in a hash table.

In some embodiments, the node device obtains a first Block of the NVM medium, determines whether a Checksum (Checksum) of the first Block is correct, if Checksum is incorrect, it indicates that this is the last Block being written when the system crashes, and the log records are all recovered, and jumps out of the method, otherwise, if Checksum is correct, it parses and checks the log data in the first Block and adds it to the hash table, and then determines whether the first Block is full, if not, it indicates that this is the last Block of the NVM medium when the system crashes, and the log records are all recovered, and jumps out of the method, otherwise, if full, reads the next Block, and performs checking circularly. After all log blocks of the NVM media are traversed, it may be determined that all logs to be recovered are stored in the hash table, and the following step 1002 is performed.

1002. And the node equipment stores the log to be recovered in a hash table, traverses the hash table, and redos the log to be recovered stored in the hash table to obtain the recovered service data.

In the step 1002, for a possible implementation manner that the node device performs data recovery based on the log to be recovered, the log to be recovered needs to be read from the disk when disaster recovery is performed in the conventional inodb, whereas in the embodiment of the present application, the log to be recovered can be directly read from the NVM medium, and the IO speed of the NVM is much higher than that of the disk, so that time consumption of disaster recovery is greatly reduced. And the speed of disaster recovery can be accelerated by storing the data in the hash table.

Fig. 11 is a schematic flowchart of a disaster recovery method according to an embodiment of the present application, please refer to fig. 11, where during disaster recovery, firstly in step 1101, the log of the NVM medium is analyzed and checked and placed in the hash table, and secondly in step 1102, the log in the hash table is redone, which does not require reading the log to be recovered from the disk for a long time, so that time consumption for disaster recovery can be greatly reduced.

In one example, the disaster recovery strategy of NVM-MySQL is shown in Table 4 below:

TABLE 4

In the embodiment of the application, the provided disaster recovery method does not need to locate a certain log file from a disk and then read the log file, only needs to directly read the log to be recovered from the NVM medium and redo the log to be recovered, and because the IO speed of the NVM is far higher than that of the disk, the time consumption of disaster recovery is greatly reduced.

In the above embodiment, an optimization scheme of the NVM-MySQL system for each process of log caching, log persistence, periodic inspection, disaster recovery, etc. of the log module in the traditional inodb system is introduced, and then performance optimization effects of the NVM-MySQL system on the above operations are analyzed from a theoretical level.

In the present embodiment, PCM (Phase Change Memory) storage media were used as NVM device representatives for analysis. Assuming that data is read and written 1 unit each time, the write latency and read latency of DRAM are DRAM respectively_WAnd DRAM_RThe write delay and read delay of the PCM are respectively PCM_WAnd PCM_RThe write delay and the read delay of the HDD are respectively the HDD_WAnd HDD_RThe relationship between the write delays of the threeCan be expressed as: DRAM (dynamic random Access memory)_W×10⁵＝PCM_W×5000＝HDD_W(ii) a The relationship between the read delays of the three can be expressed as: DRAM (dynamic random Access memory)_R×10⁵＝PCM_R×10⁵＝HDD_R。

1. Log caching phase

For the convenience of analysis, the condition of insufficient log space is not considered during log caching, and the scene of the insufficient log space is uniformly analyzed in the section of '3 and periodic inspection'.

Assuming that the log data size written once is log, for the writing of the log buffer area once, the inodb only needs to write the log into the memory, and the time consumption is as follows: log x DRAM_W(ii) a The NVM-MySQL needs to write the log into the NVM, and the time consumption is as follows: log x PCM_W＝20×log×DRAM_W。

In summary, the time consumption of NVM-MySQL is more when the log is written into the buffer, which is 20 times that of InnodB.

2. Log persistence phase

Suppose the log buffer size that needs to be persisted is buf, the metadata size is metadata, and the log metadata needs to be updated once per persistence, where buf is much larger than metadata.

When the InoDB persists the log, the cached log metadata and log records need to be written to a disk, and the time consumption is as follows: metadata × HDD_W+buf×HDD_W≈buf×HDD_W(ii) a NVM-MySQL does not do anything, and the actual time consumption is: 0.

in summary, there is no time overhead compared to InnodB because NVM-MySQL does not require persisted log data.

3. Periodic inspection

Assuming that the probability of insufficient space in the log buffer is rate at each periodic check_BufWhen the space is insufficient, the size of the buffer area needing to be persisted is buf, and according to the conclusion of the previous subsection 2 and log persistence stage, the time consumption of persisting the log buffer area is about buf multiplied by HDD_W(ii) a The probability that the data buffer pool is not flushed for too long is rate_DataData to be flushedThe size is data; the probability that a log checkpoint needs to be created is rate_CPThe size of the checkpoint information is checkpoint.

For the purpose of analysis, it is assumed that a log buffer full once triggers a log and data persistence, namely rate_Buf＝rate_Data＝rate_CPRate. In addition, buf and data are actually much larger than checkpoint.

When the InNODB performs periodic checking, log buffer, data buffer pool and log check point information may be written to a disk, which takes time as follows:

rate×buf×HDD_W+rate×data×HDD_W+rate×checkpoint×HDD_W≈(buf+data)×rate×HDD_W

when the NVM-MySQL detects that the space is insufficient, the data buffer pool is only required to be written to the disk and the log space writing position is reset, and the time consumption is as follows: rate × data × HDD_W。

From the above analysis, the time overhead saved by NVM-MySQL is mainly the time consumption of persisting the log buffer, and the specific optimization effect depends on the size ratio of buf and data.

4. Disaster recovery

Assuming that there is only one log file group, the size of the log checkpoint information is checkpoint, and the size of the log to be redo is redo, which is generally much larger than checkpoint.

When disaster recovery is performed, InnodB needs to read log checkpoint information and log data from a disk, and the time consumption is as follows: checkpoint HDD_R+redo×HDD_R≈redo×HDD_R(ii) a The NVM-MySQL only needs to read log data from the NVM, which takes time as follows: redo x PCM_R。

In summary, the NVM-MySQL transfers the log data read during the disaster recovery phase from the disk to the NVM, and the NVM has a fast reading speed similar to DRAM, thereby greatly reducing the time overhead.

5. System integrated optimization

When the InNODB system runs, the main overhead of the log is divided into two stages: 1) a normal operation stage; 2) and (5) disaster recovery stage. So theoretical optimization analysis will be done for these two stages separately.

Normal operation stage

When the system normally operates, log data persistence is mainly completed during periodic inspection, so that analysis only needs to be carried out by integrating two aspects of log caching and periodic inspection. From the foregoing analysis, the results of Table 5 can be derived, where the difference column indicates how much less time consuming NVM-MySQL is than InnodB.

TABLE 5

Under normal conditions, the log buffer area is full every time buf/log caching is carried out, and log and data persistence is triggered once during periodic checking. Therefore, considering comprehensively, the theoretical optimization time of NVM-MySQL in normal operation is:

-buf÷log×log×PCM_W+buf×HDD_W＝buf×(HDD_W-PCM_W)

≈buf×HDD_W

this optimization takes roughly buf/(buf + data) of the original InoDB time overhead.

In summary, the time overhead actually saved by NVM-MySQL is the time overhead of log persistence through low-speed disk IO, and the specific optimization amplitude depends on the ratio of the size of the log buffer to the modified amount of data to be persisted.

(II) disaster recovery stage

From the foregoing, a comparison of the time overhead of InNODB and NVM-MySQL in the disaster recovery phase can be obtained, as shown in Table 6, where the difference column indicates how much time is consumed by NVM-MySQL compared to InNODB.

TABLE 6

Operation of	InnoDB	NVM-MySQL	Difference value
				Disaster recovery	redo×HDD_R	redo×PCM_R	redo×HDD_R

From table 6, it is known that when the log is read by NVM-MySQL during disaster recovery, no low-speed disk IO is needed, and the reading speed of NVM is much faster than that of disk, so the time for reading log for disaster recovery can be shortened by 10 ten thousand times.

In the embodiment of the application, on one hand, the time overhead of log persistence through low-speed disk IO is saved by NVM-MySQL in the transaction execution phase, the overall throughput of the system is improved, and the optimization amplitude depends on the ratio between the size of the log buffer and the modification quantity of data needing persistence each time. On the other hand, in the disaster recovery stage, the low-speed disk IO is not needed when the NVM-MySQL reads the log, and the reading speed of the NVM is far faster than that of the disk, so that the time for reading the log is greatly shortened during the disaster recovery, and the log reading time can be almost ignored compared with the traditional InNODB.

In some embodiments, in addition to storing all logs in the NVM, the logs can be directly merged into the data records, and the data persistence is directly guaranteed by using the non-volatility of the NVM to remove the log module, so that the storage space can be saved and the disaster recovery process can be simplified. Table 7 shows the differences and relationships between the scheme of storing the Log only in NVM (corresponding to NVM-Data) and the scheme of storing the Data records and Log together in NVM (corresponding to NVM-Log).

TABLE 7

As can be seen from table 7, in the NVM-Log scheme, since the data records and the Log are merged and then stored in the NVM medium, persistent storage can be performed, which is equivalent to canceling the whole Log module, so that the step of persisting the data records when submitting the transaction is omitted, the transaction submission process is simplified, and the storage space of the system can be saved.

Fig. 12 is a schematic structural diagram of a log storage device according to an embodiment of the present application, please refer to fig. 12, where the log storage device includes:

a determining module 1201, configured to determine, in response to a commit event of a target transaction, a remaining capacity of a first storage medium in a database system, where the first storage medium is a nonvolatile storage medium for storing a log;

a storage module 1202, configured to create a log checkpoint in response to that the remaining capacity is smaller than the data amount of the uncached log of the target transaction, and store service data, which is generated based on a modification operation, in a second storage medium to a third storage medium, where the second storage medium is a volatile storage medium and the third storage medium is a non-volatile storage medium;

a writing module 1203, configured to write the uncached log of the target transaction to the first storage medium.

The device provided by the embodiment of the application can realize persistent maintenance of the log while the transaction is submitted by persisting the uncached log of the target transaction in the NVM storage medium (namely, the first storage medium) when the target transaction is submitted, greatly saves the space occupied by log storage because the log does not need to be stored in the memory (namely, the second storage medium) and the disk (namely, the third storage medium) twice respectively, does not need to construct a double-layer log storage system of a log buffer zone-log file like the traditional innoDB, and cancels the traditional log file, and because of this, if the residual capacity of the NVM storage medium is insufficient, the idle storage space can be cleared from the NVM storage medium by directly creating a log check point, and a tedious low-speed IO cache flow of brushing the log from the memory to the disk is not needed to be executed, the system performance of the database is improved, and the upper limit of the throughput of the database system is avoided being limited.

In a possible implementation, based on the apparatus components of fig. 12, the writing module 1203 includes:

and the writing unit is used for writing the uncached log by taking the log block as a unit from the last log block stored in the first storage medium.

In one possible implementation, the writing unit is configured to:

writing the uncached log into the last log block;

and if the storage capacity of the last log block is smaller than the data volume of the uncached log, after the last log block is written, another log block is created after the last log block.

In one possible implementation, the write module 1203 is executed in response to the remaining capacity being greater than or equal to the amount of data of the uncached log of the target transaction.

In a possible embodiment, based on the apparatus composition of fig. 12, the apparatus further comprises:

In one possible implementation, the storage module 1202 is further configured to:

and moving a starting write location pointer of the first storage medium to the first log block.

In one possible embodiment, the target condition comprises at least one of:

the difference between the maximum log sequence number in the database system and the log sequence number of the last log checkpoint in the first storage medium is greater than a second target threshold.

and the recovery module is used for responding to the restart of the database system after the shutdown, acquiring the log to be recovered from the first storage medium, and recovering data based on the log to be recovered.

In a possible implementation, based on the apparatus composition of fig. 12, the recovery module includes:

and the checking unit is used for checking the log blocks from the first log block of the first storage medium and determining the log stored in the log block which passes the checking as the log to be recovered.

and the redoing unit is used for storing the log to be recovered in the hash table, traversing the hash table, and redoing the log to be recovered stored in the hash table to obtain the recovered service data.

It should be noted that: in the log storage device provided in the above embodiment, when storing the log, only the division of the functional modules is described as an example, and in practical applications, the function distribution can be completed by different functional modules according to needs, that is, the internal structure of the node device is divided into different functional modules to complete all or part of the functions described above. In addition, the log storage device and the log storage method provided by the above embodiment belong to the same concept, and specific implementation processes thereof are detailed in the log storage method embodiment and are not described herein again.

Fig. 13 is a schematic structural diagram of a node device according to an embodiment of the present application. Optionally, the device types of the node device 1300 include: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Node device 1300 may also be referred to by other names such as user equipment, portable node device, laptop node device, desktop node device, and so on.

In general, node apparatus 1300 includes: a processor 1301 and a memory 1302.

Optionally, processor 1301 includes one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. Optionally, the processor 1301 is implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). In some embodiments, processor 1301 includes a main processor and a coprocessor, the main processor being a processor for Processing data in the wake state, also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1301 is integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content that the display screen needs to display. In some embodiments, processor 1301 further includes an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

In some embodiments, memory 1302 includes one or more computer-readable storage media, which are optionally non-transitory. Optionally, memory 1302 also includes high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1302 is used to store at least one program code for execution by the processor 1301 to implement the log storage method provided by various embodiments herein.

In some embodiments, the node apparatus 1300 may further include: a peripheral interface 1303 and at least one peripheral. Processor 1301, memory 1302, and peripheral interface 1303 may be connected by a bus or signal line. Each peripheral can be connected to the peripheral interface 1303 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1304, touch display 1305, camera assembly 1306, audio circuitry 1307, positioning assembly 1308, and power supply 1309.

Peripheral interface 1303 may be used to connect at least one peripheral associated with I/O (Input/Output) to processor 1301 and memory 1302. In some embodiments, processor 1301, memory 1302, and peripheral interface 1303 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1301, the memory 1302, and the peripheral device interface 1303 are implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 1304 is used to receive and transmit RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 1304 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1304 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1304 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. Optionally, the radio frequency circuitry 1304 communicates with other node devices via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 1304 also includes NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 1305 is used to display a UI (User Interface). Optionally, the UI includes graphics, text, icons, video, and any combination thereof. When the display screen 1305 is a touch display screen, the display screen 1305 also has the ability to capture touch signals on or over the surface of the display screen 1305. The touch signal can be input to the processor 1301 as a control signal for processing. Optionally, the display 1305 is also used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, display 1305 is one, providing the front panel of node device 1300; in other embodiments, the display 1305 is at least two, and is disposed on different surfaces of the node apparatus 1300 or in a foldable design; in still other embodiments, the display 1305 is a flexible display disposed on a curved surface or on a folded surface of the node device 1300. Even more optionally, the display 1305 is arranged in a non-rectangular irregular figure, i.e. a shaped screen. Alternatively, the Display 1305 is made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or the like.

The camera assembly 1306 is used to capture images or video. Optionally, camera assembly 1306 includes a front camera and a rear camera. Generally, the front camera is arranged on the front panel of the node device, and the rear camera is arranged on the back of the node device. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1306 also includes a flash. Optionally, the flash is a monochrome temperature flash, or a bi-color temperature flash. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp and is used for light compensation under different color temperatures.

In some embodiments, audio circuitry 1307 includes a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1301 for processing, or inputting the electric signals to the radio frequency circuit 1304 for realizing voice communication. For stereo sound acquisition or noise reduction purposes, a plurality of microphones are provided at different locations of the node apparatus 1300, respectively. Optionally, the microphone is an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1301 or the radio frequency circuitry 1304 into sound waves. Alternatively, the speaker is a conventional membrane speaker, or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to human, but also the electric signal can be converted into a sound wave inaudible to human for use in distance measurement or the like. In some embodiments, audio circuit 1307 also includes a headphone jack.

The positioning component 1308 is used to locate the current geographic Location of the node apparatus 1300 for navigation or LBS (Location Based Service). Alternatively, the Positioning component 1308 is a Positioning component based on a GPS (Global Positioning System) of the united states, a beidou System of china, a graves System of russia, or a galileo System of the european union.

The power supply 1309 is used to supply power to the various components in the node apparatus 1300. Alternatively, the power source 1309 is an alternating current, direct current, disposable battery, or rechargeable battery. When the power source 1309 comprises a rechargeable battery, the rechargeable battery supports wired charging or wireless charging. The rechargeable battery is also used to support fast charge technology.

In some embodiments, node device 1300 also includes one or more sensors 1310. The one or more sensors 1310 include, but are not limited to: acceleration sensor 1311, gyro sensor 1312, pressure sensor 1313, fingerprint sensor 1314, optical sensor 1315, and proximity sensor 1316.

In some embodiments, the acceleration sensor 1311 detects acceleration magnitudes on three coordinate axes of a coordinate system established with the node apparatus 1300. For example, the acceleration sensor 1311 is used to detect components of gravitational acceleration in three coordinate axes. Optionally, the processor 1301 controls the touch display screen 1305 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1311. The acceleration sensor 1311 is also used for acquisition of motion data of a game or a user.

In some embodiments, the gyro sensor 1312 detects the body direction and the rotation angle of the node apparatus 1300, and the gyro sensor 1312 and the acceleration sensor 1311 cooperate to acquire the 3D motion of the user on the node apparatus 1300. Processor 1301 performs the following functions based on the data collected by gyroscope sensor 1312: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Optionally, the pressure sensors 1313 are disposed on the side bezel of the node device 1300 and/or underneath the touch screen display 1305. When the pressure sensor 1313 is disposed on the side frame of the node apparatus 1300, a holding signal of the user to the node apparatus 1300 can be detected, and the processor 1301 performs left-right hand recognition or shortcut operation according to the holding signal acquired by the pressure sensor 1313. When the pressure sensor 1313 is disposed at a lower layer of the touch display screen 1305, the processor 1301 controls an operability control on the UI interface according to a pressure operation of the user on the touch display screen 1305. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 1314 is used for collecting the fingerprint of the user, and the processor 1301 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 1314, or the fingerprint sensor 1314 identifies the identity of the user according to the collected fingerprint. When the identity of the user is identified as a trusted identity, the processor 1301 authorizes the user to perform relevant sensitive operations, including unlocking a screen, viewing encrypted information, downloading software, paying, changing settings, and the like. Optionally, the fingerprint sensor 1314 is disposed on the front, back, or side of the node device 1300. When a physical key or vendor Logo is provided on node device 1300, fingerprint sensor 1314 can be integrated with the physical key or vendor Logo.

The optical sensor 1315 is used to collect the ambient light intensity. In one embodiment, the processor 1301 controls the display brightness of the touch display screen 1305 according to the intensity of the ambient light collected by the optical sensor 1315. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 1305 is increased; when the ambient light intensity is low, the display brightness of the touch display 1305 is turned down. In another embodiment, processor 1301 also dynamically adjusts the shooting parameters of camera head assembly 1306 based on the ambient light intensity collected by optical sensor 1315.

The proximity sensor 1316, also known as a distance sensor, is typically disposed on a front panel of the node apparatus 1300. The proximity sensor 1316 is used to gather the distance between the user and the front face of the node device 1300. In one embodiment, the processor 1301 controls the touch display 1305 to switch from the bright screen state to the dark screen state when the proximity sensor 1316 detects that the distance between the user and the front face of the node device 1300 gradually decreases; the touch display 1305 is controlled by the processor 1301 to switch from the rest state to the bright state when the proximity sensor 1316 detects that the distance between the user and the front surface of the node device 1300 becomes progressively larger.

Those skilled in the art will appreciate that the configuration shown in fig. 13 does not constitute a limitation of node apparatus 1300, and can include more or fewer components than shown, or combine certain components, or employ a different arrangement of components.

In an exemplary embodiment, there is also provided a computer readable storage medium, such as a memory including at least one program code, which is executable by a processor in a terminal to perform the log storage method in the above embodiments. For example, the computer-readable storage medium includes a ROM (Read-Only Memory), a RAM (Random-Access Memory), a CD-ROM (Compact Disc Read-Only Memory), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product or computer program is also provided, comprising one or more program codes, the one or more program codes being stored in a computer readable storage medium. The one or more processors of the node apparatus can read the one or more program codes from the computer-readable storage medium, and the one or more processors execute the one or more program codes, so that the node apparatus can execute to complete the log storage method in the above-described embodiment.

Those skilled in the art will appreciate that all or part of the steps for implementing the above embodiments can be implemented by hardware, or can be implemented by a program instructing relevant hardware, and optionally, the program is stored in a computer readable storage medium, and optionally, the above mentioned storage medium is a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of log storage, the method comprising:

writing the uncached log of the target transaction to the first storage medium.

2. The method of claim 1, wherein writing the uncached log of the target transaction to the first storage medium comprises:

writing the uncached log in units of log blocks starting from the last log block already stored in the first storage medium.

3. The method of claim 2, wherein writing the uncached log in log block units starting from a last log block already stored in the first storage medium comprises:

writing the uncached log into the last log block;

and if the storage capacity of the last log block is smaller than the data volume of the uncached log, after the last log block is written, another log block is created behind the last log block.

4. The method of claim 1, wherein after determining the remaining capacity of the first storage medium in the database system in response to the commit event of the target transaction, the method further comprises:

in response to the remaining capacity being greater than or equal to the amount of data of the uncached log of the target transaction, performing an operation of writing the uncached log of the target transaction to the first storage medium.

5. The method of claim 1, wherein prior to determining the remaining capacity of the first storage medium in the database system in response to the commit event of the target transaction, the method further comprises:

acquiring the storage capacity of the first storage medium;

configuring a log space capacity parameter of the database system to a storage capacity of the first storage medium.

6. The method of claim 1, further comprising:

7. The method of claim 6, wherein the target condition comprises at least one of:

8. The method of claim 1, further comprising:

and responding to the restart of the database system after the crash, acquiring the logs to be recovered from the first storage medium, and performing data recovery based on the logs to be recovered.

9. The method of claim 8, wherein the retrieving the log to be recovered from the first storage medium comprises:

and starting from the first log block of the first storage medium, checking the log block, and determining the log stored in the log block passing the checking as the log to be recovered.

10. The method of claim 8, wherein the performing data recovery based on the log to be recovered comprises:

and storing the log to be recovered in a hash table, traversing the hash table, and redoing the log to be recovered stored in the hash table to obtain the recovered service data.

11. A log storage apparatus, the apparatus comprising:

12. The apparatus of claim 11, wherein the write module comprises:

13. The apparatus of claim 12, wherein the write unit is configured to:

writing the uncached log into the last log block;

14. A node device, characterized in that the node device comprises one or more processors and one or more memories, in which at least one program code is stored, which is loaded and executed by the one or more processors to implement the log storing method according to any one of claims 1 to 10.

15. A storage medium having stored therein at least one program code, which is loaded and executed by a processor to implement the log storing method according to any one of claim 1 to claim 10.