CN113704261A - Key value storage system based on cloud storage - Google Patents

Key value storage system based on cloud storage Download PDF

Info

Publication number
CN113704261A
CN113704261A CN202110989569.5A CN202110989569A CN113704261A CN 113704261 A CN113704261 A CN 113704261A CN 202110989569 A CN202110989569 A CN 202110989569A CN 113704261 A CN113704261 A CN 113704261A
Authority
CN
China
Prior art keywords
sstable
data
storage system
cloud
key
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110989569.5A
Other languages
Chinese (zh)
Inventor
崔秋
唐刘
徐鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pingkai Star Beijing Technology Co Ltd
Original Assignee
Pingkai Star Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pingkai Star Beijing Technology Co Ltd filed Critical Pingkai Star Beijing Technology Co Ltd
Priority to CN202110989569.5A priority Critical patent/CN113704261A/en
Publication of CN113704261A publication Critical patent/CN113704261A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1464Management of the backup or restore process for networked environments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2237Vectors, bitmaps or matrices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Abstract

The application discloses a key value storage system constructed based on a cloud storage system, and a data reading request processing method, a data writing request processing method and a fault recovery method based on the key value storage system. The key value storage system is a hybrid cloud storage system composed of cloud local storage and cloud remote storage equipment and comprises an LSM-tree storage architecture with a fast indexing capability, a metadata block cache, a data block cache and a memory cache. The key value storage system can fully utilize cloud local storage with low operation delay, low capacity and high use cost and cloud remote storage with high operation delay but high capacity and low use cost in the cloud storage system, realize a fast and efficient LSM-tree storage architecture, control the use cost, ensure efficient reading and writing of data and greatly accelerate the fault recovery process.

Description

Key value storage system based on cloud storage
Technical Field
The application belongs to the technical field of information storage, and particularly relates to a key value storage system constructed based on cloud storage, and a data read-write request processing method and a fault recovery method based on the key value storage system.
Background
The explosive growth of network size, applications, now leads to an exponential increase in the amount of data, making the cost-effectiveness of data storage one of the main design goals for the underlying databases.
A Log-Structure-Merge-Tree (LSM-Tree) is a database storage engine designed specifically for a key-value storage system, and is widely applied to key-value databases. The log-structured merge tree obtains better data write performance at the cost of reduced partial data read performance, and writes inserted data into the memory buffer first when the data is inserted, and then writes the data in the memory onto the disk after the memory buffer is full. The log structure merge tree avoids random write operation to the disk by merging random write into sequential write, but when the log structure merge tree searches data, the log structure merge tree needs to search in a memory first and then in a disk file. The log structure merged tree structure is widely applied to a non-relational database (NoSQL), and is typically RocksDB, LevelDB, BigTable, Dynamo and the like as a key-value database developed by a storage engine.
With the advent of cloud technology, cloud storage is becoming more popular for enterprises wishing to improve efficiency, disaster recovery, and agility, and leasing cloud storage resources can effectively reduce data storage costs. Some studies have found that storage using a hybrid structure of a cloud-local high read-write performance SSD storage device and a cloud remote storage device is a more efficient solution. This is because cloud local storage provides faster data read and write access performance than cloud remote storage devices. Thus, more and more storage systems achieve the benefits of both by integrating cloud local storage and cloud remote storage devices (or services). But how to build a fast and efficient LSM-tree storage architecture on such a hybrid storage structure is challenging. Performance and cost are not balanced between cloud local storage and cloud remote storage. For example,
Figure BDA0003232023650000021
different data read-write performance and use cost are shown between the cloud local high-performance SSD storage of the AWS EC2 of the company and the cloud remote storage device gp2, the cloud remote storage device gp2 can reduce the use cost by about 80%, but the throughput of data read-write is low, and bandwidth and IOPS (I/O per second) limitation exist. When the cloud remote storage device gp2 with the same size is used for constructing the LSM-tree storage architecture, considerable data read-write operation performance reduction occurs, wherein the read operation performance is reduced by 98%, and the write operation performance is reduced by 40%. Therefore, it is very important to provide an LSM-tree storage architecture that can use both cloud local storage and cloud remote storage, and the LSM-tree storage architecture is required to give priority to improving data read-write performance.
Content of application
Aiming at the defects and the improvement requirements of the prior art, the application provides a key value storage system constructed based on cloud storage, and a data reading method, a data writing method and a fault recovery method based on the key value storage system, and aims to construct a fast and efficient LSM-tree storage architecture on the cloud storage, simultaneously exert the characteristics of high read-write performance and low operation delay of cloud local storage and the characteristics of large capacity and low price of cloud remote storage equipment, fully exert the characteristics of safety and agility of the cloud storage, and realize the fast fault recovery.
To achieve the above object, a first aspect of the present application provides a key-value storage system, including: a memory having computer readable instructions; one or more processors for executing the computer-readable instructions, the computer-readable instructions controlling the one or more processors to perform operations; an LSM-tree storage architecture; caching the memory; caching a metadata block; and a data block cache. The key value storage system is based on a hybrid cloud storage system composed of a cloud local storage device or service and a cloud remote storage device or service.
Further, the LSM-tree storage architecture includes L0 through Li layers deployed on the cloud local storage, and L (i +1) layers through the rest of the layers below (Li +1) deployed on the cloud remote storage. Typically, i is 1 or 2.
Further, the LSM-tree storage architecture includes multiple layers, where each layer stores data in units of SSTable (sorted Strings table), and all sstables are organized as a tree.
Further, the metadata block cache is deployed on the cloud local storage to store metadata blocks corresponding to each SSTable on the cloud remote storage device, and the metadata blocks are used for indexing data stored by each SSTable on the cloud remote storage device.
Further, the key-value storage system encodes the metadata block using MASHtree encoding.
Further, the data block cache is deployed on the cloud local storage for storing hot data blocks corresponding to SStables on the cloud remote storage device.
Further, the space of the data block cache is divided into a plurality of bucket containers (buckets), wherein each bucket container comprises data blocks from the SSTable on the cloud remote storage device, and the basic information of the SSTable corresponding to the data blocks is used for management and access through a data search method.
Further, each bucket container is divided into blocks of a specific size, and the blocks are managed by a Bitmap (Bitmap).
To achieve the above object, according to a second aspect of the present application, there is provided a data read request processing method based on the key-value storage system of the first aspect, the method including the steps of:
the key-value storage system receiving a request to read a datum;
starting from the L0 layer of the LSM-tree storage architecture of the key-value storage system, searching SSTable or metadata blocks corresponding to SSTable in each layer downwards layer by layer to search the data, wherein the SSTable in the L0 layer is searched one by one in the L0 layer to search and read the data;
if the data is not found in the L0 layer, the SSTable from the L1 layer to the Li layer is found and read in the L1 layer to the Li layer of the LSM-tree storage architecture based on keys, wherein each SSTable from the L1 layer to the Li layer has a minimum key and a maximum key;
if the data is not found in the L0 layer to the Li layer, the SSTable including the data is searched for on a key basis in metadata blocks of a metadata block cache in a cloud local storage used by the key-value storage system, wherein the metadata blocks corresponding to all SStables in a cloud remote storage used by the key-value storage system are cached in the metadata block cache;
if a target SSTable containing the data is found in the metadata blocks stored in the metadata block cache, and the data blocks containing the target data in the target SSTable are cached in the data block cache of the key-value storage system, reading the data from the data blocks containing the target data in the target SSTable in the data block cache, wherein the metadata blocks record the index information of the target SSTable;
if the target SSTable is found in the metadata block cache but the data block containing the target data in the target SSTable is not cached in the data block cache, finding the target SSTable in the cloud remote storage, caching the data block containing the target data in the target SSTable from the cloud remote storage to the cloud local storage, and then reading the data.
In order to achieve the above object, according to a third aspect of the present application, there is provided a data write request processing method based on the key-value storage system of the first aspect, including the following steps:
the key value storage system receives the data write request and judges the purpose of the data write request;
if the data write request is directed to an SSTable generated by target data in a memory cache of the key value storage system and is written into a flush-down (flush) operation of an L0 layer of an LSM-tree storage architecture of the key value storage system, executing the flush-down operation on a cloud local storage used by the key value storage system to write the SSTable generated by the target data into the L0 layer;
if the data write request is for an SSTable compression operation between adjacent ones of L0-Li layers in the LSM-tree storage architecture of the key-value storage system, then performing the compression operation on the SSTable in the Lk (k is greater than or equal to 0 and k is less than i) layer and the SSTable in the L (k +1) layer having a key value range overlapping with the Lk layer to generate a new SSTable and delete the SSTable in the Lk and the SSTable in the L (k +1) layer, and writing the new SSTable into the L (k +1) layer;
if the data write request is a compression operation aiming at one SSTable of a Li layer and one SSTable of an L (i +1) layer of an LSM-tree storage architecture of the key value storage system, wherein the SSTable has a key value range overlapping with the SSTable of the Li layer, writing a new SSTable generated by the compression operation into a cloud remote storage used by the key value storage system, copying a metadata block corresponding to the new SSTable into a metadata block cache of the key value storage system, and deleting a metadata block corresponding to an SSTable of the Li layer, which overlaps with the SSTable of the Li layer, and an SSTable of the L (i +1) layer, which overlaps with the SSTable of the L layer, and a hot data block corresponding to the SSTable in the data block cache;
if the data write request is a compression operation between two adjacent layers of an LSM-tree storage architecture of the key value storage system in a cloud remote storage used by the key value storage system, writing a new SSTable generated by the compression operation between the adjacent layers into the cloud remote storage, copying a metadata block corresponding to the new SSTable into a metadata block cache of the key value storage system, deleting the metadata block corresponding to the SSTable used for the compression operation from the metadata block cache, and deleting the data block corresponding to the SSTable from the data block cache;
if the data write request is for a obsolete data block cached in the memory cache and the obsolete data block is a hot data block located in the cloud remote storage, writing the obsolete data block into the data block cache of the key-value storage system, wherein the obsolete data block operates less relative to other data blocks in the memory cache of the key-value storage system but more relative to other data blocks in the cloud remote storage.
To achieve the above object, according to a fourth aspect of the present application, there is provided a failure recovery method for a key value storage system according to the first aspect, the method including the steps of:
after the key value storage system fails, connecting the cloud remote storage used by the failed key value storage system to a standby cloud storage system, wherein the standby cloud storage system is provided with cloud local storage;
the standby cloud storage system reads a manifest file and a log file related to each SSTable from an L0 layer to a Li layer of an LSM-tree storage architecture of the failure key value storage system from a cloud remote storage used by the failure key value storage system to obtain basic information and a corresponding key value list of each SSTable from the L0 layer to the Li layer, wherein when the failure key value storage system generates each SSTable, the basic information of each SSTable is recorded into the manifest file, and the key value list corresponding to each SSTable is recorded into the log file;
the standby cloud storage system scans all log files in a multi-thread mode from the log file with the latest generation time to obtain key value pairs included by each SSTable from the L0 layer to the Li layer on the basis of the basic information and the corresponding key value list of each SSTable from the L0 layer to the Li layer, wherein when the fault key value storage system writes data into each SSTable, the key value pairs of the data are recorded in the log files;
after one thread in the multi-thread scanning obtains all key value pairs of one SSTable, the standby cloud storage system reconstructs a backup SSTable corresponding to the SSTable;
after each SSTable in the L0 layer and the L1 layer of the LSM-tree storage architecture of the fault key value storage system rebuilds the corresponding backup SSTable on the backup cloud storage system, the backup cloud storage system rebroadcasts the log file.
Through the above technical scheme that this application provided, can obtain following beneficial effect:
1. the application provides a key value storage system which is constructed based on cloud storage and can realize high performance and cost effectiveness. The bottom layers (L0 layers to Li layers, i is usually 1 or 2) of the LSM-tree storage architecture are placed on the cloud local storage of the key-value storage system, and by utilizing the low latency performance of the cloud local storage, the rest (layers) of the LSM-tree storage architecture are placed on the cloud remote storage device or service of the key-value storage system with low storage price, so that the data storage cost can be reduced. Meanwhile, the key value storage system can cache SSTable metadata blocks and data blocks which are operated frequently on the cloud remote storage device or service, and remote access to the cloud remote storage device or service is reduced, so that tail delay is reduced, and data query efficiency is improved. Meanwhile, the key value storage system can store most of data on the cloud remote storage device more safely and stably, and can quickly finish data recovery after a data server fails.
2. The application provides a data reading request and data writing request processing method for a key value storage system, which is constructed based on cloud storage and realizes high performance and cost efficiency, and the method converts the integral access operation of an LSM-tree storage architecture deployed on cloud local storage and cloud remote storage used by the key value storage system into the access of the LSM-tree storage architecture (namely, an L0 layer to a Li layer, a metadata block cache and a data block cache of the LSM-tree) deployed on the cloud local storage. The data block where the searched data is located is not cached, and the data is required to be accessed to the cloud remote storage device or the service to obtain the data.
3. The application provides a fault recovery method of a key value storage system constructed based on cloud storage. When the key value storage system fails, the cloud remote storage device or service of the key value storage system can be quickly connected to the standby data storage system in a normal state without waiting for slow recovery of the key value storage system, and data recovery is quickly completed in a multi-thread parallel recovery mode based on log files and manifest files.
Drawings
FIG. 1 illustrates an example of a network environment suitable for use with embodiments of the present application;
fig. 2 is a schematic diagram illustrating a key-value storage system built based on a cloud storage system according to an embodiment of the present application;
fig. 3 is a schematic diagram illustrating a MASHtree compression method according to an embodiment of the present disclosure;
fig. 4 is a schematic diagram illustrating a data block cache structure provided in an embodiment of the present application;
fig. 5 is a schematic diagram illustrating a data read request processing method according to an embodiment of the present application;
fig. 6 is a schematic diagram illustrating a data write request processing method according to an embodiment of the present application;
fig. 7 shows a schematic diagram of a failure recovery method provided by an embodiment of the present application.
Detailed Description
It will be readily understood that the components of the present application, as generally described and illustrated in the figures herein, could be arranged and designed in a wide variety of different configurations. Thus, the following more detailed description of the embodiments of the present application, as represented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of certain examples of presently contemplated embodiments in accordance with the present application. The presently described embodiments will be best understood by reference to the drawings, wherein like parts are designated by like numerals throughout.
The present application may be embodied as systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions thereon for causing a processor to execute aspects of the present application.
The computer readable storage medium may be a tangible device that can retain and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage system, a magnetic storage system, an optical storage system, an electromagnetic storage system, a semiconductor storage system, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer-readable storage medium includes the following: portable computer diskette, hard disk, Random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM). Memory only (EPROM or flash memory), Static Random Access Memory (SRAM), portable compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD), memory sticks, floppy disks, mechanical coding devices such as punch cards or raised structures in grooves, having recorded thereon instructions, and any suitable combination of the foregoing. As used herein, a computer-readable storage medium should not be construed as a transitory signal per se, such as a radio wave or other freely propagating electromagnetic wave, an electromagnetic wave propagating through a waveguide or other transmission medium (e.g., a fiber optic cable with light pulses passing through it), or an electrical signal transmitted through an electrical wire.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a corresponding computing/processing device via a network (e.g., the internet, a local area network, a wide area network), or to an external computer or external storage system, a local area network, and/or a wireless network. The network may include copper transmission cables, optical transmission fibers, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium within the respective computing/processing device.
The computer-readable program instructions for carrying out operations of the present application may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or code or object code written from any source in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, an electronic circuit comprising, for example, a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), may personalize the electronic circuit by executing computer-readable program instructions with state information of the computer-readable program instructions to perform various aspects of the present application.
Aspects of the present application may be described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a mechanism, such that the instructions, which execute via the computer or other processor of the programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the instructions are stored in the computer-readable storage medium. An article of manufacture including instructions which embody aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. In addition, the technical features mentioned in the embodiments of the present application described below may be combined with each other as long as they do not conflict with each other.
First, terms related to the present application are explained as follows:
cloud local storage: storage devices of the cloud storage system are expensive, but have low operation delay and small capacity, for example, storage devices such as a nonvolatile memory SSD (nvme SSD), a high-performance SSD, and the like.
Cloud remote storage device: the storage devices or services provided by the Cloud service provider, including Cloud EBS, S3, etc., are charged according to storage capacity and IOPS, are suitable for storage of large-capacity data, and have high data security.
Log-structured Merge Tree (LSM-Tree): by
Figure BDA0003232023650000121
A data file management model proposed by companies.
Compression (compact) operation in order to maintain the tree structure of the LSM-tree, when the SSTable of the Li layer is excessive, the SSTable of the Li layer and the SSTable of the Li +1 layer having overlapping key values need to be subjected to merge sorting while discarding some obsolete keys, which is called a compression operation.
Referring to fig. 1, an example of a network environment 100 is shown. Network environment 100 is presented to illustrate one example of an environment in which systems and methods in accordance with the present application may be implemented. Network environment 100 is presented by way of example and not limitation. Indeed, the systems and methods disclosed herein may be applicable to a variety of different network environments or to separate computers in addition to the network environment 100 shown.
As shown, the network environment 100 includes one or more computers 102, 106 interconnected by a network 104. The network 104 may include, for example, a Local Area Network (LAN)104, a Wide Area Network (WAN)104, the Internet104, an Intranet104, and so forth. In certain embodiments, the computers 102, 106 may include a client computer 102 and a server computer 106 (also referred to herein as a "host" 106 or "host system" 106), and the server computer 106 may include a local server or cloud server. Typically, the client computer 102 initiates a communication session, while the server computer 106 waits for and responds to requests from the client computer 102. In certain embodiments, the computers 102 and/or servers 106 may be connected to one or more internal or external directly or cloud connected additional storage systems 109 (e.g., an array of hard disk drives, solid state drives, tape drives, cloud storage drives, etc.). These computers 102, 106 and the directly connected or cloud connected storage system 109 may communicate using protocols such as ATA, SATA, SCSI, SAS, fibre channel, local area network connection, Wide Area Network (WAN) connection, Internet connection, and the like.
In some embodiments, the network environment 100 may include a storage network 108 behind the servers 106, such as a Storage Area Network (SAN)108 or a LAN108 (e.g., when network-attached storage is used). The network 108 may connect the servers 106 to one or more storage systems, such as an array of hard disk drives or solid state drives 110, a tape library 112, a single hard disk drive 114 or solid state drive 114, a tape drive 116, a CD-ROM library, or the like. To access the storage systems 110, 112, 114, 116, the host system 106 may communicate through a physical connection from one or more ports on the host 106 to one or more ports on the storage systems 110, 112, 114, and 116. The connection may be through a switch, fabric, direct connection, etc. In certain embodiments, the server 106 and the storage systems 110, 112, 114, 116 may communicate using networking standards or protocols such as Fibre Channel (FC) or iSCSI.
As shown in fig. 2, the present application provides a key-value storage system, which is based on a hybrid cloud storage composed of a cloud local storage and a cloud remote storage, and includes an LSM-tree storage architecture, a memory cache, a metadata block cache, and a data block cache. The key value storage system can utilize the advantages of cloud local storage equipment or service and cloud remote storage equipment or service, and high data access performance and high cost benefit are achieved while high fault tolerance is maintained.
Specifically, the LSM-tree storage architecture stores data in units of Sorted Strings tables (sstables), and the sstables are organized into a multi-layer tree structure. The LSM-tree storage architecture shown in FIG. 2 includes a number of layers, the L0-L3 layers having been shown in FIG. 2, while the other layers below the L3 layer have been omitted from FIG. 2 to simplify the drawing. Wherein each layer includes a plurality of sstables, for example layer L2 includes SST1, SST2 and other sstables, which are omitted from figure 2 to simplify the drawing. Each SSTable includes at least one metadata block for finding a data location and a plurality of data blocks, a data block being a block of data and a storage key of a particular size, e.g., SST4, shown at the L3 level in FIG. 2, includes one SST4 metadata block and a plurality of SST4 data blocks. In the LSM-tree storage architecture, the top is the L0 layer located at the uppermost layer, and the layers are sequentially increased from L1, L2, L3 to Ln, where the value of n is related to the size of the total storage space of the cloud storage system. In the process of data writing operation, the writing operation is firstly recorded into a log file to ensure that data can still be persistently stored when the key value storage system fails, and then the data is written into a memtable structure of a memory cache. When a memtable becomes an immutable after being written with data, an SSTable is finally generated and written into the uppermost L0 layer. In order to maintain the tree structure of the LSM-tree, when the SSTable of a certain layer meets a preset compression trigger condition (including but not limited to when the layer has a large amount of data), an SSTable compression operation is performed, two or more sstables in the layer are compressed, merged and then written into a next layer of the layer by flushing (flush), and the compression operation is performed from the layer to the next layer until all layers do not meet the preset compression trigger condition. Therefore, the latest data in the LSM-tree is always positioned at the uppermost layer, when the data is read, the data in the memory is accessed firstly, and all SSTable searching the LSM-tree from top to bottom is not found until the result is obtained. Similar to data read operations, data write operations also typically occur at the top few layers, so the top few layers have a smaller amount of data but data access operations are more frequent. The key-value storage system uses a small-capacity, low-latency operation, but typically expensive, cloud-local storage device or service to store the top layers of the frequently-accessed LSM-tree, such as the L0 and L1 layers, and the corresponding hot data stored in the top layers to guarantee the data access performance of the system, while the rest of the LSM-tree is run by a large-capacity, high-latency operation, and typically cheaper, cloud-remote storage device or service to reduce the usage cost of the system. This is because the SSTable located at the upper layers of the LSM-tree of the key-value storage system is small in number and high in access frequency, and thus low operation latency of cloud local storage is utilized to improve access performance, thereby shifting a large number of time-consuming read and write operations from a slower cloud remote storage device or service to a faster cloud local storage device or service.
Further, SSTable in the top few layers of the LSM-tree storage architecture, which are frequently accessed but occupy less total data volume, can be stored in cloud local storage with low read-write operation delay. The number of layers Li stored in the cloud local storage can be selected according to the data amount and the size of the cloud local storage, and generally 2 to 3 layers are selected (for example, L0 layer and L1 layer shown in fig. 2, i.e., i is 1), and the rest of layers are still stored in the large-capacity, high-operation-delay, low-cost cloud remote storage. The cold data and the hot data are stored in a separated mode, the high read-write performance of the cloud local storage in the cloud platform can be fully utilized to improve the data operation efficiency, and the use cost and the benefit can be controlled through remote storage.
Specifically, the key value storage system divides data into cold and hot areas according to the frequency of data access operation, wherein the data with more frequent access operation is hot data, and the data with less frequent access operation is cold data. Based on the cold and hot areas of the data, the LSM-tree storage architecture is deployed on a mixed structure formed by cloud local storage and cloud remote storage with large difference of read-write performance. In an embodiment of the application, the AWS EC2 can be selected as a server for data storage in cloud storage, wherein the server is a self-contained high-performance cloud local storage device. The capacity of the cloud local storage device is small, the read-write performance is similar to that of the existing NVMe SSD, and the speed is far faster than that of most storage devices. The data volume of modern databases is large, so remote cloud remote storage equipment or service is selected to store most data, and the cloud remote storage equipment or service is charged according to the applied capacity, so that the cloud remote storage equipment or service is cheap in use price and beneficial to use cost control, but has the defect that the read-write performance is limited. Therefore, the LSM-tree model with obvious cold and hot data separation characteristics is used for connecting the two devices or services to form an integral data storage architecture, data access is further optimized through the metadata block cache and the data block cache, and meanwhile high read-write performance of the cloud local storage device or service and low storage cost of the cloud remote storage device or service are obtained.
Further, after the data volume of the L1 layer of the LSM-tree storage architecture reaches a preset threshold, compression operation is carried out, and SSTable in the L1 layer is merged, compressed and stored in the L2 layer. Cloud remote storage is selected as a storage device when SSTable of an L2 layer is generated, and the storage mode of other lower layer data is similar to the SSTable storage mode of the L2 layer.
The metadata block cache stores metadata of each SSTable in the cloud remote storage device. When data is searched in the rest layers except the L0 layer, the traditional LSM-tree storage architecture adopts a binary method to determine the SSTable where the data is located, namely, the metadata block in the SSTable is accessed every time, so that the SSTable in the cloud remote storage may need to be accessed for multiple times, and a larger query tail delay is brought. Through the metadata block cache deployed in the cloud local storage equipment with low read-write delay, only the metadata block cache on the cloud local storage with high read-write performance needs to be accessed when data is searched, so that the access times of the cloud remote storage are reduced, and the query efficiency is improved.
Furthermore, the key value storage system uses the embedded data block offset tree-shaped code MASHtree to replace the traditional index code to index the cloud data block in the metadata block cache, and the index space can be compressed while the index efficiency is improved, so that the system overhead of copying the metadata block by the key value storage system is remarkably reduced compared with the traditional index code.
Further, the data block cache of the key value storage system is used for caching the data block in the cloud remote storage. The buffer space of the whole data block buffer is settable and divided into different bucket containers, and each bucket container is divided into a space with a fixed size and managed by using a bitmap. If the data block in the data block cache is hit in the data reading process, the data can be directly read from the data block cache without accessing the cloud remote storage again, so that the I/O operation of the cloud remote storage is reduced, the query tail delay is reduced, and the data searching efficiency is improved.
The application provides a data storage method of a key value storage system based on the application, and the method comprises the following steps:
writing data into a memory cache of the key value storage system and generating at least one SSTable, wherein the key value storage system stores data by taking SSTable as a unit, and the SSTable comprises metadata blocks and data blocks;
writing the at least one SSTable from the memory cache into a topmost layer of an LSM-tree storage architecture of the key value storage system, the LSM-tree storage architecture comprising a plurality of layers, wherein each layer stores data in SSTable units and all SStables in the plurality of layers are organized into a tree;
caching metadata blocks corresponding to SSTable comprising hot data into a metadata block cache of the key-value storage system; and
caching data blocks corresponding to SSTable comprising hot data into a data block cache of the key value storage system;
the cloud storage system is a hybrid cloud storage system consisting of cloud local storage and cloud remote storage.
Further, the LSM-tree storage architecture includes L0 through Li layers deployed on the cloud local storage, and L (i +1) layers through the rest of the layers below (Li +1) deployed on the cloud remote storage. Typically, i is 1 or 2.
Further, if any one of the layers of the key-value storage system meets a preset compression triggering condition, a compression operation is executed from the any one layer, at least two SSTable in the any one layer are compressed and combined, then a new SSTable is formed from the layer below the any one layer to the layer below the any one layer, and the compression operation is executed from the any one layer downwards layer by layer until all the layers do not trigger the compression operation.
Further, the key-value storage system encodes the metadata block using MASHtree encoding.
Further, the space of the data block cache is divided into a plurality of bucket containers, wherein each bucket container comprises data blocks from SSTable on the cloud remote storage device, and the basic information of the SSTable corresponding to the data blocks is used for management and access through a data search method.
Further, each bucket container is divided into blocks of a specific size, and the blocks are managed by a bitmap.
As shown in fig. 3, the present application provides a metadata storage method, which can significantly reduce the space overhead for storing metadata. Taking as an example that one SSTable in the cloud remote storage contains keys "hello", "hunda", "hundred", "hung", "hunte", and "hunter", these 6 data pairs are stored in three data blocks. The keywords in SSTable are sorted alphabetically and have common prefixes, so that compressed storage and indexing are facilitated. The MASHtree compression method takes advantage of this property, and first the set of keys can be represented as a prefix tree as shown in fig. 3,
each path of the prefix tree from the root node to a leaf node represents a key string, the leaf node of the character "$" represents the end of the key, e.g., nodes 0, 2, 3, 4 represent "hunda". Two steps are then performed on the prefix tree:
first, the prefix tree is compressed. As shown in FIG. 3, the leftmost path is first selected, assuming that the path departs from the root nodeStarting with the numbers 0, 1, 2, …, i, compresses the path into a node, denoted α0d0c0,…,αi-1di-1ci-1αiIn which α isiIs the content of node i, di is the node degree minus 1, omitted if there are only 1 child node, ci is the weight of the edge. For example, the leftmost path is from h to llo, the edge is e, node h has 2 subtrees, and thus "h 1 ello" is encoded. The other leaves or subtrees of the path act as children of the node. Recursively performing the same operation on each subtree results in a prefix compression tree as shown in fig. 3.
And secondly, performing dictionary coding on the prefix compression tree. The prefix compression tree is encoded according to a first-order sequence with depth priority, and three arrays are generated: sequence, node information, and path information. The specific coding mode is as follows:
(1) traversing each node of the prefix compression tree by adopting a deep search mode to generate a sequence group, wherein the sequence group is formed by a left bracket ' (' starting ' and then information of each node;
(2) writing information of a node into node information, such as "h 1 ello", "n 2d1 a", etc., each time the node is accessed;
(3) and generating a corresponding sequence when accessing one node each time. Each node's sequence consists of i left and 1 right brackets ")" where i represents the number of child nodes of the node. For example, node h1ello indicates with "()" that the node has only one child node, and node n2d1a indicates with "()" that the node has three child nodes, to speed up the lookup of the location of a data block, the right bracket of the last key in the data block ") is replaced with" ] ", so that several" ] "are encountered during the lookup process, i.e., the data corresponding to the key is in the second data block.
(4) When a node is accessed each time, the side information of its connected subtrees is recorded in the path information in the order from the upper layer to the lower layer and from the right to the left inside the same layer, for example, the path information of the prefix compression tree shown in fig. 3 is "utgrr".
According to the above rules, all keys can be encoded into several arrays, for example three arrays as shown in fig. 3, thereby reducing the space overhead of storing pointers of the prefix tree and also omitting the filter.
As shown in fig. 4, the present application provides a method for caching data blocks. Key-value pairs stored in the data block cache are organized in an LRU list, and invalid key-value pairs are marked as invalid. The space of the data Block cache is divided into a plurality of Bucket containers (buckets), each of which is divided into fixed-size data blocks (blocks) of the same size as the data Block size (e.g., 4KB) employed by SSTable, and managed using a Bitmap (Bitmap). Each data block is mapped to a 4kb block space and the total number of data blocks is applied for according to the actual requirements.
Further, the data block cache is used for caching frequently accessed data blocks in the cloud remote storage, namely hot data blocks. The locality characteristics of the data blocks are utilized: when data corresponding to a certain key is accessed, the data near the key is likely to be accessed, so that the data block comprising the data is cached in a data block cache in the cloud local storage when the data is accessed for the first time, and the data block in the data block cache is managed by using an LRU method. Therefore, when data needs to be acquired for the second time, the key value storage system can directly read the target data from the data block cache without accessing the cloud remote storage with slow read-write operation performance again, and therefore tail delay time is reduced.
Furthermore, the key value storage system provided by the application can directly apply for a large storage space as a data block cache space, so that the overhead caused by file system calling can be avoided. Secondly, the key value storage system divides the large storage space into barrel containers with fixed size, when a new data block is inserted into the data block cache or data is read from the data block cache, the key value storage system indexes a certain specified barrel container according to the number of the SSTable file used by the data, and locks (Lock) the barrel container to be read and written during data writing and reading. Therefore, when data is read and written, the key value storage system adjusts the locking range from the whole cache space to the locking operation of a certain barrel container, so that data can be searched in multiple threads, and the data searching efficiency is improved.
Further, each barrel container is divided into data blocks with fixed size, and each data block stores specific data. A Bitmap (Bitmap) is used in each bucket container to indicate whether all data blocks in the bucket container have been used. Meanwhile, the key value storage system uses a double linked list to manage indexes of the inserted data to the used data blocks, the size of the data and the offset of the used data blocks in a cache space are stored in each double linked list node, and the nodes of the double linked list are arranged in the sequence of LRU: the node at the head of the linked list is the data which is accessed recently, and the data at the tail of the linked list is the data which is not accessed for the longest time. One or more data blocks may be used for each data written to the bucket container of the data block cache, so that the data block cache may accommodate different sizes of write data. For example, the size of the data read from the SSTable file by the RocksDB database is usually a preset fixed value, so the size of the corresponding data block in the data block cache can also be set to be close to the size of the read data to improve the efficiency of using the data block cache space, and the size of the data block can be usually set to be 4 kb.
Further, when a compression operation occurs or a new SSTable is generated, a corresponding new metadata block is cached in the metadata block cache, and an old metadata block is deleted; correspondingly, the old data block cached in the data block cache is also removed from the data block cache and replaced by the corresponding new data block. The key value storage system can directly set the corresponding data block in the bitmap to be invalid, and meanwhile, delete the index corresponding to the invalid data block.
As shown in fig. 5, the present application provides a data read request processing method based on a key-value storage system of the present application, including:
(1) the key-value storage system receives a request to read a datum. Starting from the L0 level of the LSM-tree storage architecture of the key value storage system, SSTable or metadata blocks corresponding to SSTable in each level are looked up layer by layer down to look up the data, wherein SSTable is looked up one by one in the L0 level to look up and read the data directly.
(2) If the data is not found in the L0 layer, then SSTable in the L1 layer is found in the L1 layer of the key-value storage system to find and read the data based on key dichotomy, wherein each SSTable in the L1 layer has one minimum key and a maximum key and a key value between SStables. Data is stored sequentially in layers other than the L0 layer of the key value storage system, so that each SSTable has a minimum key and a maximum key, and binary search can be employed to look at the metadata of each SSTable to find the SSTable that includes the key. In the key value storage system, an L1 layer is stored in a high-performance cloud local storage, and the key value storage system can be directly searched, if target data is found, a result is returned, and if the target data is not found, the next layer is continuously searched.
(3) If the data is not found in the L0 layer or the L1 layer, the SSTable where the data is located is found in the metadata block cache in the cloud local storage based on the key, wherein the metadata corresponding to all the sstables in the cloud remote storage used by the key value storage system is cached in the metadata block cache. All the SSTable metadata blocks in the cloud remote storage are cached in the metadata block cache in the cloud local storage, so that the key value storage system can directly search the SSTable where the data is located in the metadata block cache.
(4) If the target SSTable containing the data is found in the metadata block of the metadata block cache, and the data block containing the target data in the target SSTable is cached in the data block cache of the key-value storage system, the data is directly read from the data block containing the target data in the target SSTable in the data block cache. For example, the key-value storage system first determines, in the metadata block cache, a metadata block where a key of the data is located, and then performs a traversal search of a compression coding tree of the metadata block starting from the root node. If the key for the data is found in the metadata block, the number or other index information for the data block of the target SSTable where the data is located is also found. If the data block that includes the target SSTable of the data is already cached in the data block cache, the data is read directly from the data block of the target SSTable of the data block cache based on the number or other index information. Otherwise, the key-value storage system needs to search the target data block in the cloud-removing remote storage and read the data. If the data is not found in the metadata block of the metadata block cache, the corresponding data does not exist in the key, and the data which is not found can be directly returned.
(5) If the corresponding metadata of the data block of the target SSTable is found in the metadata block of the metadata block cache and the data block is not cached in the data block cache, searching the SSTable comprising the data from the SSTable stored in the cloud remote storage. If the corresponding metadata of the data block of the target SSTable is not found in the metadata block of the metadata block cache, the data which is not found is directly returned.
(6) After the target SSTable containing the data is found in the cloud remote storage, the key value storage system caches the data blocks containing the target SSTable of the data from the cloud remote storage to the cloud local storage and then reads the data.
As shown in fig. 6, the present application provides a data write request processing method based on a key-value storage system of the present application, including:
the key value storage system receives a write request for writing data and judges the destination of the write request;
(1) and if the request is to generate an SSTable for the data in the memory cache and write the SSTable into a flush-down (flush) operation at the L0 layer of the LSM-tree storage architecture of the key-value storage system, executing the write operation on the cloud local storage used by the key-value storage system. For example, for a flush-down operation in which SSTable generated from immutable memtable in a memory cache is written to the L0 layer, the flush-down operation may be executed directly on the cloud local high-performance SSD. Because the cloud local high performance SSD has low latency characteristics and does not require caching, performing the flush operation does not impact metadata block caching and data block caching.
(2) If the data write request is for SSTable compression operation between L0 in an LSM-tree storage architecture of the key value storage system and an adjacent layer in an Li layer, then executing the compression operation on the SSTable in an Lk (k is greater than or equal to 0 and k is less than i) layer and the SSTable with key value range overlapping with the Lk layer in an L (k +1) layer to generate new SSTable, deleting the compressed SSTable in the Lk and the compressed SSTable in the L (k +1) layer, and writing the new SSTable into the L (k +1) layer. Specifically, if the write request is a compress (compact) operation for SSTable between layers L0 and L1 of the LSM-tree storage architecture of the key-value store system, performing the compress operation on the cloud-local storage of the key-value store system generates a new STTable and writes the new STTable to the layer L1. Specifically, for the key-value storage system to execute write operations resulting from one SSTable in the L0 layer and another SSTable compression operation in the L1 layer, the key-value storage system operates directly on the cloud local storage to generate a new STTable and write the new STTable into the L1 layer, and deletes the compressed SSTable in the L0 layer and the L1 layer. Because the cloud local high-performance SSD has low-delay characteristics and does not need caching, the metadata block caching and the data block caching cannot be influenced.
(3) And if the data write request is a compression operation between one SSTable of the Li layer and another SSTable of the L (i +1) layer of the LSM-tree storage architecture of the key-value storage system, which overlaps with the SSTable existing key value range of the Li layer, writing a new SSTable generated by the compression operation into the cloud remote storage used by the key-value storage system and copying the metadata block corresponding to the new SSTable into the metadata block cache of the key-value storage system, deleting the SSTable of the Li layer and the SSTable of the L (i +1) layer, which overlaps with the SSTable existing key value range of the Li layer, and deleting the metadata block corresponding to the SSTable of the L (i +1) layer, which overlaps with the key value range, in the metadata block cache and the hot data block in the data block cache. Specifically, if the request is a write operation for a new SStable generated after compressing one SSTable at the L1 layer in the cloud local storage and another SSTable at the L2 layer in the cloud remote storage, the new SSTable is written into the cloud remote storage, and the metadata block corresponding to the new SSTable is copied into the metadata block cache of the key-value storage system. And then deleting the corresponding compressed old SSTable at the L1 layer in the cloud local storage and the corresponding compressed old SSTable at the L2 layer in the cloud remote storage, and simultaneously deleting the metadata blocks of the compressed old SStables in the metadata block cache and the corresponding data blocks in the data block cache.
(4) If the request is a write request aiming at the SSTable which needs to be written into the cloud remote storage of the key value storage system, firstly copying the newly generated SSTable metadata block into the metadata block cache, then deleting the old SSTable metadata block cache from the data block cache of the key value storage system, and ending the writing. Specifically, if the target data block of the write request is located in the SSTable in the cloud remote storage, although the target data block is relatively cold, i.e., operates less frequently, relative to the data blocks already cached in the cloud local storage, the target data block is relatively hot, i.e., operates more frequently, relative to other data blocks still in the cloud remote storage, and therefore the key-value storage system adds these relatively hot target data blocks to the data block cache in the cloud local storage.
(5) And if the write request is for a obsolete data block cached in a memory cache of the key-value storage system and the obsolete data block is a hot data block located in the cloud remote storage, adding the obsolete data block to a data block cache of the cloud local storage. While the obsolete data block operates less with respect to other data blocks in the memory cache, it operates more with respect to other data blocks in the cloud remote storage.
As shown in fig. 7, when a key value storage system of the present application fails, the present application provides a failure recovery method based on the key value storage system. In cloud storage, when a network, a server and a machine room are in failure, a cloud storage system is easy to shut down, so that data on cloud local storage is lost, and therefore the data storage system needs to finish failure recovery at the fastest speed to ensure that normal data storage service can be provided. The application provides a fault recovery method for a key value storage system constructed based on cloud storage, wherein for each data write request, an LSM-tree storage architecture of the key value storage system generates a write log and stores the write log in a log file in cloud remote storage, so that all data change/write actions are stored in the log file according to the time sequence of the action. For example, after data is flushed from a memory to the cloud local storage and written to SSTable in the L0 layer that generates the key value storage system, a log file in which a corresponding write log is saved by the key value storage system to a cloud remote storage.
However, if the data storage system simply saves the log file corresponding to each SSTable in cloud local storage and then performs data failure recovery, this failure recovery process takes a long time. Therefore, in order to accelerate the recovery speed, the key value storage system provided by the application records the key value list of each SSTable into a log file, and records the generated basic information of the SSTable into a manifest file. For example, when the key value storage system executes a compression operation to generate a new SSTable and delete an old SSTable, the operation action log and the key value list for generating the new SSTable and deleting the old SSTable are recorded in a log file, and the basic information of the new SSTable is recorded in a manifest file. By the scheme, the key value storage system can be ensured to safely roll back to a state of keeping consistency with the latest logs and key value lists stored in the list file and the log file, and data loss can be avoided in the roll-back process.
As shown in fig. 6, when a key value storage system of the present application fails, the present application provides a failure recovery method based on the key value storage system, where the failure recovery method includes the following steps:
(1) and connecting the cloud remote storage used by the fault key value storage system to a standby data cloud storage system. For example, the backup data cloud storage system is a common cloud storage system with a hybrid structure, which is composed of a cloud local storage with high read-write performance and a cloud remote storage with high capacity and low cost. The cloud remote storage used by the fault key value storage system is connected to the standby data cloud storage system, and the standby data cloud storage system has high read-write performance cloud local storage and high-capacity and low-cost cloud remote storage. The fault recovery method can quickly complete the fault recovery of the L0 and L1 layers of the LSM-tree storage architecture of the fault key value storage system, and does not need to wait for the fault key value storage system to complete the fault recovery process of the fault key value storage system.
(2) The backup data cloud storage system reads the manifest file from the cloud remote storage used by the fault key value storage system to obtain the basic information of all SStables in the L0 layer and the L1 layer of the LSM-tree storage architecture of the fault key value storage system. Specifically, when the key value storage system provided by the application generates a new SSTable in a normal working state, the key value storage system records basic information of an operation action of generating the new SSTable in the manifest file. The basic information may include information such as a file name, a generation time, and a size of the new SSTable, for example, an SSTable having a file name sst5 is generated at time t5 and a size of the SSTable is 2. Therefore, the basic information of all SStables in the L0 and L1 layers of the LSM-tree storage architecture of the key-value storage system can be obtained by reading the manifest file.
(3) And the standby data cloud storage system acquires the key value list of each SSTable from the log files of the cloud remote storage used by the fault key value storage system based on the basic information of all SStables acquired from the manifest file. The standby data cloud storage system can perform parallel operation when acquiring the key value list of each SSTable. Specifically, for each data write operation, the key-value storage system provided by the present application first writes the data into a log file, so all data of the L0 and L1 layers of the key-value storage system are saved in the log file, and the log file can be used for data recovery. In order to ensure the safety of the log file, the key value storage system stores the log file in a cloud remote storage used by the key value storage system. Further, if the data writing operation is more and the data is under heavy load, the log file can be stored in a storage device with higher performance and stability so as to ensure that the data can be recovered when the failure occurs. Further, the data of all SSTable in L0 and L1 of the LSM-tree storage architecture of the key value storage system are stored in the log file, so that the size of the log file is greatly increased, the SSTable can be compressed to the lower layer of the L1 layer in the LSM-tree storage architecture through periodic refreshing, and the log file corresponding to the data block can be deleted after the data block of the SSTable reaches the cloud remote storage.
(4) And the standby data storage system starts multithreading scanning the log files from the log file with the latest generation time to obtain the key value pair included by each SSTable based on the acquired basic information and key value list of each SSTable. When the key value storage system writes data into each SSTable, the key value pair of the data is also recorded in the log file. Therefore, all data is recorded in the log file and can be used for data recovery after the key value storage system fails. Specifically, after the key value list corresponding to each SSTable is obtained, the fault recovery method scans the log file in a parallel multithreading manner so as to independently find the key value included in each SSTable from the log file. The standby cloud storage system starts scanning from the log file with the latest generation time, and the data acquired by the fault recovery method can be ensured to be the latest written data.
(5) And after one thread in the multi-thread scanning obtains all key value pairs corresponding to one SSTable, the standby data cloud storage system reconstructs one backup SSTable on the used cloud local storage. The metadata blocks and data blocks in the backup SSTable are identical to the metadata blocks and data blocks of the corresponding SSTable in the failover key value storage system.
(6) And when all SStables of the L0 and L1 layers of the LSM-tree storage architecture of the fault key value storage system reconstruct the corresponding standby SSTable on the standby data cloud storage system, the standby data cloud storage system rebroadcasts the latest log. Specifically, the standby data cloud storage system takes the creation time of the latest SSTable in the L0 layer of the failure key value system as the starting point of replay of the log file, and sequentially sends all request information recorded in the log file to the standby cloud storage system in ascending order of time; and completing the recovery operation of the standby cloud storage system after all the requests in the log file are played. At this time, the standby data cloud storage system can perform data service to the outside after the data recovery process is completed. Further, the metadata block cache and the data block cache of the backup data cloud storage system are still empty at this time, so that the metadata block and the data block can be gradually cached while providing the service.
The flowchart and/or block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer-usable media according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It will be understood by those skilled in the art that the foregoing is merely a preferred embodiment of the present application and is not intended to limit the present application, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A cloud storage system-based key-value storage system, the key-value storage system comprising:
a memory having computer readable instructions;
one or more processors for executing the computer-readable instructions, the computer-readable instructions controlling the one or more processors to perform operations;
an LSM-tree storage architecture comprising a plurality of layers, wherein each of said layers stores data in units of SSTable, and all of said SSTable are organized as a tree;
caching the memory;
caching a metadata block; and
caching the data block;
the cloud storage system is a hybrid cloud storage system consisting of cloud local storage and cloud remote storage.
2. The key-value storage system of claim 1, wherein: the LSM-tree storage architecture includes a layer L0 through a layer Li deployed on the cloud local storage, and a layer L (i +1) and the remaining layers below L (i +1) deployed on the cloud remote storage.
3. The key-value storage system according to claim 1 or 2, wherein: the metadata block cache is deployed on the cloud local storage and used for storing metadata blocks corresponding to each SSTable on the cloud remote storage, wherein the metadata blocks are used for indexing data stored by each SSTable on the cloud remote storage.
4. The key-value storage system according to claim 1 or 2, wherein: and the key value storage system adopts MASHtree coding to code the metadata block of each SSTable.
5. The key-value storage system according to claim 1 or 2, wherein: the data block cache is deployed on the cloud local storage and used for storing the hot data blocks corresponding to the SSTable on the cloud remote storage.
6. The key-value storage system according to claim 1 or 2, wherein: the space of the data block cache is divided into a plurality of bucket containers, wherein each bucket container comprises hot data blocks from the SSTable on the cloud remote storage and is managed and accessed according to the file names of the SSTable corresponding to the data blocks through a data management searching method.
7. The key-value storage system of claim 6, wherein: each of the bucket containers is divided into blocks of a specific size and the blocks are managed by a bitmap.
8. A data read request processing method based on the key-value storage system of claims 1 to 7, the method comprising the steps of:
the key-value storage system receiving a request to read a target datum;
searching SSTable or metadata blocks corresponding to the SSTable in each layer from L0 layer of LSM-tree storage architecture of the key-value storage system to search the target data, wherein all the SSTable in the L0 layer is searched one by one in the L0 layer to search and read the target data;
if the target data is not found in the L0 layer, then the SSTable in each layer is found based on keys in the L1 to Li layers of the LSM-tree storage architecture to find and read the target data, wherein each SSTable in the L1 to Li layers has one minimum key and one maximum key;
if the target data is not found in the L0 layer to the Li layer, the SSTable including the target data is searched for on a key basis in metadata blocks cached in the metadata block cache in the cloud local storage used by the key-value storage system, wherein the metadata blocks corresponding to all the SSTs in the cloud remote storage used by the key-value storage system are cached in the metadata block cache;
if a target SSTable containing the target data is found in the metadata blocks cached in the metadata block cache, and a data block containing the target data in the target SSTable is cached in the data block cache of the key value storage system, reading the target data from the data block containing the target data in the target SSTable cached in the data block cache, wherein the metadata block records the index information of the target SSTable;
if the target SSTable is found in the metadata block but the data block in the target SSTable is not cached in the data block cache, finding the target SSTable in the cloud remote storage, and reading the target data after caching the data block containing the target data in the target SSTable from the cloud remote storage cache to the data block cache of the cloud local storage.
9. A data write request processing method based on the key-value storage system of claims 1 to 7, the method comprising the steps of:
the key value storage system receives the data writing request and judges the purpose of the data writing request;
if the data write request is a flush-down operation for generating SSTable of target data in a memory cache of the key-value storage system and writing the SSTable into an L0 layer of an LSM-tree storage architecture of the key-value storage system, executing the flush-down operation on a cloud local storage used by the key-value storage system to write the SSTable generated by the target data into the L0 layer;
if the data write request is for an SSTable compression operation between adjacent ones of L0 to Li layers in the LSM-tree storage architecture of the key-value storage system, then performing the compression operation on the SSTable of Lk (k is greater than or equal to 0 and k is less than i) layers and the SSTable of L (k +1) layers having a key value range overlapping with the SSTable of the Lk layers to generate a new SSTable and delete the SSTable of the Lk and the SSTable of the L (k +1) layers, and writing the new SSTable into the L (k +1) layers;
if the data write request is for a compression operation between the SSTable of the Li layer and an SSTable of an LSM-tree storage architecture of the key value storage system that overlaps the SSTable existing key value range of the Li layer, writing the new SSTable generated by the compression operation into a cloud remote storage used by the key value storage system, and copying the metadata blocks corresponding to the new SSTable into a metadata block cache of the key value storage system and deleting the SSTable of the Li layer and the SSTable of the L (i +1) layer that overlaps the SSTable existing key value range of the Li layer and deleting the metadata blocks corresponding to the SST of the L (i +1) layer that overlaps the SSTable existing key value range in the metadata block cache and the corresponding hot data blocks cached in the data block cache;
if the data write request is directed to a compression operation between adjacent layers of an LSM-tree storage architecture of the key-value storage system in the cloud remote storage, writing a new SSTable generated by the compression operation between the adjacent layers into the cloud remote storage device and copying a metadata block corresponding to the new SSTable into the metadata block cache, and deleting the metadata block corresponding to the SSTable used for the compression operation from the metadata block cache and deleting the data block corresponding to the SSTable from the data block cache;
if the data write request is for a obsolete data block cached in the memory cache and the obsolete data block is a data block located in the cloud remote storage, writing the obsolete data block into the data block cache, wherein the obsolete data block operates less with respect to other data blocks in the memory cache but more with respect to other data blocks stored in the cloud remote storage.
10. A method of fault recovery based on the key-value storage system of claims 1 to 7, the method comprising the steps of:
connecting a cloud remote storage used by a fault key value storage system to a standby cloud storage system, wherein the standby cloud storage system is provided with a cloud local storage;
the standby cloud storage system reads a manifest file and a log file related to each SSTable from L0 layer to Li layer of LSM-tree storage architecture of the failure key value storage system from a cloud remote storage used by the failure key value storage system to obtain basic information and a corresponding key value list of each SSTable from L0 layer to Li layer, wherein when the failure key value storage system generates each SSTable, the basic information of each SSTable is recorded into the manifest file, and the corresponding key value list of each SSTable is recorded into the log file;
the standby cloud storage system starts to perform multi-thread scanning on all the log files from the log file with the latest generation time to obtain the key value pairs included by each SSTable from the L0 layer to the Li layer based on the basic information and the corresponding key value list of each SSTable from the L0 layer to the Li layer, wherein when the failure key value storage system writes data to each SSTable in the cloud local storage, the key value pairs of the data are recorded in the log files;
after one thread in the multi-thread scanning obtains all the key value pairs included by one SSTable, the standby cloud storage system reconstructs a backup SSTable corresponding to one SSTable;
each SSTable from the L0 layer to the Li layer re-broadcasts the log file after the corresponding backup SSTable is reconstructed on the cloud local storage of the backup cloud storage system by the backup cloud storage system.
CN202110989569.5A 2021-08-26 2021-08-26 Key value storage system based on cloud storage Pending CN113704261A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110989569.5A CN113704261A (en) 2021-08-26 2021-08-26 Key value storage system based on cloud storage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110989569.5A CN113704261A (en) 2021-08-26 2021-08-26 Key value storage system based on cloud storage

Publications (1)

Publication Number Publication Date
CN113704261A true CN113704261A (en) 2021-11-26

Family

ID=78655350

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110989569.5A Pending CN113704261A (en) 2021-08-26 2021-08-26 Key value storage system based on cloud storage

Country Status (1)

Country Link
CN (1) CN113704261A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780500A (en) * 2022-06-21 2022-07-22 平安科技(深圳)有限公司 Data storage method, device, equipment and storage medium based on log merging tree
CN116244313A (en) * 2023-05-08 2023-06-09 北京四维纵横数据技术有限公司 JSON data storage and access method, device, computer equipment and medium
CN116431356A (en) * 2023-06-13 2023-07-14 中国人民解放军军事科学院系统工程研究院 Cloud network cache acceleration method and system based on intelligent network card
WO2023143061A1 (en) * 2022-01-27 2023-08-03 华为技术有限公司 Data access method and data access system thereof
WO2024022329A1 (en) * 2022-07-25 2024-02-01 华为云计算技术有限公司 Data management method based on key value storage system and related device thereof

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140188870A1 (en) * 2012-12-28 2014-07-03 Dhrubajyoti Borthakur Lsm cache
CN104657500A (en) * 2015-03-12 2015-05-27 浪潮集团有限公司 Distributed storage method based on KEY-VALUE pair
CN105094761A (en) * 2014-04-30 2015-11-25 华为技术有限公司 Data storage method and device
CN108052643A (en) * 2017-12-22 2018-05-18 北京奇虎科技有限公司 Date storage method, device and storage engines based on LSM Tree structures
US20180300350A1 (en) * 2017-04-18 2018-10-18 Microsoft Technology Licensing, Llc File table index aggregate statistics
CN109583221A (en) * 2018-12-07 2019-04-05 中国科学院深圳先进技术研究院 Dropbox system based on cloudy server architecture
CN109766312A (en) * 2019-01-07 2019-05-17 深圳大学 A kind of block chain storage method, system, device and computer readable storage medium
CN110895545A (en) * 2018-08-22 2020-03-20 阿里巴巴集团控股有限公司 Shared data synchronization method and device
CN111183420A (en) * 2019-09-12 2020-05-19 阿里巴巴集团控股有限公司 Log structure storage system
CN112346666A (en) * 2020-11-30 2021-02-09 华中科技大学 Writing and block granularity compression and combination method and system of key value storage system based on OCSSD

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140188870A1 (en) * 2012-12-28 2014-07-03 Dhrubajyoti Borthakur Lsm cache
CN105094761A (en) * 2014-04-30 2015-11-25 华为技术有限公司 Data storage method and device
CN104657500A (en) * 2015-03-12 2015-05-27 浪潮集团有限公司 Distributed storage method based on KEY-VALUE pair
US20180300350A1 (en) * 2017-04-18 2018-10-18 Microsoft Technology Licensing, Llc File table index aggregate statistics
CN108052643A (en) * 2017-12-22 2018-05-18 北京奇虎科技有限公司 Date storage method, device and storage engines based on LSM Tree structures
CN110895545A (en) * 2018-08-22 2020-03-20 阿里巴巴集团控股有限公司 Shared data synchronization method and device
CN109583221A (en) * 2018-12-07 2019-04-05 中国科学院深圳先进技术研究院 Dropbox system based on cloudy server architecture
CN109766312A (en) * 2019-01-07 2019-05-17 深圳大学 A kind of block chain storage method, system, device and computer readable storage medium
CN111183420A (en) * 2019-09-12 2020-05-19 阿里巴巴集团控股有限公司 Log structure storage system
CN112346666A (en) * 2020-11-30 2021-02-09 华中科技大学 Writing and block granularity compression and combination method and system of key value storage system based on OCSSD

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JUNG-SANG AHN等: "Jungle: Towards Dynamically Adjustable Key-Value Store by Combining LSM-Tree and Copy-On-Write B+-Tree", 《11TH USENIX WORKSHOP ON HOT TOPICS IN STORAGE AND FILE SYSTEMS》, pages 1 - 7 *
王海涛等: "一种基于LSM树的键值存储系统性能优化方法", 《计算机研究与发展》, vol. 56, no. 8, pages 1792 - 1802 *
金国栋等: "HDFS存储和优化技术研究综述", 《软件学报》, vol. 31, no. 1, pages 137 - 161 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023143061A1 (en) * 2022-01-27 2023-08-03 华为技术有限公司 Data access method and data access system thereof
CN114780500A (en) * 2022-06-21 2022-07-22 平安科技(深圳)有限公司 Data storage method, device, equipment and storage medium based on log merging tree
WO2024022329A1 (en) * 2022-07-25 2024-02-01 华为云计算技术有限公司 Data management method based on key value storage system and related device thereof
CN116244313A (en) * 2023-05-08 2023-06-09 北京四维纵横数据技术有限公司 JSON data storage and access method, device, computer equipment and medium
CN116431356A (en) * 2023-06-13 2023-07-14 中国人民解放军军事科学院系统工程研究院 Cloud network cache acceleration method and system based on intelligent network card
CN116431356B (en) * 2023-06-13 2023-08-22 中国人民解放军军事科学院系统工程研究院 Cloud network cache acceleration method and system based on intelligent network card

Similar Documents

Publication Publication Date Title
USRE49148E1 (en) Reclaiming space occupied by duplicated data in a storage system
CN113704261A (en) Key value storage system based on cloud storage
US9880746B1 (en) Method to increase random I/O performance with low memory overheads
USRE49011E1 (en) Mapping in a storage system
US10310737B1 (en) Size-targeted database I/O compression
US8954710B2 (en) Variable length encoding in a storage system
US9798728B2 (en) System performing data deduplication using a dense tree data structure
US9390116B1 (en) Insertion and eviction schemes for deduplicated cache system of a storage system
US9336143B1 (en) Indexing a deduplicated cache system by integrating fingerprints of underlying deduplicated storage system
US8935446B1 (en) Indexing architecture for deduplicated cache system of a storage system
US9189402B1 (en) Method for packing and storing cached data in deduplicated cache system of a storage system
US9189414B1 (en) File indexing using an exclusion list of a deduplicated cache system of a storage system
US9454476B2 (en) Logical sector mapping in a flash storage array
US9747293B2 (en) Method and system for storage and retrieval of information
US9367448B1 (en) Method and system for determining data integrity for garbage collection of data storage systems
US8266114B2 (en) Log structured content addressable deduplicating storage
US9424185B1 (en) Method and system for garbage collection of data storage systems
US9304914B1 (en) Deduplicated cache system of a storage system
US9715505B1 (en) Method and system for maintaining persistent live segment records for garbage collection
Esmet et al. The TokuFS Streaming File System.
US9594674B1 (en) Method and system for garbage collection of data storage systems using live segment records
US9817865B2 (en) Direct lookup for identifying duplicate data in a data deduplication system
US9405761B1 (en) Technique to determine data integrity for physical garbage collection with limited memory
US10229127B1 (en) Method and system for locality based cache flushing for file system namespace in a deduplicating storage system
US10579262B2 (en) Optimization of data deduplication

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination