CN113297136B

CN113297136B - LSM tree-oriented key value storage method and storage system

Info

Publication number: CN113297136B
Application number: CN202110573140.8A
Authority: CN
Inventors: 王宏超; 叶保留; 唐斌; 陆桑璐
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2023-11-03
Anticipated expiration: 2041-05-25
Also published as: WO2022246953A1; CN113297136A

Abstract

The invention provides a key value storage method and a key value storage system for an LSM tree. The method comprises the following steps: carrying out fine granularity division on the disk hierarchy, and setting a compatibility policy as follows: in the comparison task, all the sub-layers of the upper layer participate in the task, and only one sub-layer of the lower layer participates in the task, so that the ratio of the lower layer participating data to the total parameter and the data is reduced; the method comprises the steps of dividing the comparison task when executing the comparison task, so that the number of files participating in the comparison task is reduced, and the parallelism of the comparison is improved. The invention also provides a method of selecting parameters that minimize write amplification by reducing the impact on read performance through parallel read algorithms and by modeling the write amplification of the LSM tree.

Description

LSM tree-oriented key value storage method and storage system

Technical Field

The invention relates to a computer storage technology, in particular to a key value storage method and a key value storage system for an LSM tree.

Background

Key-Value Store (Key-Value Store) stores data as a < Key-Value > set, with keys as unique identifiers of values. It does not support complex relational patterns like relational databases, but processes data through simple interfaces of Put (k, v), get (k), update (k, v), delete (k), etc. Because of the advantages of high performance, high expandability and the like, plays an important role in the current network application and distributed system, and is widely applied to the fields of graphic databases, task queues, stream processing engines, application program data caches, event tracking systems and the like.

LSM tree (Log-Structured Merge tree) is a storage engine that is widely used in key-value storage systems. A block of cache is maintained in the memory, when a user writes a key value pair, data is written into the cache and ordered in the cache, and then the writing operation is finished. When the buffer exceeds the preset size, the data in the buffer is written into the disk once. This is in effect converting a large number of random writes to a small number of sequential writes. Since the sequential write performance of the hard disk is far higher than the random write performance, the write speed of the LSM tree is high, so the LSM is suitable for more workload of write operation. To avoid memory data loss during a system crash, a write is required to the disk WAL (Write Ahead Log) before the data is written to the cache. Since this operation is performed by way of additional writing, it does not significantly affect the system write performance.

Data in the disk is stored in multiple layers (L ₁ ,L ₂ ,…,L _n ) Wherein L is _n Represents the bottom layer, L _i Represents the i (1. Ltoreq.i.ltoreq.n) th layer. The data of each layer is stored in order according to the keys in the key value pair, and is stored in a plurality of SSTable (Sorted String Table) in a scattered way, and each SSTable stores data of a certain key range in order. In two adjacent layers, the ratio of the amount of data that can be accommodated in the next layer to the amount of data that can be accommodated in the previous layer is called a growth factor T, and is generally 10. Taking the example that the first layer can store 10MB of data at most, the second layer can store 100MB of data at most, and in a sub-class, only 7 layers are needed, and more than 10TB of data can be accommodated in total. The data imported from the memory is written into the first layer, and in order to maintain the hierarchical stability, the background is provided with a compatibility process for continuously reorganizing the data in the disk and writing partial data of a certain layer into the next layer.

Specifically, when the data volume of a certain layer exceeds the maximum value which can be accommodated by the data volume, the compatibility process selects one SSTable file of the layer, then selects all SSTable files overlapped with the key range in the next layer of the layer, merges and sorts the files, generates a new file and writes the new file into the next layer, and the old selected file is deleted.

With L ₁ For example, assume that the range of keys in the key value pair included in the SSTable file selected in the present layer is [2,8 ]]Then at L ₂ In the case of an SSTable file containing a range of keys and [2,8 ]]With overlap, the file needs to be selected as input to the compatibility task. This is done to ensure that data is written to L ₂ After that, still can ensure L ₂ Ordering of data. Because the data that can be contained in different layers is exponentially increased, in order to write an SSTable file of a layer into a next layer, multiple files of the next layer are often needed to participate, and one comparison of the layer increases the data stored in the next layer, which may cause the comparison of the next layer. Such multi-layer accumulation can cause frequent overwriting of data on the disk. The ratio of the actual amount of disk write data to the amount of write data requested by the user is referred to as write amplification. Taking a key value storage system levedb adopting an LSM tree structure as an example, experimental results show that when a user requests to write 50GB of data, the write amplification is close to 20, that is, the actual disk write amount is close to 1TB. Too high a write amplification severely affects the write performance of the LSM tree. The LSM tree is often run in a computer using an SSD, and frequent hard disk reading and writing can reduce the life of the SSD. In summary, write amplification is a serious problem for LSM tree structures. On the other hand, when the memory buffer and the disk L ₁ When the data amount exceeds the threshold value, the memory data cannot be erased, and L must be waited for ₁ After completing a comparison to make room for the layer, a new write request can be serviced, resulting in a write stall, i.e., a substantial increase in the periodic write latency.

Disclosure of Invention

Aiming at the problems in the background art, the invention aims to provide a key value storage method oriented to an LSM tree, which reduces write amplification by reducing the ratio of lower layer to upper layer participation data in a comparison task, adopts a modeling mode to describe the write amplification of a system, optimizes system parameters and simultaneously adopts a parallel reading algorithm to reduce the influence on reading performance.

Another object of the present invention is to provide a key value storing system and apparatus employing the above key value storing method.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

according to a first aspect of the present invention, there is provided a key value storing method for an LSM tree, comprising the steps of:

dividing one level of the LSM tree into a plurality of sub-levels, and marking the j sub-level of the i-th level as L _i.j SSTable files in the sub-hierarchy are arranged in an incremental manner from left to right according to the key range;

each layer maintains a comparison pointer for selecting a first input file of a comparison task;

When the ith layer L _i When the total data amount exceeds the rated size, a comparison is triggered at the layer, L _i Partial data writing L of layer _i+1 A layer for reorganizing disk data, wherein L is used for executing the compatibility task _i All sub-levels of a layer participate in a task, while L _i+1 Only one sub-level of a layer participates in a task.

Wherein for L _i The one-time compatibility task of the layer comprises the following steps:

according to L _i A layer compatibility pointer, at L _i Is the first sub-level L of _i.1 Selecting an SSTable file which contains a minimum key which is larger than or equal to the pointer and is closest to the pointer as a task initial input file, adding an input file set of a comparison task, taking the minimum key of the file as a task left boundary, and taking the maximum key of the file as a task right boundary;

for L _i Other sub-layers L of a layer _i.2 ～Sequentially selecting part or all of the files within the left and right boundaries and adding the files into an input file set, wherein S _i Represents L _i The number of sub-hierarchies divided in a layer;

expanding the boundary of the current task according to the minimum key and the maximum key of the files in the input file set so that the task comprises more files completely positioned in the boundary;

at L _i+1 Selecting a sub-hierarchy L with the least amount of current data in a layer _i+1,j From L according to task boundaries _i+1,j Selecting files positioned in the boundary or overlapped with the boundary to be added into a candidate file set, dividing the comparison task through the files in the candidate file set, and adding only files which still need to participate in the task after the task is divided into an input file set;

for the input file set, it will be located at L _i Data layered and within task boundaries and located at L _i+1 The data of the layers are subjected to multi-path merging and sorting to generate a new file and written into L _i+1,j ；

For the input file set, it will be located at L _i The data outside the task boundary is subjected to multi-path merging and sorting, and the data smaller than the left boundary of the task in the generated new file is written into L _i Layer, data writing L larger than task right boundary in new file _i The method comprises the steps of caching the layer of compatibility, recording a minimum key and a maximum key of a cached file in a log, inputting a file which is overlapped with the cached file in a file set, and deleting the file which is not recorded in the log in the file set;

will L _i The layer's compare pointer is replaced with the right boundary of the current compare task.

The specific method for dividing the tasks is as follows:

for each file in the candidate file set, acquiring the minimum key k contained in the file through metadata in the memory _min And maximum key k _max ；

According to k _min And k is equal to _max Querying the files in the input set of files, if for each file in the input set of files, [ k ] _min ,k _max ]Non-overlapping with the file, or less than k in the file _min Maximum bond sum greater than k _max If no other key value pair exists between the minimum keys of (2), the candidate file is moved out of the candidate file set and is based on k _min And k _max Will input a fileThe concentrated file is cut into two parts, one part containing keys less than k _min The other part containing bonds greater than k _max Otherwise, the candidate file is moved out of the candidate file set and added into the input file set.

In some embodiments of the first aspect of the invention, the L _i Layer at L _D+2 ～L _n Between layers, wherein n is the number of layers of the LSM tree, D is a set layer boundary parameter, and D is more than or equal to 1 and less than or equal to n; the method further comprises the steps of:

for L ₁ ～L _D All data of the layer are ordered once by adopting a tiered compaction algorithm, a newly generated file is written into the next layer, a new sub-layer is formed in the next layer, and no lower layer data participates in the ordering during the period;

for L _D+1 And ordering all data of the layer and data of one sub-layer of the lower layer, and writing newly generated data into the selected sub-layer of the lower layer.

In this hierarchical manner, the write operation includes:

acquiring a global version number maintained for key value pairs, increasing and encoding into keys;

writing the data into the WAL in an additional writing mode;

writing the data into a memory buffer and returning;

the lookup operation includes:

inquiring a memory buffer and a cache, if the memory buffer and the cache exist, returning data, otherwise, performing the next step;

from L ₁ To L _n Layers, for each level L in the disk _b Sequentially searching, wherein b is more than or equal to 1 and less than or equal to n, maintaining a thread pool, and the number of threads in the thread pool is max (S ₁ ,S ₂ ,…,S _n ) For L _b Submitting S to thread pool _b Thread for reading tasks _j For L _b,j Performing binary search, 1<j<S _b ；

Summary S _v The reading result of each thread, if any thread reads the data, the data with the largest version number is selected to returnAfter the reading is finished, if no thread reads the data, continuing to read L _b+1 ；

If all the layers are read, the data is not read yet, and the returned data does not exist.

The scope query operation includes:

searching a key value pair corresponding to a minimum key greater than or equal to k by using a Seek (k) interface: submitting a plurality of inquiry tasks to a thread pool, wherein each thread is responsible for inquiring a sub-hierarchy or memory buffer, each thread searches a minimum key which is greater than or equal to k through a dichotomy, and if each thread does not read data, the returned data does not exist; otherwise, for the thread of the read data, constructing an iterator from the read data, sequencing the read data according to the version numbers, and taking out the latest version data for return;

The key value pair corresponding to the smallest key larger than the currently found key in the system is found by using the Next () interface: if Seek (k) finds data, when the user submits a Next () request, the iterator that last returned the result runs Next (), compares again the data currently pointed to by each iterator, returning the latest data, during which the old version of data is ignored.

In some embodiments of the first aspect of the present invention, the method further comprises: modeling the write amplification, selecting optimal parameters by minimizing the write amplification, comprising the steps of:

let the number of layers of LSM tree be n and the number of sub-layers of each layer be S _b The growth factor of each layer is T _b B is more than or equal to 1 and less than or equal to n, the boundary line of the layers of different compatibility algorithms is adopted as D, and the write amplification of each layer is calculated:

for write WAL, its write amplification is 1;

for the memory buffer brushing disk, the write amplification is buf/Unique ^-1 (buf), wherein buf is the number of pairs of maximum key values that the buffer can hold, unique ^-1 (k) As an inverse function of Unique (p), unique (p) = Σ _k∈K (1-(1-f _X (k)) ^p ) N is the total number of independent keys in the workload, K is the key space [0, N-1 ]]Integer set f of (1) _X (k) Is shown inProbability of occurrence of key k in write-once request;

When b is 1.ltoreq.b.ltoreq.D, for L _b The write amplification is as follows Wherein the method comprises the steps ofInterval _b ＝Interval _b-1 *S _b ，Interval ₀ ＝Unique ^-1 (buf)，Size ₁ ＝buf*S ₁ ，/> Size _b+1 ＝Size _(b+1).j *S _b+1 ；

For L _D+1 The write amplification is as follows Wherein inter+1=inter+sd+1, size _D+2 ＝Size _D+1 *T _D+2 ，Size _(D+2).j ＝Size _D+2 /S _D+2 ；

When D+2 is less than or equal to b<n is as for L _b The write amplification is as followsWherein Interval is _b ＝Interval _b-1 +DInterval _b ，DInterval _b By solving the equation-> As obtained, sizeb+1=sizeb+tb+1, size (b+1). J=sizeb+1/sb+1;

the write amplification of each disk layer, together with the write amplification of the write WAL and the write amplification of the memory buffer brush disk, forms the write amplification of the whole LSM tree;

fixing the total number of sub-layers of the LSM tree, iteratively solving the write amplification under different parameters, and obtaining S for minimizing the write amplification _b 、T _b And D.

According to a second aspect of the present invention, there is provided an LSM tree oriented key-value store system comprising:

a first storage section that stores the first D levels of the LSM tree including n levels, and performs a comparison task by adopting a tiered compaction algorithm that minimizes a write amplification, where D represents a set level dividing line parameter;

a second storage section that stores the (d+1) th layer of the LSM tree and performs a comparison task by adopting a comparison method including the steps of:

selecting all files of all sub-layers of the layer and adding the files into an input file set;

at L _D+2 Selecting a sub-hierarchy L with the least amount of current data in a layer _D+2,j According to L _D+1 The data is contained in a range from L _D+2,j Selecting all overlapped files to be added into an input file set;

multiplexing and sorting the data in the input file set, and putting the newly generated file into the sub-hierarchy selected by the lower layer;

a third storage unit for storing L of the LSM tree _D+2 ～L _n And a layer, and performing a comparison task by adopting a comparison method comprising the following steps:

one level L of LSM tree _i Divided into a plurality of sub-hierarchies, the j-th sub-hierarchy of the i-th layer being denoted by L _i.j SSTable files in the sub-hierarchy are arranged in an incremental manner from left to right according to the key range;

Wherein the third storage part performs the steps of one-time compatibility task and the L-oriented key value storage method of the first aspect of the invention _i The steps involved in one comparison task of a layer are identical.

According to a third aspect of the present invention there is provided a key-value pair storage device, the device comprising:

one or more processors;

a memory; and

one or more computer programs stored in the memory and configured to be executed by the one or more processors, which when executed by the one or more processors, cause the one or more processors to perform steps comprising a LSM tree oriented key value storing method according to the first aspect of the present invention.

The invention can obtain the following beneficial effects:

1. and (3) dividing each level of the LSM tree into fine granularity, wherein the upper layer has a plurality of sub-levels to participate in the comparison, and the data range contained in each sub-level is the same, so that the data quantity selected by each sub-level is similar. While the lower layer has only one sub-hierarchy to participate. When the lower layer selects data, the number of files participating in the comparison of the lower layer is reduced as much as possible by cutting the comparison task, so that the ratio of the data quantity of the lower layer participating in the comparison to the data quantity of the upper layer participating in the comparison is reduced, that is, the data quantity of the lower layer which needs to participate in sequencing is reduced in order to import a certain amount of data into the lower layer, and therefore write amplification is reduced.

2. By adopting different comparison algorithms for different layers, the efficiency of importing data into the lower layer is quickened and the occurrence of the phenomenon of writing pause is reduced by adopting a tiered compaction algorithm capable of minimizing the writing amplification in the upper layer.

3. By multi-thread parallel reading, the influence on the reading performance is reduced. By modeling the write amplification, a method for selecting optimal parameters is provided, and the write performance of the system is maximized under the condition of fixed read performance.

Drawings

FIG. 1 is a schematic diagram of an LSM tree according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a comparison algorithm according to an embodiment of the present invention;

FIG. 3 is a schematic illustration of a compare task partition according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a parallel read algorithm according to an embodiment of the invention.

Detailed Description

The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

Fig. 1 is a schematic diagram of an LSM tree according to an embodiment of the invention. As shown in the figure, there is a buffer in the memory, and the WAL is a disk pre-write log set to avoid buffer data loss when the program crashes. Data in a disk is divided into three layers (L ₁ ,L ₂ ,L ₃ ). Each layer is divided into three sub-layers. Each sub-hierarchy contains a plurality of SSTable files. The data in the sub-hierarchy is ordered, while the data between the different sub-hierarchies is irrelevant. This is equivalent to relaxing the ordering of the original LSM tree, where the data within each level of the original LSM tree is strictly ordered, whereas the data of each level of the LSM tree of the present invention is split into a plurality of smaller ordered packets by means of a split.

Fig. 2 is a schematic diagram of a one-time operation process of the comparison algorithm according to an embodiment of the present invention. Each block in the figure represents an SSTable file. For convenience of description, it is assumed that a file can accommodate at most two key-value pairs (the number of key-value pairs that each file can accommodate is much higher than 2 in practice), and the numbers in the square indicate the keys in the key-value pairs that the file contains, and the corresponding values are not shown. The upper right hand corner of the number indicates the version of the value corresponding to the key, for example key 5, with the value corresponding to 5 "being newer than the value corresponding to 5 'and the value corresponding to 5' being newer than the value corresponding to 5. In particular implementations, the order of writing key-value pairs may be recorded by maintaining a global version number (e.g., a 64-bit integer). The current latest version number is represented by a number, and every time a new key pair is written, the current version number is encoded into the key pair, and the global version number +1 is encoded. For example, the current system version number is 1, a key pair is inserted, and 1 is assigned to this key pair and stored as < key1, value1>. The system version number then becomes 2. Then, a key value pair is inserted, and 2 is assigned to the key value pair and stored as < key2, value2>. The system version number then becomes 3. Thus, when a plurality of key value pairs are read, the new-old relationship of the data can be known by judging the version numbers of the key value pairs.

When L ₂ The amount of data exceeds the maximum that it can accommodate, and a compare task is triggered. First from L _2.1 And selecting an input file. Since the compatibility pointer of the layer is 6, a file with the minimum key larger than or equal to 6 and closest to 6, namely SSTable (6 ', 12'), is selected as an initial file and added into the input file set. The minimum key 6 contained in the file is recorded as the left boundary of the comparison task, and the maximum key 12 contained in the file is recorded as the right boundary of the comparison task. Then, from L _2.2 To L _2.3 Files within this boundary, or overlapping the boundary, are selected from each sub-hierarchy according to the left and right boundaries, and are added to the input file set, wherein SSTable (5 ', 8), SSTable (12, 13'), SSTable (5 ', 7'), SSTable (10 ', 14') 4 files are selected in total. To this end, L ₂ And (5) finishing file selection.

For L ₃ L is not shown in the figure for simplicity of illustration _3.1 And L _3.3 Keys contained in the file. Due to L _3.2 The amount of data contained is minimal, and the sub-hierarchy is selected to participate in the compatibility. Also based on the left and right boundaries, files SSTable (5, 6), SSTable (7, 9), SSTavle (10, 11) are selected at this sub-levelAnd adding the input file set. The sub-hierarchy with the least data quantity is selected to participate in the comparison, so that the data quantity of each sub-hierarchy is closest to each other after the completion of the comparison task, but the new and old relationship of the data of each sub-hierarchy cannot be ensured. In order to maintain the data version relationship between layers (the upper layer is newer than the lower layer for the same key), L is required ₂ All data within a certain range is written to the next layer. Therefore, L is required to be determined according to the boundary of the compatibility ₂ The selected file is cut, and the data outside the boundary range needs to be rewritten back to the layer. Otherwise it may lead to L ₃ Data ratio L of (B) ₂ New.

Data outside the boundary is eventually written back to the layer, which increases the write amplification, and thus expands the boundary according to the minimum and maximum keys of the selected file in each sub-hierarchy. If the number of cut files can be reduced after expansion, the boundaries are updated without introducing new files. That is, the boundary is extended only from the original file, ensuring that no additional files are added. Otherwise, the situation that the boundary is continuously expanded and finally all files are added into the input file may occur, which makes the task of the comparison excessively large and affects the stability of the system. As shown, the initial boundary is [6,12], and the expanded boundary is [5,12], so that SSTable (5 ', 8), SSTable (5 ',7 ') files do not have to be cut.

After the files are selected, the files are divided into two parts: 1 from L ₂ Portions of the selected file that are within boundaries and from L ₃ All files selected; 2, from L ₂ The portion of the selected file that is outside the boundary. Generating 4 new files SSTable (5 ', 6'), SSTable (7 ', 8), SSTable (9, 10'), SSTable (11, 12) and L by adopting multi-path merging and sequencing to the data of the first part _3.2 . And carrying out intra-layer compatibility on the data of the second part, namely adopting multi-path merging and sorting to generate a new file SSTable (13 ', 14') and putting back L _2.3 . Finally, the compare pointer is replaced with the right boundary 12 of the task, deleting the files in the input file set.

To further reduce file write back L ₂ And setting a comparison cache for each layer to store the files generated by the comparison in the layer. Specifically, intra-layer compatibility is L ₂ It is possible to generate a two-part document, the first part of the document being located to the left of the left boundary and the second part of the document being located to the right of the right boundary. The first part of the file is written to disk, and the second part of the file is stored in the compatibility buffer without being written to disk. The input file of the compact task adopts a circular selection strategy, that is, when the layer performs next compact, the right boundary of the compact task will be the left boundary of the next compact task. Therefore, the cache file can be directly read from the memory, and one-time reading and writing of the disk file are reduced. Because the boundary expansion operation is carried out and the comparison task has a definite boundary, L is not considered ₁ During the comparison, files positioned on the left side of the left boundary are not generated, so that the memory space occupied by the cache files is little. The cache will only be used once if L ₁ The triggered compare task contains the file in the cache that is moved out of the cache, at which time the cache still reduces one disk file read/write.

When a computer crashes, it may result in a loss of the compare cache. In order to avoid data loss, in the disk log, the sub-hierarchy to which the cache file of the composition belongs, the minimum key and the maximum key of the cache file, and the SSTable file of the source of the cache file are recorded together with other metadata of the composition (such as a new file generated by the composition, a composition pointer, statistical information of the composition task and the like), and in the last step of the task, the input file related to the cache is not deleted. Thus, when the computer crashes, the data in the comparison cache can be restored from the input file by using the data.

Fig. 3 shows how the amount of data of the lower-layer participation task is reduced by dividing the cooperation task. Shown in the figure is L ₂ Is a single compare task. When ending L ₂ After the file is selected, at first, according to the comparison boundary, at L _3.1 SSTable (1, 2), SSTable (3, 6), SSTable (7, 10) are selected as candidate files. Then, for the three candidate files, according to the files respectivelyAnd querying the files of each sub-level in the current input file set, and judging whether the candidate file can not participate in the current comparison. If for each file f in the input set of files _i If one of the following two conditions is met, the candidate file is not added to the task: 1, the minimum key and the maximum key of the candidate file are located at f _i Is outside the range of (2); 2, the minimum key and the maximum key of the candidate file are positioned at f _i Within a range of (f), but can be obtained by combining f _i The division into two parts is such that neither part overlaps with the candidate file.

In the figure, for candidate file SSTable (3, 6), its determined key range [3,6 ]]Key range [7,9 ] determined with file SSTable (7', 9) in input file set]And SSTable (1 ', 2') determined key range [1,2 ]]There is no overlap. While the current candidate file is in a key range [2,7 ] determined by the file SSTable (2 ', 7')]There is overlap, but if SSTable (2 ', 7') is split into SSTable (2 ') and SSTable (7'), these two files are identical to [3,6 ] ]There is no overlap of ranges. Then, the candidate file SSTable (3, 6) does not participate in the present compatibility. The present comparison is divided into two subtasks, one of which is responsible for dividing [1,2 ]]The data within the range is ordered and another task is responsible for ordering [7,10 ]]The data within the range is ordered and the two subtasks can be performed in parallel. Thus, on the one hand reduce L ₃ The number of files participating in the task is reduced, write amplification is reduced, parallelism of the comparison is increased, and speed of the comparison is improved.

Fig. 4 shows a reading algorithm of the LSM tree in the present invention. As each hierarchy is further divided, the number of sub-layers to be read is increased, so that the read performance is affected. In order to improve the reading performance, the invention adopts a parallel reading algorithm. The invention maintains a thread pool, the number of threads in the thread pool is the same as the maximum number of sub-levels of the LSM tree. When a read request arrives, if no corresponding data is found in the memory, the disk data needs to be queried. First query L ₁ Thread _j Responsible for querying L _1.j . And when the query of each sub-hierarchy is completed, summarizing the result of each thread. If it isIf any thread finds the result, the corresponding data is compared through the version number, and the latest result is obtained and returned. If none of the threads have queried the result, then begin for L ₂ A query is made. And so on until the corresponding result is queried and returned, or each level is searched but the corresponding data is not found, and the returned data does not exist.

Due to the number S of sub-levels of each level _i Growth factor T for each level _i And the setting of the boundary line D between the hierarchy of the tiered compaction algorithm and the hierarchy of the fine-grained comparison algorithm described in fig. 2 has a great influence on the system performance. It is thus possible to model the write amplification of the system under different parameters. By minimizing write amplification, parameters are obtained that optimize the write performance of the system.

Assume that the keyspace K of the workload is in the range of [0, N-1 ]]Where N is the total number of individual keys in the workload. The keys are set to follow a certain distribution X, such as uniform distribution, ziff distribution and the like. The probability of occurrence of key k in a write-once request is f _X (k) A. The invention relates to a method for producing a fibre-reinforced plastic composite For example, when the bonds follow a uniform distribution, f _X (k) =1/N, when the keys obey the ziff distribution, where s represents the degree of data tilt and h maps each key to an integer number of key space K. For p requests, the number of independent keys that occur is Unique (p) = Σ _k∈K (1-(1-f _X (k)) ^p ). The inverse function of Unique (p) is Unique ^-1 (k) A. The invention relates to a method for producing a fibre-reinforced plastic composite Since Unique (p) is a monotonic function, it can be solved by extending its domain to the real number domain ^-1 (k) A. The invention relates to a method for producing a fibre-reinforced plastic composite Let the size of one file be u, then k files u ₁ ,u ₂ ,…,u _k The total size of the new file generated after the compatibility is carried out is

The write overhead is modeled by a write amplification that characterizes each layer. The WAL is written before the data is written into the memory buffer, so the write amplification is WA _buf ＝1。

The number of key value pairs which can be accommodated by the system buffer is buf. Considering buffer in memory as L ₀ I.e. Size ₀ =buf. When the buffer reaches the capacity threshold, the data in the buffer is written into L in batches ₁ . Since the buffer does not include repeated key value pairs, the number of write requests required from empty to full of the buffer is Unique ^-1 (buf), which is also the Interval in which the layer data is written entirely to the next layer ₀ . The data volume written to the disk is buf, so that the write amplification of the memory write disk is WA _0→1 ＝buf/Unique ^-1 (buf)。L ₁ Is a sub-hierarchy L of _1.j Is of Size Size _1.j ＝buf，L ₁ Is the total Size of Size ₁ ＝buf*S ₁ 。

The write amplification by disk compatibility is calculated from the amount of data written to the lower layer at a certain interval.

For L _i (i is more than or equal to 1 and less than or equal to D), adopting tiered compaction algorithm, when the number of sub-layers of the layer reaches S _i When triggering the compare. Every time one sub-layer is added to the layer, the required time is L _i-1 An interval of two comparison occurs. Therefore, the Interval at which the layer occurs is Interval _i ＝Interval _i-1 *S _i . Within the interval, towards L _i+1 The written data amount isThus, L _i Write amplification of WA _i→i+1 ＝Write _i+1 /Interval _i . And L is _i+1 Size of one sub-hierarchy of (2) _(i+1).j ＝Write _i+1 ，L _i+1 Is the total Size of Size _i+1 ＝Size _(i+1).j *S _i+1 。

For L _D+1 When the number of sub-layers of the layer reaches S _D+1 When triggering the compare. Every time one sub-level is added to the layer, the requiredIs of interval L _D An interval of two comparison occurs. Therefore, the Interval at which the layer occurs is Interval _D+1 ＝Interval _D *S _D+1 . Within the interval, towards L _D+2 The written data amount is Wherein Size is _D+2 ＝Size _D+1 *T _D+2 Size of jth sub-hierarchy _(D+2).j ＝Size _D+2 /S _D+2 . Thus, L _D+1 Write amplification of WA _D+1→D+2 ＝Write _D+2 /Interval _D+1 。

For L _i (D+2≤i<n) because each sub-hierarchy contains the same data range, each time a compatibility task is executed at L _i The data ranges selected by each sub-level of the layer are substantially the same and can therefore pass through the first sub-level L of the layer _i.1 Analysis was performed. Let Dinterval _i To at L _i.1 The spacing of the same bond between two compacts, d is L _i.1 One-way distance from the just compact to the next layer of the key LastKey, wherein d is more than or equal to 0 and less than or equal to N-1. For a fixed d, if the layer has a key k ₁ The distance from LastKey is d, since the key was compoct, the sub-hierarchy already has Dinterval _i *d/(N*S _i ) A new request. If key k is present in these new requests ₁ Then L _i.1 In presence of bond k ₁ . This probability isLet P (lastkey=k) =1/N for any K e K, consider all K, and get at L _i.1 In (2) probability of having a key with a distance d from LastKey +.> Considering all d, available->From this, the Dinterval of the layer was obtained _i . Interval of the layer _i ＝Interval _i-1 +DInterval _i In the interval, the data volume of the lower layer is writtenWherein Size is _i+1 ＝Size _i *T _i+1 Size of jth sub-hierarchy _(i+1).j ＝Size _i+1 /S _i+1 . Therefore, the write amplification of this layer is WA _i→i+1 ＝Write _i+1 /Interval _i 。

All WAs are added to obtain the total WA of the LSM tree. The read performance is affected by the total number of sub-levels of the LSM tree, in general, the greater the number of sub-levels, the greater the IO of the read operation. Fixing the total number of sub-levels of the LSM tree, namelyFor a fixed value, the total WA under different parameters is obtained through iteration, and S when the WA is minimum is recorded _i ，T _i And D.

The parameter optimization algorithm according to the embodiment of the invention is as follows:

according to another embodiment of the present invention, there is provided an LSM tree oriented key value storing system including:

Wherein the third storage part performs the steps for one comparison task and the steps for L in the previous method embodiment _i The steps involved in one comparison task of a layer are identical. And will not be described in detail herein.

The key pair storage system maintains a global version number (e.g., a 64-bit integer), encodes the current version number into a key pair, and encodes the global version number +1 for each new key pair written. For example, the current system version number is 1, a key pair is inserted, and 1 is assigned to this key pair and stored as < key1, value1>. The system version number then becomes 2. Then, a key value pair is inserted, and 2 is assigned to the key value pair and stored as < key2, value2>. The system version number then becomes 3. Thus, when a plurality of key value pairs are read, the new-old relationship of the data can be known by judging the version numbers of the key value pairs.

In a hierarchical manner as described above, the write operation of the key-value store system includes:

acquiring a global version number of a key value pair, increasing and encoding into a key;

writing the data into the WAL in an additional writing mode;

writing the data into a memory buffer and returning;

the lookup operation of the key value storage system includes:

Summary S _b And (3) reading results of the threads, if any thread reads data, selecting the data with the largest version number to return, finishing reading, and if no thread reads the data, continuing reading L _b+1 ；

The scope query operation includes:

Write amplification of the system under different parameters can also be described by modeling. By minimizing write amplification, parameters are obtained that optimize the write performance of the system. The specific model building steps are the same as those in the foregoing method embodiments, and are not described herein.

According to another embodiment of the present invention, there is also provided a key-value pair storage device including:

one or more processors;

a memory; and

one or more computer programs stored in the memory and configured to be executed by the one or more processors, which when executed by the one or more processors, cause the one or more processors to perform steps comprising an LSM tree oriented key value storing method as described in the previous method embodiments.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims and the equivalents thereof, it is intended that such modifications and variations be included in the embodiments of the present invention.

Claims

1. A method for storing key values for an LSM tree, the method comprising:

when the ith layer L _i When the total data amount exceeds the rated size, a comparison is triggered at the layer, L _i Partial data writing L of layer _i+1 A layer for reorganizing disk data, wherein L is used for executing the compatibility task _i All sub-levels of a layer participate in a task, while L _i+1 Only one sub-level of the layer participates in the task;

for L _i The one-time compatibility task of the layer comprises the following steps:

for L _i Other sub-hierarchies of layers Sequentially selecting part or all of the files within the left and right boundaries and adding the files into an input file set, wherein S _i Represents L _i The number of sub-hierarchies divided in a layer;

at L _i+1 Selecting a sub-hierarchy L with the least amount of current data in a layer _i+1，j From L according to task boundaries _i+1，j Selecting files positioned in the boundary or overlapped with the boundary to be added into a candidate file set, dividing the comparison task through the files in the candidate file set, and adding only files which still need to participate in the task after the task is divided into an input file set;

for the input file set, it will be located at L _i Data layered and within task boundaries and located at L _i+1，j The data of the layers are subjected to multi-path merging and sorting to generate a new file and written into L _i+1，j ；

2. The LSM tree oriented key value storing method according to claim 1, wherein the specific step of dividing the task comprises:

According to k _min And k is equal to _max Querying the files in the input set of files, if for each file in the input set of files, [ k ] _min ，k _max ]Non-overlapping with the file, or less than k in the file _min Maximum bond sum greater than k _max If no other key value pair exists between the minimum keys of (2), the candidate file is moved out of the candidate file set and is based on k _min And k _max Will input a set of filesThe document in (a) is cut into two parts, one part containing keys smaller than k _min The other part containing bonds greater than k _max Otherwise, the candidate file is moved out of the candidate file set and added into the input file set.

3. The LSM tree oriented key value storage method of claim 1, wherein said L _i Layer at L _D+2 ～L _n Between layers, wherein n is the number of layers of the LSM tree, D is a set layer boundary parameter, and D is more than or equal to 1 and less than or equal to n; the method further comprises the steps of:

For L ₁ ～L _D All data of the layer are ordered at one time by adopting a Tieredcompaction algorithm, a newly generated file is written into the next layer, a new sub-layer is formed in the next layer, and no lower layer data participates in the ordering during the period;

4. A LSM tree oriented key-value store method according to claim 3, characterized in that in this hierarchical manner the write operation comprises:

writing the data into the WAL in an additional writing mode;

writing the data into a memory buffer and returning;

the lookup operation includes:

inquiring the memory buffer, if the memory buffer exists, returning data, otherwise, performing the next step;

from L ₁ To L _n Layers, for each level L in the disk _b Sequentially searching, wherein b is more than or equal to 1 and less than or equal to n, maintaining a thread pool, and the number of threads in the thread pool is max (S ₁ ，S ₂ ，…，S _n ) For L _b Submitting S to thread pool _b Thread for reading tasks _j For L _b，j Go through twoFind separately, j is more than 1 and less than S _b ；

5. The LSM tree oriented key-value store method of claim 4, wherein the range query operation comprises:

6. A LSM tree oriented key-value store system comprising:

A first storage section that stores the first D levels of an LSM tree including n levels, and performs a compaction task by adopting a TieredCompactment algorithm that minimizes write amplification, where D represents a set level boundary parameter;

at L _D+2 Selecting a sub-hierarchy L with the least amount of current data in a layer _D+2，j According to L _D+1 The data is contained in a range from L _D+2，j Selecting all overlapped files to be added into an input file set;

the third storage part comprises the following steps for one comparison task:

according to L _i Is indicated at L _i Is the first sub-level L of _i.1 Selecting an SSTable file which contains a minimum key which is larger than or equal to the pointer and is closest to the pointer as a task initial input file, adding an input file set of a comparison task, taking the minimum key of the file as a task left boundary, and taking the maximum key of the file as a task right boundary;

for other sub-layers of the layerSequentially selecting part or all of the files within the left and right boundaries and adding the files into an input file set, wherein S _i Representation layer L _i The number of sub-levels divided in (1);

for the input file set, it will be located at L _i And data within task boundaries and located at L _i+1，j Multiple merging and sorting of data, generating new file and writing in L _i+1，j ；

For the input file set, it will be located at L _i The data outside the task boundary are subjected to multi-path merging and sorting, and the data smaller than the left boundary of the task in the generated new file is written into L _i Writing data which is larger than the right boundary of the task in the new file into a comparison cache of the layer, recording a minimum key and a maximum key of the cache file and a file which is overlapped with the cache file in the input file set in the log, and deleting the file which is not recorded in the log in the input file set;

and replacing the comparison pointer of the layer with the right boundary of the current comparison task.

7. A key-value pair storage device, the device comprising:

one or more processors;

a memory; and

one or more computer programs stored in the memory and configured to be executed by the one or more processors, which when executed by the one or more processors, cause the one or more processors to perform steps comprising the LSM tree oriented key value storing method of any of claims 1-5.