WO2022033099A1

WO2022033099A1 - Index construction method and apparatus

Info

Publication number: WO2022033099A1
Application number: PCT/CN2021/094068
Authority: WO
Inventors: 任仁; 王晨
Original assignee: 华为技术有限公司
Priority date: 2020-08-13
Filing date: 2021-05-17
Publication date: 2022-02-17
Also published as: CN114077378A

Abstract

An index construction method and apparatus. The index construction method comprises: a storage device associating a first physical block with a first logical segment, wherein a first keyword and a first value corresponding to the first keyword are stored in the first physical block, and the first keyword and a first storage position, in the first physical block, of the first keyword satisfy a first position prediction model corresponding to the first logical segment (S101); the storage device determining whether a second keyword stored in a second physical block and a second storage position, in the second physical block, of the second keyword satisfy the first position prediction model, wherein a second value corresponding to the second keyword is also stored in the second physical block, and the second physical block is adjacent to the first physical block (S102); and when the second keyword and the second storage position, in the second physical block, of the second keyword do not satisfy the first position prediction model, the storage device associating the second physical block with a second logical segment, wherein at this time, the second keyword and the second storage position, in the second physical block, of the second keyword satisfy a second position prediction model corresponding to the second logical segment (S103). In this way, an index for quickly looking up data can be constructed, and the number of times that physical blocks need to be accessed is relatively small.

Description

A method and device for building an index

This application claims the priority of a Chinese patent application with an application number of 202010812177.7 and an invention title of "A Method and Device for Constructing Indexes", which was submitted to the State Intellectual Property Office of China on August 13, 2020, the entire contents of which are incorporated herein by reference. Applying.

technical field

The present application relates to the field of storage technologies, and in particular, to a method and apparatus for constructing an index.

Background technique

In scenarios such as Internet big data applications and cloud computing big data applications, fast access to large-scale data is usually required. To this end, while storing data, the storage device creates a corresponding index for the data, and the index provides a pointer to the data. In this way, during data retrieval, the storage device can quickly find corresponding data based on the index, thereby realizing fast data access.

Therefore, how to construct an index for quickly searching data is an important problem that needs to be solved urgently at present.

SUMMARY OF THE INVENTION

Embodiments of the present application provide a method and apparatus for constructing an index, so as to construct an index for quickly searching data.

In a first aspect, an embodiment of the present application provides a method for constructing an index, and the method can be applied to a storage device. Specifically, the storage device may associate a first physical block with a second logical segment, where the first physical block stores a first keyword and a first value corresponding to the first keyword, where the first keyword is associated with The first storage position of the first keyword in the first physical block satisfies the first position prediction model corresponding to the first logical segment. In this way, when the keyword is known, the first position prediction model can be used to predict the key The position of the word on the first physical block. Then, for the second physical block adjacent to the first physical block, the storage device may determine the second key stored in the second physical block and the second storage location of the second key in the second physical block Whether the first position prediction model is satisfied, the second physical block also stores a second value corresponding to the second keyword. When the second keyword and the second storage location do not satisfy the first location prediction model, it means that the location of the keyword on the second physical block cannot be more accurately predicted by using the first location prediction model. At this time, the storage device can The second physical block is associated with the second logical segment, and the second key and the second storage location satisfy the second location prediction model corresponding to the second logical segment.

In the process of building an index, since the keywords stored on each physical block and the storage location of the keywords in the physical blocks satisfy the corresponding position prediction model, in this way, when retrieving data based on the index, it is possible to search for The keyword and the corresponding position prediction model are used to determine the physical block where the keyword is located, so that the value corresponding to the keyword can be found in the physical block. In addition, since the storage device associates logical segments with physical blocks as a unit, the different keywords in each physical block and the storage location of the keyword in the physical block satisfy the same location prediction model. Therefore, when data retrieval is performed , based on the location prediction model, the physical block where the keyword is located can usually be predicted, and the value corresponding to the keyword can be found in the physical block, instead of searching for the value corresponding to the keyword from multiple physical blocks, Thereby, the number of accesses to physical blocks can be reduced.

In addition, if the storage device minimizes the number of physical blocks accessed each time, that is, the location prediction model of each physical block in the constructed index can accurately predict the location of each key on the physical block, or If one location prediction model can at least predict the physical block corresponding to any known keyword, a larger number of location prediction models need to be recorded in the constructed index (for example, one location prediction model is recorded for each physical block, etc.) , which makes the storage device need more storage space to store the index. On the contrary, if the number of location prediction models is reduced, the accuracy of the location corresponding to the keyword predicted by the storage device may not be high. Therefore, in practical applications, the balance between storage space and retrieval performance can be achieved by controlling the maximum number of physical blocks that are allowed to be accessed each time the storage device performs index retrieval.

In a possible implementation manner, when the second keyword and the second storage location satisfy the first location prediction model, it means that the first location prediction model can more accurately predict the value of each keyword on the second physical block. At this time, the storage device can continue to associate the second physical block with the first logical segment. In this way, the position of each keyword on the two physical blocks can be predicted by using a position prediction model.

In a possible implementation manner, when judging whether the second keyword and the second storage location of the second keyword on the second physical block satisfy the first location prediction model, specifically, according to the second keyword, Calculate the predicted position of the second key in the second physical block by using the first position prediction wear resistance, and compare the predicted position with the actual second storage position of the second key in the second physical block. Whether the error is within the preset error range, if it does not exceed the preset error range, it can be determined that the second keyword and the second storage location of the second keyword on the second physical block satisfy the first location prediction model, and if it exceeds If the error range is preset, it can be determined that the second key and the second storage location of the second key on the second physical block do not satisfy the first location prediction model.

In a possible implementation, when the number of physical blocks storing key-value pairs is large, usually the number of logical segments associated with the physical blocks is also large. Correspondingly, the predicted number of storage locations required by the storage device There are also many. In order to quickly determine the location prediction model corresponding to the physical block storing the keyword to be retrieved during index retrieval, the storage device can also build an upper-level index for the first logical segment and the second logical segment. , the storage device may write a third keyword and a third value corresponding to the third keyword into the index block, where the third keyword may include a keyword in a physical block associated with the first logical segment, and, The third value includes the model description value of the first location prediction model and the first address of the first physical block. In this way, when performing retrieval, the storage device can determine the location prediction model of the lower-level index corresponding to the keyword to be retrieved according to the location prediction model of the upper-level index, so as to further predict the keyword to be retrieved according to the location prediction model of the lower-level index The physical block (or the location on the physical block).

In a possible implementation, the first position prediction model includes a linear fitting function, and the model description value of the first position prediction model includes the inverse slope of the linear fitting function. Of course, in other possible implementations, the position prediction model may also be a nonlinear fitting function, such as a neural network, etc., which can predict the physical block corresponding to the keyword or the position on the physical block.

In a possible implementation manner, the reciprocal slope corresponding to the first position prediction model may be determined by rounding the average value of the difference between two adjacent keywords on the first physical block. In this way, the storage device does not need to perform floating-point operations when predicting the position of a key, thereby reducing the computational overhead required for the storage device to predict the key operation, and at the same time, it can flexibly support keys of different lengths.

In a possible implementation manner, the association between the first logical segment and the first physical block, and the association between the second logical segment and the second physical block may be located in a subtree index structure of the LSM tree. Among them, the subtrees in the LSM tree can include learning indexes and index structures that support data modification, insertion, and update, and the index structure that supports data modification, insertion, and update accommodates a small amount of data, which is much smaller than that of learning indexes. The amount of data contained in the index allows the LSM tree to take advantage of the performance and cost of the learning index, and the LSM can also support data modification, insertion and update.

In a second aspect, an embodiment of the present application further provides a device for constructing an index, configured to execute the method described in any one of the implementation manners of the first aspect.

In a third aspect, an embodiment of the present application further provides a device for constructing an index. The device includes a memory and a processor, where the processor is configured to execute instructions stored in the memory to execute any one of the implementations of the first aspect. the described method.

A fourth aspect of the present application provides a computer-readable medium, where instructions are stored in the computer-readable medium, which, when executed on a computer, cause the computer to execute the methods described in the above aspects.

A fifth aspect of the present application provides a computer program product, which, when run on a computer, causes the computer to execute the methods described in the above aspects.

Description of drawings

In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments. Obviously, the drawings in the following description are only some implementations described in the present application. For example, for those skilled in the art, other drawings can also be obtained from these drawings.

1 is a schematic flowchart of a method for constructing an index in an embodiment of the present application;

2 is a schematic diagram of writing a key-value pair to a physical block;

3 is a schematic diagram of a physical block associated with a first logical segment or a second logical segment;

4 is a schematic diagram of a four-layer index structure constructed in an embodiment of the present application;

Fig. 5 is the structural representation of LSM tree;

Fig. 6 is the data scale schematic diagram of each layer subtree of LSM tree;

7 is a schematic structural diagram of an indexing device provided by an embodiment of the present application;

FIG. 8 is a schematic structural diagram of an apparatus for constructing an index provided by an embodiment of the present application.

detailed description

The embodiments of the present application propose a method and apparatus for constructing an index, so as to construct an index for quickly searching data.

Referring to FIG. 1 , a schematic flowchart of a method for constructing an index is shown, and the method can be applied to a storage device. The storage device can be a server or a controller in a storage array. The storage device can not only store data, but also store an index corresponding to the data at the same time, so that the data can be quickly accessed based on the index. Specifically, the method may include:

S101: The storage device associates a first physical block with a first logical segment, the first physical block stores a first keyword and a first value corresponding to the first keyword, and the first keyword and the first keyword The first storage location in the first physical block satisfies the first location prediction model corresponding to the first logical segment.

Typically, a storage device can index data in a database based on key-value pairs. A key-value pair, which can also be called a name-value pair or an attribute-value pair, includes a key and a value corresponding to the key, and the value can usually be metadata. Among them, the keyword is used to identify the metadata (value), the metadata is the data describing the data, and is used to describe the attribute (feature) information of the data stored in the database. For example, the metadata can be a file of the data stored in the database. name, or the storage address pointer of the data, etc.

In this embodiment, when establishing an index, multiple key-value pairs may be sorted in sequence according to the value of the keyword (key value), and the multiple key-value pairs may be written into the first physical block according to the sorting order. The first physical block may have a size of 4K or 8K bytes, which is a friendly granularity of the persistent layer data to the disk (of course, the first physical block may also be of other sizes, which is not limited in this embodiment). For convenience of description, the key written in the first physical block is referred to as the first key, and the value written in the first physical block is referred to as the first value. It should be understood that the first keyword may be one or more keywords, and correspondingly, the first value may also be one or more values.

As an example, the key-value pair may be written into the first physical block based on the writing manner shown in FIG. 2 . As shown in Figure 2, the key1 in the first key-value pair can be written from the position where the block header of the first physical block ends, and the first key-value pair can be written from the end of the first physical block value1 in . Among them, when the value1 is written, the type of the value1 (value type1) can also be written at the same time. Then, continue to write key2 in the second key-value pair from the position where key1 ends, and write value2 in the second key-value pair and the type of value2 (value type2) from the previous position of value1. By analogy, multiple keys such as key1, key2, key3, and key4 can be written into the first physical block in sequence (that is, the first key above), and value1, value2, value3, and value4 can be written in sequence. value (that is, the above-mentioned first value), until the remaining storage space in the first physical block cannot continue to write new key-value pairs, it can be considered that the first physical block is filled with data.

Wherein, the block header in the first physical block can record the relevant information of the first physical block, for example, can include the cyclic redundancy check (cyclic redundancy check, CRC) code of the first physical block, the number of key-value pairs, Information such as the type of the first physical block (for example, it may be a fixed-length type or a variable-length type, etc.), the location of the first physical block in the logical segment, and the like.

It should be understood that the key-value pair writing method shown in FIG. 2 is only used as an exemplary description, and is not used to limit the writing method of the key-value pair. For example, in other possible writing methods, after writing key1 in the first key-value pair, you can continue to write value1 at the end of the key1, and then continue to write the first at the end of value1. The key2 and value2 in the two key-value pairs, etc.

After the first physical block is filled, the first physical block can be associated with a first logical segment. The first logical block corresponds to a first position prediction model, and the first position prediction model can predict the position of the keyword on the physical block associated with the first logical block according to the input keyword. In this embodiment, the first location prediction model may be determined based on the first key stored in the first physical block and the first storage location of the first key in the first physical block.

In a possible implementation manner, the first position prediction model may be, for example, a monotonic fitting function, then, the storage device may be based on the value of each keyword in the first physical block and the value of each keyword in the first physical block Perform function fitting to obtain a fitting function that represents a monotonically increasing or monotonically decreasing value of the storage location as the key value increases, that is, the above-mentioned first position prediction model.

Further, the monotonic fitting function may be a linear fitting function as shown in the following formula (1):

The slot of the keyword to be retrieved = (the keyword to be retrieved – the starting keyword) / the reciprocal slope + the starting slot position (1)

Among them, the keyword to be retrieved refers to the keyword to be searched; the starting keyword refers to the first keyword stored in the first physical block, such as key1 shown in Figure 2; the starting slot position, It refers to the slot where the first key is stored in the first physical block, and its value can be 0 (of course, it can also be other values such as 1); the reciprocal slope can be calculated by calculating two adjacent keywords in the first physical block. The average of the differences between the values is rounded up. Since the reciprocal slope is rounded, floating-point operations can be omitted when calculating the slot positions of keywords based on formula (1), which can reduce the computational overhead required for predicting keyword operations. At the same time, it can also flexibly support keys of different lengths.

Of course, the monotonic fitting function can also be other linear fitting functions, for example, it can be suitable for the transformation of formula (1); Or the monotonically decreasing transformation principle. In this embodiment, the specific presentation form of the monotonic fitting function is not limited.

It can be understood that the value of the keyword stored in the first physical block may have uneven local distribution in space, and the monotonic fitting function obtained by fitting may not be able to accurately predict the slot corresponding to each keyword. Therefore, there may be a certain error in predicting the slot position of the keyword to be retrieved in the first physical block based on the monotonic fitting function. In this embodiment, when performing function fitting, it is possible to verify the relationship between the position of each keyword on the first physical block predicted by the function obtained by fitting and the actual position of the keyword on the first physical block. Whether the deviations are all within the preset error range (or whether the verification prediction error rate is less than the preset value), if so, the function can be used as the above-mentioned monotonic fitting function, and if not, the function can be re-fitted , until a monotonic fitting function that meets the conditions is fitted. Further, when the keyword stored in the slot predicted by the monotonic fitting function is not the keyword to be retrieved, the storage device may be to associate the keyword to be retrieved with one or more slots near the predicted slot. The stored keywords are compared to determine the actual location where the keyword to be retrieved is stored in the first physical block. Then, the storage device may determine the position of the value corresponding to the keyword to be retrieved in the first physical block based on the keyword and the writing manner of the value corresponding to the keyword in the first physical block. Taking the writing method shown in Figure 2 as an example, if it is determined that the keyword to be retrieved is located in the third slot of the first physical block, it can be read from the third-to-last position where the value is stored in the first physical block. to the value3 corresponding to the keyword to be retrieved.

It is worth noting that the above description takes the first position prediction model as an example of a monotonic fitting function. In other possible embodiments, the first position prediction model may also be in other forms, such as a non-monotonic function, Or it could be a machine learning model, etc. In conclusion, the first position prediction model may be any model that can be used to predict the position of a keyword on a physical block, and is not limited to the above monotonic fitting function.

After associating the first physical block with the first logical segment, the storage device may continue to write other keywords and values corresponding to the other keywords to the second physical block, and continue to perform step S102 to determine whether to add the second physical block to the second physical block. The block is also associated with the second logical segment. The storage device may also write the key-value pair into the first physical block and the second physical block in advance, and then associate each physical block with the corresponding logical segment, which is not limited in this embodiment. For convenience of description, the key written in the second physical block is referred to as the second key, and the value written in the second physical block is referred to as the second value.

S102: The storage device determines whether the second keyword stored in the second physical block and the second storage location of the second keyword in the second physical block satisfy the first position prediction model, wherein the second physical block also A second value corresponding to the second key is stored, and the second physical block is adjacent to the first physical block.

S103: When the second key and the second storage position of the second key in the second physical block do not satisfy the first position prediction model, the storage device associates the second physical block with the second logical segment. The second key and the second storage position of the second key in the second physical block satisfy the second position prediction model corresponding to the second logical segment.

In practical applications, the storage device may include a plurality of consecutive physical blocks for storing key-value pairs, and the first physical block and the second physical block are two adjacent physical blocks in the plurality of physical blocks. Normally, the tail address of the first physical block is continuous with the head address of the second physical block.

The storage device may use the first position prediction model to predict the position of each key in the first physical block, and the prediction error is small or the prediction accuracy rate is high. Further, the storage device may also check whether the first location prediction model can be used to predict the location of each key in the second physical block. If the deviation between the position of each keyword on the second physical block predicted by the first position prediction model and the actual position of the keyword on the second physical block is within a preset error range, or If the prediction error rate of the key position on the second physical block is less than the preset value, the storage device may associate the second physical block with the first logical segment, as shown in the upper part of FIG. 3 . In this way, the position of each key in the first physical block and the second physical block on the physical block can be predicted based on the first position prediction model corresponding to the first logical segment. However, if the position of each keyword on the second physical block predicted by the first position prediction model has a large deviation from the actual position of the keyword on the second physical block, which exceeds the preset error range, or For the prediction error rate of the key position on the second physical block is less than the preset value, the storage device can associate the second physical block with other logical segments, hereinafter referred to as the second logical segment, as shown in the lower part of FIG. 3 . Show. Wherein, the second logic segment may correspond to the second position prediction model.

Similar to the first location prediction model, the second location prediction model may be determined based on the second key stored in the second physical block and the second storage location of the second key in the second physical block, for example, it may be is a monotonic fitting function or a machine learning model for predicting the position of a keyword; correspondingly, the storage device uses the second position prediction model to predict the deviation between the predicted position of each keyword on the second physical block and its actual position They may all be within the preset range, or the prediction error rate may be less than the preset value.

The above process is described by taking two physical blocks associating the same or different logical segments as an example. In practical applications, for the next physical block and more other physical blocks connected to the second physical block in the storage device, the following steps can be used. A similar way as described above is associated with the corresponding logical segment. It is worth noting that a physical block can be associated with only one logical segment, and a logical segment can be associated with one physical block, or with multiple consecutive physical blocks.

In this way, by associating each physical block with a corresponding logical segment, a corresponding index can be constructed. In addition, in the process of constructing the index, the keyword stored on each physical block and the storage location of the keyword in the physical block satisfy the corresponding position prediction model. In this way, when retrieving data based on the index, you can The keyword and the corresponding position prediction model are determined to determine the physical block where the keyword is located, so that the value corresponding to the keyword can be found in the physical block. In addition, since the storage device associates logical segments with physical blocks as a unit, the different keywords in each physical block and the storage location of the keyword in the physical block satisfy the same location prediction model. Therefore, when data retrieval is performed , based on the location prediction model, the physical block where the keyword is located can usually be predicted, and the value corresponding to the keyword can be found in the physical block, instead of searching for the value corresponding to the keyword from multiple physical blocks, Thereby, the number of accesses to physical blocks can be reduced.

Since there are many physical blocks in which key-value pairs are written in the storage device, after associating the physical blocks with the logical segments, the storage device needs to predict a model for a larger number of locations for more logical segments. In order to facilitate searching for the location prediction model applicable to the keyword to be retrieved, in a further possible implementation manner, the storage device may further establish a layer of indexes for each logical segment.

In an exemplary specific implementation, the first keyword in the first physical block associated with the logical segment and the first address in the first physical block can be extracted, and at the same time, the logical segment can also be extracted The corresponding model describes the value, constructs a new key-value pair based on this information, and stores the new key-value pair in a new physical block. Assuming that the first logical segment is only associated with the first physical block, the storage device may extract, for the first logical segment, the first keyword in the first physical block, the first address of the first physical block, and the first logical segment The corresponding model description value of the first position prediction model. Exemplarily, when the first position prediction model is a linear fitting function, the model description value may specifically be the inverse of the slope of the linear fitting function. Then, the storage device can use the extracted first keyword as the key in the new key-value pair (hereinafter referred to as the third keyword), and splices the extracted first address and model description value into a new key value The value in the pair (hereinafter referred to as the third value), thereby forming a new key-value pair (key-value). Finally, the storage device can write the new key-value pair into the index block. The index block may be a physical block. However, different from the first physical block and the second physical block for storing keywords and metadata, the index block is used to store data such as model description values and physical block addresses of lower-level indexes. Similarly, for the second logical segment, a new key-value pair can also be formed based on, and the key-value pair is also written into the index block. Of course, in this embodiment, the first keyword in the first physical block is used as an example for description. In other embodiments, any key in the first physical block may be used as an example. The word is used as the third key, such as the last key in the first physical block, which is not limited in this embodiment.

In practical applications, for each logical segment, a corresponding key-value pair can be formed, and each key-value pair is sorted according to the value of the key and then written into the index block in sequence. It can be understood that when the remaining storage space in the index block is not enough to support the writing of key-value pairs corresponding to more logical segments, the next index block adjacent to the index block can be used to store unwritten key-value pairs whose keys For the writing method of the value pair, reference may be made to the aforementioned method of writing the key-value pair into the first physical block, which will not be repeated here.

Similar to the implementation manner in which the first physical block and the second physical block are associated with the logical segment, the storage device may associate the index block with the third logical segment, and based on the third key stored in the index block and the corresponding third key The third value of , fits a third position prediction model. For the specific implementation process of obtaining the third position prediction model by fitting, reference may be made to the description of the above-mentioned description of the fitting and obtaining the first position prediction model, which will not be repeated here. In this way, based on the third position prediction model corresponding to the third logical segment, the third storage position of the keyword to be retrieved on the index block can be predicted, so that the keyword to be retrieved can be compared with the keyword in the third storage position . For ease of description, the following assumes that the index blocks are written in ascending order of the value of the third keyword, and that the third position prediction model is specifically a monotonic fitting function as an example for illustrative illustration, and the value of the third keyword is as follows: For the specific implementation of writing index blocks in order from large to small, you can refer to the understanding.

If the values of the two keys are the same, the storage device may determine the value corresponding to the key according to the key at the predicted position, where the value includes the first address of the first physical block associated with a certain logical segment of the next layer and the function description value, so that the storage device can determine the monotonic fitting function corresponding to the logical segment according to the function description value, and further predict the storage location of the keyword to be retrieved on the physical block based on the monotonic fitting function, so that the storage device can The value corresponding to the key to be retrieved is determined based on the predicted storage location and the first address of the first physical block, and the retrieval process is completed.

If the value of the keyword to be retrieved at the third storage location on the index block is smaller than the value of the keyword to be retrieved, the storage device may search backward from the physical block associated with the third logical segment to determine the first The third keyword is greater than or equal to the value of the keyword to be retrieved, and the value corresponding to the third keyword is determined according to the position of the third keyword on the index block, and the value includes a certain logic of the next layer The first address and function description value of the first physical block associated with the segment can be referred to the above retrieval process, and the value corresponding to the keyword to be retrieved can be retrieved based on the first address and the function description value to complete the retrieval process.

If the value of the keyword to be retrieved at the third storage location on the index block is greater than the value of the keyword to be retrieved, the storage device may search forward from the physical block associated with the third logical segment to determine the first A third keyword that is less than or equal to the value of the keyword to be retrieved, and the value corresponding to the third keyword is determined according to the position of the third keyword on the index block, and the value includes a certain logic of the next layer The first address and function description value of the first physical block associated with the segment can be referred to the above retrieval process, and the value corresponding to the keyword to be retrieved can be retrieved based on the first address and the function description value to complete the retrieval process.

Of course, if the keywords stored in the index block are not stored in the order of increasing or decreasing the value of the keywords, and the value of the keyword to be retrieved is different from the value of the keyword at the third storage location, the storage device can traverse the For each third keyword on the index block, the third keyword on the index block that has the same value as the keyword to be searched is determined. If the traversal fails, the storage device may continue to perform keyword traversal on other index blocks adjacent to the index block to determine the keywords stored in other index blocks that have the same value as the to-be-retrieved keyword.

As described above, a two-level index can be formed. In practical applications, the storage device may also extract the first keyword in the first physical block (that is, the above-mentioned index block) associated with the third logical segment and the first address in the first physical block, and at the same time, also The model description value corresponding to the third logical segment can be extracted, and a new key-value pair can be constructed based on the information, and the new key-value pair can be stored in a new physical block, and according to the above-mentioned similar process, for the first Three logical segments and then build a layer of index. In this way, the storage device can form a multi-layer index structure through multiple iterations, for example, the index structure shown in FIG. 4 can be formed.

Figure 4 shows the 4-layer index structure, including layers 1 to 4. Among them, the physical block in the first layer may be the physical block associated with the logical segment serving as the root node, and the physical blocks in the second layer and the third layer may be the physical block associated with the logical segment serving as the intermediate node (the first Physical blocks in layers to layer 3 may also be referred to as index blocks in this embodiment), physical blocks in layer 4 may be physical blocks associated with logical segments that are leaf nodes, and each layer of A logical segment has a location prediction model corresponding to it. The value corresponding to the keyword stored in the physical block of layer 4 is metadata, and the keyword stored in the physical block of each layer from layer 1 to layer 3 is the first one associated with the logical segment of the next layer. The first keyword stored in the physical block, the value corresponding to the keyword stored in the physical block of each layer is the first address of the first physical block associated with the logical segment of the next layer and the model description value corresponding to the logical segment. . Of course, the index structure shown in FIG. 4 is only an example. In practical applications, a logical segment serving as a root node may be associated with multiple physical blocks, or the index structure may also be a structure with three layers or five layers.

When performing data retrieval, taking the index structure shown in Figure 4 as an example, the location prediction model corresponding to the logical segment of the first layer can be used to predict the storage of the keyword to be retrieved in the physical block associated with the logical segment. location, and by comparing the value between the keyword at the storage location and the keyword to be retrieved, determine the value (value) corresponding to the keyword to be retrieved in the first layer, so as to describe the value based on the model in the value, A logical segment corresponding to the model description value is determined from a plurality of logical segments in the second layer, and a position prediction model corresponding to the logical segment is determined based on the model description value. Then, the storage device can predict the storage location of the keyword to be retrieved in the physical block associated with the logical segment based on the location prediction model, and compare the value between the keyword obtained from the storage location and the keyword to be retrieved size, determine the value corresponding to the keyword to be retrieved in layer 2, so based on the model description value in the value, determine the logical segment corresponding to the model description value from multiple logical segments in layer 3, and based on the model description value in the value, determine the logical segment corresponding to the model description value The model description value determines the position prediction model corresponding to the logic segment. By analogy, a corresponding logical segment can be determined from multiple logical segments in the fourth layer, and a location prediction model corresponding to the logical segment can be determined based on the model description value; then, the storage device can predict the location based on the location prediction model. The storage location of the keyword to be retrieved on the physical block in layer 4, and the keyword at the storage location is compared with the keyword to be retrieved. If the two keywords are the same, according to the storage location of the to-be-retrieved keyword on the physical block, determine the value corresponding to the to-be-retrieved keyword stored on the physical block to complete the data retrieval. When the two keywords are not the same, the keywords can be read forward or backward based on the storage location on the physical block associated with the logical segment, and the keywords read each time and the keywords to be retrieved are compared to obtain A keyword that is the same as the keyword to be retrieved is found on the physical block, and then the value corresponding to the keyword to be retrieved stored on the physical block is determined to complete the data retrieval. Of course, if it is determined by traversing the keywords on all physical blocks associated with the logical segment that there is no keyword identical to the keyword to be retrieved, the storage device may prompt the user that the query fails or the storage device may perform the retrieval again.

It is worth noting that when constructing an index, if the number of physical blocks associated with the logical segment of the intermediate node is greater, the more location prediction models are included in the constructed index, so the prediction of the keyword to be retrieved will usually be better. Accurate, ideally, the location prediction model can accurately predict the location of each keyword on the physical block, so that the storage device can accurately predict the value corresponding to the keyword to be retrieved stored on the lowest physical block, thereby The number of physical blocks that the storage device needs to access can be reduced, but at the same time, due to the need to additionally store a large number of location prediction models, the storage space occupied by the intermediate nodes is larger (the number of physical blocks in the intermediate nodes is larger). ), so that the storage space required for the index is larger and the storage cost is higher. However, if the location prediction model in the index is reduced, for example, the original two location prediction models are fitted into one model, the accuracy of the storage device in predicting the location of the keyword on the physical block may not be high, which may easily lead to the storage device It usually requires additional access to multiple physical blocks to find the value corresponding to the keyword to be retrieved. However, in current storage devices, the overhead and delay of accessing physical blocks are often high, which is relatively low compared to accessing physical blocks. In other words, the overhead and delay required to compare the key on the same physical block with the key to be retrieved are usually small or even negligible. Therefore, in practical applications, in the process of index building, the number of location prediction models can be determined by controlling the deviation of the storage device to access physical blocks based on the location prediction model, so as to balance the storage space occupied by the intermediate nodes and the storage device. The retrieval performance, wherein, the deviation of accessing physical blocks based on the location prediction model can be, for example, the deviation ratio between the physical blocks accessed based on the prediction results of the location prediction model and the physical blocks that actually need to be accessed (or the correct rate/error rate, etc.). Of course, the storage device may also determine the number of location prediction models by limiting the maximum number of physical blocks that are allowed to be accessed in one retrieval process.

In terms of distance, assuming that the memory space occupied by the intermediate node is 20MB (megabytes) when the deviation of the storage device accessing the physical block is 0, the storage device fits multiple location prediction models into one model, etc. , reducing the number of location prediction models, so that the memory space occupied by the intermediate nodes is reduced to 15MB. Correspondingly, the deviation rate of the storage device accessing the physical block is 5%; if it continues, the memory space occupied by the intermediate nodes will be reduced to 10MB, then the deviation rate of the storage device accessing the physical block continues to increase to 20%, then the memory space occupied by the intermediate node (that is, the location number of predictive models). For example, if the maximum deviation rate allowed in an actual application is 5%, it can be determined that the memory space occupied by the intermediate nodes in the constructed index is 15MB. Of course, this is only an example, and this embodiment does not limit the specific implementation of how to balance the memory space occupied by the intermediate nodes and the deviation existing in the storage device accessing physical blocks.

In a further possible implementation manner, the above-mentioned implementation process of constructing an index may be applied to a merging process of an index structure of a hierarchical mechanism.

Take the widely used data index structure of log-structured merge (LSM) tree as an example. As shown in Fig. 5, the LSM tree can be divided into N+1 layers (N is a positive integer), which are L0 layer, L1 layer, ..., LN layer respectively. Among them, the data scale of the L0 layer is usually small, and the L1 layer to the LN layer may have a large data scale. Each of the L0 to LN layers may have one or more subtrees of the LSM tree. In FIG. 5, the L0 layer includes 4 subtrees as an example. Of course, in practical applications, the L0 layer may also be based on practical applications. Requirements include any number of other subtrees.

When writing new data into the LSM tree, the new metadata and its corresponding keywords can be written into the subtree of the L0 layer in the form of key-value pairs, as shown in Figure 5. Write to subtree 1 in the L0 layer. When data is written to subtree 1, other subtrees of the L0 layer can be configured to be in a state where data cannot be written, that is, data can be written to only one subtree at the same time. In some examples, the subtree 1 may adopt an index structure that supports data modification, insertion and update, such as B+ tree, ARTree and other data structures. The other subtrees in the L0 layer and the subtrees in the L1 to LN layers can be learning indexes that support the merge operation. Generally, the space occupied by the learning index is smaller, and the intermediate nodes are smaller Less, faster operation, but does not support data modification, insertion and update. In this way, the index structure of the LSM tree includes not only an index structure that supports data modification, insertion and update, but also a learning index, so that the index structure of the LSM tree can support data modification, insertion and update operations. At the same time, the index structure that supports data modification, insertion and update holds a small amount of data, usually less than 1% of the LSM tree. More than 99% of the data in the LSM tree can be stored by the learning index, which makes the LSM Trees can take advantage of the performance and cost of learned indexes.

With the continuous writing of new data, the data model of subtree 1 continues to expand, which makes it impossible for subtree 1 to continue to support the writing of more new data. At the same time, the scale of data in the L0 layer will also continue to expand. To this end, when the amount of data in subtree 1 exceeds a preset threshold (such as 256MB size, etc.), the data in subtree 1 can be merged (merge), or, when the amount of data in the L0 layer exceeds a certain threshold (such as 1GB, etc.), the data in the L0 layer can be merged, and the merged data can be written to the subtree in the L1 layer. In general, the data scale of the subtree in the L1 layer can be larger than the data scale of the subtree in the L0 layer. Similarly, when the data size in the L1 layer exceeds a preset threshold (such as 10G size, etc.), the data in the L1 layer can be merged, and the merged data can be written to the subtree in the L2 layer. And so on. In the process of continuous merging and writing, the data scale of the subtree of the next level is usually larger and larger, as shown in Figure 6.

In the process of merging the data in each layer, the value corresponding to the same key can be merged into a new value, and a new key-value pair is constructed based on the new value and the original key. The new key The key of the value pair remains unchanged, and the value of the new key-value pair is the new value obtained after merging, so that redundant and invalid information in the LSM tree can be eliminated to achieve the purpose of reducing storage overhead. Then, the storage device can serialize the obtained new key-value pair, that is, it can be written into the subtree of the current layer or the subtree of the next layer according to the value of the keys in descending order or in descending order. middle. When constructing an index based on the serialized key-value pair, the index can be constructed based on the method of constructing an index shown in FIG. 1 to form subtrees of each layer in the LSM tree, so that the first logical segment and the first physical block are separated from each other. The association relationship and the association relationship between the second logical segment and the second physical block (or the association relationship between the first logical segment and the first physical block and the second physical block respectively) are located in the index structure of the subtree.

The method for storing metadata in the present application is described above with reference to FIGS. 1 to 6 , and next, the device and apparatus of the present application are described with reference to the accompanying drawings.

Referring to the schematic diagram of the hardware structure of the storage device 700 shown in FIG. 7 , the device 700 may include at least one memory 701 and at least one processor 702 .

The processor 702 may be a central processing unit (central processing unit, CPU). The memory 701 may include volatile memory, such as RAM or the like. For example, RAM can be Dynamic Random Access Memory (DRAM) or Storage Class Memory (SCM). DRAM is a semiconductor memory, and like most RAMs, it is a volatile memory device. SCM is a composite storage technology that combines the characteristics of traditional storage devices and memory. SCM can provide faster read and write speeds than hard disks, but is slower than DRAM in terms of operation speed and cheaper than DRAM in cost. However, DRAM and SCM are only illustrative in this embodiment, and the memory may also include other random access memories, such as Static Random Access Memory (SRAM), synchronous dynamic random access memory (synchronous dynamic random access memory). random-access memory, SDRAM), etc. The memory 701 may also include a non-volatile memory (non-volatile memory), such as a programmable read only memory (Programmable Read Only Memory, PROM), an erasable programmable read only memory (Erasable Programmable Read Only Memory, EPROM) ), flash memory (Flash Memory), etc., flash memory, HDD, SSD, etc.

The memory 701 can be used to store instructions, the processor 702 is connected to the memory 701, for example, it can be connected through various interfaces, transmission lines or buses, and the processor 702 can read and execute the instructions stored in the memory 701 based on the connection, to perform the following steps:

Associating a first physical block with a first logical segment, where a first key and a first value corresponding to the first key are stored in the first physical block, and the first key and the first key The first storage position of the word in the first physical block satisfies the first position prediction model corresponding to the first logical segment;

Judging whether the second key stored in the second physical block and the second storage position of the second key in the second physical block satisfy the first position prediction model, and the second physical block also A second value corresponding to the second keyword is stored, and the second physical block is adjacent to the first physical block;

When the second key and the second storage location do not satisfy the first location prediction model, associating the second physical block with a second logical segment, the second key and the second The storage location satisfies the second location prediction model corresponding to the second logic segment.

In a possible implementation manner, the processor 702 executes the instructions stored in the memory 701, and may also execute the following steps:

The second physical block is associated with the first logical segment when the second key and the second storage location satisfy the first location prediction model.

In a possible implementation manner, the processor 702 executes the instructions stored in the memory 701, and may specifically perform the following steps:

According to the second key, using the first position prediction model to calculate the predicted position of the second key in the second physical block;

comparing whether the error between the predicted location and the second storage location is within a preset error range;

Then, the second keyword and the second storage location do not satisfy the first location prediction model, specifically, the error between the predicted location and the second storage location exceeds the preset error range.

Writing a third key and a third value corresponding to the third key into the index block, where the third key includes a key in the physical block associated with the first logical segment, the third key The value includes the model description value of the first location prediction model and the first address of the first physical block.

In a possible implementation, the first position prediction model includes a linear fitting function, and the model description value of the first position prediction model includes the reciprocal slope of the linear fitting function.

In a possible implementation manner, the reciprocal slope of the linear fitting function is determined by rounding the average value of the difference between two adjacent keywords in the first physical block.

In a possible implementation manner, the association relationship between the first logical segment and the first physical block, and the association relationship between the second logical segment and the second physical block are located in a sub-tree of the log structure merge tree in a tree index structure.

In addition, the embodiments of the present application also provide an apparatus for constructing an index. Referring to the schematic structural diagram of the index building apparatus shown in FIG. 8 , the apparatus 800 may be located in a storage device. The apparatus 800 may specifically include:

The association module 801 associates a first physical block with a first logical segment, the first physical block stores a first keyword and a first value corresponding to the first keyword, and the first keyword is associated with all The first storage location of the first keyword in the first physical block satisfies the first location prediction model corresponding to the first logical segment;

The judgment module 802 is configured to judge whether the second keyword stored in the second physical block and the second storage location of the second keyword in the second physical block satisfy the first position prediction model, the A second value corresponding to the second keyword is also stored in the second physical block, and the second physical block is adjacent to the first physical block;

The associating module 801 is further configured to associate the second physical block with the second logical segment when the second keyword and the second storage location do not satisfy the first location prediction model, the The second keyword and the second storage location satisfy the second location prediction model corresponding to the second logic segment.

In a possible implementation manner, the association module 801 is further configured to:

In a possible implementation manner, the judging module 802 is specifically configured to:

using the first position prediction model to calculate the predicted position of the second keyword in the second physical block;

In a possible implementation manner, the apparatus 800 further includes:

The writing module 803 is configured to write a third keyword and a third value corresponding to the third keyword into the index block, where the third keyword includes a physical block associated with the first logical segment. keyword, and the third value includes a model description value of the first position prediction model and the first address of the first physical block.

In a possible implementation manner, the association between the first logical segment and the first physical block, and the association between the second logical segment and the second physical block are located in a subtree index of the LSM tree in the structure.

The apparatuses of the embodiments of the present application may correspond to executing the methods described in the embodiments of the present application. Moreover, the above-mentioned and other operations and/or functions of each module in the metadata storage device 800 are respectively used to implement the corresponding flow of each method in FIG. 1 . For the functions of the above modules, reference may be made to the description in the method embodiment shown in FIG. 1 . The function of each module in the apparatus 800 can be executed by the processor 702 in the above-mentioned apparatus 700 for building an index by calling the program in the memory 701 .

Embodiments of the present application further provide a computer-readable medium, where instructions are stored in the computer-readable medium, when the computer-readable medium runs on a computer, the computer causes the computer to execute the methods described in the above aspects.

Embodiments of the present application also provide a computer program product, which, when running on a computer, enables the computer to execute the methods described in the above aspects.

In addition, it should be noted that the above-described embodiments are only illustrative, wherein the modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical Modules can be located in one place or distributed over multiple network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. In addition, in the drawings of the device embodiments provided in the present application, the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.

Those of ordinary skill in the art can understand that the aforementioned storage medium includes: U disk, mobile hard disk, magnetic disk, optical disk, random-access memory (Random-Access Memory, RAM), solid state disk (Solid State Disk, SSD) or non-volatile Various non-transitory machine-readable media that can store program code, such as memory (non-volatile memory).

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them.

Claims

A method for building an index, wherein the method is applied to a storage device, and the method includes:

Associating a first physical block with a first logical segment, where a first key and a first value corresponding to the first key are stored in the first physical block, and the first key and the first key The first storage position of the word in the first physical block satisfies the first position prediction model corresponding to the first logical segment;

Judging whether the second key stored in the second physical block and the second storage position of the second key in the second physical block satisfy the first position prediction model, and the second physical block also A second value corresponding to the second keyword is stored, and the second physical block is adjacent to the first physical block;

When the second key and the second storage location do not satisfy the first location prediction model, associating the second physical block with a second logical segment, the second key and the second The storage location satisfies the second location prediction model corresponding to the second logic segment.
The method according to claim 1, wherein the method further comprises:

The second physical block is associated with the first logical segment when the second key and the second storage location satisfy the first location prediction model.
The method according to claim 1 or 2, wherein the judging whether the second key stored in the second physical block and the second storage location of the second key in the second physical block is not Satisfying the first position prediction model includes:

using the first position prediction model to calculate the predicted position of the second keyword in the second physical block;

comparing whether the error between the predicted location and the second storage location is within a preset error range;

Then, the second keyword and the second storage location do not satisfy the first location prediction model, specifically, the error between the predicted location and the second storage location exceeds the preset error range.
The method according to any one of claims 1 to 3, wherein the method further comprises:

Writing a third key and a third value corresponding to the third key into the index block, where the third key includes a key in the physical block associated with the first logical segment, the third key The value includes the model description value of the first location prediction model and the first address of the first physical block.
The method according to claim 4, wherein the first position prediction model comprises a linear fitting function, and the model description value of the first position prediction model comprises an inverse slope of the linear fitting function.
The method according to claim 5, wherein the reciprocal slope of the linear fitting function is determined by rounding the average value of the difference between two adjacent keywords in the first physical block.
The method according to any one of claims 1 to 6, wherein the association relationship between the first logical segment and the first physical block, and the relationship between the second logical segment and the second physical block The association is in the subtree index structure of the log-structured merge tree.
An apparatus for building an index, wherein the apparatus is located in a storage device, and the apparatus includes:

an associating module, which associates a first physical block with a first logical segment, where a first keyword and a first value corresponding to the first keyword are stored in the first physical block, and the first keyword is associated with the first keyword. The first storage location of the first keyword in the first physical block satisfies the first location prediction model corresponding to the first logical segment;

A judgment module for judging whether the second keyword stored in the second physical block and the second storage location of the second keyword in the second physical block satisfy the first position prediction model, the first The second value corresponding to the second keyword is also stored in the two physical blocks, and the second physical block is adjacent to the first physical block;

The associating module is further configured to associate the second physical block with a second logical segment when the second key and the second storage location do not satisfy the first location prediction model, the first The second key and the second storage location satisfy the second location prediction model corresponding to the second logic segment.
The device according to claim 8, wherein the association module is further configured to:

The second physical block is associated with the first logical segment when the second key and the second storage location satisfy the first location prediction model.
The device according to claim 8 or 9, wherein the judging module is specifically used for:

using the first position prediction model to calculate the predicted position of the second keyword in the second physical block;

comparing whether the error between the predicted location and the second storage location is within a preset error range;

Then, the second keyword and the second storage location do not satisfy the first location prediction model, specifically, the error between the predicted location and the second storage location exceeds the preset error range.
The device according to any one of claims 8 to 10, wherein the device further comprises:

a writing module, configured to write a third key and a third value corresponding to the third key into the index block, where the third key includes a key in the physical block associated with the first logical segment word, and the third value includes a model description value of the first position prediction model and a first address related to the first physical block.
The apparatus according to claim 11, wherein the first position prediction model comprises a linear fitting function, and the model description value of the first position prediction model comprises an inverse slope of the linear fitting function.
The apparatus according to claim 12, wherein the reciprocal slope of the linear fitting function is determined by rounding the average value of the difference between two adjacent keywords in the first physical block.
The apparatus according to any one of claims 8 to 13, characterized in that the association relationship between the first logical segment and the first physical block, and the relationship between the second logical segment and the second physical block The association is in the subtree index structure of the log-structured merge tree.