CN115718819A

CN115718819A - Index construction method, data reading method and index construction device

Info

Publication number: CN115718819A
Application number: CN202211502941.6A
Authority: CN
Inventors: 柴云鹏; 骆远辉; 王元桢
Original assignee: Renmin University of China
Current assignee: Renmin University of China
Priority date: 2022-11-28
Filing date: 2022-11-28
Publication date: 2023-02-28

Abstract

The application provides an index construction method, a data reading method and an index construction device, wherein the index construction method determines a target prefix length according to initial data and a preset spatial magnification threshold of a first target node; dividing the initial data into a plurality of subdata sets and data to be stored according to the target prefix length; determining an index model according to the node prefix and the data to be stored, and respectively calculating the storage positions of the node prefix and the data to be stored through the index model; storing each node prefix, a pointer pointing to a child node and data to be stored in a corresponding storage position, constructing the first target node, and repeating the above processes to construct the child node until the whole index is constructed, wherein the constructed index structure is flatter, the performance is better in different data sets and write-intensive workloads, the performance is stable when the data distribution changes, and the tail delay performance index is better.

Description

Index construction method, data reading method and index construction device

Technical Field

The present application relates to the field of data storage, and in particular, to an index construction method, a data reading method, and an index construction apparatus.

Background

Prefix tree: namely a Trie Tree, also called a dictionary Tree, is a multi-way Tree structure for fast retrieval. A prefix Tree may be used to index values, strings, and other data types (also referred to as radix trees when indexing value types). Each node of the prefix tree has a plurality of sub-nodes, and all the sub-nodes of each node have different characters. In the prefix tree, each node represents a character string (prefix), and characters passing through a path from a root node to a certain node are connected together to form a character string corresponding to the node. The core idea of the prefix tree is spatial time-shifting. The common prefix of the character string is utilized to reduce the cost of query time so as to achieve the aim of improving efficiency.

However, the time and space efficiency of the prefix tree model is directly inversely related to the data volume, and the big data era puts higher capacity and performance requirements on the storage system.

Disclosure of Invention

In view of the above, an object of the present application is to provide an index building method, a data reading method, and an index building apparatus, where a built index structure is flatter, and has better performance in different data sets and write-intensive workloads, and meanwhile, when data distribution changes, performance is kept stable, and tail delay performance index is also better.

The index construction method provided by the embodiment of the application comprises the following steps:

inputting initial data into a first target node of a prefix tree, and determining a target prefix length of the first target node according to the initial data and a preset spatial amplification rate threshold of the first target node;

dividing the initial data into a plurality of sub data sets according to the target prefix length of a first target node, and/or screening out data to be stored, which are stored in the first target node, from the initial data; wherein, the initial data in the sub data set has the same node prefix;

determining an index model of a first target node according to the node prefix and/or the data to be stored, and respectively calculating a pointer and a storage position of each sub-data set corresponding to the node prefix and pointing to the sub-node through the index model, and/or respectively calculating the storage position of each data to be stored;

storing each node prefix and a pointer pointing to a child node and/or each data to be stored in a corresponding storage position to construct an index structure of the first target node;

and taking the child node of the constructed first target node as a new first target node, taking the child node corresponding sub data set as new initial data, constructing an index structure of the new first target node until a construction completion condition is met, and constructing the whole index.

In some embodiments, the determining the target prefix length of the first target node according to the initial data and the preset spatial magnification threshold of the first target node in the index building method includes:

determining at least one prefix length to be verified according to the data distribution characteristics of the initial data;

respectively calculating the space magnification corresponding to each prefix length to be verified;

screening out prefix lengths to be verified, of which the corresponding space amplification rates are not larger than a preset space amplification rate threshold value of the first target node;

and determining the screened maximum prefix length to be verified as the target prefix length.

In some embodiments, the determining an index model of a first target node according to a node prefix and/or data to be stored in the index building method includes:

calculating a target global slope meeting a preset spatial amplification rate threshold according to the node prefix and/or the data to be stored;

and determining a linear model as an index model according to the calculated target global slope.

In some embodiments, in the index construction method, after the whole index is constructed, the method further includes:

when new data is inserted into the constructed index, judging whether a target slot position of a second target node mapped by the new data meets a preset adjusting condition;

and if so, adjusting the index structure of a second target node through a pre-configured adjustment strategy according to the data of the second target node including the new data.

In some embodiments, in the index building method, the adjusting the index structure of the second target node according to the data of the second target node including the new data includes:

adjusting the spatial magnification of the second target node to enlarge the storage space of the second target node;

or, according to the data of the second target node including the new data, re-determining the index model of the second target node;

or taking the key value pair of the data slot position of the second target node, the key value pair consisting of the node prefix of the pointer slot position and the pointer as the initial data of the second target node;

and according to the initial data of the second target node and a preset spatial magnification threshold, re-determining the target prefix length and the index model of the second target node so as to reconstruct the index structure of the second target node.

In some embodiments, the preset adjustment condition in the index building method is at least one of the following: the full load rate of the second target node reaches a preset full load rate threshold value; the target slot position of the second target node mapped by the new data is a data slot position; the target slot position of the second target node of the new data mapping is a pointer slot position, and the new data is not matched with the longest common prefix of the child node pointed by the pointer slot position.

In some embodiments, in the index construction method, when a difference between a child node of the first target node and a node prefix of the first target node is smaller than a preset length threshold, the child node of the first target node adopts an ART node.

In some embodiments, a data reading method is further provided, which is applied to the index constructed by the index construction method; the reading method comprises the following steps:

starting from the root node of the index, judging whether the node prefix of the current node is matched with the data to be read;

if the data to be read is matched with the target slot position, calculating the target slot position of the data to be read through an index model in the current node;

if the target slot position is a data slot position, judging whether the data in the data slot position is matched with the data to be read, and if so, returning the data in the data slot position;

and if the target slot position is a pointer slot position, inquiring the data to be read in the child node pointed by the pointer slot position until a reading result is returned.

In some embodiments, the data reading method further comprises:

determining boundary data of a data range to be read according to the data range; wherein the boundary data is the largest data and/or the smallest data in the data range;

determining the storage position of the boundary data in the index;

and returning a reading result matched with the data range to be read according to the storage sequence of the data in the index and the storage position of the boundary data.

In some embodiments, there is also provided an index building apparatus, including:

the determining module is used for inputting initial data into a first target node of a prefix tree and determining a target prefix length of the first target node according to the initial data and a preset spatial magnification threshold value of the first target node;

the dividing module is used for dividing the initial data into a plurality of subdata sets according to the target prefix length of the first target node, and/or screening out data to be stored, which are stored in the first target node, from the initial data; wherein, the initial data in the sub data set has the same node prefix;

the calculation module is used for determining an index model of the first target node according to the node prefix and/or the data to be stored, and respectively calculating a pointer which points to a child node and corresponds to the node prefix of each subdata set and a storage position through the index model, and/or respectively calculating the storage position of each data to be stored;

the first construction module is used for storing each node prefix, a pointer pointing to a child node and/or each data to be stored in a corresponding storage position and constructing an index structure of the first target node;

and the second construction module is used for taking the child nodes of the constructed first target node as new first target nodes, taking the child node corresponding sub data sets as new initial data, constructing an index structure of the new first target nodes until the construction completion condition is met, and constructing the whole index.

The embodiment of the application provides an index construction method, a data reading method and an index construction device, wherein in the index construction method, initial data are input into a first target node of a prefix tree, and a target prefix length and an index model of the first target node are determined according to the initial data and a preset spatial magnification threshold value of the first target node; dividing the initial data into a plurality of sub data sets according to the target prefix length of a first target node, and/or screening out data to be stored, which are stored in the first target node, from the initial data; respectively calculating pointers and storage positions pointing to the child nodes corresponding to the node prefixes of each child data set according to the index model, and/or respectively calculating the storage position of each data to be stored; storing each node prefix and a pointer pointing to a child node and/or each data to be stored in a corresponding storage position to construct an index structure of the first target node; the method comprises the steps of constructing child nodes of a first target node as new first target nodes, constructing an index structure of the new first target nodes by taking child node corresponding subdata sets as new initial data until a construction completion condition is met, and constructing a whole index, namely, compared with a traditional index structure, in the index constructed in the embodiment of the application, each node is provided with a learning index model matched with data of the node, and the index has the capability of fitting data distribution of a learning index, so that the index has comprehensive advantages in basic query performance compared with the traditional index; meanwhile, compared with other learning indexes, the method has corresponding competitiveness, for other learning indexes, the constructed indexes have integral robustness compared with the traditional prefix tree index (such as ART), can have better performance in different data sets and write-intensive workload, and simultaneously keep stable performance and better tail delay performance index when the data distribution changes; the index construction algorithm disclosed by the embodiment of the application is low in time complexity, can ensure the flattening structure of the whole tree, and can map different prefixes required by the insertion adjustment strategy to different positions.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

FIG. 1 is a flowchart illustrating a method of an index building method according to an embodiment of the present application;

FIG. 2 is a diagram illustrating a prefix tree index structure constructed in accordance with the present application;

fig. 3 is a flowchart illustrating a method for determining a target prefix length of the first target node according to the embodiment of the present application;

fig. 4 is a flowchart illustrating a method for determining an index model of a first target node according to a node prefix and/or data to be stored according to the embodiment of the present application;

FIG. 5 is a flowchart illustrating a method for adjusting an index structure of the second target node according to an embodiment of the present application;

FIG. 6 is a flow chart of a method of reading data according to an embodiment of the present application;

FIG. 7 is a flow chart of a method of reading data according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram illustrating an index building apparatus according to an embodiment of the present application;

fig. 9 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are only for illustration and description purposes and are not used to limit the protection scope of the present application. Additionally, it should be understood that the schematic drawings are not necessarily drawn to scale. The flowcharts used in this application illustrate operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. One skilled in the art, under the guidance of this application, may add one or more other operations to, or remove one or more operations from, the flowchart.

In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as presented in the figures, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that in the embodiments of the present application, the term "comprising" is used to indicate the presence of the features stated hereinafter, but does not exclude the addition of further features.

Database indexing: namely, the Database Index, is a data structure used to speed up access data operations. By additionally storing a data identifier (or a Key of the data, i.e., key) and an actual storage location of the data (or a Value of the data, i.e., value) in the index structure when the data is stored. Although creating and maintaining an index structure requires additional space cost, the index can speed up subsequent access to the data elements, avoiding the high cost of direct scan data element queries. Operations that access data may first enter a query into the index via a data identifier (Key) and then get the actual storage location (Value) of the data. In addition, indexing such data structures is also widely used in other fields, such as key-value stores, file systems, search engines, and so on.

And (3) ordered indexing: the Sorted Index is a type of Index in a plurality of Index structures, and the most prominent characteristic is that the Sorted Index supports efficient range query, prefix query and the like besides point query. This is mainly achieved by organizing the index structure explicitly in the order of the data elements. The point query queries the existence of a data element according to a data identifier (Key), and returns the actual storage location (Value) of the data if the data element exists. The scope query is then a scope that, given a data identifier (Key), returns the actual storage location (Value) of all data elements within the scope. Prefix queries, that is, given a data identifier (Key), return a maximum Value that is less than the data identifier (Key) in all data elements, and its corresponding data storage location (Value). Index structures of other categories such as hash index only support point query, and cannot support efficient range query or prefix query.

Prefix tree: namely a Trie Tree, also called a dictionary Tree, is a multi-way Tree structure for fast retrieval. A prefix Tree can be used to index values, strings, and other data types (also called Radix Tree when indexing the value type). Each node of the prefix tree has a plurality of sub-nodes, and all the sub-nodes of each node have different characters. In the prefix tree, each node represents a character string (prefix), and characters passing through a path from a root node to a certain node are connected together to form a character string corresponding to the node. The core idea of the prefix tree is spatial time-to-time. The common prefix of the character string is utilized to reduce the cost of query time so as to achieve the aim of improving efficiency.

However, the time and space efficiency of the prefix tree model is directly and negatively related to the data volume, and the big data era puts higher capacity and performance requirements on the storage system.

Various improvements have been made to the conventional prefix tree model in the prior art to improve the capacity and performance of the prefix tree model. For example, an Adaptive Radix Tree (ART) is proposed, i.e. to solve the problem of excessive space consumption of Radix trees in the worst case, the ART adaptively selects a compact and efficient data structure for internal nodes, and simultaneously reduces the height of the Tree by using a delay expansion and path compression mode, so that the ART has higher space efficiency. Each layer in the ART index records and distinguishes partial prefixes of key values through 8 bits, node types are determined according to the data quantity of each Node, the Node types comprise four types of Node4, node16, node48 and Node256, and each type ensures high-efficiency access through modes such as SIMD parallel instructions, indirect indexes and the like. Its lookup performance exceeds that of highly tuned read-only search trees while also supporting very efficient insertion and deletion. While performing reasonably well with hash tables, ART also maintains data in sorted order so that operations that hash tables cannot support, such as range scanning and prefix lookup, can be supported. Furthermore, because the ART insert operation only needs to modify the parent node and the current node at most, the ART is more friendly to multithread parallelism and can less block the execution of other parallel operations during the insert operation.

And in patent with publication number CN112732725A, a method, system and medium for constructing an adaptive prefix tree based on NVM hybrid memory are described; in order to reduce the space consumption of a prefix tree index structure on a DRAM (dynamic random access memory) and maintain higher performance, the scheme provides a self-adaptive prefix tree construction method based on an NVM (non-volatile memory) hybrid memory, wherein a global index data structure is constructed to construct and maintain the whole data and nodes in an NVM address space; and a quick index data structure for newly added data and nodes is maintained in a DRAM (dynamic random access memory) address space, so that the influence of the introduction of the NVM address space on the index performance is reduced. When the occupied DRAM address space reaches a set proportion, the migration thread is triggered to migrate the newly added data and the nodes into the NVM address space, so that the occupied space of the index structure for the DRAM can be effectively reduced, the storage cost is reduced, and various database operation requests are efficiently realized.

Learning-type indexing: the Learned Index is a novel Index structure constructed by utilizing a machine learning method and thought. The learning-type index treats the index structure itself as a model that takes as input a data identifier (Key) and as output its corresponding location information stored in the index structure. Under the view angle, various machine learning models can be used for replacing or accelerating the retrieval of the index model, and data can be searched by directly fitting the input and output functional relation and combining some corrections through the calculation model. The initial searching operations such as sequential searching, binary searching and the like in the traditional index are replaced by faster and simpler model calculation operations, so that the overall query efficiency is accelerated. When the query is actually executed, the learning index is input, the position of the learning index in the current index node is obtained through model prediction, if the model has errors, extra search and correction are needed, and the output is finally obtained through the search of a multi-layer model.

While learning-based indexes can achieve superior performance over traditional indexes under most data sets and workloads, learning-based indexes do not perform as well as traditional indexes, such as adaptive radix trees ART, in the face of difficult-to-learn data sets and write-intensive workloads. Moreover, when the data distribution has large variation, the performance fluctuates, and the tail delay at different workloads is not as stable as the traditional index. Overall, the learning-type index is less robust than the conventional index. In addition, the learning index is mainly optimized for the numerical data with fixed length, and cannot process the character data with long length, and the function perfection degree is not as good as that of the traditional index.

The invention provides a novel index structure aiming at the robustness problem of the learning type index and the performance problem of the traditional index, and the learning model is introduced on the basis of the traditional prefix tree, so that the learning type index has the characteristics of the robustness of the traditional index and the high-efficiency query performance of the learning type index, has the capability of processing variable-length character type data, and can be stably superior to the traditional index in all aspects.

Based on this, in order to improve performance and capacity of an index, an embodiment of the present application provides an index construction method, a data reading method, and an index construction apparatus, in the index construction method, initial data is input into a first target node of a prefix tree, and a target prefix length and an index model of the first target node are determined according to the initial data and a preset spatial magnification threshold of the first target node; dividing the initial data into a plurality of subdata sets according to the target prefix length of a first target node, and/or screening out data to be stored, which are stored in the first target node, from the initial data; respectively calculating pointers and storage positions pointing to the sub nodes corresponding to the node prefixes of each sub data set according to the index model, and/or respectively calculating the storage position of each data to be stored; storing each node prefix and a pointer pointing to a child node and/or each data to be stored in a corresponding storage position to construct an index structure of the first target node; the method comprises the steps of constructing child nodes of a first target node as new first target nodes, using child data sets corresponding to the child nodes as new initial data, constructing an index structure of the new first target nodes until construction completion conditions are met, and constructing the whole index, namely, compared with a traditional index structure, in the index constructed in the embodiment of the method, each node is provided with a learning index model matched with data of the node, and the index has the capability of fitting data distribution of a learning type index, so that the index has comprehensive advantages in basic query performance compared with the traditional index; meanwhile, compared with other learning indexes, the constructed indexes have corresponding competitiveness, for other learning indexes, the overall robustness of the constructed indexes is equal to that of the traditional prefix tree index (such as ART), the constructed indexes have better performance in different data sets and write-intensive working loads, the performance is kept stable when the data distribution is changed, and the tail delay performance index is better; the index construction algorithm disclosed by the embodiment of the application is low in time complexity, can ensure the flattening structure of the whole tree, and can map different prefixes required by the insertion adjustment strategy to different positions.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method of an index building method according to an embodiment of the present application; specifically, the index construction method includes the following steps S101 to S105:

s101, inputting initial data into a first target node of a prefix tree, and determining a target prefix length of the first target node according to the initial data and a preset spatial magnification threshold of the first target node;

s102, dividing the initial data into a plurality of sub data sets according to the target prefix length of a first target node, and/or screening out data to be stored, which are stored in the first target node, from the initial data; wherein the initial data in the sub data sets have the same node prefix;

s103, determining an index model of a first target node according to the node prefix and/or the data to be stored, and respectively calculating a pointer and a storage position of a pointing sub-node corresponding to the node prefix of each sub-data set and/or respectively calculating the storage position of each data to be stored through the index model;

s104, storing each node prefix, a pointer pointing to a child node and/or each data to be stored in a corresponding storage position, and constructing an index structure of the first target node;

s105, taking the child nodes of the constructed first target node as new first target nodes, taking the child node corresponding subdata sets as new initial data, constructing an index structure of the new first target nodes until the construction completion condition is met, and constructing the whole index.

Compared with the traditional index structure, the index constructed by the index construction method in the embodiment of the application has the advantages that each node is provided with a learning index model matched with the data of the node, and the learning index model has the capability of fitting the data distribution of a learning index, so that the index constructed by the embodiment of the application has comprehensive advantages in basic query performance compared with the traditional index; meanwhile, compared with other learning indexes, the method has corresponding competitiveness, for other learning indexes, the constructed indexes have integral robustness compared with the traditional prefix tree index (such as ART), can have better performance in different data sets and write-intensive workload, and simultaneously keep stable performance and better tail delay performance index when the data distribution changes; the index construction algorithm disclosed by the embodiment of the application is low in time complexity, can ensure the flattening structure of the whole tree, and can map different prefixes required by the insertion adjustment strategy to different positions.

In step S101, initial data is input into a first target node of a prefix tree, and a target prefix length and an index model of the first target node are determined according to the initial data and a preset spatial magnification threshold of the first target node.

Referring to fig. 2, fig. 2 is a schematic diagram illustrating a prefix tree index structure constructed by the present application; the prefix tree index comprises a root node, the root node is used as a father node, a plurality of child nodes are arranged, and the like.

The first target node may be a root node or other nodes; that is, the first target node is the node that is being constructed at present, and does not refer to the root node specifically. When the first target node is a root node, the initial data is all data used for constructing an index; and when the first target node is other nodes, the initial data input to the first target node is a child data set corresponding to a pointer of the parent node pointing to the first target node.

Each node records the longest common Prefix (Prefix) inserted into the data whose sub-tree is the current node, and each node has an index model (model) and node prefixes (span) that require preprocessing of the fetched key before being input into the index model. Each node has an array for storing pointers pointing to child nodes and data, and two bits are used to distinguish empty slot positions, pointer slot positions containing pointers pointing to child nodes, and data slot positions containing data.

Referring to fig. 3, determining a target prefix length of the first target node according to the initial data and a preset spatial magnification threshold of the first target node, including the following steps S301 to S304;

s301, determining at least one prefix length to be verified according to the data distribution characteristics of the initial data;

s302, respectively calculating the space magnification corresponding to each prefix length to be verified;

s303, screening out the prefix length to be verified of which the corresponding space amplification rate is not more than a preset space amplification rate threshold value of the first target node;

s304, determining the screened maximum prefix length to be verified as the target prefix length.

And after the maximum prefix length of the first target node is determined, determining an index model of the first target node according to the determined maximum prefix length.

Referring to fig. 4, in the index construction method, an index model of a first target node is determined according to a node prefix and/or data to be stored, including the following steps S401 to S402;

s401, calculating a target global slope meeting a preset spatial amplification rate threshold according to the node prefix and/or the data to be stored;

s402, determining a linear model as an index model according to the calculated target global slope.

For learning type indexing, a construction algorithm for determining an index model is a core, and currently, a plurality of choices are mainly made for different optimization targets, for example, a linear regression model based on a least square method, and an FMCD method based on minimizing a maximum collision rate.

The embodiment of the application provides a novel index model construction algorithm, and the novel index model construction algorithm is provided aiming at the facts that on the premise that the same prefix length (marked as span) is taken, data segmented by the same sub-node have the same prefix, and data stored in different positions in the sub-node have different prefixes. The optimization design objective of the index model construction algorithm in the embodiment of the application is to use the prefix length span as large as possible to reduce the height of the prefix tree, the constructed index is more flat, and different prefixes do not conflict and fall to different positions of the node.

The method of determining the prefix length as large as possible is described above in steps S301-S304.

In some embodiments, in the method for determining the target prefix length of the first target node in steps S301 to S304, the determining at least one prefix length to be verified according to the data distribution characteristics of the initial data may be performed sequentially from large to small according to the determination of the character length of the initial data; for example, if the character length of the shortest data in the initial data is 5, determining that the maximum prefix length to be verified is 5, then calculating the spatial amplification rate corresponding to the determined prefix length to be verified 5, and if the spatial amplification rate corresponding to the prefix length to be verified 5 is less than or equal to a preset spatial amplification rate threshold, determining that 5 is the target prefix length; and if the space amplification rate corresponding to the prefix length 5 to be verified is greater than the preset space amplification rate threshold, verifying whether the space amplification rate corresponding to the prefix length 4 to be verified is less than or equal to the preset space amplification rate threshold or not until the prefix length to be verified which is screened out as large as possible is taken as the target prefix length.

Here, the spatial magnification determines the storage space of the first target node, or the number of slots of the first node. If the space magnification corresponding to the prefix length is greater than the preset space magnification threshold, it indicates that the number of slots required for storing the data in the initial data or the node prefix is insufficient, and the prefix length needs to be reduced, so as to reduce the number of sub-data sets or data segments divided by the initial data, thereby ensuring that different node prefixes do not collide and fall to different positions of the node.

After the target prefix length of the first target node is determined, a target global slope meeting a preset spatial amplification rate threshold is calculated according to the target prefix length and the initial data.

Specifically, calculating a target global slope meeting a preset spatial amplification threshold according to the target prefix length and the initial data includes:

according to the target prefix length and the initial data, determining data to be stored which are directly stored in the first target node in the initial data and a node prefix of the first target node;

and calculating a target global slope meeting a preset spatial amplification rate threshold according to the data to be stored and the node prefix.

Here, the node prefix is also used as data, so that a target global slope satisfying a preset spatial amplification threshold is calculated from the data to be stored and the node prefix, and the target global slope is determined by, for example, a least square method.

In the embodiment of the application, another method for determining an index model in a first target node is also provided.

Here, determining the index model for the first target node requires traversing two passes through the initial data, whose temporal complexity is O (N + log) ₂ K) Where N is the data amount of the initial data and K represents any data element in the initial data.

Determining the size of a target prefix length span in a first traversal, and determining an index model in a second traversal; the specific process is as follows: for the first pass, the span is initialized to 0 and then two adjacent data X in the initial data are used _k ，X _k+1 Calculating the minimum non-collision slope A _k I.e. A _k Need to satisfy A _k (X _k+1 -X _k ) =1, at which point the recalculation is at a slope of a _k If the spatial magnification is less than or equal to a preset spatial magnification threshold alpha, continuously traversing the residual data; if not, the two data X are explained _k ，X _k+1 Too close, a common prefix can be taken to attribute to the same subdata set, while adding a span up to X _k ＞＞span＝X _k+1 The > span, then continue to traverse the remaining data, and after all data have been traversed, the value of the target prefix length span can be determined; a can be recalculated from span for the second pass _k And obtaining a global slope A = max (A, A) satisfying a preset spatial magnification threshold alpha _k ) Here, the global slope a needs to be the maximum value to ensure that all data do not conflict; the linear model is further determined from the global slope a.

In the step S102, according to a target prefix length of a first target node, dividing the initial data into a plurality of sub data sets, and/or screening out data to be stored, which is stored in the first target node, from the initial data; wherein the initial data in the sub data sets have the same node prefix.

Here, the initial data is divided into a plurality of sub data sets, the data in the sub data sets are relatively similar and need to be stored in the sub nodes of the first target node, and only the common node prefix of the data in the sub data sets is stored in the first target node.

Also, there may be some of the initial data stored directly in the first target node.

Therefore, according to the target prefix length of the first target node, the initial data is divided into a plurality of sub data sets, and/or the data to be stored in the first target node is screened from the initial data, and there are three cases: in the first case, the initial data is divided into a plurality of sub data sets according to the target prefix length of the first target node, for example, the root node usually does not directly store data, but only stores the public prefix of the sub data sets; in the second case, according to the target prefix length of the first target node, dividing the initial data into a plurality of sub data sets, taking data outside the sub data sets as data to be stored, and storing the data and the common prefixes of the sub data sets by some sub nodes which are later compared at the same time; in the third case, only data is stored in the child node of the lowest layer, and the child node does not exist any more, i.e., the common prefix is not stored any more.

And dividing the initial data into a plurality of sub data sets, wherein the initial data in the sub data sets have the same node prefix, namely the same longest common prefix.

Illustratively, the initial data includes absent, absormal, apned, appposition, bicycle, bigamy, which can be divided into three subdata sets, absent, absormal; apend, apposition; bicycle, bigamy.

The node prefixes of absent and abnormal are ab; the node prefixes of the ap and the ap position are ap; the node prefix of bicycle, bigamy is bi.

If the initial data includes absent, absormal, apend, appposition, bicycle, bigamy, name, then except absent, absormal; apend, appposition; the three sub data sets of bicycle and bigam also comprise a data name to be stored in the first target node.

It should be noted that, dividing the initial data into a plurality of sub data sets according to the target prefix length of the first target node, and/or screening out the data to be stored, which is stored in the first target node, from the initial data includes:

determining a node prefix according to the target prefix length of the first target node;

and dividing the initial data into a plurality of sub data sets according to the determined node prefix, and/or screening out data to be stored in the first target node from the initial data.

For example, if the target prefix length of the first target node is determined to be 2, the node prefixes of the initial data may be determined to be: ab. ap and bi, then dividing absent and abnormal; apend, appposition; and the three subdata sets of bicycle and bigam.

In step S103, an index model of the first target node is determined according to the node prefix and/or the data to be stored, and a pointer and a storage location pointing to a child node corresponding to the node prefix of each child data set are respectively calculated through the index model, and/or a storage location of each data to be stored is respectively calculated.

Here, the index model of the first target node is determined according to the node prefix and/or the data to be stored, and the slope of the linear model may be calculated according to the node prefix and/or the data to be stored, so as to obtain the index model.

Respectively calculating pointers and storage positions corresponding to the node prefixes of each sub data set and pointing to the sub nodes through the index model, and/or respectively calculating the storage position of each data to be stored; that is, for the subdata set, a slot position of the node prefix in the first target node is determined; and determining a slot position of the data to be stored in the first target node aiming at the data to be stored.

In the step S104, the index structure of the first target node is constructed by storing each node prefix and the pointer pointing to the child node, and/or each data to be stored in the corresponding storage location. That is, each node prefix and a pointer to a child node are stored in a slot in the first target node, and each data to be stored is stored in the slot in the first target node, and the first target node completes construction.

In step S105, the child node of the constructed first target node is used as a new first target node, the child data set corresponding to the child node is used as new initial data, and an index structure of the new first target node is constructed until a construction completion condition is met, so as to construct the whole index.

That is, after the first target node is constructed, the data in the sub data set in the first target node needs to be stored in the sub node of the first target node. Wherein, one subdata set corresponds to one word node.

Through the construction of the child nodes in the steps S101 to S104, after the child nodes are constructed, the child nodes are used as father nodes until the construction completion condition is satisfied, and the whole index is constructed.

The construction completion conditions are as follows: all of the data used to build the index is stored in the node. Alternatively, the initial data of the first target node does not exist as a sub data set.

In some embodiments, in constructing the child nodes, the child nodes of the first target node employ ART nodes when a difference between the child nodes of the first target node and the node prefix of the first target node is less than a preset length threshold.

Specifically, if the difference between the SPANs of the child node and the parent node (i.e., the number of bits actually used by the node to distinguish the key) is less than or equal to 8, the node may be replaced with an ART node, and after the actual replacement, the query policy is slightly different from the query policy for querying the node, and the rest of operations are the same, that is, the index structure may also be compatible with the adjustment design of the index structures of other prefix trees.

To ensure that the model can accurately direct the input values to the corresponding locations without error (accurate mapping), a series of adjustment operations need to be performed when new data is inserted. The insertion adjustment strategy can not only ensure the property of accurate mapping, but also keep the structure of the whole tree as flat as possible, reduce the height of the tree, and improve the whole query performance by optimizing the structure of the tree. However, the adjustment operation of the optimized tree structure usually needs to modify a large number of nodes, and the delay is high; on the other hand, the adjustment operation at the time of selection requires recording of corresponding meta information in the insertion process to control the frequency of insertion adjustment and reduce the performance loss caused by frequent adjustment, however, the update overhead caused by modifying multiple meta information (such as the number of elements inserted into the whole sub-tree) for one insertion may cause contention during multithreading concurrence and affect the scalability. In summary, a low latency, especially a low tail latency, robust, concurrency friendly, and efficient interpolation adjustment strategy is crucial for updatable learning-based indexes.

Based on this, after the whole index is constructed, the index construction method further includes:

when new data are inserted into the constructed index, judging whether a target slot position of a second target node mapped by the new data meets a preset adjusting condition;

if so, adjusting the index structure of the second target node through a pre-configured adjustment strategy according to the data of the second target node including the new data.

Here, the preset adjustment condition is at least one of: the full load rate of the second target node reaches a preset full load rate threshold value; the target slot position of the second target node mapped by the new data is a data slot position; the target slot position of the second target node mapped by the new data is a pointer slot position, and the new data is not matched with the longest common prefix of the child node pointed by the pointer slot position.

Adjusting an index structure of a second target node including the new data according to data of the second target node, including:

or, taking the key-value pair of the data slot position of the second target node, the node prefix of the pointer slot position and the key-value pair formed by the pointer as the initial data of the second target node;

Here, the first target node is any node in the index construction, and is not particularly referred to as a root node. The second target node is a node when the node structure of the index is updated, and may be any node, generally a child node. The first target node and the second target node are only differences in naming and do not represent that there is a parent-child relationship between the first target node and the second target node.

Referring to fig. 5, fig. 5 is a flowchart illustrating a method for adjusting an index structure of the second target node according to an embodiment of the present application.

In the embodiment of the present application, the index model may also be referred to as a prefix tree, and therefore, the position of the data in the index model may also be referred to as positioning in the tree for short.

When new data is inserted, the target slot position of the new data needs to be determined step by step through the index model in each level of node, and three situations can be met:

the first situation is the situation that the index model maps to the empty slot position, and the situation can directly occupy the empty slot position to complete insertion, and at the moment, if a preset adjustment condition is met, a vertically-extended adjustment strategy is triggered.

The second case is the case of mapping to a data slot by an index model, which satisfies a preset adjustment condition, a new child node needs to be established for accommodating the data in the slot together with newly inserted data, then the new child node is placed in the original slot, and if the length of the longest common prefix of the new child node is smaller than the SPAN of the parent node and smaller than a threshold value T, the adjustment policy of the parent node is triggered.

And in the third case, the index model maps the pointer slot pointing to the child node, but the new insertion value is not matched with the longest common prefix of the child node, so that the preset adjustment condition is met, the new longest common prefix between the new insertion value and the longest common prefix recorded by the child node needs to be calculated firstly, then the length of the new longest common prefix is compared with the SPAN of the parent node, and if the length of the new longest common prefix is smaller than the SPAN of the parent node and is smaller than a threshold value T, the adjustment strategy of the parent node is triggered. Otherwise, the insertion value can still be inserted into the child node (the insertion procedure is recursively executed) while the longest common prefix of the child node record is updated.

Here, the core idea of the adjustment policy is to regard the longest common prefix of a child node as a new data key and regard a pointer pointing to the child node as a new data value, so that the current node can be adjusted while modifying the child node as little as possible. When newly inserted data conflicts with already stored data, a conflict adjustment strategy is triggered, the whole conflict adjustment strategy is divided into three steps, and the steps are tried step by step under the condition that the model is not modified as much as possible:

the first step is to adjust the spatial magnification of the second target node to enlarge the storage space of the second target node, i.e. to first try to accommodate the newly inserted conflict value by adjusting the size of the space. This situation can only occur when data is inserted beyond both ends, trying to accommodate the conflicting value by adjusting the size of the space, if the newly inserted conflicting value can be accommodated at a given spatial magnification, the adjustment is complete; if not, the second step is attempted.

The second step is to try to accommodate the newly inserted collision value by adjusting the linear model. The specific calculation flow is similar to the construction algorithm, a new slope is calculated by looking at the newly inserted conflict value and the longest common prefix of the original slot position or the key of the original slot position data, and then on the basis of a linear model given by the new slope, if all nodes after mapping change can be accommodated under the given amplification rate, the adjustment is completed, otherwise, the third step is tried.

Thirdly, local reconstruction of a node of a second target node SPAN needing to be adjusted is carried out; in the step, key value pairs of all data slots of the original node are scanned; and regarding all pointer slots as key value pairs in the form that the longest common prefix is used as a key and the pointers pointing to the sub-nodes are used as values. Meanwhile, if the conflict slot is a child node, the slot of the child node needs to be scanned out in the above form as a key-value pair, and the key-value pairs after sequential scanning are already ordered to keys and do not need to be further ordered. And then constructing a new subtree by using an index construction algorithm according to the new key value pairs, namely inserting the new subtree into the position of the original node.

In addition, in addition to the conflict adjustment strategy caused by the conflict of the new inserted value, the vertical adjustment strategy of expanding in the vertical direction is also carried out according to the full load rate of the node; that is, when most empty slots of a node are occupied and the total slot number of the node is close to the maximum size (2 ^ SPAN) allowed by the SPAN of the node, the node is vertically expanded to increase the SPAN, so that the effects of combining in the vertical direction, reducing the height of a tree layer and ensuring the flattening of the tree structure are achieved. The process of vertical expansion is similar to the third step of conflict adjustment, and all slots of the node need to be scanned, and all slots of the subnodes form key value pairs, and then the key value pairs are reconstructed according to the index construction algorithm. This vertically extended adjustment strategy typically checks for conditional triggers when inserted into an empty slot to reduce tail-lag.

Compared with a traditional index structure, the index constructed in the embodiment of the application has the capability of fitting data distribution of a learning index, so that the index has comprehensive advantages in basic query performance compared with the traditional index, and has corresponding competitiveness compared with other learning indexes; for other learning type indexes, the method has the overall robustness with the traditional prefix tree index (such as ART), can have better performance in different data sets and write-intensive workloads, and meanwhile, the performance is kept stable when the data distribution changes, and the tail delay performance index is also better.

The index construction algorithm provided by the embodiment of the application has low time complexity, can ensure the flat structure of the whole tree, and has the property of meeting the insert adjustment strategy; in order to solve the problem that the high performance can be guaranteed while the updatable learning index is continuously inserted, the embodiment of the application also provides a novel insertion adjustment strategy, and the adjustment strategy has the characteristics of low tail delay, high robustness, concurrency, friendliness and capability of maintaining an index efficient structure.

In addition, the technology of the invention can support variable-length data types and character type data efficiently.

Referring to fig. 6, in some embodiments, a data reading method is further provided, which is applied to an index constructed by the index construction method according to the embodiment of the present application; the reading method includes the following steps S601 to S604;

s601, starting from a root node of the index, judging whether a node prefix of a current node is matched with data to be read;

s602, if the data are matched, calculating a target slot position of the data to be read through an index model in the current node;

s603, if the target slot position is a data slot position, judging whether the data in the data slot position is matched with the data to be read, and if so, returning the data in the data slot position;

s604, if the target slot position is the pointer slot position, inquiring the data to be read in the child node pointed by the pointer slot position until a reading result is returned.

Specifically, referring to fig. 7, after starting to read data, a search function is called to query a value corresponding to a key to be read, starting from a root node, whether a prefix corresponding to a current node is equal to a prefix of the key is compared, and if not, a NULL is directly returned; if the position (namely the target slot position) of the data to be read in the node is predicted according to the key and the model in the node, and the position is checked, three possibilities exist: firstly, the method comprises the following steps: if no data exists in the position, returning to NULL; secondly, the method comprises the following steps: if data exists in the position, comparing the key with the key to be inquired, if the key is equal, returning a corresponding value, and if the key is not equal, returning NULL; thirdly, the method comprises the following steps: if the pointer stored in the position points to the child node, the child node is subjected to recursive query, and the above process is repeated until a reading result is returned.

Here, the returning of the read result includes returning NULL, or returning a value corresponding to the data to be read.

Because the index constructed by the index construction method provided by the embodiment of the application maintains the storage sequence of data, the range query can be carried out.

The data reading method provided by the embodiment of the application further comprises the following steps:

determining the storage position of the boundary data in the index;

Specifically, taking all KV with query key not less than lower _ key as an example, since the index constructed by the index construction method in the embodiment of the present application maintains the sequence of stored data, the position of the object corresponding to lower _ key may be determined first during query, this process is similar to the reading process, and after predicting the position of the value in the node according to lower _ key, the position is checked, which has three possibilities: firstly, the method comprises the following steps: if no data exists in the position, continuing to find the position backwards, and repeating the process until a first key not smaller than the lower _ key is found; secondly, the method comprises the following steps: if data exists in the position, comparing the key of the position with the key of the object to be updated, if the key is smaller than or equal to the key, continuing to find the position backwards, and repeating the process until the first key not smaller than the lower _ key is found; thirdly, the method comprises the following steps: if the position stores a pointer to the child node, the child node is subjected to recursive query, and the above process is repeated until the first key not smaller than the lower _ key is found. Then, all subsequent positions are traversed and accessed from the position of the key, and if the position has no data, skipping is carried out; if the position has data, adding the data into a result set; if the position memory is a pointer pointing to the child node, the same process is carried out on the child node until all the positions are traversed, and then the result of the range query is returned.

Here, it should be noted that, when new data is inserted, the insertion process is similar to the data reading flow.

When the index constructed by the index construction method provided by the embodiment of the application deletes data, the deletion process is as follows: similar to the reading process, the deletion operation can be performed only by finding the position (namely the target slot position) of the corresponding object according to the key; after predicting the position of the value in the node according to the key, viewing the position, wherein the three possibilities are as follows: firstly, the method comprises the following steps: if no data exists in the position, returning false; secondly, the method comprises the following steps: if data exists in the position, comparing the key with the key of the object to be deleted, if the key is equal to the key, deleting the key, judging whether the current Node (namely the current Node) is empty, if the Node is empty, deleting the current Node and the related metadata, and returning to true; thirdly, the method comprises the following steps: if the pointer stored in the position points to the child node, the child node is subjected to recursive query, and the above process is repeated until a result is returned.

Here, although the delete operation affects the actual longest common prefix of the node, the longest common prefix is not recalculated for performance considerations, since this has no effect on correctness, and the original longest common prefix is only slightly shorter than the actual new longest common prefix, but is still a common prefix.

When the index constructed by the index construction method provided by the embodiment of the application deletes data, the updating process is as follows: similar to the flow of reading data, the position (namely the target slot position) of the corresponding object is found according to the key, and then the updating operation can be carried out; after predicting the position of the value in the node according to the key, viewing the position, wherein the three possibilities are as follows: firstly, if no data exists in the position, false is returned; secondly, the method comprises the following steps: if data exists in the position, comparing the key with the key of the object to be updated, if the key is equal to the key, modifying the value of the key, and returning true; thirdly, the steps of: if the pointer stored in the position points to the child node, the child node is subjected to recursive query, and the above process is repeated until a result is returned.

Based on the same inventive concept, an index construction device corresponding to the index construction method is also provided in the embodiments of the present application, and as the principle of solving the problem of the device in the embodiments of the present application is similar to that of the index construction method in the embodiments of the present application, reference may be made to the implementation of the device in the method, and repeated parts are not described again.

Referring to fig. 8 and fig. 8 are schematic structural diagrams of an index building apparatus according to an embodiment of the present application, specifically, the index building apparatus includes:

a determining module 801, configured to input initial data into a first target node of a prefix tree, and determine a target prefix length of the first target node according to the initial data and a preset spatial magnification threshold of the first target node;

a dividing module 802, configured to divide the initial data into multiple sub data sets according to a target prefix length of a first target node, and/or screen out, from the initial data, data to be stored that is stored in the first target node; wherein, the initial data in the sub data set has the same node prefix;

a calculating module 803, configured to determine an index model of the first target node according to the node prefix and/or the data to be stored, and calculate, through the index model, a pointer and a storage location pointing to a child node corresponding to the node prefix of each sub data set, and/or calculate, respectively, a storage location of each data to be stored;

a first constructing module 804, configured to store each node prefix and a pointer pointing to a child node, and/or each data to be stored in a corresponding storage location, and construct an index structure of the first target node;

a second constructing module 805, configured to use a child node of the constructed first target node as a new first target node, use a child data set corresponding to the child node as new initial data, construct an index structure of the new first target node, until a construction completion condition is met, and construct the entire index.

Based on the same inventive concept, the embodiment of the present application further provides an electronic device corresponding to the index construction method, and as the principle of solving the problem of the electronic device in the embodiment of the present application is similar to that of the index construction method in the embodiment of the present application, reference may be made to implementation of the method for implementing the electronic device, and repeated details are not described again.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application, where the electronic device 900 includes: a processor 901, a memory 902 and a bus, wherein the memory 902 stores machine-readable instructions executable by the processor 901, the processor 901 and the memory 902 communicate via the bus when the electronic device 900 is running, and the machine-readable instructions, when executed by the processor 901, perform the steps of the index construction method.

Based on the same inventive concept, a computer-readable storage medium corresponding to the index construction method is also provided in the embodiments of the present application, and since the principle of solving the problem of the computer-readable storage medium in the embodiments of the present application is similar to that of the index construction method in the embodiments of the present application, the implementation of the computer-readable storage medium can refer to the implementation of the method, and repeated details are not repeated.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and the computer program is executed by a processor to perform the steps of the index building method.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working process of the system and the apparatus described above may refer to the corresponding process in the method embodiment, and is not described in detail in this application. In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice, and for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or modules through some communication interfaces, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solutions of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a platform server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the present application, and shall cover the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An index construction method, comprising:

determining an index model of a first target node according to the node prefix and/or the data to be stored, and respectively calculating a pointer and a storage position of a sub-node corresponding to the node prefix of each sub-data set and/or respectively calculating the storage position of each data to be stored through the index model;

2. The index building method according to claim 1, wherein determining the target prefix length of the first target node according to the initial data and a preset spatial magnification threshold of the first target node comprises:

3. The index building method according to claim 1, wherein determining the index model of the first target node according to the node prefix and/or the data to be stored comprises:

4. The index building method of claim 1, wherein after building the entire index, the method further comprises:

5. The index building method of claim 4, wherein: adjusting an index structure of a second target node including the new data according to data of the second target node, including:

6. The index building method according to claim 4, wherein the preset adjustment condition is at least one of: the full load rate of the second target node reaches a preset full load rate threshold value; the target slot position of the second target node of the new data mapping is a data slot position; the target slot position of the second target node of the new data mapping is a pointer slot position, and the new data is not matched with the longest common prefix of the child node pointed by the pointer slot position.

7. The index construction method according to claim 1, wherein when the difference between the node prefixes of the child node of the first target node and the first target node is smaller than a preset length threshold, the child node of the first target node adopts an ART node.

8. A data reading method applied to an index constructed by the index construction method according to claims 1 to 7; the reading method comprises the following steps:

if so, calculating a target slot position of the data to be read through an index model in the current node;

9. A method for reading data according to claim 8, further comprising:

determining the storage position of the boundary data in the index;

10. An index building apparatus, the building apparatus comprising:

the determining module is used for inputting initial data into a first target node of a prefix tree and determining a target prefix length of the first target node according to the initial data and a preset spatial magnification threshold of the first target node;

the first construction module is used for storing each node prefix and a pointer pointing to a child node and/or each data to be stored in a corresponding storage position, and constructing an index structure of the first target node;

and the second construction module is used for taking the child nodes of the constructed first target node as new first target nodes, taking the child node corresponding subdata sets as new initial data, constructing an index structure of the new first target nodes until the construction completion condition is met, and constructing the whole index.