CN117312239A

CN117312239A - Method for storing data index based on combination mode

Info

Publication number: CN117312239A
Application number: CN202311288969.9A
Authority: CN
Inventors: 高彦军; 庞景秋; 齐井春; 李绍俊; 李波; 李彬; 张金辉
Original assignee: Changchun Jiacheng Information Technology Co ltd
Current assignee: Changchun Jiacheng Information Technology Co ltd
Priority date: 2023-10-08
Filing date: 2023-10-08
Publication date: 2023-12-29

Abstract

The invention discloses a method for storing data indexes based on a combination mode, which comprises the following steps: step S1: acquiring data stored in a database and the number of the data; step S2: judging the number of data strips; step S3: if the number of the data strips is less than or equal to 2000 ten thousand, storing the data index in the HashMap memory; step S4: if the number of data pieces is more than 2000 ten thousand pieces, storing the data in the SlimTrie, storing the SlimTrie data index in the index file according to the specified time period, and limiting the size of the file. According to the invention, the storage mode is rapidly determined, so that the scanning amount and the access time of data are reduced, and the query efficiency is improved; the combination of SlimTRI and HashMap realizes the storage of index data, reduces the storage space, provides better user experience, and saves the cost.

Description

Method for storing data index based on combination mode

Technical Field

The present invention relates to a method for storing data indexes, and more particularly, to a method for storing data indexes based on a combination manner.

Background

In the data processing process, quick positioning of data is a key technology. An index is a common positioning way, and a certain data position can be quickly positioned through the index. Indexing techniques are widely used in the fields of databases, geographic information systems, logistics management, etc., for example. With the increasing demand for real-time data, files also tend to be fragmented, such as short videos and live services, and often a video is only hundreds of KB in size, even tens of KB in size. In the prior art, for example, tree indexes, the intermediate nodes and branches of the Tree are utilized to divide the full quantity of keys into smaller parts, the intermediate nodes only store the keys, and the data parts are all stored in leaf nodes, so that the complete keys must be stored in the memory. The existing method for storing the data index has the following defects:

1. if 100TB of data is used, 10KB of files are stored in the whole, and only the index data required for managing the files can reach 10000GB of memory space. For the management of huge amounts of metadata, the index data required for managing these files occupies huge memory space;

2. because the complete key needs to be stored in the memory, the time cost consumed by traversal is high, and the searching performance is poor.

Disclosure of Invention

In order to solve the defects of the technology, the invention provides a method for storing data indexes based on a combination mode.

In order to solve the technical problems, the invention adopts the following technical scheme: a method of storing a data index based on a combination, comprising:

step S1: acquiring data stored in a database and the number of the data;

step S2: judging the number of data strips;

step S3: if the number of the data is less than or equal to 2000 ten thousand, storing the data index in a memory and storing the data index in a HashMap mode;

step S4: if the number of data pieces is more than 2000 ten thousand pieces, storing the data index in the SlimTrie, storing the SlimTrie index data in the index file according to the specified time period, and limiting the size of the index file.

Further, in step S3 of the method for storing data indexes based on the combination mode, the complete key is stored in the HashMap memory, the stored key is mapped to a new hash value by using the calculation of the hash function, and the indexes are built.

Further, in step S4 of the method for storing data indexes based on the combination mode of the present invention, the method includes:

the SlimTRI is generated on the basis of a standard Trie tree, invalid nodes in the standard Trie tree are cut off, and the memory overhead of the Trie is compressed to obtain the SlimTRI;

cutting off invalid nodes in the standard Trie tree, namely cutting off single branch nodes in the standard Trie tree;

compressing the memory overhead of the Trie to obtain SlimTRI, firstly storing the data combination of the whole Trie by using a compression array compatible array, and further realizing the compression of the Trie;

the SlimTrie uses an index to represent and mark a plurality of adjacent small files, so as to optimize the small files;

SlimTRI realizes distributed storage through key-value storage, keys in the key-value are self-increment ids, the value stores ids generated by zxy tile coordinates, and the zxy tile coordinate id generation rule is operated as follows:

z<<58|x<<29|y；

where z represents the zoom level, x represents the abscissa and y represents the ordinate.

Further, in step S4 of the method for storing data indexes based on a combination manner, the storing of the SlimTrie index data in the index file according to the specified time period includes:

the time period is practically specified as 5 to 10 minutes.

Further, in step S4 of the method for storing data indexes based on the combination method, the method limits the size of the index file, and includes:

the size of the index file is limited to 128M;

if the index file size exceeds 128M, the SlimTRI data index will be stored in the next index file.

The invention discloses a method for storing data indexes based on a combination mode, which is a method for realizing the data index storage by combining HashMap and SlimTRI, and the method rapidly determines the storage mode by the size of the data index volume, and has the following beneficial effects:

1. by quickly determining the storage mode, the scanning amount and the access time of data are reduced, and the SlimTrie supports distributed storage, so that the query efficiency is improved;

2. the data index is stored by combining SlimTRI and HashMap, so that the memory storage space is reduced, better user experience is provided for a client, and meanwhile, the cost is saved.

Drawings

FIG. 1 is a flow chart of an embodiment of the data volume determination and storage method of the present invention.

FIG. 2 is a diagram of a distributed SlimTRI memory of the present invention.

FIG. 3 is a graph of performance versus the present invention.

Detailed Description

The invention will be described in further detail with reference to the drawings and the detailed description.

The technical principle of the invention is as follows:

the implementation flowchart of the data volume judgment storage mode in the method for storing the data index based on the combination mode shown in fig. 1 includes:

the data index is stored based on the HashMap and SlimTrie combination.

Step S1: acquiring data stored in a database and the number of the data;

step S2: judging the number of data strips;

step S3: if the number of the data is less than or equal to 2000 ten thousand, storing the data index in a memory, storing the data index in a HashMap mode, firstly, mapping a key to be stored to a new hash value by using the calculation of a hash function, then establishing an index, and searching positioning data once through the hash index; when the number of data stored in the HashMap is within 2000 ten thousand, the query efficiency is optimal;

step S4: if the number of the data pieces is more than 2000 ten thousand pieces, storing index data in a SlimTRI memory; storing SlimTRI index data in an index file according to a specified time period, wherein the time period is practically specified as 5-10 minutes, and the experimental verification time period is specified as 5 minutes, so that the performance is optimal; limiting the size of the index file, wherein the size limit of the index file is 128M at maximum; if the index file size exceeds 128M, the SlimTRI data index will be stored in the next index file.

Fig. 2 is a distributed SlimTrie storage diagram of a method for storing data indexes based on a combination manner according to a first exemplary embodiment of the present invention, which is a preferred embodiment of the method shown in fig. 1, and in the method of this embodiment, for step S3, index data is stored in a SlimTrie memory, as shown in fig. 2, and the method of this embodiment includes:

the SlimTRI can support distributed storage by adopting a key-value technical structure, wherein the key is self-increment id, the value stores id generated by zxy tile coordinates, and the zxy tile coordinates id generation rule bit operation is as follows:

z<<58|x<<29|y；

where z represents the zoom level, x represents the abscissa and y represents the ordinate. A key-value database is a database that stores data in key-value pairs, each key corresponding to a unique value. For example, the index data is stored in 3 nodes, each node stores 2000 ten thousand index data, and if one node stops running, the other two nodes can still lock the index position rapidly. The method has the advantages that the high availability of the query index is realized, and the normal operation of the service is ensured.

FIG. 3 is a graph showing a comparison of performance of a method for storing data indexes based on a combination manner according to a second embodiment of the present invention, wherein the method of the present invention includes:

let n keys, each of length k, be. "O ()" is complexity, which means how fast the runtime will expand as the input size increases.

SlimTRI is built on the basis of a standard Trie tree. The SlimTRI generation process is divided into three parts: firstly, constructing indexes of all keys by using a Trie, and then cutting on the basis of the Trie to reduce the magnitude of index data from O (n x k) to O (n); the tree structure is realized in the form of pointers in the memory, but the pointers occupy 8 bytes in a 64-bit system, and the memory overhead is at least 8*n, so that the data structure of the whole Trie is stored through one compatible array (internal node, user data and Step information), and the memory overhead is reduced; optimizing the small files, and enabling a plurality of adjacent small files to have an index forbidden mark so as to balance IO overhead and space overhead.

To index n keys, at least log2 (n) bits are needed to distinguish n different keys, based on which the conclusion can be drawn: the memory overhead of the index of each file is independent of the length of the key, thus limiting the size of n, splitting the entire set of keys into multiple subsets of specified size. In each index, 10 bytes are taken as a total, the key information occupies 6 bytes, and the value occupies 4 bytes. Since the space overhead of SlimTrie is only related to the number n of keys and is irrelevant to the length of the keys, if the space overhead of SlimTrie in fig. 3 is O (n), the query time is O (log (n)), log () represents a meaning different from mathematics, where it is calculated by using a logarithm with a base of 2, and the number of operations is increased only once as the algorithm doubles with the input scale. In terms of complexity contrast, slimTrie space overhead is minimized with query time, and is superior to other data structures, both in space and performance.

The above embodiments are not intended to limit the present invention, and the present invention is not limited to the above examples, but is also intended to be limited to the following claims.

Claims

1. A method for storing a data index based on a combination, comprising the steps of:

step S1: acquiring data stored in a database and the number of the data;

step S2: judging the number of data strips;

step S4: if the number of data pieces is more than 2000 ten thousand pieces, storing the data in the SlimTrie, storing the SlimTrie index data in the index file according to a specified time period, and limiting the size of the index file.

2. The method for storing data indexes based on a combination of claim 1, wherein: and (3) storing complete keys in a HashMap memory in the step (S3), mapping the stored keys to a new hash value by utilizing the calculation of a hash function, and building an index.

3. The method for storing data indexes based on a combination of claim 1, wherein: and S4, the SlimTRI in the step is generated on the basis of a standard Trie tree, and after invalid nodes in the standard Trie tree are cut, the memory overhead of the Trie is compressed to obtain the SlimTRI.

4. A method of storing data indexes based on a combination as claimed in claim 3, wherein: and cutting out invalid nodes in the standard Trie tree, namely cutting out single branch nodes in the standard Trie tree.

5. A method of storing data indexes based on a combination as claimed in claim 3, wherein: the memory overhead of the compressed Trie is obtained by the SlimTrie, and the data structure of the whole Trie is stored by a compressed array, so that the compression of the Trie is realized.

6. A method of storing data indexes based on a combination as claimed in claim 3, wherein: the SlimTrie identifies a plurality of adjacent doclets with an index, thereby optimizing the doclets.

7. The method for storing data indexes based on a combination of claim 1, wherein: and the SlimTRI in the step S4 realizes distributed storage through key-value storage.

8. The method of storing data indexes based on a combination of claim 7, wherein: the key in the key-value is self-increment id, the value stores id generated by zxy tile coordinates, and the generation rule of the zxy tile coordinates id is calculated as follows;

z<<58|x<<29|y；

9. The method for storing data indexes based on a combination of claim 1, wherein: the time period in said step S4 is actually designated as 5 to 10 minutes.

10. The method for storing data indexes based on a combination of claim 1, wherein: in the step S4, the size limit of the index file is 128M at maximum; if the index file size exceeds 128M, the SlimTRI data index will be stored in the next index file.