CN117312239A - Method for storing data index based on combination mode - Google Patents

Method for storing data index based on combination mode Download PDF

Info

Publication number
CN117312239A
CN117312239A CN202311288969.9A CN202311288969A CN117312239A CN 117312239 A CN117312239 A CN 117312239A CN 202311288969 A CN202311288969 A CN 202311288969A CN 117312239 A CN117312239 A CN 117312239A
Authority
CN
China
Prior art keywords
data
index
storing
combination
indexes based
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311288969.9A
Other languages
Chinese (zh)
Inventor
高彦军
庞景秋
齐井春
李绍俊
李波
李彬
张金辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changchun Jiacheng Information Technology Co ltd
Original Assignee
Changchun Jiacheng Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changchun Jiacheng Information Technology Co ltd filed Critical Changchun Jiacheng Information Technology Co ltd
Priority to CN202311288969.9A priority Critical patent/CN117312239A/en
Publication of CN117312239A publication Critical patent/CN117312239A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/122File system administration, e.g. details of archiving or snapshots using management policies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for storing data indexes based on a combination mode, which comprises the following steps: step S1: acquiring data stored in a database and the number of the data; step S2: judging the number of data strips; step S3: if the number of the data strips is less than or equal to 2000 ten thousand, storing the data index in the HashMap memory; step S4: if the number of data pieces is more than 2000 ten thousand pieces, storing the data in the SlimTrie, storing the SlimTrie data index in the index file according to the specified time period, and limiting the size of the file. According to the invention, the storage mode is rapidly determined, so that the scanning amount and the access time of data are reduced, and the query efficiency is improved; the combination of SlimTRI and HashMap realizes the storage of index data, reduces the storage space, provides better user experience, and saves the cost.

Description

Method for storing data index based on combination mode
Technical Field
The present invention relates to a method for storing data indexes, and more particularly, to a method for storing data indexes based on a combination manner.
Background
In the data processing process, quick positioning of data is a key technology. An index is a common positioning way, and a certain data position can be quickly positioned through the index. Indexing techniques are widely used in the fields of databases, geographic information systems, logistics management, etc., for example. With the increasing demand for real-time data, files also tend to be fragmented, such as short videos and live services, and often a video is only hundreds of KB in size, even tens of KB in size. In the prior art, for example, tree indexes, the intermediate nodes and branches of the Tree are utilized to divide the full quantity of keys into smaller parts, the intermediate nodes only store the keys, and the data parts are all stored in leaf nodes, so that the complete keys must be stored in the memory. The existing method for storing the data index has the following defects:
1. if 100TB of data is used, 10KB of files are stored in the whole, and only the index data required for managing the files can reach 10000GB of memory space. For the management of huge amounts of metadata, the index data required for managing these files occupies huge memory space;
2. because the complete key needs to be stored in the memory, the time cost consumed by traversal is high, and the searching performance is poor.
Disclosure of Invention
In order to solve the defects of the technology, the invention provides a method for storing data indexes based on a combination mode.
In order to solve the technical problems, the invention adopts the following technical scheme: a method of storing a data index based on a combination, comprising:
step S1: acquiring data stored in a database and the number of the data;
step S2: judging the number of data strips;
step S3: if the number of the data is less than or equal to 2000 ten thousand, storing the data index in a memory and storing the data index in a HashMap mode;
step S4: if the number of data pieces is more than 2000 ten thousand pieces, storing the data index in the SlimTrie, storing the SlimTrie index data in the index file according to the specified time period, and limiting the size of the index file.
Further, in step S3 of the method for storing data indexes based on the combination mode, the complete key is stored in the HashMap memory, the stored key is mapped to a new hash value by using the calculation of the hash function, and the indexes are built.
Further, in step S4 of the method for storing data indexes based on the combination mode of the present invention, the method includes:
the SlimTRI is generated on the basis of a standard Trie tree, invalid nodes in the standard Trie tree are cut off, and the memory overhead of the Trie is compressed to obtain the SlimTRI;
cutting off invalid nodes in the standard Trie tree, namely cutting off single branch nodes in the standard Trie tree;
compressing the memory overhead of the Trie to obtain SlimTRI, firstly storing the data combination of the whole Trie by using a compression array compatible array, and further realizing the compression of the Trie;
the SlimTrie uses an index to represent and mark a plurality of adjacent small files, so as to optimize the small files;
SlimTRI realizes distributed storage through key-value storage, keys in the key-value are self-increment ids, the value stores ids generated by zxy tile coordinates, and the zxy tile coordinate id generation rule is operated as follows:
z<<58|x<<29|y;
where z represents the zoom level, x represents the abscissa and y represents the ordinate.
Further, in step S4 of the method for storing data indexes based on a combination manner, the storing of the SlimTrie index data in the index file according to the specified time period includes:
the time period is practically specified as 5 to 10 minutes.
Further, in step S4 of the method for storing data indexes based on the combination method, the method limits the size of the index file, and includes:
the size of the index file is limited to 128M;
if the index file size exceeds 128M, the SlimTRI data index will be stored in the next index file.
The invention discloses a method for storing data indexes based on a combination mode, which is a method for realizing the data index storage by combining HashMap and SlimTRI, and the method rapidly determines the storage mode by the size of the data index volume, and has the following beneficial effects:
1. by quickly determining the storage mode, the scanning amount and the access time of data are reduced, and the SlimTrie supports distributed storage, so that the query efficiency is improved;
2. the data index is stored by combining SlimTRI and HashMap, so that the memory storage space is reduced, better user experience is provided for a client, and meanwhile, the cost is saved.
Drawings
FIG. 1 is a flow chart of an embodiment of the data volume determination and storage method of the present invention.
FIG. 2 is a diagram of a distributed SlimTRI memory of the present invention.
FIG. 3 is a graph of performance versus the present invention.
Detailed Description
The invention will be described in further detail with reference to the drawings and the detailed description.
The technical principle of the invention is as follows:
the implementation flowchart of the data volume judgment storage mode in the method for storing the data index based on the combination mode shown in fig. 1 includes:
the data index is stored based on the HashMap and SlimTrie combination.
Step S1: acquiring data stored in a database and the number of the data;
step S2: judging the number of data strips;
step S3: if the number of the data is less than or equal to 2000 ten thousand, storing the data index in a memory, storing the data index in a HashMap mode, firstly, mapping a key to be stored to a new hash value by using the calculation of a hash function, then establishing an index, and searching positioning data once through the hash index; when the number of data stored in the HashMap is within 2000 ten thousand, the query efficiency is optimal;
step S4: if the number of the data pieces is more than 2000 ten thousand pieces, storing index data in a SlimTRI memory; storing SlimTRI index data in an index file according to a specified time period, wherein the time period is practically specified as 5-10 minutes, and the experimental verification time period is specified as 5 minutes, so that the performance is optimal; limiting the size of the index file, wherein the size limit of the index file is 128M at maximum; if the index file size exceeds 128M, the SlimTRI data index will be stored in the next index file.
Fig. 2 is a distributed SlimTrie storage diagram of a method for storing data indexes based on a combination manner according to a first exemplary embodiment of the present invention, which is a preferred embodiment of the method shown in fig. 1, and in the method of this embodiment, for step S3, index data is stored in a SlimTrie memory, as shown in fig. 2, and the method of this embodiment includes:
the SlimTRI can support distributed storage by adopting a key-value technical structure, wherein the key is self-increment id, the value stores id generated by zxy tile coordinates, and the zxy tile coordinates id generation rule bit operation is as follows:
z<<58|x<<29|y;
where z represents the zoom level, x represents the abscissa and y represents the ordinate. A key-value database is a database that stores data in key-value pairs, each key corresponding to a unique value. For example, the index data is stored in 3 nodes, each node stores 2000 ten thousand index data, and if one node stops running, the other two nodes can still lock the index position rapidly. The method has the advantages that the high availability of the query index is realized, and the normal operation of the service is ensured.
FIG. 3 is a graph showing a comparison of performance of a method for storing data indexes based on a combination manner according to a second embodiment of the present invention, wherein the method of the present invention includes:
let n keys, each of length k, be. "O ()" is complexity, which means how fast the runtime will expand as the input size increases.
SlimTRI is built on the basis of a standard Trie tree. The SlimTRI generation process is divided into three parts: firstly, constructing indexes of all keys by using a Trie, and then cutting on the basis of the Trie to reduce the magnitude of index data from O (n x k) to O (n); the tree structure is realized in the form of pointers in the memory, but the pointers occupy 8 bytes in a 64-bit system, and the memory overhead is at least 8*n, so that the data structure of the whole Trie is stored through one compatible array (internal node, user data and Step information), and the memory overhead is reduced; optimizing the small files, and enabling a plurality of adjacent small files to have an index forbidden mark so as to balance IO overhead and space overhead.
To index n keys, at least log2 (n) bits are needed to distinguish n different keys, based on which the conclusion can be drawn: the memory overhead of the index of each file is independent of the length of the key, thus limiting the size of n, splitting the entire set of keys into multiple subsets of specified size. In each index, 10 bytes are taken as a total, the key information occupies 6 bytes, and the value occupies 4 bytes. Since the space overhead of SlimTrie is only related to the number n of keys and is irrelevant to the length of the keys, if the space overhead of SlimTrie in fig. 3 is O (n), the query time is O (log (n)), log () represents a meaning different from mathematics, where it is calculated by using a logarithm with a base of 2, and the number of operations is increased only once as the algorithm doubles with the input scale. In terms of complexity contrast, slimTrie space overhead is minimized with query time, and is superior to other data structures, both in space and performance.
The above embodiments are not intended to limit the present invention, and the present invention is not limited to the above examples, but is also intended to be limited to the following claims.

Claims (10)

1. A method for storing a data index based on a combination, comprising the steps of:
step S1: acquiring data stored in a database and the number of the data;
step S2: judging the number of data strips;
step S3: if the number of the data is less than or equal to 2000 ten thousand, storing the data index in a memory and storing the data index in a HashMap mode;
step S4: if the number of data pieces is more than 2000 ten thousand pieces, storing the data in the SlimTrie, storing the SlimTrie index data in the index file according to a specified time period, and limiting the size of the index file.
2. The method for storing data indexes based on a combination of claim 1, wherein: and (3) storing complete keys in a HashMap memory in the step (S3), mapping the stored keys to a new hash value by utilizing the calculation of a hash function, and building an index.
3. The method for storing data indexes based on a combination of claim 1, wherein: and S4, the SlimTRI in the step is generated on the basis of a standard Trie tree, and after invalid nodes in the standard Trie tree are cut, the memory overhead of the Trie is compressed to obtain the SlimTRI.
4. A method of storing data indexes based on a combination as claimed in claim 3, wherein: and cutting out invalid nodes in the standard Trie tree, namely cutting out single branch nodes in the standard Trie tree.
5. A method of storing data indexes based on a combination as claimed in claim 3, wherein: the memory overhead of the compressed Trie is obtained by the SlimTrie, and the data structure of the whole Trie is stored by a compressed array, so that the compression of the Trie is realized.
6. A method of storing data indexes based on a combination as claimed in claim 3, wherein: the SlimTrie identifies a plurality of adjacent doclets with an index, thereby optimizing the doclets.
7. The method for storing data indexes based on a combination of claim 1, wherein: and the SlimTRI in the step S4 realizes distributed storage through key-value storage.
8. The method of storing data indexes based on a combination of claim 7, wherein: the key in the key-value is self-increment id, the value stores id generated by zxy tile coordinates, and the generation rule of the zxy tile coordinates id is calculated as follows;
z<<58|x<<29|y;
where z represents the zoom level, x represents the abscissa and y represents the ordinate.
9. The method for storing data indexes based on a combination of claim 1, wherein: the time period in said step S4 is actually designated as 5 to 10 minutes.
10. The method for storing data indexes based on a combination of claim 1, wherein: in the step S4, the size limit of the index file is 128M at maximum; if the index file size exceeds 128M, the SlimTRI data index will be stored in the next index file.
CN202311288969.9A 2023-10-08 2023-10-08 Method for storing data index based on combination mode Pending CN117312239A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311288969.9A CN117312239A (en) 2023-10-08 2023-10-08 Method for storing data index based on combination mode

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311288969.9A CN117312239A (en) 2023-10-08 2023-10-08 Method for storing data index based on combination mode

Publications (1)

Publication Number Publication Date
CN117312239A true CN117312239A (en) 2023-12-29

Family

ID=89280834

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311288969.9A Pending CN117312239A (en) 2023-10-08 2023-10-08 Method for storing data index based on combination mode

Country Status (1)

Country Link
CN (1) CN117312239A (en)

Similar Documents

Publication Publication Date Title
CN110413611B (en) Data storage and query method and device
CN111125089B (en) Time sequence data storage method, device, server and storage medium
US11899641B2 (en) Trie-based indices for databases
KR101792168B1 (en) Managing storage of individually accessible data units
KR100856245B1 (en) File system device and method for saving and seeking file thereof
CN111177302B (en) Service bill processing method, device, computer equipment and storage medium
US7895171B2 (en) Compressibility estimation of non-unique indexes in a database management system
CN1838124A (en) Method for rapidly positioning grid + T tree index in mass data memory database
CN112148928B (en) Cuckoo filter based on fingerprint family
CN111046034A (en) Method and system for managing memory data and maintaining data in memory
CN103077197A (en) Data storing method and device
CN108009265B (en) Spatial data indexing method in cloud computing environment
US11868328B2 (en) Multi-record index structure for key-value stores
CN112380004B (en) Memory management method, memory management device, computer readable storage medium and electronic equipment
CN111143373A (en) Data processing method and device, electronic equipment and storage medium
CN117312239A (en) Method for storing data index based on combination mode
CN110221778A (en) Processing method, system, storage medium and the electronic equipment of hotel&#39;s data
CN114416741A (en) KV data writing and reading method and device based on multi-level index and storage medium
US20060230054A1 (en) On-line organization of data sets
US20130218851A1 (en) Storage system, data management device, method and program
CN111382120B (en) Small file management method, system and computer equipment
CN114398373A (en) File data storage and reading method and device applied to database storage
CN117540056B (en) Method, device, computer equipment and storage medium for data query
CN113946580B (en) Massive heterogeneous log data retrieval middleware
CN118132513A (en) Fingerprint data determining method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination