CN106909623B - A kind of data set and date storage method for supporting efficient mass data to analyze and retrieve - Google Patents

A kind of data set and date storage method for supporting efficient mass data to analyze and retrieve Download PDF

Info

Publication number
CN106909623B
CN106909623B CN201710043645.7A CN201710043645A CN106909623B CN 106909623 B CN106909623 B CN 106909623B CN 201710043645 A CN201710043645 A CN 201710043645A CN 106909623 B CN106909623 B CN 106909623B
Authority
CN
China
Prior art keywords
data
index
column
record
horizontal block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710043645.7A
Other languages
Chinese (zh)
Other versions
CN106909623A (en
Inventor
王卓
李波
古晓艳
王伟平
孟丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201710043645.7A priority Critical patent/CN106909623B/en
Publication of CN106909623A publication Critical patent/CN106909623A/en
Application granted granted Critical
Publication of CN106909623B publication Critical patent/CN106909623B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种支持高效海量数据分析和检索的数据装置及数据存储方法。本装置包括若干文件夹,在每一文件夹中包含多个索引分段;每一索引分段包括一全文索引组件、一数据定位模块和一数据存储模块;全文索引组件用于存储索引分段中的记录的倒排索引信息;数据存储模块,包含多个横向分块,每个横向分块包含多个列分片,每个列分片包含多个用于存储数据记录的数据页;数据定位模块,提供针对数据存储模块的嵌套索引结构,每个横向分块索引存储了横向分块记录起始Id、横向分块位置、各列分片的位置以及列分片索引集合;每个列分片索引记录了列分片中数据页起始位置和数据页索引集合;每个数据页索引记录了数据页所在文件位置和页记录起始Id。

The invention discloses a data device and a data storage method supporting efficient massive data analysis and retrieval. The device includes several folders, and each folder contains a plurality of index segments; each index segment includes a full-text index component, a data location module and a data storage module; the full-text index component is used to store the index segment Inverted index information of records in ; data storage module, including multiple horizontal blocks, each horizontal block contains multiple column fragments, and each column fragment contains multiple data pages for storing data records; data The positioning module provides a nested index structure for the data storage module. Each horizontal block index stores the start ID of the horizontal block record, the position of the horizontal block, the position of each column fragment, and the set of column fragment indexes; each The column shard index records the start position of the data page in the column shard and the set of data page indexes; each data page index records the file location of the data page and the start Id of the page record.

Description

Data device supporting efficient mass data analysis and retrieval and data storage method
Technical Field
The invention belongs to the field of data storage organization, and relates to a data device and a data storage method for efficiently responding, analyzing and retrieving application scenes aiming at mass data.
Background
The existing mass data processing technology provides powerful support for large data application and simultaneously faces technical difficulties. On one hand, although the data analysis system is superior in data sequence reading, when a query scene with a filtering condition is processed, the situation that the processing performance is not enough obviously exists, and the situation is particularly prominent when the filtering condition is a full text retrieval condition; on the other hand, the application scenario of integrating data retrieval and data analysis services is more and more important in practical application, most of the existing solutions operate two sets of systems respectively facing the retrieval and analysis systems to respond to the mixed application scenario, however, because each system adopts different data storage strategies, such a solution not only consumes a large amount of storage and calculation resources, but also needs a complex mechanism to ensure the consistency of data of the two sets of systems.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention aims to provide a mass data oriented data storage device and a data storage method, and the invention mainly comprises three aspects: (1) a data device that combines full-text indexing with columnar storage. (2) A consolidation optimization technique for the data device. (3) Random access optimization techniques for the data device.
The invention comprises the following contents:
1) an organizational framework for a data device.
2) And relying on the data loading process of the data device.
3) And (3) data merging optimization technology.
4) And relying on the data reading process of the data device.
5) Techniques for random access optimization for read flows.
The technical scheme of the invention is as follows:
a data device supporting efficient mass data analysis and retrieval is characterized by comprising a plurality of folders, wherein each folder comprises a plurality of index segments; each index segment comprises a full text index component, a data positioning module and a data storage module; wherein,
the full-text index component is used for storing the reverse index information of the records in the index segmentation;
the data storage module comprises a plurality of transverse blocks, each transverse block comprises a plurality of column fragments, and each column fragment comprises a plurality of data pages for storing data records;
the data positioning module provides a nested index structure for the data storage module, and the nested index structure comprises the recorded column number, a column descriptor set, a compression mode of the data storage module and a transverse block index set; each horizontal blocking index stores a horizontal blocking record starting Id, a horizontal blocking position, the position of each column of fragments and a column fragment index set; each column fragment index records the starting position of a data page and a data page index set in the column fragment; each data page index records the file position of the data page and the page recording start Id.
Further, the ordered Id segments are divided into the indexes of the transverse blocks according to the starting Id number and the stopping Id number contained in each transverse block in the data positioning module.
Further, an ordered set of record ids is mapped into index segments, each index segment containing an ordered Id fragment, according to the start-stop Id numbers of the records in the full-text index component.
Further, the data content stored in the data page is data content encoded by a dictionary.
Further, the data content stored in the data page is the data content encoded by adopting a type-aware data encoding algorithm.
A data storage method comprises the following steps:
1) reading an unstored record from a record set to be stored, acquiring an Id number and a field set of the record, establishing an inverted index for a specified field, and writing constructed inverted index information into a full-text index component;
2) writing each field in the field set into a column fragment corresponding to the data storage module, and if the currently written data meets a data page, recording the Id number of the record and the offset of the data storage module in which the record is located into the data positioning module;
3) and repeating the steps 1) and 2), if the current written record meets the size of one horizontal block, recording the Id number of the record, the position of the horizontal block and the positions of all columns of slices in the horizontal block into a data positioning module, and updating the column slice index set.
Further, after the step 3), acquiring a transverse block of which the data volume is smaller than a set threshold value as a transverse block to be combined; if the number of the transverse blocks to be combined is 1, directly adding the transverse blocks to the tail end of a new data storage module and updating a data positioning module; otherwise, the row data corresponding to each horizontal block to be merged is added to a new data storage module, and the data positioning module is updated.
Further, a dictionary caching mechanism is adopted to store the records in the record set to be stored.
Compared with the prior art, the invention has the following positive effects:
the device integrates the characteristics of a column type storage format and a full-text index technology, ensures high throughput in a data analysis scene on one hand, and ensures real-time performance in a data retrieval scene on the other hand, thereby improving the performance of a data analysis task with a filtering condition and efficiently responding to the requirements of an integrated application scene. The data device is suitable for data analysis application scenes aiming at mass data and application scenes fusing data analysis and data retrieval.
Drawings
Fig. 1 is an organization block diagram of the data device.
Detailed Description
Data device organization framework
The organizational framework of the data device is shown in FIG. 1. The data device takes folders as units, and each folder comprises a plurality of independent index segments; each segment includes a full-text indexing component, a data location module and a data storage module. The full-text index component comprises related inverted index information of all records in the corresponding segments and is used for quickly inquiring the inquiry condition, and the full-text index component takes the inquiry condition as input and outputs a hit record ID set. The data storage module adopts a row-column mixed storage mode: each data storage module comprises a plurality of horizontal blocks, each horizontal block comprises a plurality of column slices, and each column slice is a storage unit which stores a specific column of data in the horizontal block; each column fragment is composed of a plurality of data pages, each data page can adopt dictionary coding and data content encoded by a plurality of types of perceptual data coding algorithms, if data in the data page adopts dictionary coding, a dictionary page is placed at the head of the column fragment to which the data page belongs, and the dictionary page is used when the data page adopting dictionary coding is used for decoding the data. The data storage module format inherits the characteristics of high compression rate and high throughput rate of the columnar storage format, and the organization mode of the transverse blocks avoids the expense of record recombination, so that the data storage module format can be efficiently applied to a data analysis application scene. The data positioning module provides a nested index structure for the data storage module, and the module stores the number of data columns (namely, the number of columns of which a record is composed), a column descriptor set (namely, information such as a name and a data type corresponding to each column), a data storage module compression mode and a transverse block index set; at the horizontal blocking level, each horizontal blocking index stores a horizontal blocking record start Id, a horizontal blocking position, dictionary page positions of each column of fragments and a column fragment index set, and if all data pages of a certain column of fragments do not adopt dictionary codes, the dictionary page positions are null; at a column slicing level, each column slicing index records a data page starting position and a data page index set in the column slicing; each data page index records the file position of the data page and the page recording start Id.
The organization form of the data device can effectively support two service scenes of data retrieval and data analysis: under the condition of giving query conditions, a document Id set meeting the query conditions can be obtained through a full-text index component, a data page containing the document Id is positioned by a data positioning module in a random access mode, and corresponding data are obtained by scanning data records in the data page; under the condition of scanning the file, the number of the data storage modules needing to be scanned is determined according to the number of the segments, the data storage modules are traversed in sequence, and then all records are returned.
Data loading flow
Given a record set, the device reads the records in sequence, constructs inverted index information, writes the inverted index information into a full-text index component, then writes the inverted index information into a data storage module and updates data positioning information, and the process can be described as the following steps:
1. if there are records which are not written into the data storage module, acquiring an unprocessed record, and executing the step 2; otherwise, step 6 is executed.
2. And acquiring the record Id number and the field set contained in the record, and establishing an inverted index for the specified field.
3. If there are fields which are not written into the data storage module, acquiring an unprocessed field, and executing the step 4; otherwise, step 5 is executed.
4. And writing the field into the column fragment corresponding to the data storage module according to the field corresponding relation defined by the user, if the size of the currently written column data meets a data page, recording the recorded Id and the offset of the data storage module in which the record is positioned in the data positioning module, and executing the step 3.
5. And if the current written record meets the size of one horizontal block, recording the Id, the horizontal block position and the column dictionary position in a data positioning module, updating each column fragment index set and executing the step 1.
6. And writing the meta information (namely statistical data obtained after loading all data, such as the maximum value and the minimum value of a certain field, the data number and other information and the position of each transverse block in the data storage module) into the data storage module, writing the data positioning information into the data positioning module, and ending.
Merging optimization techniques
In order to ensure that the loaded record set can be retrieved in a short time, the device can generate a plurality of segment sets with small data volume in the loading process, in order to ensure the indexing performance, a plurality of small segments need to be combined into one segment at intervals, and the data positioning module and the data storage module which are used as input and output are both in the data device organization form in fig. 1. In order to ensure the merging performance, the merging process of the device adopts a mode of merging by taking a data page as a unit, and in the merging process, the transverse blocks with small data volume are merged into the transverse blocks with large data volume, so that the efficiency of the merging process and the query performance after merging are ensured.
The merging process in units of pages can be described as the following steps:
1. and reading the metadata information (statistical data information, position information of the horizontal blocks and the like) contained in all the horizontal blocks needing to be merged.
2. If the transverse blocks needing to be combined exist, acquiring a transverse block set needing to be combined, wherein the size of data volume contained in the acquired transverse block set needs to be close to the default transverse block data volume, and executing the step 3; otherwise, step 5 is executed.
3. If the number of the transverse blocks needing to be combined is 1, directly adding the transverse blocks to the tail end of the new data storage module, updating data positioning information and executing the step 2; otherwise, step 4 is executed.
4. And for each data column to be generated, reading column data corresponding to each transverse block, adding the column data to a newly generated data storage module, updating data positioning information, and executing the step 2.
5. And updating the metadata information and the data positioning information into the module, and ending.
Data reading flow
The data reading operation is divided into two reading modes of random access and sequential access, wherein the random access mode refers to that a full-text index component is used for matching a record Id set meeting the conditions according to the query conditions, and a data storage module is queried by the set to obtain result data meeting the conditions; the sequential access means that all data in the data storage module is read out sequentially in a scanning mode. The whole process is described in a section of an organization frame, after a query condition is obtained, a hit ordered Id set can be obtained by using a full-text index component of each segment, and the section describes the process of obtaining a record set through data positioning information according to the ordered Id set in the random access process in detail. The process can be divided into six steps:
1. the full-text index component in each segment stores the start-stop Id number of the record set in the data storage module corresponding to the segment, and by using the information, the ordered Id set can be mapped to each index segment, and each segment comprises an ordered Id fragment.
2. And dividing the ordered Id segment into each transverse partitioning index according to the recording start Id according to the start Id number and the stop Id number contained in each transverse partitioning in the data positioning module.
3. The horizontal chunking index maps out the selected column index shard set and outputs the Id fragment and the corresponding column shard position and dictionary position.
4. The column slice index maps the Id slice into the data page index and then computes the position of the column slice into the data page index.
5. And each hit data page index calculates a data page position according to the data page position and the column fragment position, and outputs the data page position, the dictionary page position and the record Id set together.
6. And the data device is positioned to the data page of the data storage module, acquires the dictionary page, sequentially scans the records in the data page, and finishes the operation until all the selected records are completely collected.
Random access optimization techniques
In order to further accelerate the random access process, the device adopts two optimization measures aiming at the random access process: and optimizing a dictionary caching mechanism and data page level data acquisition.
The dictionary caching mechanism comprises the following steps: dictionary coding is used as a storage strategy of a data storage module, and under the condition that the data change range is small, the data compression rate can be effectively improved, and a quick decoding process is provided, so that the scanning performance is greatly improved. In order to simultaneously support dictionary coding and rapid random reading in the data device, the device stores the decoded dictionary page in a memory, and when the random access data page is dictionary coding, the dictionary can be directly decoded, so that the overhead of loading and decoding the dictionary page is saved. The cache mechanism can effectively improve the access efficiency under the condition of more random access times.
Optimizing data page level data acquisition: the random access mode can effectively filter irrelevant data pages, thereby achieving the purpose of accelerating data access. The optimization technology performs decoding optimization aiming at the process of acquiring records of related data pages, thereby further accelerating the random access process. The method comprises the steps that a data storage module stores a column-stored numerical value field in a fixed-length storage mode, so that a specific offset position of data is calculated in an Id field length mode after an Id number is acquired, the data device is directly positioned to the initial position of data content in a positioning mode after positioning to a related data page and decompressing, a corresponding numerical value is obtained after calculation and returned to a user, and compared with scanning, the optimization process saves redundant calculation and pointer movement, so that the acquisition process of the data field is accelerated; the method comprises the steps that a listed character string is stored in a data storage module in a prefix-suffix coding mode, when certain character string content is obtained, the content of a previous character string corresponding to the character string must be obtained firstly, however, unnecessary decoding and memory copying expenses are generated by the mechanism in the random access process, for this reason, after the prefix and suffix length of the character string in a data page are obtained, the suffix content of the character string is obtained by firstly positioning to the initial position of the suffix content of a target character string (the character string needing to be obtained), then related character strings before the target character string are traced back in sequence in an iteration mode, and the related content is directly copied to the target character string. In addition, the optimization technique will keep the character string content with the largest Id number of the current data page for the subsequent record acquisition process. The optimization technology can effectively reduce unnecessary expenses and achieve the purpose of accelerating random access of the target character string.

Claims (8)

1.一种支持高效海量数据分析和检索的数据装置,其特征在于,包括若干文件夹,在每一文件夹中包含多个索引分段;每一索引分段包括一全文索引组件、一数据定位模块和一数据存储模块;其中,1. A data device that supports efficient mass data analysis and retrieval is characterized in that it includes several folders, and includes a plurality of index segments in each folder; each index segment includes a full-text index component, a data positioning module and a data storage module; wherein, 全文索引组件,用于存储索引分段中的记录的倒排索引信息;The full-text index component is used to store the inverted index information of the records in the index segment; 数据存储模块,包含多个横向分块,每个横向分块包含多个列分片,每个列分片包含多个用于存储数据记录的数据页;The data storage module includes multiple horizontal blocks, each horizontal block contains multiple column fragments, and each column fragment contains multiple data pages for storing data records; 数据定位模块,提供针对数据存储模块的嵌套索引结构,其包括记录的列数、列描述符集合、数据存储模块的压缩模式以及横向分块索引集合;每个横向分块索引存储了横向分块记录起始Id、横向分块位置、各列分片的位置以及列分片索引集合;每个列分片索引记录了列分片中数据页起始位置和数据页索引集合;每个数据页索引记录了数据页所在文件位置和页记录起始Id。The data location module provides a nested index structure for the data storage module, which includes the number of columns recorded, the set of column descriptors, the compression mode of the data storage module, and the set of horizontal block indexes; each horizontal block index stores horizontal block The block records the starting Id, the horizontal block position, the position of each column fragment, and the column fragment index set; each column fragment index records the starting position of the data page in the column fragment and the data page index set; each data The page index records the file location of the data page and the start ID of the page record. 2.如权利要求1所述的数据装置,其特征在于,根据数据定位模块中各横向分块记录起始Id号将有序Id片段划分到各横向分块索引中。2. The data device according to claim 1, wherein the sequenced Id segments are divided into each horizontal block index according to the start Id number of each horizontal block record in the data location module. 3.如权利要求1或2所述的数据装置,其特征在于,根据全文索引组件中记录的横向分块记录起始Id号,将有序的记录Id集合映射到各索引分段中,每个索引分段包含一个有序Id片段。3. The data device according to claim 1 or 2, wherein, according to the horizontal block record start Id number recorded in the full-text index component, the ordered record Id set is mapped to each index segment, each An index segment contains an ordered Id segment. 4.如权利要求3所述的数据装置,其特征在于,所述数据页存储的数据内容为采用字典编码的数据内容。4. The data device according to claim 3, wherein the data content stored in the data page is the data content encoded by a dictionary. 5.如权利要求3所述的数据装置,其特征在于,所述数据页存储的数据内容为采用类型感知的数据编码算法编码的数据内容。5. The data device according to claim 3, wherein the data content stored in the data page is data content encoded using a type-aware data encoding algorithm. 6.一种基于权利要求1所述数据装置的数据存储方法,其步骤为:6. A data storage method based on the data device according to claim 1, the steps of which are: 1)从待存储的记录集合中读取一未存储的记录,获取该记录的Id号及其字段集合,然后为指定字段建立倒排索引,并将构建的倒排索引信息写入全文索引组件;1) Read an unstored record from the record set to be stored, obtain the Id number of the record and its field set, then build an inverted index for the specified field, and write the constructed inverted index information into the full-text index component ; 2)将该字段集合中的每一字段写入数据存储模块对应的列分片中,如果当前写入的数据已满足一数据页,则将该记录的Id号和该记录所在数据存储模块的偏移记录到数据定位模块中;2) Write each field in the field set into the column slice corresponding to the data storage module, if the data currently written has satisfied a data page, then the Id number of the record and the ID number of the data storage module where the record is located The offset is recorded in the data positioning module; 3)重复步骤1)、2),如果当前已写入的记录满足一个横向分块的大小,则将该记录的Id号、横向分块位置、横向分块中各列分片位置记录到数据定位模块中,并更新列分片索引集合。3) Repeat steps 1), 2), if the currently written record satisfies the size of a horizontal block, then record the Id number of the record, the horizontal block position, and the column fragment positions in the horizontal block to the data Locate the module and update the column fragmentation index collection. 7.如权利要求6所述的方法,其特征在于,步骤3)之后,获取数据量小于设定阈值的横向分块作为待合并的横向分块;如果待合并的横向分块个数为1,则直接将该横向分块追加到一新数据存储模块末端并更新数据定位模块;否则将每一待合并的横向分块对应的列数据追加到一新数据存储模块中,并更新数据定位模块。7. The method according to claim 6, characterized in that, after step 3), obtaining a horizontal block whose amount of data is less than a set threshold is used as a horizontal block to be merged; if the number of horizontal blocks to be merged is 1 , then directly add the horizontal block to the end of a new data storage module and update the data location module; otherwise, append the column data corresponding to each horizontal block to be merged into a new data storage module, and update the data location module . 8.如权利要求6所述的方法,其特征在于,采用字典缓存机制对待存储的记录集合中的记录进行存储。8. The method according to claim 6, wherein a dictionary cache mechanism is used to store the records in the record set to be stored.
CN201710043645.7A 2017-01-19 2017-01-19 A kind of data set and date storage method for supporting efficient mass data to analyze and retrieve Expired - Fee Related CN106909623B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710043645.7A CN106909623B (en) 2017-01-19 2017-01-19 A kind of data set and date storage method for supporting efficient mass data to analyze and retrieve

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710043645.7A CN106909623B (en) 2017-01-19 2017-01-19 A kind of data set and date storage method for supporting efficient mass data to analyze and retrieve

Publications (2)

Publication Number Publication Date
CN106909623A CN106909623A (en) 2017-06-30
CN106909623B true CN106909623B (en) 2019-11-26

Family

ID=59206911

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710043645.7A Expired - Fee Related CN106909623B (en) 2017-01-19 2017-01-19 A kind of data set and date storage method for supporting efficient mass data to analyze and retrieve

Country Status (1)

Country Link
CN (1) CN106909623B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609081A (en) * 2017-08-29 2018-01-19 河南职业技术学院 A kind of Financial Information audit management system
CN110413624A (en) * 2019-08-07 2019-11-05 南京录信软件技术有限公司 A method of the multiple row stored in association deposited based on column
CN111767289A (en) * 2020-09-02 2020-10-13 成都四方伟业软件股份有限公司 Data storage method and device based on memory database

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101639848A (en) * 2009-06-01 2010-02-03 北京四维图新科技股份有限公司 Spatial data engine and method applying management spatial data thereof
CN103853772A (en) * 2012-12-04 2014-06-11 北京拓尔思信息技术股份有限公司 High-efficiency reverse index structure and organizing method
CN104133867A (en) * 2014-07-18 2014-11-05 中国科学院计算技术研究所 DOT in-fragment secondary index method and DOT in-fragment secondary index system
CN105550236A (en) * 2015-11-27 2016-05-04 广州华多网络科技有限公司 Distributed data deduplication processing method and apparatus

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9703858B2 (en) * 2014-07-14 2017-07-11 International Business Machines Corporation Inverted table for storing and querying conceptual indices
US9779104B2 (en) * 2014-11-25 2017-10-03 Sap Se Efficient database undo / redo logging

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101639848A (en) * 2009-06-01 2010-02-03 北京四维图新科技股份有限公司 Spatial data engine and method applying management spatial data thereof
CN103853772A (en) * 2012-12-04 2014-06-11 北京拓尔思信息技术股份有限公司 High-efficiency reverse index structure and organizing method
CN104133867A (en) * 2014-07-18 2014-11-05 中国科学院计算技术研究所 DOT in-fragment secondary index method and DOT in-fragment secondary index system
CN105550236A (en) * 2015-11-27 2016-05-04 广州华多网络科技有限公司 Distributed data deduplication processing method and apparatus

Also Published As

Publication number Publication date
CN106909623A (en) 2017-06-30

Similar Documents

Publication Publication Date Title
US11080277B2 (en) Data set compression within a database system
US9430156B1 (en) Method to increase random I/O performance with low memory overheads
US8244530B2 (en) Efficient indexing of documents with similar content
EP2965189B1 (en) Managing operations on stored data units
CN103488709B (en) A kind of index establishing method and system, search method and system
US20170371551A1 (en) Capturing snapshots of variable-length data sequentially stored and indexed to facilitate reverse reading
CN110741637B (en) Method, computer-readable storage medium and electronic device for simplifying video data
CN112262379B (en) Stores data items and identifies the stored data items
KR20170054299A (en) Reference block aggregating into a reference set for deduplication in memory management
US10509769B1 (en) Method to efficiently track I/O access history
US12339823B2 (en) Data storage device and storage control method based on log-structured merge tree
JP7758686B2 (en) Exploiting Locality of Primary Data for Efficient Retrieval of Lossless Reduced Data Using Primary Data Sieves
WO2013086969A1 (en) Method, device and system for finding duplicate data
CN111611250A (en) Data storage device, data query method, device, server and storage medium
US20250307247A1 (en) Compressed data and compression information for storage in a database system
KR20150126667A (en) Managing operations on stored data units
US11657051B2 (en) Methods and apparatus for efficiently scaling result caching
CN115576946B (en) A data processing method, device, storage medium and equipment in Iceberg
CN106909623B (en) A kind of data set and date storage method for supporting efficient mass data to analyze and retrieve
JP6726690B2 (en) Performing multidimensional search, content-associative retrieval, and keyword-based retrieval and retrieval on losslessly reduced data using basic data sieves
CN117633105A (en) Time-series data storage management method and system based on time partition index
CN111625531A (en) Programmable device-based merging device, data merging method and database system
Deng et al. imDedup: A lossless deduplication scheme to eliminate fine-grained redundancy among images
CN108475508B (en) Simplification of audio data and data stored in block processing storage system
CN114138792B (en) Key-value separation storage method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20191126

CF01 Termination of patent right due to non-payment of annual fee