CN106909623B

CN106909623B - A kind of data set and date storage method for supporting efficient mass data to analyze and retrieve

Info

Publication number: CN106909623B
Application number: CN201710043645.7A
Authority: CN
Inventors: 王卓; 李波; 古晓艳; 王伟平; 孟丹
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2017-01-19
Filing date: 2017-01-19
Publication date: 2019-11-26
Anticipated expiration: 2037-01-19
Also published as: CN106909623A

Abstract

The invention discloses a data device and a data storage method supporting efficient massive data analysis and retrieval. The device includes several folders, and each folder contains a plurality of index segments; each index segment includes a full-text index component, a data location module and a data storage module; the full-text index component is used to store the index segment Inverted index information of records in ; data storage module, including multiple horizontal blocks, each horizontal block contains multiple column fragments, and each column fragment contains multiple data pages for storing data records; data The positioning module provides a nested index structure for the data storage module. Each horizontal block index stores the start ID of the horizontal block record, the position of the horizontal block, the position of each column fragment, and the set of column fragment indexes; each The column shard index records the start position of the data page in the column shard and the set of data page indexes; each data page index records the file location of the data page and the start Id of the page record.

Description

Data device supporting efficient mass data analysis and retrieval and data storage method

Technical Field

The invention belongs to the field of data storage organization, and relates to a data device and a data storage method for efficiently responding, analyzing and retrieving application scenes aiming at mass data.

Background

The existing mass data processing technology provides powerful support for large data application and simultaneously faces technical difficulties. On one hand, although the data analysis system is superior in data sequence reading, when a query scene with a filtering condition is processed, the situation that the processing performance is not enough obviously exists, and the situation is particularly prominent when the filtering condition is a full text retrieval condition; on the other hand, the application scenario of integrating data retrieval and data analysis services is more and more important in practical application, most of the existing solutions operate two sets of systems respectively facing the retrieval and analysis systems to respond to the mixed application scenario, however, because each system adopts different data storage strategies, such a solution not only consumes a large amount of storage and calculation resources, but also needs a complex mechanism to ensure the consistency of data of the two sets of systems.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention aims to provide a mass data oriented data storage device and a data storage method, and the invention mainly comprises three aspects: (1) a data device that combines full-text indexing with columnar storage. (2) A consolidation optimization technique for the data device. (3) Random access optimization techniques for the data device.

The invention comprises the following contents:

1) an organizational framework for a data device.

2) And relying on the data loading process of the data device.

3) And (3) data merging optimization technology.

4) And relying on the data reading process of the data device.

5) Techniques for random access optimization for read flows.

The technical scheme of the invention is as follows:

a data device supporting efficient mass data analysis and retrieval is characterized by comprising a plurality of folders, wherein each folder comprises a plurality of index segments; each index segment comprises a full text index component, a data positioning module and a data storage module; wherein,

the full-text index component is used for storing the reverse index information of the records in the index segmentation;

the data storage module comprises a plurality of transverse blocks, each transverse block comprises a plurality of column fragments, and each column fragment comprises a plurality of data pages for storing data records;

the data positioning module provides a nested index structure for the data storage module, and the nested index structure comprises the recorded column number, a column descriptor set, a compression mode of the data storage module and a transverse block index set; each horizontal blocking index stores a horizontal blocking record starting Id, a horizontal blocking position, the position of each column of fragments and a column fragment index set; each column fragment index records the starting position of a data page and a data page index set in the column fragment; each data page index records the file position of the data page and the page recording start Id.

Further, the ordered Id segments are divided into the indexes of the transverse blocks according to the starting Id number and the stopping Id number contained in each transverse block in the data positioning module.

Further, an ordered set of record ids is mapped into index segments, each index segment containing an ordered Id fragment, according to the start-stop Id numbers of the records in the full-text index component.

Further, the data content stored in the data page is data content encoded by a dictionary.

Further, the data content stored in the data page is the data content encoded by adopting a type-aware data encoding algorithm.

A data storage method comprises the following steps:

1) reading an unstored record from a record set to be stored, acquiring an Id number and a field set of the record, establishing an inverted index for a specified field, and writing constructed inverted index information into a full-text index component;

2) writing each field in the field set into a column fragment corresponding to the data storage module, and if the currently written data meets a data page, recording the Id number of the record and the offset of the data storage module in which the record is located into the data positioning module;

3) and repeating the steps 1) and 2), if the current written record meets the size of one horizontal block, recording the Id number of the record, the position of the horizontal block and the positions of all columns of slices in the horizontal block into a data positioning module, and updating the column slice index set.

Further, after the step 3), acquiring a transverse block of which the data volume is smaller than a set threshold value as a transverse block to be combined; if the number of the transverse blocks to be combined is 1, directly adding the transverse blocks to the tail end of a new data storage module and updating a data positioning module; otherwise, the row data corresponding to each horizontal block to be merged is added to a new data storage module, and the data positioning module is updated.

Further, a dictionary caching mechanism is adopted to store the records in the record set to be stored.

Compared with the prior art, the invention has the following positive effects:

the device integrates the characteristics of a column type storage format and a full-text index technology, ensures high throughput in a data analysis scene on one hand, and ensures real-time performance in a data retrieval scene on the other hand, thereby improving the performance of a data analysis task with a filtering condition and efficiently responding to the requirements of an integrated application scene. The data device is suitable for data analysis application scenes aiming at mass data and application scenes fusing data analysis and data retrieval.

Drawings

Fig. 1 is an organization block diagram of the data device.

Detailed Description

Data device organization framework

The organizational framework of the data device is shown in FIG. 1. The data device takes folders as units, and each folder comprises a plurality of independent index segments; each segment includes a full-text indexing component, a data location module and a data storage module. The full-text index component comprises related inverted index information of all records in the corresponding segments and is used for quickly inquiring the inquiry condition, and the full-text index component takes the inquiry condition as input and outputs a hit record ID set. The data storage module adopts a row-column mixed storage mode: each data storage module comprises a plurality of horizontal blocks, each horizontal block comprises a plurality of column slices, and each column slice is a storage unit which stores a specific column of data in the horizontal block; each column fragment is composed of a plurality of data pages, each data page can adopt dictionary coding and data content encoded by a plurality of types of perceptual data coding algorithms, if data in the data page adopts dictionary coding, a dictionary page is placed at the head of the column fragment to which the data page belongs, and the dictionary page is used when the data page adopting dictionary coding is used for decoding the data. The data storage module format inherits the characteristics of high compression rate and high throughput rate of the columnar storage format, and the organization mode of the transverse blocks avoids the expense of record recombination, so that the data storage module format can be efficiently applied to a data analysis application scene. The data positioning module provides a nested index structure for the data storage module, and the module stores the number of data columns (namely, the number of columns of which a record is composed), a column descriptor set (namely, information such as a name and a data type corresponding to each column), a data storage module compression mode and a transverse block index set; at the horizontal blocking level, each horizontal blocking index stores a horizontal blocking record start Id, a horizontal blocking position, dictionary page positions of each column of fragments and a column fragment index set, and if all data pages of a certain column of fragments do not adopt dictionary codes, the dictionary page positions are null; at a column slicing level, each column slicing index records a data page starting position and a data page index set in the column slicing; each data page index records the file position of the data page and the page recording start Id.

The organization form of the data device can effectively support two service scenes of data retrieval and data analysis: under the condition of giving query conditions, a document Id set meeting the query conditions can be obtained through a full-text index component, a data page containing the document Id is positioned by a data positioning module in a random access mode, and corresponding data are obtained by scanning data records in the data page; under the condition of scanning the file, the number of the data storage modules needing to be scanned is determined according to the number of the segments, the data storage modules are traversed in sequence, and then all records are returned.

Data loading flow

Given a record set, the device reads the records in sequence, constructs inverted index information, writes the inverted index information into a full-text index component, then writes the inverted index information into a data storage module and updates data positioning information, and the process can be described as the following steps:

1. if there are records which are not written into the data storage module, acquiring an unprocessed record, and executing the step 2; otherwise, step 6 is executed.

2. And acquiring the record Id number and the field set contained in the record, and establishing an inverted index for the specified field.

3. If there are fields which are not written into the data storage module, acquiring an unprocessed field, and executing the step 4; otherwise, step 5 is executed.

4. And writing the field into the column fragment corresponding to the data storage module according to the field corresponding relation defined by the user, if the size of the currently written column data meets a data page, recording the recorded Id and the offset of the data storage module in which the record is positioned in the data positioning module, and executing the step 3.

5. And if the current written record meets the size of one horizontal block, recording the Id, the horizontal block position and the column dictionary position in a data positioning module, updating each column fragment index set and executing the step 1.

6. And writing the meta information (namely statistical data obtained after loading all data, such as the maximum value and the minimum value of a certain field, the data number and other information and the position of each transverse block in the data storage module) into the data storage module, writing the data positioning information into the data positioning module, and ending.

Merging optimization techniques

In order to ensure that the loaded record set can be retrieved in a short time, the device can generate a plurality of segment sets with small data volume in the loading process, in order to ensure the indexing performance, a plurality of small segments need to be combined into one segment at intervals, and the data positioning module and the data storage module which are used as input and output are both in the data device organization form in fig. 1. In order to ensure the merging performance, the merging process of the device adopts a mode of merging by taking a data page as a unit, and in the merging process, the transverse blocks with small data volume are merged into the transverse blocks with large data volume, so that the efficiency of the merging process and the query performance after merging are ensured.

The merging process in units of pages can be described as the following steps:

1. and reading the metadata information (statistical data information, position information of the horizontal blocks and the like) contained in all the horizontal blocks needing to be merged.

2. If the transverse blocks needing to be combined exist, acquiring a transverse block set needing to be combined, wherein the size of data volume contained in the acquired transverse block set needs to be close to the default transverse block data volume, and executing the step 3; otherwise, step 5 is executed.

3. If the number of the transverse blocks needing to be combined is 1, directly adding the transverse blocks to the tail end of the new data storage module, updating data positioning information and executing the step 2; otherwise, step 4 is executed.

4. And for each data column to be generated, reading column data corresponding to each transverse block, adding the column data to a newly generated data storage module, updating data positioning information, and executing the step 2.

5. And updating the metadata information and the data positioning information into the module, and ending.

Data reading flow

The data reading operation is divided into two reading modes of random access and sequential access, wherein the random access mode refers to that a full-text index component is used for matching a record Id set meeting the conditions according to the query conditions, and a data storage module is queried by the set to obtain result data meeting the conditions; the sequential access means that all data in the data storage module is read out sequentially in a scanning mode. The whole process is described in a section of an organization frame, after a query condition is obtained, a hit ordered Id set can be obtained by using a full-text index component of each segment, and the section describes the process of obtaining a record set through data positioning information according to the ordered Id set in the random access process in detail. The process can be divided into six steps:

1. the full-text index component in each segment stores the start-stop Id number of the record set in the data storage module corresponding to the segment, and by using the information, the ordered Id set can be mapped to each index segment, and each segment comprises an ordered Id fragment.

2. And dividing the ordered Id segment into each transverse partitioning index according to the recording start Id according to the start Id number and the stop Id number contained in each transverse partitioning in the data positioning module.

3. The horizontal chunking index maps out the selected column index shard set and outputs the Id fragment and the corresponding column shard position and dictionary position.

4. The column slice index maps the Id slice into the data page index and then computes the position of the column slice into the data page index.

5. And each hit data page index calculates a data page position according to the data page position and the column fragment position, and outputs the data page position, the dictionary page position and the record Id set together.

6. And the data device is positioned to the data page of the data storage module, acquires the dictionary page, sequentially scans the records in the data page, and finishes the operation until all the selected records are completely collected.

Random access optimization techniques

In order to further accelerate the random access process, the device adopts two optimization measures aiming at the random access process: and optimizing a dictionary caching mechanism and data page level data acquisition.

The dictionary caching mechanism comprises the following steps: dictionary coding is used as a storage strategy of a data storage module, and under the condition that the data change range is small, the data compression rate can be effectively improved, and a quick decoding process is provided, so that the scanning performance is greatly improved. In order to simultaneously support dictionary coding and rapid random reading in the data device, the device stores the decoded dictionary page in a memory, and when the random access data page is dictionary coding, the dictionary can be directly decoded, so that the overhead of loading and decoding the dictionary page is saved. The cache mechanism can effectively improve the access efficiency under the condition of more random access times.

Optimizing data page level data acquisition: the random access mode can effectively filter irrelevant data pages, thereby achieving the purpose of accelerating data access. The optimization technology performs decoding optimization aiming at the process of acquiring records of related data pages, thereby further accelerating the random access process. The method comprises the steps that a data storage module stores a column-stored numerical value field in a fixed-length storage mode, so that a specific offset position of data is calculated in an Id field length mode after an Id number is acquired, the data device is directly positioned to the initial position of data content in a positioning mode after positioning to a related data page and decompressing, a corresponding numerical value is obtained after calculation and returned to a user, and compared with scanning, the optimization process saves redundant calculation and pointer movement, so that the acquisition process of the data field is accelerated; the method comprises the steps that a listed character string is stored in a data storage module in a prefix-suffix coding mode, when certain character string content is obtained, the content of a previous character string corresponding to the character string must be obtained firstly, however, unnecessary decoding and memory copying expenses are generated by the mechanism in the random access process, for this reason, after the prefix and suffix length of the character string in a data page are obtained, the suffix content of the character string is obtained by firstly positioning to the initial position of the suffix content of a target character string (the character string needing to be obtained), then related character strings before the target character string are traced back in sequence in an iteration mode, and the related content is directly copied to the target character string. In addition, the optimization technique will keep the character string content with the largest Id number of the current data page for the subsequent record acquisition process. The optimization technology can effectively reduce unnecessary expenses and achieve the purpose of accelerating random access of the target character string.

Claims

1. A data device that supports efficient mass data analysis and retrieval is characterized in that it includes several folders, and includes a plurality of index segments in each folder; each index segment includes a full-text index component, a data positioning module and a data storage module; wherein,

The full-text index component is used to store the inverted index information of the records in the index segment;

The data storage module includes multiple horizontal blocks, each horizontal block contains multiple column fragments, and each column fragment contains multiple data pages for storing data records;

The data location module provides a nested index structure for the data storage module, which includes the number of columns recorded, the set of column descriptors, the compression mode of the data storage module, and the set of horizontal block indexes; each horizontal block index stores horizontal block The block records the starting Id, the horizontal block position, the position of each column fragment, and the column fragment index set; each column fragment index records the starting position of the data page in the column fragment and the data page index set; each data The page index records the file location of the data page and the start ID of the page record.

2. The data device according to claim 1, wherein the sequenced Id segments are divided into each horizontal block index according to the start Id number of each horizontal block record in the data location module.

3. The data device according to claim 1 or 2, wherein, according to the horizontal block record start Id number recorded in the full-text index component, the ordered record Id set is mapped to each index segment, each An index segment contains an ordered Id segment.

4. The data device according to claim 3, wherein the data content stored in the data page is the data content encoded by a dictionary.

5. The data device according to claim 3, wherein the data content stored in the data page is data content encoded using a type-aware data encoding algorithm.

6. A data storage method based on the data device according to claim 1, the steps of which are:

1) Read an unstored record from the record set to be stored, obtain the Id number of the record and its field set, then build an inverted index for the specified field, and write the constructed inverted index information into the full-text index component ;

2) Write each field in the field set into the column slice corresponding to the data storage module, if the data currently written has satisfied a data page, then the Id number of the record and the ID number of the data storage module where the record is located The offset is recorded in the data positioning module;

3) Repeat steps 1), 2), if the currently written record satisfies the size of a horizontal block, then record the Id number of the record, the horizontal block position, and the column fragment positions in the horizontal block to the data Locate the module and update the column fragmentation index collection.

7. The method according to claim 6, characterized in that, after step 3), obtaining a horizontal block whose amount of data is less than a set threshold is used as a horizontal block to be merged; if the number of horizontal blocks to be merged is 1 , then directly add the horizontal block to the end of a new data storage module and update the data location module; otherwise, append the column data corresponding to each horizontal block to be merged into a new data storage module, and update the data location module .

8. The method according to claim 6, wherein a dictionary cache mechanism is used to store the records in the record set to be stored.