Data device supporting efficient mass data analysis and retrieval and data storage method
Technical Field
The invention belongs to the field of data storage organization, and relates to a data device and a data storage method for efficiently responding, analyzing and retrieving application scenes aiming at mass data.
Background
The existing mass data processing technology provides powerful support for large data application and simultaneously faces technical difficulties. On one hand, although the data analysis system is superior in data sequence reading, when a query scene with a filtering condition is processed, the situation that the processing performance is not enough obviously exists, and the situation is particularly prominent when the filtering condition is a full text retrieval condition; on the other hand, the application scenario of integrating data retrieval and data analysis services is more and more important in practical application, most of the existing solutions operate two sets of systems respectively facing the retrieval and analysis systems to respond to the mixed application scenario, however, because each system adopts different data storage strategies, such a solution not only consumes a large amount of storage and calculation resources, but also needs a complex mechanism to ensure the consistency of data of the two sets of systems.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention aims to provide a mass data oriented data storage device and a data storage method, and the invention mainly comprises three aspects: (1) a data device that combines full-text indexing with columnar storage. (2) A consolidation optimization technique for the data device. (3) Random access optimization techniques for the data device.
The invention comprises the following contents:
1) an organizational framework for a data device.
2) And relying on the data loading process of the data device.
3) And (3) data merging optimization technology.
4) And relying on the data reading process of the data device.
5) Techniques for random access optimization for read flows.
The technical scheme of the invention is as follows:
a data device supporting efficient mass data analysis and retrieval is characterized by comprising a plurality of folders, wherein each folder comprises a plurality of index segments; each index segment comprises a full text index component, a data positioning module and a data storage module; wherein,
the full-text index component is used for storing the reverse index information of the records in the index segmentation;
the data storage module comprises a plurality of transverse blocks, each transverse block comprises a plurality of column fragments, and each column fragment comprises a plurality of data pages for storing data records;
the data positioning module provides a nested index structure for the data storage module, and the nested index structure comprises the recorded column number, a column descriptor set, a compression mode of the data storage module and a transverse block index set; each horizontal blocking index stores a horizontal blocking record starting Id, a horizontal blocking position, the position of each column of fragments and a column fragment index set; each column fragment index records the starting position of a data page and a data page index set in the column fragment; each data page index records the file position of the data page and the page recording start Id.
Further, the ordered Id segments are divided into the indexes of the transverse blocks according to the starting Id number and the stopping Id number contained in each transverse block in the data positioning module.
Further, an ordered set of record ids is mapped into index segments, each index segment containing an ordered Id fragment, according to the start-stop Id numbers of the records in the full-text index component.
Further, the data content stored in the data page is data content encoded by a dictionary.
Further, the data content stored in the data page is the data content encoded by adopting a type-aware data encoding algorithm.
A data storage method comprises the following steps:
1) reading an unstored record from a record set to be stored, acquiring an Id number and a field set of the record, establishing an inverted index for a specified field, and writing constructed inverted index information into a full-text index component;
2) writing each field in the field set into a column fragment corresponding to the data storage module, and if the currently written data meets a data page, recording the Id number of the record and the offset of the data storage module in which the record is located into the data positioning module;
3) and repeating the steps 1) and 2), if the current written record meets the size of one horizontal block, recording the Id number of the record, the position of the horizontal block and the positions of all columns of slices in the horizontal block into a data positioning module, and updating the column slice index set.
Further, after the step 3), acquiring a transverse block of which the data volume is smaller than a set threshold value as a transverse block to be combined; if the number of the transverse blocks to be combined is 1, directly adding the transverse blocks to the tail end of a new data storage module and updating a data positioning module; otherwise, the row data corresponding to each horizontal block to be merged is added to a new data storage module, and the data positioning module is updated.
Further, a dictionary caching mechanism is adopted to store the records in the record set to be stored.
Compared with the prior art, the invention has the following positive effects:
the device integrates the characteristics of a column type storage format and a full-text index technology, ensures high throughput in a data analysis scene on one hand, and ensures real-time performance in a data retrieval scene on the other hand, thereby improving the performance of a data analysis task with a filtering condition and efficiently responding to the requirements of an integrated application scene. The data device is suitable for data analysis application scenes aiming at mass data and application scenes fusing data analysis and data retrieval.
Drawings
Fig. 1 is an organization block diagram of the data device.
Detailed Description
Data device organization framework
The organizational framework of the data device is shown in FIG. 1. The data device takes folders as units, and each folder comprises a plurality of independent index segments; each segment includes a full-text indexing component, a data location module and a data storage module. The full-text index component comprises related inverted index information of all records in the corresponding segments and is used for quickly inquiring the inquiry condition, and the full-text index component takes the inquiry condition as input and outputs a hit record ID set. The data storage module adopts a row-column mixed storage mode: each data storage module comprises a plurality of horizontal blocks, each horizontal block comprises a plurality of column slices, and each column slice is a storage unit which stores a specific column of data in the horizontal block; each column fragment is composed of a plurality of data pages, each data page can adopt dictionary coding and data content encoded by a plurality of types of perceptual data coding algorithms, if data in the data page adopts dictionary coding, a dictionary page is placed at the head of the column fragment to which the data page belongs, and the dictionary page is used when the data page adopting dictionary coding is used for decoding the data. The data storage module format inherits the characteristics of high compression rate and high throughput rate of the columnar storage format, and the organization mode of the transverse blocks avoids the expense of record recombination, so that the data storage module format can be efficiently applied to a data analysis application scene. The data positioning module provides a nested index structure for the data storage module, and the module stores the number of data columns (namely, the number of columns of which a record is composed), a column descriptor set (namely, information such as a name and a data type corresponding to each column), a data storage module compression mode and a transverse block index set; at the horizontal blocking level, each horizontal blocking index stores a horizontal blocking record start Id, a horizontal blocking position, dictionary page positions of each column of fragments and a column fragment index set, and if all data pages of a certain column of fragments do not adopt dictionary codes, the dictionary page positions are null; at a column slicing level, each column slicing index records a data page starting position and a data page index set in the column slicing; each data page index records the file position of the data page and the page recording start Id.
The organization form of the data device can effectively support two service scenes of data retrieval and data analysis: under the condition of giving query conditions, a document Id set meeting the query conditions can be obtained through a full-text index component, a data page containing the document Id is positioned by a data positioning module in a random access mode, and corresponding data are obtained by scanning data records in the data page; under the condition of scanning the file, the number of the data storage modules needing to be scanned is determined according to the number of the segments, the data storage modules are traversed in sequence, and then all records are returned.
Data loading flow
Given a record set, the device reads the records in sequence, constructs inverted index information, writes the inverted index information into a full-text index component, then writes the inverted index information into a data storage module and updates data positioning information, and the process can be described as the following steps:
1. if there are records which are not written into the data storage module, acquiring an unprocessed record, and executing the step 2; otherwise, step 6 is executed.
2. And acquiring the record Id number and the field set contained in the record, and establishing an inverted index for the specified field.
3. If there are fields which are not written into the data storage module, acquiring an unprocessed field, and executing the step 4; otherwise, step 5 is executed.
4. And writing the field into the column fragment corresponding to the data storage module according to the field corresponding relation defined by the user, if the size of the currently written column data meets a data page, recording the recorded Id and the offset of the data storage module in which the record is positioned in the data positioning module, and executing the step 3.
5. And if the current written record meets the size of one horizontal block, recording the Id, the horizontal block position and the column dictionary position in a data positioning module, updating each column fragment index set and executing the step 1.
6. And writing the meta information (namely statistical data obtained after loading all data, such as the maximum value and the minimum value of a certain field, the data number and other information and the position of each transverse block in the data storage module) into the data storage module, writing the data positioning information into the data positioning module, and ending.
Merging optimization techniques
In order to ensure that the loaded record set can be retrieved in a short time, the device can generate a plurality of segment sets with small data volume in the loading process, in order to ensure the indexing performance, a plurality of small segments need to be combined into one segment at intervals, and the data positioning module and the data storage module which are used as input and output are both in the data device organization form in fig. 1. In order to ensure the merging performance, the merging process of the device adopts a mode of merging by taking a data page as a unit, and in the merging process, the transverse blocks with small data volume are merged into the transverse blocks with large data volume, so that the efficiency of the merging process and the query performance after merging are ensured.
The merging process in units of pages can be described as the following steps:
1. and reading the metadata information (statistical data information, position information of the horizontal blocks and the like) contained in all the horizontal blocks needing to be merged.
2. If the transverse blocks needing to be combined exist, acquiring a transverse block set needing to be combined, wherein the size of data volume contained in the acquired transverse block set needs to be close to the default transverse block data volume, and executing the step 3; otherwise, step 5 is executed.
3. If the number of the transverse blocks needing to be combined is 1, directly adding the transverse blocks to the tail end of the new data storage module, updating data positioning information and executing the step 2; otherwise, step 4 is executed.
4. And for each data column to be generated, reading column data corresponding to each transverse block, adding the column data to a newly generated data storage module, updating data positioning information, and executing the step 2.
5. And updating the metadata information and the data positioning information into the module, and ending.
Data reading flow
The data reading operation is divided into two reading modes of random access and sequential access, wherein the random access mode refers to that a full-text index component is used for matching a record Id set meeting the conditions according to the query conditions, and a data storage module is queried by the set to obtain result data meeting the conditions; the sequential access means that all data in the data storage module is read out sequentially in a scanning mode. The whole process is described in a section of an organization frame, after a query condition is obtained, a hit ordered Id set can be obtained by using a full-text index component of each segment, and the section describes the process of obtaining a record set through data positioning information according to the ordered Id set in the random access process in detail. The process can be divided into six steps:
1. the full-text index component in each segment stores the start-stop Id number of the record set in the data storage module corresponding to the segment, and by using the information, the ordered Id set can be mapped to each index segment, and each segment comprises an ordered Id fragment.
2. And dividing the ordered Id segment into each transverse partitioning index according to the recording start Id according to the start Id number and the stop Id number contained in each transverse partitioning in the data positioning module.
3. The horizontal chunking index maps out the selected column index shard set and outputs the Id fragment and the corresponding column shard position and dictionary position.
4. The column slice index maps the Id slice into the data page index and then computes the position of the column slice into the data page index.
5. And each hit data page index calculates a data page position according to the data page position and the column fragment position, and outputs the data page position, the dictionary page position and the record Id set together.
6. And the data device is positioned to the data page of the data storage module, acquires the dictionary page, sequentially scans the records in the data page, and finishes the operation until all the selected records are completely collected.
Random access optimization techniques
In order to further accelerate the random access process, the device adopts two optimization measures aiming at the random access process: and optimizing a dictionary caching mechanism and data page level data acquisition.
The dictionary caching mechanism comprises the following steps: dictionary coding is used as a storage strategy of a data storage module, and under the condition that the data change range is small, the data compression rate can be effectively improved, and a quick decoding process is provided, so that the scanning performance is greatly improved. In order to simultaneously support dictionary coding and rapid random reading in the data device, the device stores the decoded dictionary page in a memory, and when the random access data page is dictionary coding, the dictionary can be directly decoded, so that the overhead of loading and decoding the dictionary page is saved. The cache mechanism can effectively improve the access efficiency under the condition of more random access times.
Optimizing data page level data acquisition: the random access mode can effectively filter irrelevant data pages, thereby achieving the purpose of accelerating data access. The optimization technology performs decoding optimization aiming at the process of acquiring records of related data pages, thereby further accelerating the random access process. The method comprises the steps that a data storage module stores a column-stored numerical value field in a fixed-length storage mode, so that a specific offset position of data is calculated in an Id field length mode after an Id number is acquired, the data device is directly positioned to the initial position of data content in a positioning mode after positioning to a related data page and decompressing, a corresponding numerical value is obtained after calculation and returned to a user, and compared with scanning, the optimization process saves redundant calculation and pointer movement, so that the acquisition process of the data field is accelerated; the method comprises the steps that a listed character string is stored in a data storage module in a prefix-suffix coding mode, when certain character string content is obtained, the content of a previous character string corresponding to the character string must be obtained firstly, however, unnecessary decoding and memory copying expenses are generated by the mechanism in the random access process, for this reason, after the prefix and suffix length of the character string in a data page are obtained, the suffix content of the character string is obtained by firstly positioning to the initial position of the suffix content of a target character string (the character string needing to be obtained), then related character strings before the target character string are traced back in sequence in an iteration mode, and the related content is directly copied to the target character string. In addition, the optimization technique will keep the character string content with the largest Id number of the current data page for the subsequent record acquisition process. The optimization technology can effectively reduce unnecessary expenses and achieve the purpose of accelerating random access of the target character string.