WO2018119976A1

WO2018119976A1 - Efficient data layout optimization method for data warehouse system

Info

Publication number: WO2018119976A1
Application number: PCT/CN2016/113364
Authority: WO
Inventors: 李挥; 李鑫; 危奕; 黄志浩; 朱兵
Original assignee: 日彩电子科技(深圳)有限公司
Priority date: 2016-12-30
Filing date: 2016-12-30
Publication date: 2018-07-05
Also published as: CN110268397A; CN110268397B

Abstract

An efficient data layout optimization method for a data warehouse system, the method comprising the following steps: (A) arranging a data layout on the basis of block files; (B) performing column sorting; and (C) storing table files. When processing large-scale structured data in the data warehouse system on an upper level, the present invention provides superior query efficiency and occupies less storage space compared with conventional solutions.

Description

Efficient optimized data layout method applied to data warehouse system

[Technical Field]

The present invention relates to the field of data processing, and in particular, to an efficient and optimized data layout method applied to a data warehouse system.

【Background technique】

In today's big data era, the processing of large-scale archived data is one of the most important and complex challenges in data warehousing systems. Structured data is the most common type of data storage in database management systems. For distributed systems, the partitioning method of table structure in structured data has a great impact on query and space efficiency. This is done by a single node. Data processing efficiency and network data transmission differences between different nodes.

In order to be more convenient to process data in a distributed system like a database management system, on the basis of a distributed system represented by Hadoop, some higher-level data management techniques are generated. In this context, the data layout scheme at the storage level will greatly affect the system processing efficiency.

Row storage is a commonly used data layout structure that divides the data in a table into rows, and then stores the partitioned data blocks on different data nodes, where each node stores the rows in turn on disk. Its shortcoming is that in the query process, even if the column is not used, the entire row of data needs to be loaded into the memory and perform unnecessary query operations, thus extending the query time. Another common data layout structure is column storage, which divides the data in the table into columns, and then stores the different columns on different data nodes, each of which stores the columns in turn on disk. Its disadvantage is that the results obtained by different columns in the query process need to be transmitted between nodes to produce the final result. This way increases the data transmission loss and reduces the query efficiency.

On the other hand, in terms of storage space efficiency, distributed systems usually use multiple copies to ensure data reliability, so that when a node fails, it can obtain the data from other nodes. This approach requires storage systems to provide storage that is several times larger than the original data size, which results in higher storage costs. Error correcting codes are used to prevent data corruption by generating redundant data. RDP codes are erasure codes commonly used in redundant arrays of disks, which generate additional redundant data from data on different disks. Such redundant data is generated by row and diagonal operations, which prevents the failure of up to two disks in the system. Such a technology can also be applied to block file fault tolerance in a distributed system, but how to effectively build and store it is a problem to be solved.

Paper [Y.He, R. Lee, S. Zheng, N. Jain, Z. Xu, and X. Zhang, "RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems," InICDE, 2011 The proposed RCFile is a common data layout scheme applied to distributed storage systems. It mainly combines the storage methods of row storage and column storage to build files. The data within the block. When the table data needs to be stored, it first divides the table file according to the row format, wherein each divided row group has the same size, and then stores the row group in different areas of the file block, and simultaneously in each row group. Column order is stored contiguously, and this way of storage avoids the drawbacks of row storage and column storage modes. However, RCFile's data compression method is relatively simple, and each column data in each row group is separately compressed and stored. This way of compressing all the data is not suitable for the reading and use of common data. For example, the primary key in a table is used in almost every query. In each query for it, it is necessary to perform column data once. Decompression, this way results in higher time and computational overhead. At the same time, its way of fault tolerance for data is the multi-copy mode of the underlying storage system, which takes up more storage space than the error correction code.

Zebra is a column-oriented data layout structure. In order to avoid the defect of the multi-node recombination query result inherent in the column layout, it divides the columns of the data table into multiple column groups, and stores each column group separately, in the storage. In each column group, the data is stored in a row-stored format. Each of the column groups consists of multiple columns, and one column can belong to different column groups. This storage method largely avoids the storage of query results on multiple nodes. However, the Zebra storage layout needs to group the columns in the table before storing the data table. For a query, there is no guarantee that all the columns to be used are in the same column group. In this case, The result of the query still requires reorganization of data rows between multiple nodes. Because the column group is on the same node, one column can be in multiple column groups at the same time, which actually adds duplicate data to the original data, which increases storage overhead.

[Summary of the Invention]

In order to solve the problems in the prior art, the present invention provides an efficient and optimized data layout method applied to a data warehouse system, which solves the problem of increasing query overhead and occupying more storage space in the prior art.

The invention is realized by the following technical solutions: designing and manufacturing an efficient and optimized data layout method applied to a data warehouse system, comprising the following steps: (A) performing block file basic data layout; (B) performing column classification processing; (C) Perform table file storage.

As a further improvement of the present invention, in the step (A), the table file is horizontally divided into equal-sized row groups, and then the row groups are sequentially stored in the block file by column storage; each row group is composed of three The partial components are respectively a synchronization part, a metadata part and an actual data part, and the synchronization part is used for the system to distinguish two adjacent row groups when reading data, and the metadata part includes the system can be in the row group Differentiating the size information of different columns and different fields in each column and column classification information for systematically distinguishing different kinds of columns, the actual data portion is used for storing actual data.

As a further improvement of the present invention, in the step (B), a column classification strategy based on the frequency of use is used to reduce the decompression cost of the common column, and the column is divided into a query column and a code column.

As a further improvement of the present invention, in the step (C), data is stored in a manner of using both a copy and an RDP code check block.

As a further improvement of the present invention, the matrix generated by the RDP code is a file group, and the data block of each file group stores two copies and two RDP code check blocks generated by the storage blocks when stored; two copy storage On different nodes, the other two check blocks are stored on nodes that do not contain any data blocks of the file group.

As a further improvement of the present invention: an RDP code generation group is a matrix of (p-1) × (p + 1), wherein the parameter p is an arbitrary prime number greater than 2, and the last two columns of each matrix are generated. Data is verified, and other columns store information data; the RDP code is divided into a row check block and a diagonal check block, and the row check block is obtained by laterally adding information data, and the diagonal check block is composed of information data. The diagonal addition is obtained; the RDP code organization information block file generates a verification file.

As a further improvement of the present invention, in the step (B), the code column is used to divide the data column, the column whose frequency is greater than or equal to the coded threshold is divided into the query column, and the frequency of use is less than the coded threshold is the code column. .

The invention has the beneficial effects that the upper layer data warehouse system can obtain faster query rate and occupy less storage space than the conventional solution when processing large-scale structured data; in terms of query rate, by setting different codes The threshold is to meet the needs of data management. Generally speaking, the smaller the encoding threshold, the higher the query rate will be, and the larger the storage space occupied by the data. Data warehouse managers can set a reasonable coding threshold through actual business requirements, which can make the data warehouse system get a good compromise between query rate and space occupation. In terms of storage space, use the construction parameters to determine the row group and The size of the file group, by constructing the row group, makes the data avoid additional data reading and result reorganization during the query. By constructing the file group, the data table takes less than three copies of the fault-tolerant method in the fault-tolerant space, and is fault-tolerant. With no less than three copies, this storage method allows the system to take up less storage space in terms of data fault tolerance, saving physical storage costs.

[Description of the Drawings]

1 is a schematic view of space occupation of the present invention;

2 is a schematic diagram of a query rate according to the present invention.

【detailed description】

The invention will now be further described with reference to the drawings and specific embodiments.

Abbreviations and definitions of key terms

RCFile Record Columnar File Row Column Storage File Layout

EStore Effective Store efficient data layout storage system

RDP Row‐Diagonal Parity line diagonal check

An efficient and optimized data layout method applied to a data warehouse system includes the following steps: (A) performing block file basic data layout; (B) performing column classification processing; and (C) performing table file storage.

In the step (A), the table file is horizontally divided into equal-sized row groups, and then the row groups are sequentially stored in the block file by column storage; each row group is composed of three parts, which are respectively a synchronization part. a metadata portion for distinguishing between two adjacent row groups when the data is read, and an actual data portion, the metadata portion including the system distinguishing different columns and each column in the row group The size information of the different domains and the column classification information for the system to distinguish the different kinds of columns, the actual data portion is used to store the actual data.

In the step (B), a column classification strategy based on the frequency of use is used to reduce the decompression cost of the common column, and the column is divided into a query column and a code column.

In the step (C), the data is stored in a manner of using both the replica and the RDP code check block.

The matrix generated by the RDP code is a file group, and the data block of each file group stores two copies and two RDP code check blocks generated by the storage blocks when stored; two copies are stored on different nodes, and The other two check blocks are stored on nodes that do not contain any data blocks of the file set.

An RDP code generation group is a (p-1)×(p+1) matrix, where the parameter p is an arbitrary prime number greater than 2, the last two columns of each matrix are generated check data, and the other columns store information. Data; the RDP code is divided into a row check block and a diagonal check block, and the row check block is obtained by laterally adding information data, and the diagonal check block is obtained by adding diagonal lines of information data; The RDP code organization information block file generates a verification file.

In the step (B), the code column is used to divide the data column, the column whose frequency is greater than or equal to the coded threshold is divided into the query column, and the frequency of use is less than the coded threshold is the code column.

The EStore of the present invention is used for the underlying storage file block data layout of the distributed storage system, and the optimized layout of the data enables the system query execution rate to be improved while reducing the storage space occupied by the data for error correction.

In a specific embodiment, the EStore first divides the table file into equal-sized row groups in the block file data layout, and then stores the row groups in the block file in a column storage manner. The size of the row group is determined by the system's build parameters, and the system's build parameters also affect the size of the filegroup, as described in the next section.

In a block-based distributed storage system, files are partitioned into blocks and stored on different nodes. The EStore system stores the row groups on such blocks. In a row group, there are three parts. The first part is the synchronization part, which is used by the system to distinguish two adjacent line groups when reading data. The second part is the metadata section, which contains the size information that the system can distinguish between different columns and different fields in each column in the row group. In addition to this, column classification information is included for the system to distinguish between different types of columns. The third part is the actual data part, which is used to store the actual data, which are organized in column groups in column storage.

Column classification processing:

A column classification strategy based on frequency of use is used to reduce the decompression cost of common columns. The system divides the columns in a data table into two types, one for the query column and the other for the code column. Each column is divided into one of these types.

The system uses the code threshold value to divide the data column. In a common data warehouse system, use Some queries are usually executed periodically for information decision making or data mining. In such a case, it is considered that such a query has a frequency of use, and each query will often use the same column for a data table. Before the data table is stored, the data table is preprocessed, and the query used by each column in the table is counted, and the value obtained by adding the frequency of the queries is the frequency of use of the column. In this way, the frequency of use of each column can be obtained. By setting a reasonable coding threshold, the column whose frequency is greater than the coding threshold is divided into the query column. If the frequency is less than the coding threshold, the code threshold is the code column.

Here is an example to illustrate the column classification method, as shown in the following table. Here is a data sheet showing the product information for a shopping site, which has 7 different columns. In the daily management of this mall, there are 30 queries that need to be executed periodically for the information decision of the mall. Assume that these 30 queries are used at the same frequency, ie 1/30. The query containing the class is then counted for each column, and then the frequency of use of these queries is added to get the frequency of use of each column. Set the encoding threshold to 0.2, then the first column is the query column and the remaining columns are the code columns.

TABLE I. COLUMNS OF THE TABLE ITEM

When each column in a row group is actually stored, the columns are stored in different forms depending on the kind of these columns. The query column requires that the data is read fast, then the data is stored according to the original format of the data, and for the code column, based on the storage space requirement, the column is compressed and stored using a common data compression algorithm. The classification information for these columns is saved in the second part of the row group. This way reduces the possibility of decompressing data when querying, and improves the query efficiency of the system. The data system manager can set a different coding threshold, so that the system can get a good balance between query rate and storage space. .

Table file storage:

EStore uses RDP codes to fault-tolerant data. An RDP code generation group is a (p-1)×(p+1) matrix, where the parameter p is an arbitrary prime number greater than 2, the last two columns of each matrix are generated check data, and the other columns store information. data. In the two check columns, the first column is obtained by lateral addition of information data, called a row check block. The second column is obtained by adding the diagonals of the information data, called the diagonal check block. The main problem with applying RDP codes to block verification of distributed storage systems is how to organize the information block files to generate checksum files.

EStore uses the build parameters to determine the size of this check matrix. The EStore defines the construction parameter as an arbitrary prime number greater than 2. If the size of the constructed prime number is k, then its RDP generation matrix is a matrix of (k‐1)×(k+1) size, that is, a total of k+1 Files, each file is internally divided into k-1 blocks. It is known from the previous section that the file blocks are composed of the same size group, so the row group is regarded as the basic symbol in the RDP generation matrix, and each file block It will contain k‐1 row groups, so the size of the row group in the file block is determined by the size of the block and the build parameters.

The matrix generated by each RDP code in the EStore is referred to as a file group. Generally speaking, for a big rule A modular table file, which often contains a large number of storage blocks. The EStore divides the storage blocks according to the size of the construction parameters, and divides the storage blocks into different file groups, where each file group contains k-1 such storage. Block, then use these blocks to regenerate 2 check blocks in each file group, so that each file group will eventually contain k+1 file blocks.

The following figure shows the construction of a file group of a data table, where the construction parameters are 5, blocks 0 to 3 are data blocks, and blocks 4 and 5 are parity blocks. Each row group in block 4 contains all the row checkers. For example, r _0,4 is the exclusive OR of the row groups {r _0,0 , r _0,1 , r _0,2 ,r _0.3 }. Block 5 contains all of the diagonal check symbols, for example, r _0,5 is the exclusive OR of the row set {r _0,0 , r _3,2 , r _2,3 , r _1,4 }.

The data is stored in the EStore using both the copy and the check block. The data block of each of its file sets contains two copies in the storage system, and also stores two RDP code check blocks generated by these storage blocks. The reason that the system still uses the copy mode is that the RDP code requires a large transmission bandwidth when the file is restored. Therefore, for each data block, the system will still store one more copy on other nodes, so that when a single node fails, the data block can still be obtained by copy transmission. Only when two nodes storing the same data block fail at the same time, it is necessary to restore the data block by means of RDP code data recovery. Since such a situation does not occur frequently in a distributed storage system, it is still acceptable for the transmission bandwidth of the RDP code repair in the case where the construction parameters are not very large.

In EStore's file group storage, two copies of each data block are stored on different nodes, and the other two check blocks are stored on nodes that do not contain any data blocks of the file group, so that When two copies of an arbitrary data block are corrupted at the same time, the original data block is restored using the RDP repair method.

The present invention uses three optimization strategies to improve the data processing performance of the data warehouse system, which is manifested in the improvement of the query rate and the reduction of the fault-tolerant space occupation.

In terms of block file basic data layout, the data is reasonably organized in the file block of hdfs. First, the table file in the relational data is horizontally divided into equal-sized row groups, each hdfs file block stores one or more row groups, and data is stored in a column storage manner within each row group.

In the column classification strategy, the data is pre-processed before the data table is stored, the frequency of use of different columns is counted, the code threshold is set to classify the column, and the code column below the coding threshold is stored in the data compression mode. Above the encoding threshold is the query column, which holds the data in its native format.

In the aspect of data table file storage, the RDP code is used to construct the error correction strategy, and the data table file is divided into different file groups. The size of the file group is determined by the system construction parameters. The system generates two additional check file blocks for data repair within each file group. The system uses double copy plus check block to perform data fault tolerance. Each data block in the file group is saved on two different nodes, and the check block is saved on other nodes.

The system uses the construction parameters to encode two important parameters of the threshold to store the table file. By setting different parameters, the system can meet the specific data management requirements in terms of query efficiency and space occupation.

Performance evaluation is performed by specific examples below:

The EStore system is built in the actual distributed system, and the performance comparison between the EStore and the existing common data layout structure RCFile is made in terms of storage space occupation and query rate. In this section, the parameter t is used to indicate the encoding threshold.

Figure 1 shows the spatial footprint of a data layout with fault-tolerant measures in a distributed file system under different data layouts. It can be observed that EStore significantly reduces the space occupation of data in the case of t=0.4. The difference in encoding thresholds affects the size of the storage space, depending on the number of query columns and the type of data.

Figure 2 shows the difference in query rates for different data layouts. It can be seen that the EStore query rate is higher than RCFile under all coding thresholds, because EStore reduces the time consumption of data decompression during the query process.

Performance comparisons with RCFiles with EStores with different encoding thresholds include space occupancy and query execution time. The above experimental results reflect the performance advantages of the EStore column classification and fault tolerance strategy.

The above is a further detailed description of the present invention in connection with the specific preferred embodiments, and the specific embodiments of the present invention are not limited to the description. It will be apparent to those skilled in the art that the present invention may be made without departing from the spirit and scope of the invention.

Claims

An efficient and optimized data layout method applied to a data warehouse system, comprising: (A) performing block file basic data layout; (B) performing column classification processing; and (C) performing table file storage.
The method for efficiently optimizing data layout applied to a data warehouse system according to claim 1, wherein in the step (A), the table file is horizontally divided into equal-sized row groups, and then sequentially in the block file. These row groups are stored in a column storage manner; each row group is composed of three parts, namely a synchronization part, a metadata part, and an actual data part, and the synchronization part is used for distinguishing two adjacents when the system reads data. Row group, the metadata portion includes column size information that the system can distinguish between different columns and different fields in each column in the row group, and column classification information for systematically distinguishing different kinds of columns, the actual data portion is used For storing actual data.
The method for efficiently optimizing data layout applied to a data warehouse system according to claim 1, wherein in the step (B), a column classification strategy based on the use frequency is used to reduce the decompression cost of the common column, and the column is divided into Query columns and code columns.
The method for efficiently optimizing data layout applied to a data warehouse system according to claim 1, wherein in the step (C), the data is stored in a manner of using both a copy and an RDP code check block.
The method for efficiently optimizing data layout applied to a data warehouse system according to claim 4, wherein the matrix generated by the RDP code is a file group, and the data block of each file group stores two copies and two when stored. The RDP code check block generated by these memory blocks; two copies are stored on different nodes, and the other two check blocks are stored on nodes that do not contain any data blocks of the file group.
The method for efficiently optimizing data layout applied to a data warehouse system according to claim 4, wherein: one RDP code generation group is a matrix of (p-1) x (p+1), wherein the parameter p is one greater than Any prime number of 2, the last two columns of each matrix are generated check data, the other columns store information data; the RDP code is divided into a row check block and a diagonal check block, and the row check block is composed of information data The horizontal XOR sum is obtained, and the diagonal check block is obtained by XORing the information data diagonally; the RDP code organization information block file generates a check file.
The method for efficiently optimizing data layout applied to a data warehouse system according to claim 3, wherein in the step (B), the coded threshold is used to divide the data column, and the use frequency is greater than or equal to the coded threshold. Columns are divided into query columns, and the frequency of use is less than the encoding threshold is the code column.