WO2018119976A1 - Efficient data layout optimization method for data warehouse system - Google Patents

Efficient data layout optimization method for data warehouse system Download PDF

Info

Publication number
WO2018119976A1
WO2018119976A1 PCT/CN2016/113364 CN2016113364W WO2018119976A1 WO 2018119976 A1 WO2018119976 A1 WO 2018119976A1 CN 2016113364 W CN2016113364 W CN 2016113364W WO 2018119976 A1 WO2018119976 A1 WO 2018119976A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
column
file
columns
block
Prior art date
Application number
PCT/CN2016/113364
Other languages
French (fr)
Chinese (zh)
Inventor
李挥
李鑫
危奕
黄志浩
朱兵
Original Assignee
日彩电子科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日彩电子科技(深圳)有限公司 filed Critical 日彩电子科技(深圳)有限公司
Priority to PCT/CN2016/113364 priority Critical patent/WO2018119976A1/en
Priority to CN201680090379.7A priority patent/CN110268397B/en
Publication of WO2018119976A1 publication Critical patent/WO2018119976A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to the field of data processing, and in particular, to an efficient and optimized data layout method applied to a data warehouse system.
  • Structured data is the most common type of data storage in database management systems.
  • partitioning method of table structure in structured data has a great impact on query and space efficiency. This is done by a single node. Data processing efficiency and network data transmission differences between different nodes.
  • Row storage is a commonly used data layout structure that divides the data in a table into rows, and then stores the partitioned data blocks on different data nodes, where each node stores the rows in turn on disk. Its shortcoming is that in the query process, even if the column is not used, the entire row of data needs to be loaded into the memory and perform unnecessary query operations, thus extending the query time.
  • Another common data layout structure is column storage, which divides the data in the table into columns, and then stores the different columns on different data nodes, each of which stores the columns in turn on disk. Its disadvantage is that the results obtained by different columns in the query process need to be transmitted between nodes to produce the final result. This way increases the data transmission loss and reduces the query efficiency.
  • RCFile A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems
  • the proposed RCFile is a common data layout scheme applied to distributed storage systems. It mainly combines the storage methods of row storage and column storage to build files. The data within the block. When the table data needs to be stored, it first divides the table file according to the row format, wherein each divided row group has the same size, and then stores the row group in different areas of the file block, and simultaneously in each row group. Column order is stored contiguously, and this way of storage avoids the drawbacks of row storage and column storage modes.
  • RCFile's data compression method is relatively simple, and each column data in each row group is separately compressed and stored. This way of compressing all the data is not suitable for the reading and use of common data.
  • the primary key in a table is used in almost every query. In each query for it, it is necessary to perform column data once. Decompression, this way results in higher time and computational overhead.
  • its way of fault tolerance for data is the multi-copy mode of the underlying storage system, which takes up more storage space than the error correction code.
  • Zebra is a column-oriented data layout structure. In order to avoid the defect of the multi-node recombination query result inherent in the column layout, it divides the columns of the data table into multiple column groups, and stores each column group separately, in the storage. In each column group, the data is stored in a row-stored format. Each of the column groups consists of multiple columns, and one column can belong to different column groups. This storage method largely avoids the storage of query results on multiple nodes.
  • the Zebra storage layout needs to group the columns in the table before storing the data table. For a query, there is no guarantee that all the columns to be used are in the same column group. In this case, The result of the query still requires reorganization of data rows between multiple nodes. Because the column group is on the same node, one column can be in multiple column groups at the same time, which actually adds duplicate data to the original data, which increases storage overhead.
  • the present invention provides an efficient and optimized data layout method applied to a data warehouse system, which solves the problem of increasing query overhead and occupying more storage space in the prior art.
  • the invention is realized by the following technical solutions: designing and manufacturing an efficient and optimized data layout method applied to a data warehouse system, comprising the following steps: (A) performing block file basic data layout; (B) performing column classification processing; (C) Perform table file storage.
  • the table file is horizontally divided into equal-sized row groups, and then the row groups are sequentially stored in the block file by column storage; each row group is composed of three
  • the partial components are respectively a synchronization part, a metadata part and an actual data part, and the synchronization part is used for the system to distinguish two adjacent row groups when reading data, and the metadata part includes the system can be in the row group Differentiating the size information of different columns and different fields in each column and column classification information for systematically distinguishing different kinds of columns, the actual data portion is used for storing actual data.
  • a column classification strategy based on the frequency of use is used to reduce the decompression cost of the common column, and the column is divided into a query column and a code column.
  • step (C) data is stored in a manner of using both a copy and an RDP code check block.
  • the matrix generated by the RDP code is a file group, and the data block of each file group stores two copies and two RDP code check blocks generated by the storage blocks when stored; two copy storage On different nodes, the other two check blocks are stored on nodes that do not contain any data blocks of the file group.
  • an RDP code generation group is a matrix of (p-1) ⁇ (p + 1), wherein the parameter p is an arbitrary prime number greater than 2, and the last two columns of each matrix are generated. Data is verified, and other columns store information data; the RDP code is divided into a row check block and a diagonal check block, and the row check block is obtained by laterally adding information data, and the diagonal check block is composed of information data. The diagonal addition is obtained; the RDP code organization information block file generates a verification file.
  • the code column is used to divide the data column, the column whose frequency is greater than or equal to the coded threshold is divided into the query column, and the frequency of use is less than the coded threshold is the code column.
  • the invention has the beneficial effects that the upper layer data warehouse system can obtain faster query rate and occupy less storage space than the conventional solution when processing large-scale structured data; in terms of query rate, by setting different codes
  • the threshold is to meet the needs of data management. Generally speaking, the smaller the encoding threshold, the higher the query rate will be, and the larger the storage space occupied by the data. Data warehouse managers can set a reasonable coding threshold through actual business requirements, which can make the data warehouse system get a good compromise between query rate and space occupation. In terms of storage space, use the construction parameters to determine the row group and The size of the file group, by constructing the row group, makes the data avoid additional data reading and result reorganization during the query.
  • the data table takes less than three copies of the fault-tolerant method in the fault-tolerant space, and is fault-tolerant. With no less than three copies, this storage method allows the system to take up less storage space in terms of data fault tolerance, saving physical storage costs.
  • FIG. 1 is a schematic view of space occupation of the present invention
  • FIG. 2 is a schematic diagram of a query rate according to the present invention.
  • An efficient and optimized data layout method applied to a data warehouse system includes the following steps: (A) performing block file basic data layout; (B) performing column classification processing; and (C) performing table file storage.
  • the table file is horizontally divided into equal-sized row groups, and then the row groups are sequentially stored in the block file by column storage; each row group is composed of three parts, which are respectively a synchronization part.
  • a metadata portion for distinguishing between two adjacent row groups when the data is read, and an actual data portion the metadata portion including the system distinguishing different columns and each column in the row group
  • the size information of the different domains and the column classification information for the system to distinguish the different kinds of columns, the actual data portion is used to store the actual data.
  • a column classification strategy based on the frequency of use is used to reduce the decompression cost of the common column, and the column is divided into a query column and a code column.
  • the data is stored in a manner of using both the replica and the RDP code check block.
  • the matrix generated by the RDP code is a file group, and the data block of each file group stores two copies and two RDP code check blocks generated by the storage blocks when stored; two copies are stored on different nodes, and The other two check blocks are stored on nodes that do not contain any data blocks of the file set.
  • An RDP code generation group is a (p-1) ⁇ (p+1) matrix, where the parameter p is an arbitrary prime number greater than 2, the last two columns of each matrix are generated check data, and the other columns store information.
  • Data; the RDP code is divided into a row check block and a diagonal check block, and the row check block is obtained by laterally adding information data, and the diagonal check block is obtained by adding diagonal lines of information data;
  • the RDP code organization information block file generates a verification file.
  • the code column is used to divide the data column, the column whose frequency is greater than or equal to the coded threshold is divided into the query column, and the frequency of use is less than the coded threshold is the code column.
  • the EStore of the present invention is used for the underlying storage file block data layout of the distributed storage system, and the optimized layout of the data enables the system query execution rate to be improved while reducing the storage space occupied by the data for error correction.
  • the EStore first divides the table file into equal-sized row groups in the block file data layout, and then stores the row groups in the block file in a column storage manner.
  • the size of the row group is determined by the system's build parameters, and the system's build parameters also affect the size of the filegroup, as described in the next section.
  • the EStore system stores the row groups on such blocks.
  • the first part is the synchronization part, which is used by the system to distinguish two adjacent line groups when reading data.
  • the second part is the metadata section, which contains the size information that the system can distinguish between different columns and different fields in each column in the row group. In addition to this, column classification information is included for the system to distinguish between different types of columns.
  • the third part is the actual data part, which is used to store the actual data, which are organized in column groups in column storage.
  • a column classification strategy based on frequency of use is used to reduce the decompression cost of common columns.
  • the system divides the columns in a data table into two types, one for the query column and the other for the code column. Each column is divided into one of these types.
  • the system uses the code threshold value to divide the data column.
  • Some queries are usually executed periodically for information decision making or data mining. In such a case, it is considered that such a query has a frequency of use, and each query will often use the same column for a data table.
  • the data table is preprocessed, and the query used by each column in the table is counted, and the value obtained by adding the frequency of the queries is the frequency of use of the column. In this way, the frequency of use of each column can be obtained.
  • the column whose frequency is greater than the coding threshold is divided into the query column. If the frequency is less than the coding threshold, the code threshold is the code column.
  • each column in a row group is actually stored, the columns are stored in different forms depending on the kind of these columns.
  • the query column requires that the data is read fast, then the data is stored according to the original format of the data, and for the code column, based on the storage space requirement, the column is compressed and stored using a common data compression algorithm.
  • the classification information for these columns is saved in the second part of the row group. This way reduces the possibility of decompressing data when querying, and improves the query efficiency of the system.
  • the data system manager can set a different coding threshold, so that the system can get a good balance between query rate and storage space. .
  • RDP codes to fault-tolerant data.
  • An RDP code generation group is a (p-1) ⁇ (p+1) matrix, where the parameter p is an arbitrary prime number greater than 2, the last two columns of each matrix are generated check data, and the other columns store information. data.
  • the first column is obtained by lateral addition of information data, called a row check block.
  • the second column is obtained by adding the diagonals of the information data, called the diagonal check block.
  • the main problem with applying RDP codes to block verification of distributed storage systems is how to organize the information block files to generate checksum files.
  • EStore uses the build parameters to determine the size of this check matrix.
  • the EStore defines the construction parameter as an arbitrary prime number greater than 2. If the size of the constructed prime number is k, then its RDP generation matrix is a matrix of (k ⁇ 1) ⁇ (k+1) size, that is, a total of k+1 Files, each file is internally divided into k-1 blocks. It is known from the previous section that the file blocks are composed of the same size group, so the row group is regarded as the basic symbol in the RDP generation matrix, and each file block It will contain k ⁇ 1 row groups, so the size of the row group in the file block is determined by the size of the block and the build parameters.
  • the matrix generated by each RDP code in the EStore is referred to as a file group.
  • a modular table file which often contains a large number of storage blocks.
  • the EStore divides the storage blocks according to the size of the construction parameters, and divides the storage blocks into different file groups, where each file group contains k-1 such storage. Block, then use these blocks to regenerate 2 check blocks in each file group, so that each file group will eventually contain k+1 file blocks.
  • each row group in block 4 contains all the row checkers.
  • r 0,4 is the exclusive OR of the row groups ⁇ r 0,0 , r 0,1 , r 0,2 ,r 0.3 ⁇ .
  • Block 5 contains all of the diagonal check symbols, for example, r 0,5 is the exclusive OR of the row set ⁇ r 0,0 , r 3,2 , r 2,3 , r 1,4 ⁇ .
  • the data is stored in the EStore using both the copy and the check block.
  • the data block of each of its file sets contains two copies in the storage system, and also stores two RDP code check blocks generated by these storage blocks.
  • the reason that the system still uses the copy mode is that the RDP code requires a large transmission bandwidth when the file is restored. Therefore, for each data block, the system will still store one more copy on other nodes, so that when a single node fails, the data block can still be obtained by copy transmission. Only when two nodes storing the same data block fail at the same time, it is necessary to restore the data block by means of RDP code data recovery. Since such a situation does not occur frequently in a distributed storage system, it is still acceptable for the transmission bandwidth of the RDP code repair in the case where the construction parameters are not very large.
  • EStore's file group storage two copies of each data block are stored on different nodes, and the other two check blocks are stored on nodes that do not contain any data blocks of the file group, so that When two copies of an arbitrary data block are corrupted at the same time, the original data block is restored using the RDP repair method.
  • the present invention uses three optimization strategies to improve the data processing performance of the data warehouse system, which is manifested in the improvement of the query rate and the reduction of the fault-tolerant space occupation.
  • the data is reasonably organized in the file block of hdfs.
  • the table file in the relational data is horizontally divided into equal-sized row groups, each hdfs file block stores one or more row groups, and data is stored in a column storage manner within each row group.
  • the data is pre-processed before the data table is stored, the frequency of use of different columns is counted, the code threshold is set to classify the column, and the code column below the coding threshold is stored in the data compression mode. Above the encoding threshold is the query column, which holds the data in its native format.
  • the RDP code is used to construct the error correction strategy, and the data table file is divided into different file groups.
  • the size of the file group is determined by the system construction parameters.
  • the system generates two additional check file blocks for data repair within each file group.
  • the system uses double copy plus check block to perform data fault tolerance. Each data block in the file group is saved on two different nodes, and the check block is saved on other nodes.
  • the system uses the construction parameters to encode two important parameters of the threshold to store the table file. By setting different parameters, the system can meet the specific data management requirements in terms of query efficiency and space occupation.
  • the EStore system is built in the actual distributed system, and the performance comparison between the EStore and the existing common data layout structure RCFile is made in terms of storage space occupation and query rate.
  • the parameter t is used to indicate the encoding threshold.
  • Figure 2 shows the difference in query rates for different data layouts. It can be seen that the EStore query rate is higher than RCFile under all coding thresholds, because EStore reduces the time consumption of data decompression during the query process.
  • Performance comparisons with RCFiles with EStores with different encoding thresholds include space occupancy and query execution time.
  • the above experimental results reflect the performance advantages of the EStore column classification and fault tolerance strategy.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An efficient data layout optimization method for a data warehouse system, the method comprising the following steps: (A) arranging a data layout on the basis of block files; (B) performing column sorting; and (C) storing table files. When processing large-scale structured data in the data warehouse system on an upper level, the present invention provides superior query efficiency and occupies less storage space compared with conventional solutions.

Description

应用于数据仓库系统的高效优化数据布局方法Efficient optimized data layout method applied to data warehouse system 【技术领域】[Technical Field]
本发明涉及数据处理领域,尤其涉及一种应用于数据仓库系统的高效优化数据布局方法。The present invention relates to the field of data processing, and in particular, to an efficient and optimized data layout method applied to a data warehouse system.
【背景技术】【Background technique】
在当今大数据时代,在数据仓库系统中,对大规模归档数据的处理是最重要和复杂的挑战之一。结构化数据是数据库管理系统中最常见的数据存储类型,对于分布式系统来说,结构化数据中的表结构的分割方法对于查询和空间效率有很大的影响,这是由单个节点上的数据处理效率和不同节点间的网络数据传输差异导致的。In today's big data era, the processing of large-scale archived data is one of the most important and complex challenges in data warehousing systems. Structured data is the most common type of data storage in database management systems. For distributed systems, the partitioning method of table structure in structured data has a great impact on query and space efficiency. This is done by a single node. Data processing efficiency and network data transmission differences between different nodes.
为了能够更方便的在分布式系统中像操作数据库管理系统那样处理数据,在以hadoop为代表的分布式系统的基础上,又产生了一些更为高层次的数据管理技术。在这样的背景下,位于存储层面的数据布局方案会在很大程度上影响系统处理效率。In order to be more convenient to process data in a distributed system like a database management system, on the basis of a distributed system represented by Hadoop, some higher-level data management techniques are generated. In this context, the data layout scheme at the storage level will greatly affect the system processing efficiency.
行存储是一种常用的数据布局结构,即将表中的数据按行的格式分割,然后将分割后的数据块存储在不同的数据节点上,其中每个节点将各行依次存储在磁盘上。它的缺点是在查询过程中,即使是用不到的列,整行数据也需要加载到内存并进行不必要的查询操作,这样就延长了查询时间。另一种常用数据布局结构是列存储,即将表中的数据按列进行分割,然后将不同的列存储在不同的数据节点上,其中每个节点将各列依次存储在磁盘上。它的劣势在于在查询过程中不同的列得到的结果需要在节点间传输而产生最终结果,这样的方式增加了数据传输损耗,降低了查询效率。Row storage is a commonly used data layout structure that divides the data in a table into rows, and then stores the partitioned data blocks on different data nodes, where each node stores the rows in turn on disk. Its shortcoming is that in the query process, even if the column is not used, the entire row of data needs to be loaded into the memory and perform unnecessary query operations, thus extending the query time. Another common data layout structure is column storage, which divides the data in the table into columns, and then stores the different columns on different data nodes, each of which stores the columns in turn on disk. Its disadvantage is that the results obtained by different columns in the query process need to be transmitted between nodes to produce the final result. This way increases the data transmission loss and reduces the query efficiency.
另一方面,在存储空间效率上,分布式系统通常采用多副本的方式来保证数据可靠性,这样当一个节点出现故障时,则可以从其它节点取得该数据。这样的方式需要存储系统提供较原始数据大小几倍的存储空间,这样会带来更高的存储成本。纠错码被用于通过生成冗余数据而防止数据损坏丢失,RDP码是一种在磁盘冗余阵列中被常用到的擦除码,它将不同磁盘上的数据生成额外的冗余数据,这样的冗余数据由行运算和对角线运算生成,它可以防止系统中最多两个磁盘的失效。这样的技术同时也可以应用在分布式系统中的块文件容错中,但是如何将其有效构建和存储是一个需要解决的问题。On the other hand, in terms of storage space efficiency, distributed systems usually use multiple copies to ensure data reliability, so that when a node fails, it can obtain the data from other nodes. This approach requires storage systems to provide storage that is several times larger than the original data size, which results in higher storage costs. Error correcting codes are used to prevent data corruption by generating redundant data. RDP codes are erasure codes commonly used in redundant arrays of disks, which generate additional redundant data from data on different disks. Such redundant data is generated by row and diagonal operations, which prevents the failure of up to two disks in the system. Such a technology can also be applied to block file fault tolerance in a distributed system, but how to effectively build and store it is a problem to be solved.
论文[Y.He,R.Lee,S.Zheng,N.Jain,Z.Xu,and X.Zhang,“RCFile:A Fast and Space‐efficient Data Placement Structure in MapReduce‐based Warehouse Systems,”InICDE,2011.]提出的RCFile是一种应用于分布式存储系统中的常用数据布局方案,它主要结合了行存储和列存储两种布局方案的存储方式来构建文件 块内的数据。在表数据需要存储时,它先将表文件按照行格式来进行分割,其中每个分割后的行组大小相同,然后将行组存储在文件块的不同区域,同时在每个行组内部按照列次序连续存储,这样的存储方式同时避免了行存储和列存储模式的缺陷。但是,RCFile的数据压缩方式较为单一,将每个行组中的各个列数据单独压缩存储。这样压缩全部数据的方式不利用于常用数据的读取使用,比如说,一个表中的主键几乎在每个查询中都会使用到,在对于它的每一次查询中,都需要进行一次列数据的解压缩,这样的方式造成了更高的时间和计算开销。同时,它对于数据的容错采用的方式是底层存储系统的多副本方式,这种方法相比于纠错码来说会占用更大的存储空间。Paper [Y.He, R. Lee, S. Zheng, N. Jain, Z. Xu, and X. Zhang, "RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems," InICDE, 2011 The proposed RCFile is a common data layout scheme applied to distributed storage systems. It mainly combines the storage methods of row storage and column storage to build files. The data within the block. When the table data needs to be stored, it first divides the table file according to the row format, wherein each divided row group has the same size, and then stores the row group in different areas of the file block, and simultaneously in each row group. Column order is stored contiguously, and this way of storage avoids the drawbacks of row storage and column storage modes. However, RCFile's data compression method is relatively simple, and each column data in each row group is separately compressed and stored. This way of compressing all the data is not suitable for the reading and use of common data. For example, the primary key in a table is used in almost every query. In each query for it, it is necessary to perform column data once. Decompression, this way results in higher time and computational overhead. At the same time, its way of fault tolerance for data is the multi-copy mode of the underlying storage system, which takes up more storage space than the error correction code.
Zebra是一种面向列的数据布局结构,为了避免列布局固有的多节点重组查询结果的缺陷,它将数据表的列划分为多个列组,对每个列组单独进行存储,在存储的每个列组中,数据按照行存储的格式进行存储。其中每个列组由多个列组成,一个列可以同属于不同的列组,这样的存储方式在很大程度上避免了查询结果在多节点上的存储。但是,Zebra存储布局需要在存储数据表之前提前对表中的列进行分组,而对于一个查询来说,无法保证要使用到的所有列都位于同一个列组中,在这样的情况下,对于查询结果依然需要在多节点间进行数据行的重组。基于列组位于同一节点的原因,一个列可以同时位于多个列组中,这样实际上是在原有数据中添加了重复数据,增大了存储开销。Zebra is a column-oriented data layout structure. In order to avoid the defect of the multi-node recombination query result inherent in the column layout, it divides the columns of the data table into multiple column groups, and stores each column group separately, in the storage. In each column group, the data is stored in a row-stored format. Each of the column groups consists of multiple columns, and one column can belong to different column groups. This storage method largely avoids the storage of query results on multiple nodes. However, the Zebra storage layout needs to group the columns in the table before storing the data table. For a query, there is no guarantee that all the columns to be used are in the same column group. In this case, The result of the query still requires reorganization of data rows between multiple nodes. Because the column group is on the same node, one column can be in multiple column groups at the same time, which actually adds duplicate data to the original data, which increases storage overhead.
【发明内容】[Summary of the Invention]
为了解决现有技术中的问题,本发明提供了一种应用于数据仓库系统的高效优化数据布局方法,解决现有技术中增大查询开销以及占用更大存储空间的问题。In order to solve the problems in the prior art, the present invention provides an efficient and optimized data layout method applied to a data warehouse system, which solves the problem of increasing query overhead and occupying more storage space in the prior art.
本发明是通过以下技术方案实现的:设计、制造了一种应用于数据仓库系统的高效优化数据布局方法,包括如下步骤:(A)进行块文件基础数据布局;(B)进行列分类处理;(C)进行表文件存储。The invention is realized by the following technical solutions: designing and manufacturing an efficient and optimized data layout method applied to a data warehouse system, comprising the following steps: (A) performing block file basic data layout; (B) performing column classification processing; (C) Perform table file storage.
作为本发明的进一步改进:所述步骤(A)中,先将表文件横向分割成大小相等的行组,然后在块文件中依次用列存储的方式存储这些行组;每一个行组由三部分组成,分别是同步部分、元数据部分以及实际的数据部分,所述同步部分用于系统在读取数据时区分两个相邻的行组,所述元数据部分包含系统可以在行组中区分不同列和每个列中不同域的大小信息以及用于系统区分不同种类的列的列分类信息,所述实际的数据部分用于存储实际的数据。 As a further improvement of the present invention, in the step (A), the table file is horizontally divided into equal-sized row groups, and then the row groups are sequentially stored in the block file by column storage; each row group is composed of three The partial components are respectively a synchronization part, a metadata part and an actual data part, and the synchronization part is used for the system to distinguish two adjacent row groups when reading data, and the metadata part includes the system can be in the row group Differentiating the size information of different columns and different fields in each column and column classification information for systematically distinguishing different kinds of columns, the actual data portion is used for storing actual data.
作为本发明的进一步改进:所述步骤(B)中,采用基于使用频率的列分类策略来降低常用列的解压缩代价,列被分成查询列和编码列。As a further improvement of the present invention, in the step (B), a column classification strategy based on the frequency of use is used to reduce the decompression cost of the common column, and the column is divided into a query column and a code column.
作为本发明的进一步改进:所述步骤(C)中,同时使用副本和RDP码校验块的方式来存储数据。As a further improvement of the present invention, in the step (C), data is stored in a manner of using both a copy and an RDP code check block.
作为本发明的进一步改进:RDP码生成的矩阵为文件组,每个文件组的数据块在存储时存储包含两个副本以及两个由这些存储块生成的RDP码校验块;两个副本存储在不同的节点上,而另外的两个校验块被存储在不包含该文件组任意数据块的节点上。As a further improvement of the present invention, the matrix generated by the RDP code is a file group, and the data block of each file group stores two copies and two RDP code check blocks generated by the storage blocks when stored; two copy storage On different nodes, the other two check blocks are stored on nodes that do not contain any data blocks of the file group.
作为本发明的进一步改进:一个RDP码生成组是一个(p‐1)×(p+1)的矩阵,其中参数p是一个大于2的任意素数,每一个矩阵的最后两列是生成的校验数据,其它列存储信息数据;RDP码分为行校验块和对角线校验块,所述行校验块由信息数据横向相加得到,所述对角线校验块由信息数据对角线相加得到;所述RDP码组织信息块文件生成校验文件。As a further improvement of the present invention: an RDP code generation group is a matrix of (p-1) × (p + 1), wherein the parameter p is an arbitrary prime number greater than 2, and the last two columns of each matrix are generated. Data is verified, and other columns store information data; the RDP code is divided into a row check block and a diagonal check block, and the row check block is obtained by laterally adding information data, and the diagonal check block is composed of information data. The diagonal addition is obtained; the RDP code organization information block file generates a verification file.
作为本发明的进一步改进:所述步骤(B)中,使用编码阀值来划分数据列,使用频率大于或等于编码阀值的列划分为查询列,使用频率小于编码阀值的则为编码列。As a further improvement of the present invention, in the step (B), the code column is used to divide the data column, the column whose frequency is greater than or equal to the coded threshold is divided into the query column, and the frequency of use is less than the coded threshold is the code column. .
本发明的有益效果是:使得上层的数据仓库系统在处理大规模结构化数据时,能够获得比传统方案更快的查询速率和占用更少的存储空间;在查询速率方面,通过设置不同的编码阀值来满足数据管理的需求,一般来说,编码阀值越小,查询速率相对就会越高,而数据占用的存储空间也就会越大。数据仓库管理者通过实际的业务需求,设置合理的编码阀值,可以使得数据仓库系统在查询速率和空间占用两方面得到一个很好的折衷;在存储空间方面,使用构建参数来决定行组和文件组的大小,通过构建行组,使得数据在查询时避免了额外的数据读取和结果重组,通过构建文件组,使得数据表在容错方面的空间占用少于三副本的容错方法,同时容错能力不低于三副本,这样的存储方式使得系统可以在数据容错方面占用更少的存储空间,节约了物理存储成本。The invention has the beneficial effects that the upper layer data warehouse system can obtain faster query rate and occupy less storage space than the conventional solution when processing large-scale structured data; in terms of query rate, by setting different codes The threshold is to meet the needs of data management. Generally speaking, the smaller the encoding threshold, the higher the query rate will be, and the larger the storage space occupied by the data. Data warehouse managers can set a reasonable coding threshold through actual business requirements, which can make the data warehouse system get a good compromise between query rate and space occupation. In terms of storage space, use the construction parameters to determine the row group and The size of the file group, by constructing the row group, makes the data avoid additional data reading and result reorganization during the query. By constructing the file group, the data table takes less than three copies of the fault-tolerant method in the fault-tolerant space, and is fault-tolerant. With no less than three copies, this storage method allows the system to take up less storage space in terms of data fault tolerance, saving physical storage costs.
【附图说明】[Description of the Drawings]
图1为本发明空间占用示意图;1 is a schematic view of space occupation of the present invention;
图2为本发明查询速率示意图。2 is a schematic diagram of a query rate according to the present invention.
【具体实施方式】【detailed description】
下面结合附图说明及具体实施方式对本发明进一步说明。The invention will now be further described with reference to the drawings and specific embodiments.
缩略语和关键术语定义Abbreviations and definitions of key terms
RCFile    Record Columnar File    行列存储文件布局RCFile Record Columnar File Row Column Storage File Layout
EStore    Effective Store    高效数据布局存储系统EStore Effective Store efficient data layout storage system
RDP    Row‐Diagonal Parity    行对角线校验RDP Row‐Diagonal Parity line diagonal check
一种应用于数据仓库系统的高效优化数据布局方法,包括如下步骤:(A)进行块文件基础数据布局;(B)进行列分类处理;(C)进行表文件存储。 An efficient and optimized data layout method applied to a data warehouse system includes the following steps: (A) performing block file basic data layout; (B) performing column classification processing; and (C) performing table file storage.
所述步骤(A)中,先将表文件横向分割成大小相等的行组,然后在块文件中依次用列存储的方式存储这些行组;每一个行组由三部分组成,分别是同步部分、元数据部分以及实际的数据部分,所述同步部分用于系统在读取数据时区分两个相邻的行组,所述元数据部分包含系统可以在行组中区分不同列和每个列中不同域的大小信息以及用于系统区分不同种类的列的列分类信息,所述实际的数据部分用于存储实际的数据。In the step (A), the table file is horizontally divided into equal-sized row groups, and then the row groups are sequentially stored in the block file by column storage; each row group is composed of three parts, which are respectively a synchronization part. a metadata portion for distinguishing between two adjacent row groups when the data is read, and an actual data portion, the metadata portion including the system distinguishing different columns and each column in the row group The size information of the different domains and the column classification information for the system to distinguish the different kinds of columns, the actual data portion is used to store the actual data.
所述步骤(B)中,采用基于使用频率的列分类策略来降低常用列的解压缩代价,列分成查询列和编码列。In the step (B), a column classification strategy based on the frequency of use is used to reduce the decompression cost of the common column, and the column is divided into a query column and a code column.
所述步骤(C)中,同时使用副本和RDP码校验块的方式来存储数据。In the step (C), the data is stored in a manner of using both the replica and the RDP code check block.
RDP码生成的矩阵为文件组,每个文件组的数据块在存储时存储包含两个副本以及两个由这些存储块生成的RDP码校验块;两个副本存储在不同的节点上,而另外的两个校验块被存储在不包含该文件组任意数据块的节点上。The matrix generated by the RDP code is a file group, and the data block of each file group stores two copies and two RDP code check blocks generated by the storage blocks when stored; two copies are stored on different nodes, and The other two check blocks are stored on nodes that do not contain any data blocks of the file set.
一个RDP码生成组是一个(p‐1)×(p+1)的矩阵,其中参数p是一个大于2的任意素数,每一个矩阵的最后两列是生成的校验数据,其它列存储信息数据;RDP码分为行校验块和对角线校验块,所述行校验块由信息数据横向相加得到,所述对角线校验块由信息数据对角线相加得到;所述RDP码组织信息块文件生成校验文件。An RDP code generation group is a (p-1)×(p+1) matrix, where the parameter p is an arbitrary prime number greater than 2, the last two columns of each matrix are generated check data, and the other columns store information. Data; the RDP code is divided into a row check block and a diagonal check block, and the row check block is obtained by laterally adding information data, and the diagonal check block is obtained by adding diagonal lines of information data; The RDP code organization information block file generates a verification file.
所述步骤(B)中,使用编码阀值来划分数据列,使用频率大于或等于编码阀值的列划分为查询列,使用频率小于编码阀值的则为编码列。In the step (B), the code column is used to divide the data column, the column whose frequency is greater than or equal to the coded threshold is divided into the query column, and the frequency of use is less than the coded threshold is the code column.
本发明EStore用于分布式存储系统底层存储文件块数据布局,通过对于数据在结构上的优化布局使得能够提升系统查询执行速率,同时降低数据用于纠错的存储空间占用。The EStore of the present invention is used for the underlying storage file block data layout of the distributed storage system, and the optimized layout of the data enables the system query execution rate to be improved while reducing the storage space occupied by the data for error correction.
在一具体实施例中,EStore在块文件数据布局方面,先将表文件横向分割成大小相等的行组,然后在块文件中依次用列存储的方式存储这些行组。行组的大小由系统的构建参数决定,系统的构建参数同时还影响文件组的大小,下面部分将会介绍。In a specific embodiment, the EStore first divides the table file into equal-sized row groups in the block file data layout, and then stores the row groups in the block file in a column storage manner. The size of the row group is determined by the system's build parameters, and the system's build parameters also affect the size of the filegroup, as described in the next section.
在一个基于块分布式存储系统中,文件被分割成块存储在不同节点上。EStore系统将行组存储在这样的块上。在一个行组中,有三个部分。第一个部分是同步部分,用于系统在读取数据时区分两个相邻的行组。第二个部分是元数据部分,包含了系统可以在行组中区分不同列和每个列中不同域的大小信息。除此之外,还包含列分类信息用于系统区分不同种类的列。第三个部分是实际的数据部分,这个部分用于存储实际的数据,它们是用列存储形式组织在行组中的。In a block-based distributed storage system, files are partitioned into blocks and stored on different nodes. The EStore system stores the row groups on such blocks. In a row group, there are three parts. The first part is the synchronization part, which is used by the system to distinguish two adjacent line groups when reading data. The second part is the metadata section, which contains the size information that the system can distinguish between different columns and different fields in each column in the row group. In addition to this, column classification information is included for the system to distinguish between different types of columns. The third part is the actual data part, which is used to store the actual data, which are organized in column groups in column storage.
列分类处理:Column classification processing:
使用了一种基于使用频率的列分类策略来降低常用列的解压缩代价。系统将一个数据表中的列分成两种类型,一种为查询列,另一种为编码列。每一个列都会被划分成其中一种类型。A column classification strategy based on frequency of use is used to reduce the decompression cost of common columns. The system divides the columns in a data table into two types, one for the query column and the other for the code column. Each column is divided into one of these types.
系统使用编码阀值这个参数来划分数据列。在常用的数据仓库系统中,使用 者通常会定期执行一些查询来进行信息决策或者数据挖掘。在这样的情况下,认为这样的查询就是有使用频率的,每一个查询对于一个数据表来说,会使用到的列往往不是相同的。在数据表进行存储前,对数据表进行预处理,统计这个表中每一个列被使用到的查询,将这些查询的使用频率相加得到的值,则就是该列的使用频率。用这样的方式可以得到每一个列的使用频率。通过设置合理的编码阀值,将使用频率大于编码阀值的列划分为查询列,使用频率小于编码阀值的则是编码阀值的为编码列。The system uses the code threshold value to divide the data column. In a common data warehouse system, use Some queries are usually executed periodically for information decision making or data mining. In such a case, it is considered that such a query has a frequency of use, and each query will often use the same column for a data table. Before the data table is stored, the data table is preprocessed, and the query used by each column in the table is counted, and the value obtained by adding the frequency of the queries is the frequency of use of the column. In this way, the frequency of use of each column can be obtained. By setting a reasonable coding threshold, the column whose frequency is greater than the coding threshold is divided into the query column. If the frequency is less than the coding threshold, the code threshold is the code column.
这里举一个例子来说明列分类方法,如下表所示。这里有一个数据表显示一个购物网站的商品信息,它有7个不同的列。在这个商场的日常管理中,有30个查询需要周期性执行,用于商场的信息决策。假定这30个查询的使用频率都相同,即1/30。之后对于每个列统计包含该类的查询,然后将这些查询的使用频率相加得到每个列的使用频率。将编码阀值设置为0.2,那么第一列为查询列,其余列为编码列。Here is an example to illustrate the column classification method, as shown in the following table. Here is a data sheet showing the product information for a shopping site, which has 7 different columns. In the daily management of this mall, there are 30 queries that need to be executed periodically for the information decision of the mall. Assume that these 30 queries are used at the same frequency, ie 1/30. The query containing the class is then counted for each column, and then the frequency of use of these queries is added to get the frequency of use of each column. Set the encoding threshold to 0.2, then the first column is the query column and the remaining columns are the code columns.
TABLE I.    COLUMNS OF THE TABLE ITEMTABLE I. COLUMNS OF THE TABLE ITEM
Figure PCTCN2016113364-appb-000001
Figure PCTCN2016113364-appb-000001
在实际存储行组中的每一个列时,依据这些列的种类将列按照不同形式存储。查询列要求数据读取快,那么按照数据的原生格式来存储数据,而对于编码列,基于存储空间的需求,使用常用数据压缩算法对列进行压缩存储。这些列的分类信息被保存在行组的第二个部分中。这样的方式降低了查询时需要解压数据的可能性,同时提高了系统的查询效率,数据系统管理者通过设置不同的编码阀值,使得系统可以在查询速率和存储空间上得到一个很好的平衡。When each column in a row group is actually stored, the columns are stored in different forms depending on the kind of these columns. The query column requires that the data is read fast, then the data is stored according to the original format of the data, and for the code column, based on the storage space requirement, the column is compressed and stored using a common data compression algorithm. The classification information for these columns is saved in the second part of the row group. This way reduces the possibility of decompressing data when querying, and improves the query efficiency of the system. The data system manager can set a different coding threshold, so that the system can get a good balance between query rate and storage space. .
表文件存储:Table file storage:
EStore使用RDP码对数据进行容错。一个RDP码生成组是一个(p‐1)×(p+1)的矩阵,其中参数p是一个大于2的任意素数,每一个矩阵的最后两列是生成的校验数据,其它列存储信息数据。在两个校验列中,第一个列是由信息数据横向相加得到,叫做行校验块。第二列是由信息数据对角线相加得到,叫做对角线校验块。将RDP码应用于分布式存储系统的块校验的主要问题是如何组织信息块文件生成校验文件。EStore uses RDP codes to fault-tolerant data. An RDP code generation group is a (p-1)×(p+1) matrix, where the parameter p is an arbitrary prime number greater than 2, the last two columns of each matrix are generated check data, and the other columns store information. data. In the two check columns, the first column is obtained by lateral addition of information data, called a row check block. The second column is obtained by adding the diagonals of the information data, called the diagonal check block. The main problem with applying RDP codes to block verification of distributed storage systems is how to organize the information block files to generate checksum files.
EStore会使用构建参数来决定这个校验矩阵的大小。EStore定义构建参数为一个大于2的任意素数,如果构建素数的大小为k,那么它的RDP生成矩阵就是一个(k‐1)×(k+1)大小的矩阵,即一共包括k+1个文件,每个文件内部被分成了k‐1块,从前面的部分已经知道文件块是由大小相同的行组构成的,因此将行组看做RDP生成矩阵中的基本符号,每个文件块会包含k‐1个行组,所以文件块中行组的大小是由块的大小和构建参数一起决定的。EStore uses the build parameters to determine the size of this check matrix. The EStore defines the construction parameter as an arbitrary prime number greater than 2. If the size of the constructed prime number is k, then its RDP generation matrix is a matrix of (k‐1)×(k+1) size, that is, a total of k+1 Files, each file is internally divided into k-1 blocks. It is known from the previous section that the file blocks are composed of the same size group, so the row group is regarded as the basic symbol in the RDP generation matrix, and each file block It will contain k‐1 row groups, so the size of the row group in the file block is determined by the size of the block and the build parameters.
在EStore中将每个RDP码生成的矩阵称作文件组。通常来说,对于一个大规 模的表文件,它往往会包含很多个存储块,EStore根据构建参数的大小来划分这些存储块,将存储块划分到不同的文件组中,其中每个文件组包含k‐1个这样的存储块,然后在每个文件组中用这些存储块再生成2个校验块,这样最终每个文件组都会包含k+1个文件块。The matrix generated by each RDP code in the EStore is referred to as a file group. Generally speaking, for a big rule A modular table file, which often contains a large number of storage blocks. The EStore divides the storage blocks according to the size of the construction parameters, and divides the storage blocks into different file groups, where each file group contains k-1 such storage. Block, then use these blocks to regenerate 2 check blocks in each file group, so that each file group will eventually contain k+1 file blocks.
下图显示了数据表的一个文件组的构造过程,其中构建参数为5,块0到块3是数据块,块4和块5是校验块。块4中的每一个行组包含所有的行校验符,比如说,r0,4是行组{r0,0,r0,1,r0,2,r0.3}的异或和。块5包含所有的对角线校验符号,例如,r0,5是行组{r0,0,r3,2,r2,3,r1,4}的异或和。The following figure shows the construction of a file group of a data table, where the construction parameters are 5, blocks 0 to 3 are data blocks, and blocks 4 and 5 are parity blocks. Each row group in block 4 contains all the row checkers. For example, r 0,4 is the exclusive OR of the row groups {r 0,0 , r 0,1 , r 0,2 ,r 0.3 }. Block 5 contains all of the diagonal check symbols, for example, r 0,5 is the exclusive OR of the row set {r 0,0 , r 3,2 , r 2,3 , r 1,4 }.
Figure PCTCN2016113364-appb-000002
Figure PCTCN2016113364-appb-000002
在EStore中同时使用副本和校验块的方式来存储数据。它的每个文件组的数据块在存储系统中包含两个副本,同时还存储两个由这些存储块生成的RDP码校验块。系统仍然采用副本方式的原因是RDP码在文件恢复时需要很大的传输带宽。所以对于每个数据块,系统仍然会多存储一个副本在其它节点上,这样当单节点发生故障时,仍可以通过副本传输的方式获得该数据块。只有当存储同一个数据块的两个节点同时故障时,才需要通过RDP码数据恢复的方式来还原数据块。由于这样的情况在分布式存储系统中不会经常发生,那么对于这种在构建参数不是很大的情况下RDP码修复的传输带宽,仍然是可以接受的。The data is stored in the EStore using both the copy and the check block. The data block of each of its file sets contains two copies in the storage system, and also stores two RDP code check blocks generated by these storage blocks. The reason that the system still uses the copy mode is that the RDP code requires a large transmission bandwidth when the file is restored. Therefore, for each data block, the system will still store one more copy on other nodes, so that when a single node fails, the data block can still be obtained by copy transmission. Only when two nodes storing the same data block fail at the same time, it is necessary to restore the data block by means of RDP code data recovery. Since such a situation does not occur frequently in a distributed storage system, it is still acceptable for the transmission bandwidth of the RDP code repair in the case where the construction parameters are not very large.
在EStore的文件组存储中,将每个数据块的两个副本存储在不同的节点上,而另外的两个校验块被存储在不包含该文件组任意数据块的节点上,这样可以在任意数据块的两个副本同时损坏时,使用RDP修复方式恢复原数据块。In EStore's file group storage, two copies of each data block are stored on different nodes, and the other two check blocks are stored on nodes that do not contain any data blocks of the file group, so that When two copies of an arbitrary data block are corrupted at the same time, the original data block is restored using the RDP repair method.
本发明使用了三种优化策略来提高数据仓库系统数据处理性能,表现在查询速率的提升和容错空间占用的降低。The present invention uses three optimization strategies to improve the data processing performance of the data warehouse system, which is manifested in the improvement of the query rate and the reduction of the fault-tolerant space occupation.
在块文件基础数据布局方面,将数据合理的组织在hdfs的文件块中。先将关系型数据中的表文件横向划分成大小相等的行组,每个hdfs文件块存储一个或多个行组,在每个行组内部用列存储的方式保存数据。In terms of block file basic data layout, the data is reasonably organized in the file block of hdfs. First, the table file in the relational data is horizontally divided into equal-sized row groups, each hdfs file block stores one or more row groups, and data is stored in a column storage manner within each row group.
在列分类策略中,通过在存储数据表前对数据进行预处理,统计不同列的使用频率,设置编码阀值对列进行分类,低于编码阀值的为编码列,使用数据压缩方式存储。高于编码阀值的为查询列,以数据原生格式来保存数据。In the column classification strategy, the data is pre-processed before the data table is stored, the frequency of use of different columns is counted, the code threshold is set to classify the column, and the code column below the coding threshold is stored in the data compression mode. Above the encoding threshold is the query column, which holds the data in its native format.
在数据表文件存储方面,采用了RDP码的方式构建了纠错策略,将数据表文件划分成为不同的文件组,文件组的大小由系统构建参数决定。系统在每个文件组内生成两个额外的校验文件块用于数据修复。系统采用双副本加校验块的方式来进行数据容错。文件组内的每个数据块被保存在两个不同节点上,校验块被保存在其它节点上。In the aspect of data table file storage, the RDP code is used to construct the error correction strategy, and the data table file is divided into different file groups. The size of the file group is determined by the system construction parameters. The system generates two additional check file blocks for data repair within each file group. The system uses double copy plus check block to perform data fault tolerance. Each data block in the file group is saved on two different nodes, and the check block is saved on other nodes.
系统使用构建参数来编码阀值两个重要参数来进行表文件的存储,通过设置不同的参数,可以使系统在查询效率和空间占用方面满足具体的数据管理需求。The system uses the construction parameters to encode two important parameters of the threshold to store the table file. By setting different parameters, the system can meet the specific data management requirements in terms of query efficiency and space occupation.
下面通过具体实施例来进行性能的评价: Performance evaluation is performed by specific examples below:
在实际分布式系统中搭建EStore系统,并在存储空间占用和查询速率这两个方面将EStore和现有常用数据布局结构RCFile做出性能对比。在这部分中,使用参数t表示编码阀值。The EStore system is built in the actual distributed system, and the performance comparison between the EStore and the existing common data layout structure RCFile is made in terms of storage space occupation and query rate. In this section, the parameter t is used to indicate the encoding threshold.
图1显示了在不同数据布局下,采用了容错措施的数据布局在分布式文件系统中的空间占用。可以观察到在t=0.4的情况下EStore很显著地降低了数据的空间占用。编码阀值的不同会影响到存储空间的大小,这依赖于查询列的数量和数据类型。Figure 1 shows the spatial footprint of a data layout with fault-tolerant measures in a distributed file system under different data layouts. It can be observed that EStore significantly reduces the space occupation of data in the case of t=0.4. The difference in encoding thresholds affects the size of the storage space, depending on the number of query columns and the type of data.
图2显示了不同数据布局下查询速率的区别。可以看到EStore在所有的编码阀值下查询速率都高于RCFile,这是由于EStore在查询过程中降低了数据解压缩的时间消耗。Figure 2 shows the difference in query rates for different data layouts. It can be seen that the EStore query rate is higher than RCFile under all coding thresholds, because EStore reduces the time consumption of data decompression during the query process.
通过用不同编码阀值的EStore来与RCFile进行性能比较,对比的性能包括空间占用和查询执行时间。上面的实验结果反映了EStore的列分类和容错策略为系统带来的性能优势。Performance comparisons with RCFiles with EStores with different encoding thresholds include space occupancy and query execution time. The above experimental results reflect the performance advantages of the EStore column classification and fault tolerance strategy.
以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明,不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本发明的保护范围。 The above is a further detailed description of the present invention in connection with the specific preferred embodiments, and the specific embodiments of the present invention are not limited to the description. It will be apparent to those skilled in the art that the present invention may be made without departing from the spirit and scope of the invention.

Claims (7)

  1. 一种应用于数据仓库系统的高效优化数据布局方法,其特征在于:包括如下步骤:(A)进行块文件基础数据布局;(B)进行列分类处理;(C)进行表文件存储。An efficient and optimized data layout method applied to a data warehouse system, comprising: (A) performing block file basic data layout; (B) performing column classification processing; and (C) performing table file storage.
  2. 根据权利要求1所述的应用于数据仓库系统的高效优化数据布局方法,其特征在于:所述步骤(A)中,先将表文件横向分割成大小相等的行组,然后在块文件中依次用列存储的方式存储这些行组;每一个行组由三部分组成,分别是同步部分、元数据部分以及实际的数据部分,所述同步部分用于系统在读取数据时区分两个相邻的行组,所述元数据部分包含系统可以在行组中区分不同列和每个列中不同域的大小信息以及用于系统区分不同种类的列的列分类信息,所述实际的数据部分用于存储实际的数据。The method for efficiently optimizing data layout applied to a data warehouse system according to claim 1, wherein in the step (A), the table file is horizontally divided into equal-sized row groups, and then sequentially in the block file. These row groups are stored in a column storage manner; each row group is composed of three parts, namely a synchronization part, a metadata part, and an actual data part, and the synchronization part is used for distinguishing two adjacents when the system reads data. Row group, the metadata portion includes column size information that the system can distinguish between different columns and different fields in each column in the row group, and column classification information for systematically distinguishing different kinds of columns, the actual data portion is used For storing actual data.
  3. 根据权利要求1所述的应用于数据仓库系统的高效优化数据布局方法,其特征在于:所述步骤(B)中,采用基于使用频率的列分类策略来降低常用列的解压缩代价,列分成查询列和编码列。The method for efficiently optimizing data layout applied to a data warehouse system according to claim 1, wherein in the step (B), a column classification strategy based on the use frequency is used to reduce the decompression cost of the common column, and the column is divided into Query columns and code columns.
  4. 根据权利要求1所述的应用于数据仓库系统的高效优化数据布局方法,其特征在于:所述步骤(C)中,同时使用副本和RDP码校验块的方式来存储数据。The method for efficiently optimizing data layout applied to a data warehouse system according to claim 1, wherein in the step (C), the data is stored in a manner of using both a copy and an RDP code check block.
  5. 根据权利要求4所述的应用于数据仓库系统的高效优化数据布局方法,其特征在于:RDP码生成的矩阵为文件组,每个文件组的数据块在存储时存储包含两个副本以及两个由这些存储块生成的RDP码校验块;两个副本存储在不同的节点上,而另外的两个校验块被存储在不包含该文件组任意数据块的节点上。The method for efficiently optimizing data layout applied to a data warehouse system according to claim 4, wherein the matrix generated by the RDP code is a file group, and the data block of each file group stores two copies and two when stored. The RDP code check block generated by these memory blocks; two copies are stored on different nodes, and the other two check blocks are stored on nodes that do not contain any data blocks of the file group.
  6. 根据权利要求4所述的应用于数据仓库系统的高效优化数据布局方法,其特征在于:一个RDP码生成组是一个(p‐1)×(p+1)的矩阵,其中参数p是一个大于2的任意素数,每一个矩阵的最后两列是生成的校验数据,其它列存储信息数据;RDP码分为行校验块和对角线校验块,所述行校验块由信息数据横向异或相加得到,所述对角线校验块由信息数据对角线异或相加得到;所述RDP码组织信息块文件生成校验文件。 The method for efficiently optimizing data layout applied to a data warehouse system according to claim 4, wherein: one RDP code generation group is a matrix of (p-1) x (p+1), wherein the parameter p is one greater than Any prime number of 2, the last two columns of each matrix are generated check data, the other columns store information data; the RDP code is divided into a row check block and a diagonal check block, and the row check block is composed of information data The horizontal XOR sum is obtained, and the diagonal check block is obtained by XORing the information data diagonally; the RDP code organization information block file generates a check file.
  7. 根据权利要求3所述的应用于数据仓库系统的高效优化数据布局方法,其特征在于:所述步骤(B)中,使用编码阀值来划分数据列,将使用频率大于或等于编码阀值的列划分为查询列,使用频率小于编码阀值的则为编码列。 The method for efficiently optimizing data layout applied to a data warehouse system according to claim 3, wherein in the step (B), the coded threshold is used to divide the data column, and the use frequency is greater than or equal to the coded threshold. Columns are divided into query columns, and the frequency of use is less than the encoding threshold is the code column.
PCT/CN2016/113364 2016-12-30 2016-12-30 Efficient data layout optimization method for data warehouse system WO2018119976A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2016/113364 WO2018119976A1 (en) 2016-12-30 2016-12-30 Efficient data layout optimization method for data warehouse system
CN201680090379.7A CN110268397B (en) 2016-12-30 2016-12-30 Efficient optimized data layout method applied to data warehouse system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/113364 WO2018119976A1 (en) 2016-12-30 2016-12-30 Efficient data layout optimization method for data warehouse system

Publications (1)

Publication Number Publication Date
WO2018119976A1 true WO2018119976A1 (en) 2018-07-05

Family

ID=62706678

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/113364 WO2018119976A1 (en) 2016-12-30 2016-12-30 Efficient data layout optimization method for data warehouse system

Country Status (2)

Country Link
CN (1) CN110268397B (en)
WO (1) WO2018119976A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112579597A (en) * 2020-12-15 2021-03-30 西安邮电大学 Compression-sensitive database file storage method and system
CN116931845A (en) * 2023-09-18 2023-10-24 新华三信息技术有限公司 Data layout method and device and electronic equipment

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719822B (en) * 2023-08-10 2023-12-22 深圳市连用科技有限公司 Method and system for storing massive structured data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101339524A (en) * 2008-05-22 2009-01-07 清华大学 Magnetic disc fault tolerance method of large scale magnetic disc array storage system
US20100235677A1 (en) * 2007-09-21 2010-09-16 Wylie Jay J Generating A Parallel Recovery Plan For A Data Storage System
CN103186566A (en) * 2011-12-28 2013-07-03 中国移动通信集团河北有限公司 Data classification storage method, device and system
CN103699676A (en) * 2013-12-30 2014-04-02 厦门市美亚柏科信息股份有限公司 MSSQL SERVER based table partition and automatic maintenance method and system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3694813A (en) * 1970-10-30 1972-09-26 Ibm Method of achieving data compaction utilizing variable-length dependent coding techniques
CN102521363A (en) * 2011-12-15 2012-06-27 武汉达梦数据库有限公司 Column partition based numerical data compression method for column storage database
CN102737132A (en) * 2012-06-25 2012-10-17 天津神舟通用数据技术有限公司 Multi-rule combined compression method based on database row and column mixed storage
CN103118133B (en) * 2013-02-28 2015-09-02 浙江大学 Based on the mixed cloud storage means of the file access frequency
CN103688515B (en) * 2013-03-26 2016-10-05 北京大学深圳研究生院 The coding of a kind of minimum bandwidth regeneration code and memory node restorative procedure
US9244935B2 (en) * 2013-06-14 2016-01-26 International Business Machines Corporation Data encoding and processing columnar data
CN103440244A (en) * 2013-07-12 2013-12-11 广东电子工业研究院有限公司 Large-data storage and optimization method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100235677A1 (en) * 2007-09-21 2010-09-16 Wylie Jay J Generating A Parallel Recovery Plan For A Data Storage System
CN101339524A (en) * 2008-05-22 2009-01-07 清华大学 Magnetic disc fault tolerance method of large scale magnetic disc array storage system
CN103186566A (en) * 2011-12-28 2013-07-03 中国移动通信集团河北有限公司 Data classification storage method, device and system
CN103699676A (en) * 2013-12-30 2014-04-02 厦门市美亚柏科信息股份有限公司 MSSQL SERVER based table partition and automatic maintenance method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WANG, Z.: "Performance Optimization of a Massive Data Query and Analysis System on Hadoop", ELECTRONIC TECHNOLOGY & INFORMATION SCIENCE, CHINA MASTER'S THESES, 15 August 2015 (2015-08-15) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112579597A (en) * 2020-12-15 2021-03-30 西安邮电大学 Compression-sensitive database file storage method and system
CN112579597B (en) * 2020-12-15 2023-03-21 西安邮电大学 Compression-sensitive database file storage method and system
CN116931845A (en) * 2023-09-18 2023-10-24 新华三信息技术有限公司 Data layout method and device and electronic equipment
CN116931845B (en) * 2023-09-18 2023-12-12 新华三信息技术有限公司 Data layout method and device and electronic equipment

Also Published As

Publication number Publication date
CN110268397A (en) 2019-09-20
CN110268397B (en) 2023-06-13

Similar Documents

Publication Publication Date Title
US20220368457A1 (en) Distributed Storage System Data Management And Security
US10719250B2 (en) System and method for combining erasure-coded protection sets
CN103944981B (en) Cloud storage system and implement method based on erasure code technological improvement
US8051362B2 (en) Distributed data storage using erasure resilient coding
CN101840366B (en) Storage method of loop chain type n+1 bit parity check code
WO2013130630A2 (en) Listing data objects using a hierarchical dispersed storage index
US20120089799A1 (en) Data backup processing method, data storage node apparatus and data storage device
CN114090345B (en) Disk array data recovery method, system, storage medium and equipment
CN106484559A (en) A kind of building method of check matrix and the building method of horizontal array correcting and eleting codes
WO2018119976A1 (en) Efficient data layout optimization method for data warehouse system
CN105703782B (en) A kind of network coding method and system based on incremental shift matrix
WO2023103213A1 (en) Data storage method and device for distributed database
US11656942B2 (en) Methods for data writing and for data recovery, electronic devices, and program products
CN105956128A (en) Self-adaptive encoding storage fault-tolerant method based on simple regenerating code
Esmaili et al. CORE: Cross-object redundancy for efficient data repair in storage systems
WO2024021594A1 (en) Encoding method and device for raid6 disk array, decoding method and device for raid6 disk array, and medium
WO2024001974A1 (en) Local recovery method and device for data, and storage medium
CN114115729B (en) Efficient data migration method under RAID
CN107153661A (en) A kind of storage, read method and its device of the data based on HDFS systems
US7831859B2 (en) Method for providing fault tolerance to multiple servers
US10599520B2 (en) Meta-copysets for fault-tolerant data storage
CN116248129A (en) Fault-tolerant data segment compression method, recovery method, device and system
CN106911793B (en) I/O optimized distributed storage data repair method
WO2018209541A1 (en) Coding structure based on t-design fractional repetition codes, and coding method
Tebbi et al. Linear programming bounds for distributed storage codes

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16925712

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16925712

Country of ref document: EP

Kind code of ref document: A1