CN103425772A

CN103425772A - Method for searching massive data with multi-dimensional information

Info

Publication number: CN103425772A
Application number: CN2013103501267A
Authority: CN
Inventors: 宋杰; 郭朝鹏; 王智; 徐澍; 张一川; 朱志良
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2013-08-13
Filing date: 2013-08-13
Publication date: 2013-12-04
Anticipated expiration: 2033-08-13
Also published as: CN103425772B

Abstract

A massive data query method with multi-dimensional information relates to the field of data mining. Load the dimensional information of massive data with multi-dimensional information; load massive data; use online data analysis OLAP method to query massive data. A method for querying massive data with multi-dimensional information in the present invention organizes the massive data with multi-dimensional information through the method of dimension coding, uses the method of data block storage to simplify the addressing of data blocks, and uses intermediate variables (i.e. analysis paths) In this way, the transformation of dimension level can be quickly realized, and the data selection based on the data block selection method can be used for data screening, and only the actual participating data can be calculated and processed.

Description

A massive data query method with multi-dimensional information

技术领域technical field

发明涉及数据挖掘领域，特别涉及一种具有多维信息的海量数据查询方法。The invention relates to the field of data mining, in particular to a massive data query method with multi-dimensional information.

背景技术Background technique

随着大数据时代的到来，对传统的数据管理、查询等传统数据分析领域造成了极大的挑战。为了应对海量数据所带来对挑战，在学术界和工业界广泛采用MapReduce编程模型和分布式文件系统来应对这种挑战。OLAP(On-LineAnalytical Processing联机分析处理)是传统数据分析领域中非常重要的分析手段和方法。在大数据领域对OLAP分析也提出了新的要求。With the advent of the era of big data, it poses great challenges to traditional data analysis fields such as traditional data management and query. In order to deal with the challenges brought by massive data, the MapReduce programming model and distributed file system are widely used in academia and industry to deal with this challenge. OLAP (On-Line Analytical Processing) is a very important analysis means and method in the field of traditional data analysis. In the field of big data, new requirements are also put forward for OLAP analysis.

OLAP可以按照其实现方式不同分为ROLAP(Relational OLAP面向关系的联机数据分析处理)、MOLAP(Multidimensional OLAP面向多维的联机分析处理)和HOLAP(Hybrid OLAP混合型联机分析处理)3种。其中ROLAP采用关系表存储维信息和事实数据，MOLAP则采用多维数据结构存储维信息和事实数据，而HOLAP称之为混合OLAP，该方法结合了ROLAP和MOLAP技术。无论是何种OLAP，都需要存储和计算平台的支持，尤其是在大数据环境下。为了解决大数据所带来的诸多挑战，学界和业界涌现出许多新技术，如分布式文件系统、NoSQL(Not Only Structured Query Language不仅仅是基于结构化查询语言的)数据库系统，MapReduce编程模型以及相关的优化方法，这些技术都被广泛地运用到大数据分析中。OLAP can be divided into three types according to its implementation methods: ROLAP (Relational OLAP Relational Online Data Analysis Processing), MOLAP (Multidimensional OLAP Multidimensional Online Analytical Processing) and HOLAP (Hybrid OLAP Hybrid Online Analytical Processing). Among them, ROLAP uses relational tables to store dimensional information and fact data, MOLAP uses multidimensional data structures to store dimensional information and fact data, and HOLAP is called hybrid OLAP, which combines ROLAP and MOLAP technologies. No matter what kind of OLAP it is, it needs the support of storage and computing platforms, especially in the environment of big data. In order to solve the many challenges brought by big data, many new technologies have emerged in academia and industry, such as distributed file system, NoSQL (Not Only Structured Query Language) database system, MapReduce programming model and Related optimization methods, these techniques are widely used in big data analysis.

在大数据环境中常用的OLAP优化方法有以下两种：利用预计算和浓缩数据立方的结果优化OLAP性能和通过优化存储结构和算法来优化OLAP性能。但是前者将会生成大量的数据，无法适用于海量数据环境，而后者的优化措施大多基于ROLAP，对OLAP的性能没有质的提升。有研究提出了OLAP查询中的SPAJG-OLAP子集，在存储、查询、数据分布、网络传输和分布式缓存等方面研究海量数据大规模并行处理框架的优化策略和实现技术。该研究基于并行数据库技术优化ROLAP性能，通过对OLAP查询以及存储的优化达到加速OLAP的目的，但是由于ROLAP是基于关系数据库技术，会产生大量的连接操作，当数据量非常庞大的情况下其优化效果并不明显。There are two commonly used OLAP optimization methods in the big data environment: optimizing OLAP performance by using the results of precomputation and condensing data cubes, and optimizing OLAP performance by optimizing storage structures and algorithms. However, the former will generate a large amount of data and cannot be applied to a massive data environment, while the latter's optimization measures are mostly based on ROLAP, which does not improve the performance of OLAP qualitatively. Some studies have proposed the SPAJG-OLAP subset in OLAP queries, and studied the optimization strategies and implementation technologies of massive parallel processing frameworks for massive data in terms of storage, query, data distribution, network transmission, and distributed caching. This research optimizes ROLAP performance based on parallel database technology, and achieves the purpose of accelerating OLAP by optimizing OLAP query and storage. However, because ROLAP is based on relational database technology, a large number of connection operations will be generated. When the amount of data is very large, its optimization The effect is not obvious.

就分布式的OLAP系统而言，一些基于Hadoop的云数据库系统，例如Hive、HadoopDB、HBase等，都支持OLAP。在当前对海量数据分布式OLAP领域中，广泛采用数据索引、分片等方法对ROLAP进行优化。但是ROLAP需要采用关系模型以及耗费资源的连接操作，当数据量增加时，索引及分片的优化作用急剧下降。此外还有通过对查询条件进行优化ROLAP的方法。但是同样由于不可避免连接操作，其优化作用不是非常明显。MOLAP将数据作为数据立方进行存储，但需要对维进行管理和优化，导致当前对MOLAP的研究以及系统并没有权威的报道。As far as distributed OLAP systems are concerned, some Hadoop-based cloud database systems, such as Hive, HadoopDB, HBase, etc., all support OLAP. In the current field of distributed OLAP for massive data, methods such as data indexing and fragmentation are widely used to optimize ROLAP. However, ROLAP needs to adopt a relational model and resource-consuming connection operations. When the amount of data increases, the optimization effect of indexing and fragmentation drops sharply. In addition, there is a method to optimize ROLAP by querying conditions. But also due to the unavoidable connection operation, its optimization effect is not very obvious. MOLAP stores data as a data cube, but it needs to manage and optimize the dimensions, resulting in no authoritative reports on the current research on MOLAP and the system.

发明内容Contents of the invention

针对现有发明存在的不足，本发明的目的是提供一种具有多维信息的海量数据查询方法，以达到在海量数据环境中数据查询、聚集计算的目的。In view of the shortcomings of the existing invention, the purpose of the present invention is to provide a massive data query method with multi-dimensional information, so as to achieve the purpose of data query and aggregation calculation in the massive data environment.

本发明的技术方案是这样实现的：一种具有多维信息的海量数据查询方法，包括以下步骤：The technical solution of the present invention is achieved in this way: a massive data query method with multi-dimensional information comprises the following steps:

步骤1：对具有多维信息的海量数据的维信息进行装载，具体包括如下步骤：Step 1: Load the dimensional information of massive data with multi-dimensional information, specifically including the following steps:

步骤1.1：对海量数据的维信息进行鉴别，判断海量数据的每一个维信息是否同时满足如下三个约束：Step 1.1: Identify the dimensional information of the massive data, and judge whether each dimensional information of the massive data satisfies the following three constraints at the same time:

约束1：维由一个且仅一个维层次构成，即维是所有维级别组成的全序关系；Constraint 1: A dimension consists of one and only one dimension level, that is, a dimension is a total order relationship composed of all dimension levels;

约束2：在维的任意维级别中，仅包含一个维属性，该维属性包含若干个维值；Constraint 2: In any dimension level of a dimension, only one dimension attribute is included, and the dimension attribute contains several dimension values;

约束3：在所有维值所组成的维值树中，兄弟节点包含相同个数的子节点；Constraint 3: In the dimension value tree composed of all dimension values, sibling nodes contain the same number of child nodes;

若全部满足，则执行步骤1.3，否则，执行步骤1.2；If all are satisfied, go to step 1.3, otherwise go to step 1.2;

步骤1.2：对维信息进行处理，使得每一个维都形成符合约束的维值树结构，处理过程如下：Step 1.2: Process the dimension information so that each dimension forms a dimension value tree structure conforming to the constraints. The processing process is as follows:

针对约束1：若有多个维层次，则根据需要对维层次进行舍弃，仅保留一个维层次即可；Constraint 1: If there are multiple dimension levels, discard the dimension levels as needed, and only keep one dimension level;

针对约束2：若某一维级别包含多个维属性，则根据需要对维属性进行舍弃，仅保留一个维属性即可；Constraint 2: If a dimension level contains multiple dimension attributes, discard the dimension attributes as needed, and only keep one dimension attribute;

针对约束3：若兄弟节点包含的子节点个数不同，则添加空值，使兄弟节点的子节点个数相同；For constraint 3: if the number of child nodes contained in the sibling nodes is different, add a null value to make the number of child nodes of the sibling nodes the same;

步骤1.3：对维信息进行编码；Step 1.3: Encoding dimension information;

针对维值树中每一维级别的维值，从左到右以十进制数依次编码，当所有的维值均有对应的编码后编码工作结束；For the dimension values of each dimension level in the dimension value tree, encode them sequentially in decimal numbers from left to right. When all the dimension values have corresponding encodings, the encoding work ends;

步骤1.4：对维信息的编码进行存储；Step 1.4: store the encoding of the dimension information;

对于海量数据的任意一个维信息，存储每一维级别名称以及在该级别中兄弟节点的个数，最终形成海量数据所有维信息的文件，存储于分布式文件系统中；For any dimension information of massive data, store the name of each dimension level and the number of sibling nodes in this level, and finally form a file of all dimension information of massive data, which is stored in the distributed file system;

步骤2：对海量数据进行装载；Step 2: Load massive data;

步骤2.1：用户根据需要确立海量数据的维信息的实际意义与其维信息的编码的对应关系，即将海量数据中的任意一条数据用维信息的编码进行表示；Step 2.1: The user establishes the corresponding relationship between the actual meaning of the dimension information of the massive data and the coding of the dimension information according to the needs, that is, to express any piece of data in the massive data with the coding of the dimension information;

步骤2.2：所有的最细粒度的多维海量数据形成数据立方结构，海量数据的任意一条数据作为该数据立方中的一个单元格，该单元格的信息包括：该单元格位于立方体内的坐标，以及单元格所表示的事实数据值；其中，单元格的坐标表示为：Step 2.2: All the most fine-grained multi-dimensional massive data form a data cube structure, and any piece of data in the massive data is used as a cell in the data cube, and the information of the cell includes: the coordinates of the cell within the cube, and The fact data value represented by the cell; where the coordinates of the cell are expressed as:

<一个维信息的编码，另一个维信息的编码，...，最后一个维信息的编码>；<encoding of information in one dimension, encoding of information in another dimension, ..., encoding of information in the last dimension>;

所述的最细粒度是指：最低层维级别所指向的数据；The finest granularity mentioned refers to: the data pointed to by the lowest dimension level;

步骤2.3：对数据立方进行切割：Step 2.3: Cut the data cube:

根据用户的查询需求，在保证查询时间最短的条件下，对数据立方进行切割，形成数据块，确定数据块的边长；According to the user's query requirements, under the condition of ensuring the shortest query time, the data cube is cut to form a data block, and the side length of the data block is determined;

步骤2.4：对步骤2.3划分好的数据块进行编码，方法为：数据块内任意单元格的坐标除以数据块的边长，将所得数据向上取整后得到的值作为数据块的编码；Step 2.4: Encoding the data block divided in step 2.3, the method is: divide the coordinates of any cell in the data block by the side length of the data block, and round up the obtained data to obtain the value as the encoding of the data block;

步骤2.5：步骤2.3切割好的数据块存储于分布式文件系统中，将步骤2.4确立的编码作为数据块文件的名称；Step 2.5: The data block cut in step 2.3 is stored in the distributed file system, and the code established in step 2.4 is used as the name of the data block file;

步骤3：采用联机数据分析OLAP的方法对海量数据进行查询；Step 3: Use online data analysis OLAP method to query massive data;

步骤3.1：用户设置查询条件，包括：Step 3.1: The user sets query conditions, including:

查询目标：是指确定针对哪个数据立方进行查询，即目标立方；Query target: refers to determine which data cube to query, that is, the target cube;

查询范围：在已确立的查询目标中，针对哪部分数据进行查询；Query scope: in the established query target, which part of the data is queried;

结果的维信息：是指结果数据立方的维信息；The dimension information of the result: refers to the dimension information of the result data cube;

聚集方法：对查询范围内的数据进行聚集的操作；Aggregation method: the operation of aggregating the data within the query range;

步骤3.2：判断步骤3.1设置好的查询条件是否满足如下约束条件：Step 3.2: Determine whether the query conditions set in step 3.1 meet the following constraints:

约束1：查询目标已存在，且查询范围应小于或等于查询目标的数据范围；Constraint 1: The query target already exists, and the query range should be less than or equal to the data range of the query target;

约束2：结果数据立方的维数量应小于或等于查询目标的维数量；Constraint 2: The number of dimensions of the result data cube should be less than or equal to the number of dimensions of the query target;

约束3：结果数据立方的任意维的最低维级别应高于查询目标对应维的最低维级别；Constraint 3: The lowest dimension level of any dimension of the result data cube should be higher than the lowest dimension level of the corresponding dimension of the query target;

约束4：聚集方法必须是分布式的或代数式的；Constraint 4: Aggregation methods must be distributed or algebraic;

若同时满足约束1、约束2和约束4，则执行步骤3.3；If constraint 1, constraint 2 and constraint 4 are satisfied at the same time, then step 3.3 is performed;

若同时满足约束1～约束4则执行步骤3.4；If constraints 1 to 4 are satisfied at the same time, execute step 3.4;

若不满足上面的任何一个条件，则查询失败，结束；If any of the above conditions are not met, the query fails and ends;

步骤3.3：对查询目标进行转换，寻找当前查询目标的上一级立方，判断上一级立方是否满足约束3，若不满足，再继续查询该上一级立方的上一级立方，若始终无法满足，则查询失败，结束查询过程；若找到了满足约束3的上一级立方，则将该立方替换为目标立方；Step 3.3: Convert the query target, find the upper-level cube of the current query target, and judge whether the upper-level cube satisfies constraint 3, if not, continue to query the upper-level cube of the upper-level cube, if it still cannot If it is satisfied, the query fails and the query process ends; if the upper-level cube that satisfies constraint 3 is found, replace the cube with the target cube;

步骤3.4：数据的粗筛：根据查询范围确定查询所需的最小数据块的范围；Step 3.4: Coarse screening of data: determine the range of the minimum data block required for query according to the query range;

步骤3.5：数据的精筛；扫描步骤3.4所筛选出的数据块文件，根据查询范围对所有位于数据块内的单元格进行筛选，若单元格位于查询范围内，执行步骤3.6；否则，则舍弃该单元格；Step 3.5: Fine screening of data; scan the data block files screened in step 3.4, and filter all the cells in the data block according to the query range, if the cells are in the query range, perform step 3.6; otherwise, discard the cell;

步骤3.6：利用维的编码改变单元格的维级别，对比结果数据立方的维信息与目标立方的维信息，确定发生改变的维信息，对单元格坐标上表示该维的坐标进行修改；Step 3.6: Use dimension coding to change the dimension level of the cell, compare the dimension information of the result data cube with the dimension information of the target cube, determine the changed dimension information, and modify the coordinates representing the dimension on the cell coordinates;

步骤3.7：对具有相同坐标的单元格，根据所设定的聚集方法对单元格内的事实数据值进行聚集操作；Step 3.7: For cells with the same coordinates, aggregate the fact data values in the cells according to the set aggregation method;

步骤3.8：经步骤3.7聚集之后的数据，形成结果数据立方，将该结果数据立方的信息返回给用户，并将该结果立方作为新的数据立方存储，使之可以作为下一轮查询的查询目标。Step 3.8: The data gathered in step 3.7 forms a result data cube, returns the information of the result data cube to the user, and stores the result cube as a new data cube so that it can be used as the query target of the next round of query .

本发明的有益效果：本发明一种具有多维信息的海量数据查询方法，具有如下优点：Beneficial effects of the present invention: a massive data query method with multi-dimensional information in the present invention has the following advantages:

1、通过维编码的方法来组织具有多维信息的海量数据1. Organize massive data with multi-dimensional information through the method of dimensional coding

2、利用数据分块存储的方法简化了数据块的寻址。2. The method of storing data in blocks simplifies the addressing of data blocks.

3、利用维的编码，快速地实现维层级的转化。3. Use dimension encoding to quickly realize dimension-level conversion.

4、通过了基于数据块选择的方法进行数据的筛选，仅针对实际参与的数据进行计算和处理。4. Through the data screening based on the method of data block selection, only the actual participating data is calculated and processed.

附图说明Description of drawings

图1为本发明一种实施方式海洋科学数据集中的维结构示意图；Fig. 1 is a schematic diagram of a dimension structure in a marine scientific data set according to an embodiment of the present invention;

图2为本发明一种实施方式时间维编码示意图；Fig. 2 is a schematic diagram of time dimension encoding in an embodiment of the present invention;

图3为本发明一种实施方式地区维编码示意图；Fig. 3 is a schematic diagram of area dimension coding in an embodiment of the present invention;

图4为本发明一种实施方式深度维编码示意图；FIG. 4 is a schematic diagram of depth-dimensional coding in an embodiment of the present invention;

图5为本发明一种实施方式不符合约束1的时间维示意图；FIG. 5 is a schematic diagram of the time dimension of an embodiment of the present invention that does not meet constraint 1;

图6为本发明一种实施方式多维海量数据立方结构示意图；Fig. 6 is a schematic diagram of a multi-dimensional massive data cube structure in an embodiment of the present invention;

图7为本发明一种实施方式多维海量数据立方分块示意图；Fig. 7 is a block schematic diagram of a multi-dimensional mass data cube in an embodiment of the present invention;

图8为本发明一种实施方式数据粗筛示意图；Fig. 8 is a schematic diagram of coarse screening of data according to an embodiment of the present invention;

图9为本发明一种实施方式MapReduce实现海量数据查询方法示意图；Fig. 9 is a schematic diagram of a method for realizing massive data query by MapReduce in an embodiment of the present invention;

图10为本发明一种实施方式改变维编码示意图；Fig. 10 is a schematic diagram of changing dimension coding in an embodiment of the present invention;

图11为本发明一种实施方式在MapReduce中改变维编码示例图；Fig. 11 is an example diagram of changing dimension coding in MapReduce in an embodiment of the present invention;

图12为本发明一种实施方式具有多维信息的海量数据查询方法的流程图。FIG. 12 is a flowchart of a massive data query method with multi-dimensional information according to an embodiment of the present invention.

具体实施方式Detailed ways

下面以某海洋科学数据为例，结合附图和具体实施方式对本发明作进一步的详细说明。Hereinafter, the present invention will be further described in detail by taking certain marine scientific data as an example, in conjunction with the accompanying drawings and specific implementation methods.

本实施方式以某海洋科学数据作为研究对象，该海洋数据为具有三维信息的海量数据。包括：时间维、区域维和深度维，对应分别用Time、Area和Depth表示，其维结构如图1所示。其中，Time共有4个级别：分别为年(Year)、月(Month)、日(Day)、次(Slot)。其中，Slot指一天的三个时间段分别进行一次测试，其维值分别是“上午”、“下午”和“晚上”，其维值树结构如图2所示。在图2中，若共有3年(2010年1月1日～2012年12月31日)的数据，则在Year维级别上，对应有3个维值，分别为2010、2011、2012，在Month维级别上，对应有36个维值，分别是2010年1月、2010年2月、......、2012年12月，在Day维级别上，我们假设每个月均有31天，则对应1116个维值，分别为2010年1月1日、2010年1月2日、......、2012年12月31日，在Slot维级别上则对应有3348个维值，分别是2010年1月1日上午、2010年1月1日中午、......、2012年12月31日下午。Area共有7个维级别：1°、1/2°、1/4°、1/8°、1/16°、1/32°和1/64°。其中1°指由1°经度和1°纬度所组成的方形区域，地球表面可以划分为360×180个1°方区，其维值树结构如图3所示。在图3中包含2°区域的数据，在1°的维级别上包含2个维值，分别是1°、2°，在1/2°维级别上包含4个维值，分别是1/2°、1°、3/2°、2°，以此类推，在1/64°的维级别上包含128个维值，分别是1/64°、1/32°、......、2°。Depth共有3个维级别：1OOm、50m、1Om，其中，1OOm指的是“海洋的深度以100米的间隔进行划分”，其维值树结构如图4所示。在图4中若水深有500m，则在1OOm的维级别上，对应有5个维值，分别为1OOm、200m、......、500m，在50m的维级别上，对应有10个维值，分别为50m、1OOm、......、500m，在1Om的维级别上，对应有50个维值，分别为1Om、20m、......、500m。在海洋数据集的立方中，每一个单元格内仅有一个事实数据，即海洋的温度数据，该数据的数据类型为双精度浮点数。This embodiment takes certain marine scientific data as the research object, and the marine data is massive data with three-dimensional information. Including: Time Dimension, Area Dimension and Depth Dimension, which are represented by Time, Area and Depth respectively. The dimension structure is shown in Figure 1. Among them, Time has four levels: Year, Month, Day, and Slot. Among them, Slot refers to a test in three time periods of a day, and its dimension values are "morning", "afternoon" and "evening", respectively, and its dimension value tree structure is shown in Figure 2. In Figure 2, if there are 3 years of data (January 1, 2010 to December 31, 2012), there are three corresponding dimension values at the Year dimension level, namely 2010, 2011, and 2012. On the Month dimension level, there are 36 dimension values corresponding to January 2010, February 2010, ..., December 2012. On the Day dimension level, we assume that each month has 31 day, it corresponds to 1116 dimension values, which are January 1, 2010, January 2, 2010, ..., December 31, 2012, and there are 3348 dimensions at the Slot dimension level The values are the morning of January 1, 2010, the noon of January 1, 2010, ..., and the afternoon of December 31, 2012. Area has 7 dimension levels: 1°, 1/2°, 1/4°, 1/8°, 1/16°, 1/32° and 1/64°. Among them, 1° refers to a square area composed of 1° longitude and 1° latitude. The earth’s surface can be divided into 360×180 1° square areas, and its dimension value tree structure is shown in Figure 3. In Figure 3, the data of the 2° area is included, and the 1° dimension level contains 2 dimension values, which are 1° and 2° respectively, and the 1/2° dimension level contains 4 dimension values, which are 1/ 2°, 1°, 3/2°, 2°, and so on, including 128 dimension values at the 1/64° dimension level, which are 1/64°, 1/32°, ..... ., 2°. Depth has three dimensional levels: 100m, 50m, and 1Om. Among them, 100m refers to "the depth of the ocean is divided at intervals of 100 meters", and its dimensional value tree structure is shown in Figure 4. In Figure 4, if the water depth is 500m, at the dimension level of 100m, there are 5 dimension values corresponding to 100m, 200m, ..., 500m, and at the dimension level of 50m, there are 10 values Dimension values are 50m, 100m, ..., 500m, respectively, and at the dimension level of 1Om, there are 50 dimension values, respectively 1Om, 20m, ..., 500m. In the cube of the ocean dataset, there is only one fact data in each cell, that is, the temperature data of the ocean, and the data type of this data is a double-precision floating-point number.

本实施方式，对上述具有3维信息的海洋数据进行查询的方法，其流程如图12所示，包括以下步骤：In this embodiment, the method for querying the above-mentioned marine data with 3-dimensional information, the process of which is shown in Figure 12, includes the following steps:

步骤1：对具有3维信息的海量数据的维信息进行装载，具体包括如下步骤：Step 1: Load the dimensional information of massive data with 3-dimensional information, specifically including the following steps:

约束1：维由一个且仅一个维层次构成，即维是所有维级别组成的全序关系。Constraint 1: A dimension consists of one and only one dimension level, that is, a dimension is a total order relationship composed of all dimension levels.

本实施方式中，对于海洋数据而言，若时间维中包含星期这个维级别时(如，2010年1月1日星期五上午)，由于月中不能包含完整的星期(如，2010年1月1日作为1月的开始，但是这一天是星期五，不是一个完整星期的开始)，因此将会产生两个维层次，分别是年(year)-星期(week)-日(day)-次(slot)、年(year)-月(month)-日(day)-次(slot)，这种维不符合约束，其示意图如图5所示。本实施方式中限定的全序关系，对于海洋数据的时间维而言，应该满足年包含季、季包含月、月包含日、日包含次，如果不满足这样的关系则不满足约束。In this embodiment, for marine data, if the time dimension includes the dimension level of week (for example, the morning of Friday, January 1, 2010), since the month cannot contain a complete week (for example, January 1, 2010 day is the beginning of January, but this day is Friday, not the beginning of a complete week), so two dimension levels will be generated, which are year (year)-week (week)-day (day)-time (slot ), year (year)-month (month)-day (day)-time (slot), this dimension does not meet the constraints, and its schematic diagram is shown in Figure 5. The total order relationship defined in this embodiment, for the time dimension of marine data, should satisfy the year containing season, season containing month, month containing day, and day containing time. If such a relationship is not satisfied, the constraint is not satisfied.

约束2：在维的任意维级别中，仅包含一个维属性，该维属性包含若干个维值。Constraint 2: In any dimension level of a dimension, only one dimension attribute is included, and the dimension attribute contains several dimension values.

考虑一个假设的城市维，其维层次为：省-市-区(例如，辽宁省-沈阳市-和平区)。在市维级别中，包含2个维属性，分别是城市名称、城市区号，(例如：沈阳/024)。这种维是不符合约束的。Consider a hypothetical city dimension whose dimension level is: province-city-district (for example, Liaoning Province-Shenyang City-Heping District). At the city dimension level, it contains two dimension attributes, which are city name and city area code (for example: Shenyang/024). This dimension does not conform to the constraints.

约束3：在所有维值所组成对维值树中，兄弟节点包含相同个数的子节点；Constraint 3: In the paired dimension value tree composed of all dimension values, sibling nodes contain the same number of child nodes;

在本实施方案中，对于海洋数据而言，在真实的情况下，在时间维中每个月包含不同的天数(如，2010年1月包含31天，2010年2月包含28天等)，反应到维值树中，则同一年内月互为兄弟节点，如2010年1月和2010年2月是兄弟节点，其父节点为2010年。则对于2010年1月而言，其子节点个数为31(2010年1月包含31天)，对于2010年2月而言，其子节点个数为28(2010年2月包含28天)。那么时间维不符合约束。In this embodiment, for marine data, in real cases, each month contains different numbers of days in the time dimension (for example, January 2010 contains 31 days, February 2010 contains 28 days, etc.), Reflected in the dimension value tree, the months in the same year are sibling nodes, for example, January 2010 and February 2010 are sibling nodes, and their parent node is 2010. Then for January 2010, the number of child nodes is 31 (including 31 days in January 2010), and for February 2010, the number of child nodes is 28 (including 28 days in February 2010) . Then the time dimension does not fit the constraint.

若满足，则执行步骤1.3，否则，执行步骤1.2；If satisfied, go to step 1.3, otherwise go to step 1.2;

例如，对于2010年1月1日星期五这个时间维而言，由于存在2个维层次，所以需要将星期维级别舍弃。最终形成了如图1中所示的时间维。For example, for the time dimension of Friday, January 1, 2010, since there are two dimension levels, the level of the week dimension needs to be discarded. Finally, the time dimension as shown in Figure 1 is formed.

针对约束2：若某一维级别包括多个维属性，则根据需要对维属性进行舍弃，仅保留一个维属性即可；Constraint 2: If a dimension level includes multiple dimension attributes, discard the dimension attributes as needed, and only keep one dimension attribute;

在假设的城市维中，在市的维级别中包含2个维属性，城市名称和城市编号。为了符合约束2，将维属性城市名称删除，仅保留城市编号，使得城市维符合约束2.In a hypothetical city dimension, there are 2 dimension attributes in the dimension level of city, city name and city number. In order to comply with constraint 2, the dimension attribute city name is deleted, and only the city number is retained, so that the city dimension complies with constraint 2.

针对约束3：若兄弟节点包含的子节点个数不同，则添加空值，使兄弟节点的子节点个数相同。Constraint 3: If sibling nodes contain different numbers of child nodes, add a null value to make the number of child nodes of sibling nodes the same.

例如，本实施方式中假设海洋数据的时间维信息是：全年有12个月，每个月固定有31天，这个假设中，存在与实际情况不符的情况，例如，图2中在时间维中存在有数据2010年2月31日，该日期并不存在，本实施方式通过设定空值的方式，对时间维进行修订，如，若为闰年，则将29日-31日设置为空值，若为非闰年，则将30日-31日设置为空值。对于仅有30天的月份，将31日设定为空值。最终形成的时间维的维值树如图2所示。在图2中，展示了时间维的维值树结构。其中固定了每个月均有31天，若遇到不足31天的情况，则使用空值补足。For example, in this embodiment, it is assumed that the time dimension information of marine data is: there are 12 months in a year, and each month has 31 fixed days. In this assumption, there is a situation that does not match the actual situation. There is data on February 31, 2010, but this date does not exist. In this embodiment, the time dimension is revised by setting a null value. For example, if it is a leap year, set the 29th to 31st as null Value, if it is a non-leap year, set the 30th-31st to a null value. For months with only 30 days, set the 31st to null. The final dimension value tree of the time dimension is shown in Figure 2. In Figure 2, the dimension value tree structure of the time dimension is shown. It is fixed that there are 31 days in each month. If there are less than 31 days, use a null value to make up.

针对维值树中每一级别的维值，从左到右以十进制数依次编码，当所有的维值均有对应的编码后编码工作结束；For the dimension values of each level in the dimension value tree, encode them sequentially in decimal numbers from left to right. When all the dimension values have corresponding encodings, the encoding work ends;

在本实施方案中，对时间维编码后的结果如图2所示。对于时间维而言，不符合本专利中维的假设，所以固定每个月为31天，遇到不足31天的情况则使用空值补足。2010年2月实际只有28天，所以使用3个空值补足31天。在图2中，仅显示一个空值，即2010年2月31日。在本实施方案中，海洋科学数据集共包含3年的海洋海水温度数据，所以在年维级别上共有3个维值，从左到右编码，其编码为0、1、2；在月维级别上共有36个维值(2010年1月至2013年12月)，其编码为0、1、......、35；在日维级别上共有1116个维值，编码分别为0、1、......、1115；在次维级别上共有3348个维值，编码分别为0、1、......、3347。区域维的编码如图3所示。在本实施方案中，海洋科学数据集共包含涉及2°方区的数据，在1°维级别中共包含2个维值，其编码为0、1，以此类推，在最低维级别(1/64°)中共有128个维值，编码分别为0、1、......、127。深度维的编码如图4所示，在本实施方案中，海洋科学数据集共包含涉及500米层深的数据。在1OOm维级别上共有5个维值，其编码分别是0、1、......、4；在50m维级别上共有10个维值，其编码分别是0、1、......、9；在10m维级别上共有50个维值，其编码分别是0、1、......、49。In this embodiment, the result after encoding the time dimension is shown in FIG. 2 . For the time dimension, it does not conform to the assumption of the dimension in this patent, so each month is fixed at 31 days, and if there are less than 31 days, a null value is used to make up. There are actually only 28 days in February 2010, so use 3 null values to make up 31 days. In Figure 2, only one null value is displayed, which is February 31, 2010. In this embodiment, the marine scientific data set contains 3 years of ocean water temperature data, so there are 3 dimensional values at the year level, coded from left to right, and the codes are 0, 1, 2; There are 36 dimension values at the level (from January 2010 to December 2013), and their codes are 0, 1, ..., 35; there are 1116 dimension values at the daily dimension level, and the codes are 0 , 1, ..., 1115; there are 3348 dimension values at the sub-dimensional level, and the codes are 0, 1, ..., 3347 respectively. The encoding of the region dimension is shown in Figure 3. In this embodiment, the marine scientific data set contains data involving a 2° square area, and contains 2 dimension values at the 1° dimension level, which are coded as 0 and 1, and so on, at the lowest dimension level (1/ 64°), there are 128 dimension values in total, coded as 0, 1, ..., 127 respectively. The encoding of the depth dimension is shown in Figure 4. In this embodiment, the marine scientific data set contains data involving a layer depth of 500 meters. There are 5 dimension values at the 100m dimension level, and their codes are 0, 1, ..., 4; there are 10 dimension values at the 50m dimension level, and their codes are 0, 1, ... ..., 9; there are 50 dimension values at the 10m dimension level, and their codes are 0, 1, ..., 49 respectively.

对于海量数据的任意一个维信息，存储每一维级别名称以及在该级别中兄弟节点的个数，最终形成海量数据所有维信息的文件，存储于分布式文件系统中。For any dimension information of massive data, store the name of each dimension level and the number of sibling nodes in this level, and finally form a file of all dimension information of massive data, which is stored in the distributed file system.

在本实施方案中，对于时间维编码后的结果如图2所示。在时间维的维值树中，年维级别共包含3个维值，且3个维值互为兄弟节点，其父节点为ALL即全部数据，兄弟节点数量为3；月级别共包含36个节点，其中在同一年的下的节点互为兄弟节点，其父节点为该年的节点，如2010年下共有2010年1月～2010年12月12个子节点，则在级别下兄弟节点的个数为12；日级别共包含1116个节点，其中在同一月下的节点互为兄弟节点，其父节点为该月的节点，如2010年1月下共有2010年1月1日～2010年1月31日共31个节点，则在该级别中兄弟节点数量为31；同理在次级别中，兄弟节点的数目为3。在本实施方案中，使用XML文件对时间维进行存储，其存储文件如表1所示。在该文件中包含了时间维的名称Time，以及各个维基本的名称和其兄弟节点的数目，例如对于年维级别而言，名称为Year，兄弟节点的数目为3。In this embodiment, the encoded result of the time dimension is shown in FIG. 2 . In the dimension value tree of the time dimension, the year dimension level contains 3 dimension values in total, and the 3 dimension values are sibling nodes to each other, and its parent node is ALL, which means all data, and the number of sibling nodes is 3; the month level contains 36 Nodes in the same year are sibling nodes, and their parent nodes are nodes of that year. For example, under 2010, there are 12 child nodes from January 2010 to December 2010, and the number of sibling nodes under the level The number is 12; the daily level contains a total of 1116 nodes, among which the nodes under the same month are sibling nodes, and their parent nodes are the nodes of this month. There are 31 nodes on the 31st of the month, so the number of sibling nodes in this level is 31; similarly, in the sub-level, the number of sibling nodes is 3. In this embodiment, an XML file is used to store the time dimension, and the storage file is shown in Table 1. The file contains the name of the time dimension Time, the basic name of each dimension and the number of sibling nodes, for example, for the year dimension level, the name is Year, and the number of sibling nodes is 3.

表1为以XML形式存储的时间维Table 1 is the time dimension stored in XML form

步骤2：对海量数据进行装载；Step 2: Load massive data;

维编码与维值的意义的翻译器由用户根据自己的需要实现。例如在本实施方案中，时间维上2010年1月1日上午，通过翻译器可以获得其在次维层级上的编码是0，地区维上1/64°通过翻译器可以获得其在1/64°维层级上的编码是0。翻译器常用的实现方法如下：当数据量较大时，采用数据库进行存储和查找。例如在数据库中创建一张表，在表中仅有2列，分别存储维值的意义和维值的对应编码；当维值数量较少时，采用分布式内存进行存储和查找。例如在分布式内存中创建一个哈希表，存储维值的意义和维值的对应编码；当编码与维值有数学上的对应关系时，采用编码直接计算。例如，对于深度维而言，可以通过当前层深除以当前维级别的层深划分步长后取整的方法获得。如：45m层深在10m维级别上的编码为45/10取整后为4，4即为45m层深在10m维级别上的编码。The translator of dimension coding and dimension value meaning is implemented by users according to their own needs. For example, in this embodiment, on the morning of January 1, 2010 on the time dimension, the code at the sub-dimensional level can be obtained through the translator as 0, and 1/64° on the area dimension can be obtained through the translator at 1/ The code on the 64° dimension level is 0. The commonly used implementation methods of translators are as follows: when the amount of data is large, a database is used for storage and search. For example, create a table in the database, and there are only two columns in the table, which respectively store the meaning of the dimension value and the corresponding code of the dimension value; when the number of dimension values is small, use distributed memory for storage and search. For example, create a hash table in distributed memory to store the meaning of the dimension value and the corresponding encoding of the dimension value; when the encoding and the dimension value have a mathematical correspondence, the encoding is used for direct calculation. For example, for the depth dimension, it can be obtained by dividing the current layer depth by the layer depth division step of the current dimension level and then rounding. For example: the coding of 45m layer depth at 10m dimension level is 45/10 rounded to 4, and 4 is the coding of 45m layer depth at 10m dimension level.

在本实施方案中，多维海量数据形成数据立方结构，其结构如图6所示。海量数据的任意一条数据作为该数据立方中的一个单元格，该单元格的信息包括：该单元格位于立方体内的坐标，以及单元格所表示的事实数据值。对于海洋数据：2010年1月1日上午在1/64°区内10m水深的温度数据为18℃而言，该条海洋数据作为数据立方中的一个单元格，在分布式文件系统中保存该单元的信息，包括坐标<0，0，0>，其中，第一个0表示时间维上的编码，即2010年1月1日上午在的次级别上编码为0，第2个0表示区域维上的编码，即1/64°在1/64°维级别上编码为0，第3个0表示深度维上的编码为0，即10m在10m维级别中的编码为0，以及事实数据18℃。In this embodiment, the multi-dimensional mass data forms a data cube structure, as shown in FIG. 6 . Any piece of data in the mass data is used as a cell in the data cube, and the information of the cell includes: the coordinates of the cell within the cube, and the factual data value represented by the cell. For ocean data: on the morning of January 1, 2010, the temperature data at 10m water depth in the 1/64° area is 18°C, this piece of ocean data is saved as a cell in the data cube in the distributed file system The information of the unit, including the coordinates <0, 0, 0>, where the first 0 indicates the encoding on the time dimension, that is, the morning of January 1, 2010 is encoded as 0 at the sub-level, and the second 0 indicates the area Dimensional encoding, that is, 1/64° is encoded as 0 at the 1/64° dimension level, and the third 0 indicates that the encoding on the depth dimension is 0, that is, the encoding of 10m at the 10m dimension level is 0, and the fact data 18°C.

步骤2.3：对数据立方进行切割：Step 2.3: Cut the data cube:

用户的查询需求：是指用户常用的查询条件以及其出现的概率。以用户的查询需求为对象对我们来说会有以下好处：保证查询效率，系统对于用户的查询可以快速的响应；保证了系统并行性和调度代价间的平衡。User's query requirements: Refers to the user's frequently used query conditions and the probability of their occurrence. Taking the user's query requirements as the object will have the following advantages for us: to ensure query efficiency, the system can quickly respond to user queries; to ensure the balance between system parallelism and scheduling costs.

在本实施方案中，使用MapReduce编程模型对海量数据查询方法进行实现。海量数据查询方法中MapReduce的输入是数据块文件，整个查询方法的性能与块的大小密切相关。块文件取值越小，并行性越好，实际参与运算的数量越少，但此时调度代价变高，如何折中的确定块文件的取值将变的十分关键。由于难以穷举所有查询条件及其出现概率，所以在本实施方案中采用随机抽样，抽取一些查询条件及其出现概率。除此之外，在本实施方案中，块文件大小的取值还考虑查询方法的运行环境。本实施方式中的海量数据查询方法利用MapReduce实现，所以块文件大小的取值还需考虑MapReduce的一些特性，例如文件寻址时间、数据处理时间等。表2列出了相关符号的定义，其中λ_i为变量，T和N_a是计算结果，其它为已知常量。In this embodiment, the method for querying massive data is implemented using the MapReduce programming model. The input of MapReduce in the massive data query method is a data block file, and the performance of the entire query method is closely related to the block size. The smaller the value of the block file, the better the parallelism and the smaller the number of actual calculations, but at this time the scheduling cost becomes higher, so how to determine the value of the block file in a compromise will become very critical. Since it is difficult to enumerate all query conditions and their occurrence probabilities, random sampling is used in this embodiment to extract some query conditions and their occurrence probabilities. In addition, in this embodiment, the value of the block file size also considers the operating environment of the query method. The massive data query method in this embodiment is realized by using MapReduce, so the value of the block file size needs to consider some characteristics of MapReduce, such as file addressing time and data processing time. Table 2 lists the definitions of related symbols, where λ _i is a variable, T and N _a are calculation results, and others are known constants.

表2相关符号定义Table 2 Definition of related symbols

平均一个查询命中的块个数N_a可以通过每个查询条件出现的概率得到，如公式1所示。The average number N _a of blocks hit by a query can be obtained through the occurrence probability of each query condition, as shown in formula 1.

考虑MapReduce相关的一些影响因素之后，可以得到一个OLAP操作消耗的平均时间，如公式2所示。After considering some influencing factors related to MapReduce, the average time consumed by an OLAP operation can be obtained, as shown in formula 2.

通过公式2可以计算出T取最小值时的λ_i，也就得到了块大小。The λ _i when T takes the minimum value can be calculated through Formula 2, and thus the block size can be obtained.

在本实施方案中，假设计算出的块在各个维上的取值个数分别为6、4、5，即块的边长分别为558、32、10，即在时间维上按照步长为558进行划分，在地区维上按照步长为32进行划分，在深度维上按照步长为10进行划分。最终的划分结果如图7所示。在图7中较小的方格为单元格，较大的方格为数据块(如，数据块5、数据块11、数据块18、数据块19、数据块20、数据块21、数据块22、数据块23、数据块42、数据块43和数据块44)。为了绘图方便在图中每个数据块仅包含9个单元格，但是在实际操作中，数据块包含的方格数量为558*32*10，即为178560。在本实施方案中共包含6*4*5＝120个数据块。In this embodiment, it is assumed that the calculated values of the blocks in each dimension are 6, 4, and 5 respectively, that is, the side lengths of the blocks are 558, 32, and 10 respectively, that is, the step length in the time dimension is 558 for division, the area dimension is divided according to the step size of 32, and the depth dimension is divided according to the step size of 10. The final division result is shown in Figure 7. In Fig. 7, the smaller square grid is a unit cell, and the larger square grid is a data block (such as data block 5, data block 11, data block 18, data block 19, data block 20, data block 21, data block 22, data block 23, data block 42, data block 43 and data block 44). For the convenience of drawing, each data block in the figure only contains 9 cells, but in actual operation, the number of squares contained in the data block is 558*32*10, which is 178560. In this embodiment, 6*4*5=120 data blocks are included.

当出现计算出的维上的块取值个数无法整除的情况时，例如假设在本实施方案中计算出的块在各个维上的取值个数分别为7、4、5时，则在时间维上无法进行整除，即3348/7不是整数时，通过向上取整的方法获得在该维上的划分步长，即此时在时间维上的划分步长为479。此时，对于数据块而言，要通过加入一些空值的方法进行补足。When there is a situation where the number of block values on the calculated dimension cannot be divisible, for example, assuming that the number of values of the blocks calculated in this embodiment on each dimension is 7, 4, and 5, then in Divisibility cannot be performed on the time dimension, that is, when 3348/7 is not an integer, the division step on this dimension is obtained by rounding up, that is, the division step on the time dimension is 479 at this time. At this time, for the data block, it needs to be supplemented by adding some null values.

步骤2.4：对步骤2.3划分好的数据块进行编码，方法为：数据块内任意单元格的坐标除以数据块的边长，将所得数据向上取整后得到的值作为数据块的编码。Step 2.4: Encode the data block divided in step 2.3. The method is: divide the coordinates of any cell in the data block by the side length of the data block, and round up the obtained data as the code of the data block.

从逻辑角度来看，数据块可以视作一个小立方，包含了数据立方中的一部分单元格，而数据立方也可以看作由数据块构成。在本实施方案中，假设有一条数据为2012年6月1日上午1/32°方区40m层深的海水温度为15℃，则其对应的坐标为<2697，1，4>，若块的边长分别为558、32、10，其所属的数据块坐标为<4，0，0>。其中4是由该条数据在时间维上的编码2697除以数据块在时间维的上长度558，并向上取整后获得的；第一个0是由该条数据在地区维上的编码1除以数据块在地区维上长度32获得；第二个0是有该条数据在深度维上的编码4除以数据块在深度维上长度10所获得的。From a logical point of view, a data block can be regarded as a small cube, which contains a part of the cells in the data cube, and a data cube can also be regarded as composed of data blocks. In this implementation, suppose there is a piece of data that the seawater temperature in the 1/32° square area at a depth of 40m in the morning of June 1, 2012 is 15°C, and its corresponding coordinates are <2697, 1, 4>, if block The side lengths of are 558, 32, and 10 respectively, and the coordinates of the data block they belong to are <4, 0, 0>. Among them, 4 is obtained by dividing the code 2697 of the piece of data in the time dimension by the length of the data block in the time dimension 558, and rounding up; the first 0 is the code 1 of the piece of data in the region dimension Divided by the length of the data block in the region dimension of 32; the second 0 is obtained by dividing the code 4 of the data in the depth dimension by the length of the data block in the depth dimension of 10.

步骤2.5：步骤2.3切割好的数据块存储于分布式文件系统中，将步骤2.4确立的编码作为数据块文件的名称。Step 2.5: The data block cut in step 2.3 is stored in the distributed file system, and the code established in step 2.4 is used as the name of the data block file.

为了将数据块以及数据块内的单元格存储于分布式文件系统中，本实施方案首先对其编码进行了序列化。逻辑上，“立方和其单元格”或是“立方和其块”的数据结构均可以类比为“多维数组和其元素”的数据结构。物理上，块是立方的存储单元，将块内的单元格线性化后，块可以作为独立文件进行存储。为了便于寻址，块和单元格都需要支持线性化和反线性化运算，且该运算与多维数组的线性化和反线性化运算是一致的。设存在一个n维数组，其维规模记作<A₁，A₂，...An>，多维数组中的元素X在多维空间中的坐标记作(X₁，X₂，...，X_n)，其线性化后的坐标记作index(X)，则其线性化方法如公式3所示，反线性化方法如公式4所示。In order to store the data block and the cells in the data block in the distributed file system, this embodiment first serializes its encoding. Logically, the data structure of "cube and its cells" or "cube and its blocks" can be compared to the data structure of "multidimensional array and its elements". Physically, a block is a cubic storage unit. After the cells in the block are linearized, the block can be stored as an independent file. In order to facilitate addressing, both blocks and cells need to support linearization and delinearization operations, and this operation is consistent with the linearization and delinearization operations of multidimensional arrays. Assume that there is an n-dimensional array, whose dimension scale is denoted as _{_{<A 1 , A 2 ,...An>}} , and the coordinates of the element X in the multidimensional array in the multidimensional space are denoted as (X ₁ , X ₂ ,..., X _n ), the coordinates after linearization are marked as index(X), then the linearization method is shown in formula 3, and the de-linearization method is shown in formula 4.

index(X)＝(...((X_n×A_n-1+X_n-1)×A_n-2+...+X₃)×A₂+X₂)×A₁+X₁ (3)index(X)＝(...((X _n ×A _n-1 +X _n-1 )×A _n-2 +...+X ₃ )×A ₂ +X ₂ )×A ₁ +X ₁ (3)

temp₁＝indextemp ₁ = index

......

X_n＝temp_n％A_n X _n =temp _n %A _n

对于单元格和数据块编码的线性化和反线性化的方法，与上述多维数组的线性化和反线性化方法是相同的。在对数据块线性化和反线性化时，维规模指的是数据块在各个维的取值个数，对单元格进行线性化和反线性化时，维规模指的是数据立方的各个维的长度，即组成数据立方的各个维最低层维级别中维值的取值个数。The method of linearization and delinearization of cell and data block encoding is the same as the method of linearization and delinearization of the above-mentioned multidimensional array. When linearizing and delinearizing a data block, the dimension scale refers to the number of values in each dimension of the data block. When linearizing and delinearizing a cell, the dimension scale refers to each dimension of the data cube The length of , that is, the number of dimension values in the lowest dimension level of each dimension that makes up the data cube.

假设有数据2012年6月1日上午1/32°方区40m层深的海水温度为15℃，则其对应的坐标为<2697，1，4>，且数据立方的边长分别为3348、128、50，其中3348是时间维最低层维层次，次维层次包含的维值数量，128是地区维最低维层次，1/64°维层次包含的维值数量，50是层深维最低维层次，1Om维层次包含的维值数量。直接使用公式(3)即可计算得到该条数据线性化的坐标为(4*128+1)*3348+2697＝1720221。反线性化时使用公式(4)，初始化时temp₁＝1720221；X₁＝1720221％3348＝2679，；X₂＝513％128＝1，

X₃＝4％128＝4，则最终反线性化后的坐标为<2679，1，4>。Assuming that there is data on the morning of June 1, 2012, the temperature of the seawater at a depth of 40m in the 1/32° square area is 15°C, the corresponding coordinates are <2697, 1, 4>, and the side lengths of the data cube are 3348, 128, 50, of which 3348 is the lowest level of the time dimension, the number of dimension values contained in the sub-dimensional level, 128 is the lowest dimension level of the region dimension, the number of dimension values contained in the 1/64° dimension level, and 50 is the lowest dimension of the depth dimension Hierarchy, the number of dimension values contained in the 1Om dimension hierarchies. Directly using the formula (3) can calculate the linearized coordinate of the piece of data as (4*128+1)*3348+2697=1720221. Formula (4) is used for delinearization, temp ₁ =1720221 during initialization; X ₁ =1720221%3348=2679, ; X ₂ =513%128=1,

X ₃ =4%128=4, then the final delinearized coordinates are <2679, 1, 4>.

假设有一个数据块的坐标为<4，0，0>，其中，块在各个维上的取值个数分别为6、4、5。则这个块线性化后的坐标为((0*4+0)*6)+4＝4；反线性化时使用公式(4)，初始化时temp₁＝4；X₁＝4％3348＝4，

X₂＝0％128＝0，

X₃＝0％128＝0则最终反线性化后的该块的坐标为<4，0，0>。Assume that there is a data block whose coordinates are <4, 0, 0>, where the values of the block in each dimension are 6, 4, and 5 respectively. Then the coordinates of this block after linearization are ((0*4+0)*6)+4=4; formula (4) is used for delinearization, and temp ₁ =4 during initialization; X ₁ =4%3348=4 ,

X ₂ =0% 128 =0,

X ₃ =0% 128 =0, then the coordinates of the block after final delinearization are <4, 0, 0>.

若当前载入数据块的坐标为<4，0，0>，则块内单元格的坐标范围是<2232，0，0>至<2769，31，9>。其中起始范围2232，是在时间维上编码的起始范围，通过4*558计算得来。而在时间为上的结束范围2769则是由2232+558-1计算得来。若当前载入的数据坐标为<2769，31，9>，对应的数据为2012年6月24日上午32/64°方区90m深度的海水温度。假设该数据的数值为20℃则新生成的记录为<3961241，20>，其中3961241＝(((9*128)+31)*3348+557)。在数据块文件中添加一条记录<3961241，20>，此时整个块文件导入完毕，其文件名为数据块序列化后的坐标，即为4。If the coordinates of the currently loaded data block are <4, 0, 0>, the coordinate range of the cells in the block is from <2232, 0, 0> to <2769, 31, 9>. The starting range 2232 is the starting range encoded in the time dimension, which is calculated by 4*558. The end range 2769 in time is calculated from 2232+558-1. If the currently loaded data coordinates are <2769, 31, 9>, the corresponding data is the seawater temperature at a depth of 90m in the 32/64° square area on the morning of June 24, 2012. Assuming that the value of the data is 20°C, the newly generated record is <3961241, 20>, where 3961241=(((9*128)+31)*3348+557). Add a record <3961241, 20> in the data block file. At this time, the entire block file is imported, and its file name is the coordinate of the serialized data block, which is 4.

按照上述方法，将块文件全部装载完毕。将当前载入的数据立方命名为原始立方，至此数据装载工作完成。According to the above method, all the block files are loaded. Name the currently loaded data cube as the original cube, and the data loading work is now complete.

查询目标：是指确定针对哪个数据立方进行查询，即目标立方。Query target: It refers to determining which data cube to query, that is, the target cube.

对于一个初始的查询来说，将海量数据所形成的数据立方称之为原始的数据立方，经过初始查询之后，会产生该原始数据立方的下一级立方，在经过多次查询之后，还会产生该下一级立方的下一级立方，或者原始立方的另外的下一级立方，这些新产生的数据立方均可以作为下一次查询的基础来进行选择。For an initial query, the data cube formed by massive data is called the original data cube. After the initial query, the next level cube of the original data cube will be generated. After multiple queries, it will also be The lower-level cube that generates the lower-level cube, or another lower-level cube of the original cube, these newly generated data cubes can be selected as the basis for the next query.

查询范围：在已确立的查询目标中，针对哪部分数据进行查询。Query scope: In the established query target, which part of the data is queried.

客户端给出想要分析的数据范围。通过用户自定义的翻译方法，将数据范围中维值的实际意义翻译为维值的编码信息。若用户要查询原始立方中的2012年4月20日下午4/64°方区1m至2012年6月25日上午32/64°方区90m水深的温度数据，则通过翻译后对应的数据编码范围是<2570，4，0>至<2769，31，9>。The client gives the range of data to be analyzed. Through the user-defined translation method, the actual meaning of the dimension value in the data range is translated into the coding information of the dimension value. If the user wants to query the temperature data in the original cube from 1m in the afternoon of April 20, 2012 in the 4/64° square area to the 90m in the 32/64° square area in the morning of June 25, 2012, the corresponding data code after translation The range is <2570, 4, 0> to <2769, 31, 9>.

结果的维信息：是指结果数据立方的维信息。The dimension information of the result: refers to the dimension information of the result data cube.

若我们要查看所选数据范围内的所有数据每天的海水温度的平均值，则结果数据的维信息如下：时间维，年-月-日；地区维与目标立方相同；深度维与目标立方相同。If we want to view the average daily sea temperature of all data within the selected data range, the dimension information of the result data is as follows: time dimension, year-month-day; area dimension is the same as the target cube; depth dimension is the same as the target cube .

聚集方法：对查询范围内的数据进行聚集的操作Aggregation method: the operation of aggregating the data within the query range

例如，若用户要查询原始立方中的2012年4月20日下午4/64°方区1m至2012年6月25日上午32/64°方区90m水深，这一范围内每日的平均数，则该聚集方法就是求平均数。常用的聚集方法还有：求和、最大值、最小值等。For example, if the user wants to query the water depth of 1m in the 4/64° square area in the afternoon of April 20, 2012 to 90m in the 32/64° square area in the morning of June 25, 2012, the daily average in this range , then the aggregation method is to find the average. Commonly used aggregation methods are: summation, maximum value, minimum value, etc.

用户所指定的查询目标必须是在系统中已经存在的。例如在系统初始化阶段，用户装载了所有的海量数据并命名为原始立方。若用户在初次查询时候指定了其他立方，则不符合约束1。若用户载入的数据为2010年至2012年、2°方区、500m层深的数据且在该原始立方中查询，但查询内容涉及到了2013年的数据，则查询范围大于了查询目标的范围，不符合约束1。实际上，对于多维数据而言，任何一维不在查询目标的数据范围内都不符合约束1。The query target specified by the user must already exist in the system. For example, in the system initialization phase, the user loaded all the massive data and named it the original cube. If the user specifies other cubes during the initial query, constraint 1 is not met. If the data loaded by the user is data from 2010 to 2012, 2° square area, 500m layer depth and is queried in the original cube, but the query content involves the data of 2013, the query range is greater than the range of the query target , which does not meet constraint 1. In fact, for multidimensional data, any one dimension that is not within the data range of the query target does not meet constraint 1.

若用户给出的查询目标是原始立方，原始立方的维是时间维、地区维、层深维，但是用户设定的结果数据立方中涉及到了其他维，例如记录人员维，此时是不符合约束2的。当用户仅仅指定了时间维而没有指定地区维和层深维时，是符合约束2的。此时意味着，用户不改变地区维和层深维的粒度或者说在地区维和层深维上结果数据立方和查询目标保持一致。If the query target given by the user is the original cube, the dimensions of the original cube are time dimension, region dimension, and layer depth dimension, but the result data cube set by the user involves other dimensions, such as the recording personnel dimension, which is not suitable at this time. Constraint 2. Constraint 2 is met when the user only specifies the time dimension but does not specify the area dimension and layer depth dimension. At this time, it means that the user does not change the granularity of the region dimension and layer depth dimension or that the result data cube is consistent with the query target on the region dimension and layer depth dimension.

在本实施方案中，结果数据立方与查询目标在地区维和层深维上没有发生改变，仅仅在时间为上，结果数据立方的维层次为年-月-日，原始数据立方的维层次是年-月-日-次。意味着结果数据立方较查询目标在时间维上上升了一个级别。这是符合约束3的，若我们假设查询立方和结果数据立方的维信息对调，则意味着结果数据立方较查询目标在时间维上下降了一个级别，这是不符合约束3的。In this implementation, the result data cube and the query target do not change in the region dimension and layer depth dimension, only in time, the dimension level of the result data cube is year-month-day, and the dimension level of the original data cube is year -month-day-times. It means that the resulting data cube is one level higher than the query target in the time dimension. This is in compliance with constraint 3. If we assume that the dimension information of the query cube and the result data cube are swapped, it means that the result data cube is one level lower than the query target in the time dimension, which does not meet constraint 3.

一个聚集函数是分布的，如果它能以如下分布方式进行计算：设数据被划分为n个集合，函数在每一部分上的计算得到一个聚集值。如果将函数用于n个聚集值得到的结果，与将函数用于所有数据得到的结果一样，则该函数可以用户分布式计算。常见的分布式聚集方法有：求和、最大值、最小值、计数等。An aggregate function is distributed if it can be computed in a distributed manner as follows: Suppose the data is divided into n sets, and the computation of the function on each part yields an aggregate value. If applying a function to n aggregated values yields the same result as applying the function to all data, then the function can be used for distributed computation. Common distributed aggregation methods are: summation, maximum value, minimum value, counting, etc.

一个函数是代数的，如果它能够由一个具有若干参数的代数函数计算，而每个参数都可以用一个分布式聚集函数求得。常见的代数式聚集方法有：求平均值。A function is algebraic if it can be computed by an algebraic function with several parameters, each of which can be found using a distributed aggregate function. Common algebraic aggregation methods are: averaging.

在本实施方案中使用平均值作为聚集方法，是符合约束4的。如果使用求中位数作为聚集方法，则不符合约束4。Using the mean as the aggregation method in this embodiment is consistent with constraint 4. Constraint 4 is not met if median is used as the aggregation method.

在不满足约束3的情况下，可以通过对目标立方进行转换的方法，使之满足约束3。但是无法通过转换方法满足约束3的查询，则该查询失败。If constraint 3 is not satisfied, the target cube can be converted to satisfy constraint 3. However, if the query cannot satisfy constraint 3 through the conversion method, the query fails.

若有立方一，各个维层级为：年-月-日-次、1°-1/2°-1/4°-1/8°-1/16°-1/32°-1/64°、1OOm-50m-1Om；立方二，各个维层级为：年-月-日、1°-1/2°-1/4°-1/8°-1/16°-1/32°-1/64°、1OOm-50m-1Om；立方三，年-月-日-次、1°-1/2°-1/4°-1/8°-1/16°-1/32°-1/64°、100m-50m-10m，同时我们保证在数据范围方面，立方一包含立方二，立方二包含立方三。我们假设立方二是由立方一查询得出，此时我们计划通过查询得到立方三，且查询目标为立方二。但是这一查询不符合约束3，则进行目标立方的转化。立方二是通过在立方一上建立查询得到，则我们说立方一是立方二的上一级立方。则我们将目标立方转变为立方一，此时查询变成结果数据立方为立方三，目标立方是立方一，再次检测约束3，满足，则转换成功。若通过转换后，直到目标数据立方是原始立方，仍然无法满足约束3，则查询失败。If there is cubic one, each dimension level is: year-month-day-time, 1°-1/2°-1/4°-1/8°-1/16°-1/32°-1/64° , 1OOm-50m-1Om; cubic two, each dimension level is: year-month-day, 1°-1/2°-1/4°-1/8°-1/16°-1/32°-1 /64°, 1OOm-50m-1Om; cubic three, year-month-day-time, 1°-1/2°-1/4°-1/8°-1/16°-1/32°-1 /64°, 100m-50m-10m, and we guarantee that in terms of data range, cube 1 includes cube 2, and cube 2 includes cube 3. We assume that the cube 2 is obtained from the query of the cube 1. At this time, we plan to get the cube 3 through the query, and the query target is the cube 2. But this query does not meet constraint 3, then the conversion of the target cube is performed. Cube 2 is obtained by establishing a query on Cube 1, then we say that Cube 1 is the upper cube of Cube 2. Then we transform the target cube into cube 1. At this time, the query becomes the result data cube to be cube 3, and the target cube is cube 1. Check constraint 3 again, and if it is satisfied, the conversion is successful. If constraint 3 cannot be satisfied until the target data cube is the original cube after conversion, the query fails.

数据粗筛的过程示意图如图8所示。在本实施方案中，我们将针对原始立方进行查询，数据范围为的2012年4月20日下午4/64°方区1m至2012年6月25日上午32/64°方区90m水深的温度数据，对应的编码范围是<2570，4，0>至<2769，31，9>。则反映到图8中，点B的坐标为<2570，4，0>，点H为<2769，31，9>，则立方体ABCD-EFGH即为所要查询的数据范围。为了确定最小数据块的范围，我们将B点和H点的坐标分别转换成其所在的块的坐标，转换方法与步骤2.4是相同的。在本实施方案中，数据块的边长为558，32，10。通过转换后，获得最小数据块的范围是<4，0，0>至<4，0，0>。所以在本实施方案中，仅仅需要扫描1个块文件的数据。反映到图8中，对应的A’B’C’D’-E’F’G’H’即为块<4，0，0>内包含的数据。本实施方案，一共包含有6*4*5＝120个数据块。则通过数据粗筛的方法，我们仅仅对需要查询的数据进行处理，从而去掉了其他119个数据块参与计算。The schematic diagram of the data coarse screening process is shown in Figure 8. In this implementation, we will query the original cube, and the data range is the temperature of 1m in the 4/64° square area in the afternoon of April 20, 2012 to the water depth of 90m in the 32/64° square area in the morning of June 25, 2012 data, the corresponding coding range is <2570, 4, 0> to <2769, 31, 9>. It is reflected in Figure 8, the coordinates of point B are <2570, 4, 0>, and the coordinates of point H are <2769, 31, 9>, then the cube ABCD-EFGH is the data range to be queried. In order to determine the range of the smallest data block, we convert the coordinates of point B and point H into the coordinates of the block where they are located, and the conversion method is the same as step 2.4. In this embodiment, the side lengths of the data blocks are 558, 32, 10. After conversion, the range of the smallest data block obtained is <4, 0, 0> to <4, 0, 0>. Therefore, in this embodiment, only the data of one block file needs to be scanned. Reflected in Figure 8, the corresponding A'B'C'D'-E'F'G'H' is the data contained in the block <4, 0, 0>. In this embodiment, a total of 6*4*5=120 data blocks are included. Then, through the method of data coarse screening, we only process the data that needs to be queried, thus removing the other 119 data blocks from participating in the calculation.

在本实施方案中使用MapReduce编程模型对多维信息海量数据查询方法进行了实现。MapReduce由四个部分组成：InputFormatter、Mapper、Reducer和OutputFormatter，分别对应数据的精晒，改变维级别，聚集和输出结果集的四个步骤。其运行流程如图9所示。In this embodiment, the method of querying multi-dimensional information mass data is realized by using the MapReduce programming model. MapReduce consists of four parts: InputFormatter, Mapper, Reducer, and OutputFormatter, which correspond to the four steps of refining data, changing dimension levels, and aggregating and outputting result sets. Its operation process is shown in Figure 9.

在本实施方案中，通过数据粗筛的数据块的坐标为<4，0，0>，则对应的数据块名称为4，其中4是由数据块坐标线性化而来。InputFormatter是文件级别的数据读入，所以InputFormatter读取通过粗筛后的数据块文件，也就是数据块文件4。InputFormatter依次读取到数据块4中的所有记录，每一条记录均是<key，value>格式的，其中key是当前读取到的单元格线性化后的坐标，value是当前单元格中数据的取值。之后InputFormatter将key反线性化，后得到数据在数据立方中的坐标，与查询范围进行对比，如果符合查询范围则保留该条数据，否则就舍弃该数据。例如InputFormatter读取到一条数据为<3961241，20>，对其中线性化后的坐标3961241进行反线性化后得到坐标<2570，4，0>，其中在本实施方案中查询数据的范围是<2570，4，0>至<2769，31，9>，则该条数据符合查询范围，进入到下一步改变单元格维级别的操作。否则，则舍弃该数据。In this embodiment, the coordinates of the data blocks that pass the coarse data screening are <4, 0, 0>, and the corresponding data block name is 4, where 4 is linearized from the data block coordinates. InputFormatter reads data at the file level, so InputFormatter reads the data block file that has passed the rough screening, that is, data block file 4. InputFormatter reads all the records in data block 4 in turn, each record is in <key, value> format, where key is the linearized coordinate of the currently read cell, and value is the data in the current cell value. Afterwards, the InputFormatter delinearizes the key, and then obtains the coordinates of the data in the data cube, and compares it with the query range. If it meets the query range, the data is kept, otherwise, the data is discarded. For example, InputFormatter reads a piece of data as <3961241, 20>, and delinearizes the linearized coordinate 3961241 to obtain the coordinate <2570, 4, 0>, where the range of query data in this embodiment is <2570 , 4, 0> to <2769, 31, 9>, then the piece of data meets the query range, and enters the next step to change the cell dimension level. Otherwise, the data is discarded.

步骤3.6：改变单元格的维级别，对比结果数据立方的维信息与目标立方的维信息，确定发生改变的维信息，对单元格坐标上表示该维的坐标进行修改；Step 3.6: Change the dimension level of the cell, compare the dimension information of the result data cube with the dimension information of the target cube, determine the changed dimension information, and modify the coordinates representing the dimension on the cell coordinates;

图9展示了在本实施方案中，基于MapReduce实现的海量数据查询方法的全过程。若我们将块文件中的记录抽象为<key，value>其中key是单元格线性化后的坐标，value是指单元格内数据的取值。通过InputFormatter读取所需的块文件，对key进行反序列化，得到的数据格式为<(a₁，a₂，...，a_n)，value>其中(a₁，a₂，...，a_n)是单元格的实际坐标，通过和查询条件检测后，所有符合条件的数据将进入到Mapper中进行改变维级别的操作。在Mapper中数据由<(a₁，a₂，...，a_n)，value>变化为<(b₁，b₂，...，b_n)，value>其中(b₁，b₂，...，b_n)是改变维级别后单元格的坐标。最终该坐标会被线性化后传递给Reducer处理。所有Mapper的输出为<index，value>格式。在处理过程中value值始终不会发生改变。在Reducer中，相同的index的记录会进入到同一个Reducer中进行处理，Reducer使用指定的聚集方法对所有的value进行聚集后输出到OutputFormatter。最后OutputFormatter负责块文件的生成。在整个计算过程中在Mapper中单元格坐标的维级别的改变是非常重要的操作。FIG. 9 shows the whole process of the massive data query method implemented based on MapReduce in this embodiment. If we abstract the records in the block file as <key, value>, the key is the coordinate of the cell after linearization, and the value refers to the value of the data in the cell. Read the required block file through InputFormatter, deserialize the key, and the obtained data format is <(a ₁ , a ₂ ,..., a _n ), value> where (a ₁ , a ₂ , .. ., a _n ) are the actual coordinates of the cell. After passing the check and query conditions, all qualified data will enter the Mapper to change the dimension level. In Mapper, the data changes from <(a ₁ , a ₂ ,..., a _n ), value> to <(b ₁ , b ₂ ,..., b _n ), value> where (b ₁ , b ₂ ,...,b _n ) are the coordinates of the cell after changing the dimension level. Finally, the coordinates will be linearized and passed to the Reducer for processing. The output of all Mappers is in the format of <index, value>. The value value will never change during processing. In the Reducer, the records of the same index will enter the same Reducer for processing, and the Reducer uses the specified aggregation method to aggregate all the values and output them to the OutputFormatter. Finally OutputFormatter is responsible for the generation of chunk files. Changing the dimension level of cell coordinates in Mapper is a very important operation during the entire calculation process.

为了改变单元格的维编码，在本实施方案中我们引入分析路径的概念。图10展示了一个改变维层次的示意图，其维层次为年-月-日。在年维级别共有4个维值分别是2010、2011、2012、2013，在月维级别共有48个取值，分别是2010年1月、2010年2月、......、2013年12月；在日维级别，我们固定每个月有31天，则共有1488个取值。此时我们针对2010年2月2日这个维值进行举例。若我们当前知道2010年2月2日的维编码是2，而此时需要将该维值改变至月级别，即要计算出2010年2月的维编码。为了便于计算，本实施方式中引入分析路径的概念，同时为了简述，我们将维值2010年2月2日记做<2010¹，2²，2³>，其中2010¹表示在时间维的第1个维级别中，即年维级别中，维值的取值是2010，其他的几项亦然。同时称维值在兄弟节点间的所处的位置(该位置从左到右计数，从0开始)为该维值的分析路径。例如对于<2010¹，2²，2³>而言其分析路径为<0，1，1>，2010在年维级别中与其他所有维值互为兄弟节点，且处于开始的维值，所以其分析路径为0，在月维级别中，2010年2月与其他2010内各月的维值互为兄弟节点，其父节点为2010年，则其所处第2个维值，但是由于分析路径从0开始计数，则其分析路径为1，2010年2月2日的分析路径的计算方法与2010年2月的分析路径计算方法相同。当前已知<2010¹，2²，2³>编码为32，需要求得<2010¹，2²>的编码。在本实施方案中，引入分析路径与编码的转换关系如公式(5)和公式(6)所示。其中，编码记作code()，例如code(2010¹)为2010年的编码即code(2010¹)＝0；分析路径记作order()，例如order()为2010年的分析路径即order(2010¹)＝0。|l_i|表示第i个维级别中，任意维值的兄弟节点的个数。例如在年维级别中|l₁|＝4，各个年份互为兄弟节点，在月的维级别中|l₂|＝12，同一年内的各个月互为兄弟节点，在日的维级别中|l₃|＝31，同一个月内的天互为兄弟节点。In order to change the dimension coding of the cell, we introduce the concept of analysis path in this embodiment. Figure 10 shows a schematic diagram of changing the dimension hierarchy, where the dimension hierarchy is year-month-day. There are 4 dimension values at the year dimension level, which are 2010, 2011, 2012, and 2013, and there are 48 values at the monthly dimension level, which are January 2010, February 2010, ..., 2013 December; at the daily dimension level, we fix 31 days in each month, so there are 1488 values in total. At this point, we take the dimension value of February 2, 2010 as an example. If we currently know that the dimension code on February 2, 2010 is 2, and we need to change the dimension value to the month level at this time, that is to calculate the dimension code in February 2010. For the convenience of calculation, the concept of analysis path is introduced in this embodiment. At the same time, for the sake of brief description, we set the dimension value of February 2, 2010 as <2010 ¹ , 2 ² , 2 ³ >, where 2010 ¹ represents the first time in the time dimension In one dimension level, that is, in the year dimension level, the value of the dimension value is 2010, and the same is true for several other items. At the same time, the position of the dimension value among the sibling nodes (the position is counted from left to right, starting from 0) is called the analysis path of the dimension value. For example, for <2010 ¹ , 2 ² , 2 ³ >, its analysis path is <0, 1, 1>, and 2010 is a sibling node with all other dimension values at the year dimension level, and is at the beginning dimension value, so Its analysis path is 0. In the monthly dimension level, February 2010 and the dimension values of other months in 2010 are sibling nodes, and its parent node is 2010, so it is in the second dimension value, but due to the analysis The path is counted from 0, and its analysis path is 1. The calculation method of the analysis path on February 2, 2010 is the same as the calculation method of the analysis path on February 2010. It is currently known that <2010 ¹ , 2 ² , 2 ³ > is encoded as 32, and the encoding of <2010 ¹ , 2 ² > needs to be obtained. In this embodiment, the conversion relationship between the introduction of the analysis path and the encoding is shown in formula (5) and formula (6). Among them, the code is recorded as code(), for example, code(2010 ¹ ) is the code for 2010, that is, code(2010 ¹ )=0; the analysis path is recorded as order(), for example, order() is the analysis path for 2010, that is, order( 2010 ¹ ) = 0. |l _i | indicates the number of sibling nodes of any dimension value in the i-th dimension level. For example, in the year dimension level |l ₁ |=4, each year is a sibling node; in the month dimension level |l ₂ |=12, each month in the same year is a sibling node; in the day dimension level| l ₃ |=31, days in the same month are sibling nodes.

code(vⁱ)＝(...((0+order(v¹))×|l₂|+order(v²))×|l₃|+order(v³)...)×|l_i|+order(vⁱ)code(v ⁱ )＝(...((0+order(v ¹ ))×|l ₂ |+order(v ² ))×|l ₃ |+order(v ³ )...)×|l _i |+order(v ⁱ )

(5)(5)

temp_i＝code(vⁱ)temp _i ＝code(v ⁱ )

......

order(v¹)＝temp₁％|l₁|order(v ¹ )=temp ₁ %|l ₁ |

当前已知code(2010¹，2²2³)＝32。根据公式(6)，则temp₃＝32，order(2010¹，2²2³)＝32％31＝1，

，order(2010¹，2²)＝1％12＝1，order(2010¹)＝0％4＝0。从而求得对于<2010¹，2²，2³>而言分析路径为<0，1，1>，那么显然的由于节点的相对位置不会发生改变，所以对于<2010¹，2²>而言分析路径为<0，1>。通过公式(5)，计算得到code(2010¹，2²)＝(0+0)*48+1＝1。从而求得对于2010年2月而言，编码为2。至此完成了编码2010年2月2日向2010年2月的转化。Currently known code(2010 ¹ , 2 ² 2 ³ )=32. According to formula (6), then temp ₃ =32, order(2010 ¹ , 2 ² 2 ³ )=32%31=1,

, order(2010 ¹ , 2 ² )=1%12=1, order(2010 ¹ )=0%4=0. Therefore, the analysis path for <2010 ¹ , 2 ² , 2 ³ > is <0, 1, 1>, then obviously since the relative position of the nodes will not change, so for <2010 ¹ , 2 ² > and The language analysis path is <0, 1>. Through formula (5), code(2010 ¹ , 2 ² )=(0+0)*48+1=1 is calculated. Thus, it is obtained that for February 2010, the code is 2. So far, the conversion from February 2, 2010 to February 2010 has been completed.

在本实施方案中对于海量数据的查询方法，使用MapReduce进行实现，其中实现关键是单元格坐标的更新以及单元格内度量的聚集操作。为了便于叙述，我们使用2维立方对该过程进行阐述，图11中展示了查询方法执行过程中单元格坐标及其取值的变化过程，其中d₁，d₂是2个维，分别被分为4份。对于d₁而言，其维层次为l₁ ¹-l₂ ¹-l₃ ¹，对于d₂而言，其维层次为l₁ ²-l₂ ²-l₃ ²-l₄ ²其对应的参数见如表格3所示。|l|指的是维级别l中任意节点的兄弟节点的数量。In this embodiment, the query method for massive data is realized by using MapReduce, and the key to the realization is the updating of cell coordinates and the aggregation operation of metrics in cells. For ease of description, we use a 2-dimensional cube to illustrate the process. Figure 11 shows the change process of the cell coordinates and their values during the execution of the query method, where d ₁ and d ₂ are two dimensions, which are divided into Makes 4 servings. For d ₁ , its dimensional level is l ₁ ¹ -l ₂ ¹ -l ₃ ¹ , for d ₂ , its dimensional level is l ₁ ² -l ₂ ² -l ₃ ² -l ₄ ² and its corresponding The parameters are shown in Table 3. |l| refers to the number of sibling nodes of any node in dimension level l.

表格3为OLAP算法执行相关参数Table 3 shows the relevant parameters for OLAP algorithm execution

假设被处理的单元坐标为126，事实数据为67，其对应的块的坐标为6。若该单元符合查询条件，则交予Mapper进行处理，输入为<126，67>。在Mapper中，首先126会被反线性化，得出(10，6)。若在查询过程中，对应的新维层次是l₁ ²，l₂ ²。则通过公式5和公式6分别对2个维编码进行更新，结果为(1，0)，再次线性化后的得出index为1。最终该Mapper输出结果为<1，67>。当所有的输入数据的编码被改变后，本步骤结束。Assume that the coordinate of the processed unit is 126, the fact data is 67, and the coordinate of the corresponding block is 6. If the unit meets the query conditions, it will be handed over to Mapper for processing, and the input is <126, 67>. In Mapper, first 126 will be delinearized to get (10, 6). If in the query process, the corresponding new dimension level is l ₁ ² , l ₂ ² . The two-dimensional codes are updated respectively by Formula 5 and Formula 6, and the result is (1, 0), and the index obtained after linearization is 1 again. Finally, the output result of the Mapper is <1, 67>. When the codes of all input data are changed, this step ends.

继续以图11为例。如其它Mapper输出以下数据<1，57>，<1，43>，则图11中Reducer的输入数据是<1，{67，57，43})。假设在本实施方案中，使用求和作为聚集方法，则Reducer的输出为<1，167>。上述Reducer的输出以记录的形式存储在OutputFormatter生成的块文件中。Continue to take Figure 11 as an example. If other Mappers output the following data <1, 57>, <1, 43>, then the input data of the Reducer in Figure 11 is <1, {67, 57, 43}). Assuming that in this embodiment, summation is used as the aggregation method, the output of the Reducer is <1, 167>. The output of the above Reducer is stored in the form of records in the chunk file generated by the OutputFormatter.

在所有MapReduce任务完成之后，返回客户端成功信息。且将该结果立方作为新的数据立方存储，使之可以作为下一轮查询的查询目标。After all MapReduce tasks are completed, return success information to the client. And store the result cube as a new data cube, so that it can be used as the query target of the next round of query.

虽然以上描述了本发明的具体实施方式，但是本领域内的熟练的技术人员应当理解，这些仅是举例说明，可以对这些实施方式做出多种变更或修改，而不背离本发明的原理和实质。本发明的范围仅由所附权利要求书限定。Although the specific embodiments of the present invention have been described above, those skilled in the art should understand that these are only examples, and various changes or modifications can be made to these embodiments without departing from the principles and principles of the present invention. substance. The scope of the invention is limited only by the appended claims.

Claims

1. A massive data query method with multidimensional information, characterized in that: comprise the following steps:

Step 1: Load the dimensional information of massive data with multi-dimensional information, specifically including the following steps:

Step 1.1: Identify the dimensional information of the massive data, and judge whether each dimensional information of the massive data satisfies the following three constraints at the same time:

Constraint 1: A dimension consists of one and only one dimension level, that is, a dimension is a total order relationship composed of all dimension levels;

Constraint 2: In any dimension level of a dimension, only one dimension attribute is included, and the dimension attribute contains several dimension values;

Constraint 3: In the paired dimension value tree composed of all dimension values, sibling nodes contain the same number of child nodes;

If satisfied, go to step 1.3, otherwise go to step 1.2;

Step 1.2: Process the dimension information so that each dimension forms a dimension value tree structure conforming to the constraints. The processing process is as follows:

Constraint 1: If there are multiple dimension levels, discard the dimension levels as needed, and only keep one dimension level;

Constraint 2: If a dimension level includes multiple dimension attributes, discard the dimension attributes as needed, and only keep one dimension attribute;

For constraint 3: if the number of child nodes contained in the sibling nodes is different, add a null value to make the number of child nodes of the sibling nodes the same;

Step 1.3: Encoding dimension information;

For the dimension values of each level in the dimension value tree, encode them sequentially in decimal numbers from left to right. When all the dimension values have corresponding encodings, the encoding work ends;

Step 1.4: store the encoding of the dimension information;

For any dimension information of massive data, store the name of each dimension level and the number of sibling nodes in this level, and finally form a file of all dimension information of massive data, which is stored in the distributed file system;

Step 2: Load massive data;

Step 2.1: The user establishes the corresponding relationship between the actual meaning of the dimension information of the massive data and the coding of the dimension information according to the needs, that is, to express any piece of data in the massive data with the coding of the dimension information;

Step 2.2: All the most fine-grained multi-dimensional massive data form a data cube structure, and any piece of data in the massive data is used as a cell in the data cube, and the information of the cell includes: the coordinates of the cell within the cube, and The fact data value represented by the cell; where the coordinates of the cell are expressed as:

<encoding of information in one dimension, encoding of information in another dimension, ..., encoding of information in the last dimension>;

Step 2.3: Cut the data cube:

According to the user's query requirements, under the condition of ensuring the shortest query time, the data cube is cut to form a data block, and the side length of the data block is determined;

Step 2.4: Encoding the data block divided in step 2.3, the method is: divide the coordinates of any cell in the data block by the side length of the data block, and round up the obtained data to obtain the value as the encoding of the data block;

Step 2.5: The data block cut in step 2.3 is stored in the distributed file system, and the code established in step 2.4 is used as the name of the data block file;

Step 3: Use online data analysis OLAP method to query massive data;

Step 3.1: The user sets query conditions, including:

Query target: refers to determine which data cube to query, that is, the target cube;

Query scope: in the established query target, which part of the data is queried;

The dimension information of the result: refers to the dimension information of the result data cube;

Aggregation method: the operation of aggregating the data within the query range;

Step 3.2: Determine whether the query conditions set in step 3.1 meet the following constraints:

Constraint 1: The query target already exists, and the query range should be less than or equal to the data range of the query target;

Constraint 2: The number of dimensions of the result data cube should be less than or equal to the number of dimensions of the query target;

Constraint 3: The lowest dimension level of any dimension of the result data cube should be higher than the lowest dimension level of the corresponding dimension of the query target;

Constraint 4: Aggregation methods must be distributed or algebraic;

If constraint 1, constraint 2 and constraint 4 are satisfied at the same time, then step 3.3 is performed;

If constraints 1 to 4 are satisfied at the same time, execute step 3.4;

If any of the above conditions are not met, the query fails and ends;

Step 3.3: Convert the query target, find the upper-level cube of the current query target, and judge whether the upper-level cube satisfies constraint 3, if not, continue to query the upper-level cube of the upper-level cube, if it still cannot If it is satisfied, the query fails and the query process ends; if the upper-level cube that satisfies constraint 3 is found, replace the cube with the target cube;

Step 3.4: Coarse screening of data: determine the range of the minimum data block required for query according to the query range;

Step 3.5: Fine screening of data; scan the data block files filtered in step 3.4, and filter all the cells in the data block according to the query range, if the cells are in the query range, perform step 3.6; otherwise, discard the cell;

Step 3.6: Change the dimension level of the cell, compare the dimension information of the result data cube with the dimension information of the target cube, determine the changed dimension information, and modify the coordinates representing the dimension on the cell coordinates;

Step 3.7: For cells with the same coordinates, aggregate the fact data values in the cells according to the set aggregation method;

Step 3.8: The data gathered in step 3.7 forms a result data cube, returns the information of the result data cube to the user, and stores the result cube as a new data cube so that it can be used as the query target of the next round of query .

2. The massive data query method with multi-dimensional information according to claim 1, characterized in that: the finest granularity in step 2.2 refers to the data pointed to by the lowest dimensional level.