WO2018176623A1 - 一种olap预计算模型、自动建模方法及自动建模系统 - Google Patents

一种olap预计算模型、自动建模方法及自动建模系统 Download PDF

Info

Publication number
WO2018176623A1
WO2018176623A1 PCT/CN2017/086133 CN2017086133W WO2018176623A1 WO 2018176623 A1 WO2018176623 A1 WO 2018176623A1 CN 2017086133 W CN2017086133 W CN 2017086133W WO 2018176623 A1 WO2018176623 A1 WO 2018176623A1
Authority
WO
WIPO (PCT)
Prior art keywords
dimension
module
model
query
dimensions
Prior art date
Application number
PCT/CN2017/086133
Other languages
English (en)
French (fr)
Inventor
韩卿
李栋
Original Assignee
上海跬智信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海跬智信息技术有限公司 filed Critical 上海跬智信息技术有限公司
Priority to EP17903560.5A priority Critical patent/EP3605358A4/en
Priority to US15/659,664 priority patent/US10902022B2/en
Publication of WO2018176623A1 publication Critical patent/WO2018176623A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24539Query rewriting; Transformation using cached or materialised query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2393Updating materialised views
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Definitions

  • the invention relates to the technical field of OLAP multidimensional data analysis, in particular to an OLAP precomputation model, an automatic modeling method and an automatic modeling system.
  • OLAP especially MOLAP
  • the data warehouse has a large amount of data, and multi-dimensional aggregation operations directly on a large amount of data require a large amount of computational resources and an excessively long query time.
  • OLAP provides a solution for improving the efficiency of multidimensional analysis based on pre-computation, by pre-aggregating the data in the data warehouse by implementing a "data cube" and saving the results; when the analyst conducts the actual In business query, it is not necessary to re-aggregate the data, but directly read the pre-computed results, which makes it possible to analyze millions or even hundreds of millions of data sizes.
  • OLAP Cube (Data Cube) is an abstraction of the multidimensional analysis data model in the data warehouse. It contains different combinations of dimensions in the multidimensional analysis. For example, the following figure 6 contains four dimensions, namely time, goods, location, Vendors, the different combinations of these four dimensions constitute different nodes in the OLAP Cube, and each node represents the result of the metric aggregation under these combinations of dimensions.
  • the selected combination of dimensions corresponds to a point in the Cube, and the value considered is the result of the measurement aggregation behind the node.
  • the OLAP Cube In a common OLAP solution, in order to analyze selected dimensions more quickly, the OLAP Cube is materialized, that is, the metrics of each node on the OLAP Cube are aggregated in advance by precomputation, and the results are saved. When the business analyst performs the query, the system can directly The precomputed result is returned. The O(N) level aggregation operation is converted into O(1) result query, and the query efficiency can be improved.
  • the OLAP Cube also defines the hierarchy of dimensions based on dimensions. For example, there is a hierarchical relationship between the three dimensions of year, month, and day: year>month>day. These hierarchical relationships can often be mapped with concepts in existing applications, making it easier for analysts to be more flexible in data mining systems. Application.
  • the data size is often in the order of hundreds of billions or even trillions, and the number of dimensions is too large, the dimension base is too large, and there is a hidden danger of dimensional explosion. If you still pre-calculate all the combinations of dimensions, it will lead to too long pre-calculation time and too large amount of data. This increases the pre-computation and storage costs. On the other hand, it also gives a large number of pre-calculated results. Come to the challenge.
  • the technical problem to be solved by the present invention is that the current technology has a hidden danger of dimension explosion for the data analysis service with too many dimensions and large dimension base, and the pre-calculation time is too long, the result data amount is too large, and the pre-calculation is increased on the one hand. Calculation and storage costs, on the other hand, also present challenges for scanning a large number of pre-computed results.
  • the present invention provides an OLAP pre-computation model, which includes: a dimension module, an aggregation group module, and a metric module;
  • the dimension module includes a normal dimension unit and a derived dimension unit; a normal dimension unit, configured to pre-calculate a field on the fact table;
  • the derived dimension unit is configured to pre-calculate the primary key on the dimension table, and record a mapping relationship between the column and the primary key on the dimension table;
  • the dimension table primary key of the derived dimension in the dimension unit and the normal dimension in the normal dimension unit are used as pre-computed dimensions, and the aggregation group module is configured to divide the pre-computed dimension in the dimension module into multiple
  • the aggregation group is configured to generate a pre-calculation result according to a combination aggregation of all pre-computed dimensions in the dimension module.
  • the aggregation group module includes: a required dimension unit, a combined dimension unit, a hierarchical dimension unit, and a dimension range unit; and the necessary dimension unit is configured to record all dimension combinations including a specific dimension A; Combined dimension unit for recording all dimension combinations including a certain combination dimension AB; said hierarchical dimension unit for recording all dimension combinations including a certain combination dimension ABC having a hierarchical relationship; a number range unit for recording all combinations of dimensions including a number of dimensions in a range; the aggregation group module divides all pre-computed dimensions in the dimension module into multiple aggregation groups, and saves all pre-calculations in the dimension module Dimensions for multidimensional queries between different aggregation groups.
  • the present invention proposes an aggregation group concept, that is, all pre-computed dimensions are divided into several aggregation groups, and different combinations are generated only within each aggregation group, and different aggregation groups are not cross-combined.
  • a full-scale combination is preserved for multidimensional queries across aggregate groups.
  • the derived dimension unit includes: a derived dimension, wherein the derived dimension is a dimension whose dimension base is approximately equal to the primary key base.
  • the derived dimension itself does not participate in the pre-calculation, but pre-calculates the primary key of the dimension table.
  • the dimension table image is saved and used to record the mapping relationship between the derived dimension column and the primary key. Because the records of the dimension table are relatively fixed, and the amount of data is often small, when querying, you can quickly find the dimension table mirror to convert the derived dimension and the primary key value, and then query the dimension table mirror according to the foreign key value to find the value of the degree. . When multiple dimension dimensions are set in the dimension table, only one pre-computed dimension is actually added, and the dimension reduction is realized.
  • the invention also relates to an automatic modeling method based on an OLAP pre-calculation model, the method comprising the following steps:
  • the invention has the beneficial effects of solving the data explosion problem in the big data scenario by the method of the invention, improving the query efficiency, and improving the production efficiency through intelligent means.
  • the automatic modeling method is applied to the big data analysis platform based on Apache Kylin.
  • the Cube (pre-computed model) created by automatic modeling can still guarantee the corresponding time of the second-level query on the scale of 10 billion data, and guarantee the expansion rate within 10 times, which effectively reduces the learning difficulty of Apache Kylin users. Trial and error costs optimize the user experience.
  • the physical modeling in the S4 includes: dimension setting, metric setting, and aggregation group setting.
  • the derived dimension primary key and normal dimension are used as pre-computed dimensions, arranged in order of largest to smallest according to the base.
  • the aggregation group is set, and the rule set by the aggregation group is: setting a minimum value of the dimension range, and a maximum value is a specific default value; when CD(i) is equal to 1, setting the dimension as a required dimension; when the CD (i) *CD(j) is greater than or equal to CD(i,j), then set the i-th dimension and the j-th dimension to be a set of combined dimensions; when CD(j) is equal to CD(i,j), set the i-th The dimension, the jth dimension is a set of hierarchical dimensions Degree; wherein the definition function CD(i) is the cardinality of the i-th dimension; the i-th dimension and the j-th dimension are hierarchical relationships.
  • the adjustment in S5 is performed, and the adjustment includes: pre-calculating the adjustment of the dimension order and the adjustment of the aggregation group.
  • the values are sorted in order of largest to smallest; where F(i) is the number of times the i-th dimension is used as a filter condition in the query, F is the total number of filter conditions in the query statistics, and Wp is defined as physics. Modeling weights, Wb is the business modeling weight, and CD(i) is the base of the i-th dimension.
  • Wp as the physical modeling weight
  • Wb the business modeling weight
  • CD(i) as the base of the i-th dimension
  • ScoreJoint(i,j) Score(i)*Score(j), if ScoreJoint(i)
  • the balance between physical modeling and business modeling is very important. If the physical modeling weight is too high, the model generalization ability will be strong, but the query performance of the target SQL may not be optimal; if the business modeling weight is too high, the model will be over-fitting, only for the target SQL. Can achieve optimal query performance, but the query outside the target SQL can not guarantee the optimal performance. Therefore, the method of the present invention can effectively avoid the occurrence of these problems and improve the accuracy of the model.
  • the invention also relates to an automatic modeling system based on OLAP pre-calculation model, the system comprises: a data statistics module, a business model module, a query statistics module, a modeling establishment module; a data statistical module Block, which is used for data statistics according to a data model and a data source given by the user, and obtains a statistical result of the data; a business model module is configured to perform a query preview according to a data model and a target query given by the user, and determine a business model; query statistics The module is used for querying and previewing the sample, and collecting query statistics; the modeling establishing module is configured to obtain the result of the business modeling according to the data statistics module, the business model module, and the query statistics module.
  • FIG. 1 is a schematic structural diagram of an OLAP precomputation model according to the present invention.
  • FIG. 2 is a flow chart of an automatic modeling method based on OLAP pre-calculation model according to the present invention
  • FIG. 3 is a schematic diagram of a principle of a derivative dimension according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of an aggregation group rule according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of an automatic modeling system based on an OLAP pre-calculation model according to the present invention.
  • FIG. 6 is a schematic diagram of dimension combination in the prior art of the present invention.
  • an OLAP pre-computation model of the present invention includes: a dimension module, an aggregation group module, and a metric module;
  • the dimension module includes: a normal dimension unit and a derived dimension unit; a common dimension unit for pre-calculating a field on the fact table;
  • the derived dimension unit is configured to pre-calculate the primary key on the dimension table, and record the mapping relationship between the column and the primary key on the dimension table;
  • the dimension table primary key and the common dimension are used as a pre-computed dimension
  • the aggregation group module is configured to divide the plurality of pre-computed dimensions calculated in the dimension module into multiple aggregation groups; Used to generate precomputed results according to all precomputed dimension combination aggregations in the dimension module.
  • the dimension table contains the details of a dimension. That is to say, many columns on the dimension table often have a close mapping relationship with the primary key of the dimension table.
  • the dimension table primary key cal_dt represents the date in days
  • the dimension table column week_beg_dt represents the start date of the week of the primary key date
  • the dimension table column month_beg_dt represents the month of the primary key date. Start date. If these dimensions are precomputed, then these 3 dimensions will result in 8 combinations of dimensions.
  • the present invention proposes the concept of a derivative dimension.
  • the derived dimension is derived from the dimension table, and the derived dimension itself does not participate in the precomputation, but precalculates the primary key of the dimension table.
  • multiple dimension dimensions are set in the dimension table, only one pre-computed dimension is actually added, and the dimension reduction is realized.
  • the dimension at the front end of the combination when used as a query filter, has higher query performance than the one at the end. Therefore, placing the dimensions of the business model that are often used as filter criteria in the combined front end will provide query efficiency. In addition, placing a high cardinality dimension on the combined front end can also help to significantly reduce data scanning and computational costs and reduce query time.
  • the present invention proposes an aggregation group concept, which divides all pre-computed dimensions into several aggregates. Groups, only different combinations are generated within each aggregation group, and different aggregation groups are not cross-combined. In addition, a full-scale combination is preserved for multidimensional queries across aggregate groups. By dividing the relevant dimensions into the same aggregation group according to the dependencies between the services, the meaningless combination of dimensions can be effectively removed, and the pre-computation cost is reduced.
  • the present invention defines four rules for an aggregation group: a mandatory dimension rule, a hierarchical dimension rule, a combined dimension rule, and a dimension range rule.
  • the dimension rule refers to a dimension that appears in each combination of the current aggregation group, that is, removes the dimension combination that does not contain the required dimension; the combined dimension rule specifies several groups of dimensions, and each group of dimensions is simultaneously in the dimension. Appear in the combination, that is, the combination that appears separately is removed; the hierarchical dimension rule also specifies several sets of dimensions, and each group needs to have a hierarchical relationship between the dimensions, such as region (country, province, city), time (year, month, day), etc. , all the orphan son trees will be discarded when the final dimension combination is generated.
  • the aggregation group rule, A, B, and C are three dimensions, where A is a required dimension, A and B are combined as a combined dimension, and A>B>C is set as a hierarchical dimension.
  • the dimension range rule specifies a maximum and minimum value, ie, the combination of the number of dimensions not ranging from the minimum to the maximum is removed.
  • an automatic modeling method based on OLAP pre-calculation model of the present invention includes the following steps:
  • the above method is used to solve the data explosion problem in the big data scenario, improve the query efficiency, and improve the production efficiency through intelligent means.
  • the automatic modeling method is applied to the Apache Kylin-based big data analysis platform for verification, through automatic modeling.
  • the created Cube pre-computed model
  • the created Cube can still guarantee the corresponding time of the second-level query on the scale of 10 billion data, and guarantee the expansion rate within 10 times, which effectively reduces the learning difficulty and trial and error cost of Apache Kylin users, and optimizes the user.
  • the created Cube pre-computed model
  • the overall architecture diagram of the technical solution for automatic modeling of the present invention the entire architecture is outputted with a pre-calculated model, including the dimensions of the model (including common dimensions and derived dimensions), and metrics (pre-computed columns and their Operator), aggregation group, etc. Therefore, the job of automatic modeling is to use the data model, data statistics characteristics, examples, etc. as input, select the appropriate dimension column and dimension type, add the required metrics and pre-calculation operators, and set a reasonable aggregation group.
  • the data model defines the relationships between tables and dimensions, providing a template for precomputed models.
  • the Data Model is an abstraction of data features and a teaching form framework for database management. A formal framework used in database systems to provide information representation and operational means. Number The model includes the structural part of the database data, the operational part of the database data, and the constraints of the database data. Data is a symbolic record that describes things. Model is an abstraction of the real world. In fact, there are strong correlations between many columns on the data model, such as the delivery time and order time of an order. The values of these two columns tend to conform to similar distribution characteristics. In the pre-computation modeling phase, this correlation feature can be utilized to properly set the aggregation group rules to solve the dimensional disaster problem. Therefore, the correlation between columns and columns on the data model needs to be counted before automatic modeling.
  • the present invention requires an SQL query preview engine that executes the sample SQL based on the data model given by the user.
  • the example in the present invention refers to some data templates obtained in the previous calculation of SQL.
  • the engine does not return any meaningful query results, but organizes the analysis results of the query plan during the query process.
  • Statistics mainly include:
  • the data model defines the largest category of business analysis, but the actual analysis scenario tends to focus only on the parts of the data model, which is referred to as a business model.
  • the business model contains the dimensions and metric calculation factors that actually participate in the query and is the basis for initializing the precomputed model.
  • Statistical Dimension Use In SQL, the main purpose of a dimension column is Filter and Group By. This step counts the number of times each dimension is used as a Filter and Group By. In addition, the number of simultaneous occurrences of the two dimensions is also counted, which is used to infer the correlation between columns and columns from a business perspective.
  • This step is the core step of the present invention.
  • the business model contains all the dimensions and metrics required for business analysis and defines the basis for precomputed modeling. This step is based on various data statistics and is modeled on the basis of the business model. According to the execution sequence, the following processes are mainly included:
  • Dimensional settings There are two types of dimension columns on a dimension table: normal and derived. Among them, the dimension whose dimension base is equivalent to the primary key base is suitable as a derivative dimension.
  • the derived dimension primary key and normal dimension are used as pre-computed dimensions, arranged in order of largest to smallest according to the base.
  • Metric settings After query rehearsal, all precomputed requirements are already counted in the business model. It is sufficient to generate pre-calculated metrics directly based on the metric set of the business model.
  • Aggregation group settings To simplify operations, all columns are added to an aggregation group by default. Then set the aggregation group rules based on the dimension characteristics in the aggregation group.
  • the definition function CD(i) is the cardinality of the ith dimension, and all dimensions are checked as follows:
  • the i-th dimension and the j-th dimension are set as a set of combined dimensions.
  • the i-th dimension and the j-th dimension are set as a set of hierarchical dimensions, and the hierarchical relationship is: the i-th dimension and the j-th dimension.
  • the i-th dimension and the j-th dimension are both join conditions between tables on the data model, and do not participate in join conditions between other tables, the i-th dimension and the j-th dimension are set as a set of combined dimensions.
  • CD(i) When CD(i) is less than the defined threshold, it is added to the combined dimension candidate group. Finally, the dimension candidate group is divided into multiple combined dimension groups, and the cardinal product of each group is guaranteed to be within the set range.
  • This step is based on the sample SQL, and the model given by the physical modeling is adjusted to optimize the query performance of the target SQL and shrink the storage usage.
  • the present invention sets weights to balance the results of physical modeling and business modeling.
  • Wp is defined as the physical modeling weight
  • Wb is the business modeling weight
  • CD(i) is the base of the i-th dimension.
  • Adjust the aggregation group rule The number of times the dimension is used in the target SQL and the coexistence relationship between the dimensions can be used to adjust the rules of the aggregation group. Define P(i) as the number of times the i-th dimension appears in the query, and P is the number of samples, according to the following rules:
  • Score(i) Wp*CD(i)+Wb*P(i)/P. If Score(i) ⁇ 1, consider setting the i-th dimension to the required dimension;
  • Score(i,j) Wp*CD(i)*CD(j)/CD(i,j)+Wb*P(i,j)/P if the Score values of the i-th dimension and the j-th dimension are greater than
  • Max as the maximum number of dimensions for a single query in the query statistics.
  • W is a specific value. Set the maximum value of the dimension range to Max*W.
  • Score(i) Wp*CD(i)+Wb*F(i)/F, then sort all pre-computed dimensions according to the value of Score(i) in descending order; where, define F(i) ) is the number of times the i-th dimension is used as a filter condition in the query, F is the total number of filter conditions in the query statistics, Wp is the physical modeling weight, Wb is the business modeling weight, and CD(i) is the base of the i-th dimension.
  • the balance between physical modeling and business modeling is important. If the physical modeling weight is too high, the model generalization ability will be strong, but the query performance of the target SQL may not be optimal; if the business modeling weight is too high, the model will be over-fitting, only for the target SQL. Can achieve optimal query performance, but the query outside the target SQL can not guarantee the optimal performance. Therefore, in practical applications, the weight of the modeling method needs to be set according to actual needs.
  • an automatic modeling system for an OLAP pre-calculation model of the present invention includes: a data statistics module, a business model module, a query statistics module, a modeling establishment module, and a data statistics module for The data model and the data source given by the user perform data statistics to obtain data statistics results;
  • the business model module is configured to perform query pre-operation according to the data model and the sample given by the user, and determine the business model;
  • the query statistics module is used for The sample performs query rehearsal and collects query statistics;
  • the modeling establishment module is used to obtain the result of business modeling according to the data statistics module, the business model module, and the query statistics module.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种OLAP预计算模型、自动建模方法及自动建模系统,该模型包括:维度模块、聚合组模块、度量模块;该方法包括:对所有的数据源进行数据统计,得到数据统计结果;根据用户给定的数据模型以及对目标查询进行查询预演,确定业务模型;对样例进行查询预演,并收集查询统计;进行物理建模,并定义预计算模型的维度、度量、聚合组;得到业务建模结果;得到预计算模型;该系统包括:数据统计模块、业务模型模块、查询统计模块、建模建立模块。上述方案通过增加衍生维度、聚合组,对预计算维度进行更加有效的组合,减少冗余计算和数据存储,具有更高的计算效率和更小的存储占用,从而在大数据多维分析应用中取得更好的效果。

Description

一种OLAP预计算模型、自动建模方法及自动建模系统 技术领域
本发明涉及OLAP多维数据分析的技术领域,尤其涉及一种OLAP预计算模型、自动建模方法及自动建模系统。
背景技术
在信息化和数据化的时代,如何对数据进行多维分析以进行决策支持,是商务智能和数据挖掘领域的重要课题,OLAP(尤其是MOLAP)就是为了解决这一问题而产生的。
一般情况下,数据仓库的数据量较大,直接在大量数据上进行多维聚合运算需要耗费大量的计算资源,以及过长的查询耗时。OLAP提供了一种基于预计算提高多维分析效率的解决方案,即通过实现一个“数据立方体”对数据仓库中的数据按不同的维度组合进行预聚合,并把结果保存下来;当分析师进行实际业务查询时,无需重新对数据执行聚合运算,而是直接读取预计算结果,这使得对百万甚至上亿数据规模的分析变得可能。
OLAP Cube(数据立方体)是数据仓库中对于多维分析数据模型的抽象,包含了多维分析中不同的维度组合,例如,下图6所示中包含了4个维度,分别是时间、商品、地点、供应商,这4个维度间不同的组合构成OLAP Cube中的不同节点,每个节点代表在这些维度组合下的度量聚合结果。当用户进行多维分析时,所选定的维度组合对应Cube中的一个点,考量的数值就是节点背后的度量聚合结果。
在常见的OLAP解决方案中,为了更加快速地对选定维度进行分析,会对OLAP Cube进行物化,即提前通过预计算将OLAP Cube上每个节点的度量进行聚合,并把结果保存起来。当业务分析人员执行查询时,系统可以直接对 预计算结果进行返回。把O(N)级别的聚合运算转化成O(1)的结果查询,查询效率的提高可想而知。
OLAP Cube还基于维度定义了维度的层级。例如year,month,day这三个维度之间存在着一种层级关系:year>month>day,这些层级关系往往可以和现有应用中的概念进行映射,方便分析师在数据挖掘系统中更加灵活地应用。
但是,在大数据多维分析场景中,数据规模往往在千亿甚至万亿级别,且维度数量过多、维度基数超大,存在维度爆炸的隐患。如果依然对所有维度组合的情况进行预计算,一定会导致预计算时间过长、结果数据量过大,这一方面增加了预计算和存储成本,另一方面也给大量预计算结果的扫描带来了挑战。
发明内容
本发明所要解决的技术问题是:目前的技术对于维度数量过多、维度基数超大的数据分析业务,存在维度爆炸的隐患,并且预计算时间过长、结果数据量过大,一方面增加了预计算和存储成本,另一方面也给大量预计算结果的扫描带来了挑战。
为解决上面的技术问题,本发明提供了一种OLAP预计算模型,该预计算模型包括:维度模块、聚合组模块、度量模块;所述的维度模块包括普通维度单元和衍生维度单元;所述普通维度单元,用于对事实表上的字段进行预计算;所述衍生维度单元,用于对维表上的主键进行预计算,并记录维表上的列和主键的映射关系;所述衍生维度单元中的衍生维度的维表主键和所述普通维度单元中的普通维度作为预计算维度,符合特定排列顺序;所述聚合组模块,用于将在维度模块中预计算维度划分成多个聚合组;所述度量模块,用于按照维度模块中所有预计算维度的组合聚合生成预计算结果。
本发明的有益效果:通过增加衍生维度、聚合组等概念,对预计算维度 进行更加有效的组合,减少冗余计算和数据存储,具有更高的计算效率和更小的存储占用,从而在大数据多维分析应用中取得更好的效果。
进一步地,所述的聚合组模块包括:必须维度单元、组合维度单元、层级维度单元、维数范围单元;所述的必须维度单元,用于记录包含某一特定维度A的所有维度组合;所述的组合维度单元,用于记录包含某一特定组合维度AB的所有维度组合;所述的层级维度单元,用于记录包含具有层级关系的某一特定组合维度ABC的所有维度组合;所述维数范围单元,用于记录包含维度数量在一定范围的所有维度组合;所述的聚合组模块在将维度模块中的所有预计算维度划分成多个聚合组,同时保存维度模块中的所有预计算维度,用于对不同聚合组之间的的多维查询。
上述进一步的有益效果:本发明提出了聚合组概念,即把所有的预计算维度划分成若干个聚合组,只在每一个聚合组内部产生不同组合,不同聚合组之间不会交叉组合。此外,还会保留全维度的组合,用于应对跨聚合组的多维查询。根据业务间的依赖关系把相关的维度划分到同一个聚合组,就可以有效去除无意义的维度组合,降低了预计算代价。
进一步地,所述的衍生维度单元中包括:衍生维度,所述的衍生维度是维度基数与主键基数近似相等的维度。
上述进一步的有益效果:衍生维度本身并不会参与进行预计算,而是对维表主键进行预计算。此外,还把维表镜像保存下来,用于记录衍生维度列和主键的映射关系。因为维表的记录比较固定,且数据量往往不大,所以在查询时,可以快速查找维表镜像进行衍生维度和主键值的转换,再根据外键值查询维表镜像以找到度的值。当维表中设置了多个衍生维度,实际上只增加了一个预计算维度,实现了降维。
本发明还涉及一种基于OLAP预计算模型的自动建模方法,该方法包括如下步骤:
S1,根据用户给定的数据模型和数据源进行数据统计,得到数据统计结 果;
S2,根据用户所给定的数据模型以及目标查询进行查询预演,确定业务模型;
S3,对样例进行查询预演,并收集查询统计;
S4,基于S2中的业务模型以及S1中数据统计结果,进行物理建模,并定义预计算模型的维度、度量、聚合组;
S5,基于S3中的查询统计对S4中进行物理建模后的模型进行调整,得到业务建模结果;
S6,对S5中的业务建模进行优化调整,得到预计算模型。
本发明的有益效果:通过本发明的方法解决大数据场景下的数据爆炸问题,提高查询效率,同时通过智能化手段提高生产效率,另外,自动建模方法运用到基于Apache Kylin的大数据分析平台上验证,通过自动建模创建的Cube(预计算模型)在百亿数据规模上依然能够保证秒级的查询相应时间,并保证10倍以内的膨胀率,有效降低了Apache Kylin用户的学习难度和试错成本,优化了用户体验。
进一步地,所述S4中的物理建模包括:维度设置、度量设置、聚合组设置。
进一步地,所述的维度设置包括:普通维度设置和衍生维度设置,计算每一个维度的F(i)值,如果F(i)小于指定阀值,则设置为第i个维度为衍生维度,否则设置为普通维度;其中,定义函数F(i)=CD(col_i)/CD(PK),其中CD(col_i)是第i个维度的基数,CD(PK)是主键基数。衍生维度主键和普通维度作为预计算维度,根据基数按从大到小顺序排列。
进一步地,所述的聚合组设置,该聚合组设置的规则为:设置维数范围最小值、最大值为特定默认值;当CD(i)等于1,则设置该维度为必须维度;当CD(i)*CD(j)大于或等于CD(i,j),则设置第i维度和第j维度为一组组合维度;当CD(j)等于CD(i,j),则设置第i维度、第j维度为一组层级维 度;其中,定义函数CD(i)是第i个维度的基数;第i维度、第j维度为层级关系。
进一步地,所述的S5中的进行调整,其调整包括:预计算维度顺序的调整和聚合组的调整。
进一步地,所述的预计算维度顺序的调整,该预计算顺序的调整规则为:定义Score(i)=Wp*CD(i)+Wb*F(i)/F,则依据Score(i)的值按照从大到小顺序对所有预计算维度进行排序;其中,定义F(i)为第i维度在查询中作为过滤条件的次数,F为查询统计中的过滤条件总数,定义Wp为物理建模权重,Wb为业务建模权重,CD(i)为第i维度的基数。
进一步地,所述聚合组的调整,该聚合组的调整的规则为:定义Score(i)=Wp*CD(i)+Wb*P(i)/P,如果Score(i)等于1,则设置第i维度为必须维度;定义Score(i,j)=Wp*CD(i)*CD(j)/CD(i,j)+Wb*P(i,j)/P,如果第i维度和第j维度的Score值大于设定阈值,则设置第i维度和第j维度为一个组合维度,其中,定义P(i)为第i维度在查询中出现的次数,P为样例个数,定义Wp为物理建模权重,Wb为业务建模权重,CD(i)为第i维度的基数;定义ScoreJoint(i,j)=Score(i)*Score(j),如果ScoreJoint(i,j)小于特定阈值,则将第i组组合维度和第j组组合维度合并为一组组合维度。定义Max为查询统计结果中单次查询最大维度数,定义W为特定值的维数膨胀系数,则设置维数范围的最大值为Max*W。
上述进一步的有益效果:物理建模和业务建模的平衡十分重要。如果物理建模权重过高,会导致模型泛化能力较强,但对于目标SQL的查询性能可能无法做到最优;如果业务建模权重过高,会导致模型过拟合,仅对于目标SQL能取得最优查询性能,但对目标SQL之外的查询无法保证最优性能。因此,本发明的方法能够有效地避免这些问题的出现,提高模型的准确性。
本发明还涉及一种基于OLAP预计算模型的自动建模系统,该系统包括:数据统计模块、业务模型模块、查询统计模块、建模建立模块;数据统计模 块,用于根据用户给定的数据模型和数据源进行数据统计,得到数据统计结果;业务模型模块,用于根据用户所给定的数据模型以及目标查询进行查询预演,确定业务模型;查询统计模块,用于对样例进行查询预演,并收集查询统计;建模建立模块,用于根据数据统计模块、业务模型模块、查询统计模块得到业务建模的结果。
附图说明
图1为本发明的一种OLAP预计算模型的结构示意图;
图2为本发明的一种基于OLAP预计算模型的自动建模方法的流程图;
图3为本发明的实施例衍生维度原理的示意图;
图4为本发明的实施例聚合组规则示意图;
图5为本发明的一种基于OLAP预计算模型的自动建模系统的示意图;
图6为本发明现有技术中的维度组合示意图。
具体实施方式
以下结合附图对本发明的原理和特征进行描述,所举实例只用于解释本发明,并非用于限定本发明的范围。
实施例1
如图1所示的,本发明的一种OLAP预计算模型,该预计算模型包括:维度模块、聚合组模块、度量模块;所述的维度模块包括:普通维度单元和衍生维度单元;所述普通维度单元,用于对事实表上的字段进行预计算;所述的衍生维度单元,用于对维表上的主键进行预计算,并记录维表上的列和主键的映射关系;衍生维度的维表主键和普通维度作为预计算维度,符合特定排列顺序;所述的聚合组模块,用于将在维度模块中计算得到的多个预计算维度划分成多个聚合组;所述度量模块,用于按照维度模块中的所有预计算维度组合聚合生成预计算结果。
衍生维度
维表包含了某一维度的详细信息。也就是说,维表上很多列往往和维表主键有着紧密的映射关系。如图3所示的,以时间维表为例,维表主键cal_dt代表以日为单位的日期,维表列week_beg_dt代表主键日期所在周的起始日,维表列month_beg_dt代表主键日期所在月的起始日。如果对这些维度都进行预计算,那么这3个维度就会产生8种维度组合。但是,如果只对cal_dt进行预计算,而当查询需要week_beg_dt和month_beg_dt时通过映射关系把维度值转换成cal_dt的值,就可以仅需要把cal_dt参与预计算组合,这样只会产生2种维度组合,最终节省了75%的预计算代价。
为了实现这种方案,本发明提出了衍生维度的概念。衍生维度来源于维表,而且衍生维度本身并不会参与进行预计算,而是对维表主键进行预计算。此外,还需要把维表镜像保存下来,用于记录衍生维度列和主键的映射关系。因为维表的记录比较固定,且数据量往往不大,所以在查询时,可以快速查找维表镜像进行衍生维度和主键值的转换。再根据外键值查询维表镜像以找到度的值。当维表中设置了多个衍生维度,实际上只增加了一个预计算维度,实现了降维。
预计算维度顺序
在一般的查询引擎中,对于给定的维度组合,位于组合前端的维度,在作为查询过滤条件时,查询性能比位于末端的维度高。因此,把业务模型中经常作为过滤条件的维度放置在组合前端会提供查询效率。此外,高基数维度放置在组合前端也有助于大幅减小数据扫描和计算代价,缩短查询时间。
聚合组
在实际的多维分析场景中,并不是所有的维度组合都是有意义的。例如,供货商维度和网站流量维度并不会经常同时出现在多维度分析的案例中,如果依然对这种利用率极低的维度组合进行预计算是较为浪费的。
本发明提出了聚合组概念,即把所有的预计算维度划分成若干个聚合 组,只在每一个聚合组内部产生不同组合,不同聚合组之间不会交叉组合。此外,还会保留全维度的组合,用于应对跨聚合组的多维查询。根据业务间的依赖关系把相关的维度划分到同一个聚合组,就可以有效去除无意义的维度组合,降低了预计算代价。
在同一个聚合组内部,维度间往往还存在一些关系。例如,某些维度几乎出现在所有的分析查询中,导致那些不包含该维度的维度组合使用率极低;某些维度总是同时出现在查询中,导致他们单独出现的维度组合使用率极低;某些维度存在层级关系,导致子维度单独出现的维度组合使用率极低;一次查询所使用的维度数量是有限的,导致高维数维度组合使用率极低。为了解决这些问题,本发明为一个聚合组定义了四种规则:必须维度规则、层级维度规则、组合维度规则、维数范围规则。其中,必须维度规则指的是某维度会出现在当前聚合组的每一种组合当中,即去除不包含必须维度的维度组合;组合维度规则指定若干组维度,每一组的维度会同时在维度组合中出现,即单独出现的组合被去除;层级维度规则也指定若干组维度,每组的维度间需要存在层级关系,如地域(国家、省、市)、时间(年、月、日)等,生成最终的维度组合时会舍弃所有的孤儿子树。如图4所示的,聚合组规则,A、B、C为三个维度,其中A为必须维度,A和B组合一起为组合维度,A>B>C设置为层级维度。维数范围规则指定一个最大值和最小值,即维度数量不在最小值到最大值范围间的组合被去除。
实施例2
如图2所示的,本发明的一种基于OLAP预计算模型的自动建模方法,该方法包括如下步骤:
S1,根据用户给定的数据模型和数据源进行数据统计,得到数据统计结果;
S2,根据用户所给定的数据模型以及目标查询进行查询预演,确定业务模型;
S3,对样例进行查询预演,并收集查询统计;
S4,基于S2中的业务模型以及S1中数据统计结果,进行物理建模,并定义预计算模型的维度、度量、聚合组;
S5,基于S3中的查询统计对S4中进行物理建模后的模型进行调整,得到业务建模结果;
S6,对S5中的业务建模进行优化调整,得到预计算模型。
通过上述的方法解决大数据场景下的数据爆炸问题,提高查询效率,同时通过智能化手段提高生产效率,另外,自动建模方法运用到基于Apache Kylin的大数据分析平台上验证,通过自动建模创建的Cube(预计算模型)在百亿数据规模上依然能够保证秒级的查询相应时间,并保证10倍以内的膨胀率,有效降低了Apache Kylin用户的学习难度和试错成本,优化了用户体验。
如图5所示的,本发明自动建模的技术方案总体构架图,整个架构以预计算模型为目标输出,具体包含模型的维度(包含普通维度和衍生维度)、度量(预计算列及其算子)、聚合组等。因此,自动建模的工作就是以数据模型、数据统计特性、样例等作为输入,选择合适的维度列及维度类型,添加所需的度量及预计算算子,并设置合理的聚合组。
数据统计
对模型涉及的所有源数据表进行统计,可以为建模提供精确的数据支撑。尤其是维度的基数,通过基数可以预估一个维度对预计算结果的膨胀度。此外,很多OLAP系统会对维度编码以节省存储空间,了解列值的最大长度和数据类型有助于选择合适的编码方式。因此,自动建模之前需要对源数据表统计列的基数、极值、长度极值,以及数据采样。
在数据仓库中,数据模型定义好了表间关系和维度范畴,提供了预计算模型的样板。数据模型(Data Model)是数据特征的抽象,是数据库管理的教学形式框架。数据库系统中用以提供信息表示和操作手段的形式构架。数 据模型包括数据库数据的结构部分、数据库数据的操作部分和数据库数据的约束条件。数据(Data)是描述事物的符号记录。模型(Model)是现实世界的抽象。事实上,数据模型上很多列之间有着很强的相关性,如一个订单的配送时间和下单时间,这两列的值往往符合相似的分布特征。在预计算建模阶段,可以利用这种相关性特征,恰当地设置聚合组规则,以解决维度灾难问题。因此,自动建模之前需要对数据模型上列与列的相关性进行统计。
查询预演
首先,本发明需要一个SQL查询预演引擎,基于用户给定的数据模型,执行样例SQL。本发明中的样例是指以前在进行SQL的计算时,有得到的一些数据模板,这个引擎并不返回任何有意义的查询结果,而是在查询过程中对查询计划的分析结果进行整理和统计,主要包括:
生成业务模型:数据模型定义了业务分析的最大范畴,但实际的分析场景往往只关注数据模型的局部,本发明称之为业务模型。业务模型包含实际参与查询的维度和度量计算因子,也是初始化预计算模型的基础。
统计维度用途:在SQL中,一个维度列主要用途是Filter和Group By。这一步将统计每一个维度作为Filter和Group By的次数。此外也统计了维度间两两同时出现的次数,用于从业务角度推测列与列的相关性。
物理建模
这一步是本发明的核心步骤。业务模型包含了业务分析所需的所有维度和度量,定义了预计算建模的基础。这一步基于各项数据统计,在业务模型的基础上进行建模,按照执行顺序,主要包括以下流程:
维度设置:维表上的维度列有两种类型可选:普通维度和衍生维度。其中,维度基数与主键基数相当的维度适合作为衍生维度。定义函数F(i)=CD(col_i)/CD(PK),其中CD(col_i)是第i个维度的基数,CD(PK)是主键基数。计算每一个维度的F(i)值,如果F(i)小于指定阀值,就设置第i 个维度为衍生维度,否则设为普通维度。衍生维度主键和普通维度作为预计算维度,根据基数按从大到小顺序排列。
度量设置:经过查询预演,所有预计算需求已经统计在业务模型当中。直接基于业务模型的度量集合生成预计算度量即可。
聚合组设置:为了简化操作,默认把所有列添加到一个聚合组中。再根据聚合组中维度特性设置聚合组规则。定义函数CD(i)是第i个维度的基数,对所有维度进行如下检查:
当CD(i)≈1,则设置该维度为必须维度。
当CD(i)*CD(j)>>CD(i,j),则设置第i维度和第j维度为一组组合维度。
当CD(j)≈CD(i,j),则设置第i维度、第j维度为一组层级维度,层级关系为:第i维度、第j维度。
如果第i维度和第j维度同时是数据模型上表间join条件,且不参与其他表间join条件,则设置第i维度和第j维度为一组组合维度。
当CD(i)小于定义的阈值,则添加到组合维度候选组。最后将维度候选组划分为多个组合维度组,并保证每组的基数乘积在设定范围内。
对于维数范围规则,设置维数范围最小值、最大值为特定默认值。
业务建模
这一步以样例SQL为主要依据,对物理建模给出的模型进行调整,以期优化目标SQL的查询性能、收缩存储用量。但是,过度采用业务建模会使模型出现过拟合现象。因此,本发明设置权重来平衡物理建模和业务建模的结果。在本小节涉及的数学公式中,定义Wp为物理建模权重,Wb为业务建模权重,CD(i)为第i维度的基数。模型调整的主要过程如下:
调整聚合组规则:目标SQL中维度使用次数及维度间的共存关系可以用来调整聚合组的规则。定义P(i)为第i维度在查询中出现的次数,P为样例个数,依照以下规则:
定义Score(i)=Wp*CD(i)+Wb*P(i)/P,如果Score(i)≈1,则考虑设置第i维度为必须维度;
定义Score(i,j)=Wp*CD(i)*CD(j)/CD(i,j)+Wb*P(i,j)/P,如果第i维度和第j维度的Score值大于设定阈值,则考虑设置第i维度和第j维度为一个组合维度;定义ScoreJoint(i,j)=Score(i)*Score(j),如果ScoreJoint(i,j)小于特定阈值,则将第i组组合维度和第j组组合维度合并为一组组合维度。
定义Max为查询统计结果中单次查询最大维度数,定义W为特定值的维数膨胀系数,则设置维数范围的最大值为Max*W。
定义Score(i)=Wp*CD(i)+Wb*F(i)/F,则依据Score(i)的值按照从大到小顺序对所有预计算维度进行排序;其中,定义F(i)为第i维度在查询中作为过滤条件的次数,F为查询统计中的过滤条件总数,定义Wp为物理建模权重,Wb为业务建模权重,CD(i)为第i维度的基数。
物理建模和业务建模的平衡十分重要。如果物理建模权重过高,会导致模型泛化能力较强,但对于目标SQL的查询性能可能无法做到最优;如果业务建模权重过高,会导致模型过拟合,仅对于目标SQL能取得最优查询性能,但对目标SQL之外的查询无法保证最优性能。因此,在实际应用当中,需要根据实际需求设定建模方法的权重。
实施例3
如图5所示的,本发明的一种OLAP预计算模型的自动建模系统,该系统包括:数据统计模块、业务模型模块、查询统计模块、建模建立模块;数据统计模块,用于根据用户给定的数据模型和数据源进行数据统计,得到数据统计结果;业务模型模块,用于根据用户所给定的数据模型以及样例进行查询预演,确定业务模型;查询统计模块,用于对样例进行查询预演,并收集查询统计;建模建立模块,用于根据数据统计模块、业务模型模块、查询统计模块得到业务建模的结果。
在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种OLAP预计算模型,其特征在于,该预计算模型包括:维度模块、聚合组模块、度量模块;所述的维度模块包括普通维度单元和衍生维度单元;所述普通维度单元,用于对事实表上的字段进行预计算;所述衍生维度单元,用于对维表上的主键进行预计算,并记录维表上的列和主键的映射关系;所述衍生维度单元中的衍生维度的维表主键和所述普通维度单元中的普通维度作为预计算维度,符合特定排列顺序;所述聚合组模块,用于将在维度模块中预计算维度划分成多个聚合组;所述度量模块,用于按照维度模块中所有预计算维度的组合聚合生成预计算结果。
  2. 根据权利要求1所述的一种OLAP预计算模型,其特征在于,所述的聚合组模块包括:必须维度单元、组合维度单元、层级维度单元、维数范围单元;所述的必须维度单元,用于记录包含某一特定维度A的所有维度组合;所述的组合维度单元,用于记录包含某一特定组合维度AB的所有维度组合;所述的层级维度单元,用于记录包含具有层级关系的某一特定组合维度ABC的所有维度组合;所述维数范围单元,用于记录包含维度数量在一定范围的所有维度组合;所述的聚合组模块在将维度模块中的所有预计算维度划分成多个聚合组,同时保存维度模块中的所有预计算维度,用于对不同聚合组之间的的多维查询。
  3. 一种基于权利要求1-2所述的一种OLAP预计算模型的自动建模方法,其特征在于,该方法包括如下步骤:
    S1,根据用户给定的数据模型和数据源进行数据统计,得到数据统计结果;
    S2,根据用户所给定的数据模型以及目标查询进行查询预演,确定业务模型;
    S3,对样例进行查询预演,并收集查询统计;
    S4,基于S2中的业务模型以及S1中数据统计结果,进行物理建模,并定义预计算模型的维度、度量、聚合组;
    S5,基于S3中的查询统计对S4中进行物理建模后的模型进行调整,得到业务建模结果;
    S6,对S5中的业务建模进行优化调整,得到预计算模型。
  4. 根据权利要求3所述的一种基于OLAP预计算模型的自动建模方法,其特征在于,所述S4中的物理建模包括:维度设置、度量设置、聚合组设置。
  5. 根据权利要求4所述的一种基于OLAP预计算模型的自动建模方法,其特征在于,所述的维度设置包括:普通维度设置和衍生维度设置,计算每一个维度的F(i)值,如果F(i)小于指定阀值,则设置为第i个维度为衍生维度,否则设置为普通维度;其中,定义函数F(i)=CD(col_i)/CD(PK),其中CD(col_i)是第i个维度的基数,CD(PK)是主键基数。衍生维度主键和普通维度作为预计算维度,根据基数按从大到小顺序排列。
  6. 根据权利要求4所述的一种基于OLAP预计算模型的自动建模方法,其特征在于,所述的聚合组设置,该聚合组设置的规则为:设置维数范围最小值、最大值为特定默认值;当CD(i)等于1,则设置该维度为必须维度;当CD(i)*CD(j)大于或等于CD(i,j),则设置第i维度和第j维度为一组组合维度;当CD(j)等于CD(i,j),则设置第i维度、第j维度为一组层级维度;其中,定义函数CD(i)是第i个维度的基数;第i维度、第j维度为层级关系。
  7. 根据权利要求3所述的一种基于OLAP预计算模型的自动建模方法,其特征在于,所述的S5中的进行调整,其调整包括:预计算维度的顺序调整和聚合组的调整。
  8. 根据权利要求7所述的一种基于OLAP预计算模型的自动建模方法,其特征在于,所述预计算维度的顺序调整,该顺序调整规则为:定义 Score(i)=Wp*CD(i)+Wb*F(i)/F,则依据Score(i)的值按照从大到小顺序对所有预计算维度进行排序;其中,定义F(i)为第i维度在查询中作为过滤条件的次数,F为查询统计中的过滤条件总数,定义Wp为物理建模权重,Wb为业务建模权重,CD(i)为第i维度的基数。
  9. 根据权利要求7所述的一种基于OLAP预计算模型的自动建模方法,其特征在于,所述聚合组的调整,该聚合组的调整的规则为:定义Score(i)=Wp*CD(i)+Wb*P(i)/P,如果Score(i)等于1,则设置第i维度为必须维度;定义Score(i,j)=Wp*CD(i)*CD(j)/CD(i,j)+Wb*P(i,j)/P,如果第i维度和第j维度的Score值大于设定阈值,则设置第i维度和第j维度为一个组合维度,其中,定义P(i)为第i维度在查询中出现的次数,P为样例个数,定义Wp为物理建模权重,Wb为业务建模权重,CD(i)为第i维度的基数;定义ScoreJoint(i,j)=Score(i)*Score(j),如果ScoreJoint(i,j)小于特定阈值,则将第i组组合维度和第j组组合维度合并为一组组合维度。定义Max为查询统计结果中单次查询最大维度数,定义W为特定值的维数膨胀系数,则设置维数范围的最大值为Max*W。
  10. 一种利用权利要求3至9任一所述的一种基于OLAP预计算模型的自动建模方法的自动建模系统,该系统包括:数据统计模块、业务模型模块、查询统计模块、建模建立模块;数据统计模块,用于根据用户给定的数据模型和数据源进行数据统计,得到数据统计结果;业务模型模块,用于根据用户所给定的数据模型以及目标查询进行查询预演,确定业务模型;查询统计模块,用于对样例进行查询预演,并收集查询统计;建模建立模块,用于根据数据统计模块、业务模型模块、查询统计模块得到业务建模的结果。
PCT/CN2017/086133 2017-03-28 2017-05-26 一种olap预计算模型、自动建模方法及自动建模系统 WO2018176623A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP17903560.5A EP3605358A4 (en) 2017-03-28 2017-05-26 PRE-CALCULATED OLAP MODEL, AUTOMATIC MODELING METHOD AND AUTOMATIC MODELING SYSTEM
US15/659,664 US10902022B2 (en) 2017-03-28 2017-07-26 OLAP pre-calculation model, automatic modeling method, and automatic modeling system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710192889.1 2017-03-28
CN201710192889.1A CN106997386B (zh) 2017-03-28 2017-03-28 一种olap预计算模型、自动建模方法及自动建模系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/659,664 Continuation US10902022B2 (en) 2017-03-28 2017-07-26 OLAP pre-calculation model, automatic modeling method, and automatic modeling system

Publications (1)

Publication Number Publication Date
WO2018176623A1 true WO2018176623A1 (zh) 2018-10-04

Family

ID=59431733

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/086133 WO2018176623A1 (zh) 2017-03-28 2017-05-26 一种olap预计算模型、自动建模方法及自动建模系统

Country Status (3)

Country Link
EP (1) EP3605358A4 (zh)
CN (1) CN106997386B (zh)
WO (1) WO2018176623A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309206A (zh) * 2019-07-10 2019-10-08 中国联合网络通信集团有限公司 订单信息采集方法及系统
CN111666280A (zh) * 2020-04-27 2020-09-15 百度在线网络技术(北京)有限公司 评论的排序方法、装置、设备和计算机存储介质
CN111930857A (zh) * 2020-07-08 2020-11-13 成都双链科技有限责任公司 一种基于图计算的实时联机数据分析处理方法
CN112540972A (zh) * 2020-12-16 2021-03-23 中盈优创资讯科技有限公司 一种基于RoaringBitmap海量用户高效圈选方法及装置
CN113094409A (zh) * 2021-04-08 2021-07-09 国网电子商务有限公司 业务数据的处理方法及装置、计算机存储介质
CN113268514A (zh) * 2021-05-26 2021-08-17 深圳壹账通智能科技有限公司 多维数据统计方法、装置、电子设备及存储介质

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590269A (zh) * 2017-09-25 2018-01-16 中国地质大学(武汉) 一种数据仓库中创建立方模型的方法、设备及存储设备
CN108052522B (zh) * 2017-11-02 2020-08-25 上海跬智信息技术有限公司 一种对olap预计算模型进行动态优化的方法及系统
CN108121780B (zh) * 2017-12-15 2021-10-08 中盈优创资讯科技有限公司 数据分析模型确定方法及装置
CN108182520A (zh) * 2017-12-22 2018-06-19 深圳市华云中盛科技有限公司 一种快速建模的方法及其系统
CN108334554B (zh) * 2017-12-29 2021-10-01 上海跬智信息技术有限公司 一种新型的olap预计算模型及构建方法
CN108268612B (zh) * 2017-12-29 2021-05-25 上海跬智信息技术有限公司 一种基于olap预计算模型的预校验方法及预校验系统
CN108153894B (zh) * 2017-12-29 2020-08-14 上海跬智信息技术有限公司 一种olap数据模型自动建模的方法、分类器装置
CN108229976A (zh) * 2018-01-10 2018-06-29 北京掌阔移动传媒科技有限公司 一种数据可视化反作弊系统数据模型维度调整处理方法
CN108376143B (zh) * 2018-01-11 2019-12-27 上海跬智信息技术有限公司 一种新型的olap预计算系统及生成预计算结果的方法
CN108415981B (zh) * 2018-02-09 2020-10-09 平安科技(深圳)有限公司 数据维度生成方法、装置、设备以及计算机可读存储介质
CN108829707A (zh) * 2018-05-02 2018-11-16 国网浙江省电力有限公司信息通信分公司 跨业务域的大数据智能分析系统及方法
CN110457344B (zh) * 2018-05-08 2021-06-04 北京三快在线科技有限公司 预计算模型生成、预计算方法、装置、设备及存储介质
CN113935434A (zh) * 2018-06-19 2022-01-14 北京九章云极科技有限公司 一种数据分析处理系统及自动建模方法
CN108984698B (zh) * 2018-07-05 2023-06-27 福建星瑞格软件有限公司 一种数据库业务行为的建模方法
CN109285024B (zh) * 2018-07-23 2021-05-11 北京三快在线科技有限公司 在线特征确定方法、装置、电子设备及存储介质
CN110110165B (zh) * 2019-04-01 2021-04-02 跬云(上海)信息科技有限公司 用于预计算系统中查询引擎的动态路由方法及装置
CN110232055B (zh) * 2019-05-08 2021-07-02 跬云(上海)信息科技有限公司 Olap数据分析迁移方法及系统
CN110427434B (zh) * 2019-06-28 2022-06-07 苏宁云计算有限公司 一种多维数据查询方法及装置
CN110688416A (zh) * 2019-09-05 2020-01-14 深圳市中电数通智慧安全科技股份有限公司 一种数据查询方法、装置及电子设备
CN111858542B (zh) * 2020-06-22 2023-10-27 中国平安财产保险股份有限公司 数据处理方法、装置、设备及计算机可读存储介质
CN112286953B (zh) * 2020-09-25 2023-02-24 北京邮电大学 多维数据查询方法、装置和电子设备
CN112597114B (zh) * 2020-12-23 2023-09-15 跬云(上海)信息科技有限公司 一种基于对象存储的olap预计算引擎优化方法及应用
CN113486006A (zh) * 2021-06-18 2021-10-08 深圳市迈安信科技有限公司 数据模型的构建方法及数据查询方法和计算机存储介质
CN113505276A (zh) * 2021-06-21 2021-10-15 跬云(上海)信息科技有限公司 预计算模型的评分方法、装置、设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060161517A1 (en) * 2005-01-18 2006-07-20 International Business Machines Corporation Method, system and article of manufacture for improving execution efficiency of a database workload
CN104794221A (zh) * 2015-04-29 2015-07-22 苏州国云数据科技有限公司 一种基于业务对象的多维数据分析系统
CN104965886A (zh) * 2015-06-16 2015-10-07 广州市勤思网络科技有限公司 数据维度处理方法
CN106372190A (zh) * 2016-08-31 2017-02-01 华北电力大学(保定) 实时olap查询方法和装置
CN106484875A (zh) * 2016-10-13 2017-03-08 广州视源电子科技股份有限公司 基于molap的数据处理方法及装置

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7895191B2 (en) * 2003-04-09 2011-02-22 International Business Machines Corporation Improving performance of database queries
US20090287666A1 (en) * 2008-05-13 2009-11-19 International Business Machines Corporation Partitioning of measures of an olap cube using static and dynamic criteria
CN102135994A (zh) * 2011-03-17 2011-07-27 新太科技股份有限公司 一种基于olap的智能分析方法
US8719295B2 (en) * 2011-06-27 2014-05-06 International Business Machines Corporation Multi-granularity hierarchical aggregate selection based on update, storage and response constraints
CN102663114B (zh) * 2012-04-17 2013-09-11 中国人民大学 面向并发olap的数据库查询处理方法
CN103793422B (zh) * 2012-10-31 2017-05-17 国际商业机器公司 基于增强星型模型的立方体元数据及查询语句生成
CN103853818B (zh) * 2014-02-12 2017-04-12 博易智软(北京)技术股份有限公司 多维数据的处理方法和装置
US10353923B2 (en) * 2014-04-24 2019-07-16 Ebay Inc. Hadoop OLAP engine
CN104408179B (zh) * 2014-12-15 2018-11-06 北京国双科技有限公司 数据表中数据处理方法和装置
US20170031980A1 (en) * 2015-07-28 2017-02-02 InfoKarta, Inc. Visual Aggregation Modeler System and Method for Performance Analysis and Optimization of Databases

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060161517A1 (en) * 2005-01-18 2006-07-20 International Business Machines Corporation Method, system and article of manufacture for improving execution efficiency of a database workload
CN104794221A (zh) * 2015-04-29 2015-07-22 苏州国云数据科技有限公司 一种基于业务对象的多维数据分析系统
CN104965886A (zh) * 2015-06-16 2015-10-07 广州市勤思网络科技有限公司 数据维度处理方法
CN106372190A (zh) * 2016-08-31 2017-02-01 华北电力大学(保定) 实时olap查询方法和装置
CN106484875A (zh) * 2016-10-13 2017-03-08 广州视源电子科技股份有限公司 基于molap的数据处理方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3605358A4 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309206A (zh) * 2019-07-10 2019-10-08 中国联合网络通信集团有限公司 订单信息采集方法及系统
CN110309206B (zh) * 2019-07-10 2022-06-10 中国联合网络通信集团有限公司 订单信息采集方法及系统
CN111666280A (zh) * 2020-04-27 2020-09-15 百度在线网络技术(北京)有限公司 评论的排序方法、装置、设备和计算机存储介质
CN111666280B (zh) * 2020-04-27 2023-11-21 百度在线网络技术(北京)有限公司 评论的排序方法、装置、设备和计算机存储介质
CN111930857A (zh) * 2020-07-08 2020-11-13 成都双链科技有限责任公司 一种基于图计算的实时联机数据分析处理方法
CN112540972A (zh) * 2020-12-16 2021-03-23 中盈优创资讯科技有限公司 一种基于RoaringBitmap海量用户高效圈选方法及装置
CN113094409A (zh) * 2021-04-08 2021-07-09 国网电子商务有限公司 业务数据的处理方法及装置、计算机存储介质
CN113268514A (zh) * 2021-05-26 2021-08-17 深圳壹账通智能科技有限公司 多维数据统计方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
EP3605358A4 (en) 2020-03-25
CN106997386A (zh) 2017-08-01
EP3605358A1 (en) 2020-02-05
CN106997386B (zh) 2019-12-27

Similar Documents

Publication Publication Date Title
WO2018176623A1 (zh) 一种olap预计算模型、自动建模方法及自动建模系统
US10902022B2 (en) OLAP pre-calculation model, automatic modeling method, and automatic modeling system
US9946780B2 (en) Interpreting relational database statements using a virtual multidimensional data model
Phipps et al. Automating data warehouse conceptual schema design and evaluation.
Poess et al. Why You Should Run TPC-DS: A Workload Analysis.
US8108367B2 (en) Constraints with hidden rows in a database
US8234295B2 (en) Managing uncertain data using Monte Carlo techniques
US7593931B2 (en) Apparatus, system, and method for performing fast approximate computation of statistics on query expressions
CN108153894B (zh) 一种olap数据模型自动建模的方法、分类器装置
Xu et al. PET: reducing database energy cost via query optimization
CN105183917A (zh) 一种用于多级存储数据的多维分析方法
Ibragimov et al. Optimizing aggregate SPARQL queries using materialized RDF views
CN112434024A (zh) 面向关系型数据库的数据字典生成方法、装置、设备及介质
CN108804594A (zh) 一种新闻内容全文检索引擎的构建方法及装置
Toumi et al. EMeD-part: an efficient methodology for horizontal partitioning in data warehouses
CN111061767B (zh) 一种基于内存计算与sql计算的数据处理方法
CN112241363B (zh) 面向分析型数据库的大规模随机负载生成及验证方法及系统
Burdakov et al. Predicting SQL Query Execution Time with a Cost Model for Spark Platform.
CN108052522B (zh) 一种对olap预计算模型进行动态优化的方法及系统
CN112905639A (zh) 一种基于规则的新能源数据分发方法
US8943058B1 (en) Calculating aggregates of multiple combinations of a given set of columns
Huang et al. Lightpro: Lightweight probabilistic workload prediction framework for database-as-a-service
Que et al. Towards in-circuit tuning of deep learning designs
CN116594795B (zh) 面向数据中台的错误检测和修复方法
US20230297570A1 (en) Data Statement Chunking

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17903560

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2017903560

Country of ref document: EP

Effective date: 20191028