WO2021139427A1 - Big data index construction method, apparatus and device, and storage medium - Google Patents

Big data index construction method, apparatus and device, and storage medium Download PDF

Info

Publication number
WO2021139427A1
WO2021139427A1 PCT/CN2020/131753 CN2020131753W WO2021139427A1 WO 2021139427 A1 WO2021139427 A1 WO 2021139427A1 CN 2020131753 W CN2020131753 W CN 2020131753W WO 2021139427 A1 WO2021139427 A1 WO 2021139427A1
Authority
WO
WIPO (PCT)
Prior art keywords
indicator
dimensional
type
index
data
Prior art date
Application number
PCT/CN2020/131753
Other languages
French (fr)
Chinese (zh)
Inventor
陈志兴
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021139427A1 publication Critical patent/WO2021139427A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Definitions

  • This application relates to the field of big data technology, and in particular to a method, device, equipment and storage medium for constructing big data indicators.
  • fixed-dimensional indicator I With the progress of society and the development of big data, the development of fixed-dimensional indicator I is facing challenges.
  • the basis of the fixed-dimensional indicator I is to use production data to extract the corresponding production indicators. In this process, a lot of calculations are needed on the production data. At the same time, according to the division of different index levels, very flexible calculations are required. With the skyrocketing production data and flexible application scenarios, fixed-dimensional index I services can no longer provide effective services.
  • the solution was to solve the problem by providing more computing resources or providing a higher computing engine.
  • it also consumes a lot of resources.
  • the inventor realizes that in terms of calculation model, although the flexibility of index calculation is realized, the time consumption of index calculation is increased.
  • the calculation of the fixed dimension index I of big data is often limited to Single calculation engine.
  • the main purpose of this application is to solve the technical problem of only using a single data engine and dimensional modeling.
  • the first aspect of this application provides a method for constructing big data indicators, including: obtaining data to be predicted; analyzing the data to be predicted to construct multiple indicators that carry attribute information of different dimensions; and according to linear regression
  • the algorithm calculates the access frequency of the indicator, and determines whether the indicator is associated with a preset dimension table; based on the access frequency, determines the indicator type of the indicator, wherein the indicator type includes multi-dimensional aggregated indicators and Fixed-dimensional indicators; based on the indicator type, the corresponding relationship table between the preset indicator type and the storage calculation engine, and the corresponding relationship table between the indicator type and the dimensional modeling method of the indicator are determined to correspond to the indicator Corresponding storage calculation engine and dimensional modeling method; according to the dimensional modeling method, determine the preset dimension table associated with the indicator, wherein the preset dimension table includes dimensional modeling corresponding to the indicator type A dimension table constructed in a method or a dimension table constructed based on all dimensional modeling methods; the routing decision engine is used to call the storage calculation engine to execute the
  • the second aspect of the present application provides a device for constructing big data indicators, including a memory, a processor, and computer-readable instructions stored on the memory and running on the processor, and the processor executes the computer
  • the following steps are implemented: obtain the data to be predicted; analyze the data to be predicted to construct multiple indicators that carry attribute information of different dimensions; calculate the access frequency of the indicators according to the linear regression algorithm, and determine the Whether the indicator is associated with a preset dimension table; based on the access frequency, determine the indicator type of the indicator, where the indicator type includes multi-dimensional aggregated indicators and fixed-dimensional indicators; based on the indicator type, according to preset
  • the dimensional modeling method determines the preset dimension table associated with the indicator, wherein the preset dimension table includes a dimensional table constructed based on the dimensional modeling method corresponding to the indicator
  • a third aspect of the present application provides a computer-readable storage medium that stores computer instructions, and when the computer instructions are executed on a computer, the computer executes the following steps: obtaining data to be predicted; Analyze the to-be-predicted data to construct multiple indicators that carry attribute information of different dimensions; calculate the access frequency of the indicators according to the linear regression algorithm, and determine whether the indicators are associated with a preset dimension table; based on the access Frequency, determine the indicator type of the indicator, where the indicator type includes a multi-dimensional aggregated indicator and a fixed-dimensional indicator; based on the indicator type, according to a preset corresponding relationship table between the indicator type and the storage calculation engine , And the corresponding relationship table between the indicator type and the dimensional modeling method of the indicator, determine the storage calculation engine and the dimensional modeling method corresponding to the indicator; determine the predictive value associated with the indicator according to the dimensional modeling method
  • the preset dimension table wherein the preset dimension table includes a dimension table constructed based on the dimensional modeling method corresponding to the indicator type or
  • the fourth aspect of the present application provides a big data indicator construction device, which includes: a first acquisition module for acquiring data to be predicted; a first construction module for analyzing the data to be predicted to construct multiple portable data Indicators of attribute information of different dimensions; a judging module for calculating the access frequency of the indicator according to a linear regression algorithm, and judging whether the indicator is associated with a preset dimension table; a first determining module, for calculating the access frequency based on the access frequency , Determine the indicator type of the indicator, wherein the indicator type includes a multi-dimensional aggregated indicator and a fixed-dimensional indicator; the second determining module is configured to store the calculation engine based on the indicator type according to the preset indicator type The corresponding relationship table between the index type and the dimensional modeling method of the indicator determines the storage calculation engine and the dimensional modeling method corresponding to the indicator; the third determining module is used to determine the storage calculation engine and the dimensional modeling method corresponding to the indicator; The dimensional modeling method determines the preset dimension table associated with the indicator, wherein
  • Fig. 1 is a schematic diagram of a first embodiment of a method for constructing a big data indicator in an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a second embodiment of a method for constructing a big data indicator in an embodiment of the present invention
  • FIG. 3 is a schematic diagram of a third embodiment of a method for constructing a big data indicator in an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a first embodiment of a device for constructing a big data indicator in an embodiment of the present invention
  • Fig. 5 is a schematic diagram of a second embodiment of a device for constructing a big data indicator in an embodiment of the present invention
  • Fig. 6 is a schematic diagram of an embodiment of a device for constructing a big data indicator in an embodiment of the present invention.
  • the embodiments of the present application provide a method, device, equipment, and storage medium for constructing a big data indicator, which solves the contradiction between the time-consuming and time-efficient calculation of the fixed-dimensional indicator I of big data, and at the same time solves the problem that only a single data engine and Technical issues of dimensional modeling.
  • the first embodiment of the method for constructing a big data indicator in the embodiment of the present application includes:
  • all data to be predicted are acquired, and the data contains many indicator labels.
  • data such as the premium of a certain type of insurance under a certain activity, and the premiums of all types of insurance under a certain activity.
  • the data to be predicted refers to the data containing the indicators to be calculated.
  • the data is analyzed to determine the indicator information contained in the data.
  • common attributes are added to construct labels of multiple (different) basic dimensions (under attributes). For example, we take the indicator "premium” as an example, and increase the public attribute of the indicator, and we can construct multiple different indicators “premium of auto insurance", “premium under the Double 11 event” or "premium of auto insurance under the Double 11 event”.
  • the indicator is an indicator label that the enterprise has obtained based on data analysis.
  • the basic dimension is to add the calculated value of the dimension index under the company's basic attributes, and the addition of the public attribute is to increase the company's common attributes.
  • the access frequency of the indicator that is, to determine whether the indicator is an indicator that frequently needs to be counted (visited) or through a certain usage rule
  • the calculated indicators are determined based on the access frequency of the indicator and whether other data needs to be associated with the indicator calculation to determine the calculation requirements of the indicator, and further select a suitable storage calculation engine.
  • the dimension table mentioned in this embodiment can be understood to a certain extent as a data table containing many index (label) information.
  • a dimension table that counts xx insurance company’s “total premiums in 2019” includes: this dimension table contains labels such as “time: 2019.01, 2019.02, ⁇ 2019.12, insurance types: auto insurance, life insurance, critical illness insurance, "Children's Insurance” includes indicators such as "2019.01 Auto Insurance Premium”, “2019.01 Critical Illness Insurance Premium”, “2019.03 Life Insurance Premium”, “2019.03 Children's Insurance Premium”, etc.
  • whether a certain indicator needs to be associated with other dimension tables during calculation is determined by whether it is necessary to introduce indicators from other dimension tables (data tables) when calculating the indicator. For example, when calculating the indicator "Total auto insurance premiums from 2017 to 2019", the local dimension table only has the indicator “Total auto insurance premiums in 2018". At this time, if you calculate the "Total auto insurance premiums from 2017 to 2019" "This indicator needs to be calculated by linking the indicator information in the dimension table "2017 total auto insurance premiums" and the dimension table "2019 total auto insurance premiums”. For another example, when calculating the index "total auto insurance premiums from April to June 2019", since the local dimension table "total premiums for 2019" includes the monthly auto insurance premiums from January to December 2019, there is no need to Associate index information in other dimension tables for calculation.
  • Linear regression in this embodiment refers to a regression analysis that uses a least square function called a linear regression equation to model the relationship between one or more independent variables and dependent variables.
  • a linear regression algorithm is used to predict the access frequency of the index. For example, in the process of using indicators, you will find that some indicators are frequently used or the access frequency of indicators is affected by some other data. They have the same characteristics and are linear. According to the characteristics, we can infer which indicators are visits. The frequency is relatively high or is calculated through regular statistics. For example: there is an event, the indicators that need to be checked on Double 11, and the same statistics are required on Double 12. We can calculate and aggregate the indicators of Double 12 in advance according to the characteristics of Double 11. In this embodiment, regression is to predict new data based on existing data, such as predicting stock trends. Linear regression is to be able to use a straight line to more accurately describe the relationship between the data, when new data appears, it can predict a simple value.
  • the linear regression model looks like:
  • the model obtained by linear regression is not necessarily a straight line:
  • the model is a plane in space
  • the residual sum of squares is usually used in linear regression, that is, the distance from the point to the straight line parallel to the y axis instead of the vertical distance.
  • the residual sum of squares divided by the sample size n is the mean square error.
  • the mean square error is used as the cost function of the linear regression model. Minimizing the sum of the distances from all points to the straight line is to minimize the mean square error. This method is called the least squares method.
  • the i-th sample is expressed as:
  • the loss function is the mean square error, that is
  • the least square method is used to solve the parameters, and the loss function J( ⁇ ) is derived from ⁇ :
  • the linear regression algorithm is used to determine the important indicators in the sample data, and the mapping relationship equations between the indicators and the indicator factors that affect the access frequency of the indicators are established respectively.
  • the frequency of access rights to all of the index M Index Factors affecting weight W, to determine the respective primary dependent variable (i.e., the main index factor) a 1, a 2, a 3, ..., a n are established each of the major indicators
  • the equation of the mapping relationship between factors and indicators: y ⁇ + ⁇ a 1 + ⁇ a 2 +...+ ⁇ a n , where y is the access frequency of indicator M (in a certain time period), a 1 , a 2 , a 3 , ..., a n is the index factor all impact indicators M (in a specific time period) of the access frequency.
  • M index below to an example, a collection of access frequency 2017 ⁇ 2019 M metrics of a special promotion and impact indicators Indicator M access frequency factors a 1, a 2, a 3 , ..., a n value of .
  • each index M predicted value and the actual value can be derived by comparing the The model is reasonable and can be used to predict the access frequency of indicator M (in a specific time period).
  • the regression model infers the change of the access frequency of the indicator M in a certain period of time, inputs it as input data into the prediction model, and finally obtains the predicted access frequency of the indicator M (in a certain period of time).
  • data can be predicted to predict which data will be accessed with high frequency. These high-frequency accessed data need to be pre-aggregated, and some of them do not require high-frequency access and can use other storage. engine.
  • the indicator type includes multi-dimensional aggregated indicators and fixed-dimensional indicators
  • the index type is determined according to the access frequency of the index and whether other dimension tables need to be associated when calculating the index, and further, the type of the index is determined. For example, some indicators need to be associated with multiple dimension tables to be calculated during calculation, while other indicators do not need to be associated with other dimension tables during calculation to calculate the value of the indicator. There are two types of indicators, which require multi-dimensional aggregation, that is, indicators that need to be associated with other dimension tables for associated calculations during calculation, and fixed-dimensional indicator types, which means that calculations do not need to associate data in other dimension tables. , The index of the index value can be calculated only by the data in the wide table to which it belongs.
  • the storage calculation engine corresponding to the indicator is queried, and the preset dimension table associated with the indicator is calculated.
  • Different types of indicators are stored in different locations for storage calculation engines. For example, some of them are stored in random reports or semi-aggregated reports.
  • you need to compare the table where the indicators are located with The value of the indicator can be calculated after the other dimension tables are associated.
  • the aggregate report built by these indicators can be stored in the aggregation engine for calculation in advance, and the user can query When this indicator is used, the corresponding indicator value can be quickly queried without waiting for the calculation time, which improves the efficiency of data processing.
  • the type of the indicator it is determined whether multiple dimension tables are required to be associated (to) calculated when querying the indicator (value), and if necessary, the corresponding dimension table is queried. For example, to calculate the index value of the fixed index "2018 Double 11 event auto insurance premiums”, you only need to table “2018 insurance premiums", table “2018 auto insurance premiums” and “2018 double 11 event premiums” The data in the three tables of different dimensions are stored in one table, which is a wide table. When calculating, there is no need to associate other data reports. When calculating the indicator "2018 premiums”, the table “2018” is needed. Annual auto insurance premiums", table “2018 property insurance premiums”, and table “2018 life insurance premiums”... table “2018 XX insurance premiums”, all insurance premium tables are linked together to get The indicator value of the indicator "Premium for 2018".
  • the preset dimension table includes a dimensional table constructed based on the dimensional modeling method corresponding to the indicator type or a dimensional table constructed based on all dimensional modeling methods:
  • different types of indicators correspond to different modeling models to generate different types of reports, and the generated reports are also stored in different data storage calculation engines according to different report types.
  • the routing decision engine will request the corresponding storage calculation engine according to the correspondence relationship between the calculation engines stored in the report to which the indicator belongs. That is, according to the different queried indicators, the routing decision engine will select the storage corresponding to the current calculation request. Calculation engine, and distribute the request to the corresponding storage calculation engine to calculate the value of the corresponding index. For example, if the index to be viewed is a basic (fixed) index, the query (calculation) request will be forwarded to a basic database such as hive (no aggregation database, which can realize multi-table association calculation). If you want to view the pre-calculated index, you will Forward to databases such as druid.io (aggregated data engine). In this embodiment, the calculation requirement of the index can be simply understood as whether the association and calculation of the dimension table are required (with or without).
  • the data to be predicted is mainly obtained and analyzed to construct indicators of multiple dimensional attributes, and the access frequency of the indicators is predicted by the linear regression algorithm to determine the calculation requirements of the indicators.
  • the calculation requirements of the indicators select an appropriate method to store the indicators in the corresponding storage calculation engine, and calculate the indicator values of the indicators. This solves the contradiction between the time-consuming and time-consuming calculation of the fixed-dimensional indicator I of big data, and at the same time solves the problem.
  • the second embodiment of the method for constructing a big data indicator in the embodiment of the present application includes:
  • an indicator refers to a unit or method used to measure the degree of development of a thing, and it also has a commonly used name in IT, that is, measurement. For example: population, GDP, income, number of users, profit rate, retention rate, coverage rate, etc.
  • KPI indicator system which uses several key indicators to measure the performance of the company’s business operations. The indicators need to be obtained through summary calculation methods such as summation and average, and summary calculations need to be performed under certain preconditions, such as time, location, and cost, which is what we often call statistical caliber and scope.
  • a preset model is used to classify the extracted indicators, and the dimensional attribute information of each indicator is added.
  • gradually increasing the dimensional attribute information of the indicator “premium” can become “Enterprise plan premiums” and “enterprise plan premiums of secondary institutions” further increase the basic dimension attribute information of the indicators, and at the same time increase the public attribute dimension information, such as "whether it is the enterprise plan premiums of secondary institutions participating in insurance activities".
  • the dimension attribute of the indicator refers to a certain characteristic of a thing or phenomenon, such as gender, region, time, etc., are all dimensions. Among them, time is a commonly used and special dimension. Through the comparison of time before and after, you can know whether the development of things is good or bad. For example, “The premium of auto insurance under the Double 11 event in 2019 is higher than that under the Double 11 event in 2018. The premium of auto insurance has increased by 10%", “The premium of life insurance under the Double 12 event in 2019 will increase by 20% compared to the premium of life insurance under the Double 11 event in 2019." This is the comparison in time, also known as the vertical ratio. Another comparison is the horizontal comparison.
  • the comparison between the “premium of auto insurance under the Double 11 event in 2018” and the “premium of life insurance under the Double 11 event in 2018” is a comparison between units of the same level, referred to as horizontal. ratio.
  • the dimensions can be divided into qualitative dimensions and quantitative dimensions, that is, according to the data type.
  • the data type is character (text) data, which is qualitative.
  • region and gender are all qualitative dimensions;
  • the data type is Numerical data are quantitative dimensions, such as income, age, consumption, etc.
  • the indicator and the dimensional attribute are combined to obtain multiple indicators carrying different dimensional attribute information. For example, “Premium for auto insurance under Double 11 in 2019”, “Premium for auto insurance under Double 12 in 2019”, “Premium for property insurance under Double 11 in 2019”, “Premium for property insurance under Double 12 in 2019” .
  • the indicators of different dimensional attributes in the data to be predicted are determined, and at the same time, the indicator factors that affect the access frequency of the indicators are determined.
  • a mapping relationship equation between the index obtained in the data to be predicted and the index factor corresponding to the index is established.
  • the elastic coefficient method is used to predict the parameter value of each index factor under a certain activity of the data to be predicted. For example, predict the number of people who will purchase auto insurance during the Double 11 event in 2019.
  • the elastic coefficient ET is calculated using the data of the most recent year and the farthest year (from the collected historical data), and then the access frequency of the corresponding indicator under a certain activity can be calculated.
  • the access frequency in this embodiment can also be said to be a probability value.
  • mapping relationship equation between the index obtained in the data to be predicted and the index factor corresponding to the index is established, and the parameter value of the index factor is substituted into the mapping relationship equation to calculate (predict) the access frequency of the index ( Probability value).
  • the indicator is an indicator type that requires multi-dimensional aggregation
  • the indicator is an indicator type indicator that requires multi-dimensional aggregation. It is an indicator that needs to be aggregated in multiple dimensions. For example, the indicator "2018 premiums”, if you want to calculate it, you need to table “2018 auto insurance premiums”, table “2018 property insurance premiums”, and table “2018 life insurance premiums”... XX insurance premiums in 2018", the table of premiums of all insurance types is linked together, then the indicator "2018 premiums" is an index type index that needs multi-dimensional aggregation, that is, an index that needs multi-dimensional aggregation.
  • the indicator type is a fixed-dimensional indicator type
  • the access probability of the indicator is greater than the preset threshold and the indicator is queried (calculated), there is no need to associate other dimension tables for correlation calculation, and only the data in the table to which the indicator belongs is used, then it can be determined that the indicator is The index type index of the fixed dimension, that is, the fixed index.
  • the dimension of this indicator is a fixed three dimensions "2018 + Double 11 event + auto insurance", when calculating the indicator "2018 Double 11 event auto insurance premium” .
  • Wide table when calculating, only query the data in this (wide) table, and there is no need to associate data in other tables, then the indicator "2018 Double 11 event insurance premiums" is a fixed-dimensional indicator type indicator, that is Fixed indicators. In this embodiment, the wide table is to build all the fields in it, and there is no need to associate other tables when statistical data (calculating index values).
  • the preset dimension table includes a dimensional table constructed based on the dimensional modeling method corresponding to the indicator type or a dimensional table constructed based on all dimensional modeling methods;
  • the third embodiment of the method for constructing a big data indicator in the embodiment of the present application includes:
  • the historical data containing the indicators to be predicted is obtained.
  • the indicators in a specific period of time and the indicators are in a specific period.
  • the index factor is related to the number of visits of the index in a specific period. Therefore, a mapping relationship between the index factor and the index access frequency is established, and the index access frequency is calculated (or "predicted") based on historical data.
  • historical data is used as sample data, for example, the data information of “auto insurance premium under the Double 11 event in 2018” is used as sample data.
  • the t test is a type of significance test in the multiple linear regression algorithm.
  • the F test can be equivalent to the t test.
  • the partial correlation analysis method is used to further analyze the mapping relationship equations of each index and index factor, and determine the main independent variable in the mapping relationship between each index and the index factor (that is, the main index factor, there will be many index factors affecting The number of times the indicator is visited in a specific period, and the main indicator factor is the main influencing factor), and then all the main indicator factors are retained in the mapping relationship equation between the indicator and the indicator factor.
  • the index factor whose partial correlation coefficient is within the preset value interval and the regression coefficient is greater than the F test parameter or the t test parameter in the mapping relationship equation is the main index factor.
  • the indicator type of the indicator based on the access frequency, where the indicator type includes multi-dimensional aggregated indicators and fixed-dimensional indicators;
  • the model construction method corresponding to the indicator type is queried from the preset correspondence table between the indicator type and the model construction method.
  • the model construction method corresponding to the indicator type is queried from the preset correspondence table between the indicator type and the model construction method.
  • the indicator is an indicator type indicator that requires multi-dimensional aggregation, use dimensional modeling to build random reports and/or semi-aggregated reports, and store random reports and/or semi-aggregated reports in the non-aggregated engine and/or semi-aggregated engine; If the indicator to be calculated is an indicator type indicator that requires multi-dimensional aggregation, that is, an indicator that needs to be associated with multiple dimension tables to be calculated, use dimensional modeling to build random reports and/or semi-aggregated reports, and combine random reports with / Or semi-aggregated reports are stored in the non-aggregated engine and/or semi-aggregated engine.
  • the indicator is a fixed-dimensional indicator type indicator
  • use wide-table modeling build an aggregate report, and store the aggregate report to the aggregation engine
  • the indicator to be calculated is a fixed-dimensional indicator type indicator
  • there is no need to interact with multiple dimension tables For indicators that can be calculated by association, use wide table modeling, build aggregate reports, and store aggregate reports in the aggregation engine.
  • wide table modeling means that the indicators and dimensions are stored in a large table, that is, the data is divided into a fact table and a dimension table.
  • the fact table is a record of specific events, and all fields are built in it. There is no need to associate other tables with data.
  • the dimension represents some description of the event, through the separation of facts and dimension tables, to improve flexibility and solve corresponding problems.
  • the preset dimension table includes a dimensional table constructed based on the dimensional modeling method corresponding to the indicator type or a dimensional table constructed based on all dimensional modeling methods;
  • the indicator is an indicator that requires multi-dimensional aggregation
  • the indicator is downgraded and stored in a random report or semi-aggregated report; in this embodiment, if the indicator is an indicator that requires multi-dimensional aggregation, it can be understood that the indicator is not It needs to be calculated in advance, and the indicator is downgraded, that is, the indicator and data are stored on a common calculation engine to save computing resources.
  • Common computing engines include non-aggregation engines and semi-aggregation engines.
  • the indicator is a fixed-dimensional indicator, use wide-table modeling to store all the fields in the dimension in the aggregate report; in this embodiment, if the indicator is a fixed-dimensional indicator, it means that other dimension tables are not required to calculate the indicator. Aggregate calculation. All indicators of this type can be stored in an aggregate report and calculated in advance. It saves index query (calculation) time and improves the efficiency of data processing.
  • the storage calculation engine corresponding to the fixed-dimensional index type index Query the storage calculation engine corresponding to the fixed-dimensional index type index, and store the aggregate report to the aggregation engine; in this embodiment, if the index is a fixed-dimensional index type index, that is, there is no need to associate when calculating the index For other dimension tables, this type of index is stored in the aggregation report through wide table modeling, and stored in the aggregation engine, so that it can be calculated in advance.
  • An embodiment of the device for constructing a big data indicator in the embodiment of the application includes: An acquisition module 401 is used to obtain the data to be predicted; the first construction module 402 is used to analyze the data to be predicted to construct a plurality of indicators carrying attribute information of different dimensions; the judgment module 403 is used to calculate according to the linear regression algorithm The access frequency of the indicator and determine whether the indicator is associated with a preset dimension table; the first determining module 404 is used to determine the indicator type of the indicator based on the access frequency, where the indicator type includes multi-dimensional aggregated indicators and fixed-dimensional indicators; The second determining module 405 is configured to determine the corresponding relationship with the indicator based on the indicator type, according to the preset correspondence table between the indicator type and the storage calculation engine, and the correspondence table between the indicator type and the dimensional modeling method of the indicator The storage calculation engine and the
  • the second embodiment of the device for constructing big data indicators in the embodiment of the present application includes:
  • the first acquisition module 501 is used to obtain the data to be predicted; the first construction module 502 is used to analyze the data to be predicted to construct a plurality of indicators carrying attribute information of different dimensions; the judgment module 503 is used to obtain the linear regression algorithm, Calculate the access frequency of the indicator, and determine whether the indicator is associated with a preset dimension table; the first determining module 504 is used to determine the indicator type of the indicator based on the access frequency, where the indicator type includes multi-dimensional aggregated indicators and fixed-dimensional indicators ; The second determining module 505 is used to determine the corresponding relationship table between the indicator type and the storage calculation engine based on the indicator type, and the corresponding relationship table between the indicator type and the dimensional modeling method of the indicator for a long time.
  • the third determining module 506 is used to determine the preset dimension table associated with the indicator according to the dimensional modeling method, wherein the preset dimension table includes dimensional modeling corresponding to the indicator type A dimension table constructed in a way or a dimension table constructed based on all dimensional modeling methods; a calculation module 507, used to use the routing decision engine to call the storage calculation engine to execute the preset dimension table, and calculate the corresponding indicator value of the indicator; the second acquisition module 508 , Used to obtain historical data including indicators, where the historical data includes indicators in a specific period, the number of visits of the indicator in a specific period, and index factors that affect the number of visits of the indicator in a specific period; the analysis module 509 is used to combine Historical data is used as sample data, and partial correlation analysis is performed on the sample data, indicators are extracted, and the mapping relationship equations between the indicators and the corresponding indicator factors are established respectively; the test module 510 is used to perform T-tests on the mapping relationship equations respectively to determine the impact index access
  • the first building module 502 is specifically used to: analyze the data to be predicted and define multiple indicators; use a preset model to classify the indicators and add dimensional attributes; based on the indicators and dimensional attributes, combine the indicators and dimensional attributes to obtain Multiple indicators of different dimension attributes.
  • the judgment module 503 is specifically used to: determine the main index factors that affect the access frequency of the index based on the linear regression algorithm; establish the mapping relationship equation between the index and the main index factor, and use the elastic coefficient method to predict the parameter value of the main index factor; The parameter value of the factor is substituted into the mapping relationship equation to calculate the access frequency of the indicator
  • the first determining module 504 is specifically configured to: if the access frequency of the indicator is greater than a preset threshold and other dimension tables need to be associated when calculating the access frequency of the indicator, the indicator is an indicator type that requires multi-dimensional aggregation; if the access frequency of the indicator is greater than When the threshold is preset and the access frequency of the indicator does not need to be associated with other dimension tables, the indicator type is a fixed-dimensional indicator type.
  • FIG. 6 is a schematic structural diagram of a big data indicator construction device provided by an embodiment of the present application.
  • the big data indicator construction device 600 may have relatively large differences due to different configurations or performance, and may include one or more processors (central Processing units, CPU) 610 (for example, one or more processors) and memory 620, and one or more storage media 630 (for example, one or more storage devices with a large amount of data) storing application programs 633 or data 632.
  • the memory 620 and the storage medium 630 may be short-term storage or persistent storage.
  • the program stored in the storage medium 630 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the big data indicator construction device 600.
  • the processor 610 may be configured to communicate with the storage medium 630, and execute a series of instruction operations in the storage medium 630 on the big data indicator construction device 600, so as to implement the steps of the big data indicator construction method in the foregoing embodiments. .
  • the big data indicator construction device 600 may also include one or more power supplies 640, one or more wired or wireless network interfaces 650, one or more input and output interfaces 660, and/or one or more operating systems 631, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc.
  • operating systems 631 such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc.
  • the present application also provides a device for constructing a big data indicator.
  • the device for constructing a big data indicator includes: a memory and at least one processor, where instructions are stored in the memory, and the memory and at least one processor are interconnected by wires; at least one processor calls the memory In order to make the big data indicator construction device execute the steps of the big data indicator construction method.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium.
  • the computer-readable storage medium stores computer instructions. When the computer instructions are run on the computer, the computer executes the following steps: obtain the data to be predicted; analyze the data to be predicted to construct multiple indicators that carry attribute information of different dimensions; according to linear regression Algorithm, calculate the access frequency of the indicator, and determine whether the indicator is associated with a preset dimension table; based on the access frequency, determine the indicator type of the indicator, where the indicator type includes multi-dimensional aggregated indicators and fixed-dimensional indicators; based on the indicator type, according to The preset correspondence table between the indicator type and the storage calculation engine, and the correspondence table between the indicator type and the dimensional modeling method of the indicator, determine the storage calculation engine and the dimensional modeling method corresponding to the indicator; build according to the dimension The model method determines the preset dimension table associated with the indicator.
  • the preset dimension table includes the dimension table constructed based on the dimensional modeling method corresponding to the indicator type or the dimension table constructed based on all the dimensional modeling methods; the routing decision engine is used to call The storage calculation engine executes the preset dimension table and calculates the index value corresponding to the index.

Abstract

Disclosed are a big data index construction method, apparatus and device, and a storage medium, relating to the field of big data. The method comprises: acquiring data to be predicted, and parsing said data to construct multiple indexes carrying different dimension attribute information; calculating the access frequency of each index according to a linear regression algorithm, and determining whether other dimension tables need to be associated with same during index calculation in order to determine the type of the index; according to a correlation table between the index type and a storage calculation engine and a correlation table between the index type and a dimension modeling mode for the index, making a query for a storage calculation engine corresponding to the index and calculating a preset dimension table with which the index needs to be associated; and using a routing decision engine to call the storage calculation engine in order to execute the preset dimension table, and calculating a value corresponding to the index. The method solves the problem of the timeliness of big data index calculation, and solves the technical problem of only a single data engine and dimension being used for modeling.

Description

大数据指标构建方法、装置、设备及存储介质Big data indicator construction method, device, equipment and storage medium
本申请要求于2020年7月23日提交中国专利局、申请号为202010714909.9、发明名称为“大数据指标构建方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on July 23, 2020, the application number is 202010714909.9, and the invention title is "Big Data Index Construction Method, Device, Equipment, and Storage Medium", the entire content of which is incorporated by reference Incorporate in the application.
技术领域Technical field
本申请涉及大数据技术领域,尤其涉及一种大数据指标构建方法、装置、设备及存储介质。This application relates to the field of big data technology, and in particular to a method, device, equipment and storage medium for constructing big data indicators.
背景技术Background technique
随着社会的进步和大数据的发展,固定维度的指标I的发展迎来了挑战。固定维度的指标I的基础是利用生产数据,提炼出对应的生产指标,在这个过程中需要对生产数据进行大量的计算。同时,根据不同指标层级的划分,需要非常灵活的计算。随着生产数据的暴涨和应用场景的灵活,固定维度的指标I服务无法再提供有效的服务。With the progress of society and the development of big data, the development of fixed-dimensional indicator I is facing challenges. The basis of the fixed-dimensional indicator I is to use production data to extract the corresponding production indicators. In this process, a lot of calculations are needed on the production data. At the same time, according to the division of different index levels, very flexible calculations are required. With the skyrocketing production data and flexible application scenarios, fixed-dimensional index I services can no longer provide effective services.
以往的解决方案是通过提供更多的计算资源或者提供更高的计算引擎来解决问题,然而在数据量的暴涨下,也耗费了大量的资源。发明人意识到,在计算模型方面,虽实现了指标计算的灵活,但却增加了指标计算耗时,为了满足计算模型和计算资源的一致性,大数据的固定维度的指标I计算往往局限于单一的计算引擎。In the past, the solution was to solve the problem by providing more computing resources or providing a higher computing engine. However, under the skyrocketing amount of data, it also consumes a lot of resources. The inventor realizes that in terms of calculation model, although the flexibility of index calculation is realized, the time consumption of index calculation is increased. In order to meet the consistency of calculation model and computing resources, the calculation of the fixed dimension index I of big data is often limited to Single calculation engine.
发明内容Summary of the invention
本申请的主要目的在于解决只能使用单一数据引擎和维度建模的技术问题。The main purpose of this application is to solve the technical problem of only using a single data engine and dimensional modeling.
为实现上述目的,本申请第一方面提供了一种大数据指标构建方法,包括:获取待预测数据;对所述待预测数据进行解析,构建多个携带不同维度属性信息的指标;根据线性回归算法,计算所述指标的访问频率,并判断所述指标是否关联有预置维度表;基于所述访问频率,确定所述指标的指标类型,其中,所述指标类型包括多维度聚合的指标和固定维度的指标;基于所述指标类型,根据预设的指标类型与存储计算引擎之间的对应关系表,以及指标类型与指标的维度建模方式之间的对应关系表,确定与所述指标对应的存储计算引擎和维度建模方式;根据所述维度建模方式,确定所述指标所关联的预置维度表,其中,所述预置维度表包括基于所述指标类型对应的维度建模方式构建的维度表或基于所有维度建模方式构建的维度表;利用路由决策引擎调用所述存储计算引擎执行所述预置维度表,计算出所述指标对应的指标值。In order to achieve the above objective, the first aspect of this application provides a method for constructing big data indicators, including: obtaining data to be predicted; analyzing the data to be predicted to construct multiple indicators that carry attribute information of different dimensions; and according to linear regression The algorithm calculates the access frequency of the indicator, and determines whether the indicator is associated with a preset dimension table; based on the access frequency, determines the indicator type of the indicator, wherein the indicator type includes multi-dimensional aggregated indicators and Fixed-dimensional indicators; based on the indicator type, the corresponding relationship table between the preset indicator type and the storage calculation engine, and the corresponding relationship table between the indicator type and the dimensional modeling method of the indicator are determined to correspond to the indicator Corresponding storage calculation engine and dimensional modeling method; according to the dimensional modeling method, determine the preset dimension table associated with the indicator, wherein the preset dimension table includes dimensional modeling corresponding to the indicator type A dimension table constructed in a method or a dimension table constructed based on all dimensional modeling methods; the routing decision engine is used to call the storage calculation engine to execute the preset dimension table, and the indicator value corresponding to the indicator is calculated.
本申请第二方面提供了一种大数据指标构建设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:获取待预测数据;对所述待预测数据进行解析,构建多个携带不同维度属性信息的指标;根据线性回归算法,计算所述指标的访问频率,并判断所述指标是否关联有预置维度表;基于所述访问频率,确定所述指标的指标类型,其中,所述指标类型 包括多维度聚合的指标和固定维度的指标;基于所述指标类型,根据预设的指标类型与存储计算引擎之间的对应关系表,以及指标类型与指标的维度建模方式之间的对应关系表,确定与所述指标对应的存储计算引擎和维度建模方式;根据所述维度建模方式,确定所述指标所关联的预置维度表,其中,所述预置维度表包括基于所述指标类型对应的维度建模方式构建的维度表或基于所有维度建模方式构建的维度表;利用路由决策引擎调用所述存储计算引擎执行所述预置维度表,计算出所述指标对应的指标值。The second aspect of the present application provides a device for constructing big data indicators, including a memory, a processor, and computer-readable instructions stored on the memory and running on the processor, and the processor executes the computer When the instruction is readable, the following steps are implemented: obtain the data to be predicted; analyze the data to be predicted to construct multiple indicators that carry attribute information of different dimensions; calculate the access frequency of the indicators according to the linear regression algorithm, and determine the Whether the indicator is associated with a preset dimension table; based on the access frequency, determine the indicator type of the indicator, where the indicator type includes multi-dimensional aggregated indicators and fixed-dimensional indicators; based on the indicator type, according to preset The corresponding relationship table between the indicator type and the storage calculation engine, and the correspondence table between the indicator type and the dimensional modeling method of the indicator, determine the storage calculation engine and the dimensional modeling method corresponding to the indicator; The dimensional modeling method determines the preset dimension table associated with the indicator, wherein the preset dimension table includes a dimensional table constructed based on the dimensional modeling method corresponding to the indicator type or a dimensional table constructed based on all dimensional modeling methods Dimension table; using the routing decision engine to call the storage calculation engine to execute the preset dimension table, and calculate the index value corresponding to the index.
本申请的第三方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:获取待预测数据;对所述待预测数据进行解析,构建多个携带不同维度属性信息的指标;根据线性回归算法,计算所述指标的访问频率,并判断所述指标是否关联有预置维度表;基于所述访问频率,确定所述指标的指标类型,其中,所述指标类型包括多维度聚合的指标和固定维度的指标;基于所述指标类型,根据预设的指标类型与存储计算引擎之间的对应关系表,以及指标类型与指标的维度建模方式之间的对应关系表,确定与所述指标对应的存储计算引擎和维度建模方式;根据所述维度建模方式,确定所述指标所关联的预置维度表,其中,所述预置维度表包括基于所述指标类型对应的维度建模方式构建的维度表或基于所有维度建模方式构建的维度表;利用路由决策引擎调用所述存储计算引擎执行所述预置维度表,计算出所述指标对应的指标值。A third aspect of the present application provides a computer-readable storage medium that stores computer instructions, and when the computer instructions are executed on a computer, the computer executes the following steps: obtaining data to be predicted; Analyze the to-be-predicted data to construct multiple indicators that carry attribute information of different dimensions; calculate the access frequency of the indicators according to the linear regression algorithm, and determine whether the indicators are associated with a preset dimension table; based on the access Frequency, determine the indicator type of the indicator, where the indicator type includes a multi-dimensional aggregated indicator and a fixed-dimensional indicator; based on the indicator type, according to a preset corresponding relationship table between the indicator type and the storage calculation engine , And the corresponding relationship table between the indicator type and the dimensional modeling method of the indicator, determine the storage calculation engine and the dimensional modeling method corresponding to the indicator; determine the predictive value associated with the indicator according to the dimensional modeling method The preset dimension table, wherein the preset dimension table includes a dimension table constructed based on the dimensional modeling method corresponding to the indicator type or a dimension table constructed based on all dimensional modeling methods; the routing decision engine is used to call the storage calculation engine The preset dimension table is executed, and the index value corresponding to the index is calculated.
本申请第四方面提供了一种大数据指标构建装置,包括:第一获取模块,用于获取待预测数据;第一构建模块,用于对所述待预测数据进行解析,以构建多个携带不同维度属性信息的指标;判断模块,用于根据线性回归算法,计算所述指标的访问频率,并判断所述指标是否关联有预置维度表;第一确定模块,用于基于所述访问频率,确定所述指标的指标类型,其中,所述指标类型包括多维度聚合的指标和固定维度的指标;第二确定模块,用于基于所述指标类型,根据预设的指标类型与存储计算引擎之间的对应关系表,以及指标类型与指标的维度建模方式之间的对应关系表,确定与所述指标对应的存储计算引擎和维度建模方式;第三确定模块,用于根据所述维度建模方式,确定所述指标所关联的预置维度表,其中,所述预置维度表包括基于所述指标类型对应的维度建模方式构建的维度表或基于所有维度建模方式构建的维度表;计算模块,用于利用路由决策引擎调用所述存储计算引擎执行所述预置维度表,计算出所述指标对应的指标值。The fourth aspect of the present application provides a big data indicator construction device, which includes: a first acquisition module for acquiring data to be predicted; a first construction module for analyzing the data to be predicted to construct multiple portable data Indicators of attribute information of different dimensions; a judging module for calculating the access frequency of the indicator according to a linear regression algorithm, and judging whether the indicator is associated with a preset dimension table; a first determining module, for calculating the access frequency based on the access frequency , Determine the indicator type of the indicator, wherein the indicator type includes a multi-dimensional aggregated indicator and a fixed-dimensional indicator; the second determining module is configured to store the calculation engine based on the indicator type according to the preset indicator type The corresponding relationship table between the index type and the dimensional modeling method of the indicator determines the storage calculation engine and the dimensional modeling method corresponding to the indicator; the third determining module is used to determine the storage calculation engine and the dimensional modeling method corresponding to the indicator; The dimensional modeling method determines the preset dimension table associated with the indicator, wherein the preset dimension table includes a dimensional table constructed based on the dimensional modeling method corresponding to the indicator type or a dimensional table constructed based on all dimensional modeling methods Dimension table; calculation module, used to use the routing decision engine to call the storage calculation engine to execute the preset dimension table, and calculate the index value corresponding to the index.
附图说明Description of the drawings
图1为本发明实施例中大数据指标构建方法的第一个实施例示意图;Fig. 1 is a schematic diagram of a first embodiment of a method for constructing a big data indicator in an embodiment of the present invention;
图2为本发明实施例中大数据指标构建方法的第二个实施例示意图;2 is a schematic diagram of a second embodiment of a method for constructing a big data indicator in an embodiment of the present invention;
图3为本发明实施例中大数据指标构建方法的第三个实施例示意图;3 is a schematic diagram of a third embodiment of a method for constructing a big data indicator in an embodiment of the present invention;
图4为本发明实施例中大数据指标构建装置的第一个实施例示意图;4 is a schematic diagram of a first embodiment of a device for constructing a big data indicator in an embodiment of the present invention;
图5为本发明实施例中大数据指标构建装置的第二个实施例示意图;Fig. 5 is a schematic diagram of a second embodiment of a device for constructing a big data indicator in an embodiment of the present invention;
图6为本发明实施例中大数据指标构建设备的一个实施例示意图。Fig. 6 is a schematic diagram of an embodiment of a device for constructing a big data indicator in an embodiment of the present invention.
具体实施方式Detailed ways
本申请实施例提供了一种大数据指标构建方法、装置、设备及存储介质,解决了大数据固定维度的指标I的计算耗时和时效性的矛盾,同时解决了只能使用单一数据引擎和维度建模的技术问题。The embodiments of the present application provide a method, device, equipment, and storage medium for constructing a big data indicator, which solves the contradiction between the time-consuming and time-efficient calculation of the fixed-dimensional indicator I of big data, and at the same time solves the problem that only a single data engine and Technical issues of dimensional modeling.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects, without having to use To describe a specific order or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances so that the embodiments described herein can be implemented in a sequence other than the content illustrated or described herein. In addition, the terms "including" or "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those clearly listed. Steps or units, but may include other steps or units that are not clearly listed or are inherent to these processes, methods, products, or equipment.
为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图1,本申请实施例中大数据指标构建方法的第一个实施例包括:For ease of understanding, the following describes the specific process of the embodiment of the present application. Please refer to FIG. 1. The first embodiment of the method for constructing a big data indicator in the embodiment of the present application includes:
101、获取待预测数据;101. Obtain the data to be predicted;
本实施例中,获取所有待预测的数据,该数据中包含有许多的指标标签。比如某个活动下某个险种的保费、某个活动下所有险种的保费等数据。待预测数据是指包含待计算指标的数据,对这些数据进行解析,确定数据中包含的指标信息,进一步地,增加公共属性,以构建多个(不同)基本维度(属性下)的标签。比如说,我们以指标“保费”为例,增加指标的公共属性,可以构建多个不同的指标“车险的保费”、“双11活动下的保费”或者“双11活动下的车险的保费”In this embodiment, all data to be predicted are acquired, and the data contains many indicator labels. For example, data such as the premium of a certain type of insurance under a certain activity, and the premiums of all types of insurance under a certain activity. The data to be predicted refers to the data containing the indicators to be calculated. The data is analyzed to determine the indicator information contained in the data. Furthermore, common attributes are added to construct labels of multiple (different) basic dimensions (under attributes). For example, we take the indicator "premium" as an example, and increase the public attribute of the indicator, and we can construct multiple different indicators "premium of auto insurance", "premium under the Double 11 event" or "premium of auto insurance under the Double 11 event".
102、对待预测数据进行解析,以构建多个携带不同维度属性信息的指标;102. Analyze the data to be predicted to construct multiple indicators that carry attribute information of different dimensions;
本实施例中,指标是企业根据数据分析已经得出的指示标签。基本维度即是添加了公司基本属性下的维度指标计算值,增加公共属性即是增加公司共有的属性。In this embodiment, the indicator is an indicator label that the enterprise has obtained based on data analysis. The basic dimension is to add the calculated value of the dimension index under the company's basic attributes, and the addition of the public attribute is to increase the company's common attributes.
103、根据线性回归算法,计算指标的访问频率,并判断指标是否关联有预置维度表;103. Calculate the access frequency of the indicator according to the linear regression algorithm, and determine whether the indicator is associated with a preset dimension table;
本实施例中,若需确定指标在计算时是否需要关联其他维度数据,首先需要预测指标的访问频率,也就是说判断该指标是否为经常需要统计(访问)的指标或者是通过某种使用规律统计出来的指标,根据指标的访问频率和指标计算时是否需要关联其他数据,来确定指标的计算需求,进一步地选取合适的存储计算引擎。In this embodiment, if it is necessary to determine whether the indicator needs to be associated with other dimensional data during calculation, it is first necessary to predict the access frequency of the indicator, that is, to determine whether the indicator is an indicator that frequently needs to be counted (visited) or through a certain usage rule The calculated indicators are determined based on the access frequency of the indicator and whether other data needs to be associated with the indicator calculation to determine the calculation requirements of the indicator, and further select a suitable storage calculation engine.
本实施例中所说的维度表,在一定程度上,可以把它理解成一个包含有许多个指标(标签)信息的数据表。比如,一个统计xx保险公司“2019年保费总额”的维度表,其中包括:这个维度表中包含的标签有“时间:2019.01、2019.02、···2019.12,险种:车险,寿险,重疾险,少儿险”,包含的指标有“2019.01车险保费”、“2019.01重疾险保费”、“2019.03寿险保费”、“2019.03少儿险保费”等。The dimension table mentioned in this embodiment can be understood to a certain extent as a data table containing many index (label) information. For example, a dimension table that counts xx insurance company’s “total premiums in 2019” includes: this dimension table contains labels such as “time: 2019.01, 2019.02, ···2019.12, insurance types: auto insurance, life insurance, critical illness insurance, "Children's Insurance" includes indicators such as "2019.01 Auto Insurance Premium", "2019.01 Critical Illness Insurance Premium", "2019.03 Life Insurance Premium", "2019.03 Children's Insurance Premium", etc.
本实施例中,某一指标在计算时是否需要关联其他维度表是通过计算该指标时是否需要引入其他维度表(数据表)中的指标来判断的。比如,当计算“2017~2019年车险保费的总额”这一指标时,本地的维度表中只有“2018年车险保费总额”这一指标信息,此时若计算“2017~2019年车险保费的总额”这一指标,需要同时关联维度表“2017年车险保费总额”和维度表“2019年车险保费总额”中的指标信息进行计算。再比如,当计算“2019年4~6月份车险保费总额”这一指标,由于本地维度表“2019年保费总额”中包含2019年1月份~12月份,每个月车险保费,就不需要再关联其他维度表中的指标信息进行计算。In this embodiment, whether a certain indicator needs to be associated with other dimension tables during calculation is determined by whether it is necessary to introduce indicators from other dimension tables (data tables) when calculating the indicator. For example, when calculating the indicator "Total auto insurance premiums from 2017 to 2019", the local dimension table only has the indicator "Total auto insurance premiums in 2018". At this time, if you calculate the "Total auto insurance premiums from 2017 to 2019" "This indicator needs to be calculated by linking the indicator information in the dimension table "2017 total auto insurance premiums" and the dimension table "2019 total auto insurance premiums". For another example, when calculating the index "total auto insurance premiums from April to June 2019", since the local dimension table "total premiums for 2019" includes the monthly auto insurance premiums from January to December 2019, there is no need to Associate index information in other dimension tables for calculation.
本实施例中线性回归是指利用称为线性回归方程的最小平方函数对一个或多个自变量和因变量之间关系进行建模的一种回归分析。本实施例中,利用线性回归算法,预测指标的访问频率。比如在使用指标的过程中,会发现有一些指标是会频繁使用的或者指标的访问频率是受一些其他的数据影响的,有相同的特点,并且是线性的,根据特点去推测哪些指标是访问频率比较高的或者是通过规律统计出来的。比如:有一个活动,双11需要查看的指标,在双12也是需要同样统计的,我们可以根据双11的特点,提前计算和聚合双12的指标。本实施例中,回归是基于已有数据对新的数据进行预测,比如预测股票走势。线性回归就是能够用一个直线较为精确地描述数据之间的关系,当出现新的数据的时候,就能够预测出一个简单的值。Linear regression in this embodiment refers to a regression analysis that uses a least square function called a linear regression equation to model the relationship between one or more independent variables and dependent variables. In this embodiment, a linear regression algorithm is used to predict the access frequency of the index. For example, in the process of using indicators, you will find that some indicators are frequently used or the access frequency of indicators is affected by some other data. They have the same characteristics and are linear. According to the characteristics, we can infer which indicators are visits. The frequency is relatively high or is calculated through regular statistics. For example: there is an event, the indicators that need to be checked on Double 11, and the same statistics are required on Double 12. We can calculate and aggregate the indicators of Double 12 in advance according to the characteristics of Double 11. In this embodiment, regression is to predict new data based on existing data, such as predicting stock trends. Linear regression is to be able to use a straight line to more accurately describe the relationship between the data, when new data appears, it can predict a simple value.
线性回归的模型形如:The linear regression model looks like:
h(x)=w 1x 1+w 2x 2+w 3x 3+...+w nx n+b h(x)=w 1 x 1 +w 2 x 2 +w 3 x 3 +...+w n x n +b
线性回归得出的模型不一定是一条直线:The model obtained by linear regression is not necessarily a straight line:
(1)在只有一个变量的时候,模型是平面中的一条直线;(1) When there is only one variable, the model is a straight line in the plane;
(2)有两个变量的时候,模型是空间中的一个平面;(2) When there are two variables, the model is a plane in space;
(3)有更多变量时,模型将是更高维的。(3) When there are more variables, the model will be higher dimensional.
实际上,线性回归中通常使用残差平方和,即点到直线的平行于y轴的距离而不用垂线距离,残差平方和除以样本量n就是均方误差。均方误差作为线性回归模型的损失函数(cost function)。使所有点到直线的距离之和最小,就是使均方误差最小化,这个方法叫做最小二乘法。In fact, the residual sum of squares is usually used in linear regression, that is, the distance from the point to the straight line parallel to the y axis instead of the vertical distance. The residual sum of squares divided by the sample size n is the mean square error. The mean square error is used as the cost function of the linear regression model. Minimizing the sum of the distances from all points to the straight line is to minimize the mean square error. This method is called the least squares method.
损失函数公式:
Figure PCTCN2020131753-appb-000001
Loss function formula:
Figure PCTCN2020131753-appb-000001
因为h(x)=w 1x 1+w 2x 2+w 3x 3+...+w nx n+b Because h(x)=w 1 x 1 +w 2 x 2 +w 3 x 3 +...+w n x n +b
最后通过求解,得到w及b的计算公式分别如下:Finally, through solving, the calculation formulas of w and b are obtained as follows:
Figure PCTCN2020131753-appb-000002
Figure PCTCN2020131753-appb-000002
本实施例中,当预测某样本数据中的指标A的访问频率时,假设输入的数据集D有n个样本,d个特征,则:In this embodiment, when predicting the access frequency of indicator A in a certain sample data, assuming that the input data set D has n samples and d features, then:
D={(x (1),y1),(x (2),y2),...,(x (n),yn)} D={(x (1) ,y1), (x (2) ,y2),...,(x (n) ,yn)}
其中第i个样本表示为:The i-th sample is expressed as:
(x(i),yi)=(x 1 (i),x 2 (i),...,x d (i),yi) (x(i),yi)=(x 1 (i) ,x 2 (i) ,...,x d (i) ,yi)
线性模型通过建立线性组合进行预测。我们的假设函数(1)为:Linear models make predictions by establishing linear combinations. Our hypothetical function (1) is:
H θ(x 1,x 2,...,x d)=θ 01x 12x 2+...+θ dx d H θ (x 1 , x 2 ,..., x d )=θ 01 x 12 x 2 +...+θ d x d
其中θ 0和θ 1…θ d为模型参数,令X0=1,X(i)=(X1(i),X2(i),...,Xd(i))为行向量,令X为n*d矩阵,θ为d*1维向量,则假设函数(1)式可表示为:Hθ(X)=Xθ Where θ 0 and θ 1 …θ d are model parameters, let X0=1, X(i)=(X1(i), X2(i),..., Xd(i)) are row vectors, and let X be n*d matrix, θ is a d*1 dimensional vector, assuming function (1) can be expressed as: Hθ(X)=Xθ
Figure PCTCN2020131753-appb-000003
Figure PCTCN2020131753-appb-000003
损失函数为均方误差,即
Figure PCTCN2020131753-appb-000004
The loss function is the mean square error, that is
Figure PCTCN2020131753-appb-000004
最小二乘法求解参数,损失函数J(θ)对θ进行求导:The least square method is used to solve the parameters, and the loss function J(θ) is derived from θ:
Figure PCTCN2020131753-appb-000005
Figure PCTCN2020131753-appb-000006
得θ=(X TX) -1X TY
Figure PCTCN2020131753-appb-000005
make
Figure PCTCN2020131753-appb-000006
Get θ=(X T X) -1 X T Y
本实施例中,利用线性回归算法确定样本数据中的重要指标,分别建立指标与对该指标的访问频率有影响的的指标因子的映射关系方程式。根据所有对指标M的访问频率有影响的指标因子的权重W,确定各个主要因变量(也即,主要指标因子)a 1,a 2,a 3,...,a n分别建立各个主要指标因子与指标的映射关系方程式:y=β+βa 1+βa 2+…+βa n,其中,y为指标M(在某一特定时间段内)的访问频率,a 1,a 2,a 3,...,a n为影响指标M(在某一特定时间段内)的访问频率的所有指标因子。下面以指标M为例,收集了2017~2019年的某次促销活动中指标M的访问频率以及影响指标M访问频率的指标因子a 1,a 2,a 3,...,a n的数值。利用SPSS工具,输入上述数据,方程式为y=βa 1+βa 2+βa 3+...+βa n,由于指标因子的相关系数以及调整的多重判定系数很接近1,所以模型的拟合优度较好,说明该模型线性关系比较显著。基于F检验,可见a 1,a 2,a 3,...,a n为主要指标因子,最后通过python制图,可以得到指标M的各预测值与实际值的对比,通过对比可以得出该模型比较合理,可用于指标M(在某一特定时间段内)的访问频率的预测,重复上述操作,建立各个主要指标因子与指标M(在某一特定时间段内)的访问频率的一元线性回归模型,推测出某一特定时间段内指标M的访问频率的变化情况,将其作为输入数据输入到预测模型中,最终获得指标M(在某一特定时间段内)的预测访问频率。进一步地, 根据线性回归的预测方法,可以对数据进行预测,预测哪些数据将会是高频率访问,这些高频率访问的数据就需要预聚合,有些是不需要高频率访问的则可以使用其他存储引擎。 In this embodiment, the linear regression algorithm is used to determine the important indicators in the sample data, and the mapping relationship equations between the indicators and the indicator factors that affect the access frequency of the indicators are established respectively. The frequency of access rights to all of the index M Index Factors affecting weight W, to determine the respective primary dependent variable (i.e., the main index factor) a 1, a 2, a 3, ..., a n are established each of the major indicators The equation of the mapping relationship between factors and indicators: y=β+βa 1 +βa 2 +…+βa n , where y is the access frequency of indicator M (in a certain time period), a 1 , a 2 , a 3 , ..., a n is the index factor all impact indicators M (in a specific time period) of the access frequency. M index below to an example, a collection of access frequency 2017 ~ 2019 M metrics of a special promotion and impact indicators Indicator M access frequency factors a 1, a 2, a 3 , ..., a n value of . Use the SPSS tool to input the above data. The equation is y=βa 1 +βa 2 +βa 3 +...+βa n . Since the correlation coefficient of the index factor and the adjusted multiple determination coefficient are very close to 1, the model fits well The degree is better, indicating that the linear relationship of the model is more significant. Based on the F test seen a 1, a 2, a 3 , ..., a n as the main index factor, and finally through the python mapping, contrast can be obtained each index M predicted value and the actual value can be derived by comparing the The model is reasonable and can be used to predict the access frequency of indicator M (in a specific time period). Repeat the above operation to establish a univariate linear relationship between the access frequency of each main indicator factor and indicator M (in a specific time period) The regression model infers the change of the access frequency of the indicator M in a certain period of time, inputs it as input data into the prediction model, and finally obtains the predicted access frequency of the indicator M (in a certain period of time). Furthermore, according to the prediction method of linear regression, data can be predicted to predict which data will be accessed with high frequency. These high-frequency accessed data need to be pre-aggregated, and some of them do not require high-frequency access and can use other storage. engine.
104、基于访问频率,确定指标的指标类型,其中,指标类型包括多维度聚合的指标和固定维度的指标;104. Determine the indicator type of the indicator based on the access frequency, where the indicator type includes multi-dimensional aggregated indicators and fixed-dimensional indicators;
本实施例中,根据指标的访问频率和该指标计算时是否需要关联其他维度表,确定指标的类型,进一步地,确定指标的类型。比如说,有一些指标在计算的时候需要与多个维度表进行关联才能计算,而另一些指标在计算时不需要与其他维度表关联即可计算指标的值。指标包括两种类型,需要多维度聚合的指标类型,也就是说计算时需要关联其他维度表进行关联计算的指标,固定维度的指标类型,也就是说计算时不需要关联其他维度表中的数据,仅通过其所属宽表中的数据,即可计算出指标值的指标。In this embodiment, the index type is determined according to the access frequency of the index and whether other dimension tables need to be associated when calculating the index, and further, the type of the index is determined. For example, some indicators need to be associated with multiple dimension tables to be calculated during calculation, while other indicators do not need to be associated with other dimension tables during calculation to calculate the value of the indicator. There are two types of indicators, which require multi-dimensional aggregation, that is, indicators that need to be associated with other dimension tables for associated calculations during calculation, and fixed-dimensional indicator types, which means that calculations do not need to associate data in other dimension tables. , The index of the index value can be calculated only by the data in the wide table to which it belongs.
105、基于指标类型,根据预设的指标类型与存储计算引擎之间的对应关系表,以及指标类型与指标的维度建模方式之间的对应关系表,确定与指标对应的存储计算引擎和维度建模方式;105. Based on the indicator type, according to the preset correspondence table between the indicator type and the storage calculation engine, and the correspondence table between the indicator type and the dimensional modeling method of the indicator, determine the storage calculation engine and the dimension corresponding to the indicator Modeling method;
本实施例中,根据指标的类型,从预设的指标类型与存储计算引擎之间的对应关系表中,查询与指标对应的存储计算引擎,和计算该指标所需要关联的预置维度表的信息。不同类型的指标存储的存储计算引擎的位置不同,比如说,一部分存储在随机报表或半聚合报表,在计算时需要关联其他维度表的指标,在查询这些指标时,需要将指标所在的表与其他维度表关联之后,才能计算指标的值,而固定维度的指标,在计算时不需要关联其他维度表,那么就可以把这些指标构建的聚合报表存储至聚合引擎,提前进行计算,在用户查询该指标时,可以快速查询对应指标值,不用再等待计算的时间,提高了数据处理效率。In this embodiment, according to the type of the indicator, from the correspondence table between the preset indicator type and the storage calculation engine, the storage calculation engine corresponding to the indicator is queried, and the preset dimension table associated with the indicator is calculated. information. Different types of indicators are stored in different locations for storage calculation engines. For example, some of them are stored in random reports or semi-aggregated reports. When calculating, you need to associate the indicators of other dimension tables. When querying these indicators, you need to compare the table where the indicators are located with The value of the indicator can be calculated after the other dimension tables are associated. For fixed-dimensional indicators, there is no need to associate other dimension tables in the calculation. Then the aggregate report built by these indicators can be stored in the aggregation engine for calculation in advance, and the user can query When this indicator is used, the corresponding indicator value can be quickly queried without waiting for the calculation time, which improves the efficiency of data processing.
本实施例中,根据指标的类型,确定查询该指标(值)时,是否需要多个维度表关联(来)计算,若需要,查询对应的维度表。比如,计算固定指标“2018年双11活动下车险的保费”这个指标值,只需要将表“2018年的保费”、表“2018年车险的保费”以及“2018年双11活动下的保费”三张不同维度的表中的数据存储于一张表内,也就是宽表,计算时就不在需要关联其他的数据报表,而在计算指标“2018年的保费”时,就需要把表“2018年车险的保费”、表“2018年财产险的保费”、和表“2018年寿险的保费”...表“2018年XX险的保费”,所有险种的保费的表关联到一起,才能得到指标“2018的保费”的指标值。In this embodiment, according to the type of the indicator, it is determined whether multiple dimension tables are required to be associated (to) calculated when querying the indicator (value), and if necessary, the corresponding dimension table is queried. For example, to calculate the index value of the fixed index "2018 Double 11 event auto insurance premiums", you only need to table "2018 insurance premiums", table "2018 auto insurance premiums" and "2018 double 11 event premiums" The data in the three tables of different dimensions are stored in one table, which is a wide table. When calculating, there is no need to associate other data reports. When calculating the indicator "2018 premiums", the table "2018" is needed. Annual auto insurance premiums", table "2018 property insurance premiums", and table "2018 life insurance premiums"... table "2018 XX insurance premiums", all insurance premium tables are linked together to get The indicator value of the indicator "Premium for 2018".
106、根据维度建模方式,确定指标所关联的预置维度表,其中,预置维度表包括基于指标类型对应的维度建模方式构建的维度表或基于所有维度建模方式构建的维度表:106. Determine the preset dimension table associated with the indicator according to the dimensional modeling method, where the preset dimension table includes a dimensional table constructed based on the dimensional modeling method corresponding to the indicator type or a dimensional table constructed based on all dimensional modeling methods:
本实施例中,不同类型的指标对应不同的建模模型生成不同类型的报表,同时生成的报表也会根据报表类型的不同存储至不同数据存储计算引擎。In this embodiment, different types of indicators correspond to different modeling models to generate different types of reports, and the generated reports are also stored in different data storage calculation engines according to different report types.
107、利用路由决策引擎调用存储计算引擎执行预置维度表,计算出指标对应的指标值。107. Use the routing decision engine to call the storage calculation engine to execute the preset dimension table, and calculate the index value corresponding to the index.
本实施例中,路由决策引擎会根据指标所属报表所存储计算引擎的对应关系,请求导向对应的存储计算引擎,也就是说,根据查询的指标不同,路由决策引擎会选取对应当前 计算请求的存储计算引擎,并把请求分发到对应的存储计算引擎,计算对应指标的值。比如,若查看的指标为基础(固定)的指标,那么查询(计算)请求会转发至hive等基础(无聚合数据库,可实现多表关联计算)的数据库,如果要查看预计算的指标,会转发到druid.io等数据库(聚合数据引擎)。本实施例中,指标的计算需求可以简单理解为是否需要(有无)维度表的关联和计算。In this embodiment, the routing decision engine will request the corresponding storage calculation engine according to the correspondence relationship between the calculation engines stored in the report to which the indicator belongs. That is, according to the different queried indicators, the routing decision engine will select the storage corresponding to the current calculation request. Calculation engine, and distribute the request to the corresponding storage calculation engine to calculate the value of the corresponding index. For example, if the index to be viewed is a basic (fixed) index, the query (calculation) request will be forwarded to a basic database such as hive (no aggregation database, which can realize multi-table association calculation). If you want to view the pre-calculated index, you will Forward to databases such as druid.io (aggregated data engine). In this embodiment, the calculation requirement of the index can be simply understood as whether the association and calculation of the dimension table are required (with or without).
本申请提供的技术方案中,主要通过获取待预测数据,并对数据进行解析,以构建多个维度属性的指标,根据线性回归算法预测指标的访问频率,确定指标的计算需求。根据指标的计算需求选择合适的方式将指标存储至对应的存储计算引擎,计算指标的指标值,解决了解决大数据固定维度的指标I的计算耗时和时效性的矛盾,同时解决了只能使用单一数据引擎和维度建模的技术问题。In the technical solution provided by this application, the data to be predicted is mainly obtained and analyzed to construct indicators of multiple dimensional attributes, and the access frequency of the indicators is predicted by the linear regression algorithm to determine the calculation requirements of the indicators. According to the calculation requirements of the indicators, select an appropriate method to store the indicators in the corresponding storage calculation engine, and calculate the indicator values of the indicators. This solves the contradiction between the time-consuming and time-consuming calculation of the fixed-dimensional indicator I of big data, and at the same time solves the problem. The technical issues of using a single data engine and dimensional modeling.
请参阅图2,本申请实施例中大数据指标构建方法的第二个实施例包括:Please refer to Fig. 2. The second embodiment of the method for constructing a big data indicator in the embodiment of the present application includes:
201、获取待预测数据;201. Obtain the data to be predicted;
202、对待预测数据进行解析,定义多个指标;202. Analyze the data to be predicted and define multiple indicators;
本实施例中,对获取到的包含有许多指标标签的数据进行解析,从中获取多个可定义的标签。比如“保费”、“寿险的保费”、“双12活动下财产险的保费”、“双11活动下车险的保费”等。本实施例中,指标是指用于衡量事物发展程度的单位或方法,它还有个IT上常用的名字,也就是度量。例如:人口数、GDP、收入、用户数、利润率、留存率、覆盖率等。很多公司都有自己的KPI指标体系,就是通过几个关键指标来衡量公司业务运营情况的好坏。指标需要经过加和、平均等汇总计算方式得到,并且是需要在一定的前提条件进行汇总计算,如时间、地点、费用,也就是我们常说的统计口径与范围。In this embodiment, the obtained data containing many index labels is analyzed, and multiple definable labels are obtained therefrom. For example, "premium", "premium of life insurance", "premium of property insurance under Double 12 event", "premium of auto insurance under Double 11 event" and so on. In this embodiment, an indicator refers to a unit or method used to measure the degree of development of a thing, and it also has a commonly used name in IT, that is, measurement. For example: population, GDP, income, number of users, profit rate, retention rate, coverage rate, etc. Many companies have their own KPI indicator system, which uses several key indicators to measure the performance of the company’s business operations. The indicators need to be obtained through summary calculation methods such as summation and average, and summary calculations need to be performed under certain preconditions, such as time, location, and cost, which is what we often call statistical caliber and scope.
203、利用预置模型对指标进行分级,并增加维度属性;203. Use the preset model to classify the indicators and add dimension attributes;
本实施例中,利用预置的模型对提取出来的指标进行分级,增加每一个指标的维度属性信息,以“保费”为例,逐步地增加指标“保费”的维度属性信息,就可以变成“企业计划保费”、“二级机构的企业计划保费”进一步地,增加指标的基本维度属性信息,同时增加公共属性维度信息,比如“是否是参与投保活动的二级机构的企业计划保费”。In this embodiment, a preset model is used to classify the extracted indicators, and the dimensional attribute information of each indicator is added. Taking "premium" as an example, gradually increasing the dimensional attribute information of the indicator "premium" can become "Enterprise plan premiums" and "enterprise plan premiums of secondary institutions" further increase the basic dimension attribute information of the indicators, and at the same time increase the public attribute dimension information, such as "whether it is the enterprise plan premiums of secondary institutions participating in insurance activities".
本实施例中,指标的维度属性是指事物或现象的某种特征,如性别、地区、时间等都是维度。其中时间是一种常用、特殊的维度,通过时间前后的对比,就可以知道事物的发展是好了还是坏了,比如“2019年双11活动下车险的保费,比2018年双11活动下的车险的保费增长10%”、“2019年双12活动下寿险的保费,比2019年双11活动下寿险的保费增长20%”这就是时间上的对比,也称为纵比。另一个比较就是横比,如“2018年双11活动下的车险的保费”与“2018年双11活动下的寿险的保费”不同险种之前的比较,是同级单位之间的比较,简称横比。本实施例中,维度可以分为定性维度跟定量维度,也就是根据数据类型来划分,数据类型为字符型(文本型)数据,就是定性维度,如地区、性别都是定性维度;数据类型为数值型数据的,就为定量维度,如收入、年龄、消费等。In this embodiment, the dimension attribute of the indicator refers to a certain characteristic of a thing or phenomenon, such as gender, region, time, etc., are all dimensions. Among them, time is a commonly used and special dimension. Through the comparison of time before and after, you can know whether the development of things is good or bad. For example, "The premium of auto insurance under the Double 11 event in 2019 is higher than that under the Double 11 event in 2018. The premium of auto insurance has increased by 10%", "The premium of life insurance under the Double 12 event in 2019 will increase by 20% compared to the premium of life insurance under the Double 11 event in 2019." This is the comparison in time, also known as the vertical ratio. Another comparison is the horizontal comparison. For example, the comparison between the “premium of auto insurance under the Double 11 event in 2018” and the “premium of life insurance under the Double 11 event in 2018” is a comparison between units of the same level, referred to as horizontal. ratio. In this embodiment, the dimensions can be divided into qualitative dimensions and quantitative dimensions, that is, according to the data type. The data type is character (text) data, which is qualitative. For example, region and gender are all qualitative dimensions; the data type is Numerical data are quantitative dimensions, such as income, age, consumption, etc.
204、基于指标和维度属性,将指标和维度属性组合,获取多个不同维度属性的指标;204. Combine the indicators and the dimensional attributes based on the indicators and the dimensional attributes to obtain multiple indicators of different dimensional attributes;
本实施例中,根据指标和维度属性信息,将指标和维度属性结合,获取多个携带不同维度属性信息的指标。比如“2019年双11活动下车险的保费”、“2019年双12活动下车险的保费”、“2019年双11活动下财产险的保费”、“2019年双12活动下财产险的保费”。In this embodiment, according to the indicator and dimensional attribute information, the indicator and the dimensional attribute are combined to obtain multiple indicators carrying different dimensional attribute information. For example, "Premium for auto insurance under Double 11 in 2019", "Premium for auto insurance under Double 12 in 2019", "Premium for property insurance under Double 11 in 2019", "Premium for property insurance under Double 12 in 2019" .
205、基于线性回归算法,确定影响指标访问频率的主要指标因子;205. Based on the linear regression algorithm, determine the main indicator factors that affect the frequency of indicator access;
本实施例中,根据线性回归算法,确定待预测数据中的不同维度属性的指标,同时确定影响指标访问频率的指标因子。In this embodiment, according to the linear regression algorithm, the indicators of different dimensional attributes in the data to be predicted are determined, and at the same time, the indicator factors that affect the access frequency of the indicators are determined.
206、建立指标与主要指标因子的映射关系方程式,并采用弹性系数法预测主要指标因子的参数值;206. Establish the mapping relationship equation between the index and the main index factor, and use the elastic coefficient method to predict the parameter value of the main index factor;
本实施例中,建立待预测数据中获取的指标与该指标对应的指标因子的映射关系方程式。采用弹性系数法预测待预测数据在某一特定活动下各指标因子参数值。例如预测2019年双11活动中月份购买车险的人数。采用(收集的历史数据中)最近年份和最远年份的数据计算弹性系数ET,即可计算出待预测数据在某一特定活动下对应指标的访问频率。本实施例中的访问频率也可以说是一个概率值。In this embodiment, a mapping relationship equation between the index obtained in the data to be predicted and the index factor corresponding to the index is established. The elastic coefficient method is used to predict the parameter value of each index factor under a certain activity of the data to be predicted. For example, predict the number of people who will purchase auto insurance during the Double 11 event in 2019. The elastic coefficient ET is calculated using the data of the most recent year and the farthest year (from the collected historical data), and then the access frequency of the corresponding indicator under a certain activity can be calculated. The access frequency in this embodiment can also be said to be a probability value.
207、将指标因子的参数值代入映射关系方程式中,计算指标的访问频率;207. Substitute the parameter value of the index factor into the mapping relationship equation to calculate the access frequency of the index;
本实施例中,建立待预测数据中获取的指标与该指标对应的指标因子的映射关系方程式,将指标因子的参数值代入该映射关系方程式中,即可计算(预测)出指标的访问频率(概率值)。In this embodiment, the mapping relationship equation between the index obtained in the data to be predicted and the index factor corresponding to the index is established, and the parameter value of the index factor is substituted into the mapping relationship equation to calculate (predict) the access frequency of the index ( Probability value).
208、若指标的访问频率大于预设阈值且计算指标的访问频率时需要关联其他维度表,则指标为需要多维度聚合的指标类型;208. If the access frequency of the indicator is greater than the preset threshold and other dimension tables need to be associated when calculating the access frequency of the indicator, the indicator is an indicator type that requires multi-dimensional aggregation;
本实施例中,若指标的访问概率大于预设阈值且查询(计算)本指标时,需要关联其他的维度表进行关联计算,那么,可以确定该指标为需要多维度聚合的指标类型指标,也就是需要多维度聚合的指标。比如,指标“2018年的保费”,想计算它,就需要把表“2018年车险的保费”、表“2018年财产险的保费”、和表“2018年寿险的保费”...表“2018年XX险的保费”,所有险种的保费的表关联到一起,那么指标“2018年的保费”就是需要多维度聚合的指标类型指标,也就是需要多维度聚合的指标。In this embodiment, if the access probability of the indicator is greater than the preset threshold and when querying (calculating) this indicator, it is necessary to associate other dimension tables for correlation calculation, then it can be determined that the indicator is an indicator type indicator that requires multi-dimensional aggregation. It is an indicator that needs to be aggregated in multiple dimensions. For example, the indicator "2018 premiums", if you want to calculate it, you need to table "2018 auto insurance premiums", table "2018 property insurance premiums", and table "2018 life insurance premiums"... XX insurance premiums in 2018", the table of premiums of all insurance types is linked together, then the indicator "2018 premiums" is an index type index that needs multi-dimensional aggregation, that is, an index that needs multi-dimensional aggregation.
209、若指标的访问频率大于预设阈值且计算指标的访问频率时不需要关联其他维度表,则指标类型为固定维度的指标类型;209. If the access frequency of the indicator is greater than the preset threshold and there is no need to associate other dimension tables when calculating the access frequency of the indicator, the indicator type is a fixed-dimensional indicator type;
本实施例中,若指标的访问概率大于预设阈值且查询(计算)本指标时,不需要关联其他的维度表进行关联计算,只用指标所属表中的数据,那么,可以确定该指标为固定维度的指标类型指标,也就是固定(的)指标。比如,“2018年双11活动下车险的保费”,该指标的维度是固定的“2018年+双11活动+车险”三个维度,在计算指标“2018年双11活动下车险的保费”时,只需要将表“2018年的保费”、表“2018年车险的保费”以及“2018年双11活动下的保费”三张不同维度的表利用宽表建模,存储至同一张表中,也就是宽表, 计算时,只查询本(宽)表内的数据即可,不需关联其他表中数据,那么指标“2018年双11活动下车险的保费”就是固定维度的指标类型指标,也就是固定指标。本实施例中,宽表就是将所有字段都建立在其中,在统计数据(计算指标值)时,不需要关联其他的表。In this embodiment, if the access probability of the indicator is greater than the preset threshold and the indicator is queried (calculated), there is no need to associate other dimension tables for correlation calculation, and only the data in the table to which the indicator belongs is used, then it can be determined that the indicator is The index type index of the fixed dimension, that is, the fixed index. For example, "2018 Double 11 event auto insurance premium", the dimension of this indicator is a fixed three dimensions "2018 + Double 11 event + auto insurance", when calculating the indicator "2018 Double 11 event auto insurance premium" , You only need to use the wide table to model three tables with different dimensions, the table "Premiums in 2018", the table "Premiums for auto insurance in 2018" and "Premiums under the Double 11 event in 2018", and store them in the same table. Wide table, when calculating, only query the data in this (wide) table, and there is no need to associate data in other tables, then the indicator "2018 Double 11 event insurance premiums" is a fixed-dimensional indicator type indicator, that is Fixed indicators. In this embodiment, the wide table is to build all the fields in it, and there is no need to associate other tables when statistical data (calculating index values).
210、基于指标类型,根据预设的指标类型与存储计算引擎之间的对应关系表,以及指标类型与指标的维度建模方式之间的对应关系表,确定与指标对应的存储计算引擎和维度建模方式;210. Based on the indicator type, according to the preset correspondence table between the indicator type and the storage calculation engine, and the correspondence table between the indicator type and the dimensional modeling method of the indicator, determine the storage calculation engine and the dimension corresponding to the indicator Modeling method;
211、根据维度建模方式,确定指标所关联的预置维度表,其中,预置维度表包括基于指标类型对应的维度建模方式构建的维度表或基于所有维度建模方式构建的维度表;211. Determine the preset dimension table associated with the indicator according to the dimensional modeling method, where the preset dimension table includes a dimensional table constructed based on the dimensional modeling method corresponding to the indicator type or a dimensional table constructed based on all dimensional modeling methods;
212、利用路由决策引擎调用存储计算引擎执行预置维度表,计算出指标对应的指标值。212. Use the routing decision engine to call the storage calculation engine to execute the preset dimension table, and calculate the index value corresponding to the index.
请参阅图3,本申请实施例中大数据指标构建方法的第三个实施例包括:Referring to Fig. 3, the third embodiment of the method for constructing a big data indicator in the embodiment of the present application includes:
301、获取待预测数据;301. Obtain data to be predicted;
302、对待预测数据进行解析,以构建多个携带不同维度属性信息的指标;302. Analyze the data to be predicted to construct multiple indicators that carry attribute information of different dimensions;
303、获取包含指标的历史数据,其中,历史数据包括特定时期内的指标、指标在特定时期内的访问次数,以及影响指标在特定时期内访问次数的指标因子;303. Obtain historical data including indicators, where the historical data includes indicators in a specific period, the number of visits of the indicator in a specific period, and indicator factors that affect the number of visits of the indicator in a specific period;
本实施例中,获取包含待预测指标的历史数据,比如说,我们要大概了解指标“2019年双11活动下车险保费”的基本规律,需要获取包含指标“2018年双11活动下车险保费”的数据信息,对该数据信息进行分析,以此对指标“2019年双11活动下车险保费”进行预测,所以,本实施例中的历史数据中报货特定时期内的指标、该指标在特定时期内的访问次数(访问频率),以及可能影响该指标在特定时期内访问次数的指标因子。指标因子对指标的在特定时期内的访问次数有关,因此建立指标因子与指标访问频率之间的映射关系,根据历史数据计算(或者说是“预测”)指标的访问频率。In this embodiment, the historical data containing the indicators to be predicted is obtained. For example, we need to understand the basic law of the indicator "Car insurance premiums under the double 11 event in 2019", and we need to obtain the indicators "Car insurance premiums under the double 11 event in 2018". Analyze the data information to predict the indicator "Car insurance premiums under the Double 11 event in 2019". Therefore, in the historical data in this example, the indicators in a specific period of time and the indicators are in a specific period. The number of visits (frequency of visits) during the period, and the indicator factors that may affect the number of visits of the indicator in a specific period. The index factor is related to the number of visits of the index in a specific period. Therefore, a mapping relationship between the index factor and the index access frequency is established, and the index access frequency is calculated (or "predicted") based on historical data.
304、将历史数据作为样本数据,并对样本数据进行偏相关分析,提取指标,并分别建立指标与对应指标因子的映射关系方程式;304. Use historical data as sample data, perform partial correlation analysis on the sample data, extract indicators, and respectively establish mapping relationship equations between indicators and corresponding indicator factors;
本实施例中,将历史数据作为样本数据,比如将“2018年双11活动下车险保费”的数据信息,作为样本数据。In this embodiment, historical data is used as sample data, for example, the data information of “auto insurance premium under the Double 11 event in 2018” is used as sample data.
305、分别对映射关系方程式进行T检验,确定影响指标访问频率的主要指标因子;305. Perform a T test on the mapping relationship equations to determine the main index factors that affect the frequency of index visits;
本实施例中,t检验是多元线性回归算法中显著性检验的一种,在普通二乘法下F检验可以和t检验等效。本实施例中,用偏相关分析方法进一步分析各个指标和指标因子的映射关系方程式,确定各个指标与指标因子的映射关系中的主要自变量(也就是,主要指标因子,会有很多指标因子影响指标在特定时期内的被访问次数,而主要指标因子是对主要影响因素),然后将主要指标因子全部保留于指标与指标因子的映射关系方程式中。偏相关系数取值在预置取值区间之内且映射关系方程式中回归系数大于F检验参数或t检验参数的指标因子为主要指标因子。In this embodiment, the t test is a type of significance test in the multiple linear regression algorithm. Under the ordinary square method, the F test can be equivalent to the t test. In this embodiment, the partial correlation analysis method is used to further analyze the mapping relationship equations of each index and index factor, and determine the main independent variable in the mapping relationship between each index and the index factor (that is, the main index factor, there will be many index factors affecting The number of times the indicator is visited in a specific period, and the main indicator factor is the main influencing factor), and then all the main indicator factors are retained in the mapping relationship equation between the indicator and the indicator factor. The index factor whose partial correlation coefficient is within the preset value interval and the regression coefficient is greater than the F test parameter or the t test parameter in the mapping relationship equation is the main index factor.
306、根据线性回归算法,计算指标的访问频率,并判断指标是否关联有预置维度表;306. Calculate the access frequency of the indicator according to the linear regression algorithm, and determine whether the indicator is associated with a preset dimension table;
307、基于访问频率,确定指标的指标类型,其中,指标类型包括多维度聚合的指标和固定维度的指标;307. Determine the indicator type of the indicator based on the access frequency, where the indicator type includes multi-dimensional aggregated indicators and fixed-dimensional indicators;
308、基于指标的类型,从预设的指标类型与模型构建方法之间的对应关系表中,查询与指标对应的模型构建方法;308. Based on the type of the indicator, query the model construction method corresponding to the indicator from the preset correspondence table between the indicator type and the model construction method;
本实施例中,根据指标的类型,则从预设的指标类型与模型构建方法之间的对应关系表中,查询与指标类型对应的模型构建方法。对于需要关联其他维度表进行计算的指标,使用维度建模,而对固定维度的需求,使用宽表建模,也就是在一张表里面,将所有的字段都建立在里面,在统计数据的时候不需要关联其他的表。In this embodiment, according to the type of the indicator, the model construction method corresponding to the indicator type is queried from the preset correspondence table between the indicator type and the model construction method. For indicators that need to be associated with other dimensional tables for calculation, use dimensional modeling, and for fixed-dimensional requirements, use wide table modeling, that is, in a table, all fields are built in it, in the statistical data There is no need to associate other tables.
若指标为需要多维度聚合的指标类型指标,则使用维度建模,构建随机报表和/或半聚合报表,并将随机报表和/或半聚合报表存储至无聚合引擎和/或半聚合引擎;若待计算指标为需要多维度聚合的指标类型指标,也就是需要与多个维度表进行关联才能计算的指标,则使用维度建模,构建随机报表和/或半聚合报表,并将随机报表和/或半聚合报表存储至无聚合引擎和/或半聚合引擎。If the indicator is an indicator type indicator that requires multi-dimensional aggregation, use dimensional modeling to build random reports and/or semi-aggregated reports, and store random reports and/or semi-aggregated reports in the non-aggregated engine and/or semi-aggregated engine; If the indicator to be calculated is an indicator type indicator that requires multi-dimensional aggregation, that is, an indicator that needs to be associated with multiple dimension tables to be calculated, use dimensional modeling to build random reports and/or semi-aggregated reports, and combine random reports with / Or semi-aggregated reports are stored in the non-aggregated engine and/or semi-aggregated engine.
若指标为固定维度的指标类型指标,则使用宽表建模,构建聚合报表,并将聚合报表存储至聚合引擎;若待计算指标为固定维度的指标类型指标,就是不需要与多个维度表进行关联才能计算的指标,则使用宽表建模,构建聚合报表,并将聚合报表存储至聚合引擎。If the indicator is a fixed-dimensional indicator type indicator, use wide-table modeling, build an aggregate report, and store the aggregate report to the aggregation engine; if the indicator to be calculated is a fixed-dimensional indicator type indicator, there is no need to interact with multiple dimension tables For indicators that can be calculated by association, use wide table modeling, build aggregate reports, and store aggregate reports in the aggregation engine.
本实施例中,宽表建模是指标和维度存储在一张大表,即是数据分为事实表和维度表,事实表是对具体事件的记录,将所有的字段都建立在里面,在统计数据的时候不需要关联其他的表。其中,维度表示对事件的一些说明,通过事实和维度表的分离,提高灵活性,解决对应的问题。In this embodiment, wide table modeling means that the indicators and dimensions are stored in a large table, that is, the data is divided into a fact table and a dimension table. The fact table is a record of specific events, and all fields are built in it. There is no need to associate other tables with data. Among them, the dimension represents some description of the event, through the separation of facts and dimension tables, to improve flexibility and solve corresponding problems.
309、基于指标类型,根据预设的指标类型与存储计算引擎之间的对应关系表,以及指标类型与指标的维度建模方式之间的对应关系表,确定与指标对应的存储计算引擎和维度建模方式;309. Based on the indicator type, according to the preset correspondence table between the indicator type and the storage calculation engine, and the correspondence table between the indicator type and the dimensional modeling method of the indicator, determine the storage calculation engine and the dimension corresponding to the indicator Modeling method;
310、根据维度建模方式,确定指标所关联的预置维度表,其中,预置维度表包括基于指标类型对应的维度建模方式构建的维度表或基于所有维度建模方式构建的维度表;310. Determine the preset dimension table associated with the indicator according to the dimensional modeling method, where the preset dimension table includes a dimensional table constructed based on the dimensional modeling method corresponding to the indicator type or a dimensional table constructed based on all dimensional modeling methods;
若指标为需要多维度聚合的指标,则将指标降级,并存储至随机报表和或半聚合报表;本实施例中,若指标为需要多维度聚合的指标类型指标,则可以理解为该指标不需要进行提前计算,将该指标降级也就是说将指标和数据存储到普通的计算引擎上,节约计算资源。普通计算引擎包括无聚合引擎和半聚合引擎。If the indicator is an indicator that requires multi-dimensional aggregation, the indicator is downgraded and stored in a random report or semi-aggregated report; in this embodiment, if the indicator is an indicator that requires multi-dimensional aggregation, it can be understood that the indicator is not It needs to be calculated in advance, and the indicator is downgraded, that is, the indicator and data are stored on a common calculation engine to save computing resources. Common computing engines include non-aggregation engines and semi-aggregation engines.
若指标为固定维度的指标,则利用宽表建模将维度中所有字段存储至聚合报表;本实施例中,若指标为固定维度的指标,也就是在计算该指标是不需要其他维度表进行聚合计算,这种类型的指标均可以存储至聚合报表,提前计算。节约了指标查询(计算)时间,提高了数据处理的效率。查询与固定维度的指标类型指标对应的存储计算引擎,并将聚合报表存储至聚合引擎;本实施例中,若指标为固定维度的指标类型指标,也就是说,在计 算该指标时不需要关联其他维度表,该类型指标通过宽表建模存储至聚合报表中,并存储至聚合引擎中,实现提前计算。If the indicator is a fixed-dimensional indicator, use wide-table modeling to store all the fields in the dimension in the aggregate report; in this embodiment, if the indicator is a fixed-dimensional indicator, it means that other dimension tables are not required to calculate the indicator. Aggregate calculation. All indicators of this type can be stored in an aggregate report and calculated in advance. It saves index query (calculation) time and improves the efficiency of data processing. Query the storage calculation engine corresponding to the fixed-dimensional index type index, and store the aggregate report to the aggregation engine; in this embodiment, if the index is a fixed-dimensional index type index, that is, there is no need to associate when calculating the index For other dimension tables, this type of index is stored in the aggregation report through wide table modeling, and stored in the aggregation engine, so that it can be calculated in advance.
311、利用路由决策引擎调用存储计算引擎执行预置维度表,计算出指标对应的指标值。311. Use the routing decision engine to call the storage calculation engine to execute the preset dimension table, and calculate the index value corresponding to the index.
上面对本申请实施例中大数据指标构建方法进行了描述,下面对本申请实施例中大数据指标构建装置进行描述,请参阅图4,本申请实施例中大数据指标构建装置一个实施例包括:第一获取模块401,用于获取待预测数据;第一构建模块402,用于对待预测数据进行解析,以构建多个携带不同维度属性信息的指标;判断模块403,用于根据线性回归算法,计算指标的访问频率,并判断指标是否关联有预置维度表;第一确定模块404,用于基于访问频率,确定指标的指标类型,其中,指标类型包括多维度聚合的指标和固定维度的指标;第二确定模块405,用于基于指标类型,根据预设的指标类型与存储计算引擎之间的对应关系表,以及指标类型与指标的维度建模方式之间的对应关系表,确定与指标对应的存储计算引擎和维度建模方式;第三确定模块406,用于根据维度建模方式,确定指标所关联的预置维度表,其中,预置维度表包括基于指标类型对应的维度建模方式构建的维度表或基于所有维度建模方式构建的维度表;计算模块407,用于利用路由决策引擎调用存储计算引擎执行预置维度表,计算出指标对应的指标值。The method for constructing a big data indicator in the embodiment of the application is described above. The device for constructing a big data indicator in the embodiment of the application is described below. Please refer to FIG. 4. An embodiment of the device for constructing a big data indicator in the embodiment of the application includes: An acquisition module 401 is used to obtain the data to be predicted; the first construction module 402 is used to analyze the data to be predicted to construct a plurality of indicators carrying attribute information of different dimensions; the judgment module 403 is used to calculate according to the linear regression algorithm The access frequency of the indicator and determine whether the indicator is associated with a preset dimension table; the first determining module 404 is used to determine the indicator type of the indicator based on the access frequency, where the indicator type includes multi-dimensional aggregated indicators and fixed-dimensional indicators; The second determining module 405 is configured to determine the corresponding relationship with the indicator based on the indicator type, according to the preset correspondence table between the indicator type and the storage calculation engine, and the correspondence table between the indicator type and the dimensional modeling method of the indicator The storage calculation engine and the dimensional modeling method of the dimensional model; the third determining module 406 is used to determine the preset dimension table associated with the indicator according to the dimensional modeling method, wherein the preset dimension table includes the dimensional modeling method corresponding to the indicator type The constructed dimension table or the dimension table constructed based on all dimensional modeling methods; the calculation module 407 is used to use the routing decision engine to call the storage calculation engine to execute the preset dimension table and calculate the index value corresponding to the index.
请参阅图5,本申请实施例中大数据指标构建装置的第二个实施例包括:Referring to FIG. 5, the second embodiment of the device for constructing big data indicators in the embodiment of the present application includes:
第一获取模块501,用于获取待预测数据;第一构建模块502,用于对待预测数据进行解析,以构建多个携带不同维度属性信息的指标;判断模块503,用于根据线性回归算法,计算指标的访问频率,并判断指标是否关联有预置维度表;第一确定模块504,用于基于访问频率,确定指标的指标类型,其中,指标类型包括多维度聚合的指标和固定维度的指标;第二确定模块505,用于基于指标类型,很久预设的指标类型与存储计算引擎之间的对应关系表,以及指标类型与指标的维度建模方式之间的对应关系表,确定与指标对应的存储计算引擎和维度建模方式;第三确定模块506,用于根据维度建模方式,确定指标所关联的预置维度表,其中,预置维度表包括基于指标类型对应的维度建模方式构建的维度表或基于所有维度建模方式构建的维度表;计算模块507,用于利用路由决策引擎调用存储计算引擎执行预置维度表,计算出指标对应的指标值;第二获取模块508,用于获取包含指标的历史数据,其中,历史数据包括特定时期内的指标、指标在特定时期内的访问次数,以及影响指标在特定时期内访问次数的指标因子;分析模块509,用于将历史数据作为样本数据,并对样本数据进行偏相关分析,提取指标,并分别建立指标与对应指标因子的映射关系方程式;检验模块510,用于分别对映射关系方程式进行T检验,确定影响指标访问频率的主要指标因子;第二查询模块511,用于基于指标的类型,从预设的指标类型与模型构建方法之间的对应关系表中,查询与指标对应的模型构建方法;第二构建模块512,用于当指标为需要多维度聚合的指标类型指标,则使用维度建模,构建随机报表和/或半聚合报表,并将随机报表和/或半聚合报表存储至无聚合引擎和/或半聚合引擎;第一存储模块 513,用于当指标为固定维度的指标类型指标,使用宽表建模,构建聚合报表,并将聚合报表存储至聚合引擎;指标降级模块514,用于当指标为需要多维度聚合的指标时,将指标降级,并存储至随机报表和或半聚合报表;第四确定模块515,用于查询与需要多维度聚合的指标对应的存储计算引擎和所需关联的预置维度表,确定需要多维度聚合的指标类型指标在计算时所需关联的维度表;第二存储模块516,用于当指标为固定维度的指标,利用宽表建模将维度中所有字段存储至聚合报表;第三存储模块517,用于查询与固定维度的指标类型指标对应的存储计算引擎,并将聚合报表存储至聚合引擎。The first acquisition module 501 is used to obtain the data to be predicted; the first construction module 502 is used to analyze the data to be predicted to construct a plurality of indicators carrying attribute information of different dimensions; the judgment module 503 is used to obtain the linear regression algorithm, Calculate the access frequency of the indicator, and determine whether the indicator is associated with a preset dimension table; the first determining module 504 is used to determine the indicator type of the indicator based on the access frequency, where the indicator type includes multi-dimensional aggregated indicators and fixed-dimensional indicators ; The second determining module 505 is used to determine the corresponding relationship table between the indicator type and the storage calculation engine based on the indicator type, and the corresponding relationship table between the indicator type and the dimensional modeling method of the indicator for a long time. Corresponding storage calculation engine and dimensional modeling method; the third determining module 506 is used to determine the preset dimension table associated with the indicator according to the dimensional modeling method, wherein the preset dimension table includes dimensional modeling corresponding to the indicator type A dimension table constructed in a way or a dimension table constructed based on all dimensional modeling methods; a calculation module 507, used to use the routing decision engine to call the storage calculation engine to execute the preset dimension table, and calculate the corresponding indicator value of the indicator; the second acquisition module 508 , Used to obtain historical data including indicators, where the historical data includes indicators in a specific period, the number of visits of the indicator in a specific period, and index factors that affect the number of visits of the indicator in a specific period; the analysis module 509 is used to combine Historical data is used as sample data, and partial correlation analysis is performed on the sample data, indicators are extracted, and the mapping relationship equations between the indicators and the corresponding indicator factors are established respectively; the test module 510 is used to perform T-tests on the mapping relationship equations respectively to determine the impact index access The main index factor of the frequency; the second query module 511 is used to query the model construction method corresponding to the indicator from the correspondence table between the preset indicator type and the model construction method based on the type of the indicator; the second construction module 512, used to use dimensional modeling to construct random reports and/or semi-aggregated reports, and store random reports and/or semi-aggregated reports in a non-aggregated engine and/or Semi-aggregation engine; the first storage module 513, used when the indicator is a fixed-dimensional indicator type indicator, uses wide table modeling, builds aggregate reports, and stores the aggregate reports in the aggregation engine; indicator downgrade module 514, used as indicators When multi-dimensional aggregated indicators are required, the indicators are downgraded and stored in random reports and or semi-aggregated reports; the fourth determination module 515 is used to query the storage calculation engine corresponding to the indicators that require multi-dimensional aggregation and the required associated The preset dimension table is used to determine the indicator type that needs to be aggregated in multiple dimensions and the dimension table that the indicator needs to associate when calculating; the second storage module 516 is used when the indicator is a fixed-dimensional indicator, and uses a wide table to model all fields in the dimension Stored in the aggregate report; the third storage module 517 is used to query the storage calculation engine corresponding to the fixed-dimensional index type index, and store the aggregate report in the aggregation engine.
其中,第一构建模块502具体用于:对待预测数据进行解析,定义多个指标;利用预置模型对指标进行分级,并增加维度属性;基于指标和维度属性,将指标和维度属性组合,获取多个不同维度属性的指标。Among them, the first building module 502 is specifically used to: analyze the data to be predicted and define multiple indicators; use a preset model to classify the indicators and add dimensional attributes; based on the indicators and dimensional attributes, combine the indicators and dimensional attributes to obtain Multiple indicators of different dimension attributes.
其中,判断模块503具体用于:基于线性回归算法,确定影响指标访问频率的主要指标因子;建立指标与主要指标因子的映射关系方程式,并采用弹性系数法预测主要指标因子的参数值;将指标因子的参数值代入映射关系方程式中,计算指标的访问频率Among them, the judgment module 503 is specifically used to: determine the main index factors that affect the access frequency of the index based on the linear regression algorithm; establish the mapping relationship equation between the index and the main index factor, and use the elastic coefficient method to predict the parameter value of the main index factor; The parameter value of the factor is substituted into the mapping relationship equation to calculate the access frequency of the indicator
其中,第一确定模块504具体用于:若指标的访问频率大于预设阈值且计算指标的访问频率时需要关联其他维度表,则指标为需要多维度聚合的指标类型;若指标的访问频率大于预设阈值且计算指标的访问频率时不需要关联其他维度表,则指标类型为固定维度的指标类型。Wherein, the first determining module 504 is specifically configured to: if the access frequency of the indicator is greater than a preset threshold and other dimension tables need to be associated when calculating the access frequency of the indicator, the indicator is an indicator type that requires multi-dimensional aggregation; if the access frequency of the indicator is greater than When the threshold is preset and the access frequency of the indicator does not need to be associated with other dimension tables, the indicator type is a fixed-dimensional indicator type.
上面图4和图5从模块化功能实体的角度对本申请实施例中的大数据指标构建装置进行详细描述,下面从硬件处理的角度对本申请实施例中大数据指标构建设备进行详细描述。The above figures 4 and 5 describe the big data indicator construction device in the embodiment of the present application in detail from the perspective of modular functional entities, and the following describes the big data indicator construction device in the embodiment of the present application in detail from the perspective of hardware processing.
图6是本申请实施例提供的一种大数据指标构建设备的结构示意图,该大数据指标构建设备600可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)610(例如,一个或一个以上处理器)和存储器620,一个或一个以上存储应用程序633或数据632的存储介质630(例如一个或一个以上海量存储设备)。其中,存储器620和存储介质630可以是短暂存储或持久存储。存储在存储介质630的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对大数据指标构建设备600中的一系列指令操作。更进一步地,处理器610可以设置为与存储介质630通信,在大数据指标构建设备600上执行存储介质630中的一系列指令操作,以实现上述各实施例中的大数据指标构建方法的步骤。FIG. 6 is a schematic structural diagram of a big data indicator construction device provided by an embodiment of the present application. The big data indicator construction device 600 may have relatively large differences due to different configurations or performance, and may include one or more processors (central Processing units, CPU) 610 (for example, one or more processors) and memory 620, and one or more storage media 630 (for example, one or more storage devices with a large amount of data) storing application programs 633 or data 632. Among them, the memory 620 and the storage medium 630 may be short-term storage or persistent storage. The program stored in the storage medium 630 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the big data indicator construction device 600. Furthermore, the processor 610 may be configured to communicate with the storage medium 630, and execute a series of instruction operations in the storage medium 630 on the big data indicator construction device 600, so as to implement the steps of the big data indicator construction method in the foregoing embodiments. .
大数据指标构建设备600还可以包括一个或一个以上电源640,一个或一个以上有线或无线网络接口650,一个或一个以上输入输出接口660,和/或,一个或一个以上操作系统631,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图6示出的大数据指标构建设备结构并不构成对大数据指标构建设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。The big data indicator construction device 600 may also include one or more power supplies 640, one or more wired or wireless network interfaces 650, one or more input and output interfaces 660, and/or one or more operating systems 631, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art can understand that the structure of the big data indicator construction device shown in FIG. 6 does not constitute a limitation on the big data indicator construction device, and may include more or less components than shown in the figure, or combine certain components, or Different component arrangements.
本申请还提供一种大数据指标构建设备,该大数据指标构建设备包括:存储器和至少 一个处理器,存储器中存储有指令,存储器和至少一个处理器通过线路互连;至少一个处理器调用存储器中的指令,以使得大数据指标构建设备执行大数据指标构建方法的步骤。The present application also provides a device for constructing a big data indicator. The device for constructing a big data indicator includes: a memory and at least one processor, where instructions are stored in the memory, and the memory and at least one processor are interconnected by wires; at least one processor calls the memory In order to make the big data indicator construction device execute the steps of the big data indicator construction method.
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,也可以为易失性计算机可读存储介质。计算机可读存储介质存储有计算机指令,当计算机指令在计算机上运行时,使得计算机执行如下步骤:获取待预测数据;对待预测数据进行解析,构建多个携带不同维度属性信息的指标;根据线性回归算法,计算指标的访问频率,并判断指标是否关联有预置维度表;基于访问频率,确定指标的指标类型,其中,指标类型包括多维度聚合的指标和固定维度的指标;基于指标类型,根据预设的指标类型与存储计算引擎之间的对应关系表,以及指标类型与指标的维度建模方式之间的对应关系表,确定与指标对应的存储计算引擎和维度建模方式;根据维度建模方式,确定指标所关联的预置维度表,其中,预置维度表包括基于指标类型对应的维度建模方式构建的维度表或基于所有维度建模方式构建的维度表;利用路由决策引擎调用存储计算引擎执行预置维度表,计算出指标对应的指标值。The present application also provides a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium. The computer-readable storage medium stores computer instructions. When the computer instructions are run on the computer, the computer executes the following steps: obtain the data to be predicted; analyze the data to be predicted to construct multiple indicators that carry attribute information of different dimensions; according to linear regression Algorithm, calculate the access frequency of the indicator, and determine whether the indicator is associated with a preset dimension table; based on the access frequency, determine the indicator type of the indicator, where the indicator type includes multi-dimensional aggregated indicators and fixed-dimensional indicators; based on the indicator type, according to The preset correspondence table between the indicator type and the storage calculation engine, and the correspondence table between the indicator type and the dimensional modeling method of the indicator, determine the storage calculation engine and the dimensional modeling method corresponding to the indicator; build according to the dimension The model method determines the preset dimension table associated with the indicator. The preset dimension table includes the dimension table constructed based on the dimensional modeling method corresponding to the indicator type or the dimension table constructed based on all the dimensional modeling methods; the routing decision engine is used to call The storage calculation engine executes the preset dimension table and calculates the index value corresponding to the index.
以上,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。Above, the above embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that it can still implement the foregoing various implementations. The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (20)

  1. 一种大数据指标构建方法,大数据指标构建方法包括:A method for constructing big data indicators. The method for constructing big data indicators includes:
    获取待预测数据;Obtain the data to be predicted;
    对所述待预测数据进行解析,构建多个携带不同维度属性信息的指标;Analyze the data to be predicted to construct multiple indicators that carry attribute information of different dimensions;
    根据线性回归算法,计算所述指标的访问频率,并判断所述指标是否关联有预置维度表;According to a linear regression algorithm, calculate the access frequency of the indicator, and determine whether the indicator is associated with a preset dimension table;
    基于所述访问频率,确定所述指标的指标类型,其中,所述指标类型包括多维度聚合的指标和固定维度的指标;Determine an indicator type of the indicator based on the access frequency, where the indicator type includes a multi-dimensional aggregated indicator and a fixed-dimensional indicator;
    基于所述指标类型,根据预设的指标类型与存储计算引擎之间的对应关系表,以及指标类型与指标的维度建模方式之间的对应关系表,确定与所述指标对应的存储计算引擎和维度建模方式;Based on the indicator type, according to the preset correspondence table between the indicator type and the storage calculation engine, and the correspondence table between the indicator type and the dimensional modeling method of the indicator, the storage calculation engine corresponding to the indicator is determined And dimensional modeling;
    根据所述维度建模方式,确定所述指标所关联的预置维度表,其中,所述预置维度表包括基于所述指标类型对应的维度建模方式构建的维度表或基于所有维度建模方式构建的维度表;Determine the preset dimension table associated with the indicator according to the dimensional modeling method, wherein the preset dimension table includes a dimensional table constructed based on the dimensional modeling method corresponding to the indicator type or modeling based on all dimensions Dimension table constructed in a way;
    利用路由决策引擎调用所述存储计算引擎执行所述预置维度表,计算出所述指标对应的指标值。The routing decision engine is used to call the storage calculation engine to execute the preset dimension table, and the index value corresponding to the index is calculated.
  2. 根据权利要求1所述的大数据指标构建方法,其中,所述对所述待预测数据进行解析,以构建多个携带不同维度属性信息的指标包括:The method for constructing a big data indicator according to claim 1, wherein the analyzing the data to be predicted to construct a plurality of indicators carrying attribute information of different dimensions comprises:
    对所述待预测数据进行解析,定义多个指标;Analyze the data to be predicted and define multiple indicators;
    利用预置模型对所述指标进行分级,并增加维度属性;Use a preset model to classify the indicators and add dimensional attributes;
    基于所述指标和所述维度属性,将所述指标和所述维度属性组合,获取多个不同维度属性的指标。Based on the indicator and the dimensional attribute, the indicator and the dimensional attribute are combined to obtain multiple indicators of different dimensional attributes.
  3. 根据权利要求1所述的大数据指标构建方法,其中,在所述根据线性回归算法,计算所述指标的访问频率,并判断所述指标是否关联有预置维度表之前,所述大数据指标构建方法还包括:The method for constructing a big data indicator according to claim 1, wherein, before calculating the access frequency of the indicator according to a linear regression algorithm, and determining whether the indicator is associated with a preset dimension table, the big data indicator The construction method also includes:
    获取包含所述指标的历史数据,其中,所述历史数据包括特定时期内的指标、所述指标在特定时期内的访问次数,以及影响所述指标在特定时期内访问次数的指标因子;Acquiring historical data including the indicator, where the historical data includes the indicator in a specific period, the number of visits of the indicator in a specific period, and an indicator factor that affects the number of visits of the indicator in a specific period;
    将所述历史数据作为样本数据,并对所述样本数据进行偏相关分析,提取指标,并分别建立所述指标与对应指标因子的映射关系方程式;Use the historical data as sample data, perform partial correlation analysis on the sample data, extract indicators, and respectively establish the mapping relationship equations between the indicators and corresponding indicator factors;
    分别对所述映射关系方程式进行T检验,确定影响所述指标访问频率的主要指标因子。Perform a T test on the mapping relationship equations to determine the main indicator factors that affect the frequency of the indicator access.
  4. 根据权利要求1所述的大数据指标构建方法,其中,所述根据线性回归算法,计算 所述指标的访问频率,并判断所述指标是否关联有预置维度表包括:The method for constructing a big data indicator according to claim 1, wherein the calculating the access frequency of the indicator according to a linear regression algorithm, and determining whether the indicator is associated with a preset dimension table comprises:
    基于线性回归算法,确定影响所述指标访问频率的主要指标因子;Based on a linear regression algorithm, determine the main indicator factors that affect the access frequency of the indicator;
    建立所述指标与所述主要指标因子的映射关系方程式,并采用弹性系数法预测所述主要指标因子的参数值;Establish a mapping relationship equation between the index and the main index factor, and use the elastic coefficient method to predict the parameter value of the main index factor;
    将所述指标因子的参数值代入所述映射关系方程式中,计算所述指标的访问频率。Substituting the parameter value of the index factor into the mapping relationship equation to calculate the access frequency of the index.
  5. 根据权利要求1所述的大数据指标构建方法,其中,所述基于所述访问频率,确定所述指标的指标类型包括:The method for constructing a big data indicator according to claim 1, wherein the determining the indicator type of the indicator based on the access frequency comprises:
    若所述指标的访问频率大于预设阈值且计算所述指标的访问频率需要关联其他维度表,则所述指标为需要多维度聚合的指标类型;If the access frequency of the indicator is greater than a preset threshold and calculation of the access frequency of the indicator needs to be associated with other dimension tables, the indicator is an indicator type that requires multi-dimensional aggregation;
    若所述指标的访问频率大于预设阈值且计算所述指标的访问频率不需要关联其他维度表,则所述指标类型为固定维度的指标类型。If the access frequency of the indicator is greater than the preset threshold and no other dimension table is required to calculate the access frequency of the indicator, the indicator type is a fixed-dimensional indicator type.
  6. 根据权利要求1所述的大数据指标构建方法,其中,在所述基于所述访问频率,确定所述指标的指标类型之后,所述大数据指标构建方法还包括:The method for constructing a big data indicator according to claim 1, wherein, after the indicator type of the indicator is determined based on the access frequency, the method for constructing a big data indicator further comprises:
    基于所述指标类型,从预设的指标类型与模型构建方法之间的对应关系表中,查询与所述指标对应的模型构建方法;Based on the indicator type, query the model construction method corresponding to the indicator from the preset correspondence table between the indicator type and the model construction method;
    若所述指标为需要多维度聚合的指标类型指标,则使用维度建模,构建随机报表和/或半聚合报表,并将所述随机报表和/或半聚合报表存储至无聚合引擎和/或半聚合引擎;If the indicator is an indicator type indicator that requires multi-dimensional aggregation, dimensional modeling is used to construct a random report and/or semi-aggregated report, and store the random report and/or semi-aggregated report in a non-aggregated engine and/or Semi-aggregation engine;
    若所述指标为固定维度的指标类型指标,则使用宽表建模,构建聚合报表,并将所述聚合报表存储至聚合引擎。If the indicator is a fixed-dimensional indicator type indicator, wide table modeling is used to construct an aggregate report, and the aggregate report is stored in the aggregation engine.
  7. 根据权利要求1所述的大数据指标构建方法,其中,在基于所述指标类型,根据预设的指标类型与存储计算引擎之间的对应关系表,以及指标类型与指标的维度建模方式之间的对应关系表,确定与所述指标对应的存储计算引擎和维度建模方式之后,所述大数据指标构建方法还包括:The method for constructing a big data indicator according to claim 1, wherein, based on the indicator type, according to a preset correspondence table between the indicator type and the storage calculation engine, and one of the indicator type and the dimensional modeling method of the indicator After determining the storage calculation engine and the dimensional modeling method corresponding to the indicator, the method for constructing the big data indicator further includes:
    若所述指标为需要多维度聚合的指标,则将所述指标降级,并存储至随机报表和或半聚合报表;If the indicator is an indicator that needs to be aggregated in multiple dimensions, downgrade the indicator and store it in a random report or a semi-aggregated report;
    查询与所述需要多维度聚合的指标对应的存储计算引擎和所需关联的预置维度表,确定所述需要多维度聚合的指标类型指标在计算时所需关联的维度表;Query the storage calculation engine corresponding to the indicator that needs multi-dimensional aggregation and the preset dimension table that needs to be associated, and determine the dimension table that needs to be associated when calculating the indicator type that needs multi-dimensional aggregation;
    若所述指标为固定维度的指标,则利用宽表建模将维度中所有字段存储至聚合报表;If the indicator is a fixed-dimensional indicator, use wide table modeling to store all the fields in the dimension in an aggregate report;
    查询与所述固定维度的指标类型指标对应的存储计算引擎,并将所述聚合报表存储至聚合引擎。The storage calculation engine corresponding to the index type index of the fixed dimension is queried, and the aggregation report is stored in the aggregation engine.
  8. 一种大数据指标构建设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A big data indicator construction device, including a memory, a processor, and computer-readable instructions stored on the memory and running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions :
    获取待预测数据;Obtain the data to be predicted;
    对所述待预测数据进行解析,构建多个携带不同维度属性信息的指标;Analyze the data to be predicted to construct multiple indicators that carry attribute information of different dimensions;
    根据线性回归算法,计算所述指标的访问频率,并判断所述指标是否关联有预置维度表;According to a linear regression algorithm, calculate the access frequency of the indicator, and determine whether the indicator is associated with a preset dimension table;
    基于所述访问频率,确定所述指标的指标类型,其中,所述指标类型包括多维度聚合的指标和固定维度的指标;Determine an indicator type of the indicator based on the access frequency, where the indicator type includes a multi-dimensional aggregated indicator and a fixed-dimensional indicator;
    基于所述指标类型,根据预设的指标类型与存储计算引擎之间的对应关系表,以及指标类型与指标的维度建模方式之间的对应关系表,确定与所述指标对应的存储计算引擎和维度建模方式;Based on the indicator type, according to the preset correspondence table between the indicator type and the storage calculation engine, and the correspondence table between the indicator type and the dimensional modeling method of the indicator, the storage calculation engine corresponding to the indicator is determined And dimensional modeling;
    根据所述维度建模方式,确定所述指标所关联的预置维度表,其中,所述预置维度表包括基于所述指标类型对应的维度建模方式构建的维度表或基于所有维度建模方式构建的维度表;Determine the preset dimension table associated with the indicator according to the dimensional modeling method, wherein the preset dimension table includes a dimensional table constructed based on the dimensional modeling method corresponding to the indicator type or modeling based on all dimensions Dimension table constructed in a way;
    利用路由决策引擎调用所述存储计算引擎执行所述预置维度表,计算出所述指标对应的指标值。The routing decision engine is used to call the storage calculation engine to execute the preset dimension table, and the index value corresponding to the index is calculated.
  9. 根据权利要求8所述的大数据指标构建设备,其中,所述处理器执行所述计算机可读指令实现所述对所述待预测数据进行解析,以构建多个携带不同维度属性信息的指标时,包括以下步骤:The device for constructing a big data indicator according to claim 8, wherein the processor executes the computer-readable instructions to implement the parsing of the data to be predicted to construct a plurality of indicators carrying attribute information of different dimensions. , Including the following steps:
    对所述待预测数据进行解析,定义多个指标;Analyze the data to be predicted and define multiple indicators;
    利用预置模型对所述指标进行分级,并增加维度属性;Use a preset model to classify the indicators and add dimensional attributes;
    基于所述指标和所述维度属性,将所述指标和所述维度属性组合,获取多个不同维度属性的指标。Based on the indicator and the dimensional attribute, the indicator and the dimensional attribute are combined to obtain multiple indicators of different dimensional attributes.
  10. 根据权利要求8所述的大数据指标构建设备,其中,所述处理器执行所述计算机可读指令实现在所述根据线性回归算法,计算所述指标的访问频率,并判断所述指标是否关联有预置维度表之前时,所述大数据指标构建设备还包括以下步骤:The device for constructing a big data indicator according to claim 8, wherein the execution of the computer-readable instructions by the processor is implemented in the calculation of the access frequency of the indicator according to a linear regression algorithm, and determining whether the indicator is related When there is a preset dimension table, the big data indicator construction device further includes the following steps:
    获取包含所述指标的历史数据,其中,所述历史数据包括特定时期内的指标、所述指标在特定时期内的访问次数,以及影响所述指标在特定时期内访问次数的指标因子;Acquiring historical data including the indicator, where the historical data includes the indicator in a specific period, the number of visits of the indicator in a specific period, and an indicator factor that affects the number of visits of the indicator in a specific period;
    将所述历史数据作为样本数据,并对所述样本数据进行偏相关分析,提取指标,并分别建立所述指标与对应指标因子的映射关系方程式;Use the historical data as sample data, perform partial correlation analysis on the sample data, extract indicators, and respectively establish the mapping relationship equations between the indicators and corresponding indicator factors;
    分别对所述映射关系方程式进行T检验,确定影响所述指标访问频率的主要指标因子。Perform a T test on the mapping relationship equations to determine the main indicator factors that affect the frequency of the indicator access.
  11. 根据权利要求8所述的大数据指标构建设备,其中,所述处理器执行所述计算机可读指令实现所述根据线性回归算法,计算所述指标的访问频率,并判断所述指标是否关联有预置维度表时,包括以下步骤:The device for constructing a big data indicator according to claim 8, wherein the processor executes the computer-readable instructions to implement the calculation of the access frequency of the indicator according to the linear regression algorithm, and determines whether the indicator is associated with When the dimension table is preset, the following steps are included:
    基于线性回归算法,确定影响所述指标访问频率的主要指标因子;Based on a linear regression algorithm, determine the main indicator factors that affect the access frequency of the indicator;
    建立所述指标与所述主要指标因子的映射关系方程式,并采用弹性系数法预测所述主要指标因子的参数值;Establish a mapping relationship equation between the index and the main index factor, and use the elastic coefficient method to predict the parameter value of the main index factor;
    将所述指标因子的参数值代入所述映射关系方程式中,计算所述指标的访问频率。Substituting the parameter value of the index factor into the mapping relationship equation to calculate the access frequency of the index.
  12. 根据权利要求8所述的大数据指标构建设备,其中,所述处理器执行所述计算机可读指令实现所述基于所述访问频率,确定所述指标的指标类型时,包括以下步骤:The device for constructing a big data indicator according to claim 8, wherein when the processor executes the computer-readable instruction to implement the determination of the indicator type of the indicator based on the access frequency, the method comprises the following steps:
    若所述指标的访问频率大于预设阈值且计算所述指标的访问频率需要关联其他维度表,则所述指标为需要多维度聚合的指标类型;If the access frequency of the indicator is greater than a preset threshold and calculation of the access frequency of the indicator needs to be associated with other dimension tables, the indicator is an indicator type that requires multi-dimensional aggregation;
    若所述指标的访问频率大于预设阈值且计算所述指标的访问频率不需要关联其他维度表,则所述指标类型为固定维度的指标类型。If the access frequency of the indicator is greater than the preset threshold and no other dimension table is required to calculate the access frequency of the indicator, the indicator type is a fixed-dimensional indicator type.
  13. 根据权利要求8所述的大数据指标构建设备,其中,所述处理器执行所述计算机可读指令实现在所述基于所述访问频率,确定所述指标的指标类型之后时,还所述大数据指标构建设备包括以下步骤:The device for constructing a big data indicator according to claim 8, wherein the processor executes the computer-readable instruction to realize that after the indicator type of the indicator is determined based on the access frequency, returning the big data The data indicator construction equipment includes the following steps:
    基于所述指标类型,从预设的指标类型与模型构建方法之间的对应关系表中,查询与所述指标对应的模型构建方法;Based on the indicator type, query the model construction method corresponding to the indicator from the preset correspondence table between the indicator type and the model construction method;
    若所述指标为需要多维度聚合的指标类型指标,则使用维度建模,构建随机报表和/或半聚合报表,并将所述随机报表和/或半聚合报表存储至无聚合引擎和/或半聚合引擎;If the indicator is an indicator type indicator that requires multi-dimensional aggregation, dimensional modeling is used to construct a random report and/or semi-aggregated report, and store the random report and/or semi-aggregated report in a non-aggregated engine and/or Semi-aggregation engine;
    若所述指标为固定维度的指标类型指标,则使用宽表建模,构建聚合报表,并将所述聚合报表存储至聚合引擎。If the indicator is a fixed-dimensional indicator type indicator, wide table modeling is used to construct an aggregate report, and the aggregate report is stored in the aggregation engine.
  14. 根据权利要求8所述的大数据指标构建设备,所述处理器执行所述计算机可读指令实现在基于所述指标类型,根据预设的指标类型与存储计算引擎之间的对应关系表,以及指标类型与指标的维度建模方式之间的对应关系表,确定与所述指标对应的存储计算引擎和维度建模方式之后时,所述大数据指标构建设备还包括以下步骤:The device for constructing a big data indicator according to claim 8, wherein the processor executes the computer-readable instructions to implement a corresponding relationship table between a preset indicator type and a storage calculation engine based on the indicator type, and After determining the corresponding relationship table between the indicator type and the dimensional modeling manner of the indicator, after the storage calculation engine and the dimensional modeling manner corresponding to the indicator are determined, the big data indicator construction device further includes the following steps:
    若所述指标为需要多维度聚合的指标,则将所述指标降级,并存储至随机报表和或半聚合报表;If the indicator is an indicator that needs to be aggregated in multiple dimensions, downgrade the indicator and store it in a random report or a semi-aggregated report;
    查询与所述需要多维度聚合的指标对应的存储计算引擎和所需关联的预置维度表,确定所述需要多维度聚合的指标类型指标在计算时所需关联的维度表;Query the storage calculation engine corresponding to the index requiring multi-dimensional aggregation and the preset dimension table that needs to be associated, and determine the dimension table that needs to be associated when calculating the index type index that needs multi-dimensional aggregation;
    若所述指标为固定维度的指标,则利用宽表建模将维度中所有字段存储至聚合报表;If the indicator is a fixed-dimensional indicator, use wide table modeling to store all the fields in the dimension in an aggregate report;
    查询与所述固定维度的指标类型指标对应的存储计算引擎,并将所述聚合报表存储至聚合引擎。The storage calculation engine corresponding to the index type index of the fixed dimension is queried, and the aggregation report is stored in the aggregation engine.
  15. 一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:A computer-readable storage medium in which computer instructions are stored, and when the computer instructions are executed on a computer, the computer executes the following steps:
    获取待预测数据;Obtain the data to be predicted;
    对所述待预测数据进行解析,构建多个携带不同维度属性信息的指标;Analyze the data to be predicted to construct multiple indicators that carry attribute information of different dimensions;
    根据线性回归算法,计算所述指标的访问频率,并判断所述指标是否关联有预置维度表;According to a linear regression algorithm, calculate the access frequency of the indicator, and determine whether the indicator is associated with a preset dimension table;
    基于所述访问频率,确定所述指标的指标类型,其中,所述指标类型包括多维度聚合的指标和固定维度的指标;Determine an indicator type of the indicator based on the access frequency, where the indicator type includes a multi-dimensional aggregated indicator and a fixed-dimensional indicator;
    基于所述指标类型,根据预设的指标类型与存储计算引擎之间的对应关系表,以及指标类型与指标的维度建模方式之间的对应关系表,确定与所述指标对应的存储计算引擎和维度建模方式;Based on the indicator type, according to the preset correspondence table between the indicator type and the storage calculation engine, and the correspondence table between the indicator type and the dimensional modeling method of the indicator, the storage calculation engine corresponding to the indicator is determined And dimensional modeling;
    根据所述维度建模方式,确定所述指标所关联的预置维度表,其中,所述预置维度表包括基于所述指标类型对应的维度建模方式构建的维度表或基于所有维度建模方式构建的维度表;Determine the preset dimension table associated with the indicator according to the dimensional modeling method, wherein the preset dimension table includes a dimensional table constructed based on the dimensional modeling method corresponding to the indicator type or modeling based on all dimensions Dimension table constructed in a way;
    利用路由决策引擎调用所述存储计算引擎执行所述预置维度表,计算出所述指标对应的指标值。The routing decision engine is used to call the storage calculation engine to execute the preset dimension table, and the index value corresponding to the index is calculated.
  16. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:The computer-readable storage medium according to claim 15, when the computer instructions are executed on the computer, the computer is caused to further perform the following steps:
    对所述待预测数据进行解析,定义多个指标;Analyze the data to be predicted and define multiple indicators;
    利用预置模型对所述指标进行分级,并增加维度属性;Use a preset model to classify the indicators and add dimensional attributes;
    基于所述指标和所述维度属性,将所述指标和所述维度属性组合,获取多个不同维度属性的指标。Based on the indicator and the dimensional attribute, the indicator and the dimensional attribute are combined to obtain multiple indicators of different dimensional attributes.
  17. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:The computer-readable storage medium according to claim 15, when the computer instructions are executed on the computer, the computer is caused to further perform the following steps:
    获取包含所述指标的历史数据,其中,所述历史数据包括特定时期内的指标、所述指标在特定时期内的访问次数,以及影响所述指标在特定时期内访问次数的指标因子;Acquiring historical data including the indicator, where the historical data includes the indicator in a specific period, the number of visits of the indicator in a specific period, and an indicator factor that affects the number of visits of the indicator in a specific period;
    将所述历史数据作为样本数据,并对所述样本数据进行偏相关分析,提取指标,并分别建立所述指标与对应指标因子的映射关系方程式;Use the historical data as sample data, perform partial correlation analysis on the sample data, extract indicators, and respectively establish the mapping relationship equations between the indicators and corresponding indicator factors;
    分别对所述映射关系方程式进行T检验,确定影响所述指标访问频率的主要指标因子。Perform a T test on the mapping relationship equations to determine the main indicator factors that affect the frequency of the indicator access.
  18. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:The computer-readable storage medium according to claim 15, when the computer instructions are executed on the computer, the computer is caused to further perform the following steps:
    基于线性回归算法,确定影响所述指标访问频率的主要指标因子;Based on a linear regression algorithm, determine the main indicator factors that affect the access frequency of the indicator;
    建立所述指标与所述主要指标因子的映射关系方程式,并采用弹性系数法预测所述主要指标因子的参数值;Establish a mapping relationship equation between the index and the main index factor, and use the elastic coefficient method to predict the parameter value of the main index factor;
    将所述指标因子的参数值代入所述映射关系方程式中,计算所述指标的访问频率。Substituting the parameter value of the index factor into the mapping relationship equation to calculate the access frequency of the index.
  19. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:The computer-readable storage medium according to claim 15, when the computer instructions are executed on the computer, the computer is caused to further perform the following steps:
    若所述指标的访问频率大于预设阈值且计算所述指标的访问频率需要关联其他维度表,则所述指标为需要多维度聚合的指标类型;If the access frequency of the indicator is greater than a preset threshold and calculation of the access frequency of the indicator needs to be associated with other dimension tables, the indicator is an indicator type that requires multi-dimensional aggregation;
    若所述指标的访问频率大于预设阈值且计算所述指标的访问频率不需要关联其他维度表,则所述指标类型为固定维度的指标类型。If the access frequency of the indicator is greater than the preset threshold and no other dimension table is required to calculate the access frequency of the indicator, the indicator type is a fixed-dimensional indicator type.
  20. 一种大数据指标构建装置,所述大数据指标构建装置包括:A big data indicator construction device, the big data indicator construction device includes:
    第一获取模块,用于获取待预测数据;The first obtaining module is used to obtain the data to be predicted;
    第一构建模块,用于对所述待预测数据进行解析,以构建多个携带不同维度属性信息的指标;The first construction module is used to analyze the data to be predicted to construct multiple indicators carrying attribute information of different dimensions;
    判断模块,用于根据线性回归算法,计算所述指标的访问频率,并判断所述指标是否关联有预置维度表;The judgment module is configured to calculate the access frequency of the indicator according to the linear regression algorithm, and judge whether the indicator is associated with a preset dimension table;
    第一确定模块,基于所述访问频率,确定所述指标的指标类型,其中,所述指标类型包括多维度聚合的指标和固定维度的指标;The first determining module determines an indicator type of the indicator based on the access frequency, where the indicator type includes a multi-dimensional aggregated indicator and a fixed-dimensional indicator;
    第二确定模块,用于基于所述指标类型,根据预设的指标类型与存储计算引擎之间的对应关系表,以及指标类型与指标的维度建模方式之间的对应关系表,确定与所述指标对应的存储计算引擎和维度建模方式;The second determining module is used to determine the corresponding relationship between the indicator type and the dimensional modeling method of the indicator based on the indicator type and the preset correspondence table between the indicator type and the storage calculation engine. The storage calculation engine and dimensional modeling method corresponding to the above indicators;
    第三确定模块,用于根据所述维度建模方式,确定所述指标所关联的预置维度表,其中,所述预置维度表包括基于所述指标类型对应的维度建模方式构建的维度表或基于所有维度建模方式构建的维度表;The third determining module is configured to determine the preset dimension table associated with the indicator according to the dimensional modeling method, wherein the preset dimension table includes dimensions constructed based on the dimensional modeling method corresponding to the indicator type Table or dimensional table constructed based on all dimensional modeling methods;
    计算模块,用于利用路由决策引擎调用所述存储计算引擎执行所述预置维度表,计算出所述指标对应的指标值。The calculation module is configured to use the routing decision engine to call the storage calculation engine to execute the preset dimension table, and to calculate the index value corresponding to the index.
PCT/CN2020/131753 2020-07-23 2020-11-26 Big data index construction method, apparatus and device, and storage medium WO2021139427A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010714909.9A CN111859299A (en) 2020-07-23 2020-07-23 Big data index construction method, device, equipment and storage medium
CN202010714909.9 2020-07-23

Publications (1)

Publication Number Publication Date
WO2021139427A1 true WO2021139427A1 (en) 2021-07-15

Family

ID=72950832

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/131753 WO2021139427A1 (en) 2020-07-23 2020-11-26 Big data index construction method, apparatus and device, and storage medium

Country Status (2)

Country Link
CN (1) CN111859299A (en)
WO (1) WO2021139427A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859299A (en) * 2020-07-23 2020-10-30 平安科技(深圳)有限公司 Big data index construction method, device, equipment and storage medium
CN112990669A (en) * 2021-02-24 2021-06-18 平安健康保险股份有限公司 Product data analysis method and device, computer equipment and storage medium
CN117520624B (en) * 2024-01-05 2024-04-12 青岛海信信息科技股份有限公司 Configuration and calculation method and device for big data index

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408179A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Method and device for processing data from data table
US20180025035A1 (en) * 2016-07-21 2018-01-25 Ayasdi, Inc. Topological data analysis of data from a fact table and related dimension tables
CN107918600A (en) * 2017-11-15 2018-04-17 泰康保险集团股份有限公司 report development system and method, storage medium and electronic equipment
CN109325648A (en) * 2018-06-29 2019-02-12 深圳市彬讯科技有限公司 Multi-dimensional data stream statistics method, server and storage medium based on index
CN111859299A (en) * 2020-07-23 2020-10-30 平安科技(深圳)有限公司 Big data index construction method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408179A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Method and device for processing data from data table
US20180025035A1 (en) * 2016-07-21 2018-01-25 Ayasdi, Inc. Topological data analysis of data from a fact table and related dimension tables
CN107918600A (en) * 2017-11-15 2018-04-17 泰康保险集团股份有限公司 report development system and method, storage medium and electronic equipment
CN109325648A (en) * 2018-06-29 2019-02-12 深圳市彬讯科技有限公司 Multi-dimensional data stream statistics method, server and storage medium based on index
CN111859299A (en) * 2020-07-23 2020-10-30 平安科技(深圳)有限公司 Big data index construction method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN111859299A (en) 2020-10-30

Similar Documents

Publication Publication Date Title
WO2021139427A1 (en) Big data index construction method, apparatus and device, and storage medium
US11068789B2 (en) Dynamic model data facility and automated operational model building and usage
US10019442B2 (en) Method and system for peer detection
US8788501B2 (en) Parallelization of large scale data clustering analytics
US20080288524A1 (en) Filtering of multi attribute data via on-demand indexing
US10824614B2 (en) Custom query parameters in a database system
US20070083513A1 (en) Determining a recurrent problem of a computer resource using signatures
CN104700190B (en) One kind is for project and the matched method and apparatus of professional
CN110825769A (en) Data index abnormity query method and system
WO2022252782A1 (en) Cloud computing index recommendation method and system
WO2007053940A1 (en) Automatic generation of sales and marketing information
US11810000B2 (en) Systems and methods for expanding data classification using synthetic data generation in machine learning models
US20130036122A1 (en) Assessing application performance with an operational index
Wu et al. Financial fraud risk analysis based on audit information knowledge graph
US11550762B2 (en) Implementation of data access metrics for automated physical database design
Wu et al. Research on evaluation model of hospital informatization level based on decision tree algorithm
Sun et al. Model averaging for interval-valued data
Siddiqui et al. Isum: Efficiently compressing large and complex workloads for scalable index tuning
WO2020119533A1 (en) Public sentiment warning method and apparatus based on recurrent neural network algorithm, terminal and medium
Pesantez-Narvaez et al. Penalized logistic regression to improve predictive capacity of rare events in surveys
CN110990777A (en) Data relevance analysis method and system and readable storage medium
Onile et al. A comparative study on graph-based ranking algorithms for consumer-oriented demand side management
Bachoc et al. Gaussian field on the symmetric group: Prediction and learning
CN114781717A (en) Network point equipment recommendation method, device, equipment and storage medium
CN113283688A (en) Power data asset value assessment method based on entropy weight method and multi-target attribute decision

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20912290

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20912290

Country of ref document: EP

Kind code of ref document: A1