WO2021139427A1 - Procédé, appareil et dispositif de construction d'index de mégadonnées, et support de stockage - Google Patents

Procédé, appareil et dispositif de construction d'index de mégadonnées, et support de stockage Download PDF

Info

Publication number
WO2021139427A1
WO2021139427A1 PCT/CN2020/131753 CN2020131753W WO2021139427A1 WO 2021139427 A1 WO2021139427 A1 WO 2021139427A1 CN 2020131753 W CN2020131753 W CN 2020131753W WO 2021139427 A1 WO2021139427 A1 WO 2021139427A1
Authority
WO
WIPO (PCT)
Prior art keywords
indicator
dimensional
type
index
data
Prior art date
Application number
PCT/CN2020/131753
Other languages
English (en)
Chinese (zh)
Inventor
陈志兴
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021139427A1 publication Critical patent/WO2021139427A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Definitions

  • This application relates to the field of big data technology, and in particular to a method, device, equipment and storage medium for constructing big data indicators.
  • fixed-dimensional indicator I With the progress of society and the development of big data, the development of fixed-dimensional indicator I is facing challenges.
  • the basis of the fixed-dimensional indicator I is to use production data to extract the corresponding production indicators. In this process, a lot of calculations are needed on the production data. At the same time, according to the division of different index levels, very flexible calculations are required. With the skyrocketing production data and flexible application scenarios, fixed-dimensional index I services can no longer provide effective services.
  • the solution was to solve the problem by providing more computing resources or providing a higher computing engine.
  • it also consumes a lot of resources.
  • the inventor realizes that in terms of calculation model, although the flexibility of index calculation is realized, the time consumption of index calculation is increased.
  • the calculation of the fixed dimension index I of big data is often limited to Single calculation engine.
  • the main purpose of this application is to solve the technical problem of only using a single data engine and dimensional modeling.
  • the first aspect of this application provides a method for constructing big data indicators, including: obtaining data to be predicted; analyzing the data to be predicted to construct multiple indicators that carry attribute information of different dimensions; and according to linear regression
  • the algorithm calculates the access frequency of the indicator, and determines whether the indicator is associated with a preset dimension table; based on the access frequency, determines the indicator type of the indicator, wherein the indicator type includes multi-dimensional aggregated indicators and Fixed-dimensional indicators; based on the indicator type, the corresponding relationship table between the preset indicator type and the storage calculation engine, and the corresponding relationship table between the indicator type and the dimensional modeling method of the indicator are determined to correspond to the indicator Corresponding storage calculation engine and dimensional modeling method; according to the dimensional modeling method, determine the preset dimension table associated with the indicator, wherein the preset dimension table includes dimensional modeling corresponding to the indicator type A dimension table constructed in a method or a dimension table constructed based on all dimensional modeling methods; the routing decision engine is used to call the storage calculation engine to execute the
  • the second aspect of the present application provides a device for constructing big data indicators, including a memory, a processor, and computer-readable instructions stored on the memory and running on the processor, and the processor executes the computer
  • the following steps are implemented: obtain the data to be predicted; analyze the data to be predicted to construct multiple indicators that carry attribute information of different dimensions; calculate the access frequency of the indicators according to the linear regression algorithm, and determine the Whether the indicator is associated with a preset dimension table; based on the access frequency, determine the indicator type of the indicator, where the indicator type includes multi-dimensional aggregated indicators and fixed-dimensional indicators; based on the indicator type, according to preset
  • the dimensional modeling method determines the preset dimension table associated with the indicator, wherein the preset dimension table includes a dimensional table constructed based on the dimensional modeling method corresponding to the indicator
  • a third aspect of the present application provides a computer-readable storage medium that stores computer instructions, and when the computer instructions are executed on a computer, the computer executes the following steps: obtaining data to be predicted; Analyze the to-be-predicted data to construct multiple indicators that carry attribute information of different dimensions; calculate the access frequency of the indicators according to the linear regression algorithm, and determine whether the indicators are associated with a preset dimension table; based on the access Frequency, determine the indicator type of the indicator, where the indicator type includes a multi-dimensional aggregated indicator and a fixed-dimensional indicator; based on the indicator type, according to a preset corresponding relationship table between the indicator type and the storage calculation engine , And the corresponding relationship table between the indicator type and the dimensional modeling method of the indicator, determine the storage calculation engine and the dimensional modeling method corresponding to the indicator; determine the predictive value associated with the indicator according to the dimensional modeling method
  • the preset dimension table wherein the preset dimension table includes a dimension table constructed based on the dimensional modeling method corresponding to the indicator type or
  • the fourth aspect of the present application provides a big data indicator construction device, which includes: a first acquisition module for acquiring data to be predicted; a first construction module for analyzing the data to be predicted to construct multiple portable data Indicators of attribute information of different dimensions; a judging module for calculating the access frequency of the indicator according to a linear regression algorithm, and judging whether the indicator is associated with a preset dimension table; a first determining module, for calculating the access frequency based on the access frequency , Determine the indicator type of the indicator, wherein the indicator type includes a multi-dimensional aggregated indicator and a fixed-dimensional indicator; the second determining module is configured to store the calculation engine based on the indicator type according to the preset indicator type The corresponding relationship table between the index type and the dimensional modeling method of the indicator determines the storage calculation engine and the dimensional modeling method corresponding to the indicator; the third determining module is used to determine the storage calculation engine and the dimensional modeling method corresponding to the indicator; The dimensional modeling method determines the preset dimension table associated with the indicator, wherein
  • Fig. 1 is a schematic diagram of a first embodiment of a method for constructing a big data indicator in an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a second embodiment of a method for constructing a big data indicator in an embodiment of the present invention
  • FIG. 3 is a schematic diagram of a third embodiment of a method for constructing a big data indicator in an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a first embodiment of a device for constructing a big data indicator in an embodiment of the present invention
  • Fig. 5 is a schematic diagram of a second embodiment of a device for constructing a big data indicator in an embodiment of the present invention
  • Fig. 6 is a schematic diagram of an embodiment of a device for constructing a big data indicator in an embodiment of the present invention.
  • the embodiments of the present application provide a method, device, equipment, and storage medium for constructing a big data indicator, which solves the contradiction between the time-consuming and time-efficient calculation of the fixed-dimensional indicator I of big data, and at the same time solves the problem that only a single data engine and Technical issues of dimensional modeling.
  • the first embodiment of the method for constructing a big data indicator in the embodiment of the present application includes:
  • all data to be predicted are acquired, and the data contains many indicator labels.
  • data such as the premium of a certain type of insurance under a certain activity, and the premiums of all types of insurance under a certain activity.
  • the data to be predicted refers to the data containing the indicators to be calculated.
  • the data is analyzed to determine the indicator information contained in the data.
  • common attributes are added to construct labels of multiple (different) basic dimensions (under attributes). For example, we take the indicator "premium” as an example, and increase the public attribute of the indicator, and we can construct multiple different indicators “premium of auto insurance", “premium under the Double 11 event” or "premium of auto insurance under the Double 11 event”.
  • the indicator is an indicator label that the enterprise has obtained based on data analysis.
  • the basic dimension is to add the calculated value of the dimension index under the company's basic attributes, and the addition of the public attribute is to increase the company's common attributes.
  • the access frequency of the indicator that is, to determine whether the indicator is an indicator that frequently needs to be counted (visited) or through a certain usage rule
  • the calculated indicators are determined based on the access frequency of the indicator and whether other data needs to be associated with the indicator calculation to determine the calculation requirements of the indicator, and further select a suitable storage calculation engine.
  • the dimension table mentioned in this embodiment can be understood to a certain extent as a data table containing many index (label) information.
  • a dimension table that counts xx insurance company’s “total premiums in 2019” includes: this dimension table contains labels such as “time: 2019.01, 2019.02, ⁇ 2019.12, insurance types: auto insurance, life insurance, critical illness insurance, "Children's Insurance” includes indicators such as "2019.01 Auto Insurance Premium”, “2019.01 Critical Illness Insurance Premium”, “2019.03 Life Insurance Premium”, “2019.03 Children's Insurance Premium”, etc.
  • whether a certain indicator needs to be associated with other dimension tables during calculation is determined by whether it is necessary to introduce indicators from other dimension tables (data tables) when calculating the indicator. For example, when calculating the indicator "Total auto insurance premiums from 2017 to 2019", the local dimension table only has the indicator “Total auto insurance premiums in 2018". At this time, if you calculate the "Total auto insurance premiums from 2017 to 2019" "This indicator needs to be calculated by linking the indicator information in the dimension table "2017 total auto insurance premiums" and the dimension table "2019 total auto insurance premiums”. For another example, when calculating the index "total auto insurance premiums from April to June 2019", since the local dimension table "total premiums for 2019" includes the monthly auto insurance premiums from January to December 2019, there is no need to Associate index information in other dimension tables for calculation.
  • Linear regression in this embodiment refers to a regression analysis that uses a least square function called a linear regression equation to model the relationship between one or more independent variables and dependent variables.
  • a linear regression algorithm is used to predict the access frequency of the index. For example, in the process of using indicators, you will find that some indicators are frequently used or the access frequency of indicators is affected by some other data. They have the same characteristics and are linear. According to the characteristics, we can infer which indicators are visits. The frequency is relatively high or is calculated through regular statistics. For example: there is an event, the indicators that need to be checked on Double 11, and the same statistics are required on Double 12. We can calculate and aggregate the indicators of Double 12 in advance according to the characteristics of Double 11. In this embodiment, regression is to predict new data based on existing data, such as predicting stock trends. Linear regression is to be able to use a straight line to more accurately describe the relationship between the data, when new data appears, it can predict a simple value.
  • the linear regression model looks like:
  • the model obtained by linear regression is not necessarily a straight line:
  • the model is a plane in space
  • the residual sum of squares is usually used in linear regression, that is, the distance from the point to the straight line parallel to the y axis instead of the vertical distance.
  • the residual sum of squares divided by the sample size n is the mean square error.
  • the mean square error is used as the cost function of the linear regression model. Minimizing the sum of the distances from all points to the straight line is to minimize the mean square error. This method is called the least squares method.
  • the i-th sample is expressed as:
  • the loss function is the mean square error, that is
  • the least square method is used to solve the parameters, and the loss function J( ⁇ ) is derived from ⁇ :
  • the linear regression algorithm is used to determine the important indicators in the sample data, and the mapping relationship equations between the indicators and the indicator factors that affect the access frequency of the indicators are established respectively.
  • the frequency of access rights to all of the index M Index Factors affecting weight W, to determine the respective primary dependent variable (i.e., the main index factor) a 1, a 2, a 3, ..., a n are established each of the major indicators
  • the equation of the mapping relationship between factors and indicators: y ⁇ + ⁇ a 1 + ⁇ a 2 +...+ ⁇ a n , where y is the access frequency of indicator M (in a certain time period), a 1 , a 2 , a 3 , ..., a n is the index factor all impact indicators M (in a specific time period) of the access frequency.
  • M index below to an example, a collection of access frequency 2017 ⁇ 2019 M metrics of a special promotion and impact indicators Indicator M access frequency factors a 1, a 2, a 3 , ..., a n value of .
  • each index M predicted value and the actual value can be derived by comparing the The model is reasonable and can be used to predict the access frequency of indicator M (in a specific time period).
  • the regression model infers the change of the access frequency of the indicator M in a certain period of time, inputs it as input data into the prediction model, and finally obtains the predicted access frequency of the indicator M (in a certain period of time).
  • data can be predicted to predict which data will be accessed with high frequency. These high-frequency accessed data need to be pre-aggregated, and some of them do not require high-frequency access and can use other storage. engine.
  • the indicator type includes multi-dimensional aggregated indicators and fixed-dimensional indicators
  • the index type is determined according to the access frequency of the index and whether other dimension tables need to be associated when calculating the index, and further, the type of the index is determined. For example, some indicators need to be associated with multiple dimension tables to be calculated during calculation, while other indicators do not need to be associated with other dimension tables during calculation to calculate the value of the indicator. There are two types of indicators, which require multi-dimensional aggregation, that is, indicators that need to be associated with other dimension tables for associated calculations during calculation, and fixed-dimensional indicator types, which means that calculations do not need to associate data in other dimension tables. , The index of the index value can be calculated only by the data in the wide table to which it belongs.
  • the storage calculation engine corresponding to the indicator is queried, and the preset dimension table associated with the indicator is calculated.
  • Different types of indicators are stored in different locations for storage calculation engines. For example, some of them are stored in random reports or semi-aggregated reports.
  • you need to compare the table where the indicators are located with The value of the indicator can be calculated after the other dimension tables are associated.
  • the aggregate report built by these indicators can be stored in the aggregation engine for calculation in advance, and the user can query When this indicator is used, the corresponding indicator value can be quickly queried without waiting for the calculation time, which improves the efficiency of data processing.
  • the type of the indicator it is determined whether multiple dimension tables are required to be associated (to) calculated when querying the indicator (value), and if necessary, the corresponding dimension table is queried. For example, to calculate the index value of the fixed index "2018 Double 11 event auto insurance premiums”, you only need to table “2018 insurance premiums", table “2018 auto insurance premiums” and “2018 double 11 event premiums” The data in the three tables of different dimensions are stored in one table, which is a wide table. When calculating, there is no need to associate other data reports. When calculating the indicator "2018 premiums”, the table “2018” is needed. Annual auto insurance premiums", table “2018 property insurance premiums”, and table “2018 life insurance premiums”... table “2018 XX insurance premiums”, all insurance premium tables are linked together to get The indicator value of the indicator "Premium for 2018".
  • the preset dimension table includes a dimensional table constructed based on the dimensional modeling method corresponding to the indicator type or a dimensional table constructed based on all dimensional modeling methods:
  • different types of indicators correspond to different modeling models to generate different types of reports, and the generated reports are also stored in different data storage calculation engines according to different report types.
  • the routing decision engine will request the corresponding storage calculation engine according to the correspondence relationship between the calculation engines stored in the report to which the indicator belongs. That is, according to the different queried indicators, the routing decision engine will select the storage corresponding to the current calculation request. Calculation engine, and distribute the request to the corresponding storage calculation engine to calculate the value of the corresponding index. For example, if the index to be viewed is a basic (fixed) index, the query (calculation) request will be forwarded to a basic database such as hive (no aggregation database, which can realize multi-table association calculation). If you want to view the pre-calculated index, you will Forward to databases such as druid.io (aggregated data engine). In this embodiment, the calculation requirement of the index can be simply understood as whether the association and calculation of the dimension table are required (with or without).
  • the data to be predicted is mainly obtained and analyzed to construct indicators of multiple dimensional attributes, and the access frequency of the indicators is predicted by the linear regression algorithm to determine the calculation requirements of the indicators.
  • the calculation requirements of the indicators select an appropriate method to store the indicators in the corresponding storage calculation engine, and calculate the indicator values of the indicators. This solves the contradiction between the time-consuming and time-consuming calculation of the fixed-dimensional indicator I of big data, and at the same time solves the problem.
  • the second embodiment of the method for constructing a big data indicator in the embodiment of the present application includes:
  • an indicator refers to a unit or method used to measure the degree of development of a thing, and it also has a commonly used name in IT, that is, measurement. For example: population, GDP, income, number of users, profit rate, retention rate, coverage rate, etc.
  • KPI indicator system which uses several key indicators to measure the performance of the company’s business operations. The indicators need to be obtained through summary calculation methods such as summation and average, and summary calculations need to be performed under certain preconditions, such as time, location, and cost, which is what we often call statistical caliber and scope.
  • a preset model is used to classify the extracted indicators, and the dimensional attribute information of each indicator is added.
  • gradually increasing the dimensional attribute information of the indicator “premium” can become “Enterprise plan premiums” and “enterprise plan premiums of secondary institutions” further increase the basic dimension attribute information of the indicators, and at the same time increase the public attribute dimension information, such as "whether it is the enterprise plan premiums of secondary institutions participating in insurance activities".
  • the dimension attribute of the indicator refers to a certain characteristic of a thing or phenomenon, such as gender, region, time, etc., are all dimensions. Among them, time is a commonly used and special dimension. Through the comparison of time before and after, you can know whether the development of things is good or bad. For example, “The premium of auto insurance under the Double 11 event in 2019 is higher than that under the Double 11 event in 2018. The premium of auto insurance has increased by 10%", “The premium of life insurance under the Double 12 event in 2019 will increase by 20% compared to the premium of life insurance under the Double 11 event in 2019." This is the comparison in time, also known as the vertical ratio. Another comparison is the horizontal comparison.
  • the comparison between the “premium of auto insurance under the Double 11 event in 2018” and the “premium of life insurance under the Double 11 event in 2018” is a comparison between units of the same level, referred to as horizontal. ratio.
  • the dimensions can be divided into qualitative dimensions and quantitative dimensions, that is, according to the data type.
  • the data type is character (text) data, which is qualitative.
  • region and gender are all qualitative dimensions;
  • the data type is Numerical data are quantitative dimensions, such as income, age, consumption, etc.
  • the indicator and the dimensional attribute are combined to obtain multiple indicators carrying different dimensional attribute information. For example, “Premium for auto insurance under Double 11 in 2019”, “Premium for auto insurance under Double 12 in 2019”, “Premium for property insurance under Double 11 in 2019”, “Premium for property insurance under Double 12 in 2019” .
  • the indicators of different dimensional attributes in the data to be predicted are determined, and at the same time, the indicator factors that affect the access frequency of the indicators are determined.
  • a mapping relationship equation between the index obtained in the data to be predicted and the index factor corresponding to the index is established.
  • the elastic coefficient method is used to predict the parameter value of each index factor under a certain activity of the data to be predicted. For example, predict the number of people who will purchase auto insurance during the Double 11 event in 2019.
  • the elastic coefficient ET is calculated using the data of the most recent year and the farthest year (from the collected historical data), and then the access frequency of the corresponding indicator under a certain activity can be calculated.
  • the access frequency in this embodiment can also be said to be a probability value.
  • mapping relationship equation between the index obtained in the data to be predicted and the index factor corresponding to the index is established, and the parameter value of the index factor is substituted into the mapping relationship equation to calculate (predict) the access frequency of the index ( Probability value).
  • the indicator is an indicator type that requires multi-dimensional aggregation
  • the indicator is an indicator type indicator that requires multi-dimensional aggregation. It is an indicator that needs to be aggregated in multiple dimensions. For example, the indicator "2018 premiums”, if you want to calculate it, you need to table “2018 auto insurance premiums”, table “2018 property insurance premiums”, and table “2018 life insurance premiums”... XX insurance premiums in 2018", the table of premiums of all insurance types is linked together, then the indicator "2018 premiums" is an index type index that needs multi-dimensional aggregation, that is, an index that needs multi-dimensional aggregation.
  • the indicator type is a fixed-dimensional indicator type
  • the access probability of the indicator is greater than the preset threshold and the indicator is queried (calculated), there is no need to associate other dimension tables for correlation calculation, and only the data in the table to which the indicator belongs is used, then it can be determined that the indicator is The index type index of the fixed dimension, that is, the fixed index.
  • the dimension of this indicator is a fixed three dimensions "2018 + Double 11 event + auto insurance", when calculating the indicator "2018 Double 11 event auto insurance premium” .
  • Wide table when calculating, only query the data in this (wide) table, and there is no need to associate data in other tables, then the indicator "2018 Double 11 event insurance premiums" is a fixed-dimensional indicator type indicator, that is Fixed indicators. In this embodiment, the wide table is to build all the fields in it, and there is no need to associate other tables when statistical data (calculating index values).
  • the preset dimension table includes a dimensional table constructed based on the dimensional modeling method corresponding to the indicator type or a dimensional table constructed based on all dimensional modeling methods;
  • the third embodiment of the method for constructing a big data indicator in the embodiment of the present application includes:
  • the historical data containing the indicators to be predicted is obtained.
  • the indicators in a specific period of time and the indicators are in a specific period.
  • the index factor is related to the number of visits of the index in a specific period. Therefore, a mapping relationship between the index factor and the index access frequency is established, and the index access frequency is calculated (or "predicted") based on historical data.
  • historical data is used as sample data, for example, the data information of “auto insurance premium under the Double 11 event in 2018” is used as sample data.
  • the t test is a type of significance test in the multiple linear regression algorithm.
  • the F test can be equivalent to the t test.
  • the partial correlation analysis method is used to further analyze the mapping relationship equations of each index and index factor, and determine the main independent variable in the mapping relationship between each index and the index factor (that is, the main index factor, there will be many index factors affecting The number of times the indicator is visited in a specific period, and the main indicator factor is the main influencing factor), and then all the main indicator factors are retained in the mapping relationship equation between the indicator and the indicator factor.
  • the index factor whose partial correlation coefficient is within the preset value interval and the regression coefficient is greater than the F test parameter or the t test parameter in the mapping relationship equation is the main index factor.
  • the indicator type of the indicator based on the access frequency, where the indicator type includes multi-dimensional aggregated indicators and fixed-dimensional indicators;
  • the model construction method corresponding to the indicator type is queried from the preset correspondence table between the indicator type and the model construction method.
  • the model construction method corresponding to the indicator type is queried from the preset correspondence table between the indicator type and the model construction method.
  • the indicator is an indicator type indicator that requires multi-dimensional aggregation, use dimensional modeling to build random reports and/or semi-aggregated reports, and store random reports and/or semi-aggregated reports in the non-aggregated engine and/or semi-aggregated engine; If the indicator to be calculated is an indicator type indicator that requires multi-dimensional aggregation, that is, an indicator that needs to be associated with multiple dimension tables to be calculated, use dimensional modeling to build random reports and/or semi-aggregated reports, and combine random reports with / Or semi-aggregated reports are stored in the non-aggregated engine and/or semi-aggregated engine.
  • the indicator is a fixed-dimensional indicator type indicator
  • use wide-table modeling build an aggregate report, and store the aggregate report to the aggregation engine
  • the indicator to be calculated is a fixed-dimensional indicator type indicator
  • there is no need to interact with multiple dimension tables For indicators that can be calculated by association, use wide table modeling, build aggregate reports, and store aggregate reports in the aggregation engine.
  • wide table modeling means that the indicators and dimensions are stored in a large table, that is, the data is divided into a fact table and a dimension table.
  • the fact table is a record of specific events, and all fields are built in it. There is no need to associate other tables with data.
  • the dimension represents some description of the event, through the separation of facts and dimension tables, to improve flexibility and solve corresponding problems.
  • the preset dimension table includes a dimensional table constructed based on the dimensional modeling method corresponding to the indicator type or a dimensional table constructed based on all dimensional modeling methods;
  • the indicator is an indicator that requires multi-dimensional aggregation
  • the indicator is downgraded and stored in a random report or semi-aggregated report; in this embodiment, if the indicator is an indicator that requires multi-dimensional aggregation, it can be understood that the indicator is not It needs to be calculated in advance, and the indicator is downgraded, that is, the indicator and data are stored on a common calculation engine to save computing resources.
  • Common computing engines include non-aggregation engines and semi-aggregation engines.
  • the indicator is a fixed-dimensional indicator, use wide-table modeling to store all the fields in the dimension in the aggregate report; in this embodiment, if the indicator is a fixed-dimensional indicator, it means that other dimension tables are not required to calculate the indicator. Aggregate calculation. All indicators of this type can be stored in an aggregate report and calculated in advance. It saves index query (calculation) time and improves the efficiency of data processing.
  • the storage calculation engine corresponding to the fixed-dimensional index type index Query the storage calculation engine corresponding to the fixed-dimensional index type index, and store the aggregate report to the aggregation engine; in this embodiment, if the index is a fixed-dimensional index type index, that is, there is no need to associate when calculating the index For other dimension tables, this type of index is stored in the aggregation report through wide table modeling, and stored in the aggregation engine, so that it can be calculated in advance.
  • An embodiment of the device for constructing a big data indicator in the embodiment of the application includes: An acquisition module 401 is used to obtain the data to be predicted; the first construction module 402 is used to analyze the data to be predicted to construct a plurality of indicators carrying attribute information of different dimensions; the judgment module 403 is used to calculate according to the linear regression algorithm The access frequency of the indicator and determine whether the indicator is associated with a preset dimension table; the first determining module 404 is used to determine the indicator type of the indicator based on the access frequency, where the indicator type includes multi-dimensional aggregated indicators and fixed-dimensional indicators; The second determining module 405 is configured to determine the corresponding relationship with the indicator based on the indicator type, according to the preset correspondence table between the indicator type and the storage calculation engine, and the correspondence table between the indicator type and the dimensional modeling method of the indicator The storage calculation engine and the
  • the second embodiment of the device for constructing big data indicators in the embodiment of the present application includes:
  • the first acquisition module 501 is used to obtain the data to be predicted; the first construction module 502 is used to analyze the data to be predicted to construct a plurality of indicators carrying attribute information of different dimensions; the judgment module 503 is used to obtain the linear regression algorithm, Calculate the access frequency of the indicator, and determine whether the indicator is associated with a preset dimension table; the first determining module 504 is used to determine the indicator type of the indicator based on the access frequency, where the indicator type includes multi-dimensional aggregated indicators and fixed-dimensional indicators ; The second determining module 505 is used to determine the corresponding relationship table between the indicator type and the storage calculation engine based on the indicator type, and the corresponding relationship table between the indicator type and the dimensional modeling method of the indicator for a long time.
  • the third determining module 506 is used to determine the preset dimension table associated with the indicator according to the dimensional modeling method, wherein the preset dimension table includes dimensional modeling corresponding to the indicator type A dimension table constructed in a way or a dimension table constructed based on all dimensional modeling methods; a calculation module 507, used to use the routing decision engine to call the storage calculation engine to execute the preset dimension table, and calculate the corresponding indicator value of the indicator; the second acquisition module 508 , Used to obtain historical data including indicators, where the historical data includes indicators in a specific period, the number of visits of the indicator in a specific period, and index factors that affect the number of visits of the indicator in a specific period; the analysis module 509 is used to combine Historical data is used as sample data, and partial correlation analysis is performed on the sample data, indicators are extracted, and the mapping relationship equations between the indicators and the corresponding indicator factors are established respectively; the test module 510 is used to perform T-tests on the mapping relationship equations respectively to determine the impact index access
  • the first building module 502 is specifically used to: analyze the data to be predicted and define multiple indicators; use a preset model to classify the indicators and add dimensional attributes; based on the indicators and dimensional attributes, combine the indicators and dimensional attributes to obtain Multiple indicators of different dimension attributes.
  • the judgment module 503 is specifically used to: determine the main index factors that affect the access frequency of the index based on the linear regression algorithm; establish the mapping relationship equation between the index and the main index factor, and use the elastic coefficient method to predict the parameter value of the main index factor; The parameter value of the factor is substituted into the mapping relationship equation to calculate the access frequency of the indicator
  • the first determining module 504 is specifically configured to: if the access frequency of the indicator is greater than a preset threshold and other dimension tables need to be associated when calculating the access frequency of the indicator, the indicator is an indicator type that requires multi-dimensional aggregation; if the access frequency of the indicator is greater than When the threshold is preset and the access frequency of the indicator does not need to be associated with other dimension tables, the indicator type is a fixed-dimensional indicator type.
  • FIG. 6 is a schematic structural diagram of a big data indicator construction device provided by an embodiment of the present application.
  • the big data indicator construction device 600 may have relatively large differences due to different configurations or performance, and may include one or more processors (central Processing units, CPU) 610 (for example, one or more processors) and memory 620, and one or more storage media 630 (for example, one or more storage devices with a large amount of data) storing application programs 633 or data 632.
  • the memory 620 and the storage medium 630 may be short-term storage or persistent storage.
  • the program stored in the storage medium 630 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the big data indicator construction device 600.
  • the processor 610 may be configured to communicate with the storage medium 630, and execute a series of instruction operations in the storage medium 630 on the big data indicator construction device 600, so as to implement the steps of the big data indicator construction method in the foregoing embodiments. .
  • the big data indicator construction device 600 may also include one or more power supplies 640, one or more wired or wireless network interfaces 650, one or more input and output interfaces 660, and/or one or more operating systems 631, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc.
  • operating systems 631 such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc.
  • the present application also provides a device for constructing a big data indicator.
  • the device for constructing a big data indicator includes: a memory and at least one processor, where instructions are stored in the memory, and the memory and at least one processor are interconnected by wires; at least one processor calls the memory In order to make the big data indicator construction device execute the steps of the big data indicator construction method.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium.
  • the computer-readable storage medium stores computer instructions. When the computer instructions are run on the computer, the computer executes the following steps: obtain the data to be predicted; analyze the data to be predicted to construct multiple indicators that carry attribute information of different dimensions; according to linear regression Algorithm, calculate the access frequency of the indicator, and determine whether the indicator is associated with a preset dimension table; based on the access frequency, determine the indicator type of the indicator, where the indicator type includes multi-dimensional aggregated indicators and fixed-dimensional indicators; based on the indicator type, according to The preset correspondence table between the indicator type and the storage calculation engine, and the correspondence table between the indicator type and the dimensional modeling method of the indicator, determine the storage calculation engine and the dimensional modeling method corresponding to the indicator; build according to the dimension The model method determines the preset dimension table associated with the indicator.
  • the preset dimension table includes the dimension table constructed based on the dimensional modeling method corresponding to the indicator type or the dimension table constructed based on all the dimensional modeling methods; the routing decision engine is used to call The storage calculation engine executes the preset dimension table and calculates the index value corresponding to the index.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Data Mining & Analysis (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • General Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Marketing (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Operations Research (AREA)
  • Tourism & Hospitality (AREA)
  • Mathematical Analysis (AREA)
  • Game Theory and Decision Science (AREA)
  • Technology Law (AREA)
  • Educational Administration (AREA)
  • Quality & Reliability (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Algebra (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé, un appareil et un dispositif de construction d'index de mégadonnées, ainsi qu'un support de stockage, se rapportant au domaine des mégadonnées. Le procédé comprend les étapes consistant à: acquérir des données à prédire, et analyser lesdites données pour construire de multiples index portant différentes informations d'attribut de dimension; calculer la fréquence d'accès de chaque index selon un algorithme de régression linéaire, et déterminer si d'autres tables de dimension doivent être associées à celles-ci pendant le calcul d'index afin de déterminer le type de l'index; en fonction d'une table de corrélation entre le type d'index et un moteur de calcul de stockage, et une table de corrélation entre le type d'index et un mode de modélisation de dimension pour l'index, effectuer une requête pour un moteur de calcul de stockage correspondant à l'index et calculer une table de dimension prédéfinie avec laquelle l'index doit être associé; et utiliser un moteur de décision de routage pour appeler le moteur de calcul de stockage afin d'exécuter la table de dimension prédéfinie, et calculer une valeur correspondant à l'index. Le procédé résout le problème de la rapidité d'exécution d'un calcul d'index de mégadonnées, et résout le problème technique d'un seul moteur de données et d'une dimension utilisée pour la modélisation.
PCT/CN2020/131753 2020-07-23 2020-11-26 Procédé, appareil et dispositif de construction d'index de mégadonnées, et support de stockage WO2021139427A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010714909.9A CN111859299A (zh) 2020-07-23 2020-07-23 大数据指标构建方法、装置、设备及存储介质
CN202010714909.9 2020-07-23

Publications (1)

Publication Number Publication Date
WO2021139427A1 true WO2021139427A1 (fr) 2021-07-15

Family

ID=72950832

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/131753 WO2021139427A1 (fr) 2020-07-23 2020-11-26 Procédé, appareil et dispositif de construction d'index de mégadonnées, et support de stockage

Country Status (2)

Country Link
CN (1) CN111859299A (fr)
WO (1) WO2021139427A1 (fr)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859299A (zh) * 2020-07-23 2020-10-30 平安科技(深圳)有限公司 大数据指标构建方法、装置、设备及存储介质
CN112990669A (zh) * 2021-02-24 2021-06-18 平安健康保险股份有限公司 产品数据分析方法、装置、计算机设备及存储介质
CN113420096B (zh) * 2021-06-22 2024-05-10 平安科技(深圳)有限公司 指标体系的构建方法、装置、设备及存储介质
CN117520624B (zh) * 2024-01-05 2024-04-12 青岛海信信息科技股份有限公司 一种大数据指标的配置与计算的方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408179A (zh) * 2014-12-15 2015-03-11 北京国双科技有限公司 数据表中数据处理方法和装置
US20180025035A1 (en) * 2016-07-21 2018-01-25 Ayasdi, Inc. Topological data analysis of data from a fact table and related dimension tables
CN107918600A (zh) * 2017-11-15 2018-04-17 泰康保险集团股份有限公司 报表开发系统及方法、存储介质和电子设备
CN109325648A (zh) * 2018-06-29 2019-02-12 深圳市彬讯科技有限公司 基于指标的多维度数据流统计方法、服务器及存储介质
CN111859299A (zh) * 2020-07-23 2020-10-30 平安科技(深圳)有限公司 大数据指标构建方法、装置、设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408179A (zh) * 2014-12-15 2015-03-11 北京国双科技有限公司 数据表中数据处理方法和装置
US20180025035A1 (en) * 2016-07-21 2018-01-25 Ayasdi, Inc. Topological data analysis of data from a fact table and related dimension tables
CN107918600A (zh) * 2017-11-15 2018-04-17 泰康保险集团股份有限公司 报表开发系统及方法、存储介质和电子设备
CN109325648A (zh) * 2018-06-29 2019-02-12 深圳市彬讯科技有限公司 基于指标的多维度数据流统计方法、服务器及存储介质
CN111859299A (zh) * 2020-07-23 2020-10-30 平安科技(深圳)有限公司 大数据指标构建方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN111859299A (zh) 2020-10-30

Similar Documents

Publication Publication Date Title
WO2021139427A1 (fr) Procédé, appareil et dispositif de construction d'index de mégadonnées, et support de stockage
US11068789B2 (en) Dynamic model data facility and automated operational model building and usage
US8108399B2 (en) Filtering of multi attribute data via on-demand indexing
US7502971B2 (en) Determining a recurrent problem of a computer resource using signatures
US8788501B2 (en) Parallelization of large scale data clustering analytics
US10824614B2 (en) Custom query parameters in a database system
CN104700190B (zh) 一种用于项目与专业人员匹配的方法和装置
CN110825769A (zh) 一种数据指标异常的查询方法和系统
WO2022252782A1 (fr) Procédé et système de recommandation d'indice de calcul en nuage
WO2007053940A1 (fr) Systemes et procedes de generation automatique d'informations de vente et de marketing
US11810000B2 (en) Systems and methods for expanding data classification using synthetic data generation in machine learning models
US11550762B2 (en) Implementation of data access metrics for automated physical database design
Sun et al. Model averaging for interval-valued data
Siddiqui et al. Isum: Efficiently compressing large and complex workloads for scalable index tuning
CN114781717A (zh) 网点设备推荐方法、装置、设备和存储介质
CN117217933A (zh) 用于保险行业的数据多维分析方法及装置
Onile et al. A comparative study on graph-based ranking algorithms for consumer-oriented demand side management
CN116450757A (zh) 数据资产的评价指标的确定方法及装置、设备及存储介质
Pesantez-Narvaez et al. Penalized logistic regression to improve predictive capacity of rare events in surveys
CN110990777A (zh) 数据关联性分析方法及系统、可读存储介质
Deng et al. A novel method for elimination of inconsistencies in ordinal classification with monotonicity constraints
CN113283688A (zh) 一种基于熵权法与多目标属性决策的电力数据资产价值评估方法
US20240020515A1 (en) Systems and methods for a neural network database framework for answering database query types
Li et al. Financial Intelligence Decision Analysis System Based on Decision Tree Algorithm
CN117273953A (zh) 数据资产价值评估方法、装置和计算机设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20912290

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20912290

Country of ref document: EP

Kind code of ref document: A1