WO2018068360A1 - Method for obtaining regression relationships between dependent variables and independent variables during data analysis - Google Patents
Method for obtaining regression relationships between dependent variables and independent variables during data analysis Download PDFInfo
- Publication number
- WO2018068360A1 WO2018068360A1 PCT/CN2016/106004 CN2016106004W WO2018068360A1 WO 2018068360 A1 WO2018068360 A1 WO 2018068360A1 CN 2016106004 W CN2016106004 W CN 2016106004W WO 2018068360 A1 WO2018068360 A1 WO 2018068360A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- regression
- data
- independent
- variable
- analysis
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23211—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with adaptive number of clusters
Definitions
- the invention relates to the technical field of data analysis and processing, in particular to a method for obtaining a regression relationship between a dependent variable and an independent variable in data analysis.
- the process of data analysis, regression analysis is a method that is often used.
- the user needs to select the independent variable and the dependent variable according to the relationship of a certain model, input the data by manual method and analyze the final result one by one, and then check the regression coefficient of the obtained result.
- the accuracy of the independent variable and the actual dependent variable When it is not possible to clearly see the relationship between multiple independent variables and dependent variables, it is up to the user to perform the process one by one.
- the whole process is time-consuming, labor-intensive and inefficient, and the amount of data input may have different causal relationships between the dependent variable and the independent variable for all data. It is difficult to achieve accurate analysis by directly using the traditional method. Analysis efficiency.
- the technical problem solved by the invention is to provide a method for obtaining the regression relationship between the dependent variable and the independent variable in the data analysis; the optimal correspondence between the input dependent variable and the independent variable can be efficiently obtained, and used for future data prediction.
- the method includes the following steps:
- Step 1 Perform standardization processing on the dependent variable and the independent variable input by the user, and save the result for use;
- Step 2 Perform regression analysis on the data, analyze similar data features, select vertical independent variables from similar data features, and obtain causal relationships by calling relevant linear analysis algorithms;
- Step 3 Compare the calculated results with the actual results, obtain the optimal relationship between the independent variables and the dependent variables, and present the final optimal results to the user for the final selection.
- Step 1 Obtain the dependent variable and the respective variables, and respectively obtain the average value of each dependent variable and independent variable as the reference data ⁇ ;
- Step 2 Find the standard deviation ⁇ of each dependent variable separately as the expansion coefficient, and the expansion coefficient is obtained by the standard deviation.
- the formula is:
- the values x 1 , x 2 , x 3 , ... x N in the formula are the values of the respective variables, where ⁇ is the arithmetic mean of the respective variables;
- Step 1 Perform cluster analysis on the input independent variable data according to different cluster numbers, and obtain a plurality of analysis results according to different cluster numbers;
- Step 2 For the analysis result of a certain number of clusters, select the independent variables according to different categories, analyze the relationship between the selected independent variables and the dependent variables, and obtain the regression coefficient; then calculate the method by backtesting Accuracy rate, select the regression relationship between the highest-precision independent variable and the dependent variable; use the same method for different data categories to obtain the highest accuracy regression relationship;
- Step 3 Analyze the regression relationships of the different categories classified, and combine the independent variables, The categories with little difference in regression coefficients form a unified regression relationship; if the independent variables are different or the regression coefficients are too different, an independent regression relationship is formed for each data region;
- Step 4 Repeat steps 2 and 3 to analyze the regression relationship of the number of different data clusters, and obtain the optimal regression relationship and regression coefficient under each cluster data.
- the cluster analysis can adopt the K-Means clustering algorithm, and the distance of the cluster can be calculated by using the Euclidean distance calculation method.
- the calculation formula is as follows:
- the Euclidean distance d ij represents the distance between two n-dimensional vectors a (x11, x12, ..., x1n) and b (x21, x22, ..., x2n).
- the regression relationship may be fitted by a least squares polynomial curve, and the fitting process may be performed by a self-implementation method, or directly by fitting a relevant general fitting tool, and the fitting formula is:
- Step 1 Analyze the optimal regression relationship and regression coefficient for each different cluster number, and analyze the optimal accuracy rate or the accuracy of the first few; and present the analysis results to the user.
- the user's final choice provides a data basis;
- Step 2 provide a standardized conversion formula of the independent variable and the dependent variable for the optimal result selected by the user, and the center of each cluster and the regression variable and regression coefficient of the analysis are used for the final data prediction;
- Step 3 The user provides the standardized conversion formula of the independent variable and the dependent variable, the center of each cluster and the regression independent variable and regression coefficient of the analysis; when inputting the new predicted data, the normalization of the independent variable is first performed, and then The clustering center compares and selects the nearest data region, applies the independent variables and regression coefficients of the region, and predicts the standardized prediction value; and then pushes the predicted original value through the standardized formula.
- the invention can continuously calculate and can perform the backtesting of the prediction result by using the computer, improve the accuracy of the data by standardizing the data, and make the data in the horizontal direction by the clustering method, and then automatically
- the longitudinal calculation is performed on the independent variable to obtain the optimal regression result of the data analysis, and the final result of the data analysis prediction is formed for the final data prediction.
- the user quickly and directly analyzes the optimal causal relationship, greatly improves the efficiency of obtaining the regression relationship between the dependent variable and the independent variable, and forms an optimal method for efficiently obtaining the relationship between multiple independent variables and dependent variables; Improving the analysis of the main components of the dependent variable and multiple independent variables in the process of data regression analysis simplifies the process of data regression analysis and improves the efficiency of the acquisition of dependent variables and independent variables.
- 1 is a flow chart showing the relationship between the optimal dependent variable and the independent variable in the present invention.
- the invention normalizes the data by analyzing the dependent variable input by the user and a plurality of independent variables.
- the data standardization results of each dependent variable and independent variable are saved for subsequent data prediction; then the data is classified from the horizontal angle to analyze similar data features, and then the longitudinal independent variables are analyzed from similar data features.
- Selecting, by calling the relevant linear analysis algorithm the causal relationship is obtained, and the calculated and analyzed results are compared with the actual results, and the optimal relationship between some independent variables and dependent variables is analyzed, and the final optimal result is presented to The user is used for the final selection.
- This method can effectively obtain the optimal causal relationship between the dependent variable and the dependent variable from multiple independent variables, which can greatly improve the efficiency of obtaining the regression relationship between the dependent variable and the independent variable, as an optimization data analysis process.
- a method of obtaining the relationship between major causal components can effectively obtain the optimal causal relationship between major causal components.
- the data standardization processing of each input data is required, that is, all the variables included in the input, including the dependent variables, are first converted into standard data, and then linear regression analysis is performed to make the standardized data at this time.
- the obtained regression coefficient can better reflect the importance degree of the corresponding independent variable;
- the value X_bar, ⁇ is the expansion factor, which is generally equal to the standard deviation S of the original data.
- multi-category cluster analysis is carried out according to the data of each variable.
- the purpose of cluster analysis is to discover the characteristics of different data in each category, so that Obtaining a clear regression coefficient relationship on the data with obvious characteristics; if the regression coefficient relationship obtained after classification is not much different, it can be regarded as the result data of the analysis is consistent, and can be used as a unified regression causal relationship; After the regression system is relatively large, it shows that different categories of data have different regression causal relationships in each region. In the subsequent use of regression results, comparisons can be made from the calculated cluster centers, and each cluster center is selected. Recent regression causality data is predicted.
- the cycle selects some kinds of independent variables and the dependent variables to form a regression relationship among the various classification categories, and obtains the regression coefficient, and then the specific category
- the variable data is used in the regression test to calculate the accuracy, so that from among the multiple independent variables, The optimal causal relationship between the independent variable and the dependent variable, and the regression coefficient; different categories use this method, so that all categories of data form a certain regression relationship.
- the independent variables selected by each category and the regression coefficients of the respective variables are analyzed. If the selected independent variables are the same, and the regression systems of the respective variables are not relevant. , the regression coefficients can be combined to form a unified regression relationship, which also indicates that the data conforms to the unified regression relationship, and the regression process selects the optimal relationship between the optimal independent variable and the dependent variable; if each classification category is selected.
- the regression coefficients of the optimal regression independent variables and their respective variables are different or very different. It means that the regression relationship between the input independent variables and the dependent variables is different in each region. To use different regression relationships, you need to save each The data center points of the categories and the regression independent variables and systems of each category are used for subsequent calculation of the regression relationship of each region.
- the clustering of the data of the input multiple independent variables, the regression analysis of the selected independent variables and the dependent variables can be implemented by calling the R language or the self-implementing program by means of a program, and by calling the implemented method to improve the independent variables.
- the efficiency of the choice analysis with the dependent variable relationship can be implemented by calling the R language or the self-implementing program by means of a program, and by calling the implemented method to improve the independent variables.
- the regression coefficient is obtained, and the most important thing is that after the regression coefficient of the regression relationship of each region is obtained, the regression results need to be summarized and all the unified regression relations are used to optimize the calculation of the final regression relationship.
- the optimal regression relationship and regression coefficient of each cluster number are obtained, and the optimal results of each cluster number are compared, and finally the user is optimal.
- the central data, regression independent variables and regression coefficients of each region under the cluster classification show the relationship between the optimal dependent variable and the independent variable.
- the central data, regression independent variables, regression coefficients of each region, combined with the standardized indicators of the respective variables, input new forecast data, first select the nearest distance by comparing with the central data of each category. Area, applying the nearest regression variable and regression system, thus Get the final forecast.
- the implementation of the present invention mainly includes three parts, data standardization, horizontal and vertical regression analysis of data, and obtaining an optimal correspondence.
- the specific steps of the three parts are as follows:
- Step 1 separately obtain the dependent variable and the respective variables, and respectively obtain the average value X_bar of the respective dependent variable and the independent variable as the reference data ⁇ ;
- Step 2 Find the standard deviation of each dependent variable separately, as the expansion coefficient ⁇ , and the expansion coefficient is obtained by the standard deviation.
- the formula is:
- Step 4 Preserving the dependent variable and the reference data of each variable and the expansion coefficient for subsequent standardized calculation of new data prediction
- the dependent variable and the independent variable are recalculated, so that the final regression coefficient can better reflect the importance degree of the corresponding dependent variable and independent variable;
- Step 1 According to the input independent variable data, cluster analysis is performed multiple times according to different cluster numbers, and multiple analysis results according to different cluster numbers are obtained.
- Cluster analysis can use K-Means clustering algorithm to calculate clustering.
- the distance can be calculated using the Euclidean Distance method.
- the Euclidean distance represents the distance between two n-dimensional vectors a (x11, x12, ..., x1n) and b (x21, x22, ..., x2n), such as two points a (x1, y1) on a two-dimensional plane.
- Step 2 For the analysis result of a certain number of clusters, select the independent variables according to different categories, analyze the relationship between the selected independent variables and the dependent variables, and obtain the regression coefficient, and then calculate the method by backtesting. Accuracy rate, the regression relationship between the independent variable and the dependent variable is selected; the same method is used to obtain the highest accuracy regression relationship for different data categories; the regression relationship can be fitted by least squares polynomial curve fitting, the fitting process The fitting result can be directly obtained by self-implementation or by calling the relevant general fitting tool.
- the fitting formula is:
- Step 3 Analyze the regression relationships of the different categories classified, and combine the categories with the same independent variables and different regression systems to form a unified regression relationship; if the independent variables are different or the regression coefficients are too different, each data region is formed. Independent regression relationship;
- Step 4 Repeat steps 2 and 3 to analyze the regression relationship of different data cluster numbers, and obtain the optimal regression relationship and regression coefficient under each cluster data;
- Step 1 Analyze the optimal regression relationship and regression system for each different cluster number, and analyze the optimal accuracy rate, or the optimal accuracy of the first few, and present the analysis results to the user.
- the user's final choice provides a data basis;
- Step 2 For the optimal result selected by the user, a standardized conversion formula of the independent variable and the dependent variable is provided, and the center of each cluster and the analyzed regression independent variable and regression coefficient are used for the final data prediction;
- Step 3 The user provides the normalized conversion formula of the independent variable and the dependent variable, the center of each cluster, and the regression independent variable and regression coefficient of the analysis.
- the normalization of the independent variable is first performed, and then The clustering center compares and selects the nearest data region, applies the independent variable and regression system of the region, and predicts the standardized predicted value, and then pushes the predicted original value through the standardized formula.
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to the technical field of data analysis and processing, and in particular, to a method for obtaining regression relationships between dependent variables and independent variables during data analysis. The method of the present invention comprises: by analyzing dependent variables and a plurality of independent variables inputted by a user, standardizing data; then classifying the data to obtain similar data characteristics; selecting the independent variables from the similar data characteristics; by calling a related linear analysis algorithm, obtaining a causal relationship; comparing the result obtained from calculation and analysis with an actual result; analyzing optimal relationships between some independent variables and dependent variables; and finally, displaying optimal results to a user for final selection. The present invention resolves the problems in existing methods of being unable to perform discriminant analysis of data regions and being difficult to achieve accurate analysis efficiency. The method present invention can be used for obtaining regression relationship between dependent variables and independent variables.
Description
本发明涉及数据分析处理技术领域,尤其是一种数据分析中获取因变量与自变量回归关系的方法。The invention relates to the technical field of data analysis and processing, in particular to a method for obtaining a regression relationship between a dependent variable and an independent variable in data analysis.
数据分析的过程,回归分析是经常使用的一种方法。传统的回归过程,需要用户按相关的某个模型的关系,选取自变量与因变量,通过手工的方法进行数据的输入并逐个分析最终的结果,再对得出的结果进行检查回归系数、自变量与实际因变量的准确率。对于无法很明确地查看出多个自变量与因变量的关系时,需由用户自行逐个过程进行操作。整个过程费时费力效率低,而且输入的数据量,对于全部数据来说因变量与自变量又有可能有不同的因果关系,直接使用传统的方法无法进行数据区域的区别分析,比较难达到准确的分析效率。The process of data analysis, regression analysis is a method that is often used. In the traditional regression process, the user needs to select the independent variable and the dependent variable according to the relationship of a certain model, input the data by manual method and analyze the final result one by one, and then check the regression coefficient of the obtained result. The accuracy of the independent variable and the actual dependent variable. When it is not possible to clearly see the relationship between multiple independent variables and dependent variables, it is up to the user to perform the process one by one. The whole process is time-consuming, labor-intensive and inefficient, and the amount of data input may have different causal relationships between the dependent variable and the independent variable for all data. It is difficult to achieve accurate analysis by directly using the traditional method. Analysis efficiency.
发明内容Summary of the invention
本发明解决的技术问题在于提供一种数据分析中获取因变量与自变量回归关系的方法;可以高效地获取输入的因变量与自变量的最优对应关系,用于以后的数据预测。The technical problem solved by the invention is to provide a method for obtaining the regression relationship between the dependent variable and the independent variable in the data analysis; the optimal correspondence between the input dependent variable and the independent variable can be efficiently obtained, and used for future data prediction.
本发明解决上述技术问题的技术方案是:The technical solution of the present invention to solve the above technical problem is:
所述的方法包括以下几个步骤:The method includes the following steps:
步骤1:对用户输入的因变量与自变量,进行数据标准化处理,并保存该结果备用;
Step 1: Perform standardization processing on the dependent variable and the independent variable input by the user, and save the result for use;
步骤2:对数据进行回归分析,分析出类似的数据特征,从类似的数据特征中进行纵向的自变量选取,通过调用相关的线性分析算法,得出因果关系;Step 2: Perform regression analysis on the data, analyze similar data features, select vertical independent variables from similar data features, and obtain causal relationships by calling relevant linear analysis algorithms;
步骤3:对比计算分析出来的结果与实际的结果,获得自变量与因变量的最优关系,将最终的最优结果展示给用户用于最终的选择。Step 3: Compare the calculated results with the actual results, obtain the optimal relationship between the independent variables and the dependent variables, and present the final optimal results to the user for the final selection.
所述数据标准化具体步骤为:The specific steps of the data standardization are:
步骤一、获取因变量及各自变量,分别求各自因变量、自变量的平均值,作为基准数据β;Step 1: Obtain the dependent variable and the respective variables, and respectively obtain the average value of each dependent variable and independent variable as the reference data β;
步骤二、分别求各自因变量的标准差α,作为扩大系数,扩大系数通过标准差的方式求出,公式为:Step 2: Find the standard deviation α of each dependent variable separately as the expansion coefficient, and the expansion coefficient is obtained by the standard deviation. The formula is:
公式中数值x1,x2,x3,......xN是各自变量的值,其中μ为各自变量的算术平均值;The values x 1 , x 2 , x 3 , ... x N in the formula are the values of the respective variables, where μ is the arithmetic mean of the respective variables;
步骤三、对因变量及各自变量,分别通过公式Z′=αZ+β求出标准化后的值,Z′为标准数据,β为基准数据,α是扩大系数。Step 3: For the dependent variable and the respective variables, the normalized values are obtained by the formula Z'=αZ+β, Z' is the standard data, β is the reference data, and α is the expansion coefficient.
所述数据回归分析具体步骤为:The specific steps of the data regression analysis are:
步骤一、对输入的自变量数据按不同聚类数量,多次进行聚类分析,得出多个按不同聚类数量的分析结果;Step 1: Perform cluster analysis on the input independent variable data according to different cluster numbers, and obtain a plurality of analysis results according to different cluster numbers;
步骤二、对某一特定聚类数量的分析结果,按不同的类别,从中选取自变量,分析选取的自变量与因变量的关系,得出回归系数;再通过回测的方法,计算出准确率,选取准确率最高的自变量与因变量的回归关系;对不同的数据类别采用相同的方法获取准确率最高的回归关系;Step 2: For the analysis result of a certain number of clusters, select the independent variables according to different categories, analyze the relationship between the selected independent variables and the dependent variables, and obtain the regression coefficient; then calculate the method by backtesting Accuracy rate, select the regression relationship between the highest-precision independent variable and the dependent variable; use the same method for different data categories to obtain the highest accuracy regression relationship;
步骤三、对分类出来的不同的类别的回归关系进行分析,合并自变量一样、
回归系数相差不大的类别,形成统一的回归关系;自变量不同或回归系数相差太大的,形成各数据区域独立的回归关系;Step 3: Analyze the regression relationships of the different categories classified, and combine the independent variables,
The categories with little difference in regression coefficients form a unified regression relationship; if the independent variables are different or the regression coefficients are too different, an independent regression relationship is formed for each data region;
步骤四、重复步骤二、步骤三,对不同数据聚类数量的回归关系进行分析,得出各聚类数据下的最优回归关系和回归系数。Step 4: Repeat steps 2 and 3 to analyze the regression relationship of the number of different data clusters, and obtain the optimal regression relationship and regression coefficient under each cluster data.
所述的聚类分析可采用K-Means聚类算法,计算聚类的距离可使用欧氏距离计算方法,计算公式如下:The cluster analysis can adopt the K-Means clustering algorithm, and the distance of the cluster can be calculated by using the Euclidean distance calculation method. The calculation formula is as follows:
欧氏距离dij表示两个n维向量a(x11,x12,...,x1n)与b(x21,x22,...,x2n)间的距离。The Euclidean distance d ij represents the distance between two n-dimensional vectors a (x11, x12, ..., x1n) and b (x21, x22, ..., x2n).
所述的回归关系可采用最小二乘法多项式曲线拟合,拟合的过程可通过自实现的方式进行,或是通过调用相关通用的拟合工具,直接获取拟合结果,拟合公式为:The regression relationship may be fitted by a least squares polynomial curve, and the fitting process may be performed by a self-implementation method, or directly by fitting a relevant general fitting tool, and the fitting formula is:
假设给定数据点(xi,yi)(其中i=0,1,2,…,m),为所有次数不超过n(n≤m)的多项式构成的函数类,现求使得满足min公式的Pn(xi)称为最小二乘拟合多项式,通过代入相关的(xi,yi)值并假设min为最小0,可得出n条关于a0,a1,a2,…,an的多项式,求解出以上的a0,a1,a2,…,an的多元函数,得出a0,a1,a2,…,an的具体的值。Assuming a given data point (x i , y i ) (where i = 0, 1, 2, ..., m), A function class composed of polynomials whose number does not exceed n (n ≤ m) is now sought Let P n (x i ) satisfying the min formula be called the least squares fit polynomial. By substituting the relevant (x i , y i ) values and assuming min is the minimum 0, we can get n about a 0 , a 1 , a 2, ..., a n polynomial solving the above a 0, a 1, a 2 , ..., a n polyvalent function, obtain a 0, a 1, a 2 , ..., a n specific value.
所述获取自变量与因变量的最优关系具体步骤为:
The specific steps of obtaining the optimal relationship between the independent variable and the dependent variable are as follows:
步骤一、对各不同的聚类数量分析出来的最优回归关系、回归系数,分析得出最优的准确率,或是最优的前几个的准确率;把分析结果展示给用户,为用户的最终选择提供数据依据;Step 1. Analyze the optimal regression relationship and regression coefficient for each different cluster number, and analyze the optimal accuracy rate or the accuracy of the first few; and present the analysis results to the user. The user's final choice provides a data basis;
步骤二、对用户选择的最优结果,提供自变量与因变量的标准化转换公式,各聚类的中心及分析的回归自变量、回归系数,用于最终的数据预测;Step 2: provide a standardized conversion formula of the independent variable and the dependent variable for the optimal result selected by the user, and the center of each cluster and the regression variable and regression coefficient of the analysis are used for the final data prediction;
步骤三、用户通过提供的自变量与因变量的标准化转换公式,各聚类的中心及分析的回归自变量、回归系数;在输入新的预测数据时,先进行自变量的标准化,再与各聚类中心进行对比,选取距离最近的数据区域,应用该区域的自变量及回归系数,从而预测出标准化的预测值;再通过标准化公式反推预测的原始值。Step 3: The user provides the standardized conversion formula of the independent variable and the dependent variable, the center of each cluster and the regression independent variable and regression coefficient of the analysis; when inputting the new predicted data, the normalization of the independent variable is first performed, and then The clustering center compares and selects the nearest data region, applies the independent variables and regression coefficients of the region, and predicts the standardized prediction value; and then pushes the predicted original value through the standardized formula.
本发明的有益效果是:The beneficial effects of the invention are:
本发明通过利用计算机可不断计算、并且可进行预测结果的回测的优势,通过对数据的标准化,提高数据的准确性,按聚类的方式使数据在横向上进行区域回归区分,再从自动选取自变量上进行纵向的计算,从而得出数据分析的最优回归结果,并形成数据分析预测的最终结果,用于最后的数据预测。在此方法中为用户快速直接分析出最优的因果关系,极大提高获取因变量与自变量回归关系的效率,形成一种高效获取多个自变量与因变量的关系的最优方法;从而提高数据回归分析过程中对因变量与多个自变量的主要成份的分析,简化了数据回归分析的过程,提高了因变量与自变量获取的效率。The invention can continuously calculate and can perform the backtesting of the prediction result by using the computer, improve the accuracy of the data by standardizing the data, and make the data in the horizontal direction by the clustering method, and then automatically The longitudinal calculation is performed on the independent variable to obtain the optimal regression result of the data analysis, and the final result of the data analysis prediction is formed for the final data prediction. In this method, the user quickly and directly analyzes the optimal causal relationship, greatly improves the efficiency of obtaining the regression relationship between the dependent variable and the independent variable, and forms an optimal method for efficiently obtaining the relationship between multiple independent variables and dependent variables; Improving the analysis of the main components of the dependent variable and multiple independent variables in the process of data regression analysis simplifies the process of data regression analysis and improves the efficiency of the acquisition of dependent variables and independent variables.
下面结合附图对本发明进一步说明:The present invention is further described below in conjunction with the accompanying drawings:
附图1是本发明获取最优因变量与自变量关系流程图。1 is a flow chart showing the relationship between the optimal dependent variable and the independent variable in the present invention.
本发明通过分析用户输入的因变量与多个自变量,对数据进行标准化处理,
同时保存各因变量与自变量的数据标准化结果用于后续的数据预测;后先对数据从横向的角度进行分类,从而分析出类似的数据特征,再从类似的数据特征中进行纵向的自变量选取,通过调用相关的线性分析算法,得出因果关系,通过计算分析出来的结果与实际的结果进行对比,分析出某些自变量与因变量的最优关系,把最终的最优结果展示给用户用于最终的选择,使用此方法可为用户从多个自变量中高效获取与因变量的最优因果关系,可极大提高获取因变量与自变量回归关系的效率,作为优化数据分析过程获取主要因果成份关系的一种方法。The invention normalizes the data by analyzing the dependent variable input by the user and a plurality of independent variables.
At the same time, the data standardization results of each dependent variable and independent variable are saved for subsequent data prediction; then the data is classified from the horizontal angle to analyze similar data features, and then the longitudinal independent variables are analyzed from similar data features. Selecting, by calling the relevant linear analysis algorithm, the causal relationship is obtained, and the calculated and analyzed results are compared with the actual results, and the optimal relationship between some independent variables and dependent variables is analyzed, and the final optimal result is presented to The user is used for the final selection. This method can effectively obtain the optimal causal relationship between the dependent variable and the dependent variable from multiple independent variables, which can greatly improve the efficiency of obtaining the regression relationship between the dependent variable and the independent variable, as an optimization data analysis process. A method of obtaining the relationship between major causal components.
对输入的因变量及多个自变量,需进行各输入数据的数据标准化处理,就是将输入的所有变量包括因变量都先转化为标准数据,再进行线性回归分析,使标准化后的数据此时得到的回归系数更能反映对应自变量的重要程度;数据标准化可采用如下转换通式:Z′=αZ+β,式中,Z′为标准数据,β为基准数据,一般等于原始数据的平均值X_bar,α是扩大系数,一般等于原始数据的标准差S。For the input dependent variable and multiple independent variables, the data standardization processing of each input data is required, that is, all the variables included in the input, including the dependent variables, are first converted into standard data, and then linear regression analysis is performed to make the standardized data at this time. The obtained regression coefficient can better reflect the importance degree of the corresponding independent variable; the data standardization can adopt the following conversion formula: Z'=αZ+β, where Z' is the standard data and β is the reference data, which is generally equal to the average of the original data. The value X_bar, α is the expansion factor, which is generally equal to the standard deviation S of the original data.
在因变量与自变量都进行了数据标准化后的基础上,按各自变量的数据进行多类别的聚类分析,聚类分析的目的,是为了发现不同数据在各个类别上的特征,从而可以从特征明显的数据上,获取明确的回归系数关系;如果进行分类后得出的回归系数关系相差不大,那可看作分析的结果数据是一致的,可作为统一的回归因果关系;对于进行分类后的回归系统相关比较大,则说明不同的类别数据在各区域上有不同的回归因果关系,在后续使用回归结果时可从计算出来的各聚类中心进行比对,选取与各聚类中心最近的回归因果关系进行数据的预测。On the basis of data standardization of both dependent and independent variables, multi-category cluster analysis is carried out according to the data of each variable. The purpose of cluster analysis is to discover the characteristics of different data in each category, so that Obtaining a clear regression coefficient relationship on the data with obvious characteristics; if the regression coefficient relationship obtained after classification is not much different, it can be regarded as the result data of the analysis is consistent, and can be used as a unified regression causal relationship; After the regression system is relatively large, it shows that different categories of data have different regression causal relationships in each region. In the subsequent use of regression results, comparisons can be made from the calculated cluster centers, and each cluster center is selected. Recent regression causality data is predicted.
在对自变量进行某个类别的聚类分析后,按分析的结果,循环在种个分类类别上选取某几类自变量与因变量形成回归关系,得出回归系数,再把特定类别的自变量数据用于回归测试,计算出准确率,这样从多个自变量中,选取出
最优的自变量与因变量的因果关系、回归系数;不同的类别使用这种方法,从而使所有类别的数据都形成一定的回归关系。After cluster analysis of a certain category of independent variables, according to the results of the analysis, the cycle selects some kinds of independent variables and the dependent variables to form a regression relationship among the various classification categories, and obtains the regression coefficient, and then the specific category The variable data is used in the regression test to calculate the accuracy, so that from among the multiple independent variables,
The optimal causal relationship between the independent variable and the dependent variable, and the regression coefficient; different categories use this method, so that all categories of data form a certain regression relationship.
在使所有类别的数据都形成了最优回归关系后,分析各类别所选取的自变量及各自变量的回归系数,如果所选取的自变量是一样的,及各自变量的回归系统相关不大的,则可把回归系数进行合并,从而形成统一的回归关系,也说明数据符合统一的回归关系,回归过程选择出了最优的自变量与因变量的最优关系;如果各分类类别所选取的最优回归自变量与各自变量的回归系数是不一样的或相差很大,则说明输入的自变量与因变量的回归关系在各个区域是不同的,需使用不同的回归关系,则需保存各类别的数据中心点及各类别的回归自变量及系统,用于后续对各区域的回归关系的计算。After all the categories of data have formed the optimal regression relationship, the independent variables selected by each category and the regression coefficients of the respective variables are analyzed. If the selected independent variables are the same, and the regression systems of the respective variables are not relevant. , the regression coefficients can be combined to form a unified regression relationship, which also indicates that the data conforms to the unified regression relationship, and the regression process selects the optimal relationship between the optimal independent variable and the dependent variable; if each classification category is selected The regression coefficients of the optimal regression independent variables and their respective variables are different or very different. It means that the regression relationship between the input independent variables and the dependent variables is different in each region. To use different regression relationships, you need to save each The data center points of the categories and the regression independent variables and systems of each category are used for subsequent calculation of the regression relationship of each region.
对输入的多个自变量的数据的聚类、选取自变量与因变量的回归分析,可通过程序的方式调用R语言或自实现程序进行实现,通过调用已实现的方法以提高进行自变量与因变量关系进行选择分析的效率。The clustering of the data of the input multiple independent variables, the regression analysis of the selected independent variables and the dependent variables can be implemented by calling the R language or the self-implementing program by means of a program, and by calling the implemented method to improve the independent variables. The efficiency of the choice analysis with the dependent variable relationship.
对于输入的数据量比较多的情况,需对数据进行更多类别的分类,从而区分出各个区域数据的特征,更加详细地对各个区域的自变量与因变量的最优因果关系的回归分析,得出回归系数,而最重要的是在得出各区域的回归关系回归系数后,需对回归结果进行归纳总结,全部统一的回归关系,从而优化最终的回归关系的计算。For the case where the amount of input data is relatively large, it is necessary to classify the data into more categories to distinguish the characteristics of each region data, and to analyze the regression analysis of the optimal causal relationship between the independent variables and the dependent variables in each region in more detail. The regression coefficient is obtained, and the most important thing is that after the regression coefficient of the regression relationship of each region is obtained, the regression results need to be summarized and all the unified regression relations are used to optimize the calculation of the final regression relationship.
通过多次按不同的聚类数量进行横向、纵向的数据计算,最终得出各聚类数量下的最优回归关系及回归系数,对比各聚类数量的最优结果,最终给用户最优的聚类分类下的各区域的中心数据、回归自变量、回归系数,显示最优的因变量与自变量的关系。By calculating the horizontal and vertical data by different number of clusters multiple times, the optimal regression relationship and regression coefficient of each cluster number are obtained, and the optimal results of each cluster number are compared, and finally the user is optimal. The central data, regression independent variables and regression coefficients of each region under the cluster classification show the relationship between the optimal dependent variable and the independent variable.
在得出最优的聚类分类下的各区域的中心数据、回归自变量、回归系数,结合各自变量的标准化指标,输入新的预测数据,首先通过与各类别的中心数据进行对比选取距离最近的区域,套用距离最近的回归变量及回归系统,从而
得出最终的预测结果。In the optimal cluster classification, the central data, regression independent variables, regression coefficients of each region, combined with the standardized indicators of the respective variables, input new forecast data, first select the nearest distance by comparing with the central data of each category. Area, applying the nearest regression variable and regression system, thus
Get the final forecast.
按照流程而言,如图1所示,本发明的实现主要包括三部分,数据标准化、数据横向纵向回归分析、获取最优对应关系,三部分的具体步骤如下:According to the process, as shown in FIG. 1, the implementation of the present invention mainly includes three parts, data standardization, horizontal and vertical regression analysis of data, and obtaining an optimal correspondence. The specific steps of the three parts are as follows:
一、数据标准化:First, data standardization:
步骤一、分别获取因变量及各自变量,分别求各自因变量、自变量的平均值X_bar,作为基准数据β;Step 1: separately obtain the dependent variable and the respective variables, and respectively obtain the average value X_bar of the respective dependent variable and the independent variable as the reference data β;
步骤二、分别求各自因变量的标准差,作为扩大系数α,扩大系数通过标准差的方式求出,公式为:Step 2: Find the standard deviation of each dependent variable separately, as the expansion coefficient α, and the expansion coefficient is obtained by the standard deviation. The formula is:
公式说明:Formula description:
公式中数值x1,x2,x3,......xN(皆为各自变量的值),其中μ为各自变量的平均值(算术平均值),标准差为α。The values x1, x2, x3, ... xN (both values of the respective variables) in the formula, where μ is the mean (arithmetic mean) of the respective variables, and the standard deviation is α.
步骤三、对因变量及各自变量,分别通过公式Z′=αZ+β求出标准化后的值,Z′为标准数据,β为基准数据,α是扩大系数;Step 3: For the dependent variable and the respective variables, the normalized value is obtained by the formula Z′=αZ+β, Z′ is the standard data, β is the reference data, and α is the expansion coefficient;
步骤四、保存因变量及各自变量的基准数据及扩大系数,用于后续进行新数据预测时的标准化计算;Step 4: Preserving the dependent variable and the reference data of each variable and the expansion coefficient for subsequent standardized calculation of new data prediction;
通过以上方法把因变量及自变量进行重新计算,从而使最终得出的回归系数更能反映对应因变量与自变量的重要程度;Through the above method, the dependent variable and the independent variable are recalculated, so that the final regression coefficient can better reflect the importance degree of the corresponding dependent variable and independent variable;
二、数据横向纵向回归分析Second, horizontal and vertical regression analysis of data
步骤一、按输入的自变量数据按不同聚类数量,多次进行聚类分析,得出多个按不同聚类数量的分析结果;聚类分析可采用K-Means聚类算法,计算聚类的距离可使用Euclidean Distance(欧氏距离)计算方法,公式:
Step 1. According to the input independent variable data, cluster analysis is performed multiple times according to different cluster numbers, and multiple analysis results according to different cluster numbers are obtained. Cluster analysis can use K-Means clustering algorithm to calculate clustering. The distance can be calculated using the Euclidean Distance method.
公式说明:Formula description:
欧氏距离表示两个n维向量a(x11,x12,...,x1n)与b(x21,x22,...,x2n)间的距离,例如二维平面上两点a(x1,y1)与b(x2,y2)间的欧氏距离:The Euclidean distance represents the distance between two n-dimensional vectors a (x11, x12, ..., x1n) and b (x21, x22, ..., x2n), such as two points a (x1, y1) on a two-dimensional plane. Euclidean distance between b) and b(x2, y2):
三维空间两点a(x1,y1,z1)与b(x2,y2,z2)间的欧氏距离:Euclidean distance between two points a(x1, y1, z1) and b(x2, y2, z2) in three-dimensional space:
步骤二、对某一特定聚类数量的分析结果,按不同的类别,从中选取自变量,分析选取的自变量与因变量的关系,得出回归系数,再通过回测的方法,计算出准确率,选取准确率最高的自变量与因变量的回归关系;对不同的数据类别采用相同的方法获取准确率最高的回归关系;回归关系可采用最小二乘法多项式曲线拟合,拟合的过程可通过自实现的方式进行,或是通过调用相关通用的拟合工具,直接获取拟合结果,拟合公式为:Step 2: For the analysis result of a certain number of clusters, select the independent variables according to different categories, analyze the relationship between the selected independent variables and the dependent variables, and obtain the regression coefficient, and then calculate the method by backtesting. Accuracy rate, the regression relationship between the independent variable and the dependent variable is selected; the same method is used to obtain the highest accuracy regression relationship for different data categories; the regression relationship can be fitted by least squares polynomial curve fitting, the fitting process The fitting result can be directly obtained by self-implementation or by calling the relevant general fitting tool. The fitting formula is:
公式说明:Formula description:
假设给定数据点(xi,yi)(其中i=0,1,2,…,m),为所有次数不超过n(n≤m)的多项式构成的函数类,现求使得满足min公式的Pn(xi)称为最小二乘拟合多项式,通过代入相关的(xi,yi)值并假设min为最小0,
可得出n条关于a0,a1,a2,…,an的多项式,求解出以上的a0,a1,a2,…,an的多元函数,得出a0,a1,a2,…,an的具体的值。Assuming a given data point (x i , y i ) (where i = 0, 1, 2, ..., m), A function class composed of polynomials whose number does not exceed n (n ≤ m) is now sought Let P n (x i ) satisfying the min formula be called the least squares fit polynomial. By substituting the relevant (x i , y i ) values and assuming min is the minimum 0, we can get n about a 0 , a 1 , a 2, ..., a n polynomial solving the above a 0, a 1, a 2 , ..., a n polyvalent function, obtain a 0, a 1, a 2 , ..., a n specific value.
步骤三、对分类出来的不同的类别的回归关系进行分析,合并自变量一样、回归系统相差不大的类别,形成统一的回归关系;自变量不同或回归系数相差太大的,形成各数据区域独立的回归关系;Step 3: Analyze the regression relationships of the different categories classified, and combine the categories with the same independent variables and different regression systems to form a unified regression relationship; if the independent variables are different or the regression coefficients are too different, each data region is formed. Independent regression relationship;
步骤四、重复步骤二、步骤三,从而对不同的数据聚类数量的回归关系的分析,得出各聚类数据下的最优回归关系,回归系数;Step 4: Repeat steps 2 and 3 to analyze the regression relationship of different data cluster numbers, and obtain the optimal regression relationship and regression coefficient under each cluster data;
三、获取最优对应关系:Third, to obtain the optimal correspondence:
步骤一、对各不同的聚类数量分析出来的最优回归关系、回归系统,分析得出最优的准确率,或是最优的前几个的准确率,把分析结果展示给用户,为用户的最终选择提供数据依据;Step 1. Analyze the optimal regression relationship and regression system for each different cluster number, and analyze the optimal accuracy rate, or the optimal accuracy of the first few, and present the analysis results to the user. The user's final choice provides a data basis;
步骤二、对用户选择的最优结果,需提供自变量与因变量的标准化转换公式,各聚类的中心及分析的回归自变量、回归系数,用于最终的数据预测;Step 2: For the optimal result selected by the user, a standardized conversion formula of the independent variable and the dependent variable is provided, and the center of each cluster and the analyzed regression independent variable and regression coefficient are used for the final data prediction;
步骤三、用户通过提供的自变量与因变量的标准化转换公式,各聚类的中心及分析的回归自变量、回归系数,在输入新的预测数据时,先进行自变量的标准化,再与各聚类中心进行对比,选取距离最近的数据区域,应用该区域的自变量及回归系统,从而预测出标准化的预测值,再通过标准化公式反推预测的原始值。
Step 3: The user provides the normalized conversion formula of the independent variable and the dependent variable, the center of each cluster, and the regression independent variable and regression coefficient of the analysis. When inputting new forecast data, the normalization of the independent variable is first performed, and then The clustering center compares and selects the nearest data region, applies the independent variable and regression system of the region, and predicts the standardized predicted value, and then pushes the predicted original value through the standardized formula.
Claims (10)
- 一种数据分析中获取因变量与自变量回归关系的方法,其特征在于:所述的方法包括以下几个步骤:A method for obtaining a regression relationship between a dependent variable and an independent variable in data analysis, characterized in that the method comprises the following steps:步骤1:对用户输入的因变量与自变量,进行数据标准化处理,并保存该结果备用;Step 1: Perform standardization processing on the dependent variable and the independent variable input by the user, and save the result for use;步骤2:对数据进行回归分析,分析出类似的数据特征,从类似的数据特征中进行纵向的自变量选取,通过调用相关的线性分析算法,得出因果关系;Step 2: Perform regression analysis on the data, analyze similar data features, select vertical independent variables from similar data features, and obtain causal relationships by calling relevant linear analysis algorithms;步骤3:对比计算分析出来的结果与实际的结果,获得自变量与因变量的最优关系,将最终的最优结果展示给用户用于最终的选择。Step 3: Compare the calculated results with the actual results, obtain the optimal relationship between the independent variables and the dependent variables, and present the final optimal results to the user for the final selection.
- 根据权利要求1所述的方法,其特征在于:所述数据标准化具体步骤为:The method of claim 1 wherein said data normalization steps are:步骤一、获取因变量及各自变量,分别求各自因变量、自变量的平均值,作为基准数据β;Step 1: Obtain the dependent variable and the respective variables, and respectively obtain the average value of each dependent variable and independent variable as the reference data β;步骤二、分别求各自因变量的标准差α,作为扩大系数,扩大系数通过标准差的方式求出,公式为:Step 2: Find the standard deviation α of each dependent variable separately as the expansion coefficient, and the expansion coefficient is obtained by the standard deviation. The formula is:公式中数值x1,x2,x3,......xN是各自变量的值,其中μ为各自变量的算术平均值;The values x 1 , x 2 , x 3 , ... x N in the formula are the values of the respective variables, where μ is the arithmetic mean of the respective variables;步骤三、对因变量及各自变量,分别通过公式Z′=αZ+β求出标准化后的值,Z′为标准数据,β为基准数据,α是扩大系数。Step 3: For the dependent variable and the respective variables, the normalized values are obtained by the formula Z'=αZ+β, Z' is the standard data, β is the reference data, and α is the expansion coefficient.
- 根据权利要求1所述的方法,其特征在于:所述数据回归分析具体步骤为: The method according to claim 1, wherein the specific steps of the data regression analysis are:步骤一、对输入的自变量数据按不同聚类数量,多次进行聚类分析,得出多个按不同聚类数量的分析结果;Step 1: Perform cluster analysis on the input independent variable data according to different cluster numbers, and obtain a plurality of analysis results according to different cluster numbers;步骤二、对某一特定聚类数量的分析结果,按不同的类别,从中选取自变量,分析选取的自变量与因变量的关系,得出回归系数;再通过回测的方法,计算出准确率,选取准确率最高的自变量与因变量的回归关系;对不同的数据类别采用相同的方法获取准确率最高的回归关系;Step 2: For the analysis result of a certain number of clusters, select the independent variables according to different categories, analyze the relationship between the selected independent variables and the dependent variables, and obtain the regression coefficient; then calculate the method by backtesting Accuracy rate, select the regression relationship between the highest-precision independent variable and the dependent variable; use the same method for different data categories to obtain the highest accuracy regression relationship;步骤三、对分类出来的不同的类别的回归关系进行分析,合并自变量一样、回归系数相差不大的类别,形成统一的回归关系;自变量不同或回归系数相差太大的,形成各数据区域独立的回归关系;Step 3: Analyze the regression relationships of different categories, and combine the categories with the same independent variables and different regression coefficients to form a unified regression relationship; if the independent variables are different or the regression coefficients are too different, each data region is formed. Independent regression relationship;步骤四、重复步骤二、步骤三,对不同数据聚类数量的回归关系进行分析,得出各聚类数据下的最优回归关系和回归系数。Step 4: Repeat steps 2 and 3 to analyze the regression relationship of the number of different data clusters, and obtain the optimal regression relationship and regression coefficient under each cluster data.
- 根据权利要求1所述的方法,其特征在于:所述数据回归分析具体步骤为:The method according to claim 1, wherein the specific steps of the data regression analysis are:步骤一、对输入的自变量数据按不同聚类数量,多次进行聚类分析,得出多个按不同聚类数量的分析结果;Step 1: Perform cluster analysis on the input independent variable data according to different cluster numbers, and obtain a plurality of analysis results according to different cluster numbers;步骤二、对某一特定聚类数量的分析结果,按不同的类别,从中选取自变量,分析选取的自变量与因变量的关系,得出回归系数;再通过回测的方法,计算出准确率,选取准确率最高的自变量与因变量的回归关系;对不同的数据类别采用相同的方法获取准确率最高的回归关系;Step 2: For the analysis result of a certain number of clusters, select the independent variables according to different categories, analyze the relationship between the selected independent variables and the dependent variables, and obtain the regression coefficient; then calculate the method by backtesting Accuracy rate, select the regression relationship between the highest-precision independent variable and the dependent variable; use the same method for different data categories to obtain the highest accuracy regression relationship;步骤三、对分类出来的不同的类别的回归关系进行分析,合并自变量一样、回归系数相差不大的类别,形成统一的回归关系;自变量不同或回归系数相差太大的,形成各数据区域独立的回归关系;Step 3: Analyze the regression relationships of different categories, and combine the categories with the same independent variables and different regression coefficients to form a unified regression relationship; if the independent variables are different or the regression coefficients are too different, each data region is formed. Independent regression relationship;步骤四、重复步骤二、步骤三,对不同数据聚类数量的回归关系进行分析,得出各聚类数据下的最优回归关系和回归系数。 Step 4: Repeat steps 2 and 3 to analyze the regression relationship of the number of different data clusters, and obtain the optimal regression relationship and regression coefficient under each cluster data.
- 根据权利要求3所述的方法,其特征在于:所述的聚类分析可采用K-Means聚类算法,计算聚类的距离可使用欧氏距离计算方法,计算公式如下:The method according to claim 3, wherein said clustering analysis can adopt a K-Means clustering algorithm, and the distance of the clustering can be calculated using an Euclidean distance calculation method, and the calculation formula is as follows:欧氏距离dij表示两个n维向量a(x11,x12,…,x1n)与b(x21,x22,…,x2n)间的距离。The Euclidean distance d ij represents the distance between two n-dimensional vectors a (x11, x12, ..., x1n) and b (x21, x22, ..., x2n).
- 根据权利要求4所述的方法,其特征在于:所述的回归关系可采用最小二乘法多项式曲线拟合,拟合的过程可通过自实现的方式进行,或是通过调用相关通用的拟合工具,直接获取拟合结果,拟合公式为:The method according to claim 4, wherein said regression relationship is performed by a least squares polynomial curve fitting, and the fitting process can be performed by a self-implementation method or by calling a related universal fitting tool. , directly get the fitting result, the fitting formula is:假设给定数据点(xi,yi)(其中i=0,1,2,…,m),为所有次数不超过n(n≤m)的多项式构成的函数类,现求使得满足min公式的Pn(xi)称为最小二乘拟合多项式,通过代入相关的(xi,yi)值并假设min为最小0,可得出n条关于α0,α1,α2,…,αn的多项式,求解出以上的α0,α1,α2,…,αn的多元函数,得出α0,α1,α2,…,αn的具体的值。Assuming a given data point (x i , y i ) (where i = 0, 1, 2, ..., m), A function class composed of polynomials whose number does not exceed n (n ≤ m) is now sought Let P n (x i ) satisfying the min formula be called the least squares fit polynomial. By substituting the relevant (x i , y i ) values and assuming min is the minimum 0, we can get n about α 0 , α 1 , the polynomial of α 2 , . . . , α n , solves the multivariate functions of α 0 , α 1 , α 2 , . . . , α n above, and obtains the specific values of α 0 , α 1 , α 2 , . . . , α n value.
- 根据权利要求1至4任一项所述的方法,具特征在于:所述获取自变量与因变量的最优关系具体步骤为:The method according to any one of claims 1 to 4, characterized in that the specific steps of obtaining the optimal relationship between the independent variable and the dependent variable are:步骤一、对各不同的聚类数量分析出来的最优回归关系、回归系数,分析得出最优的准确率,或是最优的前几个的准确率:把分析结果展示给用户,为用户的最终选择提供数据依据; Step 1. The optimal regression relationship and regression coefficient analyzed for different cluster numbers are analyzed to obtain the optimal accuracy rate, or the optimal accuracy of the first few: the analysis results are presented to the user, The user's final choice provides a data basis;步骤二、对用户选择的最优结果,提供自变量与因变量的标准化转换公式,各聚类的中心及分析的回归自变量、回归系数,用于最终的数据预测;Step 2: provide a standardized conversion formula of the independent variable and the dependent variable for the optimal result selected by the user, and the center of each cluster and the regression variable and regression coefficient of the analysis are used for the final data prediction;步骤三、用户通过提供的自变量与因变量的标准化转换公式,各聚类的中心及分析的回归自变量、回归系数;在输入新的预测数据时,先进行自变量的标准化,再与各聚类中心进行对比,选取距离最近的数据区域,应用该区域的自变量及回归系数,从而预测出标准化的预测值;再通过标准化公式反推预测的原始值。Step 3: The user provides the standardized conversion formula of the independent variable and the dependent variable, the center of each cluster and the regression independent variable and regression coefficient of the analysis; when inputting the new predicted data, the normalization of the independent variable is first performed, and then The clustering center compares and selects the nearest data region, applies the independent variables and regression coefficients of the region, and predicts the standardized prediction value; and then pushes the predicted original value through the standardized formula.
- 根据权利要求5所述的方法,其特征在于:所述获取自变量与因变量的最优关系具体步骤为:The method according to claim 5, wherein the step of obtaining an optimal relationship between the independent variable and the dependent variable is as follows:步骤一、对各不同的聚类数量分析出来的最优回归关系、回归系数,分析得出最优的准确率,或是最优的前几个的准确率;把分析结果展示给用户,为用户的最终选择提供数据依据;Step 1. Analyze the optimal regression relationship and regression coefficient for each different cluster number, and analyze the optimal accuracy rate or the accuracy of the first few; and present the analysis results to the user. The user's final choice provides a data basis;步骤二、对用户选择的最优结果,提供自变量与因变量的标准化转换公式,各聚类的中心及分析的回归自变量、回归系数,用于最终的数据预测;Step 2: provide a standardized conversion formula of the independent variable and the dependent variable for the optimal result selected by the user, and the center of each cluster and the regression variable and regression coefficient of the analysis are used for the final data prediction;步骤三、用户通过提供的自变量与因变量的标准化转换公式,各聚类的中心及分析的回归自变量、回归系数;在输入新的预测数据时,先进行自变量的标准化,再与各聚类中心进行对比,选取距离最近的数据区域,应用该区域的自变量及回归系数,从而预测出标准化的预测值;再通过标准化公式反推预测的原始值。Step 3: The user provides the standardized conversion formula of the independent variable and the dependent variable, the center of each cluster and the regression independent variable and regression coefficient of the analysis; when inputting the new predicted data, the normalization of the independent variable is first performed, and then The clustering center compares and selects the nearest data region, applies the independent variables and regression coefficients of the region, and predicts the standardized prediction value; and then pushes the predicted original value through the standardized formula.
- 根据权利要求6所述的方法,其特征在于:所述获取自变量与因变量的最优关系具体步骤为:The method according to claim 6, wherein the step of obtaining an optimal relationship between the independent variable and the dependent variable is as follows:步骤一、对各不同的聚类数量分析出来的最优回归关系、回归系数,分析得出最优的准确率,或是最优的前几个的准确率;把分析结果展示给用户,为用户的最终选择提供数据依据;Step 1. Analyze the optimal regression relationship and regression coefficient for each different cluster number, and analyze the optimal accuracy rate or the accuracy of the first few; and present the analysis results to the user. The user's final choice provides a data basis;步骤二、对用户选择的最优结果,提供自变量与因变量的标准化转换公式, 各聚类的中心及分析的回归自变量、回归系数,用于最终的数据预测;Step 2: providing a standardized conversion formula of the independent variable and the dependent variable for the optimal result selected by the user, The center of each cluster and the regression parameters and regression coefficients of the analysis are used for the final data prediction;步骤三、用户通过提供的自变量与因变量的标准化转换公式,各聚类的中心及分析的回归自变量、回归系数;在输入新的预测数据时,先进行自变量的标准化,再与各聚类中心进行对比,选取距离最近的数据区域,应用该区域的自变量及回归系数,从而预测出标准化的预测值;再通过标准化公式反推预测的原始值。Step 3: The user provides the standardized conversion formula of the independent variable and the dependent variable, the center of each cluster and the regression independent variable and regression coefficient of the analysis; when inputting the new predicted data, the normalization of the independent variable is first performed, and then The clustering center compares and selects the nearest data region, applies the independent variables and regression coefficients of the region, and predicts the standardized prediction value; and then pushes the predicted original value through the standardized formula.
- 根据权利要求7所述的方法,其特征在于:所述获取自变量与因变量的最优关系具体步骤为:The method according to claim 7, wherein the step of obtaining an optimal relationship between the independent variable and the dependent variable is as follows:步骤一、对各不同的聚类数量分析出来的最优回归关系、回归系数,分析得出最优的准确率,或是最优的前几个的准确率;把分析结果展示给用户,为用户的最终选择提供数据依据;Step 1. Analyze the optimal regression relationship and regression coefficient for each different cluster number, and analyze the optimal accuracy rate or the accuracy of the first few; and present the analysis results to the user. The user's final choice provides a data basis;步骤二、对用户选择的最优结果,提供自变量与因变量的标准化转换公式,各聚类的中心及分析的回归自变量、回归系数,用于最终的数据预测;Step 2: provide a standardized conversion formula of the independent variable and the dependent variable for the optimal result selected by the user, and the center of each cluster and the regression variable and regression coefficient of the analysis are used for the final data prediction;步骤三、用户通过提供的自变量与因变量的标准化转换公式,各聚类的中心及分析的回归自变量、回归系数;在输入新的预测数据时,先进行自变量的标准化,再与各聚类中心进行对比,选取距离最近的数据区域,应用该区域的自变量及回归系数,从而预测出标准化的预测值;再通过标准化公式反推预测的原始值。 Step 3: The user provides the standardized conversion formula of the independent variable and the dependent variable, the center of each cluster and the regression independent variable and regression coefficient of the analysis; when inputting the new predicted data, the normalization of the independent variable is first performed, and then The clustering center compares and selects the nearest data region, applies the independent variables and regression coefficients of the region, and predicts the standardized prediction value; and then pushes the predicted original value through the standardized formula.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610889029.9 | 2016-10-11 | ||
CN201610889029.9A CN106650774A (en) | 2016-10-11 | 2016-10-11 | Method for obtaining the regression relationship between the dependant variable and the independent variables during data analysis |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018068360A1 true WO2018068360A1 (en) | 2018-04-19 |
Family
ID=58856396
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2016/106004 WO2018068360A1 (en) | 2016-10-11 | 2016-11-16 | Method for obtaining regression relationships between dependent variables and independent variables during data analysis |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN106650774A (en) |
WO (1) | WO2018068360A1 (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111383768A (en) * | 2018-12-28 | 2020-07-07 | 医渡云(北京)技术有限公司 | Regression analysis method and device for medical data, electronic equipment and readable medium |
CN111859245A (en) * | 2020-07-06 | 2020-10-30 | 东南数字经济发展研究院 | Social e-commerce user group hierarchy dividing method |
CN112380273A (en) * | 2020-11-11 | 2021-02-19 | 北京达佳互联信息技术有限公司 | Data estimation method and device |
CN112420135A (en) * | 2020-11-20 | 2021-02-26 | 北京化工大学 | Virtual sample generation method based on sample method and quantile regression |
CN113673864A (en) * | 2021-08-19 | 2021-11-19 | 中国石油化工股份有限公司 | Automatic energy distribution and transmission method |
CN114117292A (en) * | 2021-11-04 | 2022-03-01 | 中通服咨询设计研究院有限公司 | Internet big data analysis and extraction method |
CN115270386A (en) * | 2022-04-22 | 2022-11-01 | 水利部交通运输部国家能源局南京水利科学研究院 | Quantitative evaluation method and system for beach tank evolution main control factor weight |
CN115474205A (en) * | 2021-06-11 | 2022-12-13 | 中国移动通信集团云南有限公司 | Dual-path power difference redundancy obtaining method and system and electronic equipment |
CN115795229A (en) * | 2023-02-07 | 2023-03-14 | 河海大学 | Quantitative research method suitable for water-related ecosystem service feedback loop |
CN118070050A (en) * | 2024-02-27 | 2024-05-24 | 宝艺新材料股份有限公司 | Detection data processing method and system for corrugated board cartons |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107578105B (en) * | 2017-08-31 | 2019-03-26 | 江苏康缘药业股份有限公司 | System Parameter Design space optimization method and device |
CN108573111A (en) * | 2018-04-27 | 2018-09-25 | 福州大学 | The approximating method of new algorithm parameter is parsed in a kind of Designing Vessel Under External |
CN109242916B (en) * | 2018-10-12 | 2022-01-07 | 昆山博泽智能科技有限公司 | Method for automatically calibrating image based on regression algorithm |
CN110210000A (en) * | 2019-04-18 | 2019-09-06 | 贵州大学 | The identification of industrial process efficiency and diagnostic method based on Multiple Non Linear Regression |
CN110595944B (en) * | 2019-08-21 | 2021-12-03 | 山东中烟工业有限责任公司 | Method and system for correcting end density data of bead blasting filter stick |
CN110991974A (en) * | 2019-12-20 | 2020-04-10 | 贵州黔岸科技有限公司 | GPS-based transportation cost intelligent accounting system and method |
CN111709152B (en) * | 2020-06-29 | 2022-11-15 | 西南交通大学 | Method for determining structural parameters of SiC field limiting ring terminal |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104123451A (en) * | 2014-07-16 | 2014-10-29 | 河海大学常州校区 | Dredging operation yield prediction model building method based on partial least squares regression |
CN105260249A (en) * | 2015-09-19 | 2016-01-20 | 中国地质大学(武汉) | Method for extracting calculation intensity features of spatial calculation domain |
CN105825288A (en) * | 2015-12-07 | 2016-08-03 | 北京师范大学 | Optimization analysis method for eliminating regression data colinearity problem of complex system |
CN105844410A (en) * | 2016-03-22 | 2016-08-10 | 国网天津市电力公司 | Method for determining danger coefficient of electric power construction field |
-
2016
- 2016-10-11 CN CN201610889029.9A patent/CN106650774A/en active Pending
- 2016-11-16 WO PCT/CN2016/106004 patent/WO2018068360A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104123451A (en) * | 2014-07-16 | 2014-10-29 | 河海大学常州校区 | Dredging operation yield prediction model building method based on partial least squares regression |
CN105260249A (en) * | 2015-09-19 | 2016-01-20 | 中国地质大学(武汉) | Method for extracting calculation intensity features of spatial calculation domain |
CN105825288A (en) * | 2015-12-07 | 2016-08-03 | 北京师范大学 | Optimization analysis method for eliminating regression data colinearity problem of complex system |
CN105844410A (en) * | 2016-03-22 | 2016-08-10 | 国网天津市电力公司 | Method for determining danger coefficient of electric power construction field |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111383768B (en) * | 2018-12-28 | 2023-11-03 | 医渡云(北京)技术有限公司 | Medical data regression analysis method, device, electronic equipment and computer readable medium |
CN111383768A (en) * | 2018-12-28 | 2020-07-07 | 医渡云(北京)技术有限公司 | Regression analysis method and device for medical data, electronic equipment and readable medium |
CN111859245A (en) * | 2020-07-06 | 2020-10-30 | 东南数字经济发展研究院 | Social e-commerce user group hierarchy dividing method |
CN112380273A (en) * | 2020-11-11 | 2021-02-19 | 北京达佳互联信息技术有限公司 | Data estimation method and device |
CN112420135A (en) * | 2020-11-20 | 2021-02-26 | 北京化工大学 | Virtual sample generation method based on sample method and quantile regression |
CN115474205A (en) * | 2021-06-11 | 2022-12-13 | 中国移动通信集团云南有限公司 | Dual-path power difference redundancy obtaining method and system and electronic equipment |
CN113673864A (en) * | 2021-08-19 | 2021-11-19 | 中国石油化工股份有限公司 | Automatic energy distribution and transmission method |
CN114117292A (en) * | 2021-11-04 | 2022-03-01 | 中通服咨询设计研究院有限公司 | Internet big data analysis and extraction method |
CN114117292B (en) * | 2021-11-04 | 2024-04-16 | 中通服咨询设计研究院有限公司 | Internet big data analysis and extraction method |
CN115270386A (en) * | 2022-04-22 | 2022-11-01 | 水利部交通运输部国家能源局南京水利科学研究院 | Quantitative evaluation method and system for beach tank evolution main control factor weight |
CN115270386B (en) * | 2022-04-22 | 2023-09-12 | 水利部交通运输部国家能源局南京水利科学研究院 | Quantitative evaluation method and system for beach evolution main control factor weight |
CN115795229A (en) * | 2023-02-07 | 2023-03-14 | 河海大学 | Quantitative research method suitable for water-related ecosystem service feedback loop |
CN118070050A (en) * | 2024-02-27 | 2024-05-24 | 宝艺新材料股份有限公司 | Detection data processing method and system for corrugated board cartons |
Also Published As
Publication number | Publication date |
---|---|
CN106650774A (en) | 2017-05-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2018068360A1 (en) | Method for obtaining regression relationships between dependent variables and independent variables during data analysis | |
Muthukrishnan et al. | LASSO: A feature selection technique in predictive modeling for machine learning | |
Lee et al. | EMMIX-uskew: an R package for fitting mixtures of multivariate skew t-distributions via the EM algorithm | |
Gutmann et al. | Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. | |
US20170330078A1 (en) | Method and system for automated model building | |
WO2017215346A1 (en) | Service data classification method and apparatus | |
CN108446741B (en) | Method, system and storage medium for evaluating importance of machine learning hyper-parameter | |
CN108764726B (en) | Method and device for making decision on request according to rules | |
CN107609588A (en) | A kind of disturbances in patients with Parkinson disease UPDRS score Forecasting Methodologies based on voice signal | |
US20210357772A1 (en) | System and method for time series pattern recognition | |
KR101016758B1 (en) | Method for identifying image face and system thereof | |
CN117094956A (en) | Melanoma texture feature extraction system and method | |
CN108537249B (en) | Industrial process data clustering method for density peak clustering | |
CN110414562A (en) | Classification method, device, terminal and the storage medium of X-ray | |
Russell et al. | Bayesian model averaging in model-based clustering and density estimation | |
JP6398991B2 (en) | Model estimation apparatus, method and program | |
Zhang et al. | Functional additive quantile regression | |
Hejazi et al. | Fully PCA-based approach to optimization of multiresponse-multistage problems with stochastic considerations | |
Ren et al. | Multivariate functional data clustering using adaptive density peak detection | |
TWI399661B (en) | A system for analyzing and screening disease related genes using microarray database | |
Weylandt et al. | Automatic registration and clustering of time series | |
JP5876397B2 (en) | Character assigning program, character assigning method, and information processing apparatus | |
CN113408665A (en) | Object identification method, device, equipment and medium | |
CN111108516B (en) | Evaluating input data using a deep learning algorithm | |
Bolourchi et al. | A machine learning-based data-driven approach to Alzheimer’s disease diagnosis using statistical and harmony search methods |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 16918492 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 04/09/2019) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 16918492 Country of ref document: EP Kind code of ref document: A1 |