WO2018068360A1

WO2018068360A1 - Method for obtaining regression relationships between dependent variables and independent variables during data analysis

Info

Publication number: WO2018068360A1
Application number: PCT/CN2016/106004
Authority: WO
Inventors: 郑锐韬; 李勇波; 孙傲冰; 季统凯
Original assignee: 国云科技股份有限公司
Priority date: 2016-10-11
Filing date: 2016-11-16
Publication date: 2018-04-19
Also published as: CN106650774A

Abstract

The present invention relates to the technical field of data analysis and processing, and in particular, to a method for obtaining regression relationships between dependent variables and independent variables during data analysis. The method of the present invention comprises: by analyzing dependent variables and a plurality of independent variables inputted by a user, standardizing data; then classifying the data to obtain similar data characteristics; selecting the independent variables from the similar data characteristics; by calling a related linear analysis algorithm, obtaining a causal relationship; comparing the result obtained from calculation and analysis with an actual result; analyzing optimal relationships between some independent variables and dependent variables; and finally, displaying optimal results to a user for final selection. The present invention resolves the problems in existing methods of being unable to perform discriminant analysis of data regions and being difficult to achieve accurate analysis efficiency. The method present invention can be used for obtaining regression relationship between dependent variables and independent variables.

Description

A method for obtaining regression relationship between dependent variable and independent variable in data analysis

Technical field

The invention relates to the technical field of data analysis and processing, in particular to a method for obtaining a regression relationship between a dependent variable and an independent variable in data analysis.

Background technique

The process of data analysis, regression analysis is a method that is often used. In the traditional regression process, the user needs to select the independent variable and the dependent variable according to the relationship of a certain model, input the data by manual method and analyze the final result one by one, and then check the regression coefficient of the obtained result. The accuracy of the independent variable and the actual dependent variable. When it is not possible to clearly see the relationship between multiple independent variables and dependent variables, it is up to the user to perform the process one by one. The whole process is time-consuming, labor-intensive and inefficient, and the amount of data input may have different causal relationships between the dependent variable and the independent variable for all data. It is difficult to achieve accurate analysis by directly using the traditional method. Analysis efficiency.

Summary of the invention

The technical problem solved by the invention is to provide a method for obtaining the regression relationship between the dependent variable and the independent variable in the data analysis; the optimal correspondence between the input dependent variable and the independent variable can be efficiently obtained, and used for future data prediction.

The technical solution of the present invention to solve the above technical problem is:

The method includes the following steps:

Step 1: Perform standardization processing on the dependent variable and the independent variable input by the user, and save the result for use;

Step 2: Perform regression analysis on the data, analyze similar data features, select vertical independent variables from similar data features, and obtain causal relationships by calling relevant linear analysis algorithms;

Step 3: Compare the calculated results with the actual results, obtain the optimal relationship between the independent variables and the dependent variables, and present the final optimal results to the user for the final selection.

The specific steps of the data standardization are:

Step 1: Obtain the dependent variable and the respective variables, and respectively obtain the average value of each dependent variable and independent variable as the reference data β;

Step 2: Find the standard deviation α of each dependent variable separately as the expansion coefficient, and the expansion coefficient is obtained by the standard deviation. The formula is:

The values x ₁ , x ₂ , x ₃ , ... x _N in the formula are the values of the respective variables, where μ is the arithmetic mean of the respective variables;

Step 3: For the dependent variable and the respective variables, the normalized values are obtained by the formula Z'=αZ+β, Z' is the standard data, β is the reference data, and α is the expansion coefficient.

The specific steps of the data regression analysis are:

Step 1: Perform cluster analysis on the input independent variable data according to different cluster numbers, and obtain a plurality of analysis results according to different cluster numbers;

Step 2: For the analysis result of a certain number of clusters, select the independent variables according to different categories, analyze the relationship between the selected independent variables and the dependent variables, and obtain the regression coefficient; then calculate the method by backtesting Accuracy rate, select the regression relationship between the highest-precision independent variable and the dependent variable; use the same method for different data categories to obtain the highest accuracy regression relationship;

Step 3: Analyze the regression relationships of the different categories classified, and combine the independent variables, The categories with little difference in regression coefficients form a unified regression relationship; if the independent variables are different or the regression coefficients are too different, an independent regression relationship is formed for each data region;

Step 4: Repeat steps 2 and 3 to analyze the regression relationship of the number of different data clusters, and obtain the optimal regression relationship and regression coefficient under each cluster data.

The cluster analysis can adopt the K-Means clustering algorithm, and the distance of the cluster can be calculated by using the Euclidean distance calculation method. The calculation formula is as follows:

The Euclidean distance d _ij represents the distance between two n-dimensional vectors a (x11, x12, ..., x1n) and b (x21, x22, ..., x2n).

The regression relationship may be fitted by a least squares polynomial curve, and the fitting process may be performed by a self-implementation method, or directly by fitting a relevant general fitting tool, and the fitting formula is:

Assuming a given data point (x _i , y _i ) (where i = 0, 1, 2, ..., m),

A function class composed of polynomials whose number does not exceed n (n ≤ m) is now sought

Let P _n (x _i ) satisfying the min formula be called the least squares fit polynomial. By substituting the relevant (x _i , y _i ) values and assuming min is the minimum 0, we can get n about a ₀ , a ₁ , a _2, ..., a _n polynomial solving the above _{_{_{a 0, a 1, a 2}}} , ..., a n polyvalent function, obtain _{_{_{a 0, a 1, a 2}}} , ..., a n specific value.

The specific steps of obtaining the optimal relationship between the independent variable and the dependent variable are as follows:

Step 1. Analyze the optimal regression relationship and regression coefficient for each different cluster number, and analyze the optimal accuracy rate or the accuracy of the first few; and present the analysis results to the user. The user's final choice provides a data basis;

Step 2: provide a standardized conversion formula of the independent variable and the dependent variable for the optimal result selected by the user, and the center of each cluster and the regression variable and regression coefficient of the analysis are used for the final data prediction;

Step 3: The user provides the standardized conversion formula of the independent variable and the dependent variable, the center of each cluster and the regression independent variable and regression coefficient of the analysis; when inputting the new predicted data, the normalization of the independent variable is first performed, and then The clustering center compares and selects the nearest data region, applies the independent variables and regression coefficients of the region, and predicts the standardized prediction value; and then pushes the predicted original value through the standardized formula.

The beneficial effects of the invention are:

The invention can continuously calculate and can perform the backtesting of the prediction result by using the computer, improve the accuracy of the data by standardizing the data, and make the data in the horizontal direction by the clustering method, and then automatically The longitudinal calculation is performed on the independent variable to obtain the optimal regression result of the data analysis, and the final result of the data analysis prediction is formed for the final data prediction. In this method, the user quickly and directly analyzes the optimal causal relationship, greatly improves the efficiency of obtaining the regression relationship between the dependent variable and the independent variable, and forms an optimal method for efficiently obtaining the relationship between multiple independent variables and dependent variables; Improving the analysis of the main components of the dependent variable and multiple independent variables in the process of data regression analysis simplifies the process of data regression analysis and improves the efficiency of the acquisition of dependent variables and independent variables.

DRAWINGS

The present invention is further described below in conjunction with the accompanying drawings:

1 is a flow chart showing the relationship between the optimal dependent variable and the independent variable in the present invention.

detailed description

The invention normalizes the data by analyzing the dependent variable input by the user and a plurality of independent variables. At the same time, the data standardization results of each dependent variable and independent variable are saved for subsequent data prediction; then the data is classified from the horizontal angle to analyze similar data features, and then the longitudinal independent variables are analyzed from similar data features. Selecting, by calling the relevant linear analysis algorithm, the causal relationship is obtained, and the calculated and analyzed results are compared with the actual results, and the optimal relationship between some independent variables and dependent variables is analyzed, and the final optimal result is presented to The user is used for the final selection. This method can effectively obtain the optimal causal relationship between the dependent variable and the dependent variable from multiple independent variables, which can greatly improve the efficiency of obtaining the regression relationship between the dependent variable and the independent variable, as an optimization data analysis process. A method of obtaining the relationship between major causal components.

For the input dependent variable and multiple independent variables, the data standardization processing of each input data is required, that is, all the variables included in the input, including the dependent variables, are first converted into standard data, and then linear regression analysis is performed to make the standardized data at this time. The obtained regression coefficient can better reflect the importance degree of the corresponding independent variable; the data standardization can adopt the following conversion formula: Z'=αZ+β, where Z' is the standard data and β is the reference data, which is generally equal to the average of the original data. The value X_bar, α is the expansion factor, which is generally equal to the standard deviation S of the original data.

On the basis of data standardization of both dependent and independent variables, multi-category cluster analysis is carried out according to the data of each variable. The purpose of cluster analysis is to discover the characteristics of different data in each category, so that Obtaining a clear regression coefficient relationship on the data with obvious characteristics; if the regression coefficient relationship obtained after classification is not much different, it can be regarded as the result data of the analysis is consistent, and can be used as a unified regression causal relationship; After the regression system is relatively large, it shows that different categories of data have different regression causal relationships in each region. In the subsequent use of regression results, comparisons can be made from the calculated cluster centers, and each cluster center is selected. Recent regression causality data is predicted.

After cluster analysis of a certain category of independent variables, according to the results of the analysis, the cycle selects some kinds of independent variables and the dependent variables to form a regression relationship among the various classification categories, and obtains the regression coefficient, and then the specific category The variable data is used in the regression test to calculate the accuracy, so that from among the multiple independent variables, The optimal causal relationship between the independent variable and the dependent variable, and the regression coefficient; different categories use this method, so that all categories of data form a certain regression relationship.

After all the categories of data have formed the optimal regression relationship, the independent variables selected by each category and the regression coefficients of the respective variables are analyzed. If the selected independent variables are the same, and the regression systems of the respective variables are not relevant. , the regression coefficients can be combined to form a unified regression relationship, which also indicates that the data conforms to the unified regression relationship, and the regression process selects the optimal relationship between the optimal independent variable and the dependent variable; if each classification category is selected The regression coefficients of the optimal regression independent variables and their respective variables are different or very different. It means that the regression relationship between the input independent variables and the dependent variables is different in each region. To use different regression relationships, you need to save each The data center points of the categories and the regression independent variables and systems of each category are used for subsequent calculation of the regression relationship of each region.

The clustering of the data of the input multiple independent variables, the regression analysis of the selected independent variables and the dependent variables can be implemented by calling the R language or the self-implementing program by means of a program, and by calling the implemented method to improve the independent variables. The efficiency of the choice analysis with the dependent variable relationship.

For the case where the amount of input data is relatively large, it is necessary to classify the data into more categories to distinguish the characteristics of each region data, and to analyze the regression analysis of the optimal causal relationship between the independent variables and the dependent variables in each region in more detail. The regression coefficient is obtained, and the most important thing is that after the regression coefficient of the regression relationship of each region is obtained, the regression results need to be summarized and all the unified regression relations are used to optimize the calculation of the final regression relationship.

By calculating the horizontal and vertical data by different number of clusters multiple times, the optimal regression relationship and regression coefficient of each cluster number are obtained, and the optimal results of each cluster number are compared, and finally the user is optimal. The central data, regression independent variables and regression coefficients of each region under the cluster classification show the relationship between the optimal dependent variable and the independent variable.

In the optimal cluster classification, the central data, regression independent variables, regression coefficients of each region, combined with the standardized indicators of the respective variables, input new forecast data, first select the nearest distance by comparing with the central data of each category. Area, applying the nearest regression variable and regression system, thus Get the final forecast.

According to the process, as shown in FIG. 1, the implementation of the present invention mainly includes three parts, data standardization, horizontal and vertical regression analysis of data, and obtaining an optimal correspondence. The specific steps of the three parts are as follows:

First, data standardization:

Step 1: separately obtain the dependent variable and the respective variables, and respectively obtain the average value X_bar of the respective dependent variable and the independent variable as the reference data β;

Step 2: Find the standard deviation of each dependent variable separately, as the expansion coefficient α, and the expansion coefficient is obtained by the standard deviation. The formula is:

Formula description:

The values x1, x2, x3, ... xN (both values of the respective variables) in the formula, where μ is the mean (arithmetic mean) of the respective variables, and the standard deviation is α.

Step 3: For the dependent variable and the respective variables, the normalized value is obtained by the formula Z′=αZ+β, Z′ is the standard data, β is the reference data, and α is the expansion coefficient;

Step 4: Preserving the dependent variable and the reference data of each variable and the expansion coefficient for subsequent standardized calculation of new data prediction;

Through the above method, the dependent variable and the independent variable are recalculated, so that the final regression coefficient can better reflect the importance degree of the corresponding dependent variable and independent variable;

Second, horizontal and vertical regression analysis of data

Step 1. According to the input independent variable data, cluster analysis is performed multiple times according to different cluster numbers, and multiple analysis results according to different cluster numbers are obtained. Cluster analysis can use K-Means clustering algorithm to calculate clustering. The distance can be calculated using the Euclidean Distance method.

Formula description:

The Euclidean distance represents the distance between two n-dimensional vectors a (x11, x12, ..., x1n) and b (x21, x22, ..., x2n), such as two points a (x1, y1) on a two-dimensional plane. Euclidean distance between b) and b(x2, y2):

Euclidean distance between two points a(x1, y1, z1) and b(x2, y2, z2) in three-dimensional space:

Step 2: For the analysis result of a certain number of clusters, select the independent variables according to different categories, analyze the relationship between the selected independent variables and the dependent variables, and obtain the regression coefficient, and then calculate the method by backtesting. Accuracy rate, the regression relationship between the independent variable and the dependent variable is selected; the same method is used to obtain the highest accuracy regression relationship for different data categories; the regression relationship can be fitted by least squares polynomial curve fitting, the fitting process The fitting result can be directly obtained by self-implementation or by calling the relevant general fitting tool. The fitting formula is:

Formula description:

Assuming a given data point (x _i , y _i ) (where i = 0, 1, 2, ..., m),

Step 3: Analyze the regression relationships of the different categories classified, and combine the categories with the same independent variables and different regression systems to form a unified regression relationship; if the independent variables are different or the regression coefficients are too different, each data region is formed. Independent regression relationship;

Step 4: Repeat steps 2 and 3 to analyze the regression relationship of different data cluster numbers, and obtain the optimal regression relationship and regression coefficient under each cluster data;

Third, to obtain the optimal correspondence:

Step 1. Analyze the optimal regression relationship and regression system for each different cluster number, and analyze the optimal accuracy rate, or the optimal accuracy of the first few, and present the analysis results to the user. The user's final choice provides a data basis;

Step 2: For the optimal result selected by the user, a standardized conversion formula of the independent variable and the dependent variable is provided, and the center of each cluster and the analyzed regression independent variable and regression coefficient are used for the final data prediction;

Step 3: The user provides the normalized conversion formula of the independent variable and the dependent variable, the center of each cluster, and the regression independent variable and regression coefficient of the analysis. When inputting new forecast data, the normalization of the independent variable is first performed, and then The clustering center compares and selects the nearest data region, applies the independent variable and regression system of the region, and predicts the standardized predicted value, and then pushes the predicted original value through the standardized formula.

Claims

A method for obtaining a regression relationship between a dependent variable and an independent variable in data analysis, characterized in that the method comprises the following steps:

Step 1: Perform standardization processing on the dependent variable and the independent variable input by the user, and save the result for use;

Step 2: Perform regression analysis on the data, analyze similar data features, select vertical independent variables from similar data features, and obtain causal relationships by calling relevant linear analysis algorithms;

Step 3: Compare the calculated results with the actual results, obtain the optimal relationship between the independent variables and the dependent variables, and present the final optimal results to the user for the final selection.
The method of claim 1 wherein said data normalization steps are:

Step 1: Obtain the dependent variable and the respective variables, and respectively obtain the average value of each dependent variable and independent variable as the reference data β;

Step 2: Find the standard deviation α of each dependent variable separately as the expansion coefficient, and the expansion coefficient is obtained by the standard deviation. The formula is:

The values x 1 , x 2 , x 3 , ... x N in the formula are the values of the respective variables, where μ is the arithmetic mean of the respective variables;

Step 3: For the dependent variable and the respective variables, the normalized values are obtained by the formula Z'=αZ+β, Z' is the standard data, β is the reference data, and α is the expansion coefficient.
The method according to claim 1, wherein the specific steps of the data regression analysis are:

Step 1: Perform cluster analysis on the input independent variable data according to different cluster numbers, and obtain a plurality of analysis results according to different cluster numbers;

Step 2: For the analysis result of a certain number of clusters, select the independent variables according to different categories, analyze the relationship between the selected independent variables and the dependent variables, and obtain the regression coefficient; then calculate the method by backtesting Accuracy rate, select the regression relationship between the highest-precision independent variable and the dependent variable; use the same method for different data categories to obtain the highest accuracy regression relationship;

Step 3: Analyze the regression relationships of different categories, and combine the categories with the same independent variables and different regression coefficients to form a unified regression relationship; if the independent variables are different or the regression coefficients are too different, each data region is formed. Independent regression relationship;

Step 4: Repeat steps 2 and 3 to analyze the regression relationship of the number of different data clusters, and obtain the optimal regression relationship and regression coefficient under each cluster data.
The method according to claim 1, wherein the specific steps of the data regression analysis are:

Step 1: Perform cluster analysis on the input independent variable data according to different cluster numbers, and obtain a plurality of analysis results according to different cluster numbers;

Step 2: For the analysis result of a certain number of clusters, select the independent variables according to different categories, analyze the relationship between the selected independent variables and the dependent variables, and obtain the regression coefficient; then calculate the method by backtesting Accuracy rate, select the regression relationship between the highest-precision independent variable and the dependent variable; use the same method for different data categories to obtain the highest accuracy regression relationship;

Step 3: Analyze the regression relationships of different categories, and combine the categories with the same independent variables and different regression coefficients to form a unified regression relationship; if the independent variables are different or the regression coefficients are too different, each data region is formed. Independent regression relationship;

Step 4: Repeat steps 2 and 3 to analyze the regression relationship of the number of different data clusters, and obtain the optimal regression relationship and regression coefficient under each cluster data.
The method according to claim 3, wherein said clustering analysis can adopt a K-Means clustering algorithm, and the distance of the clustering can be calculated using an Euclidean distance calculation method, and the calculation formula is as follows:

The Euclidean distance d ij represents the distance between two n-dimensional vectors a (x11, x12, ..., x1n) and b (x21, x22, ..., x2n).
The method according to claim 4, wherein said regression relationship is performed by a least squares polynomial curve fitting, and the fitting process can be performed by a self-implementation method or by calling a related universal fitting tool. , directly get the fitting result, the fitting formula is:

Assuming a given data point (x i , y i ) (where i = 0, 1, 2, ..., m),
A function class composed of polynomials whose number does not exceed n (n ≤ m) is now sought
Let P n (x i ) satisfying the min formula be called the least squares fit polynomial. By substituting the relevant (x i , y i ) values and assuming min is the minimum 0, we can get n about α 0 , α 1 , the polynomial of α 2 , . . . , α n , solves the multivariate functions of α 0 , α 1 , α 2 , . . . , α n above, and obtains the specific values of α 0 , α 1 , α 2 , . . . , α n value.
The method according to any one of claims 1 to 4, characterized in that the specific steps of obtaining the optimal relationship between the independent variable and the dependent variable are:

Step 1. The optimal regression relationship and regression coefficient analyzed for different cluster numbers are analyzed to obtain the optimal accuracy rate, or the optimal accuracy of the first few: the analysis results are presented to the user, The user's final choice provides a data basis;

Step 2: provide a standardized conversion formula of the independent variable and the dependent variable for the optimal result selected by the user, and the center of each cluster and the regression variable and regression coefficient of the analysis are used for the final data prediction;

Step 3: The user provides the standardized conversion formula of the independent variable and the dependent variable, the center of each cluster and the regression independent variable and regression coefficient of the analysis; when inputting the new predicted data, the normalization of the independent variable is first performed, and then The clustering center compares and selects the nearest data region, applies the independent variables and regression coefficients of the region, and predicts the standardized prediction value; and then pushes the predicted original value through the standardized formula.
The method according to claim 5, wherein the step of obtaining an optimal relationship between the independent variable and the dependent variable is as follows:

Step 1. Analyze the optimal regression relationship and regression coefficient for each different cluster number, and analyze the optimal accuracy rate or the accuracy of the first few; and present the analysis results to the user. The user's final choice provides a data basis;

Step 2: provide a standardized conversion formula of the independent variable and the dependent variable for the optimal result selected by the user, and the center of each cluster and the regression variable and regression coefficient of the analysis are used for the final data prediction;

Step 3: The user provides the standardized conversion formula of the independent variable and the dependent variable, the center of each cluster and the regression independent variable and regression coefficient of the analysis; when inputting the new predicted data, the normalization of the independent variable is first performed, and then The clustering center compares and selects the nearest data region, applies the independent variables and regression coefficients of the region, and predicts the standardized prediction value; and then pushes the predicted original value through the standardized formula.
The method according to claim 6, wherein the step of obtaining an optimal relationship between the independent variable and the dependent variable is as follows:

Step 1. Analyze the optimal regression relationship and regression coefficient for each different cluster number, and analyze the optimal accuracy rate or the accuracy of the first few; and present the analysis results to the user. The user's final choice provides a data basis;

Step 2: providing a standardized conversion formula of the independent variable and the dependent variable for the optimal result selected by the user, The center of each cluster and the regression parameters and regression coefficients of the analysis are used for the final data prediction;

Step 3: The user provides the standardized conversion formula of the independent variable and the dependent variable, the center of each cluster and the regression independent variable and regression coefficient of the analysis; when inputting the new predicted data, the normalization of the independent variable is first performed, and then The clustering center compares and selects the nearest data region, applies the independent variables and regression coefficients of the region, and predicts the standardized prediction value; and then pushes the predicted original value through the standardized formula.
The method according to claim 7, wherein the step of obtaining an optimal relationship between the independent variable and the dependent variable is as follows:

Step 1. Analyze the optimal regression relationship and regression coefficient for each different cluster number, and analyze the optimal accuracy rate or the accuracy of the first few; and present the analysis results to the user. The user's final choice provides a data basis;

Step 2: provide a standardized conversion formula of the independent variable and the dependent variable for the optimal result selected by the user, and the center of each cluster and the regression variable and regression coefficient of the analysis are used for the final data prediction;

Step 3: The user provides the standardized conversion formula of the independent variable and the dependent variable, the center of each cluster and the regression independent variable and regression coefficient of the analysis; when inputting the new predicted data, the normalization of the independent variable is first performed, and then The clustering center compares and selects the nearest data region, applies the independent variables and regression coefficients of the region, and predicts the standardized prediction value; and then pushes the predicted original value through the standardized formula.