CN104239722A

CN104239722A - Forecasting method based on recognition of correlational relationship between factors

Info

Publication number: CN104239722A
Application number: CN201410479908.5A
Authority: CN
Inventors: 于大洋; 李亚锦
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2014-09-18
Filing date: 2014-09-18
Publication date: 2014-12-24

Abstract

The invention discloses a forecasting method based on recognition of a correlational relationship between factors. The forecasting method comprises the following steps of acquiring factor sample data, storing factor index values into a data sheet, and building a sample sequence; when the correlation of any two factor index value sequences is calculated, taking the short ones instead of long ones for the factor index value samples with inconsistent lengths, and deleting missing items and the other factor index sample values corresponding to the missing items if missing condition exists in the samples; calculating the correlational relationship between the factors; calculating distance covariances and variances of the factor index values based on the distance correction, and obtaining distance correlation coefficients; adopting a correlational relationship sorting algorithm to sort the distance correlation coefficients between the factors, finally providing the correlation between the factors, and recognizing complex correlational relationships between the factors; selecting the factor with strong correlation with other factors according to the sorting of the correlational relationships between the factors, and forecasting the variation of other factor indexes with strong correlation through monitoring the index value of the factor.

Description

A kind of Forecasting Methodology based on correlationship identification between factor

Technical field

The present invention relates to technical field of information processing, particularly a kind of Forecasting Methodology based on correlationship identification between factor.

Background technology

Along with the development of cloud computing, technology of Internet of things, multiple advanced technology and sensor are widely used in data acquisition, and this makes Data Source abundant and various, and data type is various.Under the background that quantity of information is so huge, how from the large and value-capture miscellaneous uncertain data of capacity, the correlationship existed between identification factor is current all industry institutes problems faced.

But in the face of current large data, the factor analysis analytic models such as traditional artificial neural network, expert system, fuzzy set theory and intelligent algorithm are because being limited to the error problem of complex model modeling and model existence itself, or model itself does not have generality, be difficult to identify key factor in mass data, explain complicated mechanism and make Accurate Prediction, and then erroneous judgement may be caused or fail to judge.And in existing correlationship analytic approach, Pearson correlation coefficient can only analyze linear correlationship, though maximum correlation coefficient and distance correlation method can analyze linear and nonlinear correlationship, but under the impact of much noise, the result that maximum correlation coefficient is analyzed is on the contrary not as Pearson correlation coefficient method, by contrast, distance correlation is more accurate for the measurement of correlationship between variable.Therefore the present invention is based on the linear processes relation between distance correlation identification factor, find the factor hidden, improve the accuracy of prediction, for decision-making provides the foundation of more science.

The technical matters that the application solves is to identify according to correlativity the relation hidden existed between data sample, analyzes a phenomenon, and then predict by correlativity.Correlationship refers to that by force another data value probably also can increase thereupon when a data value increases.Such as Google Flu Trends: in a specific geographic position, more people are by the specific entry of Google search, and this area just has more people to suffer from influenza.

Summary of the invention

For solving the deficiency that prior art exists, the invention discloses a kind of Forecasting Methodology based on correlationship identification between factor, based on distance correlation, the method of complicated correlationship between automatic identification factor, effectively can analyze the implicit correlationship existed between large data factors, improve the accuracy of prediction, for decision maker provides the foundation of decision-making.

For achieving the above object, concrete scheme of the present invention is as follows:

Based on a Forecasting Methodology for correlationship identification between factor, comprise the following steps:

Step one: obtain factor sample data, by factor index value stored in tables of data, builds sample sequence;

Step 2: when factor index value sequence calculates correlativity between two, inconsistent for factor index value sample length, gets cutting back long, for the situation that there is disappearance in sample, then deletes disappearance item and another corresponding with it factor index sample value;

Step 3: correlationship between calculating factor: based on distance correlation, calculates distance covariance and the variance of factor index value, defines according to related coefficient obtain the relevant related coefficient of distance;

Step 4: adopt correlationship sort algorithm to sort to factor spacing related coefficient, finally provide the correlativity between factor, complicated correlationship between identification factor.

Step 5: according to the sequence of correlationship between factor, the selected factor strong with other factor correlativitys, the change of other factor indexs that associated is strong is predicted by the desired value of monitoring this factor, wherein as a < x < b, i.e. measurable c < y < d, a, b, c, d are real number.

Described sample sequence comprises time series and non-time series, and the data for quantizing, directly can calculate the correlationship between factor and factor for time series; For non-time series, need determine to want target factor, calculate the correlationship between target factor and other factors.

Described correlationship sort algorithm refers to and related coefficient between all variablees calculating gained is arranged in order from big to small, one group of unordered sequence is adjusted to orderly sequence, and then obtains the collating sequence of correlativity power between factor.

Described distance correlation, for calculating related coefficient, specifically comprises:

S1: the Euclidean distance calculating each element of sample interior: a _{j, k}=|| X _j-X _k||, wherein X _jx _kfor sample factor, j, k=1,2 ... n, by a _j,kform distance matrix;

S2: calculate the capable mean value of distance matrix and distance matrix column average value, and utilize in S1 the Euclidean distance calculating gained to calculate two centre distances of single factor sample interior: wherein represent jth row sample average, represent row k sample average, the distance matrix of sample average;

S3: utilize in S2 the two centre distances calculating gained to calculate distance covariance between two factor samples, the distance variance of single factor sample;

S4: utilize in S3 the distance covariance that calculates gained and distance variance calculate between two factor sample separation from related coefficient.

The relevant related coefficient of distance is:

dCor (X, Y) = \frac{dCov (X, Y)}{\sqrt{dVar (X) dVar (Y)}};

Wherein, X, Y are that dCov (X, Y) represents the distance covariance between factor index, the distance covariance dCov (X, Y) between factor index arbitrarily to factor index in sample set; DCov (X, Y) basis obtain;

DVar (X) dVar (Y) represents the covariance of factor index,

d {Var}_{n}^{2} (X) = d {Cov}_{n}^{2} (X, X) = \frac{1}{n^{2}} Σ_{j, k = 1}^{n} A_{j, k} A_{j, k};

Wherein,

A_{j, k} = a_{j, k} - {\overset{&OverBar;}{a}}_{j .} - {\overset{&OverBar;}{a}}_{. k} + {\overset{&OverBar;}{a}}_{. .},

B _j.kcalculating and A _j.ksimilar.

In formula, represent jth row sample average, represent row k sample average, for the distance matrix of sample average, a _{j, k}for the Euclidean distance between factor index, n is sample data number.

By calculating the related coefficient between factor index, we can judge the degree of correlation between two factors.Distance correlation has following characteristic: 0≤dCor (X, Y)≤1.When dCor (X, Y)=0, X, Y are separate; When dCor (X, Y)=1, X, Y are completely relevant.

Beneficial effect of the present invention:

1. whether the present invention is by the incidence relation existed between correlationship research event complicated between analytical factor and event, if event A there occurs, so namely can also there occurs by predicted events B, do not need to occur by measurement or observed events B.

2. the linear or non-linear complicated correlationship that between factor provided by the invention, complicated correlationship recognition methods exists based on multifactor of great amount of samples data identification, the correlationship hidden between the factor identified.

3. the present invention has quantized the correlativity in great amount of samples between multiple factor, and utilizes sort algorithm to obtain the sequence of correlativity power between multiple factor.Predicted by correlationship, the forecasting accuracy that correlativity is strong is also just high, thus improves the accuracy of prediction, for decision-making provides scientific basis.

Accompanying drawing explanation

Fig. 1 is overview flow chart of the present invention.

Embodiment:

Below in conjunction with accompanying drawing, the present invention is described in detail:

Sample sequence comprises time series and non-time series, and the data for quantizing, directly can calculate the correlationship between factor and factor for time series; For non-time series, need determine to want target factor, calculate the correlationship between target factor and other factors.

Correlationship sort algorithm refers to and related coefficient between all variablees calculating gained is arranged in order from big to small, one group of unordered sequence is adjusted to orderly sequence, and then obtains the collating sequence of correlativity power between factor.

Distance correlation, for calculating related coefficient, specifically comprises:

The relevant related coefficient of distance is:

dCor (X, Y) = \frac{dCov (X, Y)}{\sqrt{dVar (X) dVar (Y)}};

Wherein, X, Y are that dCov (X, Y) represents the distance covariance between factor index arbitrarily to factor index in sample set, according to obtain the distance covariance dCov (X, Y) between factor index;

DVar (X) dVar (Y) represents the covariance of factor index,

d {Var}_{n}^{2} (X) = d {Cov}_{n}^{2} (X, X) = \frac{1}{n^{2}} Σ_{j, k = 1}^{n} A_{j, k} A_{j, k};

Wherein,

A_{j, k} = a_{j, k} - {\overset{&OverBar;}{a}}_{j .} - {\overset{&OverBar;}{a}}_{. k} + {\overset{&OverBar;}{a}}_{. .},

B _j.kcalculating and A _j.ksimilar.

Being retrieved as of data directly obtains from database, and data acquisition gets data from each system, in the database that the system that is then stored into is corresponding.

The identification of correlationship is the process of data, but its application is the prediction to event, such as Google Flu Trends, the recommended book on Amazon.Correlationship can be used for catching now and predict future.If A with B often together with occur only to should be noted that B there occurs, just can predict that A also there occurs.This contribute to seizure may together with A occurrence, even if can not directly measure or observe A.

In step 5, as a < x < b, i.e. measurable c < y < d, a, b, c, d are real number, x is the sign factor index of event A, y can be the sign factor index of event B, by monitoring the change of x thus predicting the change of y.

Specific embodiment of the application, as shown in Figure 1, overall procedure is: obtain factor sample data, whether containing nonumeric in factor of judgment sample data, if containing, delete this data, if do not contain, then calculate the Euclidean distance of each element of sample interior, calculate the capable mean value of distance matrix and distance matrix column average value, calculate two centre distances of single factor sample interior, and then the distance covariance calculated respectively between two factor samples, the distance variance of single factor sample, then calculate between two factor sample separation from related coefficient, related coefficient is sorted, determine strong correlation factor, the desired value of finally monitoring factor provides the change of other strong factor index values of associated.

Specific embodiment: the present invention is directed to and there is linear or nonlinear data set, adopts distance correlation, the correlativity between measurement factor, thus provides decision-making foundation for decision maker.Concrete steps are as follows:

Step 1: by the factor index value of statistics stored in tables of data, builds sample sequence.

The statistics choosing Zhi Bang alliance of U.S. sportsman in the racing season is sample, comprising the data of 131 kinds of other attacks.Because research is relation between salary and offensive statistics, so reject the data of pitcher and wages lower than the sportsman of 400000 dollars, because pitcher does not belong to offensive, and not the wage based on achievement lower than the salary of the sportsman of 400000 dollars.By factor index value corresponding for the player statistics determined stored in tables of data, form sample sequence.

Step 2: when sportsman's salary and other factors calculate correlativity between two, inconsistent for data length, gets cutting back long.Data centralization interocclusal record is existed to the situation of disappearance, then delete disappearance item and another corresponding with it factor numerical value.

Step 3: calculate correlationship between salary and other factors: based on distance correlation, calculates distance covariance and the variance of variable, according to obtain the relevant related coefficient of distance.

Step 4: adopt correlationship sort algorithm, sort descending for related coefficient successively, provides the sequence of factor spacing related coefficient, finally obtains out the correlativity between factor, complicated correlationship between identification factor.Finally obtain with wage level correlativity the strongest be b.b number and intentional b.b number.So far, the identifying to correlationship complicated between large data sets factor is completed.

Step 5: changed by the observation b.b number the most relevant to sportsman's wage level and intentional b.b number 2 factor index values, dope the current wage level of this sportsman.When b.b number is within the scope of 7-15, the wage value of current sportsman is between 800000-2000000 dollar.

The above; be only the present invention's preferably embodiment, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; the change that can expect easily or replacement, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims

1., based on a Forecasting Methodology for correlationship identification between factor, comprise the following steps:

Step 4: adopt correlationship sort algorithm to sort to factor spacing related coefficient, finally provide the correlativity between factor, complicated correlationship between identification factor;

Step 5: according to the sequence of correlationship between factor, the selected factor strong with other factor correlativitys, the change of other factor indexs that associated is strong is predicted by the desired value of monitoring this factor, wherein as a < x < b, i.e. measurable c < y < d, a, b, c, d are real number, and x is the sign factor index of event A, and y can be the sign factor index of event B.

2. a kind of Forecasting Methodology based on correlationship identification between factor as claimed in claim 1, it is characterized in that, described sample sequence comprises time series and non-time series, and the data for quantizing, directly can calculate the correlationship between factor and factor for time series; For non-time series, need determine to want target factor, calculate the correlationship between target factor and other factors.

3. a kind of Forecasting Methodology based on correlationship identification between factor as claimed in claim 1, it is characterized in that, described correlationship sort algorithm refers to and related coefficient between all variablees of calculating gained is arranged in order from big to small, one group of unordered sequence is adjusted to orderly sequence, and then obtains the collating sequence of correlativity power between factor.

4. a kind of Forecasting Methodology based on correlationship identification between factor as claimed in claim 1, it is characterized in that, distance correlation, for calculating related coefficient, specifically comprises:

5. a kind of Forecasting Methodology based on correlationship identification between factor as claimed in claim 4, is characterized in that, the relevant related coefficient of distance is:

dCor (X, Y) = \frac{dCov (X, Y)}{\sqrt{dVar (X) dVar (Y)}};

Wherein, X, Y are that dCov (X, Y) represents the distance covariance between factor index, and dVar (X) dVar (Y) represents the covariance of factor index arbitrarily to factor index in sample set.