CN104239722A - Forecasting method based on recognition of correlational relationship between factors - Google Patents
Forecasting method based on recognition of correlational relationship between factors Download PDFInfo
- Publication number
- CN104239722A CN104239722A CN201410479908.5A CN201410479908A CN104239722A CN 104239722 A CN104239722 A CN 104239722A CN 201410479908 A CN201410479908 A CN 201410479908A CN 104239722 A CN104239722 A CN 104239722A
- Authority
- CN
- China
- Prior art keywords
- factor
- distance
- sample
- correlationship
- factors
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The invention discloses a forecasting method based on recognition of a correlational relationship between factors. The forecasting method comprises the following steps of acquiring factor sample data, storing factor index values into a data sheet, and building a sample sequence; when the correlation of any two factor index value sequences is calculated, taking the short ones instead of long ones for the factor index value samples with inconsistent lengths, and deleting missing items and the other factor index sample values corresponding to the missing items if missing condition exists in the samples; calculating the correlational relationship between the factors; calculating distance covariances and variances of the factor index values based on the distance correction, and obtaining distance correlation coefficients; adopting a correlational relationship sorting algorithm to sort the distance correlation coefficients between the factors, finally providing the correlation between the factors, and recognizing complex correlational relationships between the factors; selecting the factor with strong correlation with other factors according to the sorting of the correlational relationships between the factors, and forecasting the variation of other factor indexes with strong correlation through monitoring the index value of the factor.
Description
Technical field
The present invention relates to technical field of information processing, particularly a kind of Forecasting Methodology based on correlationship identification between factor.
Background technology
Along with the development of cloud computing, technology of Internet of things, multiple advanced technology and sensor are widely used in data acquisition, and this makes Data Source abundant and various, and data type is various.Under the background that quantity of information is so huge, how from the large and value-capture miscellaneous uncertain data of capacity, the correlationship existed between identification factor is current all industry institutes problems faced.
But in the face of current large data, the factor analysis analytic models such as traditional artificial neural network, expert system, fuzzy set theory and intelligent algorithm are because being limited to the error problem of complex model modeling and model existence itself, or model itself does not have generality, be difficult to identify key factor in mass data, explain complicated mechanism and make Accurate Prediction, and then erroneous judgement may be caused or fail to judge.And in existing correlationship analytic approach, Pearson correlation coefficient can only analyze linear correlationship, though maximum correlation coefficient and distance correlation method can analyze linear and nonlinear correlationship, but under the impact of much noise, the result that maximum correlation coefficient is analyzed is on the contrary not as Pearson correlation coefficient method, by contrast, distance correlation is more accurate for the measurement of correlationship between variable.Therefore the present invention is based on the linear processes relation between distance correlation identification factor, find the factor hidden, improve the accuracy of prediction, for decision-making provides the foundation of more science.
The technical matters that the application solves is to identify according to correlativity the relation hidden existed between data sample, analyzes a phenomenon, and then predict by correlativity.Correlationship refers to that by force another data value probably also can increase thereupon when a data value increases.Such as Google Flu Trends: in a specific geographic position, more people are by the specific entry of Google search, and this area just has more people to suffer from influenza.
Summary of the invention
For solving the deficiency that prior art exists, the invention discloses a kind of Forecasting Methodology based on correlationship identification between factor, based on distance correlation, the method of complicated correlationship between automatic identification factor, effectively can analyze the implicit correlationship existed between large data factors, improve the accuracy of prediction, for decision maker provides the foundation of decision-making.
For achieving the above object, concrete scheme of the present invention is as follows:
Based on a Forecasting Methodology for correlationship identification between factor, comprise the following steps:
Step one: obtain factor sample data, by factor index value stored in tables of data, builds sample sequence;
Step 2: when factor index value sequence calculates correlativity between two, inconsistent for factor index value sample length, gets cutting back long, for the situation that there is disappearance in sample, then deletes disappearance item and another corresponding with it factor index sample value;
Step 3: correlationship between calculating factor: based on distance correlation, calculates distance covariance and the variance of factor index value, defines according to related coefficient
obtain the relevant related coefficient of distance;
Step 4: adopt correlationship sort algorithm to sort to factor spacing related coefficient, finally provide the correlativity between factor, complicated correlationship between identification factor.
Step 5: according to the sequence of correlationship between factor, the selected factor strong with other factor correlativitys, the change of other factor indexs that associated is strong is predicted by the desired value of monitoring this factor, wherein as a < x < b, i.e. measurable c < y < d, a, b, c, d are real number.
Described sample sequence comprises time series and non-time series, and the data for quantizing, directly can calculate the correlationship between factor and factor for time series; For non-time series, need determine to want target factor, calculate the correlationship between target factor and other factors.
Described correlationship sort algorithm refers to and related coefficient between all variablees calculating gained is arranged in order from big to small, one group of unordered sequence is adjusted to orderly sequence, and then obtains the collating sequence of correlativity power between factor.
Described distance correlation, for calculating related coefficient, specifically comprises:
S1: the Euclidean distance calculating each element of sample interior: a
j, k=|| X
j-X
k||, wherein X
jx
kfor sample factor, j, k=1,2 ... n, by a
j,kform distance matrix;
S2: calculate the capable mean value of distance matrix and distance matrix column average value, and utilize in S1 the Euclidean distance calculating gained to calculate two centre distances of single factor sample interior:
wherein
represent jth row sample average,
represent row k sample average,
the distance matrix of sample average;
S3: utilize in S2 the two centre distances calculating gained to calculate distance covariance between two factor samples, the distance variance of single factor sample;
S4: utilize in S3 the distance covariance that calculates gained and distance variance calculate between two factor sample separation from related coefficient.
The relevant related coefficient of distance is:
Wherein, X, Y are that dCov (X, Y) represents the distance covariance between factor index, the distance covariance dCov (X, Y) between factor index arbitrarily to factor index in sample set; DCov (X, Y) basis
obtain;
DVar (X) dVar (Y) represents the covariance of factor index,
Wherein,
B
j.kcalculating and A
j.ksimilar.
In formula,
represent jth row sample average,
represent row k sample average,
for the distance matrix of sample average, a
j, kfor the Euclidean distance between factor index, n is sample data number.
By calculating the related coefficient between factor index, we can judge the degree of correlation between two factors.Distance correlation has following characteristic: 0≤dCor (X, Y)≤1.When dCor (X, Y)=0, X, Y are separate; When dCor (X, Y)=1, X, Y are completely relevant.
Beneficial effect of the present invention:
1. whether the present invention is by the incidence relation existed between correlationship research event complicated between analytical factor and event, if event A there occurs, so namely can also there occurs by predicted events B, do not need to occur by measurement or observed events B.
2. the linear or non-linear complicated correlationship that between factor provided by the invention, complicated correlationship recognition methods exists based on multifactor of great amount of samples data identification, the correlationship hidden between the factor identified.
3. the present invention has quantized the correlativity in great amount of samples between multiple factor, and utilizes sort algorithm to obtain the sequence of correlativity power between multiple factor.Predicted by correlationship, the forecasting accuracy that correlativity is strong is also just high, thus improves the accuracy of prediction, for decision-making provides scientific basis.
Accompanying drawing explanation
Fig. 1 is overview flow chart of the present invention.
Embodiment:
Below in conjunction with accompanying drawing, the present invention is described in detail:
Based on a Forecasting Methodology for correlationship identification between factor, comprise the following steps:
Step one: obtain factor sample data, by factor index value stored in tables of data, builds sample sequence;
Step 2: when factor index value sequence calculates correlativity between two, inconsistent for factor index value sample length, gets cutting back long, for the situation that there is disappearance in sample, then deletes disappearance item and another corresponding with it factor index sample value;
Step 3: correlationship between calculating factor: based on distance correlation, calculates distance covariance and the variance of factor index value, defines according to related coefficient
obtain the relevant related coefficient of distance;
Step 4: adopt correlationship sort algorithm to sort to factor spacing related coefficient, finally provide the correlativity between factor, complicated correlationship between identification factor.
Step 5: according to the sequence of correlationship between factor, the selected factor strong with other factor correlativitys, the change of other factor indexs that associated is strong is predicted by the desired value of monitoring this factor, wherein as a < x < b, i.e. measurable c < y < d, a, b, c, d are real number.
Sample sequence comprises time series and non-time series, and the data for quantizing, directly can calculate the correlationship between factor and factor for time series; For non-time series, need determine to want target factor, calculate the correlationship between target factor and other factors.
Correlationship sort algorithm refers to and related coefficient between all variablees calculating gained is arranged in order from big to small, one group of unordered sequence is adjusted to orderly sequence, and then obtains the collating sequence of correlativity power between factor.
Distance correlation, for calculating related coefficient, specifically comprises:
S1: the Euclidean distance calculating each element of sample interior: a
j, k=|| X
j-X
k||, wherein X
jx
kfor sample factor, j, k=1,2 ... n, by a
j,kform distance matrix;
S2: calculate the capable mean value of distance matrix and distance matrix column average value, and utilize in S1 the Euclidean distance calculating gained to calculate two centre distances of single factor sample interior:
wherein
represent jth row sample average,
represent row k sample average,
the distance matrix of sample average;
S3: utilize in S2 the two centre distances calculating gained to calculate distance covariance between two factor samples, the distance variance of single factor sample;
S4: utilize in S3 the distance covariance that calculates gained and distance variance calculate between two factor sample separation from related coefficient.
The relevant related coefficient of distance is:
Wherein, X, Y are that dCov (X, Y) represents the distance covariance between factor index arbitrarily to factor index in sample set, according to
obtain the distance covariance dCov (X, Y) between factor index;
DVar (X) dVar (Y) represents the covariance of factor index,
Wherein,
B
j.kcalculating and A
j.ksimilar.
In formula,
represent jth row sample average,
represent row k sample average,
for the distance matrix of sample average, a
j, kfor the Euclidean distance between factor index, n is sample data number.
By calculating the related coefficient between factor index, we can judge the degree of correlation between two factors.Distance correlation has following characteristic: 0≤dCor (X, Y)≤1.When dCor (X, Y)=0, X, Y are separate; When dCor (X, Y)=1, X, Y are completely relevant.
Being retrieved as of data directly obtains from database, and data acquisition gets data from each system, in the database that the system that is then stored into is corresponding.
The identification of correlationship is the process of data, but its application is the prediction to event, such as Google Flu Trends, the recommended book on Amazon.Correlationship can be used for catching now and predict future.If A with B often together with occur only to should be noted that B there occurs, just can predict that A also there occurs.This contribute to seizure may together with A occurrence, even if can not directly measure or observe A.
In step 5, as a < x < b, i.e. measurable c < y < d, a, b, c, d are real number, x is the sign factor index of event A, y can be the sign factor index of event B, by monitoring the change of x thus predicting the change of y.
Specific embodiment of the application, as shown in Figure 1, overall procedure is: obtain factor sample data, whether containing nonumeric in factor of judgment sample data, if containing, delete this data, if do not contain, then calculate the Euclidean distance of each element of sample interior, calculate the capable mean value of distance matrix and distance matrix column average value, calculate two centre distances of single factor sample interior, and then the distance covariance calculated respectively between two factor samples, the distance variance of single factor sample, then calculate between two factor sample separation from related coefficient, related coefficient is sorted, determine strong correlation factor, the desired value of finally monitoring factor provides the change of other strong factor index values of associated.
Specific embodiment: the present invention is directed to and there is linear or nonlinear data set, adopts distance correlation, the correlativity between measurement factor, thus provides decision-making foundation for decision maker.Concrete steps are as follows:
Step 1: by the factor index value of statistics stored in tables of data, builds sample sequence.
The statistics choosing Zhi Bang alliance of U.S. sportsman in the racing season is sample, comprising the data of 131 kinds of other attacks.Because research is relation between salary and offensive statistics, so reject the data of pitcher and wages lower than the sportsman of 400000 dollars, because pitcher does not belong to offensive, and not the wage based on achievement lower than the salary of the sportsman of 400000 dollars.By factor index value corresponding for the player statistics determined stored in tables of data, form sample sequence.
Step 2: when sportsman's salary and other factors calculate correlativity between two, inconsistent for data length, gets cutting back long.Data centralization interocclusal record is existed to the situation of disappearance, then delete disappearance item and another corresponding with it factor numerical value.
Step 3: calculate correlationship between salary and other factors: based on distance correlation, calculates distance covariance and the variance of variable, according to
obtain the relevant related coefficient of distance.
Step 4: adopt correlationship sort algorithm, sort descending for related coefficient successively, provides the sequence of factor spacing related coefficient, finally obtains out the correlativity between factor, complicated correlationship between identification factor.Finally obtain with wage level correlativity the strongest be b.b number and intentional b.b number.So far, the identifying to correlationship complicated between large data sets factor is completed.
Step 5: changed by the observation b.b number the most relevant to sportsman's wage level and intentional b.b number 2 factor index values, dope the current wage level of this sportsman.When b.b number is within the scope of 7-15, the wage value of current sportsman is between 800000-2000000 dollar.
The above; be only the present invention's preferably embodiment, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; the change that can expect easily or replacement, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.
Claims (5)
1., based on a Forecasting Methodology for correlationship identification between factor, comprise the following steps:
Step one: obtain factor sample data, by factor index value stored in tables of data, builds sample sequence;
Step 2: when factor index value sequence calculates correlativity between two, inconsistent for factor index value sample length, gets cutting back long, for the situation that there is disappearance in sample, then deletes disappearance item and another corresponding with it factor index sample value;
Step 3: correlationship between calculating factor: based on distance correlation, calculates distance covariance and the variance of factor index value, defines according to related coefficient
obtain the relevant related coefficient of distance;
Step 4: adopt correlationship sort algorithm to sort to factor spacing related coefficient, finally provide the correlativity between factor, complicated correlationship between identification factor;
Step 5: according to the sequence of correlationship between factor, the selected factor strong with other factor correlativitys, the change of other factor indexs that associated is strong is predicted by the desired value of monitoring this factor, wherein as a < x < b, i.e. measurable c < y < d, a, b, c, d are real number, and x is the sign factor index of event A, and y can be the sign factor index of event B.
2. a kind of Forecasting Methodology based on correlationship identification between factor as claimed in claim 1, it is characterized in that, described sample sequence comprises time series and non-time series, and the data for quantizing, directly can calculate the correlationship between factor and factor for time series; For non-time series, need determine to want target factor, calculate the correlationship between target factor and other factors.
3. a kind of Forecasting Methodology based on correlationship identification between factor as claimed in claim 1, it is characterized in that, described correlationship sort algorithm refers to and related coefficient between all variablees of calculating gained is arranged in order from big to small, one group of unordered sequence is adjusted to orderly sequence, and then obtains the collating sequence of correlativity power between factor.
4. a kind of Forecasting Methodology based on correlationship identification between factor as claimed in claim 1, it is characterized in that, distance correlation, for calculating related coefficient, specifically comprises:
S1: the Euclidean distance calculating each element of sample interior: a
j, k=|| X
j-X
k||, wherein X
jx
kfor sample factor, j, k=1,2 ... n, by a
j,kform distance matrix;
S2: calculate the capable mean value of distance matrix and distance matrix column average value, and utilize in S1 the Euclidean distance calculating gained to calculate two centre distances of single factor sample interior:
wherein
represent jth row sample average,
represent row k sample average,
the distance matrix of sample average;
S3: utilize in S2 the two centre distances calculating gained to calculate distance covariance between two factor samples, the distance variance of single factor sample;
S4: utilize in S3 the distance covariance that calculates gained and distance variance calculate between two factor sample separation from related coefficient.
5. a kind of Forecasting Methodology based on correlationship identification between factor as claimed in claim 4, is characterized in that, the relevant related coefficient of distance is:
Wherein, X, Y are that dCov (X, Y) represents the distance covariance between factor index, and dVar (X) dVar (Y) represents the covariance of factor index arbitrarily to factor index in sample set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410479908.5A CN104239722A (en) | 2014-09-18 | 2014-09-18 | Forecasting method based on recognition of correlational relationship between factors |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410479908.5A CN104239722A (en) | 2014-09-18 | 2014-09-18 | Forecasting method based on recognition of correlational relationship between factors |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104239722A true CN104239722A (en) | 2014-12-24 |
Family
ID=52227772
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410479908.5A Pending CN104239722A (en) | 2014-09-18 | 2014-09-18 | Forecasting method based on recognition of correlational relationship between factors |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104239722A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105303302A (en) * | 2015-10-12 | 2016-02-03 | 国家电网公司 | Power grid evaluating indicator correlation analysis method, apparatus and computing apparatus |
CN108615054A (en) * | 2018-04-18 | 2018-10-02 | 清华大学 | The overall target construction method that similitude is weighed between drainage pipeline networks node |
CN108873401A (en) * | 2018-06-22 | 2018-11-23 | 西安电子科技大学 | Liquid crystal display response time prediction technique based on big data |
CN110909216A (en) * | 2019-12-04 | 2020-03-24 | 支付宝(杭州)信息技术有限公司 | Method and device for detecting relevance between user attributes |
CN111753947A (en) * | 2020-06-08 | 2020-10-09 | 深圳大学 | Resting brain network construction method, device, equipment and computer storage medium |
CN113779754A (en) * | 2021-08-02 | 2021-12-10 | 张家港宏昌钢板有限公司 | Method and system for analyzing influence factors of blast furnace, electronic device and computer-readable storage medium |
CN113806356A (en) * | 2020-06-16 | 2021-12-17 | 中国移动通信集团重庆有限公司 | Data identification method and device and computing equipment |
CN111753947B (en) * | 2020-06-08 | 2024-05-03 | 深圳大学 | Resting brain network construction method, device, equipment and computer storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103020430A (en) * | 2012-11-28 | 2013-04-03 | 西南交通大学 | Evaluation method for pantograph-catenary matching performance of spectral cross-correlation coefficient |
CN103177121A (en) * | 2013-04-12 | 2013-06-26 | 天津大学 | Locality preserving projection method for adding pearson relevant coefficient |
-
2014
- 2014-09-18 CN CN201410479908.5A patent/CN104239722A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103020430A (en) * | 2012-11-28 | 2013-04-03 | 西南交通大学 | Evaluation method for pantograph-catenary matching performance of spectral cross-correlation coefficient |
CN103177121A (en) * | 2013-04-12 | 2013-06-26 | 天津大学 | Locality preserving projection method for adding pearson relevant coefficient |
Non-Patent Citations (3)
Title |
---|
GABOR J.SZEKELY等: ""Brownian distance covariance"", 《THE ANNALS OF APPLIED STATISTICS》 * |
GABOR J.SZEKELY等: ""Measuring and Testing Dependence by Correlation of Distances"", 《THE ANNALS OF STATISTICS》 * |
KARL PEARSON: ""Mathematical Contributions to the Theory of Evolution.--On the Law of Reversion"", 《PROC.R.SOC.LOND.》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105303302A (en) * | 2015-10-12 | 2016-02-03 | 国家电网公司 | Power grid evaluating indicator correlation analysis method, apparatus and computing apparatus |
CN108615054A (en) * | 2018-04-18 | 2018-10-02 | 清华大学 | The overall target construction method that similitude is weighed between drainage pipeline networks node |
CN108615054B (en) * | 2018-04-18 | 2020-06-05 | 清华大学 | Method for constructing comprehensive index for measuring similarity between drainage pipe network nodes |
CN108873401A (en) * | 2018-06-22 | 2018-11-23 | 西安电子科技大学 | Liquid crystal display response time prediction technique based on big data |
CN108873401B (en) * | 2018-06-22 | 2020-10-09 | 西安电子科技大学 | Liquid crystal display response time prediction method based on big data |
CN110909216A (en) * | 2019-12-04 | 2020-03-24 | 支付宝(杭州)信息技术有限公司 | Method and device for detecting relevance between user attributes |
CN110909216B (en) * | 2019-12-04 | 2023-06-20 | 支付宝(杭州)信息技术有限公司 | Method and device for detecting relevance between user attributes |
CN111753947A (en) * | 2020-06-08 | 2020-10-09 | 深圳大学 | Resting brain network construction method, device, equipment and computer storage medium |
CN111753947B (en) * | 2020-06-08 | 2024-05-03 | 深圳大学 | Resting brain network construction method, device, equipment and computer storage medium |
CN113806356A (en) * | 2020-06-16 | 2021-12-17 | 中国移动通信集团重庆有限公司 | Data identification method and device and computing equipment |
CN113806356B (en) * | 2020-06-16 | 2024-03-19 | 中国移动通信集团重庆有限公司 | Data identification method and device and computing equipment |
CN113779754A (en) * | 2021-08-02 | 2021-12-10 | 张家港宏昌钢板有限公司 | Method and system for analyzing influence factors of blast furnace, electronic device and computer-readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104239722A (en) | Forecasting method based on recognition of correlational relationship between factors | |
CN107577688B (en) | Original article influence analysis system based on media information acquisition | |
CN110634080B (en) | Abnormal electricity utilization detection method, device, equipment and computer readable storage medium | |
CN112132233A (en) | Criminal personnel dangerous behavior prediction method and system based on effective influence factors | |
CN103336906A (en) | Sampling GPR method of continuous anomaly detection in collecting data flow of environment sensor | |
CN111178675A (en) | LR-Bagging algorithm-based electric charge recycling risk prediction method, system, storage medium and computer equipment | |
CN116362570B (en) | Multi-dimensional pollution analysis method and system based on big data platform | |
CN107292744A (en) | Investment Trend analysis method and its system based on machine learning | |
CN111738843B (en) | Quantitative risk evaluation system and method using running water data | |
CN105469219A (en) | Method for processing power load data based on decision tree | |
CN102194134B (en) | Biological feature recognition performance index prediction method based on statistical learning | |
CN104850868A (en) | Customer segmentation method based on k-means and neural network cluster | |
CN103902798B (en) | Data preprocessing method | |
CN104835073A (en) | Unmanned aerial vehicle control system operation performance evaluating method based on intuitionistic fuzzy entropy weight | |
CN115794803A (en) | Engineering audit problem monitoring method and system based on big data AI technology | |
CN115470962A (en) | LightGBM-based enterprise confidence loss risk prediction model construction method | |
CN105488598A (en) | Medium-and-long time electric power load prediction method based on fuzzy clustering | |
CN104599062A (en) | Classification based value evaluation method and system for agricultural scientific and technological achievements | |
CN107704952A (en) | A kind of attack of terrorism Forecasting Methodology based on stochastic subspace | |
CN116629716A (en) | Intelligent interaction system work efficiency analysis method | |
CN111931992A (en) | Power load prediction index selection method and device | |
CN111401444A (en) | Method and device for predicting origin of red wine, computer equipment and storage medium | |
Dupuis et al. | Detecting change-points in extremes | |
Wang et al. | Temperature forecast based on SVM optimized by PSO algorithm | |
CN106250669B (en) | A kind of arid return period determines the method for arid threshold value in calculating |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20141224 |