CN109886409B - Quantitative causal relationship judging method of multidimensional time sequence - Google Patents
Quantitative causal relationship judging method of multidimensional time sequence Download PDFInfo
- Publication number
- CN109886409B CN109886409B CN201910115769.0A CN201910115769A CN109886409B CN 109886409 B CN109886409 B CN 109886409B CN 201910115769 A CN201910115769 A CN 201910115769A CN 109886409 B CN109886409 B CN 109886409B
- Authority
- CN
- China
- Prior art keywords
- time
- causal relationship
- sequence
- causal
- quantitative
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000001364 causal effect Effects 0.000 title claims abstract description 94
- 238000000034 method Methods 0.000 title claims abstract description 41
- 238000004364 calculation method Methods 0.000 claims abstract description 15
- 238000010586 diagram Methods 0.000 claims abstract description 10
- 238000007781 pre-processing Methods 0.000 claims abstract description 8
- 239000011159 matrix material Substances 0.000 claims description 24
- 238000012360 testing method Methods 0.000 claims description 17
- 238000007476 Maximum Likelihood Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 239000000463 material Substances 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 238000012544 monitoring process Methods 0.000 claims description 2
- 230000006641 stabilisation Effects 0.000 claims description 2
- 238000011105 stabilization Methods 0.000 claims description 2
- 238000013473 artificial intelligence Methods 0.000 abstract 1
- 238000004458 analytical method Methods 0.000 description 9
- 150000003839 salts Chemical class 0.000 description 9
- CURLTUGMZLYLDI-UHFFFAOYSA-N Carbon dioxide Chemical compound O=C=O CURLTUGMZLYLDI-UHFFFAOYSA-N 0.000 description 6
- 238000011160 research Methods 0.000 description 6
- ATNHDLDRLWWWCB-AENOIHSZSA-M chlorophyll a Chemical compound C1([C@@H](C(=O)OC)C(=O)C2=C3C)=C2N2C3=CC(C(CC)=C3C)=[N+]4C3=CC3=C(C=C)C(C)=C5N3[Mg-2]42[N+]2=C1[C@@H](CCC(=O)OC\C=C(/C)CCC[C@H](C)CCC[C@H](C)CCCC(C)C)[C@H](C)C2=C5 ATNHDLDRLWWWCB-AENOIHSZSA-M 0.000 description 5
- 239000000356 contaminant Substances 0.000 description 5
- 239000003344 environmental pollutant Substances 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 231100000719 pollutant Toxicity 0.000 description 4
- NHNBFGGVMKEFGY-UHFFFAOYSA-N Nitrate Chemical compound [O-][N+]([O-])=O NHNBFGGVMKEFGY-UHFFFAOYSA-N 0.000 description 3
- 229910002092 carbon dioxide Inorganic materials 0.000 description 3
- 229930002875 chlorophyll Natural products 0.000 description 3
- 235000019804 chlorophyll Nutrition 0.000 description 3
- 238000010792 warming Methods 0.000 description 3
- 229910002651 NO3 Inorganic materials 0.000 description 2
- BPQQTUXANYXVAA-UHFFFAOYSA-N Orthosilicate Chemical compound [O-][Si]([O-])([O-])[O-] BPQQTUXANYXVAA-UHFFFAOYSA-N 0.000 description 2
- 229910019142 PO4 Inorganic materials 0.000 description 2
- 238000005094 computer simulation Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 235000015097 nutrients Nutrition 0.000 description 2
- NBIIXXVUZAFLBC-UHFFFAOYSA-K phosphate Chemical compound [O-]P([O-])([O-])=O NBIIXXVUZAFLBC-UHFFFAOYSA-K 0.000 description 2
- 239000010452 phosphate Substances 0.000 description 2
- 206010013883 Dwarfism Diseases 0.000 description 1
- 241000287828 Gallus gallus Species 0.000 description 1
- 241000376353 Stips Species 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 239000005422 algal bloom Substances 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000001569 carbon dioxide Substances 0.000 description 1
- 229930002868 chlorophyll a Natural products 0.000 description 1
- NSMUHPMZFPKNMZ-VBYMZDBQSA-M chlorophyll b Chemical compound C1([C@@H](C(=O)OC)C(=O)C2=C3C)=C2N2C3=CC(C(CC)=C3C=O)=[N+]4C3=CC3=C(C=C)C(C)=C5N3[Mg-2]42[N+]2=C1[C@@H](CCC(=O)OC\C=C(/C)CCC[C@H](C)CCC[C@H](C)CCCC(C)C)[C@H](C)C2=C5 NSMUHPMZFPKNMZ-VBYMZDBQSA-M 0.000 description 1
- 238000013506 data mapping Methods 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000000428 dust Substances 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005293 physical law Methods 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 238000011179 visual inspection Methods 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a quantitative causal relation judging method of a multidimensional time sequence, in particular to a flow packetCollecting m time sequences; preprocessing each time sequence to obtain a plurality of stable time sequences with the same time interval; calculating quantitative and directional causal relationship between different time sequences, wherein any two time sequences X j And X i The causal relationship between the two information flows can be based on Liang-Kleeman information flow T j→i And T is i→j To make a judgment, T j→i Estimated by the following formula:and finally, integrating the causality between each two time sequences obtained through calculation to obtain a complete causality network diagram. The method can infer the causal relationship among a plurality of events through simple calculation, and the causal relationship is output in a quantitative mode, so that the problem of causal relationship inference is solved, and the method has wide and important application in a plurality of important fields such as finance, climate, environment, neuroscience, weather forecast, artificial intelligence, big data science and the like.
Description
Technical Field
The invention belongs to the technical field of big data technical processing, and particularly relates to a quantitative causal relationship judging method of a multidimensional time sequence.
Background
Causal analysis is a core problem of scientific research and is a direct goal of considerable scientific research. In one letter to j.s.switch, einstein has described: the development of western science is determined by two great achievements: the logic system is invented by Greek philosophy, and the causal relationship among events can be deduced through systematic experiments in the period of the rendition. Now suppose we have made experiments on two dynamic events to get two time series, then one of the most fundamental problems comes: we cannot exactly distinguish whether it is the cause or the effect exactly from the two sequences? Or, if there is a entanglement like "pre-chicken or pre-egg" between them, we can quantitatively characterize the circulating causality between them?
This is an ancient, well-known interdisciplinary problem, and scientists including the Nobel prize acquirer, greenworker (Clive Granger), have struggled for half a century. In fact, it is listed as one of the greatest challenges in today's emerging disciplines, big data science (O' Neil & Schutt,2013, p.274). Over the last decade, liang & Kleeman (2005), liang (2008), liang (2014), liang (2016), etc. have demonstrated through various efforts that this causal relationship can be strictly derived from the original physical laws, and does not have to appear in a semi-empirical form as in the traditional approach. Conventional causal analysis (such as Granger causal verification and transitive entropy analysis) in many cases fails to verify such an observed fact, known as the "zero cause criterion": if the evolution of one event does not depend on another event, the latter is not the cause of the former, but in this new architecture this fact appears as a proven theorem.
The causal analysis method can be generally described as follows for an m-dimensional random power system
Wherein x= (x 1 ,x 2 ,...,x m ) Is a state vector, F is an m-dimensional vector field,is an m-dimensional white noise vector, b= (B) ij ) m×m Is an amplitude matrix of noise disturbance, then is composed of the component x j To x i Liang-Kleeman information stream of (Liang, 2016)
Wherein the method comprises the steps ofρ j|i (x j |x i ) Is x j For x i Conditional probability density of>Theoretically, if T j→i Not equal to 0, x j Is x i Cause of (1), and T j→i Is the causal strength; and vice versa.
The above theory proves to be strictly true for any power system, in sharp contrast to the existing gladhand causal tests or empirical causal tests such as transfer entropy analysis. However, the formula (II) is very complex, which results in that it is precisely established in classical problems, but is hardly solved in practical problems, which greatly constrains its application value. To solve this problem, liang (2014) proposes an estimation method and gets a fairly compact formula for the two-dimensional problem. This formula has been validated against many existing semi-empirical causal analyses (e.g., granger causal tests, transitive entropy analysis) and has been successfully used in more and more realistic problems. For example, with only the stock price time series of two IBM and GE companies downloaded from Yahoo, it was found that there was a strong, nearly unidirectional causal relationship from IBM to GE over a period of time in the 70 s, and thus a dust seal of decades was explored about "seven dwarfs" and "giant" competing for the maxima market. In another example, stips et al (2016) use this formula to find a clear, nearly unidirectional causal relationship between carbon dioxide and global warming. For the last hundred years CO2 did contribute to global warming, but on the ancient climate scale of over a thousand years this causality could be reversed completely, as global warming resulted in an increase in CO2 concentration.
The estimation of Liang (2014) is very successful for two-dimensional Liang-Kleeman information flow, i.e. the (II) formula when m=2, but its success is limited to the two-dimensional case (i.e. the case of m=2), and is not sufficient for the multidimensional case.
Disclosure of Invention
The invention aims to provide a quantitative causal relation judging method of a multidimensional time sequence, which can quantitatively obtain causal relations among a plurality of events.
In order to solve the technical problems, the invention adopts the following technical scheme:
a quantitative causal relation judging method of a multidimensional time sequence is characterized in that: the magnitude and direction of the causal relation among the multidimensional time series can be quantitatively estimated, and the specific implementation scheme is as follows:
step 1, collecting m time sequences, wherein each sequence corresponds to an event, and the m time sequences are respectively X 1 ,X 2 ,…,X m ;
Step 2, preprocessing each time sequence to obtain a plurality of time sequences with the same time interval;
step 3, when some time sequences in the step 2 are not stable, further processing is needed to make the sequences become approximately stable; when the time sequence in the step 2 is stable, performing a step 4;
step 4, calculating quantitative and directional causal relations among the time sequences; the specific calculation method is as follows:
given m stationary, equidistant time series X 1 ,X 2 ,…,X m Let X be j To X i Information flow of (i, j=1, 2,..m, i+.j) is T j→i X is then j And X i Causal relationships between can be defined by j→i And T is i→j To make a judgment, T j→i Can be estimated by the following equation:
wherein,for information flow T j→i Maximum likelihood estimate of (a); />For m time series X 1 ,X 2 ,…,X m Is a covariance matrix of (a); detC is a determinant of C; delta jk Is C jk Algebraic remainder of (2); c (C) k,di Is X k And a derived sequence->Covariance between>The definition is as follows: />Δt is the time interval of the sequence;
theoretically whenNon-zero, then X j Is X i Cause of (1)/(2)>Is the size of the causal relationship and is about +.>Performing significance test;
and 5, integrating the causal relationship data between every two time sequences obtained by calculation in the step 4 to obtain a complete causal relationship network diagram.
Selecting a time point t from the time series of the data stabilization obtained in the step 3 n At a time point t n Time window [ t ] is taken for center n -L/2,t n +L/2]Executing step 4 and step 5 under the condition of a time window with the time length of L to obtain a time point t n A causal relationship network graph at the location;
and (5) sequentially taking time points of different positions according to the time sequence in the time sequence, and repeatedly executing the step (4) and the step (5) to obtain a causal relationship network graph changing along with time.
The step 4 is performedA saliency check must be performed to determine whether it differs significantly from zero, or not, when the saliency check result differs significantly from zero, the causal relationship between two events is proved, and when the saliency check result does not differ significantly from zero, the causal relationship between two events is proved not to be significant, the specific flow of the saliency check is as follows:
x is to be i And X j Arranged to the first and second positions at this timeConversion to->The method in the derivation step 4In the process of the formula, we use a linear model:
wherein x= (X 1 ,X 2 ,...,X m ) Is a state vector, here represented as m time series,is white noise, (f) 1 ,b 1 ,a 11 ,a 12 ,...,a 1m ) Is a parameter to be estimated, and it can be demonstrated that the maximum likelihood estimation of these parameters is:
wherein the method comprises the steps ofFor m time series X 1 ,X 2 ,…,X m Is a covariance matrix of (a); detC is a determinant of C; delta jk Is C jk Algebraic remainder of (2); c (C) k,di Is X k And a derived sequence->Covariance between>The definition is as follows:Δt is the time interval of the sequence; />Is sequence X k Arithmetic mean of (2); />Is a sequence ofArithmetic mean of (2); />N is the sequence length (total number of time steps), according to the central limit theorem, ++>Approximately obeys a normal distribution, where only +.>Let its variance beIt relates to Fisher information matrix I, and Fisher information matrix I is calculated as follows:
note the state vector x= (X) at time point n 1 ,X 2 ,...,X m ) For X (N), n=1, 2,..n, N, assuming that X (N) = (X 1,n ,X 2,n ,...,X m,n ) In the case of (2), the next state vector X (n+1) = (X) 1,n+1 ,X 2,n+1 ,...,X m,n+1 ) If the probability density of the transition probability density is ρ, the Fisher information matrix multiplied by the total number of steps of time N is
The above formula can be demonstrated as:
a in the matrix 1j ,f 1 ,b 1 Using its estimateInstead of the above-mentioned, the method,
inverting the obtained matrix (NI), the fourth diagonal component of the obtained inverse matrixI.e. variance->Is used for the estimation of the (c),
the variance of (c) is: />
Given the confidence level alpha, looking up a standard normal distribution table to obtain a confidence interval coefficient z α Further obtainStandard deviation of +.>In other words, for confidence level α, X 2 To X 1 The confidence interval of the true causality of (2) is:
when the confidence interval does not include zero, X 2 Is X 1 Cause of (1); otherwise, the causality is not obvious. For arbitrary sequence X i And X is j Significance test of causal relationship between the two, X is only needed i And X is j And (5) placing the materials at the first and second positions, and repeating the steps.
When the confidence level is set to 90%, the confidence interval coefficient z α =1.65, where the confidence interval is:
when the confidence level is set to 95%, the confidence interval coefficient z=1.96, at which time the confidence interval is:
when the confidence level is set to 99%, the confidence interval coefficient z=2.56, at which time the confidence interval is:
when X is i And X j When the causal relationship between the data is not obvious, the causal relationship between the data cannot be simply judged, and under the condition that the data and the data are allowed, the data sample is enlarged, and the steps 1 to 4 are repeatedly executed.
The quantitative causal relation judging method of the multidimensional time sequence has the beneficial effects that: the method can infer the causal relationship among a plurality of events through simple calculation, and the causal relationship is output in a quantitative form, so that the problem of causal relationship inference is solved. In general, causal inference in scientific research and engineering application depends on a large number of experiments such as physics, chemistry, biology and the like or computer simulation, and huge cost is consumed, but the result is not correct, compared with the method, the calculation cost is almost negligible. The method has wide application in various aspects such as financial analysis, neuroscience research, climate change research, weather forecast, earthquake prediction, haze tracing and the like.
Drawings
FIG. 1 is a workflow diagram of a quantitative causal relationship determination method of the multi-dimensional time series of the present invention.
FIG. 2 is a schematic diagram of the causal relationship between companies over time inferred from stock prices using the method of the present invention.
FIG. 3 is a geographical map of data acquisition used in inferring the source of a contaminant using the method of the present invention.
FIG. 4 is a time series of the various contaminant concentrations across the site when the source of the contaminant is inferred using the method of the present invention.
FIG. 5 is a flow chart of the operation of the method of the present invention to infer the source of contaminants.
FIG. 6 is a spatial distribution diagram of contaminant concentration when tracing a haze source using the method of the present invention.
FIG. 7 is a workflow diagram of a haze source tracing process using the method of the present invention.
Detailed Description
The invention is further described below with reference to the drawings and specific preferred embodiments.
A quantitative causal relation judging method of a multidimensional time sequence is characterized in that: the magnitude and direction of the causal relation among the multidimensional time series can be quantitatively estimated, and the specific implementation scheme is as follows:
step 1, collecting m time sequences, wherein each sequence corresponds to an event, and the m time sequences are respectively X 1 ,X 2 ,…,X m ;
Step 2, preprocessing each time sequence to obtain a plurality of time sequences with the same time interval;
step 3, when some time sequences in the step 2 are not stable, further processing is needed to make the sequences become approximately stable; when the time sequence in the step 2 is stable, performing a step 4;
step 4, calculating quantitative and directional causal relations among the time sequences; the specific calculation method is as follows:
given m stationary, equidistant time series X 1 ,X 2 ,…,X m Let X be j To X i Information flow of (i, j=1, 2,..m, i+.j) is T j→i X is then j And X i Causal relationships between can be defined by j→i And T is i→j To make a judgment, T j→i Can be estimated by the following equation:
wherein,for information flow T j→i Maximum likelihood estimate of (a); />For m time series X 1 ,X 2 ,…,X m Is a covariance matrix of (a); detC is a determinant of C; delta jk Is C jk Algebraic remainder of (2); c (C) k,di Is X k And a derived sequence->Covariance between>The definition is as follows: />Δt is the time interval of the sequence;
theoretically whenNon-zero, then X j Is X i Cause of (1)/(2)>Is the size of the causal relationship and is about +.>Performing significance test;
and 5, integrating the causal relationship data between every two time sequences obtained by calculation in the step 4 to obtain a complete causal relationship network diagram.
In this embodiment, the number of m time sequences acquired in step 1 is greater than or equal to two, and the possible existence of a causal relationship between any two time sequences is: has no causality, one-way causality and two-way causality.
In this embodiment, the preprocessing method for each time sequence in step 2 includes, but is not limited to, supplementing the missing point data, mapping the irregular time distribution data to the regular time points, and so on, and finally obtaining a plurality of time sequences with accurate data mapping and the same time length.
In this embodiment, the preprocessing mode performed on the time series in step 3 when the data is not stable includes, but is not limited to, performing linear trend removal processing on the series, or transforming to another series, such as converting the stock price series into a yield series, etc. The pretreated time series can be judged by various classical stability criteria, and the time series can be approximately stable according to visual inspection in the embodiment to meet the requirements.
In this embodiment, the quantitative and directional causal relationship between each time series, X, is calculated in step 4 j And X i Causal relationships between can be defined by j→i And T is i→j To judge: if T j→i Not equal to 0, then X j Is X i Because, if zero, it is not; if T i→j Not equal to 0, then X i Is X j If zero, it is not. There are four possibilities here in total: (1) X is X j Is X i Cause of (1), X i Also X j Cause of (1); (2) X is X j Is X i Cause of (1), X i Other than X j Cause of (1); (3) X is X j Other than X i Cause of (1), X i Is X j Cause of (1); (4) X is X j And X is i There is no causal relationship between them. Again, the order i, j=1, 2,..m, is calculated, resulting in information flow and causal relationships between all time sequences.
In this embodiment, the data T is estimated for each of the information streams obtained in step 4 j→i A significance test is performed to determine whether it is significantly different from zero, or not significantly different from zero, when the significance test result is significantly different from zero, the causal relationship between the two events is proved, and when the significance test result is not significantly different from zero, the causal relationship between the two events is not proved, the specific flow of the significance test is as follows:
x is to be i And X j Arranged to the first and second positions at this timeConversion to->In deriving the formula of step 4, we use a linear model:
wherein x= (X 1 ,X 2 ,...,X m ) Is a state vector, here represented as m time series,is white noise, (f) 1 ,b 1 ,a 11 ,a 12 ,...,a 1m ) Is a parameter to be estimated, and it can be demonstrated that the maximum likelihood estimation of these parameters is:
wherein the method comprises the steps ofFor m time series X 1 ,X 2 ,…,X m Is a covariance matrix of (a); detC is a determinant of C; delta jk Is C jk Algebraic remainder of (2); c (C) k,di Is X k And a derived sequence->Covariance between>The definition is as follows:Δt is the time interval of the sequence; />Is sequence X k Arithmetic mean of (2); />Is a sequence ofArithmetic mean of (2); />N is the sequence length (total number of time steps), according to the central limit theorem, ++>Approximately obeys a normal distribution, where only +.>Let its variance beIt relates to Fisher information matrix I, and Fisher information matrix I is calculated as follows:
note the state vector x= (X) at time point n 1 ,X 2 ,...,X m ) For X (N), n=1, 2,..n, N, assuming that X (N) = (X 1,n ,X 2,n ,...,X m,n ) In the case of (2), the next state vector X (n+1) = (X) 1,n+1 ,X 2,n+1 ,...,X m,n+1 ) If the probability density of the transition probability density is ρ, the Fisher information matrix multiplied by the total number of steps of time N is
The above formula can be demonstrated as:
a in the matrix 1j ,f 1 ,b 1 Using its estimateInstead of
Inverting the obtained matrix (NI), the fourth diagonal component of the obtained inverse matrixI.e. variance->Estimated value of ∈10->The variance of (c) is: />
Given the confidence level alpha, looking up a standard normal distribution table to obtain a confidence interval coefficient z α Further calculateStandard deviation of (2)In other words, for confidence level α, X 2 To X 1 The confidence interval of the true causality of (2) is:
when the confidence interval does not include zero, X 2 Is X 1 Cause of (1); otherwise, the causality is not obvious. For arbitrary sequence X i And X is j Significance test of causal relationship between the two, X is only needed i And X is j And (5) placing the materials at the first and second positions, and repeating the steps.
Further, when the confidence level is set to 90%, the confidence interval coefficient z α =1.65, where the confidence interval is:
when the confidence level is set to 95%, the confidence interval coefficient z=1.96, at which time the confidence interval is:
when the confidence level is set to 99%, the confidence interval coefficient z=2.56, at which time the confidence interval is:
in this embodiment, when the result of the calculation in step 4 appears X i And X j When the causal relationship between the data is not obvious, the causal relationship between the data cannot be simply judged, and under the condition that the data and the data are allowed, the data sample is enlarged, and the steps 1 to 4 are repeatedly executed.
Further, expanding the data samples includes, but is not limited to, increasing the sequence length, increasing the data temporal resolution, or both.
In this embodiment, the step 5 may be applied to an optional time window with a length L, that is, a causal relationship network between multiple time sequences in the window may be obtained, and further a causal relationship network that varies with time may be obtained, which is implemented as follows:
selecting a time point t from the stationary time series obtained in the step 3 n At a time point t n Time window [ t ] is taken for center n -L/2,t n +L/2]Executing step 4 and step 5 under the condition of a time window with the time length of L to obtain a time point t n A causal relationship network graph at the location;
and (5) sequentially taking time points of different positions according to the time sequence in the time sequence, and repeatedly executing the step (4) and the step (5) to obtain a causal relationship network graph changing along with time. The specific implementation steps are as follows:
determining the length of a time window for estimating the causal relationship as L and the starting time t 0 =l/2+1, total time series length t n Calculate [ t ] 0 -L/2,t 0 +L/2]Causal relationships between variables in the model and performing significance test, so as to obtain t 0 Whether +L/2 is greater than or equal to t n As a criterion, successively calculate at t 0 =t 0 And (5) causality relation data at the time of +1, and finally obtaining a causality network graph changing along with time.
The technical scheme is further verified by taking a specific example:
first embodiment:
inferring from the company stock prices the relationship between the two companies:
acquiring a time sequence of Amazon, google and Walmart stock prices, wherein the time is from the beginning of the 19 th 8 th year of Google2004 to the end of the 26 th 12 th year of 2014, removing trade days and holidays, and totally counting 2607 data points, and preprocessing price data to obtain log daily gain due to unstable price sequence, so as to form a time sequence, wherein the preprocessing mode comprises the following specific steps: if P (n) represents the closing price on the n-th day, r (n) = lnP (n+1) -lnP (n) is the log-day gain, and then the causal relationship between the two at each time point is obtained by taking 1000 trade days as the sliding time window length, the result as shown in fig. 2 can be obtained, and it can be seen from the graph that the information flow from Amazon to Google is insignificant at 90% probability, that is, amazon is never the factor of Google, but Google is the factor of Amazon before the autumn of 2010, and the information flow is significantly not zero. The reason for this is because Amazon has heretofore relied on Google, a search engine, which is a search engine on the principle of online shopping. But after the fall of 2010, amazon is promptly abandoned in Google in order not to lose the huge market of china, so Google is no longer the cause of Amazon since then.
Specific embodiment II:
presuming the source of the pollutant through the pollutant concentration of different regional positions:
in order to understand the cause of pollution in the open sea of Jiangsu, the method of the invention is used for carrying out causal analysis on pollutant index sequences at certain selected points, and the calculation process is shown in figure 5. In the first step, as shown in fig. 3, three points a (120.5 e,34.8 n), B (121.5 e,33.5 n) and C (122.5 e,31.8 n) are selected, which correspond to the harbour, the salt city and the Shanghai open sea respectively, pollutant concentration data of nearly twenty years are taken, the time span is the first year of 1998 to the end of 2016, the output interval of each step is 15 days, and the total of 456 time points are shown in fig. 4, and the time sequence of the obtained nutrient salt (NO 3, siO3 and PO 4) and chlorophyll is shown in fig. 4. The following are the information flows calculated with steps 1 to 5 and their 90% confidence intervals for the causal relationship of various nutrient salts with chlorophyll between the three A, B, C points:
nitrate (NO 3)
T B→A =-0.038621±0.040181
T A→B =0.204099±0.046894
T C→A =-0.049355±0.039681
T A→C =0.335230±0.048638
T C→B =0.124917±0.051953
T B→C =0.019727±0.054564
Silicate (SiO 3)
T B→A =0.092092±0.030884
T A→B =-0.043766±0.019159
T C→A =0.026947±0.024741
T A→C =-0.030499±0.017957
T C→B =0.258933±0.042229
T B→C =0.059791±0.049407
Phosphate (PO 4)
T B→A =0.110934±0.055773
T A→B =0.208629±0.073974
T C→A =0.027468±0.053460
T A→C =0.322688±0.070432
T C→B =0.108395±0.066720
T B→C =0.077843±0.066274
Chlorophyll-a (Chl-a)
T B→A =-0.016813±0.044458
T A→B =0.120196±0.049673
T C→A =0.176085±0.051851
T A→C =0.207379±0.053923
T C→B =0.186485±0.054840
T B→C =0.041304±0.051044
For these several indexes, it can be seen that the north and south sea areas are both the factors of the salt city and the open sea, so that the pollution of the salt city and the open sea is not only the result of the salt city and the sea area, but also the neighbor from the north and the south. Also, in the case of silicate, phosphate, the salt city open sea is indeed the cause of the open sea in the cloud harbor, the open sea in the Shanghai, because of the information flow T B→A ,T B→C Significantly different from zero.
However, it is quite unexpected that T for nitrate, chlorophyll B→A, T B→C And not significantly different from zero, that is, from this data sample, the salt city open sea does not appear to be the cause of the Shanghai and the Liyun harbor open sea because the information flow is not significantly zero. If pollution from nitrate and chlorophyll (such as algal bloom) is to be discussed, although pollution from the outside sea in salt city is serious, although pollution from the north of su has long been thought to pollute the mouth of the Yangtze river by the coastal flow of su in Jiangsu, it is not the cause of the outside sea in Shanghai and Liyun harbor.
Third embodiment:
haze tracing is carried out through PM2.5 concentration in the atmosphere of different regional positions:
as shown in fig. 6 and 7, by acquiring PM2.5 concentration data in the atmosphere within a certain time period of the multi-region PM2.5 concentration monitoring station, a corresponding time sequence is formed, and a haze cause and effect network diagram among multiple sites can be obtained through calculation, so that one or more main sources can be found.
It will be further appreciated from the above description of various embodiments and computing principles that the present method can infer causal relationships between events by simple computation, and that such causal relationships are output in quantitative form, thereby solving the problem of causal relationship inference. In general, causal inference in scientific research and engineering application depends on a large number of experiments such as physics, chemistry, biology and the like or computer simulation, and huge cost is consumed, but the result is not correct, compared with the method, the calculation cost is almost negligible. The method has wide application in a plurality of scientific fields such as finance, climate, environment, neuroscience and the like.
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the invention without departing from the principles thereof are intended to be within the scope of the invention as set forth in the following claims.
Claims (5)
1. A quantitative causal relation judging method of a multidimensional time sequence is characterized in that: the magnitude and direction of the causal relation among the multidimensional time series can be quantitatively estimated, and the specific implementation scheme is as follows:
step 1, collecting m time sequences, wherein each sequence corresponds to an event, and the m time sequences are respectively X 1 ,X 2 ,…,X m ;
Step 2, preprocessing each time sequence to obtain a plurality of time sequences with the same time interval;
step 3, when some time sequences in the step 2 are not stable, further processing is needed to make the sequences become approximately stable; when the time sequence in the step 2 is stable, performing a step 4;
step 4, calculating quantitative and directional causal relations among the time sequences; the specific calculation method is as follows:
given m stationary, equidistant time series X 1 ,X 2 ,…,X m RecordingFrom X j To X i I, j=1, 2,..m, i+.j information stream is T j→i X is then j And X i Causal relationships between can be defined by j→i And T is i→j To make a judgment, T j→i Can be estimated by the following equation:
wherein,for information flow T j→i Maximum likelihood estimate of (a); />For m time series X 1 ,X 2 ,…,X m Is a covariance matrix of (a); detC is a determinant of C; and (V) jk Is C jk Algebraic remainder of (2); c (C) k,di Is X k And a derived sequence->Covariance between, derived sequence is defined as:
Δt is the time interval of the sequence;
when (when)Significantly different from zero, then X j Is X i Cause of (1)/(2)>The size of the causal relationship;
step 5, integrating the causal relationship data between every two time sequences obtained by calculation in the step 4 to obtain a complete causal relationship network diagram;
the method comprises the steps of obtaining PM2.5 concentration data in the atmosphere in a certain time period of a multi-region PM2.5 concentration monitoring station, forming a corresponding time sequence, obtaining a haze causal network diagram among multiple places through calculation, and further finding one or more main sources.
2. The quantitative causal relationship determination method of the multidimensional time series of claim 1, wherein: selecting a time point t from the time series of the data stabilization obtained in the step 3 n At a time point t n Time window [ t ] is taken for center n -L/2,t n +L/2]Executing step 4 and step 5 under the condition of a time window with the time length of L to obtain a time point t n A causal relationship network graph at the location;
and (5) sequentially taking time points of different positions according to the time sequence in the time sequence, and repeatedly executing the step (4) and the step (5) to obtain a causal relationship network graph changing along with time.
3. The quantitative causal relationship determination method of a multidimensional time series according to claim 1 or 2, wherein: the step 4 is performedA saliency test must be performed to determine if it is significantly different from zero, or not significantly different from zero, and when the saliency test result is significantly different from zero, it is proved that there is a causal relationship between the two events, and when the saliency test result is not significantly different from zero, it is proved that there is not significant causal relationship between the two events; the specific flow of the significance test is as follows:
x is to be i And X j Arranged to the first and second positions,conversion to->In deriving the formula of step 4, we use the following linear model:
wherein x= (X 1 ,X 2 ,...,X m ) Is a state vector, here represented as m time series,is white noise, (f) 1 ,b 1 ,a 11 ,a 12 ,...,a 1m ) Is a parameter to be estimated, and it can be demonstrated that the maximum likelihood estimation of these parameters is:
wherein the method comprises the steps ofIs sequence X k Arithmetic mean of (2); />Is the sequence->Arithmetic mean of (2);n is the total time step number of the sequence, according to the central limit theorem,approximately obeys a normal distribution, where only +.>Let its variance beIt relates to Fisher information matrix I, and Fisher information matrix I is calculated as follows:
note the state vector x= (X) at time point n 1 ,X 2 ,...,X m ) For X (N), n=1, 2,..n, N, assuming that X (N) = (X 1,n ,X 2,n ,...,X m,n ) In the case of (2), the next state vector X (n+1) = (X) 1,n+1 ,X 2,n+1 ,...,X m,n+1 ) If the probability density of the transition probability density is ρ, the Fisher information matrix multiplied by the total number of steps of time N is
The above formula can be demonstrated as:
a in the matrix 1j ,f 1 ,b 1 Using its estimateReplacement;
inverting the obtained matrix NI, the fourth diagonal component of the obtained inverse matrixI.e. variance->Is a function of the estimated value of (2);the variance of (c) is: />
Given the confidence level alpha, looking up a standard normal distribution table to obtain a confidence interval coefficient z α Further obtainStandard deviation of (2)In other words, for confidence level α, X 2 To X 1 The confidence interval of the true causality of (2) is:
when the confidence interval does not include zero, X 2 Is X 1 Cause of (1); otherwise, the causality is not obvious; for arbitrary sequence X i And X is j Significance test of causal relationship between the two, X is only needed i And X is j And (5) placing the materials at the first and second positions, and repeating the steps.
4. A method of quantitative causal relationship determination of a multidimensional time series according to claim 3, wherein: when the confidence level is set to 90%, the confidence interval coefficient z α =1.65, where the confidence interval is:
confidence interval coefficient z when confidence level is set to 95% α =1.96, where the confidence interval is:
when the confidence level is set to 99%, the confidence interval coefficient z α =2.56, where the confidence interval is:
5. a method of quantitative causal relationship determination of a multidimensional time series according to claim 3, wherein: when X is i And X j When the causal relationship between the data is not obvious, the causal relationship between the data cannot be simply judged, and under the condition that the data and the data are allowed, the data sample is enlarged, and the steps 1 to 4 are repeatedly executed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910115769.0A CN109886409B (en) | 2019-02-15 | 2019-02-15 | Quantitative causal relationship judging method of multidimensional time sequence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910115769.0A CN109886409B (en) | 2019-02-15 | 2019-02-15 | Quantitative causal relationship judging method of multidimensional time sequence |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109886409A CN109886409A (en) | 2019-06-14 |
CN109886409B true CN109886409B (en) | 2023-12-12 |
Family
ID=66928169
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910115769.0A Active CN109886409B (en) | 2019-02-15 | 2019-02-15 | Quantitative causal relationship judging method of multidimensional time sequence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109886409B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111026852B (en) * | 2019-11-28 | 2023-06-30 | 广东工业大学 | Financial event-oriented hybrid causal relationship discovery method |
CN112884151B (en) * | 2021-01-27 | 2022-07-05 | 武汉理工大学 | Method and system for controlling environment of glass melting furnace based on causal reasoning |
CN113627663B (en) * | 2021-08-04 | 2023-11-10 | 浙江大学 | Dynamic causal analysis method based on geographic time sequence in city |
CN116187443B (en) * | 2023-02-10 | 2024-05-24 | 中国科学院自动化研究所 | Causal strength detection method and detection device based on multidimensional symbol dynamics |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108171142A (en) * | 2017-12-26 | 2018-06-15 | 中南大学 | A kind of causal method of key variables in determining complex industrial process |
-
2019
- 2019-02-15 CN CN201910115769.0A patent/CN109886409B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108171142A (en) * | 2017-12-26 | 2018-06-15 | 中南大学 | A kind of causal method of key variables in determining complex industrial process |
Non-Patent Citations (1)
Title |
---|
Unraveling the cause-effect relation between time series;X. San Liang;《https://journals.aps.org/pre/pdf/10.1103/PhysRevE.90.052150》;第1-11页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109886409A (en) | 2019-06-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109886409B (en) | Quantitative causal relationship judging method of multidimensional time sequence | |
Loy-Benitez et al. | Soft sensor validation for monitoring and resilient control of sequential subway indoor air quality through memory-gated recurrent neural networks-based autoencoders | |
CN107529651A (en) | A kind of urban transportation passenger flow forecasting and equipment based on deep learning | |
CN111367959B (en) | Zero-time-lag nonlinear expansion Granger causal analysis method | |
CN114445634A (en) | Sea wave height prediction method and system based on deep learning model | |
CN114049545B (en) | Typhoon intensity determining method, system, equipment and medium based on point cloud voxels | |
Xu et al. | Evaluation of spatiotemporal dynamics of simulated land use/cover in China using a probabilistic cellular automata-Markov model | |
CN103616732A (en) | Quality control method and quality monitoring device of upper-air wind data | |
CN106295690A (en) | Time series data clustering method based on Non-negative Matrix Factorization and system | |
CN117273152A (en) | Pollutant tracing method by constructing quantitative causal graph | |
CN116467933A (en) | Storm surge water increasing prediction method and system based on deep learning | |
Che et al. | ED-DRAP: Encoder–decoder deep residual attention prediction network for radar echoes | |
CN116151459A (en) | Power grid flood prevention risk probability prediction method and system based on improved Transformer | |
CN110348469A (en) | A kind of user's method for measuring similarity based on DeepWalk internet startup disk model | |
Zhu et al. | Multi-resolution spatio-temporal prediction with application to wind power generation | |
CN110019167A (en) | Long-term new forms of energy resource data base construction method and system in one kind | |
CN114493035A (en) | Enterprise default probability prediction method and device | |
Diodato et al. | Long-term winter temperatures in central Mediterranean: forecast skill of an ensemble statistical model | |
Geng et al. | Study on index model of tropical cyclone intensity change based on projection pursuit and evolution strategy | |
Yang et al. | Data resolution improvement for ocean of things based on improved FCM | |
Koniaev-Gurchenko et al. | Automated evaluation of smart city data from cloud-system | |
Xiao et al. | Prediction of Monthly Rainfall in Plateau Area Based on Convolutional Neural Network | |
CN109697630A (en) | A kind of businessman's volume of the flow of passengers multiplicity and prediction technique based on sparse regression | |
Zhang et al. | Intelligent flood disaster forecasting based on improved neural network algorithm | |
Zhao et al. | Prediction method of dust pollutant diffusion range in building demolition based on Euclidean distance transformation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |