CN109542936A - A kind of recursion causal inference method based on cause and effect segmentation - Google Patents

A kind of recursion causal inference method based on cause and effect segmentation Download PDF

Info

Publication number
CN109542936A
CN109542936A CN201811265052.6A CN201811265052A CN109542936A CN 109542936 A CN109542936 A CN 109542936A CN 201811265052 A CN201811265052 A CN 201811265052A CN 109542936 A CN109542936 A CN 109542936A
Authority
CN
China
Prior art keywords
cause
effect
data set
segmentation
recursion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811265052.6A
Other languages
Chinese (zh)
Inventor
周水庚
张�浩
关佶红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN201811265052.6A priority Critical patent/CN109542936A/en
Publication of CN109542936A publication Critical patent/CN109542936A/en
Pending legal-status Critical Current

Links

Abstract

The invention belongs to data mining technology field, specially a kind of recursion causal inference method based on cause and effect segmentation.The method of the present invention takes divide-and-conquer strategy, and data set is carried out to recursion cause and effect layer by layer using low order conditional independence test and is divided, then cause and effect reconstruct is carried out to each Sub Data Set again, finally merges and obtain the whole cause and effect information of data set.This method can be good at carrying out causal inference on High Dimensional Data Set, and causality is excavated.Under big data era background, causality infers that algorithm is all widely used in Science of Economics, internet field of social network, medical big data field etc., but high dimensional data problem is the Universal Problems encountered in industry information intelligent, and the relevant issues for solving the field are extremely urgent.The present invention helps to solve the problems, such as how to cope with growing mass data cause and effect information excavating, to cause and effect information important role valuable in extraction mass data.

Description

A kind of recursion causal inference method based on cause and effect segmentation
Technical field
The invention belongs to data mining technology fields, and in particular to one kind is suitable for biological information, banking network, social network The causal network construction method of network modeling.
Background technique
The arrival of big data era, causality infer algorithmic technique Science of Economics, internet field of social network, Medical big data field etc. is all widely used.As growing mass data and data structure higher-dimension are complicated Change trend handles very big concern of the causality inference problems by domestic and international experts and scholars of high dimensional data.High dimensional data is asked Topic is the Universal Problems encountered in industry information intelligent, and the relevant issues for solving the field are extremely urgent, has become machine The research hotspot problem of device learning areas.
Recently, the characteristics of Joint Distribution is symmetrical with network structure between some research and utilization data carries out cause and effect to high dimensional data Modeling, these researchs are based on three kinds of heterogeneous networks building methods, comprising: (1) the causal inference algorithm based on constraint;(2) it is based on Score the causal inference algorithm searched for;(3) based on the causal inference algorithm of causal function model.These methods are in biology base Because the fields such as regulated and control network, Prediction of Stock Index, social networks present certain Economic Application value.But the standard of these methods True rate is limited to the scale of data set, when variable is larger (about 100 ~ 200) in data set, due to high-order conditional independence (CI) test accuracy rate is lower and the excessively high reason of time complexity, these methods can not do the causality contained data Accurately judgement out, so as to cause the excavation or regulation failure to entire causal system.
Summary of the invention
It is an object of the invention to propose that a kind of algorithm accuracy rate is high, count for the cause and effect information for excavating High Dimensional Data Set Calculate the low recursion causal inference method based on cause and effect segmentation of complexity.
Recursion causal inference method provided by the invention based on cause and effect segmentation is that one kind efficiently takes recursion plan Method for digging slightly, core are cause and effect segmentation step, which is tested using based on low order CI, therefore can be very big Raising excavate accuracy and speed, help to carry out feasible, effective causal inference to High Dimensional Data Set.
Recursion causal inference method provided by the invention based on cause and effect segmentation, basic step are as follows:
(1) the cause and effect cutting constructing mould of data, the Joint Distribution being mainly based upon between data and the causal network that inherence is contained are symmetrical Characteristic, to data carry out cause and effect segmentation, have the feature that
What the segmentation of (1a) cause and effect was taken is the CI test (order≤3) of low order;
The segmentation of (1b) cause and effect is that recursion carries out;
(2) the cause and effect skeleton modeling of Sub Data Set is mainly based upon CI test subdata sets and carries out the modeling of cause and effect skeleton, The Joint Distribution of Sub Data Set is asked to need to be symmetrical with the cause and effect skeleton of the sub-network;
(3) the merging modeling of Sub Data Set, is mainly based upon and a little returns the merging of all Sub Data Sets with the method for putting corresponding coincidence One;
(4) modeling is inferred in the direction of data set, is taken in the cause and effect skeleton modelling phase of cause and effect cutting constructing mould and Sub Data Set It is CI test, network is utilized according to the result of CI testVArchitectural characteristic characteristic corresponding with conditional independence test, to data Collection carries out direction and infers modeling.
A distinguishing feature of the invention is the divide-and-conquer strategy for taking the conditional independence based on low order, this before from Come what is be not suggested clearly.On the other hand, present invention incorporates the segmentations of the cause and effect of data, the cause and effect skeleton of Sub Data Set Building, the merging of Sub Data Set model the whole cause and effect skeleton of data, and carry out global cause and effect direction after merging Study.The method of the present invention is general suitable for excavating the cause and effect information contained between High Dimensional Data Set, but experiment shows the present invention Can the associated data sets such as biomedicine, time series, intelligent control network to basic, normal, high dimension carry out than existing classics side The more effective causal inference of method facilitates the corresponding information value problem for how excavating big data era.
Recursion causal inference method provided by the invention based on cause and effect segmentation, it is assumed that each variable in data set (one is sharednIt is a) matching correspondence onenA node in causal network is tieed up, divides principle, structure according to divide-and-conquer strategy and cause and effect Establishing network;Specific steps are as follows:
Step 1: constructing 0 order conditions independence table M, M first is n × n rank adjacency matrix, wherein each element M ij = 1, it is meant that nodev i With nodev j Statistical iteration;M ij =0 means nodev i With nodev j Statistics is not independent;Element in M It can be calculated by existing independence test method;
Step 2: by variables setVBe divided into three nonoverlapping subsetsA, B, C = V\(A, B), foundation is,There is M ij = 1;According to the calculation method of previous step M, intuitively, it can be understood asAWithBBetween all possible paths All byCOrCSubset obstruction or interrupt;Then, two Sub Data Sets are constructedWith
Step 3: if step 2 can not incite somebody to actionVSegmentation, then construct 1 order conditions independence table, replace 0 original order conditions independence Table M, then repeatedly step 2;If failing again, the conditional independence table of higher order is constructed, until step 2 smoothly completes;? Under extreme case, such as full connection figure can not carry out the segmentation of step 2, then return the result, inform that the figure can not be divided;
Step 4: obtaining V1With V2Afterwards, respectively with V1、V2For Sub Data Set, the above 1-3 step of step is repeated;According to this recurrence Formula divides the process of data set, last availablekA Sub Data Set;
Step 5: for obtained in step 4kEach of a Sub Data Set carries out the study of cause and effect skeleton, is based on using existing The causal inference algorithm (such as PC algorithm) of constraint is learnt, and obtains the cause and effect skeleton of each minor structure, and by it is all it is sub- because Fruit skeleton merges;Since Sub Data Set is often more much smaller than raw data set, in this case, what step 5 obtained Global cause and effect skeleton is more efficient than other algorithms and much more accurate;
Step 6: in above-mentioned 5 steps of detection use condition independence test as a result, by CI withVThe one of structure is a pair of Principle is answered, the cause and effect direction between cause and effect skeleton interior joint and node is inferred, finally obtains the corresponding cause and effect of complete data set Network structure.
There are three significant innovative points by the present invention:
(1) causal inference algorithm is recursion study, compares other methods, and the calculating that algorithm can be greatly reduced is complicated Degree;
(2) causal inference algorithm is using low order (order3) CI test, what is taken compared to other methods is high-order (rank NumberVariable number -2) CI test, the present invention can be with the accuracy rate of large increase algorithm;
(3) bold and unrestrained Causal Analysis, time complexity not carried out according to collection to higher dimensionality due to having the characteristics that front two Often only have 1 the percent of classic algorithm to arrive one thousandth, and also have obvious promotion in accuracy rate, thus centainly Existing method is overcome in degree often can not carry out the difficulty of feasible, effective causal inference to High Dimensional Data Set.
Detailed description of the invention
Subgraph (a) to subgraph (h) is PC respectively in Fig. 1CPWith PCSADAAsia, Sachs, Alarm, Barley, Accuracy rate under this 8 data sets of Hailfinder, Win95pts, Andes and Pigs, respectively with recall rate Recall, accurate Rate Precision and F1 evaluates the performances of the two methods.
Fig. 2 is PCCPWith PCSADATime complexity performance under 8 data sets under different sample sizes, due to PCCP Far faster than PCSADA, in order to improve visual readability, runing time t is subjected to logization and is shown.Wherein, subgraph (a) is Runing time under Asia data set and Sachs data set, subgraph (b) are the fortune under Alarm data set and Barley data set The row time is (c) runing time under Hailfinder data set and Win95pts data set, be (d) Andes data set with Runing time under Pigs data set.
Table 1 is 8 used data set statistical property snapshots of experiment, including node number, average degree, maximum enter Degree.
Specific embodiment
The network structure that the present invention acquires, can be from UCI machine learning from classical causal network configuration data set Data set library http://archive.ics.uci.edu/ml/index.php and classical way SADA (R. Cai, et al. Sada: A general framework to support robust causation discovery. ICML.2013.) Downloading.Including 8 causal networks, be related to every field, have causal inference (Asia), protein signaling networks (Sachs), materia medica (Alarm), crops (Barley), intelligent tutoring system (Andes) and genetic map (Pigs).Table 1 The statistical property of this corresponding causal network of 8 data sets is illustrated, including node number, average degree, maximum enter degree, this Three features are generally acknowledged to largely represent the complexity of a causal network, it is possible to evaluate well One method.We refer to our causal inference method with CP(Causal Partition), in this experiment with classics Method SADA is compared, one of the method that SADA should be maximally efficient for study higher-dimension causal inference network at present.Justice rises See, the subalgorithm of the two all uses PC algorithm, and two algorithms use PC respectivelyCPIn PCSADATo indicate.
Come firstly, choosing { 250,500,1000,1500,2000 } a sample respectively in each causal network to PCCP And PCSADAIt is assessed, test data is generated according to true causal network using classical way SADA experimental data generating mode. It can see from experimental result (Fig. 1), PCCPPerformance in most cases will be better than PCSADA.In small data set such as Asia With the experimental result in Sachs it is observed that the F1 value of two algorithms is very close, but with the increasing of data set scale Greatly, the gap of the two becomes increasing.Wherein major reason is PCCPIt is tested using low order CI, and PCSADAUsing High-order CI test, in the lesser situation of data set scale, " low order " and " high-order " gap are little, become larger when data set scale When, this gap just becomes larger accordingly, declines so as to cause the accuracy rate of high-order conditional independence test.
On the other hand, we compared PCCPAnd PCSADADifference on time complexity.It can from experimental result (Fig. 2) To see, PCCPMore than PCSADAIt will about 1 to 3 order of magnitude fastly.It is noted that in small data set, the runing time of the method for the present invention Only PCSADAThe 3%-10% of demand.As the scale of data set increases, in biggish data set, needed for the method for the present invention Time is less than PCSADAThe 1% of demand.In large data sets, as the runing time of Pigs and Andes, the method for the present invention are only PCSADAThe 0.03%(Pigs of demand) and 0.9%(Andes).
Table 1 tests the network topology structure characteristic for 8 data sets used, including node number, average degree, maximum Enter three key properties of degree

Claims (2)

1. a kind of recursion causal inference method based on cause and effect segmentation, which is characterized in that including following four basic step:
(1) the cause and effect segmentation of data is the symmetrical characteristic of causal network contained based on the Joint Distribution between data with inherence, right Data carry out cause and effect segmentation, in which:
The segmentation of (1a) cause and effect takes the CI of low order to test, order≤3;
(1b) cause and effect is divided recursion and is carried out;
(2) the cause and effect framework construction of Sub Data Set is to carry out the modeling of cause and effect skeleton based on CI test subdata sets, it is desirable that subnumber The cause and effect skeleton of the sub-network is symmetrical with according to the Joint Distribution of collection;
(3) merging of Sub Data Set is that all Sub Data Sets are merged normalizing with the method for putting corresponding coincidence based on point;
(4) direction of data set is inferred, that take in the cause and effect skeleton modelling phase of cause and effect cutting constructing mould and Sub Data Set is CI Test, according to CI test as a result, utilizing networkVArchitectural characteristic characteristic corresponding with conditional independence test, to data set Carry out direction deduction.
2. the recursion causal inference method according to claim 1 based on cause and effect segmentation, which is characterized in that assuming that data Concentrate each variable match one correspondingnA node in causal network is tieed up, variable one is sharednIt is a, according to divide-and-conquer strategy Divide principle with cause and effect, constructs network;Specific steps are as follows:
Step 1: constructing 0 order conditions independence table M, M first is n × n rank adjacency matrix, wherein each element M ij = 1, it is meant that nodev i With nodev j Statistical iteration;M ij =0 means nodev i With nodev j Statistics is not independent;Element in M It can be calculated by existing independence test method;
Step 2: by variables setVBe divided into three nonoverlapping subsetsA, B, C = V\(A, B), foundation is,There is M ij = 1;Calculation method according to previous step M, it can be understood asAWithBBetween all possible paths all byCOrCSubset obstruction or interrupt;Then, two Sub Data Sets are constructedWith
Step 3: if step 2 can not incite somebody to actionVSegmentation, then construct 1 order conditions independence table, replace 0 original order conditions independence Table M, then repeatedly step 2;If failing again, the conditional independence table of higher order is constructed, until step 2 smoothly completes;Such as The full connection figure of fruit can not carry out the segmentation of step 2, then return the result, and inform that the figure can not be divided;
Step 4: obtaining V1With V2Afterwards, respectively with V1、V2For Sub Data Set, the above 1-3 step of step is repeated;According to this recursion The process for dividing data set, finally obtainskA Sub Data Set;
Step 5: for obtained in step 4kEach of a Sub Data Set carries out the study of cause and effect skeleton, is based on using existing The causal inference algorithm of constraint is learnt, and obtains the cause and effect skeleton of each minor structure, and all sub- cause and effect skeletons are merged;
Step 6: in above-mentioned 5 steps of detection use condition independence test as a result, according to CI withVThe one of structure is a pair of Principle is answered, the cause and effect direction between cause and effect skeleton interior joint and node is inferred, finally obtains the corresponding cause and effect of complete data set Network structure.
CN201811265052.6A 2018-10-29 2018-10-29 A kind of recursion causal inference method based on cause and effect segmentation Pending CN109542936A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811265052.6A CN109542936A (en) 2018-10-29 2018-10-29 A kind of recursion causal inference method based on cause and effect segmentation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811265052.6A CN109542936A (en) 2018-10-29 2018-10-29 A kind of recursion causal inference method based on cause and effect segmentation

Publications (1)

Publication Number Publication Date
CN109542936A true CN109542936A (en) 2019-03-29

Family

ID=65845114

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811265052.6A Pending CN109542936A (en) 2018-10-29 2018-10-29 A kind of recursion causal inference method based on cause and effect segmentation

Country Status (1)

Country Link
CN (1) CN109542936A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112134729A (en) * 2020-09-02 2020-12-25 上海科技大学 Method for proving program high-order power consumption side channel safety based on divide-and-conquer
CN113269336A (en) * 2021-07-19 2021-08-17 中国民用航空总局第二研究所 Flight event cause and effect detection method and device, electronic equipment and readable storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112134729A (en) * 2020-09-02 2020-12-25 上海科技大学 Method for proving program high-order power consumption side channel safety based on divide-and-conquer
CN112134729B (en) * 2020-09-02 2022-11-04 上海科技大学 Method for proving program high-order power consumption side channel safety based on divide-and-conquer
CN113269336A (en) * 2021-07-19 2021-08-17 中国民用航空总局第二研究所 Flight event cause and effect detection method and device, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN112365171B (en) Knowledge graph-based risk prediction method, device, equipment and storage medium
CN108399748B (en) Road travel time prediction method based on random forest and clustering algorithm
CN102768670B (en) Webpage clustering method based on node property label propagation
CN107103050A (en) A kind of big data Modeling Platform and method
CN108985380B (en) Point switch fault identification method based on cluster integration
Gabadinho et al. Analyzing state sequences with probabilistic suffix trees: The PST R package
Park et al. Software fault prediction model using clustering algorithms determining the number of clusters automatically
CN103034687B (en) A kind of relating module recognition methodss based on 2 class heterogeneous networks
CN109697456A (en) Business diagnosis method, apparatus, equipment and storage medium
Zanghi et al. Strategies for online inference of model-based clustering in large and growing networks
CN109542936A (en) A kind of recursion causal inference method based on cause and effect segmentation
Cao et al. Spatial data discretization methods for geocomputation
Kang et al. Dynamic hypergraph neural networks based on key hyperedges
Liu et al. A supervised community detection method for automatic machining region construction in structural parts NC machining
Chen et al. Application of a decision tree method with a spatiotemporal object database for pavement maintenance and management
CN102708285A (en) Coremedicine excavation method based on complex network model parallelizing PageRank algorithm
Souravlas et al. Probabilistic community detection in social networks
Jeong et al. Effective estimation of node-to-node correspondence between different graphs
Li et al. MultiLineStringNet: a deep neural network for linear feature set recognition
Goudarzi et al. A hybrid spatial data mining approach based on fuzzy topological relations and MOSES evolutionary algorithm
CN104915371A (en) Multi-entity-sparse-relation-oriented combined excavating method
Pérez et al. Extraction and reuse of design patterns from genetic algorithms using case-based reasoning
Munir et al. A Framework for Knowledge Representation Integrated with Dynamic Network Analysis
Li et al. Progresses in Link Prediction: A Survey
KR20190040864A (en) Apparatus and method for representation learning in signed directed networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190329

RJ01 Rejection of invention patent application after publication