CN116090522A

CN116090522A - Causal relation discovery method and system for missing data set based on causal feedback

Info

Publication number: CN116090522A
Application number: CN202310364531.8A
Authority: CN
Inventors: 马从锂; 黄飞虎; 弋沛玉; 王琳娜; 彭舰
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2023-04-07
Filing date: 2023-04-07
Publication date: 2023-05-09

Abstract

The disclosure provides a causal relation discovery method and system for a missing data set based on causal feedback, wherein the method comprises the following steps: establishing a causal relation finding model, wherein the causal relation finding model comprises a missing data set complement sub-model and a causal finding sub-model, the missing data set complement sub-model is used for complementing the missing data set, and the causal finding sub-model is used for determining an optimal causal graph corresponding to the complemented missing data set; performing joint training on the missing data set complement sub-model and the causal relationship discovery sub-model to obtain a trained causal relationship discovery model; the to-be-processed missing data set is input into the trained causal relation discovery model, and the trained causal relation discovery model outputs the optimal causal diagram corresponding to the to-be-processed missing data set, so that the causal relation of the missing data set can be discovered more accurately.

Description

Causal relation discovery method and system for missing data set based on causal feedback

Technical Field

The invention relates to the field of data processing, in particular to a causal relation discovery method and system for a missing data set based on causal feedback.

Background

For a long time, discussions about causal relationships have been limited to the areas of literature and philosophy until causal inferences appear. Causal inference is used to reveal the inherent mechanisms of generation of things, discover the laws of operation of things, and has application in many other fields of statistics, medicine, economics, law, etc. Causal relationship discovery is a very important branch in the field of causal inference, and aims to derive a causal relationship model from data to reveal the inherent generation mechanism of the data.

The most directly effective method of causal relationship discovery is the randomized controlled trial (Randomized controlled trials, RCTs), which is considered the "gold standard" for study causal inference in the clinic. However, RCTs are not feasible in most cases due to high cost, ethical inadvisance, or even impractical factors. For example, in assessing the effect of a pregnant woman's smoking on fetal development, it is not uncommon for one group of subjects to smoke while another group does not. Thus, researchers turn their attention to viewing data that is available for direct use. Most of the related studies are currently based on complete datasets, but causal relationship findings with missing datasets are less discussed. In practice, missing data sets are ubiquitous, and therefore causal relationship discovery on observed data sets containing missing values is critical.

In the related art, records containing missing values are directly deleted when a missing data set is processed, and only available data entries are used for causal discovery. In order to make maximum use of the data set, a method of trial deletion is proposed: when the condition test is carried out, only unusable data is deleted, and the maximum data utilization rate is ensured. Both deletion strategies are simple and intuitive, but either do not perform well or require a lot of a priori knowledge.

Therefore, it is desirable to provide a causal relationship discovery method and system for a missing data set based on causal feedback, which are used for more accurately discovering the causal relationship of the missing data set.

Disclosure of Invention

One of the embodiments of the present specification provides a causal relationship discovery method for a missing data set based on causal feedback, the method comprising: establishing a causal relation discovery model, wherein the causal relation discovery model comprises a missing data set complement sub-model and a causal discovery sub-model, the missing data set complement sub-model is used for complementing a missing data set, and the causal discovery sub-model is used for determining an optimal causal graph corresponding to the complemented missing data set; performing joint training on the missing data set complement sub-model and the causal relationship discovery sub-model to obtain a trained causal relationship discovery model; inputting the missing data set to be processed into the trained causal relation discovery model, and outputting an optimal causal graph corresponding to the missing data set to be processed by the trained causal relation discovery model.

In some embodiments, the inputs of the missing data set complement sub-model include the missing data set to be processed, a mask matrix, and a random noise matrix, wherein the random noise matrix obeys a standard normal distribution; and outputting the complement sub-model of the missing data set, wherein the output of the complement sub-model of the missing data set comprises the complement data set corresponding to the missing data set to be processed.

In some embodiments, the missing dataset complement sub-model includes a generator and a arbiter whose inputs include a hint matrix.

In some embodiments, the causal discovery submodel includes a graph generation unit for capturing variable relationships from a complement dataset corresponding to the pending missing dataset and generating a causal graph adjacency matrix, and a graph search unit for searching a graph space for an optimal causal graph corresponding to the pending missing dataset.

In some embodiments, the graph generation unit includes an encoder consisting of a multi-layer self-attention convolutional network and a decoder comprising one single layer neural network.

In some embodiments, the graph search unit includes a three-layer feed-forward multi-layer perceptron with a ReLU as an activation function.

In some embodiments, the performing the joint training on the missing dataset complement sub-model and the causal discovery sub-model to obtain a trained causal relationship discovery model includes: and carrying out joint training on the missing data set complement sub-model and the causal relationship discovery sub-model based on a causal characterization extraction feedback mechanism, and obtaining the trained causal relationship discovery model.

In some embodiments, the feedback mechanism based on causal characterization extraction performs joint training on the missing dataset complement sub-model and the causal discovery sub-model, and obtains the trained causal relationship discovery model, including: in the combined training process, the graph searching unit fuses the classification errors for training.

In some embodiments, the method further comprises: pruning the best causal graph corresponding to the to-be-processed missing data set output by the trained causal relationship discovery model, and determining the target best causal graph corresponding to the to-be-processed missing data set.

One of the embodiments of the present specification provides a causal relationship discovery system for a missing data set based on causal feedback, comprising: the system comprises a model building module, a causal relation finding module and a causal relation finding module, wherein the causal relation finding module comprises a missing data set complement sub-model and a causal relation finding sub-model, the missing data set complement sub-model is used for complementing a missing data set, and the causal relation finding sub-model is used for determining an optimal causal diagram corresponding to the complemented missing data set; the model training module is used for carrying out joint training on the missing data set complement sub-model and the causal relationship discovery sub-model to obtain a trained causal relationship discovery model; and the causal relation discovery module is used for inputting the missing data set to be processed into the trained causal relation discovery model, and the trained causal relation discovery model outputs an optimal causal diagram corresponding to the missing data set to be processed.

Compared with the prior art, the causal relation discovery method and system for the missing data set based on causal feedback provided by the specification have the following beneficial effects:

1. the missing data set is complemented, and the causal relationship of the missing data set can be found more accurately based on the complemented missing data set through the causal relationship finding model;

2. the arbiter of the missing data set complement sub-model does not distinguish the authenticity of the generated vector of the generator, but tries to judge which elements in the generated vector are real and which are generated, a large amount of continuous missing data does not have any prompt information, the accuracy of data complement is affected, the generator outputs different results each time, a prompt mechanism of missing difference perception is introduced, and the problem is alleviated through a prompt matrix;

3. in order to utilize the mutual promotion of the deficiency completion and the causality discovery, the prior experience is moderately utilized while the completion data and the new graph searching are ensured so as to obtain better performance, and a causality discovery result is fused in a data completion stage by introducing a causality characterization extraction-based feedback mechanism;

4. in the combined training process, the graph searching unit fuses the classification errors to train so as to help the model to quickly converge and improve the stability;

5. pruning the best causal graph output by the causal relation discovery model to remove false edges in the best causal graph.

Drawings

The present specification will be further elucidated by way of example embodiments, which will be described in detail by means of the accompanying drawings. The embodiments are not limiting, in which like numerals represent like structures, wherein:

FIG. 1 is a flow chart of a causal relationship discovery method for a missing data set based on causal feedback, shown in accordance with some embodiments of the present description;

FIG. 2 is a schematic structural diagram of a causal relationship discovery model shown in accordance with some embodiments of the present description;

FIG. 3 is a diagram of all paths shown and in an estimated causal graph according to some embodiments of the present description;

FIG. 4 is a block diagram of a missing dataset causal relationship discovery system based on causal feedback shown in some embodiments of the present description;

fig. 5 is a schematic structural diagram of an electronic device according to some embodiments of the present description.

Description of the embodiments

In order to more clearly illustrate the technical solutions of the embodiments of the present specification, the drawings that are required to be used in the description of the embodiments will be briefly described below. It is apparent that the drawings in the following description are only some examples or embodiments of the present specification, and it is possible for those of ordinary skill in the art to apply the present specification to other similar situations according to the drawings without inventive effort. Unless otherwise apparent from the context of the language or otherwise specified, like reference numerals in the figures refer to like structures or operations.

It will be appreciated that "system," "apparatus," "unit" and/or "module" as used herein is one method for distinguishing between different components, elements, parts, portions or assemblies at different levels. However, if other words can achieve the same purpose, the words can be replaced by other expressions.

As used in this specification and the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

A flowchart is used in this specification to describe the operations performed by the system according to embodiments of the present specification. It should be appreciated that the preceding or following operations are not necessarily performed in order precisely. Rather, the steps may be processed in reverse order or simultaneously. Also, other operations may be added to or removed from these processes.

The existing causality discovery methods can be mainly divided into three categories: constraint-based methods, score-based methods, and hybrid methods. Wherein the score-based approach is widely adopted, the core idea is to define a causal graph

Scoring function of->

The function is combined with a search algorithm, mapping space +.>

Each of->

The one with the best score, i.e. the one with the smallest score, is found. The objective function of the score-based method is: />

FIG. 1 is a flow chart of a causal relationship discovery method for a missing data set based on causal feedback, according to some embodiments of the present description. As shown in fig. 1, the causal relationship discovery method for a missing data set based on causal feedback may include the following steps.

Step 110, a causal relationship discovery model is established.

FIG. 2 is a schematic diagram of a causal relationship discovery model according to some embodiments of the present disclosure, where the causal relationship discovery model includes a missing dataset complement sub-model for complementing a missing dataset and a causal discovery sub-model for determining an optimal causal graph corresponding to the complemented missing dataset, as shown in FIG. 2.

In the case of missing data, the original data set

Incomplete shape of (a)The formula is->

. Is provided with

Is->

A corresponding mask matrix, and->

For indicating->

Is the location of the missing data.

And->

Are all->

Vector of dimensions>

Is->

Middle->

Column vectors corresponding to individual features (also referred to as ">

The%>

Column vector "),>

is->

Middle->

Column vectors corresponding to individual features (also referred to as ">

The%>

Column vector "),>

is a data set->

The number of samples in (a) is also the data set +.>

The number of samples in (a) is determined. />

、/>

And->

The correspondence relationship is as follows:

wherein (1)>

Representing missing data,/->

Representing the original dataset +.>

Corresponding incomplete form of the data set +.>

Middle->

The>

Data of->

Representing the original dataset +.>

Middle->

The>

Data of->

Representing the%>

The>

The elements.

The causal relationship discovery model implements missing data complementation based on the distribution of the missing data sets estimated by the antagonism network (Generative Adversarial Network, GAN). Since GAN cannot accept NaN (Not a Number) input, it is necessary to provide a random noise matrix that is subject to a standard normal distribution

Use->

Replace original->

，/>

Representing element multiplication, wherein NaN (Not a Number) input is used to describe a value, representing that the value is not a significant value, and does not belong to any meaningful class. Generator->

Data set to be deleted->

Mask matrix->

Random noise matrix->

As input: />

Is generator->

Is provided with a set of output data of (a),

will be +.>

Each of->

Generating a corresponding estimated value +.>

In order not to change the true value +.>

Replacement->

Corresponding missing elements of (a) new dataset is obtained for +.>

The representation is: />

In data compensationIn the whole scene, there is no true/false label, so the arbiter does not distinguish between the true and false of the generator generated vector, but tries to judge +.>

Which elements are true and which are generated, a large amount of continuously missing data does not have any prompt information, the accuracy of data complement is affected, the GAN outputs different results each time, a prompt mechanism for introducing missing difference perception is introduced to alleviate the problem, and the realization depends on a prompt matrix->

Prompt matrix->

Is input to a arbiter network, the arbiter can obtain the mask matrix +.>

And provide a hint for it. Prompt matrix->

Is subject to->

The generation mode is as follows: />

Wherein,,

representing prompt matrix->

Middle->

The>

Element(s)>

The expression probability is +.>

Bernoulli distribution of (c). />

Is a data set->

Is>

The corresponding missing rate of each feature. The cue mechanism of the miss-rate difference perception enables the discriminator to pay more attention to the features with higher miss rates. The discriminator receives->

And->

As input, output +.>

Prediction of->

：

Wherein (1)>

For the output data set of the arbiter D, +.>

Passed to arbiter D, which causes arbiter D to know the answers to most of the questions: 0 is missing data, 1 is real data, and the arbiter learns the known answer and pays attention to the unknown answer (represented by 0.5), i.e. the data represented by 0.5 is the one that the arbiter needs to distinguish. Through iteration, the arbiter eventually learns the distribution of the data.

The generator loss function is as follows:

wherein (1)>

Output data set for discriminator D +.>

Middle->

Individual column vectors>

Generator->

Output data set->

Middle->

Individual column vectors>

For the original data set->

Middle->

Individual column vectors>

Is a data set->

The number of samples in->

Representing a continuous feature +.>

Representing discrete features->

Is a weight whose value can be changed +.>

The importance in the overall loss function, the larger the value,

the greater the impact on the overall loss function. In some embodiments, ->

Can be->

The corresponding missing rate of each feature.

In some embodiments, the loss function of the arbiter is:

in some embodiments, the causal discovery submodel may include a graph generation unit, which may be based on an encoder-decoder model, responsible for capturing variable relationships from input data and generating a causal graph adjacency matrix, and a graph search unit, which may utilize the exploring-utilizing capabilities of Actor-Critic, in conjunction with a custom reward function, to search for an optimal causal graph in graph space. An Actor-Critic algorithm is used to explore the optimal causal graph, which can be selected over a continuous action space. Regarding the data set and the potential causal graph as the state and the action of the model respectively, the Actor is composed of an encoder-decoder of the graph generating unit, critic is a graph searching unit, and the graph searching unit is a three-layer feedforward multi-layer perceptron taking ReLU (Rectified Linear Unit) as an activation function.

In some embodiments, the graph generation unit may include an encoder-decoder architecture. The encoder consists of a multi-layer self-attention convolutional network. The self-attention convolutional network in combination with the codec can find causal relationships between variables. Inspired by a combinatorial optimization problem, particularly with respect to Pointer network (Pointer network) research, the input to the encoder is made by a slave

Random extraction->

Sample->

Is composed of, and is remodeled into->

Matrix of->

I.e. +.>

，

Is a connection->

First->

A vector of the composition of the individual elements. Thus->

The individual nodes are distributed in a +.>

In the dimensional space, the capturing of +.>

Causal relationships between nodes. The output of the encoder is denoted +.>

Wherein->

Is the dimension of the encoder.

The decoder comprises only one single-layer neural network by calculating every twoDifferent coding results

And->

Generating an adjacency matrix->

：

Wherein (1)>

For adjacency matrix->

Middle->

The>

Element(s)>

And->

Are all parameters that can be trained and,

is the dimension of the hidden layer in the decoder. The above-mentioned output of encoder +.>

Mapping to real space->

Is a sigmoid function that maps the result between 0 and 1. />

Is an indication function, when->

Above 0, the case is->

，/>

Is a superparameter for limiting the number of edges of the directed acyclic graph (DAG, directed acyclic graph),/->

The larger the number of sides, the fewer. />

Is the coding result->

Transpose of->

Is the coding result->

Is a transpose of (a).

In some embodiments, to avoid trapping self-loops,

is forcedly set to 0.

In some embodiments, the graph search unit uses eBIC as a scoring function for a given causal graph

The general form of eBIC is defined as follows: />

Wherein (1)>

Is->

Edge set of->

Representation->

Maximum log likelihood of model, ++>

Is a parameter of maximum log likelihood.

The model is composed of parameters->

Constraint.

Is provided with

The relation regression model is: />

Wherein (1)>

Is->

About->

Estimated amount of model->

Indicate->

Sample of observations->

Indicate->

The elements. Regression model

Can be flexibly selected according to the needs. Is provided with->

Is->

Sum of squares of residuals of the individual variables. Thus, eBIC can be expressed as follows: />

Wherein the first term is equivalent to a maximum log likelihood.

There are also some limitations to the learning paradigm of exploration-utilization. The Actor-Critic learns continuously based on continuous attempts in the solution space from which returns are obtained. However, since the solution space is typically very large, a large amount of exploration is required to learn effectively, thereby increasing training time. Furthermore, during the learning process of exploration-utilization, exploration strategies are often random, which can easily lead to unstable results if exploration is inadequate. In some embodiments, classification errors are fused in the causal graph search to help the Actor-Critic converge quickly, improving stability. Given that the data set dimension may be large, there are a large number of causal relationships that need to be predicted, affecting the performance of the classifier. The classifier section output is thus randomly lost.

Since the causal graph has no self-circulation,

always 0, i.e. not taking into account the pair +.>

Prediction is performed, so that training tag of classifier +.>

Wherein (1)>

Representing a longitudinal stacking of column vectors, ">

Representing the culling matrix->

Middle->

First column vector->

New vectors of elements. The output layer calculation process of the classifier is as follows: />

Wherein (1)>

And->

Is a trainable parameter of the penultimate layer, < ->

Is the penultimate layer vector representation.

Is a loss indication vector, ">

Is a sigmoid function. The classification loss function is as follows: />

Wherein (1)>

Is a regularization coefficient. />

Is->

The%>

Element(s)>

，/>

Is the loss rate. />

Training tag for classifier->

The%>

Element(s)>

Is the +.f in the output data of the classifier>

The number of elements to be added to the composition,

is a classifier parameter. Since a part of the output nodes are lost, the value of the lost node should be ignored when calculating the loss, the above formula is multiplied by +.>

。

In some embodiments, to enable derivation of a directed acyclic graph in conjunction with a continuous optimization method, causal graph acyclic constraints, i.e., directed graphs, are used

Is acyclic if and only if: />

Wherein (1)>

Is a culling matrix->

Matrix index of>

Is a culling matrix->

Is a trace of (1).

In some embodiments, the custom reward function is defined as follows:

wherein (1)>

And

for punishment factors->

For indicating function +.>

Indication->

Whether it is a directed acyclic graph, if it is 1,/or->

For the directed acyclic graph, if its value is 0, < >>

Not directed acyclic graph, can prove +.>

Always non-negative, when->

Less time, ->

It may still be looped, thus increasing the indicator function constraint. Simultaneous acyclic constraint->

Also multiplied by penaltyFactor->

To ensure that it is as small as possible. It is desirable to maximize the bonus function, whose objective function is:

when->

And

when an appropriate value is selected, formula (11) is equivalent to formula (1). Combining the rewarding function, finally obtaining the following optimization targets:

wherein (1)>

Representing action strategy->

Is a trainable parameter in the graph generation model. The gradient is calculated using a monte carlo strategy gradient algorithm.

Sample->

Is randomly selected from the dataset and used as a round to estimate the gradient and update the parameters.

And 120, performing joint training on the missing data set complement sub-model and the causal relationship discovery sub-model to obtain a trained causal relationship discovery model.

To take advantage of the interoperability of deficiency complement and causal discovery, the previous experience is moderately exploited to obtain better performance while ensuring complement data and searching new graphs. As shown in fig. 1, the CRE-based feedback mechanism is incorporated herein into the results of causal relationship discovery during the data complement phase. In the initial study, it was attempted to use B as a supplemental feature input GAN, but experiments showed that this did not have any promoting effect on the performance enhancement of the model. The reason is that the simple relationship expression mode of the adjacency matrix can only be used for expressing whether the causal relationship exists between the variables, and the relevance between the nodes cannot be expressed from the deep level, for example, the indirect causal effect between the nodes cannot be expressed, and the causal strength cannot be expressed.

In some embodiments, one-hot encoding is used to generate a unique initial representation for each node. When updating the node vector representation, CRE makes good use of the acyclic nature of the causal graph. The node vector representation update formula is as follows:

wherein (1)>

Representing node->

Is>

Is node->

Is a set of all immediate parent nodes of the network. Hyper-parameters->

Is the information attenuation rate, < >>

Representing node->

Is used to determine the embedded vector of (a). To gain access to each causal path, duplicate computations are avoided, and updating of the node vector is accomplished using a depth-first search. In order to avoid that the learned erroneous causal relationship affects the subsequent training, the calculation is calculated +.>

Later, the wrong directed edges need to be removed by independence checking.

Wherein (1)>

All are intermediate variables>

For fusing characteristic parameters->

Is->

Causal characterization of the individual feedback periods,/->

Is->

Transpose of->

Is->

Dimension of->

The missing part skips the operation. />

、/>

、/>

Are trainable parameters. A complete data completion to causality discovery process is referred to as a feedback period. The +.>

Just indicate +.>

Symptoms under the feedback period, +.>

Indicate->

And fusing the feature matrix of each feedback period. The new features are then spliced to the original features into the input generator network. First feedback period +.>

For a random initialization +.>

A matrix.

And 130, inputting the missing data set to be processed into a trained causal relationship discovery model, and outputting an optimal causal graph corresponding to the missing data set to be processed by the trained causal relationship discovery model.

In some embodiments, because the causality discovery model is aimed at searching for the highest scoring graph, the strategy is not output, but rather the best scoring causality graph is recorded as output for all causality graphs generated during the training process, but this is not the end result, and the best scoring causality graph is likely to contain "false edges" requiring further pruning processing. Although it may be

Middle enlargement->

The value of (2) achieves the aim of pruning, but can easily lead to the loss of correct edges. Accordingly, pruning may be performed using the following procedure: for each edge in the best scoring causal graph, it is deleted first, and the remaining edges are kept unchanged, if the performance of the next best scoring causal graph is not degraded or is degraded within an acceptable degree, the deleted edges are kept, otherwise discarded. The best scoring causal graph after pruning is taken as the output of the Actor-Critic.

The performance of the causal relationship discovery model (CF-ICD) is described below by experimental data.

The completion algorithm corresponding to the causal relationship discovery model includes List-wise Deletion (LD), MF (Matrix Factorization), multiple interpolation (MICE, multiple Imputation by Chained Equations), GAIN (Generative Adversarial Imputation Nets), the causal relationship discovery algorithm includes constraint-based method PC, score-based method GES, and hybrid method Max-Min components-child-additive noise model (MMPC). The evaluation index used is the structural hamming distance (Structure Hamming Distance, SHD), which is the most commonly used graph model evaluation index, reflecting the minimum number of additions, deletions and flipping edges required to convert a certain DAG into a true causal graph, so the smaller the SHD, the more accurate the model-derived causal graph.

The test causality discovery model requires a definite authoritative causal map of the data set and therefore cannot be evaluated using the real world data set (no causal map). Thus, the causal relationship discovery model is evaluated using simulation data and real public data (providing a causal graph).

When given a generation mode

Can be according to the paradigm->

Simulation data is generated. Given node count +.>

In order to generate a strict upper triangle adjacency matrix +.>

Let->

Ensuring that they are non-zero. Three different types of simulated data sets are generated herein: linear gaussian, nonlinear gaussian and nonlinear gaussian are specified as follows: />

The last term of the above three formulas is noise,/->

Is a coefficient of->

Use +.>

As a base->

The non-gaussian noise is obtained as an index,

and->

The above three schemes are similar to the generation process used in the NOTEARS algorithm and DAG-GNN, and the causal graph is identifiable, each class of data generates 50 sets of data, 10000 samples, and 12 variables for each sample.

Sachs is a protein signal network based on protein and phospholipid expression levels, and is widely applied to a real reference data set in the field of graph modeling, and contains intervention and observation data. Since the study herein is based on observation data, only observation samples are considered, and intervention samples are ignored, resulting in a data set of 853 samples, each sample having 11 attributes, and the corresponding causal DAG contains 17 directed edges.

The feedback period in the experiment is set to 10, each period iterates 1000 rounds, the batch size is 128, the GAN learning rate is 0.001, the Encoder embedding dimension is 64, the initial learning rates of the actor and the Critic are all 0.001, the learning attenuation rates are all 0.96, and the loop-free constraint lambda is satisfied ₁ And lambda (lambda) ₂ 1.2 and 0.01 respectively, the dag limiting parameter ϵ is 0.02, the decay rate gamma is 0.3, the loss rate of the output layer of the classification module is 0.5. The optimizer used in the model is Adam.The deletion rates were set to 10%, 20% and 30% for each class of simulation data set, respectively. The training set test set is divided into 4:1 ratios, and tables 1, 2 and 3 sequentially show results of the causal relationship discovery model and other comparison models on three data types, wherein table 1 characterizes the SHD comparison of the causal relationship discovery model and other comparison models on the linear gaussian data set, table 2 characterizes the SHD comparison of the causal relationship discovery model and other comparison models on the nonlinear gaussian data set, and table 3 characterizes the SHD comparison of the causal relationship discovery model and other comparison models on the nonlinear non-gaussian data set.

/>

As can be seen from tables 1, 2 and 3, CF-ICD achieved the lowest SHD on all simulated data sets compared to the comparative model. The average SHD of CF-ICD over the three classes of data sets was reduced by about 25.8%, 16.5%, 15.1%, respectively. Therefore, it is known that the powerful production capacity of GAN in combination with the exploratory capacity of Actor-Critic can effectively derive potential causal patterns in missing datasets. The CRE-based causal feedback mechanism can effectively utilize the mutual promotion of complementation and discovery, so that the overall performance of the model is improved, and the CRE can capture more complex causal relationship to provide richer prompt for complementation. The CF-ICD has the lowest standard deviation in all methods and is the most stable in performance, and the advantage is beneficial to the fact that the exploring and finding module for fusing the classification errors can guide the Actor-Critic to find the optimal causal graph more accurately and rapidly.

Experiments are carried out on the Sachs data set, the CRE result of the last feedback period is recorded in the experimental process, and the thermodynamic diagram corresponding to the estimated causal diagram and the CRE matrix is drawn. The CRE matrix better reflects the hierarchical causal relationship among the variables. For example, causal relationships

Relation to cause and effect->

It can be seen that the effect of PKC on P38 is greater than the effect of PKC on Mek. FIG. 3 shows->

And->

All paths in the estimated causal graph.

As shown in fig. 3, PKC and Mek are indirect causal on all paths, and there are two paths with false edges that are rejected in the independence check with a high probability. While there is no false edge for both paths from PKC to P38, there is a direct causal relationship of PKC to P38. This is why the above-described results are achieved. Such information may be embodied in the CRE matrix, but cannot be conveyed by the adjacency matrix alone. Therefore, the CRE is used as a type prompt information to be input into the complement model, and the GAN can be helped to generate data more conforming to the real distribution.

To further verify the effectiveness of CRE and fusion classification errors, ablation experiments were performed herein for CRE and classification errors. The model with CRE and classification error removed simultaneously is denoted by A, the model with CRE removed only is denoted by B, and the model with classification error removed only is denoted by C. For fairness, the values of the same parameters of all control groups are kept consistent. The experimental results are shown in table 4.

From the above table, it can be seen that the CF-ICD model has almost the lowest SHD among the four comparative models, which indicates that the performance of the model can be significantly improved after CRE and classification errors are fused. To the extent that performance improves, the SHD value of B is reduced by 1.9% for only the CRE module, compared to a, while the SHD value of model C is reduced for only the classification module12.0% lower, and the SHD value of CF-ICD is reduced by 12.9%. This is sufficient to demonstrate that CRE has a significant boosting effect on the accuracy of the estimated causal graph, while classification errors have a relatively small impact on SHD. The improvement of the model by the classification module is mainly embodied in accelerating the convergence speed of the model and improving the stability of the model.

FIG. 4 is a block diagram of a missing dataset causal relationship discovery system based on causal feedback, shown in accordance with some embodiments of the present description. As shown in FIG. 4, the causal relationship discovery system for a missing data set based on causal feedback may include a model building module, a model training module, and a causal discovery module.

The model building module may be configured to build a causal relationship discovery model, where the causal relationship discovery model includes a missing dataset complement sub-model and a causal discovery sub-model, the missing dataset complement sub-model is configured to complement a missing dataset, and the causal discovery sub-model is configured to determine a best causal map corresponding to the completed missing dataset.

The model training module can be used for carrying out joint training on the missing data set complement sub-model and the causal relationship discovery sub-model to obtain a trained causal relationship discovery model.

The causal relationship discovery module can be used for inputting the missing data set to be processed into a trained causal relationship discovery model, and the trained causal relationship discovery model outputs an optimal causal graph corresponding to the missing data set to be processed.

Fig. 5 is a schematic structural diagram of an electronic device shown according to some embodiments of the present specification, and referring to fig. 5, a structural block diagram of an electronic device that can be a server or a client of the present invention will now be described, which is an example of a hardware device that can be applied to aspects of the present invention. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 5, the electronic device includes a computing unit that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) or a computer program loaded from a storage unit into a Random Access Memory (RAM). In the RAM, various programs and data required for the operation of the device may also be stored. The computing unit, ROM and RAM are connected to each other by a bus. An input/output (I/O) interface is also connected to the bus.

A plurality of components in an electronic device are connected to an I/O interface, comprising: an input unit, an output unit, a storage unit, and a communication unit. The input unit may be any type of device capable of inputting information to the electronic device, and may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. The output unit may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage units may include, but are not limited to, magnetic disks, optical disks. The communication unit allows the electronic device to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing units include, but are not limited to, central Processing Units (CPUs), graphics Processing Units (GPUs), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processors, controllers, microcontrollers, and the like. The computing unit performs the various methods and processes described above. For example, in some embodiments, the causal relationship discovery method for a missing data set based on causal feedback may be implemented as a computer software program, tangibly embodied on a machine-readable medium, such as a storage unit. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device via the ROM and/or the communication unit. In some embodiments, the computing unit may be configured to perform the causal relationship discovery method of the missing data set based on causal feedback by any other suitable means (e.g. by means of firmware).

While the basic concepts have been described above, it will be apparent to those skilled in the art that the foregoing detailed disclosure is by way of example only and is not intended to be limiting. Although not explicitly described herein, various modifications, improvements, and adaptations to the present disclosure may occur to one skilled in the art. Such modifications, improvements, and modifications are intended to be suggested within this specification, and therefore, such modifications, improvements, and modifications are intended to be included within the spirit and scope of the exemplary embodiments of the present invention.

Meanwhile, the specification uses specific words to describe the embodiments of the specification. Reference to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic is associated with at least one embodiment of the present description. Thus, it should be emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various positions in this specification are not necessarily referring to the same embodiment. Furthermore, certain features, structures, or characteristics of one or more embodiments of the present description may be combined as suitable.

Furthermore, the order in which the elements and sequences are processed, the use of numerical letters, or other designations in the description are not intended to limit the order in which the processes and methods of the description are performed unless explicitly recited in the claims. While certain presently useful inventive embodiments have been discussed in the foregoing disclosure, by way of various examples, it is to be understood that such details are merely illustrative and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements included within the spirit and scope of the embodiments of the present disclosure. For example, while the system components described above may be implemented by hardware devices, they may also be implemented solely by software solutions, such as installing the described system on an existing server or mobile device.

Likewise, it should be noted that in order to simplify the presentation disclosed in this specification and thereby aid in understanding one or more inventive embodiments, various features are sometimes grouped together in a single embodiment, figure, or description thereof. This method of disclosure, however, is not intended to imply that more features than are presented in the claims are required for the present description. Indeed, less than all of the features of a single embodiment disclosed above.

In some embodiments, numbers describing the components, number of attributes are used, it being understood that such numbers being used in the description of embodiments are modified in some examples by the modifier "about," approximately, "or" substantially. Unless otherwise indicated, "about," "approximately," or "substantially" indicate that the number allows for a 20% variation. Accordingly, in some embodiments, numerical parameters set forth in the specification and claims are approximations that may vary depending upon the desired properties sought to be obtained by the individual embodiments. In some embodiments, the numerical parameters should take into account the specified significant digits and employ a method for preserving the general number of digits. Although the numerical ranges and parameters set forth herein are approximations that may be employed in some embodiments to confirm the breadth of the range, in particular embodiments, the setting of such numerical values is as precise as possible.

Finally, it should be understood that the embodiments described in this specification are merely illustrative of the principles of the embodiments of this specification. Other variations are possible within the scope of this description. Thus, by way of example, and not limitation, alternative configurations of embodiments of the present specification may be considered as consistent with the teachings of the present specification. Accordingly, the embodiments of the present specification are not limited to only the embodiments explicitly described and depicted in the present specification.

Claims

1. A causal relationship discovery method for a missing data set based on causal feedback, comprising:

establishing a causal relation discovery model, wherein the causal relation discovery model comprises a missing data set complement sub-model and a causal discovery sub-model, the missing data set complement sub-model is used for complementing a missing data set, and the causal discovery sub-model is used for determining an optimal causal graph corresponding to the complemented missing data set;

performing joint training on the missing data set complement sub-model and the causal relationship discovery sub-model to obtain a trained causal relationship discovery model;

inputting the missing data set to be processed into the trained causal relation discovery model, and outputting an optimal causal graph corresponding to the missing data set to be processed by the trained causal relation discovery model.

2. The causal relationship discovery method of a missing data set based on causal feedback of claim 1, wherein the input of the missing data set complement sub-model comprises the missing data set to be processed, a mask matrix, and a random noise matrix, wherein the random noise matrix obeys a standard normal distribution;

and outputting the complement sub-model of the missing data set, wherein the output of the complement sub-model of the missing data set comprises the complement data set corresponding to the missing data set to be processed.

3. The causal relationship discovery method of the missing data set based on causal feedback of claim 2, wherein the missing data set complement sub-model comprises a generator and a discriminator, and wherein the input of the discriminator comprises a prompt matrix.

4. A causal relationship discovery method for a missing data set based on causal feedback according to claim 2 or 3, wherein the causal discovery sub-model comprises a graph generating unit and a graph searching unit, the graph generating unit is used for capturing variable relationships from a complement data set corresponding to the missing data set to be processed and generating a causal graph adjacency matrix, and the graph searching unit is used for searching the best causal graph corresponding to the missing data set to be processed in a graph space.

5. The causal relationship discovery method of missing data set based on causal feedback of claim 4, wherein the graph generation unit comprises an encoder and a decoder, wherein the encoder comprises a multi-layer self-attention convolutional network, and wherein the decoder comprises a single-layer neural network.

6. The causal relationship discovery method of the missing data set based on causal feedback of claim 4, wherein the graph search unit comprises a three-layer feedforward multi-layer perceptron with a ReLU as an activation function.

7. The causal relationship discovery method of claim 4, wherein the performing joint training on the missing dataset complement sub-model and the causal relationship discovery sub-model to obtain a trained causal relationship discovery model comprises:

and carrying out joint training on the missing data set complement sub-model and the causal relationship discovery sub-model based on a causal characterization extraction feedback mechanism, and obtaining the trained causal relationship discovery model.

8. The causal relationship discovery method of claim 7, wherein the causal relationship discovery model is obtained by jointly training the missing dataset complement sub-model and the causal relationship discovery sub-model by the causal relationship extraction-based feedback mechanism, and comprises:

in the combined training process, the graph searching unit fuses the classification errors for training.

9. The causal relationship discovery method of the missing data set based on causal feedback of claim 1, further comprising:

pruning the best causal graph corresponding to the to-be-processed missing data set output by the trained causal relationship discovery model, and determining the target best causal graph corresponding to the to-be-processed missing data set.

10. A causal relationship discovery system for a missing data set based on causal feedback, comprising:

the system comprises a model building module, a causal relation finding module and a causal relation finding module, wherein the causal relation finding module comprises a missing data set complement sub-model and a causal relation finding sub-model, the missing data set complement sub-model is used for complementing a missing data set, and the causal relation finding sub-model is used for determining an optimal causal diagram corresponding to the complemented missing data set;

the model training module is used for carrying out joint training on the missing data set complement sub-model and the causal relationship discovery sub-model to obtain a trained causal relationship discovery model;

and the causal relation discovery module is used for inputting the missing data set to be processed into the trained causal relation discovery model, and the trained causal relation discovery model outputs an optimal causal diagram corresponding to the missing data set to be processed.