WO2015045091A1

WO2015045091A1 - Method and program for extraction of super-structure in structural learning of bayesian network

Info

Publication number: WO2015045091A1
Application number: PCT/JP2013/076245
Authority: WO
Inventors: 民平森下; 真臣植野
Original assignee: 株式会社シーエーシー; 国立大学法人電気通信大学
Priority date: 2013-09-27
Filing date: 2013-09-27
Publication date: 2015-04-02

Abstract

An objective of this embodiment is to provide a CI-based learning method whereby a super-structure having fewer missing edges than conventional is extracted. A program control unit (102) causes a data specification analysis unit (104) to analyze a data specification. The data specification analysis unit (104) accesses a database and acquires and retains the total number of data entries. The program control unit (102) transfers control to a CI-based structure learning unit (106). The CI-based structure learning unit (106) executes CI-based learning according to this embodiment. A designated number of CI-based learning execution threads to be executed in parallel are generated and executed in parallel. A sum set is generated of a set of edges in a graph which is obtained from the completed threads and the set of edges obtained therebefore, said sum being treated as a super-structure.

Description

Method and program for superstructure extraction in Bayesian network structure learning

The present invention relates to a method and a program for superstructure extraction in Bayesian network structure learning.

The Bayesian network structure learning method is roughly classified into score learning with relatively high estimation accuracy and CI (conditional independence) -based learning (or constraint-based learning) capable of high-speed learning. In recent years, a hybrid method utilizing both advantages has been proposed (Non-Patent Document 1). This is a technique of extracting a narrowed search space called a super-structure by high-speed CI-based learning and searching a Bayesian network from the superstructure by score learning with high estimation accuracy. Thereby, a highly accurate Bayesian network can be learned efficiently.

The weak point of the hybrid method using the superstructure is that the edge existing in the true network is often deleted at the time of superstructure extraction, that is, a missing edge is generated. If a lost side occurs in the superstructure, the lost side cannot be restored even if the superstructure is searched by score learning. For this reason, a vanishing edge occurs also in a Bayesian network as an output.

The embodiment of the present invention provides a CI-based learning method for extracting a superstructure with fewer vanishing edges than in the past.

The embodiment of the present invention is a program and method for extracting a superstructure from input data. The program and method cause a computer to execute the following steps.

(A) receiving from the database as input a data specification description file describing the name of each random variable, the number of random variables, the state name of each random variable, the name of each state that can be taken by each random variable, and the number of states;
(B) Analyzing the data specification description file, and storing the name of each random variable, the number of random variables, the state name of each random variable, the names of states that each random variable can take, and the number of states in a storage unit; ,
(C) initializing a graph to be output as a completely undirected graph, initializing a separation set as an empty set, setting the order of a conditional independent test to 0, and storing the result in a storage unit;
(D) For all variable pairs X, Y having edges on the graph,
(D1) A step of performing a conditional independent test Test (X; Y | S) for the variable set S in which | S | = n, S⊆Z, where Z is a latent parent variable set of X and Y ,
(D2) When the result of the conditional independent test is Ind (X; Y | S), the separated set is updated by Sep = Sep∪ {<{X, Y}, S>} and stored in the storage unit And (D3) deleting the edge between X and Y from the graph and storing the graph in the storage unit; and
(E) determining a direction of an edge according to the estimated v-structure if the direction is not determined for the estimated v-structure edge present in the graph;
(F) If the side is oriented in the opposite direction to the estimated v-structure, determine that the side is a collision side, select the direction of the collision side according to a predetermined probability, Storing an edge parent set having an edge and a set of parent variables of the edge as elements in a storage unit;
(G) determining an edge direction according to an orientation rule consistent with DAG constraints;
(H) incrementing n;
(I) If the size of the latent parent variable set is greater than or equal to n for all the variable pairs X and Y of the remaining edges on the graph, the process returns to step (D) and the size of the latent parent variable set is less than n. If there is a step of storing the obtained graph in the storage unit;
(J) repeating steps (C) to (I) a predetermined number of times;
(K) A step of outputting the union of the obtained predetermined number of graphs as a superstructure.

It is a block diagram of the information processing apparatus for performing Bayesian network structure learning by an Example. 6 shows a flowchart of a CI-based learning method using edge direction according to an embodiment. The flow of the main process which performs CI base learning of several Example from which operation | movement differs, synthesize | combines the result and returns as a superstructure is shown. The pseudo code of the conventional RAI is shown. 1 illustrates a conventional orientation routine. The main process which performs several CI base learning from which operation | movement differs according to the Example of this invention, synthesize | combines the result and returns as a superstructure is shown. FIG. 7 shows a modified version of RAI (RAIEX) processing of the embodiment called in the processing of FIG. 6; The mechanism by which a vanishing edge occurs due to an error in orientation in the conventional RAI will be described. The mechanism of collision edge generation is shown. Fig. 4 illustrates a new orientation routine of an embodiment of the present invention. 11 shows a flowchart of the process of FIG. Fig. 4 shows a routine for determining the orientation of two sides of a v-structure of an embodiment of the present invention. Fig. 13 shows a flowchart of the processing of Fig. 12. The flow of a process of the Example of this invention is shown. The graph structure of the Bayesian network used in the experiment is shown. The graph structure of the Bayesian network used in the experiment is shown. The graph structure of the Bayesian network used in the experiment is shown. The graph structure of the Bayesian network used in the experiment is shown.

In the following, the concepts and notations used in this specification will be defined for Bayesian networks and CI-based learning methods.

Unless otherwise noted, random variables are simply called variables, and single variables are written in capital letters such as X, not bold. The set is shown in bold. For example, the variable set is expressed as Z (bold). The variable sets X (bold) and Y (bold) are conditionally independent given Z (bold) as Ind (X (bold); Y (bold) | Z (bold)), and conditionally dependent Is represented as Dep (X (bold); Y (bold) | Z (bold)). When there is a single element, the set symbol is omitted as appropriate. For example, Ind (X; Y | Z) is written instead of Ind ({X}; {Y} | {Z}). When Ind (X (bold); Y (bold) | Z (bold)), Z (bold) is called a separated set of X (bold) and Y (bold). Testing based on data whether variables X and Y are conditionally independent given a variable set Z (bold) is called a conditional independent test or simply a test, and Test (X; Y | Z (Bold)). The conditional independent test Test (X; Y | Z (bold)) is true if Ind (X; Y | Z (bold)), and false if Dep (X; Y | Z (bold)). Z (bold) of the conditional independent test Test (X; Y | Z (bold)) is called a conditional variable set, and | Z (bold) | is called the order of the conditional independent test.

In many CI-based learning methods, the initial state is a completely undirected graph, and if a separated set of arbitrary two variables X and Y is found by a conditional independent test, an edge between XY is deleted from the graph.

In this specification, undirected graphs (all edges are undirected edges), directed graphs (all edges are directed edges), and partial directed graphs (some edges are directed edges) are handled. The graph g = <V (bold), E (bold)> is a set of a vertex set V (bold) (g) = V (bold) and an edge set E (bold) (g) = E (bold). If the graph g ′ = <V ′ (bold), E ′ (bold)> is V ′ (bold) ⊆ V (bold) (g) and E ′ (bold) ⊆ E (bold) (g) It is said to be a subgraph of graph g. The undirected side between the vertices X and Y is written as XY, and the directed side from X to Y is written as X → Y or Y ← X. If the direction of the side is not distinguished, write X *-* Y. Vertices X and Y are said to be adjacent to each other when side X *-* Y is present. When there is a directed side X → Y, X is called the parent of Y, and Y is called the child of X. Adj (bold) (X, g), Pa (bold) (X, g), Ch (bold) (X, g) are the adjacent variable set, parent variable set, child of variable (vertex) X on graph g Each variable set is represented. The latent parent variable set Pa _p (bold) (X, g) on the partial directed graph g is expressed as Pa _P (bold) (X, g) = Adj (bold) (X, g) \ Ch (bold) (X, g ).

Ordered variable sequence <X ₍₁₎ , X ₍₂₎ ,. . . , X _(n) > and all {X _(i) , X _{(i + 1)} } have sides X _(i) *-* X _{(i + 1)} are paths of variables X _(i) and X _(n) It is called (path) and expressed as π. A route in which all the sides of the route are undirected sides is called an undirected path. A path traced from the start point X ₍₁₎ to the end point X _(n) according to the direction of the directed side is called a directed path. A path in which the start point X ₍₁₎ and the end point X _(n) are the same variable is called a closed path. A closed undirected path is called a loop, and a closed directed path is called a cycle. A directed graph that does not have a cycle is called a directed acyclic graph, or DAG for short.

If X _(i-1) → X _(i) and X _(i) ← X _{(i + 1)} exist in {X _(i−1) , X _(i) , X _{(i + 1)} } of the path π, X _(I) is called a collider of the path π.

Variable sets X (bold) and Y (bold) are said to be directional-separated (d-separated) on DAG g given variable set Z (bold) when the following conditions are satisfied: ∀X∈X (Bold) and ∀Y∈Y (bold), all paths π of X and Y satisfy one of the following two properties: (1) On the path π, not a confluence, and Z ( There is a variable that is the origin of (bold); (2) there is a confluence on the path π where the confluence or the descendant of that confluence is not an element of Z (bold). On DAGg, X (bold) and Y (bold) are directed and separated with Z (bold) given as Dsep _g (X (bold); Y (bold) | Z (bold)) To do.

If probability distribution P is Ind _P (X (bold); Y (bold) | Z (bold)) ⇒ Dsep _g (X (bold); Y (bold) | Z (bold)), it is faithful to DAGg (faithful). ). Many CI-based learning techniques assume that the targeted probability distribution is faithful to some DAG. In the present invention, it is assumed that the probability distribution is faithful to the DAG.

N discrete random variables X ₁ ,. . . , X _N , and each variable X _i is in a state kε {1,. . . , R _i } Consider a domain U (bold). A Bayesian network B (bold) on U (bold) is defined as B (bold) = <g, Θ (bold)>. Here, the first element g = <V (bold), E (bold)> is a random variable set U (bold) = {X ₁ ,. . . , X _N } is a directed acyclic graph (DAG) consisting of a vertex set V (bold) and a directed edge set E (bold) representing the dependency between the variables. A graph g represents an independent relationship between variables. The next component Θ (bold) is a set of parameters θ _ijk = P (X _i = k | Π _i = j). Here, [pi _i represents the parent variable set on the graph g variables X _i, the [pi _{i =} j indicates that [pi _i takes j-th state value. Bayesian network B (bold) defines a unique joint probability distribution on variable set U (bold) as follows.

Estimating the graph structure (network) of a Bayesian network from data is called structure learning or simply learning. Bayesian network learning is roughly divided into a score-based learning method and a CI (conditional independence) -based learning method. The CI -based learning method is also called a constraint-based learning method. The score-based learning method calculates a statistical score of a candidate model (directed graph) and uses a model having the maximum score as a solution, and has a relatively high structure estimation accuracy. However, it is known that score learning of a Bayesian network is NP-hard. On the other hand, the CI-based learning method learns a Bayesian network by a conditional independent test between random variables, and relatively high-speed learning is possible. Many CI-based methods first learn Bayesian network skeletons, that is, undirected graphs with edge orientations removed from the Bayesian network graph structure by conditional independence testing, and then the resulting conditional independence And edge orientation using the DAG constraints.

In recent years, hybrid methods have been proposed that take advantage of high-speed CI-based learning and high-precision score learning. In the hybrid method, first, a narrowed search space called a super-structure is extracted by CI-based learning. Here, the superstructure means an undirected graph including a skeleton of a true Bayesian network as a subgraph. Next, a Bayesian network is searched from the superstructure by score learning. That is, the Bayesian network is learned under the restriction that the directed edge that should exist in the output Bayesian network also exists in the superstructure as an undirected edge with the edge direction removed. Ordyniak et al. Performed a mathematical analysis of the hybrid method using superstructure, and showed that when the tree width of the superstructure is limited, a score-optimized Bayesian network can be learned in polynomial time with the number of variables. It was shown that learning can be performed in linear time when the order is limited. Here, the tree width is a value indicating the computational complexity of the graph. In a Bayesian network, the tree width increases as the number of parent variables increases and the number of loops in the skeleton of the Bayesian network increases. Become.

The problem with hybrid methods using superstructures is that, at the time of superstructure extraction, often edges that are present in the true network are deleted, that is, missing edges are generated. . If a lost side occurs in the superstructure, the lost side cannot be restored even if the superstructure is searched by score learning. For this reason, a lost side also occurs in the output Bayesian network.

As a conventional hybrid method, MMPC (Max-Min Parents and Children: Tsamardinos, I., Brown, L. E., and Aliferis, C. F., "The max-min hill-climbing Bayesian network structure learning algorithm, ”Machine Learning, Vol. 65, pp. 31-78 (2006)) and HPC (Hybrid Parents and Children: Rodrigues de Morais, S. and Aussem, A.,“ An Ecalableandand ” Algorithm for Local Bayesian Network Structure Discovery, in ECML PKDD '10, ”Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases-Part III, pp. 164-179, elBerlin, The technique used is known. It is known that HPC can reduce the number of erasures, although the superstructure has more surplus edges than MMPC. MMPC and HPC are CI-based methods that extract undirected graphs based on conditional independent tests, but do not perform edge orientation.

The present invention relates to a CI-based learning method that realizes a superstructure with fewer vanishing edges than in the past. FIG. 1 shows a schematic diagram of a computer 100 that executes the method of the embodiment of the present invention or executes the method of the present invention by the program of the embodiment of the present invention. The method of the embodiment of the present invention may be executed by the computer 100 or the processor of FIG. 1 or may be executed by causing a computer-executable instruction to operate as each component shown in FIG. An embodiment of the present invention may be a computer-readable storage medium storing such computer-executable instructions.

The input to the computer 100, the output from the computer 100, and each component of the computer 100 will be described below.

The computer 100 may read the following data and data specification description file as inputs.

(1) Data This is the main input for structure learning, and is tabular data expressed in CSV format, relational database relations, and the like. The data store for storing data may be a file, a relational database, a two-dimensional array on a memory, or the like. Each column corresponds to each random variable, and each row includes the state (realized value) of the corresponding random variable. For example, in a store that handles four products A, B, C, and D, the types of coupons used by the customer are T1 and T2 (cannot be used together; n if no coupon is used) If the purchased product is represented by y and the product that has not been purchased is represented by n, the purchase data for six people is represented as shown in Table 1.

(2) Data specification description file This is a file that describes what random variables and their states (realized values) are included in the above-mentioned “data”. When the number of states of the random variable is n, the data specification indicates that each row has a random variable name, state 1, state 2,. . . , Described in CSV format to have state n. For example, as in the above example, in a store that handles four products A, B, C, and D, the types of coupons used by customers are T1 and T2 (cannot be used together, and coupons are not used. n), a purchased product is represented by y, and a non-purchased product is represented by n, a random variable representing the purchase behavior history data of the customer and its actual value are represented in a data specification description file. Is described as follows.
Coupon, T1, T2, n
A, y, n
B, y, n
C, y, n
D, y, n

(3) Output A superstructure description file is output by the method or program of the embodiment of the present invention. This is the file that describes the estimated superstructure. In this file, variable pairs with sides exist are separated by commas (,) line by line. For example, in the above example, if it is estimated that there are edges between Coupon and A, between Coupon and D, between A and B, and between B and C, the superstructure is superstructure. It is described as follows in the description file.
Coupon, A
Coupon, D
A, B
B, C

(4) Processing Component The components shown in FIG. 1 will be described. The program control unit 102 is a component that controls the overall processing flow. The program control unit 102 checks the arguments and parameters of the program as preprocessing, and if they are normal, causes the data specification analysis unit 104 to analyze the data specification. The program control unit 102 further causes the CI base structure learning unit 106 to execute main processing.

The data specification analysis unit 104 reads the data specification description file and prepares to analyze the data that is the main input. The data specification analysis unit 104 stores the name of each random variable, the number of random variables, the state name of each random variable, the number of states of each random variable, and the total number of data items included in the data. Providing information.

The CI-based structure learning unit 106 is a component that executes a CI-based learning algorithm and extracts a superstructure from data, and executes main processing.

The conditional independent test execution unit 108 is a component that executes a conditional independent test. The conditional independent test execution unit 108 has a function of caching the execution result, and when the result of the conditional independent test that has already been executed is requested, the cached result is returned.

When the separated set holding unit 110 determines that Ind (X; Y | Z (bold)) is determined by the conditional independent test Test (X; Y | Z (bold)), the separated set Z (( Bold) is held and managed by a hash using a variable pair as a key.

The graph structure construction unit 112 constructs a super structure graph structure estimated by the CI base structure learning unit 106. The graph structure construction unit 112 constructs 1) an array of nodes representing random variables, and 2) an array of directed edges or undirected edges representing dependencies between random variable pairs as a data structure shared with other components. And manage this.

The calculation results obtained in each component may be appropriately stored in a storage device such as a memory and used for subsequent calculations. In the course of calculations performed by the same component, the results obtained so far may be stored in a storage device and used for subsequent calculations by the same component.

The method of the present invention is an improved CI-based learning technique that uses edge direction (ie parent-child relationship between variables) for selection of conditional independent tests. Many CI-based learning methods perform edge orientation only once at the end. However, CI-based learning methods that use edge orientation perform edge orientation multiple times in the middle, and use the resulting variables as a result. Based on the parent-child relationship, determine what conditional independence tests will be performed thereafter. An example of a CI-based learning method that uses the direction of an edge is RAI described later. The present invention can be applied not only to RAI but also to a CI-based learning method that uses the direction of an edge. Here, a CI-based learning method using edge directions according to an embodiment of the present invention will be described.

FIG. 2 shows a flowchart of a CI-based learning method using the edge direction according to the embodiment. The method of FIG. 2 may be executed by the computer 100 or the processor of FIG. 1 or computer-executable instructions may be executed by causing the computer to operate as each component shown in FIG. An embodiment of the present invention may be a computer-readable storage medium storing such computer-executable instructions.

In step 202, the output graph G is initialized as a completely undirected graph. In step 204, Sep (bold), which is a set of all separated sets, is initialized as an empty set. In step 206, the order n of the conditional independent test is set to zero. In step 208, the following steps (1) to (3) are repeated for all variable pairs {X; Y} having edges on the graph G.

(1) The latent parent variable set Z (bold) = Pap (bold) (X, G) ∪Pap (bold) (Y, G) of X and Y is specified (step 210).
(2) Test (X; Y | S (bold)) is performed for all S (bold) that becomes | S (bold) | = n, S (bold) ⊆ Z (bold) (step 212).
(3) If the result of the test is Ind (X; Y | S (bold)) (“Yes” in step 214), the separation set Sep (bold) is set to Sep (bold) = Sep (bold) ∪ {<{ X, Y}, S (bold)>} is updated (step 216), and the side X *-* Y is deleted from the graph G (step 218).

In step 220, the orientation of the edges in the graph G is performed. In an embodiment of the present invention, a new approach is directed here. In the present invention, the contradiction of v-structure estimation is detected by the occurrence of a collision side, and the direction of the collision side is determined to be one of the probabilities. The direction determined for the collision edge is maintained and managed separately for each individual execution (thread).

The orientation of the present invention is to determine and maintain the direction of the collision edge separately for each thread in addition to the graph G to be oriented and the separated set Sep (bold) of the deleted edges (variable pairs). Consider an edge parent set E _p (bold) having a collision edge and a set of parent variables in the edge as elements. As an example, in order to execute element search in the edge parent set in the order of a constant time, the edge parent set may be implemented as a hash with an edge whose direction is ignored as a key and a parent variable in the edge as a value. In the orientation in the embodiment of the present invention, when searching for a set of three variables X *-* Z *-* Y that are candidates for the v-structure, the direction of the edge between these variables is ignored. As a result, it is possible to determine whether or not the three variables of interest are the v-structure without being affected by the previously estimated v-structure.

For each side estimated to be a v-structure, the orientation of the two sides of the v-structure is determined. First, consider the parent variable S _p and the child variable S _c where estimation of v- structure is correct. If first time v- directing the execution of structures for the sides, the sides because still undirected edges, the parent variable S _p as _estimated, determining the edges of the faces child variable as S _c. If the direction is not determined for the side for the first time, and the side is already oriented in the opposite direction to the estimated v-structure, the side is detected as a collision side. When this collision edge is detected for the first time, the direction in this thread is selected according to a predetermined probability, and the decision is added to the edge parent set E _p (bold) and held.

After that, the direction of the edge is determined as much as possible according to a rule called an orientation rule that is derived so as not to contradict the constraints of the DAG.

In step 222, the order n of the conditional independent test is incremented by one. In

steps

224 and 226, latent parent variable sets are specified for all variable pairs {X, Y} of the remaining edges on the graph G. If there is a variable pair whose latent parent variable set size | Z (bold) | is greater than or equal to n (“Yes” in step 228), the process proceeds to the conditional independent test iteration, and if there is no such variable pair. (“No” in step 228), the graph G is output and the process ends (step 230).

As described above, the embodiment of the present invention performs a new edge orientation process in the CI-based learning method that uses the edge direction to select the conditional variable set S (bold) in the conditional independent test. Detect errors.

The embodiment of the present invention further extracts a superstructure with few missing edges by synthesizing the execution results of the above processing. In other words, the present invention superimposes a plurality of the above-mentioned new CI-based learning results that output different graphs, so that even if there is an erasure edge in each CI-based learning result, an erasure edge is obtained by other CI-based learning. Can be supplemented.

FIG. 3 shows a flow of main processing in which a plurality of CI-based learnings of the present invention having different operations are executed, and the results are combined and returned as a superstructure. This process may be executed by the computer 100 or the processor of FIG. 1, or may be executed by causing a computer-executable instruction to operate the computer as each component shown in FIG.

In step 302, initial processing is executed. The program control unit 102 includes the database connection information, the data specification description file name, the significance level α of the conditional independent test, the number of threads t (the number of parallel executions of CI-based learning in the embodiment of the present invention), the superstructure description file name Check operating parameters including at least one. If there is an error, the program control unit 102 displays the error on a display device or the like and ends the program. If it is normal, the program control unit 102 continues the processing and causes the data specification analysis unit 104 to analyze the data specification. The data specification analysis unit 104 reads the data specification description file, and holds the names of the random variables, the number of random variables, the names of all the states that can be taken by the random variables, and the number of states. Next, the data specification analyzing unit 104 accesses the database using the database connection information, acquires the number of all data, and holds it. The program control unit 102 transfers control to the CI base structure learning unit 106.

The CI base structure learning unit 106 performs CI base learning according to the above-described embodiment of the present invention. In step 304, CI-based learning execution threads having the specified number of parallel executions are generated. In step 306, the above-described CI-based learning thread of the present invention is executed in parallel. In step 308, the parent thread that executes CI-based learning waits until any execution thread ends. In step 310, a union of the edge set in the graph obtained from the terminated thread and the edge set obtained so far is generated and used as a superstructure. If there is an unprocessed thread (“Yes” in step 312), the process returns to step 308. In step 314, the graph structure construction unit 112 receives the super structure graph structure from the CI base structure learning unit 106, and generates an output in accordance with the specifications of the super structure description file. Processing ends at step 316.

In recent years, RAI (Recursive Autonomy Identification) using edge orientation results to improve estimation accuracy and computational efficiency of CI-based methods (Yehezkel, R. and Lerner, B. “Bayesian Network Structure Learning by Recursive Autonomy”) Identification, ”
Journal of Machine Learning Research, Vol. 10, pp. 1527-1570 (2009)) has been proposed. Unlike MMPC and HPC conventionally used for superstructure extraction, RAI uses the result of edge orientation in graph structure learning to improve estimation accuracy and calculation efficiency. However, because the orientation of edges depends on the results of statistical tests (conditional independent tests), the edges can be misdirected under realistic conditions with finite samples. This misorientation causes conditional independence tests that are not necessary in nature, thereby causing lost edges.

In the embodiment of the present invention, RAI can be used as basic CI-based learning. The embodiment of the present invention realizes extraction of a superstructure with fewer erasures compared to the conventional method using simply RAI. Using RAI simply, edge orientation depends on the results of statistical tests (conditional independent tests). This can lead to incorrect orientation under realistic conditions of finite samples. Thus, in the conventional approach, misorientation causes a vanishing edge by causing a conditional independent test that is not necessary in nature. The present invention detects erroneous orientation based on orientation contradictions and synthesizes superstructures based on possible orientations to prevent the occurrence of superstructure disappearance edges.

Here, RAI that can be used as a CI-based learning method that is the basis of one embodiment of the present invention will be described, and the disadvantages of using RAI as it is for superstructure extraction will be described.

In general, higher-order conditional independent tests are less statistically reliable and computationally expensive than lower-order tests. RAI is a technique that avoids high-order conditional independent testing with low reliability and high calculation cost by using the parent-child relationship between variables obtained by the orientation of edges, and improves the accuracy and reduces the amount of calculation. is there. RAI uses a parent-child relationship of the vertices (variables) caused by orientation, recursively decomposed into the ancestor partial structure g _A comprising a graph from the progeny substructure g _D and other variables. A descendant substructure is more formally defined as an autonomous sub-structure (Definition 2). The parent variable in the ancestor partial structure is defined as an exogenous cause (Definition 1).

Definition 2 (autonomous sub-structure) DAGg = <V (bold), E (bold)> V ^A (bold) ⊂ V (bold) and E ^A (bold) ⊂ E (bold) structure ^g ^A = ^<V A (bold), ^{E A} (bold)> that if ∀X∈V ^A (bold), Pa (bold) (X, g) ⊂ { V A ( bold) ∪V _ex (bold )}, It is said that it is autonomous in g given the exogenous cause V _ex (bold) ⊂ V (bold) of g ^A.

Generally, reducing the number of conditional independent tests contributes to the suppression of lost edges. The most extreme case is a case where no conditional independence test is performed, and therefore no edge deletion is performed, and a completely undirected graph is output. When this completely undirected graph is regarded as a superstructure, the search space is not narrowed down at all. Instead, no superstructure disappearance occurs. Although RAI is not a technique developed for the purpose of superstructure extraction, it reduces conditional independent tests and, as a result, suppresses erasure edges when used as a superstructure estimation technique.

Higher-order conditional independent test reduction by RAI edge orientation is performed through the following two mechanisms.

The first mechanism is control of the test order based on graph decomposition. RAI recursively decomposes a graph into descendant substructures and ancestor substructures using parent-child relationships between variables estimated by edge orientation. The RAI selects edges (variable pairs) subject to conditional independence tests in the following order: 1) an edge inside the ancestor partial structure, 2) an edge connecting the ancestor partial structure and the descendant partial structure, and (3) a descendant Side inside the substructure. In this way, by deleting the edge connecting the ancestor partial structure and the descendant partial structure as much as possible before the edge inside the descendant partial structure, when performing a conditional independent test on the edge inside the descendant partial structure, Higher-order conditional independent tests can be suppressed.

The second mechanism is to reduce the condition variable set size based on edge orientation, which is based on the following lemma.

Lemma 1: In DAG, if X and Y are not adjacent and X is not a descendant of Y, X and Y are directed and separated given Pa (bold) (Y).

According to Lemma 1, the existence of an edge X → Y when X∈g _A and Y∈g _D indicates that the conditional variable set S (bold) is changed to S (bold) ⊆ Pa _p (bold) (Y) \ {X} The number of conditional independent tests can be limited. In contrast, other conventional CI-based techniques such as MMPC do not limit the conditional variable set based on the parent-child relationship, so the conditional variable set S ′ (bold) ⊆ Adj (bold) (Y) \ {X}, which is larger than RAI. According to the following lemma 2 derived from lemma 1, if variables X and Y are both elements of g _D , the existence of edge XY is similarly limited to a limited variable set S (bold) ⊆ Pa _This can be determined by checking _p (bold) (Y) \ {X}.

Lemma 2: In DAGg = <V (bold), E (bold)>, g ^A = <V ^A (bold), E ^A (bold)> is an exogenous cause set Vex (bold) ⊂ V (bold) of g ^A ), And if Ind (X; T | S (bold)) for X, YεV ^A (bold), S (bold) ⊂ V (bold), then S ′ (bold ) {V ^A (bold) ∪ V _ex (bold)} and Sd (bold) Ind (X; Y | S ′ (bold)) exists.

As will be described later, the embodiment of the present invention replaces the directing routine called in line 10 and line 15 and operates a plurality of RAIs whose operations are changed.

Here, the edge orientation in the existing CI-based method such as RAI will be described. A conventional orientation routine orientEdgeTrad is shown in FIG. The edge orientation routine uses as input the graph to be oriented and the entire separated set Sep (bold) of the deleted edge (variable pair). The basic idea of edge orientation is to first determine the direction of a particular edge from the segregated set obtained from the conditional independence test, and then the orientation rule (orientation rule) derived so as not to conflict with the DAG constraints. The direction of the side is determined as much as possible according to a rule called). The target of determining the direction from the separated set is three variables and two sides with a connection method of XZY (however, there is no side between X and Y). Here, if Z is not an element of a separated set of X and Y, XZY can be oriented as X → Z ← Y. This X → Z ← Y is called a v-structure. In FIG. 5, the v-structure is estimated from line 2 to line 5, and thereafter, orientation is performed according to the orientation rule.

The problem with conventional edge orientation is that orientation errors tend to occur at the time of v-structure estimation. In an embodiment of the present invention, a new method of orientation instead of conventional orientation is used to detect orientation errors during v-structure estimation.

The embodiment of the present invention detects a misorientation of an edge that causes an occurrence of a missing edge in RAI, and combines the execution results of a plurality of RAIs that attempt possible orientation, thereby superstructure having few missing edges. To extract.

The embodiment of the present invention mainly extracts 1) a superstructure with few missing edges, and 2) detects an orientation error, on the premise that there is a learning error.

In the following, first, a method for extracting a superstructure with few missing edges on the assumption that there is a learning error according to an embodiment of the present invention will be described. Next, a learning error in RAI according to an embodiment of the present invention will be described. A method for detecting the occurrence of an orientation error causing the above will be described.

Parallel execution of RAI In superstructure extraction for reducing lost edges instead of allowing extra edges, it is desirable to leave only those edges that can be reliably deleted. On the other hand, the structure estimation accuracy of CI-based learning depends on the accuracy of conditional independent tests, and in realistic situations where the sample size per parameter is limited, it is not always possible to correctly determine the presence or absence of all sides. Absent. If the inventors of the present application superimpose a plurality of CI-based learning results that output different graphs, even if there is an erasure edge in each CI-based learning result, the erasure edge is compensated by other CI-based learning. I came to think.

In the above example, multi-threading is used for parallel execution of CI-based learning. However, in another example, in order to increase robustness, parallel execution of CI-based learning may be implemented as a multi-process, or distributed parallel by a plurality of computers may be used for load distribution.

In the above example, the modified RAI is executed in parallel, but another method may be executed. The same learning technique does not have to be performed on individual threads.

FIG. 7 explains the modified version of RAI (RAIEX) processing of the embodiment of the present invention called in the processing of FIG. The present embodiment calls the extended side orientation routine (described later in FIG. 10) of the embodiment of the present invention instead of the conventional orientation routine (FIG. 5) in the 10th and 15th lines. And different from conventional RAI.

The processing of FIG. 7 includes the conditional independent test order n (initially zero), G _start (initially undirected graph), exogenous cause set G _ex (initially empty set), G _all (initially G _all = G _start ) as an argument, processes A, B, C and D are executed in order. However, the algorithm executes the following end condition check before executing each process, and returns from calling this algorithm if the end condition is met.

Termination condition (2nd to 5th lines): If all the variables included in G _start have latent parent variables less than n + 1, G _all is returned.

7 performs the process A (from the 6th line to the 10th line).

The following is repeated for all the variables Y included in G _start and the parent variable X of Y included in G _ex (initially, since G _ex is an empty set, nothing is performed in process A and the process proceeds to process B). A variable S (bold) that is a subset S (bold) of the union (excluding X) of the latent parent variable set of Y included in G _start and the parent variable set of Y included in G _ex If there are n and X and Y are independent S (bold) given S (bold), <{X, Y}, S (bold)> is added to the separation set Sep (bold) Add a source, sides X * - remove * Y from _{G all.} Further, the direction of the side of G _start which will be described later with reference to FIG. 10 is performed. In particular,
(1) For X *-* Z *-* Y such that X *-* Z and Z *-* Y are included in the edge set in G _start and X *-* Y is not included,
(1-1) A separation variable set of deleted edges X *-* Y is obtained.
(1-2) If Z is not included in this separation variable set, for each side of X *-* Z *-* Y,
(1-2-1) If the direction of the side is not defined, the direction is as estimated.
(1-2-2) If the side is already oriented in the opposite direction, determine that it is a collision side,
(1-2-2-1) If a collision edge is detected for the first time, it is directed with a predetermined probability.
(1-2-2-2) The direction is not changed unless the collision edge is detected for the first time.
(1-2) Repeating the direction of the side from the 8th line to the 13th line in FIG. 10, that is, the direction rule, until there is no side to be directed.

7 performs the process B (from the 11th line to the 17th line).

For all random variables Y and their parent X included in G _start ,
When the 13th line of FIG. 7 is satisfied, that is, a subset S (bold) S of a union set (excluding X) of a parent variable set of Y included in G_ex and a latent parent variable set of Y included in G_start If there are n variables that are the source of S (bold) and X and Y are independent S (bold) given S (bold) S, then the separation set Sep (bold) < Add {X, Y}, S (bold)> as a source, and delete X *-* Y from gall.
(Orienting the edges of gstart) X *-* Z *-* Y such that X *-* Z and Z *-* Y are included in the edge set in gstart, but not X *-* Y about,
A set of separation variables of the deleted side X *-* Y is obtained.
If Z is not included in this separation variable set, for each side of X *-* Z *-* Y,
If the direction of the side is not defined, the direction is as estimated.
If the side is already oriented in the opposite direction, determine that it is a collision side,
If the collision edge is detected for the first time, it is directed with a predetermined probability.
If it is not the first collision detected, leave it alone.
The direction from the 8th line to the 13th line in FIG. 10, that is, the direction of the side according to the orientation rule is repeated until there is no more side to be directed.
The variable set having the lowest topological order is set as a descendant subset gD, which is temporarily deleted from gstart, and the remaining unconnected variable sets are respectively ancestor subsets G_A1, G_A2,. . . , G_Ak (If an ancestor subset is specified, G_D temporarily deleted from G_start is restored).

7 performs the process C (from the 18th line to the 20th line).

The variable i = 1 and the following is repeated until i = k.
RAIEX is recursively called and processed for the ancestor subset G_Ai. That is, RAIEX (n = n + 1, G_start = G_Ai, G_ex = G_ex, G_all = G_all) is executed.

7 performs the process D (from the 21st line to the 25th line).

G_ex_D is changed to {G_A1, G_A2,. . . , G_Ak, G_ex}. RAIEX is recursively called and processed for the descendant subset G_D. That is, RAIEX (n = n + 1, G_start = G_D, G_ex = G_ex_D, G_all = G_all) is executed. Finally, gall is returned.

Edge Direction Collision Detection Here, the cause of the occurrence of an edge orientation error, the reason for the orientation error leading to the disappearance edge occurrence, and the method for detecting the edge orientation error will be described.

The embodiment of the present invention detects an orientation error at the time of v-structure estimation in order to suppress the occurrence of a missing edge due to the orientation error described with reference to FIG. Basically, embodiments of the present invention provide v-structures that give different orientations to X *-* Y when multiple v-structures with a single side X *-* Y are inferred. If there is, it is determined that any v-structure estimation is incorrect. Here, a situation in which different directions are given to the side X *-* Y by a plurality of v-structures is called an orientation collision, and this state is represented by sides X⇔Y in both directions. X⇔Y is called a collision side.

In the embodiment of the present invention, a contradiction in v-structure estimation is detected by the occurrence of a collision side, and the direction of the collision side is determined to be one of the probabilities. The direction determined for the collision edge is maintained and managed separately for each individual execution (thread).

The new orientation routine orientEdge of the embodiment of the present invention is shown as Algorithm 5 in FIG. FIG. 11 shows a flowchart of the processing of FIG. The arguments of the orientation routine orientEdge are all thread-local and are initialized for each thread to take a unique state. The orientEdge routine, in addition to the arguments of the conventional edge orientation routine, determines the direction of the collision edge separately for each thread, and maintains an edge parent set consisting of the pair of the collision edge and the parent variable on that edge as an element. Take E _p (bold) as an argument. As an example, in order to execute element search in the edge parent set in the order of a constant time, the edge parent set may be implemented as a hash with an edge whose direction is ignored as a key and a parent variable in the edge as a value. In the orientation routine orientEdge in the embodiment of the present invention, when searching for a set of three variables X *-* Z *-* Y that are candidates for the v-structure, the direction of the edge between these variables is ignored. Please be careful. As a result, it is possible to determine whether or not the three variables of interest currently form the v-structure without being affected by the previously estimated v-structure. The part of the latter half of the routine that determines the direction according to the orientation rules (from line 7 to line 14) is the same as the conventional orientation routine orientEdgeTrad. What is different from the prior art is the part that determines the orientation of the two sides of each v-structure (orientVStructure call in lines 5 and 6).

The orientationVStructure routine that determines the orientation of the two sides of the v-structure is shown as Algorithm 6 in FIG. FIG. 13 shows a flowchart of the processing of FIG. orientVStructure routine, v- called for each of the sides was estimated to be structured, parent variable S _p and the child variable S _c where estimation is correct v- structure is passed. If first time orientVStructure calls for the sides, since at the time calling side is a still undirected edges, rows parent variable from S _p, the child variable orienting side as S _c (rows 8 as estimated 9). If the orientationVStructure call for the edge is already oriented in the opposite direction to the estimated v-structure, not the first time (line 2), the edge is determined to be a collision edge. If this collision edge is detected for the first time (line 3), the direction in this thread is selected stochastically (line 4), and the decision is added to the edge parent set E _p (bold) and held.

In the above example, when a side is first determined to be a collision side, the direction is determined with a probability of 1/2 (line 4 in FIG. 12). However, in another example, depending on which direction has more v-structure estimations, the direction following the larger v-structure may be selected with a high probability.

In the above example, only the true / false (and separate set) of the conditional independent test is used for v-structure estimation. However, in another example, regarding the p-value of the conditional independence test as the probability that conditional independence is established, the v-structure set having the same edge direction and the v-structure set having the opposite direction are: Each may take the average of the p-values of the corresponding test and assign the direction with a probability according to the ratio.

FIG. 14 shows a processing flow of the embodiment of the present invention. The following processing may be executed by the computer 100 or the processor of FIG. 1 or may be executed by causing a computer-executable instruction to operate the computer as each component shown in FIG. An embodiment of the present invention may be a computer-readable storage medium storing such computer-executable instructions.

In step 1402, initial processing is executed. In the initial process, preparations for learning such as program operation parameter check are performed. The program control unit 102 executes the database connection information, the data specification description file name, the significance level α of the conditional independent test, the number of threads t (CI base learning of the embodiment is executed in parallel) passed from the command line argument or the like. Check the operating parameters such as the superstructure description file name. If there is an error in the initial process, the error is displayed on the display device and the process is terminated. When the initial process is normally executed, the process is continued, and the data specification analysis unit 104 analyzes the data specification. The data specification analysis unit 104 reads the data specification description file, and holds the names of the random variables, the number of random variables, the names of all the states that can be taken by the random variables, and the number of states. Next, the data specification analyzing unit 104 accesses the database using the database connection information, acquires the number of all data, and holds it. Next, the program control unit 102 transfers control to the CI base structure learning unit 806. In step 1404, the CI base structure learning unit 106 that has received control after the initial process executes the main process of the present invention (for example, a process corresponding to FIG. 6). RAIEX execution threads are generated for the specified number of parallel executions. In step 1406, RAIEX using the new edge orientation technique according to an embodiment of the present invention is executed in parallel in each thread. In step 1408, the parent thread that executes the main process waits until one of the RAIEX execution threads ends. In step 1410, the superstructure is generated as the union of the edge set of edge set E of the graph _{g out} an execution result of RAIEX obtained from terminated thread _{(bold) (g out)} so far. If there is an unprocessed child thread (“Yes” in step 1412), the process returns to waiting for an execution thread in step 1108. In step 1414, the graph structure construction unit 112 receives the super structure graph structure from the CI base structure learning unit 106, and outputs it according to the specifications of the super structure description file. In step 1416, the process ends. The more specific processing of FIG. 14 is as already described.

In the following, the results of experiments for comparing the method of the embodiment of the present invention with other CI-based methods will be described. Conventional CI-based methods to be compared are HPC (Hybrid Parents and Children), MMPC (Max-Min Parents and Children), TPDA (Three Dependency Analysis: Cheng, J., Greiner, R., Kelly, J. , Bell, D., and Liu, W., “Learning Bayesian networks from data: an information-theory based approach,” Arti_cial Intelligence, Vol. 137, No. 1-2, pp. 43-90 (2002)), There are four methods of RAI (Recursive Autonomy Identification). Regarding the implementation of HPC, “Gasse, M., Aussem, A., and Elghazel, H .: An Experimental Comparison of Hybrid Algorithms for Bayesian Network Structure Learning, in ECML PKDD 12 : Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases-Part I, pp. 58-73, Springer-Verlag (2012) For MMPC and TPDA, “Tsamardinos, I., Brown, L. E., and Aliferis, C. F .: The max-min hill-climbing Bayesian network structure learning algorithm, Machine Learning, Vol. 65, pp. 31-78 (2006) ”was used. For RAI and the examples of the present invention, those independently implemented by Java (registered trademark) were used. The experimental procedure is as follows.

(1) From the true Bayesian network, generate 10 sets of input data for every 10,000 records.
(2) Superstructure is estimated from input data by each CI-based method. Based on the estimation result, the missing side, the extra side, the average order, the maximum order, and the time required for the estimation are measured.
(3) An exact solution search for score learning is performed using the same data as in step (2) with the superstructure as a constraint. At this time, the time required for the exact solution search and the score of the Bayesian network that is the solution are measured.

For conditional independence test, which is used in step (2), all of the significance level of 5% in the experiment of G ² test "Spirtes, P., Glymour, C., and Scheines, R .: Causation, Prediction, and Search , MIT Press, New York, NY, 2nd edition (2000) ”,“ Neapolitan, RE: Learning Bayesian Networks, Prentice Hall (2003) ”. The original Java implementation was used for the exact solution search of the score learning with the superstructure of the procedure (3) as a constraint. As a score, BDeu widely used for Bayesian network learning is used, and its hyperparameter α is α = 1.0 “Ueno, M .: Learning networks determined by the ratio of which is recommended as a result of theoretical analysis of BDeu in recent years. prior and data, in Proceedings of the Twenty-Sixth Conference Annual Conference on Uncertainty in Arti_cial Intelligence (UAI-10), pp. 598-605, AUAI Press (2010).

All experiments were performed on a 3.20 GHz Intel Corei 7-3930 K6 core CPU, 64 GB memory Windows 7 (registered trademark) 64-bit machine.

Table 1 shows the Bayesian network used as an experiment target. The network structure is shown in FIGS. These are based on widely published ones. Of these, Win95pts was obtained from GeNIe & SMILE network repository (http://genie.sis.pitt.edu/index.php/network-repository), and the remaining Alarm, Insurance, and 、 Water 、 were Bayesian Network Repository (http: //www.cs.huji.ac.il/~galel/Repository/) In order to be able to execute the exact solution search by the score learning after extracting the superstructure on the limited memory of the experimental computer, the number of variables was reduced to 25 in the three networks of Alarm, Win95pts, and Water. (Therefore, it is written as Alarm-25, Win95pts-25, Water-25). The reduction of variables was performed by removing variables that had no children from the original network unless the maximum input order (the maximum number of parent variables) changed. In Table 2, | V (bold) | is the number of variables, | E (bold) | is the number of sides, Max In / Out is the maximum input order (maximum number of parent variables) and maximum output order (maximum number of child variables). ), State represents the lower limit and upper limit of the number of states of each variable, Param represents the number of parameters of the random variable, and V.Skew Param represents the number of parameters having a probability value of 0.99 or more.

The experimental results in each Bayesian network are shown in Table 3, Table 4, Table 5 and Table 6, respectively. Each column from Missing Edge to SS Time in the table represents the superstructure extracted by the algorithm shown on the left side of the table. Missing Edge represents a missing edge, Extra Edge represents a surplus edge, Degree represents an average order, Max Degree represents a maximum order, and SS Time represents the number of seconds required from the start to the superstructure output. The two columns of Score Time and BDeu in the table show the results of an exact solution search using the output superstructure. Score Time represents the number of seconds required for the exact solution search, and BDeu represents the BDeu score of the Bayesian network that is the solution for the exact solution search. All values are average values of 10 data sets. The method of the embodiment of the present invention is expressed as Proposed-2 to Proposed-10. The number following the hyphen represents the parallel execution number t. For example, Proposed-3 represents the method of the embodiment of the present invention executed with the parallel execution number of 3.

From the execution result of Alarm-25 (Table 3), it can be seen that the embodiment of the present invention has already achieved the minimum superstructure disappearance edge when the number of parallel executions t = 2. In exchange for this, in the embodiment of the present invention, the number of extra edges is several times greater. However, it can be seen that there is no noticeable difference between the embodiment of the present invention and the conventional method in terms of the superstructure learning time and the search time for the exact solution based on the score. Looking at the exact solution search score (BDeu), it can be seen that the method of the embodiment of the present invention has already achieved the maximum score at the time when the number of parallel executions t = 2 and the best result can be obtained.

Next, looking at the Win95pts-25 result (Table 4), it can be seen that it is the same as the Alarm-25 execution result (Table 3).

Next, looking at the results of Water-25 (Table 5), the same results are obtained. However, the degree of improvement such as lost edges is not so large, and there is almost no effect of increasing the number of parallel executions. This is because Water-25Water is a network with many parameters whose probabilities are greatly offset (see V.Skew Param in Fig. 11), so the causal relationship is almost decisive and it is difficult to learn in the first place. It is thought that it originates in being.

Finally, consider the Insurance results (Table 6). In Insurance, when the number of parallel executions of the method of the embodiment of the present invention is 6 or more, the maximum order may exceed the upper limit 17 of the experimental environment. For this reason, in the experiment using Insurance, after learning the superstructure, adjacent variables are trimmed so that the order of the variable exceeding 17 is 17 at the maximum. The pruning of the adjacent variables was performed by recording the conditional mutual information obtained at the time of the conditional independent test and leaving up to 17 variables as the adjacent variables in order from the variable with the larger conditional mutual information. From the experimental results, it can be seen that, like Alarm-25 and Win95pts-25, the method of the embodiment of the present invention improves in terms of disappearance and BDeu score. In the case of the method of the embodiment, the number of surplus edges hardly affects the time for superstructure extraction, but affects the exact solution search time through the increase in the order.

The graph structure of the Bayesian network used in the experiment is shown in FIGS. Each vertex represents a random variable, and the number in parentheses at each vertex represents the number of states of the variable.

The experimental results show that the method of the present invention can suppress the disappearance edge of the superstructure most, and that the method of the present invention can find the highest score Bayesian network in the exact solution search of the score learning using the superstructure. .

Although the invention herein has been described with reference to particular embodiments, the embodiments described herein are not intended to be construed as limiting the invention and are exemplary. It is intended to explain. It will be apparent to those skilled in the art that other alternative embodiments can be practiced without departing from the scope of the invention.

Claims

A program that extracts superstructure from input data.
(A) receiving from the database as input a data specification description file describing the name of each random variable, the number of random variables, the state name of each random variable, the name of each state that can be taken by each random variable, and the number of states;
(B) Analyzing the data specification description file, and storing the name of each random variable, the number of random variables, the state name of each random variable, the names of states that each random variable can take, and the number of states in a storage unit; ,
(C) initializing a graph to be output as a completely undirected graph, initializing a separation set as an empty set, setting the order of a conditional independent test to 0, and storing the result in a storage unit;
(D) For all variable pairs X, Y having edges on the graph,
(D1) A step of performing a conditional independent test Test (X; Y | S) for the variable set S in which | S | = n, S⊆Z, where Z is a latent parent variable set of X and Y ,
(D2) When the result of the conditional independent test is Ind (X; Y | S), the separated set is updated by Sep = Sep∪ {<{X, Y}, S>} and stored in the storage unit And (D3) deleting the edge between X and Y from the graph and storing the graph in the storage unit; and
(E) determining a direction of an edge according to the estimated v-structure if the direction is not determined for the estimated v-structure edge present in the graph;
(F) If the side is oriented in the opposite direction to the estimated v-structure, determine that the side is a collision side, select the direction of the collision side according to a predetermined probability, Storing an edge parent set having an edge and a set of parent variables of the edge as elements in a storage unit;
(G) determining an edge direction according to an orientation rule consistent with DAG constraints;
(H) incrementing n;
(I) If the size of the latent parent variable set is greater than or equal to n for all the variable pairs X and Y of the remaining edges on the graph, the process returns to step (D) and the size of the latent parent variable set is less than n. If there is a step of storing the obtained graph in the storage unit;
(J) repeating steps (C) to (I) a predetermined number of times;
(K) A program for executing a step of outputting a union of a predetermined number of obtained graphs as a superstructure.
A computer-readable storage medium storing a program that, when executed by a computer, causes a computer to execute a method for extracting a superstructure from input data, the method comprising:
(A) receiving from the database as input a data specification description file describing the name of each random variable, the number of random variables, the state name of each random variable, the name of each state that can be taken by each random variable, and the number of states;
(B) Analyzing the data specification description file, and storing the name of each random variable, the number of random variables, the state name of each random variable, the names of states that each random variable can take, and the number of states in a storage unit; ,
(C) initializing a graph to be output as a completely undirected graph, initializing a separation set as an empty set, setting the order of a conditional independent test to 0, and storing the result in a storage unit;
(D) For all variable pairs X, Y having edges on the graph,
(D1) A step of performing a conditional independent test Test (X; Y | S) for the variable set S in which | S | = n, S⊆Z, where Z is a latent parent variable set of X and Y ,
(D2) When the result of the conditional independent test is Ind (X; Y | S), the separated set is updated by Sep = Sep∪ {<{X, Y}, S>} and stored in the storage unit And (D3) deleting the edge between X and Y from the graph and storing the graph in the storage unit; and
(E) determining a direction of an edge according to the estimated v-structure if the direction is not determined for the estimated v-structure edge present in the graph;
(F) If the side is oriented in the opposite direction to the estimated v-structure, determine that the side is a collision side, select the direction of the collision side according to a predetermined probability, Storing an edge parent set having an edge and a set of parent variables of the edge as elements in a storage unit;
(G) determining an edge direction according to an orientation rule consistent with DAG constraints;
(H) incrementing n;
(I) If the size of the latent parent variable set is greater than or equal to n for all the variable pairs X and Y of the remaining edges on the graph, the process returns to step (D) and the size of the latent parent variable set is less than n. If there is a step of storing the obtained graph in the storage unit;
(J) repeating steps (C) to (I) a predetermined number of times;
(K) A computer-readable storage medium including a step of outputting a union of a predetermined number of obtained graphs as a superstructure.
A computer-implemented method for extracting superstructure from input data, comprising:
(A) receiving from the database as input a data specification description file describing the name of each random variable, the number of random variables, the state name of each random variable, the name of each state that can be taken by each random variable, and the number of states;
(B) Analyzing the data specification description file, and storing the name of each random variable, the number of random variables, the state name of each random variable, the names of states that each random variable can take, and the number of states in a storage unit; ,
(C) initializing a graph to be output as a completely undirected graph, initializing a separation set as an empty set, setting the order of a conditional independent test to 0, and storing the result in a storage unit;
(D) For all variable pairs X, Y having edges on the graph,
(D1) A step of performing a conditional independent test Test (X; Y | S) for the variable set S in which | S | = n, S⊆Z, where Z is a latent parent variable set of X and Y ,
(D2) When the result of the conditional independent test is Ind (X; Y | S), the separated set is updated by Sep = Sep∪ {<{X, Y}, S>} and stored in the storage unit And (D3) deleting the edge between X and Y from the graph and storing the graph in the storage unit; and
(E) determining a direction of an edge according to the estimated v-structure if the direction is not determined for the estimated v-structure edge present in the graph;
(F) If the side is oriented in the opposite direction to the estimated v-structure, determine that the side is a collision side, select the direction of the collision side according to a predetermined probability, Storing an edge parent set having an edge and a set of parent variables of the edge as elements in a storage unit;
(G) determining an edge direction according to an orientation rule consistent with DAG constraints;
(H) incrementing n;
(I) If the size of the latent parent variable set is greater than or equal to n for all the variable pairs X and Y of the remaining edges on the graph, the process returns to step (D) and the size of the latent parent variable set is less than n. If there is a step of storing the obtained graph in the storage unit;
(J) repeating steps (C) to (I) a predetermined number of times;
And (K) outputting the union of the obtained predetermined number of graphs as a superstructure.