JP2021111141A

JP2021111141A - Feature network extraction device, computer program, feature network extraction method, and bayesian network analysis method

Info

Publication number: JP2021111141A
Application number: JP2020002923A
Authority: JP
Inventors: 恭史奥野; Yasushi Okuno; 嘉紀玉田; Yoshinori Tamada
Original assignee: Kyoto University NUC
Current assignee: Kyoto University NUC
Priority date: 2020-01-10
Filing date: 2020-01-10
Publication date: 2021-08-02
Anticipated expiration: 2040-01-10
Also published as: JP7478408B2

Abstract

To provide a feature network extraction device, a computer program, a feature network extraction method, and a Bayesian network analysis method capable of evaluating the relationship between samples or groups of samples in an estimated Bayesian network.SOLUTION: In the feature network extraction device, a processor assigns data to required nodes of a Bayesian network representing dependency relationships among a plurality of nodes to which each random variable is assigned using an acyclic directed graph. When calculating posterior probability of the nodes based on the assigned data, a calculation unit calculates a feature value of a link from a parent node to a child node based on a predetermined model that constitutes conditional probability when the probability variable of the parent node is given, so that a feature network is extracted from the Bayesian network based on the calculated feature values.SELECTED DRAWING: Figure 34

Description

本発明は、特徴ネットワーク抽出装置、コンピュータプログラム、特徴ネットワーク抽出方法及びベイジアンネットワーク分析方法に関する。 The present invention relates to a feature network extraction device, a computer program, a feature network extraction method, and a Bayesian network analysis method.

ベイジアンネットワークは、グラフィカルモデル（グラフ表現を用いた統計モデル）の一つであり、多変量の因果関係をネットワーク（非巡回有向グラフ）で表現したものである。大量のデータからベイジアンネットワークの構造学習をすることにより、ベイジアンネットワークが推定され、多変量間の因果関係を推定することができる。 The Bayesian network is one of the graphical models (statistical models using graph representation), and the causal relationship of multivariates is represented by the network (non-circular directed graph). By learning the structure of a Bayesian network from a large amount of data, the Bayesian network can be estimated and the causal relationship between multivariates can be estimated.

特許文献１には、ユーザがノード名や定義域名の候補となる「表現」を名前とするラベルオブジェクトをＧＵＩ画面上に生成し、画面上に配置されたラベルオブジェクトに、ラベル間の関係（因果関係か命題の関係）をマウス操作で定義することにより、ベイジアンネットワークを容易に作成することができる装置が開示されている。 In Patent Document 1, a user creates a label object having an "expression" as a candidate for a node name or a definition area name on a GUI screen, and the label object arranged on the screen has a relationship (causality) between labels. A device that can easily create a Bayesian network by defining a relationship (relationship or propositional relationship) by operating a mouse is disclosed.

特開２００７−１０２７１７号公報JP-A-2007-102717

しかし、大量のデータを用いてベイジアンネットワークが推定されたとしても、推定されたベイジアンネットワークは、大量のデータのうち、データに潜む変数間の関係性のうち共通性のもの（例えば、データの塊り）について何らかの関係性が推定されるのみであり、例えば、個々のサンプル又はサンプル群の関係性を説明することができない。 However, even if a Bayesian network is estimated using a large amount of data, the estimated Bayesian network is a large amount of data that has commonality among variables hidden in the data (for example, a mass of data). Only some relationship is presumed, and for example, the relationship between individual samples or sample groups cannot be explained.

本発明は斯かる事情に鑑みてなされたものであり、推定されたベイジアンネットワークでのサンプル又はサンプル群の関係性を評価することができる特徴ネットワーク抽出装置、コンピュータプログラム、特徴ネットワーク抽出方法及びベイジアンネットワーク分析方法を提供することを目的とする。 The present invention has been made in view of such circumstances, and is capable of evaluating the relationship between samples or sample groups in an estimated Bayesian network. Feature network extraction device, computer program, feature network extraction method, and Bayesian network. The purpose is to provide an analytical method.

本願は上記課題を解決する手段を複数含んでいるが、その一例を挙げるならば、特徴ネットワーク抽出装置は、それぞれの確率変数が対応付けられた複数のノード間の依存関係を、非巡回有向グラフを用いて表したベイジアンネットワークの所要ノードにデータを付与するデータ付与部と、前記データ付与部が付与したデータに基づいてノードの事後確率を計算する際に、親ノードの確率変数を所与としたときの条件付き確率を構成する所定モデルに基づいて、前記親ノードから子ノードへのリンクの特徴量を算出する算出部と、前記算出部が算出した特徴量に基づいて前記ベイジアンネットワークから特徴ネットワークを抽出する抽出部とを備える。 The present application includes a plurality of means for solving the above problems. For example, the feature network extraction device displays a dependency relationship between a plurality of nodes to which each random variable is associated, and a non-circulating directed graph. The random variable of the parent node was given when calculating the posterior probability of the node based on the data given part that gives data to the required node of the Bayesian network represented by the above and the data given by the data given part. A calculation unit that calculates the feature amount of the link from the parent node to the child node based on a predetermined model that constitutes the conditional probability of the time, and a feature network from the Bayesian network based on the feature amount calculated by the calculation unit. It is provided with an extraction unit for extracting data.

本発明によれば、推定されたベイジアンネットワークでのサンプル又はサンプル群の関係性を特徴付ける特徴ネットワークを抽出することができ、推定されたベイジアンネットワークでのサンプル又はサンプル群を評価することができる。 According to the present invention, a feature network that characterizes the relationship between a sample or a sample group in an estimated Bayesian network can be extracted, and a sample or a sample group in the estimated Bayesian network can be evaluated.

本実施の形態の特徴ネットワーク抽出装置の構成の一例を示すブロック図である。It is a block diagram which shows an example of the structure of the feature network extraction apparatus of this embodiment. ベイジアンネットワークの一例を示す模式図である。It is a schematic diagram which shows an example of a Bayesian network. Ｂ−スプラインを用いたノンパラメトリック回帰モデルの一例を示す模式図である。It is a schematic diagram which shows an example of the nonparametric regression model using a B-spline. ノンパラメトリックベイジアンネットワークの一例を示す模式図である。It is a schematic diagram which shows an example of a nonparametric Bayesian network. 枝の特徴量の第１例を示す模式図である。It is a schematic diagram which shows the 1st example of the feature amount of a branch. 枝の特徴量の第２例を示す模式図である。It is a schematic diagram which shows the 2nd example of the feature amount of a branch. 変数Ｘの親ノードから変数Ｙの子ノードへの枝に対するΔＥＣｖの概念を示す模式図である。It is a schematic diagram which shows the concept of ΔECv for the branch from the parent node of the variable X to the child node of the variable Y. 枝の特徴量の第３例を示す模式図である。It is a schematic diagram which shows the 3rd example of the feature amount of a branch. 枝の特徴量の第４例を示す模式図である。It is a schematic diagram which shows the 4th example of the feature amount of a branch. 特徴ネットワークの抽出方法の第１例を示す模式図である。It is a schematic diagram which shows the 1st example of the extraction method of a feature network. 特徴ネットワークの抽出方法の第２例を示す模式図である。It is a schematic diagram which shows the 2nd example of the extraction method of a feature network. 特徴ネットワークの抽出方法の第３例を示す模式図である。It is a schematic diagram which shows the 3rd example of the extraction method of a feature network. 抽出された特徴ネットワークの第１例を示す模式図である。It is a schematic diagram which shows the 1st example of the extracted feature network. 特徴ネットワークによる個人の特徴付けの第１例を示す模式図である。It is a schematic diagram which shows the 1st example of the characterization of an individual by a feature network. ＥＣｖ行列の他の構成を示す模式図である。It is a schematic diagram which shows the other structure of the ECv matrix. 特徴ネットワークによる個人の特徴付けの第２例を示す模式図である。It is a schematic diagram which shows the 2nd example of the characterization of an individual by a feature network. 抽出された特徴ネットワークの第２例を示す模式図である。It is a schematic diagram which shows the 2nd example of the extracted feature network. 抽出された特徴ネットワークの第３例を示す模式図である。It is a schematic diagram which shows the 3rd example of the extracted feature network. 抽出された特徴ネットワークにより免疫系の遺伝子を捉えることができるメカニズムを示す模式図である。It is a schematic diagram which shows the mechanism which can capture the gene of the immune system by the extracted feature network. 特徴ネットワークによる個人の特徴付けの第３例を示す模式図である。It is a schematic diagram which shows the 3rd example of the characterization of an individual by a feature network. 特徴ネットワークによる個人の特徴付けの第４例を示す模式図である。It is a schematic diagram which shows the 4th example of the characterization of an individual by a feature network. 特徴ネットワークによる個人の特徴付けの第５例を示す模式図である。It is a schematic diagram which shows the 5th example of the characterization of an individual by a feature network. 抽出された特徴ネットワークを全体のネットワークにマッピングした模式図である。It is a schematic diagram which mapped the extracted feature network to the whole network. 抽出された特徴ネットワークとＤＥＧ遺伝子との関連の第１例を示す模式図である。It is a schematic diagram which shows the 1st example of the relation between the extracted feature network and a DEG gene. 抽出された特徴ネットワークとＤＥＧ遺伝子との関連の第２例を示す模式図である。It is a schematic diagram which shows the 2nd example of the relation between the extracted feature network and a DEG gene. 抽出された特徴ネットワークの第４例を示す模式図である。It is a schematic diagram which shows the 4th example of the extracted feature network. 慢性腎臓病（ＣＫＤ）発症関連パスを抜き出した例を示す模式図である。It is a schematic diagram which shows the example which extracted the path related to the onset of chronic kidney disease (CKD). 高血圧発症関連パスを抜き出した例を示す模式図である。It is a schematic diagram which shows the example which extracted the path related to the onset of hypertension. ＳＮＰありの場合のＣＫＤ及び高血圧の２疾患関連ネットワークの例を示す模式図である。It is a schematic diagram which shows the example of the two-disease-related network of CKD and hypertension in the case of having SNP. ＳＮＰなしの場合のＣＫＤ及び高血圧の２疾患関連ネットワークの例を示す模式図である。It is a schematic diagram which shows the example of the two-disease-related network of CKD and hypertension in the case of no SNP. 慢性腎臓病（ＣＫＤ）発症の個人ネットワークの第１例を示す模式図である。It is a schematic diagram which shows the 1st example of the personal network of the onset of chronic kidney disease (CKD). 慢性腎臓病（ＣＫＤ）発症の個人ネットワークの第２例を示す模式図である。It is a schematic diagram which shows the 2nd example of the personal network of the onset of chronic kidney disease (CKD). 慢性腎臓病（ＣＫＤ）発症の個人ネットワークの第３例を示す模式図である。It is a schematic diagram which shows the 3rd example of the personal network of the onset of chronic kidney disease (CKD). 特徴ネットワーク抽出装置の処理手順の一例を示すフローチャートである。Features It is a flowchart which shows an example of the processing procedure of a network extraction apparatus. 特徴ネットワーク抽出処理の一例を示すフローチャートである。It is a flowchart which shows an example of a feature network extraction process.

以下、本発明をその実施の形態を示す図面に基づいて説明する。図１は本実施の形態の特徴ネットワーク抽出装置５０の構成の一例を示すブロック図である。特徴ネットワーク抽出装置５０は、プロセッサ５１、操作部５２、インタフェース部５３、表示パネル５４、記録媒体読取部５５、ＲＯＭ５６、メモリ５７（例えば、ＲＡＭ）及び記憶部５８を備える。記憶部５８には、予め推定されたベイジアンネットワークモデル５８１、サンプルデータ５８２を記憶することができる。なお、特徴ネットワーク抽出装置５０は、１台の装置で構成してもよく、あるいは複数台の装置で構成してもよい。 Hereinafter, the present invention will be described with reference to the drawings showing the embodiments thereof. FIG. 1 is a block diagram showing an example of the configuration of the feature network extraction device 50 of the present embodiment. Features The network extraction device 50 includes a processor 51, an operation unit 52, an interface unit 53, a display panel 54, a recording medium reading unit 55, a ROM 56, a memory 57 (for example, RAM), and a storage unit 58. The storage unit 58 can store a pre-estimated Bayesian network model 581 and sample data 582. The feature network extraction device 50 may be composed of one device or a plurality of devices.

プロセッサ５１は、例えば、ＣＰＵ（例えば、複数のプロセッサコアを実装したマルチ・プロセッサなど）、ＧＰＵ（Graphics Processing Units）、ＤＳＰ（Digital Signal Processors）、ＦＰＧＡ（Field-Programmable Gate Arrays）などのハードウェアを組み合わせることによって構成することができる。 The processor 51 includes hardware such as a CPU (for example, a multi-processor in which a plurality of processor cores are mounted), a GPU (Graphics Processing Units), a DSP (Digital Signal Processors), and an FPGA (Field-Programmable Gate Arrays). It can be configured by combining.

表示パネル５４は、液晶パネル又は有機ＥＬ（Electro Luminescence）ディスプレイ等で構成することができる。 The display panel 54 can be composed of a liquid crystal panel, an organic EL (Electro Luminescence) display, or the like.

操作部５２は、例えば、ハードウェアキーボード、マウスなどで構成され、表示パネル５４に表示されたアイコンなどの操作、文字等の入力などを行うことができる。なお、操作部５２は、タッチパネルで構成してもよい。 The operation unit 52 is composed of, for example, a hardware keyboard, a mouse, or the like, and can operate an icon or the like displayed on the display panel 54 or input characters or the like. The operation unit 52 may be composed of a touch panel.

インタフェース部５３は、サンプルデータ、推定されたベイジアンネットワークモデルなどを外部の装置等から取得することができる。インタフェース部５３は、有線通信機能及び無線通信機能を有する。インタフェース部５３を経由して取得したサンプルデータやベイジアンネットワークモデルは、記憶部５８に記憶することができる。 The interface unit 53 can acquire sample data, an estimated Bayesian network model, and the like from an external device or the like. The interface unit 53 has a wired communication function and a wireless communication function. The sample data and the Bayesian network model acquired via the interface unit 53 can be stored in the storage unit 58.

記録媒体読取部５５は、例えば、特徴ネットワークの抽出処理の手順が定められたコンピュータプログラムを記録した記録媒体Ｍを読み取り、読み取ったコンピュータプログラムを記憶部５８に記憶することができる。なお、特徴ネットワークの抽出処理の手順が定められたコンピュータプログラムは、インタフェース部５３を経由して、外部の装置等から取得してもよい。 The recording medium reading unit 55 can, for example, read the recording medium M on which the computer program for which the procedure for extracting the feature network is defined is recorded, and store the read computer program in the storage unit 58. The computer program in which the procedure for extracting the feature network is defined may be acquired from an external device or the like via the interface unit 53.

記憶部５８は、ハードディスク又はフラッシュメモリなどで構成することができる。記憶部５８に記憶されたコンピュータプログラムをメモリ５７に読み込んでプロセッサ５１によって処理することにより、特徴ネットワークの抽出を行うことができる。 The storage unit 58 may be composed of a hard disk, a flash memory, or the like. By reading the computer program stored in the storage unit 58 into the memory 57 and processing it by the processor 51, the feature network can be extracted.

プロセッサ５１は、データ付与部、算出部、抽出部及び設定部としての機能を実行することができる。 The processor 51 can execute functions as a data addition unit, a calculation unit, an extraction unit, and a setting unit.

特徴ネットワーク抽出装置５０による特徴ネットワークの抽出方法の説明に入る前に、まず、その前提としてベイジアンネットワークの概要について説明する。 Before going into the description of the feature network extraction method by the feature network extraction device 50, first, the outline of the Bayesian network will be described as a premise.

図２はベイジアンネットワークの一例を示す模式図である。ベイジアンネットワークは、グラフィカルモデル（グラフ表現を用いた統計モデル）の一つであり、多変量の因果関係をネットワーク（非巡回有向グラフ）で表現したものである。図において、○印は、確率変数（単に「変数」ともいう）が対応付けられたノードであり、矢印は枝（リンク又はエッジ）である。枝には矢印で示したような方向性があり、矢印の上流側のノードを親ノードと称し、矢印の下流側のノードを子ノードと称する。図２の例では、変数Ｘ₁、Ｘ₂、Ｘ₃、Ｘ₄、Ｘ₅、Ｘ₆に対応して６個のノードが図示されている。 FIG. 2 is a schematic diagram showing an example of a Bayesian network. The Bayesian network is one of the graphical models (statistical models using graph representation), and the causal relationship of multivariates is represented by the network (non-circular directed graph). In the figure, a circle indicates a node to which a random variable (also simply referred to as a “variable”) is associated, and an arrow indicates a branch (link or edge). The branches have a direction as shown by the arrow, and the node on the upstream side of the arrow is called the parent node, and the node on the downstream side of the arrow is called the child node. In the example of FIG. 2, six nodes are illustrated corresponding to _{the variables X 1} , X ₂ , X ₃ , X ₄ , X ₅ , and X _6.

Ｐｒ（Ｘ_{1 ,}Ｘ_{2 ,}Ｘ_{3 ,}Ｘ_{4 ,}Ｘ_{5 ,}Ｘ₆）は、変数Ｘ₁、Ｘ₂、Ｘ₃、Ｘ₄、Ｘ₅、Ｘ₆についての同時確率（分布）を表す。この同時確率がどのように分解できるか、すなわち、条件付き独立性を探索することにより、Ｐｒ（Ｘ_{1 ,}Ｘ_{2 ,}Ｘ_{3 ,}Ｘ_{4 ,}Ｘ_{5 ,}Ｘ₆）は、Ｐｒ（Ｘ_j｜Ｐａ（Ｘ_j））という条件付き確率の積で表すことができる。ここで、ｊは変数のインデックスであり、図２の例では、ｐ＝６である。Ｐａ（Ｘ_j）は、変数Ｘ_jのネットワークにおける親ノードに対応する変数の集合である。Ｐｒ（Ｘ₄｜Ｘ_{1 ,}Ｘ₂）は、変数Ｘ₁、Ｘ₂の値が与えられたときの変数Ｘ₄の条件付き確率を表す。 Pr (X _1, X _2, X _3, X _4, X _5, X ₆ ) represents the joint probability (distribution) for the variables X ₁ , X ₂ , X ₃ , X ₄ , X ₅ , X _6. By exploring how this simultaneous probability can be decomposed, that is, conditional independence, Pr (X _1, X _2, X _3, X _4, X _5, X ₆ ) is Pr (X _j | It can be expressed as the product of conditional probabilities Pa (X _j)). Here, j is the index of the variable, and in the example of FIG. 2, p = 6. Pa (X _j ) is a set of variables corresponding to the parent node in the network of the _{variable X j.} Pr (X ₄ | X _1, X ₂ ) represents the conditional probability of the variable X ₄ given the values of the variables X ₁ and X _2.

図２の例では、変数Ｘ₃、Ｘ₄が変数Ｘ₁のもとで独立である（すなわち、変数Ｘ₁の値がわかっているという条件付きで独立である）ことを示す。変数Ｘ₅、Ｘ₆も変数Ｘ₃のもとで独立である。条件付き独立である変数は、その条件になっている変数が特定の値に固定された時に相関を示さなくなることを意味しており、これは因果関係とみなすことができる。ベイジアンネットワークは、大量のデータを用いて推定されるので、共通性のあるデータの集合間の因果関係を推定することができる。 In the example of FIG. 2 shows that the variable X _3, X ₄ are independently under variable X ₁ (i.e., independently is with the proviso that the value of variable X ₁ is known). The variables X ₅ and X ₆ are also independent under the _{variable X 3.} A variable that is conditional independence means that the variable that is the condition does not show a correlation when it is fixed to a specific value, which can be regarded as a causal relationship. Since Bayesian networks are estimated using a large amount of data, it is possible to estimate causal relationships between sets of common data.

図３はＢ−スプラインを用いたノンパラメトリック回帰モデルの一例を示す模式図である。変数間の関係が非線形であるとき、どのようなモデルを用いるかが重要である。ノンパラメトリック回帰は、変数間の関係が一次式や多項式など特定の関数形に従わず、未知である場合、特定の関数形を仮定することなく回帰を行う手法である。変数Ｘ₁、Ｘ₂…、Ｘ_pについての同時確率の分解は、確率密度関数ｆ（Ｘ_j｜Ｐａ（Ｘ_j））の分解として表される。確率密度関数ｆ（Ｘ_j｜Ｐａ（Ｘ_j））は、Ｂ−スプラインを用いたノンパラメトリック回帰モデルにより構築できる。図３に示すように、変数Ｘ₄のノードの親ノードの変数をＸ₁、Ｘ₂とすると、変数Ｘ₄のデータｘ₄と、変数Ｘ₁、Ｘ₂のデータｘ₁、ｘ₂との間には、ｘ₄＝ｍ₁（ｘ₁）＋ｍ₂（ｘ₂）＋ε、という関係が成り立つ。ｍ₁、ｍ₂は、滑らかな関数（非線形関数）であり、εはモデルで表現することができない数値であり、ノイズ項とも称する。Ｎ（０、σ²）は、平均が０、分散がσ²の正規分布である。 FIG. 3 is a schematic diagram showing an example of a nonparametric regression model using a B-spline. When the relationships between variables are non-linear, what model to use is important. Nonparametric regression is a method of performing regression without assuming a specific functional form when the relationship between variables does not follow a specific functional form such as a linear expression or a polynomial and is unknown. The decomposition of the simultaneous probabilities for the variables X ₁ , X ₂ ..., X _{p is} expressed as the decomposition of the probability density function f (X _j | Pa (X _j )). The probability density function f (X _j | Pa (X _j )) can be constructed by a nonparametric regression model using a B-spline. As shown in FIG. 3, nodes of the variable X ₄ when the variable of the parent node and X _1, X _2, and data x ₄ variables X _4, variable X _1, the data x ₁ of X _2, and x ₂ In the _{meantime, the relationship x 4} = m ₁ (x ₁ ) + m ₂ (x ₂ ) + ε is established. m ₁ and m ₂ are smooth functions (non-linear functions), and ε is a numerical value that cannot be expressed by a model, and is also called a noise term. N (0, σ ² ) is a normal distribution with a mean of 0 and a variance of σ ^2.

図４はノンパラメトリックベイジアンネットワークの一例を示す模式図である。ノンパラメトリックベイジアンネットワークは、ベイジアンネットワークの局所確率分布に、図３で例示したような、Ｂ−スプラインノンパラメトリック回帰モデルを用いたものである。図２に例示したような一般的なベイジアンネットワークと異なり、ノンパラメトリックベイジアンネットワークでは、非線形連続値を扱うことができる。 FIG. 4 is a schematic diagram showing an example of a nonparametric Bayesian network. The nonparametric Bayesian network uses a B-spline nonparametric regression model as illustrated in FIG. 3 for the local probability distribution of the Bayesian network. Unlike a general Bayesian network as illustrated in FIG. 2, a nonparametric Bayesian network can handle non-linear continuous values.

図４の例では、図２の例と同様に、変数Ｘ₁、Ｘ₂、Ｘ₃、Ｘ₄、Ｘ₅、Ｘ₆に対応して６個のノードが図示されている。図４に示す式において、ｉはサンプルのインデックスを示し、ｊは変数のインデックスを示す。図４の例では、ｊ＝１、２、…、６である。ｋは親ノードのインデックスを示す。関数ｍ_jkは、親ノードｋから子ノードであるノードｊへの関数である。関数ｍ_jkを表す式において、ｂ_lkは予め与えられたＭ_jk個のＢ−スプライン基底関数であり、γ_lkは、Ｂ−スプライン基底関数に対する係数パラメータであり、ノンパラメトリックベイジアンネットワークが推定されると固定される。なお、基底関数は、Ｂ−スプライン基底関数に限定されるものではなく、フーリエ級数、多項式基底、回帰スプライン基底、ウェーブレット基底などの他の基底関数を用いてもよい。 In the example of FIG. 4, six nodes are illustrated corresponding to _{the variables X 1} , X ₂ , X ₃ , X ₄ , X ₅ , and X _{6, similar to the example of FIG.} In the equation shown in FIG. 4, i indicates the index of the sample and j indicates the index of the variable. In the example of FIG. 4, j = 1, 2, ..., 6. k indicates the index of the parent node. The function m _jk is a function from the parent node k to the child node node j. In the equation representing the _{function m jk} _{, b lk} is a given M _jk B-spline basis functions, γ _lk is a coefficient parameter for the B-spline basis functions, and a nonparametric Bayesian network is estimated. Is fixed. The basis functions are not limited to B-spline basis functions, and other basis functions such as Fourier series, polynomial basis, regression spline basis, and wavelet basis may be used.

次に、特徴ネットワーク抽出装置５０の詳細について説明する。本実施の形態では、親ノードの確率変数を所与としたときの条件付き確率を構成する所定モデルとして、ノンパラメトリック回帰モデルについて説明するが、所定モデルは、ノンパラメトリック回帰モデルには限定されない。また、子ノードの確率変数に対する親ノードの確率変数の所定モデルを表す所定関数として、非線形関数について説明するが、所定関数は非線形関数に限定されない。本実施の形態では、ベイジアンネットワークはノンパラメトリックベイジアンネットワークであるとする。また、以下では、ノンパラメトリックベイジアンネットワークをベイジアンネットワークとも称する。 Next, the details of the feature network extraction device 50 will be described. In the present embodiment, the nonparametric regression model will be described as a predetermined model that constitutes the conditional probability when the random variable of the parent node is given, but the predetermined model is not limited to the nonparametric regression model. Further, a nonlinear function will be described as a predetermined function representing a predetermined model of the random variable of the parent node with respect to the random variable of the child node, but the predetermined function is not limited to the nonlinear function. In this embodiment, the Bayesian network is a nonparametric Bayesian network. In the following, the nonparametric Bayesian network will also be referred to as a Bayesian network.

特徴ネットワーク抽出装置５０（プロセッサ５１）は、それぞれの確率変数が対応付けられた複数のノード間の依存関係を、非巡回有向グラフを用いて表したベイジアンネットワークの所要ノードにデータを付与する処理、付与したデータに基づいてノードの事後確率を計算する際に、親ノードの確率変数を所与としたときの条件付き確率を構成するノンパラメトリック回帰モデルに基づいて、親ノードから子ノードへのリンクの特徴量を算出する処理、算出した特徴量に基づいてベイジアンネットワークから特徴ネットワークを抽出する処理を行うことができる。本実施の形態の特徴ネットワーク抽出装置５０は、特徴量を用いて、予め推定されたベイジアンネットワークの部分ネットワークを特徴ネットワークとして抽出することができる。以下、各処理について説明する。 Features The network extraction device 50 (processor 51) assigns data to the required nodes of a Bayesian network in which the dependency relationships between a plurality of nodes to which each random variable is associated are represented by using a non-circulating directed graph. When calculating the posterior probabilities of a node based on the data, the link from the parent node to the child node is based on a non-parametric regression model that constitutes the conditional probability given the random variables of the parent node. It is possible to perform a process of calculating a feature amount and a process of extracting a feature network from a Bayesian network based on the calculated feature amount. The feature network extraction device 50 of the present embodiment can extract a partial network of a Bayesian network estimated in advance as a feature network by using the feature amount. Hereinafter, each process will be described.

プロセッサ５１は、ノンパラメトリックベイジアンネットワークの所要ノードに各ノードの変数のデータを付与する。所要ノードは、どのようなデータを用いて、どのような変数間の因果関係を求めるかに応じて適宜決定することができる。変数のデータとしては、例えば、電子カルテデータや健康診断データの各種計測値（診療行為に関するデータ、検査データ、医薬品に関するデータなどを含む）、遺伝子に関するデータ（遺伝子発現データ、エピゲノムデータ、プロテオームデータ、ＳＮＰ（Single Nucleotide Polymorphism）やＣＮＶ（Copy Number Variations）などのゲノム変異データ）などが含まれるが、これらに限定されない。また、データは、個人サンプルのように、各サンプルが独立であるような静的なデータでもよく、定期的に検査が行われ記録される電子カルテ・健康診断データや薬剤投与の時系列発現データのように動的・時系列データでもよい。 The processor 51 assigns the variable data of each node to the required nodes of the nonparametric Bayesian network. The required node can be appropriately determined depending on what kind of data is used and what kind of causal relationship between variables is to be obtained. The variable data includes, for example, various measured values of electronic chart data and medical examination data (including data related to medical practice, test data, data related to pharmaceuticals, etc.), data related to genes (gene expression data, epigenome data, proteome data, etc.). Genomic mutation data such as SNP (Single Nucleotide Polymorphism) and CNV (Copy Number Variations)) are included, but are not limited thereto. In addition, the data may be static data such that each sample is independent, such as an individual sample, and electronic medical records / health examination data and time-series expression data of drug administration that are regularly inspected and recorded. It may be dynamic / time series data such as.

プロセッサ５１は、付与したデータに基づいて、他のノードの事後確率（同時確率）を計算する際に、親ノードの確率変数を所与としたときの条件付き確率に基づいて、親ノードから子ノードへの枝の特徴量を算出する。より具体的には、プロセッサ５１は、子ノードの確率変数に対する親ノードの確率変数の回帰モデルを表す非線形関数の関数値に基づいて、親ノードから子ノードへの枝の特徴量を算出する。 When the processor 51 calculates the posterior probability (simultaneous probability) of another node based on the given data, the child from the parent node is based on the conditional probability when the random variable of the parent node is given. Calculate the feature quantity of the branch to the node. More specifically, the processor 51 calculates the feature amount of the branch from the parent node to the child node based on the function value of the nonlinear function representing the regression model of the random variable of the parent node with respect to the random variable of the child node.

次に、枝の特徴量（枝評価手法）について説明する。特徴量は図５から図９に示すように所要の式に基づいて定義することができ、特徴ネットワークを抽出する際には、定義された特徴量のうち、好適のものを用いることができる。 Next, the feature amount of the branch (branch evaluation method) will be described. The feature amount can be defined based on the required formula as shown in FIGS. 5 to 9, and when extracting the feature network, a suitable one of the defined feature amounts can be used.

図５は枝の特徴量の第１例を示す模式図である。図５に示すように、変数ｙが対応付けられた子ノードに対して、ｑ個の親ノードが存在し、各親ノードの変数をｘ₁、ｘ₂、…、ｘ_qとする。この場合、子ノードと対応する各親ノードとの間には、ｑ個の枝（リンク）が存在する。ノンパラメトリック回帰モデルに基づき、変数ｙと、変数ｘ₁、ｘ₂、…、ｘ_qとの間には、ｙ＝ｍ₁（ｘ₁）＋ｍ₂（ｘ₂）＋…＋ｍ_q（ｘ_q）＋ε、という関係が成り立つ。ｘ_j（ｊ＝１〜ｑ）のｙへの特徴量を枝貢献量ＥＣｖ（Edge Contribution value）とする。枝貢献量ＥＣｖは、ＥＣｖ（ｘ_j→ｙ）＝ｍ_j（ｘ_j）と定義する。枝貢献量ＥＣｖは、関数ｍ_jの関数値である。すなわち、プロセッサ５１は、非線形関数の関数値を枝の特徴量として算出することができる。なお、複数のサンプルで構成されるサンプル群の枝貢献量ＥＣｖは、個々のサンプルの枝貢献量ＥＣｖの統計値（例えば、平均値、中央値など）とすることができる。 FIG. 5 is a schematic diagram showing a first example of the feature amount of a branch. As shown in FIG. 5, q parent nodes exist for the child node to which the variable y is associated, and the variables of each parent node are x ₁ , x ₂ , ..., X _q . In this case, there are q branches (links) between the child node and each corresponding parent node. Based on the nonparametric regression model, between the variables y and the variables x ₁ , x ₂ , ..., X _q , y = m ₁ (x ₁ ) + m ₂ (x ₂ ) + ... + m _q (x _q ) The relationship of + ε holds. Let _{the feature amount of x j} (j = 1 to q) to y be the branch contribution amount ECv (Edge Contribution value). The branch contribution amount ECv is defined as ECv (x _j → y) = m _j (x _j ). The branch contribution amount ECv is a function value of the _{function m j.} That is, the processor 51 can calculate the function value of the non-linear function as the feature amount of the branch. The branch contribution ECv of the sample group composed of a plurality of samples can be a statistical value (for example, average value, median value, etc.) of the branch contribution ECv of each sample.

図６は枝の特徴量の第２例を示す模式図である。図６に示すように、変数ｙが対応付けられた子ノードに対して、ｑ個の親ノードが存在し、各親ノードの変数をｘ₁、ｘ₂、…、ｘ_qとする。図５の場合と同様に、変数ｙと、変数ｘ₁、ｘ₂、…、ｘ_qとの間には、ｙ＝ｍ₁（ｘ₁）＋ｍ₂（ｘ₂）＋…＋ｍ_q（ｘ_q）＋ε、という関係が成り立つ。ｘ_j（ｊ＝１〜ｑ）のｙへの特徴量をΔＥＣｖとする。２つのサンプルＡ、Ｂのデータに対するＥＣｖを、それぞれＥＣｖ（ｘ_j ^A→ｙ^A）、ＥＣｖ（ｘ_j ^B→ｙ^B）とすると、ΔＥＣｖは、ΔＥＣｖ（ｘ_j→ｙ、Ａ、Ｂ）＝｜ＥＣｖ（ｘ_j ^A→ｙ^A）−ＥＣｖ（ｘ_j ^B→ｙ^B）｜と定義する。すなわち、プロセッサ５１は、異なるサンプルのデータを付与した場合に、第１サンプルのデータに基づく非線形関数の第１関数値と第２サンプルのデータに基づく非線形関数の第２関数値との比較値を枝の特徴量として算出することができる。 FIG. 6 is a schematic view showing a second example of the feature amount of the branch. As shown in FIG. 6, q parent nodes exist for the child node to which the variable y is associated, and the variables of each parent node are x ₁ , x ₂ , ..., X _q . As in the case of FIG. 5, between the variables y and the variables x ₁ , x ₂ , ..., X _q , y = m ₁ (x ₁ ) + m ₂ (x ₂ ) + ... + m _q (x _q). ) + Ε. Let _{ΔECv be the feature amount of x j} (j = 1 to q) to y. Assuming that the ECvs for the data of the two samples A and B are ECv (x _j ^A → y ^A ) and ECv (x _j ^B → y ^B ), respectively, ΔECv is ΔECv (x _j → y, A, B) = It is defined as | ECv (x _j ^A → y ^A ) -ECv (x _j ^B → y ^B ) |. That is, when different sample data are assigned, the processor 51 sets a comparison value between the first function value of the nonlinear function based on the data of the first sample and the second function value of the nonlinear function based on the data of the second sample. It can be calculated as the feature amount of the branch.

図７は変数Ｘの親ノードから変数Ｙの子ノードへの枝に対するΔＥＣｖの概念を示す模式図である。図中、横軸は変数Ｘの値を示し、縦軸は変数Ｙの値を示す。変数Ｘの値は連続値とすることができる。図中の曲線は、変数Ｘ、Ｙ間のノンパラメトリック回帰モデルを示し、Ｙ＝ｍ₁ ^(Y)（Ｘ）で表すことができる。図７では、コントロールサンプル群（例えば、特定の症状が現れていないサンプル群）と対象サンプル群（例えば、特定の症状が現れているサンプル群）の２つのサンプル集合間のΔＥＣｖを矢印の長さで表している。なお、図７の例では、２つのサンプル群間のΔＥＣｖを図示しているが、ΔＥＣｖは、２つのサンプル群間の比較に限定されるものではなく、個人（１つのサンプル）と他の個人との間のΔＥＣｖでもよく、個人と全サンプル平均との間のΔＥＣｖでもよい。 FIG. 7 is a schematic diagram showing the concept of ΔECv for a branch from the parent node of the variable X to the child node of the variable Y. In the figure, the horizontal axis represents the value of the variable X, and the vertical axis represents the value of the variable Y. The value of the variable X can be a continuous value. The curve in the figure shows a nonparametric regression model between variables X and Y, and _{can be represented by Y = m 1} ^(Y) (X). In FIG. 7, the length of the arrow indicates ΔECv between two sample sets of a control sample group (for example, a sample group in which a specific symptom does not appear) and a target sample group (for example, a sample group in which a specific symptom appears). It is represented by. In the example of FIG. 7, ΔECv between two sample groups is shown, but ΔECv is not limited to comparison between two sample groups, and an individual (one sample) and another individual. It may be ΔECv between the individual and the average of all samples.

図８は枝の特徴量の第３例を示す模式図である。図８に示すように、変数ｙが対応付けられた子ノードに対して、ｑ個の親ノードが存在し、各親ノードの変数をｘ₁、ｘ₂、…、ｘ_qとする。図５の場合と同様に、変数ｙと、変数ｘ₁、ｘ₂、…、ｘ_qとの間には、ｙ＝ｍ₁（ｘ₁）＋ｍ₂（ｘ₂）＋…＋ｍ_q（ｘ_q）＋ε、という関係が成り立つ。ｘ_j（ｊ＝１〜ｑ）のｙへの特徴量を相対貢献度ＲＣとする。相対貢献度ＲＣは、ＲＣ（ｘ_j→ｙ）＝｜ｍ_j（ｘ_j）｜／ｍａｘ｜ｍ_k（ｘ_k）｜と定義する。相対貢献度ＲＣは０から１の値になる。ここで、ｋは、０＜ｋ≦ｑとする。すなわち、プロセッサ５１は、子ノードの確率変数に対する複数の親ノードそれぞれの確率変数の回帰モデルを表す各非線形関数の関数値のうちの最大値に対する、当該枝に対応する非線形関数の関数値の割合を当該枝の特徴量として算出することができる。なお、複数のサンプルで構成されるサンプル群の相対貢献度ＲＣは、個々のサンプルの相対貢献度ＲＣの統計値（例えば、平均値、中央値など）とすることができる。 FIG. 8 is a schematic view showing a third example of the feature amount of the branch. As shown in FIG. 8, q parent nodes exist for the child node to which the variable y is associated, and the variables of each parent node are x ₁ , x ₂ , ..., X _q . As in the case of FIG. 5, between the variables y and the variables x ₁ , x ₂ , ..., X _q , y = m ₁ (x ₁ ) + m ₂ (x ₂ ) + ... + m _q (x _q). ) + Ε. Let _{the feature amount of x j} (j = 1 to q) to y be the relative contribution RC. Relative contribution RC is defined as RC (x _j → y) = | m _j (x _j ) | / max | m _k (x _k ) |. The relative contribution RC is a value from 0 to 1. Here, k is 0 <k ≦ q. That is, the processor 51 is the ratio of the function value of the nonlinear function corresponding to the branch to the maximum value of the function values of each nonlinear function representing the regression model of the random variable of each of the plurality of parent nodes with respect to the random variable of the child node. Can be calculated as the feature amount of the branch. The relative contribution RC of the sample group composed of a plurality of samples can be a statistical value (for example, average value, median value, etc.) of the relative contribution RC of each sample.

図９は枝の特徴量の第４例を示す模式図である。図９に示すように、変数ｙが対応付けられた子ノードに対して、ｑ個の親ノードが存在し、各親ノードの変数をｘ₁、ｘ₂、…、ｘ_qとする。図５の場合と同様に、変数ｙと、変数ｘ₁、ｘ₂、…、ｘ_qとの間には、ｙ＝ｍ₁（ｘ₁）＋ｍ₂（ｘ₂）＋…＋ｍ_q（ｘ_q）＋ε、という関係が成り立つ。ｘ_j（ｊ＝１〜ｑ）のｙへの特徴量を相対貢献率ＲＣｒとする。相対貢献率ＲＣｒは、ＲＣｒ（ｘ_j→ｙ）＝｜ｍ_j（ｘ_j）｜／Σ｜ｍ_k（ｘ_k）｜と定義する。相対貢献率ＲＣｒは０から１の値になる。ここで、Σはｋ＝１からｑまでの和とする。すなわち、プロセッサ５１は、子ノードの確率変数に対する複数の親ノードそれぞれの確率変数の回帰モデルを表す各非線形関数の関数値の合計値に対する、当該枝に対応する非線形関数の関数値の比率を当該枝の特徴量として算出することができる。なお、複数のサンプルで構成されるサンプル群の相対貢献率ＲＣｒは、個々のサンプルの相対貢献率ＲＣｒの統計値（例えば、平均値、中央値など）とすることができる。 FIG. 9 is a schematic view showing a fourth example of the feature amount of the branch. As shown in FIG. 9, q parent nodes exist for the child node to which the variable y is associated, and the variables of each parent node are x ₁ , x ₂ , ..., X _q . As in the case of FIG. 5, between the variables y and the variables x ₁ , x ₂ , ..., X _q , y = m ₁ (x ₁ ) + m ₂ (x ₂ ) + ... + m _q (x _q). ) + Ε. Let _{the feature amount of x j} (j = 1 to q) to y be the relative contribution rate RCr. The relative contribution rate RCr is defined as RCr (x _j → y) = | m _j (x _j ) | / Σ | m _k (x _k ) |. The relative contribution rate RCr is a value from 0 to 1. Here, Σ is the sum of k = 1 to q. That is, the processor 51 measures the ratio of the function value of the nonlinear function corresponding to the branch to the total value of the function values of each nonlinear function representing the regression model of the random variable of each of the plurality of parent nodes with respect to the random variable of the child node. It can be calculated as a feature amount of a branch. The relative contribution rate RCr of the sample group composed of a plurality of samples can be a statistical value (for example, average value, median value, etc.) of the relative contribution rate RCr of each sample.

プロセッサ５１は、算出した特徴量に基づいてベイジアンネットワークから特徴ネットワークを抽出することができる。具体的には、プロセッサ５１は、枝の特徴量が所定の閾値以上である場合、当該枝を含む特徴ネットワークを抽出することができる。上述のように、特徴量としては、枝貢献量ＥＣｖ、ΔＥＣｖ、相対貢献度ＲＣ、相対貢献率ＲＣｒなどを用いることができる。また、閾値は、固定値である必要はなく、サンプルに応じて変更してもよく、データを付与する所要ノードを変更する際に変更してもよい。また、閾値は、上限値と下限値との組み合わせによって決定される所要範囲でもよい。特徴量によって、親ノード（例えば、変数ｘ₁、ｘ₂、…、ｘ_q）から子ノード（例えば、変数ｙ）への変数ｙを決めるモデル上の重要因子を定量化することができる。すなわち、特徴量を用いて特徴ネットワークを抽出することにより、予め推定されたベイジアンネットワーク（モデル）でのサンプル（例えば、個人や特定の疾患など）又はサンプル群についての関連性を示す複数の関連パスを抜き出すことができ、推定されたベイジアンネットワークでのサンプル又はサンプル群を評価することができる。 The processor 51 can extract the feature network from the Bayesian network based on the calculated feature amount. Specifically, the processor 51 can extract a feature network including the branch when the feature amount of the branch is equal to or more than a predetermined threshold value. As described above, as the feature amount, a branch contribution amount ECv, ΔECv, a relative contribution degree RC, a relative contribution rate RCr, or the like can be used. Further, the threshold value does not have to be a fixed value, and may be changed according to the sample, or may be changed when the required node to which the data is given is changed. Further, the threshold value may be a required range determined by a combination of an upper limit value and a lower limit value. The feature quantity can quantify the important factors in the model that determine the variable y from the parent node (for example, variables x ₁ , x ₂ , ..., X _{q) to the child node (for example, variable y).} That is, by extracting a feature network using a feature quantity, a plurality of related paths showing relevance to a sample (for example, an individual or a specific disease) or a sample group in a pre-estimated Bayesian network (model). Can be extracted and samples or groups of samples in the estimated Bayesian network can be evaluated.

次に、特徴ネットワークの抽出方法について説明する。図１０は特徴ネットワークの抽出方法の第１例を示す模式図である。左図のように、便宜上、推定されたベイジアンネットワークが、１５個のノードで構成されているとする。サンプルＡのデータを所要のノードの変数に付与して、ノードの変数の同時確率を算出する際に、枝の特徴量を算出する。図１０の例では、特徴量として枝貢献量ＥＣｖを用いたとする。各枝の枝貢献量ＥＣｖと閾値とを比較して、枝貢献量ＥＣｖが閾値以上である枝を太線で表す。この場合、インデックスが３、６、８、１１、１３の順で枝を特定することができ、特定した枝を繋ぐ特徴ネットワークを抽出することができる。変数のうち、インデックス８の変数が、注目したい因子の変数とすると、サンプルＡについて、注目したい因子との因果関係のある他の因子を特定することができる。なお、図１０の例では、特徴ネットワークが、１つのネットワークとして抽出されているが、独立の複数のネットワーク、すなわち、お互いに繋がりのない複数のネットワークとして抽出してもよい。 Next, a method of extracting the feature network will be described. FIG. 10 is a schematic diagram showing a first example of a feature network extraction method. As shown in the left figure, for convenience, it is assumed that the estimated Bayesian network is composed of 15 nodes. When the data of the sample A is given to the variable of the required node and the simultaneous probability of the variable of the node is calculated, the feature amount of the branch is calculated. In the example of FIG. 10, it is assumed that the branch contribution amount ECv is used as the feature amount. The branch contribution amount ECv of each branch is compared with the threshold value, and the branches whose branch contribution amount ECv is equal to or more than the threshold value are represented by thick lines. In this case, the branches can be specified in the order of indexes 3, 6, 8, 11, and 13, and the feature network connecting the specified branches can be extracted. Assuming that the variable of index 8 is the variable of the factor of interest among the variables, it is possible to identify another factor having a causal relationship with the factor of interest for sample A. In the example of FIG. 10, the feature network is extracted as one network, but it may be extracted as a plurality of independent networks, that is, a plurality of networks that are not connected to each other.

図１１は特徴ネットワークの抽出方法の第２例を示す模式図である。図１０と同様に、便宜上、推定されたベイジアンネットワークが、１５個のノードで構成されているとする。サンプルＢのデータを所要のノードの変数に付与して、他のノードの変数の同時確率を算出する際に、枝の特徴量を算出する。図１１の例では、特徴量として枝貢献量ＥＣｖを用いたとする。各枝の枝貢献量ＥＣｖと閾値とを比較して、枝貢献量ＥＣｖが閾値以上である枝を太線で表す。この場合、インデックスが２、５、８、１０、１２、１５の順で枝を特定することができ、特定した枝を繋ぐ特徴ネットワークを抽出することができる。変数のうち、インデックス８の変数が、注目したい因子の変数とすると、サンプルＢについて、注目したい因子との因果関係のある他の因子を特定することができる。なお、図１０及び図１１については、注目したい因子を１つ図示しているが、注目したい因子は複数であってもよい。 FIG. 11 is a schematic diagram showing a second example of the feature network extraction method. As in FIG. 10, for convenience, it is assumed that the estimated Bayesian network is composed of 15 nodes. When the data of the sample B is given to the variable of the required node and the simultaneous probability of the variable of the other node is calculated, the feature amount of the branch is calculated. In the example of FIG. 11, it is assumed that the branch contribution amount ECv is used as the feature amount. The branch contribution amount ECv of each branch is compared with the threshold value, and the branches whose branch contribution amount ECv is equal to or more than the threshold value are represented by thick lines. In this case, the branches can be specified in the order of indexes 2, 5, 8, 10, 12, and 15, and the feature network connecting the specified branches can be extracted. Assuming that the variable of index 8 is the variable of the factor of interest among the variables, other factors having a causal relationship with the factor of interest can be specified for sample B. Although one factor of interest is shown in FIGS. 10 and 11, there may be a plurality of factors of interest.

図１１を図１０の場合と対比すると、サンプルＡとＢとでは、抽出される特徴ネットワークに相違がある。このように、サンプル（個人）毎の重要なパスウェイ（枝の繋がり）を抽出することができ、推定されたベイジアンネットワークでの重み付け個人ネットワークを抽出することができる。すなわち、推定されたベイジアンネットワークでのサンプル又はサンプル群を特徴付ける特徴ネットワークを抽出することができ、推定されたベイジアンネットワークでのサンプル又はサンプル群を評価することができ、個人の計測データの特徴づけ（説明）が可能となる。 Comparing FIG. 11 with the case of FIG. 10, there is a difference in the extracted feature networks between the samples A and B. In this way, important pathways (branch connections) for each sample (individual) can be extracted, and weighted personal networks in the estimated Bayesian network can be extracted. That is, the feature network that characterizes the sample or sample group in the estimated Bayesian network can be extracted, the sample or sample group in the estimated Bayesian network can be evaluated, and the characterization of the individual measurement data ( Explanation) is possible.

図１２は特徴ネットワークの抽出方法の第３例を示す模式図である。プロセッサ５１は、ベイジアンネットワークの所要の複数のノードを設定する。所要のノードの設定は、ユーザの指定に基づいて行うことができる。図１２の例では、設定ノードとして上流側のノードＳ１、下流側のノードＳ２が設定されている。 FIG. 12 is a schematic diagram showing a third example of the feature network extraction method. The processor 51 sets a plurality of required nodes of the Bayesian network. The required node settings can be made based on the user's specifications. In the example of FIG. 12, the upstream node S1 and the downstream node S2 are set as the setting nodes.

プロセッサ５１は、設定した一のノード（上流側のノードＳ１）から他のノード（下流側のノードＳ２）へ至る複数のパス（パスウェイ）それぞれを構成する１又は複数の枝全体の特徴量を算出する。プロセッサ５１は、複数のパスのうち、パスを構成する枝全体の特徴量が所定の閾値以上であるパスを含む特徴ネットワークを抽出する。 The processor 51 calculates the feature amount of one or a plurality of branches constituting each of a plurality of paths (pathways) from one set node (upstream node S1) to another node (downstream node S2). do. The processor 51 extracts a feature network including a path in which the feature amount of all the branches constituting the path is equal to or more than a predetermined threshold value among the plurality of paths.

図１２の例では、設定ノードＳ１からノードＳ２までの６個の枝それぞれの相対貢献度ＲＣを、ＲＣ１、ＲＣ２、ＲＣ３、ＲＣ４、ＲＣ５、ＲＣ６とすると、６個の枝全体の特徴量Ｅは、（ＲＣ１・ＲＣ２・ＲＣ３・ＲＣ４・ＲＣ５・ＲＣ６）の６乗根で算出できる。他のパスについても同様に特徴量を算出することができる。仮に、（ＲＣ１・ＲＣ２・ＲＣ３・ＲＣ４・ＲＣ５・ＲＣ６）の６乗根が閾値以上であれば、設定ノードＳ１とノードＳ２とを繋ぐ特徴ネットワークとして、図１２の太線で示す枝群が抽出される。 In the example of FIG. 12, if the relative contribution RC of each of the six branches from the setting node S1 to the node S2 is RC1, RC2, RC3, RC4, RC5, RC6, the feature amount E of the entire six branches is , (RC1, RC2, RC3, RC4, RC5, RC6) can be calculated by the 6th root. Features can be calculated for other paths in the same way. If the 6th root of (RC1, RC2, RC3, RC4, RC5, RC6) is equal to or greater than the threshold value, the branch group shown by the thick line in FIG. 12 is extracted as the feature network connecting the setting node S1 and the node S2. NS.

次に、前述の抽出方法を用いることにより、抽出された特徴ネットワークの例について説明する。図１３は抽出された特徴ネットワークの第１例を示す模式図である。左側に示す、推定されたネットワークは、ＥＭＴ遺伝子ネットワークの例であり、例えば、ノード（変数）の数は約２万、枝数は約３０万程度である。ＥＭＴは上皮間葉転換（Epithelial to mesenchymal transition）であり、上皮細胞がＥＭＴ化すると、癌細胞から離れて移動能を持ち、血中に入って転移を起こす。ＥＭＴに関連するタンパク質は、癌のバイオマーカーとして注目されている。ＥＭＴ遺伝子ネットワークは、ＥＭＴ化した細胞とＥＭＴ化していない細胞を表すネットワークである。右側に示す、特徴ネットワークは、枝の特徴量としてΔＥＣｖを用いて、推定されたネットワークから抽出したものである。具体的には、ＥＭＴ化した細胞とＥＭＴ化していない細胞との間のΔＥＣｖを計算し、計算したΔＥＣｖが所定の閾値以上である枝を特定し、特定した枝で構成される特徴ネットワークを抽出している。特徴ネットワークのノード数は約１５０であり、枝数は約１２０である。 Next, an example of the feature network extracted by using the above-mentioned extraction method will be described. FIG. 13 is a schematic diagram showing a first example of the extracted feature network. The estimated network shown on the left is an example of an EMT gene network, for example, the number of nodes (variables) is about 20,000 and the number of branches is about 300,000. EMT is an epithelial to mesenchymal transition, and when epithelial cells become EMT, they have the ability to move away from cancer cells and enter the blood to cause metastasis. EMT-related proteins are attracting attention as biomarkers for cancer. The EMT gene network is a network representing cells that have become EMT and cells that have not become EMT. The feature network shown on the right side is extracted from the estimated network using ΔECv as the feature amount of the branch. Specifically, the ΔECv between the EMT-ized cells and the non-EMT-ized cells is calculated, the branches whose calculated ΔECv is equal to or higher than a predetermined threshold value are specified, and the feature network composed of the specified branches is extracted. is doing. Features The number of nodes in the network is about 150 and the number of branches is about 120.

図１４は特徴ネットワークによる個人の特徴付けの第１例を示す模式図である。図１４（Ａ）は、ＥＣｖ行列と称し、各行が特徴ネットワークの枝（枝のインデックス）を示し、各列がサンプル（個人）のＥＣｖを示す。行列の各要素が各サンプルの各枝でのＥＣｖを表す。ここでの各サンプルは、がん患者の公開データベースの肺がん患者サンプルデータを用いており、ベイジアンネットワークの推定および特徴ネットワークの抽出には用いていないものであっても良い。このＥＣｖ行列に対して値が近いサンプルを纏めていくクラスタリング手法によって、サンプル群をクラスタ（group1）、（group2）という２つのクラスタに分類することができる。 FIG. 14 is a schematic diagram showing a first example of characterization of an individual by a feature network. FIG. 14A is referred to as an ECv matrix, where each row shows a branch (branch index) of the feature network and each column shows a sample (individual) ECv. Each element of the matrix represents an ECv at each branch of each sample. Each sample here uses lung cancer patient sample data from a public database of cancer patients, and may not be used for estimating the Bayesian network and extracting the feature network. The sample group can be classified into two clusters, cluster (group1) and (group2), by a clustering method that collects samples whose values are close to the ECv matrix.

２つに分けられた各クラスタに対して、上記がん患者の公開データベースの肺がんデータに含まれる生存時間データを当てはめた生存時間曲線が図１４（Ｂ）である。図１４（Ｂ）が示すように、一方のクラスタに属する患者の生存時間は比較的長く、他方のクラスタに属する患者の生存時間は比較的短いという結果が得られた。すなわち、２つのクラスタで予後（生存時間）に大きな差が出ることが実証された。このように、特徴ネットワークにより、個人ごとのデータの特徴付け、分類が可能となる。 FIG. 14 (B) shows a survival time curve to which the survival time data included in the lung cancer data of the public database of cancer patients is applied to each of the two clusters. As shown in FIG. 14 (B), the results showed that the survival time of the patients belonging to one cluster was relatively long, and the survival time of the patients belonging to the other cluster was relatively short. That is, it was demonstrated that there is a large difference in prognosis (survival time) between the two clusters. In this way, the feature network enables the characterization and classification of individual data.

図１５はＥＣｖ行列の他の構成を示す模式図である。各行が特徴ネットワークの枝（枝のインデクス）を示し、各列がサブタイプ間毎のサンプル（個人）のＥＣｖを示す。サブタイプは、例えば、胃がんの分子サブタイプのような、ある特定の癌について、さらに細かく分類いたものとすることができる。図では、サブタイプＴ１、Ｔ２、Ｔ３のように図示しているが、例えば、ＣＩＮ（Chromosomal Instability）、ＭＳＩ（Microsatellite Instability）、ＥＢＶ（Epstein Barr Virus）、ＧＳ（Genomically Stable）などとすることができる。図中、模様を付した部分が２つのサブタイプの組み合わせでΔＥＣｖが閾値以上の枝を表す。 FIG. 15 is a schematic diagram showing another configuration of the ECv matrix. Each row shows the branch of the feature network (branch index), and each column shows the ECv of the sample (individual) for each subtype. Subtypes can be further subdivided for a particular cancer, such as the molecular subtype of gastric cancer. In the figure, subtypes T1, T2, and T3 are shown, but for example, CIN (Chromosomal Instability), MSI (Microsatellite Instability), EBV (Epstein Barr Virus), GS (Genomically Stable), etc. may be used. can. In the figure, the patterned portion represents a branch in which ΔECv is equal to or higher than the threshold value by combining two subtypes.

サブタイプ間ごとのΔＥＣｖは、例えば、以下のようにして求めることができる。すなわち、まず、公開されている胃がん患者の遺伝子発現データに基づいて遺伝子ネットワークを推定する。次に、サンプルごとに全ての枝のＥＣｖを算出する。そして文献により定義された胃がんの４つのサブタイプ（ＣＩＮ、ＭＳＩ、ＥＢＶ、ＧＳ）毎に、各サンプルのＥＣｖの平均値を算出し、そのサブタイプ毎の差を取ることにより、２つのサブタイプ間のΔＥＣｖを算出することができる。ここまでは上記のＥＭＴ化しているサンプルとＥＭＴ化していないサンプルとの比較、つまり二群の比較によるΔＥＣｖの算出方法と同様である。４つのサブタイプがある胃がんデータでは他群での特徴ネットワークが必要である。これは例えば１つのサブタイプに対して、他の３つのサブタイプそれぞれとの間でΔＥＣｖが閾値より大きな枝を求め、その枝の和集合または積集合を取ることによって可能である。これによりサブタイプ毎の特徴ネットワークを抽出することができる。また単純に１つのサブタイプに対して他の３つのサブタイプを１つの大きなサブタイプとみなして二群比較することで４つのサブタイプ毎の特徴ネットワークを抽出することもできる。 ΔECv for each subtype can be obtained, for example, as follows. That is, first, the gene network is estimated based on the publicly available gene expression data of gastric cancer patients. Next, the ECv of all branches is calculated for each sample. Then, for each of the four subtypes of gastric cancer defined in the literature (CIN, MSI, EBV, GS), the average value of ECv of each sample is calculated, and the difference between the subtypes is taken to obtain the two subtypes. The ΔECv between them can be calculated. Up to this point, the method is the same as the method for calculating ΔECv by comparing the sample that has been converted to EMT and the sample that has not been converted to EMT, that is, the comparison between the two groups. Gastric cancer data with four subtypes requires a feature network in other groups. This is possible, for example, by finding a branch whose ΔECv is greater than the threshold value with each of the other three subtypes for one subtype, and taking the union or intersection of the branches. This makes it possible to extract the feature network for each subtype. It is also possible to extract the feature network for each of the four subtypes by simply considering the other three subtypes as one large subtype for one subtype and comparing the two groups.

図１６は特徴ネットワークによる個人の特徴付けの第２例を示す模式図である。図１４と同様、図１６（Ａ）は、ＥＣｖ行列と称し、各行が特徴ネットワークの枝（枝のインデックス）を示し、各列がサンプル（個人）のＥＣｖを示す。行列の各要素が各サンプルの各枝でのＥＣｖを表す。ここでの各サンプルは、がん患者の公開データベースの胃がん患者サンプルデータを用いており、４種類のサブタイプが含まれる。すなわち、ＥＢＶは、ＥＢウイルス陽性を示し、ＭＳＩはマイクロサテライト領域の高頻度変異を示し、ＣＩＮは体細胞コピー数異常を示し、ＧＳはそれら以外を示す。図１６（Ａ）は、公開データベースのデータで遺伝子ネットワーク推定をして、上位枝のＥＣｖ行列をクラスタリングすることにより、４つのカテゴリに分類することができ、大まかには既存研究の４種類のサブタイプと対応付けが可能であることを示す。また、図１６（Ｂ）に示すように、group1は、他のgroupとの間で生存時間に差があることを見出すことができる。 FIG. 16 is a schematic diagram showing a second example of characterization of an individual by a feature network. Similar to FIG. 14, FIG. 16 (A) is referred to as an ECv matrix, where each row shows a branch (branch index) of the feature network and each column shows the ECv of a sample (individual). Each element of the matrix represents an ECv at each branch of each sample. Each sample here uses gastric cancer patient sample data from a public database of cancer patients and includes four subtypes. That is, EBV is EB virus positive, MSI shows high frequency mutations in the microsatellite region, CIN shows abnormal somatic copy number, and GS shows other than that. FIG. 16 (A) can be classified into four categories by estimating the gene network using the data of the public database and clustering the ECv matrix of the upper branch, and roughly, four types of subs of the existing research. Indicates that it can be associated with a type. Further, as shown in FIG. 16 (B), it can be found that group 1 has a difference in survival time from other groups.

図１７は抽出された特徴ネットワークの第２例を示す模式図である。図１７では、４つのサブタイプのうち、ＥＢＶに対して、その他のサブタイプ（ＣＩＮ、ＧＳ、ＭＳＩ）それぞれとのΔＥＣｖで抽出した枝（ΔＥＣｖの抽出の閾値は、例えば、０．５とすることができる）のうち共通部分（二群差の共通枝）をとる、という方法で抽出した特徴ネットワークである。 FIG. 17 is a schematic diagram showing a second example of the extracted feature network. In FIG. 17, of the four subtypes, the branch extracted by ΔECv with each of the other subtypes (CIN, GS, MSI) for EBV (the threshold value for extracting ΔECv is, for example, 0.5). It is a feature network extracted by taking the intersection (the common branch of the difference between the two groups) out of (which can be done).

図１８は抽出された特徴ネットワークの第３例を示す模式図である。図１８では、４つのサブタイプのうち、ＥＢＶに対して、他の３つのサブタイプ（ＣＩＮ、ＧＳ、ＭＳＩ）のＥＣｖを平均とのΔＥＣｖで抽出した枝（ΔＥＣｖの抽出の閾値は、例えば、０．５とすることができる）によって抽出した特徴ネットワークである。 FIG. 18 is a schematic diagram showing a third example of the extracted feature network. In FIG. 18, of the four subtypes, the ECv of the other three subtypes (CIN, GS, MSI) is extracted by ΔECv with the average with respect to EBV (the threshold value for extracting ΔECv is, for example, for example. It is a feature network extracted by (which can be 0.5).

このように、サブタイプ毎のネットワーク推定を行う必要がなく、サブタイプ毎のネットワークの構造を比較する必要がない。ＥＣｖによる比較により、ネットワークの構造比較なしで、１つの遺伝子ネットワークからサブタイプの特徴的な枝を抽出することができる。また、ネットワークの構造比較が不要であるので、特定のサブタイプのサンプル数が少なく、構造比較ができない場合でも、サブタイプの特徴的な枝を抽出することができる。上述のように、がんサブタイプ毎のメカニズムの違いを抽出することが可能となる。 In this way, it is not necessary to estimate the network for each subtype, and it is not necessary to compare the network structures for each subtype. ECv comparison allows the extraction of characteristic branches of subtypes from a single gene network without network structural comparison. Moreover, since it is not necessary to compare the structure of the network, it is possible to extract the characteristic branches of the subtype even when the number of samples of the specific subtype is small and the structure cannot be compared. As described above, it is possible to extract the difference in mechanism for each cancer subtype.

図１９は抽出された特徴ネットワークにより免疫系の遺伝子を捉えることができるメカニズムを示す模式図である。ピロリ菌などのＥＢウイルス感染により、サイトカイン（Cytokine）が受容体を介して働き、免疫系が動く。この場合、ＥＢウイルス感染によって動くと考えられる免疫系に関する遺伝子が、ＥＣｖに基づいて抽出された特徴ネットワークに含まれていることが判明した。すなわち、ウイルス感染により免疫系が動き、既知の遺伝子セット（molecular signature）との構造比較で得られたシグナル伝達系の構造変化を、特徴ネットワークにより推定することができる可能性を示唆している。 FIG. 19 is a schematic diagram showing a mechanism by which genes of the immune system can be captured by the extracted feature network. Due to EB virus infection such as Helicobacter pylori, cytokines (Cytokine) work via receptors to move the immune system. In this case, it was found that genes related to the immune system, which are thought to be driven by EB virus infection, are contained in the feature network extracted based on ECv. That is, it is suggested that the immune system is moved by viral infection, and the structural change of the signal transduction system obtained by structural comparison with a known molecular signature can be estimated by the feature network.

図２０は特徴ネットワークによる個人の特徴付けの第３例を示す模式図である。図１４と同様、図２０は、ＥＣｖ行列と称し、各行が特徴ネットワークの枝（枝のインデックス）を示し、各列がサンプル（個人）のＥＣｖを示す。行列の各要素が各サンプルの各枝でのＥＣｖを表す。ここでの各サンプルは、ＴＣＧＡ（The Cancer Genome Atlas）のすい臓がん患者のデータを用いている。すい臓がん１５３患者のサンプルから予め予後の確実に良い１４サンプルと、悪い１４サンプルを決定する。予後の良悪２群それぞれのＥＣｖの平均値の差が大きい枝を抽出し、そのＥＣｖの値で全２８サンプルのＥＣｖ行列のクラスタリングを行う。図において、各列のうち暗くマーキングしているサンプルは予後が良い１４サンプルであり、明るくマーキングしているサンプルは予後が悪い１４サンプルである。枝を抽出する際のΔＥＣｖの閾値は１．０である。 FIG. 20 is a schematic diagram showing a third example of characterization of an individual by a feature network. Similar to FIG. 14, FIG. 20 refers to the ECv matrix, where each row represents a branch (branch index) of the feature network and each column represents a sample (individual) ECv. Each element of the matrix represents an ECv at each branch of each sample. Each sample here uses data from patients with pancreatic cancer from TCGA (The Cancer Genome Atlas). From the samples of 153 patients with pancreatic cancer, 14 samples with a reliable prognosis and 14 samples with a bad prognosis are determined in advance. Branches with a large difference in the average value of ECv between the two groups with good and bad prognosis are extracted, and the ECv matrix of all 28 samples is clustered based on the value of ECv. In the figure, the darkly marked samples in each row are 14 samples with a good prognosis, and the brightly marked samples are 14 samples with a poor prognosis. The threshold value of ΔECv when extracting branches is 1.0.

図２１は特徴ネットワークによる個人の特徴付けの第４例を示す模式図である。図２１では、枝を抽出する際のΔＥＣｖの閾値は０．７５である。閾値以外は、図２０の場合と同様である。 FIG. 21 is a schematic diagram showing a fourth example of characterization of an individual by a feature network. In FIG. 21, the threshold value of ΔECv when extracting the branches is 0.75. Except for the threshold value, it is the same as in FIG. 20.

図２２は特徴ネットワークによる個人の特徴付けの第５例を示す模式図である。図２２では、２８サンプルから１５３サンプルに拡大してクラスタリングを行った結果を示す。枝を抽出する際のΔＥＣｖの閾値は０．７５である。図２０〜図２２に示すように、良群と悪群にほぼ分かれることが示されている。 FIG. 22 is a schematic diagram showing a fifth example of characterization of an individual by a feature network. FIG. 22 shows the result of clustering by expanding from 28 samples to 153 samples. The threshold value of ΔECv when extracting branches is 0.75. As shown in FIGS. 20 to 22, it is shown that the group is roughly divided into a good group and a bad group.

図２３は抽出された特徴ネットワークを全体のネットワークにマッピングした模式図である。図において、濃くマーキングしている部分は特徴ネットワークを示す。 FIG. 23 is a schematic diagram in which the extracted feature network is mapped to the entire network. In the figure, the darkly marked part indicates the feature network.

遺伝子ネットワーク解析には、ＤＥＧ（Differentially expressed genes）、すなわち発現差のある遺伝子を抽出する手法が用いられている。以下では、当該手法と本実施の形態による特徴ネットワークとの関連性について説明する。 For gene network analysis, DEG (Differentially expressed genes), that is, a method for extracting genes having different expression differences is used. Hereinafter, the relationship between the method and the feature network according to the present embodiment will be described.

図２４は抽出された特徴ネットワークとＤＥＧ遺伝子との関連の第１例を示す模式図である。図では、枝を抽出するΔＥＣｖの閾値を１．０として、抽出された特徴ネットワークを示す。Ｔｏｐ２０ＤＥＧ遺伝子は、良悪２群で発現差が大きいもの（例えば、foldchangeとして差が１以上）であり、２０個存在する。２０個のＤＥＧ遺伝子のうち、特徴ネットワークから距離が所定値（例えば、１）以内のものは、５個存在し（丸印付き）、当該５個のＤＥＧ遺伝子は、特徴ネットワークの下流方向にあることが分かる。 FIG. 24 is a schematic diagram showing a first example of the relationship between the extracted feature network and the DEG gene. In the figure, the extracted feature network is shown with the threshold value of ΔECv for extracting branches set to 1.0. There are 20 Top20DEG genes, which have a large expression difference between the two groups, good and bad (for example, the difference is 1 or more as a fold change). Of the 20 DEG genes, 5 are present (circled) if the distance from the feature network is within a predetermined value (for example, 1), and the 5 DEG genes are in the downstream direction of the feature network. You can see that.

図２５は抽出された特徴ネットワークとＤＥＧ遺伝子との関連の第２例を示す模式図である。図では、枝を抽出するΔＥＣｖの閾値を０．７５として、抽出された特徴ネットワークを示す。２０個のＤＥＧ遺伝子のうち、特徴ネットワークから距離が所定値（例えば、１）以内のものは、１３個存在し、そのうちの１０個のＤＥＧ遺伝子は、特徴ネットワークの下流方向にあることが分かる。図２４及び図２５から、発現差のある遺伝子を抽出する遺伝子ネットワーク解析手法によって得られる遺伝子は、特徴ネットワークの下流に位置し、特徴ネットワークの違いから生み出された差が、個々の遺伝子の発現差として推定することができると考えられる。 FIG. 25 is a schematic diagram showing a second example of the relationship between the extracted feature network and the DEG gene. In the figure, the extracted feature network is shown with the threshold value of ΔECv for extracting branches set to 0.75. It can be seen that 13 of the 20 DEG genes are within a predetermined value (for example, 1) from the feature network, and 10 of these DEG genes are in the downstream direction of the feature network. From FIGS. 24 and 25, the genes obtained by the gene network analysis method for extracting genes with different expression are located downstream of the characteristic network, and the difference created from the difference in the characteristic network is the difference in the expression of each gene. It is considered that it can be estimated as.

図２６は抽出された特徴ネットワークの第４例を示す模式図である。図示していないが、ある地域の住民を対象とした健康調査データを用い、複数の重要疾患を定義し、単一のベイジアンネットワークを推定する。図２６は、推定されたネットワークから、被験者Ａと被験者Ｂのデータを用いて枝の特徴量としてＥＣｖを算出し、算出したＥＣｖが所定の閾値以上の枝を抽出して特徴ネットワークを抽出したものである。カテゴリは、例えば、年齢、性別、社会背景、生活習慣、健康調査の検査値、遺伝子情報などを含む。図２６から、２人の被験者それぞれの疾患羅患が何であり、共通の疾患が何であるかが分かる。 FIG. 26 is a schematic diagram showing a fourth example of the extracted feature network. Although not shown, health survey data of residents in a region are used to define multiple critical diseases and estimate a single Bayesian network. FIG. 26 shows an ECv calculated as a branch feature amount using the data of the subject A and the subject B from the estimated network, and the feature network is extracted by extracting the branches whose ECv is equal to or higher than a predetermined threshold value. Is. Categories include, for example, age, gender, social background, lifestyle, health survey test values, genetic information, and the like. From FIG. 26, it can be seen what each of the two subjects has a disease and what is a common disease.

次に、本発明の利用形態について具体例を挙げて説明する。市などの自治体や、健康保険組合に属する企業では、住民や社員などの健康維持や疾患の早期発見などを目指して健康診断を実施している。このような健康診断の結果、多数の健康調査データを収集することができる。また、病院や診療所においても、患者を診察又は治療する際に、患者のデータを収集することができる。本発明の特徴ネットワーク抽出方法を用いることにより、住民、社員、患者などの多数のサンプル又はサンプル群の関係性を評価することができる。 Next, a specific example of the usage pattern of the present invention will be described. Municipalities such as cities and companies belonging to health insurance associations carry out health examinations with the aim of maintaining the health of residents and employees and early detection of illnesses. As a result of such a health examination, a large number of health survey data can be collected. In addition, hospitals and clinics can also collect patient data when examining or treating patients. By using the feature network extraction method of the present invention, it is possible to evaluate the relationship between a large number of samples or sample groups such as residents, employees, and patients.

以下では、弘前ＣＯＩ（センター・オブ・イノベーション）で計測された健診データ（２０１４年〜２０１７年の４年間、７２７名分のデータ）から推定されたベイジアンネットワークを解析しやすいように既存のノード縮約を行い、特徴ネットワークを抽出し、所望の疾患ごと及び個人ごとの因果関係（関連パス）を抜き出した例を示す。なお、推定されたベイジアンネットワークが一般的な離散モデルである場合、１−hot化という機械学習などで用いられている前処理を行って、連続型ベイジアンネットワークに適用することができる。 Below, the existing nodes make it easy to analyze the Bayesian network estimated from the medical examination data (data for 727 people for 4 years from 2014 to 2017) measured by Hirosaki COI (Center of Innovation). An example is shown in which contraction is performed, a feature network is extracted, and a causal relationship (related path) for each desired disease and each individual is extracted. When the estimated Bayesian network is a general discrete model, it can be applied to a continuous Bayesian network by performing preprocessing used in machine learning such as 1-hot conversion.

図２７は慢性腎臓病（ＣＫＤ）発症関連パスを抜き出した例を示す模式図であり、図２８は高血圧発症関連パスを抜き出した例を示す模式図である。図２７及び図２８において、関連パスを抜き出すには、上述の相対貢献率ＲＣｒを利用して相乗平均上位パスを使用している。関連パスを抜き出す際に、生活習慣から特定の疾患（図の例では、慢性腎臓病及び高血圧）に至るパスだけを取り出している。 FIG. 27 is a schematic diagram showing an example in which a path related to the onset of chronic kidney disease (CKD) is extracted, and FIG. 28 is a schematic diagram showing an example in which a path related to the onset of hypertension is extracted. In FIGS. 27 and 28, in order to extract the related paths, the geometric mean upper path is used by utilizing the relative contribution rate RCr described above. When extracting related paths, only the paths leading from lifestyle-related diseases to specific diseases (chronic kidney disease and hypertension in the example of the figure) are extracted.

図２９はＳＮＰありの場合のＣＫＤ及び高血圧の２疾患関連ネットワークの例を示す模式図であり、図３０はＳＮＰなしの場合のＣＫＤ及び高血圧の２疾患関連ネットワークの例を示す模式図である。図２９は、ＳＮＰ、すなわち、個人ゲノム（遺伝子）変異データがある場合の、慢性腎臓病（ＣＫＤ）と高血圧の両者の共通部分を示す。図３０は、ＳＮＰがない場合の、慢性腎臓病（ＣＫＤ）と高血圧の両者の共通部分を示す。図２９及び図３０に示すように、慢性腎臓病（ＣＫＤ）と高血圧の両方の疾患共通の関連パスが観察可能となる。 FIG. 29 is a schematic diagram showing an example of a network related to two diseases of CKD and hypertension with SNP, and FIG. 30 is a schematic diagram showing an example of a network related to two diseases of CKD and hypertension without SNP. FIG. 29 shows the intersection of both chronic kidney disease (CKD) and hypertension in the presence of SNPs, i.e. personal genome (gene) mutation data. FIG. 30 shows the intersection of both chronic kidney disease (CKD) and hypertension in the absence of SNPs. As shown in FIGS. 29 and 30, a common associated path for both chronic kidney disease (CKD) and hypertension becomes observable.

図２７において例示した慢性腎臓病（ＣＫＤ）発症関連パス上に、個人ごとの相対貢献率ＲＣｒに基づいて抽出した個人のパスの例について、以下説明する。 An example of an individual path extracted based on the relative contribution rate RCr for each individual on the chronic kidney disease (CKD) onset-related path illustrated in FIG. 27 will be described below.

図３１は慢性腎臓病（ＣＫＤ）発症の個人ネットワークの第１例を示す模式図であり、図３２は慢性腎臓病（ＣＫＤ）発症の個人ネットワークの第２例を示す模式図であり、図３３は慢性腎臓病（ＣＫＤ）発症の個人ネットワークの第３例を示す模式図である。図３１に示す第１例は、７０代女性のパスであり、慢性腎臓病の発症という観点において、飲酒関連及びストレス／睡眠関連のパスが効いていることが分かる。図３２に示す第２例は、５０代男性のパスであり、慢性腎臓病の発症という観点において、心疾患関連のパスが効いていることが分かる。図３３に示す第３例は、６０代男性のパスであり、慢性腎臓病の発症という観点において、糖尿病関連のパスが効いていることが分かる。図３１から図３３に示すように、個人ごとに効いているパスが異なることが明瞭に観察可能となる。 FIG. 31 is a schematic diagram showing a first example of a personal network of chronic kidney disease (CKD) onset, and FIG. 32 is a schematic diagram showing a second example of a personal network of chronic kidney disease (CKD) onset, FIG. 33. Is a schematic diagram showing a third example of a personal network of chronic kidney disease (CKD) onset. The first example shown in FIG. 31 is a path for a woman in her 70s, and it can be seen that a drinking-related path and a stress / sleep-related path are effective from the viewpoint of developing chronic kidney disease. The second example shown in FIG. 32 is a path for a man in his fifties, and it can be seen that a path related to heart disease is effective from the viewpoint of developing chronic kidney disease. The third example shown in FIG. 33 is a path for a man in his 60s, and it can be seen that a diabetes-related path is effective from the viewpoint of developing chronic kidney disease. As shown in FIGS. 31 to 33, it is possible to clearly observe that the effective path differs for each individual.

図３４は特徴ネットワーク抽出装置５０の処理手順の一例を示すフローチャートである。便宜上、以下では処理の主体をプロセッサ５１として説明する。プロセッサ５１は、サンプル（個人）のデータを取得し（Ｓ１１）、取得したデータをベイジアンネットワークの所要のノードに付与する（Ｓ１２）。 FIG. 34 is a flowchart showing an example of the processing procedure of the feature network extraction device 50. For convenience, the subject of processing will be described below as the processor 51. The processor 51 acquires sample (individual) data (S11) and assigns the acquired data to a required node of the Bayesian network (S12).

プロセッサ５１は、所要ノード以外のノードの事後確率の算出を開始し（Ｓ１３）、リンク（枝又はエッジ）の特徴量を算出する（Ｓ１４）。プロセッサ５１は、他のサンプルの有無を判定し（Ｓ１５）、他のサンプルがある場合（Ｓ１５でＹＥＳ）、ステップＳ１１以降の処理を続ける。 The processor 51 starts calculating posterior probabilities of nodes other than the required node (S13), and calculates the feature amount of the link (branch or edge) (S14). The processor 51 determines the presence / absence of another sample (S15), and if there is another sample (YES in S15), the processor 51 continues the processing after step S11.

他のサンプルがない場合（Ｓ１５でＮＯ）、プロセッサ５１は、算出した特徴量に基づいて特徴ネットワークを抽出し（Ｓ１６）、処理を終了する。 When there is no other sample (NO in S15), the processor 51 extracts the feature network based on the calculated feature amount (S16), and ends the process.

次に、上述のステップＳ１６の特徴ネットワークの抽出について説明する。図３５は特徴ネットワーク抽出処理の一例を示すフローチャートである。プロセッサ５１は、群ごとに各枝の特徴量（例えば、ＥＣｖ）の平均を算出し（Ｓ１６１）、群間のＥＣｖの差であるΔＥＣｖを各枝で算出する（Ｓ１６２）。 Next, the extraction of the feature network in step S16 described above will be described. FIG. 35 is a flowchart showing an example of the feature network extraction process. The processor 51 calculates the average of the feature amounts (for example, ECv) of each branch for each group (S161), and calculates ΔECv, which is the difference in ECv between the groups, for each branch (S162).

プロセッサ５１は、ΔＥＣｖが閾値より大きい枝を抽出する（Ｓ１６３）。プロセッサ５１は、他の群の有無を判定し（Ｓ１６４）、他の群がある場合（Ｓ１６４でＹＥＳ）、群毎に、他の全ての群との間で抽出した枝の和集合または積集合を抽出し（Ｓ１６５）、後述のステップＳ１６６の処理を行う。 The processor 51 extracts a branch whose ΔECv is larger than the threshold value (S163). The processor 51 determines the presence or absence of another group (S164), and if there is another group (YES in S164), the union or intersection of the branches extracted from all the other groups for each group. (S165), and the process of step S166 described later is performed.

他の群がない場合（Ｓ１６４でＮＯ）、プロセッサ５１は、抽出した枝により特徴ネットワークを構築し（Ｓ１６６）、処理を終了する。 When there is no other group (NO in S164), the processor 51 constructs a feature network from the extracted branches (S166) and ends the process.

特徴ネットワーク抽出装置５０は、ＣＰＵ（プロセッサ）、ＲＡＭなどを備えたコンピュータを用いて実現することもできる。図３４及び図３５に示すような処理の手順を定めたコンピュータプログラム（記録媒体Ｍに記録可能）をコンピュータに備えられた記録媒体読取部５５で読み取り、読み取ったコンピュータプログラムをＲＡＭにロードし、コンピュータプログラムをＣＰＵ（プロセッサ）で実行することにより、コンピュータ上で特徴ネットワーク抽出装置５０を実現することができる。 Features The network extraction device 50 can also be realized by using a computer equipped with a CPU (processor), RAM, and the like. A computer program (which can be recorded on the recording medium M) that defines the processing procedure as shown in FIGS. 34 and 35 is read by the recording medium reading unit 55 provided in the computer, the read computer program is loaded into the RAM, and the computer is loaded. By executing the program on the CPU (processor), the feature network extraction device 50 can be realized on the computer.

上述のように、本実施の形態によれば、データ全体の特徴（因果関係）までは説明できるというベイジアンネットワークの限界点を超えて、ベイジアンネットワークでは説明できなかった、個人又は個別サンプルの因果関係を、推定されたベイジアンネットワークと枝の特徴量という枝評価手法を用いることにより、説明可能とすることができる。 As described above, according to the present embodiment, the causal relationship of an individual or an individual sample that cannot be explained by the Bayesian network beyond the limit point of the Bayesian network that the characteristics (causal relationship) of the entire data can be explained. Can be explained by using the branch evaluation method of estimated Bayesian network and branch features.

本実施の形態において、ベイジアンネットワークに用いる所定モデルは、ノンパラメトリック回帰モデルに限定されるものではない。例えば、所定モデルは、加法モデルでもよく、掛け算モデルでもよい。加法モデルの場合には、親変数ｘ１、ｘ２、…に対して何らかの関数ｍ１、ｍ２、…があり、子変数ｙ＝ｍ１（ｘ１）＋ｍ２（ｘ２）＋…のように「和」で表すことができる。関数ｍ１（ｘ１）、ｍ２（ｘ２）、…は、所要の関数でよく、関数ｍ１（ｘ１）、ｍ２（ｘ２）、…の値をＥＣｖとすることができる。また、ｍ１（ｘ）＝ｘとすれば、所定関数は線形関数となり、線形モデルとすることができる。また、掛け算モデルの場合には、子変数ｙ＝ｍ１（ｘ１）・ｍ２（ｘ２）・…のように「掛け算」で表すことができる。所定関数は、非線形関数に限定されるものではなく、線形関数でもよい。 In the present embodiment, the predetermined model used for the Bayesian network is not limited to the nonparametric regression model. For example, the predetermined model may be an addition model or a multiplication model. In the case of the additive model, there are some functions m1, m2, ... For the parent variables x1, x2, ..., And the child variables y = m1 (x1) + m2 (x2) + ... are represented by "sum". Can be done. The functions m1 (x1), m2 (x2), ... May be required functions, and the values of the functions m1 (x1), m2 (x2), ... Can be ECv. Further, if m1 (x) = x, the predetermined function becomes a linear function and can be a linear model. Further, in the case of a multiplication model, it can be expressed by "multiplication" such as child variables y = m1 (x1), m2 (x2), .... The predetermined function is not limited to the non-linear function, but may be a linear function.

本実施の形態において、ベイジアンネットワークは離散モデルでも適用することができる。ベイジアンネットワークが離散モデルの場合、１−hot化という機械学習で行われる一般的な前処理を行うことにより、連続モデルに適用可能となる。１−hot化は、例えば、Ｘという変数が、Ａ、Ｂ、Ｃをとる場合、「ＸがＡである」「ＸがＢである」「ＸがＣである」という３つの変数に分けて、該当する場合１を、そうでない場合は０をそれぞれの変数の値とすることにより、連続値に変換することができる。また、「ＸがＣである」というのは、「ＸがＡである」及び「ＸがＢである」の両方が０であれば表現できるので、Ｎ個のカテゴリの変数の１−hot化をＮ−１の変数で行ってもよい。 In this embodiment, the Bayesian network can also be applied to a discrete model. When the Bayesian network is a discrete model, it can be applied to a continuous model by performing a general preprocessing performed by machine learning called 1-hot conversion. For example, when the variable X takes A, B, and C, the 1-hot conversion is divided into three variables, "X is A", "X is B", and "X is C". , If applicable, 1 can be used as the value of each variable, otherwise 0 can be used as the value of each variable. Also, "X is C" can be expressed if both "X is A" and "X is B" are 0, so variables in N categories are converted to 1-hot. May be performed with the variable of N-1.

本実施の形態の特徴ネットワークは、医療関係のベイジアンネットワークへの適用に限定されるものではない。例えば、ベイジアンネットワークを用いた広告提供、マーケティングリサーチ、アンケート分析、及びシステムの障害診断への応用などにも、本実施の形態の特徴ネットワークは適用可能である。例えば、従来のベイジアンネットワークを用いた分析では、ユーザの年代や性別などの大まかな属性データの因果関係は説明できたとしても、個人又は個別サンプルの因果関係は説明することができない。本実施の形態を適用すれば、推定されたベイジアンネットワークと枝の特徴量という枝評価手法を用いることができ、個人又は個別サンプルの因果関係を説明することが可能となり、ユーザモデリングやヒューマンモデリングへ応用する際に、個人レベルまで詳細に分析することが可能となる。 The feature network of this embodiment is not limited to application to a medical-related Bayesian network. For example, the feature network of this embodiment can be applied to advertisement provision using a Bayesian network, marketing research, questionnaire analysis, application to system failure diagnosis, and the like. For example, analysis using a conventional Bayesian network can explain the causal relationship of rough attribute data such as the age and gender of a user, but cannot explain the causal relationship of an individual or an individual sample. By applying this embodiment, it is possible to use a branch evaluation method of estimated Bayesian network and branch features, and it is possible to explain the causal relationship between individual or individual samples, and to user modeling and human modeling. When applied, it will be possible to analyze in detail down to the individual level.

本実施の形態のベイジアンネットワーク分析方法は、前述の特徴ネットワーク抽出装置を用いて、所要のベイジアンネットワークから特徴ネットワークを抽出し、抽出した特徴ネットワークに基づいて、前記ベイジアンネットワークでのサンプル又はサンプル群を評価することができる。この場合、所要のベイジアンネットワークは、医療データ、広告データ、マーケティングデータ及びアンケートデータの少なくとも一つのデータに関する多変量の因果関係を表すものとすることができるが、他のデータに関する多変量の因果関係を表すものでもよい。 In the Bayesian network analysis method of the present embodiment, a feature network is extracted from a required Bayesian network using the above-mentioned Bayesian network extraction device, and a sample or a sample group in the Bayesian network is obtained based on the extracted feature network. Can be evaluated. In this case, the required Bayesian network can represent a multivariate causal relationship for at least one of the medical data, advertising data, marketing data and survey data, while the multivariate causal relationship for the other data. It may represent.

５０特徴ネットワーク抽出装置
５１プロセッサ
５２操作部
５３インタフェース部
５４表示パネル
５５記録媒体読取部
５６ＲＯＭ
５７メモリ
５８記憶部
５８１ベイジアンネットワークモデル
５８２サンプルデータ 50 Features Network extraction device 51 Processor 52 Operation unit 53 Interface unit 54 Display panel 55 Recording medium reader 56 ROM
57 Memory 58 Storage 581 Bayesian network model 582 Sample data

Claims

A data addition unit that assigns data to the required nodes of a Bayesian network that expresses the dependency relationships between multiple nodes to which each random variable is associated using a non-circulated directed graph, and
When calculating the posterior probability of a node based on the data given by the data addition unit, the child from the parent node is based on a predetermined model that constitutes the conditional probability when the random variable of the parent node is given. A calculation unit that calculates the feature amount of the link to the node,
A feature network extraction device including an extraction unit that extracts a feature network from the Bayesian network based on a feature amount calculated by the calculation unit.

The calculation unit
The feature network according to claim 1, wherein the feature amount of the link from the parent node to the child node is calculated based on the function value of a predetermined function representing a predetermined model of the random variable of the parent node with respect to the random variable of the child node. Extractor.

The calculation unit
The feature network extraction device according to claim 2, wherein the function value of the predetermined function is calculated as the feature amount of the link.

The calculation unit
When the data addition unit assigns data of different samples, a comparison value between the first function value of the predetermined function based on the data of the first sample and the second function value of the predetermined function based on the data of the second sample. The feature network extraction device according to claim 2, wherein the feature amount of the link is calculated.

The calculation unit
The feature of the link is the ratio of the function value of the predetermined function corresponding to the link to the maximum value of the function values of each predetermined function representing the predetermined model of the random variable of each of the plurality of parent nodes with respect to the random variable of the child node. The feature network extraction device according to claim 2, which is calculated as a quantity.

The calculation unit
The ratio of the function value of the predetermined function corresponding to the link to the total value of the function values of each predetermined function representing the predetermined model of the random variable of each of the plurality of parent nodes with respect to the random variable of the child node is used as the feature quantity of the link. The feature network extraction device according to claim 2 for calculation.

The extraction unit
The feature network extraction device according to any one of claims 2 to 6, wherein when the feature amount of the link calculated by the calculation unit is equal to or more than a predetermined threshold value, the feature network including the link is extracted.

A setting unit for setting a plurality of required nodes of the Bayesian network is provided.
The calculation unit
The feature amount of the entire one or a plurality of links constituting each of the plurality of paths from one node to the other node set in the setting unit is calculated.
The extraction unit
The feature network extraction device according to any one of claims 2 to 6, wherein a feature network including a path in which the feature amount of the entire link constituting the path is equal to or more than a predetermined threshold value is extracted from the plurality of paths. ..

The predetermined model includes a nonparametric regression model.
The feature network extraction device according to any one of claims 2 to 8, wherein the predetermined function includes a non-linear function.

The feature network is extracted from the required Bayesian network by using the feature network extraction device according to any one of claims 1 to 9.
Evaluate a sample or sample group in the Bayesian network based on the extracted feature network.
Bayesian network analysis method.

The Bayesian network analysis method according to claim 10, wherein the Bayesian network represents a multivariate causal relationship with respect to at least one of medical data, advertising data, marketing data, and questionnaire data.

On the computer
The process of assigning data to the required nodes of the Bayesian network, which expresses the dependency relationships between multiple nodes to which each random variable is associated using a non-circulated directed graph, and
When calculating the posterior probability of a node based on the given data, the link from the parent node to the child node is based on a predetermined model that constitutes the conditional probability given the random variable of the parent node. Processing to calculate the feature amount and
A computer program that executes a process of extracting a feature network from the Bayesian network based on the calculated feature amount.

It is a feature network extraction method by computer.
Computer
Data is added to the required nodes of the Bayesian network in which the dependency relationships between multiple nodes to which each random variable is associated are represented using an uncirculated directed graph.
When calculating the posterior probability of a node based on the given data, the link from the parent node to the child node is based on a predetermined model that constitutes the conditional probability given the random variable of the parent node. Calculate the feature amount,
A feature network extraction method for extracting a feature network from the Bayesian network based on the calculated feature quantity.