JP6318334B2

JP6318334B2 - Correlation network analysis program

Info

Publication number: JP6318334B2
Application number: JP2016207785A
Authority: JP
Inventors: 鈴木　秀幸; 秀幸鈴木; 一斗萬年; 柴田　大輔; 大輔柴田; 善之尾形
Original assignee: Kazusa DNA Research Institute Foundation; Osaka Prefecture University
Current assignee: Kazusa DNA Research Institute Foundation; Osaka Prefecture University
Priority date: 2015-11-19
Filing date: 2016-10-24
Publication date: 2018-05-09
Anticipated expiration: 2036-10-24
Also published as: JP2017102910A

Description

本発明は、例えば、オミックス解析等で得られた多変量データをボトムアップ手法により相関ネットワーク解析ためのプログラム等に関する。 The present invention relates to, for example, a program for analyzing a correlation network of multivariate data obtained by omics analysis or the like by a bottom-up method.

一般に多変量データを分析する手法として各種の多変量解析がある。例えば、従来、メタボローム解析の分野では、質量分析器等のハイスループット分析により得られた定量値をもとに代謝物間の関係性を解析する際、一般的にはその代表的な手法として、主成分分析（Principal Component Analysis; PCA）及び階層的クラスター解析（HCA）等の統計学的手法が利用されている。これらの解析方法は多変量データ全体の傾向を把握する際に大変有益な手法である。一方で、多変量データの要素をグループ化するための統計的な指標を提示しない。そのため、多変量データの要素をいくつかのグループに分類するために、判別分析等の多重比較法が考案されている。 In general, there are various multivariate analysis methods for analyzing multivariate data. For example, conventionally, in the field of metabolomic analysis, when analyzing the relationship between metabolites based on quantitative values obtained by high-throughput analysis such as a mass spectrometer, as a typical technique, Statistical methods such as principal component analysis (PCA) and hierarchical cluster analysis (HCA) are used. These analysis methods are very useful for grasping the trend of the whole multivariate data. On the other hand, it does not present statistical indicators for grouping elements of multivariate data. Therefore, in order to classify the elements of multivariate data into several groups, multiple comparison methods such as discriminant analysis have been devised.

しかしながら、近年のビッグデータのように要素数が非常に大きく得られるグループ数も多い場合には、多重比較法では不十分となる。この場合に、従来からネットワーク解析におけるコミュニティ抽出が用いられてきた。 However, when there are a large number of groups that can be obtained with a very large number of elements as in big data in recent years, the multiple comparison method is insufficient. In this case, community extraction in network analysis has been conventionally used.

このようなネットワーク解析は、ポスト・ゲノム科学分野のオミックス解析において、重要な役割を担っている。例えば、トランスクリプトーム解析の分野では、遺伝子の共発現関係の探索に相関係数を用いたネットワーク解析が広く利用されており、モデル植物シロイヌナズナを中心に数多くのデータベースが構築されている。このような相関ネットワークによる解析手法は、これまで一般的ではないものの、サンプルに含まれる代謝物の全体像を視覚的に捉える俯瞰図として注目を集めている。 Such network analysis plays an important role in omics analysis in the post-genomic science field. For example, in the field of transcriptome analysis, network analysis using correlation coefficients is widely used for searching for co-expression relationships of genes, and many databases have been constructed centering on the model plant Arabidopsis thaliana. Such an analysis method using a correlation network has been attracting attention as an overhead view for visually grasping the entire image of a metabolite contained in a sample, although it has not been common so far.

ネットワーク解析に於いて、要素間の正の相関行列から要素同士が互いに関連する部分コミュニティ構造を抽出する手法として、トップダウン手法とボトムアップ手法がある。前者はまず相関行列に基づいたネットワーク構造を描き、そのネットワークから部分コミュニティ構造に分類する。DP-Clusツール(非特許文献１)やARACNEツール(非特許文献２)等が代表的である。これらのツールはネットワーク全体で用いられる指標を利用している。例えば、DP-Clusではネットワーク指標として代表的なクラスター係数を活用している。トップダウン手法の長所としては、ネットワーク全体を一つの基準で一度に解析できるため、処理を高速化できる点が挙げられる。一般に、ネットワーク全体をコミュニティに分類するときに有用である。一方で欠点としては、基準が一つであるために、注目要素を設定する解析の場合、その注目要素を含む部分コミュニティ構造のサイズを適切な基準で設定できず、結果としてユーザの希望するサイズで得られるとは限らない。そのため、注目要素が設定され得るネットワーク解析においては、以下に述べるボトムアップ手法が適している。 In network analysis, there are a top-down method and a bottom-up method as methods for extracting a partial community structure in which elements are related to each other from a positive correlation matrix between elements. The former draws a network structure based on the correlation matrix, and classifies the network structure into a partial community structure. The DP-Clus tool (Non-patent document 1), the ARACNE tool (Non-patent document 2), etc. are representative. These tools use indicators used throughout the network. For example, DP-Clus uses a typical cluster coefficient as a network index. The advantage of the top-down method is that the entire network can be analyzed at one time based on one standard, so that the processing speed can be increased. In general, it is useful when classifying an entire network into communities. On the other hand, since there is a single criterion, in the analysis to set the element of interest, the size of the partial community structure including the element of interest cannot be set with an appropriate criterion, and as a result the size desired by the user It is not always possible to obtain it. Therefore, the bottom-up method described below is suitable for network analysis in which the element of interest can be set.

ボトムアップ手法では、相関行列の各注目要素から関連性の強度に基づいて要素間を連結することでその要素を含む部分コミュニティのサイズを徐々にサイズを大きくしていく。トップダウン手法とは異なり、相関行列全体で一度に解析することは困難であるが、注目要素を含むコミュニティのサイズを統計的有意性に基づいて調整できる点が利点である。従来法としては、Newman法（非特許文献３)やLouvain法（非特許文献４）が代表的である。ただし、これらの手法も注目要素を設定した解析に対応していないため（すべての要素を公平に扱うため）、注目要素を含むコミュニティがユーザの希望するサイズで得られるとは限らない。 In the bottom-up method, elements of interest in the correlation matrix are connected to each other based on the strength of relevance, thereby gradually increasing the size of the partial community including the element. Unlike the top-down approach, it is difficult to analyze the entire correlation matrix at one time, but the advantage is that the size of the community including the element of interest can be adjusted based on statistical significance. As conventional methods, Newman method (Non-patent document 3) and Louvain method (Non-patent document 4) are representative. However, since these methods do not support the analysis in which the element of interest is set (to treat all elements fairly), the community including the element of interest is not always obtained at the size desired by the user.

本発明者等は、ネットワーク構造において注目要素の部分コミュニティ構造（「モジュール」とも呼ぶ）をボトムアップ手法によって抽出するアルゴリズム「金平糖アルゴリズム」を既に開発した(非特許文献５）。しかしながら、このアルゴリズムでは、モジュールのサイズを調整できず、更に、ネットワーク中の要素を分類するに必要な機能を有していなかった。 The inventors of the present invention have already developed an algorithm “Kimpei Sugar Algorithm” that extracts a partial community structure (also referred to as a “module”) as a focused element in a network structure by a bottom-up method (Non-patent Document 5). However, this algorithm cannot adjust the size of the module, and does not have a function necessary for classifying elements in the network.

Altaf-Ul-Amin M, Shibo Y, Mihara K, Kurokawa K, Kanaya S. Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics, 7, 207 (2006). [PMID: 16613608]Altaf-Ul-Amin M, Shibo Y, Mihara K, Kurokawa K, Kanaya S. Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics, 7, 207 (2006). [PMID: 16613608] Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, Califano A. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics, 7, Suppl 1: S7 (2006). [PMID: 16723010]Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, Califano A. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context.BMC Bioinformatics, 7, Suppl 1: S7 (2006 ). [PMID: 16723010] Newman ME, Girvan M. Finding and evaluating community structure in networks. Phys Rev E Stat Nonlin Soft Matter Phys, 69: 026113 (2004). [PMID: 14995526]Newman ME, Girvan M. Finding and evaluating community structure in networks. Phys Rev E Stat Nonlin Soft Matter Phys, 69: 026113 (2004). [PMID: 14995526] Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech, doi:10.1088/1742-5468/2008/10/P10008 (2008).Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks.J Stat Mech, doi: 10.1088 / 1742-5468 / 2008/10 / P10008 (2008). Ogata Y, Sakurai N, Suzuki H, Aoki K, Saito K, Shibata D. The prediction of local modular structures in a co-expression network based on gene expression datasets. Genome Inform, 23: 117-127 (2009). [PMID: 20180267]Ogata Y, Sakurai N, Suzuki H, Aoki K, Saito K, Shibata D. The prediction of local modular structures in a co-expression network based on gene expression datasets.Genome Inform, 23: 117-127 (2009). [PMID : 20180267]

本発明が解決しようとする課題は、非特許文献５に記載のアルゴリズムが有する上記の問題点を解決し、多変量データの相関ネットワーク解析に於いて、より適切なサイズのモジュールをより多く形成・検出できるプログラム及び方法等を提供することである。 The problem to be solved by the present invention is to solve the above-mentioned problems of the algorithm described in Non-Patent Document 5, and to form more modules of a more appropriate size in correlation network analysis of multivariate data. It is to provide a program and method that can be detected.

本発明者は、相関ネットワーク解析に於いて、一般的に使用される統計指標である相関係数に加え、幾何学の理論を応用した独自の指標を考慮し、初期設定値として、最小及び最大のコミュニティ（モジュール）のサイズを設定し、更に、モジュール形成過程（モジュールの統合化）を追加して解析順序に変更を加えること等によって、遺伝子の共発現関係等のオミックス解析等をより効果的に過不足なく演算することができる新たなプログラム及び方法を開発し、上記課題を解決し、本発明を完成した。 In the correlation network analysis, in addition to the correlation coefficient that is a commonly used statistical index, the inventor considers a unique index that applies the theory of geometry, and sets the minimum and maximum as the initial setting values. More effective omics analysis such as co-expression relationship of genes by setting the size of community (module) and adding the module formation process (module integration) and changing the analysis order The present inventors have developed a new program and method that can perform calculations without excess and deficiency, solved the above problems, and completed the present invention.

即ち、本発明は以下の態様を有する。
[態様１]
多変量データを相関ネットワーク解析するためにコンピュータに、
（１）多変量データにおける個々のデータ（要素）間の相関行列に基づき、ネットワークＦ値（NF）及び要素Ｆ値（VF）を用いて、或る要素に対して、一定のサイズ（モジュールに含まれる要素の数）の範囲に於いて最大のネットワークＦ値（NF）を有するモジュールを形成するステップ、
（２）要素Ｆ値（VF）に基づきネットワークを再構築して、ステップ（１）で形成されたモジュールを統合化するステップ、及び
（３）要素特異率（VS）に基づき、ステップ（２）で統合化されたモジュール群に該統合化モジュール群の夫々のモジュールに含まれていない要素（周縁要素）を一定条件下で関連付けて、最終的なモジュール群を形成するステップ、
を実行させるためのプログラム。
[態様２]
多変量データを相関ネットワーク解析するためにコンピュータを、
（１）多変量データにおける個々のデータ（要素）間の相関行列に基づき、ネットワークＦ値（NF）及び要素Ｆ値（VF）を用いて、或る要素に対して、一定のサイズ（モジュールに含まれる要素の数）の範囲に於いて最大のネットワークＦ値（NF）を有するモジュールを形成する手段、
（２）要素Ｆ値（VF）に基づきネットワークを再構築して、手段（１）で形成されたモジュールを統合化する手段、及び
（３）要素特異率（VS）に基づき、手段（２）で統合化されたモジュール群に該モジュールの周縁要素を一定条件下で関連付けて、最終的なモジュール群を形成する手段、
として機能させるためのプログラム。
[態様３]
多変量データを相関ネットワーク解析する方法であって、
（１）コンピュータが、入力された多変量データにおける個々のデータ（要素）間の相関行列に基づき、ネットワークＦ値（NF）及び要素Ｆ値（VF）を用いて、或る要素に対して、一定のサイズ（モジュールに含まれる要素の数）の範囲に於いて最大のネットワークＦ値（NF）を有するモジュールを形成するステップ、
（２）コンピュータが、要素Ｆ値（VF）に基づきネットワークを再構築して、ステップ（１）で形成されたモジュールを統合化するステップ、及び
（３）コンピュータが、要素特異率（VS）に基づき、ステップ（２）で統合化されたモジュール群に該モジュールの周縁要素を一定条件下で関連付けて、最終的なモジュール群を形成するステップ
を含む、前記方法
[態様４]
本発明のプログラムを記録したコンピュータ読み取り可能な記録媒体。
[態様５]
コンピュータに実装されてなる、多変量データを相関ネットワーク解析するためにシステムであって、
（１）多変量データにおける個々のデータ（要素）間の相関行列に基づき、ネットワークＦ値（NF）及び要素Ｆ値（VF）を用いて、或る要素に対して、一定のサイズ（モジュールに含まれる要素の数）の範囲に於いて最大のネットワークＦ値（NF）を有するモジュールを形成するための手段、
（２）要素Ｆ値（VF）に基づきネットワークを再構築して、手段（１）で形成されたモジュールを統合化する手段、及び
（３）要素特異率（VS）に基づき、手段（２）で統合化されたモジュール群に該モジュールの周縁要素を一定条件下で関連付けて、最終的なモジュール群を形成するための手段、並びに、任意に、（４）手段（３）で得られたモジュール群を含むネットワーク（マップ）を描画（表示処理）するための出力手段を含む前記システム。 That is, this invention has the following aspects.
[Aspect 1]
Computers to analyze multivariate data for correlation networks,
(1) Based on the correlation matrix between individual data (elements) in multivariate data, using a network F value (NF) and element F value (VF), a certain size (into a module) Forming a module having the largest network F value (NF) in the range of (number of elements included);
(2) restructuring the network based on the element F value (VF) and integrating the modules formed in step (1); and (3) step (2) based on the element specificity (VS). Associating, under certain conditions, elements (peripheral elements) not included in each module of the integrated module group with the integrated module group to form a final module group;
A program for running
[Aspect 2]
Computer for correlation network analysis of multivariate data,
(1) Based on the correlation matrix between individual data (elements) in multivariate data, using a network F value (NF) and element F value (VF), a certain size (into a module) Means for forming a module having the largest network F value (NF) in the range of the number of elements included);
(2) means for restructuring the network based on the element F value (VF) and integrating the modules formed by means (1); and (3) means (2) based on the element specificity (VS). Means for associating peripheral elements of the module with the integrated module group under a certain condition to form a final module group;
Program to function as.
[Aspect 3]
A method for performing correlation network analysis of multivariate data,
(1) The computer uses a network F value (NF) and an element F value (VF) based on a correlation matrix between individual data (elements) in the input multivariate data, Forming a module having a maximum network F value (NF) in a range of a certain size (number of elements contained in the module);
(2) the computer reconstructs the network based on the element F value (VF) and integrates the modules formed in step (1); and (3) the computer increases the element specificity (VS). Based on the module group integrated in step (2), the peripheral elements of the module are associated under certain conditions to form a final module group.
[Aspect 4]
The computer-readable recording medium which recorded the program of this invention.
[Aspect 5]
A system for analyzing multivariate data implemented in a computer for correlation network analysis,
(1) Based on the correlation matrix between individual data (elements) in multivariate data, using a network F value (NF) and element F value (VF), a certain size (into a module) Means for forming a module having a maximum network F value (NF) in the range of the number of elements included);
(2) means for restructuring the network based on the element F value (VF) and integrating the modules formed by means (1); and (3) means (2) based on the element specificity (VS). Means for associating peripheral elements of the module with the integrated module group under a certain condition to form a final module group, and optionally (4) the module obtained by means (3) The system including output means for drawing (displaying) a network (map) including groups.

本発明において、ひとつの注目要素に関する部分コミュニティ構造（モジュール）をボトムアップ手法によって抽出する際に、各種の設定値（希望するモジュールサイズの範囲等）を統計的有意性に基づいて調整し、更に、モジュールの統合化を行うことによって、注目要素を含むモジュールのサイズを適切に制御することが可能となった。 In the present invention, when a partial community structure (module) related to one element of interest is extracted by the bottom-up method, various setting values (range of desired module size, etc.) are adjusted based on statistical significance, By integrating the modules, the size of the module including the element of interest can be appropriately controlled.

その結果、本明細書中で示されるように、従来知られている相関ネットワーク解析である、1) 汎用ネットワーク解析ツールPajekに搭載されているLouvain法、2) R統計解析プラットフォームのアドインツールのSimulating annealing法およびFast greedy法と比較して、モジュール形成の精度が高い、即ち、最も適切なサイズのモジュールをより多く形成・検出できることが確認された。 As a result, as shown in this specification, the correlation network analysis that has been conventionally known is 1) the Louvain method installed in the general-purpose network analysis tool Pajek, and 2) the Simulating of the add-in tool of the R statistical analysis platform. Compared with the annealing method and the fast greedy method, it was confirmed that the module formation accuracy is high, that is, it is possible to form and detect more modules of the most appropriate size.

このように、本発明プログラムを用いるとあらゆる異なる測定データを自在に組み合わせることがデータ処理上、可能であり、相関係数に加えて、グラフ理論による、関係性を幾何学的に考慮した手法で、目的に合った解析手法を取捨選択（データマインニング）処理ができるので、研究開発期間の極端な短縮化を導くことが可能である。 In this way, using the program of the present invention, it is possible to freely combine various different measurement data in terms of data processing, and in addition to the correlation coefficient, it is a technique that considers the relationship geometrically based on graph theory. Since the analysis method suitable for the purpose can be selected (data mining) processing, it is possible to lead to an extremely shortened research and development period.

本発明のプログラムを実行することで得られた結果を表示する一例を示す。An example of displaying the result obtained by executing the program of the present invention is shown. 本発明のプログラムの一例のアルゴリズムのフローチャートを示す。The flowchart of the algorithm of an example of the program of this invention is shown. 図２に示したフローチャート中の各ステップに関する捕捉説明を示す。The capture description regarding each step in the flowchart shown in FIG. 2 is shown. シロイヌナズナの公開DNAアレイデータ（遺伝子数：22746個 x 実験群数：9942個）を用いて、遺伝子共発現解析を実行し、得られた多変量データを本発明のプログラムでネットワーク解析処理した結果を示す。Gene co-expression analysis was performed using the public DNA array data of Arabidopsis (number of genes: 22746 x number of experiments: 9742), and the results of network analysis processing of the obtained multivariate data using the program of the present invention Show. モジュール（コミュニティ）と実験群のメンバー構成の類似性を評価する方法を示す。We show how to evaluate the similarity between the members of the module (community) and the experimental group.

即ち、本発明は、多変量データを相関ネットワーク解析するためにコンピュータに、
（１）多変量データにおける個々のデータ（要素）間の相関行列に基づき、ネットワークＦ値（NF）及び要素Ｆ値（VF）を用いて、或る要素に対して、一定のサイズ（モジュールに含まれる要素の数）の範囲に於いて最大のネットワークＦ値（NF）を有するモジュールを形成するステップ、
（２）要素Ｆ値（VF）に基づきネットワークを再構築して、ステップ（１）で形成されたモジュールを統合化する手段ステップ、及び
（３）要素特異率（VS）に基づき、ステップ（２）で統合化されたモジュール群に該モジュールの周縁要素を一定条件下で関連付けて、最終的なモジュール群を形成するステップ
を実行させるためのプログラム、に係る。 That is, the present invention provides a computer for correlation network analysis of multivariate data,
(1) Based on the correlation matrix between individual data (elements) in multivariate data, using a network F value (NF) and element F value (VF), a certain size (into a module) Forming a module having the largest network F value (NF) in the range of (number of elements included);
(2) means step for restructuring the network based on the element F value (VF) and integrating the modules formed in step (1); and (3) step (2) based on the element specificity (VS). To associate the peripheral elements of the module with the integrated module group under a certain condition to form a final module group
Related to a program for executing the program.

或いは、本発明は、多変量データを相関ネットワーク解析するためにコンピュータを、
（１）多変量データにおける個々のデータ（要素）間の相関行列に基づき、ネットワークＦ値（NF）及び要素Ｆ値（VF）を用いて、或る要素に対して、一定のサイズ（モジュールに含まれる要素の数）の範囲に於いて最大のネットワークＦ値（NF）を有するモジュールを形成する手段、
（２）要素Ｆ値（VF）に基づきネットワークを再構築して、手段（１）で形成されたモジュールを統合化する手段、及び
（３）要素特異率（VS）に基づき、手段（２）で統合化されたモジュール群に該モジュールの周縁要素を一定条件下で関連付けて、最終的なモジュール群を形成する手段
として機能させるためのプログラム、に係る。 Alternatively, the present invention provides a computer for correlation network analysis of multivariate data,
(1) Based on the correlation matrix between individual data (elements) in multivariate data, using a network F value (NF) and element F value (VF), a certain size (into a module) Means for forming a module having the largest network F value (NF) in the range of the number of elements included);
(2) means for restructuring the network based on the element F value (VF) and integrating the modules formed by means (1); and (3) means (2) based on the element specificity (VS). And a module for associating the peripheral elements of the module with the module group integrated in the above under a certain condition to function as means for forming a final module group.

上記プログラムは、更に、（４）コンピュータに、ステップ（３）で得られたモジュール群を含むネットワーク（マップ）を描画（表示処理）するステップを実行させること、又は、（４）コンピュータが、手段（３）で得られたモジュール群を含むネットワーク（マップ）を描画（表示処理）する出力手段として機能させること、を含むことが出来る。 The program further causes (4) a computer to execute a step of drawing (display processing) a network (map) including the module group obtained in step (3), or (4) the computer Functioning as an output means for drawing (display processing) a network (map) including the module group obtained in (3).

更に、本発明は、上記プログラムを記録したコンピュータ読み取り可能な記録媒体にも係るものである。このような記録媒体の種類に特に制限はなく、ＣＤ、ＤＶＤ、テープ、各種のハードディスク、及び、半導体メモリ等の当業者に公知の任意の形態をとり得る。尚、本発明プログラムは、外部に接続されたコンピュータ及び／または分散コンピュータのコンピュータ・サーバまたはネットワークを介して、該プログラムを実行するコンピュータに提供されてもよい。 Furthermore, the present invention also relates to a computer-readable recording medium on which the above program is recorded. The type of such a recording medium is not particularly limited, and may take any form known to those skilled in the art, such as a CD, a DVD, a tape, various hard disks, and a semiconductor memory. The program of the present invention may be provided to a computer that executes the program via a computer server or a network of a computer connected to the outside and / or a distributed computer.

同様に、本発明は、コンピュータに実装されてなる、多変量データを相関ネットワーク解析するためにシステムであって、
（１）多変量データにおける個々のデータ（要素）間の相関行列に基づき、ネットワークＦ値（NF）及び要素Ｆ値（VF）を用いて、或る要素に対して、一定のサイズ（モジュールに含まれる要素の数）の範囲に於いて最大のネットワークＦ値（NF）を有するモジュールを形成するための手段、
（２）要素Ｆ値（VF）に基づきネットワークを再構築して、手段（１）で形成されたモジュールを統合化する手段、及び
（３）要素特異率（VS）に基づき、手段（２）で統合化されたモジュール群に該モジュールの周縁要素を一定条件下で関連付けて、最終的なモジュール群を形成するための手段、並びに、任意に、（４）手段（３）で得られたモジュール群を含むネットワーク（マップ）を描画（表示処理）するための出力手段を含む前記システムに係る。かかるシステムは、本発明のプログラムを実行するための、１つまたは複数のプロセッサ及びメモリを含む。 Similarly, the present invention is a system for performing correlation network analysis of multivariate data implemented on a computer,
(1) Based on the correlation matrix between individual data (elements) in multivariate data, using a network F value (NF) and element F value (VF), a certain size (into a module) Means for forming a module having a maximum network F value (NF) in the range of the number of elements included);
(2) means for restructuring the network based on the element F value (VF) and integrating the modules formed by means (1); and (3) means (2) based on the element specificity (VS). Means for associating peripheral elements of the module with the integrated module group under a certain condition to form a final module group, and optionally (4) the module obtained by means (3) The present invention relates to the system including output means for drawing (display processing) a network (map) including groups. Such a system includes one or more processors and memory for executing the program of the present invention.

また、本発明は、多変量データを相関ネットワーク解析する方法であって、
（１）コンピュータが、入力された多変量データにおける個々のデータ（要素）間の相関行列に基づき、ネットワークＦ値（NF）及び要素Ｆ値（VF）を用いて、或る要素に対して、一定のサイズ（モジュールに含まれる要素の数）の範囲に於いて最大のネットワークＦ値（NF）を有するモジュールを形成するステップ、
（２）コンピュータが、要素Ｆ値（VF）に基づきネットワークを再構築して、ステップ（１）で形成されたモジュールを統合化するステップ、及び
（３）コンピュータが、要素特異率（VS）に基づき、ステップ（２）で統合化されたモジュール群に該モジュールの周縁要素を一定条件下で関連付けて、最終的なモジュール群を形成するステップ
を含む、前記方法に係る。 Further, the present invention is a method for performing a correlation network analysis of multivariate data,
(1) The computer uses a network F value (NF) and an element F value (VF) based on a correlation matrix between individual data (elements) in the input multivariate data, Forming a module having a maximum network F value (NF) in a range of a certain size (number of elements contained in the module);
(2) the computer reconstructs the network based on the element F value (VF) and integrates the modules formed in step (1); and (3) the computer increases the element specificity (VS). Based on the method, comprising associating the peripheral elements of the module with the integrated module group in step (2) under certain conditions to form a final module group.

本発明方法は、更に、（４）コンピュータが、ステップ（３）で得られたモジュール群を含むネットワーク（マップ）を描画（表示処理）するステップを含んでいても良い。本発明方法は、上記プログラムが実装されたコンピュータにおいて実施される。 The method of the present invention may further include a step (4) in which the computer draws (displays) a network (map) including the module group obtained in step (3). The method of the present invention is implemented in a computer in which the above program is installed.

本発明に於いて、多変量データの取得経路・方法・種類・属性等などに特に制限はない。例えば、代表的な例として、オミックス解析で得られた多変量データを挙げることが出来る。「オミックス解析」とは、一般に、個々の網羅的分子情報を統合解析することを意味し、網羅的分子情報の代表的例として遺伝子の転写物に関する網羅的情報であるトランスクリトームデータ、代謝産物に関する網羅的情報であるメタボロームデータ等を挙げることができる。 In the present invention, there are no particular limitations on the acquisition path, method, type, attribute, etc. of multivariate data. For example, a typical example is multivariate data obtained by omics analysis. “Omic analysis” generally means integrated analysis of individual comprehensive molecular information. As a representative example of comprehensive molecular information, transcriptome data and metabolites that are comprehensive information on gene transcripts. Metabolome data that is comprehensive information on

尚、これらの網羅的データは、当業者に公知の任意の方法・手段、例えば、各種遺伝子解析、遺伝子発現解析、並びに、LC-MS, GC-MS及びCE-MS等の各種質量分析等によって取得することが出来る。更に、これら網羅的データの取得源に特に制限はなく、様々な種類の動植物・微生物・細菌由来の、部位、器官、組織及び細胞等を挙げることが出来る。更に、多変量データは或る環境から取得した試料及び人工製造物（例えば、加工食品等）等から任意の方法で取得された情報でも良い。 These comprehensive data are obtained by any method or means known to those skilled in the art, for example, various gene analysis, gene expression analysis, and various mass spectrometry such as LC-MS, GC-MS and CE-MS. Can be acquired. Furthermore, there are no particular limitations on the source for obtaining these comprehensive data, and examples include parts, organs, tissues and cells derived from various types of animals, plants, microorganisms and bacteria. Furthermore, the multivariate data may be information obtained by an arbitrary method from a sample obtained from a certain environment, an artificial product (for example, processed food) or the like.

本発明において、上記の各指標は以下の式の通り、定義される。ここで、ネットワーク密度 (ＮＤ)及び要素密度 (VＤ)は、要素同志が互いにどの程度緊密に繋がっているか、を示す指標であり、一方、ネットワーク特異率（ＮＳ）及びと要素特異率（VＳ）は要素同志がどの程度排他的に（他のモジュールから孤立して）繋がっているか、を示す指標である。尚、以下の各式中、e(i)は要素iの部分モジュール構造内でのエッジ総数、d(i)は要素iの全ネットワーク中での次数、及び、nはモジュール中の全要素数を表す。 In the present invention, each of the above indices is defined as follows: Here, the network density (ND) and the element density (VD) are indicators showing how closely the elements are connected to each other, while the network specificity (NS) and the element specificity (VS). Is an index showing how exclusive elements are connected (isolated from other modules). In the following equations, e (i) is the total number of edges in the partial module structure of element i, d (i) is the order of element i in all networks, and n is the total number of elements in the module. Represents.

以下の式(I)で定義されるネットワークＦ値（ＮＦ）は各モジュールに関する密度と特異率を同等に評価するための指標であり、ネットワーク密度 (ＮＤ)とネットワーク特異率（ＮＳ）の調和平均である。ここで、ネットワーク密度 (ＮＤ)は「（実際に連結されているエッジの数）／（要素を理想的に全て連結したときのエッジの総数）」であり、ネットワーク特異率（ＮＳ）は「（各要素がモジュール内の他の要素と連結されているエッジの総和）／（各要素のネットワーク全体に対する次数の総和）」である。
The network F value (NF) defined by the following formula (I) is an index for equally evaluating the density and singularity for each module, and is the harmonic average of network density (ND) and network singularity (NS). It is. Here, the network density (ND) is “(the number of edges actually connected) / (the total number of edges when all elements are ideally connected)”, and the network specificity (NS) is “( The sum of the edges where each element is connected to the other elements in the module) / (the sum of the degree of each element over the entire network) ”.

以下の式(II)で定義される要素Ｆ値（VF）は各要素に関する密度と特異率を同等に評価するための指標であり、要素密度 (VＤ)と要素特異率（VＳ）の調和平均である。ここで、要素密度 (VＤ)は「（各要素がモジュール内の他の要素と連結されているエッジの数）／（各要素を理想的にモジュール内の全ての要素と連結したときのエッジの数）」であり、要素特異率（VＳ）は「（各要素がモジュール内の他の要素と連結されているエッジの数）／（各要素のネットワーク全体に対する次数）」である。

The element F value (VF) defined by the following formula (II) is an index for equally evaluating the density and singularity for each element. The harmonic mean of element density (VD) and element singularity (VS) It is. Here, the element density (VD) is expressed as “(number of edges where each element is connected to other elements in the module) / (edge number when each element is ideally connected to all elements in the module). The element specificity (VS) is “(number of edges where each element is connected to other elements in the module) / (degree of each element relative to the whole network)”.

以下、本発明プログラムにおける各ステップ又は手段に含まれる処理（工程）を詳しく説明する。 Hereinafter, processing (process) included in each step or means in the program of the present invention will be described in detail.

ステップ又は手段（１）：
多変量データにおける各要素間の相関行列に基づき、ネットワークＦ値（NF）及び要素Ｆ値（VF）を用いて、或る要素に対して、一定のサイズ（モジュールに含まれる要素の数）の範囲に於いて最大のネットワークＦ値（NF）を有するモジュール（コア部分）を形成するステップであって、相関係数は高いが注目モジュールへの貢献度が低い要素が排除されるFalse-Positive-Out（FPO）解析ステップとして機能する。 Step or means (1):
Based on the correlation matrix between elements in multivariate data, using a network F value (NF) and an element F value (VF), a certain size (number of elements included in a module) is determined for a certain element. A step of forming a module (core part) having the largest network F value (NF) in the range, in which an element having a high correlation coefficient but a low contribution to the module of interest is excluded. Functions as an Out (FPO) analysis step.

かかる相関行列は、多変量データの属性等に応じて、各要素に関する様々な情報、例えば、実験群、試料に関する様々なデータ情報（例えば、組織、処理、処理時間、条件等）に基づき、当業者に公知の任意の方法・手段で、例えば、ピアソン、スピアマン、コサイン等の任意の種類の各要素間の相関係数を求めることによって作成することが出来る。各要素間の相関係数は0以上1以下の実数とする。尚、負の相関係数を含んでいる場合は0に置き換える。 Such a correlation matrix is based on various information on each element, for example, experimental group, various data information on the sample (for example, tissue, processing, processing time, conditions, etc.) according to the attributes of the multivariate data. It can be created by obtaining a correlation coefficient between elements of any kind such as Pearson, Spearman, cosine, etc. by any method / means known to a trader. The correlation coefficient between each element is a real number between 0 and 1. If a negative correlation coefficient is included, replace it with 0.

このステップ又は手段の好適例に於いては、
１．初期設定値として、最小のモジュール（コミュニティ）サイズおよび最大のコミュニティサイズが設定される。
２．ひとつの注目する要素SVが選出され、それに対して、他の要素をSVに対する相関係数の降順に並べた要素群(HV)を含むモジュールが設定される。
３．ネットワークＦ値（NF）及び要素Ｆ値（VF）に基づき、上記１で設定された特定の範囲のサイズ内で、２で設定されたモジュールから最小値を示す要素Ｆ値（VF）を順次除去しながら、最大のネットワークＦ値（NF）を示すモジュールが形成（選択）される。 In a preferred example of this step or means:
1. As an initial setting value, a minimum module (community) size and a maximum community size are set.
2. One element SV of interest is selected, and a module including an element group (HV) in which other elements are arranged in descending order of the correlation coefficient with respect to SV is set.
3. Based on the network F value (NF) and element F value (VF), the element F value (VF) indicating the minimum value is sequentially removed from the module set in 2 within the size of the specific range set in 1 above. However, a module showing the maximum network F value (NF) is formed (selected).

ステップ又は手段（２）：
要素Ｆ値（VF）に基づきネットワークを再構築して、ステップ又は手段（１）で形成されたモジュールを統合化するステップであって、全要素に関するモジュール構成メンバーの重複を排除し、設定されたモジュールのサイズ範囲内で、モジュールが最適化される。 Step or means (2):
A step of restructuring the network based on the element F value (VF) and integrating the modules formed in the step or means (1), wherein duplication of module constituent members for all elements is eliminated and set Modules are optimized within the module size range.

このステップ又は手段の好適例に於いては、
１．すべての取り得るSVに対して、ステップ又は手段（１）で形成されたモジュールの各要素に対するVFの閾値VFtが任意に設定（選択）される。
２．ネットワーク中の任意の要素を含むモジュールに於いて、上記閾値VFt以上を示す要素間をエッジとして繋いだネットワークが構築される。
３．こうしてネットワーク中のすべての要素に対して再構築されたネットワークに関して、同じ要素を含むネットワークを全て連結して統合化モジュール群が形成される。
４．こうして選択された各閾値VFtに対して夫々形成された統合化モジュール群の各モジュールのサイズを算出し、その中で上記１で設定された特定の範囲のサイズを有する統合化モジュールの個数が最大となるような、或る閾値VFtに対して形成された統合化モジュール群が選択される。 In a preferred example of this step or means:
1. For all possible SVs, the VF threshold VFt for each element of the module formed in step or means (1) is arbitrarily set (selected).
2. In a module including an arbitrary element in the network, a network in which elements having the threshold value VFt or more are connected as edges is constructed.
3. With respect to the network reconstructed with respect to all elements in the network in this way, an integrated module group is formed by connecting all the networks including the same element.
4). The size of each module of the integrated module group formed for each threshold value VFt selected in this way is calculated, and the number of integrated modules having the size of the specific range set in 1 above is the maximum. An integrated module group formed for a certain threshold VFt is selected.

ステップ又は手段（３）：
要素特異率（VS）に基づき、ステップ又は手段（２）で選択された統合化モジュール群に、ステップ又は手段（２）で選択された統合化モジュール群の夫々のモジュールに含まれていない要素、即ち、夫々のモジュールの周縁要素を、例えば、以下に示すような一定条件下で関連付けて（追加して）、最終的なモジュール群が形成されるステップであって、相関係数は低いが注目モジュールへの貢献度が高い周縁要素が追加される（False-Negative-In（FNI）解析ステップ）。 Step or means (3):
Based on the element specificity (VS), the integrated module group selected in step or means (2) includes elements not included in each module of the integrated module group selected in step or means (2), That is, the peripheral elements of each module are associated (added) under a certain condition as shown below, for example, to form a final module group, which has a low correlation coefficient. Peripheral elements with high contribution to the module are added (False-Negative-In (FNI) analysis step).

このステップ又は手段の好適例に於いては、
１．ステップ又は手段（２）で選択された統合化モジュール群の夫々のモジュールの周縁要素の当該モジュールに対する要素特異率の閾値sが設定される。
２．要素特異率が該閾値s以上である全ての周縁要素を当該モジュールに追加し、最終的なモジュール群が形成される。 In a preferred example of this step or means:
1. A threshold value s of element specificity for the peripheral element of each module of the integrated module group selected in step or means (2) is set.
2. All peripheral elements whose element specificity is equal to or greater than the threshold value s are added to the module, and a final module group is formed.

ステップ又は手段（４）：
上記のFalse-Negative-In（FNI）解析ステップ又は手段（３）で最終的に形成されたモジュール群からなるネットワークを描画（表示処理）する出力ステップである。このステップは、当業者に公知の任意の方法・手段によって実施することが出来る。表示形式も任意である。例えば、図４に示すように、モジュールにおける各要素を「○」等の適当な図形で表示し、それらを適当な線で連結した状態で表示することが出来る。例えば、ステップ又は手段（１）のFalse-Positive-Out（FPO）解析で得られたモジュールにおける各要素は実線で連結し、ステップ又は手段（３）のFalse-Negative-In（FNI）解析で得られたモジュールにおける各要素は点線（破線）で連結することによって、それらの性質が視覚的に良いに判別される。その他、任意の表形式（例えば、Excel形式）で出力することも可能である。 Step or means (4):
This is an output step for drawing (display processing) a network composed of module groups finally formed by the above-described False-Negative-In (FNI) analysis step or means (3). This step can be performed by any method / means known to those skilled in the art. The display format is also arbitrary. For example, as shown in FIG. 4, each element in the module can be displayed with an appropriate graphic such as “◯” and connected with an appropriate line. For example, each element in the module obtained by the False-Positive-Out (FPO) analysis of the step or means (1) is connected by a solid line, and obtained by the False-Negative-In (FNI) analysis of the step or means (3). Each element in a given module is connected with a dotted line (broken line), so that their properties can be distinguished visually. In addition, it is also possible to output in an arbitrary table format (for example, Excel format).

その際に、各要素に関する情報（物性値等）も併せて画面上に表示させても良い。更に、複数のモジュールを同一の画面で同時に表示させることも可能であり、その際に、モジュールのサイズ等の任意の性質に基づき、各モジュールを適当なアラインメント又はグループに分ける等の処理をすることも可能である。そのような表示の一例を図１に示す。 At that time, information (physical property values, etc.) regarding each element may be displayed on the screen together. Furthermore, it is possible to display multiple modules simultaneously on the same screen. At that time, processing such as dividing each module into an appropriate alignment or group based on an arbitrary property such as the size of the module. Is also possible. An example of such a display is shown in FIG.

本発明のプログラムをコンピュータで実行させることによって、例えば、遺伝子共発現解析（トランスクリプトーム解析）において、生物が生産する一次代謝産物から派生する二次代謝産物（主に、生理活性成分）の生合成経路における生合成酵素マシナリー（構成部分）の遺伝子群の一括単離がFPO解析で出力され、更に、DNAに結合するタンパク質であるヒストン複合体酵素など、協調的に発現していると思われる酵素複合体の遺伝子群もFPO解析で出力される。一方で、その生合成酵素遺伝子群を転写制御する因子（転写因子）の遺伝子がFNI解析により単離することができる。 By executing the program of the present invention on a computer, for example, in gene co-expression analysis (transcriptome analysis), the production of secondary metabolites (mainly bioactive components) derived from primary metabolites produced by the organism. Collective isolation of gene groups of biosynthetic enzyme machinery (components) in the synthetic pathway is output by FPO analysis, and further, it seems to be expressed cooperatively, such as histone complex enzyme which is a protein that binds to DNA The gene group of the enzyme complex is also output by FPO analysis. On the other hand, the gene of the factor (transcription factor) that controls transcription of the biosynthetic enzyme gene group can be isolated by FNI analysis.

又、代謝物の網羅的な解析であるメタボローム解析の場合には、変動する代謝物の変動パターンをFPO解析で出力される。また、表現系（生理活性値及び五感などの官能試験データ）のような、メタボローム解析データと異なった異種データを組み合わせて本発明の相関ネットワーク解析を行うことにより、表現系に紐付けた代謝物との関係性がFNI解析により出力される。 In the case of metabolome analysis, which is a comprehensive analysis of metabolites, the fluctuation pattern of the changing metabolite is output by FPO analysis. In addition, the metabolite linked to the expression system by performing the correlation network analysis of the present invention by combining different data different from the metabolomic analysis data such as the expression system (sensory test data such as physiological activity values and five senses). Is output by FNI analysis.

以下、本発明に関連する用語の説明を以下の表１に示す。
Hereinafter, explanations of terms related to the present invention are shown in Table 1 below.

以下に記載する本発明プログラムに基づくアルゴリズム及び実施例に基づき、本発明のプログラムを更に詳細に説明する。尚、本発明の技術的範囲は以下の記載に限定されるものではなく、これら記載に基づき当業者が適宜変更・修正したものも本発明に含まれる。 The program of the present invention will be described in more detail based on the algorithm and examples based on the program of the present invention described below. It should be noted that the technical scope of the present invention is not limited to the following description, and modifications and corrections appropriately made by those skilled in the art based on these descriptions are also included in the present invention.

ステップ又は手段（１）：False-Positive-Out (FPO)解析
まず、データセットの準備として、要素間の相関行列データを入力する。各要素間の相関係数は0以上1以下の実数とする。負の相関係数を含んでいる場合は0に置き換える。 Step or means (1): False-Positive-Out (FPO) analysis First, correlation matrix data between elements is input as preparation of a data set. The correlation coefficient between each element is a real number between 0 and 1. If it contains a negative correlation coefficient, replace it with 0.

１．初期設定値として、最小のコミュニティサイズ（自然数pとする）および最大のコミュニティサイズ（自然数qとする）を設定する。これらの数値に関しては使用者が任意に設定することが可能であるが、汎用的な推奨値としてはp=5、q=50である。なお、pおよびqは次項のSeed Vertexを含んでいない。
２．ひとつの注目する要素を選出する。この要素をSeed Vertex (SV)と呼ぶ。
３．SVに対して、他の要素をSVに対する相関係数の降順に並べる。この降順に並べた要素群をHighly-correlated vertices (HV)と呼ぶ。SVに対するHVのi番目の要素をHV(SV,i)と呼ぶ。すなわち、HV(SV,i)はSVに対してi番目に相関係数の高い要素である。また、SVとHV(SV,i)との相関係数をTC(SV,i)と呼ぶ。
４．SVおよびHV(1)からHV(p)までの要素を含むモジュールを設定する。このモジュールをHM(p)と呼ぶ。HM(p)のNFを算出する（NF(p)とする）。初期設定として、NF(p)をNF(SV)とする。また、NF(SV)を示すモジュールをKM(SV)とする。すなわち、KM(SV)はSVに対するFPO過程において、NFの最大値NF(SV)を示すモジュールである。
５．まず、HM(j)をKM(j)とする。ただし、jは自然数で、p<j≦qの範囲とする
６．KM(j)のNFを算出する（NF(j)とする）。
７．NF(j)がNF(SV)より大きければ、NF(j)をNF(SV)に置き換え、KM(j)をKM(SV)と置き換える。
８．次に、KM(j)内の要素をKV(i)とする。KV(i)はHV(i)と同じ要素とする。
９．KV(i)のKM(j)に対するVF(i)を算出する。
１０．VF(i)の最小値を示すKV(i)をKM(j)から除きKM(j-1)とする。
１１．KM(j-1)に対して、工程6〜10を繰り返す。すなわち、KM(p+1)まで繰り返される。
１２．工程5〜11をすべての取り得るjに対して実施し、SVに対してNF(SV)を示すKM(SV)が得られる。KM(SV)に含まれる各要素をkernel vertex (KV)と呼び、KV(l)で表す。KVにはSVも含まれる。KM(SV)の要素数をxとすると、1≦l≦xである。 1. As an initial setting value, a minimum community size (with a natural number p) and a maximum community size (with a natural number q) are set. These numerical values can be arbitrarily set by the user, but general-purpose recommended values are p = 5 and q = 50. Note that p and q do not include the Seed Vertex in the next section.
2. Select one element of interest. This element is called Seed Vertex (SV).
3. For SV, arrange other elements in descending order of correlation coefficient for SV. This group of elements arranged in descending order is called Highly-correlated vertices (HV). The i-th element of HV with respect to SV is called HV (SV, i). That is, HV (SV, i) is the element having the i-th highest correlation coefficient with respect to SV. The correlation coefficient between SV and HV (SV, i) is called TC (SV, i).
4). Set a module that includes elements from SV and HV (1) to HV (p). This module is called HM (p). NF of HM (p) is calculated (referred to as NF (p)). As an initial setting, NF (p) is set to NF (SV). A module indicating NF (SV) is assumed to be KM (SV). That is, KM (SV) is a module indicating the maximum value NF (SV) of NF in the FPO process for SV.
5). First, let HM (j) be KM (j). However, j is a natural number and is in the range of p <j ≦ q. NF of KM (j) is calculated (referred to as NF (j)).
7). If NF (j) is greater than NF (SV), replace NF (j) with NF (SV) and replace KM (j) with KM (SV).
8). Next, let KV (i) be an element in KM (j). KV (i) is the same element as HV (i).
9. Calculate VF (i) for KM (j) of KV (i).
10. KV (i) indicating the minimum value of VF (i) is removed from KM (j) and is defined as KM (j-1).
11. Repeat steps 6-10 for KM (j-1). That is, the process is repeated up to KM (p + 1).
12 Steps 5 to 11 are performed on all possible js to obtain KM (SV) indicating NF (SV) against SV. Each element included in KM (SV) is called kernel vertex (KV) and is represented by KV (l). KV includes SV. When the number of elements of KM (SV) is x, 1 ≦ l ≦ x.

ステップ又は手段（２）：Modularize(モジュールの統合化)
１．希望モジュールサイズの最小値と最大値をそれぞれFPO過程のpおよびqとする。
２．FPO過程の変数を引き継ぐ。
３．すべての取り得るSVに対するKM(SV)の各要素KV(l)に対するVFの閾値VFtを設定する。ただし、0.5≦VFt≦0.99の範囲で0.01刻みの数値とする。
４．ネットワーク全体の要素数をmとする。
５．ネットワーク中の任意の要素Vj（ただし、jは1≦j≦mを満たす自然数）のKM(Vj)において、VFt値以上を示すKV間をエッジとして繋いだネットワークを構築し、Mod(Vj,VFt)とする。
６．ネットワーク中のすべてのVjに対して、Mod(Vj,VFt)を得る。
７．すべてのVjに対して得られたMod(Vj,VFt)に関して、同じ要素を含むMod(Vj,VFt)をすべて連結しネットワークを再構築したものをMod(VFt)とする。
８．得られたMod(VFt)における個々の（孤立した）モジュールのサイズNmodを算出する。
９．すべてのNmodのうち、p≦Nmod≦qを満たすNmodの個数をNmod(VFt)とする。
１０．Nmod(VFt)をすべてのVFtにおいて算出する。
１１．Nmod(VFt)の最大値を示すVFtに対するモジュールのうち、モジュールのサイズがp以上のものを最終的なモジュール群Modとする。 Step or means (2): Modularize (module integration)
1. Let the minimum and maximum values of the desired module size be p and q in the FPO process, respectively.
2. Inherit variables from the FPO process.
3. Set the VF threshold VFt for each element KV (l) of KM (SV) for all possible SVs. However, in the range of 0.5 ≤ VFt ≤ 0.99, the value is in 0.01 increments.
4). Let m be the number of elements in the entire network.
5). In KM (Vj) of any element Vj in the network (where j is a natural number satisfying 1 ≦ j ≦ m), a network is constructed by connecting KVs with VFt values or more as edges, and Mod (Vj, VFt ).
6). Mod (Vj, VFt) is obtained for all Vj in the network.
7). Regarding Mod (Vj, VFt) obtained for all Vj, Mod (VFt) is obtained by concatenating all Mod (Vj, VFt) including the same elements and restructuring the network.
8). The size Nmod of each (isolated) module in the obtained Mod (VFt) is calculated.
9. The number of Nmods satisfying p ≦ Nmod ≦ q among all Nmods is Nmod (VFt).
10. Nmod (VFt) is calculated for all VFt.
11. Of the modules for VFt indicating the maximum value of Nmod (VFt), modules having a module size of p or more are defined as a final module group Mod.

ステップ又は手段（３）：False-Negative-In (FNI)解析
１．FPO解析過程およびModularize過程の変数を引き継ぐ。
２．初期設定値として、相関係数の下限値c（推奨値はc=0）とする。また、任意の要素Vに対するHV(V)の順位の下限値をr（推奨値はr=1000）とする。
３．Modに含まれる各モジュールをMod(i)（ただし、1≦i≦M(Mod)、M(Mod)はModに含まれるモジュール数）とする。
４．Mod(i)に含まれる要素をMod(i,j)（ただし、1≦j≦N(Mod(i))、N(Mod(i))はMod(i)に含まれる要素数）とする。
５．Modに含まれない要素の集合をRV (residual vertex)、RVに含まれる要素をRV(l)（ただし、1≦l≦N-N(Mod)、Nは全要素数）とする。
６．Mod(i)の要素Mod(i,j)について、HV(Mod(i,j),k)（ただし、1≦k≦r）のうち、RVに含まれる要素の集合をRV(i)とし、RV(i)に含まれる要素をRV(i,x)（ただし、1≦x≦N(RV(i))）とする。
７．RV(i,x)のMod(i)に対するVS値を、HV(RV(i,x))の上位から順に、r位まで、かつTC(RV(i,x),y)>cを満たす範囲で、計算する（y位でのHV(RV(i,x))をHV(RV(i,x),y)とする）。VS(RV(i,x),y)の最大値をVS(RV(i,x))とする。
８．VS(RV(i,x))≧s（ただし、sはRV(i,x)のMod(i)に対するVS(RV(i,x))の下限値（閾値）、推奨値は0.5）であるとき、RV(i,x)をMod(i)のmarginal vertex (MV)と呼び、集合MV(i)の要素とする。
９．ステップ７〜８を、すべてのRV(i,x)に対して実行する。
１０．Mod(i)の各要素とMV(i)の各要素を合わせて、confeito vertex (CV)と呼ぶ。CVのサイズをN(CV)とすると、CV(z) ∈CV; 1≦z≦N(CV)とする。 Step or means (3): False-Negative-In (FNI) analysis Inherit variables of FPO analysis process and Modularize process.
2. As an initial setting value, the lower limit value c of the correlation coefficient (recommended value is c = 0). Further, the lower limit value of the order of HV (V) for an arbitrary element V is r (recommended value is r = 1000).
3. Each module included in Mod is Mod (i) (where 1 ≦ i ≦ M (Mod), where M (Mod) is the number of modules included in Mod).
4). The element included in Mod (i) is Mod (i, j) (where 1≤j≤N (Mod (i)) and N (Mod (i)) is the number of elements included in Mod (i)) .
5). The set of elements not included in Mod is RV (residual vertex), and the elements included in RV are RV (l) (where 1 ≦ l ≦ NN (Mod), where N is the total number of elements).
6). For the element Mod (i, j) of Mod (i), the set of elements included in RV among HV (Mod (i, j), k) (where 1≤k≤r) is RV (i). , RV (i) is an element included in RV (i, x) (where 1 ≦ x ≦ N (RV (i))).
7). The VS value for Mod (i) of RV (i, x) is in order from the top of HV (RV (i, x)) up to r and satisfies TC (RV (i, x), y)> c Calculate within the range (HV (RV (i, x)) at y position is HV (RV (i, x), y)). Let VS (RV (i, x)) be the maximum value of VS (RV (i, x), y).
8). VS (RV (i, x)) ≧ s (where s is the lower limit (threshold) of VS (RV (i, x)) for Mod (i) of RV (i, x), and the recommended value is 0.5) In some cases, RV (i, x) is called the marginal vertex (MV) of Mod (i) and is used as an element of the set MV (i).
9. Steps 7-8 are performed for all RV (i, x).
10. Each element of Mod (i) and each element of MV (i) are collectively called confeito vertex (CV). When the size of CV is N (CV), CV (z) ∈ CV; 1 ≦ z ≦ N (CV).

以上のFPO、ModularizeおよびFNI過程を通じて、注目要素(SV)に対するconfeito vertices (CV)から構成されるモジュール群が最終的に得られる。CVはSVを含むモジュール内での関連が強く、他のモジュールとの関連が弱い。Modularize過程を通じて、全要素に対して希望モジュールサイズのモジュールを最大限に含んだモジュール群が得られる。 Through the above FPO, Modularize and FNI processes, a module group consisting of confeito vertices (CV) for the element of interest (SV) is finally obtained. CV is strongly related to modules including SV and weakly related to other modules. Through the Modularize process, a module group that includes the maximum number of modules of the desired module size for all elements is obtained.

更に、上記アルゴリズムのフローチャートの一例を図２に示す。又、このフローチャート中の各ステップに関する捕捉説明を図３に示す。 Furthermore, an example of a flowchart of the above algorithm is shown in FIG. Further, FIG. 3 shows the capture explanation regarding each step in this flowchart.

シロイヌナズナの公開DNAアレイデータ（遺伝子数：22746個 X 実験群数：9942個）を用いて、遺伝子共発現解析を実行し、得られた多変量データを上記のアルゴリズムを有する本発明のプログラムでネットワーク解析処理し、ブロッコリー・ダイコン、などのアブラナ科植物に含まれる発がん物質を解毒する酵素の活性を高める作用がある脂肪族グルコシノレート成分の生合成酵素遺伝子群を制御する転写因子を含むもモジュールの単離を試みた。その結果、Hirai et al, Proc Natl Acad Sci U S A. （2007） 104(15):6478-6483 のfigure1と同様な結果が得られた。即ち、脂肪族グルコシノレート生合成の制御因子（Myb28:AT5G61420）がFPO解析で得られ、もう一つの制御因子（Myb29:AT5G07690）がFNI解析で得られた。得られた結果を図４に示す。 Gene co-expression analysis was performed using Arabidopsis public DNA array data (number of genes: 22746 x number of experimental groups: 9742), and the obtained multivariate data was networked with the program of the present invention having the above algorithm. A module that also includes transcription factors that control the biosynthetic enzyme genes of the aliphatic glucosinolate components that have been analyzed and processed to increase the activity of enzymes that detoxify carcinogens such as broccoli and radish. An attempt was made to isolate. As a result, a result similar to that of figure 1 of Hirai et al, Proc Natl Acad Sci USA (2007) 104 (15): 6478-6483 was obtained. That is, a regulator of aliphatic glucosinolate biosynthesis (Myb28: AT5G61420) was obtained by FPO analysis, and another regulator (Myb29: AT5G07690) was obtained by FNI analysis. The obtained results are shown in FIG.

次に、本発明のプログラムを用いたネットワーク解析処理で生成されるモジュールの精度を確かめるために、 NCBIのGene Expression Omnibus (GEO)からマウスのマイクロアレイデータを入手して、他のコミュニティ抽出ツール（ネットワーク解析処理ツール）との比較解析を行った。 Next, in order to confirm the accuracy of the module generated by the network analysis process using the program of the present invention, mouse microarray data was obtained from NCBI Gene Expression Omnibus (GEO), and other community extraction tools (network Comparison analysis with analysis processing tool).

GEOについての基礎知識
１．GPL：遺伝子発現データのプラットフォーム、各社のマイクロアレイや次世代シーケンサー等の種類
２．GSE：GPLに含まれる遺伝子発現データの実験群
３．GSM：GSEに属する遺伝子発現データ
４．GSE内の実験は互いにある程度類似している場合が多い。
５．すなわち、実験間での相関ネットワーク上で、同じGSEに属する実験(GSM)はコミュニティ抽出ツールの結果としても同じモジュールに含まれることが期待される。 Basic knowledge about GEO GPL: gene expression data platform, each company's microarray, next generation sequencer, etc. 2. GSE: experimental group of gene expression data included in GPL 3. Gene expression data belonging to GSM: GSE Experiments within GSE are often similar to each other.
5). In other words, experiments (GSM) belonging to the same GSE on the correlation network between experiments are expected to be included in the same module as a result of the community extraction tool.

使用するデータと相関係数の算出と相関ネットワークの描画
１．アフィメトリクス社製のマウスのマイクロアレイGPL1261の遺伝子発現データ37,013枚を用いた。
２．遺伝子発現実験間でコサイン相関係数を算出し、相関行列を作成した。
３．0.50から0.99まで0.01刻みの相関係数を閾値とした相関ネットワークを描画した。 Calculation of data to be used and correlation coefficient and drawing of correlation network 37,013 gene expression data of mouse microarray GPL1261 manufactured by Affymetrix were used.
2. A cosine correlation coefficient was calculated between gene expression experiments, and a correlation matrix was created.
3. A correlation network was drawn with a correlation coefficient in increments of 0.01 from 0.50 to 0.99.

比較する他のコミュニティ抽出ツール
１．Louvain (Blondel et al., J Stat Mech, 2008): Pajek上で実行
２．Simulating annealing (Newman and Girvan, Phys Rev E, 2004): R上で実行
３．Fast greedy (Clauset et al., Phys Rev E, 2004): R上で実行 Other community extraction tools to compare Louvain (Blondel et al., J Stat Mech, 2008): Run on Pajek Simulating annealing (Newman and Girvan, Phys Rev E, 2004): Run on R3. Fast greedy (Clauset et al., Phys Rev E, 2004): Runs on R

比較する方法：モジュール（コミュニティ）と実験群のメンバー構成の類似性を評価（図５を参照）
１．メンバー構成の類似性の評価にはF値（F-measure）を用いる。
２． F-measureは情報科学分野の指標で、precisionとrecallの調和平均である。
３．ここでのprecisionは、あるモジュール（Module）に含まれる実験（１）及び実験（２）の中で特定の実験群(GSE)に属する実験（２）の割合を表す。
４．ここでのrecallは、該実験群(GSE)に含まれる実験（１）及び実験（３）の中で特定のモジュールに属する実験（１）の割合を表す。
５．下図の例の場合に、precisionは 6 / 8 = 0.75、recallは 6 / 10 = 0.60 となる。
６．すなわち、F-measureは( 6 + 6 ) / ( 8 + 10 ) = 0.67となる。
７． F-measureが大きいほど、モジュールとGSEの実験群が似ていることになる。
８．各閾値の相関ネットワークのすべてのモジュールに対してF-measureを計算した。
９．各相関ネットワークごとに、F-measureの平均値を各ツールの代表値とする。
１０．すなわち、この平均値が大きいほど、モジュールの抽出の精度が高いことになる。 Method of comparison: Evaluate the similarity between the members of the module (community) and the experimental group (see Figure 5)
1. The F value (F-measure) is used to evaluate the similarity of member composition.
2. F-measure is an index in the field of information science, and is a harmonic average of precision and recall.
3. The precision here indicates the ratio of the experiment (2) belonging to a specific experiment group (GSE) in the experiment (1) and experiment (2) included in a certain module (Module).
4). The recall here represents the ratio of the experiment (1) belonging to a specific module among the experiments (1) and (3) included in the experiment group (GSE).
5). In the example below, precision is 6/8 = 0.75 and recall is 6/10 = 0.60.
6). That is, F-measure is (6 + 6) / (8 + 10) = 0.67.
7). The larger the F-measure, the more similar the module and GSE experiment group.
8). F-measures were calculated for all modules in each threshold correlation network.
9. For each correlation network, the average value of F-measure is the representative value of each tool.
10. That is, the larger the average value, the higher the module extraction accuracy.

比較解析の結果
ツールごとのF-measureの平均値を以下の表２に示す。本発明プログラムが最も高い平均F-measure値を示しており、モジュールの精度が高いことを表している。
Results of comparative analysis The average value of F-measure for each tool is shown in Table 2 below. The program of the present invention shows the highest average F-measure value, indicating that the accuracy of the module is high.

本発明プログラムは、相関係数の閾値のみに基づいていた従来のネットワーク解析に比べ、偽陽性を低減しつつ、解析候補の取りこぼしを抑える仕様になっており、その最大の特徴として、あらゆるオミックス解析の定量データを入力ファイルに利用できる汎用性の高さを挙げることが出来る。その結果、解析の種類を問わず、幅広い場面での利用が期待できる。 Compared to the conventional network analysis based only on the correlation coefficient threshold, the program of the present invention has a specification that reduces false positives and suppresses missing analysis candidates. The high versatility that can use the quantitative data in the input file. As a result, it can be expected to be used in a wide range of situations regardless of the type of analysis.

例えば、本発明のプログラムをコンピュータに実行させて、トランスクリプトームデータを相関ネットワーク解析処理することによって、代謝酵素遺伝子及び複合酵素遺伝子等（FPO解析）及び代謝酵素遺伝子の転写（制御）因子等（FNI解析）を容易に一斉単離・同定することが出来る。更に、メタボロームデータを相関ネットワーク解析処理した場合には、注目要素（活性・評価数等）の関連代謝物等（FPO解析）及び中間代謝物及び新規物質マーカー等（FNI解析）を容易に選抜・スクリーニングすることが出来る。 For example, by causing a computer to execute the program of the present invention and performing correlation network analysis processing on transcriptome data, metabolic enzyme genes, complex enzyme genes, etc. (FPO analysis), metabolic enzyme gene transcription (control) factors, etc. ( FNI analysis) can be easily isolated and identified simultaneously. In addition, when correlation network analysis is performed on metabolome data, related metabolites (FPO analysis) of the element of interest (activity, number of evaluations, etc.), intermediate metabolites, and novel substance markers (FNI analysis) can be selected easily. Can be screened.

更に、本発明プログラムを実施するためのネットワーク解析ソフトでは、このような汎用性を維持しつつもライトユーザーでも安心して利用することが出来るように、入力ファイルのフォーマットを単純化することで、だれでも簡単に入力ファイルを作成ことが出来る。加えて、インターフェースにGraphical User Interface（GUI）を採用することで、マウス操作で簡単に解析を進めることができるユーザーフレンドリーなソフトウェアとなっており、プログラミングの知識やコマンドベースの操作は一切不要である。その結果、高度な統計学的手法である相関ネットワーク解析を多くのライトユーザーが利用できるという、一見矛盾したコンセプトを両立させた他に類を見ないソフトウェアとなった。 Furthermore, in the network analysis software for implementing the program of the present invention, anyone can simplify the format of the input file so that it can be used with peace of mind even for light users while maintaining such versatility. But you can easily create an input file. In addition, by adopting Graphical User Interface (GUI) as the interface, it is user-friendly software that allows easy analysis by mouse operation, and no programming knowledge or command-based operation is required. . The result is an unparalleled software that combines a seemingly contradictory concept that many light users can use correlation network analysis, an advanced statistical technique.

以上のことから、本発明プログラムは、あらゆる定量測定などのビックデータに関して、要素間及び測定試料間のグループ化(モジュール)を行うことにより、ビックデータの縮小化及び軽量化に繋がることが期待できる。 From the above, the program of the present invention can be expected to lead to reduction and weight reduction of big data by grouping (module) between elements and measurement samples for big data such as all quantitative measurements. .

Claims

In order to perform correlation network analysis on multivariate data, which is comprehensive molecular information obtained by omics analysis ,

(1) False-Positive-Out (FPO) analysis step that excludes elements with high correlation coefficient but low contribution to the module of interest, including the following processes :
1. In memory, individual data (elements) in multivariate data, minimum module (community) size and maximum community size set as default values, correlation matrix data between elements, and one selected attention Process in which element SV is input,
2. 2. a process in which a processor sets a module including an element group in which other elements for the element SV are arranged in descending order of the correlation coefficient based on the correlation matrix data between the elements for the element SV; In the process 2 based on the network F value (NF) and the element F value (VF) within a specific size within the range of the minimum module (community) size and the maximum community size set as the initial setting value A process of forming (selecting) a module indicating the maximum network F value (NF) while sequentially removing the element F value (VF) indicating the minimum value from the set module;

(2) Step of optimizing a module within a set module size range by eliminating duplication of module configuration members for all elements, including the following processes :
1. Processing in which a threshold value VFt of VF arbitrarily set (selected) for each element of the module formed in step (1) is input to the memory for all possible element SVs;
2. In a module including an arbitrary element in the network, a process for constructing a network in which elements indicating the threshold value VFt or more are connected as edges,
3. Processing the processor, thus about the network reconstructed for all elements in the network, forming all linked to integrated modules networks including the same elements, and 4. The processor calculates the size of each module of the integrated module group formed for each selected threshold value VFt, and the number of integrated modules having the size of the specific range set in the process 1 among them. A process of selecting a group of integrated modules formed for a certain threshold VFt such that is maximized; and

(3) False-Negative-In (FNI) analysis step including the following processing , adding a peripheral element having a low correlation coefficient but a high contribution to the module of interest to form a final module group:
1. 1. a process of inputting a threshold value s of element specificity for the peripheral element of each module of the integrated module group selected in step (2) to the memory ; Processor is to add all the peripheral Element-specific rate is not less than the threshold value s to the module;
A program for executing

Harmonic average of network density (ND) and network specificity (NS), where network F value (NF) is defined by equation (I) below:
And
Harmonic average of element density (VD) and element specificity (VS), where element F value (VF) is defined by the following formula (II):
(In the above equation, e (i) represents the total number of edges in the partial module structure of element i, d (i) represents the degree of element i in all networks, and n represents the total number of elements in the module. )
The program.

Computer for correlation network analysis of multivariate data, which is comprehensive molecular information acquired by omics analysis ,

(1) The following includes means, correlation coefficient is high contribution to target modules eliminate the lower element False-Positive-Out (FPO) analysis means:
1. Individual data (elements) in multivariate data, minimum module (community) size and maximum community size set as default values, correlation matrix data between elements, and one selected element SV of interest Means to input,
2. 2. means for setting a module including an element group in which other elements for the element SV are arranged in descending order of the correlation coefficient based on the correlation matrix data between the elements for the element SV; Module set at 2 based on network F value (NF) and element F value (VF) within a specific size within the range of minimum module (community) size and maximum community size set as initial setting value Means for forming (selecting) a module showing the maximum network F value (NF) while sequentially removing the element F value (VF) showing the minimum value from

(2) Means for optimizing a module within a set module size range by eliminating duplication of module constituent members for all elements, including the following means:
1. For all the possible elements SV, means for inputting the threshold value VFt of arbitrarily set (select) has been VF for each element of the module formed by means (1),
2. In a module including an arbitrary element in the network, means for constructing a network in which elements indicating the threshold value VFt or more are connected as edges;
3. 3. means for concatenating all networks containing the same element to form an integrated module group for the network thus reconstructed for all elements in the network; and The size of each module of the integrated module group formed for each selected threshold value VFt is calculated, and the number of integrated modules having a specific range size set by the means 1 is the maximum. Means for selecting a group of integrated modules formed for a certain threshold VFt; and

(3) False-Negative-In (FNI) analysis means including the following means to add a peripheral element having a low correlation coefficient but a high contribution to the module of interest to form a final module group:
1. 1. means for inputting a threshold value s of element specificity of the peripheral element of each module of the integrated module group selected in the means (2) for the module; Means for adding all peripheral elements whose element specificity is equal to or greater than the threshold value s to the module;
Is a program for functioning as

Harmonic average of network density (ND) and network specificity (NS), where network F value (NF) is defined by equation (I) below:
And
Harmonic average of element density (VD) and element specificity (VS), where element F value (VF) is defined by the following formula (II):
(In the above equation, e (i) represents the total number of edges in the partial module structure of element i, d (i) represents the degree of element i in all networks, and n represents the total number of elements in the module. )
The program.

Further, (4) to a computer, the output unit Step (3) drawing a network (map) containing the resulting modules in (display process) for causing to perform the steps, or, (4) computer means ( 3. The program according to claim 1, further comprising: functioning as an output unit that draws (displays) the network (map) including the module group obtained in 3).

A method for performing a correlation network analysis of multivariate data, which is comprehensive molecular information acquired by omics analysis ,

(1) False-Positive-Out (FPO) analysis step that excludes elements with high correlation coefficient but low contribution to the module of interest, including the following processes :
1. In memory, individual data (elements) in multivariate data, minimum module (community) size and maximum community size set as default values, correlation matrix data between elements, and one selected attention Process in which element SV is input,
2. 2. a process in which a processor sets a module including an element group in which other elements for the element SV are arranged in descending order of the correlation coefficient based on the correlation matrix data between the elements for the element SV; In the process 2 based on the network F value (NF) and the element F value (VF) within a specific size within the range of the minimum module (community) size and the maximum community size set as the initial setting value A process of forming (selecting) a module indicating the maximum network F value (NF) while sequentially removing the element F value (VF) indicating the minimum value from the set module;

(2) Step of optimizing a module within a set module size range by eliminating duplication of module configuration members for all elements, including the following processes :
1. Processing in which a threshold value VFt of VF arbitrarily set (selected) for each element of the module formed in step (1) is input to the memory for all possible element SVs;
2. In a module including an arbitrary element in the network, a process for constructing a network in which elements indicating the threshold value VFt or more are connected as edges,
3. Processing the processor, thus about the network reconstructed for all elements in the network, forming all linked to integrated modules networks including the same elements, and 4. The processor calculates the size of each module of the integrated module group formed for each selected threshold value VFt, and the number of integrated modules having the size of the specific range set in the process 1 among them. A process of selecting a group of integrated modules formed for a certain threshold VFt such that is maximized; and

(3) False-Negative-In (FNI) analysis step including the following processing , adding a peripheral element having a low correlation coefficient but a high contribution to the module of interest to form a final module group:
1. 1. a process of inputting a threshold value s of element specificity for the peripheral element of each module of the integrated module group selected in step (2) to the memory ; Processor is to add all the peripheral Element-specific rate is not less than the threshold value s to the module;
Including

Harmonic average of network density (ND) and network specificity (NS), where network F value (NF) is defined by equation (I) below:
And
Harmonic average of element density (VD) and element specificity (VS), where element F value (VF) is defined by the following formula (II):
(In the above equation, e (i) represents the total number of edges in the partial module structure of element i, d (i) represents the degree of element i in all networks, and n represents the total number of elements in the module. )
Said method.

Further, (4) output means comprising the steps draw a network including the obtained modules (3) (display process), The method of claim 4.

The computer-readable recording medium which recorded the program as described in any one of Claims 1 thru | or 3.