JP4578201B2

JP4578201B2 - Gene estimation apparatus, gene estimation method and program thereof

Info

Publication number: JP4578201B2
Application number: JP2004296916A
Authority: JP
Inventors: 大亮西川; 公徳嶋本
Original assignee: NS Solutions Corp
Current assignee: NS Solutions Corp
Priority date: 2004-10-08
Filing date: 2004-10-08
Publication date: 2010-11-10
Anticipated expiration: 2024-10-08
Also published as: JP2006107394A

Description

本発明は、特定の化合物の薬理活性に関連がある遺伝子を推定する遺伝子推定装置、遺伝子推定方法及びそのプログラムに関するものである。 The present invention relates to a gene estimation apparatus, a gene estimation method and a program for estimating a gene related to the pharmacological activity of a specific compound.

ある化合物がどのような遺伝子に対して薬理活性を有するかを推定する手法として、化合物構造情報と、標的タンパクの探索からそのタンパクを生成する遺伝子を推定する技術が開示されている。また、ＮＣＩ（ＮａｔｉｏｎａｌＣａｎｃｅｒＩｎｓｔｉｔｕｔｅ）が公開しているデータベースとして、癌細胞の化合物に対する薬理活性値に関する情報であるＡ−Ｍａｔｉｒｘ、癌細胞の遺伝子発現パターンに関する情報であるＴ−Ｍａｔｒｉｘがある。このＡ−Ｍａｔｉｒｘ及びＴ−Ｍａｔｉｒｘのクラスタリング結果や、Ａ−Ｍａｔｉｒｘ、Ｔ−Ｍａｔｒｉｘから相関行列（ＡＴ−Ｍａｔｒｉｘ）を求めて既知となっている化合物と遺伝子の関係をＡＴ−Ｍａｔｒｉｘから見出す技術が開示されている（例えば、非特許文献１を参照。）。 As a technique for estimating which gene a compound has pharmacological activity, a technique for estimating a gene that generates the protein from compound structure information and searching for a target protein is disclosed. As databases published by NCI (National Cancer Institute), there are A-Matrix, which is information on pharmacological activity values for compounds of cancer cells, and T-Matrix, which is information on gene expression patterns of cancer cells. Disclosure results of A-Matrix and T-Matrix clustering and techniques for finding correlation matrix (AT-Matrix) from AT-Matrix from A-Matrix and T-Matrix and finding the relationship from AT-Matrix (For example, see Non-Patent Document 1).

Uwe Scherf、他１６名、「ａｇｅｎｅｅｘｐｒｅｓｓｉｏｎｄａｔａｂａｓｅｆｏｒｔｈｅｍｏｌｅｃｕｌａｒｐｈａｒｍａｃｏｌｏｇｙｏｆｃａｎｃｅｒ」、ＮａｔｕｒｅＧｅｎｅｔｉｃｓ、第２４巻、第３号、２０００年３月１日、ｐ.２３６−２４４Uwe Scherf and 16 others, “a gene expression database for the molecular pharmacology of cancer”, Nature Genetics, Vol. 24, No. 3, March 1, 2000, p.236-244.

しかし、上述した手法では、標的タンパクが必要となるため、標的タンパクが不明であるときには遺伝子の推定ができないという問題がある。また、非特許文献１の手法では、薬理活性が未知の化合物と遺伝子の関係については推定ができないという問題がある。 However, since the above-described method requires a target protein, there is a problem that a gene cannot be estimated when the target protein is unknown. Further, the method of Non-Patent Document 1 has a problem that it is impossible to estimate the relationship between a gene with unknown pharmacological activity and a gene.

本発明は、上述した事情を考慮してなされたもので、標的タンパクや対象化合物の薬理活性が不明であっても、対象化合物の薬理活性に関連があると期待できる遺伝子を推定することができる遺伝子推定装置、遺伝子推定方法及びそのプログラムを提供することを目的とする。 The present invention has been made in consideration of the above-mentioned circumstances, and even if the pharmacological activity of the target protein or the target compound is unknown, a gene that can be expected to be related to the pharmacological activity of the target compound can be estimated. It is an object to provide a gene estimation device, a gene estimation method, and a program thereof.

この発明は、上述した課題を解決すべくなされたもので、本発明による遺伝子推定装置においては、複数種類の細胞の各々に対する複数種類の第１の化合物の薬理活性の活性値に関する情報である化合物情報、及び前記第１の化合物の化学構造に関する情報である化合物構造情報を格納する化合物情報格納手段と、前記複数種類の細胞の各々に対する複数種類の各遺伝子の発現量を含む発現情報を格納する発現情報格納手段と、種々の化合物の構造における一部の元素のつながりを抽出した部分パスに関する情報を格納する部分パス情報格納手段と、前記化合物情報格納手段から参照する前記化合物情報と前記部分パス情報格納手段から参照する前記部分パスに関する情報とを基に、前記第１の化合物と薬理活性が未知である第２の化合物とに対して前記部分パスを含むか否かを示す部分パス有無情報を算出する算出手段と、前記部分パス有無情報の類似する前記第１の化合物をまとめてクラスタとして分類し、分類後のクラスタの中から、前記部分パス有無情報に基づいて前記第２の化合物に化学構造が最も類似している前記第１の化合物が属するクラスタを類似クラスタとして特定し、特定した前記類似クラスタに属する前記第１の化合物の各々と前記第２の化合物との類似度を算出する分類処理手段と、前記類似クラスタに属する全ての前記第１の化合物について、前記類似度で重み付けされた前記細胞に対する前記薬理活性の活性値の加重平均を、前記第２の化合物の前記細胞に対する薬理活性の推定活性値として算出する活性値推定手段と、前記各遺伝子について、前記各細胞に対する薬理活性の前記推定活性値と前記発現量との積算値の平均値を推定ポイントとして算出する遺伝子推定手段とを具備することを特徴とする。 The present invention has been made to solve the above-described problems, and in the gene estimation apparatus according to the present invention, a compound that is information on the activity value of the pharmacological activity of a plurality of types of first compounds for each of a plurality of types of cells. Compound information storage means for storing information and compound structure information that is information relating to the chemical structure of the first compound, and expression information including expression levels of a plurality of types of genes for each of the plurality of types of cells. Expression information storage means, partial path information storage means for storing information on partial paths obtained by extracting connections of some elements in various compound structures, and the compound information and partial paths referred to by the compound information storage means Based on the information on the partial path referenced from the information storage means, the first compound and the second compound whose pharmacological activity is unknown, And calculating means for calculating partial path presence / absence information indicating whether or not the partial path is included, and classifying the first compounds having similar partial path presence / absence information together as a cluster. To identify a cluster to which the first compound having the most similar chemical structure to the second compound belongs as a similar cluster based on the partial path presence / absence information, and the first cluster belonging to the identified similar cluster Classification processing means for calculating the similarity between each of the compounds and the second compound, and the activity of the pharmacological activity on the cells weighted by the similarity for all the first compounds belonging to the similar cluster An activity value estimating means for calculating a weighted average of values as an estimated activity value of the pharmacological activity of the second compound with respect to the cells, and for each of the genes, Characterized by comprising a gene estimating means for calculating an average value of the integrated value of the estimated activity value and the expression level of pharmacological activity against cells as estimated point.

これにより、本発明による遺伝子推定装置は、薬理活性が未知である対象化合物に対して、対象化合物の薬理活性に関連があると期待できる遺伝子を推定することができる。 Thereby, the gene estimation apparatus by this invention can estimate the gene which can be anticipated with respect to the pharmacological activity of a target compound with respect to the target compound whose pharmacological activity is unknown.

また、本発明による遺伝子推定方法においては、複数種類の細胞の各々に対する複数種類の第１の化合物の薬理活性の活性値に関する情報である化合物情報、及び前記第１の化合物の化学構造に関する情報である化合物構造情報を格納する化合物情報格納手段と、前記複数種類の細胞の各々に対する複数種類の各遺伝子の発現量を含む発現情報を格納する発現情報格納手段と、種々の化合物の構造における一部の元素のつながりを抽出した部分パスに関する情報を格納する部分パス情報格納手段と、算出手段と、分類処理手段と、活性値推定手段と、遺伝子推定手段とを具備する遺伝子推定装置を用いた遺伝子推定方法であって、前記算出手段が、前記化合物情報格納手段から参照する前記化合物情報と前記部分パス情報格納手段から参照する前記部分パスに関する情報とを基に、前記第１の化合物と薬理活性が未知である第２の化合物とに対して前記部分パスを含むか否かを示す部分パス有無情報を算出する算出ステップと、前記分類処理手段が、前記部分パス有無情報の類似する前記第１の化合物をまとめてクラスタとして分類し、分類後のクラスタの中から、前記部分パス有無情報に基づいて前記第２の化合物に化学構造が最も類似している前記第１の化合物が属するクラスタを類似クラスタとして特定し、特定した前記類似クラスタに属する前記第１の化合物の各々と前記第２の化合物との類似度を算出する分類処理ステップと、前記活性値推定手段が、前記類似クラスタに属する全ての前記第１の化合物について、前記類似度で重み付けされた前記細胞に対する前記薬理活性の活性値の加重平均を、前記第２の化合物の前記細胞に対する薬理活性の推定活性値として算出する活性値推定ステップと、前記遺伝子推定手段が、前記各遺伝子について、前記各細胞に対する薬理活性の前記推定活性値と前記発現量との積算値の平均値を推定ポイントとして算出する遺伝子推定ステップとを有することを特徴とする。 In the gene estimation method according to the present invention , the compound information which is information on the activity value of the pharmacological activity of the plurality of types of first compounds for each of the plurality of types of cells, and the information on the chemical structure of the first compound. Compound information storage means for storing certain compound structure information, expression information storage means for storing expression information including expression levels of a plurality of types of genes for each of the plurality of types of cells, and a part of the structures of various compounds A gene using a gene estimation device comprising: partial path information storage means for storing information on partial paths from which element connections are extracted, calculation means, classification processing means, activity value estimation means, and gene estimation means In the estimation method, the calculation means refers to the compound information referred from the compound information storage means and the partial path information storage means. A calculation step of calculating partial path presence / absence information indicating whether or not the partial path is included in the first compound and the second compound whose pharmacological activity is unknown based on the information regarding the partial path; The classification processing unit collectively classifies the first compounds having similar partial path presence / absence information as clusters, and selects the second compound from the classified clusters based on the partial path presence / absence information. The cluster to which the first compound having the most similar chemical structure belongs is specified as a similar cluster, and the similarity between each of the first compounds belonging to the specified similar cluster and the second compound is calculated. The pharmacology for the cells weighted by the similarity for all the first compounds belonging to the similar cluster, wherein the classification step and the activity value estimation means An activity value estimation step for calculating a weighted average of sex activity values as an estimated activity value of the pharmacological activity of the second compound for the cells; and the gene estimation means for each gene, the pharmacological activity for each cell. And a gene estimation step of calculating an average value of integrated values of the estimated activity value and the expression level as an estimation point .

また、本発明によるプログラムは、コンピュータを、複数種類の細胞の各々に対する複数種類の第１の化合物の薬理活性の活性値に関する情報である化合物情報、及び前記第１の化合物の化学構造に関する情報である化合物構造情報を格納する化合物情報格納手段と、前記複数種類の細胞の各々に対する複数種類の各遺伝子の発現量を含む発現情報を格納する発現情報格納手段と、種々の化合物の構造における一部の元素のつながりを抽出した部分パスに関する情報を格納する部分パス情報格納手段と、前記化合物情報格納手段から参照する前記化合物情報と前記部分パス情報格納手段から参照する前記部分パスに関する情報とを基に、前記第１の化合物と薬理活性が未知である第２の化合物とに対して前記部分パスを含むか否かを示す部分パス有無情報を算出する算出手段と、前記部分パス有無情報の類似する前記第１の化合物をまとめてクラスタとして分類し、分類後のクラスタの中から、前記部分パス有無情報に基づいて前記第２の化合物に化学構造が最も類似している前記第１の化合物が属するクラスタを類似クラスタとして特定し、特定した前記類似クラスタに属する前記第１の化合物の各々と前記第２の化合物との類似度を算出する分類処理手段と、前記類似クラスタに属する全ての前記第１の化合物について、前記類似度で重み付けされた前記細胞に対する前記薬理活性の活性値の加重平均を、前記第２の化合物の前記細胞に対する薬理活性の推定活性値として算出する活性値推定手段と、前記各遺伝子について、前記各細胞に対する薬理活性の前記推定活性値と前記発現量との積算値の平均値を推定ポイントとして算出する遺伝子推定手段として機能させるプログラムである。
In addition, the program according to the present invention allows a computer to store compound information that is information on the activity value of the pharmacological activity of a plurality of types of first compounds for each of a plurality of types of cells, and information about the chemical structure of the first compound. Compound information storage means for storing certain compound structure information, expression information storage means for storing expression information including expression levels of a plurality of types of genes for each of the plurality of types of cells, and a part of the structures of various compounds Based on the partial path information storage means for storing information on the partial path from which the element connections are extracted, the compound information referenced from the compound information storage means, and the information on the partial path referenced from the partial path information storage means. A portion indicating whether or not the partial path is included for the first compound and the second compound whose pharmacological activity is unknown And classifying the first compounds having similar partial path presence / absence information together as a cluster, and classifying the second compound based on the partial path presence / absence information from the classified clusters. A cluster to which the first compound having the most similar chemical structure to the compound of 1 belongs is identified as a similar cluster, and the similarity between each of the first compounds belonging to the identified similar cluster and the second compound A weighting average of the activity values of the pharmacological activities for the cells weighted by the similarity for all the first compounds belonging to the similar cluster, and a classification processing means for calculating An activity value estimating means for calculating an estimated activity value of pharmacological activity against cells; and for each gene, the estimated activity value of pharmacological activity against each cell and The average value of the integrated value of the expression level is a program to function as a gene estimating means for calculating as the estimated point.

本発明による遺伝子推定装置、遺伝子推定方法及びそのプログラムによれば、薬理活性が未知である対象化合物に対して、対象化合物の薬理活性に関連があると期待できる遺伝子を推定することができる。 According to the gene estimation apparatus, gene estimation method, and program thereof according to the present invention, a gene that can be expected to be related to the pharmacological activity of a target compound can be estimated for a target compound whose pharmacological activity is unknown.

以下、本発明の実施の形態を説明する。
本発明の一実施形態における遺伝子推定装置は、薬理活性が未定の化合物（どのような細胞に対する効き目があるか未定の化合物）に対して、その化合物の薬理活性に関連があると期待できる遺伝子を推定する処理を行う装置であり、以下にその概略構成について説明を行う。図１は、本実施形態における遺伝子推定装置の概略構成を示す図である。 Embodiments of the present invention will be described below.
The gene estimation apparatus according to one embodiment of the present invention provides a gene that can be expected to be related to a pharmacological activity of a compound with an undetermined pharmacological activity (a compound that has no effect on the cell). This is a device for performing the estimation process, and the schematic configuration will be described below. FIG. 1 is a diagram illustrating a schematic configuration of a gene estimation device according to the present embodiment.

図１において、１は、遺伝子推定装置であり、例えば癌細胞に対する薬理活性が未定の化合物を対象化合物（第２の化合物）として、その対象化合物の薬理活性に関連があると期待できる遺伝子（以下、関連遺伝子とする）を推定する処理を行う。２は、ネットワークであり、例えばインターネットなどの通信網である。３は、ＮＣＩ（ＮａｔｉｏｎａｌＣａｎｃｅｒＩｎｓｔｉｔｕｔｅ）データベースであり、本実施形態で利用するＮＩＣが公開しているデータベースである。具体的には、ＮＣＩデータベース３は、癌細胞の遺伝子発現パターンに関する情報である発現情報と、癌細胞に対する化合物の薬理活性値に関する情報である化合物情報とが少なくとも格納されているデータベースである。 In FIG. 1, reference numeral 1 denotes a gene estimation device. For example, a compound whose pharmacological activity against cancer cells is undetermined is regarded as a target compound (second compound), and a gene that can be expected to be related to the pharmacological activity of the target compound (hereinafter referred to as “the target compound”). , A related gene) is estimated. Reference numeral 2 denotes a network, for example, a communication network such as the Internet. Reference numeral 3 denotes an NCI (National Cancer Institute) database, which is a database published by the NIC used in this embodiment. Specifically, the NCI database 3 is a database that stores at least expression information that is information on gene expression patterns of cancer cells and compound information that is information on pharmacological activity values of compounds against cancer cells.

遺伝子推定装置１は、ネットワーク２を介してＮＣＩデータベース３から、上述した発現情報及び化合物情報を取得して利用することで、対象化合物の関連遺伝子を推定する処理を行う。尚、遺伝子推定装置１は、図示していないが、マウスやキーボードなどの入力装置及び、ＣＲＴ（ＣａｔｈｏｄｅＲａｙＴｕｂｅ）や液晶ディスプレイなどの表示装置を具備する。 The gene estimation apparatus 1 performs processing for estimating the relevant gene of the target compound by acquiring and using the above-described expression information and compound information from the NCI database 3 via the network 2. Although not shown, the gene estimation device 1 includes an input device such as a mouse or a keyboard and a display device such as a CRT (Cathode Ray Tube) or a liquid crystal display.

ここで、発現情報における、癌細胞の遺伝子発現パターンとは、複数種類の癌細胞毎に複数種類の遺伝子別の発現量（遺伝子が機能しているか否かを示す量）に関する情報である。すなわち、特定の癌細胞においては、特定の遺伝子の組合せ（遺伝子パターン）が発現している。また、以下の説明において、化合物情報とは、複数種類の癌細胞毎に複数種類の化合物別の薬理活性値を示す情報であり、化合物構造情報とは、化合物の構造に関する情報であると定義する。尚、発現情報及び化合物情報については具体例を後述する。 Here, the gene expression pattern of cancer cells in the expression information is information relating to the expression level (amount indicating whether or not a gene is functioning) for each of a plurality of types of cancer cells. That is, specific gene combinations (gene patterns) are expressed in specific cancer cells. In the following description, compound information is information indicating the pharmacological activity value for each of a plurality of types of compounds for each of a plurality of types of cancer cells, and the compound structure information is defined as information relating to the structure of the compound. . Specific examples of expression information and compound information will be described later.

次に、遺伝子推定装置１の機能構成について説明する。１１は、制御部であり、遺伝子推定装置１内の各処理部やデータの流れの制御を行う。１２は、データベースであり、上述した発現情報を格納する発現情報データベース１２ａと、上述した化合物情報及び化合物構造情報を格納する化合物情報データベース１２ｂと、種々の化合物の一部のパス（元素のつながり）を抽出した部分パスにＩＤ（識別子）を付与した情報を格納する部分パス情報データベース１２ｃと、上記発現情報と化合物情報を基に紐付けられる遺伝子の発現パターンと化合物の関連に関する情報である関連情報を格納する関連情報データベース１２ｄから構成される。 Next, the functional configuration of the gene estimation device 1 will be described. Reference numeral 11 denotes a control unit that controls each processing unit and data flow in the gene estimation apparatus 1. Reference numeral 12 denotes a database, an expression information database 12a for storing the above-described expression information, a compound information database 12b for storing the above-described compound information and compound structure information, and some paths (element connections) of various compounds. A partial path information database 12c for storing information obtained by assigning an ID (identifier) to the partial path from which the information is extracted, and related information that is information on the relationship between the expression pattern of the gene linked to the expression information and the compound information and the compound The related information database 12d is stored.

１３は、情報登録処理部であり、後述する送受信処理部２０及びネットワーク２を介してＮＣＩデータベース３から発現情報を取得して発現情報データベース１２ａに登録する処理と、ＮＣＩデータベース３から化合物情報を取得して化合物情報データベース１２ｂに登録する処理を行う。本実施形態における情報登録処理部１３は、ＮＣＩデータベース３から癌細胞の遺伝子発現パターンに関する情報であるＴ−Ｍａｔｒｉｘ（発現情報）を取得して、必要な情報を発現情報データベース１２ａに登録する。また、情報登録処理部１３は、ＮＣＩデータベース３から癌細胞の化合物に対する薬理活性値に関する情報であるＡ−Ｍａｔｒｉｘ（化合物情報）を取得して、必要な情報を化合物情報データベース１２ｂに登録する。 Reference numeral 13 denotes an information registration processing unit, which acquires expression information from the NCI database 3 via the transmission / reception processing unit 20 and the network 2 described later and registers them in the expression information database 12a, and acquires compound information from the NCI database 3. Then, a process of registering in the compound information database 12b is performed. The information registration processing unit 13 in the present embodiment acquires T-Matrix (expression information) that is information related to gene expression patterns of cancer cells from the NCI database 3, and registers necessary information in the expression information database 12a. In addition, the information registration processing unit 13 acquires A-Matrix (compound information), which is information relating to the pharmacological activity value for compounds of cancer cells, from the NCI database 3 and registers necessary information in the compound information database 12b.

ここで、上述した発現情報データベース１２ａ及び化合物情報データベース１２ｂに格納する発現情報及び化合物情報のデータ構成例を図２及び図３を用いて説明する。図２は、図１に示した発現情報データベース１２ａのデータ構成例を示す図である。図２において、ＣＬＩＤはＣｌｏｎｅＩＤから接頭辞“ＩＭＡＧＥ：”を抜いた数値であり、各遺伝子に固有の数値である。ＮＡＭＥはＣｌｏｎｅＩＤのｃＤＮＡ（Ｔｙｐｅ）に紐付く遺伝子名称ある。また、「ＭＥ：ＭＡＬＭＥ−３Ｍ」や「ＭＥ：ＳＫ−ＭＥＬ−２８」は、癌細胞の名称である。また、癌細胞の名称の下には各遺伝子に対する発現量が示されている。尚、これらのＣＬＩＤやＮＡＭＥはＮＣＩデータベース３から参照するＴ−Ｍａｔｒｉｘ（発現情報）で規定されている。また、図２に示す各遺伝子の発現量は、ＮＣＩデータベース３において６０種の中から代表的な７種の癌細胞を抜き出し、その平均と分散値で正規化した値である。 Here, a data configuration example of the expression information and the compound information stored in the expression information database 12a and the compound information database 12b described above will be described with reference to FIGS. FIG. 2 is a diagram showing a data configuration example of the expression information database 12a shown in FIG. In FIG. 2, CLID is a numerical value obtained by removing the prefix “IMAGE:” from Clone ID, and is a numerical value unique to each gene. NAME is a gene name associated with Clone ID cDNA (Type). “ME: MALME-3M” and “ME: SK-MEL-28” are names of cancer cells. Moreover, the expression level with respect to each gene is shown under the name of a cancer cell. Note that these CLIDs and NAMEs are defined by T-Matrix (expression information) referenced from the NCI database 3. In addition, the expression level of each gene shown in FIG. 2 is a value obtained by extracting seven typical cancer cells from 60 types in the NCI database 3 and normalizing them with the average and variance value.

図３は、図１に示した化合物情報データベース１２ｂに格納される化合物情報のデータ構成例を示す図である。
図３において、“ＮＳＣＮｏ．”は、化合物を特定する数値である。また、図２と同様に、「ＭＥ−ＭＡＬＭＥ−３Ｍ」や「ＭＥ−ＳＫ−ＭＥＬ−２８」などは、癌細胞の名称である。また、癌細胞の名称の下には各化合物に対する薬理活性値が示されている。この薬理活性値は、例えば化合物δの細胞ωに対する薬理活性値ａ（ω，δ）は以下の式１で算出される。 FIG. 3 is a diagram showing a data configuration example of compound information stored in the compound information database 12b shown in FIG.
In FIG. 3, “NSC No.” is a numerical value that identifies a compound. Similarly to FIG. 2, “ME-MALME-3M”, “ME-SK-MEL-28”, and the like are names of cancer cells. Moreover, the pharmacological activity value with respect to each compound is shown under the name of a cancer cell. As for this pharmacological activity value, for example, the pharmacological activity value a (ω, δ) of the compound δ with respect to the cell ω is calculated by the following formula 1.

上述した式１において、ＧＩ₅₀とは増殖抑制濃度であり、ここでは、癌細胞ωの増殖が５０％の確率で抑制される化合物δの濃度を意味する。ａ_averageとａ_sdはそれぞれ指定された化合物に対する癌細胞群の薬理活性値の平均と分散である。これにより、式１で求まる薬理活性値ａ（ω，δ）は、癌細胞ωに対する化合物δの増殖抑制の効果を意味し、化合物毎に正規化された値となる。また、化合物情報データベース１２ｂには、化合物構造情報として、化合物の名称、構造情報（化学記号及びその接続関係の情報）、構造図（２次元又は３次元の分子構造図）などの情報が“ＮＳＣＮｏ．”に関連付けられて格納されている。 In the above-described formula 1, GI ₅₀ is a growth inhibitory concentration, and here means the concentration of compound δ at which the growth of cancer cells ω is inhibited with a probability of 50%. a _average and a _sd are the mean and variance of the pharmacological activity value of the cancer cell group for the specified compound, respectively. Thus, the pharmacological activity value a (ω, δ) obtained by Formula 1 means the effect of inhibiting the growth of compound δ on cancer cells ω, and is a value normalized for each compound. In the compound information database 12b, as compound structure information, information such as compound names, structure information (chemical symbols and their connection relations), structure diagrams (two-dimensional or three-dimensional molecular structure diagrams), and the like are displayed. No. "is stored in association with it.

また、情報登録処理部１３は、部分パス情報データベース１２ｃに対する情報登録処理も行う。具体的には、情報登録処理部１３は、実存する化合物の集団において、９パス以下の連結パスであって出現頻度１．０〜０．００２までのパスを部分パスとして算出し、部分パス情報データベース１２ｃに登録する。これにより、図４に示すようなデータ構成の部分バス情報データベース１２ｃが構成される。図４は、図１に示した部分バス情報データベース１２ｃのデータ構成例を示す図である。図２に示すように、各部分パスに対してＩＤ（識別番号）が付与されている。尚、本実施形態において化合物情報データベース１２ｂに格納されている化合物は約４０００種類の化合物であり、部分パス情報データベース１２ｃには、約１万個の部分パスに関する情報が格納されている。尚、情報登録処理部１３における部分パスの算出時には水素原子についてのパスを除外している。 The information registration processing unit 13 also performs information registration processing for the partial path information database 12c. Specifically, the information registration processing unit 13 calculates, as a partial path, a path having an appearance frequency of 1.0 to 0.002 that is a connected path of 9 paths or less in a group of existing compounds. Register in the database 12c. Thus, a partial bus information database 12c having a data structure as shown in FIG. 4 is configured. FIG. 4 is a diagram showing a data configuration example of the partial bus information database 12c shown in FIG. As shown in FIG. 2, an ID (identification number) is assigned to each partial path. In the present embodiment, there are about 4000 types of compounds stored in the compound information database 12b, and information about about 10,000 partial paths is stored in the partial path information database 12c. Note that the path for hydrogen atoms is excluded when calculating the partial path in the information registration processing unit 13.

１４は、関連解析処理部であり、発現情報データベース１２ａから発現情報、化合物情報データベース１２ｂから化合物情報を参照して、同じ癌細胞における遺伝子発現パターンと薬理活性を有する化合物の関連に関する情報である関連情報を生成して、関連情報データベース１２ｄに登録する。具体的には、関連解析処理部１４は、上述した発現情報の癌細胞名と化合物情報の癌細胞名が同じものをキーに遺伝子発現パターンと薬理活性値を紐付けて、関連情報データベース１２ｄに登録する。この際、関連解析処理部１４は、薬理活性値の下限値εを設け、その下限値ε以下の薬理活性値を有する化合物については紐付け処理及び登録処理を行わない。本実施形態では、下限値ε＝−１０．０とする。 Reference numeral 14 denotes an association analysis processing unit, which refers to expression information from the expression information database 12a and compound information from the compound information database 12b, and is information relating to a relationship between a gene expression pattern and a compound having pharmacological activity in the same cancer cell. Information is generated and registered in the related information database 12d. Specifically, the association analysis processing unit 14 associates the gene expression pattern and the pharmacological activity value with the same cancer cell name in the expression information and the cancer cell name in the compound information as a key, and stores them in the associated information database 12d. sign up. At this time, the related analysis processing unit 14 sets a lower limit value ε of the pharmacological activity value, and does not perform the linking process and the registration process for a compound having a pharmacological activity value equal to or lower than the lower limit value ε. In the present embodiment, the lower limit value ε = −10.0.

１５は、ＦＰ算出処理部であり、化合物情報データベース１２ｂに格納されている全化合物の化合物構造情報を参照して、部分パス情報データベース１２ｃから参照する各部分パスを含むか否かを「１」、「０」で示した数字の羅列であるＦＰ（ＦｉｎｇｅｒＰｒｉｎｔ）を算出する。具体的には、ＦＰ算出処理部１５は、化合物δのＦＰ（部分パス有無情報）として、構造特徴であるベクトル変数ｆ（δ）（＝ＦＰ）を以下に示す式２、３を計算することにより算出する。 Reference numeral 15 denotes an FP calculation processing unit that refers to the compound structure information of all the compounds stored in the compound information database 12b and determines whether or not each partial path referred to from the partial path information database 12c is included. , FP (Finger Print), which is a sequence of numbers indicated by “0”, is calculated. Specifically, the FP calculation processing unit 15 calculates the following formulas 2 and 3 as vector variables f (δ) (= FP), which are structural features, as the FP (partial path presence / absence information) of the compound δ. Calculated by

ここで、式２、３に示したθ、Θ、Π（δ）について説明する。本実施形態では、化合物を無向グラフと考え、原子を点、結合を辺とみなし、ｆ（δ）の各要素を特定のパスθ（以下、部分パスθとする）を含むか否かの２値として扱う。図２に示したように、部分パスθは例えば「Ｃ−Ｃ＝Ｏ」といった形で表現できる。ベクトル変数ｆ（δ）の各要素に対応する部分パスθの集合Θを部分パス情報データベース１２ｃに格納される全部分パスの集合とする。また、化合物δの全パスの内、集合Θに含まれるパス集合をΠ（δ）とすると、以下の式４、５が成り立つ。尚、本実施形態の部分パスθの個数（約１万個）と、ＦＰであるベクトル変数ｆ（δ）に含まれる「１」及び「０」の個数は同数となる。 Here, θ, Θ, and Π (δ) shown in Expressions 2 and 3 will be described. In this embodiment, a compound is considered as an undirected graph, atoms are regarded as points, bonds are regarded as edges, and whether each element of f (δ) includes a specific path θ (hereinafter referred to as a partial path θ) or not Treat as binary. As shown in FIG. 2, the partial path θ can be expressed in a form such as “C−C = O”, for example. A set Θ of partial paths θ corresponding to each element of the vector variable f (δ) is set as a set of all partial paths stored in the partial path information database 12c. Further, if the path set included in the set Θ among all paths of the compound δ is Π (δ), the following expressions 4 and 5 are established. In this embodiment, the number of partial paths θ (about 10,000) and the number of “1” and “0” included in the vector variable f (δ) that is FP are the same.

１６は、ＦＰ分類処理部であり、化合物情報データベース１２ｂに格納される化合物のクラスタリング処理や、クラスタリング後のクラスタの中から対象化合物に最も類似する化合物を含むクラスタを特定する処理や、特定したクラスタに属する化合物と対象化合物の類似度を算出する処理などを行う。 Reference numeral 16 denotes an FP classification processing unit, which is a clustering process of compounds stored in the compound information database 12b, a process of specifying a cluster including a compound most similar to the target compound from the clusters after clustering, and a specified cluster A process of calculating the similarity between the compound belonging to and the target compound is performed.

まず、ＦＰ分類処理部１６は、化合物情報データベース１２ｂに格納される化合物δ₁、δ₂間のＦＰの類似度を算出してクラスタリングを行うことで、化合物情報データベース１２ｂに格納される化合物のＦＰを分類する。具体的には、ＦＰ分類処理部１６は、化合物δ₁、δ₂間の類似度ｔ（δ₁、δ₂）を、以下の「Ｔａｎｉｍｏｔｏｍｅａｓｕｒｅ」である式６に基づいて算出する。 First, the FP classification processing unit 16 calculates the FP similarity between the compounds δ ₁ and δ ₂ stored in the compound information database 12b and performs clustering to thereby perform the FP of the compounds stored in the compound information database 12b. Classify. Specifically, the FP classification processing unit 16 calculates the similarity t (δ ₁ , δ ₂ ) between the compounds δ ₁ and δ ₂ based on the following “Tanimoto measure”, which is Equation 6.

ここでＦＰ分類処理部１６が式６により求める類似度ｔ（δ₁、δ₂）の値域は０≦ｔ（δ₁、δ₂）≦１であり、二つの化合物がＦＰ上で同一の構造特徴を持つ場合に類似度ｔの値は１となる。 Here, the range of the similarity t (δ ₁ , δ ₂ ) obtained by the FP classification processing unit 16 according to Equation 6 is 0 ≦ t (δ ₁ , δ ₂ ) ≦ 1, and the two compounds have the same structure on the FP. The value of similarity t is 1 when it has a feature.

次に、ＦＰ分類処理部１６は、式６で求めた類似度ｔに基づくクラスタリングを行う。具体的には、ＦＰ分類処理部１６は、類似度ｔ（δ₁、δ₂）に基づいて化合物の集合（以下、集合Ψとする）をクラスタリングする。本実施形態のＦＰ分類処理部１６が、クラスタリングに用いるアルゴリズムは例えば「Ｈｉｅｒａｔｉｃａｌｎｅａｒｅｓｔｎｅｉｇｈｂｏｒロジック」である。このロジックでは、集合Ψの要素間における類似度ｔの最大値である最大距離Ｔの半分以下の類似度となる要素群を一つのクラスタとして扱う。 Next, the FP classification processing unit 16 performs clustering based on the similarity t obtained by Expression 6. Specifically, the FP classification processing unit 16 clusters a set of compounds (hereinafter referred to as a set Ψ) based on the similarity t (δ ₁ , δ ₂ ). The algorithm used for clustering by the FP classification processing unit 16 of the present embodiment is, for example, “Hierarchical nearest neighbor logic”. In this logic, an element group having a degree of similarity equal to or less than half of the maximum distance T, which is the maximum value of the degree of similarity t between elements of the set Ψ, is treated as one cluster.

次に、ＦＰ分類処理部１６は、対象化合物をδ_targetとすると、細胞ωが持つ化合物のクラスタの中から対象化合物δ_targetと構造上最も類似した化合物δが属するクラスタψを以下の式７〜９を用いて決定する。尚、式７に示すように、細胞ωに活性があるとみなされている全化合物の集団をΨとする。 Next, assuming that the target compound is δ _target , the FP classification processing unit 16 selects a cluster ψ to which the compound δ structurally similar to the _target compound δ _target belongs among the clusters of compounds possessed by the cell ω, as shown in the following formulas 7 to 7. 9 to determine. In addition, as shown in Formula 7, the group of all compounds that are considered to be active in the cell ω is denoted by Ψ.

次に、ＦＰ分類処理部１６は、自身が決定したクラスタψに属する全化合物δに対して上述した式６を用いて対象化合物δ_targetとの類似度ｔ（δ，δ_target）を求める。 Next, the FP classification processing unit 16 obtains the similarity t (δ, δ _target ) with the _target compound δ _target using the above-described formula 6 for all the compounds δ belonging to the cluster ψ determined by itself.

１７は、活性値推定処理部であり、細胞ωに対する対象化合物の薬理活性値を、その対象化合物に類似する化合物の薬理活性値に基づいて推定する。具体的には、活性値推定処理部１７は、ＦＰ分類処理部１６が決定したクラスタψに属する全化合物に対して、以下の式１０の計算を行うことにより、細胞ωに対する対象化合物δ_targetの推定活性値ｈ（ω，δ_target）を算出する。この式１０に示すように、活性値推定処理部１７は、決定したクラスタψに属する全化合物に対して、上述したＦＰ分類処理部１６が求めた類似度ｔ（δ，δ_target）と式１で求めた薬理活性値ａの積算値を求めて、それを類似度ｔ（δ，δ_target）の総和で割ることで推定活性値ｈ（ω，δ_target）を算出する。 Reference numeral 17 denotes an activity value estimation processing unit that estimates the pharmacological activity value of the target compound for the cell ω based on the pharmacological activity value of a compound similar to the target compound. Specifically, the activity value estimation processing unit 17 calculates the following equation 10 for all the compounds belonging to the cluster ψ determined by the FP classification processing unit 16 to obtain the target compound δ _target for the cell ω. The estimated activity value h (ω, δ _target ) is calculated. As shown in Expression 10, the activity value estimation processing unit 17 calculates the similarity t (δ, δ _target ) obtained by the above-described FP classification processing unit 16 and Expression 1 for all the compounds belonging to the determined cluster ψ. The estimated activity value h (ω, δ _target ) is calculated by _obtaining the integrated value of the pharmacological activity value a obtained in (1) and dividing it by the sum of the similarities t (δ, δ _target ).

上述した式１０において、λは、パラメータであり、このλの値を大きくとると、化合物間の類似度をより厳しく評価することができる。本実施形態ではλ＝４．０とする。 In Equation 10 described above, λ is a parameter, and when the value of λ is increased, the similarity between compounds can be evaluated more strictly. In this embodiment, λ = 4.0.

１８は、ポイント算出処理部であり、活性値推定処理部１７が算出した推定活性値ｈを基に、遺伝子毎に対象化合物δ_targetに対する関連性の高さを示す推定ポイントを算出する。具体的には、ポイント算出処理部１８は、以下の式１１を利用して、活性値推定処理部１７が算出した推定活性値ｈ（ω，δ_target）と、ある遺伝子γの発現量（Ｔ−Ｍａｔｒｉｘで正規化後の値）ｃの積算値の平均値を、遺伝子γの対象化合物δ_targetに対する推定ポイントｐ（δ_target，γ）として算出する。 Reference numeral 18 denotes a point calculation processing unit that calculates an estimated point indicating a high _degree of relevance to the _target compound δ _target for each gene based on the estimated activity value h calculated by the activity value estimation processing unit 17. Specifically, the point calculation processing unit 18 uses the following equation 11 to calculate the estimated activity value h (ω, δ _target ) calculated by the activity value estimation processing unit 17 and the expression level (T -The average value of the integrated values of c) after normalization with Matrix) is calculated as an estimated point p (δ _target , γ) for the target compound δ _target of the gene γ.

１９は、遺伝子推定処理部であり、ポイント算出処理部１８が算出した各遺伝子の推定ポイントｐの絶対値をとり降順に並べ替えたものを遺伝子推定結果として出力する。すなわち、推定ポイントｐが高い遺伝子ほど、対象化合物の薬理活性に関連があると期待できる遺伝子（関連遺伝子）であると推定している。推定ポイントｐは、正規化された推定活性値ｈと発現量ｃの積から算出するため、推定活性値ｈが小さい（すなわち負に大きい）対象化合物であっても、発現量ｃの絶対値が大きい場合には最終的な推定ポイントｐに大きく影響を与える。 Reference numeral 19 denotes a gene estimation processing unit that outputs the absolute values of the estimated points p of each gene calculated by the point calculation processing unit 18 and rearranges them in descending order as a gene estimation result. That is, it is estimated that a gene having a higher estimated point p is a gene (related gene) that can be expected to be related to the pharmacological activity of the target compound. Since the estimated point p is calculated from the product of the normalized estimated activity value h and the expression level c, the absolute value of the expression level c is the target compound even if the estimated activity value h is small (that is, negatively large). If it is larger, it will greatly affect the final estimated point p.

２０は、送受信処理部であり、ネットワーク２を介してＮＣＩデータベース３と通信を行う。尚、本実施形態の遺伝子推定装置１においては、外部にあるＮＣＩデータベース３に格納されるデータを利用するため、ネットワーク２に接続する機能を有しているが、この限りではなく、外部のデータベースを利用することなく、例えば入力手段から内部のデータベース１２に予め発現情報や化合物情報を登録して格納していてもよい。この場合には、遺伝子推定装置１は、ネットワーク２に接続するための機能を必要としない。 A transmission / reception processing unit 20 communicates with the NCI database 3 via the network 2. In addition, in the gene estimation apparatus 1 of this embodiment, in order to use the data stored in the external NCI database 3, it has the function to connect to the network 2, but it is not limited to this. For example, expression information and compound information may be registered and stored in advance in the internal database 12 from the input means. In this case, the gene estimation device 1 does not need a function for connecting to the network 2.

次に、図１に示した遺伝子推定装置１における対象化合物の関連遺伝子を推定する処理について、具体例を示して説明する。図５は、図１に示した遺伝子推定装置１における対象化合物の関連遺伝子を推定する処理を示すフロー図である。 Next, the process of estimating the relevant gene of the target compound in the gene estimation apparatus 1 shown in FIG. 1 will be described with a specific example. FIG. 5 is a flowchart showing a process of estimating the relevant gene of the target compound in the gene estimation apparatus 1 shown in FIG.

図５に示すように、ステップＳ１において、情報登録処理部１３は、ネットワーク２を介してＮＣＩデータベース３から発現情報及び化合物情報を取得し、それぞれ発現情報データベース１２ａ及び化合物情報データベース１２ｂに登録する。具体的には、情報登録処理部１３は、ＮＣＩデータベース３から発現情報として６０種の癌細胞に対する４４６３種の化合物（データが存在するのは４４４４種）の薬理活性値を含むデータテーブルであるＡ−Ｍａｔｒｉｘを取得して、発現情報データベース１２ａに登録する。 As shown in FIG. 5, in step S1, the information registration processing unit 13 acquires expression information and compound information from the NCI database 3 via the network 2, and registers them in the expression information database 12a and the compound information database 12b, respectively. Specifically, the information registration processing unit 13 is a data table including the pharmacological activity values of 4463 compounds (data is present 4444) against 60 types of cancer cells as expression information from the NCI database 3. -Obtain Matrix and register it in the expression information database 12a.

また、情報登録処理部１３は、６０種の癌細胞に対する９７０４種の遺伝子（データが存在するのは９０７３種）の発現量を含むデータテーブルであるＴ−Ｍａｔｒｉｘを取得して、化合物情報データベース１２ｂに登録する。但し、情報登録処理部１３は、Ｔ−Ｍａｔｒｉｘと上記Ａ−Ｍａｔｒｉｘとでは同一の癌細胞における細胞名の表記法が異なるのでどちらかの細胞名に統一する変換を行う（例：ＭＥ：ＭＡＬＭＥ−３Ｍ → ＭＥＬ−ＭＡＬＭＥ−３Ｍ）。 Further, the information registration processing unit 13 acquires T-Matrix, which is a data table including the expression levels of 9704 genes (9073 types for which data exists) for 60 types of cancer cells, and obtains the compound information database 12b. Register with. However, since the notation of the cell name in the same cancer cell is different between the T-Matrix and the A-Matrix, the information registration processing unit 13 performs conversion to unify either cell name (for example, ME: MALME- 3M → MEL-MALME-3M).

また、情報登録処理部１３は、ＮＣＩデータベース３から４４６３種の化合物構造情報を取得して、化合物情報データベース１２ｂに登録する。また、情報登録処理部１３は、部分パス情報データベース１２ｃに対して部分パスに関する情報を登録する。 Further, the information registration processing unit 13 acquires 4463 types of compound structure information from the NCI database 3 and registers them in the compound information database 12b. Further, the information registration processing unit 13 registers information related to the partial path in the partial path information database 12c.

次に、ステップＳ２において、関連解析処理部１４は、上述した発現情報の癌細胞名と化合物情報の癌細胞名が同じものをキーに遺伝子発現パターンと薬理活性値を紐付けて、関連情報データベース１２ｄに登録する。この時、関連解析処理部１４は、薬理活性値の下限値ε＝−１０．０を設け、その下限値ε以下の薬理活性値を有する化合物については紐付け処理及び登録処理を行わない。 Next, in step S2, the association analysis processing unit 14 associates the gene expression pattern and the pharmacological activity value with the same cancer cell name in the expression information and the cancer cell name in the compound information as a key, and the associated information database. Register to 12d. At this time, the related analysis processing unit 14 sets the lower limit value ε = −10.0 of the pharmacological activity value, and does not perform the linking process and the registration process for compounds having a pharmacological activity value equal to or lower than the lower limit value ε.

次に、ステップＳ３において、ＦＰ算出処理部１５は、化合物情報データベース１２ｂに格納されている全化合物の化合物構造情報を参照して、部分パス情報データベース１２ｃから参照する各部分パスを含むか否かを示すＦＰを算出する。次に、同ステップＳ３において、ＦＰ分類処理部１６は、ＦＰ算出処理部１５が算出したＦＰの類似度（化合物間における）を算出してクラスタリングを行い、対象化合物に類似する化合物を含むクラスタを特定する。また、ＦＰ分類処理部１６は、特定したクラスタ（化合物群）に属する化合物δと対象化合物δ_targetの類似度ｔ（δ，δ_target）についても算出する。 Next, in step S3, the FP calculation processing unit 15 refers to the compound structure information of all the compounds stored in the compound information database 12b and includes each partial path referred to from the partial path information database 12c. Is calculated. Next, in step S3, the FP classification processing unit 16 calculates the FP similarity (between compounds) calculated by the FP calculation processing unit 15, performs clustering, and selects a cluster including a compound similar to the target compound. Identify. The FP classification processing unit 16 also calculates the similarity t (δ, δ _target ) between the compound δ belonging to the specified cluster (compound group) and the target compound δ _target .

次に、ステップＳ４において、活性値推定処理部１７は、細胞ωに対する対象化合物の薬理活性値を、ステップ３で特定したクラスタに属する化合物の薬理活性値と類似度ｔ（δ，δ_target）に基づいて、細胞ωに対する対象化合物δ_targetの推定活性値ｈ（ω，δ_target）を算出する。尚、上記類似度ｔ（δ，δ_target）は、ＦＰ分類処理部１６がステップＳ３で算出した類似度である。 Next, in step S4, the activity value estimation processing unit 17 sets the pharmacological activity value of the target compound with respect to the cell ω to the similarity t (δ, δ _target ) with the pharmacological activity value of the compound belonging to the cluster specified in step 3. Based on this, an estimated activity value h (ω, δ _target ) of the target compound δ _target for the cell ω is calculated. The similarity t (δ, δ _target ) is the similarity calculated by the FP classification processing unit 16 in step S3.

次に、ステップＳ５において、ポイント算出処理部１８は、活性値推定処理部１７が算出した推定活性値ｈを基に、遺伝子毎に対象化合物δ_targetに対する関連性の高さを示す推定ポイントｐを算出する。次に、ステップＳ６において、遺伝子推定処理部１９は、ポイント算出処理部１８が算出した各遺伝子の推定ポイントｐの絶対値をとり降順に並べ替えた遺伝子群を遺伝子推定結果（ランキング）として出力する。 Next, in step S5, the point calculation processing unit 18 calculates an estimated point p indicating the _degree of relevance to the _target compound δ _target for each gene based on the estimated activity value h calculated by the activity value estimation processing unit 17. calculate. Next, in step S6, the gene estimation processing unit 19 outputs, as gene estimation results (ranking), gene groups obtained by taking the absolute values of the estimated points p of the genes calculated by the point calculation processing unit 18 and rearranging them in descending order. .

以上に示したように、本実施形態の遺伝子推定装置１によれば、例えば癌細胞に対する薬理活性が未定の化合物を対象化合物として、その対象化合物の薬理活性に関連があると期待できる遺伝子を推定することができる。また、従来のように標的タンパクに関する情報を必要としない。 As described above, according to the gene estimation device 1 of the present embodiment, for example, a compound having an undetermined pharmacological activity against cancer cells is used as a target compound, and a gene that can be expected to be related to the pharmacological activity of the target compound is estimated. can do. Moreover, the information regarding a target protein is not required unlike the past.

［実証実験］
上述した遺伝子の推定方法を実データにより実験シミュレーションした結果について以下に説明する。まず、利用した対象データについて説明する。この実験シミュレーションにおいては、対象データは結果の確認が行いやすいよう、あらかじめ文献情報として化合物と遺伝子の関連性が示されているものを扱う。 [Demonstration experiment]
The results of an experimental simulation of the gene estimation method described above using actual data will be described below. First, the target data used will be described. In this experimental simulation, the target data are those in which the relationship between the compound and the gene is shown in advance as document information so that the result can be easily confirmed.

図６は、実験シミュレーションの対象データとなった対象化合物と期待される関連遺伝子の一覧を示す図である。すなわち、図６に示した化合物を対象化合物として、図６の各化合物へ関連するとして示されている関連遺伝子を推定することができれば、本実施形態における遺伝子の推定方法の有効性を示すことができる。 FIG. 6 is a diagram showing a list of target compounds that are the target data of the experimental simulation and expected related genes. That is, if the related gene shown as related to each compound in FIG. 6 can be estimated using the compound shown in FIG. 6 as a target compound, the effectiveness of the gene estimation method in this embodiment can be shown. it can.

図６において、右端の符号（＋及び−）は、化合物の薬理活性と遺伝子の発現量の相関の方向性を示している。（＋）であれば遺伝子の発現量が多い細胞ほど化合物に対する薬理活性が高く、遺伝子の発現量が少ないほど化合物の薬理活性が低いことを意味する。また、（−）であれば遺伝子の発現量が少ない細胞ほど化合物に対する薬理活性が高く、遺伝子の発現量が高いほど薬理活性が低いことを意味する。また、（）のように空白の場合は遺伝子の発現量の多寡と化合物の薬理活性に関連がないことを意味している。 In FIG. 6, the symbols (+ and −) at the right end indicate the direction of the correlation between the pharmacological activity of the compound and the expression level of the gene. If it is (+), it means that the higher the gene expression level, the higher the pharmacological activity for the compound, and the lower the gene expression level, the lower the pharmacological activity of the compound. Moreover, if it is (-), it will mean that the pharmacological activity with respect to a compound is so high that the expression level of a gene is small, and pharmacological activity is so low that the expression level of a gene is high. In addition, a blank as in () means that there is no relation between the gene expression level and the pharmacological activity of the compound.

一般に大半の遺伝子と化合物の間には関連がないと考えられるが、図６に示す化合物５−ｆｌｕｏｒｏｕｒａｃｉｌが、関連遺伝子Ｔｙｍｉｄｙｌａｔｅｓｙｎｔｈａｓｅを標的タンパクとしながら、左記タンパクを生成する遺伝子ＴＹＭＳの発現量と活性が無関係であると指摘されているので対象遺伝子として採用している。 Although it is generally considered that there is no association between most genes and compounds, the expression level and activity of the gene TYMS that produces the protein shown on the left in the compound 5-fluorouracil shown in FIG. 6 while the related gene thymidylate synthase is the target protein Since it is pointed out that it is irrelevant, it is adopted as a target gene.

以下、本実験において用いた対象データの詳細について説明する。ＮＣＩがＡ−Ｍａｔｒｉｘとして公開している６０種の癌細胞に対する４４６３種の化合物（データが存在するのは４４４４種）の薬理活性値の内、以下に示す図６に示した対象化合物の内の２つの化合物を取り除いたデータを用いる。図６に示した対象化合物の内の２つは、「ＮＳＣＮＯ．６５６２３８」と「Ｄｏｘｏｒｕｂｉｃｉｎ（ＮＳＣＮＯ．１２３１２７）」である。尚、今回は、図６に示した対象化合物の内の他の２つの化合物「５−ｆｌｕｏｒｏｕｒａｃｉｌ」と「ＣＰＴ−１１」についてはＡ−Ｍａｔｒｉｘからデータを除外していないが、好ましくは除外した方がよい。 The details of the target data used in this experiment will be described below. Among the pharmacological activity values of 4463 compounds (4444 types for which data exists) against 60 types of cancer cells published by NCI as A-Matrix, among the target compounds shown in FIG. Data excluding two compounds is used. Two of the target compounds shown in FIG. 6 are “NSC NO. 656238” and “Doxorubicin (NSC NO. 123127)”. In this case, the data for the other two compounds “5-fluorouracil” and “CPT-11” of the target compounds shown in FIG. 6 are not excluded from A-Matrix, but preferably excluded. Is good.

また、本実験ではＮＩＣが公開している４４６３種の化合物の化合物構造情報を利用する。以降ではこの化合物構造情報全体を化合物の母集団と称する。また、発現情報と化合物情報の紐付け処理においては、上述したように下限値ε＝−１０．０とすることで、各細胞に活性値を持つ全ての化合物を解析対象として採用した。これは、化合物からの遺伝子推定においては、活性の低い化合物であっても結果に影響を与えるためである。 In this experiment, compound structure information of 4463 kinds of compounds published by NIC is used. Hereinafter, the entire compound structure information is referred to as a compound population. Further, in the process of associating the expression information with the compound information, as described above, by setting the lower limit value ε = −10.0, all compounds having an activity value in each cell were employed as the analysis target. This is because, in gene estimation from a compound, even a compound with low activity affects the result.

以上の対象データを基に、上述した遺伝子推定手法により推定を行った結果について以下に説明する。図７は、実験シミュレーションの結果を示す図である。図７において、順位Ａは、全遺伝子集合（母集団）における順位（全９３６５種）を示す。また、順位ＢはＥＳＴ（ＥｘｐｒｅｓｓｅｄＳｅｑｕｅｎｃｅＴＡＧ）を除いた遺伝子集合における順位（全４８４５種）を示す。ここで、ＥＳＴとは、機能がわかっていないが、構造が解明されている遺伝子の断片をデータベース化したものである。また、順位ＣはデータセットのＤｅｓｃｒｉｐｔｉｏｎにＥＳＴと記載されているレコードを除き、１０種以上の癌細胞に対して発現情報がある遺伝子集合の順位（全４１３４種）を意味する。 Based on the above target data, the result of estimation by the gene estimation method described above will be described below. FIG. 7 is a diagram showing the results of an experimental simulation. In FIG. 7, the rank A indicates the rank (total of 9365 species) in the entire gene set (population). Rank B indicates the rank (4845 species in total) in the gene set excluding EST (Expressed Sequence TAG). Here, the EST is a database of gene fragments whose functions are not known but whose structures are elucidated. In addition, the rank C means the rank of gene sets (4134 types in total) having expression information for 10 or more types of cancer cells, except for the record described as EST in the description of the data set.

また、図７において、「推定」の列の値は推定ポイント値であり、順位Ａ〜Ｃの列においてカッコ内のパーセント値はランキングを遺伝子数で割った比率である。図７の推定ポイント値から明らかなように、「ＮＳＣＮＯ．６５６２３８」の化合物に関連する遺伝子群及び、「５−ｆｌｕｏｒｏｕｒａｃｉｌ」のＤＰＹＤに関連する遺伝子については、遺伝子に対する推定ポイントの絶対値が相対的に高く、その遺伝子がランキングの上位に現れている。これらの「ＮＳＣＮＯ．６５６２３８」及び「５−ｆｌｕｏｒｏｕｒａｃｉｌ」のＤＰＹＤに関連する遺伝子群は、データセットを公開しているＮＣＩが出している論文中で関連があるとされているデータであり、本実施形態における遺伝子推定装置１が推定する推定ポイント及びランキングが有効なものであるといえる。 In FIG. 7, the values in the “estimated” column are estimated point values, and the percent values in parentheses in the ranks A to C are ratios obtained by dividing the ranking by the number of genes. As is clear from the estimated point value of FIG. 7, the absolute value of the estimated point relative to the gene is related to the gene group related to the compound of “NSC NO. 656238” and the gene related to DPYD of “5-fluorouracil”. The gene is high, and the gene appears at the top of the ranking. These gene groups related to DPYD of “NSC NO. 656238” and “5-fluorouracil” are data that are considered to be related in the papers published by NCI that publishes the data set. It can be said that the estimation points and rankings estimated by the gene estimation device 1 in the embodiment are effective.

図７に示すように、全体として比率が順位Ａ≒順位Ｂ＞順位Ｃとなっている。つまり１０種以下の癌細胞にしか発現情報が存在しない遺伝子を、本実施形態における遺伝子推定装置１は選択的に上位に選んでいることを意味する。そこで、順位Ｃでは、１０種以下の癌細胞にしか発現情報が存在しない遺伝子を除いた遺伝子集合に対して遺伝子の推定を行った。この順位Ｃのランキングを見て分かるように、データセットを公開しているＮＣＩが出している論文中で関連があるとされているデータ（「ＮＳＣＮＯ．６５６２３８」及び「５−ｆｌｕｏｒｏｕｒａｃｉｌ」のＤＰＹＤのデータ）については、かなり上位のランキングで遺伝子を推定できた（比率では３％未満）。このように、本実施形態の遺伝子推定装置１は、１０種以下の癌細胞にしか発現情報が存在しない遺伝子を除いた遺伝子集合に対して遺伝子の推定を行うことで、推定の精度を向上することができる。 As shown in FIG. 7, the overall ratio is rank A≈rank B> rank C. That is, it means that the gene estimation apparatus 1 in the present embodiment selectively selects a gene whose expression information exists only in 10 or less types of cancer cells. Therefore, in rank C, genes were estimated for a gene set excluding genes whose expression information exists only in 10 or less types of cancer cells. As can be seen from the ranking of the ranking C, the data (NSC NO. The data were estimated at a fairly high ranking (the ratio was less than 3%). Thus, the gene estimation apparatus 1 of the present embodiment improves the accuracy of estimation by estimating genes for a gene set excluding genes whose expression information exists only in 10 or less types of cancer cells. be able to.

それに対してＮＣＩのデータセットとは無関係な論文で関係があるとされる化合物と遺伝子の組合せについては、図７に示すように、ある程度推定ができたもの（比率で１５％未満）とできなかったもの（比率で１５％以上）があった。 On the other hand, as shown in FIG. 7, the combination of a compound and a gene that is considered to be related in a paper unrelated to the NCI data set can be estimated to a certain extent (less than 15% in proportion). (A ratio of 15% or more).

また、上記の実験で遺伝子推定装置１が、図６に示していない遺伝子であって上位にランキングした遺伝子と化合物の関連について述べている論文がないか検索したところ、化合物「５−ｆｌｕｏｒｏｕｒａｃｉｌ」と順位Ｃでランキング３５（０．８％）の遺伝子や、化合物「５−ｆｌｕｏｒｏｕｒａｃｉｌ」と順位Ｃでランキング４２（１．０％）の遺伝子について関連がある旨を記載している論文や文献が各々１つずつ見つかっている。同様に、化合物「ＣＴＰ−１１」と順位Ｃでランキング７５（１．８％）の遺伝子について関連がある旨を記載している論文や文献が３つ見つかっている。更に、化合物「ＣＴＰ−１１」と順位Ｃでランキング８２（２．０％）の遺伝子について関連がある旨を記載している論文や文献が４つ見つかっている。 Further, in the above experiment, when the gene estimation device 1 searches for a paper that is not shown in FIG. 6 and describes a relation between a gene ranked higher and a compound, the compound “5-fluorouracil” is obtained. Articles and documents describing that there is a relationship between genes ranked 35 (0.8%) in rank C and the compound "5-fluorouracil" and genes ranked 42 (1.0%) in rank C. One by one is found. Similarly, three papers and literatures describing that there is a relation between the compound “CTP-11” and the gene ranked 75 (1.8%) in rank C have been found. Furthermore, four papers and literatures have been found that state that there is a relationship between the compound “CTP-11” and the gene of ranking 82 (2.0%) in rank C.

このように、本実施形態の遺伝子推定装置１が上位のランキングと推定した関連遺伝子と化合物の関連性を裏付けるような論文や文献が発見されたことも、本実施形態の遺伝子推定装置１における遺伝子の推定手法が有効であることを示しているといえる。 As described above, the discovery of papers and documents that support the relationship between the related gene and the compound estimated by the gene estimation device 1 of the present embodiment as the highest ranking is also the gene in the gene estimation device 1 of the present embodiment. It can be said that this estimation method is effective.

また、上述した実施形態において、図１に示した遺伝子推定装置１の各処理部は、ハードウェアとしてはメモリ及びＣＰＵ（中央演算装置）により構成され、各処理部の機能を実現する為のプログラムをメモリに読み込んでＣＰＵが実行することによりその機能を実現させるものである。また、これに限定されるものではなく、各処理部の一部の処理又は全部の処理を専用のハードウェアにより実現されるものであってもよい。 Further, in the above-described embodiment, each processing unit of the gene estimation device 1 shown in FIG. 1 includes a memory and a CPU (central processing unit) as hardware, and a program for realizing the functions of each processing unit Is loaded into the memory and executed by the CPU to realize its function. In addition, the present invention is not limited to this, and part or all of the processing of each processing unit may be realized by dedicated hardware.

また、上記メモリは、ハードディスク装置や光磁気ディスク装置、フラッシュメモリ等の不揮発性のメモリや、ＣＤ−ＲＯＭ等の読み出しのみが可能な記録媒体、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）のような揮発性のメモリ、あるいはこれらの組合せによるコンピュータ読み取り、書き込み可能な記録媒体より構成されるものとする。 The memory includes a nonvolatile memory such as a hard disk device, a magneto-optical disk device, and a flash memory, a recording medium such as a CD-ROM that can only be read, and a volatile memory such as a RAM (Random Access Memory). Or a computer-readable / writable recording medium based on a combination thereof.

また、図１に示した遺伝子推定装置１の各処理部は、上述したようにコンピュータがプログラムを実行することによって実現しているが、そのプログラムをコンピュータに供給するための手段、例えばかかるプログラムを記録したコンピュータ読み取り可能な記録媒体又はかかるプログラムを伝送する伝送媒体も本発明の実施形態として適用することができる。また、上記のプログラムを記録したコンピュータ読み取り可能な記録媒体等のプログラムプロダクトも本発明の実施形態として適用することができる。上記のプログラム、記録媒体、伝送媒体及びプログラムプロダクトは、本発明の範疇に含まれる。 Further, each processing unit of the gene estimation apparatus 1 shown in FIG. 1 is realized by the computer executing the program as described above, but means for supplying the program to the computer, for example, such a program is provided. A recorded computer-readable recording medium or a transmission medium for transmitting such a program can also be applied as an embodiment of the present invention. A program product such as a computer-readable recording medium in which the above program is recorded can also be applied as an embodiment of the present invention. The above program, recording medium, transmission medium, and program product are included in the scope of the present invention.

また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Further, the “computer-readable recording medium” refers to a volatile memory (RAM) in a computer system serving as a server or a client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In addition, those holding a program for a certain period of time are also included.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。 The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line.

また、上記プログラムは、前述した機能の一部を実現する為のものであっても良い。さらに、前述した機能をコンピュータシステムに既に記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。 The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, and what is called a difference file (difference program) may be sufficient.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

本実施形態における遺伝子推定装置の概略構成を示す図である。It is a figure which shows schematic structure of the gene estimation apparatus in this embodiment. 図１に示した発現情報データベース１２ａのデータ構成例を示す図である。It is a figure which shows the data structural example of the expression information database 12a shown in FIG. 図１に示した化合物情報データベース１２ｂに格納される化合物情報のデータ構成例を示す図である。It is a figure which shows the data structural example of the compound information stored in the compound information database 12b shown in FIG. 図１に示した部分バス情報データベース１２ｃのデータ構成例を示す図である。It is a figure which shows the data structural example of the partial bus information database 12c shown in FIG. 図１に示した遺伝子推定装置１における対象化合物の関連遺伝子を推定する処理を示すフロー図である。It is a flowchart which shows the process which estimates the related gene of the target compound in the gene estimation apparatus 1 shown in FIG. 実験シミュレーションの対象データとなった対象化合物と期待される関連遺伝子の一覧を示す図である。It is a figure which shows the list | wrist of the target gene used as the object data of experiment simulation, and the related gene expected. 実験シミュレーションの結果を示す図である。It is a figure which shows the result of experiment simulation.

Explanation of symbols

１遺伝子推定装置
２ネットワーク
３ＮＣＩデータベース
１１制御部
１２データベース
１２ａ発現情報データベース
１２ｂ化合物情報データベース
１２ｃ部分パス情報データベース
１２ｄ関連情報データベース
１３情報登録処理部
１４関連解析処理部
１５ＦＰ算出処理部
１６ＦＰ分類処理部
１７活性値推定処理部
１８ポイント算出処理部
１９遺伝子推定処理部
２０送受信処理部 DESCRIPTION OF SYMBOLS 1 Gene estimation apparatus 2 Network 3 NCI database 11 Control part 12 Database 12a Expression information database 12b Compound information database 12c Partial path information database 12d Related information database 13 Information registration process part 14 Related analysis process part 15 FP calculation process part 16 FP classification process Unit 17 activity value estimation processing unit 18 point calculation processing unit 19 gene estimation processing unit 20 transmission / reception processing unit

Claims

Compound information storage means for storing compound information, which is information regarding the activity value of the pharmacological activity of the plurality of types of first compounds for each of a plurality of types of cells, and compound structure information, which is information about the chemical structure of the first compound. When,
Expression information storage means for storing expression information including expression levels of each of a plurality of types of genes for each of the plurality of types of cells;
Partial path information storage means for storing information on partial paths extracted from the connection of some elements in the structures of various compounds;
Based on the compound information referenced from the compound information storage means and the information on the partial path referenced from the partial path information storage means, the first compound and the second compound whose pharmacological activity is unknown Calculating means for calculating partial path presence / absence information indicating whether or not the partial path is included;
The first compounds having similar partial path presence / absence information are collectively classified as clusters, and the chemical structure of the second compound is most similar to the second compound based on the partial path presence / absence information from the classified clusters. Classification processing means for identifying a cluster to which the first compound belongs as a similar cluster, and calculating a similarity between each of the first compounds belonging to the identified similar cluster and the second compound;
For all the first compounds belonging to the similar cluster, the weighted average of the activity values of the pharmacological activity for the cells weighted by the similarity is calculated as the estimated activity value of the pharmacological activity of the second compound for the cells. Activity value estimation means to calculate as
A gene estimation device comprising: a gene estimation unit that calculates an average value of an integrated value of the estimated activity value of the pharmacological activity for each cell and the expression level for each gene as an estimated point .

When the first compound is δ ₁ , the second compound is δ _2, and the partial path presence / absence information for the compound δ is a vector variable f (δ), The gene estimation apparatus according to claim 1, wherein similarity is calculated .

The activity value estimating means sets the cell to ω, sets the pharmacological activity value of the _first compound δ ₁ for the cell ω to a (ω, δ ₁ ), and sets the first compound δ ₁ and the second compound 3. The estimated activity value is calculated by Equation 2 when the similarity with the compound δ ₂ is t (δ ₁ , δ ₂ ), the cluster is ψ, and λ is a parameter. The gene estimation apparatus according to 1.

The gene estimation means sets the gene as γ, and the second compound δ for the cell ω. ₂₂ The estimated activity value of h (ω, δ ₂₂ The gene estimation apparatus according to claim 3, wherein the estimation point is calculated by Equation 3 when the expression level of the gene γ is cγ and the set of cells is Ω.

Compound information storage means for storing compound information, which is information regarding the activity value of the pharmacological activity of the plurality of types of first compounds for each of a plurality of types of cells, and compound structure information, which is information about the chemical structure of the first compound. And expression information storage means for storing expression information including expression levels of each of the plurality of types of genes for each of the plurality of types of cells, and information on partial paths from which connections of some elements in the structures of various compounds are extracted A gene estimation method using a gene estimation apparatus comprising a partial path information storage means, a calculation means, a classification processing means, an activity value estimation means, and a gene estimation means,
A second pharmacological activity that is unknown to the first compound based on the compound information referenced from the compound information storage means and the information on the partial path referenced from the partial path information storage means; A calculation step of calculating partial path presence / absence information indicating whether the partial path is included with respect to the compound of
The classification processing unit collectively classifies the first compounds having similar partial path presence / absence information as clusters, and chemically classifies the second compounds from the classified clusters based on the partial path presence / absence information. Classification that specifies a cluster to which the first compound having the most similar structure belongs as a similar cluster, and calculates a similarity between each of the first compounds belonging to the specified similar cluster and the second compound Processing steps;
The activity value estimation means calculates the weighted average of the activity values of the pharmacological activity for the cells weighted by the similarity for all the first compounds belonging to the similar cluster, and the cells of the second compound An activity value estimation step for calculating an estimated activity value of pharmacological activity against
The gene estimation means includes a gene estimation step for calculating an average value of an integrated value of the estimated activity value of the pharmacological activity for each cell and the expression level for each gene as an estimated point. Gene estimation method.

Computer
Compound information storage means for storing compound information, which is information regarding the activity value of the pharmacological activity of the plurality of types of first compounds for each of a plurality of types of cells, and compound structure information, which is information about the chemical structure of the first compound. When,
Expression information storage means for storing expression information including expression levels of each of a plurality of types of genes for each of the plurality of types of cells;
Partial path information storage means for storing information on partial paths extracted from the connection of some elements in the structures of various compounds;
Based on the compound information referenced from the compound information storage means and the information on the partial path referenced from the partial path information storage means, the first compound and the second compound whose pharmacological activity is unknown Calculating means for calculating partial path presence / absence information indicating whether or not the partial path is included;
The first compounds having similar partial path presence / absence information are collectively classified as clusters, and the chemical structure of the second compound is most similar to the second compound based on the partial path presence / absence information from the classified clusters. Classification processing means for identifying a cluster to which the first compound belongs as a similar cluster, and calculating a similarity between each of the first compounds belonging to the identified similar cluster and the second compound;
For all the first compounds belonging to the similar cluster, the weighted average of the activity values of the pharmacological activity for the cells weighted by the similarity is calculated as the estimated activity value of the pharmacological activity of the second compound for the cells. Activity value estimation means to calculate as
A program for causing each gene to function as gene estimation means for calculating an average value of integrated values of the estimated activity value and the expression level of pharmacological activity for each cell as an estimated point.