JP2006323830A

JP2006323830A - System for searching candidate gene related to phenotype

Info

Publication number: JP2006323830A
Application number: JP2006081255A
Authority: JP
Inventors: Susumu Tanaka; 進田中; A Barrero Roberto; Ａバレロロベルト; Takayuki Taniya; 貴之谷家; Akitoshi Maekawa; 陽俊前川; Hideki Hanaoka; 秀樹花岡; Tadashi Imanishi; 規今西; Takashi Gojobori; 孝五條堀
Original assignee: C'S LAB Ltd; National Institute of Advanced Industrial Science and Technology AIST; Japan Biological Informatics Consortium; Cs Lab Co Ltd
Current assignee: C'S LAB Ltd; National Institute of Advanced Industrial Science and Technology AIST; Japan Biological Informatics Consortium; Cs Lab Co Ltd
Priority date: 2005-04-18
Filing date: 2006-03-23
Publication date: 2006-11-30

Abstract

<P>PROBLEM TO BE SOLVED: To identify, from a group of genes unlinked with various types of information on diseases and so on, a new gene or a candidate gene having a relation to any of such information. <P>SOLUTION: This system comprises a score computing means and an evaluation value computing means. Using a training set comprising groups of genes that are already known to be related to information linked with genes and including gene-identifying data for identifying a pertinent gene and a database associating the gene-identifying data with annotation data on the gene identified by the gene-identifying data, the score computing means computes a plurality of scores indicating a relation of the annotation data on the genes included in the database with the information. From the plurality of scores computed by the score computing means, the evaluation value computing means computes an evaluation value indicating a relation of the gene with the information. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、遺伝子アノテーション情報を用いて、特定の情報（例えば、疾患等）に関連する新規遺伝子或いは当該情報に関連する候補遺伝子を同定することができる遺伝子解析装置に関する。 The present invention relates to a gene analysis apparatus capable of identifying a new gene related to specific information (for example, a disease) or a candidate gene related to the information using gene annotation information.

データベースにおけるタンパク質の機能に関する情報は、各タンパク質データベース上で自然言語を用いて記述されてきたため、同一の機能に対する表現方法が各データベース間で必ずしも一致していないといった問題があった。これを解決するため、近年、タンパク質の機能に関する情報を矛盾なく統一的に記述するGene Ontology(以下、GO)と呼ばれるデータベースが構築された(http://www.geneontology.org/)。GOではタンパク質が持つそれぞれの機能が、階層構造を用いて統一的に表現されている。 Information on protein functions in the database has been described in each protein database using a natural language, so that there is a problem that expression methods for the same function do not necessarily match between the databases. In order to solve this problem, recently, a database called Gene Ontology (hereinafter referred to as GO) has been constructed (http://www.geneontology.org/) that uniformly describes information on protein functions consistently. In GO, each function of proteins is expressed uniformly using a hierarchical structure.

GOを用いて遺伝子の機能や特定の疾患等との関連性を検索する手法としては、G2D法(Perez-Iratxeta C et al. Nat Genet 2002(31)316-319：非特許文献１)が知られている。G2D法は、「特定の疾患症状[MeSH term(http://www.nlm.nih.gov/mesh/meshhome.html) におけるCという情報]」と「原因遺伝子に注釈付けられたGOという情報」との関係を、「それぞれの要素と化合物（MeSH D）との関係」によって仲介することで定量化している。このとき、「論文で出現するあらゆる関係」に対する「特定の関係」の出現頻度で疾患への関係性の強さを定義している。 The G2D method (Perez-Iratxeta C et al. Nat Genet 2002 (31) 316-319: Non-Patent Document 1) is known as a method for searching for the relationship between gene functions and specific diseases using GO. It has been. The G2D method is based on “specific disease symptoms [information about C in MeSH term (http://www.nlm.nih.gov/mesh/meshhome.html)]” and “information about GO annotated to the causative gene”. The relationship between and is quantified by mediating the relationship between each element and the compound (MeSH D). At this time, the strength of the relationship to the disease is defined by the frequency of occurrence of the “specific relationship” with respect to “any relationship appearing in the paper”.

また、GOを用いて遺伝子の機能や特定の疾患等との関連性を検索する手法としては、Phenotype Cluster法（Freudenberg J et al. Bioinformatics 2002(18)S110-S115：非特許文献２）が知られている。Phenotype Cluster法は、「特定の症状」と「原因遺伝子に注釈付けられたGOという情報」の関係を、「それぞれの要素と症状の特徴（発症時期、発症部位といった分類）との関係」によって仲介することで、定量化している。このとき、「疾患関連遺伝子全てで出現する関係」に対する「特定の関係」の出現頻度で関係性の強さを定義している。 In addition, the Phenotype Cluster method (Freudenberg J et al. Bioinformatics 2002 (18) S110-S115: Non-Patent Document 2) is known as a method for searching for the relationship between gene functions and specific diseases using GO. It has been. The Phenotype Cluster method mediates the relationship between "specific symptoms" and "information called GO annotated to the causative gene" by "relationship between each element and symptom characteristics (classification such as onset time and site)" By quantifying. At this time, the strength of the relationship is defined by the appearance frequency of the “specific relationship” with respect to the “relationship that appears in all the disease-related genes”.

さらに、GOを用いて遺伝子の機能や特定の疾患等との関連性を検索する手法としては、POCUS法(Turner FS et al. Genome Biol 2003(4)R75：非特許文献３)が知られている。POCUS法は、「特定の症状」と「原因遺伝子に注釈付けられたInterPro Identifier(http://www.ebi.ac.uk/interpro/)およびGOという情報」の関係を、「シミュレーションの結果として偶然発生する関係」に対する「特定の関係」の出現頻度で関係性の強さを定義している。 Furthermore, the POCUS method (Turner FS et al. Genome Biol 2003 (4) R75: Non-Patent Document 3) is known as a method for searching for the relationship between gene functions and specific diseases using GO. Yes. The POCUS method uses the relationship between `` specific symptoms '' and `` information called InterPro Identifier (http://www.ebi.ac.uk/interpro/) annotated to the causative gene and GO '' as a result of simulation. The strength of the relationship is defined by the frequency of occurrence of the “specific relationship” with respect to the “occurrence by chance”.

しかしながら、これらの方法においては、以下のように問題点が大きく2つ存在する。第一に、GOという限られた情報に強く依存することである。すなわち遺伝子にはGO以外にも多くの注釈がつけられているが、遺伝子に注釈づけられたGO以外の情報が利用されていないという点である。第二に、特定の関係の強さを定義するために比較する対象が、非常に限られた情報となっている点である。G2D法は「論文で出現する関係」を、Phenotype Cluster法は「疾患関連遺伝子において出現する関係」を、POCUS法はシミュレーションを比較対象としているが、「全て或いはほぼ全ての遺伝子」を比較対象とはできていない。 However, these methods have two major problems as follows. First, it relies heavily on limited information called GO. In other words, many annotations are added to genes other than GO, but information other than GO annotated to genes is not used. Second, the information to be compared to define the strength of a specific relationship is very limited information. The G2D method uses “relationships that appear in papers”, the Phenotype Cluster method uses “relationships that appear in disease-related genes”, and the POCUS method uses simulations as comparison targets, but “all or almost all genes” as comparison targets. Is not done.

また、これらの方法では、特定の疾患との関連性が示されていない遺伝子群から、特定の疾患に関連する新規遺伝子或いは特定の疾患に関連する候補遺伝子を同定することはできない。 In addition, these methods cannot identify a novel gene associated with a specific disease or a candidate gene associated with a specific disease from a gene group that has not been shown to be associated with a specific disease.

Perez-Iratxeta C et al. Nat Genet 2002(31)316-319Perez-Iratxeta C et al. Nat Genet 2002 (31) 316-319 Freudenberg J et al. Bioinformatics 2002(18)S110-S115Freudenberg J et al. Bioinformatics 2002 (18) S110-S115 Turner FS et al. Genome Biol 2003(4)R75Turner FS et al. Genome Biol 2003 (4) R75

そこで、本発明は、上述したような実状に鑑み、疾患等の各種情報に関連付けられていない遺伝子群から、当該情報に関連性を有する新規遺伝子或いは候補遺伝子を同定することができる遺伝子解析装置を提供することを目的とする。 Therefore, in view of the actual situation as described above, the present invention provides a gene analysis apparatus that can identify a novel gene or a candidate gene related to the information from a group of genes not associated with various information such as diseases. The purpose is to provide.

上述した目的を達成した本発明は以下を包含する。
(1) 遺伝子に関連する情報に関連することが予め知られている遺伝子群からなり、当該遺伝子を特定する遺伝子識別データを含むトレーニングセットと、遺伝子識別データと当該遺伝子識別データにより特定される遺伝子のアノテーションデータとを関連付けたデータベースとを用い、上記データベースに含まれる遺伝子のアノテーションデータについて上記情報との関連性を示す複数のスコアを演算するスコア演算手段と、上記スコア演算手段で演算した複数のスコアから、当該遺伝子と上記情報との関連性を示す評価値を演算する評価値演算手段とを備える遺伝子情報解析装置。
(2) 上記スコア演算手段は、スコア演算対象の遺伝子に関連付けられたアノテーションデータについて、トレーニングセットに含まれる当該アノテーションデータの出現頻度と、データセットに含まれる遺伝子の当該アノテーションデータの出現頻度とからスコアを演算することを特徴とする(1)記載の装置。
(3) 上記スコア演算手段は、上記データベースに含まれる全遺伝子のアノテーションデータについて上記スコアを演算することを特徴とする(1)又は(2)記載の装置。
(4) 上記スコア演算手段は、スコア演算対象の遺伝子の塩基配列データと、上記トレーニングセットを構成する遺伝子の塩基配列データとの相同性に基づくスコアを演算することを特徴とする(1)記載の装置。
(5) 上記スコア演算手段は、スコア演算対象の遺伝子の発現パターンデータとトレーニングセットを構成する遺伝子の発現パターンデータとの相関係数を、各発現パターンデータに含まれるパラメータから演算し、最も高い相関係数に基づいてスコアを演算することを特徴とする(1)記載の装置。
(6) 上記トレーニングセットに含まれる遺伝子識別データには、遺伝子識別データにより特定される遺伝子と上記情報との関連性を定量化した基礎値が関連付けられていることを特徴とする(1)記載の装置。
(7) 前記データベースに含まれるデータセットから上記トレーニングセットを除くデータセットに含まれる遺伝子識別データ及び当該遺伝子識別データで特定される遺伝子について演算された上記評価値を関連付けて出力する結果出力手段をさらに備えることを特徴とする(1)記載の装置。
(8) 上記評価値演算手段は、スコア演算対象の遺伝子が上記トレーニングセットを構成する遺伝子群及び上記トレーニングセットを除いたデータセットを構成する遺伝子群のいずれに近いか判定する評価値を演算することを特徴とする(1)記載の装置。
(9) 上記評価値演算手段は、スコア演算対象の遺伝子について、上記トレーニングセットを構成する遺伝子群へのマハラノビス汎距離及び上記トレーニングセットを除いたデータセットを構成する遺伝子群へのマハラノビス汎距離を求め、各マハラノビス汎距離の差分値として評価値を演算することを特徴とする(1)記載の装置。
(10) 上記遺伝子に関連する情報は、疾患種別情報であることを特徴とする(1)記載の装置。
(11) 上記スコア演算手段は、塩基配列相同性データ、InterPro IDデータ、酵素番号データ、代謝経路データ、GOデータにおけるmolecular functionデータ、biological processデータ、subcellular componentデータ及び発現パターンデータから選ばれる１以上のデータについてスコアを演算することを特徴とする(1)記載の装置。
(12) 上記スコア演算手段は、上記トレーニングセットに含まれる遺伝子に関連付けられたアノテーションデータについて上記スコアを求め、当該スコアをアノテーションデータと関連付けたテーブルとして記憶手段に格納し、スコア演算対象の遺伝子に関連付けられたアノテーションデータをキーとして当該テーブルを検索してスコアを読み出し、読み出したスコアを、当該スコア演算対象の遺伝子に関するアノテーションデータのスコアとして演算することを特徴とする(1)記載の装置。 The present invention that has achieved the above-described object includes the following.
(1) A training set that includes gene identification data that is known in advance to be related to information related to genes, includes gene identification data that identifies the genes, and genes identified by the gene identification data and the gene identification data. A score calculation means for calculating a plurality of scores indicating the relevance with the information for the annotation data of the genes included in the database, and a plurality of scores calculated by the score calculation means. A gene information analyzing apparatus comprising: an evaluation value calculating means for calculating an evaluation value indicating a relationship between the gene and the information from the score.
(2) The score calculation means, for annotation data associated with a score calculation target gene, from the appearance frequency of the annotation data included in the training set and the appearance frequency of the annotation data of the gene included in the data set The apparatus according to (1), wherein a score is calculated.
(3) The apparatus according to (1) or (2), wherein the score calculation means calculates the score for annotation data of all genes included in the database.
(4) The score calculating means calculates the score based on the homology between the base sequence data of the gene to be score-calculated and the base sequence data of the genes constituting the training set (1) Equipment.
(5) The score calculation means calculates the correlation coefficient between the expression pattern data of the gene to be score-calculated and the expression pattern data of the genes constituting the training set from the parameters included in each expression pattern data. The apparatus according to (1), wherein a score is calculated based on the correlation coefficient.
(6) The gene identification data included in the training set is associated with a basic value quantifying the relationship between the gene specified by the gene identification data and the information described above (1) Equipment.
(7) Result output means for associating and outputting the gene identification data included in the data set excluding the training set from the data set included in the database and the evaluation value calculated for the gene specified by the gene identification data The apparatus according to (1), further comprising:
(8) The evaluation value calculation means calculates an evaluation value for determining whether the gene for which the score is to be calculated is close to the gene group constituting the training set or the gene group constituting the data set excluding the training set The device according to (1), characterized in that:
(9) The evaluation value calculating means calculates, for the score calculation target gene, the Mahalanobis general distance to the gene group constituting the training set and the Mahalanobis general distance to the gene group constituting the data set excluding the training set. The apparatus according to (1), characterized in that an evaluation value is calculated as a difference value of each Mahalanobis general distance.
(10) The device according to (1), wherein the information related to the gene is disease type information.
(11) The score calculation means is at least one selected from nucleotide sequence homology data, InterPro ID data, enzyme number data, metabolic pathway data, molecular function data in GO data, biological process data, subcellular component data, and expression pattern data The apparatus according to (1), wherein a score is calculated for the data of
(12) The score calculation means obtains the score for the annotation data associated with the gene included in the training set, stores the score in the storage means as a table associated with the annotation data, and stores the score in the score calculation target gene. The apparatus according to (1), wherein the table is searched by using the associated annotation data as a key, the score is read, and the read score is calculated as a score of the annotation data related to the score calculation target gene.

また、本発明は、以上の遺伝子解析装置によって構築された解析結果データセットを提供する遺伝子情報提供装置にも適用することができる。すなわち、本発明を適用した遺伝子情報提供装置は、入力手段又は通信回線網を介して入力された、遺伝子に関連する情報を入力データとして管理する管理手段と、上記遺伝子情報解析装置により構築された解析結果データセットを出力する結果出力手段とを備える遺伝子情報提供装置である。 The present invention can also be applied to a gene information providing apparatus that provides an analysis result data set constructed by the above gene analysis apparatus. That is, a gene information providing apparatus to which the present invention is applied is constructed by a management means for managing information related to genes, which is input via an input means or a communication network, as input data, and the gene information analyzing apparatus. A gene information providing apparatus comprising a result output means for outputting an analysis result data set.

本発明に係る遺伝子解析装置によれば、疾患等の各種情報に関連付けられていない遺伝子群から、当該情報に関連性を有する新規遺伝子或いは候補遺伝子を同定することができる。特に、本発明に係る遺伝子解析装置によれば、トレーニングセットに含まれる遺伝子群に関連付けられたアノテーションデータと、全データセットから当該トレーニングセットを除いたデータセットに含まれる遺伝子群に関連付けられたアノテーションデータとから複数のスコアを演算しているため、網羅的かつ客観的な解析を行うことができる。 According to the gene analysis apparatus according to the present invention, it is possible to identify a new gene or a candidate gene having relevance to the information from a gene group not associated with various information such as a disease. In particular, according to the gene analysis apparatus of the present invention, annotation data associated with the gene group included in the training set, and annotation associated with the gene group included in the data set excluding the training set from all data sets. Since multiple scores are calculated from the data, comprehensive and objective analysis can be performed.

以下、図面を参照して本発明を詳細に説明する。
本発明に係る遺伝子解析装置１は、図１に示すように、インターネット等の通信回線網２を介してユーザ端末３と接続されており、通信回線網２を介してユーザ端末３との間でデータのやり取りを実行することができる。これにより、遺伝子解析装置１は、通信回線網２を介してユーザ端末３に遺伝子解析結果を提供することができる。ユーザはユーザ端末３を操作して、遺伝子解析装置１に対して遺伝子に関連する情報を入力し、当該入力情報を受信した遺伝子解析装置１は、これに基づいて当該情報に関連性を有する新規遺伝子群或いは当該情報に関連性を有する候補遺伝子群をユーザ端末３に提供する。 Hereinafter, the present invention will be described in detail with reference to the drawings.
As shown in FIG. 1, the gene analysis apparatus 1 according to the present invention is connected to a user terminal 3 via a communication network 2 such as the Internet, and is connected to the user terminal 3 via the communication network 2. Data exchange can be executed. Thereby, the gene analysis device 1 can provide the gene analysis result to the user terminal 3 via the communication line network 2. The user operates the user terminal 3 to input information related to the gene to the gene analysis device 1, and the gene analysis device 1 that has received the input information has a new relationship with the information based on this information. A gene group or a candidate gene group relevant to the information is provided to the user terminal 3.

ここで、遺伝子に関連する情報とは、例えば、ヒト由来の遺伝子に関連する情報として疾患種別情報、民族的、地域的背景や、免疫反応情報等を挙げることができ、植物由来の遺伝子に関連する情報として病態情報、各種ストレス耐性情報、増殖育成情報、対害虫性獲得因子情報、環境適応情報等を挙げることができる。また植物由来の遺伝子の場合、特にイネの遺伝子に関しても、受粉、受精、稔性のような多因子によって引き起こされる表現型（多因子表現型）を挙げることができる。 Here, information related to genes includes, for example, disease type information, ethnic and regional backgrounds, immune reaction information, etc. as information related to human-derived genes. Examples of information to be performed include pathologic information, various stress tolerance information, growth and breeding information, pest acquisition information, environmental adaptation information, and the like. In the case of plant-derived genes, especially for rice genes, phenotypes (multifactor phenotypes) caused by multifactors such as pollination, fertilization and fertility can be mentioned.

以下においては、遺伝子に関連する情報としてヒトの疾患種別情報を例示して説明する。すなわち、以下の説明において、遺伝子解析装置１は、ヒトの疾患に関連性を有する新規遺伝子或いは当該疾患に関連性を有する候補遺伝子を提供する。しかしながら、本発明に係る遺伝子解析装置１は、疾患に関連性を有する新規遺伝子或いは当該疾患に関連性を有する候補遺伝子を提供する装置に限定されず、上述したような様々な情報に適用することができる。 In the following, human disease type information will be exemplified and described as information related to genes. That is, in the following description, the gene analysis device 1 provides a novel gene related to a human disease or a candidate gene related to the disease. However, the gene analysis device 1 according to the present invention is not limited to a device that provides a novel gene related to a disease or a candidate gene related to the disease, and can be applied to various information as described above. Can do.

疾患情報としては、特に限定されないが、例えば、慢性関節リウマチ、前立腺癌、膵臓がん、高血圧症、２型糖尿病、統合失調症、気管支喘息等の多因子性疾患を挙げることができる。また、罹患原因が主としてタンパク質変異や染色体異常、多型といった遺伝子の異常にある疾患であることが好ましい。 The disease information is not particularly limited, and examples thereof include multifactorial diseases such as rheumatoid arthritis, prostate cancer, pancreatic cancer, hypertension, type 2 diabetes, schizophrenia, and bronchial asthma. Moreover, it is preferable that the cause of the disease is mainly a disease such as protein mutation, chromosomal abnormality, or polymorphism.

遺伝子解析装置１は、図２に示すように、装置全体の動作制御及びデータ演算を実行する中央処理装置４と、本発明を適用したプログラムや各種データを格納したハードディスクを有するHDD５と、情報及びプログラムの実行指示等を入力できるキーボードやマウス等の入力手段６と、通信回線網２を介して外部の端末機器やデータベースとの間で情報の送受信を行う送受信手段７と、ディスプレイ等の表示手段８と、各種データやプログラムが記録されるメモリー９と、各種データを格納している記憶手段１０とを備える。 As shown in FIG. 2, the gene analyzing apparatus 1 includes a central processing unit 4 that performs operation control and data calculation of the entire apparatus, an HDD 5 that has a hard disk storing a program and various data to which the present invention is applied, information, Input means 6 such as a keyboard and mouse that can input program execution instructions, transmission / reception means 7 that transmits / receives information to / from an external terminal device or database via the communication network 2, and display means such as a display 8, a memory 9 in which various data and programs are recorded, and a storage means 10 storing various data.

遺伝子解析装置１は、遺伝子に関連する情報（疾患種別情報）と、当該情報に関連性を有する新規遺伝子群或いは当該疾患に関連性を有する候補遺伝子群を含む解析結果データセットを記憶手段１０に格納するが、当該解析結果データベースを格納するデータベース（図示せず）を別途有していてもよい。解析結果データセットには、上記新規遺伝子群及び/又は上記候補遺伝子群に含まれる各遺伝子に対して関連付けられ、疾患に対する関連性を示す評価値が含まれていても良い。ここで、遺伝子解析装置１は、当該データベースを上記HDD５又は上記記憶手段１０内に格納したものであっても良いし、送受信手段７を介して当該データベースにアクセス可能であれば外部に当該データベースを有するものであってもよい。 The gene analyzer 1 stores in the storage means 10 an analysis result data set including information related to genes (disease type information) and a new gene group related to the information or a candidate gene group related to the disease. Although it stores, you may have separately the database (not shown) which stores the said analysis result database. The analysis result data set may include an evaluation value that is associated with each gene included in the new gene group and / or the candidate gene group and indicates an association with a disease. Here, the gene analyzing apparatus 1 may be one in which the database is stored in the HDD 5 or the storage means 10, and if the database can be accessed via the transmission / reception means 7, the database is stored outside. You may have.

遺伝子解析装置１は、送受信手段７で受信した特定の疾患種別情報をキーとして当該データベースにアクセスし、データベースから特定の疾患種別情報に関連付けられた解析結果データセットを読み出しユーザ端末３に対して出力するプログラムがインストールされている。 The gene analyzer 1 accesses the database using the specific disease type information received by the transmission / reception means 7 as a key, reads the analysis result data set associated with the specific disease type information from the database, and outputs it to the user terminal 3 The program to be installed is installed.

以下、特定の疾患種別情報に関する解析結果データセットを構築するプログラム（以下、本プログラムと称する。）について説明する。なお、遺伝子解析装置１は、特定の疾患種別情報に疾患種別情報に関連することが予め知られている遺伝子群からなるトレーニングセットを本プログラムの実行に先立って構築し、例えば記憶手段１０等に格納している。トレーニングセットは、例えば、疾患種別情報と遺伝子識別データとを関連付けて格納した他のデータベースを用いて構築することができる。より具体的には、ヒト遺伝子と遺伝性疾患とを関連付けて格納したOnline Mendelian Inheritance in Man（以下、OMIM）のデータベース、LocusLinkを引き継いだEntrez Geneのデータベースを使用することができる。 Hereinafter, a program for constructing an analysis result data set relating to specific disease type information (hereinafter referred to as this program) will be described. The gene analyzing apparatus 1 constructs a training set including a group of genes that are known in advance to be associated with the disease type information in the specific disease type information prior to the execution of this program. Storing. The training set can be constructed using, for example, another database in which disease type information and gene identification data are stored in association with each other. More specifically, the Online Mendelian Inheritance in Man (hereinafter referred to as OMIM) database storing human genes and hereditary diseases and the Entrez Gene database succeeding LocusLink can be used.

トレーニングセットを構築する際には、先ず、これらのデータベースを用いて、疾患種別情報をキーとして検索することにより、当該疾患種別情報に関連する既知の遺伝子群を特定する。次に、文献情報を蓄積したデータベースを用いて、その遺伝子群を構成する個々の遺伝子について疾患との関連性を定量化する。具体的には以下のように、個々の遺伝子について、疾患との関連性を示す基礎値を計算する。算出された基礎値は、トレーニングセットとして個々の遺伝子に関連付けられる。 When constructing a training set, first, by using these databases and searching for disease type information as a key, a known gene group related to the disease type information is specified. Next, using a database in which literature information is accumulated, the relevance to the disease is quantified for each gene constituting the gene group. Specifically, as shown below, a basic value indicating an association with a disease is calculated for each gene. The calculated basal values are associated with individual genes as a training set.

個々の遺伝子の基礎値は、遺伝子名や識別番号等の遺伝子識別データをキーとして文献情報を蓄積したデータベースを検索し、ヒットした文献に記述される当該遺伝子と疾患との関連性を評価し、当該評価の結果を数値化して算出する。個々の遺伝子につき算出された基礎値は、特定の疾患と当該遺伝子との関連性を客観的に定量化した値を意味する。 The basic value of each gene is searched by searching a database storing literature information using gene identification data such as gene name and identification number as a key, and the relevance between the gene described in the hit document and the disease is evaluated, The result of the evaluation is quantified and calculated. The basic value calculated for each gene means a value obtained by objectively quantifying the relationship between a specific disease and the gene.

更に具体的には、疾患種別情報をキーとして検索された遺伝子に関連付けられた文献情報にMeSH term（Medical Subject Headings term）と呼ばれる検索用語句が含まれているか検索する。例えば、慢性関節リウマチの場合MeSH termとして“Rheumatoid Arthritis”、“juvenile rheumatoid arthritis”をキーとし、前立腺癌の場合MeSH termとして“Prostatic Neoplasms”、“Intraepithelial Prostatic Neoplasia”をキーとしてOMIMデータベースを検索する。 More specifically, it is searched whether or not a search term phrase called “MeSH term (Medical Subject Headings term)” is included in the literature information associated with the searched gene using the disease type information as a key. For example, “Rheumatoid Arthritis” and “juvenile rheumatoid arthritis” are keyed as MeSH terms for rheumatoid arthritis, and “Prostatic Neoplasms” and “Intraepithelial Prostatic Neoplasia” are keyed as MeSH terms for prostate cancer.

また、MeSH termが全てのアブストラクトに対して付随しているわけではないため、MeSH term以外の単語を用いて検索しても良い。例えば、前立腺癌の場合、検索語として「prost*」と癌に関連する単語と（例えばcarcinoma、tumour、cancer、adenocarcinoma、neoplasm等）を用いてOMIMデータベースを検索してもよい。 Further, since MeSH term is not attached to all abstracts, a search may be made using words other than MeSH term. For example, in the case of prostate cancer, the OMIM database may be searched using “prost *” and words related to cancer (for example, carcinoma, tumor, cancer, adenocarcinoma, neoplasm, etc.) as search words.

検索の結果としては、疾患種別情報に関連性を有する遺伝子識別データ（遺伝子名や遺伝子識別番号）とその文献情報を得ることができる。次に、得られた文献情報をキーとして、文献情報を蓄積したデータベース（いわゆるPubMedデータベース）を検索し、当該遺伝子を記述する文献を抽出する。検索の結果、複数の遺伝子識別データが得られた場合には全ての遺伝子ついて文献を抽出する。 As a result of the search, gene identification data (gene name or gene identification number) having relevance to the disease type information and literature information thereof can be obtained. Next, using the obtained literature information as a key, a database (so-called PubMed database) storing the literature information is searched, and a document describing the gene is extracted. If a plurality of gene identification data are obtained as a result of the search, documents are extracted for all genes.

その後、抽出したそれぞれの文献において、対象とした遺伝子が実際の疾患に対してどの程度の関連性を有しているものとして記述されているか確認する。その確認の結果をもとに抽出した文献毎に対象の遺伝子と疾患との関連性を示す点数評価値を設定する。例えば、前立腺癌関連遺伝子を抽出する際、“Prostatic Neoplasms”及び“Intraepithelial Prostatic Neoplasia”をキーとしてOMIMデータベースより297個の遺伝子識別データ（OMIM ID）を得ることができる。OMIMデータベースは、疾患cDNAライブラリーからの遺伝子クローニング情報や疾患感受性領域に局在する遺伝子情報等、疾患への直接的な関係が示されていない文献情報も含まれている。したがって、これらの文献情報は、前立腺癌の原因遺伝子として評価する場合の擬陽性を示すものであると評価できる。また、例えばKANGAI 1 (OMIM ID: 600623)遺伝子は前立腺癌への影響がすでに示されているが、1991年および1992年の時点での市川らの報告（Ichikawa T et al. Cancer Res 1991(51)3788-3792, Ichikawa T et al. Cancer Res 1992(52)3486-3490）では、その関連性は示されていない。点数評価値としては、例えば、関連性を肯定する文献には１〜３、関連性に言及していない文献には0、関連性をはっきりと否定している文献にはマイナスと設定することができる。 After that, in each extracted document, it is confirmed how much the target gene is described as having a relationship with the actual disease. A score evaluation value indicating the relationship between the target gene and the disease is set for each document extracted based on the result of the confirmation. For example, when extracting prostate cancer-related genes, 297 gene identification data (OMIM ID) can be obtained from the OMIM database using “Prostatic Neoplasms” and “Intraepithelial Prostatic Neoplasia” as keys. The OMIM database also includes literature information that does not show a direct relationship to a disease, such as gene cloning information from a disease cDNA library and gene information localized in a disease susceptibility region. Therefore, it can be evaluated that these literature information shows a false positive when evaluating as a causative gene of prostate cancer. In addition, for example, the KANGAI 1 (OMIM ID: 600623) gene has already been shown to have an effect on prostate cancer. However, Ichikawa T et al. Cancer Res 1991 (51 3788-3792, Ichikawa T et al. Cancer Res 1992 (52) 3486-3490) does not show the association. As the score evaluation value, for example, 1 to 3 may be set for documents that affirm relevance, 0 for documents that do not mention relevance, and negative for documents that clearly deny relevance. it can.

次に、文献情報毎に設定された点数評価値の平均を計算することで、疾患と特定の遺伝子との関連性を示す基礎値とする。ある遺伝子につきn個の文献情報が検索され、これら文献情報につき点数評価値Xi（i=1〜n）が設定された場合、基礎値 Next, by calculating the average of the score evaluation values set for each document information, a basic value indicating the relationship between the disease and the specific gene is obtained. When n pieces of literature information are searched for a certain gene and score evaluation value Xi (i = 1 to n) is set for each piece of literature information, the basic value

は、下記（e1）式で表される（ａは疾患との関連性を有するものとして検索された遺伝子数を意味する。）。

Is represented by the following formula (e1) (a means the number of genes retrieved as having an association with a disease).

遺伝子解析装置１は、種々の疾患について、基礎値を含むトレーニングセットを、例えばハードディスク或いはデータベースに格納している。すなわち、遺伝子解析装置１は、上述したような様々な、遺伝子に関連する情報のそれぞれについてトレーニングセットを有している。遺伝子解析装置１の機能的構成を図３に示す。すなわち、本プログラムは、図２に示したハードウェア資源を有する装置を、スコア演算手段２０、評価値演算手段２１及び解析結果出力手段２２として機能させる。また、遺伝子解析装置１は、様々なアノテーションデータを統合したヒト遺伝子データベース２３と接続されている。トレーニングセットに含まれるデータは、図４に示すように、本プログラムに使用するヒト遺伝子データベース２３において個々の遺伝子を識別するための遺伝子識別データと、当該遺伝子識別データで識別される遺伝子に関する上記基礎値とから構成されている。 The genetic analyzer 1 stores training sets including basic values for various diseases, for example, in a hard disk or a database. That is, the gene analyzing apparatus 1 has a training set for each of various information related to genes as described above. A functional configuration of the gene analyzer 1 is shown in FIG. That is, this program causes the apparatus having the hardware resources shown in FIG. 2 to function as the score calculation means 20, the evaluation value calculation means 21, and the analysis result output means 22. In addition, the gene analyzing apparatus 1 is connected to a human gene database 23 in which various annotation data are integrated. As shown in FIG. 4, the data included in the training set includes gene identification data for identifying individual genes in the human gene database 23 used in the program, and the above-described basics related to genes identified by the gene identification data. It consists of a value and

一方、遺伝子解析装置１は、上記ヒト遺伝子データベース２３を有するか、これにアクセスしてデータベース内のデータを読み出すことができる。利用可能なヒト遺伝子データベース２３としては、特に限定されないが、H-Invitational Database (以下、H-InvDB)を挙げることができる。H-InvDBは、Imanishi T et al. PLoS Biol 2004(2)856-875 に詳述されており、インターネットを介して利用可能なデータベースである。なお、このH-InvDBには、登録された遺伝子について、GO情報、その遺伝子構造、機能、選択的スプライシングイソフォーム、非コード領域機能RNA、機能ドメイン、細胞内局在性、SNPマッピング及びマイクロサテライトモチーフ、遺伝子発現プロファイル並びに他生物種との比較結果等がアノテーションデータとして統合的に格納されている。 On the other hand, the gene analyzing apparatus 1 has the above human gene database 23 or can access it to read data in the database. The human gene database 23 that can be used is not particularly limited, and includes H-Invitational Database (hereinafter, H-InvDB). H-InvDB is described in detail in Imanishi T et al. PLoS Biol 2004 (2) 856-875, and is a database available via the Internet. This H-InvDB contains GO information, gene structure, function, alternative splicing isoform, non-coding region functional RNA, functional domain, subcellular localization, SNP mapping and microsatellite for registered genes. Motifs, gene expression profiles, comparison results with other species, etc. are stored in an integrated manner as annotation data.

本プログラムは、上記トレーニングセットを有するとともに上記ヒト遺伝子データベース２３にアクセス可能なコンピュータ装置を、先ずスコア演算手段２０として機能させる。遺伝子解析装置１におけるスコア演算手段２０は、上述したトレーニングセットと、ヒト遺伝子データベース２３(例えば、H-InvDB)に含まれる全データセット（以下、参照データセットと称する）とを用いて、トレーニングセットに含まれる遺伝子に関連付けられたアノテーションデータ毎にスコアを演算する。 This program causes a computer device having the training set and having access to the human gene database 23 to function as the score calculation means 20 first. The score calculation means 20 in the gene analyzer 1 uses the above-described training set and a training set using all the data sets (hereinafter referred to as a reference data set) included in the human gene database 23 (for example, H-InvDB). The score is calculated for each annotation data associated with the gene included in the.

なお、演算に使用するアノテーションデータは、使用するデータベースに格納されているアノテーション項目に依存する。本例ではH-InvDBを例示しているが、異なるアノテーション項目を有するデータベースを使用することも可能である。異なるアノテーション項目を有するデータベースを使用する場合には、以下で説明するスコアの演算方法もまた異なることとなる。 Note that the annotation data used for the calculation depends on the annotation items stored in the database to be used. In this example, H-InvDB is illustrated, but a database having different annotation items can also be used. When a database having different annotation items is used, the score calculation method described below is also different.

ここでスコアとは、トレーニングセットに含まれる遺伝子に関連付けられたアノテーションデータが、疾患種別情報毎に構築されたトレーニングセットの特異性を示しているかを示す値である。演算するスコアを以下に例示するが、本発明の技術的範囲は以下のスコアに限定されるものではない。 Here, the score is a value indicating whether the annotation data associated with the gene included in the training set indicates the specificity of the training set constructed for each disease type information. Although the score to calculate is illustrated below, the technical scope of this invention is not limited to the following scores.

遺伝子配列の相同性に基づくスコア
ヒト遺伝子データベース２３には、所定の遺伝子について統合されたアノテーションデータとして相同性に関連するアノテーションデータが含まれる。具体的には、所定の遺伝子について、他の遺伝子との相同性に関する、いわゆるe-valueがアノテーションデータとして関連付けられている。他の遺伝子としては、例えばParalogous gene（傍系遺伝子）が含まれる。傍系遺伝子は同じ生物種内にある遺伝子で、共通な祖先から重複とその結果生じた変異によって生じたものである。その高い相同性から、重複によって生じたもう片方の遺伝子との高いタンパク機能の類似性を持つと考えられている。 Score human gene database 23 based on the homology of the gene sequence include annotation data associated with the homology as an annotation data integrated for a given gene. Specifically, for a given gene, so-called e-values relating to homology with other genes are associated as annotation data. Examples of other genes include a Paralogous gene. A collateral gene is a gene in the same species that is caused by duplication and the resulting mutation from a common ancestor. Due to its high homology, it is thought to have high protein function similarity with the other gene caused by duplication.

本スコアを演算するにあたって、トレーニングセットに含まれる遺伝子に対する傍系遺伝子を、ヒト遺伝子データベース２３に含まれるe-valueが閾値（例えば100）を超えるものとして定義した。すなわち、e-valueが閾値未満である場合には、傍系遺伝子ではなくスコアを演算しないものとした。 In calculating this score, a side gene for a gene included in the training set was defined as an e-value included in the human gene database 23 exceeding a threshold (for example, 100). That is, when the e-value is less than the threshold, the score is not calculated instead of the side gene.

具体的には、先ず、トレーニングセットに含まれる遺伝子識別データをキーとしてヒト遺伝子データベース２３を検索し、当該遺伝子識別データで特定される遺伝子に関連付けられた傍系遺伝子（所定の閾値以上のe-valueを示す遺伝子）の遺伝子識別データ及び当該傍系遺伝子について関連付けられたe-valueを読み出す。次に、読み出したe-valueと、読み出した遺伝子識別データで特定される遺伝子を傍系遺伝子とした遺伝子に関連付けられた基礎値を読み出し、下記式に従ってスコア（S_p,x）を演算する。 Specifically, first, the human gene database 23 is searched using the gene identification data included in the training set as a key, and a side gene (e-value greater than or equal to a predetermined threshold) associated with the gene specified by the gene identification data. And the e-value associated with the neighboring gene are read out. Next, the read e-value and the basic value associated with the gene having the gene identified by the read gene identification data as a side gene are read, and the score (S _{p, x} ) is calculated according to the following equation.

スコア演算手段２０は、トレーニングセットに含まれる全ての遺伝子に関連付けられた全ての傍系遺伝子についてこのスコア（S_p,x）を演算する。得られたスコア（S_p,x）は、傍系遺伝子を示す遺伝子識別データに関連付けられたテーブルとして記憶手段等に格納される。なお、参照データセットに含まれる遺伝子が上記テーブルに傍系遺伝子として格納されている場合、上記スコア（S_p,x）は、当該参照データセットに含まれる遺伝子がトレーニングセットに含まれる遺伝子に対してどれだけ類似性があるかを示している。 The score calculation means 20 calculates this score (S _{p, x} ) for all the parasite genes associated with all the genes included in the training set. The obtained score (S _{p, x} ) is stored in a storage means or the like as a table associated with gene identification data indicating a side gene. When the genes included in the reference data set are stored as side genes in the table, the score (S _{p, x} ) is the same as the genes included in the training set. It shows how much similarity there is.

スコア演算手段２０は、次に、参照データセットに含まれる遺伝子の遺伝子識別データをキーとして上記テーブルを検索し、当該遺伝子が傍系遺伝子として含まれている場合には上記テーブルに格納されたスコアを、当該遺伝子のスコアとしてセットする。なお、スコア演算手段２０は、検索した遺伝子が傍系遺伝子として含まれていない場合には、当該遺伝子のスコアを０にセットする。 Next, the score calculation means 20 searches the table using the gene identification data of the genes included in the reference data set as a key, and when the gene is included as a side gene, the score stored in the table is calculated. , Set as the score of the gene. In addition, the score calculating means 20 sets the score of the said gene to 0, when the searched gene is not contained as a sideline gene.

遺伝子の機能注釈情報に基づくスコア
ヒト遺伝子データベース２３には、遺伝子の機能に関するアノテーションデータ（以下、機能注釈情報と称する）が関連付けられている。機能注釈情報としては、InterPro ID、酵素番号（EC）、代謝経路（KEGG pathway）を挙げることができる。InterPro IDは、European Bioinformatics Instituteが提供するデータベースに含まれる識別データであり、タンパク質の機能的呼称毎に割り振られている。酵素番号（EC）は、国際生化学連合によって定義された分類番号であり、機能に従って４桁の番号（たとえばEC 2.3.2.13）が割り振られている。代謝経路（KEGG pathway）は、Kyoto Encyclopedia of Genes and Genomesが提供するデータベースに含まれる識別データであり、代謝経路といった生物学的プロセスにおける分子相互ネットワーク毎に割り振られている。 Score human gene database 23 based on the functional annotation information gene annotation data about the function of the gene (hereinafter, referred to as functional annotation information) it is associated. Functional annotation information includes InterPro ID, enzyme number (EC), and metabolic pathway (KEGG pathway). The InterPro ID is identification data included in a database provided by the European Bioinformatics Institute, and is assigned to each functional name of the protein. The enzyme number (EC) is a classification number defined by the International Biochemical Union, and is assigned a 4-digit number (for example, EC 2.3.2.13) according to function. The metabolic pathway (KEGG pathway) is identification data included in a database provided by the Kyoto Encyclopedia of Genes and Genomes, and is assigned to each molecular mutual network in a biological process such as a metabolic pathway.

本スコアは、トレーニングセットに含まれる遺伝子に関連付けられた機能注釈情報について、トレーニングセットにおける当該機能注釈情報の出現頻度と、参照データセットにおける当該機能注釈情報の出現頻度とから演算されるスコアである。なお、例えば、機能注釈情報としてInterPro ID、酵素番号（EC）及び代謝経路（KEGG pathway）の３種類が各遺伝子に関連付けられている場合、スコア演算手段２０は機能注釈情報の種類毎に本スコアを演算する。 This score is a score calculated from the appearance frequency of the function annotation information in the training set and the appearance frequency of the function annotation information in the reference data set for the function annotation information associated with the genes included in the training set. . For example, when three types of function annotation information, ie, InterPro ID, enzyme number (EC), and metabolic pathway (KEGG pathway) are associated with each gene, the score calculation means 20 sets the score for each type of function annotation information. Is calculated.

具体的には、先ず、トレーニングセットに含まれる遺伝子識別データをキーとしてヒト遺伝子データベース２３を検索し、当該遺伝子識別データで特定される遺伝子に関連付けられた機能注釈情報を読み出す。次に、スコア演算手段２０は、読み出した機能注釈情報のうち所定の機能注釈情報についてスコアを演算するに際して、トレーニングセット及び参照データセットに含まれる遺伝子識別データをキーとしてヒト遺伝子データベース２３を検索し、以下の数を演算する。 Specifically, first, the human gene database 23 is searched using the gene identification data included in the training set as a key, and the function annotation information associated with the gene specified by the gene identification data is read out. Next, the score calculation means 20 searches the human gene database 23 using the gene identification data included in the training set and the reference data set as a key when calculating the score for the predetermined function annotation information among the read function annotation information. Calculate the following numbers.

また、スコア演算手段２０は、所定の機能注釈情報が関連付けられた遺伝子群における基礎値の平均値を演算する。この基礎値の平均値を Moreover, the score calculation means 20 calculates the average value of the basic values in the gene group associated with the predetermined function annotation information. The average of this base value

と表すと、本スコアは下記式(e3)で与えられる。

This score is given by the following formula (e3).

スコア演算手段２０が演算するスコアS_{f, n}は、特定の疾患での各機能注釈情報の特異性を示している。スコアS_{f, n}が高い値を示し、特異性が高ければ、その疾患への重要な役割を持つことが明らかである。上記式（e3）によれば、特異性の低い高頻度で存在する機能注釈情報に対して高いスコアを与えてしまうことがなく、特異性の点を忠実に評価した値を算出することができる。 The score S _{f, n} calculated by the score calculation means 20 indicates the specificity of each function annotation information in a specific disease. It is clear that if the score S _{f, n} is high and the specificity is high, it has an important role in the disease. According to the above formula (e3), it is possible to calculate a value that faithfully evaluates the point of specificity without giving a high score to the function annotation information that exists frequently with low specificity. .

所定の種類の機能注釈情報に対して複数個のデータが関連付けられている場合、スコア演算手段２０は各データについてスコアS_{f, n}を演算する。例えば、トレーニングセットに含まれる所定の遺伝子について、３個のInterPro IDが関連付けられている場合、各InterPro IDそれぞれについてS_{f, n}を演算して合計することとなる。 When a plurality of pieces of data are associated with a predetermined type of function annotation information, the score calculator 20 calculates a score S _{f, n} for each data. For example, when three InterPro IDs are associated with a predetermined gene included in the training set, S _{f, n} is calculated and summed for each InterPro ID.

スコア演算手段２０は、トレーニングセットに含まれる全ての遺伝子に関連付けられた全ての機能注釈情報についてこのスコアS_{f, n}を演算する。得られたスコアS_{f, n}は、機能注釈情報に関連付けられたテーブルとして記憶手段等に格納される。 The score calculation means 20 calculates this score S _{f, n} for all function annotation information associated with all genes included in the training set. The obtained score S _{f, n} is stored in the storage means or the like as a table associated with the function annotation information.

スコア演算手段２０は、次に、参照データセットに含まれる遺伝子の機能注釈情報をキーとして上記テーブルを検索し、当該機能注釈情報が含まれている場合には上記テーブルに格納されたスコアを、当該遺伝子のスコアとしてセットする。なお、所定の種類の機能注釈情報（例えばInterPro）として複数個の機能注釈情報（例えば３個のInterPro ID）が関連付けられている場合には、当該遺伝子のスコアとして、所定の種類の機能注釈情報に対するスコアは、各スコアの合計として演算される(e4)。なお、スコア演算手段２０は、検索した機能注釈情報が上記テーブルに含まれていない場合には、当該遺伝子に対するスコアを０にセットする。 Next, the score calculation means 20 searches the table using the function annotation information of the gene included in the reference data set as a key, and if the function annotation information is included, the score stored in the table is Set as the score for the gene. When a plurality of function annotation information (for example, three InterPro IDs) is associated as a predetermined type of function annotation information (for example, InterPro), the predetermined type of function annotation information is used as the score of the gene. The score for is calculated as the sum of each score (e4). The score calculation means 20 sets the score for the gene to 0 when the searched function annotation information is not included in the table.

このようにして、スコア演算手段２０は、参照データセットに含まれる遺伝子について、InterPro ID、酵素番号（EC）、代謝経路（KEGG pathway）といった複数種類の機能注釈情報毎にスコアS_{f, n}を演算することができる。 In this way, the score calculation means 20 calculates the score S _{f, n} for each of a plurality of types of function annotation information such as InterPro ID, enzyme number (EC), and metabolic pathway (KEGG pathway) for the genes included in the reference data set. It can be calculated.

ところで、ヒト遺伝子データベース２３には、機能注釈情報としてGO（Gene Ontrogy）データが関連付けられている。GOデータとしては、(1) molecular function、(2) biological process及び(3) subcellular componentの３つのカテゴリーに大別され、遺伝子識別データに関連付けられている。ここで、GOデータは、各カテゴリーにおいて階層構造を持ち、下層に位置するほど細分化された情報を持つこととなる。 By the way, the human gene database 23 is associated with GO (Gene Ontrogy) data as function annotation information. GO data is roughly divided into three categories: (1) molecular function, (2) biological process, and (3) subcellular component, and is associated with gene identification data. Here, the GO data has a hierarchical structure in each category, and has subdivided information as it is located in a lower layer.

スコア演算手段２０においては、GOデータに対してスコアをつけるに当たり、その階層構造を考慮したスコアを定義している。例えば、subcellular componentにおける一番上の階層に存在するcell (GO: 0005623)と第3階層のnucleus (GO: 0005634)という言葉の定義を比較すると、明らかに下層に位置する“nucleus”はより詳細な情報を有している。 The score calculation means 20 defines a score in consideration of the hierarchical structure when assigning a score to GO data. For example, if you compare the definition of cell (GO: 0005623) in the top layer of subcellular components and the definition of nucleus (GO: 0005634) in the third layer, clearly “nucleus” located in the lower layer is more detailed Information.

従って、スコア演算手段２０は、GOデータに関するスコアを上記（e3）式を基準としてGOの階層構造を考慮した下記式（e5）を用いる。 Therefore, the score calculation means 20 uses the following equation (e5) that takes into account the GO hierarchical structure based on the above equation (e3) as the score for GO data.

上記(e5)式においてP_h,nは階層構造内での位置、つまり階層が下層なほど数値が高くなる値である。 In the above formula (e5), Ph _{, n} is a position in the hierarchical structure, that is, a value that increases as the hierarchy is lower.

スコア演算手段２０は、トレーニングセットに含まれる全ての遺伝子に関連付けられた全てのGOデータについてこのスコアS_{go1-3, n}を演算する。得られたスコアS_{go1-3, n}は、GOデータに関連付けられたテーブルとして記憶手段等に格納される。 The score calculation means 20 calculates this score _{Sgo1-3, n} for all GO data associated with all genes included in the training set. The obtained score S _{go1-3, n} is stored in the storage means or the like as a table associated with the GO data.

スコア演算手段２０は、次に、参照データセットに含まれる遺伝子のGOデータをキーとして上記テーブルを検索し、当該GOデータが含まれている場合には上記テーブルに格納されたスコアS_{go1-3, n}を、当該遺伝子のスコアとしてセットする。なお、所定の種類のGOデータ（例えば、molecular function）として複数個の機能注釈情報（例えば３個のmolecular function）が関連付けられている場合には、当該遺伝子のスコアとして、所定の種類のGOデータに対するスコアは、各スコアの合計として演算して合計される(e6)。なお、スコア演算手段２０は、検索したGOデータが上記テーブルに含まれていない場合には、当該遺伝子に対するスコアを０にセットする。このようにして、スコア演算手段２０は、参照データセットに含まれる遺伝子について、GOデータ毎にスコアS_{f, n}を演算することができる。 Next, the score calculation means 20 searches the table using the GO data of the gene included in the reference data set as a key, and if the GO data is included, the score S _go1-3 stored in the table is stored. _{, n} is set as the score for the gene. When a plurality of function annotation information (for example, three molecular functions) is associated as a predetermined type of GO data (for example, molecular function), the predetermined type of GO data is used as the score of the gene. The scores for are calculated and summed up as the sum of each score (e6). The score calculation means 20 sets the score for the gene to 0 when the searched GO data is not included in the table. In this way, the score calculation means 20 can calculate the score S _{f, n} for each GO data for the genes included in the reference data set.

遺伝子発現パターンに対するスコア
ヒト遺伝子データベース２３には、遺伝子発現パターンに関するアノテーションデータ（以下、発現パターンデータと称する）が関連付けられている。 Score human gene database 23 on gene expression patterns, the annotation data related to gene expression patterns (hereinafter, referred to as expression pattern data) is associated.

いくつかのクラスター分析によって、GOデータに基づく遺伝子間での機能の類似性と遺伝子発現パターンの類似性との相関が示されている（Eisen MB et al. Proc Natl Acad Sci 1998 (95)14863-14868, Lagreid A et al. Genome Res 2003(5)965-979）。これらの知見から、類似の機能を有する遺伝子は同様の遺伝子発現制御を受けていると考えられる。この仮説に基づき、遺伝子発現パターンの類似性を見出すということは、同一の遺伝子発現制御を受けていることが示唆され、その遺伝子間で類似の機能を共有する可能性がある。 Several cluster analyzes have shown correlation between functional similarity between genes based on GO data and similarity in gene expression patterns (Eisen MB et al. Proc Natl Acad Sci 1998 (95) 14863- 14868, Lagreid A et al. Genome Res 2003 (5) 965-979). From these findings, it is considered that genes having similar functions are subjected to similar gene expression control. Finding similarities in gene expression patterns based on this hypothesis suggests that they are under the same gene expression control and may share similar functions between the genes.

スコア演算手段２０は、先ず、トレーニングセットに含まれる遺伝子に関連付けられた発現パターンデータ及び参照データセットに含まれる遺伝子に関連付けられた発現パターンデータを読み出し、これら発現パターンデータ間の相関係数を演算する。具体的に、スコア演算手段２０は、トレーニングセットに含まれる遺伝子に関連付けられた発現パターンデータＸの標準偏差δ_ｘ、参照データセットに含まれる遺伝子に関連付けられた発現パターンデータＹの標準偏差δ_ｙ、発現パターンデータＸ及び発現パターンデータＹの共分散Cov(X, Y)を用いて、相関係数Rxyは下記式により演算することができる。なお、発現パターンデータはH-InvDBのサブデータベースであるH-ANGEL（Tanino M et al. Nucleic Acids Res 2005(33)D567-D572）に格納しているiAFLPデータ（Kawamoto S et al. Genome Res 1999(12)1305-12, Sese J et al. Nucleic Acids Res 2001(29)156-8）の40組織分類を用いることができる。 The score calculation means 20 first reads the expression pattern data associated with the genes included in the training set and the expression pattern data associated with the genes included in the reference data set, and calculates the correlation coefficient between these expression pattern data. To do. Specifically, the score calculation means 20 uses the standard deviation δ _x of the expression pattern data X associated with the genes included in the training set and the standard deviation δ _y of the expression pattern data Y associated with the genes included in the reference data set. Using the covariance Cov (X, Y) of the expression pattern data X and the expression pattern data Y, the correlation coefficient Rxy can be calculated by the following equation. The expression pattern data are iAFLP data (Kawamoto S et al. Genome Res 1999) stored in H-ANGEL (Tanino M et al. Nucleic Acids Res 2005 (33) D567-D572), a sub-database of H-InvDB. (12) 1305-12, Sese J et al. Nucleic Acids Res 2001 (29) 156-8) 40 tissue classification can be used.

なお、上記式において、ｎはデータ件数を意味し、Xi及びYiはそれぞれ発現パターンX,Yのi番目の組織における発現量を意味し、μx及びμyはそれぞれ各発現パターンにおけるi番目の組織における発現量の平均を意味する。 In the above formula, n means the number of data, Xi and Yi mean the expression levels in the i-th tissue of the expression patterns X and Y, respectively, and μx and μy respectively in the i-th tissue in each expression pattern. Means the average expression level.

次に、スコア演算手段２０は、発現パターンデータＸが関連付けられた遺伝子識別データと、発現パターンデータＹが関連付けられた遺伝子識別データと、得られた相関係数Rxyとを関連付けたテーブルを記憶手段等に格納する。次に、スコア演算手段２０は、参照データセットに含まれる遺伝子の遺伝子識別データをキーとして上記テーブルを参照して、当該遺伝子識別データと関連付けられた最大の相関係数Rxyを検索する。スコア演算手段２０は、最大値の相関係数Rxyに関連付けられた、トレーニングセットに含まれる遺伝子識別データをキーとして、当該遺伝子識別データに関連付けられた遺伝子の基礎値を読み出す。そして、スコア演算手段２０は、読み出した基礎値及び最大を示す相関係数Rxyを用いて、参照データセットに含まれる遺伝子についてスコアS_e,xを下記式(e7)に従って演算する。 Next, the score calculation means 20 stores a table that associates the gene identification data associated with the expression pattern data X, the gene identification data associated with the expression pattern data Y, and the obtained correlation coefficient Rxy. And so on. Next, the score calculation means 20 searches the maximum correlation coefficient Rxy associated with the gene identification data with reference to the table using the gene identification data of the genes included in the reference data set as a key. The score calculation means 20 reads the basic value of the gene associated with the gene identification data with the gene identification data included in the training set associated with the maximum correlation coefficient Rxy as a key. Then, the score calculation means 20 calculates the score Se _{, x} according to the following formula (e7) for the genes included in the reference data set, using the read basic value and the correlation coefficient Rxy indicating the maximum.

上述の説明では、遺伝子発現パターンに対するスコアとしてトレーニングセット及び参照データセットに含まれる遺伝子に関連付けられた発現パターンデータを用いてスコアS_e,xを演算したが、スコア演算手段２０は、発現パターンデータが複数の遺伝子から構成されるクラスターに関連付けられたものであっても同様にスコアS_e,xを演算することができる。クラスターとは遺伝子の配列情報に基づいて分類された一つ群を意味する。この場合、スコア演算手段２０は、発現パターンデータＸが関連付けられたクラスターデータと、発現パターンデータＹが関連付けられたクラスターデータと、求められた相関係数Rxyとを関連付けたテーブルを記憶手段等に格納する。ここでクラスターデータとはクラスターを識別するデータを意味する。そして、スコア演算手段２０は、参照データセットに含まれる遺伝子が属するクラスターデータをキーとして上記テーブルを参照して、当該クラスターデータと関連付けられた最大の相関係数Rxyを検索する。以降は同様にして同様にスコアS_e,xを演算することができる。 In the above description, the score _{Se, x} is calculated using the expression pattern data associated with the genes included in the training set and the reference data set as the score for the gene expression pattern. Even if is associated with a cluster composed of a plurality of genes, the score _{Se, x} can be calculated similarly. A cluster means a group classified based on gene sequence information. In this case, the score calculation means 20 stores a table in which the cluster data associated with the expression pattern data X, the cluster data associated with the expression pattern data Y, and the obtained correlation coefficient Rxy are stored in the storage means or the like. Store. Here, cluster data means data for identifying a cluster. Then, the score calculation means 20 searches the maximum correlation coefficient Rxy associated with the cluster data with reference to the table using the cluster data to which the gene included in the reference data set belongs as a key. Thereafter, the score _{Se, x} can be calculated in the same manner.

以上、説明したように、スコア演算手段２０は、トレーニングセット及び参照データセットに含まれる遺伝子識別データ及びアノテーションデータを用いて、参照データセットに含まれる遺伝子について複数のスコアを演算することができる。より具体的に、スコア演算手段２０は、上述した例では複数のスコアとして以下の８種類のスコアを演算できる。
１．遺伝子配列の相同性に基づくスコア
２．遺伝子の機能注釈情報に基づくスコア（InterPro ID）
３．遺伝子の機能注釈情報に基づくスコア（酵素番号（EC））
４．遺伝子の機能注釈情報に基づくスコア（代謝経路（KEGG pathway））
５．遺伝子の機能注釈情報に基づくスコア（GOデータ（molecular function））
６．遺伝子の機能注釈情報に基づくスコア（GOデータ（biological process））
７．遺伝子の機能注釈情報に基づくスコア（GOデータ（subcellular component））
８．遺伝子の発現パターンに基づくスコア As described above, the score calculation means 20 can calculate a plurality of scores for genes included in the reference data set using the gene identification data and annotation data included in the training set and the reference data set. More specifically, the score calculation means 20 can calculate the following eight types of scores as a plurality of scores in the above-described example.
1. 1. Score based on gene sequence homology Score based on gene function annotation information (InterPro ID)
3. Score based on gene function annotation information (enzyme number (EC))
4). Score based on gene function annotation information (KEGG pathway)
5. Score based on gene function annotation information (GO data (molecular function))
6). Score based on gene function annotation information (GO data (biological process))
7). Score based on gene function annotation information (GO data (subcellular component))
8). Score based on gene expression pattern

なお、以上の例では、各スコアを演算するにあたり、スコア演算手段２０はトレーニングセットに含まれる遺伝子に関連付けられたアノテーションデータについて予め演算したスコアを含むテーブルを構築し、当該テーブルを用いてスコア演算対象の遺伝子について各スコアを演算していた。しかし、本発明においてスコア演算手段２０は、このようなテーブルを予め構築せず、参照データセットに含まれる遺伝子に関連付けられたアノテーションデータを個別に演算することもできる。 In the above example, in calculating each score, the score calculation means 20 constructs a table including scores calculated in advance for annotation data associated with genes included in the training set, and uses the table to calculate the score. Each score was calculated for the gene of interest. However, in the present invention, the score calculation means 20 can also individually calculate annotation data associated with the genes included in the reference data set without constructing such a table in advance.

例えば、上記２〜７のスコアを算出する場合、スコア演算手段２０は、先ず、参照データセットに含まれる遺伝子識別データをキーとしてヒト遺伝子データベース２３を検索し、当該遺伝子識別データで特定される遺伝子に関連付けられた機能注釈情報を読み出す。次に、スコア演算手段２０は、トレーニングセット及び参照データセットに含まれる遺伝子識別データをキーとしてヒト遺伝子データベース２３を検索し、 For example, when calculating the scores 2 to 7, the score calculation means 20 first searches the human gene database 23 using the gene identification data included in the reference data set as a key, and the gene specified by the gene identification data. Read the function annotation information associated with. Next, the score calculation means 20 searches the human gene database 23 using the gene identification data included in the training set and the reference data set as a key,

を演算する。そして、スコア演算手段２０は、上記の(e3)式に従ってスコアを演算することができる。この場合、スコア演算手段２０は、スコア演算対象の遺伝子について直接演算することとなる。

Is calculated. And the score calculation means 20 can calculate a score according to said (e3) Formula. In this case, the score calculation means 20 directly calculates the score calculation target gene.

スコアから演算する評価値
次に、本プログラムでは、スコア演算手段２０で演算した複数のスコアを用いて、評価値演算手段２１により、参照データセットに含まれる遺伝子と疾患種別情報との関連性を示す評価値を演算する。ここで評価値は、参照データセットに含まれる遺伝子について演算され、トレーニングセットに含まれる遺伝子群への距離を意味する値である。より具体的に評価値演算手段２１は、上記８種類のスコアをそれぞれ違う属性を持つものとして扱うことが可能である判別関数を用いて評価値を演算する。ここで判別関数とは、いくつかの変数に基づいて、各データがどの群に所属するかを判定する関数を意味する。 Evaluation Value Calculated from Score Next, in this program, the evaluation value calculating means 21 uses the plurality of scores calculated by the score calculating means 20 to determine the relationship between the genes included in the reference data set and the disease type information. The evaluation value shown is calculated. Here, the evaluation value is a value that is calculated for the genes included in the reference data set and indicates the distance to the gene group included in the training set. More specifically, the evaluation value calculation means 21 calculates an evaluation value using a discriminant function that can handle the above eight types of scores as having different attributes. Here, the discriminant function means a function that determines to which group each data belongs based on several variables.

具体的に評価値は、トレーニングセットに含まれる遺伝子群を群1、参照データセットから当該トレーニングセットを除いたデータセットを群2として、群2に含まれる遺伝子の群１への距離（MD1）と群2への距離（MD2）を求め、（MD2/MD1）の式により演算される。なお、MD1及びMD2は、分散を考慮できるマハラノビスの汎距離（Mardia KV et al. Multivariate analysis 1979 Academic Press, New York）を用いて算出することができる。それぞれの群への汎距離は下記式(e8)に従って演算することができる。 Specifically, the evaluation value is the distance from the gene included in group 2 to group 1 (MD1), where the gene group included in the training set is group 1 and the data set obtained by removing the training set from the reference data set is group 2. And the distance (MD2) to group 2 is calculated and calculated by the formula (MD2 / MD1). MD1 and MD2 can be calculated using Mahalanobis's general distance (Mardia KV et al. Multivariate analysis 1979 Academic Press, New York) that can take dispersion into account. The general distance to each group can be calculated according to the following formula (e8).

上記式(e8)において、Ｘはスコア演算手段２０で演算された複数のスコアの値であり、

In the above formula (e8), X is a plurality of score values calculated by the score calculating means 20,

は、トレーニングセット或いは参照データセット（例えばk個の遺伝子からなる）の平均である。なお、j = 1，2， ... ，kである。また、

Is the average of a training set or a reference data set (eg, consisting of k genes). Note that j = 1, 2,..., K. Also,

は、各群の分散共分散行列の逆行列である。

Is the inverse of the variance-covariance matrix of each group.

最後に、評価値演算手段２１は、上述のように得られた評価値を、遺伝子識別データに関連付けて、特定の疾患種別情報に関する解析結果データセットとして記憶手段に格納する。以上のように、本プログラムによれば、特定の疾患種別情報に関する解析結果データセットを構築することができる。なお、遺伝子解析装置１は、本プログラムにより構築された解析結果データセットを出力する解析結果出力手段２２を有していても良い。 Finally, the evaluation value calculation unit 21 stores the evaluation value obtained as described above in the storage unit as an analysis result data set relating to specific disease type information in association with the gene identification data. As described above, according to this program, an analysis result data set relating to specific disease type information can be constructed. The gene analyzing apparatus 1 may include an analysis result output unit 22 that outputs an analysis result data set constructed by this program.

また、遺伝子解析装置１は、種々の遺伝子に関連する情報について同様に解析結果データセットを構築するとともに、得られた解析結果データセットを上記情報に関連付けて記憶手段に格納することができる。また、遺伝子解析装置１は、上記解析結果出力手段２２を備え、図１に示したようなネットワークに接続されることによって、ユーザ端末３に対して所望の解析結果データセットを提供することができる。ユーザ端末３に提供した解析結果データセットは、特定の疾患との関連性が未知である遺伝子群が当該疾患との関連性を示す評価値とともにリストとして表示される。ユーザは、表示されたリストに基づいて、疾患に関連性を有する新規遺伝子を同定することができ、或いは疾患に関連性を有する蓋然性の高い候補遺伝子を同定することができる。このように、遺伝子解析装置１は、疾患に関連性を有する新規遺伝子や候補遺伝子といった有用な情報をユーザに提供できる遺伝子情報提供装置としても実現することができる。 The gene analyzing apparatus 1 can similarly construct analysis result data sets for information related to various genes, and store the obtained analysis result data sets in the storage means in association with the information. Moreover, the gene analysis apparatus 1 includes the analysis result output unit 22 and is connected to a network as shown in FIG. 1, thereby providing a desired analysis result data set to the user terminal 3. . In the analysis result data set provided to the user terminal 3, a group of genes whose relevance to a specific disease is unknown is displayed as a list together with an evaluation value indicating the relevance to the disease. Based on the displayed list, the user can identify new genes that are related to the disease, or can identify candidate genes that are highly likely to be related to the disease. As described above, the gene analyzing apparatus 1 can also be realized as a gene information providing apparatus capable of providing useful information such as a novel gene or a candidate gene having a relation to a disease to a user.

実際の解析結果
上記の発明を使用し、多因子性疾患であり重篤な症状を伴う慢性関節リウマチと前立腺癌に関与する遺伝子候補（新規疾病感受性遺伝子）の探索を行った。具体的には、先ず、２００４年８月時点におけるOMIMデータベースを用いて、慢性関節リウマチに関するトレーニングセットを構築した。その結果として得られる、遺伝子識別データ（遺伝子名及び遺伝子シンボル）と基礎値とを関連づけたトレーニングセットを表１に示す。 Actual analysis results Using the above-described invention, a search was made for gene candidates (new disease susceptibility genes) involved in rheumatoid arthritis, which is a multifactorial disease and has severe symptoms, and prostate cancer. Specifically, a training set for rheumatoid arthritis was first constructed using the OMIM database as of August 2004. Table 1 shows a training set obtained by associating the gene identification data (gene name and gene symbol) with the basic value.

続いて、本プログラムを適用し、H-InvDB（ヒト遺伝子データベース２３)に含まれる全データセットについて上述した８種類のスコアを算出した。その後、本プログラムを適用し、トレーニングセットに含まれる遺伝子群を群1、H-InvDBから当該トレーニングセットを除いたデータセットを群2として、群2に含まれる遺伝子の群１への距離（MD1）と群2への距離（MD2）を求め、（MD2/MD1）の式により評価値を演算した。 Subsequently, this program was applied, and the above-mentioned eight types of scores were calculated for all data sets included in H-InvDB (human gene database 23). Then, applying this program, the group of genes included in the training set is group 1, the data set obtained by removing the training set from H-InvDB is group 2, and the distance of the genes included in group 2 to group 1 (MD1 ) And the distance to group 2 (MD2), and the evaluation value was calculated by the formula (MD2 / MD1).

その結果、慢性関節リウマチの新規疾病感受性遺伝子として469個の遺伝子が抽出された。これら抽出された469個の遺伝子のうち上位100個の遺伝子について、遺伝子識別データ（遺伝子名、遺伝子シンボル及びAccession number）と評価値（MD2/MD1）とMD1とMD2とを表２に示す。 As a result, 469 genes were extracted as novel disease susceptibility genes for rheumatoid arthritis. Table 2 shows gene identification data (gene name, gene symbol and accession number), evaluation values (MD2 / MD1), MD1 and MD2 for the top 100 genes among the extracted 469 genes.

すなわち、本プログラムを適用することによって、表２に示す慢性関節リウマチの新規疾病感受性遺伝子群が特定されたこととなる。表２を詳細に検討すると、インターロイキン１レセプターアンタゴニスト（interleukin 1 receptor antagonist）であるIL1RNが特定されていることに着目することができる。IL1RNは、トレーニングセットを構築した２００４年８月時点において慢性関節リウマチとの関連性がOMIMデータベースには登録されていなかった。その後、２００４年１２月にHoraiらによって、IL1RNと慢性関節リウマチとの関連性が報告されている(Horai R et al. J Clin Invest 2004 114(11)1603-11)。すなわち、本プログラムによって特定の疾患種別情報に関連性を有する新規遺伝子を実際に同定できるといった、本プログラムの実効性を示すことができた。 That is, by applying this program, a new disease susceptibility gene group of rheumatoid arthritis shown in Table 2 was specified. When Table 2 is examined in detail, it can be noted that IL1RN, which is an interleukin 1 receptor antagonist, has been identified. IL1RN was not associated with rheumatoid arthritis in the OMIM database as of August 2004 when the training set was established. Subsequently, in December 2004, Horai et al. Reported an association between IL1RN and rheumatoid arthritis (Horai R et al. J Clin Invest 2004 114 (11) 1603-11). In other words, the effectiveness of this program could be demonstrated, such that this program can actually identify new genes related to specific disease type information.

同様に２００４年８月時点におけるOMIMデータベースを用いて、前立腺癌に関するトレーニングセットを構築した。その結果として得られる、遺伝子識別データ（遺伝子名及び遺伝子シンボル）と基礎値とを関連づけたトレーニングセットを表３に示す。 Similarly, a training set for prostate cancer was constructed using the OMIM database as of August 2004. Table 3 shows a training set obtained by associating the gene identification data (gene name and gene symbol) with the basic value.

続いて、上述の例と同様に本プログラムを適用してMD1、MD2及び評価値（MD2/MD1）を求めた。その結果、前立腺癌の新規疾病感受性遺伝子として357個の遺伝子が抽出された。これら抽出された357個の遺伝子のうち上位100個の遺伝子について、遺伝子識別データ（遺伝子名、遺伝子シンボル及びAccession number）と評価値（MD2/MD1）とMD1よMD2とを表４に示す。 Subsequently, this program was applied in the same manner as in the above example to determine MD1, MD2, and evaluation values (MD2 / MD1). As a result, 357 genes were extracted as novel disease susceptibility genes for prostate cancer. Table 4 shows gene identification data (gene name, gene symbol and accession number), evaluation value (MD2 / MD1), MD1 and MD2 for the top 100 genes among the extracted 357 genes.

表４を詳細に検討すると、チロシンキナーゼ受容体であるEPHB2が特定されていることに着目することができる。EPHB2は、トレーニングセットを構築した２００４年８月時点において前立腺癌との関連性がOMIMデータベースには登録されていなかった。その後、２００４年９月にHuuskoらによって、EPHB2と前立腺癌との関連性が報告されている(Huusko P et al. Nat Genet 2004(36)979-983)。すなわち、本プログラムによって特定の疾患種別情報に関連性を有する新規遺伝子を実際に同定できるといった、本プログラムの実効性を示すことができた。 Examining Table 4 in detail, it can be noted that EPHB2, a tyrosine kinase receptor, has been identified. EPHB2 was not associated with prostate cancer in the OMIM database as of August 2004 when the training set was established. Subsequently, an association between EPHB2 and prostate cancer was reported by Huusko et al. In September 2004 (Huusko P et al. Nat Genet 2004 (36) 979-983). In other words, the effectiveness of this program could be demonstrated, such that this program can actually identify new genes related to specific disease type information.

本発明を適用した他の例
以上、遺伝子に関連する情報としてヒトの疾患種別情報を例示して説明したが、本発明の適用範囲は本例に限定されるものではない。本プログラムは、遺伝子に関連する情報として植物由来の遺伝子に関連する情報についても適用することができる。具体例として、以下においては、イネの多因子表現型に関する解析結果データセットを構築するプログラムについて説明する。 Although other examples to which the present invention is applied have been described by exemplifying human disease type information as information related to genes, the scope of the present invention is not limited to this example. This program can be applied to information related to genes derived from plants as information related to genes. As a specific example, a program for constructing an analysis result data set related to a rice multifactor phenotype will be described below.

本例において、イネにおける特定の多因子表現型別情報に関連することが予め知られている遺伝子群からなるトレーニングセットを本プログラムの実行に先立って構築し、例えば記憶手段１０等に格納している。本例におけるトレーニングセットは、イネの機能アノテーション情報を格納したデータベースを用いて構築することができる。より具体的には、イネ遺伝子のアノテーションデータベースであるRice Annotation Project Data Base(以下RAP-DB)を使用することができる。 In this example, a training set consisting of a group of genes known in advance to be related to specific multifactor phenotype information in rice is constructed prior to execution of this program, and stored in, for example, the storage means 10 or the like. Yes. The training set in this example can be constructed using a database storing rice function annotation information. More specifically, Rice Annotation Project Data Base (RAP-DB), which is a rice gene annotation database, can be used.

トレーニングセットを構築する際には、先ず、これらのデータベースを用いて、イネの多因子表現型をキーとして検索することにより、当該多因子表現型情報に関連する既知の遺伝子群を特定する。具体的には以下のようして検索された遺伝子をトレーニングセットとして用いる。例えば、受粉、受精に関する遺伝子を検索する場合、“Fertility”、“Sertility”をキーとしてRAP-DBを検索する。なお、本例では、上述したヒト疾患情報を使用した場合とは異なり、(e1)式で与えられる基礎値は全て１とする。イネ遺伝子を解析対象とする場合には、文献情報として複数の文献が検索されたとしても文献毎に点数評価値をつけるほどの差がない。よって、本例では、イネの多因子表現型をキーとして検索された遺伝子についての文献情報を検索する必要がない。あるいは当該遺伝子についての文献情報を検索したとしても、点数評価値を設定して上記(e1)式を用いて基礎値を演算する必要はない。 When constructing a training set, first, using these databases, a known group of genes related to the multifactor phenotype information is specified by searching for rice multifactor phenotypes as keys. Specifically, genes searched as follows are used as a training set. For example, when searching for genes related to pollination and fertilization, RAP-DB is searched using “Fertility” and “Sertility” as keys. In this example, unlike the case where the above-described human disease information is used, all the basic values given by the equation (e1) are 1. When rice genes are to be analyzed, even if a plurality of documents are retrieved as document information, there is no difference that gives a score evaluation value for each document. Therefore, in this example, it is not necessary to search for literature information about genes searched using the rice multifactor phenotype as a key. Alternatively, even if literature information on the gene is searched, it is not necessary to set a score evaluation value and calculate a basic value using the above equation (e1).

本例において遺伝子解析装置１は、イネに関する種々の多因子表現型についてトレーニングセットを、例えばハードディスク或いはデータベースに格納している。すなわち、遺伝子解析装置１は、上述したような様々な、遺伝子に関連する情報のそれぞれについてトレーニングセットを有している。一方、遺伝子解析装置１は、上記イネ遺伝子データベース２３を有するか、これにアクセスしてデータベース内のデータを読み出すことができる。利用可能なヒト遺伝子データベース２３としては、特に限定されないが、RAP-DBを挙げることができる。RAP-DBは、Ohyanagi H et al. NAR 2006(34)D741-744 に詳述されており、インターネットを介して利用可能なデータベースである。なお、上述した例においてはH-InvDBを使用しており、一方で本例ではRAP-DBを使用しているが、これらデータベース間では、共通するアノテーション項目もあれば、一方のデータベースに含まれているが他方のデータベースには含まれていないアノテーション項目もある。以下で説明するスコア演算手段２０では、データベースに格納されたアノテーション項目に依存するアノテーションデータを使用して各種のスコアを演算する。なお、本例においてスコアとは、トレーニングセットに含まれる遺伝子に関連付けられたアノテーションデータが、多因子表現型毎に構築されたトレーニングセットの特異性を示しているかを示す値である。演算するスコアを以下に例示するが、本発明の技術的範囲は以下のスコアに限定されるものではない。 In this example, the gene analyzing apparatus 1 stores training sets for various multi-factor phenotypes related to rice, for example, in a hard disk or a database. That is, the gene analyzing apparatus 1 has a training set for each of various information related to genes as described above. On the other hand, the gene analysis apparatus 1 has the rice gene database 23 or can access it to read data in the database. The human gene database 23 that can be used is not particularly limited, and includes RAP-DB. RAP-DB is described in detail in Ohyanagi H et al. NAR 2006 (34) D741-744, and is a database that can be used via the Internet. In the above example, H-InvDB is used. On the other hand, RAP-DB is used in this example, but there are annotation items that are common between these databases. Some annotation items are not included in the other database. The score calculation means 20 described below calculates various scores using annotation data depending on the annotation items stored in the database. In this example, the score is a value indicating whether the annotation data associated with the genes included in the training set indicates the specificity of the training set constructed for each multifactor phenotype. Although the score to calculate is illustrated below, the technical scope of this invention is not limited to the following scores.

遺伝子の機能注釈情報に基づくスコア
イネ遺伝子データベース２３には、遺伝子の機能に関するアノテーションデータ（以下、機能注釈情報と称する）が関連付けられている。機能注釈情報としては、ヒト遺伝子と同様にInterPro ID、酵素番号（EC）、GOデータを挙げることができる。具体的には、先ず、トレーニングセットに含まれる遺伝子識別データをキーとしてイネ遺伝子データベース２３を検索し、当該遺伝子識別データで特定される遺伝子に関連付けられた機能注釈情報を読み出す。次に、スコア演算手段２０は、読み出した機能注釈情報のうち所定の機能注釈情報についてスコアを演算するに際して、トレーニングセット及び参照データセットに含まれる遺伝子識別データをキーとしてイネ遺伝子データベース２３を検索し、上記式(e3、e4、e5)を用いて演算する。 Score rice gene database 23 based on the functional annotation information gene annotation data about the function of the gene (hereinafter, referred to as functional annotation information) is associated. Examples of function annotation information include InterPro ID, enzyme number (EC), and GO data as in the case of human genes. Specifically, first, the rice gene database 23 is searched using the gene identification data included in the training set as a key, and the function annotation information associated with the gene specified by the gene identification data is read. Next, the score calculation means 20 searches the rice gene database 23 using the gene identification data included in the training set and the reference data set as a key when calculating the score for the predetermined function annotation information among the read function annotation information. The calculation is performed using the above equations (e3, e4, e5).

スコア演算手段２０は、トレーニングセット及び参照データセットに含まれる遺伝子識別データ及びアノテーションデータを用いて、参照データセットに含まれる遺伝子について複数のスコアを演算することができる。より具体的に、スコア演算手段２０は、上述した例では複数のスコアとして以下の５種類のスコアを演算できる。
１．遺伝子の機能注釈情報に基づくスコア（InterPro ID）
２．遺伝子の機能注釈情報に基づくスコア（酵素番号（EC））
３．遺伝子の機能注釈情報に基づくスコア（GOデータ（molecular function））
４．遺伝子の機能注釈情報に基づくスコア（GOデータ（biological process））
５．遺伝子の機能注釈情報に基づくスコア（GOデータ（subcellular component）） The score calculation means 20 can calculate a plurality of scores for genes included in the reference data set using gene identification data and annotation data included in the training set and the reference data set. More specifically, the score calculation means 20 can calculate the following five types of scores as a plurality of scores in the above-described example.
1. Score based on gene function annotation information (InterPro ID)
2. Score based on gene function annotation information (enzyme number (EC))
3. Score based on gene function annotation information (GO data (molecular function))
4). Score based on gene function annotation information (GO data (biological process))
5. Score based on gene function annotation information (GO data (subcellular component))

スコアから演算する評価値
次に、本プログラムでは、上述した例と同様に、スコア演算手段２０で演算した複数のスコアを用いて、評価値演算手段２１により、参照データセットに含まれる遺伝子と多因子表現型別情報との関連性を示す評価値を演算する。本例において評価値演算手段２１は、上記５種類のスコアをそれぞれ違う属性を持つものとして扱うことが可能である判別関数を用いて評価値を演算する。 Evaluation Value Calculated from Score Next, in this program, similar to the above-described example, the evaluation value calculating unit 21 uses a plurality of scores calculated by the score calculating unit 20 and the gene and many genes included in the reference data set. An evaluation value indicating the relevance with the information by factor phenotype is calculated. In this example, the evaluation value calculation means 21 calculates an evaluation value using a discriminant function that can handle the above five types of scores as having different attributes.

本例においても評価値は、上述した例と同様にして、トレーニングセットに含まれる遺伝子群を群1、参照データセットから当該トレーニングセットを除いたデータセットを群2として、群2に含まれる遺伝子の群１への距離（MD1）と群2への距離（MD2）を求め、（MD2/MD1）の式により演算される。また、MD1及びMD2は、分散を考慮できるマハラノビスの汎距離を用いて算出し、それぞれの群への汎距離は上記式(e8)に従って演算する。 In this example, the evaluation values are the same as in the above-described example, the group of genes included in the training set is group 1, the data set obtained by removing the training set from the reference data set is group 2, and the genes included in group 2 The distance to the group 1 (MD1) and the distance to the group 2 (MD2) are obtained and calculated by the formula (MD2 / MD1). MD1 and MD2 are calculated using Mahalanobis generalized distances that can take dispersion into account, and the generalized distances to the respective groups are calculated according to the above formula (e8).

また、本例においても、評価値演算手段２１は、上述のように得られた評価値を、遺伝子識別データに関連付けて、特定の多因子表現型別情報に関する解析結果データセットとして記憶手段に格納する。以上のように、本プログラムによれば、イネに関連する種々の多因子表現型別情報に関する解析結果データセットを構築することができる。なお、本例においても遺伝子解析装置１は、解析結果出力手段２２により本プログラムにより構築された解析結果データセットを出力する機能を有していても良い。 Also in this example, the evaluation value calculation means 21 stores the evaluation value obtained as described above in the storage means as an analysis result data set relating to specific multifactor phenotype information in association with the gene identification data. To do. As described above, according to the present program, it is possible to construct an analysis result data set relating to various types of multi-factor phenotype information related to rice. In this example as well, the gene analysis apparatus 1 may have a function of outputting an analysis result data set constructed by this program by the analysis result output means 22.

本発明を適用した遺伝子解析装置による、遺伝子解析結果提供システムの概略構成図である。It is a schematic block diagram of the gene-analysis result provision system by the gene-analysis apparatus to which this invention is applied. 本発明を適用した遺伝子解析装置のハードウェア構成を模式的に示す構成図である。It is a block diagram which shows typically the hardware constitutions of the gene-analysis apparatus to which this invention is applied. 本発明を適用した遺伝子解析装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the gene-analysis apparatus to which this invention is applied. 本発明を適用した遺伝子解析装置が使用するトレーニングセットのデータ構造を模式的に示す構成図である。It is a block diagram which shows typically the data structure of the training set which the gene-analysis apparatus to which this invention is applied uses.

Explanation of symbols

１…遺伝子解析装置、２…通信回線網、３…ユーザ端末、４…中央演算装置、５…HDD、６…入力手段、７…送受信手段、８…表示手段、９…メモリー、１０…記憶手段、２０…スコア演算手段、２１…評価値演算手段、２２…解析結果出力手段、２３…ヒト遺伝子データベース DESCRIPTION OF SYMBOLS 1 ... Gene analysis apparatus, 2 ... Communication network, 3 ... User terminal, 4 ... Central processing unit, 5 ... HDD, 6 ... Input means, 7 ... Transmission / reception means, 8 ... Display means, 9 ... Memory, 10 ... Storage means , 20 ... score calculation means, 21 ... evaluation value calculation means, 22 ... analysis result output means, 23 ... human gene database

Claims

A training set that includes gene identification data that is previously known to be related to information related to genes, includes gene identification data that identifies the gene, and gene annotation data and gene annotation data identified by the gene identification data And a score calculation means for calculating a plurality of scores indicating the relevance with the above information on the annotation data of the genes included in the database,
From a plurality of scores calculated by the score calculation means, an evaluation value calculation means for calculating an evaluation value indicating the relationship between the gene and the information,
A genetic information analysis apparatus comprising:

The score calculation means calculates a score for the annotation data associated with the score calculation target gene from the appearance frequency of the annotation data included in the training set and the appearance frequency of the annotation data of the gene included in the data set. The apparatus according to claim 1.

The apparatus according to claim 1 or 2, wherein the score calculation means calculates the score for annotation data of all genes included in the database.

2. The apparatus according to claim 1, wherein the score calculation means calculates a score based on homology between the base sequence data of a gene to be score-calculated and the base sequence data of genes constituting the training set.

The score calculation means calculates the correlation coefficient between the expression pattern data of the gene to be score-calculated and the expression pattern data of the genes constituting the training set from the parameters included in each expression pattern data, and the highest correlation coefficient The apparatus according to claim 1, wherein a score is calculated based on

The apparatus according to claim 1, wherein the gene identification data included in the training set is associated with a base value obtained by quantifying the relationship between the gene specified by the gene identification data and the information.

The apparatus further comprises a result output means for associating and outputting the gene identification data included in the data set excluding the training set from the data set included in the database and the evaluation value calculated for the gene specified by the gene identification data. The apparatus according to claim 1.

The evaluation value calculating means calculates an evaluation value for determining whether a gene for which a score is to be calculated is close to a gene group constituting the training set or a gene group constituting a data set excluding the training set. The apparatus according to claim 1.

The evaluation value calculating means obtains the Mahalanobis general distance to the gene group constituting the data set excluding the training set and the Mahalanobis general distance to the gene group constituting the training set for the score calculation target gene, The apparatus according to claim 1, wherein an evaluation value is calculated as a difference value of the Mahalanobis general distance.

The apparatus according to claim 1, wherein the information related to the gene is disease type information.

The score calculation means is for one or more data selected from base sequence homology data, InterPro ID data, enzyme number data, metabolic pathway data, molecular function data in GO data, biological process data, subcellular component data, and expression pattern data. The apparatus according to claim 1, wherein a score is calculated.

The score calculation means obtains the score for annotation data associated with a gene included in the training set, stores the score in a storage means as a table associated with the annotation data, and is associated with a score calculation target gene. 2. The apparatus according to claim 1, wherein the table is searched by using the annotation data as a key, the score is read, and the read score is calculated as the score of the annotation data related to the score calculation target gene.

Management means for managing, as input data, information related to genes input via the input means or communication line network;
It consists of a group of genes known in advance to be related to the input information, and includes a training set including gene identification data for specifying the gene, gene identification data, and annotation data for the gene specified by the gene identification data. A score calculation means for calculating a plurality of scores indicating the relevance with the information for annotation data of genes included in the data set included in the database, and a plurality of scores calculated by the score calculation means. From the score, a result output means for outputting an analysis result data set constructed by a gene information analysis device comprising an evaluation value calculation means for calculating an evaluation value indicating the relationship between the gene and the information,
A genetic information providing apparatus comprising:

The score calculation means calculates a score for the annotation data associated with the score calculation target gene from the appearance frequency of the annotation data included in the training set and the appearance frequency of the annotation data of the gene included in the data set. The apparatus of claim 13.

The apparatus according to claim 13 or 14, wherein the score calculation means calculates the score for annotation data of all genes included in the database.

14. The apparatus according to claim 13, wherein the score calculation means calculates a score based on the homology between the base sequence data of the gene to be score-calculated and the base sequence data of the genes constituting the training set.

The score calculation means calculates the correlation coefficient between the expression pattern data of the gene to be score-calculated and the expression pattern data of the genes constituting the training set from the parameters included in each expression pattern data, and the highest correlation coefficient The apparatus according to claim 13, wherein a score is calculated based on the above.

14. The apparatus according to claim 13, wherein the gene identification data included in the training set is associated with a basic value obtained by quantifying the relationship between the gene specified by the gene identification data and the information.

The evaluation value calculating means calculates an evaluation value for determining whether a gene for which a score is to be calculated is close to a gene group constituting the training set or a gene group constituting a data set excluding the training set. The apparatus according to claim 13.

The evaluation value calculating means obtains the Mahalanobis general distance to the gene group constituting the data set excluding the training set and the Mahalanobis general distance to the gene group constituting the training set for the score calculation target gene, The apparatus according to claim 13, wherein the evaluation value is calculated as a difference value of the Mahalanobis general distance.

The apparatus according to claim 13, wherein the information related to the gene is disease type information.

The score calculation means is a score for one or more data selected from nucleotide sequence homology data, InterPro ID data, enzyme number data, metabolic pathway data, molecular function data in GO, biological process data, subcellular component data, and expression pattern data. The apparatus according to claim 13, wherein:

The score calculation means obtains the score for annotation data associated with a gene included in the training set, stores the score in a storage means as a table associated with the annotation data, and is associated with a score calculation target gene. The apparatus according to claim 13, wherein the table is searched by using the annotation data as a key, a score is read, and the read score is calculated as a score of annotation data related to the score calculation target gene.

14. The apparatus according to claim 13, wherein the information input via the input means or the communication network is a disease name.