JP2003028855A

JP2003028855A - Method for evaluation and display of clustered result

Info

Publication number: JP2003028855A
Application number: JP2001181928A
Authority: JP
Inventors: Yasuyuki Nozaki; 康行野崎; Akira Nakashige; 亮中重; Toshiko Matsumoto; 俊子松本; Shingo Ueno; 紳吾上野; Takuro Tamura; 卓郎田村
Original assignee: Hitachi Software Engineering Co Ltd
Current assignee: Hitachi Software Engineering Co Ltd
Priority date: 2001-06-15
Filing date: 2001-06-15
Publication date: 2003-01-29
Anticipated expiration: 2021-06-15
Also published as: JP3936851B2

Abstract

PROBLEM TO BE SOLVED: To provide a method in which how much each clustering method is reliable and what kind of feature each clustering method has are evaluated on the basis of a plurality of clustered results and to provide a method of evaluating and displaying a degree at which a gene has a designated biological function when a gene having an unknown function is estimated. SOLUTION: On the basis of a method in which a gene having a known biological function is sorted and on the basis of a rate at which the gene has e.g. a corresponding biological function is judged correctly, whether each clustering method is performed satisfactorily is judged. When to which cluster the gene having the unknown function belongs is found on the basis of the plurality of clustered results, the degree at which the gene has the designated biological function is evaluated.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、特定の遺伝子とハ
イブイリダイズさせることによって得られた遺伝子発現
データを、視覚的に分かりやすく、そして遺伝子の機能
・役割が推測しやすい形式によって表示するための表示
方式に関する。TECHNICAL FIELD The present invention is intended to display gene expression data obtained by hybridizing a specific gene in a format that is visually easy to understand and that the function / role of the gene is easily guessed. Regarding the display method of.

【０００２】[0002]

【従来の技術】ゲノム配列が決定された種の増加に伴
い、進化に対応すると見られる遺伝子を見つけ出し、ど
の生物にも共通に持っていると考えられる遺伝子の集合
を探したり、それから逆に種に個別な特徴を推測したり
するなど、種間の遺伝子の違いから何かを見出そうとす
る、いわゆるゲノム比較法が盛んに行われてきた。2. Description of the Related Art As the number of species whose genome sequences have been determined increases, we find genes that appear to correspond to evolution, search for a set of genes that are thought to have in common in all organisms, and vice versa. The so-called genome comparison method, which tries to find out something from the difference in genes between species, such as inferring individual characteristics to each other, has been actively performed.

【０００３】しかし近年、ＤＮＡチップやＤＮＡマイク
ロアレイ（以下、これらを総称してバイオチップとい
う）などのインフラストラクチャの発達によって、分子
生物学の興味は、種間の情報から種内の情報へ、すなわ
ち同時発現解析へと移りつつあり、これまでの種間の比
較と合わせて、情報の抽出から関連付けの場が大きく広
がりを持ち始めている。However, in recent years, with the development of infrastructures such as DNA chips and DNA microarrays (hereinafter collectively referred to as biochips), the interest in molecular biology has changed from interspecies information to intraspecies information. We are moving to simultaneous expression analysis, and in addition to the comparison between species so far, the field of information extraction to association is beginning to expand greatly.

【０００４】例えば、既知の遺伝子と同一の発現パター
ンを示す未知の遺伝子が見つかれば、それが既知の遺伝
子と同様の機能があると類推される。これら遺伝子や蛋
白質そのものの機能的な意味付けは、機能ユニットや機
能グループといった形で研究されている。またそれらの
間の相互作用も、既知の酵素反応データや物質代謝デー
タとの対応付けによって、あるいはより直接的に、ある
遺伝子を破壊あるいは過剰反応させ、その遺伝子の発現
をなくすか、あるいは多量に発現させ、その遺伝子の直
接的及び間接的影響を、全遺伝子の発現パターンを調べ
ることによって解析している。For example, if an unknown gene showing the same expression pattern as a known gene is found, it is assumed that it has the same function as the known gene. Functional implications of these genes and proteins themselves have been studied in the form of functional units and functional groups. In addition, the interaction between them may be caused by destroying or overreacting a gene by associating it with known enzyme reaction data or substance metabolism data, or more directly, to eliminate the expression of that gene, or to increase the amount in large quantities. It is expressed and the direct and indirect effects of that gene are analyzed by examining the expression pattern of all genes.

【０００５】この分野において成功した事例として、ス
タンフォード大学のP. Brownらのグループによるイース
ト菌の発現解析が挙げられる（Michel B. Eisen et. a
l. :Cluster analysis and display of genome-wide ex
pression patterns: Proc. Natl. Acad. Sci. (1998) D
ec 8; 95(25):14863-8）。彼らは、バイオチップを用い
て、細胞から抽出した遺伝子を時系列にハイブリダイズ
させ、遺伝子の発現の度合い（ハイブリダイズした蛍光
シグナルの輝度）を数値化した。数値に色を対応させる
ことで、遺伝子の個々の発現過程をわかりやすく表示さ
せている。このとき、細胞の一連のサイクルにおいて発
現パターンの過程が近い遺伝子どうし（任意の時点での
発現の度合いが近いものどうし）をクラスタリングして
いる。A successful case in this field is the expression analysis of yeast by P. Brown et al. At Stanford University (Michel B. Eisen et. A.
l.: Cluster analysis and display of genome-wide ex
pression patterns: Proc. Natl. Acad. Sci. (1998) D
ec 8; 95 (25): 14863-8). Using a biochip, they hybridized genes extracted from cells in time series and quantified the degree of gene expression (brightness of hybridized fluorescent signal). By correlating numerical values with colors, individual expression processes of genes are displayed in an easy-to-understand manner. At this time, genes that have similar expression pattern processes in a series of cell cycles (that is, genes that have similar expression levels at arbitrary points in time) are clustered.

【０００６】図１は、階層的クラスタリングという方式
にそって遺伝子の発現状態を表示した例であり、横方向
に実験ケース、縦方向に遺伝子を並べている。左側の樹
状図は、クラスタリングの過程で、最も近い２つのクラ
スタ毎に併合されてきた状況を表しており、各枝の長さ
は併合時の2つのクラスタ間距離に対応している。FIG. 1 is an example in which the expression states of genes are displayed according to a method called hierarchical clustering, in which the experimental cases are arranged in the horizontal direction and the genes are arranged in the vertical direction. The dendrogram on the left side shows a situation in which two nearest clusters have been merged in the clustering process, and the length of each branch corresponds to the distance between the two clusters at the time of merging.

【０００７】生物学の発展に伴い、遺伝子の機能が徐々
に明らかにされてきており、生物の研究者は、発現デー
タと既知の情報を組み合わせて、遺伝子解析を行おうと
している。樹状図における解析では、研究者は、生物学
的に意味のあるクラスタ（遺伝子の集合）を探す。すな
わち、クラスタに含まれる各遺伝子の発現パターンが類
似しており、かつ、既知の機能で同じものを持つものが
多いならば、それは意味のあるクラスタとして抽出す
る。このようなクラスタをここでは機能クラスタとよ
ぶ。例えば、機能クラスタに含まれる遺伝子の中に機能
が未知のものが含まれているならば、その遺伝子は同一
クラスタ内の機能が既知のものと同様の機能を持つと推
測することができる。また、機能クラスタの発現パター
ンをみることで、機能に特有の発現過程を見つけ出すこ
とができる。With the development of biology, the function of genes has been gradually clarified, and biological researchers are trying to combine gene expression data with known information to carry out gene analysis. In the dendrogram analysis, researchers look for biologically meaningful clusters (sets of genes). That is, if the expression patterns of the genes included in the cluster are similar and there are many known genes having the same function, they are extracted as a meaningful cluster. Such a cluster is called a functional cluster here. For example, if a gene having an unknown function is included in the genes included in the functional cluster, it can be inferred that the gene has a function similar to that of a known function in the same cluster. Moreover, by observing the expression pattern of functional clusters, the expression process peculiar to the function can be found.

【０００８】[0008]

【発明が解決しようとする課題】バイオチップから得ら
れた遺伝子発現データに基づいてクラスタリングする方
法は、図１で説明した階層的クラスタリング以外にも、
自己組織化マップ、Fisherの線形判別、サポートベクタ
ーマシーン、k-Means法、Parzen推定、決定木、主成分
分析などいろいろなものが研究されている。しかもこれ
らはクラスタリングアルゴリズムへのパラメータの与え
方によって、分類のされ方が異なる。例えば、階層的ク
ラスタリングでは、発現パターン同士の類似度合いを表
す距離と、クラスタを併合するアルゴリズムの組合せ
で、いろいろな分類（樹状図）が得られる。さらに階層
的クラスタリングの場合、図２に示すように、どのレベ
ルを一つのクラスタとして考えるべきかによって、クラ
スタの大きさが異なってくる。The method of clustering based on gene expression data obtained from a biochip is not limited to the hierarchical clustering described in FIG.
Various things such as self-organizing map, Fisher linear discriminant, support vector machine, k-Means method, Parzen estimation, decision tree, principal component analysis are studied. Moreover, these are classified differently depending on how the parameters are given to the clustering algorithm. For example, in hierarchical clustering, various classifications (dendrograms) can be obtained by a combination of a distance indicating the degree of similarity between expression patterns and an algorithm for merging clusters. Further, in the case of hierarchical clustering, as shown in FIG. 2, the size of the cluster differs depending on which level should be considered as one cluster.

【０００９】ところが、たくさんのクラスタリング方法
の中で、どの方法が最も発現パターンの分類に適してい
るかは、データ自体の性質にも依存しており、分かって
いない。それ故、研究者たちは、いろいろな方法を発現
データに適用し、機能クラスタを探索していく。しか
し、いろいろなクラスタリング方法で試行すれば、複数
の分類結果が得られるので、それぞれを見比べる必要が
ある。すなわち、機能未知の遺伝子が、どの機能クラス
タに属しているのかいろいろな分類結果を見て総合的に
判断しなければならない。これは研究者にとって、大変
煩わしい作業となる。However, which method is most suitable for classification of expression patterns among many clustering methods depends on the property of the data itself and is not known. Therefore, researchers apply various methods to expression data to explore functional clusters. However, if various clustering methods are tried, a plurality of classification results can be obtained, so it is necessary to compare them. That is, it is necessary to comprehensively judge which functional cluster a gene of unknown function belongs to by looking at various classification results. This is a very troublesome task for researchers.

【００１０】また、実際には、クラスタリング方法によ
って、遺伝子の機能ごとにうまく分類されている場合と
いない場合があるので、それらを同等に比較すると、正
しい結果が得られないこともありうる。また、クラスタ
リングによって機能クラスタを見つけ出し、機能が未知
の遺伝子の機能を推測したとしても、これを確定するた
めには、通常、バイオチップ以外の実験で追試を行った
り、他の生物で同様のことが調べられていないかを公共
データベースで調査したりする。これらの追試実験は、
バイオチップのような網羅的な解析ができないので、大
量の未知遺伝子があると、それらをひとつひとつ調べる
のは煩雑な作業となる。[0010] Actually, depending on the clustering method, there are cases where genes are well classified according to gene function, and therefore, when they are compared equally, correct results may not be obtained. In addition, even if a functional cluster is found by clustering and the function of a gene whose function is unknown is inferred, in order to confirm this, usually, an additional test is conducted in an experiment other than biochips, or the same thing is done in other organisms. The public database will be used to check if is checked. These follow-up experiments
If you have a large amount of unknown genes, it will be a complicated task to examine them one by one, because you cannot perform comprehensive analysis like biochips.

【００１１】それ故本発明では、従来技術の問題点を鑑
み、複数のクラスタリング結果から、各クラスタリング
方法がどれくらい信頼できるか、どのような特徴をもっ
ているかを評価する方法、及び、機能が未知の遺伝子を
推測する際に、その遺伝子が指定された生物学的機能を
持つ度合いを評価する方法及び表示を提供することを目
的とする。Therefore, in the present invention, in view of the problems of the prior art, a method for evaluating how reliable each clustering method is and what features it has from a plurality of clustering results, and a gene whose function is unknown The purpose of the present invention is to provide a method and an indication for evaluating the degree to which a gene has a specified biological function when estimating the above.

【００１２】[0012]

【課題を解決するための手段】本発明では、複数のクラ
スタリング結果から、どのクラスタリング方法によると
うまく分類されているかの評価を、生物学的機能が既知
の遺伝子の分類のされ方をみることで行う。According to the present invention, it is possible to evaluate which clustering method is used for successful classification from a plurality of clustering results, and to see how genes with known biological functions are classified. To do.

【００１３】図３は、本発明による表示方法の一例を示
す図である。この表示例では、遺伝子をいろいろなクラ
スタリング方法により分類し、機能既知の遺伝子を参照
して、それが該当する生物学的機能であると正しく判断
した割合を比較している。この割合を存在ヒット率とよ
ぶ。例えば、図中の丸で囲んだ存在ヒット率は、決定木
（MOC１）で分類した後、生物学的機能「cytoplasmic r
ibosomes」の機能クラスタに分類された機能既知の遺伝
子の中で、本当に機能が「cytoplasmic ribosomes」で
あるものの割合（78.5％）を示している。FIG. 3 is a diagram showing an example of the display method according to the present invention. In this display example, genes are classified by various clustering methods, and genes whose functions are known are referenced to compare the proportions that are correctly determined to be the corresponding biological functions. This ratio is called the existence hit ratio. For example, the existence hit ratios circled in the figure are the biological functions “cytoplasmic r” after being classified by the decision tree (MOC1).
Of the genes with known functions classified into the functional cluster of "ibosomes", the ratio (78.5%) of those with a truly functional function of "cytoplasmic ribosomes" is shown.

【００１４】各機能クラスタに入っている存在ヒット率
をみることで、各クラスタリング方法がどれくらい信頼
できるか、どのような特徴をもっているかが評価でき
る。例えば図３の場合は、階層的クラスタ分析に着目す
ると、五つの生物学的機能のうち四つが80％以上の割合
で正しく判別できている。他のアルゴリズムをみると五
つの生物学的機能のうち三つかそれ以下でしか80％以上
の割合で正しく判別できていない。しかし、90％以上の
割合でみると、クラスタ分析よりもSVMの方がより多く
の生物学的機能を判別している。したがってこのデータ
では、階層的クラスタ分析は、精密な分類ではなく、大
まかな分類に適している傾向があることが確認できる。By looking at the existence hit rate in each functional cluster, it is possible to evaluate how reliable each clustering method is and what kind of characteristics it has. For example, in the case of FIG. 3, focusing on the hierarchical cluster analysis, four out of five biological functions can be correctly discriminated at a rate of 80% or more. Looking at other algorithms, only three or less of the five biological functions could be correctly identified at a rate of 80% or more. However, SVM discriminates against more biological functions than cluster analysis at a rate of 90% or higher. Therefore, in this data, it can be confirmed that the hierarchical cluster analysis tends to be suitable for rough classification rather than precise classification.

【００１５】研究者は分類がうまくいっているものを重
点的に調べたいと考えている。そこで、それを強調する
ために、ダイアログボックス３０１内で、例えば存在ヒ
ット率が８０％を超えるものを強調表示するように入力
することで、各クラスタリング方法の妥当性を視覚的に
容易に評価することができる。Researchers want to focus on what works well for classification. Therefore, in order to emphasize it, the validity of each clustering method can be visually evaluated easily by inputting in the dialog box 301, for example, such that the existence hit ratio exceeds 80% is highlighted. be able to.

【００１６】図４は、本発明による表示方法の他の例を
示す図である。この表示方法では、図３とは逆に、該当
する生物学的機能でないと正しく判断した割合で種々の
クラスタリング方法を比較している。この割合を非存在
ヒット率とよぶ。こちらの結果からも、各クラスタリン
グ方法がどれくらい信頼できるか、どのような特徴をも
っているかが評価できる。FIG. 4 is a diagram showing another example of the display method according to the present invention. In this display method, contrary to FIG. 3, various clustering methods are compared at the rate of correctly judging that the biological function is not the corresponding biological function. This ratio is called the non-existence hit ratio. From this result, it is possible to evaluate how reliable each clustering method is and what characteristics it has.

【００１７】図５は、本発明による表示方法の別の例を
示す図である。この表示方法では、該当する生物学的機
能であると正しく判断した既知遺伝子の個数と、機能ク
ラスタに含まれる既知遺伝子の個数を表示している。存
在ヒット率は、機能クラスタに含まれる既知遺伝子の数
を正しく判断した既知遺伝子の数で割ったものである。
例えば存在ヒット率が５０あっても、図５の中の丸で囲
った部分のように、既知遺伝子２つのうち１つが該当す
る機能を持つような状況であれば、存在ヒット率はそれ
程信用できる数値ではないことがわかる。FIG. 5 is a diagram showing another example of the display method according to the present invention. In this display method, the number of known genes correctly determined to have the corresponding biological function and the number of known genes contained in the functional cluster are displayed. The presence hit rate is the number of known genes included in the functional cluster divided by the number of correctly determined known genes.
For example, even if the existence hit ratio is 50, the existence hit ratio is so reliable in a situation where one of two known genes has a corresponding function, as shown by the circled portion in FIG. You can see that it is not a numerical value.

【００１８】図６は、遺伝子を発現パターンに基づい
て、階層的クラスタ分析、自己組織化マップ、SVM（３
次）の三つのクラスタリング方法で分類し、生物学的機
能「TCAcycle」の機能クラスタに属する機能未知の遺伝
子を集めたものを表す模式図である。図６中の各点はそ
れぞれ遺伝子に相当している。これら機能未知の遺伝子
は、機能が「TCA cycle」であると推測されるものとな
る。ここで各遺伝子は、分類のされ方の違いによって区
別することができる。例えば、遺伝子６０１は、階層的
クラスタ分析、自己組織化マップ、SVM（３次）の三つ
の方法で機能「TCAcycle」の機能クラスタに入ると判断
されたものであり、また、遺伝子５０２では、SVM（３
次）で「TCA cycle」の機能クラスタに入るとみなされ
たが、その他の二つのクラスタリング手法では「TCA cy
cle」の機能クラスタには入らないとみなされたもので
ある。FIG. 6 shows a hierarchical cluster analysis, self-organization map, SVM (3
FIG. 3 is a schematic diagram showing a collection of genes of unknown function belonging to the functional cluster of the biological function “TCA cycle”, which are classified by the following three clustering methods. Each point in FIG. 6 corresponds to a gene. These genes whose function is unknown will be inferred to have the function of "TCA cycle". Here, each gene can be distinguished by the difference in how it is classified. For example, the gene 601 was determined to be included in the functional cluster of the function “TCA cycle” by the three methods of hierarchical cluster analysis, self-organization map, and SVM (third order). (3
In the next), it was considered to be in the functional cluster of "TCA cycle", but in the other two clustering methods, "TCA cy"
cle ”was considered not to be in the functional cluster.

【００１９】機能が未知の遺伝子を追試するときは、前
述したように一度にたくさんの遺伝子を調べることがで
きない。しかし、図６のような状況がわかると、階層的
クラスタ分析、自己組織化マップ、SVM（３次）の三つ
の方法で機能「TCA cycle」とみなされた遺伝子が、最
も高い確率で機能「TCA cycle」を持つのではないかと
いうことが理解できる。従って、追試を行うときは、ま
ずこの範囲の遺伝子を対象にすればよい。When a gene whose function is unknown is retested, many genes cannot be examined at once as described above. However, when the situation as shown in Fig. 6 is understood, the gene regarded as the function "TCA cycle" by the three methods of hierarchical cluster analysis, self-organization map, and SVM (third order) has the highest probability of function " It can be understood that it may have a "TCA cycle". Therefore, when performing a supplementary test, it is sufficient to first target genes in this range.

【００２０】また、図３、図４で得た結果をもとに、各
範囲に重み付けを加えることで、より柔軟に機能推測の
もっともらしさを評価することができる。例えば、図３
の機能「TCA cycle」に関する機能クラスタの存在ヒッ
ト率から、自己組織化マップだと機能「TCA cycle」の
集まりが悪く、SVM（３次）だと集まりが良いというこ
とがわかる。従って、図５において、遺伝子６０３より
も遺伝子６０２の方が、より高い確率で機能「TCA cycl
e」を持つのではないかということがわかる。Further, by adding weight to each range based on the results obtained in FIGS. 3 and 4, it is possible to more flexibly evaluate the plausibility of the function estimation. For example, in FIG.
It can be seen from the hit ratio of the function clusters related to the function "TCA cycle" that the collection of the function "TCA cycle" is bad for the self-organizing map and good for the SVM (3rd order). Therefore, in FIG. 5, gene 602 has a higher probability of function “TCA cycl” than gene 603.
You can see that you have "e".

【００２１】図７は、本発明による表示方法のさらに他
の例を示す図であり、上で述べた各遺伝子が機能をもつ
度合いを定量化したものである。ボックス７０１に生物
学的機能名を入力すると、入力された機能をもつ度合い
を遺伝子ごとに算出し、度合い（スコア）の順に並べ
る。このスコアは図３、図４で示した、存在ヒット率、
非存在ヒット率や、その他の指標に基づいて重み付けし
た上で、遺伝子が各クラスタリング方法で求めた機能ク
ラスタに属しているかに否かに応じて算出するものであ
る。この表示から、ボックス７０１で示した機能を高い
確率で持つ遺伝子を容易に見つけ出すことができ、追試
の際の手助けとすることができる。FIG. 7 is a diagram showing still another example of the display method according to the present invention, in which the degree to which each gene described above has a function is quantified. When a biological function name is entered in the box 701, the degree of having the entered function is calculated for each gene and arranged in order of degree (score). The score is the existence hit ratio shown in FIGS. 3 and 4,
Weighting is performed based on the non-existence hit ratio and other indexes, and then the gene is calculated depending on whether or not the gene belongs to the functional cluster obtained by each clustering method. From this display, it is possible to easily find the gene having the function shown in the box 701 with a high probability, and it can be useful for the additional test.

【００２２】以上をまとめると、本発明による遺伝子の
クラスタリング結果評価方法は、複数の遺伝子を、その
発現パターンに基づいて所定のクラスタリング方法によ
ってクラスタリングした結果を評価する、遺伝子のクラ
スタリング結果評価方法において、クラスタリングの結
果、所定の生物学的機能を持つと判別された遺伝子群に
ついて、当該遺伝子群に属する生物学的機能が既知の遺
伝子のうち前記所定の生物学的機能を持つことが既知の
遺伝子の割合に基づいてクラスタリングの結果を評価す
ることを特徴とする。In summary, the gene clustering result evaluation method according to the present invention is a gene clustering result evaluation method for evaluating the result of clustering a plurality of genes by a predetermined clustering method based on the expression pattern thereof. As a result of clustering, regarding a gene group that is determined to have a predetermined biological function, among genes whose biological function belonging to the gene group is known, It is characterized in that the result of clustering is evaluated based on the ratio.

【００２３】本発明による遺伝子のクラスタリング結果
評価方法は、また、複数の遺伝子を、その発現パターン
に基づいて所定のクラスタリング方法によってクラスタ
リングした結果を評価する、遺伝子のクラスタリング結
果評価方法において、クラスタリングの結果、所定の生
物学的機能を持たないと判別された遺伝子群について、
当該遺伝子群に属する生物学的機能が既知の遺伝子のう
ち前記所定の生物学的機能を持たないことが既知の遺伝
子の割合に基づいてクラスタリングの結果を評価するこ
とを特徴とする。The gene clustering result evaluation method according to the present invention is also a gene clustering result evaluation method for evaluating the result of clustering a plurality of genes by a predetermined clustering method based on the expression pattern thereof. , For a group of genes that have been identified as not having a predetermined biological function,
It is characterized in that the result of clustering is evaluated based on the ratio of genes known to have no predetermined biological function among genes whose biological function belonging to the gene group is known.

【００２４】本発明による遺伝子のクラスタリング結果
表示方法は、複数の遺伝子を、その発現パターンに基づ
いて所定のクラスタリング方法によってクラスタリング
した結果を表示する、遺伝子のクラスタリング結果表示
方法において、クラスタリングの結果、所定の生物学的
機能を持つと判別された遺伝子群について、当該遺伝子
群に属する生物学的機能が既知の遺伝子のうち前記所定
の生物学的機能を持つことが既知の遺伝子の割合を算出
して表示することを特徴とする。The gene clustering result display method according to the present invention is a gene clustering result display method for displaying a result of clustering a plurality of genes according to a predetermined clustering method based on the expression patterns thereof. For a group of genes determined to have a biological function of, the ratio of genes known to have the predetermined biological function among the genes of known biological function belonging to the gene group is calculated. It is characterized by displaying.

【００２５】本発明による遺伝子のクラスタリング結果
表示方法は、また、複数の遺伝子を、その発現パターン
に基づいて所定のクラスタリング方法によってクラスタ
リングした結果を表示する、遺伝子のクラスタリング結
果表示方法において、クラスタリングの結果、所定の生
物学的機能を持たないと判別された遺伝子群について、
当該遺伝子群に属する生物学的機能が既知の遺伝子のう
ち前記所定の生物学的機能を持たないことが既知の遺伝
子の割合を算出して表示することを特徴とする。The gene clustering result display method according to the present invention is also a gene clustering result display method for displaying a result of clustering a plurality of genes by a predetermined clustering method based on their expression patterns. , For a group of genes that have been identified as not having a predetermined biological function,
It is characterized by calculating and displaying a ratio of genes whose biological function is known to belong to the gene group and which are known not to have the predetermined biological function.

【００２６】本発明による遺伝子のクラスタリング結果
表示方法は、また、複数の遺伝子を、その発現パターン
に基づいて所定のクラスタリング方法によってクラスタ
リングした結果を表示する、遺伝子のクラスタリング結
果表示方法において、クラスタリングの結果、所定の生
物学的機能を持つと判別された遺伝子群毎に、当該遺伝
子群に属する生物学的機能が既知の遺伝子の個数と、前
記所定の生物学的機能を持つことが既知の遺伝子の個数
とを表示することを特徴とする。The gene clustering result display method according to the present invention is also a gene clustering result display method for displaying a result of clustering a plurality of genes by a predetermined clustering method based on their expression patterns. , For each gene group determined to have a predetermined biological function, the number of genes whose biological function is known belonging to the gene group and the number of genes known to have the predetermined biological function. It is characterized by displaying the number and.

【００２７】本発明による遺伝子のクラスタリング結果
表示方法は、また、複数の遺伝子を、その発現パターン
に基づいて所定のクラスタリング方法によってクラスタ
リングした結果を表示する、遺伝子のクラスタリング結
果表示方法において、クラスタリングの結果、所定の生
物学的機能を持たないと判別された遺伝子群毎に、当該
遺伝子群に属する生物学的機能が既知の遺伝子の個数
と、前記所定の生物学的機能を持たないことが既知の遺
伝子の個数とを表示することを特徴とする。The gene clustering result display method according to the present invention is also a gene clustering result display method for displaying a result of clustering a plurality of genes by a predetermined clustering method based on their expression patterns. , For each gene group that is determined not to have a predetermined biological function, the number of genes of known biological function belonging to the gene group and the number of genes that do not have the predetermined biological function are known. It is characterized by displaying the number of genes.

【００２８】本発明による複数の遺伝子の中から所定の
生物学的機能を有する遺伝子の候補を抽出する方法は、
複数の遺伝子を、その発現パターンに基づいて複数のク
ラスタリング方法によってクラスタリングするステップ
と、一つのクラスタリング方法によるクラスタリングの
結果、前記所定の生物学的機能を有すると判定された遺
伝子に、当該クラスタリング方法に対して設定されたウ
ェイトを加算するステップと、前記ウェイトを加算する
ステップを残りの全てのクラスタリング方法によるクラ
スタリング結果ついて実行し、遺伝子毎のスコアを算出
するステップと、前記スコアの大小を基準にして前記所
定の生物学的機能を有する遺伝子の候補を抽出するステ
ップとを含むことを特徴とする。The method of extracting a gene candidate having a predetermined biological function from a plurality of genes according to the present invention is as follows:
The step of clustering a plurality of genes by a plurality of clustering methods based on the expression pattern, and the result of clustering by one clustering method, the genes determined to have the predetermined biological function, the clustering method The step of adding weights set for the step, and the step of adding the weights is executed for the clustering results by all the remaining clustering methods, the step of calculating a score for each gene, and the size of the score as a reference. And a step of extracting a candidate of a gene having the predetermined biological function.

【００２９】本発明による所定の生物学的機能を持って
いる遺伝子の候補を表示する方法は、複数の遺伝子を、
その発現パターンに基づいて複数のクラスタリング方法
によってクラスタリングするステップと、一つのクラス
タリング方法によるクラスタリングの結果、前記所定の
生物学的機能を有すると判定された遺伝子に、当該クラ
スタリング方法に対して設定されたウェイトを加算する
ステップと、前記ウェイトを加算するステップを残りの
全てのクラスタリング方法によるクラスタリング結果つ
いて実行し、遺伝子毎のスコアを算出するステップと、
遺伝子を当該遺伝子のスコアと共にスコアの大きな順に
並べて表示するステップとを含むことを特徴とする。The method of displaying a candidate of a gene having a predetermined biological function according to the present invention comprises
The step of clustering by a plurality of clustering methods based on the expression pattern, and the result of clustering by one clustering method, the genes determined to have the predetermined biological function are set for the clustering method. A step of adding weights, a step of adding the weights is executed for clustering results by all remaining clustering methods, and a step of calculating a score for each gene;
And displaying the genes in a descending order of the score together with the score of the gene.

【００３０】[0030]

【発明の実施の形態】以下、図面を参照して本発明の実
施の形態を説明する。図８は、本発明のシステム構成図
である。このシステムは、遺伝子の情報及び発現過程を
記録したバイオチップデータベース８０１、分析結果を
表示するための表示装置８０２、本システムへの値の入
力や選択の操作を行うためのキーボード８０３やマウス
８０４等の入力手段、クラスタリング処理を行ったり、
存在ヒット率、非存在ヒット率、スコアなどの算出処理
を行う中央処理装置８０５、中央処理装置８０５での処
理に必要なプログラムを格納するプログラムメモリ８０
６を備えて構成される。プログラムメモリ８０６には、
発現パターンデータをクラスタリングするクラスタリン
グ処理プログラム８０７、存在ヒット率、非存在ヒット
率を計算するヒット率計算処理プログラム８０８、生物
学的機能をもつ度合いを遺伝子ごとに算出するスコア算
出処理プログラム８０９、これら分析・計算結果を表示
するための分析結果表示処理プログラム８１０が備えら
れている。これらのプログラムは、ＣＤ−ＲＯＭ、ＤＶ
Ｄ−ＲＯＭ、ＭＯ、フロッピー（登録商標）ディスク等
の記録媒体に格納して提供することもできるし、ネット
ワークを介して提供することもできる。また、バイオチ
ップデータベース８０１は、中央処理装置８０５に接続
された記憶装置が保持する構成でもよいし、遠隔地に設
置されたサーバコンピュータが管理する構成とし、その
データベースからネットワーク等を介して遺伝子データ
を取得するようにしてもよい。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to the drawings. FIG. 8 is a system configuration diagram of the present invention. This system includes a biochip database 801, which records gene information and expression processes, a display device 802 for displaying analysis results, a keyboard 803 and a mouse 804 for inputting and selecting values into the system. Input means of, clustering processing,
A central processing unit 805 for calculating the existence hit ratio, non-existence hit ratio, score, and the like, and a program memory 80 for storing a program necessary for the processing in the central processing unit 805.
6 is provided. In the program memory 806,
Clustering processing program 807 for clustering expression pattern data, hit rate calculation processing program 808 for calculating presence hit rate and non-existence hit rate, score calculation processing program 809 for calculating degree of biological function for each gene, and these analyzes An analysis result display processing program 810 for displaying the calculation result is provided. These programs are CD-ROM, DV
It can be provided by being stored in a recording medium such as a D-ROM, MO, floppy (registered trademark) disk, or provided via a network. Further, the biochip database 801 may be stored in a storage device connected to the central processing unit 805, or may be configured to be managed by a server computer installed in a remote place, and the database can be used to store gene data via a network or the like. May be acquired.

【００３１】図９は、バイオチップデータベース８０１
に格納されたデータの具体的な構造を示したものであ
る。遺伝子情報は、gene[i]（i=1,2,…,gNum）という長
さgNumの配列に格納されているとする。ただし、gNumは
遺伝子データに含まれる遺伝子の個数である。遺伝子デ
ータは、遺伝子を一意に決める遺伝子ID（９００）、遺
伝子を表す属性情報（９０１）、及びバイオチップから
得られた発現データ（９０２）からなる。遺伝子を表す
属性には、例えば遺伝子名（９０３）、ORF（９０
４）、遺伝子の機能（９０５）などがある。遺伝子の機
能が分からなければ機能属性９０５は「UNKOWN」とす
る。これらの遺伝子の属性以外も、遺伝子情報構造体の
メンバとして定義することも可能である。また発現デー
タは、各実験で遺伝子の発現の度合い（ハイブリダイズ
した蛍光シグナルの輝度）を数値化したデータを格納し
ている。本発明では、実験の回数をｎとし、１つの遺伝
子をｎ次元のベクトルとして扱っている。FIG. 9 shows a biochip database 801.
3 shows a specific structure of data stored in. It is assumed that the gene information is stored in an array of length gNum called gene [i] (i = 1,2, ..., gNum). However, gNum is the number of genes included in the gene data. The gene data includes a gene ID (900) that uniquely determines a gene, attribute information (901) that represents the gene, and expression data (902) obtained from the biochip. Examples of attributes that represent genes include gene name (903) and ORF (90
4) and gene function (905). If the function of the gene is unknown, the function attribute 905 is "UNKOWN". In addition to these gene attributes, it is also possible to define them as members of the gene information structure. In addition, the expression data stores data in which the degree of gene expression (brightness of hybridized fluorescence signal) is quantified in each experiment. In the present invention, the number of experiments is n, and one gene is treated as an n-dimensional vector.

【００３２】図１０はクラスタリング結果を格納するた
めの構造体配列を示したものであり、これはclusterRes
ult[i] （i=1,2,…,crNum）という長さcrNumの配列に格
納されているとする。ただし、crNumはクラスタリング
した回数である。クラスタリング方法に与えるパラメー
タを変えれば、それらは別々の分類結果として扱うこと
にし、配列の要素としても別々にとることにする。この
構造体配列は、分類結果がどのようなクラスタリング方
法であるのかを表示時に識別するためのクラスタリング
方法名（１０００）と、その具体的なクラスタリング方
法（１００１）及びクラスタリングアルゴリズムに与え
るパラメータ情報（１００２）と、クラスタリング対象
の遺伝子に含まれる総既知遺伝子数（１００３）と、分
類されたクラスタの数を示すクラスタ数（１００４）
と、先頭クラスタへのポインタ（１００５）からなる。
実際の各クラスタの内容はリスト構造で格納しており、
先頭クラスタ１００５から全てのクラスタを辿れるよう
にしている。各リストは、各クラスタに対応しており、
リストのメンバとして、クラスタに含まれる遺伝子の数
を示す所属遺伝子数（１００６）と、含まれる遺伝子を
示す所属遺伝子ID（１００７）と、そのクラスタの機能
クラスタ名（１００８）と、スコア算出時に機能クラス
タにかけるウェイト（１００９）と、次のクラスタへの
リストを指し示す次クラスタへのポインタ（１０１０）
からなる。クラスタが機能クラスタではない場合は、１
０１１のように機能クラスタ名を「UNKOWN」とし、１０
１２のようにウェイトを０とする。また、リストの最終
端にきたときは、１０１３のように次クラスタへのポイ
ンタをNULLとする。FIG. 10 shows a structure array for storing the clustering result.
It is assumed that they are stored in an array of length crNum called ult [i] (i = 1,2, ..., crNum). However, crNum is the number of times of clustering. If the parameters given to the clustering method are changed, they will be treated as separate classification results, and will also be taken separately as elements of the array. This structure array has a clustering method name (1000) for identifying what kind of clustering method the classification result is, a specific clustering method (1001) and parameter information (1002) given to the clustering algorithm. ), The total number of known genes contained in the genes to be clustered (1003), and the number of clusters indicating the number of classified clusters (1004)
And a pointer (1005) to the first cluster.
The actual contents of each cluster are stored in a list structure,
All clusters can be traced from the first cluster 1005. Each list corresponds to each cluster,
As a member of the list, the number of belonging genes (1006) indicating the number of genes included in the cluster, the belonging gene ID (1007) indicating the included genes, the function cluster name (1008) of the cluster, and the function at the time of score calculation The weight to apply to the cluster (1009) and the pointer to the next cluster (1010) that points to the list to the next cluster
Consists of. 1 if the cluster is not a functional cluster
The functional cluster name is “UNKOWN” as in 011 and 10
The weight is set to 0 like 12. When the end of the list is reached, the pointer to the next cluster is set to NULL as in 1013.

【００３３】図１１は、スコア算出に用いる配列データ
の具体的な構造を示したものである。これは、score[i]
（i=1,2,…,sNum）という長さsNumの配列に格納されて
いるとする。ただし、sNumは、スコアの対象としている
遺伝子の総数である。この構造体配列は、遺伝子ID（１
１００）とスコア（１１０１）から構成される。図１２
は、本発明による処理の概略を示したフローチャートで
ある。まず、バイオチップデータベース８０１から遺伝
子発現パターンデータgene[i](i = 1,2,…,gNum)を読み
込む（ステップ１２００）。次に、適用するクラスタリ
ング方法とクラスタリングアルゴリズムに与えるパラメ
ータを、キーボード８０３やマウス８０４を用いて選択
する（ステップ１２０１）。その後、クラスタ分析を行
う。結果はclusterResult[i]（i = 1,…,crNum）に登録
する（ステップ１２０２）。すなわち、ステップ１２０
１で決めたクラスタリング方法を表示時に一意に識別す
る名前をclusterResult[i]のクラスタリング方法名１０
００に、実際のクラスタリング方法とパラメータをクラ
スタリング方法１００１、パラメータ情報１００２に、
対象とする遺伝子データの中での既知遺伝子の数を総既
知遺伝子数１００３に、クラスタの数をクラスタ数１０
０４に登録し、各クラスタの内容について、クラスタに
含まれる遺伝子の数を所属遺伝子数１００６に、含まれ
る遺伝子を所属遺伝子ID１００７に、機能クラスタ名を
１００８に入力し、それらを次クラスタへのポインタ１
０１０によってリスト構造でつなげて、その先頭ポイン
タを先頭クラスタへのポインタ１００５に登録する。FIG. 11 shows a specific structure of sequence data used for score calculation. This is score [i]
It is assumed that they are stored in an array of length sNum called (i = 1,2, ..., sNum). However, sNum is the total number of genes targeted for the score. This structure sequence has the gene ID (1
100) and a score (1101). 12
3 is a flowchart showing an outline of processing according to the present invention. First, gene expression pattern data gene [i] (i = 1,2, ..., gNum) is read from the biochip database 801 (step 1200). Next, the clustering method to be applied and the parameters to be given to the clustering algorithm are selected using the keyboard 803 or the mouse 804 (step 1201). After that, cluster analysis is performed. The result is registered in clusterResult [i] (i = 1, ..., CrNum) (step 1202). That is, step 120
The name that uniquely identifies the clustering method determined in 1 at the time of display is the clustering method name of clusterResult [i] 10
00, the actual clustering method and parameters in the clustering method 1001 and parameter information 1002,
The number of known genes in the target gene data is 1003, and the number of clusters is 10.
04, the number of genes included in the cluster is entered in the number of belonging genes 1006, the included genes are entered in the belonging gene ID 1007, and the functional cluster name is entered in 1008, and these are pointed to the next cluster. 1
A list structure is connected by 010, and the head pointer is registered in the pointer 1005 to the head cluster.

【００３４】クラスタリング方法には既知の情報を用い
てクラスタリングするもの（教師あり）と、既知の情報
を仮定せずにクラスタリングするもの（教師なし）があ
る。教師ありの場合は機能クラスタが一意に定まるが、
教師なしの場合は定まらない。そこで教師なしの場合、
どのクラスタが機能クラスタであるとするかは、例え
ば、既知の遺伝子で該当する生物学的機能が一番多く集
まっているものを機能クラスタにするとか、該当する生
物学的機能をもつ既知の遺伝子の中で代表的な遺伝子が
属するものを機能クラスタにするなど、予め決めておく
必要がある。The clustering method includes a method of clustering using known information (with teacher) and a method of clustering without using known information (without teacher). If there is a teacher, the function cluster is uniquely determined.
It is not decided without a teacher. So without teachers,
Which cluster is a functional cluster may be determined by, for example, selecting a known gene having the most relevant biological functions as a functional cluster or a known gene having the relevant biological function. It is necessary to determine in advance, for example, that the gene to which a representative gene belongs is a functional cluster.

【００３５】このclusterResult[i]は、ステップ１２０
２で出力した結果を格納することの他に、以前のクラス
タリング結果をバイオチップデータベースなどに格納
し、それをそのまま用いる利用形態もある。次に、clus
terResult[i]の結果をもとに、存在ヒット率、非存在ヒ
ット率とスコアを算出する。これらの詳細な処理は後で
述べる（ステップ１２０３、１２０４）。以上で処理を
終了する。This clusterResult [i] is obtained in step 120.
In addition to storing the result output in 2, the previous clustering result may be stored in a biochip database or the like and used as it is. Then clus
Based on the result of terResult [i], the existence hit ratio, the non-existence hit ratio, and the score are calculated. These detailed processes will be described later (steps 1203 and 1204). With that, the process ends.

【００３６】図１３は、図１２における存在ヒット率、
非存在ヒット率を計算する処理（ステップ１２０３）の
詳細フローである。キーボード８０３やマウス８０４を
用いて、調べたい生物学的機能を選択するとともに、強
調表示をするときはその閾値を入力する（ステップ１３
０１）。i =１としてiがcrNumになるまで以下の処理を
行う（ステップ１３０１，１３０２）。まずclusterRes
ult[i]から、ステップ１３００で選択した調べたい生物
学的機能に対応する機能クラスタCを探す。すなわち、c
lusterResult[i]の先頭クラスタへのポインタ１００５
からリストを辿ってゆき、機能クラスタ名１００８が生
物学的機能と一致するものを探して、それをCとする
（ステップ１３０３）。FIG. 13 shows the existence hit ratio in FIG.
It is a detailed flow of a process (step 1203) of calculating a non-existence hit ratio. Using the keyboard 803 and the mouse 804, a biological function to be investigated is selected and, when highlighted, the threshold value is input (step 13).
01). The following processing is performed until i becomes crNum with i = 1 (steps 1301 and 1302). First, clusterRes
From ult [i], a function cluster C corresponding to the biological function to be investigated selected in step 1300 is searched. I.e. c
Pointer 1005 to the first cluster of lusterResult [i]
The list is traced from, and a function cluster name 1008 that matches the biological function is searched for and is set as C (step 1303).

【００３７】次に存在ヒット率を求める。機能クラスタ
Cに属する所属遺伝子IDの中で、ステップ１３００で選
択した生物学的機能を持つ既知の遺伝子の数Rと、機能
を持たない既知の遺伝子の数Sを求める。すなわち、Rを
求めるときは、所属遺伝子ID１００７に含まれる遺伝子
で、その機能９０５が該当する生物学的機能であるもの
を数え上げ、Sを求めるときは機能９０５が該当する生
物学的機能でないもので、かつ、機能が「UNKNOWN」で
ないものを数え上げる。その後、存在ヒット率１００×
R／（R＋S）を計算する（ステップ１３０４，１３０
５）。Next, the existence hit rate is calculated. Function cluster
Among the gene IDs belonging to C, the number R of known genes having a biological function selected in step 1300 and the number S of known genes having no function are obtained. That is, when R is obtained, the genes included in the belonging gene ID 1007 are counted, and those whose function 905 is the corresponding biological function are counted. When S is obtained, the function 905 is not the corresponding biological function. , And count those whose functions are not "UNKNOWN". After that, existence hit rate 100 ×
Calculate R / (R + S) (steps 1304, 130)
5).

【００３８】次に非存在ヒット率を求める。機能クラス
タC以外のクラスタに含まれる遺伝子IDを調べて、ステ
ップ１３００で選択した生物学的機能を持つ既知の遺伝
子の数Tを求める。これも同様に、所属遺伝子ID１００
７に含まれる遺伝子で、その機能９０５が該当する生物
学的機能であるものを数え上げる。その後、clusterRes
ut[i]の総既知遺伝子数１００３を用いて、非存在ヒッ
ト率１００×｛総既知遺伝子数−（R+S+T）｝／｛既知遺伝
子総数−（R+S）｝を計算する（ステップ１３０６，１３０７）。Next, the non-existence hit ratio is calculated. The gene IDs included in the clusters other than the functional cluster C are examined to obtain the number T of known genes having the biological function selected in step 1300. This is also the gene ID 100
Among the genes included in 7, the functions whose function 905 is the corresponding biological function are enumerated. Then clusterRes
Using the total known gene number 1003 of ut [i], the non-existence hit rate 100 × {total known gene number- (R + S + T)} / {total known gene number- (R + S)} is calculated ( Steps 1306 and 1307).

【００３９】ステップ１３０３から１３０７までを、ス
テップ１３００で選択した全ての生物学的機能について
行い、それが終わったらiをひとつインクリメントして
ステップ１３０２に戻る（ステップ１３０８）。ステ
ップ１３０２でiがcrNumになったら、すべての存在ヒッ
ト率、非存在ヒット率を計算したことを意味する。すべ
てが計算されたら、存在ヒット率、非存在ヒット率、
R、Sの値などを表示装置８０２に表示して処理を終える
（ステップ１３０９）。Steps 1303 to 1307 are performed for all the biological functions selected in step 1300, and when they are completed, i is incremented by 1 and the process returns to step 1302 (step 1308). If i reaches crNum in step 1302, it means that all the existing hit ratios and non-existent hit ratios have been calculated. Once all are calculated, the hit rate, the non-existent hit rate,
The values of R and S are displayed on the display device 802, and the process ends (step 1309).

【００４０】図１４は、図１２におけるスコアを算出す
る処理（ステップ１２０４）の詳細フローである。キー
ボード８０３、マウス８０４を用いて、スコアの対象の
生物学的機能を選択する。また各クラスタリング方法に
ついてウェイトを設定する。ウェイトは、存在ヒット率
や非存在ヒット率の値を用いても良いし、他の指標をも
とに設定しても良い。これを各クラスタのウェイト１０
０９に登録する（ステップ１４００）。FIG. 14 is a detailed flow of the process (step 1204) for calculating the score in FIG. A keyboard 803 and a mouse 804 are used to select a biological function to be scored. Weights are set for each clustering method. The weight may use the value of the existence hit rate or the non-existence hit rate, or may be set based on another index. This is the weight of each cluster 10
09 (step 1400).

【００４１】i =１としてiがcrNumになるまで以下の処
理を行う（ステップ１４０１、１４０２）。まずcluste
rResult[i]から、ステップ１４００で選択した調べたい
生物学的機能に対応する機能クラスタCを探す。すなわ
ち、clusterResult[i]の先頭クラスタへのポインタ１０
０５からリストを辿ってゆき、機能クラスタ名１００８
が生物学的機能と一致するものを探して、それをCとす
る（ステップ１４０３）。機能クラスタCに属する遺伝
子IDを調べ、それがscore[i](i = 1,…,sNum)の遺伝子I
D１１００に登録されていないのならば、その遺伝子ID
を遺伝子ID１１０１に登録し、スコア１１０１を０に初
期化しておく。既に機能クラスタCに属する遺伝子IDがs
core[i]に登録されているなら、スコア１１０１にウェ
イト１００９の値を足す（ステップ１４０４）。The following process is performed until i becomes crNum with i = 1 (steps 1401 and 1402). First cluste
From rResult [i], the function cluster C corresponding to the biological function to be investigated selected in step 1400 is searched. That is, the pointer 10 to the first cluster of clusterResult [i]
The functional cluster name 1008 is traced from 05.
Finds one that matches the biological function and sets it as C (step 1403). The gene ID belonging to the functional cluster C is investigated, and it is the gene I of score [i] (i = 1,…, sNum)
If not registered in D1100, the gene ID
Is registered in the gene ID 1101 and the score 1101 is initialized to 0. Gene IDs already belonging to functional cluster C are s
If it is registered in core [i], the value of weight 1009 is added to score 1101 (step 1404).

【００４２】ステップ１４０３，１４０４を、ステップ
１４００で選択した全ての生物学的機能について行い、
それが終わったらiをひとつインクリメントしてステッ
プ１４０２に戻る（ステップ１４０６）。ステップ１
４０２でiがcrNumになったら、すべてのスコアを計算し
たことを意味する。すべてが計算されたら、score[i]の
スコア１１０１の値で順に並べ替えて、表示装置８０２
に表示して処理を終える（ステップ１４０６）。以上の
処理によって、図３、図４、図５、図７に示したような
クラスタ分析結果の評価及び分析が可能となる。Perform steps 1403 and 1404 for all biological functions selected in step 1400,
After that, i is incremented by 1 and the process returns to step 1402 (step 1406). Step 1
If i reaches crNum in 402, it means that all the scores have been calculated. When all are calculated, they are sorted in order of the score 1101 value of score [i], and the display device 802
Is displayed on the screen and the process is completed (step 1406). With the above processing, it is possible to evaluate and analyze the cluster analysis results as shown in FIGS. 3, 4, 5, and 7.

【００４３】[0043]

【発明の効果】以上説明したように、本発明によれば、
複数のクラスタリングの結果から、生物学的機能が既知
の遺伝子の分類のされ方をみることで、各クラスタリン
グ方法がどれくらい信頼できるか、どのような特徴をも
っているかを評価することができる。また、機能が未知
の遺伝子が、複数のクラスタリング結果でどのようなク
ラスタに属しているのかをみることによって、指定され
た生物学的機能を持つ度合いを評価することができる。
このような評価結果は、機能未知の遺伝子の機能を確か
める際の手助けとなる。As described above, according to the present invention,
From the results of a plurality of clustering, it is possible to evaluate how reliable each clustering method is and what kind of characteristics it has, by observing how genes with known biological functions are classified. In addition, it is possible to evaluate the degree of having a designated biological function by observing to which cluster a gene whose function is unknown belongs to a plurality of clustering results.
Such evaluation results help in confirming the function of a gene whose function is unknown.

[Brief description of drawings]

【図１】階層的クラスタリング結果の表示例を示す図。FIG. 1 is a diagram showing a display example of a hierarchical clustering result.

【図２】階層的クラスタリングにおけるクラスタの大き
さの違いを示した説明図。FIG. 2 is an explanatory diagram showing a difference in cluster size in hierarchical clustering.

【図３】該当する生物学的機能であると正しく判別した
割合を示した図。FIG. 3 is a view showing a ratio of correctly discriminating the biological function.

【図４】該当する生物学的機能でないと正しく判別した
割合を示した図。FIG. 4 is a view showing a ratio of correctly discriminating that the biological function is not applicable.

【図５】正しく判別した既知遺伝子の数とクラスタに含
まれる既知遺伝子の数を示した図。FIG. 5 is a diagram showing the number of known genes correctly discriminated and the number of known genes contained in a cluster.

【図６】複数のクラスタリング方法で推測した遺伝子の
関係を示した説明図。FIG. 6 is an explanatory diagram showing the relationship of genes estimated by a plurality of clustering methods.

【図７】推測した遺伝子のもっともらしさにもとづいた
スコアリングを示した図。FIG. 7 is a diagram showing scoring based on the plausibility of estimated genes.

【図８】本発明によるクラスタ結果分析システムの構成
を示したブロック図。FIG. 8 is a block diagram showing the configuration of a cluster result analysis system according to the present invention.

【図９】遺伝子データのデータ構造の例を示した図。FIG. 9 is a diagram showing an example of a data structure of gene data.

【図１０】クラスタリング結果データのデータ構造を示
した図。FIG. 10 is a diagram showing a data structure of clustering result data.

【図１１】スコア算出用配列データのデータ構造を示し
た図。FIG. 11 is a diagram showing a data structure of score calculation sequence data.

【図１２】本発明によるクラスタ結果分析システムの処
理の概略を示したフローチャート。FIG. 12 is a flowchart showing an outline of processing of the cluster result analysis system according to the present invention.

【図１３】存在ヒット率、非存在ヒット率の計算処理の
詳細を示すフローチャート。FIG. 13 is a flowchart showing the details of the calculation processing of the existence hit ratio and the non-existence hit ratio.

【図１４】スコア算出処理の詳細を示すフローチャー
ト。FIG. 14 is a flowchart showing details of score calculation processing.

[Explanation of symbols]

８０１…バイオチップデータベース、８０２…表示装
置、８０３…キーボード、８０４…マウス、８０５…中
央処理装置、８０６…プログラムメモリ、８０７…クラ
スタリング処理部、８０８…ヒット率計算処理部、８０
９…スコア算出処理部、８１０…分析結果表示処理部801 ... Biochip database, 802 ... Display device, 803 ... Keyboard, 804 ... Mouse, 805 ... Central processing unit, 806 ... Program memory, 807 ... Clustering processing unit, 808 ... Hit ratio calculation processing unit, 80
9 ... Score calculation processing unit, 810 ... Analysis result display processing unit

───────────────────────────────────────────────────── フロントページの続き (72)発明者中重亮神奈川県横浜市中区尾上町６丁目81番地日立ソフトウエアエンジニアリング株式会社内 (72)発明者松本俊子神奈川県横浜市中区尾上町６丁目81番地日立ソフトウエアエンジニアリング株式会社内 (72)発明者上野紳吾神奈川県横浜市中区尾上町６丁目81番地日立ソフトウエアエンジニアリング株式会社内 (72)発明者田村卓郎神奈川県横浜市中区尾上町６丁目81番地日立ソフトウエアエンジニアリング株式会社内Ｆターム(参考） 2G045 DA12 DA13 4B063 QA08 QA18 QQ42 QQ52 QR08 QR42 QR55 QR62 QS25 QS34 QS39 QX01 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Ryo Nakashige 6-81 Onoe-cho, Naka-ku, Yokohama-shi, Kanagawa Hitachi Software Engineering Stock Association In-house (72) Inventor Toshiko Matsumoto 6-81 Onoe-cho, Naka-ku, Yokohama-shi, Kanagawa Hitachi Software Engineering Stock Association In-house (72) Inventor Shingo Ueno 6-81 Onoe-cho, Naka-ku, Yokohama-shi, Kanagawa Hitachi Software Engineering Stock Association In-house (72) Inventor Takuro Tamura 6-81 Onoe-cho, Naka-ku, Yokohama-shi, Kanagawa Hitachi Software Engineering Stock Association In-house F-term (reference) 2G045 DA12 DA13 4B063 QA08 QA18 QQ42 QQ52 QR08 QR42 QR55 QR62 QS25 QS34 QS39 QX01

Claims

[Claims]

1. A gene clustering result evaluation method for evaluating a result of clustering a plurality of genes by a predetermined clustering method based on expression patterns thereof, wherein the clustering result is determined to have a predetermined biological function. Characterized in that the result of clustering is evaluated based on the ratio of the genes known to have the predetermined biological function among the genes whose biological functions belonging to the gene group are known. Method for evaluating clustering result of genes to be used.

2. A gene clustering result evaluation method for evaluating the result of clustering a plurality of genes by a predetermined clustering method based on the expression pattern thereof, wherein the clustering result does not have a predetermined biological function. For the determined gene group, it is possible to evaluate the clustering result based on the ratio of the genes known to have no predetermined biological function among the genes having known biological functions belonging to the gene group. A method for evaluating clustering results of characteristic genes.

3. A method for displaying a gene clustering result, which displays a result of clustering a plurality of genes by a predetermined clustering method based on their expression patterns, wherein the result of clustering determines that the gene has a predetermined biological function. The gene group characterized by calculating the ratio of genes known to have the predetermined biological function among genes whose biological function is known to belong to the gene group Clustering result display method.

4. A gene clustering result display method for displaying the result of clustering a plurality of genes by a predetermined clustering method based on the expression pattern thereof, wherein the clustering result does not have a predetermined biological function. Regarding the discriminated gene group, it is characterized by calculating and displaying a ratio of genes whose biological function is known to belong to the gene group and which are known not to have the predetermined biological function. Gene clustering result display method.

5. A method for displaying a gene clustering result, which displays a result of clustering a plurality of genes by a predetermined clustering method based on their expression patterns, wherein the clustering result is determined to have a predetermined biological function. Genes characterized by displaying, for each gene group, the number of genes having a known biological function belonging to the gene group and the number of genes known to have the predetermined biological function Clustering result display method.

6. A gene clustering result display method for displaying a result of clustering a plurality of genes by a predetermined clustering method based on the expression pattern thereof, wherein the clustering result has no predetermined biological function. For each of the determined gene groups, the number of genes having a known biological function belonging to the gene group and the number of genes known not to have the predetermined biological function are displayed. Gene clustering result display method to be used.

7. A method of extracting a candidate of a gene having a predetermined biological function from a plurality of genes, the method comprising clustering a plurality of genes by a plurality of clustering methods based on the expression pattern thereof. As a result of clustering by one clustering method, a gene determined to have the predetermined biological function is added with a weight set for the clustering method, and a step of adding the weight is used for all the remaining steps. The clustering result by the clustering method of
A method comprising: a step of calculating a score for each gene; and a step of extracting a gene candidate having the predetermined biological function based on the magnitude of the score.

8. A method of displaying a candidate of a gene having a predetermined biological function, which comprises: clustering a plurality of genes by a plurality of clustering methods based on their expression patterns; and As a result of the clustering, the step of adding the weight set for the clustering method to the gene determined to have the predetermined biological function, and the step of adding the weight according to all the remaining clustering methods. Run on the clustering results,
A method comprising: a step of calculating a score for each gene; and a step of arranging and displaying the genes together with the score of the gene in descending order of score.