JP4413017B2

JP4413017B2 - Clustering program and clustering apparatus

Info

Publication number: JP4413017B2
Application number: JP2004008510A
Authority: JP
Inventors: 聡雄西垣
Original assignee: RIKEN Institute of Physical and Chemical Research
Current assignee: RIKEN Institute of Physical and Chemical Research
Priority date: 2004-01-15
Filing date: 2004-01-15
Publication date: 2010-02-10
Anticipated expiration: 2024-01-15
Also published as: JP2005202697A

Description

本発明は、複数の配列を相同性の程度にしたがって分類するためのクラスタリングプログラムに関するものであり、詳細には、「single linkage clustering」アルゴリズムの高速化を実現可能なクラスタリングプログラムおよびクラスタリング装置に関するものである。 The present invention relates to a clustering program for classifying a plurality of sequences according to the degree of homology, and more particularly, to a clustering program and a clustering apparatus capable of speeding up a “single linkage clustering” algorithm. is there.

たとえば、ｃＤＮＡ末端配列を、既知配列との相同性の程度にしたがって分類する処理のことを「EST clustering」と呼ぶが、この「EST clustering」を実現するためのアルゴリズムの一部分として、一般に「single linkage clustering」アルゴリズムがよく用いられる。この「single linkage clustering」アルゴリズムは、たとえば、「SingleLinkageClusterer（ftp://ftp.tigr.org/pub/software/singleLinkageClusterer）」等により実現できる。 For example, the process of classifying cDNA end sequences according to the degree of homology with known sequences is called “EST clustering”. As a part of the algorithm for realizing “EST clustering”, “single linkage” is generally used. The “clustering” algorithm is often used. This “single linkage clustering” algorithm can be realized by “SingleLinkageClusterer” (ftp://ftp.tigr.org/pub/software/singleLinkageClusterer), for example.

また、ｃＤＮＡ末端配列を、既知配列との相同性の程度にしたがって分類する代表的なプログラムとして、「BLASTCLUST（ftp://ftp.ncbi.nih.gov/blast/executables/LATEST-BLAST）」がある（非特許文献１参照）。これは、塩基配列またはアミノ酸配列と既知配列との相同性を計算する作業と、その計算結果に基づいて分類する作業と、を同時に行うプログラムである。具体的にいうと、非特許文献１には、以下の内容が記載されている。この「BLASTCLUST」は、「BLAST」アルゴリズムによる相同性計算結果に基づいてアミノ酸配列のグループ分けを行うために、「single linkage clustering」を用いる。また、このプログラムは、各々がユニークな配列識別子をもつような複数個のタンパク配列からなる「FASTA」フォーマットファイルを入力とする。そして、このプログラムは、クラスタとしてまとめられた配列識別子ファイルを出力する。 As a typical program for classifying cDNA terminal sequences according to the degree of homology with known sequences, “BLASTCLUST (ftp://ftp.ncbi.nih.gov/blast/executables/LATEST-BLAST)” is available. Yes (see Non-Patent Document 1). This is a program that simultaneously performs a task of calculating the homology between a base sequence or amino acid sequence and a known sequence and a task of classifying based on the calculation result. Specifically, Non-Patent Document 1 describes the following contents. This “BLASTCLUST” uses “single linkage clustering” to perform grouping of amino acid sequences based on the homology calculation result by the “BLAST” algorithm. The program also takes as input a “FASTA” format file consisting of a plurality of protein sequences, each with a unique sequence identifier. The program then outputs an array identifier file organized as a cluster.

また、上記「BLASTCLUST」は、「BLAST」から処理をはじめることもできるし、また一方で、クラッシュ時に備えてヒットリスト（queryID，hitID）が保存されているので、そのヒットリストを入力として与えることにより、「single linkage clustering」から処理をはじめることもできる。 In addition, the above "BLASTCLUST" can start processing from "BLAST". On the other hand, since a hit list (queryID, hitID) is saved in case of a crash, give the hit list as an input. Thus, the processing can be started from “single linkage clustering”.

NCBI News Fall/Winter 2000 "http://www.ncbi.nih.gov/Web/Newsltr/FallWinter2000/fall_winter2000.pdf"NCBI News Fall / Winter 2000 "http://www.ncbi.nih.gov/Web/Newsltr/FallWinter2000/fall_winter2000.pdf"

しかしながら、上記「BLASTCLUST」は、機能が未知のｃＤＮＡ末端配列と既知配列との相同性を計算する作業と、その計算結果に基づいて分類する作業と、を同時に行う点で便利であるが、一方で、融通が利かない、すなわち、「single linkage clustering」の機能だけを利用したい場合であっても、そのような機能は提供されていない、という問題があった。 However, the above-mentioned “BLASTCLUST” is convenient in that the work of calculating the homology between the cDNA end sequence whose function is unknown and the known sequence and the work of classifying based on the calculation result are performed simultaneously. However, there is a problem that even if it is not flexible, that is, only the function of “single linkage clustering” is desired, such a function is not provided.

また、上記「single linkage clustering」アルゴリズムを実現するための１つの手段である「SingleLinkageClusterer」においては、たとえば、ヒットリスト内の配列同士を総当りで確認することにより、クラスタメンバーを作成している。しかしながら、この例では、上記のようにヒットリスト内の配列同士で総当りの確認処理を行う必要があるので、分類処理にかかる時間が増大する、という問題があった。 Further, in “SingleLinkageClusterer” which is one means for realizing the “single linkage clustering” algorithm, for example, cluster members are created by checking the sequences in the hit list with each other. However, in this example, since it is necessary to perform a round-robin confirmation process between sequences in the hit list as described above, there is a problem that the time required for the classification process increases.

また、昨今は、ＰＣクラスタを用いて並列処理を行う「BLAST」専用機が一般に入手可能である。また、このような専用機を用いると、ヒットリスト作成までの処理の高速化を実現できる。しかしながら、「BLAST」以降の上記「single linkage clustering」アルゴリズムについては、ヒットリスト内の全配列同士を確認する処理があるため、処理の並列化が不可能であり、このような要因で、「single linkage clustering」アルゴリズムにかかる処理（クラスタリング処理）の高速化が非常に困難である、という問題があった。 Recently, a “BLAST” dedicated machine that performs parallel processing using a PC cluster is generally available. In addition, when such a dedicated machine is used, it is possible to increase the processing speed up to the hit list creation. However, since the “single linkage clustering” algorithm after “BLAST” has a process of checking all sequences in the hit list, it is impossible to parallelize the processes. There is a problem that it is very difficult to speed up the processing (clustering processing) related to the “linkage clustering” algorithm.

本発明は、上記に鑑みてなされたものであって、機能が未知のｃＤＮＡ末端配列を、既知配列との相同性の程度にしたがって分類する場合において、「BLAST」と「single linkage clustering」とを分離し、さらに「single linkage clustering」にかかる処理時間を大幅に削減可能なクラスタリングプログラムおよびクラスタリング装置を提供することを目的とする。 The present invention has been made in view of the above, and in the case of classifying cDNA terminal sequences whose functions are unknown according to the degree of homology with known sequences, "BLAST" and "single linkage clustering" An object of the present invention is to provide a clustering program and a clustering apparatus that can separate and further significantly reduce the processing time required for “single linkage clustering”.

上述した課題を解決し、目的を達成するために、本発明にかかるクラスタリングプログラムは、複数の分類対象配列を相同性に基づいて分類するクラスタリングプログラムであって、たとえば、前記分類対象配列に含まれる複数の基準となる配列（基準配列）の識別子と、当該基準配列との相同性が高い配列の識別子と、が対応付けられた状態で記述されたヒットリストを読み出し、当該ヒットリストに基づいて、前記基準配列の識別子と、当該基準配列との相同性が高い配列の識別子と、をグループ化し、グループ毎に個別のクラスタ番号を設定し、さらに、当該クラスタ番号が設定された識別子をソートし、識別子毎に個別の配列番号を設定し、その後、当該配列番号順に、識別子とクラスタ番号と配列番号とを関連付けて記述した配列番号順データを生成する配列番号順データ生成ステップと、前記配列番号順データを第１の領域に記憶する配列番号順データ記憶ステップと、前記配列番号順データを読み出し、当該配列番号順データをクラスタ番号順にソートし、当該クラスタ番号順に、識別子とクラスタ番号と配列番号とを関連付けて記述したクラスタ番号順データを生成するクラスタ番号順データ生成ステップと、前記クラスタ番号順データを第２の領域に記憶するクラスタ番号順データ記憶ステップと、前記配列番号順データと前記クラスタ番号順データとを読み出し、これらのデータに基づいて、同一の配列番号を共有する複数のクラスタ番号に属するすべての識別子を、同一のクラスタメンバーとする処理を行い、当該クラスタメンバーを構成する識別子を記述したクラスタメンバーリストを生成するクラスタメンバー決定ステップと、前記クラスタメンバーリストを第３の領域に記憶するクラスタメンバーリスト記憶ステップと、をコンピュータに実行させることを特徴とする。 In order to solve the above-described problems and achieve the object, a clustering program according to the present invention is a clustering program for classifying a plurality of classification target sequences based on homology, and is included in the classification target sequence, for example. A hit list described in a state where identifiers of a plurality of reference sequences (reference sequences) and identifiers of sequences having high homology with the reference sequences are associated with each other is read based on the hit list, Grouping the identifier of the reference sequence and the identifier of the sequence having high homology with the reference sequence, setting an individual cluster number for each group, and further sorting the identifier set with the cluster number, An array in which an individual array number is set for each identifier, and then the identifier, cluster number, and array number are associated with each other in the order of the array numbers. Sequence number order data generation step for generating order data, array number order data storage step for storing the sequence number order data in the first area, the array number order data is read, and the sequence number order data is clustered A cluster number order data generation step for generating cluster number order data in which the identifier, the cluster number, and the array number are associated with each other, and the cluster number order data is stored in the second area. The cluster number order data storage step, the array number order data and the cluster number order data are read out, and based on these data, all identifiers belonging to a plurality of cluster numbers sharing the same array number are the same. Process to be a cluster member and describe the identifiers that make up the cluster member A cluster member determination step of generating a cluster member list, characterized in that to execute a cluster member list storage step of storing the cluster member list in the third region, to the computer.

また、本発明にかかるクラスタリングプログラムは、機能が未知の複数のｃＤＮＡ末端配列を、それぞれ、既知配列との相同性計算の結果に基づいて分類するクラスタリングプログラムであって、前記複数のｃＤＮＡ末端配列を構成する各ｃＤＮＡ末端配列の識別子と、前記相同性計算結果として得られる既知配列の識別子と、が対応付けられた状態で記述されたヒットリストを読み出し、当該ヒットリストに基づいて、前記ｃＤＮＡ末端配列の識別子と、当該ｃＤＮＡ末端配列との相同性が高い既知配列の識別子と、をグループ化し、グループ毎に個別のクラスタ番号を設定し、さらに、当該クラスタ番号が設定された識別子をソートし、識別子毎に個別の配列番号を設定し、その後、当該配列番号順に、識別子とクラスタ番号と配列番号とを関連付けて記述した配列番号順データを生成する配列番号順データ生成ステップと、前記配列番号順データを第１の領域に記憶する配列番号順データ記憶ステップと、前記配列番号順データを読み出し、当該配列番号順データをクラスタ番号順にソートし、当該クラスタ番号順に、識別子とクラスタ番号と配列番号とを関連付けて記述したクラスタ番号順データを生成するクラスタ番号順データ生成ステップと、前記クラスタ番号順データを第２の領域に記憶するクラスタ番号順データ記憶ステップと、前記配列番号順データと前記クラスタ番号順データとを読み出し、これらのデータに基づいて、同一の配列番号を共有する複数のクラスタ番号に属するすべての識別子を、同一のクラスタメンバーとする処理を行い、当該クラスタメンバーを構成する識別子を記述したクラスタメンバーリストを生成するクラスタメンバー決定ステップと、前記クラスタメンバーリストを第３の領域に記憶するクラスタメンバーリスト記憶ステップと、をコンピュータに実行させることを特徴とする。 The clustering program according to the present invention is a clustering program for classifying a plurality of cDNA end sequences having unknown functions based on the result of homology calculation with known sequences, respectively. A hit list described in a state where the identifier of each cDNA terminal sequence constituting the identifier and the identifier of the known sequence obtained as a result of the homology calculation are associated with each other is read out, and the cDNA terminal sequence is read based on the hit list And the identifiers of known sequences having high homology with the cDNA end sequences are grouped, individual cluster numbers are set for each group, and the identifiers set with the cluster numbers are further sorted. An individual sequence number is set for each, and then the identifier, cluster number, and sequence number in that sequence number order. A sequence number sequence data generation step for generating sequence number sequence data described in association with each other, a sequence number sequence data storage step for storing the sequence number sequence data in a first area, and reading the sequence number sequence data, The cluster number order data is generated by sorting the array number order data in the order of the cluster number, and generating the cluster number order data in which the identifier, the cluster number, and the array number are described in association with each other in the cluster number order. The cluster number order data storage step stored in the second area, the array number order data and the cluster number order data are read out, and belong to a plurality of cluster numbers sharing the same array number based on these data All identifiers are processed as the same cluster member, and the cluster member A cluster member determination step of generating a cluster member list that describes the identifiers that make up the over, characterized in that to execute a cluster member list storage step of storing the cluster member list in the third region, to the computer.

また、本発明にかかるクラスタリングプログラムにおいて、前記クラスタメンバー決定ステップでは、前記クラスタ番号順データに基づいて、特定のクラスタ番号に属するすべての識別子を同一のクラスタメンバーとしてまとめ、さらに、前記配列番号順データを確認した結果、当該クラスタメンバーを構成する識別子が他のクラスタ番号にも属している場合には、前記クラスタ番号順データに基づいて、前記他のクラスタ番号に属するすべての識別子についても同一のクラスタメンバーとして加え、一方で、新たに加えられた識別子が、残りのどのクラスタ番号にも属していない状態になった段階でクラスタメンバーの追加処理を停止し、以降、前記停止まで一連の処理を、全識別子の分類が完了するまで継続することを特徴とする。 Further, in the clustering program according to the present invention, in the cluster member determination step, all identifiers belonging to a specific cluster number are grouped together as the same cluster member based on the cluster number order data, and the sequence number order data As a result, if the identifier constituting the cluster member also belongs to another cluster number, the same cluster is applied to all identifiers belonging to the other cluster number based on the cluster number order data. As a member, on the other hand, when the newly added identifier does not belong to any remaining cluster number, the cluster member addition process is stopped, and then the series of processes until the stop, It is characterized by continuing until classification of all identifiers is completed.

また、本発明にかかるクラスタリングプログラムにおいて、前記クラスタメンバー決定ステップでは、特定のクラスタ番号に属する識別子が他のどのクラスタ番号にも属していない場合（共有なし）、当該特定のクラスタ番号に属する識別子のみでクラスタメンバーを形成することを特徴とする。 In the clustering program according to the present invention, in the cluster member determining step, when an identifier belonging to a specific cluster number does not belong to any other cluster number (no sharing), only an identifier belonging to the specific cluster number And forming a cluster member.

本発明にかかるクラスタリングプログラムにおいては、機能が未知のｃＤＮＡ末端配列と既知配列との相同性を計算する作業（「BLAST」）と、その計算結果に基づいて分類する作業（クラスタリング処理）と、を分離した形で実現する構成とした。また、本発明にかかるクラスタリングプログラムは、たとえば、機能が未知のｃＤＮＡ末端配列を、既知配列との相同性の程度にしたがって分類する場合、まず、識別子毎に個別の配列番号を設定し、さらに、前記ｃＤＮＡ末端配列と、当該ｃＤＮＡ末端配列との相同性が高い既知配列と、をグループ化し、グループ毎に個別のクラスタ番号を設定する。そして、同一の配列番号を共有する複数のクラスタ番号に属するすべての配列で、１つのクラスタメンバーを形成することとした。これらにより、代表的なプログラムである「BLASTCLUST」を使用した場合と比較して、また、配列総当りでクラスタメンバーを生成する従来方式と比較して、大幅に「single linkage clustering」にかかる処理時間を低減でき、ひいては、「EST clustering」にかかる処理時間を大幅に低減できる、という効果を奏する。 In the clustering program according to the present invention, the task of calculating the homology between the cDNA terminal sequence whose function is unknown and the known sequence (“BLAST”), and the task of classifying based on the calculation result (clustering process) The configuration is realized in a separated form. In addition, the clustering program according to the present invention, for example, when classifying cDNA terminal sequences whose functions are unknown according to the degree of homology with known sequences, first sets an individual sequence number for each identifier, The cDNA end sequences and known sequences having high homology with the cDNA end sequences are grouped, and an individual cluster number is set for each group. One cluster member is formed by all the sequences belonging to a plurality of cluster numbers sharing the same sequence number. As a result, the processing time required for “single linkage clustering” is significantly greater than when using the “BLASTCLUST” program, which is a representative program, and when compared to the conventional method of generating cluster members in the entire sequence. As a result, the processing time required for “EST clustering” can be greatly reduced.

以下に、本発明にかかるクラスタリングプログラムおよびクラスタリング装置の実施例を図面に基づいて詳細に説明する。なお、この実施例によりこの発明が限定されるものではない。 Embodiments of a clustering program and a clustering apparatus according to the present invention will be described below in detail with reference to the drawings. Note that the present invention is not limited to the embodiments.

図１は、機能が未知のｃＤＮＡ末端配列を、既知配列との相同性の程度にしたがって分類する（「EST clustering」に相当）ためのクラスタリング装置として動作し、本発明のクラスタリング処理を実現可能な一般的な計算機システムの構成を示す図である。なお、本実施例においては、一例として、ｃＤＮＡ末端配列を、既知配列との相同性の程度にしたがって分類する場合について説明するが、分類の対象となる配列はｃＤＮＡ配列に限らず、たとえば、本実施例の処理をｃＤＮＡ以外の塩基配列やアミノ酸配列の分類に適用することとしてもよい。また、分類の対象となる配列は、必ずしも機能が未知の配列である必要はない。 FIG. 1 operates as a clustering apparatus for classifying cDNA terminal sequences with unknown functions according to the degree of homology with known sequences (corresponding to “EST clustering”), and can implement the clustering process of the present invention. It is a figure which shows the structure of a general computer system. In this example, as an example, the case where the cDNA terminal sequence is classified according to the degree of homology with the known sequence will be described. However, the sequence to be classified is not limited to the cDNA sequence. The processing in the examples may be applied to classification of base sequences other than cDNA and amino acid sequences. Further, the sequence to be classified does not necessarily need to be a sequence whose function is unknown.

この計算機システムは、たとえば、ＣＰＵを含む制御ユニット１０１と、メモリユニット１０２と、表示ユニット１０３と、入力ユニット１０４と、ＣＤ−ＲＯＭドライブユニット１０５（ＤＶＤドライブユニット，ＦＤドライブユニットであってもよい）と、ディスクユニット１０６と、外部Ｉ／Ｆユニット１０７と、を備え、これらの各ユニットは、それぞれシステムバスＡを介して接続されている。 This computer system includes, for example, a control unit 101 including a CPU, a memory unit 102, a display unit 103, an input unit 104, a CD-ROM drive unit 105 (may be a DVD drive unit or an FD drive unit), a disk A unit 106 and an external I / F unit 107 are provided, and these units are connected to each other via a system bus A.

図１において、制御ユニット１０１は、本発明のクラスタリングプログラムを実行する。メモリユニット１０２は、ＲＡＭ，ＲＯＭ等の各種メモリを含み、制御ユニット１０１が実行すべきプログラム，処理の過程で得られた必要なデータ等を記憶する。表示ユニット１０３は、ＣＲＴやＬＣＤ（液晶表示パネル）等で構成され、計算機システムの使用者に対して各種画面を表示する。入力ユニット１０４は、キーボード，マウス等で構成され、計算機システムの使用者が、各種情報の入力を行うために使用する。また、図示のＣＤ−ＲＯＭ２００には、本発明のクラスタリング処理を記述したプログラム（クラスタリングプログラム）が格納されている。 In FIG. 1, the control unit 101 executes the clustering program of the present invention. The memory unit 102 includes various memories such as a RAM and a ROM, and stores programs to be executed by the control unit 101, necessary data obtained in the course of processing, and the like. The display unit 103 is composed of a CRT, LCD (liquid crystal display panel), or the like, and displays various screens to the user of the computer system. The input unit 104 includes a keyboard, a mouse, and the like, and is used by a computer system user to input various information. The illustrated CD-ROM 200 stores a program (clustering program) describing the clustering process of the present invention.

ここで、上記本発明のクラスタリングプログラムが実行可能な状態になるまでの上記計算機システムの動作例について説明する。まず、上記のように構成される計算機システムには、ＣＤ−ＲＯＭドライブユニット１０５にセットされたＣＤ−ＲＯＭ２００から、クラスタリングプログラムがディスクユニット１０６にインストールされる。そして、計算機システムの起動時（専用機の場合）またはプログラムの実行時に、ディスクユニット１０６から読み出されたプログラムがメモリユニット１０２に格納される。この状態で、制御ユニット１０１（ＣＰＵ）は、メモリユニット１０２に格納されたプログラムにしたがって、本発明のクラスタリング処理を実行する。 Here, an operation example of the computer system until the clustering program of the present invention becomes executable will be described. First, in the computer system configured as described above, a clustering program is installed in the disk unit 106 from the CD-ROM 200 set in the CD-ROM drive unit 105. Then, the program read from the disk unit 106 is stored in the memory unit 102 when the computer system is activated (in the case of a dedicated machine) or when the program is executed. In this state, the control unit 101 (CPU) executes the clustering process of the present invention in accordance with the program stored in the memory unit 102.

なお、本発明においては、ＣＤ−ＲＯＭ２００にて本発明のクラスタリング処理を記述したプログラムを提供しているが、このプログラムの記録媒体は、これに限定されることなく、システムを構成するコンピュータに応じて、たとえば、フロッピー（登録商標）ディスク等の磁気ディスク，光磁気ディスク，磁気テープ等の他の記録媒体を用いることも可能である。また、電子メール，インターネット等の伝送媒体により提供されたプログラムを用いることとしてもよい。 In the present invention, a program describing the clustering processing of the present invention is provided on the CD-ROM 200, but the recording medium of this program is not limited to this, and depends on the computer constituting the system. For example, other recording media such as a magnetic disk such as a floppy (registered trademark) disk, a magneto-optical disk, and a magnetic tape may be used. Moreover, it is good also as using the program provided by transmission media, such as an email and the internet.

つぎに、上記「EST clustering」処理の概要、および上記「EST clustering」処理における本発明のクラスタリングプログラムの位置付け、について説明する。図２は、上記「EST clustering」の処理手順を示すフローチャートである。 Next, an outline of the “EST clustering” process and the positioning of the clustering program of the present invention in the “EST clustering” process will be described. FIG. 2 is a flowchart showing a processing procedure of the “EST clustering”.

まず、計算機システムの制御ユニット１０１は、ユーザの指示により、「FASTA」形式のｃＤＮＡ末端配列を記載したファイル、すなわち、query.fastaファイルを生成し、そのquery.fastaファイルをメモリユニット１０２の所定領域に格納する（ステップＳ１）。 First, the control unit 101 of the computer system generates a file describing a cDNA terminal sequence in “FASTA” format, that is, a query.fasta file, according to a user instruction, and stores the query.fasta file in a predetermined area of the memory unit 102. (Step S1).

つぎに、制御ユニット１０１は、ユーザの指示により、NCBI等の公的機関が既に公開している既知配列と、上記query.fastaファイルから読み出したｃＤＮＡ末端配列と、をまとめたファイル、すなわち、db.fastaファイルを生成し、そのdb.fastaファイルをメモリユニット１０２の所定領域に格納する（ステップＳ２）。 Next, in accordance with a user instruction, the control unit 101 is a file in which a known sequence already published by a public organization such as NCBI and a cDNA terminal sequence read from the query.fasta file are combined, that is, db A .fasta file is generated, and the db.fasta file is stored in a predetermined area of the memory unit 102 (step S2).

つぎに、制御ユニット１０１は、上記query.fastaをquery側とし、上記db.fastaをデータベース側として、「BLAST」を行う（ステップＳ３）。そして、上記「BLAST」結果として得られるヒットリスト（ヒットリストは「queryIDとhitIDとの組み合わせがわかるようなリスト，表，データであれば、その形式はどのようなものでもかまわない。）をメモリユニット１０２の所定領域に格納する（ステップＳ４）。図３−１，図３−２，図３−３は、「BLAST」結果として得られるアライメントの一例を示す図である。なお、本実施例においては、相同性検索の手段として「BLAST」を採用するが、これに限らず、たとえば、既知の検索法である「dynamic programing法」や「FASTA法」を用いることとしてもよい。 Next, the control unit 101 performs “BLAST” with the query.fasta as the query side and the db.fasta as the database side (step S3). Then, the hit list obtained as a result of the above “BLAST” (the hit list is “a list, table, or data that can be used to know the combination of queryID and hitID, any format may be used)” is stored in memory. (Step S4) FIGS.3-1, 3-2, and 3-3 are diagrams illustrating an example of alignment obtained as a “BLAST” result. In this embodiment, “BLAST” is adopted as a homology search means. However, the present invention is not limited to this. For example, a known search method such as “dynamic programming method” or “FASTA method” may be used. Good.

最後に、制御ユニット１０１は、上記ヒットリストに基づいて、後述する本発明のクラスタリングプログラム（「single linkage clustering」を含む）を実行する（ステップＳ５）。そして、その結果であるクラスタメンバーリストをメモリユニット１０２の所定領域に格納する（ステップＳ６）。以下では、本発明の特徴的な処理であるステップＳ５およびステップＳ６の処理について詳細に説明する。 Finally, the control unit 101 executes a clustering program (including “single linkage clustering”) of the present invention, which will be described later, based on the hit list (step S5). Then, the resulting cluster member list is stored in a predetermined area of the memory unit 102 (step S6). Below, the process of step S5 and step S6 which are the characteristic processes of this invention is demonstrated in detail.

つぎに、本発明のクラスタリングプログラム（上記ステップＳ５およびステップＳ６に相当）を図面に基づいて詳細に説明する。図４は、本発明のクラスタリングプログラムによる処理を示すフローチャートである。 Next, the clustering program of the present invention (corresponding to steps S5 and S6) will be described in detail with reference to the drawings. FIG. 4 is a flowchart showing processing by the clustering program of the present invention.

なお、ここでは、機能が未知のｃＤＮＡ末端配列と既知配列の相同性を計算する「BLAST」の結果に基づいて、既にヒットリスト（queryID，hitID）が生成されていることを前提とし、そのヒットリストが既にメモリユニット１０２の所定領域に格納されていることとする。図５は、上記ヒットリストの一例を示す図である。このヒットリストは、機能が未知のｃＤＮＡ末端配列の識別子を表すqueryID（図示の000001_queryID_1（配列番号１参照）,000002_queryID,…に相当）と、上記相同性計算結果（上記「BLAST」結果に相当）として得られる既知配列の識別子を表すhitID（図示のAB000096（配列番号２参照），AK004551，AK004675等に相当）と、の組み合わせを記述したデータである。ただし、ヒットリストは、queryIDとhitIDとの組み合わせがわかるようなリスト，表，データであれば、その形式はどのようなものでもかまわない。すなわち、「BLAST」結果そのものであっても、その結果をクラスタリング処理用にフォーマット変換したものであってもかまわない。また、ここでは、一例として、NCBIで公開されているデータベースを利用して「BLAST」を実行した場合について記載する。 Here, it is assumed that a hit list (queryID, hitID) has already been generated based on the result of “BLAST” that calculates the homology between the cDNA end sequence whose function is unknown and the known sequence. It is assumed that the list is already stored in a predetermined area of the memory unit 102. FIG. 5 is a diagram showing an example of the hit list. This hit list includes a query ID (corresponding to 000001_queryID_1 (refer to SEQ ID No. 1), 000002_queryID,...) And a homology calculation result (corresponding to the “BLAST” result) shown in FIG. Data describing the combination of hitID (corresponding to AB000096 (see SEQ ID No. 2), AK004551, AK004675, etc.) shown in FIG. However, the format of the hit list is not limited as long as it is a list, table, or data that shows the combination of queryID and hitID. In other words, the “BLAST” result itself may be the format converted for clustering processing. Also, here, as an example, a case where “BLAST” is executed using a database published by NCBI will be described.

まず、制御ユニット１０１では、上記ヒットリストをメモリユニット１０２から読み出し、そのヒットリストに基づいてqueryIDのみを記載したデータ（図６のlist_qid参照）と、hitIDのみを記載したデータ（図７のlist_hitid参照）を生成し、それらのデータをメモリユニット１０２の所定領域に個別に格納する（ステップＳ１１）。 First, in the control unit 101, the hit list is read from the memory unit 102, and based on the hit list, only the queryID is described (see list_qid in FIG. 6) and only the hitID is described (see list_hitid in FIG. 7). ) And are stored individually in a predetermined area of the memory unit 102 (step S11).

つぎに、制御ユニット１０１では、上記ヒットリストおよびlist_qid，list_hitidをメモリユニット１０２から読み出し、これらの情報を用いて、識別子（上記queryID，hitIDに相当）と、配列（以下、単に配列と記載する場合は、上記ｃＤＮＡ末端配列および既知配列を表す）を分類するためのクラスタ番号と、を関連付ける。すなわち、最初のqueryIDにクラスタ番号”１”を付与し、以降、queryIDが切り替わる度にクラスタ番号をインクリメントする。そして、上記関連付け後のデータをメモリユニット１０２の所定領域に格納する（ステップＳ１２）。図８は、識別子とクラスタ番号が関連付けられた状態のデータの一例を示す図である。ここでは、上記「BLAST」結果に基づいて相同性の高い配列同士を同一グループに分ける処理を行い、たとえば、000001_queryID，AB000096，AK004551，AK004675に対応するクラスタ番号を”１”とし、000002_queryID，AB000121，BC030169に対応するクラスタ番号を”２”とし、以降、000003_queryIDと各hitIDの組，000004_queryIDと各hitIDの組，000005_queryIDと各hitIDの組，…に対応するクラスタ番号を順に”３”，”４”，”５”…とする（インクリメントする）。なお、この段階では、配列番号をすべて”−１”と記載しておく。 Next, the control unit 101 reads the hit list, list_qid, and list_hitid from the memory unit 102, and uses these pieces of information to identify an identifier (corresponding to the queryID and hitID) and an array (hereinafter simply referred to as an array). Is associated with the cluster number for classifying the above-mentioned cDNA terminal sequence and known sequence). That is, the cluster number “1” is assigned to the first queryID, and thereafter the cluster number is incremented every time the queryID is switched. Then, the associated data is stored in a predetermined area of the memory unit 102 (step S12). FIG. 8 is a diagram illustrating an example of data in a state where an identifier and a cluster number are associated with each other. Here, based on the “BLAST” result, a process of dividing sequences having high homology into the same group is performed. For example, the cluster number corresponding to 000001_queryID, AB000096, AK004551, AK004675 is set to “1”, and The cluster number corresponding to BC030169 is set to “2”, and thereafter, the cluster numbers corresponding to 000003_queryID and each hitID, 000004_queryID and each hitID, 000005_queryID and each hitID, etc. , “5”... (Increment). At this stage, all the sequence numbers are described as “−1”.

つぎに、制御ユニット１０１では、上記識別子とクラスタ番号が関連付けられた状態のデータをメモリユニット１０２から読み出し、このデータを識別子でソートする。そして、ソート後のデータをメモリユニット１０２の所定領域に格納する（ステップＳ１３）。図９は、ソート後のデータの一例を示す図である。 Next, the control unit 101 reads data in a state where the identifier and the cluster number are associated from the memory unit 102, and sorts the data by the identifier. Then, the sorted data is stored in a predetermined area of the memory unit 102 (step S13). FIG. 9 is a diagram illustrating an example of the sorted data.

つぎに、制御ユニット１０１では、上記ソート後のデータをメモリユニット１０２から読み出し、そのデータの配列番号欄に、識別子毎に個別の配列番号を設定し、識別子と配列番号の対応付けを行う。ここでは、各識別子に対応する配列番号を、上記ソート後の順に”０”，”１”，”２”，…と設定し、配列番号設定後のデータをcluster_1としてメモリユニット１０２の所定領域に格納する（ステップＳ１４）。このとき、クラスタ番号と配列番号が同一であるレコード（行）がある場合は、そのレコードを削除する。なお、このような同一レコードの重複については、予めヒットリストから削除しておくこととしてもよい。図１０は、cluster_1の一例を示す図である。図示の複数個の同一識別子は、その識別子が複数のクラスタ番号に重複して関連付けられていることを表している。 Next, the control unit 101 reads the sorted data from the memory unit 102, sets an individual array number for each identifier in the array number column of the data, and associates the identifier with the array number. Here, array numbers corresponding to the respective identifiers are set to “0”, “1”, “2”,... In the order after the sorting, and the data after the array number setting is set as cluster_1 in a predetermined area of the memory unit 102. Store (step S14). At this time, if there is a record (row) having the same cluster number and array number, the record is deleted. Such duplication of the same record may be deleted from the hit list in advance. FIG. 10 is a diagram illustrating an example of cluster_1. The plurality of identical identifiers shown in the figure indicates that the identifiers are associated with a plurality of cluster numbers in an overlapping manner.

つぎに、制御ユニット１０１では、上記cluster_1をメモリユニット１０２から読み出し、そのcluster_1をクラスタ番号でソートし、ソート後のデータをcluster_2としてメモリユニット１０２の所定領域に格納する（ステップＳ１６）。このとき、クラスタ番号と配列番号が同一であるレコード（行）がある場合は、そのレコードを削除する。図１１は、cluster_2の一例を示す図である。 Next, the control unit 101 reads the cluster_1 from the memory unit 102, sorts the cluster_1 by the cluster number, and stores the sorted data as cluster_2 in a predetermined area of the memory unit 102 (step S16). At this time, if there is a record (row) having the same cluster number and array number, the record is deleted. FIG. 11 is a diagram illustrating an example of cluster_2.

つぎに、制御ユニット１０１では、上記cluster_1およびcluster_2を用いて、「single linkage clustering」を実行する（ステップＳ１７）。ここで、本実施例の「single linkage clustering」の手順を具体的に説明する。図１２は、本実施例の「single linkage clustering」手順の具体例を示す図である。なお、図１２は、本実施例の「single linkage clustering」を簡単に説明するための図であり、図示の配列番号とクラスタ番号の対応付けは上記図１０および図１１とは異なる。また、図示のクラスタ番号１，２，３，…は、cluster_2のクラスタ番号に対応し、図示の配列番号０，１，２，…は、cluster_1の配列番号に対応する。 Next, the control unit 101 executes “single linkage clustering” using the cluster_1 and cluster_2 (step S17). Here, the procedure of “single linkage clustering” of the present embodiment will be specifically described. FIG. 12 is a diagram illustrating a specific example of the “single linkage clustering” procedure of the present embodiment. FIG. 12 is a diagram for simply explaining the “single linkage clustering” of the present embodiment, and the correspondence between the illustrated array number and the cluster number is different from that in FIGS. 10 and 11 described above. .. Correspond to the cluster number of cluster_2, and the array numbers 0, 1, 2,... Correspond to the array number of cluster_1.

たとえば、クラスタ番号”１”に属する識別子（配列）に対応する配列番号が”０”，”１”，”２”である場合、これらの識別子にて同一のクラスタメンバーを形成する。そして、たとえば、配列番号”２”に対応する識別子がクラスタ番号”４”にも属している場合には、クラスタ番号”４”に属するすべての識別子（配列番号”２”，”６”）についても上記と同一のクラスタメンバーとして加える。さらに、クラスタ番号”４”に属する識別子、たとえば、配列番号”６”に対応する識別子が他のクラスタ番号に属している場合には、そのクラスタ番号に属する識別子についても同一のクラスタメンバーとして加える。以降、新たに加えられる識別子が他のクラスタ番号に属していない状態になるまでクラスタメンバーの追加処理を継続する。たとえば、cluster_1を参照すると、配列番号”６”に対応する識別子がクラスタ番号”４”以外のクラスタ番号に属しているかどうかを直ちに判断できるので、上記「新たに加えられる識別子が他のクラスタ番号に属していない状態」を判定するのは容易である。 For example, if the array element numbers corresponding to the identifiers (arrays) belonging to the cluster number “1” are “0”, “1”, “2”, these identifiers form the same cluster member. For example, if the identifier corresponding to the array number “2” also belongs to the cluster number “4”, all the identifiers (array numbers “2” and “6”) belonging to the cluster number “4” Are added as the same cluster members as above. Further, when an identifier belonging to the cluster number “4”, for example, an identifier corresponding to the array number “6” belongs to another cluster number, the identifier belonging to the cluster number is also added as the same cluster member. Thereafter, the cluster member addition process is continued until a newly added identifier does not belong to another cluster number. For example, referring to cluster_1, it can be immediately determined whether or not the identifier corresponding to the array number “6” belongs to a cluster number other than the cluster number “4”. It is easy to determine “a state that does not belong”.

また、クラスタ番号”２”に属する識別子（配列）に対応する配列番号が”３”，”４”，”５”である場合には、これらの識別子にて上記とは異なるクラスタメンバーを形成する。そして、たとえば、配列番号”５”に対応する識別子がクラスタ番号”５”にも属している場合には、クラスタ番号”５”に属するすべての識別子（配列番号”５”，”８”）についても同一のクラスタメンバーとして加える。さらに、クラスタ番号”５”に属する識別子、たとえば、配列番号”８”に対応する識別子が他のクラスタ番号に属している場合には、そのクラスタ番号に属する識別子についても同一のクラスタメンバーとして加える。以降、新たに加えられる識別子が他のクラスタ番号に属していない状態になるまでクラスタメンバーの追加処理を継続する。なお、特定のクラスタ番号に属する識別子が他のどのクラスタ番号にも属していない場合（共有なし）には、当該特定のクラスタ番号の属する識別子のみでクラスタメンバーを形成する。 Further, when the array element numbers corresponding to the identifiers (arrays) belonging to the cluster number “2” are “3”, “4”, “5”, a cluster member different from the above is formed with these identifiers. . For example, if the identifier corresponding to the array number “5” also belongs to the cluster number “5”, all the identifiers (array numbers “5” and “8”) belonging to the cluster number “5” Are also added as the same cluster member. Further, when an identifier belonging to cluster number “5”, for example, an identifier corresponding to array number “8” belongs to another cluster number, the identifier belonging to that cluster number is also added as the same cluster member. Thereafter, the cluster member addition process is continued until a newly added identifier does not belong to another cluster number. When an identifier belonging to a specific cluster number does not belong to any other cluster number (no sharing), a cluster member is formed only with the identifier to which the specific cluster number belongs.

このようにして、全クラスタ番号について処理を行い、その結果をcluster_3としてメモリユニット１０２の所定領域に格納する（ステップＳ１７）。これにより、機能が未知のｃＤＮＡ末端配列を、既知配列を含むクラスタメンバーに帰属させることができる。図１３は、cluster_3の内容を説明するための概略図である。実際には、cluster_3は、下記のリスト形式で表現される。
cluster_3[1]＝｛0,119978,126329,126352,3,119978,…｝
cluster_3[2]＝｛1,119979,132190,4,119979,…｝ In this way, processing is performed for all cluster numbers, and the result is stored as cluster_3 in a predetermined area of the memory unit 102 (step S17). Thereby, the cDNA terminal sequence whose function is unknown can be attributed to the cluster member including the known sequence. FIG. 13 is a schematic diagram for explaining the contents of cluster_3. Actually, cluster_3 is expressed in the following list format.
cluster_3 [1] = {0,119978,126329,126352,3,119978,…}
cluster_3 [2] = {1,119979,132190,4,119979,…}

最後に、制御ユニット１０１では、上記cluster_3をメモリユニット１０２から読み出し、そのcluster_3に基づいてクラスタメンバーリストを生成し、そのリストをメモリユニット１０２の所定領域に格納する（ステップＳ１８）。図１４は、クラスタメンバーリストの一例を示す図である。 Finally, the control unit 101 reads cluster_3 from the memory unit 102, generates a cluster member list based on the cluster_3, and stores the list in a predetermined area of the memory unit 102 (step S18). FIG. 14 is a diagram illustrating an example of the cluster member list.

このように、本実施例においては、機能が未知のｃＤＮＡ末端配列と既知配列との相同性を計算する作業（「BLAST」）と、その計算結果に基づいて分類する作業（クラスタリング処理）と、を分離した形で実現する構成とした。 As described above, in this example, the work of calculating the homology between the cDNA terminal sequence whose function is unknown and the known sequence ("BLAST"), the work of classifying based on the calculation result (clustering process), It was set as the structure which implement | achieves in the form which isolate | separated.

また、本実施例においては、機能が未知のｃＤＮＡ末端配列を、既知配列との相同性の程度にしたがって分類する場合、まず、識別子（配列）毎に個別の配列番号を設定し、さらに、前記ｃＤＮＡ末端配列と、当該ｃＤＮＡ末端配との相同性が高い既知配列と、をグループ化し、グループ毎に個別のクラスタ番号を設定する。そして、同一の配列番号を共有する複数のクラスタ番号に属するすべての識別子で、１つのクラスタメンバーを形成することとした。ただし、特定のクラスタ番号に属する識別子が他のどのクラスタ番号にも属していない場合（共有なし）には当該特定のクラスタ番号に属する識別子のみでクラスタメンバーを形成する。 Further, in this example, when classifying cDNA terminal sequences whose functions are unknown according to the degree of homology with known sequences, first, an individual sequence number is set for each identifier (sequence), and The cDNA end sequences and known sequences having high homology with the cDNA end sequences are grouped, and an individual cluster number is set for each group. Then, one cluster member is formed by all identifiers belonging to a plurality of cluster numbers sharing the same sequence number. However, when an identifier belonging to a specific cluster number does not belong to any other cluster number (no sharing), a cluster member is formed only by the identifier belonging to the specific cluster number.

これらにより、代表的なプログラムである「BLASTCLUST」を使用した場合と比較して、また、配列総当りでクラスタメンバーを生成する従来方式と比較して、大幅に「single linkage clustering」にかかる処理時間を低減でき、ひいては、「EST clustering」にかかる処理時間を大幅に低減できる。 As a result, compared to the case of using “BLASTCLUST”, which is a typical program, and the processing time required for “single linkage clustering”, compared to the conventional method of generating cluster members in the entire sequence. As a result, the processing time required for “EST clustering” can be greatly reduced.

以上のように、本発明にかかるクラスタリングプログラムは、塩基配列，アミノ酸配列を配列の相同性の程度に応じて短時間でグループ化できるため、機能未知遺伝子のアノテーション付与やＤＮＡチップの設計等に好適に利用できる。 As described above, since the clustering program according to the present invention can group base sequences and amino acid sequences in a short time according to the degree of sequence homology, it is suitable for annotation of unknown function genes, DNA chip design, etc. Available to:

本発明のクラスタリング処理を実現可能な一般的な計算機システムの構成を示す図である。It is a figure which shows the structure of the general computer system which can implement | achieve the clustering process of this invention. 「EST clustering」の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of "EST clustering". 「BLAST」結果として得られるアライメントの一例を示す図である。It is a figure which shows an example of the alignment obtained as a "BLAST" result. 「BLAST」結果として得られるアライメントの一例を示す図である。It is a figure which shows an example of the alignment obtained as a "BLAST" result. 「BLAST」結果として得られるアライメントの一例を示す図である。It is a figure which shows an example of the alignment obtained as a "BLAST" result. 本発明のクラスタリングプログラムによる処理を示すフローチャートである。It is a flowchart which shows the process by the clustering program of this invention. ヒットリストの一例を示す図である。It is a figure which shows an example of a hit list. list_qidの一例を示す図である。It is a figure which shows an example of list_qid. list_hitidの一例を示す図である。It is a figure which shows an example of list_hitid. 識別子とクラスタ番号が関連付けられた状態のデータの一例を示す図である。It is a figure which shows an example of the data of the state with which the identifier and the cluster number were linked | related. ソート後のデータの一例を示す図である。It is a figure which shows an example of the data after a sort. cluster_1の一例を示す図である。It is a figure which shows an example of cluster_1. cluster_2の一例を示す図である。It is a figure which shows an example of cluster_2. 本実施例の「single linkage clustering」手順の具体例を示す図である。It is a figure which shows the specific example of the "single linkage clustering" procedure of a present Example. cluster_3の内容を説明するための概略図である。It is the schematic for demonstrating the content of cluster_3. クラスタメンバーリストの一例を示す図である。It is a figure which shows an example of a cluster member list.

Explanation of symbols

１０１制御ユニット
１０２メモリユニット
１０３表示ユニット
１０４入力ユニット
１０５ＣＤ−ＲＯＭドライブユニット
１０６ディスクユニット
１０７外部Ｉ／Ｆユニット
２００ＣＤ−ＲＯＭ 101 Control Unit 102 Memory Unit 103 Display Unit 104 Input Unit 105 CD-ROM Drive Unit 106 Disk Unit 107 External I / F Unit 200 CD-ROM

配列番号１：000001_queryID
配列番号２：AB000096 Array number 1: 000001_queryID
Sequence number 2: AB000096

Claims

A clustering program for classifying a plurality of target sequences based on homology,
Read a hit list described in a state where identifiers of a plurality of reference sequences (reference sequences) included in the classification target sequence and identifiers of sequences having high homology with the reference sequences are associated with each other, Based on the hit list, the identifier of the reference sequence and the identifier of the sequence having high homology with the reference sequence are grouped, and an individual cluster number is set for each group, and the cluster number is set. Sorts the identifiers, sets individual array numbers for each identifier, and then generates array number order data in which the identifiers, cluster numbers, and array numbers are described in association with each other in the order of the array numbers Steps,
A sequence number sequence data storage step for storing the sequence number sequence data in the first region;
Cluster number order data generation that reads the array number order data, sorts the array number order data in the order of the cluster number, and generates cluster number order data in which the identifier, the cluster number, and the array number are described in association with the cluster number order Steps,
A cluster number order data storage step for storing the cluster number order data in a second area;
The sequence number order data and the cluster number order data are read, and based on these data, all identifiers belonging to a plurality of cluster numbers sharing the same array number are processed as the same cluster member, A cluster member determining step for generating a cluster member list describing identifiers constituting the cluster member;
A cluster member list storing step of storing the cluster member list in a third area;
A clustering program for causing a computer to execute.

A clustering program for classifying a plurality of cDNA end sequences having unknown functions based on the results of homology calculation with known sequences,
A hit list described in a state where the identifiers of the cDNA terminal sequences constituting the plurality of cDNA terminal sequences and the identifiers of known sequences obtained as the homology calculation results are associated with each other is read, Based on this, the identifier of the cDNA end sequence and the identifier of a known sequence having high homology with the cDNA end sequence are grouped, an individual cluster number is set for each group, and the cluster number is set. Array number order data generation step that sorts the identifiers, sets individual array numbers for each identifier, and then generates array number order data in which the identifiers, cluster numbers, and array numbers are described in association with each other in the order of the array numbers When,
A sequence number sequence data storage step for storing the sequence number sequence data in the first region;
Cluster number order data generation that reads the array number order data, sorts the array number order data in the order of the cluster number, and generates cluster number order data in which the identifier, the cluster number, and the array number are described in association with the cluster number order Steps,
A cluster number order data storage step for storing the cluster number order data in a second area;
The sequence number order data and the cluster number order data are read, and based on these data, all identifiers belonging to a plurality of cluster numbers sharing the same array number are processed as the same cluster member, A cluster member determining step for generating a cluster member list describing identifiers constituting the cluster member;
A cluster member list storing step of storing the cluster member list in a third area;
A clustering program for causing a computer to execute.

In the cluster member determination step,
Based on the cluster number order data, all identifiers belonging to a specific cluster number are grouped as the same cluster member, and further, as a result of checking the sequence number order data, the identifier constituting the cluster member is another cluster number. Are added as the same cluster member for all the identifiers belonging to the other cluster numbers based on the cluster number order data, while the newly added identifiers are Stop the cluster member addition process when it does not belong to any cluster number,
3. The clustering program according to claim 1, wherein a series of processes until the stop is continued until classification of all identifiers is completed.

In the cluster member determination step,
The cluster member is formed only by an identifier belonging to the specific cluster number when an identifier belonging to the specific cluster number does not belong to any other cluster number (no sharing). 3. The clustering program according to 3.

In a clustering apparatus that classifies a plurality of target sequences based on homology,
Read a hit list described in a state where identifiers of a plurality of reference sequences (reference sequences) included in the classification target sequence and identifiers of sequences having high homology with the reference sequences are associated with each other, Based on the hit list, the identifier of the reference sequence and the identifier of the sequence having high homology with the reference sequence are grouped, and an individual cluster number is set for each group, and the cluster number is set. Sorts the identifiers, sets individual array numbers for each identifier, and then generates array number order data in which the identifiers, cluster numbers, and array numbers are described in association with each other in the order of the array numbers Means,
First storage means for storing the sequence number order data;
Cluster number order data generation that reads the array number order data, sorts the array number order data in the order of the cluster number, and generates cluster number order data in which the identifier, the cluster number, and the array number are described in association with the cluster number order Means,
Second storage means for storing the cluster number order data;
The sequence number order data and the cluster number order data are read, and based on these data, all identifiers belonging to a plurality of cluster numbers sharing the same array number are processed as the same cluster member, A cluster member determining means for generating a cluster member list describing identifiers constituting the cluster member;
Third storage means for storing the cluster member list;
A clustering apparatus comprising:

In a clustering apparatus that classifies a plurality of cDNA end sequences with unknown functions based on the result of homology calculation with known sequences,
A hit list described in a state where the identifiers of the cDNA terminal sequences constituting the plurality of cDNA terminal sequences and the identifiers of known sequences obtained as the homology calculation results are associated with each other is read, Based on this, the identifier of the cDNA end sequence and the identifier of a known sequence having high homology with the cDNA end sequence are grouped, an individual cluster number is set for each group, and the cluster number is set. Array number order data generation means for sorting the identifiers, setting individual array numbers for each identifier, and then generating array number order data in which the identifiers, cluster numbers and array numbers are described in association with each other in the order of the array numbers When,
First storage means for storing the sequence number order data;
Cluster number order data generation that reads the array number order data, sorts the array number order data in the order of the cluster number, and generates cluster number order data in which the identifier, the cluster number, and the array number are described in association with the cluster number order Means,
Second storage means for storing the cluster number order data;
The sequence number order data and the cluster number order data are read, and based on these data, all identifiers belonging to a plurality of cluster numbers sharing the same array number are processed as the same cluster member, A cluster member determining means for generating a cluster member list describing identifiers constituting the cluster member;
Third storage means for storing the cluster member list;
A clustering apparatus comprising:

The cluster member determining means includes
Based on the cluster number order data, all identifiers belonging to a specific cluster number are grouped as the same cluster member, and further, as a result of checking the sequence number order data, the identifier constituting the cluster member is another cluster number. Are added as the same cluster member for all the identifiers belonging to the other cluster numbers based on the cluster number order data, while the newly added identifiers are Stop the cluster member addition process when it does not belong to any cluster number,
7. The clustering apparatus according to claim 5, wherein a series of processes until the stop is continued until classification of all identifiers is completed.

The cluster member determining means includes
The cluster member is formed only by the identifier belonging to the specific cluster number when the identifier belonging to the specific cluster number does not belong to any other cluster number (no sharing). 8. The clustering device according to 7.