JP5453613B2

JP5453613B2 - Gene clustering apparatus and program

Info

Publication number: JP5453613B2
Application number: JP2008252353A
Authority: JP
Inventors: 毅井澤; 基広三原; 仁藤宮
Original assignee: National Institute of Agrobiological Sciences
Current assignee: National Institute of Agrobiological Sciences
Priority date: 2008-09-30
Filing date: 2008-09-30
Publication date: 2014-03-26
Anticipated expiration: 2028-09-30
Also published as: JP2010086142A

Description

本発明は、複数の遺伝子を配列の類似性に基づいてクラスタリングする遺伝子クラスタリング装置およびプログラムに関する。 The present invention relates to a gene clustering apparatus and program for clustering a plurality of genes based on sequence similarity.

機能の分からない遺伝子の働きを推定するには、すでに知られている遺伝子に対する類似性を評価し、配列の類似性に基づいてクラスタリングを行う手法が有効であることが知られている。
従来、遺伝子のクラスタリングには最大節約法、最尤法、近隣結合法などが用いられている。これらの方法は、クラスタリング対象となる遺伝子の配列を直接比較しながら、系統樹を作成する点が共通である。このようなクラスタリングを利用した例として、非特許文献１に開示されたクラスタリングとアラインメントのためのプログラムなどがあげられる。 In order to estimate the function of a gene whose function is unknown, it is known that a technique of evaluating similarity to a known gene and performing clustering based on sequence similarity is effective.
Conventionally, a maximum saving method, a maximum likelihood method, a neighborhood connection method, and the like are used for gene clustering. These methods are common in that a phylogenetic tree is created while directly comparing the sequences of genes to be clustered. As an example using such clustering, a clustering and alignment program disclosed in Non-Patent Document 1 can be cited.

従来の遺伝子クラスタリング方法では、一つひとつの遺伝子の塩基配列に着目し、個々の塩基配列の変異の時期や前後関係を推定することで系統樹を作成している。しかしながら、これらの方法では、遺伝的にかなり離れてしまっているものや、分化したあとに新たに獲得された機能など、大幅に全体の配列が異なるようなもの同士は比較できないという問題があった。従来のクラスタリングは、進化的な過程で発生する程度の配列変化、すなわち比較的変化の少ない遺伝子同士を比較するのには適している。 In the conventional gene clustering method, a phylogenetic tree is created by paying attention to the base sequence of each gene, and estimating the time and context of mutation of each base sequence. However, with these methods, there is a problem that it is not possible to compare things that are significantly different from each other, such as those that are genetically separated or functions that are newly acquired after differentiation. . Conventional clustering is suitable for comparing gene changes that occur in an evolutionary process, that is, genes with relatively little change.

CLUSTAL W:improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice; J.D.Thompson et.al.; Nucleic acids Research, 1994, Vol. 22, No.22 4673-4680.CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice; J.D.Thompson et.al .; Nucleic acids Research, 1994, Vol. 22, No.22 4673-4680.

上述のように、従来のクラスタリング方法のように全ての遺伝子配列をそのまま用いてクラスタリングする方法では、進化的に離れた遺伝子のクラスタリングを行うことは難しかった。また、遺伝子の機能や関係するタンパク質などを絞り込むためには、遺伝子の配列に含まれる情報だけで判断することは非常に難しかった。 As described above, it is difficult to cluster genes that are evolutionarily separated by the method of clustering using all gene sequences as they are, as in the conventional clustering method. In addition, in order to narrow down gene functions and related proteins, it was very difficult to make a judgment based only on information contained in gene sequences.

本発明は、進化的に離れた生物の遺伝子でも、類似した機能を持つ遺伝子を発見できるような遺伝子クラスタリング装置およびプログラムを提供することを目的とする。 An object of the present invention is to provide a gene clustering apparatus and program that can discover genes having similar functions even in genes of evolutionary distant organisms.

また、本発明の第二の目的は、遺伝子の配列情報のみではなく、遺伝子発現データも利用することでさらにそれぞれの機能を類推しやすい情報を提供することである。 The second object of the present invention is to provide not only gene sequence information but also gene expression data to provide information that makes it easier to infer each function.

本発明に係る遺伝子クラスタリング装置は、複数の遺伝子を配列の類似性に基づいてクラスタリングする遺伝子クラスタリング装置であって、遺伝子配列に含まれるモチーフ配列を検索するモチーフ検索部と、各々の遺伝子配列に含まれるモチーフ配列を比較することにより、任意の２つの遺伝子の類似度スコアを計算するモチーフスコア計算部と、前記類似度スコアを用いて、任意の２つの遺伝子の遺伝子間距離を計算する遺伝子間距離計算部と、前記遺伝子間距離に基づいて、前記複数の遺伝子のクラスタリングを行うクラスタリング処理部と、各々の遺伝子の発現データを遺伝子発現データ記憶部から取得する発現データ取得部と、取得した前記各々の遺伝子の発現データを、各々の遺伝子に対応した位置に表示する発現データ表示部とを備える。
本発明では、遺伝子配列に含まれるモチーフを指標として遺伝子の類似度を解析したクラスタと遺伝子発現データの関係を比較するようにした。進化的には離れていても類似した機能を持つ遺伝子は同様のモチーフを持っていることが多いため、本発明は、広い生物種間での機能類似遺伝子の発見や、未知の遺伝子の機能推定等に大変有効である。また、モチーフレベルで類似で同様な機能が期待されながら遺伝子発現の時期などの相違があるものを見出すことで、さらにターゲットとなるタンパク質などの相違などを推定するのに大変有効である。 The gene clustering apparatus according to the present invention is a gene clustering apparatus that clusters a plurality of genes based on sequence similarity, and includes a motif search unit that searches for a motif sequence included in a gene sequence, and each gene sequence A motif score calculation unit for calculating a similarity score between two arbitrary genes by comparing motif sequences to be calculated, and an intergenic distance for calculating an intergenic distance between any two genes using the similarity score A calculation unit, a clustering processing unit that performs clustering of the plurality of genes based on the inter-gene distance, an expression data acquisition unit that acquires expression data of each gene from a gene expression data storage unit, and the acquired each Expression data display section that displays the expression data of each gene at the position corresponding to each gene Equipped with a.
In the present invention, the relationship between the gene expression data and the cluster analyzed for the gene similarity is compared using the motif included in the gene sequence as an index. Since genes that have similar functions even though they are evolutionarily separated often have similar motifs, the present invention can be used to discover functionally similar genes among a wide range of species and to estimate the functions of unknown genes. It is very effective. In addition, it is very effective to estimate the difference in the target protein and the like by finding the similar and similar functions at the motif level but different in the timing of gene expression.

本発明に係る遺伝子クラスタリング装置は、複数の遺伝子を２つ以上の特徴ベクトル量を用いてそれぞれクラスタリングを行った結果を比較する遺伝子クラスタリング装置であって、それぞれの前記特徴ベクトル量を用いてクラスタリングを実行するクラスタリング処理部と、前記クラスタリングの結果に基づいて、それぞれのサブクラスタの距離情報を一次元の階調数列に変換する階調変換部と、それぞれの前記特徴ベクトル量を用いたクラスタリング結果について、前記一次元の階調数列に変換した結果を並列に表示する並列表示部とを備える。 A gene clustering apparatus according to the present invention is a gene clustering apparatus that compares the results of clustering a plurality of genes using two or more feature vector quantities, and performs clustering using each of the feature vector quantities. Clustering processing unit to be executed, tone conversion unit for converting distance information of each sub-cluster into a one-dimensional tone number sequence based on the result of clustering, and clustering result using each feature vector amount And a parallel display unit for displaying in parallel the result of conversion into the one-dimensional gradation sequence.

本発明によれば、それぞれの遺伝子に関する異なるデータに基づいて作成された2つ以上のデンドログラムが、どの程度類似しているかを容易に比較、把握できる。特にモチーフを基にしたデンドログラムから構造的に類似であることが分かっている遺伝子群に対して、発現時期や発現部位などによって発現パターンが異なっていることが容易に把握できる。これらの情報を利用することにより、遺伝子としての機能の違い、つまり、生成されたタンパク質の相互作用する相手が異なっている可能性や、作用するネットワークに相違があることなど、重要な情報を得ることができる。 According to the present invention, it is possible to easily compare and grasp how similar two or more dendrograms created based on different data related to each gene are. In particular, it can be easily understood that the expression pattern differs depending on the expression time, expression site, and the like for gene groups that are known to be structurally similar from the dendrogram based on the motif. By using this information, we obtain important information such as differences in gene function, that is, the partner with which the generated protein interacts and the network in which it interacts. be able to.

本発明に係るコンピュータプログラムは、コンピュータを、複数の遺伝子を配列の類似性に基づいてクラスタリングする遺伝子クラスタリング装置として機能させるプログラムであって、遺伝子配列に含まれるモチーフ配列を検索するモチーフ検索部と、各々の遺伝子配列に含まれるモチーフ配列を比較することにより、任意の２つの遺伝子の類似度スコアを計算するモチーフスコア計算部と、前記類似度スコアを用いて、任意の２つの遺伝子の遺伝子間距離を計算する遺伝子間距離計算部と、前記遺伝子間距離に基づいて、前記複数の遺伝子のクラスタリングを行うクラスタリング処理部と、各々の遺伝子の発現データを遺伝子発現データ記憶部から取得する発現データ取得部と、取得した前記各々の遺伝子の発現データを、各々の遺伝子に対応した位置に表示する発現データ表示部として機能させる。
本発明では、遺伝子配列に含まれるモチーフを指標として遺伝子の類似度を解析したクラスタと遺伝子発現データの関係を比較するようにした。進化的には離れていても類似した機能を持つ遺伝子は同様のモチーフを持っていることが多いため、本発明は、広い生物種間での機能類似遺伝子の発見や、未知の遺伝子の機能推定等に大変有効である。また、モチーフレベルで類似で同様な機能が期待されながら遺伝子発現の時期などの相違があるものを見出すことで、さらにターゲットとなるタンパク質などの相違などを推定するのに大変有効である。 A computer program according to the present invention is a program that causes a computer to function as a gene clustering apparatus that clusters a plurality of genes based on sequence similarity, and a motif search unit that searches for a motif sequence included in a gene sequence; A motif score calculator that calculates the similarity score of any two genes by comparing the motif sequences included in each gene sequence, and the intergenic distance between any two genes using the similarity score An intergene distance calculation unit that calculates a clustering processing unit that clusters the plurality of genes based on the intergene distance, and an expression data acquisition unit that acquires expression data of each gene from a gene expression data storage unit And the obtained expression data of each gene, To function as an expression data display unit for displaying the corresponding positions.
In the present invention, the relationship between the gene expression data and the cluster analyzed for the gene similarity is compared using the motif included in the gene sequence as an index. Since genes that have similar functions even though they are evolutionarily separated often have similar motifs, the present invention can be used to discover functionally similar genes among a wide range of species and to estimate the functions of unknown genes. It is very effective. In addition, it is very effective to estimate the difference in the target protein and the like by finding the similar and similar functions at the motif level but different in the timing of gene expression.

本発明に係るコンピュータプログラムは、コンピュータを、複数の遺伝子を２つ以上の特徴ベクトル量を用いてそれぞれクラスタリングを行った結果を比較する遺伝子クラスタリング装置として機能させるプログラムであって、それぞれの前記特徴ベクトル量を用いてクラスタリングを実行するクラスタリング処理部と、前記クラスタリングの結果に基づいて、それぞれのサブクラスタの距離情報を一次元の階調数列に変換する階調変換部と、それぞれの前記特徴ベクトル量を用いたクラスタリング結果について、前記一次元の階調数列に変換した結果を並列に表示する並列表示部として機能させる。
本発明によれば、それぞれの遺伝子に関する異なるデータに基づいて作成された2つ以上のデンドログラムが、どの程度類似しているかを容易に比較、把握できる。特にモチーフを基にしたデンドログラムから構造的に類似であることが分かっている遺伝子群に対して、発現時期や発現部位などによって発現パターンが異なっていることが容易に把握できる。これらの情報を利用することにより、遺伝子としての機能の違い、つまり、生成されたタンパク質の相互作用する相手が異なっている可能性や、作用するネットワークに相違があることなど、重要な情報を得ることができる。 A computer program according to the present invention is a program that causes a computer to function as a gene clustering apparatus that compares the results of clustering a plurality of genes using two or more feature vector quantities, and each of the feature vectors. A clustering processing unit that performs clustering using a quantity, a tone conversion unit that converts distance information of each sub-cluster into a one-dimensional tone sequence based on the result of the clustering, and each of the feature vector quantities As for the clustering result using, the result converted into the one-dimensional gradation number sequence is made to function as a parallel display unit for displaying in parallel.
According to the present invention, it is possible to easily compare and grasp how similar two or more dendrograms created based on different data related to each gene are. In particular, it can be easily understood that the expression pattern differs depending on the expression time, expression site, and the like for gene groups that are known to be structurally similar from the dendrogram based on the motif. By using this information, we obtain important information such as differences in gene function, that is, the partner with which the generated protein interacts and the network in which it interacts. be able to.

以下、本発明の実施の形態について図面を参照して説明する。
実施の形態１．
図１は、本発明の実施の形態１による、遺伝子クラスタリング装置１０の機能構成を示すブロック図である。図に示すように、遺伝子クラスタリング装置１０は、入力装置１１、ユーザインターフェイス部１２、データアクセス部１３、遺伝子配列記憶部１４、スコア記憶部１５、モチーフ記憶部１６、遺伝子発現データ記憶部１７、モチーフ検索部１８、モチーフスコア計算部１９、遺伝子間距離計算部２０、クラスタリング処理部２１、発現データ取得部２２、出力装置２３、発現データ表示部２４を備えている。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing a functional configuration of a gene clustering apparatus 10 according to Embodiment 1 of the present invention. As shown in the figure, the gene clustering device 10 includes an input device 11, a user interface unit 12, a data access unit 13, a gene sequence storage unit 14, a score storage unit 15, a motif storage unit 16, a gene expression data storage unit 17, a motif A search unit 18, a motif score calculation unit 19, an intergene distance calculation unit 20, a clustering processing unit 21, an expression data acquisition unit 22, an output device 23, and an expression data display unit 24 are provided.

遺伝子クラスタリング装置１０は、例えば汎用的なパーソナルコンピュータに所定のプログラムを実行させたものである。ユーザインターフェイス部１２、データアクセス部１３、モチーフ検索部１８、モチーフスコア計算部１９、遺伝子間距離計算部２０、クラスタリング処理部２１、発現データ取得部２２、および発現データ表示部２４は、プログラムに従ってコンピュータのプロセッサが行う動作のモジュールを表しており、これらは実際には一体として遺伝子クラスタリング装置１０のプロセッサを構成する。 The gene clustering apparatus 10 is obtained by causing a general-purpose personal computer to execute a predetermined program, for example. The user interface unit 12, data access unit 13, motif search unit 18, motif score calculation unit 19, intergene distance calculation unit 20, clustering processing unit 21, expression data acquisition unit 22, and expression data display unit 24 are computers according to programs. The modules of the operations performed by these processors are shown, and these actually constitute the processor of the gene clustering apparatus 10 as a whole.

遺伝子配列記憶部１４、スコア記憶部１５、モチーフ記憶部１６、および発現データ記憶部１７は、遺伝子クラスタリング装置１０のハードディスク等の記憶装置である。
入力装置１１は、例えばキーボード、マウス、タッチパネル等の入力手段であり、ユーザが遺伝子クラスタリング装置１０に処理の指示を与えたり、データやパラメータを入力するために用いられる。また、USB(Universal Serial Bus)インターフェイスを介して、メモリ媒体などからデータを読み込むことも可能である。ユーザによる入力装置１１を介した操作はユーザインターフェイス部１２によって制御される。
出力装置２３は、表示装置やプリンタ等である。 The gene sequence storage unit 14, the score storage unit 15, the motif storage unit 16, and the expression data storage unit 17 are storage devices such as a hard disk of the gene clustering device 10.
The input device 11 is input means such as a keyboard, a mouse, and a touch panel, for example, and is used by the user to give processing instructions to the gene clustering device 10 and to input data and parameters. It is also possible to read data from a memory medium or the like via a USB (Universal Serial Bus) interface. The operation through the input device 11 by the user is controlled by the user interface unit 12.
The output device 23 is a display device, a printer, or the like.

次に本実施形態による遺伝子クラスタリング処理ついて説明する。解析対象の遺伝子配列データと、それらの遺伝子発現データはあらかじめ入力装置１１より、ユーザインターフェイス部１２を経由し、デーアクセス部１３を経由し、遺伝子配列記憶部１４および遺伝子発現データ記憶部１７に格納される。またクラスタリングに必要となる遺伝子配列比較のためのスコアデータは、同様に入力装置１１から入力され、スコア記憶部１５に格納される。まず、クラスタリングの対象となる遺伝子群の配列情報が遺伝子配列記憶部１４からデータアクセス部１３を介してモチーフ検索部１８に供給される。 Next, the gene clustering process according to the present embodiment will be described. The gene sequence data to be analyzed and the gene expression data are stored in advance in the gene sequence storage unit 14 and the gene expression data storage unit 17 from the input device 11 via the user interface unit 12 and the data access unit 13. Is done. Score data for gene sequence comparison necessary for clustering is similarly input from the input device 11 and stored in the score storage unit 15. First, the sequence information of the gene group to be clustered is supplied from the gene sequence storage unit 14 to the motif search unit 18 via the data access unit 13.

図２は、クラスタリングの対象となる遺伝子群の例を示す図である。ここでは、対象となる遺伝子の遺伝子番号とその生物種を示している。図２に示す例は、トウモロコシ（Zea mays）のID１（indeterminate１）遺伝子をqueryとして、イネ（Oryza Sativa）、シロイヌナズナ（arabidopsis thaliana）、および紅藻のアミノ酸配列に対してblastサーチ（閾値1e-30）を行い、ヒットした遺伝子を示している。 FIG. 2 is a diagram showing an example of gene groups to be clustered. Here, the gene number of the target gene and its species are shown. The example shown in FIG. 2 is a blast search (threshold 1e-30) for amino acid sequences of rice (Oryza Sativa), Arabidopsis thaliana, and red algae using the corn (Zea mays) ID1 (indeterminate1) gene as a query. ) And shows the hit genes.

なお、それぞれの遺伝子配列は、例えば以下のサイトで参照することができる。
イネ： http://rapdb.lab.nig.ac.jp/（RAP１）
シロイヌナズナ： http://mips.gsf.de/proj/thal/db/（MIPS）
紅藻：http://merolae.biol.s.u-tokyo.ac.jp/ Each gene sequence can be referred to, for example, at the following site.
Rice: http://rapdb.lab.nig.ac.jp/ (RAP1)
Arabidopsis: http://mips.gsf.de/proj/thal/db/ (MIPS)
Red algae: http://merolae.biol.su-tokyo.ac.jp/

ID1遺伝子はトウモロコシにおいて花成を制御している遺伝子として単離されたものであり、ジンクフィンガーをもつ転写因子をコードしている。
なお、遺伝子群の選び方は上記の方法に限られず、他の配列解析手法を用いてもよい。 The ID1 gene has been isolated as a gene that controls flowering in maize, and encodes a transcription factor having a zinc finger.
The method for selecting a gene group is not limited to the above method, and other sequence analysis methods may be used.

次に、供給された遺伝子群を対象にモチーフ検索部１８においてモチーフ検索を実行する。モチーフは、タンパク質構造中の活性部位や機能領域に対応した配列パターンである。モチーフ検索は、例えばMEME(Bailey and Elkan, 1994)などの手法を用いて行うことができる。図３は、図２にその一部を示した遺伝子群に対してモチーフ検索を行った結果得られるモチーフデータの例を示す図である。図中、番号を付された四角で表されたものが個々のモチーフに対応する。例えば、ID1遺伝子は、５番、２番、３番、１番、７番、６番、１８番で表されるモチーフ配列を有していることが分かる。一般に、遺伝的にかなり離れている場合でも、機能的に類似した遺伝子同士は同じモチーフを持っていることが多い。 Next, a motif search is performed in the motif search unit 18 for the supplied gene group. A motif is a sequence pattern corresponding to an active site or a functional region in a protein structure. The motif search can be performed using a technique such as MEME (Bailey and Elkan, 1994). FIG. 3 is a diagram illustrating an example of motif data obtained as a result of a motif search performed on the gene group partially shown in FIG. In the figure, the numbered squares correspond to individual motifs. For example, it is understood that the ID1 gene has a motif sequence represented by No. 5, No. 2, No. 3, No. 1, No. 7, No. 6, No. 18. In general, functionally similar genes often have the same motif even if they are genetically separated.

モチーフ検索を行うことにより、各々の遺伝子の配列の中から、主要な構造・機能を決めるために寄与していると考えられる大小さまざまな部分配列の情報を得ることができる。得られたモチーフデータはモチーフ記憶部１６に保存される。 By performing a motif search, it is possible to obtain information on partial sequences of various sizes, which are considered to contribute to determining the main structure / function, from the sequence of each gene. The obtained motif data is stored in the motif storage unit 16.

次に、モチーフスコア計算部１９において、クラスタリング対象となる全ての遺伝子同士を比較して、含まれるモチーフ配列でみた類似度を表すスコアを算出する。類似度スコア算出には、アミノ酸相互の置換確率に基づくPAM(Point-Accepted Mutation、In Margaret O. Dayhoff, editor, Atlas of Protein Sequence and Structure, volume 5, pages 345-352. National Biochemical Research Foundation, Washington DC, 1978)やBLOSUM(Blocs Substitution Matrix、Henikoff and Henikoff (1992; PNAS 89:10915-10919))などを用いることができる。スコア記憶部１５には、これらの手法で用いられるスコアデータが保存されている。
なお、本実施形態では、モチーフ以外の領域についてはスコア算出を行っていない。これはモチーフ以外の部分をスコア０とみなしていることを意味する。モチーフという配列が保存された部分に絞り、スコアを算出することで高速にクラスタリングを実施している。もし、さらに必要があれば、単に保存された配列モチーフだけでなく、二次構造予測などの機能を加え、αヘリックスやβシートなどを決めている構造部分を抽出し、それらをモチーフとしてスコアを与えることで、機能だけでなく構造類似のクラスタリングを行わせることも可能である。 Next, the motif score calculation unit 19 compares all the genes to be clustered, and calculates a score representing the degree of similarity seen from the included motif sequences. For similarity score calculation, PAM (Point-Accepted Mutation, In Margaret O. Dayhoff, editor, Atlas of Protein Sequence and Structure, volume 5, pages 345-352. National Biochemical Research Foundation, Washington DC, 1978) and BLOSUM (Blocs Substitution Matrix, Henikoff and Henikoff (1992; PNAS 89: 10915-10919)) and the like can be used. The score storage unit 15 stores score data used in these methods.
In the present embodiment, score calculation is not performed for regions other than the motif. This means that the part other than the motif is regarded as score 0. Clustering is performed at high speed by narrowing down to the part where the sequence called motif is stored and calculating the score. If there is a further need, add not only the conserved sequence motif but also a function such as secondary structure prediction to extract the structural part that determines α-helix and β-sheet, and score them as motifs By giving, it is possible to perform not only functions but also structure-like clustering.

類似度スコア算出方法について説明する。
例えば、遺伝子１に含まれるモチーフ１と、遺伝子２に含まれるモチーフ２の配列が下記のとおりとする。
モチーフ１：WKCEKCAK
モチーフ２：WKCDKCN A similarity score calculation method will be described.
For example, the sequences of motif 1 included in gene 1 and motif 2 included in gene 2 are as follows.
Motif 1: WKCEKCAK
Motif 2: WKCDKCN

モチーフ１とモチーフ２の最初のアミノ酸残基はWなので、図４に示すPAM40のマトリクスのWの行のWの列を参照すると、スコアは１３であることが分かる。２番目のアミノ酸残基は両配列ともKであり、スコアは６であることが分かる。このように順にスコアを求めてそれらを加算すると、モチーフ１とモチーフ２のスコアは以下のようになる。
スコア＝１３＋６＋９＋３＋６＋９＋（−３）＝４３
このようにして、遺伝子１および遺伝子２に含まれているすべてのモチーフ同士について総当りでスコアを求める。さらに、すべてのモチーフ同士のスコアの和を求め、遺伝子１と遺伝子２の類似度スコアとする。ここで、モチーフ相互に比較するに当たって、アミノ酸残基の欠失や挿入を考慮して最適なスコアを算出する場合は、部分最適並置を求める動的計画法を用いたアルゴリズムSmith-Waterman法（Smith TF, Waterman MS (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195-197.）を利用している。 Since the first amino acid residue of motif 1 and motif 2 is W, referring to the column of W in the W row of the PAM40 matrix shown in FIG. It can be seen that the second amino acid residue is K for both sequences and the score is 6. When scores are sequentially obtained in this way and added, the scores of motif 1 and motif 2 are as follows.
Score = 13 + 6 + 9 + 3 + 6 + 9 + (− 3) = 43
In this way, a score is obtained for all the motifs included in gene 1 and gene 2 in a round-robin manner. Furthermore, the sum of the scores of all motifs is obtained and used as the similarity score between gene 1 and gene 2. Here, when calculating the optimal score in consideration of deletion and insertion of amino acid residues when comparing motifs, the algorithm Smith-Waterman method (Smith TF, Waterman MS (1981). “Identification of Common Molecular Subsequences”. Journal of Molecular Biology 147: 195-197.).

図５に、上記のようにして求められた遺伝子同士のスコアマトリクスの一部を示す。図５は、４つの遺伝子についての遺伝子相互の類似度スコアを示している。 FIG. 5 shows a part of the score matrix between genes determined as described above. FIG. 5 shows gene similarity scores for the four genes.

次に、遺伝子間距離計算部２０において、各遺伝子間の距離を算出する。遺伝子同士の距離はさまざまな定義が可能であるが、本発明では、ピアソンの相関係数を用いる。これは、図５に示すマトリクスの任意の２行のデータを取り出し、相互の要素の相関を求める方法である。相関係数を用いることで、相対的なモチーフ類似度を持つものに対しては相関が高くなり、絶対値の偏りによって離れてしまうことがない。共通モチーフの数が多いものと少ないものなどの差が多少あっても、共通の尺度で補正しながら距離を求めることが可能となる。このほかにコサイン係数を用いることも可能である。 Next, the intergene distance calculation unit 20 calculates the distance between each gene. The distance between genes can be defined in various ways. In the present invention, the Pearson correlation coefficient is used. This is a method of obtaining data of two arbitrary rows of the matrix shown in FIG. 5 and obtaining a correlation between elements. By using the correlation coefficient, the correlation becomes high for those having a relative motif similarity, and the correlation coefficient does not leave due to the bias of the absolute value. Even if there is a slight difference between a large number of common motifs and a small number of common motifs, the distance can be obtained while correcting with a common scale. In addition, a cosine coefficient can be used.

次に、クラスタリング処理部２１において、遺伝子間距離計算部２０で算出された距離の値を用いてWard法や群平均法などの方法を用いてクラスタリングを実施する。図６にクラスタリング結果のデンドログラムを示す。図６から、トウモロコシのID１遺伝子は、Os10g0419200遺伝子と似た機能を持っていることが示唆される。Os10g0419200遺伝子は、zinc finger proteinをコードしており、Os10g0419200が持つ機能はZinc finger, C2H2 type family proteinと付与されており、実際にＩＤ１と類似の機能を持つということが類推できる。 Next, the clustering processing unit 21 performs clustering using a method such as the Ward method or the group average method using the distance value calculated by the intergene distance calculation unit 20. FIG. 6 shows a dendrogram of the clustering result. FIG. 6 suggests that the maize ID1 gene has a function similar to that of the Os10g0419200 gene. The Os10g0419200 gene encodes zinc finger protein, and the function of Os10g0419200 is given as Zinc finger and C2H2 type family protein, and it can be analogized that it actually has a similar function to ID1.

このように、本発明によれば、モチーフの抽出、モチーフの有無と類似度を指標としたクラスタリングという一連の解析が可能となる。モチーフとは、機能ドメインに特徴的な保存配列パターンなどを含み、モチーフを指標として解析することで、遺伝的には離れていても機能的に似た遺伝子を比較解析することができる。アミノ酸配列の置換率を利用した解析はこれまでにも存在しているが、モチーフの有無・類似度を指標とした比較解析の手法は確立されておらず、今後、生物間で保存された機能遺伝子の解析、機能未知遺伝子の機能推定等で利用することが出来る。DNAシーケンシング技術の進歩により、非常に多くの生物種のゲノムの読取が進んできており、必ずしも遺伝的に同祖でない場合でも、機能的に類似なものがクラスタリングにより見出せれば、未知の遺伝子配列の機能を解析するのに非常に有用である。 Thus, according to the present invention, it is possible to perform a series of analyzes of extraction of motifs and clustering using the presence / absence and similarity of motifs as indices. A motif includes a conserved sequence pattern that is characteristic of a functional domain. By analyzing the motif as an index, genes that are functionally similar can be comparatively analyzed even if they are genetically separated. Analyzes using amino acid sequence substitution rates have existed so far, but methods for comparative analysis using the presence or similarity of motifs as indices have not been established, and functions that have been preserved between organisms in the future It can be used for gene analysis, function estimation of unknown function genes, and the like. Advances in DNA sequencing technology have led to the reading of genomes of a large number of species, and even if they are not necessarily genetically homologous, if functionally similar ones can be found by clustering, unknown genes It is very useful for analyzing the function of a sequence.

なお、本発明によるクラスタリング方法は、単に遺伝子のモチーフ情報に限らず、構造的な特徴、つまりαヘリックス、βシート、疎水性、親水性の強いエリアなど種々の指標値に置き換えた数値列パターンを対象に利用することも可能である。また、本発明で説明している遺伝子配列は文字列そのものである。したがって、遺伝子配列はそのまま文字配列のクラスタリングに置き換えることが可能である。あらゆる文字情報あるいは数値情報列に適用可能であることはいうまでもない。文字列ではその一致した文字数をスコアにすることや、単に辞書に存在する単語ごとに一定のスコアを与えるという方法でも問題ない。数字列の場合は、その数値そのものの差やその２乗値を距離として広く適用が可能なことはいうまでもない。 It should be noted that the clustering method according to the present invention is not limited to gene motif information, but a numerical sequence pattern in which structural features are replaced with various index values such as α helix, β sheet, hydrophobic and hydrophilic areas. It can also be used as a target. Further, the gene sequence described in the present invention is a character string itself. Therefore, it is possible to replace the gene sequence as it is with clustering of character sequences. Needless to say, the present invention can be applied to any character information or numerical information sequence. For character strings, there is no problem even if the number of matched characters is used as a score, or a method of simply giving a constant score for each word existing in the dictionary. In the case of a numeric string, it is needless to say that the difference between the numerical values themselves or the square value thereof can be widely applied as a distance.

次に、発現データ取得部２２は、これらの遺伝子の各発現データを遺伝子発現データ記憶部１７から取り出す。遺伝子発現データとしては、成熟過程別にみた花粉細胞での遺伝子発現量として、減数分裂期、四分子期、一核期、二核期、成熟花粉を、また、タペータムでの遺伝子発現量として、減数分裂期四分子期、一核期の発現データを用いた。発現量の計測は、DNAマイクロアレイを用いた方法や、RT-PCRを用いた方法、または、回収されたmRNAの塩基配列をDNAオートシーケンサで読み取り、mRNAの本数を数えるなどの方法を用いて、細胞内で発現しているmRNAの量を計測可能である。図７にDNAマイクロアレイを用いて計測した結果の例を示す。発現データ表示部２４は、出力装置２３に、図６に示すクラスタリング結果と併せて発現データを表示する。発現データは、各々のクラスタに対応した位置に表示される。図７に示す例では、各発現データをデンドログラムの各リーフの横方向に並べてある。また、それぞれ2,3個のサンプルを用いていることもあるため、その場合はそれらを密着して並べている。さらに、それぞれの発現量は測定した発現量に応じた濃さの色をつけて、表示している。ここでは、色が濃いほど発現量が多いことを表す。たとえば、植物の花粉の成熟段階の初期（７１）には、４サンプル分のデータがあり、四分子期(７２)では、発現量はほとんど変わらず３サンプル分のデータが計測されている。それに対し、2核期の後期（73、74）は濃い色で示されており、発現量が増加していることが読み取れる。 Next, the expression data acquisition unit 22 extracts each expression data of these genes from the gene expression data storage unit 17. The gene expression data includes the gene expression level in pollen cells according to the maturation process, meiosis, tetramolecular, mononuclear, binuclear, and mature pollen, and the gene expression in tapetum. Data on the expression of metaphase, quadruple and mononuclear phase were used. The expression level is measured using a method using a DNA microarray, a method using RT-PCR, or a method such as reading the base sequence of the recovered mRNA with a DNA autosequencer and counting the number of mRNAs. The amount of mRNA expressed in the cell can be measured. FIG. 7 shows an example of measurement results using a DNA microarray. The expression data display unit 24 displays the expression data on the output device 23 together with the clustering result shown in FIG. Expression data is displayed at a position corresponding to each cluster. In the example shown in FIG. 7, each expression data is arranged in the horizontal direction of each leaf of the dendrogram. In some cases, two or three samples are used, and in that case, they are closely arranged. Furthermore, each expression level is displayed with a dark color corresponding to the measured expression level. Here, the darker the color, the greater the expression level. For example, in the early stage (71) of the pollen maturation stage of the plant, there are data for 4 samples, and in the tetramolecular period (72), the expression level is hardly changed and data for 3 samples is measured. In contrast, the latter half of the binuclear phase (73, 74) is shown in dark color, indicating that the expression level is increasing.

図７では、クラスタリングによって非常に類似性が高いという結果が得られた遺伝子のグループが、発現量においてもほぼ同じ時期に増加するという例を示している。また、図８には、パラロガス（ある生物種において遺伝子重複によって新たに生じた相同配列）な遺伝子間で発現パターンが保存されていないケース（８１）を示している。また、図９には、パラロガスなもので発現時期が微妙にずれている例（９１）を示した。このようにモチーフを用いたクラスタリング結果のデンドログラムの横に遺伝子発現データを並列して表示することにより、非常に容易に遺伝子の挙動の違いを確認することができる。 FIG. 7 shows an example in which the group of genes from which the result that clustering is very similar is obtained increases at almost the same time in the expression level. FIG. 8 shows a case (81) in which the expression pattern is not conserved between genes that are paralogous (homologous sequences newly generated by gene duplication in a certain biological species). In addition, FIG. 9 shows an example (91) in which the onset time is slightly shifted due to paralogous material. Thus, by displaying the gene expression data side by side next to the dendrogram of the clustering result using the motif, the difference in gene behavior can be confirmed very easily.

これらの描画の手順を図１０に示す。まず、ステップ１０１において、発現データ取得部２２は、各遺伝子の発現データを遺伝子発現データ記憶部１７から取得する。次にステップ１０２において、作成されたデンドログラムの構造を参照しながらサブクラスタごとに、属している遺伝子の発現パターンの表示処理を行う。さらにステップ１０３において、発現データ表示部２４は、デンドログラムの横に発現データを配置して描画する。これらの結果が図７から図９の結果である。これにより、サブクラスタ内の遺伝子の発現パターンを目視によって比較することが可能である。 These drawing procedures are shown in FIG. First, in step 101, the expression data acquisition unit 22 acquires the expression data of each gene from the gene expression data storage unit 17. Next, in step 102, the expression pattern of the gene to which it belongs is displayed for each sub-cluster with reference to the created dendrogram structure. Furthermore, in step 103, the expression data display unit 24 arranges and draws the expression data beside the dendrogram. These results are the results of FIGS. Thereby, it is possible to visually compare the expression patterns of the genes in the subcluster.

以上のように、本実施形態によれば、遺伝子配列に含まれるモチーフを指標として遺伝子の類似度を解析したクラスタと遺伝子発現データの関係を比較するようにした。進化的には離れていても類似した機能を持つ遺伝子は同様のモチーフを持っていることが多いため、本発明は、広い生物種間での機能類似遺伝子の発見や、未知の遺伝子の機能推定等に大変有効である。また、モチーフレベルで類似で同様な機能が期待されながら遺伝子発現の時期などの相違があるものを見出すことで、さらにターゲットとなるタンパク質などの相違などを推定するのに大変有効である。 As described above, according to the present embodiment, the relationship between the gene expression data and the cluster in which the similarity of the genes is analyzed using the motif included in the gene sequence as an index is compared. Since genes that have similar functions even though they are evolutionarily separated often have similar motifs, the present invention can be used to discover functionally similar genes among a wide range of species and to estimate the functions of unknown genes. It is very effective. In addition, it is very effective to estimate the difference in the target protein and the like by finding the similar and similar functions at the motif level but different in the timing of gene expression.

また、遺伝子のモチーフ情報を用いたクラスタリング結果と発現データと合わせて表示することにより、実際の細胞の生の動きの情報を加えて考察することができる。なお、発現データとしては、各組織別に取得したものや、時系列的に取得したデータ、それぞれ系統が異なるものなど、比較する目的に応じて組み合わせることができることは言うまでもない。 Moreover, by displaying together with the clustering result using gene motif information and the expression data, it is possible to consider by adding information on the actual movement of cells. Needless to say, the expression data can be combined according to the purpose of comparison, such as data acquired for each tissue, data acquired in time series, and data of different systems.

実施の形態２．
実施の形態２では、遺伝子のモチーフ情報を用いたクラスタリングに加え、さらに遺伝子の発現データを用いたクラスタリングを行い、両者の結果を比較できるように表示する。
複数のクラスタリング結果を比較する方法について図１１を用いて説明する。図１１は、クラスタリング処理部２１が遺伝子のモチーフ情報を用いて算出したデンドログラムを上部に、遺伝子の発現データを用いてクラスタリングした結果を下部に対向して表示した例である。また、中間位置には、後述するような各クラスタの比較を行うためのヒートマップ領域１１５ａ、１１６ａ、１１５ｂ、１１６ｂが示されている。 Embodiment 2. FIG.
In the second embodiment, clustering using gene expression data is performed in addition to clustering using gene motif information, and the results are displayed so that the results can be compared.
A method of comparing a plurality of clustering results will be described with reference to FIG. FIG. 11 shows an example in which the dendrogram calculated using the gene motif information by the clustering processing unit 21 is displayed at the top and the result of clustering using the gene expression data is displayed facing the bottom. In addition, heat map regions 115a, 116a, 115b, and 116b for comparing each cluster as described later are shown at the intermediate positions.

また、同図左側１１１は、両方のクラスタリング結果が、比較的類似している場合である。また、同図右側１１２は、クラスタリング結果がかなり異なっている結果が得られた例である。まずクラスタ構造が類似している１１１について説明する。デンドログラム１１３aは遺伝子Ａ、Ｂ、Ｃ、Ｄのモチーフ情報を用いてクラスタリングした結果である。遺伝子ＡとＢの距離は、両者の枝の分岐点下に示してあるとおり“３”である。さらに遺伝子Ａ、Ｂの重心からＣまでの距離は“６”である。さらにＡ、Ｂ、Ｃの重心とＤまでの距離は“１１”であることを意味する。 Further, the left side 111 in the figure is a case where both clustering results are relatively similar. Also, the right side 112 in the figure is an example in which the clustering results are considerably different. First, 111 having a similar cluster structure will be described. The dendrogram 113a is the result of clustering using gene A, B, C, and D motif information. The distance between genes A and B is “3” as shown below the branch point of both branches. Furthermore, the distance from the center of gravity of genes A and B to C is “6”. Furthermore, the distance from the center of gravity of A, B, and C to D means “11”.

これらの距離のデータをコンパクトに表現するため、発現データ表示部（階調変換部、並列表示部）２４は１１５ａに示すように階調に合わせて距離が遠くなるほど濃くなる色に対応させ、それぞれの遺伝子の下に配色する。各距離と階調の関係は、２５６階調の表示装置の場合、対象クラスタリング結果の最大の距離を“２５６”に割り当てるように比例配分することが可能である。また、必要に応じてガンマ補正により、距離の短い側を強調して、距離の遠い側の色の差を小さくするような補正をすることも可能である。本実施例では、最大値１１４ｂの距離“１７”が最大であるため、これが２５５となるように、２５５ｘ（該当する距離／最大の距離）の補正をかけて表示階調算出している。 In order to express the data of these distances in a compact manner, the expression data display unit (gradation conversion unit, parallel display unit) 24 corresponds to a color that becomes darker as the distance increases according to the gradation as shown in 115a. Colors under the gene. The relationship between each distance and gradation can be proportionally distributed so that the maximum distance of the target clustering result is assigned to “256” in the case of a 256 gradation display device. Further, if necessary, it is possible to perform correction so as to emphasize the short side and reduce the color difference on the far side by gamma correction. In this embodiment, since the distance “17” of the maximum value 114b is the maximum, the display gradation is calculated by correcting 255x (corresponding distance / maximum distance) so that the distance becomes 255.

１１１の場合、ヒートマップ領域１１５ａ、１１６ａを比較するとわかるようにデンドログラムが類似している場合、これらのヒートマップ領域もほぼ同じ様な階調パターンとなっている。しかし、１１２では、クラスタリングの結果が異なっているため、このヒートマップ領域１１５ｂ、１１６ｂの階調パターンが異なっていることが容易に識別できる。 In the case of 111, when the dendrograms are similar as can be seen by comparing the heat map regions 115a and 116a, these heat map regions also have substantially the same gradation pattern. However, since the clustering result is different at 112, it can be easily identified that the tone patterns of the heat map regions 115b and 116b are different.

さらに、本実施例の図中に数値では示していないが、このヒートマップ領域の数字を使い、両者のピアソン相関係数を求めることで、複数のクラスタリングの類似度を求めることも可能である。たとえば、１１５ａは（３，３，６，１１）であり、１１６ａは（５，５，７，１５）であるから、両者のピアソン相関係数を求めると、0.9990という結果が得られる。一方１１２では、１１５ｂが（２，２，７，１１）であり、１１６ｂが（１７，７，７，１０）であるから、ピアソン相関係数は-0.2768と負の相関係数が得られる。クラスタリングの結果が同様なものを選択する場合には、たとえば、相関係数が0.7以上のものといったしきい値を与えることで容易に選択することができる。相関の低いものを選択したい場合は、０に近いものや負の相関のものなどを選択することで、選び出すことができる。単に正の相関から負の相関までを順にソートして、類似のものから順に見られるようにするだけでも全体の状況を整理、把握しやすくできる。 Furthermore, although not shown numerically in the drawing of this embodiment, it is also possible to obtain the similarity of a plurality of clustering by obtaining the Pearson correlation coefficient of both using the numbers in this heat map area. For example, since 115a is (3, 3, 6, 11) and 116a is (5, 5, 7, 15), obtaining the Pearson correlation coefficient of both results in 0.9990. On the other hand, in 112, since 115b is (2, 2, 7, 11) and 116b is (17, 7, 7, 10), a negative correlation coefficient of -0.2768 is obtained as the Pearson correlation coefficient. When selecting a clustering result that is similar, for example, it can be easily selected by giving a threshold value such that the correlation coefficient is 0.7 or more. If it is desired to select one having a low correlation, it can be selected by selecting one having a low correlation or a value close to zero. By simply sorting from positive correlations to negative correlations so that they can be viewed in order from similar ones, the overall situation can be easily organized and grasped.

同図では、2つのデンドログラムを比較しているが、3つ以上であってもヒートマップ領域１１５ａや１１６ａの下に連続して並べて配置するだけで十分視覚的に確認しながら比較が可能である。また、３つ以上のクラスタリング結果のバラツキ具合を数値評価する場合は、ピアソンの相関係数の代わりに遺伝子ごとに、階調の分散を求めることで、ばらつきの多いものやそうでないものをソートして選別することができる。 In the figure, two dendrograms are compared. However, even if there are three or more dendrograms, it is possible to compare them with sufficient visual confirmation by simply arranging them side by side under the heat map regions 115a and 116a. is there. In addition, when numerically evaluating the degree of dispersion of three or more clustering results, the variance of gradations is obtained for each gene instead of Pearson's correlation coefficient, so that those with a large variation and those with no variation are sorted. Can be selected.

例えば、各階調の値をそのまま平均値を求めて、その平均値からのずれ量の2乗の累積値を平均するといった方法である。これらにより、視覚的に見たヒートマップのバラツキ具合をそのまま数値化することができる。以上、本発明によれば、さまざまな情報でクラスタリングした結果を比較して表示でき、それぞれのクラスタリング結果が類似しているものやそうでないものなどを素早く探し出すことができる。これによってモチーフレベルでは類似で機能的に似ている可能性のあるものでありながら、作用するたんぱく質が異なる場合などの識別に役に立てることができる。 For example, the average value of each gradation value is obtained as it is, and the cumulative value of the square of the deviation amount from the average value is averaged. As a result, it is possible to directly quantify the degree of variation in the heat map as viewed visually. As described above, according to the present invention, it is possible to compare and display the results of clustering with various pieces of information, and quickly find out what is similar to each other or what is not. This can be useful for discrimination when the proteins acting on the motif level are similar and possibly functionally similar, but the proteins that act are different.

以上のように、本実施形態によれば、それぞれの遺伝子に関する異なるデータに基づいて作成された2つ以上のデンドログラムが、どの程度類似しているかを容易に比較、把握できる。特にモチーフを基にしたデンドログラムから構造的に類似であることが分かっている遺伝子群に対して、発現時期や発現部位などによって発現パターンが異なっていることが容易に把握できる。これらの情報を利用することにより、遺伝子としての機能の違い、つまり、生成されたタンパク質の相互作用する相手が異なっている可能性や、作用するネットワークに相違があることなど、重要な情報を得ることができる。 As described above, according to this embodiment, it is possible to easily compare and grasp how similar two or more dendrograms created based on different data related to each gene are. In particular, it can be easily understood that the expression pattern differs depending on the expression time, expression site, and the like for gene groups that are known to be structurally similar from the dendrogram based on the motif. By using this information, we obtain important information such as differences in gene function, that is, the partner with which the generated protein interacts and the network in which it interacts. be able to.

本実施例では、モチーフレベルでの類似性を使ったクラスタリングと、遺伝子発現量に関してのクラスタリング結果を示したが、これは、さまざまな別の特徴量に対して適用することも可能である。たとえば、実験条件に関係する様々な数値群（ベクトル量）と、得られた実験結果のベクトル量などをそれぞれクラスタリングして、並べてそれらを比較したい場合などにも用いられることは言うまでもない。 In the present embodiment, clustering using similarity at the motif level and clustering results regarding gene expression levels are shown, but this can also be applied to various other feature quantities. For example, it goes without saying that the present invention is also used when various numerical groups (vector quantities) related to experimental conditions and vector quantities of the obtained experimental results are clustered and compared with each other.

図１は、本発明の実施の形態１による、遺伝子クラスタリング装置の機能構成を示すブロック図である。FIG. 1 is a block diagram showing a functional configuration of a gene clustering apparatus according to Embodiment 1 of the present invention. 図２は、クラスタリングの対象となる遺伝子群の例を示す図である。FIG. 2 is a diagram showing an example of gene groups to be clustered. 図３は、検索により得られるモチーフの例を示す図である。FIG. 3 is a diagram illustrating an example of a motif obtained by a search. 図４は、PAM40のマトリクス表である。FIG. 4 is a matrix table of PAM40. 図５は、遺伝子同士の類似度スコアの例を示す図である。FIG. 5 is a diagram illustrating an example of a similarity score between genes. 図６は、遺伝子のクラスタリング結果のデンドログラム図である。FIG. 6 is a dendrogram of gene clustering results. 図７は、遺伝子のモチーフクラスタリング結果と遺伝子発現パターンが類似している結果の例を示す図である。FIG. 7 is a diagram illustrating an example of a result of similarity of gene motif pattern and gene expression pattern. 図８は、パラロガスな遺伝子間で発現パターンが保存されていないケースを示す図である。FIG. 8 is a diagram showing a case where an expression pattern is not conserved between paralogous genes. 図９は、パラロガスなもので発現時期が微妙にずれている例を示す図である。FIG. 9 is a diagram showing an example in which the onset time is slightly shifted due to paralogous things. 図１０は、サブクラスターでの発現パターンの比較処理を行う処理フローを示す図である。FIG. 10 is a diagram showing a processing flow for performing expression pattern comparison processing in sub-clusters. 図１１は、ヒートマップ表示をした複数のクラスタリング結果を並べて表示する実施例を示す図である。FIG. 11 is a diagram illustrating an example in which a plurality of clustering results displayed in a heat map are displayed side by side.

Explanation of symbols

１０遺伝子クラスタリング装置、１１入力装置、１２ユーザインターフェイス部、１３データアクセス部、１４遺伝子配列記憶部、１５スコア記憶部、１６モチーフ記憶部、１７遺伝子発現データ記憶部、１８モチーフ検索部、１９モチーフスコア計算部、２０遺伝子間距離計算部、２１クラスタリング処理部、２２発現データ取得部、２３出力装置、２４発現データ表示部 10 gene clustering device, 11 input device, 12 user interface unit, 13 data access unit, 14 gene sequence storage unit, 15 score storage unit, 16 motif storage unit, 17 gene expression data storage unit, 18 motif search unit, 19 motif score Calculation unit, 20 Intergene distance calculation unit, 21 Clustering processing unit, 22 Expression data acquisition unit, 23 Output device, 24 Expression data display unit

Claims

A gene clustering apparatus for clustering a plurality of genes based on sequence similarity,
A motif search unit that searches for a motif sequence that is included in a gene sequence and includes a sequence pattern corresponding to an active site or functional region in a protein structure;
A motif score calculation unit for calculating a similarity score between any two genes based on a motif sequence included in each gene sequence to be clustered ;
An intergenic distance calculation unit for calculating an intergenic distance between any two genes using the similarity score;
A clustering processing unit for clustering the plurality of genes based on the intergenic distance;
An expression data acquisition unit for acquiring the expression data of each gene from the gene expression data storage unit;
An expression data display unit for displaying the acquired expression data of each gene at a position corresponding to each gene;
The motif score calculation unit calculates a similarity score between all motif sequences included in any two genes, assumes that the similarity score of a region other than the motif sequence is zero, and calculates all the calculated motif sequences A gene clustering apparatus characterized in that a sum of similarity scores is used as a similarity score of the two genes .

The clustering processing unit further performs clustering using a feature vector amount other than the inter-gene distance,
A gradation converter that converts the distance information of each sub-cluster into a one-dimensional gradation sequence based on the result of each clustering;
The gene clustering apparatus according to claim 1, further comprising: a parallel display unit that displays in parallel the results of conversion into the one-dimensional gradation sequence for each clustering result.

The expression data display unit displays a plurality of expression data respectively obtained from a plurality of samples for each cell maturation process, along with the cell maturation process, and the expression data at the same maturation stage are displayed in close contact with each other. The gene clustering apparatus according to claim 1 or 2, wherein each expression data is colored with a concentration corresponding to the measured expression level.

Computer
A program that functions as a gene clustering device that clusters a plurality of genes based on sequence similarity,
A motif search unit that searches for a motif sequence that is included in a gene sequence and includes a sequence pattern corresponding to an active site or functional region in a protein structure;
A motif score calculation unit for calculating a similarity score between any two genes based on a motif sequence included in each gene sequence to be clustered ;
An intergenic distance calculation unit for calculating an intergenic distance between any two genes using the similarity score;
A clustering processing unit for clustering the plurality of genes based on the intergenic distance;
An expression data acquisition unit for acquiring the expression data of each gene from the gene expression data storage unit;
The acquired expression data of each gene is made to function as an expression data display unit that displays it at a position corresponding to each gene ,
The motif score calculation unit calculates a similarity score between all motif sequences included in any two genes, assumes that the similarity score of a region other than the motif sequence is zero, and calculates all the calculated motif sequences A program characterized in that a sum of similarity scores is used as a similarity score of the two genes .

The clustering processing unit further performs clustering using a feature vector amount other than the inter-gene distance,
The computer,
A gradation converter that converts the distance information of each sub-cluster into a one-dimensional gradation sequence based on the result of each clustering;
5. The program according to claim 4 , wherein each of the clustering results is caused to function as a parallel display unit that displays the result of conversion into the one-dimensional gradation number sequence in parallel.

The expression data display unit displays a plurality of expression data respectively obtained from a plurality of samples for each cell maturation process, along with the cell maturation process, and the expression data at the same maturation stage are displayed in close contact with each other. 6. The program according to claim 4 or 5, wherein each expression data is colored with a concentration corresponding to the measured expression level.