JP2001258598A

JP2001258598A - Method and device for analyzing dna sequence

Info

Publication number: JP2001258598A
Application number: JP2000084184A
Authority: JP
Inventors: Hironobu Takahashi; 裕信高橋; Ryuichi Oka; 隆一岡; Yasuhide Mori; 靖英森
Original assignee: REAL WORLD COMPUTING PARTNERSH; Hitachi Ltd; Real World Computing Partnership
Current assignee: REAL WORLD COMPUTING PARTNERSH; Hitachi Ltd; Real World Computing Partnership
Priority date: 2000-03-24
Filing date: 2000-03-24
Publication date: 2001-09-25

Abstract

PROBLEM TO BE SOLVED: To improve the accuracy of DNA analysis. SOLUTION: This method for analyzing a DNA sequence, characterized by sequentially sampling a plurality of bases from the DNA sequence at the distance of smaller base number than the base number of a sampling, clustering the sampled words, assuming that the distribution of the words on the galaxy obtained as the result is the characteristics of the DNA sequence, comparing the characteristics of a standard DNA sequence with the DNA sequence of the analysis target to detect inconsistent sections, and consequently detecting an abnormal site in the DNA sequence section.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、塩基列の形態で表
されたＤＮＡシーケンスを分析し、当然変異部分や挿入
部等を検出する場合に好適なＤＮＡシーケンス分析方法
および装置に関する。The present invention relates to a DNA sequence analysis method and apparatus suitable for analyzing a DNA sequence represented in the form of a base sequence and naturally detecting a mutated portion, an inserted portion and the like.

【０００２】[0002]

【従来の技術】従来、塩基列の形態で表されるＤＮＡシ
ーケンスを分析する方法としては隠れマルコフモデルや
動的計画法を用いて、ＤＮＡシーケンスのアライメント
を明確にする方法が知られている。2. Description of the Related Art Conventionally, as a method of analyzing a DNA sequence represented in the form of a base sequence, a method of clarifying the alignment of a DNA sequence using a hidden Markov model or a dynamic programming method is known.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、上述の
方法では、分析精度が低く、より高い精度での分析性能
が求められている。However, in the above-mentioned method, the analysis accuracy is low, and the analysis performance with higher accuracy is required.

【０００４】そこで、本発明の目的は、分析精度を向上
させ、高い分析制度を有する新規なＤＮＡシーケンス分
析方法および装置を提供することにある。Accordingly, an object of the present invention is to provide a novel DNA sequence analysis method and apparatus having improved analysis accuracy and a high analysis accuracy.

【０００５】[0005]

【課題を解決するための手段】このような目的を達成す
るために、請求項１の発明は、基準ＤＮＡシーケンスを
情報処理装置内に記憶しておき、前記情報処理装置に分
析対象のＤＮＡシーケンスを与え、前記情報処理装置
は、基準ＤＮＡシーケンスおよび分析対象のＤＮＡシー
ケンスについてそれぞれ一定個数の塩基列を前記一定個
数より小さい塩基数間隔で順次にサンプリングし、当該
サンプリングされた塩基列についてクラスタリングして
ギャラクシー空間上の分布状態を取得し、前記基準シー
ケンスの前記分布と前記分析対象の分布とを比較して分
布が一致しない分析対象のＤＮＡシーケンスの区間を検
出することを特徴とする。In order to achieve the above object, according to the present invention, a reference DNA sequence is stored in an information processing apparatus, and the DNA sequence to be analyzed is stored in the information processing apparatus. The information processing apparatus sequentially samples a fixed number of base sequences for the reference DNA sequence and the DNA sequence to be analyzed at intervals of a base number smaller than the fixed number, and clusters the sampled base sequences. A distribution state in a galaxy space is acquired, and the distribution of the reference sequence and the distribution of the analysis target are compared to detect a section of the DNA sequence of the analysis target whose distribution does not match.

【０００６】請求項２の発明は、請求項１に記載のＤＮ
Ａシーケンス分析方法において、前記情報処理装置は分
布が一致しない基準ＤＮＡシーケンスおよび分析対象の
ＤＮＡシーケンスの区間の塩基列の個数をそれぞれ計数
し、当該計数の結果を比較することにより、塩基の挿入
および欠落を検出することを特徴とする。According to a second aspect of the present invention, there is provided the DN according to the first aspect.
In the A-sequence analysis method, the information processing device counts the number of base sequences in a section of a reference DNA sequence and a DNA sequence to be analyzed whose distributions do not match, and compares the results of the counting to insert and remove bases. It is characterized in that the missing is detected.

【０００７】請求項３の発明は、請求項１に記載のＤＮ
Ａシーケンス分析方法において、前記情報処理装置は少
なくとも前記分布が一致しない分析対象のＤＮＡシーケ
ンスの区間を分析結果として出力することを特徴とす
る。According to a third aspect of the present invention, there is provided the DN according to the first aspect.
In the A-sequence analysis method, the information processing device outputs at least a section of the DNA sequence to be analyzed whose distribution does not match as an analysis result.

【０００８】請求項４の発明は、基準ＤＮＡシーケンス
を記憶する記憶手段と、分析対象のＤＮＡシーケンスを
入力する入力手段と、基準ＤＮＡシーケンスおよび分析
対象のＤＮＡシーケンスについてそれぞれ一定個数の塩
基列を前記一定個数より小さい塩基数間隔で順次にサン
プリングするサンプリング手段と、当該サンプリングさ
れた塩基列についてクラスタリングしてギャラクシー空
間上の分布状態を取得するクラスタリング手段と、前記
基準シーケンスの分布と前記分析対象の分布とを比較し
て分布が一致しない分析対象のＤＮＡシーケンスの区間
を検出する手段とを具えたことを特徴とする。According to a fourth aspect of the present invention, there are provided a storage means for storing a reference DNA sequence, an input means for inputting a DNA sequence to be analyzed, and a fixed number of base sequences for each of the reference DNA sequence and the DNA sequence to be analyzed. Sampling means for sequentially sampling at a base number interval smaller than a certain number; clustering means for clustering the sampled base sequence to obtain a distribution state in a Galaxy space; distribution of the reference sequence and distribution of the analysis target And means for detecting a section of the DNA sequence to be analyzed whose distribution does not match by comparing

【０００９】請求項５の発明は、請求項４に記載のＤＮ
Ａシーケンス分析装置において、前記分布が一致しない
基準ＤＮＡシーケンスおよび分析対象のＤＮＡシーケン
スの区間の塩基列の個数をそれぞれ計数し、当該計数の
結果を比較することにより、塩基の挿入および欠落を検
出する手段をさらに具えたことを特徴とする。[0009] The invention of claim 5 provides the above-described DN.
In the A-sequence analyzer, the number of base sequences in the section of the reference DNA sequence whose distribution does not match and the number of base sequences in the section of the DNA sequence to be analyzed are counted, and the results of the counting are compared to detect insertion and deletion of bases. It is characterized by further comprising means.

【００１０】請求項６の発明は、請求項４に記載のＤＮ
Ａシーケンス分析装置において、少なくとも前記分布が
一致しない分析対象のＤＮＡシーケンスの区間を分析結
果として出力する手段をさらに具えたことを特徴とす
る。According to a sixth aspect of the present invention, there is provided the DN according to the fourth aspect.
The A-sequence analyzer further comprises means for outputting at least a section of the DNA sequence to be analyzed whose distribution does not match as an analysis result.

【００１１】[0011]

【発明の実施の形態】以下、図面を参照して本発明の実
施形態を詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００１２】図１は本実施形態のＤＮＡシーケンス分析
方法の処理手順を説明するための説明図である。本実施
形態では、ＤＮＡシーケンスが与えられると、Ｎ個（Ｎ
は正数）の連続する塩基をサンプリングする。サンプリ
ングする塩基列はサンプリングシーケンス中で、Ｍ個
（ＭはＮより小さい正数）ずつサンプリング位置をずら
して行く。図１の例ではＭ＝７、Ｎ＝１の例を示してい
る。サンプリングされた塩基列の組み合わせは有限であ
るので、類似する組み合わせを１つの集合にまとめ、複
数の集合を作成する。集合を作成することを本実施形態
ではクラスタリングと呼ぶことにする。図１の例ではサ
ンプリングされた塩基列はそれぞれが異なるので、集合
（本実施形態では、ワードと呼ぶ）Ｗ１，Ｗ２，Ｗ
３．．．が作成される。本実施形態では、クラスタリン
のための手法としてギャラクシークラスタリング方法を
使用する。この手法は、ＨｉｒｏｎｏｂｕＴａｋａｈ
ａｓｈｉ、ＹｏｓｈｉｔａｋａＮｉｔｔａ，Ｔａｋａ
ｓｈｉＥｎｄｏ、“ＣｌｕｓｔｅｒｉｎｇＭｅｔｈ
ｏｄｏｆｌａｒｇｅ−ｓｃａｌｅＢｉｇｒａｍ
ＮｅｔｗｏｒｋＳｐｅｃｉａｌｉｚａｔｉｏｎａｎ
ｄＡｐｐｌｉｃａｔｉｏｎｔｏＴｅｘｔＲｅｔ
ｒｉｅｖａｌ“、ＴｅｃｈｎｉｃａｌＲｅｐｏｒｔ
ｏｆＩＥＣＥ，ＮＬＣ−９７−３４、ｐｐ。４１−４
７において開示されているが、本発明に係るので簡単に
説明する。FIG. 1 is an explanatory diagram for explaining the processing procedure of the DNA sequence analysis method of the present embodiment. In this embodiment, when a DNA sequence is given, N (N
Is a positive number). The sampling sequence shifts the sampling position by M (M is a positive number smaller than N) in the sampling sequence. FIG. 1 shows an example in which M = 7 and N = 1. Since the combinations of the sampled base sequences are finite, similar combinations are combined into one set to create a plurality of sets. In this embodiment, creating a set is referred to as clustering. In the example of FIG. 1, since the sampled base sequences are different from each other, a set (called a word in this embodiment) W1, W2, W
3. . . Is created. In the present embodiment, a galaxy clustering method is used as a technique for clustering. This technique is based on the Hironobu Takah
ashi, Yoshitaka Nita, Taka
shi Endo, “Clustering Meth
od of large-scale Bigram
Network Specialization an
d Application to Text Ret
rieval “, Technical Report
of IECE, NLC-97-34, pp. 41-4
7, which will be briefly described because it relates to the present invention.

【００１３】Ｎ次元の特徴を表す仮想空間(ギャラクシ
ー空間）を想定し、１つのオブジェクトのギャラクシー
空間上の位置をｘで表す。複数のオブジェクトの順番を
番号ｉで表す。また、位置ｘｉとｘｊの２つのオブジェ
クトの類似度をＭｉｊで表す。２つのオブジェクト位置
の間の距離をｄｉｊで表す。上記類似度および距離を計
算する計算式を予め用意し、複数のオフジェクトの中の
２つのオブジェクをオブジェクトを異ならせながら選択
し、２つのオブジェクトの間の類似度および距離を計算
する。Assuming a virtual space (galaxy space) representing N-dimensional features, the position of one object in the galaxy space is represented by x. The order of a plurality of objects is represented by a number i. The similarity between the two objects at the positions xi and xj is represented by Mij. The distance between two object positions is represented by dij. A calculation formula for calculating the similarity and the distance is prepared in advance, and two objects among a plurality of objects are selected while differentiating the objects, and the similarity and the distance between the two objects are calculated.

【００１４】２つのオブジェクトが類似する場合、類似
度Ｍｉｊの値が大きくまた、オブジェクト間の距離ｄｉ
ｊも大きくなる性質があるので、類似度および距離の２
つのパラメータを使用して、ギャラクシー空間上で近接
するオブジェクトを集めた複数の集合を作成する。When two objects are similar, the value of the similarity Mij is large and the distance di between the objects is di.
Since j also has the property of increasing, the similarity and distance 2
One set of parameters is used to create a plurality of sets of objects that are close to each other in the galaxy space.

【００１５】このような手法を使用して、特定の塩基の
位置や隣接する他の塩基の組み合わせなどを特徴とする
と、上記ワード（サンプリングした塩基列）をクラスタ
リングすることができる。If such a technique is used to characterize the position of a specific base or a combination of other adjacent bases, the word (sampled base sequence) can be clustered.

【００１６】クラスタリングされた集合（ワード）を順
番に接続すると図１の下部に示すように模式的に表示す
ることができる。この表示例は図示のために２次元表示
しているが実際にはＮ次元空間上、すなわち、ギャラク
シー空間上に複数の集合が配置される。本実施形態で
は、この配置、すなわち、ワードの分布状態をＤＮＡシ
ーケンスの特徴として取り扱う。When clustered sets (words) are connected in order, they can be schematically displayed as shown in the lower part of FIG. In this display example, two-dimensional display is performed for illustration, but a plurality of sets are actually arranged in an N-dimensional space, that is, in a Galaxy space. In this embodiment, this arrangement, that is, the distribution state of words is treated as a feature of the DNA sequence.

【００１７】予め遺伝子の性質が判明している基準ＤＮ
Ａシーケンスについて、情報処理装置たとえば、パソコ
ンにより上記情報処理を施して、ワード（集合）のギャ
ラクシー空間上の分布の特徴を情報処理装置たとえば、
パソコンにより取得する。次に分析対象のＤＮＡシーケ
ンスについても上述の処理を施してギャラクシー空間上
の分布の特徴を取得する。得られた２つの分布状態を図
２から図４に模式的に示す。図２（Ａ）に示すように基
準ＤＮＡシーケンスの中の１つの塩基が分析対象のＤＮ
Ａシーケンスでは他の塩基に変異した例であり、基準Ｄ
ＮＡシーケンスのギャラクシー空間上のワードをＤで表
し、分析対象のＤＮＡシーケンスのギャラクシー空間上
のワードをＭで表している。図２（Ｂ）では変異部分の
個所がギャラクシー空間上では、目視確認できるほど、
分布の状態が異なっていることがわかる。Reference DN whose gene properties are known in advance
For the A-sequence, the above information processing is performed by an information processing device such as a personal computer, and the characteristics of the distribution of words (sets) in the Galaxy space are processed by the information processing device such as
Obtain by PC. Next, the above-described processing is also performed on the DNA sequence to be analyzed to acquire the characteristics of the distribution in the Galaxy space. FIGS. 2 to 4 schematically show two obtained distribution states. As shown in FIG. 2A, one base in the reference DNA sequence is the DN to be analyzed.
A sequence is an example of mutation to another base,
A word in the Galaxy space of the NA sequence is represented by D, and a word in the Galaxy space of the DNA sequence to be analyzed is represented by M. In FIG. 2 (B), the location of the mutated portion can be visually confirmed in the Galaxy space.
It can be seen that the state of distribution is different.

【００１８】図３（Ａ）は分析対象のＤＮＡシーケンス
の中に基準ＤＮＡシーケンスにはない塩基列が挿入され
た事例を示し、図３（Ｂ）はこの事例におけるギャラク
シー空間上のワードの配置を示す。ここで、分析対象の
ワードはＩで表している。FIG. 3A shows a case where a base sequence not present in the reference DNA sequence is inserted into the DNA sequence to be analyzed, and FIG. 3B shows the arrangement of words in the Galaxy space in this case. Show. Here, the word to be analyzed is represented by I.

【００１９】図４は基準ＤＮＡシーケンスの中の１以上
の塩基列が分析対象のＤＮＡシーケンスの中で欠落する
事例を示す。図４（Ｂ）はこの事例におけるギャラクシ
ー空間上のワードの分布を示す。ここで、分析対象の分
布はＶで表している。FIG. 4 shows a case where one or more base sequences in the reference DNA sequence are missing in the DNA sequence to be analyzed. FIG. 4B shows the distribution of words on the Galaxy space in this case. Here, the distribution of the analysis target is represented by V.

【００２０】したがって、本実施形態ではギャラクシー
空間上の２つのワードの分布状態を比較することにより
次のようなことがわかる。１）２つの分布の間の距離を計算し、計算した距離が予
め定めた閾値より大きい区間は基準のＤＮＡシーケンス
とは異なる区間となる。これにより異常部分の区間をも
簡単な情報処理で検出することができる。２）上記距離が閾値より大きくなる区間の２つの分布上
のワードの個数をそれぞれ計数し、計数した個数を比較
（差の計算）すると、変異、挿入、欠落の種類を判別す
ることができる。（ａ）ワードの個数が一致する場合（差が０）の場合に
は、変異であり、距離が最も大きくなる分析対象のワー
ドの中に変異部分が存在することも判明する。異常区間
のＤＮＡシーケンスのワードの比較により変異した塩基
列も検出することができる。（ｂ）基準ＤＮＡシーケンスのワードの個数が分析対象
のワードの個数よりも大きい場合は分析対象のＤＮＡシ
ーケンス側に欠落があり、異常区間のＤＮＡシーケンス
の塩基の比較により欠落する塩基列も検出することがで
きる。（ｃ）基準ＤＮＡシーケンスのワードの個数が分析対象
のワードの個数よりも小さい場合は分析対象のＤＮＡシ
ーケンス側に塩基列の挿入があり、異常区間のＤＮＡシ
ーケンスのワードの比較により挿入された塩基列も検出
することができる。Therefore, in the present embodiment, the following can be understood by comparing the distribution states of two words in the Galaxy space. 1) The distance between two distributions is calculated, and a section where the calculated distance is larger than a predetermined threshold is a section different from the reference DNA sequence. Thereby, the section of the abnormal part can be detected by simple information processing. 2) Counting the number of words on the two distributions in the section in which the distance is greater than the threshold value, and comparing the counted numbers (calculating the difference), the type of mutation, insertion, or deletion can be determined. (A) If the number of words matches (the difference is 0), it is a mutation, and it is also found that there is a mutated part in the analysis target word having the largest distance. A mutated base sequence can also be detected by comparing the words of the DNA sequence in the abnormal section. (B) If the number of words in the reference DNA sequence is larger than the number of words to be analyzed, there is a missing in the DNA sequence to be analyzed, and the missing base sequence is also detected by comparing the bases of the DNA sequence in the abnormal section. be able to. (C) When the number of words in the reference DNA sequence is smaller than the number of words to be analyzed, a base sequence is inserted on the side of the DNA sequence to be analyzed, and the base inserted by comparison of the words in the DNA sequence in the abnormal section. Columns can also be detected.

【００２１】以上、述べたＤＮＡ分析方法を実現するた
めのシステムを次に説明する。システムとしてはパソコ
ンなどの汎用コンピュータやデジタル回路で実現可能で
あるが、汎用コンピュータのソフトウェアにより上記分
析機能を実現する例を説明する。A system for realizing the above-described DNA analysis method will now be described. Although the system can be realized by a general-purpose computer such as a personal computer or a digital circuit, an example in which the analysis function is realized by software of a general-purpose computer will be described.

【００２２】図５は分析機能を実現するためのソフトウ
ェアプロラムの機能構成を示す。図５において、１０は
キーボードからの入力やテキストファイルにより与えら
れる分析対象のＤＮＡシーケンスからワードを順次にサ
ンプリングするワード抽出部である。２０は抽出された
ワードをクラスタリングし、ギャラクシー空間上のワー
ドの分布を出力するクラスタリング部である。FIG. 5 shows a functional configuration of a software program for realizing the analysis function. In FIG. 5, reference numeral 10 denotes a word extraction unit for sequentially sampling words from a DNA sequence to be analyzed given by input from a keyboard or a text file. Reference numeral 20 denotes a clustering unit that clusters the extracted words and outputs a word distribution on the Galaxy space.

【００２３】３０は基準シーケンス記憶部であり、キー
ボードからの入力やテキストファイルの形態で与えられ
た規準の（ＤＮＡ）シーケンスを記憶する。基準シーケ
ンス記憶部３０としてはハードディスクを使用すること
ができる。Reference numeral 30 denotes a reference sequence storage unit which stores a reference (DNA) sequence given in the form of a text file or an input from a keyboard. A hard disk can be used as the reference sequence storage unit 30.

【００２４】４０は基準シーケンスからワードを順次に
抽出するワード抽出部であり、ワード抽出部１０を共有
使用することができる。５０は基準シーケンスのワード
をクラスタリングするクラスタリング部であり、基準シ
ーケンス側のギャラクシー空間上のワードの分布を出力
する。クラスタリング部２０および４０を共有使用する
ことができる。Reference numeral 40 denotes a word extracting unit for sequentially extracting words from the reference sequence, and the word extracting unit 10 can be commonly used. Reference numeral 50 denotes a clustering unit that clusters words in the reference sequence, and outputs a word distribution in the Galaxy space on the reference sequence side. The clustering units 20 and 40 can be shared.

【００２５】７０はパターン分析部であり、クラスタリ
ング部２０および５０から出力されるワードの分布を上
述の分析方法により分析し、その分析結果を出力する。
出力部７０はパターン分析部６０から出力される分析結
果を出力する出力部であり、たとえば、図６に示す形態
で分析結果を表示出力する。出力部７０としては、プリ
ンタ、通信装置等を使用することができる。処理部１
０、２０、４０、５０および６０はソフトウェアプログ
ラムをＣＰＵが実行することによりその機能が実現され
る。具体的なソフトウェアプログラムの内容は言語形態
によっても異なり、また、上述の分析方法の説明により
当業者であれば作成できるので詳細な説明は省略する。Reference numeral 70 denotes a pattern analysis unit which analyzes the distribution of words output from the clustering units 20 and 50 by the above-described analysis method, and outputs the analysis result.
The output unit 70 is an output unit that outputs the analysis result output from the pattern analysis unit 60, and displays and outputs the analysis result in the form shown in FIG. 6, for example. As the output unit 70, a printer, a communication device, or the like can be used. Processing unit 1
The functions of 0, 20, 40, 50 and 60 are realized by the CPU executing the software program. The specific contents of the software program differ depending on the language form, and can be created by those skilled in the art based on the description of the analysis method described above.

【００２６】図６の分析解析結果は基準のＤＮＡシーケ
ンスの塩基列と分析対象のＤＮＡシーケンスを対比させ
て表示し、一致する塩基（ワード間の距離が閾値以下の
区間）は直線で一致していることを表し、一致しない塩
基列部分には（ワード分布が以上となる部分）直線を引
かないことで、塩基の挿入部や欠落部を表す。変異部に
ついては塩基の色を他の表示と異ならせてユーザに報知
する。また、ＤＮＡシーケンスの各塩基の順番を示す番
号を表示する。The results of the analysis and analysis shown in FIG. 6 are displayed by comparing the base sequence of the reference DNA sequence with the DNA sequence to be analyzed, and the matching bases (sections in which the distance between words is equal to or less than the threshold value) are linearly matched. A base line portion that does not match (a portion where the word distribution is equal to or greater) is not drawn to indicate a base insertion portion or a missing portion. For the mutated portion, the user is notified by making the color of the base different from other displays. Also, a number indicating the order of each base in the DNA sequence is displayed.

【００２７】このような表示を行なうためには入力され
た分析対象のＤＮＡシーケンスを一時記憶しておき、ハ
ードディスクに記憶されているＤＮＡシーケンスと共に
表示する。この際に、塩基の順番をカウンタで計数し、
その計数結果を塩基位置を示す番号として表示する。直
線を結ぶ塩基同士は上述の分析方法で検出する。以上の
処理はソフトウェアをＣＰＵが実行することで実現可能
である。In order to perform such display, the input DNA sequence to be analyzed is temporarily stored and displayed together with the DNA sequence stored in the hard disk. At this time, the order of bases is counted by a counter,
The counting result is displayed as a number indicating the base position. Bases connecting the straight lines are detected by the above-described analysis method. The above processing can be realized by executing software by the CPU.

【００２８】図６の表示の形態は説明のための１実施例
であって異常な内容、その塩基列やワード位置を塩基の
形態で出力してもよいし、他の種々の報知形態を採るこ
とができる。The display form shown in FIG. 6 is one embodiment for explanation, and abnormal contents, its base sequence and word position may be output in the form of bases, and various other notification forms may be employed. be able to.

【００２９】上述の実施形態の他に次の形態を実施する
ことが可能である。１）上述の実施形態ではサンプリングする塩基数は７
個、サンプリング間隔は１塩基の例を示したが、キーボ
ードなどから任意のサンプリングの塩基数やサンプリン
グ間隔の塩基数を入力したり、マウスの指定によりサン
プリングの塩基数やサンプリング間隔の塩基数を可変設
定してもよい。２）基準のＤＮＡシーケンスは複数組用意するとよい。
複数組の基準のＤＮＡシーケンスの中から分析対象のＤ
ＮＡシーケンスと類似している基準のＤＮＡ―シーケン
スを検出することもできる。この場合には、ワード分布
についての距離計算を行い、最も距離が近い基準のＤＮ
Ａシーケンスを類似のＤＮＡシーケンスと決定する。複
数の基準のＤＮＡシーケンスに遺伝子の性質を示す識別
情報を与えておき、検出された類似の基準のＤＮＡシー
ケンスの識別情報を取り出すと、分析対象のＤＮＡシー
ケンスの遺伝子の性質の種類判別を行なうことができ
る。The following embodiment can be carried out in addition to the above embodiment. 1) In the above embodiment, the number of bases to be sampled is 7
The number of samples and the sampling interval are shown as one base, but the number of bases for sampling and the number of bases for sampling interval can be input from the keyboard etc. May be set. 2) A plurality of sets of reference DNA sequences may be prepared.
D to be analyzed from multiple sets of standard DNA sequences
A reference DNA-sequence similar to the NA sequence can also be detected. In this case, a distance calculation is performed for the word distribution, and the reference DN having the closest distance is used.
The A sequence is determined to be a similar DNA sequence. When identification information indicating the property of a gene is given to a plurality of reference DNA sequences and identification information of the detected similar reference DNA sequence is taken out, the type of the property of the gene of the DNA sequence to be analyzed is determined. Can be.

【００３０】また、このようにして選択された基準のＤ
ＮＡシーケンスに対して上述の実施形態の分析処理を施
してもよい。The reference D thus selected is
The analysis processing of the above embodiment may be performed on the NA sequence.

【００３１】以上述べた実施形態の他に種々の変形が可
能であるが、その変形が本願特許請求の範囲に記載され
た技術思想に基づくものであるかぎり、その変形は本発
明の技術範囲内となる。Although various modifications are possible in addition to the above-described embodiment, the modifications are within the technical scope of the present invention as long as the modifications are based on the technical idea described in the claims of the present application. Becomes

【００３２】[0032]

【発明の効果】以上、説明した本発明に基づき実際に実
験を行なうと、従来の分析方法に比べて高い分析精度が
得られた。本発明ではサンプリングした塩基列がクラス
タリングされるので、その塩基列の変化の中の塩基の変
化がＤＮＡシーケンスの特徴として捕まえられ、また、
ギャラクシー分布の中にこの特徴が織り込まれるので、
分析対象のＤＮＡシーケンス中の挿入部分や欠落分や変
異部分をより精度よく検出できると思料される。According to the present invention as described above, when an experiment was actually performed, higher analysis accuracy was obtained as compared with the conventional analysis method. In the present invention, since the sampled base sequence is clustered, the base change in the base sequence change is captured as a characteristic of the DNA sequence.
Since this feature is woven into the Galaxy distribution,
It is considered that an inserted portion, a missing portion, or a mutated portion in a DNA sequence to be analyzed can be detected with higher accuracy.

[Brief description of the drawings]

【図１】本発明実施形態のＤＮＡ分析方法を説明するた
めの説明図である。FIG. 1 is an explanatory diagram for explaining a DNA analysis method according to an embodiment of the present invention.

【図２】本発明実施形態の分析内容を説明するための説
明図である。FIG. 2 is an explanatory diagram for explaining analysis contents of the embodiment of the present invention.

【図３】本発明実施形態の分析内容を説明するための説
明図である。FIG. 3 is an explanatory diagram for explaining analysis contents of the embodiment of the present invention.

【図４】本発明実施形態の分析内容を説明するための説
明図である。FIG. 4 is an explanatory diagram for explaining analysis contents of the embodiment of the present invention.

【図５】本発明実施形態のシステムの機能構成を示すブ
ロック図である。FIG. 5 is a block diagram showing a functional configuration of a system according to the embodiment of the present invention.

【図６】本発明実施形態の表示例を示す説明図である。FIG. 6 is an explanatory diagram showing a display example according to the embodiment of the present invention.

[Explanation of symbols]

１０、４０ワード抽出部２０、５０クラスタリング部６０パターン分析部７０出力部 10, 40 word extraction unit 20, 50 clustering unit 60 pattern analysis unit 70 output unit

フロントページの続き (72)発明者岡隆一東京都千代田区東神田２−５−12 龍角散ビル８階技術研究組合新情報処理開発機構内 (72)発明者森靖英東京都国分寺市東恋ヶ窪一丁目280番地株式会社日立製作所中央研究所内Ｆターム(参考） 4B024 AA11 CA01 HA11 HA19 4B029 AA07 AA23 BB20 FA10 4B063 QA01 QA12 QA13 QA18 QQ42 QS38 QS39 Continued on the front page (72) Inventor Ryuichi Oka 2-8-12 Higashi Kanda, Chiyoda-ku, Tokyo Ryukakusan Building 8F, Technology Research Association New Information Processing Development Machine Campus (72) Inventor Mori Yasuhide Higashi Koigakubo, Kokubunji-shi, Tokyo No. 280 F term in Hitachi Central Research Laboratory Co., Ltd. (reference) 4B024 AA11 CA01 HA11 HA19 4B029 AA07 AA23 BB20 FA10 4B063 QA01 QA12 QA13 QA18 QQ42 QS38 QS39

Claims

[Claims]

1. A reference DNA sequence is stored in an information processing device, and a DNA sequence to be analyzed is given to the information processing device. The number of base sequences is sampled sequentially at a base number interval smaller than the certain number, and the sampled base sequences are clustered to obtain a distribution state in a Galaxy space, and the distribution of the reference sequence and the analysis target are obtained. A DNA sequence analysis method, comprising detecting a section of a DNA sequence to be analyzed whose distribution does not match by comparing with a distribution.

2. The DNA sequence analysis method according to claim 1, wherein the information processing device counts the number of base sequences in a section between the reference DNA sequence and the DNA sequence to be analyzed whose distributions do not match, and calculates the number of base sequences. A DNA sequence analysis method characterized by detecting insertion and deletion of a base by comparing the results.

3. The DNA sequence analysis method according to claim 1, wherein the information processing device outputs at least a section of the DNA sequence to be analyzed whose distribution does not match as an analysis result. .

4. A storage means for storing a reference DNA sequence; an input means for inputting a DNA sequence to be analyzed; and a base sequence having a fixed number of base sequences smaller than the fixed number for each of the reference DNA sequence and the DNA sequence to be analyzed. Sampling means for sequentially sampling at several intervals; clustering means for clustering the sampled base sequence to obtain a distribution state in the Galaxy space; and comparing the distribution of the reference sequence with the distribution of the analysis target. Means for detecting a section of the DNA sequence to be analyzed whose distribution does not match,
Sequence analyzer.

5. The DNA sequence analyzer according to claim 4, wherein the number of base sequences in the section of the reference DNA sequence whose distribution does not match and the number of base sequences in the section of the DNA sequence to be analyzed are counted, and the results of the counting are compared. A DNA sequence analyzer further comprising means for detecting insertion and deletion of a base.

6. The DNA sequence analyzer according to claim 4, further comprising means for outputting, as an analysis result, at least a section of the DNA sequence to be analyzed whose distribution does not match. .