JPH05219932A

JPH05219932A - Device for examining genetic information

Info

Publication number: JPH05219932A
Application number: JP2101292A
Authority: JP
Inventors: Mayumi Oya; 真弓大矢; Seiichi Aikawa; 聖一相川
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1992-02-06
Filing date: 1992-02-06
Publication date: 1993-08-31

Abstract

PURPOSE:To evaluate the similarity of two amino acid sequences at a high speed in a small volume of the memory by expressing amino acids with letters and evaluating the similarity between the amino acid sequence of an examination target and the amino acid sequence of a reference target. CONSTITUTION:The longest common letter number between an examination target amino acid sequence expressed with a letter sequence and a reference target amino acid sequence expressed with a letter sequence is detected by a detection means 10. The ratio of the longest common letter number to the letter sequence length of the examination target amino acid sequence or reference target amino acid sequence is calculated by a calculation means 11, and the calculated ratio is outputted into an output device 2.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、検査対象のアミノ酸配
列と、比較対象のアミノ酸配列との間の類似性を評価す
る遺伝子情報検査装置に関し、特に、簡単な処理機構に
従って類似性を評価することのできる遺伝子情報検査装
置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a genetic information testing apparatus for evaluating the similarity between an amino acid sequence to be tested and an amino acid sequence to be compared, and particularly, the similarity is evaluated according to a simple processing mechanism. The present invention relates to a genetic information testing device that can be used.

【０００２】医薬品の開発等に必要なタンパク質工学の
分野では、分子生物学の進歩に伴って大量に遺伝情報が
蓄積され始め現在データベース化が進んでいる。これら
の大量に蓄積された遺伝情報から、タンパク質の構造や
機能等の生物学的に意味のある情報を抽出していくこと
が要求されている。この抽出処理は、高速処理を可能に
するためにも、できる限り簡単な処理機構により実現し
ていくことが好ましいのである。In the field of protein engineering required for drug development and the like, a large amount of genetic information has begun to be accumulated with the progress of molecular biology, and a database is now being developed. It is required to extract biologically meaningful information such as protein structure and function from the large amount of accumulated genetic information. It is preferable that the extraction processing is realized by a processing mechanism as simple as possible in order to enable high speed processing.

【０００３】[0003]

【従来の技術】遺伝子の本体はＤＮＡであり、Ａ（アデ
ニン）、Ｔ（チミン）、Ｃ（シトシン）、Ｇ（グアニ
ン）という４つの塩基で構成される塩基配列として表現
される。また、生体を構成するアミノ酸は約２０種類あ
り、これまでに、塩基配列中の３つの塩基の並びと各ア
ミノ酸とが対応することが解明されている。従って、生
体内では、ＤＮＡの塩基配列に従ってアミノ酸が合成さ
れ、合成されたアミノ酸が折り畳まれることによってタ
ンパク質が形作られることになる。2. Description of the Related Art The main body of a gene is DNA, which is expressed as a base sequence composed of four bases, A (adenine), T (thymine), C (cytosine) and G (guanine). In addition, there are about 20 kinds of amino acids that compose a living body, and it has been clarified so far that the arrangement of three bases in a base sequence corresponds to each amino acid. Therefore, in vivo, an amino acid is synthesized according to the base sequence of DNA, and the synthesized amino acid is folded to form a protein.

【０００４】上述したように、分子生物学の発展に伴
い、塩基やアミノ酸の配列の決定法が確立したことによ
って、塩基配列データ、アミノ酸配列データ等の遺伝情
報が大量に蓄積され始めている。このため、遺伝子情報
処理の分野では、蓄積された膨大な遺伝情報の中から、
タンパク質の構造や機能等に関する生物情報をいかにし
て抽出するかが中心課題となってきた。As described above, with the development of molecular biology, a method for determining the sequences of bases and amino acids has been established, and a large amount of genetic information such as base sequence data and amino acid sequence data has begun to be accumulated. Therefore, in the field of genetic information processing, from the huge amount of accumulated genetic information,
How to extract biological information about the structure and function of proteins has become a central issue.

【０００５】このような生物情報を抽出する際の基本的
手法は、アミノ酸の配列を比較することである。これ
は、アミノ酸の配列が類似していることは、生物学的機
能にも類似があると考えられているためである。The basic method for extracting such biological information is to compare amino acid sequences. This is because similar amino acid sequences are considered to have similar biological functions.

【０００６】このようなことを背景にして、評価対象の
アミノ酸配列の機能を推定するために、機能が解明され
ている既知のアミノ酸配列データベースから、評価対象
のアミノ酸配列と類似するアミノ酸配列を検索する相同
性探索や、比較するアミノ酸配列間での違いと類似性が
明確になるようにアミノ酸配列を並び変えるアラインメ
ントが行われるようになってきている。Against this background, in order to estimate the function of the amino acid sequence to be evaluated, a known amino acid sequence database whose function has been elucidated is searched for an amino acid sequence similar to the amino acid sequence to be evaluated. There is a growing tendency to search for homology, and to align amino acid sequences so that differences and similarities between compared amino acid sequences are clear.

【０００７】また、アミノ酸配列の中で生物にとって重
要な機能をコードしている領域は、進化の過程でも保存
されていると考えられている。例えば、異なる生物種で
同じ機能を持つタンパク質のアミノ酸配列を比較する
と、共通に存在する配列パターンがあることが知られて
いる。このような配列パターンはモチーフと呼ばれてい
る。これから、アミノ酸配列中にどのようなモチーフが
含まれているかを調べることによって、タンパク質の性
質や機能を解明することができるだけでなく、既存のタ
ンパク質に対する強化、機能の付加、新しいタンパク質
の合成等、多岐に渡ってタンパク質工学の分野に応用す
ることができる。これから、モチーフを検索することが
行われるようになってきている。[0007] The region of the amino acid sequence which encodes an important function for living organisms is considered to be conserved during evolution. For example, when amino acid sequences of proteins having the same function in different organism species are compared, it is known that there are common sequence patterns. Such a sequence pattern is called a motif. From this, it is possible not only to elucidate the properties and functions of proteins by investigating what kind of motif is contained in the amino acid sequence, but also to strengthen existing proteins, add functions, synthesize new proteins, etc. It can be applied to various fields of protein engineering. From now on, the search for motifs has been started.

【０００８】従来、２つのアミノ酸配列を比較する方法
としては、音声認識処理等で用いられているダイナミッ
クプログラミング手法が用いられている。Conventionally, as a method of comparing two amino acid sequences, a dynamic programming method used in speech recognition processing or the like has been used.

【０００９】[0009]

【発明が解決しようとする課題】しかしながら、ダイナ
ミックプログラミング手法によるアミノ酸配列の比較方
法では、２次元的にアミノ酸配列を比較していくため
に、大きなメモリ容量が必要になるとともに、処理時間
も多くかかるという問題点があった。However, in the method for comparing amino acid sequences by the dynamic programming method, since the amino acid sequences are compared two-dimensionally, a large memory capacity is required and a long processing time is required. There was a problem.

【００１０】本発明はかかる事情に鑑みてなされたもの
であって、検査対象のアミノ酸配列と、比較対象のアミ
ノ酸配列との間の類似性を簡単な処理機構に従って評価
することのできる新たな遺伝子情報検査装置の提供を目
的とするものである。The present invention has been made in view of the above circumstances, and is a new gene capable of evaluating the similarity between the amino acid sequence to be examined and the amino acid sequence to be compared according to a simple processing mechanism. It is intended to provide an information inspection device.

【００１１】[0011]

【課題を解決するための手段】図１（ａ）に本発明の第
１の発明の原理構成、図１（ｂ）に本発明の第２の発明
の原理構成を図示する。FIG. 1 (a) shows the principle configuration of the first invention of the present invention, and FIG. 1 (b) shows the principle configuration of the second invention of the present invention.

【００１２】図１（ａ)(ｂ）中、１は本発明を具備する
遺伝子情報検査装置であって、アミノ酸を文字で表現す
る構成を採って、検査対象のアミノ酸配列と、比較対象
のアミノ酸配列との間の類似性を評価するもの、２は遺
伝子情報検査装置１に接続される出力装置である。In FIGS. 1 (a) and 1 (b), reference numeral 1 denotes a genetic information testing apparatus equipped with the present invention, which has a structure in which amino acids are expressed by letters, and has an amino acid sequence to be tested and an amino acid to be compared. The output device connected to the genetic information test device 1 is used to evaluate the similarity with the sequence.

【００１３】図１（ａ）に従う本発明の遺伝子情報検査
装置１は、文字列で表現される検査対象アミノ酸配列
と、文字列で表現される比較対象アミノ酸配列との最長
共有文字数を検出する検出手段１０と、検出手段１０に
より検出される最長共有文字数と、検査対象アミノ酸配
列又は比較対象アミノ酸配列の文字列長との割合を算出
する算出手段１１と、算出手段１１の算出する割合値を
出力装置２に出力する出力制御手段１２と、算出手段１
１の算出する割合値に従って、検査対象アミノ酸配列と
比較対象アミノ酸配列との間の類似性を評価する評価手
段１３とを備える。The genetic information testing device 1 of the present invention according to FIG. 1 (a) is a detection device for detecting the maximum number of shared characters between a test amino acid sequence represented by a character string and a comparison target amino acid sequence represented by a character string. The means 10, the calculating means 11 for calculating the ratio of the longest shared character number detected by the detecting means 10, and the character string length of the amino acid sequence to be examined or the amino acid sequence to be compared, and the ratio value calculated by the calculating means 11 are output. Output control means 12 for outputting to the device 2 and calculation means 1
The evaluation unit 13 evaluates the similarity between the test target amino acid sequence and the comparison target amino acid sequence according to the calculated ratio value of 1.

【００１４】図１（ｂ）に従う本発明の遺伝子情報検査
装置１は、文字列で表現される検査対象アミノ酸配列
と、文字列で表現される比較対象アミノ酸配列との最長
共有部分列を検出する検出手段２０と、検査対象アミノ
酸配列及び比較対象アミノ酸配列の持つ最長共有部分列
の配列位置を特定するとともに、その特定結果に従っ
て、その配列位置間に存在する文字列長を特定する特定
手段２１と、検出手段２０の検出結果や特定手段２１の
特定結果を出力装置２に出力する出力制御手段２２と、
検出手段２０の検出結果や特定手段２１の特定結果に従
って、検査対象アミノ酸配列と比較対象アミノ酸配列と
の間の類似性を評価する評価手段２３とを備える。The genetic information testing apparatus 1 of the present invention according to FIG. 1 (b) detects the longest shared subsequence of a test amino acid sequence represented by a character string and a comparison target amino acid sequence represented by a character string. A detection unit 20 and a specifying unit 21 for specifying the sequence position of the longest shared subsequence of the inspection target amino acid sequence and the comparison target amino acid sequence, and for specifying the character string length existing between the sequence positions according to the specifying result. Output control means 22 for outputting the detection result of the detection means 20 and the identification result of the identification means 21 to the output device 2,
The evaluation means 23 evaluates the similarity between the amino acid sequence to be examined and the amino acid sequence to be compared according to the detection result of the detection means 20 and the identification result of the identification means 21.

【００１５】[0015]

【作用】図１（ａ）に従う本発明の遺伝子情報検査装置
１では、例えば、検査対象アミノ酸配列の文字列表現が
“ＡＢＣＢＤＡＢ”で、比較対象アミノ酸配列の文字列
表現が“ＢＤＣＡＢＡ”である場合に、検出手段１０
は、この２つの文字列での最長共有文字数が“４”であ
ることを検出し、算出手段１１は、この検出結果を受け
て、検査対象アミノ酸配列の文字列長を基準にする場合
には５７％（＝４÷７）という割合値を算出し、比較対
象アミノ酸配列の文字列長を基準にする場合には６７％
（＝４÷６）という割合値を算出する。In the genetic information testing apparatus 1 of the present invention according to FIG. 1 (a), for example, when the character string expression of the amino acid sequence to be tested is "ABCBDAB" and the character string expression of the amino acid sequence to be compared is "BDCABA". And the detection means 10
Detects that the maximum number of shared characters in these two character strings is “4”, and the calculation means 11 receives this detection result and uses the character string length of the amino acid sequence to be examined as a reference. 67% when the ratio value of 57% (= 4 ÷ 7) is calculated and the character string length of the comparison target amino acid sequence is used as a reference.
A ratio value of (= 4 ÷ 6) is calculated.

【００１６】そして、出力制御手段１２は、この算出さ
れた割合値を出力装置２に出力していくことで、ユーザ
に対して、検査対象アミノ酸配列と比較対象アミノ酸配
列との間の類似性の評価値を通知し、一方、評価手段１
３は、この算出された割合値を規定の基準値と比較する
ことで、検査対象アミノ酸配列と比較対象アミノ酸配列
との間の類似性を機械的に評価して上述の相同性検査を
実行していく。Then, the output control means 12 outputs the calculated ratio value to the output device 2 to inform the user of the similarity between the inspection target amino acid sequence and the comparison target amino acid sequence. The evaluation value is notified, while the evaluation means 1
3 compares the calculated ratio value with a prescribed reference value to mechanically evaluate the similarity between the amino acid sequence to be tested and the amino acid sequence to be compared, and execute the above-mentioned homology test. To go.

【００１７】このように、図１（ａ）に従う本発明の遺
伝子情報検査装置１は、文字列で表現される検査対象ア
ミノ酸配列と、文字列で表現される比較対象アミノ酸配
列との最長共有文字数を算出する構成を採って、この最
長共有文字数に従って、検査対象アミノ酸配列と比較対
象アミノ酸配列との間の類似性を評価する構成を採るも
のであることから、ダイナミックプログラミングム手法
によるアミノ酸配列の比較方法に比べて、小さなメモリ
容量で、かつ高速に２つのアミノ酸配列の類似性を評価
することができるのである。As described above, in the genetic information test device 1 of the present invention according to FIG. 1 (a), the maximum number of shared characters between the amino acid sequence to be examined expressed in a character string and the amino acid sequence to be compared in a character string is shared. Is calculated, and the similarity between the amino acid sequence to be tested and the amino acid sequence to be compared is evaluated according to this longest number of shared characters. Therefore, comparison of amino acid sequences by the dynamic programming method is performed. Compared with the method, it is possible to evaluate the similarity between two amino acid sequences with a small memory capacity and at high speed.

【００１８】図１（ｂ）に従う本発明の遺伝子情報検査
装置１では、例えば、検査対象アミノ酸配列の文字列表
現が“ＡＢＣＢＤＡＢ”で、比較対象アミノ酸配列の文
字列表現が“ＢＤＣＡＢＡ”である場合に、検出手段２
０は、この２つの文字列での最長共有部分列が“ＢＤＡ
Ｂ”、“ＢＣＢＡ”、“ＢＤＡＢ”、“ＢＣＡＢ”であ
ることを検出する。この検出結果を受けて、特定手段２
１は、検査対象アミノ酸配列及び比較対象アミノ酸配列
の持つ各最長共有部分列の配列位置を特定するととも
に、その配列位置間に存在する文字列長を特定する。In the genetic information test apparatus 1 of the present invention according to FIG. 1 (b), for example, when the character string expression of the amino acid sequence to be tested is "ABCBDAB" and the character string expression of the amino acid sequence to be compared is "BDCABA". And the detection means 2
0 means that the longest shared substring in these two strings is "BDA
B "," BCBA "," BDAB "," BCAB "are detected, and the specifying means 2 receives the detection result.
1 specifies the sequence position of each longest shared partial sequence of the amino acid sequence to be examined and the amino acid sequence to be compared, and also identifies the character string length existing between the sequence positions.

【００１９】そして、出力制御手段２２は、検出手段２
０の検出する最長共有部分列をそのまま出力したり、特
定手段２１により特定される文字列長をこの最長共有部
分列に対応付けて出力したり、特定手段２１により特定
される配列位置に従って、検査対象アミノ酸配列及び比
較対象アミノ酸配列の持つ最長共有部分列が対応付けら
れるべくこの２つのアミノ酸配列をアラインメントして
出力したりしていくことで、ユーザに対して、検査対象
アミノ酸配列と比較対象アミノ酸配列との間の類似性を
通知する。The output control means 22 is the detection means 2
The longest shared subsequence detected by 0 is output as it is, the character string length specified by the specifying unit 21 is output in association with this longest shared subsequence, or the inspection is performed according to the array position specified by the specifying unit 21. By aligning and outputting these two amino acid sequences so that the longest shared subsequences of the target amino acid sequence and the comparison target amino acid sequence are associated with each other, the inspection target amino acid sequence and the comparison target amino acid are presented to the user. Signal the similarity to the sequence.

【００２０】一方、評価手段２３は、比較対象アミノ酸
配列が連続する文字列で表現される場合や、規定されな
い文字列を配列位置間に含む文字列のもので表現される
場合にあって、検査対象アミノ酸配列にこの比較対象ア
ミノ酸配列が含まれているか否かを評価していくときに
は、特定手段２１の特定結果を考慮しつつ、検出手段２
０の検出する最長共有部分列と比較対象アミノ酸配列と
が一致するか否かを機械的に評価していくことで上述の
モチーフ検査を実行していく。On the other hand, the evaluation means 23 performs an inspection when the amino acid sequence to be compared is expressed by a continuous character string, or when it is expressed by a character string including an unspecified character string between sequence positions. When evaluating whether or not this comparison target amino acid sequence is included in the target amino acid sequence, the detection means 2 is considered while considering the identification result of the identification means 21.
The above-mentioned motif test is executed by mechanically evaluating whether or not the longest shared subsequence of 0 and the comparison target amino acid sequence match.

【００２１】このように、図１（ｂ）に従う本発明の遺
伝子情報検査装置１は、文字列で表現される検査対象ア
ミノ酸配列と、文字列で表現される比較対象アミノ酸配
列との最長共有部分列を特定する構成を採って、この最
長共有部分列に従って、検査対象アミノ酸配列と比較対
象アミノ酸配列との間の類似性を評価する構成を採るも
のであることから、ダイナミックプログラミング手法に
よるアミノ酸配列の比較方法に比べて、小さなメモリ容
量で、かつ高速に２つのアミノ酸配列の類似性を評価す
ることができるのである。As described above, the genetic information test device 1 of the present invention according to FIG. 1 (b) has the longest shared portion between the amino acid sequence to be examined expressed by a character string and the amino acid sequence to be compared expressed by the character string. Since the structure that specifies the sequence is adopted and the similarity between the amino acid sequence to be tested and the amino acid sequence to be compared is evaluated according to this longest shared subsequence, the amino acid sequence of the dynamic programming method Compared with the comparison method, the similarity between two amino acid sequences can be evaluated with a smaller memory capacity and faster.

【００２２】[0022]

【実施例】以下、実施例に従って本発明を詳細に説明す
る。図２に、本発明を実装する遺伝子情報検査装置１の
一実施例を図示する。図中、４０は遺伝子情報検査装置
１に接続される入力装置、４１は入力装置４０の備える
キーボードやマウス等の対話装置、４２は遺伝子情報検
査装置１に接続されるディスプレイ装置、５０は文字列
で表現されるアミノ酸配列情報を管理するアミノ酸配列
データベース、６０は文字列で表現されるモチーフ配列
情報を管理するモチーフデータベースである。EXAMPLES The present invention will be described in detail below with reference to examples. FIG. 2 illustrates an embodiment of the genetic information test apparatus 1 implementing the present invention. In the figure, 40 is an input device connected to the genetic information test device 1, 41 is an interactive device such as a keyboard or mouse provided in the input device 40, 42 is a display device connected to the genetic information test device 1, and 50 is a character string. An amino acid sequence database that manages the amino acid sequence information represented by, and 60 is a motif database that manages the motif sequence information represented by a character string.

【００２３】この実施例の遺伝子情報検査装置１は、入
力装置４０から入力されてくるアミノ酸配列の文字列
と、アミノ酸配列データベース５０やモチーフデータベ
ース６０から与えられるアミノ酸配列の文字列との間の
最長共有文字数や、最長共有部分列（ＬＣＳ：longest
common subsequence）や、最長共有部分列の展開位置を
検出するＬＣＳ検出部３０と、ＬＣＳ検出部３０の結果
に従って、ＬＣＳ検出部３０の検出対象となった２つの
アミノ酸配列の相同性を判定する相同性判定部３１と、
相同性判定部３１の検出結果に従って、入力装置４０か
ら入力されてくるアミノ酸配列と相同なアミノ酸配列を
アミノ酸配列データベース５０から検索する相同性探索
部３２と、ＬＣＳ検出部３０の結果に従って、入力装置
４０から入力されてくるアミノ酸配列と相同なモチーフ
配列をモチーフデータベース６０から検索するモチーフ
探索部３３と、ＬＣＳ検出部３０の検出結果に従って、
入力装置４０から入力されてくるアミノ酸配列の文字列
と、アミノ酸配列データベース５０やモチーフデータベ
ース６０から与えられるアミノ酸配列の文字列とをアラ
インメントするアラインメント部３４と、各処理部の処
理結果をディスプレイ装置４２に表示する表示部３５と
を備える。The genetic information test apparatus 1 of this embodiment has the longest distance between the character string of the amino acid sequence input from the input device 40 and the character string of the amino acid sequence provided from the amino acid sequence database 50 or the motif database 60. The number of shared characters and the longest shared substring (LCS: longest
common subsequence) and the LCS detection unit 30 that detects the expanded position of the longest shared subsequence, and the homology that determines the homology between the two amino acid sequences that are the detection targets of the LCS detection unit 30 according to the result of the LCS detection unit 30. The sex determination unit 31,
The homology search unit 32 searches the amino acid sequence database 50 for an amino acid sequence homologous to the amino acid sequence input from the input device 40 according to the detection result of the homology determination unit 31, and the input device according to the result of the LCS detection unit 30. According to the detection results of the LCS detection unit 30 and the motif search unit 33 that searches the motif database 60 for a motif sequence homologous to the amino acid sequence input from 40,
An alignment unit 34 that aligns the character string of the amino acid sequence input from the input device 40 with the character string of the amino acid sequence provided from the amino acid sequence database 50 or the motif database 60, and the display device 42 that displays the processing result of each processing unit. And a display unit 35 for displaying.

【００２４】次に、図３ないし図５に示す処理フローに
従って、ＬＣＳ検出部３０の実行する処理について詳細
に説明する。ここで、図３に示す処理フローは、検査対
象となる２つのアミノ酸配列の持つ最長共有文字数を検
出するための処理フローであり、図４及び図５に示す処
理フローは、検査対象となる２つのアミノ酸配列の持つ
最長共有部分列と、その展開位置を検出するための処理
フローである。Next, the processing executed by the LCS detector 30 will be described in detail according to the processing flows shown in FIGS. Here, the process flow shown in FIG. 3 is a process flow for detecting the maximum number of shared characters of two amino acid sequences to be inspected, and the process flows shown in FIGS. 4 and 5 are to be inspected. It is a processing flow for detecting the longest shared subsequence which one amino acid sequence has, and its expansion position.

【００２５】ＬＣＳ検出部３０は、文字列１で表現され
るアミノ酸配列と、文字列２で表現されるアミノ酸配列
との間の最長共有文字数を検出する場合には、図３の処
理フローに示すように、先ず最初に、ステップ１で、文
字列１から１文字ずつ読み込んで、文字列１中での各文
字の出現位置を示す表を作成する。When the LCS detection unit 30 detects the maximum number of shared characters between the amino acid sequence represented by the character string 1 and the amino acid sequence represented by the character string 2, it is shown in the processing flow of FIG. As described above, first, in step 1, one character is read from the character string 1 and a table showing the appearance positions of each character in the character string 1 is created.

【００２６】この出現表は、例えば、Ａ〜Ｚまでのアル
ファベットに対応した配列に、各文字の出現位置をポイ
ンタで連結することによって実現するものであって、例
えば、文字列１のアミノ酸配列が“ＡＢＣＢＤＡＢ”で
表現される場合には、図６に示すように、“Ａ”が６番
目と１番目に出現し、“Ｂ”が７番目と４番目と２番目
に出現し、“Ｃ”が３番目に出現し、“Ｄ”が５番目に
出現するというように作成する。そして、このステップ
１では、更に、以下の処理で用いる文字列１と同じサイ
ズを持つ配列Ｓ[i] の初期化処理を実行して、各エント
リにゼロ値を設定する。This appearance table is realized, for example, by connecting the appearance positions of each character with a pointer to an array corresponding to the alphabet from A to Z. For example, the amino acid sequence of the character string 1 is When expressed by "ABCBDAB", as shown in FIG. 6, "A" appears at the 6th and 1st, "B" appears at the 7th, 4th and 2nd, and "C". Appears at the third position, "D" appears at the fifth position, and so on. Then, in this step 1, initialization processing of the array S [i] having the same size as the character string 1 used in the following processing is further executed, and a zero value is set in each entry.

【００２７】次に、ステップ２で、文字列２から１文字
を読み込み、ステップ１で作成した出現表を参照して、
その文字の文字列１での出現位置ｒを特定する。続い
て、ステップ３で、用意されている配列Ｓ[i] のｒ番目
のＳ[r] のエントリデータと、(r−1)番目のＳ[r−1]の
エントリデータとが等しいか否かを判断する。Next, in step 2, one character is read from the character string 2 and the appearance table created in step 1 is referred to,
The appearance position r of the character in the character string 1 is specified. Then, in step 3, it is determined whether the r-th S [r] entry data of the prepared array S [i] is equal to the (r-1) -th S [r-1] entry data. To judge.

【００２８】このステップ３で、Ｓ[r] とＳ[r−1]とが
等しいと判断するときには、ステップ４に進んで、ｒ番
目以上で、かつＳ[r] のエントリデータと等しい値のエ
ントリデータを持つ配列Ｓ[i] に“１”を加算し、続く
ステップ５で、文字列２の最後の文字までの処理を終了
したのか否かを判断して、終了していないことを判断す
るときには、ステップ２に戻っていく。一方、ステップ
３で、Ｓ[r] とＳ[r−1]とが等しくないと判断するとき
には、ステップ４の加算処理を実行することなく、直ち
にステップ５に進んでいく。When it is judged in this step 3 that S [r] and S [r-1] are equal, the process proceeds to step 4, where the value is equal to or greater than the rth entry data and is equal to the entry data of S [r]. "1" is added to the array S [i] having the entry data, and in the following step 5, it is determined whether or not the processing up to the last character of the character string 2 has been completed, and it is determined that the processing has not been completed. When it does, it returns to step 2. On the other hand, when it is determined in step 3 that S [r] and S [r−1] are not equal, the process immediately proceeds to step 5 without executing the addition process of step 4.

【００２９】ここで、ステップ２で読み込んだ文字列２
の文字が文字列１で複数回出現する場合には、出現位置
ｒの大きい順にステップ３及びステップ４の処理を実行
していくことになる。Here, the character string 2 read in step 2
When the character of "1" appears multiple times in the character string 1, the processes of steps 3 and 4 are executed in descending order of the appearance position r.

【００３０】そして、ステップ５で、文字列２の最後の
文字までの処理を終了したことを判断すると、ステップ
６に進んで、配列Ｓ[i] の最終要素のＳ[m] のエントリ
データＫmax を最長共有文字数として出力していく。When it is determined in step 5 that the processing up to the last character of the character string 2 has been completed, the operation proceeds to step 6 and the entry data Kmax of the last element S [m] of the array S [i]. Is output as the maximum number of shared characters.

【００３１】この処理フローの実行により、例えば、文
字列１のアミノ酸配列が“ＡＢＣＢＤＡＢ”で表現さ
れ、文字列２のアミノ酸配列が“ＢＤＣＡＢＡ”で表現
される場合には、文字列２の第１番目の文字Ｂの読込処
理に従って、図６の出現表から“ｒ＝７，４，２”が特
定されて、図７（ａ）に示すように配列Ｓ[i] のエント
リデータが更新され、文字列２の第２番目の文字Ｄの読
込処理に従って、図６の出現表から“ｒ＝５”が特定さ
れて、図７（ｂ）に示すように配列Ｓ[i] のエントリデ
ータが更新され、文字列２の第３番目の文字Ｃの読込処
理に従って、“ｒ＝３”が特定されて、図８（ａ）に示
すように配列Ｓ[i] のエントリデータが更新され、文字
列２の第４番目の文字Ａの読込処理に従って、図６の出
現表から“ｒ＝６，１”が特定されて、図８（ｂ）に示
すように配列Ｓ[i] のエントリデータが更新される。By executing this processing flow, for example, when the amino acid sequence of the character string 1 is represented by "ABCBDAB" and the amino acid sequence of the character string 2 is represented by "BDCABA", the first character string 2 According to the reading process of the second character B, “r = 7, 4, 2” is specified from the appearance table of FIG. 6, and the entry data of the array S [i] is updated as shown in FIG. According to the reading process of the second character D of the character string 2, "r = 5" is specified from the appearance table of FIG. 6, and the entry data of the array S [i] is updated as shown in FIG. 7B. Then, "r = 3" is specified according to the reading process of the third character C of the character string 2, and the entry data of the array S [i] is updated as shown in FIG. According to the reading process of the fourth character A of No. 2, “r = 6, 1” from the appearance table of FIG. Is specified, the entry data of the array S [i] is updated as shown in Figure 8 (b).

【００３２】そして、文字列２の第５番目の文字Ｂの読
込処理に従って、図６の出現表から“ｒ＝７，４，２”
が特定されて、図９（ａ）に示すように配列Ｓ[i] のエ
ントリデータが更新され、文字列２の第６番目の文字Ａ
の読込処理に従って、図６の出現表から“ｒ＝６，１”
が特定されて、図９（ｂ）に示すように配列Ｓ[i] のエ
ントリデータが更新されていって、最終的に、“４”と
いう最長共有文字数が特定されることになる。なお、図
７ないし図９に示す配列Ｓ[i] では、システムの便宜
上、文字列１より１文字多いサイズを持つ配列Ｓ[i] に
従うもので示してある。Then, according to the reading process of the fifth character B of the character string 2, "r = 7, 4, 2" is found from the appearance table of FIG.
Is specified, the entry data of the array S [i] is updated as shown in FIG. 9A, and the sixth character A of the character string 2 is updated.
According to the reading process of “r = 6, 1” from the appearance table of FIG.
Is specified, the entry data of the array S [i] is updated as shown in FIG. 9B, and finally, the longest shared character number of "4" is specified. Note that the array S [i] shown in FIGS. 7 to 9 is shown to follow the array S [i] having a size one character larger than the character string 1 for convenience of the system.

【００３３】次に、図４及び図５に従って、検査対象と
なる２つのアミノ酸配列の持つ最長共有部分列と、その
展開位置を検出するための処理について説明する。ＬＣ
Ｓ検出部３０は、文字列１で表現されるアミノ酸配列
と、文字列２で表現されるアミノ酸配列との間の最長共
有部分列と、その展開位置とを検出する場合には、図４
の処理フローに示すように、先ず最初に、ステップ１０
で、文字列１から１文字ずつ読み込んで、文字列１中で
の各文字の出現位置を示す表を作成する。すなわち、図
６で説明した出現表を作成するのである。そして、この
ステップ１では、更に、以下の処理で用いる文字列１と
同じサイズを持つ配列Ｓ[i] の初期化処理を実行して、
各エントリにゼロ値を設定するとともに、以下の処理で
用いる最長共有文字数と同じサイズを持つ配列data[k]
の初期化処理を実行して、各エントリが何もポイントし
ないように設定する。Next, with reference to FIGS. 4 and 5, a process for detecting the longest shared partial sequence of two amino acid sequences to be inspected and its expanded position will be described. LC
When the S detection unit 30 detects the longest shared partial sequence between the amino acid sequence represented by the character string 1 and the amino acid sequence represented by the character string 2 and its expanded position, the S detection unit 30 shown in FIG.
As shown in the processing flow of FIG.
Then, one character is read from the character string 1 and a table showing the appearance positions of each character in the character string 1 is created. That is, the appearance table described in FIG. 6 is created. Then, in this step 1, initialization processing of the array S [i] having the same size as the character string 1 used in the following processing is further executed,
An array data [k] that has the same size as the maximum number of shared characters used in the following process, with a zero value set for each entry
Perform the initialization process of to make each entry point to nothing.

【００３４】次に、ステップ１１で、文字列２から１文
字（ｊ番目の文字）を読み込み、ステップ１０で作成し
た出現表を参照して、その文字の文字列１での出現位置
ｒを特定する。続いて、ステップ１２で、用意されてい
る配列Ｓ[i] のｒ番目のＳ[r] のエントリデータと、(r
−1)番目のＳ[r−1]のエントリデータとが等しいか否か
を判断する。このステップ１２で、Ｓ[r] とＳ[r−1]と
が等しいと判断するときには、ステップ１３に進んで、
ｒ番目以上で、かつＳ[r] のエントリデータと等しい値
のエントリデータを持つ配列Ｓ[i] に“１”を加算し、
一方、等しくないと判断するときには、図５の処理フロ
ーのステップ１７の処理に進んで、この加算処理を実行
しないよう処理する。ここで、ステップ１１で読み込ん
だ文字列２の文字が文字列１で複数回出現する場合に
は、出現位置ｒの大きい順にステップ１２及びステップ
１３の処理を実行していくことになる。Next, in step 11, one character (jth character) is read from the character string 2 and the appearance table created in step 10 is referenced to identify the appearance position r of the character in the character string 1. To do. Then, in step 12, the r-th S [r] entry data of the prepared array S [i] and (r
It is determined whether or not the (-1) th entry data of S [r-1] is equal. When it is determined in step 12 that S [r] and S [r−1] are equal, the process proceeds to step 13,
"1" is added to the array S [i] having the entry data whose value is equal to or greater than the entry data of S [r], which is the r-th,
On the other hand, when it is determined that they are not equal, the process proceeds to step 17 in the process flow of FIG. 5 and the addition process is not executed. Here, when the character of the character string 2 read in step 11 appears multiple times in the character string 1, the processes of step 12 and step 13 are executed in descending order of the appearance position r.

【００３５】このようにして設定されるＳ[r] のエント
リデータの値ｋが、文字列１のｒ番目の文字までの文字
列と、文字列２のｊ番目の文字までの文字列との間の最
長共有文字数となる。このように、ＬＣＳ検出部３０
は、最長共有部分列を検出していく場合にも、図３の処
理フローで説明した最長共有文字数を検出していく処理
を実行していくものである。The value k of the entry data of S [r] set in this way is obtained by comparing the character string up to the r-th character of character string 1 and the character string up to the j-th character of character string 2. It becomes the maximum number of shared characters between. In this way, the LCS detector 30
In the case of detecting the longest shared subsequence, the process of detecting the longest shared character number described in the processing flow of FIG. 3 is executed.

【００３６】ステップ１３の処理を実行すると、続い
て、ステップ１４で、得られたＳ[r]のエントリデータ
である最長共有文字数ｋに従って、文字列１での展開位
置ｒと、文字列２での展開位置ｊとの対データ（ｒ，
ｊ）を配列data[k] に格納する。ここで、配列Ｓ[i] が
前回の処理サイクルのものから変化していないときに
は、この格納処理を実行しないように処理する。最長共
有部分列は、以下の処理に従って、このデータ構造を連
結していくことで求められることになる。When the processing of step 13 is executed, subsequently, in step 14, according to the longest shared character number k which is the entry data of S [r] obtained, the expansion position r in the character string 1 and the character string 2 Paired data (r,
j) is stored in the array data [k]. Here, when the array S [i] has not changed from that in the previous processing cycle, processing is performed so as not to execute this storage processing. The longest shared subsequence is obtained by concatenating this data structure according to the following processing.

【００３７】続いて、図５の処理フローに移って、ステ
ップ１５で、data[k-1] に格納された文字位置ｒ',ｊ'
を参照して、ｒ' ＜ｒ，ｊ' ＜ｊが成立するか否かを判断し、成立すると判断するときに
は、文字位置の逆転が起こらないことに対応して、ステ
ップ１６に進んで、文字位置ｒ',ｊ' を次候補となるも
のとしてポインタを張って登録する。そして、続くステ
ップ１７で、文字列２の最後の文字までの処理を終了し
たのか否かを判断して、終了していないことを判断する
ときには、図４の処理フローのステップ１１に戻ってい
く。一方、ステップ１５で、上述の関係式が成立しない
と判断するときには、ステップ１６の処理を実行するこ
となく、直ちにステップ１７の処理に入っていく。Subsequently, moving to the processing flow of FIG. 5, in step 15, the character positions r ', j'stored in data [k-1] are stored.
It is determined whether r '<r, j'<j is satisfied by referring to, and when it is determined that it is satisfied, in response to the fact that the character position is not reversed, the process proceeds to step 16 The position r ′, j ′ is registered as a next candidate by setting a pointer. Then, in the following step 17, it is determined whether or not the processing up to the last character of the character string 2 has been completed. When it is determined that the processing has not been completed, the processing returns to step 11 of the processing flow of FIG. .. On the other hand, when it is determined in step 15 that the above relational expression is not satisfied, the process of step 16 is immediately executed without executing the process of step 16.

【００３８】そして、ステップ１７で、文字列２の最後
の文字までの処理を終了したことを判断すると処理を終
了する。この図４及び図５の処理フローの実行により、
上述のように、文字列１のアミノ酸配列が“ＡＢＣＢＤ
ＡＢ”で表現され、文字列２のアミノ酸配列が“ＢＤＣ
ＡＢＡ”で表現される場合には、文字列２の第１番目
（ｊ＝１）の文字Ｂの読込処理に従って、図６の出現表
から“ｒ＝７，４，２”が特定されて、図７（ａ）に示
したように、“ｒ＝７”に従って“Ｓ[7] ＝１”が特定
されることでdata[1] に（７，１）が格納され、“ｒ＝
４”に従って“Ｓ[4] ＝１”が特定されることでdata
[1] に（４，１）が格納され、“ｒ＝２”に従って“Ｓ
[2] ＝１”が特定されることでdata[1] に（２，１）が
格納される。When it is determined in step 17 that the processing up to the last character of the character string 2 has been completed, the processing is completed. By executing the processing flows of FIGS. 4 and 5,
As described above, the amino acid sequence of character string 1 is "ABCBD.
It is represented by "AB" and the amino acid sequence of character string 2 is "BDC.
In the case of being represented by “ABA”, “r = 7, 4, 2” is specified from the appearance table of FIG. 6 according to the reading process of the first (j = 1) character B of the character string 2, As shown in FIG. 7A, (7,1) is stored in data [1] by specifying "S [7] = 1" according to "r = 7", and "r =
"S [4] = 1" is specified according to "4"
(4, 1) is stored in [1], and “S =” is entered according to “r = 2”.
When [2] = 1 ”is specified, (2,1) is stored in data [1].

【００３９】そして、文字列２の第２番目（ｊ＝２）の
文字Ｄの読込処理に従って、図６の出現表から“ｒ＝
５”が特定されて、図７（ｂ）に示したように、“Ｓ
[5] ＝２”が特定されることでdata[2] に（５，２）が
格納される。そして、文字列２の第３番目（ｊ＝３）の
文字Ｃの読込処理に従って、図６の出現表から“ｒ＝
３”が特定されて、図８（ａ）に示したように、“Ｓ
[3] ＝２”が特定されることでdata[2] に（３，３）が
格納される。そして、文字列２の第４番目（ｊ＝４）の
文字Ａの読込処理に従って、図６の出現表から“ｒ＝
６，１”が特定されて、図８（ｂ）に示したように、
“ｒ＝６”に従って“Ｓ[6] ＝３”が特定されることで
data[3] に（６，４）が格納され、“ｒ＝１”に従って
“Ｓ[1] ＝１”が特定されることでdata[1] に（１，
４）が格納される。Then, according to the reading process of the second (j = 2) character D of the character string 2, "r =
5 "is specified, and as shown in FIG.
[5] = 2 "is specified and (5,2) is stored in data [2]. Then, according to the reading process of the third (j = 3) character C of character string 2, From the appearance table of 6, "r =
3 ”is identified, and as shown in FIG.
By specifying [3] = 2 ", (3, 3) is stored in data [2]. Then, according to the reading process of the fourth character A (j = 4) of the character string 2, From the appearance table of 6, "r =
6, 1 ”is identified, and as shown in FIG.
By specifying “S [6] = 3” according to “r = 6”,
(6, 4) is stored in data [3], and "S [1] = 1" is specified according to "r = 1".
4) is stored.

【００４０】そして、文字列２の第５番目（ｊ＝５）の
文字Ｂの読込処理に従って、図６の出現表から“ｒ＝
７，４，２”が特定されて、図９（ａ）に示したよう
に、“ｒ＝７”に従って“Ｓ[7] ＝４”が特定されるこ
とでdata[4] に（７，５）が格納され、“ｒ＝４”に従
って“Ｓ[4] ＝３”が特定されることでdata[3] に
（４，５）が格納され、“ｒ＝２”に従って“Ｓ[2] ＝
２”が特定されることでdata[2] に（２，５）が格納さ
れる。そして、文字列２の第６番目（ｊ＝６）の文字Ａ
の読込処理に従って、図６の出現表から“ｒ＝６，１”
が特定されて、図９（ｂ）に示したように、“ｒ＝６”
に従って“Ｓ[6] ＝４”が特定されることでdata[4] に
（６，６）が格納される。なお、図９（ｂ）から分かる
ように、“ｒ＝６”と“ｒ＝１”とで配列Ｓ[i] に変化
がないことから、（１，６）の格納処理は実行されな
い。Then, in accordance with the reading process of the fifth (j = 5) character B of the character string 2, "r =" from the appearance table of FIG.
7, 4, 2 "is specified, and as shown in FIG. 9A," S [7] = 4 "is specified according to" r = 7 ", so that data [4] has (7, 4, 2). 5) is stored, “S [4] = 3” is specified according to “r = 4”, (4, 5) is stored in data [3], and “S [2] is stored according to“ r = 2 ”. ] =
By specifying "2", (2, 5) is stored in data [2]. Then, the 6th (j = 6) character A of character string 2 is stored.
According to the reading process of “r = 6, 1” from the appearance table of FIG.
Is specified, and as shown in FIG. 9B, “r = 6”
According to the above, “S [6] = 4” is specified, whereby (6, 6) is stored in data [4]. As can be seen from FIG. 9B, since the array S [i] does not change between “r = 6” and “r = 1”, the storage process of (1,6) is not executed.

【００４１】そして、これらの文字位置情報（文字列１
と文字列２の持つ同一文字の展開位置を表示する）は、
data[k-1] に格納されたものと、data[k] に格納された
ものとで文字位置の逆転が起こらない場合には、それら
の間でポインタが張られていくことで、図１０のよう
に、data[k] に格納されるのである。The character position information (character string 1
And display the expansion position of the same character that character string 2 has),
When the character positions of the data stored in data [k-1] and the data stored in data [k] do not reverse, the pointers are set between them and It is stored in data [k] like.

【００４２】最長共有部分列は、このdata[k] に格納さ
れる文字位置情報のポインタを辿っていくことで特定さ
れることになる。すなわち、図１０の例で説明するなら
ば、“data[4] の（７，５）→data[3] の（６，４）→
data[2] の（５，２）→data[1] の（４，１）”という
連結に従って、最長共有部分列ＢＤＡＢと、文字列１及
び文字列２におけるその展開位置が特定され、“data
[4] の（７，５）→data[3] の（６，４）→data[2] の
（５，２）→data[1] の（２，１）”という連結に従っ
て、最長共有部分列ＢＤＡＢと、文字列１及び文字列２
におけるその展開位置が特定され、“data[4] の（７，
５）→data[3] の（６，４）→data[2] の（３，３）→
data[1] の（２，１）”という連結に従って、最長共有
部分列ＢＣＡＢと、文字列１及び文字列２におけるその
展開位置が特定され、“data[4] の（６，６）→data
[3] の（４，５）→data[2] の（３，３）→data[1] の
（２，１）”という連結に従って、最長共有部分列ＢＣ
ＢＡと、文字列１及び文字列２におけるその展開位置が
特定されるのである。The longest shared subsequence is specified by tracing the pointer of the character position information stored in this data [k]. That is, to explain using the example of FIG. 10, “(5, 5) of data [4] → (6, 4) of data [3] →
According to the concatenation of (5,2) of data [2] → (4,1) ”of data [1], the longest shared subsequence BDAB and its expansion position in the character string 1 and the character string 2 are specified, and“ data
According to the concatenation of (7, 5) of [4] → (6, 4) of data [3] → (5, 2) of data [2] → (2, 1) of data [1], the longest shared part Column BDAB and character string 1 and character string 2
The expansion position in is identified, and the data (4, (7,
5) → (6,4) of data [3] → (3,3) of data [2] →
According to the concatenation of (2, 1) ”of data [1], the longest shared subsequence BCAB and its expansion position in character string 1 and character string 2 are specified, and (6, 6) → data of“ data [4] → data
According to the concatenation of (4, 5) of [3] → (3, 3) of data [2] → (2, 1) of data [1], the longest shared subsequence BC
The BA and its expanded position in the character string 1 and the character string 2 are specified.

【００４３】図１１及び図１２に、ＬＣＳ検出部３０
が、この連結を辿っていくことで最長共有部分列を特定
していくときに実行する処理フローを図示する。次に、
図２に示した遺伝子情報検査装置１の各処理部が、この
ＬＣＳ検出部３０の検出する最長共有文字数と、最長共
有部分列及びその展開位置とを受けて実行することにな
る処理について説明する。11 and 12, the LCS detector 30 is shown.
However, the processing flow executed when the longest shared subsequence is specified by tracing this connection is illustrated. next,
A process to be executed by each processing unit of the genetic information test apparatus 1 shown in FIG. 2 in response to the longest shared character number detected by the LCS detection unit 30, the longest shared subsequence, and its expanded position will be described. ..

【００４４】相同性判定部３１は、ＬＣＳ検出部３０が
入力装置４０から入力されてくるアミノ酸配列の文字列
（以下、入力アミノ酸配列と称する）と、アミノ酸配列
データベース５０やモチーフデータベース６０から与え
られるアミノ酸配列の文字列との間の最長共有文字数を
検出すると、その最長共有文字数と入力アミノ酸配列の
文字列長との割合値を検出して、その割合値が規定の基
準値よりも大きい場合には、入力アミノ酸配列が、アミ
ノ酸配列データベース５０やモチーフデータベース６０
から与えられるアミノ酸配列と相同であると判定し、基
準値よりも小さい場合には、相同でないと判定する。The homology determination unit 31 is supplied from the amino acid sequence database 50 and the motif database 60, together with the character string of the amino acid sequence (hereinafter referred to as the input amino acid sequence) input by the LCS detection unit 30 from the input device 40. When the maximum number of shared characters between the amino acid sequence and the character string is detected, the ratio value between the maximum number of shared characters and the character string length of the input amino acid sequence is detected, and if the ratio value is greater than the specified reference value, Indicates that the input amino acid sequence is the amino acid sequence database 50 or the motif database 60.
It is determined that the amino acid sequence is homologous to the amino acid sequence given by, and if it is smaller than the reference value, it is determined that it is not homologous.

【００４５】相同性探索部３２は、相同性判定部３１の
判定結果を利用して、入力アミノ酸配列と相同なアミノ
酸配列をアミノ酸配列データベース５０から検索する。
そして、相同の関係にある場合には、相同性判定部３１
により算出された割合値と、ＬＣＳ検出部３０により検
出された最長共有部分列とを表示部３５を介してディス
プレイ装置４２に表示する。The homology search unit 32 uses the determination result of the homology determination unit 31 to search the amino acid sequence database 50 for an amino acid sequence homologous to the input amino acid sequence.
If there is a homology relationship, the homology determination unit 31
The ratio value calculated by and the longest shared partial sequence detected by the LCS detection unit 30 are displayed on the display device 42 via the display unit 35.

【００４６】図１３に、この表示例の一例を図示する。
この表示例は、ヒトチトクロームｃとバクテリアチトク
ロームｃという２つのアミノ酸配列の処理結果を表示す
るものであって、最長共有部分列については、両者のア
ミノ酸配列にどのような文字間隔でもって配置されてい
るかを示す表示形態に従って表示する構成を採ってい
る。すなわち、“ＧＤ｛ｘ３，３｝Ｇ｛ｘ０，１｝Ｋ
｛ｘ０，２｝・・”と表示する形態を採って、ヒトチト
クロームｃでは、“ＧＤ”の後３文字については一致し
ない文字が続いて、その後に“Ｇ”が続いて、その後直
ぐに“Ｋ”が続き、一方、バクテリアチトクロームｃで
は、“ＧＤ”の後３文字については一致しない文字が続
いて、その後に“Ｇ”が続いて、その後１文字について
は一致しない文字が続いて、その後に“Ｋ”が続くとい
うように表示するものである。FIG. 13 shows an example of this display example.
This display example shows the results of processing two amino acid sequences of human cytochrome c and bacterial cytochrome c. The longest shared subsequences are arranged in both amino acid sequences at any character intervals. It is configured to display according to the display form indicating whether or not. That is, "GD {x3,3} G {x0,1} K
In the form of displaying {x0,2} ... ", in human cytochrome c, three characters after" GD "are not matched, followed by" G ", and immediately after that," K ". On the other hand, in bacterial cytochrome c, "GD" is followed by three non-matching characters, followed by "G", followed by one non-matching character, and then one It is displayed such that "K" continues.

【００４７】モチーフ探索部３３は、相同性判定部３１
の判定結果を利用して、先ず最初に、入力アミノ酸配列
と相同なモチーフ配列をモチーフデータベース６０から
検索し、続いて、ＬＣＳ検出部３０の検出する最長共有
部分列と、その配列位置間の持つ文字列長とに従って、
この相同の関係にあるモチーフ配列が本来のモチーフ配
列であるか否かを判定する。例えば、“Ｌ”の後に規定
されない文字が６文字続いてその後に“Ｌ”が続き、こ
の“Ｌ”の総個数が５個となるロイシンジッパーという
モチーフ配列が、規定されない６文字まで含めて入力ア
ミノ酸配列に含まれているか否かをＬＣＳ検出部３０の
検出結果に従ってチェックしていくのである。そして、
モチーフ探索部３３は、入力アミノ酸配列がモチーフ配
列を持つ場合には、入力アミノ酸配列とモチーフ配列と
を表示部３５を介してディスプレイ装置４２に表示す
る。図１４に、ロイシンジッパーを持つラット卵細胞カ
リウムチャンネルの表示例を図示する。The motif search unit 33 is a homology determination unit 31.
First, a motif sequence homologous to the input amino acid sequence is searched from the motif database 60 by using the determination result of 1., and subsequently, the longest shared subsequence detected by the LCS detection unit 30 and the sequence position thereof are held. According to the string length and
It is determined whether or not this homologous motif sequence is the original motif sequence. For example, enter a motif sequence called leucine zipper that includes "L" followed by 6 unspecified characters, followed by "L", and the total number of "L" is 5, including up to 6 unspecified characters. Whether or not it is contained in the amino acid sequence is checked according to the detection result of the LCS detection unit 30. And
When the input amino acid sequence has a motif sequence, the motif search unit 33 displays the input amino acid sequence and the motif sequence on the display device 42 via the display unit 35. FIG. 14 illustrates a display example of a rat egg cell potassium channel having a leucine zipper.

【００４８】アラインメント部３４は、ＬＣＳ検出部３
０の検出する最長共有部分列と、その展開位置とを受け
て、入力アミノ酸配列と、アミノ酸配列データベース５
０やモチーフデータベース６０から与えられるアミノ酸
配列の持つ最長共有部分列が対応付けられるべく、この
２つのアミノ酸配列をアラインメントして表示部３５を
介してディスプレイ装置４２に表示する。図１５に、こ
の表示例の一例を図示する。この表示例は、ヒトチトク
ロームｃとバクテリアチトクロームｃという２つのアミ
ノ酸配列の処理結果を表示するものであって、配列位置
間の持つ文字列長分に相当する空白を挿入していくこと
でアラインメント処理を実行していくことになる。The alignment unit 34 includes the LCS detection unit 3
The input amino acid sequence and the amino acid sequence database 5 based on the longest shared subsequence of 0 detected and its expanded position
The two amino acid sequences are aligned and displayed on the display device 42 via the display unit 35 so that the longest shared partial sequence of the amino acid sequence given by 0 or the motif database 60 is associated with each other. FIG. 15 illustrates an example of this display example. This display example displays the processing result of two amino acid sequences of human cytochrome c and bacterial cytochrome c, and the alignment processing is performed by inserting a blank corresponding to the character string length between sequence positions. Will be executed.

【００４９】[0049]

【発明の効果】以上説明したように、本発明によれば、
文字列で表現される検査対象アミノ酸配列と、文字列で
表現される比較対象アミノ酸配列との最長共有文字数や
最長共有部分列を検出する構成を採って、この最長共有
文字数や最長共有部分列に従って、検査対象アミノ酸配
列と比較対象アミノ酸配列との間の類似性を評価する構
成を採るものであることから、ダイナミックプログラミ
ング手法によるアミノ酸配列の比較方法に比べて、小さ
なメモリ容量で、かつ高速に２つのアミノ酸配列の類似
性を評価することができるのである。As described above, according to the present invention,
According to the maximum number of shared characters and the longest shared substring, the maximum number of shared characters and the longest shared subsequence of the amino acid sequence to be examined expressed as a character string and the amino acid sequence to be compared expressed as a character string are detected. Since it has a configuration for evaluating the similarity between the amino acid sequence to be tested and the amino acid sequence to be compared, it has a smaller memory capacity and a higher speed than the method of comparing amino acid sequences by the dynamic programming method. The similarity of two amino acid sequences can be evaluated.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の原理構成図である。FIG. 1 is a principle configuration diagram of the present invention.

【図２】本発明の一実施例である。FIG. 2 is an example of the present invention.

【図３】ＬＣＳ検出部の実行する処理フローの一実施例
である。FIG. 3 is an example of a processing flow executed by an LCS detection unit.

【図４】ＬＣＳ検出部の実行する処理フローの一実施例
である。FIG. 4 is an example of a processing flow executed by an LCS detection unit.

【図５】ＬＣＳ検出部の実行する処理フローの一実施例
である。FIG. 5 is an example of a processing flow executed by an LCS detection unit.

【図６】ＬＣＳ検出部の作成する出現表の説明図であ
る。FIG. 6 is an explanatory diagram of an appearance table created by the LCS detection unit.

【図７】配列Ｓ[i] の更新処理の説明図である。FIG. 7 is an explanatory diagram of update processing of the array S [i].

【図８】配列Ｓ[i] の更新処理の説明図である。FIG. 8 is an explanatory diagram of an updating process of the array S [i].

【図９】配列Ｓ[i] の更新処理の説明図である。FIG. 9 is an explanatory diagram of an updating process of the array S [i].

【図１０】ＬＣＳ検出部の作成するデータ構造の説明図
である。FIG. 10 is an explanatory diagram of a data structure created by the LCS detection unit.

【図１１】ＬＣＳ検出部の実行する処理フローの一実施
例である。FIG. 11 is an example of a processing flow executed by an LCS detection unit.

【図１２】ＬＣＳ検出部の実行する処理フローの一実施
例である。FIG. 12 is an example of a processing flow executed by an LCS detection unit.

【図１３】処理結果の表示形態の一実施例である。FIG. 13 is an example of a display form of processing results.

【図１４】処理結果の表示形態の一実施例である。FIG. 14 is an example of a display form of processing results.

【図１５】処理結果の表示形態の一実施例である。FIG. 15 is an example of a display form of processing results.

[Explanation of symbols]

１遺伝子情報検査装置２出力装置１０検出手段１１算出手段１２出力制御手段１３評価手段２０検出手段２１特定手段２２出力制御手段２３評価手段 DESCRIPTION OF SYMBOLS 1 Gene information test device 2 Output device 10 Detection means 11 Calculation means 12 Output control means 13 Evaluation means 20 Detection means 21 Identification means 22 Output control means 23 Evaluation means

Claims

[Claims]

1. A genetic information test apparatus for evaluating the similarity between an amino acid sequence to be examined and an amino acid sequence to be compared, wherein an amino acid is represented by a character and is represented by a character string. Amino acid sequence to be tested and a detection means (10) for detecting the longest shared character number between the comparison target amino acid sequence represented by a character string, and the longest shared character number detected by the detection means (10),
A genetic information inspection apparatus, comprising: a calculating means (11) for calculating a ratio of the inspection target amino acid sequence or the comparison target amino acid sequence to the character string length.

2. A genetic information test apparatus for evaluating the similarity between an amino acid sequence to be tested and an amino acid sequence to be compared, wherein an amino acid is represented by a character and is represented by a character string. A genetic information testing apparatus comprising a detection means (20) for detecting the longest shared subsequence of a test amino acid sequence and a comparison amino acid sequence represented by a character string.

3. The genetic information test apparatus according to claim 2, further comprising a specifying means (21) for specifying the sequence position of the longest shared partial sequence of the amino acid sequence to be examined and the amino acid sequence to be compared. Genetic information testing device.

4. The genetic information test apparatus according to claim 3, wherein two amino acid sequences are output to an output device, and at the time of output, according to the sequence position specified by the specifying means (21), A genetic information inspection apparatus characterized in that the longest shared subsequences of these two amino acid sequences are aligned and output so that the longest shared subsequences are associated with each other.

5. The genetic information testing apparatus according to claim 3, wherein the longest shared partial sequence of the two amino acid sequences is output to the output device, and at the time of this output, the identifying means is provided.
A genetic information testing device, characterized in that the character string length between the array positions defined according to the array position specified by (21) is processed so as to be output in association with the longest shared character string. ..