JP4218874B2

JP4218874B2 - Evaluation method of data string composed of characters, etc., program for executing the evaluation method, and execution result of the evaluation method

Info

Publication number: JP4218874B2
Application number: JP2003026082A
Authority: JP
Inventors: 邦彰 ▲高▼島
Original assignee: 高島邦彰
Priority date: 2002-08-29
Filing date: 2003-02-03
Publication date: 2009-02-04
Anticipated expiration: 2023-02-03
Also published as: JP2004145845A

Description

【０００１】
【発明の属する技術分野】
本発明は、複数の数値，記号，文字又は図形若しくはこれらの組み合わせからなる複数のデータ列について、これらデータ列がどの程度類似しているか、また、どの程度共通している部分があるのかを客観的に評価するデータ列の文字等からなるデータ列の評価装置及びデータ列評価装置用のコンピュータプログラムに関する。
【０００２】
【従来の技術】
文字や、数字等のデータからなるデータ列の類似性を判断する方法として、従来から種々の方法が提案されている（例えば、特許文献１及び特許文献２参照）。
【０００３】
［特許文献１］
特開平６−１３９２２４号公報
［特許文献２］
特開２０００２−２０２９８３号公報
【０００４】
しかし、上記したような従来の手法を用いても、一つの基準データ列に対して、比較の対象となる複数の比較データ列がある場合、前記複数の比較データ列が基準データ列に対してどの程度近いのか、ランク付けすることが困難な場合がある。
【０００５】
例えば、五つの文字「Ａ，Ｂ，Ｃ，Ｄ，Ｅ」からなる基準データ列に対して、三つの比較データ列「Ａ，Ｃ，Ｂ，Ｄ，Ｅ」、「Ｃ，Ｂ，Ａ，Ｅ，Ｄ」及び「Ｅ，Ａ，Ｂ，Ｄ，Ｃ」があった場合、「Ａ，Ｃ，Ｂ，Ｄ，Ｅ」が基準データ列「Ａ，Ｂ，Ｃ，Ｄ，Ｅ」に最も近いということは容易に判断できても、「Ｃ，Ｂ，Ａ，Ｅ，Ｄ」及び「Ｅ，Ａ，Ｂ，Ｄ，Ｃ」のいずれが基準データ列により近いのか、客観的な評価を行うことは困難であった。
さらに、基準データ列や比較データ列を構成する文字等のデータの数が増えると、処理しなければならない情報も大幅に増える。例えば、ＡＧＴＣの四種類のデータからなるＤＮＡの塩基配列やたんぱく質を構成する２０種類のアミノ酸配列等の長い文字等配列を解析するには、超高速の超大型コンピュータを使用しなければ処理できないというのが現状である。
【０００６】
上記のようなＤＮＡの塩基配列やアミノ酸配列を解析する方法やプログラム等として、特定モチーフ配列を検索・抽出するシステムに関する技術（特許文献３参照）や、ＳＮＰ（一塩基多型）検出に関する方法（特許文献４参照）等、種々のものが提案されている。
【０００７】
［特許文献３］
特開２０００−０６０５５３号公報
［特許文献４］
特開２００２−０６３１７５号公報
【０００８】
しかし、特許文献３及び特許文献４に記載の方法やプログラム、システムによっても、解析には多大な時間と労力を要するという問題がある。また、ＤＮＡの塩基配列やアミノ酸配列等の長い文字記号配列の中から共通部分を見つけだすことは極めて困難である。例えば、数千の文字等配列からなるヒトのアミノ酸配列と、比較対照となるサルやセン虫、マウスその他の動物とのアミノ酸配列がどの程度共通していて、どの部分が共通しているのかは、比較する文字等配列を研究者が目視で行っているのが現状で、複数の比較対象について共通部分を探索する作業は、膨大な時間と労力を要するという問題がある。
【０００９】
【発明が解決しようとする課題】
この発明は上記の問題点にかんがみてなされたもので、簡単な方法でデータ列の類似度を客観的に評価することができ、文字等のデータの数が増えても、コンピュータの処理能力に過度の負担をかけることのない文字等からなるデータ列の評価装置及びデータ列評価装置用のコンピュータプログラムを提供すること、複数のデータ列の共通部分を簡単に検索することが可能な文字等からなるデータ列の評価装置及びデータ列評価装置用のコンピュータプログラムの提供を目的とする。
【００１０】
【課題を解決するための手段】
上記の目的を達成するために、本発明の発明者は、比較するデータ列のそれぞれを、二つの文字等からなるデータの組に分解し、各組について、文字等の並び順が一致しているか否かを判断することで、簡単なプログラムで、類似性（共通性）の判断を容易に行うことができることを見出した。
すなわち、請求項１に記載の発明は、複数の文字，数値，記号又は図形のデータからなる基準データ列に対し、同様のデータからなる比較データが、どの程度類似しているかを評価する評価装置において、複数の文字，数値，記号又は図形のデータからなる基準データ列若しくはこれらのデータを組み合わせてなる基準データ列の入力を受け付けて記憶部に格納するとともに、前記比較データ列の入力を受け付けて記憶部に格納する入力部と、前記記憶部に格納された前記基準データ列及び前記比較データ列の各々から、並び順を変えずに、二つのデータからなる組を形成して前記記憶部に格納し、前記記憶部に格納された前記比較データ列の組における前記データの並び順と、前記記憶部に格納された前記基準データにおける前記組を構成するデータが同じデータである組のデータの並び順とを比較する比較部と、比較の結果、前記並び順が一致したときに第一の評価点数を付け、前記並び順が一致しないときには、第二の評価点数を付け、前記比較データ列の中から抽出された全ての組について付された前記第一及び第二の評価点数を合算して、得られた総合点数を前記記憶部に格納し、前記総合点数に基づいて評価を行う演算部とを有する構成としてある。
【００１１】
また、請求項２に記載の発明は、前記第一の評価点数が１点で、前記第二の評価点数が０点とした構成としてある。
類似性の評価は、比較するデータ列にどの程度共通部分が含まれているかで判断することができる。そして、共通部分の割合を示す一定の計算式を予め定義し、この計算式と一致する組の数又は不一致とされた組の数を代入することで、類似性を判断することができる。すなわち、請求項３に記載の発明は、複数の文字，数値，記号又は図形のデータからなる基準データ列に対し、同様のデータからなる比較データが、どの程度類似しているかを評価する評価装置において、複数の文字，数値，記号又は図形のデータからなる基準データ列若しくはこれらのデータを組み合わせてなる基準データ列の入力を受け付けて記憶部に格納するとともに、前記比較データ列の入力を受け付けて記憶部に格納する入力部と、前記記憶部に格納された前記基準データ列及び比較データ列から、二つのデータからなる組を形成して前記記憶部に格納し、前記記憶部に格納された前記比較データ列の組における前記データの並び順と、前記記憶部に格納された前記基準データにおける前記組を構成するデータが同じデータである組のデータの並び順とを比較する比較部と、前記比較データ列の組における前記並び順が一致した組の数又は不一致となった組の数を求め、全ての組の数に対する一致する組又は不一致となった組の数の割合から前記比較データ列の評価値を求め、前記評価値を前記記憶部に格納する演算部とを有する構成としてある。
【００１２】
なお、本発明においては、類似性だけでなく共通性の判断も簡単に行うことができる。すなわち、請求項４に記載の発明は、請求項１〜３のいずれかに記載のデータ列の評価装置において、前記入力部から入力された複数のデータ列の各々の中から、並び順を変えずに、二つのデータからなる組を形成して前記記憶部に格納し、前記記憶部に格納された前記組の中から、前記複数のデータ列において前記データの並び順が一致する組を抽出し、前記一致する組を、各々の前記データ列について、前記データ列の中の位置情報とともに記憶部に格納させ、各データ列について、前記並び順及び前記位置情報が一致する組を、前記記憶部に格納された情報に基づいて再現する演算部を有する構成としてある。なお、本発明においては、請求項５に記載するように、前記二つのデータが離間している場合にも適用が可能である。この場合には、データ間の距離を並び順の条件に加えて、比較を行う。
【００１３】
また、本発明は遺伝子の塩基配列やたんぱく質を構成するアミノ酸配列のような膨大なデータ列に対しても有効に応用が可能である。請求項６に記載の発明は、前記基準データ列及び前記比較データ列を構成するデータが、Ａ（アデニン）、Ｇ（グアニン）、Ｔ（チミン）及びＣ（シトシン）の四種類の文字データからなる遺伝子の塩基配列である構成としてある。また、請求項７に記載の発明は、前記基準データ列及び前記比較データ列を構成するデータがたんぱく質を構成する２０種のアミノ酸配列に係る文字等である構成としてある。
【００１４】
本発明のデータ列評価装置用のコンピュータプログラムは、ＣＤやＦＤ、ＭＯ、ＤＶＤ等の磁気記録媒体又は光記録媒体に記録して提供することが可能である。また、無線又は有線の通信回線を介して提供することも可能である。そして、本発明のプログラムを読み取り装置等を介してパソコンや専用の解析装置のメモリに読み込み、ＣＰＵで当該プログラムを実行させることで、本発明の評価装置の動作が可能になる。
【００１５】
【発明の実施の形態】
以下、本発明の好適な実施形態を、図面にしたがって詳細に説明する。
［第一の実施形態］
図１は、本発明の評価方法を実行する装置の構成を示すブロック図、図２は、本発明の評価方法のフローチャート、図３〜図７は本発明の評価方法による具体的な評価例を示す図である。
まず、図１及び図２を参照しながら、本発明の評価方法による評価の手順を説明する。
【００１６】
図１に示すように、本発明の評価方法を実行するための評価装置１は、キーボードやマウス等の入力部１１と、この入力部１１からの入力や、プリンタやディスプレイ等の表示部１５への出力の処理を行う入出力処理部１２と、入出力処理部１２で処理された内容を記憶する記憶部１４と、入出力処理部１２の指令に基づいて所定の比較や演算を行う比較・演算部１３とを備えている。前記した記憶部１４は、ＲＡＭ等のメモリやハードディスク，フレキスブルディスク（ＦＤ）の他、入出力処理部１２の指令によって所定の内容を読み書き可能な光ディスク等の媒体を用いることができる。また、記憶部１４は、評価を行うにあたっての基準となるデータの列（基準データ列）を記憶する第一の記憶領域１４ａと、前記基準データ列との比較の対象となるデータの列（比較データ列）を記憶する第二の記憶領域１４ｂと、評価の結果を点数で記憶する第三の記憶領域１４ｃとを有している。
【００１７】
図２に示すように、所定の評価を行うにあたり、まず、基準となるデータの列（基準データ列）を入力部１１から入出力処理部１２に入力する（ステップＳ１）。入力された基準データ列は、入出力処理部１２によって、記憶部１４内の第一の記憶領域１４ａに格納される。
なお、以下の説明では、この基準データ列を「Ａ，Ｂ，Ｃ，Ｄ，Ｅ」の五つの文字データからなるデータ列であるとして説明する。
【００１８】
次いで、比較データ列の個数ｎを入力部１１から入出力処理部１２に入力し（ステップＳ２）、比較データ列を入力部１１から入出力処理部１２に入力する（ステップＳ３）。比較データ列の個数ｎ及び各比較データ列は、入出力処理部１２によって、記憶部１４内の第二の記憶領域１４ｂに格納される。
なお、以下の説明では、比較の対象となる比較データ列を、「Ａ，Ｂ，Ｃ，Ｄ，Ｅ」（ｋ＝１）、「Ａ，Ｃ，Ｂ，Ｄ，Ｅ」（ｋ＝２）、「Ｃ，Ｂ，Ａ，Ｅ，Ｄ」（ｋ＝３）、「Ｅ，Ａ，Ｂ，Ｄ，Ｃ」（ｋ＝４）、「Ｅ，Ｄ，Ｃ，Ｂ，Ａ」（ｋ＝５）の五つ（ｎ＝５）として説明する。
【００１９】
この後、入力部１１を介して「実行」指令を行うことで、評価が開始される。「実行」の入力とともに、まず、第一の記憶領域１４ａに格納されている基準データ列が、が入出力処理部１２によって読み出され、比較・演算部１３に送られる。次いで、第二の記憶領域１４ｂに格納されている先頭（ｋ＝１，この番号の定義はステップＳ４で行われる）の比較データ列が入出力処理部１２によって読み出され、比較・演算部１３に送られる。
【００２０】
比較・演算部１３では、入出力処理部１２から送られてきたｋ＝１の比較データ列「Ａ，Ｂ，Ｃ，Ｄ，Ｅ」の中から、二つのデータの組を抽出する。このとき、各データ「Ａ」，「Ｂ」，「Ｃ」，「Ｄ」，「Ｅ」の順序は変更しない（ステップＳ５）。
すなわち、比較・演算部１３は、図３に示すように、「Ａ，Ｂ，Ｃ，Ｄ，Ｅ」の比較データ列の中から、「Ａ，Ｂ」、「Ａ，Ｃ」・・・「Ｃ，Ｅ」、「Ｄ，Ｅ」なる１０種類のデータの組を抽出するわけである。
なお、基準データ列の中からも、予め、比較データ列の１０種類の組に対応する１０種類のデータの組（二つのデータからなる組）を抽出しておく（ステップＳ１６）。
【００２１】
そして、この１０種類の組におけるデータの順序と、各データの組に含まれるデータと同じデータの組について、基準データ列のデータの順序と比較する（ステップＳ６）。
例えば、比較データ列「Ａ，Ｂ，Ｃ，Ｄ，Ｅ」から抽出した「Ａ，Ｃ」の組と、基準データ列内のデータ「Ａ」及びデータ「Ｃ」との順序（配置関係）を比較し、両者が一致しているかどうかを検討する。この場合は、ともに、データ「Ａ」及びデータ「Ｃ」の順序が、「Ａ，Ｃ」となっているので、順序が一致していると判断して（ステップＳ７）、比較・演算部１３は、この組（「Ａ，Ｃ」の組）に対して「１」を点数として付与する（ステップＳ８）。
【００２２】
このようにして、１０種類の全ての組について評価を行い（ステップＳ１０）、全ての組について評価が完了したときには、ｋ＝１の比較データ列「Ａ，Ｂ，Ｃ，Ｄ，Ｅ」について、全ての組の評価の点数を合計する。
ｋ＝１の比較データ列「Ａ，Ｂ，Ｃ，Ｄ，Ｅ」は、基準データ列「Ａ，Ｂ，Ｃ，Ｄ，Ｅ」と完全に一致しているので、その評価点数の合計は、図３に示すように満点（１０点）となる。この評価点数は、ｋ＝１の比較データ列「Ａ，Ｂ，Ｃ，Ｄ，Ｅ」として、入出力処理部１２によって第三の記憶領域１４ｃに格納される。
【００２３】
ｋ＝１の比較データ列「Ａ，Ｂ，Ｃ，Ｄ，Ｅ」の評価を終えた後、ｋのカウントを一つだけアップして（ステップＳ１２，Ｓ１４）、次のｋ＝２の比較データ列「Ａ，Ｃ，Ｂ，Ｅ，Ｄ」について評価を行う。
このｋ＝２の比較データ列「Ａ，Ｃ，Ｂ，Ｄ，Ｅ」についても、各データ「Ａ」，「Ｃ」，「Ｂ」，「Ｄ」，「Ｅ」の順序を変えることなく、１０種類のデータの組を作成する（ステップＳ５）。
そして先と同様の手順で比較を行い（ステップＳ６）、基準データ列のデータの順序を一致するかどうかの判断を行う（ステップＳ７）。
【００２４】
このｋ＝２の比較データ列「Ａ，Ｃ，Ｂ，Ｄ，Ｅ」においては、基準データ列に対して、「Ｃ」と「Ｂ」のデータが入れ替わっているので、「Ｃ，Ｂ」の組のデータの順序は、基準データ列の中の「Ｂ」，「Ｃ」のデータの順序とは異なるものである。そこで、比較・演算部１３は、図４に示すように、「Ｃ，Ｂ」の組については、「０」を点数として付与する（ステップＳ９）。
ｋ＝２の比較データ列「Ａ，Ｃ，Ｂ，Ｄ，Ｅ」は、基準データ列「Ａ，Ｂ，Ｃ，Ｄ，Ｅ」に対して、「Ｃ」，「Ｂ」のデータの順序が入れ替わっただけであるので、その評価点数の合計は、図４に示すように９点となる。この評価点数は、ｋ＝２の比較データ列「Ａ，Ｃ，Ｂ，Ｄ，Ｅ」として、入出力処理部１２によって第三の記憶領域１４ｃに格納される。
【００２５】
そして、ｋ＝３、ｋ＝４、ｋ＝５の比較データ列に対してもステップＳ５〜ステップＳ１０の手順に従って同様に評価を行い、各比較データ列ごとに評価点数の合計を求め（ステップＳ１１）、その結果を第三の記憶領域１４ｃに格納する。
図５〜図７に示すように、ｋ＝３の比較データ列「Ｃ，Ｂ，Ａ，Ｅ，Ｄ」では評価点数は６点、ｋ＝４の比較データ列「Ｅ，Ａ，Ｂ，Ｄ，Ｃ」の評価点数は５点、基準データ列のデータの配置と全く反対のｋ＝５の比較データ列「Ｅ，Ｄ，Ｃ，Ｂ，Ａ」の評価点数は０点となる。
【００２６】
このようにして、五つの比較データ列の全てについて評価を終えた後（ステップＳ１２）、ｋ＝１〜５の比較データ列に対して評価点数の比較を行う（ステップＳ１３）。
すなわち、ｋ＝１〜５の比較データ列の評価点数を、比較・演算部１３は、第三の記憶領域１４ｃから読み出して比較し、点数の高いものから順にランク付けを行う。
【００２７】
上記したｋ＝１〜５の比較データ列では、ｋ＝１の比較データ列「Ａ，Ｂ，Ｃ，Ｄ，Ｅ」の評価点数が最も高く、「Ａ，Ｃ，Ｂ，Ｄ，Ｅ」（ｋ＝２）、「Ｃ，Ｂ，Ａ，Ｅ，Ｄ」（ｋ＝３）、「Ｅ，Ａ，Ｂ，Ｄ，Ｃ」（ｋ＝４）、「Ｅ，Ｄ，Ｃ，Ｂ，Ａ」（ｋ＝５）の順で評価点数が低くなる。これは、すなわち、基準データ列に対する比較データ列の類似性（接近性）を表していることに他ならない。
【００２８】
本発明では、従来の手法では困難であった客観的な評価が可能になるわけである。例えば、ｋ＝３の比較データ列「Ｃ，Ｂ，Ａ，Ｅ，Ｄ」と、ｋ＝４の比較データ列「Ｅ，Ａ，Ｂ，Ｄ，Ｃ」とは従来、どっちが基準データ列に近いのか判断が困難であったが、各比較データ列を構成するデータの距離という観点からその類似性を判断すると、ｋ＝３の比較データ列の方が基準データ列により近い、という客観的に判断することが可能になる。
【００２９】
［第二の実施形態］
上記の実施形態では、データの並び順が同一か否かで、評価点数１点又は０点を付与している。
この実施形態では、「類似性」は、比較する文字等のデータ列の中における共通部分の割合であるととらえ、類似性の評価を、以下の式で求めるものとする。
なお、以下の式では、
(i) 基準データ列のデータ数＝ｎ，基準データ列のデータの組の数＝ａ
ここで、組の数ａはａ＝ｎ（ｎ−１）／２で表される。
(ii) 比較データ列のデータ数＝ｍ，比較データ列のデータの組の数＝ｂ
ここで、組の数ｂはｂ＝ｍ（ｍ−１）／２で表される。
(iii) 基準データ列と比較データ列との共通する組の数＝ｃ
【００３０】
類似性評価値＝（ｃ×２）／（ａ＋ｂ）
【００３１】
なお、類似性評価値をパーセント（％）で表す場合は、上記の式の右辺に１００を掛ければよい。
図８は、基準データ列と比較データ列のデータ数が同じ場合（双方とも５つ、すなわち、ｎ＝ｍ＝５）の類似性評価の結果を示すものである。
第一の実施形態と同様に、基準データ列及び比較データ列（比較データ列１，２，３）を二つのデータの組に分類する。図８に示す例では、基準データ列及び比較データ列ともに、１０個の組に分類することができる（すなわち、ａ＝ｂ＝１０）。
【００３２】
ここで、第一の実施形態と同様に、各組におけるデータの並び順が、基準データの組のデータの並び順と一致するかどうかを判断する。
図８に示すように、比較データ列１では、基準データ列と一致する組の数が８個（ｃ＝８）あるので、各代数を上記の式に代入して、
類似性評価値＝１６／２０（８０％）となる。
同様に、比較データ列２では、類似性評価値＝１２／２０（６０％）、比較データ列３では、並び順の共通する組が存在せず、類似性評価値＝０／２０（０％）となる。
【００３３】
［第三の実施形態］
第二の実施形態の類似性の評価は、基準データ列のデータの数と、比較データ列のデータの数とが異なる場合にも適用が可能で、このような場合にも類似性の評価を客観的に行うことができるという利点がある。
以下の第三の実施形態では、基準データ列「ＡＢＣＤＥ」と比較データ列１「ＡＣＢＥ」、比較データ列２「ＢＣＤ」及び比較データ列３「ＥＤＣ」との類似性評価を行うものとする。
この第三の実施形態の類似性評価の結果を、図９の表に示す。
比較データ列１では、データ数ｍ＝４、データの組の数ｂ＝６で、基準データ列と共通するデータの組の数ｃ＝５であるから、これを上記の式に代入して、
類似性評価値＝１０／１６（６２．５％）となる。同様に、比較データ２については、類似性評価値＝６／１３（４６．２％）、比較データ３については、類似性評価値＝０となる。
【００３４】
［第四の実施形態］
上記第一〜第三の実施形態では、データの組の並び順にのみ基づいて類似性の評価を行っているが、この第四の実施形態では、各組のデータ間の距離も考慮して、類似性の評価を行うようにしている。
図１０に示す類似性評価の例では、基準データ列「ＡＢＢＣＣ」に対し、三つの比較データ列、すなわち、比較データ列１「ＢＢＣＣＡ」、比較データ列２「ＣＡＢＢＣ」及び比較データ列３「ＡＢＣＣＢ」についての評価を行っている。
【００３５】
そして、これら基準データ列及び比較データ列１，２，３の中からデータの組を取り出す場合において、各組のデータの距離をスペース‘＿’で示している。例えば、基準データ列「ＡＢＢＣＣ」では、先頭の「Ａ」と四番目の「Ｃ」とのデータの組を、「Ａ＿＿Ｃ」で表している。
そして、各比較データ列１，２，３において、基準データ列の組と二つのデータの並び順と距離が一致する組の数を数え、これを、上記の式に代入して類似性評価を行っている。
【００３６】
本発明は、基準データ列及び比較データ列を構成するデータの数が多くなり同一文字・記号が繰り返し発生する配列の場合でも、当該比較データ列がどの程度基準データ列に近いのか、客観的かつ正確に判断を行うことができるので、きわめてその適用分野が広い。又、例えば、複数の生物の遺伝子のＤＮＡの文字配列（比較データ列）をヒトのＤＮＡの文字配列（基準データ列）と比較したり、複数の生物のある細胞を構成するたんぱく質のアミノ酸配列とヒトの細胞のたんぱく質を構成するアミノ酸配列を比較し、どの生物がどの程度ヒトに近いのかを客観的かつ容易に判断することに応用出来る。
なお、この実施形態においても、図１１に示すように、基準データ列と比較データ列との数が異なる場合にも適用が可能である。
図１１に示す例では、基準データ列のデータの数よりも比較データ列のデータ数が多い場合を示しているが、上記と同様の手法により、類似性の評価を行うことができる。
【００３７】
［第五の実施形態］
次に、本発明の評価方法を遺伝子配列に応用した第五の実施形態について、図１、図２及び図１２〜図１６を参照しながら説明する。
【００３８】
図２に示すように、所定の評価を行うにあたり、まず、基準となるデータの列（基準データ列）を入力部１１から入出力処理部１２に入力する（ステップＳ１）。入力された基準データ列は、入出力処理部１２によって、記憶部１４内の第一の記憶領域１４ａに格納される。
なお、以下の説明では、この基準データ列を「Ｃ、Ｔ、Ｃ、Ｇ、Ａ」の五つの文字データからなるデータ列であるとして説明する。
【００３９】
次いで、比較データ列の個数ｎを入力部１１から入出力処理部１２に入力し（ステップＳ２）、比較データ列を入力部１１から入出力処理部１２に入力する（ステップＳ３）。比較データ列の個数ｎ及び各比較データ列は、入出力処理部１２によって、記憶部１４内の第二の記憶領域１４ｂに格納される。
なお、以下の説明では、比較の対象となる比較データ列を、図１２〜図１６に示すように、「Ｃ，Ｔ，Ｃ，Ｇ，Ａ」（ｋ＝１）、「Ｃ，Ｔ，Ｔ，Ｇ，Ａ」（ｋ＝２）、「Ｔ，Ｃ，Ｇ，Ａ，Ｃ」（ｋ＝３）、「Ａ，Ｇ，Ｃ，Ｔ，Ｃ」（ｋ＝４）、「Ａ，Ｇ，Ｔ，Ａ，Ｃ」（ｋ＝５）の五つ（ｎ＝５）として説明する。
【００４０】
この後、入力部１１を介して「実行」指令を行うことで、評価が開始される。「実行」の入力とともに、まず、第一の記憶領域１４ａに格納されている基準データ列が、入出力処理部１２によって読み出され、比較・演算部１３に送られる。次いで、第二の記憶領域１４ｂに格納されている先頭（ｋ＝１，この番号の定義はステップＳ４で行われる）の比較データ列が入出力処理部１２によって読み出され、比較・演算部１３に送られる。
【００４１】
比較・演算部１３では、入出力処理部１２から送られてきたｋ＝１の比較データ列「Ｃ，Ｔ，Ｃ，Ｇ，Ａ」の中から、二つのデータの組を抽出する。このとき、各データ「Ｃ」，「Ｔ」，「Ｃ」，「Ｇ」，「Ａ」の順序は変更せず（ステップＳ５）、かつ、二つのデータの間の距離も考慮している。
すなわち、比較・演算部１３は、図１２に示す「Ｃ、Ｔ、Ｃ、Ｇ、Ａ」の比較データ列の中から、「ＣＴ」、「Ｃ＿Ｃ」、「Ｃ＿＿Ｇ」、「Ｃ＿＿＿Ａ」「ＴＣ」、「Ｔ＿Ｇ」、…「ＧＡ」なる１０種類のデータの組を抽出するわけである。‘＿’は空白を意味し、２つの文字の距離を示す。
【００４２】
なお、この空白にはどのような文字等が挿入されていてもよいものとし、各データの間の空白に入る文字等の数、すなわちデータ間の距離を数字で表して、それぞれを、「Ｃ０Ｔ」、「Ｃ１Ｃ」、「Ｃ２Ｇ」、「Ｃ３Ａ」「Ｔ０Ｃ」、「Ｔ１Ｇ」、…「Ｇ０Ａ」のように表現することも可能である。
先の実施形態と同様に、基準データ列の中からも、予め、比較データ列の１０種類の組に対応する１０種類のデータの組（二つのデータからなる組）を抽出しておく（ステップＳ１６）。
【００４３】
そして、基準データ列と比較データ列について、この１０種類の組におけるデータの並び順及び距離の比較を行う（ステップＳ６）。
例えば、比較データ列「Ｃ、Ｔ、Ｔ、Ｇ、Ａ」から抽出した「Ｃ＿＿Ｇ」の組と、基準データ列内のデータ「Ｃ＿＿Ｇ」を比較し、両者の配列が距離を含めて一致しているかどうかを検討する。この場合、比較データ列内の「Ｃ＿＿Ｇ」が基準データ配列内に存在すると判断して（ステップＳ７）、比較・演算部１３は、この組（「Ｃ＿＿Ｇ」の組）に対して「１」を点数として付与する（ステップＳ８）。
なお、評価点数「１」又は「０」を付与する代わりに、並び順及び距離の双方が一致する組の数を求め、第二及び第三の実施形態で示した式に代入して、類似性の評価を行うようにしてもよいことは勿論である。
【００４４】
このようにして、１０種類の全ての組について評価を行い（ステップＳ１０）、全ての組について評価が完了したときには、ｋ＝１の比較データ列「Ｃ、Ｔ、
Ｃ、Ｇ、Ａ」について、全ての組の評価の点数を合計する。
ｋ＝１の比較データ列「Ｃ、Ｔ、Ｃ、Ｇ、Ａ」は、基準データ列「Ｃ，Ｔ，Ｃ，Ｇ，Ａ」と完全に一致しているので、その評価点数の合計は、図１２に示すように満点（１０点）となる。この評価点数は、ｋ＝１の比較データ列「Ｃ、Ｔ、Ｃ、Ｇ、Ａ」の類似性評価値として、入出力処理部１２によって第三の記憶領域１４ｃに格納される。
【００４５】
ｋ＝１の比較データ列「Ｃ、Ｔ、Ｃ、Ｇ、Ａ」の評価を終えた後、ｋのカウントを一つだけアップして（ステップＳ１２，Ｓ１４）、次のｋ＝２の比較データ列「Ｃ，Ｔ，Ｔ，Ｇ，Ａ」について評価を行う。
このｋ＝２の比較データ列「Ｃ，Ｔ，Ｔ，Ｇ，Ａ」についても、各データ「Ｃ」，「Ｔ」，「Ｔ」，「Ｇ」，「Ａ」の順序を変えることなく、１０種類のデータの組を作成する（ステップＳ５）。
そして先と同様の手順で比較を行い（ステップＳ６）、基準データ列のデータの順序を一致するかどうかの判断を行う（ステップＳ７）。
【００４６】
このｋ＝２の比較データ列「Ｃ，Ｔ，Ｔ，Ｇ，Ａ」においては、基準データ列に対して、３文字目の「Ｃ」と「Ｔ」のデータが入れ替わっている為、比較データ列から作られた「ＴＧ」の組のデータは、基準データ列の中に存在しないものである。そこで、比較・演算部１３は、図１３に示すように、「ＴＧ」の組については、「０」を点数として付与する（ステップＳ９）。
ｋ＝２の比較データ列「Ｃ，Ｔ，Ｔ，Ｇ，Ａ」は、基準データ列「Ｃ，Ｔ，Ｃ，Ｇ，Ａ」に対して、「Ｃ」，「Ｔ」のデータの順序が入れ替わっただけであるが、その評価点数の合計は、図１３に示すように6点となる。この評価点数は、ｋ＝２の比較データ列「Ｃ，Ｔ，Ｔ，Ｇ，Ａ」として、入出力処理部１２によって第三の記憶領域１４ｃに格納される。
【００４７】
そして、ｋ＝３、ｋ＝４、ｋ＝５の比較データ列に対してもステップＳ５〜ステップＳ１０の手順に従って同様に評価を行い、各比較データ列ごとに評価点数の合計を求め（ステップＳ１１）、その結果を第三の記憶領域１４ｃに格納する。
図１４〜図１６に示すように、ｋ＝３の比較データ列「Ｔ，Ｃ，Ｇ，Ａ，Ｃ」では評価点数は６点、ｋ＝４の比較データ列「Ａ，Ｇ，Ｃ，Ｔ，Ｃ」の評価点数は３点、ｋ＝５の比較データ列「Ａ，Ｇ，Ｔ，Ａ，Ｃ」の評価点数は０点となる。
【００４８】
このようにして、五つの比較データ列の全てについて評価を終えた後（ステップＳ１２）、ｋ＝１〜５の比較データ列に対して評価点数の比較を行う（ステップＳ１３）。
すなわち、ｋ＝１〜５の比較データ列の評価点数を、比較・演算部１３は、第三の記憶領域１４ｃから読み出して比較し、点数の高いものから順にランク付けを行う。
【００４９】
上記したｋ＝１〜５の比較データ列では、ｋ＝１の比較データ列「Ｃ、Ｔ、Ｃ、Ｇ、Ａ」の評価点数が最も高く、「Ｃ，Ｔ，Ｃ，Ｇ，Ａ」（ｋ＝２）、「Ｔ，Ｃ，Ｇ，Ａ，Ｃ」（ｋ＝３）、「Ａ，Ｇ，Ｃ，Ｔ，Ｃ」（ｋ＝４）、「Ａ，Ｇ，Ｔ，Ａ，Ｃ」（ｋ＝５）の順で評価点数が低くなる。これは、すなわち、基準データ列に対する比較データ列の類似性（接近性）を表していることに他ならない。
【００５０】
このように、本発明では、従来の手法では困難であった客観的な評価が可能になるわけである。例えば、ｋ＝３の比較データ列「Ｔ，Ｃ，Ｇ，Ａ，Ｃ」と、ｋ＝４の比較データ列「Ａ，Ｇ，Ｃ，Ｔ，Ｃ」とは従来、どっちが基準データ列に近いのか判断が困難であったが、各比較データ列を構成するデータの距離という観点からその類似性を判断すると、ｋ＝３の比較データ列の方が基準データ列により近い、ということを客観的に判断することが可能になる。
【００５１】
［第六の実施形態］
本発明をさらに応用することで、複数のデータ列の中から共通するデータ列の部分を抜き出すことが可能である。
例えば、データ列１：「ＡＢＢＣＣ」，データ列２：「ＢＢＣＣＡ」，データ列３：「ＣＡＢＢＣ」及びデータ列４：「ＡＢＣＣＢ」の四つのデータ列があるものとする。
そこで、上記と同様に、これらデータ列１〜４を、距離を含めて二つのデータの組に分解する。
これを図１７に示す。文字間の数字は、二つの文字の間の距離を意味している。例えば、ＢとＣとの間に二つの文字が存在する場合、距離は「２」となる。
【００５２】
この第六の実施形態における処理の手順を、図２のブロック図及び図１８のフローチャートを参照しながら説明する。
まず、比較する複数のデータ列（この場合は四つのデータ列）を入力する（ステップＳ２１）。図２の評価装置の入出力処理部１２は、各データ列１〜４を第一の記憶領域１４ａに格納する。また、入出力処理部１２は、各データ列１〜４を二つのデータの組に分解し（ステップＳ２２）、各組ごとにデータ間の距離情報を付して、第二の記憶領域１４ｂに格納する（ステップＳ２３）。
【００５３】
比較演算部１３は、第二の記憶領域１４ｂから各データ列１〜４ごとにデータの組を読み出して、並び順とデータ間の距離とが一致するか否かを判断する（ステップＳ２４）。この場合、例えばデータ列１の各データの組を基準として、他のデータ列２，３のデータの組の中から一致するデータの組を抽出する（ステップＳ２５）。そして、一致するデータの組を、各組のデータ間の距離とともに、第三の記憶領域１４ｃに格納する（ステップＳ２６）。
【００５４】
この後、復元の必要や要求に応じて（ステップＳ２７）、この第三の記憶領域１４ｃに格納されている内容をディスプレイ等の表示部１５に復元・表示させる（ステップＳ２６）。これにより、これら四つのデータ列１〜４の共通部分が「ＢＢＣ」であることが容易にわかる（図１７の「共有配列」の欄参照）。
【００５５】
［第七の実施形態］
この第七の実施形態においては、先頭の文字の位置情報を加味することも可能である。図１９に示す第七の実施形態では、先頭の文字の位置情報を、先頭の文字の前に付した「１」〜「４」の数値で表している。この位置情報を前記のデータ間の距離とともに第三の記憶領域１４ｃに格納し、前記と同様に共有配列を復元することで、長いデータ列の中のどの位置に、当該共通部分が存在するかを知ることができる。
【００５６】
［第八の実施形態］
上記の第七の実施形態は、データ列の長さが長くなっても適用が可能である。例えば、ヒト（Homo sapiens）のアミノ酸配列「MIVFVRFNSSHGFPVEVDSDTSIFQ・・・・」とセン虫（Arabidopsis thaliana）のアミノ酸配列「MENNREGPYSVLTRDQLKGNMKKQIA・・・」との共通部分を容易に発見することが可能になる。
この場合も、ヒト及びセン虫についてのアミノ酸配列（データ列）を、各データ列内の位置情報及びデータ間の距離情報を有する二つのデータからなる組に分解する。
【００５７】
例えば、上記のヒトのアミノ酸配列と、セン虫のアミノ酸配列を比べて見ると、ヒトのアミノ酸配列の先頭から１４番目のＰＶ（「１３Ｐ２Ｖ」）と、セン虫のアミノ酸配列の８番目のＰＶ（「８Ｐ２Ｖ」）とが共通している。
上記した第七及び第八の実施形態においても、基本的には、第六の実施形態と同様の処理を行えばよい。第七及び第八の実施形態では、図２０に示すような手順で処理を行う。
【００５８】
なお、図２０のフローチャートにおいては、図１８のフローチャートと同一のステップに同一の符号を付し、その説明は省略する。
第七及び第八の実施形態の処理では、二つのデータの組を抽出（ステップＳ２２）した後、距離情報を付与する（ステップＳ２３）する際に、当該データの組の位置情報を一時的に記憶部（例えば第三の記憶領域１４ｃ）に格納する（ステップＳ３１）。そして、ステップＳ２５で一致するデータの組を抜き出して、前記距離情報とともに記憶部に記憶させる際に、前記位置情報を記憶部から読み出して前記データの組に付与する（ステップ３２）。ステップＳ２８で共有部分の復元を行う際には、前記位置情報に基づいて、当該共通部分のみをアミノ酸配列上の所定位置に復元させるようにする。これにより、共通部分を容易に判断することが可能になる。
【００５９】
［第九の実施形態］
本発明は、アミノ酸配列モチーフのような一定の並び順のデータ配列検出にも有効に利用できる。
アミノ酸モチーフ配列は、例えば、Ｃ・・・Ｈ・・・Ｃ・・・のように複数のデータ（一般には文字）からなるデータ列であって、かつ、規定のデータ（Ｃ，Ｈ，Ｃ）の間の距離が一定範囲内のものである。すなわち、あるアミノ酸配列モチーフは、例えば、Ｃ４〜１４Ｈ４〜１５Ｃ３〜５Ｃのように表すことができる。
【００６０】
このアミノ酸配列モチーフの解析においては、アミノ酸配列モチーフを基準データ列として、比較データ列（アミノ酸配列）の中に前記基準データ列と共通のデータ列が存在するかどうかを解析する。
この場合は、まず、基準データ列の同じ並び順と同じ並び順を有するデータの組を、比較データ列の中から取り出す（以下の表の例では、ＣＨＣＣの並び順を有するデータの組を抽出する）。
そして、下記の表に示すように、基準データの距離情報に幅を持たせ、比較データ列の中の同じ並び順の組のそれぞれについて、距離情報が基準データ列の距離情報の範囲内に含まれるかどうかを判断する。
【００６１】
【表１】

【００６２】
そして、比較データ列の中のデータの組の全てについて距離情報が一致するとき（つまり、上記の表で評価点数が１のとき）に、当該比較データ列の中に所定のモチーフが存在すると判断する。上記の表の例では、比較データ列１の中にはモチーフは存在しないが、比較データ列２の中にはモチーフが存在すると判断することができる。
【００６３】
本発明の好適な実施形態について説明したが、本発明は上記の実施形態に限定されるものではない。
例えば、上記の説明では、説明の便宜のため、基準データ列及び比較データ列の類似性評価方法として説明したが、類似性を評価する為の２文字からなる配列データ（例：Ｃ＿＿Ｔ）に配列内の位置情報（＝ｉ）を付加し、且つ‘Ｃ’⇔‘Ｔ’間の距離を数値化（＝ｗ）して１件のミニマムなコア配列データを、「ｉＣｗＴ」、例えば「１Ｃ２Ｔ」のように表現して記憶させることにより、
▲１▼基準データ列及び比較データ列が共有する連続文字配列データを出力する事が可能になる。
▲２▼ 又、▲１▼と同様に、基準データ列及び比較配列データのそれぞれが持つ固有の連続文字配列を出力する事が出来る。
▲３▼ 遺伝子解析上重要な課題である１塩基置換配列（ＳＮＰ）を確実に把握することが出来る。
【００６４】
その他、ＤＮＡ配列分析の為の、客観的で信頼性の高い効率的な解析手法の基礎とする事が出来る。
また、本発明では、上記で説明した処理を実行するプログラムを光ディスクや磁気ディスク等の記録媒体に格納し、この記録媒体から読み出した前記プログラムをコンピュータに読み込んで実行させることが可能である。
さらに、本発明では、第一の実施形態〜第九の実施形態を適宜に組み合わせて実施することが可能である。例えば、第六の実施形態にしたがって複数のデータ列の中から共通部分を抜き出した後に、第一の実施形態にしたがって、当該共通部分からなるデータ列について、データの並び順の一致性を判断するようにすることも可能である。
また、本発明は、上記のような一次元的なデータ列に限らず、二次元的なデータ列にも適用が可能で、比較画像の類似性を判断するパターン解析にも適用が可能である。
【００６５】
【発明の効果】
このように、本発明によれば、簡単な方法でデータ列の類似性及び共通性を客観的に評価できるものである。さらに、データの数が多くなっても、二つのデータからなる組を抽出するだけであり、過度に複雑な処理プロセスを必要とせずコンピュータの処理能力に大きな負担をかけることなく客観的な評価を安価かつ短時間で行うことが出来る。従って、パーソナルコンピュータ程度の汎用コンピュータによる処理でもある程度の規模の文字列間の評価を行うことができ、又、処理を分散し複数のコンピュータで並行的な処理を行う事も可能になる。
【図面の簡単な説明】
【図１】本発明の評価方法を説明するフローチャートである。
【図２】本発明の評価方法を実行する装置の構成を示すブロック図である。
【図３】本発明の評価方法の第一の実施形態にかかる具体例で、ｋ＝１の比較データ列「Ａ，Ｂ，Ｃ，Ｄ，Ｅ」の評価結果を示すものである。
【図４】本発明の評価方法の第一の実施形態にかかる具体例で、ｋ＝２の比較データ列「Ａ，Ｃ，Ｂ，Ｄ，Ｅ」の評価結果を示すものである。
【図５】本発明の評価方法の第一の実施形態にかかる具体例で、ｋ＝３の比較データ列「Ｃ，Ｂ，Ａ，Ｅ，Ｄ」の評価結果を示すものである。
【図６】本発明の評価方法の第一の実施形態にかかる具体例で、ｋ＝４の比較データ列「Ｅ，Ａ，Ｂ，Ｄ，Ｃ」の評価結果を示すものである。
【図７】本発明の評価方法の第一の実施形態にかかる具体例で、ｋ＝５の比較データ列「Ｅ，Ｄ，Ｃ，Ｂ，Ａ」の評価結果を示すものである。
【図８】第二の実施形態の類似性評価の結果を示す表である。
【図９】第三の実施形態の類似性評価の結果を示す表である。
【図１０】第四の実施形態の類似性評価の結果を示す表である。
【図１１】第五の実施形態の類似性評価の結果を示す表である。
【図１２】本発明の評価方法の第五の実施形態にかかる具体例で、ｋ＝１の比較データ列「Ｃ、Ｔ、Ｃ、Ｇ、Ａ」の評価結果を示すものである。
【図１３】本発明の評価方法の第五の実施形態にかかる具体例で、ｋ＝２の比較データ列「Ｃ，Ｔ，Ｔ，Ｇ，Ａ」の評価結果を示すものである。
【図１４】本発明の評価方法の第五の実施形態にかかる具体例で、ｋ＝３の比較データ列「Ｔ，Ｃ，Ｇ，Ａ，Ｃ」の評価結果を示すものである。
【図１５】本発明の評価方法の第五の実施形態にかかる具体例で、ｋ＝４の比較データ列「Ａ，Ｇ，Ｃ，Ｔ，Ｃ」の評価結果を示すものである。
【図１６】本発明の評価方法の第五の実施形態にかかる具体例で、ｋ＝５の比較データ列「Ａ，Ｇ，Ｔ，Ａ，Ｃ」の評価結果を示すものである。
【図１７】第六の実施形態のデータ列が共有する配列部分を検出する処理の結果を示す表である。
【図１８】第六の実施形態の処理の手順を説明するためのフローチャートである。
【図１９】第七の実施形態のデータ列が共有する配列を元の配列として復元する処理の結果を示す表である。
【図２０】第七及び第八の実施形態の処理の手順を説明するためのフローチャートである。
【符号の説明】
１評価装置
１１入力部
１３比較・演算部
１４記憶部
１４ａ第一の記憶領域
１４ｂ第二の記憶領域
１４ｃ第三の記憶領域
１５表示部[0001]
BACKGROUND OF THE INVENTION
The present invention objectively determines how similar these data strings are, and how common they are among a plurality of data strings consisting of a plurality of numerical values, symbols, characters, figures, or combinations thereof. Evaluation of data strings consisting of characters etc.apparatusas well asComputer program for data string evaluation apparatusAbout.
[0002]
[Prior art]
Conventionally, various methods have been proposed as methods for determining the similarity of data strings composed of data such as characters and numbers (see, for example, Patent Document 1 and Patent Document 2).
[0003]
[Patent Document 1]
JP-A-6-139224
[Patent Document 2]
JP 20002-202983 A
[0004]
However, even when the conventional method as described above is used, when there are a plurality of comparison data strings to be compared with respect to one reference data string, the plurality of comparison data strings are compared with the reference data string. It may be difficult to rank how close they are.
[0005]
For example, with respect to a reference data string consisting of five characters “A, B, C, D, E”, three comparison data strings “A, C, B, D, E”, “C, B, A, E” are used. , D "and" E, A, B, D, C "," A, C, B, D, E "is closest to the reference data string" A, B, C, D, E " Although it can be easily judged, it is not possible to objectively evaluate which of “C, B, A, E, D” and “E, A, B, D, C” is closer to the reference data string. It was difficult.
Furthermore, as the number of data such as characters constituting the reference data string and comparison data string increases, the information that must be processed increases significantly. For example, in order to analyze a long character sequence such as a DNA base sequence consisting of four types of AGTC data and 20 types of amino acid sequences constituting a protein, it cannot be processed without using an ultra-high-speed super-large computer. is the current situation.
[0006]
As a method or program for analyzing the base sequence or amino acid sequence of DNA as described above, a technique related to a system for searching and extracting a specific motif sequence (see Patent Document 3), or a method related to SNP (single nucleotide polymorphism) detection ( Various types have been proposed, such as Patent Document 4).
[0007]
[Patent Document 3]
JP 2000-060553 A
[Patent Document 4]
JP 2002-063175 A
[0008]
However, even with the methods, programs, and systems described in Patent Document 3 and Patent Document 4, there is a problem that analysis requires a lot of time and labor. Moreover, it is extremely difficult to find a common part from long character code sequences such as DNA base sequences and amino acid sequences. For example, how common is the amino acid sequence of human amino acid sequences consisting of thousands of letters, etc., and monkeys, worms, mice and other animals that are comparative controls, and what parts are common? The current situation is that researchers are visually observing the arrangement of characters and the like to be compared, and the task of searching for a common portion for a plurality of comparison targets is problematic in that it requires enormous time and effort.
[0009]
[Problems to be solved by the invention]
The present invention has been made in view of the above-mentioned problems. It is possible to objectively evaluate the similarity of data strings by a simple method. Even if the number of data such as characters increases, the processing power of the computer is improved. Evaluation of data strings consisting of characters that do not place an excessive burdenapparatusas well asComputer program for data string evaluation apparatusA data string evaluation device composed of characters or the like that can easily search for a common part of a plurality of data strings, andComputer program for data string evaluation apparatusThe purpose is to provide.
[0010]
[Means for Solving the Problems]
In order to achieve the above object, the inventor of the present invention decomposes each data string to be compared into a data set composed of two characters, etc., and the arrangement order of the characters is the same for each set. It was found that similarity (commonality) can be easily determined with a simple program by determining whether or not there is.
  That is, the invention according to claim 1 is an evaluation apparatus for evaluating how similar the comparison data composed of similar data is to a reference data string composed of a plurality of character, numerical value, symbol or graphic data. , A reference data string consisting of a plurality of character, numerical value, symbol or figure data, or a reference data string combining these dataIs received and stored in the storage unitAnd the comparison data stringIs received and stored in the storage unitAn input unit toStored in the storage unitFrom each of the reference data string and the comparison data string, a set of two data is formed without changing the arrangement order.Stored in the storage unit, and stored in the storage unitThe order of the data in the set of comparison data strings;The stored in the storage unitData constituting the set in the reference dataButthe sameA set of dataA comparison unit that compares the order of data; and, as a result of comparison, a first evaluation score is assigned when the arrangement order matches, and a second evaluation score is assigned when the arrangement order does not match, and the comparison Total score obtained by adding the first and second evaluation scores assigned to all pairs extracted from the data stringIn the storage unit, and the total scoreAnd a calculation unit that performs evaluation based on the above.
[0011]
The invention described in claim 2 is configured such that the first evaluation score is 1 point and the second evaluation score is 0 point.
  The similarity evaluation can be determined by how much common parts are included in the data strings to be compared. Then, by defining a certain calculation formula indicating the ratio of the common part in advance and substituting the number of sets that match or the number of sets that do not match, the similarity can be determined. That is, the invention according to claim 3 is an evaluation device for evaluating how similar the comparison data made of similar data is to a reference data string made up of a plurality of character, numerical value, symbol or graphic data. , A reference data string consisting of a plurality of character, numerical value, symbol or figure data, or a reference data string combining these dataIs received and stored in the storage unitAnd the comparison data stringIs received and stored in the storage unitAn input unit toStored in the storage unitA set of two data is formed from the reference data string and the comparison data string.Stored in the storage unit,The stored in the storage unitThe order of the data in the set of comparison data strings;The stored in the storage unitData constituting the set in the reference dataButthe sameA set of dataThe comparison unit for comparing the data arrangement order and the number of sets in which the arrangement order is matched or the number of mismatched sets in the set of the comparison data string is obtained, and the matching set or the mismatch for all the pairs Evaluation of the comparison data string from the ratio of the number of pairsA value is obtained and the evaluation value is stored in the storage unitAnd a calculation unit.
[0012]
  In the present invention, not only similarity but also commonality can be easily determined. That is, the invention according to claim 4 is the data string evaluation device according to any one of claims 1 to 3, wherein the arrangement order is changed from each of the plurality of data strings input from the input unit. Without creating a set of two dataStored in the storage unit, and stored in the storage unitFrom the set, a set in which the order of the data matches in the plurality of data strings is extracted, and the matched set is stored in the storage unit together with position information in the data string for each of the data strings.StoreFor each data string, a set in which the arrangement order and the position information match is stored in the storage unit.StoreIt has a configuration having a calculation unit that reproduces based on the information. In the present invention, as described in claim 5, the present invention can also be applied when the two data are separated. In this case, the distance between the data is added to the arrangement order condition and the comparison is performed.
[0013]
In addition, the present invention can be effectively applied to a huge data string such as a gene base sequence or an amino acid sequence constituting a protein. In the invention according to claim 6, the data constituting the reference data string and the comparison data string is based on four types of character data of A (adenine), G (guanine), T (thymine) and C (cytosine). Is the base sequence of the geneConstitutionIt is as. In the invention according to claim 7, the data constituting the reference data string and the comparison data string are characters or the like relating to 20 kinds of amino acid sequences constituting the protein.ConstitutionIt is as.
[0014]
Of the present inventionComputer program for data string evaluation apparatusCan be provided by being recorded on a magnetic recording medium such as a CD, FD, MO, or DVD, or an optical recording medium. It is also possible to provide via a wireless or wired communication line. Then, the program of the present invention is read into the memory of a personal computer or a dedicated analysis device via a reading device or the like, and the program is executed by the CPU.Device operationIs possible.
[0015]
DETAILED DESCRIPTION OF THE INVENTION
DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, preferred embodiments of the invention will be described in detail with reference to the drawings.
[First embodiment]
FIG. 1 is a block diagram showing the configuration of an apparatus for executing the evaluation method of the present invention, FIG. 2 is a flowchart of the evaluation method of the present invention, and FIGS. 3 to 7 are specific evaluation examples according to the evaluation method of the present invention. FIG.
First, an evaluation procedure according to the evaluation method of the present invention will be described with reference to FIGS.
[0016]
As shown in FIG. 1, an evaluation apparatus 1 for executing the evaluation method of the present invention includes an input unit 11 such as a keyboard and a mouse, an input from the input unit 11, and a display unit 15 such as a printer and a display. The input / output processing unit 12 that performs the output processing, the storage unit 14 that stores the contents processed by the input / output processing unit 12, and the comparison / computation that performs a predetermined comparison or calculation based on a command from the input / output processing unit 12 And an arithmetic unit 13. As the storage unit 14, a medium such as a RAM, a hard disk, a flexible disk (FD), or a medium such as an optical disk that can read and write predetermined contents according to a command from the input / output processing unit 12 can be used. The storage unit 14 also includes a first storage area 14a for storing a reference data string (reference data string) for performing evaluation and a data string (comparison) to be compared with the reference data string. A second storage area 14b for storing the data string) and a third storage area 14c for storing the evaluation results in terms of scores.
[0017]
As shown in FIG. 2, when performing a predetermined evaluation, first, a reference data string (reference data string) is input from the input unit 11 to the input / output processing unit 12 (step S <b> 1). The input reference data string is stored in the first storage area 14 a in the storage unit 14 by the input / output processing unit 12.
In the following description, the reference data string is assumed to be a data string including five character data “A, B, C, D, and E”.
[0018]
Next, the number n of comparison data strings is input from the input unit 11 to the input / output processing unit 12 (step S2), and the comparison data string is input from the input unit 11 to the input / output processing unit 12 (step S3). The number n of comparison data strings and each comparison data string are stored in the second storage area 14 b in the storage unit 14 by the input / output processing unit 12.
In the following description, comparison data strings to be compared are represented by “A, B, C, D, E” (k = 1), “A, C, B, D, E” (k = 2). , “C, B, A, E, D” (k = 3), “E, A, B, D, C” (k = 4), “E, D, C, B, A” (k = 5) ) (N = 5).
[0019]
Thereafter, the “execution” command is given via the input unit 11 to start the evaluation. Along with the input of “execute”, first, the reference data string stored in the first storage area 14 a is read by the input / output processing unit 12 and sent to the comparison / calculation unit 13. Next, the first comparison data string (k = 1, definition of this number is performed in step S4) stored in the second storage area 14b is read out by the input / output processing unit 12, and the comparison / calculation unit 13 is read out. Sent to.
[0020]
The comparison / calculation unit 13 extracts a set of two data from the comparison data string “A, B, C, D, E” of k = 1 sent from the input / output processing unit 12. At this time, the order of each data “A”, “B”, “C”, “D”, “E” is not changed (step S5).
That is, as shown in FIG. 3, the comparison / calculation unit 13 selects “A, B”, “A, C”... From “A, B, C, D, E”. Ten types of data sets “C, E” and “D, E” are extracted.
It should be noted that 10 types of data sets (sets of two data) corresponding to the 10 types of comparison data sequences are extracted in advance from the reference data sequence (step S16).
[0021]
Then, the data order in the 10 types of sets and the same data set as the data included in each data set are compared with the data order of the reference data string (step S6).
For example, an order (arrangement relationship) between a set of “A, C” extracted from the comparison data string “A, B, C, D, E” and data “A” and data “C” in the reference data string Compare and consider whether they match. In this case, since the order of the data “A” and the data “C” are both “A, C”, it is determined that the order matches (step S7), and the comparison / calculation unit 13 Gives “1” as a score to this set (“A, C” set) (step S8).
[0022]
In this way, evaluation is performed for all 10 types of sets (step S10). When evaluation is completed for all sets, the comparison data string “A, B, C, D, E” of k = 1 is Sum the evaluation scores for all pairs.
Since the comparison data string “A, B, C, D, E” of k = 1 completely matches the reference data string “A, B, C, D, E”, the total evaluation score is As shown in FIG. 3, it becomes a perfect score (10 points). The evaluation score is stored in the third storage area 14c by the input / output processing unit 12 as a comparison data string “A, B, C, D, E” of k = 1.
[0023]
After the evaluation of the comparison data string “A, B, C, D, E” of k = 1, the count of k is incremented by one (steps S12 and S14), and the next comparison data of k = 2 Evaluate column “A, C, B, E, D”.
For the comparison data string “A, C, B, D, E” of k = 2, the order of the data “A”, “C”, “B”, “D”, “E” is not changed. Ten types of data sets are created (step S5).
Then, the comparison is performed in the same procedure as described above (step S6), and it is determined whether or not the order of the data in the reference data sequence matches (step S7).
[0024]
In the comparison data string “A, C, B, D, E” of k = 2, the data “C” and “B” are interchanged with respect to the reference data string. The order of the data of the set is different from the order of the data “B” and “C” in the reference data string. Therefore, as shown in FIG. 4, the comparison / calculation unit 13 assigns “0” as a score to the set of “C, B” (step S <b> 9).
In the comparison data string “A, C, B, D, E” of k = 2, the order of the data of “C”, “B” is relative to the reference data string “A, B, C, D, E”. Since they have only been replaced, the total number of evaluation points is 9 as shown in FIG. This evaluation score is stored in the third storage area 14c by the input / output processing unit 12 as a comparison data string “A, C, B, D, E” of k = 2.
[0025]
The comparison data string of k = 3, k = 4, and k = 5 is similarly evaluated according to the procedure from step S5 to step S10, and the total score is obtained for each comparison data string (step S11). ), And stores the result in the third storage area 14c.
As shown in FIGS. 5 to 7, the comparison data string “C, B, A, E, D” with k = 3 has 6 evaluation points, and the comparison data string “E, A, B, D with k = 4”. , C ”has an evaluation score of 5 points, and the comparison data sequence“ E, D, C, B, A ”of k = 5, which is completely opposite to the data arrangement of the reference data sequence, has 0 points.
[0026]
In this way, after all the five comparison data strings have been evaluated (step S12), the evaluation score is compared with the comparison data string of k = 1 to 5 (step S13).
That is, the comparison / calculation unit 13 reads and compares the evaluation score of the comparison data string of k = 1 to 5 from the third storage area 14c, and ranks the score in descending order.
[0027]
In the comparison data string of k = 1 to 5 described above, the comparison data string “A, B, C, D, E” of k = 1 has the highest evaluation score, and “A, C, B, D, E” ( k = 2), “C, B, A, E, D” (k = 3), “E, A, B, D, C” (k = 4), “E, D, C, B, A” The evaluation score decreases in the order of (k = 5). That is, it represents the similarity (accessibility) of the comparison data string to the reference data string.
[0028]
In the present invention, it is possible to make an objective evaluation that was difficult with the conventional method. For example, the comparison data string “C, B, A, E, D” of k = 3 and the comparison data string “E, A, B, D, C” of k = 4 are conventionally used as the reference data string. Although it was difficult to determine whether they are close, if the similarity is determined from the viewpoint of the distance of the data constituting each comparison data string, the comparison data string of k = 3 is objectively closer to the reference data string It becomes possible to judge.
[0029]
[Second Embodiment]
In the above embodiment, 1 or 0 evaluation points are assigned depending on whether the data is arranged in the same order.
In this embodiment, it is assumed that “similarity” is the ratio of common parts in a data string such as characters to be compared, and the evaluation of similarity is obtained by the following expression.
In the following formula,
(i) Number of data in the reference data string = n, number of data sets in the reference data string = a
Here, the number a of sets is represented by a = n (n−1) / 2.
(ii) Number of data in comparison data string = m, number of data sets in comparison data string = b
Here, the number b of sets is represented by b = m (m−1) / 2.
(iii) Number of common sets of reference data string and comparison data string = c
[0030]
Similarity evaluation value = (c × 2) / (a + b)
[0031]
When the similarity evaluation value is expressed as a percentage (%), the right side of the above formula may be multiplied by 100.
FIG. 8 shows the result of similarity evaluation when the number of data in the reference data string and the comparison data string is the same (both are five, that is, n = m = 5).
As in the first embodiment, the reference data string and the comparison data string (comparison data strings 1, 2, 3) are classified into two data sets. In the example shown in FIG. 8, both the reference data string and the comparison data string can be classified into 10 groups (ie, a = b = 10).
[0032]
Here, as in the first embodiment, it is determined whether or not the data arrangement order in each set matches the data arrangement order of the reference data set.
As shown in FIG. 8, in the comparison data string 1, since there are eight pairs (c = 8) that match the reference data string, each algebra is substituted into the above equation,
Similarity evaluation value = 16/20 (80%).
Similarly, in the comparison data string 2, the similarity evaluation value = 12/20 (60%), and in the comparison data string 3, there is no group having a common arrangement order, and the similarity evaluation value = 0/20 (0%) )
[0033]
[Third embodiment]
The similarity evaluation of the second embodiment can be applied even when the number of data in the reference data string is different from the number of data in the comparison data string. In such a case, the similarity evaluation is also performed. There is an advantage that it can be performed objectively.
In the following third embodiment, it is assumed that the similarity evaluation between the reference data string “ABCDE”, the comparison data string 1 “ACBE”, the comparison data string 2 “BCD”, and the comparison data string 3 “EDC” is performed.
The result of similarity evaluation of the third embodiment is shown in the table of FIG.
In the comparison data string 1, since the number of data m = 4, the number of data sets b = 6, and the number of data sets c = 5 in common with the reference data string, this is substituted into the above equation,
Similarity evaluation value = 10/16 (62.5%). Similarly, for the comparison data 2, the similarity evaluation value = 6/13 (46.2%), and for the comparison data 3, the similarity evaluation value = 0.
[0034]
[Fourth embodiment]
In the first to third embodiments, the similarity is evaluated based only on the arrangement order of the data sets. In the fourth embodiment, the distance between the data of each set is also considered. The similarity is evaluated.
In the example of similarity evaluation illustrated in FIG. 10, three comparison data strings, that is, comparison data string 1 “BBCCA”, comparison data string 2 “CABBC”, and comparison data string 3 “ABCCB” with respect to the reference data string “ABBCC”. Is being evaluated.
[0035]
When a data set is extracted from the reference data string and the comparison data strings 1, 2, and 3, the distance between the data of each group is indicated by a space "_". For example, in the reference data string “ABBCC”, the data set of the first “A” and the fourth “C” is represented by “A__C”.
In each of the comparison data strings 1, 2, and 3, the number of pairs in which the reference data string pair and the arrangement order and distance of the two data coincide with each other is counted, and this is substituted into the above formula to evaluate the similarity. Is going.
[0036]
In the present invention, even in the case of an array in which the number of data constituting the reference data string and the comparison data string increases and the same character / symbol repeatedly occurs, how close the comparison data string is to the reference data string is objective and Since the judgment can be made accurately, the application field is extremely wide. In addition, for example, the character sequence (comparison data string) of DNA of a plurality of organism genes is compared with the character sequence (reference data string) of human DNA, or the amino acid sequence of a protein constituting a cell having a plurality of organisms. It can be applied to the objective and easy judgment of which living organisms are close to humans by comparing amino acid sequences constituting proteins of human cells.
Note that this embodiment can also be applied when the number of reference data strings and comparison data strings are different as shown in FIG.
In the example shown in FIG. 11, the case where the number of data in the comparison data string is larger than the number of data in the reference data string is shown, but similarity evaluation can be performed by the same method as described above.
[0037]
[Fifth embodiment]
Next, a fifth embodiment in which the evaluation method of the present invention is applied to gene sequences will be described with reference to FIGS. 1, 2, and 12 to 16.
[0038]
As shown in FIG. 2, when performing a predetermined evaluation, first, a reference data string (reference data string) is input from the input unit 11 to the input / output processing unit 12 (step S <b> 1). The input reference data string is stored in the first storage area 14 a in the storage unit 14 by the input / output processing unit 12.
In the following description, the reference data string is assumed to be a data string composed of five character data “C, T, C, G, A”.
[0039]
Next, the number n of comparison data strings is input from the input unit 11 to the input / output processing unit 12 (step S2), and the comparison data string is input from the input unit 11 to the input / output processing unit 12 (step S3). The number n of comparison data strings and each comparison data string are stored in the second storage area 14 b in the storage unit 14 by the input / output processing unit 12.
In the following description, comparison data strings to be compared are represented by “C, T, C, G, A” (k = 1), “C, T, T” as shown in FIGS. , G, A ”(k = 2),“ T, C, G, A, C ”(k = 3),“ A, G, C, T, C ”(k = 4),“ A, G, T, A, C ”(k = 5) will be described as five (n = 5).
[0040]
Thereafter, the “execution” command is given via the input unit 11 to start the evaluation. Along with the input of “execute”, first, the reference data string stored in the first storage area 14 a is read by the input / output processing unit 12 and sent to the comparison / calculation unit 13. Next, the first comparison data string (k = 1, definition of this number is performed in step S4) stored in the second storage area 14b is read out by the input / output processing unit 12, and the comparison / calculation unit 13 is read out. Sent to.
[0041]
The comparison / calculation unit 13 extracts two data sets from the comparison data string “C, T, C, G, A” of k = 1 sent from the input / output processing unit 12. At this time, the order of the data “C”, “T”, “C”, “G”, “A” is not changed (step S5), and the distance between the two data is also taken into consideration.
That is, the comparison / calculation unit 13 selects “CT”, “C_C”, “C__G”, “C___A”, “TC” from the comparison data string “C, T, C, G, A” shown in FIG. , “T_G”,... “GA” are extracted. '_' Indicates a blank and indicates the distance between two characters.
[0042]
It should be noted that any character or the like may be inserted in the space, and the number of characters that enter the space between the data, that is, the distance between the data is represented by a numeral, ”,“ C1C ”,“ C2G ”,“ C3A ”,“ T0C ”,“ T1G ”,...,“ G0A ”.
Similar to the previous embodiment, 10 types of data sets (a set of two data) corresponding to the 10 types of sets of the comparison data sequence are extracted in advance from the reference data sequence (steps) S16).
[0043]
Then, with respect to the reference data string and the comparison data string, the arrangement order and distance of the data in the ten kinds of sets are compared (step S6).
For example, the set of “C__G” extracted from the comparison data string “C, T, T, G, A” and the data “C__G” in the reference data string are compared, and the arrangement of both matches, including the distance. Consider whether or not. In this case, it is determined that “C__G” in the comparison data string exists in the reference data array (step S7), and the comparison / calculation unit 13 sets “1” to this group (the group of “C__G”). It is given as a score (step S8).
In addition, instead of assigning the evaluation score “1” or “0”, the number of pairs in which both the arrangement order and the distance match is obtained and substituted into the expressions shown in the second and third embodiments, and similar. Of course, the evaluation of sex may be performed.
[0044]
In this way, evaluation is performed for all 10 types of sets (step S10), and when evaluation is completed for all sets, the comparison data string “C, T,
For “C, G, A”, the scores of all sets of evaluations are summed up.
Since the comparison data string “C, T, C, G, A” with k = 1 completely matches the reference data string “C, T, C, G, A”, the total evaluation score is As shown in FIG. 12, it becomes a perfect score (10 points). This evaluation score is stored in the third storage area 14c by the input / output processing unit 12 as a similarity evaluation value of the comparison data string “C, T, C, G, A” of k = 1.
[0045]
After the evaluation of the comparison data string “C, T, C, G, A” of k = 1, the count of k is increased by one (steps S12 and S14), and the next comparison data of k = 2 Evaluate the column “C, T, T, G, A”.
For the comparison data string “C, T, T, G, A” of k = 2, the order of the data “C”, “T”, “T”, “G”, “A” is not changed. Ten types of data sets are created (step S5).
Then, the comparison is performed in the same procedure as described above (step S6), and it is determined whether or not the order of the data in the reference data sequence matches (step S7).
[0046]
In the comparison data string “C, T, T, G, A” of k = 2, the data “C” and “T” of the third character are replaced with respect to the reference data string. The “TG” set of data created from the columns is not present in the reference data column. Therefore, as shown in FIG. 13, the comparison / calculation unit 13 assigns “0” as a score to the set of “TG” (step S9).
In the comparison data string “C, T, T, G, A” of k = 2, the order of the data of “C”, “T” is relative to the reference data string “C, T, C, G, A”. Although it has only been replaced, the total number of evaluation points is 6 as shown in FIG. The evaluation score is stored in the third storage area 14c by the input / output processing unit 12 as a comparison data string “C, T, T, G, A” of k = 2.
[0047]
The comparison data string of k = 3, k = 4, and k = 5 is similarly evaluated according to the procedure from step S5 to step S10, and the total score is obtained for each comparison data string (step S11). ), And stores the result in the third storage area 14c.
As shown in FIGS. 14 to 16, the comparison data string “T, C, G, A, C” with k = 3 has 6 evaluation points, and the comparison data string “A, G, C, T with k = 4”. , C ”has 3 evaluation points, and k = 5 comparison data string“ A, G, T, A, C ”has 0 evaluation points.
[0048]
In this way, after all the five comparison data strings have been evaluated (step S12), the evaluation score is compared with the comparison data string of k = 1 to 5 (step S13).
That is, the comparison / calculation unit 13 reads and compares the evaluation score of the comparison data string of k = 1 to 5 from the third storage area 14c, and ranks the score in descending order.
[0049]
In the comparison data string of k = 1 to 5 described above, the comparison data string “C, T, C, G, A” of k = 1 has the highest evaluation score, and “C, T, C, G, A” ( k = 2), “T, C, G, A, C” (k = 3), “A, G, C, T, C” (k = 4), “A, G, T, A, C” The evaluation score decreases in the order of (k = 5). That is, it represents the similarity (accessibility) of the comparison data string to the reference data string.
[0050]
As described above, according to the present invention, an objective evaluation that is difficult with the conventional method can be performed. For example, the comparison data string “T, C, G, A, C” of k = 3 and the comparison data string “A, G, C, T, C” of k = 4 are conventionally used as the reference data string. Although it was difficult to determine whether they are close, if the similarity is determined from the viewpoint of the distance of the data constituting each comparison data string, it is objective that the comparison data string of k = 3 is closer to the reference data string It becomes possible to judge automatically.
[0051]
[Sixth embodiment]
By further applying the present invention, it is possible to extract a common data string portion from a plurality of data strings.
For example, it is assumed that there are four data strings of data string 1: “ABCCC”, data string 2: “BBCCA”, data string 3: “CABBC”, and data string 4: “ABCCB”.
Therefore, in the same manner as described above, these data strings 1 to 4 are decomposed into two data sets including the distance.
This is shown in FIG. The number between letters means the distance between two letters. For example, when there are two characters between B and C, the distance is “2”.
[0052]
A processing procedure in the sixth embodiment will be described with reference to the block diagram of FIG. 2 and the flowchart of FIG.
First, a plurality of data strings to be compared (in this case, four data strings) are input (step S21). The input / output processing unit 12 of the evaluation apparatus in FIG. 2 stores the data strings 1 to 4 in the first storage area 14a. Further, the input / output processing unit 12 decomposes each data string 1 to 4 into two data sets (step S22), attaches distance information between the data for each set, and stores them in the second storage area 14b. Store (step S23).
[0053]
The comparison calculation unit 13 reads out a data set for each of the data strings 1 to 4 from the second storage area 14b, and determines whether or not the arrangement order matches the distance between the data (step S24). In this case, for example, based on each data set of the data string 1, a matching data set is extracted from the data sets of the other data strings 2 and 3 (step S25). Then, the matched data set is stored in the third storage area 14c together with the distance between the data of each set (step S26).
[0054]
Thereafter, in response to the necessity or request for restoration (step S27), the contents stored in the third storage area 14c are restored and displayed on the display unit 15 such as a display (step S26). Thereby, it can be easily understood that the common part of these four data strings 1 to 4 is “BBC” (see the column “Shared sequence” in FIG. 17).
[0055]
[Seventh embodiment]
In the seventh embodiment, it is possible to consider the position information of the first character. In the seventh embodiment shown in FIG. 19, the position information of the first character is represented by numerical values “1” to “4” added in front of the first character. By storing this position information in the third storage area 14c together with the distance between the data, and restoring the shared array in the same manner as described above, at which position in the long data string the common part exists Can know.
[0056]
[Eighth embodiment]
The seventh embodiment can be applied even when the length of the data string is increased. For example, the human (Homo sapiens) amino acid sequence “MIVFVRFNSSHGF”PVEVDSDTSIFQ ... "and the amino acid sequence" MENNREG "of the larvae (Arabidopsis thaliana)PYSVIt becomes possible to easily find the common part with "LTRDQLKGNMKKQIA ...".
Also in this case, the amino acid sequences (data strings) for humans and worms are decomposed into sets of two data having position information in each data string and distance information between the data.
[0057]
For example, when comparing the above human amino acid sequence with the amino acid sequence of the worm, the 14th P from the beginning of the human amino acid sequence is shown. V (“13P2V”) and the 8th P in the amino acid sequence of the worm V (“8P2V”) is common.
In the seventh and eighth embodiments described above, basically, the same processing as in the sixth embodiment may be performed. In the seventh and eighth embodiments, processing is performed according to the procedure shown in FIG.
[0058]
In the flowchart of FIG. 20, the same steps as those in the flowchart of FIG. 18 are denoted by the same reference numerals, and the description thereof is omitted.
In the processing of the seventh and eighth embodiments, after extracting a set of two data (step S22), when assigning distance information (step S23), the position information of the set of data is temporarily stored. The data is stored in a storage unit (for example, the third storage area 14c) (step S31). In step S25, when the matching data set is extracted and stored in the storage unit together with the distance information, the position information is read from the storage unit and added to the data set (step 32). When the shared part is restored in step S28, only the common part is restored to a predetermined position on the amino acid sequence based on the position information. This makes it possible to easily determine the common part.
[0059]
[Ninth Embodiment]
The present invention can also be effectively used to detect a data sequence in a certain arrangement order such as an amino acid sequence motif.
The amino acid motif sequence is, for example, a data string composed of a plurality of data (generally characters) such as C ... H ... C ... and prescribed data (C, H, C). The distance between is within a certain range. That is, a certain amino acid sequence motif can be represented as C4-14H4-15C3-5C, for example.
[0060]
In the analysis of the amino acid sequence motif, the amino acid sequence motif is used as a reference data string, and whether or not a data string common to the reference data string exists in the comparison data string (amino acid sequence) is analyzed.
In this case, first, a data set having the same order as the reference data string is extracted from the comparison data string (in the example of the table below, a data set having the CHCC order is extracted. To do).
Then, as shown in the table below, the distance information of the reference data is given a width, and the distance information is included in the range of the distance information of the reference data string for each group in the same order in the comparison data string. To determine whether or not
[0061]
[Table 1]

[0062]
Then, when the distance information matches for all of the data sets in the comparison data string (that is, when the evaluation score is 1 in the above table), it is determined that the predetermined motif exists in the comparison data string To do. In the example of the above table, it can be determined that the motif does not exist in the comparison data string 1 but the motif exists in the comparison data string 2.
[0063]
Although a preferred embodiment of the present invention has been described, the present invention is not limited to the above-described embodiment.
For example, in the above description, for convenience of explanation, the method for evaluating the similarity between the reference data string and the comparison data string has been described. However, the array data is arranged in two-character array data (for example, C__T) for evaluating the similarity. Position information (= i) is added, and the distance between 'C' and 'T' is quantified (= w), and one minimum core sequence data is converted into “iCwT”, for example, “1C2T”. By expressing and memorizing like
(1) It becomes possible to output continuous character array data shared by the reference data string and the comparison data string.
{Circle around (2)} Similarly to {circle around (1)}, it is possible to output a unique continuous character array of each of the reference data string and the comparison array data.
(3) A single base substitution sequence (SNP), which is an important issue in gene analysis, can be reliably grasped.
[0064]
In addition, it can be the basis of an objective, reliable and efficient analysis method for DNA sequence analysis.
Further, in the present invention, it is possible to store a program for executing the processing described above in a recording medium such as an optical disk or a magnetic disk, and read the program read from the recording medium into a computer for execution.
Furthermore, in this invention, it is possible to implement combining 1st embodiment-9th embodiment suitably. For example, after extracting a common part from a plurality of data strings according to the sixth embodiment, according to the first embodiment, the consistency of the data arrangement order is determined for the data string consisting of the common part. It is also possible to do so.
Further, the present invention is not limited to the one-dimensional data string as described above, but can be applied to a two-dimensional data string, and can also be applied to pattern analysis for determining the similarity of comparison images. .
[0065]
【The invention's effect】
As described above, according to the present invention, the similarity and commonality of data strings can be objectively evaluated by a simple method. Furthermore, even if the number of data increases, only a set of two data is extracted, and an objective evaluation is performed without imposing an excessive burden on the processing power of a computer without requiring an excessively complicated processing process. It can be performed inexpensively and in a short time. Therefore, it is possible to perform evaluation between character strings of a certain scale even with processing by a general-purpose computer such as a personal computer, and it is also possible to distribute processing and perform parallel processing with a plurality of computers.
[Brief description of the drawings]
FIG. 1 is a flowchart illustrating an evaluation method of the present invention.
FIG. 2 is a block diagram showing a configuration of an apparatus for executing the evaluation method of the present invention.
FIG. 3 is a specific example according to the first embodiment of the evaluation method of the present invention, and shows evaluation results of comparison data strings “A, B, C, D, E” with k = 1.
FIG. 4 is a specific example according to the first embodiment of the evaluation method of the present invention, and shows evaluation results of comparison data strings “A, C, B, D, E” with k = 2.
FIG. 5 is a specific example according to the first embodiment of the evaluation method of the present invention, and shows the evaluation results of the comparison data string “C, B, A, E, D” for k = 3.
FIG. 6 is a specific example according to the first embodiment of the evaluation method of the present invention, and shows the evaluation results of the comparison data string “E, A, B, D, C” with k = 4.
FIG. 7 is a specific example according to the first embodiment of the evaluation method of the present invention, and shows the evaluation result of the comparison data string “E, D, C, B, A” with k = 5.
FIG. 8 is a table showing the results of similarity evaluation according to the second embodiment.
FIG. 9 is a table showing the results of similarity evaluation according to the third embodiment.
FIG. 10 is a table showing the results of similarity evaluation according to the fourth embodiment.
FIG. 11 is a table showing the results of similarity evaluation according to the fifth embodiment.
FIG. 12 is a specific example according to the fifth embodiment of the evaluation method of the present invention and shows the evaluation result of the comparison data string “C, T, C, G, A” with k = 1.
FIG. 13 is a specific example according to the fifth embodiment of the evaluation method of the present invention, and shows the evaluation result of the comparison data string “C, T, T, G, A” with k = 2.
FIG. 14 is a specific example according to the fifth embodiment of the evaluation method of the present invention and shows the evaluation result of the comparison data string “T, C, G, A, C” with k = 3.
FIG. 15 is a specific example according to the fifth embodiment of the evaluation method of the present invention, and shows the evaluation results of the comparison data string “A, G, C, T, C” with k = 4.
FIG. 16 is a specific example according to the fifth embodiment of the evaluation method of the present invention, and shows the evaluation results of the comparison data string “A, G, T, A, C” with k = 5.
FIG. 17 is a table showing a result of a process for detecting an array portion shared by data strings according to the sixth embodiment.
FIG. 18 is a flowchart for explaining a processing procedure according to the sixth embodiment;
FIG. 19 is a table showing a result of a process for restoring an array shared by a data string according to the seventh embodiment as an original array;
FIG. 20 is a flowchart for explaining a processing procedure of the seventh and eighth embodiments;
[Explanation of symbols]
1 Evaluation device
11 Input section
13 Comparison / Calculation section
14 Storage unit
14a First storage area
14b Second storage area
14c Third storage area
15 Display section

Claims

In an evaluation apparatus for evaluating how similar comparison data consisting of similar data is to a reference data string consisting of a plurality of character, numerical value, symbol or figure data,
Accepts input of a reference data string composed of a plurality of character, numerical value, symbol or graphic data, or a reference data string formed by combining these data , stores the received data in the storage unit, and accepts input of the comparison data string, and stores the data An input unit stored in
From each of the reference data sequence and the comparison data sequence stored in the storage unit, a set of two data is formed without changing the arrangement order, and stored in the storage unit, and stored in the storage unit A comparison unit that compares the arrangement order of the data in the set of comparison data strings and the arrangement order of the data of the group in which the data constituting the set in the reference data stored in the storage unit is the same data ; ,
As a result of comparison, when the arrangement order matches, a first evaluation score is assigned, and when the arrangement order does not match, a second evaluation score is assigned, and for all sets extracted from the comparison data string A total of the attached first and second evaluation scores, storing the obtained total score in the storage unit, and performing an evaluation based on the total score ;
An apparatus for evaluating a data string composed of characters or the like characterized by comprising:

The data string evaluation device according to claim 1, wherein 1 is assigned as the first evaluation score, and 0 is assigned as the second evaluation score.

In an evaluation apparatus for evaluating how similar comparison data consisting of similar data is to a reference data string consisting of a plurality of character, numerical value, symbol or figure data,
Accepts input of a reference data string composed of a plurality of character, numerical value, symbol or graphic data, or a reference data string formed by combining these data , stores the received data in the storage unit, and accepts input of the comparison data string, and stores the data An input unit stored in
From the reference data string and comparison data string stored in the storage unit, a set of two data is formed and stored in the storage unit, and the data in the comparison data string set stored in the storage unit A comparison unit that compares the arrangement order of the data and the arrangement order of the data of the set in which the data constituting the set in the reference data stored in the storage unit is the same data ;
The number of pairs in which the arrangement order in the set of comparison data strings matches or the number of sets that do not match is obtained, and the comparison data is calculated from the ratio of the number of sets that match or do not match with respect to the number of all sets. Calculating an evaluation value of a column, and storing the evaluation value in the storage unit ;
An apparatus for evaluating a data string composed of characters or the like characterized by comprising:

In the data string evaluation device according to any one of claims 1 to 3,
From each of the plurality of data strings input from the input unit, without changing the arrangement order, forming a set consisting of two data and storing in the storage unit,
From the sets stored in the storage unit, a set in which the arrangement order of the data matches in the plurality of data strings is extracted, and the matching set is extracted from the data string for each of the data strings. is stored in the storage unit together with the position information,
For each data string, having a calculation unit that reproduces a set in which the arrangement order and the position information match based on information stored in the storage unit,
A data string evaluation device composed of characters or the like characterized by

4. The data string evaluation device according to claim 1, wherein the arrangement order includes a distance between the two data.

The data constituting the reference data string and the comparison data string is a base sequence of a gene consisting of four types of character data of A (adenine), G (guanine), T (thymine) and C (cytosine). An apparatus for evaluating a data string comprising characters or the like according to any one of claims 1 to 5 as a characteristic data string.

The data according to any one of claims 1 to 5, wherein the data constituting the reference data string and the comparison data string is a character or the like according to 20 amino acid sequences constituting a protein. A data string evaluation device comprising:

In a computer program for a data string evaluation apparatus for evaluating how similar comparison data consisting of similar data is to a reference data string consisting of a plurality of character, numerical value, symbol or figure data,
Accepts input of a reference data string composed of a plurality of character, numerical value, symbol or graphic data, or a reference data string formed by combining these data, stores the received data in the storage unit, and accepts input of the comparison data string, and stores the data Storing in the step,
From the reference data and the comparison data string stored in the storage unit, without changing the arrangement order, forming a set of two data , and storing in the storage unit ;
And order of the data in the set of said comparison data string stored in the storage unit, the order of the set of data data constituting the set are the same data in the reference data stored in the storage unit A step of comparing
When the arrangement order is matched, a first evaluation score is assigned, and when the arrangement order is not matched, a second evaluation score is added , and stored in the storage unit ;
Summing the first and second evaluation scores attached to all sets extracted from the comparison data, and storing the obtained total score in the storage unit ;
A computer program for a data string evaluation apparatus, characterized in that

In a computer program for a data string evaluation apparatus for evaluating how similar comparison data consisting of similar data is to a reference data string consisting of a plurality of character, numerical value, symbol or figure data,
Accepts input of a reference data string composed of a plurality of character, numerical value, symbol or graphic data, or a reference data string formed by combining these data, stores the received data in the storage unit, and accepts input of the comparison data string, and stores the data Storing in the step,
From the reference data string and comparison data string stored in the storage unit, forming a set of two data , and storing in the storage unit ;
And order of the data in the set of said comparison data string stored in the storage unit, the order of the set of data data constituting the set are the same data in the reference data stored in the storage unit A step of comparing
Obtaining the number of sets in which the arrangement order in the set of comparison data strings coincides or the number of sets in which they do not match;
Obtaining the evaluation value of the comparison data string from the ratio of the number of matching or non-matching pairs to the number of all pairs, and storing the evaluation value in the storage unit ;
A computer program for a data string evaluation apparatus, characterized in that

In the computer program for the data string evaluation device according to claim 9 or 10,
From each of the plurality of data strings input from the input unit, without changing the arrangement order, forming a set of two data and storing in the storage unit;
Extracting a set in which the arrangement order of the data matches in the plurality of data strings from the set stored in the storage unit ;
Storing the matched set for each of the data strings together with location information in the data string in a storage unit;
For each data string, reproducing the set in which the arrangement order and the position information match based on information stored in the storage unit;
A computer program for a data string evaluation apparatus, characterized in that