JP2006113786A

JP2006113786A - Sequence information extraction apparatus, sequence information extraction method and sequence information extraction program

Info

Publication number: JP2006113786A
Application number: JP2004299926A
Authority: JP
Inventors: Masahiro Akasaka; 賢洋赤坂; Shigeki Yajima; 成樹谷嶋
Original assignee: Mitsubishi Space Software Co Ltd
Current assignee: Mitsubishi Space Software Co Ltd
Priority date: 2004-10-14
Filing date: 2004-10-14
Publication date: 2006-04-27

Abstract

<P>PROBLEM TO BE SOLVED: To create a link file linking acquired sequence information with a sequence character string by extracting a nucleic acid sequence and an amino acid sequence included in arbitrary text data to acquire the sequence information on the extracted sequences. <P>SOLUTION: In a sequence information extraction apparatus 100, a text input part 110 inputs arbitrary text data. A candidate character string extraction part 130, from character strings included in the text data, extracts at least any candidate character string of a candidate character string of the nucleic acid sequence and a candidate character string of the amino acid sequence. A sequence information acquisition part 150, for the extracted candidate character string, acquires the sequence information on a sequence matching the candidate character string from a database. An information link file creation part 170 creates an information link file linking the extracted candidate character string with the acquired sequence information. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

バイオインフォマティクスにおける配列情報抽出装置、配列情報抽出方法および配列情報抽出プログラム。 Sequence information extraction apparatus, sequence information extraction method, and sequence information extraction program in bioinformatics.

バイオ公共データベースでは、一般にアミノ酸もしくはヌクレオチドの配列と、それに付与された自然言語のコメント文（以下、アノテーションとする）がセットになり蓄積されている。
また多くの公共データベースは、それに対する検索をインターネット経由で実行できるＩ／Ｆ（インタフェース）を提供している。
検索方法は、主に２種類ある。１つは配列を入力とし、配列の類似性により検索を行う相同性検索である。相同性検索のアルゴリズムは多数存在し、それを実装したプログラムも多いが、ＢＬＡＳＴと呼ばれるプログラムが代表格である。ＢＬＡＳＴ検索をＷＥＢ経由で実行するＩ／Ｆを提供しているバイオ公共データベースでは、ＮＣＢＩ（米国、ｈｔｔｐ：／／ｗｗｗ．ｎｃｂｉ．ｎｌｍ．ｎｉｈ．ｇｏｖ／）、ＤＤＢＪ（日本、ｈｔｔｐ：／／ｗｗｗ．ｄｄｂｊ．ｎｉｇ．ａｃ．ｊｐ／Ｗｅｌｃｏｍｅ−ｊ．ｈｔｍｌ）が有名である。
もう１つの検索方法は、入力された自然言語の単語をアノテーションに対してキーワード検索する方法である。前述の２つの公共データベースは、キーワード検索を実行するＷＥＢ（ウェブ）Ｉ／Ｆも提供している。
バイオ公共データベースのユーザは一般に、ＷＥＢＩ／Ｆを用いて相同性検索やキーワード検索を検索情報の手入力により実行し、単一もしくは複数の配列情報を取得する。配列情報とは、アミノ酸もしくはヌクレオチド配列に加えて、それに付与されたアノテーションを含む情報である。
また、配列情報を取得するのに、医学および生物学に関する文献情報を蓄積する医学生物学文献データベースを利用する方法もある。医学生物学文献データベースではＮＬＭ（ＮａｔｉｏｎａｌＬｉｂｒａｒｙｏｆＭｅｄｉｃｉｎｅ）が提供するＭＥＤＬＩＮＥが有名であり、ＮＣＢＩの文献検索システムＰｕｂＭｅｄを利用して検索を行うことができる。
以下、バイオ公共データベースを利用した配列情報の取得に関する記述において、バイオ公共データベースの代わりに医学生物学文献データベースを利用してもよい。
ユーザの興味が配列情報を取得することで完結する場合もあるが、ユーザは取得した配列情報を別の解析プログラムに入力して、さらに配列情報の解析も行う。
それらの解析プログラムの多くはＷＥＢＩ／Ｆを備えていないため、ユーザは取得した配列情報を一度ローカル環境に保存し、それからさらにローカル環境で解析プログラムを実行する。この手順を以下に説明する。 In the bio-public database, generally, amino acid or nucleotide sequences and natural language comment sentences (hereinafter referred to as annotations) attached thereto are stored as a set.
In addition, many public databases provide an interface (I / F) capable of performing a search for it via the Internet.
There are two main search methods. One is a homology search in which a sequence is input and a search is performed based on sequence similarity. There are many homology search algorithms and there are many programs that implement them, but a program called BLAST is typical. Biobiological databases that provide I / F for performing BLAST searches via WEB include NCBI (USA, http://www.ncbi.nlm.nih.gov/), DDBJ (Japan, http: // www .Ddbj.nig.ac.jp / Welcome-j.html).
The other search method is a method for keyword search for an input natural language word against an annotation. The two public databases mentioned above also provide a WEB (Web) I / F that performs keyword searches.
A user of a bio public database generally performs a homology search and a keyword search by manually inputting search information using a WEB I / F, and acquires single or plural sequence information. The sequence information is information including an amino acid or nucleotide sequence and an annotation attached thereto.
In addition, there is a method using a medical biology literature database that accumulates literature information on medicine and biology to obtain sequence information. In the medical biology literature database, MEDLINE provided by NLM (National Library of Medicine) is famous, and search can be performed using NCBI's literature search system PubMed.
Hereinafter, in the description regarding the acquisition of sequence information using the bio public database, a medical biology literature database may be used instead of the bio public database.
Although the user's interest may be completed by acquiring the sequence information, the user inputs the acquired sequence information into another analysis program and further analyzes the sequence information.
Since many of these analysis programs do not have a WEB I / F, the user once saves the acquired sequence information in the local environment, and then executes the analysis program in the local environment. This procedure will be described below.

図１は、従来技術におけるバイオ公共データベースの利用手順を示す図である。
図１において、（１）ユーザはＷＥＢブラウザ（以下、ブラウザとする）でバイオ公共データベースのＷＥＢページにアクセスし、ＷＥＢＩ／Ｆに検索情報を入力して、相同性検索もしくはキーワード検索を実行する。
（２）バイオ公共データベースは検索を実行し、その結果をＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）に変換後、ユーザが操作するクライアントＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）のブラウザに返送する。
（３）ユーザは得られたＨＴＭＬを見て、配列情報へリンクされた文字列に対して、必要な配列情報を判断し、必要と判断した配列情報へリンクされた文字列上に画面上のカーソルを移動し、マウスのクリックなどで配列情報の選択を行い、ブラウザを通してバイオ公共データベースに選択した配列情報のダウンロードを指示する。
（４）ブラウザはバイオ公共データベースから配列情報を取得し、ユーザのクライアントＰＣに配列情報をダウンロードし、ファイルとして格納する。
（５）ユーザは格納されたファイルをローカル環境の解析サーバへ移動させ、所望の解析プログラムを実行する。
（６）ユーザは解析プログラムの出力した結果を得る。
ユーザは上記処理を繰り返して、必要な配列情報、解析結果を取得していた。 FIG. 1 is a diagram showing a procedure for using a bio-public database in the prior art.
In FIG. 1, (1) a user accesses a WEB page of a bio-public database with a WEB browser (hereinafter referred to as a browser), inputs search information to the WEB I / F, and executes a homology search or a keyword search. .
(2) The bio-public database executes a search, converts the result into HTML (Hyper Text Markup Language), and returns it to the browser of the client PC (Personal Computer) operated by the user.
(3) The user looks at the obtained HTML, determines the necessary sequence information for the character string linked to the sequence information, and on the screen on the character string linked to the sequence information determined to be necessary Move the cursor, select the sequence information by clicking the mouse, etc., and instruct the bio public database to download the selected sequence information through the browser.
(4) The browser acquires the sequence information from the bio-public database, downloads the sequence information to the user's client PC, and stores it as a file.
(5) The user moves the stored file to the analysis server in the local environment and executes a desired analysis program.
(6) The user obtains the result output by the analysis program.
The user repeated the above process to obtain necessary sequence information and analysis results.

上記の従来技術を改良したものに以下のようなものがある。
まず、バイオ公共データベースに蓄積されたデータをダウンロードし、ローカル環境にミラーデータベースを構築する。
そして、それに対する検索を行うＷＥＢＩ／Ｆをもつサーバプログラムを開発する。
ユーザは、バイオ公共データベースではなく、そのサーバプログラムのＷＥＢＩ／Ｆを用いて検索を実行する。
サーバプログラムは独自に開発することで、例えば検索結果を他の解析プログラムの入力とできるようなＷＥＢＩ／Ｆを開発することで、ユーザが手作業により、解析プログラムに配列情報を移動させたり、解析プログラムの実行コマンドを入力する必要を無くしたものである。この手順を以下に説明する。 The following improvements are made to the above prior art.
First, the data stored in the bio public database is downloaded and a mirror database is constructed in the local environment.
Then, a server program having a WEB I / F for performing a search for it is developed.
The user performs a search using the WEB I / F of the server program instead of the bio public database.
By developing the server program independently, for example, by developing a WEB I / F that allows search results to be input to other analysis programs, the user can move sequence information to the analysis program manually, This eliminates the need to input an analysis program execution command. This procedure will be described below.

図２は、従来技術におけるバイオ公共データベースの利用手順を示す図である。
図２において、（１）サーバプログラムは、あらかじめバイオ公共データベースの配列情報をダウンロードし、ローカルなデータベースを構築する。
（２）ユーザはブラウザでサーバプログラムのインタフェースとなるＷＥＢページにアクセスし、ＷＥＢＩ／Ｆに検索情報を入力して、相同性検索もしくはキーワード検索を実行する。
（３）サーバプログラムは検索を実行し、その結果をＨＴＭＬに変換後、ユーザが操作するクライアントＰＣのブラウザに返送する。
（４）ユーザは得られたＨＴＭＬを見て、配列情報へリンクされた文字列に対して、必要な配列情報を判断し、必要と判断した配列情報へリンクされた文字列上に画面上のカーソルを移動し、マウスのクリックなどで配列情報の選択を行い、サーバプログラムに配列情報に対する解析プログラムの実行を指示する。
（５）サーバプログラムは、ユーザの選択した配列情報を入力とし、解析サーバ上でユーザの指示した解析プログラムを実行する。
（６）サーバプログラムは、解析プログラムの実行が終わると、解析サーバよりその結果を取得し、ＨＴＭＬに変換してクライアントＰＣのブラウザに返送する。
ユーザは上記処理を繰り返して、必要な配列情報、解析結果を取得していた。 FIG. 2 is a diagram showing a procedure for using a bio-public database in the prior art.
In FIG. 2, (1) the server program downloads the sequence information of the bio public database in advance and constructs a local database.
(2) A user accesses a WEB page that is an interface of a server program with a browser, inputs search information to the WEB I / F, and executes a homology search or a keyword search.
(3) The server program executes the search, converts the result into HTML, and returns it to the browser of the client PC operated by the user.
(4) The user looks at the obtained HTML, determines the necessary sequence information for the character string linked to the sequence information, and on the screen on the character string linked to the sequence information determined to be necessary Move the cursor, select sequence information by clicking the mouse, and instruct the server program to execute the analysis program for the sequence information.
(5) The server program receives the sequence information selected by the user and executes the analysis program instructed by the user on the analysis server.
(6) When the execution of the analysis program is finished, the server program acquires the result from the analysis server, converts it into HTML, and returns it to the browser of the client PC.
The user repeated the above process to obtain necessary sequence information and analysis results.

図２に示す従来技術では、図１に示す従来技術に比べ以下のメリットを有した。
図１に示す従来技術では、バイオ公共データベースの検索結果を解析プログラムの入力とするために、取得した配列情報を手作業でクライアントＰＣにダウンロードし、さらに解析サーバにも手作業で移動させる必要があったが、図２に示す従来技術では、それをサーバプログラムにより自動化した。
図１に示す従来技術では、解析プログラムを実行するためにユーザがコマンドを手入力する必要があったが、図２に示す従来技術では、それをサーバプログラムにより自動化した。
図１に示す従来技術では、ユーザが得た解析プログラムの結果は、そのほとんどがテキストで表現される。しかし、図２に示す従来技術では、解析プログラムの結果をサーバプログラムによりＨＴＭＬに変換するため、サーバプログラムの実装によってはグラフィカルあるいはインタラクティブな表現も可能とした。 The prior art shown in FIG. 2 has the following advantages over the prior art shown in FIG.
In the prior art shown in FIG. 1, in order to use the search result of the bio public database as the input of the analysis program, it is necessary to manually download the acquired sequence information to the client PC and to move it manually to the analysis server. However, in the prior art shown in FIG. 2, it is automated by a server program.
In the prior art shown in FIG. 1, it is necessary for the user to manually input a command in order to execute the analysis program. In the prior art shown in FIG. 2, this is automated by a server program.
In the prior art shown in FIG. 1, most of the analysis program results obtained by the user are expressed in text. However, in the prior art shown in FIG. 2, since the result of the analysis program is converted into HTML by the server program, a graphical or interactive expression is possible depending on the implementation of the server program.

図２に示す従来技術は、バイオ公共データベースのデータをローカルにダウンロードし、データの検索が行えるミラーサイトを構築していた。バイオ公共データベースのデータは数ＧＢ（ギガバイト）〜数百ＧＢと非常に大きく、まだデータベースごとに仕様が異なる。したがってミラーサイトの構築には、高価なＨ／Ｗ（ハードウェア）と複雑なＳ／Ｗ（ソフトウェア）を必要とした。
またサーバプログラムのＷＥＢＩ／Ｆは、その機能上の相違からバイオ公共データベースのそれとは異なる。そのためユーザは、一般によく知られたバイオ公共データベースと異なるルックアンドフィールのアプリケーションの操作に慣れる必要がある。 The prior art shown in FIG. 2 has constructed a mirror site that can download data from a bio-public database locally and search for the data. The data in the bio public database is very large, from several GB (gigabytes) to several hundred GB, and the specifications are still different for each database. Therefore, the construction of the mirror site requires expensive H / W (hardware) and complicated S / W (software).
The WEB I / F of the server program is different from that of the bio-public database due to its functional differences. Therefore, users need to get used to operating a look and feel application that differs from the well-known biopublic database.

バイオ公共データベースのミラーサイトの構築は高価であり利用は複雑な処理を要するという課題に対して、バイオ公共データベースのミラーデータベースを構築したメインサーバと検索のためのインデックスファイルを記憶する検索サーバとを備え、ユーザは検索サーバから検索を行い、検索サーバはインデックスファイルを使用してメインサーバからデータを取得するものがある（特許文献１）。 In response to the problem that the construction of a mirror site for a bio public database is expensive and requires complicated processing, a main server that constructs a mirror database for a bio public database and a search server that stores an index file for searching A user performs a search from a search server, and the search server acquires data from a main server using an index file (Patent Document 1).

また、生物学的データに関して、調査結果を図表表示するものもある（特許文献２）。
特開２００２−３６６５５３号公報特表２００３−５０９７７６号公報 In addition, there is also a technique that displays the results of a survey on biological data (Patent Document 2).
JP 2002-366553 A Japanese translation of PCT publication No. 2003-509776

しかし、従来技術では、配列情報の入手や配列情報の解析のためには、配列情報の入手を行いたい核酸配列やアミノ酸配列に対して、ユーザが一つずつ核酸配列またはアミノ酸配列を検索用のインタフェースに入力する必要があった。 However, in the prior art, in order to obtain sequence information or analyze sequence information, the user can search the nucleic acid sequence or amino acid sequence one by one for the nucleic acid sequence or amino acid sequence for which sequence information is to be obtained. It was necessary to input to the interface.

さらに、配列情報の表示や配列情報の解析結果の表示のために、ユーザが入手した配列情報と解析した解析結果を一つずつファイルとして記憶して、記憶したファイルの中から必要なファイルを探して開くことが必要であった。 In addition, for displaying sequence information and analysis results of sequence information, the sequence information obtained by the user and the analysis results analyzed are stored as files one by one, and the necessary files are searched from the stored files. It was necessary to open it.

また、例えば、任意のＷＥＢページやテキストファイルに示される核酸配列やアミノ酸配列についての配列情報の入手や配列情報の解析を行う場合も、ユーザがＷＥＢページやテキストファイルを目視し、ＷＥＢページやテキストファイルに示される核酸配列やアミノ酸配列を網羅し、網羅した核酸配列やアミノ酸配列を一つずつ検索用のインタフェースに入力する必要があった。 Also, for example, when obtaining sequence information or analyzing sequence information about a nucleic acid sequence or amino acid sequence shown in an arbitrary WEB page or text file, the user visually looks at the WEB page or text file, and the WEB page or text It was necessary to cover the nucleic acid sequences and amino acid sequences shown in the file and to input the covered nucleic acid sequences and amino acid sequences one by one into the search interface.

本発明は、上記の課題を解決するためになされたもので、ユーザによる検索用インタフェースへの入力無しに、配列情報の入手や配列情報の解析を行えるようにすることを目的とする。 The present invention has been made to solve the above-described problems, and an object of the present invention is to enable acquisition of sequence information and analysis of sequence information without a user input to a search interface.

また、ユーザによる配列情報ファイルおよび解析結果ファイルの記憶と、記憶したファイルを探して開くという処理無しに、配列情報の表示や配列情報の解析結果の表示を行えるようにすることを目的とする。 It is another object of the present invention to allow display of sequence information and display of analysis results of sequence information without the need for the user to store the sequence information file and analysis result file and to search and open the stored file.

また、任意のＷＥＢページやテキストファイルに示される核酸配列やアミノ酸配列に対して、ユーザによるＷＥＢページやテキストファイルの目視や核酸配列、アミノ酸配列の網羅を無しに、核酸配列やアミノ酸配列を抽出できるようにすることを目的とする。 In addition, nucleic acid sequences and amino acid sequences can be extracted from nucleic acid sequences and amino acid sequences shown in arbitrary WEB pages and text files without the user viewing the WEB page or text file or covering the nucleic acid sequences and amino acid sequences. The purpose is to do so.

本発明の配列情報抽出装置は、テキストデータを入力するテキスト入力部と、テキスト入力部が入力したテキストデータに含まれる文字列から、核酸配列の候補文字列とアミノ酸配列の候補文字列との少なくともいずれかの候補文字列を抽出する候補文字列抽出部と、候補文字列抽出部が抽出した候補文字列に対して、候補文字列に相同する配列の配列情報をデータベースから取得する配列情報取得部とを備えたことを特徴とする。 The sequence information extraction apparatus of the present invention includes a text input unit for inputting text data, and a character string included in the text data input by the text input unit, and at least a candidate character string for a nucleic acid sequence and a candidate character string for an amino acid sequence. A candidate character string extraction unit that extracts any candidate character string, and a sequence information acquisition unit that acquires sequence information of sequences homologous to the candidate character string from the database with respect to the candidate character string extracted by the candidate character string extraction unit It is characterized by comprising.

さらに、上記候補文字列抽出部が抽出した候補文字列と、上記配列情報取得部が取得した配列情報とをリンクさせた情報リンクファイルを生成する情報リンクファイル生成部を備えたことを特徴とする。 The information processing apparatus further includes an information link file generation unit that generates an information link file that links the candidate character string extracted by the candidate character string extraction unit and the sequence information acquired by the sequence information acquisition unit. .

さらに、上記配列情報取得部が取得した相同する配列の配列情報に対して、配列の相同スコアを判定するスコア判定部を備え、上記情報リンクファイル生成部は、上記候補文字列抽出部が抽出した候補文字列と、上記配列情報取得部が取得した配列情報でありスコア判定部が判定した相同スコアの高い配列の配列情報とをリンクさせた情報リンクファイルを生成することを特徴とする。 Further, the sequence information acquisition unit includes a score determination unit that determines a sequence homology score for the sequence information of the homologous sequences acquired by the sequence information acquisition unit, and the information link file generation unit is extracted by the candidate character string extraction unit An information link file is generated by linking a candidate character string and sequence information of a sequence having a high homology score determined by the score determination unit, which is sequence information acquired by the sequence information acquisition unit.

さらに、上記配列情報取得部が取得した配列情報に対して、配列を解析する配列解析部と、上記候補文字列抽出部が抽出した候補文字列と、配列解析部が解析した解析結果とをリンクさせた解析リンクファイルを生成する解析リンクファイル生成部とを備えたことを特徴とする。 Furthermore, for the sequence information acquired by the sequence information acquisition unit, the sequence analysis unit for analyzing the sequence, the candidate character string extracted by the candidate character string extraction unit, and the analysis result analyzed by the sequence analysis unit are linked. And an analysis link file generation unit that generates the analyzed link file.

さらに、上記配列情報取得部が取得した相同する配列の配列情報に対して、配列の相同スコアを判定するスコア判定部を備え、上記配列解析部は、上記配列情報取得部が取得した情報の配列であり、スコア判定部が判定した相同スコアの高い配列を解析することを特徴とする。 Further, the sequence information acquisition unit includes a score determination unit that determines a sequence homology score with respect to the sequence information of the homologous sequences acquired by the sequence information acquisition unit, and the sequence analysis unit includes the sequence of information acquired by the sequence information acquisition unit And a sequence having a high homology score determined by the score determination unit is analyzed.

上記候補文字列抽出部は、上記テキスト入力部が入力したテキストデータに含まれる文字列に対して、核酸配列を構成する文字を組み合わせた文字列を核酸配列の候補文字列とする判定と、アミノ酸配列を構成する文字を組み合わせた文字列をアミノ酸配列の候補文字列とする判定との少なくともいずれかの判定を行う候補文字列判定部を備えて候補文字列を抽出することを特徴とする。 The candidate character string extraction unit is configured to determine, as an amino acid sequence candidate character string, a character string obtained by combining characters constituting a nucleic acid sequence with respect to a character string included in the text data input by the text input unit. A candidate character string is extracted by including a candidate character string determination unit that performs determination of at least one of determination of a character string combining characters constituting a sequence as a candidate character string of an amino acid sequence.

上記テキスト入力部はマークアップ言語ソースコードをテキストデータとして入力し、上記候補文字列抽出部は、上記テキスト入力部が入力したマークアップ言語ソースコード内のタグで括られた文字列を抽出するタグ文字列抽出部と、タグ文字列抽出部が抽出した文字列に対して、核酸配列を構成する文字を組み合わせた文字列を核酸配列の候補文字列とする判定と、アミノ酸配列を構成する文字を組み合わせた文字列をアミノ酸配列の候補文字列とする判定との少なくともいずれかの判定を行う候補文字列判定部とを備えて候補文字列を抽出することを特徴とする。 The text input unit inputs a markup language source code as text data, and the candidate character string extraction unit extracts a character string enclosed by tags in the markup language source code input by the text input unit. The character string extraction unit and the character string extracted by the tag character string extraction unit are determined to be a character string that is a combination of characters constituting the nucleic acid sequence as a candidate character string of the nucleic acid sequence, and the characters that constitute the amino acid sequence are A candidate character string is extracted by including a candidate character string determination unit that determines at least one of determination of a combined character string as a candidate character string of an amino acid sequence.

本発明の配列情報抽出装置は、データベースから出力されるマークアップ言語ソースコードであり、配列情報を示すマークアップ言語ソースコードをテキストデータとして入力するテキスト入力部と、テキスト入力部が入力したマークアップ言語ソースコード内の記述形式を判定するソースコード形式判定部と、ソースコード形式判定部が判定した記述形式に基づいて、マークアップ言語ソースコードの示す配列情報を抽出する配列情報抽出部と、配列情報抽出部が抽出した配列情報に対して、配列を解析する配列解析部とを備えたことを特徴とする。 The sequence information extraction apparatus of the present invention is a markup language source code output from a database, a text input unit that inputs a markup language source code indicating sequence information as text data, and a markup input by the text input unit A source code format determining unit for determining a description format in the language source code, an array information extracting unit for extracting sequence information indicated by the markup language source code based on the description format determined by the source code format determining unit, and an array A sequence analysis unit for analyzing the sequence of the sequence information extracted by the information extraction unit is provided.

さらに、上記テキスト入力部が入力したマークアップ言語ソースコードの示す配列情報と、上記配列解析部が解析した解析結果とをリンクさせた解析リンクファイルを生成する解析リンクファイル生成部を備えたことを特徴とする。 And an analysis link file generation unit that generates an analysis link file that links the sequence information indicated by the markup language source code input by the text input unit and the analysis result analyzed by the sequence analysis unit. Features.

本発明の配列情報抽出方法は、テキストデータを入力するテキスト入力工程と、テキスト入力工程で入力したテキストデータに含まれる文字列から、核酸配列の候補文字列とアミノ酸配列の候補文字列との少なくともいずれかの候補文字列を抽出する候補文字列抽出工程と、候補文字列抽出工程で抽出した候補文字列に対して、候補文字列に相同する配列の配列情報をデータベースから取得する配列情報取得工程とを実行することを特徴とする。 The sequence information extraction method of the present invention includes a text input step for inputting text data, and at least a candidate character string for a nucleic acid sequence and a candidate character string for an amino acid sequence from a character string included in the text data input in the text input step. A candidate character string extracting step for extracting any candidate character string, and a sequence information acquiring step for acquiring sequence information of sequences homologous to the candidate character string from the database with respect to the candidate character string extracted in the candidate character string extracting step And executing.

本発明の配列情報抽出方法は、データベースから出力されるマークアップ言語ソースコードであり、配列情報を示すマークアップ言語ソースコードをテキストデータとして入力するテキスト入力工程と、テキスト入力工程で入力したマークアップ言語ソースコード内の記述形式を判定するソースコード形式判定工程と、ソースコード形式判定工程で判定した記述形式に基づいて、マークアップ言語ソースコードの示す配列情報に対して、配列を解析する配列解析工程とを実行することを特徴とする。 The sequence information extraction method of the present invention is a markup language source code output from a database, a text input step for inputting the markup language source code indicating the sequence information as text data, and a markup input in the text input step Sequence analysis that analyzes the sequence for the sequence information indicated by the markup language source code based on the source code format determination process that determines the description format in the language source code and the description format determined in the source code format determination process And performing the process.

本発明の配列情報抽出プログラムは、上記配列情報抽出方法をコンピュータに実行させる。 The sequence information extraction program of the present invention causes a computer to execute the sequence information extraction method.

本発明によれば、ユーザによる検索用インタフェースへの入力無しに、配列情報の入手や配列情報の解析を行うことができる。 According to the present invention, sequence information can be obtained and sequence information can be analyzed without a user input to the search interface.

また、ユーザによる配列情報ファイルおよび解析結果ファイルの記憶と、記憶したファイルを探して開くという処理無しに、配列情報の表示や配列情報の解析結果の表示を行うことができる。 Further, it is possible to display the sequence information and display the analysis result of the sequence information without storing the sequence information file and the analysis result file by the user and without searching for and opening the stored file.

また、任意のＷＥＢページやテキストファイルに示される核酸配列やアミノ酸配列に対して、ユーザによるＷＥＢページやテキストファイルの目視や核酸配列、アミノ酸配列の網羅を無しに、核酸配列やアミノ酸配列を抽出できる。 In addition, nucleic acid sequences and amino acid sequences can be extracted from nucleic acid sequences and amino acid sequences shown in arbitrary WEB pages and text files without the user viewing the WEB page or text file or covering the nucleic acid sequences and amino acid sequences. .

実施の形態１．
図３は、実施の形態１におけるバイオ公共データベースの利用手順を示す図である。
図３において、（１）配列情報抽出装置１００はＷＥＢブラウザ２１０に表示されるＷＥＢページのＨＴＭＬソースコードを入力し、ＨＴＭＬソースコードから核酸配列またはアミノ酸配列を示す文字列を抽出する。
（２）配列情報抽出装置１００は抽出した文字列の示す核酸配列またはアミノ酸配列の配列情報をバイオ公共データベース２００に要求する。
（３）バイオ公共データベース２００は配列情報抽出装置１００の要求に対して配列情報を返送する。
（４）配列情報抽出装置１００はバイオ公共データベース２００から返信された配列情報を入力として解析プログラムを実行し、（１）でＨＴＭＬソースコードから抽出した解析対象である核酸配列またはアミノ酸配列と、解析結果とをリンクしたＨＴＭＬソースコードを作成し、ＷＥＢブラウザ２１０にＷＥＢページとして表示する。
配列情報抽出装置１００は上記処理をＨＴＭＬソースコードに含まれる各核酸配列、各アミノ酸配列に対して行う。これにより、ユーザは必要な配列情報、解析結果を取得することができ、また、ＷＥＢページに表示された核酸配列またはアミノ酸配列を示す文字列の上にカーソルを移動し、マウスのクリックなどを行って解析結果を参照することもできる。 Embodiment 1 FIG.
FIG. 3 is a diagram showing a procedure for using the bio-public database in the first embodiment.
In FIG. 3, (1) the sequence information extracting apparatus 100 inputs an HTML source code of a WEB page displayed on the WEB browser 210, and extracts a character string indicating a nucleic acid sequence or an amino acid sequence from the HTML source code.
(2) The sequence information extraction apparatus 100 requests the bio public database 200 for sequence information of the nucleic acid sequence or amino acid sequence indicated by the extracted character string.
(3) The bio-public database 200 returns the sequence information in response to the request from the sequence information extraction apparatus 100.
(4) The sequence information extraction apparatus 100 executes an analysis program with the sequence information returned from the bio-public database 200 as an input, and analyzes the nucleic acid sequence or amino acid sequence to be analyzed extracted from the HTML source code in (1). An HTML source code linked with the result is created and displayed on the WEB browser 210 as a WEB page.
The sequence information extraction apparatus 100 performs the above processing on each nucleic acid sequence and each amino acid sequence included in the HTML source code. This allows the user to obtain the necessary sequence information and analysis results, and also moves the cursor over the character string indicating the nucleic acid sequence or amino acid sequence displayed on the WEB page, and clicks the mouse. You can also refer to the analysis results.

図４は、実施の形態１における配列情報抽出装置１００の外観を示す図である。
図４において、配列情報抽出装置１００は、システムユニット９１０、ＣＲＴ（ＣａｔｈｏｄｅＲａｙＴｕｂｅ）表示装置９０１、キーボード（Ｋ／Ｂ）９０２、マウス９０３、コンパクトディスク装置（ＣＤＤ）９０５、プリンタ装置９０６、スキャナ装置９０７を備え、これらはケーブルで接続されている。
さらに、配列情報抽出装置１００は、ＦＡＸ機９３２、電話器９３１とケーブルで接続され、また、ローカルエリアネットワーク（ＬＡＮ）９４２、ウェブサーバ９４１を介してインターネット９４０に接続されている。 FIG. 4 is a diagram illustrating an appearance of the arrangement information extraction apparatus 100 according to the first embodiment.
In FIG. 4, an array information extraction apparatus 100 includes a system unit 910, a CRT (Cathode Ray Tube) display device 901, a keyboard (K / B) 902, a mouse 903, a compact disk device (CDD) 905, a printer device 906, a scanner device. 907, which are connected by a cable.
Further, the array information extraction apparatus 100 is connected to the FAX machine 932 and the telephone 931 via a cable, and is connected to the Internet 940 via a local area network (LAN) 942 and a web server 941.

図５は、実施の形態１における配列情報抽出装置１００のハードウェア構成図である。
図５において、配列情報抽出装置１００は、プログラムを実行するＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）９１１を備えている。ＣＰＵ９１１は、バス９１２を介してＲＯＭ９１３、ＲＡＭ９１４、通信ボード９１５、ＣＲＴ表示装置９０１、Ｋ／Ｂ９０２、マウス９０３、ＦＤＤ（ＦｌｅｘｉｂｌｅＤｉｓｋＤｒｉｖｅ）９０４、磁気ディスク装置９２０、ＣＤＤ９０５、プリンタ装置９０６、スキャナ装置９０７と接続されている。
ＲＡＭ９１４は、揮発性メモリの一例である。ＲＯＭ９１３、ＦＤＤ９０４、ＣＤＤ９０５、磁気ディスク装置９２０、光ディスク装置は、不揮発性メモリの一例である。これらは、記憶装置あるいは記憶部の一例である。
通信ボード９１５は、ＦＡＸ機９３２、電話器９３１、ＬＡＮ９４２等に接続されている。
例えば、通信ボード９１５、Ｋ／Ｂ９０２、スキャナ装置９０７、ＦＤＤ９０４などは、情報入力部の一例である。
また、例えば、通信ボード９１５、ＣＲＴ表示装置９０１などは、出力部の一例である。 FIG. 5 is a hardware configuration diagram of the array information extraction apparatus 100 according to the first embodiment.
In FIG. 5, the sequence information extraction apparatus 100 includes a CPU (Central Processing Unit) 911 that executes a program. The CPU 911 includes a ROM 913, a RAM 914, a communication board 915, a CRT display device 901, a K / B 902, a mouse 903, an FDD (Flexible Disk Drive) 904, a magnetic disk device 920, a CDD 905, a printer device 906, and a scanner device 907 via a bus 912. Connected with.
The RAM 914 is an example of a volatile memory. The ROM 913, the FDD 904, the CDD 905, the magnetic disk device 920, and the optical disk device are examples of nonvolatile memories. These are examples of a storage device or a storage unit.
The communication board 915 is connected to a FAX machine 932, a telephone 931, a LAN 942, and the like.
For example, the communication board 915, the K / B 902, the scanner device 907, the FDD 904, and the like are examples of the information input unit.
Further, for example, the communication board 915, the CRT display device 901, and the like are examples of the output unit.

ここで、通信ボード９１５は、ＬＡＮ９４２に限らず、直接、インターネット９４０、或いはＩＳＤＮ等のＷＡＮ（ワイドエリアネットワーク）に接続されていても構わない。直接、インターネット９４０、或いはＩＳＤＮ等のＷＡＮに接続されている場合、配列情報抽出装置１００は、インターネット９４０、或いはＩＳＤＮ等のＷＡＮに接続され、ウェブサーバ９４１は不用となる。
磁気ディスク装置９２０には、オペレーティングシステム（ＯＳ）９２１、ウィンドウシステム９２２、プログラム群９２３、ファイル群９２４が記憶されている。プログラム群９２３は、ＣＰＵ９１１、ＯＳ９２１、ウィンドウシステム９２２により実行される。 Here, the communication board 915 is not limited to the LAN 942 and may be directly connected to the Internet 940 or a WAN (Wide Area Network) such as ISDN. When directly connected to a WAN such as the Internet 940 or ISDN, the sequence information extracting apparatus 100 is connected to a WAN such as the Internet 940 or ISDN, and the web server 941 is unnecessary.
The magnetic disk device 920 stores an operating system (OS) 921, a window system 922, a program group 923, and a file group 924. The program group 923 is executed by the CPU 911, the OS 921, and the window system 922.

上記プログラム群９２３には、以下に述べる実施の形態の説明において「〜部」として説明する機能を実行するプログラムが記憶されている。プログラムは、ＣＰＵ９１１により読み出され実行される。
ファイル群９２４には、以下に述べる実施の形態の説明において、「〜の判定結果」、「〜の計算結果」、「〜の処理結果」として説明するものが、「〜ファイル」として記憶されている。
また、以下に述べる実施の形態の説明において説明するフローチャートの矢印の部分は主としてデータの入出力を示し、そのデータの入出力のためにデータは、磁気ディスク装置９２０、ＦＤ（ＦｌｅｘｉｂｌｅＤｉｓｋｃａｒｔｒｉｄｇｅ）、光ディスク、ＣＤ（コンパクトディスク）、ＭＤ（ミニディスク）、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）等のその他の記録媒体に記録される。あるいは、信号線やその他の伝送媒体により伝送される。 The program group 923 stores programs that execute functions described as “˜units” in the description of the embodiments described below. The program is read and executed by the CPU 911.
In the file group 924, what is described as “determination result of”, “calculation result of”, and “processing result of” in the description of the embodiment described below is stored as “˜file”. Yes.
In addition, the arrow portion of the flowchart described in the description of the embodiment described below mainly indicates input / output of data, and for the input / output of the data, the data is the magnetic disk device 920, FD (Flexible Disk cartridge), It is recorded on other recording media such as an optical disc, CD (compact disc), MD (mini disc), and DVD (Digital Versatile Disk). Alternatively, it is transmitted through a signal line or other transmission medium.

また、以下に述べる実施の形態の説明において「〜部」として説明するものは、ＲＯＭ９１３に記憶されたファームウェアで実現されていても構わない。或いは、ソフトウェアのみ、或いは、ハードウェアのみ、或いは、ソフトウェアとハードウェアとの組み合わせ、さらには、ファームウェアとの組み合わせで実施されても構わない。 In addition, what is described as “unit” in the description of the embodiment described below may be realized by firmware stored in the ROM 913. Alternatively, it may be implemented by software alone, hardware alone, a combination of software and hardware, or a combination of firmware.

また、以下に述べる実施の形態を実施するプログラムは、磁気ディスク装置９２０、ＦＤ、光ディスク、ＣＤ、ＭＤ、ＤＶＤ等のその他の記録媒体による記録装置を用いて記憶されても構わない。 In addition, a program that implements the embodiment described below may be stored using a recording device using another recording medium such as a magnetic disk device 920, FD, optical disk, CD, MD, or DVD.

実施の形態１における配列情報抽出装置１００は、図３に示した処理を含めて以下のような処理を行うことができる。
（１）任意のテキストデータを入力し、テキストデータに含まれる核酸配列またはアミノ酸配列を示す文字列を抽出し、核酸配列またはアミノ酸配列の配列情報を取得し、核酸配列またはアミノ酸配列を示す文字列とその配列情報とをリンクさせた情報リンクファイルを作成する。
（２）任意のＨＴＭＬソースコードを入力し、ＨＴＭＬソースコードのＨＴＭＬタグに括られた文字列の中から核酸配列またはアミノ酸配列を示す文字列を抽出し、核酸配列またはアミノ酸配列の配列情報を取得し、核酸配列またはアミノ酸配列を示す文字列とその配列情報とをリンクさせた情報リンクファイルを作成する。
（３）バイオ公共データベースを使用した配列情報の検索結果を示すＨＴＭＬソースコードを入力し、ＨＴＭＬソースコードに示される配列情報を取得し、配列情報に対して配列を解析し、配列情報と解析結果とをリンクさせた解析リンクファイルを作成する。
（４）任意のテキストデータを入力し、テキストデータに含まれる核酸配列またはアミノ酸配列を示す文字列を抽出し、核酸配列またはアミノ酸配列の配列情報を取得し、配列情報に対して配列を解析し、核酸配列またはアミノ酸配列を示す文字列とその解析結果とをリンクさせた解析リンクファイルを作成する。
（５）任意のＨＴＭＬソースコードを入力し、ＨＴＭＬソースコードのＨＴＭＬタグに括られた文字列の中で核酸配列またはアミノ酸配列を示す文字列を抽出し、核酸配列またはアミノ酸配列の配列情報を取得し、配列情報に対して配列を解析し、核酸配列またはアミノ酸配列を示す文字列とその解析結果とをリンクさせた解析リンクファイルを作成する。 The sequence information extraction apparatus 100 according to Embodiment 1 can perform the following processing including the processing shown in FIG.
(1) Input arbitrary text data, extract a character string indicating a nucleic acid sequence or amino acid sequence contained in the text data, obtain sequence information of the nucleic acid sequence or amino acid sequence, and character string indicating the nucleic acid sequence or amino acid sequence An information link file is created by linking the data and the array information.
(2) Input arbitrary HTML source code, extract the character string indicating the nucleic acid sequence or amino acid sequence from the character string enclosed in the HTML tag of the HTML source code, and obtain the sequence information of the nucleic acid sequence or amino acid sequence Then, an information link file is created by linking the character string indicating the nucleic acid sequence or amino acid sequence and the sequence information thereof.
(3) Input the HTML source code indicating the search result of the sequence information using the bio public database, obtain the sequence information indicated in the HTML source code, analyze the sequence with respect to the sequence information, the sequence information and the analysis result Create an analysis link file that links.
(4) Input arbitrary text data, extract a character string indicating the nucleic acid sequence or amino acid sequence contained in the text data, obtain the sequence information of the nucleic acid sequence or amino acid sequence, and analyze the sequence against the sequence information Then, an analysis link file in which a character string indicating the nucleic acid sequence or amino acid sequence and the analysis result are linked is created.
(5) Input an arbitrary HTML source code, extract a character string indicating the nucleic acid sequence or amino acid sequence from the character string enclosed in the HTML tag of the HTML source code, and obtain the sequence information of the nucleic acid sequence or amino acid sequence Then, the sequence is analyzed with respect to the sequence information, and an analysis link file in which the character string indicating the nucleic acid sequence or the amino acid sequence and the analysis result are linked is created.

図６は、実施の形態１における配列情報抽出装置１００の構成図である。
図７は、実施の形態１における配列情報抽出方法の処理の流れを示す図である。
配列情報抽出装置１００の処理（１）〜（５）について、図６と図７に基づいて以下に説明する。 FIG. 6 is a configuration diagram of the sequence information extraction apparatus 100 according to the first embodiment.
FIG. 7 is a diagram showing a processing flow of the sequence information extraction method according to the first embodiment.
Processing (1) to (5) of the sequence information extracting apparatus 100 will be described below with reference to FIGS.

まず、（１）任意のテキストデータを入力し、テキストデータに含まれる核酸配列またはアミノ酸配列を示す文字列を抽出し、核酸配列またはアミノ酸配列の配列情報を取得し、核酸配列またはアミノ酸配列を示す文字列とその配列情報とをリンクさせた情報リンクファイルを作成することについて説明する。 First, (1) input arbitrary text data, extract a character string indicating the nucleic acid sequence or amino acid sequence included in the text data, obtain the sequence information of the nucleic acid sequence or amino acid sequence, and indicate the nucleic acid sequence or amino acid sequence The creation of an information link file in which a character string and its sequence information are linked will be described.

配列情報抽出装置１００の構成について図６に基づいて説明する。ここでは配列情報抽出装置の処理（１）の処理に関する構成部分を説明する。
記憶部１０４は、テキストデータやＨＴＭＬソースコードや配列情報やリンクファイルなどの配列情報抽出装置１００が使用する各種データを記憶する。また、配列情報抽出装置１００を構成する各部の出力するデータも記憶部１０４に記憶され、各部の入力データとなる。
ブラウザ実行部１０１は、記憶部１０４に記憶されたＨＴＭＬソースコードやリンクファイルなどをＷＥＢブラウザ２１０にＷＥＢページとして表示する。
エディタ実行部１０５は、記憶部１０４に記憶されたテキストデータをテキストエディタ２２０に表示する。
ユーザＩ／Ｆ部１０３は、ユーザの指定する処理命令のデータを入力し、入力した処理命令データを記憶部１０４に記憶し、またテキスト入力部１１０に出力する。
ここで、ユーザＩ／Ｆ部１０３は、テキストエディタ２２０に表示されたテキストデータに含まれる核酸配列およびアミノ酸配列の配列情報の取得を示す処理命令データをユーザから入力されたものとして以下の説明をする。
ユーザＩ／Ｆ部１０３は、ＷＥＢブラウザ２１０やテキストエディタ２２０に備えられたツールバーのような形態が好ましい。 The configuration of the sequence information extraction apparatus 100 will be described with reference to FIG. Here, the components related to the process (1) of the sequence information extraction apparatus will be described.
The storage unit 104 stores various data used by the sequence information extraction device 100 such as text data, HTML source code, sequence information, and link files. In addition, data output from each unit constituting the sequence information extracting apparatus 100 is also stored in the storage unit 104 and becomes input data of each unit.
The browser execution unit 101 displays the HTML source code and the link file stored in the storage unit 104 on the WEB browser 210 as a WEB page.
The editor execution unit 105 displays the text data stored in the storage unit 104 on the text editor 220.
The user I / F unit 103 inputs processing instruction data designated by the user, stores the input processing instruction data in the storage unit 104, and outputs the data to the text input unit 110.
Here, the user I / F unit 103 assumes that processing instruction data indicating acquisition of the sequence information of the nucleic acid sequence and amino acid sequence included in the text data displayed in the text editor 220 is input from the user as follows. To do.
The user I / F unit 103 preferably has a form such as a toolbar provided in the WEB browser 210 or the text editor 220.

テキスト入力部１１０は、ユーザＩ／Ｆ部１０３が出力した処理命令データに基づいて、エディタ実行部１０５が表示するテキストデータを記憶部１０４から取得し、取得したテキストデータを入力とする。さらに、入力したテキストデータをソースコード形式判定部１２０に出力する。
ソースコード形式判定部１２０は、テキスト入力部１１０が出力したテキストデータの形式を判定し処理を切り換える。
テキストデータの形式がＨＴＭＬソースコードでない場合には候補文字列抽出部１３０の候補文字列判定部１３１にテキストデータを出力することで処理を切り換える。 The text input unit 110 acquires the text data displayed by the editor execution unit 105 from the storage unit 104 based on the processing command data output from the user I / F unit 103, and uses the acquired text data as input. Further, the input text data is output to the source code format determination unit 120.
The source code format determination unit 120 determines the format of text data output from the text input unit 110 and switches processing.
If the format of the text data is not an HTML source code, the process is switched by outputting the text data to the candidate character string determination unit 131 of the candidate character string extraction unit 130.

候補文字列抽出部１３０は、候補文字列判定部１３１を備え、ソースコード形式判定部１２０が出力したテキストデータから、核酸配列の候補文字列またはアミノ酸配列の候補文字列を抽出する。さらに、抽出した候補文字列を配列情報取得部１５０に出力する。
候補文字列判定部１３１は、ソースコード形式判定部１２０が出力したテキストデータから、核酸配列の候補文字列またはアミノ酸配列の候補文字列を抽出する。さらに、抽出した候補文字列を配列情報取得部１５０に出力する。
候補文字列判定部１３１は、Ａ（アデニン）、Ｔ（チミン）、Ｇ（グアニン）、Ｃ（シトシン）の４つの文字（以下、ＤＮＡ構成文字とする）から構成される文字列をＤＮＡ核酸配列の候補文字列（以下、ＤＮＡ候補文字列とする）として抽出する。
また、Ａ（アデニン）、Ｕ（ウラシル）、Ｇ（グアニン）、Ｃ（シトシン）の４つの文字（以下、ＲＮＡ構成文字）から構成される文字列をＲＮＡ核酸配列の候補文字列（以下、ＲＮＡ候補文字列とする）として抽出する。
また、Ａ（アラニン（Ａｌａ））、Ｖ（バリン（Ｖａｌ））、Ｌ（ロイシン（Ｌｅｕ））、Ｉ（イソロイシン（Ｉｌｅ））、Ｆ（フェニルアラニン（Ｐｈｅ））、Ｗ（トリプトファン（Ｔｒｐ））、Ｍ（メチオニン（Ｍｅｔ））、Ｐ（プロリン（Ｐｒｏ））、Ｇ（グリシン（Ｇｌｙ））、Ｙ（チロシン（Ｔｙｒ））、Ｓ（セリン（Ｓｅｒ））、Ｔ（スレオニン（Ｔｈｒ））、Ｃ（システイン（Ｃｙｓ））、Ｎ（アスパラギン（Ａｓｎ））、Ｑ（グルタミン（Ｇｌｎ））、Ｋ（リシン（Ｌｙｓ））、Ｒ（アルギニン（Ａｒｇ））、Ｈ（ヒスチジン（Ｈｉｓ））、Ｄ（アスパラギン酸（Ａｓｐ））、Ｅ（グルタミン酸（Ｇｌｕ））の２０の文字（以下、アミノ酸構成文字とする）から構成される文字列をアミノ酸配列の候補文字列（以下、アミノ酸候補文字列とする）として抽出する。 The candidate character string extraction unit 130 includes a candidate character string determination unit 131, and extracts a candidate character string of a nucleic acid sequence or a candidate character string of an amino acid sequence from the text data output by the source code format determination unit 120. Further, the extracted candidate character string is output to the sequence information acquisition unit 150.
The candidate character string determination unit 131 extracts a candidate character string of a nucleic acid sequence or a candidate character string of an amino acid sequence from the text data output by the source code format determination unit 120. Further, the extracted candidate character string is output to the sequence information acquisition unit 150.
The candidate character string determination unit 131 converts a character string composed of four characters of A (adenine), T (thymine), G (guanine), and C (cytosine) (hereinafter referred to as DNA constituent characters) into a DNA nucleic acid sequence. As a candidate character string (hereinafter referred to as a DNA candidate character string).
In addition, a character string composed of four characters (hereinafter referred to as RNA constituent characters) of A (adenine), U (uracil), G (guanine), and C (cytosine) is converted into a candidate character string (hereinafter referred to as RNA) of an RNA nucleic acid sequence. As a candidate character string).
A (alanine (Ala)), V (valine (Val)), L (leucine (Leu)), I (isoleucine (Ile)), F (phenylalanine (Phe)), W (tryptophan (Trp)), M (methionine (Met)), P (proline (Pro)), G (glycine (Gly)), Y (tyrosine (Tyr)), S (serine (Ser)), T (threonine (Thr)), C ( Cysteine (Cys)), N (Asparagine (Asn)), Q (Glutamine (Gln)), K (Lysine (Lys)), R (Arginine (Arg)), H (Histidine (His)), D (Aspartic acid) (Asp)), E (glutamic acid (Glu)) 20 character strings (hereinafter referred to as amino acid constituent characters) are converted into amino acid sequence candidate character strings (hereinafter referred to as amino acids). Is extracted as a candidate character string).

例えば、形態素解析を行ってテキストデータから単語を抽出し、抽出した単語に対して、ＤＮＡ構成文字以外の文字が含まれるか判定し、判定した結果、ＤＮＡ構成文字以外の文字が含まれない単語をＤＮＡ候補文字列として抽出する。
ＲＮＡ候補文字列、アミノ酸候補文字列も同様にして抽出する。 For example, a word is extracted from text data by performing morphological analysis, and it is determined whether the extracted word includes a character other than a DNA constituent character. As a result of the determination, a word that does not include a character other than a DNA constituent character Are extracted as DNA candidate character strings.
RNA candidate character strings and amino acid candidate character strings are extracted in the same manner.

また例えば、テキストデータに対してＤＮＡ構成文字の検索を行い、検索した文字の次の文字がＤＮＡ構成文字であるか判定し、判定した結果、次の文字がＤＮＡ構成文字の場合は、さらに次の文字がＤＮＡ構成文字のいずれかであるか判定し、ＤＮＡ構成文字が３文字以上連続する場合、その文字列をＤＮＡ候補文字列として抽出する。
ＲＮＡ候補文字列、アミノ酸候補文字列も同様にして抽出する。
この場合の連続する構成文字数は３文字に限らず任意である。 In addition, for example, a DNA constituent character is searched for text data, and it is determined whether the next character after the searched character is a DNA constituent character. Is determined as one of DNA constituent characters, and when three or more consecutive DNA constituent characters are detected, the character string is extracted as a DNA candidate character string.
RNA candidate character strings and amino acid candidate character strings are extracted in the same manner.
In this case, the number of consecutive constituent characters is not limited to three and is arbitrary.

配列情報取得部１５０は、候補文字列抽出部１３０の候補文字列判定部１３１が出力した候補文字列に対して、候補文字列に相同するＤＮＡ核酸配列またはＲＮＡ核酸配列またはアミノ酸配列の配列情報をバイオ公共データベースで検索する。さらに、検索した配列情報を取得しスコア判定部１６０に出力する。
バイオ公共データベースには、ＮＣＢＩやＤＤＢＪやＧｅｎＢａｎｋなどがある。
配列情報取得部１５０は、ＮＣＢＩとＤＤＢＪとＧｅｎＢａｎｋのいずれかから配列情報を取得してもよいし、その他のデータベースから配列情報を取得してもよい。使用するデータベースに合わせて検索処理、配列情報取得処理を行う。
また、配列情報取得部１５０が取得する配列情報とは、配列情報の内容を示すデータでもよいし、配列情報の内容を示すデータにリンクされたリンク情報でもよい。
また、配列情報取得部１５０がスコア判定部１６０に出力する配列情報は、候補文字列抽出部１３０が出力した各候補文字列とそれぞれの配列情報とを関連付けたテーブルとする。 The sequence information acquisition unit 150 obtains sequence information of a DNA nucleic acid sequence, an RNA nucleic acid sequence, or an amino acid sequence that is homologous to the candidate character string with respect to the candidate character string output by the candidate character string determination unit 131 of the candidate character string extraction unit 130. Search in bio public database. Furthermore, the retrieved sequence information is acquired and output to the score determination unit 160.
Bio public databases include NCBI, DDBJ, and GenBank.
The sequence information acquisition unit 150 may acquire sequence information from any of NCBI, DDBJ, and GenBank, or may acquire sequence information from other databases. Search processing and sequence information acquisition processing are performed according to the database to be used.
Further, the sequence information acquired by the sequence information acquisition unit 150 may be data indicating the content of the sequence information or link information linked to data indicating the content of the sequence information.
The sequence information output from the sequence information acquisition unit 150 to the score determination unit 160 is a table in which each candidate character string output from the candidate character string extraction unit 130 is associated with each sequence information.

図８は、実施の形態１における配列情報の内容を示す図である。
ＮＣＢＩやＤＤＢＪやＧｅｎＢａｎｋなどのバイオ公共データベースから取得する配列情報には図８に示すような情報が含まれる。 FIG. 8 is a diagram showing the contents of the array information in the first embodiment.
The sequence information acquired from bio public databases such as NCBI, DDBJ, and GenBank includes information as shown in FIG.

分類としてアノテーションと配列とがある。
アノテーションは、配列に付加された自然言語の説明文、もしくは遺伝子名等の記号である。
配列（Ｓｅｑｕｅｎｃｅ）は、文字列で表されるヌクレオチドあるいはアミノ酸の配列である。ヌクレオチドはＤＮＡ候補文字またはＲＮＡ候補文字を示し、ヌクレオチドの配列はＤＮＡ核酸配列またはＲＮＡ核酸配列のことである。 Classification includes annotation and sequence.
The annotation is a natural language description added to the sequence or a symbol such as a gene name.
The sequence (Sequence) is a nucleotide or amino acid sequence represented by a character string. The nucleotide indicates a DNA candidate letter or an RNA candidate letter, and the nucleotide sequence is a DNA nucleic acid sequence or an RNA nucleic acid sequence.

ｅｎｔｒｙｉｄは、整数で表される、配列情報の行を識別するユニークな数値である。
ｐｒａｉａｃｃは、文字列で表される、バイオ公共データベースが配列に付加した識別コードのうち、代表的なものである。
ａｃｃｅｓｓｉｏｎは、文字列で表される、バイオ公共データベースが配列に付加した識別コードである。これは複数存在する場合がある。
ｌｏｃｕｓは、文字列で表される、配列の染色体上の位置を表す記号である。
ｇｉは、整数で表される、ＧｅｎＢａｎｋが配列を管理するために付加した識別コードである。
ｌｅｎｇｔｈは、整数で表される、配列の長さである。
ｓｔｒａｎｄは、文字列で表される、生体内で翻訳される配列の向きである。
ｍｏｌｔｙｐｅは、文字列で表される、配列が塩基配列かアミノ酸配列かを示す識別コードである。塩基配列はＤＮＡ核酸配列またはＲＮＡ核酸配列のことである。
ｃｉｒｃｕｌａｒは、文字列で表される、線状配列か環状配列かを示す識別コードである。
ｄｉｖｉｓｉｏｎは、文字列で表される、配列の分類を示す記号である。
ｃｄａｔｅは、配列のエントリの作成された日付である。
ｕｄａｔｅは、配列のエントリの更新された日付である。
ｄｅｆｉｎｉｔｉｏｎは、文字列で表される、配列に付加された自然言語の説明文である。
ｋｅｙｗｏｒｄは、文字列で表される、配列の特質を表現する単語の組である。
ｓｏｕｒｃｅは、文字列で表される、配列が採取された生物の生物種あるいは器官等の情報である。
ｏｒｇａｎｉｓｍは、文字列で表される、配列が採取された生物の生物種である。
ｔａｘｏｎｏｍｙは、文字列で表される、配列が採取された生物の生物学的分類である。
ｔａｘｉｄは、文字列で表される、生物種を特定する識別コードである。ｔａｘｏｎｏｍｙデータベース（ｈｔｔｐ：／／ｗｗｗ．ｎｃｂｉ．ｎｌｍ．ｎｉｈ．ｇｏｖ／Ｔａｘｏｎｏｍｙ／ｔａｘｏｎｏｍｙｈｏｍｅ．ｈｔｍｌ／）で規定されている。
ｃｏｍｍｅｎｔは、文字列で表される、配列に対する付加的な自然言語の説明文である。 The entry id is a unique numerical value that identifies a row of array information represented by an integer.
The praiacc is a representative one of the identification codes represented by character strings and added to the sequence by the bio public database.
The accession is an identification code represented by a character string and added to the sequence by the bio public database. There may be more than one.
“locus” is a symbol representing the position of the sequence on the chromosome represented by a character string.
gi is an identification code represented by an integer and added by GenBank to manage the sequence.
The length is the length of the array expressed as an integer.
strand is the direction of the sequence that is translated in vivo, represented by a character string.
The moltype is an identification code indicating whether the sequence is a base sequence or an amino acid sequence, which is represented by a character string. The base sequence is a DNA nucleic acid sequence or an RNA nucleic acid sequence.
The circular is an identification code indicating a linear array or a circular array represented by a character string.
The division is a symbol indicating the classification of the array, which is represented by a character string.
cdate is the date when the array entry was created.
update is the date when the array entry was updated.
The definition is a natural language descriptive text added to the array represented by a character string.
“keyword” is a set of words that expresses the characteristics of the array, which is represented by a character string.
The source is information such as the species or organ of the organism from which the sequence is collected, represented by a character string.
organism is the species of the organism from which the sequence was collected, represented by a character string.
Taxonomic is the biological classification of the organism from which the sequence was collected, expressed as a string.
Taxid is an identification code that identifies a biological species represented by a character string. It is defined in the taxonomy database (http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomichome.html/).
“comment” is an additional natural language descriptive text for the array, represented by a character string.

上記の他に、配列情報にはホモロジーサーチ、すなわちＢＬＡＳＴを使用した相同性検索の相同スコアが含まれる。この相同スコアにはＥ−ｖａｌｕｅ、ｂｉｔスコア、％ｉｄがある。
Ｅ−ｖａｌｕｅは、その配列が照合ＤＢ中に偶発的に存在する確率であり、値が小さい方ほどホモロジーが高い。通常、Ｅ−ｖａｌｕｅ＜０．００１で有意なホモロジーが有るとされている。以下、値が小さくホモロジーが高いことを相同スコアが高いとする。
ｂｉｔスコアは、マッチした文字列のスコアを加算したものである。値が大きいほどホモロジーが高い。値は底が２のＬｏｇである。
％ｉｄは、文字列が一致しているパーセントである。通常、アミノ酸で３０％以上一致していれば、立体構造が類似、つまり生化学的機能に類似性が期待できるとされている。 In addition to the above, the sequence information includes a homology search, that is, a homology score of a homology search using BLAST. This homology score includes E-value, bit score, and% id.
E-value is the probability that the sequence is accidentally present in the collation DB, and the smaller the value, the higher the homology. Usually, it is considered that there is significant homology at E-value <0.001. Hereinafter, the homology score is high when the value is small and the homology is high.
The bit score is obtained by adding the scores of matched character strings. The higher the value, the higher the homology. The value is Log with a base of 2.
% Id is the percentage of matching character strings. Usually, it is said that if the amino acid matches 30% or more, the three-dimensional structure is similar, that is, the biochemical function can be expected to be similar.

図６において、スコア判定部１６０は、配列情報取得部１５０が出力した配列情報に含まれるホモロジーサーチの相同スコアの判定を行う。さらに、スコアの判定結果に基づいて配列情報を処理切換部１０６に出力する。
相同スコアの判定では、基準の相同スコアを超える場合、つまりホモロジーの高い場合に、配列情報取得部１５０によりバイオ公共データベースで検索された候補文字列を、ＤＮＡ核酸配列またはＲＮＡ核酸配列またはアミノ酸配列と判定し、検索した配列情報を処理切換部１０６に出力する。さらに、ＤＮＡ核酸配列またはＲＮＡ核酸配列またはアミノ酸配列と判定した候補文字列の配列情報が複数存在した場合、最も高い相同スコアの配列情報を判定結果としてもよい。
例えば、基準の相同スコアをＥ−ｖａｌｕｅ＜０．００１として相同スコアを判定する。
また、Ｅ−ｖａｌｕｅとｂｉｔスコアと％ｉｄとのそれぞれに任意の係数を掛けて、それらの合計値により相同スコアを判定してもよい。 In FIG. 6, the score determination unit 160 determines the homology score of the homology search included in the sequence information output from the sequence information acquisition unit 150. Further, the sequence information is output to the process switching unit 106 based on the score determination result.
In the determination of the homology score, when the reference homology score is exceeded, that is, when the homology is high, the candidate character string searched in the bio public database by the sequence information acquisition unit 150 is used as a DNA nucleic acid sequence, an RNA nucleic acid sequence, or an amino acid sequence The determined sequence information is output to the process switching unit 106. Furthermore, when there are a plurality of sequence information of candidate character strings determined as DNA nucleic acid sequences, RNA nucleic acid sequences, or amino acid sequences, the sequence information with the highest homology score may be used as the determination result.
For example, the homology score is determined by setting the reference homology score to E-value <0.001.
Alternatively, the E-value, the bit score, and% id may be multiplied by an arbitrary coefficient, and the homology score may be determined based on the total value thereof.

処理切換部１０６は、ユーザＩ／Ｆ部１０３が入力した処理命令データを記憶部１０４から取得し、処理命令データに基づいて処理を切り換える。
処理命令データが配列情報の取得を示す場合は、スコア判定部１６０が出力した配列情報を情報リンクファイル生成部１７０に出力することで処理を切り換える。 The process switching unit 106 acquires the processing command data input by the user I / F unit 103 from the storage unit 104, and switches processing based on the processing command data.
When the processing instruction data indicates acquisition of sequence information, the sequence information output by the score determination unit 160 is output to the information link file generation unit 170 to switch processing.

情報リンクファイル生成部１７０は、ユーザＩ／Ｆ部１０３が入力した処理命令データを記憶部１０４から取得し、取得した処理命令データと、処理切換部１０６が出力した配列情報とに基づいてリンクファイルを生成し、生成したリンクファイルを記憶部１０４に記憶する。
例えば、処理命令データが、テキストエディタ２２０に表示されたテキストデータに含まれる核酸配列またはアミノ酸配列の配列情報の取得を示す場合、テキストエディタ２２０に表示されたテキストデータを記憶部１０４から取得してコピーする。
さらに、コピーしたテキストデータにヘッダタグを挿入しＨＴＭＬソースコードを生成する。
さらに、候補文字列抽出部１３０が出力した候補文字列を、生成したＨＴＭＬソースコードから検索し、ＨＴＭＬソースコードで検索した候補文字列に、スコア判定部１６０がホモロジーが高いと判定した配列の配列情報をリンクしたタグを、生成したＨＴＭＬソースコードに挿入する。 The information link file generation unit 170 acquires the processing command data input by the user I / F unit 103 from the storage unit 104, and based on the acquired processing command data and the sequence information output by the processing switching unit 106, the link file And the generated link file is stored in the storage unit 104.
For example, when the processing instruction data indicates acquisition of sequence information of a nucleic acid sequence or amino acid sequence included in the text data displayed in the text editor 220, the text data displayed in the text editor 220 is acquired from the storage unit 104. make a copy.
Further, an HTML source code is generated by inserting a header tag into the copied text data.
Furthermore, the candidate character string output by the candidate character string extraction unit 130 is searched from the generated HTML source code, and the array of sequences determined by the score determination unit 160 as having high homology to the candidate character string searched by the HTML source code A tag linked with information is inserted into the generated HTML source code.

例えば、配列情報取得部１５０が、取得した配列情報と、候補文字列抽出部１３０の出力した候補文字列とを関連付けたテーブルを配列情報としてスコア判定部１６０に出力することで、情報リンクファイル生成部１７０は、処理切換部１０６の出力した配列情報から上記のような処理により情報リンクファイルであるＨＴＭＬソースコードを生成することができる。
また、候補文字列抽出部１３０は、抽出した候補文字列と、その候補文字列のテキストデータ内での位置を示したポインタとを関連付けたテーブルを候補文字列として配列情報取得部１５０に出力し、情報リンクファイル生成部１７０は、ポインタの示すテキストデータ位置に配列情報をリンクしたタグを挿入してもよい。
候補文字列への配列情報のリンクについて、バイオ公共データベース２００の提供する配列情報のＷＥＢページのアドレスをリンク情報としてタグを挿入してもよい。
また、配列情報取得部１５０が、バイオ公共データベース２００の提供する配列情報のＷＥＢページから配列情報を取得して記憶部１０４に記憶し、記憶部１０４に記憶された配列情報の記憶位置をリンク情報としてタグを挿入してもよい。
また、スコア判定部１６０が最も相同スコアが高いと判定した配列情報のみをリンクしてもよいし、スコア判定部１６０が基準を超える相同スコアであると判定した複数の配列情報をリンクしてもよい。 For example, the sequence information acquisition unit 150 outputs a table associating the acquired sequence information with the candidate character string output from the candidate character string extraction unit 130 to the score determination unit 160 as sequence information, thereby generating an information link file The unit 170 can generate an HTML source code, which is an information link file, from the sequence information output from the process switching unit 106 by the process as described above.
Further, the candidate character string extraction unit 130 outputs a table associating the extracted candidate character string with a pointer indicating the position of the candidate character string in the text data to the sequence information acquisition unit 150 as a candidate character string. The information link file generation unit 170 may insert a tag that links the array information at the text data position indicated by the pointer.
Regarding the link of the sequence information to the candidate character string, a tag may be inserted using the address of the WEB page of the sequence information provided by the bio public database 200 as link information.
Further, the sequence information acquisition unit 150 acquires sequence information from the WEB page of sequence information provided by the bio public database 200 and stores the sequence information in the storage unit 104, and the storage position of the sequence information stored in the storage unit 104 is linked information A tag may be inserted as
Alternatively, only the sequence information that the score determination unit 160 has determined to have the highest homology score may be linked, or a plurality of sequence information that the score determination unit 160 has determined to have a homology score that exceeds the reference may be linked. Good.

配列情報抽出方法の処理の流れについて図７に基づいて説明する。ここでは配列情報抽出装置の処理（１）に関して説明する。
まず、Ｓ１で以下の処理を行う。
ユーザＩ／Ｆ部１０３はユーザの指定する処理命令データを入力する。
ここで、ユーザＩ／Ｆ部１０３は、テキストエディタ２２０に表示されたテキストデータに含まれる核酸配列またはアミノ酸配列の配列情報の取得を示す処理命令データをユーザから入力されたものとして以下の説明をする。
テキスト入力部１１０は、ユーザＩ／Ｆ部１０３の入力した処理命令データに基づいて、テキストエディタ２２０に表示されたテキストデータを入力する。
ソースコード形式判定部１２０は、テキスト入力部１１０が入力したテキストデータの形式がＨＴＭＬソースコードでないと判定し、候補文字列抽出部１３０の候補文字列判定部１３１を起動する。テキストデータを候補文字列判定部１３１の入力とする。
候補文字列判定部１３１は、テキストデータからＤＮＡ候補文字列、ＲＮＡ候補文字列、アミノ酸候補文字列を抽出し、抽出した候補文字列を記憶部１０４に記憶する。 The processing flow of the sequence information extraction method will be described with reference to FIG. Here, the process (1) of the sequence information extracting apparatus will be described.
First, the following processing is performed in S1.
The user I / F unit 103 inputs processing instruction data designated by the user.
Here, the user I / F unit 103 assumes that processing instruction data indicating acquisition of the sequence information of the nucleic acid sequence or amino acid sequence included in the text data displayed in the text editor 220 is input from the user, and explains the following. To do.
The text input unit 110 inputs the text data displayed on the text editor 220 based on the processing command data input by the user I / F unit 103.
The source code format determination unit 120 determines that the format of the text data input by the text input unit 110 is not an HTML source code, and activates the candidate character string determination unit 131 of the candidate character string extraction unit 130. The text data is input to the candidate character string determination unit 131.
The candidate character string determination unit 131 extracts a DNA candidate character string, an RNA candidate character string, and an amino acid candidate character string from the text data, and stores the extracted candidate character string in the storage unit 104.

次に、Ｓ５で以下の処理を行う。
配列情報取得部１５０は、記憶部１０４に記憶された各候補文字列について外部データベースを参照する。
外部データベースとは、ＮＣＢＩやＤＤＢＪなどのバイオ公共データベース２００のことであり、外部データベースの参照とは、ホモロジーサーチや配列情報の取得を行うことである。 Next, in S5, the following processing is performed.
The sequence information acquisition unit 150 refers to the external database for each candidate character string stored in the storage unit 104.
The external database is a bio-public database 200 such as NCBI or DDBJ, and the reference to the external database is to perform homology search or sequence information acquisition.

次に、Ｓ６で以下の処理を行う。
外部データベースは、各候補文字列について、核酸配列の配列情報を記憶する核酸データベースと、アミノ酸配列の配列情報を記憶するアミノ酸データベースを使用してホモロジーサーチを行う。
さらに、ホモロジーサーチの結果として、各候補文字列に相同する配列の配列情報を配列情報抽出装置１００の配列情報取得部１５０に返送する。配列情報には、Ｅ−ｖａｌｕｅやｂｉｔスコアや％ｉｄといった相同スコアが含まれる。 Next, the following processing is performed in S6.
For each candidate character string, the external database performs a homology search using a nucleic acid database that stores the sequence information of the nucleic acid sequence and an amino acid database that stores the sequence information of the amino acid sequence.
Further, as a result of the homology search, the sequence information of the sequence that is homologous to each candidate character string is returned to the sequence information acquisition unit 150 of the sequence information extraction device 100. The sequence information includes homology scores such as E-value, bit score, and% id.

次に、Ｓ７で以下の処理を行う。
配列情報取得部１５０は、外部データベースから取得した配列情報をスコア判定部１６０に出力する。
スコア判定部１６０は、配列情報に含まれる相同スコアに基づいて、各候補文字列から核酸配列またはアミノ酸配列を示す文字列を判定する。 Next, in S7, the following processing is performed.
The sequence information acquisition unit 150 outputs the sequence information acquired from the external database to the score determination unit 160.
The score determination unit 160 determines a character string indicating a nucleic acid sequence or an amino acid sequence from each candidate character string based on the homology score included in the sequence information.

次に、Ｓ８で以下の処理を行う。
処理切換部１０６は、Ｓ１でユーザＩ／Ｆ部１０３の入力したユーザの処理命令データが配列情報の取得であると判定し、情報リンクファイル生成部１７０を起動する。 Next, in S8, the following processing is performed.
The process switching unit 106 determines that the user process instruction data input by the user I / F unit 103 in S1 is acquisition of sequence information, and activates the information link file generation unit 170.

次に、Ｓ９で以下の処理を行う。
情報リンクファイル生成部１７０は、スコア判定部１６０が核酸配列またはアミノ酸配列だと判定したテキストデータ内の文字列に対してリンク情報を、配列情報取得部１５０が外部データベースから取得した配列情報から取得する。
ここで、配列情報から取得するリンク情報を、外部データベースの提供する配列情報を示すＷＥＢページのアドレスとする。ただし、ＷＥＢページのアドレスに限らない。例えば、配列情報取得部１５０が配列情報の内容を記憶部１０４に記憶し、記憶部１０４内での配列情報の記憶位置をリンク情報としてもよい。 Next, the following processing is performed in S9.
The information link file generation unit 170 acquires link information for the character string in the text data determined by the score determination unit 160 as a nucleic acid sequence or an amino acid sequence from the sequence information acquired by the sequence information acquisition unit 150 from the external database. To do.
Here, the link information acquired from the array information is set as the address of the WEB page indicating the array information provided by the external database. However, it is not limited to the address of the WEB page. For example, the sequence information acquisition unit 150 may store the contents of the sequence information in the storage unit 104, and the storage position of the sequence information in the storage unit 104 may be used as link information.

次に、Ｓ１０で以下の処理を行う。
情報リンクファイル生成部１７０は、取得したリンク情報を、テキスト入力部１１０が入力したテキストデータの対象文字列部分にハイパーリンクするタグを挿入し、ＨＴＭＬソースコードを生成する。
ブラウザ実行部１０１は、生成されたＨＴＭＬソースコードの示すＷＥＢページをＷＥＢブラウザ２１０に表示する。
ユーザは、ＷＥＢブラウザ２１０に表示されたＷＥＢページ上で、ＷＥＢページに表示された各核酸配列や各アミノ酸配列のハイパーリンク部分を指定し、それぞれの配列情報を参照することができる。 Next, the following processing is performed in S10.
The information link file generation unit 170 inserts a tag that hyperlinks the acquired link information to the target character string portion of the text data input by the text input unit 110, and generates an HTML source code.
The browser execution unit 101 displays a WEB page indicated by the generated HTML source code on the WEB browser 210.
The user can specify the hyperlink portion of each nucleic acid sequence or each amino acid sequence displayed on the WEB page on the WEB page displayed on the WEB browser 210, and can refer to the respective sequence information.

上記説明では、配列情報抽出装置１００の処理（１）について説明した。
次に、（２）任意のＨＴＭＬソースコードを入力し、ＨＴＭＬソースコードのＨＴＭＬタグに括られた文字列の中から核酸配列またはアミノ酸配列を示す文字列を抽出し、核酸配列またはアミノ酸配列の配列情報を取得し、核酸配列またはアミノ酸配列を示す文字列とその配列情報とをリンクさせた情報リンクファイルを作成することについて説明する。 In the above description, the process (1) of the sequence information extraction apparatus 100 has been described.
Next, (2) an arbitrary HTML source code is input, a character string indicating a nucleic acid sequence or amino acid sequence is extracted from a character string enclosed in an HTML tag of the HTML source code, and the sequence of the nucleic acid sequence or amino acid sequence is extracted. A description will be given of obtaining information and creating an information link file in which a character string indicating a nucleic acid sequence or amino acid sequence is linked to the sequence information.

配列情報抽出装置１００の構成について図６に基づいて説明する。ここでは配列情報抽出装置１００の処理（２）に関する構成について（１）の説明と異なる部分を説明する。
ブラウザ実行部１０１は、ＷＥＢブラウザ２１０へのＷＥＢページの表示が終了するとシグナルをシグナル受信部１０２に出力する。
シグナル受信部１０２は、ブラウザ実行部１０１が出力したシグナルを受信すると、テキスト入力部１１０を起動する。
テキスト入力部１１０は、シグナル受信部１０２に起動されると、ＷＥＢブラウザ２１０にＷＥＢページとして表示されるＨＴＭＬソースコードを記憶部１０４から取得し、取得したＨＴＭＬソースコードをテキストデータとして入力する。さらに、入力したテキストデータをソースコード形式判定部１２０に出力する。 The configuration of the sequence information extraction apparatus 100 will be described with reference to FIG. Here, regarding the configuration related to the process (2) of the sequence information extraction apparatus 100, a different part from the description of (1) will be described.
The browser execution unit 101 outputs a signal to the signal reception unit 102 when the display of the WEB page on the WEB browser 210 is completed.
The signal receiving unit 102 activates the text input unit 110 when receiving the signal output from the browser execution unit 101.
When activated by the signal receiving unit 102, the text input unit 110 acquires the HTML source code displayed as a WEB page on the WEB browser 210 from the storage unit 104, and inputs the acquired HTML source code as text data. Further, the input text data is output to the source code format determination unit 120.

ソースコード形式判定部１２０は、テキスト入力部１１０が出力したテキストデータの形式がＨＴＭＬソースコードである場合、ＨＴＭＬソースコードが示すＷＥＢページがバイオ公共データベースの提供する配列情報検索結果のＷＥＢページか判定する。判定した結果、バイオ公共データベースの提供する配列情報検索結果のＷＥＢページでない場合、候補文字列抽出部１３０のタグ文字列抽出部１３２にＨＴＭＬソースコードを出力することで処理を切り換える。
例えば、各バイオ公共データベースの提供する配列情報検索結果のＷＥＢページを示すＨＴＭＬソースコード内のヘッダタグで括られた文字列を記憶部１０４に記憶し、判定するＨＴＭＬソースコード内のヘッダタグで括られた文字列と、記憶部１０４に記憶された各バイオ公共データベースの提供する配列情報検索結果のＷＥＢページを示すＨＴＭＬソースコード内のヘッダタグで括られた文字列とを比較することで判定を行う。 When the format of the text data output from the text input unit 110 is HTML source code, the source code format determination unit 120 determines whether the WEB page indicated by the HTML source code is the WEB page of the sequence information search result provided by the bio public database. To do. As a result of the determination, if it is not the WEB page of the sequence information search result provided by the bio public database, the process is switched by outputting the HTML source code to the tag character string extraction unit 132 of the candidate character string extraction unit 130.
For example, the character string enclosed by the header tag in the HTML source code which shows the WEB page of the sequence information search result which each biopublic database provides is memorize | stored in the memory | storage part 104, and was enclosed by the header tag in the HTML source code to judge. The determination is performed by comparing the character string and the character string enclosed by the header tag in the HTML source code indicating the WEB page of the sequence information search result provided by each bio public database stored in the storage unit 104.

候補文字列抽出部１３０は、候補文字列判定部１３１に加え、タグ文字列抽出部１３２を備える。
タグ文字列抽出部１３２は、ソースコード形式判定部１２０が出力したＨＴＭＬソースコード内のタグで括られた文字列を抽出する。さらに、抽出した文字列を候補文字列判定部１３１に出力する。
候補文字列判定部１３１は、タグ文字列抽出部１３２の出力した文字列がＤＮＡ候補文字列またはＲＮＡ候補文字列またはアミノ酸候補文字列か判定し、いずれかの候補文字列と判定した文字列を候補文字列として配列情報取得部１５０に出力する。
その他の部分は、配列情報抽出装置１００の処理（１）の説明と同様であり、配列情報取得部１５０は配列情報を取得し、情報リンクファイル生成部１７０は情報リンクファイルを生成する。 The candidate character string extraction unit 130 includes a tag character string extraction unit 132 in addition to the candidate character string determination unit 131.
The tag character string extraction unit 132 extracts a character string enclosed by tags in the HTML source code output from the source code format determination unit 120. Further, the extracted character string is output to the candidate character string determination unit 131.
The candidate character string determination unit 131 determines whether the character string output from the tag character string extraction unit 132 is a DNA candidate character string, an RNA candidate character string, or an amino acid candidate character string, and determines the character string determined as one of the candidate character strings. It outputs to the sequence information acquisition part 150 as a candidate character string.
The other parts are the same as the description of the process (1) of the sequence information extraction apparatus 100, the sequence information acquisition unit 150 acquires sequence information, and the information link file generation unit 170 generates an information link file.

配列情報抽出方法の処理の流れについて図７に基づいて説明する。ここでは、配列情報抽出装置１００の処理（２）に関して（１）の説明と異なる部分を説明する。
まず、Ｓ２で以下の処理を行う。
ブラウザ実行部１０１は、任意のＷＥＢページのＷＥＢブラウザ２１０への表示を開始し、表示処理が終了した場合、シグナル受信部１０２にシグナルを出力する。
ここで、ＷＥＢブラウザ２１０に表示されたＷＥＢページは、バイオ公共データベースの提供する配列情報検索結果のＷＥＢページでないものとして以下の説明をする。
シグナル受信部１０２は、シグナルを受信した場合、テキスト入力部１１０を起動する。
テキスト入力部１１０は、ＷＥＢブラウザに表示されたＷＥＢページのＨＴＭＬソースコードをテキストデータとして入力する。
ソースコード形式判定部１２０は、テキスト入力部１１０が入力したテキストデータの形式がＨＴＭＬソースコードであると判定し、さらに、バイオ公共データベースの提供する配列情報検索結果のＷＥＢページのＨＴＭＬソースコードでないと判定し、候補文字列抽出部１３０のタグ文字列抽出部１３２を起動する。ＨＴＭＬソースコードをタグ文字列抽出部１３２の入力とする。
タグ文字列抽出部１３２は、ＨＴＭＬソースコード内のタグで括られた文字列を抽出する。 The processing flow of the sequence information extraction method will be described with reference to FIG. Here, regarding the processing (2) of the sequence information extracting apparatus 100, a different part from the description of (1) will be described.
First, the following processing is performed in S2.
The browser execution unit 101 starts displaying an arbitrary WEB page on the WEB browser 210, and outputs a signal to the signal receiving unit 102 when the display process ends.
Here, the following explanation will be made assuming that the WEB page displayed on the WEB browser 210 is not the WEB page of the sequence information search result provided by the bio public database.
When receiving a signal, the signal receiving unit 102 activates the text input unit 110.
The text input unit 110 inputs the HTML source code of the WEB page displayed on the WEB browser as text data.
The source code format determination unit 120 determines that the format of the text data input by the text input unit 110 is HTML source code, and is not an HTML source code of the WEB page of the sequence information search result provided by the bio public database. The tag character string extraction unit 132 of the candidate character string extraction unit 130 is activated. The HTML source code is input to the tag character string extraction unit 132.
The tag character string extraction unit 132 extracts a character string enclosed by tags in the HTML source code.

次に、Ｓ４で以下の処理を行う。
候補文字列判定部１３１は、タグ文字列抽出部１３２が抽出した文字列の中からＤＮＡ候補文字列、ＲＮＡ候補文字列、アミノ酸候補文字列を抽出し、抽出した候補文字列を記憶部１０４に記憶する。
その後の処理は、配列情報抽出装置１００の処理（１）の説明と同様である。 Next, in S4, the following processing is performed.
The candidate character string determination unit 131 extracts a DNA candidate character string, an RNA candidate character string, and an amino acid candidate character string from the character strings extracted by the tag character string extraction unit 132, and stores the extracted candidate character strings in the storage unit 104. Remember.
The subsequent processing is the same as the description of the processing (1) of the sequence information extracting apparatus 100.

配列情報抽出装置１００の処理（２）では、ブラウザ実行部１０１の出力するＷＥＢブラウザにＷＥＢページを表示し終えたことを示すシグナルにより、テキスト入力部１１０がＨＴＭＬソースコードを入力し、配列情報抽出処理を始めた。
ただし、配列情報抽出装置１００の処理（１）のように、ユーザＩ／Ｆ部１０３のユーザの処理命令データの入力により、テキスト入力部１１０がＨＴＭＬソースコードを入力し、配列情報抽出処理を始めてもよい。 In the process (2) of the sequence information extraction apparatus 100, the text input unit 110 inputs the HTML source code in response to a signal indicating that the WEB page has been displayed on the WEB browser output from the browser execution unit 101, and the sequence information is extracted. Started processing.
However, as in the process (1) of the sequence information extraction apparatus 100, the text input unit 110 inputs the HTML source code and starts the sequence information extraction process by the input of the processing instruction data of the user of the user I / F unit 103. Also good.

また、ＨＴＭＬソースコードを入力としたが、ＨＴＭＬソースコードに限らず他のマークアップ言語ソースコードを入力としてもよい。例えば、ＸＭＬソースコードでもよい。 Further, although HTML source code is input, other markup language source code may be input without being limited to HTML source code. For example, XML source code may be used.

上記説明では、配列情報抽出装置１００の処理（２）について説明した。
次に、（３）バイオ公共データベースを使用した配列情報の検索結果を示すＨＴＭＬソースコードを入力し、ＨＴＭＬソースコードに示される配列情報を取得し、配列情報に対して配列を解析し、配列情報と解析結果とをリンクさせた解析リンクファイルを作成することについて説明する。 In the above description, the process (2) of the sequence information extraction apparatus 100 has been described.
Next, (3) HTML source code indicating the search result of the sequence information using the bio public database is input, the sequence information indicated in the HTML source code is acquired, the sequence is analyzed with respect to the sequence information, and the sequence information Creating an analysis link file in which the analysis results are linked to each other will be described.

配列情報抽出装置１００の構成について図６に基づいて説明する。ここでは配列情報抽出装置１００の処理（３）に関する構成について（１）および（２）の説明と異なる部分を説明する。
ソースコード形式判定部１２０は、テキスト入力部１１０が出力したテキストデータの形式がＨＴＭＬソースコードであり、さらにＨＴＭＬソースコードが示すＷＥＢページがバイオ公共データベースの提供する配列情報検索結果のＷＥＢページである場合、ＨＴＭＬソースコードの示すＷＥＢページが、いずれのバイオ公共データベースの提供する配列情報検索結果のＷＥＢページの形式であるか判定する。判定した形式に基づいて、配列情報抽出部１４０の備えるＨＴＭＬソースコードの各形式を処理する第１配列情報抽出構成部１４１１〜第ｎ配列情報抽出構成部１４１ｎのいずれかの配列情報抽出構成部にＨＴＭＬソースコードを出力することで処理を切り換える。 The configuration of the sequence information extraction apparatus 100 will be described with reference to FIG. Here, regarding the configuration relating to the process (3) of the sequence information extraction apparatus 100, a different part from the description of (1) and (2) will be described.
In the source code format determination unit 120, the format of the text data output from the text input unit 110 is an HTML source code, and the WEB page indicated by the HTML source code is the WEB page of the sequence information search result provided by the bio public database. In this case, it is determined whether the WEB page indicated by the HTML source code is in the format of the WEB page of the sequence information search result provided by any bio-public database. Based on the determined format, any one of the sequence information extraction configuration unit 1411 to the n-th sequence information extraction configuration unit 141n that processes each format of the HTML source code included in the sequence information extraction unit 140 Processing is switched by outputting HTML source code.

配列情報抽出部１４０は、第１配列情報抽出構成部１４１１〜第ｎ配列情報抽出構成部１４１ｎと配列情報編集部１４２とを備え、ＨＴＭＬソースコードと、ＨＴＭＬソースコードからリンクされたバイオ公共データベースのデータとから配列情報を抽出する。さらに、抽出した配列情報を配列解析部１８０に出力する。 The sequence information extraction unit 140 includes a first sequence information extraction configuration unit 1411 to an n-th sequence information extraction configuration unit 141n and a sequence information editing unit 142, and includes an HTML source code and a bio public database linked from the HTML source code. Extract sequence information from the data. Further, the extracted sequence information is output to the sequence analysis unit 180.

第１配列情報抽出構成部１４１１〜第ｎ配列情報抽出構成部１４１ｎは、それぞれ異なる形式で、ソースコード形式判定部１２０が出力したＨＴＭＬソースコードと、ＨＴＭＬソースコードからリンクされたバイオ公共データベースのデータとから配列情報を抽出する。さらに、抽出した配列情報を配列情報編集部１４２に出力する。
図９、図１０、図１１、図１２、図１３は実施の形態１におけるバイオ公共データベースのＷＥＢページのＨＴＭＬソースコード形式を示す図である。
各配列情報抽出構成部の行う配列情報の抽出方法を図９〜図１３の各図に基づいて説明する。
図９は、ＮＣＢＩのＮｕｃｌｅｏｔｉｄｅ／Ｐｒｏｔｅｉｎ検索結果のＳｕｍｍａｒｙ表示のＨＴＭＬソースコード形式である。
ＷＥＢページの識別は、Ｎｕｃｌｅｏｔｉｄｅ検索結果の場合、ヘッダタグ内の”＜ｔｉｔｌｅ＞ＥｎｔｒｅｚＮｕｃｌｅｏｔｉｄｅ＜／ｔｉｔｌｅ＞”の記述を判定する。
Ｐｒｏｔｅｉｎ検索結果の場合は、”＜ｔｉｔｌｅ＞ＥｎｔｒｅｚＰｒｏｔｅｉｎ＜／ｔｉｔｌｅ＞”の記述を判定する。
ＨＴＭＬの構造は、Ｊａｖａ（登録商標）Ｓｃｒｉｐｔの記述を取り除くと図９に示すような構造である。
ＤＬタグ、ＤＴタグ、ＡＨＲＥＦタグ、ＤＤタグを判定することで、「配列へのＵＲＬ」と「アノテーション」とを抽出し、「配列へのＵＲＬ」を用いてＨＴＴＰリクエストをＮＣＢＩに発行して配列を取得する。 The first sequence information extraction / configuration unit 1411 to the n-th sequence information extraction / configuration unit 141n are respectively in different formats, the HTML source code output from the source code format determination unit 120, and the bio public database data linked from the HTML source code Sequence information is extracted from Further, the extracted sequence information is output to the sequence information editing unit 142.
FIG. 9, FIG. 10, FIG. 11, FIG. 12 and FIG. 13 are diagrams showing the HTML source code format of the WEB page of the bio public database in the first embodiment.
The sequence information extraction method performed by each sequence information extraction component will be described with reference to FIGS.
FIG. 9 is an HTML source code format of Summary display of NCBI Nucleotide / Protein search results.
As for the identification of the WEB page, in the case of Nucleotide search result, the description of “<title> EntrezNucleotide </ title>” in the header tag is determined.
In the case of a Protein search result, the description of “<title> EntrezProtein </ title>” is determined.
The structure of HTML is as shown in FIG. 9 when the description of Java (registered trademark) Script is removed.
By judging DL tag, DT tag, A HREF tag, DD tag, "URL to array" and "Annotation" are extracted, and HTTP request is issued to NCBI using "URL to array" Get an array.

図１０は、ＮＣＢＩのＮｕｃｌｅｏｔｉｄｅ検索結果のＦａｓｔＡ表示のＨＴＭＬソースコード形式である。
ＷＥＢページの識別は、ヘッダタグ内の”＜ＴＩＴＬＥ＞ＮＣＢＩＳｅｑｕｅｎｃｅＶｉｅｗｅｒ＜／ＴＩＴＬＥ＞”の記述を判定する。
ＨＴＭＬの構造は、Ｊａｖａ（登録商標）Ｓｃｒｉｐｔの記述を取り除くと図１０に示すような構造である。
ＰＲＥタグを判定することで、「アノテーション」と「配列」とを取得する。 FIG. 10 shows the HTML source code format of FastA display of NCBI Nucleotide search results.
To identify the WEB page, the description of “<TITLE> NCBI Sequence Viewer </ TITLE>” in the header tag is determined.
The structure of HTML is as shown in FIG. 10 when the description of Java (registered trademark) Script is removed.
By determining the PRE tag, “annotation” and “array” are acquired.

図１１は、ＤＤＢＪのＳＲＳ検索結果のＳｅｑＳｉｍｐｌｅＶｉｅｗ表示のＨＴＭＬソースコード形式である。
ＷＥＢページの識別は、ヘッダタグ内の”＜ＴＩＴＬＥ＞ＱｕｅｒｙＲｅｓｕｌｔ＜／ＴＩＴＬＥ＞”の記述を判定する。
ＨＴＭＬの構造は、Ｊａｖａ（登録商標）Ｓｃｒｉｐｔおよび不要なタグ属性等の記述を取り除くと図１１に示すような構造である。
ＴＲタグ、ＴＤタグ、ＡＨＲＥＦタグを判定することで、「配列情報へのＵＲＬ」を抽出し、「配列情報へのＵＲＬ」が示すリンク先から配列情報を取得する。 FIG. 11 is an HTML source code format of SeqSimpleView display of the SRS search result of DDBJ.
For identifying the WEB page, the description of “<TITLE> Query Result </ TITLE>” in the header tag is determined.
The structure of HTML is as shown in FIG. 11 when descriptions such as Java (registered trademark) Script and unnecessary tag attributes are removed.
By determining the TR tag, TD tag, and A HREF tag, “URL to sequence information” is extracted, and sequence information is acquired from the link destination indicated by “URL to sequence information”.

図１２は、ＤＤＢＪのＳＲＳ検索結果のＦａｓｔＡ表示のＨＴＭＬソースコード形式である。
ＰＲＥタグを判定することで、「アノテーション」と「配列」とを取得する。 FIG. 12 is an HTML source code format of FastA display of the SRS search result of DDBJ.
By determining the PRE tag, “annotation” and “array” are acquired.

図１３は、ＤＤＢＪのＳＲＳ検索結果のＣｏｍｐｌｅｔｅＥｎｔｒｉｅｓ表示のＨＴＭＬソースコード形式である。
ＨＴＭＬの構造は、Ｊａｖａ（登録商標）Ｓｃｒｉｐｔおよび不要なタグ属性等の記述を取り除くと図１３に示すような構造である。
ＰＲＥタグを判定することで、「配列情報」を取得する。 FIG. 13 shows the HTML source code format of the Complete Entries display of the SRS search result of the DDBJ.
The structure of HTML is as shown in FIG. 13 when descriptions such as Java (registered trademark) Script and unnecessary tag attributes are removed.
By determining the PRE tag, “sequence information” is acquired.

その他のＷＥＢページにおいても同様にＨＴＭＬソースコード形式を判定し配列情報を取得する。 Similarly, in other WEB pages, the HTML source code format is determined and the sequence information is acquired.

上記のようにして各配列情報抽出構成部が取得した配列情報を配列情報編集部１４２が入力する。
図６において、配列情報編集部１４２は、各ＷＥＢページから取得された配列情報を、ＷＥＢページに依存しない形式に変換し、また対応する配列を組み合わせて配列情報を作成する。さらに、作成した配列情報を配列解析部１８０に出力する。
例えば、作成する配列情報の形式を図８に示すようなフォーマットにする。 The sequence information editing unit 142 inputs the sequence information acquired by each sequence information extraction configuration unit as described above.
In FIG. 6, the array information editing unit 142 converts the array information acquired from each WEB page into a format independent of the WEB page, and creates array information by combining corresponding arrays. Further, the generated sequence information is output to the sequence analysis unit 180.
For example, the format of the sequence information to be created is set as shown in FIG.

図１４は、実施の形態１における配列情報抽出部１４０のデータフローダイアグラムである。
図１４において、テキスト入力部１１０はブラウザ実行部１０１が出力すたＨＴＭＬ読み込み終了シグナルか、ユーザＩ／Ｆ部１０３が入力したユーザの処理命令データを入力とする。
ソースコード形式判定部１２０は、テキスト入力部１１０から起動される。また、ＷＥＢブラウザ２１０に表示されたＷＥＢページのＨＴＭＬソースコードを入力とする。
第１配列情報抽出構成部１４１１〜第ｎ配列情報抽出構成部１４１ｎは、ソースコード形式判定部１２０から起動される。また、ＷＥＢブラウザ２１０に表示されたＷＥＢページのＨＴＭＬソースコードを入力とする。また、配列とアノテーションとを含む配列情報を出力とする。
配列情報編集部１４２は、第１配列情報抽出構成部１４１１〜第ｎ配列情報抽出構成部１４１ｎから起動される。また、第１配列情報抽出構成部１４１１〜第ｎ配列情報抽出構成部１４１ｎが出力した配列情報を入力とする。また編集した配列情報を出力とする。 FIG. 14 is a data flow diagram of the sequence information extraction unit 140 in the first embodiment.
In FIG. 14, the text input unit 110 receives an HTML read end signal output from the browser execution unit 101 or a user processing command data input from the user I / F unit 103.
The source code format determination unit 120 is activated from the text input unit 110. Further, the HTML source code of the WEB page displayed on the WEB browser 210 is used as an input.
The first sequence information extraction configuration unit 1411 to the n-th sequence information extraction configuration unit 141n are activated from the source code format determination unit 120. Further, the HTML source code of the WEB page displayed on the WEB browser 210 is used as an input. Also, array information including the array and the annotation is output.
The sequence information editing unit 142 is activated by the first sequence information extraction configuration unit 1411 to the nth sequence information extraction configuration unit 141n. In addition, the sequence information output from the first sequence information extraction configuration unit 1411 to the nth sequence information extraction configuration unit 141n is used as input. The edited sequence information is output.

図６において、配列解析部１８０は、配列情報、特に配列情報に含まれる配列について解析し、解析結果を解析リンクファイル生成部１９０に出力する。
配列解析部１８０が行う解析処理は任意である。 In FIG. 6, the sequence analysis unit 180 analyzes the sequence information, particularly the sequence included in the sequence information, and outputs the analysis result to the analysis link file generation unit 190.
The analysis process performed by the sequence analysis unit 180 is arbitrary.

解析リンクファイル生成部１９０は、配列情報抽出部１４０の配列情報抽出構成部が配列情報を抽出したＨＴＭＬソースコード内での配列情報または配列情報へのリンク情報の位置に、配列解析部１８０の解析結果をリンクしたタグを挿入した解析リンクファイルを生成し、生成した解析リンクファイルを記憶部１０４に記憶する。
リンクする情報を配列解析部１８０の解析結果とし、リンクファイルの生成について情報リンクファイル生成部１７０と同様である。 The analysis link file generation unit 190 performs the analysis of the sequence analysis unit 180 at the position of the sequence information or the link information to the sequence information in the HTML source code from which the sequence information extraction configuration unit of the sequence information extraction unit 140 has extracted the sequence information. An analysis link file in which the tag to which the result is linked is inserted is generated, and the generated analysis link file is stored in the storage unit 104.
The information to be linked is set as the analysis result of the sequence analysis unit 180, and the generation of the link file is the same as that of the information link file generation unit 170.

図６において、その他の部分は、配列情報抽出装置１００の処理（１）および（２）の説明と同様である。 In FIG. 6, the other parts are the same as those in the processes (1) and (2) of the sequence information extracting apparatus 100.

配列情報抽出方法の処理の流れについて図７に基づいて説明する。ここでは、配列情報抽出装置１００の処理（３）に関して（１）および（２）の説明と異なる部分を説明する。
まず、Ｓ３で以下の処理を行う。
ブラウザ実行部１０１は、任意のＷＥＢページのＷＥＢブラウザ２１０への表示を開始し、表示処理が終了した場合、シグナル受信部１０２にシグナルを出力する。
ここで、ＷＥＢブラウザ２１０に表示されたＷＥＢページは、バイオ公共データベースの提供する配列情報検索結果のＷＥＢページであるものとして以下の説明をする。
シグナル受信部１０２は、シグナルを受信した場合、テキスト入力部１１０を起動する。
テキスト入力部１１０は、ＷＥＢブラウザに表示されたＷＥＢページのＨＴＭＬソースコードをテキストデータとして入力する。
ソースコード形式判定部１２０は、テキスト入力部１１０が入力したテキストデータの形式がＨＴＭＬソースコードであると判定し、さらに、いずれかのバイオ公共データベースの提供する配列情報検索結果のＷＥＢページのＨＴＭＬソースコードであると判定し、判定したＷＥＢページのＨＴＭＬソースコードを処理する配列情報抽出構成部を起動する。ＨＴＭＬソースコードを配列情報抽出構成部の入力とする。
配列情報抽出構成部は、ＨＴＭＬソースコードと、ＨＴＭＬソースコードからリンクされたバイオ公共データベースのデータとから配列情報を抽出する。
配列情報編集部１４２は、配列情報を編集し、編集した配列情報を記憶部１０４に記憶する。 The processing flow of the sequence information extraction method will be described with reference to FIG. Here, regarding the process (3) of the sequence information extraction apparatus 100, a different part from the description of (1) and (2) is demonstrated.
First, the following processing is performed in S3.
The browser execution unit 101 starts displaying an arbitrary WEB page on the WEB browser 210, and outputs a signal to the signal receiving unit 102 when the display process ends.
Here, the WEB page displayed on the WEB browser 210 will be described below assuming that it is a WEB page of the sequence information search result provided by the bio public database.
When receiving a signal, the signal receiving unit 102 activates the text input unit 110.
The text input unit 110 inputs the HTML source code of the WEB page displayed on the WEB browser as text data.
The source code format determination unit 120 determines that the format of the text data input by the text input unit 110 is HTML source code, and further, the HTML source of the WEB page of the sequence information search result provided by any bio public database It is determined that it is a code, and the sequence information extraction configuration unit that processes the HTML source code of the determined WEB page is activated. The HTML source code is used as an input to the sequence information extraction configuration unit.
The sequence information extraction configuration unit extracts sequence information from the HTML source code and the data of the bio public database linked from the HTML source code.
The sequence information editing unit 142 edits the sequence information and stores the edited sequence information in the storage unit 104.

次に、Ｓ１１で以下の処理を行う。
配列解析部１８０は、記憶部１０４に記憶された配列情報を入力とし、配列解析処理を行い、解析結果を記憶部１０４に記憶する。 Next, in S11, the following processing is performed.
The sequence analysis unit 180 receives the sequence information stored in the storage unit 104 as input, performs sequence analysis processing, and stores the analysis result in the storage unit 104.

次に、Ｓ１２で以下の処理を行う。
解析リンクファイル生成部１９０は、記憶部１０４に記憶された解析結果の記憶位置をリンク情報として取得する。 Next, in S12, the following processing is performed.
The analysis link file generation unit 190 acquires the storage position of the analysis result stored in the storage unit 104 as link information.

次に、Ｓ１０で以下の処理を行う。
解析リンクファイル生成部１９０は、取得したリンク情報を、テキスト入力部１１０が入力したＨＴＭＬソースコードの対象部分にハイパーリンクするタグを挿入し、ＨＴＭＬソースコードを生成する。
ブラウザ実行部１０１は、生成されたＨＴＭＬソースコードの示すＷＥＢページをＷＥＢブラウザ２１０に表示する。
ユーザは、ＷＥＢブラウザ２１０に表示されたＷＥＢページ上で、ＷＥＢページに表示された各核酸配列や各アミノ酸配列のハイパーリンク部分を指定し、それぞれの配列の解析結果を参照することができる。 Next, the following processing is performed in S10.
The analysis link file generation unit 190 inserts a tag that hyperlinks the acquired link information into the target portion of the HTML source code input by the text input unit 110, and generates an HTML source code.
The browser execution unit 101 displays a WEB page indicated by the generated HTML source code on the WEB browser 210.
On the WEB page displayed on the WEB browser 210, the user can specify the hyperlink portion of each nucleic acid sequence or each amino acid sequence displayed on the WEB page, and can refer to the analysis result of each sequence.

上記説明では、配列情報抽出装置１００の処理（３）について説明した。
次に、（４）任意のテキストデータを入力し、テキストデータに含まれる核酸配列またはアミノ酸配列を示す文字列を抽出し、核酸配列またはアミノ酸配列の配列情報を取得し、配列情報に対して配列を解析し、核酸配列またはアミノ酸配列を示す文字列とその解析結果とをリンクさせた解析リンクファイルを作成することについて説明する。 In the above description, the process (3) of the sequence information extraction apparatus 100 has been described.
Next, (4) Input arbitrary text data, extract a character string indicating the nucleic acid sequence or amino acid sequence contained in the text data, obtain the sequence information of the nucleic acid sequence or amino acid sequence, and sequence the sequence information Will be described, and an analysis link file in which a character string indicating a nucleic acid sequence or amino acid sequence and an analysis result thereof are linked will be described.

配列情報抽出装置１００の構成について図６に基づいて説明する。ここでは配列情報抽出装置１００の処理（４）に関する構成について（１）および（３）の説明と異なる部分を説明する。
ユーザＩ／Ｆ部１０３は、テキストエディタ２２０に表示されたテキストデータに含まれる核酸配列またはアミノ酸配列の解析結果の取得を示す処理命令データをユーザから入力されたものとして以下の説明をする。
処理切換部１０６は、処理命令データが解析結果の取得を示す場合は、スコア判定部１６０が出力した配列情報を配列解析部１８０に出力して処理を切り換える。
配列解析部１８０は、配列情報、特に配列情報に含まれる配列について解析し、解析結果を解析リンクファイル生成部１９０に出力する。
配列解析部１８０が行う解析処理は任意である。
解析リンクファイル生成部１９０は、候補文字列抽出部１３０が抽出した候補文字列のテキストデータ内での位置に、配列解析部１８０の解析結果をリンクしたタグを挿入した解析リンクファイルを生成し、生成した解析リンクファイルを記憶部１０４に記憶する。
リンクする情報を配列解析部１８０の解析結果とし、リンクファイルの生成について情報リンクファイル生成部１７０と同様である。
図６において、その他の部分は、配列情報抽出装置１００の処理（１）および（３）の説明と同様である。 The configuration of the sequence information extraction apparatus 100 will be described with reference to FIG. Here, regarding the configuration relating to the process (4) of the sequence information extracting apparatus 100, a different part from the description of (1) and (3) will be described.
The user I / F unit 103 will be described below assuming that processing instruction data indicating acquisition of the analysis result of the nucleic acid sequence or amino acid sequence included in the text data displayed in the text editor 220 is input from the user.
When the processing instruction data indicates acquisition of the analysis result, the process switching unit 106 outputs the sequence information output by the score determination unit 160 to the sequence analysis unit 180 and switches the process.
The sequence analysis unit 180 analyzes the sequence information, particularly the sequence included in the sequence information, and outputs the analysis result to the analysis link file generation unit 190.
The analysis process performed by the sequence analysis unit 180 is arbitrary.
The analysis link file generation unit 190 generates an analysis link file in which a tag linked to the analysis result of the sequence analysis unit 180 is inserted at a position in the text data of the candidate character string extracted by the candidate character string extraction unit 130, The generated analysis link file is stored in the storage unit 104.
The information to be linked is set as the analysis result of the sequence analysis unit 180, and the generation of the link file is the same as that of the information link file generation unit 170.
In FIG. 6, the other portions are the same as those described in the processes (1) and (3) of the sequence information extraction apparatus 100.

配列情報抽出方法の処理の流れについて図７に基づいて説明する。ここでは、配列情報抽出装置１００の処理（４）に関して（１）および（３）の説明と異なる部分を説明する。
ユーザＩ／Ｆ部１０３は、テキストエディタ２２０に表示されたテキストデータに含まれる核酸配列またはアミノ酸配列の解析結果の取得を示す処理命令データをユーザから入力されたものとして以下の説明をする。
Ｓ１、Ｓ５、Ｓ６、Ｓ７の順で処理を行い、処理内容は配列情報抽出装置１００の処理（１）の説明と同様である。 The processing flow of the sequence information extraction method will be described with reference to FIG. Here, regarding the process (4) of the sequence information extraction apparatus 100, a different part from the description of (1) and (3) is demonstrated.
The user I / F unit 103 will be described below assuming that processing instruction data indicating acquisition of the analysis result of the nucleic acid sequence or amino acid sequence included in the text data displayed in the text editor 220 is input from the user.
Processing is performed in the order of S1, S5, S6, and S7, and the processing content is the same as the description of the processing (1) of the sequence information extraction apparatus 100.

次に、Ｓ８で以下の処理を行う。
処理切換部１０６は、Ｓ１でユーザＩ／Ｆ部１０３が入力したユーザの処理命令データが、解析結果の取得であると判定し、配列情報を記憶部１０４に記憶し、配列解析部１８０を起動する。 Next, in S8, the following processing is performed.
The process switching unit 106 determines that the user process command data input by the user I / F unit 103 in S1 is the acquisition of the analysis result, stores the sequence information in the storage unit 104, and activates the sequence analysis unit 180. To do.

Ｓ８の処理後、Ｓ１１、Ｓ１２、Ｓ１０の順で処理を行い、処理内容は配列情報抽出装置１００の処理（３）の説明と同様である。 After the processing of S8, the processing is performed in the order of S11, S12, and S10, and the processing content is the same as the description of the processing (3) of the sequence information extracting apparatus 100.

上記説明では、配列情報抽出装置１００の処理（４）について説明した。
次に、（５）任意のＨＴＭＬソースコードを入力し、ＨＴＭＬソースコードのＨＴＭＬタグに括られた文字列の中で核酸配列またはアミノ酸配列を示す文字列を抽出し、核酸配列またはアミノ酸配列の配列情報を取得し、配列情報に対して配列を解析し、核酸配列またはアミノ酸配列を示す文字列とその解析結果とをリンクさせた解析リンクファイルを作成することについて説明する。 In the above description, the process (4) of the sequence information extraction apparatus 100 has been described.
Next, (5) an arbitrary HTML source code is input, a character string indicating a nucleic acid sequence or an amino acid sequence is extracted from a character string enclosed in an HTML tag of the HTML source code, and the sequence of the nucleic acid sequence or amino acid sequence is extracted. A description will be given of obtaining information, analyzing the sequence with respect to the sequence information, and creating an analysis link file in which a character string indicating a nucleic acid sequence or an amino acid sequence is linked to the analysis result.

配列情報抽出装置１００の構成について図６に基づいて説明する。
テキスト入力部１１０は、シグナル受信部１０２またはユーザＩ／Ｆ部１０３から起動され、ＷＥＢブラウザ２１０に表示されたＨＴＭＬソースコードを入力とし、配列情報抽出装置１００は、ＨＴＭＬソースコードに含まれる核酸配列またはアミノ酸配列の解析結果の取得を行う。
各構成の説明は、配列情報抽出装置１００の処理（４）の説明と同様である。 The configuration of the sequence information extraction apparatus 100 will be described with reference to FIG.
The text input unit 110 is activated from the signal receiving unit 102 or the user I / F unit 103 and receives the HTML source code displayed on the WEB browser 210, and the sequence information extraction apparatus 100 reads the nucleic acid sequence included in the HTML source code. Alternatively, an amino acid sequence analysis result is obtained.
The description of each configuration is the same as the description of the process (4) of the sequence information extraction apparatus 100.

配列情報抽出方法の処理の流れについて図７に基づいて説明する。
テキスト入力部１１０は、シグナル受信部１０２またはユーザＩ／Ｆ部１０３から起動され、ＷＥＢブラウザ２１０に表示されたＨＴＭＬソースコードを入力とし、配列情報抽出装置１００は、ＨＴＭＬソースコードに含まれる核酸配列またはアミノ酸配列の解析結果の取得を行う。
処理の流れは、Ｓ２、Ｓ４、Ｓ５、Ｓ６、Ｓ７、Ｓ８、Ｓ１１、Ｓ１２、Ｓ１０の順であり、処理の内容は、配列情報抽出装置１００の処理（２）および（４）の説明と同様である。 The processing flow of the sequence information extraction method will be described with reference to FIG.
The text input unit 110 is activated from the signal receiving unit 102 or the user I / F unit 103 and receives the HTML source code displayed on the WEB browser 210, and the sequence information extraction apparatus 100 reads the nucleic acid sequence included in the HTML source code. Alternatively, an amino acid sequence analysis result is obtained.
The process flow is in the order of S2, S4, S5, S6, S7, S8, S11, S12, and S10, and the contents of the process are the same as those described for the processes (2) and (4) of the sequence information extracting apparatus 100. It is.

実施の形態１では、配列情報抽出装置１００がＨＴＭＬソースコードを含む任意のテキストデータを入力とできることで、ユーザはバイオ公共データベース検索用インタフェースへの入力無しに配列情報の入手や配列情報の解析を行える。 In the first embodiment, the sequence information extraction apparatus 100 can input any text data including HTML source code, so that the user can obtain sequence information or analyze sequence information without input to the bio-public database search interface. Yes.

また、配列情報抽出装置１００が配列情報および解析結果を記憶部１０４に記憶し、入力としたテキストデータに配列情報および解析結果をリンクしたリンクファイルを作成できることで、ユーザは配列情報および解析結果ファイルを記憶し、記憶したファイルを探して、探したファイルを表示するという処理無しに、配列情報の表示や配列情報の解析結果の表示を行える。 In addition, the sequence information extraction apparatus 100 stores the sequence information and the analysis result in the storage unit 104 and can create a link file in which the sequence information and the analysis result are linked to the input text data. Can be displayed and the analysis result of the sequence information can be displayed without the process of searching for the stored file and displaying the searched file.

また、配列情報抽出装置１００が入力したテキストデータに含まれる核酸配列またはアミノ酸配列を抽出できることで、ユーザはテキストデータを目視して、核酸配列、アミノ酸配列の網羅をせずに核酸配列やアミノ酸配列の抽出を行える。 In addition, since the nucleic acid sequence or amino acid sequence included in the text data input by the sequence information extraction apparatus 100 can be extracted, the user can visually check the text data and do not cover the nucleic acid sequence or amino acid sequence. Can be extracted.

実施の形態１では、情報リンクファイル生成部１７０と解析リンクファイル生成部１９０が生成するリンクファイルをＨＴＭＬソースコードとしたが、ＨＴＭＬソースコードに限らず、他のマークアップ言語ソースコードのリンクファイルを生成してもよい。例えば、ＸＭＬソースコードでもよいし、ハイパーリンク文書機能を提供するエディタのファイル形式でもよい。 In the first embodiment, the link files generated by the information link file generation unit 170 and the analysis link file generation unit 190 are HTML source codes. However, the link files of other markup language source codes are not limited to HTML source codes. It may be generated. For example, XML source code or an editor file format that provides a hyperlink document function may be used.

また、実施の形態１では、ホモロジーサーチは外部データベースを使用して行ったが、配列情報抽出装置１００がホモロジーサーチを実行するサーチ部と核酸データベースとアミノ酸データベースとを備えて、外部データベースを使用しないでホモロジーサーチを行っても構わない。 In the first embodiment, the homology search is performed using the external database. However, the sequence information extraction apparatus 100 includes a search unit that executes the homology search, the nucleic acid database, and the amino acid database, and does not use the external database. You can do a homology search with.

また、実施の形態１では、配列情報抽出装置１００はスコア判定部１６０を備え、相同スコアの高い配列の配列情報をリンクするリンクファイルを作成したが、スコア判定を行わずにホモロジーサーチ結果の先頭に位置する配列情報をリンクするリンクファイルを作成しても構わない。 In the first embodiment, the sequence information extraction apparatus 100 includes the score determination unit 160 and creates a link file that links sequence information of sequences having high homology scores. However, the head of the homology search result is not performed without performing score determination. You may create the link file which links the arrangement information located in.

また、ユーザＩ／Ｆ部１０３はＷＥＢブラウザ２１０やテキストエディタ２２０にプラグインされたツールバーのような形態をとるとユーザの利便性が向上する。
例えば、配列情報抽出装置１００は、配列解析部１８０やサーチ部、核酸データベース、アミノ酸データベースを有するＷＥＢサーバと、その他の配列情報抽出装置１００の構成を有するプラグインとで構成してもよい。 Further, if the user I / F unit 103 takes a form such as a toolbar plugged into the WEB browser 210 or the text editor 220, the convenience of the user is improved.
For example, the sequence information extraction apparatus 100 may be configured by a WEB server having a sequence analysis unit 180, a search unit, a nucleic acid database, and an amino acid database, and a plug-in having the configuration of the other sequence information extraction apparatus 100.

例えば、プラグインのメニューバーにユーザのログイン機能と、取得済みの配列情報および解析情報の指定機能とを備えるとよい。
図１５は、実施の形態１におけるプラグインとＷＥＢサーバとを備えた配列情報抽出装置１００を示す図である。
ＷＥＢサーバは、取得済みの配列情報とその配列情報を取得時の処理を識別するセッションＩＤとを関連付けて記憶する。
ユーザはログインすることでＷＥＢサーバにアクセスできる（Ｓ２）。
ログイン済みのユーザはセッションＩＤを指定することができる（Ｓ１）。
また、ユーザは新たに配列情報の取得をＷＥＢサーバにリクエストすることができ（Ｓ３）、ＷＥＢサーバは、新たなリクエストを受けた時はユニークなセッションＩＤを生成する（Ｓ４）。
セッションＩＤを指定されるとＷＥＢサーバはセッションＩＤに対応する配列情報を取得し解析処理を行う（Ｓ５）。
ＷＥＢサーバは、セッションＩＤと解析結果ページとをマッピングし（Ｓ６）、プラグインはＷＥＢブラウザ２１０の表示を解析結果ページに切り換える（Ｓ７）。 For example, a plug-in menu bar may be provided with a user login function and a function for specifying acquired sequence information and analysis information.
FIG. 15 is a diagram illustrating an array information extraction apparatus 100 including a plug-in and a WEB server according to the first embodiment.
The WEB server stores the acquired array information in association with the session ID for identifying the process at the time of acquiring the array information.
The user can access the WEB server by logging in (S2).
The logged-in user can specify a session ID (S1).
In addition, the user can newly request the WEB server to obtain sequence information (S3), and the WEB server generates a unique session ID when receiving a new request (S4).
When a session ID is specified, the WEB server acquires sequence information corresponding to the session ID and performs an analysis process (S5).
The WEB server maps the session ID and the analysis result page (S6), and the plug-in switches the display of the WEB browser 210 to the analysis result page (S7).

従来技術におけるバイオ公共データベースの利用手順を示す図。The figure which shows the utilization procedure of the bio public database in a prior art. 従来技術におけるバイオ公共データベースの利用手順を示す図。The figure which shows the utilization procedure of the bio public database in a prior art. 実施の形態１におけるバイオ公共データベースの利用手順を示す図。The figure which shows the utilization procedure of the biopublic database in Embodiment 1. FIG. 実施の形態１における配列情報抽出装置１００の外観を示す図。FIG. 3 shows an external appearance of the sequence information extraction apparatus 100 in the first embodiment. 実施の形態１における配列情報抽出装置１００のハードウェア構成図。FIG. 3 is a hardware configuration diagram of the sequence information extraction device 100 according to the first embodiment. 実施の形態１における配列情報抽出装置１００の構成図。1 is a configuration diagram of a sequence information extraction apparatus 100 according to Embodiment 1. FIG. 実施の形態１における配列情報抽出方法の処理の流れを示す図。FIG. 5 is a diagram showing a flow of processing of the sequence information extraction method in the first embodiment. 実施の形態１における配列情報の内容を示す図。FIG. 4 shows the contents of sequence information in the first embodiment. 実施の形態１におけるバイオ公共データベースのＷＥＢページのＨＴＭＬソースコード形式を示す図。The figure which shows the HTML source code format of the WEB page of the bio public database in Embodiment 1. FIG. 実施の形態１におけるバイオ公共データベースのＷＥＢページのＨＴＭＬソースコード形式を示す図。The figure which shows the HTML source code format of the WEB page of the bio public database in Embodiment 1. FIG. 実施の形態１におけるバイオ公共データベースのＷＥＢページのＨＴＭＬソースコード形式を示す図。The figure which shows the HTML source code format of the WEB page of the bio public database in Embodiment 1. FIG. 実施の形態１におけるバイオ公共データベースのＷＥＢページのＨＴＭＬソースコード形式を示す図。The figure which shows the HTML source code format of the WEB page of the bio public database in Embodiment 1. FIG. 実施の形態１におけるバイオ公共データベースのＷＥＢページのＨＴＭＬソースコード形式を示す図。The figure which shows the HTML source code format of the WEB page of the bio public database in Embodiment 1. FIG. 実施の形態１における配列情報抽出部１４０のデータフローダイアグラム。3 is a data flow diagram of the sequence information extraction unit 140 in the first embodiment. 実施の形態１におけるプラグインとＷＥＢサーバとを備えた配列情報抽出装置１００を示す図。FIG. 3 is a diagram showing an array information extraction apparatus 100 including a plug-in and a WEB server in the first embodiment.

Explanation of symbols

１００配列情報抽出装置、１０１ブラウザ実行部、１０２シグナル受信部、１０３ユーザＩ／Ｆ部、１０４記憶部、１０５エディタ実行部、１０６処理切換部、１１０テキスト入力部、１２０ソースコード形式判定部、１３０候補文字列抽出部、１３１候補文字列判定部、１３２タグ文字列抽出部、１４０配列情報抽出部、１４１１第１配列情報抽出構成部、１４２配列情報編集部、１５０配列情報取得部、１６０スコア判定部、１７０情報リンクファイル生成部、１８０配列解析部、１９０解析リンクファイル生成部、２００バイオ公共データベース、２１０ＷＥＢブラウザ、２２０テキストエディタ、９０１ＣＲＴ表示装置、９０２Ｋ／Ｂ、９０３マウス、９０４ＦＤＤ、９０５ＣＤＤ、９０６プリンタ装置、９０７スキャナ装置、９１０システムユニット、９１１ＣＰＵ、９１２バス、９１３ＲＯＭ、９１４ＲＡＭ、９１５通信ボード、９２０磁気ディスク装置、９２１ＯＳ、９２２ウィンドウシステム、９２３プログラム群、９２４ファイル群、９３１電話器、９３２ＦＡＸ機、９４０インターネット、９４１ウェブサーバ、９４２ＬＡＮ。 DESCRIPTION OF SYMBOLS 100 Sequence information extraction device, 101 Browser execution part, 102 Signal receiving part, 103 User I / F part, 104 Storage part, 105 Editor execution part, 106 Process switching part, 110 Text input part, 120 Source code format determination part, 130 Candidate character string extraction unit, 131 Candidate character string determination unit, 132 Tag character string extraction unit, 140 Sequence information extraction unit, 1411 First sequence information extraction component, 142 Sequence information editing unit, 150 Sequence information acquisition unit, 160 Score determination Unit, 170 information link file generation unit, 180 sequence analysis unit, 190 analysis link file generation unit, 200 bio public database, 210 WEB browser, 220 text editor, 901 CRT display device, 902 K / B, 903 mouse, 904 FDD, 905 CDD, 906 Printer device, 907 scanner device, 910 system unit, 911 CPU, 912 bus, 913 ROM, 914 RAM, 915 communication board, 920 magnetic disk device, 921 OS, 922 window system, 923 program group, 924 file group, 931 telephone 932 FAX machine, 940 Internet, 941 Web server, 942 LAN.

Claims

In a sequence information extraction apparatus connected to a database that stores sequence information of at least one of a nucleic acid sequence and an amino acid sequence,
A text input section for inputting text data;
A candidate character string extraction unit that extracts a candidate character string of at least one of a candidate character string of a nucleic acid sequence and a candidate character string of an amino acid sequence from a character string included in the text data input by the text input unit;
A sequence information extraction apparatus comprising: a sequence information acquisition unit that acquires sequence information of sequences homologous to a candidate character string from a database with respect to the candidate character string extracted by the candidate character string extraction unit.

The sequence information extraction device further includes:
The information link file generation unit that generates an information link file that links the candidate character string extracted by the candidate character string extraction unit and the sequence information acquired by the sequence information acquisition unit. The apparatus for extracting sequence information according to 1.

The sequence information extraction device further includes:
For the sequence information of the homologous sequences acquired by the sequence information acquisition unit, a score determination unit for determining the sequence homology score,
The information link file generation unit includes a candidate character string extracted by the candidate character string extraction unit, and sequence information of a sequence having a high homology score determined by the score determination unit, which is the sequence information acquired by the sequence information acquisition unit. 3. The sequence information extraction apparatus according to claim 2, wherein a linked information link file is generated.

The sequence information extraction device further includes:
For the sequence information acquired by the sequence information acquisition unit, a sequence analysis unit for analyzing the sequence,
2. An analysis link file generation unit that generates an analysis link file that links a candidate character string extracted by the candidate character string extraction unit and an analysis result analyzed by a sequence analysis unit. The sequence information extraction device described.

The sequence information extraction device further includes:
For the sequence information of the homologous sequences acquired by the sequence information acquisition unit, a score determination unit for determining the sequence homology score,
5. The sequence information extraction apparatus according to claim 4, wherein the sequence analysis unit analyzes a sequence of information acquired by the sequence information acquisition unit and has a high homology score determined by the score determination unit.

The candidate character string extraction unit
The character string included in the text data input by the text input unit is combined with the determination of the character string combining the characters constituting the nucleic acid sequence as the candidate character string of the nucleic acid sequence and the characters constituting the amino acid sequence. The candidate character string is extracted by including a candidate character string determination unit that performs at least one of determination of determining a character string as a candidate character string of an amino acid sequence. The sequence information extraction device described.

The text input part
Enter the markup language source code as text data,
The candidate character string extraction unit
A tag character string extraction unit that extracts a character string surrounded by tags in the markup language source code input by the text input unit;
The character string extracted by the tag character string extraction unit is determined as a candidate character string of a nucleic acid sequence that is a combination of characters constituting a nucleic acid sequence, and the character string that is a combination of characters that constitute an amino acid sequence is an amino acid. 6. The array according to claim 1, further comprising: a candidate character string determination unit that performs determination of at least one of determination as a candidate character string of the array, and extracting the candidate character string. Information extraction device.

In a sequence information extraction apparatus connected to a database that stores sequence information of at least one of a nucleic acid sequence and an amino acid sequence,
A text input unit for inputting markup language source code indicating sequence information as text data, which is a markup language source code output from a database;
A source code format determination unit for determining a description format in the markup language source code input by the text input unit;
A sequence information extraction unit that extracts sequence information indicated by the markup language source code based on the description format determined by the source code format determination unit;
A sequence information extraction apparatus comprising: a sequence analysis unit that analyzes a sequence of the sequence information extracted by the sequence information extraction unit.

The sequence information extraction device further includes:
An analysis link file generation unit that generates an analysis link file that links the sequence information indicated by the markup language source code input by the text input unit and the analysis result analyzed by the sequence analysis unit; The apparatus for extracting sequence information according to claim 8.

In a sequence information extraction method of a sequence information extraction apparatus connected to a database that stores sequence information of at least one of a nucleic acid sequence and an amino acid sequence,
A text input process for inputting text data;
A candidate character string extraction step of extracting a candidate character string of at least one of a candidate character string of a nucleic acid sequence and a candidate character string of an amino acid sequence from a character string included in the text data input in the text input step;
A sequence information extraction method comprising: performing a sequence information acquisition step of acquiring sequence information of a sequence homologous to a candidate character string from a database with respect to the candidate character string extracted in the candidate character string extraction step.

In a sequence information extraction method of a sequence information extraction apparatus connected to a database that stores sequence information of at least one of a nucleic acid sequence and an amino acid sequence,
A text input process for inputting markup language source code indicating sequence information as text data, which is a markup language source code output from a database;
A source code format determination process for determining a description format in the markup language source code input in the text input process;
A sequence information extraction method comprising: performing a sequence analysis step of analyzing a sequence on sequence information indicated by a markup language source code based on a description format determined in a source code format determination step.

The sequence information extraction program which makes a computer perform the sequence information extraction method of Claim 10 or Claim 11.

10. The sequence information extraction apparatus according to claim 1, wherein the database is at least one of a bio public database and a medical biology literature database.