JP2006085437A

JP2006085437A - Non-redundant biopolymer database production method and server for retrieval service

Info

Publication number: JP2006085437A
Application number: JP2004269658A
Authority: JP
Inventors: Masashi Hata; 昌史秦; Sada Mizunuma; 貞水沼
Original assignee: Hitachi Software Engineering Co Ltd
Current assignee: Hitachi Software Engineering Co Ltd
Priority date: 2004-09-16
Filing date: 2004-09-16
Publication date: 2006-03-30

Abstract

<P>PROBLEM TO BE SOLVED: To reduce a database retrieval work time by accessing a searching Web page for each database when desiring to transversely search a biopolymer database and by dispensing with work for repeating retrieval work or work for removing duplication from an acquired retrieval result. <P>SOLUTION: This non-redundant biopolymer database 104 is produced by using correspondence relation of data among the biopolymer databases A, B, C, and the search is performed to the produced non-redundant biopolymer database. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、塩基配列、タンパク質配列などの生体高分子データファイルのデータを効率的に検索する方法に関する。 The present invention relates to a method for efficiently searching data in biopolymer data files such as base sequences and protein sequences.

数多くの生体高分子データベースがインターネットで公開されている（例えば、Baxebanis, A.D:Nucl.Acids Res.,28:1-10,2000, "Genetics Databases"(Bishop M.J ed.), Academic Press, Cambridge,1999）。分子生物学を研究対象とする研究者は、これらのデータベースを検索して、自分の研究に関係するデータを取得している。多くの生体高分子データが、データベース間で重複しているため、生体高分子データベースは、重複するデータのIDの対応関係を公開している。
Baxebanis, A.D:Nucl.Acids Res.,28:1-10,2000 "Genetics Databases"(Bishop M.J ed.), Academic Press, Cambridge,1999 Numerous biopolymer databases are published on the Internet (eg Baxebanis, AD: Nucl. Acids Res., 28: 1-10, 2000, “Genetics Databases” (Bishop MJ ed.), Academic Press, Cambridge, 1999). Researchers studying molecular biology search these databases to obtain data related to their research. Since a lot of biopolymer data is duplicated between databases, the biopolymer database publishes the correspondence relationship between duplicate data IDs.
Baxebanis, AD: Nucl.Acids Res., 28: 1-10,2000 "Genetics Databases" (Bishop MJ ed.), Academic Press, Cambridge, 1999

複数のデータベースを横断的に検索したい場合、各データベースの検索用Webページへアクセスし、それぞれのデータベースに対して検索作業を繰り返す必要があり、更に、データベース間で重複するデータがあるため、取得した検索結果から重複を取り除く作業が必要となるため、非常に面倒である。 If you want to search across multiple databases, you need to access the search web page for each database, repeat the search for each database, and there are duplicate data between the databases. Since it is necessary to remove duplicates from the search results, it is very troublesome.

例えば、図１１に示すように、データベース８０１とデータベース８０２を横断的に検索したい場合、データベース８０１とデータベース８０２の両方に検索を行って、得られた検索結果８０３，８０４から対応関係８０５を参照して、重複するデータ８０６の削除を行い、重複の無い検索結果８０７を取得する。 For example, as shown in FIG. 11, when searching across the database 801 and the database 802, the database 801 and the database 802 are searched, and the corresponding relationship 805 is referred to from the obtained search results 803 and 804. Thus, the duplicate data 806 is deleted, and a search result 807 having no duplicate is acquired.

本発明の目的は、生体高分子データベースを効率的に検索する方法を提供することにある。 An object of the present invention is to provide a method for efficiently searching a biopolymer database.

本発明では、生体高分子データベース間に於けるデータの対応関係を利用することにより、データの重複を取り除き、非冗長な生体高分子データベースを作成し、作成した非冗長な生体高分子データベースを用いて検索を行うようにする。この方法により、同時に複数の生体高分子データベースに対して検索を行ったのと同等な検索を一度で行い、かつ、非冗長な検索結果を得ることができる。 In the present invention, by utilizing the correspondence relationship between the data in the biopolymer database, duplication of data is removed, a non-redundant biopolymer database is created, and the created non-redundant biopolymer database is used. To search. By this method, it is possible to perform a search equivalent to a search performed on a plurality of biopolymer databases at the same time and obtain a non-redundant search result.

本発明によれば、生体高分子データベースを横断的に検索したい場合に、各データベースの検索用Webページへアクセスし、検索作業を繰り返す作業や、取得した検索結果から重複を取り除く作業が不要になるため、データベース検索作業時間を短縮することができる。 According to the present invention, when it is desired to search the biopolymer database cross-sectionally, there is no need to access the search Web page of each database, repeat the search operation, and remove the duplication from the acquired search results. Therefore, the database search work time can be shortened.

以下、本発明を実施する場合の一形態について図面を参照して具体的に説明する。
図１は、本発明による検索サービスの例を示す概略図である。検索サービスセンター１１１は、記憶装置１０１を有する検索サービス用サーバ１０５を備える。検索サービス用サーバ１０５は、DBデータ取得部１２１、対応関係取得部１２２、対応関係テーブル作成部１２３、非冗長DB作成部１２４、検索処理部１２５を有する。 Hereinafter, an embodiment for carrying out the present invention will be specifically described with reference to the drawings.
FIG. 1 is a schematic diagram showing an example of a search service according to the present invention. The search service center 111 includes a search service server 105 having a storage device 101. The search service server 105 includes a DB data acquisition unit 121, a correspondence relationship acquisition unit 122, a correspondence relationship table creation unit 123, a non-redundant DB creation unit 124, and a search processing unit 125.

検索サービスセンター１１１では、データベース間で重複するデータを持つ外部の複数のデータベースＡ，Ｂ，Ｃのデータを、検索サービスセンター１１１内の検索サービス用サーバ１０５の記憶装置１０１上にダウンロードする。この処理は、検索サービス用サーバ１０５のDBデータ取得部１２１によって行われる。また、検索サービス用サーバ１０５の対応関係取得部１２２は、データベース間のデータの対応関係に関する情報を取得し、それを対応関係テーブル作成部１２３に渡す。対応関係テーブル作成部１２３では、データベース間で重複するデータの対応関係を表す対応関係テーブル１３０を作成し、記憶装置１０１に記憶する。その後、ダウンロードしたデータベースＡ，Ｂ，Ｃのデータから、対応関係テーブル１３０を利用することにより、データの重複を取り除き１０３、非冗長な生体高分子データベース１０４を構築する。この処理は、非冗長DB作成部１２４によって行う。 The search service center 111 downloads data of a plurality of external databases A, B, and C having data overlapping between the databases onto the storage device 101 of the search service server 105 in the search service center 111. This process is performed by the DB data acquisition unit 121 of the search service server 105. Also, the correspondence relationship acquisition unit 122 of the search service server 105 acquires information about the correspondence relationship between the data in the database, and passes it to the correspondence table creation unit 123. The correspondence table creation unit 123 creates a correspondence table 130 that represents the correspondence of data that overlaps between databases and stores it in the storage device 101. Thereafter, by using the correspondence table 130 from the downloaded data of the databases A, B, and C, the duplication of data is removed 103 and the non-redundant biopolymer database 104 is constructed. This processing is performed by the non-redundant DB creation unit 124.

検索サービスセンター１１１は、この非冗長な生体高分子データベース１０４を用いて、ディスプレイ装置１０８、演算装置１０９、キーボード１０６、マウス１１０を備えた装置を操作するユーザ（クライアント）に対して、ネットワーク１０７を介して検索サービスを提供する。この検索サービスは、検索サービス用サーバ１０５の検索処理部１２５によって行われる。 The search service center 111 uses the non-redundant biopolymer database 104 to provide a network 107 to a user (client) who operates a device including the display device 108, the arithmetic device 109, the keyboard 106, and the mouse 110. To provide search services. This search service is performed by the search processing unit 125 of the search service server 105.

図２は、データベースＡ，Ｂ，Ｃに登録されているデータを摸式的に示した図である。図２の例では、データベースＡにはデータＡ１，Ａ４，Ａ５，Ａ７が登録され、データベースＢにはデータＢ２，Ｂ４，Ｂ６，Ｂ７が登録され、データベースＣにはデータＣ３，Ｃ５，Ｃ６，Ｃ７が登録されている。 FIG. 2 is a diagram schematically showing data registered in the databases A, B, and C. In the example of FIG. 2, data A1, A4, A5, A7 are registered in the database A, data B2, B4, B6, B7 are registered in the database B, and data C3, C5, C6, C7 are registered in the database C. Is registered.

図３のフローチャートと図７の工程図を用いて、本発明による非冗長な生体高分子データベースの作成方法について説明する。 A non-redundant biopolymer database creation method according to the present invention will be described with reference to the flowchart of FIG. 3 and the process diagram of FIG.

最初に、検索サービスセンター１１１内の検索サービス用サーバ１０５のDBデータ取得部１２１は、外部の複数の生体高分子データベース、本例ではデータベースＡ、データベースＢ、データベースＣのデータを、記憶装置１０１上にダウンロードする（Ｓ１１）。次に、検索サービス用サーバ１０５は、データベースＡ、データベースＢ、データベースＣにアクセスし、対応関係取得部１２２により各データベース間のデータの対応関係についての情報を取得する。生体高分子データベースには、他の生体高分子データベースのデータとの対応関係を記述した部分があり、対応関係取得部１２２はその部分のデータを切り出してきて対応関係テーブル作成部１２３に渡す。対応関係テーブル作成部１２３では、渡されたデータを整理して、対応関係テーブル１３０を作成し、記憶装置１０１に記憶する（Ｓ１２）。 First, the DB data acquisition unit 121 of the search service server 105 in the search service center 111 stores a plurality of external biopolymer databases, in this example, the data of the database A, the database B, and the database C on the storage device 101. (S11). Next, the search service server 105 accesses the database A, the database B, and the database C, and the correspondence relationship acquisition unit 122 acquires information about the correspondence relationship between the databases. The biopolymer database has a part describing the correspondence with the data of other biopolymer databases, and the correspondence acquisition unit 122 cuts out the data of the part and passes it to the correspondence table creation unit 123. The correspondence table creation unit 123 organizes the received data, creates the correspondence table 130, and stores it in the storage device 101 (S12).

図４は、こうして作成したデータベース間のデータの対応関係を示す対応関係テーブル１３０の模式図である。本例では、データベースＡ−Ｂ間のデータの対応関係として、Ａ４とＢ４、Ａ７とＢ７、データベースＢ−Ｃ間のデータの対応関係として、Ｂ６とＣ６、Ｂ７とＣ７、データベースＣ−Ａ間のデータの対応関係として、Ｃ５とＡ５、Ｃ７とＡ７がそれぞれ同等のデータであることが登録されている。 FIG. 4 is a schematic diagram of the correspondence table 130 showing the correspondence of data between databases created in this way. In this example, as data correspondence between databases A-B, A4 and B4, A7 and B7, data correspondence between databases B-C, B6 and C6, B7 and C7, and database C-A As data correspondence, it is registered that C5 and A5 and C7 and A7 are equivalent data.

図５は、生体高分子データベース間のデータの対応関係の具体例を示す図である。図５は、NCBI（National Center for Biotechnology Information）が公開しているUniGeneデータベースとGenBankデータベースの対応関係である。生体高分子データベースのデータベース間のデータの対応関係はこのような形式で公開されている。データはタブ区切りで、１行が１レコードを表す。第１列３０１がUniGeneのIDを表し、第４列３０２がそのUniGeneのデータに対応するGenBankのIDを表している。例えば、UniGeneのHs.103504（３０３）をIDとするデータは、GenBankの AF061055（３０４）をIDとするデータと対応している。これらのデータを抽出することでデータベース間のデータの対応関係を取得することができる。 FIG. 5 is a diagram showing a specific example of data correspondence between biopolymer databases. FIG. 5 shows the correspondence between the UniGene database and the GenBank database published by NCBI (National Center for Biotechnology Information). The data correspondence between the biopolymer databases is disclosed in such a format. The data is tab-delimited and one line represents one record. The first column 301 represents the UniGene ID, and the fourth column 302 represents the GenBank ID corresponding to the UniGene data. For example, data whose ID is UniGene's Hs.103504 (303) corresponds to data whose ID is AF061055 (304) of GenBank. By extracting these data, it is possible to obtain the data correspondence between the databases.

この後の処理は、検索サービス用サーバ１０５の非冗長DB作成部１２４によって行われる。非冗長DB作成部１２４は、オペレータからの優先度付けの指示の入力に基づき、まずデータベースＡ、データベースＢ、データベースＣに優先度をつける。この優先度は任意で付けてかまわない。ここでは、図６に示したように、データベースＡ、データベースＢ、データベースＣの順で優先度に高いスコアを付けたとする（ステップ１３）。次に、優先度の高いデータベースから順に（ステップ１４）、自分より優先度の高いデータベースとのデータの対応関係がないデータを取得する（ステップ１５）。ステップ１４からステップ１５の処理を反復することで、非冗長DB１０４が作成される。 The subsequent processing is performed by the non-redundant DB creation unit 124 of the search service server 105. The non-redundant DB creation unit 124 first assigns priorities to the database A, the database B, and the database C based on the input of prioritization instructions from the operator. This priority may be arbitrarily assigned. Here, as shown in FIG. 6, it is assumed that a higher score is assigned to the priority in the order of database A, database B, and database C (step 13). Next, in order from the database with the highest priority (step 14), data having no data correspondence with the database with a higher priority than itself is acquired (step 15). By repeating the processing from step 14 to step 15, the non-redundant DB 104 is created.

ステップ１４，１５の処理を図７により説明する。最初に、図７（ａ）に示すように、優先度のスコアの最も高いデータベースＡからデータを取得する。データベースＡより優先度の高いデータベースはないので、データベースＡからはすべてのデータ、Ａ１，Ａ４，Ａ５，Ａ７を取得する。次に、図７（ｂ）に示すように、優先度のスコアが２番目のデータベースＢからデータを取得する処理に移る。データベースＢはデータベースＡより優先度が低いので、データベースＢからはデータベースＡとの対応関係を持たないデータ、Ｂ２，Ｂ６を取得する。重複データの確認には図４に示したデータベース間のデータの対応関係を表す対応関係テーブル１３０を用いる。ここで、図７中の点線はデータが重複していることを示す。最後に、図７（ｃ）に示すように、一番優先度の低いデータベースＣからデータを取得する処理を行う。データベースＣはデータベースＡ及びデータベースＢより優先度が低いので、データベースＣからはデータベースＡ及びデータベースＢとの対応関係を持たないデータ、Ｃ３を取得する。このときも、重複データの確認には図４に示した対応関係テーブルを用いる。このようにして、データＡ１，Ｂ２，Ｃ３，Ａ４，Ａ５，Ｂ６，Ａ７をもつ非冗長なデータベース１０４が作成される。 The processing of steps 14 and 15 will be described with reference to FIG. First, as shown in FIG. 7A, data is acquired from the database A having the highest priority score. Since no database has a higher priority than the database A, all data A1, A4, A5 and A7 are acquired from the database A. Next, as shown in FIG. 7B, the process moves to a process of acquiring data from the database B having the second priority score. Since the database B has a lower priority than the database A, the data B 2 and B 6 having no correspondence with the database A are acquired from the database B. For the confirmation of the duplicate data, the correspondence table 130 representing the data correspondence between the databases shown in FIG. 4 is used. Here, the dotted line in FIG. 7 indicates that the data is duplicated. Finally, as shown in FIG. 7C, a process of acquiring data from the database C having the lowest priority is performed. Since the database C has a lower priority than the database A and the database B, the data C3 that does not have a corresponding relationship with the database A and the database B is acquired from the database C. Also at this time, the correspondence table shown in FIG. 4 is used for confirmation of duplicate data. In this way, a non-redundant database 104 having data A1, B2, C3, A4, A5, B6, and A7 is created.

図８は、検索サービスセンター１１１内の検索サービス用サーバ１０５に対する検索の概念図である。図８に示すように、ユーザ９０６は重複のあるデータベース９０１及び９０２から重複を除いて作成した非冗長データベース９０３に対して、ネットワーク９０４を通して検索が行えるため、重複のない検索結果９０５を得ることができる。ユーザによる検索キーワードの入力には、図９に示すようなグラフィカルユーザインターフェイスを用いる。ユーザがテキストボックス４０１に、検索したいキーワード４０３を入力し、検索開始ボタン４０２を押下することで検索が開始され、入力されたキーワードに関連するデータが図１０に示すように一覧表示される。一覧にはそのデータを抽出したデータベース名５０１、データのID５０２、データの要約５０３が表示される。データのID５０２をクリックすることでデータの詳細を表示することができる。 FIG. 8 is a conceptual diagram of a search for the search service server 105 in the search service center 111. As shown in FIG. 8, the user 906 can search through the network 904 for the non-redundant database 903 created by excluding duplicates from the duplicated databases 901 and 902, so that a search result 905 without duplicates can be obtained. it can. A graphical user interface as shown in FIG. 9 is used to input a search keyword by the user. When the user inputs a keyword 403 to be searched in the text box 401 and presses the search start button 402, the search is started, and data related to the input keyword is displayed as a list as shown in FIG. In the list, a database name 501 from which the data is extracted, a data ID 502, and a data summary 503 are displayed. By clicking on the data ID 502, the details of the data can be displayed.

本発明による検索サービスの例を示す概略図。Schematic which shows the example of the search service by this invention. 生体高分子データベースのデータ例を示す図。The figure which shows the example of data of a biopolymer database. 非冗長な生体高分子データベースを作成する方法の流れを示すフローチャート。The flowchart which shows the flow of the method of producing a nonredundant biopolymer database. 対応関係テーブルの模式図。The schematic diagram of a correspondence table. 生体高分子データベース間のデータの対応関係の具体例を示す図。The figure which shows the specific example of the correspondence of the data between biopolymer databases. 優先度の設定例を示す図。The figure which shows the example of a setting of a priority. 非冗長な生体高分子データベースの作成例を示す工程模式図。The process schematic diagram which shows the creation example of a nonredundant biopolymer database. 検索サービス用サーバに対する検索の概念図。The conceptual diagram of the search with respect to the server for search services. キーワード入力グラフィカルユーザインターフェイスの例を示す図。The figure which shows the example of a keyword input graphical user interface. 検索結果一覧表示グラフィカルユーザインターフェイスの例を示す図。The figure which shows the example of a search result list display graphical user interface. 従来の検索方法の説明図。Explanatory drawing of the conventional search method.

Explanation of symbols

１０１…記憶装置、１０３…重複の除去作業、１０４…非冗長な生体高分子データベース、１０５…検索サービス用サーバ、１０７…ネットワーク、１１１…検索サービスセンター、１２１…DBデータ取得部、１２２…対応関係取得部、１２３…対応関係テーブル作成部、１２４…非冗長DB作成部、１２５…検索処理部、１３０…対応関係テーブル、３０１…UniGene ID、３０２…GenBank ID、３０３…重複するUniGene ID、３０４…重複するGenBank ID、４０１…テキストボックス、４０２…検索開始ボタン、４０３…キーワード、５０１…データを抽出したデータベース名、５０２…データのID、５０３…データの要約、９０３…非冗長な生体高分子データベース、９０４…ネットワーク、９０５…重複のない検索結果 DESCRIPTION OF SYMBOLS 101 ... Memory | storage device, 103 ... Duplication removal work, 104 ... Non-redundant biopolymer database, 105 ... Server for search service, 107 ... Network, 111 ... Search service center, 121 ... DB data acquisition part, 122 ... Correspondence Acquisition unit, 123 ... correspondence table creation unit, 124 ... non-redundant DB creation unit, 125 ... search processing unit, 130 ... correspondence table, 301 ... UniGene ID, 302 ... GenBank ID, 303 ... duplicate UniGene ID, 304 ... Duplicate GenBank ID, 401 ... text box, 402 ... search start button, 403 ... keyword, 501 ... database name from which data was extracted, 502 ... data ID, 503 ... data summary, 903 ... non-redundant biopolymer database , 904 ... Network, 905 ... Search results without duplication

Claims

Storing data in a plurality of biopolymer databases having data overlapping between the databases for each database and storing them in a storage device;
Obtaining information on the correspondence between the databases from the plurality of biopolymer databases, and storing it in a storage device as a correspondence table;
Prioritizing the plurality of biopolymer databases;
For the plurality of biopolymer databases stored in the storage device, in order from the database with the highest priority, data whose correspondence with the data of the database with a higher priority than itself is not registered in the correspondence table is acquired. A method for creating a non-redundant biopolymer database, comprising: repeating the process.

DB data acquisition unit that acquires data from a plurality of external biopolymer databases, distinguishes each database and stores it in a storage device,
A correspondence acquisition unit that acquires information on data correspondence between the plurality of biopolymer databases; and
A correspondence table creating unit that organizes the information obtained by the correspondence obtaining unit and creates a correspondence table that represents the correspondence of overlapping data between databases;
A non-redundant DB creating unit that creates a non-redundant biopolymer database by removing duplication of data from a plurality of biopolymer database data stored in the storage device with reference to the correspondence table. Characteristic search service server.