JP4334955B2

JP4334955B2 - Biological information lossless encoder

Info

Publication number: JP4334955B2
Application number: JP2003323368A
Authority: JP
Inventors: 敏雄茂出木
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 2003-09-16
Filing date: 2003-09-16
Publication date: 2009-09-30
Anticipated expiration: 2023-09-16
Also published as: JP2005087069A

Description

本発明は、バイオインフォマティクス、ゲノム創薬、バイオ新素材開発など生物情報データベースの構築、検索を行う分野、コンピュータグラフィックスを用いたＣＧアニメーション映像制作分野、科学技術シミュレーションにおける可視化映像制作、ＣＧを用いた高分子構造・挙動の可視化分野に関する。 The present invention uses bioinformatics, genomic drug discovery, development of new biomaterials, and other fields to construct and search biological information databases, CG animation video production using computer graphics, visualization video production in science and technology simulation, and CG It relates to the field of visualization of polymer structure and behavior.

近年、ヒトゲノムプロジェクトなどバイオインフォマティクス（生物情報科学）の急速な進展に伴い、膨大な生物情報データベースが構築されつつある。特に、ＤＮＡ配列については、完成度が高まっており、現在急ピッチでプロテオーム情報の蓄積が進行している。このように蓄積される大容量のデータベースを活用し、医薬品開発や新規素材開発などに応用するためには、ネットワーク経由の円滑なデータベースの扱いが重要となる。すなわち、いかに効率良く圧縮し、効率良く検索するかが重要となる。 In recent years, with the rapid development of bioinformatics (bioinformatics) such as the Human Genome Project, a huge biological information database is being built. In particular, the completeness of DNA sequences is increasing, and accumulation of proteome information is progressing at a rapid pace. In order to utilize such a large-capacity database accumulated and applied to drug development, new material development, etc., it is important to handle the database smoothly via a network. In other words, how to efficiently compress and search efficiently is important.

生物情報配列は、１文字のエラーでも致命的な欠陥につながるため、圧縮を行う場合、ＭＰＥＧなどのロッシー型圧縮やニアロスレス型圧縮は適用できず、ロスレス型圧縮に限定される。幸い、生物情報配列は、ＡＳＣＩＩテキスト形式であるため、テキストを対象とした汎用可逆圧縮ツール（ＺＩＰ、ＬＺＨ等）である程度の圧縮が可能であり、現在蓄積されているデータベースにおいてもＺＩＰ技術が適用されている。 Since the biological information sequence leads to a fatal defect even with a single character error, when compression is performed, lossy compression or near lossless compression such as MPEG cannot be applied, and is limited to lossless compression. Fortunately, the biological information sequence is in ASCII text format, so it can be compressed to some extent with general-purpose lossless compression tools (ZIP, LZH, etc.) for text, and the ZIP technology can be applied to currently accumulated databases. Has been.

このような生物情報の符号化については、他にもいくつかの技術が提案されている（例えば、特許文献１、特許文献２参照）。
特開２００３−１８８７３５号公報特開２００３−１０１４８５号公報また、生物情報の解析には、タンパク質立体構造等の３次元モデルの解析も必要となるが、このような３次元モデルを圧縮する手法についても提案されている（例えば、特許文献３参照）。特開平１０−３２０５８３号公報 Several other techniques have been proposed for encoding such biological information (see, for example, Patent Document 1 and Patent Document 2).
JP 2003-188735 A JP-A-2003-101485 Further, analysis of biological information requires analysis of a three-dimensional model such as a protein three-dimensional structure, and a method for compressing such a three-dimensional model has also been proposed (for example, And Patent Document 3). Japanese Patent Laid-Open No. 10-320583

しかしながら、上記のような汎用圧縮ツール（ユニバーサル圧縮方式）もしくは上記特許文献に示した技術では、生物情報の特徴を活かせないため、圧縮率に限界がある。例えば、ＤＮＡ配列は４文字で構成されるため、理論上は１文字あたり２ビットで符号化できるが、ＤＮＡの代表的な相同検索エンジンＦＡＳＴＡに用いられている記録形式であるＦＡＳＴＡ形式では、注釈情報を混在するため３ビット程度までしか圧縮できない。また、ＤＮＡには特有の繰り返しパターンがあり、これを活用すると２ビット未満に圧縮できる可能性がある。 However, the general-purpose compression tool (universal compression method) as described above or the technology shown in the above-mentioned patent document cannot limit the characteristics of biological information, and therefore has a limit on the compression rate. For example, since a DNA sequence is composed of 4 characters, it can theoretically be encoded with 2 bits per character. However, in the FASTA format, which is a recording format used in the DNA homologous search engine FASTA, an annotation is used. Since information is mixed, compression is possible only up to about 3 bits. Further, DNA has a unique repetitive pattern, and if this is utilized, there is a possibility that it can be compressed to less than 2 bits.

そこで、本発明は、生物情報配列の特徴を活かし、注釈情報が混在しても生物情報を最適な符号長で圧縮できる生物情報のロスレス符号化装置、圧縮された生物情報配列を完全には復号しなくても、少ないメモリで検索可能な生物情報の検索装置、３次元モデルに対しても、ロスレス圧縮することが可能な三次元情報のロスレス符号化装置を提供することを課題とする。 Therefore, the present invention makes use of the characteristics of the biological information sequence, and is a lossless encoding device for biological information that can compress biological information with an optimal code length even when annotation information is mixed, and completely decodes the compressed biological information sequence. It is an object of the present invention to provide a biological information retrieval apparatus that can be retrieved with a small amount of memory, and a three-dimensional information lossless encoding apparatus that can perform lossless compression even for a three-dimensional model.

上記課題を解決するため、本発明では、生物情報のロスレス符号化装置を、所定の範囲内で定義された文字の配列情報と前記配列情報の特定の範囲の情報を注釈する注釈情報で構成される生物情報ファイルに対して、前記注釈情報と配列情報を分離して、注釈データ、配列データ本体とするとともに、前記生物情報ファイルを復元できるように、前記注釈データに前記配列データ本体へのリンク情報を追加するためのデータ分離手段と、前記配列データ本体に記録された各文字に対して固定ビット長を割り当てることによりデータ圧縮を行って、中間配列データを得る固定長符号化手段と、前記固定長で圧縮された中間配列データ、および前記注釈データそれぞれに対して、可変ビット長でデータ圧縮を行う可変長符号化手段を有する構成としたことを特徴とする。 In order to solve the above-described problem, in the present invention, a lossless encoding device for biological information is composed of character sequence information defined within a predetermined range and annotation information for annotating information in a specific range of the sequence information. The annotation information and the sequence data body are separated from the biological information file to form the annotation data and the sequence data body, and the biological data file can be restored, and the annotation data is linked to the sequence data body. Data separating means for adding information, fixed length encoding means for obtaining intermediate array data by performing data compression by assigning a fixed bit length to each character recorded in the array data body, and A configuration having variable length encoding means for compressing data with a variable bit length for each of the intermediate sequence data compressed at a fixed length and the annotation data; Characterized in that was.

また、本発明では、１バイト未満で１つの塩基もしくはアミノ酸が記録された検索用配列データから、目的とする配列を検索する生物情報の検索装置を、検索キーとする配列を入力する検索キー入力手段と、前記入力された検索キーを１塩基もしくは１アミノ酸の記録単位ずつ移動させて、全体としてバイト単位になるように任意ビットを追加した、複数の検索パターンを作成する検索パターン作成手段と、前記作成された検索パターンと、前記検索用配列データを１バイト単位で比較していくことにより照合を行う照合手段を有する構成としたことを特徴とする。 Further, in the present invention, a search key input for inputting a sequence using a search device for biological information that searches for a target sequence from sequence data for search in which one base or amino acid is recorded in less than 1 byte. And a search pattern creation means for creating a plurality of search patterns by moving the input search key by one base or one amino acid recording unit and adding arbitrary bits so as to be a whole byte unit, It is characterized by having a collating means for performing collation by comparing the created search pattern with the search array data in units of 1 byte.

また、本発明では、三次元情報のロスレス符号化装置を、所定の範囲内で定義された数値を含む文字情報と前記文字情報の特定の範囲の情報を注釈する注釈情報で構成される三次元情報ファイルに対して、情報の区切りを示す空白文字符号を抽出し、ランレングス符号化を行い、前記三次元情報ファイル内の空白文字部分を所定のランレングス符号に変換するランレングス符号化手段と、前記文字情報に含まれる数値を分離して数値データ本体とし、分離された他方を注釈データとして、前記三次元情報ファイルを復元できるように、前記注釈データに前記数値データ本体へのリンク情報を追加するためのデータ分離手段と、前記数値データ本体、および前記注釈データそれぞれに対して、可変ビット長でデータ圧縮を行う可変長符号化手段を有する構成としたことを特徴とする。 In the present invention, the lossless encoding device for three-dimensional information is a three-dimensional information composed of character information including numerical values defined within a predetermined range and annotation information for annotating information in a specific range of the character information. A run-length encoding unit that extracts a blank character code indicating an information delimiter from the information file, performs run-length encoding, and converts a blank character part in the three-dimensional information file into a predetermined run-length code; The numerical value included in the character information is separated into a numerical data body, and the separated other is used as annotation data, so that the three-dimensional information file can be restored in the annotation data with link information to the numerical data body. Data separation means for adding, variable length encoding means for compressing data with variable bit length for each of the numeric data body and the annotation data Characterized by being configured to include.

本発明の生物情報のロスレス符号化装置によれば、注釈情報と配列情報が混在した生物情報ファイルについて、注釈情報と配列情報を分離して、それぞれ注釈データ、配列データ本体とするとともに、注釈データに、配列データ本体へのリンク情報を追加した後、それぞれを符号化するようにしたので、注釈情報が混在しても生物情報を最適な符号長で圧縮することが可能となるという効果を奏する。 According to the biological information lossless encoding apparatus of the present invention, for a biological information file in which annotation information and sequence information are mixed, the annotation information and the sequence information are separated into an annotation data and an array data body, respectively. In addition, since the link information to the sequence data main body is added and then encoded, the biological information can be compressed with the optimum code length even if the annotation information is mixed. .

本発明の生物情報の検索装置によれば、入力された検索キーを、１文字（１塩基もしくはアミノ酸）ずつ移動させて、全体としてバイト単位とした複数の検索パターンを作成し、この検索パターンを利用して配列データの検索を行うようにしたので、少ないメモリで検索可能となるという効果を奏する。 According to the biological information search apparatus of the present invention, the input search key is moved one character (one base or amino acid) at a time to create a plurality of search patterns in byte units as a whole. Since the sequence data is searched by using it, there is an effect that the search can be performed with a small amount of memory.

本発明の三次元情報のロスレス符号化装置によれば、注釈情報と数値情報が混在した三次元情報ファイルについて、情報の区切りを示す空白文字符号を抽出し、ランレングス符号化を行った後、注釈情報と数値情報を分離して、それぞれ注釈データ、数値データ本体とするとともに、注釈データに、数値データ本体へのリンク情報を追加した後、それぞれを符号化するようにしたので、３次元モデルに対しても、ロスレス圧縮することが可能となるという効果を奏する。 According to the lossless encoding apparatus for three-dimensional information of the present invention, for a three-dimensional information file in which annotation information and numerical information are mixed, after extracting a blank character code indicating a delimiter of information and performing run-length encoding, Annotation information and numeric information are separated into annotation data and numeric data body, respectively, and after adding link information to the numeric data body in the annotation data, each is encoded, so the 3D model As a result, the lossless compression can be performed.

以下、本発明の実施形態について図面を参照して詳細に説明する。
（生物情報のロスレス符号化装置）
図１は、本発明に係る生物情報のロスレス符号化装置の構成を示す機能ブロック図である。図１において、１はデータ分離手段、２は固定長符号化手段、３は可変長符号化手段である。データ分離手段１は、生物情報ファイルに記録されている注釈情報と配列情報を分離して注釈データと、配列データ本体を得る機能を有している。固定長符号化手段２は、データ分離手段１により分離された一方の配列データ本体を、各配列文字の別によらず各文字に固定ビット長を割り当てることにより符号化する機能を有している。可変長符号化手段３は、データ分離手段１により分離された一方の注釈データ、固定長符号化手段２により符号化された配列データ本体を、それぞれ可変長で符号化する機能を有している。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
(Lossless encoding device for biological information)
FIG. 1 is a functional block diagram showing a configuration of a biological information lossless encoding apparatus according to the present invention. In FIG. 1, 1 is data separation means, 2 is fixed length coding means, and 3 is variable length coding means. The data separating means 1 has a function of separating annotation information and sequence information recorded in a biological information file to obtain annotation data and a sequence data body. The fixed-length encoding unit 2 has a function of encoding one array data body separated by the data separating unit 1 by assigning a fixed bit length to each character regardless of each array character. The variable length encoding means 3 has a function of encoding one annotation data separated by the data separation means 1 and the array data main body encoded by the fixed length encoding means 2 with variable lengths, respectively. .

ここで、本発明で圧縮対象とする生物情報の構造について説明しておく。本実施形態では、生物情報として、塩基配列、アミノ酸配列を利用することができる。ここでは、まず、塩基配列について説明する。図２（ａ）は、代表的なデータ形式であるＦＡＳＴＡ形式で表現した原塩基配列ファイルを示す図である。図１において、ｔ、ｃ、ａ、ｇは、それぞれチミン、シトニン、アデニン、グアニンの４種類の塩基を示している。なお、ここでは、塩基を示す４文字以外の注釈情報は、<ANNOTATION>として省略して示してあるが、実際には、塩基配列を説明するための注釈情報が記されている。注釈情報を構成する文字、および各塩基は、ＡＳＣＩＩコードで記録されており、１文字の記録に８ビットを要している。 Here, the structure of biological information to be compressed in the present invention will be described. In this embodiment, a base sequence and an amino acid sequence can be used as biological information. Here, first, the base sequence will be described. FIG. 2A is a diagram showing an original base sequence file expressed in the FASTA format which is a typical data format. In FIG. 1, t, c, a, and g represent four types of bases, thymine, cytonin, adenine, and guanine, respectively. Here, the annotation information other than the four characters indicating the base is omitted as <ANNOTATION>, but actually, the annotation information for explaining the base sequence is described. The characters that make up the annotation information and each base are recorded in ASCII code, and 8 bits are required to record one character.

続いて、図１に示した装置の処理動作について説明する。まず、図２（ａ）に示したような原塩基配列ファイルを入力すると、まず、データ分離手段１が、原塩基配列ファイル内の注釈情報と配列情報を分離して、注釈データ、配列データ本体とする。具体的には、図２（ａ）に示したような原塩基配列ファイルを先頭から順に解読していき、データがｔ、ｃ、ａ、ｇのＡＳＣＩＩ文字データだけから構成されるテキスト形式である場合には、配列データ本体であると判断し、ｔ、ｃ、ａ、ｇ以外のＡＳＣＩＩ文字データを含むテキスト形式である場合には、注釈データであると判断して分離する。この際、配列データ本体として分離される塩基の数をカウントしておき、各注釈情報の後に、記録されていた塩基の数を記録する。例えば、図２（ａ）の例では、<ANNOTATION2>の後に６７の塩基が記録されていたので、注釈データ内に、６７の塩基を挿入すべき旨の情報を記録することになる。ただし、本実施形態では、注釈情報がＡＳＣＩＩコードで記録されており、０〜１２７の値は、文字情報として認識されることになる。そのため、文字情報として使用される最大値１２７に塩基数６７を加算して記録されることになる。このため、図２（ｂ）に示すように、<ANNOTATION2>の後には、「１９４」が記録されることになる。 Next, the processing operation of the apparatus shown in FIG. 1 will be described. First, when the original base sequence file as shown in FIG. 2 (a) is input, first, the data separating means 1 separates the annotation information and the sequence information in the original base sequence file to obtain the annotation data and the sequence data body. And Specifically, the original base sequence file as shown in FIG. 2 (a) is decoded in order from the top, and the data is a text format composed only of ASCII character data of t, c, a, and g. In this case, it is determined that the data is an array data body. If the text format includes ASCII character data other than t, c, a, and g, it is determined that the data is annotation data and separated. At this time, the number of bases separated as the sequence data body is counted, and the number of recorded bases is recorded after each annotation information. For example, in the example of FIG. 2A, since 67 bases are recorded after <ANNOTATION2>, information indicating that 67 bases should be inserted is recorded in the annotation data. However, in the present embodiment, the annotation information is recorded in the ASCII code, and values from 0 to 127 are recognized as character information. Therefore, the base value 67 is added to the maximum value 127 used as character information and recorded. Therefore, as shown in FIG. 2B, “194” is recorded after <ANNOTATION2>.

１バイトで記録できる情報は、０〜２５５までであり、上述のように、０〜１２７は文字情報として使用されているので、１バイトで記録できる塩基数は、１２８までとなる。そのため、塩基数が１２９以上となった場合は、２バイトで記録することになる。例えば、図２（ａ）の例では、<ANNOTATION1>の後に１３６の塩基が記録されていたので、注釈データ内に、１３６の塩基を挿入すべき旨の情報を記録することになる。この場合、１３６を１２８と８に分け、１バイト目、２バイト目にそれぞれ１２７を加算して記録する。このため、図２（ｂ）に示すように、<ANNOTATION1>の後には、「２５５」「１３５」が記録されることになる。このように、注釈データに、挿入すべき塩基の数が記録されることにより、復号時に配列データ本体とのリンクをとることが可能となる。 The information that can be recorded in 1 byte is from 0 to 255. As described above, 0 to 127 are used as character information. Therefore, the number of bases that can be recorded in 1 byte is up to 128. Therefore, when the number of bases is 129 or more, it is recorded with 2 bytes. For example, in the example of FIG. 2A, since 136 bases are recorded after <ANNOTATION1>, information indicating that 136 bases should be inserted is recorded in the annotation data. In this case, 136 is divided into 128 and 8, and 127 is added to the first byte and the second byte for recording. Therefore, as shown in FIG. 2B, “255” and “135” are recorded after <ANNOTATION1>. Thus, by recording the number of bases to be inserted in the annotation data, it becomes possible to establish a link with the sequence data body at the time of decoding.

配列データ本体は、原塩基配列ファイルから注釈情報を外して、塩基を連続して配列させたものとなる。そのため、図２（ａ）のように、１３６の塩基と６７の塩基が記録されていた場合は、図２（ｃ）に示すように、２０３の塩基が連続して記録されることになる。 The sequence data body is obtained by removing the annotation information from the original base sequence file and arranging the bases continuously. For this reason, when 136 bases and 67 bases are recorded as shown in FIG. 2A, 203 bases are continuously recorded as shown in FIG. 2C.

続いて、固定長符号化手段２が、配列データ本体を固定長符号化し、中間配列データを得る。具体的には、８ビットで記録されている各塩基を２ビットに置き換えて符号化する。具体的には、図３に示した塩基変換テーブルを利用して、置き換えることになる。この結果、１塩基について８ビットで記録されていたものが、２ビットで記録されることになり、大幅にデータ量が削減される。 Subsequently, the fixed length encoding means 2 performs fixed length encoding on the array data body to obtain intermediate array data. Specifically, encoding is performed by replacing each base recorded in 8 bits with 2 bits. Specifically, replacement is performed using the base conversion table shown in FIG. As a result, what was recorded in 8 bits per base is recorded in 2 bits, and the amount of data is greatly reduced.

一方、可変長符号化手段３は、注釈データを可変長で符号化していく。ここで、可変長符号化手段３による処理概要を図４のフローチャートに示す。まず、読み込んだ注釈データをバイト単位でランレングス圧縮する（ステップＳ１）。次に、バイトデータの頻度テーブルを作成する（ステップＳ２）。具体的には、出現頻度の高いバイトデータの順に、少ないビット長のビット配列を対応させた頻度テーブルを作成することになる。作成された頻度テーブルは、後で利用するために保存される。次に、ランレングス圧縮したデータを、作成した頻度テーブルで変換する（ステップＳ３）。これにより、頻度が高いデータほど、小さい値になる。続いて、頻度テーブルで変換されたデータを可変長符号化する（ステップＳ４）。このステップＳ４における可変長符号化処理は、Golomb-Rice等の周知の手法を用いることができる。この結果、圧縮注釈データが得られることになる。 On the other hand, the variable length encoding means 3 encodes the annotation data with a variable length. Here, an outline of the processing by the variable length coding means 3 is shown in the flowchart of FIG. First, the read annotation data is run-length compressed byte by byte (step S1). Next, a byte data frequency table is created (step S2). Specifically, a frequency table is created in which bit arrays having a small bit length are associated in the order of byte data having a high appearance frequency. The created frequency table is saved for later use. Next, the run-length compressed data is converted using the created frequency table (step S3). As a result, the higher the frequency, the smaller the value. Subsequently, the data converted by the frequency table is subjected to variable length coding (step S4). For the variable length encoding process in step S4, a known method such as Golomb-Rice can be used. As a result, compressed annotation data is obtained.

可変長符号化手段３は、注釈データの処理後、固定長符号化手段２より固定長符号化された中間配列データを可変長符号化する。この処理は、上記ステップＳ１〜ステップＳ４の処理と同一である。この結果、圧縮配列データが得られることになる。 After processing the annotation data, the variable length encoding means 3 performs variable length encoding on the intermediate sequence data that has been fixed length encoded by the fixed length encoding means 2. This process is the same as the process in steps S1 to S4. As a result, compressed array data is obtained.

以上の処理により、圧縮ファイル（圧縮注釈データ、圧縮配列データ、注釈頻度テーブル、配列頻度テーブル、塩基変換テーブルを含む）が得られることになる。この圧縮ファイルを所定の記憶装置に記憶させておくことにより、圧縮ファイルの流通が可能となる。例えば、これらを、インターネットに公開されたコンピュータの所定のディレクトリに保存しておけば、利用者は、データ量の小さいデータ量をダウンロードすれば良いため、データの取得を迅速に行うことが可能となる。 Through the above processing, a compressed file (including compressed annotation data, compressed sequence data, annotation frequency table, sequence frequency table, and base conversion table) is obtained. By storing this compressed file in a predetermined storage device, the compressed file can be distributed. For example, if these are stored in a predetermined directory of a computer that is open to the Internet, the user can download a small amount of data, so that data can be acquired quickly. Become.

続いて、圧縮ファイルの復号処理について説明する。復号処理は、圧縮注釈データと注釈頻度テーブルから注釈データを復元し、圧縮配列データと配列頻度テーブルから中間配列データを復元し、中間配列データと塩基変換テーブルから配列データ本体を復元し、最後に、注釈データと配列データ本体を統合して元の生物情報ファイルを得ることになる。具体的には、まず、圧縮注釈データに対して、図４のフローチャートに示した処理と逆の処理を行い、注釈頻度テーブルを利用して注釈データを復元する。また、圧縮配列データに対しても、図４のフローチャートに示した処理と逆の処理を行い、配列頻度テーブルを利用して、中間配列データを復元する。中間配列データは、各塩基が２ビットで表現されているので、塩基変換テーブルを利用して各塩基８ビットに戻してやることにより、配列データ本体が復元される。次に、注釈データと配列データ本体の統合を行うが、これは、注釈データの注釈情報<ANNOTATION>を読み込み、その直後に記録されている挿入文字数分に対応する数の塩基を配列データ本体から読み込み、注釈情報の後に挿入していく。この処理を各注釈情報に対して行っていくことにより、生物情報ファイルが復元される。 Next, the compressed file decoding process will be described. In the decoding process, the annotation data is restored from the compressed annotation data and the annotation frequency table, the intermediate sequence data is restored from the compressed sequence data and the sequence frequency table, the sequence data body is restored from the intermediate sequence data and the base conversion table, and finally The original biological information file is obtained by integrating the annotation data and the sequence data body. Specifically, first, the processing opposite to the processing shown in the flowchart of FIG. 4 is performed on the compressed annotation data, and the annotation data is restored using the annotation frequency table. Also, the compressed array data is subjected to a process reverse to the process shown in the flowchart of FIG. 4, and the intermediate array data is restored using the array frequency table. In the intermediate sequence data, each base is represented by 2 bits. Therefore, the sequence data body is restored by returning to 8 bits for each base using the base conversion table. Next, the annotation data and the sequence data body are integrated. This is done by reading the annotation information <ANNOTATION> of the annotation data and adding the number of bases corresponding to the number of inserted characters immediately after that from the sequence data body. Read and insert after annotation information. By performing this process on each piece of annotation information, the biological information file is restored.

（生物情報の検索装置）
次に、本発明に係る生物情報の検索装置について説明する。図５は、本発明に係る生物情報の検索装置の構成を示す機能ブロック図である。図５において、１１は検索キー入力手段、１２は検索パターン作成手段、１３は照合手段である。検索キー入力手段１は、検索の目的とする配列である検索キーを入力する機能を有している。検索パターン作成手段１２は、入力された検索キーを１文字ずつ移動させた複数の検索パターンを作成する機能を有している。照合手段１３は、作成された検索パターンと、中間配列データ内の配列との照合を行う機能を有している。 (Biological information search device)
Next, a biological information search apparatus according to the present invention will be described. FIG. 5 is a functional block diagram showing the configuration of the biological information search apparatus according to the present invention. In FIG. 5, 11 is a search key input means, 12 is a search pattern creation means, and 13 is a collation means. The search key input means 1 has a function of inputting a search key that is an array to be searched. The search pattern creation means 12 has a function of creating a plurality of search patterns by moving the input search key character by character. The collation means 13 has a function of collating the created search pattern with the array in the intermediate array data.

続いて、図５に示した検索装置の処理動作について説明する。中間配列データの構造を図６（ａ）に示す。上述のように、中間配列データにおいては、各塩基は２ビットで記録されている。図６では、１バイト（４塩基）単位で区切って示している。このような中間配列データを用いて、「ｔａｔａｇｃ」という配列を検索する場合を考えてみる。この場合、検索キー入力手段１１から「ｔａｔａｇｃ」という検索キーを入力すると、検索パターン作成手段１２は、図６（ｂ）に示すようなＡ「ｔａｔａｇｃ＊＊」、Ｂ「＊ｔａｔａｇｃ＊」、Ｃ「＊＊ｔａｔａｇｃ」、Ｄ「＊＊＊ｔａｔａｇｃ＊＊＊」という４通りの検索パターンを作成する。ここで、「＊」は２ビットの任意の配列である。この検索パターンは、整数バイトとなっており、ここでは、検索パターンＡ、Ｂ、Ｃは２バイト、検索パターンＤは３バイトである。次に、照合手段１３が、検索パターンの先頭から、バイト単位で検索する。例えば、まず、Ａパターンの先頭１バイトの「ｔａｔａ」を利用して、１バイト単位で、中間配列データとのマッチングを行い、一致する配列が存在したら、２バイト目の「ｇｃ＊＊」とのマッチングを行う。このようにすることにより、検索対象とする配列全てのマッチングをいきなり行う必要がなく、１バイト目が一致した場合のみ、２バイト目以降のマッチングを行えば良いことになり、検索時間が大幅に短縮される。Ａパターンで一致する配列が見つからなかった場合は、Ｂパターン、Ｃパターン、Ｄパターンという順に、全てのパターンで検索を試みる。 Next, the processing operation of the search device shown in FIG. 5 will be described. The structure of the intermediate sequence data is shown in FIG. As described above, in the intermediate sequence data, each base is recorded with 2 bits. In FIG. 6, the data is divided in units of 1 byte (4 bases). Consider a case where such an intermediate sequence data is used to search for an array “tatagc”. In this case, when a search key “tatagc” is input from the search key input means 11, the search pattern creation means 12 displays A “tatagc **”, B “* tagtagc *”, C, as shown in FIG. Four types of search patterns, “** tagtagc” and D “*** tagagc ***”, are created. Here, “*” is an arbitrary array of 2 bits. This search pattern is an integer byte. Here, the search patterns A, B, and C are 2 bytes, and the search pattern D is 3 bytes. Next, the collation means 13 searches in byte units from the top of the search pattern. For example, first, “data” of the first byte of the A pattern is used to perform matching with the intermediate array data in units of 1 byte. If there is a matching array, “gc **” in the second byte is Perform matching. By doing so, it is not necessary to perform matching of all the sequences to be searched suddenly, and only when the first byte is matched, it is only necessary to perform matching after the second byte. Shortened. If no matching sequence is found in the A pattern, the search is attempted in all patterns in the order of the B pattern, the C pattern, and the D pattern.

（アミノ酸配列の例）
上記生物情報のロスレス符号化装置および検索装置の例では、ＤＮＡ塩基配列を例にとって説明したが、アミノ酸配列でも同様である。ここでは、アミノ酸配列を圧縮、検索する場合について、上記ＤＮＡ塩基配列の場合と異なる点について説明する。アミノ酸配列の場合は、データ分離手段１による処理の後、固定長符号化手段２により８ビットで表現されている各アミノ酸を４ビットに変換する。ただし、アミノ酸は２０種類あるため、４ビットでは表現しきれないため、比較的出現頻度の低いもの５種については、８ビットで表現し、他の１５種について４ビットで表現することとしている。具体的には、図７に示したアミノ酸変換テーブルを利用して変換することになる。 (Example of amino acid sequence)
In the above-described examples of the lossless encoding device and search device for biological information, the DNA base sequence has been described as an example, but the same applies to amino acid sequences. Here, differences in the case of compressing and searching for amino acid sequences from the case of the DNA base sequence will be described. In the case of an amino acid sequence, after processing by the data separation unit 1, each fixed-length encoding unit 2 converts each amino acid represented by 8 bits into 4 bits. However, since there are 20 types of amino acids, they cannot be expressed in 4 bits, so 5 types with relatively low appearance frequency are expressed in 8 bits, and the other 15 types are expressed in 4 bits. Specifically, the conversion is performed using the amino acid conversion table shown in FIG.

次に、アミノ酸配列の検索について説明する。アミノ酸の場合の中間配列データの構造を図７に示す。中間配列データにおいては、上述のように各アミノ酸は４ビットもしくは８ビットで記録されている。図８では、１バイト（１もしくは２アミノ酸）単位で区切って示している。このような中間配列データを用いて、「ＥＫＡＲ」という配列を検索する場合を考えてみる。この場合、図８（ｂ）に示すようなＥ「ＥＫＡＲ」、Ｆ「＊ＥＫＡＲ＊」という２通りのパターンを作成し、バイト単位で検索する。ここで、「＊」は４ビットの任意の配列である。例えば、まず、Ｅパターンの先頭１バイトの「ＥＫ」を利用して、１バイト単位で、中間配列データとのマッチングを行い、一致する配列が存在したら、２バイト目の「ＡＲ」とのマッチングを行う。このようにすることにより、検索対象とする配列全てのマッチングをいきなり行う必要がなく、１バイト目が一致した場合のみ、２バイト目以降のマッチングを行えば良いことになり、検索時間が大幅に短縮される。Ｅパターンで一致する配列が見つからなかった場合は、Ｆパターンで検索を試みる。 Next, the search for amino acid sequences will be described. The structure of the intermediate sequence data in the case of amino acids is shown in FIG. In the intermediate sequence data, each amino acid is recorded in 4 bits or 8 bits as described above. In FIG. 8, the data is shown in units of 1 byte (1 or 2 amino acids). Consider a case where such an intermediate sequence data is used to search for an array “EKAR”. In this case, two patterns E “EKAR” and F “* EKAR *” as shown in FIG. 8B are created and searched in byte units. Here, “*” is an arbitrary array of 4 bits. For example, first, “EK” in the first 1 byte of the E pattern is used to perform matching with intermediate array data in units of 1 byte. If there is a matching array, matching with “AR” in the second byte I do. By doing so, it is not necessary to perform matching of all the sequences to be searched suddenly, and only when the first byte is matched, it is only necessary to perform matching after the second byte. Shortened. If no matching sequence is found in the E pattern, a search is attempted using the F pattern.

（三次元情報のロスレス符号化装置）
図９は、本発明に係る三次元情報のロスレス符号化装置の構成を示す機能ブロック図である。図９において、２１はランレングス符号化手段、２２は定型タグ符号化手段、２３はデータ分離手段、２４は可変長符号化手段である。ランレングス符号化手段２１は、三次元情報ファイル内の空白文字のランレングス符号化を行う機能を有している。定型タグ符号化手段２２は、三次元情報ファイル内の定型のタグを対応するビット列に変換する機能を有している。データ分離手段２３は、三次元情報ファイルに記録されている注釈情報と数値情報を分離して注釈データと、数値データ本体を得る機能を有している。可変長符号化手段２４は、データ分離手段２３により分離された注釈データ、数値データ本体を、それぞれ可変長で符号化する機能を有している。 (Lossless encoding device for 3D information)
FIG. 9 is a functional block diagram showing the configuration of the lossless encoding apparatus for three-dimensional information according to the present invention. In FIG. 9, 21 is a run length encoding means, 22 is a fixed tag encoding means, 23 is a data separation means, and 24 is a variable length encoding means. The run-length encoding means 21 has a function of performing run-length encoding of blank characters in the three-dimensional information file. The fixed tag encoding means 22 has a function of converting a fixed tag in the three-dimensional information file into a corresponding bit string. The data separation means 23 has a function of obtaining annotation data and a numerical data body by separating the annotation information and the numerical information recorded in the three-dimensional information file. The variable length encoding unit 24 has a function of encoding the annotation data and the numerical data main body separated by the data separation unit 23 with variable lengths.

本発明で圧縮対象とする三次元情報ファイルの構造について説明しておく。図１０（ａ）は、代表的なデータ形式であるＶＲＭＬ形式で表現した三次元ＣＧファイルを示す図である。図１０（ａ）において、下線は「スペース」を示している。なお、ここでは、数値以外の注釈情報は、図１と同様<ANNOTATION>として省略して示してあるが、実際には、数値を説明するための注釈情報が記されている。 The structure of the three-dimensional information file to be compressed in the present invention will be described. FIG. 10A is a diagram showing a three-dimensional CG file expressed in the VRML format, which is a typical data format. In FIG. 10A, the underline indicates “space”. Here, annotation information other than numerical values is omitted as <ANNOTATION> in the same manner as in FIG. 1, but actually, annotation information for explaining numerical values is described.

続いて、三次元情報ファイルの圧縮について説明する。まず、三次元データを読み込むと、ランレングス符号化手段１１が、スペース（空白）情報をランレングス符号化する。次に、定型タグ符号化手段１２が、定型タグを符号化していく。具体的には、図１１に示すような定型タグ変換テーブルを利用して符号化することになる。次に、データ分離手段１３が、文字情報に含まれる数値を分離して数値データ本体とし、分離された他方を注釈データとする。具体的には、図１０（ａ）に示したような原三次元ＣＧファイルを先頭から順に解読していき、データが０，１，２，３，４，５，６，７，８，９、および負符号と小数点記号のＡＳＣＩＩ文字データのみから構成されるテキスト形式である場合には、数値データ本体であると判断し、データに上記以外のＡＳＣＩＩ文字データを含むテキスト形式である場合には、注釈データであると判断して分離する。この際、数値データ本体として分離される数値の個数をカウントしておき、各注釈情報の後に、記録されていた数値に関する情報を記録する。この際、バイト単位で読み込んだ文字に、以下の〔変換規則１〕に従って数値を割り当てる。 Subsequently, compression of the three-dimensional information file will be described. First, when three-dimensional data is read, the run-length encoding means 11 performs run-length encoding of space (blank) information. Next, the fixed tag encoding means 12 encodes the fixed tag. Specifically, encoding is performed using a fixed tag conversion table as shown in FIG. Next, the data separating means 13 separates the numerical values included in the character information into a numerical data body, and sets the other separated as annotation data. Specifically, the original three-dimensional CG file as shown in FIG. 10A is sequentially decoded from the top, and the data is 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 If the text format is composed only of ASCII character data with a minus sign and a decimal symbol, it is determined that the data is a numeric data body, and if the text format includes ASCII character data other than the above, It is determined that it is annotation data and is separated. At this time, the number of numerical values separated as the numerical data body is counted, and information about the recorded numerical values is recorded after each annotation information. At this time, a numerical value is assigned to the character read in byte units according to the following [Conversion rule 1].

〔変換規則１〕
０〜１２７：ＡＳＣＩＩ文字列
１２８〜１９１：挿入数値長＋１２７
１９２〜２２３：定型タグコード＋１９２
２２４〜２５５：挿入スペース長＋２２３
例えば、図１０（ａ）に示した先頭５文字のスペースは、挿入スペース長「５」に２２３を加算して「２２８」として図１０（ｂ）に示す注釈データに記録される。同様に、２文字のスペースは、挿入スペース長「２」に２２３を加算して「２２５」として注釈データに記録される。また、図１０（ａ）に示した「POINT」という定型タグは、図９に示したテーブルにより得られる「２１」に１９２を加算して「２１３」として図１０（ｂ）に示す注釈データに記録される。数値については、「．」も含めて連続する数値の個数に１２７を加算した値として記録される。すなわち、「０．００００００」のように８個の場合は、「１３５」として記録され、「−０．０００１００」のように９個の場合は、「１３６」として記録される。 [Conversion rule 1]
0 to 127: ASCII character string 128 to 191: Inserted numerical value length +127
192 to 223: Fixed tag code +192
224 to 255: insertion space length + 223
For example, the space of the first five characters shown in FIG. 10A is recorded in the annotation data shown in FIG. 10B as “228” by adding 223 to the insertion space length “5”. Similarly, a two-character space is recorded in the annotation data as “225” by adding 223 to the insertion space length “2”. Further, the standard tag “POINT” shown in FIG. 10A is added to “21” obtained from the table shown in FIG. 9 by adding 192 to “213” to the annotation data shown in FIG. 10B. To be recorded. The numerical value is recorded as a value obtained by adding 127 to the number of consecutive numerical values including “.”. That is, when there are 8 such as “0.000000”, it is recorded as “135”, and when 9 such as “−0.000100”, it is recorded as “136”.

数値データ本体は、原三次元ＣＧデータから注釈情報を外して、数値を連続して配列させたものとなる。そのため、図１（ｃ）に示すように、数値が連続して記録されることになる。 The numerical data body is obtained by removing the annotation information from the original three-dimensional CG data and arranging the numerical values continuously. Therefore, as shown in FIG. 1C, numerical values are continuously recorded.

可変長符号化手段２４は、注釈データ、数値データ本体を可変長で符号化していく。具体的には、図４のフローチャートに示した処理を実行することになる。この結果、圧縮ファイル（圧縮注釈データ、圧縮数値データ、注釈頻度テーブル、数値頻度テーブル、定型タグ変換テーブル）が得られることになる。 The variable length encoding means 24 encodes the annotation data and the numerical data main body with a variable length. Specifically, the processing shown in the flowchart of FIG. 4 is executed. As a result, a compressed file (compressed annotation data, compressed numerical data, annotation frequency table, numerical frequency table, fixed tag conversion table) is obtained.

続いて、復号処理について説明する。復号処理は、圧縮注釈データと注釈頻度テーブルから注釈データを復元し、圧縮数値データ本体と数値頻度テーブルから数値データ本体を復元し、最後に、注釈データと数値データ本体を統合して元の配列データを得ることになる。具体的には、まず、圧縮注釈データに対して、図４のフローチャートに示した処理と逆の処理を行い、注釈頻度テーブルを利用して注釈データを復元する。また、圧縮数値データ本体に対しても、図４のフローチャートに示した処理と逆の処理を行い、数値頻度テーブルを利用して、数値データ本体を復元する。次に、注釈データと数値データ本体の統合を行うが、これは、注釈データの注釈情報<ANNOTATION>を読み込み、その直後に記録されている数値を、上述の〔変換規則１〕によって変換し、対応する個数の数値を数値データ本体から読み込み、注釈情報の後に挿入していく。この処理を各注釈情報に対して行っていくことにより、三次元情報ファイルが復元される。 Subsequently, the decoding process will be described. The decryption process restores the annotation data from the compressed annotation data and the annotation frequency table, restores the numeric data body from the compressed numeric data body and the numeric frequency table, and finally integrates the annotation data and the numeric data body into the original array. You will get data. Specifically, first, the processing opposite to the processing shown in the flowchart of FIG. 4 is performed on the compressed annotation data, and the annotation data is restored using the annotation frequency table. Also, the compressed numeric data body is processed in reverse to the processing shown in the flowchart of FIG. 4, and the numeric data body is restored using the numeric frequency table. Next, the annotation data and the numerical data body are integrated. This is done by reading the annotation information <ANNOTATION> of the annotation data, and converting the numerical value recorded immediately after that according to the above-mentioned [Conversion Rule 1]. The corresponding number of numerical values are read from the numerical data body and inserted after the annotation information. By performing this process for each piece of annotation information, the three-dimensional information file is restored.

なお、上記図１、図５、図８に示した各装置は、具体的には、コンピュータ等のハードウェアに、専用のソフトウェアプログラムを搭載することにより実現される。 Note that each of the devices shown in FIGS. 1, 5, and 8 is specifically realized by installing a dedicated software program in hardware such as a computer.

本発明に係る生物情報のロスレス符号化装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the lossless encoding apparatus of the biological information which concerns on this invention. データ分離手段１による処理の様子を示す図である。It is a figure which shows the mode of the process by the data separation means. 塩基変換テーブルの一例を示す図である。It is a figure which shows an example of a base conversion table. 可変長符号化手段による処理概要を示すフローチャートである。It is a flowchart which shows the process outline | summary by a variable-length encoding means. 本発明に係る生物情報の検索装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the search apparatus of the biological information which concerns on this invention. 塩基配列の場合の中間配列データおよび検索パターンを示す図である。It is a figure which shows the intermediate sequence data and search pattern in the case of a base sequence. アミノ酸変換テーブルの一例を示す図である。It is a figure which shows an example of an amino acid conversion table. アミノ酸配列の場合の中間配列データおよび検索パターンを示す図である。It is a figure which shows the intermediate sequence data in the case of an amino acid sequence, and a search pattern. 本発明に係る三次元情報のロスレス符号化装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the lossless encoding apparatus of the three-dimensional information which concerns on this invention. データ分離手段２３による処理の様子を示す図である。It is a figure which shows the mode of the process by the data separation means. 定型タグ変換テーブルの一例を示す図である。It is a figure which shows an example of a fixed tag conversion table.

Explanation of symbols

１、２３・・・データ分離手段
２・・・固定長符号化手段
３、２４・・・可変長符号化手段
１１・・・検索キー入力手段
１２・・・検索パターン作成手段
１３・・・照合手段
１４・・・アーカイブ実行手段
２１・・・ランレングス符号化手段
２２・・・定型タグ符号化手段

1, 23 ... Data separation means 2 ... Fixed length encoding means 3, 24 ... Variable length encoding means 11 ... Search key input means 12 ... Search pattern creation means 13 ... Verification Means 14 ... Archive execution means 21 ... Run length encoding means 22 ... Fixed tag encoding means

Claims

For a biological information file composed of character sequence information defined within a predetermined range and annotation information that annotates information of a specific range of the sequence information,
Data separation for adding link information to the sequence data body to the annotation data so that the annotation information and the sequence data body can be separated into the annotation data and the sequence data body and the biological information file can be restored. Means,
Fixed-length encoding means for obtaining intermediate array data by performing data compression by assigning a fixed bit length to each character recorded in the array data body;
Variable length encoding means for compressing data with a variable bit length for each of the intermediate array data compressed with the fixed length and the annotation data;
A lossless encoding apparatus for biological information, comprising:

In claim 1,
The variable-length encoding means performs run-length compression on each byte array unit of the annotation data or array data body, and performs encoding by assigning bits having a short length in order of occurrence frequency of each byte data. A lossless encoding device for biological information, characterized in that it exists.

In claim 1,
The sequence data body is composed of four types of characters a, g, c, and t (capital letters are acceptable), each sequence data is recorded in 8 bits, and the fixed-length encoding means includes: A lossless encoding device for biological information, wherein each character is encoded with a fixed length of 2 bits.

In claim 1,
The array data body is L, A, S, G, V, E, K, I, T, D, R, P, N, F, Q, Y, M, H, C, W (lower case is also acceptable) The fixed-length encoding means includes L, A, S, G, V, E, K, I, T, the amino acid sequence data in which each character is recorded in 8 bits. D, R, P, N, F, and Q characters are encoded with a fixed length of 4 bits, and Y, M, H, C, and W characters are encoded with a fixed length of 8 bits. A lossless encoding apparatus for biological information, characterized in that it performs the conversion.