JP2003188735A

JP2003188735A - Data compressing device and method, and program

Info

Publication number: JP2003188735A
Application number: JP2001380473A
Authority: JP
Inventors: Takashi Yoshioka; 隆吉岡; Toshihiko Morimoto; 俊彦森本; Tetsuo Toyoda; 哲郎豊田
Original assignee: NTT Data Corp; RIKEN Institute of Physical and Chemical Research
Current assignee: RIKEN Institute of Physical and Chemical Research; NTT Data Group Corp
Priority date: 2001-12-13
Filing date: 2001-12-13
Publication date: 2003-07-04

Abstract

<P>PROBLEM TO BE SOLVED: To provide a data compressing device and method in which genome data can be efficiently compressed, and a program. <P>SOLUTION: The base sequence of a mouse, for example, is stored on a dictionary DB 13 as a dictionary sequence. The base sequence of human being as a compression target sequence is stored on a sequence DB 11. A part of the compression target sequence is defined as reference information and the whether there exists matching is searched by comparing the reference information with the base sequence of a mouse. The length of the reference information is not limited and the reference information can be made to be long as long as it is matched with the dictionary sequence. When there is matching, a token presenting information indicating the matching in the dictionary sequence and the length of the reference information is prepared and is replaced with a token being a part of the compression target and prepared at the part corresponding to the reference information. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、ゲノムデータを効
率よく圧縮できるデータ圧縮装置及び方法並びにプログ
ラムに関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a data compression apparatus, method and program capable of efficiently compressing genomic data.

【０００２】[0002]

【従来の技術】ゲノムの塩基は「Ａ」、「Ｃ」、
「Ｇ」、「Ｔ」で表現される。現在、様々な生物のゲノ
ム塩基配列を解明する研究が各国でなされている。一般
的に、データの容量を少なくするのに、圧縮技術が用い
られる。圧縮技術のひとつとして、データ中の一部分
を、もとのデータより小さいビット数のデータで置き換
えることにより全体の情報量を減らす方法がある。例え
ば、ゲノムの塩基配列は、「Ａ＝00」、「Ｃ＝01」、
「Ｇ＝10」、「Ｔ＝11」と置き換えられている。これに
より、１塩基につき１バイト使用されていたものが、１
塩基につき２ビット使用することとなり、情報量は１／
４程度となる。2. Description of the Related Art The bases of the genome are "A", "C",
It is expressed by "G" and "T". Currently, research is being made in various countries to elucidate the genomic nucleotide sequences of various organisms. Compression techniques are commonly used to reduce the amount of data. As one of compression techniques, there is a method of replacing a part of data with data having a smaller number of bits than the original data to reduce the total information amount. For example, the base sequence of the genome is "A = 00", "C = 01",
It is replaced with “G = 10” and “T = 11”. As a result, one byte was used for each base
2 bits are used for each base, and the amount of information is 1 /
It will be about 4.

【０００３】[0003]

【発明が解決しようとする課題】例えば、ヒトは、塩基
が約３０億個存在し、上述のように２ビットでひとつの
塩基を表現しても、ヒトの塩基配列の情報量は３Ｇバイ
トとなる。現在、ヒトのみならず、イネ、マウス等様々
な種類の動植物のゲノム塩基配列を調べ、相関関係を調
べるというプロジェクトが各研究団体等で進められてお
り、ゲノムデータは増加する一方である。このようなゲ
ノムデータの増加に伴い、ゲノムデータを格納するファ
イルを管理するコストは増加する。また、ゲノムデータ
のプロジェクトは、複数の研究団体が協力して推進する
こともあり、各研究機関でゲノムデータをやり取りする
通信時間も、膨大なものとなる。本発明はこのような事
情に鑑みてなされたもので、ゲノムデータを効率よく圧
縮できるデータ圧縮装置及び方法並びにプログラムを提
供することを目的とする。For example, humans have about 3 billion bases, and even if one base is expressed by 2 bits as described above, the amount of human base sequence information is 3 Gbytes. Become. Currently, not only humans but also various research groups and others are proceeding with a project to examine the genomic base sequences of various kinds of plants and animals such as rice and mouse, and to investigate the correlation, and the genome data is increasing. With the increase of such genome data, the cost of managing the file storing the genome data increases. In addition, a project for genome data may be promoted by a plurality of research groups in cooperation, and the communication time for exchanging genome data at each research institution will be enormous. The present invention has been made in view of such circumstances, and an object of the present invention is to provide a data compression device, method, and program capable of efficiently compressing genome data.

【０００４】[0004]

【課題を解決するための手段】本発明は上記の目的を達
成するためになされたもので、本発明は、複数の単位デ
ータからなる圧縮対象データを、複数の単位データから
なる辞書データを用いて圧縮するデータ圧縮装置であっ
て、前記辞書データを記憶する記憶部と、前記圧縮対象
データ中の未圧縮個所の単位データの配列である参照情
報を前記辞書データと比較して一致する個所を検索し、
前記検索の結果一致する個所を検出した場合、前記参照
情報の次に連続する単位データと、前記辞書データを構
成する単位データの配列であって前記参照情報と一致す
る単位データの配列の次に連続する単位データとを比較
することで単位データが連続して一致する長さを認識
し、前記辞書データ中の一致個所の開始位置を示す位置
情報と、前記一致する単位データの長さ情報とを認識す
る比較部と、前記圧縮対象データ中の一致個所に、前記
比較部により認識された位置情報及び長さ情報を含む置
換情報を置き換える圧縮部とを備えることを特徴とす
る。The present invention has been made to achieve the above object, and the present invention uses compression target data composed of a plurality of unit data and dictionary data composed of a plurality of unit data. In the data compressing device for compressing, the storage unit for storing the dictionary data and the reference information, which is an array of the unit data of the uncompressed portion in the compression target data, are compared with the dictionary data to find a matching portion. Search and
When a matching portion is detected as a result of the search, a unit data sequence next to the reference information and an array of unit data constituting the dictionary data, which is next to the array of unit data that matches the reference information, Recognizing the length of continuous unit data by comparing with the continuous unit data, position information indicating the start position of the matching point in the dictionary data, and the length information of the matching unit data And a compression unit that replaces the replacement information including the position information and the length information recognized by the comparison unit at the matching portion in the compression target data.

【０００５】また、本発明は、前記発明において、前記
比較部は、前記検索の結果一致する個所が見つからなか
った場合、前記参照情報の開始点の位置からひとつ又は
複数ずらした単位データの配列を新たな参照情報とし、
前記新たな参照情報と前記辞書データと比較することを
特徴とする。Further, in the present invention according to the above-mentioned invention, the comparison unit, when a matching portion is not found as a result of the search, arranges a unit data array shifted by one or a plurality of positions from the starting point of the reference information. New reference information,
It is characterized in that the new reference information is compared with the dictionary data.

【０００６】これにより、例えば、ある生物種のゲノム
データを辞書データとし、他の生物種のゲノムデータを
圧縮対象データとした場合、他の生物種の塩基配列デー
タやアミノ酸配列データなどの圧縮率を飛躍的に高める
ことができる。また、複数の生物のゲノムデータを辞書
データとし、圧縮対象データを複数の生物種で圧縮した
場合、辞書データとする生物種の種類が増えれば増える
ほど圧縮率が高くなり、結果、世界中で増えつづけてい
るゲノムデータの増加率を抑えることができる。また、
データを圧縮することで、通信コストの低減が期待でき
る。また、データ管理コスト低減を図ることができる。Thus, for example, when the genomic data of a certain biological species is used as the dictionary data and the genomic data of another biological species is used as the compression target data, the compression ratio of the base sequence data and amino acid sequence data of the other biological species is obtained. Can be dramatically increased. In addition, when the genome data of multiple organisms is used as dictionary data and the data to be compressed is compressed with multiple organism species, the compression rate increases as the number of species of organisms used as dictionary data increases, resulting in worldwide compression. It is possible to suppress the increasing rate of increasing genomic data. Also,
By compressing the data, a reduction in communication costs can be expected. Further, the data management cost can be reduced.

【０００７】また、本発明は、前記発明において、前記
圧縮対象データの圧縮前の情報量と圧縮後の情報量とを
比較することで、圧縮対象データと辞書データとの類似
度を取得する類似度取得手段をさらに備えることを特徴
とする。According to the present invention, in the above invention, the similarity between the compression target data and the dictionary data is obtained by comparing the information amount before compression and the information amount after compression of the compression target data. It is characterized by further comprising a degree acquisition means.

【０００８】これにより、例えば、ある生物種のゲノム
データを辞書データとし、他の生物種のゲノムデータを
圧縮対象データとした場合、辞書データとした生物種と
圧縮対象とした生物種との類似度を取得することができ
る。また、このような類似度は、データを圧縮したまま
で取得することができる。Thus, for example, when the genome data of a certain species is used as dictionary data and the genome data of another species is used as the compression target data, the similarity between the species used as the dictionary data and the compression target You can get the degree. Further, such a degree of similarity can be obtained while the data is compressed.

【０００９】また、本発明は、複数の単位データからな
る圧縮対象データを、複数の単位データからなる辞書デ
ータを用いて圧縮するデータ圧縮装置であって、圧縮対
象データを構成する単位データ中の特定の単位データの
配列を辞書データとし、前記圧縮対象データ中の未圧縮
個所の単位データの配列である参照情報を前記辞書デー
タと比較して一致する個所を検索し、前記検索の結果一
致する個所を検出した場合、前記参照情報の次に連続す
る単位データと、前記辞書データを構成する単位データ
の配列であって前記参照情報と一致する単位データの配
列の次に連続する単位データとを比較することで単位デ
ータが連続して一致する長さを認識し、前記辞書データ
中の一致個所の開始位置を示す位置情報と、前記一致す
る単位データの長さ情報とを認識する比較部と、前記圧
縮対象データ中の一致個所に、前記比較部により認識さ
れた位置情報及び長さ情報を含む置換情報を置き換える
圧縮部と、前記辞書データを格納する記憶部と、を備え
ることを特徴とする。Further, the present invention is a data compression device for compressing compression target data composed of a plurality of unit data using dictionary data composed of a plurality of unit data, wherein the unit data in the unit data constituting the compression target data is compressed. An array of specific unit data is used as dictionary data, reference information, which is an array of unit data at uncompressed points in the compression target data, is compared with the dictionary data to search for a matching point, and the result of the search is matched. When a point is detected, unit data that is next to the reference information and unit data that is an array of unit data that constitutes the dictionary data and that is next to the array of unit data that matches the reference information are generated. Recognizing the length where the unit data continuously match by comparing, position information indicating the start position of the matching point in the dictionary data, and the length of the matching unit data A comparing unit that recognizes information, a compressing unit that replaces replacement information including position information and length information recognized by the comparing unit at matching points in the compression target data, and a storage unit that stores the dictionary data. And are provided.

【００１０】また、本発明は、前記発明において、前記
比較部は、前記検索の結果一致する個所が見つからなか
った場合、前記辞書データの終了点の位置からひとつ又
は複数ずらした単位データの配列を、前記記憶部に追加
して記憶させることを特徴とする。Also, in the present invention according to the above-mentioned invention, when the matching portion is not found as a result of the search, the comparison unit generates an array of unit data shifted one or more from the end point position of the dictionary data. And is additionally stored in the storage unit.

【００１１】これにより、例えば、塩基配列データやア
ミノ酸配列データなどの圧縮率を飛躍的に高めることが
できる。また、データを圧縮することで、通信コストの
低減が期待できる。また、データ管理コスト低減を図る
ことができる。また、辞書データと一致する個所が無か
った場合、一致しなかった個所を新たに辞書データとし
て追加できるので、圧縮率をさらに向上させることがで
きる。As a result, the compression rate of, for example, base sequence data or amino acid sequence data can be dramatically increased. Further, by compressing the data, it is possible to expect a reduction in communication cost. Further, the data management cost can be reduced. If there is no part that matches the dictionary data, the part that does not match can be added as new dictionary data, so that the compression rate can be further improved.

【００１２】また、本発明は、複数の単位データからな
る圧縮対象データを、複数の単位データからなる辞書デ
ータを用いて圧縮するデータ圧縮方法であって、前記辞
書データを記憶する過程と、前記圧縮対象データ中の未
圧縮個所の単位データの配列である参照情報を前記辞書
データと比較して一致する個所を検索する過程と、前記
検索の結果一致する個所を検出した場合、前記参照情報
の次に連続する単位データと、前記辞書データを構成す
る単位データの配列であって前記参照情報と一致する単
位データの配列の次に連続する単位データとを比較する
ことで単位データが連続して一致する長さを認識する過
程と、前記辞書データ中の一致個所の開始位置を示す位
置情報と、前記一致する単位データの長さ情報とを認識
する過程と、前記圧縮対象データ中の一致個所に、前記
認識された位置情報及び長さ情報を含む置換情報を置き
換える過程とを備えることを特徴とする。Further, the present invention is a data compression method for compressing compression target data composed of a plurality of unit data using dictionary data composed of a plurality of unit data, the process including storing the dictionary data, and A process of comparing reference information, which is an array of unit data of uncompressed points in the data to be compressed, with the dictionary data and searching for a matching point, and when a matching point is detected as a result of the search, the reference information By comparing the next continuous unit data and the next continuous unit data of the unit data array that constitutes the dictionary data and is the same as the reference information, the unit data is continuously generated. A step of recognizing a matching length; a step of recognizing position information indicating a start position of a matching point in the dictionary data; and a step of recognizing length information of the matching unit data, Matching location in the reduced target data, characterized by comprising a step of replacing the replacement information including the recognized position information and length information.

【００１３】また、本発明は、複数の単位データからな
る圧縮対象データを、複数の単位データからなる辞書デ
ータを用いて圧縮するデータ圧縮方法であって、圧縮対
象データを構成する単位データ中の特定の単位データの
配列を辞書データとする過程と、前記辞書データを記憶
する過程と、前記圧縮対象データ中の未圧縮個所の単位
データの配列である参照情報を前記辞書データと比較し
て一致する個所を検索する過程と、前記検索の結果一致
する個所を検出した場合、前記参照情報の次に連続する
単位データと、前記辞書データを構成する単位データの
配列であって前記参照情報と一致する単位データの配列
の次に連続する単位データとを比較することで単位デー
タが連続して一致する長さを認識する過程と、前記辞書
データ中の一致個所の開始位置を示す位置情報と、前記
一致する単位データの長さ情報とを認識する過程と、前
記圧縮対象データ中の一致個所に、前記比較部により認
識された位置情報及び長さ情報を含む置換情報を置き換
える過程と、を備えることを特徴とする。Further, the present invention is a data compression method for compressing compression target data composed of a plurality of unit data by using dictionary data composed of a plurality of unit data, wherein The process of making an array of specific unit data into dictionary data, the process of storing the dictionary data, and the reference information, which is an array of unit data at an uncompressed portion in the data to be compressed, are compared with the dictionary data and match. In the process of searching for a part to be searched, and when a matching part is detected as a result of the search, the unit data following the reference information and an array of the unit data forming the dictionary data match the reference information. The process of recognizing the length of continuous unit data by comparing the unit data array next to the unit data array and the matching number in the dictionary data. Recognizing the position information indicating the start position of the data and the length information of the matching unit data, and the matching position in the compression target data includes the position information and the length information recognized by the comparing unit. And a step of replacing the replacement information.

【００１４】また、本発明は、複数の単位データからな
る圧縮対象データを、複数の単位データからなる辞書デ
ータを用いて圧縮するデータ圧縮プログラムであって、
前記辞書データを記憶するステップと、前記圧縮対象デ
ータ中の未圧縮個所の単位データの配列である参照情報
を前記辞書データと比較して一致する個所を検索するス
テップと、前記検索の結果一致する個所を検出した場
合、前記参照情報の次に連続する単位データと、前記辞
書データを構成する単位データの配列であって前記参照
情報と一致する単位データの配列の次に連続する単位デ
ータとを比較することで単位データが連続して一致する
長さを認識するステップと、前記辞書データ中の一致個
所の開始位置を示す位置情報と、前記一致する単位デー
タの長さ情報とを認識するステップと、前記圧縮対象デ
ータ中の一致個所に、前記認識された位置情報及び長さ
情報を含む置換情報を置き換えるステップとをコンピュ
ータに実行させるデータ圧縮プログラムである。The present invention is also a data compression program for compressing compression target data composed of a plurality of unit data using dictionary data composed of a plurality of unit data.
Storing the dictionary data, comparing reference information, which is an array of unit data of uncompressed points in the compression target data, with the dictionary data, searching for a matching point, and matching as a result of the search When a point is detected, unit data that is next to the reference information and unit data that is an array of unit data that constitutes the dictionary data and that is next to the array of unit data that matches the reference information are generated. A step of recognizing a length in which the unit data consecutively match by comparing, a step of recognizing position information indicating a start position of a matching point in the dictionary data, and a length information of the matching unit data. And a step of replacing the replacement information including the recognized position information and length information at the matching portion in the compression target data with a computer. Is a data compression program.

【００１５】また、本発明は、複数の単位データからな
る圧縮対象データを、複数の単位データからなる辞書デ
ータを用いて圧縮するデータ圧縮プログラムであって、
圧縮対象データを構成する単位データ中の特定の単位デ
ータの配列を辞書データとするステップと、前記辞書デ
ータを記憶するステップと、前記圧縮対象データ中の未
圧縮個所の単位データの配列である参照情報を前記辞書
データと比較して一致する個所を検索するステップと、
前記検索の結果一致する個所を検出した場合、前記参照
情報の次に連続する単位データと、前記辞書データを構
成する単位データの配列であって前記参照情報と一致す
る単位データの配列の次に連続する単位データとを比較
することで単位データが連続して一致する長さを認識す
るステップと、前記辞書データ中の一致個所の開始位置
を示す位置情報と、前記一致する単位データの長さ情報
とを認識するステップと、前記圧縮対象データ中の一致
個所に、前記比較部により認識された位置情報及び長さ
情報を含む置換情報を置き換えるステップと、をコンピ
ュータに実行させるデータ圧縮プログラムである。The present invention is also a data compression program for compressing compression target data composed of a plurality of unit data using dictionary data composed of a plurality of unit data.
Reference is an array of unit data that is an uncompressed portion of the compression target data, a step of setting an array of specific unit data in the unit data that constitutes the compression target data as dictionary data, a step of storing the dictionary data Comparing the information with the dictionary data to search for a matching location,
When a matching portion is detected as a result of the search, a unit data sequence next to the reference information and an array of unit data constituting the dictionary data, which is next to the array of unit data that matches the reference information, A step of recognizing a length of continuous unit data by comparing the unit data with each other, position information indicating a start position of a matching point in the dictionary data, and a length of the unit data of the match. A data compression program that causes a computer to execute a step of recognizing information and a step of replacing the replacement information including the position information and the length information recognized by the comparison unit at a matching portion in the compression target data. .

【００１６】[0016]

【発明の実施の形態】以下、図面を参照し、本発明の一
実施形態について説明する。まず、塩基配列に関して簡
単に説明する。上述したように、ゲノムの塩基は
「Ａ」、「Ｃ」、「Ｇ」、「Ｔ」で表現される。生物の
塩基配列は、生物種別間で共通する部分がある。例え
ば、ヒトとマウスの塩基配列が図１０に一例を示すよう
である場合、ゲノム上での存在個所は異なるが、図中の
Ａという部分と、図中のＢという部分で配列が共通して
いる。このような配列は、生物が進化し、異なる種とな
っても変わることなく保存された塩基配列であり、その
共通性は、機能的に有意な部分であるほど顕著なことが
多い。本実施形態のデータ圧縮装置は、このようなゲノ
ムの塩基配列の性質を利用して、塩基配列のデータを圧
縮する。DETAILED DESCRIPTION OF THE INVENTION An embodiment of the present invention will be described below with reference to the drawings. First, the base sequence will be briefly described. As described above, the bases of the genome are represented by "A", "C", "G", and "T". The base sequences of organisms have some common parts among organism types. For example, when the nucleotide sequences of human and mouse are as shown in FIG. 10, the positions in the genome are different, but the sequence is common between the part A in the figure and the part B in the figure. There is. Such a sequence is a conserved base sequence that does not change even if the organism evolves and becomes a different species, and its commonality is often remarkable as it is a functionally significant part. The data compression apparatus of the present embodiment compresses the data of the base sequence by utilizing the property of the base sequence of such a genome.

【００１７】図１は、データ圧縮装置１の構造を示すブ
ロック図である。図１を参照して、データ圧縮装置１の
構成を説明する。図１において、１１は配列ＤＢ（デー
タベース）であり、圧縮対象の塩基配列が格納されてい
る。１２は圧縮対象配列取得部であり、配列ＤＢ１１に
格納されている圧縮対象配列を取得する。１３は辞書Ｄ
Ｂであり、圧縮対象配列と比較する塩基配列が格納され
る。以下、辞書ＤＢ１３に格納されている塩基配列を辞
書配列という。辞書ＤＢ１３に格納される辞書配列の一
例を図２に示す。詳細は後述するが、辞書ＤＢ１３に
は、予め辞書配列が格納されている実施例と、始めは辞
書配列が格納されていない実施例とがある。１４は辞書
配列取得部であり、辞書配列を取得する。FIG. 1 is a block diagram showing the structure of the data compression apparatus 1. The configuration of the data compression device 1 will be described with reference to FIG. In FIG. 1, reference numeral 11 is a sequence DB (database) in which base sequences to be compressed are stored. A compression target sequence acquisition unit 12 acquires the compression target sequence stored in the array DB 11. 13 is the dictionary D
B, which stores the base sequence to be compared with the compression target sequence. Hereinafter, the base sequence stored in the dictionary DB 13 is called a dictionary sequence. FIG. 2 shows an example of the dictionary array stored in the dictionary DB 13. Although details will be described later, there are an embodiment in which the dictionary array is stored in advance in the dictionary DB 13, and an embodiment in which the dictionary array is not stored at the beginning. A dictionary array acquisition unit 14 acquires a dictionary array.

【００１８】１５は比較部であり、圧縮対象配列と辞書
配列とを比較し、一致する個所の位置情報と一致する長
さ情報とを取得する。１６は圧縮部であり、比較部１５
により取得された位置情報と長さ情報とを含むトークン
を、圧縮対象配列に置き換えることにより圧縮する。１
７は類似度取得部であり、圧縮率から類似度を取得す
る。Reference numeral 15 denotes a comparison unit, which compares the compression target array with the dictionary array and acquires the position information of the matching position and the matching length information. Reference numeral 16 is a compression unit, which is a comparison unit 15.
The token including the position information and the length information acquired by is replaced with the compression target array for compression. 1
Reference numeral 7 denotes a similarity acquisition unit, which acquires the similarity from the compression rate.

【００１９】配列ＤＢ１１について説明する。配列ＤＢ
１１には、「ｎｈｒ」、「ｎｓｑ」、「ｎｉｎ」という
３つのファイルが格納されている。「ｎｈｒ」には、生
物由来に関する情報であるヘッダー情報が記録されてい
る。「ｎｓｑ」には、実際の塩基配列を示す情報（シー
クエンス情報）が記録されている。「ｎｉｎ」には、ヘ
ッダー情報の長さやシークエンス情報の長さが記録され
ている。The array DB 11 will be described. Sequence DB
11 stores three files, “nhr”, “nsq”, and “nin”. In “nhr”, header information that is information related to biological origin is recorded. Information (sequence information) indicating an actual base sequence is recorded in “nsq”. In “nin”, the length of header information and the length of sequence information are recorded.

【００２０】図３に一例を示す配列を参照して、各ファ
イルについて説明する。図３において、１行目は生物の
由来を示す情報である。２行目、３行目は、当該生物の
塩基配列を示している。図３の１行目に示すヘッダー情
報は、「ｎｈｒ」ファイルに記録される。２行目、３行
目に示す塩基配列は、「Ａ＝00」、「Ｃ＝01」、「Ｇ＝
10」、「Ｔ＝11」というように置き換えられている。例
えば、始めの「AGCT」は「00011011」と変換されてい
る。このように変換された情報は、「ｎｓｑ」ファイル
に記録される。このように変換したことで、１つの塩基
を示す１文字（１バイト＝８ビット）を、変換する前の
１／４の長さである２ビットで示すことができる。Each file will be described with reference to the array shown in FIG. In FIG. 3, the first line is information indicating the origin of the organism. The second and third lines show the base sequence of the organism. The header information shown in the first line of FIG. 3 is recorded in the “nhr” file. The base sequences shown in the 2nd and 3rd lines are “A = 00”, “C = 01”, and “G =
10 "," T = 11 "and so on. For example, the first "AGCT" is converted to "00011011". The information thus converted is recorded in the "nsq" file. By performing the conversion in this way, one character indicating one base (1 byte = 8 bits) can be represented by 2 bits which is 1/4 the length before conversion.

【００２１】図３の１行目に示すヘッダー情報の文字数
は５４であるため、その長さを５４とする。また、図３
の２行目、３行目に示す塩基の数は１４０であるが、上
述のように置き換えたことにより、シークエンス情報
（塩基配列）の長さは３５（＝１４０／４）となる。ま
た、ヘッダー情報の長さと、シークエンス情報の長さと
を示す配列の長さは８（４バイト×２個）となる。この
ようなヘッダー情報の長さ５４と、シークエンス情報の
長さ３５と、配列の長さ８とが、「ｎｉｎ」ファイルに
記録される。Since the number of characters of the header information shown in the first line of FIG. 3 is 54, its length is 54. Also, FIG.
The number of bases shown in the second and third lines is 140, but the length of the sequence information (base sequence) becomes 35 (= 140/4) by the above replacement. The length of the array indicating the length of the header information and the length of the sequence information is 8 (4 bytes x 2). The length 54 of the header information, the length 35 of the sequence information, and the length 8 of the array are recorded in the “nin” file.

【００２２】次に、動作を説明する。動作の説明では、
塩基を「Ａ」、「Ｃ」、「Ｇ」、「Ｔ」として説明する
が、実際には、「Ａ＝00」、「Ｃ＝01」、「Ｇ＝10」、
「Ｔ＝11」というように、ひとつの塩基を２ビットで表
したものを用いて以下に説明する処理は実行される。Next, the operation will be described. In the explanation of the operation,
The bases will be described as "A", "C", "G", and "T", but in reality, "A = 00", "C = 01", "G = 10",
The processing described below is executed using one base represented by 2 bits, such as "T = 11".

【００２３】＜実施例１＞ここでは、静的な圧縮の動作
を説明する。静的な圧縮は、あらかじめ辞書ＤＢ１３に
辞書配列が格納されており、格納されている辞書配列を
用いて圧縮するものである。ここでは、例えばマウスの
塩基配列が辞書配列として辞書ＤＢ１３に格納され、圧
縮対象配列としてヒトの塩基配列が配列ＤＢ１１に格納
されているものとする。<First Embodiment> Here, a static compression operation will be described. In the static compression, a dictionary array is stored in the dictionary DB 13 in advance, and compression is performed using the stored dictionary array. Here, for example, it is assumed that a mouse base sequence is stored as a dictionary sequence in the dictionary DB 13, and a human base sequence as a compression target sequence is stored in the sequence DB 11.

【００２４】データ圧縮装置１の辞書配列取得部１４
は、まず、辞書ＤＢ１３から辞書配列を取得し（図４に
おけるステップＳ１０）、取得した辞書配列を比較部１
５に送信する。圧縮対象配列取得部１２は、配列ＤＢ１
１から圧縮対象配列を取得する（図４におけるステップ
Ｓ１１）。圧縮対象配列取得部１２は、配列ＤＢ１１内
の「ｎｉｎ」ファイルを参照して圧縮対象の配列が格納
されている長さを認識し、認識した長さの圧縮対象配列
を「ｎｓｑ」ファイルから取得する。圧縮対象配列取得
部１２は、取得した圧縮対象配列を比較部１５に送信す
る。The dictionary array acquisition unit 14 of the data compression apparatus 1
First, the dictionary array is acquired from the dictionary DB 13 (step S10 in FIG. 4), and the acquired dictionary array is compared with the comparison unit 1
Send to 5. The compression target sequence acquisition unit 12 uses the sequence DB 1
The compression target array is acquired from 1 (step S11 in FIG. 4). The compression target sequence acquisition unit 12 refers to the “nin” file in the sequence DB 11 to recognize the length in which the compression target sequence is stored, and acquires the compression target sequence of the recognized length from the “nsq” file. To do. The compression target sequence acquisition unit 12 transmits the acquired compression target sequence to the comparison unit 15.

【００２５】比較部１５は、辞書配列と圧縮対象配列を
比較し、一致する部分を検索する（図４におけるステッ
プＳ１２）。比較部１５は、図５に示すように、圧縮対
象配列の圧縮処理開始位置から４０塩基（１塩基につき
２ビット）以上の連続する塩基配列断片と一致する塩基
配列が、辞書配列にあるかどうか検索する。以下、辞書
配列と一致するかどうか比較する圧縮対象配列中の４０
塩基以上の塩基配列断片を参照情報という。なお、図５
では、辞書配列は折り返して示されている。The comparison unit 15 compares the dictionary array with the compression target array and searches for a matching portion (step S12 in FIG. 4). As shown in FIG. 5, the comparison unit 15 determines whether or not the base sequence that matches a continuous base sequence fragment of 40 bases (2 bits per base) or more from the compression processing start position of the compression target sequence is in the dictionary array. Search for. Hereafter, 40 in the compression target array to be compared for matching with the dictionary array
A base sequence fragment having more than one base is referred to as reference information. Note that FIG.
In, the dictionary array is shown folded.

【００２６】まず、検索の結果、参照情報と一致する塩
基配列が辞書配列にあった場合（図４におけるステップ
Ｓ１３）を説明する。図６の辞書配列中に示す下線は、
圧縮対象配列の参照情報と一致する塩基配列個所を示し
ている。なお、図６では、辞書配列は折り返して示され
ている。比較部１５は、一致個所があった旨と、辞書配
列中で一致する個所の開始位置と、一致する塩基の長さ
とに関する情報を圧縮部１６に送信する。First, the case where the search result shows that the base sequence matching the reference information is in the dictionary sequence (step S13 in FIG. 4) will be described. The underline shown in the dictionary array of FIG. 6 is
The base sequence locations that match the reference information of the compression target sequence are shown. Note that in FIG. 6, the dictionary array is shown folded. The comparison unit 15 transmits information to the compression unit 16 that there is a matching portion, the start position of the matching portion in the dictionary array, and the length of the matching base.

【００２７】圧縮部１６は、圧縮対象配列の一部を辞書
配列と置き換えるため、トークンを作成する（図４にお
けるステップＳ１４）。このトークンは、例えば、辞書
配列中の一致する個所（図６中の辞書配列中の下線部）
が、辞書配列の開始から６０番目の塩基から始まり、一
致する長さが４０（塩基の数）であったとすると、「１
（６０、４０）」というように作成される。このトーク
ンのうち、「１」は、辞書配列と一致することを示すビ
ットフラグである。The compression unit 16 creates a token in order to replace a part of the compression target array with the dictionary array (step S14 in FIG. 4). This token is, for example, a matching position in the dictionary array (underlined part in the dictionary array in FIG. 6).
If the matching length is 40 (the number of bases), starting from the 60th base from the start of the dictionary sequence, "1
(60, 40) ”is created. In this token, "1" is a bit flag indicating that it matches the dictionary array.

【００２８】ここで、参照情報の長さ（塩基の数）の４
０という数字は固定されているものではなく、参照情報
の長さの最低値を示すものである。参照情報の長さは、
辞書配列と一致する限り無限に長くすることができる。
例えば、圧縮対象配列の開始の塩基を参照情報の開始と
し、圧縮対象配列開始塩基から長さ４０の連続する塩基
配列が、辞書配列の開始から長さ４０の連続する配列と
一致したものとする。比較部１５がさらに比較した結
果、圧縮対象配列開始塩基から５０番目までの塩基配列
と、辞書配列の開始から５０番目までの塩基配列とが一
致した場合、圧縮部１６はトークンを「１（１、５
０）」というように作成する。このように参照情報の長
さを可変とすることで、辞書配列と一致する参照情報の
長さがどれほど長くても一定情報量のトークンで置き換
えることができ、結果、圧縮率を高くすることができ
る。Here, the length of the reference information (the number of bases) is 4
The number 0 is not fixed but indicates the minimum length of the reference information. The length of the reference information is
It can be infinitely long as long as it matches the dictionary array.
For example, assume that the base at the start of the compression target sequence is the start of the reference information, and the continuous base sequence having a length of 40 from the compression target sequence start base matches the continuous sequence having a length of 40 from the start of the dictionary sequence. . As a result of further comparison performed by the comparison unit 15, when the 50th base sequence from the compression target sequence start base and the 50th base sequence from the start of the dictionary sequence match, the compression unit 16 sets the token to “1 (1 5,
0) ”. By making the length of the reference information variable in this way, it is possible to replace it with a token with a certain amount of information, no matter how long the length of the reference information that matches the dictionary array, and as a result, the compression rate can be increased. it can.

【００２９】辞書配列中で一致する個所の開始位置と、
一致する塩基の長さとに関する情報を圧縮部１６に渡し
た後、比較部１５は、圧縮対象配列の全てを比較したか
どうか確認し（図４におけるステップＳ１６）、比較が
終了していなければ、置き換えた圧縮対象配列の次の塩
基から長さ４０以上の連続する塩基配列断片を新たな参
照情報とし、辞書配列と比較して一致する個所を検索す
る。例えば、圧縮対象配列の開始の塩基から５０番目の
塩基までトークンで置き換えられた場合、圧縮対象配列
の開始の塩基から５１番目の塩基を開始点とする長さ４
０以上の連続する塩基配列断片を新たな参照情報とし、
辞書配列と比較する。The start position of the matching position in the dictionary array,
After passing the information regarding the length of the matching bases to the compression unit 16, the comparison unit 15 confirms whether or not all the sequences to be compressed have been compared (step S16 in FIG. 4), and if the comparison is not completed, A contiguous base sequence fragment having a length of 40 or more from the next base of the replaced compression target sequence is used as new reference information and compared with the dictionary sequence to find a matching position. For example, when the token from the start base of the compression target sequence to the 50th base is replaced with the token, the length 4 with the 51st base from the start base of the compression target sequence as the starting point
0 or more consecutive base sequence fragments as new reference information,
Compare with dictionary array.

【００３０】次に、検索の結果、一致する塩基配列が辞
書配列に無かった場合（図４におけるステップＳ１３）
を説明する。比較部１５は、一致しなかった旨と、圧縮
対象配列の参照情報の一番はじめの塩基とを圧縮部１６
に送信する。圧縮部１６は、圧縮対象配列の参照情報の
一番はじめの塩基を、一致しなかったことを示すビット
フラグをたててそのまま作成する（図４におけるステッ
プＳ１５）。例えば、「Ａ」という塩基を開始点とする
参照情報が辞書配列と一致しなかった場合、「０（０、
０）Ａ」というようにトークンを作成する。このトーク
ンのうち、一番初めの「０」は、一致しなかったことを
示すビットフラグである。また、「（０、０）」は、ダ
ミー情報である。また、「Ａ」は、一致しなかった塩基
を示している。Next, as a result of the search, if there is no matching base sequence in the dictionary sequence (step S13 in FIG. 4).
Will be explained. The comparison unit 15 determines that there is no match and the first base of the reference information of the compression target sequence with the compression unit 16
Send to. The compression unit 16 creates the first base of the reference information of the compression target sequence as it is with a bit flag indicating that it does not match (step S15 in FIG. 4). For example, when the reference information starting from the base “A” does not match the dictionary array, “0 (0, 0,
0) Create a token such as "A". Of the tokens, the first "0" is a bit flag indicating that they do not match. Further, “(0,0)” is dummy information. In addition, “A” indicates a base that did not match.

【００３１】一致しなかった旨と、参照情報の一番はじ
めの塩基とを圧縮部１６に渡した後、比較部１５は、圧
縮対象配列の全てを比較したかどうか確認し（図４にお
けるステップＳ１６）、比較が終了していなければ、図
７に示すように、圧縮対象配列の参照情報の開始個所
を、１塩基分（２ビット）だけスキップさせ、そこから
４０塩基以上の塩基配列を新たな参照情報とする。例え
ば、圧縮対象配列の開始から４１番目の塩基を開始とす
る参照情報が辞書配列と一致しなかった場合、圧縮対象
配列の開始から４２番目の塩基を新たな参照情報の開始
個所とする。After passing the fact that they do not match and the first base of the reference information to the compression unit 16, the comparison unit 15 confirms whether all the sequences to be compressed are compared (step in FIG. 4). S16), if the comparison is not completed, as shown in FIG. 7, the start position of the reference information of the compression target sequence is skipped by one base (2 bits), and a base sequence of 40 bases or more is newly added. Reference information. For example, if the reference information starting at the 41st base from the start of the compression target sequence does not match the dictionary sequence, the 42nd base from the start of the compression target sequence is set as the starting point of new reference information.

【００３２】比較部１５は、この新たな参照情報と一致
する個所が辞書配列にあるかどうか検索し、一致する個
所があれば、上述のように、圧縮部１６に一致個所を示
すトークンを作成させ、一致する個所が無ければ、圧縮
部１６にダミートークンを作成させる。The comparing unit 15 searches the dictionary array for a portion that matches the new reference information, and if there is a matching portion, creates a token indicating the matching portion in the compression unit 16 as described above. If there is no matching portion, the compression unit 16 is made to create a dummy token.

【００３３】比較部１５及び圧縮部１６は、上述のよう
に、参照情報と辞書配列の比較、トークンの作成という
処理を繰り返すことで、圧縮対象配列を圧縮する。作成
したトークンは、データ圧縮装置１中のトークンを格納
するフィールドに一時格納され、圧縮対象配列取得部１
２が読み出した圧縮対象配列全てが辞書配列と比較され
るまで保存される。圧縮対象配列全てに対するトークン
が作成されると、圧縮部１６は、作成されたトークンを
含む情報を圧縮データとして出力する。As described above, the comparing section 15 and the compressing section 16 compress the array to be compressed by repeating the processing of comparing the reference information with the dictionary array and creating the token. The created token is temporarily stored in the field storing the token in the data compression device 1, and the compression target sequence acquisition unit 1
All the compression target sequences read out by 2 are stored until they are compared with the dictionary sequence. When the tokens for all the compression target arrays are created, the compression unit 16 outputs information including the created tokens as compressed data.

【００３４】また、他の未圧縮の生物の塩基配列に対し
圧縮する必要がある場合（図４におけるステップＳ１
７）、再度、圧縮対象配列の読出し、辞書配列の読出し
を行い、上述と同様の処理により圧縮する。このように
して圧縮対象配列の圧縮を終了すると、類似度取得部１
７は、圧縮対象配列の圧縮率から、圧縮対象配列の生物
と、辞書データの生物との種の近さを取得する。その原
理は、上述したように、ある生物の塩基配列を辞書とし
て他の生物の塩基配列を圧縮した場合、辞書とした生物
と近い種ほど一致する個所が多くなり、結果、圧縮率が
高くなる。したがって、圧縮率により、辞書とした生物
と、圧縮対象の生物との種の近さが推定できる。When it is necessary to compress the base sequences of other uncompressed organisms (step S1 in FIG. 4).
7) Then, the compression target array and the dictionary array are read again, and compression is performed by the same processing as described above. When the compression of the compression target array is completed in this way, the similarity acquisition unit 1
Reference numeral 7 acquires the species proximity between the organism of the compression target sequence and the organism of the dictionary data from the compression rate of the compression target sequence. The principle is that, as described above, when the base sequence of one organism is used as a dictionary and the base sequence of another organism is compressed, the closer the species to the dictionary organism, the more points that match, resulting in a higher compression rate. . Therefore, it is possible to estimate the closeness of the species between the organism as a dictionary and the organism to be compressed by the compression ratio.

【００３５】＜実施例２＞ここでは、動的な圧縮の動作
を説明する。動的な圧縮では、上述した実施例とは異な
り、圧縮処理を開始する以前には、辞書ＤＢ１３には辞
書配列が格納されていない。各部の動作は、上述した動
作と同様であるので、以下簡単に説明する。<Embodiment 2> Here, a dynamic compression operation will be described. In the dynamic compression, unlike the above-described embodiment, the dictionary array is not stored in the dictionary DB 13 before the compression process is started. The operation of each unit is the same as the above-mentioned operation, and will be briefly described below.

【００３６】データ圧縮装置１の圧縮対象配列取得部１
２は、配列ＤＢ１１から圧縮対象配列を読出し、比較部
１５に送信する。比較部１５は、図８に示すように、圧
縮対象配列の一部を辞書配列とし、辞書配列と一致する
塩基配列部分が当該圧縮対象配列内にあるかどうか検索
する。例えば、圧縮対象配列の開始から長さ４０（４０
塩基）を辞書配列とする。The compression target sequence acquisition unit 1 of the data compression apparatus 1
2 reads the compression target array from the array DB 11 and sends it to the comparison unit 15. As shown in FIG. 8, the comparison unit 15 sets a part of the compression target sequence as a dictionary array and searches for a base sequence part that matches the dictionary sequence in the compression target sequence. For example, a length of 40 (40
Base) as a dictionary array.

【００３７】比較部１５は、辞書配列を辞書配列取得部
１４に送信する。辞書配列取得部１４は、取得した辞書
配列を辞書ＤＢ１３に格納する。また、比較部１５は、
辞書配列取得部１４に辞書配列を送信すると共に、辞書
配列と一致する塩基配列があるかどうか、圧縮対象配列
を検索する。The comparison unit 15 transmits the dictionary array to the dictionary array acquisition unit 14. The dictionary array acquisition unit 14 stores the acquired dictionary array in the dictionary DB 13. In addition, the comparison unit 15
The dictionary sequence is transmitted to the dictionary sequence acquisition unit 14 and the compression target sequence is searched for a base sequence that matches the dictionary sequence.

【００３８】検索の結果、一致する個所があれば、比較
部１５は、上述と同様に、圧縮部１６に各情報を送信す
ることによりトークンを作成させる。検索の結果、一致
しなければ、上述と同様に、圧縮部１６に各情報を送信
することによりダミートークンを作成させる。さらに、
比較部１５は、辞書配列とした圧縮対象配列の次の部分
を辞書配列として追加し、辞書配列取得部１４に送信す
る。辞書配列取得部１４は渡された辞書配列を辞書ＤＢ
１３に格納する。例えば、圧縮対象配列の開始から４０
番目の塩基までの配列を辞書配列とした場合、４１番目
の塩基を辞書配列として追加し、辞書ＤＢ１３に格納さ
せる。If there is a matching portion as a result of the search, the comparing section 15 sends a token to the compressing section 16 by transmitting each piece of information as described above. If they do not match as a result of the search, the dummy token is created by transmitting each information to the compression unit 16 as in the above. further,
The comparison unit 15 adds the next portion of the compression target array that is the dictionary array as a dictionary array, and transmits the dictionary array acquisition unit 14. The dictionary array acquisition unit 14 stores the passed dictionary array in the dictionary DB.
Store in 13. For example, 40 from the start of the compression target array
If the sequence up to the th base is a dictionary sequence, the 41st base is added as a dictionary sequence and stored in the dictionary DB 13.

【００３９】なお、次回以降圧縮する場合には、実施例
２で取得した辞書配列を用いて、実施例１に示す動作を
してもよい。すなわち、実施例２で辞書配列を取得した
後、他の塩基配列を圧縮する場合、まず、辞書ＤＢ１３
に格納されている辞書配列と比較することにより一致個
所を検索する。検索の結果一致個所が無かった場合、実
施例１で述べたようにダミートークンを出力するのみで
もよく、それに加え、実施例２で述べたように一致しか
なった次の塩基を辞書配列として追加してもよい。ここ
で、異なる生物種の塩基配列が辞書配列として辞書ＤＢ
１３に混在する場合、各々の辞書配列に種別ＩＤを付与
するなどして区別できるようにするとよい。When compressing from the next time onward, the operation shown in the first embodiment may be performed using the dictionary array acquired in the second embodiment. That is, in the case of compressing another base sequence after acquiring the dictionary sequence in the second embodiment, first, the dictionary DB 13
The matching position is searched by comparing with the dictionary array stored in. If there is no matching portion as a result of the search, only the dummy token may be output as described in the first embodiment. In addition to that, the next base that has a match as described in the second embodiment is added as a dictionary array. You may. Here, the base sequence of different species is a dictionary DB as a dictionary sequence.
When 13 are mixed, it is advisable to distinguish them by giving a type ID to each dictionary array.

【００４０】また、上述したように、異なる生物種間で
保存される塩基配列は機能的に有意な部分であることが
多い。したがって、異なる生物種の圧縮データに共通し
て使用される辞書配列中の一部の塩基配列を検索するこ
とにより、生物学的に意味のある塩基配列データを取得
することができる。また、この生物学的に意味のある塩
基配列データは、塩基配列データを圧縮したまま取得す
ることが可能である。Further, as described above, the nucleotide sequences conserved among different species are often functionally significant portions. Therefore, by searching a part of the base sequences in the dictionary sequence commonly used for the compressed data of different organism species, it is possible to obtain the base sequence data having biological significance. Further, this biologically meaningful base sequence data can be obtained while compressing the base sequence data.

【００４１】なお、上述の実施形態において、データ圧
縮装置１の比較部１５は、４０以上の塩基の塩基を参照
情報とするとしたが、塩基の数は４０個以上に限られる
ことはない。ただし、例えば、参照情報の塩基の数が２
である場合、上述のようにトークンを作成すると、圧縮
後の情報量は６バイトになるのに対し、圧縮前の情報量
は４ビットとなり、圧縮する前の情報量のほうが小さく
なる。このような場合圧縮することが無意味になるの
で、参照情報の塩基の数は、圧縮後の情報量が圧縮前の
情報量より小さくなるような数であることが望ましい。In the above embodiment, the comparison unit 15 of the data compression apparatus 1 uses the bases of 40 or more bases as the reference information, but the number of bases is not limited to 40 or more. However, for example, if the number of bases in the reference information is 2
If the token is created as described above, the amount of information after compression becomes 6 bytes, whereas the amount of information before compression becomes 4 bits, and the amount of information before compression becomes smaller. In such a case, it is meaningless to compress, and thus the number of bases of the reference information is preferably a number such that the amount of information after compression is smaller than the amount of information before compression.

【００４２】また、辞書配列は、図９に一例を示すよう
なツリー構造で格納しても良い。この場合、図２に一例
を示す構造で格納した場合より、検索速度が速くなる。Further, the dictionary array may be stored in a tree structure as an example shown in FIG. In this case, the search speed becomes faster than the case where the structure shown in FIG. 2 is stored.

【００４３】また、実施例１において、一致する個所が
検索できなかった場合、一致しなかった範囲を１塩基分
ずらした新たな参照情報と辞書配列とを比較するものと
したが、ずらす長さは１塩基分に限られるわけではな
い。また、実施例２において、一致する個所が検索でき
なかった場合、既にある辞書配列に新たな塩基をひとつ
加えたものを新たな辞書配列とするとしたが、加える塩
基の数はひとつと限定されるものではない。Further, in the first embodiment, when the matching portion cannot be searched, the new reference information obtained by shifting the unmatched range by one base is compared with the dictionary array. Is not limited to one base. Further, in the second embodiment, when the matching portion cannot be searched, it is assumed that a new dictionary sequence is obtained by adding one new base to the existing dictionary sequence, but the number of bases to be added is limited to one. Not a thing.

【００４４】また、上述の実施形態において、ゲノムデ
ータを圧縮するものとしたが、これに限られるわけでは
ない。例えば、アミノ酸の配列情報の圧縮にも適応でき
る。Further, in the above-mentioned embodiment, the genome data is compressed, but the present invention is not limited to this. For example, it can be applied to compression of amino acid sequence information.

【００４５】また、上述の実施形態において、圧縮対象
配列と置き換えるトークンは、例えば「１（５０、４
０）」というように、一致の有無を示すビットフラグ
と、辞書配列における一致した個所の開始位置と、一致
した長さ（塩基の数）とから成り、一致しなかった場合
一致しなかった始めの塩基そのものをさらに加えるとし
たが、これに限られるわけではない。置き換えるトーク
ンは、辞書配列中のどこの部分と一致するかを示す情報
であればよい。In the above embodiment, the token to be replaced with the array to be compressed is, for example, "1 (50,4,4).
0) ”, such as a bit flag indicating the presence or absence of a match, the start position of the matched position in the dictionary array, and the matched length (the number of bases). The base itself was added, but the base is not limited to this. The token to be replaced may be information indicating which part in the dictionary array matches.

【００４６】なお、図１に示す各部は専用のハードウェ
アにより実現されるものであってもよく、また、図１に
示す各部はメモリおよびＣＰＵ（中央演算装置）により
構成され、各部の機能を実現するためのプログラムをメ
モリにロードして実行することによりその機能を実現さ
せるものであってもよい。Each unit shown in FIG. 1 may be realized by dedicated hardware, and each unit shown in FIG. 1 is composed of a memory and a CPU (central processing unit), and functions of each unit. The function may be realized by loading a program for realizing it into a memory and executing it.

【００４７】また、図１における各部の機能を実現する
ためのプログラムをコンピュータ読み取り可能な記録媒
体に記録して、この記録媒体に記録されたプログラムを
コンピュータシステムに読み込ませ、実行することによ
り実現させてもよい。なお、ここでいう「コンピュータ
システム」とは、ＯＳや周辺機器等のハードウェアを含
むものとする。Further, a program for realizing the functions of the respective parts in FIG. 1 is recorded in a computer-readable recording medium, and the program recorded in this recording medium is read by a computer system and executed. May be. The “computer system” mentioned here includes an OS and hardware such as peripheral devices.

【００４８】また、「コンピュータ読み取り可能な記録
媒体」とは、フレキシブルディスク、光磁気ディスク、
ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシス
テムに内蔵されるハードディスク等の記憶装置のことを
いう。さらに「コンピュータ読み取り可能な記録媒体」
とは、インターネット等のネットワークや電話回線等の
通信回線を介してプログラムを送信する場合の通信線の
ように、短時間の間、動的にプログラムを保持するも
の、その場合のサーバやクライアントとなるコンピュー
タシステム内部の揮発性メモリのように、一定時間プロ
グラムを保持しているものも含むものとする。また上記
プログラムは、前述した機能の一部を実現するためのも
のであっても良く、さらに前述した機能をコンピュータ
システムにすでに記録されているプログラムとの組み合
わせで実現できるものであっても良い。以上、この発明
の実施形態を図面を参照して詳述してきたが、具体的な
構成はこの実施形態に限られるものではなく、この発明
の要旨を逸脱しない範囲の設計変更等も含まれる。The "computer-readable recording medium" means a flexible disk, a magneto-optical disk,
A portable medium such as a ROM or a CD-ROM, or a storage device such as a hard disk built in a computer system. Furthermore, "computer-readable recording medium"
Means a program that dynamically holds a program for a short period of time, such as a communication line for transmitting a program through a network such as the Internet or a communication line such as a telephone line, and a server or client in that case. Such a volatile memory inside a computer system that holds a program for a certain period of time is also included. Further, the above program may be one for realizing some of the functions described above, and may be one that can realize the above functions in combination with a program already recorded in the computer system. The embodiment of the present invention has been described in detail above with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes design changes and the like without departing from the scope of the present invention.

【００４９】[0049]

【発明の効果】以上説明したように、本発明によるデー
タ圧縮装置及び方法並びにプログラムによれば、ゲノム
データを従来より高い圧縮率で圧縮することを可能と
し、また、圧縮率により、辞書データとした生物と圧縮
対象データの生物との類似度が推定できる。As described above, according to the data compression apparatus, method and program according to the present invention, it is possible to compress genomic data at a higher compression rate than before, and the compression rate allows dictionary data to be stored as dictionary data. The degree of similarity between the living organism and the living organism of the compression target data can be estimated.

[Brief description of drawings]

【図１】本発明の一実施形態において、データ圧縮装
置の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a data compression device according to an embodiment of the present invention.

【図２】同実施形態において、辞書ＤＢの一例であ
る。FIG. 2 is an example of a dictionary DB in the same embodiment.

【図３】同実施形態において、配列ＤＢに格納される
圧縮対象配列の一例である。FIG. 3 is an example of a compression target array stored in an array DB in the same embodiment.

【図４】同実施形態において、データ圧縮装置の動作
を説明する図である。FIG. 4 is a diagram explaining an operation of the data compression apparatus in the embodiment.

【図５】同実施形態において、データ圧縮装置の圧縮
動作を説明する図である。FIG. 5 is a diagram illustrating a compression operation of the data compression apparatus in the embodiment.

【図６】同実施形態において、データ圧縮装置の圧縮
動作を説明する図である。FIG. 6 is a diagram explaining a compression operation of the data compression apparatus in the embodiment.

【図７】同実施形態において、データ圧縮装置の圧縮
動作を説明する図である。FIG. 7 is a diagram illustrating a compressing operation of the data compressing device in the embodiment.

【図８】同実施形態において、データ圧縮装置の圧縮
動作を説明する図である。FIG. 8 is a diagram illustrating a compressing operation of the data compressing device in the embodiment.

【図９】同実施形態において、辞書ＤＢの他の例であ
る。FIG. 9 is another example of the dictionary DB in the same embodiment.

【図１０】ゲノムの塩基配列の性質を説明するための
図である。FIG. 10 is a diagram for explaining the nature of the base sequence of the genome.

[Explanation of symbols]

１：データ圧縮装置、１１：配列ＤＢ、１２：圧縮対象
配列取得部、１３：辞書ＤＢ、１４：辞書配列取得部、
１５：比較部、１６：圧縮部、１７：類似度取得部1: data compression device, 11: array DB, 12: compression target array acquisition unit, 13: dictionary DB, 14: dictionary array acquisition unit,
15: comparison unit, 16: compression unit, 17: similarity degree acquisition unit

───────────────────────────────────────────────────── フロントページの続き (72)発明者森本俊彦東京都江東区豊洲三丁目３番３号株式会社エヌ・ティ・ティ・データ内 (72)発明者豊田哲郎神奈川県横浜市鶴見区末広町一丁目７番22 号理化学研究所横浜研究所内Ｆターム(参考） 5B075 KK23 UU19 5J064 AA02 BA11 BC01 BC14 BC29 BD02 BD04 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Toshihiko Morimoto Stock Exchange, 3-3 Toyosu 3-chome, Koto-ku, Tokyo Company NTT Data (72) Inventor Tetsuro Toyota 7-22 Suehiro-cho, Tsurumi-ku, Yokohama-shi, Kanagawa RIKEN Yokohama Institute F-term (reference) 5B075 KK23 UU19 5J064 AA02 BA11 BC01 BC14 BC29 BD02 BD04

Claims

[Claims]

1. A data compression apparatus for compressing compression target data composed of a plurality of unit data using dictionary data composed of a plurality of unit data, the storage unit storing the dictionary data, and the compression target data. The reference information, which is an array of unit data of an uncompressed portion, is searched for a matching portion by comparing with the dictionary data, and when a matching portion is detected as a result of the search, a unit that is continuous next to the reference information is detected. By comparing the data and the unit data that is an array of the unit data that constitutes the dictionary data and that is next to the array of the unit data that matches the reference information, the length of the unit data that continuously matches can be determined. A comparing unit for recognizing and recognizing the position information indicating the start position of the matching point in the dictionary data and the length information of the matching unit data, and the matching point in the compression target data And a compression unit that replaces replacement information including position information and length information recognized by the comparison unit.

2. The comparison unit, when a matching portion is not found as a result of the search, sets one or more unit data arrays shifted from the position of the start point of the reference information as new reference information, The data compression apparatus according to claim 1, wherein the reference data is compared with the dictionary data.

3. Further comprising a similarity degree obtaining unit for obtaining the degree of similarity between the compression target data and the dictionary data by comparing the information amount before compression and the information amount after compression of the compression target data. The data compression apparatus according to claim 1, wherein the data compression apparatus is a data compression apparatus.

4. A data compression device for compressing compression target data composed of a plurality of unit data by using dictionary data composed of a plurality of unit data, wherein specific unit data in the unit data constituting the compression target data. Is used as dictionary data, reference information, which is an array of unit data at uncompressed points in the compression target data, is compared with the dictionary data to search for a matching point, and a matching point is detected as a result of the search. In this case, by comparing the unit data that follows the reference information and the unit data that is an array of unit data that constitutes the dictionary data and that is next after the array of unit data that matches the reference information, Recognizing the lengths of unit data that continuously match, and recognizing the position information indicating the start position of the matching point in the dictionary data and the length information of the matching unit data. A comparison unit, a compression unit that replaces replacement information including position information and length information recognized by the comparison unit at matching points in the compression target data, and a storage unit that stores the dictionary data. A data compression device characterized by the above.

5. The comparison unit adds an array of unit data shifted by one or more from the end point position of the dictionary data to the storage unit when no matching portion is found as a result of the search. The data compression device according to claim 4, wherein the data compression device is stored.

6. A data compression method for compressing compression target data composed of a plurality of unit data using dictionary data composed of a plurality of unit data, the process comprising storing the dictionary data, The process of comparing the reference information, which is an array of unit data of the uncompressed portion, with the dictionary data to search for a matching portion, and if a matching portion is detected as a result of the search, the reference information is continued next to the reference information. The length at which the unit data are consecutively matched by comparing the unit data and the unit data that is an array of the unit data that constitutes the dictionary data and that is next to the array of the unit data that matches the reference information. Recognizing the position information indicating the start position of the matching point in the dictionary data, and the length information of the matching unit data, and the compression target data. Matching location in the data compression method, characterized in that it comprises a step of replacing the replacement information including the recognized position information and length information.

7. A data compression method for compressing compression target data composed of a plurality of unit data by using dictionary data composed of a plurality of unit data, wherein specific unit data in the unit data constituting the compression target data. To make the array data into dictionary data, to store the dictionary data, and to compare the reference data, which is an array of the unit data of the uncompressed part in the data to be compressed, with the dictionary data to search for a matching part. In the process of, when a matching portion is detected as a result of the search, unit data that is continuous next to the reference information, and an array of unit data that constitutes the dictionary data, of the unit data that matches the reference information A process of recognizing the length of continuous unit data by comparing the unit data next to the sequence, and the start position of the matching point in the dictionary data A process of recognizing the position information indicated and the length information of the matching unit data, and replacing the replacement information including the position information and the length information recognized by the comparing unit at the matching position in the compression target data. A data compression method comprising the steps of:

8. A data compression program for compressing compression target data composed of a plurality of unit data using dictionary data composed of a plurality of unit data, the step of storing the dictionary data; Comparing the reference information, which is an array of the unit data of the uncompressed portion, with the dictionary data, and searching for a matching portion, and if a matching portion is detected as a result of the search, the reference information is continued next to the reference information. The length at which the unit data are consecutively matched by comparing the unit data and the unit data that is an array of the unit data that constitutes the dictionary data and that is next to the array of the unit data that matches the reference information. The step of recognizing the position information indicating the start position of the matching point in the dictionary data and the length information of the matching unit data. If the matching locations in the compressed data, data compression program for executing a step of replacing the replacement information including the recognized position information and length information into the computer.

9. A data compression program for compressing compression target data composed of a plurality of unit data using dictionary data composed of a plurality of unit data, wherein specific unit data in the unit data constituting the compression target data. A step of storing the dictionary data as an array data, a step of storing the dictionary data, and a reference information which is an array of unit data of an uncompressed part in the compression target data is compared with the dictionary data to search for a matching part. And, when a matching portion is detected as a result of the search, unit data that is continuous next to the reference information, and an array of unit data that constitutes the dictionary data, of unit data that matches the reference information A step of recognizing the length of continuous unit data by comparing the unit data next to the array, and the dictionary data Recognizing the position information indicating the start position of the matching point and the length information of the matching unit data, at the matching point in the compression target data, the position information and the length recognized by the comparison unit. A data compression program that causes a computer to perform the steps of substituting replacement information, including information.