JP3555506B2

JP3555506B2 - Character string data compression encoding device, character string data decompression device, and character string data arithmetic processing device

Info

Publication number: JP3555506B2
Application number: JP17162299A
Authority: JP
Inventors: 真一湊
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1999-06-17
Filing date: 1999-06-17
Publication date: 2004-08-18
Anticipated expiration: 2019-06-17
Also published as: JP2000357970A

Description

【０００１】
【発明の属する技術分野】
本発明は、文字列データ圧縮符号化装置及び文字列データ復元装置及び文字列データ演算処理装置に係り、特に、データの通信及び加工及び、文字列データを圧縮符号化して効率よく伝送または保存し、圧縮符号化された文字列データ同士の演算処理を高速に実行するための文字列データ圧縮符号化装置及び文字列データ復元装置及び文字列データ演算処理装置に関する。
【０００２】
【従来の技術】
入力文字列を、過去に現れた文字列の複製として符号化するデータ圧縮方式として、Ｌｅｍｐｅｌ−Ｚｉｖ符号化方式が広く知られている。また、これを改良した方式に関する多数の特許が出願されている。例えば、特許第２５９０２８７号には、Ｌｅｍｐｅｌ−Ｚｉｖ符号化方式を改良した方式に関する発明が記載されている。
【０００３】
一般に、Ｌｅｍｐｅｌ−Ｚｉｖ符号化方式は、過去に入力された文字列の履歴（の一部）を記憶するメモリと、その中に含まれる部分列を登録・検索する辞書を持ち、入力文字列データを順次読み込みながら、辞書に部分列を登録し、これを参照して同一の部分列を検出することにより、データ圧縮を行う方式である。Ｌｅｍｐｅｌ−Ｚｉｖ符号化方式にはユニバーサル型と増分分解型の２つの方式がある。
【０００４】
ユニバーサル型Ｌｅｍｐｅｌ−Ｚｉｖ符号化方式は、入力文字列を符号化する際に、過去の入力文字列の任意の位置から現在の文字列に一致する最大長の文字列部分を区切り（文字列部分列という）、その文字部分列のメモリ上の位置と文字列部分列の長さにより指定してデータ圧縮を行う方式である。
増分分類型Ｌｅｍｐｅｌ−Ｚｉｖ符号化方式は、過去にコード化された最大長の文字列に新たに出現した一文字を付加した文字列を符号化して登録し、現在の文字列を過去に現れた文字部分列の最長文字部分列の複製として符号化する方式である。
【０００５】
いずれの方式においても、文字列データの圧縮率を高めることと、圧縮・復元処理を効率よく実行できること、の２点を目的として発明及び改良がなされている。即ち、圧縮符号化されたコード列データは、一旦、復元処理を通して元の文字列データに戻した後に、種々の演算処理に供されることを想定している。圧縮符号化されたコード列データを復元処理を通さずに直接演算処理を行うことは、従来技術では考慮されておらず、このような演算処理を効率よく行えるように工夫された符号にはなっていない。
【０００６】
一方、データを圧縮して記憶装置に格納し、圧縮したままの状態で高速に演算処理を行う方法として、「二分決定グラフ」と呼ばれるデータ構造を用いた論理関数データの表現方式及び演算方式が知られている。その処理方式は、Ｒ．Ｅ．Ｂｒｙａｎｔ：”Ｇｒａｐｈ−ＢａｓｅｄＡｌｇｏｒｉｔｈｍｓｆｏｒＢｏｏｌｅａｎＦｕｎｃｔｉｏｎＭａｎｉｐｕｌａｔｉｏｎ”，ＩＥＥＥＴｒａｎｓ．ｃｏｍｐｕｔ．，Ｖｏｌ．Ｃ−３５，Ｎｏ．８，ｐｐ．６７７−６９１（以後、論文１と呼ぶ）に記載されている。また、情報処理学会誌「情報処理」１９９３年５月号５８４〜６３０ページの「特集ＢＤＤ（二分決定グラフ）」（以後、記事２と呼ぶ）にも詳しい解説記事が掲載されている。
【０００７】
論理関数データとは、ある複数個の入力変数に０，１の論理値の組み合わせがそれぞれ与えられたときに、関数値として０，１のいずれの論理値が出力されるか、という論理関数の情報を表したものである。
二分決定グラフが発明される以前は、論理関数データを表現するために、２^ｎ通りの入力変数値の組み合わせ（ｎは入力変数の個数）に対する関数値を羅列した「真理値表」と呼ばれる０，１の数列の形式で表現する方式が多く用いられている。図１２は、Ｘ_１，Ｘ_２，Ｘ_３の３変数を入力とする論理関数データを表現する真理値表の一例である。
【０００８】
二分決定グラフは、グラフを用いた論理関数データの表現方法であって、
・すべての入力変数に、ある固定した順序を与える手段（以下、順序付け処理と呼ぶ）、
・入力変数の中のある１つの変数に０及び１の論理値を代入することにより、論理関数から２つの部分関数を導出する手段（以下、展開処理と呼ぶ）、
・上記展開処理を、上記順序付け処理により与えられる順序に従って、０，１を代入する入力変数がなくなるまで繰り返し、個々の展開処理を表す中間節点、個々の展開処理により得られた２つの部分関数を指す０−枝及び１−枝、展開処理の繰り返しの結果得られた０，１の論理値を表す０−終端節点と１−終端節点、を用いた図１３のような二分木グラフを生成する手段（以下、二分木生成処理と呼ぶ）、
・上記二分木生成処理において、図１４のように、該二分木のある２つの中間節点の０−枝同士、１−枝同士がそれぞれ同一の節点を指しているかどうかを調べ、そのような論理的に等価な２つの中間節点を発見した場合に、その一方の中間節点を消去し、他方の中間節点を共有して用いる手段（以下、共有化処理と呼ぶ）、
・個々の中間節点において、図１５のようにその０−枝と１−枝が、共に同じ節点を指しているときには、該中間節点を取り除き、枝を直結させる手段（以下、非冗長化処理と呼ぶ）、
からなる。
【０００９】
上記の処理を組み合わせて行うことにより、図１６のような二分決定グラフが得られる。この二分決定グラフは、図１７のように中間節点を列挙した表の形式で記憶装置に格納することができる。このときに必要とする記憶量は、二分決定グラフの節点数に比例したものとなる。論理関数データを真理値表で表現した場合は、入力変数の値の組み合わせの数（変数の個数をｎとすると２^ｎ通り）の記憶量が常に必要となるが、２分決定グラフを用いるとそれよりも少ない記憶量となる場合がしばしばあり、ときには、圧縮率が数百倍に達する例もある。そのため、二分決定グラフは論理関数データの圧縮記憶方式として、近年広く利用されている。
【００１０】
さらに、論文１、及び記事２には、二分決定グラフ同士の二項演算処理方式が記憶されている。即ち、２つの二分決定グラフが記憶装置に格納されているときに、それらの二分決定グラフが表す論理関数データ同士の二項論理演算（論理関、論理和、排他的論理和、等）を実行し、その演算結果の論理関数データを表現する新しい二分決定グラフを生成し、記憶装置に格納する方法が述べられている。
【００１１】
その演算方法は、与えられた２つの二分決定グラフの最も上位の変数に０及び１を代入することにより、各々対応する２つの中間節点の組に分解し、得られた中間節点の組をさらに分解し、これを繰り返して、多数の終端節点の組を得る。それらの終端節点の組の各々に対する２項演算結果を表す終端節点を新たに生成し、そのようにして得られた演算結果を再び二分決定グラフに組み上げることにより、新しい二分決定グラフを生成する。このとき、演算処理中に現れた中間節点の組とその組に対する演算結果とを辞書に登録することにより、同じ中間節点の組を検出したときには、それ以上分解せず、直ちに辞書に登録されている中間節点を参照して演算結果とする、という高速化手段を具備している。
【００１２】
この演算方法の計算時間は、演算対象として与えられた２つの二分決定グラフの節点数と、演算結果として生成された二分決定グラフの節点数の総和にほぼ比例する。従って、真理値表データ形式を用いた論理演算方式と比較すると、二分決定グラフによるデータ圧縮率にほぼ比例して、演算処理高速化の効果が得られる。
【００１３】
【発明が解決しようとする課題】
Ｌｅｍｐｅｌ−Ｚｉｖ符号化方式では、圧縮符号化されたコード列データは、一旦、復元処理を通して元の文字列データに戻した後に、種々の演算処理に供されることを想定している。圧縮符号化を行うことにより、伝送データ量または、記憶領域が節約できるが、圧縮・復元のための処理時間が余計に必要となる。圧縮率と処理時間は一般にトレードオフの関係にあると考えられている。
【００１４】
しかし、圧縮符号化されたコード列データを復元せずに直接演算処理をすることができれば、圧縮・復元のための処理時間が不要になるばかりでなく、データの圧縮率に比例して演算処理時間を短縮することも可能になると考えられる。
そのためには、圧縮符号化の際の部分列の区切り方を、演算処理しやすいように工夫する必要がある。ところが、従来のＬｅｍｐｅｌ−Ｚｉｖ符号化方式では、入力される文字列データの内容により異なる位置で部分列が区切られるため、２本の圧縮コード列同士の二項演算処理を実行しようとすると、それぞれの部分列データの対応関係が複雑となり、圧縮したままでは効率よく演算処理を実行することができないという問題がある。
【００１５】
一方、二分決定グラフを用いた二項論理演算処理方式では、圧縮したままのデータ形式で二項演算処理を実行することができ、圧縮率にほぼ比例する高速化が実現されている。
しかし、二分決定グラフを用いた二項論理演算処理方式では、演算対象の２つの二分決定グラフは記憶装置内に節点表として格納された状態で演算処理装置に引き渡され、演算結果の二分決定グラフも記憶装置内に節点表として生成される。従って、これらの二分決定グラフの全ての節点が前以って記憶装置内に格納可能であることが必要条件となっている。従来の技術の項で記載した二分決定グラフの共有化処理と非冗長化処理では、節点の一致判定を高速に行うために、全ての節点を記憶装置内の辞書に登録・検索する方式が用いられている。辞書の容量は有限であり、演算結果の二分決定グラフが辞書に入り切らなくなった場合には、処理の途中で異常終了し、演算処理の結果を得ることができないという問題がある。
【００１６】
さらに、二分決定グラフを用いた二項演算処理方式では、演算対象の２つの二分決定グラフのすべての節点を記憶装置に格納してからでなければ演算処理を開始できない。即ち、前段の処理が完了しないと次段の処理を開始できない。よって、連続する演算処理を直列接続してパイプライン式に並行実行することができないという問題がある。
【００１７】
本発明は、上記の点に鑑みなされたもので、圧縮符号化されたコード列データを必ずしも完全に復元せずに直接演算処理することができ、データ圧縮率に応じて演算処理の高速化が可能となるような、文字悦データ圧縮符号化装置及び文字列データ復元装置及び文字列データ演算処理装置を提供することを目的とする。
また、本発明は、従来の二分決定グラフを用いた演算処理方式のようにすべての入力データを記憶装置内に格納してから演算を開始するのではなく、入力された圧縮コード列データを先頭から順次読み込むと同時に演算処理を実行し、演算結果を順次出力することが可能な文字列データ圧縮符号化装置及び文字列データ復元装置及び文字列データ演算処理装置を提供することを目的とする。
【００１８】
【課題を解決するための手段】
図１は、本発明の原理構成図（圧縮符号化装置）である。
本発明（請求項１）は、２^ｎ（但し、ｎは０以上の整数定数）個の長さの文字列データを入力とし、その情報を損なうことなしにデータを圧縮したコード列データを出力する文字列データ圧縮符号化装置であって、
圧縮前の文字列データを先頭から順次読み込む入力処理手段１１０と、
圧縮符号化後のコード列データを先頭から順次出力する出力処理手段１２０と、
ある有限の容量の辞書１３０と、
入力処理手段１１０から入力された長さ２ ^ｎの入力文字列データをその前半部と後半部のそれぞれ半分の長さの２つの部分列に区切り、得られた該部分列の前半部と後半部に区切り、これを繰り返して、長さ１，２，４，８，１６、…２^ｎの部分列に分解するために参照される分解規則１５０と、
分解規則１５０を参照して入力処理手段１１０から入力された入力文字列データを順次読み込みながら、長さ１，２，４，８，６、…，２^ｎの部分列を取り出して、異なる文字列毎に異なる登録番号を付与して、辞書にその容量の許す限り格納し、該辞書を参照しながら、該入力文字列データを読み込むことにより、該入力文字列データの中に同一の文字列が複数回繰り返して出現したことを検出する検出手段１４１と、
入力文字列データに同一の文字列からなる部分列が複数回繰り返して出現する場合には、１回目は文字列と該文字列の登録番号の組を出力するように出力処理手段１２０に指示し、２回目以降は、該部分列を指し示す登録番号のみを出力するよう該出力処理手段１２０に指示する出力制御手段１４２とを有する。
【００１９】
本発明（請求項２）は、任意の長さの文字列データが入力された場合に、入力文字列データの長さがちょうど２^ｎ（但し、ｎはなるべく小さい整数定数）となるように、空白を表す特別な文字を文字列データの末尾に複数個追加する手段を有する。
本発明（請求項３）は、検出手段１４１において、
長さ２以上の任意の文字列ａを、前半部と後半部のそれぞれ半分の長さの２つの部分列ａ１と部分列ａ２に分解し、該部分列ａ１と該部分列ａ２の文字列がすでに辞書に格納されている場合に、該部分列ａ１と該部分列ａ２の文字列を指し示す登録番号の組を持って文字列ａを表現し、該組に該文字列ａを表す新たな登録番号を与えて該辞書１３０に格納する格納手段を有し、
出力制御手段１４２において、
長さ２以上の文字列ａを出力する際に、前半部と後半部の２つの部分列ａ１，部分列ａ２の文字列が辞書１３０に格納されていない場合には、該部分列ａ１，部分列ａ２の文字列を順次出力させることにより、文字列ａを表現し、該部分列ａ１、該部分列ａ２の片方または両方の文字列が該辞書１３０に格納済の場合には、格納済の文字列を出力する代わりに該文字列が指し示す登録番号を出力させる手段を有する。
【００２０】
本発明（請求項４）は、格納手段において、
辞書１３０の容量が限界に達して、新しい文字列ｘを格納できなくなった場合に、これまでに格納された文字列の中で、他の文字列の前半部または後半部としてまだ一度も使われていない文字列ｙを検索し、該文字列ｙを表す該文字列ｙの前半部と後半部の文字列を指し示す登録番号の組を一旦消去し、消去した文字列ｙと同じ登録番号で指し示される記憶領域に新しい文字列ｘを表す情報を格納し、以降の処理を進める手段を有する。
【００２１】
図２は、本発明の原理構成図（文字列データ復元装置）である。
本発明（請求項５）は、文字列データ圧縮符号化装置により圧縮符号化されたコード列データから元の文字列データを復元して取り出す文字列データ復元装置であって、
圧縮符号化されたコード列データを先頭から順次読み込む入力処理手段２１０と、
復元した文字列データを先頭から順次出力する出力処理手段２２０と、
文字列データを圧縮したときに用いた文字列データ圧縮符号化装置の辞書と同等または、それ以上の容量の辞書２３０と、
入力処理手段２１０から取得した圧縮符号化されたコード列データを順次読み込む際に、単純な文字列として表現された部分列を読み込んだ場合には、該文字列をそのまま出力するように出力処理手段２２０に指示し、部分列の登録番号と文字列の組を読み込んだ場合には、辞書２３０に該文字列を該登録番号を付与して格納し、部分列の登録番号のみを読み込んだ場合には、該登録番号が指し示す部分列を該辞書２３０から取り出して出力するように該出力処理手段２２０に指示する復元制御手段２４０とを有する。
【００２２】
本発明（請求項６）は、復元制御手段２４０において、
圧縮符号化されたコード列データを順次読み込む際に、前半部と後半部の２つの部分列の組によって表現された文字列が入力された場合に、該前半部と該後半部の部分列の登録番号の組をもって該文字列を表現して辞書２３０に格納する手段と、
辞書２３０に格納された部分列を取り出して、出力する際には、該部分列の前半部と後半部の部分列を該辞書２３０から順次参照し、それらの部分列がさらに前半部と後半部の組として表現されている場合には、さらにそれらを順次参照し、長さ１の部分列を取り出して出力させる手段を更に有する。
【００２３】
図３は、本発明の原理構成図（文字列データ演算処理装置）である。
本発明（請求項７）は、任意の２個の文字に対して１個の文字を計算して出力する、ある二項演算が定義されている場合に、長さが同じ２本の文字列データを入力として、先頭からそれぞれ同じ位置にある文字同士について、所定の二項演算を計算した結果を表す１本の文字列データを出力する文字列データ演算処理装置であって、
圧縮符号化されたコード列データを先頭から順次読み込む第１の入力手段と、復元した文字列データを先頭から順次出力する第１の出力手段と、文字列データを圧縮した時に用いた文字列データ圧縮符号化装置の辞書と同等または、それ以上の容量の第１の辞書と、該第１の入力手段から取得した圧縮符号化されたコード列データを順次読み込む際に、短単純な文字列として表現された部分列を読み込んだ場合には、該文字列をそのまま出力するように該第１の出力手段に指示し、部分列の登録番号と文字列の組を読み込んだ場合には、該第１の辞書に該文字列を該登録番号を付与して格納し、部分列の登録番号のみを読み込んだ場合には、該登録番号が指し示す部分列を該第１の辞書から取り出して出力するように該第１の出力手段に指示する復元制御手段とを有する２つの文字列データ復元装置からなり、入力されたコード列データを文字列データにそれぞれ復元して転送する２つの入力処理手段３１０，３２０と、
入力処理手段３１０，３２０から転送された２つの文字列データの、先頭からそれぞれ同じ位置にある文字同士の二項演算を順次計算し、その結果を新たな文字列データとして転送する演算処理手段３３０と、
圧縮前の文字列データを先頭から順次読み込む第２の入力手段と、圧縮符号化後のコード列データを先頭から順次出力する第２の出力手段と、ある有限の容量の第２の辞書と、該第２の入力手段から入力された長さ２^ｎの入力文字列データをその前半部と後半部のそれぞれ半分の長さの２つの部分列に区切り、得られた該部分列の前半部と後半部に区切り、これを繰り返して、長さ１，２，４，８，１６、…２^ｎの部分列に分解するために参照される分解規則と、該分解規則を参照して該第２の入力手段から入力された該入力文字列データを順次読み込みながら、該長さ１，２，４，８，１６、…２^ｎの部分列を取り出して、異なる文字列毎に異なる登録番号を付与して、該第２の辞書にその容量の許す限り格納し、該第２の辞書を参照しながら、該入力文字列データを読み込むことにより、該入力文字列データの中に同一の文字列が複数回繰り返して出現したことを検出する検出手段と、該入力文字列データに同一の文字列からなる部分列が複数回繰り返して出現する場合には、１回目は文字列と該文字列の登録番号の組を出力するように該第２の出力手段に指示し、２回目以降は、該部分列を指し示す登録番号のみを出力するよう該第２の出力手段に指示する出力制御手段とを有する文字列データ圧縮符号化装置であり、演算処理手段３３０から転送された文字列データを圧縮符号化してコード列データとして出力する出力処理手段３６０とを有する。
【００２４】
本発明（請求項８）は、入力処理手段３１０、３２０と出力処理手段３６０において、部分列がそれぞれの第１の辞書または第２の辞書に格納または参照された場合に、それぞれの登録番号の情報を転送する転送手段を更に有し、
転送手段から登録番号の情報を取得して、ある２つの入力部分列の組に対する二項演算の結果がどの部分列に対応したかを、２つの入力部分列と１つの出力部分列の３つの部分列の登録番号の組として、ある有限の容量の演算高速化辞書に該容量が許す限り格納し、該演算高速化辞書を参照しながら演算処理を順次行うことにより、同一の入力部分列の組が複数回出現したことを検出する検出手段と、
同一の入力部分列の組が出現したときには、入力処理手段３１０、３２０において、部分列の登録番号を文字列データに復元する処理を省略するように制御し、出力処理手段３６０において演算高速処理辞書に格納されている演算結果の部分列の登録番号をそのまま出力するように制御する制御手段を更に有する。
【００２５】
本発明の圧縮符号化装置は、２^ｎ（但し、ｎは０以上の整数定数）個の長さの文字列データを入力とし、その情報を損なうことなしに、データを圧縮したコード列データを出力する。当該圧縮符号化装置によれば、長さ２^ｎの入力文字列データを、その前半部と後半部のそれぞれ半分の長さの２つの部分列に区切り、得られた部分列をさらにそれぞれの前半部と後半部に区切り、これを繰り返して長さ、１，２，４，８，１６，…，２^{（ｎ−１）}の多数の部分列に分解するという分解規則を用いている。
【００２６】
さらに、上記の分解規則に基づいて、入力文字列データを順次読み込みながら、上記の長さ１，２，４，８，１６，…，２^{（ｎ−１）}の部分列を取り出して、異なる文字列毎に異なる登録番号を付与して、辞書にその容量を許す限り格納する。この辞書を参照しながら、入力文字列データを読み込むことにより、文字列データの中に同一の文字列が複数回繰り返して出現したことを検出することができる。入力文字列データに同一の文字列からなる部分列が複数回繰り返して出現する場合には、１回目は文字列とその登録番号の組を出力し、２回目以降は、当該文字列を指し示す登録番号のみを出力することにより、データを圧縮符号化する。
【００２７】
また、本発明の圧縮符号化装置において、入力文字列データの長さの制限を取り除く、すなわち、入力文字列データの長さがちょうど２^ｎ（但し、ｎはなるべく小さい整数定数）となるように、空白を表す特別な文字を文字列データの末尾に複数個追加し、これを圧縮符号化する。
また、本発明の圧縮符号化装置において、長さ１，２，４，８，１６，…，２^{（ｎ−１）}の部分列を、そのまま列挙して辞書に格納するのではなく、前半部と後半部の２つの部分列に分解して格納する。即ち、長さ２以上の文字列ａを、前半部と後半部のそれぞれ半分の長さの２つの部分列ａ１とａ２に分解し、ａ１とａ２の文字列がすでに辞書に格納されている場合に、２つの文字列を指し示す登録番号の組を持って文字列ａを表現し、これに文字列ａを表す新たな登録番号を与えて、辞書に格納する。これにより、長さ２以上の部分列はすべて、前半部と後半部の部分列を指し示す登録番号の組により表現される。
【００２８】
また、長さ２以上の文字列ａを出力する際に、前半部と後半部の２つの部分列ａ１、ａ２の文字列が辞書に格納されていない場合には、２つの文字列を順次出力することにより、文字列ａを表現し、そうでない場合、（すなわちａ１，ａ２の片方または両方の文字列が辞書に格納済の場合）には、格納済の文字列を出力する代わりにそれを指し示す登録番号を出力することにより、データを圧縮符号化することが可能となる。
【００２９】
また、本発明の圧縮符号化装置において、文字列データの部分列に分解して辞書に格納する処理において、辞書の容量が限界に達して、新しい文字列ｘを格納できなくなった場合に、これまでに格納された文字列の中で、他の文字列の前半部または、後半部としてまだ一度も使われていない文字列ｙを探し出して、その文字列ｙを表す情報（文字列の前半部と後半部の文字列を指し示す登録番号の組）を一旦消去し、消去した文字列ｙと同じ登録番号で指し示される記憶領域に新しい文字列ｘを表す情報を格納し、以降の処理を進める。
【００３０】
本発明の文字列データ復元装置は、データ圧縮符号化装置により圧縮符号化されたコード列データを順次読み込む際に、単純な文字列で表現された部分列を読み込んだときはその文字列をそのまま出力し、部分列の登録番号と文字列の組を読み込んだ場合には、文字列にその登録番号を付与して辞書に格納し、登録番号のみを読み込んだ場合は、その登録番号が指し示す部分列を辞書から取り出して出力することにより、文字列データを復元する。
【００３１】
さらに、本発明の文字列データ復元装置は、文字列データを前半部と後半部の２つの部分列に分解して圧縮符号化されたデータを復元する。即ち、圧縮符号化されたコード列データを順次読み込む際に、前半部と後半部の２つの部分列の組によって表現された文字列が入力された場合に、前半部と後半部の部分列の登録番号の組を持ってその文字列を表現して辞書に格納し、一方、辞書に登録された部分列を取り出して出力する際には、その前半部と後半部の部分列を辞書から順次参照し、それらの部分列がさらに前半部と後半部の組として表現されている場合には、さらに、それらを順次参照し、遂に長さ１の部分列を取り出してこれを出力することを繰り返して、部分列全体を出力する。
【００３２】
本発明の文字列データ演算処理装置は、データ圧縮符号化装置により圧縮さた状態の２本のコード列データを順次読み込み、これらの圧縮化コード列を復元したら得られるであろう２本の文字列データについて、先頭からそれぞれ同じ位置にある文字同士について、所定の二項演算を計算した結果を表す１本の文字列データを生成し、これを圧縮化方式で圧縮された状態で順次出力する。また、２つの入力処理手段において入力されたコード列データを文字列データにそれぞれ復元して演算処理手段に転送し、当該演算処理手段において、入力処理手段から転送されて来た２本の文字列データの、先頭からそれぞれ同じ位置にある文字同士の二項演算を順次計算し、その結果を新たな文字列データとして出力処理手段に転送し、当該出力処理手段において、演算処理手段から転送されてきた文字列データを圧縮符号化してコード列データとして出力する。
【００３３】
さらに、本発明の文字列データ演算処理装置に、検出手段、制御手段と、ある有限の容量の演算高速化辞書を設け、２個の入力処理手段と、出力処理手段において、部分列がそれぞれの辞書に格納されたり参照されたりしたときに、それぞれの登録番号の情報を検出手段に転送する。当該検出手段において、ある２つの入力部分列の組に対する二項演算の結果がどの部分列に対応したかを、それら３つの部分列の登録番号の組として、演算高速化辞書にその容量が許す限り格納し、この辞書を参照しながら演算を順次行うことにより、同一の入力部分列の組が複数回現れたことを検出する。制御手段において、同一の入力部分列の組が出現したときには、入力処理手段において部分列の登録番号を文字列データに復元する処理を省略し、出力処理手段において演算高速化辞書に格納されている演算結果の部分列の登録番号をそのまま出力することにより、演算処理を高速化することが可能となる。
【００３４】
上記のように、本発明では、Ｌｅｍｐｅｌ−Ｚｉｖ符号化方式のように、入力文字列データの内容によって異なる位置で部分列を切り出すのではなく、文字列データを前半部と後半部の２つの部分列に固定的に区切り、各部分列をさらに前半部と後半部の２つの部分列に区切り、これを繰り返して長さ１，２，４，８，１６，…２^{（ｎ−１）}の部分列に区切るという分解規則を用いている。そして、入力データを先頭から順次読み込みながら、それらの部分列を取り出して、２分決定グラフと同様のデータを先頭から順次読み込みながら、それらの部分列を取り出して、二分決定グラフと同様のデータ構造として辞書に登録する。
【００３５】
このような固定的な分解方式を採用すると、自由な区切り位置を許す場合に比べて、一致する部分列を検出する機会が減り圧縮率が低下するという可能性もある。しかしながら、上記のように区切り位置を固定することにより、文字列データの内容に関わりなく、データの先頭から２^ｋ（１＜ｋ＜ｎ）の倍数に当たる位置が常に部分列の境界となるため、複数の圧縮コード列の各々の部分列同士の対応関係が単純になる。
【００３６】
そこで、２本の圧縮コード列を入力として、それらの文字列同士の二項演算結果を新たな圧縮コード列として出力する処理において、ある２つの入力部分列の組に対する二項演算の結果が出力コード列のどの部分列に対応するかを、それらの３つの部分列の登録番号の組として、演算高速化辞書にその容量が許す限り格納し、この辞書を参照しながら演算処理を順次行うことにより、同一の入力部分列の組が複数回出現したことを検出することができる。２回目以降の出現時には、演算結果の部分列を示す辞書登録番号だけを出力することによって、同じ演算処理を繰り返す手間が省略されて高速化が可能になる。
【００３７】
本発明で用いている圧縮コード列の二項演算の高速化方式は、従来の二分決定グラフによる二項論理演算の高速化方式と同じ原理に基づく。しかしながら、本発明は、従来のように二分決定グラフの全ての節点の表が記憶装置に格納された状態で入力データとして与えられるのではなく、圧縮符号化されたコード列データを入力として順次読み込むと同時に演算結果を順次出力するという点で、従来の方法と本質的に異なる。
【００３８】
また、本発明では、文字列の圧縮及び復元の処理中に、辞書の容量が限界に達して、新しい文字列ｘを格納できなくなった場合に、これまでに格納された文字列の中で、他の文字列の前半部または、後半部としてまだ一度も使われていない文字列ｙを探し出して、その文字列ｙを表す情報（文字列と前半部と後半部の文字列を指し示す登録番号の組）を一旦消去し、消去し文字列ｙと同じ登録番号で指し示される記憶領域に新しい文字列ｘを表す情報を格納し、以降の処理を継続することが可能である。文字列ｙの情報を消去したことにより、部分列の一致を検出する機会が減少して圧縮率が低下する恐れはあるが、これにより文字列データの情報が失われたり歪んだりすることはないため、演算処理は正しく行うことが可能となる。
【００３９】
つまり、本発明では、記憶装置の容量が十分でない場合でも、その容量に応じた圧縮率及び高速化率で演算処理を行うことが可能である。これに対して従来の二分決定グラフを用いた二項論理演算方式では、記憶装置に格納された二分決定グラフの節点表が入力データとして引き渡されるため、記憶装置の容量を超える大規模データを扱うことは原理的に不可能である。
【００４０】
さらに、本発明では、圧縮符号化されたコード列データを入力として順次読み込みと同時に演算結果のコード列データを順次出力するので、処理を複数段直列に接続してパイプライン式に並列実行することができる。従来の二分決定グラフを用いた二項論理演算処理方式では、前段の処理が完了しないと次段の処理を開始できないため、パイプライン式に並列実行して高速化を図ることができないが、本発明は、この点についても従来の方法より優れていると言える。
【００４１】
【発明の実施の形態】
１．文字列データ圧縮符号化装置：
最初に文字列データ圧縮符号化装置について説明する。
図４は、本発明の文字列データ圧縮符号化装置の構成を示す。
同図に示す文字列データ圧縮符号化装置１００は、２^ｎ（但し、ｎ≧０の整数定数）個の長さの文字列データを入力とし、その情報を損なうことなしにデータを圧縮したコード列データを出力する。
【００４２】
当該文字列データ圧縮符号化装置１００は、圧縮前の文字列データを先頭から順次読み込む入力処理部１１０と、圧縮符号化後のコード列データを先頭から順次出力する出力処理部１２０と、ある有限の容量の辞書１３０、それらを制御する圧縮符号化制御部１４０及び分解規則１５０から構成される。
当該文字列データ圧縮符号化装置１００では、入力文字列データの長さは丁度２^ｎ（ｎ≧０のある整数）個であることが仮定されている。もしもそれ以外の長さの文字列データが入力された場合は、入力処理部１１０において、空白を表す特別な文字を文字列データの末尾に複数個追加し、圧縮符号化制御部１４０に送られる文字列データの長さがちょうど２^ｎ（但し、ｎはなるべく小さい整数定数）になるようにする。
【００４３】
圧縮符号化制御部１４０おいては、入力処理部１１０から転送されてくる文字列データを多数の部分列に分解し、辞書１３０に格納する。その部分列の区切り位置は文字列データの内容によらず、ある一定の分解規則１５０に基づく。即ち、長さ２^ｎの入力文字列データを、その前半部と後半部のそれぞれ半分の長さの２つの部分列に区切り、得られた部分列をさらにそれぞれの前半部と後半部に区切り、それを繰り返して長さ１，２，４，８，１６、…、２^ｎの多数の部分列に分解するという規則を用いている。図５は、本発明の文字列データを部分列に分解した例を示しており、同図の例では、長さ８の文字列データを部分列に分解する様子が示されている。
【００４４】
圧縮符号化制御部１４０においては、上記の分解規則５０に基づいて、入力文字列データを先頭から順次読み込みながら、上記の長さ１，２，４，８，１６，２^ｎの部分列を取り出して、異なる文字列毎に異なる登録番号を付与して、辞書１３０にその容量の許す限り格納する。この辞書を参照しながら入力文字列データを読み込むことにより、文字列データの中に同一の文字列が複数回繰り返して出現したことを検出する。
【００４５】
２．文字列データ復元装置：
次に、文字列データ復元装置について説明する。
図６は、本発明の文字列データ復元装置の構成を示す。
同図に示す文字列データ復元装置２００は、文字列データ圧縮符号化装置１００により圧縮符号化されたコード列データを元の文字列に復元するものである。
【００４６】
当該文字列データ復元装置２００は、圧縮コード列データを先頭から順次読み込む入力処理部２１０と、復元文字列データを先頭から順次出力する出力処理部２２０と、文字列データを圧縮したときに用いた文字列データ圧縮符号化装置１００の辞書１３０と同等以上の容量の辞書と、それらを制御する復元制御部２４０から構成される。
【００４７】
復元制御部２４０においては、圧縮符号化されたコード列データを順次読み込む際に、単純な文字列で表現された部分列を読み込んだ場合は、その文字列を出力処理部２２０からそのまま出力し、部分列の登録番号とその文字列の組を読み込んだ場合には、その部分列を該登録番号を付与して辞書２３０に格納し、部分列の登録番号のみを読み込んだ場合には、その登録番号が指し示す部分列を辞書２３０から取り出して出力することにより文字列データを復元する。
【００４８】
さらに、復元制御部２４０においては、圧縮符号化されたコード列データを順次読み込む際に、前半部と後半部の２つの部分列の組によって表現された文字列が入力された場合に、前半部と後半部の文字列の登録番号の組を持ってその文字列を表現して辞書２３０に格納する。結果的に、前述の文字列データ圧縮符号化装置１００により圧縮符号化を行ったときに内部で作られた辞書１３０と全く同じ内容の辞書が復元される。
【００４９】
さらに、復元制御部２４０においては、辞書２３０に格納された部分列を取り出して出力する際に、その前半部と後半部の部分列を辞書２３０から順次参照し、それらの部分列からさらに前半部と後半部の組として表現されている場合には、さらに、それらを順次参照し、遂に長さ１の部分列を取り出してこれを出力することを繰り返して、部分列全体を出力する。
【００５０】
３．文字列データ演算処理装置：
次に、文字列データ演算処理装置について説明する。
図７は、本発明の文字列データ演算処理装置の構成を示す。
同図に示す文字列データ演算処理装置は、入力処理部３１０、３２０として、２つの圧縮コード列復元装置（前述の文字列データ復元装置と同様の構成）と、出力処理部３６０として、圧縮符号化装置（前述の文字列データ圧縮符号化装置と同様の構成）と、二項演算を行う演算処理部３３０と、ある有限の容量の演算高速化辞書３５０と、それらを制御する制御部３４０からなる。
【００５１】
この装置は、任意の２個の文字に対して１個の文字を計算して出力する、ある二項演算（例えば、論理積、論理和のような論理演算や加減乗除のような算術演算）が定義されているときに、前述の符号化方式で圧縮された状態の２本のコード列データを順次読み込み、それらの圧縮化コード列を復元したら得られるであろう２本の文字列データについて、先頭からそれぞれ同じ位置にある文字同士について、所定の二項演算を計算した結果を表す１本の文字列データを生成し、これを前述の符号化方式で圧縮され状態で順次出力するものである。
【００５２】
２つの入力処理部３１０、３２０においては、圧縮コード列データを先頭から順次読み込み、文字列データにそれぞれ復元して演算処理部３３０に転送し、演算処理部３３０において、入力処理部３１０、３２０から転送されてきた２本の文字列データの、先頭からそれぞれ同じ位置にある文字同士の二項演算を順次計算し、その演算結果を新たな文字列データとして出力処理部３６０に転送する。出力処理部３６において、演算処理部３３０から転送されてきた文字列データを圧縮符号化してコード列データとして出力する。
【００５３】
しかし、単に文字列データを復元して、先頭から順に演算して、再度圧縮するというだけでは、常に復元文字列データの長さに比例する計算時間が必要となり、効率的でない。そこで、本発明では、以下に説明するような演算処理高速化の手段を備えている。
２つの入力処理部３１０、３２０と出力処理部３６０においては、部分列がそれぞれの辞書に格納されたり参照されたりしたときに、それぞれの登録番号の情報を制御部３４０に定期的に報告する。
【００５４】
制御部３４０においては、ある２つの入力部分列の組に対する二項演算の結果がどの部分列に対応したかを常に監視し、それら３つの部分列の登録番号の組が得られる毎に、演算高速化辞書３５０にその容量が許す限り格納し、この演算高速化辞書３５０を参照しながら演算を順次行うことにより、同一の入力部分列の組が複数回現れたことを検出する。
【００５５】
さらに、制御部３４０においては、同一の入力部分列の組を検出したときには、２回目以降は、入力処理部３１０、３２０において部分列の登録番号を文字列データに復元する処理を省略するように制御指令を送り、出力処理部３６０において、演算高速化辞書３５０に格納されている演算結果の部分列の登録番号をそのまま出力するように制御指令を送る。
【００５６】
以上の方法により、同一の入力部分列の組が多数出現する場合には、単純に復元して演算して再圧縮する方式に比べて、全体の処理時間を短縮することができる。その高速化の割合は、同一の入力部分列の組が出現する頻度によって決まり、演算処理高速化辞書３５０の容量が十分にあるときには、入力と出力のコード列データの圧縮率にほぼ比例した高速化率が得られる。
【００５７】
【実施例】
以下、図面と共に本発明の実施例を説明する。
［第１の実施例］
第１の実施例として図４に示す文字列データ圧縮符号化装置１００に基づいて説明する。
【００５８】
図８は、本発明の第１の実施例の文字列データ圧縮符号化装置の動作を説明するための図であり、「ＡＢＡＢＡＢＡＡ」という長さの８の文字列データを入力した場合の動作の例を示す。
図９は、本発明の第１の実施例の文字列データの部分列を登録した辞書の一例を示す。以下、図８と図９の例に基づいて説明する。
【００５９】
文字列は左から１文字ずつ入力処理部１１０により読み込まれる。まず最初に「Ａ」という長さ１の文字が認識される。次に、「Ｂ」という長さ１の文字が認識される。ここで、「ＡＢ」という部分列が初めて出現するので、これを部分列１という登録番号を付与して辞書１３０に登録する。辞書には文字「Ａ」と「Ｂ」を連結した長さ２の部分文字列として格納される。
【００６０】
次に、第３文字目と第４文字目を読み込んだところで、再び「ＡＢ」という部分列を認識する。これは、すでに登録されている部分列１と同一の文字列であるので、新たに登録する必要はない。
次に、第１文字目から４文字目までの部分列「ＡＢＡＢ」を認識する。これは、新しい文字列であるから部分列２という登録番号を付与して辞書１３０に登録する。この文字列の前半部「ＡＢ」と後半部「ＡＢ」はそれぞれ部分列１と同一の文字列であるので、辞書１３０には部分列１を２つ連結した長さ４の部分列として格納する。
【００６１】
引き続いて、第５文字目と６文字目を読み込み、「ＡＢ」という部分列を認識するが、これはやはり部分列１と同一の文字列であるので、新たに登録する必要はない。
次に、第７文字目と８文字目を読み込み「ＡＡ」という部分列を認識する。これは、新しく出現した文字列であるので、部分列３という登録番号を付与して辞書１３０に登録する。辞書１３０には、文字「Ａ」を２つ連結した長さ２の部分列として格納される。
【００６２】
次に、第４文字目から８文字目までの文字列「ＡＢＡＡ」が認識される。これも初めて出現する文字列であるので、部分列４という登録番号を付与して辞書１３０に登録する。辞書１３０には部分列１と部分列３を連結した長さ４の部分列として格納される。
最後に第１文字目から８文字目までの部分列を認識し、これを部分列５という登録番号を付与して辞書１３０に登録する。辞書１３０には、部分列２と部分列４を連結した長さ８の部分列として格納される。
【００６３】
以上のようにして、入力文字データは部分列に分解されて辞書１３０に格納される。
次に、上記のように読み込まれた入力データがどのように圧縮符号化されて出力されるかを説明する。図１０は、本発明の第１の実施例の文字列データを圧縮符号化したコード列データの一例を示す。同図に示すコード列データは、図８に示す文字列データを圧縮符号化したものである。
【００６４】
圧縮符号化制御部１４０は、入力文字列データに同一の文字列からなる部分列が複数回繰り返して出現する場合には、１回目は文字列とその登録番号の情報を出力し、２回目以降は部分列を指し示す登録番号のみを出力することにより、データを圧縮符号化する。図１０では、まず、文字「Ａ」、「Ｂ」をそのまま出力し、次に、「（列１＝Ａ，Ｂ）」という符号により、「ＡＢ」が部分列１という登録番号で登録されたことを出力する。以後は、「ＡＢ」という文字列を出力する代わりに、「列１」という符号だけを出力して、データを圧縮する。
【００６５】
圧縮コード列データを出力するタイミングに関しては、入力文字列データを最後まで読み終えるのを待つことなく、新しい部分列が認識される登録される毎に、入力と同時並行的に出力が行われる。
このような圧縮符号化を行うことにより得られるコード列データの長さは、入力文字列データの中に新しく出現した部分列の個数にほぼ比例する。従って、同一の部分列が多数出現する入力文字列データを与えた場合には、非常に高い圧縮率を得ることができる。
【００６６】
辞書１３０を登録・参照するための計算時間は、ランダムアクセス可能な記憶装置を前提とすれば、ハッシュテーブルの技法により、辞書の登録データ数に依存せず、ある定数の時間で１回の登録・参照の処理を実行できる。従って、全体の計算時間は、入力文字列データの長さにほぼ比例する。
ところで、現実の装置においては、無限容量の辞書を実現することは不可能であり、ある有限の容量の辞書１３０を用いることになる。従って、大規模な文字列データを入力した場合に、辞書１３０が満杯になり、新たな部分列を格納できなくなることを考慮しなくてはならない。
【００６７】
そこで、本実施例では、辞書１３０が満杯になった場合には、これまでに格納された部分列の中で、他の部分列の前半部または後半部として使われていない部分列を探し出して、その部分列を表す情報（文字列の前半部と後半部の部分列を指し示す登録番号の組）を一旦消去し、消去した部分列と同じ登録番号で指し示される記憶領域に新しい部分列を表す情報を格納し、以降の処理を進めることとする（以後、この処理を「辞書溢れ続行処理」と呼ぶ）。
【００６８】
上記の辞書溢れ続行処理を行うことにより、過去の部分列の情報の一部が失われるため、本来は一致検出して圧縮できるはずだった部分列が検出できなくなる恐れがある。圧縮できなかった文字列は、そのまま出力されるので、圧縮率が低下することはあるが、情報が失われたり、歪んだりする心配はない。復元装置において、圧縮装置と同等以上の容量の辞書があれば、正しく元通りに復元可能であることが保証される。
【００６９】
また、現実の文字列データにおいては、距離的に近い位置にある部分列の方が複数回出現する可能性が高いと仮定すると、上記の辞書溢れ続行処理のように、使用頻度の低い部分列を消去して最新の部分列に置き換えた方が、限られた辞書容量をより有効に利用することになり、結果として圧縮率の低下を少なく抑えられると考えられる。
【００７０】
辞書溢れ続行処理を実施する際には、消去してもよい部分列を高速に発見するために、ある部分列が他の部分列から参照されている回数を数えるカウンタを、個々の部分列情報に付加し、新たに参照されたり、参照が失われたりしたときには、このカウンタの値を増減させるようにする。そして、カウンタの値が０になっている部分列を常にリストアップしておくことにより、必要なときに即座に辞書１３０を更新することができる。
【００７１】
［第２の実施例］
以下の第２の実施例は、図６に示す文字列データ復元装置２００に基づいて説明する。
当該文字列データ復元装置２００では、前述の第１の実施例における図１０に示す圧縮符号化されたコード列データが入力された場合には、図９に示す内容の辞書が復元される。
【００７２】
例えば、図９において、部分列４として辞書２３０に格納されている部分列を取り出して出力するためには、復元制御部２４０は、先ず部分列４の前半部である部分列１を参照し、さらに、部分列１の前半部を参照して「Ａ」を出力し、次に、部分列１の後半部を参照して「Ｂ」を出力し、これをもって部分列１の出力が完了する。
【００７３】
次に、部分列４に一旦戻り、その後半部である部分列３を参照し、以下同様にして「Ａ」「Ａ」を出力する。このようにして部分列４全体を出力することができる。
復元文字列データを出力するタイミングに関しては、圧縮コード列データを最後まで読み終えるのを待つことなく、入力処理部２１０または、辞書２３０から部分列を取り出す毎に、同時並行的に出力が行われる。
【００７４】
［第３の実施例］
第３の実施例として、図７に示す文字列データ演算処理装置について説明する。当該文字列データ演算処理装置は、圧縮コード列データ同士の二項演算処理を行うものであり、前述の第２の実施例の圧縮コード列復元装置２００を入力処理部３１０、３２０とし、前述の第１の実施例の文字列データ圧縮符号化装置１００を出力処理部３６０とし、入力処理部３１０、３２０から入力された２つの復元文字列データの二項演算を行う演算処理部３３０とある有限の容量の演算処理高速化辞書３５０とこれらの各構成要素を制御する制御部３４０から構成されている。
【００７５】
制御部３４０では、２つの入力処理部３１０、３２０及び出力処理部３６０から部分列がそれぞれが有する辞書（図６の辞書２３０に相当）に格納されたり、参照された場合に、それぞれの登録番号の情報を取得し、ある２つの入力部分列の組に対する演算処理部３３０における二項演算の結果がどの部分列に対応したかを監視し、当該２つの部分列とよび二項演算の結果が対応する部分列の３つの部分列の登録番号の組が得られる毎に、演算処理高速化辞書３５０に格納する。そして、この演算処理高速化辞書３５０を参照して、同一の入力部分列の組が複数回出現したことを検出する。
【００７６】
ここで、同一の入力部分列の組を検出した場合に、２回目以降は、入力処理部３１０、３２０に対して、部分列の登録番号を文字列データに復元する処理を省略するように指令を出し、出力処理部３６０では、演算処理高速化辞書３５０に格納されている演算結果の部分列の登録番号をそのまま出力するように指令を出す。これにより、同じ演算処理を繰り返すことないため、高速化が可能となる。
【００７７】
図１１は、本発明の第３の実施例の演算処理高速化辞書の例を示す。当該演算処理高速化辞書３５０には、２つの入力と１つの演算結果の合わせて３つの部分列の登録番号が１組として登録する。ランダムアクセス可能な記憶装置を前提とすれば、ハッシュテーブルの技法により、当該辞書３５０のエントリ数に依存せずに、ある定数時間で、１回の登録・参照を行うことができる。
【００７８】
現実の装置では、演算処理高速化辞書３５０の容量は有限なものであるが、当該辞書３５０が満杯になった場合には、古い情報を適宜消去して新しい情報を上書きする。これにより省略できるはずの部分列を検出する機会が失われ、演算処理速度が低下する恐れがある。しかし、その場合でも速度が低下するだけで演算結果のコード列データの内容には影響はない。
【００７９】
なお、本発明は、上記の実施例に限定されることなく、特許請求の範囲内で種々変更・応用が可能である。
【００８０】
【発明の効果】
上述のように、本発明によれば、圧縮符号化されたコード列データを必ずしも完全に復元せずに直接演算処理を行う装置を提供できる。これにより、データの圧縮率にほぼ比例して、処理の高速化が可能となる。
また、本発明によれば、すべての入力データを前もって記憶装置内に格納可能でない場合でも、入力された圧縮コード列データの先頭から順次、演算処理を実行し、演算結果を順次出力することが可能である。これにより、従来手法では扱えなかった大規模なデータの演算処理が行えるようになる。
【００８１】
さらに、本発明によれば、複数個の演算処理装置を直列に接続することにより、パイプライン式に並行実行して高速化を図ることができる。
【図面の簡単な説明】
【図１】本発明の原理構成図（文字列データ圧縮符号化装置）である。
【図２】本発明の原理構成図（文字列データ復元装置）である。
【図３】本発明の原理構成図（文字列データ演算処理装置）である。
【図４】本発明の文字列データ圧縮符号化装置の構成図である。
【図５】本発明の部分列の分解規則を説明するための図である。
【図６】本発明の文字列データ復元装置の構成図である。
【図７】本発明の文字列データ演算処理装置の構成図である。
【図８】本発明の第１の実施例の文字列データ圧縮符号化装置の動作を説明するための図である。
【図９】本発明の第１の実施例の文字列データの部分列を登録した辞書の一例である。
【図１０】本発明の第１の実施例の文字列データを圧縮符号化したコード列データの一例である。
【図１１】本発明の第３の実施例の演算処理高速化辞書の例である。
【図１２】Ｘ_１，Ｘ_２，Ｘ_３の３変数を入力とする論理関数データを表現する真理値表の一例である。
【図１３】論理関数データに対して入力変数に０，１を代入することにより展開して得られる二分木状のグラフの一例である。
【図１４】二分決定グラフの共有化処理を説明するための図である。
【図１５】二分決定グラフの非冗長化処理を説明するための図である。
【図１６】論理関数データを表現する二分決定グラフの一例である。
【図１７】二分決定グラフを記憶装置に格納するためにグラフの節点を列挙した表の一例である。
【符号の説明】
１００文字列データ圧縮符号化装置
１１０入力処理手段、入力処理部
１２０出力処理手段、出力処理部
１３０辞書
１４０圧縮符号化制御部
１４１検出手段
１４２出力制御手段
１５０分解規則
２００文字列データ復元装置
２１０入力処理手段、入力処理部
２２０出力処理手段、出力処理部
２３０辞書
２４０復元制御手段、復元制御部
３００文字列データ演算処理装置
３１０、３２０入力処理手段
３３０演算処理手段
３４０制御部
３５０演算処理高速化辞書
３６０出力処理手段[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a character string data compression encoding apparatus, a character string data decompression apparatus, and a character string data arithmetic processing apparatus. In particular, the present invention relates to data communication and processing, and compression encoding of character string data for efficient transmission or storage. The present invention relates to a character string data compression encoding apparatus, a character string data decompression apparatus, and a character string data arithmetic processing apparatus for performing high-speed arithmetic processing between compression-encoded character string data.
[0002]
[Prior art]
A Lempel-Ziv encoding method is widely known as a data compression method for encoding an input character string as a copy of a character string that appeared in the past. Also, a number of patents have been filed relating to the improved system. For example, Japanese Patent No. 2590287 describes an invention relating to a system in which the Lempel-Ziv coding system is improved.
[0003]
In general, the Lempel-Ziv encoding method has a memory for storing (part of) the history of a character string input in the past, and a dictionary for registering / retrieving a subsequence included in the memory. Are sequentially read, and a partial sequence is registered in a dictionary, and the same partial sequence is detected with reference to this to perform data compression. There are two Lempel-Ziv encoding schemes: a universal scheme and an incremental decomposition scheme.
[0004]
In the universal Lempel-Ziv encoding method, when an input character string is encoded, a character string portion having the maximum length that matches the current character string is separated from an arbitrary position in the past input character string (character string substring). This is a method of performing data compression by designating the position of the character substring in the memory and the length of the character substring.
The incremental classification type Lempel-Ziv encoding method encodes and registers a character string obtained by adding a newly appearing character to a previously encoded maximum-length character string, and stores the current character string in the past. This is a method of encoding as a copy of the longest character subsequence of the subsequence.
[0005]
In any of the methods, inventions and improvements have been made with the object of increasing the compression ratio of character string data and efficiently performing compression / decompression processing. That is, it is assumed that the code string data that has been compression-encoded is once returned to the original character string data through the restoration processing, and then subjected to various arithmetic processing. Performing direct arithmetic processing on compression-encoded code string data without going through decompression processing is not considered in the prior art, and is a code devised so that such arithmetic processing can be performed efficiently. Not.
[0006]
On the other hand, as a method of compressing data, storing the data in a storage device, and performing high-speed arithmetic processing in a state in which the data is compressed, an expression method and an arithmetic method of logical function data using a data structure called a “Binary Decision Graph” are known. Are known. The processing method is described in R. E. FIG. Bryant: "Graph-Based Algorithms for Boolean Function Manipulation", IEEE Trans. compute. , Vol. C-35, no. 8, pp. 677-691 (hereinafter referred to as article 1). In addition, a detailed commentary article is also published in “Special Issue BDD (Binary Decision Graph)” (hereinafter referred to as “Article 2”), pp. 584-630, May 1993, IPSJ Journal “Information Processing”.
[0007]
The logical function data is a logical function data indicating whether a logical value of 0 or 1 is output as a function value when a combination of logical values of 0 and 1 is given to a plurality of input variables. It represents information.
Before the invention of the binary decision diagram, to represent logic function data,ⁿA method of expressing a function value for a combination of input variable values (n is the number of input variables) in a form of a sequence of 0s and 1s called a “truth value table” is often used. FIG.₁, X₂, X₃3 is an example of a truth table expressing logic function data having the three variables as inputs.
[0008]
The BDD is a method of expressing logic function data using a graph,
Means for giving a fixed order to all input variables (hereinafter referred to as ordering processing);
Means for deriving two partial functions from a logical function by substituting logical values of 0 and 1 into one of the input variables (hereinafter referred to as expansion processing);
The above expansion processing is repeated according to the order given by the ordering processing until there is no input variable to which 0 and 1 are substituted, and an intermediate node representing each expansion processing and two partial functions obtained by each expansion processing are obtained. A binary tree graph as shown in FIG. 13 is generated using 0-branch and 1-branch points, and 0-terminal nodes and 1-terminal nodes representing logical values of 0 and 1 obtained as a result of repeating the expansion processing. Means (hereinafter referred to as binary tree generation processing),
In the above-described binary tree generation processing, as shown in FIG. 14, it is checked whether or not 0-branches and 1-branches of two intermediate nodes in the binary tree point to the same node. Means for finding two intermediate nodes that are equivalent to each other and erasing one of the intermediate nodes and sharing the other intermediate node (hereinafter referred to as sharing processing);
At each intermediate node, when the 0-branch and the 1-branch both point to the same node as shown in FIG. 15, means for removing the intermediate node and directly connecting the branch (hereinafter referred to as non-redundancy processing) Call),
Consists of
[0009]
By performing the above processes in combination, a binary decision diagram as shown in FIG. 16 is obtained. This BDD can be stored in the storage device in the form of a table listing intermediate nodes as shown in FIG. The amount of storage required at this time is proportional to the number of nodes in the BDD. When the logical function data is represented by a truth table, the number of combinations of the values of the input variables (when the number of variables is n, 2ⁿIs always required, but using a binary decision diagram often results in a smaller storage amount, and in some cases, the compression ratio can reach several hundred times. For this reason, binary decision diagrams have been widely used in recent years as a compression storage method for logical function data.
[0010]
Further, the paper 1 and the article 2 store a binomial operation processing method between the BDDs. That is, when two BDDs are stored in the storage device, a binary logical operation (logical function, OR, exclusive OR, etc.) between the logic function data represented by those BDDs is executed. Then, a method is described in which a new BDD expressing the logical function data of the operation result is generated and stored in a storage device.
[0011]
The operation method is such that by assigning 0 and 1 to the uppermost variables of two given BDDs, each is decomposed into two corresponding sets of intermediate nodes, and the obtained set of intermediate nodes is further divided. Decompose and repeat this to get multiple sets of terminal nodes. A new terminal node representing the binomial operation result for each of the terminal node sets is newly generated, and the operation result obtained in this manner is assembled again into the binary decision diagram, thereby generating a new binary decision diagram. At this time, by registering the set of intermediate nodes that appear during the arithmetic processing and the operation result for the set in the dictionary, when the same set of intermediate nodes is detected, the set is not further decomposed and immediately registered in the dictionary. There is provided a high-speed means for referring to an intermediate node as a calculation result.
[0012]
The calculation time of this calculation method is almost proportional to the sum of the number of nodes of the two BDDs given as the calculation objects and the number of nodes of the BDD generated as the calculation result. Therefore, as compared with the logical operation method using the truth table data format, the effect of speeding up the arithmetic processing can be obtained in substantially proportion to the data compression ratio by the BDD.
[0013]
[Problems to be solved by the invention]
In the Lempel-Ziv encoding method, it is assumed that the code string data that has been compression-encoded is once returned to the original character string data through the restoration processing, and then subjected to various arithmetic processing. By performing the compression encoding, the amount of transmission data or the storage area can be saved, but processing time for compression / decompression is additionally required. It is generally considered that the compression ratio and the processing time have a trade-off relationship.
[0014]
However, if the arithmetic processing can be directly performed without decompressing the compression-encoded code string data, processing time for compression and decompression becomes unnecessary, and the arithmetic processing is performed in proportion to the data compression ratio. It would be possible to shorten the time.
For this purpose, it is necessary to devise a method of dividing the sub-sequence in the compression encoding so that the arithmetic processing is easy. However, in the conventional Lempel-Ziv encoding method, substrings are separated at different positions depending on the contents of input character string data. Therefore, when trying to perform a binomial operation process between two compressed code strings, However, there is a problem in that the correspondence relationship between the subsequence data becomes complicated, and it is not possible to efficiently execute the arithmetic processing with the data compressed.
[0015]
On the other hand, in the binomial logical operation processing method using the binary decision diagram, the binomial arithmetic processing can be executed in the data format as it is, and the speeding up which is almost proportional to the compression ratio is realized.
However, in the binary logical operation processing method using the binary decision diagram, the two binary decision diagrams to be operated are delivered to the arithmetic processing device in a state stored as a node table in the storage device, and the binary decision graph of the operation result is obtained. Is also generated as a node table in the storage device. Therefore, a requirement is that all nodes of these BDDs can be stored in the storage device in advance. In the process of sharing a BDD and the process of non-redundancy described in the section of the related art, a method of registering and retrieving all nodes in a dictionary in a storage device is used in order to quickly determine whether a node matches. Have been. The capacity of the dictionary is finite, and if the BDD of the operation result does not fit in the dictionary, the process ends abnormally in the middle of the process and there is a problem that the result of the operation process cannot be obtained.
[0016]
Further, in the binary operation processing method using the BDD, the calculation process cannot be started until all nodes of the two BDDs to be operated are stored in the storage device. That is, the processing of the next stage cannot be started unless the processing of the previous stage is completed. Therefore, there is a problem that continuous arithmetic processing cannot be connected in series and executed in parallel in a pipeline manner.
[0017]
SUMMARY OF THE INVENTION The present invention has been made in view of the above points, and has been developed in view of compression-coded code string dataNot necessarily completelyA character data compression / encoding device, a character string data restoring device, and a character string data processing device capable of directly performing arithmetic processing without decompression and enabling high-speed arithmetic processing according to a data compression ratio To provideWith the goal.
Also, the present invention does not start the operation after storing all the input data in the storage device as in the conventional operation processing method using the BDD, but instead starts the input compressed code string data. It is an object of the present invention to provide a character string data compression / encoding device, a character string data decompression device, and a character string data arithmetic processing device capable of sequentially executing arithmetic processing while sequentially reading from the same, and sequentially outputting the operation result.
[0018]
[Means for Solving the Problems]
FIG. 1 is a diagram showing the principle configuration (compression encoding device) of the present invention.
The present invention (claim 1)ⁿ(Where n is an integer constant equal to or greater than 0) A character string data compression encoding apparatus which receives as input character string data and outputs code string data obtained by compressing the data without losing the information. ,
Input processing means 110 for sequentially reading the character string data before compression from the beginning,
Output processing means 120 for sequentially outputting the code string data after compression encoding from the beginning,
A finite capacity dictionary 130;
Length input from input processing means 1102 ⁿIs divided into two substrings each having a half length of the first half and the second half of the input character string data, and divided into the first half and the second half of the obtained substring. 2,4,8,16, ... 2ⁿA decomposition rule 150 that is referenced to decompose into
While sequentially reading the input character string data input from the input processing means 110 with reference to the decomposition rule 150, the lengths 1, 2, 4, 8, 6,.ⁿIs obtained by assigning a different registration number to each different character string, storing it in a dictionary as long as its capacity permits, and reading the input character string data while referring to the dictionary to obtain the input character string. Detecting means 141 for detecting that the same character string appears repeatedly in the column data a plurality of times;
If a substring consisting of the same character string appears repeatedly in the input character string data a plurality of times, the output processing unit 120 is instructed to output a set of the character string and the registration number of the character string at the first time. And an output control unit 142 for instructing the output processing unit 120 to output only the registration number indicating the subsequence after the second time.
[0019]
According to the present invention (claim 2), when character string data of an arbitrary length is input, the length of the input character string data is exactly two.ⁿThere is provided a means for adding a plurality of special characters representing blanks to the end of the character string data so that (where n is an integer constant as small as possible).
According to the present invention (claim 3), in the detection means 141,
An arbitrary character string a having a length of 2 or more is decomposed into two partial strings a1 and a2 each having a half length of the first half and the second half, and the character strings of the partial strings a1 and a2 are If it is already stored in the dictionary, the character string a is represented by a set of registration numbers indicating the character strings of the subsequence a1 and the subsequence a2, and a new registration representing the character string a in the set Storage means for giving a number and storing the number in the dictionary 130;
In the output control means 142,
When a character string a having a length of 2 or more is output, if the character strings of the two partial strings a1 and a2 of the first half and the second half are not stored in the dictionary 130, the partial strings a1 and The character string a is expressed by sequentially outputting the character strings in the column a2, and if one or both of the partial sequence a1 and the partial sequence a2 are already stored in the dictionary 130, the stored character string a is stored. Means is provided for outputting a registration number indicated by the character string instead of outputting the character string.
[0020]
According to the present invention (claim 4), in the storage means,
When the capacity of the dictionary 130 reaches the limit and a new character string x cannot be stored, it is still used as a first half or a second half of another character string in the character strings stored so far. Search for a character string y that has not been deleted, delete a set of registration numbers indicating the first half and the second half of the character string y representing the character string y, and point to the same registration number as the deleted character string y. There is means for storing information representing the new character string x in the indicated storage area, and advancing the subsequent processing.
[0021]
FIG. 2 is a block diagram showing the principle of the present invention (character string data restoration device).
The present invention (claim 5) is a character string data restoring device for restoring and extracting original character string data from code string data compressed and encoded by the character string data compression encoding device,
An input processing means 210 for sequentially reading the compression-encoded code string data from the beginning,
Output processing means 220 for sequentially outputting the restored character string data from the beginning,
A dictionary 230 having a capacity equal to or greater than the dictionary of the character string data compression encoding device used when compressing the character string data,
When sequentially reading the compression-encoded code string data obtained from the input processing means 210, if a partial string expressed as a simple character string is read, the output processing means is configured to output the character string as it is. When the instruction is given to the subroutine 220 and the set of the registration number and the character string of the subsequence is read, the character string is stored in the dictionary 230 with the registration number added thereto. Has a restoration control unit 240 that instructs the output processing unit 220 to extract the subsequence indicated by the registration number from the dictionary 230 and output it.
[0022]
According to the present invention (claim 6), in the restoration control means 240,
When sequentially reading the compression-encoded code string data, when a character string represented by a set of two partial strings of the first half and the second half is input, the first half and the second half of the partial string are input. Means for expressing the character string with a set of registration numbers and storing it in the dictionary 230;
When extracting and outputting the partial sequence stored in the dictionary 230, the partial sequence of the first half and the second half of the partial sequence is sequentially referred to from the dictionary 230, and the partial sequence is further divided into the first half and the second half. In the case of being expressed as a set, there is further provided a means for sequentially referring to them and extracting and outputting a subsequence of length 1.
[0023]
FIG. 3 is a diagram showing the principle configuration (character string data processing device) of the present invention.
According to the present invention (claim 7), when a certain binary operation for calculating and outputting one character for any two characters is defined, two character strings having the same length are used. What is claimed is: 1. A character string data processing device that outputs data of one character string representing a result of calculating a predetermined binary operation with respect to characters at the same position from the beginning as input data,
First input means for sequentially reading the compression-encoded code string data from the beginning, first output means for sequentially outputting the restored character string data from the beginning, and character string data used when the character string data is compressed When sequentially reading the first dictionary having a capacity equal to or larger than the dictionary of the compression encoding device and the compression-encoded code string data acquired from the first input means, a short simple character string is used. When the expressed substring is read, the first output means is instructed to output the character string as it is, and when the set of the registration number and the character string of the substring is read, the When the character string is given the registration number in the first dictionary and stored, and only the registration number of the subsequence is read, the subsequence indicated by the registration number is taken out from the first dictionary and output. To the first output means That restoration control becomes unit and two strings data restoration device having, two input processing means 310 and 320 for transferring to restore each input code string data to the character string data,
Arithmetic processing means 330 for sequentially calculating a binary operation between characters at the same position from the beginning of two character string data transferred from input processing means 310 and 320 and transferring the result as new character string data When,
A second input unit for sequentially reading the character string data before compression from the beginning, a second output unit for sequentially outputting the code string data after compression and encoding from the beginning, a second dictionary having a certain finite capacity, Length 2 input from the second input meansⁿIs divided into two substrings each having a half length of the first half and the second half of the input character string data, and divided into the first half and the second half of the obtained substring. 2,4,8,16, ... 2ⁿDecompose into subsequences ofReferenced toWhile sequentially reading the decomposition rule and the input character string data input from the second input means with reference to the decomposition rule, the lengths 1, 2, 4, 8, 16,.ⁿ, A different registration number is assigned to each different character string, and stored in the second dictionary as long as its capacity permits. The input character string data is stored while referring to the second dictionary. By reading, a detecting means for detecting that the same character string repeatedly appears in the input character string data a plurality of times, and a subsequence consisting of the same character string in the input character string data being repeated a plurality of times If it appears, it instructs the second output means to output a set of a character string and a registration number of the character string the first time, and outputs only the registration number indicating the subsequence after the second time Output control means for instructing the second output means to perform the output, wherein the character string data transferred from the arithmetic processing means 330 is compression-encoded and output as code string data. Processing means 36 With the door.
[0024]
In the present invention (claim 8), in the input processing means 310, 320 and the output processing means 360, when a subsequence is stored or referred to in each of the first dictionary or the second dictionary, the registration number of each registration number is determined. Further comprising a transfer means for transferring information;
The information of the registration number is acquired from the transfer means, and which subsequence corresponds to the result of the binomial operation with respect to a certain set of two input subsequences is determined by three input subsequences and one output subsequence. As a set of the registration numbers of the subsequences, as long as the capacity permits, the data is stored in a certain finite-capacity operation acceleration dictionary, and the operation processing is sequentially performed with reference to the operation acceleration dictionary. Detecting means for detecting that the set has appeared more than once;
When the same set of input subsequences appears, the input processing means 310 and 320 control so as to omit the process of restoring the registration number of the subsequence to character string data, and the output processing means 360 calculates And control means for controlling so as to output the registration number of the subsequence of the operation result stored in.
[0025]
The compression encoding device of the present inventionⁿ(Where n is an integer constant of 0 or more) character string data is input, and code string data obtained by compressing the data is output without losing the information. According to the compression encoding device, the length 2ⁿIs divided into two substrings each having a half length of the first half and the second half of the input character string data, and the obtained substring is further divided into the first half and the second half respectively. , 1,2,4,8,16, ..., 2^(N-1)Is decomposed into a number of subsequences.
[0026]
Further, based on the above decomposition rules, while sequentially reading the input character string data, the lengths 1, 2, 4, 8, 16,.^(N-1)Is extracted, a different registration number is assigned to each different character string, and the dictionary is stored as long as its capacity is allowed. By reading the input character string data while referring to this dictionary, it is possible to detect that the same character string appears repeatedly in the character string data a plurality of times. If a substring consisting of the same character string appears repeatedly in the input character string data a plurality of times, the first time a set of the character string and its registration number is output, and the second and subsequent registrations indicate the character string. By outputting only the number, the data is compression-coded.
[0027]
Further, in the compression encoding apparatus of the present invention, the restriction on the length of the input character string data is removed, that is, the length of the input character string data is exactly 2ⁿ(However, n is an integer constant as small as possible), a plurality of special characters representing blanks are added to the end of the character string data, and these are compression-encoded.
Further, in the compression encoding apparatus of the present invention, the lengths 1, 2, 4, 8, 16,.^(N-1)Is not enumerated and stored in the dictionary as it is, but is decomposed into two subsequences of the first half and the second half, and stored. That is, when the character string a having a length of 2 or more is decomposed into two partial strings a1 and a2 each having a half length of the first half and the second half, and the character strings a1 and a2 are already stored in the dictionary. Represents a character string a with a set of registration numbers indicating two character strings, and a new registration number representing the character string a is given to the character string a and stored in the dictionary. As a result, all the subsequences having a length of 2 or more are represented by a set of registration numbers indicating the subsequences of the first half and the second half.
[0028]
When a character string a having a length of 2 or more is output, if the character strings of the two partial strings a1 and a2 of the first half and the second half are not stored in the dictionary, the two character strings are sequentially output. To represent the character string a, otherwise (ie, if one or both of the strings a1 and a2 are already stored in the dictionary), instead of outputting the stored string, By outputting the registered registration number, the data can be compressed and encoded.
[0029]
In the compression encoding apparatus of the present invention, in the process of decomposing the character string data into subsequences and storing them in the dictionary, if the capacity of the dictionary reaches the limit and a new character string x cannot be stored, A character string y that has never been used as a first half or a second half of another character string is searched for from among the character strings stored up to and including information representing the character string y (the first half of the character string). And a registration number indicating the latter half of the character string), temporarily store information representing the new character string x in the storage area indicated by the same registration number as the deleted character string y, and proceed with the subsequent processing. .
[0030]
The character string data restoration device of the present invention, when sequentially reading code string data compressed and encoded by the data compression encoding device, when reading a partial string represented by a simple character string, the character string is left as it is. Output, when the set of the registration number and the character string of the subsequence is read, the registration number is added to the character string and stored in the dictionary, and when only the registration number is read, the part indicated by the registration number The character string data is restored by extracting the string from the dictionary and outputting it.
[0031]
Further, the character string data restoring device of the present invention decomposes the character string data into two partial strings, a first half part and a second half part, and restores the compression-encoded data. That is, when sequentially reading the compression-encoded code string data, if a character string represented by a set of two partial strings of the first half and the second half is input, the partial string of the first half and the second half is input. Having a set of registration numbers, the character string is represented and stored in the dictionary.On the other hand, when extracting and outputting the subsequence registered in the dictionary, the first half and the second half of the subsequence are sequentially read from the dictionary. If these subsequences are further expressed as a pair of the first half and the second half, the sequence is repeatedly referred to, and a subsequence of length 1 is finally extracted and output. To output the entire subsequence.
[0032]
The character string data arithmetic processing device of the present invention sequentially reads two code string data compressed by the data compression coding apparatus and restores these compressed code strings to obtain two characters that would be obtained. With respect to the column data, one character string data representing the result of performing a predetermined binary operation is calculated for characters at the same position from the beginning, and these are sequentially output in a state of being compressed by the compression method. . Further, the code string data input by the two input processing means are respectively restored to character string data and transferred to the arithmetic processing means, and the two character strings transferred from the input processing means are transferred to the arithmetic processing means. The binary operation of the characters at the same position from the beginning of the data is sequentially calculated, and the result is transferred to the output processing unit as new character string data. The output processing unit transfers the result from the arithmetic processing unit. The compressed character string data is compression-encoded and output as code string data.
[0033]
Further, the character string data arithmetic processing device of the present invention is provided with a detection means, a control means, and an arithmetic high-speed dictionary having a certain finite capacity, and the two input processing means and the output processing means each have a partial sequence. When stored in the dictionary or referred to, the information of each registration number is transferred to the detecting means. In the detection means, the capacity of the operation speedup dictionary is determined as a set of registration numbers of the three subsequences, as a set of registration numbers of the three subsequences. As long as the same set of input subsequences appears a plurality of times, the calculation is performed sequentially while referring to this dictionary. When the same set of input subsequences appears in the control means, the process of restoring the registration numbers of the subsequences to character string data in the input processing means is omitted, and the output processing means stores them in the high-speed operation dictionary. By outputting the registration number of the substring of the operation result as it is, it is possible to speed up the operation process.
[0034]
As described above, in the present invention, the character string data is divided into two parts, the first half part and the second half part, instead of cutting out the partial string at different positions depending on the contents of the input character string data as in the Lempel-Ziv encoding method. , And each subsequence is further divided into two subsequences, a first half and a second half, and this is repeated to obtain a length of 1, 2, 4, 8, 16,.^(N-1)Is used. Then, while sequentially reading the input data from the head, the subsequences are taken out, and while the data similar to the BDD are sequentially read from the head, the subsequences are taken out, and the data structure similar to the BDD is obtained. And register it in the dictionary.
[0035]
Adopting such a fixed decomposition method may reduce the chance of detecting a matching sub-sequence and reduce the compression ratio as compared with a case where a free break position is allowed. However, by fixing the delimiter position as described above, regardless of the content of the character string data, two times from the beginning of the data^kSince the position corresponding to a multiple of (1 <k <n) is always the boundary of the subsequence, the correspondence between the respective subsequences of the plurality of compressed code sequences is simplified.
[0036]
Therefore, in a process of inputting two compressed code strings as input and outputting a binary operation result of the character strings as a new compressed code string, a result of the binary operation for a certain set of two input subsequences is output. To store which sub-sequence of the code sequence corresponds to a set of registration numbers of these three sub-sequences in the high-speed operation dictionary as far as the capacity permits, and to sequentially perform the operation processing with reference to this dictionary Thus, it is possible to detect that the same set of input subsequences appears multiple times. At the second and subsequent appearances, by outputting only the dictionary registration number indicating the subsequence of the operation result, it is possible to save the trouble of repeating the same operation process and to increase the speed.
[0037]
The high-speed binomial operation method of the compressed code string used in the present invention is based on the same principle as the conventional high-speed binomial logical operation method using a BDD. However, the present invention does not provide a table of all nodes of a BDD as input data in a state where it is stored in a storage device as in the related art, but sequentially reads compression-encoded code string data as an input. Simultaneously with the conventional method, the operation results are sequentially output at the same time.
[0038]
Further, in the present invention, during the process of compressing and decompressing a character string, when the capacity of the dictionary reaches a limit and a new character string x cannot be stored, among the stored character strings, A character string y that has never been used as a first half or a second half of another character string is searched for, and information representing the character string y (a character string and a registration number indicating a character string of the first half and the second half) is obtained. ) Is deleted, the information representing the new character string x is stored in the storage area pointed to by the same registration number as the character string y, and the subsequent processing can be continued. By deleting the information of the character string y, the chance of detecting the coincidence of the subsequences may be reduced and the compression rate may be reduced, but this does not cause the information of the character string data to be lost or distorted. Therefore, the arithmetic processing can be performed correctly.
[0039]
That is, in the present invention, even when the capacity of the storage device is not sufficient, it is possible to perform the arithmetic processing at the compression rate and the speed-up rate according to the capacity. On the other hand, in the conventional binary operation method using the BDD, since the node table of the BDD stored in the storage device is delivered as input data, large-scale data exceeding the capacity of the storage device is handled. It is impossible in principle.
[0040]
Furthermore, in the present invention, since the code string data which has been compression-encoded is sequentially read and the code string data of the operation result is sequentially output at the same time as the input, the processing is connected in series in a plurality of stages and the processing is executed in parallel in a pipeline manner. Can be. In the conventional binary operation processing method using the BDD, the processing of the next stage cannot be started unless the processing of the previous stage is completed. The present invention can be said to be superior to the conventional method in this respect as well.
[0041]
BEST MODE FOR CARRYING OUT THE INVENTION
1. String data compression encoding device:
First, the character string data compression encoding apparatus will be described.
FIG. 4 shows the configuration of the character string data compression encoding apparatus of the present invention.
The character string data compression encoding device 100 shown in FIG.ⁿ(Where n is an integer constant of n ≧ 0) is input as character string data, and code string data obtained by compressing the data without losing the information is output.
[0042]
The character string data compression encoding apparatus 100 includes an input processing unit 110 that sequentially reads character string data before compression from the beginning, and an output processing unit 120 that sequentially outputs code string data after compression and encoding from the beginning. , A compression encoding control unit 140 that controls them, and a decomposition rule 150.
In the character string data compression encoding apparatus 100, the length of the input character string data is just 2ⁿIt is assumed that there are (an integer with n ≧ 0). If character string data of any other length is input, the input processing unit 110 adds a plurality of special characters representing a blank to the end of the character string data and sends the character to the compression / encoding control unit 140. String data length is exactly 2ⁿ(Where n is an integer constant as small as possible).
[0043]
In the compression / encoding control unit 140, the inputprocessingThe character string data transferred from the unit 110 is decomposed into a number of subsequences and stored in the dictionary 130. The break position of the subsequence is based on a certain decomposition rule 150 regardless of the content of the character string data. That is, length 2ⁿIs divided into two substrings each having a half length of the first half and the second half of the input character string data, and the obtained substring is further divided into the first half and the second half respectively. 1,2,4,8,16, ..., 2ⁿIs decomposed into a number of subsequences. FIG. 5 shows an example in which character string data of the present invention is decomposed into sub-sequences. In the example shown in FIG. 5, a state in which character string data having a length of 8 is decomposed into sub-sequences is shown.
[0044]
The compression encoding control unit 140 sequentially reads the input character string data from the beginning based on the decomposition rule 50 described above, and reads the lengths 1, 2, 4, 8, 16, 2ⁿIs extracted, a different registration number is assigned to each different character string, and stored in the dictionary 130 as long as its capacity allows. By reading the input character string data while referring to the dictionary, it is detected that the same character string appears repeatedly in the character string data a plurality of times.
[0045]
2. String data recovery device:
Next, a character string data restoration device will be described.
FIG. 6 shows the configuration of the character string data restoration device of the present invention.
A character string data restoration device 200 shown in FIG. 2 restores code string data compressed and encoded by the character string data compression encoding device 100 to an original character string.
[0046]
The character string data decompression device 200 is used when the compressed character string data is compressed from the beginning, the output processing part 220 sequentially outputs the decompressed character string data from the beginning, and when the character string data is compressed. It is composed of a dictionary having a capacity equal to or greater than the dictionary 130 of the character string data compression encoding device 100 and a restoration control unit 240 for controlling them.
[0047]
When sequentially reading the compression-encoded code string data, the decompression control unit 240 outputs the character string as it is from the output processing unit 220 as it is when reading a partial string represented by a simple character string. When the set of the registration number of the subsequence and its character string is read, the subsequence is assigned the registration number and stored in the dictionary 230. When only the registration number of the subsequence is read, the registration is performed. The character string data is restored by extracting the subsequence indicated by the number from the dictionary 230 and outputting it.
[0048]
Further, when sequentially reading the compression-encoded code string data, the decompression control unit 240, when a character string represented by a set of two partial strings of the first half and the second half is input, the first half Then, the character string is represented by a set of the registration numbers of the character strings in the latter half and stored in the dictionary 230. As a result, a dictionary having exactly the same contents as the dictionary 130 created internally when compression encoding is performed by the above-described character string data compression encoding apparatus 100 is restored.
[0049]
Further, when extracting and outputting the partial sequence stored in the dictionary 230, the restoration control unit 240 sequentially refers to the first half and the second half of the partial sequence from the dictionary 230, and further extracts the first half from the partial sequence. In the case where the partial sequence is expressed as a set of the partial sequence and the latter half, the sequence is further referred to, a subsequence of length 1 is finally extracted and output, and the entire subsequence is output.
[0050]
3. Character string data processing unit:
Next, the character string data processing device will be described.
FIG. 7 shows the configuration of the character string data processing device of the present invention.
The character string data arithmetic processing device shown in FIG. 11 includes two compression code string decompression devices (the same configuration as the above-described character string data decompression device) as input processing units 310 and 320, and compression codes as output processing units 360. , An arithmetic processing unit 330 for performing a binomial operation, an arithmetic high-speed dictionary 350 having a finite capacity, and a control unit 340 for controlling them. Become.
[0051]
This device calculates and outputs one character for any two characters, and performs a certain binary operation (for example, a logical operation such as a logical product and a logical sum or an arithmetic operation such as addition, subtraction, multiplication, and division). Is defined, two code string data compressed in the above-described encoding method are sequentially read, and two character string data that would be obtained by restoring the compressed code string are described below. One character string data representing the result of calculating a predetermined binary operation is calculated for characters at the same position from the beginning, and this is sequentially output in a state compressed by the above-described encoding method. is there.
[0052]
In the two input processing units 310 and 320, the compressed code string data is sequentially read from the beginning, decompressed into character string data and transferred to the arithmetic processing unit 330. The two character string data transferred are sequentially calculated by performing a binary operation between characters at the same position from the beginning, and the calculation result is transferred to the output processing unit 360 as new character string data. In the output processing unit 36, the character string data transferred from the arithmetic processing unit 330 is compression-encoded and output as code string data.
[0053]
However, simply restoring the character string data, sequentially calculating the data from the beginning, and compressing the data again requires a calculation time proportional to the length of the restored character string data, which is not efficient. Therefore, the present invention includes means for speeding up the arithmetic processing as described below.
The two input processing units 310 and 320 and the output processing unit 360 periodically report the information of their registration numbers to the control unit 340 when the subsequences are stored in or referenced from the respective dictionaries.
[0054]
The control unit 340 always monitors which subsequence corresponds to the result of the binomial operation on a certain set of two input subsequences, and every time a set of registration numbers of these three subsequences is obtained, By storing the data in the speed-up dictionary 350 as much as the capacity permits and sequentially performing calculations while referring to the calculation speed-up dictionary 350, it is detected that the same set of input subsequences appears a plurality of times.
[0055]
Further, when the control unit 340 detects the same set of input subsequences, the input processing units 310 and 320 omit the process of restoring the registration number of the subsequence into character string data from the second time on. The control command is sent, and the output processing unit 360 sends the control command so as to directly output the registration number of the partial sequence of the calculation result stored in the calculation acceleration dictionary 350.
[0056]
According to the above method, when a large number of sets of the same input subsequence appear, the entire processing time can be reduced as compared with a method of simply restoring, calculating, and recompressing. The rate of the speed-up is determined by the frequency of appearance of the same set of input sub-sequences. When the capacity of the arithmetic processing speed-up dictionary 350 is sufficient, the high-speed rate almost proportional to the compression ratio of the input and output code string data is obtained. Conversion rate is obtained.
[0057]
【Example】
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[First Embodiment]
A first embodiment will be described based on a character string data compression encoding apparatus 100 shown in FIG.
[0058]
FIG. 8 is a diagram for explaining the operation of the character string data compression encoding apparatus according to the first embodiment of the present invention, and illustrates the operation when eight character string data having a length of “ABABABAA” is input. Here is an example.
FIG. 9 shows an example of a dictionary in which substrings of character string data according to the first embodiment of the present invention are registered. Hereinafter, a description will be given based on the examples of FIGS. 8 and 9.
[0059]
The character string is read by the input processing unit 110 character by character from the left. First, a character of length "A" is recognized. Next, a character of length "B" is recognized. Here, since a partial sequence "AB" appears for the first time, it is registered in the dictionary 130 with a registration number of partial sequence 1 given thereto. The dictionary stores a character string of length 2 in which the characters "A" and "B" are linked.
[0060]
Next, when the third character and the fourth character are read, the substring "AB" is recognized again. Since this is the same character string as substring 1 already registered, there is no need to newly register it.
Next, the partial sequence “ABAB” from the first character to the fourth character is recognized. Since this is a new character string, it is registered in the dictionary 130 with a registration number of subsequence 2 added. Since the first half “AB” and the second half “AB” of this character string are the same character string as the partial string 1, respectively, the dictionary 130 stores the partial string 1 as a partial string having a length of 4 by linking two partial strings 1. .
[0061]
Subsequently, the fifth character and the sixth character are read, and the substring "AB" is recognized. However, since this is the same character string as the substring 1, there is no need to newly register it.
Next, the seventh and eighth characters are read, and a substring "AA" is recognized. Since this is a newly appearing character string, it is registered in the dictionary 130 with a registration number of subsequence 3 added. The dictionary 130 is stored as a substring of length 2 in which two characters “A” are connected.
[0062]
Next, the character string “ABAA” from the fourth character to the eighth character is recognized. Since this is also a character string that appears for the first time, a registration number of subsequence 4 is given and registered in the dictionary 130. The dictionary 130 stores the subsequence 1 and the subsequence 3 as a subsequence having a length of four.
Finally, the partial sequence from the first character to the eighth character is recognized, and this is registered in the dictionary 130 with the registration number of the partial sequence 5 assigned. The dictionary 130 is stored as a substring having a length of 8 in which the subsequences 2 and 4 are connected.
[0063]
As described above, the input character data is decomposed into substrings and stored in the dictionary 130.
Next, how the input data read as described above is compression-encoded and output will be described. FIG. 10 shows an example of code string data obtained by compressing and encoding character string data according to the first embodiment of the present invention. The code string data shown in FIG. 8 is obtained by compressing and encoding the character string data shown in FIG.
[0064]
When a substring consisting of the same character string appears repeatedly in the input character string data a plurality of times, the compression encoding control unit 140 outputs the information of the character string and its registration number the first time, and the second and subsequent times Compresses and encodes data by outputting only the registration number indicating the subsequence. In FIG. 10, first, the characters “A” and “B” are output as they are, and then “AB” is registered with the registration number “subsequence 1” by the code “(column 1 = A, B)”. Output that. Thereafter, instead of outputting the character string "AB", only the code "column 1" is output to compress the data.
[0065]
Regarding the timing of outputting the compressed code string data, the output is performed simultaneously and in parallel with the input each time a new partial string is recognized and registered without waiting for the input character string data to be completely read.
The length of the code string data obtained by performing such compression encoding is substantially proportional to the number of substrings newly appearing in the input character string data. Therefore, when input character string data in which the same subsequence appears many times is given, a very high compression ratio can be obtained.
[0066]
The calculation time for registering and referring to the dictionary 130 is determined by a hash table technique, regardless of the number of registered data in the dictionary, and is calculated once in a certain constant time, assuming a storage device that can be randomly accessed. -Reference processing can be executed. Therefore, the total calculation time is almost proportional to the length of the input character string data.
By the way, in an actual device, it is impossible to realize a dictionary having an infinite capacity, and a dictionary 130 having a certain finite capacity is used. Therefore, it is necessary to take into consideration that when large-scale character string data is input, the dictionary 130 becomes full and a new subsequence cannot be stored.
[0067]
Therefore, in the present embodiment, when the dictionary 130 is full, a partial sequence that has not been used as the former half or the latter half of another partial sequence is searched for in the partial sequences stored so far. , The information representing the subsequence (a set of registration numbers indicating the first half and the second half of the character string) is temporarily deleted, and a new subsequence is stored in the storage area indicated by the same registration number as the deleted subsequence. The information to be stored is stored, and the subsequent processing proceeds (hereinafter, this processing is referred to as “dictionary overflow continuation processing”).
[0068]
By performing the above dictionary overflow continuation processing, a part of the information of the past partial sequence is lost, and there is a possibility that the partial sequence which should have been able to be detected and compressed by the coincidence cannot be detected. Since the character string that could not be compressed is output as it is, the compression ratio may be reduced, but there is no fear that information is lost or distorted. If the decompression device has a dictionary having a capacity equal to or larger than that of the compression device, it is guaranteed that the data can be correctly reconstructed and restored.
[0069]
Further, in the actual character string data, if it is assumed that a subsequence located at a position closer to the distance is more likely to appear more than once, as shown in the above dictionary overflow continuation processing, a subsequence that is used less frequently is used. It is conceivable that erasing and replacing with the latest subsequence would make more efficient use of the limited dictionary capacity, and as a result, a reduction in the compression ratio would be reduced.
[0070]
When performing the dictionary overflow continuation process, in order to quickly find a subsequence that may be deleted, a counter that counts the number of times a certain subsequence is referred to by other subsequences is provided for each subsequence information. When a new reference is made or a reference is lost, the value of this counter is increased or decreased. By always listing the subsequences in which the counter value is 0, the dictionary 130 can be updated immediately when necessary.
[0071]
[Second embodiment]
The following second embodiment will be described based on the character string data restoration device 200 shown in FIG.
In the character string data restoration apparatus 200, when the compression-encoded code string data shown in FIG. 10 in the first embodiment is input, the dictionary shown in FIG. 9 is restored.
[0072]
For example, in FIG. 9, in order to extract and output a subsequence stored in the dictionary 230 as the subsequence 4, the restoration control unit 240 first refers to the subsequence 1 which is the first half of the subsequence 4, Further, "A" is output with reference to the first half of the partial sequence 1, and then "B" is output with reference to the second half of the partial sequence 1. This completes the output of the partial sequence 1.
[0073]
Next, the process returns to the sub-sequence 4 once, refers to the sub-sequence 3 which is the latter half, and outputs “A” and “A” in the same manner. In this way, the entire partial sequence 4 can be output.
Regarding the timing of outputting the decompressed character string data, the output is performed simultaneously and in parallel each time a partial string is extracted from the input processing unit 210 or the dictionary 230 without waiting for the compressed code string data to be read to the end. .
[0074]
[Third embodiment]
As a third embodiment, a character string data processing device shown in FIG. 7 will be described. The character string data processing device performs a binomial operation between the compressed code sequence data. The compressed code sequence decompression device 200 of the second embodiment is used as the input processing units 310 and 320. The character string data compression / encoding device 100 according to the first embodiment is defined as an output processing unit 360, and an arithmetic processing unit 330 that performs a binary operation on two restored character string data input from the input processing units 310 and 320. And a control unit 340 for controlling each of these components.
[0075]
In the control unit 340, when the two input processing units 310 and 320 and the output processing unit 360 store or refer to the respective sub-strings in the respective dictionaries (corresponding to the dictionary 230 in FIG. 6), the respective registration numbers And monitors which subsequence corresponds to the result of the binomial operation in the arithmetic processing unit 330 for a certain set of two input subsequences. Each time a set of registration numbers of three subsequences of the corresponding subsequence is obtained, it is stored in the arithmetic processing acceleration dictionary 350. With reference to the arithmetic processing speed-up dictionary 350, it is detected that the same set of input subsequences appears a plurality of times.
[0076]
Here, when the same set of input subsequences is detected, the second and subsequent times are instructed to the input processing units 310 and 320 to omit the process of restoring the registration number of the subsequence to character string data. And the output processing unit 360 issues a command to output the registration number of the partial sequence of the operation result stored in the operation processing acceleration dictionary 350 as it is. Thereby, since the same arithmetic processing is not repeated, the speed can be increased.
[0077]
FIG. 11 shows an example of an arithmetic processing acceleration dictionary according to the third embodiment of the present invention. The registration number of three substrings is registered as one set in the arithmetic processing acceleration dictionary 350 including two inputs and one arithmetic result. Assuming a storage device that can be accessed randomly, the hash table technique allows one registration / reference in a certain constant time without depending on the number of entries in the dictionary 350.
[0078]
In an actual device, the capacity of the arithmetic processing acceleration dictionary 350 is finite, but when the dictionary 350 becomes full, old information is appropriately deleted and new information is overwritten. As a result, an opportunity to detect a subsequence that can be omitted may be lost, and the processing speed may be reduced. However, even in such a case, the content of the code string data of the operation result is not affected only by lowering the speed.
[0079]
It should be noted that the present invention is not limited to the above-described embodiment, but can be variously modified and applied within the scope of the claims.
[0080]
【The invention's effect】
As described above, according to the present invention, the compression-encodedWithout necessarily restoring completelyIt is possible to provide an apparatus for directly performing an arithmetic processing. As a result, the processing speed can be increased substantially in proportion to the data compression ratio.
Further, according to the present invention, even when all the input data cannot be stored in the storage device in advance, it is possible to sequentially execute the arithmetic processing from the beginning of the input compressed code string data and sequentially output the arithmetic results. It is possible. As a result, arithmetic processing of large-scale data that cannot be handled by the conventional method can be performed.
[0081]
Further, according to the present invention, by connecting a plurality of arithmetic processing units in series, it is possible to increase the processing speed by executing the processing in parallel in a pipeline manner.
[Brief description of the drawings]
FIG. 1 is a block diagram of the principle (character string data compression encoding apparatus) of the present invention.
FIG. 2 is a block diagram of the principle (character string data restoration device) of the present invention.
FIG. 3 is a diagram illustrating the principle configuration (character string data processing device) of the present invention.
FIG. 4 is a configuration diagram of a character string data compression encoding device of the present invention.
FIG. 5 is a diagram for explaining a subsequence decomposition rule according to the present invention.
FIG. 6 is a configuration diagram of a character string data restoration device of the present invention.
FIG. 7 is a configuration diagram of a character string data processing device of the present invention.
FIG. 8 is a diagram for explaining the operation of the character string data compression encoding apparatus according to the first embodiment of the present invention.
FIG. 9 is an example of a dictionary in which substrings of character string data according to the first embodiment of the present invention are registered.
FIG. 10 is an example of code string data obtained by compressing and encoding character string data according to the first embodiment of the present invention.
FIG. 11 is an example of an arithmetic processing acceleration dictionary according to a third embodiment of the present invention.
FIG. 12₁, X₂, X₃3 is an example of a truth table expressing logic function data having the three variables as inputs.
FIG. 13 is an example of a binary tree-like graph obtained by substituting 0 and 1 for input variables with respect to logic function data.
FIG. 14 is a diagram illustrating a process of sharing a BDD.
FIG. 15 is a diagram for explaining a process of making a BDD non-redundant.
FIG. 16 is an example of a BDD representing logic function data.
FIG. 17 is an example of a table listing nodes of the BDD in order to store the BDD in a storage device.
[Explanation of symbols]
100 Character string data compression encoding device
110 input processing means, input processing unit
120 output processing means, output processing unit
130 dictionaries
140 compression encoding control unit
141 detection means
142 output control means
150 Decomposition rules
200 character string data recovery device
210 input processing means, input processing unit
220 output processing means, output processing unit
230 dictionaries
240 restoration control means, restoration control unit
300 Character string data processing unit
310, 320 Input processing means
330 arithmetic processing means
340 control unit
350 High Speed Processing Dictionary
360 output processing means

Claims

A character string data compression / encoding device that receives character string data having a length of 2 ⁿ (where n is an integer constant of 0 or more) and outputs code string data obtained by compressing the data without losing the information. So,
Input processing means for sequentially reading the character string data before compression from the beginning,
Output processing means for sequentially outputting code string data after compression encoding from the beginning,
A finite capacity dictionary,
The separated length 2 ⁿ input character string data input from the input processing means into two subsequences of each half length of the first half and the second half portion, the first half of the resulting the portion segment sequence and the latter half portion to separate, by repeating this, the length 1,2,4,8,16, decomposition rule to be referenced in order to decompose the subsequence of ... ^{2 n,}
While sequentially reads the input character string data input from said input processing means with reference to the decomposition rule, the length 1,2,4,8,6, ..., remove subsequences of 2 ^n, different By assigning a different registration number to each character string, storing it in the dictionary as long as its capacity permits, and reading the input character string data while referring to the dictionary, the same character string is input into the input character string data. Detecting means for detecting that the character string has repeatedly appeared plural times;
When a substring consisting of the same character string appears repeatedly in the input character string data a plurality of times, the output processing unit is instructed to output a set of a character string and a registration number of the character string at the first time. And a second and subsequent output control means for instructing the output processing means to output only the registration number indicating the subsequence.

When character string data of an arbitrary length is input, a special character representing a blank character string is used so that the length of the input character string data is exactly 2 ⁿ (where n is an integer constant as small as possible). 2. The character string data compression encoding apparatus according to claim 1, further comprising means for adding a plurality of data to the end of the data.

The detecting means,
An arbitrary character string a having a length of 2 or more is decomposed into two partial strings a1 and a2 each having a half length of the first half and the second half, and the character strings of the partial strings a1 and a2 are If the character string a is already stored in the dictionary, the character string a is represented by a pair of the registration numbers indicating the character strings of the subsequence a1 and the subsequence a2. Storage means for giving a registration number and storing it in the dictionary,
The output control means,
When a character string a having a length of 2 or more is output, if the character strings of the two partial strings a1 and a2 of the first half and the second half are not stored in the dictionary, the partial strings a1 and The character string a is expressed by sequentially outputting the character strings of the column a2, and when one or both of the character strings of the subsequence a1 and the subsequence a2 are stored in the dictionary, the stored character 2. The character string data compression encoding apparatus according to claim 1, further comprising means for outputting a registration number indicated by the character string instead of outputting the string.

The storage means,
When the capacity of the dictionary reaches the limit and a new character string x cannot be stored, it is still used as a first half or a second half of another character string in the character strings stored so far. Search for a character string y that has not been deleted, and temporarily delete a set of registration numbers indicating the first half and the second half of the character string y representing the character string y, and store the storage pointed by the same registration number as the deleted character string y. 4. The character string data compression encoding apparatus according to claim 3, further comprising means for storing information representing a new character string x in an area and proceeding with subsequent processing.

A character string data decompression device for decompressing and extracting original character string data from code sequence data compressed and encoded by a character string data compression encoding device,
Input processing means for sequentially reading compression-encoded code string data from the beginning,
Output processing means for sequentially outputting the restored character string data from the beginning,
A dictionary equivalent to or larger than the dictionary of the character string data compression encoding device used when compressing the character string data,
When sequentially reading the compression-encoded code string data obtained from the input processing means, if a partial string represented as a simple character string is read, the output processing is performed such that the character string is output as it is. Instructing the means, when reading the set of the registration number and the character string of the subsequence, storing the character string with the registration number added to the dictionary and reading only the registration number of the subsequence And a restoration control unit for instructing the output processing unit to take out a partial string indicated by the registration number from the dictionary and output the extracted partial string.

The restoration control means includes:
When sequentially reading the compression-encoded code string data, when a character string represented by a set of two partial strings of a first half and a second half is input, the partial string of the first half and the second half is input. Means for expressing the character string with a set of registration numbers and storing it in the dictionary;
When extracting and outputting the partial sequence stored in the dictionary, the partial sequence of the first half and the second half of the partial sequence is sequentially referred to from the dictionary, and those partial sequences are further divided into the first half and the second half. 6. The character string data restoring device according to claim 5, further comprising means for, when represented as a set, sequentially referring to them and extracting and outputting a substring having a length of one.

Calculates and outputs one character for any two characters. If a certain binary operation is defined, two character string data of the same length are input and the same from the beginning. A character string data processing device that outputs one character string data representing a result of calculating a predetermined binary operation for characters at positions,
First input means for sequentially reading the compression-encoded code string data from the beginning, first output means for sequentially outputting the restored character string data from the beginning, and character string data used when the character string data is compressed When sequentially reading the first dictionary having a capacity equal to or larger than the dictionary of the compression encoding device and the compression-encoded code string data acquired from the first input means, a short simple character string is used. When the expressed substring is read, the first output means is instructed to output the character string as it is, and when the set of the registration number and the character string of the substring is read, the When the character string is given the registration number in the first dictionary and stored, and only the registration number of the subsequence is read, the subsequence indicated by the registration number is taken out from the first dictionary and output. To the first output means Two consisted string data restoration device, two input processing means for transferring to restore each input code string data to the character string data and a restoration control unit that,
Arithmetic processing means for sequentially calculating a binary operation between characters at the same position from the beginning of the two character string data transferred from the input processing means, and transferring the result as new character string data;
A second input unit for sequentially reading the character string data before compression from the beginning, a second output unit for sequentially outputting the code string data after compression and encoding from the beginning, a second dictionary having a certain finite capacity, The input character string data having a length of ²ⁿ input from the second input means is divided into two substrings each having a half length of the first half and the second half thereof, and the first half of the obtained partial string is divided into two substrings. Separate the rear half portion, by repeating this, the length 1,2,4,8,16, and decomposition rules referred to decompose the subsequence of ... 2 ^n, said second referring to the decomposition rules , While sequentially reading the input character string data input from the input means, extracts the sub-sequences of lengths 1, 2, 4, 8, 16,... ²ⁿ and assigns a different registration number to each different character string Then, the data is stored in the second dictionary as much as the capacity allows, and the second dictionary is referred to. However, by reading the input character string data, detecting means for detecting that the same character string repeatedly appears in the input character string data a plurality of times, and detecting the input character string data from the same character string. If the sub-sequence appears repeatedly a plurality of times, the second output unit is instructed to output a set of a character string and a registration number of the character string at the first time. An output control means for instructing the second output means to output only a registration number indicating a column, wherein the character string data transferred from the arithmetic processing means is compressed and encoded. Output processing means for converting the data into code string data and outputting the code string data.

The input processing unit and the output processing unit further include a transfer unit that transfers information of each registration number when a subsequence is stored or referred to in each of the first dictionary or the second dictionary. And
The information of the registration number is obtained from the transfer means, and the subsequence corresponding to the result of the binomial operation on a certain set of two input subsequences is determined by the three input subsequences of two input subsequences and one output subsequence. As a set of registration numbers of two subsequences, the same input subsequence is stored in an operation acceleration dictionary having a finite capacity as long as the capacity permits, and by sequentially performing operation processing while referring to the operation acceleration dictionary. Detecting means for detecting that the set of characters has appeared a plurality of times;
When the same set of input sub-sequences appears, the input processing means controls so as to omit the process of restoring the registration number of the sub-sequence to character string data, and the output processing means stores in the arithmetic high-speed processing dictionary. 8. The character string data processing apparatus according to claim 7, further comprising control means for controlling so as to output the registration number of the stored partial sequence of the operation result as it is.