JP5808361B2

JP5808361B2 - String compression and decompression system and method

Info

Publication number: JP5808361B2
Application number: JP2013080293A
Authority: JP
Inventors: 健山室; 史和小西
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-04-08
Filing date: 2013-04-08
Publication date: 2015-11-10
Anticipated expiration: 2033-04-08
Also published as: JP2014204358A

Description

本発明は、文字列圧縮における階層型サンプル文字列辞書作成方法及び装置に係り、特に、置換方式における文字列の圧縮方式において、参照局所性を向上させるためのサンプル文字列辞書を作成するための頻出パターンの頻度順序を利用してｊ段階置き換えを行う文字列圧縮における階層型サンプル文字列辞書作成方法及び装置に関する。 The present invention relates to a hierarchical sample character string dictionary creation method and apparatus for character string compression, and more particularly to create a sample character string dictionary for improving reference locality in a character string compression method in a replacement method. The present invention relates to a hierarchical sample character string dictionary creation method and apparatus in character string compression that performs j-stage replacement using the frequency order of frequent patterns.

具体的には、圧縮処理を行う前に、圧縮対象の入力文字列Nから適当な方法で取得したサンプル文字列（部分文字列集合）を抽出してサンプル文字列辞書を作成し、圧縮時に当該サンプル文字列辞書を参照し、ポインタ置換処理を行い、復元処理の階層型メモリ構造における参照局性を改善する技術に関する。 Specifically, before performing the compression process, a sample character string (partial character string set) obtained by an appropriate method is extracted from the input character string N to be compressed, and a sample character string dictionary is created. The present invention relates to a technique for referring to a sample character string dictionary, performing pointer replacement processing, and improving reference locality in a hierarchical memory structure of restoration processing.

入力された文字列を圧縮する方法としては、以下のような方法がある。 As a method of compressing the input character string, there are the following methods.

圧縮するべきデータ列を入力として、ハッシュを元にした探索データ構造を用いて適宜現在の圧縮対象文字列と、過去の出現文字列を比較し、出現しているものを過去の出現文字列へのポインタに置き換えることで圧縮を実現する方法がある（例えば、特許文献1参照）。例えば、図1の例では、1回目に出現した文字列「abcd」に対し、2回目以降に同一の文字列「abcd」が出現した場合に、置換ポインタを1回目に出現した文字列のポインタに置き換えることで文字列を圧縮していく。 Using the search data structure based on the hash as a data string to be compressed as input, compare the current compression target character string with the past appearance character string as appropriate, and change the appearance to the past appearance character string There is a method that realizes compression by replacing the pointer with (see, for example, Patent Document 1). For example, in the example of FIG. 1, when the same character string “abcd” appears for the second time or later with respect to the character string “abcd” that appears first time, the replacement pointer is the pointer of the character string that appears first time. The character string is compressed by replacing with.

また、圧縮するべきデータ列を入力として、現在の圧縮対象の文字列と過去の一致文字列を比較判定する処理を適用することで効率的なデータ圧縮を行う方法がある（例えば、特許文献２参照）。 In addition, there is a method for performing efficient data compression by applying a process of comparing and determining a current character string to be compressed and a past matching character string using a data string to be compressed (for example, Patent Document 2). reference).

また、辞書に基づく符号化を行う複数のLZ系圧縮を、それらの「圧縮率」を評価しながら適用するLZ圧縮法を切り替えることで圧縮率を効率化する方法がある（例えば、特許文献３参照）。 In addition, there is a method of improving the compression rate efficiency by switching the LZ compression method to apply a plurality of LZ compressions that perform dictionary-based encoding while evaluating their “compression rate” (for example, Patent Document 3). reference).

Dean K. Gibson, Mark D. Graybill, "Apparatus and method for very high data rate-compression incorporating lossless data compression and expansion utilizing a hashing technique". US Patent 5,049,881.Dean K. Gibson, Mark D. Graybill, "Apparatus and method for very high data rate-compression incorporating lossless data compression and expansion utilizing a hashing technique". US Patent 5,049,881. 特開2003-179501号公報JP 2003-179501 A 特許第3065591号公報Japanese Patent No. 3055991

しかしながら、上記従来の圧縮方法は、図１に示すように、過去に出現した任意位置の文字列と、現在の文字列が一致した場合に、その過去の該出現文字列へのポインタに置き換えることで圧縮を実現する方法であり、過去の任意の位置に出現した文字列に対するポインタに置き換えていく。そのため、復元時にそのポインタを参照して元の文字列に置き換える際に、特にルールを設けていないため、任意のポインタを参照することになり、メモリ上のどの位置を参照してよいかを特定することができず、ポインタの参照先の参照局所性が保障されない。特に、圧縮対象となる文字列のサイズが1G等の大きなサイズの場合は復元処理が遅延する可能性がある。 However, in the conventional compression method, as shown in FIG. 1, when a character string at an arbitrary position that appeared in the past matches the current character string, it is replaced with a pointer to the past appearing character string. This is a method for realizing compression, and is replaced with a pointer to a character string that has appeared at an arbitrary position in the past. For this reason, there is no special rule when referring to the pointer at the time of restoration and replacing it with the original character string. Therefore, an arbitrary pointer is referred to, and it is possible to specify which location in the memory can be referred to. The locality of the reference destination of the pointer cannot be guaranteed. In particular, if the size of the character string to be compressed is a large size such as 1G, the restoration process may be delayed.

また、特許文献3の方法は、複数のLZ圧縮を組み合わせて使用する技術であるが、「圧縮率」のみを評価対象にそれらの方法を切り替えているため、単純にこれらの組み合わせだけでは復元時の参照局所性の向上、さらには復元速度の向上効果は得られない、という問題がある。 In addition, the method of Patent Document 3 is a technique that uses a combination of multiple LZ compressions. However, since these methods are switched to only the “compression rate” as an evaluation target, simply using these combinations alone is sufficient for restoration. There is a problem that the improvement of the reference locality and the improvement of the restoration speed cannot be obtained.

本発明は、上記の点に鑑みなされたもので、従来技術のポインタ置換の前に、入力文字列の部分文字列の出現頻度に基づいてサンプル文字列辞書を生成し、当該辞書を積極的に参照することで、復元時の参照局所性を改善し、復元速度を向上させることが可能な頻出パターンの頻度順序を利用したｊ段階置き換えによる改善手法及び装置を提供することを目的とする。 The present invention has been made in view of the above points, and generates a sample character string dictionary based on the appearance frequency of a partial character string of an input character string before the conventional pointer replacement, It is an object of the present invention to provide an improvement method and apparatus by j-stage replacement using the frequency order of frequent patterns that can improve the reference locality at the time of restoration and improve the restoration speed.

本発明は、置換方式による文字列圧縮及び復元システムであって、
入力文字列Ｎから部分文字列を抽出し、該入力文字列Ｎにおける該部分文字列の出現回数をカウントし、該部分文字列と出現回数を頻出パターン記憶手段に格納する頻出パターン分析手段と、
前記頻出パターン記憶手段に格納されている前記部分文字列の出現回数について降順に並べ替え、該頻出パターン記憶手段に格納し、該頻出パターン記憶手段の上位Ｎ件を、サンプル文字列としてサンプル文字列記憶手段に格納するサンプル文字列生成手段と、
前記入力文字列Nの開始位置ｉから始まる部分文字列と、前記サンプル文字列記憶手段から読み出した前記サンプル文字列Mの最大一致長L_Mと該部分文字列の出現位置P_Mを求め、前記入力文字列Nの開始位置ｉ番目から始まる部分文字列と、該入力文字列Nの０番目からｉ−1番目までに出現した部分文字列との最大一致長L_Nと、該出現した部分文字列の出現位置P _Nを求め、該最大一致長L_Nが該最大一致長L_Mより大きい場合は、置換ポインタを該入力文字列Nの過去の位置を示すものとし、該最大一致長L_Mが該最大一致長L_N以上である場合は、該置換ポインタをサンプル文字列M上の位置を示すものとし、該入力文字列Nの[i…i+L+1]の部分文字列を出現済みの部分文字列として動的辞書記憶手段に格納し、置換ポインタ列と該サンプル文字列を出力する置換ポインタ生成手段と、
を有する符号化手段と、
前記置換ポインタ生成手段から前記置換ポインタ列と前記サンプル文字列を取得し、置換ポインタがサンプル文字列M上の位置を示している場合には、該置換ポインタが参照するサンプル文字列M上の部分文字列を出力し、該置換ポインタが前記入力文字列Nの過去の位置を示す場合には、該置換ポインタが参照する既に復元済みの部分文字列を出力する置換ポインタ分析手段を有する復号手段と、を有する。 The present invention is a character string compression and decompression system by a replacement method,
A frequent pattern analysis unit that extracts a partial character string from the input character string N, counts the number of appearances of the partial character string in the input character string N, and stores the partial character string and the number of appearances in a frequent pattern storage unit;
The appearance counts of the partial character strings stored in the frequent pattern storage means are rearranged in descending order, stored in the frequent pattern storage means, and the top N cases of the frequent pattern storage means as sample character strings. Sample character string generating means to be stored in the storage means;
The partial character string starting from the start position i of the input character string N, the maximum matching length L _M of the sample character string M read from the sample character string storage means, and the appearance position P _M of the partial character string are obtained, The maximum matching length L _N between the partial character string starting from the starting position i of the input character string N and the partial character strings appearing from the 0th to the (i−1) th of the input character string N, and the appearing partial characters The appearance position P _{N of the} column is obtained, and when the maximum matching length L _N is larger than the maximum matching length L _M , the replacement pointer is set to indicate the past position of the input character string N, and the maximum matching length L _M Is equal to or longer than the maximum matching length L _N , the replacement pointer indicates the position on the sample character string M, and a partial character string of [i ... i + L + 1] of the input character string N appears. Stored as a partial character string in the dynamic dictionary storage means, and a replacement pointer string and a sample character string are output. Conversion pointer generation means;
Encoding means comprising:
When the replacement pointer string and the sample character string are obtained from the replacement pointer generation means, and the replacement pointer indicates a position on the sample character string M, a portion on the sample character string M to which the replacement pointer refers A decoding unit having a replacement pointer analyzing unit that outputs a character string and outputs an already restored partial character string referred to by the replacement pointer when the replacement pointer indicates a past position of the input character string N; Have.

本発明は、入力文字列Ｎに対して、頻出パターン分析を適用することで、出現頻度の高い文字列を取得し、さらに、頻度の高い順にＮ段ある階層型メモリ構造により、参照速度が速くサイズが小さい上位のメモリ構造のサイズに合わせてサンプル文字列を分割してサンプル文字列辞書を生成することにより、出現頻度の低いパターンが含まれていないため、辞書のサイズを抑制することができる。さらに、ｊ段の階層型メモリ構造において、Ｎ段に分割したサンプル文字列を用いて置き換えを行った場合の局所改善による圧縮文字列の復元速度を向上させることが可能となる。 The present invention obtains a character string having a high appearance frequency by applying frequent pattern analysis to the input character string N, and further has a high reference speed due to a hierarchical memory structure having N stages in order of frequency. By generating the sample character string dictionary by dividing the sample character string in accordance with the size of the upper memory structure having a smaller size, since the pattern with low appearance frequency is not included, the size of the dictionary can be suppressed. . Further, in the j-stage hierarchical memory structure, it is possible to improve the decompression speed of the compressed character string by local improvement when replacement is performed using the sample character string divided into N stages.

置換方式による文字列の圧縮を説明するための図である。It is a figure for demonstrating compression of the character string by a substitution system. 本発明の一実施の形態における概要を示す図である。It is a figure which shows the outline | summary in one embodiment of this invention. 本発明の一実施の形態における文字列圧縮装置の構成図である。It is a block diagram of the character string compression apparatus in one embodiment of this invention. 本発明の一実施の形態における文字列圧縮処理のフローチャートである。It is a flowchart of the character string compression process in one embodiment of this invention. 本発明の一実施の形態におけるサンプル文字列生成処理のフローチャートである。It is a flowchart of the sample character string generation process in one embodiment of the present invention. 本発明の一実施の形態における頻出パターン分析処理のフローチャートである。It is a flowchart of the frequent pattern analysis process in one embodiment of this invention. 本発明の一実施の形態における頻出パターン記憶部の例である。It is an example of the frequent pattern memory | storage part in one embodiment of this invention. 本発明の一実施の形態におけるソート後の頻出パターン記憶部の例である。It is an example of the frequent pattern memory | storage part after the sort in one embodiment of this invention. 本発明の一実施の形態における置換ポインタ生成部の処理のフローチャートである。It is a flowchart of the process of the replacement pointer production | generation part in one embodiment of this invention. 本発明の一実施の形態における動的辞書記憶部の例である。It is an example of the dynamic dictionary memory | storage part in one embodiment of this invention. 本発明の一実施の形態における置換ポインタ記憶部の例である。It is an example of the replacement pointer memory | storage part in one embodiment of this invention. 本発明の一実施の形態における置換ポインタ分析部の処理のフローチャートである。It is a flowchart of the process of the replacement pointer analysis part in one embodiment of this invention. 本発明の一実施の形態における階層型メモリ構造における辞書配置例である。It is an example of dictionary arrangement | positioning in the hierarchical memory structure in one embodiment of this invention.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

最初に本発明の概要を説明する。 First, the outline of the present invention will be described.

図２は、本発明の一実施の形態における概要を説明するための図である。 FIG. 2 is a diagram for explaining the outline of one embodiment of the present invention.

本発明では、サンプル文字列を用いた置換方式の圧縮の改善を目的としており、入力文字列Ｎの部分文字列の出現頻度に基づいて抽出された文字列（サンプル文字列）を格納した記憶部（サンプル文字列M記憶部）に格納されているサンプル文字列M（但し、M＜＜入力文字列Nとする）内の部分文字列と置換することにより、サンプル文字列Mは非常に小さい（全入力文字列Nの１％）という前提の下、参照局所性が改善される。なお、上記の１％は、実験により求められた数値である。図２において、サンプル文字列上にないパターンは従来技術による置換方式を適用するものとする。 The present invention aims to improve the compression of the replacement method using the sample character string, and stores a character string (sample character string) extracted based on the appearance frequency of the partial character string of the input character string N By replacing the partial character string in the sample character string M (where M << input character string N) stored in (sample character string M storage unit), the sample character string M is very small ( Reference locality is improved under the premise of 1% of all input character strings N). In addition, said 1% is a numerical value calculated | required by experiment. In FIG. 2, it is assumed that a replacement method according to the conventional technique is applied to a pattern not on the sample character string.

図３は、本発明の一実施の形態における文字列圧縮装置の構成を示す。 FIG. 3 shows the configuration of the character string compression apparatus according to the embodiment of the present invention.

同図に示す文字列圧縮装置は、符号化部１００と復号部２００を有する。 The character string compression apparatus shown in the figure includes an encoding unit 100 and a decoding unit 200.

符号化部１００は、頻出パターン分析部３１０、頻出パターン記憶部３２０、サンプル文字列生成部１１０、サンプル文字列取得部１２０、置換ポインタ生成部１３０、サンプル文字列探索部１４０、サンプル文字列M記憶部１５０、動的辞書探索・更新部１６０、動的辞書記憶部１７０、置換ポインタ記憶部１８０を有する。なお、サンプル文字列生成部１１０は、サンプル文字列を一時的に格納するためのメモリ（図示せず）を有する。 The encoding unit 100 includes a frequent pattern analysis unit 310, a frequent pattern storage unit 320, a sample character string generation unit 110, a sample character string acquisition unit 120, a replacement pointer generation unit 130, a sample character string search unit 140, and a sample character string M storage. Unit 150, dynamic dictionary search / update unit 160, dynamic dictionary storage unit 170, and replacement pointer storage unit 180. The sample character string generation unit 110 includes a memory (not shown) for temporarily storing the sample character string.

復号部２００は、置換ポインタ分析部２１０、入力サンプル文字列記憶部２２０、出力文字列記憶部２３０を有する。 The decoding unit 200 includes a replacement pointer analysis unit 210, an input sample character string storage unit 220, and an output character string storage unit 230.

上記の構成における処理を以下に示す。 The processing in the above configuration is shown below.

最初に、符号化部１００の処理について説明する。 First, the processing of the encoding unit 100 will be described.

図４は、本発明の一実施の形態における文字列圧縮処理のフローチャートである。 FIG. 4 is a flowchart of character string compression processing according to the embodiment of the present invention.

ステップ１００）符号化部１００は、入力から圧縮対象の入力文字列Nを受け取るまで待機する。 Step 100) The encoding unit 100 stands by until an input character string N to be compressed is received from the input.

ステップ２００）頻出パターン分析部３１０において入力文字列Ｎの頻出パターンを分析し、サンプル文字列生成部１１０は、頻出パターンの出現頻度に基づいてサンプル文字列Mを生成する。 Step 200) The frequent pattern analysis unit 310 analyzes the frequent pattern of the input character string N, and the sample character string generation unit 110 generates the sample character string M based on the appearance frequency of the frequent pattern.

ステップ４００）置換ポインタ生成部１３０は、入力文字列Nを引数として置換ポインタを生成して置換ポインタ記憶部１８０に格納する。 Step 400) The replacement pointer generator 130 generates a replacement pointer using the input character string N as an argument and stores it in the replacement pointer storage unit 180.

上記のステップ２００の処理について説明する。 The process of step 200 will be described.

図５は、本発明の一実施の形態におけるサンプル文字列生成処理のフローチャートである。 FIG. 5 is a flowchart of sample character string generation processing according to an embodiment of the present invention.

ステップ２１０）サンプル文字列生成部１１０は、入力引数から入力文字列Ｎを取得する。 Step 210) The sample character string generation unit 110 acquires the input character string N from the input argument.

ステップ２２０）頻出パターン分析部３１０に対して、図６に示す処理を指示する。 Step 220) The processing shown in FIG. 6 is instructed to the frequent pattern analysis unit 310.

ステップ２３０）サンプル文字列生成部１１０は、サンプル文字列取得部１３０を介して、頻度パターン記憶部３２０を参照し、頻出パターンを出現頻度順で降順に並び替え、再度頻出パターン記憶部３２０に格納する。なお、頻出パターン記憶部３２０の要素数をＨとする。 Step 230) The sample character string generation unit 110 refers to the frequency pattern storage unit 320 via the sample character string acquisition unit 130, rearranges the frequent patterns in descending order of appearance frequency, and stores them again in the frequent pattern storage unit 320. To do. Note that the number of elements in the frequent pattern storage unit 320 is H.

ステップ２４０）階層型メモリの段数をｊとし、各メモリのサイズをMem₀,Mem₁,…,Mem_j-1とする。また、添え字の小さいものがより参照速度が速く容量が小さいメモリを表していると想定する。 Step 240) Let j be the number of stages in the hierarchical memory, and let Mem ₀ , Mem ₁ ,..., Mem _j−1 be the size of each memory. Further, it is assumed that a small subscript indicates a memory having a high reference speed and a small capacity.

ステップ２５０）部分文字列のカウンタｉ＝０，ｋ＝０とする。 Step 250) The partial character string counters i = 0 and k = 0.

ステップ２６０）サンプル文字列生成部１１０は、頻出パターン記憶部３２０の上位からｋ番目の頻出パターン（部分文字列）をMiの末尾に連結する。 Step 260) The sample character string generation unit 110 concatenates the k-th frequent pattern (partial character string) from the top of the frequent pattern storage unit 320 to the end of Mi.

ステップ２７０）ｋと頻出パターン記憶部３２０の要素数Hを比較し、k<Hであればステップ２８０に移行し、そうでない場合は、ステップ３１０に移行する。 Step 270) Compare k with the number H of elements in the frequent pattern storage unit 320. If k <H, the process proceeds to Step 280. Otherwise, the process proceeds to Step 310.

ステップ２８０）サンプル文字列Mi<Mem_iであればステップ２６０に戻り、そうでない場合はステップ２９０に移行する。 Step 280) If the sample character string Mi <Mem _i , return to Step 260, otherwise go to Step 290.

ステップ２９０）ｉの値を１インクリメント（ｉ＝ｉ＋１）する。 Step 290) The value of i is incremented by 1 (i = i + 1).

ステップ３００）ｉが段数jより小さい（ｉ<ｊ）であればステップ２６０に戻り、そうでなければステップ３１０に移行する。 Step 300) If i is smaller than the number of stages j (i <j), return to Step 260, otherwise go to Step 310.

ステップ３１０）生成したメモリ上のMem₀,Mem₁,…,Mem_j-1のサンプル文字列をサンプル文字列Ｍ記憶部１５０に格納する。 Step 310) Store the generated sample character strings of Mem ₀ , Mem ₁ ,..., Mem _j−1 on the generated memory in the sample character string M storage unit 150.

次に、上記のステップ２２０の頻出パターン分析部３１０の処理を説明する。 Next, the processing of the frequent pattern analysis unit 310 in step 220 will be described.

図６は、本発明の一実施の形態における頻出パターン分析処理のフローチャートである。 FIG. 6 is a flowchart of frequent pattern analysis processing according to an embodiment of the present invention.

ステップ２２１）サンプル文字列生成部１１０は、入力引数から入力文字列Nを受け取り、サンプル文字列取得部１２０を介して頻出パターン分析部３１０に出力する。 Step 221) The sample character string generation unit 110 receives the input character string N from the input argument and outputs it to the frequent pattern analysis unit 310 via the sample character string acquisition unit 120.

ステップ２２２）頻出パターン分析部３１０は、頻出パターンの最大長をGとする。 Step 222) The frequent pattern analysis unit 310 sets G as the maximum length of the frequent patterns.

ステップ２２３）頻出パターン分析部３１０は、部分文字列のカウンタｉ＝０、S＝１とする。ここで、ｉは現在探索をしている開始位置を示し、Sはｉから数えて合致している文字列の長さを示す。当該ステップでは、はじめの合致長ｉを１で初期化している。 Step 223) The frequent pattern analysis unit 310 sets the partial character string counter i = 0 and S = 1. Here, i indicates the start position where the current search is performed, and S indicates the length of the matching character string counted from i. In this step, the initial match length i is initialized to 1.

ステップ２２４）入力文字列Nの部分文字列[ｉ…S］が、頻出パターン記憶部３２０に存在するかを調べ、存在する場合はステップ２２６に移行し、存在しない場合はステップ２５０に移行する。 Step 224) It is checked whether or not the partial character string [i... S] of the input character string N exists in the frequent pattern storage unit 320. If it exists, the process proceeds to Step 226. If not, the process proceeds to Step 250.

ステップ２２５）部分文字列[ｉ…S］が入力文字列Ｎ中に何回出現しているかを調べて、当該部分文字列と回数を頻出パターン記憶部３２０に格納する。図７に頻出パターン記憶部３２０の例を示す。同図に示すように、頻出パターン記憶部３２０は、出現パターンの文字列とその出現回数を格納する。 Step 225) It is checked how many times the partial character string [i... S] appears in the input character string N, and the partial character string and the number of times are stored in the frequent pattern storage unit 320. FIG. 7 shows an example of the frequent pattern storage unit 320. As shown in the figure, the frequent pattern storage unit 320 stores the character string of the appearance pattern and the number of appearances thereof.

ステップ２２６）カウントSを１インクリメントする（S = S +1）。 Step 226) The count S is incremented by 1 (S = S + 1).

ステップ２２７）カウントSが頻出パターンの最大長Gより小さく（S <G）かつ、i+ Sが入力文字列数Nより小さい場合（i+ S <Ｎ）は、ステップ２２４に移行し、そうでなければステップ２２８に移行する。 Step 227) If the count S is smaller than the maximum length G of the frequent pattern (S <G) and i + S is smaller than the number N of input character strings (i + S <N), the process proceeds to Step 224; Control goes to step 228.

ステップ２２８）部分文字列のカウントｉを１インクリメントし(i=i+1)、Sを１とする（S＝１）。 Step 228) The count i of the partial character string is incremented by 1 (i = i + 1), and S is set to 1 (S = 1).

ステップ２２９）ｉが入力文字数より小さければステップ２２４に移行し、そうでなければ処理を終了する。 Step 229) If i is smaller than the number of input characters, the process proceeds to Step 224; otherwise, the process is terminated.

サンプル文字列生成部１１０は、上記の図６に示す頻出パターン分析部３１０の処理が終了すると、図７に示すサンプル文字列取得部１２０を介して頻出パターン記憶部３２０の出現回数を、図８に示すように降順に並べ替える。このとき、出現頻度の高い順番にｊ段ある階層型メモリの各容量に合わせたｊ個のサンプル文字列M₀〜M_j-1を作成して、サンプル文字列M記憶部１５０に追加格納する。サンプル文字列M記憶部１５０は、サンプル文字列M₀〜M_j-1の添え字の小さいサンプル文字列M_xを、参照速度が速く容量が小さいメモリに配置することを前提として、出現頻度のより高い文字列で構成される。 When the processing of the frequent pattern analysis unit 310 shown in FIG. 6 ends, the sample character string generation unit 110 determines the number of appearances of the frequent pattern storage unit 320 via the sample character string acquisition unit 120 shown in FIG. Sort in descending order as shown in. At this time, j sample character strings M _{0 to} M _{j−1 corresponding} to the respective capacities of the hierarchical memory having j stages in the order of appearance frequency are generated and additionally stored in the sample character string M storage unit 150. . The sample character string M storage unit 150 assumes that the sample character string M _x with a small subscript of the sample character strings M _{0 to} M _j-1 is arranged in a memory with a high reference speed and a small capacity. Consists of higher strings.

次に、上記のステップ３００の置換ポインタ生成部１３０の処理を説明する。 Next, the processing of the replacement pointer generator 130 in step 300 will be described.

図９は、本発明の一実施の形態における置換ポインタ生成部の処理のフローチャートである。 FIG. 9 is a flowchart of the process of the replacement pointer generator in the embodiment of the present invention.

ステップ３０１）置換ポインタ生成部１３０は、入力引数から入力文字列Nを受け取る。 Step 301) The replacement pointer generator 130 receives the input character string N from the input argument.

ステップ３０２）入力文字列の位置カウントｉを０とする。 Step 302) The position count i of the input character string is set to zero.

ステップ３０３）サンプル文字列探索部１４０に対して、入力文字列Nの開始位置ｉ番目から始まる部分文字列と、サンプル文字列M記憶部１５０のサンプル文字列M上の文字列との最大一致長L_Mと出現位置P_Mの計算を指示する。サンプル文字列探索部１４０は、入力文字列Nとサンプル文字列M記憶部１５０を参照して、サンプル文字列MのL_Mとその出現位置P_Mを計算し、置換ポインタ生成部１３０に返却する。 Step 303) For the sample character string search unit 140, the maximum matching length between the partial character string starting from the start position i-th of the input character string N and the character string on the sample character string M in the sample character string M storage unit 150 to indicate the calculation of L _M and the appearance position P _M. Sample string search unit 140, with reference to the input string N and sample string M storage unit 150, the L _M and its appearance position P _M of the sample string M is calculated and returned to the replacement pointer generator 130 .

ステップ３０４）置換ポインタ生成部１３０は、ステップ３０３と同様に、動的辞書探索・更新部１６０に対して、入力文字列Nの開始位置ｉ番目から始まる部分文字列と入力文字列の[0…ｉ−1]までに出現した部分文字列との最大一致長L_Nとその出現位置P_Nの計算を指示する。動的辞書探索・更新部１６０は、入力文字列Nの開始位置ｉ番目から始まる部分文字列と入力文字列の[0…ｉ−1]を比較して最大一致長L_Nを求め、さらに、最大一致した部分文字列に基づいて動的辞書記憶部１７０を参照して、出現位置P_Nを取得する。動的辞書記憶部１７０は、図１０に示すように、出現文字列と出現位置を格納する辞書であり、出現位置は、例えば、入力文字列Nが「zxywe…」であった場合に、"zxy"の出現位置を"０"、"xyw"の出現位置を"１"、"ywe"の出現位置を"２"のように１文字ずつスライドさせて動的辞書記憶部１７０に登録する。 Step 304) Similar to step 303, the replacement pointer generation unit 130 instructs the dynamic dictionary searching / updating unit 160 [0... Of the partial character string starting from the starting position i of the input character string N and the input character string. Instructs the calculation of the maximum matching length L _N and the appearance position P _N of the partial character string appearing up to i−1]. The dynamic dictionary search / update unit 160 compares the partial character string starting from the start position i-th of the input character string N with the input character string [0... I−1] to obtain the maximum matching length L _N , and Referring to the dynamic dictionary storage unit 170 on the basis of the maximum matched sub-string, obtains the occurrence position P _N. As shown in FIG. 10, the dynamic dictionary storage unit 170 is a dictionary that stores appearance character strings and appearance positions. For example, when the input character string N is “zxywe. The appearance position of zxy is “0”, the appearance position of “xyw” is “1”, the appearance position of “ywe” is slid one character at a time and registered in the dynamic dictionary storage unit 170.

ステップ３０５）ステップ３０３で求められた最大一致長L_Mとステップ３０４で求められた最大一致長L_Nを比較し、L_M<L_Nであればステップ３０６に移行し、L_M≧L_Nであればステップ３０８に移行する。 Step 305) The maximum matching length L _M obtained in step 303 is compared with the maximum matching length L _N obtained in step 304. If L _M <L _N , the process proceeds to step 306, and L _M ≧ L _N If there is, the process proceeds to step 308.

ステップ３０６）置換ポインタ生成部１３０は、L_M<L_Nであるとき（過去の文字列N系列上に出現した場合）、置換ポインタフラグFを０に設定する。つまり、置換ポインタは入力文字列Nの過去の位置となる
ステップ３０７）最大一致長LをL_Nとし、ポインタをP _Nとし、ステップ３１０に移行する。 Step 306) The replacement pointer generation unit 130 sets the replacement pointer flag F to 0 when L _M <L _N (when it appears on the past character string N series). That is, the replacement pointer becomes the past position of the input character string N. Step 307) The maximum matching length L is set to L _N , the pointer is set to P _N , and the process proceeds to Step 310.

ステップ３０８）置換ポインタ生成部１３０は、L_M≧L_Nであるとき（サンプル文字列M上に出現した場合）、置換ポインタフラグFに１に設定する。つまり、置換ポインタは、サンプリング文字列M上の位置となる。 Step 308) The replacement pointer generation unit 130 sets the replacement pointer flag F to 1 when L _M ≧ L _N (when it appears on the sample character string M). That is, the replacement pointer is a position on the sampling character string M.

ステップ３０９）最大一致長LをL_Mとし、ポインタをサンプル文字列Mの出現位置P_Mとする。 Step 309) the maximum matching length L and L _M, a pointer and the appearance position P _M of the sample string M.

ステップ３１０）上記のステップ３０６、ステップ３０８の置換ポインタフラグFと置換ポインタ（L/P/文字列Nの（ｉ+L+1）番目の文字）を置換ポインタ記憶部１８０に格納する。 Step 310) The replacement pointer flag F and the replacement pointer (the (i + L + 1) th character of L / P / character string N) in Steps 306 and 308 are stored in the replacement pointer storage unit 180.

置換ポインタ記憶部１８０の例を図１１に示す。置換ポインタ記憶部１８０は、置換ポインタ種別フラグと置換ポインタを格納する。置換ポインタ種別フラグが"０"である場合は、置換ポインタは入力文字列Nの過去の位置を表し、"１"である場合は、置換ポインタはサンプル文字列M上の位置を表す。置換ポインタは、{先頭からの位置、長さ、置換文字列の終端文字}の組で構成される。図１１の例において、置換ポインタ記憶部１８０の１番目のエントリの置換ポインタ種別フラグは、"１"であるので、置換ポインタがサンプル文字列M上の位置を示す。[サンプル文字列M：abcdefg…]、[入力文字列N：zxywefghｉc…abcdk…ef ghｉj…]であるとき、サンプル文字列Mの"abcd"と入力文字列Nの"abcd"の最大一致長Lは"４"であり、サンプル文字列の開始位置Pは０番目であり、L/P/入力文字列Nの（ｉ+L+1）番目の文字が"k"であるので、置換ポインタは、「0,4,'k'」となる。 An example of the replacement pointer storage unit 180 is shown in FIG. The replacement pointer storage unit 180 stores a replacement pointer type flag and a replacement pointer. When the replacement pointer type flag is “0”, the replacement pointer represents a past position of the input character string N, and when it is “1”, the replacement pointer represents a position on the sample character string M. The replacement pointer is composed of a set of {position from the beginning, length, end character of replacement character string}. In the example of FIG. 11, the replacement pointer type flag of the first entry in the replacement pointer storage unit 180 is “1”, so the replacement pointer indicates the position on the sample character string M. When [sample string M: abcdefg ...] and [input string N: zxywefghic ... abcdk ... ef ghij ...], the maximum matching length L of "abcd" in sample string M and "abcd" in input string N Is "4", the starting position P of the sample character string is 0th, and the (i + L + 1) th character of L / P / input character string N is "k", so the replacement pointer is “0,4, 'k” ”.

ステップ３１１）入力文字列Nの[ｉ…ｉ+L+1]の部分文字列を動的辞書探索・更新部１６０を用いて出現済みの部分文字列として、動的辞書記憶部１７０に登録する。 Step 311) The [i... I + L + 1] partial character string of the input character string N is registered in the dynamic dictionary storage unit 170 as a partial character string that has already appeared using the dynamic dictionary search / update unit 160. .

ステップ３１２）ｉをｉ＝ｉ＋L＋１とする。 Step 312) i is set to i = i + L + 1.

ステップ３１３）ｉ＜全入力文字列Ｎであればステップ３０３に移行し、そうでない場合は当該処理を終了する。 Step 313) If i <all input character strings N, the process proceeds to Step 303. Otherwise, the process ends.

上記の処理の後、置換ポインタ生成部１３０は、置換ポインタ記憶部１８０から置換ポインタ列とサンプル文字列Mを復号部２００の置換ポインタ分析部２１０に出力する。 After the above processing, the replacement pointer generation unit 130 outputs the replacement pointer string and the sample character string M from the replacement pointer storage unit 180 to the replacement pointer analysis unit 210 of the decoding unit 200.

以下、図４のステップ４００の復号部２００の置換ポインタ分析部２１０について説明する。 Hereinafter, the replacement pointer analysis unit 210 of the decoding unit 200 in step 400 of FIG. 4 will be described.

図１２は、本発明の一実施の形態における置換ポインタ分析部の処理のフローチャートである。 FIG. 12 is a flowchart of processing of the replacement pointer analyzer in the embodiment of the present invention.

ステップ４０１）置換ポインタ分析部２１０は、入力引数から置換ポインタ列とポインタ文字列Mを符号化部１００から受け取る。 Step 401) The replacement pointer analysis unit 210 receives a replacement pointer string and a pointer character string M from the input argument from the encoding unit 100.

ステップ４０２）置換ポインタ分析部２１０は、受け取ったサンプル文字列Mを入力サンプル文字列記憶部２２０に格納する。 Step 402) The replacement pointer analysis unit 210 stores the received sample character string M in the input sample character string storage unit 220.

ステップ４０３）置換ポインタ総数をZに設定する。 Step 403) Set the total number of replacement pointers to Z.

ステップ４０４）置換ポインタのカウンタｉを０とする。 Step 404) Set the counter i of the replacement pointer to 0.

ステップ４０５）ステップ４０１で取得した置換ポインタ列からｉ番目の置換ポインタに含まれる置換ポインタフラグFを取得する。 Step 405) The replacement pointer flag F included in the i-th replacement pointer is acquired from the replacement pointer sequence acquired in Step 401.

ステップ４０６）置換ポインタフラグFが１であるかを判定し、１であればステップ４０８に移行し、そうでなければステップ４０７に移行する。 Step 406) It is determined whether or not the replacement pointer flag F is 1. If 1, the process proceeds to Step 408, and if not, the process proceeds to Step 407.

ステップ４０７）置換ポインタFが参照する既に復元済みの部分文字列を出力文字列記憶部２３０に出力し、ステップ４０９に移行する。 Step 407) The already restored partial character string referred to by the replacement pointer F is output to the output character string storage unit 230, and the process proceeds to Step 409.

ステップ４０８）ｉを１インクリメントする。 Step 408) Increment i by 1.

ステップ４０９）ｉ<Zであれば、ステップ４０５に移行し、そうでなければ、当該処理を終了する。 Step 409) If i <Z, the process proceeds to Step 405. Otherwise, the process ends.

図１３は、本発明の一実施の形態における階層型メモリ構造における辞書配置例を示す。同図では階層j＝３（ｊ＝０，１，２）の場合を示している。階層ｊ[2]の主記憶装置には、出現頻度で並び替えて連結して作成した辞書を搭載しておくことで、CPU装置が、その周辺の辞書領域を参照速度の早いキャッシュ装置（階層ｊ[0]）に読み込む確率が高くなり、結果的に辞書の参照速度が向上する。 FIG. 13 shows an example of a dictionary arrangement in a hierarchical memory structure according to an embodiment of the present invention. In the figure, the case of hierarchy j = 3 (j = 0, 1, 2) is shown. In the main storage device of the hierarchy j [2], a dictionary created by rearranging and concatenating them according to the appearance frequency is mounted, so that the CPU device can store the peripheral dictionary area in the cache device (hierarchy with high reference speed). The probability of reading in j [0]) is increased, and as a result, the dictionary reference speed is improved.

なお、上記の図３に示す文字列圧縮装置の構成要素の動作をプログラムとして構築し、文字列圧縮装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 The operation of the components of the character string compression device shown in FIG. 3 can be constructed as a program, installed in a computer used as the character string compression device, executed, or distributed via a network. It is.

本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments, and various modifications and applications are possible within the scope of the claims.

１００符号化部
１１０サンプル文字列生成部
１２０サンプル文字列取得部
１３０置換ポインタ生成部
１４０サンプル文字列探索部
１５０サンプル文字列M記憶部
１６０動的辞書探索・更新部
１７０動的辞書記憶部
１８０置換ポインタ記憶部
２００復号部
２１０置換ポインタ分析部
２２０入力サンプル文字列記憶部
２３０出力文字列記憶部
３１０頻出パターン分析部
３２０頻出パターン記憶部 100 encoding unit 110 sample character string generation unit 120 sample character string acquisition unit 130 replacement pointer generation unit 140 sample character string search unit 150 sample character string M storage unit 160 dynamic dictionary search / update unit 170 dynamic dictionary storage unit 180 replacement Pointer storage unit 200 Decoding unit 210 Replacement pointer analysis unit 220 Input sample character string storage unit 230 Output character string storage unit 310 Frequency pattern analysis unit 320 Frequency pattern storage unit

Claims

A string compression and decompression system using a replacement method,
A frequent pattern analysis unit that extracts a partial character string from the input character string N, counts the number of appearances of the partial character string in the input character string N, and stores the partial character string and the number of appearances in a frequent pattern storage unit;
The appearance counts of the partial character strings stored in the frequent pattern storage means are rearranged in descending order, stored in the frequent pattern storage means, and the top N cases of the frequent pattern storage means as sample character strings. Sample character string generating means to be stored in the storage means;
The partial character string starting from the start position i of the input character string N, the maximum matching length L _M of the sample character string M read from the sample character string storage means, and the appearance position P _M of the partial character string are obtained, The maximum matching length L _N between the partial character string starting from the starting position i of the input character string N and the partial character strings appearing from the 0th to the (i−1) th of the input character string N, and the appearing partial characters The appearance position P _{N of the} column is obtained, and when the maximum matching length L _N is larger than the maximum matching length L _M , the replacement pointer indicates the past position of the input character string N, and the maximum matching length L _M Is equal to or longer than the maximum matching length L _N , the replacement pointer indicates the position on the sample character string M, and a partial character string of [i ... i + L + 1] of the input character string N appears. Stored as a partial character string in the dynamic dictionary storage means, and a replacement pointer string and a sample character string are output. Conversion pointer generation means;
Encoding means comprising:
When the replacement pointer string and the sample character string are obtained from the replacement pointer generation means, and the replacement pointer indicates a position on the sample character string M, a portion on the sample character string M to which the replacement pointer refers A decoding unit having a replacement pointer analyzing unit that outputs a character string and outputs an already restored partial character string referred to by the replacement pointer when the replacement pointer indicates a past position of the input character string N; ,
A compression and decompression system by two-stage replacement using a sample character string (dictionary) characterized by comprising:

The compression and decompression system by two-stage replacement using the sample character string (dictionary) according to claim 1, wherein the sample character string storage means has an area of about 1% or less with respect to the input character string N.

A string compression and decompression method using a replacement method,
Frequent pattern analysis means, frequent pattern storage means, sample character string generation means, sample character string storage means, dynamic dictionary storage means, and encoding means having replacement pointer generation means, and replacement pointer analysis means And a decryption means comprising:
The frequent pattern analysis unit of the encoding unit extracts a partial character string from the input character string N, counts the number of appearances of the partial character string in the input character string N, and determines the partial character string and the number of appearances. A frequent pattern analysis step stored in the frequent pattern storage means;
The sample character string generating means of the encoding means rearranges the number of appearances of the partial character strings stored in the frequent pattern storage means in descending order, stores them in the frequent pattern storage means, and the frequent pattern storage means A sample character string generation step for storing the top N of the above as sample character strings in the sample character string storage means;
The replacement pointer generating means of the encoding means includes a partial character string starting from a start position i of the input character string N, a maximum matching length L _M of the sample character string M read from the sample character string storage means, and the The appearance position P _M of the partial character string is obtained, and the partial character string starting from the start position i-th of the input character string N and the partial character string appearing from the 0th to the (i−1) th of the input character string N maximum matching length L _N, obtains the appearance position P _N of the emerging partial string, when said maximum matching length L _N is greater than said maximum matching length L _M is the last replacement pointer of the input string N If the maximum match length L _M is equal to or greater than the maximum match length L _N , the replacement pointer indicates the position on the sample character string M, and [i. + L + 1] is stored in the dynamic dictionary storage means as a partial character string that has already appeared. A replacement pointer generation step for outputting a replacement pointer string and the sample character string;
When the replacement pointer analysis means of the decoding means acquires the replacement pointer string and the sample character string from the encoding means, and the replacement pointer indicates a position on the sample character string M, the replacement pointer Outputs a partial character string on the sample character string M referred to by and if the replacement pointer indicates the past position of the input character string N, outputs the already restored partial character string referred to by the replacement pointer A replacement pointer analysis step to
A compression and decompression method using two-stage replacement using a sample character string (dictionary).

The compression and decompression method by two-step replacement using the sample character string (dictionary) according to claim 3, wherein the sample character string storage means is an area of about 1% or less with respect to the input character string N.