JP2011221845A

JP2011221845A - Text processing apparatus, text processing method and text processing program

Info

Publication number: JP2011221845A
Application number: JP2010091311A
Authority: JP
Inventors: Tatsuya Asai; 達哉浅井; Shinichiro Tako; 真一郎多湖; Hiroya Inakoshi; 宏弥稲越; Seishi Okamoto; 青史岡本
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2010-04-12
Filing date: 2010-04-12
Publication date: 2011-11-04
Anticipated expiration: 2030-04-12
Also published as: JP5544998B2

Abstract

PROBLEM TO BE SOLVED: To efficiently create a phrase tree allowed to be used for encoding text data at a high compression ratio.SOLUTION: A frequency counting means 1c counts up the appearance frequency of a selected character given to a node of a decision position. When the appearance frequency of the selected character given to the node of the decision position reaches a predetermined threshold, a node addition means 1b adds a slave node corresponding to the selected character for the node of the decision position to a phrase tree 2a. When there exists the slave node corresponding to the selected character for the node of the decision position, the node addition means 1b changes the decision position to the slave node corresponding to the selected character for the node of the decision position. When the slave node corresponding to the selected character does not exist and the appearance frequency of the selected character given to the node of the decision position does not reach the predetermined threshold, a decision position change means 1d changes the decision position to a slave node of a node of a route.

Description

本発明はテキストデータを処理するテキスト処理装置、テキスト処理方法、およびテキスト処理プログラムに関する。 The present invention relates to a text processing device, a text processing method, and a text processing program for processing text data.

コンピュータシステムにおいて、テキストデータを処理対象とした情報処理を行う場合がある。例えば、テキストデータを対象とした情報検索が行われる。
近年、情報通信技術の用途の拡大に伴い、処理されるテキストデータの量が増加傾向にある。そこで、テキストデータを処理対象とした情報処理を効率化するためのさまざまな技術が考えられている。例えば、小容量のメモリを用いて、高速にデータソート処理を行うデータソート処理方法が考えられている。また、指定された項目に沿ったデータ集合の分割を効率よく行うことができるデータ集合分割方法も考えられている。 In a computer system, information processing may be performed on text data as a processing target. For example, information retrieval for text data is performed.
In recent years, with the expansion of applications of information communication technology, the amount of text data to be processed tends to increase. Therefore, various techniques for improving the efficiency of information processing for processing text data have been considered. For example, a data sort processing method that performs data sort processing at high speed using a small-capacity memory is considered. In addition, a data set dividing method that can efficiently divide a data set along a designated item has been considered.

ところで、テキストデータを検索する場合、蓄積されたテキストデータの中から、検索キーワードに合致する文字列を効率的よく検出することで、処理の高速化が図れる。例えば、圧縮パターン照合技術がある。圧縮パターン照合では、テキストデータを圧縮して保存しておき、圧縮されたテキストを解凍せずにパターン照合が行われる。 By the way, when searching text data, the processing speed can be increased by efficiently detecting a character string that matches the search keyword from the stored text data. For example, there is a compression pattern matching technique. In compressed pattern matching, text data is compressed and stored, and pattern matching is performed without decompressing the compressed text.

圧縮パターン照合では、テキストデータを圧縮したまま検索できるように、入力されたテキストの中に含まれる部分文字列を、その部分文字列を一意に特定可能な符号に符号化する。テキストデータに含まれる部分文字列を符号化することで、元のテキストデータよりもデータ量を減少させることができる。 In the compression pattern matching, the partial character string included in the input text is encoded into a code that can uniquely identify the partial character string so that the text data can be searched while being compressed. By encoding the partial character string included in the text data, the data amount can be reduced as compared with the original text data.

テキストの中の特定の文字列を符号化する際、出現頻度が高い部分文字列、または長い部分文字列を符号化すれば圧縮率が向上する。このような圧縮率の向上技術としてＳＴＶＦ（Suffix Tree based Variable-length-to-Fixed-length code）符号化技術が考えられている。 When a specific character string in a text is encoded, if a partial character string having a high appearance frequency or a long partial character string is encoded, the compression rate is improved. As a technique for improving the compression rate, a STVF (Suffix Tree based Variable-length-to-Fixed-length code) coding technique is considered.

ＳＴＶＦ符号化技術では、頻度情報に基づいて刈り込んだ接尾辞木を文節木として、ＶＦ（Variable-length-to-Fixed-length code）符号化が行われる。 In the STVF encoding technique, VF (Variable-length-to-Fixed-length code) encoding is performed using a suffix tree trimmed based on frequency information as a phrase tree.

特開２００７−１１７８４号公報JP 2007-11784 A 特開２００７−１１５４８号公報JP 2007-11548 A

Takuya Kida, "Suffix Tree Based VF-Coding for Compressed Pattern Matching" Hokkaido University, TCS Research Reports, TCS-TR-A-08-36, Division of Computer Science, 18 November 2008Takuya Kida, "Suffix Tree Based VF-Coding for Compressed Pattern Matching" Hokkaido University, TCS Research Reports, TCS-TR-A-08-36, Division of Computer Science, 18 November 2008 Takuya Kida, "Suffix Tree Based VF-Coding for Compressed Pattern Matching," dcc, Proceedings of the 2009 Data Compression Conference, 16-18 March 2009, pp.449Takuya Kida, "Suffix Tree Based VF-Coding for Compressed Pattern Matching," dcc, Proceedings of the 2009 Data Compression Conference, 16-18 March 2009, pp.449

しかし、ＳＴＶＦ符号化による文節木の作成には時間がかかるという問題がある。すなわち、ＳＴＶＦ符号化における接尾辞木の作成には、テキストデータ全体を解析することになるため処理に時間がかかる。しかも、接尾辞木の作成は、テキストデータの量が多いほど時間が多くかかる。 However, there is a problem that it takes time to create a phrase tree by STVF encoding. That is, the creation of a suffix tree in STVF encoding takes time for processing because the entire text data is analyzed. Moreover, the creation of a suffix tree takes more time as the amount of text data increases.

本発明はこのような点に鑑みてなされたものであり、テキストデータの高圧縮率での符号化に利用可能な文節木を効率的に作成することができるテキスト処理装置、テキスト処理方法、およびテキスト処理プログラムを提供することを目的とする。 The present invention has been made in view of the above points, and is a text processing device, a text processing method, and a text processing method capable of efficiently creating a phrase tree that can be used for encoding text data at a high compression rate. An object is to provide a text processing program.

上記課題を解決するために、文字選択手段、ノード追加手段、および頻度カウント手段、を有するテキスト処理装置が提供される。文字選択手段は、テキストデータ内の文字列から順に文字を選択する。ノード追加手段は、テキストデータに出現し得る文字に対応する複数のノードがルートのノードの子として木構造で予め関連付けられ、各ノードに対してテキストデータに出現し得る文字に対応する子のノードを追加可能であり、各ノードに対応付けて、ノードの識別子と、各ノードが判断位置とされたときに次に出現した各文字の出現回数とが付与された文節木を記憶する文節木記憶手段を参照する。そしてノード追加手段は、ルートのノードから判断位置を開始し、判断位置のノードに対して文字選択手段で選択された文字に対応する子のノードが存在する場合、該子のノードに判断位置を移動する。またノード追加手段は、判断位置のノードに対して選択された文字に対応する子のノードが存在せず、かつ判断位置のノードに付与された選択された文字の出現回数が所定の閾値に達していない場合、ルートのノードに対する選択された文字に対応する子のノードに判断位置を移動する。さらにノード追加手段は、判断位置のノードに対して選択された文字に対応する子のノードが存在せず、かつ判断位置のノードに付与された選択された文字の出現回数が所定の閾値に達している場合、判断位置のノードに対して、新たな識別子を付与した、選択された文字に対応する子のノードを追加すると共に、追加した該子のノードに判断位置を移動する。頻度カウント手段は、文節木記憶手段を参照し、判断位置のノードに対して選択された文字に対応するノードが存在しない場合、判断位置のノードの付与された選択された文字の出現回数をカウントアップする。 In order to solve the above-described problem, a text processing apparatus having a character selection unit, a node addition unit, and a frequency count unit is provided. A character selection means selects a character in order from the character string in text data. The node adding means associates a plurality of nodes corresponding to characters that can appear in the text data in advance in a tree structure as children of the root node, and child nodes corresponding to characters that can appear in the text data for each node A phrase tree memory that stores a phrase tree in which a node identifier and the number of appearances of each character that appears next when each node is determined are associated with each node. Refer to the means. Then, the node adding means starts the determination position from the root node, and when there is a child node corresponding to the character selected by the character selection means for the determination position node, the determination position is set to the child node. Moving. In addition, the node adding means has no child node corresponding to the character selected for the node at the determination position, and the number of appearances of the selected character assigned to the node at the determination position has reached a predetermined threshold. If not, the determination position is moved to the child node corresponding to the selected character for the root node. Further, the node adding means has no child node corresponding to the character selected for the node at the determination position, and the number of appearances of the selected character assigned to the node at the determination position has reached a predetermined threshold. If it is, the child node corresponding to the selected character to which a new identifier is assigned is added to the node at the determination position, and the determination position is moved to the added child node. The frequency counting means refers to the phrase tree storage means, and if there is no node corresponding to the character selected for the node at the judgment position, counts the number of appearances of the selected character to which the node at the judgment position is given. Up.

また、上記テキスト処理装置が実行する処理と同様の処理をコンピュータが実行するテキスト処理方法が提供される。さらに、上記テキスト処理装置が実行する処理と同様の処理をコンピュータに実行させるテキスト処理プログラムが提供される。 Also provided is a text processing method in which a computer executes a process similar to the process executed by the text processing apparatus. Furthermore, a text processing program for causing a computer to execute processing similar to the processing executed by the text processing device is provided.

上記テキスト処理装置、テキスト処理方法、およびテキスト処理プログラムによれば、テキストデータの高圧縮率での符号化に利用可能な文節木を効率的に作成することができる。 According to the above text processing device, text processing method, and text processing program, it is possible to efficiently create a phrase tree that can be used for encoding text data at a high compression rate.

実施の形態の概要を示す図である。It is a figure which shows the outline | summary of embodiment. システム構成の一例を示す図である。It is a figure which shows an example of a system configuration. 本実施の形態に用いるサーバのハードウェアの一構成例を示す図である。It is a figure which shows one structural example of the hardware of the server used for this Embodiment. 要約トライのノード構造体の一例を示す図である。It is a figure which shows an example of the node structure body of a summary trie. 要約トライの例を示す図である。It is a figure which shows the example of a summary try. サーバの機能を示すブロック図である。It is a block diagram which shows the function of a server. テキスト符号化部の詳細機能を示すブロック図である。It is a block diagram which shows the detailed function of a text encoding part. テキスト符号化処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of a text encoding process. 要約トライ作成およびテキスト圧縮処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of summary trie creation and text compression processing. 要約トライ作成およびテキスト圧縮処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of summary trie creation and text compression processing. 初期状態の要約トライと圧縮済テキストの例を示す図である。It is a figure which shows the example of the summary trie of an initial state, and the compressed text. １文字目を処理した後の要約トライと圧縮済テキストの例を示す図である。It is a figure which shows the example of the summary trie and the compressed text after processing the 1st character. ２文字目を処理した後の要約トライと圧縮済テキストの例を示す図である。It is a figure which shows the example of the summary trie and the compressed text after processing the 2nd character. ３文字目を処理した後の要約トライと圧縮済テキストの例を示す図である。It is a figure which shows the example of the summary trie and the compressed text after processing the 3rd character. ４文字目を処理した後の要約トライと圧縮済テキストの例を示す図である。It is a figure which shows the example of the summary trie and the compressed text after processing the 4th character. ５文字目を処理した後の要約トライと圧縮済テキストの例を示す図である。It is a figure which shows the example of the summary trie and the compressed text after processing the 5th character. ６文字目を処理した後の要約トライと圧縮済テキストの例を示す図である。It is a figure which shows the example of the summary trie and the compressed text after processing the 6th character. ７文字目を処理した後の要約トライと圧縮済テキストの例を示す図である。It is a figure which shows the example of the summary trie and the compressed text after processing the 7th character. ８文字目を処理した後の要約トライと圧縮済テキストの例を示す図である。It is a figure which shows the example of the summary trie and the compressed text after processing the 8th character. ９文字目を処理した後の要約トライと圧縮済テキストの例を示す図である。It is a figure which shows the example of the summary trie and the compressed text after processing the 9th character. １０文字目を処理した後の要約トライと圧縮済テキストの例を示す図である。It is a figure which shows the example of the summary trie and the compressed text after processing the 10th character. 位置「ｉ」の値が文字数「ｎ」より大きくなった後の要約トライと圧縮済テキストの例を示す図である。It is a figure which shows the example of the summary trie after the value of position "i" becomes larger than the number of characters "n", and the compressed text. 第３の実施の形態に係るサーバの機能を示す図である。It is a figure which shows the function of the server which concerns on 3rd Embodiment. 第３の実施の形態に係るテキスト符号化部の詳細機能を示すブロック図である。It is a block diagram which shows the detailed function of the text encoding part which concerns on 3rd Embodiment. 第３の実施の形態に係るテキスト符号化処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the text encoding process which concerns on 3rd Embodiment. 圧縮処理の詳細手順を示すフローチャートである。It is a flowchart which shows the detailed procedure of a compression process. 第４の実施の形態に係るテキスト符号化部の機能を示すブロック図である。It is a block diagram which shows the function of the text encoding part which concerns on 4th Embodiment. 閾値テーブル記憶部のデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of a threshold value table memory | storage part. 第４の実施の形態に係るテキスト符号化処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the text encoding process which concerns on 4th Embodiment.

以下、本実施の形態について図面を参照して説明する。
〔第１の実施の形態〕
図１は、実施の形態の概要を示す図である。テキスト処理装置１は、文字選択手段１ａ、ノード追加手段１ｂ、頻度カウント手段１ｃ、識別子出力手段１ｄを有する。 Hereinafter, the present embodiment will be described with reference to the drawings.
[First Embodiment]
FIG. 1 is a diagram showing an outline of the embodiment. The text processing apparatus 1 includes a character selection unit 1a, a node addition unit 1b, a frequency count unit 1c, and an identifier output unit 1d.

また、文節木記憶手段２は、文節木２ａを記憶する。文節木２ａは、テキストデータ３に出現し得る文字に対応する複数のノードがルート（根）のノードの子として木構造で予め関連付けられている。また文節木２ａは、各ノードに対して前記テキストデータに出現し得る文字に対応する子のノードを追加可能である。さらに文節木２ａの各ノードに対応付けて、ノードの識別子と、各ノードが判断位置とされたときに次に出現した各文字の出現回数とが付与されている。 The phrase tree storage means 2 stores a phrase tree 2a. In the phrase tree 2a, a plurality of nodes corresponding to characters that can appear in the text data 3 are associated in advance in a tree structure as children of the root node. The phrase tree 2a can add a child node corresponding to a character that can appear in the text data to each node. Further, in association with each node of the phrase tree 2a, an identifier of the node and the number of appearances of each character that appears next when each node is set as the determination position are given.

図１の例では、各ノードを丸印で表し、ノードを示す丸印の中に、該当ノードの識別子が示されている。また、文節木２ａは、ノード間を接続する矢印により、木構造におけるノード間の関係を示している。繋げられた２つのノードのうち、ルートのノードに近い方のノードが親のノード、ルートのノードから遠い方のノードが子のノードである。ノードを接続する矢印の横に、子のノードに対応する文字が示されている。 In the example of FIG. 1, each node is represented by a circle, and the identifier of the corresponding node is indicated in the circle indicating the node. The phrase tree 2a indicates a relationship between nodes in the tree structure by arrows connecting the nodes. Of the two connected nodes, a node closer to the root node is a parent node, and a node far from the root node is a child node. The characters corresponding to the child nodes are shown next to the arrows connecting the nodes.

ルートのノードには、例えば初期状態から、テキストデータ３に含まれ得る各文字に対応する子のノードが設けられる。図１の例では、テキストデータ３に「ａ，ｂ，ｃ，ｄ」の４つの文字を含まれ得るものとし、識別子「０」のルートのノードには、識別子「１」〜「４」の４つの子のノードが設けられている。 The root node is provided with a child node corresponding to each character that can be included in the text data 3 from the initial state, for example. In the example of FIG. 1, it is assumed that the text data 3 can include four characters “a, b, c, d”, and the nodes of the root of the identifier “0” have identifiers “1” to “4”. Four child nodes are provided.

文字選択手段１ａは、テキストデータ３内の文字列から順に文字を選択する。図１の例では、テキストデータ３に「ａｂｃａｂｃｄａｂｃ」という文字列が含まれている。例えば文字選択手段１ａは、「ａｂｃａｂｃｄａｂｃ」の左から順に文字を１文字ずつ選択する。 The character selection unit 1 a selects characters in order from the character string in the text data 3. In the example of FIG. 1, the text data 3 includes a character string “abcabcdabc”. For example, the character selection unit 1a selects characters one by one in order from the left of “abcabbcdabc”.

ノード追加手段１ｂは、文節木記憶手段２を参照する。そしてノード追加手段１ｂは、ルートのノードから判断位置を開始する。ノード追加手段１ｂは、判断位置のノードに対して文字選択手段１ａで選択された文字に対応する子のノードが存在する場合、該子のノードに判断位置を移動する。 The node addition unit 1b refers to the phrase tree storage unit 2. Then, the node adding unit 1b starts the determination position from the root node. When there is a child node corresponding to the character selected by the character selection unit 1a with respect to the node at the determination position, the node addition unit 1b moves the determination position to the child node.

またノード追加手段１ｂは、判断位置のノードに対して選択された文字に対応する子のノードが存在しない場合、判断位置のノードに付与された選択された文字の出現回数が所定の閾値に達しているか否かを判断する。判断位置のノードに付与された選択された文字の出現回数が所定の閾値に達していない場合、ノード追加手段１ｂは、ルートのノードに対する選択された文字に対応する子のノードに判断位置を移動する。他方、判断位置のノードに付与された選択された文字の出現回数が所定の閾値に達している場合、ノード追加手段１ｂは、判断位置のノードに対して、新たな識別子を付与した、選択された文字に対応する子のノードを追加する。このときノード追加手段１ｂは、追加した子のノードに判断位置を移動する。 Further, the node adding means 1b, when there is no child node corresponding to the character selected for the node at the determination position, the number of appearances of the selected character given to the node at the determination position reaches a predetermined threshold. Judge whether or not. When the number of appearances of the selected character given to the node at the determination position does not reach the predetermined threshold, the node adding means 1b moves the determination position to the child node corresponding to the selected character for the root node. To do. On the other hand, when the number of appearances of the selected character assigned to the node at the determination position has reached a predetermined threshold, the node adding means 1b is selected by adding a new identifier to the node at the determination position. Add a child node corresponding to the specified character. At this time, the node adding means 1b moves the determination position to the added child node.

例えば、ノード追加手段１ｂは、子のノードの追加を、文節木２ａのノードの数が所定数に達するまで行う。換言すると、ノード追加手段１ｂは、ノードの数が所定数を超えると、子のノードの追加を行わないようにすることができる。 For example, the node adding unit 1b adds child nodes until the number of nodes in the phrase tree 2a reaches a predetermined number. In other words, the node adding means 1b can prevent the child node from being added when the number of nodes exceeds a predetermined number.

なお、文節木２ａに追加されたノードは、ルートのノードから該当ノードまで辿ったときの経路上の各ノードに対応する文字の配列で示される部分文字列（文節）に対応する。すなわち、追加したノードの識別子が、そのノードに対応する文字列を表す符号となる。 Note that the node added to the phrase tree 2a corresponds to a partial character string (phrase) indicated by an array of characters corresponding to each node on the route when the node is traced from the root node to the corresponding node. That is, the identifier of the added node is a code representing the character string corresponding to the node.

頻度カウント手段１ｃは、文節木記憶手段２を参照する。そして頻度カウント手段１ｃは、判断位置のノードに対して選択された文字に対応するノードが存在しない場合、判断位置のノードの付与された選択された文字の出現回数をカウントアップする。例えば、頻度カウント手段１ｃは、判断位置のノードに付与された選択された文字の出現回数を「１」だけ増加させる。なお頻度カウント手段１ｃは、判断位置のノードに対して選択された文字に対応するノードが存在する場合についても、判断位置のノードの付与された選択された文字の出現回数をカウントアップするようにしてもよい。 The frequency counting unit 1 c refers to the phrase tree storage unit 2. When there is no node corresponding to the character selected for the node at the determination position, the frequency counting unit 1c counts up the number of appearances of the selected character to which the node at the determination position is assigned. For example, the frequency counting unit 1c increases the number of appearances of the selected character given to the node at the determination position by “1”. Note that the frequency counting unit 1c counts up the number of appearances of the selected character to which the node at the determination position is added even when there is a node corresponding to the character selected with respect to the node at the determination position. May be.

また頻度カウント手段１ｃは、判断位置のノードに対して選択された文字に対応するノードが存在しない場合、ノード追加手段１ｂによる処理に先立って、出現回数をカウントアップすることができる。この場合、ノード追加手段１ｂは、カウントアップ後の出現回数について、閾値に達しているか否かを判断する。 The frequency counting means 1c can count up the number of appearances prior to the processing by the node adding means 1b when there is no node corresponding to the character selected for the node at the determination position. In this case, the node addition unit 1b determines whether or not the number of appearances after counting up has reached a threshold value.

識別子出力手段１ｄは、文節木記憶手段２を参照する。そして識別子出力手段１ｄは、判断位置のノードに対して選択された文字に対応する子のノードが存在しない場合、判断位置のノードに付与された識別子を出力する。また識別子出力手段１ｄは、テキストデータ３内の文字列の最後の文字が選択されたことにより判断位置が移動されると、移動後の判断位置のノードに付与された識別子も出力するようにしてもよい。ここで、識別子出力手段１ｄで出力された各識別子が、テキストデータ３内の部分文字列ごとの符号となる。そして、テキストデータ３内のすべての文字列に対応して識別子出力手段１ｄから出力された識別子の列が、テキストデータ３を符号化した符号語列４となる。識別子出力手段１ｄは、例えば、符号語列４を記憶装置に格納することができる。また、識別子出力手段１ｄは、符号語列４を、生成された符号から順次、ネットワークを介して送出することもできる。 The identifier output unit 1d refers to the phrase tree storage unit 2. When there is no child node corresponding to the character selected for the node at the determination position, the identifier output unit 1d outputs the identifier given to the node at the determination position. The identifier output means 1d outputs the identifier assigned to the node at the determined position after the movement when the determined position is moved by selecting the last character of the character string in the text data 3. Also good. Here, each identifier output by the identifier output means 1 d becomes a code for each partial character string in the text data 3. An identifier string output from the identifier output unit 1 d corresponding to all character strings in the text data 3 becomes a code word string 4 obtained by encoding the text data 3. For example, the identifier output unit 1d can store the codeword string 4 in the storage device. The identifier output means 1d can also send the code word string 4 sequentially from the generated code via the network.

このようなテキスト処理装置１によれば、文字選択手段１ａによりテキストデータ３内の文字列から順に選択される。すると、頻度カウント手段１ｃにより、判断位置のノードに付与された選択された文字の出現回数がカウントアップされる。 According to such a text processing apparatus 1, the character selection means 1a selects the character strings in the text data 3 in order. Then, the frequency count means 1c counts up the number of appearances of the selected character given to the node at the determination position.

そして、判断位置のノードに対して選択された文字に対応する子のノードが存在する場合、ノード追加手段１ｂにより、判断位置が、判断位置のノードの選択された文字に対応する子のノードに変更される。これにより、選択された文字に対応する子のノードが存在する限り、文節木のルートから葉に向かって判断位置のノードが遷移する。 When there is a child node corresponding to the character selected for the node at the determination position, the node addition unit 1b changes the determination position to the child node corresponding to the selected character at the node at the determination position. Be changed. As a result, as long as there is a child node corresponding to the selected character, the node at the determination position transitions from the root of the phrase tree toward the leaf.

また判断位置のノードに対して選択された文字に対応する子のノードが存在せず、かつ判断位置のノードに付与された選択された文字の出現回数が閾値に達した場合、ノード追加手段１ｂにより文節木２ａへのノードの追加が行われる。すなわち、ノード追加手段１ｂにより、判断位置のノードに対する選択された文字に対応する子のノードが、文節木２ａに追加される。この場合、ノード追加手段１ｂにより、判断位置が追加された子のノードに移動される。 If there is no child node corresponding to the character selected for the node at the determination position, and the number of appearances of the selected character assigned to the node at the determination position reaches the threshold value, the node adding means 1b Thus, the node is added to the phrase tree 2a. That is, the node adding unit 1b adds a child node corresponding to the selected character for the node at the determination position to the phrase tree 2a. In this case, the node adding unit 1b moves to the child node to which the determination position is added.

また、判断位置のノードに対して選択された文字に対応する子のノードが存在せず、かつ判断位置のノードに付与された選択された文字の出現回数が閾値に達しない場合、ノード追加手段１ｂにより、判断位置が移動される。すなわち、ノード追加手段１ｂにより、文節木２ａのルートのノードに対する選択された文字に対応する子のノードに、判断位置が移動される。これにより、選択された文字に対応する子のノードが存在しなければ、以後、ルートのノードに戻って、選択した文字を先頭とする部分文字列による文節木との照合が行われる。 If there is no child node corresponding to the character selected for the node at the determination position, and the number of appearances of the selected character given to the node at the determination position does not reach the threshold value, the node addition means The determination position is moved by 1b. In other words, the node addition means 1b moves the determination position to the child node corresponding to the selected character for the root node of the phrase tree 2a. As a result, if there is no child node corresponding to the selected character, the process returns to the root node and collation with the phrase tree is performed by the partial character string starting from the selected character.

さらに、判断位置のノードに対して選択された文字に対応する子のノードが存在しない場合、識別子出力手段１ｄにより、判断位置のノードに付与された識別子が出力される。そのため、テキストデータ３内のすべての文字が選択されることで、識別子出力手段１ｄからは、テキストデータ３全体を符号化した符号語列４が出力される。これにより、文節木２ａの構築処理と並行してテキストデータ３を符号化することが可能となっている。 Further, when there is no child node corresponding to the character selected for the node at the determination position, the identifier output means 1d outputs the identifier assigned to the node at the determination position. For this reason, when all the characters in the text data 3 are selected, the identifier output means 1d outputs a code word string 4 obtained by encoding the entire text data 3. Thereby, the text data 3 can be encoded in parallel with the construction process of the phrase tree 2a.

このように、テキストデータ３内の文字を選択していくことで、出現頻度が所定値以上の文字列に対応するノードで構成される文節木２ａが生成される。また文節木２ａを符号化辞書として利用して、テキストデータ３が符号化され、符号語列４が出力される。 Thus, by selecting characters in the text data 3, a phrase tree 2a composed of nodes corresponding to character strings whose appearance frequency is a predetermined value or more is generated. The text data 3 is encoded using the phrase tree 2a as an encoding dictionary, and a code word string 4 is output.

生成される文節木２ａは、出現頻度が高い部分文字列に対応するノードで構成されるため、文節木２ａを用いた符号化により、高圧縮率での符号化が可能である。
さらに、テキスト処理装置１による文節木２ａの構築は、ＳＴＶＦ符号化と比較すると、テキストデータ３全体からの接尾辞木の構築処理が必要ないため、短時間で行うことができる。しかも、文節木２ａの構築と並行してテキストデータ３の符号化が可能であるため、テキストデータ３の符号化についても効率的に行うことができる。例えばテキスト処理装置１による符号化は、ＳＴＶＦ符号化技術より２０倍ほど高速に符号化が可能である（符号化に要する時間が２０分の１）。 Since the generated phrase tree 2a is composed of nodes corresponding to partial character strings having a high appearance frequency, encoding at a high compression rate is possible by encoding using the phrase tree 2a.
Further, the construction of the phrase tree 2a by the text processing apparatus 1 can be performed in a short time because the construction of the suffix tree from the entire text data 3 is not required as compared with STVF encoding. Moreover, since the text data 3 can be encoded in parallel with the construction of the phrase tree 2a, the text data 3 can also be encoded efficiently. For example, the encoding by the text processing apparatus 1 can be performed about 20 times faster than the STVF encoding technique (the time required for encoding is 1/20).

また、テキスト処理装置１によるテキストデータ３の符号化は、ＳＴＶＦ符号化と異なり、逐次処理が可能である。すなわち、ＳＴＶＦ符号化を用いた場合、テキストデータ３に基づいて接尾辞木を構築後、その接尾辞木の刈り込み（頻度の高い文字列に対応するノードのみを残す処理）が行われ、文節木が生成される（１パス目の処理）。そして、ＳＴＶＦ符号化では、生成された文節木を用いて、テキストデータ３が符号化される（２パス目の処理）。このようにＳＴＶＦ符号化では、２パスのアルゴリズムであるため、１パス目の処理が完了するまで、テキストデータの符号化は開始できず、逐次処理が困難であった。 Further, the encoding of the text data 3 by the text processing device 1 can be sequentially performed unlike the STVF encoding. That is, when STVF encoding is used, after constructing a suffix tree based on the text data 3, the suffix tree is trimmed (a process for leaving only nodes corresponding to high-frequency character strings). Is generated (first pass processing). In STVF encoding, text data 3 is encoded using the generated phrase tree (second pass process). Thus, since STVF encoding is a two-pass algorithm, encoding of text data cannot be started until the processing of the first pass is completed, and sequential processing is difficult.

他方、テキスト処理装置１では、テキストデータ３の先頭から順番に文字を選択し、その文字から文節木２ａの構築とテキストデータ３の符号化を実行することができる。そして、テキストデータ３内の各文字は、１回ずつ処理対象として選択すればよい。そのため、１文字ずつの逐次処理が可能となる。逐次処理が可能であるため、例えば、ネットワーク上を転送されるテキストデータ３を、通信遅延を最小限に抑えて、テキストデータ３内の部分文字列を逐次的に符号化して、生成された符号を随時転送することができる。 On the other hand, the text processing device 1 can select characters in order from the beginning of the text data 3, and can construct the phrase tree 2a and encode the text data 3 from the characters. Then, each character in the text data 3 may be selected as a processing target once. Therefore, sequential processing for each character becomes possible. Since sequential processing is possible, for example, text data 3 transferred over a network is generated by sequentially encoding partial character strings in the text data 3 while minimizing communication delay. Can be transferred at any time.

なお、逐次処理よりも圧縮率の向上を優先する場合、文節木２ａを構築後、その文節木２ａを用いてテキストデータ３の先頭の文字から順に再度選択し、符号化処理を行ってもよい。これによりテキストデータ３の圧縮率を高めることができる。 When priority is given to improving the compression ratio over sequential processing, after the phrase tree 2a is constructed, the phrase tree 2a may be used to select again from the first character of the text data 3 in order and perform the encoding process. . Thereby, the compression rate of the text data 3 can be increased.

また文節木２ａは、符号語列４の高速な圧縮パターン照合を実現するための有利な性質を備えている。圧縮パターン照合とは、圧縮されたテキストに対して、解凍せずにパターン照合を行う技術である。高速な圧縮パターン照合に有利な圧縮技術の性質としては、ＶＦ符号化、静的な符号化辞書、より高い圧縮率が挙げられる。文節木２ａが高い圧縮率を備えていることは、前述の通りである。また、テキスト処理装置１で生成される文節木２ａは、符号として用いる各ノードの識別子のビット数を固定長とすることで、ＶＦ符号化が可能となる。符号が固定長であれば、符号間の区切りが明確であり、照合すべき文字列と符号語列４とを照合する際の処理負荷が少なくて済む。また文節木２ａに対するノードが所定数に達した時点で文節木２ａへのノードの追加を終了することで、それ以後の文節木２ａは静的な符号化辞書として利用できる。 The phrase tree 2a has an advantageous property for realizing high-speed compressed pattern matching of the codeword string 4. Compressed pattern matching is a technique for performing pattern matching on a compressed text without decompression. Properties of compression techniques that are advantageous for high-speed compression pattern matching include VF encoding, static encoding dictionaries, and higher compression rates. As described above, the phrase tree 2a has a high compression rate. The phrase tree 2a generated by the text processing device 1 can be VF-encoded by setting the number of identifier bits of each node used as a code to a fixed length. If the code has a fixed length, the delimiter between the codes is clear, and the processing load for collating the character string to be collated with the code word string 4 can be reduced. Further, when the addition of the node to the phrase tree 2a ends when the number of nodes for the phrase tree 2a reaches a predetermined number, the subsequent phrase tree 2a can be used as a static coding dictionary.

このように、テキスト処理装置１では、高速な圧縮パターン照合に適した文節木２ａが作成できる。
〔第２の実施の形態〕
第２の実施の形態は、圧縮パターン照合による検索を行うシステムに対して、第１の実施の形態に係るテキスト符号化技術を適用したものである。 Thus, the text processing apparatus 1 can create a phrase tree 2a suitable for high-speed compressed pattern matching.
[Second Embodiment]
In the second embodiment, the text encoding technique according to the first embodiment is applied to a system that performs a search by compressed pattern matching.

図２は、システム構成の一例を示す図である。図２の例では、サーバ１００に対して複数のクライアント２０１，２０２がネットワーク１０を介して接続されている。サーバ１００は、クライアント２０１，２０２から入力されたテキスト文書を符号化により圧縮し、保持する。また、サーバ１００は、クライアント２０１，２０２からのテキスト検索の要求に応答して、圧縮して保持したテキスト文書から検索要求（検索クエリ）に合致する文字を検索する。さらにサーバ１００は、クライアント２０１，２０２からの解凍要求に応答して、圧縮したテキスト文書を元の状態に解凍する。 FIG. 2 is a diagram illustrating an example of a system configuration. In the example of FIG. 2, a plurality of clients 201 and 202 are connected to the server 100 via the network 10. The server 100 compresses and stores the text document input from the clients 201 and 202 by encoding. Further, in response to the text search request from the clients 201 and 202, the server 100 searches for a character that matches the search request (search query) from the compressed and held text document. Further, in response to the decompression request from the clients 201 and 202, the server 100 decompresses the compressed text document to the original state.

クライアント２０１，２０２は、ユーザが使用するコンピュータである。ユーザは、クライアント２０１，２０２を操作して、クライアント２０１，２０２からサーバ１００へテキスト文書を送信することができる。またユーザは、クライアント２０１，２０２を操作して、クライアント２０１，２０２からサーバ１００へ、検索要求や解凍要求を送信することができる。 Clients 201 and 202 are computers used by users. The user can operate the clients 201 and 202 to transmit a text document from the clients 201 and 202 to the server 100. The user can operate the clients 201 and 202 to transmit a search request and a decompression request from the clients 201 and 202 to the server 100.

図３は、本実施の形態に用いるサーバのハードウェアの一構成例を示す図である。サーバ１００は、ＣＰＵ（Central Processing Unit）１０１によって装置全体が制御されている。ＣＰＵ１０１には、バス１０８を介してＲＡＭ（Random Access Memory）１０２と複数の周辺機器が接続されている。 FIG. 3 is a diagram illustrating a configuration example of server hardware used in the present embodiment. The server 100 is entirely controlled by a CPU (Central Processing Unit) 101. A RAM (Random Access Memory) 102 and a plurality of peripheral devices are connected to the CPU 101 via a bus 108.

ＲＡＭ１０２は、サーバ１００の主記憶装置として使用される。ＲＡＭ１０２には、ＣＰＵ１０１に実行させるＯＳ（Operating System）のプログラムやアプリケーションプログラムの少なくとも一部が一時的に格納される。また、ＲＡＭ１０２には、ＣＰＵ１０１による処理に必要な各種データが格納される。 The RAM 102 is used as a main storage device of the server 100. The RAM 102 temporarily stores at least part of an OS (Operating System) program and application programs to be executed by the CPU 101. The RAM 102 stores various data necessary for processing by the CPU 101.

バス１０８に接続されている周辺機器としては、ハードディスクドライブ（ＨＤＤ:Hard Disk Drive）１０３、グラフィック処理装置１０４、入力インタフェース１０５、光学ドライブ装置１０６、および通信インタフェース１０７がある。 Peripheral devices connected to the bus 108 include a hard disk drive (HDD) 103, a graphic processing device 104, an input interface 105, an optical drive device 106, and a communication interface 107.

ＨＤＤ１０３は、内蔵したディスクに対して、磁気的にデータの書き込みおよび読み出しを行う。ＨＤＤ１０３は、サーバ１００の二次記憶装置として使用される。ＨＤＤ１０３には、ＯＳのプログラム、アプリケーションプログラム、および各種データが格納される。なお、二次記憶装置としては、フラッシュメモリなどの半導体記憶装置を使用することもできる。 The HDD 103 magnetically writes and reads data to and from the built-in disk. The HDD 103 is used as a secondary storage device of the server 100. The HDD 103 stores an OS program, application programs, and various data. Note that a semiconductor storage device such as a flash memory can also be used as the secondary storage device.

グラフィック処理装置１０４には、モニタ１１が接続されている。グラフィック処理装置１０４は、ＣＰＵ１０１からの命令に従って、画像をモニタ１１の画面に表示させる。モニタ１１としては、ＣＲＴ（Cathode Ray Tube）を用いた表示装置や液晶表示装置などがある。 A monitor 11 is connected to the graphic processing device 104. The graphic processing device 104 displays an image on the screen of the monitor 11 in accordance with a command from the CPU 101. Examples of the monitor 11 include a display device using a CRT (Cathode Ray Tube) and a liquid crystal display device.

入力インタフェース１０５には、キーボード１２とマウス１３とが接続されている。入力インタフェース１０５は、キーボード１２やマウス１３から送られてくる信号をＣＰＵ１０１に送信する。なお、マウス１３は、ポインティングデバイスの一例であり、他のポインティングデバイスを使用することもできる。他のポインティングデバイスとしては、タッチパネル、タブレット、タッチパッド、トラックボールなどがある。 A keyboard 12 and a mouse 13 are connected to the input interface 105. The input interface 105 transmits a signal sent from the keyboard 12 or the mouse 13 to the CPU 101. The mouse 13 is an example of a pointing device, and other pointing devices can also be used. Examples of other pointing devices include a touch panel, a tablet, a touch pad, and a trackball.

光学ドライブ装置１０６は、レーザ光などを利用して、光ディスク１４に記録されたデータの読み取りを行う。光ディスク１４は、光の反射によって読み取り可能なようにデータが記録された可搬型の記録媒体である。光ディスク１４には、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）などがある。 The optical drive device 106 reads data recorded on the optical disk 14 using laser light or the like. The optical disk 14 is a portable recording medium on which data is recorded so that it can be read by reflection of light. The optical disk 14 includes a DVD (Digital Versatile Disc), a DVD-RAM, a CD-ROM (Compact Disc Read Only Memory), a CD-R (Recordable) / RW (ReWritable), and the like.

通信インタフェース１０７は、ネットワーク１０に接続されている。通信インタフェース１０７は、ネットワーク１０を介して、他のコンピュータまたは通信機器との間でデータの送受信を行う。 The communication interface 107 is connected to the network 10. The communication interface 107 transmits and receives data to and from other computers or communication devices via the network 10.

以上のようなハードウェア構成によって、本実施の形態の処理機能を実現することができる。なお、図３にはサーバ１００のハードウェア構成を示したが、クライアント２０１，２０２も同様のハードウェア構成で実現することができる。 With the hardware configuration as described above, the processing functions of the present embodiment can be realized. Although FIG. 3 shows the hardware configuration of the server 100, the clients 201 and 202 can also be realized with the same hardware configuration.

次に、要約トライのデータ構造について説明する。要約トライは、トライ木（Trie）と呼ばれる順序付き木構造を有する。要約トライは、ノード（節点）を表すノード構造体がポインタで繋がれている。 Next, the data structure of the summary trie will be described. A summary trie has an ordered tree structure called a trie. In the summary trie, node structures representing nodes (nodes) are connected by a pointer.

図４は、要約トライのノード構造体の一例を示す図である。ノード構造体２０には、ノードＩＤ２１、ラベル２２、子ノードのカウンタリスト２３、および子ノードへのポインタリスト２４が設けられている。 FIG. 4 is a diagram illustrating an example of the node structure of the summary trie. The node structure 20 is provided with a node ID 21, a label 22, a counter list 23 of child nodes, and a pointer list 24 to the child nodes.

ノードＩＤ２１は、ノード構造体２０を識別するための識別子である。ノードＩＤ２１には、ノード構造体２０の作成順に、１ずつインクリメントされた１以上の整数が代入される。ノードＩＤ２１は、圧縮符号として使用される。 The node ID 21 is an identifier for identifying the node structure 20. An integer of 1 or more incremented by 1 is assigned to the node ID 21 in the order in which the node structure 20 is created. The node ID 21 is used as a compression code.

ラベル２２は、ノード構造体２０に対応する文字である。子ノードのノード構造体２０が、親ノードのノード構造体の配下に、対応する文字に応じた所定の配列で接続される場合、その配列上の順番により各ノード構造体に対応する文字が特定できる。この場合、ノード構造体２０がラベル２２を有していなくてもよい。 The label 22 is a character corresponding to the node structure 20. When the node structure 20 of the child node is connected to the node structure of the parent node in a predetermined array corresponding to the corresponding character, the character corresponding to each node structure is specified by the order on the array it can. In this case, the node structure 20 may not have the label 22.

子ノードのカウンタリスト２３には、処理対象のテキストデータに存在し得る文字の種類に応じたカウンタが設けられている。例えばテキストデータに存在し得る文字がアルファベットであれば、アルファベットの文字数分のカウンタが用意される。また、テキストデータに所定のコード体系による１バイト文字が存在し得る場合、２５６個のカウンタが用意される。子ノードのカウンタリスト２３に含まれる各カウンタは、子ノードに対応付けられている。子ノードに対応するカウンタには、ルートノードからノード構造体２０のノードに到達するまでの各ノードの文字列の次に、その子ノードに対応する文字が出現した回数が設定される。なお、各カウンタの初期値は「０」である。 The child node counter list 23 is provided with counters corresponding to the types of characters that may exist in the text data to be processed. For example, if the characters that can exist in the text data are alphabets, counters corresponding to the number of characters of the alphabet are prepared. In addition, when one-byte characters having a predetermined code system can exist in the text data, 256 counters are prepared. Each counter included in the child node counter list 23 is associated with a child node. In the counter corresponding to the child node, the number of times that the character corresponding to the child node appears next to the character string of each node from the root node to the node of the node structure 20 is set. The initial value of each counter is “0”.

子ノードへのポインタリスト２４には、処理対象のテキストデータに存在し得る文字の種類に応じたポインタが設けられている。例えばテキストデータに存在し得る文字がアルファベットであれば、アルファベットの文字数分のポインタが用意される。また、テキストデータに所定のコード体系による１バイト文字が存在し得る場合、２５６個のポインタが用意される。子ノードへのポインタリスト２４に含まれる各ポインタは、子ノードに対応付けられている。子ノードに対応するポインタには、子ノードのノード構造体を一意に特定する情報が設定される。各ポインタの初期値は「ＮＵＬＬ」である。 In the pointer list 24 to the child node, pointers corresponding to the types of characters that can exist in the text data to be processed are provided. For example, if the characters that can exist in the text data are alphabets, pointers corresponding to the number of characters of the alphabet are prepared. In addition, when 1-byte characters with a predetermined code system can exist in text data, 256 pointers are prepared. Each pointer included in the child node pointer list 24 is associated with a child node. Information for uniquely specifying the node structure of the child node is set in the pointer corresponding to the child node. The initial value of each pointer is “NULL”.

このようなノード構造体のノードを繋げることで、順序付き木構造（トライ）である要約トライが作成される。
図５は、要約トライの例を示す図である。図５の例では、テキストデータに存在し得る文字を「ａ，ｂ，ｃ，ｄ」とし、これらの文字を含む集合Σ（Σ＝｛ａ，ｂ，ｃ，ｄ｝）を定義している。この場合、Σに含まれる要素数（｜Σ｜）は、「４」となる。 By connecting the nodes of such a node structure, a summary trie that is an ordered tree structure (trie) is created.
FIG. 5 is a diagram illustrating an example of a summary trie. In the example of FIG. 5, the characters that can exist in the text data are “a, b, c, d”, and a set Σ (Σ = {a, b, c, d}) including these characters is defined. . In this case, the number of elements (| Σ |) included in Σ is “4”.

要約トライ３０は、複数のノード３１〜３７で構成されている。各ノード３１〜３７は、図４に示したようなノード構造体を有する。各ノード３１〜３７における子ノードのカウンタリストには、ａ，ｂ，ｃ，ｄそれぞれに対応するカウンタが設けられている。各ノード３１〜３７における子ノードへのポインタリストには、ａ，ｂ，ｃ，ｄそれぞれに対応するポインタが設けられている。 The summary trie 30 is composed of a plurality of nodes 31 to 37. Each of the nodes 31 to 37 has a node structure as shown in FIG. Counters corresponding to a, b, c, and d are provided in the counter list of the child nodes in the nodes 31 to 37, respectively. In the pointer list to the child nodes in each of the nodes 31 to 37, pointers corresponding to a, b, c, and d are provided.

ノード３１はルートノードであり、ノードＩＤは「０」である。ルートノードであるノード３１のラベルには「ｒｏｏｔ」と設定されている。ノード３２は文字「ａ」に対応するノードであり、ラベルには「ａ」と設定されている。ノード３２のノードＩＤは「１」である。ノード３３は文字「ｂ」に対応するノードであり、ラベルには「ｂ」と設定されている。ノード３３のノードＩＤは「２」である。ノード３４は文字「ｃ」に対応するノードであり、ラベルには「ｃ」と設定されている。ノード３４のノードＩＤは「３」である。ノード３５は文字「ｄ」に対応するノードであり、ラベルには「ｄ」と設定されている。ノード３５のノードＩＤは「４」である。ノード３６は文字「ｂ」に対応するノードであり、ラベルには「ｂ」と設定されている。ノード３６のノードＩＤは「５」である。ノード３７は文字「ｃ」に対応するノードであり、ラベルには「ｃ」と設定されている。ノード３７のノードＩＤは「６」である。 The node 31 is a root node, and the node ID is “0”. The label of the node 31 that is the root node is set to “root”. The node 32 is a node corresponding to the character “a”, and “a” is set in the label. The node ID of the node 32 is “1”. The node 33 is a node corresponding to the character “b”, and “b” is set in the label. The node ID of the node 33 is “2”. The node 34 is a node corresponding to the character “c”, and “c” is set in the label. The node ID of the node 34 is “3”. The node 35 is a node corresponding to the character “d”, and “d” is set in the label. The node ID of the node 35 is “4”. The node 36 is a node corresponding to the character “b”, and “b” is set in the label. The node ID of the node 36 is “5”. The node 37 is a node corresponding to the character “c”, and “c” is set in the label. The node ID of the node 37 is “6”.

ノード３１に設けられた子ノードのカウンタリスト内のカウンタの値は、すべて「０」である。また、ノード３１に設けられた子ノードへのポインタリストには、子ノードである４つのノード３２〜３５へのポインタが設定されている。 The values of the counters in the counter list of the child node provided in the node 31 are all “0”. In the pointer list to child nodes provided in the node 31, pointers to four nodes 32 to 35 which are child nodes are set.

なお図５では、ポインタとそのポインタが指し示すノードとを、矢印で接続している。また「ＮＵＬＬ」ポインタを、図中「・」で示している。
ノード３２に設けられた子ノードのカウンタリスト内の各カウンタのうち、文字「ｂ」に対応するカウンタのみ「２」が設定され、その他のカウンタの値は「０」である。図５の例では、カウンタの値が「２」に達すると、該当カウンタに対応する文字の子ノードが作成されるものとする。そこで、ノード３２の子ノードとして、文字「ｂ」に対応するノード３６が作成されている。ノード３２の文字「ｂ」に対応するポインタは、ノード３６を指し示している。 In FIG. 5, the pointer and the node pointed to by the pointer are connected by an arrow. The “NULL” pointer is indicated by “•” in the figure.
Of the counters in the counter list of the child node provided in the node 32, “2” is set only for the counter corresponding to the character “b”, and the values of the other counters are “0”. In the example of FIG. 5, when the counter value reaches “2”, a child node of a character corresponding to the counter is assumed to be created. Therefore, a node 36 corresponding to the character “b” is created as a child node of the node 32. The pointer corresponding to the character “b” of the node 32 points to the node 36.

ノード３３〜３５には、値が「２」に達したカウンタがないため、子ノードも作成されていない。ノード３６は、文字「ｃ」に対応するカウンタが「２」となり、文字「ｃ」に対応する子ノードとしてノード３７が作成されている。 Since the nodes 33 to 35 do not have a counter whose value has reached “2”, no child nodes are created. In the node 36, the counter corresponding to the character “c” is “2”, and the node 37 is created as a child node corresponding to the character “c”.

このような要約トライ３０を用いて文字列の符号化が可能となる。例えば、「ａ，ｂ，ｃ」の文字列は、ルートのノード３１からノード３７までの経路に対応する。そこで、文字列「ａ，ｂ，ｃ」をノード３７のノードＩＤ「６」に符号化することができる。 A character string can be encoded using such a summary trie 30. For example, the character string “a, b, c” corresponds to the route from the node 31 to the node 37 of the route. Therefore, the character string “a, b, c” can be encoded into the node ID “6” of the node 37.

第２の実施の形態では、テキストデータの文字を１文字ずつ読み込み、順次、要約トライ３０の作成と文字列の符号化を行う。このように、要約トライの作成と文字列の符号化のストリーム処理を可能とすることで、テキストデータ符号化処理の効率化が可能となる。 In the second embodiment, characters of text data are read one character at a time, and a summary trie 30 is created and a character string is encoded sequentially. As described above, stream processing for creating a summary trie and encoding a character string is possible, thereby improving the efficiency of the text data encoding process.

図６は、サーバの機能を示すブロック図である。サーバ１００は、符号化辞書記憶部１１０、圧縮済テキスト記憶部１２０、テキスト符号化部１３０、検索部１４０、および解凍部１５０を有する。なお図６の例では、クライアント２０１から未圧縮テキスト４０が入力され、クライアント２０２から検索クエリ４１または解凍要求４３が入力されるものとする。 FIG. 6 is a block diagram illustrating functions of the server. The server 100 includes an encoding dictionary storage unit 110, a compressed text storage unit 120, a text encoding unit 130, a search unit 140, and a decompression unit 150. In the example of FIG. 6, it is assumed that the uncompressed text 40 is input from the client 201 and the search query 41 or the decompression request 43 is input from the client 202.

符号化辞書記憶部１１０は、符号化辞書として使用する要約トライを記憶する。例えば、ＲＡＭ１０２またはＨＤＤ１０３の記憶領域の一部が符号化辞書記憶部１１０として使用される。 The encoding dictionary storage unit 110 stores a summary trie used as an encoding dictionary. For example, a part of the storage area of the RAM 102 or the HDD 103 is used as the encoding dictionary storage unit 110.

圧縮済テキスト記憶部１２０は、符号化によりデータ量が圧縮されたテキストデータ（圧縮済テキスト）を記憶する。例えば、ＲＡＭ１０２またはＨＤＤ１０３の記憶領域の一部が圧縮済テキスト記憶部１２０として使用される。 The compressed text storage unit 120 stores text data (compressed text) whose data amount is compressed by encoding. For example, a part of the storage area of the RAM 102 or the HDD 103 is used as the compressed text storage unit 120.

テキスト符号化部１３０は、クライアント２０１から入力された未圧縮テキスト４０内の文字列を符号化し、データ量を圧縮する。なおテキスト符号化部１３０は、文字列を符号化する際に、要約トライ形式の符号化辞書を作成する。テキスト符号化部１３０は、作成した要約トライを、符号化辞書記憶部１１０に格納する。またテキスト符号化部１３０は、データ量が圧縮された圧縮済テキストを、圧縮済テキスト記憶部１２０に格納する。 The text encoding unit 130 encodes the character string in the uncompressed text 40 input from the client 201 and compresses the data amount. Note that the text encoding unit 130 creates a summary trie format encoding dictionary when encoding a character string. The text encoding unit 130 stores the created summary trie in the encoding dictionary storage unit 110. In addition, the text encoding unit 130 stores the compressed text whose data amount is compressed in the compressed text storage unit 120.

検索部１４０は、クライアント２０２から入力された検索クエリに応答し、圧縮済テキスト記憶部１２０内の圧縮済テキストを参照し、検索クエリで指定される文字列に合致する文字列の未圧縮テキスト４０上での位置を特定する。なお、圧縮済テキストに対する検索には、例えば可変長情報源を固定長に符号化するＶＦ符号化技術で符号化されたＶＦ符号のパターン照合技術を用いることができる。なお圧縮されたテキストに対するパターン照合では、圧縮されたテキストを解凍せずにパターン照合を行うことができる。例えばＳＩＧＭＡ検索アルゴリズム、ＫＭＰ（Knuth-Morris-Pratt）アルゴリズム、ＡＣ（Aho-Corasick）アルゴリズムなどを用いたパターン照合が可能である。検索部１４０は、検索結果４２をクライアント２０２に送信する。検索結果４２には、例えば、検索でヒットした文字列の未圧縮テキスト４０内での位置（何文字目から何文字目か）が示される。 The search unit 140 responds to the search query input from the client 202, refers to the compressed text in the compressed text storage unit 120, and stores the uncompressed text 40 of a character string that matches the character string specified by the search query. Locating above. The search for the compressed text can use, for example, a pattern matching technique of a VF code encoded by a VF encoding technique for encoding a variable length information source to a fixed length. In pattern matching for compressed text, pattern matching can be performed without decompressing the compressed text. For example, pattern matching using a SIGMA search algorithm, a KMP (Knuth-Morris-Pratt) algorithm, an AC (Aho-Corasick) algorithm, or the like is possible. The search unit 140 transmits the search result 42 to the client 202. The search result 42 indicates, for example, the position of the character string hit in the search in the uncompressed text 40 (from what character to what character).

解凍部１５０は、クライアント２０２から入力された解凍要求に応答し、圧縮済テキスト記憶部１２０内の圧縮済テキストを解凍する。そして解凍部１５０は、解凍した解凍済テキスト４４をクライアント２０２に送信する。 In response to the decompression request input from the client 202, the decompression unit 150 decompresses the compressed text in the compressed text storage unit 120. Then, the decompressing unit 150 transmits the decompressed text 44 that has been decompressed to the client 202.

次にテキスト符号化部１３０の詳細について説明する。
図７は、テキスト符号化部の詳細機能を示すブロック図である。テキスト符号化部１３０は、頻度閾値記憶部１３１、最大符号数記憶部１３２、文字選択部１３３、ノード作成部１３４、頻度カウント部１３５、および符号出力部１３６を有する。 Next, details of the text encoding unit 130 will be described.
FIG. 7 is a block diagram showing detailed functions of the text encoding unit. The text encoding unit 130 includes a frequency threshold storage unit 131, a maximum code number storage unit 132, a character selection unit 133, a node creation unit 134, a frequency count unit 135, and a code output unit 136.

頻度閾値記憶部１３１は、文字列の出現頻度の閾値を記憶する。例えば、ＲＡＭ１０２やＨＤＤ１０３の記憶領域の一部が、頻度閾値記憶部１３１として使用される。出現頻度が閾値を超えた文字列については、対応するノードが作成される。 The frequency threshold storage unit 131 stores a threshold for the appearance frequency of a character string. For example, a part of the storage area of the RAM 102 or HDD 103 is used as the frequency threshold storage unit 131. For character strings whose appearance frequency exceeds the threshold, a corresponding node is created.

最大符号数記憶部１３２には、要約トライに含まれる符号の最大数を示す最大符号数を記憶する。例えば、ＲＡＭ１０２やＨＤＤ１０３の記憶領域の一部が、最大符号数記憶部１３２として使用される。要約トライの符号数が最大符号数に達すると、それ以降予約トライは更新されない。 The maximum code number storage unit 132 stores a maximum code number indicating the maximum number of codes included in the summary trie. For example, a part of the storage area of the RAM 102 or HDD 103 is used as the maximum code number storage unit 132. When the code number of the summary trie reaches the maximum code number, the reservation trie is not updated thereafter.

文字選択部１３３は、入力された未圧縮テキスト４０内の先頭から、文字を１文字ずつ選択する。文字選択部１３３は、選択した文字をノード作成部１３４、頻度カウント部１３５、および符号出力部１３６に渡す。なお、文字選択部１３３は、入力された未圧縮テキスト４０を、例えばＲＡＭ１０２内に一時的に格納しておき、ＲＡＭ１０２内から１文字ずつ文字を選択することができる。 The character selection unit 133 selects characters one by one from the top in the input uncompressed text 40. The character selection unit 133 passes the selected character to the node creation unit 134, the frequency count unit 135, and the code output unit 136. Note that the character selection unit 133 can temporarily store the input uncompressed text 40 in, for example, the RAM 102 and select characters one by one from the RAM 102.

ノード作成部１３４は、文字選択部１３３から文字を取得するごとに、要約トライに対するノードの作成の要否を判断する。なお、ノード作成部１３４は、要約トライに含まれるノードのうち、判断位置となるノードを示す情報（ノードポインタ）を管理している。そして、ノード作成部１３４は、判断位置のノードのノード構造体と取得した文字とを比較することで、ノードの作成の要否を判断する。ノードの作成が必要と判断した場合、ノード作成部１３４は、新たなノード構造体を作成し、そのノード構造体を符号化辞書記憶部１１０に格納されている要約トライに追加する。 Each time the node creation unit 134 acquires a character from the character selection unit 133, the node creation unit 134 determines whether it is necessary to create a node for the summary trie. Note that the node creation unit 134 manages information (node pointer) indicating a node that is a determination position among nodes included in the summary trie. Then, the node creation unit 134 determines whether it is necessary to create a node by comparing the node structure of the node at the determination position with the acquired character. When it is determined that the node needs to be created, the node creation unit 134 creates a new node structure and adds the node structure to the summary trie stored in the encoding dictionary storage unit 110.

頻度カウント部１３５は、文字選択部１３３から文字を取得すると、要約トライ内の判断位置のノードに設けられているカウンタを更新する。例えば、文字選択部１３３から取得した文字までの符号化されていない部分文字列に対応するノードがまだ作成されていない場合、頻度カウント部１３５は、その文字に対応するカウンタの値をカウントアップする。 When the frequency count unit 135 acquires a character from the character selection unit 133, the frequency count unit 135 updates a counter provided in the node at the determination position in the summary trie. For example, when the node corresponding to the non-encoded partial character string up to the character acquired from the character selection unit 133 has not yet been created, the frequency counting unit 135 counts up the counter value corresponding to the character. .

符号出力部１３６は、文字選択部１３３から文字を取得するごとに、符号化されていない部分文字列に対応する符号の出力の要否を判断する。そして符号出力部１３６は、符号を出力する場合、要約トライ内のノードのノードＩＤを符号として圧縮済テキスト記憶部１２０に格納する。例えば、判断位置となるノードが要約トライの葉であり、文字選択部１３３が選択した文字に対応するノードが、判断位置となるノードの子ノードとして新たに作成されない場合、符号出力部１３６は符号を出力する。この場合、符号出力部１３６は、文字選択部１３３から取得した文字の直前の文字までの符号化されていない部分文字列を、判断位置となるノードのノードＩＤに符号化する。そして符号出力部１３６は、符号を圧縮済テキスト記憶部１２０に格納する。 Each time the code output unit 136 acquires a character from the character selection unit 133, the code output unit 136 determines whether it is necessary to output a code corresponding to a partial character string that is not encoded. When outputting the code, the code output unit 136 stores the node ID of the node in the summary trie as a code in the compressed text storage unit 120. For example, if the node that is the determination position is a leaf of the summary trie, and the node corresponding to the character selected by the character selection unit 133 is not newly created as a child node of the node that is the determination position, the code output unit 136 Is output. In this case, the code output unit 136 encodes the non-encoded partial character string up to the character immediately before the character acquired from the character selection unit 133 into the node ID of the node serving as the determination position. The code output unit 136 stores the code in the compressed text storage unit 120.

次に、テキスト符号化処理の手順について詳細に説明する。なお、テキスト符号化処理の入力は、未圧縮テキスト４０、閾値、および最大符号数である。
ここで未圧縮テキスト４０に含まれる文字数を「ｎ」（ｎは１以上の整数）とする。そして、未圧縮テキスト４０内の文字列を、配列を用いてＴ＝Ｔ［１］，・・・，Ｔ［ｎ］と定義する。 Next, the procedure of the text encoding process will be described in detail. The input of the text encoding process is the uncompressed text 40, the threshold value, and the maximum number of codes.
Here, the number of characters included in the uncompressed text 40 is “n” (n is an integer of 1 or more). Then, a character string in the uncompressed text 40 is defined as T = T [1],..., T [n] using an array.

また、閾値の値は、頻度閾値記憶部１３１から読み出され、変数「α」に設定される。さらに、最大符号数の値は、最大符号数記憶部１３２から読み出され、変数「Ｋ」に設定される。 The threshold value is read from the frequency threshold storage unit 131 and set to the variable “α”. Further, the value of the maximum code number is read from the maximum code number storage unit 132 and set to the variable “K”.

テキスト符号化処理の出力は、要約トライと圧縮済テキストである。ここで要約トライを「Ｄ」とする。要約トライ「Ｄ」の構造は、図５に示したように、複数のノード構造体をポインタで関連付けたものである。また圧縮済テキストを「Ｃ＝Ｃｏｍｐｒｅｓｓ（Ｔ）」とする。圧縮済テキスト「Ｃ」は、符号化により生成された符号が、生成順に並べられた情報である。 The output of the text encoding process is a summary trie and compressed text. Here, it is assumed that the summary trie is “D”. The structure of the summary trie “D” is obtained by associating a plurality of node structures with pointers as shown in FIG. Further, the compressed text is “C = Compress (T)”. The compressed text “C” is information in which codes generated by encoding are arranged in the order of generation.

テキスト符号化処理では、要約トライ「Ｄ」中の判断位置のノードを示すノードポインタ「Ｐ」が用いられる。例えばノードポインタ「Ｐ」には、判断位置のノードのノードＩＤが設定される。 In the text encoding process, a node pointer “P” indicating the node at the determination position in the summary trie “D” is used. For example, the node pointer “P” is set with the node ID of the node at the determination position.

また、要約トライの現在の大きさを「ｋ」とする。要約トライ「Ｄ」の大きさ「ｋ」は、その要約トライ「Ｄ」に含まれる符号数である。なお図５の例のように、要約トライのルートノードのノードＩＤを「０」とし、その他のノードに対して「１」から順にノードＩＤを付与した場合、要約トライ「Ｄ」に含まれる符号数はノードＩＤの最大値と等しくなる。 Also, the current size of the summary trie is “k”. The size “k” of the summary trie “D” is the number of codes included in the summary trie “D”. If the node ID of the root node of the summary trie is “0” and the node IDs are assigned in order from “1” to the other nodes as in the example of FIG. 5, the codes included in the summary trie “D” The number is equal to the maximum value of the node ID.

テキストデータ中の現在の処理対象の文字の位置を「ｉ」とする。位置ｉは、０以上の整数を採ることができる。文字の位置は、テキスト中の先頭からの該当文字の順番で表される。配列「Ｔ」のインデックスに、処理対象の文字の順番を示す位置「ｉ」を設定することで、処理対象の文字が抽出できる。 The position of the current character to be processed in the text data is “i”. The position i can take an integer of 0 or more. The position of the character is represented by the order of the corresponding character from the beginning in the text. By setting the position “i” indicating the order of the characters to be processed to the index of the array “T”, the characters to be processed can be extracted.

図８は、テキスト符号化処理の手順を示すフローチャートである。以下、図８に示す処理をステップ番号に沿って説明する。
［ステップＳ１１］テキスト符号化部１３０内の各要素は、それぞれ情報の初期化を行う。例えば文字選択部１３３は、文字の位置「ｉ」の値を「０」に初期化する。 FIG. 8 is a flowchart showing the procedure of the text encoding process. In the following, the process illustrated in FIG. 8 will be described in order of step number.
[Step S11] Each element in the text encoding unit 130 initializes information. For example, the character selection unit 133 initializes the value of the character position “i” to “0”.

ノード作成部１３４は、符号化辞書記憶部１１０内の要約トライ「Ｄ」を初期化する。初期化された要約トライ「Ｄ」は、すべてのアルファベットがカウント「０」で登録された状態を表している。例えばノード作成部１３４は、ルートノードのノード構造体を作成し、ルートノードのノード構造体の配下に、集合Σに含まれる各要素に対応する子ノードのノード構造体を繋げる。この際、ノード作成部１３４は、各ノードに含まれるカウンタの値はすべて「０」とする。またノード作成部１３４は、ルートノード以外の各ノードのポインタの値は「ＮＵＬＬ」とする。初期状態の符号数は集合Σ内の要素数であり、その要素数が、要約トライの大きさ「ｋ」に設定される。 The node creation unit 134 initializes the summary trie “D” in the coding dictionary storage unit 110. The initialized summary trie “D” represents a state in which all alphabets are registered with a count “0”. For example, the node creation unit 134 creates a node structure of the root node, and connects a node structure of a child node corresponding to each element included in the set Σ under the node structure of the root node. At this time, the node creation unit 134 sets all the values of the counters included in each node to “0”. The node creation unit 134 sets the value of the pointer of each node other than the root node to “NULL”. The number of codes in the initial state is the number of elements in the set Σ, and the number of elements is set to the summary trie size “k”.

さらにノード作成部１３４は、ノードポインタ「Ｐ」を初期化する。ノードポインタ「Ｐ」の初期値では、要約トライ「Ｄ」のルートノードが指し示され、「Ｐ＝ｒｏｏｔ（Ｄ）」と表される。 Further, the node creation unit 134 initializes the node pointer “P”. The initial value of the node pointer “P” indicates the root node of the summary trie “D”, and is expressed as “P = root (D)”.

符号出力部１３６は、圧縮済テキスト記憶部１２０内の圧縮済テキストを初期化する。初期化された圧縮済テキストには、空文字列が設定され、「Ｃ＝ε」と表される。εは空文字列を意味する。 The code output unit 136 initializes the compressed text in the compressed text storage unit 120. A null character string is set in the initialized compressed text, and is expressed as “C = ε”. ε means an empty character string.

［ステップＳ１２］文字選択部１３３は、位置「ｉ」の値をインクリメントする（ｉ＝ｉ＋１）。これにより、未圧縮テキスト４０内の１文字目が処理対象となる。
［ステップＳ１３］ノード作成部１３４は、位置「ｉ」の値が、文字数「ｎ」より大きいか否かを判断する。位置「ｉ」の値が文字数「ｎ」より大きければ、処理がステップＳ２０に進められる。位置「ｉ」の値が文字数「ｎ」以下であれば、処理がステップＳ１４に進められる。 [Step S12] The character selection unit 133 increments the value of the position “i” (i = i + 1). As a result, the first character in the uncompressed text 40 is processed.
[Step S13] The node creation unit 134 determines whether the value of the position “i” is larger than the number of characters “n”. If the value of position “i” is larger than the number of characters “n”, the process proceeds to step S20. If the value of position “i” is less than or equal to the number of characters “n”, the process proceeds to step S14.

［ステップＳ１４］位置「ｉ」の値が文字数「ｎ」以下であれば、ノード作成部１３４、頻度カウント部１３５、および符号出力部１３６が連携し、要約トライ作成およびテキスト圧縮処理を実行する。この処理の詳細は後述する（図９参照）。 [Step S14] If the value of the position “i” is equal to or less than the number of characters “n”, the node creation unit 134, the frequency count unit 135, and the code output unit 136 cooperate to execute summary trie creation and text compression processing. Details of this processing will be described later (see FIG. 9).

［ステップＳ１５］文字選択部１３３は、位置「ｉ」の値をインクリメントする（ｉ＝ｉ＋１）。これにより、未圧縮テキスト４０内の次の文字が処理対象となる。
［ステップＳ１６］ノード作成部１３４は、要約トライ「Ｄ」の大きさ「ｋ」が、最大符号数「Ｋ」未満か否かを判断する。要約トライ「Ｄ」の大きさ「ｋ」が最大符号数「Ｋ」未満であれば、処理がステップＳ１３に進められる。要約トライ「Ｄ」の大きさ「ｋ」が最大符号数「Ｋ」以上であれば、処理がステップＳ１７に進められる。 [Step S15] The character selection unit 133 increments the value of the position “i” (i = i + 1). As a result, the next character in the uncompressed text 40 becomes the processing target.
[Step S16] The node creation unit 134 determines whether the size “k” of the summary trie “D” is less than the maximum code number “K”. If the size “k” of the summary trie “D” is less than the maximum code number “K”, the process proceeds to step S13. If the size “k” of the summary trie “D” is equal to or greater than the maximum code number “K”, the process proceeds to step S17.

［ステップＳ１７］ノード作成部１３４は、位置「ｉ」の値が、文字数「ｎ」より大きいか否かを判断する。位置「ｉ」の値が文字数「ｎ」より大きければ、処理がステップＳ２０に進められる。位置「ｉ」の値が文字数「ｎ」以下であれば、処理がステップＳ１８に進められる。 [Step S17] The node creation unit 134 determines whether the value of the position “i” is greater than the number of characters “n”. If the value of position “i” is larger than the number of characters “n”, the process proceeds to step S20. If the value of position “i” is equal to or less than the number of characters “n”, the process proceeds to step S18.

［ステップＳ１８］位置「ｉ」の値が文字数「ｎ」以下であれば、符号出力部１３６は、テキスト圧縮処理を実行する。この処理の詳細は後述する（図１０参照）。
［ステップＳ１９］文字選択部１３３は、位置「ｉ」の値をインクリメントする（ｉ＝ｉ＋１）。これにより、未圧縮テキスト４０内の次の文字が処理対象となる。その後、処理がステップＳ１７に進められる。 [Step S18] If the value of the position “i” is equal to or less than the number of characters “n”, the code output unit 136 executes text compression processing. Details of this processing will be described later (see FIG. 10).
[Step S19] The character selection unit 133 increments the value of the position “i” (i = i + 1). As a result, the next character in the uncompressed text 40 becomes the processing target. Thereafter, the process proceeds to step S17.

［ステップＳ２０］位置「ｉ」の値が文字数「ｎ」より大きくなると、符号出力部１３６は、ノードポインタ「Ｐ」が示すノードの符号を、圧縮済テキスト「Ｃ」の末尾に書き出す。その後、処理が終了する。 [Step S20] When the value of the position “i” becomes larger than the number of characters “n”, the code output unit 136 writes the code of the node indicated by the node pointer “P” at the end of the compressed text “C”. Thereafter, the process ends.

次に、要約トライ作成およびテキスト圧縮処理（ステップＳ１４）の詳細について説明する。
図９は、要約トライ作成およびテキスト圧縮処理の手順を示すフローチャートである。以下、図９に示す処理をステップ番号に沿って説明する。 Next, details of the summary trie creation and text compression processing (step S14) will be described.
FIG. 9 is a flowchart showing a procedure of summary trie creation and text compression processing. In the following, the process illustrated in FIG. 9 will be described in order of step number.

［ステップＳ２１］ノード作成部１３４は、ノードポインタ「Ｐ」で示されるノードの子ノードの中に、処理対象の文字「Ｔ［ｉ］」に対応する子ノードが存在するか否かを判断する。具体的には、ノード作成部１３４は、文字選択部１３３からＴ［ｉ］に対応する文字を取得する。次にノード作成部１３４は、ノードポインタ「Ｐ」で示されるノードのノード構造体を参照し、取得した文字に対応するポインタの内容を確認する。該当ポインタが「ＮＵＬＬ」以外の有効な値であれば、取得した文字に対応する子ノードが存在する。 [Step S21] The node creation unit 134 determines whether or not a child node corresponding to the processing target character “T [i]” exists among the child nodes of the node indicated by the node pointer “P”. . Specifically, the node creation unit 134 acquires a character corresponding to T [i] from the character selection unit 133. Next, the node creation unit 134 refers to the node structure of the node indicated by the node pointer “P”, and confirms the content of the pointer corresponding to the acquired character. If the corresponding pointer is a valid value other than “NULL”, a child node corresponding to the acquired character exists.

子ノードが存在する場合、処理がステップＳ２２に進められる。子ノードが存在しない場合、処理がステップＳ２３に進められる。
［ステップＳ２２］子ノードが存在する場合、ノード作成部１３４は、現在のノードポインタ「Ｐ」で指定されているノードの文字「Ｔ［ｉ］」に対応する子ノードを、新たにノードポインタ「Ｐ」の指定先とする。その後、要約トライ作成およびテキスト圧縮処理が終了し、図８のステップＳ１５に処理が進められる。 If a child node exists, the process proceeds to step S22. If no child node exists, the process proceeds to step S23.
[Step S22] When a child node exists, the node creation unit 134 newly sets a child node corresponding to the character “T [i]” of the node designated by the current node pointer “P” as a node pointer “ The designated destination of “P”. Thereafter, the summary trie creation and text compression processing ends, and the process proceeds to step S15 in FIG.

［ステップＳ２３］子ノードが存在しない場合、頻度カウント部１３５は、現在のノードポインタ「Ｐ」で指定されているノード内の、取得した文字に対応するカウンタの値を１増加させる。ノードポインタ「Ｐ」で示されるノードにおける文字「Ｔ［ｉ］」の出現頻度を表すカウンタの値は、「ｃｏｕｎｔ（Ｐ，Ｔ［ｉ］）」と表記できる。 [Step S23] When there is no child node, the frequency counting unit 135 increments the value of the counter corresponding to the acquired character in the node specified by the current node pointer “P” by one. A counter value indicating the appearance frequency of the character “T [i]” at the node indicated by the node pointer “P” can be expressed as “count (P, T [i])”.

［ステップＳ２４］ノード作成部１３４は、現在のノードポインタ「Ｐ」で指定されているノード内の、取得した文字に対応するカウンタ（ｃｏｕｎｔ（Ｐ，Ｔ［ｉ］））の値が、閾値「α」と等しいか否かを判断する。カウンタの値が閾値「α」と等しい場合、処理がステップＳ２５に進められる。等しくなければ、処理がステップＳ２８に進められる。 [Step S24] The node creation unit 134 sets the value of the counter (count (P, T [i])) corresponding to the acquired character in the node specified by the current node pointer “P” to the threshold “ It is determined whether it is equal to “α”. If the value of the counter is equal to the threshold “α”, the process proceeds to step S25. If not equal, the process proceeds to Step S28.

［ステップＳ２５］カウンタの値が閾値「α」と等しい場合、ノード作成部１３４は、現在のノードポインタ「Ｐ」で指定されているノードの配下に、文字「Ｔ［ｉ］」に対応する子ノードを作成する。具体的には、ノード作成部１３４は、新たなノードＩＤを付与したノード構造体を生成する。例えばノード作成部１３４は、最後に作成したノードのノードＩＤを記憶しておき、そのノードＩＤに１を加算した値を、作成したノードのノードＩＤとする。作成されたノードのラベルは、文字「Ｔ［ｉ］」である。また作成されたノード構造体内の各カウンタおよび各ポインタには初期値が設定される。そしてノード作成部１３４は、現在のノードポインタ「Ｐ」で指定されているノードにおける文字「Ｔ［ｉ］」に対応するポインタに、新たに作成したノードを指し示す値を設定する。 [Step S25] When the value of the counter is equal to the threshold value “α”, the node creation unit 134 subordinates the node specified by the current node pointer “P” to the child corresponding to the character “T [i]”. Create a node. Specifically, the node creation unit 134 generates a node structure to which a new node ID is assigned. For example, the node creation unit 134 stores the node ID of the last created node, and sets the value obtained by adding 1 to the node ID as the node ID of the created node. The label of the created node is the letter “T [i]”. In addition, initial values are set in each counter and each pointer in the created node structure. Then, the node creation unit 134 sets a value indicating the newly created node to the pointer corresponding to the character “T [i]” in the node designated by the current node pointer “P”.

［ステップＳ２６］ノード作成部１３４は、ノードポインタ「Ｐ」の指定先を、ステップＳ２５で新たに作成した子ノードとする。
［ステップＳ２７］ノード作成部１３４は、要約トライ「Ｄ」の大きさ「ｋ」の値をインクリメントする。その後、要約トライ作成およびテキスト圧縮処理が終了し、図８のステップＳ１５に処理が進められる。 [Step S26] The node creation unit 134 designates the node pointer “P” as the child node newly created in Step S25.
[Step S27] The node creation unit 134 increments the value of the size “k” of the summary trie “D”. Thereafter, the summary trie creation and text compression processing ends, and the process proceeds to step S15 in FIG.

［ステップＳ２８］ステップＳ２４でカウンタの値が閾値「α」と等しくないと判断された場合、符号出力部１３６は、ノードポインタ「Ｐ」で指定されたノードの符号を、圧縮済テキスト「Ｃ」の末尾に書き出す。 [Step S28] When it is determined in step S24 that the value of the counter is not equal to the threshold value “α”, the code output unit 136 converts the code of the node designated by the node pointer “P” to the compressed text “C”. Write at the end of.

［ステップＳ２９］ノード作成部１３４は、ノードポインタ「Ｐ」の指定先を、ルートノードの子ノードのうちの、文字「Ｔ［ｉ］」に対応するルートノードの子ノードとする。その後、要約トライ作成およびテキスト圧縮処理が終了し、図８のステップＳ１５に処理が進められる。 [Step S29] The node creation unit 134 designates the node pointer “P” as a child node of the root node corresponding to the character “T [i]” among the child nodes of the root node. Thereafter, the summary trie creation and text compression processing ends, and the process proceeds to step S15 in FIG.

次に、テキスト圧縮処理（ステップＳ１８）の詳細について説明する。
図１０は、要約トライ作成およびテキスト圧縮処理の手順を示すフローチャートである。以下、図１０に示す処理をステップ番号に沿って説明する。 Next, details of the text compression process (step S18) will be described.
FIG. 10 is a flowchart showing a procedure of summary trie creation and text compression processing. In the following, the process illustrated in FIG. 10 will be described in order of step number.

［ステップＳ３１］符号出力部１３６は、ノードポインタ「Ｐ」で示されるノードの子ノードの中に、処理対象の文字「Ｔ［ｉ］」に対応する子ノードが存在するか否かを判断する。子ノードが存在する場合、処理がステップＳ３２に進められる。子ノードが存在しない場合、処理がステップＳ３３に進められる。 [Step S31] The code output unit 136 determines whether or not a child node corresponding to the processing target character “T [i]” exists among the child nodes of the node indicated by the node pointer “P”. . If a child node exists, the process proceeds to step S32. If no child node exists, the process proceeds to step S33.

［ステップＳ３２］子ノードが存在する場合、符号出力部１３６は、現在のノードポインタ「Ｐ」で指定されているノードの文字「Ｔ［ｉ］」に対応する子ノードを、新たにノードポインタ「Ｐ」の指定先とする。その後、テキスト圧縮処理が終了し、図８のステップＳ１９に処理が進められる。 [Step S32] When there is a child node, the code output unit 136 newly sets a child node corresponding to the character “T [i]” of the node designated by the current node pointer “P” as a node pointer “ The designated destination of “P”. Thereafter, the text compression process ends, and the process proceeds to step S19 in FIG.

［ステップＳ３３］子ノードが存在しない場合、符号出力部１３６は、ノードポインタ「Ｐ」で指定されたノードの符号を、圧縮済テキスト「Ｃ」の末尾に書き出す。
［ステップＳ３４］符号出力部１３６は、ノードポインタ「Ｐ」の指定先を、ルートノードの子ノードのうちの、文字「Ｔ［ｉ］」に対応する子ノードとする。その後、要約トライ作成およびテキスト圧縮処理が終了し、図８のステップＳ１９に処理が進められる。 [Step S33] If no child node exists, the code output unit 136 writes the code of the node designated by the node pointer “P” at the end of the compressed text “C”.
[Step S34] The code output unit 136 sets the designation destination of the node pointer “P” as a child node corresponding to the character “T [i]” among the child nodes of the root node. Thereafter, the summary trie creation and text compression processing ends, and the process proceeds to step S19 in FIG.

このようにして、要約トライ「Ｄ」と圧縮済テキスト「Ｃ」とが作成される。ここで、カウンタの値が閾値「α」に達した場合にのみ要約トライ「Ｄ」に対して子ノードを追加するため、出現頻度の高い文字列に対応するノードのみ作成することができる。 In this way, the summary trie “D” and the compressed text “C” are created. Here, since a child node is added to the summary trie “D” only when the value of the counter reaches the threshold value “α”, only a node corresponding to a character string having a high appearance frequency can be created.

しかも要約トライ「Ｄ」の大きさ「ｋ」が最大符号数「Ｋ」以上となると、要約トライ「Ｄ」は更新されずテキスト圧縮処理による圧縮済テキスト「Ｃ」の作成のみが継続される。すなわち、要約トライ「Ｄ」の大きさが最大符号数を超えたら、要約トライ「Ｄ」のノードカウンタが閾値「α」を超えても、新たなノードは作られない。これにより、要約トライ「Ｄ」の大きさは所定値以下に抑制される。 Moreover, when the size “k” of the summary trie “D” becomes equal to or greater than the maximum code number “K”, the summary trie “D” is not updated and only the creation of the compressed text “C” by the text compression process is continued. That is, if the size of the summary trie “D” exceeds the maximum number of codes, a new node is not created even if the node counter of the summary trie “D” exceeds the threshold “α”. Thereby, the size of the summary trie “D” is suppressed to a predetermined value or less.

次に、具体的な要約トライと圧縮済テキストとの生成例について説明する。
図１１は、初期状態の要約トライと圧縮済テキストの例を示す図である。図１１の例では、未圧縮テキスト４０に含まれ得る文字が「ａ，ｂ，ｃ，ｄ」の４文字であるものとする。すなわち、集合Σ＝｛ａ，ｂ，ｃ，ｄ｝であり、アルファベットサイズ（｜Σ｜）＝４である。入力された未圧縮テキスト４０内の文字列を”ａｂｃａｂｃｄａｂｃ”とする。すなわち、配列Ｔ＝”ａｂｃａｂｃｄａｂｃ”となる。また、閾値「α＝２」、最大符号数「Ｋ＝２５６」であるものとする。最大符号数が「２５６（２の８乗）」であるということは、符号が１バイトの固定長であることを示す。 Next, a specific example of generating a summary trie and compressed text will be described.
FIG. 11 is a diagram illustrating an example of an initial summary trie and compressed text. In the example of FIG. 11, it is assumed that the characters that can be included in the uncompressed text 40 are four characters “a, b, c, d”. That is, the set Σ = {a, b, c, d} and the alphabet size (| Σ |) = 4. The character string in the input uncompressed text 40 is assumed to be “abcabcdabc”. That is, the array T = “abcabbcdabc”. Further, it is assumed that the threshold value “α = 2” and the maximum code number “K = 256”. That the maximum code number is “256 (2 to the 8th power)” indicates that the code has a fixed length of 1 byte.

初期化フェーズにおいて、初期状態の要約トライ５０が作成される。作成された要約トライ５０は、５つのノード５１〜５５を有する。ノード５１は、ルートノードである。ノード５２〜５５は、それぞれ文字「ａ，ｂ，ｃ，ｄ」に対応するノード５１の子ノードである。ノード５１には、子ノードであるノード５２〜５５それぞれへのポインタが設定されている。 In the initialization phase, an initial summary trie 50 is created. The created summary trie 50 has five nodes 51 to 55. The node 51 is a root node. The nodes 52 to 55 are child nodes of the node 51 corresponding to the characters “a, b, c, d”, respectively. In the node 51, pointers to the nodes 52 to 55 which are child nodes are set.

初期状態では、各ノード５１〜５５内のカウンタの値はすべて「０」である。また初期状態では、ノードポインタ「Ｐ」はノード５１を指し示している（Ｐ＝ｒｏｏｔ（Ｄ））。なお図１１の例では、ノードポインタ「Ｐ」で指し示されるノードに、「Ｐ」の文字が書かれた旗の図形を付与している。 In the initial state, the values of the counters in the nodes 51 to 55 are all “0”. In the initial state, the node pointer “P” points to the node 51 (P = root (D)). In the example of FIG. 11, a flag figure in which the letter “P” is written is assigned to the node indicated by the node pointer “P”.

圧縮済テキスト６０には、初期化処理により空文字列（ε）が設定される。また処理対象の文字の位置「ｉ」には、「０」が設定される。
図１１に示した状態から位置「ｉ」がインクリメントされ、ｉ＝１とされる。そして、未圧縮テキスト４０内の先頭の文字「ａ」が処理対象として選択される。 An empty character string (ε) is set in the compressed text 60 by the initialization process. In addition, “0” is set in the position “i” of the character to be processed.
The position “i” is incremented from the state shown in FIG. 11 so that i = 1. Then, the first character “a” in the uncompressed text 40 is selected as a processing target.

図１２は、１文字目を処理した後の要約トライと圧縮済テキストの例を示す図である。初期状態においてノードポインタ「Ｐ」で指定されていたルートのノード５１には、すべての文字に対応する子ノードが存在する。そこで、１文字目の文字「ａ」に応じて、ノードポインタ「Ｐ」の指定先が、ノード５１の子ノードのうちの「ａ」に対応するノード５２に変更される。この際、圧縮済テキスト６０への符号の書き出しは行われない。 FIG. 12 is a diagram illustrating an example of the summary trie and the compressed text after the first character is processed. In the node 51 of the route designated by the node pointer “P” in the initial state, there are child nodes corresponding to all characters. Therefore, the designation destination of the node pointer “P” is changed to the node 52 corresponding to “a” among the child nodes of the node 51 in accordance with the first character “a”. At this time, no code is written to the compressed text 60.

図１２に示した状態から位置「ｉ」がインクリメントされ、ｉ＝２とされる。そして、未圧縮テキスト４０内の２文字目の文字「ｂ」が処理対象として選択される。
図１３は、２文字目を処理した後の要約トライと圧縮済テキストの例を示す図である。ノード５２には、文字「ｂ」に対応する子ノードが存在しない。そこで、ノード５２内の文字「ｂ」に対応するカウンタが「１」にカウントアップされている（ｃｏｕｎｔ（１，ｂ）＝１）。このとき、ｃｏｕｎｔ（１，ｂ）＜αである。そこで、圧縮済テキスト６０」の末尾に、ノード５２に対応する符号「１」が書き出される。その結果、圧縮済テキスト６０は「Ｃ＝１」となる。そして、ノードポインタ「Ｐ」の指定先が、ノード５１の子ノードのうちの「ｂ」に対応するノード５３に変更される。 The position “i” is incremented from the state shown in FIG. 12, and i = 2. Then, the second character “b” in the uncompressed text 40 is selected as a processing target.
FIG. 13 is a diagram illustrating an example of the summary trie and the compressed text after the second character is processed. The node 52 has no child node corresponding to the character “b”. Therefore, the counter corresponding to the character “b” in the node 52 is counted up to “1” (count (1, b) = 1). At this time, count (1, b) <α. Therefore, the code “1” corresponding to the node 52 is written at the end of the compressed text 60 ”. As a result, the compressed text 60 becomes “C = 1”. Then, the designation destination of the node pointer “P” is changed to the node 53 corresponding to “b” among the child nodes of the node 51.

図１３に示した状態から位置「ｉ」がインクリメントされ、ｉ＝３とされる。そして、未圧縮テキスト４０内の３文字目の文字「ｃ」が処理対象として選択される。
図１４は、３文字目を処理した後の要約トライと圧縮済テキストの例を示す図である。ノード５３には、文字「ｃ」に対応する子ノードが存在しない。そこで、ノード５３内の文字「ｃ」に対応するカウンタが「１」にカウントアップされている（ｃｏｕｎｔ（２，ｃ）＝１）。このとき、ｃｏｕｎｔ（２，ｃ）＜αである。そこで、圧縮済テキスト６０の末尾に、ノード５３に対応する符号「２」が書き出される。その結果、圧縮済テキスト６０は「Ｃ＝１２」となる。そして、ノードポインタ「Ｐ」の指定先が、ノード５１の子ノードのうちの「ｃ」に対応するノード５４に変更される。 The position “i” is incremented from the state shown in FIG. 13 so that i = 3. Then, the third character “c” in the uncompressed text 40 is selected as a processing target.
FIG. 14 is a diagram illustrating an example of the summary trie and the compressed text after the third character is processed. The node 53 has no child node corresponding to the character “c”. Therefore, the counter corresponding to the character “c” in the node 53 is counted up to “1” (count (2, c) = 1). At this time, count (2, c) <α. Therefore, the code “2” corresponding to the node 53 is written at the end of the compressed text 60. As a result, the compressed text 60 becomes “C = 12.” Then, the designation destination of the node pointer “P” is changed to the node 54 corresponding to “c” among the child nodes of the node 51.

図１４に示した状態から位置「ｉ」がインクリメントされ、ｉ＝４とされる。そして、未圧縮テキスト４０内の４文字目の文字「ａ」が処理対象として選択される。
図１５は、４文字目を処理した後の要約トライと圧縮済テキストの例を示す図である。ノード５４には、文字「ａ」に対応する子ノードが存在しない。そこで、ノード５４内の文字「ａ」に対応するカウンタが「１」にカウントアップされている（ｃｏｕｎｔ（３，ａ）＝１）。このとき、ｃｏｕｎｔ（３，ａ）＜αである。そこで、圧縮済テキスト６０の末尾に、ノード５４に対応する符号「３」が書き出される。その結果、圧縮済テキスト６０は「Ｃ＝１２３」となる。そして、ノードポインタ「Ｐ」の指定先が、ノード５１の子ノードのうちの「ａ」に対応するノード５２に変更される。 The position “i” is incremented from the state shown in FIG. 14, and i = 4. Then, the fourth character “a” in the uncompressed text 40 is selected as a processing target.
FIG. 15 is a diagram illustrating an example of the summary trie and the compressed text after the fourth character is processed. The node 54 has no child node corresponding to the character “a”. Therefore, the counter corresponding to the character “a” in the node 54 is counted up to “1” (count (3, a) = 1). At this time, count (3, a) <α. Therefore, the code “3” corresponding to the node 54 is written at the end of the compressed text 60. As a result, the compressed text 60 becomes “C = 123”. Then, the designation destination of the node pointer “P” is changed to the node 52 corresponding to “a” among the child nodes of the node 51.

図１５に示した状態から位置「ｉ」がインクリメントされ、ｉ＝５とされる。そして、未圧縮テキスト４０内の５文字目の文字「ｂ」が処理対象として選択される。
図１６は、５文字目を処理した後の要約トライと圧縮済テキストの例を示す図である。ノード５２には、文字「ｂ」に対応する子ノードが存在しない。そこで、ノード５２内の文字「ｂ」に対応するカウンタが「２」にカウントアップされている（ｃｏｕｎｔ（１，ｂ）＝２）。このとき、ｃｏｕｎｔ（１，ｂ）＝αである。そこで、ノード５２の文字「ｂ」に対応する子ノードとしてノード５６が作成される。この際、ノード５２の文字「ｂ」に対応するポインタには、ノード５６を示す値が設定される。新たに作成されたノード５６には、ノードＩＤ「５」が付与され、ラベル「ｂ」が設定される。ノード５６内のカウンタとポインタとには、初期値が設定される。そして、ノードポインタ「Ｐ」の指定先が、新たに作成したノード５６に変更される。この際、圧縮済テキスト６０への符号の書き出しは行われない。 The position “i” is incremented from the state shown in FIG. 15, and i = 5. Then, the fifth character “b” in the uncompressed text 40 is selected as a processing target.
FIG. 16 is a diagram illustrating an example of the summary trie and the compressed text after processing the fifth character. The node 52 has no child node corresponding to the character “b”. Therefore, the counter corresponding to the character “b” in the node 52 is counted up to “2” (count (1, b) = 2). At this time, count (1, b) = α. Therefore, a node 56 is created as a child node corresponding to the character “b” of the node 52. At this time, a value indicating the node 56 is set in the pointer corresponding to the character “b” of the node 52. A node ID “5” is assigned to the newly created node 56 and a label “b” is set. Initial values are set in the counter and pointer in the node 56. Then, the designation destination of the node pointer “P” is changed to the newly created node 56. At this time, no code is written to the compressed text 60.

図１６に示した状態から位置「ｉ」がインクリメントされ、ｉ＝６とされる。そして、未圧縮テキスト４０内の６文字目の文字「ｃ」が処理対象として選択される。
図１７は、６文字目を処理した後の要約トライと圧縮済テキストの例を示す図である。ノード５６には、文字「ｃ」に対応する子ノードが存在しない。そこで、ノード５６内の文字「ｃ」に対応するカウンタが「１」にカウントアップされている（ｃｏｕｎｔ（５，ｃ）＝１）。このとき、ｃｏｕｎｔ（５，ｃ）＜αである。そこで、圧縮済テキスト６０の末尾に、ノード５６に対応する符号「５」が書き出される。その結果、圧縮済テキスト６０は「Ｃ＝１２３５」となる。そして、ノードポインタ「Ｐ」の指定先が、ノード５１の子ノードのうちの「ｃ」に対応するノード５４に変更される。 The position “i” is incremented from the state shown in FIG. 16, and i = 6. Then, the sixth character “c” in the uncompressed text 40 is selected as a processing target.
FIG. 17 is a diagram illustrating an example of the summary trie and the compressed text after processing the sixth character. The node 56 has no child node corresponding to the character “c”. Therefore, the counter corresponding to the character “c” in the node 56 is counted up to “1” (count (5, c) = 1). At this time, count (5, c) <α. Therefore, the code “5” corresponding to the node 56 is written at the end of the compressed text 60. As a result, the compressed text 60 becomes “C = 1235”. Then, the designation destination of the node pointer “P” is changed to the node 54 corresponding to “c” among the child nodes of the node 51.

図１７に示した状態から位置「ｉ」がインクリメントされ、ｉ＝７とされる。そして、未圧縮テキスト４０内の７文字目の文字「ｄ」が処理対象として選択される。
図１８は、７文字目を処理した後の要約トライと圧縮済テキストの例を示す図である。図１７の状態のままではノード５４には、文字「ｄ」に対応する子ノードが存在しない。そこで、ノード５４内の文字「ｄ」に対応するカウンタが「１」にカウントアップされている（ｃｏｕｎｔ（３，ｄ）＝１）。このとき、ｃｏｕｎｔ（３，ｄ）＜αである。そこで、圧縮済テキスト６０の末尾に、ノード５４に対応する符号「３」が書き出される。その結果、圧縮済テキスト６０は「Ｃ＝１２３５３」となる。そして、ノードポインタ「Ｐ」の指定先が、ノード５１の子ノードのうちの「ｄ」に対応するノード５５に変更される。 The position “i” is incremented from the state shown in FIG. 17 so that i = 7. Then, the seventh character “d” in the uncompressed text 40 is selected as a processing target.
FIG. 18 is a diagram illustrating an example of the summary trie and the compressed text after processing the seventh character. In the state of FIG. 17, the node 54 has no child node corresponding to the character “d”. Therefore, the counter corresponding to the character “d” in the node 54 is counted up to “1” (count (3, d) = 1). At this time, count (3, d) <α. Therefore, the code “3” corresponding to the node 54 is written at the end of the compressed text 60. As a result, the compressed text 60 becomes “C = 1353”. Then, the designation destination of the node pointer “P” is changed to the node 55 corresponding to “d” among the child nodes of the node 51.

図１８に示した状態から位置「ｉ」がインクリメントされ、ｉ＝８とされる。そして、未圧縮テキスト４０内の８文字目の文字「ａ」が処理対象として選択される。
図１９は、８文字目を処理した後の要約トライと圧縮済テキストの例を示す図である。ノード５５には、文字「ａ」に対応する子ノードが存在しない。そこで、ノード５５内の文字「ａ」に対応するカウンタが「１」にカウントアップされている（ｃｏｕｎｔ（４，ａ）＝１）。このとき、ｃｏｕｎｔ（４，ａ）＜αである。そこで、圧縮済テキスト６０の末尾に、ノード５５に対応する符号「４」が書き出される。その結果、圧縮済テキスト６０は「Ｃ＝１２３５３４」となる。そして、ノードポインタ「Ｐ」の指定先が、ノード５１の子ノードのうちの「ａ」に対応するノード５２に変更される。 The position “i” is incremented from the state shown in FIG. 18, and i = 8. Then, the eighth character “a” in the uncompressed text 40 is selected as a processing target.
FIG. 19 is a diagram illustrating an example of the summary trie and the compressed text after processing the eighth character. The node 55 has no child node corresponding to the character “a”. Therefore, the counter corresponding to the character “a” in the node 55 is counted up to “1” (count (4, a) = 1). At this time, count (4, a) <α. Therefore, the code “4” corresponding to the node 55 is written at the end of the compressed text 60. As a result, the compressed text 60 becomes “C = 123534”. Then, the designation destination of the node pointer “P” is changed to the node 52 corresponding to “a” among the child nodes of the node 51.

図１９に示した状態から位置「ｉ」がインクリメントされ、ｉ＝９とされる。そして、未圧縮テキスト４０内の９文字目の文字「ｂ」が処理対象として選択される。
図２０は、９文字目を処理した後の要約トライと圧縮済テキストの例を示す図である。ノード５２には、文字「ｂ」に対応する子ノードが存在する。そこで、９文字目の文字「ｂ」に応じて、ノードポインタ「Ｐ」の指定先が、ノード５２の子ノードのうちの「ｂ」に対応するノード５６に変更される。この際、圧縮済テキスト６０への符号の書き出しは行われない。 The position “i” is incremented from the state shown in FIG. 19, and i = 9. Then, the ninth character “b” in the uncompressed text 40 is selected as a processing target.
FIG. 20 is a diagram illustrating an example of the summary trie and the compressed text after the ninth character has been processed. The node 52 has a child node corresponding to the character “b”. Therefore, the designation destination of the node pointer “P” is changed to the node 56 corresponding to “b” among the child nodes of the node 52 in accordance with the ninth character “b”. At this time, no code is written to the compressed text 60.

図２０に示した状態から位置「ｉ」がインクリメントされ、ｉ＝１０とされる。そして、未圧縮テキスト４０内の１０文字目の文字「ｃ」が処理対象として選択される。
図２１は、１０文字目を処理した後の要約トライと圧縮済テキストの例を示す図である。図２０の状態のままではノード５６には、文字「ｃ」に対応する子ノードが存在しない。そこで、ノード５６内の文字「ｃ」に対応するカウンタが「２」にカウントアップされている（ｃｏｕｎｔ（５，ｃ）＝２）。このとき、ｃｏｕｎｔ（５，ｃ）＝αである。そこで、ノード５２の文字「ｃ」に対応する子ノードとしてノード５７が作成される。この際、ノード５６の文字「ｃ」に対応するポインタには、ノード５７を示す値が設定される。新たに作成されたノード５７には、ノードＩＤ「６」が付与され、ラベル「ｃ」が設定される。ノード５７内のカウンタとポインタとには、初期値が設定される。そして、ノードポインタ「Ｐ」の指定先が、新たに作成したノード５７に変更される。この際、圧縮済テキスト６０への符号の書き出しは行われない。 The position “i” is incremented from the state shown in FIG. 20 so that i = 10. Then, the tenth character “c” in the uncompressed text 40 is selected as a processing target.
FIG. 21 is a diagram illustrating an example of the summary trie and the compressed text after the 10th character is processed. In the state of FIG. 20, the node 56 has no child node corresponding to the character “c”. Therefore, the counter corresponding to the character “c” in the node 56 is counted up to “2” (count (5, c) = 2). At this time, count (5, c) = α. Therefore, a node 57 is created as a child node corresponding to the character “c” of the node 52. At this time, a value indicating the node 57 is set in the pointer corresponding to the character “c” of the node 56. The newly created node 57 is given a node ID “6” and a label “c” is set. Initial values are set in the counter and pointer in the node 57. Then, the designation destination of the node pointer “P” is changed to the newly created node 57. At this time, no code is written to the compressed text 60.

図２１に示した状態から位置「ｉ」がインクリメントされ、ｉ＝１１とされる。テキストには１１文字目は存在しない。そのため位置「ｉ」の値が文字数「ｎ」より大きくなる。 The position “i” is incremented from the state shown in FIG. 21 so that i = 11. The eleventh character does not exist in the text. Therefore, the value of the position “i” is larger than the number of characters “n”.

図２２は、位置「ｉ」の値が文字数「ｎ」より大きくなった後の要約トライと圧縮済テキストの例を示す図である。位置「ｉ」の値が文字数「ｎ」より大きくなると、要約トライ５０の構築処理は終了する。そして、ノードポインタ「Ｐ」で示されるノード５７の符号が圧縮済テキスト６０に書き出される。その結果、圧縮済テキスト６０は「Ｃ＝１２３５３４６」となる。 FIG. 22 is a diagram illustrating an example of the summary trie and the compressed text after the value of the position “i” is larger than the number of characters “n”. When the value of the position “i” becomes larger than the number of characters “n”, the construction process of the summary trie 50 ends. Then, the code of the node 57 indicated by the node pointer “P” is written in the compressed text 60. As a result, the compressed text 60 becomes “C = 1235346”.

このようにして、未圧縮テキストデータに基づいて、要約トライ５０と圧縮済テキスト６０とが作成される。第２の実施の形態に示したテキスト符号化技術を用いると、ＳＴＶＦのような接尾辞木の作成が不要となる。接尾辞木は、すべてのテキストデータを読み込んで作成されるため、構築に時間がかかる。そのため第２の実施の形態に示したテキスト符号化技術は、接尾辞木を作成せずに済むことにより、ＳＴＶＦより２０倍ほど高速に符号化を行うことができる。 In this way, the summary trie 50 and the compressed text 60 are created based on the uncompressed text data. When the text encoding technique shown in the second embodiment is used, it becomes unnecessary to create a suffix tree like STVF. Since the suffix tree is created by reading all text data, it takes time to construct. Therefore, the text encoding technique shown in the second embodiment can perform encoding about 20 times faster than STVF by eliminating the need to create a suffix tree.

しかも、第２の実施の形態に示したテキスト符号化技術では、テキスト内の文字を先頭から１文字ずつ読み込んで、逐次処理（ストリーム処理）が可能である。逐次処理ができれば、例えば、コンピュータ間の通信における送信データの符号化に利用できる。すなわち、ＳＴＶＦのようにすべてのテキストデータから接尾辞木を構築し、その接尾辞木から文節木を作成後にテキストデータの符号化を行う場合、送信データの全体を読み込むまで、符号化されたデータを作成できない。そのため、ストリーム処理が困難となる。一方、第２の実施の形態に示したテキスト符号化技術では、送信データの先頭から順に符号化することができ、符号化したデータを順次送信できる。これにより、符号化による通信遅延を最小限に抑えることができる。 Moreover, in the text encoding technique shown in the second embodiment, characters in the text can be read one by one from the beginning and sequentially processed (stream processing). If sequential processing can be performed, for example, it can be used for encoding transmission data in communication between computers. That is, when a suffix tree is constructed from all text data as in STVF, and text data is encoded after creating a phrase tree from the suffix tree, the encoded data is read until the entire transmission data is read. Cannot be created. Therefore, stream processing becomes difficult. On the other hand, in the text encoding technique shown in the second embodiment, encoding can be performed sequentially from the top of transmission data, and encoded data can be transmitted sequentially. Thereby, the communication delay by encoding can be suppressed to the minimum.

〔第３の実施の形態〕
第３の実施の形態は、複数の未圧縮テキストを圧縮し、文字検索の対象とすることができるようにしたものである。 [Third Embodiment]
In the third embodiment, a plurality of uncompressed texts can be compressed and used as a character search target.

図２３は、第３の実施の形態に係るサーバの機能を示す図である。第３の実施の形態にかかるサーバ１００ａは、符号化辞書記憶部１１０ａ、圧縮済テキスト記憶部１２０ａ、テキスト符号化部１３０ａ、検索部１４０ａ、および解凍部１５０ａを有する。 FIG. 23 is a diagram illustrating functions of a server according to the third embodiment. The server 100a according to the third embodiment includes an encoding dictionary storage unit 110a, a compressed text storage unit 120a, a text encoding unit 130a, a search unit 140a, and a decompression unit 150a.

符号化辞書記憶部１１０ａと圧縮済テキスト記憶部１２０ａとは、それぞれ図６に示した第２の実施の形態の同名の要素と同じ機能を有する。
テキスト符号化部１３０ａは、図６に示した第２の実施の形態のテキスト符号化部１３０が有する機能に加え、複数の未圧縮テキスト７１，７２，７３を区別するための制御記号を圧縮済テキスト８０に挿入する機能を有する。制御記号としては、要約トライで割り当てる符号と重複しない記号を用いる。例えば、図２３の例では、圧縮済テキスト８０内に、未圧縮テキスト７１〜７３の区切りを示す制御記号として「＄」が挿入されている。 The encoding dictionary storage unit 110a and the compressed text storage unit 120a have the same functions as the elements of the same name in the second embodiment shown in FIG.
The text encoding unit 130a has compressed control symbols for distinguishing a plurality of uncompressed texts 71, 72, and 73 in addition to the functions of the text encoding unit 130 of the second embodiment shown in FIG. It has a function of inserting into the text 80. As the control symbol, a symbol that does not overlap with the code assigned in the summary trie is used. For example, in the example of FIG. 23, “$” is inserted in the compressed text 80 as a control symbol indicating a delimiter between the uncompressed texts 71 to 73.

検索部１４０ａは、図６に示した第２の実施の形態の検索部１４０が有する機能に加え、検索結果内に、ヒットした文字列を含む未圧縮テキストを示す情報を含める機能を有する。例えば、圧縮済テキスト８０内での順番によって、ヒットした文字列を含む未圧縮テキストが示される。例えば、検索部１４０ａは、検索でヒットした文字列の位置よりも前にある制御記号「＄」の数を数え、その数に１を加算した番号を、該当文字列を含む未圧縮テキストの順番として、検索結果４２に含める。 The search unit 140a has a function of including information indicating uncompressed text including the hit character string in the search result, in addition to the function of the search unit 140 of the second embodiment shown in FIG. For example, the uncompressed text including the hit character string is indicated by the order in the compressed text 80. For example, the search unit 140a counts the number of control symbols “$” preceding the position of the character string hit in the search, and adds the number obtained by adding 1 to the order of the uncompressed text including the corresponding character string. Are included in the search result 42.

解凍部１５０ａは、図６に示した第２の実施の形態の解凍部１５０が有する機能に加え、未圧縮テキストの順番を指定した解凍要求４３を受け取った場合、圧縮済テキスト８０内の指定された順番の符号語列を解凍する機能を有する。例えば解凍部１５０ａは、圧縮済テキスト８０を、制御記号を境界として分割し、複数の符号語列を生成する。そして解凍部１５０ａは、圧縮済テキスト８０の先頭からの符号語列の順番を数え、要求された順番の符号語列を解凍し、解凍済テキスト４４とする。 When the decompression unit 150a receives the decompression request 43 specifying the order of uncompressed text, in addition to the functions of the decompression unit 150 of the second embodiment shown in FIG. 6, the decompression unit 150a is designated in the compressed text 80. It has a function of decompressing codeword strings in the specified order. For example, the decompressing unit 150a divides the compressed text 80 using the control symbols as boundaries, and generates a plurality of codeword strings. Then, the decompressing unit 150 a counts the order of the code word strings from the beginning of the compressed text 80, decompresses the code word strings in the requested order, and creates the decompressed text 44.

図２４は、第３の実施の形態に係るテキスト符号化部の詳細機能を示すブロック図である。テキスト符号化部１３０ａは、頻度閾値記憶部１３１ａ、最大符号数記憶部１３２ａ、文字選択部１３３ａ、ノード作成部１３４ａ、頻度カウント部１３５ａ、符号出力部１３６ａ、および制御記号出力部１３７を有する。頻度閾値記憶部１３１ａ、最大符号数記憶部１３２ａ、文字選択部１３３ａ、ノード作成部１３４ａ、頻度カウント部１３５ａ、および符号出力部１３６ａは、図７に示した第２の実施の形態における同名の要素と同じ機能を有している。 FIG. 24 is a block diagram illustrating detailed functions of the text encoding unit according to the third embodiment. The text encoding unit 130a includes a frequency threshold storage unit 131a, a maximum code number storage unit 132a, a character selection unit 133a, a node creation unit 134a, a frequency count unit 135a, a code output unit 136a, and a control symbol output unit 137. The frequency threshold storage unit 131a, the maximum code number storage unit 132a, the character selection unit 133a, the node creation unit 134a, the frequency count unit 135a, and the code output unit 136a are elements having the same names in the second embodiment shown in FIG. Has the same function.

制御記号出力部１３７は、符号出力部１３６ａが圧縮済テキスト記憶部１２０ａに対して、１つの未圧縮テキストから生成した最後の符号（図８のステップＳ２０で格納される符号）を出力したことを検出する。そして、制御記号出力部１３７は、１つの未圧縮テキストの末尾（最後の符号の後）に制御記号「＄」を格納する。 The control symbol output unit 137 indicates that the code output unit 136a has output the last code generated from one uncompressed text (the code stored in step S20 in FIG. 8) to the compressed text storage unit 120a. To detect. Then, the control symbol output unit 137 stores the control symbol “$” at the end (after the last code) of one uncompressed text.

次に、テキスト符号化処理の手順について詳細に説明する。なお、テキスト符号化処理の入力は、テキスト数Ｍ個（Ｍは１以上の整数）の未圧縮テキスト７１，７２，７３、閾値、および最大符号数である。ここでＭ個の未圧縮テキスト７１，７２，７３，・・・それぞれに含まれるテキストデータ列を、Ｔ１，・・・，ＴＭとする。また、現在処理対象としている未圧縮テキストを識別するテキスト番号を「ｍ」（ｍは１以上の整数）とする。また、テキスト間の境界を示す制御記号を「＄」とする。その他の記号の意味は第２の実施の形態と同様である。 Next, the procedure of the text encoding process will be described in detail. Note that the input of the text encoding process is the uncompressed text 71, 72, 73 having the number of texts M (M is an integer of 1 or more), the threshold value, and the maximum number of codes. Here, text data strings included in the M uncompressed texts 71, 72, 73,... Are T1,. Also, the text number for identifying the uncompressed text that is currently processed is “m” (m is an integer of 1 or more). A control symbol indicating a boundary between texts is “$”. The meanings of the other symbols are the same as in the second embodiment.

図２５は、第３の実施の形態に係るテキスト符号化処理の手順を示すフローチャートである。以下、図２５に示す処理をステップ番号に沿って説明する。
［ステップＳ４１］サーバ１００ａ内の各要素は、それぞれ情報の初期化を行う。例えば文字選択部１３３ａは、文字の位置「ｉ」の値を「０」に初期化する。また文字選択部１３３ａは、テキスト番号「ｍ」の値を「１」に初期化する。 FIG. 25 is a flowchart illustrating the procedure of the text encoding process according to the third embodiment. In the following, the process illustrated in FIG. 25 will be described in order of step number.
[Step S41] Each element in the server 100a initializes information. For example, the character selection unit 133a initializes the value of the character position “i” to “0”. In addition, the character selection unit 133a initializes the value of the text number “m” to “1”.

ノード作成部１３４ａは、符号化辞書記憶部１１０ａ内の要約トライ「Ｄ」を初期化する。さらにノード作成部１３４ａは、ノードポインタ「Ｐ」を初期化する。符号出力部１３６ａは、圧縮済テキスト記憶部１２０ａ内の圧縮済テキストを初期化する。初期化された圧縮済テキストには、空文字列が設定され、「Ｃ＝ε」と表される。εは空文字列を意味する。 The node creation unit 134a initializes the summary trie “D” in the coding dictionary storage unit 110a. Furthermore, the node creation unit 134a initializes the node pointer “P”. The code output unit 136a initializes the compressed text in the compressed text storage unit 120a. A null character string is set in the initialized compressed text, and is expressed as “C = ε”. ε means an empty character string.

［ステップＳ４２］文字選択部１３３ａは、テキスト番号「ｍ」がテキスト数「Ｍ」とより大きいか否かを判断する。テキスト番号がテキスト数より大きい場合、処理が終了する。テキスト番号がテキスト数以下であれば、処理がステップＳ４３に進められる。 [Step S42] The character selection unit 133a determines whether or not the text number “m” is larger than the text number “M”. If the text number is greater than the number of texts, the process ends. If the text number is less than or equal to the number of texts, the process proceeds to step S43.

［ステップＳ４３］文字選択部１３３ａは、「ｍ」番目のテキストデータ「Ｔｍ」を取得する。
［ステップＳ４４］文字選択部１３３ａは、位置「ｉ」の値に「１」を設定する（ｉ＝１）。これにより、「ｍ」番目のテキストデータ「Ｔｍ」の先頭の文字が処理対象となる。 [Step S43] The character selection unit 133a acquires the “m” -th text data “Tm”.
[Step S44] The character selection unit 133a sets “1” as the value of the position “i” (i = 1). As a result, the first character of the “m” th text data “Tm” becomes the processing target.

［ステップＳ４５］文字選択部１３３ａは、テキストデータ「Ｔｍ」の長さ（文字数）を、文字数「ｎ」に代入する。
［ステップＳ４６］文字選択部１３３ａ、ノード作成部１３４ａ、頻度カウント部１３５ａ、および符号出力部１３６の連携した処理により、圧縮処理が行われる。この処理の詳細は後述する（図２６参照）。 [Step S45] The character selection unit 133a substitutes the length (number of characters) of the text data “Tm” for the number of characters “n”.
[Step S46] The compression processing is performed by the cooperation of the character selection unit 133a, the node creation unit 134a, the frequency count unit 135a, and the code output unit 136. Details of this processing will be described later (see FIG. 26).

［ステップＳ４７］制御記号出力部１３７は、制御記号「＄」を圧縮済テキスト「Ｃ」の末尾に書き出す。
［ステップＳ４８］ノード作成部１３４ａは、ノードポインタ「Ｐ」を初期化する（Ｐ＝ｒｏｏｔ（Ｄ））。 [Step S47] The control symbol output unit 137 writes the control symbol “$” at the end of the compressed text “C”.
[Step S48] The node creation unit 134a initializes the node pointer “P” (P = root (D)).

［ステップＳ４９］文字選択部１３３ａは、テキスト番号「ｍ」に１を加算する。その後、処理がステップＳ４２に進められる。
図２６は、圧縮処理の詳細手順を示すフローチャートである。以下、図２６に示す処理をステップ番号に沿って説明する。 [Step S49] The character selection unit 133a adds 1 to the text number “m”. Thereafter, the process proceeds to step S42.
FIG. 26 is a flowchart showing a detailed procedure of the compression process. In the following, the process illustrated in FIG. 26 will be described in order of step number.

［ステップＳ５１］ノード作成部１３４ａは、位置「ｉ」の値が、文字数「ｎ」より大きいか否かを判断する。位置「ｉ」の値が文字数「ｎ」より大きければ、処理がステップＳ５８に進められる。位置「ｉ」の値が文字数「ｎ」以下であれば、処理がステップＳ５２に進められる。 [Step S51] The node creation unit 134a determines whether the value of the position “i” is larger than the number of characters “n”. If the value of position “i” is larger than the number of characters “n”, the process proceeds to step S58. If the value of position “i” is equal to or less than the number of characters “n”, the process proceeds to step S52.

［ステップＳ５２］位置「ｉ」の値が文字数「ｎ」以下であれば、ノード作成部１３４ａ、頻度カウント部１３５ａ、および符号出力部１３６ａが連携し、要約トライ作成およびテキスト圧縮処理を実行する。この処理の詳細は、図９に示した第２の実施の形態における要約トライ作成およびテキスト圧縮処理と同様である。 [Step S52] If the value of the position “i” is equal to or less than the number of characters “n”, the node creation unit 134a, the frequency count unit 135a, and the code output unit 136a cooperate to execute summary trie creation and text compression processing. The details of this processing are the same as the summary trie creation and text compression processing in the second embodiment shown in FIG.

［ステップＳ５３］文字選択部１３３ａは、位置「ｉ」の値をインクリメントする（ｉ＝ｉ＋１）。これにより、テキスト番号「ｍ」の未圧縮テキスト内の次の文字が処理対象となる。 [Step S53] The character selection unit 133a increments the value of the position “i” (i = i + 1). As a result, the next character in the uncompressed text with the text number “m” becomes the processing target.

［ステップＳ５４］ノード作成部１３４ａは、要約トライ「Ｄ」の大きさ「ｋ」が、最大符号数「Ｋ」未満か否かを判断する。要約トライ「Ｄ」の大きさ「ｋ」が最大符号数「Ｋ」未満であれば、処理がステップＳ５１に進められる。要約トライ「Ｄ」の大きさ「ｋ」が最大符号数「Ｋ」以上であれば、処理がステップＳ５５に進められる。 [Step S54] The node creation unit 134a determines whether the size “k” of the summary trie “D” is less than the maximum code number “K”. If the size “k” of the summary trie “D” is less than the maximum code number “K”, the process proceeds to step S51. If the size “k” of the summary trie “D” is equal to or greater than the maximum code number “K”, the process proceeds to step S55.

［ステップＳ５５］ノード作成部１３４ａは、位置「ｉ」の値が、文字数「ｎ」より大きいか否かを判断する。位置「ｉ」の値が文字数「ｎ」より大きければ、処理がステップＳ５８に進められる。位置「ｉ」の値が文字数「ｎ」以下であれば、処理がステップＳ５６に進められる。 [Step S55] The node creation unit 134a determines whether the value of the position “i” is larger than the number of characters “n”. If the value of position “i” is larger than the number of characters “n”, the process proceeds to step S58. If the value of position “i” is equal to or less than the number of characters “n”, the process proceeds to step S56.

［ステップＳ５６］位置「ｉ」の値が文字数「ｎ」以下であれば、符号出力部１３６ａは、テキスト圧縮処理を実行する。この処理の詳細は、図１０に示した第２の実施の形態におけるテキスト圧縮処理と同様である。 [Step S56] If the value of the position “i” is equal to or less than the number of characters “n”, the code output unit 136a executes text compression processing. Details of this processing are the same as the text compression processing in the second embodiment shown in FIG.

［ステップＳ５７］文字選択部１３３ａは、位置「ｉ」の値をインクリメントする（ｉ＝ｉ＋１）。これにより、テキスト番号「ｍ」の未圧縮テキスト内の次の文字が処理対象となる。その後、処理がステップＳ５５に進められる。 [Step S57] The character selection unit 133a increments the value of the position “i” (i = i + 1). As a result, the next character in the uncompressed text with the text number “m” becomes the processing target. Thereafter, the process proceeds to step S55.

［ステップＳ５８］位置「ｉ」の値が文字数「ｎ」より大きくなると、符号出力部１３６は、ノードポインタ「Ｐ」が示すノードの符号を、圧縮済テキスト「Ｃ」の末尾に書き出す。その後、圧縮処理が終了し、処理が図２５のステップＳ４７に進められる。 [Step S58] When the value of the position “i” becomes larger than the number of characters “n”, the code output unit 136 writes the code of the node indicated by the node pointer “P” at the end of the compressed text “C”. Thereafter, the compression process ends, and the process proceeds to step S47 in FIG.

このようにして、符号化された個々のテキストデータの間に制御記号を挿入することができる。この制御記号により、圧縮済テキスト内の符号語列を、符号の生成元となった未圧縮テキストごとに分割することができる。そして、圧縮済テキスト内の分割された符号語列の順番により、各符号語列の生成元となった未圧縮テキストを判別できる。 In this way, control symbols can be inserted between encoded individual text data. With this control symbol, the code word string in the compressed text can be divided for each uncompressed text from which the code is generated. And the uncompressed text which became the production | generation origin of each codeword sequence can be discriminate | determined by the order of the codeword sequence divided | segmented in the compressed text.

〔第４の実施の形態〕
第４の実施の形態は、要約トライに含まれるノードの閾値「α」を動的に変更可能としたものである。要約トライを用いた符号化では、閾値「α」のとり方によって圧縮率が大きく異なる。閾値「α」が大きすぎると、要約トライがなかなか成長せず、長い文字列に符号を割り当てることができない。逆に閾値「α」が小さすぎると、要約トライが早く成長しすぎ、後方のデータを読む前に最大サイズを超えてしまう。そこで第４の実施の形態では、閾値テーブルを用いて、最初のうちは閾値「α」の値を小さく保ち、読み込むデータ量が増えるのに伴い閾値「α」を徐々に大きくしていく。 [Fourth Embodiment]
In the fourth embodiment, the threshold value “α” of a node included in the summary trie can be dynamically changed. In encoding using summary trie, the compression rate varies greatly depending on how the threshold value “α” is set. If the threshold “α” is too large, the summary trie does not grow easily, and a code cannot be assigned to a long character string. Conversely, if the threshold “α” is too small, the summary trie grows too quickly and exceeds the maximum size before reading back data. Therefore, in the fourth embodiment, using the threshold value table, the threshold value “α” is initially kept small, and the threshold value “α” is gradually increased as the amount of data to be read increases.

図２７は、第４の実施の形態に係るテキスト符号化部の機能を示すブロック図である。テキスト符号化部１３０ｂは、未圧縮テキスト４０を符号化し、要約トライと圧縮済テキストとを生成する。要約トライは、符号化辞書として符号化辞書記憶部１１０ｂに格納される。圧縮済テキストは、圧縮済テキスト記憶部１２０ｂに格納される。符号化辞書記憶部１１０ｂは、要約トライ形式の符号化辞書を記憶する。圧縮済テキスト記憶部１２０ｂは、圧縮済テキストを記憶する。 FIG. 27 is a block diagram illustrating functions of the text encoding unit according to the fourth embodiment. The text encoding unit 130b encodes the uncompressed text 40 and generates a summary trie and a compressed text. The summary trie is stored in the encoding dictionary storage unit 110b as an encoding dictionary. The compressed text is stored in the compressed text storage unit 120b. The encoding dictionary storage unit 110b stores a summary trie format encoding dictionary. The compressed text storage unit 120b stores the compressed text.

テキスト符号化部１３０ｂは、頻度閾値記憶部１３１ｂ、最大符号数記憶部１３２ｂ、文字選択部１３３ｂ、ノード作成部１３４ｂ、頻度カウント部１３５ｂ、符号出力部１３６ｂ、閾値テーブル記憶部１３８、および閾値設定部１３９を有する。頻度閾値記憶部１３１ｂ、最大符号数記憶部１３２ｂ、文字選択部１３３ｂ、ノード作成部１３４ｂ、頻度カウント部１３５ｂ、および符号出力部１３６ｂは、図７に示した第２の実施の形態における同名の要素と同じ機能を有している。 The text encoding unit 130b includes a frequency threshold storage unit 131b, a maximum code number storage unit 132b, a character selection unit 133b, a node creation unit 134b, a frequency count unit 135b, a code output unit 136b, a threshold table storage unit 138, and a threshold setting unit. 139. The frequency threshold storage unit 131b, the maximum code number storage unit 132b, the character selection unit 133b, the node creation unit 134b, the frequency count unit 135b, and the code output unit 136b are elements having the same names in the second embodiment shown in FIG. Has the same function.

閾値テーブル記憶部１３８は、符号化された文字数と閾値との対応関係を示す閾値テーブルを記憶する。例えば、ＲＡＭ１０２またはＨＤＤ１０３の記憶領域の一部が、閾値テーブル記憶部１３８として使用される。 The threshold value table storage unit 138 stores a threshold value table indicating the correspondence between the number of encoded characters and the threshold value. For example, a part of the storage area of the RAM 102 or the HDD 103 is used as the threshold table storage unit 138.

閾値設定部１３９は、文字選択部１３３ｂが選択してノード作成部１３４ｂに渡した文字数に応じて、頻度閾値記憶部１３１ｂ内の閾値を設定する。その際、閾値設定部１３９は、閾値テーブル記憶部１３８を参照し、設定する閾値を決定する。 The threshold setting unit 139 sets the threshold in the frequency threshold storage unit 131b according to the number of characters selected by the character selection unit 133b and passed to the node creation unit 134b. At this time, the threshold setting unit 139 refers to the threshold table storage unit 138 and determines a threshold to be set.

図２８は、閾値テーブル記憶部のデータ構造の一例を示す図である。閾値テーブル記憶部１３８には、閾値テーブル１３８ａが格納されている。閾値テーブル１３８ａには、文字数と閾値との欄が設けられている。閾値テーブル１３８ａ内の横方向に並べられた情報が互いに関連付けられている。 FIG. 28 is a diagram illustrating an example of a data structure of the threshold table storage unit. The threshold value table storage unit 138 stores a threshold value table 138a. The threshold value table 138a has columns for the number of characters and the threshold value. Information arranged in the horizontal direction in the threshold value table 138a is associated with each other.

文字数の欄には、符号化された文字の数を示す数値の範囲が設定されている。図２８の例では、文字のデータ量を示すバイト数によって、文字数が示されている。例えば１バイト文字であれば、バイト数の数値が、そのまま文字数となる。また２バイト文字であれば、バイト数の数値の半分の値が、文字数となる。 A numerical value range indicating the number of encoded characters is set in the number of characters column. In the example of FIG. 28, the number of characters is indicated by the number of bytes indicating the amount of character data. For example, if it is a 1-byte character, the numerical value of the number of bytes becomes the number of characters as it is. If it is a 2-byte character, the number of characters is half the number of bytes.

閾値の欄には、対応する文字数の範囲内の文字が符号化されたときの閾値が設定されている。例えば、符号化された文字のデータ量が１バイトから１Ｋバイト範囲内であれば、閾値「α」として１０が設定される。また符号化された文字のデータ量が１Ｋバイトから１００Ｋバイト範囲内であれば、閾値「α」として１００が設定される。なお、文字のデータ量が「１〜１Ｋ」と「１Ｋ〜１００Ｋ」の境界の値「１Ｋ」となった場合、例えば値が大きい方の閾値（この例では「１００」）が設定される。 In the threshold value column, a threshold value when characters within the range of the corresponding number of characters are encoded is set. For example, if the data amount of the encoded character is within the range of 1 byte to 1 Kbyte, 10 is set as the threshold “α”. If the encoded character data amount is within the range of 1 Kbytes to 100 Kbytes, 100 is set as the threshold “α”. When the character data amount reaches the value “1K” at the boundary between “1 to 1K” and “1K to 100K”, for example, a threshold having a larger value (“100” in this example) is set.

図２９は、第４の実施の形態に係るテキスト符号化処理の手順を示すフローチャートである。以下、図２９に示す処理をステップ番号に沿って説明する。
［ステップＳ６１］テキスト符号化部１３０ｂ内の各要素は、それぞれ情報の初期化を行う。例えば文字選択部１３３ｂは、文字の位置「ｉ」の値を「０」に初期化する。ノード作成部１３４ｂは、符号化辞書記憶部１１０ｂ内の要約トライ「Ｄ」を初期化する。さらにノード作成部１３４ｂは、ノードポインタ「Ｐ」を初期化する。符号出力部１３６ｂは、圧縮済テキスト記憶部１２０ｂ内の圧縮済テキストを初期化する。 FIG. 29 is a flowchart showing the procedure of the text encoding process according to the fourth embodiment. In the following, the process illustrated in FIG. 29 will be described in order of step number.
[Step S61] Each element in the text encoding unit 130b initializes information. For example, the character selection unit 133b initializes the value of the character position “i” to “0”. The node creation unit 134b initializes the summary trie “D” in the coding dictionary storage unit 110b. Further, the node creation unit 134b initializes the node pointer “P”. The code output unit 136b initializes the compressed text in the compressed text storage unit 120b.

［ステップＳ６２］文字選択部１３３ｂは、位置「ｉ」の値をインクリメントする（ｉ＝ｉ＋１）。これにより、未圧縮テキスト４０内の１文字目が処理対象となる。
［ステップＳ６３］ノード作成部１３４ｂは、位置「ｉ」の値が、文字数「ｎ」より大きいか否かを判断する。位置「ｉ」の値が文字数「ｎ」より大きければ、処理がステップＳ７１に進められる。位置「ｉ」の値が文字数「ｎ」以下であれば、処理がステップＳ６４に進められる。 [Step S62] The character selection unit 133b increments the value of the position “i” (i = i + 1). As a result, the first character in the uncompressed text 40 is processed.
[Step S63] The node creation unit 134b determines whether the value of the position “i” is larger than the number of characters “n”. If the value of position “i” is larger than the number of characters “n”, the process proceeds to step S71. If the value of position “i” is equal to or less than the number of characters “n”, the process proceeds to step S64.

［ステップＳ６４］閾値設定部１３９は、閾値テーブル１３８ａと位置「ｉ」に基づいて閾値「α」を求める。例えば閾値設定部１３９は、未圧縮テキスト４０に含まれる文字が１バイト文字であれば、位置「ｉ」の値を、符号化した文字数に応じたバイト数として取得する。次に閾値設定部１３９は、取得したバイト数を含む範囲を、閾値テーブル１３８ａの文字数の欄から選択する。さらに閾値設定部１３９は、選択した範囲に対応する閾値を、閾値テーブル１３８ａから取得する。そして閾値設定部１３９は、取得した閾値を頻度閾値記憶部１３１ｂに格納する。 [Step S64] The threshold value setting unit 139 obtains a threshold value “α” based on the threshold value table 138a and the position “i”. For example, if the character included in the uncompressed text 40 is a 1-byte character, the threshold setting unit 139 acquires the value of the position “i” as the number of bytes corresponding to the number of encoded characters. Next, the threshold setting unit 139 selects a range including the acquired number of bytes from the number of characters column of the threshold table 138a. Further, the threshold setting unit 139 acquires a threshold corresponding to the selected range from the threshold table 138a. Then, the threshold setting unit 139 stores the acquired threshold in the frequency threshold storage unit 131b.

［ステップＳ６５］位置「ｉ」の値が文字数「ｎ」以下であれば、ノード作成部１３４ｂ、頻度カウント部１３５ｂ、および符号出力部１３６ｂが連携し、要約トライ作成およびテキスト圧縮処理を実行する。この処理の詳細は、図９に示した第２の実施の形態の処理と同様である。 [Step S65] If the value of the position “i” is equal to or less than the number of characters “n”, the node creation unit 134b, the frequency count unit 135b, and the code output unit 136b cooperate to execute summary trie creation and text compression processing. The details of this process are the same as those of the second embodiment shown in FIG.

［ステップＳ６６］文字選択部１３３ｂは、位置「ｉ」の値をインクリメントする（ｉ＝ｉ＋１）。これにより、未圧縮テキスト４０内の次の文字が処理対象となる。
［ステップＳ６７］ノード作成部１３４ｂは、要約トライ「Ｄ」の大きさ「ｋ」が、最大符号数「Ｋ」未満か否かを判断する。要約トライ「Ｄ」の大きさ「ｋ」が最大符号数「Ｋ」未満であれば、処理がステップＳ６３に進められる。要約トライ「Ｄ」の大きさ「ｋ」が最大符号数「Ｋ」以上であれば、処理がステップＳ６８に進められる。 [Step S66] The character selection unit 133b increments the value of the position “i” (i = i + 1). As a result, the next character in the uncompressed text 40 becomes the processing target.
[Step S67] The node creation unit 134b determines whether the size “k” of the summary trie “D” is less than the maximum code number “K”. If the size “k” of the summary trie “D” is less than the maximum code number “K”, the process proceeds to step S63. If the size “k” of the summary trie “D” is equal to or greater than the maximum code number “K”, the process proceeds to step S68.

［ステップＳ６８］ノード作成部１３４ｂは、位置「ｉ」の値が、文字数「ｎ」より大きいか否かを判断する。位置「ｉ」の値が文字数「ｎ」より大きければ、処理がステップＳ７１に進められる。位置「ｉ」の値が文字数「ｎ」以下であれば、処理がステップＳ６９に進められる。 [Step S68] The node creation unit 134b determines whether the value of the position “i” is greater than the number of characters “n”. If the value of position “i” is larger than the number of characters “n”, the process proceeds to step S71. If the value of position “i” is equal to or less than the number of characters “n”, the process proceeds to step S69.

［ステップＳ６９］位置「ｉ」の値が文字数「ｎ」以下であれば、符号出力部１３６ｂは、テキスト圧縮処理を実行する。この処理の詳細は、図１０に示した第２の実施の形態の処理と同様である。 [Step S69] If the value of the position “i” is equal to or less than the number of characters “n”, the code output unit 136b executes text compression processing. The details of this process are the same as those of the second embodiment shown in FIG.

［ステップＳ７０］文字選択部１３３ｂは、位置「ｉ」の値をインクリメントする（ｉ＝ｉ＋１）。これにより、未圧縮テキスト４０内の次の文字が処理対象となる。その後、処理がステップＳ６８に進められる。 [Step S70] The character selection unit 133b increments the value of the position “i” (i = i + 1). As a result, the next character in the uncompressed text 40 becomes the processing target. Thereafter, the process proceeds to step S68.

［ステップＳ７１］位置「ｉ」の値が文字数「ｎ」より大きくなると、符号出力部１３６ｂは、ノードポインタ「Ｐ」が示すノードの符号を、圧縮済テキスト「Ｃ」の末尾に書き出す。その後、処理が終了する。 [Step S71] When the value of the position “i” becomes larger than the number of characters “n”, the code output unit 136b writes the code of the node indicated by the node pointer “P” at the end of the compressed text “C”. Thereafter, the process ends.

このようにして、閾値「α」を動的に変更することができる。これにより、システムを運用していくうちに符号化された文字数が徐々に増加していっても、符号化された文字数に応じた適切な閾値「α」が設定される。例えば、閾値「α」の値を、符号化された文字数の増加に伴い徐々に大きくしていくことで、閾値「α」が大きすぎることにより、要約トライがなかなか成長しないような事態を抑止できる。また閾値「α」が小さすぎることにより、要約トライが早く成長しすぎ、後方のデータを読む前にノード数が最大符号数を超えてしまう事態の発生も抑止できる。 In this way, the threshold “α” can be dynamically changed. Thereby, even if the number of encoded characters gradually increases as the system is operated, an appropriate threshold “α” corresponding to the number of encoded characters is set. For example, by gradually increasing the value of the threshold “α” as the number of encoded characters increases, it is possible to suppress a situation where the summary “try” does not grow easily due to the threshold “α” being too large. . In addition, since the threshold “α” is too small, it is possible to suppress a situation in which the summary trie grows too quickly and the number of nodes exceeds the maximum number of codes before reading back data.

〔その他の応用例〕
上記の処理機能は、コンピュータによって実現することができる。その場合、サーバが有すべき機能の処理内容を記述したプログラムが提供される。そのプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、磁気記憶装置、光ディスク、光磁気記録媒体、半導体メモリなどがある。磁気記憶装置には、ハードディスク装置（ＨＤＤ）、フレキシブルディスク（ＦＤ）、磁気テープなどがある。光ディスクには、ＤＶＤ、ＤＶＤ−ＲＡＭ、ＣＤ−ＲＯＭ／ＲＷなどがある。光磁気記録媒体には、ＭＯ（Magneto-Optical disc）などがある。 [Other application examples]
The above processing functions can be realized by a computer. In that case, a program describing the processing contents of the functions that the server should have is provided. By executing the program on a computer, the above processing functions are realized on the computer. The program describing the processing contents can be recorded on a computer-readable recording medium. Examples of the computer-readable recording medium include a magnetic storage device, an optical disk, a magneto-optical recording medium, and a semiconductor memory. Examples of the magnetic storage device include a hard disk device (HDD), a flexible disk (FD), and a magnetic tape. Optical discs include DVD, DVD-RAM, CD-ROM / RW, and the like. Magneto-optical recording media include MO (Magneto-Optical disc).

プログラムを流通させる場合には、例えば、そのプログラムが記録されたＤＶＤ、ＣＤ−ＲＯＭなどの可搬型記録媒体が販売される。また、プログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することもできる。 When distributing the program, for example, a portable recording medium such as a DVD or a CD-ROM in which the program is recorded is sold. It is also possible to store the program in a storage device of a server computer and transfer the program from the server computer to another computer via a network.

プログラムを実行するコンピュータは、例えば、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、自己の記憶装置に格納する。そして、コンピュータは、自己の記憶装置からプログラムを読み取り、プログラムに従った処理を実行する。なお、コンピュータは、可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することもできる。また、コンピュータは、サーバコンピュータからプログラムが転送されるごとに、逐次、受け取ったプログラムに従った処理を実行することもできる。 The computer that executes the program stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in its own storage device. Then, the computer reads the program from its own storage device and executes processing according to the program. The computer can also read the program directly from the portable recording medium and execute processing according to the program. Further, each time the program is transferred from the server computer, the computer can sequentially execute processing according to the received program.

また、上記の処理機能の少なくとも一部を、ＤＳＰ（Digital Signal Processor）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＰＬＤ（Programmable Logic Device）などの電子回路で実現することもできる。 In addition, at least a part of the above processing functions can be realized by an electronic circuit such as a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), or a PLD (Programmable Logic Device).

以上、実施の形態を例示したが、実施の形態で示した各部の構成は同様の機能を有する他のものに置換することができる。また、他の任意の構成物や工程が付加されてもよい。さらに、前述した実施の形態のうちの任意の２以上の構成（特徴）を組み合わせたものであってもよい。 As mentioned above, although embodiment was illustrated, the structure of each part shown by embodiment can be substituted by the other thing which has the same function. Moreover, other arbitrary structures and processes may be added. Further, any two or more configurations (features) of the above-described embodiments may be combined.

以上の実施の形態に開示された技術には、以下の付記に示す技術が含まれる。
（付記１）テキストデータ内の文字列から順に文字を選択する文字選択手段と、
前記テキストデータに出現し得る文字に対応する複数のノードがルートのノードの子として木構造で予め関連付けられ、各ノードに対して前記テキストデータに出現し得る文字に対応する子のノードを追加可能であり、各ノードに対応付けて、ノードの識別子と、各ノードが判断位置とされたときに次に出現した各文字の出現回数とが付与された文節木を記憶する文節木記憶手段を参照し、前記ルートのノードから前記判断位置を開始し、前記判断位置のノードに対して前記文字選択手段で選択された文字に対応する子のノードが存在する場合、該子のノードに前記判断位置を移動し、前記判断位置のノードに対して前記選択された文字に対応する子のノードが存在せず、かつ前記判断位置のノードに付与された前記選択された文字の出現回数が所定の閾値に達していない場合、前記ルートのノードに対する前記選択された文字に対応する子のノードに前記判断位置を移動し、前記判断位置のノードに対して前記選択された文字に対応する子のノードが存在せず、かつ前記判断位置のノードに付与された前記選択された文字の出現回数が所定の閾値に達している場合、前記判断位置のノードに対して、新たな識別子を付与した、前記選択された文字に対応する子のノードを追加すると共に、追加した該子のノードに前記判断位置を移動するノード追加手段と、
前記文節木記憶手段を参照し、前記判断位置のノードに対して前記選択された文字に対応するノードが存在しない場合、前記判断位置のノードの付与された前記選択された文字の出現回数をカウントアップする頻度カウント手段と、
を有することを特徴とするテキスト処理装置。 The techniques disclosed in the above embodiments include the techniques shown in the following supplementary notes.
(Supplementary Note 1) Character selection means for selecting characters in order from a character string in text data;
A plurality of nodes corresponding to characters that can appear in the text data are associated in advance in a tree structure as children of the root node, and child nodes corresponding to characters that can appear in the text data can be added to each node. Refers to the phrase tree storage means for storing the phrase tree associated with each node and assigned with the node identifier and the number of times each character appears next when each node is determined. The determination position is started from the node of the root, and when there is a child node corresponding to the character selected by the character selection unit with respect to the node of the determination position, the determination position is included in the child node. , The child node corresponding to the selected character does not exist with respect to the node at the determination position, and the appearance number of the selected character given to the node at the determination position is Is not reached the predetermined threshold, the determination position is moved to a child node corresponding to the selected character for the root node, and the selected character is corresponding to the determination position node. If there is no child node and the number of appearances of the selected character given to the node at the judgment position has reached a predetermined threshold, a new identifier is given to the node at the judgment position Node adding means for adding a child node corresponding to the selected character and moving the determination position to the added child node;
If the node corresponding to the selected character does not exist with respect to the node at the determination position with reference to the phrase tree storage unit, the number of appearances of the selected character to which the node at the determination position is added is counted. Frequency counting means to increase,
A text processing apparatus comprising:

（付記２）前記文節木記憶手段を参照し、前記判断位置のノードに対して前記選択された文字に対応する子のノードが存在しない場合、前記判断位置のノードに付与された識別子を出力する識別子出力手段をさらに有することを特徴とする付記１記載のテキスト処理装置。 (Supplementary Note 2) With reference to the phrase tree storage means, if there is no child node corresponding to the selected character with respect to the node at the determination position, an identifier assigned to the node at the determination position is output. The text processing apparatus according to appendix 1, further comprising identifier output means.

（付記３）前記識別子出力手段は、前記テキストデータ内の文字列の最後の文字が選択されたことにより前記判断位置が移動されると、移動後の前記判断位置のノードに付与された識別子を出力することを特徴とする付記２記載のテキスト処理装置。 (Supplementary Note 3) When the determination position is moved by selecting the last character of the character string in the text data, the identifier output means displays the identifier assigned to the node of the determination position after the movement. The text processing device according to attachment 2, wherein the text processing device outputs the text processing device.

（付記４）前記ノード追加手段は、子のノードの追加を、前記文節木のノードの数が所定数に達するまで行うことを特徴とする付記１乃至３のいずれかに記載のテキスト処理装置。 (Supplementary note 4) The text processing apparatus according to any one of supplementary notes 1 to 3, wherein the node addition unit performs addition of a child node until the number of nodes of the phrase tree reaches a predetermined number.

（付記５）前記選択された文字の総量に応じて前記閾値を変更する閾値変更手段をさらに有することを特徴とする付記１乃至４のいずれかに記載のテキスト処理装置。
（付記６）前記閾値変更手段は、前記選択された文字の総量が増加するほど、前記閾値の値を大きくすることを特徴とする付記５記載のテキスト処理装置。 (Supplementary note 5) The text processing device according to any one of supplementary notes 1 to 4, further comprising threshold value changing means for changing the threshold value according to the total amount of the selected characters.
(Additional remark 6) The said threshold value change means enlarges the value of the said threshold value, so that the total amount of the said selected character increases, The text processing apparatus of Additional remark 5 characterized by the above-mentioned.

（付記７）移動後の前記判断位置のノードに付与された識別子を出力後、前記テキストデータ内の文字列に対応する符号の最後であることを示す制御記号を出力する制御記号出力手段をさらに有することを特徴とする付記３記載のテキスト処理装置。 (Additional remark 7) The control symbol output means which outputs the control symbol which shows that it is the last of the code | cord | chord corresponding to the character string in the said text data after outputting the identifier provided to the node of the said judgment position after movement is further provided The text processing device according to attachment 3, wherein the text processing device is provided.

（付記８）コンピュータが、
テキストデータ内の文字列から順に文字を選択し、
前記テキストデータに出現し得る文字に対応する複数のノードがルートのノードの子として木構造で予め関連付けられ、各ノードに対して前記テキストデータに出現し得る文字に対応する子のノードを追加可能であり、各ノードに対応付けて、ノードの識別子と、各ノードが判断位置とされたときに次に出現した各文字の出現回数とが付与された文節木を記憶する文節木記憶手段を参照し、
前記ルートのノードから前記判断位置を開始し、
前記判断位置のノードに対して前記選択された文字に対応する子のノードが存在する場合、該子のノードに前記判断位置を移動し、
前記判断位置のノードに対して前記選択された文字に対応する子のノードが存在せず、かつ前記判断位置のノードに付与された前記選択された文字の出現回数が所定の閾値に達していない場合、前記ルートのノードに対する前記選択された文字に対応する子のノードに前記判断位置を移動し、
前記判断位置のノードに対して前記選択された文字に対応する子のノードが存在せず、かつ前記判断位置のノードに付与された前記選択された文字の出現回数が所定の閾値に達している場合、前記判断位置のノードに対して、新たな識別子を付与した、前記選択された文字に対応する子のノードを追加すると共に、追加した該子のノードに前記判断位置を移動し、
前記判断位置のノードに対して前記選択された文字に対応するノードが存在しない場合、前記判断位置のノードの付与された前記選択された文字の出現回数をカウントアップする、
ことを特徴とするテキスト処理方法。 (Appendix 8) The computer
Select characters in order from the character string in the text data,
A plurality of nodes corresponding to characters that can appear in the text data are associated in advance in a tree structure as children of the root node, and child nodes corresponding to characters that can appear in the text data can be added to each node. Refers to the phrase tree storage means for storing the phrase tree associated with each node and assigned with the node identifier and the number of times each character appears next when each node is determined. And
Start the decision position from the node of the route;
If there is a child node corresponding to the selected character with respect to the node at the determination position, the determination position is moved to the child node;
There is no child node corresponding to the selected character for the node at the determination position, and the number of appearances of the selected character assigned to the node at the determination position does not reach a predetermined threshold value. The determination position is moved to a child node corresponding to the selected character for the root node, and
There is no child node corresponding to the selected character with respect to the node at the determination position, and the number of appearances of the selected character given to the node at the determination position has reached a predetermined threshold value. In this case, a child node corresponding to the selected character with a new identifier is added to the node at the determination position, and the determination position is moved to the added child node.
When there is no node corresponding to the selected character with respect to the node at the determination position, the number of appearances of the selected character to which the node at the determination position is given is counted up.
A text processing method characterized by the above.

（付記９）前記コンピュータが、さらに、
前記文節木記憶手段を参照し、前記判断位置のノードに対して前記選択された文字に対応する子のノードが存在しない場合、前記判断位置のノードに付与された識別子を出力することを特徴とする付記８記載のテキスト処理方法。 (Supplementary Note 9) The computer further includes:
The phrase tree storage means is referred to, and when there is no child node corresponding to the selected character for the node at the determination position, an identifier given to the node at the determination position is output. The text processing method according to appendix 8.

（付記１０）前記コンピュータが、さらに、
前記テキストデータ内の文字列の最後の文字が選択されたことにより前記判断位置が移動されると、移動後の前記判断位置のノードに付与された識別子を出力することを特徴とする付記９記載のテキスト処理方法。 (Supplementary Note 10) The computer further includes:
The supplementary note 9, wherein when the determination position is moved by selecting the last character of the character string in the text data, an identifier given to the node of the determination position after the movement is output. Text processing method.

（付記１１）子のノードの追加は、前記文節木のノードの数が所定数に達するまで行うことを特徴とする付記８乃至１０のいずれかに記載のテキスト処理方法。
（付記１２）前記コンピュータが、さらに、
前記選択された文字の総量に応じて前記閾値を変更することを特徴とする付記８乃至１１のいずれかに記載のテキスト処理方法。 (Supplementary note 11) The text processing method according to any one of Supplementary notes 8 to 10, characterized in that the addition of child nodes is performed until the number of nodes in the phrase tree reaches a predetermined number.
(Supplementary Note 12) The computer further includes:
12. The text processing method according to any one of appendices 8 to 11, wherein the threshold value is changed according to the total amount of the selected characters.

（付記１３）前記閾値を変更する際には、前記選択された文字の総量が増加するほど、前記閾値の値を大きくすることを特徴とする付記１２記載のテキスト処理方法。
（付記１４）前記コンピュータが、さらに、
移動後の前記判断位置のノードに付与された識別子を出力後、前記テキストデータ内の文字列に対応する符号の最後であることを示す制御記号を出力することを特徴とする付記１０記載のテキスト処理方法。 (Additional remark 13) When changing the said threshold value, the value of the said threshold value is enlarged, so that the total amount of the said selected character increases, The text processing method of Additional remark 12 characterized by the above-mentioned.
(Supplementary Note 14) The computer further includes:
11. The text according to claim 10, wherein after the identifier assigned to the node at the determined position after movement is output, a control symbol indicating the end of the code corresponding to the character string in the text data is output. Processing method.

（付記１５）コンピュータに、
テキストデータ内の文字列から順に文字を選択し、
前記テキストデータに出現し得る文字に対応する複数のノードがルートのノードの子として木構造で予め関連付けられ、各ノードに対して前記テキストデータに出現し得る文字に対応する子のノードを追加可能であり、各ノードに対応付けて、ノードの識別子と、各ノードが判断位置とされたときに次に出現した各文字の出現回数とが付与された文節木を記憶する文節木記憶手段を参照し、
前記ルートのノードから前記判断位置を開始し、
前記判断位置のノードに対して前記選択された文字に対応する子のノードが存在する場合、該子のノードに前記判断位置を移動し、
前記判断位置のノードに対して前記選択された文字に対応する子のノードが存在せず、かつ前記判断位置のノードに付与された前記選択された文字の出現回数が所定の閾値に達していない場合、前記ルートのノードに対する前記選択された文字に対応する子のノードに前記判断位置を移動し、
前記判断位置のノードに対して前記選択された文字に対応する子のノードが存在せず、かつ前記判断位置のノードに付与された前記選択された文字の出現回数が所定の閾値に達している場合、前記判断位置のノードに対して、新たな識別子を付与した、前記選択された文字に対応する子のノードを追加すると共に、追加した該子のノードに前記判断位置を移動し、
前記判断位置のノードに対して前記選択された文字に対応するノードが存在しない場合、前記判断位置のノードの付与された前記選択された文字の出現回数をカウントアップする、
処理を実行させることを特徴とするテキスト処理プログラム。 (Supplementary note 15)
Select characters in order from the character string in the text data,
A plurality of nodes corresponding to characters that can appear in the text data are associated in advance in a tree structure as children of the root node, and child nodes corresponding to characters that can appear in the text data can be added to each node. Refers to the phrase tree storage means for storing the phrase tree associated with each node and assigned with the node identifier and the number of times each character appears next when each node is determined. And
Start the decision position from the node of the route;
If there is a child node corresponding to the selected character with respect to the node at the determination position, the determination position is moved to the child node;
There is no child node corresponding to the selected character for the node at the determination position, and the number of appearances of the selected character assigned to the node at the determination position does not reach a predetermined threshold value. The determination position is moved to a child node corresponding to the selected character for the root node, and
There is no child node corresponding to the selected character with respect to the node at the determination position, and the number of appearances of the selected character given to the node at the determination position has reached a predetermined threshold value. In this case, a child node corresponding to the selected character with a new identifier is added to the node at the determination position, and the determination position is moved to the added child node.
When there is no node corresponding to the selected character with respect to the node at the determination position, the number of appearances of the selected character to which the node at the determination position is given is counted up.
A text processing program for executing a process.

（付記１６）前記コンピュータが、さらに、
前記文節木記憶手段を参照し、前記判断位置のノードに対して前記選択された文字に対応する子のノードが存在しない場合、前記判断位置のノードに付与された識別子を出力することを特徴とする付記１５記載のテキスト処理プログラム。 (Supplementary Note 16) The computer further includes:
The phrase tree storage means is referred to, and when there is no child node corresponding to the selected character for the node at the determination position, an identifier given to the node at the determination position is output. The text processing program according to appendix 15.

（付記１７）前記コンピュータが、さらに、
前記テキストデータ内の文字列の最後の文字が選択されたことにより前記判断位置が移動されると、移動後の前記判断位置のノードに付与された識別子を出力することを特徴とする付記１６記載のテキスト処理プログラム。 (Supplementary Note 17) The computer further includes:
The supplementary note 16, wherein when the determination position is moved by selecting the last character of the character string in the text data, an identifier given to the node of the determination position after the movement is output. Text processing program.

（付記１８）子節点の追加は、前記文節木のノードの数が所定数に達するまで行うことを特徴とする付記１５乃至１７のいずれかに記載のテキスト処理プログラム。
（付記１９）前記コンピュータが、さらに、
前記選択された文字の総量に応じて前記閾値を変更することを特徴とする付記１５乃至１８のいずれかに記載のテキスト処理プログラム。 (Supplementary note 18) The text processing program according to any one of supplementary notes 15 to 17, wherein the child nodes are added until the number of nodes of the phrase tree reaches a predetermined number.
(Supplementary note 19) The computer further includes:
The text processing program according to any one of appendices 15 to 18, wherein the threshold value is changed according to the total amount of the selected characters.

（付記２０）前記閾値を変更する際には、前記選択された文字の総量が増加するほど、前記閾値の値を大きくすることを特徴とする付記１９記載のテキスト処理プログラム。
（付記２１）前記コンピュータが、さらに、
移動後の前記判断位置のノードに付与された識別子を出力後、前記テキストデータ内の文字列に対応する符号の最後であることを示す制御記号を出力することを特徴とする付記１７記載のテキスト処理プログラム。 (Supplementary note 20) The text processing program according to supplementary note 19, wherein when the threshold value is changed, the threshold value is increased as the total amount of the selected characters increases.
(Supplementary Note 21) The computer further includes:
18. The text according to claim 17, wherein after the identifier assigned to the node at the determined position after movement is output, a control symbol indicating the end of the code corresponding to the character string in the text data is output. Processing program.

１テキスト処理装置
１ａ文字選択手段
１ｂノード追加手段
１ｃ頻度カウント手段
１ｄ識別子出力手段
２文節木記憶手段
２ａ文節木
３テキストデータ
４符号語列 DESCRIPTION OF SYMBOLS 1 Text processing apparatus 1a Character selection means 1b Node addition means 1c Frequency count means 1d Identifier output means 2 Phrase tree storage means 2a Phrase tree 3 Text data 4 Code word string

Claims

A character selection means for selecting characters in order from the character string in the text data;
A plurality of nodes corresponding to characters that can appear in the text data are associated in advance in a tree structure as children of the root node, and child nodes corresponding to characters that can appear in the text data can be added to each node. Refers to the phrase tree storage means for storing the phrase tree associated with each node and assigned with the node identifier and the number of times each character appears next when each node is determined. The determination position is started from the node of the root, and when there is a child node corresponding to the character selected by the character selection unit with respect to the node of the determination position, the determination position is included in the child node. , The child node corresponding to the selected character does not exist with respect to the node at the determination position, and the appearance number of the selected character given to the node at the determination position is Is not reached the predetermined threshold, the determination position is moved to a child node corresponding to the selected character for the root node, and the selected character is corresponding to the determination position node. If there is no child node and the number of appearances of the selected character given to the node at the judgment position has reached a predetermined threshold, a new identifier is given to the node at the judgment position Node adding means for adding a child node corresponding to the selected character and moving the determination position to the added child node;
If the node corresponding to the selected character does not exist with respect to the node at the determination position with reference to the phrase tree storage unit, the number of appearances of the selected character to which the node at the determination position is added is counted. Frequency counting means to increase,
A text processing apparatus comprising:

Referring to the phrase tree storage means, and when there is no child node corresponding to the selected character for the node at the determination position, identifier output means for outputting an identifier assigned to the node at the determination position The text processing apparatus according to claim 1, further comprising:

The identifier output means outputs the identifier assigned to the node at the determined position after the movement when the determined position is moved by selecting the last character of the character string in the text data. The text processing apparatus according to claim 2, wherein

The text processing apparatus according to claim 1, wherein the node adding unit adds a child node until the number of nodes in the phrase tree reaches a predetermined number.

The text processing apparatus according to claim 1, further comprising a threshold value changing unit that changes the threshold value according to the total amount of the selected characters.

Computer
Select characters in order from the character string in the text data,
A plurality of nodes corresponding to characters that can appear in the text data are associated in advance in a tree structure as children of the root node, and child nodes corresponding to characters that can appear in the text data can be added to each node. Refers to the phrase tree storage means for storing the phrase tree associated with each node and assigned with the node identifier and the number of times each character appears next when each node is determined. And
Start the decision position from the node of the route;
If there is a child node corresponding to the selected character with respect to the node at the determination position, the determination position is moved to the child node;
There is no child node corresponding to the selected character for the node at the determination position, and the number of appearances of the selected character assigned to the node at the determination position does not reach a predetermined threshold value. The determination position is moved to a child node corresponding to the selected character for the root node, and
There is no child node corresponding to the selected character with respect to the node at the determination position, and the number of appearances of the selected character given to the node at the determination position has reached a predetermined threshold value. In this case, a child node corresponding to the selected character with a new identifier is added to the node at the determination position, and the determination position is moved to the added child node.
When there is no node corresponding to the selected character with respect to the node at the determination position, the number of appearances of the selected character to which the node at the determination position is given is counted up.
A text processing method characterized by the above.

On the computer,
Select characters in order from the character string in the text data,
A plurality of nodes corresponding to characters that can appear in the text data are associated in advance in a tree structure as children of the root node, and child nodes corresponding to characters that can appear in the text data can be added to each node. Refers to the phrase tree storage means for storing the phrase tree associated with each node and assigned with the node identifier and the number of times each character appears next when each node is determined. And
Start the decision position from the node of the route;
If there is a child node corresponding to the selected character with respect to the node at the determination position, the determination position is moved to the child node;
There is no child node corresponding to the selected character for the node at the determination position, and the number of appearances of the selected character assigned to the node at the determination position does not reach a predetermined threshold value. The determination position is moved to a child node corresponding to the selected character for the root node, and
There is no child node corresponding to the selected character with respect to the node at the determination position, and the number of appearances of the selected character given to the node at the determination position has reached a predetermined threshold value. In this case, a child node corresponding to the selected character with a new identifier is added to the node at the determination position, and the determination position is moved to the added child node.
When there is no node corresponding to the selected character with respect to the node at the determination position, the number of appearances of the selected character to which the node at the determination position is given is counted up.
A text processing program for executing a process.