JP2006302082A

JP2006302082A - Character string retrieval system

Info

Publication number: JP2006302082A
Application number: JP2005124860A
Authority: JP
Inventors: Takaaki Nakamura; 隆顕中村; Mitsunori Kori; 光則郡
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2005-04-22
Filing date: 2005-04-22
Publication date: 2006-11-02
Anticipated expiration: 2025-04-22
Also published as: JP4726046B2

Abstract

<P>PROBLEM TO BE SOLVED: To efficiently and quickly perform retrieval from a compressed block string by substituting partial character strings constituting a digitized text with compressed blocks in a retrieval device which determines whether the text includes a prescribed character string described using regular expressions or not and obtains its position. <P>SOLUTION: A DFA (deterministic finite automaton) is utilized for retrieval. A compressed block (code) is acquired from the compressed block string (code string) (S21). In the case that a partial character string corresponding to the compressed block is new, the compressed block is decompressed into the partial character string and is inputted to the DFA, and the history of state transition (state history) of the DFA at this time is stored (S261 to S264). Compressed blocks corresponding to the same partial character string are not decompressed, and the state history is referred to update the state of the DFA (S251 to S254). <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、文書の中から所定の文字列を検索する技術に関する。 The present invention relates to a technique for retrieving a predetermined character string from a document.

近年様々な分野で文書の電子化が進んでいる。大量の電子化された文書が利用されるに伴って、次のような課題も出てきている。 In recent years, the digitization of documents is progressing in various fields. As a large amount of digitized documents are used, the following problems have arisen.

第一の課題は、大量に存在する文書の中から、所望の文書を見つけ出すことが困難になるということである。そのため、電子化された文書を効率的に検索する方式が求められている。
この課題を解決する技術としては、ＤＦＡ（ＤｅｔｅｒｍｉｎｉｓｔｉｃＦｉｎｉｔｅＡｕｔｏｍａｔｏｎ：決定性有限オートマトン）を用いた検索方法が知られている（例えば、非特許文献１等）。
また、固定した文字列を検索するだけでは、検索の効率が悪い。そこで、検索文字列の一部または全部を選択的に指定したり、同一の文字列の繰り返しの指定を許すことによって、検索条件を一般化し、類似する文字列を同時に検索することが行われる。このように、検索文字列を一般化して表現した検索パターンの表記方法としては、正規表現等が知られている。
正規表現に基づいて、それを検索可能なＮＦＡ（ＮｏｎｄｅｔｅｒｍｉｎｉｓｔｉｃＦｉｎｉｔｅＡｕｔｏｍａｔｏｎ：非決定性有限オートマトン）を構成できることが知られている（例えば、非特許文献１、非特許文献２等）。
さらに、ＮＦＡは、それと等価なＤＦＡに変換できることが知られている（例えば、非特許文献１、非特許文献２等）。 The first problem is that it becomes difficult to find a desired document from a large amount of existing documents. Therefore, a method for efficiently searching for an electronic document is required.
As a technique for solving this problem, a search method using DFA (Deterministic Finite Automaton) is known (for example, Non-Patent Document 1).
In addition, simply searching for a fixed character string results in poor search efficiency. Therefore, the search condition is generalized by selectively specifying a part or all of the search character string or by allowing the same character string to be repeatedly specified, and similar character strings are simultaneously searched. As described above, a regular expression or the like is known as a method for expressing a search pattern in which a search character string is generalized.
It is known that an NFA (Nondeterministic Finite Automaton) that can be searched based on a regular expression can be constructed (for example, Non-Patent Document 1, Non-Patent Document 2, etc.).
Further, it is known that NFA can be converted into an equivalent DFA (for example, Non-Patent Document 1, Non-Patent Document 2, etc.).

第二の課題は、大量の文書を保存するための記憶装置の容量や、ネットワークを介して文書をやり取りする場合にネットワークの帯域を消費すると言うことである。記憶装置の容量節約や、ネットワークを流れるデータ転送量の縮小のため、電子化された文書を効率よく圧縮する方式が求められている。
この課題を解決する技術としては、ＬＺ（Ｌｅｍｐｅｌ−Ｚｉｖ）７７方式、ＬＺ７８方式、ＬＺＳＳ（Ｌｅｍｐｅｌ−Ｚｉｖ−Ｓｔｏｒｅｒ−Ｓｙｚｍａｎｓｋｉ）方式、ＬＺＷ（Ｌｅｍｐｅｌ−Ｚｉｖ−Ｗｅｌｃｈ）方式、ハフマン符号化方式等、様々な可逆圧縮法が知られている。 The second problem is that the capacity of a storage device for storing a large amount of documents and that the network bandwidth is consumed when documents are exchanged via the network. In order to save the capacity of a storage device and reduce the amount of data transferred through a network, a method for efficiently compressing an electronic document is required.
As a technology for solving this problem, there are various methods such as LZ (Lempel-Ziv) 77, LZ78, LZSS (Lempel-Ziv-Storer-Syzmanski), LZW (Lempel-Ziv-Welch), Huffman coding, etc. A reversible compression method is known.

第三の課題は、上記した検索技術と圧縮技術は、それぞれ独立して発達してきたため、圧縮された文書を効率よく検索することが困難であるということである。
この課題を解決する技術としては、単純に、一度圧縮テキスト（圧縮された文書）を伸張（復元）した後で、検索する方式が一般的である。検索の方法としては、例えば、状態遷移機械（有限オートマトン）を用いて文字列照合を行う。
その一方で、圧縮された文書を伸長せずに検索する方式も知られている（例えば、特許文献１、非特許文献１等）。 The third problem is that it is difficult to efficiently search a compressed document because the above-described search technique and compression technique have been independently developed.
As a technique for solving this problem, a method of simply searching after decompressing (restoring) a compressed text (compressed document) once is general. As a search method, for example, character string matching is performed using a state transition machine (finite automaton).
On the other hand, a method for retrieving a compressed document without decompressing it is also known (for example, Patent Document 1, Non-Patent Document 1, etc.).

特許文献１に記載の検索方式は、圧縮テキストを固定の検索文字列によって高速に検索する方式に関するものである。この方式では、圧縮辞書と固定の検索文字列を入力として有限オートマトンを作成し、その有限オートマトンによって圧縮テキストを伸張することなく検索することで、圧縮率の逆数倍高速に検索することができる。 The search method described in Patent Document 1 relates to a method for searching compressed text at a high speed using a fixed search character string. In this method, a finite automaton is created by inputting a compression dictionary and a fixed search character string, and a search is performed without decompressing the compressed text using the finite automaton. .

非特許文献１に記載の検索方式は、ＬＺ７８形式またはＬＺＷ形式で圧縮された圧縮テキストを、正規表現を含んだ検索条件によって高速に検索する方式に関するものである。この方式では、圧縮テキストを伸張することなく、決定性有限オートマトンによって検索することで、高速に検索することができる。
特開平１０−２６０９８０号公報ＧｏｎｚａｌｏＮａｖａｒｒｏ、“ＲｅｇｕｌａｒＥｘｐｒｅｓｓｉｏｎＳｅａｒｃｈｉｎｇｏｎＣｏｍｐｒｅｓｓｅｄＴｅｘｔ”、ＪｏｕｒｎａｌｏｆＤｉｓｃｒｅｔｅＡｌｇｏｒｉｔｈｍｓＶｏｌｕｍｅ１、Ｉｓｓｕｅ５−６（Ｏｃｔｏｂｅｒ２００３）、４２３〜４４３ページ、２００３Ｅ．Ｊ．Ｈｏｐｃｒｏｆｔ，Ｄ．Ｊ．Ｕｌｌｍａｎ，“ＦｏｒｍａｌＬａｎｇｕａｇｅｓＡｎｄＴｈｅｉｒＲｅｌａｔｉｏｎｔｏＡｕｔｏｍａｔａ”、ＡｄｄｉｓｏｎＷｅｓｌｅｙ（１９６９）（邦題「言語理論とオートマトン」、サイエンス社、昭和４６年） The search method described in Non-Patent Document 1 relates to a method for searching a compressed text compressed in the LZ78 format or LZW format at high speed according to a search condition including a regular expression. In this method, it is possible to search at high speed by searching with a deterministic finite automaton without decompressing the compressed text.
JP-A-10-260980 Gonzalo Navarro, “Regular Expression Searching on Compressed Text”, Journal of Discrete Algorithms Volume 1, Issue 5-6 (October 2003), pages 423-443. E. J. et al. Hopcroft, D.M. J. et al. Ullman, “Formal Languages And Ther Relation to Automata”, Addison Wesley (1969) (Japanese Language Theory and Automata, Science, 1971)

圧縮された文書を伸長してから検索を行う方式は、圧縮テキストを伸張する時間と、文字列照合の時間が必要となり、検索時間が長くなるという課題がある。
特許文献１に記載の検索方式は、固定の検索文字列と圧縮の辞書を入力としており、正規表現を検索条件として扱うことはできないという課題がある。また、現在広く利用されている辞書式圧縮方式ＬＺ７７、ＬＺＳＳ、ＬＺ７８、ＬＺＷなどの方式では、圧縮テキストを伸張しながら同時に圧縮の辞書を生成するため、検索の開始時点で圧縮辞書が存在することを前提とした特許文献１の方式を適用することができないという課題がある。
非特許文献１に記載の検索方式は、テキストがシングルバイトコードからなることを前提としており、マルチバイトコード文字を含むテキストを検索することを考慮されていないという課題がある。また、ＬＺ７８やＬＺＷ形式の圧縮辞書の特徴を利用した方式であるため、圧縮辞書にそのような特徴を持たない他の圧縮方式で圧縮されたテキストを検索することができないという課題がある。 The method of performing a search after decompressing a compressed document requires a time for decompressing the compressed text and a time for character string matching, and has a problem that the retrieval time becomes long.
The search method described in Patent Document 1 has a problem that a fixed search character string and a compression dictionary are input, and a regular expression cannot be handled as a search condition. Also, in the currently widely used lexicographic compression methods such as LZ77, LZSS, LZ78, and LZW, a compression dictionary is generated while decompressing the compressed text, so that a compression dictionary exists at the start of the search. There is a problem that the method of Patent Document 1 based on the above cannot be applied.
The search method described in Non-Patent Document 1 is based on the premise that the text is composed of a single byte code, and has a problem that it is not considered to search for text including multibyte code characters. In addition, since the method uses the characteristics of the compression dictionary in the LZ78 or LZW format, there is a problem that it is not possible to search for text compressed by another compression method that does not have such a feature in the compression dictionary.

本発明は、例えば、上記のような課題を解決するためになされたもので、圧縮された文書を効率よく検索することを目的とする。 The present invention has been made, for example, in order to solve the above-described problems, and an object thereof is to efficiently search a compressed document.

本発明に係る文字列検索装置は、
状態を保持し、文字を入力し、上記保持した状態と上記入力した文字とに基づいて遷移先状態を算出し、上記保持した状態を上記算出した遷移先状態に更新するオートマトンであって、所定の文字列を構成する文字を入力した場合に、上記記憶した状態が所定の状態となるか否かを判別することにより、所定の検索パターンに対応する検索文字列が上記文字列に含まれるか否かを判別できるよう構成したオートマトンを実行することによって、
上記文字列に含まれる部分文字列を上記部分文字列に対応する所定の符号に置換した符号列を取得して、上記文字列から上記検索文字列を検索する文字列検索装置において、
上記オートマトンを実行するオートマトン実行部と、
上記オートマトンが保持した状態を状態履歴として記憶する履歴記憶部と、
上記符号列を構成する符号を取得する符号取得部と、
上記オートマトンが保持する状態と上記履歴記憶部が記憶した状態履歴と上記符号取得部が取得した符号とに基づいて、第一の条件及び第二の条件を満たすか否かを判断する条件判断部と、
上記条件判断部が第一の条件を満たすと判断した場合に、上記履歴記憶部が記憶した状態履歴に基づいて遷移先状態を算出し、上記オートマトンが保持した状態を、算出した遷移先状態に更新する遷移先算出部と、
上記条件判断部が第二の条件を満たすと判断した場合に、上記符号取得部が取得した符号に対応する部分文字列を復元し、上記部分文字列を構成する文字を上記オートマトンに入力する文字列復元部と、
を有することを特徴とする。 The character string search device according to the present invention is:
An automaton that holds a state, inputs a character, calculates a transition destination state based on the held state and the input character, and updates the held state to the calculated transition destination state. Whether or not a search character string corresponding to a predetermined search pattern is included in the character string by determining whether or not the stored state is a predetermined state when characters constituting the character string are input By running an automaton configured to determine whether or not
In a character string search device for acquiring a code string obtained by replacing a partial character string included in the character string with a predetermined code corresponding to the partial character string, and searching the search character string from the character string,
An automaton execution unit for executing the automaton,
A history storage unit that stores the state held by the automaton as a state history;
A code acquisition unit for acquiring a code constituting the code string;
A condition determination unit that determines whether the first condition and the second condition are satisfied based on the state held by the automaton, the state history stored in the history storage unit, and the code acquired by the code acquisition unit When,
When the condition determination unit determines that the first condition is satisfied, the transition destination state is calculated based on the state history stored in the history storage unit, and the state held by the automaton is changed to the calculated transition destination state. A transition destination calculation unit to be updated;
When the condition determining unit determines that the second condition is satisfied, the character that restores the partial character string corresponding to the code acquired by the code acquiring unit and inputs the characters constituting the partial character string to the automaton A column restoration unit;
It is characterized by having.

本発明によれば、例えば、圧縮されたテキストに含まれる文字列を検索する場合において、正規表現等による検索条件の指定を行い、これを有限オートマトンに変換して検索を行う検索装置において、圧縮ブロックに置換された部分文字列に対して行った検索の履歴を記憶し、記憶した履歴を用いて、オートマトンの状態遷移を省略することにより、検索が高速に行えるとの効果を奏する。 According to the present invention, for example, when searching for a character string included in a compressed text, a search condition is specified by a regular expression or the like, and this is converted into a finite automaton to perform a search. By storing the history of the search performed on the partial character string replaced with the block and omitting the state transition of the automaton using the stored history, there is an effect that the search can be performed at high speed.

まず、ＤＦＡ（オートマトンの一例）を用いた検索方法について説明する。 First, a search method using DFA (an example of an automaton) will be described.

ＤＦＡによる文字列照合方式は状態遷移機械（オートマトン）のモデルに基づいている。状態遷移機械は内部に状態と状態遷移関数を持つ。状態遷移関数は現在の状態と入力文字に対して次の状態を決定する関数である。ＤＦＡを用いた文字列照合方式では、入力テキストを１文字ずつ読み出し、現在の状態と入力文字の組に対して状態遷移関数を適用して得られた次の状態に遷移する。この方法によるとテキストを後戻りすることなく１度走査することによって照合を行うことができ、高速な文字列照合が可能になる。複数の条件による照合を行う場合、照合に成功した条件を区別するため、ＤＦＡを拡張し各状態に出力を定義した出力つき有限オートマトン（Ｍｏｏｒｅ（ムーア）機械）も用いられている。 The DFA character string matching method is based on a model of a state transition machine (automaton). The state transition machine has a state and a state transition function inside. The state transition function is a function that determines the next state for the current state and the input character. In the character string matching method using DFA, the input text is read one character at a time, and a transition is made to the next state obtained by applying the state transition function to the set of the current state and the input character. According to this method, it is possible to perform collation by scanning the text once without going back, and high-speed character string collation is possible. When collating with a plurality of conditions, a finite automaton (Moore machine) with an output in which DFA is extended and outputs are defined for each state is also used in order to distinguish conditions that have been successfully collated.

図３９は、ＤＦＡの動作における状態の遷移の一例を示す概念図である。
図３９において、状態９９０〜９９３は、ＤＦＡの状態を示す。ＤＦＡは、状態９９０〜９９３のうち、どれか一つの状態を保持しており、入力によって保持している状態が遷移（更新）する。検索開始時は、初期状態９９０を保持している。
図中の矢印は、状態の遷移を示す。矢印に添えられた文字を入力すると、矢印の先の状態へ遷移する。
例えば、現在の状態が状態９９０で、文字「ａ」を入力すると、状態９９１に遷移する。また、現在の状態が状態９９０で、文字「ｂ」または「ｃ」を入力すると、状態９９０のまま変わらない。 FIG. 39 is a conceptual diagram showing an example of state transition in the operation of the DFA.
In FIG. 39, states 990 to 993 indicate DFA states. The DFA holds one of the states 990 to 993, and the state held by the input transitions (updates). At the start of the search, the initial state 990 is held.
Arrows in the figure indicate state transitions. When a character attached to an arrow is input, the state transitions to the state at the end of the arrow.
For example, when the current state is state 990 and the character “a” is input, the state transitions to state 991. If the character “b” or “c” is entered in the current state 990, the state 990 remains unchanged.

ここでは、説明を簡単にするため、ＤＦＡに入力する文字は「ａ」「ｂ」「ｃ」の３種類しかないものとしているが、実際のＤＦＡに入力する文字の種類はもっと多くてもよいことはもちろんである。
ここでいう「文字」とは、アルファベットや漢字といった狭義の文字に限らず、およそコンピュータが文字として扱えるものであれば何でも構わない。コンピュータ上において、文字はビット列で表現されている。例えば、ＡＳＣＩＩ（ＡｍｅｒｉｃａｎＳｔａｎｄａｒｄＣｏｄｅｆｏｒＩｎｆｏｒｍａｔｉｏｎＩｎｔｅｒｃｈａｎｇｅ）コードを用いる場合、文字「ａ」は「０１０００００１」（４１ｈ）という８ビットのビット列で表現される。あるいは、シフトＪＩＳ（ＪａｐａｎＩｎｄｕｓｔｒｙＳｔａｎｄａｒｄ：日本工業規格）コードを用いる場合、文字「あ」は「１０００００１０１０１０００００」（８２Ａ０ｈ）という１６ビットのビット列で表現される。このように、使用する文字コードによっては、それを表現するビット列のビット長が異なる場合もある。したがって、およそコンピュータ上でビット列として表現できるものは、すべて「文字」として扱うことができ、ＤＦＡに入力することができる。
ただし、ＤＦＡは、あらかじめ入力に対応した遷移先が決まっていなければ動作させることができないので、ＤＦＡに入力する可能性のある文字の種類は有限個でなければならない。 Here, for the sake of simplicity, it is assumed that there are only three types of characters “a”, “b”, and “c” to be input to the DFA, but there may be more types of characters to be input to the actual DFA. Of course.
The “character” here is not limited to a narrowly defined character such as an alphabet or a kanji, but may be anything as long as the computer can handle it as a character. On the computer, characters are represented by bit strings. For example, when an ASCII (American Standard Code for Information Interchange) code is used, the character “a” is represented by an 8-bit bit string “01000001” (41h). Alternatively, when a shift JIS (Japan Industry Standard) code is used, the character “A” is represented by a 16-bit bit string “1000001010100000” (82A0h). Thus, depending on the character code used, the bit length of the bit string representing it may be different. Therefore, anything that can be expressed as a bit string on a computer can be treated as “character” and can be input to the DFA.
However, since the DFA cannot be operated unless the transition destination corresponding to the input is determined in advance, the types of characters that may be input to the DFA must be limited.

図３９において、状態９９３は特別な状態であり、これを受理状態という。検索を目的として構成したＤＦＡにおいては、検索に成功したときに、受理状態となる。
図３９に示すＤＦＡは、検索文字列「ａｂｃ」を検索するためのものである。 In FIG. 39, a state 993 is a special state, which is called an acceptance state. A DFA configured for the purpose of retrieval is in an accepting state when the retrieval is successful.
The DFA shown in FIG. 39 is for searching the search character string “abc”.

文字列「ｂａｂａｂｃｃ」の中から検索文字列「ａｂｃ」を検索する場合を例にとって、図３９に示すＤＦＡの動作を説明する。 The operation of the DFA shown in FIG. 39 will be described by taking as an example the case where the search character string “abc” is searched from the character string “bababcc”.

ＤＦＡには、文字列「ｂａｂａｂｃｃ」を構成する文字を、最初から１文字ずつ入力していく。
図４０は、図３９に示すＤＦＡに文字を入力していった場合に、ＤＦＡの状態がどのように遷移するかを示す図である。
検索開始時には、ＤＦＡの状態は初期状態９９０である。
まず、文字「ｂ」を入力すると、ＤＦＡの状態は状態９９０のまま変わらない。
次に、文字「ａ」を入力すると、ＤＦＡの状態は状態９９１に遷移する。
次に、文字「ｂ」を入力すると、ＤＦＡの状態は状態９９２に遷移する。
次に、文字「ａ」を入力すると、ＤＦＡの状態は状態９９１に遷移する。
次に、文字「ｂ」を入力すると、ＤＦＡの状態は状態９９２に遷移する。
次に、文字「ｃ」を入力すると、ＤＦＡの状態は状態９９３に遷移する。
状態９９３は受理状態であるから、この時点で、文字列「ｂａｂａｂｃｃ」に検索文字列「ａｂｃ」が含まれていることが分かる。また、６文字目の「ｃ」を入力した時点で、ＤＦＡの状態が受理状態９９３になったので、検索文字列「ａｂｃ」は、文字列「ｂａｂａｂｃｃ」の６文字目で終わる位置に現れることもわかる。
次に、文字「ｃ」を入力すると、ＤＦＡの状態は状態９９０に遷移する。
ＤＦＡに入力する文字がなくなったので、これでＤＦＡは動作を終了する。 Characters constituting the character string “bababcc” are input to the DFA one by one from the beginning.
FIG. 40 is a diagram illustrating how the DFA state transitions when characters are input to the DFA illustrated in FIG. 39.
At the start of the search, the DFA state is the initial state 990.
First, when the character “b” is input, the state of the DFA remains the state 990.
Next, when the character “a” is input, the DFA state transitions to the state 991.
Next, when the character “b” is input, the DFA state transitions to state 992.
Next, when the character “a” is input, the DFA state transitions to the state 991.
Next, when the character “b” is input, the DFA state transitions to state 992.
Next, when the character “c” is input, the DFA state changes to the state 993.
Since the state 993 is an accepting state, it can be seen that the character string “bababcc” includes the search character string “abc” at this point. Further, when the sixth character “c” is input, the DFA state becomes the accepting state 993, so that the search character string “abc” appears at the position ending with the sixth character of the character string “bababcc”. I understand.
Next, when the character “c” is input, the DFA state transitions to the state 990.
Since there are no more characters to be input to the DFA, the DFA ends the operation.

以上の動作により、文字列「ｂａｂａｂｃｃ」のなかには検索文字列「ａｂｃ」が１回出現し、その出現位置は６文字目で終わる位置であることがわかる。 As a result of the above operation, it can be seen that the search character string “abc” appears once in the character string “bababcc”, and the appearance position is a position ending with the sixth character.

次に、オートマトン実行部がこのＤＦＡを実行する処理の流れについて説明する。 Next, the flow of processing in which the automaton execution unit executes this DFA will be described.

オートマトン実行部は、図３９のＤＦＡに対応して図４１の遷移先一覧表を記憶している。
この表は、ＤＦＡが最左欄の状態にあるときに、最上欄の文字を入力すると、次に遷移する遷移先の状態を示すものである。
図４２は、オートマトン実行部の処理の流れの一例を示すフローチャート図である。
検索開始時に、オートマトン実行部は、ＤＦＡの状態を初期化して、初期状態にする（Ｓ９９１）。例えば、ＤＦＡの状態を記憶するメモリに、初期状態の状態番号０を記憶する。
次に、例えば、文字列復元部が、オートマトンに文字列を１文字ずつ入力する（Ｓ９９２）。すなわち、オートマトン実行部に入力する文字を通知し、オートマトン実行部がこれを取得する。
オートマトン実行部は、記憶した遷移先一覧表を参照して、現在のＤＦＡの状態と入力した文字とに基づいて、遷移先状態を算出する（Ｓ９９３）。
次に、オートマトン実行部は、ＤＦＡの状態を更新する（Ｓ９９４）。例えば、ＤＦＡの状態を記憶するメモリに、遷移先状態の状態番号を記憶する。
オートマトン実行部は、ＤＦＡの状態が受理状態が否かを判別する（Ｓ９９５）。ＤＦＡの状態が受理状態である場合には、検索に成功したので、検索成功処理をする（Ｓ９９６）。例えば、検索に成功したことを示すメッセージや検索文字列の出現した位置をＣＲＴ表示装置９０１に表示する。
以上の処理を、ＤＦＡに入力する文字がなくなるまで繰り返す（Ｓ９９７）。 The automaton execution unit stores the transition destination list of FIG. 41 corresponding to the DFA of FIG.
This table shows the state of the transition destination to which the next transition is made when the character in the top column is input when the DFA is in the state of the leftmost column.
FIG. 42 is a flowchart illustrating an example of a process flow of the automaton execution unit.
At the start of the search, the automaton execution unit initializes the DFA state to the initial state (S991). For example, the initial state number 0 is stored in the memory that stores the DFA state.
Next, for example, the character string restoration unit inputs a character string character by character to the automaton (S992). That is, the character input to the automaton execution unit is notified, and the automaton execution unit acquires this.
The automaton execution unit refers to the stored transition destination list, and calculates the transition destination state based on the current DFA state and the input characters (S993).
Next, the automaton execution unit updates the state of the DFA (S994). For example, the state number of the transition destination state is stored in the memory that stores the state of the DFA.
The automaton execution unit determines whether or not the DFA state is an acceptance state (S995). If the DFA state is the accepting state, the search is successful, and the search success process is performed (S996). For example, a message indicating that the search is successful and the position where the search character string appears are displayed on the CRT display device 901.
The above processing is repeated until there are no more characters to be input to the DFA (S997).

図４３は、もう少し複雑なＤＦＡの一例である。
図４３に示すＤＦＡは、文字列「ａｂａｂｂ」及び「ａｂｃａ」及び「ａｂａ」及び「ｂｂ」を検索できるように構成されている。 FIG. 43 is an example of a slightly more complicated DFA.
The DFA shown in FIG. 43 is configured to be able to search for character strings “ababb”, “abca”, “aba”, and “bb”.

次に、正規表現について説明する。 Next, regular expressions will be described.

正規表現とは、正規言語と呼ばれる言語のクラスを表現する表記方法である。 A regular expression is a notation method for expressing a class of a language called a regular language.

正規言語とは、それを構成する文字を任意に連結した文字列のうち、一定の規則に従う文字列の集合である。
正規表現は、正規言語を構成する文字とメタ文字とからなる文字列であり、ある文字列が正規言語に属するかどうかを識別するための規則を表現している。
正規表現の表記法には様々なものが知られている。ここではその一例について説明する。 A regular language is a set of character strings according to a certain rule among character strings obtained by arbitrarily concatenating characters constituting the regular language.
A regular expression is a character string made up of characters and metacharacters constituting a regular language, and expresses a rule for identifying whether a certain character string belongs to the regular language.
There are various known regular expression notations. Here, an example will be described.

説明を簡単にするため、正規言語を構成する文字は「ａ」「ｂ」「ｃ」の３種類しかないものとする。
また、メタ文字は「（」「）」「｜」「＊」「？」の５種類があるものとする。ここで、「（」「）」はグループ化を意味する。「｜」は選択を意味する。「＊」は０回以上の繰り返しを意味する。「？」は０回または１回の出現を意味する。
例えば、正規表現「（ａｂ｜ｃ）」は、文字列「ａｂ」及び「ｃ」を要素とする正規言語を表現するものである。
また、例えば、正規表現「ａｂ＊」は、文字列「ａ」、「ａｂ」、「ａｂｂ」、「ａｂｂｂ」、・・・を要素とする正規言語を表現するものである。
また、例えば、正規表現「ｃ？ｂ」は、文字列「ｂ」及び「ｃｂ」を要素とする正規言語を表現するものである。 In order to simplify the explanation, it is assumed that there are only three characters “a”, “b”, and “c” that constitute a regular language.
Further, there are five types of meta characters, “(” “)”, “|”, “*”, and “?”. Here, “(” “)” means grouping. “|” Means selection. “*” Means zero or more repetitions. “?” Means 0 or 1 occurrence.
For example, the regular expression “(ab | c)” represents a regular language whose elements are character strings “ab” and “c”.
For example, the regular expression “ab *” expresses a regular language whose elements are character strings “a”, “ab”, “abb”, “abbb”,.
For example, the regular expression “c? B” expresses a regular language whose elements are the character strings “b” and “cb”.

実際に知られている正規表現はもっと複雑であるが、ここでは説明しない（非特許文献２等を参照のこと）。 Regular expressions that are actually known are more complicated, but are not described here (see Non-Patent Document 2).

このように、正規表現を用いると、文字列の集合を簡単に表現できるので、正規表現は検索条件を記述する検索パターンに用いられる。
例えば、文字列「ａｂａｂｂ」及び「ａｂｃａ」及び「ａｂａ」及び「ｂｂ」を検索したい場合、これを正規表現で記述すると、「ａｂａｂｂ｜ａｂｃａ｜ａｂａ｜ｂｂ」となる。あるいは「（ａｂａ）？（ｂｂ）？｜ａｂｃａ」と記述してもよいし、「ａｂｃ？ａ｜（ａｂａ）？ｂｂ」と記述してもよい。 As described above, when a regular expression is used, a set of character strings can be expressed easily. Therefore, the regular expression is used for a search pattern describing a search condition.
For example, if it is desired to search for the character strings “ababb” and “abca” and “aba” and “bb”, they are described as regular expressions as “ababb | abca | aba | bb”. Alternatively, “(aba)? (Bb)? | Abca” may be described, or “abc? A | (aba)? Bb” may be described.

正規表現を用いて記述した検索パターンに対応する検索文字列を検索する場合、正規表現に対応するＮＦＡを求めて、ＮＦＡを実行することによって検索する方法がある。 When searching for a search character string corresponding to a search pattern described using a regular expression, there is a method of searching for an NFA corresponding to the regular expression and executing the NFA.

ＮＦＡは、ＤＦＡと同じくオートマトンの一種であるが、ある状態に対して、一つの文字を入力した場合に遷移する遷移先状態が２つ以上あったり、文字を入力しない場合でも状態が遷移（これを「空遷移」または「ε遷移」という）したりするので、遷移先状態を一意に決定することができない。
そこで、ＮＦＡを実行するには、遷移先が２つ以上ある場合、そのうちの１つを選択してとりあえず実行してみる。実行して失敗した場合には、分岐点に戻り、別の選択肢を選択してまた実行する。このように、バックトラックをすることにより、ＮＦＡを実行することが可能である。 NFA, like DFA, is a kind of automaton, but there are two or more transition destination states that transition when a single character is input for a certain state, and the state transitions even when no character is input (this Or “ε transition”), the transition destination state cannot be uniquely determined.
Therefore, in order to execute NFA, when there are two or more transition destinations, one of them is selected and executed for the time being. If execution fails, return to the branch point, select another option, and execute again. In this way, NFA can be executed by backtracking.

しかし、このようにバックトラックによりＮＦＡを実行すると、途中で失敗した場合に後戻りが生じるので、検索に時間がかかる。 However, if NFA is executed by backtracking in this way, it will take a long time to search because it will be reversed if it fails in the middle.

そこで、ＮＦＡをＤＦＡに変換し、変換されたＤＦＡを実行することで、後戻りせずに短時間で検索することができる。 Therefore, by converting NFA to DFA and executing the converted DFA, it is possible to search in a short time without going back.

正規表現からＮＦＡを求める方法、ＮＦＡをＤＦＡに変換する方法については、既に知られたものがあるので、ここでは説明しない（例えば、非特許文献１等を参照のこと）。 A method for obtaining NFA from a regular expression and a method for converting NFA to DFA are already known and will not be described here (for example, see Non-Patent Document 1).

次に、文書圧縮技術について説明する。
現在一般的な文書圧縮技術には、自己参照型のものと辞書参照型のものがある。また、辞書参照型には、別に辞書を用意するものと、圧縮文書の中に辞書が埋め込まれているものとがある。 Next, the document compression technique will be described.
Currently common document compression techniques include a self-reference type and a dictionary reference type. There are two types of dictionary reference types: a dictionary is prepared separately and a dictionary is embedded in a compressed document.

自己参照型の圧縮技術について説明する。この圧縮方式には、例えば、ＬＺ７７方式やＬＺＳＳ方式がある。
自己参照型圧縮技術において基本となる考え方は、元の文字列の異なる位置に、同じ部分文字列がある場合、一方を他方への参照で置換することによって、文字列全体の符号長を短くしようというものである。 A self-referencing compression technique will be described. Examples of this compression method include the LZ77 method and the LZSS method.
The basic idea in the self-referencing compression technique is to shorten the code length of the entire character string by replacing one with a reference to the other when the same partial character string is at a different position in the original character string. That's it.

図４４は、ＬＺＳＳ方式における符号化の一例を示す図である。 FIG. 44 is a diagram illustrating an example of encoding in the LZSS scheme.

ここで、文字列「ｃｂａｂａｂｃｂｃ」を圧縮する場合を例にとって説明する。なお、ここでは、ＡＳＣＩＩコードを用いているものとする。したがって、１文字は８ビットのビット列によって表現されている。 Here, a case where the character string “cbababbcbc” is compressed will be described as an example. Here, it is assumed that the ASCII code is used. Therefore, one character is represented by an 8-bit bit string.

文字列に含まれる部分文字列は、次の規則にしたがって符号に変換される。 The partial character string included in the character string is converted into a code according to the following rules.

規則１：１文字からなる部分文字列は、フラグ９８１（１ビット）とその文字を表現するビット列９８２（８ビット）からなる９ビットの符号に変換する。
規則２：部分文字列が、それより前に出現した他の部分文字列と一致する場合には、フラグ９８１（１ビット）、他の部分文字列の出現位置９８３（例えば８ビット）、部分文字列の長さ９８４（例えば５ビット）の合計１４ビットの符号に変換する。 Rule 1: A partial character string composed of 1: 1 characters is converted into a 9-bit code composed of a flag 981 (1 bit) and a bit string 982 (8 bits) representing the character.
Rule 2: If the partial character string matches another partial character string that appears before that, the flag 981 (1 bit), the appearance position 983 (for example, 8 bits) of the other partial character string, the partial character The code is converted into a 14-bit code having a column length of 984 (for example, 5 bits).

フラグ９８１は、規則１で変換したか規則２で変換したかを区別するためのもので、あとで元の文字列を復元するときに使用する。
ビット列９８２は、その文字を表すＡＳＣＩＩコードである。ここでは、元の文字と同じコードを使用しているが、元の文字を復元することができれば、異なるコードに置き換えてもよい。
他の部分文字列の出現位置９８３は、それより前に出現した他の部分文字列の先頭の位置を、現在の部分文字列の先頭の位置からの距離（何文字前か）で表したものである。この例では、出現位置９８３は８ビットのビット列で表現しているので、２５６文字より前に他の部分文字列がある場合には、規則２を適用することができない。
部分文字列の長さ９８４は、規則２を適用して符号化する部分文字列の文字数である。この例では、部分文字列の長さ９８４は５ビットのビット列で表現しているので、３２文字以上の部分文字列には、規則２を適用することができない。 The flag 981 is used to distinguish whether the data is converted according to the rule 1 or the rule 2, and is used when the original character string is restored later.
Bit string 982 is an ASCII code representing the character. Here, the same code as the original character is used, but if the original character can be restored, it may be replaced with a different code.
The appearance position 983 of another partial character string is the position of the beginning of another partial character string that appears before that, expressed as the distance from the start position of the current partial character string (how many characters ago) It is. In this example, since the appearance position 983 is expressed by an 8-bit bit string, rule 2 cannot be applied if there is another partial character string before 256 characters.
The length 984 of the partial character string is the number of characters of the partial character string to be encoded by applying rule 2. In this example, since the length 984 of the partial character string is expressed by a 5-bit bit string, rule 2 cannot be applied to a partial character string of 32 characters or more.

文字列「ｃｂａｂａｂｃｂｃ」を圧縮する場合、最初の「ｃ」「ｂ」「ａ」はそれ以前に出現したことがないので、１文字ずつ規則１を適用して符号６０１〜６０３に変換する。
４文字目から始まる３文字の部分文字列「ｂａｂ」は、２文字目（現在位置から見て２文字前）から始まる３文字の他の部分文字列「ｂａｂ」と一致するので、規則２を適用して符号６０４に変換する（このように、他の部分文字列は、自分自身と一部重なっていても構わない）。
７文字目から始まる２文字の部分文字列「ｃｂ」は、１文字目（現在位置から見て６文字前）から始まる２文字の他の部分文字列「ｃｂ」と一致するので、規則２を適用して符号６０５に変換する。
９文字目から始まる１文字の部分文字列「ｃ」は、７文字目から始まる１文字の部分文字列「ｃ」と一致するので、規則２を適用して変換してもよい。しかし、規則２を適用すると、変換した符号は１４ビットになるのに対して、規則１で変換すれば９ビットにしかならないので、圧縮効率が高くなるよう（圧縮後のビット長が短くなるよう）規則１を適用して符号６０６に変換する。
以上の変換により、全体のビット長が７２ビットあった文字列「ｃｂａｂａｃｂｃ」は、全体のビット長が６４ビットの符号列に置換される。 When the character string “cbababbcbc” is compressed, the first “c”, “b”, and “a” have not appeared before that, so the rule 1 is applied character by character and converted into codes 601 to 603.
Since the three-character partial character string “bab” starting from the fourth character matches the other partial character string “bab” starting from the second character (two characters before the current position), rule 2 Applying and converting to the code | symbol 604 (In this way, another partial character string may partly overlap with self).
Since the two-character partial character string “cb” starting from the seventh character matches the other partial character string “cb” starting from the first character (six characters before the current position), rule 2 Apply to convert to 605.
Since the one-character partial character string “c” starting from the ninth character matches the one-character partial character string “c” starting from the seventh character, conversion may be performed by applying rule 2. However, if rule 2 is applied, the converted code becomes 14 bits, whereas if converted according to rule 1, it becomes only 9 bits, so that the compression efficiency becomes high (the bit length after compression becomes short). ) Apply rule 1 to convert to code 606.
Through the above conversion, the character string “cbavacbc” having a total bit length of 72 bits is replaced with a code string having a total bit length of 64 bits.

このようにして置換した符号列から元の文字列を復元する手順について説明する。 A procedure for restoring the original character string from the code string replaced in this way will be described.

図４５は、従来例において、符号列から元の文字列を復元する場合の制御の流れの一例を示すフローチャート図である。
まず、符号列から１ビット（フラグ９８１）取得し（Ｓ９８１）、続く符号が規則１で変換されたものか規則２で変換されたものかを判別する（Ｓ９８２）。
フラグ９８１が「１」なら、規則１なので、続く８ビット（符号９８２）を取得し（Ｓ９８３）、それを文字として出力する（Ｓ９８４）。
出力した文字は、出力履歴に記憶する（Ｓ９８５）。
例えば、図４４の符号６０１は、文字「ｃ」に変換し、出力する。
フラグ９８１が「０」なら、規則２なので、続く１３ビット（出現位置９８３及び長さ９８４）を取得する（Ｓ９８６）。
例えば、図４４の符号６０４であれば、２文字前から３文字であるとわかる。
次に、出力履歴を参照して、部分文字列を復元し、出力する（Ｓ９８７）。
出力した文字は、出力履歴として記憶する（Ｓ９８８）。
例えば、図４４の符号６０４であれば、出力履歴の２文字前を読み出す。この時点で出力履歴として「ｃｂａ」の３文字が記憶してあるので、２文字前は「ｂ」である。そこで、「ｂ」を出力し、すぐに出力履歴として記憶する。出力履歴は「ｃｂａｂ」となる。
これを長さが示す文字数繰り返す（Ｓ９８９）。
２文字目において、２文字前は「ａ」なので、続いて２文字目「ａ」を出力する。出力履歴は「ｃｂａｂａ」となる。３文字目において、２文字前は「ｂ」なので、３文字目「ｂ」を出力する。３文字分出力したので、符号６０４についての処理は終わり、次の処理に移る。したがって、符号６０４に対応して、「ｂａｂ」の３文字が出力される。
これを符号列が終わるまで繰り返す（Ｓ９９０）。 FIG. 45 is a flowchart showing an example of the flow of control when restoring the original character string from the code string in the conventional example.
First, 1 bit (flag 981) is acquired from the code string (S981), and it is determined whether the following code is converted by rule 1 or 2 (S982).
If the flag 981 is “1”, since it is rule 1, the subsequent 8 bits (symbol 982) are acquired (S983) and output as characters (S984).
The output character is stored in the output history (S985).
For example, reference numeral 601 in FIG. 44 is converted into the character “c” and output.
If the flag 981 is “0”, since it is rule 2, the following 13 bits (appearance position 983 and length 984) are acquired (S986).
For example, in the case of reference numeral 604 in FIG.
Next, the partial character string is restored with reference to the output history and output (S987).
The output character is stored as an output history (S988).
For example, in the case of reference numeral 604 in FIG. 44, two characters before the output history are read out. Since three characters “cba” are stored as the output history at this point, “b” is two characters before. Therefore, “b” is output and immediately stored as an output history. The output history is “cbab”.
This is repeated for the number of characters indicated by the length (S989).
Since the second character is “a” in the second character, the second character “a” is output. The output history is “cbaba”. In the third character, since the character before the second character is “b”, the third character “b” is output. Since three characters have been output, the processing for the reference numeral 604 ends, and the processing proceeds to the next processing. Therefore, three characters “bab” are output corresponding to the reference numeral 604.
This is repeated until the code string ends (S990).

図４６は、同じく自己参照型圧縮技術の一種であるＬＺ７７方式における符号化の一例を示す図である。
ＬＺ７７方式は、ＬＺＳＳ方式と符号化の規則が異なるが、他の部分はほとんど同じである。
ＬＺ７７方式では、次の２つの規則により、元の文字列を符号列に置換する。 FIG. 46 is a diagram showing an example of encoding in the LZ77 system, which is also a kind of self-reference compression technology.
The LZ77 system differs from the LZSS system in encoding rules, but the other parts are almost the same.
In the LZ77 method, the original character string is replaced with a code string according to the following two rules.

規則１：部分文字列のうち、最後の１文字を除いた部分文字列が、それより前に出現した他の部分文字列と一致する場合には、他の部分文字列の出現位置９８３（例えば８ビット）及び部分文字列の長さ９８４（例えば５ビット）の合計１３ビットの符号と、最後の１文字を示すビット列（例えば８ビット）の２つの符号に変換する。
規則２：１文字からなる部分文字列は、出現位置０及び長さ０を示す符号（例えば１３ビット）と、その文字を示すビット列（例えば８ビット）の２つの符号に変換する。ここで、出現位置０は、他の文字列への参照がないことを示す一例である。 Rule 1: If the partial character string excluding the last character in the partial character string matches another partial character string that appears before that, the appearance position 983 of another partial character string (for example, 8 bits) and a partial character string length 984 (for example, 5 bits) and a total of 13-bit codes and a bit string (for example, 8 bits) indicating the last character are converted into two codes.
Rule 2: A partial character string composed of 1 character is converted into two codes: a code indicating an appearance position 0 and a length 0 (for example, 13 bits) and a bit string indicating the character (for example, 8 bits). Here, the appearance position 0 is an example indicating that there is no reference to another character string.

この規則によれば、他の部分文字列へのポインタを示す符号と、文字を表現するビット列を示す符号とは、必ず交互に出現することになるので、その符号がどちらの意味であるかを示すフラグは必要ない。 According to this rule, a code indicating a pointer to another partial character string and a code indicating a bit string representing a character always appear alternately. The flag to indicate is not necessary.

符号化、復元の動作についての説明は、省略する。 A description of the encoding and restoration operations is omitted.

辞書参照型の圧縮技術について説明する。
辞書参照型圧縮技術において基本となる考え方は、元の文字列の部分文字列が、辞書（置換辞書）に登録してある単語と一致する場合、その部分文字列を、辞書に登録してある符号で置換することによって、文字列全体のビット長を短くしようというものである。
例えば、図４７に示すような辞書があるとする。文字列「ｃｂａｂａｂｃｂｃ」を圧縮すると、符号列「１２２３３」を得る。符号１つ当りのビット長が１２ビットだとすれば、全体のビット長は６０ビットになる。
例えば、自然言語を記述した文書を圧縮する場合、その言語の単語や、よく出現するフレーズを辞書に登録しておけば、高い圧縮率を得ることができる。
辞書は、符号列とは別に用意しておいてもよいし、符号列に埋め込んでもよい。 A dictionary reference compression technique will be described.
The basic idea in the dictionary reference compression technique is that if the partial character string of the original character string matches a word registered in the dictionary (replacement dictionary), the partial character string is registered in the dictionary. By replacing it with a code, the bit length of the entire character string is to be shortened.
For example, assume that there is a dictionary as shown in FIG. When the character string “cbababbcbc” is compressed, a code string “12233” is obtained. If the bit length per code is 12 bits, the total bit length is 60 bits.
For example, when compressing a document describing a natural language, a high compression ratio can be obtained by registering words of the language or frequently occurring phrases in a dictionary.
The dictionary may be prepared separately from the code string or may be embedded in the code string.

埋込辞書参照型の圧縮技術について説明する。この圧縮方式にはＬＺ７８方式、ＬＺＷ方式などがある。
埋込辞書参照型は、辞書参照型の一形態である。埋込辞書参照型では、辞書を別に用意するのではなく、符号列の中に辞書の情報を埋め込んである。
図４８は、ＬＺ７８方式における符号化の一例を示す図である。 An embedded dictionary reference type compression technique will be described. This compression method includes the LZ78 method and the LZW method.
The embedded dictionary reference type is a form of dictionary reference type. In the embedded dictionary reference type, a dictionary is not prepared separately, but dictionary information is embedded in a code string.
FIG. 48 is a diagram illustrating an example of encoding in the LZ78 system.

文字列に含まれる部分文字列は、次の規則にしたがって、符号列に変換される。 The partial character string included in the character string is converted into a code string according to the following rules.

規則１：部分文字列が辞書に登録されている場合、辞書の参照番号（例えば、１０ビット）を示す符号９７１に変換し、次の１文字を、その文字を示すビット列（８ビット）の符号９７２に変換する。
規則２：１文字からなる部分文字列が辞書に登録されていない場合、参照番号０を示す符号９７１及びその文字を示すビット列の符号９７２の２つに変換する。ここで、参照番号０は、辞書に登録されていないことを示す番号の一例であり、他の番号でもよい。 Rule 1: If a partial character string is registered in the dictionary, it is converted into a code 971 indicating the dictionary reference number (for example, 10 bits), and the next character is converted into a code of a bit string (8 bits) indicating the character. To 972.
Rule 2: When a partial character string consisting of 1 character is not registered in the dictionary, it is converted into two: a code 971 indicating the reference number 0 and a code 972 indicating a bit string indicating the character. Here, the reference number 0 is an example of a number indicating that it is not registered in the dictionary, and may be another number.

この規則によれば、参照番号を示す符号９７１と、文字を表現するビット列を示す符号９７２とは、必ず交互に出現することになるので、その符号がどちらの意味であるかを示すフラグは必要ない。しかし、規則１と規則２とを区別するためのフラグビットを設けて、規則２の場合は、その文字を示す符号だけに変換してもよい。 According to this rule, a code 971 indicating a reference number and a code 972 indicating a bit string expressing a character always appear alternately, so a flag indicating which meaning of the code is necessary. Absent. However, a flag bit for distinguishing between rule 1 and rule 2 may be provided, and in the case of rule 2, it may be converted only to a code indicating the character.

この規則により変換した部分文字列＋１文字は、置換辞書に登録されていない。もし登録されていれば、もう１文字長い部分文字列を同じビット長の符号に変換できるからである。そこで、この部分文字列＋１文字を新しく置換辞書６５０に登録する。 The partial character string + 1 character converted by this rule is not registered in the replacement dictionary. This is because if it is registered, a partial character string longer by one character can be converted into a code having the same bit length. Therefore, this partial character string + 1 character is newly registered in the replacement dictionary 650.

圧縮開始前において、置換辞書６５０には何も登録されていない。しかし、あらかじめ取り決めた部分文字列を登録しておくこととしてもよい。例えば、１文字からなる部分文字列をすべて置換辞書６５０に登録しておけば、規則２は必要なくなる。 Before starting compression, nothing is registered in the replacement dictionary 650. However, a partial character string decided in advance may be registered. For example, if all the partial character strings consisting of one character are registered in the replacement dictionary 650, the rule 2 becomes unnecessary.

文字列「ａｂａｂａｂｃｂｃ」についていえば、最初の「ａ」は置換辞書６５０に登録されていない。そこで、規則２を適用して、参照番号「０」及び文字「ａ」を符号６２１及び符号６２２として出力する。そして部分文字列「ｃ」を置換辞書６５０に登録する。参照番号は「１」となる。
次の「ｂ」も置換辞書６５０に登録されていないので、参照番号「０」と文字「ｂ」を出力し、「ｂ」を置換辞書６５０に登録する（参照番号２）。
次の「ａ」は置換辞書６５０に登録されている（参照番号１）が、「ａｂ」が登録されていないので、参照番号「１」と文字「ｂ」を出力し、「ａｂ」を置換辞書６５０に登録する（参照番号３）。
次の「ａｂ」は置換辞書６５０に登録されている（参照番号３）が、「ａｂｃ」は登録されていないので、参照番号「３」と文字「ｃ」を出力し、「ａｂｃ」を置換辞書６５０に登録する（参照番号４）。
次の「ｂ」は置換辞書６５０に登録されている（参照番号２）が、「ｂｃ」は登録されていないので、参照番号「２」と文字「ｃ」を出力し。「ｂｃ」を置換辞書６５０に登録する（参照番号５）。 Regarding the character string “abababcbc”, the first “a” is not registered in the replacement dictionary 650. Therefore, the rule 2 is applied, and the reference number “0” and the character “a” are output as reference numerals 621 and 622. Then, the partial character string “c” is registered in the replacement dictionary 650. The reference number is “1”.
Since the next “b” is not registered in the replacement dictionary 650, the reference number “0” and the character “b” are output, and “b” is registered in the replacement dictionary 650 (reference number 2).
The next “a” is registered in the replacement dictionary 650 (reference number 1), but “ab” is not registered, so the reference number “1” and the letter “b” are output, and “ab” is replaced. It is registered in the dictionary 650 (reference number 3).
The next “ab” is registered in the replacement dictionary 650 (reference number 3), but “abc” is not registered, so the reference number “3” and the character “c” are output, and “abc” is replaced. It is registered in the dictionary 650 (reference number 4).
The next “b” is registered in the replacement dictionary 650 (reference number 2), but “bc” is not registered, so the reference number “2” and the character “c” are output. “Bc” is registered in the replacement dictionary 650 (reference number 5).

図４９は、従来例において、符号列から元の文字列を復元する場合の制御の流れの一例を示すフローチャート図である。
まず、置換辞書６５０を初期化する（Ｓ９７１）。例えば、空にする。
次に、符号列から１０ビット（辞書参照番号を示す符号９７１）取得し（Ｓ９７２）、参照番号が０以外なら（Ｓ９７３）、置換辞書６５０を参照して、参照番号に対応する前方文字列を求め、出力する（Ｓ９７４）。
符号列から８ビット（文字を表すビット列の符号９７２）取得し（Ｓ９７５）、それが示す文字を出力する（Ｓ９７６）。
Ｓ９７４及びＳ９７６で出力した文字を結合し、置換辞書６５０に新しく登録する（Ｓ９７７）。
これを符号列が尽きるまで繰り返す（Ｓ９７８）。 FIG. 49 is a flowchart showing an example of the flow of control when restoring the original character string from the code string in the conventional example.
First, the replacement dictionary 650 is initialized (S971). For example, empty.
Next, 10 bits (symbol 971 indicating the dictionary reference number) are obtained from the code string (S972). If the reference number is other than 0 (S973), the replacement dictionary 650 is referred to, and the forward character string corresponding to the reference number is obtained. Obtain and output (S974).
8 bits (code 972 of a bit string representing a character) are acquired from the code string (S975), and the character indicated by it is output (S976).
The characters output in S974 and S976 are combined and newly registered in the replacement dictionary 650 (S977).
This is repeated until the code string is exhausted (S978).

例えば、図４８の符号列６００を復元する場合について説明する。
最初に、置換辞書６５０を空にする。
符号６２１は参照番号「０」を示し、符号６２２は文字「ａ」を示すので、この２つの符号から部分文字列「ａ」を復元し、出力する。そして「ａ」を置換辞書６５０に登録する（参照番号１）。
符号６２３は参照番号「０」を示し、符号６２４は文字「ｂ」を示すので、この２つの符号から部分文字列「ｂ」を復元し、出力する。そして「ｂ」を置換辞書６５０に登録する（参照番号２）。
符号６２５は参照番号「１」を示し、符号６２６は文字「ｂ」を示すので、この２つの符号から部分文字列「ａｂ」を復元し、出力する。そして「ａｂ」を置換辞書６５０に登録する（参照番号３）。
符号６２７は参照番号「３」を示し、符号６２８は文字「ｂ」を示すので、この２つの符号から部分文字列「ａｂｃ」を復元し、出力する。そして「ａｂｃ」を置換辞書６５０に登録する（参照番号４）。
符号６２９は参照番号「２」を示し、符号６３０は文字「ｃ」を示すので、この２つの符号から部分文字列「ｂｃ」を復元し、出力する。そして「ｂｃ」を置換辞書６５０に登録する（参照番号５）。 For example, a case where the code string 600 of FIG. 48 is restored will be described.
First, the replacement dictionary 650 is emptied.
Reference numeral 621 indicates a reference number “0”, and reference numeral 622 indicates a character “a”. Therefore, the partial character string “a” is restored from these two codes and is output. Then, “a” is registered in the replacement dictionary 650 (reference number 1).
Reference numeral 623 indicates the reference number “0”, and reference numeral 624 indicates the character “b”. Therefore, the partial character string “b” is restored from these two codes and output. Then, “b” is registered in the replacement dictionary 650 (reference number 2).
Reference numeral 625 indicates the reference number “1”, and reference numeral 626 indicates the character “b”. Therefore, the partial character string “ab” is restored from these two codes and is output. Then, “ab” is registered in the replacement dictionary 650 (reference number 3).
Reference numeral 627 indicates the reference number “3”, and reference numeral 628 indicates the character “b”. Therefore, the partial character string “abc” is restored from these two codes and is output. Then, “abc” is registered in the replacement dictionary 650 (reference number 4).
Reference numeral 629 indicates the reference number “2”, and reference numeral 630 indicates the character “c”. Therefore, the partial character string “bc” is restored from these two codes and is output. Then, “bc” is registered in the replacement dictionary 650 (reference number 5).

なお、置換辞書６５０には、参照番号に対応する部分文字列を記憶する代わりに、復元した文字列内の対応する部分へのポインタを記憶してもよい。あるいは、前方文字列に対応する参照番号と残りの１文字を記憶してもよい。 Instead of storing the partial character string corresponding to the reference number, the replacement dictionary 650 may store a pointer to the corresponding portion in the restored character string. Alternatively, a reference number corresponding to the front character string and the remaining one character may be stored.

図５０は、同じく埋込辞書参照型圧縮技術の一種であるＬＺＷ方式における符号化の一例を示す図である。 FIG. 50 is a diagram illustrating an example of encoding in the LZW method, which is also a kind of embedded dictionary reference compression technology.

規則１：部分文字列を、辞書の参照番号を示す符号９７１に変換する。 Rule 1: The partial character string is converted into a code 971 indicating the reference number of the dictionary.

辞書には、出現する可能性のある１文字からなる部分文字列をすべて最初に登録しておく。したがって、部分文字列が辞書に登録されていない場合はない。そのため、ＬＺ７８形式と異なり、辞書に登録されていない場合の変換規則（規則２）が存在しない。 In the dictionary, all partial character strings consisting of one character that may appear are registered first. Therefore, there is no case where the partial character string is not registered in the dictionary. Therefore, unlike the LZ78 format, there is no conversion rule (rule 2) when it is not registered in the dictionary.

この例では、出現する可能性がある文字が「ａ」「ｂ」「ｃ」の３種類しかないものとしているので、辞書には「ａ」「ｂ」「ｃ」の３つの部分文字列が最初に登録される。 In this example, since there are only three types of characters “a”, “b”, and “c” that may appear, the dictionary includes three partial character strings “a”, “b”, and “c”. Registered first.

変換した部分文字列＋次の１文字（未変換）を、新しく辞書に登録する。 The converted partial character string + the next one character (unconverted) is newly registered in the dictionary.

この規則によれば、規則が１つしか存在しないので、符号化の規則を区別する必要がない。また、最初から辞書に登録されている部分文字列があるので、ＬＺ７８方式に比べて圧縮率がよい。 According to this rule, since there is only one rule, it is not necessary to distinguish the encoding rule. Further, since there are partial character strings registered in the dictionary from the beginning, the compression rate is better than that of the LZ78 method.

符号化、復元の詳細についての説明は、省略する。 A detailed description of the encoding and restoration will be omitted.

実施の形態１．
実施の形態１を図１〜図８を用いて説明する。 Embodiment 1 FIG.
The first embodiment will be described with reference to FIGS.

図１は、この実施の形態における圧縮テキスト検索装置１００（文字列検索装置の一例）の外観の一例を示す図である。
図１において、圧縮テキスト検索装置１００は、システムユニット９１０、ＣＲＴ（ＣａｔｈｏｄｅＲａｙＴｕｂｅ）表示装置９０１、キーボード（Ｋ／Ｂ）９０２、マウス９０３、コンパクトディスク装置（ＣＤＤ）９０５、プリンタ装置９０６、スキャナ装置９０７を備え、これらはケーブルで接続されている。
さらに、圧縮テキスト検索装置１００は、ＦＡＸ機９３２、電話器９３１とケーブルで接続され、また、ローカルエリアネットワーク（ＬＡＮ）９４２、ゲートウェイ９４１を介してインターネット９４０に接続されている。 FIG. 1 is a diagram showing an example of the appearance of a compressed text search device 100 (an example of a character string search device) in this embodiment.
In FIG. 1, a compressed text search device 100 includes a system unit 910, a CRT (Cathode Ray Tube) display device 901, a keyboard (K / B) 902, a mouse 903, a compact disc device (CDD) 905, a printer device 906, a scanner device. 907, which are connected by a cable.
Further, the compressed text search apparatus 100 is connected to a FAX machine 932 and a telephone 931 via a cable, and is connected to the Internet 940 via a local area network (LAN) 942 and a gateway 941.

図２は、この実施の形態における圧縮テキスト検索装置のハードウェア構成の一例を示す図である。
図２において、圧縮テキスト検索装置１００は、プログラムを実行するＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）９１１を備えている。ＣＰＵ９１１は、バス９１２を介してＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）９１３、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）９１４、通信ボード９１５、ＣＲＴ表示装置９０１、Ｋ／Ｂ９０２、マウス９０３、ＦＤＤ（ＦｌｅｘｉｂｌｅＤｉｓｋＤｒｉｖｅ）９０４、磁気ディスク装置９２０、ＣＤＤ９０５、プリンタ装置９０６、スキャナ装置９０７と接続されている。
ＲＡＭ９１４は、揮発性メモリの一例である。ＲＯＭ９１３、ＦＤＤ９０４、ＣＤＤ９０５、磁気ディスク装置９２０は、不揮発性メモリの一例である。これらは、記憶装置あるいは記憶部の一例である。
通信ボード９１５は、ＦＡＸ機９３２、電話器９３１、ＬＡＮ９４２等に接続されている。
例えば、通信ボード９１５、Ｋ／Ｂ９０２、スキャナ装置９０７、ＦＤＤ９０４などは、入力部の一例である。
また、例えば、通信ボード９１５、ＣＲＴ表示装置９０１などは、出力部の一例である。 FIG. 2 is a diagram showing an example of a hardware configuration of the compressed text search device in this embodiment.
In FIG. 2, the compressed text search apparatus 100 includes a CPU (Central Processing Unit) 911 that executes a program. The CPU 911 includes a ROM (Read Only Memory) 913, a RAM (Random Access Memory) 914, a communication board 915, a CRT display device 901, a K / B 902, a mouse 903, a FDD (Flexible Disk Drive) 904, and a bus 912. A device 920, a CDD 905, a printer device 906, and a scanner device 907 are connected.
The RAM 914 is an example of a volatile memory. The ROM 913, the FDD 904, the CDD 905, and the magnetic disk device 920 are examples of nonvolatile memories. These are examples of a storage device or a storage unit.
The communication board 915 is connected to a FAX machine 932, a telephone 931, a LAN 942, and the like.
For example, the communication board 915, the K / B 902, the scanner device 907, the FDD 904, and the like are examples of the input unit.
Further, for example, the communication board 915, the CRT display device 901, and the like are examples of the output unit.

ここで、通信ボード９１５は、ＬＡＮ９４２に限らず、直接、インターネット９４０、或いはＩＳＤＮ等のＷＡＮ（ワイドエリアネットワーク）に接続されていても構わない。直接、インターネット９４０、或いはＩＳＤＮ等のＷＡＮに接続されている場合、圧縮テキスト検索装置１００は、インターネット９４０、或いはＩＳＤＮ等のＷＡＮに接続され、ゲートウェイ９４１は不用となる。
磁気ディスク装置９２０には、オペレーティングシステム（ＯＳ）９２１、ウィンドウシステム９２２、プログラム群９２３、ファイル群９２４が記憶されている。プログラム群９２３は、ＣＰＵ９１１、ＯＳ９２１、ウィンドウシステム９２２により実行される。 Here, the communication board 915 is not limited to the LAN 942 and may be directly connected to the Internet 940 or a WAN (Wide Area Network) such as ISDN. When directly connected to a WAN such as the Internet 940 or ISDN, the compressed text search apparatus 100 is connected to a WAN such as the Internet 940 or ISDN, and the gateway 941 is unnecessary.
The magnetic disk device 920 stores an operating system (OS) 921, a window system 922, a program group 923, and a file group 924. The program group 923 is executed by the CPU 911, the OS 921, and the window system 922.

上記プログラム群９２３には、以下に述べる実施の形態の説明において「〜部」として説明する機能を実行するプログラムが記憶されている。プログラムは、ＣＰＵ９１１により読み出され実行される。
ファイル群９２４には、以下に述べる実施の形態の説明において、「〜の判定結果」、「〜の計算結果」、「〜の処理結果」として説明するものが、「〜ファイル」として記憶されている。
また、以下に述べる実施の形態の説明において説明するフローチャートの矢印の部分は主としてデータの入出力を示し、そのデータの入出力のためにデータは、ＲＡＭ９１４もしくは磁気ディスク装置９２０、ＦＤ（ＦｌｅｘｉｂｌｅＤｉｓｋ）、光ディスク、ＣＤ（コンパクトディスク）、ＭＤ（ミニディスク）、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）等のその他の記録媒体に記録される。あるいは、信号線やその他の伝送媒体により伝送される。 The program group 923 stores programs that execute functions described as “˜units” in the description of the embodiments described below. The program is read and executed by the CPU 911.
In the file group 924, what is described as “determination result of”, “calculation result of”, and “processing result of” in the description of the embodiment described below is stored as “˜file”. Yes.
In addition, the arrow portion of the flowchart described in the description of the embodiment described below mainly indicates input / output of data, and for the input / output of the data, the data is the RAM 914 or the magnetic disk device 920, FD (Flexible Disk). , Optical disc, CD (compact disc), MD (mini disc), DVD (Digital Versatile Disk) and other recording media. Alternatively, it is transmitted through a signal line or other transmission medium.

また、以下に述べる実施の形態の説明において「〜部」として説明するものは、ＲＯＭ９１３に記憶されたファームウェアで実現されていても構わない。或いは、ソフトウェアのみ、或いは、ハードウェアのみ、或いは、ソフトウェアとハードウェアとの組み合わせ、さらには、ファームウェアとの組み合わせで実施されても構わない。 In addition, what is described as “unit” in the description of the embodiment described below may be realized by firmware stored in the ROM 913. Alternatively, it may be implemented by software alone, hardware alone, a combination of software and hardware, or a combination of firmware.

また、以下に述べる実施の形態を実施するプログラムは、また、ＲＡＭ９１４もしくは磁気ディスク装置９２０、ＦＤ（ＦｌｅｘｉｂｌｅＤｉｓｋ）、光ディスク、ＣＤ（コンパクトディスク）、ＭＤ（ミニディスク）、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）等のその他の記録媒体による記録装置を用いて記憶されても構わない。 A program for implementing the embodiment described below also includes a RAM 914 or a magnetic disk device 920, an FD (Flexible Disk), an optical disk, a CD (Compact Disk), an MD (Mini Disk), a DVD (Digital Versatile Disk), and the like. You may memorize | store using the recording apparatus by other recording media.

図３は、この実施の形態における圧縮テキスト検索装置１００のブロック構成の一例を示すブロック図である。
この圧縮テキスト検索装置は、入力された圧縮テキスト中に検索条件に適合する文字列が存在するか否かを判定し、存在する場合はその文字列の末尾の位置をヒット位置として出力する検索装置である。また、存在しない場合は何も出力しない。
図３において、圧縮テキスト検索装置１００は、検索条件入力部１０２、圧縮テキスト記憶部１０３、照合結果出力部１０４、状態遷移表生成部１０５、状態遷移表記憶部１０６、照合部１０７から構成される。照合部１０７は、圧縮ブロック取得部１０８、文字取得部１０９、状態遷移機械１１０、状態記憶部１１１、状態遷移記憶部１１２、圧縮辞書記憶部１１３、条件判断部１１４、現在位置カウンタ１１５、遷移先算出部１１６、検索成功判別部１１７を有する。 FIG. 3 is a block diagram showing an example of a block configuration of the compressed text search apparatus 100 according to this embodiment.
This compressed text search device determines whether or not there is a character string that meets the search condition in the input compressed text, and if it exists, the search device outputs the end position of the character string as a hit position It is. If it does not exist, nothing is output.
In FIG. 3, the compressed text search apparatus 100 includes a search condition input unit 102, a compressed text storage unit 103, a collation result output unit 104, a state transition table generation unit 105, a state transition table storage unit 106, and a collation unit 107. . The collation unit 107 includes a compressed block acquisition unit 108, a character acquisition unit 109, a state transition machine 110, a state storage unit 111, a state transition storage unit 112, a compression dictionary storage unit 113, a condition determination unit 114, a current position counter 115, a transition destination A calculation unit 116 and a search success determination unit 117 are included.

検索条件入力部１０２は、検索条件（検索パターンの一例）を入力する。検索条件は、正規表現を用いて表現されている。しかし、固定の検索文字列でもよい。 The search condition input unit 102 inputs a search condition (an example of a search pattern). Search conditions are expressed using regular expressions. However, a fixed search character string may be used.

状態遷移表生成部１０５は、検索条件の入力を受け付けると、検索条件と適合する文字列を受理するＤＦＡ（オートマトンの一例）に対応する状態遷移表を生成する機能を備えている。
すなわち、検索条件入力部１０２に入力した検索条件に基づいて、それに対応するＮＦＡを求め、更に、ＮＦＡをＤＦＡに変換し、それに対応する状態遷移表（遷移先一覧表の一例）を生成する。
状態遷移表生成部１０５が生成した状態遷移表は、状態遷移表記憶部１０６が記憶する。 The state transition table generation unit 105 has a function of generating a state transition table corresponding to a DFA (an example of an automaton) that accepts a character string that matches a search condition when an input of the search condition is received.
That is, based on the search condition input to the search condition input unit 102, an NFA corresponding to the search condition is obtained. Further, the NFA is converted into DFA, and a corresponding state transition table (an example of a transition destination list) is generated.
The state transition table storage unit 106 stores the state transition table generated by the state transition table generation unit 105.

圧縮テキスト記憶部１０３は、圧縮テキスト（符号列の一例）を記憶する。圧縮テキストとは、電子化された文書（文字列の一例）を圧縮技術によって符号化し、全体のビット長を短くしたものである。なお、圧縮テキストは、必ずしも文書を符号化したものである必要はなく、コンピュータが記憶するデータを符号化したものであってもよい。 The compressed text storage unit 103 stores compressed text (an example of a code string). The compressed text is obtained by encoding an electronic document (an example of a character string) using a compression technique and shortening the entire bit length. Note that the compressed text does not necessarily need to be an encoded document, and may be encoded data stored in a computer.

照合部１０７は、圧縮テキストの入力を受け付けると、状態遷移表を参照しながら、圧縮テキスト中に検索条件に適合する文字列が存在するか否かを判定し、存在する場合はそのヒット位置を出力する機能を備える。
すなわち、圧縮テキスト記憶部１０３が記憶した圧縮テキストを入力すると、その圧縮テキストに対応する圧縮前の文書の中に、検索条件に合致する部分文字列が含まれるか否かを判別し、含まれる場合にはその出現位置（ヒット位置）を出力する。 When the collation unit 107 receives input of the compressed text, the collation unit 107 refers to the state transition table to determine whether or not there is a character string that meets the search condition in the compressed text. It has a function to output.
That is, when the compressed text stored in the compressed text storage unit 103 is input, it is determined whether or not a partial character string that matches the search condition is included in the uncompressed document corresponding to the compressed text. In this case, the appearance position (hit position) is output.

照合部１０７が出力したヒット位置は、照合結果出力部１０４が、例えばＣＲＴ表示装置９０１に表示する。 The hit result output by the collation unit 107 is displayed on the CRT display device 901 by the collation result output unit 104, for example.

圧縮ブロック取得部１０８（符号取得部の一例）は、入力された圧縮テキストから、圧縮ブロック（符号の一例）を１つずつ取得する機能を備える。
すなわち、圧縮テキスト記憶部１０３が記憶した圧縮テキストから、それを構成する圧縮ブロックを先頭から順に取得する。 The compressed block acquisition unit 108 (an example of a code acquisition unit) has a function of acquiring one compressed block (an example of a code) one by one from the input compressed text.
That is, the compressed blocks constituting the compressed text stored in the compressed text storage unit 103 are acquired in order from the top.

圧縮辞書記憶部１１３（辞書記憶部の一例）は、圧縮テキストから圧縮辞書（置換辞書の一例）を取得し記憶する機能を備える。
すなわち、圧縮テキストの中に圧縮辞書の情報が埋め込まれている埋込辞書参照型圧縮方式において、圧縮テキストに埋め込まれた圧縮辞書の情報を抽出し、記憶する。あるいは、圧縮テキストとは別に圧縮辞書を用意する辞書参照型圧縮方式の場合には、別に用意した圧縮辞書を取得して記憶しておく。 The compression dictionary storage unit 113 (an example of a dictionary storage unit) has a function of acquiring and storing a compression dictionary (an example of a replacement dictionary) from compressed text.
That is, in the embedded dictionary reference type compression method in which the compressed dictionary information is embedded in the compressed text, the compressed dictionary information embedded in the compressed text is extracted and stored. Alternatively, in the case of a dictionary reference compression method in which a compression dictionary is prepared separately from the compressed text, a separately prepared compression dictionary is acquired and stored.

文字取得部１０９（文字列復元部の一例）は、圧縮辞書記憶部１１３に記憶されている圧縮辞書を参照しながら、圧縮ブロック取得部１０８によって取得された圧縮ブロックから文字を１文字ずつ取得する機能を有する。
すなわち、圧縮辞書記憶部１１３が記憶した圧縮辞書に基づいて、圧縮ブロックに対応する部分文字列を求める。更に、その部分文字列を構成する文字を先頭から順に取得し、状態遷移機械１１０に入力する。 The character acquisition unit 109 (an example of a character string restoration unit) acquires characters one by one from the compressed block acquired by the compression block acquisition unit 108 while referring to the compression dictionary stored in the compression dictionary storage unit 113. It has a function.
That is, the partial character string corresponding to the compressed block is obtained based on the compression dictionary stored in the compression dictionary storage unit 113. Further, the characters constituting the partial character string are acquired in order from the top and input to the state transition machine 110.

状態記憶部１１１は、状態遷移機械１１０の現在の状態を記憶する機能を備える。状態遷移機械１１０は、文字取得部１０９と状態記憶部１１１の状態を元に、状態遷移表を参照することで、次の状態を取得し、状態記憶部１１１の状態を更新する機能を備える。状態遷移記憶部１１２は、圧縮辞書記憶部１１３の文字列に対応した状態遷移の履歴を記憶する機能を備える。
すなわち、状態記憶部１１１及び状態遷移機械１１０はオートマトン実行部の一例であり、状態遷移表記憶部１０６が記憶した状態遷移表に対応するＤＦＡを実行する。ＤＦＡの保持する状態は、状態記憶部１１１が記憶する。状態遷移機械１１０は、状態記憶部１１１が記憶したＤＦＡの状態と、文字取得部１０９が入力した文字とに基づいて、状態遷移表記憶部１０６が記憶した状態遷移表を参照し、遷移先状態を取得する。状態記憶部１１１は、状態遷移機械１１０が取得した遷移先状態を、ＤＦＡの状態として、古いＤＦＡの状態に上書きして記憶する（更新する）。 The state storage unit 111 has a function of storing the current state of the state transition machine 110. The state transition machine 110 has a function of acquiring the next state by referring to the state transition table based on the states of the character acquisition unit 109 and the state storage unit 111 and updating the state of the state storage unit 111. The state transition storage unit 112 has a function of storing a history of state transitions corresponding to the character strings in the compression dictionary storage unit 113.
That is, the state storage unit 111 and the state transition machine 110 are examples of an automaton execution unit, and execute a DFA corresponding to the state transition table stored in the state transition table storage unit 106. The state stored in the DFA is stored in the state storage unit 111. The state transition machine 110 refers to the state transition table stored in the state transition table storage unit 106 based on the state of the DFA stored in the state storage unit 111 and the character input in the character acquisition unit 109, and changes to the transition destination state. To get. The state storage unit 111 stores (updates) the transition destination state acquired by the state transition machine 110 by overwriting the old DFA state as the DFA state.

状態遷移記憶部１１２（履歴記憶部の一例）は、状態記憶部１１１が記憶したＤＦＡの状態の履歴（状態履歴）を記憶する。 The state transition storage unit 112 (an example of a history storage unit) stores a DFA state history (state history) stored in the state storage unit 111.

条件判断部１１４は、圧縮ブロックを元の部分文字列に復元するか否か等の条件を判断する。 The condition determining unit 114 determines conditions such as whether to restore the compressed block to the original partial character string.

現在位置カウンタ１１５は、元の文字列が検索条件に合致する検索文字列を含む場合に、そのヒット位置を求めるため、検索の現在位置を示すカウンタである。 The current position counter 115 is a counter indicating the current position of the search in order to obtain the hit position when the original character string includes a search character string that matches the search condition.

遷移先算出部１１６は、状態遷移記憶部１１２が記憶した状態履歴に基づいて、ＤＦＡの遷移先状態を算出する。 The transition destination calculation unit 116 calculates the DFA transition destination state based on the state history stored in the state transition storage unit 112.

検索成功判別部１１７は、元の文字列の中に検索パターンに合致する検索文字列が含まれるか否かを判別し、含まれる場合にはその出現位置を出力する。 The search success determining unit 117 determines whether or not a search character string that matches the search pattern is included in the original character string, and if it is included, the appearance position is output.

図３の圧縮テキスト検索装置１００の各機能部は、複数のＣＰＵなどの演算装置と、メモリなどの記憶装置によって構成しても良いし、単一の演算装置と１以上の記憶装置上で動作するソフトウェアとして実現しても良い。 Each functional unit of the compressed text search apparatus 100 of FIG. 3 may be configured by a plurality of arithmetic devices such as CPUs and a storage device such as a memory, or operate on a single arithmetic device and one or more storage devices. It may be realized as software.

図４は、この実施の形態において、圧縮テキスト記憶部１０３が記憶した圧縮テキストの一例を示す図である。
辞書式圧縮（辞書参照型圧縮方式）によって圧縮された圧縮テキストは、圧縮ブロック列３００（符号列の一例）と、圧縮辞書３０３（置換辞書の一例）とから構成される。
１つの圧縮ブロック３０２（符号の一例）は、圧縮辞書の１つのエントリの参照番号を含んでいる。
圧縮辞書記憶部１１３は、圧縮テキストから圧縮辞書の情報を抽出し、記憶している。圧縮辞書は、文字列３０５（部分文字列の一例）と、文字列の参照番号３０４（符号の一例）との対応を示す表である。
図４に示す圧縮ブロック列３００「１２１３４１４」を伸張（復元）する場合、圧縮ブロック列３００から１つずつ圧縮ブロック３０２を取得し、その参照情報を元に圧縮ブロック３０２を圧縮辞書の文字列３０５（以後、圧縮ブロックの参照文字列と呼ぶ）と置き換える。
例えば、最初の圧縮ブロック３０２は、圧縮辞書の１番目のエントリを参照しているため、最初の圧縮ブロックは、１番目のエントリの文字列「ａｂｃｄｅ」と置き換えることができる。同様に全ての圧縮ブロックについて繰り返すことで、圧縮ブロック列３００から伸張されたテキスト「ａｂｃｄｅｃｂａｂｃｄｅｂｅｃｄａｂｃｄｅｄ」を得ることができる。辞書式圧縮では、このようにテキスト中に出現する文字列を、その文字列よりもビット長が短い圧縮ブロックに置き換えることで、同じ文字列が繰り返し出現するほど高い圧縮率を得ることができる。 FIG. 4 is a diagram showing an example of the compressed text stored in the compressed text storage unit 103 in this embodiment.
The compressed text compressed by lexicographic compression (dictionary reference compression method) is composed of a compressed block string 300 (an example of a code string) and a compression dictionary 303 (an example of a replacement dictionary).
One compression block 302 (an example of a code) includes the reference number of one entry in the compression dictionary.
The compression dictionary storage unit 113 extracts compression dictionary information from the compressed text and stores it. The compression dictionary is a table showing correspondence between a character string 305 (an example of a partial character string) and a reference number 304 (an example of a code) of the character string.
When decompressing (restoring) the compressed block sequence 300 “121414” illustrated in FIG. 4, the compressed block 302 is acquired one by one from the compressed block sequence 300, and the compressed block 302 is converted into the character string 305 of the compression dictionary based on the reference information. (Hereinafter referred to as a compressed block reference character string).
For example, since the first compressed block 302 refers to the first entry of the compression dictionary, the first compressed block can be replaced with the character string “abcde” of the first entry. Similarly, it is possible to obtain the decompressed text “abcdecbabcdebecdabcded” from the compressed block sequence 300 by repeating for all the compressed blocks. In lexicographic compression, by replacing a character string appearing in text in this way with a compressed block having a bit length shorter than that character string, a higher compression rate can be obtained as the same character string repeatedly appears.

ここでは、圧縮辞書を表の形式で記載したが、圧縮ブロックと、圧縮辞書のエントリを１対１に対応付けることができれば、圧縮辞書の実現方式は問わない。すなわち木構造やハッシュを使っても良い。
以後、圧縮ブロックの参照番号を＜＞で囲んだ数値として表記する。 Here, the compression dictionary is described in the form of a table. However, the compression dictionary implementation method is not limited as long as the compression block and the compression dictionary entry can be associated one-to-one. That is, a tree structure or a hash may be used.
Hereinafter, the reference number of the compressed block is expressed as a numerical value surrounded by <>.

図５は、この実施の形態において、状態遷移記憶部１１２が記憶する状態履歴の構造を示す図である。
状態遷移記憶部１１２は、参照番号４０１（符号の一例）、状態遷移履歴４０２、受理位置４０３の情報を持つ。
参照番号４０１は、圧縮辞書の参照番号３０４と１対１に対応付けられている。
状態遷移履歴４０２は、対応する圧縮辞書の文字列による状態遷移機械の状態遷移の履歴を記憶したものであり、先頭が圧縮辞書の文字列を読む直前の状態、末尾が圧縮辞書の文字列を全て読んだ直後の状態をさす。例えば、１番目のエントリの場合、状態［１］から開始して、文字列「ａｂｃｄｅ」を１文字読む毎に状態が［２］−［３］−［４］−［５］と遷移し、最後の「ｅ」を読んだ直後に状態［６］になったことを意味する。
受理位置４０３は、圧縮辞書の文字列の何文字目で、状態遷移機械が受理状態に到達したかを表わしている。例えば、図４で状態［４］が受理状態であったとする。このとき、エントリ１の受理位置４０３は、圧縮辞書の文字列の３文字目の「ｃ」を読んだ直後に受理状態［４］に到達したことを意味している。
ここでは、受理位置は１番目と３番目のエントリに各１つずつしかないが、１つの状態遷移履歴に２つ以上の受理位置があっても良い。 FIG. 5 is a diagram showing the structure of the state history stored in the state transition storage unit 112 in this embodiment.
The state transition storage unit 112 has information of a reference number 401 (an example of a code), a state transition history 402, and an acceptance position 403.
The reference number 401 is associated with the compression dictionary reference number 304 on a one-to-one basis.
The state transition history 402 stores the state transition history of the state transition machine by the character string of the corresponding compression dictionary, the state immediately before reading the character string of the compression dictionary at the beginning, The state immediately after reading everything. For example, in the case of the first entry, starting from the state [1], the state transitions to [2]-[3]-[4]-[5] each time a character string “abcde” is read, Immediately after reading the last “e”, it means that state [6] has been reached.
The receiving position 403 represents the character number of the character string in the compression dictionary and the state transition machine has reached the receiving state. For example, assume that state [4] in FIG. 4 is an accepted state. At this time, the acceptance position 403 of entry 1 means that the acceptance state [4] has been reached immediately after reading the third character “c” in the character string of the compression dictionary.
Here, there is only one receiving position for each of the first and third entries, but there may be two or more receiving positions in one state transition history.

この実施の形態では、正規表現とテキストの照合には、状態遷移が一意に決定できる状態遷移機械を使用する。このような状態遷移機械の代表的なものにＤＦＡがある。 In this embodiment, a state transition machine capable of uniquely determining a state transition is used for matching between a regular expression and a text. A representative example of such a state transition machine is DFA.

図６は、この実施の形態において、状態遷移表記憶部１０６が記憶する状態遷移表２００の一例を示す図である。状態遷移表２００の左端の列は、現在の状態を表わしている。また、１行目は次に入力された文字を表わしている。それ以外の要素は次の状態（遷移先状態）を表わしている。例えば、現在の状態が［１］で、次の入力文字が「ａ」であった場合、状態［１］の行、文字「ａ」の列（２行２列目）の状態［２］が次の状態である。 FIG. 6 is a diagram showing an example of the state transition table 200 stored in the state transition table storage unit 106 in this embodiment. The leftmost column of the state transition table 200 represents the current state. The first line represents the next input character. The other elements represent the next state (transition destination state). For example, when the current state is [1] and the next input character is “a”, the state [2] of the row of the state [1] and the column of the character “a” (second row and second column) is The next state.

検索条件に含まれる正規表現を受理する状態遷移機械の状態遷移表２００は、圧縮テキストが入力され照合が開始されるまでに、状態遷移表生成部１０５によって生成される。
例えば、検索条件入力部１０２が、正規表現「（ａｂ｜ｄｅｃ）［ｃｅ］ｅ＊」を検索条件として入力する（ここで「［ｃｅ］」は「（ｃ｜ｅ）」の簡略表記である）。この正規表現は、「ａｂｃ」「ａｂｅ」「ａｂｃｅ」「ａｂｅｅ」「ｄｅｃｃ」「ｄｅｃｅ」「ｄｅｃｃｅ」「ｄｅｃｅｅ」・・・などを意味する。図６の状態遷移表２００は、この正規表現に基づいて、状態遷移表生成部１０５が生成するものである。 The state transition table 200 of the state transition machine that accepts the regular expression included in the search condition is generated by the state transition table generation unit 105 until the compressed text is input and collation is started.
For example, the search condition input unit 102 inputs a regular expression “(ab | dec) [ce] e *” as a search condition (where “[ce]” is a simplified notation of “(c | e)”. ). This regular expression means “abc”, “bee”, “abce”, “abee”, “decc”, “dece”, “decce”, “decee”, and so on. The state transition table 200 in FIG. 6 is generated by the state transition table generation unit 105 based on this regular expression.

図７は、この実施の形態における圧縮テキスト検索装置１００の検索処理の制御の流れの一例を示すフローチャート図である。 FIG. 7 is a flowchart showing an example of the control flow of search processing of the compressed text search apparatus 100 in this embodiment.

初期化処理（Ｓ１０）において、検索条件入力部１０２が入力した検索条件に基づいて、状態遷移表生成部１０５が検索条件を受理するＤＦＡに対応する状態遷移表を生成し、状態遷移表記憶部１０６が記憶する。
圧縮テキスト記憶部１０３が記憶した圧縮テキストから圧縮辞書の情報を抽出し、圧縮辞書記憶部１１３が記憶する。
状態記憶部１１１は初期状態（状態番号＝１）を記憶する。
状態遷移記憶部１１２は、記憶する状態遷移履歴を空にする。
現在位置カウンタ１１５は、圧縮前のテキスト（または元テキストという）長をカウントするため、記憶する現在位置を０に初期化する。 In the initialization process (S10), based on the search condition input by the search condition input unit 102, the state transition table generation unit 105 generates a state transition table corresponding to the DFA that accepts the search condition, and the state transition table storage unit 106 stores.
Information of the compression dictionary is extracted from the compressed text stored in the compressed text storage unit 103 and stored in the compression dictionary storage unit 113.
The state storage unit 111 stores an initial state (state number = 1).
The state transition storage unit 112 empties the stored state transition history.
The current position counter 115 initializes the current position to be stored to 0 in order to count the length of text (or original text) before compression.

圧縮ブロック取得部１０８が、圧縮ブロック列の先頭から順に圧縮ブロックを１個取得する（Ｓ１１）。 The compressed block acquisition unit 108 acquires one compressed block in order from the beginning of the compressed block sequence (S11).

条件判断部１１４は、状態記憶部１１１が記憶した現在のＤＦＡの状態と、状態遷移記憶部１１２が記憶した圧縮ブロックに対応する状態遷移履歴（以後、圧縮ブロックの状態遷移履歴という）の先頭の状態とが一致するか判定する（Ｓ１２）。
すなわち、状態遷移記憶部１１２が記憶する状態履歴のうち、圧縮ブロック取得部１０８が取得した圧縮ブロックに対応する状態遷移履歴４０２を参照し、最初の状態（その圧縮ブロックに対応する部分文字列をＤＦＡに入力する前のＤＦＡの状態）を取得する。取得した最初の状態と、現在のＤＦＡの状態とを比較して、一致するか否かを判断する。 The condition determining unit 114 is the first DFA state stored in the state storage unit 111 and the state transition history corresponding to the compressed block stored in the state transition storage unit 112 (hereinafter referred to as the state transition history of the compressed block). It is determined whether the state matches (S12).
That is, among the state histories stored in the state transition storage unit 112, the state transition history 402 corresponding to the compressed block acquired by the compressed block acquiring unit 108 is referred to, and the initial state (the partial character string corresponding to the compressed block is selected). The state of the DFA before being input to the DFA) is acquired. The acquired initial state is compared with the current DFA state to determine whether or not they match.

条件判断部１１４が一致する（第一の条件を満たす）と判断した（Ｓ１２）場合には、遷移先算出部１１６が、上記一致した状態遷移履歴を参照し、最後の状態（その圧縮ブロックに対応する部分文字列をＤＦＡに入力した後のＤＦＡの状態）を取得する（遷移先状態）。遷移先算出部１１６は、取得した遷移先状態に、状態記憶部１１１が記憶したＤＦＡの状態を更新する（Ｓ１３）。 When the condition determination unit 114 determines that they match (the first condition is satisfied) (S12), the transition destination calculation unit 116 refers to the matching state transition history and determines the last state (in the compressed block). (DFA state after the corresponding partial character string is input to the DFA) is acquired (transition destination state). The transition destination calculation unit 116 updates the DFA state stored in the state storage unit 111 to the acquired transition destination state (S13).

遷移先算出部１１６がＤＦＡの状態を更新した場合、途中で通るはずの状態を飛ばすことになる。その中に受理状態がある場合には、ＤＦＡが受理状態になることなく、次へ進んでしまう。
そこで、状態遷移記憶部１１２は、途中通るはずの状態の中に受理状態があるかどうか、また、ある場合にはその位置がどこか（受理位置４０３）を記憶している。
そこで、検索成功判別部１１７は、状態遷移記憶部１１２が記憶した状態履歴のうち、一致した状態遷移履歴を参照し、受理状態があるかどうかを判別する（Ｓ１４）。
受理状態がある場合には、検索成功判別部１１７は、現在位置カウンタ１１５が記憶した現在の元テキスト長（現在位置）に、状態遷移記憶部１１２が記憶した状態履歴のうち、一致した状態遷移履歴４０２に対応する受理位置４０３を加えて、ヒット位置を求めて出力する（Ｓ１５）。その場合、照合結果出力部１０４がＣＲＴ表示装置９０１に表示する。
場合によっては、受理位置が複数ある場合もある。その場合には、その全ての受理位置に対して、出現位置（ヒット位置）を出力する。
逆に、受理状態がない場合には、何も出力しない。 When the transition destination calculation unit 116 updates the state of the DFA, the state that should have passed in the middle is skipped. If there is an acceptance state, the DFA does not enter the acceptance state and proceeds to the next.
Therefore, the state transition storage unit 112 stores whether or not there is an accepting state in a state that should be passed, and if there is, the position (accepting position 403).
Therefore, the search success determination unit 117 refers to the matched state transition history among the state histories stored in the state transition storage unit 112, and determines whether there is an accepted state (S14).
When there is an acceptance state, the search success determination unit 117 matches the state transition stored in the state history stored in the state transition storage unit 112 to the current original text length (current position) stored in the current position counter 115. A receiving position 403 corresponding to the history 402 is added to obtain and output a hit position (S15). In that case, the collation result output unit 104 displays on the CRT display device 901.
In some cases, there may be a plurality of receiving positions. In that case, the appearance position (hit position) is output for all the reception positions.
Conversely, if there is no acceptance state, nothing is output.

条件判断部１１４が一致しない（第二の条件を満たす）と判断した（Ｓ１２）場合には、圧縮ブロックから部分文字列を復元して、ＤＦＡに入力し、状態遷移履歴を求める処理を行う（Ｓ１６）。 When the condition determining unit 114 determines that they do not match (the second condition is satisfied) (S12), the partial character string is restored from the compressed block, input to the DFA, and processing for obtaining the state transition history is performed ( S16).

最後に、現在位置カウンタ１１５は、記憶した現在位置に、圧縮ブロックに対応する部分文字列の長さ（文字数）を加え、更新して記憶する（Ｓ１７）。 Finally, the current position counter 115 adds the length (number of characters) of the partial character string corresponding to the compressed block to the stored current position, and updates and stores it (S17).

以上の処理を圧縮ブロックが尽きるまで繰り返す（Ｓ１８）。 The above processing is repeated until the compressed block is exhausted (S18).

図８は、図７のＳ１６における処理の詳細の一例を示すフローチャート図である。 FIG. 8 is a flowchart showing an example of details of the process in S16 of FIG.

状態遷移記憶部１１２は、状態遷移履歴の一時的な記憶領域Ｈを用意し、初期状態として、状態記憶部１１１が記憶したＤＦＡの状態（最初の状態）を記憶する（Ｓ１６１）。あるいは、記憶領域Ｈを空にし、ＤＦＡの状態を記憶領域Ｈの先頭に追加してもよい。 The state transition storage unit 112 prepares a temporary storage area H for the state transition history, and stores the DFA state (initial state) stored in the state storage unit 111 as an initial state (S161). Alternatively, the storage area H may be emptied and the DFA state may be added to the top of the storage area H.

次に、文字取得部１０９が、圧縮ブロックの参照文字列（圧縮辞書記憶部１１３が記憶した圧縮辞書から取得した、参照番号に対応する部分文字列）を参照し、先頭から順に１文字ずつ取得し、ＤＦＡに入力する（Ｓ１６２）。 Next, the character acquisition unit 109 refers to the reference character string of the compressed block (the partial character string corresponding to the reference number acquired from the compression dictionary stored in the compression dictionary storage unit 113), and acquires one character at a time from the beginning. Then, input to the DFA (S162).

状態遷移機械１１０は、入力の文字と状態を元に、状態遷移表記憶部１０６が記憶した状態遷移表を参照して、次の状態を取得する。すなわち、状態記憶部１１１が記憶したＤＦＡの現在の状態と、文字取得部１０９が入力した文字とに基づいて、状態遷移表記憶部１０６が記憶した状態遷移表を参照し、遷移先状態を取得し、状態記憶部１１１にセットする（更新する）。状態記憶部１１１は、状態遷移機械１１０が取得した遷移先状態を、新たに、ＤＦＡの状態として記憶する（Ｓ１６３）。 The state transition machine 110 acquires the next state by referring to the state transition table stored in the state transition table storage unit 106 based on the input character and state. That is, referring to the state transition table stored in the state transition table storage unit 106 based on the current state of the DFA stored in the state storage unit 111 and the characters input in the character acquisition unit 109, the transition destination state is acquired. Then, it is set (updated) in the state storage unit 111. The state storage unit 111 newly stores the transition destination state acquired by the state transition machine 110 as a DFA state (S163).

状態遷移記憶部１１２は、状態記憶部１１１が記憶したＤＦＡの新たな状態を記憶領域Ｈの最後に追加する（Ｓ１６４）。 The state transition storage unit 112 adds the new state of the DFA stored in the state storage unit 111 to the end of the storage area H (S164).

検索成功判別部１１７は、状態記憶部１１１が記憶したＤＦＡの新たな状態が受理状態であるかを判別する（Ｓ１６５）。 The search success determination unit 117 determines whether the new state of the DFA stored in the state storage unit 111 is an acceptance state (S165).

検索成功判別部１１７が受理状態であると判別した場合、状態遷移記憶部１１２は、文字取得部１０９がＤＦＡに入力した文字数を受理位置として、記憶領域Ｈに記憶させる（Ｓ１６６）。
更に、現在位置カウンタ１１５が記憶した現在位置に受理位置を加え、ヒット位置として出力する（Ｓ１６７）。出力されたヒット位置は、照合結果出力部１０４がＣＲＴ表示装置９０１に表示する。 When the search success determination unit 117 determines that the search is in the accepting state, the state transition storage unit 112 stores the number of characters input to the DFA by the character acquisition unit 109 as a reception position in the storage area H (S166).
Further, the reception position is added to the current position stored by the current position counter 115 and output as a hit position (S167). The collation result output unit 104 displays the output hit position on the CRT display device 901.

これを参照文字列を構成する文字が尽きるまで繰り返す（Ｓ１６８）。 This is repeated until the characters constituting the reference character string are exhausted (S168).

最後に、状態遷移記憶部１１２は、記憶領域Ｈに記憶した状態遷移履歴及び受理位置を、圧縮ブロックの状態遷移履歴に反映する。すなわち、圧縮ブロックの参照番号に対応する位置に複写して記憶する（Ｓ１６９）。 Finally, the state transition storage unit 112 reflects the state transition history and the reception position stored in the storage area H in the state transition history of the compressed block. That is, it is copied and stored in the position corresponding to the reference number of the compressed block (S169).

なお、一時的な記憶領域Ｈを用意せず、状態遷移記憶部１１２が、圧縮ブロックの参照番号に対応する位置に、直接、状態遷移履歴及び受理位置を記憶してもよい。そうすれば、記憶領域Ｈの記憶内容を複写するステップが省けるので好ましい。 Instead of preparing the temporary storage area H, the state transition storage unit 112 may store the state transition history and the reception position directly at the position corresponding to the reference number of the compressed block. This is preferable because the step of copying the storage contents of the storage area H can be omitted.

また、状態遷移履歴の途中は記憶せず、最初の状態と最後の状態だけを記憶することとしてもよい。そうすれば、記憶領域の節約になるので好ましい。 Moreover, it is good also as memorize | storing only the first state and the last state, without memorize | storing the middle of a state transition history. This is preferable because it saves the storage area.

また、状態遷移記憶部１１２は、圧縮ブロックの参照番号に対応する状態履歴として、最初の状態に対応して複数の状態履歴を記憶してもよい。そうすれば、状態履歴として記憶した最初の状態と現在の状態とが一致する可能性が高くなり、状態遷移を飛ばすことができるので好ましい。 Further, the state transition storage unit 112 may store a plurality of state histories corresponding to the initial state as the state history corresponding to the reference number of the compressed block. If it does so, possibility that the first state memorize | stored as a state history and the present state will become high and a state transition can be skipped, and it is preferable.

あるいは、圧縮ブロックの参照番号に対応する状態履歴として、１つの状態履歴しか記憶しないこととしてもよい。そうすれば、状態履歴を記憶する記憶領域の節約になるので好ましい。
その場合、既にその圧縮ブロックの参照番号に対応する状態履歴（最初の状態が異なる）を記憶している場合には、状態履歴を記憶しないこととしてもよい。あるいは、最初の状態が特定の状態（例えば、初期状態）のときだけ、状態履歴を上書きすることとしてもよい。そうすれば、出現頻度の高い状態のときに、最初の状態と現在の状態とが一致する可能性が高くなり、状態遷移を飛ばすことができるので好ましい。 Alternatively, only one state history may be stored as the state history corresponding to the reference number of the compressed block. This is preferable because it saves a storage area for storing the state history.
In that case, when the state history (the first state is different) corresponding to the reference number of the compressed block is already stored, the state history may not be stored. Alternatively, the state history may be overwritten only when the initial state is a specific state (for example, the initial state). This is preferable because it is highly possible that the first state and the current state coincide with each other when the appearance frequency is high, and the state transition can be skipped.

この手順にしたがって、圧縮テキスト検索装置１００が実際にどのように動作するか、図４〜図６に示した具体例を使って説明する。 How the compressed text search apparatus 100 actually operates according to this procedure will be described using the specific examples shown in FIGS.

初期化処理（Ｓ１０）により、各記憶部が記憶内容を初期化する。
状態遷移表記憶部１０６は、図６に示す状態遷移表２００を記憶する。状態遷移表２００は、検索条件入力部１０２が入力した検索条件（正規表現を含む）に基づいて、状態遷移表生成部１０５が生成したものである。状態遷移表２００に対応するＤＦＡにおいて、初期状態は状態番号１に対応する状態であり、受理状態は状態番号４に対応する状態である。これも、状態遷移表記憶部１０６が記憶している。なお、ＤＦＡの初期状態は必ず１つであるが、受理状態は複数あってもよい。
状態記憶部１１１は初期状態（状態番号＝１）を記憶する。
状態遷移記憶部１１２は、記憶する状態履歴を空にする。
圧縮辞書記憶部１１３は、圧縮辞書を記憶する。
現在位置カウンタ１１５は、現在位置として０を記憶する。 By the initialization process (S10), each storage unit initializes the stored contents.
The state transition table storage unit 106 stores the state transition table 200 shown in FIG. The state transition table 200 is generated by the state transition table generation unit 105 based on the search condition (including the regular expression) input by the search condition input unit 102. In the DFA corresponding to the state transition table 200, the initial state is a state corresponding to the state number 1, and the accepting state is a state corresponding to the state number 4. This is also stored in the state transition table storage unit 106. The initial state of DFA is always one, but there may be a plurality of acceptance states.
The state storage unit 111 stores an initial state (state number = 1).
The state transition storage unit 112 empties the stored state history.
The compression dictionary storage unit 113 stores a compression dictionary.
The current position counter 115 stores 0 as the current position.

圧縮テキスト記憶部１０３が記憶した圧縮ブロック列３００から、圧縮ブロック取得部１０８が圧縮ブロックを取得する（Ｓ１１）。最初の圧縮ブロック３０２は「１」である。 The compressed block acquisition unit 108 acquires a compressed block from the compressed block sequence 300 stored in the compressed text storage unit 103 (S11). The first compressed block 302 is “1”.

条件判断部１１４が、状態記憶部１１１が記憶した現在のＤＦＡの状態と、状態遷移記憶部１１２が記憶した状態履歴のうち、圧縮ブロックに対応する状態遷移履歴の最初の状態とを比較する（Ｓ１２）。
しかし、状態履歴は空なので、対応する最初の状態は記憶されていない。
そこで、条件判断部１１４は、一致しないと判断する（Ｓ１６へ）。 The condition determination unit 114 compares the current DFA state stored in the state storage unit 111 with the first state of the state transition history corresponding to the compressed block in the state history stored in the state transition storage unit 112 ( S12).
However, since the state history is empty, the corresponding first state is not stored.
Therefore, the condition determining unit 114 determines that they do not match (to S16).

状態遷移記憶部１１２は、記憶領域Ｈに、状態記憶部１１１が記憶したＤＦＡの現在の状態（＝１）を記憶する（Ｓ１６１）。 The state transition storage unit 112 stores the current state (= 1) of the DFA stored in the state storage unit 111 in the storage area H (S161).

文字取得部１０９は、圧縮辞書記憶部１１３が記憶した圧縮辞書の参照番号１を参照し、部分文字列「ａｂｃｄｅ」から、最初の文字「ａ」を取得し、ＤＦＡに入力する（Ｓ１６２）。 The character acquisition unit 109 refers to the reference number 1 of the compression dictionary stored in the compression dictionary storage unit 113, acquires the first character “a” from the partial character string “abcde”, and inputs it to the DFA (S162).

状態遷移機械１１０は、ＤＦＡの現在の状態が状態番号１、入力した文字が「ａ」なので、状態遷移表記憶部１０６が記憶した状態遷移表２００を参照し、状態番号２を遷移先状態として取得する。状態記憶部１１１は、新たなＤＦＡの状態として状態番号２を記憶する（Ｓ１６３）。 Since the current state of the DFA is state number 1 and the input character is “a”, the state transition machine 110 refers to the state transition table 200 stored in the state transition table storage unit 106 and sets the state number 2 as the transition destination state. get. The state storage unit 111 stores state number 2 as the new DFA state (S163).

状態遷移記憶部１１２は、状態番号２を記憶領域Ｈの最後に追加し（Ｓ１６４）、「１２」となる。 The state transition storage unit 112 adds the state number 2 to the end of the storage area H (S164), and becomes “12”.

検索成功判別部１１７は、状態記憶部１１１が記憶した状態（状態番号２）が受理状態（状態番号４）ではないと判別する（Ｓ１６５）。したがって、Ｓ１６６及びＳ１６７の処理は行わない。 The search success determining unit 117 determines that the state (state number 2) stored in the state storage unit 111 is not the accepting state (state number 4) (S165). Therefore, the processing of S166 and S167 is not performed.

まだ参照文字列を構成する文字が残っているので（Ｓ１６８）、文字取得部１０９が次の文字「ｂ」を取得し、ＤＦＡに入力する（Ｓ１６２）。
状態遷移機械１１０は、状態番号２、入力文字「ｂ」なので、状態遷移表２００より、遷移先状態の番号３を取得し、状態記憶部１１１が記憶する（Ｓ１６３）。
状態遷移記憶部１１２が状態番号３を記憶領域Ｈの最後に追加し（Ｓ１６４）、「１２３」となる。
検索成功判別部１１７は、現在の状態（状態番号３）が受理状態（状態番号４）でないと判別する（Ｓ１６５）。 Since the characters constituting the reference character string still remain (S168), the character acquisition unit 109 acquires the next character “b” and inputs it to the DFA (S162).
Since the state transition machine 110 has the state number 2 and the input character “b”, the state transition unit 200 acquires the transition destination state number 3 from the state transition table 200 and stores the state storage unit 111 (S163).
The state transition storage unit 112 adds the state number 3 to the end of the storage area H (S164), and becomes “123”.
The search success determining unit 117 determines that the current state (state number 3) is not the accepting state (state number 4) (S165).

まだ文字が尽きていないので（Ｓ１６８）、文字取得部１０９は次の文字「ｃ」を取得し、ＤＦＡに入力する（Ｓ１６２）。
状態番号３、入力文字「ｃ」なので、新たなＤＦＡの状態は状態番号４になり、記憶領域Ｈに追加する（Ｓ１６３、Ｓ１６４）。
検索成功判別部１１７は、現在の状態（４）が受理状態（４）であることを判別する（Ｓ１６５）。
状態遷移記憶部１１２は、記憶領域Ｈに、受理位置を記憶する（Ｓ１６６）。ここまでで文字取得部１０９がＤＦＡに入力した文字数は３なので、受理位置は３となる。
更に、検索成功判別部１１７は、現在位置カウンタ１１５が記憶した現在位置（＝０）に、受理位置（＝３）を加え、ヒット位置（＝３）を算出して出力する（Ｓ１６７）。 Since the characters are not yet exhausted (S168), the character acquisition unit 109 acquires the next character “c” and inputs it to the DFA (S162).
Since the state number is 3 and the input character is “c”, the state of the new DFA becomes the state number 4 and is added to the storage area H (S163, S164).
The search success determination unit 117 determines that the current state (4) is the acceptance state (4) (S165).
The state transition storage unit 112 stores the reception position in the storage area H (S166). Up to this point, since the number of characters input to the DFA by the character acquisition unit 109 is 3, the reception position is 3.
Further, the search success determining unit 117 adds the acceptance position (= 3) to the current position (= 0) stored by the current position counter 115, and calculates and outputs the hit position (= 3) (S167).

まだ文字が尽きていないので（Ｓ１６８），次の文字「ｄ」をＤＦＡに入力する（Ｓ１６２）。
状態番号４、入力文字「ｄ」なので、新たなＤＦＡの状態は５になり、記憶される（Ｓ１６３、Ｓ１６４）。
現在の状態（５）は受理状態（４）ではないので、次へ進む（Ｓ１６５）。 Since the characters are not yet exhausted (S168), the next character “d” is input to the DFA (S162).
Since the state number is 4 and the input character is “d”, the state of the new DFA becomes 5 and is stored (S163, S164).
Since the current state (5) is not the acceptance state (4), the process proceeds to the next (S165).

同様にして（Ｓ１６８）、次の文字「ｅ」を入力し（Ｓ１６２）、新たな状態６になる（Ｓ１６３、Ｓ１６４）。受理状態ではないので次へ進む（Ｓ１６５）。 In the same manner (S168), the next character “e” is input (S162), and a new state 6 is entered (S163, S164). Since it is not an acceptance state, it progresses to the next (S165).

参照文字列を構成する文字が尽きたので（Ｓ１６８）、状態遷移記憶部１１２は、記憶領域Ｈに記憶した状態遷移履歴及び受理位置を、状態履歴のうち、圧縮ブロックの参照番号１に対応する位置に複写する（Ｓ１６９）。
ここまでで、記憶領域Ｈには、状態遷移履歴として「１２３４５６」が、受理位置として「３」が記憶されていたので、これを参照番号１に対応する位置に記憶する。 Since the characters constituting the reference character string have been exhausted (S168), the state transition storage unit 112 corresponds to the state transition history and the reception position stored in the storage area H with reference number 1 of the compressed block in the state history. Copy to the position (S169).
Up to this point, since “123456” is stored as the state transition history and “3” is stored as the reception position in the storage area H, this is stored in the position corresponding to the reference number 1.

現在位置カウンタ１１５が記憶した現在位置（＝０）に、圧縮ブロックに対応する部分文字列の長さ（文字数、＝５）を加えて、現在位置を５に更新する（Ｓ１７）。 The length (number of characters, = 5) of the partial character string corresponding to the compressed block is added to the current position (= 0) stored by the current position counter 115, and the current position is updated to 5 (S17).

圧縮ブロックがまだ尽きていないので（Ｓ１８）、圧縮ブロック取得部１０８が次の圧縮ブロック「２」を取得する。 Since the compressed block is not yet exhausted (S18), the compressed block acquisition unit 108 acquires the next compressed block “2”.

状態遷移記憶部１１２は、参照番号２に対応する状態遷移履歴を記憶していないので、条件判断部１１４は、一致しないと判断する（Ｓ１２）。 Since the state transition storage unit 112 does not store the state transition history corresponding to the reference number 2, the condition determination unit 114 determines that they do not match (S12).

さきほどと同様にして状態遷移履歴を求める（Ｓ１６）。
圧縮辞書の参照番号２に対応する部分文字列は「ｃｂ」なので、ＤＦＡの状態は、最初の状態が状態番号６、文字「ｃ」を入力して状態番号３、文字「ｂ」を入力して状態番号１へと遷移する。
このなかに受理状態（４）はないので、検索成功判別部１１７は何も出力しない。
したがって、状態遷移記憶部１１２は、状態遷移履歴として「６３１」を、状態履歴の参照番号２に対応する位置に記憶する。受理位置はないので、記憶しない。 The state transition history is obtained in the same manner as before (S16).
Since the partial character string corresponding to the reference number 2 of the compression dictionary is “cb”, the state of the DFA is the first state by inputting the state number 6, the character “c”, the state number 3, and the character “b”. To state number 1.
Since there is no acceptance state (4), the search success judgment unit 117 outputs nothing.
Therefore, the state transition storage unit 112 stores “631” as the state transition history at a position corresponding to the reference number 2 of the state history. Since there is no receiving position, it is not memorized.

現在位置カウンタ１１５が記憶した現在位置（＝５）は、文字数２を加えて、７となる（Ｓ１７）。 The current position (= 5) stored by the current position counter 115 is 7 by adding 2 characters (S17).

圧縮ブロックがまだ尽きていないので（Ｓ１８）、圧縮ブロック取得部１０８が次の圧縮ブロック「１」を取得する。 Since the compressed block is not yet exhausted (S18), the compressed block acquisition unit 108 acquires the next compressed block “1”.

状態遷移記憶部１１２は、参照番号１に対応する状態遷移履歴を記憶している。状態記憶部１１１が記憶したＤＦＡの現在の状態は１である。状態遷移記憶部１１２が記憶した状態履歴のうち、参照番号１に対応する状態遷移履歴の最初の状態も１で一致する。したがって、条件判断部１１４は、一致すると判断する（Ｓ１２）。 The state transition storage unit 112 stores a state transition history corresponding to the reference number 1. The current state of the DFA stored in the state storage unit 111 is 1. Among the state histories stored by the state transition storage unit 112, the first state of the state transition history corresponding to the reference number 1 also matches with 1. Therefore, the condition determination unit 114 determines that they match (S12).

そこで、遷移先算出部１１６は、一致した状態遷移履歴の最後の状態から、遷移先状態の状態番号６を取得し、状態記憶部１１１に記憶させる（Ｓ１３）。 Therefore, the transition destination calculation unit 116 acquires the state number 6 of the transition destination state from the last state of the matched state transition history, and stores it in the state storage unit 111 (S13).

検索成功判別部１１７は、一致した状態遷移履歴に対応して、状態遷移記憶部１１２が受理位置を記憶しているか否かを判別する（Ｓ１４）。この場合、受理位置「３」を記憶しているので、現在位置（＝７）に受理位置（＝３）を加えて、ヒット位置（＝１０）を算出し、出力する（Ｓ１５）。 The search success determination unit 117 determines whether or not the state transition storage unit 112 stores the reception position corresponding to the matched state transition history (S14). In this case, since the acceptance position “3” is stored, the acceptance position (= 3) is added to the current position (= 7), and the hit position (= 10) is calculated and output (S15).

現在位置カウンタ１１５が記憶した現在位置（＝７）は、文字数５を加えて、１２となる（Ｓ１７）。 The current position (= 7) stored by the current position counter 115 is 12 after adding the number of characters 5 (S17).

以下、同様にして、圧縮ブロック取得部１０８が取得すべき圧縮ブロックがなくなるまで処理を続け（Ｓ１８）、圧縮ブロックの終端まで処理を終えた時点で検索処理を終了する。 Thereafter, similarly, the compressed block acquisition unit 108 continues the process until there is no compressed block to be acquired (S18), and the search process ends when the process is completed up to the end of the compressed block.

なお、この例においては、条件判断部１１４が、状態記憶部１１１が記憶したＤＦＡの現在の状態と、状態遷移記憶部１１２が記憶した状態遷移履歴のうち圧縮ブロックの参照番号に対応する状態遷移履歴の最初の状態とが一致するかを判定している。
しかし、ＤＦＡの状態を更新（遷移）するたびに、状態遷移記憶部１１２が記憶した状態遷移履歴において対応する状態と比較し、一致する場合には、残りの状態遷移を飛ばす構成としてもよい。 In this example, the condition determination unit 114 performs state transition corresponding to the reference number of the compressed block in the current state of the DFA stored in the state storage unit 111 and the state transition history stored in the state transition storage unit 112. It is determined whether the initial state of the history matches.
However, each time the state of the DFA is updated (transitioned), the state transition history stored in the state transition storage unit 112 is compared with the corresponding state, and if they match, the remaining state transition may be skipped.

また、一つ状態を更新するたびに比較するのではなく、あらかじめ定めた回数更新した後に比較することとしてもよい。
例えば、参照文字列のｉ番目（ｉは自然数）の文字について次の状態を取得したあと、状態遷移履歴のｉ番目の状態と比較し、一致した場合はＳ１６の処理を終了し、Ｓ１３に処理が移るように構成しても良い。このとき、一時的な記憶領域Ｈに記憶した状態遷移履歴の先頭からｉ番目までを、圧縮ブロックの状態遷移履歴に反映させる。 Further, instead of comparing each time one state is updated, comparison may be made after updating a predetermined number of times.
For example, after obtaining the next state for the i-th character (i is a natural number) in the reference character string, the next state is compared with the i-th state in the state transition history, and if they match, the process in S16 is terminated, and the process in S13 It may be configured to move. At this time, the first to i-th state transition history stored in the temporary storage area H is reflected in the state transition history of the compressed block.

なお、ステップＳ１６の処理で圧縮ブロックの状態遷移履歴を、常に更新するように構成してもよいし、常に更新しなくても良い。すなわち、最初に取得した状態遷移履歴から更新しないようにしても良いし、状態遷移履歴の先頭の状態がある特定の状態のときのみ、履歴を更新するようにしても良い。 Note that the state transition history of the compressed block may be constantly updated in the process of step S16, or may not always be updated. In other words, the state transition history may not be updated from the first acquired state transition history, or the history may be updated only when the state at the head of the state transition history is a specific state.

以上のように、この実施の形態によれば、状態記憶部１１１が記憶したＤＦＡの現在の状態と、取得した圧縮ブロックの状態遷移履歴の先頭の状態が一致しなかった場合には、状態遷移を処理するのに、圧縮ブロックの参照文字列の長さ分のステップ数を要する。一方で、現在の状態と状態遷移履歴の先頭の状態が一致した場合には、状態遷移を１ステップで処理することができる。現在の状態と文字によって状態遷移先が一意に決定される状態遷移機械では、現在の状態が初期状態である場合が多い。そのため、圧縮率が高いほど、すなわち長さが長い文字列が繰り返し出現しているようなテキストほど、状態遷移に要するステップ数を削減することができる。
正規表現に適合する文字列がテキスト中に存在するかの照合自体は、従来から利用されている、正規表現を受理する状態遷移が一意に決定される状態遷移機械を使用している。
このように、この実施の形態の圧縮テキスト検索装置では、正規表現を含んだ検索条件によって、圧縮テキストを高速に検索することができる。 As described above, according to this embodiment, when the current state of the DFA stored in the state storage unit 111 does not match the first state of the acquired state transition history of the compressed block, the state transition Is processed by the number of steps corresponding to the length of the reference character string of the compressed block. On the other hand, when the current state matches the first state of the state transition history, the state transition can be processed in one step. In a state transition machine in which the state transition destination is uniquely determined by the current state and characters, the current state is often the initial state. For this reason, the higher the compression ratio, that is, the text in which a long character string repeatedly appears, the number of steps required for state transition can be reduced.
The collation of whether a character string that matches a regular expression exists in the text itself uses a state transition machine that has been conventionally used to uniquely determine a state transition that accepts a regular expression.
As described above, the compressed text search apparatus according to this embodiment can search the compressed text at high speed according to the search condition including the regular expression.

ここで説明した検索装置は、以下の特徴を有する。
辞書式圧縮方式によって圧縮されたテキストを、伸張することなく、正規表現によって検索する検索装置である。
検索には、状態遷移が一意に決定できる状態遷移機械を使用する。
状態遷移機械、状態遷移表生成部、圧縮ブロック取得部、文字取得部、状態記憶部、状態遷移記憶部、圧縮辞書記憶部から構成される。
検索時には、圧縮ブロックが参照する辞書中の文字列毎に、状態遷移機械の状態遷移の履歴を状態遷移記憶部に記憶しておき、現在の状態が圧縮ブロックが参照する状態遷移の履歴の先頭の状態と一致した場合に、履歴の末尾の状態まで、１回の状態遷移で遷移させる。 The search device described here has the following features.
This is a search device that searches a text compressed by a lexicographic compression method using a regular expression without decompressing the text.
The search uses a state transition machine that can uniquely determine the state transition.
It consists of a state transition machine, a state transition table generation unit, a compressed block acquisition unit, a character acquisition unit, a state storage unit, a state transition storage unit, and a compression dictionary storage unit.
When searching, for each character string in the dictionary referenced by the compressed block, the state transition history of the state transition machine is stored in the state transition storage unit, and the current state is the head of the state transition history referenced by the compressed block. Transition to a state at the end of the history in a single state transition.

このように、オートマトンの状態遷移を利用して検索を行う文字列検索装置において、圧縮テキストを伸長して元の文字列を復元することなく、圧縮テキストから直接、圧縮ブロックを取得して検索を行うことにより、圧縮テキストを伸長するのにかかる時間を削減し、検索が高速になるという効果を奏する。 In this way, in a character string search device that performs a search using state transition of an automaton, a search is performed by directly acquiring a compressed block from the compressed text without decompressing the compressed text and restoring the original character string. By doing so, the time required for decompressing the compressed text can be reduced, and the search can be speeded up.

取得した圧縮ブロックについて、過去に検索したことがなければ、これを元の部分文字列に復元して、オートマトンの状態遷移を行い、検索する。しかし、過去に検索したことがあり、そのときのオートマトンの状態が同じであれば、同じ状態遷移をもう一度行う必要はない。したがって、過去に検索したときの状態遷移を記憶しておき、オートマトンの状態遷移を１回で済ませることにより、検索が高速になるという効果を奏する。 If the acquired compressed block has not been searched in the past, it is restored to the original partial character string, the state transition of the automaton is performed, and the search is performed. However, if there is a search in the past and the state of the automaton at that time is the same, there is no need to perform the same state transition again. Therefore, by storing the state transitions when the search is performed in the past and completing the state transition of the automaton only once, there is an effect that the search becomes faster.

しかも、オートマトンの状態遷移を１回で済ませるためにわざわざ遷移後の状態を計算するのではなく、実際に検索を行ったときの状態遷移を記憶するのであるから、無駄な計算をすることがない。これにより、検索が高速になるという効果を奏する。 In addition, the state after the transition is not calculated in order to complete the state transition of the automaton in one time, but the state transition when the search is actually performed is stored, so there is no unnecessary calculation. . As a result, there is an effect that the search becomes faster.

検索に利用するオートマトンとして、決定性有限オートマトンを用いることにより、遷移先状態を一意に算出できるので、バックトラックをする必要がなく、検索が高速になるという効果を奏する。 By using a deterministic finite automaton as an automaton used for the search, the transition destination state can be uniquely calculated, so that there is no need for backtracking and the speed of the search is increased.

検索文字列の指定に、正規表現により表現された検索パターンを用いることができるので、検索の自由度が増し、効率的な検索ができるという効果を奏する。 Since a search pattern expressed by a regular expression can be used for specifying a search character string, the degree of freedom of search is increased, and an efficient search can be achieved.

また、一般的に、正規表現を検索できるよう構成されたＤＦＡは、固定文字列を検索できるよう構成されたＤＦＡよりも、複雑で状態の数も多い。このようなＤＦＡに対して、圧縮ブロックの状態遷移を１回で済ませるよう、遷移後の状態を前もって計算しておくことは無駄が多い。しかし、この実施の形態のように、実際に検索したときの状態遷移を履歴として記憶する方式であれば、無駄な計算をすることはないので、検索が高速になるという効果を奏する。 In general, a DFA configured to be able to search for a regular expression is more complicated and has more states than a DFA configured to be able to search for a fixed character string. For such a DFA, it is wasteful to calculate the state after the transition in advance so that the state transition of the compressed block is completed only once. However, as in this embodiment, if a system that stores state transitions when actually searched is stored as a history, there is no wasteful calculation, so that the search is fast.

実施の形態２．
実施の形態２を図３、図６、図９〜図１０、図４６を用いて説明する。
この実施の形態における圧縮テキスト検索装置１００（文字列検索装置の一例）の外観、ハードウェア構成、ブロック構成は、実施の形態１で説明したものと同一なので、ここでは説明を省略する。 Embodiment 2. FIG.
The second embodiment will be described with reference to FIGS. 3, 6, 9 to 10, and 46.
Since the appearance, hardware configuration, and block configuration of the compressed text search apparatus 100 (an example of a character string search apparatus) in this embodiment are the same as those described in the first embodiment, description thereof is omitted here.

この実施の形態では、ＬＺ７７方式で圧縮されたテキストを検索する場合について説明する。 In this embodiment, a case where a text compressed by the LZ77 method is searched will be described.

図９は、この実施の形態において、圧縮テキスト記憶部１０３及び状態遷移記憶部１１２（履歴記憶部の一例）が記憶する記憶内容の一例を示す図である。 FIG. 9 is a diagram illustrating an example of storage contents stored in the compressed text storage unit 103 and the state transition storage unit 112 (an example of a history storage unit) in this embodiment.

圧縮テキスト記憶部１０３は、圧縮ブロック列３００（符号列の一例）を記憶している。図９に示す圧縮ブロック列３００は、元の文字列５００「ａｂｃｄｅｃｂａｂｃｄｅｂｅｃｄａｂｃｄｅｄ」に、図４６の規則を適用して置換したものである。 The compressed text storage unit 103 stores a compressed block sequence 300 (an example of a code sequence). The compressed block string 300 shown in FIG. 9 is obtained by replacing the original character string 500 “abcdecbabcdebecdabcded” by applying the rule of FIG. 46.

状態遷移記憶部１１２（履歴記憶部の一例）は、ＤＦＡ（オートマトンの一例）の状態とＤＦＡに入力した文字とを状態履歴として記憶する。
状態遷移記憶部１１２は状態履歴として、文字と状態を１対１に対応づけて記憶する。
例えば、文字４１１と状態４６１とを対応づけて記憶している。以下、文字４１２と状態４６２、文字４１３と状態４６３も同様に対応づけて記憶している。以下の説明では、状態履歴を、状態と文字を「（）」で括って表現するものとする。例えば、図９に示す状態履歴は「（１、ａ）（２、ｂ）（３、ｃ）（４、ｄ）…」と表現する。 The state transition storage unit 112 (an example of a history storage unit) stores a state of a DFA (an example of an automaton) and characters input to the DFA as a state history.
The state transition storage unit 112 stores characters and states in a one-to-one correspondence as state history.
For example, the character 411 and the state 461 are stored in association with each other. Hereinafter, the character 412 and the state 462, and the character 413 and the state 463 are also stored in association with each other. In the following description, the state history is expressed by enclosing the state and characters with “()”. For example, the state history shown in FIG. 9 is expressed as “(1, a) (2, b) (3, c) (4, d).

状態遷移記憶部１１２が記憶する状態履歴は、最初は空である。検索が進むにつれて少ずつ増えていく。
例えば、文字４１１は、元の文字列５００の最初の文字「ａ」に対応し、この文字をＤＦＡに入力したときに記憶したものである。状態４６１は、文字４１１をＤＦＡに入力する前において、状態記憶部１１１が記憶していたＤＦＡの状態である。 The state history stored in the state transition storage unit 112 is initially empty. It increases little by little as search progresses.
For example, the character 411 corresponds to the first character “a” of the original character string 500 and is stored when this character is input to the DFA. The state 461 is the state of the DFA stored in the state storage unit 111 before the character 411 is input to the DFA.

図１０は、この実施の形態における圧縮テキスト検索装置１００の検索処理の制御の流れの一例を示すフローチャート図である。 FIG. 10 is a flowchart showing an example of a control flow of search processing of the compressed text search apparatus 100 in this embodiment.

初期化処理（Ｓ１０）において、検索条件入力部１０２が入力した検索条件（検索パターンの一例）に基づいて、状態遷移表生成部１０５が検索条件を受理する状態遷移機械（ＤＦＡ）に対応する状態遷移表を生成し、状態遷移表記憶部１０６が記憶する。
状態記憶部１１１は初期状態（状態番号＝１）を記憶する。
状態遷移記憶部１１２は、記憶する状態履歴を空にする。
現在位置カウンタ１１５は、現在位置として０を記憶する。 In the initialization process (S10), based on the search condition (an example of a search pattern) input by the search condition input unit 102, the state corresponding to the state transition machine (DFA) in which the state transition table generation unit 105 accepts the search condition A transition table is generated and stored in the state transition table storage unit 106.
The state storage unit 111 stores an initial state (state number = 1).
The state transition storage unit 112 empties the stored state history.
The current position counter 115 stores 0 as the current position.

圧縮ブロック取得部１０８（符号取得部の一例）が、圧縮ブロック列（符号列の一例）の先頭から順に圧縮ブロック（符号の一例）を１個取得する（Ｓ２１）。 The compressed block acquisition unit 108 (an example of a code acquisition unit) acquires one compressed block (an example of a code) in order from the beginning of the compressed block sequence (an example of a code sequence) (S21).

圧縮ブロック列３００は、図４６の規則を適用して置換したものなので、奇数番目の圧縮ブロックは符号９８５あるいは符号９８６であり、偶数番目の圧縮ブロックは符号９８２である。Ｓ２１では、奇数番目の圧縮ブロックを取得するので、それがダミーポインタ（符号９８６）でなければ、他の文字列へのポインタの情報を含んでいる。 Since the compressed block sequence 300 is replaced by applying the rule of FIG. 46, the odd-numbered compressed block is denoted by reference numeral 985 or 986, and the even-numbered compressed block is denoted by reference numeral 982. In S21, since an odd-numbered compressed block is acquired, if it is not a dummy pointer (reference numeral 986), information on pointers to other character strings is included.

条件判断部１１４は、圧縮ブロック取得部１０８が取得した圧縮ブロックが０かどうかを見て、０でない場合には、他の文字列へのポインタの情報を含んでいると判断する（Ｓ２２）。 The condition determining unit 114 checks whether the compressed block acquired by the compressed block acquiring unit 108 is 0. If the compressed block is not 0, the condition determining unit 114 determines that the pointer information to another character string is included (S22).

条件判断部１１４は、含んでいると判断した場合、圧縮ブロックを解読して、他の部分文字列の（最初の文字の）出現位置９８３（現在位置からの距離）及び部分文字列の長さ９８４を取得する。更に、状態遷移記憶部１１２が記憶した状態履歴から、出現位置に対応する（出現位置の文字数分だけ前に記憶した状態履歴の）状態及び文字を取得し（Ｓ２３）、取得した状態と状態記憶部１１１が記憶しているＤＦＡの現在の状態とを比較する（Ｓ２４）。 If the condition determination unit 114 determines that it includes the compressed block, it decodes the compressed block, and the appearance position 983 (distance from the current position) of the other partial character string (the distance from the current position) and the length of the partial character string 984 is obtained. Further, from the state history stored in the state transition storage unit 112, the state and characters corresponding to the appearance position (of the state history stored previously by the number of characters of the appearance position) are acquired (S23), and the acquired state and state storage are acquired. The current state of the DFA stored in the unit 111 is compared (S24).

Ｓ２４において、条件判断部１１４が一致する（第一の条件を満たす）と判断した場合、状態遷移記憶部１１２は、条件判断部１１４が取得した状態及び文字を状態履歴の最後に追加する（Ｓ２５１）。
遷移先算出部１１６は、状態履歴から、次の状態及び文字を取得する（Ｓ２５２）。
検索成功判別部１１７は、遷移先算出部１１６が取得した状態が受理状態であるか否かを判別し、受理状態である場合には、照合結果出力部１０４が、ヒット位置として現在位置カウンタ１１５が記憶する現在位置を出力する（Ｓ２５３）。
現在位置カウンタ１１５が記憶する現在位置を１つ増やし、部分文字列の残りの長さを１つ減らす（Ｓ２５４）。
部分文字列の残りがまだあれば、Ｓ２５１から繰り返す（Ｓ２５５）。
Ｓ２５１〜Ｓ２５５の処理が終わったら、遷移先算出部１１６は、Ｓ２５２で取得した状態に、状態記憶部１１１が記憶したＤＦＡの状態を更新する（Ｓ２５６）。 In S24, when the condition determination unit 114 determines that they match (the first condition is satisfied), the state transition storage unit 112 adds the state and characters acquired by the condition determination unit 114 to the end of the state history (S251). ).
The transition destination calculation unit 116 acquires the next state and character from the state history (S252).
The search success determination unit 117 determines whether or not the state acquired by the transition destination calculation unit 116 is an acceptance state. If the state is an acceptance state, the collation result output unit 104 sets the current position counter 115 as a hit position. Is stored (S253).
The current position stored in the current position counter 115 is incremented by 1, and the remaining length of the partial character string is decremented by 1 (S254).
If there is any remaining partial character string, the process is repeated from S251 (S255).
When the processing of S251 to S255 is completed, the transition destination calculation unit 116 updates the state of the DFA stored in the state storage unit 111 to the state acquired in S252 (S256).

Ｓ２４において、条件判断部１１４が一致しない（第二の条件を満たす）と判断した場合、状態遷移記憶部１１２は、状態記憶部１１１が記憶した状態及び条件判断部１１４が取得した文字を、状態履歴の最後に追加する（Ｓ２６１）。
文字取得部１０９（文字列復元部の一例）は、条件判断部１１４が取得した文字を取得し、ＤＦＡに入力する。状態遷移機械１１０は、文字取得部１０９が入力した文字と、状態記憶部１１１が記憶したＤＦＡの現在の状態とに基づいて、状態遷移表記憶部１０６が記憶した状態遷移表を参照し、遷移先状態を取得する。状態遷移機械１１０は、取得した遷移先状態を、状態記憶部１１１に記憶させ、現在の状態を更新する（Ｓ２６２）。
検索成功判別部１１７は、更新されたＤＦＡの現在の状態が受理状態であるか否かを判別し、受理状態である場合には、照合結果出力部１０４が、ヒット位置として現在位置カウンタ１１５が記憶する現在位置を出力する（Ｓ２６３）。
現在位置カウンタ１１５が記憶する現在位置を１つ増やし、部分文字列の残りの長さを１つ減らす（Ｓ２６４）。
部分文字列の残りがまだあれば、Ｓ２３から繰り返す（Ｓ２６５）。 In S24, when the condition determination unit 114 determines that they do not match (the second condition is satisfied), the state transition storage unit 112 displays the state stored in the state storage unit 111 and the character acquired by the condition determination unit 114 in the state. Add to the end of the history (S261).
The character acquisition unit 109 (an example of a character string restoration unit) acquires the character acquired by the condition determination unit 114 and inputs it to the DFA. The state transition machine 110 refers to the state transition table stored in the state transition table storage unit 106 based on the characters input by the character acquisition unit 109 and the current state of the DFA stored in the state storage unit 111, and makes transitions Get the destination state. The state transition machine 110 stores the acquired transition destination state in the state storage unit 111 and updates the current state (S262).
The search success determination unit 117 determines whether or not the current state of the updated DFA is an acceptance state, and if it is an acceptance state, the collation result output unit 104 sets the current position counter 115 as a hit position. The current position to be stored is output (S263).
The current position stored in the current position counter 115 is incremented by 1, and the remaining length of the partial character string is decremented by 1 (S264).
If there is any remaining partial character string, the process is repeated from S23 (S265).

繰り返し処理において、条件判断部１１４は、再び、状態が一致するかを判定している（Ｓ２４）。部分文字列を入力する前の状態が、状態履歴と異なっていても、何文字かＤＦＡに入力した後で、一致する場合があるからである。 In the iterative process, the condition determination unit 114 determines again whether the states match (S24). This is because even if the state before the partial character string is input is different from the state history, some characters may be matched after being input to the DFA.

これにより、最初の状態が一致しなかった場合でも、途中で状態が一致した場合には、それ以降の状態遷移を１回で済ませることができ、検索が高速に行えるという効果を奏する。 As a result, even if the initial states do not match, if the states match in the middle, the subsequent state transition can be completed once, and the search can be performed at high speed.

しかし、条件判断が増えることによる処理速度の低下を防ぐため、状態の比較は最初だけ行うように構成してもよい。
あるいは、毎回比較するのではなく、何回かに一回比較する構成としてもよい。 However, in order to prevent a decrease in processing speed due to an increase in condition judgment, the state comparison may be performed only at the beginning.
Or it is good also as a structure which does not compare every time but compares once in several times.

Ｓ２２で、条件判断部１１４が他の文字列へのポインタの情報を含んでいないと判断した場合、あるいは、Ｓ２３以下の処理が終わった場合、圧縮ブロック取得部１０８は、圧縮ブロック列３００から次の圧縮ブロックを取得する（Ｓ２７）。
これは、偶数番目の圧縮ブロックなので、文字を表現するビット列の符号９８２である。
したがって、条件判断部１１４は、無条件にこれが他の文字列へのポインタの情報を含むものではないと判断する。 In S22, when the condition determination unit 114 determines that the pointer information to another character string is not included, or when the processing from S23 onward is completed, the compressed block acquisition unit 108 starts from the compressed block sequence 300. The compressed block is acquired (S27).
Since this is an even-numbered compressed block, it is a code 982 of a bit string representing a character.
Therefore, the condition determination unit 114 unconditionally determines that this does not include information on pointers to other character strings.

文字取得部１０９は、圧縮ブロック取得部１０８が取得した圧縮ブロックに対応する文字を取得し、状態遷移記憶部１１２は、状態記憶部１１１が記憶したＤＦＡの現在の状態及び文字取得部１０９が取得した文字を、状態履歴の最後に追加する（Ｓ２８１）。 The character acquisition unit 109 acquires a character corresponding to the compressed block acquired by the compressed block acquisition unit 108, and the state transition storage unit 112 acquires the current state of the DFA stored in the state storage unit 111 and the character acquisition unit 109. The added character is added to the end of the state history (S281).

文字取得部１０９が、取得した文字をＤＦＡに入力し、状態遷移機械１１０が遷移先状態を算出して、状態記憶部１１１が記憶したＤＦＡの現在の状態を更新する（Ｓ２８２）。 The character acquisition unit 109 inputs the acquired character to the DFA, the state transition machine 110 calculates the transition destination state, and updates the current state of the DFA stored in the state storage unit 111 (S282).

検索成功判別部１１７は、更新されたＤＦＡの現在の状態が受理状態であるか否かを判別し、受理状態である場合には、照合結果出力部１０４が、ヒット位置として現在位置カウンタ１１５が記憶する現在位置を出力する（Ｓ２８３）。 The search success determination unit 117 determines whether or not the current state of the updated DFA is an acceptance state, and if it is an acceptance state, the collation result output unit 104 sets the current position counter 115 as a hit position. The current position to be stored is output (S283).

現在位置カウンタ１１５が記憶する現在位置を１つ増やす（Ｓ２８４）。 The current position stored in the current position counter 115 is incremented by one (S284).

まだ圧縮ブロック列３００に圧縮ブロックが残っていれば、Ｓ２１から処理を繰り返す（Ｓ２９）。 If the compressed block still remains in the compressed block string 300, the process is repeated from S21 (S29).

以上説明した動作を、図６及び図９に示した具体例を使って、詳しく説明する。 The operation described above will be described in detail using the specific examples shown in FIGS.

初期化処理（Ｓ１０）により、各記憶部が記憶内容を初期化する。
状態遷移表記憶部１０６は、図６に示す状態遷移表２００を記憶する。状態遷移表２００は、検索条件入力部１０２が入力した検索条件（正規表現を含む）に基づいて、状態遷移表生成部１０５が生成したものである。状態遷移表２００に対応するＤＦＡにおいて、初期状態は状態番号１に対応する状態であり、受理状態は状態番号４に対応する状態である。これも、状態遷移表記憶部１０６が記憶している。なお、ＤＦＡの初期状態は必ず１つであるが、受理状態は複数あってもよい。
状態記憶部１１１は初期状態（状態番号＝１）を記憶する。
状態遷移記憶部１１２は、記憶する状態履歴を空にする。
現在位置カウンタ１１５は、現在位置として「１」を記憶する。 By the initialization process (S10), each storage unit initializes the stored contents.
The state transition table storage unit 106 stores the state transition table 200 shown in FIG. The state transition table 200 is generated by the state transition table generation unit 105 based on the search condition (including the regular expression) input by the search condition input unit 102. In the DFA corresponding to the state transition table 200, the initial state is a state corresponding to the state number 1, and the accepting state is a state corresponding to the state number 4. This is also stored in the state transition table storage unit 106. The initial state of DFA is always one, but there may be a plurality of acceptance states.
The state storage unit 111 stores an initial state (state number = 1).
The state transition storage unit 112 empties the stored state history.
The current position counter 115 stores “1” as the current position.

圧縮テキスト記憶部１０３が記憶した圧縮ブロック列３００から、圧縮ブロック取得部１０８が圧縮ブロックを取得する（Ｓ２１）。最初の圧縮ブロック３１１は「０、０」である。
条件判断部１１４が、取得した圧縮ブロックに他の部分文字列へのポインタの情報が含まれていないと判断し（Ｓ２１）、Ｓ２７へ進む。 The compressed block acquisition unit 108 acquires a compressed block from the compressed block sequence 300 stored in the compressed text storage unit 103 (S21). The first compressed block 311 is “0, 0”.
The condition determining unit 114 determines that the acquired compressed block does not include pointer information to other partial character strings (S21), and proceeds to S27.

圧縮ブロック取得部１０８が次の圧縮ブロック３１２「ａ」を取得する（Ｓ２７）。
状態遷移記憶部１１２は、状態記憶部１１１が記憶したＤＦＡの現在の状態「１」及び文字「ａ」を、状態履歴の最後に追加する（状態４６１及び文字４１１）（Ｓ２７）。
文字取得部１０９がＤＦＡに文字「ａ」を入力し、ＤＦＡの状態は「２」になる（Ｓ２８２）。受理状態ではないので、出力はせず（Ｓ２８３）、現在位置が「２」になる（Ｓ２８４）。次の圧縮ブロックへ進む（Ｓ２９）。 The compressed block acquisition unit 108 acquires the next compressed block 312 “a” (S27).
The state transition storage unit 112 adds the current state “1” and the character “a” of the DFA stored in the state storage unit 111 to the end of the state history (state 461 and character 411) (S27).
The character acquisition unit 109 inputs the character “a” to the DFA, and the state of the DFA becomes “2” (S282). Since it is not in the accepting state, no output is made (S283), and the current position becomes “2” (S284). The process proceeds to the next compressed block (S29).

次に取得した圧縮ブロック（Ｓ２１）は「０、０」なので（Ｓ２２）、Ｓ２７へ進む。
次に取得した圧縮ブロック（Ｓ２７）は「ｂ」なので、状態「２」と文字「ｂ」を状態履歴に追加（状態４６２及び文字４１２）し（Ｓ２８１）、ＤＦＡに文字「ｂ」を入力すると、ＤＦＡの状態は「３」になる（Ｓ２８２）。受理状態か判別し（Ｓ２８３）、現在位置が「３」になる（Ｓ２８４）。次へ進む（Ｓ２９）。 Since the acquired compressed block (S21) is “0, 0” (S22), the process proceeds to S27.
Since the acquired compressed block (S27) is “b”, the state “2” and the character “b” are added to the state history (state 462 and character 412) (S281), and the character “b” is input to the DFA. The DFA status becomes “3” (S282). It is determined whether it is in an accepting state (S283), and the current position becomes “3” (S284). Proceed to the next (S29).

次に取得した圧縮ブロック（Ｓ２１）は「０、０」なので（Ｓ２２）、Ｓ２７へ進む。
次に取得した圧縮ブロック（Ｓ２７）は「ｃ」なので、状態「３」と文字「ｃ」を状態履歴に追加（状態４６３及び文字４１３）し（Ｓ２８１）、ＤＦＡに文字「ｃ」を入力すると、ＤＦＡの状態は「４」になる（Ｓ２８２）。受理状態なので、ヒット位置（＝３）を出力し（Ｓ２８３）、現在位置が「４」になる（Ｓ２８４）。次へ進む（Ｓ２９）。 Since the acquired compressed block (S21) is “0, 0” (S22), the process proceeds to S27.
Since the acquired compressed block (S27) is “c”, the state “3” and the character “c” are added to the state history (state 463 and character 413) (S281), and the character “c” is input to the DFA. The DFA status is “4” (S282). Since it is in the accepting state, the hit position (= 3) is output (S283), and the current position becomes “4” (S284). Proceed to the next (S29).

次に取得した圧縮ブロック（Ｓ２１）は「０、０」なので（Ｓ２２）、Ｓ２７へ進む。
次に取得した圧縮ブロック（Ｓ２７）は「ｄ」なので、状態「４」と文字「ｄ」を状態履歴に追加し（Ｓ２８１）、ＤＦＡに文字「ｄ」を入力すると、ＤＦＡの状態が「５」になる（Ｓ２８２）。受理状態か判別し（Ｓ２８３）、現在位置が「５」になる（Ｓ２８４）。次へ進む（Ｓ２９）。 Since the acquired compressed block (S21) is “0, 0” (S22), the process proceeds to S27.
Since the acquired compressed block (S27) is “d”, the state “4” and the character “d” are added to the state history (S281), and when the character “d” is input to the DFA, the DFA state is “5”. (S282). It is determined whether it is in an accepting state (S283), and the current position becomes “5” (S284). Proceed to the next (S29).

次に取得した圧縮ブロック（Ｓ２１）は「０、０」なので（Ｓ２２）、Ｓ２７へ進む。
次に取得した圧縮ブロック（Ｓ２７）は「ｅ」なので、状態「５」と文字「ｅ」を状態履歴に追加し（Ｓ２８１）、ＤＦＡに文字「ｅ」を入力すると、ＤＦＡの状態が「６」になる（Ｓ２８２）。受理状態か判別し（Ｓ２８３）、現在位置が「６」になる（Ｓ２８４）。次へ進む（Ｓ２９）。 Since the acquired compressed block (S21) is “0, 0” (S22), the process proceeds to S27.
Since the acquired compressed block (S27) is “e”, the state “5” and the character “e” are added to the state history (S281), and when the character “e” is input to the DFA, the DFA state is “6”. (S282). It is determined whether it is in the accepting state (S283), and the current position becomes “6” (S284). Proceed to the next (S29).

この時点で、状態履歴は「（１、ａ）（２、ｂ）（３、ｃ）（４、ｄ）（５、ｅ）」となる。 At this time, the state history becomes “(1, a) (2, b) (3, c) (4, d) (5, e)”.

次に取得した圧縮ブロック「３、１」（Ｓ２１）はポインタなので（Ｓ２２）、出現位置「３」、長さ「１」を取得し、３文字前（出現位置＝３）の状態４６３「３」及び文字４１３「ｃ」を取得する（Ｓ２３）。
取得した状態「３」と、状態記憶部１１１が記憶したＤＦＡの現在の状態（＝６）とを比較する（Ｓ２４）。
一致しないので、現在の状態「６」と取得した文字「ｃ」を状態履歴に追加する（Ｓ２６１）。ＤＦＡに文字「ｃ」を入力して、ＤＦＡの状態が「３」になる（Ｓ２６２）。受理状態か判別し（Ｓ２６３）、現在位置が「７」になる（Ｓ２６４）。長さ文字数分の入力が終わったので（Ｓ２６５）、Ｓ２７へ進む。 Since the acquired compressed block “3, 1” (S21) is a pointer (S22), the appearance position “3” and the length “1” are acquired, and the state 463 “3” three characters before (appearance position = 3). "And the character 413" c "are acquired (S23).
The acquired state “3” is compared with the current state (= 6) of the DFA stored in the state storage unit 111 (S24).
Since they do not match, the current state “6” and the acquired character “c” are added to the state history (S261). The character “c” is input to the DFA, and the DFA status becomes “3” (S262). It is determined whether it is in an accepting state (S263), and the current position becomes “7” (S264). Since the input for the length of characters has been completed (S265), the process proceeds to S27.

次に取得した圧縮ブロック（Ｓ２７）は「ｂ」なので、状態「３」と文字「ｂ」を状態履歴に追加し（Ｓ２８１）、ＤＦＡに文字「ｂ」を入力すると、ＤＦＡの状態が「１」になる（Ｓ２８２）。受理状態か判別し（Ｓ２８３）、現在位置が「８」になる（Ｓ２８４）。次へ進む（Ｓ２９）。 Since the next acquired compressed block (S27) is “b”, the state “3” and the character “b” are added to the state history (S281), and when the character “b” is input to the DFA, the DFA state is “1”. (S282). It is determined whether it is in an accepting state (S283), and the current position becomes “8” (S284). Proceed to the next (S29).

この時点で、状態履歴は「（１、ａ）（２、ｂ）（３、ｃ）（４、ｄ）（５、ｅ）（６、ｃ）（３、ｂ）」となる。 At this time, the state history becomes “(1, a) (2, b) (3, c) (4, d) (5, e) (6, c) (3, b)”.

次に取得した圧縮ブロック「７、５」（Ｓ２１）はポインタなので（Ｓ２２）、出現位置「７」、長さ「５」を取得し、７文字前（出現位置＝７）の状態４６１「１」及び文字４１１「ａ」を取得する（Ｓ２３）。
取得した状態「１」と、状態記憶部１１１が記憶したＤＦＡの現在の状態（＝１）とを比較する（Ｓ２４）。 Since the next acquired compressed block “7, 5” (S21) is a pointer (S22), the appearance position “7” and the length “5” are acquired, and the state 461 “1” 7 characters before (appearance position = 7). "And the character 411" a "are acquired (S23).
The acquired state “1” is compared with the current state (= 1) of the DFA stored in the state storage unit 111 (S24).

一致するので、取得した状態「１」と文字「ａ」を状態履歴に追加する（状態４６８及び文字４１８）（Ｓ２５１）。状態履歴から７文字前の状態４６２「２」及び文字４１２「ｂ」を取得し（Ｓ２５２）、取得した状態「２」が受理状態か判別する（Ｓ２５３）。現在位置は「９」になり（Ｓ２５４）、繰り返しは残り４文字となる（Ｓ２５５）。 Since they match, the acquired state “1” and character “a” are added to the state history (state 468 and character 418) (S251). The state 462 “2” and the character 412 “b” seven characters before are acquired from the state history (S252), and it is determined whether the acquired state “2” is an accepting state (S253). The current position is “9” (S254), and the repetition is the remaining four characters (S255).

取得した状態「２」と文字「ｂ」を状態履歴に追加する（状態４６９及び文字４１９）（Ｓ２５１）。状態履歴から７文字前の状態４６３「３」及び文字４１３「ｃ」を取得し（Ｓ２５２）、取得した状態「３」が受理状態か判別する（Ｓ２５３）。現在位置は「１０」になり（Ｓ２５４）、繰り返しは残り３文字となる（Ｓ２５５）。 The acquired state “2” and character “b” are added to the state history (state 469 and character 419) (S251). The state 463 “3” and the character 413 “c” seven characters before are acquired from the state history (S252), and it is determined whether the acquired state “3” is an accepting state (S253). The current position is “10” (S254), and the repetition is the remaining three characters (S255).

取得した状態「３」と文字「ｃ」を状態履歴に追加する（状態４７０及び文字４２０）（Ｓ２５１）。状態履歴から７文字前の状態「４」及び文字「ｄ」を取得し（Ｓ２５２）、取得した状態「４」が受理状態なので、ヒット位置「１０」を出力する（Ｓ２５３）。現在位置は「１１」になり（Ｓ２５４）、繰り返しは残り２文字となる（Ｓ２５５）。 The acquired state “3” and character “c” are added to the state history (state 470 and character 420) (S251). The state “4” and the character “d” seven characters before are acquired from the state history (S252), and since the acquired state “4” is the accepting state, the hit position “10” is output (S253). The current position is “11” (S254), and the repetition is the remaining two characters (S255).

取得した状態「４」と文字「ｄ」を状態履歴に追加する（Ｓ２５１）。状態履歴から７文字前の状態「５」及び文字「ｅ」を取得し（Ｓ２５２）、取得した状態「５」が受理状態か判別する（Ｓ２５３）。現在位置は「１２」になり（Ｓ２５４）、繰り返しは残り１文字となる（Ｓ２５５）。 The acquired state “4” and the character “d” are added to the state history (S251). The state “5” and the character “e” seven characters before are acquired from the state history (S252), and it is determined whether the acquired state “5” is an accepting state (S253). The current position is “12” (S254), and the repetition is the remaining one character (S255).

取得した状態「５」と文字「ｅ」を状態履歴に追加する（Ｓ２５１）。状態履歴から７文字前の状態「６」及び文字「ｃ」を取得し（Ｓ２５２）、取得した状態「６」が受理状態か判別する（Ｓ２５３）。現在位置は「１３」になり（Ｓ２５４）、繰り返しを終了する（Ｓ２５５）。ＤＦＡの現在の状態は「６」になる（Ｓ２５６）。 The acquired state “5” and character “e” are added to the state history (S251). The state “6” and the character “c” seven characters before are acquired from the state history (S252), and it is determined whether the acquired state “6” is an accepting state (S253). The current position becomes “13” (S254), and the repetition ends (S255). The current state of the DFA is “6” (S256).

次に取得した圧縮ブロック（Ｓ２７）は「ｂ」なので、状態「６」と文字「ｂ」を状態履歴に追加し（Ｓ２８１）、ＤＦＡに文字「ｂ」を入力すると、ＤＦＡの状態が「１」になる（Ｓ２８２）。受理状態か判別し（Ｓ２８３）、現在位置が「１４」になる（Ｓ２８４）。次へ進む（Ｓ２９）。 Since the acquired compressed block (S27) is “b”, the state “6” and the character “b” are added to the state history (S281), and when the character “b” is input to the DFA, the DFA state is “1”. (S282). It is determined whether it is in an accepting state (S283), and the current position becomes “14” (S284). Proceed to the next (S29).

次に取得した圧縮ブロック「８、２」（Ｓ２１）はポインタなので（Ｓ２２）、出現位置「８」、長さ「２」を取得し、８文字前（出現位置＝８）の状態「６」及び文字「ｃ」を取得する（Ｓ２３）。
取得した状態「３」と、状態記憶部１１１が記憶したＤＦＡの現在の状態（＝１）とを比較する（Ｓ２４）。
一致しないので、現在の状態「１」と取得した文字「ｃ」を状態履歴に追加する（Ｓ２６１）。ＤＦＡに文字「ｃ」を入力して、ＤＦＡの状態が「１」になる（Ｓ２６２）。受理状態か判別し（Ｓ２６３）、現在位置が「１５」になる（Ｓ２６４）。繰り返しは残り１文字となる（Ｓ２６５）。 Since the acquired compressed block “8, 2” (S21) is a pointer (S22), the appearance position “8” and the length “2” are acquired, and the state “6” eight characters before (appearance position = 8) is acquired. And the character “c” is acquired (S23).
The acquired state “3” is compared with the current state (= 1) of the DFA stored in the state storage unit 111 (S24).
Since they do not match, the current state “1” and the acquired character “c” are added to the state history (S261). The character “c” is input to the DFA, and the DFA status becomes “1” (S262). It is determined whether it is in the accepting state (S263), and the current position becomes “15” (S264). The repetition is the remaining one character (S265).

８文字前の状態「３」及び文字「ｂ」を取得する（Ｓ２３）。
取得した状態「３」と、状態記憶部１１１が記憶したＤＦＡの現在の状態（＝１）とを比較する（Ｓ２４）。
一致しないので、現在の状態「１」と取得した文字「ｂ」を状態履歴に追加する（Ｓ２６１）。ＤＦＡに文字「ｂ」を入力して、ＤＦＡの状態が「１」になる（Ｓ２６２）。受理状態か判別し（Ｓ２６３）、現在位置が「１６」になる（Ｓ２６４）。繰り返しを終了する（Ｓ２６５）。 The state “3” and the character “b” eight characters before are acquired (S23).
The acquired state “3” is compared with the current state (= 1) of the DFA stored in the state storage unit 111 (S24).
Since they do not match, the current state “1” and the acquired character “b” are added to the state history (S261). The character “b” is input to the DFA, and the DFA state becomes “1” (S262). It is determined whether it is in the accepting state (S263), and the current position becomes “16” (S264). The repetition ends (S265).

次に取得した圧縮ブロック（Ｓ２７）は「ｄ」なので、状態「１」と文字「ｄ」を状態履歴に追加し（Ｓ２８１）、ＤＦＡに文字「ｄ」を入力すると、ＤＦＡの状態が「５」になる（Ｓ２８２）。受理状態か判別し（Ｓ２８３）、現在位置が「１７」になる（Ｓ２８４）。次へ進む（Ｓ２９）。 Since the acquired compressed block (S27) is “d”, the state “1” and the character “d” are added to the state history (S281), and when the character “d” is input to the DFA, the DFA state is “5”. (S282). It is determined whether it is in the accepting state (S283), and the current position becomes “17” (S284). Proceed to the next (S29).

次に取得した圧縮ブロック「９、５」（Ｓ２１）はポインタなので（Ｓ２２）、出現位置「９」、長さ「５」を取得し、９文字前（出現位置＝９）の状態「１」及び文字「ａ」を取得する（Ｓ２３）。
取得した状態「１」と、状態記憶部１１１が記憶したＤＦＡの現在の状態（＝５）とを比較する（Ｓ２４）。
一致しないので、現在の状態「５」と取得した文字「ａ」を状態履歴に追加する（Ｓ２６１）。ＤＦＡに文字「ａ」を入力して、ＤＦＡの状態が「２」になる（Ｓ２６２）。受理状態か判別し（Ｓ２６３）、現在位置が「１５」になる（Ｓ２６４）。繰り返しは残り４文字となる（Ｓ２６５）。 Since the acquired compressed block “9, 5” (S21) is a pointer (S22), the appearance position “9” and the length “5” are acquired, and the state “1” 9 characters before (appearance position = 9) is acquired. And the character “a” is acquired (S23).
The acquired state “1” is compared with the current state (= 5) of the DFA stored in the state storage unit 111 (S24).
Since they do not match, the current state “5” and the acquired character “a” are added to the state history (S261). The character “a” is input to the DFA, and the DFA status becomes “2” (S262). It is determined whether it is in the accepting state (S263), and the current position becomes “15” (S264). The repetition is the remaining 4 characters (S265).

９文字前の状態「２」及び文字「ｂ」を取得する（Ｓ２３）。
取得した状態「２」と、状態記憶部１１１が記憶したＤＦＡの現在の状態（＝２）とを比較する（Ｓ２４）。
一致するので、取得した状態「２」と文字「ｂ」を状態履歴に追加する（Ｓ２５１）。状態履歴から９文字前の状態「３」及び文字「ｃ」を取得し（Ｓ２５２）、取得した状態「３」が受理状態か判別する（Ｓ２５３）。現在位置は「１６」になり（Ｓ２５４）、繰り返しは残り３文字となる（Ｓ２５５）。 The state “2” and the character “b” 9 characters before are acquired (S23).
The acquired state “2” is compared with the current state (= 2) of the DFA stored in the state storage unit 111 (S24).
Since they match, the acquired state “2” and character “b” are added to the state history (S251). The state “3” and the character “c” nine characters before are acquired from the state history (S252), and it is determined whether the acquired state “3” is an accepting state (S253). The current position is “16” (S254), and the repetition is the remaining three characters (S255).

以下、同様の処理の繰り返しなので、説明は省略する。 Hereinafter, since the same processing is repeated, description thereof is omitted.

なお、状態遷移記憶部１１２は状態履歴をすべて記憶しておく必要はない。例えば、出現位置９８３を表すビット列が８ビットのビット長である場合、最大２５５文字前までしか参照できない。したがって、２５６文字分以上前の状態履歴を記憶していても使われることはないので、古いものから順に消去してしまってよい。 Note that the state transition storage unit 112 need not store all state histories. For example, if the bit string representing the appearance position 983 has a bit length of 8 bits, only a maximum of 255 characters before can be referred to. Therefore, even if the state history of 256 characters or more is stored, it is not used, and may be deleted in order from the oldest one.

このように、ＬＺ７７方式で圧縮されたテキストを検索する場合において、一般にＬＺ７７方式の圧縮テキストを復元するのに使用されるスライド窓の機能を拡張し、そのときのＤＦＡの状態をともに記憶することにより、過去に検索したときの状態遷移を利用することができ、同じ状態遷移をもう一度繰り返す必要がなくなる。これにより、検索が高速に行えるという効果を奏する。 As described above, when searching for text compressed in the LZ77 format, the function of the sliding window generally used to restore the compressed text in the LZ77 format is expanded, and the state of the DFA at that time is stored together. Thus, it is possible to use the state transition when the search is performed in the past, and it is not necessary to repeat the same state transition again. Thereby, there is an effect that the search can be performed at high speed.

実施の形態３．
実施の形態３０を図３、図１１〜図１３を用いて説明する。
この実施の形態における圧縮テキスト検索装置１００（文字列検索装置の一例）の外観、ハードウェア構成、ブロック構成は、実施の形態１で説明したものと同一なので、ここでは説明を省略する。 Embodiment 3 FIG.
A thirtieth embodiment will be described with reference to FIGS. 3 and 11 to 13.
Since the appearance, hardware configuration, and block configuration of the compressed text search apparatus 100 (an example of a character string search apparatus) in this embodiment are the same as those described in the first embodiment, description thereof is omitted here.

この実施の形態では、ＬＺ７７方式で圧縮されたテキストを検索する別の場合について説明する。 In this embodiment, another case of searching for text compressed by the LZ77 method will be described.

図１１は、ＬＺ７７形式による圧縮テキストの構造を示す図である。ＬＺ７７形式では、図４のような固定の圧縮辞書を持つ代わりに、固定長のスライド窓と呼ばれる自らのテキストの一部を辞書として利用する。スライド窓の長さは実装方式に依存する。ＬＺ７７形式の圧縮テキストは圧縮ブロック列３００のみから構成される。各圧縮ブロック８０２は、スライド窓８０３中の一致する文字列の位置、一致文字列長、最初の不一致文字の情報を持っている。例えば、図１１の例では、圧縮ブロック８０４は、一致する文字列の位置＝１、一致文字列長＝５、最初の不一致文字＝「ｂ」である。これは、圧縮ブロック８０４がスライド窓８０３（この例では、スライド窓の長さを８としている）の１番目の文字から５文字目までの文字列に、文字「ｂ」を加えた文字列「ａｂｃｄｅｂ」と等しいことを意味する。ここで、スライド窓８０３の下に付加した数値は、スライド窓中の文字位置を明確にするために便宜上付加したものである。以降、スライド窓の一致する文字列の位置から始まり、一致文字列長の長さを持つ文字列を、圧縮ブロックの参照文字列と呼ぶこととする。また、圧縮ブロックの一致する文字列の位置を参照位置、一致文字列の長さを参照文字列長とも呼ぶこととする。 FIG. 11 is a diagram showing the structure of compressed text in the LZ77 format. In the LZ77 format, instead of having a fixed compression dictionary as shown in FIG. 4, a part of its own text called a fixed-length sliding window is used as a dictionary. The length of the sliding window depends on the mounting method. The compressed text in the LZ77 format is composed of only the compressed block sequence 300. Each compression block 802 has information on the position of the matched character string in the sliding window 803, the matched character string length, and the first mismatched character. For example, in the example of FIG. 11, the compressed block 804 has a matching character string position = 1, a matching character string length = 5, and the first mismatch character = “b”. This is because the compression block 804 adds a character string “b” to the character string from the first character to the fifth character of the sliding window 803 (the length of the sliding window is 8 in this example). abcdeb "means equal. Here, the numerical value added below the sliding window 803 is added for convenience in order to clarify the character position in the sliding window. Hereinafter, a character string starting from the position of the matching character string in the sliding window and having the length of the matching character string length is referred to as a reference character string of the compressed block. The position of the character string that matches the compressed block is also referred to as a reference position, and the length of the matching character string is also referred to as a reference character string length.

これまでに、ＬＺ７７形式のスライド窓の参照を高速化するために、様々な方法が提案されているが、この実施の形態では、上記のスライド窓の機能を備えていれば、その実現方式は問わない。 So far, various methods have been proposed for speeding up the reference of the slide window of the LZ77 format. In this embodiment, if the above-mentioned function of the slide window is provided, the implementation method is as follows. It doesn't matter.

図１２は、この実施の形態における、圧縮辞書記憶部１１３（辞書記憶部の一例）と状態遷移記憶部１１２（履歴記憶部の一例）の記憶する情報を図示したものである。圧縮辞書記憶部１１３は、スライド窓２１０を記憶している。状態遷移記憶部１１２は、スライド窓長＋１の状態遷移を記憶する状態遷移履歴２２０と、受理位置２３０を記憶している。受理位置２３０は、状態遷移履歴２２０の中で、受理状態の位置を記憶している。圧縮ブロックの参照位置がｎのとき、状態遷移履歴の先頭の状態からｎ番目の状態を、状態遷移履歴の参照位置と呼ぶ。 FIG. 12 illustrates information stored in the compression dictionary storage unit 113 (an example of a dictionary storage unit) and a state transition storage unit 112 (an example of a history storage unit) in this embodiment. The compression dictionary storage unit 113 stores the sliding window 210. The state transition storage unit 112 stores a state transition history 220 that stores a state transition of sliding window length + 1 and a reception position 230. The acceptance position 230 stores the position of the acceptance state in the state transition history 220. When the reference position of the compressed block is n, the nth state from the first state of the state transition history is referred to as a state transition history reference position.

図１３は、この実施の形態の圧縮テキスト検索装置１００における検索処理の流れ図である。初期状態として、検索条件入力部１０２が入力した検索条件から状態遷移表生成部１０５によって状態遷移表が生成され、状態遷移表記憶部１０６が記憶されているものとする。また、状態記憶部１１１には初期状態（＝１）がセットされているものとする。圧縮辞書記憶部１１３と、状態遷移記憶部１１２の状態遷移履歴と受理位置は空であるとする。また、元テキスト長をカウントするためのカウンタを０に初期化する。 FIG. 13 is a flowchart of search processing in the compressed text search apparatus 100 of this embodiment. As an initial state, it is assumed that the state transition table generation unit 105 generates a state transition table from the search conditions input by the search condition input unit 102 and stores the state transition table storage unit 106. It is assumed that the initial state (= 1) is set in the state storage unit 111. It is assumed that the state transition history and the reception position of the compression dictionary storage unit 113 and the state transition storage unit 112 are empty. Also, a counter for counting the original text length is initialized to zero.

最初に、ステップＳ７０１で、圧縮ブロック取得部１０８により、圧縮ブロック列の先頭から順に圧縮ブロックを１個ずつ取得する。ステップＳ７０２で、状態記憶部１１１が記憶したＤＦＡの現在の状態と、状態遷移履歴の参照位置の状態が一致するか判定する。状態が一致した場合は（ＹＥＳ）、ステップＳ７０３で状態遷移履歴の先頭から（参照位置＋参照文字列長）番目の状態と、最初の不一致文字から次の状態遷移を取得し、現在の状態にセットする。ステップＳ７０４で、状態遷移履歴の参照位置から（参照位置＋参照文字列長）番目の位置の間に受理状態があるか判定する。受理状態がある場合は（ＹＥＳ）、ステップＳ７０５でヒット位置を計算して出力する。ヒット位置は現在の元テキスト長＋受理位置となる。現在の状態が受理状態である場合もヒット位置として、元テキスト長＋参照文字列長を出力する。受理状態が無い場合は（ＮＯ）、何もせずにステップＳ７０６へ進む。ステップＳ７０６では、スライド窓と状態遷移履歴を更新する。 First, in step S701, the compressed block acquisition unit 108 acquires compressed blocks one by one in order from the beginning of the compressed block sequence. In step S702, it is determined whether the current state of the DFA stored in the state storage unit 111 matches the state of the reference position in the state transition history. If the states match (YES), the next state transition is acquired from the top of the state transition history (reference position + reference character string length) and the first mismatch character in step S703, and the current state is set. set. In step S704, it is determined whether there is an acceptance state between the reference position of the state transition history and the (reference position + reference character string length) -th position. If there is an acceptance state (YES), the hit position is calculated and output in step S705. The hit position is the current original text length + accepted position. Even when the current state is an acceptance state, the original text length + reference character string length is output as the hit position. If there is no acceptance state (NO), the process proceeds to step S706 without doing anything. In step S706, the sliding window and the state transition history are updated.

スライド窓の更新では、まず、スライド窓中の文字列を、（参照文字列長＋１）文字分前へシフトする。次にスライド窓の最後の文字の後ろに参照文字列と最初の不一致文字を追加する。同様に、状態遷移履歴も、（参照文字列長＋１）文字分前へシフトし、末尾に参照位置から参照文字列長分の状態遷移履歴を追加する。状態遷移履歴にはさらに現在の状態を追加する。 In updating the sliding window, first, the character string in the sliding window is shifted forward by (reference character string length + 1) characters. Next, the reference character string and the first mismatch character are added after the last character of the sliding window. Similarly, the state transition history is also shifted forward by (reference character string length + 1) characters, and the state transition history for the reference character string length from the reference position is added at the end. The current state is further added to the state transition history.

ステップＳ７０７では、圧縮ブロック列の終端に達しているか判定し、達していない場合は（ＮＯ）、ステップＳ７０１で次の圧縮ブロック列を取得する。達していた場合は（ＹＥＳ）、検索処理を終了する。このステップＳ７０７で、元テキスト長に参照文字列長＋１を加える。 In step S707, it is determined whether or not the end of the compressed block sequence has been reached. If not (NO), the next compressed block sequence is acquired in step S701. If it has reached (YES), the search process is terminated. In step S707, the reference character string length + 1 is added to the original text length.

ステップＳ７０２で、現在の状態と、状態遷移履歴の参照位置の状態が一致しない場合は（ＮＯ）、ステップＳ７０８で参照文字列に最初の不一致文字を加えた文字列に対して、状態遷移の履歴を求める。すなわち、図７の処理の流れと同様に、参照文字列に不一致文字を加えた文字列の先頭から順に１文字ずつ取得しながら、状態遷移機械によって次の状態を取得する。ステップＳ７０８の処理が終了したら、ステップＳ７０６でスライド窓と状態遷移履歴を更新する。すなわち、スライド窓中の文字列を、（参照文字列長＋１）文字分前へシフトし、スライド窓の最後の文字の後ろに参照文字列と最初の不一致文字を追加する。同様に、状態遷移履歴も、同様に（参照文字列長＋１）文字分前へシフトし、末尾にステップＳ７０８で取得した状態遷移の履歴をセットする。 If the current state does not match the state at the reference position of the state transition history in step S702 (NO), the state transition history for the character string obtained by adding the first unmatched character to the reference character string in step S708. Ask for. That is, as in the processing flow of FIG. 7, the next state is acquired by the state transition machine while acquiring one character at a time in order from the beginning of the character string obtained by adding the mismatch character to the reference character string. When the process of step S708 ends, the sliding window and the state transition history are updated in step S706. That is, the character string in the sliding window is shifted forward by (reference character string length + 1) characters, and the reference character string and the first mismatch character are added after the last character of the sliding window. Similarly, the state transition history is similarly shifted forward by (reference character string length + 1) characters, and the state transition history acquired in step S708 is set at the end.

以上のように、この実施の形態によれば、圧縮ブロック列を１個取得した後、現在の状態と、状態遷移履歴の参照位置の状態とを比較し、状態が一致した場合は、本来参照文字列長分の回数要する状態遷移の処理を、２回の状態遷移で処理することができ、処理ステップを削減することができる。
また、正規表現とテキストとの照合処理自体は、その正規表現を受理する、状態遷移機械を利用する。これにより、正規表現を含んだ検索条件によって、ＬＺ７７形式で圧縮された圧縮テキストを高速に検索することができる。 As described above, according to this embodiment, after acquiring one compressed block string, the current state is compared with the state at the reference position of the state transition history, and if the state matches, the original reference is made. State transition processing that requires the number of times corresponding to the character string length can be processed by two state transitions, and processing steps can be reduced.
In addition, the regular expression / text matching process itself uses a state transition machine that accepts the regular expression. Thereby, the compressed text compressed in the LZ77 format can be searched at high speed according to the search condition including the regular expression.

ここで説明した圧縮テキスト検索装置は、以下の特徴を持つ。
圧縮辞書記憶部にＬＺ７７形式のスライド窓を記憶する。
状態遷移記憶部に、スライド窓長＋１の長さの状態遷移履歴を記憶する。
ＬＺ７７形式の圧縮ブロックを読み込み、現在の状態が圧縮ブロックが参照する状態遷移の履歴の先頭の状態と一致した場合に、参照文字列の末尾の文字まで状態を１回の状態遷移で遷移させ、さらに不一致文字により状態を遷移させる。 The compressed text search apparatus described here has the following features.
The LZ77 format sliding window is stored in the compression dictionary storage unit.
In the state transition storage unit, a state transition history having a length of sliding window length + 1 is stored.
When a compressed block in the LZ77 format is read and the current state matches the first state of the history of state transitions referenced by the compressed block, the state is transitioned by one state transition up to the last character of the reference character string, Further, the state is changed by the mismatch character.

実施の形態４．
実施の形態４を図３、図１４、図１５、図４４を用いて説明する。
この実施の形態における圧縮テキスト検索装置１００（文字列検索装置の一例）の外観、ハードウェア構成、ブロック構成は、実施の形態１で説明したものと同一なので、ここでは説明を省略する。 Embodiment 4 FIG.
The fourth embodiment will be described with reference to FIGS. 3, 14, 15, and 44. FIG.
Since the appearance, hardware configuration, and block configuration of the compressed text search apparatus 100 (an example of a character string search apparatus) in this embodiment are the same as those described in the first embodiment, description thereof is omitted here.

この実施の形態では、ＬＺＳＳ方式で圧縮されたテキストを検索する場合について説明する。 In this embodiment, a case where a text compressed by the LZSS method is searched will be described.

図１４は、この実施の形態において、圧縮テキスト記憶部１０３及び状態遷移記憶部１１２の記憶内容の一例を示す図である。 FIG. 14 is a diagram illustrating an example of the stored contents of the compressed text storage unit 103 and the state transition storage unit 112 in this embodiment.

圧縮テキスト記憶部１０３は、圧縮ブロック列３００（符号列の一例）を記憶する。図１４に示す圧縮ブロック列３００は、元の文字列５００「ａｂｃｄｅｃｂａｂｃｄｅｂｅｃｄａｂｃｄｅｄ」に、図４４の規則を適用して置換したものである。 The compressed text storage unit 103 stores a compressed block sequence 300 (an example of a code sequence). A compressed block string 300 shown in FIG. 14 is obtained by replacing the original character string 500 “abcdecbabcdebecdabcded” by applying the rule of FIG.

状態遷移記憶部１１２（履歴記憶部の一例）は、ＤＦＡ（オートマトンの一例）の状態とＤＦＡに入力した文字とを状態履歴として記憶する。
この実施の形態では、実施の形態２と異なり、文字に対応する状態を複数記憶できる。
例えば、文字４３１は状態４８１と対応づけられている。また、文字４３８は状態４８８及び状態４４８と対応づけられている。具体的な実現方法としては、表形式、リスト形式、ポインタ形式等の構造が考えられるが、他の実現方法でもよい。
また、実施の形態２と同様に、文字と状態を１対１対応として記憶することとしてもよい。 The state transition storage unit 112 (an example of a history storage unit) stores a state of a DFA (an example of an automaton) and characters input to the DFA as a state history.
In this embodiment, unlike Embodiment 2, a plurality of states corresponding to characters can be stored.
For example, the character 431 is associated with the state 481. The character 438 is associated with the state 488 and the state 448. As specific implementation methods, structures such as a table format, a list format, and a pointer format are conceivable, but other implementation methods may be used.
Further, as in the second embodiment, characters and states may be stored in a one-to-one correspondence.

図１５は、この実施の形態における圧縮テキスト検索装置１００の検索処理の制御の流れの一例を示すフローチャート図である。
図１５は、実施の形態２で説明した図１０とほぼ同じなので、異なる部分のみ説明する。 FIG. 15 is a flowchart showing an example of the control flow of search processing of the compressed text search apparatus 100 in this embodiment.
FIG. 15 is almost the same as FIG. 10 described in the second embodiment, so only different parts will be described.

ＬＺＳＳ方式は、ＬＺ７７方式と符号化の方式が異なっている。したがって、制御の流れも、それに対応する部分が異なっている。 The LZSS method is different from the LZ77 method in encoding method. Therefore, the control flow also differs in the corresponding parts.

すなわち、圧縮ブロック取得部１０８は、まず圧縮ブロック列３００から１ビット（フラグ９８１）を取得し、圧縮ブロックの長さを判別する。フラグ９８１が「１」の場合、図４４の規則１であるから、続く８ビット（ビット列９８２）を取得する。フラグ９８１が「０」の場合、図４４の規則２であるから、続く１３ビット（出現位置９８３及び長さ９８４）を取得する（Ｓ２１）。
条件判断部１１４は、フラグ９８１が「１」の場合、他の部分文字列へのポインタの情報を含まないと判断し（Ｓ２２）、文字取得部１０９が圧縮ブロックから文字を取得し、状態遷移記憶部１１２が、現在の状態と取得した文字とを状態履歴の最後に追加する（Ｓ２８１）。
条件判断部１１４は、フラグ９８１が「０」の場合、他の部分文字列へのポインタの情報を含むと判断し（Ｓ２２）、圧縮ブロックから、他の部分文字列の（最初の文字の）出現位置９８３（現在位置からの距離）及び部分文字列の長さ９８４を取得する。更に、状態遷移記憶部１１２が記憶した状態履歴から、出現位置に対応する状態及び文字を取得し、状態記憶部１１１が記憶しているＤＦＡの現在の状態と比較する（Ｓ２４）。 That is, the compressed block acquisition unit 108 first acquires 1 bit (flag 981) from the compressed block sequence 300, and determines the length of the compressed block. When the flag 981 is “1”, it is rule 1 in FIG. 44, and therefore the subsequent 8 bits (bit string 982) are acquired. If the flag 981 is “0”, it is the rule 2 of FIG. 44, so the subsequent 13 bits (appearance position 983 and length 984) are acquired (S21).
When the flag 981 is “1”, the condition determination unit 114 determines that the pointer information to another partial character string is not included (S22), and the character acquisition unit 109 acquires the character from the compressed block, and the state transition The storage unit 112 adds the current state and the acquired character to the end of the state history (S281).
When the flag 981 is “0”, the condition determination unit 114 determines that the pointer information to the other partial character string is included (S22), and the other partial character string (of the first character) from the compressed block. The appearance position 983 (distance from the current position) and the length 984 of the partial character string are acquired. Further, the state and characters corresponding to the appearance position are acquired from the state history stored in the state transition storage unit 112, and compared with the current state of the DFA stored in the state storage unit 111 (S24).

この実施の形態では、もう一つ実施の形態２と異なる部分がある。この実施の形態では、文字に対応して複数の状態を記憶できるよう、状態遷移記憶部１１２を構成している点である。 This embodiment is different from the second embodiment. In this embodiment, the state transition storage unit 112 is configured so that a plurality of states can be stored corresponding to characters.

Ｓ２４において、条件判断部１１４は、取得した状態が複数ある場合に、そのなかにＤＦＡの現在の状態と一致するものがあるかどうかを判別する（Ｓ２４）。
一致するものがあった場合、状態遷移記憶部１１２は、一致しなかった状態も含めて、取得した状態すべてと文字とを、状態履歴の最後に追加する（Ｓ２５１）。
遷移先算出部１１６は、次の状態及び文字を取得する（Ｓ２５２）。
検索成功判別部１１７は、取得した状態のうち、一致した状態に対応する状態が受理状態かを判別する（Ｓ２５３）。 In S24, when there are a plurality of acquired states, the condition determining unit 114 determines whether there is any one that matches the current state of the DFA (S24).
If there is a match, the state transition storage unit 112 adds all the acquired states and characters including the unmatched state to the end of the state history (S251).
The transition destination calculation unit 116 acquires the next state and character (S252).
The search success determination unit 117 determines whether the state corresponding to the matched state among the acquired states is the accepting state (S253).

例えば、図１４において、状態遷移記憶部１１２は、文字４３８「ｂ」に対応して、状態４８８「４」及び状態４４８「２」の２つの状態を記憶している。これに対して、現在の状態が「２」だったとする。
Ｓ２３において、条件判断部１１４が状態「４」及び「２」、文字「ｂ」を取得する。
Ｓ２４において、条件判断部１１４は、取得した状態（４、２）に現在の状態（＝２）が含まれているので、一致すると判断する。
Ｓ２５１において、状態遷移記憶部１１２は、取得した状態「４」及び「２」すべてと、文字「ｂ」とを、状態履歴の最後に追加する。
Ｓ２５２において、遷移先算出部１１６は、次の状態４９５「１」及び状態４７５「３」と、文字４４５「ｃ」とを取得する。 For example, in FIG. 14, the state transition storage unit 112 stores two states of a state 488 “4” and a state 448 “2” corresponding to the character 438 “b”. In contrast, assume that the current state is “2”.
In S23, the condition determination unit 114 acquires the states “4” and “2” and the character “b”.
In S24, the condition determination unit 114 determines that the acquired states (4, 2) match because the current state (= 2) is included.
In S251, the state transition storage unit 112 adds all the acquired states “4” and “2” and the character “b” to the end of the state history.
In S252, the transition destination calculation unit 116 acquires the next state 495 “1”, state 475 “3”, and the character 445 “c”.

ここで、次の状態４８９「１」は、状態４８８「４」に対応しており、ＤＦＡの状態が状態４８８「４」だったときに文字「ｂ」を入力して遷移した状態を示している。また、次の状態４４９「３」は、状態４４８「２」に対応しており、ＤＦＡの状態が状態４４８「２」だったときに文字「ｂ」を入力して遷移した状態を示している。図１５では、この対応関係を矢印を使って表現しているが、状態遷移記憶部１１２はこの関係を、例えば、ポインタを用いて記憶する。 Here, the next state 489 “1” corresponds to the state 488 “4”, and indicates a state in which a transition is made by inputting the character “b” when the DFA state is the state 488 “4”. Yes. Further, the next state 449 “3” corresponds to the state 448 “2”, and indicates a state in which a transition is made by inputting the character “b” when the DFA state is the state 448 “2”. . In FIG. 15, this correspondence relationship is expressed using arrows, but the state transition storage unit 112 stores this relationship using, for example, a pointer.

Ｓ２５３において、検索成功判別部１１７は、取得した２つの状態のうち、一致した状態４４８に対応する状態４４９「３」が受理状態か否かを判別する。 In S253, the search success determining unit 117 determines whether or not the state 449 “3” corresponding to the matched state 448 is the accepting state among the two acquired states.

同様に、Ｓ２５６において、遷移先算出部１１６が算出する遷移先状態も、取得した状態が複数あるときは、一致した状態に対応する状態となり、状態記憶部１１１は、その状態を記憶する（Ｓ２５６）。 Similarly, in S256, when there are a plurality of acquired states, the transition destination state calculated by the transition destination calculating unit 116 also becomes a state corresponding to the matched state, and the state storage unit 111 stores the state (S256). ).

Ｓ２４において、取得した状態の中にＤＦＡの現在の状態と一致するものがなかった場合、条件判断部１１４は一致しないと判断する（Ｓ２４）。
状態遷移記憶部１１２は、現在の状態及び取得した状態と、取得した文字とを状態履歴の最後に追加する（Ｓ２６１）。 In S24, if none of the acquired states matches the current state of the DFA, the condition determination unit 114 determines that they do not match (S24).
The state transition storage unit 112 adds the current state, the acquired state, and the acquired character to the end of the state history (S261).

したがって、状態が一致しなかった場合、１つの文字に対応して記憶する状態が１つ増えることになる。 Therefore, if the states do not match, the number of states stored corresponding to one character is increased by one.

自己参照型の圧縮技術において、元の文字列に同じ部分文字列が何回も出てくる場合、現在位置から近い方の部分文字列を参照することがある。出現位置を符号化する際のビット数の制限により、遠い方を参照することができないからである。また、あまり遠い部分文字列を参照することとすると、圧縮の際に時間がかかり過ぎ、また復元の際に記憶領域を消費し過ぎて、実用に耐えない場合があるからである。 In the self-referencing compression technique, when the same partial character string appears many times in the original character string, the partial character string closer to the current position may be referred to. This is because it is not possible to refer to the far side due to the limitation on the number of bits when encoding the appearance position. In addition, when referring to a partial character string that is too far away, it takes too much time for compression and consumes a storage area for restoration, which may not be practical.

同じ部分文字列が何回も出現する場合において、状態を１つしか記憶しなければ、状態が一致する可能性は低い。しかし、状態を複数記憶できることとすれば、その分状態が一致する可能性が高くなる。これにより、検索が更に高速に行えるという効果を奏する。 In the case where the same partial character string appears many times, if only one state is stored, it is unlikely that the states match. However, if a plurality of states can be stored, there is a high possibility that the states will match. Thereby, there is an effect that the search can be performed at higher speed.

以上説明した動作を、図６及び図１４に示す具体例を使って、詳しく説明する。 The operation described above will be described in detail with reference to specific examples shown in FIGS.

圧縮ブロック３３４まで処理が終わり、状態遷移記憶部１１２は、状態履歴として状態４８１〜４８４及び文字４３１〜４３４を記憶している。
状態記憶部１１１は、ＤＦＡの現在の状態として「５」を記憶している。 The processing is completed up to the compression block 334, and the state transition storage unit 112 stores states 481 to 484 and characters 431 to 434 as state history.
The state storage unit 111 stores “5” as the current state of the DFA.

次の圧縮ブロック３３５を取得する（Ｓ２１）。フラグ９８１が「０」なので、ポインタである（Ｓ２２）。出現位置「４」、長さ「２」なので、４文字前の状態４８１「１」及び文字４３１「ａ」を取得し（Ｓ２３）、取得した状態「１」と現在の状態「５」とを比較する（Ｓ２４）。
一致しないので、現在の状態「５」及び取得した状態「１」と文字「ａ」とを状態履歴の最後に追加する（状態４８５、４４５、文字４３５）（Ｓ２６１）。
ＤＦＡに文字「ａ」を入力し、現在の状態が「２」に遷移する（Ｓ２６２）。
受理状態か判別し（Ｓ２６３）、現在位置を進める（Ｓ２６４）。繰り返しは残り１文字になる（Ｓ２６５）。 The next compressed block 335 is acquired (S21). Since the flag 981 is “0”, it is a pointer (S22). Since the appearance position is “4” and the length is “2”, the state 481 “1” and the character 431 “a” four characters before are acquired (S23), and the acquired state “1” and the current state “5” are obtained. Compare (S24).
Since they do not match, the current state “5”, the acquired state “1”, and the character “a” are added to the end of the state history (states 485 and 445, character 435) (S261).
The character “a” is input to the DFA, and the current state transitions to “2” (S262).
It is determined whether it is in an accepting state (S263), and the current position is advanced (S264). The repetition is one character remaining (S265).

４文字前の状態４８２「２」及び文字４３２「ｂ」を取得し、取得した状態「２」と現在の状態「２」とを比較する（Ｓ２４）。
一致するので、取得した状態「２」と文字「ａ」とを状態履歴の最後に追加する（状態４８６及び文字４３６）（Ｓ２５１）。
このとき、前の状態が「１」でも「５」でも同じ状態「２」になるので、状態４８６は、状態４８５と状態４４５の両方に対応づけられる（図１４の矢印）。 The state 482 “2” and the character 432 “b” four characters before are acquired, and the acquired state “2” is compared with the current state “2” (S24).
Since they match, the acquired state “2” and character “a” are added to the end of the state history (state 486 and character 436) (S251).
At this time, since the same state “2” is obtained regardless of whether the previous state is “1” or “5”, the state 486 is associated with both the state 485 and the state 445 (arrows in FIG. 14).

状態履歴から、４文字前の状態４８３「３」及び文字４３３「ｃ」を取得し（Ｓ２５２）、受理状態か判別し（Ｓ２５３）、現在位置を進める（Ｓ２５４）。繰り返しは終了する（Ｓ２５５）。 From the state history, the state 483 “3” and the character 433 “c” four characters before are acquired (S252), it is determined whether or not it is in an accepting state (S253), and the current position is advanced (S254). The repetition ends (S255).

ＤＦＡの状態を、取得した状態「３」に更新し（Ｓ２５６）、次の圧縮ブロックへ進む（Ｓ２９）。 The DFA state is updated to the acquired state “3” (S256), and the process proceeds to the next compressed block (S29).

次の圧縮ブロック３３６を取得する（Ｓ２１）。フラグ９８１が「１」なので、文字である（Ｓ２２）。文字「ｅ」を取得し、現在の状態「３」及び文字「ｅ」を状態履歴に追加する（状態４８７及び文字４３７）（Ｓ２８１）。文字「ｅ」をＤＦＡに入力すると、ＤＦＡの状態は「４」になる（Ｓ２８２）。受理状態か判別し（Ｓ２８３）、現在位置を進める（Ｓ２８４）。次の圧縮ブロックへ進む（Ｓ２９）。 The next compressed block 336 is acquired (S21). Since the flag 981 is “1”, it is a character (S22). The character “e” is acquired, and the current state “3” and the character “e” are added to the state history (state 487 and character 437) (S281). When the character “e” is input to the DFA, the state of the DFA becomes “4” (S282). It is determined whether it is in an accepting state (S283) and the current position is advanced (S284). The process proceeds to the next compressed block (S29).

次の圧縮ブロック３３７を取得する（Ｓ２１）。フラグ９８１が「０」なので、ポインタである（Ｓ２２）。出現位置「６」、長さ「７」なので、６文字前の状態４８２「２」及び文字４３２「ｂ」を取得し（Ｓ２３）、取得した状態「２」と現在の状態「４」とを比較する（Ｓ２４）。 The next compressed block 337 is acquired (S21). Since the flag 981 is “0”, it is a pointer (S22). Since the appearance position is “6” and the length is “7”, the state 482 “2” and the character 432 “b” six characters before are acquired (S23), and the acquired state “2” and the current state “4” are obtained. Compare (S24).

一致しないので、現在の状態「４」及び取得した状態「２」と文字「ｂ」とを状態履歴に追加する（状態４８８、４４８、文字４３８）（Ｓ２６１）。
ＤＦＡに文字「ｂ」を入力し、現在の状態が「１」に遷移する（Ｓ２６２）。
受理状態か判別し（Ｓ２６３）、現在位置を進める（Ｓ２６４）。繰り返しは残り６文字になる（Ｓ２６５）。 Since they do not match, the current state “4” and the acquired state “2” and the character “b” are added to the state history (state 488, 448, character 438) (S261).
The character “b” is input to the DFA, and the current state transitions to “1” (S262).
It is determined whether it is in an accepting state (S263), and the current position is advanced (S264). The repetition is the remaining 6 characters (S265).

６文字前の状態４８３「３」及び文字４３３「ｃ」を取得し（Ｓ２３）、取得した状態「３」と現在の状態「１」とを比較する（Ｓ２４）。
一致しないので、現在の状態「１」及び取得した状態「３」と文字「ｃ」とを状態履歴に追加する（状態４８９、４４９、文字４３９）（Ｓ２６１）。
このとき、状態４８９は状態４８８と対応づける。また、状態４４９は状態４４８と対応づける。
ＤＦＡに文字「ｃ」を入力し、現在の状態が「１」に遷移する（Ｓ２６２）。
受理状態か判別し（Ｓ２６３）、現在位置を進める（Ｓ２６４）。繰り返しは残り５文字になる（Ｓ２６５）。 The state 483 “3” and the character 433 “c” six characters before are acquired (S23), and the acquired state “3” is compared with the current state “1” (S24).
Since they do not match, the current state “1”, the acquired state “3”, and the character “c” are added to the state history (state 489, 449, character 439) (S261).
At this time, the state 489 is associated with the state 488. Further, the state 449 is associated with the state 448.
The character “c” is input to the DFA, and the current state transitions to “1” (S262).
It is determined whether it is in an accepting state (S263), and the current position is advanced (S264). The repetition is the remaining 5 characters (S265).

６文字前の状態４８４「４」及び文字４３４「ｄ」を取得し（Ｓ２３）、取得した状態「４」と現在の状態「１」とを比較する（Ｓ２４）。
一致しないので、現在の状態「１」及び取得した状態「４」と文字「ｄ」とを状態履歴に追加する（状態４９０、４５０、文字４４０）（Ｓ２６１）。
このとき、状態４９０は状態４８９と対応づける。また、状態４５０は状態４４９と対応づける。
ＤＦＡに文字「ｄ」を入力し、現在の状態が「５」に遷移する（Ｓ２６２）。
受理状態か判別し（Ｓ２６３）、現在位置を進める（Ｓ２６４）。繰り返しは残り４文字になる（Ｓ２６５）。 The state 484 “4” and the character 434 “d” six characters before are acquired (S23), and the acquired state “4” is compared with the current state “1” (S24).
Since they do not match, the current state “1”, the acquired state “4”, and the character “d” are added to the state history (states 490, 450, character 440) (S261).
At this time, the state 490 is associated with the state 489. The state 450 is associated with the state 449.
The character “d” is input to the DFA, and the current state transitions to “5” (S262).
It is determined whether it is in an accepted state (S263), and the current position is advanced (S264). The repetition is the remaining 4 characters (S265).

６文字前の状態４８５「５」、状態４４５「１」及び文字４３５「ａ」を取得し（Ｓ２３）、取得した状態と現在の状態「１」とを比較する（Ｓ２４）。
状態４４５と一致するので、取得した状態「１」及び「５」と文字「ａ」とを状態履歴に追加する（状態４９１、４５１、文字４４１）（Ｓ２５１）。
このとき、状態４９１は状態４９０及び状態４５０と対応づける。どちらの状態からも同じ状態「１」に遷移するからである。また、状態４５１は、前の状態と対応づけない。 The state 485 “5”, state 445 “1”, and character 435 “a” six characters before are acquired (S23), and the acquired state is compared with the current state “1” (S24).
Since it matches the state 445, the acquired states “1” and “5” and the character “a” are added to the state history (states 491 and 451, character 441) (S251).
At this time, the state 491 is associated with the state 490 and the state 450. This is because the transition to the same state “1” is made from either state. Also, the state 451 is not associated with the previous state.

６文字前の状態４８６「２」及び文字４３５「ｂ」を取得する（Ｓ２５２）。
取得した状態が受理状態か判別し（Ｓ２５３）、現在位置を進める（Ｓ２５４）。繰り返しは残り３文字になる（Ｓ２６５）。 The state 486 “2” and the character 435 “b” six characters before are acquired (S252).
It is determined whether the acquired state is an acceptance state (S253), and the current position is advanced (S254). The repetition is the remaining three characters (S265).

このように、ＬＺＳＳ方式で圧縮されたテキストを検索する場合において、一般にＬＺＳＳ方式の圧縮テキストを復元するのに使用されるスライド窓の機能を拡張し、そのときのＤＦＡの状態をともに記憶することにより、過去に検索したときの状態遷移を利用することができ、同じ状態遷移をもう一度繰り返す必要がなくなる。これにより、検索が高速に行えるという効果を奏する。 As described above, when searching for text compressed in the LZSS format, the function of the sliding window generally used to restore the compressed text in the LZSS format is expanded, and the DFA state at that time is stored together. Thus, it is possible to use the state transition when the search is performed in the past, and it is not necessary to repeat the same state transition again. Thereby, there is an effect that the search can be performed at high speed.

また、そのときのＤＦＡの状態だけでなく、それが参照している過去に検索したときの状態遷移を合わせて記憶することにより、直接参照していない場合でも、過去の状態遷移を利用することができ、同じ状態遷移をもう一度繰り返す必要がなくなる。これにより、検索が更に高速に行えるという効果を奏する。 Also, by storing not only the state of the DFA at that time but also the state transitions that were searched in the past that are referenced, the past state transitions can be used even when they are not directly referenced. This eliminates the need to repeat the same state transition again. Thereby, there is an effect that the search can be performed at higher speed.

なお、このように、１つの文字に対して複数の状態を記憶できるようにした構成は、ＬＺＳＳ方式の圧縮テキストに限るものではなく、ＬＺ７７方式の圧縮テキストを検索する場合にも用いることができる。 Note that the configuration in which a plurality of states can be stored for one character as described above is not limited to LZSS compressed text, and can also be used when searching for LZ77 compressed text. .

更に、自己参照型の圧縮方式には、この他にも様々なものがある。例えば、上述したＬＺ７７方式あるいはＬＺＳＳ方式の符号列を、さらにハフマン符号化（静的あるいは動的）によって置換し、更に全体のビット長を短くしたもの等がある。
ここで説明した実施の形態は、それらの圧縮方式によって圧縮されたテキストにも適用できる。 Furthermore, there are various other self-reference compression methods. For example, the above-described LZ77 or LZSS code string is further replaced by Huffman coding (static or dynamic), and the entire bit length is further shortened.
The embodiments described here can also be applied to text compressed by these compression methods.

実施の形態５．
実施の形態５を図３、図１３、図１６を用いて説明する。
この実施の形態における圧縮テキスト検索装置１００（文字列検索装置の一例）の外観、ハードウェア構成、ブロック構成は、実施の形態１で説明したものと同一なので、ここでは説明を省略する。 Embodiment 5 FIG.
Embodiment 5 will be described with reference to FIG. 3, FIG. 13, and FIG.
Since the appearance, hardware configuration, and block configuration of the compressed text search apparatus 100 (an example of a character string search apparatus) in this embodiment are the same as those described in the first embodiment, description thereof is omitted here.

この実施の形態では、ＬＺＳＳ方式で圧縮されたテキストを検索する別の場合について説明する。 In this embodiment, another case of searching for text compressed by the LZSS method will be described.

図１６は、ＬＺＳＳ形式による圧縮テキストの構造を示す図である。ＬＺＳＳ形式は、ＬＺ７７形式の圧縮ブロックの冗長なデータを削除することにより、より圧縮効率を高めることを目的とした圧縮形式である。ＬＺＳＳ形式では、ＬＺ７７形式と同様に、図４のような固定の圧縮辞書３０３を持つ代わりに、固定長のスライド窓と呼ばれる自らのテキストの一部を辞書として利用する。ＬＺＳＳ形式の圧縮テキストは圧縮ブロック列３００のみから構成される。ＬＺ７７形式では、スライド窓に参照文字列が無い場合でも、圧縮ブロックには参照位置＝０、参照文字列長＝０という冗長な情報が含まれている。ＬＺＳＳ形式では、圧縮ブロックの先頭に１ビットのフラグを設けることで冗長なデータを削除する。スライド窓に参照文字列が無い場合は、圧縮ブロック１００２のように先頭のビットを０とし、次に不一致文字がセットされる。参照文字列が存在する場合は、圧縮ブロック１００４のように先頭のビットを１とし、次に参照位置と参照文字列長がセットされる。 FIG. 16 is a diagram showing the structure of compressed text in the LZSS format. The LZSS format is a compression format for the purpose of further improving the compression efficiency by deleting redundant data in the compressed block of the LZ77 format. In the LZSS format, like the LZ77 format, instead of having a fixed compression dictionary 303 as shown in FIG. 4, a part of its own text called a fixed-length sliding window is used as a dictionary. The compressed text in the LZSS format is composed of only the compressed block sequence 300. In the LZ77 format, even when there is no reference character string in the sliding window, the compressed block includes redundant information such that reference position = 0 and reference character string length = 0. In the LZSS format, redundant data is deleted by providing a 1-bit flag at the head of the compressed block. If there is no reference character string in the sliding window, the leading bit is set to 0 as in the compressed block 1002, and then a non-matching character is set. When the reference character string exists, the leading bit is set to 1 as in the compression block 1004, and then the reference position and the reference character string length are set.

これまでに、ＬＺＳＳ形式のスライド窓の参照を高速化するために、様々な方法が提案されているが、この実施の形態の圧縮テキスト検索装置では、上記のスライド窓の機能を備えていれば、その実現方式は問わない。 So far, various methods have been proposed for speeding up the reference of the slide window in the LZSS format, but the compressed text search apparatus of this embodiment has the above-described function of the slide window. The implementation method does not matter.

この実施の形態の圧縮テキスト検索装置１００における検索処理の流れは、図１３に示したものと同様である。初期状態として、検索条件入力部１０２が入力した検索条件から状態遷移表生成部１０５が状態遷移表を生成し、状態遷移表記憶部１０６が記憶しているものとする。また、状態記憶部１１１には初期状態がセットされているものとする。圧縮辞書記憶部１１３と、状態遷移記憶部１１２の状態遷移履歴と受理位置は空であるとする。また、元テキスト長をカウントするためのカウンタを０に初期化する。 The flow of search processing in the compressed text search apparatus 100 of this embodiment is the same as that shown in FIG. As an initial state, it is assumed that the state transition table generation unit 105 generates a state transition table from the search conditions input by the search condition input unit 102 and the state transition table storage unit 106 stores the state transition table. In addition, it is assumed that the initial state is set in the state storage unit 111. It is assumed that the state transition history and the reception position of the compression dictionary storage unit 113 and the state transition storage unit 112 are empty. Also, a counter for counting the original text length is initialized to zero.

実施の形態３との主要な差異は、ステップＳ７０２とステップＳ７０６である。最初に、ステップＳ７０１で、圧縮ブロック列の先頭から順に圧縮ブロックを１個ずつ取得する。ステップＳ７０２で、圧縮ブロックの先頭のビットを判定する。さらに、圧縮ブロックの先頭のビットが１であった場合は、現在の状態と、状態遷移履歴の参照位置の状態が一致するか判定する。状態が一致した場合は（ＹＥＳ）、ステップＳ７０３で状態遷移履歴の先頭から（参照位置＋参照文字列長）番目の状態を現在の状態にセットする。ステップＳ７０４で、状態遷移履歴の参照位置から（参照位置＋参照文字列長）番目の状態が受理状態であるか判定する。受理状態である場合は（ＹＥＳ）、ステップＳ７０５でヒット位置を計算して出力し、ステップＳ７０６へ進む。受理状態が無い場合は（ＮＯ）、何もせずにステップＳ７０６へ進む。ステップＳ７０６では、スライド窓と状態遷移履歴を更新する。 The main difference from the third embodiment is step S702 and step S706. First, in step S701, compressed blocks are acquired one by one in order from the beginning of the compressed block sequence. In step S702, the first bit of the compressed block is determined. Further, when the first bit of the compressed block is 1, it is determined whether the current state matches the state of the reference position of the state transition history. If the states match (YES), in step S703, the (reference position + reference character string length) -th state from the top of the state transition history is set to the current state. In step S704, it is determined whether the (reference position + reference character string length) -th state from the reference position of the state transition history is the accepting state. If it is in the accepting state (YES), the hit position is calculated and output in step S705, and the process proceeds to step S706. If there is no acceptance state (NO), the process proceeds to step S706 without doing anything. In step S706, the sliding window and the state transition history are updated.

スライド窓の更新では、まず、スライド窓中の文字列を、参照文字列長分前へシフトする。次にスライド窓の最後の文字の後ろに参照文字列を追加する。同様に、状態遷移履歴も、参照文字列長分前へシフトし、末尾に参照位置から参照文字列長分の状態遷移履歴を追加する。 In updating the sliding window, first, the character string in the sliding window is shifted forward by the reference character string length. Next, a reference character string is added after the last character of the sliding window. Similarly, the state transition history is also shifted forward by the reference character string length, and the state transition history for the reference character string length from the reference position is added at the end.

ステップＳ７０７では、圧縮ブロック列の終端に達しているか判定し、達していない場合は（ＮＯ）、ステップＳ７０１で次の圧縮ブロック列を取得する。達していた場合は（ＹＥＳ）、検索処理を終了する。ステップＳ７０７で、元テキスト長に、圧縮ブロックの先頭のビットが０の場合は１を、先頭のビットが１の場合は参照文字列長を加える。 In step S707, it is determined whether or not the end of the compressed block sequence has been reached. If not (NO), the next compressed block sequence is acquired in step S701. If it has reached (YES), the search process is terminated. In step S707, 1 is added to the original text length when the first bit of the compressed block is 0, and the reference character string length is added when the first bit is 1.

ステップＳ７０２で、圧縮ブロックの先頭の１ビットが０であるか、現在の状態と状態遷移履歴の参照位置の状態が一致しない場合は（ＮＯ）、ステップＳ７０８で不一致文字または参照文字列に対して状態遷移の履歴を求める。すなわち、図５の処理の流れと同様に、文字列の先頭から順に１文字ずつ取得しながら、状態遷移機械によって次の状態を取得する。ステップＳ７０８の処理が終了したら、ステップ７０６でスライド窓と状態遷移履歴を更新する。すなわち、スライド窓中の文字列を、１文字または参照文字列長分前へシフトする。次にスライド窓の最後の文字の後ろに不一致文字または参照文字列を追加する。同様に、状態遷移履歴も、１文字または参照文字列長分前へシフトし、末尾にステップＳ７０８で取得した状態遷移の履歴を追加する。 If the first bit of the compressed block is 0 in step S702 or the current state does not match the state of the reference position in the state transition history (NO), in step S708, the mismatched character or reference character string is detected. Find the history of state transitions. That is, as in the processing flow of FIG. 5, the next state is acquired by the state transition machine while acquiring one character at a time from the beginning of the character string. When the process of step S708 ends, the sliding window and the state transition history are updated in step 706. That is, the character string in the sliding window is shifted forward by one character or the reference character string length. Next, an unmatched character or a reference character string is added after the last character of the sliding window. Similarly, the state transition history is also shifted forward by one character or the reference character string length, and the state transition history acquired in step S708 is added at the end.

以上のように、この実施の形態によれば、圧縮ブロック列を１個取得した後、現在の状態と、状態遷移履歴の参照位置の状態とを比較し、状態が一致した場合は、本来参照文字列長に比例したステップ数要する状態遷移の処理を、１回の状態遷移で処理することができ、処理ステップを削減することができる。
また、正規表現とテキストとの照合処理自体は、その正規表現を受理する、状態遷移機械を利用する。これにより、正規表現を含んだ検索条件によって、ＬＺＳＳ形式で圧縮された圧縮テキストを高速に検索することができる。 As described above, according to this embodiment, after acquiring one compressed block string, the current state is compared with the state of the reference position of the state transition history, and if the state matches, the original reference is made. State transition processing that requires the number of steps proportional to the character string length can be processed in one state transition, and processing steps can be reduced.
In addition, the regular expression / text matching process itself uses a state transition machine that accepts the regular expression. Thereby, the compressed text compressed in the LZSS format can be searched at high speed according to the search condition including the regular expression.

ここで示したＬＺＳＳ形式以外にも、ＬＺＢ（Ｌｅｍｐｅｌ−Ｚｉｖ−Ｂｅｌｌ）形式やＬＺＢＷ（Ｌｅｍｐｅｌ−Ｚｉｖ−Ｂｅｎｄｅｒ−Ｗｏｌｆ）形式など、ＬＺ７７形式から派生した圧縮形式で圧縮されたテキストを、この実施の形態の圧縮テキスト検索装置によって同様に検索することができる。 In addition to the LZSS format shown here, text compressed in a compression format derived from the LZ77 format, such as the LZB (Lempel-Ziv-Bell) format and the LZBW (Lempel-Ziv-Bender-Wolf) format, A similar search can be performed by the form of compressed text search device.

ここで説明した圧縮テキスト検索装置は、以下の特徴を持つ。
圧縮辞書記憶部にＬＺＳＳ形式のスライド窓を記憶する。
状態遷移記憶部に、スライド窓長＋１の長さの状態遷移履歴を記憶する。
ＬＺＳＳ形式の圧縮ブロックを読み込み、現在の状態が圧縮ブロックが参照する状態遷移の履歴の先頭の状態と一致した場合に、参照文字列の末尾の文字まで１回の状態遷移で状態を遷移させる。 The compressed text search apparatus described here has the following features.
The LZSS format sliding window is stored in the compression dictionary storage unit.
In the state transition storage unit, a state transition history having a length of sliding window length + 1 is stored.
When a compressed block in the LZSS format is read and the current state matches the first state of the history of state transitions referenced by the compressed block, the state is transitioned by one state transition to the last character of the reference character string.

実施の形態６．
実施の形態６を図３、図６、図１７、図１８、図４８を用いて説明する。
この実施の形態における圧縮テキスト検索装置１００（文字列検索装置の一例）の外観、ハードウェア構成、ブロック構成は、実施の形態１で説明したものと同一なので、ここでは説明を省略する。 Embodiment 6 FIG.
A sixth embodiment will be described with reference to FIGS. 3, 6, 17, 18, and 48. FIG.
Since the appearance, hardware configuration, and block configuration of the compressed text search apparatus 100 (an example of a character string search apparatus) in this embodiment are the same as those described in the first embodiment, description thereof is omitted here.

この実施の形態では、ＬＺ７８方式で圧縮されたテキストを検索する場合について説明する。 In this embodiment, a case where a text compressed by the LZ78 method is searched will be described.

図１７は、この実施の形態において、圧縮テキスト記憶部１０３及び圧縮辞書記憶部１１３（辞書記憶部の一例）及び状態遷移記憶部１１２（履歴記憶部の一例）が記憶する記憶内容の一例を示す図である。 FIG. 17 shows an example of storage contents stored in the compressed text storage unit 103, the compression dictionary storage unit 113 (an example of a dictionary storage unit), and a state transition storage unit 112 (an example of a history storage unit) in this embodiment. FIG.

圧縮テキスト記憶部１０３は、圧縮ブロック列３００を記憶している。圧縮ブロック列３００は、元の文字列５００「ａａｂａｂｃｄｅｃａｂｃｄｅｂａｂｃｄｅｃｃ」に、図４８の規則を適用して置換したものである。 The compressed text storage unit 103 stores a compressed block sequence 300. The compressed block string 300 is obtained by replacing the original character string 500 “aabbcdecabbcdebabcdec” by applying the rule of FIG.

状態遷移記憶部１１２は、辞書の「参照番号」を行とし、「最初の状態」を列とする表の形で状態履歴を記憶する。しかし、このような表の形ではなく、例えばリスト形式で記憶してもよい。
状態履歴としては「遷移先状態」及び「受理位置」を記憶する。状態履歴は最初は空であり、検索が進むにつれて、埋まっていく。 The state transition storage unit 112 stores the state history in the form of a table with the “reference number” of the dictionary as a row and the “first state” as a column. However, instead of such a table form, it may be stored in a list form, for example.
As the state history, “transition destination state” and “accepting position” are stored. The state history is initially empty and fills up as the search proceeds.

圧縮辞書記憶部１１３は、圧縮ブロック列３００から抽出した圧縮辞書（置換辞書の一例）を記憶する。圧縮辞書は、辞書の「参照番号」と、前方参照番号及び接尾文字（後方文字列の一例）とを対応づけている。ここで「参照番号」に対応する部分文字列は、前方参照番号に対応する部分文字列（前方文字列の一例）の後に接尾文字を付けたものである。また、前方参照番号が「０」の場合は、対応する部分文字列は、接尾文字１文字からなる文字列である。
例えば、参照番号１は、部分文字列「ａ」に対応する。参照番号２は、参照番号１に対応する部分文字列「ａ」に接尾文字「ｂ」を付けたもの（「ａｂ」）に対応する。参照番号３は、「ａｂ」＋「ｃ」で、「ａｂｃ」に対応する。
圧縮辞書は最初は空であり、検索が進むにつれて、エントリが増えていく。 The compression dictionary storage unit 113 stores a compression dictionary (an example of a replacement dictionary) extracted from the compressed block sequence 300. The compression dictionary associates a “reference number” of the dictionary with a forward reference number and a suffix (an example of a backward character string). Here, the partial character string corresponding to the “reference number” is obtained by adding a suffix after the partial character string corresponding to the forward reference number (an example of the forward character string). When the forward reference number is “0”, the corresponding partial character string is a character string composed of one suffix character.
For example, reference number 1 corresponds to the partial character string “a”. Reference number 2 corresponds to a partial character string “a” corresponding to reference number 1 with a suffix “b” (“ab”). Reference number 3 is “ab” + “c”, which corresponds to “abc”.
The compression dictionary is initially empty, and entries increase as the search proceeds.

図１８は、この実施の形態における圧縮テキスト検索装置１００の検索処理の制御の流れの一例を示すフローチャート図である。 FIG. 18 is a flowchart showing an example of the control flow of search processing of the compressed text search apparatus 100 in this embodiment.

初期化処理（Ｓ１０）において、検索条件入力部１０２が入力した検索条件（検索パターンの一例）に基づいて、状態遷移表生成部１０５が検索条件を受理する状態遷移機械（ＤＦＡ）に対応する状態遷移表を生成し、状態遷移表記憶部１０６が記憶する。
状態記憶部１１１は初期状態（状態番号＝１）を記憶する。
状態遷移記憶部１１２は、記憶する状態履歴を空にする。
圧縮辞書記憶部１１３は、記憶部する圧縮辞書を空にする。
現在位置カウンタ１１５は、現在位置として０を記憶する。 In the initialization process (S10), based on the search condition (an example of a search pattern) input by the search condition input unit 102, the state corresponding to the state transition machine (DFA) in which the state transition table generation unit 105 accepts the search condition A transition table is generated and stored in the state transition table storage unit 106.
The state storage unit 111 stores an initial state (state number = 1).
The state transition storage unit 112 empties the stored state history.
The compression dictionary storage unit 113 empties the compression dictionary stored in the storage unit.
The current position counter 115 stores 0 as the current position.

圧縮ブロック取得部１０８（符号取得部の一例）が、圧縮ブロック列（符号列の一例）の先頭から順に、圧縮ブロック（符号の一例）を１個取得する（Ｓ３１）。状態遷移記憶部１１２は、状態記憶部１１１が記憶したＤＦＡの現在の状態を、最初の状態として記憶しておく。 The compressed block acquisition unit 108 (an example of a code acquisition unit) acquires one compressed block (an example of a code) in order from the beginning of the compressed block sequence (an example of a code sequence) (S31). The state transition storage unit 112 stores the current state of the DFA stored by the state storage unit 111 as an initial state.

図４８の規則によれば、参照番号を示す符号９７１と文字を表すビット列の符号９７２とは交互に出現するので、奇数番目の圧縮ブロックは、参照番号を示す符号９７１である。 According to the rule of FIG. 48, since the code 971 indicating the reference number and the code 972 of the bit string representing the character appear alternately, the odd-numbered compressed block is the code 971 indicating the reference number.

圧縮ブロック取得部１０８は、参照番号を条件判断部１１４に渡して、展開ルーチンを実行する（Ｓ３２）。 The compressed block acquisition unit 108 passes the reference number to the condition determination unit 114 and executes a decompression routine (S32).

展開ルーチンでは、条件判断部１１４が、受け取った参照番号が０か否かを判断する（Ｓ３２１）。
参照番号が０の場合は、対応する部分文字列は空なので、展開ルーチンを終了する。
参照番号が０以外の場合は、対応する部分文字列が圧縮辞書に登録されている。そこで、条件判断部１１４は、状態遷移記憶部１１２が記憶している状態履歴を参照する。状態履歴の表において、「参照番号」行、「現在の状態」列を見て、過去にその参照番号の部分文字列を検索したときの状態履歴が記憶されているかを見る（Ｓ３２２）。 In the expansion routine, the condition determining unit 114 determines whether or not the received reference number is 0 (S321).
If the reference number is 0, the corresponding partial character string is empty and the expansion routine is terminated.
When the reference number is other than 0, the corresponding partial character string is registered in the compression dictionary. Therefore, the condition determination unit 114 refers to the state history stored in the state transition storage unit 112. In the state history table, the “reference number” row and the “current state” column are looked at to see whether or not the state history when the partial character string of the reference number is searched in the past is stored (S322).

記憶されている場合には、遷移先算出部１１６が、状態履歴から遷移先状態を取得し（Ｓ３２３）、状態記憶部１１１が記憶したＤＦＡの現在の状態を、遷移先状態に更新する（Ｓ３２４）。
次に、検索成功判別部１１７が、状態履歴から受理位置を取得し、受理位置がある場合には、ヒット位置を算出して出力する（Ｓ３２５）。 If it is stored, the transition destination calculation unit 116 acquires the transition destination state from the state history (S323), and updates the current state of the DFA stored in the state storage unit 111 to the transition destination state (S324). ).
Next, the search success determining unit 117 acquires the reception position from the state history, and if there is a reception position, calculates and outputs a hit position (S325).

条件判断部１１４が、状態履歴が記憶されていないと判断した場合（Ｓ３２２）、文字取得部１０９（文字列復元部の一例）が圧縮辞書から前方参照番号を取得する（Ｓ４１）。
圧縮ブロック取得部１０８は、前方参照番号を条件判断部１１４に渡して、展開ルーチンを再帰的に実行する（Ｓ４２）。 When the condition determination unit 114 determines that the state history is not stored (S322), the character acquisition unit 109 (an example of a character string restoration unit) acquires a forward reference number from the compression dictionary (S41).
The compressed block acquisition unit 108 passes the forward reference number to the condition determination unit 114 and recursively executes the expansion routine (S42).

次に、文字取得部１０９が圧縮辞書から接尾文字を取得する（Ｓ４３）。 Next, the character acquisition unit 109 acquires a suffix from the compression dictionary (S43).

文字取得部１０９は接尾文字をＤＦＡに入力し、状態遷移機械１１０が遷移先状態を算出して、状態記憶部１１１が記憶したＤＦＡの現在の状態を、遷移先状態に更新する（Ｓ４４）。 The character acquisition unit 109 inputs the suffix to the DFA, the state transition machine 110 calculates the transition destination state, and updates the current state of the DFA stored in the state storage unit 111 to the transition destination state (S44).

検索成功判別部１１７は、現在の状態が受理状態か判別し、受理状態ならヒット位置を算出して出力する（Ｓ４５）。 The search success determination unit 117 determines whether the current state is an acceptance state, and if it is an acceptance state, calculates and outputs a hit position (S45).

最後に、状態遷移記憶部１１２は、「参照番号」行、「最初の状態」列に、状態履歴として、現在の状態及び受理位置を記憶し（Ｓ４８）、展開ルーチンは終了する。 Finally, the state transition storage unit 112 stores the current state and the reception position as the state history in the “reference number” row and the “first state” column (S48), and the expansion routine ends.

展開ルーチンが終了したら、圧縮ブロック取得部１０８は、次の圧縮ブロックを圧縮テキスト記憶部１０３から取得する（Ｓ３３）。
これは偶数番目の圧縮ブロックなので、文字を表すビット列の符号９７２である。 When the expansion routine ends, the compressed block acquisition unit 108 acquires the next compressed block from the compressed text storage unit 103 (S33).
Since this is an even-numbered compressed block, it is a code 972 of a bit string representing a character.

文字取得部１０９は、この圧縮ブロックに対応する文字（接尾文字）を、ＤＦＡに入力し、状態遷移機械１１０が遷移先状態を算出して、状態記憶部１１１が記憶したＤＦＡの現在の状態を、遷移先状態に更新する（Ｓ３４）。 The character acquisition unit 109 inputs a character (suffix) corresponding to the compressed block to the DFA, the state transition machine 110 calculates the transition destination state, and the current state of the DFA stored in the state storage unit 111 is obtained. The state is updated to the transition destination state (S34).

検索成功判別部１１７は、現在の状態が受理状態か判別し、受理状態ならヒット位置を算出して出力する（Ｓ３５）。 The search success determination unit 117 determines whether the current state is an acceptance state, and if it is an acceptance state, calculates and outputs a hit position (S35).

現在位置カウンタ１１５が記憶した現在位置に、２つの圧縮ブロックに対応する部分文字列の長さを加えて、現在位置を更新する（Ｓ３６）。 The current position is updated by adding the lengths of the partial character strings corresponding to the two compressed blocks to the current position stored by the current position counter 115 (S36).

圧縮辞書記憶部１１３は、参照番号と接尾文字を圧縮辞書に登録する（Ｓ３７）。 The compression dictionary storage unit 113 registers the reference number and suffix in the compression dictionary (S37).

状態遷移記憶部１１２は、新たに圧縮辞書に登録した部分文字列に対応する参照番号を取得し、状態履歴の「参照番号」行、「最初の状態」列に、状態履歴として、現在の状態及び受理位置を記憶する（Ｓ３８）。 The state transition storage unit 112 acquires a reference number corresponding to the partial character string newly registered in the compression dictionary, and stores the current state as the state history in the “reference number” row and the “first state” column of the state history. And the receiving position is stored (S38).

以上の処理を圧縮ブロックがなくなるまで繰り返す（Ｓ１８）。 The above processing is repeated until there are no more compressed blocks (S18).

以上説明した動作を、図６及び図１７に示した具体例を使って説明する。 The operation described above will be described using the specific examples shown in FIGS.

初期化処理（Ｓ１０）により、各記憶部が記憶内容を初期化する。
状態遷移表記憶部１０６は、図６に示す状態遷移表２００を記憶する。状態遷移表２００は、検索条件入力部１０２が入力した検索条件（正規表現を含む）に基づいて、状態遷移表生成部１０５が生成したものである。状態遷移表２００に対応するＤＦＡにおいて、初期状態は状態番号１に対応する状態であり、受理状態は状態番号４に対応する状態である。これも、状態遷移表記憶部１０６が記憶している。なお、ＤＦＡの初期状態は必ず１つであるが、受理状態は複数あってもよい。
状態記憶部１１１は初期状態（状態番号＝１）を記憶する。
状態遷移記憶部１１２は、記憶する状態履歴を空にする。
圧縮辞書記憶部１１３は、記憶する圧縮辞書を空にする。
現在位置カウンタ１１５は、現在位置として０を記憶する。 By the initialization process (S10), each storage unit initializes the stored contents.
The state transition table storage unit 106 stores the state transition table 200 shown in FIG. The state transition table 200 is generated by the state transition table generation unit 105 based on the search condition (including the regular expression) input by the search condition input unit 102. In the DFA corresponding to the state transition table 200, the initial state is a state corresponding to the state number 1, and the accepting state is a state corresponding to the state number 4. This is also stored in the state transition table storage unit 106. The initial state of DFA is always one, but there may be a plurality of acceptance states.
The state storage unit 111 stores an initial state (state number = 1).
The state transition storage unit 112 empties the stored state history.
The compression dictionary storage unit 113 empties the stored compression dictionary.
The current position counter 115 stores 0 as the current position.

メインループにおいて、圧縮ブロック取得部１０８が、圧縮テキスト記憶部１０３が記憶した圧縮ブロック列３００から、最初の圧縮ブロックを取得する（Ｓ３１）。最初の圧縮ブロックは参照番号を意味し、参照番号は「０」である。
状態遷移記憶部は、状態記憶部１１１が記憶しているＤＦＡの現在の状態（＝１）を、最初の状態として記憶しておく。 In the main loop, the compressed block acquisition unit 108 acquires the first compressed block from the compressed block sequence 300 stored in the compressed text storage unit 103 (S31). The first compressed block means a reference number, and the reference number is “0”.
The state transition storage unit stores the current state (= 1) of the DFA stored in the state storage unit 111 as the first state.

次に、圧縮ブロック取得部１０８は、条件判断部１１４に参照番号「０」を渡し、展開ルーチンを呼び出す（Ｓ３２）。再帰的な呼出しと区別するため、メインループからの呼出しはネストレベル１と呼ぶことにする。 Next, the compressed block acquisition unit 108 passes the reference number “0” to the condition determination unit 114 and calls the expansion routine (S32). In order to distinguish from recursive calls, calls from the main loop will be called nesting level 1.

展開ルーチンにおいて、条件判断部１１４は参照番号が０か否かを判断する（Ｓ３２１）。この場合、参照番号は０なので、展開ルーチン（ネストレベル１）は終了する。 In the expansion routine, the condition determination unit 114 determines whether the reference number is 0 (S321). In this case, since the reference number is 0, the expansion routine (nesting level 1) ends.

メインループに戻り、圧縮ブロック取得部１０８は、次の圧縮ブロックを取得する（Ｓ３３）。偶数番目の圧縮ブロックなので、文字を意味する圧縮ブロックであり、その文字（接尾文字）は「ａ」である。 Returning to the main loop, the compressed block acquisition unit 108 acquires the next compressed block (S33). Since it is an even-numbered compressed block, it is a compressed block meaning a character, and its character (suffix) is “a”.

文字取得部１０９は、文字「ａ」をＤＦＡに入力し、状態遷移機械１１０は、状態遷移表記憶部１０６が記憶した状態遷移表２００を参照して遷移先状態（＝２）を算出し、状態記憶部１１１が記憶したＤＦＡの現在の状態を、「２」に更新する（Ｓ３４）。 The character acquisition unit 109 inputs the character “a” to the DFA, the state transition machine 110 refers to the state transition table 200 stored in the state transition table storage unit 106, calculates the transition destination state (= 2), The current state of the DFA stored in the state storage unit 111 is updated to “2” (S34).

検索成功判別部１１７は、現在の状態（＝２）が受理状態（＝４）でないと判断し、何もしない（Ｓ３５）。 The search success determining unit 117 determines that the current state (= 2) is not the accepting state (= 4), and does nothing (S35).

現在位置カウンタ１１５は、現在位置に１加え、現在位置は「１」となる。 The current position counter 115 adds 1 to the current position, and the current position becomes “1”.

圧縮辞書記憶部１１３は、参照番号「０」と接尾文字「ａ」を圧縮辞書に登録する。圧縮辞書記憶部１１３の圧縮辞書は空だったので、登録した部分文字列に対応する参照番号（登録番号）は「１」である。 The compression dictionary storage unit 113 registers the reference number “0” and the suffix “a” in the compression dictionary. Since the compression dictionary in the compression dictionary storage unit 113 is empty, the reference number (registration number) corresponding to the registered partial character string is “1”.

状態遷移記憶部１１２は、参照番号「１」と最初の状態「１」に対応する欄に、現在の状態「２」（遷移先状態）と受理位置（ないので「０」）を記憶する（Ｓ３８）。 The state transition storage unit 112 stores the current state “2” (transition destination state) and the acceptance position (“0” because there is no transition) in the column corresponding to the reference number “1” and the first state “1” ( S38).

次の圧縮ブロック（参照番号「１」）を取得する（Ｓ１８、Ｓ３１）。最初の状態は「２」である。
参照番号「１」を渡して、展開ルーチン（ネストレベル１）を呼び出す（Ｓ３２）。参照番号が０以外なので（Ｓ３２１）、状態履歴をチェックする（Ｓ３２２）。
状態履歴の、参照番号「１」最初の状態「２」の欄は空欄なので（Ｓ３２２）、圧縮辞書の参照番号「１」の欄を参照し、前方参照番号「０」を得る（Ｓ４１）。
前方参照番号「０」を渡して、展開ルーチンを再帰的に呼び出す（Ｓ４２）。ネストレベル１からの呼出しなので、ネストレベル２と呼ぶことにする。
ネストレベル２の展開ルーチンにおいて、参照番号が０なので（Ｓ３２１）、何もせずに帰ってくる。
ネストレベル１に戻り、圧縮辞書の参照番号「１」の欄を参照し、接尾文字「ａ」を得る（Ｓ４３）。
接尾文字「ａ」をＤＦＡに入力し、ＤＦＡの状態は「２」になる（Ｓ４４）。受理状態ではないので、出力はしない（Ｓ４５）。
状態履歴の参照番号「１」最初の状態「２」の欄に、遷移先状態「２」受理位置「０」を記憶して（Ｓ４８）、ネストレベル１の展開ルーチンは終了する。 The next compressed block (reference number “1”) is acquired (S18, S31). The initial state is “2”.
The reference number “1” is passed and the expansion routine (nesting level 1) is called (S32). Since the reference number is other than 0 (S321), the status history is checked (S322).
Since the column of the reference number “1” and the first state “2” in the state history is blank (S322), the reference number “1” column of the compression dictionary is referred to, and the forward reference number “0” is obtained (S41).
The expansion routine is recursively called by passing the forward reference number “0” (S42). Since it is a call from nesting level 1, it will be called nesting level 2.
In the nest level 2 expansion routine, since the reference number is 0 (S321), the process returns without doing anything.
Returning to nesting level 1, the column of reference number “1” in the compression dictionary is referred to, and the suffix “a” is obtained (S43).
The suffix “a” is input to the DFA, and the state of the DFA becomes “2” (S44). Since it is not in the accepting state, no output is made (S45).
The transition destination state “2” acceptance position “0” is stored in the column of the state history reference number “1” and the first state “2” (S48), and the nest level 1 expansion routine ends.

メインループに戻り、次の圧縮ブロック（接尾文字「ｂ」）を取得する（Ｓ３３）。
接尾文字「ｂ」をＤＦＡに入力し、ＤＦＡの状態は「３」になる（Ｓ３４）。受理状態ではないので、何も出力せず（Ｓ３５）、現在位置が「３」になる（Ｓ３６）。
圧縮辞書には、参照番号「１」と接尾文字「ｂ」が登録される。登録番号は「２」である。
状態履歴には、参照番号「２」最初の状態「２」の欄に、遷移先状態「３」受理位置「０」を記憶する。 Returning to the main loop, the next compressed block (suffix “b”) is acquired (S33).
The suffix “b” is input to the DFA, and the state of the DFA becomes “3” (S34). Since it is not in the accepting state, nothing is output (S35), and the current position becomes “3” (S36).
A reference number “1” and a suffix “b” are registered in the compression dictionary. The registration number is “2”.
In the state history, the transition destination state “3” reception position “0” is stored in the field of reference number “2”, first state “2”.

次の圧縮ブロック（参照番号「２」）を取得する（Ｓ１８、Ｓ３１）。最初の状態は「３」である。
参照番号「２」を渡して、展開ルーチン（ネストレベル１）を呼び出す（Ｓ３２）。参照番号が０以外なので（Ｓ３２１）、状態履歴をチェックする（Ｓ３２２）。
状態履歴の、参照番号「２」最初の状態「３」の欄は空欄なので（Ｓ３２２）、圧縮辞書の参照番号「２」の欄を参照し、前方参照番号「１」を得る（Ｓ４１）。
前方参照番号「１」を渡して、展開ルーチン（ネストレベル２）を再帰的に呼び出す（Ｓ４２）。 The next compressed block (reference number “2”) is acquired (S18, S31). The initial state is “3”.
The reference number “2” is passed and the expansion routine (nesting level 1) is called (S32). Since the reference number is other than 0 (S321), the status history is checked (S322).
Since the column of the reference number “2” and the first state “3” of the state history is blank (S322), the forward reference number “1” is obtained by referring to the column of the reference number “2” of the compression dictionary (S41).
The forward reference number “1” is passed, and the expansion routine (nesting level 2) is recursively called (S42).

ネストレベル２の展開ルーチンにおいて、参照番号が０以外なので（Ｓ３２１）、状態履歴をチェックする（Ｓ３２２）。
状態履歴の、参照番号「１」最初の状態「３」の欄は空欄なので（Ｓ３２２）、圧縮辞書の参照番号「１」の欄を参照し、前方参照番号「０」を得る（Ｓ４１）。
前方参照番号「０」を渡して、展開ルーチン（ネストレベル３）を再帰的に呼び出す（Ｓ４２）。
ネストレベル３の展開ルーチンにおいて、参照番号が０なので（Ｓ３２１）、何もせずに帰ってくる。 In the expansion routine of the nest level 2, since the reference number is other than 0 (S321), the state history is checked (S322).
Since the column of the reference number “1” and the first state “3” in the state history is blank (S322), the reference number “1” column of the compression dictionary is referred to, and the forward reference number “0” is obtained (S41).
The forward reference number “0” is passed, and the expansion routine (nesting level 3) is recursively called (S42).
In the expansion routine of nesting level 3, since the reference number is 0 (S321), it returns without doing anything.

ネストレベル２に戻り、圧縮辞書の参照番号「１」の欄を参照し、接尾文字「ａ」を得る（Ｓ４３）。
接尾文字「ａ」をＤＦＡに入力し、ＤＦＡの状態は「２」になる（Ｓ４４）。受理状態ではないので、出力はしない（Ｓ４５）。
状態履歴の参照番号「１」最初の状態「３」の欄に、遷移先状態「２」受理位置「０」を記憶して（Ｓ４８）、ネストレベル２の展開ルーチンは終了する。 Returning to nesting level 2, the column of reference number “1” in the compression dictionary is referred to, and the suffix “a” is obtained (S43).
The suffix “a” is input to the DFA, and the state of the DFA becomes “2” (S44). Since it is not in the accepting state, no output is made (S45).
The transition destination state “2” acceptance position “0” is stored in the column of the state history reference number “1” and the first state “3” (S48), and the nest level 2 expansion routine ends.

ネストレベル１に戻り、圧縮辞書の参照番号「２」の欄を参照し、接尾文字「ｂ」を得る（Ｓ４３）。
接尾文字「ｂ」をＤＦＡに入力し、ＤＦＡの状態は「３」になる（Ｓ４４）。受理状態ではないので、出力はしない（Ｓ４５）。
状態履歴の参照番号「２」最初の状態「３」の欄に、遷移先状態「３」受理位置「０」を記憶して（Ｓ４８）、ネストレベル１の展開ルーチンは終了する。 Returning to the nesting level 1, the column “2” in the compression dictionary is referred to, and the suffix “b” is obtained (S43).
The suffix “b” is input to the DFA, and the DFA status becomes “3” (S44). Since it is not in the accepting state, no output is made (S45).
The transition destination state “3” acceptance position “0” is stored in the column of the state history reference number “2” and the first state “3” (S48), and the nest level 1 expansion routine ends.

メインループに戻り、次の圧縮ブロック（接尾文字「ｃ」）を取得する（Ｓ３３）。
接尾文字「ｃ」をＤＦＡに入力し、ＤＦＡの状態は「４」になる（Ｓ３４）。受理状態なので、ヒット位置「６」を出力し（Ｓ３５）、現在位置が「６」になる（Ｓ３６）。
圧縮辞書には、参照番号「２」と接尾文字「ｃ」が登録される。登録番号は「３」である。
状態履歴には、参照番号「３」最初の状態「３」の欄に、遷移先状態「４」受理位置「３」を記憶する。 Returning to the main loop, the next compressed block (suffix “c”) is acquired (S33).
The suffix “c” is input to the DFA, and the state of the DFA becomes “4” (S34). Since it is in the accepting state, the hit position “6” is output (S35), and the current position becomes “6” (S36).
Reference number “2” and suffix “c” are registered in the compression dictionary. The registration number is “3”.
In the state history, the transition destination state “4” reception position “3” is stored in the field of the reference number “3” and the first state “3”.

以下、同様に繰り返し、次から６つの圧縮ブロック（参照番号「０」接尾文字「ｄ」参照番号「０」接尾文字「ｅ」参照番号「０」接尾文字「ｃ」）についての処理が終わったところで、圧縮辞書及び状態履歴は、図１７に示す内容となっている。また、現在位置カウンタ１１５が記憶する現在位置は「９」、状態記憶部１１１が記憶するＤＦＡの現在の状態は「３」になっている。 Thereafter, the same processing is repeated, and processing for the next six compressed blocks (reference number “0” suffix “d” reference number “0” suffix “e” reference number “0” suffix “c”) is completed. By the way, the compression dictionary and the state history have the contents shown in FIG. The current position stored in the current position counter 115 is “9”, and the current state of the DFA stored in the state storage unit 111 is “3”.

次の圧縮ブロック（参照番号「３」）を取得する（Ｓ３１）。最初の状態は「３」である。
参照番号「３」を渡して、展開ルーチン（ネストレベル１）を呼び出す（Ｓ３２）。参照番号が０以外なので（Ｓ３２１）、状態履歴をチェックする（Ｓ３２２）。
状態履歴の、参照番号「３」最初の状態「３」の欄には状態履歴が記憶されているので（Ｓ３２２）、状態履歴から遷移先状態「４」を取得する（Ｓ３２３）。
状態記憶部１１１が記憶したＤＦＡの現在の状態を「４」に更新する（Ｓ３２４）。
状態履歴から受理位置「３」を取得し（Ｓ３２５）、ヒット位置「１２」を出力して（Ｓ３２６）、ネストレベル１の展開ルーチンは終了する。 The next compressed block (reference number “3”) is acquired (S31). The initial state is “3”.
The reference number “3” is passed and the expansion routine (nesting level 1) is called (S32). Since the reference number is other than 0 (S321), the status history is checked (S322).
Since the state history is stored in the column of the reference number “3” and the first state “3” of the state history (S322), the transition destination state “4” is acquired from the state history (S323).
The current state of the DFA stored in the state storage unit 111 is updated to “4” (S324).
The acceptance position “3” is acquired from the state history (S325), the hit position “12” is output (S326), and the expansion routine of the nesting level 1 ends.

メインループに戻り、次の圧縮ブロック（接尾文字「ｄ」）を取得する（Ｓ３３）。
接尾文字「ｄ」をＤＦＡに入力し、ＤＦＡの状態は「５」になる（Ｓ３４）。受理状態ではないので、何も出力せず（Ｓ３５）、現在位置が「１３」になる（Ｓ３６）。
圧縮辞書には、参照番号「３」と接尾文字「ｄ」が登録される。登録番号は「７」である。
状態履歴には、参照番号「７」最初の状態「３」の欄に、遷移先状態「５」受理位置「３」を記憶する。 Returning to the main loop, the next compressed block (suffix “d”) is acquired (S33).
The suffix “d” is input to the DFA, and the state of the DFA becomes “5” (S34). Since it is not in the accepting state, nothing is output (S35), and the current position becomes “13” (S36).
A reference number “3” and a suffix “d” are registered in the compression dictionary. The registration number is “7”.
In the state history, the transition destination state “5” receiving position “3” is stored in the field of the reference number “7” and the first state “3”.

以下、圧縮ブロックが尽きるまで繰り返す（Ｓ１８）。 Thereafter, the process is repeated until the compressed block is exhausted (S18).

このように、ＬＺ７８方式で圧縮されたテキストを検索する場合において、圧縮テキストを読み込みながら圧縮辞書を構築していきつつ、そのときのＤＦＡの状態をともに記憶することにより、過去に検索したときの状態遷移を利用することができ、同じ状態遷移をもう一度繰り返す必要がなくなる。これにより、検索が高速に行えるという効果を奏する。 As described above, when searching for text compressed by the LZ78 method, the compression dictionary is constructed while reading the compressed text, and the state of the DFA at that time is stored together, so that the search is performed in the past. State transitions can be used, eliminating the need to repeat the same state transitions again. Thereby, there is an effect that the search can be performed at high speed.

更に、ＬＺ７８方式においては、新しく辞書に登録する部分文字列は、それまでに登録した部分文字列に１文字加えたものである。したがって、一つの参照番号について検索を実行すると、それに含まれる前方文字列についても同時に検索をしていることになる。したがって、再帰的な呼出しによって、これらを状態履歴に記憶すれば、状態履歴が記憶されている場合が多くなり、同じ状態遷移を繰り返す必要がない。これにより、検索が高速に行えるという効果を奏する。 Further, in the LZ78 system, the partial character string newly registered in the dictionary is obtained by adding one character to the partial character string registered so far. Therefore, if a search is performed for one reference number, the search is also performed for the forward character string included in the search. Therefore, if these are stored in the state history by recursive calls, the state history is often stored, and it is not necessary to repeat the same state transition. Thereby, there is an effect that the search can be performed at high speed.

なお、この実施の形態では、再帰的な呼出しを利用することに上述した効果を得ているが、ＬＺ７８方式の圧縮テキストを検索する場合に、再帰的な呼出しが必ず必要となるわけではなく、実施の形態１等で説明した方式でもよい。
圧縮辞書記憶部１１３が記憶する圧縮辞書の構造も、実施の形態１等で説明したように、参照番号と部分文字列とを記憶するものであってもよい。
状態遷移記憶部１１２が記憶する状態履歴の構造も、実施の形態１等で説明したような構造であってもよい。 In this embodiment, the above-described effect is obtained by using a recursive call. However, a recursive call is not necessarily required when searching for compressed text of the LZ78 method. The method described in the first embodiment or the like may be used.
The structure of the compression dictionary stored in the compression dictionary storage unit 113 may also store reference numbers and partial character strings as described in the first embodiment.
The structure of the state history stored in the state transition storage unit 112 may also be the structure described in the first embodiment.

実施の形態７．
実施の形態７を図３、図６、図８、図１９、図２０を用いて説明する。
この実施の形態における圧縮テキスト検索装置１００（文字列検索装置の一例）の外観、ハードウェア構成、ブロック構成は、実施の形態１で説明したものと同一なので、ここでは説明を省略する。 Embodiment 7 FIG.
The seventh embodiment will be described with reference to FIGS. 3, 6, 8, 19, and 20. FIG.
Since the appearance, hardware configuration, and block configuration of the compressed text search apparatus 100 (an example of a character string search apparatus) in this embodiment are the same as those described in the first embodiment, description thereof is omitted here.

この実施の形態では、ＬＺ７８方式で圧縮されたテキストを検索する別の場合について説明する。 In this embodiment, another case of searching for text compressed by the LZ78 method will be described.

図１９は、ＬＺ７８形式による圧縮テキストの構造を示す図である。ＬＺ７８形式では、図４のような固定の圧縮辞書３０３を持たない。ＬＺ７８形式の圧縮テキストは圧縮ブロック列３００のみから構成される。各圧縮ブロック１２０２は、文字列が最も長く一致する圧縮辞書１２０３の参照番号と、次の不一致文字の情報を持つ。圧縮辞書１２０３は、文字列１２０５と、文字列を参照するための参照番号１２０４から構成され、圧縮テキスト伸張の過程で随時エントリが追加されていく。図１９の例に示したような圧縮ブロック列を伸張する場合、圧縮ブロック１２０６は、圧縮辞書１２０３の参照番号＝９、最初の不一致文字＝「ａ」なので、圧縮辞書１２０３の９番目のエントリの文字列に最初の不一致文字を加えた文字列「ａｂｃｄｅａ」と置き換えられる。さらに、この文字列「ａｂｃｄｅａ」が圧縮辞書１２０３の末尾に新たなエントリとして追加される。以後、圧縮ブロックの参照している圧縮辞書の文字列を、参照文字列と呼ぶこととする。 FIG. 19 is a diagram showing the structure of compressed text in the LZ78 format. The LZ78 format does not have a fixed compression dictionary 303 as shown in FIG. The compressed text in the LZ78 format is composed of only the compressed block sequence 300. Each compression block 1202 has a reference number of the compression dictionary 1203 with the longest matching character string and information about the next mismatched character. The compression dictionary 1203 includes a character string 1205 and a reference number 1204 for referring to the character string, and entries are added as needed in the process of decompressing the compressed text. When decompressing the compressed block string as shown in the example of FIG. 19, the compressed block 1206 has the reference number = 9 of the compression dictionary 1203 and the first mismatch character = “a”, so the ninth entry of the compression dictionary 1203 It is replaced with a character string “abcdea” obtained by adding the first mismatch character to the character string. Further, this character string “abcdea” is added as a new entry at the end of the compression dictionary 1203. Hereinafter, the character string in the compression dictionary referred to by the compression block is referred to as a reference character string.

これまでに、ＬＺ７８形式の圧縮辞書の参照を高速化するために、木構造やハッシュなどを用いた様々な方法が提案されているが、この実施の形態の圧縮テキスト検索装置では、その実現方式は問わない。 So far, various methods using a tree structure or a hash have been proposed in order to speed up the reference to the compression dictionary in the LZ78 format. In the compressed text search apparatus according to this embodiment, an implementation method thereof is proposed. Does not matter.

この実施の形態による状態遷移記憶部は、図６に示したものと同様である。 The state transition storage unit according to this embodiment is the same as that shown in FIG.

図２０は、この実施の形態の圧縮テキスト検索装置における検索処理の流れ図である。初期状態として、すでに検索条件入力部１０２が入力した検索条件から状態遷移表生成部１０５によって生成された状態遷移表を、状態遷移表記憶部１０６が記憶しているものとする。また、状態記憶部１１１には初期状態がセットされているものとする。圧縮辞書記憶部１１３と、状態遷移記憶部１１２の状態遷移履歴と受理位置は空であるとする。また、元テキスト長をカウントするためのカウンタを０に初期化する。 FIG. 20 is a flowchart of search processing in the compressed text search apparatus of this embodiment. As an initial state, it is assumed that the state transition table storage unit 106 stores a state transition table generated by the state transition table generation unit 105 based on a search condition already input by the search condition input unit 102. In addition, it is assumed that the initial state is set in the state storage unit 111. It is assumed that the state transition history and the reception position of the compression dictionary storage unit 113 and the state transition storage unit 112 are empty. Also, a counter for counting the original text length is initialized to zero.

まず、ステップＳ１１０１で圧縮ブロック取得部１０８によって、圧縮ブロック列の先頭から順に圧縮ブロックを１個取得する。ステップＳ１１０２で、状態記憶部１１１の現在の状態と、圧縮ブロックが参照する状態遷移履歴の先頭の状態が一致するか判定する。一致する場合は（ＹＥＳ）、ステップＳ１１０３で状態遷移履歴の末尾の状態と、圧縮ブロックの不一致文字から次の状態を取得し、状態記憶部１１１にセットする。次にステップＳ１１０４で、状態遷移履歴に受理位置があるか判定する。受理位置があった場合は（ＹＥＳ）、ステップＳ１１０５でヒット位置を計算して出力し、ステップＳ１１０６に進む。ここで、ヒット位置＝元テキスト長＋受理位置となる。現在の状態が受理状態である場合もヒット位置として、元テキスト長＋参照文字列長を出力する。ステップＳ１１０４で受理位置が無かった場合は（ＮＯ）、そのままステップＳ１１０６へ進む。ステップＳ１１０６では、圧縮辞書と状態遷移履歴を更新する。 First, in step S1101, the compressed block acquisition unit 108 acquires one compressed block in order from the beginning of the compressed block sequence. In step S1102, it is determined whether the current state of the state storage unit 111 matches the first state of the state transition history referred to by the compressed block. If they match (YES), the next state is acquired from the last state of the state transition history and the mismatched character of the compressed block in step S1103 and set in the state storage unit 111. Next, in step S1104, it is determined whether there is an acceptance position in the state transition history. If there is an acceptance position (YES), the hit position is calculated and output in step S1105, and the process proceeds to step S1106. Here, hit position = original text length + accepting position. Even when the current state is an acceptance state, the original text length + reference character string length is output as the hit position. If there is no receiving position in step S1104 (NO), the process proceeds to step S1106. In step S1106, the compression dictionary and the state transition history are updated.

圧縮辞書に対しては、参照文字列に不一致文字を追加したものを、圧縮辞書の新たなエントリとして追加する。状態遷移履歴に対しては、圧縮辞書が参照する状態遷移履歴に現在の状態を加えたものを、新たなエントリとして追加する。 For the compression dictionary, a reference character string with an unmatched character added is added as a new entry in the compression dictionary. For the state transition history, the state transition history referred to by the compression dictionary plus the current state is added as a new entry.

ステップＳ１１０７では、圧縮ブロック列の終端に達しているか判定し、達していない場合は（ＮＯ）、ステップＳ１１０１で次の圧縮ブロックを取得する。ステップＳ１１０７で圧縮ブロック列の終端に達していた場合は（ＹＥＳ）、検索処理を終了する。図２０には明記していないが、ここで元テキスト長に、参照文字列長を加える。 In step S1107, it is determined whether the end of the compressed block string has been reached. If not (NO), the next compressed block is acquired in step S1101. If the end of the compressed block sequence has been reached in step S1107 (YES), the search process is terminated. Although not clearly shown in FIG. 20, here, the reference character string length is added to the original text length.

ステップＳ１１０２で、状態記憶部１１１が記憶した現在の状態と、圧縮ブロックが参照する状態遷移履歴の先頭の状態が一致しなかった場合は（ＮＯ）、ステップＳ１１０８に進み、参照文字列に不一致文字を加えた文字列に対して、状態遷移履歴を求める。ステップＳ１１０８で状態遷移履歴を求め終えたら、ステップＳ１１０７へ進む。圧縮ブロックの参照番号が０である場合、すなわち圧縮辞書に参照文字列が無い場合でも、ステップＳ１１０２からステップＳ１１０８へと進む。 In step S1102, if the current state stored in the state storage unit 111 and the first state of the state transition history referred to by the compressed block do not match (NO), the process proceeds to step S1108, and the reference character string does not match The state transition history is obtained for the character string to which is added. When the state transition history is obtained in step S1108, the process proceeds to step S1107. Even when the reference number of the compressed block is 0, that is, when there is no reference character string in the compression dictionary, the process proceeds from step S1102 to step S1108.

図８は、ステップＳ１１０８の処理の流れの一例を示す図である。図８については、実施の形態１において説明したので、ここでは説明を省略する。 FIG. 8 is a diagram illustrating an example of the processing flow of step S1108. Since FIG. 8 has been described in Embodiment 1, the description thereof is omitted here.

以上のように、この実施の形態によれば、圧縮ブロック列を１個取得した後、現在の状態と、状態遷移履歴の参照位置の状態とを比較し、状態が一致した場合は、本来参照文字列長分の回数要する状態遷移の処理を、高々２回の状態遷移で処理することができ、処理ステップを削減することができる。
また、正規表現とテキストとの照合処理自体は、その正規表現を受理する、状態遷移機械を利用する。これにより、正規表現を含んだ検索条件によって、ＬＺ７８形式で圧縮された圧縮テキストを高速に検索することができる。 As described above, according to this embodiment, after acquiring one compressed block string, the current state is compared with the state at the reference position of the state transition history, and if the state matches, the original reference is made. State transition processing that requires the number of times corresponding to the length of the character string can be processed with at most two state transitions, and processing steps can be reduced.
In addition, the regular expression / text matching process itself uses a state transition machine that accepts the regular expression. Thereby, the compressed text compressed in the LZ78 format can be searched at a high speed according to the search condition including the regular expression.

ここで説明した圧縮テキスト検索装置は、以下の特徴を持つ。
圧縮辞書記憶部にＬＺ７８形式の圧縮辞書を記憶する。
ＬＺ７８形式の圧縮ブロックを読み込み、現在の状態が圧縮ブロックが参照する状態遷移の履歴の先頭の状態と一致した場合に、参照文字列の末尾の文字まで１回の状態遷移で状態を遷移させさせる。さらに、不一致文字により状態を遷移させる。
参照文字列と不一致文字からなる文字列を圧縮辞書の新たなエントリとして追加し、上記の状態遷移を状態遷移記憶部の新たなエントリとして追加する。 The compressed text search apparatus described here has the following features.
A compression dictionary in the LZ78 format is stored in the compression dictionary storage unit.
When a compressed block in LZ78 format is read and the current state matches the first state of the history of state transitions referenced by the compressed block, the state is shifted by one state transition to the last character of the reference character string. . Further, the state is changed by the mismatch character.
A character string composed of the reference character string and the mismatched character is added as a new entry in the compression dictionary, and the state transition is added as a new entry in the state transition storage unit.

実施の形態８．
実施の形態８を図２１〜図２３を用いて説明する。
この実施の形態における圧縮テキスト検索装置１００（文字列検索装置の一例）の外観、ハードウェア構成、ブロック構成は、実施の形態１で説明したものと同一なので、ここでは説明を省略する。 Embodiment 8 FIG.
An eighth embodiment will be described with reference to FIGS.
Since the appearance, hardware configuration, and block configuration of the compressed text search apparatus 100 (an example of a character string search apparatus) in this embodiment are the same as those described in the first embodiment, description thereof is omitted here.

この実施の形態では、ＬＺＷ方式で圧縮されたテキストを検索する場合について説明する。 In this embodiment, a case will be described in which a text compressed by the LZW method is searched.

図２１は、この実施の形態において、圧縮テキスト記憶部１０３及び圧縮辞書記憶部１１３（辞書記憶部の一例）が記憶する記憶内容の一例を示す図である。 FIG. 21 is a diagram showing an example of storage contents stored in the compressed text storage unit 103 and the compression dictionary storage unit 113 (an example of a dictionary storage unit) in this embodiment.

圧縮テキスト記憶部１０３は、圧縮ブロック列３００を記憶している。圧縮ブロック列３００は、元の文字列５００「ａａｂａｂｃｄｅｃａｂｃｄｅｂａｂｃｄｅｃｃ」に、図５０の規則を適用して置換したものである。 The compressed text storage unit 103 stores a compressed block sequence 300. The compressed block string 300 is obtained by replacing the original character string 500 “aabbcdecabbcdebabcdec” by applying the rule of FIG. 50.

圧縮辞書記憶部１１３（辞書記憶部の一例）は、圧縮辞書（置換辞書の一例）を記憶する。圧縮辞書には、出現する可能性のある文字から構成される１文字の部分文字列すべてが最初に登録される。この例では、「ａ」「ｂ」「ｃ」「ｄ」「ｅ」の５種類の文字しか出現しないものとして説明する。したがって、圧縮辞書は、最初、参照番号１〜５に「ａ」「ｂ」「ｃ」「ｄ」「ｅ」の５つの部分文字列が登録される。
参照番号６以降は、検索が進むにつれて、登録されるものである。 The compression dictionary storage unit 113 (an example of a dictionary storage unit) stores a compression dictionary (an example of a replacement dictionary). In the compression dictionary, all partial character strings of one character composed of characters that may appear are registered first. In this example, it is assumed that only five types of characters “a”, “b”, “c”, “d”, and “e” appear. Accordingly, in the compression dictionary, first, five partial character strings “a”, “b”, “c”, “d”, and “e” are registered in reference numerals 1 to 5.
After the reference number 6, it is registered as the search proceeds.

圧縮辞書記憶部１１３は圧縮辞書として、実施の形態６で説明した「参照番号」「前方参照番号」「接尾文字」に加え、「接頭文字」を記憶する。接頭文字は、その部分文字列の最初の１文字を示している。辞書更新時に、すぐに接尾文字を求められるように記憶しているものであり、なくてもよい。 The compression dictionary storage unit 113 stores a “prefix character” as a compression dictionary in addition to the “reference number”, “forward reference number”, and “suffix” described in the sixth embodiment. The prefix character indicates the first character of the substring. It is stored so that the suffix can be obtained immediately when the dictionary is updated.

図２２は、この実施の形態において、状態遷移記憶部１１２（履歴記憶部の一例）が記憶する記憶内容の一例を示す図である。 FIG. 22 is a diagram illustrating an example of storage contents stored in the state transition storage unit 112 (an example of a history storage unit) in this embodiment.

状態遷移記憶部１１２は、状態履歴を記憶する。検索の始めにおいて、状態履歴は空であり、検索が進むにつれて状態履歴が登録されていく。
状態遷移記憶部１１２は状態履歴として、実施の形態６で説明したのと同様、「参照番号」「最初の状態」により参照可能な形で、「状態遷移履歴」と「受理位置」を記憶する。
状態遷移履歴は、最初の状態と最後の状態（遷移先状態）だけでなく、途中の状態もすべて記憶したものである。しかし、実施の形態６で説明したように、遷移先状態を記憶することとしてもよい。
その場合であっても、最初の状態の次の状態（２番目の状態）は記憶しておくほうが好ましい。状態履歴を更新する際に２番目の状態が必要になるので、ＤＦＡに入力せずともこれを取得できるからである。 The state transition storage unit 112 stores a state history. The state history is empty at the beginning of the search, and the state history is registered as the search proceeds.
As described in the sixth embodiment, the state transition storage unit 112 stores “state transition history” and “acceptance position” in a form that can be referred to by “reference number” and “first state”, as described in the sixth embodiment. .
The state transition history stores not only the first state and the last state (transition destination state) but also all intermediate states. However, as described in the sixth embodiment, the transition destination state may be stored.
Even in that case, it is preferable to store the next state (second state) after the first state. This is because the second state is required when updating the state history, and this can be acquired without input to the DFA.

また、ここでは、状態遷移履歴と受理位置をリスト形式で記憶しているので、現在の状態と一致する最初の状態から始まる状態履歴が記憶されているかを条件判断部１１４が判断する際、リスト内を検索する必要がある。このような構成とすると、実施の形態６で説明したような表形式で記憶する場合に比べて処理に時間がかかるが、ＤＦＡの状態の数が多い場合には、状態遷移記憶部１１２が状態履歴を記憶するのに必要とする記憶領域を節約できる。したがって、ＣＰＵの処理能力、ハードディスクの記憶容量等を勘案して、どちらの形式で記憶するかを決定すればよい。 Here, since the state transition history and the reception position are stored in a list format, when the condition determination unit 114 determines whether the state history starting from the first state that matches the current state is stored, the list Need to search inside. With such a configuration, processing takes time compared to the case of storing in a table format as described in the sixth embodiment, but when the number of DFA states is large, the state transition storage unit 112 is in a state. The storage area required for storing the history can be saved. Therefore, it is only necessary to determine which format to store in consideration of the processing capacity of the CPU, the storage capacity of the hard disk, and the like.

図２３は、この実施の形態における圧縮テキスト検索装置１００の検索処理の制御の流れの一例を示すフローチャート図である。 FIG. 23 is a flowchart showing an example of a control flow of search processing of the compressed text search apparatus 100 in this embodiment.

ＬＺＷ方式は、ＬＺ７８方式と符号化の規則が異なる。そこで、制御の流れも、それに対応する部分が異なっている。以下、異なる部分のみ説明する。 The LZW system is different from the LZ78 system in encoding rules. Therefore, the control flow also differs in the corresponding parts. Only different parts will be described below.

初期化処理（Ｓ１０）において、圧縮辞書記憶部１１３は、出現する可能性のある文字から構成される１文字の部分文字列すべてを登録する。参照番号は、例えば、文字コードと同じ番号を用いてもよい。 In the initialization process (S10), the compression dictionary storage unit 113 registers all partial character strings of one character composed of characters that may appear. As the reference number, for example, the same number as the character code may be used.

Ｓ１８において、圧縮ブロックが残っていなければ、検索処理を終了する。 If no compressed block remains in S18, the search process is terminated.

Ｓ５１において、次の圧縮ブロック（参照番号を示す符号９７１）を取得する。これは、圧縮辞書記憶部１１３が記憶した圧縮辞書を更新するために、次の１文字を知る必要があるからである。
Ｓ５２において、圧縮辞書記憶部１１３が記憶した圧縮辞書を参照して、次の圧縮ブロックの先頭の文字（接頭文字）を取得する。圧縮辞書に接頭文字を記憶しているので、次の圧縮ブロックを伸長しなくても、辞書を更新できる。 In S51, the next compressed block (reference numeral 971 indicating a reference number) is acquired. This is because it is necessary to know the next character in order to update the compression dictionary stored in the compression dictionary storage unit 113.
In S52, referring to the compression dictionary stored in the compression dictionary storage unit 113, the first character (prefix character) of the next compression block is acquired. Since the prefix character is stored in the compression dictionary, the dictionary can be updated without decompressing the next compression block.

Ｓ３８において、新しく辞書に登録した参照番号について、状態履歴を記憶する。この際、ＤＦＡにもう１文字入力したあとの状態を、遷移先状態として記憶する必要がある。
そこで、状態遷移記憶部１１２は、次の参照番号について記憶した状態遷移履歴から、２番目の状態を取得する。
あるいは、状態遷移表記憶部１０６が記憶した状態遷移表を参照して、２番目の状態を取得してもよい。
あるいは、ＤＦＡにもう１文字入力したあとで、状態履歴を記憶するよう構成してもよい。 In S38, the state history is stored for the reference number newly registered in the dictionary. At this time, it is necessary to store the state after another character is input to the DFA as the transition destination state.
Therefore, the state transition storage unit 112 acquires the second state from the state transition history stored for the next reference number.
Or you may acquire a 2nd state with reference to the state transition table which the state transition table memory | storage part 106 memorize | stored.
Alternatively, the state history may be stored after another character is input to the DFA.

展開ルーチンの処理については、実施の形態６において図１８を用いて説明したものと同一なので、ここでは説明を省略する。 Since the processing of the unfolding routine is the same as that described with reference to FIG. 18 in the sixth embodiment, the description thereof is omitted here.

このように、ＬＺＷ方式で圧縮されたテキストを検索する場合において、圧縮テキストを読み込みながら圧縮辞書を構築していきつつ、そのときのＤＦＡの状態をともに記憶することにより、過去に検索したときの状態遷移を利用することができ、同じ状態遷移をもう一度繰り返す必要がなくなる。これにより、検索が高速に行えるという効果を奏する。 As described above, when searching for text compressed by the LZW method, the compression dictionary is constructed while reading the compressed text, and the state of the DFA at that time is memorized together. State transitions can be used, eliminating the need to repeat the same state transitions again. Thereby, there is an effect that the search can be performed at high speed.

更に、ＬＺＷ方式においては、新しく辞書に登録する部分文字列は、それまでに登録した部分文字列に１文字加えたものである。したがって、一つの参照番号について検索を実行すると、それに含まれる前方文字列についても同時に検索をしていることになる。したがって、再帰的な呼出しによって、これらを状態履歴に記憶すれば、状態履歴が記憶されている場合が多くなり、同じ状態遷移を繰り返す必要がない。これにより、検索が高速に行えるという効果を奏する。 Furthermore, in the LZW method, the partial character string newly registered in the dictionary is obtained by adding one character to the partial character string registered so far. Therefore, if a search is performed for one reference number, the search is also performed for the forward character string included in the search. Therefore, if these are stored in the state history by recursive calls, the state history is often stored, and it is not necessary to repeat the same state transition. Thereby, there is an effect that the search can be performed at high speed.

埋込辞書参照型の圧縮方式には、この他にも様々なものがある。例えば、上述したＬＺ７８方式あるいはＬＺＷ方式の符号列を、さらにハフマン符号化（静的あるいは動的）によって置換し、更に全体のビット長を短くしたもの等がある。
ここで説明した実施の形態は、それらの圧縮方式によって圧縮されたテキストにも適用できる。 There are various other compression methods of the embedded dictionary reference type. For example, the above-described LZ78 system or LZW system code string is further replaced by Huffman coding (static or dynamic), and the entire bit length is further shortened.
The embodiments described here can also be applied to text compressed by these compression methods.

実施の形態９．
実施の形態９を図２０、図２４を用いて説明する。
この実施の形態における圧縮テキスト検索装置１００（文字列検索装置の一例）の外観、ハードウェア構成、ブロック構成は、実施の形態１で説明したものと同一なので、ここでは説明を省略する。 Embodiment 9 FIG.
Embodiment 9 will be described with reference to FIGS. 20 and 24. FIG.
Since the appearance, hardware configuration, and block configuration of the compressed text search apparatus 100 (an example of a character string search apparatus) in this embodiment are the same as those described in the first embodiment, description thereof is omitted here.

この実施の形態では、ＬＺＷ方式で圧縮されたテキストを検索する別の場合について説明する。 In this embodiment, another case of searching for text compressed by the LZW method will be described.

図２４は、ＬＺＷ形式による圧縮テキストの構造を示す図である。ＬＺＷ形式は、ＬＺ７８形式から派生した形式であり、ＬＺ７８形式同様に圧縮テキストは圧縮ブロック列３００のみから構成される。ＬＺＷ形式では、圧縮辞書に１バイト文字のエントリを予め（暗に）持つことを特徴としている。各圧縮ブロック１３０２は、文字列が最も長く一致する圧縮辞書の参照番号の情報のみを持つ。圧縮辞書の構造は、その１バイト文字のエントリを持つこと以外は、実施の形態７において図１９を用いて説明したものと同様である。 FIG. 24 is a diagram showing the structure of compressed text in the LZW format. The LZW format is a format derived from the LZ78 format. Like the LZ78 format, the compressed text includes only the compressed block sequence 300. The LZW format is characterized in that a one-byte character entry is previously (implicitly) stored in the compression dictionary. Each compression block 1302 has only information on the reference number of the compression dictionary with the longest matching character string. The structure of the compression dictionary is the same as that described with reference to FIG. 19 in the seventh embodiment except that the entry of the one-byte character is included.

これまでに、ＬＺＷ形式の圧縮辞書の参照を高速化するために、木構造やハッシュなどを用いた様々な方法が提案されているが、この実施の形態の圧縮テキスト検索装置では、その実現方式は問わない。 So far, various methods using a tree structure, a hash, and the like have been proposed in order to speed up the reference of the compression dictionary in the LZW format. Does not matter.

この実施の形態の圧縮テキスト検索装置における検索処理の流れは、実施の形態７の図２０に示したものとほぼ同様であるので、図２０を援用して検索処理の流れを説明する。初期状態として、すでに検索条件入力部１０２が入力した検索条件から状態遷移表生成部１０５が状態遷移表を生成し、状態遷移表記憶部１０６が記憶しているものとする。また、状態記憶部１１１には初期状態がセットされているものとする。圧縮辞書記憶部１１３と、状態遷移記憶部１１２の状態遷移履歴と受理位置は空であるとする。また、元テキスト長をカウントするためのカウンタを０に初期化する。 Since the flow of search processing in the compressed text search apparatus of this embodiment is almost the same as that shown in FIG. 20 of Embodiment 7, the flow of search processing will be described with the aid of FIG. As an initial state, it is assumed that the state transition table generation unit 105 generates a state transition table from the search conditions already input by the search condition input unit 102 and the state transition table storage unit 106 stores the state transition table. In addition, it is assumed that the initial state is set in the state storage unit 111. It is assumed that the state transition history and the reception position of the compression dictionary storage unit 113 and the state transition storage unit 112 are empty. Also, a counter for counting the original text length is initialized to zero.

なお、圧縮辞書記憶部１１３は、エントリ番号０〜２５５に対する参照を受けたとき、そのエントリ番号と同じ文字コードを持つ１バイト文字を返すものとする。これにより、１バイト文字をあらかじめ圧縮辞書に登録しておく必要がない。 When the compression dictionary storage unit 113 receives a reference to the entry numbers 0 to 255, it returns a 1-byte character having the same character code as the entry number. This eliminates the need to register 1-byte characters in the compression dictionary in advance.

実施の形態７の図２０との差異は、ステップＳ１１０２とステップＳ１１０６である。まず、ステップＳ１１０１で圧縮ブロック取得部１０８によって、圧縮ブロック列の先頭から順に圧縮ブロックを１個取得する。ステップＳ１１０２で、状態記憶部１１１の現在の状態と、圧縮ブロックが参照する状態遷移履歴の先頭の状態が一致するか判定する。一致する場合は（ＹＥＳ）、ステップＳ１１０３で状態遷移履歴の末尾の状態を、状態記憶部１１１にセットする。次にステップＳ１１０４で、状態遷移履歴に受理位置があるか判定する。受理位置があった場合は（ＹＥＳ）、ステップＳ１１０５でヒット位置を計算して出力し、ステップＳ１１０６に進む。ここで、ヒット位置＝現在の元テキスト長＋受理位置をとする。ステップＳ１１０４で受理位置が無かった場合は（ＮＯ）、そのままステップＳ１１０６へ進む。 The difference from FIG. 20 of the seventh embodiment is step S1102 and step S1106. First, in step S1101, the compressed block acquisition unit 108 acquires one compressed block in order from the beginning of the compressed block sequence. In step S1102, it is determined whether the current state of the state storage unit 111 matches the first state of the state transition history referred to by the compressed block. If they match (YES), the last state of the state transition history is set in the state storage unit 111 in step S1103. Next, in step S1104, it is determined whether there is an acceptance position in the state transition history. If there is an acceptance position (YES), the hit position is calculated and output in step S1105, and the process proceeds to step S1106. Here, it is assumed that hit position = current original text length + acceptance position. If there is no receiving position in step S1104 (NO), the process proceeds to step S1106.

ステップＳ１１０６では、圧縮ブロックの参照文字列に次の圧縮ブロックの参照文字列の先頭の文字を追加したものを、圧縮辞書に新たなエントリとして追加する。状態遷移履歴には、圧縮ブロックの参照する状態遷移履歴に、現在の状態を、新たなエントリとして追加する。さらに圧縮ブロックの参照文字列の先頭の文字から得られる次の状態を、そのエントリに追加する。次の圧縮ブロックが無い場合は、ステップＳ１１０６では何もしない。 In step S1106, the reference character string of the compressed block added with the first character of the reference character string of the next compressed block is added as a new entry to the compression dictionary. In the state transition history, the current state is added as a new entry to the state transition history referred to by the compressed block. Further, the next state obtained from the first character of the reference character string of the compressed block is added to the entry. If there is no next compressed block, nothing is done in step S1106.

ステップＳ１１０７では、圧縮ブロック列の終端に達しているか判定し、達していない場合は（ＮＯ）、ステップＳ１１０１で次の圧縮ブロックを取得する。ステップＳ１１０７で圧縮ブロック列の終端に達していた場合は（ＹＥＳ）、検索処理を終了する。図２０には明記していないが、ここで圧縮前のテキスト長に、圧縮辞書の文字列長を加える。このようにすることで、現在圧縮前のテキストの何文字目まで検索したことになるかを知ることができる。 In step S1107, it is determined whether the end of the compressed block string has been reached. If not (NO), the next compressed block is acquired in step S1101. If the end of the compressed block sequence has been reached in step S1107 (YES), the search process is terminated. Although not clearly shown in FIG. 20, the character string length of the compression dictionary is added to the text length before compression. In this way, it is possible to know how many characters of the current uncompressed text have been searched.

ステップＳ１１０２で、状態記憶部１１１の現在の状態と、圧縮ブロックの参照する状態遷移履歴の先頭の状態が一致しなかった場合は（ＮＯ）、ステップＳ１１０８に進み、圧縮ブロックの参照文字列に対して、状態遷移履歴を求める。ステップＳ１１０８で状態遷移履歴を求め終えたら、ステップＳ１１０７へ進む。 If it is determined in step S1102 that the current state of the state storage unit 111 does not match the first state of the state transition history referred to by the compressed block (NO), the process proceeds to step S1108, where To obtain the state transition history. When the state transition history is obtained in step S1108, the process proceeds to step S1107.

以上のように、この実施の形態によれば、圧縮ブロック列を１個取得した後、現在の状態と、状態遷移履歴の参照位置の状態とを比較し、状態が一致した場合は、本来参照文字列長分の回数要する状態遷移の処理を、２回の状態遷移で処理することができ、処理ステップを削減することができる。
また、正規表現とテキストとの照合処理自体は、その正規表現を受理する、状態遷移機械を利用する。これにより、正規表現を含んだ検索条件によって、ＬＺＷ形式で圧縮された圧縮テキストを高速に検索することができる。 As described above, according to this embodiment, after acquiring one compressed block string, the current state is compared with the state at the reference position of the state transition history, and if the state matches, the original reference is made. State transition processing that requires the number of times corresponding to the character string length can be processed by two state transitions, and processing steps can be reduced.
In addition, the regular expression / text matching process itself uses a state transition machine that accepts the regular expression. Thereby, a compressed text compressed in the LZW format can be searched at high speed according to a search condition including a regular expression.

この実施の形態と同様にして、ＬＺ７８形式から派生した圧縮形式によって圧縮された圧縮テキストをこの実施の形態の圧縮テキスト検索装置によって高速に検索することができる。 Similarly to this embodiment, the compressed text compressed by the compression format derived from the LZ78 format can be searched at high speed by the compressed text search device of this embodiment.

ここで説明した圧縮テキスト検索装置は、以下の特徴を持つ。
圧縮辞書記憶部にＬＺＷ形式の圧縮辞書を記憶する。
ＬＺＷ形式の圧縮ブロックを読み込み、現在の状態が圧縮ブロックが参照する状態遷移の履歴の先頭の状態と一致した場合に、参照文字列の末尾の文字まで１回の状態遷移で状態を遷移させる。次の圧縮ブロックの参照文字列の先頭の文字によって状態を遷移させる。
参照文字列と次の圧縮ブロックの参照文字列の先頭の文字からなる文字列を圧縮辞書の新たなエントリとして追加し、上記の状態遷移を状態遷移記憶部の新たなエントリとして追加する。 The compressed text search apparatus described here has the following features.
A compression dictionary in the LZW format is stored in the compression dictionary storage unit.
When a compressed block in LZW format is read and the current state matches the beginning state of the history of state transitions referred to by the compressed block, the state is changed by one state transition to the last character of the reference character string. The state is changed by the first character of the reference character string of the next compressed block.
A character string composed of the reference character string and the first character of the reference character string of the next compressed block is added as a new entry in the compression dictionary, and the above state transition is added as a new entry in the state transition storage unit.

実施の形態１０．
実施の形態１０を図７、図２５〜図２７を用いて説明する。 Embodiment 10 FIG.
The tenth embodiment will be described with reference to FIGS. 7 and 25 to 27.

いままでの実施の形態においては、検索装置においてＤＦＡに入力する文字のビット長と、圧縮技術において置換する部分文字列を構成する文字のビット長が一致するものと仮定していた。 In the embodiments so far, it has been assumed that the bit length of the character input to the DFA in the search device matches the bit length of the character constituting the partial character string to be replaced in the compression technique.

しかし、使用するコードによっては、文字を表現するビット列のビット長が異なる場合がある。
例えば、ＡＳＣＩＩコードを用いる場合、文字を表現するビット列のビット長は８ビット（１バイト）である。
これに対し、シフトＪＩＳコードを用いる場合、文字を表現するビット列のビット長は１６ビット（２バイト）である。
更にいえば、シフトＪＩＳコードはＡＳＣＩＩコードと混在させることができるので、同じ文字列の中に８ビットのビット列によって表される文字と１６ビットのビット列によって表される文字とが混在する場合もある。 However, depending on the code used, the bit length of the bit string representing the character may be different.
For example, when the ASCII code is used, the bit length of the bit string representing the character is 8 bits (1 byte).
On the other hand, when the shift JIS code is used, the bit length of the bit string representing the character is 16 bits (2 bytes).
Furthermore, since the shift JIS code can be mixed with the ASCII code, there may be a case where a character represented by an 8-bit bit string and a character represented by a 16-bit bit string are mixed in the same character string. .

しかし、圧縮技術においては、それぞれの文字を表すビット列の長さが何ビットであるかは重要な問題ではない。
結果として得られる符号列全体のビット長が、元の文字列全体のビット長より短くなっていればよいのであって、その文字列が何を意味しているかを理解する必要はないからである。
そこで、圧縮技術においては通常、すべての文字列は８ビットのビット長を持つ文字から構成されているものとして扱っている。 However, in the compression technique, how many bits the bit string representing each character is is not an important problem.
This is because it is only necessary that the bit length of the entire code string obtained is shorter than the bit length of the entire original character string, and it is not necessary to understand what the character string means. .
Therefore, in the compression technique, all character strings are normally handled as being composed of characters having a bit length of 8 bits.

これに対して、検索装置においては、文字を表現したビット列のビット長よりも、文字数のほうが重要である。
例えば、ある文字列を画面に表示する場合、使用者はその文字列を構成する文字が、コンピュータ内部で何ビットのビット列によって表現されているかを意識する必要はない。
したがって、検索装置は通常、検索条件に合致する検索文字列が「何文字目にあった」と画面に表示する。
また、正規表現で指定する検索条件においては、「任意の１文字」といった指定の仕方が可能である。この場合、その１文字が、コンピュータ内部において何ビットのビット列で表現されているかは無関係である。
したがって、検索装置においてＤＦＡに入力する文字は、必ずしも８ビットのビット長を持つビット列であるとは限らない。 On the other hand, in the search device, the number of characters is more important than the bit length of the bit string representing the characters.
For example, when a character string is displayed on the screen, the user does not need to be aware of how many bits the characters constituting the character string are expressed in the computer.
Therefore, the search device usually displays on the screen that the search character string that matches the search condition is “what character was found”.
In addition, in the search condition designated by a regular expression, a designation method such as “any one character” is possible. In this case, it is irrelevant how many bits the character is represented in the computer.
Therefore, the character input to the DFA in the search device is not necessarily a bit string having a bit length of 8 bits.

図２５は、圧縮技術において取り扱う文字を表現したビット列のビット長と、検索装置において取り扱う文字を表現したビット列のビット長とが異なっている場合について説明するための説明図である。 FIG. 25 is an explanatory diagram for explaining a case where the bit length of the bit string expressing the character handled in the compression technique is different from the bit length of the bit string expressing the character handled in the search device.

元の文字列５００は、シフトＪＩＳコードを用いる場合、コンピュータの内部では、内部表現５５０のビット列で表現されている。なお、図２５ではわかりやすいよう、内部表現のビット列を８ビットごとに区切って１６進数で表している。 When the shift JIS code is used, the original character string 500 is represented by a bit string having an internal representation 550 inside the computer. In FIG. 25, for easy understanding, the bit string of the internal representation is divided into 8 bits and expressed in hexadecimal.

符号列６００は、これを圧縮技術によって圧縮したものである。上述したように、圧縮技術においては、文字を表現するビット列のビット長にかかわらず、８ビットを１文字として置換を行う。したがって、符号に置換された部分文字列が、元の文字列の文字の区切りとは違うところで区切られる場合がある。 The code string 600 is obtained by compressing the code string 600 using a compression technique. As described above, in the compression technique, replacement is performed with 8 bits as one character regardless of the bit length of the bit string representing the character. Therefore, the partial character string replaced with the code may be separated at a place different from the character separation of the original character string.

これを検索装置が検索する場合、例えば、最初の符号を復元してＤＦＡに入力しようとすると、「あ」「い」までは入力できるが、最後の８ビットが余ってしまい、ＤＦＡに入力することができない。 When the search device searches for this, for example, if the first code is restored and input to the DFA, “A” and “I” can be input, but the last 8 bits are left and input to the DFA. I can't.

そこで、この実施の形態は、未完文字復元部１２１とバイトデータ記憶部１２２（未完文字記憶部の一例）を設けることにより、この課題を解決するものである。 Therefore, this embodiment solves this problem by providing an incomplete character restoration unit 121 and a byte data storage unit 122 (an example of an incomplete character storage unit).

この実施の形態における圧縮テキスト検索装置１００（文字列検索装置の一例）の外観、ハードウェア構成は、実施の形態１で説明したものと同一なので、ここでは説明を省略する。 Since the appearance and hardware configuration of the compressed text search apparatus 100 (an example of a character string search apparatus) in this embodiment are the same as those described in the first embodiment, description thereof is omitted here.

図２６は、この実施の形態における圧縮テキスト検索装置１００のブロック構成の一例を示すブロック図である。
未完文字復元部１２１は、圧縮ブロック（符号の一例）から復元した部分文字列に、ＤＦＡに入力できない文字（未完文字）が含まれているかを判断し、未完文字がある場合には、バイトデータ記憶部１２２に記憶させる。
他の部分は、実施の形態１において図３を用いて説明したものと同一なので、ここでは説明を省略する。 FIG. 26 is a block diagram showing an example of a block configuration of the compressed text search apparatus 100 in this embodiment.
The incomplete character restoration unit 121 determines whether the partial character string restored from the compressed block (an example of a code) includes a character (unfinished character) that cannot be input to the DFA. The data is stored in the storage unit 122.
Other portions are the same as those described in Embodiment 1 with reference to FIG.

未完文字は、次の部分文字列の先頭の文字（他の未完文字）と結合することによって、ＤＦＡに入力できる文字となる。 The incomplete character becomes a character that can be input to the DFA by combining with the first character (other incomplete characters) of the next partial character string.

図２７は、この実施の形態において、圧縮テキスト記憶部１０３、圧縮辞書記憶部１１３（辞書記憶部の一例）、状態遷移記憶部１１２（履歴記憶部の一例）が記憶する記憶内容の一例を示す図である。 FIG. 27 shows an example of storage contents stored in the compressed text storage unit 103, the compression dictionary storage unit 113 (an example of a dictionary storage unit), and a state transition storage unit 112 (an example of a history storage unit) in this embodiment. FIG.

圧縮テキスト記憶部１０３は、圧縮ブロック列３００を記憶している。圧縮ブロック列３００は、元の文字列５００を辞書参照型圧縮方式により置換して圧縮してしたものである。 The compressed text storage unit 103 stores a compressed block sequence 300. The compressed block string 300 is obtained by replacing the original character string 500 with a dictionary reference compression method and compressing it.

状態遷移記憶部１１２は、状態履歴を記憶する。検索開始時には、状態履歴は空である。検索が進むにつれて、埋まっていく。
状態遷移記憶部１１２は状態履歴として、状態遷移履歴、受理位置、未完文字、末尾の未完文字を記憶する。
「状態遷移履歴」は、その部分文字列をＤＦＡに入力する前（前の部分文字列の末尾に未完文字がある場合も含む）の状態（最初の状態）から、その部分文字列をＤＦＡに入力した後（その部分文字列の末尾に未完文字がある場合には、未完文字の手前まで入力した後）の状態（遷移先状態）までの、ＤＦＡの状態遷移の履歴である。なお、途中経過は記憶せず、最初の状態と遷移先状態だけを記憶してもよい。
「未完文字」は、その部分文字列を展開する前に、バイトデータ記憶部１２２が記憶していた未完文字である。これを、部分文字列の先頭にある未完文字と結合することによって、ＤＦＡに入力できる文字となる。
「末尾の未完文字」は、その部分文字列の最後に未完文字がある場合の未完文字を示す。 The state transition storage unit 112 stores a state history. At the start of the search, the state history is empty. It fills in as the search progresses.
The state transition storage unit 112 stores a state transition history, an acceptance position, an incomplete character, and an incomplete character at the end as a state history.
The “state transition history” indicates that the partial character string is input to the DFA from the state (first state) before the partial character string is input to the DFA (including the case where there is an incomplete character at the end of the previous partial character string). This is a DFA state transition history up to the state (transition destination state) after input (after input before the incomplete character if there is an incomplete character at the end of the partial character string). Note that the intermediate state may not be stored, and only the first state and the transition destination state may be stored.
The “incomplete character” is an incomplete character stored in the byte data storage unit 122 before the partial character string is expanded. By combining this with the incomplete character at the beginning of the partial character string, the character can be input to the DFA.
“Ending incomplete character” indicates an incomplete character when there is an incomplete character at the end of the partial character string.

圧縮辞書記憶部１１３は、圧縮に用いた辞書と同じ辞書を記憶している。これは、圧縮テキスト記憶部１０３が記憶していたものから取得してもよいし、圧縮ブロック列の中に埋め込まれた情報を抽出したものであってもよい。
圧縮ブロックによって置換される部分文字列には、ここに示すように、末尾にＤＦＡに入力できない文字（未完文字）を有するもの（例えば、参照番号１）、先頭にＤＦＡに入力できない文字（他の未完文字）を有するもの（例えば、参照番号２）、両方に有するもの（例えば、参照番号３）などがある。 The compression dictionary storage unit 113 stores the same dictionary as that used for compression. This may be acquired from what is stored in the compressed text storage unit 103, or may be obtained by extracting information embedded in the compressed block sequence.
The partial character string replaced by the compressed block has a character (for example, reference number 1) that cannot be entered in the DFA at the end (for example, reference number 1) and a character that cannot be entered in the DFA (other characters) as shown here. Incomplete characters) (for example, reference number 2) and in both (for example, reference number 3).

状態遷移表記憶部１０６は、検索条件入力部１０２が入力した検索条件（検索パターン）に基づいて、状態遷移表生成部１０５が生成した状態遷移表を記憶している。
例えば、検索条件入力部１０２が、正規表現「（あい｜えおう）［うお］え＊」を検索条件として入力する（ここで「［うお］」は「（う｜お）」の簡略表記である）。この正規表現は、「あいう」「あいお」「あいうお」「あいおお」「えおうう」「えおうお」「えおううお」「えおうおお」・・・などを意味する。図２７の状態遷移表は、この正規表現に基づいて、状態遷移表生成部１０５が生成するものである。 The state transition table storage unit 106 stores the state transition table generated by the state transition table generation unit 105 based on the search condition (search pattern) input by the search condition input unit 102.
For example, the search condition input unit 102 inputs the regular expression “(Ai | Eo) [Uo] e *” as a search condition (where “[Uo]” is an abbreviated notation of “(U | O)”. is there). This regular expression means "Aoi""Aio""Aoio""Aioo""Eouo""Eouo""Euouo""Euouo", etc. The state transition table in FIG. 27 is generated by the state transition table generation unit 105 based on this regular expression.

図７は、この実施の形態における圧縮テキスト検索装置１００の検索処理の制御の流れの一例を示すフローチャート図である。
この実施の形態における制御の流れは、実施の形態１において図７を用いて説明した流れとほぼ同一である。ここでは、相違する点だけを説明する。 FIG. 7 is a flowchart showing an example of the control flow of search processing of the compressed text search apparatus 100 in this embodiment.
The control flow in this embodiment is substantially the same as the flow described in the first embodiment with reference to FIG. Only the differences will be described here.

Ｓ１２において、条件判断部１１４は、状態記憶部１１１が記憶した現在のＤＦＡの状態と、状態遷移記憶部１１２が記憶した圧縮ブロックに対応する状態遷移履歴（以後、圧縮ブロックの状態遷移履歴という）の先頭の状態とが一致するかを判定するとともに、バイトデータ記憶部１２２が記憶した未完文字と、状態遷移履歴の未完文字とが一致するかも判定する。条件判断部１１４は、両方が一致した場合のみ一致と判断し、Ｓ６２以降の処理に移る。 In S12, the condition determination unit 114 displays the current DFA state stored in the state storage unit 111 and the state transition history corresponding to the compressed block stored in the state transition storage unit 112 (hereinafter referred to as the state transition history of the compressed block). It is also determined whether or not the first state of the character matches the incomplete character stored in the byte data storage unit 122 and the incomplete character in the state transition history. The condition determining unit 114 determines that the both match only when both match, and proceeds to the processing after S62.

Ｓ１３において、遷移先算出部１１６は、状態遷移履歴から取得した遷移先状態に、状態記憶部１１１が記憶したＤＦＡの状態を更新するとともに、バイトデータ記憶部１２２が記憶した未完文字を、状態遷移履歴から取得した末尾の未完文字に更新する。 In S13, the transition destination calculation unit 116 updates the DFA state stored in the state storage unit 111 to the transition destination state acquired from the state transition history, and converts the incomplete characters stored in the byte data storage unit 122 into the state transition. Update to the last incomplete character obtained from the history.

Ｓ１６において、文字取得部１０９が取得した文字が未完文字である場合には、未完文字復元部１２１がそれを判断し、バイトデータ記憶部１２２に未完文字を記憶させる。 In S16, if the character acquired by the character acquisition unit 109 is an incomplete character, the incomplete character restoration unit 121 determines that and stores the incomplete character in the byte data storage unit 122.

これ以外の部分における処理は、実施の形態１において図７を用いて説明したものと同一なので、ここでは説明を省略する。 Since the processing in other parts is the same as that described in Embodiment 1 with reference to FIG. 7, the description thereof is omitted here.

このように、未完文字がある場合にはそれを一時的に記憶しておき、ＤＦＡの状態と未完文字が両方とも一致するかを判断する。一致する場合には、これを展開してＤＦＡに入力しても全く同じ状態遷移をすることになるので、これを展開せず、過去の履歴を参照して、ＤＦＡの状態遷移を１回で済ませる。
これにより、検索装置が扱う文字を表すビット列のビット長が、圧縮テキストの圧縮方式で想定している文字を表すビット列のビット長と異なる場合でも、検索が高速になるという効果を奏する。 Thus, if there is an incomplete character, it is temporarily stored, and it is determined whether both the DFA status and the incomplete character match. If they match, even if this is expanded and input to the DFA, the same state transition will occur, so this is not expanded and the DFA state transition can be performed once by referring to the past history. I'll do it.
Thereby, even when the bit length of the bit string representing the character handled by the search device is different from the bit length of the bit string representing the character assumed in the compression method of the compressed text, there is an effect that the search is performed at high speed.

なお、ここでは、辞書参照型圧縮方式によって圧縮された圧縮テキストを検索する場合について説明したが、実施の形態２〜実施の形態９で説明した構成と組み合わせることにより、他の圧縮方式によって圧縮された圧縮テキストを検索することも可能である。 Here, the case where the compressed text compressed by the dictionary reference compression method is searched has been described. However, by combining with the configuration described in the second to ninth embodiments, the compressed text is compressed by another compression method. It is also possible to search for compressed text.

実施の形態１１．
実施の形態１１を図７、図２８を用いて説明する。
この実施の形態における圧縮テキスト検索装置１００（文字列検索装置の一例）の外観、ハードウェア構成、ブロック構成は、実施の形態１０で説明したものと同一なので、ここでは説明を省略する。 Embodiment 11 FIG.
An eleventh embodiment will be described with reference to FIGS.
Since the appearance, hardware configuration, and block configuration of the compressed text search apparatus 100 (an example of a character string search apparatus) in this embodiment are the same as those described in the tenth embodiment, description thereof is omitted here.

この実施の形態では、検索装置が扱う文字を表すビット列のビット長が、圧縮テキストの圧縮方式で想定している文字を表すビット列のビット長と異なる場合の別の方式について説明する。 In this embodiment, another method will be described in the case where the bit length of the bit string representing the character handled by the search device is different from the bit length of the bit string representing the character assumed in the compressed text compression method.

図２８は、この実施の形態において、圧縮テキスト記憶部１０３、圧縮辞書記憶部１１３（辞書記憶部の一例）、状態遷移記憶部１１２（履歴記憶部の一例）が記憶する記憶内容の一例を示す図である。 FIG. 28 shows an example of storage contents stored in the compressed text storage unit 103, the compression dictionary storage unit 113 (an example of a dictionary storage unit), and a state transition storage unit 112 (an example of a history storage unit) in this embodiment. FIG.

状態遷移記憶部１１２は、状態履歴を記憶する。検索開始時には、状態履歴は空である。検索が進むにつれて、埋まっていく。
状態遷移記憶部１１２は状態履歴として、状態遷移履歴、受理位置、先頭の未完文字、末尾の未完文字を記憶する。
「状態遷移履歴」は、その部分文字列をＤＦＡに入力する前（その部分文字列の先頭に未完文字がある場合には、そこまで入力した後）の状態（最初の状態）から、その部分文字列をＤＦＡに入力した後（その部分文字列の末尾に未完文字がある場合には、未完文字の手前まで入力した後）の状態（遷移先状態）までの、ＤＦＡの状態遷移の履歴である。なお、途中経過は記憶せず、最初の状態と遷移先状態だけを記憶してもよい。
「先頭の未完文字」は、その部分文字列の先頭に未完文字がある場合の未完文字を示す。バイトデータ記憶部１２２が記憶していた未完文字を、これと結合することによって、ＤＦＡに入力できる文字となる。
「末尾の未完文字」は、その部分文字列の最後に未完文字がある場合の未完文字を示す。 The state transition storage unit 112 stores a state history. At the start of the search, the state history is empty. It fills in as the search progresses.
The state transition storage unit 112 stores the state transition history, the reception position, the first incomplete character, and the last incomplete character as the state history.
The “state transition history” is obtained from the state (first state) before the partial character string is input to the DFA (after there is an incomplete character at the beginning of the partial character string). A history of DFA state transitions up to the state (transition destination state) after the character string is input to the DFA (after the character string is input before the incomplete character if there is an incomplete character at the end of the partial character string) is there. Note that the intermediate state may not be stored, and only the first state and the transition destination state may be stored.
“First unfinished character” indicates an incomplete character when the partial character string has an incomplete character at the beginning. By combining the incomplete character stored in the byte data storage unit 122 with this, the character can be input to the DFA.
“Ending incomplete character” indicates an incomplete character when there is an incomplete character at the end of the partial character string.

他の部分については、実施の形態１０において図２７を使って説明したものと同一であるので、ここでは説明を省略する。 Other portions are the same as those described in Embodiment 10 with reference to FIG. 27, and thus description thereof is omitted here.

Ｓ１１において、圧縮ブロック取得部１０８（符号取得部の一例）が圧縮ブロック（符号）を取得する。バイトデータ記憶部１２２が未完文字を記憶している場合には、未完文字復元部１２１が、部分文字列の先頭にある未完文字（他の未完文字）と結合して、ＤＦＡに入力できる文字を取得し、これをＤＦＡに入力して、状態遷移処理を行う。
これにより、Ｓ１２の処理をする段階では未完文字がなくなるので、Ｓ１２においては、未完文字が一致するかを判別する必要がなく、状態の一致のみを判別すればよい。 In S11, the compressed block acquisition unit 108 (an example of a code acquisition unit) acquires a compressed block (code). When the byte data storage unit 122 stores an incomplete character, the incomplete character restoration unit 121 combines the incomplete character (other incomplete characters) at the head of the partial character string with a character that can be input to the DFA. Obtain this and input it to the DFA to perform state transition processing.
As a result, there are no incomplete characters at the stage of processing in S12, so in S12, it is not necessary to determine whether the incomplete characters match, and it is only necessary to determine the match of the state.

このように、未完文字がある場合にはそれを一時的に記憶しておき、次の圧縮ブロックの先頭にある未完文字と結合してＤＦＡに入力する。未完文字がなくなってから、ＤＦＡの状態が一致するかを判断するので、未完文字の存在を無視することができる。
これにより、検索装置が扱う文字を表すビット列のビット長が、圧縮テキストの圧縮方式で想定している文字を表すビット列のビット長と異なる場合でも、検索が高速になるという効果を奏する。 In this way, if there is an incomplete character, it is temporarily stored, combined with the incomplete character at the head of the next compressed block, and input to the DFA. Since it is determined whether or not the DFA status matches after the incomplete characters disappear, the presence of incomplete characters can be ignored.
Thereby, even when the bit length of the bit string representing the character handled by the search device is different from the bit length of the bit string representing the character assumed in the compression method of the compressed text, there is an effect that the search is performed at high speed.

実施の形態１２．
実施の形態１２を図２９〜図３３を用いて説明する。
この実施の形態における圧縮テキスト検索装置１００（文字列検索装置の一例）の外観、ハードウェア構成、ブロック構成は、実施の形態１０で説明したものと同一なので、ここでは説明を省略する。 Embodiment 12 FIG.
The twelfth embodiment will be described with reference to FIGS.
Since the appearance, hardware configuration, and block configuration of the compressed text search apparatus 100 (an example of a character string search apparatus) in this embodiment are the same as those described in the tenth embodiment, description thereof is omitted here.

この実施の形態では、検索装置が扱う文字を表すビット列のビット長が、圧縮テキストの圧縮方式で想定している文字を表すビット列のビット長と異なる場合の更に別の方式について説明する。 In this embodiment, another method will be described in the case where the bit length of the bit string representing the character handled by the search apparatus is different from the bit length of the bit string representing the character assumed in the compressed text compression method.

この圧縮テキスト検索装置は、入力されたマルチバイトコード文字を含む圧縮テキスト中に検索条件に適合する文字列が存在するか否かを判定し、存在する場合はその文字列の出現位置を、存在しない場合は何も出力しない検索装置である。 This compressed text search device determines whether or not there is a character string that matches the search condition in the compressed text that includes the input multibyte code character, and if it exists, the appearance position of the character string is present. If not, the search device does not output anything.

マルチバイトコード文字を含むテキストでは、テキスト中に１バイトコードの文字と、２バイト以上のコードの文字が混在して存在する。ここでは、文字のコードを「（８２Ａ０）」のように（）で囲んだ１６進数値で表記するものとする。これ以降、主にシフトＪＩＳコードを例に説明する。文字コード（８２Ａ０）は、シフトＪＩＳコードの「あ」である。また、２バイト以上のコードの文字の、１文字に満たない部分コードをバイトデータ（未完文字の一例）と呼ぶこととする。例えば、「あ」のバイトデータは、（８２）や（Ａ０）である。 In a text including multi-byte code characters, a character having a 1-byte code and a character having a code of 2 bytes or more exist in the text. Here, it is assumed that the character code is expressed as a hexadecimal value enclosed in parentheses, such as “(82A0)”. Hereinafter, the shift JIS code will be mainly described as an example. The character code (82A0) is “A” of the shift JIS code. Further, a partial code of a character with a code of 2 bytes or more and less than one character is referred to as byte data (an example of an incomplete character). For example, the byte data of “A” is (82) or (A0).

図２９は、この実施の形態による圧縮テキストの構造を示す図である。図２９は、それぞれ実施の形態１において図４を用いて説明したものと対応している。 FIG. 29 is a diagram showing the structure of the compressed text according to this embodiment. FIG. 29 corresponds to that described in Embodiment 1 with reference to FIG.

一般的に辞書式圧縮は、１バイト単位で処理されるため、マルチバイトコード文字を含むテキストを圧縮した場合、２バイト以上からなる文字が圧縮辞書の２つ以上のエントリに分かれて登録されてしまうことがある。
図２９のような圧縮テキスト＜１、２、１、３、４、１、３＞を伸張する場合、圧縮ブロック列３００から１つずつ圧縮ブロック１７０２を取得し、その参照情報を元に圧縮ブロック１７０２を圧縮辞書の文字列１７０５と置き換える。
例えば、最初の圧縮ブロック１７０２は、圧縮辞書の１番目のエントリを参照しているため、最初の圧縮ブロックは、１番目のエントリの文字列「あいうえ（８２）」と置き換えることができる。同様に２番目の圧縮ブロックは、「（Ａ８）うい」と置き換えることができる。ここで、文字コード（８２Ａ８）は、シフトＪＩＳコードで「お」を表わしているため、１番目と２番目の圧縮ブロックからは、文字列「あいうえおうい」が得られる。同様に全ての圧縮ブロックについて繰り返すことで、圧縮ブロック列３００から伸張されたテキスト「あいうえおういあいうえいおうあえあいうえあえ」を得ることができる（（８２Ａ０）＝「あ」、（８２Ａ２）＝「い」）。 Generally, lexicographic compression is processed in 1-byte units, so when text containing multi-byte code characters is compressed, characters consisting of 2 bytes or more are divided into two or more entries in the compression dictionary and registered. It may end up.
When decompressing the compressed text <1, 2, 1, 3, 4, 1, 3> as shown in FIG. 29, one compressed block 1702 is acquired from the compressed block sequence 300 one by one, and the compressed block is based on the reference information. 1702 is replaced with the character string 1705 of the compression dictionary.
For example, since the first compressed block 1702 refers to the first entry of the compression dictionary, the first compressed block can be replaced with the character string “Aiue (82)” of the first entry. Similarly, the second compressed block can be replaced with “(A8) Ui”. Here, since the character code (82A8) represents “o” in the shift JIS code, the character string “aiueoui” is obtained from the first and second compressed blocks. Similarly, by repeating for all the compressed blocks, the text “Aiueouaiaiueouaiueaiueai” can be obtained from the compressed block string 300 ((82A0) = “a”, (82A2) = “ ").

図３０は、この実施の形態において、状態遷移記憶部１１２が記憶する状態履歴の構造を示す構成図である。状態遷移記憶部１１２は、参照番号１８０１、状態遷移履歴１８０２、受理位置１８０３の情報を持つ。参照番号１８０１は、圧縮辞書の参照番号３０４と１対１に対応している。状態遷移履歴１８０２は、対応する圧縮辞書の文字列による状態遷移機械の状態遷移の履歴を記憶したものであり、先頭が圧縮辞書の文字列を読む直前の状態、末尾が圧縮辞書の文字列を全て読んだ直後の状態をさす。状態は、１バイトのデータではなく１文字に対して１回遷移する。例えば、１番目の状態遷移履歴の場合、状態［１］から開始して、文字列「あいうえ（８２）」を１文字読む毎に状態が［２］−［３］−［４］−［５］と遷移する。最後の「（８２）」は、バイトデータであるため状態を遷移させることができない。
このように、実施の形態１では、状態遷移の履歴の長さは、圧縮ブロックの参照文字列の文字列長＋１であったが、この実施の形態のマルチバイトコード文字を含むテキストを検索する圧縮テキスト検索装置では、参照文字列の文字列長＋１となるとは限らず、参照文字列の文字数＋１となる。受理位置１８０３は、圧縮ブロックの参照文字列の何バイト目で、状態遷移機械が受理状態に到達したかを表わしている。この受理位置は、文字単位でカウントしても良い。このとき、１文字に満たない１バイトデータがある場合でも、それを１文字とカウントしても良いし、しなくても良い。 FIG. 30 is a configuration diagram showing the structure of the state history stored in the state transition storage unit 112 in this embodiment. The state transition storage unit 112 has information of a reference number 1801, a state transition history 1802, and an acceptance position 1803. The reference number 1801 has a one-to-one correspondence with the reference number 304 of the compression dictionary. The state transition history 1802 stores the state transition history of the state transition machine by the character string of the corresponding compression dictionary, the state immediately before reading the character string of the compression dictionary at the beginning, and the character string of the compression dictionary at the end. The state immediately after reading everything. The state transitions once for one character, not one byte of data. For example, in the case of the first state transition history, the state is [2]-[3]-[4]-[[1] every time the character string “Aiue (82)” is read starting from the state [1]. 5]. Since the last “(82)” is byte data, the state cannot be changed.
As described above, in the first embodiment, the length of the state transition history is the character string length +1 of the reference character string of the compressed block, but the text including the multi-byte code character of this embodiment is searched. In the compressed text search device, the character string length of the reference character string is not necessarily +1, but the number of characters of the reference character string is +1. The receiving position 1803 represents the number of bytes of the reference character string of the compressed block and the state transition machine has reached the receiving state. This acceptance position may be counted in character units. At this time, even if there is 1-byte data less than one character, it may or may not be counted as one character.

図３１は、状態遷移表記憶部１０６が記憶する状態遷移表の一例を示す図である。
図３１の状態遷移表１９０１の左端の列は、現在の状態を表わしている。また、１行目は次に入力された文字を表わしている。検索条件に含まれる正規表現を受理する状態遷移機械の状態遷移表は、照合が開始されるまでに生成される。 FIG. 31 is a diagram illustrating an example of a state transition table stored in the state transition table storage unit 106.
The leftmost column of the state transition table 1901 in FIG. 31 represents the current state. The first line represents the next input character. The state transition table of the state transition machine that accepts the regular expression included in the search condition is generated before collation is started.

図３２は、この実施の形態の圧縮テキスト検索装置における検索処理の流れ図である。初期状態として、検索条件入力部１０２が入力した検索条件から、状態遷移表生成部１０５が状態遷移表を生成し、状態遷移表記憶部１０６が記憶しているものとする。また、圧縮テキスト記憶部１０３が記憶した圧縮テキストから圧縮辞書についての情報を抽出し、圧縮辞書記憶部１１３が記憶しているものとする。また、状態記憶部１１１には初期状態がセットされているものとする。状態遷移記憶部１１２の状態遷移履歴と受理位置は空であるとする。また、元テキスト長をカウントするためのカウンタを０に初期化する。 FIG. 32 is a flowchart of search processing in the compressed text search apparatus of this embodiment. As an initial state, it is assumed that the state transition table generation unit 105 generates a state transition table from the search condition input by the search condition input unit 102 and the state transition table storage unit 106 stores the state transition table. Also, it is assumed that information about the compression dictionary is extracted from the compressed text stored in the compressed text storage unit 103 and stored in the compression dictionary storage unit 113. In addition, it is assumed that the initial state is set in the state storage unit 111. It is assumed that the state transition history and the reception position in the state transition storage unit 112 are empty. Also, a counter for counting the original text length is initialized to zero.

まず、ステップＳ１５０１で圧縮ブロック取得部１０８によって、圧縮ブロック列の先頭から順に圧縮ブロックを１個取得する。ステップＳ１５０２で、バイトデータ記憶部１２２にバイトデータがあるか判定する。バイトデータがある場合は（ＹＥＳ）、ステップＳ１５０３に進む。 First, in step S1501, the compressed block acquisition unit 108 acquires one compressed block in order from the beginning of the compressed block sequence. In step S1502, it is determined whether there is byte data in the byte data storage unit 122. If there is byte data (YES), the process proceeds to step S1503.

テキストが全て文字の割り付けられている文字コードからなり、バイトデータ記憶部１２２にバイトデータがある場合、圧縮ブロックの参照文字列の先頭にも、バイトデータがある。そこで、ステップＳ１５０３で、バイトデータ記憶部のバイトデータを上位バイト、参照文字列の最初のバイトデータを下位バイトとする１文字と、状態記憶部１１１の現在の状態を元に、状態遷移機械１１０によって次の状態を取得する。取得した次の状態は、状態記憶部１１１にセットする。同時にバイトデータ記憶部１２２を空にする。 When the text is composed of character codes to which characters are allotted and there is byte data in the byte data storage unit 122, there is byte data at the head of the reference character string of the compressed block. Therefore, in step S1503, the state transition machine 110 is based on one character having the byte data in the byte data storage unit as the upper byte, the first byte data in the reference character string as the lower byte, and the current state of the state storage unit 111. To get the next state. The acquired next state is set in the state storage unit 111. At the same time, the byte data storage unit 122 is emptied.

ステップＳ１５０２で、バイトデータが無い場合は（ＮＯ）、そのままステップＳ１５０４へと進む。
ステップＳ１５０４で、状態記憶部１１１の現在の状態と、圧縮ブロックが参照する状態遷移履歴の先頭の状態が一致するか判定する。一致する場合は（ＹＥＳ）、ステップＳ１５０５で状態遷移履歴の末尾の状態を、状態記憶部１１１にセットする。
次にステップＳ１５０６で、状態遷移履歴に受理位置があるか判定する。受理位置があった場合は（ＹＥＳ）、ステップＳ１５０７でヒット位置を計算して出力する。ここで、ヒット位置＝現在の元テキスト長＋受理位置となる。ステップＳ１５０６で受理位置が無かった場合は（ＮＯ）、そのままステップＳ１５０８へ進む。
ステップＳ１５０８では、圧縮ブロックの参照文字列の末尾に、バイトデータがあるか判定する。ステップＳ１５０８でバイトデータがある場合は（ＹＥＳ）、ステップＳ１５０９で、そのバイトデータをバイトデータ記憶部１２２にセットする。ステップＳ１５０８で、バイトデータが無い場合は（ＮＯ）、そのままステップＳ１５１０へと進む。
ステップＳ１５１０では、圧縮ブロック列の終端に達しているか判定し、達していない場合は（ＮＯ）、ステップＳ１５０１で次の圧縮ブロックを取得する。ステップＳ１５１０で圧縮ブロック列の終端に達していた場合は（ＹＥＳ）、検索処理を終了する。
図３２には明記していないが、ここで元テキスト長に、圧縮辞書の文字列長を加える。 If there is no byte data in step S1502 (NO), the process proceeds to step S1504.
In step S1504, it is determined whether the current state of the state storage unit 111 matches the first state of the state transition history referenced by the compressed block. If they match (YES), the state at the end of the state transition history is set in the state storage unit 111 in step S1505.
Next, in step S1506, it is determined whether there is an acceptance position in the state transition history. If there is an accepted position (YES), the hit position is calculated and output in step S1507. Here, hit position = current original text length + acceptance position. If there is no receiving position in step S1506 (NO), the process proceeds to step S1508.
In step S1508, it is determined whether there is byte data at the end of the reference character string of the compressed block. If there is byte data in step S1508 (YES), the byte data is set in the byte data storage unit 122 in step S1509. If there is no byte data in step S1508 (NO), the process proceeds to step S1510.
In step S1510, it is determined whether the end of the compressed block sequence has been reached. If not (NO), the next compressed block is acquired in step S1501. If the end of the compressed block string has been reached in step S1510 (YES), the search process is terminated.
Although not clearly shown in FIG. 32, the character string length of the compression dictionary is added to the original text length here.

ここでは、ヒット位置をテキストの先頭からのバイト数として出力するようにしたが、文字数として出力しても良い。その場合は、元テキスト長をカウントする代わりに、元テキストの文字数をカウントするようにし、状態遷移履歴の受理位置も文字数で記録するようにすると良い。 Here, the hit position is output as the number of bytes from the beginning of the text, but it may be output as the number of characters. In that case, instead of counting the original text length, it is preferable to count the number of characters of the original text and record the reception position of the state transition history by the number of characters.

ステップＳ１５０４で、状態記憶部１１１が記憶した現在の状態と、圧縮ブロックが参照する状態遷移履歴の先頭の状態が一致しなかった場合は（ＮＯ）、ステップＳ１５１１に進み、圧縮ブロックの参照文字列に対して、状態遷移履歴を求める。
ステップＳ１５０２で、参照文字列の先頭がバイトデータであった場合は、そのバイトデータの次の文字から始まる文字列に対して、状態遷移履歴を求める。
ステップＳ１５１１で状態遷移履歴を求め終えたら、ステップＳ１５１０へ進む。 In step S1504, if the current state stored in the state storage unit 111 and the head state of the state transition history referred to by the compressed block do not match (NO), the process proceeds to step S1511, and the reference character string of the compressed block For this, the state transition history is obtained.
In step S1502, if the head of the reference character string is byte data, a state transition history is obtained for the character string starting from the next character of the byte data.
When the state transition history is obtained in step S1511, the process proceeds to step S1510.

図３３は、図３２の検索処理の流れにおけるステップＳ１５１１の処理の流れ図である。ここでは、状態遷移履歴の一時的な記憶領域Ｈを用意し、初期状態として空であるものとする。
まず、ステップＳ１６０１で、現在の状態を記憶領域Ｈの状態遷移履歴の先頭に追加する。ステップＳ１６０２で、文字取得部１０９によって圧縮ブロックの参照文字列の先頭から順に１文字ずつ取得する。
ステップＳ１６０３で、ステップＳ１６０１で取得した文字と、状態記憶部１１１が記憶した現在の状態を入力として、状態遷移機械１１０から次の状態を取得し、状態記憶部１１１にセットする。
ステップＳ１６０４では、ステップＳ１６０３で取得した状態を、記憶領域Ｈの状態遷移履歴に追加する。
ステップＳ１６０５では、状態記憶部１１１の状態が受理状態か判定する。受理状態であった場合は（ＹＥＳ）、ステップＳ１６０６でヒット位置を出力し、ステップＳ１６０７に進む。ここで、ヒット位置は元テキスト長（バイト）＋圧縮ブロックの参照文字列の先頭からのバイト数となる。このとき、同時に記憶領域Ｈの状態遷移履歴の受理位置に、参照文字列の先頭からのバイト数をセットする。ステップＳ１６０５で受理状態では無かった場合は（ＮＯ）、そのままステップＳ１６０７に進む。
ステップＳ１６０７では、文字列の終端に達しているか判定し、達していない場合には（ＮＯ）、ステップＳ１６０８へ進む。
ステップＳ１６０８では、次の文字がバイトデータであるか判定し、バイトデータであった場合は（ＹＥＳ）、そのまま処理を終了する。バイトデータではない場合は（ＮＯ）、ステップＳ１６０２で次の文字を取得する。
ステップＳ１６０７で、文字列の終端まで処理していた場合は（ＹＥＳ）、処理を終了する。
処理を終了する時に、記憶領域Ｈの状態遷移履歴の履歴と受理位置を、圧縮ブロックの参照する状態遷移履歴に反映する。 FIG. 33 is a flowchart of the process of step S1511 in the search process flow of FIG. Here, it is assumed that a temporary storage area H for the state transition history is prepared and is empty as an initial state.
First, in step S1601, the current state is added to the top of the state transition history in the storage area H. In step S1602, the character acquisition unit 109 acquires characters one by one in order from the beginning of the reference character string of the compressed block.
In step S1603, the next state is acquired from the state transition machine 110 using the character acquired in step S1601 and the current state stored in the state storage unit 111 as inputs, and set in the state storage unit 111.
In step S1604, the state acquired in step S1603 is added to the state transition history in the storage area H.
In step S1605, it is determined whether the state of the state storage unit 111 is an accepting state. If it is in an accepted state (YES), the hit position is output in step S1606, and the process proceeds to step S1607. Here, the hit position is the original text length (bytes) + the number of bytes from the beginning of the reference character string of the compressed block. At this time, the number of bytes from the beginning of the reference character string is set at the reception position of the state transition history in the storage area H at the same time. If NO in step S1605 (NO), the process proceeds to step S1607.
In step S1607, it is determined whether the end of the character string has been reached. If not (NO), the process proceeds to step S1608.
In step S1608, it is determined whether or not the next character is byte data. If the next character is byte data (YES), the processing ends. If it is not byte data (NO), the next character is acquired in step S1602.
If it is determined in step S1607 that the character string has been processed up to the end (YES), the process is terminated.
When the process ends, the history of the state transition history in the storage area H and the reception position are reflected in the state transition history referred to by the compressed block.

上記の図３３の処理の流れでは、処理の終了時点で状態遷移履歴を反映するとしたが、常に更新しなくても良い。
すなわち、最初に取得した状態遷移履歴から更新しないようにしても良いし、状態遷移履歴の先頭の状態がある特定の状態のときのみ、履歴を更新するようにしても良い。 In the process flow of FIG. 33 described above, the state transition history is reflected at the end of the process, but it is not always necessary to update it.
In other words, the state transition history may not be updated from the first acquired state transition history, or the history may be updated only when the state at the head of the state transition history is a specific state.

図３３の処理開始時に、文字列の先頭がバイトデータであった場合は、そのまま処理を終了してよい。 If the beginning of the character string is byte data at the start of the processing in FIG. 33, the processing may be terminated as it is.

この実施の形態では、ステップＳ１５０４で、状態記憶部１１１が記憶した現在の状態と、状態遷移記憶部の圧縮ブロックに対応する状態遷移履歴の先頭の状態が一致するか判定するようにしたが、図３３（図３２のステップＳ１５１１）のステップＳ１６０３で次の状態を取得したあと、状態遷移履歴の状態と比較し、一致した場合はステップＳ１５１１の処理を終了し、ステップＳ１５０５に処理が移るように構成しても良い。 In this embodiment, in step S1504, it is determined whether the current state stored in the state storage unit 111 matches the first state of the state transition history corresponding to the compressed block in the state transition storage unit. After obtaining the next state in step S1603 of FIG. 33 (step S1511 of FIG. 32), the state is compared with the state of the state transition history, and if they match, the process of step S1511 is terminated and the process proceeds to step S1505. It may be configured.

図２６〜図３３を用いて、この実施の形態による圧縮テキスト検索装置の処理の例を示す。初期状態として、検索条件入力部１０２が正規表現を含んだ検索条件を入力し、入力した検索条件に基づいて、状態遷移表生成部１０５がその正規表現を受理するＤＦＡの状態遷移表１９０１を生成し、状態遷移表記憶部１０６が記憶している。受理状態は状態［４］のみとする。また、状態記憶部１１１には、初期状態［１］が記憶されているものとする。状態遷移記憶部１１２の状態遷移履歴は空であるとする。バイトデータ記憶部１２２も空である。今、圧縮テキスト記憶部１０３が記憶している圧縮テキストから、圧縮ブロック列３００と圧縮辞書１７０３を得る。圧縮辞書は圧縮辞書記憶部１１３に記憶される。なお、ここで使用する文字はすべて２バイト文字であるとする。 An example of processing of the compressed text search apparatus according to this embodiment will be described with reference to FIGS. As an initial state, the search condition input unit 102 inputs a search condition including a regular expression, and based on the input search condition, the state transition table generation unit 105 generates a DFA state transition table 1901 that accepts the regular expression. The state transition table storage unit 106 stores this information. The acceptance state is only state [4]. Further, it is assumed that the state storage unit 111 stores an initial state [1]. It is assumed that the state transition history in the state transition storage unit 112 is empty. The byte data storage unit 122 is also empty. Now, a compressed block string 300 and a compression dictionary 1703 are obtained from the compressed text stored in the compressed text storage unit 103. The compression dictionary is stored in the compression dictionary storage unit 113. It is assumed that all characters used here are 2-byte characters.

まず、ステップＳ１５０１で圧縮ブロック列３００から最初の圧縮ブロックを取得する。
次にステップＳ１５０２で、バイトデータ記憶部１２２にバイトデータがないので、ステップＳ１５０４に進む。
ステップＳ１５０４では、状態記憶部１１１が記憶している現在の状態［１］に対して、圧縮ブロックの参照する状態遷移履歴は空なので、状態は一致しない。よって、ステップＳ１５１１に移行する。 First, in step S1501, the first compressed block is acquired from the compressed block sequence 300.
In step S1502, since there is no byte data in the byte data storage unit 122, the process advances to step S1504.
In step S1504, the state transition history referred to by the compressed block is empty for the current state [1] stored in the state storage unit 111, so the states do not match. Therefore, the process proceeds to step S1511.

図３３のステップＳ１６０１では、現在の状態を、状態遷移記憶部の１番目の状態遷移履歴の先頭にセットする。
ステップＳ１６０２では、圧縮ブロックが参照している圧縮辞書の文字列「あいうえ（８２）」の最初の１文字「あ」を取得する。
次にステップＳ１６０３では、現在の状態［１］と文字「あ」を元に、状態遷移機械１１０によって次の状態［２］を取得する。取得した状態［２］を状態記憶部１１１にセットして、次のステップＳ１６０４に進む。
ステップＳ１６０４では、現在の状態を、１番目の状態遷移履歴に追加する。ここで、状態遷移記憶部１１２が記憶する１番目の状態遷移履歴は「１−２」となる。
ステップＳ１６０５では、現在の状態が受理状態であるか判定し、受理状態ではないのでステップＳ１６０７へ進む。
ステップＳ１６０７では、圧縮辞書の文字列の終端に達したか判定し、達していないのでステップＳ１６０８へ進み、次の文字がバイトデータでもないので、ステップＳ１６０２で次の文字「い」を取得する。 In step S1601 of FIG. 33, the current state is set at the head of the first state transition history in the state transition storage unit.
In step S1602, the first character “A” of the character string “AIUE (82)” in the compression dictionary referred to by the compression block is acquired.
In step S1603, the state transition machine 110 acquires the next state [2] based on the current state [1] and the character “A”. The acquired state [2] is set in the state storage unit 111, and the process proceeds to the next step S1604.
In step S1604, the current state is added to the first state transition history. Here, the first state transition history stored in the state transition storage unit 112 is “1-2”.
In step S1605, it is determined whether the current state is an accepting state. Since the present state is not an accepting state, the process proceeds to step S1607.
In step S1607, it is determined whether or not the end of the character string in the compression dictionary has been reached. Since it has not been reached, the process proceeds to step S1608, and the next character is not byte data. In step S1602, the next character “I” is acquired.

２文字目「い」に対しても同様に処理を行い、ステップＳ１６０８に達した時点で、現在の状態は［３］、状態遷移履歴は「１−２−３」となる。 The same processing is performed for the second character “I”, and when step S1608 is reached, the current state is [3] and the state transition history is “1-2-3”.

次に、ステップ２０１で次の文字「う」を取得する。状態［３］で文字「う」を取得した場合、状態遷移表１９０１より次の状態［４］を得る。よって、ステップＳ１６０４の実行直後では、現在の状態［４］、状態遷移履歴「１−２−３−４」となる。
ステップＳ１６０５で現在の状態が受理状態か判定し、受理状態であるのでステップＳ１６０６に進む。現在２バイト文字を３文字目まで取得したところなので、ヒット位置として２バイト×３＝６出力する。同時に、１番目の状態遷移履歴の受理位置に６を追加する。 Next, in step 201, the next character “U” is acquired. When the character “U” is acquired in the state [3], the next state [4] is obtained from the state transition table 1901. Therefore, immediately after execution of step S1604, the current state [4] and the state transition history “1-2-3-4” are obtained.
In step S1605, it is determined whether the current state is an acceptance state. Since the current state is an acceptance state, the process proceeds to step S1606. Since the current 2 byte characters have been acquired up to the third character, 2 bytes × 3 = 6 are output as hit positions. At the same time, 6 is added to the reception position of the first state transition history.

同様に処理を繰り返して、４文字目の「え」のステップＳ１６０８で、文字がバイトデータか判定する。シフトＪＩＳコードの場合、文字の１バイト目でそれが１バイト文字であるか２バイト文字であるかは判別可能である。次の文字（８２）はバイトデータであるため処理を終了する。この時点で、テキストの先頭から９バイトまでの処理が終わり、現在の状態は［５］、１番目の状態遷移履歴は「１−２−３−４−５」となっている。 Similarly, the process is repeated to determine whether the character is byte data in step S1608 of the fourth character “e”. In the case of the shift JIS code, it is possible to determine whether the first byte of the character is a 1-byte character or a 2-byte character. Since the next character (82) is byte data, the processing ends. At this point, the processing from the beginning of the text to the 9th byte is completed, the current state is [5], and the first state transition history is “1-2-3-4-5”.

次に処理は図１３のステップＳ１５０８へと進む。ステップＳ１５０８では、末尾の文字がバイトデータ（８２）であるので、ステップＳ１５０９で（８２）をバイトデータ記憶部１２２にセットする。そしてステップＳ１５１０へと進み、圧縮ブロック列の終端に達していないため、ステップＳ１５０１で次の圧縮ブロックを取得する。この時点で、元テキスト長は９バイトである。
ステップＳ１５０２で、バイトデータ記憶部１２２にバイトデータ（８２）があるので、ステップＳ１５０３へ進む。
ステップＳ１５０３で、バイトデータ記憶部１２２のバイトデータ（８２）を上位バイト、参照文字列の先頭のバイトデータ（Ａ８）を下位バイトとする文字「お」（＝（８２Ａ８））と、現在の状態［５］から、状態遷移機械１１０によって次の状態［６］を取得し、状態記憶部１１１にセットする。そしてバイトデータ記憶部１２２を空にする。
ステップＳ１５０４以降は同様に処理を行って、ステップＳ１５１０に達した時点でテキストの１４バイト目まで処理が終わり、現在の状態は［１］、１番目の状態遷移履歴は「１−２−３−４−５」、２番目の状態遷移履歴は「６−３−１」である。 Next, the process proceeds to step S1508 in FIG. In step S1508, since the last character is byte data (82), (82) is set in the byte data storage unit 122 in step S1509. Then, the process proceeds to step S1510, and since the end of the compressed block sequence has not been reached, the next compressed block is acquired in step S1501. At this point, the original text length is 9 bytes.
In step S1502, since byte data (82) is stored in the byte data storage unit 122, the process advances to step S1503.
In step S1503, the character “O” (= (82A8)) in which the byte data (82) of the byte data storage unit 122 is the upper byte, the first byte data (A8) of the reference character string is the lower byte, and the current state From [5], the next state [6] is acquired by the state transition machine 110 and set in the state storage unit 111. Then, the byte data storage unit 122 is emptied.
The same processing is performed after step S1504. When the processing reaches step S1510, the processing is completed up to the 14th byte of the text. The current state is [1], and the first state transition history is “1-2-3- 4-5 "and the second state transition history is" 6-3-1 ".

ステップＳ１５０１で３番目の圧縮ブロックを取得する。３番目の圧縮ブロックの参照先は＜１＞である。
ステップＳ１５０２で、バイトデータ記憶部１２２にバイトデータがないため、ステップＳ１５０４へと進む。
ステップＳ１５０４で、現在の状態［１］と、１番目の状態遷移履歴の先頭の状態［１］を比較し、一致しているのでステップＳ１５０５へと進む。
ステップＳ１５０５では、状態遷移履歴の末尾の状態［５］を、状態記憶部１１１が現在の状態として記憶する。
ステップＳ１５０６で、１番目の状態遷移履歴には受理状態があるため、ステップＳ１５０７でヒット位置を計算し出力する。３番目の圧縮ブロックを取得する直前までに処理したテキストは１４バイト、また状態遷移履歴の受理位置は６なので、１４＋６＝２０がヒット位置となる。
ステップＳ１５０８で参照文字列の末尾にバイトデータがないのでステップＳ１５１０へ進む。
ステップＳ１５１０で、圧縮ブロックの終端ではないので、ステップＳ１５０１に進む。
以降同様に処理を行い、圧縮ブロック列の終端まで処理を終えた時点で検索処理を終了する。 In step S1501, the third compressed block is acquired. The reference destination of the third compressed block is <1>.
In step S1502, since there is no byte data in the byte data storage unit 122, the process proceeds to step S1504.
In step S1504, the current state [1] is compared with the first state [1] of the first state transition history, and since they match, the process proceeds to step S1505.
In step S1505, the state storage unit 111 stores the state [5] at the end of the state transition history as the current state.
In step S1506, since the first state transition history has an accepted state, the hit position is calculated and output in step S1507. Since the text processed immediately before obtaining the third compressed block is 14 bytes and the accepting position of the state transition history is 6, 14 + 6 = 20 is the hit position.
In step S1508, since there is no byte data at the end of the reference character string, the process advances to step S1510.
In step S1510, since it is not the end of the compressed block, the process advances to step S1501.
Thereafter, the same process is performed, and the search process is terminated when the process is completed up to the end of the compressed block string.

以上のように、この実施の形態によれば、状態記憶部の現在の状態と、取得した圧縮ブロックに対応する状態遷移履歴の先頭の状態が一致しなかった場合には、状態遷移を処理するのに対して、圧縮ブロックが参照している圧縮辞書の文字列長に比例したステップ数を要する。一方で、現在の状態と状態遷移履歴の先頭の状態が一致した場合には、高々２回の状態遷移で処理することができる。現在の状態と文字によって状態遷移先が一意に決定される状態遷移機械では、現在の状態が初期状態である場合が多い。そのため、圧縮率が高いほど、すなわち長さが長い文字列が繰り返し出現しているようなテキストほど、状態遷移に要するステップ数を削減することができる。
正規表現に適合する文字列がテキスト中に存在するかの照合自体は、従来から利用されている、正規表現を受理する状態遷移が一意に決定される状態遷移機械を使用している。
このように、この実施の形態の圧縮テキスト検索装置では、正規表現を含んだ検索条件によって、マルチバイトコード文字を含む圧縮テキストを高速に検索することができる。 As described above, according to this embodiment, when the current state of the state storage unit does not match the first state of the state transition history corresponding to the acquired compressed block, the state transition is processed. On the other hand, the number of steps proportional to the character string length of the compression dictionary referred to by the compression block is required. On the other hand, when the current state matches the first state of the state transition history, it can be processed with at most two state transitions. In a state transition machine in which the state transition destination is uniquely determined by the current state and characters, the current state is often the initial state. For this reason, the higher the compression rate, that is, the text in which a long character string repeatedly appears, the number of steps required for state transition can be reduced.
The verification itself of whether or not a character string that conforms to a regular expression exists in the text uses a state transition machine that is conventionally used and that uniquely determines a state transition that accepts a regular expression.
As described above, in the compressed text search apparatus according to this embodiment, a compressed text including multibyte code characters can be searched at high speed according to a search condition including a regular expression.

この実施の形態では、例としてシフトＪＩＳのテキストについて説明したが、他のマルチバイトコード文字のテキストでも同様に検索することができる。例えば日本語ＥＵＣ（ＥｘｔｅｎｄｅｄＵＮＩＸＣｏｄｅ）（ＵＮＩＸは登録商標）は、シフトＪＩＳ同様に１バイト目で１〜３バイト文字のいずれであるか判定することが可能であり、バイトデータが１バイトか２バイトかを注意すればよい。また、ＪＩＳは、１バイト文字であるか２バイト文字であるかを判定するための、情報を持つ。その情報をバイトデータ記憶部１２２などにバイトデータとともに記憶しておくなどとするとよい。 In this embodiment, shift JIS text has been described as an example, but other multi-byte code character texts can be similarly searched. For example, Japanese EUC (Extended UNIX Code) (UNIX is a registered trademark) can determine which of the 1st to 3rd byte characters in the first byte, as in Shift JIS. Watch out for bytes. JIS has information for determining whether it is a 1-byte character or a 2-byte character. The information may be stored in the byte data storage unit 122 together with the byte data.

ここで説明した圧縮テキスト検索装置は、以下の特徴を持つ。
辞書式圧縮方式によって圧縮されたマルチバイトコード文字を含むテキストを、伸張することなく、正規表現によって検索する検索装置である。
検索には、状態遷移が一意に決定できる状態遷移機械を使用する。
検索時には、圧縮ブロックが参照する辞書中の文字列毎に、状態遷移機械の状態遷移の履歴を記憶しておき、現在の状態が圧縮ブロックが参照する状態遷移の履歴の先頭の状態と一致した場合に、履歴の末尾の状態まで状態遷移を一気に遷移させる。
１文字に満たないバイトデータを記憶する記憶部を備え、辞書中の文字列の末尾に１文字に満たないバイトデータが含まれる場合には、そのバイトデータを記憶しおき、次の圧縮ブロックの先頭からバイトデータを取得した時点で、状態遷移を処理する。 The compressed text search apparatus described here has the following features.
This is a search device that searches text containing multi-byte code characters compressed by a lexicographic compression method using regular expressions without decompression.
The search uses a state transition machine that can uniquely determine the state transition.
At the time of retrieval, the state transition history of the state transition machine is stored for each character string in the dictionary referenced by the compressed block, and the current state matches the first state of the state transition history referenced by the compressed block. In such a case, the state transition is performed at a stretch to the state at the end of the history.
A storage unit for storing byte data less than one character is provided. When byte data less than one character is included at the end of a character string in the dictionary, the byte data is stored, and the next compressed block is stored. When the byte data is acquired from the beginning, the state transition is processed.

実施の形態１３．
実施の形態１３を図３４、図３５を用いて説明する。
この実施の形態における圧縮テキスト検索装置１００（文字列検索装置の一例）の外観、ハードウェア構成、ブロック構成は、実施の形態１０で説明したものと同一なので、ここでは説明を省略する。 Embodiment 13 FIG.
The thirteenth embodiment will be described with reference to FIGS.
Since the appearance, hardware configuration, and block configuration of the compressed text search apparatus 100 (an example of a character string search apparatus) in this embodiment are the same as those described in the tenth embodiment, description thereof is omitted here.

この実施の形態では、検索装置が扱う文字を表すビット列のビット長が、圧縮テキストの圧縮方式で想定している文字を表すビット列のビット長と異なる場合において、ＬＺ７７方式で圧縮されたテキストを検索する場合について説明する。 In this embodiment, when the bit length of the bit string representing the character handled by the search device is different from the bit length of the bit string representing the character assumed in the compression method of the compressed text, the text compressed by the LZ77 method is searched. The case where it does is demonstrated.

図３４は、この実施の形態による、圧縮辞書記憶部１１３と状態遷移記憶部１１２の記憶する情報を図示したものである。圧縮辞書記憶部１１３は、スライド窓２１０３を記憶する。状態遷移記憶部１１２は、スライド窓に対応した状態遷移履歴２１０４と、受理位置２１０５を記憶している。また、状態遷移履歴２１０４は、スライド窓長＋１文字分の状態遷移を記憶している。受理位置２１０５は、状態遷移履歴２１０４の中で、受理状態の位置を記憶している。
この実施の形態では、圧縮辞書記憶部１１３のスライド窓２１０３がバイトデータ記憶部の役割を兼ねる。 FIG. 34 illustrates information stored in the compression dictionary storage unit 113 and the state transition storage unit 112 according to this embodiment. The compression dictionary storage unit 113 stores the sliding window 2103. The state transition storage unit 112 stores a state transition history 2104 corresponding to the sliding window and a reception position 2105. Further, the state transition history 2104 stores the state transition for the sliding window length + 1 character. The reception position 2105 stores the position of the reception state in the state transition history 2104.
In this embodiment, the sliding window 2103 of the compression dictionary storage unit 113 also serves as a byte data storage unit.

図３５は、この実施の形態の圧縮テキスト検索装置における検索処理の流れ図である。初期状態として、検索条件入力部１０２が入力した検索条件から状態遷移表生成部１０５が状態遷移表を生成し、状態遷移表記憶部１０６が記憶しているものとする。また、状態記憶部１１１には初期状態がセットされているものとする。圧縮辞書記憶部１１３と、状態遷移記憶部１１２の状態遷移履歴と受理位置は空であるとする。また、元テキスト長をカウントするためのカウンタを０に初期化する。 FIG. 35 is a flowchart of search processing in the compressed text search apparatus of this embodiment. As an initial state, it is assumed that the state transition table generation unit 105 generates a state transition table from the search conditions input by the search condition input unit 102 and the state transition table storage unit 106 stores the state transition table. In addition, it is assumed that the initial state is set in the state storage unit 111. It is assumed that the state transition history and the reception position of the compression dictionary storage unit 113 and the state transition storage unit 112 are empty. Also, a counter for counting the original text length is initialized to zero.

最初に、ステップＳ２００１で、圧縮ブロック列の先頭から順に圧縮ブロックを１個ずつ取得する。
ステップＳ２００２で、圧縮ブロックの参照文字列の先頭がバイトデータか判定する。バイトデータの場合で、圧縮辞書記憶部の末尾にバイトデータが存在する場合は、その末尾のバイトデータを上位バイト、参照文字列の先頭のバイトデータを下位バイトとする１文字と、現在の状態を元に状態遷移機械１１０によって次の状態を取得し、現在の状態と状態遷移履歴２１０４の末尾にセットする。圧縮辞書記憶部の末尾にバイトデータが無い場合は、ステップＳ２００８に進んでよい。
ステップＳ２００４で、現在の状態と、圧縮ブロックが参照する状態遷移履歴の先頭の状態が一致するか判定する。参照文字列の先頭がバイトデータであった場合は、その次の文字の状態と比較する。先頭の状態が一致した場合は（ＹＥＳ）、ステップ２００５で、スライド窓の（参照文字列の位置＋参照文字列長）の位置の文字の状態を現在の状態にセットする。ステップＳ２００６で、参照文字列の位置から（参照文字列の位置＋参照文字列長）の位置の間に受理状態があるか判定する。受理状態がある場合は（ＹＥＳ）、ステップＳ２００７でヒット位置を計算して出力し、ステップＳ２００８へ進む。受理状態が無い場合は（ＮＯ）、何もせずにステップＳ２００８へ進む。
ステップＳ２００８では、スライド窓と状態遷移履歴を更新する。すなわち、スライド窓中の文字列を、（参照文字列長＋１）バイト分前へシフトし、末尾に参照文字列の位置から参照文字列長分の文字列と、最初の不一致文字を追加する。同様に、状態遷移履歴も同様に、（参照文字列長＋１）バイト分前へシフトし、末尾に参照文字列の先頭の文字から参照文字列分の後ろの文字までの状態遷移履歴を追加する。状態遷移履歴は、さらに、参照文字列の末尾と不一致文字が共にバイトデータで、合わせて１文字となる場合には、状態遷移機械１１０によって、次の状態を取得し、状態記憶部１１１と、状態遷移履歴の末尾にセットする必要がある。
ステップＳ２００９では、圧縮ブロック列の終端に達しているか判定し、達していない場合は（ＮＯ）、ステップＳ２００１で次の圧縮ブロック列を取得する。達していた場合は（ＹＥＳ）、検索処理を終了する。 First, in step S2001, compressed blocks are acquired one by one in order from the beginning of the compressed block sequence.
In step S2002, it is determined whether the head of the reference character string of the compressed block is byte data. In the case of byte data, if byte data exists at the end of the compression dictionary storage unit, one character with the byte data at the end as the upper byte, the byte data at the beginning of the reference character string as the lower byte, and the current state The next state is acquired by the state transition machine 110 based on the current state and set at the end of the current state and the state transition history 2104. If there is no byte data at the end of the compression dictionary storage unit, the process may proceed to step S2008.
In step S2004, it is determined whether the current state matches the first state of the state transition history referenced by the compressed block. If the beginning of the reference character string is byte data, it is compared with the state of the next character. If the leading states match (YES), in step 2005, the character state at the position of the reference window (reference character string position + reference character string length) is set to the current state. In step S2006, it is determined whether there is an acceptance state between the position of the reference character string and the position of (reference character string position + reference character string length). If there is an acceptance state (YES), the hit position is calculated and output in step S2007, and the process proceeds to step S2008. If there is no acceptance state (NO), the process proceeds to step S2008 without doing anything.
In step S2008, the sliding window and the state transition history are updated. That is, the character string in the sliding window is shifted forward by (reference character string length + 1) bytes, and the character string corresponding to the reference character string length from the position of the reference character string and the first mismatch character are added at the end. Similarly, the state transition history is also shifted forward by (reference character string length + 1) bytes, and the state transition history from the first character of the reference character string to the character after the reference character string is added at the end. . In the state transition history, when both the end of the reference character string and the non-matching character are byte data and become one character in total, the state transition machine 110 acquires the next state, the state storage unit 111, Must be set at the end of the state transition history.
In step S2009, it is determined whether the end of the compressed block sequence has been reached. If not reached (NO), the next compressed block sequence is acquired in step S2001. If it has reached (YES), the search process is terminated.

ステップＳ２００４で、現在の状態と、状態遷移履歴の圧縮ブロックの参照文字列の先頭の文字の状態が一致しない場合は（ＮＯ）、ステップＳ２０１０で参照文字列に最初の不一致文字を加えた文字列に対して状態遷移の履歴を求める。すなわち、図３３の処理の流れと同様に、参照文字列に最初の不一致文字を加えた文字列の先頭から順に１文字ずつ取得しながら、状態遷移機械によって次の状態を取得する。ステップＳ２０１０の処理が終了したら、ステップＳ２００８でスライド窓と状態遷移履歴を更新する。 If the current state does not match the state of the first character of the reference character string of the compressed block of the state transition history in step S2004 (NO), the character string obtained by adding the first unmatched character to the reference character string in step S2010 For the state transition history. That is, similarly to the processing flow of FIG. 33, the next state is acquired by the state transition machine while acquiring one character at a time from the beginning of the character string obtained by adding the first mismatched character to the reference character string. When the process of step S2010 is completed, the sliding window and the state transition history are updated in step S2008.

以上のように、この実施の形態によれば、圧縮ブロック列を１個取得した後、現在の状態と、状態遷移履歴の参照位置の状態とを比較し、状態が一致した場合は、本来参照文字列の文字数分の回数要する状態遷移の処理を、高々３回の状態遷移で処理することができ、処理ステップを削減することができる。 As described above, according to this embodiment, after acquiring one compressed block string, the current state is compared with the state at the reference position of the state transition history, and if the state matches, the original reference is made. State transition processing that requires as many times as the number of characters in the character string can be processed with up to three state transitions, and processing steps can be reduced.

また、正規表現とテキストとの照合処理自体は、その正規表現を受理する、状態遷移機械を利用する。これにより、正規表現を含んだ検索条件によって、ＬＺ７７形式で圧縮されたマルチバイトコード文字を含む圧縮テキストを高速に検索することができる。 In addition, the regular expression / text matching process itself uses a state transition machine that accepts the regular expression. Thereby, a compressed text including multibyte code characters compressed in the LZ77 format can be searched at high speed according to a search condition including a regular expression.

ここで説明した圧縮テキスト検索装置は、以下の特徴を持つ。
圧縮辞書記憶部にＬＺ７７形式のスライド窓を記憶する。
状態遷移記憶部に、スライド窓長＋１の長さの状態遷移履歴を記憶する。
スライド窓の末尾をバイトデータ記憶部として利用する。
ＬＺ７７形式の圧縮ブロックを読み込み、現在の状態が圧縮ブロックが参照する状態遷移の履歴の先頭の状態と一致した場合に、参照文字列の末尾の文字まで状態を１回の状態遷移で遷移させ、さらに参照文字列の末尾と不一致文字とで合わせて１文字となる場合にはその文字により状態を遷移させる。 The compressed text search apparatus described here has the following features.
The LZ77 format sliding window is stored in the compression dictionary storage unit.
In the state transition storage unit, a state transition history having a length of sliding window length + 1 is stored.
The end of the sliding window is used as a byte data storage unit.
When a compressed block in the LZ77 format is read and the current state matches the first state of the history of state transitions referenced by the compressed block, the state is transitioned by one state transition up to the last character of the reference character string, Further, when the end of the reference character string and the mismatched character are combined into one character, the state is changed by that character.

実施の形態１４．
実施の形態１４を図３５を用いて説明する。
この実施の形態における圧縮テキスト検索装置１００（文字列検索装置の一例）の外観、ハードウェア構成、ブロック構成は、実施の形態１０で説明したものと同一なので、ここでは説明を省略する。 Embodiment 14 FIG.
The fourteenth embodiment will be described with reference to FIG.
Since the appearance, hardware configuration, and block configuration of the compressed text search apparatus 100 (an example of a character string search apparatus) in this embodiment are the same as those described in the tenth embodiment, description thereof is omitted here.

この実施の形態では、検索装置が扱う文字を表すビット列のビット長が、圧縮テキストの圧縮方式で想定している文字を表すビット列のビット長と異なる場合において、ＬＺＳＳ方式で圧縮されたテキストを検索する場合について説明する。 In this embodiment, when the bit length of the bit string representing the character handled by the search device is different from the bit length of the bit string representing the character assumed in the compression method of the compressed text, the text compressed by the LZSS method is searched. The case where it does is demonstrated.

この実施の形態の圧縮テキスト検索装置における検索処理の流れは、実施の形態１３で図３５を用いて説明したものと同様である。まず、圧縮ブロックの先頭のビットが１であった場合、すなわち圧縮辞書に参照文字列がある場合を考える。実施の形態１３との主要な差異は、圧縮ブロックに不一致文字が無いことである。すなわち、ステップＳ２００８やステップＳ２００９、ステップＳ２０１０の処理で不一致文字を考慮しないこと以外は、実施の形態７と同様に検索することができる。 The flow of search processing in the compressed text search apparatus of this embodiment is the same as that described in Embodiment 13 with reference to FIG. First, consider the case where the first bit of the compressed block is 1, that is, the case where there is a reference character string in the compression dictionary. The main difference from the thirteenth embodiment is that there is no mismatched character in the compressed block. That is, the search can be performed in the same manner as in the seventh embodiment except that the mismatched characters are not taken into consideration in the processes of step S2008, step S2009, and step S2010.

次に、圧縮ブロックの先頭のビットが０であった場合を考える。ここで、圧縮ブロックの不一致文字は常に１バイトである。このときは、ステップＳ２００２以降の処理として、次の３通りがある。まず、不一致文字が１バイト文字であった場合である。この場合は、現在の状態と、不一致文字から次の状態を取得し、ステップＳ２００５で状態記憶部１１１にセットし、ステップＳ２００８の処理を実行する。不一致文字がバイトデータで、スライド窓の末尾にもバイトデータがある場合は、ステップＳ２００３、ステップＳ２００８の処理を実行する。不一致文字がバイトデータで、スライド窓の末尾にバイトデータが無い場合は、ステップＳ２００８の処理を実行するだけでよい。 Next, consider a case where the first bit of the compressed block is 0. Here, the mismatched character in the compressed block is always 1 byte. At this time, there are the following three types of processing after step S2002. First, the mismatched character is a 1-byte character. In this case, the next state is acquired from the current state and the mismatched character, set in the state storage unit 111 in step S2005, and the process of step S2008 is executed. If the unmatched character is byte data and there is byte data at the end of the sliding window, the processes of steps S2003 and S2008 are executed. If the unmatched character is byte data and there is no byte data at the end of the sliding window, it is only necessary to execute the process of step S2008.

以上のように、この実施の形態によれば、圧縮ブロック列を１個取得した後、現在の状態と、状態遷移履歴の参照位置の状態とを比較し、状態が一致した場合は、本来参照文字列長に比例したステップ数要する状態遷移の処理を、高々２回の状態遷移で処理することができ、処理ステップを削減することができる。
また、正規表現とテキストとの照合処理自体は、その正規表現を受理する、状態遷移機械を利用する。これにより、正規表現を含んだ検索条件によって、ＬＺＳＳ形式で圧縮されたマルチバイトコード文字を含む圧縮テキストを高速に検索することができる。 As described above, according to this embodiment, after acquiring one compressed block string, the current state is compared with the state at the reference position of the state transition history, and if the state matches, the original reference is made. State transition processing that requires the number of steps proportional to the character string length can be processed with at most two state transitions, and processing steps can be reduced.
In addition, the regular expression / text matching process itself uses a state transition machine that accepts the regular expression. Thereby, a compressed text including multibyte code characters compressed in the LZSS format can be searched at high speed according to a search condition including a regular expression.

ここで示したＬＺＳＳ形式以外にも、ＬＺＢ形式やＬＺＢＷ形式など、ＬＺ７７形式から派生した圧縮形式で圧縮されたテキストを、この実施の形態の圧縮テキスト検索装置によって同様に検索することができる。 In addition to the LZSS format shown here, text compressed in a compression format derived from the LZ77 format, such as the LZB format and the LZBW format, can be similarly searched by the compressed text search device of this embodiment.

ここで説明した圧縮テキスト検索装置は、以下の特徴を持つ。
圧縮辞書記憶部にＬＺＳＳ形式のスライド窓を記憶する。
状態遷移記憶部に、スライド窓長＋１の長さの状態遷移履歴を記憶する。
スライド窓の末尾をバイトデータ記憶部として利用する。
ＬＺＳＳ形式の圧縮ブロックを読み込み、現在の状態が圧縮ブロックが参照する状態遷移の履歴の先頭の状態と一致した場合に、参照文字列の末尾の文字まで１回の状態遷移で状態を遷移させる。 The compressed text search apparatus described here has the following features.
The LZSS format sliding window is stored in the compression dictionary storage unit.
In the state transition storage unit, a state transition history having a length of sliding window length + 1 is stored.
The end of the sliding window is used as a byte data storage unit.
When a compressed block in the LZSS format is read and the current state matches the first state of the history of state transitions referenced by the compressed block, the state is transitioned by one state transition to the last character of the reference character string.

実施の形態１５．
実施の形態１５を図３２を用いて説明する。
この実施の形態における圧縮テキスト検索装置１００（文字列検索装置の一例）の外観、ハードウェア構成、ブロック構成は、実施の形態１０で説明したものと同一なので、ここでは説明を省略する。 Embodiment 15 FIG.
The fifteenth embodiment will be described with reference to FIG.
Since the appearance, hardware configuration, and block configuration of the compressed text search apparatus 100 (an example of a character string search apparatus) in this embodiment are the same as those described in the tenth embodiment, description thereof is omitted here.

この実施の形態では、検索装置が扱う文字を表すビット列のビット長が、圧縮テキストの圧縮方式で想定している文字を表すビット列のビット長と異なる場合において、ＬＺ７８方式で圧縮されたテキストを検索する場合について説明する。 In this embodiment, when the bit length of the bit string representing the character handled by the search device is different from the bit length of the bit string representing the character assumed in the compression method of the compressed text, the text compressed by the LZ78 method is searched. The case where it does is demonstrated.

この実施の形態による状態遷移記憶部は、図３０に示したものと同様である。 The state transition storage unit according to this embodiment is the same as that shown in FIG.

この実施の形態の圧縮テキスト検索装置における検索処理の流れは、実施の形態１２で説明したものとほぼ同様であるので、図３２を援用して検索処理の流れを説明する。実施の形態１２と異なる点のみを記述する。
ＬＺ７８形式で圧縮されたテキストを検索する場合には、ステップＳ１５０８の直前かステップＳ１５１０の直前で、圧縮辞書と状態遷移履歴にエントリを追加する必要がある。まず、圧縮ブロックの参照文字列に不一致文字を加えたものを、圧縮辞書の新たなエントリとして追加する。
不一致文字がバイトデータで、かつ参照文字列の末尾がバイトデータで無い場合や、不一致文字と参照文字列の末尾が共にバイトデータで、かつ合わせても１文字に満たない場合は、圧縮ブロックが参照する状態遷移履歴を、そのまま状態遷移記憶部１１２の新たなエントリとして追加する。
そして、そのバイトデータをバイトデータ記憶部１２２に追加する。不一致文字と参照文字列の末尾が共にバイトデータで合わせて１文字になる場合や、不一致文字が１バイト文字である場合は、その文字と現在の状態から次の状態を取得し、その取得した状態を状態記憶部と、状態遷移履歴の末尾にセットする。さらにその状態が受理状態であれば、ヒット位置を出力し、受理位置にも追加する。
ステップＳ１５０４で、現在の状態と圧縮ブロックの参照する状態遷移履歴の先頭の状態が一致しなかった場合は、状態遷移記憶部の新しいエントリには、ステップＳ１５１１で取得した状態遷移履歴を追加する。 Since the flow of search processing in the compressed text search apparatus of this embodiment is almost the same as that described in Embodiment 12, the flow of search processing will be described with the aid of FIG. Only differences from the twelfth embodiment will be described.
When searching for text compressed in the LZ78 format, it is necessary to add an entry to the compression dictionary and the state transition history immediately before step S1508 or immediately before step S1510. First, the reference character string of the compressed block plus the unmatched character is added as a new entry in the compression dictionary.
If the non-matching character is byte data and the end of the reference character string is not byte data, or if the non-matching character and the reference character string both end with byte data and are less than one character, the compressed block The state transition history to be referred to is added as a new entry in the state transition storage unit 112 as it is.
Then, the byte data is added to the byte data storage unit 122. If both the mismatched character and the end of the reference character string are combined with byte data to be 1 character, or if the mismatched character is a 1-byte character, the next state is acquired from that character and the current state, and the acquired The state is set at the state storage unit and at the end of the state transition history. Further, if the state is an acceptance state, the hit position is output and added to the acceptance position.
In step S1504, if the current state does not match the head state of the state transition history referred to by the compressed block, the state transition history acquired in step S1511 is added to the new entry in the state transition storage unit.

ステップＳ１５１１では、参照文字列に不一致文字を加えた文字について、状態遷移履歴を求める。 In step S1511, a state transition history is obtained for a character obtained by adding a mismatch character to the reference character string.

以上のように、この実施の形態によれば、圧縮ブロック列を１個取得した後、現在の状態と、状態遷移履歴の参照位置の状態とを比較し、状態が一致した場合は、本来参照文字列の文字数分の回数要する状態遷移の処理を、高々３回の状態遷移で処理することができ、処理ステップを削減することができる。
また、正規表現とテキストとの照合処理自体は、その正規表現を受理する、状態遷移機械を利用する。これにより、正規表現を含んだ検索条件によって、ＬＺ７８形式で圧縮されたマルチバイトコード文字を含む圧縮テキストを高速に検索することができる。 As described above, according to this embodiment, after acquiring one compressed block string, the current state is compared with the state at the reference position of the state transition history, and if the state matches, the original reference is made. State transition processing that requires as many times as the number of characters in the character string can be processed with up to three state transitions, and processing steps can be reduced.
In addition, the regular expression / text matching process itself uses a state transition machine that accepts the regular expression. Thereby, a compressed text including multibyte code characters compressed in the LZ78 format can be searched at high speed according to a search condition including a regular expression.

ここで説明した圧縮テキスト検索装置は、以下の特徴を持つ。
圧縮辞書記憶部にＬＺ７８形式の圧縮辞書を記憶する。
ＬＺ７８形式の圧縮ブロックを読み込み、現在の状態が圧縮ブロックが参照する状態遷移の履歴の先頭の状態と一致した場合に、参照文字列の末尾の文字まで１回の状態遷移で状態を遷移させさせる。さらに参照文字列の末尾と不一致文字とで合わせて１文字となる場合にはその文字により状態を遷移させる。
参照文字列と不一致文字からなる文字列を圧縮辞書の新たなエントリとして追加し、上記の状態遷移を状態遷移記憶部の新たなエントリとして追加する。 The compressed text search apparatus described here has the following features.
A compression dictionary in the LZ78 format is stored in the compression dictionary storage unit.
When a compressed block in LZ78 format is read and the current state matches the first state of the history of state transitions referenced by the compressed block, the state is shifted by one state transition to the last character of the reference character string. . Further, when the end of the reference character string and the mismatched character are combined into one character, the state is changed by that character.
A character string composed of the reference character string and the mismatched character is added as a new entry in the compression dictionary, and the state transition is added as a new entry in the state transition storage unit.

実施の形態１６．
実施の形態１６を図３５を用いて説明する。
この実施の形態における圧縮テキスト検索装置１００（文字列検索装置の一例）の外観、ハードウェア構成、ブロック構成は、実施の形態１０で説明したものと同一なので、ここでは説明を省略する。 Embodiment 16 FIG.
The sixteenth embodiment will be described with reference to FIG.
Since the appearance, hardware configuration, and block configuration of the compressed text search apparatus 100 (an example of a character string search apparatus) in this embodiment are the same as those described in the tenth embodiment, description thereof is omitted here.

この実施の形態では、検索装置が扱う文字を表すビット列のビット長が、圧縮テキストの圧縮方式で想定している文字を表すビット列のビット長と異なる場合において、ＬＺＷ方式で圧縮されたテキストを検索する場合について説明する。 In this embodiment, when the bit length of the bit string representing the character handled by the search device is different from the bit length of the bit string representing the character assumed in the compression method of the compressed text, the text compressed by the LZW method is searched. The case where it does is demonstrated.

この実施の形態の圧縮テキスト検索装置における検索処理の流れは、実施の形態１５とほぼ同様であるので、図３５を援用して検索処理の流れを説明する。実施の形態１５の検索処理との差異は、圧縮ブロックに不一致文字が含まれないことである。この実施の形態の圧縮テキスト検索装置では、次の圧縮ブロックの先頭の文字を、実施の形態１５の不一致文字の代わりに利用する。 Since the flow of search processing in the compressed text search apparatus of this embodiment is almost the same as that of Embodiment 15, the flow of search processing will be described with the aid of FIG. The difference from the search processing of the fifteenth embodiment is that the compressed block does not include mismatched characters. In the compressed text search apparatus of this embodiment, the first character of the next compressed block is used instead of the mismatched character of the fifteenth embodiment.

以上のように、この実施の形態によれば、圧縮ブロック列を１個取得した後、現在の状態と、状態遷移履歴の参照位置の状態とを比較し、状態が一致した場合は、本来参照文字列長分の回数要する状態遷移の処理を、高々３回の状態遷移で処理することができ、処理ステップを削減することができる。
また、正規表現とテキストとの照合処理自体は、その正規表現を受理する、状態遷移機械を利用する。これにより、正規表現を含んだ検索条件によって、ＬＺＷ形式で圧縮されたマルチバイトコード文字を含む圧縮テキストを高速に検索することができる。 As described above, according to this embodiment, after acquiring one compressed block string, the current state is compared with the state at the reference position of the state transition history, and if the state matches, the original reference is made. State transition processing that requires the number of times corresponding to the length of the character string can be processed with up to three state transitions, and processing steps can be reduced.
In addition, the regular expression / text matching process itself uses a state transition machine that accepts the regular expression. Thereby, a compressed text including multibyte code characters compressed in the LZW format can be searched at high speed according to a search condition including a regular expression.

同様にして、ＬＺ７８形式から派生した圧縮形式によって圧縮されたマルチバイトコード文字を含む圧縮テキストを、この実施の形態の圧縮テキスト検索装置によって高速に検索することができる。 Similarly, a compressed text including multi-byte code characters compressed by a compression format derived from the LZ78 format can be searched at high speed by the compressed text search device of this embodiment.

ここで説明した圧縮テキスト検索装置は、以下の特徴を持つ。
圧縮辞書記憶部にＬＺＷ形式の圧縮辞書を記憶する。
ＬＺＷ形式の圧縮ブロックを読み込み、現在の状態が圧縮ブロックが参照する状態遷移の履歴の先頭の状態と一致した場合に、参照文字列の末尾の文字まで１回の状態遷移で状態を遷移させさせる。さらに参照文字列の末尾と次の圧縮ブロックの参照文字列の先頭の文字とで合わせて１文字となる場合にはその文字により状態を遷移させる。
参照文字列と次の圧縮ブロックの参照文字列の先頭の文字からなる文字列を圧縮辞書の新たなエントリとして追加し、上記の状態遷移を状態遷移記憶部の新たなエントリとして追加する。 The compressed text search apparatus described here has the following features.
A compression dictionary in the LZW format is stored in the compression dictionary storage unit.
When a compressed block in LZW format is read and the current state matches the first state of the state transition history referenced by the compressed block, the state is changed by one state transition to the last character of the reference character string. . Furthermore, when the end of the reference character string and the first character of the reference character string of the next compressed block are combined into one character, the state is changed by that character.
A character string composed of the reference character string and the first character of the reference character string of the next compressed block is added as a new entry in the compression dictionary, and the above state transition is added as a new entry in the state transition storage unit.

実施の形態１７．
実施の形態１７を図３６を用いて説明する。
この実施の形態における圧縮テキスト検索装置１００（文字列検索装置の一例）の外観、ハードウェア構成、ブロック構成は、実施の形態１で説明したものと同一なので、ここでは説明を省略する。 Embodiment 17. FIG.
Embodiment 17 will be described with reference to FIG.
Since the appearance, hardware configuration, and block configuration of the compressed text search apparatus 100 (an example of a character string search apparatus) in this embodiment are the same as those described in the first embodiment, description thereof is omitted here.

この実施の形態では、状態遷移記憶部１１２（履歴記憶部の一例）が記憶する状態履歴の内容の別の例について説明する。 In this embodiment, another example of the contents of the state history stored in the state transition storage unit 112 (an example of a history storage unit) will be described.

図３６は、この実施の形態による圧縮テキスト検索装置の、状態遷移記憶部１１２が記憶する状態履歴の構造を示す構成図である。この実施の形態の圧縮テキスト検索装置は、状態遷移記憶部１１２を、１つの圧縮辞書のエントリに対して複数の状態遷移履歴を記憶するように構成したものである。 FIG. 36 is a configuration diagram showing the structure of the state history stored in the state transition storage unit 112 of the compressed text search apparatus according to this embodiment. In the compressed text search apparatus of this embodiment, the state transition storage unit 112 is configured to store a plurality of state transition histories for one entry in the compression dictionary.

この実施の形態の圧縮テキスト検索装置の状態遷移記憶部の１つのエントリは、エントリの参照番号２２０１と、状態遷移履歴２２０２、受理位置２２０３の情報を記憶する。参照番号１は、圧縮辞書の参照番号と１対１に対応する識別子である。
状態遷移履歴は、対応する圧縮辞書の文字列による状態遷移機械の、状態遷移の履歴を記憶したものであり、先頭が圧縮辞書の文字列を読む直前の状態、末尾が圧縮辞書の文字列を全て読んだ直後の状態をさす。
受理位置２２０３は、状態遷移履歴のどこで、状態遷移機械が受理状態に到達したかを表わしている。
この実施の形態の圧縮テキスト検索装置では、１つの状態遷移記憶部のエントリに、０以上の状態遷移履歴２２０２と受理位置２２０３の組を記憶する。状態遷移履歴２２０２と受理位置２２０３の組をレコードと呼ぶこととする。 One entry of the state transition storage unit of the compressed text search apparatus according to this embodiment stores an entry reference number 2201, information on a state transition history 2202, and an acceptance position 2203. Reference number 1 is an identifier that corresponds one-to-one with the reference number of the compression dictionary.
The state transition history is the state transition history of the state transition machine based on the character string of the corresponding compression dictionary. The state is the state immediately before reading the character string of the compression dictionary at the beginning and the character string of the compression dictionary at the end. The state immediately after reading everything.
The accepting position 2203 represents where in the state transition history the state transition machine has reached the accepting state.
In the compressed text search apparatus of this embodiment, a set of zero or more state transition histories 2202 and reception positions 2203 is stored in an entry of one state transition storage unit. A set of the state transition history 2202 and the receiving position 2203 is called a record.

この実施の形態の圧縮テキスト検索装置は、例えば、実施の形態１で説明した図７のステップＳ２０７で求めた状態遷移履歴と受理位置を、圧縮ブロックが参照する状態遷移記憶部のエントリの、新たなレコードとして追加する。 The compressed text search apparatus according to this embodiment, for example, adds a new entry of the state transition storage unit to which the compressed block refers to the state transition history and the reception position obtained in step S207 of FIG. 7 described in the first embodiment. Add as a record.

１つの圧縮ブロックに対して、状態履歴を１つしか記憶しない場合には、状態記憶部１１１が記憶した現在の状態と、状態遷移記憶部１１２の圧縮ブロックが参照している状態遷移履歴の先頭の状態とが一致した時のみ、圧縮ブロックの参照文字列に対する状態遷移の回数を削減することができる。
この実施の形態の圧縮テキスト検索装置は、上記のように構成することで、状態記憶部の現在の状態と、状態遷移記憶部の圧縮ブロックが参照している状態遷移履歴の先頭の状態とが一致しているか判定する時、複数の状態遷移の履歴を記憶することができるため、状態が一致する確率を高めることができる。よって、より状態遷移の回数を削減することができる確率が高くなり、より検索処理を高速化することができる。 When only one state history is stored for one compressed block, the current state stored in the state storage unit 111 and the head of the state transition history referenced by the compressed block in the state transition storage unit 112 The number of state transitions with respect to the reference character string of the compressed block can be reduced only when the state matches.
By configuring the compressed text search device of this embodiment as described above, the current state of the state storage unit and the first state of the state transition history referenced by the compressed block of the state transition storage unit are Since it is possible to store a plurality of state transition histories when determining whether or not they match, the probability that the states match can be increased. Therefore, the probability that the number of state transitions can be further reduced is increased, and the search process can be further speeded up.

ここで説明した圧縮テキスト検索装置は、圧縮辞書の１つのエントリに対して、複数の状態遷移の履歴を記憶することを特徴とする。 The compressed text search apparatus described here stores a plurality of state transition histories for one entry in the compression dictionary.

実施の形態１８．
実施の形態１８を図３７を用いて説明する。
この実施の形態における圧縮テキスト検索装置１００（文字列検索装置の一例）の外観、ハードウェア構成、ブロック構成は、実施の形態１で説明したものと同一なので、ここでは説明を省略する。 Embodiment 18 FIG.
An eighteenth embodiment will be described with reference to FIG.
Since the appearance, hardware configuration, and block configuration of the compressed text search apparatus 100 (an example of a character string search apparatus) in this embodiment are the same as those described in the first embodiment, description thereof is omitted here.

図３７は、この実施の形態による圧縮テキスト検索装置の、状態遷移記憶部１１２が記憶する状態履歴の構造を示す構成図である。この実施の形態の圧縮テキスト検索装置は、状態遷移記憶部１１２を、スライド窓に対して複数の状態遷移履歴を記憶するように構成したものである。 FIG. 37 is a block diagram showing the structure of the state history stored in the state transition storage unit 112 of the compressed text search apparatus according to this embodiment. In the compressed text search device of this embodiment, the state transition storage unit 112 is configured to store a plurality of state transition histories for the sliding window.

この実施の形態の圧縮テキスト検索装置の状態遷移記憶部は、状態を（スライド窓の長さ＋１）個分記憶する状態遷移履歴２３０４と受理位置２３０５の組から構成されるレコードを複数組備える。 The state transition storage unit of the compressed text search apparatus according to this embodiment includes a plurality of sets of records each composed of a set of a state transition history 2304 and a reception position 2305 for storing states corresponding to (sliding window length + 1).

この実施の形態の圧縮テキスト検索装置の検索処理の流れを説明する。 A flow of search processing of the compressed text search apparatus of this embodiment will be described.

ここで説明した圧縮テキスト検索装置は、辞書中の１つの文字列に対して、ＬＺ７７形式およびＬＺＳＳ形式から派生した圧縮形式のスライド窓に対して、複数の状態遷移の履歴を記憶することを特徴とする。 The compressed text search device described here stores a plurality of state transition histories for one character string in a dictionary with respect to a sliding window in a compressed format derived from the LZ77 format and the LZSS format. And

実施の形態１９．
実施の形態１９を図３８を用いて説明する。
この実施の形態における圧縮テキスト検索装置１００（文字列検索装置の一例）の外観、ハードウェア構成、ブロック構成は、実施の形態１で説明したものと同一なので、ここでは説明を省略する。 Embodiment 19. FIG.
The nineteenth embodiment will be described with reference to FIG.
Since the appearance, hardware configuration, and block configuration of the compressed text search apparatus 100 (an example of a character string search apparatus) in this embodiment are the same as those described in the first embodiment, description thereof is omitted here.

図３８は、この実施の形態による圧縮テキスト検索装置の、状態遷移記憶部１１２が記憶する状態履歴の構造を示す構成図である。この実施の形態の圧縮テキスト検索装置は、状態遷移記憶部１１２を、状態遷移の履歴の先頭と末尾の状態のみを記憶するように構成したものである。 FIG. 38 is a block diagram showing the structure of the state history stored in the state transition storage unit 112 of the compressed text search apparatus according to this embodiment. In the compressed text search apparatus of this embodiment, the state transition storage unit 112 is configured to store only the head and tail states of the state transition history.

この実施の形態の圧縮テキスト検索装置の状態遷移記憶部は、参照番号２４０１、状態遷移履歴の先頭の状態２４０２、状態遷移履歴の末尾の状態２４０３、受理位置２４０４を記憶する。この実施の形態の圧縮テキスト検索装置は、例えば、実施の形態１で説明した図７のステップＳ２０７で求めた状態遷移履歴の先頭と末尾の状態、受理位置のみを、状態遷移記憶部にセットする。 The state transition storage unit of the compressed text search apparatus of this embodiment stores a reference number 2401, a head state 2402 of the state transition history, a state 2403 at the end of the state transition history, and an acceptance position 2404. The compressed text search apparatus of this embodiment sets, for example, only the head and tail states of the state transition history obtained in step S207 of FIG. 7 described in the first embodiment and the reception position in the state transition storage unit. .

状態遷移記憶部１１２が状態遷移履歴を全部記憶する場合には、圧縮辞書のエントリの数と圧縮辞書の文字列の長さに応じて、記憶領域を必要とする。そのため、圧縮辞書のエントリの数や、圧縮辞書の文字列の長さが大きくなるとメモリなどの記憶容量を圧迫することがある。
この実施の形態の圧縮テキスト検索装置では、状態遷移の先頭の状態と末尾の状態、そして受理位置のみが分ればよい。
この実施の形態の圧縮テキスト検索装置は、上記のように構成することで、状態遷移記憶部が必要とする記憶領域の大きさは、圧縮辞書の文字列の長さには依存せず、圧縮辞書のエントリの数の定数倍で抑えることができる。 When the state transition storage unit 112 stores all state transition history, a storage area is required according to the number of entries in the compression dictionary and the length of the character string in the compression dictionary. For this reason, if the number of entries in the compression dictionary or the length of the character string in the compression dictionary increases, the storage capacity of the memory or the like may be reduced.
In the compressed text retrieval apparatus of this embodiment, only the head state and tail state of the state transition and the receiving position need be known.
By configuring the compressed text search device of this embodiment as described above, the size of the storage area required by the state transition storage unit does not depend on the length of the character string in the compression dictionary, and is compressed. It can be suppressed by a constant multiple of the number of dictionary entries.

ここで説明した圧縮テキスト検索装置は、辞書中の文字列に対して状態遷移の履歴を記憶する際に、履歴の先頭の状態と、末尾の状態のみを記憶することを特徴とする。 The compressed text search apparatus described here stores only the head state and the tail state of the history when storing the state transition history for the character string in the dictionary.

実施の形態２０．
この実施の形態における圧縮テキスト検索装置１００（文字列検索装置の一例）の外観、ハードウェア構成、ブロック構成は、実施の形態１で説明したものと同一なので、ここでは説明を省略する。 Embodiment 20. FIG.
Since the appearance, hardware configuration, and block configuration of the compressed text search apparatus 100 (an example of a character string search apparatus) in this embodiment are the same as those described in the first embodiment, description thereof is omitted here.

この実施の形態の圧縮テキスト検索装置は、状態遷移記憶部１１２を、圧縮ブロックの参照文字列の長さが、予め決められた長さ以上のときのみ、状態遷移の履歴をセットするようにしたものである。 In the compressed text search device of this embodiment, the state transition storage unit 112 sets the state transition history only when the length of the reference character string of the compressed block is equal to or longer than a predetermined length. Is.

例えば、図４を例にすると、参照文字列の長さが５以上の場合のみ、状態遷移履歴を記憶するとした場合、１番目の状態遷移のみを記憶する。 For example, taking FIG. 4 as an example, if the state transition history is stored only when the length of the reference character string is 5 or more, only the first state transition is stored.

この実施の形態の圧縮テキスト検索装置は、状態記憶部１１１が記憶した現在の状態と、状態遷移記憶部１１２の圧縮ブロックが参照している状態遷移履歴の先頭の状態とが一致した場合に、圧縮ブロックの参照文字列に対する状態遷移を１ステップで処理することができる。このとき、削減できる処理のステップ数は、圧縮ブロックが参照している文字列の長さが長いほど大きくなる。すなわち、圧縮ブロックの参照文字列が短い場合には、処理のステップ数の削減効果は小さい。 The compressed text search apparatus of this embodiment, when the current state stored in the state storage unit 111 matches the first state of the state transition history referenced by the compressed block of the state transition storage unit 112, The state transition for the reference character string of the compressed block can be processed in one step. At this time, the number of processing steps that can be reduced increases as the length of the character string referred to by the compressed block increases. That is, when the reference character string of the compressed block is short, the effect of reducing the number of processing steps is small.

この実施の形態の圧縮テキスト検索装置は、上記のように構成することで、圧縮辞書の文字列が短いエントリに対応した状態遷移履歴を記憶しなくて良いため、状態遷移記憶部の必要とする記憶領域を削減することができる。 Since the compressed text search apparatus of this embodiment is configured as described above, it is not necessary to store a state transition history corresponding to an entry having a short character string in the compression dictionary. The storage area can be reduced.

ここで説明した圧縮テキスト検索装置は、辞書中の文字列に対して状態遷移の履歴を記憶する際に、予め定められた長さ以上の文字列に対してのみ、状態遷移の履歴を記憶することを特徴とする。 The compressed text search device described here stores the state transition history only for character strings longer than a predetermined length when storing the state transition history for the character strings in the dictionary. It is characterized by that.

実施の形態１７乃至実施の形態２０で説明した状態遷移記憶部１１２が記憶する状態履歴の構成は、複数組み合わせて構成しても良い。 The configuration of the state history stored in the state transition storage unit 112 described in the seventeenth to twentieth embodiments may be combined.

実施の形態１における圧縮テキスト検索装置１００（文字列検索装置の一例）の外観の一例を示す図。FIG. 3 is a diagram illustrating an example of an appearance of a compressed text search device 100 (an example of a character string search device) in the first embodiment. 実施の形態１における圧縮テキスト検索装置のハードウェア構成の一例を示す図。2 is a diagram illustrating an example of a hardware configuration of a compressed text search device according to Embodiment 1. FIG. 実施の形態１における圧縮テキスト検索装置１００のブロック構成の一例を示すブロック図。FIG. 3 is a block diagram illustrating an example of a block configuration of the compressed text search apparatus 100 according to the first embodiment. 実施の形態１において、圧縮テキスト記憶部１０３が記憶した圧縮テキストの一例を示す図。FIG. 4 is a diagram showing an example of compressed text stored in a compressed text storage unit 103 in the first embodiment. 実施の形態１において、状態遷移記憶部１１２が記憶する状態履歴の構造を示す図。In Embodiment 1, the figure which shows the structure of the state history which the state transition memory | storage part 112 memorize | stores. 実施の形態１において、状態遷移表記憶部１０６が記憶する状態遷移表２００の一例を示す図。FIG. 6 is a diagram showing an example of a state transition table 200 stored in the state transition table storage unit 106 in the first embodiment. 実施の形態１における圧縮テキスト検索装置１００の検索処理の制御の流れの一例を示すフローチャート図。FIG. 3 is a flowchart showing an example of a control flow of search processing of the compressed text search apparatus 100 according to the first embodiment. 図７のＳ１６における処理の詳細の一例を示すフローチャート図。The flowchart figure which shows an example of the detail of a process in S16 of FIG. 実施の形態２において、圧縮テキスト記憶部１０３及び状態遷移記憶部１１２が記憶する記憶内容の一例を示す図。FIG. 10 is a diagram illustrating an example of storage contents stored in a compressed text storage unit 103 and a state transition storage unit 112 in the second embodiment. 実施の形態２における圧縮テキスト検索装置１００の検索処理の制御の流れの一例を示すフローチャート図。The flowchart figure which shows an example of the flow of search processing of the compressed text search device 100 in Embodiment 2. ＬＺ７７形式による圧縮テキストの構造を示す図。The figure which shows the structure of the compression text by LZ77 format. 実施の形態３における、圧縮辞書記憶部１１３（辞書記憶部の一例）と状態遷移記憶部１１２（履歴記憶部の一例）の記憶する情報を示す図。The figure which shows the information which the compression dictionary memory | storage part 113 (an example of a dictionary memory | storage part) and the state transition memory | storage part 112 (an example of a history memory | storage part) memorize | store in Embodiment 3. FIG. 実施の形態３の圧縮テキスト検索装置１００における検索処理の流れ図。10 is a flowchart of search processing in the compressed text search apparatus 100 according to the third embodiment. 実施の形態４において、圧縮テキスト記憶部１０３及び状態遷移記憶部１１２の記憶内容の一例を示す図。In Embodiment 4, it is a figure which shows an example of the memory content of the compressed text memory | storage part 103 and the state transition memory | storage part 112. FIG. 実施の形態４における圧縮テキスト検索装置１００の検索処理の制御の流れの一例を示すフローチャート図。The flowchart figure which shows an example of the flow of control of the search process of the compressed text search device 100 in Embodiment 4. ＬＺＳＳ形式による圧縮テキストの構造を示す図。The figure which shows the structure of the compression text by a LZSS format. 実施の形態６において、圧縮テキスト記憶部１０３及び圧縮辞書記憶部１１３（辞書記憶部の一例）及び状態遷移記憶部１１２（履歴記憶部の一例）が記憶する記憶内容の一例を示す図。In Embodiment 6, it is a figure which shows an example of the memory content which the compressed text memory | storage part 103, the compression dictionary memory | storage part 113 (an example of a dictionary memory | storage part), and the state transition memory | storage part 112 (an example of a history memory | storage part) memorize | store. 実施の形態６における圧縮テキスト検索装置１００の検索処理の制御の流れの一例を示すフローチャート図。FIG. 20 is a flowchart showing an example of a control flow of search processing of the compressed text search device 100 according to Embodiment 6. ＬＺ７８形式による圧縮テキストの構造を示す図。The figure which shows the structure of the compression text by LZ78 format. 実施の形態７の圧縮テキスト検索装置における検索処理の流れ図。18 is a flowchart of search processing in the compressed text search device according to the seventh embodiment. 実施の形態８において、圧縮テキスト記憶部１０３及び圧縮辞書記憶部１１３（辞書記憶部の一例）が記憶する記憶内容の一例を示す図。In Embodiment 8, it is a figure which shows an example of the memory content which the compressed text memory | storage part 103 and the compression dictionary memory | storage part 113 (an example of a dictionary memory | storage part) memorize | store. 実施の形態８において、状態遷移記憶部１１２（履歴記憶部の一例）が記憶する記憶内容の一例を示す図。In Embodiment 8, it is a figure which shows an example of the memory content which the state transition memory | storage part 112 (an example of a log | history memory | storage part) memorize | stores. 実施の形態８における圧縮テキスト検索装置１００の検索処理の制御の流れの一例を示すフローチャート図。FIG. 19 is a flowchart showing an example of the flow of search processing performed by the compressed text search apparatus 100 according to the eighth embodiment. ＬＺＷ形式による圧縮テキストの構造を示す図。The figure which shows the structure of the compression text by a LZW format. 圧縮技術において取り扱う文字を表現したビット列のビット長と、検索装置において取り扱う文字を表現したビット列のビット長とが異なっている場合について説明するための説明図。Explanatory drawing for demonstrating the case where the bit length of the bit string expressing the character handled in compression technology differs from the bit length of the bit string expressing the character handled in the search device. 実施の形態１０における圧縮テキスト検索装置１００のブロック構成の一例を示すブロック図。FIG. 25 is a block diagram showing an example of a block configuration of the compressed text search device 100 according to the tenth embodiment. 実施の形態１０において、圧縮テキスト記憶部１０３、圧縮辞書記憶部１１３（辞書記憶部の一例）、状態遷移記憶部１１２（履歴記憶部の一例）が記憶する記憶内容の一例を示す図。In Embodiment 10, the figure which shows an example of the memory content which the compressed text memory | storage part 103, the compression dictionary memory | storage part 113 (an example of a dictionary memory | storage part), and the state transition memory | storage part 112 (an example of a history memory | storage part) memorize | store. 実施の形態１１において、圧縮テキスト記憶部１０３、圧縮辞書記憶部１１３（辞書記憶部の一例）、状態遷移記憶部１１２（履歴記憶部の一例）が記憶する記憶内容の一例を示す図。In Embodiment 11, it is a figure which shows an example of the memory content which the compressed text memory | storage part 103, the compression dictionary memory | storage part 113 (an example of a dictionary memory | storage part), and the state transition memory | storage part 112 (an example of a history memory | storage part) memorize | store. 実施の形態１２による圧縮テキストの構造を示す図。FIG. 19 shows a structure of compressed text according to the twelfth embodiment. 実施の形態１２において、状態遷移記憶部１１２が記憶する状態履歴の構造を示す構成図。In Embodiment 12, it is a block diagram which shows the structure of the state history which the state transition memory | storage part 112 memorize | stores. 状態遷移表記憶部１０６が記憶する状態遷移表の一例を示す図。The figure which shows an example of the state transition table which the state transition table memory | storage part 106 memorize | stores. 実施の形態１２の圧縮テキスト検索装置における検索処理の流れ図。19 is a flowchart of search processing in the compressed text search device according to the twelfth embodiment. 図３２の検索処理の流れにおけるステップＳ１５１１の処理の流れ図。The flowchart of the process of step S1511 in the flow of the search process of FIG. 実施の形態１３による、圧縮辞書記憶部１１３と状態遷移記憶部１１２の記憶する情報を示す図。The figure which shows the information which the compression dictionary memory | storage part 113 and the state transition memory | storage part 112 memorize | store according to Embodiment 13. FIG. 実施の形態１３の圧縮テキスト検索装置における検索処理の流れ図。18 is a flowchart of search processing in the compressed text search device according to the thirteenth embodiment. 実施の形態１７による圧縮テキスト検索装置の、状態遷移記憶部１１２が記憶する状態履歴の構造を示す構成図。The block diagram which shows the structure of the state history which the state transition memory | storage part 112 of the compressed text search device by Embodiment 17 memorize | stores. 実施の形態１８による圧縮テキスト検索装置の、状態遷移記憶部１１２が記憶する状態履歴の構造を示す構成図。The block diagram which shows the structure of the state history which the state transition memory | storage part 112 memorize | stores in the compressed text search device by Embodiment 18. FIG. 実施の形態１９による圧縮テキスト検索装置の、状態遷移記憶部１１２が記憶する状態履歴の構造を示す構成図。The block diagram which shows the structure of the state history which the state transition memory | storage part 112 memorize | stores in the compressed text search device by Embodiment 19. FIG. ＤＦＡの動作における状態の遷移の一例を示す概念図。The conceptual diagram which shows an example of the state transition in operation | movement of DFA. ＤＦＡの状態の遷移についての説明図。Explanatory drawing about the transition of the state of DFA. オートマトン実行部がＤＦＡを実行するために記憶する状態遷移表の一例を示す図。The figure which shows an example of the state transition table which an automaton execution part memorize | stores in order to perform DFA. オートマトン実行部の処理の流れの一例を示すフローチャート図。The flowchart figure which shows an example of the flow of a process of an automaton execution part. ＤＦＡの一例を示す概念図。The conceptual diagram which shows an example of DFA. ＬＺＳＳ方式における符号化の一例を示す図。The figure which shows an example of the encoding in a LZSS system. 従来例において、符号列から元の文字列を復元する場合の制御の流れの一例を示すフローチャート図。The flowchart figure which shows an example of the flow of control in the case of decompress | restoring the original character string from a code string in a prior art example. ＬＺ７７方式における符号化の一例を示す図。The figure which shows an example of the encoding in a LZ77 system. 辞書参照型圧縮方式における符号化の一例を示す図。The figure which shows an example of the encoding in a dictionary reference type compression system. ＬＺ７８方式における符号化の一例を示す図。The figure which shows an example of the encoding in a LZ78 system. 従来例において、符号列から元の文字列を復元する場合の制御の流れの一例を示すフローチャート図。The flowchart figure which shows an example of the flow of control in the case of decompress | restoring the original character string from a code string in a prior art example. ＬＺＷ方式における符号化の一例を示す図。The figure which shows an example of the encoding in a LZW system.

Explanation of symbols

１００圧縮テキスト検索装置、１０２検索条件入力部、１０３圧縮テキスト記憶部、１０４照合結果出力部、１０５状態遷移表生成部、１０６状態遷移表記憶部、１０７照合部、１０８圧縮ブロック取得部、１０９文字取得部、１１０状態遷移機械、１１１状態記憶部、１１２状態遷移記憶部、１１３圧縮辞書記憶部、１１４条件判断部、１１５現在位置カウンタ、１１６遷移先算出部、１１７検索成功判別部、１２１未完文字復元部、１２２バイトデータ記憶部、２００状態遷移表、３００圧縮ブロック列、５００元の文字列、６００符号列、６５０置換辞書、９０１ＣＲＴ表示装置、９０２Ｋ／Ｂ、９０３マウス、９０４ＦＤＤ、９０５ＣＤＤ、９０６プリンタ装置、９０７スキャナ装置、９１０システムユニット、９１１ＣＰＵ、９１２バス、９１３ＲＯＭ、９１４ＲＡＭ、９１５通信ボード、９２０磁気ディスク装置、９２１ＯＳ、９２２ウィンドウシステム、９２３プログラム群、９２４ファイル群、９３１電話器、９３２ＦＡＸ機、９４０インターネット、９４１ゲートウェイ、９４２ＬＡＮ。 DESCRIPTION OF SYMBOLS 100 Compressed text search device, 102 Search condition input part, 103 Compressed text memory | storage part, 104 Collation result output part, 105 State transition table production | generation part, 106 State transition table memory | storage part, 107 Collation part, 108 Compressed block acquisition part, 109 characters Acquisition unit, 110 State transition machine, 111 State storage unit, 112 State transition storage unit, 113 Compression dictionary storage unit, 114 Condition determination unit, 115 Current position counter, 116 Transition destination calculation unit, 117 Search success determination unit, 121 Incomplete character Restoration unit, 122 byte data storage unit, 200 state transition table, 300 compressed block string, 500 original character string, 600 code string, 650 replacement dictionary, 901 CRT display device, 902 K / B, 903 mouse, 904 FDD, 905 CDD, 906 printer device, 907 scanner device, 10 system unit, 911 CPU, 912 bus, 913 ROM, 914 RAM, 915 communication board, 920 magnetic disk unit, 921 OS, 922 window system, 923 program group, 924 file group, 931 telephone, 932 FAX machine, 940 Internet 941 Gateway, 942 LAN.

Claims

An automaton that holds a state, inputs a character, calculates a transition destination state based on the held state and the input character, and updates the held state to the calculated transition destination state. Whether or not a search character string corresponding to a predetermined search pattern is included in the character string by determining whether or not the stored state is a predetermined state when characters constituting the character string are input By running an automaton configured to determine whether or not
In a character string search device for acquiring a code string obtained by replacing a partial character string included in the character string with a predetermined code corresponding to the partial character string, and searching the search character string from the character string,
An automaton execution unit for executing the automaton,
A history storage unit that stores the state held by the automaton as a state history;
A code acquisition unit for acquiring a code constituting the code string;
A condition determination unit that determines whether the first condition and the second condition are satisfied based on the state held by the automaton, the state history stored in the history storage unit, and the code acquired by the code acquisition unit When,
When the condition determination unit determines that the first condition is satisfied, the transition destination state is calculated based on the state history stored in the history storage unit, and the state held by the automaton is changed to the calculated transition destination state. A transition destination calculation unit to be updated;
When the condition determining unit determines that the second condition is satisfied, the character that restores the partial character string corresponding to the code acquired by the code acquiring unit and inputs the characters constituting the partial character string to the automaton A column restoration unit;
A character string search device characterized by comprising:

The automaton execution part
The character string search device according to claim 1, wherein an automaton capable of uniquely calculating the transition destination state is executed.

The automaton execution part
The character string search device according to claim 1, wherein the automaton is configured to search for a search character string corresponding to a search pattern expressing at least one of concatenation, selection, and repetition of characters.

When the partial character string included in the character string matches another partial character string included in the character string, the character string search device is configured to specify the partial character string as a pointer to the other partial character string. A character string search device for acquiring a code string replaced with a code including
The character string restoration part
When it is determined that the code acquired by the code acquisition unit includes information on a pointer to the other partial character string, the other partial character string is restored as a partial character string corresponding to the code,
When it is determined that the code acquired by the code acquisition unit does not include pointer information to the other partial character string, the character corresponding to the code is restored as a character string corresponding to the code. The character string search device according to claim 1.

The character string search device further includes:
A dictionary storage unit for storing a correspondence relationship between a partial character string and a code corresponding to the partial character string as a replacement dictionary;
The character string restoration part
2. The character string search device according to claim 1, wherein a partial character string corresponding to the code is obtained based on a replacement dictionary stored in the dictionary storage unit.

The code acquisition unit
From the code string including the information of the replacement dictionary, obtain a code constituting the code string,
The dictionary storage unit
The character string search device according to claim 5, wherein information of the replacement dictionary is acquired from the code acquired by the code acquisition unit and stored.

The condition determination unit
About the code acquired by the code acquisition unit, the state held by the automaton before inputting the characters constituting the partial character string corresponding to the code to the automaton, from the state history stored by the history storage unit If the acquired state and the state held by the automaton are compared and determined to match, it is determined that the first condition is satisfied,
When it is determined that the first condition is not satisfied, it is determined that the second condition is satisfied,
The transition destination calculation unit
When the condition determination unit determines that the first condition is satisfied, the characters constituting the partial character string corresponding to the code acquired by the code acquisition unit based on the state history stored by the history storage unit are described above. Obtaining the state updated by the automaton after input to the automaton and making it the transition destination state, updating the state held by the automaton to the transition destination state,
The history storage unit
When the condition determination unit determines that the second condition is satisfied, the state held by the automaton is stored as a state history,
When the condition determining unit determines that the first condition is satisfied, the automaton holds the character constituting the partial character string corresponding to the code acquired by the code acquiring unit when the character is input to the automaton. The character string search device according to claim 1, wherein the state obtained from the state history is stored as a state history.

The condition determination unit
About the code acquired by the code acquisition unit, the state stored in the automaton before inputting the characters constituting the partial character string corresponding to the code to the automaton, from the state history stored in the history storage unit If the acquired state and the state held by the automaton are compared and determined to match, it is determined that the first condition is satisfied,
When it is determined that the first condition is not satisfied, it is determined that the second condition is satisfied,
The transition destination calculation unit
When the condition determining unit determines that the first condition is satisfied, a state in which the automaton is updated after inputting characters constituting the partial character string corresponding to the code acquired by the code acquiring unit to the automaton, Obtained from the state history stored in the history storage unit as a transition destination state, update the state stored by the automaton to the transition destination state,
The history storage unit
When the condition determination unit determines that the second condition is satisfied, the code acquired by the code acquisition unit and the state held by the automaton are associated and stored as a state history. The character string search device according to 1.

The character string search device further includes:
The correspondence relationship between the partial character string and the code corresponding to the partial character string is stored as a replacement dictionary, and for the partial character string composed of the forward character string and the backward character string, the code corresponding to the forward character string and the above A dictionary storage unit for further storing a correspondence relationship between a backward character string and a code corresponding to the partial character string as a replacement dictionary;
In the case where the condition determining unit determines that the second condition is satisfied,
When the dictionary storage unit stores the partial character string corresponding to the code, obtain the partial character string, input the characters constituting the partial character string to the automaton,
When the dictionary storage unit stores a code and a back character string corresponding to the front character string, the code and a back character string corresponding to the front character string are obtained, and the back character string is obtained. Is input to the above automaton,
The code acquisition unit further includes:
The character string search device according to claim 8, wherein a code corresponding to the forward character string obtained by the character string restoration unit is acquired as the code.

The character string search device obtains a code string obtained by replacing a partial character string composed of characters having a bit length different from the bit length of the character input to the automaton with a code corresponding to the partial character string. A search device,
An incomplete character storage unit for storing incomplete characters;
The character string restoration part
In the restored partial character string, it is determined whether or not there is a character that cannot be input to the automaton due to the mismatch of the bit length, and when it is determined, the uncompleted character is stored in the incomplete character storage unit. The character string search device according to claim 1, wherein:

The condition determination unit
Regarding the code acquired by the code acquisition unit, the state of the automaton held before the characters constituting the partial character string corresponding to the code are input to the automaton and the incompleteness stored by the incomplete character storage unit are stored. The character is acquired from the state history stored in the history storage unit, the acquired state is compared with the state held by the automaton, and the acquired incomplete character and the incomplete character stored in the incomplete character storage unit are obtained. If both are determined to match, it is determined that the first condition is satisfied,
When it is determined that the first condition is not satisfied, it is determined that the second condition is satisfied,
The transition destination calculation unit
When the condition determination unit determines that the first condition is satisfied, the characters constituting the partial character string corresponding to the code acquired by the code acquisition unit based on the state history stored by the history storage unit are described above. Obtaining the state updated by the automaton after input to the automaton and making it the transition destination state, updating the state held by the automaton to the transition destination state,
The history storage unit
When the condition determining unit determines that the second condition is satisfied, the state held by the automaton and the incomplete characters stored by the incomplete character storage unit are stored as a state history,
When the condition determining unit determines that the first condition is satisfied, the automaton holds the character constituting the partial character string corresponding to the code acquired by the code acquiring unit when the character is input to the automaton. The character string search device according to claim 10, wherein an incomplete character stored in the state and the incomplete character storage unit is acquired from the state history and stored as a state history.

The condition determination unit
Regarding the code acquired by the code acquisition unit, the state of the automaton held before the characters constituting the partial character string corresponding to the code are input to the automaton and the incompleteness stored by the incomplete character storage unit are stored. The character is acquired from the state history stored in the history storage unit, the acquired state is compared with the state held by the automaton, and the acquired incomplete character and the incomplete character stored in the incomplete character storage unit are obtained. If both are determined to match, it is determined that the first condition is satisfied,
When it is determined that the first condition is not satisfied, it is determined that the second condition is satisfied,
The transition destination calculation unit
When the condition determining unit determines that the first condition is satisfied, a state in which the automaton is updated after inputting characters constituting the partial character string corresponding to the code acquired by the code acquiring unit to the automaton, Obtained from the state history stored in the history storage unit as a transition destination state, update the state stored by the automaton to the transition destination state,
The history storage unit
When the condition determining unit determines that the second condition is satisfied, the code acquired by the code acquiring unit is associated with the state held by the automaton and the incomplete character stored by the incomplete character storage unit The character string search device according to claim 10, wherein the character string search device is stored as a history.

The character string search device further includes:
When the incomplete character storage unit stores incomplete characters, a portion of the partial character string corresponding to the code acquired by the code acquisition unit that becomes a character that can be input to the automaton by combining with the incomplete character The character string search device according to claim 10, further comprising: an uncompleted character restoration unit that restores another unfinished character and inputs a character obtained by combining the unfinished character and the other unfinished character to the automaton. .

The condition determination unit
When the incomplete character restoration unit does not input a character to the automaton, the automaton holds the code acquired by the code acquisition unit before inputting the characters constituting the partial character string corresponding to the code to the automaton. When the first condition is satisfied when the acquired state is acquired from the state history stored in the history storage unit and the acquired state and the state held by the automaton are compared and determined to match. Judgment
When the incomplete character restoration unit inputs a character to the automaton, for the code acquired by the code acquisition unit, other characters restored by the incomplete character restoration unit among the characters constituting the partial character string corresponding to the code The state that the automaton holds before inputting the part excluding incomplete characters to the automaton is acquired from the state history stored in the history storage unit, and the acquired state and the state held by the automaton are obtained. If it is determined that they match, it is determined that the first condition is satisfied,
When it is determined that the first condition is not satisfied, it is determined that the second condition is satisfied,
The transition destination calculation unit
When the condition determination unit determines that the first condition is satisfied, the characters constituting the partial character string corresponding to the code acquired by the code acquisition unit based on the state history stored by the history storage unit are described above. Obtaining the state updated by the automaton after input to the automaton and making it the transition destination state, updating the state held by the automaton to the transition destination state,
The history storage unit
When the condition determining unit determines that the second condition is satisfied, the state held by the automaton and the incomplete characters stored by the incomplete character storage unit are stored as a state history,
When the condition determining unit determines that the first condition is satisfied, the automaton holds the character constituting the partial character string corresponding to the code acquired by the code acquiring unit when the character is input to the automaton. The character string search device according to claim 13, wherein the incomplete character stored in the incomplete character storage unit is acquired from the state history and stored as a state history.

The condition determination unit
When the incomplete character restoration unit does not input a character to the automaton, the automaton holds the code acquired by the code acquisition unit before inputting the characters constituting the partial character string corresponding to the code to the automaton. When the first condition is satisfied when the acquired state is acquired from the state history stored in the history storage unit and the acquired state and the state held by the automaton are compared and determined to match. Judgment
When the incomplete character restoration unit inputs a character to the automaton, for the code acquired by the code acquisition unit, other characters restored by the incomplete character restoration unit among the characters constituting the partial character string corresponding to the code The state held by the automaton before inputting the part excluding incomplete characters into the automaton is acquired from the state history stored in the history storage unit, and the acquired state is compared with the state held by the automaton. And the above first condition is satisfied,
When it is determined that the first condition is not satisfied, it is determined that the second condition is satisfied,
The transition destination calculation unit
When the condition determining unit determines that the first condition is satisfied, a state in which the automaton is updated after inputting characters constituting the partial character string corresponding to the code acquired by the code acquiring unit to the automaton, Obtained from the state history stored in the history storage unit as a transition destination state, update the state stored by the automaton to the transition destination state,
The history storage unit
When the condition determining unit determines that the second condition is satisfied, the code acquired by the code acquiring unit is associated with the state held by the automaton and the incomplete character stored by the incomplete character storage unit The character string search device according to claim 13, wherein the character string search device is stored as a history.