JP2003242179A

JP2003242179A - Character string collating method, document processing device using the method and program

Info

Publication number: JP2003242179A
Application number: JP2002028736A
Authority: JP
Inventors: Isamu Hasegawa; 勇長谷川
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2002-02-05
Filing date: 2002-02-05
Publication date: 2003-08-29
Anticipated expiration: 2022-02-05
Also published as: JP3852757B2

Abstract

<P>PROBLEM TO BE SOLVED: To make POSIX regular expressions including a partial regular expression and a longest match or a regular expression including a plurality of character collating elements processible by a determinative finite state automaton. <P>SOLUTION: This document processing system is provided with an automaton constructing part 210 constructing an indeterminative finite state automaton from the regular expression of a character string and constructing the determinative finite state automaton based on the indeterminative finite state automaton and an automaton deciding part for performing a matching for character strings by using the determinative finite state automaton. The automaton deciding part 240 specifies a match range of the character strings by using the indeterminative finite state automaton and the determinative finite state automaton, regarding the matched character strings. The automaton deciding part 240 performs a matching for each element of the character string to be processing object, while dynamically determining the state of a transition destination in a state transition of the determinative finite state automaton. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、有限状態オートマ
トンを用いてパターンマッチング（文字列の照合）を行
う技術に関し、特に文字列の正規表現を用いたパターン
マッチングを決定性有限状態オートマトンにて実行する
ための技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a technique for performing pattern matching (character string matching) by using a finite state automaton, and in particular, pattern matching using a character string regular expression is executed by a deterministic finite state automaton. For technology.

【０００２】[0002]

【従来の技術】コンピュータを用いたテキスト処理ツー
ルでは、文字列検索などに用いられるパターンマッチン
グ、すなわち文字列中のパターンを発見し操作するため
の機構として正規表現が広く利用されている。例えば、
正規表現“(ab)*”は“ab”の０回以上の繰り返
し（“”、“ab”、“abab”、...）にマッチし、“[ac
d]”は“a”、“c”または“d”にマッチする。また、
“．”は任意の１文字にマッチし、“ab|cd”は“ab”
または“cd”のいずれかにマッチする。2. Description of the Related Art In a text processing tool using a computer, a regular expression is widely used as a pattern matching used for character string search or the like, that is, a mechanism for finding and manipulating a pattern in a character string. For example,
The regular expression "(ab) *" matches zero or more repetitions of "ab"("","ab","abab", ...), and "[ac
d] ”matches“ a ”,“ c ”or“ d ”.
"." Matches any single character, "ab | cd" is "ab"
Or matches either "cd".

【０００３】文字列検索などの処理を行う場合には、こ
の正規表現から、正規表現によるパターンマッチングの
評価器である有限状態オートマトン（Finite State Aut
omaton）を作成し、検索の対象である文字列に関してこ
の有限状態オートマトンを処理する。ここで、有限状態
オートマトンには、非決定性有限状態オートマトン（No
n deterministic Finite state Automaton：ＮＦＡ）
と、決定性有限状態オートマトン（Deterministic Fini
te state Automaton：ＤＦＡ）とがあり、ＤＦＡはＮＦ
Ａから作成することができる。When performing processing such as character string search, a Finite State Automaton (Finite State Automaton), which is an evaluator for pattern matching by the regular expression, is used from this regular expression.
omaton) and process this finite state automaton with respect to the string to be searched. Here, a finite state automaton is a nondeterministic finite state automaton (No
n deterministic Finite state Automaton: NFA)
And the deterministic finite state automaton (Deterministic Fini
te state Automaton (DFA) and DFA is NF
It can be created from A.

【０００４】図３４は、正規表現“([ab]c)*ac”にマッ
チングするＮＦＡの例である。また、図３５は、図３４
のＮＦＡと等価のＤＦＡである。これらの有限状態オー
トマトンは、例えば入力が“acac”や“ac”である場合
は正規表現“([ab]c)*ac”にマッチするので受理する。
また、入力が“aa”である場合は正規表現“([ab]c)*a
c”にマッチしないので受理しない。なお、一般に所定
の正規表現に対応するＮＦＡは複数存在するが、ＤＦＡ
は一意に決まる。実際の処理では、処理速度に優れるＤ
ＦＡを用いることが好ましい。したがって、例えば文字
列検索の場合、検索条件を正規表現で記述し、この正規
表現からＮＦＡを作成し、さらにこのＮＦＡからＤＦＡ
を作成し、このＤＦＡを処理することにより検索処理を
行う。正規表現および有限状態オートマトンを用いたパ
ターンマッチングに関しては、例えば下記の文献に詳細
に記載されている。文献：V.J.Rayward-Smith.言語理論入門.共立出版, 198
6.井上謙蔵監訳.FIG. 34 shows an example of NFA that matches the regular expression "([ab] c) * ac". In addition, FIG.
It is a DFA equivalent to the NFA. These finite state automata are accepted because they match the regular expression "([ab] c) * ac" when the input is "acac" or "ac", for example.
If the input is “aa”, the regular expression “([ab] c) * a
It is not accepted because it does not match c ”. Generally, although there are multiple NFAs corresponding to a given regular expression, DFA
Is uniquely determined. In actual processing, D is superior in processing speed
It is preferable to use FA. Therefore, for example, in the case of a character string search, the search condition is described by a regular expression, an NFA is created from this regular expression, and the DFA is generated from this NFA.
Is created and the search process is performed by processing this DFA. Pattern matching using a regular expression and a finite state automaton is described in detail in the following documents, for example. Reference: VJ Rayward-Smith. Introduction to Language Theory. Kyoritsu Shuppan, 198
6.Translated by Kenzo Inoue.

【０００５】[0005]

【発明が解決しようとする課題】ところで、ＵＮＩＸ
（登録商標）システムでは、オペレーティングシステム
間の互換性を確保するため、標準的なＡＰＩ（Applicat
ion Program Interface）であるＰＯＳＩＸ（Portable
Operating System Interface for UNIX）が、ＩＥＥＥ
によって定められている。このＰＯＳＩＸに準拠する正
規表現（以下、ＰＯＳＩＸ正規表現）は、部分正規表現
や最長一致など、本来の正規表現の範疇を越えた機能を
持つ。そのため、文字列検索などに用いるパターンマッ
チング機能をＤＦＡで実装するには困難な場合があっ
た。ＰＯＳＩＸでは、与えられた文字列中の最も左にあ
る部分文字列にマッチし、さらに各正規表現の要素は可
能な限り長い（最長の）文字列にマッチするという最長
最左規則が存在する。このため、例えば、ＰＯＳＩＸ正
規表現による“([ab]c)*ac”に、入力“acac”を与える
と、“([ab]c)*”が最長一致し、部分正規表現“([ab]
c)”がマッチした部分（入力の先頭“ac”）に関する情
報を取得できる。これらの情報はＰＯＳＩＸに準拠した
関数を用いる場合などに必要となる場合がある。しか
し、これらの情報はＮＦＡからＤＦＡへ変換した際に失
われ、正規表現のどの要素がどの文字にマッチしたかが
識別できなくなっているため、ＤＦＡで処理することが
できず、処理速度の遅いＮＦＡを用いてパターンマッチ
ングを行わなければならなかった。By the way, UNIX
In the (registered trademark) system, in order to ensure compatibility between operating systems, a standard API (Applicat
POSIX (Portable) which is an ion program interface
Operating System Interface for UNIX), IEEE
Is determined by. The regular expression conforming to POSIX (hereinafter referred to as POSIX regular expression) has functions such as partial regular expression and longest match, which are beyond the scope of the original regular expression. Therefore, it may be difficult to implement the pattern matching function used for character string search or the like in the DFA. In POSIX, there is a longest leftmost rule that matches the leftmost substring in a given string, and each regular expression element matches the longest (longest) string possible. Therefore, for example, if the input “acac” is given to “([ab] c) * ac” in the POSIX regular expression, “([ab] c) *” will match the longest and the partial regular expression “([ab ]
c) ”can be obtained about the matched part (the beginning“ ac ”of the input). This information may be necessary when using a POSIX-compliant function. However, this information can be obtained from the NFA. It is lost when converted to DFA, and it is impossible to identify which element of the regular expression matched with which character, so it cannot be processed by DFA and pattern matching is performed using NFA, which has a slow processing speed. I had to.

【０００６】また、テキスト処理ツールには、正規表現
“[a-b]”で、要素‘a’、‘b’、‘aa’にマッチする
というように、可変長の照合要素（以下、複数文字照合
要素）を扱うことが可能なものがある。図３６は、複数
文字照合要素を含む正規表現に対応するＮＦＡの例を示
す図である。この場合、図３６において最初の状態０か
ら状態１に遷移する場合と状態２に遷移する場合とで
は、受理する文字の長さが違ってしまうため、そのまま
ではＤＦＡを構築してパターンマッチングを行うことが
できない。これに対し、複数文字照合要素の可能な文字
列（上記の正規表現“[a-b]”では要素‘a’、‘b’、
‘aa’のそれぞれ）を列挙して個別に状態を遷移させる
ＤＦＡを構築して対応することが考えられる。しかしな
がら、複数文字照合要素がマッチする文字列全てを列挙
するＡＰＩ（Application Program Interface）がＰＯ
ＳＩＸで提供されていないため、そのようなＤＦＡを構
築することはできない。したがって、このような場合に
は処理速度の遅いＮＦＡを用いてパターンマッチングを
行わなければならなかった。Further, the text processing tool has a variable length matching element (hereinafter, plural character matching) such as matching the elements'a ',' b 'and'aa' with the regular expression "[ab]". There are things that can handle elements). FIG. 36 is a diagram showing an example of an NFA corresponding to a regular expression including a multiple character matching element. In this case, the length of the characters to be accepted is different between the case of transition from the first state 0 to the state 1 and the case of transition to the state 2 in FIG. 36. Therefore, the DFA is constructed as it is and pattern matching is performed. I can't. On the other hand, a character string that allows multiple character matching elements (in the above regular expression “[ab]”, the elements'a ',' b ',
It is conceivable to construct a DFA that enumerates (each of'aa ') and make a state transition individually to deal with it. However, an API (Application Program Interface) that enumerates all the character strings that the multiple character matching element matches is a PO.
It is not possible to build such a DFA as it is not provided by SIX. Therefore, in such a case, the pattern matching must be performed using the NFA having a slow processing speed.

【０００７】そこで本発明は、部分正規表現や最長一致
を含むＰＯＳＩＸ正規表現をＤＦＡ（決定性有限状態オ
ートマトン）にて処理できるようにすること、またその
ような処理装置（コンピュータ装置）を実現することを
目的とする。また、複数文字照合要素を含む正規表現を
ＤＦＡ（決定性有限状態オートマトン）にて処理できる
ようにすること、またそのような処理装置（コンピュー
タ装置）を実現することを他の目的とする。Therefore, the present invention makes it possible to process a POSIX regular expression including a partial regular expression and a longest match by a DFA (deterministic finite state automaton), and to realize such a processing device (computer device). With the goal. Another object is to enable a regular expression including a multi-character collating element to be processed by a DFA (deterministic finite state automaton), and to realize such a processing device (computer device).

【０００８】[0008]

【課題を解決するための手段】上記の目的を達成するた
め、本発明は、コンピュータを用いて文字列の照合を行
う文字列照合方法として実現される。すなわち、この文
字列照合方法は、文字列の正規表現から非決定性有限状
態オートマトンを作成するステップと、この非決定性有
限状態オートマトンに基づいて決定性有限状態オートマ
トンを作成するステップと、この決定性有限状態オート
マトンを用いて文字列のマッチングを行うステップと、
マッチした前記文字列に関し、これらの非決定性有限状
態オートマトンと決定性有限状態オートマトンとを用い
て、この文字列のマッチ範囲を特定するステップとを含
むことを特徴とする。In order to achieve the above object, the present invention is realized as a character string collating method for collating character strings using a computer. That is, this string matching method is a step of creating a nondeterministic finite state automaton from a regular expression of a string, a step of creating a deterministic finite state automaton based on this nondeterministic finite state automaton, and this deterministic finite state automaton. A step of matching the character strings using
With respect to the matched character string, a step of specifying a matching range of the character string by using the nondeterministic finite state automaton and the deterministic finite state automaton.

【０００９】ここで、この非決定性有限状態オートマト
ンを作成するステップは、詳しくは、文字列の正規表現
に対し、この正規表現中の一定の範囲を指定する要素を
除く各要素に１つずつ対応させた非決定性有限状態オー
トマトンの状態を生成するステップと、繰り返しを意味
する要素及び選択を意味する要素に対してε遷移を対応
させると共に、その他の要素に対して次の要素に対応付
けられた状態への遷移を対応させるステップとを含む。
また、マッチ範囲を特定するステップは、詳しくは、決
定性有限状態オートマトンによる状態遷移を示す状態列
のうちで、終了状態へ到達しない不要な状態列を削除す
るステップと、残った前記状態列に基づいて文字列のマ
ッチ範囲を特定するステップとを含む。さらに、より好
ましくは、このマッチ範囲を特定するステップは、決定
性有限状態オートマトンによる状態遷移を示す状態列の
うち、最長最左規則を満足する状態列に基づいて文字列
のマッチ範囲を特定するステップを含む。Here, the step of creating the non-deterministic finite state automaton is, in detail, for a regular expression of a character string, one for each element except elements for designating a certain range in the regular expression. The step of generating the state of the nondeterministic finite state automaton that has been made to correspond to the ε transition for the element that means repetition and the element that means selection, and to the next element for other elements Corresponding to the transition to the state.
Further, the step of identifying the match range is specifically based on the step of deleting an unnecessary state sequence that does not reach the end state among the state sequences showing the state transition by the deterministic finite state automaton, and the remaining state sequence. And specifying the matching range of the character string. Furthermore, more preferably, the step of identifying the match range is a step of identifying the match range of the character string based on the state sequence that satisfies the longest leftmost rule among the state sequences showing the state transition by the deterministic finite state automaton. including.

【００１０】また、本発明は、次のような他の文字列照
合方法として実現される。すなわち、この文字列照合方
法は、所定の文字列の正規表現に基づいて作成された決
定性有限状態オートマトンをメモリから読み込み、この
決定性有限状態オートマトンを用いて文字列のマッチン
グを行い、処理結果をメモリに格納する第１のステップ
と、この決定性有限状態オートマトンによる状態遷移を
示す状態列を、終了状態へ到達可能な状態列に絞り込む
第２のステップと、絞り込まれた状態列に基づいて、処
理対象である文字列中のどの文字が正規表現のどの部分
にマッチしたかを特定する第３のステップとを含むこと
を特徴とする。ここで、この第３のステップは、より好
ましくは、絞り込まれた状態列のうち、先に出現する繰
り返しが最も多くなる状態列を選択し、この状態列に基
づいて処理対象である文字列中の各文字と正規表現のど
の部分にマッチしたかを判断するステップを含む。Further, the present invention is realized as another character string collating method as follows. That is, this character string matching method reads a deterministic finite state automaton created based on a regular expression of a predetermined character string from a memory, performs character string matching using this deterministic finite state automaton, and stores the processing result in a memory. Based on the narrowed state sequence and the second step of narrowing down the state sequence showing the state transition by this deterministic finite state automaton to the state sequence that can reach the end state. And a third step of identifying which part of the regular expression matches which part of the regular expression. Here, in the third step, more preferably, of the narrowed-down state strings, the state string having the largest number of repetitions that appears first is selected, and the character string to be processed is selected based on this state string. And determining which part of the regular expression each character in and matched.

【００１１】さらに本発明は、次のような他の文字列照
合方法として実現される。すなわち、この文字列照合方
法は、所定の文字列の正規表現に基づいて作成された決
定性有限状態オートマトンをメモリから読み込み、この
決定性有限状態オートマトンを用いて文字列のマッチン
グを行い、処理結果をメモリに格納する第１のステップ
と、この決定性有限状態オートマトンによる状態遷移を
示す第１の状態列に基づいて、非決定性有限状態オート
マトンにおける状態遷移を示す第２の状態列を復元する
第２のステップと、復元された第２の状態列に基づい
て、文字列中のどの文字が正規表現のどの部分にマッチ
したかに関する情報を取得する第３のステップとを含む
ことを特徴とする。ここで、この第２のステップは、よ
り好ましくは、第２の状態列として、最長最左規則を満
足する状態列を復元するステップを含む。Further, the present invention is realized as the following other character string collating method. That is, this character string matching method reads a deterministic finite state automaton created based on a regular expression of a predetermined character string from a memory, performs character string matching using this deterministic finite state automaton, and stores the processing result in a memory. And a second step of restoring a second state sequence showing the state transition in the non-deterministic finite state automaton based on the first step of storing the state transition in the deterministic finite state automaton And a third step of obtaining information about which character in the string matched which part of the regular expression based on the restored second state sequence. Here, this second step more preferably includes a step of restoring a state sequence satisfying the longest leftmost rule as the second state sequence.

【００１２】さらにまた、本発明は、次のような他の文
字列照合方法として実現される。すなわち、この文字列
照合方法は、文字列の正規表現から非決定性有限状態オ
ートマトンを作成するステップと、この非決定性有限状
態オートマトンに基づいて決定性有限状態オートマトン
を作成するステップと、処理対象である文字列の各要素
に対して、この決定性有限状態オートマトンの状態遷移
における遷移先の状態を動的に決定しながらマッチング
を行うステップとを含むことを特徴とする。Furthermore, the present invention is implemented as another character string collating method as follows. That is, this string matching method is a step of creating a nondeterministic finite state automaton from a regular expression of a character string, a step of creating a deterministic finite state automaton based on this nondeterministic finite state automaton, and a character to be processed. The step of performing matching while dynamically determining the transition destination state in the state transition of this deterministic finite state automaton for each element of the sequence.

【００１３】ここで、さらに詳しくは、このマッチング
を行うステップは、処理対象である文字列を先読みして
この文字列中に複数文字照合要素に該当し得る文字列が
含まれているか否かを判定するステップと、この文字列
に複数文字照合要素に該当し得る文字列が含まれている
場合に、この文字列が複数文字照合要素である場合にお
ける状態遷移を反映させて遷移先の状態を動的に決定す
るステップとを含む。また、より好ましくは、この文字
列照合方法は、マッチングを行った後、マッチした文字
列に関し、非決定性有限状態オートマトンと決定性有限
状態オートマトンとを用いて、この文字列のマッチ範囲
に関する情報を得るステップをさらに含む構成とする。More specifically, in this matching step, the character string to be processed is pre-read to determine whether or not the character string includes a character string that can correspond to a multi-character collating element. If the determination step and this character string include a character string that can correspond to a multi-character collating element, reflect the state transition when this character string is a multi-character collating element to reflect the state of the transition destination. Dynamically determining. Further, more preferably, in this string matching method, after performing matching, using a nondeterministic finite state automaton and a deterministic finite state automaton, information about the matching range of this string is obtained. The configuration further includes steps.

【００１４】さらに、本発明は、次のような他の文字列
照合方法として実現される。すなわち、この文字列照合
方法は、文字列の正規表現から非決定性有限状態オート
マトンを作成するステップと、この非決定性有限状態オ
ートマトンに基づいて決定性有限状態オートマトンを作
成するステップと、処理対象である文字列を先読みして
この文字列中に複数文字照合要素に該当し得る文字列が
含まれている場合に、文字列の各要素に対して、決定性
有限状態オートマトンの状態遷移を生成すると共に、こ
の複数文字照合要素に該当し得る文字列に対応する状態
遷移を仮想的に生成し、これらの状態遷移に基づいてマ
ッチングを行うステップとを含むことを特徴とする。Further, the present invention is implemented as another character string collating method as follows. That is, this string matching method is a step of creating a nondeterministic finite state automaton from a regular expression of a character string, a step of creating a deterministic finite state automaton based on this nondeterministic finite state automaton, and a character to be processed. When a string is read ahead and a string that can be a multi-character matching element is included in this string, a state transition of a deterministic finite state automaton is generated for each element of the string and A step of virtually generating a state transition corresponding to a character string that can correspond to a multi-character collating element, and performing matching based on these state transitions.

【００１５】また、上記の目的を達成する他の本発明
は、正規表現を用いて文字列の検索を行う文書処理装置
において、文字列の正規表現から非決定性有限状態オー
トマトンを構築する非決定性有限状態オートマトン構築
手段と、この非決定性有限状態オートマトン構築手段に
て構築された非決定性有限状態オートマトンに基づいて
決定性有限状態オートマトンを構築する決定性有限状態
オートマトン構築手段と、この決定性有限状態オートマ
トン構築手段にて構築された決定性有限状態オートマト
ンを用いて文字列のマッチングを行う判定手段とを備え
る。この判定手段は、マッチした文字列に関し、さらに
非決定性有限状態オートマトンと決定性有限状態オート
マトンとを用いて、文字列のマッチ範囲を特定すること
を特徴とする。Another object of the present invention to achieve the above object is to provide a non-deterministic finite state automaton for constructing a non-deterministic finite state automaton from a regular expression of a character string in a document processing device for searching a character string using a regular expression. State automaton construction means, deterministic finite state automaton construction means for constructing deterministic finite state automaton based on nondeterministic finite state automaton constructed by this nondeterministic finite state automaton construction means, and this deterministic finite state automaton construction means And a determining means for matching the character strings by using the deterministic finite state automaton constructed as above. The determination means is characterized in that the matching range of the character string is specified by using a nondeterministic finite state automaton and a deterministic finite state automaton for the matched character string.

【００１６】ここで、この非決定性有限状態オートマト
ン構築手段は、文字列の正規表現に対し、この正規表現
中の一定の範囲を指定する要素を除く各要素に１つずつ
対応させた非決定性有限状態オートマトンの状態を生成
し、かつ繰り返しを意味する要素及び選択を意味する要
素に対してε遷移を対応させると共に、その他の要素に
対して次の要素に対応付けられた状態への遷移を対応さ
せることを特徴とする。また好ましくは、この判定手段
は、決定性有限状態オートマトンによる状態遷移を示す
状態列のうちで、終了状態へ到達しない不要な状態列を
削除して絞り込む状態列絞り込み手段と、この状態列絞
り込み手段により絞り込まれた状態列に基づいて文字列
のマッチ範囲を特定するマッチ範囲判定手段とする。Here, the nondeterministic finite state automaton constructing means corresponds to the regular expression of the character string one by one to each element other than the element designating a certain range in the regular expression. Generates the state of the state automaton, and makes the ε transition correspond to the element that means repetition and the element that means selection, and the transition to the state associated with the next element to other elements. It is characterized by Also preferably, the determination means is a state sequence narrowing means for deleting and narrowing down unnecessary state sequences that do not reach the end state among the state sequences showing the state transition by the deterministic finite state automaton, and the state sequence narrowing means. The match range determination means specifies the match range of the character string based on the narrowed-down state string.

【００１７】また、本発明は、次のような他の文書処理
装置としても実現される。すなわち、正規表現を用いて
文字列の検索を行う文書処理装置において、文字列の正
規表現から非決定性有限状態オートマトンを構築する非
決定性有限状態オートマトン構築手段と、この非決定性
有限状態オートマトン構築手段にて構築された非決定性
有限状態オートマトンに基づいて決定性有限状態オート
マトンを構築する決定性有限状態オートマトン構築手段
と、この決定性有限状態オートマトン構築手段にて構築
された決定性有限状態オートマトンを用い、処理対象で
ある文字列の各要素に対して、決定性有限状態オートマ
トンの状態遷移における遷移先の状態を動的に決定しな
がらマッチングを行う判定手段とを備えることを特徴と
する。The present invention can also be implemented as another document processing apparatus as follows. That is, in a document processing device for searching a character string using a regular expression, a nondeterministic finite state automaton constructing means for constructing a nondeterministic finite state automaton from a regular expression for a character string and a nondeterministic finite state automaton constructing means are provided. The deterministic finite state automaton construction means for constructing a deterministic finite state automaton based on the nondeterministic finite state automaton constructed by the above and the deterministic finite state automaton constructed by this deterministic finite state automaton construction means are processing targets. It is characterized by further comprising: determination means for performing matching while dynamically determining a transition destination state in the state transition of the deterministic finite state automaton for each element of the character string.

【００１８】ここで、さらに好ましくは、この判定手段
は、処理対象である文字列を先読みして文字列中に複数
文字照合要素に該当し得る文字列が含まれているか否か
を判定し、複数文字照合要素に該当し得る文字列が含ま
れていると判断した場合に、この文字列が複数文字照合
要素である場合における状態遷移を反映させて遷移先の
状態を動的に決定する。また、この判定手段は、決定性
有限状態オートマトンによる状態遷移を示す状態列のう
ちで、終了状態へ到達しない不要な状態列を削除して絞
り込む状態列絞り込み手段と、この状態列絞り込み手段
により絞り込まれた状態列に基づいて文字列のマッチ範
囲を特定するマッチ範囲判定手段とを備える構成とする
ことができる。More preferably, the determining means pre-reads the character string to be processed and determines whether or not the character string includes a character string that can be a multi-character collating element. When it is determined that the character string that can be applied to the multi-character collating element is included, the state transition of the transition destination is dynamically determined by reflecting the state transition when the character string is the multi-character collating element. Further, this determination means deletes unnecessary state sequences that do not reach the end state from the state sequence showing the state transition by the deterministic finite state automaton and narrows down the state sequence narrowing down means, and the state sequence narrowing down means narrows down. And a match range determination unit that specifies the match range of the character string based on the state sequence.

【００１９】さらに本発明は、コンピュータを制御して
上述した文字列照合方法を実行するプログラムとして、
あるいはコンピュータを上記の文書処理装置として機能
させるプログラムとして実現することができる。かかる
プログラムは、磁気ディスクや光ディスク、半導体メモ
リ、その他の記録媒体に格納して配布したり、ネットワ
ークを介して配信したりすることにより、提供すること
ができる。Further, the present invention is a program for controlling a computer to execute the above-mentioned character string collating method,
Alternatively, it can be realized as a program that causes a computer to function as the above document processing device. Such a program can be provided by being stored in a magnetic disk, an optical disk, a semiconductor memory, or another recording medium for distribution, or distributed via a network.

【００２０】[0020]

【発明の実施の形態】以下、添付図面に示す第１、第２
の実施の形態に基づいて、この発明を詳細に説明する。
なお、以下の実施の形態の説明において、正規表現は、
ＰＯＳＩＸ正規表現（ＰＯＳＩＸに準拠した正規表現）
を意味する。［第１の実施の形態］第１の実施の形態では、正規表現
で表現された検索条件に基づき、ＤＦＡ（決定性有限状
態オートマトン）を用いて文字列検索を行う文書処理シ
ステムをコンピュータにて実現する。このシステムは、
ＤＦＡを作成した際にＤＦＡが遷移した状態を記録して
おく。そして、記録された状態の列から、対応するＮＦ
Ａ（非決定性有限状態オートマトン）を用いた場合にた
どる状態の列を得ることで、ＤＦＡへの変換で失われた
必要な情報を取得する。これにより、正規表現で用いら
れる部分正規表現や最長一致に対しても、ＤＦＡによる
高速なパターンマッチングを行うことが可能となる。BEST MODE FOR CARRYING OUT THE INVENTION First and second embodiments shown in the accompanying drawings
The present invention will be described in detail based on the embodiments.
In the description of the embodiments below, the regular expression is
POSIX regular expression (regular expression based on POSIX)
Means [First Embodiment] In the first embodiment, a computer realizes a document processing system for performing a character string search using a DFA (deterministic finite state automaton) based on a search condition expressed by a regular expression. To do. This system
The state of transition of the DFA when the DFA is created is recorded. Then, from the sequence of recorded states, the corresponding NF
By obtaining the sequence of states to follow when using A (nondeterministic finite state automaton), the necessary information lost in the conversion to DFA is obtained. As a result, it is possible to perform high-speed pattern matching by DFA even for the partial regular expression used in the regular expression and the longest match.

【００２１】図１は、第１の実施の形態による文書処理
システムを実現するのに好適なコンピュータ装置のハー
ドウェア構成の例を模式的に示した図である。図１に示
すコンピュータ装置は、演算手段であるＣＰＵ（Centra
l ProcessingUnit：中央処理装置）１０１と、Ｍ／Ｂ
（マザーボード）チップセット１０２及びＣＰＵバスを
介してＣＰＵ１０１に接続されたメインメモリ１０３
と、同じくＭ／Ｂチップセット１０２及びＡＧＰ（Acce
lerated Graphics Port）を介してＣＰＵ１０１に接続
されたビデオカード１０４と、ＰＣＩ（Peripheral Com
ponent Interconnect）バスを介してＭ／Ｂチップセッ
ト１０２に接続されたハードディスク１０５及びネット
ワークインターフェイス１０６と、さらにこのＰＣＩバ
スからブリッジ回路１０７及びＩＳＡ（Industry Stand
ard Architecture）バスなどの低速なバスを介してＭ／
Ｂチップセット１０２に接続されたフロッピー（登録商
標）ディスクドライブ１０８及びキーボード／マウス１
０９とを備える。また、図１には記載していないが、こ
のコンピュータ装置は、後述するようにＣＰＵ１０１の
動作性能（動作クロック）を制御する手段として、クロ
ック発信器及びそのコントローラを備える。なお、図１
は本実施の形態を実現するコンピュータ装置のハードウ
ェア構成を例示するに過ぎず、本実施の形態を適用可能
であれば、他の種々の構成を取ることができる。例え
ば、ビデオカード１０４を設ける代わりに、ビデオメモ
リのみを搭載し、ＣＰＵ１０１にてイメージデータを処
理する構成としても良いし、音声による入出力を行うた
めのサウンド機構を設けたり、ＡＴＡ（AT Attachmen
t）などのインターフェイスを介してＣＤ−ＲＯＭ（Com
pact Disc Read Only Memory）やＤＶＤ−ＲＯＭ（Digi
tal Versatile Disc Read Only Memory）のドライブを
設けたりしても良い。FIG. 1 is a diagram schematically showing an example of a hardware configuration of a computer device suitable for realizing the document processing system according to the first embodiment. The computer device shown in FIG. 1 is a CPU (Centra) which is a calculation means.
l Processing Unit: central processing unit) 101, M / B
(Motherboard) Main memory 103 connected to CPU 101 via chipset 102 and CPU bus
And M / B chipset 102 and AGP (Acce
a video card 104 connected to the CPU 101 via an integrated graphics port (PCI) and a PCI (Peripheral Com
A hard disk 105 and a network interface 106 connected to the M / B chipset 102 via a ponent interconnect (BUS), and a bridge circuit 107 and an ISA (Industry Stand) from the PCI bus.
ard Architecture) M / via a low-speed bus such as a bus
Floppy (registered trademark) disk drive 108 and keyboard / mouse 1 connected to B chipset 102
09 and. Although not shown in FIG. 1, the computer device includes a clock oscillator and a controller thereof as means for controlling the operating performance (operating clock) of the CPU 101, as will be described later. Note that FIG.
Only exemplifies the hardware configuration of the computer device that realizes the present embodiment, and various other configurations can be taken as long as the present embodiment is applicable. For example, instead of providing the video card 104, only a video memory may be mounted and the CPU 101 may process image data, a sound mechanism for inputting and outputting by voice, or an ATA (AT Attachmen).
CD-ROM (Com
pact Disc Read Only Memory) and DVD-ROM (Digi
A tal Versatile Disc Read Only Memory) drive may be provided.

【００２２】図２は、第１の実施の形態による文書処理
システムの構成を説明するブロック図である。図２を参
照すると、本実施の形態の文書処理システム２００は、
文字列のパターンマッチングを行うＤＦＡを作成するオ
ートマトン構築部２１０と、作成されたＤＦＡを保持す
るオートマトン保持部２２０と、検索対象である文書デ
ータを保持する文書保持部２３０と、ＤＦＡを用いて文
字列検索におけるパターンマッチングを実行するオート
マトン判定部２４０とを備える。また、本実施の形態に
よる文字列検索以外の処理を実行するための文書処理部
２５０と、コンピュータ装置にてこれらの機能を実現す
る処理プログラムを保持する処理プログラム保持部２６
０とを備える。また図示のように、文書処理システム２
００は、ユーザ（文書利用者）が検索キーあるいは検索
対象である文書データや各種の処理を行うためのコマン
ドを入力すると共に、処理結果を出力するための入出力
装置３００に接続されている。FIG. 2 is a block diagram illustrating the configuration of the document processing system according to the first embodiment. Referring to FIG. 2, the document processing system 200 according to the present embodiment is
An automaton building unit 210 that creates a DFA that performs pattern matching of character strings, an automaton holding unit 220 that holds the created DFA, a document holding unit 230 that holds the document data that is the search target, and a character using the DFA. An automaton determination unit 240 that executes pattern matching in the column search. Further, the document processing unit 250 for executing processing other than the character string search according to the present embodiment, and the processing program holding unit 26 for holding processing programs for realizing these functions in the computer device.
With 0 and. As shown in the figure, the document processing system 2
A user 00 (document user) is connected to an input / output device 300 for outputting a processing result while a user (document user) inputs a search key or a command for performing document data to be searched and various kinds of processing.

【００２３】図２において、オートマトン保持部２２
０、文書保持部２３０及び処理プログラム保持部２６０
は、メインメモリ１０３にて実現される。なお、メイン
メモリ１０３に保持されるデータは、必要に応じてハー
ドディスク１０５などの記憶装置に退避させることがで
きる。また、オートマトン構築部２１０、オートマトン
判定部２４０及び文書処理部２５０は、処理プログラム
保持部２６０に格納された処理プログラムにより制御さ
れたＣＰＵ１０１にて実現されるソフトウェアブロック
である。ＣＰＵ１０１を制御してこれらの機能を実現す
る処理プログラムは、磁気ディスクや光ディスク、半導
体メモリ、その他の記録媒体に格納して配布したり、ネ
ットワークを介して配信したりすることにより提供され
る。In FIG. 2, the automaton holding unit 22
0, the document holding unit 230 and the processing program holding unit 260
Are realized in the main memory 103. The data held in the main memory 103 can be saved in a storage device such as the hard disk 105 as needed. The automaton construction unit 210, the automaton determination unit 240, and the document processing unit 250 are software blocks implemented by the CPU 101 controlled by the processing program stored in the processing program holding unit 260. A processing program that controls the CPU 101 to realize these functions is provided by being stored in a magnetic disk, an optical disk, a semiconductor memory, or another recording medium for distribution, or distributed via a network.

【００２４】図３は、オートマトン構築部２１０による
ＤＦＡ構築処理の概略的な流れを示すフローチャートで
ある。図３に示すように、オートマトン構築部２１０
は、検索キーとなる文字列の正規表現の入力を受理する
と（ステップ３０１）、まず、当該正規表現から構文木
を構築する（ステップ３０２）。構築された構文木は、
メインメモリ１０３やＣＰＵ１０１の図示しないキャッ
シュメモリに格納される。次に、オートマトン構築部２
１０は、メインメモリ１０３等から構文木を読み出し、
当該構文木に基づいてＮＦＡを構築する（ステップ３０
３）。構築されたＮＦＡは、メインメモリ１０３やＣＰ
Ｕ１０１の図示しないキャッシュメモリに格納される。
さらにオートマトン構築部２１０は、メインメモリ１０
３等からＮＦＡを読み出し、当該ＮＦＡに基づいてＤＦ
Ａを構築する（ステップ３０４）。構築されたＤＦＡ
は、メインメモリ１０３やＣＰＵ１０１の図示しないキ
ャッシュメモリに格納される。すなわちオートマトン構
築部２１０は、本実施の形態で用いるＤＦＡを作成する
ため、構文木構築手段、ＮＦＡ構築手段およびＤＦＡ構
築手段として機能する。なお、構築されたＮＦＡおよび
ＤＦＡは、メインメモリ１０３にて実現されたオートマ
トン保持部２２０に格納される。FIG. 3 is a flow chart showing a schematic flow of the DFA construction processing by the automaton construction unit 210. As shown in FIG. 3, the automaton building unit 210
Upon receiving an input of a regular expression of a character string serving as a search key (step 301), first constructs a syntax tree from the regular expression (step 302). The constructed syntax tree is
It is stored in the main memory 103 or a cache memory (not shown) of the CPU 101. Next, the automaton construction unit 2
10 reads the syntax tree from the main memory 103,
Build an NFA based on the syntax tree (step 30).
3). The constructed NFA is the main memory 103 and CP.
It is stored in a cache memory (not shown) of U101.
Furthermore, the automaton building unit 210 uses the main memory 10
The NFA is read from the third grade and the DF is read based on the NFA.
Construct A (step 304). Constructed DFA
Are stored in the main memory 103 or a cache memory (not shown) of the CPU 101. That is, the automaton building unit 210 functions as a syntax tree building unit, an NFA building unit, and a DFA building unit in order to create the DFA used in this embodiment. The constructed NFA and DFA are stored in the automaton holding unit 220 realized in the main memory 103.

【００２５】ステップ３０１で正規表現から構築される
構文木（二分木）は、後述する適切なＮＦＡを作成する
ための情報を含んでいればどのような構成であっても良
いが、例えば、ノードが次に示す４つのうちのいずれか
であるような構文木とすることができる。１．正規表現中の一定の範囲（部分正規表現）を指定す
る要素‘(’、‘)’を除く演算子（１つの演算子は正規
表現の１要素に対応する）このノードでは、‘[abc]’は１つの演算子とみなし、
対応するノードも１つである。また、‘*’は子を１つ
持つ（“<re>*”の<re>に対応する部分木。なお部分木
は、構文木中のあるノード及びその子孫（子、孫、・・
・）の全てを含む木であり、正規表現の一部分に対応す
る）。さらに、‘|(<ALT>)’は子を２つ持つ（“<re1>|
<re2>”の<re1>、<re2>に対応する部分木）。その他の
演算子（“a”、“[a-c]”など）は子を持たない。２．<TERM> 正規表現の末尾を示す仮想的な演算子（正
規表現“abc”の場合、‘c’の後ろに仮想的な終端の文
字があると考え、これを<TERM>とする）このノードは子を持たない。３．<CONCAT> 正規表現の連結を表す仮想的なノード
（対応するＮＦＡ状態および演算子はない）例えば、正規表現“abc”は、要素‘a’と‘b’との連
結、及び当該連結と要素‘c’との連結であり、この
「連結」を１つのノードとする。また、このノードは子
を２つ持つ（連結する２つの部分正規表現に対応する部
分木）。４．<SUBEXP> 括弧で囲われた部分正規表現を表す（対
応するＮＦＡ状態はない。正規表現中の‘(’、‘)’の
組に対応）このノードは子を１つ持つ（‘(’、‘)’の内の部分正
規表現に対応する部分木）。The syntax tree (binary tree) constructed from the regular expressions in step 301 may have any configuration as long as it includes information for creating an appropriate NFA described later. Can be any one of the following four: 1. Operators excluding the elements'(',')'that specify a certain range (partial regular expression) in the regular expression (one operator corresponds to one element of the regular expression) In this node,' [abc] 'Is regarded as one operator,
There is also one corresponding node. Also, '*' has one child (a subtree corresponding to <re> of “<re> *”. A subtree is a node in the syntax tree and its descendants (child, grandchild, ...
-) Is a tree that contains all of the) and corresponds to a part of the regular expression). In addition, '| (<ALT>)' has two children (“<re1> |
(A subtree corresponding to <re1> and <re2> of <re2> ”. Other operators (“ a ”,“ [ac] ”, etc.) have no children.) 2. <TERM> Regular expression end Virtual operator that indicates (in the case of regular expression "abc", consider that there is a virtual terminal character after'c ', and let this be <TERM>) This node has no children. <CONCAT> Virtual node representing the concatenation of regular expressions (there is no corresponding NFA state and operator) For example, the regular expression "abc" is the concatenation of elements'a 'and'b', and It is a connection with the element'c ', and this "connection" is one node. In addition, this node has two children (subtrees corresponding to two concatenated partial regular expressions). 4. <SUBEXP> Represents a partial regular expression enclosed in parentheses (there is no corresponding NFA state. Corresponds to the pair of '(', ')' in the regular expression) This node has one child ('(', Subtree corresponding to the subexpression in ')').

【００２６】図４は、正規表現“ab|c(de)*”から構築
される構文木を示す図である。図４において、正規表現
“[ab]|c(de)*”に対し、“[ab]”、“|”、“c”、
“d”、“e”、“*”がそれぞれ演算子であり、対応す
るノードが生成されている。また、ノード<C> はCONCA
T、ノード | はALT、ノード<S> はSUBEXP、ノード<T>
はTERMである。この構文木を構築する処理は公知の技術
を用いることができる。図５は、構文木を構築する処理
を実行するためのプログラムの例である。FIG. 4 is a diagram showing a syntax tree constructed from the regular expression "ab | c (de) *". In FIG. 4, for the regular expression “[ab] | c (de) *”, “[ab]”, “|”, “c”,
"D", "e", and "*" are operators, and corresponding nodes are generated. Also, node <C> is CONCA
T, node | for ALT, node <S> for SUBEXP, node <T>
Is TERM. A known technique can be used for the process of constructing the syntax tree. FIG. 5 is an example of a program for executing the process of constructing a syntax tree.

【００２７】ステップ３０２で構文木から構築されるＮ
ＦＡは、本実施の形態に用いるための条件を具備する。
すなわち、ＤＦＡの状態列から復元可能であること、正
規表現の最長最左規則に従うことである。上述したよう
に、所定の正規表現からは複数のＮＦＡを作成し得る。
しかしながら、本実施の形態では、ＤＦＡによるパター
ンマッチングにおいて、ＮＦＡを用いた場合にたどる状
態の列の情報を利用するため、この目的に適した、すな
わち上記の条件を具備したＮＦＡを構築することが必要
である（以下、このＮＦＡを適切なＮＦＡと称す）。こ
の適切なＮＦＡは、次の特徴を持つ。１．“(”、“)”を除く全ての要素（“a”、“[a-
c]”、“*”など）にＮＦＡの状態を１つ対応させる。２．ε遷移（文字を受理せずに行う遷移）を、繰り返し
（*、+、?など）と選択（|）にのみ許す。３．（ε遷移を除いて）遷移先はただ１つ。N constructed from the syntax tree in step 302
FA has the conditions for use in the present embodiment.
That is, it is possible to restore from the state sequence of DFA, and to follow the longest leftmost rule of the regular expression. As described above, multiple NFAs can be created from a given regular expression.
However, in the present embodiment, in the pattern matching by DFA, since the information of the sequence that is traced when NFA is used, it is possible to construct an NFA suitable for this purpose, that is, satisfying the above conditions. Required (hereinafter, this NFA is referred to as a proper NFA). This suitable NFA has the following characteristics. 1. All elements except "(", ")"("a","[a-
c] ”,“ * ”, etc.) corresponds to one NFA state 2. ε transition (transition without accepting characters) is repeated (*, +,?, etc.) and selected (|) 3. Only one transition destination (except ε transition).

【００２８】かかる適切なＮＦＡを作成するため、本実
施の形態では、first、epsdest、nextという３種類の関
数を定義する。以下、各関数について説明する。firs
t：構文木中の所定の部分木（＝引数のノードを根とす
る部分木）が対応する部分正規表現において、最初に
「出現」する演算子に対応するＮＦＡ状態を返す。ここ
で、最初に「出現」する演算子とは、最初に処理しなけ
ればならない演算子であることを意味する。また、ＮＦ
Ａ状態は、ＮＦＡにおける１つの丸（ノード）で示され
る。本実施の形態では、ＮＦＡ状態は演算子に対応し、
全てのＮＦＡ状態は対応するノードを１つ持つ（逆は真
ではなく、対応するＮＦＡ状態を持たないノードも存在
する）。なお、従来技術においてＮＦＡ／ＤＦＡを構築
するために定義される関数firstでは演算子の集合が返
されるのに対し、本実施の形態では演算子１つ（＝ＮＦ
Ａ状態１つ）のみが返される。以上の定義から、次のこ
とが成り立つ。 first(<char>)＝<char> （文字） first(<re1><re2>)＝first(<re1>) （連結） first(<re1>|<re2>)＝ | （選択） first(<re1>*)＝ * （繰り返し） first((<re>))＝first(<re>) （括弧で囲われた部分正規表現）図６は、関数firstの定義コードを例示する図である。In order to create such an appropriate NFA, in this embodiment, three types of functions, first, epsdest and next, are defined. Each function will be described below. firs
t: Returns the NFA state corresponding to the operator that "appears" first in the subregular expression to which a predetermined subtree (= subtree whose root is the node of the argument) in the syntax tree corresponds. Here, the operator that “appears” first means that it must be processed first. Also, NF
The A state is indicated by one circle (node) in the NFA. In the present embodiment, the NFA state corresponds to the operator,
Every NFA state has one corresponding node (the opposite is not true, and some nodes do not have a corresponding NFA state). In the prior art, a set of operators is returned in the function first defined for constructing NFA / DFA, whereas in the present embodiment, one operator (= NF
Only one A state) is returned. From the above definition, the following holds. first (<char>) ＝ <char> (character) first (<re1><re2>) ＝ first (<re1>) (concatenation) first (<re1> | <re2>) ＝ | (selection) first (<re1> *) = * (repetition) first ((<re>)) = first (<re>) (partial regular expression enclosed in parentheses) FIG. 6 is a diagram illustrating a definition code of the function first.

【００２９】epsdest：引数のノードに対応するＮＦＡ
状態からε遷移可能なＮＦＡ状態の集合を返す。ここ
で、epsdestが空でない集合を返す、すなわちε遷移を
許すのは、“繰り返しの演算子”と“選択の演算子”の
みである。この定義から、次のことが成り立つ。 epsdest(*)＝first(<re>)∪next(<re>) （ただし、* の周りを“<re>*”とする） epsdest(|)＝first(<re1>)∪first(<re2>) （ただし、| の周りを“<re1>|<re2>”とする） epsdest(<op>)＝φ （ただし、<op> は、*、| 以外の演算子“a”、“[a
b]”など）図７は、関数epsdestの定義コードを例示する図であ
る。Epsdest: NFA corresponding to the argument node
Returns a set of NFA states that can undergo ε transition from a state. Here, it is only the "iteration operator" and the "selection operator" that epsdest returns a non-empty set, that is, ε transitions are allowed. From this definition, the following holds. epsdest (*) ＝ first (<re>) ∪next (<re>) (however, "*" is surrounded by "<re>*") epsdest (|) ＝ first (<re1>) ∪first (<re2 >) (However, "<re1> | <re2>" is set around |) epsdest (<op>) ＝ φ (However, <op> is an operator "a", "[other than * and | a
b] ”etc.) FIG. 7 is a diagram illustrating a definition code of the function epsdest.

【００３０】next：引数のノードを根とする部分木が対
応する部分正規表現の次に出現する演算子に対応するＮ
ＦＡ状態を返す。すなわち、関数nextは、引数のノード
が対応するＮＦＡ状態の遷移先を示す。また、関数firs
tと同様に、返す値が演算子の集合ではなく演算子１つ
のみであるため、“*”及び“|”を除く各ＮＦＡ状態の
遷移先は一意に定まる。このとき受理する文字は、当該
ノードが対応する演算子がマッチする文字である。以上
の定義から、次のことが成り立つ。なお、next(<re1>)
は、正規表現中の<re1>の周囲に応じて定義される。 <re1>の周囲が<re1><re2>であるとき、 next(<re1>)＝first(<re2>) （連結） <re1>の周囲が<re1>|<re2>であるとき、 next(<re1>)＝next(<re1>|<re2>) （選択） <re1>の周囲が<re1>*であるとき、 next(<re1>)＝ * （繰り返し） <re1>の周囲が(<re1>)であるとき、 next(<re1>)＝next((<re1>)) （括弧で囲われた部分正規表現） <re1>が正規表現全体を表す、すなわち<re1>の周囲が<re1>であるとき、 next(<re1>)＝<TERM> （一番外側の正規表現）図８は、関数nextの定義コードを例示する図である。Next: N corresponding to the operator that appears next to the partial regular expression to which the subtree whose root is the node of the argument corresponds
Returns FA status. That is, the function next indicates the transition destination of the NFA state to which the node of the argument corresponds. Also, the function firs
Similar to t, since the value to be returned is not a set of operators but only one operator, the transition destination of each NFA state except “*” and “|” is uniquely determined. Characters accepted at this time are characters that the operator corresponding to the node matches. From the above definition, the following holds. Note that next (<re1>)
Is defined according to the surroundings of <re1> in the regular expression. When the circumference of <re1> is <re1><re2>, next (<re1>) ＝ first (<re2>) (concatenation) When the circumference of <re1> is <re1> | <re2>, next (<re1>) ＝ next (<re1> | <re2>) (selection) When the circumference of <re1> is <re1> *, next (<re1>) ＝ * (repeated) The circumference of <re1> is When (<re1>), next (<re1>) ＝ next ((<re1>)) (partial regular expression enclosed in parentheses) <re1> represents the entire regular expression, that is, around <re1> Is <re1>, next (<re1>) = <TERM> (the outermost regular expression) FIG. 8 is a diagram illustrating the definition code of the function next.

【００３１】図９は、上述した関数を用いて適切なＮＦ
Ａを構築する処理を説明するフローチャートである。な
お、以下の動作において、ＮＦＡの開始状態とは、構文
木のfirst（＝正規表現で最初に処理すべき演算子）で
ある。また、ＮＦＡの状態には、正規表現
の‘(’、‘)’を除く演算子と<TERM>にＮＦＡの状態が
１つずつ割り当てられる。また、所定のＮＦＡ状態のε
遷移先は、そのＮＦＡ状態に対応するノードのepsdest
（定義より）である。ここで、epsdestが空でないの
は、‘*’または‘|’の場合のみであり、ε遷移可能な
状態も‘*’及び‘|’のみである。さらにまた、所定の
あるＮＦＡ状態の遷移先は、そのＮＦＡ状態がε遷移先
を持つ場合は、通常の遷移先はない。これに対しε遷移
先を持たない場合は、そのＮＦＡ状態に対応するノード
のnextである。このとき、受理する文字は、そのＮＦＡ
状態が対応する演算子がマッチする文字である。例え
ば、演算子“a”は、文字“a”にマッチし、演算子“[a
bc]”は、文字“a”、“b”、“c”のいずれかにマッチ
する。また、ＮＦＡの終了状態とは、<TERM>である（<T
ERM>を除く全てのＮＦＡ状態は「ε遷移先を２つ」また
は「通常の遷移先を１つ」持つ）。FIG. 9 shows the appropriate NF using the function described above.
8 is a flowchart illustrating a process of constructing A. In the following operation, the NFA start state is the first (= operator to be processed first in the regular expression) of the syntax tree. In addition, to the NFA state, an operator excluding '(', ')' in the regular expression and one NFA state are assigned to <TERM>. Also, for a given NFA state ε
The transition destination is epsdest of the node corresponding to the NFA state.
(By definition) Here, epsdest is not empty only in the case of '*' or '|', and ε transitionable states are also only '*' and '|'. Furthermore, a predetermined NFA state transition destination does not have a normal transition destination if the NFA state has an ε transition destination. On the other hand, when there is no ε transition destination, it is the next of the node corresponding to the NFA state. At this time, the accepted character is the NFA.
The state is a character that the corresponding operator matches. For example, the operator “a” matches the character “a” and the operator “[a
bc] ”matches any of the characters“ a ”,“ b ”, or“ c. ”The NFA end status is <TERM>(<T
All NFA states except ERM> have “two ε transition destinations” or “one normal transition destination”).

【００３２】図９を参照すると、オートマトン構築部２
１０は、まず〔nfa_st_num〕個のＮＦＡ状態を生成する
（ステップ９０１）。ここで、nfa_st_numは、正規表現
から本手法で生成するＮＦＡ状態の個数であり、正規表
現中の‘(’、‘)’を除く要素の個数＋１（<TERM>の
分）である。次に、ＮＦＡの開始状態（nfa_init）を構
文木の最初のノード（first(tree)）、すなわち正規表
現で最初に処理すべき演算子とする。また、ＮＦＡの停
止状態（nfa_halt）をＮＦＡの状態数（nfa_st_num−
１）とする（ステップ９０２）。Referring to FIG. 9, the automaton construction unit 2
The 10 first generates [nfa_st_num] NFA states (step 901). Here, nfa_st_num is the number of NFA states generated from the regular expression by this method, and is the number of elements in the regular expression excluding '(', ')' + 1 (for <TERM>). Next, the start state (nfa_init) of the NFA is the first node (first (tree)) of the syntax tree, that is, the operator to be processed first in the regular expression. In addition, the NFA stop state (nfa_halt) indicates the NFA state number (nfa_st_num-
1) (step 902).

【００３３】次に、変数ｉを初期化（ｉ＝０）して、ｉ
番目のＮＦＡ状態（状態ｉ）に対応する構文木上のノー
ドへの参照（state_trees[i]）を調べる（ステップ９０
３、９０４）。このノードへの参照である演算子が
‘*’または‘|’である場合は、関数epsdestで得られ
るepsdest(state_trees[i])を状態ｉのε遷移先とする
（ステップ９０５）。一方、状態ｉに対応する構文木上
のノードへの参照である演算子が‘*’及び‘|’以外の
ものである場合は、当該ノードへの参照（state_trees
[i]）が表す当該要素に対する状態ｉの遷移先を、関数n
extで得られるnext(state_trees[i])とする（ステップ
９０６）。Next, the variable i is initialized (i = 0), and i
The reference (state_trees [i]) to the node on the syntax tree corresponding to the nth NFA state (state i) is checked (step 90).
3, 904). When the operator that is the reference to this node is '*' or '|', epsdest (state_trees [i]) obtained by the function epsdest is set as the ε transition destination of the state i (step 905). On the other hand, when the operator that is the reference to the node on the syntax tree corresponding to the state i is other than '*' and '|', the reference to the node (state_trees
[i]) represents the transition destination of the state i for the element represented by the function n
Next (state_trees [i]) obtained by ext is set (step 906).

【００３４】この後、変数ｉの値を１加算し（ステップ
９０７）、得られた新たな変数ｉがＮＦＡの状態数に達
したか否かを調べる（ステップ９０８）。そして、当該
新たな変数ｉの値がＮＦＡの状態数未満であれば、当該
新たな変数ｉを用いてステップ９０４移行の処理を実行
する。一方、当該新たな変数ｉの値がＮＦＡの状態数に
達したならば、ＮＦＡの停止状態に達し、適切なＮＦＡ
が作成されたので、処理を終了する。作成された適切な
ＮＦＡは、後でオートマトン判定部２４０による処理に
用いられるため、オートマトン保持部２２０に格納され
る。Thereafter, the value of the variable i is incremented by 1 (step 907), and it is checked whether or not the obtained new variable i has reached the number of NFA states (step 908). Then, if the value of the new variable i is less than the number of NFA states, the process of step 904 is executed using the new variable i. On the other hand, if the value of the new variable i reaches the number of NFA states, the NFA stop state is reached and the appropriate NFA is reached.
Has been created, the process ends. The generated appropriate NFA is stored in the automaton holding unit 220 because it is used later for the processing by the automaton determination unit 240.

【００３５】ステップ３０３で適切なＮＦＡからＤＦＡ
を構築する処理は、公知の手法を用いることができる。
すなわち、元のＮＦＡの各状態で１文字受け取った際に
どのように状態が遷移するかという情報を集めて１個の
新しい状態を作る作業を、開始状態から最終状態まで順
に繰り返す処理である。以上のようにして作成されたＤ
ＦＡは、メインメモリ１０３等で実現されるオートマト
ン保持部２２０に格納され、オートマトン判定部２４０
による処理に使用される。Step 303: From the appropriate NFA to DFA
A known method can be used for the process of constructing.
That is, it is a process of sequentially collecting from the start state to the final state the work of collecting one piece of information about how the state transitions when one character is received in each state of the original NFA and creating one new state. D created as above
The FA is stored in the automaton holding unit 220 realized by the main memory 103 and the like, and the automaton determination unit 240
Used for processing by.

【００３６】図１０は、オートマトン判定部２４０によ
るパターンマッチング処理の概略的な流れを示すフロー
チャートである。図１０に示すように、オートマトン判
定部２４０は、オートマトン保持部２２０に格納されて
いるＤＦＡと、文書保持部２３０に格納されている検索
対象の文字列（文書データ）とを読み出して入力すると
（ステップ１００１）、まず、ＤＦＡにより入力文字列
の判定（マッチング）を行う（ステップ１００２）。判
定結果は、メインメモリ１０３やＣＰＵ１０１の図示し
ないキャッシュメモリに格納される。次に、オートマト
ン判定部２４０は、当該入力文字列が受理されたなら
ば、オートマトン保持部２２０から読み出されたＮＦＡ
及びＤＦＡを用いて、ＤＦＡ状態列の絞り込み処理およ
びマッチ範囲（正規表現の所定の部分（部分正規表現）
にマッチしている文字列の部分（部分文字列）の範囲）
の判定処理を行う（ステップ１００３〜１００５）。こ
の処理結果は、メインメモリ１０３やＣＰＵ１０１の図
示しないキャッシュメモリに格納される。最後に、オー
トマトン判定部２４０は、ステップ１００２の判定結果
及びステップ１００４、１００５の処理結果をメインメ
モリ１０３等から読み出し、ディスプレイ装置等の出力
デバイスを介して出力する（ステップ１００６）。ま
た、当該入力文字列がＤＦＡにより受理されなかったな
らば、不受理の結果を出力する（ステップ１００３、１
００６）。以上の処理を全ての入力文字列に対して実行
する（ステップ１００７）。すなわち、オートマトン判
定部２４０は、入力文字列判定手段、ＤＦＡ状態列絞り
込み手段およびマッチ範囲判定手段として機能する。FIG. 10 is a flow chart showing a schematic flow of the pattern matching processing by the automaton judging section 240. As illustrated in FIG. 10, the automaton determination unit 240 reads and inputs the DFA stored in the automaton holding unit 220 and the search target character string (document data) stored in the document holding unit 230 ( First, in step 1001), the DFA determines (matches) the input character string (step 1002). The determination result is stored in the main memory 103 or a cache memory (not shown) of the CPU 101. Next, the automaton determination unit 240, if the input character string is accepted, reads the NFA read from the automaton holding unit 220.
And DFA to narrow down the DFA state sequence and match range (predetermined part of regular expression (partial regular expression)
Part of the character string that matches (substring range)
Is performed (steps 1003 to 1005). The processing result is stored in the main memory 103 or a cache memory (not shown) of the CPU 101. Finally, the automaton determination unit 240 reads the determination result of step 1002 and the processing results of steps 1004 and 1005 from the main memory 103 or the like, and outputs them via an output device such as a display device (step 1006). If the input character string is not accepted by the DFA, the result of rejection is output (steps 1003, 1).
006). The above processing is executed for all input character strings (step 1007). That is, the automaton determining unit 240 functions as an input character string determining unit, a DFA state sequence narrowing unit, and a match range determining unit.

【００３７】ステップ１００２における文字列のマッチ
ング処理は、公知の手法を用いることができる。ここ
で、入力文字列がＤＦＡにマッチし、受理された場合で
あっても、それだけでは当該入力文字列中の部分正規表
現がマッチした部分に関する情報を得られない。すなわ
ち、ＤＦＡの状態列には入力に対してたどり得る全ての
ＮＦＡの状態列が含まれているため、正規表現のどの要
素がどの文字にマッチしたかが識別できず、ＰＯＳＩＸ
における最長最左規則に適合する状態遷移がわからな
い。そこで、本実施の形態のオートマトン判定部２４０
は、ＤＦＡの絞り込みで最長のＮＦＡ状態列の候補を取
得し、最左なＮＦＡ状態列を復元することによりマッチ
範囲を判定することにより、かかる情報を取得する。A known method can be used for the character string matching processing in step 1002. Here, even if the input character string matches the DFA and is accepted, it is not possible to obtain information about the matched portion of the partial regular expression in the input character string. That is, since the DFA state sequence includes all NFA state sequences that can be traced to the input, it is impossible to identify which element of the regular expression matches which character, and the POSIX
I do not know the state transition that conforms to the longest leftmost rule in. Therefore, the automaton determination unit 240 of the present embodiment
Acquires the information of the longest NFA state sequence candidate by narrowing down the DFA and determining the matching range by restoring the leftmost NFA state sequence.

【００３８】ステップ１００４におけるＤＦＡ状態列の
絞り込みでは、入力に対して到達し得る全てのＮＦＡの
状態が含まれているＤＦＡ状態列のうちから、終了状態
に到達し得ない不要なＮＦＡの状態列を削除する。この
ＤＦＡ状態列の絞り込みは、ＤＦＡを構成したＮＦＡ
（適切なＮＦＡ）に基づいてＤＦＡの状態列を後ろから
たどることによって行う。図１１は、ＤＦＡ状態列の絞
り込みにおける動作を説明するフローチャートである。
図１１に示すように、オートマトン判定部２４０は、ま
ずステップ１００２で判定されたＤＦＡの状態列（stat
e_log）と入力文字列のインデックス（idx）とを初期化
する。具体的には、最後の停止状態に対するＤＦＡの状
態（state_log[last]）を、ＮＦＡの停止状態（nfa_hal
t）のみを含むＤＦＡ状態とする。また、現在処理中
（絞り込みの最中）の入力文字列のインデックス（id
x）を、ＤＦＡの状態列における最後の停止状態（las
t）の直前の状態（last-1）とする（ステップ１１０
１）。例えば、ＤＦＡの状態列（state_log）が
｛｛０，１，３｝，｛２，３，４｝，｛３，５｝，
｛４｝｝であり、ＮＦＡの停止状態が５のとき、ＤＦＡ
の状態列中の停止状態は｛３，５｝のみである。このと
き、最後の停止状態（last）の値は２（０から数えるた
め）である。すなわち、state_log[last]＝[nfa_st_nu
m]は、この例ではstate_log[2]＝[5]であり、この結果
ＤＦＡの状態列が、｛｛０，１，３｝，｛２，３，
４｝，｛５｝，｛４｝｝となることを意味する。In the narrowing down of the DFA state sequence in step 1004, an unnecessary NFA state sequence that cannot reach the end state is selected from the DFA state sequence that includes all NFA states that can be reached for the input. To delete. This narrowing down of the DFA status row is performed by the NFA that constitutes the DFA.
By tracing the DFA's state sequence from the back based on (appropriate NFA). FIG. 11 is a flow chart for explaining the operation in narrowing down the DFA state sequence.
As shown in FIG. 11, the automaton determination unit 240 first determines the state sequence (stat) of the DFA determined in step 1002.
e_log) and the input string index (idx) are initialized. Specifically, the state of the DFA (state_log [last]) for the last stop state is set to the stop state of the NFA (nfa_hal).
The DFA state includes only t). Also, the index (id of the input string currently being processed (during narrowing))
x) is the last stop state (las
The state (last-1) immediately before t) is set (step 110).
1). For example, the state sequence (state_log) of DFA is {{0,1,3}, {2,3,4}, {3,5},
{4}} and when the NFA stop status is 5, DFA
Only {3, 5} is the stopped state in the state sequence of. At this time, the value of the last stop state (last) is 2 (because counting from 0). That is, state_log [last] = [nfa_st_nu
m] is state_log [2] = [5] in this example, and as a result, the state sequence of DFA is {{0,1,3}, {2,3,
4}, {5}, {4}}.

【００３９】次に、オートマトン判定部２４０は、現在
処理中の入力文字列に対するＤＦＡの状態列（state_lo
g[idx]）を更新する（ステップ１１０２）。そして、処
理対象となる入力文字列のインデックスを、現在処理中
の入力文字列のインデックスの１つ前の状態（idx=idx-
1）とし（ステップ１１０３）、未処理の入力文字列が
存在すればステップ１１０２に戻って処理を実行し、全
ての入力文字列に対して処理が行われたならば、ＤＦＡ
の状態列の絞り込み処理を終了する（ステップ１１０
４）。Next, the automaton determination unit 240 determines the state string (state_lo) of the DFA for the input character string currently being processed.
g [idx]) is updated (step 1102). Then, the index of the input character string to be processed is the state (idx = idx-idx = idx-idx = idx-
1) (step 1103), if there is an unprocessed input character string, the process returns to step 1102 to execute the processing, and if all the input character strings have been processed, the DFA
Ends the process of narrowing down the state column (step 110).
4).

【００４０】図１２は、ステップ１１０２のＤＦＡの状
態列（state_log[idx]）を更新する処理の詳細を説明す
るフローチャートである。図１２を参照すると、オート
マトン判定部２４０は、まず本処理で使用するメモリ領
域（state_buf）を初期化し、本処理において参照する
ＮＦＡ状態へのインデックス（nfa_idx）の値を０とす
る（ステップ１２０１）。次に、nfa_idxとstate_log[i
dx]との要素数を比較する（ステップ１２０２）。そし
て、state_log[idx]の要素数の方が大きい場合、参照し
ているＮＦＡ状態（nfa_st）をstate_log[idx][nfa_id
x]とし、メモリ領域において当該ＮＦＡ状態が対応する
構文木上のノードへのポインタ（node）をstate_trees
[nfa_st]とする（ステップ１２０３）。そして、構文木
におけるノードの文字列表現（node.label）が‘*’ま
たは‘|’かどうかを調べる。‘*’または‘|’のいず
れかであった場合は、nfa_idxの値を１加算してステッ
プ１２０２へ戻る（ステップ１２０４、１２０８）。FIG. 12 is a flow chart for explaining the details of the processing for updating the state string (state_log [idx]) of the DFA in step 1102. Referring to FIG. 12, the automaton determination unit 240 first initializes the memory area (state_buf) used in this process, and sets the value of the index (nfa_idx) to the NFA state referenced in this process to 0 (step 1201). . Then nfa_idx and state_log [i
The number of elements is compared with dx] (step 1202). When the number of elements of state_log [idx] is larger, the NFA state (nfa_st) being referred to is changed to state_log [idx] [nfa_id
x], and the pointer (node) to the node on the syntax tree corresponding to the NFA state in the memory area is state_trees.
[nfa_st] (step 1203). Then, it is checked whether the character string representation (node.label) of the node in the syntax tree is '*' or '|'. If it is either "*" or "|", the value of nfa_idx is incremented by 1 and the process returns to step 1202 (steps 1204, 1208).

【００４１】ノードの文字列表現（node.label）が
‘*’または‘|’のいずれでもない場合、オートマトン
判定部２４０は、次に、関数値next(node)について、ne
xt(node)∈state_log[idx+1]が成り立つかどうかを調べ
る（ステップ１２０４、１２０５）。これが成り立たな
いならば、nfa_idxの値を１加算してステップ１２０２
へ戻る（ステップ１２０８）。一方、next(node)∈stat
e_log[idx+1]が成り立つ場合、オートマトン判定部２４
０は、次に、当該ノードの文字列表現が現在処理中のイ
ンデックスに対応する入力文字（input(idx)）を受理す
るかどうかを調べる（ステップ１２０６）。そして、受
理しないならば、nfa_idxの値を１加算してステップ１
２０２へ戻る（ステップ１２０８）。さらに、ノードの
文字列表現が文字列input(idx)を受理する場合、オート
マトン判定部２４０は、次に、メモリ領域state_bufにn
fa_stを追加する（ステップ１２０７）。そして、nfa_i
dxの値を１加算してステップ１２０２へ戻る（ステップ
１２０８）。When the character string representation (node.label) of the node is neither "*" nor "|", the automaton determination unit 240 next determines ne for the function value next (node).
It is checked whether xt (node) εstate_log [idx + 1] holds (steps 1204 and 1205). If this is not the case, add 1 to the value of nfa_idx and step 1202
Return to (step 1208). On the other hand, next (node) ∈ stat
When e_log [idx + 1] holds, the automaton determination unit 24
0 then checks whether the character string representation of the node accepts the input character (input (idx)) corresponding to the index currently being processed (step 1206). If not accepted, add 1 to the value of nfa_idx and step 1
Return to 202 (step 1208). Furthermore, when the character string representation of the node accepts the character string input (idx), the automaton determination unit 240 next stores n in the memory area state_buf.
fa_st is added (step 1207). And nfa_i
The value of dx is incremented by 1, and the process returns to step 1202 (step 1208).

【００４２】ステップ１２０２〜１２０８の処理を繰り
返してstate_bufに格納されるＮＦＡ状態を追加してい
き、nfa_idxの値がstate_log[idx]の要素数に達したな
らば、次に、オートマトン判定部２４０は、ε遷移元の
追加処理を行う（ステップ１２０９）。ε遷移元の追加
処理については後述する。この後、オートマトン判定部
２４０は、state_log[idx]をstate_bufに蓄積された処
理結果に更新する（ステップ１２１０）。If the value of nfa_idx reaches the number of elements of state_log [idx] by repeating the processing of steps 1202 to 1208 and adding the NFA state stored in state_buf, then the automaton determination unit 240 , Ε transition source is added (step 1209). The process for adding the ε transition source will be described later. After this, the automaton determination unit 240 updates state_log [idx] to the processing result accumulated in state_buf (step 1210).

【００４３】図１３は、ε遷移元の追加処理の詳細を説
明するフローチャートである。図１３を参照すると、オ
ートマトン判定部２４０は、まず本処理において参照す
るＮＦＡ状態へのインデックス（nfa_idx）の値を０と
する（ステップ１３０１）。次に、nfa_idxとstate_log
[idx]との要素数を比較する（ステップ１３０２）。そ
して、state_log[idx]の要素数の方が大きい場合、参照
しているＮＦＡ状態（nfa_st）をstate_log[idx][nfa_i
dx]とし、メモリ領域において当該ＮＦＡ状態が対応す
る構文木上のノードへのポインタ（node）をstate_tree
s[nfa_st]とする（ステップ１３０３）。そして、構文
木におけるノードの文字列表現（node.label）が‘*’
または‘|’かどうかを調べる。‘*’または‘|’のい
ずれでもない場合は、nfa_idxの値を１加算してステッ
プ１３０２へ戻る（ステップ１３０４、１３０７）。FIG. 13 is a flowchart for explaining the details of the ε transition source addition processing. Referring to FIG. 13, the automaton determination unit 240 first sets the value of the index (nfa_idx) to the NFA state referred to in this processing to 0 (step 1301). Then nfa_idx and state_log
The number of elements is compared with [idx] (step 1302). Then, when the number of elements of state_log [idx] is larger, the NFA state (nfa_st) being referred to is changed to state_log [idx] [nfa_i
dx], and the pointer (node) to the node on the syntax tree corresponding to the NFA state in the memory area is state_tree.
Let s [nfa_st] (step 1303). And the string representation of the node in the syntax tree (node.label) is '*'
Or check for '|'. If it is neither "*" nor "|", the value of nfa_idx is incremented by 1 and the process returns to step 1302 (steps 1304 and 1307).

【００４４】ノードの文字列表現（node.label）が
‘*’または‘|’であった場合、オートマトン判定部２
４０は、次に、関数値epsdest(node)について、epsdest
(node)∈state_bufが成り立つかどうかを調べる（ステ
ップ１３０４、１３０５）。これが成り立たないなら
ば、nfa_idxの値を１加算してステップ１３０２へ戻る
（ステップ１３０７）。一方、epsdest(node)∈state_b
ufが成り立つ場合、オートマトン判定部２４０は、次
に、メモリ領域state_bufにnfa_stを追加し、ステップ
１３０１へ戻る（ステップ１３０６）。When the character string representation (node.label) of the node is "*" or "|", the automaton determination unit 2
40, next, for the function value epsdest (node), epsdest
It is checked whether (node) εstate_buf holds (steps 1304 and 1305). If this is not the case, the value of nfa_idx is incremented by 1 and the process returns to step 1302 (step 1307). On the other hand, epsdest (node) ∈ state_b
When uf is satisfied, the automaton determination unit 240 next adds nfa_st to the memory area state_buf and returns to step 1301 (step 1306).

【００４５】以上の処理を繰り返してstate_bufに格納
されるＮＦＡ状態を追加していき、nfa_idxの値がstate
_log[idx]の要素数に達したならば、ε遷移元の追加処
理を終了する。これにより、現在処理中の入力文字列に
対するＤＦＡの状態列（state_log[idx]）が更新され、
ＤＦＡ状態列の絞り込み処理が完了する。The NFA state stored in state_buf is added by repeating the above processing, and the value of nfa_idx is state.
When the number of elements of _log [idx] is reached, the addition process of the ε transition source is ended. This updates the DFA state string (state_log [idx]) for the input string currently being processed,
The process of narrowing down the DFA status column is completed.

【００４６】次に、上述したＤＦＡ状態列の絞り込み処
理の具体的な動作例を説明する。ここでは、正規表現
“((a*)a)*a”に対して、入力文字列“aaaa”のマッチ
ングを調べる場合を例とする。図１４は、正規表現
“((a*)a)*a”の適切なＮＦＡを示す図である。図１４
に示すＮＦＡは、０、１、２、３、４、５という６つの
状態を有する。図１５は、図１４に示したＮＦＡのＤＦ
Ａ及び文字列“aaaa”に対する状態遷移を示す図であ
る。図１５のＤＦＡを参照すると、入力文字列“aaaa”
は受理され、ＤＦＡにおける状態遷移は、開始状態が
｛０，１，２，３，４｝であり、最初の文字以降の入力
文字列に対するＤＦＡの状態は全て｛０，１，２，３，
４，５｝である。したがって、最長最左規則に従えば、
入力文字列“aaaa”のうちの初めの２つの‘a’が正規
表現“((a*)a)*a”における‘a*’に部分的にマッチン
グしているのであるが、ＤＦＡからかかる情報を得るこ
とはできない。Next, a concrete operation example of the above-mentioned narrowing-down process of the DFA state sequence will be described. Here, the case where the matching of the input character string “aaaa” is checked against the regular expression “((a *) a) * a” is taken as an example. FIG. 14 is a diagram showing an appropriate NFA of the regular expression “((a *) a) * a”. 14
The NFA shown in 6 has six states of 0, 1, 2, 3, 4, and 5. FIG. 15 is a DF of the NFA shown in FIG.
It is a figure which shows the state transition with respect to A and the character string "aaaa." Referring to the DFA of FIG. 15, the input character string “aaaa”
Is accepted, the state transition in DFA is {0,1,2,3,4} as the start state, and all DFA states for the input character string after the first character are {0,1,2,3,4.
4, 5}. Therefore, according to the longest leftmost rule,
The first two'a's in the input string "aaaa" partially match the'a * 'in the regular expression "((a *) a) * a". No information is available.

【００４７】図１６は、入力文字列“aaaa”に対するＮ
ＦＡ状態の遷移の様子を示す図である。図１６（Ａ）に
は、各ＮＦＡ状態からの、ε遷移先および当該ＮＦＡ状
態の文字と当該文字に対する遷移先のＮＦＡ状態が一覧
表示されている。また、図１６（Ｂ）には、図１６
（Ａ）に基づいて絞り込まれたＤＦＡの状態列による状
態遷移の経路が示されている。FIG. 16 shows N for the input character string "aaaa".
It is a figure which shows the mode of transition of FA state. FIG. 16A shows a list of the ε transition destinations and the characters of the NFA states and the NFA states of the transition destinations for the characters from each NFA state. In addition, in FIG.
The state transition path is shown by the state string of the DFA narrowed down based on (A).

【００４８】図１４のＮＦＡ及び図１６（Ａ）の遷移表
を参照して、ＤＦＡの状態列の絞り込みを考える。ま
ず、図１６（Ａ）における最右の列は、ＮＦＡの終了状
態であるので、図１４のＮＦＡに基づいてＮＦＡ状態５
でなければならない。次に、文字‘a’を１つ入力する
ことでＮＦＡ状態５に到達し得るＮＦＡ状態は、ＮＦＡ
状態４と、ＮＦＡ状態４にε遷移するＮＦＡ状態３のみ
である。次に、文字‘a’を１つ入力することでＮＦＡ
状態４またはＮＦＡ状態３に到達し得るＮＦＡ状態は、
ＮＦＡ状態２と、ＮＦＡ状態にε遷移するＮＦＡ状態１
と、ＮＦＡ状態１にε遷移するＮＦＡ状態３のみであ
る。同様にして、文字‘a’を１つ入力することでＮＦ
Ａ状態１、２または３に遷移するＮＦＡ状態は、ＮＦＡ
状態０、１、２または３であることがわかる。さらに、
文字‘a’を１つ入力することでＮＦＡ状態１、２また
は３に遷移するＮＦＡ状態（これが開始状態に相当す
る）は、ＮＦＡ状態０、１、２または３であることがわ
かる。図１６（Ａ）では、このようにして抽出される、
すなわち絞り込まれたＤＦＡの状態列が斜体字で示され
ている。With reference to the NFA of FIG. 14 and the transition table of FIG. 16 (A), consider narrowing down the state sequence of the DFA. First, since the rightmost column in FIG. 16A is the NFA end state, the NFA state 5 based on the NFA in FIG.
Must. Next, the NFA state that can reach NFA state 5 by entering one letter'a 'is NFA
Only state 4 and NFA state 3 that makes an epsilon transition to NFA state 4. Then enter the letter'a 'to enter NFA
NFA states that can reach state 4 or NFA state 3 are:
NFA state 2 and NFA state 1 that makes an epsilon transition to the NFA state
Then, only the NFA state 3 that makes an epsilon transition to the NFA state 1. Similarly, enter one letter'a 'to get NF
The NFA state that transits to A state 1, 2 or 3 is NFA.
It can be seen that it is in states 0, 1, 2 or 3. further,
It can be seen that the NFA state (corresponding to the start state) that transitions to the NFA state 1, 2 or 3 by inputting one character'a 'is the NFA state 0, 1, 2 or 3. In FIG. 16 (A), extraction is performed in this way,
That is, the narrowed-down state row of the DFA is shown in italics.

【００４９】図１６（Ｂ）は、上記のようにして絞り込
まれたＤＦＡの状態列に関して、開始状態のＮＦＡ状態
３から終了状態のＮＦＡ状態５までの可能な経路を示し
ている。同図において、ε遷移は縦方向の経路で、文字
‘a’に対する遷移は横方向の経路で示している。ま
た、ＮＦＡ状態０へ向かう（１）の表記は、ＮＦＡ状態
０がＮＦＡ状態２と同様にＮＦＡ状態１から遷移するこ
とを意味している。FIG. 16B shows a possible route from the NFA state 3 at the start state to the NFA state 5 at the end state regarding the state sequence of the DFA narrowed down as described above. In the figure, the ε transition is shown as a vertical path, and the transition for the character'a 'is shown as a horizontal path. Further, the notation (1) of going to the NFA state 0 means that the NFA state 0 transits from the NFA state 1 similarly to the NFA state 2.

【００５０】次に、オートマトン判定部２４０による処
理（図１０参照）におけるステップ１００５のマッチ範
囲の判定処理について説明する。この処理では、ステッ
プ１００４で絞り込まれたＤＦＡの状態列に基づき、適
切なＮＦＡにて得られる状態遷移に関する情報を用い
て、文字列のマッチ範囲を判定する。これにより、ＰＯ
ＳＩＸの最長最左規則に応じた部分正規表現のマッチン
グや最長一致といった情報が得られることとなる。Next, the matching range determination processing of step 1005 in the processing by the automaton determination section 240 (see FIG. 10) will be described. In this processing, the matching range of the character string is determined based on the state string of the DFA narrowed down in step 1004, using the information about the state transition obtained by an appropriate NFA. This makes the PO
Information such as matching or longest match of the partial regular expression according to the longest leftmost rule of SIX is obtained.

【００５１】図１７は、マッチ範囲の判定処理における
動作を説明するフローチャートである。図１７に示すよ
うに、オートマトン判定部２４０は、まずレジスタに格
納される各種のパラメータを初期化する（ステップ１７
０１）。具体的には、現在参照しているＮＦＡ状態（nf
a_state）をオートマトン保持部２２０から入力した適
切なＮＦＡの開始状態（nfa_init）とし、部分正規表現
がマッチした部分を示すデータmatches[i]＝（−１，−
１）（０≦ｉ≦ＭＡＸ）を設定して、処理対象の文字列
中で現在参照しているインデックス（idx）をidx＝matc
hes[0].first＝０とする。FIG. 17 is a flow chart for explaining the operation in the match range determination process. As shown in FIG. 17, the automaton determination unit 240 first initializes various parameters stored in the register (step 17).
01). Specifically, it refers to the NFA status (nf
a_state) as an appropriate NFA start state (nfa_init) input from the automaton holding unit 220, and data matches [i] = (-1,-
1) (0 ≦ i ≦ MAX) is set and the index (idx) currently referred to in the character string to be processed is idx = matc
Let hes [0] .first = 0.

【００５２】そして、オートマトン判定部２４０は、レ
ジスタの更新処理（ステップ１７０２）、現在参照して
いるＮＦＡ状態の次のＮＦＡ状態への遷移処理（ステッ
プ１７０３）を行う。これらの処理については後述す
る。次に、オートマトン判定部２４０は、ＮＦＡ状態
（nfa_state）が停止状態（nfa_halt）かどうかを判断
する（ステップ１７０４）。そして、nfa_state＝nfa_h
altでなければ、ステップ１７０２に戻って処理を繰り
返し、nfa_state＝nfa_haltであれば、マッチ範囲の判
定処理を終了する。Then, the automaton determining section 240 performs a register updating process (step 1702) and a transition process from the currently referred NFA state to the next NFA state (step 1703). These processes will be described later. Next, the automaton determination unit 240 determines whether the NFA state (nfa_state) is the stop state (nfa_halt) (step 1704). And nfa_state = nfa_h
If it is not alt, the process returns to step 1702 to repeat the process. If nfa_state = nfa_halt, the match range determination process is ended.

【００５３】図１８は、レジスタの更新処理の詳細を説
明するフローチャートである。図１８を参照すると、オ
ートマトン判定部２４０は、まず変数ｉを初期化（ｉ＝
０）して（ステップ１８０１）、ＮＦＡ状態（nfa_stat
e）が部分正規表現の範囲に含まれるかどうか（subexps
[i].first≦nfa_state＜subexps[i].last）を判断する
（ステップ１８０２）。ＮＦＡ状態がこの範囲に含まれ
ていない場合は、データmatches[i].lastをインデック
ス（idx）の値（初期的にはｉ＝０であるので、図１７
のステップ１７０１からmatches[0].last＝matches[0].
first＝０となる）とする（ステップ１８０３）。そし
て、変数ｉの値を１加算し（ステップ１８０６）、変数
ｉの値が図１７のステップ１７０１で設定したＭＡＸに
達していなければステップ１８０２に戻る（ステップ１
８０７）。FIG. 18 is a flow chart for explaining the details of the register updating process. Referring to FIG. 18, the automaton determination unit 240 first initializes a variable i (i =
0) (step 1801), NFA status (nfa_stat
whether e) is in the range of the subexpression (subexps
[i] .first ≦ nfa_state <subexps [i] .last) is determined (step 1802). When the NFA state is not included in this range, the data matches [i] .last is set to the value of the index (idx) (since i = 0 in the initial state, FIG.
From step 1701 of [matches [0] .last = matches [0].
first = 0) (step 1803). Then, the value of the variable i is incremented by 1 (step 1806), and if the value of the variable i has not reached MAX set in step 1701 of FIG. 17, the process returns to step 1802 (step 1).
807).

【００５４】ステップ１８０２において、ＮＦＡ状態が
部分正規表現の範囲に含まれている場合、オートマトン
判定部２４０は、次に、matches[i].first＝−１または
muches[i].last≠−１が成り立つかどうかを調べる（ス
テップ１８０４）。この関係が成り立つならば、データ
matches[i].firstをインデックス（idx）の値とし、デ
ータmatches[i].lastの値を−１とする（ステップ１８
０５）。そして、変数ｉの値を１加算し（ステップ１８
０６）、変数ｉの値が図１７のステップ１７０１で設定
したＭＡＸに達していなければステップ１８０２に戻る
（ステップ１８０７）。また、ステップ１８０４でmatc
hes[i].first＝−１またはmuches[i].last≠−１が成り
立たない場合は、データに対しては何らの処理も行わ
ず、変数ｉの値を１加算し（ステップ１８０６）、変数
ｉの値が図１７のステップ１７０１で設定したＭＡＸに
達していなければステップ１８０２に戻る（ステップ１
８０７）。以上のようにしてステップ１８０２〜１８０
７の処理を繰り返し、変数ｉの値がＭＡＸに達したなら
ば、レジスタの更新処理を終了する。In step 1802, if the NFA state is included in the range of the partial regular expression, the automaton determination unit 240 then determines whether matches [i] .first = -1 or
It is checked whether muches [i] .last ≠ -1 holds (step 1804). If this relationship holds, the data
Matches [i] .first is set as an index (idx) value, and data matches [i] .last is set as -1 (step 18).
05). Then, the value of the variable i is incremented by 1 (step 18
06), if the value of the variable i has not reached MAX set in step 1701 of FIG. 17, the process returns to step 1802 (step 1807). Also, in step 1804, matc
If hes [i] .first = -1 or muches [i] .last ≠ -1 does not hold, no processing is performed on the data and the value of the variable i is incremented by 1 (step 1806), If the value of the variable i has not reached MAX set in step 1701 of FIG. 17, the process returns to step 1802 (step 1
807). As described above, steps 1802 to 180
When the value of the variable i reaches MAX, the process of 7 is repeated, and the updating process of the register is ended.

【００５５】図１９は、次のＮＦＡ状態への遷移処理の
詳細を説明するフローチャートである。図１９を参照す
ると、オートマトン判定部２４０は、まず、現在参照し
ているＮＦＡ状態が対応するメモリ領域における構文木
上のノードへのポインタ（node）をstate_trees[nfa_st
ate]とする（ステップ１９０１）。そして、構文木にお
けるノードの文字列表現（node.label）が‘*’または
‘|’かどうかを調べる。‘*’または‘|’であった場
合は、集合epsdest(node)のうちの最も番号の小さい状
態をＮＦＡ状態（nfa_state）とする（ステップ１９０
２、１９０３）。一方、構文木におけるノードの文字列
表現（node.label）が‘*’または‘|’のいずれでもな
い場合は、ＮＦＡ状態（nfa_state）を次のノード（nex
t(node)）として処理を終了する（ステップ１９０２、
１９０４）。FIG. 19 is a flow chart for explaining the details of the transition processing to the next NFA state. Referring to FIG. 19, the automaton determination unit 240 first sets a pointer (node) to a node on the syntax tree in the memory area corresponding to the currently referred NFA state to state_trees [nfa_st.
ate] (step 1901). Then, it is checked whether the character string representation (node.label) of the node in the syntax tree is '*' or '|'. If it is '*' or '|', the state with the smallest number in the set epsdest (node) is the NFA state (nfa_state) (step 190).
2, 1903). On the other hand, when the character string representation (node.label) of the node in the syntax tree is neither '*' nor '|', the NFA state (nfa_state) is set to the next node (nex).
The process ends as t (node)) (step 1902,
1904).

【００５６】次に、上述したマッチ範囲の判定処理の具
体的な動作例を説明する。ここでは、図１４乃至図１６
を参照して説明した、正規表現“((a*)a)*a”に対し
て、入力文字列“aaaa”のマッチングを調べる場合を例
として説明する。図１６に示したように、図１０のステ
ップ１００４におけるＤＦＡ状態列の絞り込み処理によ
り、適切なＮＦＡの開始状態から終了状態へ至る、可能
な状態列が抽出されている。図２０は、図１６に示され
た遷移表（図１６（Ａ））及び絞り込まれたＤＦＡ状態
列による状態遷移の経路（図１６（Ｂ））に基づいて、
マッチ範囲を判定する様子を示す図である。なお、図２
０（Ａ）の遷移表では、ＤＦＡ状態列の絞り込みにより
削除されたＮＦＡ状態を括弧付きの数字で記述してあ
る。Next, a specific operation example of the above-described match range determination processing will be described. Here, FIGS.
As an example, a case of checking the matching of the input character string “aaaa” with the regular expression “((a *) a) * a” described with reference to FIG. As shown in FIG. 16, possible state sequences from the appropriate NFA start state to the appropriate NFA end state are extracted by the DFA state sequence narrowing processing in step 1004 of FIG. 20 is based on the transition table (FIG. 16 (A)) shown in FIG. 16 and the state transition path (FIG. 16 (B)) based on the narrowed DFA state sequence.
It is a figure which shows a mode that a match range is determined. Note that FIG.
In the transition table of 0 (A), NFA states deleted by narrowing down the DFA state string are described by parenthesized numbers.

【００５７】図２０（Ａ）（Ｂ）を参照して可能なＤＦ
Ａ状態列をたどると、まずＮＦＡの開始状態３からＮＦ
Ａ状態１へ遷移し、次にＮＦＡ状態０またはＮＦＡ状態
２へ遷移が可能である。しかし、ＰＯＳＩＸの最長最左
規則に従って、先に出現する繰り返しを可能な限り多く
選択するため、遷移先はＮＦＡ状態０となる。続いて、
ＮＦＡ状態０からＮＦＡ状態１へ遷移し、次にＮＦＡ状
態０またはＮＦＡ状態２へ遷移が可能となる。ここでも
最長最左規則に従って、遷移先としてＮＦＡ状態０が選
択される。この後、ＮＦＡ状態０からＮＦＡ状態１、Ｎ
ＦＡ状態２、ＮＦＡ状態３、ＮＦＡ状態４、ＮＦＡ状態
５と遷移して終了状態に至る。図２０（Ａ）（Ｂ）で
は、このようにして選択されたＤＦＡ状態列が斜体字で
示されている。Possible DF with reference to FIGS.
If you follow the A state sequence, first you will see NF from the start state 3 of the NFA.
It is possible to make a transition to A state 1 and then to NFA state 0 or NFA state 2. However, according to the longest leftmost rule of POSIX, as many iterations that appear first as possible are selected, so that the transition destination is the NFA state 0. continue,
It is possible to make a transition from NFA state 0 to NFA state 1 and then to NFA state 0 or NFA state 2. Again, according to the longest leftmost rule, NFA state 0 is selected as the transition destination. After this, NFA state 0 to NFA state 1, N
The FA state 2, the NFA state 3, the NFA state 4, and the NFA state 5 are transited to reach the end state. In FIGS. 20A and 20B, the DFA status string thus selected is shown in italics.

【００５８】以上のようにして、正規表現“((a*)a)*
a”に対して、入力文字列“aaaa”がマッチングした際
にたどるべきＮＦＡ状態列が復元された。図２１は、復
元されたＮＦＡ状態列を示す図である。図２１におい
て、状態遷移を示す矢印に文字‘a’が付されているも
のが文字に対する遷移であり、文字‘a’が付されてい
ないものがε遷移である。図２１から、入力文字列“aa
aa”のうち、初めの２つの文字‘a’がＮＦＡ状態０か
らＮＦＡ状態１への遷移に対応しており、正規表現
“((a*)a)*a”における２重括弧の内側の‘a*’にマッ
チングしていることがわかる。同様にして、３番目の文
字‘a’が正規表現における外側の括弧内の‘a’にマッ
チングし、最後の文字‘a’が正規表現における最も外
側の‘a’にマッチングしていることがわかる。As described above, the regular expression "((a *) a) *
The NFA state sequence to be traced when the input character string “aaaa” is matched with “a” is restored. FIG. 21 is a diagram showing the restored NFA state sequence. 21 is a transition for a character, and the arrow without the character'a 'is an ε transition.
The first two letters'a 'of aa ”correspond to the transition from NFA state 0 to NFA state 1, and the inside of the double brackets in the regular expression“ ((a *) a) * a ” You can see that it matches'a * '. Similarly, the third character'a' matches the'a 'in the outer parentheses in the regular expression, and the last character'a' in the regular expression. You can see that it matches the outermost'a '.

【００５９】このようにして、入力文字列のどの文字が
正規表現におけるどの部分にマッチングしているかとい
う情報が得られることとなる。以上のようなパターンマ
ッチングを、ＮＦＡを用いて行う場合、処理に要する時
間はＯ(２ⁿ)程度であり、入力ｎに対して２のｎ乗に比
例した時間を要する。これに対し、本実施の形態では、
ＤＦＡを用いたパターンマッチングに要する時間がＯ
(n)程度であり、ＤＦＡ状態の絞り込み処理及びマッチ
範囲の判定処理に要する時間がＯ(n)程度であって、い
ずれも入力ｎに対してｎに比例した時間で完了する。し
たがって、本実施の形態は、ＮＦＡを用いた処理と比し
て、処理に要する時間を大幅に短縮することができる。In this way, it is possible to obtain information as to which character of the input character string matches which part of the regular expression. When the above-described pattern matching is performed using the NFA, the time required for processing is about O (2 ⁿ ), which is a time proportional to 2 n raised to the input n. On the other hand, in the present embodiment,
The time required for pattern matching using DFA is O
It is about (n), and the time required for narrowing down the DFA state and determining the match range is about O (n), and both are completed in a time proportional to n with respect to the input n. Therefore, in the present embodiment, the time required for the processing can be significantly shortened as compared with the processing using NFA.

【００６０】ここで、本実施の形態を具体的なテキスト
ファイルにおける文字列検索に用いた例について説明す
る。住所録が入ったテキストファイルがあるとする。こ
のテキストファイルにおいて、名前などの各項目
は‘,’で区切られており、最初の項目は名前である。
しかし、１つのエントリにどのような項目があるかは統
一されていないとする。今、この住所録のテキストファ
イルに次のようなエントリが並んでいるものとする。日本太郎,taro@yamato.ibm.com,046-xxx-xxxx,神奈川県
大和市大和東X-X-X 大和次郎,Yamato Jiro,jiro@jp.ibm.com,神奈川県大和
市下鶴間X-X-X このテキストファイルから、住所が「神奈川県大和市」
で電子メールアドレスにibm.comという文字列が含まれ
る人の名前と電子メールアドレスを列挙する作業を考え
る。このような場合、正規表現を用いると容易に検索す
ることができる。例えば、次の正規表現で検索可能であ
る。 ([^,]*),([^,]*,)*([^,]*@[^,]*ibm.com),([^,]*,)*神
奈川県大和市この正規表現において、[^,] は、‘,’以外の任意の文
字にマッチし、*は0回以上の繰り返しである。よって、
[^,]*は、上記テキストファイルのエントリにおける項
目１つ分にマッチする。例えば、最初の ([^,]*), は名
前と直後の‘,’にマッチし、括弧で括られた部分が名
前にマッチすることとなる。また、([^,]*,)*は、所定
の項目と‘,’の０回以上の繰り返し、すなわちに名前
とメールアドレスの間にある任意個の項目にマッチす
る。さらに、文字＠及び文字列ibm.comを含む([^,]*@
[^,]*ibm.com) は、メールアドレスにマッチする。そし
て、これ以降、０個以上の項目にマッチし、最後に“神
奈川県大和市”にマッチする。Here, an example in which the present embodiment is used for a character string search in a specific text file will be described. Suppose you have a text file that contains an address book. In this text file, each item such as the name is separated by ',', and the first item is the name.
However, it is not unified what kind of items are included in one entry. Now, assume that the following entry is lined up in the text file of this address book. Taro Nippon, taro @ yamato.ibm.com, 046-xxx-xxxx, Yamato East Yamato XXX, Kanagawa XXX Yamato Jiro, jiro@jp.ibm.com, Yamato Shimotsuruma XXX, Kanagawa XXX From this text file, Address is "Yamato City, Kanagawa Prefecture"
Consider enumerating the names and email addresses of people whose email addresses contain the string ibm.com. In such a case, it is possible to easily search by using a regular expression. For example, you can search with the following regular expressions. ([^,] *), ([^,] *,) * ([^,] * @ [^,] * ibm.com), ([^,] *,) * Yamato, Kanagawa This regular expression In, [^,] matches any character except ',', and * is zero or more repetitions. Therefore,
[^,] * Matches one item in the text file entry. For example, the first ([^,] *), matches the name and the ',' immediately following it, and the part in parentheses matches the name. Further, ([^,] *,) * matches a predetermined item and ',' repeated zero or more times, that is, any number of items between the name and the mail address. In addition, contains the character @ and the string ibm.com ([^,] * @
[^,] * ibm.com) matches email addresses. After that, 0 or more items are matched, and finally, "Yamato City, Kanagawa Prefecture" is matched.

【００６１】これを従来のＤＦＡで処理すると、正規表
現中の括弧‘(’‘)’で囲われた部分が元の文章のどこ
にマッチしたか分からないため、名前やメールアドレス
といった特定の項目を抜き出すことができない。一方、
ＮＦＡでは実行に入力の長さの指数関数時間かかってし
まう。しかしながら、上記第１の実施の形態において
は、ＤＦＡを用いたパターンマッチングにより短時間で
文字列の照合を行うことができ、さらにマッチした文字
列に対してＤＦＡ状態列の絞り込み及びＮＦＡ状態列の
復元によるマッチ範囲の判定を行うことによって、どの
項目が正規表現のどの部分にマッチしたかという情報が
得られるため、所望の項目を取り出すことが可能とな
る。When this is processed by the conventional DFA, it is not known where the part enclosed by the parentheses '('')' in the regular expression matches the original sentence, so that a specific item such as a name or mail address is specified. I can't pull it out. on the other hand,
In NFA, it takes exponential time of the input length to execute. However, in the first embodiment, the character string can be collated in a short time by the pattern matching using the DFA, and the DFA status string and the NFA status string can be narrowed down to the matched character string. By determining the matching range by restoration, it is possible to obtain information as to which item matches which part of the regular expression, and thus it is possible to extract the desired item.

【００６２】［第２の実施の形態］第２の実施の形態
は、上述した第１の実施形態と同様に、正規表現で表現
された検索条件に基づき、ＤＦＡ（決定性有限状態オー
トマトン）を用いて文字列検索を行う文書処理システム
をコンピュータにて実現する。そして、検索キーとなる
文字列に対して、複数文字照合要素を許容する。このシ
ステムは、ＤＦＡを用いて文字列のパターンマッチング
を行う際に、入力文字列を先読みすることにより、複数
文字照合要素が含まれているか否かを判定する。また、
入力文字列がマッチするかどうか（ＤＦＡに受理される
かどうか）を判定する際に、ＤＦＡの状態遷移における
遷移先の状態を動的に決定し、かつ動的に生成しながら
評価を行う。[Second Embodiment] The second embodiment uses a DFA (deterministic finite state automaton) based on a search condition expressed by a regular expression, as in the first embodiment described above. A computer realizes a document processing system that searches for character strings. Then, the plural character collating element is allowed for the character string serving as the search key. This system pre-reads an input character string when performing pattern matching of a character string using DFA, and determines whether or not a multi-character matching element is included. Also,
When determining whether the input character strings match (whether the DFA accepts them), the state of the transition destination in the state transition of the DFA is dynamically determined and evaluated while being dynamically generated.

【００６３】第２の実施の形態による文書処理システム
は、図１に例示された第１の実施の形態による文書処理
システムと同様のコンピュータ装置にて実現される。ま
た、図２に示された第１の実施の形態による文書処理シ
ステムと同様のシステム構成を備える。そこで、本実施
の形態では、各構成要素を図１及び図２で用いた符号を
付して説明することとし、文書処理システムを実現する
コンピュータ装置のハードウェア構成及びシステムの機
能構成については説明を省略する。The document processing system according to the second embodiment is realized by the same computer device as the document processing system according to the first embodiment illustrated in FIG. Further, the document processing system according to the first embodiment shown in FIG. 2 has the same system configuration. Therefore, in the present embodiment, each component will be described with the reference numerals used in FIGS. 1 and 2, and the hardware configuration of the computer device that realizes the document processing system and the functional configuration of the system will be described. Is omitted.

【００６４】また、第２の実施の形態において、オート
マトン構築部２１０の動作は、図３乃至図９を参照して
説明した、第１の実施の形態におけるオートマトン構築
部２１０の動作と同様である。そこで、本実施の形態に
おいては、オートマトン構築部２１０の動作についての
説明を省略する。Further, in the second embodiment, the operation of the automaton building unit 210 is the same as the operation of the automaton building unit 210 in the first embodiment described with reference to FIGS. 3 to 9. . Therefore, in the present embodiment, description of the operation of the automaton construction unit 210 will be omitted.

【００６５】第２の実施の形態におけるオートマトン判
定部２４０は、概略的に、第１の実施の形態におけるオ
ートマトン判定部２４０と同様に動作する。すなわち、
図１０に示すように、オートマトン保持部２２０に格納
されているＤＦＡと、文書保持部２３０に格納されてい
る検索対象の文字列（文書データ）とを入力し（ステッ
プ１００１）、ＤＦＡにより入力文字列のマッチングを
行い（ステップ１００２）、ＤＦＡ状態列の絞り込み処
理およびマッチ範囲の判定処理を行って（ステップ１０
０３〜１００５）、処理結果を出力する（ステップ１０
０６、１００７）。The automaton determining unit 240 in the second embodiment operates roughly similarly to the automaton determining unit 240 in the first embodiment. That is,
As shown in FIG. 10, the DFA stored in the automaton holding unit 220 and the search target character string (document data) stored in the document holding unit 230 are input (step 1001), and the input character is input by the DFA. The columns are matched (step 1002), and the DFA state column narrowing process and the matching range determination process are performed (step 10).
03 to 1005), and outputs the processing result (step 10).
06, 1007).

【００６６】ここで、第１の実施の形態では、ステップ
１００２における文字列のマッチング処理に公知の手法
を用いることとした。これに対し、第２の実施の形態で
は、複数文字照合要素に対応するため、ＤＦＡの状態遷
移を動的に決定し生成しながら評価を行う。図２２、２
３は、第２の実施の形態におけるＤＦＡによる入力文字
列の判定処理を説明するフローチャートである。図２
２、２３を参照すると、オートマトン判定部２４０は、
まず、処理対象である入力文字列のインデックス（id
x）を初期化（idx＝０）し、ＤＦＡの状態列（state_lo
g[0]）をＮＦＡの開始状態（nfa_init）におけるeclosu
re(nfa_init)とする（ステップ２２０１）。ここで、ec
losure(state：ＮＦＡ状態)は、epsdest(state)∪epsde
st(state)の各要素をstate'としてepsdest(state')...
と拡張した最大の集合である。また、本処理において参
照するＮＦＡ状態へのインデックス（nfa_idx）の値を
０とする（ステップ２２０２）。Here, in the first embodiment, a known method is used for the character string matching processing in step 1002. On the other hand, in the second embodiment, since the multi-character collation element is supported, the state transition of the DFA is dynamically determined and evaluated while being generated. 22, 2
FIG. 3 is a flowchart illustrating an input character string determination process by the DFA according to the second embodiment. Figure 2
Referring to 2 and 23, the automaton determination unit 240
First, the index (id of the input string to be processed
x) is initialized (idx = 0), and the DFA state sequence (state_lo
g [0]) is the eclosu in the NFA start state (nfa_init)
It is set to re (nfa_init) (step 2201). Where ec
losure (state: NFA state) is epsdest (state) ∪epsde
epsdest (state ') ... with each element of st (state) as state'
Is the maximum set that is expanded. Further, the value of the index (nfa_idx) to the NFA state referred to in this processing is set to 0 (step 2202).

【００６７】次に、nfa_idxとstate_log[idx]との要素
数を比較する（ステップ２２０３）。そして、state_lo
g[idx]の要素数の方が大きい場合、参照しているＮＦＡ
状態（nfa_st）をstate_log[idx][nfa_idx]とし、メモ
リ領域において当該ＮＦＡ状態が対応する構文木上のノ
ードへのポインタ（node）をstate_trees[nfa_st]とす
る（ステップ２２０４）。そして、構文木におけるノー
ドの文字列表現（node.label）が‘*’または‘|’かど
うかを調べる。‘*’または‘|’であった場合は、nfa_
idxの値を１加算してステップ２２０３へ戻る（ステッ
プ２２０５、２２１２）。Next, the number of elements of nfa_idx and state_log [idx] are compared (step 2203). And state_lo
If g [idx] has a larger number of elements, it refers to the NFA
The state (nfa_st) is set to state_log [idx] [nfa_idx], and the pointer (node) to the node on the syntax tree corresponding to the NFA state in the memory area is set to state_trees [nfa_st] (step 2204). Then, it is checked whether the character string representation (node.label) of the node in the syntax tree is '*' or '|'. Nfa_ if it is '*' or '|'
The value of idx is incremented by 1, and the process returns to step 2203 (steps 2205 and 2212).

【００６８】ステップ２２０５において、ノードの文字
列表現（node.label）が‘*’または‘|’のいずれでも
ない場合、オートマトン判定部２４０は、次に、入力文
字列が現在処理中のインデックス（idx）を先頭とする
複数文字照合要素を持つかどうかを調べる（ステップ２
２０６）。複数文字照合要素がある場合、データmblen
を、mblen＝入力文字列のインデックス（idx）から始ま
る複数文字照合要素の長さのように定義し（ステップ２
２０７）、次に、当該ＮＦＡ状態が対応する構文木上の
ノードが当該複数文字照合要素を受理可能かどうかを調
べる（ステップ２２０８）。そして、受理可能であれ
ば、ＤＦＡの状態列state_log[idx+mblen]にeclosure(n
ext(node))を追加する（ステップ２２０９）。In step 2205, if the character string representation of the node (node.label) is neither '*' nor '|', the automaton determination unit 240 next determines the index ( Check if there is a multi-character collating element that starts with idx) (step 2
206). If there is a multi-character collating element, the data mblen
Is defined as mblen = length of multi-character matching element starting from index (idx) of input string (step 2
207) Next, it is checked whether the node on the syntax tree corresponding to the NFA state can accept the multi-character collating element (step 2208). If it is acceptable, eclosure (n is added to the state string state_log [idx + mblen] of DFA.
ext (node)) is added (step 2209).

【００６９】ステップ２２０６で入力文字列が複数文字
照合要素を持たないと判断された場合、またはステップ
２２０８でノードが複数文字照合要素を受理不可能と判
断された場合、またはノードが複数文字照合要素を受理
可能と判断されてステップ２２０９の処理が行われた
後、オートマトン判定部２４０は、処理中のノードに対
応する正規表現の文字列表現（node.label）が当該イン
デックスの入力（input(idx)）を受理したかどうかを調
べる（ステップ２２１０）。そして、受理したならば、
ＤＦＡの状態列state_log[idx+1]にeclosure(next(nod
e))を追加する（ステップ２２１１）。ステップ２２１
０で処理中のノードに対応する正規表現の文字列表現
（node.label）が当該インデックスの入力（input(id
x)）を受理しなかった場合、または、ステップ２２１１
の処理の後、nfa_idxの値を１加算してステップ２２０
３へ戻る（ステップ２２１２）。When it is determined in step 2206 that the input character string does not have a multi-character collating element, or when it is determined that the node cannot accept the multi-character collating element in step 2208, or the node is a multi-character collating element. Is determined to be acceptable and the process of step 2209 is performed, the automaton determination unit 240 determines that the character string expression (node.label) of the regular expression corresponding to the node being processed is the input (input (idx )) Is accepted (step 2210). And if accepted,
In the state string state_log [idx + 1] of DFA, eclosure (next (nod
e)) is added (step 2211). Step 221
The string expression (node.label) of the regular expression corresponding to the node being processed with 0 is the input (input (id
x)) is not accepted, or step 2211
After the processing of step 220, the value of nfa_idx is incremented by 1
Return to step 3 (step 2212).

【００７０】ステップ２２０３〜２２１２の処理を繰り
返し、nfa_idxの値がstate_log[idx]の要素数に達した
ならば、次に、オートマトン判定部２４０は、入力文字
列のインデックス（idx）を１つ進め（ステップ２２１
３）、入力文字列のインデックスが当該入力文字列の最
後の文字に達したか、またはstate_log[i]＝０（ｉ≧id
x）が成り立つかどうかを調べる（ステップ２２１
４）。これがいずれも成り立たない場合は、ステップ２
２０２に戻って処理を繰り返し、いずれかが成り立つ場
合は、ＤＦＡによる入力文字列の判定処理を終了する。If the values of nfa_idx reach the number of elements of state_log [idx] by repeating the processing of steps 2203 to 2212, then the automaton judging unit 240 advances the index (idx) of the input character string by one. (Step 221
3), the index of the input character string has reached the last character of the input character string, or state_log [i] = 0 (i ≧ id
It is checked whether x) holds (step 221).
4). If neither of these are true, then step 2
Returning to 202, the process is repeated, and if either of the conditions is satisfied, the input character string determination process by the DFA ends.

【００７１】次に、オートマトン判定部２４０は、図１
０に示すように、ＤＦＡ状態列の絞り込み処理（ステッ
プ１００４）を行う。第２の実施の形態におけるＤＦＡ
状態列の絞り込み処理は、図１１に示した第１の実施の
形態における処理と概ね同様であるが、state_log[idx]
の更新処理（ステップ１１０２）において、複数文字照
合要素に対応する点が異なっている。図２４、２５は、
本実施の形態におけるＤＦＡ状態列の絞り込み処理を説
明するフローチャートである。図２４、２５において、
ステップ２４０１からステップ２４０４までの処理は、
図１２のステップ１２０１からステップ１２０４までの
処理と同様であり、ステップ２４１２からステップ２４
１４までの処理は、図１２のステップ１２０８からステ
ップ１２１０までの処理と同様である。Next, the automaton judging section 240 is operated by the process shown in FIG.
As shown in 0, the process of narrowing down the DFA state sequence (step 1004) is performed. DFA in the second embodiment
The process of narrowing down the state string is almost the same as the process in the first embodiment shown in FIG. 11, but state_log [idx]
The update processing (step 1102) is different in that it corresponds to the multi-character collating element. 24 and 25 show
7 is a flowchart illustrating a process of narrowing down a DFA status string according to the present embodiment. 24 and 25,
The processing from step 2401 to step 2404 is
The processing is the same as the processing from step 1201 to step 1204 in FIG.
The processing up to 14 is the same as the processing from step 1208 to step 1210 in FIG.

【００７２】ステップ２４０４で、ノードの文字列表現
（node.label）が‘*’または‘|’のいずれでもない場
合、オートマトン判定部２４０は、次に、入力文字列が
現在処理中のインデックス（idx）を先頭とする複数文
字照合要素を持つかどうかを調べる（ステップ２４０
５）。複数文字照合要素がある場合、データmblenを、m
blen＝入力文字列のインデックス（idx）から始まる複
数文字照合要素の長さのように定義し（ステップ２４０
６）、次に、当該ＮＦＡ状態が対応する構文木上のノー
ドが当該複数文字照合要素を受理可能かどうかを調べる
（ステップ２４０７）。そして、受理可能であれば、さ
らに関数値next(node)について、next(node)∈state_lo
g[idx+mblen]が成り立つかどうかを調べる（ステップ２
４０８）。If it is determined in step 2404 that the character string representation (node.label) of the node is neither "*" nor "|", the automaton determination unit 240 next determines whether the input character string is currently being processed by the index ( It is checked whether or not there is a multi-character collating element that starts with idx) (step 240).
5). If there is a multi-character collation element, replace the data mblen with m
blen is defined as the length of the multi-character collating element starting from the index (idx) of the input character string (step 240
6) Next, it is checked whether the node on the syntax tree corresponding to the NFA state can accept the multi-character collating element (step 2407). If the function value next (node) is acceptable, next (node) ∈ state_lo
Check whether g [idx + mblen] holds (Step 2
408).

【００７３】ステップ２４０５で入力文字列が複数文字
照合要素を持たないと判断された場合、またはステップ
２４０７でノードが複数文字照合要素を受理不可能と判
断された場合、またはステップ２４０８でnext(node)∈
state_log[idx+mblen]が成り立たないと判断された場
合、オートマトン判定部２４０は、次に、関数値next(n
ode)について、next(node)∈state_log[idx+1]が成り立
つかどうかを調べる（ステップ２４０９）。これが成り
立たないならば、nfa_idxの値を１加算してステップ２
４０２へ戻る（ステップ２４１２）。一方、ステップ２
４０９でnext(node)∈state_log[idx+1]が成り立つ場
合、オートマトン判定部２４０は、次に、当該ノードの
文字列表現が現在処理中のインデックスに対応する入力
文字（input(idx)）を受理するかどうかを調べる（ステ
ップ２４１０）。そして、受理しないならば、nfa_idx
の値を１加算してステップ２４０２へ戻る（ステップ２
４１２）。さらに、ステップ２４１０でノードの文字列
表現が入力文字input(idx)を受理する場合、またはステ
ップ２４０８でnext(node)∈state_log[idx+idx]が成り
立つ場合、オートマトン判定部２４０は、次に、メモリ
領域state_bufにnfa_stを追加する（ステップ２４１
１）。そして、nfa_idxの値を１加算してステップ２４
０２へ戻る（ステップ２４１２）。上述したように、こ
れ以降の動作は図１２に示した第１の実施の形態におけ
る動作と同様である。If it is determined in step 2405 that the input character string does not have the multi-character collating element, or if it is determined in step 2407 that the node cannot accept the multi-character collating element, or in step 2408 next (node ) ∈
When it is determined that state_log [idx + mblen] does not hold, the automaton determination unit 240 next determines the function value next (n
ode), it is checked whether next (node) εstate_log [idx + 1] holds (step 2409). If this does not hold, add 1 to the value of nfa_idx and step 2
Return to step 402 (step 2412). On the other hand, step 2
If next (node) εstate_log [idx + 1] holds in 409, the automaton determination unit 240 next determines the input character (input (idx)) corresponding to the index currently processed by the character string representation of the node. It is checked whether or not to accept (step 2410). And if you don't accept, nfa_idx
Is incremented by 1 and the process returns to step 2402 (step 2
412). Furthermore, if the character string representation of the node accepts the input character input (idx) in step 2410, or if next (node) εstate_log [idx + idx] holds in step 2408, the automaton determination unit 240 then Nfa_st is added to the memory area state_buf (step 241)
1). Then, the value of nfa_idx is incremented by 1, and step 24 is performed.
It returns to 02 (step 2412). As described above, the operation thereafter is the same as the operation in the first embodiment shown in FIG.

【００７４】次に、オートマトン判定部２４０は、図１
０に示すように、マッチ範囲の判定処理（ステップ１０
０５）を行う。第２の実施の形態におけるマッチ範囲の
判定処理は、図１１に示した第１の実施の形態における
処理と概ね同様であるが、次のＮＦＡ状態への遷移処理
（ステップ１７０３）において、複数文字照合要素に対
応する点が異なっている。図２６は、本実施の形態にお
けるマッチ範囲の判定処理を説明するフローチャートで
ある。図２６において、ステップ２６０１からステップ
２６０３までの動作は、図１９のステップ１９０１から
ステップ１９０３までの動作と同様である。Next, the automaton judging section 240 is operated by the process shown in FIG.
As shown in 0, the matching range determination process (step 10
05) is performed. The process of determining the match range in the second embodiment is substantially the same as the process in the first embodiment shown in FIG. 11, but in the transition process to the next NFA state (step 1703), a plurality of characters They differ in that they correspond to matching elements. FIG. 26 is a flowchart illustrating the matching range determination processing according to the present embodiment. In FIG. 26, the operation from step 2601 to step 2603 is the same as the operation from step 1901 to step 1903 in FIG.

【００７５】ステップ２６０２で、構文木におけるノー
ドの文字列表現（node.label）が‘*’または‘|’のい
ずれでもない場合、オートマトン判定部２４０は、次
に、入力文字列が現在処理中のインデックス（idx）を
先頭とする複数文字照合要素を持つかどうかを調べる
（ステップ２６０４）。そして、複数文字照合要素を持
たないならば、現在処理中のインデックス（idx）を１
つ進め（ステップ２６０８）、ＮＦＡ状態（nfa_stat
e）を次のノード（next(node)）として処理を終了する
（ステップ２６０９）。If the character string representation (node.label) of the node in the syntax tree is neither "*" nor "|" in step 2602, the automaton determination unit 240 next determines whether the input character string is currently being processed. It is checked whether or not there is a multi-character collating element having the index (idx) of 1 as the head (step 2604). If it does not have a multi-character collation element, set the index (idx) currently being processed to 1
(Step 2608), NFA status (nfa_stat
The process is terminated by setting e) as the next node (next (node)) (step 2609).

【００７６】ステップ２６０４で入力文字列が複数文字
照合要素を持つと判断された場合、データmblenを、mbl
en＝入力文字列のインデックス（idx）から始まる複数
文字照合要素の長さと定義し（ステップ２６０５）、次
に、当該ＮＦＡ状態が対応する構文木上のノードが当該
複数文字照合要素を受理可能かどうかを調べる（ステッ
プ２６０６）。そして、受理可能であれば、state_log
[idx+mblen]にeclosure(next(node))を追加して（ステ
ップ２６０７）、ＮＦＡ状態（nfa_state）を次のノー
ド（next(node)）として処理を終了する（ステップ２６
０９）。また、ステップ２６０６でノードが複数文字照
合要素を受理不可能と判断された場合、現在処理中のイ
ンデックス（idx）を１つ進め（ステップ２６０８）、
ＮＦＡ状態（nfa_state）を次のノード（next(node)）
として処理を終了する（ステップ２６０９）。If it is determined in step 2604 that the input character string has a multi-character collating element, the data mblen is set to mbl.
It is defined as en = length of the multi-character collating element starting from the index (idx) of the input character string (step 2605), and then the node on the syntax tree corresponding to the NFA state can accept the multi-character collating element. It is checked whether or not (step 2606). And if acceptable, state_log
eclosure (next (node)) is added to [idx + mblen] (step 2607), and the process ends with the NFA state (nfa_state) as the next node (next (node)) (step 26).
09). If it is determined in step 2606 that the node cannot accept the multi-character collating element, the index (idx) currently being processed is advanced by 1 (step 2608),
NFA state (nfa_state) to the next node (next (node))
Then, the processing ends (step 2609).

【００７７】次に、第２の実施の形態におけるＤＦＡに
よる入力文字列の判定処理の具体的な動作例を説明す
る。ここでは、正規表現“[a-b]c|aab”に対して、入力
文字列“aac”のマッチングを調べる場合を例とする。
なお、“aa”は有効な複数文字照合要素であり、すなわ
ち“aa”∈“[a-b]”である。図２７は、オートマトン
構築部２１０にて作成された正規表現“[a-b]c|aab”の
適切なＮＦＡを示す図である。図２７に示すＮＦＡは
０、１、２、３、４、５、６という７つの状態を有す
る。図２８乃至図３３は、オートマトン判定部２４０に
おいて、ＤＦＡの状態遷移における遷移先の状態を動的
に決定し生成する様子を示す図である。Next, a specific operation example of the input character string determination processing by the DFA in the second embodiment will be described. Here, the case where the matching of the input character string “aac” is checked against the regular expression “[ab] c | aab” is taken as an example.
Note that “aa” is a valid multi-character collating element, that is, “aa” ε “[ab]”. FIG. 27 is a diagram showing an appropriate NFA of the regular expression “[ab] c | aab” created by the automaton building unit 210. The NFA shown in FIG. 27 has seven states of 0, 1, 2, 3, 4, 5, and 6. 28 to 33 are diagrams showing how the automaton determination unit 240 dynamically determines and generates the transition destination state in the DFA state transition.

【００７８】図２７及び図２８を参照すると、まず、Ｎ
ＦＡの開始状態であるＮＦＡ状態２に対応するＤＦＡの
開始状態｛０，２，３｝が生成される。なお、この例で
は、入力文字列“aac”の各文字に対して対応するＤＦ
Ａの各状態が生成されていくこととなる。Referring to FIGS. 27 and 28, first, N
The DFA start state {0, 2, 3} corresponding to the FA start state NFA state 2 is generated. In this example, the DF corresponding to each character of the input character string "aac"
Each state of A will be generated.

【００７９】次に、入力文字列“aac”の最初の文字
‘a’に対する処理が行われる。図２７に示すＮＦＡを
参照すると、入力文字‘a’に対してＮＦＡ状態０から
ＮＦＡ状態１またはＮＦＡ状態３からＮＦＡ状態４へ遷
移可能であるため、図２９に示すように状態｛１，４｝
が生成される。ここで、入力文字列の先読みが行われ、
１番目の文字が‘a’であることから、１番目と２番目
の文字列がが複数文字照合要素“aa”である可能性があ
る。そこで、図２７のＮＦＡに基づき、この文字列が複
数文字照合要素“aa”であった場合の遷移先として、図
３０に示すように状態｛１｝が仮に生成される。なお、
この状態｛１｝は先読みによる不完全な状態なので、図
３０では破線で示してある。Next, the processing for the first character'a 'of the input character string "aac" is performed. Referring to the NFA shown in FIG. 27, since the NFA state 0 can be transited to the NFA state 1 or the NFA state 3 can be transited to the NFA state 4 for the input character'a ', as shown in FIG. }
Is generated. Here, the input string is prefetched,
Since the first character is'a ', there is a possibility that the first and second character strings are the multi-character collating element “aa”. Therefore, based on the NFA of FIG. 27, the state {1} is temporarily generated as shown in FIG. 30 as the transition destination when this character string is the plural-character collating element “aa”. In addition,
Since this state {1} is an incomplete state due to prefetching, it is indicated by a broken line in FIG.

【００８０】次に、入力文字列の２番目の文字‘a’に
対する処理が行われる。図２７に示すＮＦＡを参照する
と、ＮＦＡ状態１から文字‘a’に対する遷移はなく、
ＮＦＡ状態４からＮＦＡ状態５へ遷移するのみであるた
め、図３１に示すように状態｛５｝が生成される。ここ
で、入力文字列の１番目と２番目の複数文字照合要素
“aa”が“[a-b]”に該当してＮＦＡ状態０からＮＦＡ
状態１へ遷移するか、または各文字‘a’に対してＮＦ
Ａ状態３→ＮＦＡ状態４→ＮＦＡ状態５と遷移するかの
いずれかであることがわかった。そこで、図３２に示す
ように、状態｛１｝と状態｛５｝とを融合した状態
｛１，５｝を生成する。Next, the process for the second character'a 'of the input character string is performed. Referring to the NFA shown in FIG. 27, there is no transition from NFA state 1 to the character'a ',
Since only a transition is made from NFA state 4 to NFA state 5, state {5} is generated as shown in FIG. Here, the first and second multi-character collation elements “aa” of the input character string correspond to “[ab]”, and the NFA status 0 to NFA
Transition to state 1 or NF for each letter'a '
It was found that there was one of transitions from A state 3 → NFA state 4 → NFA state 5. Therefore, as shown in FIG. 32, a state {1, 5} is generated by fusing the state {1} and the state {5}.

【００８１】次に、図２７のＮＦＡより、入力文字
‘c’に対してＮＦＡ状態１からＮＦＡ状態６に遷移
し、入力文字‘b’に対してＮＦＡ状態５からＮＦＡ状
態６に遷移することがわかる。そこで、図３３に示すよ
うに、状態｛６｝が生成される。これらはＮＦＡの終了
状態であるため、以上で入力文字列“aac”に対するＤ
ＦＡの状態遷移における遷移先の状態を動的な生成が完
了する。なお、入力文字列“aac”の最後の文字は‘c’
であるから、終了状態｛６｝へ到達し、当該入力文字列
“aac”はこのＤＦＡにマッチすることがわかる。Next, from the NFA of FIG. 27, the NFA state 1 changes to the NFA state 6 for the input character'c ', and the NFA state 5 changes to the NFA state 6 for the input character'b'. I understand. Therefore, as shown in FIG. 33, the state {6} is generated. Since these are the end states of NFA, D for the input character string "aac"
Dynamic generation of the transition destination state in the FA state transition is completed. The last character of the input string "aac"is'c'.
Therefore, the end state {6} is reached, and it can be seen that the input character string "aac" matches this DFA.

【００８２】このようにして、複数文字照合要素を含む
正規表現に対しても、ＤＦＡを用いて高速なパターンマ
ッチングを行うことが可能となる。以上のようなパター
ンマッチングを、ＮＦＡを用いて行う場合、処理に要す
る時間はＯ(２ⁿ)程度であり、入力ｎに対して２のｎ乗
に比例した時間を要する。これに対し、本実施の形態で
は、ＤＦＡを用いたパターンマッチングに要する時間が
Ｏ(n)程度であり、入力ｎに対してｎに比例した時間で
完了する。したがって、本実施の形態は、ＮＦＡを用い
た処理と比して、処理に要する時間を大幅に短縮するこ
とができる。また、照合要素の文字数を可変とすること
ができるため、日本語の文字のようなマルチバイトキャ
ラクタに対する文字検索にも、ＤＦＡを用いた高速なパ
ターンマッチングを適用することが可能となる。In this way, it is possible to perform high-speed pattern matching using DFA even for a regular expression containing multiple character matching elements. When the above-described pattern matching is performed using the NFA, the time required for processing is about O (2 ⁿ ), which is a time proportional to 2 n raised to the input n. On the other hand, in the present embodiment, the time required for the pattern matching using the DFA is about O (n), which is completed in a time proportional to n with respect to the input n. Therefore, in the present embodiment, the time required for the processing can be significantly shortened as compared with the processing using NFA. Moreover, since the number of characters of the matching element can be made variable, it is possible to apply high-speed pattern matching using DFA to character search for multi-byte characters such as Japanese characters.

【００８３】なお、第２の実施の形態は、第１の実施の
形態に加えて、入力文字列の判定時にＤＦＡの状態遷移
における遷移先の状態を動的に決定し生成することとし
たが、かかる手法を、第１の実施の形態による手法（す
なわち、ＤＦＡの絞り込み及びＮＦＡ状態の復元によ
り、入力文字列中の部分正規表現がマッチした部分に関
する情報を得る手法）とは独立して用いることも可能で
ある。この場合も、上述したように複数文字照合要素を
含む正規表現に対して、ＤＦＡを用いて高速なパターン
マッチングを行うことが可能である。しかしながら、上
記のように第１の実施の形態と組み合わせて用いること
により、backreferenceにも応用することが可能であ
る。ここで、backreferenceとは、「指定した部分正規
表現がマッチした文字列に待ちする演算子」をいう。例
えば、正規表現“(ab(cd))\1\2”は、“abcdabcdcd”に
マッチするが、この場合の\1及び\2がbackreferenceで
ある。\1、\2の１、２が部分正規表現の出現順を示して
おり、例えば正規表現“(ab(cd))”の場合、外側の（最
初に出現している）括弧に囲われた“(ab(cd))”が\1に
対応し、次に出現する内側の括弧内の“(cd)”が\2に対
応する。そして、それぞれのマッチした文字列“abc
d”、“cd”に、\1、\2がマッチする。同様に、正規表
現“(ab)(cd)\2\1”は、“abcdcdab”にマッチし、“(a
+)\1”は、“aa”、“aaaa”、“aaaaaa”、・・・、す
なわち偶数個のaの文字列にマッチする。このとき、文
字列のうちの前半分が(a+)にマッチし、後ろ半分が\1に
マッチする。このように、backreferenceは、「\1が出
現するまでの部分正規表現が何にマッチしたか」という
情報が必要であり、さらにbackreference自身は、「１
つの演算子で複数の文字にマッチ」する。この２つの条
件のうち、前者に第１の実施の形態で対応し、後者に第
２の実施の形態で対応することにより、backreference
を用いた文字列の照合、検索を行うことが可能となる。In the second embodiment, in addition to the first embodiment, the transition destination state in the DFA state transition is dynamically determined and generated when the input character string is determined. , Such a method is used independently of the method according to the first embodiment (that is, a method of obtaining information about a matched portion of a partial regular expression in an input character string by narrowing down the DFA and restoring the NFA state). It is also possible. Also in this case, it is possible to perform high-speed pattern matching using DFA for the regular expression including the plural-character matching element as described above. However, by using it in combination with the first embodiment as described above, it can be applied to back reference. Here, backreference refers to "an operator that waits for a character string that the specified partial regular expression matches". For example, the regular expression "(ab (cd)) \ 1 \ 2" matches "abcdabcdcd", but \ 1 and \ 2 in this case are backreferences. \ 1 and \ 2 1 and 2 indicate the order of appearance of the partial regular expression. For example, in the case of the regular expression "(ab (cd))", it is enclosed in the outer (first appearing) parentheses. "(Ab (cd))" corresponds to \ 1, and "(cd)" in the inner parenthesis that appears next corresponds to \ 2. And each matched string "abc
\ 1 and \ 2 match d "and" cd ". Similarly, the regular expression" (ab) (cd) \ 2 \ 1 "matches" abcdcdab "and" (a
+) \ 1 ”matches“ aa ”,“ aaaa ”,“ aaaaaa ”, ..., That is, an even number of character strings of a. At this time, the first half of the character string becomes (a +) It matches, and the back half matches \ 1. In this way, backreference needs the information "what matched the subexpression until \ 1 appeared", and backreference itself 1
Matches multiple characters with one operator ". Of the two conditions, the former is dealt with in the first embodiment, and the latter is dealt with in the second embodiment.
It is possible to collate and search for character strings using.

【００８４】[0084]

【発明の効果】以上説明したように、本発明によれば、
部分正規表現や最長一致を含むＰＯＳＩＸ正規表現をＤ
ＦＡ（決定性有限状態オートマトン）にて処理すること
が可能となる。また、本発明によれば、複数文字照合要
素を含む正規表現をＤＦＡ（決定性有限状態オートマト
ン）にて処理することが可能となる。As described above, according to the present invention,
D for POSIX regular expressions including partial regular expressions and longest match
It becomes possible to process by FA (deterministic finite state automaton). Further, according to the present invention, it becomes possible to process a regular expression including a multi-character collating element by a DFA (deterministic finite state automaton).

[Brief description of drawings]

【図１】第１の実施の形態による文書処理システムを
実現するのに好適なコンピュータ装置のハードウェア構
成の例を模式的に示す図である。FIG. 1 is a diagram schematically showing an example of a hardware configuration of a computer device suitable for realizing a document processing system according to a first embodiment.

【図２】第１の実施の形態による文書処理システムの
構成を説明する図である。FIG. 2 is a diagram illustrating a configuration of a document processing system according to the first embodiment.

【図３】第１の実施の形態のオートマトン構築部によ
るＤＦＡ構築処理の概略的な流れを示すフローチャート
である。FIG. 3 is a flowchart showing a schematic flow of a DFA construction process by the automaton construction unit of the first embodiment.

【図４】正規表現“ab|c(de)*”から構築される構文
木を示す図である。FIG. 4 is a diagram showing a syntax tree constructed from a regular expression “ab | c (de) *”.

【図５】構文木を構築する処理を実行するためのプロ
グラムの例を示す図である。FIG. 5 is a diagram showing an example of a program for executing a process of constructing a syntax tree.

【図６】第１の実施の形態による適切なＮＦＡの構築
に用いられる関数firstの定義コードを例示する図であ
る。FIG. 6 is a diagram illustrating a definition code of a function first used for constructing an appropriate NFA according to the first embodiment.

【図７】第１の実施の形態による適切なＮＦＡの構築
に用いられる関数epsdestの定義コードを例示する図で
ある。FIG. 7 is a diagram exemplifying a definition code of a function epsdest used for constructing an appropriate NFA according to the first embodiment.

【図８】第１の実施の形態による適切なＮＦＡの構築
に用いられる関数nextの定義コードを例示する図であ
る。FIG. 8 is a diagram illustrating a definition code of a function next used for constructing an appropriate NFA according to the first embodiment.

【図９】第１の実施の形態において適切なＮＦＡを構
築する処理を説明するフローチャートである。FIG. 9 is a flowchart illustrating a process of constructing an appropriate NFA according to the first embodiment.

【図１０】第１の実施の形態のオートマトン判定部に
よるパターンマッチング処理の概略的な流れを示すフロ
ーチャートである。FIG. 10 is a flowchart showing a schematic flow of pattern matching processing by the automaton determination unit according to the first embodiment.

【図１１】第１の実施の形態におけるＤＦＡ状態列の
絞り込みにおける動作を説明するフローチャートであ
る。FIG. 11 is a flowchart illustrating an operation of narrowing down a DFA status string according to the first embodiment.

【図１２】図１１におけるＤＦＡの状態列（state_lo
g[idx]）を更新する処理の詳細を説明するフローチャー
トである。FIG. 12 is a state sequence (state_lo) of the DFA in FIG.
It is a flowchart explaining the detail of the process which updates g [idx]).

【図１３】図１２におけるε遷移元の追加処理の詳細
を説明するフローチャートである。FIG. 13 is a flowchart illustrating the details of ε transition source addition processing in FIG.

【図１４】正規表現“((a*)a)*a”の適切なＮＦＡを
示す図である。FIG. 14 is a diagram showing an appropriate NFA of a regular expression “((a *) a) * a”.

【図１５】図１４に示したＮＦＡのＤＦＡ及び文字列
“aaaa”に対する状態遷移を示す図である。15 is a diagram showing state transitions for the DFA of the NFA and the character string "aaaa" shown in FIG.

【図１６】入力文字列“aaaa”に対するＮＦＡ状態の
遷移の様子を示す図である。FIG. 16 is a diagram showing a state of NFA state transition with respect to an input character string “aaaa”.

【図１７】第１の実施の形態におけるマッチ範囲の判
定処理における動作を説明するフローチャートである。FIG. 17 is a flowchart illustrating an operation in a match range determination process according to the first embodiment.

【図１８】図１７におけるレジスタの更新処理の詳細
を説明するフローチャートである。FIG. 18 is a flowchart illustrating details of register update processing in FIG. 17;

【図１９】図１７における次のＮＦＡ状態への遷移処
理の詳細を説明するフローチャートである。FIG. 19 is a flowchart illustrating details of a transition process to the next NFA state in FIG.

【図２０】図１６に示された遷移の様子を示す図及び
絞り込まれたＤＦＡ状態列による状態遷移の経路に基づ
いて、マッチ範囲を判定する様子を示す図である。20 is a diagram showing a state of transition shown in FIG. 16 and a diagram showing a state of determining a match range based on a state transition route by a narrowed DFA state sequence.

【図２１】復元されたＮＦＡ状態列を示す図である。FIG. 21 is a diagram showing a restored NFA state sequence.

【図２２】第２の実施の形態におけるＤＦＡによる入
力文字列の判定処理を説明するフローチャートである。FIG. 22 is a flowchart illustrating an input character string determination process by DFA according to the second embodiment.

【図２３】第２の実施の形態におけるＤＦＡによる入
力文字列の判定処理を説明するフローチャートである。FIG. 23 is a flowchart illustrating an input character string determination process by DFA according to the second embodiment.

【図２４】第２の実施の形態におけるＤＦＡ状態列の
絞り込み処理を説明するフローチャートである。FIG. 24 is a flowchart illustrating a process of narrowing down a DFA status string according to the second embodiment.

【図２５】第２の実施の形態におけるＤＦＡ状態列の
絞り込み処理を説明するフローチャートである。FIG. 25 is a flowchart illustrating a process of narrowing down a DFA status string according to the second embodiment.

【図２６】第２の実施の形態におけるマッチ範囲の判
定処理を説明するフローチャートである。FIG. 26 is a flowchart illustrating a match range determination process according to the second embodiment.

【図２７】第２の実施の形態のオートマトン構築部に
て作成された正規表現“[a-b]c|aab”の適切なＮＦＡを
示す図である。FIG. 27 is a diagram showing an appropriate NFA of a regular expression “[ab] c | aab” created by the automaton building unit of the second embodiment.

【図２８】第２の実施の形態によりＤＦＡの状態遷移
における遷移先の状態を動的に決定し生成する様子を示
す図であり、状態｛０｝までを生成した様子を示す図で
ある。FIG. 28 is a diagram showing how a transition destination state in a DFA state transition is dynamically determined and generated according to the second embodiment, and is a diagram showing how states up to {0} are generated.

【図２９】第２の実施の形態によりＤＦＡの状態遷移
における遷移先の状態を動的に決定し生成する様子を示
す図であり、状態｛１，２｝までを生成した様子を示す
図である。FIG. 29 is a diagram showing how a transition destination state in the DFA state transition is dynamically determined and generated according to the second embodiment, and is a diagram showing how states {1, 2} are generated. is there.

【図３０】第２の実施の形態によりＤＦＡの状態遷移
における遷移先の状態を動的に決定し生成する様子を示
す図であり、状態｛１｝を仮に生成した様子を示す図で
ある。FIG. 30 is a diagram showing how a transition destination state in a DFA state transition is dynamically determined and generated according to the second embodiment, and is a diagram showing how state {1} is temporarily generated.

【図３１】第２の実施の形態によりＤＦＡの状態遷移
における遷移先の状態を動的に決定し生成する様子を示
す図であり、状態｛３｝までを生成した様子を示す図で
ある。FIG. 31 is a diagram showing how a transition destination state in the DFA state transition is dynamically determined and generated according to the second embodiment, and is a diagram showing how states up to state {3} are generated.

【図３２】第２の実施の形態によりＤＦＡの状態遷移
における遷移先の状態を動的に決定し生成する様子を示
す図であり、状態｛１，３｝までを生成した様子を示す
図である。FIG. 32 is a diagram showing how a transition destination state in DFA state transition is dynamically determined and generated according to the second embodiment, and is a diagram showing how states {1, 3} are generated. is there.

【図３３】第２の実施の形態によりＤＦＡの状態遷移
における遷移先の状態を動的に決定し生成する様子を示
す図であり、終了状態である状態｛４｝及び状態｛５｝
までを生成した様子を示す図である。FIG. 33 is a diagram showing a state in which a transition destination state in the DFA state transition is dynamically determined and generated according to the second embodiment, and is a state {4} and a state {5} which are end states.
It is a figure which shows a mode that what was generated.

【図３４】正規表現“([ab]c)*ac”にマッチングする
ＮＦＡの例を示す図である。FIG. 34 is a diagram illustrating an example of an NFA that matches a regular expression “([ab] c) * ac”.

【図３５】図３４のＮＦＡと等価のＤＦＡを示す図で
ある。FIG. 35 is a diagram showing a DFA equivalent to the NFA of FIG. 34.

【図３６】複数文字照合要素を含む正規表現に対応す
るＮＦＡの例を示す図である。FIG. 36 is a diagram showing an example of an NFA corresponding to a regular expression including a multi-character matching element.

[Explanation of symbols]

１０１…ＣＰＵ（中央処理装置）、１０２…Ｍ／Ｂ（マ
ザーボード）チップセット、１０３…メインメモリ、１
０５…ハードディスク、２００…文書処理システム、２
１０…オートマトン構築部、２２０…オートマトン保持
部、２３０…文書保持部、２４０…オートマトン判定
部、２５０…文書処理部、２６０…処理プログラム保持
部、３００…入出力装置101 ... CPU (central processing unit), 102 ... M / B (motherboard) chip set, 103 ... Main memory, 1
05 ... hard disk, 200 ... document processing system, 2
Reference numeral 10 ... Automata construction unit, 220 ... Automaton holding unit, 230 ... Document holding unit, 240 ... Automaton judging unit, 250 ... Document processing unit, 260 ... Processing program holding unit, 300 ... Input / output device

フロントページの続き (72)発明者長谷川勇神奈川県大和市下鶴間1623番地14 日本アイ・ビー・エム株式会社大和事業所内Ｆターム(参考） 5B075 ND03 QM10 Continued front page (72) Inventor Isamu Hasegawa 1623 1423 Shimotsuruma, Yamato-shi, Kanagawa Japan BM Co., Ltd. Daiwa Office F-term (reference) 5B075 ND03 QM10

Claims

[Claims]

1. A character string collating method for collating a character string using a computer, the step of creating a non-deterministic finite state automaton from a regular expression of a character string and storing it in a memory; and the non-deterministic finite state automaton from the memory. Reading a state automaton, creating a deterministic finite-state automaton based on the nondeterministic finite-state automaton, and storing in a memory; and reading the deterministic finite-state automaton from the memory, using the deterministic finite-state automaton to generate a character string And a step of specifying a matching range of the character string by using the nondeterministic finite state automaton and the deterministic finite state automaton read from the memory for the matched character string. Character string matching method characterized by.

2. The non-deterministic finite state automaton generating step is characterized in that the regular expression of a character string is associated with each non-deterministic element in the regular expression except for an element designating a certain range. Generating a state of a finite state automaton, and corresponding ε transition to an element meaning repetition and an element meaning selection, and to other states to the state associated with the next element The character string matching method according to claim 1, further comprising a step of associating transitions.

3. The step of identifying the match range includes a step of deleting an unnecessary state sequence that does not reach an end state from the state sequence showing a state transition by the deterministic finite state automaton, and the remaining state sequence. The method according to claim 1, further comprising the step of: specifying a matching range of the character string based on the above.

4. The step of identifying the match range comprises:
2. The method according to claim 1, further comprising the step of identifying a matching range of the character string based on a state sequence satisfying a longest leftmost rule among state sequences showing state transitions by the deterministic finite state automaton. String matching method.

5. A character string collating method for collating a character string using a computer, wherein a deterministic finite state automaton created based on a regular expression of a predetermined character string is read from a memory and the deterministic finite state automaton is used. The first step of matching the character strings and storing the processing result in the memory;
And a third step of identifying which character in the character string to be processed matches which part of the regular expression based on the narrowed-down state sequence. String matching method.

6. The third step is to select, from among the narrowed-down state strings, a state string having the largest number of repetitions that appears first, and based on the state string, the character string to be processed is selected. 6. The character string matching method according to claim 5, further comprising a step of determining which part of the regular expression and each part of the regular expression are matched.

7. A character string collating method for collating character strings using a computer, wherein a deterministic finite state automaton created based on a regular expression of a predetermined character string is read from a memory and the deterministic finite state automaton is used. Based on a first state sequence indicating a state transition by the deterministic finite state automaton, and a state transition in the non-deterministic finite state automaton. A second step of recovering a second state sequence shown, and obtaining information about which character in the string matched which part of the regular expression based on the restored second state sequence And a third step of performing a character string matching method.

8. The character string matching method according to claim 7, wherein the second step includes a step of restoring a state sequence satisfying a longest leftmost rule as the second state sequence. .

9. A character string collating method for collating a character string using a computer, the step of creating a non-deterministic finite state automaton from a regular expression of a character string and storing it in a memory; and the non-deterministic finite state automaton from the memory. Reading a state automaton, creating a deterministic finite state automaton based on the non-deterministic finite state automaton, and storing it in a memory; and for each element of the character string to be processed, the deterministic finite state automaton read out from the memory. And a step of performing matching while dynamically determining a transition destination state in the state transition of the state automaton.

10. The step of performing the matching includes a step of pre-reading the character string to be processed and determining whether or not the character string includes a character string that may correspond to a multi-character collating element. , If the character string includes a character string that may correspond to a multi-character collating element, the state of the transition destination is dynamically changed by reflecting the state transition when the character string is a multi-character collating element. The character string matching method according to claim 9, further comprising a step of determining.

11. After performing the matching, regarding the matched character string, using the nondeterministic finite state automaton and the deterministic finite state automaton read from the memory, information regarding a matching range of the character string is obtained. The character string matching method according to claim 9, further comprising a step of obtaining.

12. A character string collating method for collating a character string using a computer, the step of creating a non-deterministic finite state automaton from a regular expression of a character string and storing it in a memory, and the non-deterministic finite state automaton from the memory. Read the state automaton, create a deterministic finite state automaton based on the non-deterministic finite state automaton, store it in memory, and pre-read the character string to be processed, and apply the multi-character matching element to the character string. If a possible character string is included, for each element of the character string, the state transition of the deterministic finite state automaton read from the memory is generated, and the character that can be the multi-character matching element Virtually generating a state transition corresponding to the column and performing matching based on the state transition. A character string matching method characterized by the fact that it is used.

13. A document processing device for searching a character string using a regular expression, a nondeterministic finite state automaton constructing means for constructing a nondeterministic finite state automaton from a regular expression of a character string, and the nondeterministic finite state automaton. A deterministic finite state automaton constructing means for constructing a deterministic finite state automaton based on the non-deterministic finite state automaton constructed by the constructing means, and using the deterministic finite state automaton constructed by the deterministic finite state automaton constructing means And a determining means for matching the character strings, the determining means, with respect to the matched character string, further using the nondeterministic finite state automaton and the deterministic finite state automaton, the match range of the character string A document processing device characterized by specifying.

14. The nondeterministic finite state automaton constructing means associates, with respect to a regular expression of a character string, one nondeterministic finite state corresponding to each element of the regular expression except elements designating a certain range. Generate the state of the automaton,
And, the element that means repetition and the element that means selection are associated with the ε transition, and the other elements are associated with the transition to the state associated with the next element. Item 13. The document processing device according to item 13.

15. The state sequence narrowing means for deleting and narrowing down an unnecessary state sequence that does not reach an end state among the state sequences showing the state transitions by the deterministic finite state automaton, and the state sequence narrowing down. 14. The document processing apparatus according to claim 13, further comprising a match range determination unit that specifies a match range of the character string based on a state string narrowed down by the unit.

16. A non-deterministic finite state automaton constructing means for constructing a non-deterministic finite state automaton from a regular expression of a character string in a document processing device for searching a character string using a regular expression, and the non-deterministic finite state automaton. A deterministic finite state automaton constructing means for constructing a deterministic finite state automaton based on the non-deterministic finite state automaton constructed by the constructing means, and using the deterministic finite state automaton constructed by the deterministic finite state automaton constructing means A document processing apparatus, comprising: a determination unit that performs matching while dynamically determining a transition destination state in the state transition of the deterministic finite state automaton for each element of the character string to be processed. .

17. The multi-character collating means pre-reads the character string to be processed, judges whether or not the character string includes a character string that may correspond to a multi-character collating element, When it is determined that an element includes a character string that may be applicable, the state transition destination is dynamically determined by reflecting the state transition in the case where the character string is a multi-character collating element. The document processing device according to claim 16.

18. The state sequence narrowing means for deleting and narrowing down an unnecessary state sequence that does not reach an end state among the state sequences showing the state transition by the deterministic finite state automaton, and the state sequence narrowing down. The document processing apparatus according to claim 16, further comprising a match range determination unit that specifies a match range of the character string based on the state string narrowed down by the unit.

19. A program for controlling a computer to collate a character string, the process comprising: creating a nondeterministic finite state automaton from a regular expression of a character string; storing the nondeterministic finite state automaton in the memory; and the nondeterminism from the memory. Read a finite state automaton, create a deterministic finite state automaton based on the non-deterministic finite state automaton, and store it in memory, and read the deterministic finite state automaton from the memory, character using the deterministic finite state automaton A process of performing string matching, and a process of specifying the matching range of the character string by using the nondeterministic finite state automaton and the deterministic finite state automaton read from the memory for the matched character string, A program characterized by causing a computer to execute Beam.

20. The non-deterministic finite-state automaton is created by a non-deterministic one in which a regular expression of a character string is associated with each element of the regular expression except elements designating a certain range. A process for generating a state of a finite state automaton, and an ε transition is made to correspond to an element that means repetition and an element that means selection, and to other states to the state associated with the next element. 20. The program according to claim 19, including a process of associating transitions.

21. The process of identifying the match range includes a process of deleting an unnecessary state sequence that does not reach an end state among state sequences showing a state transition by the deterministic finite state automaton, and the remaining state sequence. 20. The program according to claim 19, further comprising: a process of specifying a matching range of the character string based on the above.

22. A program for controlling a character string by controlling a computer, a process of creating a nondeterministic finite state automaton from a regular expression of a character string and storing the nondeterministic finite state automaton in the memory, and the nondeterministic property from the memory. A process of reading a finite state automaton, creating a deterministic finite state automaton based on the non-deterministic finite state automaton, and storing it in a memory, and for each element of the character string to be processed, the deterministic read from the memory A program for causing the computer to perform a process of performing matching while dynamically determining a transition destination state in a state transition of a finite state automaton.

23. The matching process includes a process of pre-reading the character string to be processed and determining whether or not the character string includes a character string that may correspond to a multi-character collating element. When the character string includes a character string that can correspond to a multi-character collating element, the state of the transition destination is dynamically determined by reflecting the state transition when the character string is the multi-character collating element. The program according to claim 22, further comprising: