JP4347087B2

JP4347087B2 - Pattern matching apparatus and method, and program

Info

Publication number: JP4347087B2
Application number: JP2004051680A
Authority: JP
Inventors: 亮介榑林; 勝片山; 直明山中; 公平塩本
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-02-26
Filing date: 2004-02-26
Publication date: 2009-10-21
Anticipated expiration: 2024-02-26
Also published as: JP2005242672A

Description

本発明は、与えられた入力文字列であるテキストから任意の複数種類の文字列であるパターンを検索するパターンマッチングを行うための手法および装置に関する。 The present invention relates to a technique and an apparatus for performing pattern matching for searching for patterns that are arbitrary plural types of character strings from text that is a given input character string.

与えられたテキストから任意の複数種類のパターンを検索するパターンマッチングは、ワープロソフト、データベースの検索等、多様な分野で応用されている。しかし、情報化社会の進展、そしてハードディスクといった記憶装置の大容量化、低コスト化によって、検索対象となる情報が膨大化している。したがって、パターンマッチングを高効率化するため、そのハードウェア化などの試みがなされている。 Pattern matching for searching for a plurality of arbitrary patterns from a given text is applied in various fields such as word processing software and database search. However, with the progress of the information society and the increase in capacity and cost of storage devices such as hard disks, the information to be searched for has become enormous. Therefore, in order to increase the efficiency of pattern matching, attempts have been made to implement hardware.

ここで、パターンマッチングについて定義する。長さｎのテキストＴと、長さｍのテキストＰを仮定する。ここで、パターンとテキストとはそれぞれ、共通した文字集合に属する文字を、テキストの場合はｎ文字、パターンの場合はｍ文字を、左から右に順番に並べることで構成される。すなわち、テキストＴ中の文字をｔ_iと表すと、テキストＴはｔ₁…ｔ_nと表記できる。同様にパターンＰはｐ₁…ｐ_mと表記できる。パターンマッチングの目的はテキスト中にパターンが存在するか否かを特定することにある。すなわち、ｔ_s+1…ｔ_s+m＝ｐ₁…ｐ_mとなる部分テキストを検出することにある。 Here, pattern matching is defined. Assume a text T having a length n and a text P having a length m. Here, the pattern and the text are configured by arranging characters belonging to a common character set in order from left to right, n characters in the case of text, and m characters in the case of a pattern. That is, if the character in the text T represents a t _i, text T can be represented as t ₁ ... t _n. Similarly the pattern P can be represented as p ₁ ... p _m. The purpose of pattern matching is to specify whether or not a pattern exists in the text. That is to detect the partial text to be _{_{t s + 1 ... t s +}} m = p 1 ... p m.

パターンマッチングの方法として既に多くの手法が提案されている。パターンマッチングの方法は大きく以下の二つに分類できる。
１）テキストとパターンとの文字比較位置が関数的に変化する方法。 Many methods have already been proposed as pattern matching methods. Pattern matching methods can be broadly classified into the following two types.
1) A method in which the character comparison position between the text and the pattern changes functionally.

この種の方法では、テキストおよびパターンの双方の文字を比較し、それぞれの文字が一致しない場合は、比較中のテキストとパターンとの重なり位置（以下、ウィンドウと呼ぶ）をテキストに対して右方向に移動させ、ウィンドウ中のテキストとパターンとの比較を再開する。このとき、テキストとパターンとの文字比較回数を減少させるには、文字が一致しなかった際にウィンドウを移動させる幅を可能な限り大きくすることが重要となる。この種の代表的な方法として、ＢＭ(Boyer Moore)，ＲｅｖｅｒｓｅＦａｃｔｏｒ，ＫＭＰ(Knuth-Morris-Pratt)などがある（例えば、非特許文献１参照）。
２）テキストとパターンとの文字比較位置が一定に変化する方法。 In this type of method, both text and pattern characters are compared, and if the characters do not match, the overlapping position of the text being compared and the pattern (hereinafter referred to as the window) is directed to the right of the text. To resume the comparison of the text in the window with the pattern. At this time, in order to reduce the number of character comparisons between the text and the pattern, it is important to increase the width for moving the window as much as possible when the characters do not match. Typical examples of this type include BM (Boyer Moore), Reverse Factor, KMP (Knuth-Morris-Pratt), and the like (see, for example, Non-Patent Document 1).
2) A method in which the character comparison position between the text and the pattern changes constantly.

この種の方法では、どのようなテキストおよびパターンを用いても、テキストとパターンとの文字比較回数が常にテキスト長にのみ依存する特徴がある。この種の方法として、オートマトンを用いた方法およびＳｈｉｆｔＯＲアルゴリズムがある。 This type of method has a feature that the number of character comparisons between text and pattern always depends only on the text length, regardless of the text and pattern used. As this type of method, there are a method using an automaton and a Shift OR algorithm.

前者の方法では、一般的に文字比較回数の期待値が後者の方法より優れる。このため、特にソフトウェア上での実現では、前者の方法が広く用いられる。しかし、前者の方法では、複数パターンを同時に検索することができないという問題がある。さらに、そのハードウェア化に際して以下の問題が生じる。 In the former method, the expected value of the number of character comparisons is generally superior to the latter method. For this reason, the former method is widely used, particularly for implementation on software. However, the former method has a problem that a plurality of patterns cannot be searched simultaneously. Furthermore, the following problems occur when the hardware is implemented.

先の入力文字の比較結果に応じてどれだけウィンドウを右方向に移動できるかが決まる。このため、テキスト中の入力文字をパイプライン処理することができない。 How much the window can be moved to the right depends on the comparison result of the previous input characters. For this reason, input characters in the text cannot be pipelined.

テキストとパターンとの文字比較位置が関数的に変化するため、文字比較回数がテキストとパターンとに依存して変化する。このとき文字比較回数の最悪値は後者の方法より多くなる。文字比較回数の変化にともなうパターンマッチングのスループットの変化を吸収するため、入力テキストのバッファ機構が必要となる。バッファの容量を大きくとることによって、バッファが溢れる確率を小さくすることが可能であるが、完全にバッファが溢れないことを保障することは困難点である。また、バッファ中のテキストが枯渇した場合には、パイプラインがストールするなどの問題が生じる。
ChristianCharrs and ThierryLecroq“Handbook of Exact String Matching Algorithms”、[online]、[平成16年2月10日検索]、インターネット<URL:http://homepage.stts.edu/~aikawa/string.pdf> A.V.Aho and M.J.Corasick.Efficientstringmatching:an aid to bibliographic search.Comm.oftheACM.18(6):333-340,June 1975. Since the character comparison position between the text and the pattern changes functionally, the number of character comparisons changes depending on the text and the pattern. At this time, the worst value of the number of character comparisons is larger than that of the latter method. A buffer mechanism for input text is required to absorb changes in pattern matching throughput that accompany changes in the number of character comparisons. Although it is possible to reduce the probability that the buffer overflows by increasing the buffer capacity, it is difficult to ensure that the buffer does not overflow completely. Also, when the text in the buffer is depleted, problems such as pipeline stalls occur.
ChristianCharrs and ThierryLecroq “Handbook of Exact String Matching Algorithms”, [online], [Search February 10, 2004], Internet <URL: http://homepage.stts.edu/~aikawa/string.pdf> AVAho and MJCorasick.Efficientstringmatching: an aid to bibliographic search.Com.oftheACM.18 (6): 333-340, June 1975.

従来技術によるパターンマッチングのハードウェア化では、主にオートマトンを用いた手法が用いられる。オートマトンの特徴としてまず、１）複数パターンを同時に検索することが可能であることが挙げられる。さらに、ハードウェア向きである特徴として、２−１）テキストとパターンとの文字比較位置が一定でありテキストをバッファする機構が必要でない。２−２）テキスト中の文字が入力されてから次の文字が入力されるまでの遅延が他の方法と比較して小さい、ことが挙げられる。その一方で、オートマトンによる方法では、メモリおよび論理回路の規模、スループットの点において幾つかの課題が存在する。 In the conventional pattern matching hardware implementation, a method using an automaton is mainly used. The features of automata are as follows: 1) It is possible to search a plurality of patterns simultaneously. Further, as a feature suitable for hardware, 2-1) the character comparison position between the text and the pattern is constant, and a mechanism for buffering the text is not necessary. 2-2) The delay between the input of a character in the text and the input of the next character is small compared to other methods. On the other hand, the automaton method has several problems in terms of the scale and throughput of the memory and logic circuit.

オートマトンを用いた手法では、まず、これまでに入力されたテキストのサフィックスに対するパターンの最長プリフィックスを状態として持つオートマトンを構成する。ここで、任意の文字列ｔ_iｔ_i+1…ｔ_i+jのプリフィックスは、ｔ_iｔ_i+1…ｔ_i+l（０≦ｌ≦ｊ）と定義される。一方、任意の文字列ｔ_iｔ_i+1…ｔ_i+jのサフィックスはｔ_lｔ_l+1…ｔ_i+j（ｉ≦ｌ≦ｉ＋ｊ）と定義される。複数のパターンが存在する場合には、個々のパターンに対するオートマトンを単一のオートマトンとして合成することが可能である。このような方法をＡｈｏ−Ｃｏｒａｓｉｃｋと呼ぶ（例えば、非特許文献２参照）。 In the method using the automaton, first, an automaton having the state with the longest prefix of the pattern with respect to the suffix of the text input so far is formed. Here, the prefix of an arbitrary character string t _i t _{i + 1} ... T _{i + j} is defined as t _i t _{i + 1} ... t _{i + l} (0 ≦ l ≦ j). On the other hand, the suffix of an arbitrary character string t _i t _{i + 1} ... T _{i + j} is defined as t _l t _{l + 1} ... t _{i + j} (i ≦ l ≦ i + j). When there are a plurality of patterns, the automaton for each pattern can be synthesized as a single automaton. Such a method is called Aho-Corasick (for example, refer nonpatent literature 2).

図９はＡｈｏ−Ｃｏｒａｓｉｃｋオートマトンを説明するための図であるが、図９は例としてパターンをｃａｂｃとｃｂａｃａの２つとした場合に生成されるオートマトンを示している。なお、いずれの状態においても、アーク上に明示されていない文字が入力された場合は状態０に戻る。次に、そのオートマトンに対して入力テキストの文字列を入力させ、入力された文字と現在のオートマトンの状態とから、オートマトンを次の状態に遷移させるという動作を繰り返す。このとき、オートマトンがパターン全体を表す状態（以下、最終状態）に遷移すると、テキスト中の文字列がパターンと一致したことを表す。図９の例では、状態４と状態８が最終状態である。 FIG. 9 is a diagram for explaining the Aho-Corasick automaton, but FIG. 9 shows an automaton generated when the patterns are cabc and cbaca as two examples. Note that, in any state, when a character that is not clearly indicated on the arc is input, the state returns to state 0. Next, the operation of inputting the character string of the input text to the automaton and changing the automaton to the next state from the input characters and the current automaton state is repeated. At this time, when the automaton transitions to a state representing the entire pattern (hereinafter, the final state), it represents that the character string in the text matches the pattern. In the example of FIG. 9, state 4 and state 8 are final states.

図１０は従来のメモリを用いたオートマトンの実現例を示す図であり、図１１は従来のオートマトンの回路展開例を示す図であるが、オートマトンのハードウェア実現に際しては、図１０のように、オートマトンをメモリ上で表現する方法と、図１１のようにオートマトンを直接的に回路化する方法とに分けられる。オートマトンをメモリ上で表現する方法として、新しく入力されたテキストの文字と現在のオートマトンの状態とを入力とし、次に遷移すべき状態を出力とする表が用いられる。しかし、オートマトンを表として実現する場合には、表のエントリ数がテキストの一文字あたりのビット幅に対して指数関数的に増加するという問題がある。 FIG. 10 is a diagram showing an implementation example of an automaton using a conventional memory, and FIG. 11 is a diagram showing an example of circuit development of the conventional automaton. In realizing the hardware of the automaton, as shown in FIG. It can be divided into a method of expressing an automaton on a memory and a method of directly forming an automaton as shown in FIG. As a method for expressing the automaton on the memory, a table is used in which newly input text characters and the current automaton state are input, and the next state to be transitioned is output. However, when the automaton is implemented as a table, there is a problem that the number of entries in the table increases exponentially with respect to the bit width per character of the text.

一方、オートマトンをバイナリツリー等を用いて表現することによって、エントリ数を削減する方法もある。しかし、この方法では、テキストの文字が入力されてから次の状態が決定されるまでの遅延が表と比較して大きくなる。この遅延の間、次のテキストの文字に対する照合処理ができないため、スループットが大きく制限される。一方、図１１のオートマトンを直接布線論理化する方法では、オートマトンの状態毎に比較器が必要となる。また、状態間の配線遅延によってスループットが制限される。 On the other hand, there is a method of reducing the number of entries by expressing an automaton using a binary tree or the like. However, with this method, the delay between the input of text characters and the determination of the next state is greater than in the table. During this delay, the next text character cannot be collated, and the throughput is greatly limited. On the other hand, in the method of directly logicalizing the automaton in FIG. 11, a comparator is required for each automaton state. Also, throughput is limited by wiring delays between states.

本発明は、このような背景に行われたものであって、パターンマッチングをハードウェア処理した際の各種遅延の隠蔽およびスループットの向上を可能とすることができるパターンマッチング装置および方法を提供することを目的とする。 The present invention is made in such a background, and provides a pattern matching apparatus and method capable of concealing various delays and improving throughput when pattern matching is processed by hardware. With the goal.

本発明では、従来のオートマトンによるパターンマッチングに対し、以下の点を付加することに特徴がある。 The present invention is characterized in that the following points are added to the conventional pattern matching by the automaton.

個々のパターンの各文字を関数を用いて振り分けることによって、パターンを複数のパターンに分割し、これまでに入力されたテキストのサフィックスに対するパターンの最長プリフィックスを状態として持つオートマトンを分割パターンについて構成し、１つのパターンから生成された全ての分割パターンに対応するオートマトンの全てが最終状態になったか否かを判定することによって、パターンマッチングを行う。 By dividing each character of each pattern using a function, the pattern is divided into a plurality of patterns, and an automaton having the longest prefix of the pattern with respect to the text suffix input so far is configured for the divided pattern, Pattern matching is performed by determining whether or not all of the automatons corresponding to all the divided patterns generated from one pattern are in a final state.

個々のパターンから分割されたパターンに対応するオートマトン全てを単一のオートマトンとして合成し、パターンマッチングすることができる。 All the automatons corresponding to the patterns divided from the individual patterns can be synthesized as a single automaton and pattern matching can be performed.

生成されたオートマトンをパイプライン化して実現し、テキストと個々の分割パターンとに対する比較を時分割して行うことによって、単一テキスト中の文字単位でパイプライン処理することができる。 The generated automaton is realized as a pipeline, and the comparison between the text and each division pattern is performed in a time division manner, whereby pipeline processing can be performed for each character in a single text.

生成されたオートマトンを並列に配置し、テキスト中の文字を個々のオートマトンに振り分けて並列に入力することによって、単一テキスト中の文字単位で並列処理することができる。 By arranging the generated automata in parallel, distributing the characters in the text to the individual automata and inputting them in parallel, it is possible to perform parallel processing in units of characters in a single text.

すなわち、個々の分割パターンとテキストとの比較を互いに独立に行えるため、それらをパイプラインまたは並列処理することが可能である。すなわち、パターンをｋ個に分割する場合に、テキストもｋ個に分割する。本発明では、これらの分割テキストをそれぞれ独立に個々の分割パターンと比較させ、その結果を統合させている。 That is, since individual division patterns and text can be compared independently of each other, they can be pipelined or processed in parallel. That is, when the pattern is divided into k pieces, the text is also divided into k pieces. In the present invention, these divided texts are independently compared with individual division patterns, and the results are integrated.

この結果、分割パターンと分割テキストとの比較を並列処理することが可能となる。また、オートマトンによる比較処理を複数ステージに細分化し、独立な分割テキストを時分割に処理することによって、オートマトンのパイプライン処理を実現することができる。 As a result, the comparison between the division pattern and the division text can be performed in parallel. In addition, automaton pipeline processing can be realized by subdividing the automaton comparison processing into a plurality of stages and processing independent divided text in a time-sharing manner.

以上により、本発明の目的である、テキストを文字単位でパイプラインまたは並列処理することによる、パターンマッチングをハードウェア処理した際の配線遅延またはメモリアクセス遅延といった遅延隠蔽とスループット向上効果が得られる。 As described above, the delay concealment and the throughput improvement effect such as the wiring delay or the memory access delay when the pattern matching is hardware-processed by the pipeline or parallel processing of the text in units of characters, which is the object of the present invention, are obtained.

すなわち、本発明の第一の観点は、入力文字列であるテキストから任意文字列であるパターンを検索するパターンマッチング装置であって、前記パターンの各文字を先頭から順番にｋ−１（ｋは２以上の整数）文字置きに振り分けることによって一つのパターンをｋ個に分割した分割パターンについて、それぞれが互いに異なる前記分割パターンを保持し、これまでに入力されたテキストのサフィックスに対する前記保持している分割パターンの最長プリフィックスを状態として持つｋ個のオートマトン部を並列させて構成したオートマトン部の組をｋ組と、前記テキストの各文字を先頭から順番にｋ−１文字置きに振り分けることによって一つのテキストをｋ個の分割テキストに分割し、ｋ組の前記オートマトン部の組のそれぞれに互いに異なる前記分割テキストを入力する手段と、ｋ組の前記オートマトン部の組全てについて、該オートマトン部の組を構成するオートマトン部のいずれかが最終状態になり、かつ互いに異なる前記分割パターンを保持したｋ個のオートマトン部が該最終状態になったか否かを判定することによってパターンマッチングを行う手段とを備えたところにある。 Specifically, a first aspect of the present invention, the input string is a A pattern matching apparatus for searching a pattern is text or RaTsutomu meaning string, k-1 in order each character of the pattern from the top ( k for dividing pattern obtained by dividing one of the patterns in the k by that distributed to every integer of 2 or more) characters, it holds the division pattern is different from each other, wherein for text suffix input so far A set of automaton parts formed by arranging k automaton parts having the longest prefix of the divided pattern as a state in parallel is divided into k sets, and each character of the text is distributed in order of k-1 characters from the top. A text is divided into k pieces of divided text, and each of the k sets of the automaton parts is divided into each other. Means for inputting the divided text consisting, for a set all k sets of the automaton portion, k any of the automaton portion constituting a set of the automaton portion is the final state, and to hold the said different split pattern to each other Ru near the place where pieces of the automaton portion and means for performing pattern matching by determining whether or not it is the final state.

さらに、前記オートマトン部の組をパイプライン化して実現し、このパイプライン化された前記オートマトン部の組に対し、前記分割テキストと前記分割パターンとの比較を時分割して行うパイプライン処理手段を備えることができる。 Furthermore, a set of the automaton portion implemented by pipelining, the set of the automaton portion the pipelined hand, the divided text before Symbol dividing pattern and pipeline processing means for performing time-divided comparison of Ru can be provided with.

あるいは、前記オートマトン部の組のそれぞれを、互いに異なる前記分割パターンを保持したｋ個のオートマトン部を合成した、互いに異なる前記分割パターンにそれぞれ対応したｋ個の最終状態を持つ単一のオートマトン部に置き換えることができる。 Alternatively, each of the set of automaton parts is combined with k automaton parts holding the different division patterns into a single automaton part having k final states respectively corresponding to the different division patterns. Ru can be replaced.

本発明の第二の観点は、パターンマッチング装置が、入力文字列であるテキストから任意文字列であるパターンを検索するパターンマッチング方法であって、前記パターンの各文字を先頭から順番にｋ−１（ｋは２以上の整数）文字置きに振り分けることによって一つのパターンをｋ個に分割した分割パターンについて、それぞれが互いに異なる前記分割パターンを保持し、これまでに入力されたテキストのサフィックスに対する前記保持している分割パターンの最長プリフィックスを状態として持つｋ個のオートマトン部を並列させて構成したオートマトン部の組をｋ組用い、文字列入力手段が、入力文字列の前記テキストの各文字を先頭から順番にｋ−１文字置きに振り分けることによって一つのテキストをｋ個の分割テキストに分割し、ｋ組の前記オートマトン部の組のそれぞれに互いに異なる前記分割テキストを入力するステップと、判定手段が、ｋ組の前記オートマトン部の組全てについて、該オートマトン部の組を構成するオートマトン部のいずれかが最終状態になり、かつ互いに異なる前記分割パターンを保持したｋ個のオートマトン部が該最終状態になったか否かを判定することによってパターンマッチングを行うステップとを実行するところにある。 A second aspect of the invention, the pattern matching apparatus, a pattern matching method to find the text or RaTsutomu pattern is meaning string is the input character string, in order to each character of the pattern from the top k -1 (k is an integer of 2 or more) for the divided patterns obtained by dividing one of the patterns in the k by that distributed to every character, holds the division pattern is different from each other, it has been the text input so far Using k sets of automaton parts configured by arranging k automaton parts having the longest prefix of the held divided pattern as a state with respect to a suffix in parallel , a character string input means uses each of the text of the input character string Divide one text into k pieces of divided text by sorting the characters in order of k-1 characters from the top. inputting a different the divided text into each set of k sets of the automaton portion, determining means, for a set all k sets of the automaton portion, one of the automaton portion constituting a set of the automaton portion There will be a final state, and Ru near where k-number of the automaton portion holding the different said split patterns to each other and a step of performing pattern matching by determining whether or not it is the final state.

さらに、前記オートマトン部の組をパイプライン化して実現し、このパイプライン化された前記オートマトン部の組に対し、前記分割テキストと前記分割パターンとの比較を時分割して行うパイプライン処理ステップを実行することができる。 Furthermore, a set of the automaton portion implemented by pipelined, the pipelined set of automata parts per pipeline processing steps performed by the time the comparison division of the divided text before Symbol division pattern Ru can be run.

あるいは、前記オートマトン部の組のそれぞれを、互いに異なる前記分割パターンを保持したｋ個のオートマトン部を合成した、互いに異なる前記分割パターンにそれぞれ対応したｋ個の最終状態を持つ単一のオートマトン部に置き換えることができる。 Alternatively, each of the set of automaton parts is combined with k automaton parts holding the different division patterns into a single automaton part having k final states respectively corresponding to the different division patterns. replaced by Ru can Rukoto.

本発明の第三の観点は、情報処理装置にインストールすることにより、その情報処理装置に、入力文字列であるテキストから任意文字列であるパターンを検索するパターンマッチング装置に相応する機能を実現させるプログラムであって、前記パターンの各文字を先頭から順番にｋ−１（ｋは２以上の整数）文字置きに振り分けることによって一つのパターンをｋ個に分割した分割パターンについて、それぞれが互いに異なる前記分割パターンを保持し、これまでに入力されたテキストのサフィックスに対する前記保持している分割パターンの最長プリフィックスを状態として持つｋ個のオートマトン部を並列させて構成したオートマトン部の組ｋ組に相応する機能と、前記テキストの各文字を先頭から順番にｋ−１文字置きに振り分けることによって一つのテキストをｋ個の分割テキストに分割し、ｋ組の前記オートマトン部の組のそれぞれに互いに異なる前記分割テキストを入力する機能と、ｋ組の前記オートマトン部の組全てについて、該オートマトン部の組を構成するオートマトン部のいずれかが最終状態になり、かつ互いに異なる前記分割パターンを保持したｋ個のオートマトン部が該最終状態になったか否かを判定することによってパターンマッチングを行う機能とを実現させるところにある。 A third aspect of the present invention, by installing the information processing apparatus, to the information processing apparatus, the function corresponding to the pattern matching device to search for text or RaTsutomu pattern is meaning string is the input character string a program for realizing, for the divided patterns obtained by dividing one of the patterns by the be distributed in order to k-1 (k is an integer of 2 or more) placed characters each character from the beginning of the pattern into k, respectively Sets of automaton parts k each having different division patterns from each other and having k long automaton parts having the longest prefix of the held division pattern as a state with respect to the suffix of the text inputted so far. and the function corresponding to the set, this to distribute in order to each character of the text from the beginning every k-1 character A function of dividing one text into k divided texts and inputting different divided texts into each of k sets of the automaton parts, and for all of the k sets of automaton parts, the automaton part A function for performing pattern matching by determining whether any of the automaton parts constituting the set is in a final state and k automaton parts holding the different divided patterns are in the final state; Ru near the place to realize.

さらに、前記オートマトン部の組をパイプライン化して実現し、このパイプライン化された前記オートマトン部の組に対し、前記分割テキストと前記分割パターンとの比較を時分割して行うパイプライン処理機能を実現させることができる。 Furthermore, a set of the automaton portion implemented by pipelined, the pipelined set of automata portion to the divided text before Symbol division pattern and the divided pipeline processing function performed when the comparison of the Ru can be realized.

本発明の第四の観点は、本発明のプログラムが記録された前記情報処理装置読取可能な記録媒体である。本発明のプログラムは本発明の記録媒体に記録されることにより、前記情報処理装置は、この記録媒体を用いて本発明のプログラムをインストールすることができる。あるいは、本発明のプログラムを保持するサーバからネットワークを介して直接前記情報処理装置に本発明のプログラムをインストールすることもできる。 The fourth aspect of the present invention, Ru said information processing apparatus readable media der which the program is recorded of the present invention. By recording the program of the present invention on the recording medium of the present invention, the information processing apparatus can install the program of the present invention using this recording medium. Alternatively, the program of the present invention can be directly installed in the information processing apparatus via a network from a server holding the program of the present invention.

これにより、汎用の情報処理装置を用いて、各種遅延の隠蔽およびスループットの向上を可能とすることができるパターンマッチング装置を実現することができる。 As a result, it is possible to realize a pattern matching device capable of concealing various delays and improving throughput using a general-purpose information processing device.

本発明によれば、パターンマッチングをハードウェア処理した際の各種遅延の隠蔽およびスループットの向上を可能とすることができる。また、パターンマッチングを汎用の情報処理装置とプログラムとを用いて実現する際にも各種遅延の隠蔽およびスループットの向上を可能とすることができる。 According to the present invention, it is possible to conceal various delays and improve throughput when hardware processing is performed for pattern matching. Also, it is possible to conceal various delays and improve throughput when realizing pattern matching using a general-purpose information processing apparatus and a program.

本発明実施例のパターンマッチング装置および方法ならびにプログラムについて図１ないし図８を参照して説明する。図１は分割パターンを用いたパターンマッチング装置の構成図である。図２は分割パターンを用いたオートマトンによって文字単位の並列処理を行う実施例を示す図である。図３はメモリを用いたオートマトンによってパイプライン処理する構成を示す図である。図４は回路展開したオートマトンによってパイプライン処理する構成を示す図である。図５は分割パターンを合成したパターンマッチング装置の構成図である。図６は分割パターンを合成したパターンマッチングの実施例を示す図である。図７はメモリ表現した場合の最終状態判定部を示す図である。図８はオートマトンをメモリ表現した場合の実施例を示す図である。 A pattern matching apparatus, method, and program according to an embodiment of the present invention will be described with reference to FIGS. FIG. 1 is a configuration diagram of a pattern matching apparatus using divided patterns. FIG. 2 is a diagram showing an embodiment in which character unit parallel processing is performed by an automaton using a division pattern. FIG. 3 is a diagram showing a configuration in which pipeline processing is performed by an automaton using a memory. FIG. 4 is a diagram showing a configuration in which pipeline processing is performed by an automaton that has been developed. FIG. 5 is a configuration diagram of a pattern matching apparatus that combines divided patterns. FIG. 6 is a diagram showing an example of pattern matching in which divided patterns are synthesized. FIG. 7 is a diagram showing the final state determination unit in the case of memory representation. FIG. 8 is a diagram showing an embodiment when the automaton is expressed in memory.

本発明実施例は、図１に示すように、入力文字列であるテキストから複数種類の任意文字列であるパターンを検索するパターンマッチング装置であって、本発明の特徴とするところは、個々のパターンの各文字を関数に従って振り分けることによって個々の一つのパターンを複数のパターンに分割した分割パターンを保持し、これまでに入力されたテキストのサフィックスに対するパターンの最長プリフィックスを状態として持つオートマトンを前記分割パターンについてそれぞれ構成するオートマトン部１と、一つのパターンから分割された全ての前記分割パターンのそれぞれについて構成されたオートマトンの全てが最終状態になったか否かを判定することによってパターンマッチングを行う最終状態判定部２とを備えたところにある。 As shown in FIG. 1, the embodiment of the present invention is a pattern matching apparatus for searching for a pattern that is a plurality of types of arbitrary character strings from text that is an input character string. By dividing each character of the pattern according to a function, a divided pattern obtained by dividing each individual pattern into a plurality of patterns is retained, and the automaton having the longest pattern prefix for the text suffix input so far as the state is divided. The final state in which pattern matching is performed by determining whether or not all of the automatons configured for each of the divided patterns divided from one pattern are in a final state, and the automaton unit 1 configured for each pattern. Ru near the place that includes a determination unit 2.

さらに、図３に示すように、前記オートマトンをパイプライン化して実現し、このパイプライン化された前記オートマトンに対し、テキストと個々の前記分割パターンとの比較を時分割して行うパイプライン処理手段を備えることができる。 Furthermore, as shown in FIG. 3, the automaton is realized by making the automaton into a pipeline, and for the automaton that is made into a pipeline, a pipeline processing means for performing a time-division comparison of text and each of the division patterns. Ru can be provided with.

あるいは、図２に示すように、前記オートマトンを複数並列に配置し、この並列に配置された複数の前記オートマトンに対し、テキスト中の文字を個々のオートマトンに振り分けて並列に入力し、単一テキスト中の文字単位で並列処理する手段を備えることができる。 Alternatively, as shown in FIG. 2, a plurality of the automatons are arranged in parallel, and the characters in the text are divided into individual automatons and input in parallel to the plurality of automatons arranged in parallel. Ru can be provided with means for parallel processing in characters in.

あるいは、図８に示すように、前記オートマトンをパイプライン化して実現すると共に、このパイプライン化して実現したオートマトンを複数並列に配置し、この並列に配置された複数の前記オートマトンに対し、テキスト中の文字を個々のオートマトンに振り分けて並列に入力し、単一テキスト中の文字単位で並列処理すると共に、このパイプライン化された前記オートマトンに対し、前記テキスト中の文字と個々の前記分割パターンとの比較を時分割して行う並列パイプライン処理手段を備えることができる。 Alternatively, as shown in FIG. 8, the automaton is realized by pipelining, and a plurality of automatons realized by the pipelining are arranged in parallel, and the plurality of automatons arranged in parallel are Are distributed to each automaton and input in parallel, and are processed in parallel in units of characters in a single text, and for the pipelined automaton, the characters in the text and the individual division patterns Ru can be provided with a parallel and pipeline processing means for performing time-divided comparisons.

さらに、図５に示すように、全ての前記分割パターンのそれぞれについて構成されたオートマトンの全てを単一のオートマトンとして合成する合成済みオートマトン部３を備えることができる。 Furthermore, as shown in FIG. 5, Ru can be provided with a precomposed automaton unit 3 for synthesizing all of the automaton constructed for each of all of said divided patterns as a single automaton.

本実施例の入力文字列であるテキストから複数種類の任意文字列であるパターンを検索するパターンマッチング方法は、個々のパターンの各文字を関数に従って振り分けることによって個々の一つのパターンを複数のパターンに分割するステップと、これまでに入力されたテキストのサフィックスに対するパターンの最長プリフィックスを状態として持つオートマトンを前記分割するステップにより分割された分割パターンについてそれぞれ構成するステップと、一つのパターンから分割された全ての前記分割パターンのそれぞれについて構成されたオートマトンの全てが最終状態になったか否かを判定することによってパターンマッチングを行うステップとを実行することを特徴とする。 The pattern matching method for searching for a pattern that is a plurality of types of arbitrary character strings from the text that is the input character string of the present embodiment is that each character of each pattern is sorted according to a function to convert each one pattern into a plurality of patterns. A step of dividing, a step of configuring each of the divided patterns divided by the step of dividing the automaton having the longest prefix of the pattern with respect to the suffix of the text input so far, and all divided from one pattern all of the automaton constructed for each of the divided pattern you characterized by executing the steps of performing pattern matching by determining whether it is the final state.

さらに、前記オートマトンをパイプライン化して実現し、このパイプライン化された前記オートマトンに対し、テキストと個々の前記分割パターンとの比較を時分割して行うパイプライン処理ステップを実行することができる。 Moreover, the automaton implemented by pipelined, to the pipelined said automaton, Ru can perform pipeline processing step of performing time-divided compares the text and each of the divided pattern .

あるいは、前記オートマトンを複数並列に配置し、この並列に配置された複数の前記オートマトンに対し、テキスト中の文字を個々のオートマトンに振り分けて並列に入力し、単一テキスト中の文字単位で並列処理するステップを実行することができる。 Alternatively, a plurality of the automata are arranged in parallel, and the characters in the text are distributed to the individual automata and input in parallel to the plurality of the automata arranged in parallel, and are processed in units of characters in a single text. Ru can be a step to be executed.

あるいは、前記オートマトンをパイプライン化して実現すると共に、このパイプライン化して実現したオートマトンを複数並列に配置し、この並列に配置された複数の前記オートマトンに対し、テキスト中の文字を個々のオートマトンに振り分けて並列に入力し、単一テキスト中の文字単位で並列処理すると共に、このパイプライン化された前記オートマトンに対し、前記テキスト中の文字と個々の前記分割パターンとの比較を時分割して行う並列パイプライン処理ステップを実行することができる。 Alternatively, the automaton is realized by making the automaton into a pipeline, and a plurality of automatons realized by making the pipeline are arranged in parallel, and the characters in the text are assigned to individual automata for the plurality of automatons arranged in parallel. Sort and input in parallel, process in parallel in units of characters in a single text, and for the pipelined automaton, compare the characters in the text with the individual division patterns by time division Ru can run parallel pipeline processing step of performing.

さらに、全ての前記分割パターンのそれぞれについて構成されたオートマトンの全てを単一のオートマトンとして合成するステップを実行することができる。 Further, Ru can execute the step of combining all of the automaton constructed for each of all of said divided patterns as a single automaton.

本実施例は、汎用の情報処理装置にインストールすることにより、その情報処理装置に本実施例のパターンマッチング装置に相応する機能を実現させるプログラムとして実現することができる。このプログラムは、記録媒体に記録されて情報処理装置にインストールされ、あるいは通信回線を介して情報処理装置にインストールされることにより当該情報処理装置に、オートマトン部１、最終状態判定部２、合成済みオートマトン部３にそれぞれ相応する機能を実現させることができる。 This embodiment, by installing a general-purpose information processing apparatus, Ru can be realized as a program for realizing functions corresponding to the pattern matching apparatus of the present embodiment to the information processing apparatus. This program is recorded on a recording medium and installed in the information processing apparatus , or installed in the information processing apparatus via a communication line, so that the automaton unit 1, final state determination unit 2, synthesis is added to the information processing apparatus. A function corresponding to each of the completed automaton units 3 can be realized.

以下では、本発明実施例をさらに詳細に説明する。本発明では、まず、個々のパターンの各文字を一定の関数に基づいて振り分けることによってパターンを複数のパターンに分割する。関数の具体例として、パターンを先頭から順番に振り分ける場合に、分割前のパターンＰをｐ₁ｐ₂…ｐ_mとしてｋ分割すると、その分割パターンＤＰ_i（ｉ＝１，…，ｋ）はｐ_iｐ_i+k…ｐ_i+(m/k-1)kと表記できる。例えば、パターンａｂａｂｃｄから２つの分割パターンを生成する場合に、分割パターンＤＰ₁はａａｃ、分割パターンＤＰ2はｂｂｄとなる。任意の部分テキストｔ_s+1ｔ_s+2…ｔ_s+mがパターンと一致するか否かを判定する場合には、パターンと同様に部分テキストを分割し、個々の分割部分テキストが個々の分割パターンと一致するか否かを判定する。すなわち、ｉ＝１，…，ｋについて以下の式が成り立つか否かを判定する。 In the following, embodiments of the present invention will be described in more detail. In the present invention, first, a pattern is divided into a plurality of patterns by distributing each character of each pattern based on a certain function. Specific examples of the functions, when distributing sequentially the pattern from the beginning, when k divides the pattern P before division as p ₁ p ₂ ... p _m, the division pattern _{DP i (i = 1, ...} , k) is p _i p _{i + k} can be expressed as p _{i + (m / k−1) k} . For example, when generating a two division patterns from the pattern Ababcd, division pattern DP ₁ is aac, division pattern DP2 becomes bbd. When determining whether or not an arbitrary partial text t _{s + 1} t _{s + 2} ... t _{s + m} matches the pattern, the partial text is divided in the same manner as the pattern, and each divided partial text is individually It is determined whether or not it matches the division pattern. That is, it is determined whether or not the following expression holds for i = 1,.

ｔ_s+iｔ_s+i+k…ｔ_s+i+(m/k-1)k＝ｐ_iｐ_i+k…ｐ_i+(m/k-1)k
次に、これまでに入力されたテキストのサフィックスに対するパターンの最長プリフィックスを状態として持つオートマトンを分割パターンについて構成する。図１に分割パターンを用いたオートマトンによってパターンマッチングを実施するための構成を示す。オートマトンによる検索では、関数に基づいてテキストｔ₁…ｔ_nをｋ個のテキスト列として振り分ける。 t _{s + i} t _{s + i + k} ... t _{s + i + (m / k-1) k} = p _i p _{i + k} ... p _{i + (m / k-1) k}
Next, an automaton having the longest prefix of the pattern with respect to the suffix of the text input so far as a state is configured for the division pattern. FIG. 1 shows a configuration for performing pattern matching by an automaton using divided patterns. In the search by the automaton, the texts t ₁ ... T _n are sorted as k text strings based on the function.

関数の具体例として、テキストの先頭から順番に振り分ける場合には、分割後のテキストＤＴ_i（ｉ＝１，…，ｋ）はｔ_iｔ_i+k…ｔ_i+(n/k-1)kと表記できる。そして、それぞれの分割テキストＤＴ_iを、各分割パターンに対応するオートマトンに入力する。 As a specific example of the function, when sorting in order from the beginning of the text, the divided text DT _i (i = 1,..., K) is expressed as t _i t _{i + k} ... t _{i + (n / k−1) k} Can be written. Then, each divided text DT _i is input to the automaton corresponding to each divided pattern.

適切なタイミングで全ての分割パターンに対応するオートマトンが最終状態に遷移したか否かを判定することによって、パターン全体と一致したか否かを判定する。また、各オートマトンを並列に動作させることによって、テキストとパターンとのマッチングを文字単位で並列処理することが可能である。 By determining whether or not the automaton corresponding to all the divided patterns has transitioned to the final state at an appropriate timing, it is determined whether or not it matches the entire pattern. In addition, by matching each automaton in parallel, text and pattern matching can be processed in parallel on a character basis.

図２はａｂａｂｃｄを分割パターンを用いたオートマトンによって文字単位の並列処理を行う場合を例示している。 FIG. 2 exemplifies a case where abbcd is subjected to character unit parallel processing by an automaton using a division pattern.

さらに並列化に加えて、オートマトンを図３、図４のように、オートマトンの各処理を複数ステージに細分化し、同じオートマトンに対して入力される異なる分割テキストを時分割して入力することによって、テキストとパターンとのマッチングを文字単位でパイプライン処理することができる。 Furthermore, in addition to parallelization, the automaton is divided into a plurality of stages as shown in FIGS. 3 and 4, and different divided texts inputted to the same automaton are input in a time-sharing manner. Matching between text and patterns can be pipelined in character units.

個々の分割パターンと分割テキストとの比較を個別のオートマトンを用いて行う場合に、そのオートマトン数は分割パターン数の二乗に比例する。例えば、図２では２つの分割パターンの比較に対し４つのオートマトンを必要としている。これを、Ａｈｏ−Ｃｏｒａｓｉｃｋに基づいて、オートマトンを合成することによって、オートマトン数を削減した構成を図５に示す。分割パターンを合成した場合には、パターンマッチングに必要なオートマトンの個数は分割パターンの数に比例する。図６はａｂａｂｃｄを分割パターンし、そのオートマトンを合成することによって、パターンマッチングを行う場合を例示している。図２と比較すると必要なオートマトンの個数は半分になっている。 When the comparison between each divided pattern and divided text is performed using individual automata, the number of the automaton is proportional to the square of the number of divided patterns. For example, in FIG. 2, four automata are required for comparison of two division patterns. FIG. 5 shows a configuration in which the automaton number is reduced by synthesizing the automaton based on Aho-Corasick. When the divided patterns are synthesized, the number of automaton necessary for pattern matching is proportional to the number of divided patterns. FIG. 6 illustrates a case where pattern matching is performed by dividing ababcd into patterns and synthesizing the automaton. Compared to FIG. 2, the required number of automata is halved.

最終状態判定部２は、オートマトンの現在の状態が最終状態であるか否かを判定する。図７はオートマトンをメモリ表現した場合の最終状態判定部２の実装例を示す。オートマトンの状態を入力として、最終状態に遷移しているか否かを示す信号を出力する。一方、オートマトンを回路で表現する場合は、各最終状態を表す状態のレジスタの値を出力するのみでよい。 The final state determination unit 2 determines whether or not the current state of the automaton is the final state. FIG. 7 shows an implementation example of the final state determination unit 2 when the automaton is expressed in memory. Using the automaton state as an input, a signal indicating whether or not the state has transitioned to the final state is output. On the other hand, when expressing an automaton with a circuit, it is only necessary to output the value of a register in a state representing each final state.

図８は、オートマトンをメモリ表現し、パターンを４つに分割した場合を示している。図８では、テキストから生成される４つの分割テキストのうち、２つをパイプライン処理している。さらに、オートマトンを２つ並列に配置することによって、２つの文字を並列処理している。回路展開した場合も同様に、各状態を表現する回路を２つのステージに細分化したオートマトンを２つ並列に配置する。そして、図８と同様の方法で、テキストを分割してオートマトンに入力し、各分割パターンが最終状態に遷移したか否かを判定することでパターンマッチングが実施可能である。 FIG. 8 shows a case where the automaton is expressed in memory and the pattern is divided into four. In FIG. 8, two of the four divided texts generated from the text are pipeline processed. Furthermore, by arranging two automata in parallel, two characters are processed in parallel. Similarly, when the circuit is expanded, two automatons obtained by subdividing the circuit expressing each state into two stages are arranged in parallel. Then, the pattern matching can be performed by dividing the text and inputting it into the automaton by the same method as in FIG. 8, and determining whether each divided pattern has transitioned to the final state.

本発明によれば、パターンマッチングをハードウェア処理した際の各種遅延の隠蔽およびスループットの向上を可能とすることができるため、効率の高いパターンマッチング処理を行うことができるパターンマッチング装置を実現することができる。また、パターンマッチングを汎用の情報処理装置とプログラムとを用いて実現する際にも各種遅延の隠蔽およびスループットの向上を可能とすることができるため、効率の高いパターンマッチング処理を行うことができるパターンマッチング装置を実現することができる。 According to the present invention, since it is possible to conceal various delays and improve throughput when hardware processing is performed for pattern matching, it is possible to realize a pattern matching apparatus capable of performing highly efficient pattern matching processing. Can do. In addition, it is possible to conceal various delays and improve throughput even when realizing pattern matching using a general-purpose information processing device and program, so that pattern matching processing with high efficiency can be performed. A matching device can be realized.

分割パターンを用いたパターンマッチング装置の構成図。The block diagram of the pattern matching apparatus using a division | segmentation pattern. 分割パターンを用いたオートマトンによって文字単位の並列処理を行う実施例を示す図。The figure which shows the Example which performs the parallel processing of a character unit by the automaton using a division | segmentation pattern. メモリを用いたオートマトンによってパイプライン処理する構成を示す図。The figure which shows the structure which pipeline-processes by the automaton using memory. 回路展開したオートマトンによってパイプライン処理する構成を示す図。The figure which shows the structure which pipeline-processes by the automaton which expanded the circuit. 分割パターンを合成したパターンマッチング装置の構成図。The block diagram of the pattern matching apparatus which synthesize | combined the division | segmentation pattern. 分割パターンを合成したパターンマッチングの実施例を示す図。The figure which shows the Example of the pattern matching which synthesize | combined the division | segmentation pattern. メモリ表現した場合の最終状態判定部を示す図。The figure which shows the final state determination part at the time of memory expression. オートマトンをメモリ表現した場合の実施例を示す図。The figure which shows the Example at the time of expressing an automaton by memory. Ａｈｏ−Ｃｏｒａｓｉｃｋに基づいて生成されるオートマトンを示す図。The figure which shows the automaton produced | generated based on Aho-Corasick. メモリを用いたオートマトンの実現例を示す図。The figure which shows the implementation example of the automaton using memory. オートマトンの回路展開例を示す図。The figure which shows the circuit development example of an automaton.

１オートマトン部
２最終状態判定部
３合成済みオートマトン部 1 Automaton part 2 Final state judgment part 3 Synthesized automaton part

Claims

In the pattern matching apparatus for searching for a pattern which is the input character text or RaTsutomu meaning string is a column,
The divided patterns by dividing each character k-1 from the beginning in order to (k is an integer of 2 or more) characters placed one pattern by that distributed to the pattern to the k, the division pattern is different from each other A set of k automaton parts configured by arranging k automaton parts having the longest prefix of the held division pattern as a state with respect to the suffix of the text inputted so far, and
By dividing each character of the text into k-1 characters in order from the top, one text is divided into k divided texts, and the different divided texts are input to each of the k sets of automaton parts. Means to
For all of the k sets of automaton parts, one of the automaton parts constituting the automaton part group is in a final state, and k automaton parts holding the different division patterns are in the final state. A pattern matching device comprising: means for performing pattern matching by determining whether or not the

A set of the automaton portion implemented by pipelined, the pipelined set of automata portion to include a pipeline processing unit for performing time-divided comparison with the divided text before Symbol division pattern The
The pattern matching apparatus according to claim 1.

Each of the set of automaton parts is replaced with a single automaton part having k final states respectively corresponding to the different divided patterns , which is a combination of k automaton parts holding the different divided patterns .
The pattern matching apparatus according to claim 1, wherein:

Pattern matching apparatus, in the pattern matching method for searching for text or RaTsutomu pattern is meaning string is the input character string,
The divided patterns by dividing each character k-1 from the beginning in order to (k is an integer of 2 or more) characters placed one pattern by that distributed to the pattern to the k, the division pattern is different from each other Using k sets of automaton parts that are formed by paralleling k automaton parts having the longest prefix of the held division pattern with respect to the suffix of the text inputted so far,
The character string input means divides each text of the text of the input character string into k divided texts in order from the beginning in order of k-1 characters, and sets k sets of automaton parts. Inputting the divided text different from each other in each of
The determination means has, for all the k sets of the automaton parts, any one of the automaton parts constituting the set of the automaton parts is in a final state, and k automaton parts holding the division patterns different from each other are A pattern matching method comprising: performing pattern matching by determining whether or not a final state has been reached.

A set of the automaton portion implemented by pipelined, the pipelined set of automata portion to the execution pipeline processing step of performing time-divided comparison with the divided text before Symbol division pattern Do
The pattern matching method according to claim 4, wherein:

Each of the set of automaton parts is replaced with a single automaton part having k final states respectively corresponding to the different divided patterns , which is a combination of k automaton parts holding the different divided patterns .
6. The pattern matching method according to claim 4 or 5, wherein:

By installing on an information processing device,
A program for realizing functions corresponding to the pattern matching apparatus for searching the input character is a text or RaTsutomu meaning string is the column pattern,
The divided patterns by dividing each character k-1 from the beginning in order to (k is an integer of 2 or more) characters placed one pattern by that distributed to the pattern to the k, the division pattern is different from each other A function corresponding to a set k of automaton parts configured by holding k automaton parts having the longest prefix of the held divided pattern as a state with respect to the suffix of the text input so far,
By dividing each character of the text into k-1 characters in order from the top, one text is divided into k divided texts, and the different divided texts are input to each of the k sets of automaton parts. Function to
For all of the k sets of automaton parts, one of the automaton parts constituting the automaton part group is in a final state, and k automaton parts holding the different division patterns are in the final state. A program that realizes the function of performing pattern matching by determining whether or not the

A set of the automaton portion implemented by pipelined, the pipelined set of automata portion to realize pipeline processing function for performing the divided text and by dividing the time compared with the previous SL division pattern Make
The program according to claim 7, wherein:

Each of the set of automaton parts is replaced with a single automaton part having k final states respectively corresponding to the different divided patterns , which is a combination of k automaton parts holding the different divided patterns .
The program according to any one of claims 7 and 8 , characterized in that:

The information processing apparatus readable recording medium according to claim 7 to program according to any of the 9, characterized in that it is recorded.