JPH11161677A

JPH11161677A - Dictionary collating device

Info

Publication number: JPH11161677A
Application number: JP9343918A
Authority: JP
Inventors: Hideo Ito; 秀夫伊東
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1997-11-28
Filing date: 1997-11-28
Publication date: 1999-06-18

Abstract

PROBLEM TO BE SOLVED: To easily discriminate between a symbol and a symbol array so as to apply a simple arranging method to state transition wherein not only the symbol, but also the symbol array is allowed for a transition label. SOLUTION: The device consists of an input part 1 for a symbol array, a dictionary 2 comprising a set of symbol arrays represented in a state transition table, and a dictionary collating part 3 which detects all dictionary elements appearing in the input symbol array. When the dictionary collating part 3 makes a state transition on the basis of symbols in the input symbol array or the symbol array and state transition table, codes derived from the code of the head symbol through mathematical operation is made to correspond to the symbol array. The symbols and symbol array are discriminated through the mathematical operation with small operation quantity, so that high-speed dictionary collating is realized, while the state transition number is reduced with the symbol array label to improve storage efficiency.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、辞書照合装置、よ
り詳細には、記号列の集合（辞書）を状態遷移表で表
し、入力となる記号列と照合し、入力記号列中に出現す
る全ての辞書要素を検出する装置に関し、翻訳装置、テ
キスト検索装置、スペルチェッカー、仮名漢字変換装
置、音声認識装置、文字認識装置等に適用可能なもので
ある。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a dictionary matching device, and more specifically, a set of symbol strings (dictionary) is represented by a state transition table, collated with an input symbol string, and appears in an input symbol string. A device for detecting all dictionary elements is applicable to a translation device, a text search device, a spell checker, a kana-kanji conversion device, a speech recognition device, a character recognition device, and the like.

【０００２】[0002]

【従来の技術】入力となる記号列と記号列の集合（辞
書）を照合し、入力記号列中に出現する全ての辞書要素
を検出するタスク（辞書照合）は、言語処理に限らず幅
広い分野で行なわれている。一般に、辞書を有効にグラ
フで表現し、入力記号列を一つづつ読み込み、記号の種
類（コード）に基づいて決定的に状態遷移を行なうこと
で、入力長に比例する高速な辞書照合を実現できる。図
１０にＴＲＩＥ、図１１にＤＡＷＧと呼ばれる有効グラ
フの例を示すが、いずれの例も｛ａａａ，ａｂａ，ａｄ
ｃａ｝という記号列の集合が表現されている。2. Description of the Related Art The task of matching a symbol string to be input with a set (dictionary) of symbol strings and detecting all dictionary elements appearing in the input symbol string (dictionary matching) is not limited to language processing, but is a wide field. It is done in. In general, high-speed dictionary matching that is proportional to the input length is realized by effectively expressing a dictionary as a graph, reading input symbol strings one by one, and deterministically performing state transitions based on the symbol type (code). it can. FIG. 10 shows an example of an effective graph called TRIE, and FIG. 11 shows an example of an effective graph called DAWG.
A set of symbol strings ca｝ is expressed.

【０００３】上述のようなグラフは状態ｓｉ−（ｘ）−
†ｓｊという状態遷移の集合を表している。状態遷移の
集合をコンパクトに表現するための技法として単配置法
がある。単配置法で状態遷移を表すと図１２のようにな
る。図１２において、各状態ｓは配列内に自分固有の位
置ｂａｓｅ（ｓ）を持っており状態遷移ｓｉ−（ａ）−
†ｓｊは、ｂａｓｅ（ｓｉ）と記号ａの内部コードａ′
から、状態ｓ２の基底値ｂａｓｅ（ｓｊ）をｂａｓｅ
（ｓｉ）＋ａ′の位置のｂａｓｅ配列の値によって得
る。ｂａｓｅ（ｓｉ）＋ａ′の位置のｌａｂｅｌ配列に
ａ′を格納しているのは、その位置にａ′によって遷移
する元の状態を一意に固定するためである。[0003] The above-described graph shows a state si- (x)-.
集合 sj represents a set of state transitions. As a technique for compactly expressing a set of state transitions, there is a single arrangement method. FIG. 12 shows the state transition by the single arrangement method. In FIG. 12, each state s has its own position base (s) in the array, and state transition si- (a)-
† sj is the base (si) and the internal code a ′ of the symbol a.
From the base value base (sj) of the state s2
(Si) Obtained by the value of the base array at the position of + a '. The reason that a 'is stored in the label array at the position of base (si) + a' is to uniquely fix the original state transitioned by a 'at that position.

【０００４】[0004]

【発明が解決しようとする課題】図１３は、ＤＡＷＧの
例を示す図であるが、この図１３に示したグラフも｛ａ
ａａ，ａｂａ，ａｄｃａ｝という記号列の集合が表現さ
れている。図１１に示した例との違いは状態ｓ４がな
く、そのかわりｄｃという記号列がラベルとなっている
点である。記号だけでなく、記号列を遷移ラベルに許す
状態遷移に単配置法を適用するには、記号と記号列を簡
単に区別できなければならない。請求項１の発明はこの
課題を解決することを目的としてなされたものである。FIG. 13 is a diagram showing an example of a DAWG. The graph shown in FIG.
A set of symbol strings of aa, aba, adcaｃ is represented. The difference from the example shown in FIG. 11 is that there is no state s4, and instead a symbol string dc is used as a label. In order to apply the single constellation method to a state transition that allows not only a symbol but also a symbol string to a transition label, the symbol and the symbol string must be easily distinguishable. The invention of claim 1 has been made for the purpose of solving this problem.

【０００５】単配置法のｂａｓｅ配列は、配列中の任意
の位置を格納できる必要がある。よって、配列長が長い
場合、ｂａｓｅ配列の各セルの記憶量は増大する。例え
ば、１ｂｙｔｅでは０〜２５５までの位置を表現できる
が、配列長が２５５を越えると２ｂｙｔｅ必要になり、
一挙にｂａｓｅ配列に要する記憶量は倍になってしま
う。請求項２及び請求項３の発明はこの課題を解決する
ことを目的としてなされたものである。[0005] The base array of the single arrangement method needs to be able to store an arbitrary position in the array. Therefore, when the array length is long, the storage amount of each cell in the base array increases. For example, 1 byte can represent positions from 0 to 255, but if the array length exceeds 255, 2 bytes are required,
The amount of storage required for the base array is doubled at a stroke. The inventions of claims 2 and 3 have been made to solve this problem.

【０００６】ＤＡＷＧはＴＲＩＥより状態数、状態遷移
数ともに少ない。とくに記号列の集合において、各記号
末は一つの状態（音端状態と呼ぶ）に収れんする。よっ
て、単配置法でＤＡＷＧを表現すると、配列中の遠くの
位置からこの唯一の終端状態への遷移が発生するので、
ｂａｓｅ配列の各セルの必要記憶量は大きなってしま
う。請求項４の発明はこの課題を解決することを目的と
してなされたものである。DAWG has a smaller number of states and state transitions than TRIE. In particular, in the set of symbol strings, the end of each symbol fits into one state (called the end state). Therefore, when the DAWG is expressed by the single arrangement method, a transition from a distant position in the array to this unique terminal state occurs,
The required storage amount of each cell in the base array is large. The invention of claim 4 has been made for the purpose of solving this problem.

【０００７】ＴＲＩＥやＤＡＷＧにおいて、遷移ラベル
に記号だけでなく、記号列を許す場合、記号列の始めが
入力とマッチすれば後続の記号列を予測できる。請求項
５の発明は、このような予測機能を辞書照合に提供する
ことを目的としてなされたものである。In TRIE and DAWG, if not only symbols but also symbol strings are allowed for transition labels, subsequent symbol strings can be predicted if the beginning of the symbol string matches the input. The invention of claim 5 has been made to provide such a prediction function to dictionary matching.

【０００８】[0008]

【課題を解決するための手段】請求項１の発明は、記号
列の入力部と、状態遷移表で表現された記号列の集合か
らなる辞書と、入力記号列中に出現する全ての辞書要素
を検出する辞書照合部とから構成され、辞書照合部で入
力記号列中の記号または記号列と状態遷移表に基づいて
状態遷移を行なう際、記号列に対し、その先頭記号のコ
ードから算術演算により導かれるコードを対応させるこ
とを特徴とし、もって、記号と記号列の区別を計算量の
少ない算術演算で行なうことにより、高速な辞書照合を
可能にしつつ、記号列ラベルにより状態遷移数を減少さ
せ記憶効率を向上させるようにしたものである。According to the first aspect of the present invention, there is provided a symbol string input section, a dictionary comprising a set of symbol strings represented by a state transition table, and all dictionary elements appearing in the input symbol string. When performing state transition based on a symbol or a symbol string in an input symbol string and a state transition table in the dictionary collation section, an arithmetic operation is performed on the symbol string from the code of the first symbol of the symbol string. The feature is to correspond to the code derived by, and by distinguishing symbols and symbol strings by an arithmetic operation with a small amount of computation, the number of state transitions is reduced by symbol string labels while enabling high-speed dictionary matching This improves the storage efficiency.

【０００９】請求項２の発明は、記号列の入力部と、状
態遷移表で表現された記号列の集合からなる辞書と、入
力記号列中に出現する全ての辞書要素を検出する辞書照
合部とから構成され、辞書照合部で入力記号列中の記号
または記号列と状態遷移表に基づいて状態態遷移を行な
う場合、遷移元の状態の識別番号（状態番号）から算術
演算により遷移先の状態番号を割り当てることを特徴と
し、もって、遷移元の状態の識別番号（状態番号）から
算術演算により遷移先の状態番号を割り当てることがで
き、単配置法における配列容量を少なくするようにした
ものである。According to a second aspect of the present invention, there is provided a symbol string input unit, a dictionary including a set of symbol strings represented by a state transition table, and a dictionary matching unit for detecting all dictionary elements appearing in the input symbol string. When performing a state transition based on a symbol or a symbol string in an input symbol string and a state transition table in the dictionary matching unit, the destination number of the transition destination is calculated by an arithmetic operation from the identification number (state number) of the transition source state. It is characterized by assigning a state number, whereby the state number of the transition destination can be assigned by an arithmetic operation from the identification number (state number) of the state of the transition source, so that the array capacity in the single arrangement method is reduced. It is.

【００１０】請求項３の発明は、記号列の入力部と、状
態遷移表で表現された記号列の集合からなる辞書と、入
力記号列中に出現する全ての辞書要素を検出する辞書照
合部とから構成され、辞書照合部で入力記号列中の記号
または記号列と状態遷移表に基づいて状態遷移を行なう
場合、状態遷移を複数に分割することで遷移元の状態の
状態番号と遷移先の状態番号の差分を常に一定値内に収
めることを特徴とし、もって、状態遷移を複数に分割す
ることで単配置法における配列容量を少なくするように
したものである。A third aspect of the present invention provides an input unit for a symbol string, a dictionary composed of a set of symbol strings represented by a state transition table, and a dictionary matching unit for detecting all dictionary elements appearing in the input symbol string. When performing a state transition based on a symbol or a symbol string in an input symbol string and a state transition table in the dictionary matching unit, the state transition is divided into a plurality of parts, and the state number of the transition source state and the transition destination state are divided. Is characterized in that the difference between the state numbers is always kept within a fixed value, thereby reducing the array capacity in the single arrangement method by dividing the state transition into a plurality.

【００１１】請求項４の発明は、記号列の入力部と、状
態遷移表で表現された記号列の集合からなる辞書と、入
力記号列中に出現する全ての辞書要素を検出する辞書照
合部とから構成され、辞書照合部で入力記号列中の記号
または記号列と状態遷移表に基づいて状態遷移を行なう
場合、そこからの遷移がない状態（終端状態）について
は、複数の識別番号（状態番号）を割り当てることを特
徴とし、もって、複数の終端状態を設けることで単配置
法において遠距離の状態遷移を抑制し配列容量を少なく
するようにしたものである。According to a fourth aspect of the present invention, there is provided a symbol string input section, a dictionary comprising a set of symbol strings represented by a state transition table, and a dictionary matching section for detecting all dictionary elements appearing in the input symbol string. When the dictionary matching unit performs a state transition based on a symbol or a symbol string in an input symbol string and a state transition table, a plurality of identification numbers (terminal states) are assigned to a state (terminal state) having no transition from the state transition table. State numbers are assigned, and by providing a plurality of terminal states, a state transition at a long distance in the single arrangement method is suppressed, and the array capacity is reduced.

【００１２】請求項５の発明は、記号列の入力部と、状
態遷移表で表現された記号列の集合からなる辞書と、入
力記号列中に出現する全ての辞書要素を検出する辞書照
合部とから構成され、辞書照合部で入力記号列中の記号
または記号列と状態遷移表に基づいて状態遷移を行なう
場合、状態遷移表の遷移ラベルを各状態で表現される部
分記号列の最長共通接頭とすることを特徴とし、もっ
て、状態に対応する部分記号列集合の最長共通接頭を遷
移ラベルとすると左から右に辞書照合を行なう際に、最
長の文字列が予測可能になる。According to a fifth aspect of the present invention, there is provided a symbol string input unit, a dictionary comprising a set of symbol strings represented by a state transition table, and a dictionary matching unit for detecting all dictionary elements appearing in the input symbol string. When a state transition is performed based on a symbol or a symbol string in an input symbol string and a state transition table in the dictionary matching unit, the transition label of the state transition table is the longest common of the partial symbol strings expressed in each state. If the longest common prefix of the partial symbol string set corresponding to the state is used as the transition label, the longest character string can be predicted when performing dictionary matching from left to right.

【００１３】[0013]

【発明の実施の形態】図１は、本発明の基本構成を説明
するための要部概略構成図で、図中、１は記号列の入力
部、２は状態遷移表で表現された記号列の集合（辞
書）、３は入力機号列中に出現する全ての辞書要素を検
出する辞書照合部で、入力部１に入力された記号列を、
辞書照合部３において、以下請求項１乃至５に説明する
ようにして、辞書（記号列の集合）２と照合する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 is a schematic diagram of a main part for explaining a basic configuration of the present invention. In the drawing, reference numeral 1 denotes a symbol string input unit, and 2 denotes a symbol string represented by a state transition table. (Dictionary) 3 is a dictionary matching unit that detects all dictionary elements appearing in the input device sequence, and converts the symbol sequence input to the input unit 1 into
The dictionary collating unit 3 collates with the dictionary (set of symbol strings) 2 as described in claims 1 to 5 below.

【００１４】（請求項１の発明）図２は、請求項１の発
明の動作説明をするための図で、請求項１の発明は、記
号列の入力部１と、状態遷移表で表現された記号列の集
合からなる辞書２と、入力記号列中に出現する全ての辞
書要素を検出する辞書照合部３とから構成され、辞書照
合部３で入力記号列中の記号または記号列と状態遷移表
に基づいて状態遷移を行なう際、記号列に対し、その先
頭記号のコードから算術演算により導かれるコードを対
応させることを特徴としたもので、図２に示すように、
ＤＡＷＧやＴＲＩＥでは、ある状態から記号ａから始ま
る遷移は高々１つである。よって、図３に示すように、
同じ先頭記号を持つ記号と記号列が並立することはない
ことに着目し、ａから始まる記号列には、ａ′＋ｓｉｇ
ｍａというコードを対応させる。ここでｓｉｇｍａとは
辞書中のコード値の最大値である。これにより、他の記
号と記号列のコードが重なることはないことが保証され
る。コードが重なるｌａｂｅｌ配列の値だけからでは遷
移元を一意に確定できないことに注意する。辞書照合の
アルゴリズムは図４のようになる。(Invention of Claim 1) FIG. 2 is a diagram for explaining the operation of the invention of Claim 1, and the invention of Claim 1 is represented by a symbol string input unit 1 and a state transition table. And a dictionary matching unit 3 for detecting all dictionary elements appearing in the input symbol string. The dictionary matching unit 3 determines the state of the symbol or symbol string in the input symbol string. When performing a state transition based on a transition table, a code derived from an arithmetic operation from a code of the first symbol is made to correspond to a symbol string, as shown in FIG.
In DAWG and TRIE, there is at most one transition starting from the symbol a from a certain state. Therefore, as shown in FIG.
Focusing on the fact that a symbol having the same leading symbol and a symbol string do not co-exist, a symbol string starting with a has a ′ + sig
The code ma is made to correspond. Here, sigma is the maximum code value in the dictionary. This ensures that the code of the symbol string does not overlap with other symbols. Note that the transition source cannot be uniquely determined only from the values of the label array where the codes overlap. The algorithm of dictionary matching is as shown in FIG.

【００１５】（請求項２の発明）図５は、請求項２の発
明は、記号列の入力部と、状態遷移表で表現された記号
列の集合からなる辞書と、入力記号列中に出現する全て
の辞書要素を検出する辞書照合部とから構成され、辞書
照合部で入力記号列中の記号または記号列と状態遷移表
に基づいて状態態遷移を行なう場合、遷移元の状態の識
別番号（状態番号）から算術演算により遷移先の状態番
号を割り当てることを特徴としたもので、図５に示すよ
うに、単配置法のｂａｓｅ配列に遷移先状態のｂａｓｅ
値を直接格納せず遷移元の状態番号からの相対的な距離
を入れることでｂａｓｅ配列要素の必要サイズを抑制す
る。なお、図５の場合、ｏｆｆｓｅｔ＋ｂａｓｅ（ｓ
ｉ）によって次のｂａｓｅ値を定める。図６は、請求項
２の発明の辞書照合のアルゴリズムを示す図である。FIG. 5 is a diagram showing an input part of a symbol string, a dictionary composed of a set of symbol strings represented by a state transition table, and a dictionary appearing in the input symbol string. And a dictionary matching unit for detecting all dictionary elements to be changed. When the dictionary matching unit performs a state transition based on a symbol or a symbol string in an input symbol string and a state transition table, an identification number of a transition source state It is characterized by assigning the state number of the transition destination by (arithmetic operation) from (state number). As shown in FIG.
The required size of the base array element is suppressed by inserting a relative distance from the state number of the transition source without directly storing the value. In the case of FIG. 5, offset + base (s
The next base value is determined according to i). FIG. 6 is a diagram showing an algorithm of dictionary matching according to the second aspect of the present invention.

【００１６】（請求項３の発明）請求項３の発明は、記
号列の入力部と、状態遷移表で表現された記号列の集合
からなる辞書と、入力記号列中に出現する全ての辞書要
素を検出する辞書照合部とから構成され、辞書照合部で
入力記号列中の記号または記号列と状態遷移表に基づい
て状態遷移を行なう場合、状態遷移を複数に分割するこ
とで遷移元の状態の状態番号と遷移先の状態番号の差分
を常に一定値内に収めることを特徴としたもので、現在
のｂａｓｅ値から一定の距離の中に遷移先がない時はそ
の一定の範囲の中にある未使用の配列位置（どの状態の
ｂａｓｅにもなっていない位置）を探索し、それをｂａ
ｓｅとしてもつ新たに状態を設け、かつ、そこから入力
記号を消費しないε遷移を行なうことを当初の遷移先状
態に達するまで続ける。According to a third aspect of the present invention, there is provided an input unit for a symbol string, a dictionary composed of a set of symbol strings represented by a state transition table, and all dictionaries appearing in the input symbol string. When the state transition is performed based on a symbol or a symbol string in an input symbol string and a state transition table in the dictionary collation section, the state transition is divided into a plurality of parts to determine the transition source. The feature is that the difference between the state number of the state and the state number of the transition destination is always kept within a certain value. If there is no transition destination within a certain distance from the current base value, the difference is within the certain range. Search for an unused array position (a position that is not a base in any state) at
A new state having “se” is provided, and ε transition from which no input symbol is consumed is continued until the state reaches the initial transition destination state.

【００１７】ｓ１−（ａ）−＞ｓ２という状態遷移にお
いて、ｂａｓｅ（ｓ１）とｂａｓｅ（ｓ２）の距離が一
定値ｍａｘを越える場合、図７に示すような状態遷移の
連鎖を行なう。ただし、各遷移において遷移元と遷移先
の距離は一定値ｍａｘを越えない、かつ、記号ｅによる
状態遷移では入力記号列を消費しない。In the state transition of s1- (a)-> s2, if the distance between base (s1) and base (s2) exceeds a predetermined value max, the state transition is chained as shown in FIG. However, in each transition, the distance between the transition source and the transition destination does not exceed the fixed value max, and the input symbol string is not consumed in the state transition by the symbol e.

【００１８】（請求項４の発明）請求項４の発明は、記
号列の入力部と、状態遷移表で表現された記号列の集合
からなる辞書と、入力記号列中に出現する全ての辞書要
素を検出する辞書照合部とから構成され、辞書照合部で
入力記号列中の記号または記号列と状態遷移表に基づい
て状態遷移を行なう場合、そこからの遷移がない状態
（終端状態）については、複数の識別番号（状態番号）
を割り当てることを特徴としたもので、ｓ１−（ａ）Ｆ
という状態遷移においてＦは終端状態とするｂａｓｅ
（ｓ１）とｂａｓｅ（Ｆ）の距離が一定値ｍａｘを越え
る場合、距離がｍａｘを越えないｂａｓｅ（ｓ１）の近
傍に新たに終端状態Ｆ２を設け，代わりにｓ１−（ａ）
Ｆ２になる状態遷移とする。According to a fourth aspect of the present invention, there is provided an input unit for a symbol string, a dictionary comprising a set of symbol strings represented by a state transition table, and all dictionaries appearing in the input symbol string. When a state transition is performed based on a symbol or a symbol string in an input symbol string and a state transition table in the dictionary comparison section, a state where there is no transition from the dictionary matching section (terminal state) Means multiple identification numbers (state numbers)
S1- (a) F
In the state transition, F is the terminal state.
If the distance between (s1) and base (F) exceeds a predetermined value max, a new terminal state F2 is provided near base (s1) where the distance does not exceed max, and s1- (a) is used instead.
The state transition becomes F2.

【００１９】（請求項５の発明）請求項５の発明は、記
号列の入力部と、状態遷移表で表現された記号列の集合
からなる辞書と、入力記号列中に出現する全ての辞書要
素を検出する辞書照合部とから構成され、辞書照合部で
入力記号列中の記号または記号列と状態遷移表に基づい
て状態遷移を行なう場合、状態遷移表の遷移ラベルを各
状態で表現される部分記号列の最長共通接頭とすること
を特徴としたもので、例えば、図８に示すような、記号
による状態遷移について考える。According to a fifth aspect of the present invention, there is provided an input unit for a symbol string, a dictionary comprising a set of symbol strings represented by a state transition table, and all dictionaries appearing in the input symbol string. When performing state transition based on a symbol or symbol string in an input symbol string and a state transition table in the dictionary collation section, a transition label of the state transition table is expressed in each state. This is characterized by using the longest common prefix of the partial symbol string. For example, consider a state transition by a symbol as shown in FIG.

【００２０】図８のグラフは｛ａｂｃｄｅ，ａｂｃｄ
ｆ，ｅｆｃｄｅ，ｅｆｃｄｆ｝という記号列の集合を表
す。これを図９に示すように、左から右に最長共通接頭
については構造を共有しないようにする。ここで、最長
共通接頭とは｛ａｄｃｄｅ，ａｂｃｄｆ｝についていえ
ばａｂｃｄになる。これにより、状態ｓ１において入力
記号ａがきた時に辞書に照らして後続するであろうこと
が予測可能（一意に決定可能）な最大長の記号列ａｂｃ
ｄが得ることができる。The graph of FIG. 8 is represented by {abcde, abcd
f, efcde, efcdf represents a set of symbol strings. As shown in FIG. 9, the structure is not shared with respect to the longest common prefix from left to right. Here, the longest common prefix is abcd in terms of {adcde, abcdf}. Thereby, when the input symbol a comes in the state s1, it is predictable (uniquely determinable) that the input symbol a will follow in the dictionary, and the maximum length symbol string abc
d can be obtained.

【００２１】[0021]

【発明の効果】請求項１の発明は、記号列の入力部と、
状態遷移表で表現された記号列の集合からなる辞書と、
入力記号列中に出現する全ての辞書要素を検出する辞書
照合部とから構成され、辞書照合部で入力記号列中の記
号または記号列と状態遷移表に基づいて状態遷移を行な
う際、記号列に対し、その先頭記号のコードから算術演
算により導かれるコードを対応させるようにしたので、
記号と記号列の区別を計算量の少ない算術演算で行なう
ことにより、高速な辞書照合を可能にしつつ、記号列ラ
ベルにより状態遷移数を減少させ記憶効率を向上させる
ことできる。According to the first aspect of the present invention, there is provided a symbol string input unit,
A dictionary composed of a set of symbol strings represented by a state transition table,
A dictionary matching unit for detecting all dictionary elements appearing in the input symbol string, and performing a state transition based on a symbol or symbol string in the input symbol string and the state transition table in the dictionary matching unit. , The code derived from the code of the first symbol by arithmetic operation is made to correspond,
By performing the distinction between the symbol and the symbol string by an arithmetic operation with a small amount of calculation, it is possible to reduce the number of state transitions by the symbol string label and improve the storage efficiency while enabling high-speed dictionary matching.

【００２２】請求項２の発明は、記号列の入力部と、状
態遷移表で表現された記号列の集合からなる辞書と、入
力記号列中に出現する全ての辞書要素を検出する辞書照
合部とから構成され、辞書照合部で入力記号列中の記号
または記号列と状態遷移表に基づいて状態態遷移を行な
う場合、遷移元の状態の識別番号（状態番号）から算術
演算により遷移先の状態番号を割り当てるようにしたの
で、遷移元の状態の識別番号（状態番号）から算術演算
により遷移先の状態番号を割り当てることができ、単配
置法における配列容量を少なくすることができる。According to a second aspect of the present invention, there is provided a symbol string input section, a dictionary composed of a set of symbol strings represented by a state transition table, and a dictionary matching section for detecting all dictionary elements appearing in the input symbol string. When performing a state transition based on a symbol or a symbol string in an input symbol string and a state transition table in the dictionary matching unit, the destination number of the transition destination is calculated by an arithmetic operation from the identification number (state number) of the transition source state. Since the state number is assigned, the state number of the transition destination can be assigned by an arithmetic operation from the identification number (state number) of the state of the transition source, and the array capacity in the single arrangement method can be reduced.

【００２３】請求項３の発明は、記号列の入力部と、状
態遷移表で表現された記号列の集合からなる辞書と、入
力記号列中に出現する全ての辞書要素を検出する辞書照
合部とから構成され、辞書照合部で入力記号列中の記号
または記号列と状態遷移表に基づいて状態遷移を行なう
場合、状態遷移を複数に分割することで遷移元の状態の
状態番号と遷移先の状態番号の差分を常に一定値内に収
めるようにしたので、状態遷移を複数に分割することで
単配置法における配列容量を少なくすることができる。According to a third aspect of the present invention, there is provided a symbol string input section, a dictionary composed of a set of symbol strings represented by a state transition table, and a dictionary matching section for detecting all dictionary elements appearing in the input symbol string. When performing a state transition based on a symbol or a symbol string in an input symbol string and a state transition table in the dictionary matching unit, the state transition is divided into a plurality of parts, and the state number of the transition source state and the transition destination state are divided. Since the difference between the state numbers is always kept within a fixed value, the array capacity in the single arrangement method can be reduced by dividing the state transition into a plurality.

【００２４】請求項４の発明は、記号列の入力部と、状
態遷移表で表現された記号列の集合からなる辞書と、入
力記号列中に出現する全ての辞書要素を検出する辞書照
合部とから構成され、辞書照合部で入力記号列中の記号
または記号列と状態遷移表に基づいて状態遷移を行なう
場合、そこからの遷移がない状態（終端状態）について
は、複数の識別番号（状態番号）を割り当てるようにし
たので、複数の終端状態を設けることで単配置法におい
て遠距離の状態遷移を抑制し配列容量を少なくすること
ができる。According to a fourth aspect of the present invention, there is provided a symbol string input unit, a dictionary comprising a set of symbol strings represented by a state transition table, and a dictionary matching unit for detecting all dictionary elements appearing in the input symbol string. When the dictionary matching unit performs a state transition based on a symbol or a symbol string in an input symbol string and a state transition table, a plurality of identification numbers (terminal states) are assigned to a state (terminal state) having no transition from the state transition table. Since state numbers are assigned, a plurality of terminal states are provided, so that a long-distance state transition can be suppressed and the array capacity can be reduced in the single arrangement method.

【００２５】請求項５の発明は、記号列の入力部と、状
態遷移表で表現された記号列の集合からなる辞書と、入
力記号列中に出現する全ての辞書要素を検出する辞書照
合部とから構成され、辞書照合部で入力記号列中の記号
または記号列と状態遷移表に基づいて状態遷移を行なう
場合、状態遷移表の遷移ラベルを各状態で表現される部
分記号列の最長共通接頭とするので、状態に対応する部
分記号列集合の最長共通接頭を遷移ラベルとすると左か
ら右に辞書照合を行なう際に、最長の文字列が予測可能
になる。According to a fifth aspect of the present invention, there is provided a symbol string input section, a dictionary composed of a set of symbol strings represented by a state transition table, and a dictionary matching section for detecting all dictionary elements appearing in the input symbol string. When a state transition is performed based on a symbol or a symbol string in an input symbol string and a state transition table in the dictionary matching unit, the transition label of the state transition table is the longest common of the partial symbol strings expressed in each state. Since the prefix is used, if the longest common prefix of the partial symbol string set corresponding to the state is used as the transition label, the longest character string can be predicted when performing dictionary matching from left to right.

[Brief description of the drawings]

【図１】本発明が適用される照合装置の概略構成図で
ある。FIG. 1 is a schematic configuration diagram of a collation device to which the present invention is applied.

【図２】請求項１の発明の動作説明をするための図で
ある。FIG. 2 is a diagram for explaining the operation of the invention of claim 1;

【図３】請求項１の発明の動作説明をするための図で
ある。FIG. 3 is a diagram for explaining the operation of the invention of claim 1;

【図４】請求項１の発明における辞書照合のアルゴリ
ズムを示す図である。FIG. 4 is a diagram showing an algorithm of dictionary matching according to the first embodiment of the present invention.

【図５】請求項２の発明の動作を説明するための図で
ある。FIG. 5 is a diagram for explaining the operation of the invention of claim 2;

【図６】請求項２の発明の辞書照合のアルゴリズムを
示す図である。FIG. 6 is a diagram showing an algorithm of dictionary matching according to the second aspect of the present invention.

【図７】請求項３の発明における状態遷移の連鎖を説
明する図である。FIG. 7 is a diagram illustrating a chain of state transitions according to the invention of claim 3;

【図８】請求項５の発明における状態遷移を説明する
ための図である。FIG. 8 is a diagram for explaining a state transition in the invention of claim 5;

【図９】請求項５の発明における状態数最小のグラフ
を示す図である。FIG. 9 is a diagram showing a graph with the minimum number of states according to the invention of claim 5;

【図１０】ＴＲＩＥの有効グラフ例を示す図である。FIG. 10 is a diagram showing an example of an effective graph of TRIE.

【図１１】ＤＡＷＧの有効グラフ例を示す図である。FIG. 11 is a diagram showing an example of an effective graph of DAWG.

【図１２】単配置法の例を説明するための図である。FIG. 12 is a diagram illustrating an example of a single arrangement method.

【図１３】ＤＡＷＧの他の例を示すグラフである。FIG. 13 is a graph showing another example of DAWG.

[Explanation of symbols]

１…入力部、２…辞書、３…照合部。 1. Input unit, 2. Dictionary, 3. Matching unit.

Claims

[Claims]

An input unit for a symbol string, a dictionary composed of a set of symbol strings represented by a state transition table, and a dictionary matching unit for detecting all dictionary elements appearing in the input symbol string, When performing a state transition based on a symbol or a symbol string in an input symbol string and a state transition table in the dictionary matching unit, it is necessary to associate a symbol string with a code derived by an arithmetic operation from the code of the first symbol. A dictionary matching device that features.

2. An input unit for a symbol string, a dictionary including a set of symbol strings represented by a state transition table, and a dictionary matching unit for detecting all dictionary elements appearing in the input symbol string, When performing a state transition based on a symbol or a symbol string in an input symbol string and a state transition table in the dictionary matching unit, a state number of a transition destination is calculated by an arithmetic operation from an identification number (state number) of a state of a transition source. A dictionary matching device, which is assigned.

3. An input unit for a symbol string, a dictionary including a set of symbol strings represented by a state transition table, and a dictionary matching unit for detecting all dictionary elements appearing in the input symbol string, When performing a state transition based on a symbol or a symbol string in an input symbol string and a state transition table in the dictionary matching unit, the state transition is divided into a plurality of parts, thereby dividing the state number of the transition source state and the state number of the transition destination state. A dictionary matching apparatus characterized in that the difference is always kept within a certain value.

4. An input unit for a symbol string, a dictionary composed of a set of symbol strings represented by a state transition table, and a dictionary matching unit for detecting all dictionary elements appearing in the input symbol string, When a state transition is performed by the dictionary matching unit based on a symbol or a symbol string in an input symbol string and a state transition table, a plurality of identification numbers (state numbers) are assigned to a state that has no transition from that state (terminal state). A dictionary matching device, which is assigned.

5. An input unit for a symbol string, a dictionary composed of a set of symbol strings represented by a state transition table, and a dictionary matching unit for detecting all dictionary elements appearing in the input symbol string, When performing a state transition based on a symbol or a symbol string in an input symbol string and a state transition table in the dictionary matching unit, a transition label of the state transition table should be a longest common prefix of a partial symbol string represented in each state. A dictionary matching device characterized by the following.