JPS60105040A

JPS60105040A - Sentence retrieving system

Info

Publication number: JPS60105040A
Application number: JP58211718A
Authority: JP
Inventors: Ushio Inoue; 潮井上; Haruo Hayamizu; 速水　治夫
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1983-11-12
Filing date: 1983-11-12
Publication date: 1985-06-10
Also published as: JPH0315221B2

Abstract

PURPOSE:To attain the sentence retrieval by means of a state shift table in a small size even in case a character code is long in bit length by using partial patterns obtained by dividing the character code into units equal to 2 pcs of power multiplier. CONSTITUTION:The input characters are set to low-order 8 bits of an address register 7 every 8 bits and alternately in order of high-order and low-order places from a data register 20 by a switch circuit 21. The upper 8 bits having initial values all set at 0 are supplied to an address decoder 9, and the 8-bit data is read out of an address 0 of a random access memory 8 and stored to a memory register 10. A discrimination circuit 22 neglects the contents 10 of the register 10 with a truth value action in case the high-order 8 bits of the register 7 are equal to the special value FF. Then the high-order 8 bits of the register 7 are forcibly reset ''0'' to prevent the coincidence between code patterns of different character strings.

Description

【発明の詳細な説明】（発明の属する分野）本発明は文字列中に所定の文字列が存在するか否かを判
定するだめの文章検索方式に関するものである。DETAILED DESCRIPTION OF THE INVENTION (Field of the Invention) The present invention relates to a text search method for determining whether a predetermined character string exists in a character string.

（従来の技術）データ処理システムの分野では、文章等の文字列データ
の集まりの中からキーとなる特定の部分文字列を含むも
ののみを検索しだシ、文字列データ中に含まれるすべて
のキーを抽出することがしばしば必要となる。通常、１
つの文章はｎビットの固定長のコードで表現されるため
、文字列データはｎビット単位のコードの系列となる。(Prior Art) In the field of data processing systems, it is necessary to search for only those that contain a specific substring as a key from a collection of character string data such as sentences. It is often necessary to extract the key. Usually 1
Since one sentence is expressed by a fixed length code of n bits, the character string data is a series of codes of n bits.

一般に文字列データは磁気ディスク等の電子計算機の外
部記憶装置に格納でれており、検索時に中央処理装置へ
１文手ずつ直列に転送埒れる。従って、処理時間の短縮
のためには、データの転送と同時に検索を行うことが必
要と々る。Character string data is generally stored in an external storage device of a computer, such as a magnetic disk, and is serially transferred one sentence at a time to a central processing unit during a search. Therefore, in order to reduce processing time, it is necessary to perform a search simultaneously with data transfer.

第１図はこのような文章検索機構の説明図である。第１
図において、１は文字列データが格納された記憶装置、
２は文字列の検索を行う文字列検索装置、３は文字列デ
ータ転送路、４は検索結果を出力する信号線である。文
字列データは記憶装置１からデータ転送路３を経由して
文字列検索装置２へ１文字ずつ直列に入力される。文字
列検索装置２では予じめ記憶されているキーとなる部分
文字列と入力されたデータを照合し、両者の一致が検出
でれた時点で信号線４に一致信号を出力する。文字列検
索装置２において文字列の照合を行う方式として、従来
より有限オートマトンを用いる方法が一般に知られてい
る。（Ｌ、　Ａ、　Ｈｏ１ｌａａｒ’　Ｈａｒｄｗａｒ
ｅ　Ｓｙｓｔｅｍｓ　ｆｏｒ　Ｔｅｘｔ　Ｉｎｆｏｒｍ
ａｔｉｏｎ　Ｒｅｔｒｉｅｖａｌ　ｌｌＡＣＭ　５ＩＧ
ＩＲ６ｔｈ　Ｃｏｎｆｅｒｅｎｃｅ　１９８３　）第２
図は有限オートマトンの状態遷移を表わした説明図であ
る。第２図において、５はオートマトンの状態、６は状
態遷移の方向を表わし、文字列データの中からｎＤＯＧ
ｌｌという３文字のキーを検索することができる。以下
、この動作を説明する。オートマトンの初期状態は状態
（０）であり、入力文字が＋ｌＤ″であると状態（１）
へ遷移する。第２図において１≠１はその他の文字を表
わし、状態（０）における入力文字が″Ｄ１１以外なら
ば引き続き状態（０）にとどまる。状態（１）について
も同様であり、入力文字がｎ　□　ｎならば状態（２）
へ、′Ｄ′１ならば再び状態（１）へ、それ以外ならば
状態（０）へ遷移する。FIG. 1 is an explanatory diagram of such a text search mechanism. 1st
In the figure, 1 is a storage device in which character string data is stored;
Reference numeral 2 designates a character string search device that searches for character strings, 3 represents a character string data transfer path, and 4 represents a signal line that outputs search results. Character string data is serially input character by character from the storage device 1 to the character string search device 2 via the data transfer path 3. The character string search device 2 compares the input data with a pre-stored partial character string serving as a key, and outputs a match signal to the signal line 4 when a match is detected between the two. As a method for collating character strings in the character string search device 2, a method using a finite automaton is generally known. (L, A, Ho1laar' Hardwar
e Systems for Text Information
ation Retrieval llACM 5IG
IR6th Conference 1983) 2nd
The figure is an explanatory diagram showing the state transition of a finite automaton. In Figure 2, 5 represents the state of the automaton, 6 represents the direction of state transition, and nDOG is selected from the character string data.
You can search for the three-letter key ll. This operation will be explained below. The initial state of the automaton is state (0), and if the input character is +lD'', it becomes state (1).
Transition to. In Fig. 2, 1≠1 represents other characters, and if the input character in state (0) is other than "D11", it will continue to remain in state (0).The same applies to state (1), and if the input character is n □ If n, state (2)
If 'D' is 1, the state returns to state (1); otherwise, the state returns to state (0).

状態Ｑ）において入力文字が＋１ＧＩＩならば状態（３
）へ遷移し、′１ＤＯＧＩ′というキーを検出したこと
になり、第１図の信号線４から一致信号が出力される。If the input character is +1GII in state Q), state (3
), which means that the key '1DOGI' has been detected, and a match signal is output from the signal line 4 in FIG.

第３図は８ビツトのＪ工Ｓコードで表現された文字列デ
ータを対象とした従来の有限オートマトンの実現回路構
成の説明図である。第３図において、３は文字列データ
転送路、４は検索結果を出力する信号線であり、７は１
６ビツトのアドレスレジスタ、８は６４ＫＢ　（２５６
Ｘ　２８１３　）のランダムアクセス・メモリ、９はア
ドレスデコーダ、１０は８ビツトのメモリレジスタ、１
１は判別回路であり、１２．１４，１５は８ビツト幅の
データ線、１３は１６ビツト幅のアドレス線である。FIG. 3 is an explanatory diagram of a circuit configuration for realizing a conventional finite automaton for character string data expressed in 8-bit J/S code. In Figure 3, 3 is a character string data transfer path, 4 is a signal line for outputting search results, and 7 is 1
6-bit address register, 8 is 64KB (256
X2813) random access memory, 9 is an address decoder, 10 is an 8-bit memory register, 1
1 is a discrimination circuit, 12, 14 and 15 are 8-bit wide data lines, and 13 is a 16-bit wide address line.

第４図は、第３図のランダムアクセス・メモリ８に格納
てれた状態遷移テーブルの内容を表わしたものであり、
１６は８ビツトのデータ、１７はメモリアドレスの上位
８ビツト、１８はメモリアドレスの下位８ビツトである
。なお、論理的にはメモリの上位アドレス１７が状態番
号、メモリの下位アドレス１８が文字コードに対応して
おシ、１９はメモリの下位アドレス１８のコードによっ
て表現きれる文字である。FIG. 4 shows the contents of the state transition table stored in the random access memory 8 of FIG.
16 is 8-bit data, 17 is the upper 8 bits of the memory address, and 18 is the lower 8 bits of the memory address. Note that logically, the upper address 17 of the memory corresponds to the state number, the lower address 18 of the memory corresponds to the character code, and 19 is a character that can be expressed by the code of the lower address 18 of the memory.

入力文字はデータ転送路３よりアドレスレジスタ７の下
位８ビツトにセラ１１れる。アドレスレジスタ７の上位
８ビツトには初期値としてオールゼロがセットされてお
り、アドレス線１３を経由してアドレスデコーダ９に入
力きれ、ランダムアクセス・メモリ８から当該アドレス
に格納されている８ビツトのデータ１６が読み出場れ、
データ線１４を経由してメモリレジスタ１０に格納され
る。The input character is transferred from the data transfer path 3 to the lower 8 bits of the address register 7. The upper 8 bits of the address register 7 are set to all zeros as an initial value, and the 8-bit data stored at the address from the random access memory 8 can be input to the address decoder 9 via the address line 13. 16 is read out,
The data is stored in the memory register 10 via the data line 14.

判別回路１１ではデータ線１５よりメモリレジスタＩＯ
の内容を参照し、値７がハイバリュー（１６進表示で”
ＦＦ’　）ならば信号線４に一致信号を出力し、ハイバ
リー−以外ならばデータ線１２を経由してメモリレジス
タ１０の内容をアドレスレジスタ７の上位８ビツトにセ
ット嘔れる。以上の動作をデータ転送路３から１文字入
力されるごとに繰シ返ずことによシ、検索処理が実行さ
れる。In the discrimination circuit 11, the data line 15 is connected to the memory register IO.
Refer to the contents of ``7'' is a high value (in hexadecimal notation)
FF'), a match signal is output to the signal line 4, and if it is other than Highbury-, the contents of the memory register 10 are set to the upper 8 bits of the address register 7 via the data line 12. The above operation is repeated every time one character is input from the data transfer path 3, thereby executing the search process.

以上説明した従来の方式を、１文字が１６ビツトで表現
される日本語文字列に適用しようとすると、コードの種
類が２１６となるため、第３図と同じく２５６個の状態
を表現する状態遷移テーブルを格納するだめのランダム
アクセス・メモリ８の大きさはＩＭＢ　（２５６Ｘ　２
１６Ｂ）必要となる。しかもここに格納きれるデータ１
６の内容は一般に大半がゼロであり、極端に低い利用効
率で膨大な量のメモリを使用しなければならなくなると
いう欠点があった。If we try to apply the conventional method explained above to a Japanese character string in which each character is represented by 16 bits, the number of code types will be 216, so there will be state transitions representing 256 states as in Figure 3. The size of the random access memory 8 used to store the table is IMB (256 x 2
16B) Required. Moreover, the data that can be stored here1
6 is generally mostly zero, which has the drawback of requiring a huge amount of memory to be used with extremely low utilization efficiency.

（発明の目的）本発明は、状態遷移テーブルのエントリを文字コードそ
のものとするのではなく、文字コート′を２の１１．Ｉ
Ｊ乗数個に分割した部分ノくターンを使用することを特
徴とし、その目的は文字コードのビット長が長い場合で
も小さなサイズの状態遷移テーフ゛ルにより検索できる
ようにしたことである。以下、文字コードを２個に分割
する場合について詳細に説明する。(Objective of the Invention) The present invention does not use the character code itself as an entry in the state transition table, but instead uses the character code '211. I
It is characterized by the use of partial turns divided into J multipliers, and its purpose is to make it possible to search using a small-sized state transition table even when the bit length of the character code is long. The case where a character code is divided into two will be described in detail below.

（発明の構成および作用）第５図は本発明の方式を用いた有限オートマトンの実現
回路の構成を示す一実施例のブロック図で、１６ピツト
のコードで表現をれた文字列データを対象とした検索方
式の説明図であシ、２０は１６ビツトのデータレジスタ
、２１は切換え回路、２３、２４は８ビツト幅のデータ
線である。(Structure and operation of the invention) Fig. 5 is a block diagram of an embodiment showing the structure of a circuit for realizing a finite automaton using the method of the present invention. In this figure, 20 is a 16-bit data register, 21 is a switching circuit, and 23 and 24 are 8-bit wide data lines.

第６図は第５図のランダムアクセス・メモリ８に格納さ
れた状態遷移テーブルの構成内容を表わしだものであり
、キーとなる部分文字列が１海抜ｎ（１６進）ＪＩＳコ
ートテ表ワストｖ３３２４．４８３４ｖ）ノ場合である
。FIG. 6 shows the structure of the state transition table stored in the random access memory 8 of FIG. 5, and the key partial string is 1 sea level n (hexadecimal) JIS Courtesy table wast v3324. 4834v).

入力文字は転送路３よシデータレジスタ２０に格納され
る。切換え回路２１はデータレジスタ２０の内容を８ビ
ツトずつ」二位、下位の順で交互にデータ線２３を経由
してアドレスレジスタ７の下位８ビツトにセットする。The input characters are stored in the transfer path 3 and the data register 20. The switching circuit 21 sets the contents of the data register 20 8 bits at a time to the lower 8 bits of the address register 7 via the data line 23 alternately in the order of 2nd place and lower order.

アドレスレジスタ７の上位８ビツトには初期値としてオ
ールゼロがセットされており、アドレス線１３を経由し
てアドレスデコーダ９に入力され、ランダムアクセスメ
モリ８から当該アドレスに格納嘔れている８ビツトのデ
ータ１６が読み出され、データ線１４を経由してメモリ
レジスフ１０に格納される。The upper 8 bits of the address register 7 are set to all zeros as an initial value, and are input to the address decoder 9 via the address line 13, and the 8-bit data stored at the address from the random access memory 8 is sent to the address decoder 9 via the address line 13. 16 is read out and stored in the memory register 10 via the data line 14.

第７図は判別回路２２の動作を記述した真理値を示す図
であり、２５はデータ線２４からの入力、２６はデータ
線１５からの入力、２７はデータ線１２への出力、２８
は信号線４への出力を１６進表示で示したものである。FIG. 7 is a diagram showing truth values describing the operation of the discrimination circuit 22, in which 25 is an input from the data line 24, 26 is an input from the data line 15, 27 is an output to the data line 12, and 28
is a hexadecimal representation of the output to the signal line 4.

判別回路２２は第７図の真理値図によって動作し、工１
すなわちアドレスレジスタ７の上位８ビツトがｖ、ＦＥ
ｖの場合は、Ｉ２すなわちメモリレジスフ１０の内容を
無視してアドレスレジスタ７の上位８ビツトを強制的に
ｖｏｏｖにリセットする。The discrimination circuit 22 operates according to the truth diagram shown in FIG.
That is, the upper 8 bits of address register 7 are v, FE
In the case of v, I2, that is, the contents of the memory register 10, is ignored and the upper 8 bits of the address register 7 are forcibly reset to voov.

第８図はこの強制的なりセントが必要なことを示すだめ
の具体例であり、２９はキーとなる部分文字列、３０は
部分文字列２９の１６進表示、３１は検索対象と々る文
字列データ、３２は文字列データ３１の１６進表示であ
る。Figure 8 is a concrete example showing that this compulsory cent is required, where 29 is a key substring, 30 is a hexadecimal representation of substring 29, and 31 is the character to be searched. Column data 32 is a hexadecimal representation of character string data 31.

第８図に示すように１６ビツートのコードで表わされた
文字を８ビツトずつに分割して検索を行うと、丁度８ビ
ツトだけずれた状態で２つの異なる文字列のコードパタ
ーンが一致することが生じる。As shown in Figure 8, if a character represented by a 16-bit code is divided into 8-bit parts and searched, the code patterns of two different character strings will match with a difference of exactly 8 bits. occurs.

この現象を防止するため本発明では、状態遷移チー　フ
ルに特殊な値（ｖＦＥｖ）　ｆｆｉ埋め込んでおき、判
別回路２２で強制的なリセットをかけている。なお、判
別回路２２のＩ□がｖＦＥｖ以外の場合には従来の方式
と同じ動作を行う。In order to prevent this phenomenon, in the present invention, a special value (vFEv) ffi is embedded in the state transition coefficient, and the determination circuit 22 is forced to reset it. Note that when I□ of the discrimination circuit 22 is other than vFEv, the same operation as the conventional method is performed.

第９図は第６図まだは第１０図の状態遷移テーブルを作
成するだめのフローチャートである。FIG. 9 is a flowchart for creating the state transition table shown in FIG. 6 and FIG. 10.

第９図において、３３は２５６個の作業域であシ、３４
で示しだｉは状態遷移テーブル内のアドレスの上位８ビ
ツトを表わす変数、３５で示した３は同じくアドレスの
下位８ビツトを表わす変数である。In Figure 9, 33 is 256 work areas, 34
In the figure, i is a variable representing the upper 8 bits of the address in the state transition table, and 3, denoted by 35, is a variable representing the lower 8 bits of the address.

第１０図は、第９図に示したフローチャー１・に従って
作成した状態遷移テーブルと作業域の内容を表わしてお
シ、キーとなる文字列はＶ商務省Ｖ（１６進のコードは
３Ｅ２６．４Ｃ３３，３Ｅ４Ａ　）である。FIG. 10 shows the contents of the state transition table and work area created according to the flowchart 1 shown in FIG. 9. The key character string is V Commerce Department V (hexadecimal code is 3E26. 4C33,3E4A).

第９図のフローチャートから明らかなように、状態遷移
テーブルの作成は単純かつ容易である。また、第１０図
の状態遷移テーブルのサイズは１５３６バイ）（６Ｘ２
８Ｂ）であり、従来の方式による状態遷移テーブルのサ
イズ１９６６０８ノくイト（３Ｘ２　Ｂ）と比較して大
幅に小さくなっている。As is clear from the flowchart of FIG. 9, creating the state transition table is simple and easy. Also, the size of the state transition table in Figure 10 is 1536 bytes) (6X2
8B), which is significantly smaller than the size of the state transition table according to the conventional method, which is 196,608 knots (3×2 B).

なお、上記説明ではコードの分割は２分の１とした場合
を説明したが、これに限らず、２の慕乗分の１としても
よいことは明らかである。In the above description, the code is divided into 1/2, but it is clear that the code is not limited to this and may be divided into 1/2.

（効果）以上説明したように、本発明は１文字を表わすｎビット
のコートを状態遷移テーブルのエントりとしてそのまま
使用するのではなく、ｎビットのコードパターンを２分
の１に分割した部分バクーンをエントリとして使用する
とともに、文字列データがｎ７２ビツトずれることによ
って本来含まれていない部分文字列が検出されることを
防止する強制的なりセント機構を備えたものであるから
、従来の方式と全く同じ機能を、従来のものよシも大幅
に減少したサイズの状態遷移テーブルによって実現でき
るという利点がある。(Effects) As explained above, the present invention does not directly use the n-bit code representing one character as an entry in the state transition table, but instead uses a partial code pattern that divides the n-bit code pattern into half. is used as an entry, and is equipped with a forced cent mechanism that prevents a substring that is not originally included from being detected due to a shift of n72 bits in the character string data, so it is completely different from the conventional method. The advantage is that the same functionality can be achieved with a state transition table that is significantly smaller in size than the conventional one.

[Brief explanation of the drawing]

第１図は文章検索機構の説明図、第２図は有限オートマ
トンの状態遷移を表わしだ説明図、第３図は従来の有限
オートマトンの実現回路構成図、第４図は従来の状態遷
移テーブルの構成図、第５図は本発明を用いた有限オー
トマトンの実現回路の構成を示す一実施例のブロック図
、第６図は第５図のランダムアクセス・メモリに格納さ
れた状態遷移テーブルの構成図、第７図は本発明による
判別回路の動作の真理値を示す図、第８図は本発明にお
ける強制的なりセット機構が必要であることを示す具体
例、第９図は第６図または第１０図の状態遷移テーブル
を作成するだめのフローチャート、第１０図は第９図の
フローチャー１・に従って作成した状態遷移テーブルと
作業域の内容を示す図である。１　・・・・・・・記憶装置、　２・・・・・・・・文
字列検索装置、３・・・・・・・・・データ転送路、　
４　・・・・・・・・信号線、５・・・・・・・・オー
トマトンの状態、　６・・・川・・状態遷移の方向、　
７・・・・・・・・・アドレスレジスタ、８・・・・・
・・ランダムアクセス・メモリ、　９・・・・・・・・
・アドレスデコーダ、１０・・・・・・・・・メモリレ
ジスタ、１１　゛−−−゛　判別回路、　１２．１４．
１５．２３．２４・・・・・・・・・データ線、１３・
・・・・・・・・アドレスｉｌｌ、１６・・・・・・・
・・ｆ−タ、１７・−・・・・・・メモリの上位アドレ
ス、１８パ゛メモリの下位アドレス、１９・・・・・・
・・　コード対応の文字、２０　・・・・・・・データ
レジスタ、２１・・・・・・・・切換え回路、２２　・
・・・・・判別回路、２５，２６−・・・・・　入力端
子、２７、２８　・−・・−出力端子、２９・・・・・
・　部分文字列、３０．３２　・・・・文字列の１６進
表示、３１・・・・・・検索対象となる文字列データ、
３３・・・・・・・・作業域。特許出願人　日本電信電話公社第１図第２図一１第３図て第４図第８図 ′海　捩、、２９（３３）（２４）（４８）（３４）Ｊ３゜山と宕　ｊ３
１（３Ｂ）（３３Ｘ２４　Ｋ４８）（３４Ｘ６４）Ｉ３２
第１０図（３Ｅ）（２６Ｘ４ＣＸ３３Ｘ３ＥＸ４Ａ）〜′３０第
９図Figure 1 is an explanatory diagram of the text retrieval mechanism, Figure 2 is an explanatory diagram showing the state transition of a finite automaton, Figure 3 is a diagram of the circuit configuration for realizing a conventional finite automaton, and Figure 4 is a diagram of a conventional state transition table. 5 is a block diagram of an embodiment showing the configuration of a finite automaton implementation circuit using the present invention, and FIG. 6 is a configuration diagram of a state transition table stored in the random access memory of FIG. 5. , FIG. 7 is a diagram showing the truth value of the operation of the discrimination circuit according to the present invention, FIG. 8 is a specific example showing the necessity of a forced reset mechanism in the present invention, and FIG. 9 is a diagram showing the truth value of the operation of the discriminating circuit according to the present invention. FIG. 10 is a flowchart for creating the state transition table. FIG. 10 is a diagram showing the state transition table created according to flowchart 1 of FIG. 9 and the contents of the work area. 1...Storage device, 2...Character string search device, 3...Data transfer path,
4...Signal line, 5...Automaton state, 6...River...Direction of state transition,
7...Address register, 8...
・Random access memory, 9・・・・・・・・・・
・Address decoder, 10... Memory register, 11 ゛----゛ Discrimination circuit, 12.14.
15.23.24... Data line, 13.
・・・・・・・・・Address ill, 16・・・・・・・
...F-data, 17--... Upper address of memory, 18 Lower address of memory, 19...
・・Character corresponding to the code, 20 ・・・・・Data register, 21 ・・・Switching circuit, 22 ・
...Discrimination circuit, 25, 26--Input terminal, 27, 28 ...-Output terminal, 29...
・ Partial character string, 30.32 ... Hexadecimal representation of the character string, 31 ... Character string data to be searched,
33......Work area. Patent Applicant Nippon Telegraph and Telephone Public Corporation Figure 1 Figure 2 Figure 3
1 (3B) (33X24 K48) (34X64) I32
Figure 10 (3E) (26X4CX33X3EX4A) ~ '30 Figure 9

Claims

[Claims]

When n is a positive even number, is it composed of characters expressed by n-bit codes? In a search method that uses a finite automaton that uses a two-dimensional state transition table with codes and state numbers as entries, the state transition A partial pattern obtained by dividing an n-bit code pattern into several parts is used as a table entry, and the state transition destination is determined by indexing the state transition table as many times as the code pattern is divided. A text retrieval method characterized by having a gluing gutter mechanism for preventing output of erroneous search results due to misalignment of bits in a code.