JPH02136969A

JPH02136969A - Retrieving system for character string

Info

Publication number: JPH02136969A
Application number: JP63290800A
Authority: JP
Inventors: Hiroshi Ichiyanagi; 一柳　洋
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1988-11-17
Filing date: 1988-11-17
Publication date: 1990-05-25

Abstract

PURPOSE:To retrieve the character strings with reduce back tracks at collation by reducing gradually the table collating range when the collation is carried out between the character string to be retrieved and a retrieving character string table. CONSTITUTION:A retrieving character string table is prepared to sort and hold plural retrieving character strings. Then a partial character string starting at an optional position of a character string to the retrieved is collated with the character strings stored in the retrieving character string table with no occurrence of back tracks by reducing gradually the collating range of the table with use of the wild characters. Thus the back tracks are decreased at retrieval of character strings and the collating process efficiency is improved.

Description

【発明の詳細な説明】（産業上の利用分野）本発明は文字列検索方式に関し、特にテキスト・エディ
タやファイル変換プログラム等の文字列処理ツールへの
応用が可能な文字列検索方式に関する。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a string search method, and particularly to a string search method that can be applied to string processing tools such as text editors and file conversion programs.

（従来の技術）被検索文字列から複数の検索文字列を検索する従来の検
索方式としては検索文字列テーブルをハツシュ・テーブ
ルとする方式や、検索文字列を有限オー１−マンとして
表現する方式等がある。(Prior Art) Conventional search methods for searching for multiple search strings from a search string include a method in which the search string table is a hash table, and a method in which the search string is expressed as a finite O-man. etc.

（発明が解決しようとする課題）上述した従来の方式のうちハツシュ・テーブルを用いる
方式は、被検索文字列を適当な単語に分割することがで
き、それらの単語のうちのいくつかを検索する場合には
有効である。しかしながら検索文字列として任意の文字
列を複数個与えた場合には、最悪の場合には被検索文字
列のすべての部分文字列に対しハツシュ関数を適用しな
ければならなくなる。すなわち、ハツシュ・テーブルを
用いる方式は複数個の任意の文字列を検索する場合には
不適当である。さらに、検索文字列にワイルド・キャラ
クタを含む場合には、この方式を適用することは難しい
。(Problem to be Solved by the Invention) Among the conventional methods described above, the method using a hash table can divide a searched character string into appropriate words, and search for some of those words. It is valid in some cases. However, if a plurality of arbitrary character strings are given as the search string, in the worst case, the hash function must be applied to all substrings of the search string. In other words, the method using a hash table is inappropriate when searching for a plurality of arbitrary character strings. Furthermore, it is difficult to apply this method when the search string contains wild characters.

また、有限オートマンを用いる方式では、検索文字列が
多い場合や長い場合には有限オートマンが大きくなって
検索時のバックトラックが多くなるため処理効率が低下
するという欠点がある。Furthermore, the method using a finite automaton has the disadvantage that when there are many or long search strings, the finite automaton becomes large and backtracking during the search increases, resulting in a decrease in processing efficiency.

（課題を解決するための手段）本発明による文字列検索方式は、ワイルド・キャラクタ
を含む複数の検索文字列を文字コード順にソートして検
索文字列テーブルに格納するソート工程と、前記被検索
文字列の任意の位置から始まる部分文字列と前記検索文
字列テーブルに格納されている文字列とを前記ワイルド
・キャラクタを用い前記検索文字列テーブルの照合範囲
を次々狭めていくことによりバックトラックをすること
なく照合する照合工程と、前記照合工程により照合に成
功した文字列またはその文字列の被検索文字列中のイン
デックスを出力する出力工程とを含む。(Means for Solving the Problem) The character string search method according to the present invention includes a sorting step of sorting a plurality of search strings including wild characters in the order of character codes and storing them in a search string table; backtracking between a substring starting from an arbitrary position in the column and the string stored in the search string table by successively narrowing the matching range of the search string table using the wild character; and an output step of outputting the character string successfully matched in the matching step or the index of the character string in the searched character string.

（実施例）以下、本発明について図面を参照して説明する。(Example) Hereinafter, the present invention will be explained with reference to the drawings.

第１図は本発明の構成ならびに処理手順を示すフローチ
ャートである。１１はワイルド・キャラクタを含む複数
個の文字列を保持する検索文字列テーブル１１に保持されたワイルド・キャラ
クタを含む複数個の文字列は、ソート工程１２により文
字コード順にソートされる。一方、被検索文字列１３は
、照合工程１４において、検索文字列テーブル１１の内
容との照合が行われる。FIG. 1 is a flowchart showing the configuration and processing procedure of the present invention. Reference numeral 11 holds a plurality of character strings containing wild characters.The plurality of character strings containing wild characters held in the search string table 11 are sorted in the order of character codes in a sorting step 12. On the other hand, the searched character string 13 is compared with the contents of the searched string table 11 in a matching step 14 .

照合工程１４で照合に成功した文字列または、その文字
列の被検索文字列中のインデックスは、工程１５を通し
て出力データ１６として出力される。The character string successfully matched in the matching process 14 or the index of that character string in the searched character string is outputted as output data 16 through the process 15.

即ち、上述の如く本実施例の処理はソート工程１２、照
合工程１４．出力工程１５の順で行われる。ソート工程
１２における検索文字列のソート処理は、クイックソー
ト等の周知のアルゴリズムを用いて行うことができる。That is, as described above, the processing of this embodiment includes the sorting step 12, the collation step 14. This is performed in the order of output step 15. Sorting of search strings in the sorting step 12 can be performed using a well-known algorithm such as quick sort.

出力データ１６は、コンピュータのメモリ、磁気ディス
ク等の２次記憶、プリンタあるいはＣＲＴデイスプレィ
等の適当なメディア上に出力することができる。The output data 16 can be output to a suitable medium such as a computer memory, secondary storage such as a magnetic disk, a printer, or a CRT display.

次に本実施例における照合工程１４について照合工程の
処理手続きを詳細に示したフローチャートである。第２
図を参照して詳細に説明する。Next, it is a flowchart showing in detail the processing procedure of the verification step 14 in this embodiment. Second
This will be explained in detail with reference to the drawings.

まず、ステップ２１において被検索文字列の先頭に被検
索文字列ポインタを位置づけ、ステップ２２において、
被検索文字列の走査を終了したかどうかをチエツクし、
終了していない場合には、パターンマツチ処理を行う（
ステップ２３）、ステップ２４においては、マツチした
文字列があったかどうかの判断を行い、あった場合には
、ステップ２５においてその文字列またはその文字列の
被検索文字列中のインデックスを出力バッファに書き込
む、その後、ステップ２６において、被検索文字列ポイ
ンタをカウントアツプする。この際にマツチした文字列
あった場合には、その文字列の長さだけ、マツチしなか
った場合には１つだけカウントアツプし、ステップ２２
の処理に戻る。First, in step 21, the search string pointer is positioned at the beginning of the search string, and in step 22,
Checks whether scanning of the searched string is finished,
If not completed, perform pattern matching processing (
In step 23), in step 24, it is determined whether there is a matched character string, and if there is, in step 25, that character string or the index of that character string in the searched character string is written to the output buffer. , Then, in step 26, the searched character string pointer is counted up. If there is a matched character string at this time, the length of that character string is counted up, and if there is no match, only one is counted up, and step 22
Return to processing.

第３図にはパターンマツチ処理２３を行う際のデータ構
成図が示されている。図において、３１は被検索文字列
の一部、３２は検索文字列テーブルの一部であり、テー
ブルの各行に個々の検索文字列がソートされて格納きれ
ている。ワイルド・キャラクタで始まる検索文字列はテ
ーブル上後ろの方にソートされて格納されている。ここ
で、“？″は英字に対するワイルド・キャラクタ、“＆
”は数字に対するワイルド・キャラクタを示すものとす
る。FIG. 3 shows a data configuration diagram when performing the pattern matching process 23. In the figure, 31 is a part of the character string to be searched, and 32 is a part of the search string table, in which individual search strings are sorted and stored in each row of the table. Search strings starting with wild characters are stored sorted towards the back of the table. Here, “?” is a wild character for alphabetic characters, “&
” indicates a wild character for numbers.

以下、パターンマツチ処理のアルゴリズムを説明する。The algorithm for pattern matching processing will be explained below.

まず、被検索文字列ポインタが文字列３１の“Ａ”に位
置づけられているとする。？ｌ初に照合ポインタを１に
設定し、検索文字列テーブル中の“Ａ”で始まる文字列
３３、ならびにワイルド・キャラクタ“？”で始まる文
字列３４のテーブル内の位置を照合範囲テーブルに登録
し、照合範囲の初期値とする。First, assume that the searched character string pointer is positioned at "A" of character string 31. ? lFirst, set the collation pointer to 1, and register the positions in the table of the character string 33 starting with "A" in the search string table and the character string 34 starting with the wild character "?" in the collation range table. , is the initial value of the matching range.

照合範囲テーブルから見ると、検索文字列テーブルは図
中３５に示すような形で見える。すなわち、元の検索文
字列テーブル３２の中のＡ”で始まる文字列３３ならび
にワイルド・キャラクタ”　７　”で始まる文字列３４
を圧縮したように見えることになる。なお、前記ソート
工程において各先頭文字毎のテーブル内の範囲を別途イ
ンデックス・テーブルに格納しておけば照合範囲の初期
値設定を行うときに検索文字列テーブルをサーチする必
要はない。“Ａ　”または゛？パで始まる検索文字列が
ない場合はマツチする文字列なしとなる９照合範囲が設
定できた場合には、照合ポインタを１つすずめ、被検索
文字列の次の文字との照合を行う、このときは照合範囲
テーブルに登録されている検索文字列について照合を行
うのであるが、そこで検索文字列が終了するもの（たと
えば３５の１行目の検索文字列）があった場合には、マ
ツチした文字列があったものとしておく。照合に成功し
た検索文字列については、その検索文字列テーブル中の
インデックスを照合範囲テーブルに圧縮する。その結果
、図中３５の範囲３６と範囲３７か圧縮され、照合範囲
テーブルから見ると検索文字列テーブルは図中３８に示
すような形で見えることになる。When viewed from the matching range table, the search string table appears in the form shown at 35 in the figure. That is, the character string 33 starting with A" and the character string 34 starting with the wild character "7" in the original search string table 32.
It will look like it has been compressed. Note that if the range within the table for each first character is separately stored in an index table in the sorting step, there is no need to search the search string table when setting the initial value of the collation range. “A” or ゛? If there is no search string that starts with "pa", there will be no matching string.9If the matching range has been set, move the matching pointer one step and match the next character of the searched string. In this case, the search strings registered in the matching range table are compared, but if there is a search string that ends there (for example, the search string in the first line of 35), a match is made. Assume that there is a string. For search strings that are successfully matched, the indexes in the search string table are compressed into a matching range table. As a result, ranges 36 and 37 of 35 in the figure are compressed, and the search string table appears as shown in 38 in the figure when viewed from the matching range table.

さらに照合ポインタをすずめ同様の処理を照合範囲を次
々狭めながら繰り返し照合範囲がＯになった段１昔でそ
の繰り返しを終了する。Further, the verification pointer is moved and the same process is repeated while narrowing the verification range one after another, and the repetition ends when the verification range reaches O (stage 1).

（発明の効果）以上説明したように本発明は、複数の検索文字列をソー
トして保持する検索文字列テーブルを備え、被検索文字
列と検索文字列テーブルを照合づる際にテーブル照合範
囲を次々狭めていく照合処理を行っているので照合の際
のバックトラックの少ない文字列検索ができる効果があ
る。(Effects of the Invention) As explained above, the present invention includes a search string table that sorts and holds a plurality of search strings, and when matching a searched string with the search string table, the table matching range is set. Since the matching process is performed to narrow down the strings one after another, it is possible to perform a string search with less backtracking during matching.

[Brief explanation of the drawing]

第１図は本発明による文字列検索方式の一実施例の構成
処理手段を示すフローチャート、第２図は第１図におけ
る照合工程の処理手順を詳細に示したフローチャート、
第３図は第２図におけるパターンマツチ処理を行う際の
データ構造図である。１１・・・検索文字列テーブル、１２・・・ソート工程
、１３・・・被検索文字列、１４・・・照合工程、１５
・・・出力工程、１６・・・出力データ。FIG. 1 is a flowchart showing the configuration processing means of an embodiment of the character string search method according to the present invention, FIG. 2 is a flowchart showing in detail the processing procedure of the collation step in FIG. 1,
FIG. 3 is a data structure diagram when performing the pattern matching process in FIG. 2. 11... Search string table, 12... Sorting process, 13... Searched string, 14... Collation process, 15
...Output process, 16...Output data.

Claims

[Claims]

a sorting step of sorting a plurality of search strings including wild characters in the order of their character codes and storing them in a search string table; and a partial string starting from an arbitrary position of the searched string and storing it in the search string table. A matching process in which a character string is matched without backtracking by successively narrowing the matching range of the search string table using the wild character, and a character string successfully matched in the matching process. or an output step of outputting an index of the character string in a searched character string.