JP2535629B2

JP2535629B2 - Input string normalization method of search system

Info

Publication number: JP2535629B2
Application number: JP1290714A
Authority: JP
Inventors: 誠二中野
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1989-11-08
Filing date: 1989-11-08
Publication date: 1996-09-18
Anticipated expiration: 2011-09-18
Also published as: JPH03150668A

Description

【発明の詳細な説明】［概要］入力文字列をキーワードとしてデータベース等の記録
ファイルを検索する検索システムの入力文字列正規化方
式に関し、正しい綴りでない略称等を用いた入力文字列による検
索が簡単にできることを目的とし、入力文字列から切り出した単語に対応する正式単語を
単語辞書から検索し、正式単語の組合せにより１又は複
数の正規化文字列を作成してデータ検索させるように構
成する。DETAILED DESCRIPTION OF THE INVENTION [Overview] Regarding an input character string normalization method of a search system for searching a record file such as a database using an input character string as a keyword, it is easy to search by an input character string using an abbreviation that is not spelled correctly. For the purpose of being able to do so, the formal word corresponding to the word cut out from the input character string is searched from the word dictionary, and one or a plurality of normalized character strings are created by the combination of the formal words, and the data is searched.

［産業上の利用分野］本発明は、入力文字列をキーワードとしてデータベー
ス等の記録ファイルを検索する検索システムの入力文字
列正規化方式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an input character string normalization method of a search system for searching a record file such as a database using an input character string as a keyword.

テレックス電文等による相手先会社名等を検索キーワ
ードとして口座番号等の必要なデータをデータベース等
の記録ファイルから取り出す検索システムにあっては、
検索キーワードとして使用される会社名等の入力文字列
が正しく綴られていなければならない。In a search system that retrieves necessary data such as an account number from a record file such as a database by using the partner company name etc. as a search keyword by telex telegram etc.,
The input character string such as the company name used as a search keyword must be spelled correctly.

しかし、検索キーワードとして使用する入力文字列に
は往々にして略称が使用され、正式名称に加えて使用が
予測される略称をキーワードとして登録しておくことが
考えられるが、キーワードを増やすことなく略称であっ
ても正式名称と同様にデータ検索が簡単にできるシステ
ムが望まれる。However, abbreviated names are often used in input character strings used as search keywords, and it is possible to register abbreviated names that are expected to be used in addition to official names as keywords. Even so, a system that allows easy data retrieval similar to the official name is desired.

［従来の技術］従来、銀行取引等に使用する海外からのテレックス電
文等を自動的に解析して処理するシステムが考えられて
いる。[Prior Art] Conventionally, a system for automatically analyzing and processing telex telegrams and the like from overseas used for bank transactions and the like has been considered.

このような電文自動解析システムにあっては、電文中
に綴られた相手先会社名等を検索キーワードとし、口座
番号等の情報を記録したデータベースを検索し、必要な
相手先データを取り出すようにしている。In such a telegram automatic analysis system, the destination company name spelled in the telegram is used as a search keyword, and a database that records information such as the account number is searched to retrieve the necessary destination data. ing.

［発明が解決しようとする課題］しかしながら、会社名等の入力文字列を検索キーワー
ドとして使用する検索システムにあっては、相手先会社
名が正式名称の通り入力されてくることはまれであり、
電文受け取り側の担当者が識別できれば十分であるた
め、様々に省略して送られてくる。例えば英単語を省略
したもの、会社名の先頭語の固有名詞部分のみを入力し
てくるもの、会社名を構成する単語の先頭文字を拾い出
して並べたもの等、様々である。[Problems to be Solved by the Invention] However, in a search system that uses an input character string such as a company name as a search keyword, it is rare that the partner company name is input as the official name.
Since it is sufficient if the person in charge of receiving the message can be identified, it is sent in various forms. For example, there are various ones such as those in which English words are omitted, those in which only the proper noun portion of the first word of the company name is input, and those in which the first characters of the words constituting the company name are picked up and arranged.

更に、日本語を英字表記する場合には、綴り方は１つ
のみではなく、例えば「東京」は「TOKYO」又は「TOKI
O」のどちらにも使用される。Furthermore, when writing Japanese in English, there is not only one spelling method, for example, "Tokyo" means "TOKYO" or "TOKI".
Used for both "O".

このような入力文字列の略称や異なる綴りに対して
は、様々な形の略称文字列を考えて登録する必要があ
る。For such abbreviations of input character strings and different spellings, it is necessary to register abbreviated character strings of various shapes.

しかし、相手先会社名の多様な入力に対応した名前を
用意しておくことには限界があり、想定可能な綴りを全
て登録するには膨大な人的労力を要し、検索辞書が巨大
となってプログラムの実行領域を圧迫し、更に検索効率
も低下する問題があった。However, there is a limit to preparing names that correspond to various inputs of the partner company name, enormous human effort is required to register all possible spellings, and the search dictionary is huge. As a result, there is a problem that the execution area of the program is squeezed and the search efficiency is reduced.

本発明は、このような従来の問題点に鑑みてなされた
もので、正しい綴りでない略称等を用いた入力文字列に
よる検索が簡単にできる検索システムの入力文字列正規
化方式を提供することを目的とする。The present invention has been made in view of such conventional problems, and provides an input character string normalization method of a search system capable of easily performing a search by an input character string using an abbreviation that is not correctly spelled. To aim.

［課題を解決するための手段］第１図は本発明の原理説明図である。[Means for Solving the Problems] FIG. 1 is a diagram illustrating the principle of the present invention.

まず本発明は、処理データ格納手段10からの入力文字
列を検索キーワードとして検索処理手段12により記録フ
ァイル14を検索して対応するデータを出力する検索シス
テムを対象とする。First, the present invention is directed to a search system that searches the recording file 14 by the search processing means 12 using the input character string from the processed data storage means 10 as a search keyword and outputs the corresponding data.

このような検索システムにつき本発明にあっては、入
力文字列を構成する略称等に対応する正しい綴りの正式
単語を格納した単語辞書16と、処理データ格納手段（1
0）から入力文字列を単語単位に切り出す単語切り出し
手段と、その切り出した入力文字列を単語辞書（16）と
対応させたとき１又は複数の正式単語が得られたときは
得られた正式単語を全て入力文字列に対応する格納位置
に格納し正式単語が得られなかったときは入力文字列を
その入力文字列に対応する格納位置に格納する格納処理
と，格納処理を切り出す単語がなくなるまで繰り返す循
環処理と，切り出す単語がなくなった時点で格納内容の
組み合わせにより作成し得る１又は複数の正規化文字列
を作成して検索手段（12）に出力する正規化文字列作成
処理とからなる正規化手段（18）とで構成する。According to the present invention with respect to such a search system, a word dictionary 16 in which correct spelling formal words corresponding to abbreviations or the like constituting an input character string are stored, and processing data storage means (1
A word cutout unit that cuts out the input character string from 0) into word units, and a formal word obtained when one or more formal words are obtained when the cut out input character string is associated with the word dictionary (16) Is stored in the storage position corresponding to the input character string, and when the official word is not obtained, the storage process of storing the input character string in the storage position corresponding to the input character string and until there are no words to cut out the storage process A normalization consisting of repeated cyclic processing and normalization character string creation processing which creates one or more normalization character strings that can be created by combining stored contents when there are no more words to cut out and outputs to the search means (12). It is composed of a conversion means (18).

［作用］このような構成を備えた本発明による検索システムの
入力文字列正規化方式によれば、略称等の正しい綴りの
単語でない入力文字列であっても、検索前処理として単
語辞書から略称に対応する正式単語を検索し、正式単語
の組合せによる正規化文字列が作り出され、この正規化
文字列をキーワードして検索処理が行われるため、単語
単位で略称や綴りの変化を考えておけば正しい綴りの正
式名称を含む正規化文字列を生成し、データ検索を有効
に行うことができる。[Operation] According to the input character string normalization method of the search system according to the present invention having such a configuration, even an input character string that is not a correctly spelled word such as an abbreviation is abbreviated from the word dictionary as pre-search processing. Search for a formal word corresponding to, a normalized character string is created by a combination of formal words, and the search process is performed using this normalized character string as a keyword, so consider the change in abbreviation or spelling for each word. For example, it is possible to generate a normalized character string containing the correct spelling official name and to effectively perform data search.

［実施例］第２図は本発明の一実施例を示した実施例構成図であ
る。[Embodiment] FIG. 2 is a configuration diagram of an embodiment showing one embodiment of the present invention.

第２図において、10は検索対象データファイルであ
り、テレックス、電文等のデータがオンライン処理また
はバッジ処理により格納されている。20はホスト計算機
であり、本発明による文字列正規化処理部18の機能と検
索処理部12の機能を有する。ホスト計算機20の文字列正
規化処理部18に対しては単語辞書ファイル16が設けら
れ、また検索処理部12に対しては検索データベース14が
設けられる。更にホスト計算機20の検索処理部12による
検索結果はCRT、プリンタ等の出力装置22に出力され
る。In FIG. 2, reference numeral 10 is a data file to be searched, in which data such as telex and telegram are stored by online processing or badge processing. A host computer 20 has the function of the character string normalization processing unit 18 and the function of the search processing unit 12 according to the present invention. A word dictionary file 16 is provided for the character string normalization processing unit 18 of the host computer 20, and a search database 14 is provided for the search processing unit 12. Further, the search result by the search processing unit 12 of the host computer 20 is output to the output device 22 such as a CRT or a printer.

ホスト計算機20は検索対象データファイル10から処理
対象となるテレックス、電文を取り出し、テレックス、
電文に含まれる相手先会社名を表わす入力文字列を文字
列正規化処理部18に与え、入力文字列に使用されている
略称を単語辞書ファイル16の参照により正しい綴りの単
語に変換した正規化文字列を生成する。The host computer 20 retrieves the telex and telegram to be processed from the search target data file 10,
An input character string representing the partner company name included in the message is given to the character string normalization processing unit 18, and the abbreviation used in the input character string is converted to a correctly spelled word by referring to the word dictionary file 16 for normalization. Generate a string.

文字列正規化処理部18での正規化処理に使用される単
語辞書ファイル16には、入力文字列を構成する略称等に
対応する正しい綴りの正式単語が格納されている。即
ち、テレックス、電文の相手先会社名となる企業名は業
種、扱う商品、地名情報等を表わす語と、固有名詞から
構成されていると考えられる。そこで単語辞書ファイル
16には企業名を構成する固有名詞以外の語を、その語の
省略形と併せて登録している。例えば正規の綴り「BAN
K」に対しては略称として「BK」「BNK」「GINKO」等が
使用されることから、各名称単語につき正しい綴りの正
式単語が検索できるように登録を行なっている。The word dictionary file 16 used for the normalization processing in the character string normalization processing unit 18 stores the correct spelling formal words corresponding to the abbreviations and the like that form the input character string. That is, it is conceivable that the company name, which is the destination company name of the telex and the electronic message, is composed of a business type, a product to be handled, place name information and the like, and a proper noun. So word dictionary file
The words other than proper nouns that compose the company name are registered in 16 along with their abbreviations. For example, the regular spelling "BAN
Since “BK”, “BNK”, “GINKO”, etc. are used as abbreviations for “K”, registration is performed so that the correct spelling of the official word can be searched for each name word.

文字列正規化処理部18による正規化処理の概要は次の
通りである。The outline of the normalization processing by the character string normalization processing unit 18 is as follows.

まず検索対象データファイル10から得られた入力文字
列を単語単位に区切る単語切り出しを行なう。次に各切
り出し単語等に単語辞書ファイル16を検索し、対応する
１または複数の正式単語を検索する。そして最終的に単
語辞書ファイル16から得られた正式単語の組合せにより
１または複数の正規化文字列を作成して検索処理部12に
引き渡し、検索処理部12において正規化文字列をキーワ
ードとした検索データベース14の検索処理を行なわせ
る。即ち、本発明の文字列正規化処理は検索処理部12で
キーワードとして使用される相手先会社名等の入力文字
列の前処理として行なわれることになる。First, the input character string obtained from the search target data file 10 is word-divided into word units. Next, the word dictionary file 16 is searched for each cut-out word or the like, and one or more corresponding official words are searched. Finally, one or more normalized character strings are created from the combination of formal words obtained from the word dictionary file 16 and passed to the search processing unit 12, and the search processing unit 12 searches using the normalized character strings as keywords. The database 14 is searched. That is, the character string normalization process of the present invention is performed as a pre-process of the input character string such as the partner company name used as a keyword in the search processing unit 12.

次に第3A,3B図を参照して第２図の文字列正規化処理
部18の処理動作を説明する。Next, the processing operation of the character string normalization processing unit 18 of FIG. 2 will be described with reference to FIGS. 3A and 3B.

第3A図において、まずステップS1（以下、ステップは
省略する）において、検索対象データファイル10から得
られた入力文字列をデリミタで分割して入力語群にセッ
トする。例えば第４図に示すように入力文字列が「NIPPON TEL ＋ TEL」であったとすると、各語のスペースとしてのデミリタで
入力文字列を４つに分割し、各語を入力語群に格納す
る。In FIG. 3A, first, in step S1 (hereinafter, step is omitted), the input character string obtained from the search target data file 10 is divided by the delimiter and set in the input word group. For example, if the input character string is "NIPPON TEL + TEL" as shown in Fig. 4, the input character string is divided into four parts by the delimiter as the space of each word, and each word is stored in the input word group. .

次にS2に進み、入力語群数Ｎに入力語群の数をセット
する。第４図の場合は入力語群数Ｎ＝４となる。次にS3
で入力語群数Ｎ＝０、即ち正規化処理が終了したか否か
判定し、最初、入力語群数Ｎ＝４であることからS4に進
む。Next, in S2, the number of input word groups is set to the number N of input word groups. In the case of FIG. 4, the number of input word groups N = 4. Then S3
Then, it is determined whether or not the number of input word groups N = 0, that is, whether or not the normalization processing is completed. First, since the number of input word groups N = 4, the process proceeds to S4.

S4にあっては、入力語群の入力語群数Ｎ＝４の位置の
語、例えば第４図の右から４番目の語「NIPPON」を検索
語にセットし、次にS5で検索語「NIPPON」をキーワード
として単語辞書ファイル16を検索する。検索語「NIPPO
N」は固有名詞であることから、この実施例にあっては
単語辞書ファイル16には登録されておらず、このためS6
からS7に進み、検索語「NIPPON」に対応したデータをそ
のまま取得し、次のS8で第５図に示すような単語格納領
域の入力語群インデックス１の位置にS7で取得したデー
タ「NIPPON」を格納する。In S4, the word at the position where the number of input word groups N = 4 in the input word group, for example, the fourth word “NIPPON” from the right in FIG. 4 is set as the search word, and then in S5, the search word “ The word dictionary file 16 is searched using “NIPPON” as a keyword. Search term "NIPPO
Since "N" is a proper noun, it is not registered in the word dictionary file 16 in this embodiment.
From S7, the data corresponding to the search word "NIPPON" is acquired as it is, and in the next S8, the data "NIPPON" acquired in S7 at the position of the input word group index 1 in the word storage area as shown in FIG. To store.

続いて、S9で入力語群数ＮをＮ＝４−１＝３としてS3
を経由して再びS4に戻り、検索語に入力語群の入力語群
数Ｎ＝３となる位置の語、即ち第４図の右から３番目の
語「TEL」をセットし、次のS5で検索語「TEL」をキーワ
ードとして単語辞書ファイル16を検索する。この単語辞
書ファイル16の検索により検索語「TEL」については、
正式単語「TELEPHONE」と「TELEGRAM」の２つが得られ
る。S6にあっては、単語辞書ファイルに検索語と一致す
る語が存在することからS10に進み、第５図に示す単語
格納領域の入力語群インデックス２の位置にS5で検索さ
れた２つの検索語を図示のように格納する。Then, in S9, the number N of input word groups is set to N = 4-1 = 3, and S3 is set.
Returning to S4 again via, the word at the position where the number of input word groups N = 3 in the input word group, that is, the third word “TEL” from the right in FIG. Search the word dictionary file 16 with the search word "TEL" as a keyword. By searching this word dictionary file 16, for the search word "TEL",
You can get two official words "TELEPHONE" and "TELEGRAM". In S6, since there is a word that matches the search word in the word dictionary file, the process proceeds to S10, and the two searches searched in S5 at the position of the input word group index 2 in the word storage area shown in FIG. The word is stored as shown.

以下、入力語群数Ｎ＝２については、第４図の右から
２番目の「PLUS」を検索語とした単語辞書ファイル16の
検索で「AND」と「PLUS」の２つが得られ、第５図の入
力語群数インデックス３の位置に図示のように格納さ
れ、更に入力語群数Ｎ＝１となる最後の処理にあって
も、１回目の処理と同様、検索語「TEL」について２つ
の語が検索され、第５図の入力語群数インデックス４の
位置に示すように検索語が格納される。For the number of input word groups N = 2, two words "AND" and "PLUS" are obtained by searching the word dictionary file 16 with the second word "PLUS" from the right in FIG. 4 as the search word. Even in the last process in which the number of input word groups is stored as shown in FIG. 5 as shown in FIG. Two words are searched, and the search word is stored as shown at the position of the input word group number index 4 in FIG.

このような入力文字列のデミリタで区切られたすべて
について単語辞書ファイル16の検索処理が終了すると、
S3で入力語群数Ｎ＝０が判別され、第3B図のS11に進
む。When the search process of the word dictionary file 16 is completed for all delimiters of the input string,
The number of input word groups N = 0 is determined in S3, and the process proceeds to S11 in FIG. 3B.

S11にあっては、入力語数インデックスに入力語群の
数Ｎ＝４をセットし、次にS12に進み、第５図に示した
単語格納領域のインデックス４の位置から順番に１語ず
つ取り出し、S13でインデックスを１つ減らし、S14でイ
ンデックスが０、即ち４つの語が取り出されたか否か判
定し、４つの語の取出しが済んでいなければ再びS12に
戻って、次のインデックス３の語を取り出し、以下、イ
ンデックス＝０となるまで順番に単語取出しを繰り返
す。４つの語の取出しが終了するとインデックス＝０と
なることからS15に進み、第５図の単語格納領域から取
り出された４つの語の組合せで成る正規化文字列を文字
列格納領域に格納する。続いてS16で文字列格納領域に
全パターンが格納済みか否かチェックし、格納済みでな
ければ再びS11に戻ってインデックスに再度入力語群の
数Ｎ＝４をセットし、次のパターンの取出しを行なう。In S11, the number of input word groups N = 4 is set in the input word number index, then the process proceeds to S12, and one word is sequentially taken out from the position of index 4 in the word storage area shown in FIG. The index is decremented by 1 in S13, the index is 0 in S14, that is, it is determined whether four words have been extracted. If four words have not been extracted yet, the process returns to S12 and the next index 3 word is extracted. Is taken out, and thereafter, the word taking out is repeated in order until the index = 0. When the extraction of the four words is completed, the index becomes 0, so the process proceeds to S15, and the normalized character string composed of the combination of the four words extracted from the word storage area of FIG. 5 is stored in the character string storage area. Subsequently, in S16, it is checked whether or not all the patterns have been stored in the character string storage area. If not stored, the process returns to S11 and the number of input word groups N = 4 is set again in the index to extract the next pattern. Do.

S16で文字列格納領域に全パターンが格納されたこと
が判定されるとS17に進み、文字列格納領域の格納デー
タを検索処理部12に引き渡し、検索処理部12は文字列正
規化処理部18から引き渡された文字列格納領域の正規化
文字列をキーワードとして検索データベース14の検索処
理を実行するようになる。When it is determined in S16 that all the patterns are stored in the character string storage area, the process proceeds to S17, the stored data in the character string storage area is delivered to the search processing unit 12, and the search processing unit 12 causes the character string normalization processing unit 18 to operate. The search process of the search database 14 is executed using the normalized character string in the character string storage area delivered from the as a keyword.

第６図は第3B図におけるS11〜S17の処理で第５図に示
した単語格納領域の格納データから作り出された正規化
文字列の格納状態を示す。FIG. 6 shows the storage state of the normalized character string created from the storage data of the word storage area shown in FIG. 5 in the processing of S11 to S17 in FIG. 3B.

即ち、第５図に示すように、第４図の入力文字列につ
いては辞書ファイルの検索により第５図に示すインデッ
クス１〜４で示す正しい綴りの単語が検索されているこ
とから、これらの組合せにより第６図の〜に示す８
つの正規化文字列のパターンが生成される。That is, as shown in FIG. 5, for the input character strings shown in FIG. 4, since the dictionary files are searched for the correctly spelled words indicated by indexes 1 to 4 shown in FIG. According to 8 in Fig. 6
A pattern of two normalized strings is generated.

この第６図に示すように生成された８つのパターンは
検索処理部12において順次キーワードとして検索データ
ベース14の検索に使用され、正しい正規化文字列であれ
ば検索結果が得られることになる。第６図の場合、パタ
ーンが正式名称であることから、の正規化文字列を
キーワードとした検索処理で対応する検索結果を得るこ
とができる。The eight patterns generated as shown in FIG. 6 are sequentially used as keywords in the search processing unit 12 to search the search database 14, and if the character string is a correct normalized character string, the search result can be obtained. In the case of FIG. 6, since the pattern is a formal name, the corresponding search result can be obtained by the search process using the normalized character string of as a keyword.

［発明の効果］以上説明してきたように本発明によれば、記録ファイ
ルの検索キーワードとして正しい綴りの入力文字列のみ
を有効としていても、略称を使用した入力文字列につ
き、正規化処理により正しい綴りの文字列に変換するこ
とで略称を使用した入力文字列であっても対応する検索
結果を得ることができ、検索エラーを減少させて無駄な
オペレーションを低減し、検索性能を向上することがで
きる。[Effect of the Invention] As described above, according to the present invention, even if only an input character string with a correct spelling is effective as a search keyword for a recording file, an input character string using an abbreviation is correct by normalization processing. By converting to a spelled character string, it is possible to obtain the corresponding search result even with an input character string using an abbreviation, reduce search errors, reduce unnecessary operations, and improve search performance. it can.

【図面の簡単な説明】第１図は本発明の原理説明図；第２図は本発明の実施例構成図；第3A図及び第3B図は本発明の文字列正規化処理フロー
図；第４図は入力文字列説明図；第５図は入力語数インデックスと検索データ格納説明
図；第６図は文字列格納領域説明図である。図中、 10:処理データ格納手段（検索対象データファイル） 12:検索処理手段 14:データベース（検索データベース） 16:単語辞書（単語辞書ファイル） 18:文字列正規化手段 20:ホスト計算機 22:出力装置BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is an explanatory view of the principle of the present invention; FIG. 2 is a configuration diagram of an embodiment of the present invention; FIGS. 3A and 3B are character string normalization processing flow charts of the present invention; 4 is an explanatory diagram of an input character string; FIG. 5 is an explanatory diagram of an input word number index and search data storage; FIG. 6 is an explanatory diagram of a character string storage area. In the figure, 10: processed data storage means (search target data file) 12: search processing means 14: database (search database) 16: word dictionary (word dictionary file) 18: character string normalization means 20: host computer 22: output apparatus

Claims

(57) [Claims]

1. A search system for searching a record database (14) by a search processing means (12) using a character string from a processed data storage means (10) as a search keyword to output corresponding data A word dictionary (16) storing correct spelling formal words corresponding to abbreviations forming columns; word cutting processing for cutting an input character string into word units from the processing data storage means (10), and the cut input When one or more formal words are obtained when the character string is associated with the word dictionary (16), all the obtained formal words are stored in the storage positions corresponding to the input character string to obtain the formal word. If not, a storage process of storing the input character string in a storage position corresponding to the input character string, and a cyclic process of repeating the storage process until there are no more words to cut out,
A normalizing means comprising a normalizing character string creating process for creating one or a plurality of normalizing character strings that can be created by combining the stored contents when there are no more words to be cut out, and outputting the normalizing character string to the searching means (12). (18) and; are provided, and the input string normalization method of the search system is characterized.