JPH04225471A

JPH04225471A - Keyword retrieving method

Info

Publication number: JPH04225471A
Application number: JP2407098A
Authority: JP
Inventors: Hidetoshi Ito; 伊東　英俊
Original assignee: Fuji Electric Co Ltd; Fuji Facom Corp
Current assignee: Fuji Electric Co Ltd; Fuji Facom Corp
Priority date: 1990-12-27
Filing date: 1990-12-27
Publication date: 1992-08-14

Abstract

PURPOSE:To improve the retrieval efficiency by relieving rationally misreading without eliminating it simply, although the retrieval efficiency drops by that portion, when there is misreading in a part of a keyword constitution character. CONSTITUTION:In each character of the same rank as a read character-string and a registered keyword, that which coincides above a first prescribed number corresponding to the number of registered keyword characters is selected, and with regard to each selected read character-string thereof, a character set of every rank is derived. In its character set of every rank, when a character which coincides with a character of the corresponding rank of the registered keyword exists above a second prescribed number corresponding to the number of characters belonging to each character set, all of the read character-strings selected previously are regarded as keywords to be retrieved.

Description

[Detailed description of the invention]

【０００１】0001

【産業上の利用分野】この発明は、登録キーワードと同
じ文字列をキーワードとして文書中から検索する方法で
あって、とくにキーワード構成文字の一部に誤読があっ
たとしても、単純に排除しないで合理的に救済すること
により、検索効率の向上が図れるキーワード検索方法に
関する。[Industrial Application Field] This invention is a method of searching a document using the same character string as a registered keyword as a keyword, and in particular, even if some of the keyword constituent characters are misread, the method does not simply eliminate them. This invention relates to a keyword search method that can improve search efficiency by providing reasonable relief.

【０００２】0002

【従来の技術】一般に、文書の内容を迅速，的確に把握
するには、キーワードを活用するのが有効である。たと
えば、地球環境保護の問題に関する文書では、たとえば
「放射能」や「オゾン層」，「地球汚染」などのキーワ
ードが用いられる。さて、文書のデータベース化をおこ
なうとき、文書の文字つまり原稿文字を順に文字読取装
置によって、標準文字に対する類似度のもっとも高い文
字を読取文字として選出し、文字コードで表されるテキ
ストを作成する。このテキストに対して、登録されたキ
ーワードと同じ単一文字または文字列をキーワードとし
て検索する。2. Description of the Related Art Generally, it is effective to use keywords to quickly and accurately understand the contents of a document. For example, documents related to global environmental protection issues use keywords such as "radiation,""ozonelayer," and "global pollution." Now, when converting a document into a database, the characters of the document, that is, the original characters, are sequentially read by a character reading device, and the character with the highest degree of similarity to the standard character is selected as the reading character, and a text expressed by a character code is created. This text is searched for the same single character or character string as the registered keyword as a keyword.

【０００３】0003

【発明が解決しようとする課題】従来の方法では、文字
読取装置によって読み取られた結果に誤り、つまり誤読
が１字でもあると、キーワードが存在するにもかかわら
ず、検索対象から除外される。すなわち、文書中のキー
ワード総数に対する検索キーワード数の比率を検索効率
と定義したとき、検索効率は著しく低下する。In the conventional method, if there is an error in the result read by a character reading device, that is, even one character is misread, the keyword is excluded from the search target even though the keyword exists. That is, when search efficiency is defined as the ratio of the number of search keywords to the total number of keywords in a document, the search efficiency decreases significantly.

【０００４】この発明の課題は、従来の技術がもつ以上
の問題点を解消し、キーワード構成文字の一部に誤読が
あったとしても、単純に排除しないで合理的に救済する
ことにより、検索効率の向上が図れるキーワード検索方
法を提供することにある。[0004] An object of the present invention is to solve the above-mentioned problems of the conventional technology, and even if there is a misreading of some of the characters constituting a keyword, it can be remedied rationally without simply eliminating it. The object of the present invention is to provide a keyword search method that can improve efficiency.

【０００５】[0005]

【課題を解決するための手段】この課題を解決するため
に、請求項１に係るキーワード検索方法は、登録キーワ
ードと同じ文字列をキーワードとして文書中から検索す
る方法において、この文書の各文字を文字読取装置によ
って読み取り、前記登録キーワードと同一文字数で、か
つ各同一順位の文字同士が第１の所定数以上一致し、前
記登録キーワードの各文字全てとは一致しない読取文字
列を選出し；この選出された各読取文字列について各同
一順位ごとの文字の集合を求め；この各順位ごとの文字
集合のうちに前記登録キーワードの対応する順位の文字
と一致する文字の個数が第２の所定数以上あるとき、前
記各読取文字列の全てを検索すべきキーワードとみなす
。[Means for Solving the Problem] In order to solve this problem, the keyword search method according to claim 1 is a method of searching a document using the same character string as a registered keyword as a keyword. read by a character reading device, select a read character string that has the same number of characters as the registered keyword, has a first predetermined number or more of characters in the same rank, and does not match all of the characters of the registered keyword; For each of the selected reading character strings, a set of characters for each same rank is determined; out of the set of characters for each rank, the number of characters that match the characters of the corresponding rank of the registered keyword is a second predetermined number. If there are any of the above, all of the read character strings are considered to be keywords to be searched.

【０００６】請求項２に係るキーワード検索方法は、請
求項１に記載の方法において、第１所定数は、登録キー
ワードの文字数に応じて定められる。[0006] In the keyword search method according to claim 2, in the method according to claim 1, the first predetermined number is determined according to the number of characters of the registered keyword.

【０００７】請求項３に係るキーワード検索方法は、請
求項１または２に記載の方法において、第２所定数は、
各順位ごとの文字集合に属する共通な文字数に応じて定
められる。The keyword search method according to claim 3 is the method according to claim 1 or 2, in which the second predetermined number is:
It is determined according to the number of common characters belonging to the character set for each rank.

【０００８】[0008]

【作用】請求項１に係るキーワード検索方法では、■文
書の各文字を文字読取装置によって読み取り、登録キー
ワードと同一文字数で、かつ各同一順位の文字同士が第
１の所定数以上一致し、登録キーワードの各文字全てと
は一致しない読取文字列を選出する、つまり登録キーワ
ードと部分的に一致し、従来は除外されるべき読取文字
列について救済可能な候補として１次選考する、■この
選出された各読取文字列について各同一順位ごとの文字
の集合を求め、この各順位ごとの文字集合のうち登録キ
ーワードの対応する順位の文字と一致する文字の個数が
第２の所定数以上あるとき、２次選考として各読取文字
列の全てを検索すべきキーワードとみなす。なお、１次
選考における第１所定数は、請求項２のように登録キー
ワードの文字数に応じて、また２次選考における第２所
定数は、請求項３のように各順位ごとの文字集合に属す
る共通な文字数に応じてそれぞれ定められる。[Operation] In the keyword search method according to claim 1, (1) Each character of the document is read by a character reading device, and if the characters having the same number of characters as the registered keyword and having the same rank match each other at least a first predetermined number, the characters are registered. Select read character strings that do not match all characters of the keyword, that is, perform a first selection as salvageable candidates for read character strings that partially match the registered keyword and would normally be excluded. Find a set of characters for each same rank for each read character string, and when the number of characters that match the characters of the corresponding rank of the registered keyword among the character sets for each rank is equal to or greater than a second predetermined number; As a secondary selection, all of the read character strings are considered as keywords to be searched. The first predetermined number in the first selection is determined according to the number of characters of the registered keyword as in claim 2, and the second predetermined number in the second selection is determined according to the character set for each rank as in claim 3. Each is determined according to the number of common characters it belongs to.

【０００９】[0009]

【実施例】本発明に係るキーワード検索方法を適用した
検索装置について、以下に図を参照しながら説明する。図３は検索装置に係る登録キーワードと１次選考読取文
字列の例示図である。図３において、登録キーワードＫ
は、４文字から構成される「富士電機」である。１次選
考の結果、５個の読取文字列Ｗ１〜Ｗ５が選出されたと
する。すなわち、富土謳機，富土電揆，笛士雷機，宮士
壇機，宙土電機　　である。各読取文字列とも登録キー
ワードと、下線を付けた２個の同一文字をもっている。ここで、発明における第１所定数は２とする。DESCRIPTION OF THE PREFERRED EMBODIMENTS A search device to which the keyword search method according to the present invention is applied will be described below with reference to the drawings. FIG. 3 is an illustrative diagram of registered keywords and first selection reading character strings related to the search device. In Figure 3, registered keyword K
is "Fuji Electric", which is composed of four characters. Assume that five read character strings W1 to W5 are selected as a result of the first selection. Namely, they are Fudojoki, Fudodenki, Fueshi Raiki, Gyushidanki, and Soradodenki. Each read character string has a registered keyword and two identical underlined characters. Here, the first predetermined number in the invention is two.

【００１０】図４は読取文字列の２次選考に係る選考過
程図である。図４において、第１列に文字順位、第２列
に登録キーワード、第３列に１次選考読取文字列の各順
位文字の集合、第４列に登録キーワードの各構成文字と
各順位文字の集合との一致文字数、がそれぞれ示される
。たとえば、文字順位１に相当する登録キーワードの構
成文字は「富」、これに対して１次選考された５個の読
取文字列Ｗ１〜Ｗ５で文字順位１に相当する文字の集合
は｛富，富，笛，宮，宙｝である。つまり、一致文字数
は２である。ここで、発明における第２所定数は２とす
る。同様に、各文字順位２，３，４について文字集合を
求め、各一致文字数２，２，４を得る。したがって、１
次選考された５個の読取文字列Ｗ１〜Ｗ５は、２次選考
にも合格してキーワードであると判定される。なお、第
１，第２の各所定数は、基本的には経験的に定められ、
一般的には、第１所定数は登録キーワードの構成文字数
が多くなるほど大きい数値をとり、第２所定数は各順位
の文字集合に属する文字数が多くなるほど比例的に大き
い数値をとる。FIG. 4 is a selection process diagram relating to the secondary selection of read character strings. In Figure 4, the first column shows the character ranking, the second column shows the registered keyword, the third column shows the set of each ranking character of the first selection read character string, and the fourth column shows the constituent characters of the registered keyword and each ranking character. The number of matching characters with the set is shown respectively. For example, the constituent characters of the registered keyword corresponding to character rank 1 are "tomi", and the set of characters corresponding to character rank 1 in the five read character strings W1 to W5 that were selected in the first round is {tomi, Wealth, flute, palace, space}. In other words, the number of matching characters is 2. Here, the second predetermined number in the invention is two. Similarly, a character set is obtained for each character rank 2, 3, and 4, and the number of matching characters 2, 2, and 4 is obtained. Therefore, 1
The five read character strings W1 to W5 selected for the next selection also pass the second selection and are determined to be keywords. The first and second predetermined numbers are basically determined empirically,
Generally, the first predetermined number takes a larger value as the number of characters constituting the registered keyword increases, and the second predetermined number takes a proportionally larger value as the number of characters belonging to the character set of each rank increases.

【００１１】図２は検索装置の構成を示すブロック図で
ある。図２において、１は文書の原稿文字に係る画像を
求めるイメージスキャナ、２は読取部で、原稿文字に係
る画像に基づいて読み取りをおこなう。３は読取文字用
のメモリで、ここに読取文字が文字コードで格納される
。４は登録キーワード用の入力部、５は登録キーワード
用のメモリである。６は照合部で、各メモリ３，５から
の対応する文字コードを照合し、一致，不一致の判定を
する等、前記の１次，２次の各選考処理をおこない、最
終的にキーワードとみなすかどうかを決める。７はＣＲ
Ｔで、照合結果を画面に表示する。なお、このＣＲＴ７
に照合結果を印刷して出力するプリンタを併設すること
ができる。FIG. 2 is a block diagram showing the configuration of the search device. In FIG. 2, reference numeral 1 denotes an image scanner that obtains images related to original characters of a document, and 2 a reading unit that performs reading based on images related to original characters. Reference numeral 3 denotes a memory for read characters, in which read characters are stored as character codes. 4 is an input section for registered keywords, and 5 is a memory for registered keywords. Reference numeral 6 is a collation unit that performs the above-mentioned primary and secondary selection processes, such as collating the corresponding character codes from each memory 3 and 5 and determining whether they match or do not match, and finally considers them as keywords. Decide whether or not. 7 is CR
Press T to display the matching results on the screen. In addition, this CRT7
A printer can be installed to print and output the verification results.

【００１２】図１は検索装置の動作を示すフローチャー
トである。図１のステップＳ１で、キーワードＫＷと同
じ文字数の読取文字列Ｗを順次選出（予備選考）する。ステップＳ２で、Ｓ１で選出された読取文字列Ｗで、Ｋ
Ｗと同一順位同士が全て同文字である読取文字列Ｗｉを
選出する。これは従来の検索方法である。ステップＳ３
で、Ｓ２で選出されなかった読取文字列から、ＫＷと同
一順位同士がＡ個（第１所定数）以上、同文字である読
取文字列Ｗｊを選出（１次選考）する。すなわち、ステ
ップＳ３以降が検索キーワードの救済処置になる。なお
、図３の例では、Ａ＝２　　である。ステップ４で、ス
テップＳ１〜Ｓ３を文書全体について繰り返す。ステッ
プＳ５で、Ｓ３で選出された読取文字列Ｗｊに係る同一
順位文字の集合のうちＢ個（第２所定数）以上が、ＫＷ
の同一順位文字と同かどうか判断（２次選考）し、ＹＥ
ＳならステップＳ５に移行し、ＮＯなら救済されず終了
する。なお図４の例では、Ｂ＝２である。ステップＳ６
で、読取文字列ＷｊをＫＷとみなし、救済する。したが
って、検索キーワードは各読取文字列Ｗｉ，Ｗｊになる
。FIG. 1 is a flowchart showing the operation of the search device. In step S1 of FIG. 1, read character strings W having the same number of characters as the keyword KW are sequentially selected (preliminary selection). In step S2, in the read character string W selected in S1, K
A read character string Wi in which all characters in the same rank as W are the same is selected. This is a conventional search method. Step S3
Then, from the read character strings not selected in S2, read character strings Wj having the same characters at least A (first predetermined number) in the same rank as KW are selected (first selection). That is, steps after step S3 are search keyword relief measures. Note that in the example of FIG. 3, A=2. In step 4, steps S1 to S3 are repeated for the entire document. In step S5, B (second predetermined number) or more of the set of characters of the same rank related to the read character string Wj selected in S3 are KW.
Judge whether it is the same as the same ranking character (secondary selection), and select YE.
If S, the process moves to step S5, and if NO, the process ends without relief. Note that in the example of FIG. 4, B=2. Step S6
Then, the read character string Wj is regarded as KW and is rescued. Therefore, the search keywords are the read character strings Wi, Wj.

【００１３】以上のように、従来の方法で除外された読
取文字列を、１次，２次の各選考過程を経て救済するが
、この救済が適正かつ合理的な制約条件のもとでおこな
われるから、救済により検索効率の向上が図れるととも
に、救済された読取文字列の検索確度は高い。As described above, the read character strings excluded by the conventional method are rescued through the primary and secondary selection processes, but this rescue is performed under appropriate and reasonable constraints. Therefore, the search efficiency can be improved by the rescue, and the search accuracy of the rescued read character string is high.

【００１４】[0014]

【発明の効果】請求項１〜３のいずれかに係るキーワー
ド検索方法では共通に、登録キーワードと同一文字数で
、かつ各同一順位の文字同士が第１の所定数以上一致し
、登録キーワードの各文字全てとは一致しない読取文字
列を１次選考として選出し；この選出された各読取文字
列について各同一順位ごとの文字の集合を求め、この各
順位ごとの文字集合のうち登録キーワードの対応する順
位の文字と一致する文字の個数が第２の所定数以上ある
とき、２次選考として各読取文字列の全てを検索すべき
キーワードとみなす。Effects of the Invention In the keyword search method according to any one of claims 1 to 3, characters having the same number of characters and the same rank as the registered keyword match each other at least a first predetermined number, and each of the registered keywords The read character strings that do not match all the characters are selected as the first selection; for each selected read character string, the set of characters for each same rank is determined, and the correspondence of registered keywords among the character sets for each rank is determined. When the number of characters that match the characters in the ranking is greater than or equal to a second predetermined number, all of the read character strings are considered as keywords to be searched as a secondary selection.

【００１５】したがって、キーワード構成文字の一部に
誤読があったとしても、単純に排除しないで合理的に救
済することにより、検索効率の向上が図れる、という効
果が得られる。また、とくに請求項２のように、第１所
定数が登録キーワード文字数に応じて定められ、また請
求項３のように、第２所定数が各順位ごとの文字集合に
属する共通な文字数に応じて定められるから、救済が適
正かつ合理的な制約条件のもとでおこなわれ、救済され
た読取文字列の検索確度は高い。[0015] Therefore, even if some of the characters constituting the keyword are misread, the search efficiency can be improved by rationally repairing the characters rather than simply eliminating them. In particular, as in claim 2, the first predetermined number is determined according to the number of registered keyword characters, and as in claim 3, the second predetermined number is determined according to the number of common characters belonging to the character set for each rank. Therefore, the rescue is performed under appropriate and reasonable constraints, and the search accuracy of the rescued read character string is high.

[Brief explanation of the drawing]

【図１】本発明に係る方法を適用した検索装置の動作を
示すフローチャートFIG. 1 is a flowchart showing the operation of a search device to which the method according to the present invention is applied.

【図２】この検索装置の構成を示すブロック図[Figure 2] Block diagram showing the configuration of this search device

【図３】
この検索装置に係る登録キーワードと１次選考読取文字
列の例示図[Figure 3]
An illustration of registered keywords and first selection read character strings related to this search device

【図４】読取文字列の２次選考に係る選考過程図[Figure 4] Selection process diagram for secondary selection of read character strings

[Explanation of symbols]

１　　　　イメージセンサ２　　　　読取部３　　　　メモリ４　　　　入力部５　　　　メモリ６　　　　照合部７　　　　ＣＲＴ 1 Image sensor 2 Reading section 3. Memory 4 Input section 5. Memory 6　　　　　　Verification section 7 CRT

Claims

[Claims]

Claim 1: In a method of searching a document using the same character string as a registered keyword as a keyword, each character of the document is read by a character reading device, and characters having the same number of characters as the registered keyword and of the same rank are identified. 1st
Select a read character string that matches a predetermined number or more and does not match all of the characters of the registered keyword; Find a set of characters for each same rank for each selected read character string; When the number of characters in the character set that match the characters of the corresponding rank of the registered keyword is equal to or greater than a second predetermined number, all of the read character strings are regarded as keywords to be searched. Keyword search method.

2. The method according to claim 1, wherein the first predetermined number is determined according to the number of characters of the registered keyword.

3. The method according to claim 1 or 2,
A keyword search method characterized in that the second predetermined number is determined according to the number of common characters belonging to a character set for each rank.