JPH04215183A

JPH04215183A - Key word retrieving method

Info

Publication number: JPH04215183A
Application number: JP2401761A
Authority: JP
Inventors: Yukio Kudo; 久藤　幸生
Original assignee: Fuji Electric Co Ltd; Fuji Facom Corp
Current assignee: Fuji Electric Co Ltd; Fuji Facom Corp
Priority date: 1990-12-13
Filing date: 1990-12-13
Publication date: 1992-08-05

Abstract

PURPOSE:To realize a retrieving method for improving retrieval efficiency by rationally recovering even when some original characters in a documents are incorrectly read out. CONSTITUTION:This retrieving device is composed of an image scanner 1, a read-out part 2, a memory 3 for candidate character, a input part 4 for registration key word, a memory 5 for registration key word, a collation part 6 and a CRT 7. The read-out part selects based on the image if the original character, the prescribed number (e.g. three characters) of candidate character and stores them in the memory 3. When at least one of each candidate character of each original character is coincident with the character of the order for which the registration key corresponds, the original character is decided to be the character for which the key word to be retrieved corresponds.

Description

[Detailed description of the invention]

【０００１】0001

【産業上の利用分野】この発明は、登録キーワードと同
じ単一文字または文字列をキーワードとして文書中から
検索する方法であって、とくに文書中の各文字の読取り
に多少の誤読があっても検索効率の良好なキーワード検
索方法に関する。[Industrial Application Field] This invention is a method for searching a document using a single character or string of characters that is the same as a registered keyword as a keyword. Concerning an efficient keyword search method.

【０００２】0002

【従来の技術】一般に、文書の内容を迅速，的確に把握
するには、キーワードを活用するのが有効である。たと
えば、地球環境保護の問題に関する文書では、たとえば
「放射能」や「オゾン層」，「地球汚染」などのキーワ
ードが用いられる。さて、文書のデータベース化をおこ
なうとき、文書の文字つまり原稿文字を順に文字読取装
置によって、標準文字に対する類似度のもっとも高い文
字を読取文字として選出し、文字コードで表されるテキ
ストを作成する。このテキストに対して、登録されたキ
ーワードと同じ単一文字または文字列をキーワードとし
て検索する。2. Description of the Related Art Generally, it is effective to use keywords to quickly and accurately understand the contents of a document. For example, documents related to global environmental protection issues use keywords such as "radiation,""ozonelayer," and "global pollution." Now, when converting a document into a database, the characters of the document, that is, the original characters, are sequentially read by a character reading device, and the character with the highest degree of similarity to the standard character is selected as the reading character, and a text expressed by a character code is created. This text is searched for the same single character or character string as the registered keyword as a keyword.

【０００３】0003

【発明が解決しようとする課題】従来の方法では、文字
読取装置によって読み取られた結果に誤り、つまり誤読
が１字でもあると、キーワードが存在するにもかかわら
ず、検索対象から除外される。すなわち、文書中のキー
ワード総数に対する検索キーワード数の比率を検索効率
と定義したとき、検索効率は著しく低下する。In the conventional method, if there is an error in the result read by a character reading device, that is, even one character is misread, the keyword is excluded from the search target even though the keyword exists. That is, when search efficiency is defined as the ratio of the number of search keywords to the total number of keywords in a document, the search efficiency decreases significantly.

【０００４】この発明の課題は、従来の技術がもつ以上
の問題点を解消し、文書中の各文字の読取りに多少の誤
読があっても検索効率の良好なキーワード検索方法を提
供することにある。[0004] An object of the present invention is to provide a keyword search method that solves the above-mentioned problems of the conventional technology and has good search efficiency even if there is some misreading of each character in a document. be.

【０００５】[0005]

【課題を解決するための手段】この課題を解決するため
に、請求項１に係るキーワード検索方法は、登録キーワ
ードと同じ単一文字または文字列をキーワードとして文
書中から検索する方法において、この文書の各文字を原
稿文字として順に文字読取装置によって読み取り、前記
各原稿文字について標準文字に対する類似度に基づき所
定個数までの候補文字を選出し；前記各原稿文字でもっ
とも先行するものの各候補文字のうち少なくとも一つが
前記登録キーワードの先頭文字と一致するときの、前記
原稿文字を前記検索すべきキーワードの先頭文字とし；
この先頭文字に対応する原稿文字に後続の各文字につい
て前記と同じ所定個数までの候補文字を選出し；この後
続順の各原稿文字に対応する前記所定個数までの各候補
文字のうち少なくとも一つが前記登録キーワードの対応
する後続順位の文字と一致するとき、前記後続順の各原
稿文字を前記検索すべきキーワードの対応する後続順位
の各文字とする。請求項２に係るキーワード検索方法は
、請求項１に記載の方法において、所定個数が３である
。[Means for Solving the Problem] In order to solve this problem, the keyword search method according to claim 1 is a method for searching a document using the same single character or character string as a registered keyword as a keyword. Each character is sequentially read as a manuscript character by a character reading device, and up to a predetermined number of candidate characters are selected for each manuscript character based on the degree of similarity to a standard character; When one of the characters matches the first character of the registered keyword, set the manuscript character as the first character of the keyword to be searched;
Select up to the same predetermined number of candidate characters as above for each character following the manuscript character corresponding to this first character; at least one of the candidate characters up to the predetermined number corresponding to each subsequent manuscript character When a character in the corresponding succeeding rank of the registered keyword matches, each document character in the succeeding order is set as each character in the corresponding succeeding rank of the keyword to be searched. In a keyword search method according to a second aspect of the present invention, in the method according to the first aspect, the predetermined number of keywords is three.

【０００６】[0006]

【作用】請求項１または２に係るキーワード検索方法で
は共通に、文書の各原稿文字の読取りに多少の誤読があ
っても、読取文字として所定個数、たとえば請求項２の
ように、３個までの候補文字を上げ、そのうち少なくと
も一つが登録キーワードの先頭文字と一致する最先行の
読取文字を探せば、その一致したものは正しい文字であ
る確率が高い。以下、後続する各原稿文字を順に読み取
り、それぞれに対し同じ所定個数、たとえば３個までの
候補文字を選出し、そのうち少なくとも一つが登録キー
ワードの対応する各後続文字と一致するものを探せば、
その一致したものは、高い確率でキーワードの対応する
各後続文字である。[Operation] In the keyword search method according to claim 1 or 2, even if there is some misreading of each manuscript character of a document, up to a predetermined number of read characters, for example, 3 as in claim 2. If you search for the first readable character in which at least one of the candidate characters matches the first character of the registered keyword, there is a high probability that the matching character is the correct character. Thereafter, each successive manuscript character is read in turn, the same predetermined number of candidate characters, for example, up to three, are selected for each character, and at least one of them is searched for a character that matches each subsequent character corresponding to the registered keyword.
The match is with high probability each corresponding subsequent character of the keyword.

【０００７】[0007]

【実施例】本発明に係るキーワード検索方法が適用され
る検索装置について、以下に図を参照しながら説明する
。図３は検索装置に係る登録キーワード，原稿文字，各
候補文字の対応図である。図３において、第１行は登録
キーワード、第２行は検索すべき原稿文字、第３行は各
原稿文字の読取結果の第１候補文字、第４行は同じくそ
の第２候補文字、第５行は同じくその第３候補文字であ
る。登録キーワードは「共同開発」、原稿文字「共」に
係る読取結果の第１候補文字は「共」、第２候補文字は
「井」、第３候補文字は「丼」である。以下、原稿文字
「同」に対し伺，同，向が、原稿文字「開」に対し開，
閉，関が、原稿文字「発」に対し発，溌，廃がそれぞれ
候補文字として選出される。なお、第１，第２，第３の
各候補文字は、各原稿文字の標準文字に対する類似度の
高い順に、または類似度に係るしきい値を順に緩和して
、３個までの文字が選定される。類似度が極端に低くな
るときには、候補文字とは言えないから、３個を揃えて
選定する必要はない。DESCRIPTION OF THE PREFERRED EMBODIMENTS A search device to which the keyword search method according to the present invention is applied will be described below with reference to the drawings. FIG. 3 is a correspondence diagram of registered keywords, manuscript characters, and candidate characters related to the search device. In FIG. 3, the first line is the registered keyword, the second line is the manuscript character to be searched, the third line is the first candidate character of the reading result of each manuscript character, the fourth line is the second candidate character, and the fifth The line is also its third candidate character. The registered keyword is "joint development", the first candidate character in the reading results for the manuscript character "Kyo" is "Kyo", the second candidate character is "I", and the third candidate character is "Don". Hereinafter, the manuscript character ``same'' corresponds to the character ``same,'' and the manuscript character ``open'' corresponds to the character ``open''.
Close and Seki are selected as candidate characters for the manuscript character ``Hatsu'', and ``Hatsu'', 溌, and HAI are respectively selected as candidate characters. For each of the first, second, and third candidate characters, up to three characters are selected in order of similarity to the standard character of each manuscript character, or by relaxing the threshold related to similarity in order. be done. If the degree of similarity is extremely low, it cannot be said to be a candidate character, so there is no need to select all three characters.

【０００８】図４は、図３の各文字を符号化したときの
対応図であり、登録キーワードＫの各構成文字：Ｋｉ、
原稿文字Ｗの各構成文字：Ｗｉ、第１，第２，第３の各
候補文字：Ａｉ，Ｂｉ，Ｃｉにそれぞれ対応する。ここ
で、ｉ＝１，２，３，４で、登録キーワード、原稿文字
の共通な文字順位符号である。FIG. 4 is a correspondence diagram when each character in FIG. 3 is encoded, and shows the constituent characters of the registered keyword K: Ki,
These correspond to the constituent characters of the original character W: Wi, and the first, second, and third candidate characters: Ai, Bi, and Ci, respectively. Here, i=1, 2, 3, and 4, which are common character ranking codes for the registered keyword and manuscript characters.

【０００９】図２は検索装置の構成を示すブロック図で
ある。図２において、１は文書の原稿文字に係る画像を
求めるイメージスキャナ、２は読取部で、原稿文字に係
る画像に基づいて３個までの候補文字を選出する。なお
、第１，第２，第３の各候補文字については、既に説明
したとおりである。３は読取文字に係る候補文字用のメ
モリで、読取りの第１，第２，第３の各候補文字が文字
コードで格納される。４は登録キーワード用の入力部、
５は登録キーワード用のメモリである。６は照合部で、
各メモリ３，５からの対応する文字コードを照合し、一
致，不一致の判定をする。７はＣＲＴで、照合結果を画
面に表示する。なお、このＣＲＴ７に照合結果を印刷し
て出力するプリンタを併設することもできる。FIG. 2 is a block diagram showing the configuration of the search device. In FIG. 2, reference numeral 1 denotes an image scanner that obtains an image of original characters of a document, and 2 a reading unit that selects up to three candidate characters based on the image of original characters. Note that the first, second, and third candidate characters are as already described. Reference numeral 3 denotes a memory for candidate characters related to read characters, in which each of the first, second, and third candidate characters to be read is stored as a character code. 4 is an input section for registered keywords,
5 is a memory for registered keywords. 6 is the collation part,
Corresponding character codes from each memory 3 and 5 are compared to determine whether they match or do not match. 7 is a CRT, which displays the verification results on the screen. Note that this CRT 7 may also be provided with a printer for printing and outputting the verification results.

【００１０】図１は検索装置の動作を示すフローチャー
トである。図１において、ステップＳ１で、４個の文字
からなる登録キーワードの各構成文字Ｋｉ、原稿文字Ｗ
ｉの共通な文字順位符号ｉの初期化、ｉ＝１をおこなう
。ステップＳ２で、原稿文字Ｗｉを読み取った結果の３
個の第１，第２，第３の各候補文字Ａｉ，Ｂｉ，Ｃｉを
選出する。ステップＳ３で、第１候補文字Ａｉがキーワ
ード構成文字Ｋｉと一致するかどうかが判断され、ＹＥ
ＳならステップＳ６へ、ＮＯならステップＳ４へ移行す
る。ステップＳ６で、Ｗｉ，Ｋｉは一致すると判定され
た後、以上のプロセスがステップＳ７とステップＳ８を
経て、４個の文字すべてについて繰り返される。戻って
ステップＳ４で、第２候補文字ＢｉがＫｉと一致するか
どうかが判断され、ＹＥＳならステップＳ６へ、ＮＯな
らステップＳ５へ移行する。ステップＳ５で、第３候補
文字ＣｉがＫｉと一致するかどうかが判断され、ＹＥＳ
ならステップＳ６へ、ＮＯならステップＳ９へ移行する
。ステップＳ９で、Ｗｉは読取不能とされ、したがって
次のステップＳ１０で検索不能とされる。FIG. 1 is a flowchart showing the operation of the search device. In FIG. 1, in step S1, each constituent character Ki of a registered keyword consisting of four characters, the original character W
The common character order code i of i is initialized to i=1. 3 as a result of reading the original characters Wi in step S2.
The first, second, and third candidate characters Ai, Bi, and Ci are selected. In step S3, it is determined whether the first candidate character Ai matches the keyword constituent characters Ki, and YE
If S, the process moves to step S6; if NO, the process moves to step S4. After it is determined in step S6 that Wi and Ki match, the above process is repeated for all four characters through steps S7 and S8. Returning to step S4, it is determined whether the second candidate character Bi matches Ki. If YES, the process moves to step S6; if NO, the process moves to step S5. In step S5, it is determined whether the third candidate character Ci matches Ki, and YES is determined.
If so, proceed to step S6; if NO, proceed to step S9. In step S9, Wi is made unreadable, and therefore, in the next step S10, Wi is made unsearchable.

【００１１】ここで、若干補足すると、キーワードとな
るべき先頭の原稿文字Ｗ１の読取りに多少の誤読があっ
ても、読取文字として３個までの候補文字Ａ１，Ｂ１，
Ｃ１を上げ、そのうち少なくとも一つが登録キーワード
の先頭文字Ｋ１と一致するものを探せば、その一致した
ものは正しい文字である確率が高い、と考えることがで
きる。以下、後続する各原稿文字Ｗｉを順に読み取り、
それぞれに対し３個までの候補文字Ａｉ，Ｂｉ，Ｃｉを
選出し、そのうち少なくとも一つが登録キーワードの対
応する各後続文字Ｋｉと一致するものを探せば、その一
致したものは、高い確率でキーワードの対応する各後続
文字であると言える。なお、候補文字の個数は多いほど
、読取り確度は上がるが、処理時間もかかるから、調和
点を求める必要がある。候補文字を３個までとしたのは
、経験的なもので、処理時間もほどほどの線で、ほぼ９
９％の確率で正確な文字読取りができることが実証され
たことに基づく。[0011] Here, to add a little bit, even if there is some misreading of the first manuscript character W1, which should be a keyword, up to three candidate characters A1, B1,
If C1 is increased and at least one of them matches the first character K1 of the registered keyword, it can be considered that there is a high probability that the matched character is the correct character. Hereafter, each subsequent manuscript character Wi is read in order,
If you select up to three candidate characters Ai, Bi, and Ci for each, and search for a character in which at least one of them matches each subsequent character Ki of the registered keyword, the matched character has a high probability of being the keyword's character. It can be said that each corresponding subsequent character. Note that the greater the number of candidate characters, the higher the reading accuracy, but the longer the processing time, so it is necessary to find the harmonious points. The choice of up to 3 candidate characters is based on experience, and the processing time is moderate, approximately 9.
This is based on the fact that it has been proven that characters can be read accurately with a probability of 9%.

【００１２】0012

【発明の効果】請求項１または２に係るキーワード検索
方法では共通に、文書の各原稿文字の読取りに多少の誤
読があっても、読取文字として所定個数、たとえば請求
項２のように、３個までの候補文字を上げ、そのうち少
なくとも一つが登録キーワードの先頭文字と一致する最
先行の読取文字を探せば、その一致したものは正しい文
字である確率が高いから、以下、後続する各原稿文字を
順に読み取り、それぞれに対し同じ所定個数、たとえば
３個までの候補文字を上げ、そのうち少なくとも一つが
登録キーワードの対応する各後続文字と一致するものを
探せば、その一致したものは、高い確率でキーワードの
対応する各後続文字である。したがって、文書中の各文
字の読取りに多少の誤読があっても、結果として検索効
率の良好なキーワード検索ができるという効果が得られ
る。とくに、候補文字を３個までにすることによって、
処理時間もほどほどの線で、ほぼ９９％の確率で正確な
文字読取りができ、処理時間と検索確度との調和が図れ
ることが実証された。Effects of the Invention In the keyword search method according to claim 1 or 2, even if there is some misreading of each manuscript character of a document, a predetermined number of read characters, for example, 3 as in claim 2, are read. If you search for the first readable character in which at least one of the candidate characters matches the first character of the registered keyword, there is a high probability that the matching character is the correct character. are read in order, and the same predetermined number of candidate characters, for example, up to 3, are read for each character, and if at least one of them matches the corresponding subsequent character of the registered keyword, then the matched character has a high probability. Each corresponding subsequent character of the keyword. Therefore, even if there is some misreading of each character in the document, the effect of keyword search with good search efficiency can be obtained as a result. In particular, by limiting the number of candidate characters to three,
It has been demonstrated that accurate character reading can be achieved with approximately 99% probability of processing time at a reasonable level, and that a balance between processing time and retrieval accuracy can be achieved.

[Brief explanation of the drawing]

【図１】本発明に係る方法を適用した検索装置の動作を
示すフローチャートFIG. 1 is a flowchart showing the operation of a search device to which the method according to the present invention is applied.

【図２】この検索装置の構成を示すブロック図[Figure 2] Block diagram showing the configuration of this search device

【図３】
この検索装置に係る登録キーワード，原稿文字，各候補
文字の対応図[Figure 3]
Correspondence diagram of registered keywords, manuscript characters, and candidate characters related to this search device

【図４】図３の各文字を符号化したときの対応図[Figure 4] Correspondence diagram when each character in Figure 3 is encoded

[Explanation of symbols]

１　　　　イメージセンサ２　　　　読取部３　　　　メモリ４　　　　入力部５　　　　メモリ６　　　　照合部７　　　　ＣＲＴ 1 Image sensor 2 Reading section 3. Memory 4 Input section 5. Memory 6　　　　　　Verification section 7 CRT

Claims

[Claims]

Claim 1. A method of searching a document using a single character or string of characters as a keyword as a registered keyword, in which each character of the document is sequentially read as a manuscript character by a character reading device, and the similarity of each manuscript character to a standard character is determined. select up to a predetermined number of candidate characters based on the degree of occurrence; the search should be performed for the manuscript character when at least one of the candidate characters of the most preceding character in each of the manuscript characters matches the first character of the registered keyword; Select the first character of the keyword; select up to the same predetermined number of candidate characters as above for each character following the manuscript character corresponding to this first character; select each candidate character up to the predetermined number corresponding to each subsequent manuscript character; When at least one of the characters matches a character in a corresponding succeeding rank of the registered keyword, each manuscript character in the succeeding order is set as a character in a corresponding succeeding rank of the keyword to be searched; Keyword search method.

2. The method according to claim 1, wherein the predetermined number is three.