JPH04215183A - Key word retrieving method - Google Patents

Key word retrieving method

Info

Publication number
JPH04215183A
JPH04215183A JP2401761A JP40176190A JPH04215183A JP H04215183 A JPH04215183 A JP H04215183A JP 2401761 A JP2401761 A JP 2401761A JP 40176190 A JP40176190 A JP 40176190A JP H04215183 A JPH04215183 A JP H04215183A
Authority
JP
Japan
Prior art keywords
character
characters
keyword
manuscript
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP2401761A
Other languages
Japanese (ja)
Inventor
Yukio Kudo
久藤 幸生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuji Electric Co Ltd
Fuji Facom Corp
Original Assignee
Fuji Electric Co Ltd
Fuji Facom Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuji Electric Co Ltd, Fuji Facom Corp filed Critical Fuji Electric Co Ltd
Priority to JP2401761A priority Critical patent/JPH04215183A/en
Publication of JPH04215183A publication Critical patent/JPH04215183A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Character Discrimination (AREA)

Abstract

PURPOSE:To realize a retrieving method for improving retrieval efficiency by rationally recovering even when some original characters in a documents are incorrectly read out. CONSTITUTION:This retrieving device is composed of an image scanner 1, a read-out part 2, a memory 3 for candidate character, a input part 4 for registration key word, a memory 5 for registration key word, a collation part 6 and a CRT 7. The read-out part selects based on the image if the original character, the prescribed number (e.g. three characters) of candidate character and stores them in the memory 3. When at least one of each candidate character of each original character is coincident with the character of the order for which the registration key corresponds, the original character is decided to be the character for which the key word to be retrieved corresponds.

Description

【発明の詳細な説明】[Detailed description of the invention]

【0001】0001

【産業上の利用分野】この発明は、登録キーワードと同
じ単一文字または文字列をキーワードとして文書中から
検索する方法であって、とくに文書中の各文字の読取り
に多少の誤読があっても検索効率の良好なキーワード検
索方法に関する。
[Industrial Application Field] This invention is a method for searching a document using a single character or string of characters that is the same as a registered keyword as a keyword. Concerning an efficient keyword search method.

【0002】0002

【従来の技術】一般に、文書の内容を迅速,的確に把握
するには、キーワードを活用するのが有効である。たと
えば、地球環境保護の問題に関する文書では、たとえば
「放射能」や「オゾン層」,「地球汚染」などのキーワ
ードが用いられる。さて、文書のデータベース化をおこ
なうとき、文書の文字つまり原稿文字を順に文字読取装
置によって、標準文字に対する類似度のもっとも高い文
字を読取文字として選出し、文字コードで表されるテキ
ストを作成する。このテキストに対して、登録されたキ
ーワードと同じ単一文字または文字列をキーワードとし
て検索する。
2. Description of the Related Art Generally, it is effective to use keywords to quickly and accurately understand the contents of a document. For example, documents related to global environmental protection issues use keywords such as "radiation,""ozonelayer," and "global pollution." Now, when converting a document into a database, the characters of the document, that is, the original characters, are sequentially read by a character reading device, and the character with the highest degree of similarity to the standard character is selected as the reading character, and a text expressed by a character code is created. This text is searched for the same single character or character string as the registered keyword as a keyword.

【0003】0003

【発明が解決しようとする課題】従来の方法では、文字
読取装置によって読み取られた結果に誤り、つまり誤読
が1字でもあると、キーワードが存在するにもかかわら
ず、検索対象から除外される。すなわち、文書中のキー
ワード総数に対する検索キーワード数の比率を検索効率
と定義したとき、検索効率は著しく低下する。
In the conventional method, if there is an error in the result read by a character reading device, that is, even one character is misread, the keyword is excluded from the search target even though the keyword exists. That is, when search efficiency is defined as the ratio of the number of search keywords to the total number of keywords in a document, the search efficiency decreases significantly.

【0004】この発明の課題は、従来の技術がもつ以上
の問題点を解消し、文書中の各文字の読取りに多少の誤
読があっても検索効率の良好なキーワード検索方法を提
供することにある。
[0004] An object of the present invention is to provide a keyword search method that solves the above-mentioned problems of the conventional technology and has good search efficiency even if there is some misreading of each character in a document. be.

【0005】[0005]

【課題を解決するための手段】この課題を解決するため
に、請求項1に係るキーワード検索方法は、登録キーワ
ードと同じ単一文字または文字列をキーワードとして文
書中から検索する方法において、この文書の各文字を原
稿文字として順に文字読取装置によって読み取り、前記
各原稿文字について標準文字に対する類似度に基づき所
定個数までの候補文字を選出し;前記各原稿文字でもっ
とも先行するものの各候補文字のうち少なくとも一つが
前記登録キーワードの先頭文字と一致するときの、前記
原稿文字を前記検索すべきキーワードの先頭文字とし;
この先頭文字に対応する原稿文字に後続の各文字につい
て前記と同じ所定個数までの候補文字を選出し;この後
続順の各原稿文字に対応する前記所定個数までの各候補
文字のうち少なくとも一つが前記登録キーワードの対応
する後続順位の文字と一致するとき、前記後続順の各原
稿文字を前記検索すべきキーワードの対応する後続順位
の各文字とする。請求項2に係るキーワード検索方法は
、請求項1に記載の方法において、所定個数が3である
[Means for Solving the Problem] In order to solve this problem, the keyword search method according to claim 1 is a method for searching a document using the same single character or character string as a registered keyword as a keyword. Each character is sequentially read as a manuscript character by a character reading device, and up to a predetermined number of candidate characters are selected for each manuscript character based on the degree of similarity to a standard character; When one of the characters matches the first character of the registered keyword, set the manuscript character as the first character of the keyword to be searched;
Select up to the same predetermined number of candidate characters as above for each character following the manuscript character corresponding to this first character; at least one of the candidate characters up to the predetermined number corresponding to each subsequent manuscript character When a character in the corresponding succeeding rank of the registered keyword matches, each document character in the succeeding order is set as each character in the corresponding succeeding rank of the keyword to be searched. In a keyword search method according to a second aspect of the present invention, in the method according to the first aspect, the predetermined number of keywords is three.

【0006】[0006]

【作用】請求項1または2に係るキーワード検索方法で
は共通に、文書の各原稿文字の読取りに多少の誤読があ
っても、読取文字として所定個数、たとえば請求項2の
ように、3個までの候補文字を上げ、そのうち少なくと
も一つが登録キーワードの先頭文字と一致する最先行の
読取文字を探せば、その一致したものは正しい文字であ
る確率が高い。以下、後続する各原稿文字を順に読み取
り、それぞれに対し同じ所定個数、たとえば3個までの
候補文字を選出し、そのうち少なくとも一つが登録キー
ワードの対応する各後続文字と一致するものを探せば、
その一致したものは、高い確率でキーワードの対応する
各後続文字である。
[Operation] In the keyword search method according to claim 1 or 2, even if there is some misreading of each manuscript character of a document, up to a predetermined number of read characters, for example, 3 as in claim 2. If you search for the first readable character in which at least one of the candidate characters matches the first character of the registered keyword, there is a high probability that the matching character is the correct character. Thereafter, each successive manuscript character is read in turn, the same predetermined number of candidate characters, for example, up to three, are selected for each character, and at least one of them is searched for a character that matches each subsequent character corresponding to the registered keyword.
The match is with high probability each corresponding subsequent character of the keyword.

【0007】[0007]

【実施例】本発明に係るキーワード検索方法が適用され
る検索装置について、以下に図を参照しながら説明する
。図3は検索装置に係る登録キーワード,原稿文字,各
候補文字の対応図である。図3において、第1行は登録
キーワード、第2行は検索すべき原稿文字、第3行は各
原稿文字の読取結果の第1候補文字、第4行は同じくそ
の第2候補文字、第5行は同じくその第3候補文字であ
る。登録キーワードは「共同開発」、原稿文字「共」に
係る読取結果の第1候補文字は「共」、第2候補文字は
「井」、第3候補文字は「丼」である。以下、原稿文字
「同」に対し伺,同,向が、原稿文字「開」に対し開,
閉,関が、原稿文字「発」に対し発,溌,廃がそれぞれ
候補文字として選出される。なお、第1,第2,第3の
各候補文字は、各原稿文字の標準文字に対する類似度の
高い順に、または類似度に係るしきい値を順に緩和して
、3個までの文字が選定される。類似度が極端に低くな
るときには、候補文字とは言えないから、3個を揃えて
選定する必要はない。
DESCRIPTION OF THE PREFERRED EMBODIMENTS A search device to which the keyword search method according to the present invention is applied will be described below with reference to the drawings. FIG. 3 is a correspondence diagram of registered keywords, manuscript characters, and candidate characters related to the search device. In FIG. 3, the first line is the registered keyword, the second line is the manuscript character to be searched, the third line is the first candidate character of the reading result of each manuscript character, the fourth line is the second candidate character, and the fifth The line is also its third candidate character. The registered keyword is "joint development", the first candidate character in the reading results for the manuscript character "Kyo" is "Kyo", the second candidate character is "I", and the third candidate character is "Don". Hereinafter, the manuscript character ``same'' corresponds to the character ``same,'' and the manuscript character ``open'' corresponds to the character ``open''.
Close and Seki are selected as candidate characters for the manuscript character ``Hatsu'', and ``Hatsu'', 溌, and HAI are respectively selected as candidate characters. For each of the first, second, and third candidate characters, up to three characters are selected in order of similarity to the standard character of each manuscript character, or by relaxing the threshold related to similarity in order. be done. If the degree of similarity is extremely low, it cannot be said to be a candidate character, so there is no need to select all three characters.

【0008】図4は、図3の各文字を符号化したときの
対応図であり、登録キーワードKの各構成文字:Ki、
原稿文字Wの各構成文字:Wi、第1,第2,第3の各
候補文字:Ai,Bi,Ciにそれぞれ対応する。ここ
で、i=1,2,3,4で、登録キーワード、原稿文字
の共通な文字順位符号である。
FIG. 4 is a correspondence diagram when each character in FIG. 3 is encoded, and shows the constituent characters of the registered keyword K: Ki,
These correspond to the constituent characters of the original character W: Wi, and the first, second, and third candidate characters: Ai, Bi, and Ci, respectively. Here, i=1, 2, 3, and 4, which are common character ranking codes for the registered keyword and manuscript characters.

【0009】図2は検索装置の構成を示すブロック図で
ある。図2において、1は文書の原稿文字に係る画像を
求めるイメージスキャナ、2は読取部で、原稿文字に係
る画像に基づいて3個までの候補文字を選出する。なお
、第1,第2,第3の各候補文字については、既に説明
したとおりである。3は読取文字に係る候補文字用のメ
モリで、読取りの第1,第2,第3の各候補文字が文字
コードで格納される。4は登録キーワード用の入力部、
5は登録キーワード用のメモリである。6は照合部で、
各メモリ3,5からの対応する文字コードを照合し、一
致,不一致の判定をする。7はCRTで、照合結果を画
面に表示する。なお、このCRT7に照合結果を印刷し
て出力するプリンタを併設することもできる。
FIG. 2 is a block diagram showing the configuration of the search device. In FIG. 2, reference numeral 1 denotes an image scanner that obtains an image of original characters of a document, and 2 a reading unit that selects up to three candidate characters based on the image of original characters. Note that the first, second, and third candidate characters are as already described. Reference numeral 3 denotes a memory for candidate characters related to read characters, in which each of the first, second, and third candidate characters to be read is stored as a character code. 4 is an input section for registered keywords,
5 is a memory for registered keywords. 6 is the collation part,
Corresponding character codes from each memory 3 and 5 are compared to determine whether they match or do not match. 7 is a CRT, which displays the verification results on the screen. Note that this CRT 7 may also be provided with a printer for printing and outputting the verification results.

【0010】図1は検索装置の動作を示すフローチャー
トである。図1において、ステップS1で、4個の文字
からなる登録キーワードの各構成文字Ki、原稿文字W
iの共通な文字順位符号iの初期化、i=1をおこなう
。ステップS2で、原稿文字Wiを読み取った結果の3
個の第1,第2,第3の各候補文字Ai,Bi,Ciを
選出する。ステップS3で、第1候補文字Aiがキーワ
ード構成文字Kiと一致するかどうかが判断され、YE
SならステップS6へ、NOならステップS4へ移行す
る。ステップS6で、Wi,Kiは一致すると判定され
た後、以上のプロセスがステップS7とステップS8を
経て、4個の文字すべてについて繰り返される。戻って
ステップS4で、第2候補文字BiがKiと一致するか
どうかが判断され、YESならステップS6へ、NOな
らステップS5へ移行する。ステップS5で、第3候補
文字CiがKiと一致するかどうかが判断され、YES
ならステップS6へ、NOならステップS9へ移行する
。ステップS9で、Wiは読取不能とされ、したがって
次のステップS10で検索不能とされる。
FIG. 1 is a flowchart showing the operation of the search device. In FIG. 1, in step S1, each constituent character Ki of a registered keyword consisting of four characters, the original character W
The common character order code i of i is initialized to i=1. 3 as a result of reading the original characters Wi in step S2.
The first, second, and third candidate characters Ai, Bi, and Ci are selected. In step S3, it is determined whether the first candidate character Ai matches the keyword constituent characters Ki, and YE
If S, the process moves to step S6; if NO, the process moves to step S4. After it is determined in step S6 that Wi and Ki match, the above process is repeated for all four characters through steps S7 and S8. Returning to step S4, it is determined whether the second candidate character Bi matches Ki. If YES, the process moves to step S6; if NO, the process moves to step S5. In step S5, it is determined whether the third candidate character Ci matches Ki, and YES is determined.
If so, proceed to step S6; if NO, proceed to step S9. In step S9, Wi is made unreadable, and therefore, in the next step S10, Wi is made unsearchable.

【0011】ここで、若干補足すると、キーワードとな
るべき先頭の原稿文字W1の読取りに多少の誤読があっ
ても、読取文字として3個までの候補文字A1,B1,
C1を上げ、そのうち少なくとも一つが登録キーワード
の先頭文字K1と一致するものを探せば、その一致した
ものは正しい文字である確率が高い、と考えることがで
きる。以下、後続する各原稿文字Wiを順に読み取り、
それぞれに対し3個までの候補文字Ai,Bi,Ciを
選出し、そのうち少なくとも一つが登録キーワードの対
応する各後続文字Kiと一致するものを探せば、その一
致したものは、高い確率でキーワードの対応する各後続
文字であると言える。なお、候補文字の個数は多いほど
、読取り確度は上がるが、処理時間もかかるから、調和
点を求める必要がある。候補文字を3個までとしたのは
、経験的なもので、処理時間もほどほどの線で、ほぼ9
9%の確率で正確な文字読取りができることが実証され
たことに基づく。
[0011] Here, to add a little bit, even if there is some misreading of the first manuscript character W1, which should be a keyword, up to three candidate characters A1, B1,
If C1 is increased and at least one of them matches the first character K1 of the registered keyword, it can be considered that there is a high probability that the matched character is the correct character. Hereafter, each subsequent manuscript character Wi is read in order,
If you select up to three candidate characters Ai, Bi, and Ci for each, and search for a character in which at least one of them matches each subsequent character Ki of the registered keyword, the matched character has a high probability of being the keyword's character. It can be said that each corresponding subsequent character. Note that the greater the number of candidate characters, the higher the reading accuracy, but the longer the processing time, so it is necessary to find the harmonious points. The choice of up to 3 candidate characters is based on experience, and the processing time is moderate, approximately 9.
This is based on the fact that it has been proven that characters can be read accurately with a probability of 9%.

【0012】0012

【発明の効果】請求項1または2に係るキーワード検索
方法では共通に、文書の各原稿文字の読取りに多少の誤
読があっても、読取文字として所定個数、たとえば請求
項2のように、3個までの候補文字を上げ、そのうち少
なくとも一つが登録キーワードの先頭文字と一致する最
先行の読取文字を探せば、その一致したものは正しい文
字である確率が高いから、以下、後続する各原稿文字を
順に読み取り、それぞれに対し同じ所定個数、たとえば
3個までの候補文字を上げ、そのうち少なくとも一つが
登録キーワードの対応する各後続文字と一致するものを
探せば、その一致したものは、高い確率でキーワードの
対応する各後続文字である。したがって、文書中の各文
字の読取りに多少の誤読があっても、結果として検索効
率の良好なキーワード検索ができるという効果が得られ
る。とくに、候補文字を3個までにすることによって、
処理時間もほどほどの線で、ほぼ99%の確率で正確な
文字読取りができ、処理時間と検索確度との調和が図れ
ることが実証された。
Effects of the Invention In the keyword search method according to claim 1 or 2, even if there is some misreading of each manuscript character of a document, a predetermined number of read characters, for example, 3 as in claim 2, are read. If you search for the first readable character in which at least one of the candidate characters matches the first character of the registered keyword, there is a high probability that the matching character is the correct character. are read in order, and the same predetermined number of candidate characters, for example, up to 3, are read for each character, and if at least one of them matches the corresponding subsequent character of the registered keyword, then the matched character has a high probability. Each corresponding subsequent character of the keyword. Therefore, even if there is some misreading of each character in the document, the effect of keyword search with good search efficiency can be obtained as a result. In particular, by limiting the number of candidate characters to three,
It has been demonstrated that accurate character reading can be achieved with approximately 99% probability of processing time at a reasonable level, and that a balance between processing time and retrieval accuracy can be achieved.

【図面の簡単な説明】[Brief explanation of the drawing]

【図1】本発明に係る方法を適用した検索装置の動作を
示すフローチャート
FIG. 1 is a flowchart showing the operation of a search device to which the method according to the present invention is applied.

【図2】この検索装置の構成を示すブロック図[Figure 2] Block diagram showing the configuration of this search device

【図3】
この検索装置に係る登録キーワード,原稿文字,各候補
文字の対応図
[Figure 3]
Correspondence diagram of registered keywords, manuscript characters, and candidate characters related to this search device

【図4】図3の各文字を符号化したときの対応図[Figure 4] Correspondence diagram when each character in Figure 3 is encoded

【符号の説明】[Explanation of symbols]

1    イメージセンサ 2    読取部 3    メモリ 4    入力部 5    メモリ 6    照合部 7    CRT 1 Image sensor 2 Reading section 3. Memory 4 Input section 5. Memory 6      Verification section 7 CRT

Claims (2)

【特許請求の範囲】[Claims] 【請求項1】登録キーワードと同じ単一文字または文字
列をキーワードとして文書中から検索する方法において
、この文書の各文字を原稿文字として順に文字読取装置
によって読み取り、前記各原稿文字について標準文字に
対する類似度に基づき所定個数までの候補文字を選出し
;前記各原稿文字でもっとも先行するものの各候補文字
のうち少なくとも一つが前記登録キーワードの先頭文字
と一致するときの、前記原稿文字を前記検索すべきキー
ワードの先頭文字とし;この先頭文字に対応する原稿文
字に後続の各文字について前記と同じ所定個数までの候
補文字を選出し;この後続順の各原稿文字に対応する前
記所定個数までの各候補文字のうち少なくとも一つが前
記登録キーワードの対応する後続順位の文字と一致する
とき、前記後続順の各原稿文字を前記検索すべきキーワ
ードの対応する後続順位の各文字とする;ことを特徴と
するキーワード検索方法。
Claim 1. A method of searching a document using a single character or string of characters as a keyword as a registered keyword, in which each character of the document is sequentially read as a manuscript character by a character reading device, and the similarity of each manuscript character to a standard character is determined. select up to a predetermined number of candidate characters based on the degree of occurrence; the search should be performed for the manuscript character when at least one of the candidate characters of the most preceding character in each of the manuscript characters matches the first character of the registered keyword; Select the first character of the keyword; select up to the same predetermined number of candidate characters as above for each character following the manuscript character corresponding to this first character; select each candidate character up to the predetermined number corresponding to each subsequent manuscript character; When at least one of the characters matches a character in a corresponding succeeding rank of the registered keyword, each manuscript character in the succeeding order is set as a character in a corresponding succeeding rank of the keyword to be searched; Keyword search method.
【請求項2】請求項1に記載の方法において、所定個数
は、3であることを特徴とするキーワード検索方法。
2. The method according to claim 1, wherein the predetermined number is three.
JP2401761A 1990-12-13 1990-12-13 Key word retrieving method Pending JPH04215183A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2401761A JPH04215183A (en) 1990-12-13 1990-12-13 Key word retrieving method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2401761A JPH04215183A (en) 1990-12-13 1990-12-13 Key word retrieving method

Publications (1)

Publication Number Publication Date
JPH04215183A true JPH04215183A (en) 1992-08-05

Family

ID=18511590

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2401761A Pending JPH04215183A (en) 1990-12-13 1990-12-13 Key word retrieving method

Country Status (1)

Country Link
JP (1) JPH04215183A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011034230A (en) * 2009-07-30 2011-02-17 Rakuten Inc Image search engine

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011034230A (en) * 2009-07-30 2011-02-17 Rakuten Inc Image search engine

Similar Documents

Publication Publication Date Title
EP1843276A1 (en) Method for automated processing of hard copy text documents
CN109902303B (en) Entity identification method and related equipment
CN117076653B (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
JP3589007B2 (en) Document filing system and document filing method
JPH04215183A (en) Key word retrieving method
WO2022019275A1 (en) Document search device, document search system, document search program, and document search method
JP2815707B2 (en) Keyword search method
JPH08272811A (en) Document management method and device therefor
JPH04225471A (en) Keyword retrieving method
JP2586372B2 (en) Information retrieval apparatus and information retrieval method
WO2020244150A1 (en) Speech retrieval method and apparatus, computer device, and storage medium
JPH08272813A (en) Filing device
JP2827066B2 (en) Post-processing method for character recognition of documents with mixed digit strings
CN110349568B (en) Voice retrieval method, device, computer equipment and storage medium
CN117573839B (en) Document retrieval method, man-machine interaction method, electronic device and storage medium
JP3241854B2 (en) Automatic word spelling correction device
JPH07296005A (en) Japanese text registration/retrieval device
JP3924899B2 (en) Text search apparatus and text search method
JPH08180064A (en) Document retrieval method and document filing device
JP2839515B2 (en) Character reading system
JPH0256086A (en) Method for postprocessing for character recognition
JP2570784B2 (en) Document reader post-processing device
JPH03257693A (en) Character recognized result correcting system
CN115455948A (en) Spelling error correction model training method, spelling error correction method and storage medium
JPS5930176A (en) Character discrimination processing system