JPH04215183A - Key word retrieving method - Google Patents
Key word retrieving methodInfo
- Publication number
- JPH04215183A JPH04215183A JP2401761A JP40176190A JPH04215183A JP H04215183 A JPH04215183 A JP H04215183A JP 2401761 A JP2401761 A JP 2401761A JP 40176190 A JP40176190 A JP 40176190A JP H04215183 A JPH04215183 A JP H04215183A
- Authority
- JP
- Japan
- Prior art keywords
- character
- characters
- keyword
- manuscript
- candidate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 21
- 238000010586 diagram Methods 0.000 description 6
- 239000000470 constituent Substances 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000012795 verification Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- UNPLRYRWJLTVAE-UHFFFAOYSA-N Cloperastine hydrochloride Chemical compound Cl.C1=CC(Cl)=CC=C1C(C=1C=CC=CC=1)OCCN1CCCCC1 UNPLRYRWJLTVAE-UHFFFAOYSA-N 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000005855 radiation Effects 0.000 description 1
- 230000002040 relaxant effect Effects 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Character Discrimination (AREA)
Abstract
Description
【0001】0001
【産業上の利用分野】この発明は、登録キーワードと同
じ単一文字または文字列をキーワードとして文書中から
検索する方法であって、とくに文書中の各文字の読取り
に多少の誤読があっても検索効率の良好なキーワード検
索方法に関する。[Industrial Application Field] This invention is a method for searching a document using a single character or string of characters that is the same as a registered keyword as a keyword. Concerning an efficient keyword search method.
【0002】0002
【従来の技術】一般に、文書の内容を迅速,的確に把握
するには、キーワードを活用するのが有効である。たと
えば、地球環境保護の問題に関する文書では、たとえば
「放射能」や「オゾン層」,「地球汚染」などのキーワ
ードが用いられる。さて、文書のデータベース化をおこ
なうとき、文書の文字つまり原稿文字を順に文字読取装
置によって、標準文字に対する類似度のもっとも高い文
字を読取文字として選出し、文字コードで表されるテキ
ストを作成する。このテキストに対して、登録されたキ
ーワードと同じ単一文字または文字列をキーワードとし
て検索する。2. Description of the Related Art Generally, it is effective to use keywords to quickly and accurately understand the contents of a document. For example, documents related to global environmental protection issues use keywords such as "radiation,""ozonelayer," and "global pollution." Now, when converting a document into a database, the characters of the document, that is, the original characters, are sequentially read by a character reading device, and the character with the highest degree of similarity to the standard character is selected as the reading character, and a text expressed by a character code is created. This text is searched for the same single character or character string as the registered keyword as a keyword.
【0003】0003
【発明が解決しようとする課題】従来の方法では、文字
読取装置によって読み取られた結果に誤り、つまり誤読
が1字でもあると、キーワードが存在するにもかかわら
ず、検索対象から除外される。すなわち、文書中のキー
ワード総数に対する検索キーワード数の比率を検索効率
と定義したとき、検索効率は著しく低下する。In the conventional method, if there is an error in the result read by a character reading device, that is, even one character is misread, the keyword is excluded from the search target even though the keyword exists. That is, when search efficiency is defined as the ratio of the number of search keywords to the total number of keywords in a document, the search efficiency decreases significantly.
【0004】この発明の課題は、従来の技術がもつ以上
の問題点を解消し、文書中の各文字の読取りに多少の誤
読があっても検索効率の良好なキーワード検索方法を提
供することにある。[0004] An object of the present invention is to provide a keyword search method that solves the above-mentioned problems of the conventional technology and has good search efficiency even if there is some misreading of each character in a document. be.
【0005】[0005]
【課題を解決するための手段】この課題を解決するため
に、請求項1に係るキーワード検索方法は、登録キーワ
ードと同じ単一文字または文字列をキーワードとして文
書中から検索する方法において、この文書の各文字を原
稿文字として順に文字読取装置によって読み取り、前記
各原稿文字について標準文字に対する類似度に基づき所
定個数までの候補文字を選出し;前記各原稿文字でもっ
とも先行するものの各候補文字のうち少なくとも一つが
前記登録キーワードの先頭文字と一致するときの、前記
原稿文字を前記検索すべきキーワードの先頭文字とし;
この先頭文字に対応する原稿文字に後続の各文字につい
て前記と同じ所定個数までの候補文字を選出し;この後
続順の各原稿文字に対応する前記所定個数までの各候補
文字のうち少なくとも一つが前記登録キーワードの対応
する後続順位の文字と一致するとき、前記後続順の各原
稿文字を前記検索すべきキーワードの対応する後続順位
の各文字とする。請求項2に係るキーワード検索方法は
、請求項1に記載の方法において、所定個数が3である
。[Means for Solving the Problem] In order to solve this problem, the keyword search method according to claim 1 is a method for searching a document using the same single character or character string as a registered keyword as a keyword. Each character is sequentially read as a manuscript character by a character reading device, and up to a predetermined number of candidate characters are selected for each manuscript character based on the degree of similarity to a standard character; When one of the characters matches the first character of the registered keyword, set the manuscript character as the first character of the keyword to be searched;
Select up to the same predetermined number of candidate characters as above for each character following the manuscript character corresponding to this first character; at least one of the candidate characters up to the predetermined number corresponding to each subsequent manuscript character When a character in the corresponding succeeding rank of the registered keyword matches, each document character in the succeeding order is set as each character in the corresponding succeeding rank of the keyword to be searched. In a keyword search method according to a second aspect of the present invention, in the method according to the first aspect, the predetermined number of keywords is three.
【0006】[0006]
【作用】請求項1または2に係るキーワード検索方法で
は共通に、文書の各原稿文字の読取りに多少の誤読があ
っても、読取文字として所定個数、たとえば請求項2の
ように、3個までの候補文字を上げ、そのうち少なくと
も一つが登録キーワードの先頭文字と一致する最先行の
読取文字を探せば、その一致したものは正しい文字であ
る確率が高い。以下、後続する各原稿文字を順に読み取
り、それぞれに対し同じ所定個数、たとえば3個までの
候補文字を選出し、そのうち少なくとも一つが登録キー
ワードの対応する各後続文字と一致するものを探せば、
その一致したものは、高い確率でキーワードの対応する
各後続文字である。[Operation] In the keyword search method according to claim 1 or 2, even if there is some misreading of each manuscript character of a document, up to a predetermined number of read characters, for example, 3 as in claim 2. If you search for the first readable character in which at least one of the candidate characters matches the first character of the registered keyword, there is a high probability that the matching character is the correct character. Thereafter, each successive manuscript character is read in turn, the same predetermined number of candidate characters, for example, up to three, are selected for each character, and at least one of them is searched for a character that matches each subsequent character corresponding to the registered keyword.
The match is with high probability each corresponding subsequent character of the keyword.
【0007】[0007]
【実施例】本発明に係るキーワード検索方法が適用され
る検索装置について、以下に図を参照しながら説明する
。図3は検索装置に係る登録キーワード,原稿文字,各
候補文字の対応図である。図3において、第1行は登録
キーワード、第2行は検索すべき原稿文字、第3行は各
原稿文字の読取結果の第1候補文字、第4行は同じくそ
の第2候補文字、第5行は同じくその第3候補文字であ
る。登録キーワードは「共同開発」、原稿文字「共」に
係る読取結果の第1候補文字は「共」、第2候補文字は
「井」、第3候補文字は「丼」である。以下、原稿文字
「同」に対し伺,同,向が、原稿文字「開」に対し開,
閉,関が、原稿文字「発」に対し発,溌,廃がそれぞれ
候補文字として選出される。なお、第1,第2,第3の
各候補文字は、各原稿文字の標準文字に対する類似度の
高い順に、または類似度に係るしきい値を順に緩和して
、3個までの文字が選定される。類似度が極端に低くな
るときには、候補文字とは言えないから、3個を揃えて
選定する必要はない。DESCRIPTION OF THE PREFERRED EMBODIMENTS A search device to which the keyword search method according to the present invention is applied will be described below with reference to the drawings. FIG. 3 is a correspondence diagram of registered keywords, manuscript characters, and candidate characters related to the search device. In FIG. 3, the first line is the registered keyword, the second line is the manuscript character to be searched, the third line is the first candidate character of the reading result of each manuscript character, the fourth line is the second candidate character, and the fifth The line is also its third candidate character. The registered keyword is "joint development", the first candidate character in the reading results for the manuscript character "Kyo" is "Kyo", the second candidate character is "I", and the third candidate character is "Don". Hereinafter, the manuscript character ``same'' corresponds to the character ``same,'' and the manuscript character ``open'' corresponds to the character ``open''.
Close and Seki are selected as candidate characters for the manuscript character ``Hatsu'', and ``Hatsu'', 溌, and HAI are respectively selected as candidate characters. For each of the first, second, and third candidate characters, up to three characters are selected in order of similarity to the standard character of each manuscript character, or by relaxing the threshold related to similarity in order. be done. If the degree of similarity is extremely low, it cannot be said to be a candidate character, so there is no need to select all three characters.
【0008】図4は、図3の各文字を符号化したときの
対応図であり、登録キーワードKの各構成文字:Ki、
原稿文字Wの各構成文字:Wi、第1,第2,第3の各
候補文字:Ai,Bi,Ciにそれぞれ対応する。ここ
で、i=1,2,3,4で、登録キーワード、原稿文字
の共通な文字順位符号である。FIG. 4 is a correspondence diagram when each character in FIG. 3 is encoded, and shows the constituent characters of the registered keyword K: Ki,
These correspond to the constituent characters of the original character W: Wi, and the first, second, and third candidate characters: Ai, Bi, and Ci, respectively. Here, i=1, 2, 3, and 4, which are common character ranking codes for the registered keyword and manuscript characters.
【0009】図2は検索装置の構成を示すブロック図で
ある。図2において、1は文書の原稿文字に係る画像を
求めるイメージスキャナ、2は読取部で、原稿文字に係
る画像に基づいて3個までの候補文字を選出する。なお
、第1,第2,第3の各候補文字については、既に説明
したとおりである。3は読取文字に係る候補文字用のメ
モリで、読取りの第1,第2,第3の各候補文字が文字
コードで格納される。4は登録キーワード用の入力部、
5は登録キーワード用のメモリである。6は照合部で、
各メモリ3,5からの対応する文字コードを照合し、一
致,不一致の判定をする。7はCRTで、照合結果を画
面に表示する。なお、このCRT7に照合結果を印刷し
て出力するプリンタを併設することもできる。FIG. 2 is a block diagram showing the configuration of the search device. In FIG. 2, reference numeral 1 denotes an image scanner that obtains an image of original characters of a document, and 2 a reading unit that selects up to three candidate characters based on the image of original characters. Note that the first, second, and third candidate characters are as already described. Reference numeral 3 denotes a memory for candidate characters related to read characters, in which each of the first, second, and third candidate characters to be read is stored as a character code. 4 is an input section for registered keywords,
5 is a memory for registered keywords. 6 is the collation part,
Corresponding character codes from each memory 3 and 5 are compared to determine whether they match or do not match. 7 is a CRT, which displays the verification results on the screen. Note that this CRT 7 may also be provided with a printer for printing and outputting the verification results.
【0010】図1は検索装置の動作を示すフローチャー
トである。図1において、ステップS1で、4個の文字
からなる登録キーワードの各構成文字Ki、原稿文字W
iの共通な文字順位符号iの初期化、i=1をおこなう
。ステップS2で、原稿文字Wiを読み取った結果の3
個の第1,第2,第3の各候補文字Ai,Bi,Ciを
選出する。ステップS3で、第1候補文字Aiがキーワ
ード構成文字Kiと一致するかどうかが判断され、YE
SならステップS6へ、NOならステップS4へ移行す
る。ステップS6で、Wi,Kiは一致すると判定され
た後、以上のプロセスがステップS7とステップS8を
経て、4個の文字すべてについて繰り返される。戻って
ステップS4で、第2候補文字BiがKiと一致するか
どうかが判断され、YESならステップS6へ、NOな
らステップS5へ移行する。ステップS5で、第3候補
文字CiがKiと一致するかどうかが判断され、YES
ならステップS6へ、NOならステップS9へ移行する
。ステップS9で、Wiは読取不能とされ、したがって
次のステップS10で検索不能とされる。FIG. 1 is a flowchart showing the operation of the search device. In FIG. 1, in step S1, each constituent character Ki of a registered keyword consisting of four characters, the original character W
The common character order code i of i is initialized to i=1. 3 as a result of reading the original characters Wi in step S2.
The first, second, and third candidate characters Ai, Bi, and Ci are selected. In step S3, it is determined whether the first candidate character Ai matches the keyword constituent characters Ki, and YE
If S, the process moves to step S6; if NO, the process moves to step S4. After it is determined in step S6 that Wi and Ki match, the above process is repeated for all four characters through steps S7 and S8. Returning to step S4, it is determined whether the second candidate character Bi matches Ki. If YES, the process moves to step S6; if NO, the process moves to step S5. In step S5, it is determined whether the third candidate character Ci matches Ki, and YES is determined.
If so, proceed to step S6; if NO, proceed to step S9. In step S9, Wi is made unreadable, and therefore, in the next step S10, Wi is made unsearchable.
【0011】ここで、若干補足すると、キーワードとな
るべき先頭の原稿文字W1の読取りに多少の誤読があっ
ても、読取文字として3個までの候補文字A1,B1,
C1を上げ、そのうち少なくとも一つが登録キーワード
の先頭文字K1と一致するものを探せば、その一致した
ものは正しい文字である確率が高い、と考えることがで
きる。以下、後続する各原稿文字Wiを順に読み取り、
それぞれに対し3個までの候補文字Ai,Bi,Ciを
選出し、そのうち少なくとも一つが登録キーワードの対
応する各後続文字Kiと一致するものを探せば、その一
致したものは、高い確率でキーワードの対応する各後続
文字であると言える。なお、候補文字の個数は多いほど
、読取り確度は上がるが、処理時間もかかるから、調和
点を求める必要がある。候補文字を3個までとしたのは
、経験的なもので、処理時間もほどほどの線で、ほぼ9
9%の確率で正確な文字読取りができることが実証され
たことに基づく。[0011] Here, to add a little bit, even if there is some misreading of the first manuscript character W1, which should be a keyword, up to three candidate characters A1, B1,
If C1 is increased and at least one of them matches the first character K1 of the registered keyword, it can be considered that there is a high probability that the matched character is the correct character. Hereafter, each subsequent manuscript character Wi is read in order,
If you select up to three candidate characters Ai, Bi, and Ci for each, and search for a character in which at least one of them matches each subsequent character Ki of the registered keyword, the matched character has a high probability of being the keyword's character. It can be said that each corresponding subsequent character. Note that the greater the number of candidate characters, the higher the reading accuracy, but the longer the processing time, so it is necessary to find the harmonious points. The choice of up to 3 candidate characters is based on experience, and the processing time is moderate, approximately 9.
This is based on the fact that it has been proven that characters can be read accurately with a probability of 9%.
【0012】0012
【発明の効果】請求項1または2に係るキーワード検索
方法では共通に、文書の各原稿文字の読取りに多少の誤
読があっても、読取文字として所定個数、たとえば請求
項2のように、3個までの候補文字を上げ、そのうち少
なくとも一つが登録キーワードの先頭文字と一致する最
先行の読取文字を探せば、その一致したものは正しい文
字である確率が高いから、以下、後続する各原稿文字を
順に読み取り、それぞれに対し同じ所定個数、たとえば
3個までの候補文字を上げ、そのうち少なくとも一つが
登録キーワードの対応する各後続文字と一致するものを
探せば、その一致したものは、高い確率でキーワードの
対応する各後続文字である。したがって、文書中の各文
字の読取りに多少の誤読があっても、結果として検索効
率の良好なキーワード検索ができるという効果が得られ
る。とくに、候補文字を3個までにすることによって、
処理時間もほどほどの線で、ほぼ99%の確率で正確な
文字読取りができ、処理時間と検索確度との調和が図れ
ることが実証された。Effects of the Invention In the keyword search method according to claim 1 or 2, even if there is some misreading of each manuscript character of a document, a predetermined number of read characters, for example, 3 as in claim 2, are read. If you search for the first readable character in which at least one of the candidate characters matches the first character of the registered keyword, there is a high probability that the matching character is the correct character. are read in order, and the same predetermined number of candidate characters, for example, up to 3, are read for each character, and if at least one of them matches the corresponding subsequent character of the registered keyword, then the matched character has a high probability. Each corresponding subsequent character of the keyword. Therefore, even if there is some misreading of each character in the document, the effect of keyword search with good search efficiency can be obtained as a result. In particular, by limiting the number of candidate characters to three,
It has been demonstrated that accurate character reading can be achieved with approximately 99% probability of processing time at a reasonable level, and that a balance between processing time and retrieval accuracy can be achieved.
【図1】本発明に係る方法を適用した検索装置の動作を
示すフローチャートFIG. 1 is a flowchart showing the operation of a search device to which the method according to the present invention is applied.
【図2】この検索装置の構成を示すブロック図[Figure 2] Block diagram showing the configuration of this search device
【図3】
この検索装置に係る登録キーワード,原稿文字,各候補
文字の対応図[Figure 3]
Correspondence diagram of registered keywords, manuscript characters, and candidate characters related to this search device
【図4】図3の各文字を符号化したときの対応図[Figure 4] Correspondence diagram when each character in Figure 3 is encoded
1 イメージセンサ 2 読取部 3 メモリ 4 入力部 5 メモリ 6 照合部 7 CRT 1 Image sensor 2 Reading section 3. Memory 4 Input section 5. Memory 6 Verification section 7 CRT
Claims (2)
列をキーワードとして文書中から検索する方法において
、この文書の各文字を原稿文字として順に文字読取装置
によって読み取り、前記各原稿文字について標準文字に
対する類似度に基づき所定個数までの候補文字を選出し
;前記各原稿文字でもっとも先行するものの各候補文字
のうち少なくとも一つが前記登録キーワードの先頭文字
と一致するときの、前記原稿文字を前記検索すべきキー
ワードの先頭文字とし;この先頭文字に対応する原稿文
字に後続の各文字について前記と同じ所定個数までの候
補文字を選出し;この後続順の各原稿文字に対応する前
記所定個数までの各候補文字のうち少なくとも一つが前
記登録キーワードの対応する後続順位の文字と一致する
とき、前記後続順の各原稿文字を前記検索すべきキーワ
ードの対応する後続順位の各文字とする;ことを特徴と
するキーワード検索方法。Claim 1. A method of searching a document using a single character or string of characters as a keyword as a registered keyword, in which each character of the document is sequentially read as a manuscript character by a character reading device, and the similarity of each manuscript character to a standard character is determined. select up to a predetermined number of candidate characters based on the degree of occurrence; the search should be performed for the manuscript character when at least one of the candidate characters of the most preceding character in each of the manuscript characters matches the first character of the registered keyword; Select the first character of the keyword; select up to the same predetermined number of candidate characters as above for each character following the manuscript character corresponding to this first character; select each candidate character up to the predetermined number corresponding to each subsequent manuscript character; When at least one of the characters matches a character in a corresponding succeeding rank of the registered keyword, each manuscript character in the succeeding order is set as a character in a corresponding succeeding rank of the keyword to be searched; Keyword search method.
は、3であることを特徴とするキーワード検索方法。2. The method according to claim 1, wherein the predetermined number is three.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2401761A JPH04215183A (en) | 1990-12-13 | 1990-12-13 | Key word retrieving method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2401761A JPH04215183A (en) | 1990-12-13 | 1990-12-13 | Key word retrieving method |
Publications (1)
Publication Number | Publication Date |
---|---|
JPH04215183A true JPH04215183A (en) | 1992-08-05 |
Family
ID=18511590
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
JP2401761A Pending JPH04215183A (en) | 1990-12-13 | 1990-12-13 | Key word retrieving method |
Country Status (1)
Country | Link |
---|---|
JP (1) | JPH04215183A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011034230A (en) * | 2009-07-30 | 2011-02-17 | Rakuten Inc | Image search engine |
-
1990
- 1990-12-13 JP JP2401761A patent/JPH04215183A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011034230A (en) * | 2009-07-30 | 2011-02-17 | Rakuten Inc | Image search engine |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP1843276A1 (en) | Method for automated processing of hard copy text documents | |
CN109902303B (en) | Entity identification method and related equipment | |
CN117076653B (en) | Knowledge base question-answering method based on thinking chain and visual lifting context learning | |
JP3589007B2 (en) | Document filing system and document filing method | |
JPH04215183A (en) | Key word retrieving method | |
WO2022019275A1 (en) | Document search device, document search system, document search program, and document search method | |
JP2815707B2 (en) | Keyword search method | |
JPH08272811A (en) | Document management method and device therefor | |
JPH04225471A (en) | Keyword retrieving method | |
JP2586372B2 (en) | Information retrieval apparatus and information retrieval method | |
WO2020244150A1 (en) | Speech retrieval method and apparatus, computer device, and storage medium | |
JPH08272813A (en) | Filing device | |
JP2827066B2 (en) | Post-processing method for character recognition of documents with mixed digit strings | |
CN110349568B (en) | Voice retrieval method, device, computer equipment and storage medium | |
CN117573839B (en) | Document retrieval method, man-machine interaction method, electronic device and storage medium | |
JP3241854B2 (en) | Automatic word spelling correction device | |
JPH07296005A (en) | Japanese text registration/retrieval device | |
JP3924899B2 (en) | Text search apparatus and text search method | |
JPH08180064A (en) | Document retrieval method and document filing device | |
JP2839515B2 (en) | Character reading system | |
JPH0256086A (en) | Method for postprocessing for character recognition | |
JP2570784B2 (en) | Document reader post-processing device | |
JPH03257693A (en) | Character recognized result correcting system | |
CN115455948A (en) | Spelling error correction model training method, spelling error correction method and storage medium | |
JPS5930176A (en) | Character discrimination processing system |