JPH04225471A - Keyword retrieving method - Google Patents

Keyword retrieving method

Info

Publication number
JPH04225471A
JPH04225471A JP2407098A JP40709890A JPH04225471A JP H04225471 A JPH04225471 A JP H04225471A JP 2407098 A JP2407098 A JP 2407098A JP 40709890 A JP40709890 A JP 40709890A JP H04225471 A JPH04225471 A JP H04225471A
Authority
JP
Japan
Prior art keywords
characters
character
keyword
rank
read
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP2407098A
Other languages
Japanese (ja)
Inventor
Hidetoshi Ito
伊東 英俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuji Electric Co Ltd
Fuji Facom Corp
Original Assignee
Fuji Electric Co Ltd
Fuji Facom Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuji Electric Co Ltd, Fuji Facom Corp filed Critical Fuji Electric Co Ltd
Priority to JP2407098A priority Critical patent/JPH04225471A/en
Publication of JPH04225471A publication Critical patent/JPH04225471A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

PURPOSE:To improve the retrieval efficiency by relieving rationally misreading without eliminating it simply, although the retrieval efficiency drops by that portion, when there is misreading in a part of a keyword constitution character. CONSTITUTION:In each character of the same rank as a read character-string and a registered keyword, that which coincides above a first prescribed number corresponding to the number of registered keyword characters is selected, and with regard to each selected read character-string thereof, a character set of every rank is derived. In its character set of every rank, when a character which coincides with a character of the corresponding rank of the registered keyword exists above a second prescribed number corresponding to the number of characters belonging to each character set, all of the read character-strings selected previously are regarded as keywords to be retrieved.

Description

【発明の詳細な説明】[Detailed description of the invention]

【0001】0001

【産業上の利用分野】この発明は、登録キーワードと同
じ文字列をキーワードとして文書中から検索する方法で
あって、とくにキーワード構成文字の一部に誤読があっ
たとしても、単純に排除しないで合理的に救済すること
により、検索効率の向上が図れるキーワード検索方法に
関する。
[Industrial Application Field] This invention is a method of searching a document using the same character string as a registered keyword as a keyword, and in particular, even if some of the keyword constituent characters are misread, the method does not simply eliminate them. This invention relates to a keyword search method that can improve search efficiency by providing reasonable relief.

【0002】0002

【従来の技術】一般に、文書の内容を迅速,的確に把握
するには、キーワードを活用するのが有効である。たと
えば、地球環境保護の問題に関する文書では、たとえば
「放射能」や「オゾン層」,「地球汚染」などのキーワ
ードが用いられる。さて、文書のデータベース化をおこ
なうとき、文書の文字つまり原稿文字を順に文字読取装
置によって、標準文字に対する類似度のもっとも高い文
字を読取文字として選出し、文字コードで表されるテキ
ストを作成する。このテキストに対して、登録されたキ
ーワードと同じ単一文字または文字列をキーワードとし
て検索する。
2. Description of the Related Art Generally, it is effective to use keywords to quickly and accurately understand the contents of a document. For example, documents related to global environmental protection issues use keywords such as "radiation,""ozonelayer," and "global pollution." Now, when converting a document into a database, the characters of the document, that is, the original characters, are sequentially read by a character reading device, and the character with the highest degree of similarity to the standard character is selected as the reading character, and a text expressed by a character code is created. This text is searched for the same single character or character string as the registered keyword as a keyword.

【0003】0003

【発明が解決しようとする課題】従来の方法では、文字
読取装置によって読み取られた結果に誤り、つまり誤読
が1字でもあると、キーワードが存在するにもかかわら
ず、検索対象から除外される。すなわち、文書中のキー
ワード総数に対する検索キーワード数の比率を検索効率
と定義したとき、検索効率は著しく低下する。
In the conventional method, if there is an error in the result read by a character reading device, that is, even one character is misread, the keyword is excluded from the search target even though the keyword exists. That is, when search efficiency is defined as the ratio of the number of search keywords to the total number of keywords in a document, the search efficiency decreases significantly.

【0004】この発明の課題は、従来の技術がもつ以上
の問題点を解消し、キーワード構成文字の一部に誤読が
あったとしても、単純に排除しないで合理的に救済する
ことにより、検索効率の向上が図れるキーワード検索方
法を提供することにある。
[0004] An object of the present invention is to solve the above-mentioned problems of the conventional technology, and even if there is a misreading of some of the characters constituting a keyword, it can be remedied rationally without simply eliminating it. The object of the present invention is to provide a keyword search method that can improve efficiency.

【0005】[0005]

【課題を解決するための手段】この課題を解決するため
に、請求項1に係るキーワード検索方法は、登録キーワ
ードと同じ文字列をキーワードとして文書中から検索す
る方法において、この文書の各文字を文字読取装置によ
って読み取り、前記登録キーワードと同一文字数で、か
つ各同一順位の文字同士が第1の所定数以上一致し、前
記登録キーワードの各文字全てとは一致しない読取文字
列を選出し;この選出された各読取文字列について各同
一順位ごとの文字の集合を求め;この各順位ごとの文字
集合のうちに前記登録キーワードの対応する順位の文字
と一致する文字の個数が第2の所定数以上あるとき、前
記各読取文字列の全てを検索すべきキーワードとみなす
[Means for Solving the Problem] In order to solve this problem, the keyword search method according to claim 1 is a method of searching a document using the same character string as a registered keyword as a keyword. read by a character reading device, select a read character string that has the same number of characters as the registered keyword, has a first predetermined number or more of characters in the same rank, and does not match all of the characters of the registered keyword; For each of the selected reading character strings, a set of characters for each same rank is determined; out of the set of characters for each rank, the number of characters that match the characters of the corresponding rank of the registered keyword is a second predetermined number. If there are any of the above, all of the read character strings are considered to be keywords to be searched.

【0006】請求項2に係るキーワード検索方法は、請
求項1に記載の方法において、第1所定数は、登録キー
ワードの文字数に応じて定められる。
[0006] In the keyword search method according to claim 2, in the method according to claim 1, the first predetermined number is determined according to the number of characters of the registered keyword.

【0007】請求項3に係るキーワード検索方法は、請
求項1または2に記載の方法において、第2所定数は、
各順位ごとの文字集合に属する共通な文字数に応じて定
められる。
The keyword search method according to claim 3 is the method according to claim 1 or 2, in which the second predetermined number is:
It is determined according to the number of common characters belonging to the character set for each rank.

【0008】[0008]

【作用】請求項1に係るキーワード検索方法では、■文
書の各文字を文字読取装置によって読み取り、登録キー
ワードと同一文字数で、かつ各同一順位の文字同士が第
1の所定数以上一致し、登録キーワードの各文字全てと
は一致しない読取文字列を選出する、つまり登録キーワ
ードと部分的に一致し、従来は除外されるべき読取文字
列について救済可能な候補として1次選考する、■この
選出された各読取文字列について各同一順位ごとの文字
の集合を求め、この各順位ごとの文字集合のうち登録キ
ーワードの対応する順位の文字と一致する文字の個数が
第2の所定数以上あるとき、2次選考として各読取文字
列の全てを検索すべきキーワードとみなす。なお、1次
選考における第1所定数は、請求項2のように登録キー
ワードの文字数に応じて、また2次選考における第2所
定数は、請求項3のように各順位ごとの文字集合に属す
る共通な文字数に応じてそれぞれ定められる。
[Operation] In the keyword search method according to claim 1, (1) Each character of the document is read by a character reading device, and if the characters having the same number of characters as the registered keyword and having the same rank match each other at least a first predetermined number, the characters are registered. Select read character strings that do not match all characters of the keyword, that is, perform a first selection as salvageable candidates for read character strings that partially match the registered keyword and would normally be excluded. Find a set of characters for each same rank for each read character string, and when the number of characters that match the characters of the corresponding rank of the registered keyword among the character sets for each rank is equal to or greater than a second predetermined number; As a secondary selection, all of the read character strings are considered as keywords to be searched. The first predetermined number in the first selection is determined according to the number of characters of the registered keyword as in claim 2, and the second predetermined number in the second selection is determined according to the character set for each rank as in claim 3. Each is determined according to the number of common characters it belongs to.

【0009】[0009]

【実施例】本発明に係るキーワード検索方法を適用した
検索装置について、以下に図を参照しながら説明する。 図3は検索装置に係る登録キーワードと1次選考読取文
字列の例示図である。図3において、登録キーワードK
は、4文字から構成される「富士電機」である。1次選
考の結果、5個の読取文字列W1〜W5が選出されたと
する。すなわち、富土謳機,富土電揆,笛士雷機,宮士
壇機,宙土電機  である。各読取文字列とも登録キー
ワードと、下線を付けた2個の同一文字をもっている。 ここで、発明における第1所定数は2とする。
DESCRIPTION OF THE PREFERRED EMBODIMENTS A search device to which the keyword search method according to the present invention is applied will be described below with reference to the drawings. FIG. 3 is an illustrative diagram of registered keywords and first selection reading character strings related to the search device. In Figure 3, registered keyword K
is "Fuji Electric", which is composed of four characters. Assume that five read character strings W1 to W5 are selected as a result of the first selection. Namely, they are Fudojoki, Fudodenki, Fueshi Raiki, Gyushidanki, and Soradodenki. Each read character string has a registered keyword and two identical underlined characters. Here, the first predetermined number in the invention is two.

【0010】図4は読取文字列の2次選考に係る選考過
程図である。図4において、第1列に文字順位、第2列
に登録キーワード、第3列に1次選考読取文字列の各順
位文字の集合、第4列に登録キーワードの各構成文字と
各順位文字の集合との一致文字数、がそれぞれ示される
。たとえば、文字順位1に相当する登録キーワードの構
成文字は「富」、これに対して1次選考された5個の読
取文字列W1〜W5で文字順位1に相当する文字の集合
は{富,富,笛,宮,宙}である。つまり、一致文字数
は2である。ここで、発明における第2所定数は2とす
る。同様に、各文字順位2,3,4について文字集合を
求め、各一致文字数2,2,4を得る。したがって、1
次選考された5個の読取文字列W1〜W5は、2次選考
にも合格してキーワードであると判定される。なお、第
1,第2の各所定数は、基本的には経験的に定められ、
一般的には、第1所定数は登録キーワードの構成文字数
が多くなるほど大きい数値をとり、第2所定数は各順位
の文字集合に属する文字数が多くなるほど比例的に大き
い数値をとる。
FIG. 4 is a selection process diagram relating to the secondary selection of read character strings. In Figure 4, the first column shows the character ranking, the second column shows the registered keyword, the third column shows the set of each ranking character of the first selection read character string, and the fourth column shows the constituent characters of the registered keyword and each ranking character. The number of matching characters with the set is shown respectively. For example, the constituent characters of the registered keyword corresponding to character rank 1 are "tomi", and the set of characters corresponding to character rank 1 in the five read character strings W1 to W5 that were selected in the first round is {tomi, Wealth, flute, palace, space}. In other words, the number of matching characters is 2. Here, the second predetermined number in the invention is two. Similarly, a character set is obtained for each character rank 2, 3, and 4, and the number of matching characters 2, 2, and 4 is obtained. Therefore, 1
The five read character strings W1 to W5 selected for the next selection also pass the second selection and are determined to be keywords. The first and second predetermined numbers are basically determined empirically,
Generally, the first predetermined number takes a larger value as the number of characters constituting the registered keyword increases, and the second predetermined number takes a proportionally larger value as the number of characters belonging to the character set of each rank increases.

【0011】図2は検索装置の構成を示すブロック図で
ある。図2において、1は文書の原稿文字に係る画像を
求めるイメージスキャナ、2は読取部で、原稿文字に係
る画像に基づいて読み取りをおこなう。3は読取文字用
のメモリで、ここに読取文字が文字コードで格納される
。4は登録キーワード用の入力部、5は登録キーワード
用のメモリである。6は照合部で、各メモリ3,5から
の対応する文字コードを照合し、一致,不一致の判定を
する等、前記の1次,2次の各選考処理をおこない、最
終的にキーワードとみなすかどうかを決める。7はCR
Tで、照合結果を画面に表示する。なお、このCRT7
に照合結果を印刷して出力するプリンタを併設すること
ができる。
FIG. 2 is a block diagram showing the configuration of the search device. In FIG. 2, reference numeral 1 denotes an image scanner that obtains images related to original characters of a document, and 2 a reading unit that performs reading based on images related to original characters. Reference numeral 3 denotes a memory for read characters, in which read characters are stored as character codes. 4 is an input section for registered keywords, and 5 is a memory for registered keywords. Reference numeral 6 is a collation unit that performs the above-mentioned primary and secondary selection processes, such as collating the corresponding character codes from each memory 3 and 5 and determining whether they match or do not match, and finally considers them as keywords. Decide whether or not. 7 is CR
Press T to display the matching results on the screen. In addition, this CRT7
A printer can be installed to print and output the verification results.

【0012】図1は検索装置の動作を示すフローチャー
トである。図1のステップS1で、キーワードKWと同
じ文字数の読取文字列Wを順次選出(予備選考)する。 ステップS2で、S1で選出された読取文字列Wで、K
Wと同一順位同士が全て同文字である読取文字列Wiを
選出する。これは従来の検索方法である。ステップS3
で、S2で選出されなかった読取文字列から、KWと同
一順位同士がA個(第1所定数)以上、同文字である読
取文字列Wjを選出(1次選考)する。すなわち、ステ
ップS3以降が検索キーワードの救済処置になる。なお
、図3の例では、A=2  である。ステップ4で、ス
テップS1〜S3を文書全体について繰り返す。ステッ
プS5で、S3で選出された読取文字列Wjに係る同一
順位文字の集合のうちB個(第2所定数)以上が、KW
の同一順位文字と同かどうか判断(2次選考)し、YE
SならステップS5に移行し、NOなら救済されず終了
する。なお図4の例では、B=2である。ステップS6
で、読取文字列WjをKWとみなし、救済する。したが
って、検索キーワードは各読取文字列Wi,Wjになる
FIG. 1 is a flowchart showing the operation of the search device. In step S1 of FIG. 1, read character strings W having the same number of characters as the keyword KW are sequentially selected (preliminary selection). In step S2, in the read character string W selected in S1, K
A read character string Wi in which all characters in the same rank as W are the same is selected. This is a conventional search method. Step S3
Then, from the read character strings not selected in S2, read character strings Wj having the same characters at least A (first predetermined number) in the same rank as KW are selected (first selection). That is, steps after step S3 are search keyword relief measures. Note that in the example of FIG. 3, A=2. In step 4, steps S1 to S3 are repeated for the entire document. In step S5, B (second predetermined number) or more of the set of characters of the same rank related to the read character string Wj selected in S3 are KW.
Judge whether it is the same as the same ranking character (secondary selection), and select YE.
If S, the process moves to step S5, and if NO, the process ends without relief. Note that in the example of FIG. 4, B=2. Step S6
Then, the read character string Wj is regarded as KW and is rescued. Therefore, the search keywords are the read character strings Wi, Wj.

【0013】以上のように、従来の方法で除外された読
取文字列を、1次,2次の各選考過程を経て救済するが
、この救済が適正かつ合理的な制約条件のもとでおこな
われるから、救済により検索効率の向上が図れるととも
に、救済された読取文字列の検索確度は高い。
As described above, the read character strings excluded by the conventional method are rescued through the primary and secondary selection processes, but this rescue is performed under appropriate and reasonable constraints. Therefore, the search efficiency can be improved by the rescue, and the search accuracy of the rescued read character string is high.

【0014】[0014]

【発明の効果】請求項1〜3のいずれかに係るキーワー
ド検索方法では共通に、登録キーワードと同一文字数で
、かつ各同一順位の文字同士が第1の所定数以上一致し
、登録キーワードの各文字全てとは一致しない読取文字
列を1次選考として選出し;この選出された各読取文字
列について各同一順位ごとの文字の集合を求め、この各
順位ごとの文字集合のうち登録キーワードの対応する順
位の文字と一致する文字の個数が第2の所定数以上ある
とき、2次選考として各読取文字列の全てを検索すべき
キーワードとみなす。
Effects of the Invention In the keyword search method according to any one of claims 1 to 3, characters having the same number of characters and the same rank as the registered keyword match each other at least a first predetermined number, and each of the registered keywords The read character strings that do not match all the characters are selected as the first selection; for each selected read character string, the set of characters for each same rank is determined, and the correspondence of registered keywords among the character sets for each rank is determined. When the number of characters that match the characters in the ranking is greater than or equal to a second predetermined number, all of the read character strings are considered as keywords to be searched as a secondary selection.

【0015】したがって、キーワード構成文字の一部に
誤読があったとしても、単純に排除しないで合理的に救
済することにより、検索効率の向上が図れる、という効
果が得られる。また、とくに請求項2のように、第1所
定数が登録キーワード文字数に応じて定められ、また請
求項3のように、第2所定数が各順位ごとの文字集合に
属する共通な文字数に応じて定められるから、救済が適
正かつ合理的な制約条件のもとでおこなわれ、救済され
た読取文字列の検索確度は高い。
[0015] Therefore, even if some of the characters constituting the keyword are misread, the search efficiency can be improved by rationally repairing the characters rather than simply eliminating them. In particular, as in claim 2, the first predetermined number is determined according to the number of registered keyword characters, and as in claim 3, the second predetermined number is determined according to the number of common characters belonging to the character set for each rank. Therefore, the rescue is performed under appropriate and reasonable constraints, and the search accuracy of the rescued read character string is high.

【図面の簡単な説明】[Brief explanation of the drawing]

【図1】本発明に係る方法を適用した検索装置の動作を
示すフローチャート
FIG. 1 is a flowchart showing the operation of a search device to which the method according to the present invention is applied.

【図2】この検索装置の構成を示すブロック図[Figure 2] Block diagram showing the configuration of this search device

【図3】
この検索装置に係る登録キーワードと1次選考読取文字
列の例示図
[Figure 3]
An illustration of registered keywords and first selection read character strings related to this search device

【図4】読取文字列の2次選考に係る選考過程図[Figure 4] Selection process diagram for secondary selection of read character strings

【符号の説明】[Explanation of symbols]

1    イメージセンサ 2    読取部 3    メモリ 4    入力部 5    メモリ 6    照合部 7    CRT 1 Image sensor 2 Reading section 3. Memory 4 Input section 5. Memory 6      Verification section 7 CRT

Claims (3)

【特許請求の範囲】[Claims] 【請求項1】登録キーワードと同じ文字列をキーワード
として文書中から検索する方法において、この文書の各
文字を文字読取装置によって読み取り、前記登録キーワ
ードと同一文字数で、かつ各同一順位の文字同士が第1
の所定数以上一致し、前記登録キーワードの各文字全て
とは一致しない読取文字列を選出し;この選出された各
読取文字列について各同一順位ごとの文字の集合を求め
;この各順位ごとの文字集合のうちに前記登録キーワー
ドの対応する順位の文字と一致する文字の個数が第2の
所定数以上あるとき、前記各読取文字列の全てを検索す
べきキーワードとみなす;ことを特徴とするキーワード
検索方法。
Claim 1: In a method of searching a document using the same character string as a registered keyword as a keyword, each character of the document is read by a character reading device, and characters having the same number of characters as the registered keyword and of the same rank are identified. 1st
Select a read character string that matches a predetermined number or more and does not match all of the characters of the registered keyword; Find a set of characters for each same rank for each selected read character string; When the number of characters in the character set that match the characters of the corresponding rank of the registered keyword is equal to or greater than a second predetermined number, all of the read character strings are regarded as keywords to be searched. Keyword search method.
【請求項2】請求項1に記載の方法において、第1所定
数は、登録キーワードの文字数に応じて定められること
を特徴とするキーワード検索方法。
2. The method according to claim 1, wherein the first predetermined number is determined according to the number of characters of the registered keyword.
【請求項3】請求項1または2に記載の方法において、
第2所定数は、各順位ごとの文字集合に属する共通な文
字数に応じて定められることを特徴とするキーワード検
索方法。
3. The method according to claim 1 or 2,
A keyword search method characterized in that the second predetermined number is determined according to the number of common characters belonging to a character set for each rank.
JP2407098A 1990-12-27 1990-12-27 Keyword retrieving method Pending JPH04225471A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2407098A JPH04225471A (en) 1990-12-27 1990-12-27 Keyword retrieving method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2407098A JPH04225471A (en) 1990-12-27 1990-12-27 Keyword retrieving method

Publications (1)

Publication Number Publication Date
JPH04225471A true JPH04225471A (en) 1992-08-14

Family

ID=18516712

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2407098A Pending JPH04225471A (en) 1990-12-27 1990-12-27 Keyword retrieving method

Country Status (1)

Country Link
JP (1) JPH04225471A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11232296A (en) * 1998-02-18 1999-08-27 Mitsubishi Electric Corp Document filing system and document filing method
JP2000057315A (en) * 1998-08-06 2000-02-25 Mitsubishi Electric Corp Document filing device and its method
US6070161A (en) * 1997-03-19 2000-05-30 Minolta Co., Ltd. Method of attaching keyword or object-to-key relevance ratio and automatic attaching device therefor
US7130487B1 (en) 1998-12-15 2006-10-31 Matsushita Electric Industrial Co., Ltd. Searching method, searching device, and recorded medium
JP2011034230A (en) * 2009-07-30 2011-02-17 Rakuten Inc Image search engine

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6070161A (en) * 1997-03-19 2000-05-30 Minolta Co., Ltd. Method of attaching keyword or object-to-key relevance ratio and automatic attaching device therefor
JPH11232296A (en) * 1998-02-18 1999-08-27 Mitsubishi Electric Corp Document filing system and document filing method
JP2000057315A (en) * 1998-08-06 2000-02-25 Mitsubishi Electric Corp Document filing device and its method
US7130487B1 (en) 1998-12-15 2006-10-31 Matsushita Electric Industrial Co., Ltd. Searching method, searching device, and recorded medium
JP2011034230A (en) * 2009-07-30 2011-02-17 Rakuten Inc Image search engine

Similar Documents

Publication Publication Date Title
US9208185B2 (en) Indexing and search query processing
JP3689455B2 (en) Information processing method and apparatus
AU746762B2 (en) Data summariser
EP1843276A1 (en) Method for automated processing of hard copy text documents
US20040049499A1 (en) Document retrieval system and question answering system
US7539343B2 (en) Classifying regions defined within a digital image
CN112307182B (en) Question-answering system-based pseudo-correlation feedback extended query method
JPH07160806A (en) Paper recognition system for document
EP1288792B1 (en) A method for automatically indexing documents
EA003619B1 (en) System and method for searching electronic documents created with optical character recognition
AU2002331728A1 (en) A method for automatically indexing documents
JPH087033A (en) Method and device for processing information
WO2008130501A1 (en) Unstructured and semistructured document processing and searching and generation of value-based information
US20110229036A1 (en) Method and apparatus for text and error profiling of historical documents
JPH04225471A (en) Keyword retrieving method
JPH0773197A (en) Supporting system for preparing different notation word dictionary
JPH10240901A (en) Document filing device and method therefor
US6792415B2 (en) Method and system for document classification with multiple dimensions and multiple algorithms
JPS5856071A (en) Retrieval system by japanese
JP2815707B2 (en) Keyword search method
JP3477822B2 (en) Document registration search system
JP4677750B2 (en) Document attribute acquisition method and apparatus, and recording medium recording program
JP2500680B2 (en) Data name assignment registration device
JP3210842B2 (en) Information processing device
JP2005284776A (en) Text mining apparatus and text analysis method