JPS62278689A

JPS62278689A - Word retrieving system

Info

Publication number: JPS62278689A
Application number: JP61123127A
Authority: JP
Inventors: Yoshitake Tsuji; 辻　善丈
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1986-05-27
Filing date: 1986-05-27
Publication date: 1987-12-03
Anticipated expiration: 2009-06-15
Also published as: JPH0646420B2

Abstract

PURPOSE:To fetch a desired word even if a wrong character or an unreadable character is contained in several retrieval keys, by generating plural pieces of retrieval keys from a partial character string generated from adjacent characters in an input character string. CONSTITUTION:A retrieval key converting part 3 fetches a possible partial character string in an input character string stored in an input character string register 2, converts it successively to a key value and stores it in a retrieval key register 4. An index control part fetches successively the retrieval key value from the retrieval key register 4, fetches a dictionary pointer group to a word containing the retrieval key value from a dictionary pointer storage part 7 by using an index table 6, and transfers it to a dictionary pointer counting part 8. The dictionary pointer counting part 8 transfers plural pieces of different dictionary pointers in a dictionary pointer register 9 and its counting value to a candidate pointer selecting part 10. The candidate pointer selecting part 10 reads out successively a dictionary pointer and its counting value from the dictionary pointer register 9, and when a counting value T exceeds a threshold T, it is decided to be the dictionary pointer of a candidate word, and transferred to the index control part 5.

Description

【発明の詳細な説明】３、発明の詳細な説明（産業上の利用分野）本発明は、単語検索方式、特に、入力文字列が光学的文
字読取装置（以下ＯＣＲと称す）により読み取られた文
字列等のような誤読文字や読取り不能文字が含まれる場
合において、入力文字列の部分文字列を用いて単語辞書
から複数個の単語を検索する単語検索方式に係わる。[Detailed Description of the Invention] 3. Detailed Description of the Invention (Field of Industrial Application) The present invention is a word search method, in particular, a word search method in which an input character string is read by an optical character reading device (hereinafter referred to as OCR). The present invention relates to a word search method for searching a word dictionary for a plurality of words using partial strings of an input string when a character string includes misread characters or unreadable characters.

（従来の技術）文字読取装置等の読取対象となる郵便物や帳票等におい
て、片仮名やアルファベット等は、人名・地名や品名等
の固有名詞や普通名詞の形で用いられることが多い。こ
れらのものは数字の場合と異なり単語内に於ける文字間
の従属性がかなり強く、また十分な冗長性を有する場合
が多い。したがって単語を単位として認識を行えば、そ
の従属性や冗長性を利用することにより、誤読文字の訂
正や読取不能文字の回復が可能となり認識率をかなり改
善することができる。このような単語単位の認識を以下
単語認識と呼ぶことにする。一般に、このような単語認
識において、人名や地名等を記憶した単語辞書はメモリ
やディスク等に格納される。(Prior Art) Katakana, alphabets, etc. are often used in the form of proper nouns or common nouns, such as names of people, places, and products, in mail, forms, etc. that are read by character reading devices. Unlike numbers, these characters often have strong dependencies between characters within a word, and have sufficient redundancy. Therefore, if words are recognized as units, by utilizing their dependencies and redundancies, it is possible to correct misread characters and recover unreadable characters, and the recognition rate can be considerably improved. Such word-by-word recognition will hereinafter be referred to as word recognition. Generally, in such word recognition, a word dictionary storing personal names, place names, etc. is stored in a memory, a disk, or the like.

この場合問題となるのは単語辞書に含まれる語数が大き
くなると、その検索に要する時間が増大して、認識速度
が大幅に低下することである。このような単語辞書を用
いて単語認識を行う場合に問題となる認識速度の低下を
軽減するために、ＯＣＲから出力された入力文字列のあ
る特定な部分文字列を検索キーとして、簡単な変換操作
を行い、ハツシュテーブルを用いて、検索キーと一致す
る複数個の単語を検索する方法が一般的に用いられる。The problem in this case is that as the number of words included in the word dictionary increases, the time required to search them increases, resulting in a significant reduction in recognition speed. In order to reduce the problem of slow recognition speed when performing word recognition using such word dictionaries, we can perform simple conversion using a specific substring of the input string output from OCR as a search key. A commonly used method is to perform an operation and use a hash table to search for a plurality of words that match a search key.

ここで、入力文字列に誤り文字やりジェクト状態（読取
り不能や複数個の候補文字が出力される状態）が含まれ
ると、前述した特定な部分文字列で入力文字列と一致す
る単語を含まれる状態で複数個の単語を検索することが
困難となる場合が生じる。Here, if the input string contains an error character ejection state (unreadable or multiple candidate characters are output), the specific substring mentioned above will contain a word that matches the input string. It may be difficult to search for multiple words in certain situations.

上記問題を解決するために、従来、例えば、昭和５４年
情報処理学会第２０回全国大会講演論文集第４８７〜４
８８頁、６Ｉ’−３１’−０ＣＲのだめの単語認識」で
示されているように、予め０ＣＲＫ於ける誤り傾向を表
わすコンフユージョ／・マトリク／（（Ｃｏｎｆｕｓｉ
ｏｎ　Ｍａｔｒｉｘ　）　　を基にして特定な部分文字
列を検索キーに変換する単語辞書の検索方法がある。ま
た、例えば、技術誌「アイトリプルイー・トラ／ザクシ
ラ／・オン・コ／ピエーターズ（工ＥＥＫ　　Ｔｒａｎ
ｓａｃｔｉｏｎ　ｏｎ　Ｃｏｍｐｕｔ−ｓｒｓ）ＪのＶ
ｏｌ　Ｃ−２７＋　４８＋　Ａｕｇｕｓ”　１９７８ニ
、「ア争マルチフォント・ワード・リコグニション・シ
ステム・フォー・ボスタル・アドレス・リーディング（
ａ　ｍｕｌｔｉｆｏｎｔ　ｗｏｒｄ　ｒｓｃｏｇｎｉ−
ｔｉｏｎ　Ｓｙｓｔｓｍ　ｆｏｒ　ｐｏｓｔａｌ　ａｄ
ｄｒｅｓｓ　ｒｅａｄｉｎｇＪと題して第１２８〜１３
０頁で示されているように、入力文字列の前半部と後半
部との２つの特定な部分文字列を検索キーとして用いる
検索方法がある。また、上述した手法では、検索キーと
して選定した所定の部分文字列内に１文字が２文字に変
化する等の文字の切り出しミスが含まれると、単語検索
が困難となるから、例えば、本願出願人の出願になる特
願５６−１０３５５４号「単語検索装置」に記載されて
いるように、予め文字切り出しミスが生じる可能性を持
つ文字カテゴリー列を用意することによって対処する方
法がある。In order to solve the above problems, conventional methods have been used, for example, Proceedings of the 20th National Conference of the Information Processing Society of Japan in 1978, Nos. 487-4.
As shown in page 88, 6I'-31'-0CR's Bad Word Recognition, a Confujo/・Matrix/((Confusi
There is a word dictionary search method that converts a specific partial character string into a search key based on ``on Matrix''. In addition, for example, the technical magazine "EEK Tran
saction on Compute-srs) J's V
ol C-27+ 48+ August" 1978, "A Multi-Font Word Recognition System for Vostal Address Reading (
a multifont word rscogni-
tion system for postal ad
128th to 13th entitled dress reading J
As shown on page 0, there is a search method that uses two specific partial strings, the first half and the second half of an input string, as search keys. In addition, with the above-mentioned method, if a predetermined partial character string selected as a search key contains a character cutout error such as one character changing to two characters, it becomes difficult to perform a word search. As described in Japanese Patent Application No. 56-103554 "Word Search Device" filed by J.D., there is a method to deal with this problem by preparing in advance a character category string in which character segmentation errors may occur.

（発明が解決しようとする問題点）上述したように、特定な部分文字列を設定し、コンフユ
ージヨン・マトリクスなどを用いても文字切り出しミス
などで生じる部分文字列の長さが変化する場合には、困
難となる。また、文字切り出しミスにより生じる可能性
を持つ文字カテゴリーの組合せを予め用意する手法では
、種々なタイプ文字を含む印刷文字等が使用される郵便
物などを取り扱う場合には、同一の文字カテゴリーでも
文字パターンが異なり、更にＯＣＲに於ける文字切り出
し方法にも依存するから、可能性を持つ文字カテゴリー
の組合せを予め用意することが困難となる。そこで、本
発明の目的は、上記問題点を解決するために、隣接する
文字から生成される可能な部分文字列をそ、の相対的位
置情報（例えば、入力文字列の始端文字を含む部分文字
列あるいは入力文字列の始端、終端文字も含まない部分
文字列など）も含めて検索キーに変換し、上記複数個の
検索キーにより検索可能な複数個の単語のうち、同一の
単語を検索した検索キーの個数を基にして入力文字列に
対応する単語を選択することによって、検索キーのいく
つかに誤り文字やりジエクト文字が含まれていても所望
の単語あるいけ所望の単語を含む複数個の単語が選択で
きる単語検索方式を提供することにある。(Problem to be Solved by the Invention) As mentioned above, even if a specific substring is set and a conflation matrix is used, the length of the substring changes due to character extraction errors, etc. It becomes difficult. In addition, with the method of preparing in advance combinations of character categories that may occur due to character extraction errors, when handling mail items that use printed characters, etc. that include various types of characters, it is necessary to Since the patterns are different and also depend on the character extraction method used in OCR, it is difficult to prepare in advance possible combinations of character categories. SUMMARY OF THE INVENTION In order to solve the above-mentioned problems, it is an object of the present invention to provide information on the relative position of possible substrings generated from adjacent characters (for example, subcharacters including the starting character of an input string). (such as partial strings that do not include the start or end characters of a column or input character string) are converted into search keys, and the same word is searched for among multiple words that can be searched using the multiple search keys mentioned above. By selecting the word corresponding to the input string based on the number of search keys, the desired word or multiple words containing the desired word can be selected even if some of the search keys contain incorrect characters or select characters. The purpose of the present invention is to provide a word search method that allows the selection of words.

（問題点を解決するための手段）前述の問題点を解決し上記目的を達成するために本発明
が提供する単語検索方式は、入力文字列内の部分文字列
を用いて単語辞書に格納された所定の単語を検索する単
語検索方式において、前記入力文字列の隣接する文字か
ら成る部分文字列を相対位置情報も含めて検索キーに変
換する手段と、前記検索キーに対応する部分文字列を含
む単語の記憶位置を示したポインターを異なる検索キー
毎にブロック化して格納するポインター記憶手段と、前
記ポインター記憶手段内で前記検索キー毎にブロック化
したブロック格納位置を前記検索キーにより定まる所定
の番地に格納した索引テーブルと、前記入力文字列から
可能な複数個の検索キーを生成し、前記索引テーブルを
用いて前記ポインター記憶手段から、前記複数個の検索
キーに対応する複数個のブロックを取り出す手段と、前
記複数個のブロック内に含まれた同一の単語を示すポイ
ンターの個数及び前記入力文字列の個数を基にして、前
記入力文字列に対応する単語を選択する手段とを有する
ことを特徴とする。(Means for Solving the Problems) In order to solve the above-mentioned problems and achieve the above objectives, the word search method provided by the present invention uses partial strings in an input string to be stored in a word dictionary. A word search method for searching for a predetermined word includes means for converting a partial string of adjacent characters of the input string into a search key including relative position information; pointer storage means for storing a pointer indicating a storage position of a word included in a block for each different search key; and a pointer storage means for storing a block storage position of a block for each search key within the pointer storage means at a predetermined location determined by the search key. A plurality of possible search keys are generated from the index table stored in the address and the input character string, and a plurality of blocks corresponding to the plurality of search keys are generated from the pointer storage means using the index table. and means for selecting a word corresponding to the input character string based on the number of pointers indicating the same word included in the plurality of blocks and the number of the input character strings. It is characterized by

（作用〕本発明においては、検索キーとして、入力文字列内の隣
接する文字から生成される可能な部分文字列から複数個
生成するために、検索キーのいくつかに誤り文字や読取
り不能文字が含まれても所望の単語を取り出すことが可
能となる。ここで、検索キーを複数個にすることにより
所望の単語を正確に取り出せる一方、必要以上の単語群
も含まれることになる。そこで、本発明においては、入
力文字列から生成される可能な部分文字列を検索キーに
変換する際、部分文字列の相対位置情報も含めて検索キ
ーを生成し、生成された複数個の検索キーにより検索可
能な複数個の単語のうち、同一の単語を検索した検索キ
ーの個数を基にして入力文字列に類似する単語群のみ選
択することが可能となる。(Operation) In the present invention, in order to generate a plurality of search keys from possible substrings generated from adjacent characters in an input string, some of the search keys include incorrect characters or unreadable characters. It is possible to extract the desired word even if it is included.Here, by using multiple search keys, the desired word can be extracted accurately, but it also includes a group of words that are more than necessary.Therefore, In the present invention, when converting possible substrings generated from an input string into search keys, the search keys are generated including the relative position information of the substrings, and the search keys are Among a plurality of searchable words, it is possible to select only a group of words similar to the input character string based on the number of search keys that searched for the same word.

（実施例）以下、本発明の実施例について図面を参照しつつ説明す
る。図１は、入力文字列から一連の検索キーを算出する
方法を示した一例である。図１（ａ）は、アルファベッ
ト２６文字を一意に求まるコード値に対応付けた一例を
示している。図１（ａ）において、アルファベラ）Ａは
コード値０に、アルファベラ）Ｂはコード値１に、・・
・・・・といったように−意に対応付けられるから、入
力文字列の任意の部分文字列を検索キーに変換する場合
、上述した　−コード値を用いて簡単な算出演算により
、部分文字列に一意に対応する検索キーの値（以下、検
索キー値と呼ぶ）を得ることができる。尚、以下単語及
び入力文字列はアルファベット２６文字として説明する
が、カタカナや特殊記号などを対象としても、また、そ
れらを同時に用いても同様であることは言うまでもない
。(Example) Hereinafter, an example of the present invention will be described with reference to the drawings. FIG. 1 shows an example of a method for calculating a series of search keys from an input character string. FIG. 1(a) shows an example in which 26 alphabetical characters are associated with unique code values. In Figure 1(a), Alphabella)A has a code value of 0, Alphabella)B has a code value of 1, etc.
..., so when converting any substring of an input string into a search key, you can convert it into a substring by a simple calculation using the - code value mentioned above. A uniquely corresponding search key value (hereinafter referred to as search key value) can be obtained. In the following description, words and input character strings are assumed to be 26 characters of the alphabet, but it goes without saying that the same applies to katakana and special symbols, or even if they are used at the same time.

図１（ｂ）は、入力文字列ａ、ａ、・・・・・・ａｌ（
但し、ａｌは、アルファペラＦ　Ａｍ　　Ｂ＋　　ｃ＋
　・・曲ｚのいずれかであり、１＝１．・・・・・・ｎ
）に対する部分文字列を隣接する２文字を用いて生成し
、検索キー値を算出する方法を示している。尚、図１（
ｂ）では、隣接する２文字について示しているが、３文
字の場合等でも可能であることは言うまでもない。図１
（ｂ）において、記号ＰＩは検索キー値を算出すべき隣
接２文字ａｉａｉ＋１の位置情報に対する値であり、例
えば、隣接２文字”ｉ　ａｉ＋１の文字材が先頭文字で
あれば、記号Ｐ工は値０とし、隣接２文字ａ１ａ１＋、
の文字ａｌヤ、が入力文字列の末尾の文字であれば、記
号ＥＸは、値２とし、隣接２文字ａ１ａ１ヤ、が共に、
入力文字列の先頭文字でも末尾の文字でもなければ、記
号Ｐ工として値１を与えたものである。そこで、隣接２
文字ａｉ　ａｉ＋１は、変換式Ｐ工・２６″＋ａｉ　・
２６　＋　ａｌ４．を用いて、−意に検索キー値が得ら
れると共に、隣接２文字ａｉａｉ＋１が同一の文字から
構成されていても例えば、文字ａ１が先頭文字であるか
否かによって異なる検索キーを生成することになる。Figure 1(b) shows the input character strings a, a,...al(
However, al is Alphapella F Am B+ c+
...any song z, 1=1.・・・・・・n
) using two adjacent characters to generate a partial character string and calculate a search key value. Furthermore, Figure 1 (
In b), two adjacent characters are shown, but it goes without saying that three characters can also be used. Figure 1
In (b), the symbol PI is the value for the position information of the two adjacent characters aiai+1 for which the search key value is to be calculated. For example, if the character material of the two adjacent characters "i ai+1" is the first character, the symbol P is the value 0, two adjacent characters a1a1+,
If the character ``al'' is the last character of the input string, the symbol EX has a value of 2, and the two adjacent characters ``a1a1'' are both
If the character is neither the first character nor the last character of the input character string, a value of 1 is given as the symbol P. Therefore, adjacent 2
The character ai ai+1 is the conversion formula P・26″+ai・
26 + al4. By using , a search key value can be obtained at will, and even if two adjacent characters aiai+1 are composed of the same characters, for example, a different search key can be generated depending on whether the character a1 is the first character or not. Become.

図２は入力文字列から生成された検索キーを用いて所望
の単語を取り出す方法を示した一例である。FIG. 2 shows an example of a method for retrieving a desired word using a search key generated from an input character string.

図において、入力文字列？？ＢＣＤ１！Ｋ（但し、？は
読取り不能状態を示す）に対応する単語ＡＢＣＤＲＩＦ
（例えば、ＯＣＲの読取りの結果、文字Ａが文字切出し
ミスにより？？となり、更に文字Ｆが誤って文字Ｅに認
識された場合）の取り出し方は次のようになる。In the figure, the input string? ? BCD1! Word ABCDRIF corresponding to K (? indicates unreadable state)
(For example, when OCR reading results in character A becoming ?? due to a character cutting error and character F being mistakenly recognized as character E), the method for extracting the character is as follows.

最初に、図１を用いて説明した変換式ＰＩ・２６２＋吋
・’ｌ　（ｉ＋ａｉ＋＋　（図２では、３１＝Ｂ、　　
ｃ、　Ｄ。First, the conversion formula PI・262+吋・'l (i+ai++ (in FIG. 2, 31=B,
c.D.

”’　　ａｉ＋１　＝Ｃ＋　　Ｄ＋　　Ｋ、　　Ｗ　）
を用いて、隣接する２文字ＢＣ，ＣＤ、　　ＤＢ、　　
Ｉ！Ｊ！：が検索キーに変換される。例えば、隣接２文
字ＢＣは、ｌ・２６ｔ＋１・２６＋２＝７０４　（１１
＝ｌ、　ａｌは文字Ｂであり値１＋　　ａｉ＋１は文字
Ｃであり値２である）となり、検索キーとして値７０４
が得られる。同様にして、隣接２文字ＣＤ、ＤＪ　　Ｋ
Ｗもそれぞれ検索キー値として、７３１，７５８．１４
６０が得られる。尚、ＯＣＲから１文字イメージに対す
る認繊結来が複数個得られた場合には、２文字のすべて
の組合せに対して前述した方法で検索キー値が生成され
ているものとする。”' ai+1 = C+ D+ K, W)
Using , two adjacent characters BC, CD, DB,
I! J! : is converted to a search key. For example, two adjacent characters BC are l・26t+1・26+2=704 (11
=l, al is the letter B and the value is 1+ ai+1 is the letter C and the value is 2), and the value 704 is used as the search key.
is obtained. In the same way, two adjacent characters CD, DJ K
W also has a search key value of 731,758.14.
60 is obtained. Note that if a plurality of recognition results for one character image are obtained from OCR, it is assumed that search key values have been generated for all combinations of two characters by the method described above.

次に、前述した隣接２文字ＢＣ，ＣＤ、　　ＤＩ。Next, the two adjacent characters BC, CD, and DI mentioned above.

ＫＢｌｉに対する検索キー値７０４，７３１，７５８゜
１４６０を用いて索引テーブルをアクセスする。The index table is accessed using the search key values 704, 731, 758°1460 for KBli.

索引テーブルは、検索キー値に対応する番地に、辞書ポ
インター記憶部で、その検索キーにより一意に定まる隣
接２文字を含む単語群のブロックの先頭位置が格納され
ている。In the index table, the starting position of a block of a word group containing two adjacent characters uniquely determined by the search key is stored in the dictionary pointer storage at an address corresponding to the search key value.

例えば、入力文字列内の隣接２文字ＢＧに対する検索キ
ー値７０４において、索引テーブルの７０４番地には、
ポインターＰ（１，ＢＣ）（但し、１は、図１で述べた
相対位置情報Ｐ工の値を示しており、ポインターＰ（１
，ＢＣ）は、相対位置情報Ｐ工＝１で隣接２文字ＢＧを
有する単語群のブロックＢＩＣ（１，ＢＣ）の先頭位置
を示す）が格納されているために、辞書ポインター記憶
部内のブロックＢＫ（１，ＢＣ）を索引テーブルにより
直接的に求めることができる。ここで、検索キー値７０
４（入力文字列内の隣接２文字ＢＧ）に対して得られた
辞書ポインター記憶部内のブロックＢＫ（１，ＢＣ）に
は、例えば、図２に示す単語辞書では、単語ＡＢＣＤＩ
Ｆ、ＦＢＣＡＡＡ等の記憶位置（以下辞書ポインターと
呼ぶ）Ｐ、、Ｐ、等が格納されているため、辞書ポイン
ターＰ、、Ｐ、を取り出すことにより入力文字列７’１
ＪＢＣＤＫＫ内の隣接２文字ＢＣが含まれる単語ＡＥＣ
ＤＩＩｔＦ、ＦＢＣＡＡＡ等が得られる。同様にして入
力文字列内の隣接２文字ＣＤに対して辞書ポインターＰ
、、　Ｆ、、　ｐ、。For example, in the search key value 704 for two adjacent characters BG in the input string, the address 704 of the index table has the following information:
Pointer P(1, BC) (However, 1 indicates the value of relative position information P described in FIG. 1, and pointer P(1, BC)
, BC) indicates the start position of block BIC (1, BC) of a word group having two adjacent characters BG with relative position information P = 1, so block BK in the dictionary pointer storage unit is stored. (1, BC) can be directly obtained using the index table. Here, search key value 70
For example, in the word dictionary shown in FIG.
Since storage locations such as F, FBCAAA (hereinafter referred to as dictionary pointers) P, ,P, etc. are stored, by taking out the dictionary pointers P, ,P, the input character string 7'1
Word AEC containing two adjacent letters BC in JBCDKK
DIItF, FBCAAA, etc. are obtained. Similarly, the dictionary pointer P is used for two adjacent characters CD in the input string.
,,F,,p,.

Ｐ、カ、隣接２文字Ｄ］ｌ！！に対して辞書ポインター
Ｐ、、　Ｐ、、　ｐ、が、隣接２文字１１！Ｋに対して
辞書ポインターｐｏ、ｐ、が得られる。P, Ka, two adjacent letters D]l! ! For, the dictionary pointer P,, P,, p, has two adjacent characters 11! A dictionary pointer po, p is obtained for K.

尚、図２において、単語Ｆ！ＦＦ、ＣＣＤには、隣接２
文字としてＣＤが含まれ、入力文字列？？ＢＣＤＫＫに
も隣接２文字ＣＤが含まれるが、本発明では、入力文字
列の相対位置情報ＰＩも考慮して検索キー値７３１が算
出される。そのために、入力文字列の隣接２文字ＣＤに
より単語ＫＦＦＣＣＤが取り出されないことになり、必
要以上の単語の取り出しを制限することができる。In addition, in FIG. 2, the word F! Adjacent 2 to FF and CCD
Does the input string include CD as a character? ? Although BCDKK also includes two adjacent characters CD, in the present invention, the search key value 731 is calculated in consideration of the relative position information PI of the input character string. Therefore, the word KFFCCD is not extracted due to the two adjacent characters CD of the input character string, and it is possible to restrict extraction of more words than necessary.

次に、入力文字列？？ＢＣＤＫＫの可能な検索キー値７
０４，７３１，７５８．１４６０　を用いて取り出され
たブロックＢＫ（１，ＢＣ）、ＢＫ（１゜ＣＤ）、’Ｂ
Ｋ（１，ＤＢ）、ＢＫ（２，１ｎＫ）内の辞書ポインタ
ーｐ０．　ｐ、、　ｐ、、　ｐ、、　ｐ、の個数が計算
される。即ち、辞書ポインターＰ０は１回、辞書ポイン
ターＰｌは３回、辞書ポインターｐ、は３回、辞書ポイ
ンターＰ、は２回、辞書ポインターＰ。Next, the input string? ? BCDKK possible search key values 7
Blocks BK (1, BC), BK (1° CD), 'B extracted using 04,731,758.1460
K (1, DB), dictionary pointer p0. in BK (2, 1nK). The number of p,, p,, p,, p, is calculated. That is, the dictionary pointer P0 is used once, the dictionary pointer Pl is used three times, the dictionary pointer p is used three times, the dictionary pointer P is used twice, and the dictionary pointer P is used three times.

は１回となる。ここで入力文字列７７ＢＣＤＦ！Ｋに対
する候補単語は、辞書ポインターｐ０．ｐ、。will be once. Input character string 77BCDF here! The candidate word for K is the dictionary pointer p0. p.

ｐ、、　Ｐ、、　Ｐ、の計数値１．　３．　３．　２．
　１及び入力文字列の文字数ｊｉｎ＝７を基にして選択
することができる。Count values of p,, P,, P, 1. 3. 3. 2.
1 and the number of characters jin=7 in the input character string.

一例として、閾値でとしてＴ＝α・（ｔｉｎ−１）（但
し、０〈α≦１）を算出し、閾値７以上の計数値を持つ
辞書ポインターを選択する。例えば、上述した例では、
α＝０．５とすると、閾値Ｔは３となり、計数値３以上
の辞書ポインターＰ、、Ｐ２、即ち、候補単語ＡＢＣＤ
ＩＩ！Ｆ、ＣＤＣＤ１ｉＫが選択されることになる。他
の一例としては、入力文字列の文字数は使用せず、単純
に、辞書ポインターの計数値が最大となる辞書ポインタ
ー即ち、単語を取り出すことによって候補単語Ａ　Ｂ　
ＣＤ　Ｋ　Ｆ。As an example, a threshold value is calculated as T=α·(tin-1) (0<α≦1), and a dictionary pointer having a count value of 7 or more is selected. For example, in the example above,
When α=0.5, the threshold T is 3, and the dictionary pointers P, , P2 with a count value of 3 or more, that is, the candidate word ABCD
II! F, CDCD1iK will be selected. As another example, the number of characters in the input character string is not used, and the candidate word A B is simply retrieved by taking out the dictionary pointer, that is, the word that has the maximum count value of the dictionary pointer.
CDKF.

ＣＤＣＤＩＫを取り出しても良い。You can also take out the CDCDIK.

尚、上述した一例で示したように、入力文字列に対応す
る単語が一意に決定できなかった場合には、例えば、本
願出願人の出願になる特願５５−１２６２４４号「相違
度検出装置」に記載されているような入力文字列と単語
との照合を行う単語照合装置を用いて一意に判定するこ
とができる。また、上述した辞書ポインター記憶部及び
索引テーブルは、使用される単語辞書が決まると通常の
マイクロプロセッサ−あるいは計算機を用いて、予め作
成できることは言うまでもない。As shown in the above example, if the word corresponding to the input character string cannot be uniquely determined, for example, Japanese Patent Application No. 55-126244 "Difference Detection Device" filed by the applicant of the present application. This can be uniquely determined using a word matching device that matches input character strings and words as described in . Furthermore, it goes without saying that the dictionary pointer storage section and the index table described above can be created in advance using an ordinary microprocessor or computer once the word dictionary to be used is determined.

図３は、本発明の一実施例を示した論理ブロック図であ
る。図において、１は文字読取装置（ＯＣＲ）であり、
０ＣＲ１によって読み取られた文字列は、順次入力文字
列レジスタ２に格納される。３は、検索キー変換部であ
り、図１を用いて説明したように、入力文字列レジスタ
２に格納された入力文字列内の可能な部分文字列を取り
出し、順次検索キー値に変換し、検索キーレジスタ４に
格納する。検索キーレジスタ４に複数個の検索キー値の
セットが終了すると、索引制御部５は検索キーレジスタ
４から順次検索キー値を取り出し、図２で示したように
索引テーブル６を用いて辞書ディ／ター記憶部７から検
索キー値（入力文り１１内の部分文字列）を含む単語へ
の辞書ディ／タ一群を取り出し、辞書ポインター計数部
８へ転送する。上記操作が検索キーレジスタ４に格納さ
れたすべての検索キー値に対して行われる。次に、辞書
ポインター計数部８は、順次、索引制御部５から転送さ
れる辞書ディ／ターがすでに転送された辞書ディ／ター
と同一であれば、その辞書ディ／ターに対する計数値を
カウントアツプすることによって、異なる辞書ポインタ
ーに対する計数値を算出し、辞書ポインターレジスタ９
に複数個の異なる辞書ポインターとその計数値を候補デ
ィ／ター選択部ｉｏに転送する。FIG. 3 is a logical block diagram showing one embodiment of the present invention. In the figure, 1 is a character reading device (OCR),
The character strings read by 0CR1 are sequentially stored in the input character string register 2. 3 is a search key conversion unit, which extracts possible substrings from the input string stored in the input string register 2 and sequentially converts them into search key values, as explained using FIG. Store in search key register 4. When a plurality of search key values have been set in the search key register 4, the index control unit 5 sequentially retrieves the search key values from the search key register 4, and uses the index table 6 as shown in FIG. A group of dictionary data for words including the search key value (partial character string in the input sentence 11) is taken out from the data storage section 7 and transferred to the dictionary pointer counting section 8. The above operation is performed on all search key values stored in the search key register 4. Next, if the dictionary data transferred from the index control unit 5 is the same as the dictionary data already transferred, the dictionary pointer counting unit 8 counts up the count value for the dictionary data. By calculating the count value for different dictionary pointers, the dictionary pointer register 9
Then, a plurality of different dictionary pointers and their counts are transferred to the candidate data selection unit io.

候補ポインター選択部１０は、入力文字列レジスタ２に
格納された入力文字列の文字数ｌｉｎと予め定められた
係数α（０〈α≦１）からＩＩ！４値Ｔ（＝α・（ｚｉ
ｎ−１））を算出する。次に１候補ポインタ一選択部１
０は辞書ポインターレジスタ９から順次辞書ディ／ター
及びその計数値を読み出し、計数値でか閾値７以上であ
れば、候補単語の辞書ポインターと判定し、索引制御部
５へ転送する。尚、候補ディ／ター選択部１０は、辞書
ボイノターレジスタ９に格納された複数個の辞書ポイン
ターにおける計数値のうち、最大となる計数値を持つ辞
書ポインターを候補単語の辞書ポインターと判定する方
式を用いても良い。索引制御部５に、候補ポインター選
択部１０から候補単語の辞書ポインターが転送されると
、索引制御部５は、単語辞書記憶部１１から転送された
辞書ポインターに対応する単語を取り出し、候補単語レ
ジスタ１２に転送する。The candidate pointer selection unit 10 selects II! from the number of characters lin of the input string stored in the input string register 2 and a predetermined coefficient α (0<α≦1). 4-value T(=α・(zi
n-1)). Next, 1 candidate pointer 1 selection part 1
0 reads the dictionary data and its count value sequentially from the dictionary pointer register 9, and if the count value is equal to or greater than the threshold value 7, it is determined that the dictionary pointer is a candidate word and is transferred to the index control unit 5. Note that the candidate word selection unit 10 determines the dictionary pointer with the largest count value among the count values of the plurality of dictionary pointers stored in the dictionary pointer register 9 as the dictionary pointer of the candidate word. You may use a method. When the dictionary pointer of the candidate word is transferred to the index control unit 5 from the candidate pointer selection unit 10, the index control unit 5 retrieves the word corresponding to the transferred dictionary pointer from the word dictionary storage unit 11 and stores it in the candidate word register. Transfer to 12.

以上のようにして、候補単語レジスタ１２に、入力文字
列レジスタ１２に格納された入力文字列に対応する単語
がセットされることになる。As described above, the word corresponding to the input character string stored in the input character string register 12 is set in the candidate word register 12.

尚、候補単語レジスタ１２に複数個の単語がセットされ
た場合、図中省略するが、通常の単語照合装置を用いて
、入力文字列と単語との照合を行い、特定の単語に判定
される。Note that when multiple words are set in the candidate word register 12, the input character string is compared with the word using a normal word matching device (not shown in the figure), and the word is determined to be a specific word. .

（発明の効果）以上説明したように、本発明により、入力文字列内の部
分文字列を検索キーとして所望の単語を取り出す場合、
検索キーのいくつかに誤り文字や読取り不能状態があっ
ても安定にしかも効率良く所望の単語あるいけ所望の単
語を含む複数個の単語を選択することが可能となる。(Effects of the Invention) As explained above, according to the present invention, when extracting a desired word using a partial string in an input string as a search key,
Even if some of the search keys have erroneous characters or are unreadable, it is possible to stably and efficiently select a desired word or a plurality of words containing the desired word.

[Brief explanation of drawings]

図１は、入力文字列を一連の検索キーに変換する例を示
す図である。図２は、一連の検索キーを用いて行う所望
の単語の索引方法の例を示す図である。図３は、本発明
の一実施例を示す論理ブロック図である。図において、１は０ＣＲ１２は入力文字列レジスタ、３
は検索キー変換部、４け検索キーレジスタ、５は索引制
御部、６は索引テーブル、７は辞書ボイ７ター記憶部、
８は辞書ポインター計数部、９は辞書ポインターレジス
タ、ｌｏは候補ディ／ター選択部、１１は単語辞書記憶
部、１２は候補単語レジスタでちる。FIG. 1 is a diagram illustrating an example of converting an input character string into a series of search keys. FIG. 2 is a diagram showing an example of a method of indexing a desired word using a series of search keys. FIG. 3 is a logical block diagram illustrating one embodiment of the present invention. In the figure, 1 is 0CR12 is the input character string register, 3
1 is a search key conversion unit, 4-digit search key register, 5 is an index control unit, 6 is an index table, 7 is a dictionary boiler storage unit,
8 is a dictionary pointer counting section, 9 is a dictionary pointer register, lo is a candidate data/data selection section, 11 is a word dictionary storage section, and 12 is a candidate word register.

Claims

[Claims]

In a word search method that searches for a predetermined word stored in a word dictionary using a substring in an input string, a search key is used to search for a substring consisting of adjacent characters of the input string, including relative position information. a pointer storage means for storing a pointer indicating a storage position of a word containing a partial character string corresponding to the search key in blocks for each different search key; The pointer storage means generates an index table in which the block storage position of each block is stored at a predetermined address determined by the search key, and a plurality of possible search keys from the input character string, and uses the index table to generate a plurality of possible search keys. means for extracting a plurality of blocks corresponding to the plurality of search keys from the plurality of search keys, based on the number of pointers indicating the same word included in the plurality of blocks and the number of the input character strings, A word search method comprising means for selecting a word corresponding to the input character string.