JPH03154985A

JPH03154985A - Maximum likelihood word recognizing system

Info

Publication number: JPH03154985A
Application number: JP1292226A
Authority: JP
Inventors: Naotaka Daikoumei; 大光明　直孝
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1989-11-13
Filing date: 1989-11-13
Publication date: 1991-07-02

Abstract

PURPOSE:To increase the processing speed by summing points of characters included in a character candidate string correspondingly to each word and narrowing down words to a smaller number of words based on the frequency in this summing and discriminating coincident/nonconcident parts of characters to perform recollation of words. CONSTITUTION:Evaluation values which are given to characters coinciding with each candidate word are summed to a pertinent evaluation value table 12 by an evaluation value summing processing part 20, and the frequency in summing of the evaluation value for each word is counted and its value of the frequency in summing is stored in a summing frequency count totalizing table 15. A sort processing part 3 sorts candidate words obtained as the output of the evaluation value summing processing part 2 in the descending order of evaluation values, and the candidate word having the largest evaluation value is outputted as the collation output by a discrimination processing part 4. Thus, the number of candidate words as the object of recollation processing is reduced to increase the processing speed.

Description

【発明の詳細な説明】［発明の目的］（産業上の利用分野）本発明は、入力文字列から候補単語を探索し、この候補
単語の中から最も適切で尤度の高い単語を認識する最尤
度単語認識方式に関し、更に詳しくは、単語照合の途中
結果として得られるＮ個の候補単語から不一致文字部分
の文字切り出しおよび文字照合を再処理する再照合の対
象とすべきｎ個の候補単語を抽出する候補単語絞り込み
方式を使用した最尤度単語認識方式に関する。[Detailed Description of the Invention] [Objective of the Invention] (Industrial Application Field) The present invention searches for candidate words from an input character string, and recognizes the most appropriate and most likely word from among the candidate words. Regarding the maximum likelihood word recognition method, in more detail, from the N candidate words obtained as an intermediate result of word matching, character extraction of unmatched character portions and character matching are reprocessed to select n candidates to be subjected to rematching. This invention relates to a maximum likelihood word recognition method using a candidate word narrowing down method for extracting words.

（従来の技術）従来の最尤度単語認識方式は、第３図に示すように、入
力文字列とメモリ部３７の単語辞書１１との連想統合処
理を行って、入力文字列の各文字位置にその文字が存在
する候補単語を探索し、該当するすべての候補単語を抽
出する探索処理部１を有し、それから評価値加算処理部
２において前記探索処理部１で得られた候補単語に対し
て一致した文字に付与された評価値を評価値テーブル１
２に加算する。次に、評価値加算処理部２から出力され
る候補単語に対してソート処理部３において評価値の降
順にソートを行い、上位Ｎ位までの候補単語を得る。な
お、探索処理部１、評価点加算処理部２およびソート処
理部３は単語照合処理部１６を構成している。(Prior Art) The conventional maximum likelihood word recognition method, as shown in FIG. It has a search processing section 1 that searches for candidate words in which the character exists and extracts all the corresponding candidate words.Then, an evaluation value addition processing section 2 searches for candidate words in which the character exists in the search processing section 1, and extracts all the corresponding candidate words. The evaluation values given to the matching characters are shown in evaluation value table 1.
Add to 2. Next, the candidate words output from the evaluation value addition processing section 2 are sorted by the sorting processing section 3 in descending order of evaluation value to obtain the top N candidate words. Note that the search processing section 1, the evaluation point addition processing section 2, and the sorting processing section 3 constitute a word matching processing section 16.

ソート処理部２で得られたＮ個の各候補単語は、判定処
理部４において各単語の文字が人力文字列の中にすべて
含まれているか否かを判定される。For each of the N candidate words obtained by the sorting processing section 2, a judgment processing section 4 judges whether all the characters of each word are included in the human character string.

この判定の結果、すべての文字か含まれていない場合に
は、再照合処理部１０において、その単語の不一致部分
に対して再度文字切り出しおよび文字照合を行い、再度
単語の認識処理ようになっている。この再照合処理部１
０は不一致文字位置探索処理部６１文字一致率による絞
り込み部１３、再文字切り出し部７、再文字照合部８、
単語照合処理部５および判定処理部９から構成されてい
る。As a result of this determination, if all the characters are not included, the re-matching processing unit 10 performs character extraction and character matching again for the unmatched part of the word, and starts the word recognition process again. There is. This re-verification processing unit 1
0 is an unmatched character position search processing unit 61, a narrowing down unit 13 based on character matching rate, a character re-cutting unit 7, a character re-matching unit 8,
It is composed of a word matching processing section 5 and a determination processing section 9.

また、前記判定処理部４における判定の結果、候補単語
中の全構成文字が候補文字列（入力文字列）の中に存在
すると判定された場合、該当する候補単語を照合認識結
果として出力する。Further, as a result of the determination in the determination processing unit 4, if it is determined that all the constituent characters of the candidate word are present in the candidate character string (input character string), the corresponding candidate word is output as a matching recognition result.

任意ピッチ文字列では、１文字文の領域が２文字分の領
域に分割されたり、逆に２文字分の領域が１文字分の領
域として合成されてしまう等というように１文字の領域
が必ずしも正確に識別できない。In arbitrary pitch character strings, the area for one character is not necessarily divided into areas for one character, such as the area for a one-character sentence being divided into areas for two characters, or conversely, the area for two characters is combined into an area for one character. cannot be identified accurately.

従って、領域の識別が誤った場合、入力文字列から絞ら
れた文字候補列と候補単語の一致状態は、単純に各々の
先頭位置から１文字毎に文字コードの比較を行う方法で
は、一致部分と不一致部分を切り分けることができない
。単純な比較をした場合には、誤った一致部分、不一致
部分の判定を行うことが多くなる。このため、一致部分
、不一致部分の判定は、各々の文字候補の前後の文字ま
での比較を行いながら最適な組合せを求めるＤＰマツチ
ング方と呼ばれる方法に類似した方法により行う。この
不一致部分判定の処理量は候補単語数Ｎｘ候補単語長、
ｌｘ候補文字列長ｍのオーダとなる。Therefore, if the region is incorrectly identified, the matching state between the character candidate string narrowed down from the input character string and the candidate word cannot be determined by simply comparing the character codes character by character from the first position of each character string. It is not possible to separate the discrepancies. When a simple comparison is made, incorrect matching or non-matching portions are often determined. Therefore, the determination of matching portions and non-matching portions is performed using a method similar to the DP matching method in which the optimal combination is determined by comparing the characters before and after each character candidate. The processing amount for this mismatch determination is the number of candidate words N x length of candidate words.
lx candidate character string length is on the order of m.

この判定の結果得られた一致部分の文字数と単語の文字
数の比を文字一致率と称し、この文字一致率の大きな単
語をｎ個（＜Ｎ個）選択することにより再照合単語を絞
り込んでいる。The ratio of the number of characters in the matching part obtained as a result of this judgment to the number of characters in the word is called the character matching rate, and by selecting n words (<N) with a large character matching rate, the words to be rematched are narrowed down. .

（発明が解決しようとする課題）上述した従来の方法では、Ｎ個の候補について一致／不
一致部分を求める処理を行うため、処理負荷が重く、高
速化を図ることができないという問題がある。(Problems to be Solved by the Invention) The conventional method described above has a problem in that the processing load is heavy and speeding up cannot be achieved because processing is performed to determine matching/mismatching portions for N candidates.

本発明は、上記に鑑みてなされたもので、その目的とす
るところは、再照合処理の対象とする候補単語数を減ら
して再照合処理量を低減し、高速化を図った最尤度単語
認識方式を提供することにある。The present invention has been made in view of the above, and its purpose is to reduce the number of candidate words to be subjected to re-matching processing, reduce the amount of re-matching processing, and speed up maximum likelihood word processing. The objective is to provide a recognition method.

［発明の構成］（課題を解決しようとする手段）上記目的を達成するため、本発明の最尤度単語認識方式
は、入力文字列の文字切り出し、文字認識の結果得られ
た文字候補列中のすべての文字について文字毎に該文字
を有する単語を探索し、該文字の得点を該単語の得点と
して加算し、この得られた単語の得点の高いものから類
似度が高いと判断する尤度検索手段と、尤度の高い所定
数の単語について構成文字がすべて文字候補列中に含ま
れるか否かを調べ、すべての文字が含まれている単語を
入力文字候補列に一致した単語と判断し、この判断によ
り一致しないと判断された場合には、前記所定数の単語
について文字候補列と単語の−致部分／不一致部分を探
索し、不一致部分について再度文字領域の識別と文字認
識を行い、最尤度単語を決定する再照合手段を有する任
意ピッチで記入された日本語文字列に一致する単語を単
語辞書中から探索する最尤度単語認識方式であって、文
字の得点を単語の得点として加算する場合に単語毎の加
算回数を記憶する加算回数記憶手段と、この加算回数を
単語の構成文字数で割って得られる疑似文字一致率が所
定値以上の前記所定数よりも少ない数の単語についての
み再照合を行う再照合手段と有することを要旨とする。[Structure of the Invention] (Means for Solving the Problems) In order to achieve the above object, the maximum likelihood word recognition method of the present invention extracts characters from an input character string and extracts characters from character candidate strings obtained as a result of character recognition. For each character, search for a word that has the character, add the score of the character as the score of the word, and determine the likelihood that the obtained word has a high degree of similarity based on the higher score. A search method is used to check whether all constituent characters of a predetermined number of words with high likelihood are included in the character candidate string, and a word that includes all the characters is determined to be a word that matches the input character candidate string. However, if it is determined that they do not match, the predetermined number of words are searched for matching/unmatching parts between the character candidate string and the word, and character area identification and character recognition are performed again for the unmatched parts. is a maximum likelihood word recognition method that searches a word dictionary for a word that matches a Japanese character string written at an arbitrary pitch, and has a rematching means for determining the maximum likelihood word. an addition number storage means for storing the number of additions for each word when adding up as a score; and a pseudo character matching rate obtained by dividing the number of additions by the number of characters constituting a word, which is a number smaller than the predetermined number and which is greater than or equal to a predetermined value. The gist of the present invention is to have a re-verification means that performs re-verification only on words.

（作用）本発明の最尤度単語認識方式では、文字候補列に含まれ
る文字の得点を単語対応に加算するとともに、この加算
回数を記憶し、この加算回数を基に単語を少ない数に絞
り込み、それから文字の一致／不一致部分を識別し、単
語の再照合を行っている。(Operation) In the maximum likelihood word recognition method of the present invention, the scores of characters included in a character candidate string are added to each word, the number of additions is stored, and the number of words is narrowed down to a small number based on the number of additions. , then identifies matching/mismatching parts of the characters and rematching the words.

（実施例）以下、図面を用いて本発明の詳細な説明する。(Example) Hereinafter, the present invention will be explained in detail using the drawings.

第１図は本発明の一実施例に係わる最尤度単語認識方式
の構成を示すブロック図である。同図に示す最尤度単語
認識方式は、単語照合処理部３１と、判定処理部４と、
再照合処理部３３と、メモリ部３５とから構成されてい
る。単語照合処理部３１は、探索処理部１と、評価値加
算処理部２０と、ソート処理部３とから構成され、再照
合処理部３３は、疑似文字一致率による絞り込み部１４
と、不一致文字位置探索処理部６と、再文字切り出し部
７と、再文字照合部８と、単語照合処理部５と、判定処
理部９とから構成され、メモリ部３５は、単語辞書１１
と、評価値テーブル１２と、加算回数集計テーブル１５
とから構成されている。FIG. 1 is a block diagram showing the configuration of a maximum likelihood word recognition method according to an embodiment of the present invention. The maximum likelihood word recognition method shown in the figure includes a word matching processing section 31, a determination processing section 4,
It is composed of a reverification processing section 33 and a memory section 35. The word matching processing section 31 includes a search processing section 1, an evaluation value addition processing section 20, and a sorting processing section 3. The re-matching processing section 33 includes a narrowing down section 14 based on pseudo character matching rate
, a mismatch character position search processing section 6, a character re-extraction section 7, a re-character matching section 8, a word matching processing section 5, and a determination processing section 9.
, evaluation value table 12 , and addition count aggregation table 15
It is composed of.

なお、本実施例に示す最尤度単語認識方式において、前
述した第３図の最尤度単語認識方式に使用されている構
成要素と同じ構成要素には同じ符号が付されており、本
実施例の最尤度単語認識方式が前述した第３図の最尤度
単語認識方式と異なる点は、評価値加算処理部２０にお
いて加算回数を計数し、この加算回数を記憶する加算回
数集計テーブル１５を設けた点と、疑似文字一致率によ
る絞り込み部１４を設け、Ｎ個の候補単語から正解の可
能性の高い候補単語をｎ個に絞り込んでから不一致文字
位置探索処理を行い、処理負荷を低減している点である
。In the maximum likelihood word recognition method shown in this example, the same components as those used in the maximum likelihood word recognition method shown in FIG. The difference between the example maximum likelihood word recognition method and the maximum likelihood word recognition method shown in FIG. A narrowing down unit 14 based on pseudo character matching rate is provided to narrow down candidate words with a high probability of being correct from N candidate words, and then search for unmatched character positions to reduce the processing load. This is what we are doing.

また、前記疑似文字一致率により絞り込み部１４は、第
２図に示すように、疑似文字一致率算出部２６と、グル
ープ化順位置付は部２７と、候補単語採否判定部２８と
から構成されている。Further, as shown in FIG. 2, the pseudo character matching rate narrowing down section 14 is composed of a pseudo character matching rate calculating section 26, a grouping order positioning section 27, and a candidate word acceptance/rejection determining section 28. ing.

次に作用を説明する。Next, the effect will be explained.

人力文字列は、まず探索処理部１において単語辞書１１
の各単語と比較され、入力文字列の各文字位置にその文
字が存在する候補単語が探索され、該当するすべての候
補単語が抽出される。この抽出された各候補単語は、評
価値加算処理部２０において一致した文字に付与された
評価値を該当する評価値テーブル１２に加算されると同
時に、評価値を加算する各単語に対して評価値の加算回
数を計数し、その加算回数値を加算回数集計テーブル１
５に格納する。この評価値加算回数の記憶処理は、１単
語当りの評価値加算処理部の加算処理と、１回の格納処
理が追加されるだけであるので、この処理量は無視し得
る程度である。The human character string is first processed by the word dictionary 11 in the search processing unit 1.
is compared with each word in the input character string, a candidate word in which that character exists in each character position of the input character string is searched, and all corresponding candidate words are extracted. For each extracted candidate word, the evaluation value added to the matching character is added to the corresponding evaluation value table 12 in the evaluation value addition processing unit 20, and at the same time, each word to which the evaluation value is added is evaluated. Count the number of times a value is added, and use the number of additions as the number of additions table 1
Store in 5. This storage process of the number of evaluation value additions only requires the addition process of the evaluation value addition processing unit per word and one storage process, so the amount of processing is negligible.

次に、ソート処理部３は、評価値加算処理部２の出力と
して得られる候補単語について評価値の降順にソートを
実行し、上位Ｎ位までの候補単語を得る。それから、判
定処理部４は、ソート処理部３で得たＮ個の各候補単語
についてその構成文字と候補文字である入力文字列との
一致状況を判定して、構成文字のすべてが候補文字列中
に存在する候補単語が存在した場合、これらのうち最も
評価値の高い候補単語を照合結果として出力する。Next, the sort processing unit 3 sorts the candidate words obtained as the output of the evaluation value addition processing unit 2 in descending order of evaluation value, and obtains the top N candidate words. Then, the determination processing unit 4 determines the matching status of the constituent characters of each of the N candidate words obtained by the sorting processing unit 3 with the input character string that is the candidate character, and determines whether all of the constituent characters are in the candidate character string. If there are candidate words in the list, the candidate word with the highest evaluation value is output as the matching result.

構成文字のすべてが候補文字列中に存在する候補単語が
ない場合には、Ｎ個の候補単語を次の再照合処理部３３
に引き渡す。再照合処理部３３では、Ｎ候補単語の中か
ら再照合処理の対象候補単語を決定するため、疑似文字
一致率による絞り込み部１４における処理を実行し、以
下不一致文字位置探索処理部６、再文字切り出し部７、
再文字　０照合部８、単語照合処理部５および判定処理部９を順次
実行して照合結果を出力する。If there is no candidate word in which all of the constituent characters are present in the candidate character string, N candidate words are sent to the next re-verification processing unit 33.
hand over to. The rematching processing unit 33 executes the processing in the narrowing down unit 14 based on the pseudo character matching rate in order to determine target candidate words for the rematching process from among the N candidate words. Cutout part 7,
Re-character 0 The matching unit 8, word matching processing unit 5, and determination processing unit 9 are executed in sequence to output the matching results.

疑似文字一致率による絞り込み部１４では、第２図に示
すように疑似文字一致率算出部２６、グループ化順位置
付は部２７および候補単語採否判定部２８における処理
を実行する。The narrowing down unit 14 based on the pseudo character matching rate executes the processes in the pseudo character matching rate calculating unit 26, the grouping order positioning unit 27, and the candidate word acceptance/rejection determining unit 28, as shown in FIG.

疑似文字一致率算出部２６では、評価値加算処理部２０
で記憶した評価値加算回数を当該単語長、すなわち単語
の構成文字数で割ることにより、疑似文字一致率を算出
する。なお、疑似文字一致率と称するのは、次の理由に
よる。評価値加算処理部２０では、文字切り出し誤りに
よる文字位置がずれる影響を補正するために、その文字
位装置の前後り文字の範囲を調べ、その文字が含まれて
いる単語に加算を行う。従って、「東京Ｊと「京都」の
場合、２文字目の「京」によりどちらも１回の加算が行
われること１とより、正確な一致状態を表現しないこと
があるためである。なお、従来の一致／不一致の判定で
は、文字の並び順までを考慮した判定を行うため、この
ような不正確さは発生しない。In the pseudo character matching rate calculation unit 26, the evaluation value addition processing unit 20
The pseudo character matching rate is calculated by dividing the number of evaluation value additions stored in , by the length of the word, that is, the number of characters constituting the word. Note that the reason why this is referred to as pseudo character matching rate is as follows. The evaluation value addition processing unit 20 examines the range of characters before and after the character position device and performs addition to the word containing the character, in order to correct the effect of character position shift due to a character segmentation error. Therefore, in the case of ``Tokyo J'' and ``Kyoto,'' both are added once due to the second character ``Kyo'' (1), and an accurate matching state may not be expressed. Note that in conventional match/mismatch determinations, such inaccuracies do not occur because the determination is performed taking into account the order of the characters.

グループ化順位置付は部２７では、上記疑似文字一致率
算出部２６で得られた各候補単語の疑似文字一致率の高
い候補単語を再照合処理の対象候補単語とするため、疑
似文字一致率の降順に候補単語をソートする。ここで、
疑似文字一致率の差が少ない候補単語があったとき、単
純に疑似文字一致率の順位を評価尺度とし、そのしきい
値で再照合対象候補とするか除外するかを区別すること
は、本来、正解／不正解の判定が難しい類似単語の中で
僅かな疑似文字一致率の差のみで採否を決定することに
なり、判定方法としては無理がある。In the grouping order positioning unit 27, the pseudo character matching rate is determined in order to select candidate words with a high pseudo character matching rate for each candidate word obtained in the pseudo character matching rate calculation unit 26 as target candidate words for the re-matching process. Sort candidate words in descending order. here,
When there is a candidate word with a small difference in pseudo-character match rate, it is originally impossible to simply use the rank of the pseudo-character match rate as an evaluation measure and use that threshold to distinguish whether to include it as a candidate for re-matching or to exclude it. , it is difficult to determine whether a word is correct or incorrect, and acceptance or rejection is determined based only on the slight difference in pseudo-character matching rate among similar words, which is unreasonable as a judgment method.

このため、グループ化順位置付は部２７では、疑似文字
一致率の近い類似候補単語を一纏めにしたグループとし
て扱うこととし、そのグループに順位を付与する。Therefore, in the grouping order positioning section 27, similar candidate words with similar pseudo-character matching rates are treated as a group, and a ranking is assigned to the group.

疑似文字一致率による絞り込み部１４の最後の処理であ
る候補単語採否判定部２８における処理は、グループ化
順位付けされた候補単語のうち、予め設定した評価代部
のしきい値（グループ化順１　］　２位）より上位の候補単語を再照合処理の対象候補単語と
して抽出する。The final process of the narrowing down unit 14 based on the pseudo character matching rate in the candidate word acceptance/rejection determining unit 28 is based on a preset evaluation margin threshold (grouping order 1) among the grouped candidate words. ] 2) Extract higher ranking candidate words as target candidate words for re-verification processing.

以上の疑似文字一致率による絞り込み部１４により、判
定処理部４で再照合処理が必要と判定された場合のＮ個
の候補単語から正解の可能性の高い候補単語Ｎ個に絞り
込むことができる。なお、疑似文字一致率による絞り込
み部１４は文字一致率の算出に用いる値が異なる他は従
来の文字一致率による絞り込み部１３の処理と全く同じ
処理である。The narrowing down unit 14 based on the above-mentioned pseudo character matching rate can narrow down the N candidate words that are likely to be correct from the N candidate words when the determination processing unit 4 determines that re-verification processing is necessary. Note that the narrowing down unit 14 based on the pseudo character matching rate performs exactly the same processing as the conventional narrowing down unit 13 based on the character matching rate, except that the values used to calculate the character matching rate are different.

［発明の効果］以上説明したように、本発明によれば、文字候補列に含
まれる文字の得点を単語対応に加算するとともに、この
加算回数を記憶し、この加算回数を基に単語を少ない数
に絞り込み、それから文字の一致／不一致部分を識別し
、単語の再照合を行っているので、この絞り込んだ分だ
け処理不可量を低減することができ、高速化を図ること
ができる。[Effects of the Invention] As explained above, according to the present invention, the scores of characters included in a character candidate string are added to each word, the number of additions is stored, and the number of words is reduced based on the number of additions. Since the number of characters is narrowed down, the matching/mismatching portions of characters are identified, and the words are re-verified, the amount of processing that cannot be processed can be reduced by the amount of narrowing down, and speeding up can be achieved.

[Brief explanation of the drawing]

第１図は本発明の一実施例に係わる最尤度単語認識方式
の構成を示すブロック図、第２図は第１図の最尤度単語
認識方式に使用されている疑似文字一致率による絞り込
み部の詳細な構成を示すブロック図、第３図は従来の最
尤度単語認識方式の構成を示すブロック図である。１・・・探索処理部　　　３・・・ソート処理部、４・
・・判定処理部６・・・不一致文字位置探索処理部１１・・・単語辞書　　　１２・・評価値テーブル１４
・・・疑似文字一致率による絞り込み部１５・・・加算
回数集計テーブル２０・・・評価値加算処理部３１・・・単語照合処理部Fig. 1 is a block diagram showing the configuration of a maximum likelihood word recognition method according to an embodiment of the present invention, and Fig. 2 is a narrowing down based on the pseudo character matching rate used in the maximum likelihood word recognition method of Fig. 1. FIG. 3 is a block diagram showing the structure of a conventional maximum likelihood word recognition system. 1... Search processing unit 3... Sort processing unit, 4.
... Judgment processing section 6 ... Unmatched character position search processing section 11 ... Word dictionary 12 ... Evaluation value table 14
...Narrowing down section 15 based on pseudo character matching rate...Addition count aggregation table 20...Evaluation value addition processing section 31...Word matching processing section

Claims

[Claims]

(1) Characters are extracted from the input character string, and for each character in the character candidate string obtained as a result of character recognition, a word containing the character is searched for, and the score of the character is added as the score of the word. , a likelihood search means that determines that the obtained words have a high degree of similarity based on the words with high scores, and checks whether all constituent characters are included in the character candidate string for a predetermined number of words with high likelihood. , a word containing all the characters is determined to be a word that matches the input character candidate string, and if it is determined that there is no match based on this determination, the matching part of the character candidate string and the word for the predetermined number of words is determined as a word that matches the input character candidate string. /A word dictionary that searches for unmatched parts, performs character area identification and character recognition again for the unmatched parts, and searches for words that match Japanese character strings written at arbitrary pitches. A maximum likelihood word recognition method that searches from inside,
an addition number storage means for storing the number of additions for each word when character scores are added as word scores; 1. A maximum likelihood word recognition method comprising a re-verification means for performing re-verification only for words whose number is less than a predetermined number.

(2) The re-matching means sorts the words by the pseudo-character matching rate, groups words with a pseudo-character matching rate within a predetermined step size, and divides words into one group from those with a large pseudo-character matching rate. 2. The maximum likelihood word recognition method according to claim 1, wherein a ranking is given and re-verification is performed for words whose ranking is a predetermined rank or higher and is smaller than the predetermined number.