JP2827066B2

JP2827066B2 - Post-processing method for character recognition of documents with mixed digit strings

Info

Publication number: JP2827066B2
Application number: JP4101817A
Authority: JP
Inventors: 明利塚本; 節正広垣; 直弘天本
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1992-03-27
Filing date: 1992-03-27
Publication date: 1998-11-18
Anticipated expiration: 2013-11-18
Also published as: JPH05274482A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、光学的文字認識装置等
により読み取った文字に対し、特に処理対象として数字
列混在文書を用いた場合について、認識結果に誤りが存
在しても、これを自動的に修正して出力する数字列混在
文書の文字認識の後処理方法に関するものである。BACKGROUND OF THE INVENTION The present invention relates to a character read by an optical character recognition device or the like, especially when a character string mixed document is used as an object to be processed, even if an error exists in the recognition result. The present invention relates to a post-processing method for character recognition of a document containing a mixed number of characters that is automatically corrected and output.

【０００２】[0002]

【従来の技術】文字認識装置は、文書中に手書き等によ
り記載された各文字をそれぞれ別個のパターンとして認
識するものであるが、各文字は文書中の単語を構成して
いるため、単語中の１字又は２字程度の認識誤りは文字
認識の後処理を行なって修正することができる。従来の
文字認識の後処理は、次のようにして行なわれていた。
即ち、まず、原文字パターンに対する候補文字と、原文
字パターンと各候補文字との形状の相異の度合いを表わ
す「距離」を文字認識装置から受け取る。そして、この
「距離」が最も小さい値を取る「第１候補文字」を並べ
た「参照単語」を作成する。次に、この参照単語と同じ
長さで、かつ最も多くの文字が一致する単語を単語辞書
から選び、これを「候補単語」として挙げる。そして、
これらの候補単語を候補文字から作成するときに用いる
文字の距離の総和（「コスト値」）を求め、これが最小
となる単語を結果として出力する（例えば、特願平3-19
6509号参照）。2. Description of the Related Art A character recognizing device recognizes each character described in a document by handwriting or the like as a separate pattern. However, since each character constitutes a word in the document, a character The recognition error of about one or two characters can be corrected by performing post-processing of character recognition. Conventional post-processing of character recognition has been performed as follows.
That is, first, the candidate character for the original character pattern and the "distance" indicating the degree of the difference between the shapes of the original character pattern and each candidate character are received from the character recognition device. Then, a "reference word" in which "first candidate characters" having the smallest value of "distance" are arranged is created. Next, a word having the same length as this reference word and matching the most characters is selected from the word dictionary, and is selected as a “candidate word”. And
The sum of the character distances (“cost value”) used when creating these candidate words from candidate characters is determined, and the word that minimizes this is output as a result (for example, Japanese Patent Application No. 3-19 / 1991).
No. 6509).

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、上述し
た従来の技術には、次のような問題があった。即ち、文
書中に存在する単語が数字列である場合に、そのような
数字列は辞書に搭載されていない。このため、たとえ認
識結果が正しくとも別の辞書中の単語に誤修正を行なっ
てしまう結果、数字が出力されない可能性があるという
問題がある。本発明は、以上の点に着目してなされたも
ので、各文字パターンに対する認識結果の候補文字のう
ち、最も順位の高い数字などの候補を並べることによっ
て「数字列候補単語」を作成し、これを辞書を検索して
得られた候補単語に追加することにより、数字列に対し
ても誤認識を修正し、認識性能を向上させた数字列混在
文書の文字認識の後処理方法を提供することを目的とす
るものである。However, the above-mentioned prior art has the following problems. That is, when a word present in a document is a number string, such a number string is not included in the dictionary. For this reason, even if the recognition result is correct, a word in another dictionary is erroneously corrected, so that there is a problem that a number may not be output. The present invention has been made by focusing on the above points, and among the candidate characters of the recognition result for each character pattern, by arranging candidates such as the highest-ranked numbers, a `` digit string candidate word '' is created, By adding this to candidate words obtained by searching a dictionary, erroneous recognition is also corrected for a numeric string, and a post-processing method of character recognition of a numeric string mixed document with improved recognition performance is provided. The purpose is to do so.

【０００４】[0004]

【課題を解決するための手段】本発明の数字列混在文書
の文字認識の後処理方法は、認識対象である文書中から
切り出された任意の単語中の各文字を、それぞれ文字デ
ータ辞書中の各文字データと比較することにより、各文
字データについて前記切り出された任意の単語中の文字
との近似する程度を表わす距離を求め、当該距離が比較
的小さい文字データを前記切り出された任意の単語中の
文字に対応する複数又は単数の候補として挙げた文字認
識の結果に対し、前記文字認識の対象とした文書中に存
在すると予測される単語を予め登録した単語辞書を用意
し、まず、前記切り出された任意の単語中の各文字に対
応する複数の候補の文字データのうち、前記距離が最小
の文字データを第１候補とし、前記距離が小さい順に順
次第２、第３及びそれ以降の候補とし、当該第１候補と
された各文字データを配列して構成した単語を、前記単
語辞書を参照するための参照単語とし、当該参照単語を
前記単語辞書中の各登録単語と比較し、当該各登録単語
のうち、その各文字が前記参照単語中の各文字データと
一致する数が最も多い登録単語を候補単語として選択
し、次に、前記認識された文字列中のすべての文字につ
いて、第１候補から始まる複数の候補のいずれかに数字
が挙げられているときは、これらの数字によって、ま
た、このとき、１つの文字について複数の数字が候補と
して挙げられている場合にはそれらの候補数字のうち前
記距離が最小の候補数字によって数字列候補を作成し、
前記参照単語を当該数字列候補と比較するか、当該数字
列候補中の前記第１候補となっている数字の数を算出す
るかのいずれかにより、前記参照単語中の各文字と当該
数字列候補との一致文字数を算出し、当該一致文字数が
前記単語辞書中から選択された候補単語の前記参照単語
との一致する文字数と等しいかそれ以上であるときは、
当該数字列候補を前記候補単語に加え、当該候補単語及
び数字列候補のうち、すべての文字又は数字が前記第１
候補から始まる複数の候補のいずれかに挙げられている
候補単語又は数字列候補について、前記距離の合計であ
るコスト値を算出し、このコスト値が最も小さい候補単
語又は数字列候補を出力することを特徴とするものであ
る。According to the post-processing method of character recognition of a document including a sequence of numbers according to the present invention, each character in an arbitrary word cut out from a document to be recognized is converted into a character data dictionary. By comparing each character data with each character data, a distance representing the degree of approximation with the character in the extracted arbitrary word is obtained, and the character data whose distance is relatively small is converted into the extracted arbitrary word. For the character recognition results listed as a plurality or singular candidates corresponding to the characters in the middle, prepare a word dictionary in which words that are predicted to be present in the document targeted for the character recognition are registered in advance, Among the character data of a plurality of candidates corresponding to the characters in the extracted arbitrary word, the character data having the smallest distance is set as the first candidate, and the second, third, and the like are sequentially arranged in ascending order of the distance. After that, a word formed by arranging each character data set as the first candidate is set as a reference word for referencing the word dictionary, and the reference word is referred to as each registered word in the word dictionary. Compare and select, as candidate words, the registered words whose characters match the character data in the reference word the most among the registered words, and then select all the registered words in the recognized character string. When a number is listed in any of a plurality of candidates starting from the first candidate for the character of, and when a plurality of numbers are listed as candidates for one character at this time, Creates a numeric string candidate with the candidate number whose distance is the smallest of those candidate numbers,
Either the reference word is compared with the number string candidate or the number of the first candidate number in the number string candidate is calculated, and each character in the reference word and the number string are calculated. When the number of matching characters with the candidate is calculated and the number of matching characters is equal to or greater than the number of characters matching the reference word of the candidate word selected from the word dictionary,
The number string candidate is added to the candidate word, and all the characters or numbers of the candidate word and the number string candidate are changed to the first word.
Calculating a cost value that is the sum of the distances for candidate words or numeric string candidates listed as one of a plurality of candidates starting from the candidate, and outputting a candidate word or numeric string candidate with the smallest cost value It is characterized by the following.

【０００５】[0005]

【作用】本発明の数字列混在文書の文字認識の後処理方
法においては、認識対象である文書中から切り出された
任意の単語中の各文字を、それぞれ文字データ辞書中の
各文字データと比較することにより、各文字データにつ
いて前記切り出された任意の単語中の文字との近似する
程度を表わす距離を求め、当該距離が比較的小さい文字
データを前記切り出された任意の単語中の文字に対応す
る複数又は単数の候補として挙げた文字認識の結果に対
し、前記文字認識の対象とした文書中に存在すると予測
される単語を予め登録した単語辞書を用意しておき、以
下の手順の処理を行なう。まず、前記切り出された任意
の単語中の各文字に対応する複数の候補の文字データの
うち、前記距離が最小の文字データを第１候補とし、前
記距離が小さい順に順次第２、第３及びそれ以降の候補
とする。そして、当該第１候補とされた各文字データを
配列して構成した単語を、前記単語辞書を参照するため
の参照単語とする。続いて、当該参照単語を前記単語辞
書中の各登録単語と比較し、当該各登録単語のうち、そ
の各文字が前記参照単語中の各文字データと一致する数
が最も多い登録単語を候補単語として選択する。次に、
前記認識された文字列中のすべての文字について、第１
候補から始まる複数の候補のいずれかに数字が挙げられ
ているときは、これらの数字によって、また、このと
き、１つの文字について複数の数字が候補として挙げら
れている場合にはそれらの候補数字のうち前記距離が最
小の候補数字によって数字列候補を作成する。そして、
前記参照単語を当該数字列候補と比較するか、当該数字
列候補中の前記第１候補となっている数字の数を算出す
るかのいずれかにより、前記参照単語中の各文字と当該
数字列候補との一致文字数を算出する。この結果、当該
一致文字数が前記単語辞書中から選択された候補単語の
前記参照単語との一致する文字数と等しいかそれ以上で
あるときは、当該数字列候補を前記候補単語に加える。
そして、当該候補単語及び数字列候補のうち、すべての
文字又は数字が前記第１候補から始まる複数の候補のい
ずれかに挙げられている候補単語又は数字列候補につい
て、前記距離の合計であるコスト値を算出し、このコス
ト値が最も小さい候補単語又は数字列候補を出力する。In the post-processing method for character recognition of a document having mixed numeric strings according to the present invention, each character in an arbitrary word extracted from the document to be recognized is compared with each character data in the character data dictionary. By doing so, for each character data, a distance representing the degree of approximation with the character in the cut out arbitrary word is obtained, and the character data whose distance is relatively small corresponds to the character in the cut out arbitrary word. A word dictionary in which words predicted to be present in the document subjected to the character recognition are registered in advance with respect to the result of character recognition listed as a plurality or a single candidate to be processed, and the processing of the following procedure is performed. Do. First, among character data of a plurality of candidates corresponding to each character in the extracted arbitrary word, the character data with the smallest distance is set as a first candidate, and the second, third, and the like are sequentially arranged in ascending order of the distance. Subsequent candidates. Then, a word formed by arranging the character data set as the first candidates is used as a reference word for referring to the word dictionary. Subsequently, the reference word is compared with each of the registered words in the word dictionary, and among the registered words, the registered word whose number of characters matches the character data of the reference word is the candidate word. Select as next,
For all characters in the recognized character string, the first
If any of the candidates starting with the candidate includes a number, these numbers are used. If a plurality of numbers are listed as one candidate for one character, these candidate numbers are used. Of the candidate numbers having the smallest distance among the candidate numbers. And
Either the reference word is compared with the number string candidate or the number of the first candidate number in the number string candidate is calculated, and each character in the reference word and the number string are calculated. Calculate the number of matching characters with the candidate. As a result, when the number of matching characters is equal to or greater than the number of characters matching the reference word of the candidate word selected from the word dictionary, the number string candidate is added to the candidate word.
Then, among the candidate words and the number string candidates, a cost that is the sum of the distances for a candidate word or a number string candidate in which all characters or numbers are listed as any of a plurality of candidates starting from the first candidate The cost value is calculated, and the candidate word or the numeric string candidate having the smallest cost value is output.

【０００６】[0006]

【実施例】以下、本発明の実施例を図面を参照して詳細
に説明する。図１は、本発明の数字列混在文書の文字認
識の後処理方法の一実施例のブロック図である。図示の
文字認識後処理装置１０は、候補文字決定手段１と、参
照単語作成手段２と、候補単語選択手段３と、数字列候
補作成手段４と、一致文字数算出手段５と、候補単語決
定手段６と、コスト値算出手段７と、単語出力手段８と
から成る。Embodiments of the present invention will be described below in detail with reference to the drawings. FIG. 1 is a block diagram of an embodiment of a post-processing method for character recognition of a document with a mixed number string according to the present invention. The illustrated character recognition post-processing device 10 includes a candidate character determination unit 1, a reference word generation unit 2, a candidate word selection unit 3, a numeric string candidate generation unit 4, a matching character number calculation unit 5, a candidate word determination unit 6, a cost value calculating means 7, and a word output means 8.

【０００７】候補文字決定手段１は、文字認識装置１２
により認識された文書１３中の各文字について候補文字
を決定する。即ち、文字認識手段１２は、文字データ辞
書１４を備えており、文書１３中の文字（図示の場合、
数字）「１」等と、文字データ辞書１４中の文字データ
「ａ」、「ｂ」、その他とを比較して距離を算出し、距
離の比較的小さい文字データ「ｌ」、「１」等をいくつ
か挙げる。候補文字決定手段１は、これらの文字データ
「ｌ」、「１」等のうち、距離の最小のものから順に第
１候補「ｌ」、第２候補「１」というふうに順位を付け
る。参照単語作成手段２は、第１候補とされた各文字デ
ータを配列して参照単語「ｌ０６ｏｌ」を作成する。The candidate character determining means 1 includes a character recognition device 12
A candidate character is determined for each character in the document 13 recognized by. That is, the character recognizing unit 12 includes a character data dictionary 14, and the characters in the document 13 (in the illustrated case,
Numerical values) "1" and the like are compared with character data "a", "b" and others in the character data dictionary 14 to calculate a distance, and character data "l", "1" and the like having a relatively small distance are calculated. Some are listed. The candidate character determination means 1 ranks the character data “l”, “1”, and the like in the order of the first candidate “l” and the second candidate “1” in order from the one with the smallest distance. The reference word creating means 2 creates a reference word “l06ol” by arranging each character data set as the first candidate.

【０００８】候補単語選択手段３は、参照単語「ｌ０６
ｏｌ」を単語辞書１１中の各登録単語「ｌａｂｅｌ」、
「ｌａｂｏｒ」、「ｌｅｍｏｎ」、「ｌｅｖｅｌ」、
「ｌｏｃａｌ」、「ｌｏｙａｌ」、「ｍｅｌｏｎ」と比
較する。そして、各登録単語のうち、その各文字が参照
単語「ｌ０６ｏｌ」中の各文字データ「ｌ」、「０」、
「６」、「ｏ」、「ｌ」と一致する数が最も多い登録単
語を候補単語として選択する。図示の例では、同じ文字
数の登録単語のうち、対応する位置の文字「ｌ，ｌ」、
「ｌ，ｏ」、「ｌ，ｏ」、「ｌ，ｌ」、「ｌ，ｌ」、
「ｌ，ｌ」が同じ登録単語「ｌａｂｅｌ」、「ｌａｂｏ
ｒ」、「ｌｅｍｏｎ」、「ｌｅｖｅｌ」、「ｌｏｃａ
ｌ」、「ｌｏｙａｌ」が選択される。この場合、一致文
字数は、“２”である。[0008] The candidate word selecting means 3 receives the reference word "l06".
ol "to each registered word" label "in the word dictionary 11,
"Labor", "lemon", "level",
Compare with "local", "royal" and "melon". Then, among the registered words, each character is represented by each character data “l”, “0”, and “0” in the reference word “l06ol”.
A registered word having the largest number of matches with “6”, “o”, and “l” is selected as a candidate word. In the illustrated example, among the registered words having the same number of characters, the characters “l, l” at the corresponding positions,
"L, o", "l, o", "l, l", "l, l",
"L, l" are the same registered words "label", "labo"
r "," lemon "," level "," local "
l "and" royal "are selected. In this case, the number of matching characters is “2”.

【０００９】数字列候補作成手段４は、文字認識装置１
２で認識された文字列中のすべての文字について、第１
候補「ｌ０６ｏｌ」から始まる複数の候補のいずれかに
数字が挙げられているときは、数字列候補を作成する。
このとき、１つの文字について複数の数字が候補として
挙げられている場合にはそれらの候補の数字のうち最も
順位の高い候補の数字によって数字列候補を作成する。
図示の例では、数字列候補として「１０６０１」が作成
されるとする。一致文字数算出手段５は、参照単語「ｌ
０６ｏｌ」と数字列候補「１０６０１」との一致する文
字数を算出する。この算出は、単純に参照単語「ｌ０６
ｏｌ」を数字列候補「１０６０１」と比較して一致する
文字（数字）「０」、「６」の数を数えるか、数字列候
補「１０６０１」中の第１候補となっている数字
「０」、「６」の数を算出するかのいずれの方法で行な
ってもよい。図示の例では、一致文字数“２”となる。The digit string candidate creating means 4 includes a character recognition device 1.
For all characters in the character string recognized in step 2,
When a number is listed in any of a plurality of candidates starting from the candidate “l06ol”, a number string candidate is created.
At this time, if a plurality of numbers are listed as candidates for one character, a number string candidate is created using the number of the candidate with the highest rank among the numbers of those candidates.
In the illustrated example, it is assumed that “10601” is created as a number string candidate. The number-of-matching-characters calculating means 5 calculates the reference word “l
06ol ”and the number of characters that match the numeric string candidate“ 10601 ”are calculated. This calculation is performed simply by referring to the reference word “l06”.
ol "is compared with the number string candidate" 10601 "to count the number of matching characters (numbers)" 0 "and" 6 ", or the number" 0 "as the first candidate in the number string candidate" 10601 " Or the method of calculating the number of "6". In the illustrated example, the number of matching characters is “2”.

【００１０】候補単語決定手段６は、一致文字数算出手
段５で算出された一致文字数が候補単語選択手段３で検
出された一致文字数と等しいかそれ以上であるときは、
数字列候補が候補単語に加えられる。図示の例では、一
致文字数がいずれも“２”であるので、数字列候補「１
０６０１」が候補単語「ｌａｂｅｌ」、「ｌａｂｏ
ｒ」、「ｌｅｍｏｎ」、「ｌｅｖｅｌ」、「ｌｏｃａ
ｌ」、「ｌｏｙａｌ」に加えられる。コスト値算出手段
７は、候補単語及び数字列候補のうち、すべての文字又
は数字が文字認識装置１２による候補として挙げられて
いるもののコスト値を算出する。図示の例では、「ｌｏ
ｃａｌ」と「１０６０１」がこれに該当するとする。単
語出力手段８は、コスト値算出手段７で算出されたコス
ト値が最小の候補単語及び数字列候補を出力する。図示
の例では、数字列候補「１０６０１」が出力される。When the number of matching characters calculated by the matching character number calculating means 5 is equal to or greater than the number of matching characters detected by the candidate word selecting means 3,
Numeric string candidates are added to the candidate words. In the illustrated example, since the number of matching characters is “2”, the number string candidate “1” is set.
0601 ”becomes the candidate words“ label ”and“ labo ”.
r "," lemon "," level "," local "
l "," royal ". The cost value calculation means 7 calculates a cost value of a candidate word and a number string candidate in which all characters or numbers are listed as candidates by the character recognition device 12. In the illustrated example, “lo
cal "and" 10601 "correspond to this. The word output unit 8 outputs a candidate word and a numeric string candidate having the minimum cost value calculated by the cost value calculation unit 7. In the illustrated example, the numeric string candidate “10601” is output.

【００１１】図２は、文字認識後処理手順を説明するフ
ローチャートである。また、図３は、参照単語の作成の
説明図である。図３は、文字パターン「１０６０１」に
対する認識結果の一例を示す。この図においては、各文
字パターンに対する候補文字と距離が与えられている。
距離は、各文字パターンとその候補文字とのパターン上
の違いの程度を表わす。このような距離は、パターンの
認識時のノイズ、パターンの特徴部分の切出し方、ずれ
によって異なってくる。文字パターン「１」について
は、第１候補が文字「ｌ」となり、これの文字パターン
「１」との距離は、“80”となっている。また、第２候
補が数字「１」となり、これの文字パターン「１」との
距離は、“90”となっている。このように、パターン認
識上の相異により、文字パターン「１」について文字
「ｌ」より数字「１」の距離の方が大きくなることがあ
る。更に、第３候補が文字「ｔ」となり、これの文字パ
ターン「１」との距離は、“120 ”となっている。FIG. 2 is a flowchart for explaining the post-character recognition processing procedure. FIG. 3 is an explanatory diagram of creating a reference word. FIG. 3 shows an example of a recognition result for the character pattern “10601”. In this figure, candidate characters and distances for each character pattern are given.
The distance indicates the degree of difference in pattern between each character pattern and the candidate character. Such a distance varies depending on noise at the time of pattern recognition, how to extract a characteristic portion of the pattern, and deviation. For the character pattern "1", the first candidate is the character "l", and the distance from the character pattern "1" is "80". The second candidate is the number “1”, and the distance from the character pattern “1” is “90”. As described above, due to the difference in the pattern recognition, the distance of the numeral “1” may be longer than the character “l” of the character pattern “1”. Further, the third candidate is the character “t”, and the distance from the character pattern “1” is “120”.

【００１２】文字パターン「０」については、第１候補
が数字「０」となり、これの文字パターン「０」との距
離は、“70”となっている。また、第２候補が文字
「ｏ」となり、これの文字パターン「０」との距離は、
“80”となっている。更に、第３候補が文字「ｃ」とな
り、これの文字パターン「０」との距離は、“100 ”と
なっている。文字パターン「６」については、第１候補
が数字「６」となり、これの文字パターン「６」との距
離は、“50”となっている。また、第２候補が文字
「５」となり、これの文字パターン「６」との距離は、
“90”となっている。更に、第３候補が文字「ｏ」とな
り、これの文字パターン「６」との距離は、“100 ”と
なっている。更にまた、第４候補が文字「ｃ」となり、
これの文字パターン「６」との距離は、“110 ”となっ
ている。As for the character pattern "0", the first candidate is a numeral "0", and the distance from the character pattern "0" is "70". The second candidate is the character “o”, and the distance between the character and the character pattern “0” is
It is “80”. Further, the third candidate is the character “c”, and the distance from the character pattern “0” is “100”. As for the character pattern "6", the first candidate is the number "6", and the distance from the character pattern "6" is "50". The second candidate is the character “5”, and the distance between the character and the character pattern “6” is
It is “90”. Further, the third candidate is the character "o", and the distance between the character and the character pattern "6" is "100". Furthermore, the fourth candidate is the character "c",
The distance from the character pattern "6" is "110".

【００１３】文字パターン「０」については、第１候補
が文字「ｏ」となり、これの文字パターン「０」との距
離は、“80”となっている。また、第２候補が数字
「０」となり、これの文字パターン「０」との距離は、
“85”となっている。このように、前述したように、パ
ターン認識上の相異により、文字パターン「０」につい
て文字「ｏ」より数字「０」の距離の方が大きくなるこ
とがある。更に、第３候補が文字「ａ」となり、これの
文字パターン「１」との距離は、“120 ”となってい
る。文字パターン「１」については、第１候補が文字
「ｌ」となり、これの文字パターン「１」との距離は、
“80”となっている。また、第２候補が数字「１」とな
り、これの文字パターン「１」との距離は、“100 ”と
なっている。このように、前述したように、パターン認
識上の相異により、文字パターン「１」について文字
「ｌ」より数字「１」の距離の方が大きくなることがあ
り、しかも、その距離の値が以前と異なることもある。
更に、第３候補が文字「ｔ」となり、これの文字パター
ン「１」との距離は、“120 ”となっている。For the character pattern "0", the first candidate is the character "o", and the distance from the character pattern "0" is "80". In addition, the second candidate is a number “0”, and the distance from the character pattern “0” is
It is “85”. As described above, as described above, the distance of the numeral “0” may be larger than the character “o” of the character pattern “0” due to the difference in the pattern recognition. Further, the third candidate is the character "a", and the distance between the character and the character pattern "1" is "120". For the character pattern “1”, the first candidate is the character “l”, and the distance from the character pattern “1” is
It is “80”. In addition, the second candidate is the numeral “1”, and the distance from the character pattern “1” is “100”. As described above, as described above, the distance of the numeral “1” may be larger than the character “l” for the character pattern “1” due to the difference in the pattern recognition. It may be different from before.
Further, the third candidate is the character “t”, and the distance from the character pattern “1” is “120”.

【００１４】まず、図３において、「参照単語」を作成
する。参照単語は、最も「距離」の小さい認識候補文
字、即ち第１候補の文字又は数字を並べたものである。
図３の例の場合、参照単語は「ｌ０６ｏｌ」である。次
に、予め準備した単語辞書を用いて、参照単語に近い
「候補単語」を作成する。この候補単語を作成する方法
については、従来手法を用いることができる（特願平3-
196509号参照）。First, in FIG. 3, a "reference word" is created. The reference word is a list of recognition candidate characters having the smallest “distance”, that is, letters or numbers of first candidates.
In the case of the example of FIG. 3, the reference word is “l06ol”. Next, a “candidate word” close to the reference word is created using a word dictionary prepared in advance. A conventional method can be used to create this candidate word (Japanese Patent Application No. Hei.
196509).

【００１５】図４は、候補単語選択の説明図である。こ
の図は、図３の結果に対して候補単語を作成した結果の
一例を示す。図示の例では、参照単語「ｌ０６ｏｌ」と
最も多くの文字が一致する６個の単語「ｌａｂｅｌ」、
「ｌａｂｏｒ」、「ｌｅｍｏｎ」、「ｌｅｖｅｌ」、
「ｌｏｃａｌ」、「ｌｏｙａｌ」が候補単語に挙げられ
る。この場合、「ｌ０６ｏｌ」と「ｌａｂｅｌ」とは、
最初の「ｌ」と最後の「ｌ」とが一致する。「ｌ０６ｏ
ｌ」と「ｌａｂｏｒ」とは、「ｌ」と「ｏ」とが一致す
る。「ｌ０６ｏｌ」と「ｌｅｍｏｎ」とは、「ｌ」と
「ｏ」とが一致する。「ｌ０６ｏｌ」と「ｌｅｖｅｌ」
とは、最初の「ｌ」と最後の「ｌ」とが一致する。「ｌ
０６ｏｌ」と「ｌｏｃａｌ」とは、最初の「ｌ」と最後
の「ｌ」とが一致する。「ｌ０６ｏｌ」と「ｌｏｙａ
ｌ」とは、最初の「ｌ」と最後の「ｌ」とが一致する。
従って、これらの候補単語に対し、一致文字数“２”が
記憶される。次に、数字列候補を作成する。これは、各
文字パターンに対して与えられている候補文字のうち、
最も順位の高い数字（コンマ、ピリオドなどを含む）を
並べることによって作成する。FIG. 4 is an explanatory diagram of candidate word selection. This figure shows an example of the result of creating candidate words for the result of FIG. In the illustrated example, six words “label” in which the most characters match the reference word “l06ol”,
"Labor", "lemon", "level",
“Local” and “royal” are listed as candidate words. In this case, “l06ol” and “label” are
The first “l” matches the last “l”. "106h
“l” and “labor” match “l” and “o”. “L06ol” and “lemon” match “l” and “o”. "L06ol" and "level"
Means that the first “l” matches the last “l”. "L
06ol "and" local "are the same as the first" l "and the last" l ". "L06ol" and "loya"
“l” matches the first “l” with the last “l”.
Therefore, the number of matching characters “2” is stored for these candidate words. Next, a number string candidate is created. This is because of the candidate characters given for each character pattern,
It is created by arranging the highest numbered numbers (including commas, periods, etc.).

【００１６】図５は、数字列候補作成の説明図である。
この図は、図３の認識結果に対して作成した数字列候補
を示す。図示の例では、それぞれの候補文字のうち最も
順位が高い文字を並べた数字列「１０６０１」が数字列
候補となる。この場合、参照単語「ｌ０６ｏｌ」と数字
列候補「１０６０１」とは、「０」と「６」とが一致す
る。従って、数字列候補に対し、図６のステップＳ１１
で一致文字数“２”が算出される。次に、数字列候補単
語の候補単語への追加判定を行なう。FIG. 5 is an explanatory diagram of creating a numeric string candidate.
This figure shows a number string candidate created for the recognition result of FIG. In the illustrated example, a numeral string “10601” in which characters having the highest rank among the candidate characters are arranged is a numeral string candidate. In this case, “0” and “6” match between the reference word “l06ol” and the numeric string candidate “10601”. Therefore, for the number string candidate, step S11 in FIG.
Calculates the number of matching characters "2". Next, it is determined whether the number string candidate word is added to the candidate word.

【００１７】図６は、この処理の一例を示すフローチャ
ートである。まず、数字列候補と参照単語との比較を行
ない、一致文字数を算出する（ステップＳ１１）。次
に、その一致文字数と、既に記憶されている候補単語と
参照単語との一致文字数との比較を行なう（ステップＳ
１２）。この結果、数字列候補の一致文字数の方が多い
場合には、それまでの候補単語が抹消され、数字列候補
だけが候補単語として登録される（ステップＳ１４）。
また、両方の一致文字数が同じである場合には、数字列
候補単語が候補単語の１つとして追加される（ステップ
Ｓ１３）。図５の例の場合には、参照単語「ｌ０６ｏ
ｌ」と数字列候補「１０６０１」との一致文字数は
“２”であり、図４の候補単語における一致文字数
“２”に等しいため、数字列候補も候補単語の１つとし
て追加される。FIG. 6 is a flowchart showing an example of this processing. First, the number string candidate is compared with the reference word, and the number of matching characters is calculated (step S11). Next, the number of matching characters is compared with the number of matching characters between the already stored candidate word and the reference word (step S).
12). As a result, if the number of matching characters in the number string candidate is larger, the candidate words up to that point are deleted, and only the number string candidate is registered as a candidate word (step S14).
If the number of matching characters is the same, the number string candidate word is added as one of the candidate words (step S13). In the case of the example of FIG. 5, the reference word “l06o
Since the number of matching characters between “1” and the number string candidate “10601” is “2”, which is equal to the number of matching characters “2” in the candidate word in FIG. 4, the number string candidate is also added as one of the candidate words.

【００１８】尚、この例では、参照単語と数字列候補と
の比較によって数字列候補の追加判定を行なったが、数
字列候補の各文字が最小距離のものかどうかを調べ、そ
の個数を一致文字数として上記の比較を行なうこともで
きる。即ち、図４の例では、「０」と「６」とが最小距
離となっており、その個数は“２”である。次に、各文
字パターンに対する候補文字を組合せて各候補単語の作
成を行ない、候補単語に対するコスト値を算出する（図
２ステップＳ５）。そして、出力単語を決定し（図２ス
テップＳ６）、選び出された単語が出力される。この手
法は従来の手法を用いることもできる（特開平3-196509
号参照）。In this example, the addition of the number string candidate is determined by comparing the reference word with the number string candidate. However, it is checked whether each character of the number string candidate is the one having the minimum distance, and the numbers are matched. The above comparison can also be performed as the number of characters. That is, in the example of FIG. 4, “0” and “6” are the minimum distance, and the number is “2”. Next, each candidate word is created by combining candidate characters for each character pattern, and a cost value for the candidate word is calculated (step S5 in FIG. 2). Then, an output word is determined (step S6 in FIG. 2), and the selected word is output. This method can use a conventional method (Japanese Patent Laid-Open No. 3-196509).
No.).

【００１９】図７は、各候補単語に対するコスト値の計
算結果を示す図である。同図において、「ｌａｂｅｌ」
については、最初の「ｌ」と最後の「ｌ」とは候補に挙
がっているが、「ａ」と「ｂ」と「ｅ」とについては、
候補に挙がっていないので、距離が算出されておらず、
コスト値が計算できない。「ｌａｂｏｒ」については、
「ｌ」と「ｏ」とは候補に挙がっているが、「ａ」と
「ｂ」と「ｒ」とについては、候補に挙がっていないの
で、距離が算出されておらず、コスト値が計算できな
い。「ｌｅｍｏｎ」については、「ｌ」と「ｏ」とは候
補に挙がっているが、「ｅ」と「ｍ」と「ｎ」とについ
ては、候補に挙がっていないので、距離が算出されてお
らず、コスト値が計算できない。「ｌｅｖｅｌ」につい
ては、最初の「ｌ」と最後の「ｌ」とは候補に挙がって
いるが、最初の「ｅ」と「ｖ」と最後の「ｅ」とについ
ては、候補に挙がっていないので、距離が算出されてお
らず、コスト値が計算できない。、「ｌｏｃａｌ」につ
いては、最初の「ｌ」と「ｏ」と「ｃ」と「ａ」と最後
の「ｌ」とのすべてについて候補に挙がっており、距離
が算出されているので、コスト値が計算できる。その値
は、“470”となる。「ｌｏｙａｌ」については、最初
の「ｌ」と「ｏ」と「ａ」と最後の「ｌ」とは候補に挙
がっているが、「ｙ」については、候補に挙がっていな
いので、距離が算出されておらず、コスト値が計算でき
ない。FIG. 7 is a diagram showing a calculation result of a cost value for each candidate word. In the figure, "label"
For, the first "l" and the last "l" are candidates, but for "a", "b" and "e",
Since it was not listed as a candidate, the distance was not calculated,
The cost value cannot be calculated. About "labor"
"L" and "o" are listed as candidates, but "a", "b", and "r" are not listed as candidates, so the distance is not calculated and the cost value is calculated. Can not. For "lemon", "l" and "o" are listed as candidates, but for "e", "m", and "n", the distances are calculated because they are not listed as candidates. And the cost value cannot be calculated. For “level”, the first “l” and the last “l” are listed as candidates, but the first “e”, “v”, and the last “e” are not listed as candidates. Therefore, the distance has not been calculated and the cost value cannot be calculated. , “Local” are all candidates for the first “l”, “o”, “c”, “a”, and the last “l”, and the distance is calculated. Can be calculated. Its value is “470”. For "royal", the first "l", "o", "a" and the last "l" are listed as candidates, but for "y", the distance is calculated because it is not listed as a candidate. No cost value can be calculated.

【００２０】一方、数字列候補「１０６０１」について
コスト値を算出すると、“395 ”となる。従って、これ
らの候補単語及び数字列候補の中から、最もコスト値の
低い候補「１０６０１」が選び出され、後処理の結果と
して出力される。以上、本発明の実施例を英単語の場合
について説明したが、辞書の内容を変えることにより、
本発明は他の言語に対しても実施することが可能であ
る。例えば、日本語の場合は、カタカナの長音記号
「ー」と漢数字の「一」とが同じようなパターンであ
り、その他にも数字列混じりの文章についての後処理に
おいて、前述した場合と同様の事態を生じる場合があ
る。このようなよく知られた現象を考慮することによ
り、本発明が前述した実施例に限定されないことが明か
であることはいうまでもない。On the other hand, when the cost value is calculated for the numeric string candidate "10601", it becomes "395". Therefore, the candidate “10601” with the lowest cost value is selected from these candidate words and numeric string candidates, and is output as a result of post-processing. As described above, the embodiment of the present invention has been described in the case of English words, but by changing the contents of the dictionary,
The invention can be implemented for other languages. For example, in the case of Japanese, the katakana long sound symbol "-" and the kanji numeral "1" have a similar pattern. May occur. Considering such well-known phenomena, it goes without saying that the present invention is not limited to the embodiments described above.

【００２１】[0021]

【発明の効果】以上説明したように、本発明の数字列混
在文書の文字認識の後処理方法によれば、単語辞書から
候補単語を選び出すとともに、数字列候補も作成し、参
照単語との一致文字数に応じてこの数字列候補を後処理
の候補に加えるようにしたので、数字列が混在する文書
の認識結果に対して誤認識を修正し、正しい単語を出力
することができる。As described above, according to the post-processing method for character recognition of a document including mixed character strings according to the present invention, a candidate word is selected from a word dictionary, and a character string candidate is also created to match a reference word. Since the number string candidates are added to the post-processing candidates according to the number of characters, erroneous recognition can be corrected for the recognition result of a document in which the number strings are mixed, and a correct word can be output.

[Brief description of the drawings]

【図１】本発明の数字列混在文書の文字認識の後処理方
法の一実施例のブロック図である。FIG. 1 is a block diagram of an embodiment of a post-processing method for character recognition of a document including a string of numeric characters according to the present invention.

【図２】文字認識後処理手順を説明するフローチャート
である。FIG. 2 is a flowchart illustrating a character recognition post-processing procedure.

【図３】参照単語作成の説明図である。FIG. 3 is an explanatory diagram of reference word creation.

【図４】候補単語選択の説明図である。FIG. 4 is an explanatory diagram of candidate word selection.

【図５】数字列候補作成の説明図である。FIG. 5 is an explanatory diagram of creating a numeric string candidate.

【図６】数字列候補の候補単語への追加判定を説明する
フローチャートである。FIG. 6 is a flowchart illustrating a process of determining whether a number string candidate is added to a candidate word.

【図７】各候補単語に対するコスト値の計算結果を示す
図である。FIG. 7 is a diagram illustrating a calculation result of a cost value for each candidate word.

[Explanation of symbols]

１候補文字決定手段２参照単語作成手段３候補単語選択手段４数字列候補作成手段５一致文字数算出手段６候補単語決定手段７コスト値算出手段８単語出力手段１０文字認識後処理装置 DESCRIPTION OF SYMBOLS 1 Candidate character determination means 2 Reference word creation means 3 Candidate word selection means 4 Numeric string candidate creation means 5 Matching character number calculation means 6 Candidate word determination means 7 Cost value calculation means 8 Word output means 10 Character recognition post-processing device

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平５−135211（ＪＰ，Ａ) 特開平５−108885（ＪＰ，Ａ) 特開平５−114052（ＪＰ，Ａ) 特開平２−103690（ＪＰ，Ａ) 特開昭62−251986（ＪＰ，Ａ) 特開平１−205288（ＪＰ，Ａ) 「英文認識における単語照合」、システム制御情報学会研究発表後援会講演論文集，ＶＯＬ．36，ＰＡＧＥ275−276, 1992 (58)調査した分野(Int.Cl.⁶，ＤＢ名) G06K 9/72 G06K 9/68 特許ファイル（ＰＡＴＯＬＩＳ) ＪＩＣＳＴファイル（ＪＯＩＳ)────────────────────────────────────────────────── ─── Continuation of the front page (56) References JP-A-5-135211 (JP, A) JP-A-5-108885 (JP, A) JP-A-5-114052 (JP, A) JP-A-2- 103690 (JP, A) JP-A-62-251986 (JP, A) JP-A-1-205288 (JP, A) "Word collation in English sentence recognition", Proc. . 36, PAGE 275-276, 1992 (58) Fields investigated (Int. Cl. ⁶ , DB name) G06K 9/72 G06K 9/68 Patent file (PATOLIS) JICST file (JOIS)

Claims

(57) [Claims]

1. An arbitrary word cut out from a document to be recognized is compared with each character data in a character data dictionary, so that an arbitrary word cut out for each character data is obtained. A distance representing the degree of approximation with the character in the middle is obtained, and the character recognition result in which the distance is relatively small is given as a plurality or a single candidate corresponding to the character in the cut-out arbitrary word. Preparing a word dictionary in which words that are predicted to be present in the document to be subjected to the character recognition are registered in advance; first, character data of a plurality of candidates corresponding to each character in the cut out arbitrary word is prepared. The character data having the smallest distance is set as a first candidate, the second, third, and subsequent candidates are sequentially set in ascending order of the distance, and each character data set as the first candidate is arranged. The composed word is used as a reference word for referring to the word dictionary, and the reference word is compared with each registered word in the word dictionary.
From among the registered words, a registered word whose number matches each character data in the reference word is selected as a candidate word, and then, for all the characters in the recognized character string, , If any one of the plurality of candidates starting from the first candidate has a number, these numbers are used. If a plurality of numbers are listed as one candidate for one character, those numbers are used. Of the candidate numbers, the number string is created with the candidate number having the smallest distance, and the reference word is compared with the number string candidate, or the number of the number that is the first candidate in the number string candidate Calculating the number of matching characters between each character in the reference word and the number string candidate, and determining the number of matching characters with the reference word of the candidate word selected from the word dictionary. When the number of matching characters is equal to or more than the number of matching characters, the number string candidate is added to the candidate word, and among the candidate word and the number string candidate, a plurality of candidates in which all characters or numbers start from the first candidate For any of the candidate words or number string candidates listed above, calculate a cost value that is the sum of the distances, and output the candidate word or number string candidate with the smallest cost value. Post-processing method for character recognition of documents.