JPH06215198A

JPH06215198A - Character recognition post-processing system

Info

Publication number: JPH06215198A
Application number: JP5003614A
Authority: JP
Inventors: Shinji Sase; 慎治佐瀬
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1993-01-12
Filing date: 1993-01-12
Publication date: 1994-08-05
Anticipated expiration: 2011-12-04
Also published as: JP2560959B2

Abstract

PURPOSE:To simultaneously handle Japanese characters and alphabetical letter strings and to enable collation in the unit of meaning as well by combining the unit of collation with the unit of a word in the case of character recognition post-processing. CONSTITUTION:With the character recognized result 20 as a key character, a dictionary reading processing 11 reads character arrangement information from a dictionary 21. In dictionary preparation processing 12, the read charcters arrangement information is analyzed, the partition of character is detected, and partition information or the character arrangement information is divided to a prescribed length. At a collation part 13, the character recognized result 20 is collated with the divided character arrangement information. In the case of collation, the partitioning character is specially collated. The result of the character arrangement information is stored and defined as the collating result with one piece of character arrangement information of the dictionary.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、文字認識装置に関し、
特に、読取結果を確認／補正する文字認識の後処理方式
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character recognition device,
In particular, it relates to a post-processing method of character recognition for confirming / correcting the reading result.

【０００２】[0002]

【従来の技術】文字認識の後処理は、文字認識の不完全
さを補助する手段として広く活用されている。Post-processing of character recognition is widely used as a means of assisting incomplete character recognition.

【０００３】入力文字列の性質としては、欧米では単語
単位で区切って書かれるのに対して日本語文字列は慣習
的にべた書きで書かれるために、処理対象に応じて後処
理手法が異なっている。As for the nature of the input character string, in the US and Europe, the character string is delimited in units of words, whereas the Japanese character string is customarily written as a solid character. Therefore, the post-processing method differs depending on the processing target. ing.

【０００４】欧米の文に対する後処理は、空白などの区
切り情報をもとに文字認識結果を単語単位に分割し、単
語単位の照合を行っている。また日本語のべた書きに対
する処理は、空白などの区切り情報に分割されている文
字列に対しても、一単語とは想定せずに照合を行ってい
る。In post-processing for Western texts, the character recognition result is divided into word units based on delimiter information such as blanks, and collation is performed in word units. In addition, in the processing for Japanese solid writing, a character string divided into delimiter information such as white space is also collated without assuming that it is a single word.

【０００５】[0005]

【発明が解決しようとする課題】前述の文字認識後処理
方式を日本語と欧米文の混在読み取りに適用すると、採
用する手法に応じて、日本語もわかち書きする必要が生
じるか、欧米文の単語が区切り情報なしに連続して抽出
される場合が生ずると言う課題があった。When the above-mentioned character recognition post-processing method is applied to mixed reading of Japanese and Western sentences, it may be necessary to write Japanese in Japanese or Western, depending on the method adopted. However, there is a problem in that there may occur a case where data is continuously extracted without delimiter information.

【０００６】また、本のタイトルのように、構成上は単
語の並びであるが、意味上では一つの単語として取り扱
うほうがよい文字列が存在する。前述の従来方式の後処
理でこのような文字列に対処する場合には、このような
文字列は単語の並びであると考えて単語別に辞書に登録
するか、あるいは単語の並び自体を一つの単語として辞
書に登録する方法がある。しかしながら、前者では意味
上では全体を一つととらえることが難しくなり、後者で
は単語長が長くなるために処理時間が膨大となる問題が
あった。[0006] Further, like the title of a book, there is a character string that is a sequence of words in terms of structure, but is better to handle as a single word in terms of meaning. When dealing with such a character string in the post-processing of the conventional method described above, such a character string is considered to be a sequence of words and registered in the dictionary for each word, or the sequence of words itself is set as There is a method of registering as a word in the dictionary. However, it is difficult for the former to be regarded as one in terms of meaning, and for the latter, there is a problem that the processing time becomes enormous because the word length becomes long.

【０００７】また意味上の一まとまりの単語列がわかる
ことにより、この間の関連性（例えば、図書名とその著
者名等）も簡単な構造で表現が可能となるが、従来の手
法では意味上一まとまりの単語列のみを取り扱うことし
かできないか、意味単位で単語列を区分することが十分
でないという問題があった。[0007] Further, by knowing a group of words in the meaning, it is possible to express the relationship between them (for example, a book name and its author's name) with a simple structure. There is a problem in that it is possible to handle only a set of word strings, or it is not sufficient to classify word strings in semantic units.

【０００８】更に、英文等では、書式により大文字←→
小文字の変化があったり、名前をイニシャルで省略する
ことがよくあるので、このような変動に対処する必要が
ある。照合時に矛盾なくこのような変動に対応する照合
を行うと処理が非常に複雑になるという問題がある。[0008] Furthermore, in English, uppercase letters ← → depending on the format.
It is necessary to deal with such fluctuations because there are often changes in lowercase letters and omitting names with initials. There is a problem that the processing becomes very complicated if the matching corresponding to such a variation is performed without any contradiction at the time of matching.

【０００９】本発明は従来の上記実情に鑑みてなされた
ものであり、従って本発明の目的は、従来の技術に内在
する上記諸課題を解決することを可能とした新規な文字
認識後処理方式を提供することにある。The present invention has been made in view of the above-mentioned conventional circumstances, and therefore an object of the present invention is to provide a novel character recognition post-processing method capable of solving the above-mentioned problems inherent in the prior art. To provide.

【００１０】[0010]

【課題を解決するための手段】上記目的を達成するため
に、本発明に係る文字認識後処理方式は、文字認識の結
果に応じて、辞書より照合する情報を読出す手段と、文
字認識結果と読み出した文字並び情報に応じて読み出し
た文字列を分割して階層的な分割辞書を生成する手段
と、入力文字列と読み出した文字並び情報を照合し、そ
の類似性を階層順に順次求める手段と、最終的に入力文
字列に対応する後処理結果を判定する手段と、を具備し
て構成される。In order to achieve the above object, the character recognition post-processing method according to the present invention is a means for reading out information to be collated from a dictionary according to a result of character recognition, and a character recognition result. And means for generating a hierarchical division dictionary by dividing the read character string according to the read character arrangement information, and means for collating the input character string with the read character arrangement information and sequentially obtaining the similarity in hierarchical order. And means for finally determining the post-processing result corresponding to the input character string.

【００１１】[0011]

【実施例】次に、本発明をその好ましい実施例について
図面を参照して具体的に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will now be described in detail with reference to the preferred embodiments thereof with reference to the drawings.

【００１２】図１は本発明（請求項１に記載の発明）の
一実施例を示すブロック構成図である。処理フロ−の概
要は以下に示す通りである。FIG. 1 is a block diagram showing an embodiment of the present invention (the invention according to claim 1). The outline of the processing flow is as follows.

【００１３】図１を参照するに、文字認識結果２０が与
えられると、処理開始１０が辞書読出１１を起動する。Referring to FIG. 1, when the character recognition result 20 is given, the process start 10 activates the dictionary reading 11.

【００１４】図２は文字認識結果２０の一例を示す図で
あり、各記入文字に対してここでは三つずつ文字認識結
果の候補が出力されている。FIG. 2 is a diagram showing an example of the character recognition result 20, in which three candidates of the character recognition result are output for each written character.

【００１５】辞書読出１１は、この文字認識結果２０を
キ−文字として辞書２１内を検索し、照合すべき単語情
報を選択して読み出す。辞書生成部１２では入力文字列
の照合範囲を区切目情報を中心に設定し、この情報をも
とに読み出した単語情報を解析し、必要に応じて短い部
分に分割した照合情報を順次生成し、照合位置とあわせ
て順次照合部１３に渡す。The dictionary reading 11 searches the dictionary 21 by using the character recognition result 20 as a key character, selects and reads word information to be collated. The dictionary generation unit 12 sets the collation range of the input character string centering on the break information, analyzes the read word information based on this information, and sequentially generates collation information divided into short parts as necessary. , And the matching positions are sequentially passed to the matching unit 13.

【００１６】最終的に全部の単語情報が照合部１３に送
出された後、最後に送出終了の情報が照合部１３に送ら
れ、次の辞書読出が実行される。After all the word information is finally transmitted to the collating unit 13, the information of the end of transmission is finally transmitted to the collating unit 13 and the next dictionary reading is executed.

【００１７】単語の分割情報が、入力文字列の区切目情
報とあわないときには、辞書生成を中止し、次の辞書読
出を実行する。When the word division information does not match the input character string break information, the dictionary generation is stopped and the next dictionary reading is executed.

【００１８】照合部１３は、順次送られてくる文字並び
の分割情報と文字認識結果２０の照合を行い、その結果
を累積する。照合部１３は、辞書生成１２より送出終了
の情報を受け取ったときに、照合中間結果２２と累積結
果を比較し、最終判定結果に残る可能性があると考えら
れる場合には、その結果を照合中間結果２２に加える。The collating unit 13 collates the division information of the character sequence and the character recognition result 20 which are sequentially transmitted, and accumulates the result. When the collation unit 13 receives the transmission end information from the dictionary generator 12, the collation unit 13 compares the collation intermediate result 22 with the cumulative result, and collates the result when it is considered that there is a possibility of remaining in the final determination result. Add to intermediate result 22.

【００１９】辞書読出１１において読み出すべき辞書が
なくなった場合には、判定１４を起動し、照合中間結果
２２を参照して、判定結果２３を作成し、処理を終了す
る。When there are no more dictionaries to be read in the dictionary reading 11, the judgment 14 is activated, the judgment result 23 is created by referring to the collation intermediate result 22, and the processing is ended.

【００２０】以上のような構成の文字認識後処理方式
は、ＣＰＵ（中央演算処理装置）と各種処理プログラム
およびデ−タを格納する記憶媒体（ＲＡＭ，ハ−ドディ
スク，フロッピ−ディスク等）で構成することができ
る。The character recognition post-processing system having the above-described structure is composed of a CPU (central processing unit) and a storage medium (RAM, hard disk, floppy disk, etc.) for storing various processing programs and data. Can be configured.

【００２１】以下処理１１〜１３の処理について詳細に
説明する。The processes of processes 11 to 13 will be described in detail below.

【００２２】辞書読出処理１１では、文字認識結果２０
をキ−文字として、辞書２１から文字並び情報を読み出
す。文字並び情報の一例を図３に示す。図３は意味単位
に分けられた一つ分の文字並び情報であり、文字格納位
置は参照する文字コ−ド列の格納位置を示しており、文
字数はそのエリアで参照する文字コ−ド数を示してい
る。連接情報は他の意味単位の文字並び情報との接続関
係を、内容情報は本情報に関する意味付き情報をそれぞ
れ示している。また文字コ−ド列のΔはブランクコ−ド
を示している。In the dictionary reading process 11, the character recognition result 20
Is used as a key character, and character arrangement information is read from the dictionary 21. An example of the character arrangement information is shown in FIG. FIG. 3 shows one character arrangement information divided into semantic units, the character storage position shows the storage position of the character code string to be referred to, and the number of characters is the number of character code to be referred to in that area. Is shown. The concatenation information indicates the connection relation with the character arrangement information of another semantic unit, and the content information indicates the information with meaning about this information. Further, Δ in the character code string indicates a blank code.

【００２３】辞書生成処理１２では、文字認識結果２０
より区切記号を検索し、その位置を記録する。区切り記
号は空白とその他の記号（“，”、“；”）との二種類
を抽出する。次に、読出した文字並び情報から区切り記
号を検索し、区切り記号位置を文字認識結果の区切位置
と比較し、一致した場合に文字並び情報を分割する。も
し区切り記号がなければ所定の長さで文字並び情報を分
割する。区切り記号の比較は、最終文字以外ではすべて
の区切り記号を対象とし、最終文字では空白のみが比較
対象となる。In the dictionary generation processing 12, the character recognition result 20
Search for a delimiter and record its position. Two types of delimiters are extracted: blank and other symbols (“,”, “;”). Next, the delimiter is searched from the read character arrangement information, the delimiter position is compared with the delimiter position of the character recognition result, and when they match, the character arrangement information is divided. If there is no delimiter, the character arrangement information is divided into a predetermined length. The comparison of delimiters targets all delimiters except the last character, and only blanks are compared in the last character.

【００２４】図２の文字認識結果に対して、図３の文字
並び情報を分割した例を図４に示す。FIG. 4 shows an example in which the character arrangement information of FIG. 3 is divided from the character recognition result of FIG.

【００２５】分割した文字並び情報を順次照合部１３に
送出し、最終情報送出後に送出終了の情報を送出する。The divided character arrangement information is sequentially transmitted to the collating unit 13, and after the final information is transmitted, the information of the transmission completion is transmitted.

【００２６】前述の区切文字の比較は、文字並び情報の
区切り記号情報と文字認識結果の区切り記号情報位置が
一致し、文字並び情報の区切り記号以外の情報と文字認
識結果の空白以外での情報が一致することにより実施す
る。In the comparison of the delimiter characters described above, the delimiter information of the character sequence information and the delimiter information position of the character recognition result match, and the information other than the delimiter of the character sequence information and the information other than the blank space of the character recognition result. It will be carried out by matching.

【００２７】図５を用いて照合部１３を説明する。The collating unit 13 will be described with reference to FIG.

【００２８】図５を参照するに、照合部１３は分割され
た文字並び情報と文字認識結果２０の類似性を計算す
る。類似性の計算１３２は、文字認識結果の指定位置
に、文字並び情報で指定された候補文字があるかどうか
で確認する。たとえば類似性の値としては、文字認識の
候補順位を用いればよい。ただし、区切文字の場合に
は、予め登録されている区切文字テ−ブルを利用して、
最善の区切文字候補を検出し、その文字で分割文字並び
情報に置きかえるが、本検出は類似性の計算には含めな
い。Referring to FIG. 5, the collation unit 13 calculates the similarity between the divided character arrangement information and the character recognition result 20. The similarity calculation 132 confirms whether or not there is a candidate character designated by the character arrangement information at the designated position of the character recognition result. For example, the candidate rank of character recognition may be used as the similarity value. However, in the case of delimiter characters, use the delimiter character table registered in advance,
The best delimiter character candidate is detected and replaced with the character division information, but this detection is not included in the similarity calculation.

【００２９】分割された文字並び情報の区切文字を除く
すべての文字が文字認識結果の該当位置に存在しない場
合には、リジェクトフラッグ１３５をセッする。それ以
外の場合には、類似度を累積１３３し、照合位置と文字
並び情報と共に一次格納バッファ１３６に格納する。When all the characters except the delimiter of the divided character arrangement information are not present at the corresponding positions in the character recognition result, the reject flag 135 is set. In other cases, the similarity is accumulated 133 and stored in the primary storage buffer 136 together with the collation position and the character arrangement information.

【００３０】分割情報送出終了の情報を受け取ると、照
合候補の追加１３４を行う。まず、リジェクトフラッグ
１３４をチェックする。リジェクトフラッグ１３４がセ
ットされていない場合には、一次格納バッファ１３６と
照合中間結果２２を比較し最終的な判定候補に残る可能
性のある場合には、照合中間結果２２に追加し、照合中
間結果２２の候補数が所定個数を越えた場合には、最も
ありえない候補を削除する。When the information indicating the end of transmission of the division information is received, the collation candidate is added 134. First, the reject flag 134 is checked. If the reject flag 134 is not set, the primary storage buffer 136 is compared with the collation intermediate result 22, and if there is a possibility that it remains as a final judgment candidate, it is added to the collation intermediate result 22 and the collation intermediate result 22 is added. If the number of 22 candidates exceeds a predetermined number, the most unlikely candidate is deleted.

【００３１】図６は請求項２に記載の文字認識後処理方
式を示す処理フロ−である。辞書読出３１は、文字認識
結果２０の候補文字をキ−文字に辞書検索を行うのにあ
わせて、照合中間結果２２をもとに辞書検索を行う。こ
の時図４の連接情報として辞書検索情報を利用する。FIG. 6 is a processing flowchart showing the character recognition post-processing method according to the second aspect. The dictionary reading 31 performs a dictionary search based on the matching intermediate result 22 in accordance with the dictionary search using the candidate character of the character recognition result 20 as a key character. At this time, the dictionary search information is used as the connection information in FIG.

【００３２】判定３２は、照合中間結果２２より連接情
報と照合位置をもとにして照合候補を組み合わせて判定
結果２３を作成する。In the judgment 32, the judgment result 23 is created by combining the verification candidates from the verification intermediate result 22 based on the connection information and the verification position.

【００３３】請求項３に記載の発明は、請求項１あるい
は２の辞書生成において、分割文字並び情報がアルファ
ベットあるいは数字含みアルファベットで構成されるこ
とを検出すると、大文字←→小文字変換により３種類の
文字並び情報を作成する。図７に、図３より作成した分
割情報を示す。According to the third aspect of the present invention, in the dictionary generation of the first or second aspect, when it is detected that the divided character arrangement information is composed of alphabets or alphabets including numbers, three types of uppercase ← → lowercase conversion are performed. Create character arrangement information. FIG. 7 shows the division information created from FIG.

【００３４】請求項４に記載の発明は、請求項１、２、
３の辞書生成において、分割文字並び情報がアルファベ
ットで構成され、分割数が二以上で、文字並び情報の内
容情報乱に姓名と記述されている場合には、最終分割情
報以外はイニシャルに置き換えるものである。図８にこ
の生成例を示す。The invention described in claim 4 is the same as in claim 1,
In the dictionary generation of No. 3, when the divided character arrangement information is composed of alphabets, the number of divisions is two or more, and the surname and first name are described in the content information of the character arrangement information, the other than the final division information is replaced with the initials. Is. FIG. 8 shows an example of this generation.

【００３５】[0035]

【発明の効果】以上説明したように、本発明によれば、
日本語表記とアルファベット表記が混在する文字並び情
報を同じ意味単位で取り扱うことが出来、かつ文字数が
日本語表記より一般的に長いアルファベットに対しても
処理速度が日本語と同程度で処理できるという効果が得
られる。As described above, according to the present invention,
It is said that the character sequence information in which Japanese notation and alphabet notation are mixed can be handled in the same semantic unit, and that the processing speed can be processed at the same level as Japanese even for alphabets that generally have longer characters than Japanese notation. The effect is obtained.

【００３６】更に請求項２に記載の発明では、図書名と
著者名のように意味単位での連接関係による照合も可能
となる。Further, in the invention according to the second aspect, it is possible to perform collation based on a concatenation relationship in a semantic unit such as a book name and an author name.

【００３７】請求項３に記載の発明は、アルファベット
表記の大文字／小文字の記入の不安定さにも１つの登録
情報で対応でき、かつ日本文の一部に英文が含まれても
対応が可能である。The invention according to claim 3 can cope with the instability of writing uppercase / lowercase letters in alphabetical notation with one registration information, and can also cope with the case where an English sentence is included in a part of the Japanese sentence. Is.

【００３８】請求項４に記載の発明は、更に加えて、１
つの登録情報で姓名の記入変動にも対応が可能となる。In addition to the invention described in claim 4, 1
With one registration information, it is possible to respond to changes in the entry of the family name.

【００３９】従って、本発明は、図書名−著者名−出版
社名の記入に代表されるような内容の記入帳票に対する
文字認識の後処理として大きな効果を有する。Therefore, the present invention has a great effect as a post-processing of character recognition for an entry form having contents represented by entry of book name-author name-publisher name.

[Brief description of drawings]

【図１】請求項１に記載の発明の一実施例を示す機能ブ
ロック構成図である。FIG. 1 is a functional block configuration diagram showing an embodiment of the invention described in claim 1.

【図２】文字認識結果の一例を示す図である。FIG. 2 is a diagram showing an example of a character recognition result.

【図３】辞書内の一つの文字並び情報の格納形式の一例
を示す図である。FIG. 3 is a diagram showing an example of a storage format of one character arrangement information in a dictionary.

【図４】図３の文字並び情報をもとに生成した分割文字
並び情報を示す図である。FIG. 4 is a diagram showing divided character arrangement information generated based on the character arrangement information of FIG.

【図５】照合機能１３の詳細機能を示すブロック構成図
である。5 is a block diagram showing a detailed function of a matching function 13. FIG.

【図６】請求項２に記載の発明の一実施例を示す機能ブ
ロック図である。FIG. 6 is a functional block diagram showing an embodiment of the invention described in claim 2.

【図７】請求項３に記載の発明により図３の文字並び情
報より生成される文字並び情報を示す図である。FIG. 7 is a diagram showing character arrangement information generated from the character arrangement information of FIG. 3 according to the invention described in claim 3;

【図８】請求項４に記載の発明による分割文字並び情報
生成の例を示す図である。FIG. 8 is a diagram showing an example of generation of divided character arrangement information according to the invention described in claim 4;

[Explanation of symbols]

１０…処理開始１１、３１…辞書読出処理１２…辞書生成処理１３…照合処理１４、３２…判定処理１５…処理終了２０…文字認識結果２１…辞書２２…照合中間結果２３…判定結果１３１…分割エンド１３２…類似性の計算１３３…結果累積１３４…照合候補の追加１３５…リジェクトフラッグ１３６…一次格納バッファ 10 ... Start of processing 11, 31 ... Dictionary reading processing 12 ... Dictionary generation processing 13 ... Collation processing 14, 32 ... Judgment processing 15 ... End of processing 20 ... Character recognition result 21 ... Dictionary 22 ... Collation intermediate result 23 ... Judgment result 131 ... Division End 132 ... Similarity calculation 133 ... Result accumulation 134 ... Collation candidate addition 135 ... Reject flag 136 ... Primary storage buffer

Claims

[Claims]

1. A character recognition post-processing method for confirming / correcting the result of character recognition based on a dictionary in which information about the arrangement of characters is written, and information to be collated is read from the dictionary according to the result of character recognition. Means, a means for dividing the character string according to the character recognition result and the read character arrangement information to create hierarchical divided character arrangement information, and the input character string and the read character arrangement information are collated, and their similarity A character recognition post-processing method comprising: a means for sequentially obtaining the post-processing in a hierarchical order; and a means for finally determining a post-processing result corresponding to an input character string.

2. In a character recognition post-processing method for confirming / correcting the result of character recognition based on a dictionary describing information on the arrangement of characters, information to be collated is read from the dictionary according to the result of character recognition. Means, a means for dividing the character string according to the character recognition result and the read character arrangement information to create hierarchical divided character arrangement information, and the input character string and the read character arrangement information are collated, and their similarity And a means for finally determining the post-processing result corresponding to the input character string, and selecting the dictionary information to be matched and reading the dictionary, the input character string and the already read A post-processing method for character recognition, which is characterized in that the result related to collation of dictionary information is used as a read condition together with the result of character recognition.

3. When reading a dictionary according to a character recognition result, when the character recognition result includes alphabets, the dictionary is read without distinguishing between uppercase and lowercase, and when the divided dictionary is created, the divided dictionary includes only alphabets and numbers. Generates three types of split dictionaries: a split dictionary that consists only of uppercase letters, a split dictionary that consists of only lowercase letters, and a dictionary that consists of only uppercase letters and lowercase letters. The character recognition post-processing method according to claim 1 or 2, further characterized in that the dictionary having the highest similarity is selected in advance.

4. When creating a divided dictionary, the divided dictionary is composed of only alphabets and numbers, and when the content of the dictionary information is limited to first and last names, a dictionary composed of initials is generated. 2. The character recognition post-processing method according to any one of 2 and 3.