JPH0696285A

JPH0696285A - Character recognizing device

Info

Publication number: JPH0696285A
Application number: JP4241599A
Authority: JP
Inventors: Yukiya Sugiyama; 幸也杉山
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1992-09-10
Filing date: 1992-09-10
Publication date: 1994-04-08

Abstract

PURPOSE:To provide the character recognizing device being excellent in reliability, which can obtain a correct solution character, even in the case where two adjacent half size characters are segmented erroneously as one full size character, at the time of obtaining character image data by segmenting a character with respect to image data. CONSTITUTION:The character recognizing device is provided with a character segmenting part 2 for segmenting image data to the image data of every one character, a character recognizing part 3 for recognizing the image data and converting it to a code, a language processing part 4 for executing a language processing to a result of recognition of the character recognizing part 3, a suspected word extracting part 5 for extrancing a word being a one-character word, and also, consisting of characters which can be separated into the left and the right as a suspected word from in a word group, a suspected word resegmenting part 6 for segmenting the image data of the suspected word as two half size characters, a suspected word character recognizing part 7 for recognizing its image data and converting it to a code, a language reprocessing part 8 for executing a language processing to a result of recognition after the resegmentation, and a deciding part 9 for comparing the result of recognition by the language processing part and the result of recognition by the language reprocessing part and determining a final solution character-string.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明はコンピュータ等の入力機
器として用いられ、新聞，雑誌，小説等の、活字，ドッ
ト文字，手書き文字等のパターンを文字認識して、ＪＩ
Ｓコード等のコード情報に変換する文字認識装置に関す
るものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention is used as an input device for a computer or the like, and character-recognizes patterns such as print characters, dot characters and handwritten characters in newspapers, magazines, novels, etc.
The present invention relates to a character recognition device that converts code information such as an S code.

【０００２】[0002]

【従来の技術】近年、コンピュータ等の情報機器が急速
に発展したため、作業性の向上等の目的から文書等の電
子化が行われるようになり、活字等のパターンを文字認
識し文字コードに変換する文字認識装置が、コンピュー
タ等の入力機器として広く用いられるようになった。2. Description of the Related Art In recent years, with the rapid development of information devices such as computers, digitization of documents and the like has come to be performed for the purpose of improving workability, and patterns such as printed characters are recognized and converted into character codes. The character recognition device has become widely used as an input device such as a computer.

【０００３】以下に従来の文字認識装置について説明す
る。（表１）に示す認識対象文書を基にして従来の文字
認識装置の文字認識方法を説明する。A conventional character recognition device will be described below. A character recognition method of the conventional character recognition device will be described based on the recognition target document shown in (Table 1).

【０００４】[0004]

【表１】 [Table 1]

【０００５】始めに、認識対象文書を画像読み取り装置
により読み取り、画像データを得る。First, a document to be recognized is read by an image reading device to obtain image data.

【０００６】次に、画像データより一文字ごとの文字画
像データを切り出す。次に、文字画像データを文字認識
して文字認識結果を得る。（表１）に示す原文に対する
文字認識結果を（表２）に示す。Next, character image data for each character is cut out from the image data. Next, the character image data is character-recognized to obtain a character recognition result. The result of character recognition for the original sentence shown in (Table 1) is shown in (Table 2).

【０００７】[0007]

【表２】 [Table 2]

【０００８】ここで、原文の‘（’，‘３’の隣接する
二つの半角文字の部分を誤って一つの全角文字として切
り出している為、原文の“（３”を“行”，“往”，
“径”と誤認識している。Here, since the adjacent two half-width characters '(', '3' in the original sentence are erroneously cut out as one full-width character, "(3" in the original sentence is changed to "line", "forward". ",
It is mistakenly recognized as "diameter".

【０００９】次に、（表２）の文字認識結果を基にして
言語処理を行う。まず、各候補文字を組み合わせて文字
列を構成する。この文字列を単語辞書と照会して一致す
る文字列だけを候補単語とする。これにより抽出した候
補単語を（表３）に示す。Next, language processing is performed based on the character recognition result of (Table 2). First, each candidate character is combined to form a character string. This character string is queried with the word dictionary and only the matching character string is set as the candidate word. The candidate words thus extracted are shown in (Table 3).

【００１０】[0010]

【表３】 [Table 3]

【００１１】次に、候補単語を組み合わせて文法規則に
そった文節を得る。この例では、複数文字で構成される
単語は最長一致単語を採用し、一文字で構成される候補
単語は、第一候補文字を採用している。Next, the candidate words are combined to obtain a clause according to grammatical rules. In this example, the longest matching word is adopted as the word composed of a plurality of characters, and the first candidate character is adopted as the candidate word composed of one character.

【００１２】採用した文節を（表４）に示す。The adopted clauses are shown in (Table 4).

【００１３】[0013]

【表４】 [Table 4]

【００１４】以上の処理によって、解文字列として
‘行）文字認識’を得る。このように、原文の隣接する
二つの半角文字の部分を誤って一つの全角文字として切
り出している為、原文の“（３”を“行”と誤認識して
いる。By the above processing, "line) character recognition" is obtained as a solution character string. In this way, since the adjacent two half-width characters in the original sentence are erroneously cut out as one full-width character, "(3" in the original sentence is erroneously recognized as a "line".

【００１５】[0015]

【発明が解決しようとする課題】しかしながら上記従来
の方法では、画像データを文字切り出しして文字画像デ
ータを得る際に、隣接する二つの半角文字を誤って一つ
の全角文字として切り出してしまった場合に、以降の処
理によって誤認識を救済することが出来ないため、認識
率が低下し、信頼性に欠けるという問題点があった。However, in the above-mentioned conventional method, when the character data of the image data is cut out to obtain the character image data, two adjacent one-byte characters are accidentally cut out as one full-width character. In addition, since the erroneous recognition cannot be relieved by the subsequent processing, there is a problem that the recognition rate is lowered and the reliability is poor.

【００１６】本発明は上記従来の問題点を解決するもの
で、画像データを文字切り出しして文字画像データを得
る際に、隣接する二つの半角文字を誤って一つの全角文
字として切り出してしまった場合であっても、正しい解
文字を得ることの出来る、信頼性に優れた文字認識装置
を提供することを目的とする。The present invention solves the above-mentioned conventional problems. When character data is cut out from image data to obtain character image data, two adjacent one-byte characters are mistakenly cut out as one full-width character. An object of the present invention is to provide a highly reliable character recognition device that can obtain a correct solution character even in the case.

【００１７】[0017]

【課題を解決するための手段】この目的を達成するため
に本発明の文字認識装置は、画像読み取り部によって読
み取られた画像データを一文字毎の文字画像データに切
り出す文字切り出し部と、前記文字切り出し部によって
切り出された文字画像データを文字認識して文字コード
に変換する文字認識部と、前記文字認識部によって認識
された文字認識結果に対して言語処理を行い文字単位で
区分されている文字認識結果を単語単位の区分に変更す
る言語処理部と、前記言語処理部で得られた単語群中よ
り一文字単語であり且つ左右に分離可能な文字からなる
単語を被疑単語として抽出する被疑単語抽出部と、前記
被疑単語抽出部によって抽出された被疑単語は認識対象
文書中の隣接する半角文字を一つの全角文字として誤認
識した可能性が高いと判断して被疑単語の画像データを
二つの半角文字として切りだす被疑単語再切り出し部
と、前記被疑単語再切り出し部によって切り出された文
字画像データを文字認識して文字コードに変換する被疑
単語文字認識部と、前記被疑単語文字認識部による再切
り出し後の文字認識結果に対して言語処理を行う再言語
処理部と、前記言語処理部による文字認識結果と前記再
言語処理部による文字認識結果とを比較して最終的な解
文字列を決定する判定部と、からなる構成を有してい
る。In order to achieve this object, a character recognition apparatus of the present invention is a character slicing section for slicing image data read by an image reading section into character image data for each character, and the character slicing section. A character recognition unit that character-recognizes character image data cut out by a unit and converts the character image data into a character code, and character recognition that performs language processing on the character recognition result recognized by the character recognition unit and is divided into character units. A language processing unit that changes the result into word units, and a suspected word extraction unit that extracts, as a suspected word, a word that is a single-character word and is composed of separable left and right words from the word group obtained by the language processing unit. The suspect word extracted by the suspect word extracting unit is likely to have erroneously recognized adjacent half-width characters in the recognition target document as one full-width character. Suspicious word re-cutout unit that cuts out the image data of the suspected word as two half-width characters, and the suspected word character that character-recognizes the character image data cut out by the suspected word re-cutout unit and converts it into a character code. A recognition unit, a relanguage processing unit that performs language processing on the character recognition result after re-cutting by the suspected word character recognition unit, a character recognition result by the language processing unit, and a character recognition result by the relanguage processing unit. And a determination unit that determines the final solution character string by comparing.

【００１８】[0018]

【作用】この構成によって、被疑単語を抽出し、被疑単
語再切り出し等を行うことにより隣接する二つの半角文
字を誤って一つの全角文字として切りだしてしまった場
合でも、正しい解文字列を得ることが出来る。With this configuration, even if two adjacent one-byte characters are accidentally cut out as one full-width character by extracting the suspected word and re-cutting the suspected word, etc., a correct solution character string is obtained. You can

【００１９】[0019]

【実施例】以下本発明の一実施例について、図面を参照
しながら説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings.

【００２０】図１は本発明の一実施例における文字認識
装置のブロック図である。１は認識対象文書を光電変換
して画像データを得る画像読み取り部、２は画像読み取
り部１によって読み取られた画像データを一文字毎の文
字画像データに切り出す文字切り出し部、３は文字切り
出し部２によって切り出された文字画像データを文字認
識して文字コードに変換する文字認識部、４は文字認識
部３によって認識された文字認識結果を基にして文法規
則にそった文節を作成することにより正解文字を選択す
る言語処理部、５は言語処理部４で得られた単語群中よ
り一文字単語であり且つ左右に分離可能な文字からなる
単語を被疑単語として抽出する被疑単語抽出部、６は被
疑単語抽出部５によって抽出された被疑単語の画像デー
タを二つの半角文字として切りだす被疑単語再切り出し
部、７は被疑単語再切り出し部６によって切り出された
文字画像データを文字認識して文字コードに変換する被
疑単語文字認識部、８は被疑単語文字認識部７による再
切り出し後の文字認識結果に対して言語処理を行う再言
語処理部、９は言語処理部４による文字認識結果と再言
語処理部８による文字認識結果とを比較して最終的な解
文字列を決定する判定部である。FIG. 1 is a block diagram of a character recognition apparatus according to an embodiment of the present invention. 1 is an image reading unit that photoelectrically converts a recognition target document to obtain image data, 2 is a character cutting unit that cuts the image data read by the image reading unit 1 into character image data for each character, and 3 is a character cutting unit 2. The character recognition unit 4 for recognizing the cut out character image data and converting it into a character code is a correct character by creating a clause according to the grammatical rule based on the character recognition result recognized by the character recognition unit 3. The language processing unit 5 selects a suspected word extraction unit that extracts a word that is a single-character word from the word group obtained by the language processing unit 4 and that is composed of characters that can be separated left and right as a suspected word, and 6 is a suspected word. The suspected word recutting unit 7 cuts out the image data of the suspected word extracted by the extraction unit 5 as two half-width characters, and 7 is cut by the suspected word recutting unit 6. A suspicious word character recognizing unit that character-recognizes the output character image data and converts the character image data into a character code; 8 is a re-language processing unit that performs language processing on the character recognition result after re-cutting by the suspicious word character recognizing unit 7; A determination unit 9 compares the character recognition result by the language processing unit 4 with the character recognition result by the re-language processing unit 8 to determine a final solution character string.

【００２１】以上のように構成された本実施例の文字認
識装置について、以下例として（表１）に示す認識対象
文書を基にして、その動作を説明する。The operation of the character recognition apparatus of the present embodiment having the above-mentioned configuration will be described below based on the recognition target document shown in (Table 1) as an example.

【００２２】図２は本発明の一実施例における文字認識
装置の画像読み取り部及び文字切り出し部及び文字認識
部のフローチャートであり、図３は言語処理部のフロー
チャートであり、図４は被疑単語抽出部のフローチャー
トであり、図５は被疑単語再切り出し部及び被疑単語再
認識部及び再言語処理部のフローチャートであり、図６
は判定部のフローチャートである。FIG. 2 is a flow chart of an image reading unit, a character cutout unit and a character recognition unit of a character recognition apparatus according to an embodiment of the present invention, FIG. 3 is a flow chart of a language processing unit, and FIG. 4 is a suspected word extraction. 6 is a flowchart of the suspected word recutting unit, the suspected word rerecognition unit, and the relanguage processing unit, and FIG.
Is a flowchart of the determination unit.

【００２３】図２において、始めに、画像読み取り部１
において、認識対象文書を光電変換し、画像データを得
る（Ｓ１）。In FIG. 2, first, the image reading unit 1
At, the recognition target document is photoelectrically converted to obtain image data (S1).

【００２４】次に、文字切り出し部２において、画像デ
ータより一文字毎の文字画像データを得る（Ｓ２）。Next, in the character cutting section 2, character image data for each character is obtained from the image data (S2).

【００２５】次に、文字認識部３において、文字画像デ
ータを基にして文字認識を行う（Ｓ３）。Next, the character recognition section 3 performs character recognition based on the character image data (S3).

【００２６】ここで、（表１）に示す認識対象文書に対
する具体的な文字認識結果を（表２）に示す。Here, a concrete character recognition result for the recognition target document shown in (Table 1) is shown in (Table 2).

【００２７】（表２）において、上段は認識結果を示
し、下段は類似度を示す。次に、言語処理を行う。In Table 2, the upper part shows the recognition result and the lower part shows the similarity. Next, language processing is performed.

【００２８】図３において、まず、字種遷移を利用して
仮文節を設定する（Ｓ４）。ここで、（表２）に示す文
字認識結果に対する具体的な仮文節設定結果を（表５）
に示す。In FIG. 3, first, provisional clauses are set by using character type transition (S4). Here, the concrete provisional clause setting result for the character recognition result shown in (Table 2) is shown in (Table 5).
Shown in.

【００２９】[0029]

【表５】 [Table 5]

【００３０】ここで字種遷移とは、文字種の変化点であ
り、そこを文節の切れ目とするか否かを予め定められた
字種遷移表と照会することにより仮文節を設定する。字
種遷移表を（表６）に示す。Here, the character type transition is a change point of a character type, and a provisional bunsetsu is set by referring to a predetermined character type transition table as to whether or not it is a break of a bunsetsu. The character type transition table is shown in (Table 6).

【００３１】[0031]

【表６】 [Table 6]

【００３２】（表６）において、「１」は文節の切れ目
となる仮文節設定点となることを表し、「０」は仮文節
設定点とならないことを表す。In (Table 6), "1" indicates that the provisional phrase set point becomes a break of the phrase, and "0" indicates that the provisional phrase set point does not exist.

【００３３】次に、仮文節番号をｉとし、ｉに０を代入
する（Ｓ５）。次に、第ｉ仮文節に着目する（Ｓ６）。Next, the provisional clause number is set to i, and 0 is substituted for i (S5). Next, pay attention to the i-th temporary clause (S6).

【００３４】次に、仮文節先頭からの相対文字番号をｋ
とし、ｋに０を代入する（Ｓ７）。次に、候補順位をｊ
とし、ｊに０を代入する（Ｓ８）。Next, the relative character number from the beginning of the provisional phrase is k
Then, 0 is substituted for k (S7). Next, the candidate ranking is j
Then, 0 is substituted for j (S8).

【００３５】次に、第ｉ仮文節中のｊ行ｋ列文字を先頭
とする文字列を作成する（Ｓ９）。次に、Ｓ９において
作成された文字列を単語辞書と照会し、一致するものだ
けを候補単語として抽出する（Ｓ１０）。Next, a character string starting from the character in the j-th row and the k-th column in the i-th temporary clause is created (S9). Next, the character string created in S9 is referred to the word dictionary, and only matching ones are extracted as candidate words (S10).

【００３６】次に、ｊに１を加算する（Ｓ１１）。次
に、ｊが３未満であるか調べる（Ｓ１２）。Next, 1 is added to j (S11). Next, it is checked whether j is less than 3 (S12).

【００３７】ｙｅｓである場合は、Ｓ９にｊｕｍｐす
る。ｎｏである場合は、ｋに１を加算する（Ｓ１３）。If yes, jump to S9. If no, 1 is added to k (S13).

【００３８】次に、ｋが第ｉ仮文節の構成文字数未満で
あるか調べる（Ｓ１４）。ｙｅｓである場合は、Ｓ８に
ｊｕｍｐする。Next, it is checked whether k is less than the number of constituent characters of the i-th temporary clause (S14). If yes, jump to S8.

【００３９】ｎｏである場合は、Ｓ１０で抽出された候
補単語を組み合わせて文法規則にそった候補文節を構成
する（Ｓ１５）。If the result is no, the candidate words extracted in S10 are combined to form a candidate phrase according to the grammatical rules (S15).

【００４０】次に、各候補文節の平均類似度を求める
（Ｓ１６）。次に、平均類似度が最高値の候補文節を解
として採用する（Ｓ１７）。Next, the average similarity of each candidate phrase is obtained (S16). Next, the candidate clause having the highest average similarity is adopted as a solution (S17).

【００４１】次に、ｉに１を加算する（Ｓ１８）。次
に、ｉが仮文節総数未満であるか調べる（Ｓ１９）。Next, 1 is added to i (S18). Next, it is checked whether i is less than the total number of provisional clauses (S19).

【００４２】ｙｅｓである場合は、Ｓ６以降の処理を行
う。ｎｏである場合は、言語処理を終了する。If yes, the process from S6 is performed. If no, the language processing ends.

【００４３】ここで、（表５）に示す仮文節設定結果に
対する言語処理の具体例を以下に示す。Here, a specific example of the language processing for the provisional clause setting result shown in (Table 5) is shown below.

【００４４】まず、第０仮文節から第１仮文節までは、
一文字で構成されているため、第１候補文字が解として
採用される。First, from the 0th temporary clause to the 1st temporary clause,
Since it consists of one character, the first candidate character is adopted as the solution.

【００４５】次に、第２仮文節‘文宇認織’に注目する
（Ｓ６）。次に、‘文’を先頭に持つ文字列を作成する
（Ｓ９）。Next, attention is paid to the second provisional bunsetsu "Fun Uori Ori" (S6). Next, a character string having a "sentence" at the beginning is created (S9).

【００４６】ここで、作成された文字列を（表７）に示
す。Here, the created character string is shown in (Table 7).

【００４７】[0047]

【表７】 [Table 7]

【００４８】次に、作成された文字列を単語辞書と照会
し、一致する単語だけを抽出する（Ｓ１０）。Next, the created character string is referred to the word dictionary and only the matching words are extracted (S10).

【００４９】次に、全ての候補文字に対して以上の処理
を行い（Ｓ８〜Ｓ１４）、（表８）に示す候補単語群を
得る。Next, the above processing is performed for all the candidate characters (S8 to S14) to obtain the candidate word group shown in (Table 8).

【００５０】[0050]

【表８】 [Table 8]

【００５１】次に、候補単語群を組み合わせて候補文節
を構成する（Ｓ１５）。‘文字’と‘認識’は名詞同士
であるため、接続可能であり、候補文節‘文字認識’を
得る。同様に‘文章’と‘認識’は名詞同士であるた
め、接続可能であり、候補文節‘文章認識’を得る。Next, a candidate phrase is constructed by combining candidate word groups (S15). Since'character 'and'recognition' are nouns, they can be connected and the candidate phrase'character recognition 'is obtained. Similarly, since'sentence 'and'recognition' are nouns, they can be connected and the candidate phrase'sentence recognition 'is obtained.

【００５２】次に、各候補文節の平均類似度を求め、そ
れの大きな文節を解として採用する（Ｓ１６〜１７）。Next, the average degree of similarity of each candidate phrase is calculated, and the larger phrase is adopted as the solution (S16-17).

【００５３】候補文節‘文字認識’の平均類似度は９５
０。候補文節‘文章認識’の平均類似度は９２５。よっ
て‘文字認識’が解として採用される。The average similarity of the candidate phrase “character recognition” is 95.
0. The average similarity of the candidate phrase'sentence recognition 'is 925. Therefore, 'character recognition' is adopted as the solution.

【００５４】以上の処理によって得られた文節を（表
９）に示す。The clauses obtained by the above processing are shown in (Table 9).

【００５５】[0055]

【表９】 [Table 9]

【００５６】次に、被疑単語抽出を行う。図４におい
て、まず、文節番号をｉとし、ｉに０を代入する（Ｓ２
０）。Next, the suspected word is extracted. In FIG. 4, first, the phrase number is set to i, and 0 is substituted for i (S2
0).

【００５７】次に、文節を構成する単語の文節先頭から
の相対番号をｊとし、ｊに０を代入する。Next, the relative number from the beginning of the phrase of the words forming the phrase is set to j, and 0 is substituted for j.

【００５８】次に、第ｉ文節の第ｊ単語に着目する（Ｓ
２２）。次に、着目単語が一文字で構成される単語か調
べる（Ｓ２３）。Next, pay attention to the j-th word of the i-th clause (S
22). Next, it is checked whether or not the focused word is composed of one letter (S23).

【００５９】ｎｏである場合は、Ｓ２６へｊｕｍｐす
る。ｙｅｓである場合は、着目単語が、分離文字である
か調べる（Ｓ２４）。If no, jump to S26. If yes, it is checked whether the focused word is a separated character (S24).

【００６０】ここで、分離文字とは、‘行’，‘語’等
のような、「へん」と「つくり」から構成された左右に
分割可能な文字を指す。The term "separated character" as used herein refers to a character such as "line", "word", etc., which is composed of "hen" and "make" and can be divided into right and left.

【００６１】ｎｏである場合は、Ｓ２６へｊｕｍｐす
る。ｙｅｓである場合は、着目単語を被疑単語とし、そ
の文字画像データ番号を配列ｄｏｕｂｔに記憶する（Ｓ
２５）。If NO, jump to S26. If yes, the word of interest is the suspected word, and its character image data number is stored in the array doublet (S
25).

【００６２】次に、ｊに１を加算する（Ｓ２６）。次
に、ｊが第ｉ文節の構成単語数未満であるか調べる（Ｓ
２７）。Next, 1 is added to j (S26). Next, it is checked whether j is less than the number of constituent words of the i-th clause (S
27).

【００６３】ｙｅｓである場合は、Ｓ２２にｊｕｍｐす
る。ｎｏである場合は、ｉに１を加算する（Ｓ２８）。If yes, jump to S22. If no, 1 is added to i (S28).

【００６４】次に、ｉが文節数未満であるか調べる（Ｓ
２９）。ｙｅｓである場合は、Ｓ２１にｊｕｍｐする。Next, it is checked whether i is less than the number of clauses (S
29). If yes, jump to S21.

【００６５】ｎｏである場合は、被疑単語抽出を終了す
る。ここで、（表９）に示す言語処理結果に対する被疑
単語抽出の具体例を以下に示す。If no, the extraction of the suspected word ends. Here, a specific example of the suspected word extraction for the language processing result shown in (Table 9) is shown below.

【００６６】まず、第０文節は‘行’のみで構成されて
いる（Ｓ２２）。次に、‘行’は一文字単語である（Ｓ
２３）。First, the 0th clause is composed only of'lines' (S22). Next, 'line' is a one-letter word (S
23).

【００６７】次に、‘行’は分離文字である（Ｓ２
４）。次に、‘行’を被疑単語とし、‘行’の文字画像
データ番号である０をｄｏｕｂｔNext, "line" is a separation character (S2
4). Next, "line" is the suspected word, and the character image data number of "line", 0, is doublet.

〔０〕に代入する（Ｓ
２５）。Substitute in [0] (S
25).

【００６８】次に、第１文節は分離文字ではない（Ｓ２
４）。よって被疑単語はない。Next, the first phrase is not a separating character (S2
4). Therefore, there are no suspect words.

【００６９】次に、第２文節は‘文字’‘認識’という
二文字単語で構成されている（Ｓ２３）。Next, the second phrase is composed of a two-letter word "character" recognition "(S23).

【００７０】よって被疑単語はない。次に、被疑単語再
切り出し及び被疑単語文字認識及び再言語処理を行う。Therefore, there are no suspect words. Next, re-cutting of the suspected word, recognition of the suspected word character, and re-language processing are performed.

【００７１】図５において、まず、ｉを被疑単語番号と
し、ｉに０を代入する（Ｓ３０）。次に、被疑単語の文
字画像データ番号をｊとし、ｊにｄｏｕｂｔ〔ｉ〕を代
入する（Ｓ３１）。In FIG. 5, first, i is the suspected word number, and 0 is substituted for i (S30). Next, the character image data number of the suspected word is set to j, and doubt [i] is substituted for j (S31).

【００７２】次に、第ｊ文字画像データを再切り出し、
左右二つの文字画像データを得る（Ｓ３２）。Next, the j-th character image data is re-cut out,
Left and right character image data are obtained (S32).

【００７３】次に、Ｓ３２で得られた二つの文字画像デ
ータをそれぞれ文字認識する（Ｓ３３）。Next, the two character image data obtained in S32 are recognized as characters (S33).

【００７４】次に、ｉに１を加算する（Ｓ３４）。次
に、ｉが被疑単語数未満であるか調べる（Ｓ３５）。Next, 1 is added to i (S34). Next, it is checked whether i is less than the number of suspected words (S35).

【００７５】ｙｅｓである場合は、Ｓ３１にｊｕｍｐす
る。ｎｏである場合は、再切り出し前の文字認識結果と
再切り出し後の文字認識結果を統合する（Ｓ３６）。If yes, jump to S31. If the result is no, the character recognition result before recutting and the character recognition result after recutting are integrated (S36).

【００７６】次に、Ｓ３６で統合された文字認識結果に
対して再度図３と同様な言語処理を行う（Ｓ３７）。Next, the language processing similar to that of FIG. 3 is performed again on the character recognition result integrated in S36 (S37).

【００７７】ここで、前述の被疑単語抽出結果に対する
被疑単語再切り出し及び被疑単語文字認識及び再言語処
理の具体例を以下に示す。Here, a specific example of the re-cutting of the suspected word, the recognition of the suspected word and the re-language process for the suspected word extraction result will be described below.

【００７８】まず、ｄｏｕｂｔFirst, doubt

〔０〕には０が格納され
ている（Ｓ３１）。これは、第０文字画像データが被疑
単語であることを表している。0 is stored in [0] (S31). This indicates that the 0th character image data is a suspect word.

【００７９】次に、第０文字画像データを再切り出し
し、左右二つの文字画像データを得る（Ｓ３２）。Next, the 0th character image data is re-cut out to obtain two left and right character image data (S32).

【００８０】次に、この文字画像データをそれぞれ文字
認識する（Ｓ３３）。このときの文字認識結果を（表１
０）に示す。Next, each of the character image data is recognized as a character (S33). The character recognition result at this time is shown in (Table 1
0).

【００８１】[0081]

【表１０】 [Table 10]

【００８２】次に、再切り出し前の認識結果と再切り出
し後の認識結果を統合する（Ｓ３６）。Next, the recognition result before the recutting and the recognition result after the recutting are integrated (S36).

【００８３】このときの統合結果を（表１１）に示す。The integration result at this time is shown in (Table 11).

【００８４】[0084]

【表１１】 [Table 11]

【００８５】次に、この統合結果に対して再度言語処理
を行う（Ｓ３７）。このときの再言語処理結果を（表１
２）に示す。Next, language processing is performed again on this integrated result (S37). The re-language processing result at this time is shown in (Table 1
2).

【００８６】[0086]

【表１２】 [Table 12]

【００８７】次に、判定を行う。図６において、まず、
ｉを被疑単語番号とし、ｉに０を代入する（Ｓ３８）。Next, a judgment is made. In FIG. 6, first,
i is the suspected word number, and 0 is substituted for i (S38).

【００８８】次に、被疑単語の画像データ番号をｊと
し、ｊにｄｏｕｂｔ〔ｉ〕を代入する（Ｓ３９）。Next, the image data number of the suspected word is set to j, and doubt [i] is substituted for j (S39).

【００８９】次に、第ｊ文字画像データの認識結果を含
む、再切り出し前の文節と、再切り出し後の文節に着目
する（Ｓ４０）。Next, attention is paid to the bunsetsu before the re-cutout and the bunsetsu after the re-cutout including the recognition result of the j-th character image data (S40).

【００９０】次に、着目した両文節の確からしさを比較
し、解となる文節を決定する（Ｓ４１）。Next, the certainty of both focused bunsetsus is compared, and the bunsetsu which becomes a solution is determined (S41).

【００９１】確からしさは、文節を構成する単語が一般
文書に出現する頻度や、括弧などの文字は対で用いられ
ることが多いといった知識を用いて決定される。The certainty is determined by using the knowledge that the words that form a phrase appear in a general document and that characters such as parentheses are often used in pairs.

【００９２】次に、ｉに１を加算する（Ｓ４２）。次
に、ｉが被疑単語数未満であるか調べる（Ｓ４３）。Next, 1 is added to i (S42). Next, it is checked whether i is less than the number of suspected words (S43).

【００９３】ｙｅｓである場合は、Ｓ３９へｊｕｍｐす
る。ｎｏである場合は、全ての処理を終了する。If yes, jump to S39. If the answer is no, all processing is terminated.

【００９４】ここで、（表１２）に示す再言語処理結果
に対する判定の具体例を以下に示す。Here, a concrete example of the judgment for the re-language processing result shown in (Table 12) is shown below.

【００９５】まず、ｄｏｕｂｔFirst, doubt

〔０〕には０が格納され
ている（Ｓ３９）。次に、第０文字画像データを含む、
再切り出し前の文節‘行’及び再切り出し後の文節
‘（３’に注目する（Ｓ４０）。0 is stored in [0] (S39). Next, including the 0th character image data,
Pay attention to the phrase "line" before the re-cutout and the phrase "(3" after the re-cutout (S40).

【００９６】次に、両文節の確からしさを比較する（Ｓ
４１）。文節‘行’には、隣接する文節との関連は全く
無い。一方、文節‘（３’には左括弧が含まれている。
括弧は対で用いられることが多いという知識を用いて、
他の文節内に右括弧が存在しないかを調べると、隣接す
る文節に右括弧‘）’が存在し、文節‘（３’には他の
文節との関連性があった。Next, the probabilities of both clauses are compared (S
41). The bunsetsu'line 'has no relation to the adjacent bunsetsu. On the other hand, the clause '(3' includes a left parenthesis.
With the knowledge that brackets are often used in pairs,
When a right parenthesis was not found in other bunsetsu, a right parenthesis ')' was found in the adjacent bunsetsu, and bunsetsu '(3') was related to other bunsetsu.

【００９７】よって、再切り出し後の文節の確からしさ
のほうが大きいと判断し、そちらを解として採用する。Therefore, it is judged that the certainty of the bunsetsu after the re-cutout is greater, and that one is adopted as the solution.

【００９８】以上の処理によって、最終的な解文字列と
して‘（３）文字認識’が得られ、認識対象文書を正し
く認識することが出来る。By the above processing, "(3) Character recognition" is obtained as the final solution character string, and the recognition target document can be correctly recognized.

【００９９】[0099]

【発明の効果】以上のように本発明によれば、被疑単語
を抽出し、被疑単語再切り出し等を行うことにより、隣
接する二つの半角文字を誤って一つの全角文字として切
り出してしまった場合でも、原文を正しく文字認識する
ことが出来る信頼性に優れた文字認識装置を実現できる
ものである。As described above, according to the present invention, when the suspected word is extracted and the suspected word is re-cut out, two adjacent one-byte characters are accidentally cut out as one full-width character. However, it is possible to realize a highly reliable character recognition device that can correctly recognize the original text.

[Brief description of drawings]

【図１】本発明の一実施例における文字認識装置のブロ
ック図FIG. 1 is a block diagram of a character recognition device according to an embodiment of the present invention.

【図２】本発明の一実施例における文字認識装置の画像
読み取り部及び文字切り出し部及び文字認識部の制御手
順を示すフローチャートFIG. 2 is a flowchart showing a control procedure of the image reading unit, the character cutting unit, and the character recognition unit of the character recognition device according to the embodiment of the present invention.

【図３】本発明の一実施例における文字認識装置の言語
処理部の制御手順を示すフローチャートFIG. 3 is a flowchart showing a control procedure of a language processing unit of the character recognition device in the embodiment of the present invention.

【図４】本発明の一実施例における文字認識装置の被疑
単語抽出部の制御手順を示すフローチャートFIG. 4 is a flowchart showing a control procedure of a suspected word extraction unit of the character recognition device in the embodiment of the present invention.

【図５】本発明の一実施例における文字認識装置の被疑
単語再切り出し部及び被疑単語再認識部及び再言語処理
部の制御手順を示すフローチャートFIG. 5 is a flowchart showing a control procedure of a suspected word recutting unit, a suspected word rerecognition unit, and a relanguage processing unit of the character recognition device according to the embodiment of the present invention.

【図６】本発明の一実施例における文字認識装置の判定
部の制御手順を示すフローチャートFIG. 6 is a flowchart showing a control procedure of the determination unit of the character recognition device in the embodiment of the present invention.

[Explanation of symbols]

１画像読み取り部２文字切り出し部３文字認識部４言語処理部５被疑単語抽出部６被疑単語再切り出し部７被疑単語文字認識部８再言語処理部９判定部 1 image reading unit 2 character cutting unit 3 character recognition unit 4 language processing unit 5 suspected word extraction unit 6 suspected word recutting unit 7 suspected word character recognition unit 8 relanguage processing unit 9 judgment unit

Claims

[Claims]

1. A character cutout unit for cutting out image data read by an image reading unit into character image data for each character, and a character for recognizing the character image data cut out by the character cutout unit and converting it into a character code. A recognition unit, a language processing unit that performs language processing on the character recognition result recognized by the character recognition unit, and changes the character recognition result divided into character units into word unit divisions; and the language processing unit. A suspect word extracting unit that extracts a word consisting of a character that is a single-character word and that can be separated into left and right from the obtained word group, and the suspect word extracted by the suspect word extracting unit is a recognition target document. It is judged that there is a high possibility that the adjacent half-width characters were mistakenly recognized as one full-width character, and the image data of the suspected word is cut out as two half-width characters. A word recutting unit, a suspected word character recognition unit that character-recognizes the character image data cut out by the suspected word recutting unit and converts it into a character code, and a character recognition result after recutting by the suspected word character recognition unit. A re-language processing unit that performs language processing on the above, and a determination unit that compares the character recognition result by the language processing unit with the character recognition result by the re-language processing unit to determine a final solution character string. A character recognition device characterized by: