JPH09218921A

JPH09218921A - General document reader

Info

Publication number: JPH09218921A
Application number: JP8025393A
Authority: JP
Inventors: Yukiko Chiba; 由紀子千葉
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1996-02-13
Filing date: 1996-02-13
Publication date: 1997-08-19

Abstract

PROBLEM TO BE SOLVED: To improve the recognition rate of a character string part composed of English characters only and to improve the operational reliability of a device by investigating the similarity tendency of capital letters and small letters consisting of an English character string, discriminating the features of character forms of that English character string and properly judging the capital letters and small letters. SOLUTION: A word collating part 6 partitions the character recognized result for each word, collates it with a word dictionary 7 and extracts candidate words, and an output forming part 8 selects any word to be outputted out of a candidate word group. In this case, the capital letter/small letter similarity tendency of respective characters consisting of that word is investigated. Then, it is discriminated whether that word is a word to be composed of capital letters as a whole (1), a word to be composed of small letters as a whole (2) or a word to consist only its head of a capital letter and to consist the other part of small letters (3) and based on this discrimination, the capital letter/small letter can be properly selected for the respective characters of that word so that the recognition rate can be improved.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、一般文書を構成す
る文字を認識して読み取る一般文書読取装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a general document reading device for recognizing and reading characters constituting a general document.

【０００２】[0002]

【従来の技術】一般に、従来の一般文書読取装置では、
読み取り対象の文書画像を文字・表・罫線・図及び写真
の領域に自動あるいは手動で分割し、文字領域及び表領
域の文字部分について文字認識が行われる。文字認識結
果の読取精度を向上する方法として、入力文字列に対応
する候補文字集合の列と単語辞書とを使用して単語照合
あるいは言語処理を施し、一致する単語を入力文字列の
答えとする後処理方法が広く使われている。2. Description of the Related Art Generally, in a conventional general document reading apparatus,
A document image to be read is automatically or manually divided into character, table, ruled line, figure, and photograph areas, and character recognition is performed on the character portions of the character area and table area. As a method of improving the reading accuracy of the character recognition result, word matching or linguistic processing is performed using a string of candidate character sets corresponding to the input character string and a word dictionary, and the matching word is used as the answer of the input character string. Post-treatment methods are widely used.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、上記構
成の従来技術によれば、文字認識で英字を認識する場
合、例えば、“Ｐ”のような、大文字と小文字が同形の
文字が、本来大文字で書かれていたか小文字で書かれて
いたかを判別するのは非常に難しく、単語辞書を用いて
単語照会を行ったとしても、文字の大小を決定すること
は困難であるという問題があった。However, according to the prior art with the above-described structure, when recognizing an alphabetic character by character recognition, for example, a character having the same uppercase and lowercase letters such as "P" is originally capitalized. There is a problem that it is very difficult to determine whether it is written or written in lower case, and it is difficult to determine the size of a character even if a word inquiry is performed using a word dictionary.

【０００４】本発明は、以上の問題点に鑑み、一般文書
読取装置によって文字と認識された箇所のうち、英字の
みで構成される文字列部分の認識率を向上させて、装置
の作動信頼性を高めることを目的とする。In view of the above problems, the present invention improves the operation reliability of the apparatus by improving the recognition rate of the character string portion which is composed of only English characters among the portions recognized as characters by the general document reading apparatus. The purpose is to increase.

【０００５】[0005]

【課題を解決するための手段】上記目的を達成するた
め、本発明は、英字列内の大文字と小文字の形が似てい
る文字について、当該英字列を構成する各英字の大文字
と小文字の類似傾向を調べ、当該英字列の字形の特徴を
判別することで、大文字と小文字を正しく判断するよう
にする。In order to achieve the above-mentioned object, the present invention relates to a case where upper and lower case characters in an alphabetic string are similar, and the upper and lower case of each alphabetic character constituting the alphabetic string is similar. By examining the tendency and determining the characteristics of the glyphs of the relevant alphabetic string, the uppercase and lowercase letters can be correctly determined.

【０００６】すなわち、本発明は、一般文書を構成する
文字を認識して読み取る一般文書読取装置において、読
み取り対象の文書画像を文字・表・罫線・図及び写真の
領域に分割した後の文字領域及び表領域の文字部分を走
査部により走査し、光電変換して得られる画像信号を文
字認識部に転送し、文字認識部により形成された候補文
字と距離値からなる集合のうち英字列と判断される文字
列に対して、単語辞書を用いて単語照合あるいは言語処
理を施し、一致する単語を選択した結果について、英字
の大文字／小文字類似傾向テーブルを参照し、当該単語
を構成する各文字が「確かに大文字」あるいは「確かに
小文字」かを判断し、当該単語の中に「確かに大文字」
あるいは「確かに小文字」がどれだけ含まれているかを
調査し、上記調査の結果と、当該単語の先頭文字の文字
種とから、当該単語がすべて小文字であるべきか、すべ
て大文字であるべきか、先頭文字のみ大文字で他は小文
字であるべきか、大文字／小文字混在であるべきかを判
断し、上記判断に基づいて、当該単語中の各文字が大文
字か小文字を判断し、必要に応じて大文字と小文字を入
れ換え出力する。That is, according to the present invention, in a general document reading apparatus for recognizing and reading characters forming a general document, a character area after a document image to be read is divided into areas of characters, tables, ruled lines, figures and photographs. And the character part of the front area is scanned by the scanning part, the image signal obtained by photoelectric conversion is transferred to the character recognition part, and it is judged as an alphabetic character string from the set of candidate characters formed by the character recognition part and the distance value. For the result of performing word matching or linguistic processing using a word dictionary on the character string that is generated, and selecting the matching word, refer to the upper-case / lower-case similarity tendency table of the alphabet, Judge whether it is "certainly uppercase" or "certainly lowercase" and add "certainly uppercase" in the word.
Or, investigate how many "certainly lowercase letters" are included, and from the result of the above investigation and the character type of the first character of the word, whether the word should be all lowercase letters or all uppercase letters, It is determined whether only the first letter should be capitalized and the other letters should be lowercased, or mixed case, and based on the above determination, whether each character in the word is uppercase or lowercase, and if necessary, capitalized Replaces and lowercase letters and outputs.

【０００７】[0007]

【発明の実施の形態】以下、図面に従って、本発明の実
施の形態を説明する。図１は本発明の一実施の形態の構
成を示すブロック図である。図において、１は読取対象
である帳票、２は画像入力装置、３はレイアウト解析
部、４は走査部、５は文字認識部、６は単語照合部、７
は単語照合部で用いる単語辞書、８は出力文字列形成部
である。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a configuration of an embodiment of the present invention. In the figure, 1 is a form to be read, 2 is an image input device, 3 is a layout analysis unit, 4 is a scanning unit, 5 is a character recognition unit, 6 is a word matching unit, and 7 is a word matching unit.
Is a word dictionary used in the word matching unit, and 8 is an output character string forming unit.

【０００８】図２は帳票１上の文字列を文字認識した結
果の一例を示す説明図である。図３は、図２の文字認識
結果と単語辞書とを照合して得られる候補単語群の例を
示す説明図である。図４は、従来技術により、図３の候
補単語群から出力文字列を選択した場合の結果を示す説
明図である。FIG. 2 is an explanatory diagram showing an example of the result of character recognition of the character string on the form 1. FIG. 3 is an explanatory diagram showing an example of a candidate word group obtained by collating the character recognition result of FIG. 2 with a word dictionary. FIG. 4 is an explanatory diagram showing a result when an output character string is selected from the candidate word group of FIG. 3 according to the conventional technique.

【０００９】図５は、本実施の形態における、各英字の
大文字と小文字の類似傾向を判定するテーブルの例を示
す説明図である。図６は、図５のテーブルを参照し図４
での出力単語の文字種を決定する例を示す説明図であ
る。図７は、本実施の形態における、各単語を構成する
文字の類似文字傾向から当該単語の特徴を判定するテー
ブルの例を示す説明図である。FIG. 5 is an explanatory diagram showing an example of a table for determining the tendency of uppercase letters and lowercase letters of each alphabetic character in this embodiment. FIG. 6 refers to the table of FIG.
5 is an explanatory diagram showing an example of determining the character type of the output word in FIG. FIG. 7 is an explanatory diagram showing an example of a table for determining the feature of the word from the similar character tendency of the characters forming each word in the present embodiment.

【００１０】図８は、本実施の形態により、図３の候補
単語群から出力文字列を選択した例を示す説明図であ
る。以下、本実施の形態の作用を説明する。図１の画像
入力装置から入力した文書画像は、図１のレイアウト解
析部２により、自動あるいは手動で、文字・表・罫線・
図及び写真の領域に分割する。FIG. 8 is an explanatory view showing an example in which an output character string is selected from the candidate word group of FIG. 3 according to the present embodiment. The operation of this embodiment will be described below. The document image input from the image input device of FIG. 1 is automatically or manually operated by the layout analysis unit 2 of FIG.
Divide into figure and photo areas.

【００１１】走査部は、レイアウト解析部により文字領
域と判断された領域と表領域内の文字部分を走査し、光
電変換して得られる画像信号を文字認識部に転送する。
文字認識部は入力文字の字形と各文字の標準字形との距
離を計算し、距離の小さい順に（つまり字形の似ている
順に）並んだ候補文字と距離値からなる集合を形成し、
単語照合部に出力する。図２は認識対象及び文字認識例
である。The scanning unit scans the character portion in the area determined by the layout analysis unit and the character area in the front area, and transfers the image signal obtained by photoelectric conversion to the character recognition unit.
The character recognition unit calculates the distance between the glyph of the input character and the standard glyph of each character, and forms a set of candidate characters and distance values arranged in ascending order of distance (that is, in the order of similar glyphs),
Output to the word matching unit. FIG. 2 shows a recognition target and an example of character recognition.

【００１２】単語照合部では、入力された文字認識結果
を適当な位置で区切り、前述の候補文字と距離値からな
る集合の列から組み合わせてできる文字列のうち、単語
辞書を検索して単語辞書に存在するものを抽出する。切
り出された文字列が英字列である場合、単語辞書は英語
単語辞書を使用する。単語辞書中の登録単語は全て大文
字で記され、文字認識結果は一旦全て大文字に変換して
から、単語辞書の登録内容と照合するものとする。図３
は、文字認識結果を単語毎に区切り、単語辞書と照合
し、候補単語を抽出した例である。In the word collating unit, the input character recognition result is delimited at an appropriate position, and a word dictionary is searched for from a character string obtained by combining the above-mentioned set of candidate characters and distance values into a word dictionary. Extract what exists in. When the cut-out character string is an English character string, the word dictionary uses the English word dictionary. All registered words in the word dictionary are written in upper case, and the character recognition result is once converted to upper case before being collated with the registered content in the word dictionary. FIG.
Is an example in which the character recognition result is divided into words and collated with a word dictionary to extract candidate words.

【００１３】出力形成部では、単語照合部で抽出された
候補単語群の中から、出力すべき単語を選択する。図４
は平均候補順位が上位の単語を優先して選択した例であ
り、すなわち従来技術による選択結果を示している。図
３の単語照合時に、一旦全て大文字に変換した文字認識
結果は、出力形成部で再び文字認識時の文字種に変換す
るものとする。当該文字位置の候補文字に、同一英語の
大文字と小文字が出現した場合、上位の候補文字を出力
する。この例として、図４の、“ＴｏＭ”の“ｏ”の文
字位置の候補文字には“ｏ”と“Ｏ”が挙げられている
が、“ｏ”の方が上位の認識文字なので“Ｏ”ではなく
“ｏ”で出力する。The output forming unit selects a word to be output from the candidate word group extracted by the word matching unit. FIG.
Shows an example in which the words with the highest average candidate rank are selected with priority, that is, the selection results according to the prior art are shown. At the time of word matching in FIG. 3, the character recognition result, which has been converted into all uppercase letters, is converted into the character type at the time of character recognition again in the output forming unit. When uppercase and lowercase letters of the same English appear in the candidate character at the character position, the upper candidate character is output. As an example of this, “o” and “O” are listed as the candidate characters at the character position of “o” of “ToM” in FIG. 4, but since “o” is the higher recognized character, “O” is recognized. Output as "o" instead of "."

【００１４】単語辞書を用いた単語照合では、英字の並
びを修正することは可能だが、“ｏ”と“Ｏ”のどちら
かが正しいのかを判断することはできないので、図４の
例のように文字認識時の判断を優先することとなる。本
発明では、文字認識結果から英字文字列について、出力
形成部で候補単語を抽出した後、当該単語を構成する各
文字の大文字／小文字類似傾向を調べ、当該単語が、
すべて大文字であるべき単語、すべて小文字であるべ
き単語、単語頭のみ大文字で他の部分は小文字である
べき単語、のいずれであるかを判定し、この判定をもと
に当該単語の各文字の大文字／小文字を正しく選択する
ことで、認識率を向上することができる。In the word matching using the word dictionary, it is possible to correct the arrangement of letters, but it is not possible to judge whether "o" or "O" is correct. Therefore, the judgment at the time of character recognition will be given priority. In the present invention, for the alphabetic character string from the character recognition result, after extracting the candidate word in the output forming unit, the uppercase / lowercase similarity tendency of each character forming the word is examined, and the word is
It is determined whether the word should be all uppercase letters, all lowercase letters, or only uppercase letters and the other parts should be lowercase letters. The correct recognition of uppercase / lowercase can improve the recognition rate.

【００１５】以下に、本実施の形態のよる出力文字の選
択手順を、図５〜図７を使って説明する。まず、当該単
語を形成する各文字について、図５に示す大文字／小文
字類似傾向テーブルを参照して、当該文字が確かに大文
字あるいは確かに小文字かを判断する。当該文字が非類
似文字に属する場合、その文字は確かに大文字あるいは
確かに小文字だと判断する。当該文字が類似文字に属す
る場合、その文字は大文字と小文字を誤って認識してい
る可能性があると判断する。図６は、図４の出力形成結
果について、各文字の大文字／小文字類似傾向を調べ、
確かに大文字あるいは確かに小文字と判断できる文字を
判断した例である。The output character selection procedure according to the present embodiment will be described below with reference to FIGS. First, for each character forming the word, it is determined whether the character is certainly an uppercase letter or certainly a lowercase letter by referring to the uppercase / lowercase similarity tendency table shown in FIG. If the character belongs to a dissimilar character, it is determined that the character is definitely uppercase or lowercase. If the character belongs to a similar character, it is determined that the character may have mistakenly recognized uppercase and lowercase letters. FIG. 6 shows the output formation result of FIG.
This is an example of determining a character that can be determined to be an uppercase letter or a lowercase letter.

【００１６】次に、各単語中に、確かに大文字あるいは
確かに小文字と判断できる文字はそれぞれいくつあるか
を調べる。ただし、単語の先頭文字は大文字で書かれる
確率が他の文字位置より高いので、チェック対象外とす
る。図６の各表の「合計」欄は、各単語の確かに大文字
あるいは確かに小文字と判断した文字数の合計である。Next, it is examined how many characters each of the words can be surely determined to be uppercase letters or lowercase letters. However, since the probability that the first letter of a word is written in capital letters is higher than other letter positions, it is excluded from the check target. The “total” column of each table in FIG. 6 is the total number of characters judged to be definitely uppercase or lowercase for each word.

【００１７】次に、図６の「合計」と図４の出力結果の
各単語の先頭文字を参考に、図７の当該単語の特徴を判
定するテーブルを用いて、各単語の特徴を判定し、当該
単語中にその特徴に反する文字があれば修正する。図８
は、図４，図６，図７から判断した出力文字列の例であ
る。図４の“ＨａｐＰｙ”では、先頭文字“Ｈ”は前述
した如くチェック対象外とし、“ａｐＰｙ”をチェック
する。ここで、図６(1)に示す如く、“ａｐＰｙ”の中
には「確かに小文字」が１つ有り、「確かに大文字」は
存在しない。加えてチェック対象外の先頭文字“Ｈ”は
大文字と判定されているので、図７のテーブルから、こ
の単語は先頭のみ大文字で他は小文字で書かれた単語と
判定する。すると、“ＨａｐＰｙ”の“Ｐ”はこの特徴
に反するので、“Ｐ”を“ｐ”に修正し、“Ｈａｐｐ
ｙ”を出力する。Next, referring to the "total" in FIG. 6 and the first character of each word in the output result in FIG. 4, the feature of each word is determined using the table for determining the feature of the word in FIG. , If there is a character in the word that is against the feature, correct it. FIG.
Is an example of the output character string determined from FIGS. 4, 6, and 7. In "HapPy" of FIG. 4, the first character "H" is excluded from the check target as described above, and "apPy" is checked. Here, as shown in FIG. 6 (1), there is one "certainly small letter" in "apPy", and there is no "certainly uppercase letter". In addition, since the first character "H" that is not a check target is determined to be a capital letter, it is determined from the table in FIG. 7 that this word is a word written only in the capital letter and the other letters in lower case. Then, since "P" of "HapPy" is against this feature, "P" is corrected to "p" and "Happ"
y ”is output.

【００１８】図４の“ｂＩｒＴｈｄａＹ”では、先頭文
字“ｂ”は前述した如くチェック対象外とし、“ＩｒＴ
ｈｄａＹ”をチェックする。ここで、図６(2) に示す如
く、“ＩｒＴｈｄａＹ”の中には「確かに小文字」が４
つあり、「確かに大文字」は存在しない。加えてチェッ
ク対象外の先頭文字“ｂ”は小文字と判定されているの
で、図７のテーブルから、この単語はすべて小文字で書
かれた単語と判定する。すると、“ｂＩｒＴｈｄａＹ”
の“Ｉ，Ｔ，Ｙ”はこの特徴に反するので、“Ｉ，Ｔ，
Ｙ”を“ｉ，ｔ，ｙ”に修正し、“ｂｉｒｔｈｄａｙ”
と出力する。In "bIrThdaY" of FIG. 4, the first character "b" is not checked as described above, and "IrT
Check "hdaY". Here, as shown in Fig. 6 (2), "certainly small letter" is 4 in "IrThdaY".
There is no "certainly uppercase". In addition, since the first character "b" that is not a check target is determined to be a lowercase letter, it is determined from the table in FIG. 7 that this word is written in all lowercase letters. Then, "bIrThdaY"
Since "I, T, Y" of this is contrary to this feature, "I, T, Y"
Correct "Y" to "i, t, y" and change to "birthday"
Is output.

【００１９】図４の“ｔＯ”では、先頭文字“ｔ”は前
述した如くチェック対象外とし、“ｏ”をチェックす
る。ここで、図６(3) に示す如く、“ｏ”は「確かに小
文字」でも「確かに大文字」でもない。加えてチェック
対象外の先頭文字“ｔ”は小文字と判定されているの
で、図７のテーブルから、この単語はすべて小文字で書
かれた単語と判定する。すると、“ｔＯ”の“Ｏ”はこ
の特徴に反するので、“Ｏ”を“ｏ”に修正し、“ｔ
Ｏ”と出力する。In "tO" of FIG. 4, the leading character "t" is not checked as described above, and "o" is checked. Here, as shown in FIG. 6 (3), “o” is neither “certainly lowercase” nor “certainly uppercase”. In addition, since the first character “t” that is not a check target is determined to be a lowercase letter, it is determined from the table of FIG. 7 that this word is written in all lowercase letters. Then, since "O" of "tO" is contrary to this feature, "O" is corrected to "o" and "t" is corrected.
Output as "O".

【００２０】図４の“ＴｏＭ”では、先頭文字“Ｔ”は
前述した如くチェック対象外とし、“ｏＭ”をチェック
する。ここで、図６(4) に示す如く、“ｏＭ”の中には
「確かに小文字」が存在せず、「確かに大文字」が１つ
ある。加えてチェック対象外の先頭文字“Ｔ”は大文字
と判定されているので、図７のテーブルから、この単語
はすべて大文字で書かれた単語と判定する。すると、
“ＴｏＭ”の“ｏ”はこの特徴に反するので、“ｏ”を
“Ｏ”に修正し、“ＴＯＭ”と出力する。In "ToM" of FIG. 4, the leading character "T" is excluded from the check target as described above, and "oM" is checked. Here, as shown in FIG. 6 (4), "certainly lowercase letters" do not exist in "oM", and there is one "certainly uppercase letters". In addition, since the first letter “T” that is not checked is determined to be a capital letter, it is determined from the table of FIG. 7 that this word is written in all capital letters. Then
Since "o" of "ToM" is contrary to this feature, "o" is corrected to "O" and output as "TOM".

【００２１】なお、上記実施の形態は、画像読み取り文
字を認識する装置を示したが、本発明はペン入力文字を
認識する装置にも適用可能である。In the above embodiment, the device for recognizing the image read character is shown, but the present invention is also applicable to the device for recognizing the pen input character.

【００２２】[0022]

【発明の効果】以上詳細に説明した如く、本発明によれ
ば、一般文書を構成する文字を認識して読み取る一般文
書読取装置において、読み取り対象の文書画像を文字・
表・罫線・図及び写真の領域に分割した後の文字領域及
び表領域の文字部分を走査部により走査し、光電変換し
て得られる画像信号を文字認識部に転送し、文字認識部
により形成された候補文字と距離値からなる集合のうち
英字列と判断される文字列に対して、単語辞書を用いて
単語照合あるいは言語処理を施し、一致する単語を選択
した結果について、英字の大文字／小文字類似傾向テー
ブルを参照し、当該単語を構成する各文字が「確かに大
文字」あるいは「確かに小文字」かを判断し、当該単語
の中に「確かに大文字」あるいは「確かに小文字」がど
れだけ含まれているかを調査し、上記調査の結果と、当
該単語の先頭文字の文字種とから、当該単語がすべて小
文字であるべきか、すべて大文字であるべきか、先頭文
字のみ大文字で他は小文字であるべきか、大文字／小文
字混在であるべきかを判断し、上記判断に基づいて、当
該単語中の各文字が大文字か小文字を判断し、必要に応
じて大文字と小文字を入れ換え出力するので、英字列内
の大文字と小文字の形が似ている文字について、当該英
字列を構成する各英字の大文字と小文字の類似傾向を調
べ、当該英字列の字形の特徴を判別することで、大文字
と小文字を正しく判断することができる。As described above in detail, according to the present invention, in the general document reading device for recognizing and reading the characters constituting the general document, the document image to be read is displayed as characters.
The scanning unit scans the character area and the character area of the table area after dividing into the table, ruled line, figure, and photograph areas, transfers the image signal obtained by photoelectric conversion to the character recognition unit, and forms it by the character recognition unit. The result of selecting a matching word by performing word matching or linguistic processing using a word dictionary on the character string that is determined to be an alphabetic character string from the set of candidate characters and distance values that are selected. Refer to the lowercase similarity tendency table to determine whether each character that constitutes the word is "certainly uppercase" or "certainly lowercase", and which of the words is "certainly uppercase" or "certainly lowercase" Whether the word should be all lowercase letters or all uppercase letters, or only the first letter should be capitalized from the result of the above survey and the character type of the first letter of the word It judges whether it should be lowercase or mixed case, and based on the above judgment, it judges whether each character in the word is uppercase or lowercase, and switches the uppercase and lowercase as necessary and outputs it. , For letters with similar upper and lower case letters in an alphabetic string, by examining the similarities of the uppercase and lowercase letters of the letters that make up the alphabetic string, and distinguishing the glyph features of the alphabetic string, Can correctly detect lowercase letters.

【００２３】これにより、英字のみで構成される文字列
部分の認識率を向上させること可能となり、装置の作動
信頼性を高めるという効果がある。As a result, it is possible to improve the recognition rate of the character string portion composed of only alphabetic characters, which has the effect of increasing the operational reliability of the device.

[Brief description of drawings]

【図１】本発明の一実施の形態の構成を示すブロック図
である。FIG. 1 is a block diagram showing a configuration of an embodiment of the present invention.

【図２】帳票上の文字列を文字認識した結果の一例を示
す説明図である。FIG. 2 is an explanatory diagram showing an example of a result of character recognition of a character string on a form.

【図３】図２の文字認識結果と単語辞書とを照合して得
られる候補単語群の例を示す説明図である。FIG. 3 is an explanatory diagram showing an example of a candidate word group obtained by collating the character recognition result of FIG. 2 with a word dictionary.

【図４】従来技術により、図３の候補単語群から出力文
字列を選択した場合の結果を示す説明図である。FIG. 4 is an explanatory diagram showing a result when an output character string is selected from the candidate word group of FIG. 3 according to a conventional technique.

【図５】本発明の一実施の形態の、各英字の大文字と小
文字の類似傾向を判定するテーブルの例を示す説明図で
ある。FIG. 5 is an explanatory diagram showing an example of a table for determining similarity between uppercase letters and lowercase letters of each English character according to the embodiment of this invention.

【図６】図５のテーブルを参照し、図４での出力単語の
文字種を決定する例を示す説明図である。6 is an explanatory diagram showing an example of determining the character type of the output word in FIG. 4 with reference to the table in FIG.

【図７】本発明の一実施の形態の、各単語を構成する文
字の類似文字傾向から当該単語の特徴を判定するテーブ
ルの例を示す説明図である。FIG. 7 is an explanatory diagram showing an example of a table for determining the characteristics of a word from the similar character tendency of the characters forming each word according to the embodiment of the present invention.

【図８】本発明の一実施の形態により、図３の候補単語
群から出力文字列を選択した例を示す説明図である。FIG. 8 is an explanatory diagram showing an example in which an output character string is selected from the candidate word group in FIG. 3 according to the embodiment of the present invention.

[Explanation of symbols]

１帳票２画像入力装置３レイアウト解析部４走査部５文字認識部６単語照合部７単語辞書８出力文字列形成部 1 form 2 image input device 3 layout analysis unit 4 scanning unit 5 character recognition unit 6 word matching unit 7 word dictionary 8 output character string formation unit

Claims

[Claims]

1. A general document reading device for recognizing and reading characters constituting a general document, wherein a document image to be read is divided into character, table, ruled line, figure, and photo areas, A character string that is scanned by the scanning unit, transfers the image signal obtained by photoelectric conversion to the character recognition unit, and is a character string that is determined to be an alphabetic character string from the set of candidate characters formed by the character recognition unit and the distance value. For word matching or linguistic processing using a word dictionary and selecting matching words,
By referring to the lower case similarity tendency table, it is judged whether each character forming the word is "certainly uppercase" or "certainly lowercase", and which of the words is "certainly uppercase" or "certainly lowercase" is determined. Whether the word should be all lowercase letters or all uppercase letters, or only the first letter is uppercase and the others are lowercase, based on the result of the above survey and the character type of the first letter of the word It is necessary to judge whether each character in the word is upper case or lower case based on the above judgment, and switch the upper case and lower case, and output it. Characteristic general document reading device.