JP2503259B2

JP2503259B2 - How to determine full-width and half-width characters

Info

Publication number: JP2503259B2
Application number: JP63252029A
Authority: JP
Inventors: 泰二森
Original assignee: Fuji Electric Co Ltd; Fuji Facom Corp
Current assignee: Fuji Electric Co Ltd; Fuji Facom Corp
Priority date: 1988-10-07
Filing date: 1988-10-07
Publication date: 1996-06-05
Anticipated expiration: 2011-06-05
Also published as: JPH02100189A

Description

【発明の詳細な説明】〔産業上の利用分野〕この発明は、文字認識装置（OCR）における全角，半
角文字の決定方法に関する。The present invention relates to a method for determining full-width and half-width characters in a character recognition device (OCR).

[Conventional technology]

一般に、漢字等の大きさを全角と云い、全角の半分の
幅を半角と云う。全角，半角の双方を持つ文字種別には
カタカナ，英字，数字がある。そこで、OCRにおける全
角，半角文字の認識は、従来の専ら半角文字が全角文字
の半分の幅であることを利用して行なうようにしてい
る。つまり、文書画像より、文字の切出しを行なう。こ
の過程で全角文字の大きさを抽出する。そして、切出さ
れた文字の大きさ（文字幅）と全角文字の大きさを比較
して、全角文字か半角文字かの判定を行なっていた。Generally, the size of Chinese characters is called full-width, and the half width of full-width is called half-width. Character types that have both full-width and half-width include katakana, letters, and numbers. Therefore, full-width and half-width characters are recognized in the OCR by utilizing the fact that the conventional half-width characters are half the width of the full-width characters. That is, characters are cut out from the document image. In this process, the size of full-width characters is extracted. Then, the size (character width) of the cut out character is compared with the size of the full-width character to determine whether it is a full-width character or a half-width character.

[Problems to be Solved by the Invention]

しかしながら、かゝる方式では全角の文字でも文字幅
の狭いカタカナの「イ」，「リ」，「ト」や数字の
「１」や英字の「Ｉ」などは半角と判定されることにな
り、認識結果に全角，半角が混在し出力した場合に見辛
くなると云う問題が生じる。例えば、第２図の如き文字
行または列（文字列）がある場合、数字の「１」とカタ
カナの「ト」，「リ」，「ツ」が半角文字として認識さ
れる。つまり、既存のOCRでは全角文字でも幅の狭い文
字は半角と誤認識するおそれがあると云うわけである。However, in such a method, even if the full-width character is narrow, the katakana characters such as "i", "li", "to", the numeral "1", and the alphabetic character "I" are determined to be half-width. However, there is a problem that it becomes difficult to see when the full-width and half-width characters are mixed and output in the recognition result. For example, when there is a character row or column (character string) as shown in FIG. 2, the number "1" and the katakana "to", "ri", and "tsu" are recognized as half-width characters. In other words, existing OCR may incorrectly recognize double-width characters as narrow-width characters as single-byte characters.

したがって、この発明は幅の狭い全角文字を半角文字
と誤認識しないようにすることが可能な方法を提供する
ことを目的とする。Therefore, it is an object of the present invention to provide a method capable of preventing a narrow full-width character from being erroneously recognized as a half-width character.

[Means for solving the problem]

未知文字の認識処理をしてその候補文字を抽出すると
ゝもに、該候補文字が全角文字か半角文字かの判断およ
びその文字種別の判断をした後、各文字列毎に半角文字
をもつ文字種別の文字が存在するか否かを調べ、存在す
るときはそれと同じ文字種別の一連の文字からなる部分
文字列を抽出し、該部分文字列中に存在する全角文字の
数を調べ、その数に応じて該当部分文字列を全角文字ま
たは半角文字のいずれか一方のみからなる文字列として
統一する。When a candidate character is extracted by performing unknown character recognition processing, it is determined whether the candidate character is a full-width character or a half-width character and its character type, and then a character having a half-width character for each character string. Check whether the character of the type exists, and if there is, extract the partial character string consisting of a series of characters of the same character type, check the number of full-width characters existing in the partial character string, According to the above, the corresponding partial character string is unified as a character string consisting of either one-byte characters or one-byte characters.

[Action]

全角，半角文字が混在する文字種別に属する文字につ
いては、単に文字の大きさ（文字幅）だけでなくその前
後関係も調べることにより、幅の狭い全角文字を半角文
字として誤判別しないようにし、信頼性の向上を図ると
ゝもに出力した場合の見栄えの良さを保証する。For characters belonging to a character type in which full-width and half-width characters are mixed, not only the size (character width) of the character but also its context are checked to avoid misidentifying narrow full-width characters as half-width characters. When the reliability is improved, it guarantees a good appearance even when the output is "1".

〔Example〕

第１図はこの発明の実施例を示すフローチャートであ
る。FIG. 1 is a flowchart showing an embodiment of the present invention.

まず、公知の画像処理により文字画像データの切出し
を行い（参照）、その過程で基準となる全角文字の大
きさを抽出する（参照）。次いで、対象文字の大きさ
と全角文字の大きさとを比較することにより全角文字か
半角文字かの判別を行い、その結果を全角半角文字情報
としてメモリに格納する（参照）。次いで、対象文字
の認識等を行い（参照）、その結果を所定の文字コー
ドに変換してメモリに格納する。その過程で文字コード
により文字種別の判別を行う。そして、〜の処理を
繰り返すことにより、１つの文字行または列（文字列）
の認識結果を得る。First, the character image data is cut out by a known image processing (see), and in the process, the size of the reference double-byte character is extracted (see). Next, the size of the target character and the size of the full-width character are compared to determine whether it is a full-width character or a half-width character, and the result is stored in the memory as full-width half-width character information (see). Next, the target character is recognized (reference), and the result is converted into a predetermined character code and stored in the memory. In the process, the character type is determined by the character code. Then, by repeating the processes from to, one character row or column (character string)
Get the recognition result of.

次に、認識した文字列のうち、例えば文字の幅が狭い
全角文字を半角文字と誤判別することのないように、文
字列の中から半角文字をもつ文字種別に属する文字、す
なわちカタカナ，英字，数字を検索し（参照）、存在
する場合はその文字から始まり別の文字種別の文字の前
までを、同一文字種別からなる１つの部分文字列として
抽出する（参照）。このとき、カタカナで始まる文字
列（カタカナ列）の長音記号は文字列の終わりとしな
い。同様に英文列ではカンマ，ピリオド，コロン，セミ
コロン、また数字列ではカンマ，ピリオドはいずれも文
字列の終わりとはしないこととする。Next, of the recognized character strings, for example, in order not to misidentify a full-width character with a narrow character as a half-width character, the characters belonging to the character type with half-width characters in the character string, that is, katakana, alphabet , The number is searched (reference), and if it exists, the characters starting from that character and before the character of another character type are extracted as one partial character string of the same character type (reference). At this time, the long syllabary of the character string starting with katakana (katakana string) is not the end of the character string. Similarly, in English strings, commas, periods, colons, and semicolons, and in numerical strings, neither commas nor periods end the string.

次に、で得られた全角半角文字情報により、この部
分文字列の中に全角，半角文字が混在しているか否かを
調べ、混在している場合はその部分文字列の中の全角文
字数を数える（参照）。この例では全角文字数が“0"
の場合のみ該当文字列は半角文字からなるものとし、そ
れ以外の場合は部分文字列全体を全角文字として統一す
る（参照）。Next, check whether the full-width and half-width characters are mixed in this partial character string based on the full-width half-width character information obtained in step 2. If they are mixed, determine the number of full-width characters in the partial character string. Count (see). In this example, the number of double-byte characters is "0"
Only in case of, the corresponding character string shall consist of half-width characters, and in all other cases, the entire partial character string shall be unified as full-width characters (see).

このようにすれば、例えば第２図の例では〜の処
理により同図に示す如き従来と同様の結果が得られて
も、，の処理によりカタカナ列「トリップ」が抽出
され、さらに〜の処理において、この部分文字列中
には「プ」なる全角文字が存在することから、部分文字
列「トリップ」全体は全角文字よりなるものとして認識
（決定）されることになる。その結果、「ト」，「リ」
等の幅の狭い全角文字を半角文字と誤判別するおそれを
無くすことができ、認識結果を出力した場合の見辛さを
解消することが可能となる。Thus, for example, in the example of FIG. 2, even if the same result as the conventional one as shown in FIG. 2 is obtained by the processing of ~, the katakana string "trip" is extracted by the processing of, and the processing of In this case, since the double-byte character "p" is present in this partial character string, the entire partial character string "trip" is recognized (determined) as consisting of double-byte characters. As a result, "to", "ri"
It is possible to eliminate the possibility of erroneously discriminating a full-width character having a narrow width such as a half-width character, and it is possible to eliminate the discomfort when the recognition result is output.

〔The invention's effect〕

この発明によれば、全角，半角文字が混在する文字に
ついては、単に文字の大きさ（文字幅）だけでなくその
前後関係も考慮して全角，半角の判断をするようにした
ので、幅の狭い全角文字を半角文字と誤判別するおそれ
が無くなり、したがって信頼性が向上するとゝもに出力
した場合の見栄えの良さを保証することが可能となる利
点がもたらされる。According to the present invention, for a character in which full-width and half-width characters are mixed, the full-width and half-width characters are determined by considering not only the size of the character (character width) but also its context. There is no risk of erroneously distinguishing a narrow full-width character from a half-width character. Therefore, if the reliability is improved, there is an advantage that it is possible to guarantee a good-looking image even if it is output.

[Brief description of drawings]

第１図はこの発明の実施例を示すフローチャート、第２
図は文字列の例と全角，半角の認識結果を説明するため
の説明図である。符号説明 …文字画像の切出し処理、…全角文字の大きさの抽
出処理、…文字の大きさによる全角・半角の判定処
理、…認識結果、…カタカナ，英字，数字の検索処
理、…カタカナ列，英字列，数字率の抽出処理、…
計数処理、…統一処理。FIG. 1 is a flow chart showing an embodiment of the present invention,
The figure is an explanatory diagram for explaining an example of a character string and recognition results of full-width and half-width. Description of code: Character image cutout process, Full-width character size extraction process, Full-width / half-width determination process based on character size, Recognition result, Katakana, alphabetic and numeric search process, Katakana string, Extraction process of alphanumeric strings and numerical rates, ...
Counting process ... Unified processing.

Claims

(57) [Claims]

1. A method of recognizing an unknown character to extract a candidate character thereof, and after determining whether the candidate character is a full-width character or a half-width character and a character type thereof based on the character width, For each character row or column (character string), it is checked whether or not there is a character type character having half-width characters, and if there is, a partial character string consisting of a series of characters of the same character type is extracted, The number of full-width characters existing in the partial character string is checked, and the partial character string is unified as a character string consisting of only one full-width character or one half-width character according to the number. How to determine.