JP2503259B2 - How to determine full-width and half-width characters - Google Patents

How to determine full-width and half-width characters

Info

Publication number
JP2503259B2
JP2503259B2 JP63252029A JP25202988A JP2503259B2 JP 2503259 B2 JP2503259 B2 JP 2503259B2 JP 63252029 A JP63252029 A JP 63252029A JP 25202988 A JP25202988 A JP 25202988A JP 2503259 B2 JP2503259 B2 JP 2503259B2
Authority
JP
Japan
Prior art keywords
character
width
characters
full
string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
JP63252029A
Other languages
Japanese (ja)
Other versions
JPH02100189A (en
Inventor
泰二 森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuji Electric Co Ltd
Fuji Facom Corp
Original Assignee
Fuji Electric Co Ltd
Fuji Facom Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuji Electric Co Ltd, Fuji Facom Corp filed Critical Fuji Electric Co Ltd
Priority to JP63252029A priority Critical patent/JP2503259B2/en
Publication of JPH02100189A publication Critical patent/JPH02100189A/en
Application granted granted Critical
Publication of JP2503259B2 publication Critical patent/JP2503259B2/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Landscapes

  • Character Input (AREA)
  • Character Discrimination (AREA)

Description

【発明の詳細な説明】 〔産業上の利用分野〕 この発明は、文字認識装置(OCR)における全角,半
角文字の決定方法に関する。
The present invention relates to a method for determining full-width and half-width characters in a character recognition device (OCR).

〔従来の技術〕[Conventional technology]

一般に、漢字等の大きさを全角と云い、全角の半分の
幅を半角と云う。全角,半角の双方を持つ文字種別には
カタカナ,英字,数字がある。そこで、OCRにおける全
角,半角文字の認識は、従来の専ら半角文字が全角文字
の半分の幅であることを利用して行なうようにしてい
る。つまり、文書画像より、文字の切出しを行なう。こ
の過程で全角文字の大きさを抽出する。そして、切出さ
れた文字の大きさ(文字幅)と全角文字の大きさを比較
して、全角文字か半角文字かの判定を行なっていた。
Generally, the size of Chinese characters is called full-width, and the half width of full-width is called half-width. Character types that have both full-width and half-width include katakana, letters, and numbers. Therefore, full-width and half-width characters are recognized in the OCR by utilizing the fact that the conventional half-width characters are half the width of the full-width characters. That is, characters are cut out from the document image. In this process, the size of full-width characters is extracted. Then, the size (character width) of the cut out character is compared with the size of the full-width character to determine whether it is a full-width character or a half-width character.

〔発明が解決しようとする課題〕[Problems to be Solved by the Invention]

しかしながら、かゝる方式では全角の文字でも文字幅
の狭いカタカナの「イ」,「リ」,「ト」や数字の
「1」や英字の「I」などは半角と判定されることにな
り、認識結果に全角,半角が混在し出力した場合に見辛
くなると云う問題が生じる。例えば、第2図の如き文字
行または列(文字列)がある場合、数字の「1」とカタ
カナの「ト」,「リ」,「ツ」が半角文字として認識さ
れる。つまり、既存のOCRでは全角文字でも幅の狭い文
字は半角と誤認識するおそれがあると云うわけである。
However, in such a method, even if the full-width character is narrow, the katakana characters such as "i", "li", "to", the numeral "1", and the alphabetic character "I" are determined to be half-width. However, there is a problem that it becomes difficult to see when the full-width and half-width characters are mixed and output in the recognition result. For example, when there is a character row or column (character string) as shown in FIG. 2, the number "1" and the katakana "to", "ri", and "tsu" are recognized as half-width characters. In other words, existing OCR may incorrectly recognize double-width characters as narrow-width characters as single-byte characters.

したがって、この発明は幅の狭い全角文字を半角文字
と誤認識しないようにすることが可能な方法を提供する
ことを目的とする。
Therefore, it is an object of the present invention to provide a method capable of preventing a narrow full-width character from being erroneously recognized as a half-width character.

〔課題を解決するための手段〕[Means for solving the problem]

未知文字の認識処理をしてその候補文字を抽出すると
ゝもに、該候補文字が全角文字か半角文字かの判断およ
びその文字種別の判断をした後、各文字列毎に半角文字
をもつ文字種別の文字が存在するか否かを調べ、存在す
るときはそれと同じ文字種別の一連の文字からなる部分
文字列を抽出し、該部分文字列中に存在する全角文字の
数を調べ、その数に応じて該当部分文字列を全角文字ま
たは半角文字のいずれか一方のみからなる文字列として
統一する。
When a candidate character is extracted by performing unknown character recognition processing, it is determined whether the candidate character is a full-width character or a half-width character and its character type, and then a character having a half-width character for each character string. Check whether the character of the type exists, and if there is, extract the partial character string consisting of a series of characters of the same character type, check the number of full-width characters existing in the partial character string, According to the above, the corresponding partial character string is unified as a character string consisting of either one-byte characters or one-byte characters.

〔作用〕[Action]

全角,半角文字が混在する文字種別に属する文字につ
いては、単に文字の大きさ(文字幅)だけでなくその前
後関係も調べることにより、幅の狭い全角文字を半角文
字として誤判別しないようにし、信頼性の向上を図ると
ゝもに出力した場合の見栄えの良さを保証する。
For characters belonging to a character type in which full-width and half-width characters are mixed, not only the size (character width) of the character but also its context are checked to avoid misidentifying narrow full-width characters as half-width characters. When the reliability is improved, it guarantees a good appearance even when the output is "1".

〔実施例〕〔Example〕

第1図はこの発明の実施例を示すフローチャートであ
る。
FIG. 1 is a flowchart showing an embodiment of the present invention.

まず、公知の画像処理により文字画像データの切出し
を行い(参照)、その過程で基準となる全角文字の大
きさを抽出する(参照)。次いで、対象文字の大きさ
と全角文字の大きさとを比較することにより全角文字か
半角文字かの判別を行い、その結果を全角半角文字情報
としてメモリに格納する(参照)。次いで、対象文字
の認識等を行い(参照)、その結果を所定の文字コー
ドに変換してメモリに格納する。その過程で文字コード
により文字種別の判別を行う。そして、〜の処理を
繰り返すことにより、1つの文字行または列(文字列)
の認識結果を得る。
First, the character image data is cut out by a known image processing (see), and in the process, the size of the reference double-byte character is extracted (see). Next, the size of the target character and the size of the full-width character are compared to determine whether it is a full-width character or a half-width character, and the result is stored in the memory as full-width half-width character information (see). Next, the target character is recognized (reference), and the result is converted into a predetermined character code and stored in the memory. In the process, the character type is determined by the character code. Then, by repeating the processes from to, one character row or column (character string)
Get the recognition result of.

次に、認識した文字列のうち、例えば文字の幅が狭い
全角文字を半角文字と誤判別することのないように、文
字列の中から半角文字をもつ文字種別に属する文字、す
なわちカタカナ,英字,数字を検索し(参照)、存在
する場合はその文字から始まり別の文字種別の文字の前
までを、同一文字種別からなる1つの部分文字列として
抽出する(参照)。このとき、カタカナで始まる文字
列(カタカナ列)の長音記号は文字列の終わりとしな
い。同様に英文列ではカンマ,ピリオド,コロン,セミ
コロン、また数字列ではカンマ,ピリオドはいずれも文
字列の終わりとはしないこととする。
Next, of the recognized character strings, for example, in order not to misidentify a full-width character with a narrow character as a half-width character, the characters belonging to the character type with half-width characters in the character string, that is, katakana, alphabet , The number is searched (reference), and if it exists, the characters starting from that character and before the character of another character type are extracted as one partial character string of the same character type (reference). At this time, the long syllabary of the character string starting with katakana (katakana string) is not the end of the character string. Similarly, in English strings, commas, periods, colons, and semicolons, and in numerical strings, neither commas nor periods end the string.

次に、で得られた全角半角文字情報により、この部
分文字列の中に全角,半角文字が混在しているか否かを
調べ、混在している場合はその部分文字列の中の全角文
字数を数える(参照)。この例では全角文字数が“0"
の場合のみ該当文字列は半角文字からなるものとし、そ
れ以外の場合は部分文字列全体を全角文字として統一す
る(参照)。
Next, check whether the full-width and half-width characters are mixed in this partial character string based on the full-width half-width character information obtained in step 2. If they are mixed, determine the number of full-width characters in the partial character string. Count (see). In this example, the number of double-byte characters is "0"
Only in case of, the corresponding character string shall consist of half-width characters, and in all other cases, the entire partial character string shall be unified as full-width characters (see).

このようにすれば、例えば第2図の例では〜の処
理により同図に示す如き従来と同様の結果が得られて
も、,の処理によりカタカナ列「トリップ」が抽出
され、さらに〜の処理において、この部分文字列中
には「プ」なる全角文字が存在することから、部分文字
列「トリップ」全体は全角文字よりなるものとして認識
(決定)されることになる。その結果、「ト」,「リ」
等の幅の狭い全角文字を半角文字と誤判別するおそれを
無くすことができ、認識結果を出力した場合の見辛さを
解消することが可能となる。
Thus, for example, in the example of FIG. 2, even if the same result as the conventional one as shown in FIG. 2 is obtained by the processing of ~, the katakana string "trip" is extracted by the processing of, and the processing of In this case, since the double-byte character "p" is present in this partial character string, the entire partial character string "trip" is recognized (determined) as consisting of double-byte characters. As a result, "to", "ri"
It is possible to eliminate the possibility of erroneously discriminating a full-width character having a narrow width such as a half-width character, and it is possible to eliminate the discomfort when the recognition result is output.

〔発明の効果〕〔The invention's effect〕

この発明によれば、全角,半角文字が混在する文字に
ついては、単に文字の大きさ(文字幅)だけでなくその
前後関係も考慮して全角,半角の判断をするようにした
ので、幅の狭い全角文字を半角文字と誤判別するおそれ
が無くなり、したがって信頼性が向上するとゝもに出力
した場合の見栄えの良さを保証することが可能となる利
点がもたらされる。
According to the present invention, for a character in which full-width and half-width characters are mixed, the full-width and half-width characters are determined by considering not only the size of the character (character width) but also its context. There is no risk of erroneously distinguishing a narrow full-width character from a half-width character. Therefore, if the reliability is improved, there is an advantage that it is possible to guarantee a good-looking image even if it is output.

【図面の簡単な説明】[Brief description of drawings]

第1図はこの発明の実施例を示すフローチャート、第2
図は文字列の例と全角,半角の認識結果を説明するため
の説明図である。 符号説明 …文字画像の切出し処理、…全角文字の大きさの抽
出処理、…文字の大きさによる全角・半角の判定処
理、…認識結果、…カタカナ,英字,数字の検索処
理、…カタカナ列,英字列,数字率の抽出処理、…
計数処理、…統一処理。
FIG. 1 is a flow chart showing an embodiment of the present invention,
The figure is an explanatory diagram for explaining an example of a character string and recognition results of full-width and half-width. Description of code: Character image cutout process, Full-width character size extraction process, Full-width / half-width determination process based on character size, Recognition result, Katakana, alphabetic and numeric search process, Katakana string, Extraction process of alphanumeric strings and numerical rates, ...
Counting process ... Unified processing.

Claims (1)

(57)【特許請求の範囲】(57) [Claims] 【請求項1】未知文字の認識処理をしてその候補文字を
抽出するとゝもに、文字幅に基づいて該候補文字が全角
文字か半角文字かの判断とその文字種別の判断をした
後、各文字行または列(文字列)毎に半角文字を持つ文
字種別の文字が存在するか否かを調べ、存在するときは
それと同じ文字種別の一連の文字からなる部分文字列を
抽出し、該部分文字列中に存在する全角文字の数を調
べ、その数に応じて該部分文字列を全角文字または半角
文字のいずれか一方のみからなる文字列として統一する
ことを特徴とする全角,半角文字の決定方法。
1. A method of recognizing an unknown character to extract a candidate character thereof, and after determining whether the candidate character is a full-width character or a half-width character and a character type thereof based on the character width, For each character row or column (character string), it is checked whether or not there is a character type character having half-width characters, and if there is, a partial character string consisting of a series of characters of the same character type is extracted, The number of full-width characters existing in the partial character string is checked, and the partial character string is unified as a character string consisting of only one full-width character or one half-width character according to the number. How to determine.
JP63252029A 1988-10-07 1988-10-07 How to determine full-width and half-width characters Expired - Lifetime JP2503259B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP63252029A JP2503259B2 (en) 1988-10-07 1988-10-07 How to determine full-width and half-width characters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP63252029A JP2503259B2 (en) 1988-10-07 1988-10-07 How to determine full-width and half-width characters

Publications (2)

Publication Number Publication Date
JPH02100189A JPH02100189A (en) 1990-04-12
JP2503259B2 true JP2503259B2 (en) 1996-06-05

Family

ID=17231596

Family Applications (1)

Application Number Title Priority Date Filing Date
JP63252029A Expired - Lifetime JP2503259B2 (en) 1988-10-07 1988-10-07 How to determine full-width and half-width characters

Country Status (1)

Country Link
JP (1) JP2503259B2 (en)

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS57157379A (en) * 1981-03-24 1982-09-28 Ricoh Co Ltd Discriminating device of kind of image
JPH0632070B2 (en) * 1983-11-28 1994-04-27 株式会社東芝 Character recognition device

Also Published As

Publication number Publication date
JPH02100189A (en) 1990-04-12

Similar Documents

Publication Publication Date Title
US4991094A (en) Method for language-independent text tokenization using a character categorization
US5161245A (en) Pattern recognition system having inter-pattern spacing correction
JPH0682403B2 (en) Optical character reader
CN111291535A (en) Script processing method and device, electronic equipment and computer readable storage medium
JP2503259B2 (en) How to determine full-width and half-width characters
JP2681663B2 (en) Japanese sentence correction candidate character extraction method
JP2851102B2 (en) Character extraction method
JP3924899B2 (en) Text search apparatus and text search method
JP4318223B2 (en) Document proofing apparatus and program storage medium
JP3115459B2 (en) Method of constructing and retrieving character recognition dictionary
JP2006294069A (en) Document corrector and program storage medium
JP3241854B2 (en) Automatic word spelling correction device
JP2746345B2 (en) Post-processing method for character recognition
JPH0362260A (en) Detecting/correcting device for katakana word error
JPH0944604A (en) Character recognizing processing method
JPH06111079A (en) Word reader
JPH08249427A (en) Method and device for character recognition
JPH06119497A (en) Character recognizing method
JP3081622B2 (en) Telephone number stylization device and telephone number stylization method
JPH0614375B2 (en) Character input device
JPS60138689A (en) Character recognizing method
JPS62285189A (en) Character recognition post processing system
JPH04282789A (en) Character reader
JPH04180160A (en) Paragraph segmenting device
JPH0757059A (en) Character recognition device