JP2000251015A

JP2000251015A - Method and device for discriminating similar character and storage medium recording similar character discrimination program

Info

Publication number: JP2000251015A
Application number: JP11052844A
Authority: JP
Inventors: Koji Kurokawa; 浩司黒川; Katsuto Fujimoto; 克仁藤本
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1999-03-01
Filing date: 1999-03-01
Publication date: 2000-09-14

Abstract

PROBLEM TO BE SOLVED: To provide a similar character discriminating method capable of exactly discriminating similar characters while flexibly dealing with a document mixing various fonts or emphasized expressions. SOLUTION: Concerning a character recognizing device provided with a pattern recognizing means for recognizing respective characters described on an original on the basis of character patterns, it is discriminated whether the height of respective character patterns is uniform or not by receiving a character pattern stream corresponding to character strings within a prescribed range including similar characters and the recognized result of the pattern recognizing means, when there is a dispersion in the height of character patterns, the similar characters are discriminated by classification into a capital letter class and a small letter class with discrimination analysis concerning the height of character patterns and when the height of character patterns is uniform, the fixed characters included in the recognized result are classified into prescribed letter types. Then, similar characters are discriminated corresponding to the result of comparison between the height of the character pattern corresponding to the similar characters and the height of the character pattern classified into the prescribed letter type.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書ファイルシス
テムなどに備えられる文字認識装置に関し、特に、相似
形の文字パターンで表されているために、判別が困難な
相似文字を判別する技術に関するものである。例えば、
英大文字の「Ｃ」と英小文字の「ｃ」やひらがな大文字
の「つ」とひらがな小文字の「っ」のように、大文字と
小文字とがほぼ相似形の文字パターンによって表されて
いる文字（以下、相似文字と称する）は、パターン辞書
などを用いたパターンマッチング技術のみでは、認識結
果を確定することができない。このため、パターンマッ
チングによる文字認識とは別途に、大文字か小文字かを
判別する技術が必要とされている。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character recognition device provided in a document file system and the like, and more particularly to a technique for determining a similar character which is difficult to determine because it is represented by a similar character pattern. It is. For example,
Characters whose uppercase and lowercase characters are represented by character patterns that are almost similar, such as uppercase “C” and lowercase “c” and lowercase “tsu” and lowercase “tsu” (hereinafter referred to as “hi”). , And similar characters), the recognition result cannot be determined only by a pattern matching technique using a pattern dictionary or the like. For this reason, there is a need for a technique for distinguishing between uppercase and lowercase, in addition to character recognition by pattern matching.

【０００２】[0002]

【従来の技術】相似文字について、大文字であるか小文
字であるかを判別するための技法として、特開平３−１
１１９８３号公報「類似文字識別方法」が提案されてい
る。図１０に、特開平３−１１１９８３号公報記載の技
法による相似文字識別方法を適用した文字認識装置の構
成例を示す。2. Description of the Related Art As a technique for determining whether a similar character is uppercase or lowercase, Japanese Unexamined Patent Publication No.
Japanese Patent Application Publication No. 11983, “Similar character identification method” has been proposed. FIG. 10 shows an example of the configuration of a character recognition device to which a similar character identification method according to the technique described in Japanese Patent Application Laid-Open No. 3-111983 is applied.

【０００３】図１０に示したパターン認識部４０２は、
原稿読取部４０１が原稿から読み取った画像データか
ら、個々の文字に対応する文字パターンを切り出し、こ
の文字パターンをパターン辞書４０３に登録された標準
文字パターンとを照合することにより各文字を認識する
構成となっている。また、図１０において、相似文字抽
出部４０４は、このパターン認識部４０２によって相似
文字を示す認識結果が得られた場合に、該当する文字パ
ターンを相似文字判定部４１０に送出し、これに応じ
て、この相似文字判定部４１０が、後述する判定処理を
行う構成となっている。[0003] The pattern recognition unit 402 shown in FIG.
A configuration in which the document reading unit 401 cuts out a character pattern corresponding to each character from the image data read from the document, and recognizes each character by comparing the character pattern with a standard character pattern registered in the pattern dictionary 403. It has become. In FIG. 10, when the similar character extracting unit 404 obtains a recognition result indicating a similar character by the pattern recognizing unit 402, the similar character extracting unit 404 sends the corresponding character pattern to the similar character determining unit 410, and accordingly, The similar character determination unit 410 performs a determination process described later.

【０００４】また、図１０において、出力処理部４０５
は、相似文字判定部４１０によって得られた判定結果を
受け取り、この判定結果を認識結果に含めて文書ファイ
ルシステムに送出する構成となっている。この相似文字
判定部４１０において、空白比算出部４１１は、パター
ン認識部４０２から受け取った標準行幅に対して、判定
対象の文字を表す文字パターンの空白部分の幅が占める
割合を求め、比較器４１２の処理に供する構成となって
いる。In FIG. 10, an output processing unit 405
Is configured to receive a determination result obtained by the similar character determination unit 410, include the determination result in the recognition result, and send the result to the document file system. In the similar character determination unit 410, the blank ratio calculation unit 411 calculates the ratio of the width of the blank portion of the character pattern representing the character to be determined to the standard line width received from the pattern recognition unit 402, and 412.

【０００５】この比較器４１２は、得られた空白比を所
定の閾値TH1 と比較し、空白比がこの閾値TH1 以上であ
る場合に小文字である旨の判定結果を出力し、空白比が
閾値TH1 以下である場合に大文字である旨の判定結果を
出力する構成とすればよい。ここで、図１１に示すよう
に、フォントおよびポイント数が同一であれば、大文字
に対応する上側空白dlと小文字に対応する上側空白dsと
は大きく異なっているから、標準行幅Ｄに対して、判別
対象の文字に対応する上側空白ｄが占める割合に基づい
て、その文字が大文字であるか小文字であるかを容易に
判別することができる。The comparator 412 compares the obtained blank ratio with a predetermined threshold value TH1, and outputs a determination result indicating that a lowercase letter is present when the blank ratio is equal to or greater than the threshold value TH1. In the case of the following, a configuration may be adopted in which a determination result indicating that the character is a capital letter is output. Here, as shown in FIG. 11, if the font and the number of points are the same, the upper space dl corresponding to uppercase letters and the upper space ds corresponding to lowercase letters are significantly different. It is possible to easily determine whether the character is uppercase or lowercase based on the ratio of the upper space d corresponding to the character to be determined.

【０００６】[0006]

【発明が解決しようとする課題】ところで、ワードプロ
セッサなどの普及に従って、文字認識装置の認識対象と
なる文書の多様化が進み、多数のフォントの切り替えや
斜体文字および太字などの強調表現を駆使した多彩な表
現技法が用いられるようになっている。このような様々
なフォントや強調表現が混在する文書においては、大文
字を表す文字パターンに対応する上側空白dlと標準行幅
Ｄとの比が上述した所定の閾値TH1 を超える場合や、逆
に、小文字を表す文字パターンに対応する上側空白dsと
標準行幅Ｄとの比が上述した所定の閾値TH1 を下回る場
合が考えられる。By the way, with the spread of word processors and the like, diversification of documents to be recognized by the character recognition device has progressed, and a variety of fonts have been switched, and a variety of fonts using emphasized expressions such as italic characters and bold characters have been used. Various expression techniques have come to be used. In such a document in which various fonts and emphasized expressions are mixed, when the ratio between the upper space dl corresponding to the character pattern representing an uppercase letter and the standard line width D exceeds the above-described predetermined threshold TH1, or conversely, It is conceivable that the ratio between the upper blank space ds corresponding to the character pattern representing lowercase letters and the standard line width D is lower than the above-mentioned predetermined threshold value TH1.

【０００７】しかしながら、上述した従来の大小文字判
定方法は、判別対象の文字についての空白比と固定の閾
値TH1 との比較結果に基づいて判別しているため、この
ような文書に含まれる相似文字の全てを正確に判別する
ことは困難である。本発明は、様々なフォントや強調表
現が混在する文書に柔軟に対応し、相似文字を正確に判
定可能な相似文字判別方法および相似文字判別装置並び
に、相似文字判別プログラムを記録した記憶媒体を提供
することを目的とする。However, in the above-described conventional case determination method, since the determination is made based on the comparison result between the blank ratio of the character to be determined and the fixed threshold value TH1, similar characters included in such a document are used. Is difficult to determine accurately. The present invention provides a similar character discrimination method and a similar character discrimination device capable of flexibly coping with a document in which various fonts and emphasized expressions are mixed and accurately determining similar characters, and a storage medium storing a similar character discrimination program. The purpose is to do.

【０００８】[0008]

【課題を解決するための手段】図１に、本発明の相似文
字判別方法の原理を示す。請求項１の発明は、原稿から
読み取られた文字パターンに基づいて原稿に記載された
各文字を認識するパターン認識手段を備えた文字認識装
置において、相似形の文字パターンによって大文字と小
文字とが表される相似文字を含む所定の範囲の文字列に
対応する文字パターン列と、パターン認識手段によって
文字パターン列について行われた認識結果とを受け取
り、文字パターン列に含まれる文字パターンの高さが一
様であるか否かを判定し、文字パターン列に含まれる文
字パターンの高さにばらつきがある場合に、文字パター
ン列に含まれる文字パターンの高さについての判別分析
によって、大文字に相当する高さを持つ大文字クラスと
小文字に相当する高さを持つ小文字クラスとに分類し、
相似文字についての分類結果に基づいて、相似文字の認
識結果を決定し、文字パターン列に含まれる文字パター
ンの高さが一様である場合は、認識結果に含まれる確定
済みの文字について、文字パターンの高さ方向における
分布の特徴が異なる複数の文字型に分類し、相似文字に
対応する文字パターンの高さと、複数の文字型に分類さ
れた確定済みの文字に対応する文字パターンの高さとの
比較結果に応じて、相似文字に対応する認識結果を決定
することを特徴とする。FIG. 1 shows the principle of a similar character discriminating method according to the present invention. According to a first aspect of the present invention, there is provided a character recognition apparatus including a pattern recognition unit for recognizing each character written on a document based on a character pattern read from the document. A character pattern string corresponding to a predetermined range of character strings including similar characters to be recognized and a recognition result performed on the character pattern string by the pattern recognition unit are received, and the height of the character pattern included in the character pattern string is one. Is determined, and if the heights of the character patterns included in the character pattern string vary, the height equivalent to the capital letters is determined by discriminant analysis of the height of the character pattern included in the character pattern string. Class into uppercase classes with heights and lowercase classes with heights equivalent to lowercase,
The similar character recognition result is determined based on the similar character classification result, and if the height of the character pattern included in the character pattern string is uniform, the character is determined for the determined character included in the recognition result. Classification into multiple character types with different distribution characteristics in the height direction of the pattern, the height of character patterns corresponding to similar characters, and the height of character patterns corresponding to confirmed characters classified into multiple character types Is characterized in that a recognition result corresponding to a similar character is determined according to the comparison result.

【０００９】請求項１の発明は、相似文字を含む文字列
を表す文字パターンの高さにばらつきがある場合に、文
字の高さに注目した判別分析を行い、文字集合に含まれ
る各文字を大文字クラスと小文字クラスとに分類するこ
とにより、文字集合に含まれる全ての文字の高さを利用
して、判別対象の相似文字が大文字であるか小文字であ
るかを判別することができる。According to the first aspect of the present invention, when there is a variation in the height of a character pattern representing a character string including similar characters, discriminant analysis is performed by focusing on the character height, and each character included in the character set is determined. By classifying into uppercase and lowercase classes, it is possible to determine whether similar characters to be determined are uppercase or lowercase using the heights of all characters included in the character set.

【００１０】また、文字パターンの高さが一様である場
合は、文字集合に属する文字のうち、大文字の高さある
いは小文字の高さの基準となる文字型に分類された文字
の高さを利用して、相似文字を個別に判別することがで
きる。図２に、本発明の相似文字判別装置の原理ブロッ
ク図を示す。請求項２の発明は、原稿から読み取られた
文字パターンに基づいて原稿に記載された各文字を認識
するパターン認識手段１０１を備えた文字認識装置にお
いて、相似形の文字パターンによって大文字と小文字と
が表される相似文字を含む所定の範囲の文字列に対応す
る文字パターン列と、パターン認識手段１０１によって
文字パターン列について行われた認識結果とを受け取っ
て入力する文字集合入力手段１１１と、文字パターン列
に含まれる各文字パターンの高さが一様であるか否かを
判定する高さ分布判定手段１１２と、各文字パターンの
高さにばらつきがある旨の判定結果に応じて、文字パタ
ーン列に含まれる文字パターンの高さについての判別分
析によって、大文字に相当する高さを持つ大文字クラス
と小文字に相当する高さを持つ小文字クラスとに分類
し、この分類結果に基づいて、判別対象となる相似文字
の認識結果を決定する判別分析処理手段１１３と、各文
字パターンの高さが一様である旨の判定結果に応じて、
認識結果に含まれる確定済みの文字について、文字パタ
ーンの高さ方向における分布の特徴が異なる複数の文字
型に分類する分類手段１１４と、相似文字に対応する文
字パターンの高さと、複数の文字型に分類された確定済
みの文字に対応する文字パターンの高さとに基づいて、
相似文字それぞれに対応する認識結果を個別に決定する
個別判定手段１１５とを備えたことを特徴とする。If the height of the character pattern is uniform, the height of a character belonging to a character type that is a reference for the height of uppercase letters or lowercase letters among characters belonging to a character set is determined. Utilization can be used to distinguish similar characters individually. FIG. 2 shows a principle block diagram of the similar character discriminating apparatus of the present invention. According to a second aspect of the present invention, there is provided a character recognition apparatus including a pattern recognition unit for recognizing each character written on a document based on a character pattern read from the document. A character set input unit 111 for receiving and inputting a character pattern string corresponding to a predetermined range of character strings including similar characters to be represented and a recognition result performed on the character pattern string by the pattern recognition unit 101; A height distribution determining unit 112 for determining whether the height of each character pattern included in the column is uniform, and a character pattern sequence according to a determination result that the height of each character pattern varies. Based on the discriminant analysis of the height of the character pattern included in, the uppercase class with the height equivalent to uppercase and the height equivalent to lowercase A classification analysis processing unit 113 that classifies the character patterns into character classes and determines a recognition result of similar characters to be determined based on the classification result; and a determination unit that determines that the height of each character pattern is uniform. hand,
A classification unit 114 for classifying the determined characters included in the recognition result into a plurality of character types having different distribution characteristics in the height direction of the character pattern; a height of the character pattern corresponding to the similar character; Based on the height of the character pattern corresponding to the confirmed character classified into
And an individual determination unit for individually determining a recognition result corresponding to each similar character.

【００１１】請求項２の発明は、高さ分布判定手段１１
２による判定結果に応じて、判別分析処理手段１１３あ
るいは分類手段１１４および個別判定手段１１５が動作
することにより、文字集合入力手段１１１を介してパタ
ーン認識手段１０１から受け取った一連の文字パターン
の高さの分布の特徴を利用して、この文字集合に含まれ
る相似文字を判別することができる。According to a second aspect of the present invention, a height distribution determining means is provided.
2 operates the discriminant analysis processing means 113 or the classification means 114 and the individual judgment means 115 in response to the height of a series of character patterns received from the pattern recognition means 101 via the character set input means 111. Utilizing the characteristics of the distribution, similar characters included in this character set can be determined.

【００１２】請求項３の発明は、原稿から読み取られた
文字パターンに基づいて原稿に記載された各文字を認識
するパターン認識手段１０１を備えた文字認識装置にお
いて、相似形の文字パターンによって大文字と小文字と
が表される相似文字を含む所定の範囲の文字列に対応す
る文字パターン列と、パターン認識手段１０１によって
文字パターン列について行われた認識結果とを受け取っ
て入力する文字集合入力手順と、文字パターン列に含ま
れる各文字パターンの高さが一様であるか否かを判定す
る高さ分布判定手順と、各文字パターンの高さにばらつ
きがある旨の判定結果に応じて、文字パターン列に含ま
れる文字パターンの高さについての判別分析によって、
大文字に相当する高さを持つ大文字クラスと小文字に相
当する高さを持つ小文字クラスとに分類し、この分類結
果に基づいて、判別対象となる相似文字の認識結果を決
定する判別分析処理手順と、各文字パターンの高さが一
様である旨の判定結果に応じて、認識結果に含まれる確
定済みの文字について、文字パターンの高さ方向におけ
る分布の特徴が異なる複数の文字型に分類する分類手順
と、相似文字に対応する文字パターンの高さと、複数の
文字型に分類された確定済みの文字に対応する文字パタ
ーンの高さとに基づいて、相似文字それぞれに対応する
認識結果を個別に決定する個別判定手順とをコンピュー
タに実行させることを特徴とする。According to a third aspect of the present invention, there is provided a character recognition apparatus provided with a pattern recognition means for recognizing each character written on a document based on a character pattern read from the document. A character set inputting step of receiving and inputting a character pattern string corresponding to a predetermined range of character strings including similar characters represented by lowercase letters and a recognition result performed on the character pattern string by the pattern recognition means 101; A height distribution determination procedure for determining whether or not the height of each character pattern included in the character pattern string is uniform, and a character pattern according to a determination result that the height of each character pattern varies. By discriminant analysis on the height of the character pattern included in the column,
Classification into uppercase class with height equivalent to uppercase and lowercase class with height equivalent to lowercase, and based on this classification result, a discriminant analysis processing procedure to determine the recognition result of similar characters to be discriminated. According to the determination result that the height of each character pattern is uniform, the determined characters included in the recognition result are classified into a plurality of character types having different distribution characteristics in the height direction of the character pattern. Based on the classification procedure, the height of the character pattern corresponding to the similar character, and the height of the character pattern corresponding to the determined character classified into a plurality of character types, individually recognize the recognition result corresponding to each similar character. The individual determination procedure to be determined is executed by a computer.

【００１３】請求項３の発明は、高さ分布判定手順にお
ける判定結果に応じて、判別分析処理手順あるいは分類
手順および個別判定手順を実行することにより、文字集
合入力手順によりパターン認識手段１０１から受け取っ
た一連の文字パターンの高さの分布の特徴を利用して、
この文字集合に含まれる相似文字を判別することができ
る。According to a third aspect of the present invention, a discriminant analysis processing procedure, a classification procedure, and an individual determination procedure are executed in accordance with a determination result in the height distribution determination procedure, so that a pattern is received from the pattern recognition means 101 by a character set input procedure. Utilizing the height distribution characteristics of a series of character patterns
Similar characters included in this character set can be determined.

【００１４】[0014]

【発明の実施の形態】以下、図面に基づいて、本発明の
実施形態について詳細に説明する。図３に、本発明の相
似文字判別装置を適用した文字認識装置の構成を示す。Embodiments of the present invention will be described below in detail with reference to the drawings. FIG. 3 shows a configuration of a character recognition device to which the similar character discrimination device of the present invention is applied.

【００１５】図３に示した文字認識装置において、パタ
ーン認識部４０２は、原稿読取部４０１によって読み取
られた画像データに含まれる文字パターンを順次に切り
出して、パターン辞書４０３との照合結果に基づいて文
字を認識し、得られた認識結果を相似文字判別装置２１
０を介して、文書ファイルシステム（図示せず）などに
送出する構成となっている。In the character recognition apparatus shown in FIG. 3, a pattern recognition unit 402 sequentially cuts out character patterns included in image data read by the document reading unit 401 and based on a result of comparison with a pattern dictionary 403. Characters are recognized, and the obtained recognition result is used as a similar character discriminating device 21
0 to a document file system (not shown) or the like.

【００１６】図３に示した相似文字判別装置２１０にお
いて、文字列入力部２１１は、請求項２で述べた文字集
合入力手段１１１に相当するものであり、原稿上の所定
の範囲に対応する一連の文字パターンのおよび認識結果
をパターン認識部４０２から受け取り、文字型判別部２
１２および高さ分布評価部２１３の処理に供する構成と
なっている。In the similar character discriminating device 210 shown in FIG. 3, the character string input section 211 corresponds to the character set input means 111 described in claim 2, and is a sequence corresponding to a predetermined range on the document. Of the character pattern and the recognition result from the pattern recognition unit 402,
12 and a configuration used for processing by the height distribution evaluation unit 213.

【００１７】この文字列入力部２１１は、例えば、後に
追加開示項６において述べる文字集合入力手段１１１と
して動作し、原稿上の各行に対応する一連の文字パター
ンおよび認識結果を順次に判別対象として入力する構成
とすればよい。この文字型判別部２１２は、請求項２で
述べた文字型判別手段１１４に相当するものであり、文
字列入力部２１１から受け取った１行分の認識結果に属
する各文字が、図４に示す文字型のどれに属するかを判
別し、判別結果を判別制御部２１４の処理に供する構成
となっている。The character string input unit 211 operates as, for example, a character set input unit 111 which will be described later in the additional disclosure item 6, and sequentially inputs a series of character patterns and recognition results corresponding to each line on the document as a determination target. The configuration may be such that The character type discriminating unit 212 corresponds to the character type discriminating unit 114 described in claim 2, and each character belonging to the recognition result for one line received from the character string input unit 211 is shown in FIG. It is configured to determine to which of the character types it belongs and to provide the determination result to the processing of the determination control unit 214.

【００１８】図４(ａ)に示したように、英字についての
相似文字型は、大文字と小文字との組が互いに相似な文
字パターンによって表される文字が属する型であり、例
えば、英大文字の「Ｃ，Ｏ，Ｓ，Ｕ，Ｖ，Ｗ，Ｘ，Ｚ」
および英小文字の「ｃ，ｏ，ｓ，ｕ，ｖ，ｗ，ｘ，ｚ」
を含んでいる。また、大文字型は、大文字の上側接線お
よび下側接線としてそれぞれ与えられる大文字線および
基準線の双方に接し、大文字と小文字との組が互いに異
なる文字パターンによって表される文字が属する型であ
り、例えば、英大文字の「Ａ，Ｂ，Ｄ，Ｅ，Ｆ，Ｇ，
Ｈ，Ｋ，Ｌ，Ｍ，Ｎ，Ｒ，Ｔ，Ｙ」および英小文字の
「ｂ，ｄ，ｆ，ｈ，ｋ」を含んでいる。As shown in FIG. 4 (a), the similar character type for an alphabetic character is a type to which characters represented by a character pattern in which a set of uppercase and lowercase characters are similar to each other belong, for example, "C, O, S, U, V, W, X, Z"
And lowercase letters "c, o, s, u, v, w, x, z"
Contains. In addition, the uppercase type is a type to which a character that is in contact with both an uppercase line and a base line respectively given as an upper tangent and a lower tangent of an uppercase letter, and whose uppercase and lowercase letters are represented by different character patterns, For example, the uppercase letters "A, B, D, E, F, G,
H, K, L, M, N, R, T, Y "and lowercase letters" b, d, f, h, k ".

【００１９】一方、小文字型は、小文字の上側接線とし
て与えられる小文字線と上述した基準線との双方に接
し、大文字と小文字とが互いに異なる文字パターンによ
って表される文字が属する型であり、例えば、英小文字
の「ａ，ｅ，ｍ，ｎ，ｒ」を含んでいる。また、中文字
型は、上述した基準線に接し、文字パターンの上端が小
文字線と大文字線との中間にある文字が属する型であ
り、例えば、英小文字の「ｉ，ｔ」を含んでいる。On the other hand, the lower case type is a type to which a character which is in contact with both a lower case line given as an upper tangent of a lower case and the above-mentioned reference line and whose upper case and lower case are represented by character patterns different from each other belongs. , Lowercase letters “a, e, m, n, r”. The middle character type is a type to which a character which is in contact with the above-mentioned reference line and whose upper end of the character pattern is located between a lower case line and an upper case line belongs to, and includes, for example, lowercase letters "i, t". .

【００２０】一方、英大文字の「Ｑ」のように、大文字
線に接し、基準線の下側に文字パターンが突き出してい
る文字は下大文字型に属し、英小文字の「ｑ、ｙ」のよ
うに、小文字線に接し、基準線の下側に文字パターンが
突き出している文字は下小文字型に属しており、また、
英小文字の「ｇ、ｊ」のように、文字パターンの上端が
小文字線と大文字線との中間にあり、基準線の下側に文
字パターンが突き出している文字は下中文字型に属して
いる。On the other hand, a character that is in contact with an uppercase line and has a character pattern protruding below the reference line, such as an uppercase letter “Q”, belongs to the lowercase uppercase type and is represented by a lowercase letter “q, y”. In addition, characters that touch the lowercase line and have a character pattern protruding below the reference line belong to the lowercase type,
Characters whose upper end of the character pattern is halfway between the lowercase and uppercase lines and whose character pattern protrudes below the base line, such as the lowercase letter “g, j”, belong to the lower middle character type .

【００２１】なお、英小文字の「ｐ」および英大文字の
「Ｐ」は特殊相似文字型とし、この方に属する文字につ
いて大文字か小文字かを判定する際には、英小文字の
「ｐ」の基準線より上側の部分の高さを文字の高さとす
ればよい。一方、英大文字の「Ｉ，Ｊ」および小文字の
「ｌ」は、文字型の推定制度が低いため対象除外型と
し、これらの文字は後述する判別処理には利用しない。Note that the lowercase letter “p” and the uppercase letter “P” are of a special similar character type, and when determining whether a character belonging to this is uppercase or lowercase, the base line of the lowercase letter “p” is determined. The height of the upper part may be the height of the character. On the other hand, the uppercase letters “I, J” and the lowercase “l” are of the target exclusion type because the character type estimation system is low, and these characters are not used in the discrimination processing described later.

【００２２】また、図３において、高さ分布評価部２１
３は、文字列入力部２１１から受け取った１行分の文字
パターンに基づいて、該当する行に含まれる各文字の文
字高さおよびその分布範囲を求め、得られた分布範囲を
各文字の文字高さを示す高さ情報とともに判別制御部２
１４の処理に供する構成となっている。図３に示した判
別制御部２１４は、上述した文字型判別部２１２による
判別結果と高さ分布評価部２１３によって得られた文字
パターンの高さの分布範囲とに基づいて、判別分析処理
部２１５、高さ判定処理部２１６および高さ比判定処理
部２１７の動作を制御し、これらの各部で得られた判別
結果によって文字列入力部２１１を介して受け取ったパ
ターン認識結果に含まれる相似文字についての認識結果
を確定し、認識結果として文書ファイルシステム（図示
せず）などに送出する構成となっている。In FIG. 3, a height distribution evaluator 21 is provided.
3 obtains the character height and distribution range of each character included in the corresponding line based on the character pattern for one line received from the character string input unit 211, and calculates the obtained distribution range as the character range of each character. Discrimination control unit 2 together with height information indicating the height
14 is provided. The discrimination control unit 214 illustrated in FIG. 3 performs the discrimination analysis processing unit 215 based on the discrimination result of the character type discrimination unit 212 and the distribution range of the height of the character pattern obtained by the height distribution evaluation unit 213. , Controls the operations of the height determination processing unit 216 and the height ratio determination processing unit 217, and determines the similar characters included in the pattern recognition result received via the character string input unit 211 based on the determination results obtained by these units. Is determined and sent as a recognition result to a document file system (not shown).

【００２３】図５に、相似文字判別動作を表す流れ図を
示す。図３に示した文字列入力部２１１によって１行分
のパターン認識結果が入力されると、文字型判別部２１
２により、このパターン認識結果に含まれる各文字が属
する文字型が判別される（ステップ３０１、３０２）。
このステップ３０２で得られた文字型の集合に相似文字
型が含まれていない場合に、判別制御部２１４は、ステ
ップ３０３の否定判定としてステップ３１０に進み、文
字列入力部２１１を介して受け取ったパターン認識結果
をそのまま文書ファイルシステムに送出して、処理を終
了すればよい。FIG. 5 is a flowchart showing the similar character determining operation. When a pattern recognition result for one line is input by the character string input unit 211 shown in FIG.
2, the character type to which each character included in the pattern recognition result belongs is determined (steps 301 and 302).
If the similar character type is not included in the set of character types obtained in step 302, the determination control unit 214 proceeds to step 310 as a negative determination in step 303 and receives the same via the character string input unit 211. What is necessary is just to send the pattern recognition result as it is to the document file system and end the processing.

【００２４】一方、上述したステップ３０２で得られた
文字型の集合に相似文字型が含まれている場合に、判別
制御部２１４は、高さ分布評価部２１３による評価結果
を参照し、入力された文字列に対応する文字パターンの
高さが一様であるか否かを判定する（ステップ３０
４）。この高さ分布評価部２１３で得られた分布範囲が
所定の閾値V1以上である場合に、判別制御部２１４は、
入力された判別対象の文字列は判別分析の適用条件を満
たしていると判断し、ステップ３０４の否定判定とし
て、高さ分布評価部２１３で得られた高さ情報を判別分
析処理部２１５の処理に供すればよい。On the other hand, when the similar character type is included in the set of character types obtained in the above-described step 302, the discrimination control unit 214 refers to the evaluation result by the height distribution evaluating unit 213 and inputs the same. It is determined whether the height of the character pattern corresponding to the character string is uniform (step 30).
4). When the distribution range obtained by the height distribution evaluation unit 213 is equal to or more than the predetermined threshold V1, the discrimination control unit 214
It is determined that the input character string to be distinguished satisfies the conditions for applying the discriminant analysis, and as a negative determination in step 304, the height information obtained by the height distribution evaluation unit 213 is processed by the discriminant analysis processing unit 215. It is good to serve.

【００２５】これに応じて、判別分析処理部２１５によ
り、判別分析処理が行われる（ステップ３０５）。図６
に、判別分析を説明する図を示す。相似文字型に属する
文字は、大文字と同等の文字高さを持つクラスと小文字
と同等の文字高さを持つクラスに分類することができ
る。In response, the discriminant analysis processing section 215 performs discriminant analysis processing (step 305). FIG.
The figure explaining discriminant analysis is shown in FIG. Characters belonging to the similar character type can be classified into a class having a character height equivalent to uppercase letters and a class having a character height equivalent to lowercase letters.

【００２６】一方、判別対象の文字列が十分な数の文字
を含んでいれば、この文字列に含まれる各文字の高さの
分布は、図６(ａ)に示すように、大文字の文字高さを中
心とする正規分布と小文字の文字高さを中心とする正規
分布の和として表すことができる。したがって、判別分
析処理部２１５は、判別分析手法を用いて上述した２つ
の正規分布を分離する閾値ｋを求め、この閾値ｋに基づ
いて、判別対象の文字列に含まれる相似文字型の文字が
大文字であるか小文字であるかを判定すればよい。On the other hand, if the character string to be discriminated contains a sufficient number of characters, the distribution of the height of each character contained in this character string is, as shown in FIG. It can be expressed as the sum of a normal distribution centered on the height and a normal distribution centered on the height of lowercase characters. Accordingly, the discriminant analysis processing unit 215 obtains a threshold k for separating the two normal distributions described above using a discriminant analysis technique, and based on the threshold k, a similar character type character included in the character string to be discriminated is determined. What is necessary is just to judge whether it is a capital letter or a small letter.

【００２７】図６(ｂ)に、判別分析処理部の詳細構成図
を示す。図６において、ヒストグラム作成部２２１は、
判別制御部２１４から受け取った高さ情報に基づいて、
判別対象の文字列に含まれる各文字の高さの分布を表す
ヒストグラムを作成し、分離度算出部２２３の処理に供
する構成となっている。一方、閾値決定部２２２は、最
適化制御部２２４からの指示に応じて、上述した閾値ｋ
の値を調整し、分離度算出部２２３の処理に供する構成
となっており、この分離度算出部２２３は、閾値決定部
２２２から受け取った閾値ｋに基づいて上述したヒスト
グラムで表される分布を分離した場合に、誤判定となる
確率の和を示す分離度λを求める構成となっている。FIG. 6B shows a detailed configuration diagram of the discriminant analysis processing unit. In FIG. 6, the histogram creation unit 221 includes:
Based on the height information received from the discrimination control unit 214,
A histogram representing the distribution of the height of each character included in the character string to be determined is created and provided to the processing of the degree-of-separation calculating unit 223. On the other hand, the threshold determination unit 222 responds to the instruction from the optimization control unit 224 to set the threshold k
Is adjusted and provided to the process of the degree-of-separation calculating unit 223. The degree-of-separation calculating unit 223 converts the distribution represented by the above-described histogram based on the threshold value k received from the threshold value determining unit 222. In the case of separation, the configuration is such that a degree of separation λ indicating the sum of the probability of erroneous determination is obtained.

【００２８】また、最適化制御部２２４は、上述した分
離度λの値に応じて、閾値決定部２２２による閾値決定
処理を制御し、分離度λが最小となったときの閾値ｋを
比較処理部２２５の処理に供する構成となっている。こ
のようにして得られた閾値ｋは、図６(ａ)に示した大文
字と同等の文字高さを持つ文字に対応する文字高さの分
布と小文字と同等の文字高さを持つ文字に対応する文字
高さの分布との境界を示している。Further, the optimization control unit 224 controls the threshold value determination processing by the threshold value determination unit 222 according to the value of the above-mentioned separation degree λ, and compares the threshold value k when the separation degree λ becomes the minimum. The configuration is provided for the processing of the unit 225. The threshold k obtained in this manner corresponds to the distribution of character heights corresponding to characters having the same character height as uppercase letters shown in FIG. 4 shows the boundary between the character height distribution and the character height distribution.

【００２９】したがって、比較処理部２２５は、ヒスト
グラム作成部２２１から判別対象の相似文字型の文字に
対応する文字高さを受け取り、この文字高さと閾値ｋと
の比較結果に基づいて、この相似文字が大文字であるか
小文字であるかを判定し、判定結果を判別制御部２１４
に返せばよい。Therefore, the comparison processing unit 225 receives the character height corresponding to the similar character type character to be discriminated from the histogram creation unit 221 and, based on the comparison result between the character height and the threshold k, determines the similar character. Is determined to be a capital letter or a small letter, and the determination result is determined by the determination control unit 214.
Return it to

【００３０】これに応じて、判別制御部２１４は、図５
に示したステップ３０６において、受け取った判別結果
に基づいて、パターン認識部４０２による認識結果に含
まれる相似文字に関する認識結果を確定し、得られた認
識結果を文書ファイルシステムなどに送出して処理を終
了すればよい。一方、判別対象の一行分の文字列の高さ
が一様であった場合は、上述したステップ３０４におけ
る肯定判定となり、判別制御部２１４は、ステップ３０
７において、文字型判別部２１２による判別結果の中
に、大文字の文字高さあるいは小文字の文字高さの基準
となる文字型に属する文字が含まれているか否かを判定
する。In response, the discrimination control unit 214
In step 306, the recognition result for the similar character included in the recognition result by the pattern recognition unit 402 is determined based on the received determination result, and the obtained recognition result is sent to a document file system or the like to perform processing. It should just end. On the other hand, if the height of the character string for one line to be determined is uniform, an affirmative determination is made in step 304 described above, and the determination control unit 214 determines in step 30
In 7, it is determined whether or not the result of the determination by the character type determination unit 212 includes a character belonging to a character type serving as a reference for the character height of uppercase characters or the character height of lowercase characters.

【００３１】このステップ３０７の肯定判定の場合に、
判別制御部２１４は、後に追加開示項１において述べる
基準文字抽出手段１３１として動作し、得られた基準文
字の高さおよび文字型を示す情報とともに判別対象の相
似文字の高さを示す情報を高さ判定処理部２１６に送出
し、これに応じて、この高さ判定処理部２１６により、
基準文字の高さによる判別処理が行われる（ステップ３
０８）。In the case of an affirmative determination in step 307,
The discrimination control unit 214 operates as the reference character extracting means 131 described later in the additional disclosure item 1, and outputs the information indicating the height and the character type of the obtained reference character and the information indicating the height of the similar character to be discriminated. To the height determination processing unit 216, and accordingly, the height determination processing unit 216
A determination process based on the height of the reference character is performed (step 3).
08).

【００３２】このとき、判別制御部２１４は、大文字型
あるいは小文字型に属する文字があれば優先的に基準文
字として選択し、これらの文字型に属する文字がない場
合には、下小文字型、下中文字型あるいは下大文字型に
属する文字を基準文字として選択し、後に追加開示項１
において述べる高さ判定手段１３２に相当する高さ判定
処理部２１６の処理に供すればよい。At this time, the discrimination control unit 214 preferentially selects a character belonging to an uppercase type or a lowercase type as a reference character. Select a character belonging to the middle or lower case type as the reference character, and add additional disclosure
May be provided for the processing of the height determination processing unit 216 corresponding to the height determination means 132 described in.

【００３３】図７に、文字高さによる判別動作を表す流
れ図を示す。高さ判定処理部２１６は、ステップ３１１
で受け取った基準文字型について、まず、大文字型であ
るか否かを判定し（ステップ３１２）、肯定判定の場合
に、高さ判定処理部２１６は、判別対象の相似文字は大
文字であると判断し（ステップ３１３）、判別結果を判
別制御部２１４に返して（ステップ３１４）、処理を終
了すればよい。FIG. 7 is a flowchart showing the discrimination operation based on the character height. The height determination processing unit 216 determines in step 311
First, it is determined whether or not the reference character type received in step 311 is an uppercase type (step 312). In the case of an affirmative determination, the height determination processing unit 216 determines that the similar character to be determined is uppercase. Then (Step 313), the determination result is returned to the determination control unit 214 (Step 314), and the process may be terminated.

【００３４】一方、ステップ３１２の否定判定の場合
は、受け取った基準文字型が小文字型であるか否かを判
定し（ステップ３１５）、肯定判定の場合に、高さ判定
処理部２１６は、判別対象の相似文字は小文字であると
判断し（ステップ３１６）、判別結果を判別制御部２１
４に返して（ステップ３１４）、処理を終了すればよ
い。また一方、ステップ３１５の否定判定の場合に、高
さ判定処理部２１６は、基準文字の文字高さHsと判別対
象の相似文字の文字高さHxとを比較し、ほぼ等しい場合
（ステップ３１７の肯定判定）にステップ３１３に進ん
で大文字であると判断し、ステップ３１７の否定判定の
場合は、ステップ３１６に進んで小文字であると判断す
ればよい。On the other hand, in the case of a negative determination in step 312, it is determined whether or not the received reference character type is a lower case type (step 315). In the case of an affirmative determination, the height determination processing unit 216 determines It is determined that the target similar character is a lowercase letter (step 316), and the determination result is determined by the determination control unit 21.
4 (step 314), and the process may be terminated. On the other hand, in the case of a negative determination in step 315, the height determination processing unit 216 compares the character height Hs of the reference character with the character height Hx of the similar character to be determined and determines that they are substantially equal (step 317). The process proceeds to step 313 to determine that the character is a capital letter, and if the determination is negative in step 317, the process proceeds to step 316 to determine that the character is a small letter.

【００３５】このようにして得られた判別結果を受け取
って、判別制御部２１４は、図５に示したステップ３０
６に進み、判別結果をパターン認識結果に反映して、処
理を終了すればよい。ところで、判別対象の文字列に基
準となりうる文字型に属する文字が含まれていない場合
に、判別制御部２１４は、ステップ３０７の否定判定と
して、判別対象の相似文字の文字高さを示す高さ情報を
高さ比判定処理部２１７に送出し、これに応じて、この
高さ比判定処理部２１７により、該当する相似文字の文
字高さと標準的な大文字の文字高さとの比に基づく判定
処理が行われる（ステップ３０９）。Upon receiving the determination result obtained in this manner, the determination control unit 214 executes step 30 shown in FIG.
Then, the process may be terminated by reflecting the determination result to the pattern recognition result and proceeding to step 6. By the way, when the character string to be determined does not include a character belonging to a character type that can be a reference, the determination control unit 214 determines the character height of the similar character to be determined as a negative determination in step 307. The information is sent to the height ratio determination processing unit 217. In response, the height ratio determination processing unit 217 performs determination processing based on the ratio between the character height of the corresponding similar character and the standard uppercase character height. Is performed (step 309).

【００３６】図８に、高さ比判定処理部２１７の詳細構
成を示す。図８に示した高さ比判定処理部２１７におい
て、外接矩形抽出部２３１は、後に追加開示項２におい
て述べる基準文字抽出手段１３１として動作し、文字列
入力部２１１を介して原稿上の各行に対応する文字パタ
ーンを順次に受け取り、１行分の文字パターンに含まれ
る各文字に対応する文字パターンの外接矩形をそれぞれ
抽出し、文字型判別部２１２を介して受け取った文字型
に応じて、基準線推定部２３２、大文字線推定部２３３
および小文字線推定部２３４に送出する構成となってい
る。FIG. 8 shows a detailed configuration of the height ratio determination processing section 217. In the height ratio determination processing unit 217 shown in FIG. 8, the circumscribed rectangle extraction unit 231 operates as the reference character extraction unit 131 described later in the additional disclosure item 2, and outputs each line on the document via the character string input unit 211. The corresponding character patterns are sequentially received, and the circumscribed rectangles of the character patterns corresponding to the respective characters included in the character pattern for one line are extracted. Line estimation unit 232, capital letter line estimation unit 233
And a lower-case line estimating unit 234.

【００３７】この外接矩形抽出部２３１は、例えば、大
文字型と判別された文字パターンに対応する外接矩形を
基準線推定部２３２および大文字線推定部２３３に送出
し、小文字型と判別された文字パターンに対応する外接
矩形を基準線推定部２３２および小文字線推定部２３４
に送出すればよい。この基準線推定部２３２は、後に追
加開示項３において述べる基準線推定手段１３６に相当
するものであり、外接矩形抽出部２３１から受け取った
外接矩形の下側底辺の中点の集合について、例えば、最
小二乗法による直線近似処理などを行って、これらの外
接矩形が共通して外接する基準線を推定し、大文字サイ
ズ推定部２３５および小文字サイズ推定部２３６の処理
に供する構成となっている。The circumscribed rectangle extracting section 231 sends, for example, a circumscribed rectangle corresponding to the character pattern determined to be uppercase to the reference line estimating section 232 and the uppercase line estimating section 233, and outputs the character pattern determined to be lowercase. The circumscribed rectangle corresponding to the reference line estimating unit 232 and the lowercase line estimating unit 234
Should be sent to The reference line estimating unit 232 corresponds to the reference line estimating unit 136 described in the additional disclosure item 3 later. For the set of the midpoints of the lower bottom sides of the circumscribed rectangles received from the circumscribed rectangle extraction unit 231, for example, A straight line approximation process using the least squares method or the like is performed to estimate a reference line commonly circumscribing these circumscribed rectangles, and the estimated reference line is used for processing by the uppercase size estimator 235 and the lowercase size estimator 236.

【００３８】一方、大文字線推定部２３３は、後に追加
開示項３において述べる大文字線推定手段１３５に相当
するものであり、外接矩形抽出部２３１から受け取った
外接矩形の上側底辺の中点の集合について、例えば、最
小二乗法による直線近似処理などを行って、これらの外
接矩形が共通して外接する大文字線を推定し、大文字サ
イズ推定部２３５の処理に供する構成となっている。On the other hand, the upper-case line estimating unit 233 corresponds to the upper-case line estimating unit 135 described in the additional disclosure item 3 later. For example, a straight line approximation process using the least squares method is performed to estimate a capital letter line that circumscribes these circumscribed rectangles in common, and is used for processing by the capital letter size estimation unit 235.

【００３９】また、図８に示した大文字サイズ推定部２
３５は、後に追加開示項３において述べる文字高さ算出
手段１３７に相当するものであり、受け取った大文字線
と基準線との距離に基づいて標準的な大文字の高さを求
め、セレクタ２３７ａを介して高さ比算出部２３９の処
理に供するとともに、得られた大文字の高さをサイズ保
持部２３８ａに保持する構成となっている。The upper case size estimating unit 2 shown in FIG.
35 is equivalent to the character height calculating means 137 described later in the additional disclosure item 3, and calculates the standard capital letter height based on the distance between the received capital letter line and the reference line. In addition to the processing of the height ratio calculation unit 239, the obtained uppercase height is stored in the size storage unit 238a.

【００４０】このように、上述した基準線推定部２３２
および大文字線推定部２３３の動作により、現在注目し
ている原稿上の行に含まれる文字の大きさを忠実に反映
した基準線および大文字線を精密に推定し、大文字サイ
ズ推定部２３５の処理に供することにより、後に追加開
示項２において述べる大文字高さ推定手段１３３の機能
を実現し、文字列入力部２１１を介して受け取った各行
に対応する文字パターンの連なりに基づいて、標準的な
大文字の文字高さを示す指標を動的に算出することがで
きる。As described above, the above-described reference line estimating unit 232
By the operation of the capital letter line estimating unit 233, the reference line and the capital letter line that accurately reflect the size of the characters included in the line on the current document are precisely estimated. By providing the function, the function of the capital letter height estimating means 133 described later in the additional disclosure item 2 is realized, and based on the series of character patterns corresponding to each line received via the character string input unit 211, a standard capital letter An index indicating the character height can be dynamically calculated.

【００４１】また、図８において、セレクタ２３７ａ
は、大文字線推定部２３３からの指示に応じて、大文字
サイズ推定部２３５の出力とサイズ保持部２３８ａの出
力のいずれかを選択し、大文字の文字高さHaとして高さ
比算出部２３９の処理に供する構成となっている。ここ
で、大文字線推定部２３３は、大文字線を正常に推定で
きた場合に、大文字サイズ推定部２３５の出力を選択す
る旨をセレクタ２３７ａに指示し、大文字線の推定処理
ができなかった場合に、サイズ保持部２３８ａの出力を
選択する旨をセレクタ２３７ａに指示すればよい。In FIG. 8, the selector 237a
Selects either the output of the uppercase size estimation unit 235 or the output of the size holding unit 238a in accordance with an instruction from the uppercase line estimation unit 233, and sets the uppercase character height Ha as the processing of the height ratio calculation unit 239. It is configured to be used for Here, the uppercase line estimating unit 233 instructs the selector 237a to select the output of the uppercase size estimating unit 235 when the uppercase line can be normally estimated, and when the uppercase line estimating process cannot be performed. The selector 237a may be instructed to select the output of the size holding unit 238a.

【００４２】このように、大文字線が正常に推定できた
か否かに応じて、セレクタ２３７ａの選択動作を制御す
ることにより、現在注目している範囲に大文字の文字高
さの基準となる文字型に属する文字が存在するか否かに
かかわらず、大文字の高さを示す適切な指標を高さ比算
出部２３９の処理に供することができる。この高さ比算
出部２３９は、判別制御部２１４を介して判別対象の相
似文字の文字高さHxを受け取り、この文字高さHxと上述
した大文字の文字高さHaとの高さ比Vhを求める構成とな
っており、後述する閾値算出部２４０および比較処理部
２４１とともに、後に追加開示項２において述べる高さ
比判定手段１３４を形成している。As described above, by controlling the selection operation of the selector 237a in accordance with whether or not the capital letter line has been correctly estimated, the character type serving as the reference for the character height of the capital letter in the range currently focused on. Regardless of whether or not there is a character belonging to, an appropriate index indicating the height of uppercase letters can be provided to the processing of the height ratio calculation unit 239. The height ratio calculation unit 239 receives the character height Hx of the similar character to be determined via the determination control unit 214, and calculates a height ratio Vh between the character height Hx and the above-described uppercase character height Ha. The height ratio determination unit 134 described later in the additional disclosure 2 is formed together with the threshold calculation unit 240 and the comparison processing unit 241 to be described later.

【００４３】図８において、比較処理部２４１は、この
高さ比Vhと閾値算出部２４０によって得られる閾値Thv
とを比較し、得られた結果に応じて、判別対象の相似文
字が大文字であるか小文字であるかを示す判別結果を判
別制御部２１４に返す構成となっている。また、図８に
おいて、閾値算出部２４０は、上述した大文字の文字高
さHaと後述する小文字の文字高さHbとの比に基づいて閾
値Thv を算出し、比較処理部２４１の処理に供する構成
とすればよい。In FIG. 8, the comparison processing unit 241 compares the height ratio Vh with the threshold value Thv obtained by the threshold value calculation unit 240.
, And a determination result indicating whether the similar character to be determined is a capital letter or a small letter is returned to the determination control unit 214 in accordance with the obtained result. In FIG. 8, the threshold calculator 240 calculates a threshold Thv based on the ratio between the uppercase character height Ha described above and the lowercase character height Hb described later, and provides the threshold Thv to the processing of the comparison processor 241. And it is sufficient.

【００４４】上述した大文字の文字高さHaと同様にし
て、小文字の文字高さHbを示す指標を求め、高さ比につ
いての閾値Thv を動的に決定することができる。図８に
おいて、小文字線推定部２３４は、外接矩形抽出部２３
１から受け取った外接矩形の上側底辺の中点の集合につ
いて、上述した大文字線推定部２３３と同様の処理を行
い、これらの外接矩形が共通して外接する小文字線をそ
れぞれ推定し、小文字サイズ推定部２３６の処理に供す
る構成となっている。In the same manner as the above-described uppercase character height Ha, an index indicating the lowercase character height Hb is obtained, and the threshold value Thv for the height ratio can be dynamically determined. In FIG. 8, the lowercase line estimating unit 234 includes a circumscribed rectangle extracting unit 23.
For the set of the midpoints of the upper bottom sides of the circumscribed rectangles received from No. 1, the same processing as that of the above-described upper-case line estimation unit 233 is performed, the lower-case lines common to these circumscribed rectangles are circumscribed, and the lower-case size is estimated. The configuration is provided for the processing of the unit 236.

【００４５】また、小文字サイズ推定部２３６は、受け
取った小文字線と基準線との距離に基づいて標準的な小
文字の高さを求め、セレクタ２３７ｂを介して閾値算出
部２４０の処理に供するとともに、得られた小文字の高
さをサイズ保持部２３８ｂに保持する構成となってい
る。このセレクタ２３７ｂは、小文字線推定部２３４か
らの指示に応じて、小文字サイズ推定部２３６の出力と
サイズ保持部２３８ｂの出力のいずれかを選択し、小文
字の文字高さHbとして閾値算出部２４０の処理に供する
構成とすればよい。The lower case size estimating unit 236 obtains a standard lower case height based on the distance between the received lower case line and the reference line, and provides the same to the threshold calculating unit 240 via the selector 237b. The obtained lowercase height is stored in the size storage unit 238b. The selector 237b selects one of the output of the lowercase size estimating unit 236 and the output of the size holding unit 238b in response to an instruction from the lowercase line estimating unit 234, and sets the lowercase character height Hb as the lowercase character height Hb. What is necessary is just to set it as the structure provided for a process.

【００４６】これにより、原稿上の各行の特徴に応じ
て、高さ比による判定の基準となる大文字の文字高さHa
および判別のための閾値Thv を動的に決定し、これらの
判定基準に従って、相似文字の判別を行うことができ
る。ここで、図５に示したステップ３０９が実行される
のは、判別対象の文字列の高さが一様であり、かつ、大
文字あるいは小文字の高さの基準となる文字型に属する
文字が含まれていない場合であり、非常に希なケースで
ある。Thus, the character height Ha of the capital letter, which is a criterion for determination based on the height ratio, is determined according to the characteristics of each line on the document.
And a threshold value Thv for discrimination is dynamically determined, and similar characters can be discriminated in accordance with these criteria. Here, step 309 shown in FIG. 5 is executed because the character string to be discriminated has a uniform height and includes a character belonging to a character type serving as a standard for uppercase or lowercase height. This is a very rare case.

【００４７】このとき、大文字線推定部２３３および小
文字線推定部２３４においては、ともに推定処理が不可
能となるため、セレクタ２３７ａ、２３７ｂにより、そ
れぞれサイズ保持部２３８ａ、２３８ｂに保持された大
文字の文字高さHaと小文字の文字高さHbがそれぞれ選択
され、高さ比算出部２３９、閾値算出部２４０および比
較処理部２４１によって、高さ比に基づく判定処理が行
われる。At this time, since the estimation process cannot be performed in both the capital letter line estimation unit 233 and the small letter line estimation unit 234, the capital letters held in the size holding units 238a and 238b by the selectors 237a and 237b, respectively. The height Ha and the lowercase character height Hb are selected, and the height ratio calculation unit 239, the threshold value calculation unit 240, and the comparison processing unit 241 perform determination processing based on the height ratio.

【００４８】この場合に、上述したようにして、現在注
目している原稿上の行よりも以前に認識された行におけ
る大文字の文字高さおよび大文字と小文字との文字高さ
の比を基準として、文字の高さに基づく判定処理を行う
ことにより、判別対象の相似文字が大文字であるか小文
字であるかを判断することができる。図８に示した大文
字線推定部２３３、基準線推定部２３２、小文字線推定
部２３４、大文字サイズ推定部２３５および小文字サイ
ズ推定部２３６により、この判別結果は、上述した判断
基準を求めた原稿上の行と現在注目している行との間
で、フォントの切替やポイント数の切替のような特徴の
変化がない限り正確である。In this case, as described above, the uppercase character height and the ratio of uppercase and lowercase character heights in the line recognized earlier than the line on the current document of interest are used as references. By performing the determination process based on the character height, it is possible to determine whether the similar character to be determined is a capital letter or a small letter. The uppercase line estimator 233, the reference line estimator 232, the lowercase line estimator 234, the uppercase size estimator 235, and the lowercase size estimator 236 shown in FIG. Is accurate unless there is a change in characteristics such as font switching or point number switching between the line and the current line of interest.

【００４９】なお、高さ比算出部２３９によって得られ
た高さ比と固定の閾値とを比較した結果に基づいて、相
似文字の判別を行うことも可能である。この場合は、図
８に示した小文字線推定部２３４、小文字サイズ推定部
２３６、セレクタ２３７ｂ、サイズ保持部２３８および
閾値算出部２４０を除外して高さ比判定処理部２１７を
構成し、標準的なフォントにおける大文字の文字高さと
小文字の文字高さの比に基づいて決定した固定の閾値を
比較処理部２４１の処理に供すればよい。It is also possible to determine similar characters based on the result of comparing the height ratio obtained by the height ratio calculation unit 239 with a fixed threshold. In this case, the height ratio determination processing unit 217 is configured by excluding the lowercase line estimation unit 234, the lowercase size estimation unit 236, the selector 237b, the size holding unit 238, and the threshold value calculation unit 240 illustrated in FIG. A fixed threshold determined based on the ratio of the character height of uppercase characters to the character height of lowercase characters in a simple font may be provided to the processing of the comparison processing unit 241.

【００５０】ここで、上述したように、文字列入力部２
１１の動作により、原稿上の１行ごとに相似文字の判別
処理を行う場合は、上述した判別分析処理部２１５、高
さ判定処理部２１６および高さ比判定処理部２１７は、
相似文字が大文字であるか小文字であるかを判別するた
めの判別指標をそれぞれ原稿上の各行について求めてい
る。Here, as described above, the character string input unit 2
In the case where the similar character discriminating process is performed for each line on the document by the operation of 11, the discriminant analysis processing unit 215, the height determination processing unit 216, and the height ratio determination processing unit 217
A determination index for determining whether a similar character is uppercase or lowercase is obtained for each line on the document.

【００５１】したがって、例えば、脚注などのように、
ポイント数が行ごとに変化している部分を含む文書が原
稿となった場合においても、判別指標を行ごとに動的に
変化させ、各行に含まれる相似文字を正確に判別するこ
とが可能である。また、図３に示した相似文字判別装置
を構成する各部は、それぞれソフトウェアによって実現
可能であり、これらのソフトウェアによって、請求項３
で述べた文字集合入力手順、高さ分布評価手順、判別分
析処理手順、分類手順および個別判定手順をコンピュー
タに実行させることができ、このようなプログラムをフ
ロッピーディスクやＣＤ−ＲＯＭなどの記憶媒体に記録
して頒布することにより、本発明による正確な相似文字
判別処理を幅広い利用者に提供することができる。Therefore, for example, as in a footnote,
Even when a document contains a document with a portion where the number of points changes for each line, it is possible to dynamically change the discrimination index for each line and accurately identify similar characters included in each line. is there. Further, each unit constituting the similar character discriminating apparatus shown in FIG. 3 can be realized by software, respectively.
The computer can execute the character set input procedure, the height distribution evaluation procedure, the discriminant analysis processing procedure, the classification procedure, and the individual determination procedure described in, and such a program is stored in a storage medium such as a floppy disk or a CD-ROM. By recording and distributing, accurate similar character discrimination processing according to the present invention can be provided to a wide range of users.

【００５２】次に、日本語文字と英字とを相互に利用し
て、相似文字の判別処理を行う方法について説明する。
ここで、図９(ａ)に示すように、標準的なひらがな大文
字は、標準的な英大字よりも一回り大きい文字パターン
で表されている場合が多い。したがって、例えば、英文
字で記述された文書を主として認識する場合は、図９
(ｂ)に示すように、文字列入力部２１１に設けた日本語
文字判別部２５１によって、ひらがな大文字の高さhnの
基準となる文字を判別し、正規化処理部２５２により、
これらの文字を表す文字パターンの高さに適切な正規化
定数Csを乗じて、文字型判別部２１２および高さ分布評
価部２１３の処理に供する構成とすればよい。Next, a description will be given of a method of performing similar character discrimination processing by mutually using Japanese characters and English characters.
Here, as shown in FIG. 9A, a standard hiragana capital letter is often represented by a character pattern that is slightly larger than a standard English capital letter. Therefore, for example, when mainly recognizing a document described in English characters, FIG.
As shown in (b), the Japanese character discriminating unit 251 provided in the character string input unit 211 discriminates a character serving as a reference for the height hn of the hiragana capital letter, and the normalization processing unit 252
The height of the character pattern representing these characters may be multiplied by an appropriate normalization constant Cs to be provided to the processing of the character type determination unit 212 and the height distribution evaluation unit 213.

【００５３】この日本語文字判別部２５１は、後に追加
開示項４において述べる言語判別手段１２１に相当する
ものであり、文字列入力部２１１を介して受け取ったパ
ターン認識結果から、例えば、図４(ｂ)に示したひらが
な文字「か、き、く、け、さ、し、す、そ」などを基準
となる文字として判別する構成とすればよい。また、正
規化処理部２５２は、後に追加開示項４において述べる
正規化手段１２２に相当するものであり、例えば、標準
的なひらがな大文字の高さhnと標準的な英大文字の高さ
heとの比に基づいて予め正規化定数Csを求めておき、日
本語文字判別部２５１によって判別された各文字に対応
する文字高さhxを上述した正規化定数Csによって正規化
し、英文字に関する認識結果とともに、文字型判別部２
１２および文字高さ分布評価部２１３に送出すればよ
い。The Japanese character discriminating section 251 corresponds to the language discriminating means 121 described later in the additional disclosure item 4. Based on the pattern recognition result received via the character string input section 211, for example, FIG. A configuration may be adopted in which the hiragana characters “ka, ki, ku, ke, sa, shi, su, so” shown in b) are determined as reference characters. The normalization processing unit 252 corresponds to the normalization unit 122 described later in the additional disclosure item 4. For example, a standard hiragana capital letter height hn and a standard alphabetical capital letter height hn
The normalization constant Cs is obtained in advance based on the ratio to he, and the character height hx corresponding to each character determined by the Japanese character determination unit 251 is normalized by the above-described normalization constant Cs, and Along with the recognition result, the character type discriminating unit 2
12 and the character height distribution evaluation unit 213.

【００５４】また、逆に、日本語文字を主として認識す
る場合に、日本語文字で記述された文書に混在している
英文字の高さをひらがな大文字の高さに合わせて正規化
すれば、図４(ｂ)に示したひらがなの相似文字「あ、
い、う、え、お、つ、や、ゆ、よ、ぁ、ぃ、ぅ、ぇ、
ぉ、っ、ゃ、ゅ、ょ」の判別処理に利用することができ
る。Conversely, when Japanese characters are mainly recognized, if the height of English characters mixed in a document described in Japanese characters is normalized to the height of Hiragana capital letters, The similar character "A,
I, u, ue, oh, tsu, ya, yu, yo, ぁ, ぃ, ぅ, ぇ,
ぉ, ゃ, ゃ, ゅ, 」”.

【００５５】また一方、文字列入力部２１１が、後に追
加開示項５において述べる文字集合入力手段１１１とし
て動作し、例えば、原稿読取部４０１による読み取り範
囲内の複数行についてのパターン認識結果を判別対象の
文字列として入力した場合は、判別分析処理部２１５、
高さ判定処理部２１６および高さ比判定処理部２１７に
より、それぞれこの読み取り範囲ごとに、相似文字が大
文字であるか小文字であるかを判別する判別指標を求め
ることができる。On the other hand, the character string input unit 211 operates as a character set input unit 111 described later in the additional disclosure item 5, and determines, for example, a pattern recognition result for a plurality of lines within a reading range by the document reading unit 401. When the character string is input as a character string, the discriminant analysis processing unit 215
The height determination processing unit 216 and the height ratio determination processing unit 217 can determine a determination index for determining whether a similar character is uppercase or lowercase for each of the reading ranges.

【００５６】この場合は、十分なサンプル数が確保でき
るので、確実に判別分析処理部２１５による判別処理を
行うことが可能であり、また、十分なサンプルに基づく
判別分析処理によれば、適切な閾値ｋを十分な精度で推
定することが可能であるから、判別対象の領域に含まれ
る強調表現やフォントの変化に伴う文字の大きさの微妙
な変化を吸収し、上述した判別対象の文字列に含まれる
相似文字を大文字と小文字とに確実に判別することがで
きる。In this case, since a sufficient number of samples can be secured, the discriminant processing by the discriminant analysis processing unit 215 can be performed reliably. According to the discriminant analysis processing based on sufficient samples, an appropriate Since it is possible to estimate the threshold value k with sufficient accuracy, it absorbs subtle changes in the size of characters due to changes in the emphasis expressions and fonts included in the area to be determined, and the above-described character string to be determined Can be reliably distinguished between uppercase and lowercase.

【００５７】その一方、このように判別対象の領域を拡
大した場合は、ポイント数の局所的な変化などには十分
に対応することができない。逆に、文字列入力部２１１
が、後に追加開示項７において述べる文字集合入力手段
１１１として動作し、パターン認識結果を単語単位で入
力する構成とした場合は、判別分析処理部２１５、高さ
判定処理部２１６および高さ比判定処理部２１７によ
り、それぞれ各単語について、相似文字が大文字である
か小文字であるかを判別する判別指標を動的に求めるこ
とができる。On the other hand, when the area to be determined is enlarged in this way, it is not possible to sufficiently cope with a local change in the number of points or the like. Conversely, the character string input unit 211
Operates as the character set input unit 111 described later in the additional disclosure item 7, and inputs the pattern recognition result in units of words. In the case where the discrimination analysis processing unit 215, the height determination processing unit 216, and the height ratio determination The processing unit 217 can dynamically determine, for each word, a determination index for determining whether a similar character is a capital letter or a small letter.

【００５８】この場合は、サンプル数が極端に少ないた
めに、判別分析処理部２１５、高さ判定処理部２１６お
よび高さ比判定処理部２１７によってそれぞれ得られる
判別指標の推定精度は低下する可能性がある。その反
面、上述したようにして、各単語について動的に求めた
判別指標に基づいて、その単語に含まれる相似文字の判
別を行うことにより、例えば、単語単位でポイント数が
変化する部分を含む文書が原稿となった場合において
も、相似文字を判別することができる。In this case, since the number of samples is extremely small, the estimation accuracy of the discrimination index obtained by the discriminant analysis processing unit 215, the height determination processing unit 216, and the height ratio determination processing unit 217 may decrease. There is. On the other hand, as described above, based on the discrimination index dynamically obtained for each word, the similar characters included in the word are discriminated to include, for example, a portion where the number of points changes in word units. Even when the document is a manuscript, similar characters can be determined.

【００５９】[0059]

【発明の効果】以上に説明したように、請求項１、請求
項２および請求項１０の発明によれば、原稿から読み取
られた所定の範囲に含まれる文字パターンの高さに関す
る特徴に基づいて、判別分析処理あるいは個別判定処理
を行うことにより、この所定の範囲について動的に求め
た判別指標に基づいて、その範囲内の相似文字を判別す
ることができるので、フォントの違いや強調表現による
文字パターンの変化などを吸収して、相似文字を正確に
判別することが可能である。As described above, according to the first, second, and tenth aspects of the present invention, based on the feature relating to the height of a character pattern included in a predetermined range read from a document. By performing the discriminant analysis process or the individual judgment process, similar characters within the predetermined range can be determined based on the discrimination index dynamically obtained for the predetermined range. It is possible to accurately determine similar characters by absorbing changes in character patterns.

【００６０】以上の説明に関して、更に、以下の項を開
示する。追加開示項１．請求項２に記載の相似文字判別装置にお
いて、個別判定手段１１５は、分類手段１１４による分
類結果を受け取り、大文字または小文字の高さの基準と
なりうる文字型に分類される文字に対応する文字パター
ンを基準文字パターンとして抽出する基準文字抽出手段
１３１と、基準文字パターンの高さと判別対象となる相
似文字に対応する文字パターンの高さとの比較結果に応
じて、相似文字の認識結果を決定する高さ判定手段１３
２とを備えた構成であることを特徴とする。With respect to the above description, the following items are further disclosed. Additional Disclosure 1. 3. The similar character discriminating apparatus according to claim 2, wherein the individual judging unit 115 receives the classification result by the classifying unit 114, and converts a character pattern corresponding to a character classified into a character type that can be a reference for the height of uppercase letters or lowercase letters. A reference character extracting unit 131 that extracts a reference character pattern, and a height that determines a similar character recognition result according to a comparison result between the height of the reference character pattern and the character pattern corresponding to the similar character to be determined. Judgment means 13
2 is provided.

【００６１】追加開示項１の相似文字判別装置は、高さ
判定手段１３２が、基準文字抽出手段１３１によって抽
出された基準文字パターンの高さと相似文字の文字パタ
ーンの高さとを直接に比較することにより、文字集合に
含まれる基準文字の高さに基づいて、相似文字を正確に
判別することができる。このように、追加開示項１を適
用すれば、パターン認識手段によって認識結果が確定し
ている文字の高さを大文字と小文字とを判別するための
判別指標として利用することにより、フォントの違いや
強調表現による文字パターンの変化にかかわらず、相似
文字を正確に判別することができる。In the similar character discriminating apparatus according to additional disclosure 1, the height determining means 132 directly compares the height of the reference character pattern extracted by the reference character extracting means 131 with the height of the character pattern of the similar character. Accordingly, similar characters can be accurately determined based on the height of the reference characters included in the character set. As described above, when the additional disclosure item 1 is applied, the difference between fonts can be improved by using the height of a character whose recognition result has been determined by the pattern recognition means as a determination index for determining uppercase and lowercase. Regardless of a change in the character pattern due to the emphasized expression, similar characters can be accurately determined.

【００６２】追加開示項２．請求項２に記載の相似文字
判別装置において、個別判定手段１１５は、分類手段１
１４による分類結果を受け取り、大文字または小文字の
高さの基準となりうる文字型に分類される文字に対応す
る文字パターンを基準文字パターンとして抽出する基準
文字抽出手段１３１と、基準文字パターンに基づいて、
標準的な大文字の高さを推定する大文字高さ推定手段１
３３と、大文字高さ推定手段１３３によって得られた大
文字の高さと、判別対象となる相似文字に対応する文字
パターンの高さとの比に基づいて、相似文字の認識結果
を決定する高さ比判定手段１３４とを備えた構成である
ことを特徴とする。Additional Disclosure Item 2. 3. The similar character discriminating apparatus according to claim 2, wherein the individual judging means 115 comprises
A reference character extraction unit 131 that receives a classification result according to 14 and extracts a character pattern corresponding to a character classified into a character type that can be a reference for the height of uppercase or lowercase letters as a reference character pattern;
Uppercase height estimation means 1 for estimating standard uppercase height
33, a height ratio determination for determining a recognition result of similar characters based on a ratio of a capital letter height obtained by the capital letter height estimating means 133 to a character pattern corresponding to a similar character to be determined. And means 134.

【００６３】追加開示項２の相似文字判別装置は、基準
文字抽出手段１３１によって抽出された基準文字の高さ
に基づいて、大文字高さ推定手段１３３が、標準的な大
文字の高さを推定することにより、高さ比判定手段１３
４により、この標準的な大文字の高さと判別対象の相似
文字の高さとの比に基づいて、この相似文字を判別する
ことができる。In the similar character discriminating apparatus according to Additional Disclosure 2, the capital letter height estimating means 133 estimates the standard capital letter height based on the height of the reference character extracted by the reference character extracting means 131. The height ratio determination means 13
According to FIG. 4, the similar character can be determined based on the ratio between the standard uppercase letter height and the height of the similar character to be determined.

【００６４】追加開示項３．請求項４に記載の相似文字
判別装置において、大文字高さ推定手段１３３は、大文
字の高さの基準となりうる文字型に分類された基準文字
パターンに対応する外接矩形の上側底辺の中点の集合に
基づいて、標準的な大文字を表す文字パターンの上部が
外接する大文字線を推定する大文字線推定手段１３５
と、大文字または小文字の高さの基準となりうる文字型
に分類された基準文字パターンに対応する外接矩形の下
側底辺の中点の集合に基づいて、標準的な大文字および
小文字を表す文字パターンの下部が外接する基準線を推
定する基準線推定手段１３６と、大文字線と基準線との
距離を求め、標準的な大文字の高さとして出力する文字
高さ算出手段１３７とを備えた構成であることを特徴と
する。Additional Disclosure Item 3. 5. The similar character discriminating apparatus according to claim 4, wherein the capital letter height estimating means 133 is a set of middle points of the upper bottom of a circumscribed rectangle corresponding to a reference character pattern classified into a character type that can be a reference for the capital letter height. Capital line estimating means 135 for estimating a capital line circumscribing the upper part of a character pattern representing a standard capital letter based on
Based on the set of midpoints on the lower base of the circumscribed rectangle corresponding to the reference character pattern categorized as a character type that can be the standard for uppercase or lowercase characters, a character pattern representing standard uppercase and lowercase characters It is provided with a reference line estimating means 136 for estimating a reference line whose lower part is circumscribed, and a character height calculating means 137 for determining the distance between the capital letter line and the reference line and outputting the distance as a standard capital letter height. It is characterized by the following.

【００６５】追加開示項３の相似文字判別装置は、大文
字線推定手段１３５および基準線推定手段１３６の動作
により、複数の基準文字の高さに基づいて、大文字線と
基準線とを精密に推定することができるから、文字高さ
算出手段１３７により、標準的な大文字の高さを正確に
求めることができる。The similar character discriminating apparatus according to additional disclosure item 3 accurately estimates a capital letter line and a reference line based on the heights of a plurality of reference characters by the operation of the capital letter line estimation means 135 and the reference line estimation means 136. Therefore, the character height calculator 137 can accurately determine the height of a standard uppercase letter.

【００６６】したがって、追加開示項２および追加開示
項３を適用すれば、パターン認識手段によって認識結果
が確定している文字に対応する文字パターンに基づい
て、標準的な大文字の高さを精密に推定し、この標準的
な大文字の高さとの比を判別指標として利用することに
より、フォントの違いや強調表現による文字パターンの
変化にかかわらず、相似文字を正確に判別することがで
きる。Therefore, if the additional disclosure item 2 and the additional disclosure item 3 are applied, the standard capital letter height can be precisely determined based on the character pattern corresponding to the character whose recognition result has been determined by the pattern recognition means. By estimating and using the ratio to the standard uppercase letter height as a discrimination index, similar characters can be accurately discriminated irrespective of font differences or changes in character patterns due to emphasized expressions.

【００６７】追加開示項４．請求項２に記載の相似文字
判別装置において、文字集合入力手段１１１は、パター
ン認識手段１０１から受け取った文字列に含まれる各文
字が属する言語を判別する言語判別手段１２１と、文字
認識装置が主として扱う主言語以外の言語に属する外国
語文字として判別された各文字について、対応する文字
パターンの大きさを主言語に属する文字の大きさに合わ
せて正規化する正規化手段１２２とを備えた構成である
ことを特徴とする。Additional Disclosure Item 4. In the similar character discriminating apparatus according to claim 2, the character set input means 111 is mainly composed of a language discriminating means 121 for discriminating a language to which each character included in the character string received from the pattern recognizing means 101 belongs, and A configuration including a normalizing means for normalizing the size of the corresponding character pattern to each character determined as a foreign language character belonging to a language other than the main language to be handled in accordance with the size of the character belonging to the main language; It is characterized by being.

【００６８】追加開示項４の相似文字判別装置は、言語
判別手段１２１によって外国語文字であると判別された
各文字について、正規化手段１２２が正規化処理を行う
ことにより、文字集合に含まれる外国語文字を相似文字
の判別に利用することが可能である。この追加開示項４
を適用すれば、外国語文字の大きさを主言語文字の標準
的な大文字の大きさに基づいて正規化することにより、
外国語文字を主言語文字に含めた文字集合に基づいて、
判別指標を求めることが可能となるので、複数の言語に
用いられる文字が混在する文書についての相似文字の判
別精度を向上することができる。In the similar character discriminating apparatus according to the additional disclosure item 4, each character determined as a foreign language character by the language discriminating means 121 is included in the character set by the normalizing means 122 performing normalization processing. Foreign language characters can be used to determine similar characters. This additional disclosure item 4
, By normalizing the size of foreign language characters based on the standard capitalization of main language characters,
Based on a character set that includes foreign language characters as main language characters,
Since it is possible to obtain the discrimination index, it is possible to improve the accuracy of discriminating similar characters in a document in which characters used in a plurality of languages are mixed.

【００６９】追加開示項５．請求項２に記載の相似文字
判別装置において、文字集合入力手段１１１は、原稿に
おいて判別対象となる相似文字を含む所定の読み取り領
域に含まれる文字パターンとこれらの文字パターンにつ
いてパターン認識手段１０１によって得られた認識結果
とを入力する構成であることを特徴とする。Additional Disclosure Item 5. 3. The similar character discriminating apparatus according to claim 2, wherein the character set input unit 111 obtains, by the pattern recognition unit 101, a character pattern included in a predetermined reading area including the similar character to be discriminated in the original document and these character patterns. The received recognition result is input.

【００７０】追加開示項５の相似文字判別装置は、文字
集合入力手段１１１の動作により、読み取り領域に含ま
れる文字パターンとその認識結果を文字集合として入力
することにより、判別分析処理手段１１３および個別判
別手段１１５において、相似文字が大文字であるか小文
字であるかを判別するための判別指標をそれぞれ上述し
た読み取り領域ごとに動的に求めて判別することができ
る。The similar character discriminating apparatus according to the additional disclosure item 5 operates the character set input means 111 to input a character pattern included in the reading area and its recognition result as a character set, thereby providing the discriminant analysis processing means 113 and the individual The determination means 115 can dynamically determine a determination index for determining whether a similar character is a capital letter or a small letter for each of the above-described reading regions.

【００７１】この追加開示項５を適用し、読み取り領域
ごとに判別指標を求める構成とすれば、十分なサンプル
に基づいた判別分析処理により、相似文字を精密に判別
することが可能となる。追加開示項６．請求項２に記載の相似文字判別装置にお
いて、文字集合入力手段１１１は、原稿において判別対
象となる相似文字を含む１行に対応する文字パターンと
これらの文字パターンについてパターン認識手段１０１
によって得られた認識結果とを入力する構成であること
を特徴とする。When the additional disclosure item 5 is applied and the discrimination index is obtained for each reading area, similar characters can be discriminated accurately by the discriminant analysis processing based on sufficient samples. Additional Disclosure Item 6. 3. The similar character discriminating apparatus according to claim 2, wherein the character set input unit 111 includes a character pattern corresponding to one line including a similar character to be determined in the document and a pattern recognition unit 101 for these character patterns.
And a recognition result obtained by the above.

【００７２】追加開示項６の相似文字判別装置は、文字
集合入力手段１１１の動作により、原稿上の１行に相当
する文字集合を入力することにより、判別分析処理手段
１１３および個別判別手段１１５において、相似文字が
大文字であるか小文字であるかを判別するための判別指
標をそれぞれ原稿上の行ごとに動的に求めて判別するこ
とができる。The similar character discriminating apparatus according to the additional disclosure item 6 inputs a character set corresponding to one line on a document by the operation of the character set inputting means 111, so that the discrimination analysis processing means 113 and the individual discriminating means 115 The determination index for determining whether the similar character is uppercase or lowercase can be dynamically obtained for each line on the document and can be determined.

【００７３】追加開示項７．請求項２に記載の相似文字
判別装置において、文字集合入力手段１１１は、原稿に
おいて判別対象となる相似文字を含む単語に対応する文
字パターンとこれらの文字パターンについてパターン認
識手段１０１によって得られた認識結果とを入力する構
成であることを特徴とする。追加開示項７の相似文字判
別装置は、文字集合入力手段１１１の動作により、単語
に相当する文字集合を入力することにより、判別分析処
理手段１１３および個別判別手段１１５において、相似
文字が大文字であるか小文字であるかを判別するための
判別指標をそれぞれ単語ごとに動的に求めて判別するこ
とができる。Additional Disclosure Item 7. 3. The similar character discriminating apparatus according to claim 2, wherein the character set input unit 111 includes a character pattern corresponding to a word including a similar character to be determined in the document and a recognition obtained by the pattern recognizing unit 101 for these character patterns. And a result input unit. In the similar character discriminating apparatus according to the additional disclosure item 7, when the character set corresponding to the word is input by the operation of the character set input unit 111, the similar character is capitalized in the discriminant analysis processing unit 113 and the individual determination unit 115. It is possible to dynamically determine and determine a determination index for determining whether a word is a lowercase or a lowercase for each word.

【００７４】追加開示項６あるいは追加開示項７を適用
し、原稿上の行あるいは単語ごとに判別指標を求める構
成とすれば、ポイント数の局所的な変化などにも柔軟に
追従し、相似文字の判別を行うことができる。When the additional disclosure item 6 or 7 is applied to determine the discrimination index for each line or word on the manuscript, it is possible to flexibly follow a local change in the number of points, etc. Can be determined.

[Brief description of the drawings]

【図１】本発明の相似文字判別方法の原理を示す図であ
る。FIG. 1 is a diagram illustrating the principle of a similar character determination method according to the present invention.

【図２】本発明の相似文字判別装置の原理ブロック図で
ある。FIG. 2 is a principle block diagram of a similar character discriminating apparatus of the present invention.

【図３】本発明の相似文字判別装置を適用した文字認識
装置の構成を示す図である。FIG. 3 is a diagram showing a configuration of a character recognition device to which the similar character discrimination device of the present invention is applied.

【図４】文字型の例を説明する図である。FIG. 4 is a diagram illustrating an example of a character type.

【図５】相似文字判別動作を表す流れ図である。FIG. 5 is a flowchart illustrating a similar character determining operation.

【図６】判別分析を説明する図である。FIG. 6 is a diagram illustrating discriminant analysis.

【図７】文字高さによる判別動作を表す流れ図である。FIG. 7 is a flowchart illustrating a determination operation based on a character height.

【図８】高さ比判定処理部の詳細構成図である。FIG. 8 is a detailed configuration diagram of a height ratio determination processing unit.

【図９】日本語文字および英字の特徴を説明する図であ
る。FIG. 9 is a diagram illustrating characteristics of Japanese characters and English characters.

【図１０】特開平３ー１１１９８３号公報に記載の発明
技法を適用した文字認識装置の構成図である。FIG. 10 is a configuration diagram of a character recognition device to which the invention technique described in Japanese Patent Application Laid-Open No. 3-111983 is applied.

【図１１】従来の空白比による判定方法を説明する図で
ある。FIG. 11 is a diagram illustrating a conventional determination method based on a blank ratio.

[Explanation of symbols]

１０１パターン認識手段１１１文字集合入力手段１１２高さ分布判定手段１１３判別分析処理手段１１４分類手段１１５個別判定手段１２１言語判別手段１２２正規化手段１３１基準文字抽出手段１３２高さ判定手段１３３大文字高さ推定手段１３４高さ比判定手段１３５大文字線推定手段１３６基準線推定手段１３７文字高さ算出手段２１０相似文字判別装置２１１文字列入力部２１２文字型判別部２１３高さ分布評価部２１４判別制御部２１５判別分析処理部２１６高さ判定処理部２１７高さ比判定処理部２２１ヒストグラム作成部２２２閾値決定部２２３分離度算出部２２４最適化制御部２２５、２４１比較処理部２３１外接矩形抽出部２３２基準線推定部２３３大文字線推定部２３４小文字線推定部２３５大文字サイズ推定部２３６小文字サイズ推定部２３７セレクタ２３８サイズ保持部２３９高さ比算出部２４０閾値算出部４０１原稿読取部４０２パターン認識部４０３パターン辞書４０４相似文字抽出部４０５出力処理部４１０相似文字判定部４１１空白比算出部４１２比較器 DESCRIPTION OF SYMBOLS 101 Pattern recognition means 111 Character set input means 112 Height distribution judgment means 113 Discrimination analysis processing means 114 Classification means 115 Individual judgment means 121 Language discrimination means 122 Normalization means 131 Reference character extraction means 132 Height judgment means 133 Uppercase letter height estimation Means 134 Height ratio determining means 135 Uppercase line estimating means 136 Reference line estimating means 137 Character height calculating means 210 Similar character discriminating device 211 Character string input unit 212 Character type discriminating unit 213 Height distribution evaluating unit 214 Discrimination control unit 215 Discrimination Analysis processing unit 216 Height determination processing unit 217 Height ratio determination processing unit 221 Histogram creation unit 222 Threshold determination unit 223 Separation degree calculation unit 224 Optimization control units 225, 241 Comparison processing unit 231 Circumscribed rectangle extraction unit 232 Reference line estimation unit 233 Upper case line estimation unit 234 Small Character line estimation unit 235 Uppercase size estimation unit 236 Lowercase size estimation unit 237 Selector 238 Size holding unit 239 Height ratio calculation unit 240 Threshold calculation unit 401 Document reading unit 402 Pattern recognition unit 403 Pattern dictionary 404 Similar character extraction unit 405 Output processing unit 410 Similar character determination unit 411 Blank ratio calculation unit 412 Comparator

Claims

[Claims]

1. A character recognition device comprising a pattern recognition means for recognizing each character written on a document based on a character pattern read from the document, wherein uppercase and lowercase characters are represented by similar character patterns. Receiving a character pattern string corresponding to a character string in a predetermined range including similar characters, and a recognition result performed on the character pattern string by the pattern recognition unit, the height of the character pattern included in the character pattern string Is determined whether or not is uniform, if the heights of the character patterns included in the character pattern sequence varies, by discriminant analysis of the height of the character patterns included in the character pattern sequence, capital letters Classified into uppercase class with height equivalent to lowercase and lowercase class with height equivalent to lowercase. The recognition result of the similar character is determined based on the classification result. If the height of the character pattern included in the character pattern sequence is uniform, the character pattern is determined for the determined character included in the recognition result. Are classified into a plurality of character types having different distribution characteristics in the height direction, and the height of a character pattern corresponding to the determined character classified into the plurality of character types and the height of a character pattern corresponding to the similar character. Determining a recognition result corresponding to the similar character according to a result of comparison with the similar character.

2. A character recognition device comprising a pattern recognition means for recognizing each character written on a document based on a character pattern read from the document, wherein uppercase and lowercase characters are represented by similar character patterns. Character set input means for receiving and inputting a character pattern string corresponding to a character string in a predetermined range including similar characters, and a recognition result performed on the character pattern string by the pattern recognition means; and Height distribution determining means for determining whether or not the height of each character pattern included in the character pattern is uniform; and the character pattern string according to a determination result that the height of each of the character patterns varies. Based on the discriminant analysis of the height of the character pattern included in the uppercase, the uppercase class with the height equivalent to the uppercase and the small sentence with the height equivalent to the lowercase Character class, and based on this classification result,
Discriminant analysis processing means for determining a recognition result of the similar character to be determined; and for a determined character included in the recognition result, according to a determination result that the height of each of the character patterns is uniform. A classification means for classifying into a plurality of character types having different distribution characteristics in the height direction of the character pattern; and a height of the character pattern corresponding to the similar character, and a determined character classified into the plurality of character types. A similar character discriminating device, comprising: individual determining means for individually determining a recognition result corresponding to each of the similar characters based on a height of a corresponding character pattern.

3. A character recognition device comprising a pattern recognition means for recognizing each character written on a document based on a character pattern read from the document, wherein uppercase and lowercase characters are represented by similar character patterns. A character set sequence for receiving and inputting a character pattern string corresponding to a character string in a predetermined range including similar characters, and a recognition result performed on the character pattern string by the pattern recognition means; and A height distribution determining procedure for determining whether the height of each character pattern included in the character pattern is uniform, and the character pattern string according to a determination result that the height of each character pattern varies. Based on the discriminant analysis of the height of the character pattern included in the uppercase, the uppercase class with the height equivalent to the uppercase and the small sentence with the height equivalent to the lowercase Character class, and based on this classification result,
A discriminant analysis processing procedure for determining a recognition result of the similar character to be determined, and a determined character included in the recognition result according to a determination result that the height of each character pattern is uniform. A classification procedure of classifying into a plurality of character types having different distribution characteristics in the height direction of the character pattern; and a height of a character pattern corresponding to the similar character, and a determined character classified into the plurality of character types. A storage medium storing a similar character determination program for causing a computer to execute an individual determination procedure for individually determining a recognition result corresponding to each of the similar characters based on a height of a corresponding character pattern.