JP2001338264A

JP2001338264A - Device and method for preparing character recognition pattern dictionary, and recording medium

Info

Publication number: JP2001338264A
Application number: JP2000155305A
Authority: JP
Inventors: 立 ▲せん▼; Ritsu Sen
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2000-05-25
Filing date: 2000-05-25
Publication date: 2001-12-07

Abstract

PROBLEM TO BE SOLVED: To provide a character recognition pattern dictionary preparing device of which the accuracy of recognition will not be lowered, even when a feature vector to be used for selecting a feature amount is not suitable for character recognition. SOLUTION: This device is provided with an input part 110 for inputting a character image, feature amount extracting part 120 for extracting the feature amount from the character image, which is inputted by the input part 110, by a discriminating analysis method or main component analysis method, feature amount selection feature vector preparing part 130, for preparing a feature amount selection feature vector for reducing the number of dimensions of the feature amount extracted by the feature amount extracting part 120 through adjustment using the appearance frequency information of a character a feature amount selecting part 140 for reducing the number of dimensions of the feature amount extracted by the feature amount extracting part 120, using the feature amount selection feature vector prepared by the feature amount selection feature vector preparing part 130, and dictionary preparing part 150 for preparing a dictionary from the feature amount selected by the feature amount selecting part 140.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文字認識／光学的
文字認識装置／画像処理に用いられるパターン認識用の
辞書を作成する文字認識パターン辞書作成装置、その方
法および記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character recognition pattern dictionary creation device for creating a pattern recognition dictionary used for character recognition / optical character recognition device / image processing, a method thereof, and a recording medium.

【０００２】[0002]

【従来の技術】一般に文字認識の高精度化のためには、
分離性の高い特徴量を抽出する必要がある。このような
目的のためには、特徴抽出の際に領域分割の細分化等に
より特徴量の次元数を増加させるやり方がある。しか
し、次元数を増加させると識別に要する計算量や記憶容
量が増大するという問題点がある。この問題を解決する
ためには、認識率を低下させることなく特徴量の次元数
を減少させることが必要であり、判別分析等の統計的な
手法によって有効な特徴を残しつつそれが達成できる。2. Description of the Related Art Generally, in order to improve the accuracy of character recognition,
It is necessary to extract feature quantities with high separability. For such a purpose, there is a method of increasing the number of dimensions of the feature amount by subdividing the area division at the time of feature extraction. However, when the number of dimensions is increased, there is a problem that a calculation amount and a storage capacity required for identification increase. In order to solve this problem, it is necessary to reduce the number of dimensions of the feature amount without lowering the recognition rate, and this can be achieved while leaving effective features by a statistical method such as discriminant analysis.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、特徴量
選択を行なう際に用いられる特徴ベクトルが文字認識に
適切でない場合には、認識精度が低下してしまう場合が
ある。本発明の目的は、このような認識精度が低下しな
いようなパターン認識用の辞書を作成する文字認識パタ
ーン辞書作成装置、文字認識パターン辞書作成方法及び
記録媒体を提供することにある。However, if a feature vector used for selecting a feature amount is not appropriate for character recognition, recognition accuracy may be reduced. An object of the present invention is to provide a character recognition pattern dictionary creation device, a character recognition pattern dictionary creation method, and a recording medium for creating a dictionary for pattern recognition that does not reduce the recognition accuracy.

【０００４】[0004]

【課題を解決するための手段】本発明の請求項１の文字
認識パターン辞書作成装置は、文字画像を入力する入力
部と、前記入力部で入力された文字画像から判別分析法
または主成分分析法により特徴量を抽出する特徴量抽出
部と、文字の出現頻度情報を用いて調整して、前記特徴
量抽出部で抽出された特徴量の次元数を削減するための
特徴量選択特徴ベクトルを作成する特徴量選択特徴ベク
トル作成部と、前記特徴量選択特徴ベクトル作成部によ
って作成された特徴量選択特徴ベクトルを用いて前記特
徴量抽出部で抽出された特徴量の次元数を削減する特徴
量選択部と、前記特徴量選択部で選択された特徴量から
辞書作成を行なう辞書作成部とを備えたことを特徴とす
る。また、本発明の請求項２の文字認識パターン辞書作
成装置は、請求項１記載の文字認識パターン辞書作成装
置において、前記特徴量選択特徴ベクトル作成部は、認
識対象となる出現頻度の高い文字に高い事前確率、認識
対象となる出現頻度の低い文字に低い事前確率を用いて
調整して、前記特徴量抽出部で抽出された特徴量の次元
数を削減する特徴量選択特徴ベクトルを作成するように
したことを特徴とする。また、本発明の請求項３の文字
認識パターン辞書作成装置は、請求項１または請求項２
記載の文字認識パターン辞書作成装置において、前記特
徴量選択特徴ベクトル作成部は、認識すべき文字に対す
る学習データのクラスタリングを行なってから特徴量選
択特徴ベクトルを作成するようにしたことを特徴とす
る。また、本発明の請求項４の文字認識パターン辞書作
成装置は、請求項３記載の文字認識パターン辞書作成装
置において、前記学習データのクラスタリングを、生成
したクラスタの中心から最も離れたクラスタメンバの距
離が一定数値を越えないことをクラスタリングの終了条
件とした組合せクラスタリング方法で行い特徴量選択特
徴ベクトルを算出することを特徴とする。また、本発明
の請求項５の文字認識パターン辞書作成方法は、文字画
像を入力し、この入力された文字画像から特徴量を抽出
し、文字の認識対象となる出現頻度情報を用いて調整し
て、抽出された特徴量の次元数を削減するための特徴量
選択特徴ベクトルを作成し、この作成された特徴量選択
特徴ベクトルを用いて抽出された特徴量の次元数を削減
して、選択された特徴量から辞書作成を行なうようにし
たことを特徴とする。According to a first aspect of the present invention, there is provided a character recognition pattern dictionary creating apparatus, comprising: an input unit for inputting a character image; and a discriminant analysis method or principal component analysis based on the character image input by the input unit. A feature value extraction unit for extracting feature values by a method, and a feature value selection feature vector for reducing the number of dimensions of the feature value extracted by the feature value extraction unit by adjusting using character appearance frequency information. A feature quantity selection feature vector creation unit to be created, and a feature quantity that reduces the number of dimensions of the feature quantity extracted by the feature quantity extraction unit using the feature quantity selection feature vector created by the feature quantity selection feature vector creation unit. A feature is provided that includes a selection unit and a dictionary creation unit that creates a dictionary from the feature amounts selected by the feature amount selection unit. According to a second aspect of the present invention, there is provided the character recognition pattern dictionary creating apparatus according to the first aspect, wherein the feature amount selection feature vector creating unit assigns a character having a high appearance frequency to be recognized. A feature amount selection feature vector is created in which a high prior probability is adjusted for a character having a low appearance frequency to be recognized using a low prior probability to reduce the number of dimensions of the feature amount extracted by the feature amount extraction unit. It is characterized by the following. Further, the character recognition pattern dictionary creating apparatus according to claim 3 of the present invention is characterized by claim 1 or claim 2.
In the character recognition pattern dictionary creation device described above, the feature amount selection feature vector creation unit creates a feature amount selection feature vector after performing clustering of learning data for a character to be recognized. According to a fourth aspect of the present invention, there is provided the character recognition pattern dictionary creating apparatus according to the third aspect, wherein the learning data is clustered by a distance of a cluster member farthest from the center of the generated cluster. Is not exceeded by a combination clustering method using the clustering termination condition as a feature value selection feature vector. According to a fifth aspect of the present invention, there is provided a character recognition pattern dictionary creating method according to the first aspect, wherein a character image is input, a feature amount is extracted from the input character image, and the feature amount is adjusted using appearance frequency information to be recognized. A feature quantity selection feature vector for reducing the number of dimensions of the extracted feature quantity, and reducing the dimension number of the feature quantity extracted using the created feature quantity selection feature vector to select The feature is that a dictionary is created from the set feature amounts.

【０００５】また、本発明の請求項６の文字認識パター
ン辞書作成方法は、文字画像を入力し、この入力された
文字画像から判別分析法または主成分分析法により特徴
量を抽出し、認識すべき文字に対する学習データのクラ
スタリングを行なってから文字の認識対象となる出現頻
度の高い文字に高い事前確率、認識対象となる出現頻度
の低い文字に低い事前確率を用いて調整して、抽出され
た特徴量の次元数を削減するための特徴量選択特徴ベク
トルを作成し、この作成された特徴量選択特徴ベクトル
を用いて抽出された特徴量の次元数を削減して、選択さ
れた特徴量から辞書作成を行なうようにしたことを特徴
とする。また、本発明の請求項７の記録媒体は、コンピ
ュータを、文字認識用のパターン辞書を作成する文字認
識パターン辞書作成装置として機能させるためのコンピ
ュータ読み取り可能な記録媒体であって、コンピュータ
を、文字画像を入力する入力部と、前記入力部で入力さ
れた文字画像から判別分析法または主成分分析法により
特徴量を抽出する特徴量抽出部と、文字の出現頻度情報
を用いて調整して、前記特徴量抽出部で抽出された特徴
量の次元数を削減するための特徴量選択特徴ベクトルを
作成する特徴量選択特徴ベクトル作成部と、前記特徴量
選択特徴ベクトル作成部によって作成された特徴量選択
特徴ベクトルを用いて前記特徴量抽出部で抽出された特
徴量の次元数を削減する特徴量選択部と、前記特徴量選
択部で選択された特徴量から辞書作成を行なう辞書作成
部とを備えた文字認識パターン辞書作成装置として機能
させるためのプログラムを記録した。In a character recognition pattern dictionary creating method according to a sixth aspect of the present invention, a character image is input, and a characteristic amount is extracted from the input character image by a discriminant analysis method or a principal component analysis method and recognized. After clustering the learning data for the power character, it was extracted by adjusting using a high prior probability for the character with high appearance frequency that is the target of character recognition, and using a low prior probability for the character with low appearance frequency that is the target of character recognition. A feature amount selection feature vector for reducing the dimension number of the feature amount is created, the dimension number of the feature amount extracted using the created feature amount selection feature vector is reduced, and the feature amount is selected from the selected feature amount. It is characterized in that a dictionary is created. A recording medium according to claim 7 of the present invention is a computer-readable recording medium for causing a computer to function as a character recognition pattern dictionary creating device that creates a pattern dictionary for character recognition. An input unit for inputting an image, a feature amount extracting unit for extracting a feature amount from a character image input by the input unit by a discriminant analysis method or a principal component analysis method, and adjusting using character appearance frequency information, A feature amount selection feature vector creation unit that creates a feature amount selection feature vector for reducing the number of dimensions of the feature amount extracted by the feature amount extraction unit; and a feature amount created by the feature amount selection feature vector creation unit. A feature amount selecting unit for reducing the number of dimensions of the feature amount extracted by the feature amount extracting unit using the selected feature vector, and a feature amount selected by the feature amount selecting unit. The program for functioning as a character recognition pattern dictionary creating apparatus and a dictionary creation unit performing dictionary creation was recorded.

【０００６】[0006]

【発明の実施の形態】以下、図面を用いて、本発明の構
成および動作原理について説明する。図１は、本発明の
実施の形態の一例を示す文字認識パターン辞書作成装置
を構成する機能ブロック図である。文字認識パターン辞
書作成１００のは、入力部１１０、特徴量抽出部１２
０、特徴量選択特徴ベクトル作成部１３０、特徴量選択
部１４０、辞書作成部１５０、認識辞書２００等の機能
ブロックで構成されている。入力部１１０は、原稿用紙
等の画像をスキャナーによって読み取る。この画像の読
み取りは、スキャナーだけではなく、既に読み込まれて
いる画像をファイル等から読み込んでもよい。読み込ま
れた文字画像は、一文字ずつ特徴量抽出部１２０へ送ら
れる。特徴量抽出部１２０は、入力部１１０から送られ
た1文字分の文字画像に対して、認識に用いられる特徴
量である特徴ベクトル（ｎ次ベクトル）を抽出する。こ
の特徴ベクトルは、文献１（特開平１−２５０１８４号
公報）の方向コードヒストグラム等を用いて作成され
る。特徴量選択特徴ベクトル作成部１３０は、この特徴
量の特徴ベクトルを参照して、適切な特徴ベクトル作成
方法を用いて、特徴量を選択するための特徴量選択特徴
ベクトルを作成する。特徴量選択部１４０は、特徴量抽
出部１２０で抽出された特徴量を特徴量選択特徴ベクト
ル作成部１３０で作成された特徴選択特徴ベクトルによ
って特徴量次元削減を行う。辞書作成部１５０は、次元
削減された特徴量から認識辞書２００を作成する。認識
辞書２００は、登録されている文字毎に、文字コード、
標準パターン特徴ベクトル（ｎ次元）、特徴量選択特徴
ベクトル（ｎ×ｋ行列、ｋ＜ｎ）等の情報を保持してい
る。このｋは、辞書サイズを小さくするために、ｎより
小さな数値が実験によって選定される。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The configuration and operation principle of the present invention will be described below with reference to the drawings. FIG. 1 is a functional block diagram of a character recognition pattern dictionary creation device showing an example of an embodiment of the present invention. The input unit 110, the feature amount extraction unit 12
0, a feature amount selection feature vector creation unit 130, a feature amount selection unit 140, a dictionary creation unit 150, and a recognition dictionary 200. The input unit 110 reads an image on a document sheet or the like with a scanner. This image reading may be performed not only by a scanner but also by reading an already read image from a file or the like. The read character image is sent to the feature amount extraction unit 120 character by character. The feature amount extraction unit 120 extracts a feature vector (n-order vector), which is a feature amount used for recognition, from a character image of one character sent from the input unit 110. This feature vector is created using a directional code histogram or the like in Document 1 (Japanese Patent Laid-Open No. 1-250184). The feature amount selection feature vector creation unit 130 creates a feature amount selection feature vector for selecting a feature amount by using an appropriate feature vector creation method with reference to the feature vector of the feature amount. The feature amount selection unit 140 reduces the feature amount extracted by the feature amount extraction unit 120 using the feature selection feature vector created by the feature amount selection feature vector creation unit 130. The dictionary creation unit 150 creates the recognition dictionary 200 from the feature amount whose dimensions have been reduced. The recognition dictionary 200 stores a character code for each registered character.
Information such as a standard pattern feature vector (n-dimensional) and a feature amount selection feature vector (n × k matrix, k <n) is held. For k, a numerical value smaller than n is selected by experiment in order to reduce the dictionary size.

【０００７】図２のフローチャートによって、本発明の
実施の形態における処理の流れを説明する。原稿用紙等
に描かれた学習用パターンの画像をスキャナーによって
読み取る。または、既に読み込まれている画像をファイ
ル等から読み込む（ステップＳ１００）。この読み込ま
れた画像をディスプレイへ表示し、その中からマウスの
ようなポインティングデバイスから文字部分を選択す
る。文字部分は、領域識別の技術を使って判別されても
よい。この選択された文字部分の画像に対して大きさの
正規化や雑音除去を行う。この入力処理された文字画像
の特徴量として方向コードヒストグラムを抽出する（文
献１参照）（ステップＳ１１０）。ここで、抽出された
特徴量をn次元の特徴量Ｘで表す。抽出された特徴量の
性質を判断して、適切な特徴量選択の手法を適用するこ
とにより特徴量選択特徴ベクトルを作成する（ステップ
Ｓ１２０）（詳細は後述）。ここで、作成された特徴量
選択特徴ベクトルをｎ×ｋ（ｋ＜ｎ）の行列Ａとす
る。作成された特徴量選択特徴ベクトルＡを用いて、抽
出された特徴量Ｘの次元削減をおこなう（ステップＳ１
３０）。これは特徴量選択によって次元を削減された特
徴量をＹ（ｋ次元）とすると、Ｙ＝Ａ×Ｘで求めること
ができる。特徴量の次元を削減した特徴量Ｙを認識辞書
２００の該当文字における特徴量として登録する（ステ
ップＳ１４０）。The flow of processing in the embodiment of the present invention will be described with reference to the flowchart of FIG. The image of the learning pattern drawn on the manuscript paper or the like is read by a scanner. Alternatively, the already read image is read from a file or the like (step S100). The read image is displayed on a display, and a character portion is selected from the read image from a pointing device such as a mouse. The character portion may be determined using an area identification technique. Normalization of the size and noise removal are performed on the image of the selected character portion. A direction code histogram is extracted as a feature amount of the input character image (see Reference 1) (step S110). Here, the extracted feature amount is represented by an n-dimensional feature amount X. The characteristics of the extracted features are determined, and a feature selection feature vector is created by applying an appropriate feature selection method (step S120) (details will be described later). Here, the created feature amount selection feature vector is assumed to be a matrix A of n × k (k <n). The dimension of the extracted feature X is reduced using the created feature selection feature vector A (step S1).
30). This can be obtained by Y = A × X, where Y (k dimension) is a feature amount whose dimension is reduced by the feature amount selection. The feature amount Y in which the dimension of the feature amount is reduced is registered as the feature amount of the corresponding character in the recognition dictionary 200 (step S140).

【０００８】＜特徴量選択特徴ベクトル作成部１３０の
実施の形態＞図３及び図４は、特徴量選択特徴ベクトル
作成部１３０の実施の形態の例を説明するための図であ
る。図３を参照すると、２次元の特徴量空間に学習パタ
ーン数が同じ２つのカテゴリ“を”とカテゴリ“苧”
は、特徴量空間では少し交わっている。ここで、Ｏ１と
Ｏ２は、カテゴリ“を”とカテゴリ“苧”の平均特徴量
とする。今、学習パターンにおける総パターン数をＮ、
カテゴリの学習パターン数をＮｌとすると、事前確率は
Ｐ（ｗｌ）＝Ｎｌ／Ｎから推定される。この式から学習
パターン数が同じであれば、事前確率も同じであると考
えられる。文献２（若林、鶴岡、木村、三宅：「手書き
数字認識における特徴選択に関する考察」、電子情報通
信学会論文誌D-II Vol.J78-D-II No.11 pp.1627-1638、
1995年11月）の正準判別分析法を使って特徴量選択特徴
ベクトルを計算すると、図３のＺ１を得る。これら２つ
のカテゴリ“を”とカテゴリ“苧”の中心Ｏ１とＯ２か
らＺ１へ投影し、これらの２直線の中間直線Ｇ１を求め
る。この直線Ｇ１のＯ１側の点はカテゴリ“を”に属す
ると判断し、直線Ｇ１のＯ２側の点はカテゴリ“苧”に
属すると判断する。<Embodiment of Feature Amount Selection Feature Vector Creation Unit 130> FIGS. 3 and 4 are diagrams for explaining an example of an embodiment of the feature amount selection feature vector creation unit 130. FIG. Referring to FIG. 3, two categories “wo” and “ramie” having the same number of learning patterns in a two-dimensional feature space.
Slightly intersect in the feature space. Here, O1 and O2 are the average feature amounts of the category “を” and the category “ramie”. Now, let the total number of patterns in the learning pattern be N,
Assuming that the number of learning patterns of a category is Nl, the prior probability is estimated from P (wl) = Nl / N. From this equation, if the number of learning patterns is the same, the prior probabilities are considered to be the same. Reference 2 (Wakabayashi, Tsuruoka, Kimura, Miyake: "Feature Selection in Handwritten Digit Recognition", IEICE Transactions D-II Vol.J78-D-II No.11 pp.1627-1638,
When the feature quantity selection feature vector is calculated using the canonical discriminant analysis method (November 1995), Z1 in FIG. 3 is obtained. These two categories "" and the center O1 and O2 of the category "ram" are projected onto Z1 to obtain an intermediate straight line G1 of these two straight lines. The point on the O1 side of the straight line G1 is determined to belong to the category "", and the point on the O2 side of the straight line G1 is determined to belong to the category "ramie".

【０００９】図３に示したように、直線Ｇ１は２つカテ
ゴリの真中を通り、２つのカテゴリそれぞれの誤認文字
数は同じである。そこで、カテゴリ“を”の学習パター
ンの出現頻度を大きくし、カテゴリ“苧”の学習パター
ン出現頻度を従来のままとするように、文字の出現頻度
に重みを付加して実現する。すなわち、カテゴリ“を”
の事前確率を大きくし、“苧”の事前確率はそのままと
する。このように設定した後、正準判別分析法（文献２
参照）を使って特徴量選択特徴ベクトルを計算すると、
図４の特徴量選択特徴ベクトルＺ２を得る。正準判別分
析法の性質によって、事前確率の大きいカテゴリの収束
方向へ特徴量選択特徴ベクトルは傾く。カテゴリ“を”
とカテゴリ“苧”の中心Ｏ１と中心Ｏ２から特徴量選択
特徴ベクトルＺ２へ投影し、これらの２直線の中間直線
Ｇ２を求める。直線Ｇ２のＯ１側の点はカテゴリ“を”
に属すると判断し、直線Ｇ２のＯ２側の点はカテゴリ”
苧”に属すると判断する。上述のように学習データの追
加などの作業をぜず、特徴量選択特徴ベクトルを変更す
ることによって、カテゴリ“を”の誤認文字は減少し、
カテゴリ“を”の認識が有利の方向へ特徴ベクトルの変
更ができる。なお、一般的な原稿では、カテゴリ“苧”
よりカテゴリ“を”の出現頻度の方が非常に大きく、こ
の２つの文字を同じ事前確率で特徴量選択特徴ベクトル
を計算すると、認識精度向上には不利となる。また、実
原稿の出現頻度通りに学習データを収集することも極め
て困難なことである。したがって、本発明のようにある
カテゴリに属する文字の出現頻度を変更することによっ
て、全体的な誤認文字を減少させることができる。As shown in FIG. 3, a straight line G1 passes through the middle of the two categories, and the number of misidentified characters in each of the two categories is the same. Therefore, the appearance frequency of the learning pattern of the category “を” is increased, and the appearance frequency of the learning pattern of the category “ramie” is realized by adding a weight to the appearance frequency of the character so as to keep the conventional frequency. That is, the category ""
The prior probability of “ramie” is kept large. After such setting, the canonical discriminant analysis method (Reference 2)
) To calculate the feature selection feature vector,
The feature selection feature vector Z2 of FIG. 4 is obtained. Due to the nature of the canonical discriminant analysis method, the feature amount selection feature vector is inclined toward the convergence direction of a category having a large prior probability. Category ""
And the center O1 and the center O2 of the category “ramie” are projected onto the feature amount selection feature vector Z2, and an intermediate straight line G2 between these two straight lines is obtained. The point on the O1 side of the straight line G2 is the category ""
And the point on the O2 side of the straight line G2 is the category "
It is determined that the character belongs to “ramie”. By changing the feature amount selection feature vector without performing work such as adding learning data as described above, the number of misidentified characters in the category “
The feature vector can be changed in a direction where recognition of the category "" is advantageous. In general manuscripts, the category "ramie"
The frequency of occurrence of the category “を” is much higher, and calculating a feature selection feature vector for these two characters with the same prior probability is disadvantageous for improving recognition accuracy. It is also extremely difficult to collect learning data according to the appearance frequency of actual manuscripts. Therefore, by changing the appearance frequency of characters belonging to a certain category as in the present invention, it is possible to reduce the total number of misidentified characters.

【００１０】＜特徴量選択特徴ベクトル作成部１３０の
他の実施の形態＞図５及び図６は、特徴量選択特徴ベク
トル作成部１３０の他の実施の形態の例を説明するため
の図である。図５を参照すると、２次元の特徴量空間の
２つのカテゴリ“ピ”とカテゴリ“ゼ”に対して、２つ
のカテゴリの文字形状は類似部分があるので、特徴量空
間で交わっている。また、各カテゴリにはフォント等の
異なる学習データも含んでいるので、特徴量空間では長
く伸びた形状をしている。このような特徴量に対して、
正準判別分析法（文献２参照）を使って、特徴量選択特
徴ベクトルを計算すると、図５のＺ１を得る。図５の直
線Ｇ１（この直線は線形判別分析等で求めることが出来
る。）の左上はカテゴリ“ピ”の領域、図５の直線Ｇ１
の右下はカテゴリ“ゼ”の領域となるが、これら２つの
カテゴリ“ピ”とカテゴリ“ゼ”とは、かなり交差して
いるため、ぞれぞれ誤認文字数が多数存在することにな
る。そこで、正準判別分析法の適用により特徴量特徴選
択を行なう前に、クラスタリングを行なう。図６はクラ
スタリング結果である。ここで、点線はクラスタリング
する前の各カテゴリを示し、実線の円はクラスタリング
結果である。これを参照すると、カテゴリ“ピ”及びカ
テゴリ“ゼ”は、それぞれ２つのクラスタに分割され、
各クラスタは交っていない。このクラスタリングを行っ
た後、正準判別分析法（文献２参照）を用いて特徴量選
択特徴ベクトルを作成すると、Ｚ２を得る。直線Ｇ２、
Ｇ３、Ｇ４によって、各クラスタはそれぞれ完全に分離
され、誤認文字はなくなる。<Another embodiment of feature amount selection feature vector creation unit 130> FIGS. 5 and 6 are diagrams for explaining another example of the feature amount selection feature vector creation unit 130. . Referring to FIG. 5, the two categories “pi” and “ze” in the two-dimensional feature amount space have similar portions in the character shapes of the two categories, and therefore intersect in the feature amount space. Further, since each category also includes different learning data such as fonts, the feature space has a long and elongated shape. For such features,
When the feature amount selection feature vector is calculated using the canonical discriminant analysis method (see Document 2), Z1 in FIG. 5 is obtained. The upper left of the straight line G1 in FIG. 5 (this straight line can be obtained by linear discriminant analysis or the like) is an area of the category "pi", and the straight line G1 in FIG.
Is in the area of the category "Z", but since these two categories "P" and "Z" intersect considerably, each of them has a large number of erroneous characters. Therefore, before performing feature amount feature selection by applying the canonical discriminant analysis method, clustering is performed. FIG. 6 shows the clustering result. Here, the dotted line indicates each category before clustering, and the solid circle indicates the clustering result. Referring to this, category "pi" and category "ze" are each divided into two clusters,
Each cluster does not intersect. After performing this clustering, a feature amount selection feature vector is created using the canonical discriminant analysis method (see Document 2) to obtain Z2. Straight line G2,
Each cluster is completely separated by G3 and G4, and there is no erroneous character.

【００１１】上述したクラスタリングは、次のような手
順で行なう。（手順１）カテゴリ内部の総学習データ数をNとする。
初期状態としては、N個の学習データそれぞれが、１つ
のクラスタを形成しているものとする。したがって、ク
ラスタの中心から最も離れた学習データまでの距離はＤ
ＩＳＴ＝０であり、クラスタ数ＭはＭ＝Ｎである。（手順２）このＭ個のクラスタの中で最も類似度の大き
い（距離の小さい）対を求め、それらを一つのクラスタ
として融合する。これによりクラスタ数Ｍは、Ｍ−1と
なる。また、融合したクラスタの中心から最も離れた学
習データまでの距離をＤＩＳＴとする。この後、クラス
タ数Ｍが１以上で且つ、距離ＤＩＳＴが所定の閾値ＣＯ
ＮＳＴであれば、次の手順3へ進み、そうでなければ手
順4へ進む。ここで閾値ＣＯＮＳＴは、学習データの性
質や特徴量の計算方法（例えば、カテゴリー内分散やカ
テゴリー間分散、分散比等を考慮して）によって経験的
に決めるものである。（手順３）新しく作られたクラスタと他のクラスタの類
似度（あるいは距離）を計算し、手順２へ戻る。（手順4）クラスタリング情報としてクラスタ内の特徴
量（データ）数、クラスタ内の特徴量をファイル等へ出
力し、クラスタリングを終了する。The above-described clustering is performed in the following procedure. (Procedure 1) Let N be the total number of learning data within a category.
In the initial state, it is assumed that each of the N pieces of learning data forms one cluster. Therefore, the distance from the center of the cluster to the farthest learning data is D
IST = 0, and the number M of clusters is M = N. (Procedure 2) A pair having the highest similarity (smallest distance) among the M clusters is obtained, and the pairs are fused as one cluster. As a result, the number M of clusters becomes M−1. The distance from the center of the merged cluster to the farthest learning data is defined as DIST. Thereafter, the number of clusters M is 1 or more, and the distance DIST is equal to a predetermined threshold CO.
If it is NST, proceed to the next step 3, otherwise proceed to step 4. Here, the threshold value CONST is empirically determined by the method of calculating the properties of the learning data and the feature amount (for example, considering the variance within a category, the variance between categories, the variance ratio, and the like). (Step 3) The similarity (or distance) between the newly created cluster and another cluster is calculated, and the procedure returns to step 2. (Procedure 4) The number of features (data) in the cluster and the features in the cluster are output to a file or the like as clustering information, and the clustering ends.

【００１２】＜本発明のプログラムとしての実施の形態
＞更に、本発明は上記の実施の形態のみに限定されたも
のではない。たとえば、図７に示したハードウェア構成
を持つコンピュータ装置によっても実現が可能である。
すなわち、入力装置１はキーボード、マウス、タッチパ
ネル等により構成され、情報の入力に使用される。ま
た、文字画像を入力するためにスキャナーが接続されて
も良い。表示装置２は、種々の出力情報や入力装置１か
らの入力された情報などを表示出力させる。ＣＰＵ（Ce
ntral Processing Unit；中央処理ユニット）３は、種
々のプログラムを動作させる。メモリ４は、プログラム
自身を保持し、又そのプログラムがＣＰＵ３によって実
行されるときに一時的に作成される情報等を保持する。
記憶装置５は、本装置で扱う認識辞書２００及びプログ
ラムやプログラム実行時の一時的な情報等を保持する。
媒体駆動装置６は、プログラムやデータ等を記憶した記
録媒体を装着してそれらを読み込み、メモリ４または記
憶装置５へ格納するのに用いられる。又、直接データ
（例えば、文字の学習データ等）の入出力やプログラム
実行するのに使ってもよい。このようなコンピュータ装
置において、図１に示した文字認識パターン辞書作成装
置を構成する各機能をプログラム化し、予めＣＤ−ＲＯ
Ｍ等の記録媒体に書き込んでおき、このＣＤ−ＲＯＭを
各サイトのＣＤ−ＲＯＭドライブのような媒体駆動装置
６を搭載したコンピュータに装着して、プログラムをメ
モリ４あるいは記憶装置５に格納し、それを実行するこ
とによって、上記の実施例の実施形態と同様な機能を実
現することができる。なお、記録媒体としては半導体媒
体（例えば、ＲＯＭ、ＩＣメモリカード等）、光媒体
（たとえば、ＤＶＤ−ＲＯＭ，ＭＯ，ＭＤ，ＣＤ−Ｒ
等）、磁気媒体（例えば、磁気テープ、フレキシブルデ
ィスク等）のいずれであってもよい。また、本発明の機
能を実現するプログラムは、媒体の形で頒布することが
できる。また、本発明の機能を実現するプログラムを磁
気ディスク等の記憶装置に格納しておき、有線又は無線
の通信ネットワークによりダウンロード等の形式で頒布
することも可能である。更に、本発明の機能を実現する
プログラムを放送波によって配布することで提供するよ
うにしてもよい。また、本発明の機能をクライアント／
サーバーシステムに適用するようにしてもよい。すなわ
ち、クライアント側のコンピュータのスキャナーから画
像を取り込み、取り込んだ画像をサーバー側の入力部１
１０へ送り、認識辞書２００を作成させてもよい。ま
た、クライアント側で取り込んだ画像の初期処理を行っ
てから、サーバー側の入力部１１０へその画像を送るよ
うにしてもよい。<Embodiments as Program of the Present Invention> Further, the present invention is not limited to only the above embodiments. For example, the present invention can be realized by a computer having the hardware configuration shown in FIG.
That is, the input device 1 includes a keyboard, a mouse, a touch panel, and the like, and is used for inputting information. Further, a scanner may be connected to input a character image. The display device 2 displays and outputs various output information, information input from the input device 1, and the like. CPU (Ce
An ntral Processing Unit (Central Processing Unit) 3 operates various programs. The memory 4 holds the program itself, and also holds information temporarily created when the program is executed by the CPU 3.
The storage device 5 stores the recognition dictionary 200 and the program, temporary information at the time of program execution, and the like handled by the present device.
The medium drive device 6 is used to mount a recording medium storing programs, data, and the like, read them, and store them in the memory 4 or the storage device 5. Further, it may be used for directly inputting / outputting data (for example, character learning data) or executing a program. In such a computer device, each function constituting the character recognition pattern dictionary creating device shown in FIG.
M, etc., and the CD-ROM is mounted on a computer equipped with a medium drive device 6 such as a CD-ROM drive at each site, and the program is stored in the memory 4 or the storage device 5, By executing this, it is possible to realize the same function as that of the embodiment of the above-described embodiment. As a recording medium, a semiconductor medium (for example, ROM, IC memory card, etc.), an optical medium (for example, DVD-ROM, MO, MD, CD-R)
Etc.) and magnetic media (for example, magnetic tape, flexible disk, etc.). Further, the program for realizing the functions of the present invention can be distributed in the form of a medium. It is also possible to store a program for realizing the functions of the present invention in a storage device such as a magnetic disk, and distribute the program in a form such as download by a wired or wireless communication network. Further, a program for realizing the functions of the present invention may be provided by distributing it by broadcast waves. In addition, the functions of the present invention are implemented by a client /
You may make it apply to a server system. That is, an image is fetched from the scanner of the computer on the client side, and the fetched image is input to the input unit 1 on the server side.
10 to generate the recognition dictionary 200. Alternatively, the image may be sent to the input unit 110 on the server side after the client performs the initial processing of the captured image.

【００１３】[0013]

【発明の効果】以上説明したように、本発明によれば、
認識対象の文字の出現頻度の導入し、更にクラスタリン
グを行なってから特徴量選択特徴ベクトルを適切に設定
するようにしたので、特徴量選択を行なう際に認識精度
を低下することなく認識することができる認識辞書を作
成することができた。As described above, according to the present invention,
Since the appearance frequency of the character to be recognized is introduced, and the clustering is further performed, the feature amount selection feature vector is appropriately set, so that when performing feature amount selection, recognition can be performed without lowering the recognition accuracy. I was able to create a recognition dictionary that can do it.

[Brief description of the drawings]

【図１】本発明の機能構成をあらわすブロック図であ
る。FIG. 1 is a block diagram showing a functional configuration of the present invention.

【図２】本発明の機能処理をあらわすフローチャートで
ある。FIG. 2 is a flowchart showing a functional process of the present invention.

【図３】文字出現頻度を導入前の特徴量選択特徴ベクト
ルを説明する図である。FIG. 3 is a diagram illustrating a feature amount selection feature vector before introducing a character appearance frequency.

【図４】文字出現頻度を導入後の特徴量選択特徴ベクト
ルを説明する図である。FIG. 4 is a diagram illustrating a feature amount selection feature vector after introducing a character appearance frequency.

【図５】クラスタリング前の特徴量選択特徴ベクトルを
説明する図である。FIG. 5 is a diagram illustrating a feature amount selection feature vector before clustering.

【図６】クラスタリング後の特徴量選択特徴ベクトルを
説明する図である。FIG. 6 is a diagram illustrating a feature amount selection feature vector after clustering.

【図７】本発明が稼動するコンピュータのハードウェア
構成を示す図である。FIG. 7 is a diagram illustrating a hardware configuration of a computer on which the present invention operates.

[Explanation of symbols]

１：入力装置２：表示装置３：ＣＰＵ４：メモリ５：記憶装置６：媒体駆動装置１００：文字認識パターン辞書作成装置１１０：入力部１２０：特徴量抽出部１３０：特徴量選択特徴ベクトル作成部１４０：特徴量選択部１５０：辞書作成部２００：認識辞書 1: Input device 2: Display device 3: CPU 4: Memory 5: Storage device 6: Medium drive device 100: Character recognition pattern dictionary creation device 110: Input unit 120: Feature extraction unit 130: Feature selection feature vector creation unit 140: feature quantity selection unit 150: dictionary creation unit 200: recognition dictionary

Claims

[Claims]

An input unit for inputting a character image, a characteristic amount extracting unit for extracting a characteristic amount from the character image input by the input unit by a discriminant analysis method or a principal component analysis method, and character appearance frequency information A feature amount selection feature vector creation unit that creates a feature amount selection feature vector for reducing the number of dimensions of the feature amount extracted by the feature amount extraction unit,
A feature amount selection unit that reduces the number of dimensions of the feature amount extracted by the feature amount extraction unit using the feature amount selection feature vector created by the feature amount selection feature vector creation unit; And a dictionary creating unit for creating a dictionary based on the obtained feature amount.

2. The character recognition pattern dictionary creation device according to claim 1, wherein the feature amount selection feature vector creation unit has a high prior probability for a character having a high appearance frequency to be recognized and a low appearance frequency for a character to be recognized. Character recognition pattern dictionary creation characterized by creating a feature selection feature vector that reduces the dimensionality of the feature extracted by the feature extraction unit by adjusting using a low prior probability for the character. apparatus.

3. The character-recognition pattern dictionary creating apparatus according to claim 1, wherein the feature-amount selecting feature vector creating unit performs clustering of learning data for a character to be recognized, and then performs a feature-amount selecting feature vector. A character recognition pattern dictionary creating apparatus characterized by creating a character recognition pattern dictionary.

4. The character recognition pattern dictionary creating apparatus according to claim 3, wherein the learning data is clustered on condition that a distance of a cluster member farthest from a center of a generated cluster does not exceed a predetermined numerical value. A character recognition pattern dictionary creation apparatus characterized in that a feature amount selection feature vector is calculated by performing the combination clustering method described above.

5. A feature for inputting a character image, extracting a feature amount from the input character image, adjusting the feature amount using character appearance frequency information, and reducing the number of dimensions of the extracted feature amount. A feature is that a quantity selection feature vector is created, a dimension is reduced from the feature quantity extracted using the created feature quantity selection feature vector, and a dictionary is created from the selected feature quantity. How to create a character recognition pattern dictionary.

6. A character image is input, a characteristic amount is extracted from the input character image by a discriminant analysis method or a principal component analysis method, clustering of learning data for a character to be recognized is performed, and then a character recognition target is extracted. A feature amount selection feature vector for reducing the number of dimensions of the extracted feature amount by adjusting using a high prior probability for characters with a high appearance frequency and a low prior probability for characters with a low appearance frequency to be recognized A character recognition pattern characterized in that the number of dimensions of the extracted feature amount is reduced using the created feature amount selection feature vector, and a dictionary is created from the selected feature amount. Dictionary creation method.

7. A computer-readable recording medium in which a program for causing a computer to function as a character recognition pattern dictionary creating device for creating a character recognition pattern dictionary is stored. An input unit, a feature amount extracting unit that extracts a feature amount from a character image input by the input unit by a discriminant analysis method or a principal component analysis method, and adjusting the feature amount by using character appearance frequency information; A feature amount selection feature vector creation unit that creates a feature amount selection feature vector for reducing the number of dimensions of the feature amount extracted by the unit; and a feature amount selection feature vector created by the feature amount selection feature vector creation unit. A feature amount selecting unit that reduces the number of dimensions of the feature amount extracted by the feature amount extracting unit using the feature amount extracting unit; A computer-readable recording medium having recorded thereon a program for functioning as a character recognition pattern dictionary creation device having a dictionary creation unit for creating a letter.